How can a Microsoft Word binary be stored in Hive? - hadoop

Question from a relative Hadoop/Hive newbie: How can I pass the contents of a Microsoft Word (binary) document as a parameter to a Hive function?
My goal is to be able to provide the full contents of a binary file (a Microsoft Word document in my particular use case) as a binary parameter to a UDTF. My initial approach has been to slurp the file's contents into a staging table and then provide it to the UDTF in a query later on, and this was how I attempted to build that staging table:
create table worddoc(content BINARY);
load data inpath '/path/to/wordfile' into table worddoc;
Unfortunately, there seem to be newlines in the Word document (or something acting enough like newlines) that results in the staging table having many rows instead of a single comprehensive blob, the latter of which is what I was hoping for. Is there some way of ensuring that the ingest doesn't get exploded into multiple rows? I've seen similar questions here on SO regarding other binary data like image files, so that is why I'm guessing it's the newlines that are tripping me up.
Failing all that, is there a way to skip storing the file's contents in an intermediary Hive table and just provide the content directly to the UDTF at invocation time? Nothing obvious jumped out during my search through Hive's built-in functions, but maybe I am missing something.
Version-wise, the environment is Hive 0.13.1 and Hadoop 1.2.1 (although upgrades to both are pending).

This is a hack-y workaround but what I ended up doing is this:
1) base64 encode the binary document and put the encoded file into HDFS
2) In Hive:
CREATE TABLE staging_table (content STRING);
LOAD DATA INPATH '/path/to/base64_encoded_file' INTO TABLE staging_table;
CREATE TABLE target_table (content BINARY);
INSERT INTO target_table SELECT unbase64(content) FROM staging_table;
Theoretically this should work for any arbitrary binary file that you'd want to squish into Hive this way. A gotcha to watch out for is to make sure your base64 encoding implementation produces a single-line file (my OS X base64 utility produces 1-line output, while the base64 utility in a CentOS 6 VM I was using produced hundreds of lines) - if it doesn't, you can manually glue it together before putting it into HDFS.

Related

How does the CONCATENATE in ALTER TABLE command in HIVE works

I am trying to understand how exactly the ALTER TABLE CONCATENATE in HIVE Works.
I saw this link How does Hive 'alter table <table name> concatenate' work? but all I got from this links is that for ORC Files, the merge happens at a stripe level.
I am looking for a detailed explanation of how CONCATENATE works. As an e.g I initially had 500 small ORC Files in the HDFS. I ran the Hive ALTER TABLE CONCATENATE and the files merged to 27 bigger files. Subsequent runs of CONCATENATE reduced the number of files to 16 and finally I ended up in two large files.( used version Hive 0.12 ) So I wanted to understand
How exactly CONCATENATE works? Does it looks at the existing number of files , as well as the size ? How will it determine the no: of output ORC files after concatenation?
Is there any known issues with using the Concatenate ? We are planning to run the concatenate one a day in the maintenance window
Is Using CTAS an alternative to concatenate and which is better? Note that my requirement is to reduce the no of ORC files (ingested through Nifi) without compromising performance of Read
Any help is appreciated and thanks in advance
Concatenated file size can be controlled with following two values:
set mapreduce.input.fileinputformat.split.minsize=268435456;
set hive.exec.orc.default.block.size=268435456;
These values should be set based on your HDFS/MapR-FS block size.
As commented by #leftjoin it is indeed the case that you can get different output files for the same underlying data.
This is discussed more in the linked HCC thread but the key point is:
Concatenation depends on which files are chosen first.
Note that having files of different sizes, should not be a problem in normal situations.
If you want to streamline your process, then depending on how big your data is, you may also want to batch it a bit before writing to HDFS. For instance, by setting the batch size in NiFi.

Apache Solr support for ORC file format

I have a bunch of tables in Hive, stored as ORC. I want to index their data in a SolrCloud collection.
Is there any support for indexing data stored in ORC format in Solr?
I've googled around but nothing came out.
Looks like you want SolR to read data from a specific Hive file format.
You might look at the problem the other way i.e. use Hive to write data to SolR -- and thus let Hive take care of the complexity of the actual input file format (whether ORC, Parquet, AVRO, whatever -- even HBase data files).
In the LucidWorks GitHub repo you will find a project labeled hive-solr. Have a look.
I'll accept Samson's answer.
Anyway, I'm not fully satisfied about this solution. In fact, now I still need to create an external table manually declaring all fields in the original table. In terms of operations, it is not different from creating a new table (stored ad textfile) starting from the original one, indexing the new text files and finally dropping them (of course, this may be a problem for very large tables, which is not my case).
Being ORC a self-describing format, it would be great for Solr to read both field names and data directly from the compressed files.

Write Hive Table using Spark SQL and JDBC

I am new to Hadoop and I am using a single node cluster (for development) to pull some data from a relational database.
Specifically, I am using Spark (version 1.4.1), Java API, to pull data for a query and write to Hive. I have run into various problems (and have read the manuals and tried searching online) but I think I might be misunderstanding some fundamental part of this because I am having problems.
First, I thought I'd be able to read data into Spark, optionally run some Spark methods to manipulate the data and then write it to Hive through a HiveContext object. But, there doesn't seem to be any way to write straight from Spark to Hive. Is that true?
So I need an intermediate step. I have tried a few different methods of storing the data first before writing to Hive and settled on writing an HDFS text file since it seemed to work best for me. However, writing the HDFS file, I get square brackets in the files, like this: [A,B,C]
So, when I load the data into Hive using the "LOAD DATA INPATH..." HiveQL statement, I get the square brackets in the Hive table!!
What am I missing? Or more appropriately, can someone please help me understand the steps I need to do to:
Run a SQL on SQL Server or Oracle DB
Write the data out to a Hive table that can be accessed by a dashboard tool.
My code right now, looks something like this:
DataFrame df= sqlContext.read().format("jdbc").options(getSqlContextOptions(driver, dburl, query)).load(); // This step seem to work fine.
JavaRDD<Row> rdd = df.javaRDD();
rdd.saveAsTextFile(getHdfsUri() + pathToFile); // This works, but writes the rows in square brackets, like: [1, AAA].
hiveContext.sql("CREATE TABLE BLAH (MY_ID INT, MY_DESC STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n' STORED AS TEXTFILE");
hiveContext.sql("LOAD DATA INPATH '" + getHdfsUri() + hdfsFile + "' OVERWRITE INTO TABLE `BLAH`"); // Get's written like:
MY_INT MY_DESC
------ -------
AAA]
The INT column doesn't get written at all because the leading [ makes it no longer a numeric value and the last column shows the "]" at the end of the row in the HDFS file.
Please help me understand why this isn't working or what a better way would be. Thanks!
I am not locked into any specific approach, so all options would be appreciated.
Ok, I figured out what I was doing wrong. I needed to use the write function on the HiveContext and needed to use the com.databricks.spark.csv to write a sequencefile in Hive. This does not require an intermediate step of saving a file in HDFS, which is great, and writes to Hive successfully.
DataFrame df = hiveContext.createDataFrame(rdd, struct);
df.select(cols).write().format("com.databricks.spark.csv").mode(SaveMode.Append).saveAsTable("TABLENAME");
I did need to create a StructType object, though, to pass into the createDataFrame method for the proper mapping of the data types (Something like is shown in the middle of this page: Support for User Defined Types for java in Spark). And the cols variable is an array of Column objects which is really just an array of column names (i.e. something like Column[] cols = {new Column("COL1"), new Column("COL2")};
I think "Insert" is not yet supported.
http://spark.apache.org/docs/latest/sql-programming-guide.html#compatibility-with-apache-hive
To get rid of brackets in text file, you should avoid saveAsTextFile. Instead try writing the contents using HDFS API i.e FSDataInputStream

apache drill memory exception

I am trying to reformat over 600gb of csv files into parquet using apache drill in a single node setup.
I run my sql statement:
CREATE TABLE AS Data_Transform.'/' AS
....
FROM Data_source.'/data_dump/*'
and it is creating parquet files but I get the error:
Query Failed: An Error Occurred
org.apache.drill.common.exceptions.UserRemoteException: RESOURCE ERROR:
One or more nodes ran out of memory while executing the query.
is there a way around this?
Or is there an alternative way to do the conversion?
I don't know if querying all those GB on a local node is feasible. If you've configured the memory per the docs, using a cluster of Drillbits to share the load is the obvious solution, but I guess you already know that.
If you're willing to experiment, and you're converting csv files using a select * to query the csv, rather than selecting individual columns, change the query to something like select columns[0] as user_id, columns1 as user_name. Cast any columns to types like int, float, datetime if possible. This avoids the read overhead storing data in the varchars and prepares data for your future queries that need to be cast for any analysis.
I've also seen the following recommendation from a Drill developer: split files into smaller files manually to overcome the local file system capability limitations. Drill doesn't split files on block splits.

Amazon EMR JSON

I am using Amazon EMR Hadoop Hive for big data processing. Current data in my log files is in CSV format. In order to make the table from log files, I wrote regex expression to parse the data and store into different columns of external table. I know that SerDe can be used to read data in JSON format and this means that each log file line could be as JSON object. Are there any Hadoop performance advantages if my log files are in JSON format comparing CSV format.
If you can process the output of the table (that you created with the regexp) why do another processing? Try to avoid unnecessary stuff.
I think the main issue here is which format is faster to read. I believe CSV will provide better speed over JSON but don't take my word. Hadoop really doesn't care. It's all byte arrays to him, once in memory.

Resources