How to read customised hdfs file with hive - hadoop

I have my own file format in HDFS, like below
<bytes_for_size_of_header><header_as_protobuf_bytes><bytes_for_size_of_a_record><record_as_protobuf_bytes>...
As we can see, each record inside the file is encoded with protocol buffer
I've been trying reading these files with hive, and I supposed that I should create an inputformat, a record reader from the older version of mapreduce API, and also a serde to decode the protobuf record.
Does anyone ever done this before, am I going in the right direction? Any help would be appreciate.

Yes, you are going in the right direction. This is exactly what the InputFormat, RecordReader, and SerDe abstracts are for. You should be able to find plenty of examples.

Related

Does all of three: Presto, hive and impala support Avro data format?

I am clear about the Serde available in Hive to support Avro schema for data formats. Comfortable in using avro with hive.
AvroSerDe
for say, I have found this issue against presto.
https://github.com/prestodb/presto/issues/5009
I need to choose components for fast execution cycle. Presto and impala provide much smaller execution cycle.
So, Anyone please let me clarify that which would be better in different data formats.
Primarily, I am looking for avro support with Presto now.
However, lets consider following data formats stored on HDFS:
Avro format
Parquet format
Orc format
Which is the best to use with high performance on different data formats.
?? please suggest.
Impala can read Avro data but can not write it. Please refer to this documentaion page describing the file formats supported by Impala.
Hive supports both reading and writing Avro files.
Presto's Hive Connector supports Avro as well. Thanks to David Phillips for pointing out this documentaion page.
There are different benchmarks on the internet about performance, but I would not like to link to a specific one as results heavily depend on the exact use case benchmarked.

Deserialize protobuf column with Hive

I am really new to Hive, I apologize if there are any misconceptions in my question.
I need to read a hadoop Sequence File into a Hive table, the sequence file is thrift binary data, which could be deserialized using SerDe2 that comes with Hive.
The problem now is: One column in the file is encoded with Google protobuf, so when thrift SerDe processes the sequence file it does not process the protobuf encoded column properly.
I wonder if there's a way in Hive to deal with this kind of protobuf encoded columns that are nested inside a thrift sequence file, so that each column could be parsed properly?
Thank you so much for any possible help!
I believe you should use some other serde to deserialize the proto buff format,
may be you can refer this,
https://github.com/twitter/elephant-bird/wiki/How-to-use-Elephant-Bird-with-Hive

How do i get generated filename when calling the Spark SaveAsTextFile method

I'am new to Spark, Hadoop and all what comes with. My global need is to build a real-time application that get tweets and store them on HDFS in order to build a report based on HBase.
I'd like to get the generated filename when calling saveAsTextFile RRD method in order to import it to Hive.
Feel free to ask for further informations and thanks in advance.
saveAsTextFile will create a directory of sequence files. So if you give it path "hdfs://user/NAME/saveLocation", a folder called saveLocation will be created filled with sequence files. You should be able to load this into HBase simply by passing the directory name to HBase (sequenced files are a standard in Hadoop).
I do recommend you look into saving as a parquet though, they are much more useful than standard text files.
From what I understand, You saved your tweets to hdfs and now want the file names of those saved files. Correct me if I'm wrong
val filenames=sc.textfile("Your hdfs location where you saved your tweets").map(_._1)
This gives you an array of rdd's into filenames onto which you could do your operations. Im a newbie too to hadoop, but anyways...hope that helps

Use elephant-bird with hive to read protobuf data

I have a similar problem like this one
The followning are what I used:
CDH4.4 (hive 0.10)
protobuf-java-.2.4.1.jar
elephant-bird-hive-4.6-SNAPSHOT.jar
elephant-bird-core-4.6-SNAPSHOT.jar
elephant-bird-hadoop-compat-4.6-SNAPSHOT.jar
The jar file which include the protoc compiled .class file.
And I flow Protocol Buffer java tutorial create my data "testbook".
And I
use hdfs dfs -mkdir /protobuf_data to create HDFS folder.
Use hdfs dfs -put testbook /protobuf_data to put "testbook" to HDFS.
Then I follow elephant-bird web page to create table, syntax is like this:
create table addressbook
row format serde "com.twitter.elephantbird.hive.serde.ProtobufDeserializer"
with serdeproperties (
"serialization.class"="com.example.tutorial.AddressBookProtos$AddressBook")
stored as
inputformat "com.twitter.elephantbird.mapred.input.DeprecatedRawMultiInputFormat"
OUTPUTFORMAT "org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat"
LOCATION '/protobuf_data/';
All worked.
But when I submit the query select * from addressbook; no result came out.
And I couldn't find any logs with errors to debug.
Could someone help me ?
Many thanks
The problem had been solved.
First I put protobuf binary data directly into HDFS, no result showed.
Because it doesn't work that way.
After asking some senior colleagues, they said protobuf binary data should be written into some kind of container, some file format, like hadoop SequenceFile etc.
The elephant-bird page had written the information too, but first I couldn't understand it completely.
After writing protobuf binary data into sequenceFile, I can read the protobuf data with hive.
And because I use sequenceFile format, so I use the create table syntax:
inputformat 'org.apache.hadoop.mapred.SequenceFileInputFormat'
outputformat 'org.apache.hadoop.mapred.SequenceFileOutputFormat'
Hope it can help others who are new to hadoop, hive, elephant too.

Sequence File of Objects into Hive

We started with a bunch of data stored in NetCDF files. From there, some Java code was written to create sequence files from the NetCDF files. We don't know much about the original intentions of the code, but we have been able to learn a little bit about the sequence files themselves. Ultimately, we are trying to create tables within Hive using these sequence files, but seem incapable of doing so at the moment.
We know that the keys and values within the sequence files are stored as objects that implements WritableComparable. We are also capable of creating Java code to iterate through all of the data in the sequence files.
So, what would be necessary to actually get Hive to read the data within the objects of these sequence files properly?
Thanks in advanced!
UPDATE: The reason it is so difficult to describe where I am having trouble exactly is because I am not necessarily getting any errors. Hive is simply just reading the sequence files incorrectly. When running the Hadoop -text command on my sequence file I get a list of objects as such:
NetCDFCompositeKey#263c7e3f , NetCDFRecordWritable#4d846db5
The data is within those objects themselves. So, currently from the help of #Tariq I believe what I have to do in order to actually read those objects is to create a custom InputFormat to read the keys and a custom SerDe to serialize and deserialize the objects?
I'm sorry, i'm not able to understand from your question where exactly you are facing the problem. If you wish to use SequenceFiles through Hive you just have to add STORED AS SEQUENCEFILE clause while issuing CREATE TABLE(most probably you already know this, nothing new). When you work on SequenceFiles Hive treats each key/value pair of the SequenceFiles similar to rows in normal files. Important thing here is that keys will be ignored. Apart from that nothing very special.
Having said that, if you wish to read both keys and values, you might have to write a custom InputFormat that can read both keys and values. See this project for example. It allows us to access data stored in a SequenceFile's key.
Also, if your keys and values are custom classes, you will require to write a SerDe as well to serialize and deserialize your data.
HTH
P.S. : I don't know if this is exactly what you were looking for. Do let me know if it is not and add some more detail to your question. I'll try addressing that.

Resources