Use elephant-bird with hive to read protobuf data - hadoop

I have a similar problem like this one
The followning are what I used:
CDH4.4 (hive 0.10)
protobuf-java-.2.4.1.jar
elephant-bird-hive-4.6-SNAPSHOT.jar
elephant-bird-core-4.6-SNAPSHOT.jar
elephant-bird-hadoop-compat-4.6-SNAPSHOT.jar
The jar file which include the protoc compiled .class file.
And I flow Protocol Buffer java tutorial create my data "testbook".
And I
use hdfs dfs -mkdir /protobuf_data to create HDFS folder.
Use hdfs dfs -put testbook /protobuf_data to put "testbook" to HDFS.
Then I follow elephant-bird web page to create table, syntax is like this:
create table addressbook
row format serde "com.twitter.elephantbird.hive.serde.ProtobufDeserializer"
with serdeproperties (
"serialization.class"="com.example.tutorial.AddressBookProtos$AddressBook")
stored as
inputformat "com.twitter.elephantbird.mapred.input.DeprecatedRawMultiInputFormat"
OUTPUTFORMAT "org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat"
LOCATION '/protobuf_data/';
All worked.
But when I submit the query select * from addressbook; no result came out.
And I couldn't find any logs with errors to debug.
Could someone help me ?
Many thanks

The problem had been solved.
First I put protobuf binary data directly into HDFS, no result showed.
Because it doesn't work that way.
After asking some senior colleagues, they said protobuf binary data should be written into some kind of container, some file format, like hadoop SequenceFile etc.
The elephant-bird page had written the information too, but first I couldn't understand it completely.
After writing protobuf binary data into sequenceFile, I can read the protobuf data with hive.
And because I use sequenceFile format, so I use the create table syntax:
inputformat 'org.apache.hadoop.mapred.SequenceFileInputFormat'
outputformat 'org.apache.hadoop.mapred.SequenceFileOutputFormat'
Hope it can help others who are new to hadoop, hive, elephant too.

Related

How to rename file within a partition in HIVE

I have a date partitioned data in hive. However the file within a certain partition has a name like 000112_0. Is there a way to rename this file
There is no configuration property to do this but you can write your custom reducer class like OutputFormat to achieve this.
OutputFormat describes the output-specification for a Map-Reduce job.
The Map-Reduce framework relies on the OutputFormat of the job to:
Validate the output-specification of the job. For e.g. check that the output directory doesn't already exist.
Provide the RecordWriter implementation to be used to write out the output files of the job. Output files are stored in a FileSystem.
You can. Run:
hadoop fs -mv /path_to_file/old_name /path_to_file/new_name

Bigdata Live data streaming using flume

I am trying to analyze twitter data using flume
i got the files from twitter using flume in BigInsights
but the data I received is of compressed Avro schema which is not readable
can anyone tell me a way so that can convert that file to JSON (Readable)
in order to do some analysis on it.
Or is there any way so that the data I receive is already in JSON (Readable) format.
Thanks In Advance.
This is the data i received
Avro format is not designed to be human readable and it's desinged to be consumed by programs. But you have a few options to view this data or even better analyze the data.
Create Hive Table: This option will allow you to analyze data using SQL queries, Spark SQL, Spark notebooks, visualization tools like Tableau and Excel too.
Your table creation script will look like this:
CREATE TABLE twitter_data
ROW FORMAT
SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
STORED AS
INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
TBLPROPERTIES ('avro.schema.literal'='{...
In schema literal, you can define your own schema too.
Write Program: If you are developer and want to/like to wrangle data using programming, you have many languages to choose from to read, parse, convert and write from Avro file to JSON.

How to use DistCp to directly convert data into tables in Hive?

I am using DistCp to copy the data from cluster 1 to cluster 2. I was successfully able to copy the table data from cluster 1 into cluster 2. However, using the hdfs, the data has been sent to file browser.
Is there any direct way to convert this hdfs data into a Hive table (including data type, delimeters ...etc) by use of DistCp command(s)? I can certainly query it to gather the data from hdfs, however I'll have to convert them one-by-one. Trying to look for efficient way to this. Thanks!
Example:
hadoop distcp hdfs://nn1:8020/source/a hdfs://nn1:8020/source/b hdfs://nn2:8020/destination
Haven't found a documentation where you can directly use DistCp to copy tables. However, if any one is looking for similar situation, they can use. Worked for me.
--hive
export table <<<table_name>>> to '<<<hdfs path>>>';
#bash/shell
hadoop distcp source desitination
--hive
import table <<<table_name>> from '<<<hdfs>>>';

How to read customised hdfs file with hive

I have my own file format in HDFS, like below
<bytes_for_size_of_header><header_as_protobuf_bytes><bytes_for_size_of_a_record><record_as_protobuf_bytes>...
As we can see, each record inside the file is encoded with protocol buffer
I've been trying reading these files with hive, and I supposed that I should create an inputformat, a record reader from the older version of mapreduce API, and also a serde to decode the protobuf record.
Does anyone ever done this before, am I going in the right direction? Any help would be appreciate.
Yes, you are going in the right direction. This is exactly what the InputFormat, RecordReader, and SerDe abstracts are for. You should be able to find plenty of examples.

HDFS file compression internally

I am looking for a default compression in HDFS. I saw this but I don' t want my files to have gzip like extensions(in fact, they should be accesible as if they didn' t compressed) Actually, what I am looking for is exactly like the option "Compress contents to save disk space" on Windows. This option compresses the files internally, but they can be accessed just like usual files. Any ideas will be helpful.
Thanks
This doesn't exist in standard HDFS implementations and you have to manage it yourself. You have to manage your own compression. However, a proprietary implementation of Hadoop, MapR, does this, if solving this problem is important enough for you.
After using hadoop for a little while this doesn't really bother me anymore. Pig and MapReduce and such handle the compression automatically enough for me. I know that's not a real answer, but I couldn't tell in your question if you are simply annoyed or you have a real problem this is causing. Getting use to adding | gunzip to everything didn't take long. I For example:
hadoop fs -cat /my/file.gz | gunzip
cat file.txt | gzip | hadoop fs -put - /my/file.txt.gz
When you're using compressed files you need to think about having them splittable - i.e. can Hadoop split this file when running a map reduce (if the file is not splittable it will only be read by a single map)
The usual way around this is to use a container format e.g. sequence file, orc file etc. where you can enable compression. If you are using simple text files (csv etc) - there's an lzo project by twitter but I didn't use it personally
The standard way to store files with compression in HDFS is through default compression argument while writing any file into HDFS. This is available in mapper libraries, sqoop, flume , hive , hbase catalog and so on. I am quoting some examples here from Hadoop. Here you dont need to worry about compressing the file locally for efficiency in hadoop. Its best to default hdfs file format option to perform this work. This type of compression will smoothly integrate with the hadoop mapper processing.
Job written through Mapper Library
While creating the writer in your mapper program. Here is the definition. You will write your own mapper and reducer to write the file into HDFS with your codec defined as a argument to the Writer method.
createWriter(Configuration conf, FSDataOutputStream out, Class keyClass, Class valClass, org.apache.hadoop.io.SequenceFile.CompressionType **compressionType**, CompressionCodec codec)
Sqoop Import
Below option send default compression argument for the file import into HDFS
sqoop import --connect jdbc://mysql://yourconnection/rawdata --table loglines --target-dir /tmp/data/logs/ --compress
with sqoop you can also specify specific codec as well with option
sqoop --connect jdbc://mysql://yourconnection/rawdata --table loglines --target-dir /tmp/data/logs compression-codec org.apache.hadoop.io.compress.SnappyCodec
Hive import
Below example, you can use your desired option to read the file into hive. This again is property you can set while reading from your local file.
SET hive.exec.compress.output=true;
SET parquet.compression=**SNAPPY**; --this is the default actually
CREATE TABLE raw (line STRING) STORED AS PARQUET ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t';
LOAD DATA LOCAL INPATH '/tmp/weblogs/20090603-access.log' INTO TABLE raw;
I have not mentioned all the example of methods of data compression while you import into HDFS.
HDFS CLI doesn't (for e.g hdfs dfs -copyFromLocal) provide any direct way to compress. This is my understanding of working with hadoop CLI.

Resources