Loading protobuf format file into pig script using loadfunc pig UDF - hadoop

I have very little knowledge of pig. I have protobuf format data file. I need to load this file into a pig script. I need to write a LoadFunc UDF to load it. say function is Protobufloader().
my PIG script would be
A = LOAD 'abc_protobuf.dat' USING Protobufloader() as (name, phonenumber, email);
All i wish to know is How do i get the file input stream. Once i get hold of file input stream, i can parse the data from protobuf format to PIG tuple format.
PS: thanks in advance

Twitter's open source library elephant bird has many such loaders:
https://github.com/kevinweil/elephant-bird
You can use LzoProtobufB64LinePigLoader and LzoProtobufBlockPigLoader.
https://github.com/kevinweil/elephant-bird/tree/master/src/java/com/twitter/elephantbird/pig/load
To use it, you just need to do:
define ProtoLoader com.twitter.elephantbird.pig.load.LzoProtobufB64LineLoader('your.proto.class.name');
a = load '/your/file' using ProtoLoader;
b = foreach a generate
field1, field2;
After loading, it will be automatically translated to pig tuples with proper schema.
However, they assume you write your data in serialized protobuffer and compressed by lzo.
They have corresponding writers as well, in package com.twitter.elephantbird.pig.store.
If your data format is a bit different, you can adapt their code to your custom loader.

Related

How to query file in hdfs which has xml as one column

Context:
I have data in a table in mysql with xml as one column.
For Ex: Table application has 3 fields.
id(integer) , details(xml) , address(text)
(In real case i have 10-12 fields here).
Now we want to query the whole table with all the fields in mysql table using pig.
Transferred the data from mysql into hdfs using sqoop with
record delimiter '\u0005' and column delimiter as "`" to /x.xml.
Then Load the data from x.xml into the Pig using
app = LOAD '/x.xml' USING PigStorage('\u0005') AS (id:int , details:chararray , address:chararray);
What is the best way to query such data.
Solution that i could currently think about.
Use a custom loader and extend Loadfunc to read the data.
If there is some way to load a particular column using xmlpathloader and rest loading normally. Please suggest if this can be done.
As all the examples i have seen using xpath are using XML loader while loading the file.
For Ex:
A = LOAD 'xmls/hadoop_books.xml' using org.apache.pig.piggybank.storage.XMLLoader('BOOK') as (x:chararray);
Is it good to use pig for querying such kind of data, please suggest if there are any other alternative technologies, that does it effectively.
The size of data present is around 500 GB.
FYI i am new to hadoop ecosytem and i might be missing something trivial.
Load a specific column:
Some other StackOverflow answers suggesting preprocessing the data with awk (generate a new input contains only the xml part.)
A nicer work-a-round to generate the specific data with an extra FOREACH from the xml column, like:
B = FOREACH app GENERATE details;
and store it to be able to load with an XML loader.
Check the StreamingXMLLoader
(You can also check Apache Drill it may support this case out of the box)
Or use UDF for the XML processing and in pig you just hand over the related xml field.

How to store Avro format in HDFS using PIG?

After processing input data, I've a JAVA object. I've created avro schema for storing the object in avro file. I'm stuck at writing the object using schema into HDFS. Can anyone walk me through the process of writing the object using PIG script & corresponding UDF?
I suppose you are using an UDF if you use Java.
So you have just to return the result of your UDF as a pig Tuple.
Then you get a relation with your data ready to store.
Finally You can use the STORE command using AvroStorage.

Using ParquetFileWriter to write data into parquet file?

I am newBee to parquet!
I have tried below Example code to write data into parquet file using parquetWriter .
http://php.sabscape.com/blog/?p=623
The above example uses parquetWriter, But I want to use ParquetFileWriter to write data efficiently in parquet files.
Please suggest an example or how we can write parquet files using ParquetFileWriter ?
You can probably get some idea from a parquet column reader that i wrote here.

Apache Pig Load Function Bag as input possible?

if I write a custom Load Function with the constructor
MyLoadFunction(String someOptions, DataBag myBag)
How can I execute this function with piglatin?
X = load 'foo.txt' using MyLoadFunction('myString', myBagAlias);
this does not work, is it even possible?
thanks
I'm not sure your need is suitable for Pig. Pig is all about loading up a lot of data and then putting that data through a pipeline. It sounds like you want something more procedural, to load a small amount of data, do some processing, make a decision based on that, and follow that algorithm to completion.
So I'm not sure this is the best way for you to go, but you can try writing a UDF that will access HBase and grab the data you need. LOAD is inappropriate here because LOAD does not return a bag, it returns a relation that Pig expects you to put through some transformations. But you can pass a bag as input to a UDF, and then inside that UDF to do the HBase lookup and processing you want to do.
A more Pig-ish way of doing things would be to load all of the relevant HBase data into one or more relations, and then do a JOIN as appropriate to combine the pieces of data you want together.

Filtering Using MapReduce in Hadoop

I want to filter records from given file based on some criteria,i want my criteria to be if value of third field is equal to some value then retrive that record and save it in output file .i am taking CSV file as input.Can anyone suggest something ?
Simplest way would probably be to use pig
something like
orig = load 'filename.csv' using PigStorage(',') as (first,second,third:chararray,...);
filtered_orig= FILTER orig by third=="somevalue";
store filtered_orig into 'newfilename' using PigStorage(',');
If you need scalability you can use hadoop in the following way:
Install Hadoop, install hive, put your csv files into HDFS.
define the CSV file as external table (http://hive.apache.org/docs/r0.8.1/language_manual/data-manipulation-statements.html) and then you can write SQLs against the CSV file. Results of SQL can be then exported back to CSV.

Resources