PIG: Process tuples in a bag - hadoop

I have a data set that looks like this after a GROUP operation :
input = key1|{(a1,b1,c1),(a2,b2,c2)}
key2|{(a3,b3,c3),(a4,b4,c4),(a5,b5,c5)}
I need to traverse the above to generate final output like this :
<KEY>key1</KEY>|
<VALUES><VALUE><VALUE1>a1</VALUE1>VALUE2>b1</VALUE2>VALUE3>c1</VALUE3></VALUE><VALUE><VALUE1>a2</VALUE1><VALUE2>b2</VALUE2><VALUE3>c2</VALUE3> </VALUE></VALUES>
<KEY>key2</KEY>| ...
I have tried to use FLATTEN and CONCAT to achieve this result in the below manner:
A = FOREACH input GENERATE key, FLATTEN(input);
output = FOREACH A GENERATE CONCAT('<KEY>',CONCAT(input.key,'</KEY>')),
CONCAT('<VALUE>',''),
CONCAT('<VALUE1>',CONCAT(input.col1,'</VALUE1>')
...
But this does not give the desired output. Fairly new to pig, so don't know if this is possible.

If you FLATTEN your bag than you'll ended up as many new 'rows' as many elements you had in the bag:
key1|(a1,b1,c1)
key1|(a2,b2,c2)
If I understand your problem correctly you want this:
Use the BagToTuple built in function.
Than you'll get
key1|(a1,b1,c1,a2,b2,c2)
After this you can format your data with e.g. a UDF

Related

How to Merge Maps in Pig

I am new to Pig so bear with me. I have two datasources that have the same schema: a map of attributes. I know that some attributes will have a single identifiable overlapping attribute. For example
Record A:
{"Name":{"First":"Foo", "Last":"Bar"}, "FavoriteFoods":{["Oranges", "Pizza"]}}
Record B:
{"Name":{"First":"Foo", "Last":"Bar"}, "FavoriteFoods":{["Buffalo Wings"]}}
I want to merge the records on Name such that:
Merged:
{"Name":{"First":"Foo", "Last":"Bar"}, "FavoriteFoods":{["Oranges", "Pizza", "Buffalo Wings"]}}
UNION, UNION ONSCHEMA,and JOIN don't operate in this way. Is there a method available to do this within Pig or will it have to happen within a UDF?
Something like:
A = LOAD 'fileA.json' USING JsonLoader AS infoMap:map[];
B = LOAD 'fileB.json' USING JsonLoader AS infoMap:map[];
merged = MERGE_ON infoMap#Name, A, B;
Pig by itself is very dumb when it comes to even slightly complex data translation. I feel you will need two kinds of UDFs to achieve your task. The first UDF will need to accept a map and create a unique string representation of it. It could be like a hashed string representation of the map (lets call it getHashFromMap()). This string will be used to join the two relations. The second UDF would accept two maps and return a merged map (lets call it mergeMaps()). Your script will then look as follows:
A = LOAD 'fileA.json' USING JsonLoader AS infoMapA:map[];
B = LOAD 'fileB.json' USING JsonLoader AS infoMapB:map[];
A2 = FOREACH A GENERATE *, getHashFromMap(infoMapA#'Name') AS joinKey;
B2 = FOREACH B GENERATE *, getHashFromMap(infoMapB#'Name') AS joinKey;
AB = JOIN A2 BY joinKey, B2 BY joinKey;
merged = FOREACH AB GENERATE *, mergeMaps(infoMapA, infoMapB) AS mergedMap;
Here I assume that the attrbute you want to merge on is a map. If that can vary, you first UDF will need to become more generic. Its main purpose would be to get a unique string representation of the the attribute so that the datasets can be joined on that.

Pig LOADER for SPLUNK like records

I am trying to use PIG to read data from HDFS where the files contain rows that look like:
"key1"="value1", "key2"="value2", "key3"="value3"
"key1"="value10", "key3"="value30"
In a way the rows of the data are essentially dictionaries:
{"key1":"value1", "key2":"value2", "key3":"value3"}
{"key1":"value10", "key3":"value30"}
I can read and dump portion of this data easily enough with something like:
data = LOAD '/hdfslocation/weirdformat*' as PigStorage(',');
sampled = SAMPLE data 0.00001;
dump sampled;
My problem is that I can't parse it efficiently. I have tried to use
org.apache.pig.piggybank.storage.MyRegExLoader
but it seems extremely slow.
Could someone recommend a different approach?
Seems like one way is to use a python UDF.
This solution is heavily inspired from bag-to-tuple
In myudfs.py write:
#!/usr/bin/python
def FieldPairsGenerator(dataline):
for x in dataline.split(','):
k,v = x.split('=')
yield (k.strip().strip('"'),v.strip().strip('"'))
#outputSchema("foo:map[]")
def KVDataToDict(dataline):
return dict( kvp for kvp in FieldPairsGenerator(dataline) )
then write the following Pig script:
REGISTER 'myudfs.py' USING jython AS myfuncs;
data = LOAD 'whereyourdatais*.gz' AS (foo:chararray);
A = FOREACH data GENERATE myfuncs.KVDataToDict(foo);
A now has the data stored as a PigMap

Carry fields around, or store and join?

In Hadoop using Pig, I have a large number of fields in a few separate sources which I load, filter, project, group, run through a couple Java UDFs, join, project and store. (That's everyday life in Hadoop.) Some of the fields in the original load of data aren't used by the UDFs and aren't needed until the final store.
When is it better to pass unused fields through UDFs than to store and join them later?
A trivial toy example is a data source with columns name,weight,height and I ultimately want to store name,weight,heightSquared. My UDF is going to square the height for me. Which is better:
inputdata = LOAD 'data' AS name,weight,height;
outputdata = FOREACH inputdata
GENERATE myudf.squareHeight(name,weight,height)
AS (name,weight,heightSquared);
STORE outputdata INTO 'output';
or
inputdata = LOAD 'data' AS name,weight,height;
name_weight = FOREACH inputdata
GENERATE name,weight;
intdata1 = FOREACH inputdata
GENERATE myudf.squareHeight(name,height)
AS (iname,heightSquared);
intdata2 = JOIN intdata1 BY iname, name_weight BY name;
outputdata = FOREACH intdata2
GENERATE name,weight,heightSquared;
STORE outputdata INTO 'output';
In this case it looks pretty obvious: the first case is better. But the UDF does have to read and store and output the weight field. When you have 15 fields the UDF doesn't care about and one it does, is the first case still better?
If you have 15 fields the UDF doesn't care about, then don't send them to the UDF. In your example, there's no reason to write your UDF to take three fields if it's only going to use the third one. The best script for your example would be
inputdata = LOAD 'data' AS name,weight,height;
outputdata =
FOREACH inputdata
GENERATE
name,
weight,
myudf.squareHeight(height) AS heightSquared;
STORE outputdata INTO 'output';
So that addresses the UDF case. If you have a bunch of fields that you'll want to store, but you are not going to use them in any of the next several map-reduce cycles, you may wish to store them immediately and then join them back in. But that would be a matter of empirically testing which approach is faster for your specific case.

Pig multiple store commands creating duplicate work

I have a pig script which reads input from a file and sends to our custom UDF, which sends back a Map with 2 key/value pair. After that we have to save each key value pair in 2 different locations. We are doing it using Store. Problem we are facing is each STORE command which we are using in our pig script is invoking our custom UDF.
>REGISTER MyUDF.jar;
>LOADFILE = LOAD '$file' AS record:chararray;
>MAPREC = FOREACH LOADFILE GENERATE MyUDF(record);
>ERRLIST = FOREACH MAPREC {
>GENERATE $0#'errorRecord' AS ErrorRecord;
>};
>ERRLIST = FILTER ERRLIST BY ErrorRecord is not null;
>MLIST = FOREACH MAPREC {
>GENERATE $0#'mInfo' AS MRecord;
>};
>MLIST = FILTER MLIST BY MRecord is not null;
>STORE MLIST INTO 'fileOut';
>STORE ERRLIST INTO 'errorDir';
Is there a way in pig script through which UDF will be invoked only once, even if we have multiple STORE....
I think that what's happening under the covers is that MAPREC isn't populated by its assignment statement. Pig is waiting until MAPREC is used (which is twice) to figure out what it contains. I suggest creating an intermediate structure by using a FOREACH to iterate over MAPREC. That would force the calling of MyUDF once and then use that intermediate result twice in place of MAPREC in the following FOREACH statements. Hope that made sense.

How to generate a custom schema from a relation in Pig?

I have a schema describing tf-idf values for words in various articles.
Its description looks like:
tfidf_relation: {word: chararray,id: bytearray,tfidf: double}
Here is an example of such data:
(cat,article_one,0.13515503603605478)
(cat,article_two,0.4054651081081644)
(dog,article_one,0.3662040962227032)
(apple,article_three,0.3662040962227032)
(orange,article_three,0.3662040962227032)
(parrot,article_one,0.13515503603605478)
(parrot,article_three,0.13515503603605478)
I want to get output in a form:
cat article_one 0.13515503603605478, article_two 0.4054651081081644
and so on.
The question is, how do I make a relation from this which contains the word field and a tuple of id and tfidf fields?
Someting like this:
X = FOREACH tfidf_relation GENERATE word, (id, tfidf);
doesn't work. What is the correct syntax for this?
Try this:
t = LOAD 'input/file' USING PigStorage(',') as (word: chararray,id: bytearray,tfidf: double);
u = group t by word;
dump u;
The output will be
(cat,{(cat,article_two,0.4054651081081644),(cat,article_one,0.13515503603605478)})
(dog,{(dog,article_one,0.3662040962227032)})
(apple,{(apple,article_three,0.3662040962227032)})
(orange,{(orange,article_three,0.366204096222703)})
(parrot,{(parrot,article_three,0.13515503603605478),
(parrot,article_one,0.13515503603605478)})
I hope this is what you are looking for.
X = FOREACH tfidf_relation GENERATE word, {(id, tfidf)};
This is probably what you need.

Resources