In Hadoop using Pig, I have a large number of fields in a few separate sources which I load, filter, project, group, run through a couple Java UDFs, join, project and store. (That's everyday life in Hadoop.) Some of the fields in the original load of data aren't used by the UDFs and aren't needed until the final store.
When is it better to pass unused fields through UDFs than to store and join them later?
A trivial toy example is a data source with columns name,weight,height and I ultimately want to store name,weight,heightSquared. My UDF is going to square the height for me. Which is better:
inputdata = LOAD 'data' AS name,weight,height;
outputdata = FOREACH inputdata
GENERATE myudf.squareHeight(name,weight,height)
AS (name,weight,heightSquared);
STORE outputdata INTO 'output';
or
inputdata = LOAD 'data' AS name,weight,height;
name_weight = FOREACH inputdata
GENERATE name,weight;
intdata1 = FOREACH inputdata
GENERATE myudf.squareHeight(name,height)
AS (iname,heightSquared);
intdata2 = JOIN intdata1 BY iname, name_weight BY name;
outputdata = FOREACH intdata2
GENERATE name,weight,heightSquared;
STORE outputdata INTO 'output';
In this case it looks pretty obvious: the first case is better. But the UDF does have to read and store and output the weight field. When you have 15 fields the UDF doesn't care about and one it does, is the first case still better?
If you have 15 fields the UDF doesn't care about, then don't send them to the UDF. In your example, there's no reason to write your UDF to take three fields if it's only going to use the third one. The best script for your example would be
inputdata = LOAD 'data' AS name,weight,height;
outputdata =
FOREACH inputdata
GENERATE
name,
weight,
myudf.squareHeight(height) AS heightSquared;
STORE outputdata INTO 'output';
So that addresses the UDF case. If you have a bunch of fields that you'll want to store, but you are not going to use them in any of the next several map-reduce cycles, you may wish to store them immediately and then join them back in. But that would be a matter of empirically testing which approach is faster for your specific case.
Related
I am new to Pig so bear with me. I have two datasources that have the same schema: a map of attributes. I know that some attributes will have a single identifiable overlapping attribute. For example
Record A:
{"Name":{"First":"Foo", "Last":"Bar"}, "FavoriteFoods":{["Oranges", "Pizza"]}}
Record B:
{"Name":{"First":"Foo", "Last":"Bar"}, "FavoriteFoods":{["Buffalo Wings"]}}
I want to merge the records on Name such that:
Merged:
{"Name":{"First":"Foo", "Last":"Bar"}, "FavoriteFoods":{["Oranges", "Pizza", "Buffalo Wings"]}}
UNION, UNION ONSCHEMA,and JOIN don't operate in this way. Is there a method available to do this within Pig or will it have to happen within a UDF?
Something like:
A = LOAD 'fileA.json' USING JsonLoader AS infoMap:map[];
B = LOAD 'fileB.json' USING JsonLoader AS infoMap:map[];
merged = MERGE_ON infoMap#Name, A, B;
Pig by itself is very dumb when it comes to even slightly complex data translation. I feel you will need two kinds of UDFs to achieve your task. The first UDF will need to accept a map and create a unique string representation of it. It could be like a hashed string representation of the map (lets call it getHashFromMap()). This string will be used to join the two relations. The second UDF would accept two maps and return a merged map (lets call it mergeMaps()). Your script will then look as follows:
A = LOAD 'fileA.json' USING JsonLoader AS infoMapA:map[];
B = LOAD 'fileB.json' USING JsonLoader AS infoMapB:map[];
A2 = FOREACH A GENERATE *, getHashFromMap(infoMapA#'Name') AS joinKey;
B2 = FOREACH B GENERATE *, getHashFromMap(infoMapB#'Name') AS joinKey;
AB = JOIN A2 BY joinKey, B2 BY joinKey;
merged = FOREACH AB GENERATE *, mergeMaps(infoMapA, infoMapB) AS mergedMap;
Here I assume that the attrbute you want to merge on is a map. If that can vary, you first UDF will need to become more generic. Its main purpose would be to get a unique string representation of the the attribute so that the datasets can be joined on that.
I have a data set that looks like this after a GROUP operation :
input = key1|{(a1,b1,c1),(a2,b2,c2)}
key2|{(a3,b3,c3),(a4,b4,c4),(a5,b5,c5)}
I need to traverse the above to generate final output like this :
<KEY>key1</KEY>|
<VALUES><VALUE><VALUE1>a1</VALUE1>VALUE2>b1</VALUE2>VALUE3>c1</VALUE3></VALUE><VALUE><VALUE1>a2</VALUE1><VALUE2>b2</VALUE2><VALUE3>c2</VALUE3> </VALUE></VALUES>
<KEY>key2</KEY>| ...
I have tried to use FLATTEN and CONCAT to achieve this result in the below manner:
A = FOREACH input GENERATE key, FLATTEN(input);
output = FOREACH A GENERATE CONCAT('<KEY>',CONCAT(input.key,'</KEY>')),
CONCAT('<VALUE>',''),
CONCAT('<VALUE1>',CONCAT(input.col1,'</VALUE1>')
...
But this does not give the desired output. Fairly new to pig, so don't know if this is possible.
If you FLATTEN your bag than you'll ended up as many new 'rows' as many elements you had in the bag:
key1|(a1,b1,c1)
key1|(a2,b2,c2)
If I understand your problem correctly you want this:
Use the BagToTuple built in function.
Than you'll get
key1|(a1,b1,c1,a2,b2,c2)
After this you can format your data with e.g. a UDF
I am trying to use PIG to read data from HDFS where the files contain rows that look like:
"key1"="value1", "key2"="value2", "key3"="value3"
"key1"="value10", "key3"="value30"
In a way the rows of the data are essentially dictionaries:
{"key1":"value1", "key2":"value2", "key3":"value3"}
{"key1":"value10", "key3":"value30"}
I can read and dump portion of this data easily enough with something like:
data = LOAD '/hdfslocation/weirdformat*' as PigStorage(',');
sampled = SAMPLE data 0.00001;
dump sampled;
My problem is that I can't parse it efficiently. I have tried to use
org.apache.pig.piggybank.storage.MyRegExLoader
but it seems extremely slow.
Could someone recommend a different approach?
Seems like one way is to use a python UDF.
This solution is heavily inspired from bag-to-tuple
In myudfs.py write:
#!/usr/bin/python
def FieldPairsGenerator(dataline):
for x in dataline.split(','):
k,v = x.split('=')
yield (k.strip().strip('"'),v.strip().strip('"'))
#outputSchema("foo:map[]")
def KVDataToDict(dataline):
return dict( kvp for kvp in FieldPairsGenerator(dataline) )
then write the following Pig script:
REGISTER 'myudfs.py' USING jython AS myfuncs;
data = LOAD 'whereyourdatais*.gz' AS (foo:chararray);
A = FOREACH data GENERATE myfuncs.KVDataToDict(foo);
A now has the data stored as a PigMap
I have a blacklist file that looks a little bit like this
481295b2-30c7-4191-8c14-4e513c7e7577
481295a2-1234-4191-8c14-4e513c7e7577
and a lot of other data i am loading .
How can i filter out the data that is already inside the blacklist?
sort of not in in SQL terms.
I tried using somthing a little bit like this
but couldn't make this work with a relation.
You can use a left join and filter to implement it. E.g.,
data = load '/path/to/data.txt' as (id: chararray);
blacklist = load '/path/to/blacklist.txt' as (id: chararray);
jnd = join data by id left outer, blacklist by id using 'replicated';
filtered = filter jnd by blacklist::id is null;
result = foreach filtered generate data::id as id;
dump result;
In this example, the input data will be joined (left outer) by blacklist. After that, we removed the rows which match the blacklist by a is null check.
using 'replicated' is used to tell Pig to load the second relation into the memory to speed up the join. If the blacklist is too big to fit into memory, you can remove using 'replicated'.
I have a pig script which reads input from a file and sends to our custom UDF, which sends back a Map with 2 key/value pair. After that we have to save each key value pair in 2 different locations. We are doing it using Store. Problem we are facing is each STORE command which we are using in our pig script is invoking our custom UDF.
>REGISTER MyUDF.jar;
>LOADFILE = LOAD '$file' AS record:chararray;
>MAPREC = FOREACH LOADFILE GENERATE MyUDF(record);
>ERRLIST = FOREACH MAPREC {
>GENERATE $0#'errorRecord' AS ErrorRecord;
>};
>ERRLIST = FILTER ERRLIST BY ErrorRecord is not null;
>MLIST = FOREACH MAPREC {
>GENERATE $0#'mInfo' AS MRecord;
>};
>MLIST = FILTER MLIST BY MRecord is not null;
>STORE MLIST INTO 'fileOut';
>STORE ERRLIST INTO 'errorDir';
Is there a way in pig script through which UDF will be invoked only once, even if we have multiple STORE....
I think that what's happening under the covers is that MAPREC isn't populated by its assignment statement. Pig is waiting until MAPREC is used (which is twice) to figure out what it contains. I suggest creating an intermediate structure by using a FOREACH to iterate over MAPREC. That would force the calling of MyUDF once and then use that intermediate result twice in place of MAPREC in the following FOREACH statements. Hope that made sense.