how to join header row to detail rows in multiple files with apache pig - hadoop

I have several CSV files in a HDFS folder which I load to a relation with:
source = LOAD '$data' USING PigStorage(','); --the $data is a passed as a parameter to the pig command.
When I dump it, the structure of the source relation is as follows: (note that the data is text qualified but I will deal with that using the REPLACE function)
("HEADER","20110118","20101218","20110118","T00002")
("0000000000000000035412","20110107","2699","D","20110107","2315.","","","","","","C")
("0000000000000000035412","20110107","2699","D","20110107","246..","162","74","","","","B")
<.... more records ....>
("HEADER","20110224","20110109","20110224","T00002")
("0000000000000000035412","20110121","2028","D","20110121","a6c3.","","","","","R","P")
("0000000000000000035412","20110217","2619","D","20110217","a6c3.","","","","","R","P")
<.... more records ....>
So each file has a header which provides some information about the data set that follows it such as the provider of the data and the date range it covers.
So now, how can I transform the above structure and create a new relation like the following ?:
{
(HEADER,20110118,20101218,20110118,T00002),{(0000000000000000035412,20110107,2699,D,20110107,2315.,,,,,,C),(0000000000000000035412,20110107,2699,D,20110107,246..,162,74,,,,B),..more tuples..},
(HEADER,20110224,20110109,20110224,T00002),{(0000000000000000035412,20110121,2028,D,20110121,a6c3.,,,,,R,P),(0000000000000000035412,20110217,2619,D,20110217,a6c3.,,,,,R,P),..more tuples..},..more tuples..
}
Where each header tuple is followed by a bag of record tuples belonging to that header ?.
Unfortunately there is no common key field between the header and the detail rows, so I don't think cant use any JOIN operation. ?
I am quite new to Pig and Hadoop and this is one of the first concept projects that I am engaging in.
Hope my question is clear and look forward to some guidance here.

This should get you started.
Code:
Source = LOAD '$data' USING PigStorage(',','-tagFile');
A = SPLIT Source INTO FileHeaders IF $1 == 'HEADER', FileData OTHERWISE;
B = GROUP FileData BY $0;
C = GROUP FileHeaders BY $0;
D = JOIN B BY Group, C BY Group;
...

Related

How to Merge Maps in Pig

I am new to Pig so bear with me. I have two datasources that have the same schema: a map of attributes. I know that some attributes will have a single identifiable overlapping attribute. For example
Record A:
{"Name":{"First":"Foo", "Last":"Bar"}, "FavoriteFoods":{["Oranges", "Pizza"]}}
Record B:
{"Name":{"First":"Foo", "Last":"Bar"}, "FavoriteFoods":{["Buffalo Wings"]}}
I want to merge the records on Name such that:
Merged:
{"Name":{"First":"Foo", "Last":"Bar"}, "FavoriteFoods":{["Oranges", "Pizza", "Buffalo Wings"]}}
UNION, UNION ONSCHEMA,and JOIN don't operate in this way. Is there a method available to do this within Pig or will it have to happen within a UDF?
Something like:
A = LOAD 'fileA.json' USING JsonLoader AS infoMap:map[];
B = LOAD 'fileB.json' USING JsonLoader AS infoMap:map[];
merged = MERGE_ON infoMap#Name, A, B;
Pig by itself is very dumb when it comes to even slightly complex data translation. I feel you will need two kinds of UDFs to achieve your task. The first UDF will need to accept a map and create a unique string representation of it. It could be like a hashed string representation of the map (lets call it getHashFromMap()). This string will be used to join the two relations. The second UDF would accept two maps and return a merged map (lets call it mergeMaps()). Your script will then look as follows:
A = LOAD 'fileA.json' USING JsonLoader AS infoMapA:map[];
B = LOAD 'fileB.json' USING JsonLoader AS infoMapB:map[];
A2 = FOREACH A GENERATE *, getHashFromMap(infoMapA#'Name') AS joinKey;
B2 = FOREACH B GENERATE *, getHashFromMap(infoMapB#'Name') AS joinKey;
AB = JOIN A2 BY joinKey, B2 BY joinKey;
merged = FOREACH AB GENERATE *, mergeMaps(infoMapA, infoMapB) AS mergedMap;
Here I assume that the attrbute you want to merge on is a map. If that can vary, you first UDF will need to become more generic. Its main purpose would be to get a unique string representation of the the attribute so that the datasets can be joined on that.

Store pig result in a text file

Hi stackoverflow community;
i'm totally new to pig, i want to STORE the result in a text file and name it as i want. is it possible do this using STORE function.
My code:
a = LOAD 'example.csv' USING PigStorage(';');
b = FOREACH a GENERATE $0,$1,$2,$3,$6,$7,$8,$9,$11,$12,$13,$14,$20,$24,$25;
STORE b INTO ‘myoutput’;
Thanks.
Yes you will be able to store your result in myoutput.txt and you can load the data into file with any delimiter you want using PigStorage.
a = LOAD 'example.csv' USING PigStorage(';');
b = FOREACH a GENERATE $0,$1,$2,$3,$6,$7,$8,$9,$11,$12,$13,$14,$20,$24,$25;
STORE b INTO ‘myoutput.txt’ using PigStorage(';');
Yes, it is possible. b will store every row into 25 different columns - $0 to S25.

Pig reading data as databytearray

Hey guys i have one more question I am just not able to understand the behavior of pig
I am loading the data into pig and after some transformation storing it using PigStorage() on hdfs(/user/sga/transformeddata).
But when I load the data from /user/sga/transformeddata location and do
temp = load '/user/sga/transformeddata' using PigStorage();
gen = foreach temp generate page_type;
dump gen;
getting following error:
databytearray can not be cast to java.lang.String
but if i do
gen = foreach temp generate *;
dump gen;
it works fine
any help is totally appreciated to understand this.
As required presenting the code:
STORE union_of_all_records INTO '/staged/google/data_after_denormalization' using PigStorage('\t','-schema');
union_of_all_records is an alias in pig.
now another script which will consume this data
lookup_data =
LOAD '/staged/google/page_type_map_file/' using PigStorage() AS (page_type:chararray,page_type_classification:chararray);
load_denorm_clickstream_record =
LOAD '/staged/google/data_after_denormalization' using PigStorage('\t','-schema');
and join on these two aliases
denorm_clickstream_record = LIMIT load_denorm_clickstream_record 100;
join_with_lookup =
JOIN denorm_clickstream_record BY page_type LEFT OUTER, lookup_data BY page_type;
step x : final_output =
FOREACH join_with_lookup
GENERATE denorm_clickstream_record::page_type as page_type;
at step x i get the above error.
I think you have to options:
1) You have to tell Pig the schema that the data has. For example:
temp = load '/user/sga/transformeddata' using PigStorage() AS (page_type:chararray);
2) When you first store the data tell Pigstorage to store the schema information as well. PigStorage('\t', '-schema'); When you load the data as you do above, PigStorage should read the schema from the schema information.

Hive Columnar Loader in HDP2.0

I am using HDP 2.0 and running a simple Pig Script.
I have registered the below jars and I am then executing the below code (updated the schema) -
register /usr/lib/pig/piggybank.jar;
register /usr/lib/hive/lib/hive-common-0.11.0.2.0.5.0-67.jar;
register /usr/lib/hive/lib/hive-exec-0.11.0.2.0.5.0-67.jar;
A = LOAD '/apps/hive/warehouse/test.db/hivetables' USING
org.apache.pig.piggybank.storage.HiveColumnarLoader('id int, name string,age
int,create_dt string,timestamp string,accno int');
F = FILTER A BY (id == 85986249 );
STORE F INTO '/user/test/Pigout' USING PigStorage();
The problem is , Though the value for F is available in the Hive table, the result always writes 0 records into the output. But it is able to load all the records into A.
Basically the Filter function is not working. My Hive table is not partitioned. I beleive that the problem could be in HiveColumarLoade but not able to figure out what it is.
Please let me know if you are aware of a solution. I am struggling a lot with this.
Thanks a lot for the help!!!
Based on the pig 0.12 documentation HiveColumnarLoader appears to require an intermediate relation before you can filter on a non-partition value. Given that id is not a partition that appears to be your problem.
try this:
A = LOAD '/apps/hive/warehouse/test.db/hivetables' USING
org.apache.pig.piggybank.storage.HiveColumnarLoader('id int, name string,age
int,create_dt string,timestamp string,accno int');
B = FOREACH GENERATE A.id, A.name, A.age, A.create_dt, A.timestamp, A.accno;
F = FILTER A BY (id == 85986249 );
STORE F INTO '/user/test/Pigout' USING PigStorage();
The documentation all seems to say that for processing the actual values you need intermediate relation B.

Pig approach to pairing data fields in a data set

I'm new to Pig and trying to correctly implement a somewhat common algorithm in which I need to pair every matching record in a set of records. In order to distill the question into its simplest form and also avoid discussing some business-specific sensitivities, here's a mock problem:
Say that I have a dataset representing college classes and students that attend them:
Philosophy,John
English,Mary
English,Sue
History,Jack
Philosophy,David
English,Mark
English,Larry
I want to pair every association between students that took the same class; so the output would include this, showing the explosion of the four 'English' rows into six associations:
Philosphy John,David
English Mary,Sue
English Mary,Mark
English Mary,Larry
English Sue,Mark
English Sue,Larry
English Mark,Larry
This page: http://ofps.oreilly.com/titles/9781449302641/advanced_pig_latin.html refers to using flatten() to effect the cross product. I have tried several approaches and researched this extensively and would post my attempts but honestly I'm flailing and I think that would just confuse the reader and not provide any value. But here's the boilerplate:
s = load 'classes' using PigStorage(',') as (class:chararray, student:chararray);
grp = group s by class;
...
(I believe the problem I'm facing has to do with flatten requiring multiple bags, not multiple fields, and I can't figure out how to get my group'ing to generate multiple bags...)
Thank you for any assistance!
You can use the UnorderedPairs UDF from LinkedIn's Datafu project. Download the package from here and issue the followings
(tested on Pig v0.10.0) :
register '/home/user/datafu/dist/datafu-0.0.4.jar'
define UnorderedPairs datafu.pig.bags.UnorderedPairs();
A = load 'classes' using PigStorage(',') as (class:chararray, student:chararray);
B = GROUP A BY class;
C = FOREACH B GENERATE group, FLATTEN(UnorderedPairs(A.student));
When further flattening the result:
D = FOREACH C generate FLATTEN($0) as (class:chararray),
FLATTEN($1) as (student1:chararray), FLATTEN($2) as (student2:chararray);
You'll end up having the desired result:
dump D;
(English,Mary,Sue)
(English,Mary,Mark)
(English,Mary,Larry)
(English,Sue,Mark)
(English,Sue,Larry)
(English,Mark,Larry)
(Philosophy,John,David)
There are two approaches I see to this. I have not tried either in quite some time, so please follow up and let us know if they worked well or not.
The first approach is a self join
s1 = load 'classes' using PigStorage(',') as (class:chararray, student:chararray);
s2 = load 'classes' using PigStorage(',') as (class:chararray, student:chararray);
b = JOIN s1 BY class, s2 BY class;
...
The downside of this is that you have to load the data twice. There is some discussion on why this sucks, but it's just how you have to do it.
The other option would be to use CROSS nested in a FOREACH after the GROUP:
Note: I'm not sure at all if this will work, or if I got the syntax right (I'm not in an environment that I could test this right now). Perhaps someone can confirm.
B = GROUP s BY class;
C = FOREACH B {
DA = CROSS s, s;
GENERATE FLATTEN(DA);
}
This can be done with a self-join and some simple filtering.
classes1 = load 'classes' using PigStorage(',') as (class:chararray, student:chararray);
classes2 = load 'classes' using PigStorage(',') as (class:chararray, student:chararray);
joined = JOIN classes1 BY class, classes2 BY class;
filtered = FILTER joined BY classes1.student < classes2.student;
pairs = FOREACH filtered GENERATE classes1.student AS student1, classes2.student AS student2;
Note that filtering by student1 < student2 gets you unique pairs.

Resources