Pig approach to pairing data fields in a data set - hadoop

I'm new to Pig and trying to correctly implement a somewhat common algorithm in which I need to pair every matching record in a set of records. In order to distill the question into its simplest form and also avoid discussing some business-specific sensitivities, here's a mock problem:
Say that I have a dataset representing college classes and students that attend them:
Philosophy,John
English,Mary
English,Sue
History,Jack
Philosophy,David
English,Mark
English,Larry
I want to pair every association between students that took the same class; so the output would include this, showing the explosion of the four 'English' rows into six associations:
Philosphy John,David
English Mary,Sue
English Mary,Mark
English Mary,Larry
English Sue,Mark
English Sue,Larry
English Mark,Larry
This page: http://ofps.oreilly.com/titles/9781449302641/advanced_pig_latin.html refers to using flatten() to effect the cross product. I have tried several approaches and researched this extensively and would post my attempts but honestly I'm flailing and I think that would just confuse the reader and not provide any value. But here's the boilerplate:
s = load 'classes' using PigStorage(',') as (class:chararray, student:chararray);
grp = group s by class;
...
(I believe the problem I'm facing has to do with flatten requiring multiple bags, not multiple fields, and I can't figure out how to get my group'ing to generate multiple bags...)
Thank you for any assistance!

You can use the UnorderedPairs UDF from LinkedIn's Datafu project. Download the package from here and issue the followings
(tested on Pig v0.10.0) :
register '/home/user/datafu/dist/datafu-0.0.4.jar'
define UnorderedPairs datafu.pig.bags.UnorderedPairs();
A = load 'classes' using PigStorage(',') as (class:chararray, student:chararray);
B = GROUP A BY class;
C = FOREACH B GENERATE group, FLATTEN(UnorderedPairs(A.student));
When further flattening the result:
D = FOREACH C generate FLATTEN($0) as (class:chararray),
FLATTEN($1) as (student1:chararray), FLATTEN($2) as (student2:chararray);
You'll end up having the desired result:
dump D;
(English,Mary,Sue)
(English,Mary,Mark)
(English,Mary,Larry)
(English,Sue,Mark)
(English,Sue,Larry)
(English,Mark,Larry)
(Philosophy,John,David)

There are two approaches I see to this. I have not tried either in quite some time, so please follow up and let us know if they worked well or not.
The first approach is a self join
s1 = load 'classes' using PigStorage(',') as (class:chararray, student:chararray);
s2 = load 'classes' using PigStorage(',') as (class:chararray, student:chararray);
b = JOIN s1 BY class, s2 BY class;
...
The downside of this is that you have to load the data twice. There is some discussion on why this sucks, but it's just how you have to do it.
The other option would be to use CROSS nested in a FOREACH after the GROUP:
Note: I'm not sure at all if this will work, or if I got the syntax right (I'm not in an environment that I could test this right now). Perhaps someone can confirm.
B = GROUP s BY class;
C = FOREACH B {
DA = CROSS s, s;
GENERATE FLATTEN(DA);
}

This can be done with a self-join and some simple filtering.
classes1 = load 'classes' using PigStorage(',') as (class:chararray, student:chararray);
classes2 = load 'classes' using PigStorage(',') as (class:chararray, student:chararray);
joined = JOIN classes1 BY class, classes2 BY class;
filtered = FILTER joined BY classes1.student < classes2.student;
pairs = FOREACH filtered GENERATE classes1.student AS student1, classes2.student AS student2;
Note that filtering by student1 < student2 gets you unique pairs.

Related

How to Merge Maps in Pig

I am new to Pig so bear with me. I have two datasources that have the same schema: a map of attributes. I know that some attributes will have a single identifiable overlapping attribute. For example
Record A:
{"Name":{"First":"Foo", "Last":"Bar"}, "FavoriteFoods":{["Oranges", "Pizza"]}}
Record B:
{"Name":{"First":"Foo", "Last":"Bar"}, "FavoriteFoods":{["Buffalo Wings"]}}
I want to merge the records on Name such that:
Merged:
{"Name":{"First":"Foo", "Last":"Bar"}, "FavoriteFoods":{["Oranges", "Pizza", "Buffalo Wings"]}}
UNION, UNION ONSCHEMA,and JOIN don't operate in this way. Is there a method available to do this within Pig or will it have to happen within a UDF?
Something like:
A = LOAD 'fileA.json' USING JsonLoader AS infoMap:map[];
B = LOAD 'fileB.json' USING JsonLoader AS infoMap:map[];
merged = MERGE_ON infoMap#Name, A, B;
Pig by itself is very dumb when it comes to even slightly complex data translation. I feel you will need two kinds of UDFs to achieve your task. The first UDF will need to accept a map and create a unique string representation of it. It could be like a hashed string representation of the map (lets call it getHashFromMap()). This string will be used to join the two relations. The second UDF would accept two maps and return a merged map (lets call it mergeMaps()). Your script will then look as follows:
A = LOAD 'fileA.json' USING JsonLoader AS infoMapA:map[];
B = LOAD 'fileB.json' USING JsonLoader AS infoMapB:map[];
A2 = FOREACH A GENERATE *, getHashFromMap(infoMapA#'Name') AS joinKey;
B2 = FOREACH B GENERATE *, getHashFromMap(infoMapB#'Name') AS joinKey;
AB = JOIN A2 BY joinKey, B2 BY joinKey;
merged = FOREACH AB GENERATE *, mergeMaps(infoMapA, infoMapB) AS mergedMap;
Here I assume that the attrbute you want to merge on is a map. If that can vary, you first UDF will need to become more generic. Its main purpose would be to get a unique string representation of the the attribute so that the datasets can be joined on that.

Hadoop Pig: Show entries using STARTSWITH

I am having issues using the STARTSWITH string function. I want to display all records in System_Period that begins with 20040
transactions = LOAD '/home/cloudera/datasets/assignment2/Transactions.csv'
USING PigStorage(',') AS (Branch_Number:int, Contract_Number:int,
Customer_Number:int,Invoice_Date:chararray, Invoice_Number:int,
Product_Number:int, Sales_Amount:double, Employee_Number:int,
Service_Date:chararray, System_Period:int);
sysGroup = GROUP transactions BY System_Period;
sysFilter = FILTER sysGroup BY STARTSWITH(transactions.System_Period, 20040);
DUMP sysFilter;
The error I am receiving is
Could not infer the matching function for org.apache.pig.builtin.STARTSWITH as multiple or none of them fit. Please use an explicit cast.
STARTSWITH is only used to compare a tuple1 with tuple2 to check whether tuple1 contains tuple2. You cannot pass a relation or a bag to that. And one more thing to be noted is it accepts only String(chararray) not an integer. Either FILTER the system_period that begins with 20040 before the GROUP BY and load system_period as chararray and then cast it after the filter as per your need.
transactions = LOAD '/home/cloudera/datasets/assignment2/Transactions.csv'
USING PigStorage(',') AS (Branch_Number:int, Contract_Number:int,
Customer_Number:int,Invoice_Date:chararray, Invoice_Number:int,
Product_Number:int, Sales_Amount:double, Employee_Number:int,
Service_Date:chararray, System_Period:chararray);
sysFilter = FILTER transactions BY STARTSWITH(System_Period, '20040');
Else after GROUP BY FLATTEN the result and then filter
transactions = LOAD '/home/cloudera/datasets/assignment2/Transactions.csv'
USING PigStorage(',') AS (Branch_Number:int, Contract_Number:int,
Customer_Number:int,Invoice_Date:chararray, Invoice_Number:int,
Product_Number:int, Sales_Amount:double, Employee_Number:int,
Service_Date:chararray, System_Period:chararray);
sysGroup = GROUP transactions BY System_Period;
flatres = FOREACH sysGroup GENERATE group,FLATTEN(transactions);
sysFilter = FILTER flatres BY STARTSWITH(System_Period, '20040');

how to join header row to detail rows in multiple files with apache pig

I have several CSV files in a HDFS folder which I load to a relation with:
source = LOAD '$data' USING PigStorage(','); --the $data is a passed as a parameter to the pig command.
When I dump it, the structure of the source relation is as follows: (note that the data is text qualified but I will deal with that using the REPLACE function)
("HEADER","20110118","20101218","20110118","T00002")
("0000000000000000035412","20110107","2699","D","20110107","2315.","","","","","","C")
("0000000000000000035412","20110107","2699","D","20110107","246..","162","74","","","","B")
<.... more records ....>
("HEADER","20110224","20110109","20110224","T00002")
("0000000000000000035412","20110121","2028","D","20110121","a6c3.","","","","","R","P")
("0000000000000000035412","20110217","2619","D","20110217","a6c3.","","","","","R","P")
<.... more records ....>
So each file has a header which provides some information about the data set that follows it such as the provider of the data and the date range it covers.
So now, how can I transform the above structure and create a new relation like the following ?:
{
(HEADER,20110118,20101218,20110118,T00002),{(0000000000000000035412,20110107,2699,D,20110107,2315.,,,,,,C),(0000000000000000035412,20110107,2699,D,20110107,246..,162,74,,,,B),..more tuples..},
(HEADER,20110224,20110109,20110224,T00002),{(0000000000000000035412,20110121,2028,D,20110121,a6c3.,,,,,R,P),(0000000000000000035412,20110217,2619,D,20110217,a6c3.,,,,,R,P),..more tuples..},..more tuples..
}
Where each header tuple is followed by a bag of record tuples belonging to that header ?.
Unfortunately there is no common key field between the header and the detail rows, so I don't think cant use any JOIN operation. ?
I am quite new to Pig and Hadoop and this is one of the first concept projects that I am engaging in.
Hope my question is clear and look forward to some guidance here.
This should get you started.
Code:
Source = LOAD '$data' USING PigStorage(',','-tagFile');
A = SPLIT Source INTO FileHeaders IF $1 == 'HEADER', FileData OTHERWISE;
B = GROUP FileData BY $0;
C = GROUP FileHeaders BY $0;
D = JOIN B BY Group, C BY Group;
...

Hive Columnar Loader in HDP2.0

I am using HDP 2.0 and running a simple Pig Script.
I have registered the below jars and I am then executing the below code (updated the schema) -
register /usr/lib/pig/piggybank.jar;
register /usr/lib/hive/lib/hive-common-0.11.0.2.0.5.0-67.jar;
register /usr/lib/hive/lib/hive-exec-0.11.0.2.0.5.0-67.jar;
A = LOAD '/apps/hive/warehouse/test.db/hivetables' USING
org.apache.pig.piggybank.storage.HiveColumnarLoader('id int, name string,age
int,create_dt string,timestamp string,accno int');
F = FILTER A BY (id == 85986249 );
STORE F INTO '/user/test/Pigout' USING PigStorage();
The problem is , Though the value for F is available in the Hive table, the result always writes 0 records into the output. But it is able to load all the records into A.
Basically the Filter function is not working. My Hive table is not partitioned. I beleive that the problem could be in HiveColumarLoade but not able to figure out what it is.
Please let me know if you are aware of a solution. I am struggling a lot with this.
Thanks a lot for the help!!!
Based on the pig 0.12 documentation HiveColumnarLoader appears to require an intermediate relation before you can filter on a non-partition value. Given that id is not a partition that appears to be your problem.
try this:
A = LOAD '/apps/hive/warehouse/test.db/hivetables' USING
org.apache.pig.piggybank.storage.HiveColumnarLoader('id int, name string,age
int,create_dt string,timestamp string,accno int');
B = FOREACH GENERATE A.id, A.name, A.age, A.create_dt, A.timestamp, A.accno;
F = FILTER A BY (id == 85986249 );
STORE F INTO '/user/test/Pigout' USING PigStorage();
The documentation all seems to say that for processing the actual values you need intermediate relation B.

Querying M:M relationships using Entity Framework

How would I modify the following code:
var result = from p in Cache.Model.Products
from f in p.Flavours
where f.FlavourID == "012541-5-5-5-651"
select p;
So that f.FlavourID is supplied a range of ID's as a supposed to just one value as shown in the above example?
Given the following ERD Model:
Products* => ProdCombinations <= *Flavours
ProdCombinations is a junction/link table and simply has one composite key in there.
Of the top of my head
string [] ids = new[]{"012541-5-5-5-651", "012541-5-5-5-652", "012541-5-5-5-653"};
var result = from p in Cache.Model.Products
from f in p.Flavours
where ids.Contains(f.FlavourID)
select p;
There are some limitations, but an array of ids has worked for me before. I've only actually tried with SQL Server backend, and my IDs were integers.
As I understand it, Linq needs to translate your query into SQL, and it's only possible sometimes. For example it's not possible with IEnumerable<SomeClass>, which produces a runtime error, but possible with a collection of simple types.

Resources