How to generate a custom schema from a relation in Pig? - hadoop

I have a schema describing tf-idf values for words in various articles.
Its description looks like:
tfidf_relation: {word: chararray,id: bytearray,tfidf: double}
Here is an example of such data:
(cat,article_one,0.13515503603605478)
(cat,article_two,0.4054651081081644)
(dog,article_one,0.3662040962227032)
(apple,article_three,0.3662040962227032)
(orange,article_three,0.3662040962227032)
(parrot,article_one,0.13515503603605478)
(parrot,article_three,0.13515503603605478)
I want to get output in a form:
cat article_one 0.13515503603605478, article_two 0.4054651081081644
and so on.
The question is, how do I make a relation from this which contains the word field and a tuple of id and tfidf fields?
Someting like this:
X = FOREACH tfidf_relation GENERATE word, (id, tfidf);
doesn't work. What is the correct syntax for this?

Try this:
t = LOAD 'input/file' USING PigStorage(',') as (word: chararray,id: bytearray,tfidf: double);
u = group t by word;
dump u;
The output will be
(cat,{(cat,article_two,0.4054651081081644),(cat,article_one,0.13515503603605478)})
(dog,{(dog,article_one,0.3662040962227032)})
(apple,{(apple,article_three,0.3662040962227032)})
(orange,{(orange,article_three,0.366204096222703)})
(parrot,{(parrot,article_three,0.13515503603605478),
(parrot,article_one,0.13515503603605478)})
I hope this is what you are looking for.

X = FOREACH tfidf_relation GENERATE word, {(id, tfidf)};
This is probably what you need.

Related

How to get table name for a simple Sequel Dataset object?

Ie, given a dataset object ds = DB[:transactions].where{updated_at > 1.day.ago} - no funny joins and stuff going on - how could I fetch the table name (:transactions) ?
If you want the first table in the dataset, you can use ds.first_source.
If you want it as a string you can do:
ds.first_source_table.to_s
If you want a symbol, just omit .to_s
Based on the example provided, I would do something like this.
ds.klass.name
That will return a string with the name of your table.

How to Merge Maps in Pig

I am new to Pig so bear with me. I have two datasources that have the same schema: a map of attributes. I know that some attributes will have a single identifiable overlapping attribute. For example
Record A:
{"Name":{"First":"Foo", "Last":"Bar"}, "FavoriteFoods":{["Oranges", "Pizza"]}}
Record B:
{"Name":{"First":"Foo", "Last":"Bar"}, "FavoriteFoods":{["Buffalo Wings"]}}
I want to merge the records on Name such that:
Merged:
{"Name":{"First":"Foo", "Last":"Bar"}, "FavoriteFoods":{["Oranges", "Pizza", "Buffalo Wings"]}}
UNION, UNION ONSCHEMA,and JOIN don't operate in this way. Is there a method available to do this within Pig or will it have to happen within a UDF?
Something like:
A = LOAD 'fileA.json' USING JsonLoader AS infoMap:map[];
B = LOAD 'fileB.json' USING JsonLoader AS infoMap:map[];
merged = MERGE_ON infoMap#Name, A, B;
Pig by itself is very dumb when it comes to even slightly complex data translation. I feel you will need two kinds of UDFs to achieve your task. The first UDF will need to accept a map and create a unique string representation of it. It could be like a hashed string representation of the map (lets call it getHashFromMap()). This string will be used to join the two relations. The second UDF would accept two maps and return a merged map (lets call it mergeMaps()). Your script will then look as follows:
A = LOAD 'fileA.json' USING JsonLoader AS infoMapA:map[];
B = LOAD 'fileB.json' USING JsonLoader AS infoMapB:map[];
A2 = FOREACH A GENERATE *, getHashFromMap(infoMapA#'Name') AS joinKey;
B2 = FOREACH B GENERATE *, getHashFromMap(infoMapB#'Name') AS joinKey;
AB = JOIN A2 BY joinKey, B2 BY joinKey;
merged = FOREACH AB GENERATE *, mergeMaps(infoMapA, infoMapB) AS mergedMap;
Here I assume that the attrbute you want to merge on is a map. If that can vary, you first UDF will need to become more generic. Its main purpose would be to get a unique string representation of the the attribute so that the datasets can be joined on that.

PIG: Process tuples in a bag

I have a data set that looks like this after a GROUP operation :
input = key1|{(a1,b1,c1),(a2,b2,c2)}
key2|{(a3,b3,c3),(a4,b4,c4),(a5,b5,c5)}
I need to traverse the above to generate final output like this :
<KEY>key1</KEY>|
<VALUES><VALUE><VALUE1>a1</VALUE1>VALUE2>b1</VALUE2>VALUE3>c1</VALUE3></VALUE><VALUE><VALUE1>a2</VALUE1><VALUE2>b2</VALUE2><VALUE3>c2</VALUE3> </VALUE></VALUES>
<KEY>key2</KEY>| ...
I have tried to use FLATTEN and CONCAT to achieve this result in the below manner:
A = FOREACH input GENERATE key, FLATTEN(input);
output = FOREACH A GENERATE CONCAT('<KEY>',CONCAT(input.key,'</KEY>')),
CONCAT('<VALUE>',''),
CONCAT('<VALUE1>',CONCAT(input.col1,'</VALUE1>')
...
But this does not give the desired output. Fairly new to pig, so don't know if this is possible.
If you FLATTEN your bag than you'll ended up as many new 'rows' as many elements you had in the bag:
key1|(a1,b1,c1)
key1|(a2,b2,c2)
If I understand your problem correctly you want this:
Use the BagToTuple built in function.
Than you'll get
key1|(a1,b1,c1,a2,b2,c2)
After this you can format your data with e.g. a UDF

Pig: Invalid field Projection; Projected Field does not exist

describe filter_records;
This gives me the below format:
filter_records: {details1: (firstname: chararray,lastname: chararray,age: int,gender: chararray),details2: (firstname: chararray,lastname: chararray,age: int,gender: chararray)}
I want to display the firstname from both details1 and details2. I tried this:
display_records = FOREACH filter_records GENERATE display1.firstname;
But I am getting the error:
Invalid field projection. Projected field [display1] does not exist in schema: details1:tuple(firstname:chararray,lastname:chararray,age:int,gender:chararray),details2:tuple(firstname:chararray,lastname:chararray,age:int,gender:chararray).
Please suggest why this error and how to resolve this.
I didn't see any relation name display1 in the filter_records. I guess instead of details1.firstname you used display1.firstname. Can you change your script like this?
display_records = FOREACH filter_records GENERATE details1.firstname;
It seems you used same variable names(firstname, lastname,age,gender) in both details1 and details2. It will give duplicate error when you print like this
display_records = FOREACH filter_records GENERATE details1.firstname,details2.firstname;
To solve this issue you need to provide a unique names in the details1 and details2 relation, Can you change your load schema like this? or you can give any unique name in the details1 and details2.
details1:tuple(firstname1:chararray,lastname1:chararray,age1:int,sex1:chararray),details2:tuple(firstname2:chararray,lastname2:chararray,age2:int,sex2:chararray)
Now when you try like this, you will get the firstname from details1 and details2
display_records = FOREACH filter_records GENERATE details1.firstname1,details2.firstname2;

Querying M:M relationships using Entity Framework

How would I modify the following code:
var result = from p in Cache.Model.Products
from f in p.Flavours
where f.FlavourID == "012541-5-5-5-651"
select p;
So that f.FlavourID is supplied a range of ID's as a supposed to just one value as shown in the above example?
Given the following ERD Model:
Products* => ProdCombinations <= *Flavours
ProdCombinations is a junction/link table and simply has one composite key in there.
Of the top of my head
string [] ids = new[]{"012541-5-5-5-651", "012541-5-5-5-652", "012541-5-5-5-653"};
var result = from p in Cache.Model.Products
from f in p.Flavours
where ids.Contains(f.FlavourID)
select p;
There are some limitations, but an array of ids has worked for me before. I've only actually tried with SQL Server backend, and my IDs were integers.
As I understand it, Linq needs to translate your query into SQL, and it's only possible sometimes. For example it's not possible with IEnumerable<SomeClass>, which produces a runtime error, but possible with a collection of simple types.

Resources