Iterate on 2 Data Sources in PIG - hadoop

I have 2 data sources
1) Params.txt which has the following content
item1
item2
item2
.
.
.
itemN
2) Data.txt which which has following content
he names (aliases) of relations A, B, and C are case sensitive.
The names (aliases) of fields f1, f2, and f3 are case sensitive.
Function names PigStorage and COUNT are case sensitive.
Keywords LOAD, USING, AS, GROUP, BY, FOREACH, GENERAT
and DUMP are case insensitive. They can also be written
The task is to see if each of N items of param file exist in each line of data file.
this is the pseudocode for the same
FOREACH d IN data:
FOREACH PARAM IN PARAMS:
IF PARAM IN d:
GENERATE PARAM,1
Is something of this sort possible in PIG scripting, if yes could you please point me in that direction.
Thanks

This is possible in Pig, but Pig is perhaps an unusual language to solve the problem!
I would approach the problem like this:
Load in Params.txt
Load in Data.txt and tokenise each line (assuming you're happy to split the text on spaces - you might need to think about what to do with punctuation)
Flatten the bag from tokenise to get one "word" per record in the relation.
Join the Params and Data relations. An inner join would give you words that are only in both.
Group the data and then count the occurrence of each word.
params = LOAD 'Params.txt' USING PigStorage() AS (param_word:chararray);
data = LOAD 'Data.txt' USING PigStorage() AS (line:chararray);
token_data = FOREACH data GENERATE TOKENIZE(line) AS words:{(word:chrarray)};
token_flat = FOREACH token_data GENERATE FLATTEN(words) AS (word);
joined = JOIN params BY param_word, token_flat BY word;
word_count = FOREACH (GROUP joined BY params::param_word) GENERATE
group AS param_word,
COUNT(joined) AS param_word_count;

Related

how can I merge sparse tables in hadoop?

I have a number of csv files containing a single column of values:
File1:
ID|V1
1111|101
4444|101
File2:
ID|V2
2222|102
4444|102
File3:
ID|V3
3333|103
4444|103
I want to combine these to get:
ID|V1|V2|V3
1111|101||
2222||102|
3333|||103
4444|101|102|103
There are many (100 million) rows, and about 100 columns/tables.
I've been trying to use Pig, but I'm a beginner, and am struggling.
For two files, I can do:
s1 = load 'file1.psv' using PigStorage('|') as (ID,V1);
s2 = load 'file2.psv' using PigStorage('|') as (ID,V2);
cg = cogroup s1 by ID, s2 by ID
merged = foreach cg generate group, flatten((IsEmpty(s1) ? null : s1.V1)), flatten((IsEmpty(s2) ? null : s2.V2));
But I would like to do this with whatever files are present, up to 100 or so, and I don't think I can cogroup that many big files without running out of memory. So I'd rather get the column name from the header than just hard-coding it. In other words, this 2-file toy example doesn't scale.

How to count on two columns of group by items in pig

I have generated two columns(origin and destination) out of 'n' number of columns. Now I want to generate count for these two columns combination. I am not able to get the result. I am getting error as, ERROR 1070: Could not resolve Count using imports:
Below is my script,
mydata = load '/Projects/Flightdata/1987/Rawdata' using PigStorage(',') as (year:int, month:int, dom:int, dow:int, deptime:long, crsdeptime:long, arrtime:long, crsarrtime:long, uniqcarcode:chararray, flightnum:long, tailnum:chararray, actelaptime:long, crselaptime:long, airtime:long, arrdeltime:long, depdeltime:long, origcode:chararray, destcode:chararray, dist:long, taxintime:long, taxiouttime:long, flightcancl:int, canclcode:chararray, diverted:int, carrierdel:long, weatherdel:long, nasdel:long, securitydel:long, lateaircraftdel:long);
Step2 = foreach mydata generate origcode, destcode;
grpby = group Step2 by (origcode, destcode) ;
step3 = foreach grpby generate group.origcode as source, group.destcode as destination, Count(step2);
here I want to generate count for each combination of origin and destination.
Any guidance will be helpful.
Please see the Pig documentation about case sensitivity
The names of Pig Latin functions are case sensitive.

Pig latin join by field

I have a Pig latin related problem:
I have this data below (in one row):
A = LOAD 'records' AS (f1:chararray, f2:chararray,f3:chararray, f4:chararray,f5:chararray, f6:chararray);
DUMP A;
(FITKA,FINVA,FINVU,FEEVA,FETKA,FINVA)
Now I have another dataset:
B = LOAD 'values' AS (f1:chararray, f2:chararray);
Dump B;
(FINVA,0.454535)
(FITKA,0.124411)
(FEEVA,0.123133)
And I would like to get those two dataset joined. I would get corresponding value from dataset B and place that value beside the value from dataset A. So expected output is below:
FITKA 0.123133, FINVA 0.454535 and so on ..
(They can also be like: FITKA, 0.123133, FINVA, 0.454535 and so on .. )
And then I would be able to multiply values (0.123133 x 0.454535 .. and so on) because they are on the same row now and this is what I want.
Of course I can join column by column but then values appear "end of row" and then I can clean it by using another foreach generate. But, I want some simpler solution without too many joins which may cause performance issues.
Dataset A is text (Sentence in one way..).
So what are my options to achieve this?
Any help would be nice.
A sentence can be represented as a tuple and contains a bag of tuples (word, count).
Therefore, I suggest you change the way you store your data to the following format:
sentence:tuple(words:bag{wordcount:tuple(word, count)})

Pig how to assign name to columns?

I have a csv file which have hundreds of columns, when I load the file into Pig, I dont want to assign each column like
A = load 'path/to/file' as (a,b,c,d,e......)
Since I'll filter a lot of them at the second step:
B = foreach A generate $0,$2,....;
But here, can I assign a name and type to each column of B? something like
B = foreach A generate $0,$2,... AS (a:int,b:int,c:float)
I tried the above code but it doesn't work.
Thanks.
You have to specify them between each comma.
B = foreach A generate $0 as a, $2 as b,...
Note that it just assumes the type that it is already.

UnGroup in Apache Pig

Does Apache Pig support an UNGROUP operation ? I guess No. So could any one help me out with this probblem?
I have a rows of the form
1,a-b-c
2,d-e-f
3,g-h
I would like to expand it to the form
1,a
1,b
1,c
2,d
2,e
2,f
3,g
3,h
Any help appreciated.
You should probably use the builtin STRSPLIT to split your second field into several tokens, and then apply FLATTEN to create 1 row per element. Something like this:
A = LOAD 'input.txt' as (id, data);
B = FOREACH A GENERATE id, FLATTEN(STRSPLIT(data,'-'));

Resources