Pig latin join by field - hadoop

I have a Pig latin related problem:
I have this data below (in one row):
A = LOAD 'records' AS (f1:chararray, f2:chararray,f3:chararray, f4:chararray,f5:chararray, f6:chararray);
DUMP A;
(FITKA,FINVA,FINVU,FEEVA,FETKA,FINVA)
Now I have another dataset:
B = LOAD 'values' AS (f1:chararray, f2:chararray);
Dump B;
(FINVA,0.454535)
(FITKA,0.124411)
(FEEVA,0.123133)
And I would like to get those two dataset joined. I would get corresponding value from dataset B and place that value beside the value from dataset A. So expected output is below:
FITKA 0.123133, FINVA 0.454535 and so on ..
(They can also be like: FITKA, 0.123133, FINVA, 0.454535 and so on .. )
And then I would be able to multiply values (0.123133 x 0.454535 .. and so on) because they are on the same row now and this is what I want.
Of course I can join column by column but then values appear "end of row" and then I can clean it by using another foreach generate. But, I want some simpler solution without too many joins which may cause performance issues.
Dataset A is text (Sentence in one way..).
So what are my options to achieve this?
Any help would be nice.

A sentence can be represented as a tuple and contains a bag of tuples (word, count).
Therefore, I suggest you change the way you store your data to the following format:
sentence:tuple(words:bag{wordcount:tuple(word, count)})

Related

Iterate on 2 Data Sources in PIG

I have 2 data sources
1) Params.txt which has the following content
item1
item2
item2
.
.
.
itemN
2) Data.txt which which has following content
he names (aliases) of relations A, B, and C are case sensitive.
The names (aliases) of fields f1, f2, and f3 are case sensitive.
Function names PigStorage and COUNT are case sensitive.
Keywords LOAD, USING, AS, GROUP, BY, FOREACH, GENERAT
and DUMP are case insensitive. They can also be written
The task is to see if each of N items of param file exist in each line of data file.
this is the pseudocode for the same
FOREACH d IN data:
FOREACH PARAM IN PARAMS:
IF PARAM IN d:
GENERATE PARAM,1
Is something of this sort possible in PIG scripting, if yes could you please point me in that direction.
Thanks
This is possible in Pig, but Pig is perhaps an unusual language to solve the problem!
I would approach the problem like this:
Load in Params.txt
Load in Data.txt and tokenise each line (assuming you're happy to split the text on spaces - you might need to think about what to do with punctuation)
Flatten the bag from tokenise to get one "word" per record in the relation.
Join the Params and Data relations. An inner join would give you words that are only in both.
Group the data and then count the occurrence of each word.
params = LOAD 'Params.txt' USING PigStorage() AS (param_word:chararray);
data = LOAD 'Data.txt' USING PigStorage() AS (line:chararray);
token_data = FOREACH data GENERATE TOKENIZE(line) AS words:{(word:chrarray)};
token_flat = FOREACH token_data GENERATE FLATTEN(words) AS (word);
joined = JOIN params BY param_word, token_flat BY word;
word_count = FOREACH (GROUP joined BY params::param_word) GENERATE
group AS param_word,
COUNT(joined) AS param_word_count;

Pig Latin using two data sources in one FILTER statement

In my pig script, am reading data from more than 5 data sources (Hive tables), where one is the main source data and rest were kind of dimension data tables. I am trying to filter the main data source relation (or alias) w.r.t some value in one of the dimension relation.
E.g.
-- main_data is main data source and dept_data is department data
filtered_data1 = FILTER main_data BY deptID == dept_data.departmentID;
filtered_data2 = FOREACH filtered_data1 GENERATE $0, $1, $3, $7;
In my pig script there are minimum 20 instances where I need to match for some value between multiple data sources and produce a new relation. But am getting some error as
ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1066: Unable to open iterator for alias filtered_data1.
Backend error : Scalar has more than one row in the output. 1st : ( ..... ) 2nd : ( .... )
Details at logfile: /root/pig_1403263965493.log
I tried to use "relation::field" approach also, no use. Alternatively, am joining these two relations (data sources) to get filtered data, but I feel, this will slow down the execution process and unnecessirity huge data will be dumped.
Please guide me how two use two or more data sources in one FILTER statement, something like in SQL, so that I can avoid using JOIN statements and get it done from FILTER statement itself.
Where A.deptID = B.departmentID And A.sectionID = C.sectionID And A.cityID = D.cityID
If you want to match records from different tables by a single ID, you would pretty much have to use a join, as such:
Where A::deptID = B::departmentID And A::sectionID = C::sectionID And A::cityID = D::cityID
If you just want to keep the records that occur in all other tables, you could probably go for an INTERSECT and then a
FILTER BY someID IN someIDList

Hadoop, how to normalize multiple columns data?

I have a file .txt like this
1036177 19459.7356 17380.3761 18084.1440
1045709 19674.2457 17694.8674 18700.0120
1140443 19772.0645 17760.0904 19456.7521
where the first column represent the Key and the others are the values.
I would like to normalize (min-max) each column and after that sum up the columns.
Someone can give me some advice on how do that in MapReduce?
From an algorithmic perspective you'll need to:
Mapper
Parse / tokenize each input line by it's delimiter (space?)
Use a Text object to encapsulate the key field
Either create a custom value class to encapsulate the other fields or use an ArrayWritable wrapper
Output this Key / Value from your Mapper
Reducer
All values will be grouped by the same key, so here you'll just need to process each input value and calculate the min, max and sum for each column
Finally output your result
You might want to look at using Apache Pig which should make this task much easier (untested):
grunt> A = LOAD '/path/to/data.txt' USING PigStorage(' ')
AS (key, fld1:float, fld2:float, fld3:float);
grunt> GRP = GROUP A BY key;
grunt> B = FOREACH GRP GENERATE $0, MIN(fld1), MAX(fld1), SUM(fld1),
MIN(fld2), MAX(fld2), SUM(fld2),
MIN(fld3), MAX(fld3), SUM(fld3);
grunt> STORE B INTO '/path/to/output' USING PigStorage('\t', '-schema');

how to create set of values, after group function in Pig (Hadoop)

Lets say I have set of values in file.txt
a,b,c
a,b,d
k,l,m
k,l,n
k,l,o
And my code is:
file = LOAD 'file.txt' using PigStorage(',');
events = foreach file generate session_id, user_id, code, type;
gr = group events by (session_id, user_id);
and I have set of value:
((a,b),{(a,b,c),(a,b,d)})
((k,l),{(k,l,m),(k,l,n),(k,l,o)})
And I'd like to have:
(a,b,(c,d))
(k,l,(m,n,o))
Have you got any idea how to do it?
Regards
Pawel
Note: you are inconsistent in your question. You say session_id, user_id, code, type in the FOREACH line, but your have a PigStorage not providing values. Also, that FOREACH has 4 values, while your sample data only has 3. I'll assume that type doesn't exist in order to answer your question.
After your gr relation, you are left with the group by key (in this case (session_id, user_id)) in a automatically generated tuple called group.
So, first step: gr2 = FOREACH gr GENERATE FLATTEN(group);
This will give you the tuples (a,b) and (k,l). You need to use FLATTEN because group is a tuple and you are asking for session_id and user_id to be individual columns. FLATTEN does that for you.
Ok, so now modify the gr2 line to also use a projection to tease out the third value:
gr2 = FOREACH gr GENERATE FLATTEN(group), events.code;
events.code creates a bag out of all the code values. events is the name of the bag of grouped tuples (it's named after the original relation).
This should give you:
(a, b, {c, d})
(k, l, {m, n, o})
It's very important to note that the values in the list are in a bag not a tuple, like you asked for. Keeping it in a bag is the right idea because the bag is a variable list, while a tuple is not.
Additional advice: Understanding how GROUP BY outputs data is something I see a lot of people struggle with when first using Pig. If you think my answer doesn't make much sense, I'd recommend spending some time to really get to understand GROUP BY. Understanding versus thinking it is magic will pay off in the long run.

Max/Min for whole sets of records in PIG

I have a set set of records that I am loading from a file and the first thing I need to do is get the max and min of a column.
In SQL I would do this with a subquery like this:
select c.state, c.population,
(select max(c.population) from state_info c) as max_pop,
(select min(c.population) from state_info c) as min_pop
from state_info c
I assume there must be an easy way to do this in PIG as well but I'm having trouble finding it. It has a MAX and MIN function but when I tried doing the following it didn't work:
records=LOAD '/Users/Winter/School/st_incm.txt' AS (state:chararray, population:int);
with_max = FOREACH records GENERATE state, population, MAX(population);
This didn't work. I had better luck adding an extra column with the same value to each row and then grouping them on that column. Then getting the max on that new group. This seems like a convoluted way of getting what I want so I thought I'd ask if anyone knows a simpler way.
Thanks in advance for the help.
As you said you need to group all the data together but no extra column is required if you use GROUP ALL.
Pig
records = LOAD 'states.txt' AS (state:chararray, population:int);
records_group = GROUP records ALL;
with_max = FOREACH records_group
GENERATE
FLATTEN(records.(state, population)), MAX(records.population);
Input
CA 10
VA 5
WI 2
Output
(CA,10,10)
(VA,5,10)
(WI,2,10)

Resources