Pig - how to iterate on a bag of maps - hadoop

Let me explain the problem. I have this line of code:
u = FOREACH persons GENERATE FLATTEN($0#'experiences') as j;
dump u;
which produces this output:
([id#1,date_begin#12 2012,description#blabla,date_end#04 2013],[id#2,date_begin#02 2011,description#blabla2,date_end#04 2013])
([id#1,date_begin#12 2011,description#blabla3,date_end#04 2012],[id#2,date_begin#02 2010,description#blabla4,date_end#04 2011])
Then, when I do this:
p = foreach u generate j#'id', j#'description';
dump p;
I have this output:
(1,blabla)
(1,blabla3)
But that's not what I wanted. I would like to have an output like this:
(1,blabla)
(2,blabla2)
(1,blabla3)
(2,blabla4)
How could I have this ?
Thank you very much.

I'm assuming that the $0 you are FLATTENing in u is a tuple.
The overall problem is that j is only referencing the first map in the tuple. In order to get the output you want, you'll have to convert each tuple into a bag, then FLATTEN it.
If you know that each tuple will have up to two maps, you can do:
-- My B is your u
B = FOREACH A GENERATE (tuple(map[],map[]))$0#'experiences' AS T ;
B2 = FOREACH B GENERATE FLATTEN(TOBAG(T.$0, T.$1)) AS j ;
C = foreach B2 generate j#'id', j#'description' ;
If you don't know how many fields will be in the tuple, then this is will be much harder.
NOTE: This works for pig 0.10.
For tuples with an undefined number of maps, the best answer I can think of is using a UDF to parse the bytearray:
myudf.py
#outputSchema('vals: {(val:map[])}')
def foo(the_input):
# This converts the indeterminate number of maps into a bag.
foo = [chr(i) for i in the_input]
foo = ''.join(foo).strip('()')
out = []
for f in foo.split('],['):
f = f.strip('[]')
out.append(dict((k, v) for k, v in [ i.split('#') for i in f.split(',')]))
return out
myscript.pig
register 'myudf.py' using jython as myudf ;
B = FOREACH A GENERATE FLATTEN($0#'experiences') ;
T1 = FOREACH B GENERATE FLATTEN(myudf.foo($0)) AS M ;
T2 = FOREACH T1 GENERATE M#'id', M#'description' ;
However, this relies on the fact that #, ,, or ],[ will not appear in any of the keys or values in the map.
NOTE: This works for pig 0.11.
So it seems that how pig handles the input to the python UDFs changed in this case. Instead of a bytearray being the input to foo, the bytearray is automatically converted to the appropriate type. In that case it makes everything much easier:
myudf.py
#outputSchema('vals: {(val:map[])}')
def foo(the_input):
# This converts the indeterminate number of maps into a bag.
out = []
for map in the_input:
out.append(map)
return out
myscript.pig
register 'myudf.py' using jython as myudf ;
# This time you should pass in the entire tuple.
B = FOREACH A GENERATE $0#'experiences' ;
T1 = FOREACH B GENERATE FLATTEN(myudf.foo($0)) AS M ;
T2 = FOREACH T1 GENERATE M#'id', M#'description' ;

Related

How do I parallelize for-loop in octave using pararrayfun (or any other function will also do)?

Well, I'm new to octave and i wanted to know how to implement parallel execution of for loop in octave.
I'm looking for parallel implementation of the below code (its not the exact code that I'm trying to execute, but something similar to this)
`%read a csv file
master_sheet = csv2cell('master_sheet.csv');
delta = 0.001;
nprocs= nproc();
%extract some values from the csv file and store it in the variables
a = master_sheet{34,2} ;
b = master_sheet{38,2} ;
c = master_sheet{39,2} ;
for i=0:1000
%%create variants of a,b and c by adding a delta value
a_adj = a +(i)*delta ;
b_adj = b +(i)*delta ;
c_adj = c +(i)*delta ;
%club all the above variables and put it to an array variable
array_abc = [a_adj, b_adj, c_adj];
%send this array as an argument/parameter to a function
%processingData() function would essentially perform some series of calculation and would write the
%results onto a file
processingData(array_abc);
endfor
Currently, I'm using parallel pkg (pararrayfun) to implement this, but if there is any other way(package) that could achieve the parallelization of for loop in octave, then I'm open to exploring that as well.
Thank you!

PIG Script to split large txt file into parts based on specified word

I am trying to build a pig script that takes in a textbook file and divides it into chapters and then compares the words in each chapter and returns only words that show up in all chapters and counts them. The chapters are Delimited fairly easily by CHAPTER - X.
Here's what I have so far:
lines = LOAD '../../Alice.txt' AS (line:chararray);
lineswithoutspecchars = FOREACH lines GENERATE REPLACE(line,'([^a-zA-Z\\s]+)','') as line;
words = FOREACH lineswithoutspecchars GENERATE FLATTEN(TOKENIZE(line)) as word;
grouped = GROUP words BY word;
wordcount = FOREACH grouped GENERATE group, COUNT(words);
DUMP wordcount;
Sorry that this question is probably way too simple compared to what I normally ask on stackoverflow and I googled around for it but perhaps I am not using the correct keywords. I am brand new to PIG and trying to learn it for a new job assignment.
Thanks in advance!
A bit lengthy but you will get the result. You could cut down unnecessary relations based on your file though. Provided appropriate comments in teh script.
Input File:
Pig does not know whether integer values in baseball are stored as ASCII strings, Java
serialized values, binary-coded decimal, or some other format. So it asks the load func-
tion, because it is that function’s responsibility to cast bytearrays to other types. In
general this works nicely, but it does lead to a few corner cases where Pig does not know
how to cast a bytearray. In particular, if a UDF returns a bytearray, Pig will not know
how to perform casts on it because that bytearray is not generated by a load function.
CHAPTER - X
In a strongly typed computer language (e.g., Java), the user must declare up front the
type for all variables. In weakly typed languages (e.g., Perl), variables can take on values
of different type and adapt as the occasion demands.
CHAPTER - X
In this example, remember we are pretending that the values for base_on_balls and
ibbs turn out to be represented as integers internally (that is, the load function con-
structed them as integers). If Pig were weakly typed, the output of unintended would
be records with one field typed as an integer. As it is, Pig will output records with one
field typed as a double. Pig will make a guess and then do its best to massage the data
into the types it guessed.
Pig Script:
A = LOAD 'file' as (line:chararray);
B = FOREACH A GENERATE REPLACE(line,'([^a-zA-Z\\s]+)','') as line;
//we need to split on CHAPTER X but the above load function would give us a tuple for each newline. so
group everything and convert that bag to string which will give a single tuple with _ as delimiter.
C = GROUP B ALL;
D = FOREACH C GENERATE BagToString(B) as (line:chararray);
//now we dont have any commas so convert our delimiter CHAPTER X to comma. We do this becuase if we pass this
to TOKENIZE it would split that into separarte column that would be useful to RANK it.
E = FOREACH D GENERATE REPLACE(line,'_CHAPTER X_',',') AS (line:chararray);
F = FOREACH E GENERATE REPLACE(line,'_',' ') AS (line:chararray); //remove the delimiter created by BagToString
//create separate columns
G = FOREACH F GENERATE FLATTEN(TOKENIZE(line,',')) AS (line:chararray);
//we need to rank each chapter so that would be easy when you are doing the count of each word.
H = RANK G;
J = FOREACH H GENERATE rank_G,FLATTEN(TOKENIZE(line)) as (line:chararray);
J1 = GROUP J BY (rank_G, line);
J2 = FOREACH J1 GENERATE COUNT(J) AS (cnt:long),FLATTEN(group.line) as (word:chararray),FLATTEN(group.rank_G) as (rnk:long);
//So J2 result will not have duplicate word within each chapter now.
//So if we group it by word and then filter teh count of that by 2 we are sure that the word is present in all chapters.
J3 = GROUP J2 BY word;
J4 = FOREACH J3 GENERATE SUM(J2.cnt) AS (sumval:long),COUNT(J2) as (cnt:long),FLATTEN(group) as (word:chararray);
J5 = FILTER J4 BY cnt > 2;
J6 = FOREACH J5 GENERATE word,sumval;
dump J6;
//result in order word,count across chapters
Output:
(a,8)
(In,5)
(as,6)
(the,9)
(values,4)

Hadoop Pig Script Help Needed with labeling words in a sentence

I am working on a solution to the following problem:
Given an arbitrary text document written in English, write a program that will generate a concordance, i.e. an alphabetical list of all word occurrences, labeled with word frequencies.
Bonus: label each word with the sentence numbers in which each occurrence appeared.
Now, I have the first part of this exercise completed. I am stuck on the bonus part.
Can someone please help me out? I am using Hadoop Pig on Cloudera Live. Here is what the sample output is suppose to look like including the bonus.
a. a {2:1,1}
b. all {1:1}
c. alphabetical {1:1}
d. an {2:1,1}
e. appeared {1:2}
Wordcount.pig script does the word count and the other one puts it in alphabetical order.
Wordcount.pig
--Load data
lines = LOAD '/user/cloudera/gettysburg.txt' AS (line:chararray);
-- Create list
words = FOREACH lines GENERATE FLATTEN(TOKENIZE(line)) as word;
-- Count occurances
grouped = GROUP words BY word;
--Generate wordcout
wordcount = FOREACH grouped GENERATE group, COUNT(words);
--Save output
STORE wordcount into '/user/cloudera/output';
WORDCOUNTALPHABETIZE.PIG
--Load unsorted data file
unsortedData = LOAD '/user/cloudera/output/UnsortedList.txt' AS (words:chararray, frequency:int);
DUMP unsortedData;
--Put data in alphabetical order
sortedData = ORDER unsortedData BY words ASC, frequency;
DUMP sortedData;
--Save output
STORE sortedData into '/user/cloudera/output2';
Thanks,
Anne
Could be achieved with UDF Enumerate(Datafu) which would be useful to generate sequence number for each tuple in a bag. can you try this?
register datafu-1.1.0.jar;
define Enumerate datafu.pig.bags.Enumerate('1');
A = LOAD '/home/hduser/a22.dat' as (line:chararray);
Z = FOREACH A GENERATE FLATTEN(TOKENIZE(line,'.')) as (word:chararray); // generate line_number with rank
Z1 = RANK Z;
Z2 = FOREACH Z1 GENERATE rank_Z,FLATTEN(TOKENIZE(word)) as (word:chararray); // line_number,word
Z3 = RANK Z2; // rank used to maintain the word order
Z4 = GROUP Z3 by rank_Z; // grouped by line_number to generate word_number for each line
Z5 = foreach Z4 {
sorted = order Z3 by rank_Z2;
generate group, sorted;
} //ordered to maintain word order
Z6 = foreach Z5 generate FLATTEN(Enumerate(sorted)) as (l:int,word_no:int,word:chararray,line_no:int); //generate word_number
Z7 = GROUP Z6 BY word;
Z8 = FOREACH Z7 GENERATE group,Z6.line_no,Z6.word_no,COUNT(Z6); // output in order word,line_number,word_number,count_of_each_word
For word nation below is the output:
(nation,{(16),(13),(25),(16)},{(2),(2),(4),(1)},4)
in the order (word,{(word_number1,word_number2,word_number3,word_number4},{line_number1,line_number2,line_number3,line_number4},count_of_each_word)

Hadoop Pig UDF invocation issue

The following code works quite well, but when I already have two existing bags (with their alias, suppose S1 and S2 for representing two existing bags for two sets), wondering how to call UDF setDifference to generate set differences? I think if I manually construct an additional bag, using my already existing input bags (S1 and S2), it will be additional overhead?
register datafu-1.2.0.jar;
define setDifference datafu.pig.sets.SetDifference();
-- ({(3),(4),(1),(2),(7),(5),(6)} \t {(1),(3),(5),(12)})
A = load 'input.txt' AS (B1:bag{T:tuple(val:int)},B2:bag{T:tuple(val:int)});
F1 = foreach A generate B1;
F2 = foreach A generate B2;
differenced = FOREACH A {
-- input bags must be sorted
sorted_b1 = ORDER B1 by val;
sorted_b2 = ORDER B2 by val;
GENERATE setDifference(sorted_b1,sorted_b2);
}
-- produces: ({(2),(4),(6),(7)})
DUMP differenced;
Update:
Question is, suppose I have two bags already, how to call UDF setDifference to get set differences? Do I need to build another super bag which contains the two separate bags? Thanks.
thanks in advance,
Lin
I don't see any overhead issue with the UDF invocation.
Ref : http://datafu.incubator.apache.org/docs/datafu/guide/set-operations.html, we have a example for using SetDifference method.
As per API (http://datafu.incubator.apache.org/docs/datafu/1.2.0/datafu/pig/sets/SetDifference.html) SetDifference method takes bags as input and emits the difference between them.
N.B. Do note that the input bags have to be sorted.
In the example snippet shared, I don't see the need of below code snippet
F1 = foreach A generate B1;
F2 = foreach A generate B2;

Difficulty Accessing Members of Tuple in Apache Pig

I have a variable titled F.
Describe F returns:
F: {group: bytearray,indexkey: {(indexkey: chararray)}}
Dump F returns:
(321,{(CHOW),(DREW)})
(5011,{(CHOW),(DREW)})
(5825,{(TANNER),(SPITZENBERGER)})
(16631,{(CHOW),(DREW)})
(34299,{(CHOW),(DREW)})
(35044,{(TANNER),(SPITZENBERGER)})
(65623,{(CHOW),(DREW)})
(74597,{(SPITZENBERGER),(TANNER)})
(83499,{(SPITZENBERGER),(TANNER)})
(90257,{(SPITZENBERGER),(TANNER)})
What I need is to produce an output that looks like this (only 1st row as an example):
(321,DREW,{(CHOW)})
I've tried using deference to pull out the first element by using this:
G = FOREACH F generate indexkey.$0;
But, this still returns the whole tuple.
Can anyone suggest a method for doing this? I was under the impression that the deference operator should allow me to do this.
Thanks in advance!
Daniel
You can't index into bags like that. The reason for that is bags don't have any notion of ordering. Selecting the first item in a bag should be treated as picking a random one.
Either way, if you want only one item instead of all of them you can used a nested FOREACH to pull a LIMIT of 1:
first = FOREACH F {
lim = LIMIT indexkey 1;
GENERATE group, lim;
}
(disclaimer: I can't test this code right now, if it doesn't work let me know. Hopefully you can get the gist)
You can take this a bit further and FLATTEN it to remove the bag of one item entirely, but be careful in that if the bag is empty i think you throw away the entire record in this case.
first = FOREACH F {
lim = LIMIT indexkey 1;
GENERATE group, FLATTEN(lim);
}

Resources