Nested Flatten in Pig - hadoop

I have problem using Pig Like this:
Suppose I have a alias A, like ("key1","just_for_example"). I want something like :("key1","just"),("key1","for"),("key1","example"). My script looks like:
B = foreach A generate $0, FLATTEN(TOBAG(FLATTEN(STRSPLIT($1,'_'))));
But it keeps throwing me errors like "Error 1070:Couldn't resolve Flatten from builtin". But once I split this statement into two to eliminate the nested flattens, then it works. Why is that? Is it related to how Pig compile my script? Thanks.

Flatten is not a UDF, it is an operator that changes the format of bags and tuples. You can read the description here.
You can get the desired output in a clean one-liner though. TOKENIZE works similarly to STRSPLIT, but produces a bag instead of a tuple. Therefore, you can just do:
B = FOREACH A GENERATE $0, FLATTEN(TOKENIZE($1, '_')) ;
Resulting schema and output:
B: {key: chararray,bag_of_tokenTuples::token: chararray}
(key1,just)
(key1,for)
(key1,example)

Flatten is not a function - you can't nest it.
However, in your case, I don't think you have to use it twice. This should suffice to get your desired output:
B = foreach A generate $0, FLATTEN(STRSPLIT($1,'_'));

I tried to replicate you question, Please find my solution below. have taken flatten_exe.txt as an input which contains data as ("Just_For_example")
grunt> flt= Load '/home/training/pig/Join/flatten_exe.txt' using PigStorage();
grunt> b= Foreach flt Generate FLATTEN(TOKENIZE($0, '_'));
grunt>dump b;

Related

Pig referencing

I am learning Hadoop pig and I always stuck at referencing the elements.please find the below example.
groupwordcount: {group: chararray,words: {(bag_of_tokenTuples_from_line::token: chararray)}}
Can somebody please explain how to reference the elements if we have nested tuples and bags.
Any Links for better understanding the nested referrencing would be great help.
Let's do a simple Demonstration to understand this problem.
say a file 'a.txt' stored at '/tmp/a.txt' folder in HDFS
A = LOAD '/tmp/a.txt' using PigStorage(',') AS (name:chararray,term:chararray,gpa:float);
Dump A;
(John,fl,3.9)
(John,fl,3.7)
(John,sp,4.0)
(John,sm,3.8)
(Mary,fl,3.8)
(Mary,fl,3.9)
(Mary,sp,4.0)
(Mary,sm,4.0)
Now let's group by this Alias 'A' on the basis of some parameter say name and term
B = GROUP A BY (name,term);
dump B;
((John,fl),{(John,fl,3.7),(John,fl,3.9)})
((John,sm),{(John,sm,3.8)})
((John,sp),{(John,sp,4.0)})
((Mary,fl),{(Mary,fl,3.9),(Mary,fl,3.8)})
((Mary,sm),{(Mary,sm,4.0)})
((Mary,sp),{(Mary,sp,4.0)})
describe B;
B: {group: (name: chararray,term: chararray),A: {(name: chararray,term: chararray,gpa: float)}}
now it has become the problem statement that you have asked. Let me demonstrate you how to access elements of group tuple or element of A tuple or both
C = foreach B generate group.name,group.term,A.name,A.term,A.gpa;
dump C;
(John,fl,{(John),(John)},{(fl),(fl)},{(3.7),(3.9)})
(John,sm,{(John)},{(sm)},{(3.8)})
(John,sp,{(John)},{(sp)},{(4.0)})
(Mary,fl,{(Mary),(Mary)},{(fl),(fl)},{(3.9),(3.8)})
(Mary,sm,{(Mary)},{(sm)},{(4.0)})
(Mary,sp,{(Mary)},{(sp)},{(4.0)})
So we accessed all elements by this way.
hope this helped

Hadoop Pig UDF invocation issue

The following code works quite well, but when I already have two existing bags (with their alias, suppose S1 and S2 for representing two existing bags for two sets), wondering how to call UDF setDifference to generate set differences? I think if I manually construct an additional bag, using my already existing input bags (S1 and S2), it will be additional overhead?
register datafu-1.2.0.jar;
define setDifference datafu.pig.sets.SetDifference();
-- ({(3),(4),(1),(2),(7),(5),(6)} \t {(1),(3),(5),(12)})
A = load 'input.txt' AS (B1:bag{T:tuple(val:int)},B2:bag{T:tuple(val:int)});
F1 = foreach A generate B1;
F2 = foreach A generate B2;
differenced = FOREACH A {
-- input bags must be sorted
sorted_b1 = ORDER B1 by val;
sorted_b2 = ORDER B2 by val;
GENERATE setDifference(sorted_b1,sorted_b2);
}
-- produces: ({(2),(4),(6),(7)})
DUMP differenced;
Update:
Question is, suppose I have two bags already, how to call UDF setDifference to get set differences? Do I need to build another super bag which contains the two separate bags? Thanks.
thanks in advance,
Lin
I don't see any overhead issue with the UDF invocation.
Ref : http://datafu.incubator.apache.org/docs/datafu/guide/set-operations.html, we have a example for using SetDifference method.
As per API (http://datafu.incubator.apache.org/docs/datafu/1.2.0/datafu/pig/sets/SetDifference.html) SetDifference method takes bags as input and emits the difference between them.
N.B. Do note that the input bags have to be sorted.
In the example snippet shared, I don't see the need of below code snippet
F1 = foreach A generate B1;
F2 = foreach A generate B2;

Reverse the group data as a different record using Pig

Split the group record in to different records :
for eg :
Input : (A,(3,2,3))
Output in to 3 new lines:
A,3
A,2
A,3
Can any one let me know the option to do this please?
The problem is when you convert the output of Arraylist to tuple then it will be difficult to achieve what you want, so I recommend this approach, so it will be easy to get the output .
In your UDF code, instead of creating Arraylist, append the output into string seperated by comma and return back to pig script.
You final output should be like this from UDF as a string ie "3,2,3"
Then use the below code to get the result
C = FOREACH B GENERATE $0,NewRollingCount(BagToString($1)) AS rollingCnt
D = FOREACH C GENERATE $0,FLATTEN(TOKENIZE(rollingcnt));
DUMP D;

Apache Pig: Convert bag of tupple to single tupple

I'm trying to convert a bag of tupples into a single tupple:
grunt> describe B;
B: {Comment: {tuple_of_tokens: (token: chararray)}}
grunt> dump B;
({(10),(123),(1234)})
I would like to get (10,123,1234) from B. I've tried using FLATTEN but this gives a new line for each tupple in the bag and that is not what I want.
Is there any way to do this conversion without going to UDF ?
Thanks in advance !
BagToTuple() function is already available in piggybank, you just download the pig-0.11.0.jar and set it in your classpath. For this you no need to write any UDF code.
Download jar from this link:
http://www.java2s.com/Code/Jar/p/Downloadpig0110jar.htm
Reference:
https://pig.apache.org/docs/r0.12.0/api/org/apache/pig/builtin/BagToTuple.html
Example:
input.txt
{(10),(123),(1234)}
{(4),(5)}
Pigscript:
A= LOAD 'input.txt' USING PigStorage() AS (b:{t:(f1)});
B = FOREACH A GENERATE FLATTEN(BagToTuple(b));
DUMP B;
Output:
(10,123,1234)
(4,5)

Apache Pig not parsing a tuple fully

I have a file called data that looks like this: (note there are tabs after the 'personA')
personA (1, 2, 3)
personB (2, 1, 34)
And I have an Apache pig script like this:
A = LOAD 'data' AS (name: chararray, nodes: tuple(a:int, b:int, c:int));
C = foreach A generate nodes.$0;
dump C;
The output of which makes sense:
(1)
(2)
However if I change the schema of the script to be like this:
A = LOAD 'data' AS (name: chararray, nodes: tuple());
C = foreach A generate nodes.$0;
dump C;
Then the output I get is this:
(1, 2, 3)
(2, 1, 34)
It looks like the first (and only) element in this tuple is a bytearray. i.e. it's not parsing the input text 1, 2, 3 into a tuple.
In future my input will have an unknown & variable number of elements in the nodes item, so I can't just write out a:int, ….
Is there anyway to get Pig to parse the input tuple as a tuple without having to write out the full schema?
Pig does not accept what you are passing in as valid. The default loading scheme PigStorage only accepts delimited files (by default tab delimited). It is not smart enough to parse the tuple construct with the parenthesis and commas you have in the text. Your options are:
Reformat your file to be tab delimited: personA 1 2 3
Read the file in line by line with TextLoader, then write some sort of UDF that parses the line and returns the data in the form you want.
Write your own custom loader.
This is no more a limitation. Pig parses the tuples in input file considering comma as field separator. I'm trying in Apache Pig version 0.15.0.
A = LOAD 'data' AS (name: chararray, nodes: tuple());
C = foreach A generate nodes.$0;
dump C;
Output I get is:
(1)
(2)
Here is another way of tackling this issue, although I know the answers above are more efficient.
data = LOAD 'data' USING PigStorage() AS (name:chararray, field2:chararray);
data = FOREACH data GENERATE name, REPLACE(REPLACE(field2, '\\(',''),'\\)','') AS field2;
data = FOREACH data GENERATE name, STRSPLIT(field2, '\\,') AS fieldTuple;
data = FOREACH data GENERATE name, fieldTuple.$0,fieldTuple.$1, fieldTuple.$2 ;
Load field2 as chararray
Remove parentheses
Split field2 by comma (it gives you a tuple with 3 fields in it)
Get values by index
I know it is hacky. Just wanted to provide another way of doing this

Resources