Apache Pig: Convert bag of tupple to single tupple - hadoop

I'm trying to convert a bag of tupples into a single tupple:
grunt> describe B;
B: {Comment: {tuple_of_tokens: (token: chararray)}}
grunt> dump B;
({(10),(123),(1234)})
I would like to get (10,123,1234) from B. I've tried using FLATTEN but this gives a new line for each tupple in the bag and that is not what I want.
Is there any way to do this conversion without going to UDF ?
Thanks in advance !

BagToTuple() function is already available in piggybank, you just download the pig-0.11.0.jar and set it in your classpath. For this you no need to write any UDF code.
Download jar from this link:
http://www.java2s.com/Code/Jar/p/Downloadpig0110jar.htm
Reference:
https://pig.apache.org/docs/r0.12.0/api/org/apache/pig/builtin/BagToTuple.html
Example:
input.txt
{(10),(123),(1234)}
{(4),(5)}
Pigscript:
A= LOAD 'input.txt' USING PigStorage() AS (b:{t:(f1)});
B = FOREACH A GENERATE FLATTEN(BagToTuple(b));
DUMP B;
Output:
(10,123,1234)
(4,5)

Related

Pig referencing

I am learning Hadoop pig and I always stuck at referencing the elements.please find the below example.
groupwordcount: {group: chararray,words: {(bag_of_tokenTuples_from_line::token: chararray)}}
Can somebody please explain how to reference the elements if we have nested tuples and bags.
Any Links for better understanding the nested referrencing would be great help.
Let's do a simple Demonstration to understand this problem.
say a file 'a.txt' stored at '/tmp/a.txt' folder in HDFS
A = LOAD '/tmp/a.txt' using PigStorage(',') AS (name:chararray,term:chararray,gpa:float);
Dump A;
(John,fl,3.9)
(John,fl,3.7)
(John,sp,4.0)
(John,sm,3.8)
(Mary,fl,3.8)
(Mary,fl,3.9)
(Mary,sp,4.0)
(Mary,sm,4.0)
Now let's group by this Alias 'A' on the basis of some parameter say name and term
B = GROUP A BY (name,term);
dump B;
((John,fl),{(John,fl,3.7),(John,fl,3.9)})
((John,sm),{(John,sm,3.8)})
((John,sp),{(John,sp,4.0)})
((Mary,fl),{(Mary,fl,3.9),(Mary,fl,3.8)})
((Mary,sm),{(Mary,sm,4.0)})
((Mary,sp),{(Mary,sp,4.0)})
describe B;
B: {group: (name: chararray,term: chararray),A: {(name: chararray,term: chararray,gpa: float)}}
now it has become the problem statement that you have asked. Let me demonstrate you how to access elements of group tuple or element of A tuple or both
C = foreach B generate group.name,group.term,A.name,A.term,A.gpa;
dump C;
(John,fl,{(John),(John)},{(fl),(fl)},{(3.7),(3.9)})
(John,sm,{(John)},{(sm)},{(3.8)})
(John,sp,{(John)},{(sp)},{(4.0)})
(Mary,fl,{(Mary),(Mary)},{(fl),(fl)},{(3.9),(3.8)})
(Mary,sm,{(Mary)},{(sm)},{(4.0)})
(Mary,sp,{(Mary)},{(sp)},{(4.0)})
So we accessed all elements by this way.
hope this helped

tokenize fields using pig script for records having no delimiter

I have fields C1C2C3C4 (no delimter present)in a raw file, I have to generate output which should look like C1,C2,C3,C4.Using PIG script.
Given :- size of C1=C2=C3=C4= 4bytes.
This should be straightforward with these steps:
Load the data as is
Generate four new columns, using the SUBSTRING function
For example, you should be able to extract c2 as:
SUBSTRING(inputstring, 5, 8)
Extending Dennis's Answer.
Assuming the field is stored as chararray
A = LOAD 'data.txt' as (f1:chararray);
B = FOREACH A GENERATE
SUBSTRING(f1,0,2) as A1,
SUBSTRING(f1,2,4) as A2,
SUBSTRING(f1,4,6) as A3,
SUBSTRING(f1,6,8) as A4;
DUMP B;

Apache PIG - How to cut digits after decimal point

Is there any possibility to cut a certain area after the decimal point of a float or double number?
For example: the result would be 2.67894 => I want to have 2.6 as result (and not 2.7 when rounded).
try it.. val is your values like 2.666,3.666,4.666666,5.3456334.....
b = foreach a GENERATE (FLOOR(val * 10) / 10);
dump b;
Write a UDF (User Defined Function) for this.
A very simple python UDF (numformat.py):
#outputSchema('value:double')
def format(data):
return round(data,1)
(Of course you can parametrized the UDF to use different precision.)
Than register and use it in your pig code. Example:
REGISTER numformat.py USING jython as numformat;
A = LOAD 'so/testdata.csv' USING PigStorage(',') AS (data:double);
B = FOREACH A GENERATE numformat.format(data);
DUMP B;
For the following input:
2.1234
12.334
The dumped result is:
(2.1)
(12.3)

Nested Flatten in Pig

I have problem using Pig Like this:
Suppose I have a alias A, like ("key1","just_for_example"). I want something like :("key1","just"),("key1","for"),("key1","example"). My script looks like:
B = foreach A generate $0, FLATTEN(TOBAG(FLATTEN(STRSPLIT($1,'_'))));
But it keeps throwing me errors like "Error 1070:Couldn't resolve Flatten from builtin". But once I split this statement into two to eliminate the nested flattens, then it works. Why is that? Is it related to how Pig compile my script? Thanks.
Flatten is not a UDF, it is an operator that changes the format of bags and tuples. You can read the description here.
You can get the desired output in a clean one-liner though. TOKENIZE works similarly to STRSPLIT, but produces a bag instead of a tuple. Therefore, you can just do:
B = FOREACH A GENERATE $0, FLATTEN(TOKENIZE($1, '_')) ;
Resulting schema and output:
B: {key: chararray,bag_of_tokenTuples::token: chararray}
(key1,just)
(key1,for)
(key1,example)
Flatten is not a function - you can't nest it.
However, in your case, I don't think you have to use it twice. This should suffice to get your desired output:
B = foreach A generate $0, FLATTEN(STRSPLIT($1,'_'));
I tried to replicate you question, Please find my solution below. have taken flatten_exe.txt as an input which contains data as ("Just_For_example")
grunt> flt= Load '/home/training/pig/Join/flatten_exe.txt' using PigStorage();
grunt> b= Foreach flt Generate FLATTEN(TOKENIZE($0, '_'));
grunt>dump b;

Apache Pig not parsing a tuple fully

I have a file called data that looks like this: (note there are tabs after the 'personA')
personA (1, 2, 3)
personB (2, 1, 34)
And I have an Apache pig script like this:
A = LOAD 'data' AS (name: chararray, nodes: tuple(a:int, b:int, c:int));
C = foreach A generate nodes.$0;
dump C;
The output of which makes sense:
(1)
(2)
However if I change the schema of the script to be like this:
A = LOAD 'data' AS (name: chararray, nodes: tuple());
C = foreach A generate nodes.$0;
dump C;
Then the output I get is this:
(1, 2, 3)
(2, 1, 34)
It looks like the first (and only) element in this tuple is a bytearray. i.e. it's not parsing the input text 1, 2, 3 into a tuple.
In future my input will have an unknown & variable number of elements in the nodes item, so I can't just write out a:int, ….
Is there anyway to get Pig to parse the input tuple as a tuple without having to write out the full schema?
Pig does not accept what you are passing in as valid. The default loading scheme PigStorage only accepts delimited files (by default tab delimited). It is not smart enough to parse the tuple construct with the parenthesis and commas you have in the text. Your options are:
Reformat your file to be tab delimited: personA 1 2 3
Read the file in line by line with TextLoader, then write some sort of UDF that parses the line and returns the data in the form you want.
Write your own custom loader.
This is no more a limitation. Pig parses the tuples in input file considering comma as field separator. I'm trying in Apache Pig version 0.15.0.
A = LOAD 'data' AS (name: chararray, nodes: tuple());
C = foreach A generate nodes.$0;
dump C;
Output I get is:
(1)
(2)
Here is another way of tackling this issue, although I know the answers above are more efficient.
data = LOAD 'data' USING PigStorage() AS (name:chararray, field2:chararray);
data = FOREACH data GENERATE name, REPLACE(REPLACE(field2, '\\(',''),'\\)','') AS field2;
data = FOREACH data GENERATE name, STRSPLIT(field2, '\\,') AS fieldTuple;
data = FOREACH data GENERATE name, fieldTuple.$0,fieldTuple.$1, fieldTuple.$2 ;
Load field2 as chararray
Remove parentheses
Split field2 by comma (it gives you a tuple with 3 fields in it)
Get values by index
I know it is hacky. Just wanted to provide another way of doing this

Resources