pig how to concat columns into single string - hadoop

I have a column of strings that I load using Pig:
A
B
C
D
how do I convert this column into a single string like this?
A,B,C,D

You are going to have to first GROUP ALL to put everything into one bag, then join the contents of the bag together using a UDF. Something like this:
-- myudfs.py
-- #!/usr/bin/python
--
-- #outputSchema('concated: string')
-- def concat_bag(BAG):
-- return ','.join(BAG)
Register 'myudfs.py' using jython as myfuncs;
A = LOAD 'myfile.txt' AS (letter:chararray) ;
B = GROUP A ALL ;
C = FOREACH B GENERATE myfuncs.concat_bag(A.letter) AS all_letters ;
If your file/schema contains multiple columns, you are probably going to want to project out the column you want to generate the string for. Something like:
A0 = LOAD 'myfile.txt' AS (letter:chararray, val:int, extra:chararray) ;
A = FOREACH A0 GENERATE letter ;
This way you are not keeping around extra columns that will slow down an already expensive operation.

Related

Union Two files by column using pig

I want to Union/Merge two files using pig. But, this is a different union than a usual union. Following are my files (h* are header of files) :
F1 :
h1,h2,h3,h4
a01,a02,a03,a04
a11,a12,a13,a14
F2 :
h3,h4,h5,h6
a23,a24,b01,b02
a33,a34,b11,b12
The resulting output must be a Union of these files like this :
FR :
h1,h2,h3,h4,h5,h6
a01,a02,a03,a04,,
a11,a12,a13,a14,,
,,a23,a24,b01,b02
,,a33,a34,b11,b12
One more difficulty is I want to make it generic so that it works for dynamic number of common columns. Currently there are two common columns, it could have 3 or 1 common column or even no common column at all. For example :
F1 :
h1,h2,h3,h4
a1,a2,a3,a4
F2
h5,h6,h7,h8
b1,b2,b3,b4
FR
a1,a2,a3,a4
,,,,b1,b2,b3,b4
Any hint/help is appreciable.
Here is how you can do it statically:
F1full = FOREACH F1 GENERATE h1,h2,h3,h4, NULL as h5, NULL as h6;
F2full = FOREACH F2 GENERATE NULL as h1,NULL as h2,h3,h4, h5, h6;
FR = F1full UNION F2full;
Pig is not very flexible, so I don't think it is possible to generate this dynamically/for the generic case.
If you would want a solution for the generic case, you could use a language like python to build the required command based on metadata of stored tables/files.
I tried to solve the problem using following approach :
1) Load both of the files.
2) Add counter to generate a unique field (ID).
3) Start the counter for file B where counter for A ended.
4) Cogroup both files with common columns, including counteer.
5) Take all group columns in a different schema.
6) Generate uncommon columns from both files, along with the counter.
7) First join uncommon columns from file A with group columns on counter.
8) Join the result of step 7 with uncommon columns from file B on counter.
Following is the pig script to do the same. As this script is generic, I have mentioned what all parameters will be required before running the script.
-- Parameters required : $file1_path, $file2_path, $file1_schema, $file2_schema, $COUNT_A (number of rows in file A), $CMN_COLUMN_A (common columns in A), $CMN_COLUMN_B, $UNCMN_COLUMN_A(Unique columns in file A), $UNCMN_COLUMN_B.
A = LOAD '$file1_path' USING org.apache.pig.piggybank.storage.CSVExcelStorage('~', 'NO_MULTILINE', 'NOCHANGE', 'SKIP_INPUT_HEADER') as ($file1_schema);
B = LOAD '$file2_path' USING org.apache.pig.piggybank.storage.CSVExcelStorage('~', 'NO_MULTILINE', 'NOCHANGE', 'SKIP_INPUT_HEADER') as ($file2_schema);
RANK_A = RANK A;
RANK_B = RANK B;
COUNT_RANK_B = FOREACH RANK_B GENERATE ($0+(long)'$COUNT_A') as rank_B, $1 ..;
COGRP_RANK_AB = COGROUP RANK_A BY($CMN_COLUMN_A), COUNT_RANK_B BY ($CMN_COLUMN_B);
CMN_COGRP_RANK_AB = FOREACH COGRP_RANK_AB GENERATE FLATTEN(group) AS ($CMN_COLUMN_A);
UNCMN_RB = FOREACH COUNT_RANK_B GENERATE $UNCMN_COLUMN_B;
JOIN_CMN_UNCMN_A = JOIN CMN_COGRP_RANK_AB BY(rank_A) LEFT OUTER, UNCMN_RA by rank_A;
JOIN_CMN_UNCMN_B = JOIN JOIN_CMN_UNCMN_A BY(CMN_COGRP_RANK_AB::rank_A) LEFT OUTER, UNCMN_RB by rank_B;
STORE FINAL_DATA INTO '$store_path' USING org.apache.pig.piggybank.storage.CSVExcelStorage('~', 'NO_MULTILINE', 'UNIX', 'WRITE_OUTPUT_HEADER');

PIG replace for multiple columns

I have a total of about 150 columns and want to search for \t and replace it with spaces
A = LOAD 'db.table' USING org.apache.hcatalog.pig.HCatLoader();
B = GROUP A ALL;
C = FOREACH B GENERATE REPLACE(B, '\\t', ' ');
STORE C INTO 'location';
This output is producing ALL the only word as output.
Is there a better way to replace all columns at once??
Thank you
Nivi
You could do this with a Python UDF. Say you had some data like this with tabs in it:
Data:
hi there friend,whats up,nothing much
yo yo yo,green eggs, ham
You could write this in Python
UDF:
#outputSchema("datums:{(no_tabs:chararray)}")
def remove_tabs(columns):
try:
out = [tuple(map(lambda s: s.replace("\t", " "), x)) for x in columns]
return out
except:
return [(None)]
and then in Pig
Query:
REGISTER 'remove_tabs.py' USING jython AS udf;
data = LOAD 'toy_data' USING PigStorage(',') AS (col0:chararray,
, col1:chararray, col2:chararray);
grpd = GROUP data all;
A = FOREACH grpd GENERATE FLATTEN(udf.remove_tabs(data));
DUMP A;
Output:
(hi there friend,whats up,nothing much)
(yo yo yo,green eggs,ham)
Ovbiously you have more than three columns, but since you are grouping by all, the script should generalize to any number of columns.

FIles comparing field by field in PIG

I have two files like
File 1
id,sal,location,code
1000,1000,jupiter,F
1001,2000,jupiter,F
1002,3000,jupiter,F
1003,4000,jupiter,F
1004,5000,jupiter,F
File 2
id,sal,location,code
1000,2000,jupiter,F
1001,2000,jupiter,Z
1002,3000,jupiter,F
1003,4000,jupiter,F
1004,5000,jupiter,F
When I compare file1 with file 2, I need a output like
1000, sal
1001,code
Basically, it should tell me what field is changed from the previous file along with the id.
Can this be done in PIG.
You can easily solve this problem but the challenging part will the output format as you mentioned. It requires little bit complex logic to get the output format.
I have fixed most of the edge cases but you can check with your input to make sure that it works for all combinations.
file1:
1000,1000,jupiter,F
1001,2000,jupiter,F
1002,3000,jupiter,F
1003,4000,jupiter,F
1004,5000,jupiter,F
file2:
1000,2000,jupiter,F
1001,2000,jupiter,Z
1002,3000,jupiter,F
1003,4000,jupiter,F
1004,5000,jupiter,F
PigScript:
A = LOAD 'file1' USING PigStorage(',') AS (id,sal,location,code);
B = LOAD 'file2' USING PigStorage(',') AS (id,sal,location,code);
C = JOIN A BY id,B BY id;
D = FOREACH C GENERATE A::id AS id,((A::sal == B::sal)?'':'sal') AS sal,
((A::location == B::location)?'':'location') AS location,
((A::code == B::code)?'':'code') AS code;
--Remove the common fields between two files
E = FILTER D BY NOT (sal=='' AND location=='' AND code=='');
--The below two lines are used to formatting the output
F = FOREACH E GENERATE id,REPLACE(BagToString(TOBAG(sal,location,code),','),'(,,$|,$)','') As finalOutput;
G = FOREACH F GENERATE id,REPLACE(finalOutput,',,',',');
DUMP G;
Output:
(1000,sal)
(1001,code)

Count and find maximum number in Hadoop using pig

I have a table which contain sample CDR data in that column A and column B having calling person and called person mobile number
I need to find whose having maximum number of calls made(column A)
and also need to find to which number(column B) called most
the table structure is like below
calling called
889578226 77382596
889582256 77382596
889582256 7736368296
7785978214 782987522
in the above table 889578226 have most number of outgoing calls and 77382596 is most called number in such a way need to get the output
in hive i run like below
SELECT calling_a,called_b, COUNT(called_b) FROM cdr_data GROUP BY calling_a,called_b;
what might be the equalent code for the above query in pig?
Anas, Could you please let me know this is what you are expecting or something different?
input.txt
a,100
a,101
a,101
a,101
a,103
b,200
b,201
b,201
c,300
c,300
c,301
d,400
PigScript:
A = LOAD 'input.txt' USINg PigStorage(',') AS (name:chararray,phone:long);
B = GROUP A BY (name,phone);
C = FOREACH B GENERATE FLATTEN(group),COUNT(A) AS cnt;
D = GROUP C BY $0;
E = FOREACH D {
SortedList = ORDER C BY cnt DESC;
top = LIMIT SortedList 1;
GENERATE FLATTEN(top);
}
DUMP E;
Output:
(a,101,3)
(b,201,2)
(c,300,2)
(d,400,1)

Equivalent of linux 'diff' in Apache Pig

I want to be able to do a standard diff on two large files. I've got something that will work but it's not nearly as quick as diff on the command line.
A = load 'A' as (line);
B = load 'B' as (line);
JOINED = join A by line full outer, B by line;
DIFF = FILTER JOINED by A::line is null or B::line is null;
DIFF2 = FOREACH DIFF GENERATE (A::line is null?B::line : A::line), (A::line is null?'REMOVED':'ADDED');
STORE DIFF2 into 'diff';
Anyone got any better ways to do this?
I use the following approaches. (My JOIN approach is very similar but this method does not replicate the behavior of diff with replicated lines). As this was asked sometime ago, perhaps you were using only one reducer as Pig got an algorithm to adjust the number of reducers in 0.8?
Both approaches I use are within a few percent of eachother in performance but do not treat duplicates the same
The JOIN approach collapses duplicates (so, if one file has more duplicates than the other, this approach will not output the duplicate)
The UNION approach works like the Unix diff(1) tool and will return the correct number of extra duplicates for the correct file
Unlike the Unix diff(1) tool, order is not important (effectively the JOIN approach performs sort -u <foo.txt> | diff while UNION performs sort <foo> | diff)
If you have an incredible (~thousands) number of duplicate lines, then things will slow down due to the joins (if your use allows, perform a DISTINCT on the raw data first)
If your lines are very long (e.g. >1KB in size), then it would be recommended to use the DataFu MD5 UDF and only difference over hashes then JOIN with your original files to get the original row back before outputting
Using JOIN:
SET job.name 'Diff(1) Via Join'
-- Erase Outputs
rmf first_only
rmf second_only
-- Process Inputs
a = LOAD 'a.csv.lzo' USING com.twitter.elephantbird.pig.load.LzoPigStorage('\n') AS First: chararray;
b = LOAD 'b.csv.lzo' USING com.twitter.elephantbird.pig.load.LzoPigStorage('\n') AS Second: chararray;
-- Combine Data
combined = JOIN a BY First FULL OUTER, b BY Second;
-- Output Data
SPLIT combined INTO first_raw IF Second IS NULL,
second_raw IF First IS NULL;
first_only = FOREACH first_raw GENERATE First;
second_only = FOREACH second_raw GENERATE Second;
STORE first_only INTO 'first_only' USING PigStorage();
STORE second_only INTO 'second_only' USING PigStorage();
Using UNION:
SET job.name 'Diff(1)'
-- Erase Outputs
rmf first_only
rmf second_only
-- Process Inputs
a_raw = LOAD 'a.csv.lzo' USING com.twitter.elephantbird.pig.load.LzoPigStorage('\n') AS Row: chararray;
b_raw = LOAD 'b.csv.lzo' USING com.twitter.elephantbird.pig.load.LzoPigStorage('\n') AS Row: chararray;
a_tagged = FOREACH a_raw GENERATE Row, (int)1 AS File;
b_tagged = FOREACH b_raw GENERATE Row, (int)2 AS File;
-- Combine Data
combined = UNION a_tagged, b_tagged;
c_group = GROUP combined BY Row;
-- Find Unique Lines
%declare NULL_BAG 'TOBAG(((chararray)\'place_holder\',(int)0))'
counts = FOREACH c_group {
firsts = FILTER combined BY File == 1;
seconds = FILTER combined BY File == 2;
GENERATE
FLATTEN(
(COUNT(firsts) - COUNT(seconds) == (long)0 ? $NULL_BAG :
(COUNT(firsts) - COUNT(seconds) > 0 ?
TOP((int)(COUNT(firsts) - COUNT(seconds)), 0, firsts) :
TOP((int)(COUNT(seconds) - COUNT(firsts)), 0, seconds))
)
) AS (Row, File); };
-- Output Data
SPLIT counts INTO first_only_raw IF File == 1,
second_only_raw IF File == 2;
first_only = FOREACH first_only_raw GENERATE Row;
second_only = FOREACH second_only_raw GENERATE Row;
STORE first_only INTO 'first_only' USING PigStorage();
STORE second_only INTO 'second_only' USING PigStorage();
Performance
It takes roughly 10 minutes to difference over 200GB (1,055,687,930 rows) using LZO compressed input with 18 nodes.
Each approach only takes one Map/Reduce cycle.
This results in roughly 1.8GB diffed per node, per minute (not a great throughput but on my system it seems diff(1) only operates in-memory, while Hadoop leverages streaming disks.

Resources