Pig Latin issue - hadoop

please help me out..its really urgent..deadline nearing, and im stuck with it since 2 weeks..breaking my head but no result. i am a newbie in piglatin.
i have a scenario where i have to filter data from a csv file.
the csv is on hdfs, and has two columns.
grunt>> fl = load '/user/hduser/file.csv' USING PigStorage(',') AS (conv:chararray, clnt:chararray);
grunt>> dump f1;
("first~584544fddf~dssfdf","2001")
("first~4332990~fgdfs4s","2001")
("second~232434334~fgvfd4","1000")
("second~786765~dgbhgdf","1000)
("second~345643~gfdgd43","1000")
what i need to do is i need to extract only the first word before the 1st '~' sign and concat that with the second column value of the csv file. Also i need to group the concatenated result returned and count the number of such similar rows, and create a new csv file as out put, where there would be 2 columns again. 1st column would be the concatenated value and the 2nd column would be the row count.
i.e
("first 2001","2")
("second 1000","3")
and so on.
I have written the code here but its just not working. i have used STRSPLIT. it is splitting the values of the first column of input csv file. but i dont know how to extract the first split value.
code is given below:
convData = LOAD '/user/hduser/file.csv' USING PigStorage(',') AS (conv:chararray, clnt:chararray);
fil = FILTER convData BY conv != '"-1"'; --im using this to filter out the rows that has 1st column as "-1".
data = FOREACH fil GENERATE STRSPLIT($0, '~');
X = FOREACH data GENERATE CONCAT(data.$0,' ',convData.clnt);
Y = FOREACH X GROUP BY X;
Z = FOREACH Y GENERATE COUNT(Y);
var = FOREACH Z GENERATE CONCAT(Y,',',Z);
STORE var INTO '/user/hduser/output.csv' USING PigStorage(',');

STRSPLIT returns a tuple, the individual elements of which you can access using the numbered syntax. This is what you need:
data = FOREACH fil GENERATE STRSPLIT($0, '~') AS a, clnt;
X = FOREACH data GENERATE CONCAT(a.$0,' ', clnt);

Related

Filter inner bag in Pig

The data looks like this:
22678, {(112),(110),(2)}
656565, {(110), (109)}
6676, {(2),(112)}
This is the data structure:
(id:chararray, event_list:{innertuple:(innerfield:chararray)})
I want to filter those rows where event_list contains 2. I thought initially to flatten the data and then filter those rows that have 2. Somehow flatten doesn't work on this dataset.
Can anyone please help?
There might be a simpler way of doing this, like a bag lookup etc. Otherwise with basic pig one way of achieving this is:
data = load 'data.txt' AS (id:chararray, event_list:bag{});
-- flatten bag, in order to transpose each element to a separate row.
flattened = foreach data generate id, flatten(event_list);
-- keep only those rows where the value is 2.
filtered = filter flattened by (int) $1 == 2;
-- keep only distinct ids.
dist = distinct (foreach filtered generate $0 as (id:chararray));
-- join distinct ids to origitnal relation
jnd = join a by id, dist by id;
-- remove extra fields, keep original fields.
result = foreach jnd generate a::id, a::event_list;
dump result;
(22678,{(112),(110),(2)})
(6676,{(2),(112)})
You can filter the Bag and project a boolean which says if 2 is present in the bag or not. Then, filter the rows which says that projection is true or not
So..
input = LOAD 'data.txt' AS (id:chararray, event_list:bag{});
input_filt = FOREACH input {
bag_filter = FILTER event_list BY (val_0 matches '2');
GENERATE
id,
event_list,
isEmpty(bag_filter.$0) ? false : true AS is_2_present:boolean;
;
};
output = FILTER input_filt BY is_2_present;

Getting Friends ID with PIG Script - Text manipulation needed

Here is my input
user0=242561&friend=6226&friend=93856&age=35&friend=35900
user1=242562&friend=6226&friend=93856&age=35&friend=35900
user2=242563&friend=6226&friend=93856&age=35&friend=35900&friend=33900&friend=34900
user3=242564&friend=6226&friend=93856&age=35&friend=35900&friend=35930&friend=35920&friend=35901
Notes and Requirement
I need to remove the age=35
I need to get the user with friends number associated with the user ( In input one row will have one user
The number of friends will be different and the maximum number of friends is not know
Expected result
user0=242562-6226,93856,35900
user1=242562-6226,93856,35900
user2=242562-6226,93856,35900,33900,34900
user3=242562-6226,93856,35900,35930,35920,35901
I tried some thing like this,but didnt worked
inputs = LOAD '/data/friends4' AS (line:chararray);
tokenized = FOREACH inputs GENERATE FLATTEN(TOKENIZE(line, '&')) AS parameter;
filtered = FILTER tokenized BY INDEXOF(parameter, 'age=') != 0;
dump filtered;
I am getting as
(user=242562)
(friend=6226)
(friend=93856)
(friend=35900)
(user1=242562)
(friend=6226)
(friend=93856)
(friend=35900)
(user2=242562)
(friend=6226)
(friend=93856)
(friend=35900)
(friend=33900)
(friend=34900)
(user3=242562)
(friend=6226)
(friend=93856)
(friend=35900)
(friend=35930)
(friend=35920)
(friend=35901)
Now I need the result as bellow, can some one please help in this
user0=242562-6226,93856,35900
user1=242562-6226,93856,35900
user2=242562-6226,93856,35900,33900,34900
user3=242562-6226,93856,35900,35930,35920,35901
You can create UDF to handle it properly and easy way, although you can try with the below script, I am just adding a line in your script to replace the 'friend=' with ',' now you can create a UDF which will split the String from the space than replace first ',' with '-'
inputs = LOAD '/data/friends4' AS (line:chararray);
tokenized = FOREACH inputs GENERATE FLATTEN(TOKENIZE(line, '&')) AS parameter;
filtered = FILTER tokenized BY INDEXOF(parameter, 'age=') != 0;
REPL1 = FOREACH filtered GENERATE REPLACE($0, 'friend=', ',');
dump REPL1;
output
(user0=242561)
(,6226)
(,93856)
(,35900 user1=242562)
(,6226)
(,93856)
(,35900 user2=242563)
(,6226)
(,93856)
(,35900)
(,33900)
(,34900 user3=242564)
(,6226)
(,93856)
(,35900)
(,35930)
(,35920)
(,35901)

PIG replace for multiple columns

I have a total of about 150 columns and want to search for \t and replace it with spaces
A = LOAD 'db.table' USING org.apache.hcatalog.pig.HCatLoader();
B = GROUP A ALL;
C = FOREACH B GENERATE REPLACE(B, '\\t', ' ');
STORE C INTO 'location';
This output is producing ALL the only word as output.
Is there a better way to replace all columns at once??
Thank you
Nivi
You could do this with a Python UDF. Say you had some data like this with tabs in it:
Data:
hi there friend,whats up,nothing much
yo yo yo,green eggs, ham
You could write this in Python
UDF:
#outputSchema("datums:{(no_tabs:chararray)}")
def remove_tabs(columns):
try:
out = [tuple(map(lambda s: s.replace("\t", " "), x)) for x in columns]
return out
except:
return [(None)]
and then in Pig
Query:
REGISTER 'remove_tabs.py' USING jython AS udf;
data = LOAD 'toy_data' USING PigStorage(',') AS (col0:chararray,
, col1:chararray, col2:chararray);
grpd = GROUP data all;
A = FOREACH grpd GENERATE FLATTEN(udf.remove_tabs(data));
DUMP A;
Output:
(hi there friend,whats up,nothing much)
(yo yo yo,green eggs,ham)
Ovbiously you have more than three columns, but since you are grouping by all, the script should generalize to any number of columns.

How would I make a pig script that only returns fields with entries over a certain length?

The data I have is already fielded, I just want a document that contains two of the fields and even then it only contains an entry if the title field is over a certain length. This is what I have so far.
records = LOAD '$INPUT' USING PigStorage('\t') AS (url:chararray, title:chararray, meta:chararray, copyright:chararray, aboutUSLink:chararray, aboutTitle:chararray, aboutMeta:chararray, contactUSLink:chararray, contactTitle:chararray, contactMeta:chararray, phones:chararray);
E = FOREACH records IF SIZE(title)>10 GENERATE url,title;
STORE E INTO '$OUTPUT/phoneNumbersAndTitles';
Why does the code exit at IF?
You should use FILTER, which selects tuples from a relation based on some condition:
filtered = FILTER records BY SIZE(title) > 10;
E = FOREACH filtered GENERATE url,title;

Equivalent of linux 'diff' in Apache Pig

I want to be able to do a standard diff on two large files. I've got something that will work but it's not nearly as quick as diff on the command line.
A = load 'A' as (line);
B = load 'B' as (line);
JOINED = join A by line full outer, B by line;
DIFF = FILTER JOINED by A::line is null or B::line is null;
DIFF2 = FOREACH DIFF GENERATE (A::line is null?B::line : A::line), (A::line is null?'REMOVED':'ADDED');
STORE DIFF2 into 'diff';
Anyone got any better ways to do this?
I use the following approaches. (My JOIN approach is very similar but this method does not replicate the behavior of diff with replicated lines). As this was asked sometime ago, perhaps you were using only one reducer as Pig got an algorithm to adjust the number of reducers in 0.8?
Both approaches I use are within a few percent of eachother in performance but do not treat duplicates the same
The JOIN approach collapses duplicates (so, if one file has more duplicates than the other, this approach will not output the duplicate)
The UNION approach works like the Unix diff(1) tool and will return the correct number of extra duplicates for the correct file
Unlike the Unix diff(1) tool, order is not important (effectively the JOIN approach performs sort -u <foo.txt> | diff while UNION performs sort <foo> | diff)
If you have an incredible (~thousands) number of duplicate lines, then things will slow down due to the joins (if your use allows, perform a DISTINCT on the raw data first)
If your lines are very long (e.g. >1KB in size), then it would be recommended to use the DataFu MD5 UDF and only difference over hashes then JOIN with your original files to get the original row back before outputting
Using JOIN:
SET job.name 'Diff(1) Via Join'
-- Erase Outputs
rmf first_only
rmf second_only
-- Process Inputs
a = LOAD 'a.csv.lzo' USING com.twitter.elephantbird.pig.load.LzoPigStorage('\n') AS First: chararray;
b = LOAD 'b.csv.lzo' USING com.twitter.elephantbird.pig.load.LzoPigStorage('\n') AS Second: chararray;
-- Combine Data
combined = JOIN a BY First FULL OUTER, b BY Second;
-- Output Data
SPLIT combined INTO first_raw IF Second IS NULL,
second_raw IF First IS NULL;
first_only = FOREACH first_raw GENERATE First;
second_only = FOREACH second_raw GENERATE Second;
STORE first_only INTO 'first_only' USING PigStorage();
STORE second_only INTO 'second_only' USING PigStorage();
Using UNION:
SET job.name 'Diff(1)'
-- Erase Outputs
rmf first_only
rmf second_only
-- Process Inputs
a_raw = LOAD 'a.csv.lzo' USING com.twitter.elephantbird.pig.load.LzoPigStorage('\n') AS Row: chararray;
b_raw = LOAD 'b.csv.lzo' USING com.twitter.elephantbird.pig.load.LzoPigStorage('\n') AS Row: chararray;
a_tagged = FOREACH a_raw GENERATE Row, (int)1 AS File;
b_tagged = FOREACH b_raw GENERATE Row, (int)2 AS File;
-- Combine Data
combined = UNION a_tagged, b_tagged;
c_group = GROUP combined BY Row;
-- Find Unique Lines
%declare NULL_BAG 'TOBAG(((chararray)\'place_holder\',(int)0))'
counts = FOREACH c_group {
firsts = FILTER combined BY File == 1;
seconds = FILTER combined BY File == 2;
GENERATE
FLATTEN(
(COUNT(firsts) - COUNT(seconds) == (long)0 ? $NULL_BAG :
(COUNT(firsts) - COUNT(seconds) > 0 ?
TOP((int)(COUNT(firsts) - COUNT(seconds)), 0, firsts) :
TOP((int)(COUNT(seconds) - COUNT(firsts)), 0, seconds))
)
) AS (Row, File); };
-- Output Data
SPLIT counts INTO first_only_raw IF File == 1,
second_only_raw IF File == 2;
first_only = FOREACH first_only_raw GENERATE Row;
second_only = FOREACH second_only_raw GENERATE Row;
STORE first_only INTO 'first_only' USING PigStorage();
STORE second_only INTO 'second_only' USING PigStorage();
Performance
It takes roughly 10 minutes to difference over 200GB (1,055,687,930 rows) using LZO compressed input with 18 nodes.
Each approach only takes one Map/Reduce cycle.
This results in roughly 1.8GB diffed per node, per minute (not a great throughput but on my system it seems diff(1) only operates in-memory, while Hadoop leverages streaming disks.

Resources