Getting Friends ID with PIG Script - Text manipulation needed - hadoop

Here is my input
user0=242561&friend=6226&friend=93856&age=35&friend=35900
user1=242562&friend=6226&friend=93856&age=35&friend=35900
user2=242563&friend=6226&friend=93856&age=35&friend=35900&friend=33900&friend=34900
user3=242564&friend=6226&friend=93856&age=35&friend=35900&friend=35930&friend=35920&friend=35901
Notes and Requirement
I need to remove the age=35
I need to get the user with friends number associated with the user ( In input one row will have one user
The number of friends will be different and the maximum number of friends is not know
Expected result
user0=242562-6226,93856,35900
user1=242562-6226,93856,35900
user2=242562-6226,93856,35900,33900,34900
user3=242562-6226,93856,35900,35930,35920,35901
I tried some thing like this,but didnt worked
inputs = LOAD '/data/friends4' AS (line:chararray);
tokenized = FOREACH inputs GENERATE FLATTEN(TOKENIZE(line, '&')) AS parameter;
filtered = FILTER tokenized BY INDEXOF(parameter, 'age=') != 0;
dump filtered;
I am getting as
(user=242562)
(friend=6226)
(friend=93856)
(friend=35900)
(user1=242562)
(friend=6226)
(friend=93856)
(friend=35900)
(user2=242562)
(friend=6226)
(friend=93856)
(friend=35900)
(friend=33900)
(friend=34900)
(user3=242562)
(friend=6226)
(friend=93856)
(friend=35900)
(friend=35930)
(friend=35920)
(friend=35901)
Now I need the result as bellow, can some one please help in this
user0=242562-6226,93856,35900
user1=242562-6226,93856,35900
user2=242562-6226,93856,35900,33900,34900
user3=242562-6226,93856,35900,35930,35920,35901

You can create UDF to handle it properly and easy way, although you can try with the below script, I am just adding a line in your script to replace the 'friend=' with ',' now you can create a UDF which will split the String from the space than replace first ',' with '-'
inputs = LOAD '/data/friends4' AS (line:chararray);
tokenized = FOREACH inputs GENERATE FLATTEN(TOKENIZE(line, '&')) AS parameter;
filtered = FILTER tokenized BY INDEXOF(parameter, 'age=') != 0;
REPL1 = FOREACH filtered GENERATE REPLACE($0, 'friend=', ',');
dump REPL1;
output
(user0=242561)
(,6226)
(,93856)
(,35900 user1=242562)
(,6226)
(,93856)
(,35900 user2=242563)
(,6226)
(,93856)
(,35900)
(,33900)
(,34900 user3=242564)
(,6226)
(,93856)
(,35900)
(,35930)
(,35920)
(,35901)

Related

Filter inner bag in Pig

The data looks like this:
22678, {(112),(110),(2)}
656565, {(110), (109)}
6676, {(2),(112)}
This is the data structure:
(id:chararray, event_list:{innertuple:(innerfield:chararray)})
I want to filter those rows where event_list contains 2. I thought initially to flatten the data and then filter those rows that have 2. Somehow flatten doesn't work on this dataset.
Can anyone please help?
There might be a simpler way of doing this, like a bag lookup etc. Otherwise with basic pig one way of achieving this is:
data = load 'data.txt' AS (id:chararray, event_list:bag{});
-- flatten bag, in order to transpose each element to a separate row.
flattened = foreach data generate id, flatten(event_list);
-- keep only those rows where the value is 2.
filtered = filter flattened by (int) $1 == 2;
-- keep only distinct ids.
dist = distinct (foreach filtered generate $0 as (id:chararray));
-- join distinct ids to origitnal relation
jnd = join a by id, dist by id;
-- remove extra fields, keep original fields.
result = foreach jnd generate a::id, a::event_list;
dump result;
(22678,{(112),(110),(2)})
(6676,{(2),(112)})
You can filter the Bag and project a boolean which says if 2 is present in the bag or not. Then, filter the rows which says that projection is true or not
So..
input = LOAD 'data.txt' AS (id:chararray, event_list:bag{});
input_filt = FOREACH input {
bag_filter = FILTER event_list BY (val_0 matches '2');
GENERATE
id,
event_list,
isEmpty(bag_filter.$0) ? false : true AS is_2_present:boolean;
;
};
output = FILTER input_filt BY is_2_present;

PIG: Filter a string on a basis of a phrase

I was wondering if it is possible yo filter a string on the basis of the phrase? For example,I want to count number of times when ps3(ps 3) appears in the query. I am not sure how not to use exact match with the filter condition for "ps 3" as do not know how to put a tab inside of it. My code so far is:
data = LOAD '/user/cloudera/' using PigStorage(',') as (text:chararray);
filtered_data = FILTER data BY (text matches '.*ps3.*') OR (text == 'ps 3');
Res = FOREACH (GROUP filtered_data ALL) GENERATE COUNT(filtered_data);
DUMP Res;
So obviously code fails to count queries like "ps 3 today". Is there is a way to handle this?
Try this -
A = LOAD 'input.csv' USING PigStorage(',') AS (text:chararray);
B = FILTER A BY (LOWER(text) MATCHES '.*ps 3.*' OR LOWER(text) MATCHES '.*ps3.*');
DUMP B Output :
(ps 3 today)
(ps 3)
(ps3)
(PS3TODAY)

How do I get the matching values inside a for loop using FILTER in PIG?

Consider this as my input,
Input (File1):
12345;11
34567;12
.
.
Input (File2):
11;(1,2,3,4,5,6,7,8,9)
12;(9,8,7,6,5,4,3,2,1)
.
.
I would like to get the output as follows:
Output:
(1,2,3,4,5,6,7,8,9)
(9,8,7,6,5,4,3,2,1)
Here's the sample code which I have tried using FILTER and I face some errors with this. Please suggest me some other options.
data1 = load '/File1' using PigStorage(';') as (id,number);
data2 = load '/File2' using PigStorage(';') as (numberInfo, collection);
out = foreach data1{
Data_filter = FILTER data2 by (numberInfo matches CONCAT(number,''));
generate Data_filter;
}
Is it possible do this inside a for loop ? Please let me know. Thanks in advance !
There are no for loops in Apache Pig, if you need to iterate through each row of the data for some specific purpose you need to implement your own UDF. The foreach keyword is not used to create a loop, it is used to transform your data based on your columns, applying UDFs to it. You can also use a nested foreach, where you perform operations over each group in your relation.
However, your syntax is wrong. You are trying to use a nested foreach without grouping your data first. What a nested foreach does, is perform the operations you define in the block of code over a grouped relation. Therefore, the only way your code could work is by grouping the data first:
data1 = load '/File1' using PigStorage(';') as (id,number);
data2 = load '/File2' using PigStorage(';') as (numberInfo, collection);
data1 = group data1 by id;
out = foreach data1{
Data_filter = FILTER data2 by (numberInfo matches CONCAT(number,''));
generate Data_filter;
}
However, this won't work because inside a nested foreach you cannot refer to a different relation like data2.
What you really want, is a JOIN operation over both relations using number for data1 and numberInfo for data2. This will give you this:
joined_data = join data1 by number, data2 by numberInfo;
dump joined_data;
(12345,11,11,(1,2,3,4,5,6,7,8,9))
(34567,12,12,(9,8,7,6,5,4,3,2,1))
In your question you said you only wanted as output the last column, so now you can use a foreach to generate the column you want:
final_data = foreach joined_data generate data2::collection;
dump final_data;
((1,2,3,4,5,6,7,8,9))
((9,8,7,6,5,4,3,2,1))

How would I make a pig script that only returns fields with entries over a certain length?

The data I have is already fielded, I just want a document that contains two of the fields and even then it only contains an entry if the title field is over a certain length. This is what I have so far.
records = LOAD '$INPUT' USING PigStorage('\t') AS (url:chararray, title:chararray, meta:chararray, copyright:chararray, aboutUSLink:chararray, aboutTitle:chararray, aboutMeta:chararray, contactUSLink:chararray, contactTitle:chararray, contactMeta:chararray, phones:chararray);
E = FOREACH records IF SIZE(title)>10 GENERATE url,title;
STORE E INTO '$OUTPUT/phoneNumbersAndTitles';
Why does the code exit at IF?
You should use FILTER, which selects tuples from a relation based on some condition:
filtered = FILTER records BY SIZE(title) > 10;
E = FOREACH filtered GENERATE url,title;

Pig Latin issue

please help me out..its really urgent..deadline nearing, and im stuck with it since 2 weeks..breaking my head but no result. i am a newbie in piglatin.
i have a scenario where i have to filter data from a csv file.
the csv is on hdfs, and has two columns.
grunt>> fl = load '/user/hduser/file.csv' USING PigStorage(',') AS (conv:chararray, clnt:chararray);
grunt>> dump f1;
("first~584544fddf~dssfdf","2001")
("first~4332990~fgdfs4s","2001")
("second~232434334~fgvfd4","1000")
("second~786765~dgbhgdf","1000)
("second~345643~gfdgd43","1000")
what i need to do is i need to extract only the first word before the 1st '~' sign and concat that with the second column value of the csv file. Also i need to group the concatenated result returned and count the number of such similar rows, and create a new csv file as out put, where there would be 2 columns again. 1st column would be the concatenated value and the 2nd column would be the row count.
i.e
("first 2001","2")
("second 1000","3")
and so on.
I have written the code here but its just not working. i have used STRSPLIT. it is splitting the values of the first column of input csv file. but i dont know how to extract the first split value.
code is given below:
convData = LOAD '/user/hduser/file.csv' USING PigStorage(',') AS (conv:chararray, clnt:chararray);
fil = FILTER convData BY conv != '"-1"'; --im using this to filter out the rows that has 1st column as "-1".
data = FOREACH fil GENERATE STRSPLIT($0, '~');
X = FOREACH data GENERATE CONCAT(data.$0,' ',convData.clnt);
Y = FOREACH X GROUP BY X;
Z = FOREACH Y GENERATE COUNT(Y);
var = FOREACH Z GENERATE CONCAT(Y,',',Z);
STORE var INTO '/user/hduser/output.csv' USING PigStorage(',');
STRSPLIT returns a tuple, the individual elements of which you can access using the numbered syntax. This is what you need:
data = FOREACH fil GENERATE STRSPLIT($0, '~') AS a, clnt;
X = FOREACH data GENERATE CONCAT(a.$0,' ', clnt);

Resources