PIG replace for multiple columns - hadoop

I have a total of about 150 columns and want to search for \t and replace it with spaces
A = LOAD 'db.table' USING org.apache.hcatalog.pig.HCatLoader();
B = GROUP A ALL;
C = FOREACH B GENERATE REPLACE(B, '\\t', ' ');
STORE C INTO 'location';
This output is producing ALL the only word as output.
Is there a better way to replace all columns at once??
Thank you
Nivi

You could do this with a Python UDF. Say you had some data like this with tabs in it:
Data:
hi there friend,whats up,nothing much
yo yo yo,green eggs, ham
You could write this in Python
UDF:
#outputSchema("datums:{(no_tabs:chararray)}")
def remove_tabs(columns):
try:
out = [tuple(map(lambda s: s.replace("\t", " "), x)) for x in columns]
return out
except:
return [(None)]
and then in Pig
Query:
REGISTER 'remove_tabs.py' USING jython AS udf;
data = LOAD 'toy_data' USING PigStorage(',') AS (col0:chararray,
, col1:chararray, col2:chararray);
grpd = GROUP data all;
A = FOREACH grpd GENERATE FLATTEN(udf.remove_tabs(data));
DUMP A;
Output:
(hi there friend,whats up,nothing much)
(yo yo yo,green eggs,ham)
Ovbiously you have more than three columns, but since you are grouping by all, the script should generalize to any number of columns.

Related

dataset for hadoop

Here I put the pig code . when I try to execute this code I see errors. I am unable to debug it. Can any one help me in debugging the code. PLease post the answer with their input and output result.
I would request people to answer the problem with their inputs and the output result. please.
The problem with this dataset is multiple characters as a delimiter "::". In pig you can't use multiple characters as a delimiter. To solve this problem you have 3 options
1. Use REGEX_EXTRACT_ALL build-in function(need to write regex for this input)
2. Write custom UDF
3. Replace the multiple character delimiter to single character delimiter(This is very simple).
I downloaded the dataset from this site http://www.grouplens.org/datasets/movielens/ and tried with option 3
1. Go to your input folder /home/bigdata/sample/inputs/
2. Run this sed command
>> sed 's/::/$/g' movies.dat > Testmovies.dat
>> sed 's/::/$/g' ratings.dat > Testratings.dat
>> sed 's/::/$/g' users.dat > Testusers.dat
This will convert the multiple character delimiter '::' to single character delimiter '$'. I chosen '$' as delimiter bcoz in all the three files '$' is not present.
3. Now load the new input files(Testmovies.dat,Testratings.dat,Testusers.dat) in the pig script using '$' as a delimiter
Modified Pig Script:
-- filtering action and war movies
A = LOAD 'Testmovies.dat' USING PigStorage('$')as (MOVIEID: chararray,TITLE:chararray,GENRE: chararray);
B = filter A by ((GENRE matches '.*Action.*') AND (GENRE matches '.*War.*'));
-- finding action and war movie ratings
C = LOAD 'Testratings.dat' USING PigStorage('$')as (UserID: chararray, MovieID:chararray, Rating: int, Timestamp: chararray);
D = JOIN B by $0, C by MovieID;
-- calculating avg
E = group D by $0;
F = foreach E generate group as mvId, AVG(D.Rating) as avgRating;
-- finding max avg-rating
G = group F ALL;
H = FOREACH G GENERATE MAX(F.$1) AS avgMax;
-- finding max avg-rated movie
I = FILTER F BY (float)avgRating == (float)H.avgMax;
-- filtering female users age between 20-30
J = LOAD 'Testusers.dat' USING PigStorage('$') as (UserID: chararray, Gender: chararray, Age: int, Occupation: chararray, Zip: chararray);
K = filter J by ((Gender == 'F') AND (Age >= 20 AND Age <= 30));
L = foreach K generate UserID;
-- finding filtered female users rated movies
M = JOIN L by $0, C by UserID;
-- finding filtered female users who rated highest rated action and war movies
N = JOIN I by $0, M by $2;
-- finding distinct female users
O = foreach N generate $2 as User;
Q1 = Distinct O;
DUMP Q1;
Sample Output:
(5763)
(5785)
(5805)
(5808)
(5812)
(5825)
(5832)
(5852)
(5869)
(5878)
(5920)
(5955)
(5972)
(5974)
(6009)
(6036)

Count and find maximum number in Hadoop using pig

I have a table which contain sample CDR data in that column A and column B having calling person and called person mobile number
I need to find whose having maximum number of calls made(column A)
and also need to find to which number(column B) called most
the table structure is like below
calling called
889578226 77382596
889582256 77382596
889582256 7736368296
7785978214 782987522
in the above table 889578226 have most number of outgoing calls and 77382596 is most called number in such a way need to get the output
in hive i run like below
SELECT calling_a,called_b, COUNT(called_b) FROM cdr_data GROUP BY calling_a,called_b;
what might be the equalent code for the above query in pig?
Anas, Could you please let me know this is what you are expecting or something different?
input.txt
a,100
a,101
a,101
a,101
a,103
b,200
b,201
b,201
c,300
c,300
c,301
d,400
PigScript:
A = LOAD 'input.txt' USINg PigStorage(',') AS (name:chararray,phone:long);
B = GROUP A BY (name,phone);
C = FOREACH B GENERATE FLATTEN(group),COUNT(A) AS cnt;
D = GROUP C BY $0;
E = FOREACH D {
SortedList = ORDER C BY cnt DESC;
top = LIMIT SortedList 1;
GENERATE FLATTEN(top);
}
DUMP E;
Output:
(a,101,3)
(b,201,2)
(c,300,2)
(d,400,1)

Why is my PIG script failing?

I'm running a PIG script where I join 2 datasets. I individually verified that the two datasets I'm joining has the expected schema and when I dump the 2 datasets to output, I can see the datasets.
The strange issue I'm having is that my PIG job fails 9 out of 10 times, but succeeds once in a while. The only error message I get is "Internal error creating job configuration". I get this error when I try to perform the JOIN. Unfortunately, I don't have CLI access to the cluster, I can't get a detailed error message. (Jobs are submitted through a REST API). The PIG version is 0.8.
The 2 datasets are fairly small and can be held in memory. So I'm using replicated option. I also tried it without the replicated option. It fails again.
Any idea on whats happenning and how can I solve this problem?
SET mapred.map.tasks.speculative.execution false;
X = LOAD 'data1' using ....; -- using custom loader
P = LOAD 'data2' using ....; -- using custom loader
-- dataset1
Y = FILTER X BY type MATCHES 'Info' AND name MATCHES 'someregex';
Z = GROUP Y BY name;
A = FOREACH Z GENERATE group, $1.data;
B = FOREACH A GENERATE $0, FLATTEN(udf1($1));
C = GROUP B ALL;
D = FOREACH C GENERATE FLATTEN(udf2($1)) as (metric1:chararray, count:double, avg: double);
--dataset2
Q = FILTER P BY type MATCHES 'Info' AND name MATCHES 'someother_regex';
R = FOREACH Q GENERATE REPLACE(name,'Percent','') , udf3(records);
S = GROUP R BY $0;
T = FOREACH S GENERATE $0, FLATTEN(udf4(R.$1)) as (q1: double, q2: double, q3: double, q4: double);
-- Each tuple in T is like this: (metric2:chararray, q1: double, q2: double, q3: double, q4: double)
----join the two dataset
I = JOIN D BY $0, T BY $0 USING 'replicated';
STORE I into '$output' using PigStorage(',');

pig how to concat columns into single string

I have a column of strings that I load using Pig:
A
B
C
D
how do I convert this column into a single string like this?
A,B,C,D
You are going to have to first GROUP ALL to put everything into one bag, then join the contents of the bag together using a UDF. Something like this:
-- myudfs.py
-- #!/usr/bin/python
--
-- #outputSchema('concated: string')
-- def concat_bag(BAG):
-- return ','.join(BAG)
Register 'myudfs.py' using jython as myfuncs;
A = LOAD 'myfile.txt' AS (letter:chararray) ;
B = GROUP A ALL ;
C = FOREACH B GENERATE myfuncs.concat_bag(A.letter) AS all_letters ;
If your file/schema contains multiple columns, you are probably going to want to project out the column you want to generate the string for. Something like:
A0 = LOAD 'myfile.txt' AS (letter:chararray, val:int, extra:chararray) ;
A = FOREACH A0 GENERATE letter ;
This way you are not keeping around extra columns that will slow down an already expensive operation.

Pig Latin issue

please help me out..its really urgent..deadline nearing, and im stuck with it since 2 weeks..breaking my head but no result. i am a newbie in piglatin.
i have a scenario where i have to filter data from a csv file.
the csv is on hdfs, and has two columns.
grunt>> fl = load '/user/hduser/file.csv' USING PigStorage(',') AS (conv:chararray, clnt:chararray);
grunt>> dump f1;
("first~584544fddf~dssfdf","2001")
("first~4332990~fgdfs4s","2001")
("second~232434334~fgvfd4","1000")
("second~786765~dgbhgdf","1000)
("second~345643~gfdgd43","1000")
what i need to do is i need to extract only the first word before the 1st '~' sign and concat that with the second column value of the csv file. Also i need to group the concatenated result returned and count the number of such similar rows, and create a new csv file as out put, where there would be 2 columns again. 1st column would be the concatenated value and the 2nd column would be the row count.
i.e
("first 2001","2")
("second 1000","3")
and so on.
I have written the code here but its just not working. i have used STRSPLIT. it is splitting the values of the first column of input csv file. but i dont know how to extract the first split value.
code is given below:
convData = LOAD '/user/hduser/file.csv' USING PigStorage(',') AS (conv:chararray, clnt:chararray);
fil = FILTER convData BY conv != '"-1"'; --im using this to filter out the rows that has 1st column as "-1".
data = FOREACH fil GENERATE STRSPLIT($0, '~');
X = FOREACH data GENERATE CONCAT(data.$0,' ',convData.clnt);
Y = FOREACH X GROUP BY X;
Z = FOREACH Y GENERATE COUNT(Y);
var = FOREACH Z GENERATE CONCAT(Y,',',Z);
STORE var INTO '/user/hduser/output.csv' USING PigStorage(',');
STRSPLIT returns a tuple, the individual elements of which you can access using the numbered syntax. This is what you need:
data = FOREACH fil GENERATE STRSPLIT($0, '~') AS a, clnt;
X = FOREACH data GENERATE CONCAT(a.$0,' ', clnt);

Resources