dataset for hadoop - hadoop

Here I put the pig code . when I try to execute this code I see errors. I am unable to debug it. Can any one help me in debugging the code. PLease post the answer with their input and output result.
I would request people to answer the problem with their inputs and the output result. please.

The problem with this dataset is multiple characters as a delimiter "::". In pig you can't use multiple characters as a delimiter. To solve this problem you have 3 options
1. Use REGEX_EXTRACT_ALL build-in function(need to write regex for this input)
2. Write custom UDF
3. Replace the multiple character delimiter to single character delimiter(This is very simple).
I downloaded the dataset from this site http://www.grouplens.org/datasets/movielens/ and tried with option 3
1. Go to your input folder /home/bigdata/sample/inputs/
2. Run this sed command
>> sed 's/::/$/g' movies.dat > Testmovies.dat
>> sed 's/::/$/g' ratings.dat > Testratings.dat
>> sed 's/::/$/g' users.dat > Testusers.dat
This will convert the multiple character delimiter '::' to single character delimiter '$'. I chosen '$' as delimiter bcoz in all the three files '$' is not present.
3. Now load the new input files(Testmovies.dat,Testratings.dat,Testusers.dat) in the pig script using '$' as a delimiter
Modified Pig Script:
-- filtering action and war movies
A = LOAD 'Testmovies.dat' USING PigStorage('$')as (MOVIEID: chararray,TITLE:chararray,GENRE: chararray);
B = filter A by ((GENRE matches '.*Action.*') AND (GENRE matches '.*War.*'));
-- finding action and war movie ratings
C = LOAD 'Testratings.dat' USING PigStorage('$')as (UserID: chararray, MovieID:chararray, Rating: int, Timestamp: chararray);
D = JOIN B by $0, C by MovieID;
-- calculating avg
E = group D by $0;
F = foreach E generate group as mvId, AVG(D.Rating) as avgRating;
-- finding max avg-rating
G = group F ALL;
H = FOREACH G GENERATE MAX(F.$1) AS avgMax;
-- finding max avg-rated movie
I = FILTER F BY (float)avgRating == (float)H.avgMax;
-- filtering female users age between 20-30
J = LOAD 'Testusers.dat' USING PigStorage('$') as (UserID: chararray, Gender: chararray, Age: int, Occupation: chararray, Zip: chararray);
K = filter J by ((Gender == 'F') AND (Age >= 20 AND Age <= 30));
L = foreach K generate UserID;
-- finding filtered female users rated movies
M = JOIN L by $0, C by UserID;
-- finding filtered female users who rated highest rated action and war movies
N = JOIN I by $0, M by $2;
-- finding distinct female users
O = foreach N generate $2 as User;
Q1 = Distinct O;
DUMP Q1;
Sample Output:
(5763)
(5785)
(5805)
(5808)
(5812)
(5825)
(5832)
(5852)
(5869)
(5878)
(5920)
(5955)
(5972)
(5974)
(6009)
(6036)

Related

Hadoop Pig Max Command

I have one file that contains data of all the countries from all over the world.
I want to find out the country which has maximum airport.
I have written below code:
A = load 'airports.dat' USING PigStorage (',') AS(AirportID:int,Name:chararray,City:chararray,Country:chararray,IATA:chararray,IATAothers:chararray,Latitude:float,Longitude:float,Altitude:float,Timezone:float,DST:chararray,Zone:chararray);
B= GROUP A BY Country;
C= FOREACH B GENERATE A.Country, COUNT(A) AS Count;
but after this I am not getting how to find the maximum.
Can anybody please help.
You have created the number of airports per country. What you need to do now, is take the row with the highest number:
D = order C by $1 DESC;
E = limit D 1;
dump E;

Replace multiple space with single space, Remove " and leading 0 using APACHE PIG

Need help with PIG
A = load 'input.txt' as (line:chararray);
B = foreach A generate FLATTEN(TOBAG(*));
C = FOREACH B GENERATE REPLACE(($0, '\\s+', ' ')
Need help on the last line to Replace multiple space with single space, Remove " (Quotes) and leading 00 using APACHE PIG
Note:- Approach should NOT be field specific as there are more than 70 fields,
Basically, expecting help with REPLACE or STRSTRING OR REGEX function which can perfomr mentioned operations on a line.
Input.txt
00595, ab 000cdef california "state, 00USA
00733, 0ds ds "ARIZONA 00state, USA
Expected Output
595, ab cdef califormia state, USA
733, ds ds ARIZONA state, USA
You can use REPLACE function in Pig to do the cleaning and loading as INT will remove the leading zeros from the number.
A = LOAD '/usr/pigfiles/pigo.txt' using PigStorage(',') as (value: INT, state: chararray, country: chararray);
B = FOREACH A GENERATE value,REPLACE(REPLACE(state,' ', ' ' ),'\\"',''), country;
DUMP B;
You cannot use nested REPLACE function within same loop. You have to do a series of operation on your data to get your desired result.
Try below code on your data. it is working well on sample you provided.
*a = LOAD 'ip.txt' USINGTextLoader();*
*b = FOREACH a GENERATE REPLACE($0,'\\s+',' ');*
*c = FOREACH b GENERATE REPLACE($0,'"','');*
*d = FOREACH c GENERATE REPLACE($0,'\\s+0+',' ');*

PIG replace for multiple columns

I have a total of about 150 columns and want to search for \t and replace it with spaces
A = LOAD 'db.table' USING org.apache.hcatalog.pig.HCatLoader();
B = GROUP A ALL;
C = FOREACH B GENERATE REPLACE(B, '\\t', ' ');
STORE C INTO 'location';
This output is producing ALL the only word as output.
Is there a better way to replace all columns at once??
Thank you
Nivi
You could do this with a Python UDF. Say you had some data like this with tabs in it:
Data:
hi there friend,whats up,nothing much
yo yo yo,green eggs, ham
You could write this in Python
UDF:
#outputSchema("datums:{(no_tabs:chararray)}")
def remove_tabs(columns):
try:
out = [tuple(map(lambda s: s.replace("\t", " "), x)) for x in columns]
return out
except:
return [(None)]
and then in Pig
Query:
REGISTER 'remove_tabs.py' USING jython AS udf;
data = LOAD 'toy_data' USING PigStorage(',') AS (col0:chararray,
, col1:chararray, col2:chararray);
grpd = GROUP data all;
A = FOREACH grpd GENERATE FLATTEN(udf.remove_tabs(data));
DUMP A;
Output:
(hi there friend,whats up,nothing much)
(yo yo yo,green eggs,ham)
Ovbiously you have more than three columns, but since you are grouping by all, the script should generalize to any number of columns.

FIles comparing field by field in PIG

I have two files like
File 1
id,sal,location,code
1000,1000,jupiter,F
1001,2000,jupiter,F
1002,3000,jupiter,F
1003,4000,jupiter,F
1004,5000,jupiter,F
File 2
id,sal,location,code
1000,2000,jupiter,F
1001,2000,jupiter,Z
1002,3000,jupiter,F
1003,4000,jupiter,F
1004,5000,jupiter,F
When I compare file1 with file 2, I need a output like
1000, sal
1001,code
Basically, it should tell me what field is changed from the previous file along with the id.
Can this be done in PIG.
You can easily solve this problem but the challenging part will the output format as you mentioned. It requires little bit complex logic to get the output format.
I have fixed most of the edge cases but you can check with your input to make sure that it works for all combinations.
file1:
1000,1000,jupiter,F
1001,2000,jupiter,F
1002,3000,jupiter,F
1003,4000,jupiter,F
1004,5000,jupiter,F
file2:
1000,2000,jupiter,F
1001,2000,jupiter,Z
1002,3000,jupiter,F
1003,4000,jupiter,F
1004,5000,jupiter,F
PigScript:
A = LOAD 'file1' USING PigStorage(',') AS (id,sal,location,code);
B = LOAD 'file2' USING PigStorage(',') AS (id,sal,location,code);
C = JOIN A BY id,B BY id;
D = FOREACH C GENERATE A::id AS id,((A::sal == B::sal)?'':'sal') AS sal,
((A::location == B::location)?'':'location') AS location,
((A::code == B::code)?'':'code') AS code;
--Remove the common fields between two files
E = FILTER D BY NOT (sal=='' AND location=='' AND code=='');
--The below two lines are used to formatting the output
F = FOREACH E GENERATE id,REPLACE(BagToString(TOBAG(sal,location,code),','),'(,,$|,$)','') As finalOutput;
G = FOREACH F GENERATE id,REPLACE(finalOutput,',,',',');
DUMP G;
Output:
(1000,sal)
(1001,code)

Count and find maximum number in Hadoop using pig

I have a table which contain sample CDR data in that column A and column B having calling person and called person mobile number
I need to find whose having maximum number of calls made(column A)
and also need to find to which number(column B) called most
the table structure is like below
calling called
889578226 77382596
889582256 77382596
889582256 7736368296
7785978214 782987522
in the above table 889578226 have most number of outgoing calls and 77382596 is most called number in such a way need to get the output
in hive i run like below
SELECT calling_a,called_b, COUNT(called_b) FROM cdr_data GROUP BY calling_a,called_b;
what might be the equalent code for the above query in pig?
Anas, Could you please let me know this is what you are expecting or something different?
input.txt
a,100
a,101
a,101
a,101
a,103
b,200
b,201
b,201
c,300
c,300
c,301
d,400
PigScript:
A = LOAD 'input.txt' USINg PigStorage(',') AS (name:chararray,phone:long);
B = GROUP A BY (name,phone);
C = FOREACH B GENERATE FLATTEN(group),COUNT(A) AS cnt;
D = GROUP C BY $0;
E = FOREACH D {
SortedList = ORDER C BY cnt DESC;
top = LIMIT SortedList 1;
GENERATE FLATTEN(top);
}
DUMP E;
Output:
(a,101,3)
(b,201,2)
(c,300,2)
(d,400,1)

Resources