Hadoop Pig Filter - hadoop

I have a input file like this:
481295b2-30c7-4191-8c14-4e513c7e7577,1362974399,56973118825,56950298471,true
67912962-dd84-46fa-84ef-a2fba12c2423,1362974399,56950556676,56982431507,false
cc68e779-4798-405b-8596-c34dfb9b66da,1362974399,56999223677,56998032823,true
37a1cc9b-8846-4cba-91dd-19e85edbab00,1362974399,56954667454,56981867544,false
4116c384-3693-4909-a8cc-19090d418aa5,1362974399,56986027804,56978169216,true
I only need the line which the last filed is "true". So I use the following Pig Latin:
records = LOAD 'test/test.csv' USING PigStorage(',');
A = FILTER records BY $4 'true';
DUMP A;
The problem is the second command, I always get the error:
2013-08-07 16:48:11,505 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1200: <line 2, column 25> mismatched input ''true'' expecting SEMI_COLON
Why? I also try "$4 == 'true'" but still doesn't work though. Could anyone tell me how to do this simple thing?

How about:
A = FILTER records BY $4 == 'true' ;
Also, if you know how many fields the data will have beforehand, you should give it a schema. Something like:
records = LOAD 'test/test.csv' USING PigStorage(',')
AS (val1: chararray, val2: int, val3: int, val4: int, bool: chararray);
Or whatever names/types fit your needs.

Related

Apache Pig fetch max from a data-set that has Groups

I have a data-set stored in HDFS in a file called temp.txt which is as follows :
US,Arizona,51.7
US,California,56.7
US,Bullhead City,51.1
India,Jaisalmer,42.4
Libya,Aziziya,57.8
Iran,Lut Desert,70.7
India,Banda,42.4
Now, I load this into Pig memory through the following command :
temp_input = LOAD '/WC/temp.txt' USING PigStorage(',') as
(country:chararray,city:chararray,temp:double);
After this, I GROUPED all the data in temp_input as :
group_country = GROUP temp_input BY country;
When I dump the data in group_country, following output is displayed on screen:
(US,{(US,Bullhead City,51.1),(US,California,56.7),(US,Arizona,51.7)})
(Iran,{(Iran,Lut Desert,70.7)})
(India,{(India,Banda,42.4),(India,Jaisalmer,42.4)})
(Libya,{(Libya,Aziziya,57.8)})
Once the data-set is grouped, I tried to fetch the country name and indivisual maximum temperatures for each each in group_country through the following query :
max_temp = foreach group_country generate group,max(temp);
This punches out an error which looks like :
017-06-21 13:20:34,708 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR
1070: Could not resolve max using imports: [, java.lang.,
org.apache.pig.builtin., org.apache.pig.impl.builtin.]
Details at logfile: /opt/ecosystems/pig-0.16.0/pig_1498026994684.log
What should be my next move to resolve this error and fetch the required result.
All help is appreciated.
While transforming relations pig use describe relationname this will help to know how to iterate. So in your case:
desribe group_country;
Should give you a output like:
group_country: {group: chararray,temp_input: {(country: chararray,city: chararray,temp: double)}}
Then the query :
max_temp = foreach group_country GENERATE group,MAX(temp_input.temp);
Output:
(US,56.7) (Iran,70.7) (India,42.4) (Libya,57.8)
Updated as per comment:
finaldata = foreach group_country {
orderedset = order temp_input by temp DESC;
maxtemps = limit orderedset 1;
generate flatten(maxtemps);
}

Pig- Unable to DUMP data

I have two dataset one for movies and other for ratings
Movies Data looks like
MovieID#Title#Genre
1#Toy Story (1995)#Animation|Children's|Comedy
2#Jumanji (1995)#Adventure|Children's|Fantasy
3#Grumpier Old Men (1995)#Comedy|Romance
Ratings Data looks like
UserID#MovieID#Ratings#RatingsTimestamp
1#1193#5#978300760
1#661#3#978302109
1#914#3#978301968
My Script is as follows
1) movies_data = LOAD '/user/admin/MoviesDataset/movies_new.dat' USING PigStorage('#') AS (movieid:int,
moviename:chararray,moviegenere:chararray);
2) ratings_data = LOAD '/user/admin/RatingsDataset/ratings_new.dat' USING PigStorage('#') AS (Userid:int,
movieid:int,ratings:int,timestamp:long);
3) moviedata_ratingsdata_join = JOIN movies_data BY movieid, ratings_data BY movieid;
4) moviedata_ratingsdata_join_group = GROUP moviedata_ratingsdata_join BY movies_data.movieid;
5) moviedata_ratingsdata_averagerating = FOREACH moviedata_ratingsdata_join_group GENERATE group,
AVG(moviedata_ratingsdata_join.ratings) AS Averageratings, (moviedata_ratingsdata_join.Userid) AS userid;
6) DUMP moviedata_ratingsdata_averagerating;
I am getting this error
2017-03-25 06:46:50,332 [main] ERROR org.apache.pig.tools.pigstats.PigStats - ERROR 0: org.apache.pig.backend.executionengine.ExecException: ERROR 0: Exception while executing (Name: moviedata_ratingsdata_join_group: Local Rearrange[tuple]{int}(false) - scope-95 Operator Key: scope-95): org.apache.pig.backend.executionengine.ExecException: ERROR 0: Exception while executing (Name: moviedata_ratingsdata_averagerating: New For Each(false,false)[bag] - scope-83 Operator Key: scope-83): org.apache.pig.backend.executionengine.ExecException: ERROR 0: Scalar has more than one row in the output. 1st : (1,Toy Story (1995),Animation|Children's|Comedy), 2nd :(2,Jumanji (1995),Adventure|Children's|Fantasy) (common cause: "JOIN" then "FOREACH ... GENERATE foo.bar" should be "foo::bar" )
If remove line 6, script executes successfully
Why can't I DUMP relation that generates in line 5?
Use the disambiguate operator ( :: ) to identify field names after JOIN, COGROUP, CROSS, or FLATTEN operators.
Relations movies_data and ratings_data both have a column movieid. When forming relation moviedata_ratingsdata_join_group, Use the :: operator to identify which column movieid to use for GROUP.
So your 4) would look like,
4) moviedata_ratingsdata_join_group = GROUP moviedata_ratingsdata_join BY movies_data::movieid;

Pig script filter a file getting ERROR

I am new to pig. I'm trying to filter the text file and store it in hbase. Here is the sample input file
sample.txt
{"pattern":"google_1473491793_265244074740","tweets":[{"tweet::created_at":"18:47:31 ","tweet::id":"252479809098223616","tweet::user_id":"450990391","tweet::text":"rt #joey7barton: ..give a google about whether the americans wins a ryder cup. i mean surely he has slightly more important matters. #fami ..."}]}
{"pattern":"facebook_1473491793_265244074740","tweets":[{"tweet::created_at":"11:33:16 ","tweet::id":"252370526411051008","tweet::user_id":"845912316","tweet::text":"#maarionymcmb facebook mere ta dit tu va resté chez toi dnc tu restes !"}]}
Script:
data = load 'sample.txt' using JsonLoader('pattern:chararray, tweets: bag {t1:tuple(tweet::created_at: chararray,tweet::id: chararray,tweet::user_id: chararray,tweet::text: chararray)}');
A = FILTER data BY pattern == 'google_*';
grouped = foreach (group A by pattern){tweets1 = foreach data generate tweets.(created_at),tweets.(id),tweets.(user_id),tweets.(text); generate group as pattern1,tweets1;}
But I got this error when run grouped:
2016-09-10 13:38:52,995 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1200: Pig script failed to parse: <line 41, column 57> expression is not a project expression: (Name: ScalarExpression) Type: null Uid: null)
In the nested you can't reference the 'tweets', you need to the 'A'. See my example below.
grouped = FOREACH (GROUP A BY pattern) {
GENERATE groups AS pattern, A.created_at, A.id, A.user_id, A.text;
};

Null error for nonempty data using Generate in Pig/Hadoop

I am working on Task 2 in this link:
https://sites.google.com/site/hadoopbigdataoverview/certification-practice-exam
I used the code below
a = load '/user/horton/flightdelays/flight_delays1.csv' using PigStorage(',');
dump a
a_top = limit a 5
a_top shows that the first 5 rows. There are non-null values for each Year
Then I type
a_clean = filter a BY NOT ($4=='NA');
aa = foreach a_clean generate a_clean.Year;
But that gives the error
ERROR 1200: null
What is wrong with this?
EDIT: I also tried
a = load '/user/horton/flightdelays/flight_delays1.csv' using PigStorage(',') AS (Year:chararray,Month:chararray,DayofMonth:chararray,DayOfWeek:chararray,DepTime:chararray,CRSDepTime:chararray,ArrTime:chararray,CRSArrTime:chararray,UniqueCarrier:chararray,FlightNum:chararray,TailNum:chararray,ActualElapsedTime:chararray,CRSElapsedTime:chararray,AirTime:chararray,ArrDelay:chararray,DepDelay:chararray,Origin:chararray,Dest:chararray,Distance:chararray,TaxiIn:chararray,TaxiOut:chararray,Cancelled:chararray,CancellationCode:chararray,Diverted:chararray,CarrierDelay:chararray,WeatherDelay:chararray,NASDelay:chararray,SecurityDelay:chararray,LateAircraftDelay:chararray);
and
aa = foreach a_clean generate a_clean.Year
but the error was
ERROR org.apache.pig.tools.pigstats.PigStats - ERROR 0: org.apache.pig.backend.executionengine.ExecException: ERROR 0: Scalar has more than one row in the output. 1st : (Year,Month,DayofMonth,DayOfWeek,DepTime,CRSDepTime,ArrTime,CRSArrTime,UniqueCarrier,FlightNum,TailNum,ActualElapsedTime,CRSElapsedTime,AirTime,ArrDelay,DepDelay,Origin,Dest,Distance,TaxiIn,TaxiOut,Cancelled,CancellationCode,Diverted,CarrierDelay,WeatherDelay,NASDelay,SecurityDelay,LateAircraftDelay), 2nd :(2008,1,3,4,2003,1955,2211,2225,WN,335,N712SW,128,150,116,-14,8,IAD,TPA,810,4,8,0,,0,NA,NA,NA,NA,NA)
Since you have not specified the schema in the LOAD statement,you will have to refer the columns using order in which they occur.Year seems to be the first column so try this
a_clean = filter a BY ($4 != 'NA');
aa = foreach a_clean generate a.Year;

Pig Latin - adding values from different bags?

I have one file max_rank.txt containing:
1,a
2,b
3,c
and second file max_rank_add.txt:
d
e
f
My expecting result is:
1,a
2,b
3,c,
4,d,
5,e
6,f
So I want to generate RANK for second set of values, but starting with value greater than max from first set.
Beginig of the script probably looks like this:
existing = LOAD 'max_rank.txt' using PigStorage(',') AS (id: int, text : chararray);
new = LOAD 'max_rank_add.txt' using PigStorage() AS (text2 : chararray);
ordered = ORDER existing by id desc;
limited = LIMIT ordered 1;
new_rank = RANK new;
But I have problem with last, most importatn line, that adds value from limited to rank_new from new_rank.
Can you please give any suggestions?
Regards
Pawel
I've found a solution.
Both scripts work:
rank_plus_max = foreach new_rank generate flatten(limited.$0 + rank_new), text2;
rank_plus_max = foreach new_rank generate limited.$0 + rank_new, text2;
These DOES NOT work:
rank_plus_max = foreach new_rank generate flatten(limited.$0) + flatten(rank_new);
2014-02-24 10:52:39,580 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1200: <line 10, column 62> mismatched input '+' expecting SEMI_COLON
Details at logfile: /export/home/pig/pko/pig_1393234166538.log

Resources