Null error for nonempty data using Generate in Pig/Hadoop - hadoop

I am working on Task 2 in this link:
https://sites.google.com/site/hadoopbigdataoverview/certification-practice-exam
I used the code below
a = load '/user/horton/flightdelays/flight_delays1.csv' using PigStorage(',');
dump a
a_top = limit a 5
a_top shows that the first 5 rows. There are non-null values for each Year
Then I type
a_clean = filter a BY NOT ($4=='NA');
aa = foreach a_clean generate a_clean.Year;
But that gives the error
ERROR 1200: null
What is wrong with this?
EDIT: I also tried
a = load '/user/horton/flightdelays/flight_delays1.csv' using PigStorage(',') AS (Year:chararray,Month:chararray,DayofMonth:chararray,DayOfWeek:chararray,DepTime:chararray,CRSDepTime:chararray,ArrTime:chararray,CRSArrTime:chararray,UniqueCarrier:chararray,FlightNum:chararray,TailNum:chararray,ActualElapsedTime:chararray,CRSElapsedTime:chararray,AirTime:chararray,ArrDelay:chararray,DepDelay:chararray,Origin:chararray,Dest:chararray,Distance:chararray,TaxiIn:chararray,TaxiOut:chararray,Cancelled:chararray,CancellationCode:chararray,Diverted:chararray,CarrierDelay:chararray,WeatherDelay:chararray,NASDelay:chararray,SecurityDelay:chararray,LateAircraftDelay:chararray);
and
aa = foreach a_clean generate a_clean.Year
but the error was
ERROR org.apache.pig.tools.pigstats.PigStats - ERROR 0: org.apache.pig.backend.executionengine.ExecException: ERROR 0: Scalar has more than one row in the output. 1st : (Year,Month,DayofMonth,DayOfWeek,DepTime,CRSDepTime,ArrTime,CRSArrTime,UniqueCarrier,FlightNum,TailNum,ActualElapsedTime,CRSElapsedTime,AirTime,ArrDelay,DepDelay,Origin,Dest,Distance,TaxiIn,TaxiOut,Cancelled,CancellationCode,Diverted,CarrierDelay,WeatherDelay,NASDelay,SecurityDelay,LateAircraftDelay), 2nd :(2008,1,3,4,2003,1955,2211,2225,WN,335,N712SW,128,150,116,-14,8,IAD,TPA,810,4,8,0,,0,NA,NA,NA,NA,NA)

Since you have not specified the schema in the LOAD statement,you will have to refer the columns using order in which they occur.Year seems to be the first column so try this
a_clean = filter a BY ($4 != 'NA');
aa = foreach a_clean generate a.Year;

Related

Apache Pig fetch max from a data-set that has Groups

I have a data-set stored in HDFS in a file called temp.txt which is as follows :
US,Arizona,51.7
US,California,56.7
US,Bullhead City,51.1
India,Jaisalmer,42.4
Libya,Aziziya,57.8
Iran,Lut Desert,70.7
India,Banda,42.4
Now, I load this into Pig memory through the following command :
temp_input = LOAD '/WC/temp.txt' USING PigStorage(',') as
(country:chararray,city:chararray,temp:double);
After this, I GROUPED all the data in temp_input as :
group_country = GROUP temp_input BY country;
When I dump the data in group_country, following output is displayed on screen:
(US,{(US,Bullhead City,51.1),(US,California,56.7),(US,Arizona,51.7)})
(Iran,{(Iran,Lut Desert,70.7)})
(India,{(India,Banda,42.4),(India,Jaisalmer,42.4)})
(Libya,{(Libya,Aziziya,57.8)})
Once the data-set is grouped, I tried to fetch the country name and indivisual maximum temperatures for each each in group_country through the following query :
max_temp = foreach group_country generate group,max(temp);
This punches out an error which looks like :
017-06-21 13:20:34,708 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR
1070: Could not resolve max using imports: [, java.lang.,
org.apache.pig.builtin., org.apache.pig.impl.builtin.]
Details at logfile: /opt/ecosystems/pig-0.16.0/pig_1498026994684.log
What should be my next move to resolve this error and fetch the required result.
All help is appreciated.
While transforming relations pig use describe relationname this will help to know how to iterate. So in your case:
desribe group_country;
Should give you a output like:
group_country: {group: chararray,temp_input: {(country: chararray,city: chararray,temp: double)}}
Then the query :
max_temp = foreach group_country GENERATE group,MAX(temp_input.temp);
Output:
(US,56.7) (Iran,70.7) (India,42.4) (Libya,57.8)
Updated as per comment:
finaldata = foreach group_country {
orderedset = order temp_input by temp DESC;
maxtemps = limit orderedset 1;
generate flatten(maxtemps);
}

Pig- Unable to DUMP data

I have two dataset one for movies and other for ratings
Movies Data looks like
MovieID#Title#Genre
1#Toy Story (1995)#Animation|Children's|Comedy
2#Jumanji (1995)#Adventure|Children's|Fantasy
3#Grumpier Old Men (1995)#Comedy|Romance
Ratings Data looks like
UserID#MovieID#Ratings#RatingsTimestamp
1#1193#5#978300760
1#661#3#978302109
1#914#3#978301968
My Script is as follows
1) movies_data = LOAD '/user/admin/MoviesDataset/movies_new.dat' USING PigStorage('#') AS (movieid:int,
moviename:chararray,moviegenere:chararray);
2) ratings_data = LOAD '/user/admin/RatingsDataset/ratings_new.dat' USING PigStorage('#') AS (Userid:int,
movieid:int,ratings:int,timestamp:long);
3) moviedata_ratingsdata_join = JOIN movies_data BY movieid, ratings_data BY movieid;
4) moviedata_ratingsdata_join_group = GROUP moviedata_ratingsdata_join BY movies_data.movieid;
5) moviedata_ratingsdata_averagerating = FOREACH moviedata_ratingsdata_join_group GENERATE group,
AVG(moviedata_ratingsdata_join.ratings) AS Averageratings, (moviedata_ratingsdata_join.Userid) AS userid;
6) DUMP moviedata_ratingsdata_averagerating;
I am getting this error
2017-03-25 06:46:50,332 [main] ERROR org.apache.pig.tools.pigstats.PigStats - ERROR 0: org.apache.pig.backend.executionengine.ExecException: ERROR 0: Exception while executing (Name: moviedata_ratingsdata_join_group: Local Rearrange[tuple]{int}(false) - scope-95 Operator Key: scope-95): org.apache.pig.backend.executionengine.ExecException: ERROR 0: Exception while executing (Name: moviedata_ratingsdata_averagerating: New For Each(false,false)[bag] - scope-83 Operator Key: scope-83): org.apache.pig.backend.executionengine.ExecException: ERROR 0: Scalar has more than one row in the output. 1st : (1,Toy Story (1995),Animation|Children's|Comedy), 2nd :(2,Jumanji (1995),Adventure|Children's|Fantasy) (common cause: "JOIN" then "FOREACH ... GENERATE foo.bar" should be "foo::bar" )
If remove line 6, script executes successfully
Why can't I DUMP relation that generates in line 5?
Use the disambiguate operator ( :: ) to identify field names after JOIN, COGROUP, CROSS, or FLATTEN operators.
Relations movies_data and ratings_data both have a column movieid. When forming relation moviedata_ratingsdata_join_group, Use the :: operator to identify which column movieid to use for GROUP.
So your 4) would look like,
4) moviedata_ratingsdata_join_group = GROUP moviedata_ratingsdata_join BY movies_data::movieid;

Bug in my Pig Latin script

Im trying to do a Median operation on a file in Pig. The file looks like this.
NewYork,-1
NewYork,-5
NewYork,-2
NewYork,3
NewYork,4
NewYork,13
NewYork,11
Amsterdam,12
Amsterdam,11
Amsterdam,2
Amsterdam,1
Amsterdam,-1
Amsterdam,-4
Mumbai,1
Mumbai,4
Mumbai,5
Mumbai,-2
Mumbai,9
Mumbai,-4
The file is loaded and the data inside it is grouped as follows:
wdata = load 'weatherdata' using PigStorage(',') as (city:chararray, temp:int);
wdata_g = group wdata by city;
Im trying to get the median from all the temperatures of the cities as following:
wdata_tempmedian = foreach wdata_g { tu = wdata.temp as temp; ord = order tu by temp generate group, Median(ord); }
The data is ordering because is needs to in sorted order to find a median.
But Im getting the following error message which I couldn't figure out what is the mistake:
[main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1200: <line 3, column 53> mismatched input 'as' expecting SEMI_COLON
Any help is much appreciated.
You are missing a ';' after ordering the temperatures.
wdata_tempmedian = FOREACH wdata_g {
tu = wdata.temp as temp;
ord = ORDER tu BY temp;
GENERATE group, Median(ord);
}
OR
wdata_ordered = ORDER wdata_g BY temp;
wdata_tempmedian = FOREACH wdata_ordered GENERATE group, Median(ord);
Note:I am assuming you are using data-fu since PIG does not have a Median function.Ensure the jar is correctly registered
register /path/datafu-pig-incubating-1.3.1.jar

Pig script filter a file getting ERROR

I am new to pig. I'm trying to filter the text file and store it in hbase. Here is the sample input file
sample.txt
{"pattern":"google_1473491793_265244074740","tweets":[{"tweet::created_at":"18:47:31 ","tweet::id":"252479809098223616","tweet::user_id":"450990391","tweet::text":"rt #joey7barton: ..give a google about whether the americans wins a ryder cup. i mean surely he has slightly more important matters. #fami ..."}]}
{"pattern":"facebook_1473491793_265244074740","tweets":[{"tweet::created_at":"11:33:16 ","tweet::id":"252370526411051008","tweet::user_id":"845912316","tweet::text":"#maarionymcmb facebook mere ta dit tu va resté chez toi dnc tu restes !"}]}
Script:
data = load 'sample.txt' using JsonLoader('pattern:chararray, tweets: bag {t1:tuple(tweet::created_at: chararray,tweet::id: chararray,tweet::user_id: chararray,tweet::text: chararray)}');
A = FILTER data BY pattern == 'google_*';
grouped = foreach (group A by pattern){tweets1 = foreach data generate tweets.(created_at),tweets.(id),tweets.(user_id),tweets.(text); generate group as pattern1,tweets1;}
But I got this error when run grouped:
2016-09-10 13:38:52,995 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1200: Pig script failed to parse: <line 41, column 57> expression is not a project expression: (Name: ScalarExpression) Type: null Uid: null)
In the nested you can't reference the 'tweets', you need to the 'A'. See my example below.
grouped = FOREACH (GROUP A BY pattern) {
GENERATE groups AS pattern, A.created_at, A.id, A.user_id, A.text;
};

Pig Latin - adding values from different bags?

I have one file max_rank.txt containing:
1,a
2,b
3,c
and second file max_rank_add.txt:
d
e
f
My expecting result is:
1,a
2,b
3,c,
4,d,
5,e
6,f
So I want to generate RANK for second set of values, but starting with value greater than max from first set.
Beginig of the script probably looks like this:
existing = LOAD 'max_rank.txt' using PigStorage(',') AS (id: int, text : chararray);
new = LOAD 'max_rank_add.txt' using PigStorage() AS (text2 : chararray);
ordered = ORDER existing by id desc;
limited = LIMIT ordered 1;
new_rank = RANK new;
But I have problem with last, most importatn line, that adds value from limited to rank_new from new_rank.
Can you please give any suggestions?
Regards
Pawel
I've found a solution.
Both scripts work:
rank_plus_max = foreach new_rank generate flatten(limited.$0 + rank_new), text2;
rank_plus_max = foreach new_rank generate limited.$0 + rank_new, text2;
These DOES NOT work:
rank_plus_max = foreach new_rank generate flatten(limited.$0) + flatten(rank_new);
2014-02-24 10:52:39,580 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1200: <line 10, column 62> mismatched input '+' expecting SEMI_COLON
Details at logfile: /export/home/pig/pko/pig_1393234166538.log

Resources