Pig Latin - adding values from different bags? - hadoop

I have one file max_rank.txt containing:
1,a
2,b
3,c
and second file max_rank_add.txt:
d
e
f
My expecting result is:
1,a
2,b
3,c,
4,d,
5,e
6,f
So I want to generate RANK for second set of values, but starting with value greater than max from first set.
Beginig of the script probably looks like this:
existing = LOAD 'max_rank.txt' using PigStorage(',') AS (id: int, text : chararray);
new = LOAD 'max_rank_add.txt' using PigStorage() AS (text2 : chararray);
ordered = ORDER existing by id desc;
limited = LIMIT ordered 1;
new_rank = RANK new;
But I have problem with last, most importatn line, that adds value from limited to rank_new from new_rank.
Can you please give any suggestions?
Regards
Pawel

I've found a solution.
Both scripts work:
rank_plus_max = foreach new_rank generate flatten(limited.$0 + rank_new), text2;
rank_plus_max = foreach new_rank generate limited.$0 + rank_new, text2;
These DOES NOT work:
rank_plus_max = foreach new_rank generate flatten(limited.$0) + flatten(rank_new);
2014-02-24 10:52:39,580 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1200: <line 10, column 62> mismatched input '+' expecting SEMI_COLON
Details at logfile: /export/home/pig/pko/pig_1393234166538.log

Related

Apache Pig fetch max from a data-set that has Groups

I have a data-set stored in HDFS in a file called temp.txt which is as follows :
US,Arizona,51.7
US,California,56.7
US,Bullhead City,51.1
India,Jaisalmer,42.4
Libya,Aziziya,57.8
Iran,Lut Desert,70.7
India,Banda,42.4
Now, I load this into Pig memory through the following command :
temp_input = LOAD '/WC/temp.txt' USING PigStorage(',') as
(country:chararray,city:chararray,temp:double);
After this, I GROUPED all the data in temp_input as :
group_country = GROUP temp_input BY country;
When I dump the data in group_country, following output is displayed on screen:
(US,{(US,Bullhead City,51.1),(US,California,56.7),(US,Arizona,51.7)})
(Iran,{(Iran,Lut Desert,70.7)})
(India,{(India,Banda,42.4),(India,Jaisalmer,42.4)})
(Libya,{(Libya,Aziziya,57.8)})
Once the data-set is grouped, I tried to fetch the country name and indivisual maximum temperatures for each each in group_country through the following query :
max_temp = foreach group_country generate group,max(temp);
This punches out an error which looks like :
017-06-21 13:20:34,708 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR
1070: Could not resolve max using imports: [, java.lang.,
org.apache.pig.builtin., org.apache.pig.impl.builtin.]
Details at logfile: /opt/ecosystems/pig-0.16.0/pig_1498026994684.log
What should be my next move to resolve this error and fetch the required result.
All help is appreciated.
While transforming relations pig use describe relationname this will help to know how to iterate. So in your case:
desribe group_country;
Should give you a output like:
group_country: {group: chararray,temp_input: {(country: chararray,city: chararray,temp: double)}}
Then the query :
max_temp = foreach group_country GENERATE group,MAX(temp_input.temp);
Output:
(US,56.7) (Iran,70.7) (India,42.4) (Libya,57.8)
Updated as per comment:
finaldata = foreach group_country {
orderedset = order temp_input by temp DESC;
maxtemps = limit orderedset 1;
generate flatten(maxtemps);
}

Bug in my Pig Latin script

Im trying to do a Median operation on a file in Pig. The file looks like this.
NewYork,-1
NewYork,-5
NewYork,-2
NewYork,3
NewYork,4
NewYork,13
NewYork,11
Amsterdam,12
Amsterdam,11
Amsterdam,2
Amsterdam,1
Amsterdam,-1
Amsterdam,-4
Mumbai,1
Mumbai,4
Mumbai,5
Mumbai,-2
Mumbai,9
Mumbai,-4
The file is loaded and the data inside it is grouped as follows:
wdata = load 'weatherdata' using PigStorage(',') as (city:chararray, temp:int);
wdata_g = group wdata by city;
Im trying to get the median from all the temperatures of the cities as following:
wdata_tempmedian = foreach wdata_g { tu = wdata.temp as temp; ord = order tu by temp generate group, Median(ord); }
The data is ordering because is needs to in sorted order to find a median.
But Im getting the following error message which I couldn't figure out what is the mistake:
[main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1200: <line 3, column 53> mismatched input 'as' expecting SEMI_COLON
Any help is much appreciated.
You are missing a ';' after ordering the temperatures.
wdata_tempmedian = FOREACH wdata_g {
tu = wdata.temp as temp;
ord = ORDER tu BY temp;
GENERATE group, Median(ord);
}
OR
wdata_ordered = ORDER wdata_g BY temp;
wdata_tempmedian = FOREACH wdata_ordered GENERATE group, Median(ord);
Note:I am assuming you are using data-fu since PIG does not have a Median function.Ensure the jar is correctly registered
register /path/datafu-pig-incubating-1.3.1.jar

Null error for nonempty data using Generate in Pig/Hadoop

I am working on Task 2 in this link:
https://sites.google.com/site/hadoopbigdataoverview/certification-practice-exam
I used the code below
a = load '/user/horton/flightdelays/flight_delays1.csv' using PigStorage(',');
dump a
a_top = limit a 5
a_top shows that the first 5 rows. There are non-null values for each Year
Then I type
a_clean = filter a BY NOT ($4=='NA');
aa = foreach a_clean generate a_clean.Year;
But that gives the error
ERROR 1200: null
What is wrong with this?
EDIT: I also tried
a = load '/user/horton/flightdelays/flight_delays1.csv' using PigStorage(',') AS (Year:chararray,Month:chararray,DayofMonth:chararray,DayOfWeek:chararray,DepTime:chararray,CRSDepTime:chararray,ArrTime:chararray,CRSArrTime:chararray,UniqueCarrier:chararray,FlightNum:chararray,TailNum:chararray,ActualElapsedTime:chararray,CRSElapsedTime:chararray,AirTime:chararray,ArrDelay:chararray,DepDelay:chararray,Origin:chararray,Dest:chararray,Distance:chararray,TaxiIn:chararray,TaxiOut:chararray,Cancelled:chararray,CancellationCode:chararray,Diverted:chararray,CarrierDelay:chararray,WeatherDelay:chararray,NASDelay:chararray,SecurityDelay:chararray,LateAircraftDelay:chararray);
and
aa = foreach a_clean generate a_clean.Year
but the error was
ERROR org.apache.pig.tools.pigstats.PigStats - ERROR 0: org.apache.pig.backend.executionengine.ExecException: ERROR 0: Scalar has more than one row in the output. 1st : (Year,Month,DayofMonth,DayOfWeek,DepTime,CRSDepTime,ArrTime,CRSArrTime,UniqueCarrier,FlightNum,TailNum,ActualElapsedTime,CRSElapsedTime,AirTime,ArrDelay,DepDelay,Origin,Dest,Distance,TaxiIn,TaxiOut,Cancelled,CancellationCode,Diverted,CarrierDelay,WeatherDelay,NASDelay,SecurityDelay,LateAircraftDelay), 2nd :(2008,1,3,4,2003,1955,2211,2225,WN,335,N712SW,128,150,116,-14,8,IAD,TPA,810,4,8,0,,0,NA,NA,NA,NA,NA)
Since you have not specified the schema in the LOAD statement,you will have to refer the columns using order in which they occur.Year seems to be the first column so try this
a_clean = filter a BY ($4 != 'NA');
aa = foreach a_clean generate a.Year;

Get the count through iterate over Data Bag but condition should be different count for each value associated to that field

Below is the data I have and the schema for the same is-
student_name, question_number, actual_result(either - false/Correct)
(b,q1,Correct)
(a,q1,false)
(b,q2,Correct)
(a,q2,false)
(b,q3,false)
(a,q3,Correct)
(b,q4,false)
(a,q4,false)
(b,q5,flase)
(a,q5,false)
What I want is to get the count for each student i.e. a/b for total
correct and false answer he/she has made.
For the use case shared, below pig script is suffice.
Pig Script :
student_data = LOAD 'student_data.csv' USING PigStorage(',') AS (student_name:chararray, question_number:chararray, actual_result:chararray);
student_data_grp = GROUP student_data BY student_name;
student_correct_answer_data = FOREACH student_data_grp {
answers = student_data.actual_result;
correct_answers = FILTER answers BY actual_result=='Correct';
incorrect_answers = FILTER answers BY actual_result=='false';
GENERATE group AS student_name, COUNT(correct_answers) AS correct_ans_count, COUNT(incorrect_answers) AS incorrect_ans_count ;
};
Input : student_data.csv :
b,q1,Correct
a,q1,false
b,q2,Correct
a,q2,false
b,q3,false
a,q3,Correct
b,q4,false
a,q4,false
b,q5,false
a,q5,false
Output : DUMP kpi:
-- schema : (student_name, correct_ans_count, incorrect_ans_count)
(a,1,4)
(b,2,3)
Ref : For more details on nested FOR EACH
http://pig.apache.org/docs/r0.12.0/basic.html#foreach
http://chimera.labs.oreilly.com/books/1234000001811/ch06.html#more_on_foreach
Use this:
data = LOAD '/abc.txt' USING PigStorage(',') AS (name:chararray, number:chararray,result:chararray);
B = GROUP data by (name,result);
C = foreach B generate FLATTEN(group) as (name,result), COUNT(data) as count;
and answer will be like:
(a,false,4)
(a,Correct,1)
(b,false,3)
(b,Correct,2)
Hope this is the output you are looking for

Handle thorn delimiter in pig

My Source is a log file having "þ" as delimiter.I am trying to read this file in Pig.Please look at the options I tried.
Option 1 :
Using PigStorage("þ") - This does'nt work out as it cant handle unicode characters.
Option 2 :
I tried reading the lines as string and tried to split the line with "þ".This also does'nt work out as the STRSPLIT left out the last field as it has "\n" in the end.
I can see multiple questions in web, but unable to find a solution.
Kindly direct me with this.
Thorn Details :
http://www.fileformat.info/info/unicode/char/fe/index.htm
Is this the solution are you expecting?
input.txt:
helloþworldþhelloþworld
helloþworldþhelloþworld
helloþworldþhelloþworld
helloþworldþhelloþworld
helloþworldþhelloþworld
PigScript:
A = LOAD 'input.txt' as line;
B = FOREACH A GENERATE FLATTEN(REGEX_EXTRACT_ALL(line,'(.*)þ(.*)þ(.*)þ(.*)'));
dump B;
Output:
(hello,world,hello,world)
(hello,world,hello,world)
(hello,world,hello,world)
(hello,world,hello,world)
(hello,world,hello,world)
Added 2nd option with different datatypes:
input.txt
helloþ1234þ1970-01-01T00:00:00.000+00:00þworld
helloþ4567þ1990-01-01T00:00:00.000+00:00þworld
helloþ8901þ2001-01-01T00:00:00.000+00:00þworld
helloþ9876þ2014-01-01T00:00:00.000+00:00þworld
PigScript:
A = LOAD 'input.txt' as line;
B = FOREACH A GENERATE FLATTEN(REGEX_EXTRACT_ALL(line,'(.*)þ(.*)þ(.*)þ(.*)')) as (f1:chararray,f2:long,f3:datetime,f4:chararray);
DUMP B;
DESCRIBE B;
Output:
(hello,1234,1970-01-01T00:00:00.000+00:00,world)
(hello,4567,1990-01-01T00:00:00.000+00:00,world)
(hello,8901,2001-01-01T00:00:00.000+00:00,world)
(hello,9876,2014-01-01T00:00:00.000+00:00,world)
B: {f1: chararray,f2: long,f3: datetime,f4: chararray}
Another thorn symbol A¾:
input.txt
1077A¾04-01-2014þ04-30-2014þ0þ0.0þ0
1077A¾04-01-2014þ04-30-2014þ0þ0.0þ0
1077A¾04-01-2014þ04-30-2014þ0þ0.0þ0
PigScript:
A = LOAD 'jinput.txt' as line;
B = FOREACH A GENERATE FLATTEN(REGEX_EXTRACT_ALL(line,'(.*)A¾(.*)þ(.*)þ(.*)þ(.*)þ(.*)')) as (f1:long,f2:datetime,f3:datetime,f4:int,f5:double,f6:int);
DUMP B;
describe B;
Output:
(1077,04-01-2014,04-30-2014,0,0.0,0)
(1077,04-01-2014,04-30-2014,0,0.0,0)
(1077,04-01-2014,04-30-2014,0,0.0,0)
B: {f1: long,f2: datetime,f3: datetime,f4: int,f5: double,f6: int}
}
This should work (replace the unicode code point with the one that's working for you, this is for capital thorn):
A = LOAD 'input' USING
B = FOREACH A GENERATE STRSPLIT(f1, '\\u00DE', -1);
I don't see why the last field should be left out.
Somehow, this does not work:
A = LOAD 'input' USING PigStorage('\00DE');

Resources