I am new to pig Latin scripting I don't know whether am i doing is right or wrong please help me.
Below is the sample which I have the first group by player name that is first parameter now data which is present in bag i want to order them by score desc
Is it possible to get it done in pig by single statement?
(B.Kumarr,{(B.Kumarr,18),(B.Kumarr,10),(B.Kumarr,38)})
cricData3 = FOREACH cricData2 GENERATE $0,ORDER $1.$1 By DESC;
(B.Kumarr,{(B.Kumarr,38),(B.Kumarr,18),(B.Kumarr,10)})
Related
I have an alias called student, the data structure is like this (result of command describe),
studentIDInt:int,courses:bag{(courseId:int,testID:int,score:int)}
Then I am trying to filter students by score, but met with such Pig parse error, if anyone have any good ideas, it will be great. Thanks.
Confused about the additional tuple reported in the error message.
student = filter student by courses.score > 3;
incompatible types in GreaterThan Operator left hand side:bag :tuple(score:int) right hand score:int
regards,
Lin
You can't do it directly. Possible solution is first flatten, filter and than group again
flat_student = foreach student generate studentIDInt, flatten(courses);
filtered_student = filter flat_student by score > 3;
final_student = group filtered_student by studentIDInt;
Another way is writing custom FilterFunc, so it's up to you what to choose.
I have generated two columns(origin and destination) out of 'n' number of columns. Now I want to generate count for these two columns combination. I am not able to get the result. I am getting error as, ERROR 1070: Could not resolve Count using imports:
Below is my script,
mydata = load '/Projects/Flightdata/1987/Rawdata' using PigStorage(',') as (year:int, month:int, dom:int, dow:int, deptime:long, crsdeptime:long, arrtime:long, crsarrtime:long, uniqcarcode:chararray, flightnum:long, tailnum:chararray, actelaptime:long, crselaptime:long, airtime:long, arrdeltime:long, depdeltime:long, origcode:chararray, destcode:chararray, dist:long, taxintime:long, taxiouttime:long, flightcancl:int, canclcode:chararray, diverted:int, carrierdel:long, weatherdel:long, nasdel:long, securitydel:long, lateaircraftdel:long);
Step2 = foreach mydata generate origcode, destcode;
grpby = group Step2 by (origcode, destcode) ;
step3 = foreach grpby generate group.origcode as source, group.destcode as destination, Count(step2);
here I want to generate count for each combination of origin and destination.
Any guidance will be helpful.
Please see the Pig documentation about case sensitivity
The names of Pig Latin functions are case sensitive.
Structure of bag:
emp = LOAD '...../emp.csv' using PigStorage(',') AS
(ename:chararray,id:int,job:chararray,sal:double)
This bag contains details of employees. I want to split the data based on job.
Bag = split emp into mngr if job == 'MANAGER';
This is not working & giving Error 1200.
If I include one more condition with it, for ex.- sal10k if sal<10000, then it is working. But why not only on one chararray?
I am new to hadoop pig. Know few basics. Kindly help.
Kindly find the solution to the problem below along with basic explanation about SPLIT operator:
The SPLIT operator is used to break a relation into two new relations. So you need to take care of both conditions , like IF and ELSE:
For instance: IF(Something matches) then make Relation1, IF(NOT(something
matches) then make another relation. ( You don't have else keyword in Pig).
SPLIT operation is an independent operation, meaning that you cant store the SPLIT operation in a relation:
Example:
Bag = split emp into mngr if job == 'MANAGER'; // This is wrong.
You can't represent a SPLIT operation by a relation.
It will execute independently on the GRUNT shell or Script like this :
*SPLIT emp INTO managers IF(job MATCHES '.MANAGER.'),not_managers IF(NOT(job MATCHES '.MANAGER.'));*
Here is an example data set and output for your reference:
**
Dataset
**
Ron,1331,MANAGER,7232332.34
John,4332,ASSOCIATE,45534.6
Michell,4112,MANAGER,8342423.43
Tamp,1353,ASSOCIATE,34324.67
Ramo,2144,MODULE LEAD,845433.32
Shina,1389,MANAGER,8345321.78
Chin,4323,MODULE LEAD,455465.42
SCRIPT:
emp = LOAD 'stackfile.txt' USING PigStorage(',') AS (ename:chararray,id:int,job:chararray,sal:double);
SPLIT emp INTO managers IF(job MATCHES '.*MANAGER.*'),not_managers IF(NOT(job MATCHES '.*MANAGER.*'));
DUMP managers;
OUTPUT:
(Ron,1331,MANAGER,7232332.34)
(Michell,4112,MANAGER,8342423.43)
(Shina,1389,MANAGER,8345321.78)
I think you are using SPLIT operator wrong.
This is from doc:
SPLIT alias INTO alias IF expression, alias IF expression [, alias IF expression …] [, alias OTHERWISE];
So don't use this part "Bag =" at start.
Using Hadoop's PIG-Latin to find the number of occurrences of unique search strings from a search engine log file.(click here to view the sample log file)
Please help me out. Thanks in advance.
Pig script
excitelog = load '/user/hadoop/input/excite-small.log' using PigStorage() AS
(encryptcode:chararray, numericid:int, searchstring:chararray);
GroupBySearchString = GROUP excitelog by searchstring;
searchStrFrq = foreach GroupBySearchString Generate group as searchstring,count(searchstring);
Error encountered
[main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1070: Could not resolve count using imports: [, org.apache.pig.builtin., org.apache.pig.impl.builtin.]
You need to do:
searchStrFrq = foreach GroupBySearchString Generate group as searchstring,
COUNT(excitelog) as kount;
This is because the way grouping works in pig, GroupBySearchString would be a bag of {group, excitelog}, where excitelog is itself a bag of all tuples matching the group. COUNT is a UDF takes a bag as input and returns the number of tuples in the bag. So, COUNT(excitelog) will then give you the number of tuples matching the group.
Function names PigStorage and COUNT are case sensitive.
so one need to keep COUNT function like as below:
wordcount = FOREACH grouped GENERATE group , COUNT(words);
I have a set set of records that I am loading from a file and the first thing I need to do is get the max and min of a column.
In SQL I would do this with a subquery like this:
select c.state, c.population,
(select max(c.population) from state_info c) as max_pop,
(select min(c.population) from state_info c) as min_pop
from state_info c
I assume there must be an easy way to do this in PIG as well but I'm having trouble finding it. It has a MAX and MIN function but when I tried doing the following it didn't work:
records=LOAD '/Users/Winter/School/st_incm.txt' AS (state:chararray, population:int);
with_max = FOREACH records GENERATE state, population, MAX(population);
This didn't work. I had better luck adding an extra column with the same value to each row and then grouping them on that column. Then getting the max on that new group. This seems like a convoluted way of getting what I want so I thought I'd ask if anyone knows a simpler way.
Thanks in advance for the help.
As you said you need to group all the data together but no extra column is required if you use GROUP ALL.
Pig
records = LOAD 'states.txt' AS (state:chararray, population:int);
records_group = GROUP records ALL;
with_max = FOREACH records_group
GENERATE
FLATTEN(records.(state, population)), MAX(records.population);
Input
CA 10
VA 5
WI 2
Output
(CA,10,10)
(VA,5,10)
(WI,2,10)