Sorting on Pig Latin - sorting

I'm testing the following example from Apache Pig docs:
http://pig.apache.org/docs/r0.14.0/basic.html#order-by
but the sort function seems to be not working. Any idea?
$ pig -version
Apache Pig version 0.14.0 (r1640057)
compiled Nov 16 2014, 18:02:05
grunt> a= load 'data' as (c1:int, c2:int, c3:int);
grunt> describe a;
a: {c1: int,c2: int,c3: int}
grunt> dump a;
(1,2,3)
(4,2,1)
(8,3,4)
(4,3,3)
(7,2,5)
(8,4,3)
grunt> result = order a by c1 desc;
grunt> dump result;
(8,4,3)
(8,3,4)
(4,3,3)
(4,2,1)
(1,2,3)
(7,2,5)
grunt> result = order a by c2 desc;
grunt> dump result;
(8,4,3)
(7,2,5)
(4,2,1)
(1,2,3)
(4,3,3)
(8,3,4)
grunt> result = order a by c3 desc;
grunt> dump result;
(7,2,5)
(4,3,3)
(8,4,3)
(1,2,3)
(4,2,1)
(8,3,4)

You are loading the data using default delimiter(tab) but your actual input data is not properly delimited by tab. Can you make sure that your input data fields are delimited by tab in the file 'data'.
In the below example each input field is delimited by tab and its working fine.
data:
1<TAB>2<TAB>3
4<TAB>2<TAB>1
8<TAB>3<TAB>4
4<TAB>3<TAB>3
7<TAB>2<TAB>5
8<TAB>4<TAB>3
grunt> a= load 'data' as (c1:int, c2:int, c3:int);
(1,2,3)
(4,2,1)
(8,3,4)
(4,3,3)
(7,2,5)
(8,4,3)
grunt> result = order a by c1 desc;
grunt> dump result;
(8,3,4)
(8,4,3)
(7,2,5)
(4,2,1)
(4,3,3)
(1,2,3)
grunt> result = order a by c2 desc;
grunt> dump result;
(8,4,3)
(8,3,4)
(4,3,3)
(1,2,3)
(4,2,1)
(7,2,5)
grunt> result = order a by c3 desc;
grunt> dump result;
(7,2,5)
(8,3,4)
(1,2,3)
(4,3,3)
(8,4,3)
(4,2,1)

Related

How to join two relations in pig with multiple fields

I've two CSV files:
1- Fertiltiy.csv :
2- Life Expectency.csv :
I want to join them in pig so that the result will be like this:
I am new to pig, I couldn't get the correct answer, but here is my code:
fertility = LOAD 'fertility' USING org.apache.hcatalog.pig.HCatLoader();
lifeExpectency = LOAD 'lifeExpectency' USING org.apache.hcatalog.pig.HCatLoader();
A = JOIN fertility by country, lifeExpectency by country;
B = JOIN fertility by year, lifeExpectency by year;
C = UNION A,B;
DUMP C;
Here is the result of my code:
You have the join by country and year and select the necessary columns needed for your final output.
fertility = LOAD 'fertility' USING org.apache.hcatalog.pig.HCatLoader();
lifeExpectency = LOAD 'lifeExpectency' USING org.apache.hcatalog.pig.HCatLoader();
A = JOIN fertility by (country,year), lifeExpectency by (country,year);
B = FOREACH A GENERATE fertility::country,fertility::year,fertility::fertility,lifeExpectency::lifeExpectency;
DUMP B;

Pig script to find the max, min,avg,sum of Salary in each department

I get stuck after grouping the data by department no.The steps followed by me
grunt> A = load '/home/cloudera/naveen1/hive_data/emp_data.txt' using PigStorage(',') as (eno:int,ename:chararray,job:chararray,sal:float,comm:float,dno:int);
grunt> B = group A by don;
grunt> describe B;
B: {group: int,A: {(eno: int,ename: chararray,job: chararray,sal: float,comm: float,dno: int)}}
Please let me know the steps after this.I am bit confused about the Nested Foreach statement execution.
Data contains eno,ename,sal,job,commisson,deptno and i want extract the max sal in each dept and the employee getting the highest salary.
Similary for min sal.
Use the aggregate functions after grouping.
C = FOREACH B GENERATE group,MAX(A.sal),MIN(A.sal),AVG(A.sal),SUM(A.sal);
DUMP C;
To get the name,eno and max sal in each dept,sort the records and get the top row
C = FOREACH B {
max_sal = ORDER A BY sal DESC;
max_limit = LIMIT max_sal 1;
GENERATE FLATTEN(max_limit);
}
DUMP C;

How to store distinct values in a list for the same key using Pig

I have a use case
col1|col2
a101|10
a101|20
a101|10
a101|30
a201|40
a201|50
Expected output:
a101|List<10,20,30>
a201|List<40,50>
Below is the query, but I am not getting the output as expected. I want to store col2 distinct values in a list.
input1= load 'list1.csv' using PigStorage('|') as (col1: chararray, col2: int);
input2 = DISTINCT (FOREACH input1 generate col1,col2);
input3 = GROUP input2 by col1;
dump input3;
(a101,{(a101,30),(a101,20),(a101,10)})
(a201,{(a201,50),(a201,40)})
Try this:
input1= load 'input.txt' using PigStorage('|') as (col1: chararray, col2: int);
input2 = DISTINCT input1; --distinct not required but will remove duplicate rows
input3 = GROUP input2 by col1;
data = FOREACH input3 GENERATE FLATTEN(group) as col1, input2.col2 AS col2;
DUMP data;
Output Generated:
(a101,{(30),(20),(10)})
(a201,{(50),(40)})

How to join bag in pig

First I have two data files.
largefile.txt:
1001 {(1,-1),(2,-1),(3,-1),(4,-1)}
smallfile.txt:
1002 {(1,0.04),(2,0.02),(4,0.03)}
and I want smallfile.txt like this:
1002 {(1,0.04),(2,0.02),(3,-1),(4,0.03)}
What type of join that I can do something like this?
A = LOAD './largefile.txt' USING PigStorage('\t') AS (id:int, a:bag{tuple(time:int,value:float)});
B = LOAD './smallfile.txt' USING PigStorage('\t') AS (id:int, b:bag{tuple(time:int,value:float)});
Can you clear your requirement a bit ? Do you want to join on first column/field from largefile.txt and smallfile.txt with same value (for eg 1002). If that is the case you can simple do this :-
A = LOAD './largefile.txt' USING PigStorage('\t') AS (id:int, a:bag{tuple(time:int,value:float)});
A = Foreach A generate id , FLATTEN(a) as time,value ;
B = LOAD './smallfile.txt' USING PigStorage('\t') AS (id:int, b:bag{tuple(time:int,value:float)});
B = Foreach B generate id , FLATTEN(b) as time,value ;
joined = join A by A.id , B by B.id;

Apache Pig Error -- Unable to trace

When I try to run below Pig query, I am getting error while using SORT command. If I omit the SORT transform, then the query is able to execute.
grunt> month1 = LOAD 'hdfs://localhost.localdomain:8020/user/cloudera/data/big1/climate_month1.txt' USING PigStorage(',');
grunt> month2 = LOAD 'hdfs://localhost.localdomain:8020/user/cloudera/data/big1/climate_month2.txt' USING PigStorage(',');
grunt> month3 = LOAD 'hdfs://localhost.localdomain:8020/user/cloudera/data/big1/climate_month3.txt' USING PigStorage(',');
grunt> month4 = LOAD 'hdfs://localhost.localdomain:8020/user/cloudera/data/big1/climate_month4.txt' USING PigStorage(',');
grunt> month5 = LOAD 'hdfs://localhost.localdomain:8020/user/cloudera/data/big1/climate_month5.txt' USING PigStorage(',');
grunt> months = UNION month1, month2, month3, month4, month5;
grunt> clearWeather = FILTER months BY skyCondition == 'CLR';
grunt> shapedWeather = FOREACH clearWeather GENERATE date, SUBSTRING(date,0,4) as YEAR, SUBSTRING(date,4,6) as MONTH, SUBSTRING(date,6,8) as DAY, skyCondition, dryTemp;
grunt> groupedByMonthDay = GROUP shapedWeather BY (MONTH, DAY) PARALLEL 10;
grunt> aggedResults = FOREACH groupedByMonthDay GENERATE group as MonthDay, AVG(shapedWeather.dryTemp) AS AVERAGETEMP, MIN(shapedWeather.dryTemp) AS MINIMUMTEMP, MAX(shapedWeather.dryTemp) AS MAXIMUMTEMP, COUNT(shapedWeather.dryTemp) AS COUNTTEMP PARALLEL 10;
grunt> sortedResult = SORT aggedResults BY $1 DESC;
2014-10-31 10:22:44,664 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1200: <line 11, column 0> Syntax error, unexpected symbol at or near 'sortedResult'
Details at logfile: /home/cloudera/pig_1414775884282.log
And the error file says:
/home/cloudera/pig_1414775884282.log
Can anyone give me the solution for this.
Pig Stack Trace
---------------
ERROR 1200: <line 11, column 0> Syntax error, unexpected symbol at or near 'sortedResult'
Failed to parse: <line 11, column 0> Syntax error, unexpected symbol at or near 'sortedResult'
at org.apache.pig.parser.QueryParserDriver.parse(QueryParserDriver.java:235)
at org.apache.pig.parser.QueryParserDriver.parse(QueryParserDriver.java:177)
at org.apache.pig.PigServer$Graph.validateQuery(PigServer.java:1572)
at org.apache.pig.PigServer$Graph.registerQuery(PigServer.java:1545)
at org.apache.pig.PigServer.registerQuery(PigServer.java:518)
at org.apache.pig.tools.grunt.GruntParser.processPig(GruntParser.java:991)
at org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:412)
at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:194)
at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:170)
at org.apache.pig.tools.grunt.Grunt.run(Grunt.java:69)
at org.apache.pig.Main.run(Main.java:538)
at org.apache.pig.Main.main(Main.java:157)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.RunJar.main(RunJar.java:208)
No SORT command in Pig. Try ORDER command
sortedResult = ORDER aggedResults BY $1 DESC;

Resources