When I try to run below Pig query, I am getting error while using SORT command. If I omit the SORT transform, then the query is able to execute.
grunt> month1 = LOAD 'hdfs://localhost.localdomain:8020/user/cloudera/data/big1/climate_month1.txt' USING PigStorage(',');
grunt> month2 = LOAD 'hdfs://localhost.localdomain:8020/user/cloudera/data/big1/climate_month2.txt' USING PigStorage(',');
grunt> month3 = LOAD 'hdfs://localhost.localdomain:8020/user/cloudera/data/big1/climate_month3.txt' USING PigStorage(',');
grunt> month4 = LOAD 'hdfs://localhost.localdomain:8020/user/cloudera/data/big1/climate_month4.txt' USING PigStorage(',');
grunt> month5 = LOAD 'hdfs://localhost.localdomain:8020/user/cloudera/data/big1/climate_month5.txt' USING PigStorage(',');
grunt> months = UNION month1, month2, month3, month4, month5;
grunt> clearWeather = FILTER months BY skyCondition == 'CLR';
grunt> shapedWeather = FOREACH clearWeather GENERATE date, SUBSTRING(date,0,4) as YEAR, SUBSTRING(date,4,6) as MONTH, SUBSTRING(date,6,8) as DAY, skyCondition, dryTemp;
grunt> groupedByMonthDay = GROUP shapedWeather BY (MONTH, DAY) PARALLEL 10;
grunt> aggedResults = FOREACH groupedByMonthDay GENERATE group as MonthDay, AVG(shapedWeather.dryTemp) AS AVERAGETEMP, MIN(shapedWeather.dryTemp) AS MINIMUMTEMP, MAX(shapedWeather.dryTemp) AS MAXIMUMTEMP, COUNT(shapedWeather.dryTemp) AS COUNTTEMP PARALLEL 10;
grunt> sortedResult = SORT aggedResults BY $1 DESC;
2014-10-31 10:22:44,664 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1200: <line 11, column 0> Syntax error, unexpected symbol at or near 'sortedResult'
Details at logfile: /home/cloudera/pig_1414775884282.log
And the error file says:
/home/cloudera/pig_1414775884282.log
Can anyone give me the solution for this.
Pig Stack Trace
---------------
ERROR 1200: <line 11, column 0> Syntax error, unexpected symbol at or near 'sortedResult'
Failed to parse: <line 11, column 0> Syntax error, unexpected symbol at or near 'sortedResult'
at org.apache.pig.parser.QueryParserDriver.parse(QueryParserDriver.java:235)
at org.apache.pig.parser.QueryParserDriver.parse(QueryParserDriver.java:177)
at org.apache.pig.PigServer$Graph.validateQuery(PigServer.java:1572)
at org.apache.pig.PigServer$Graph.registerQuery(PigServer.java:1545)
at org.apache.pig.PigServer.registerQuery(PigServer.java:518)
at org.apache.pig.tools.grunt.GruntParser.processPig(GruntParser.java:991)
at org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:412)
at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:194)
at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:170)
at org.apache.pig.tools.grunt.Grunt.run(Grunt.java:69)
at org.apache.pig.Main.run(Main.java:538)
at org.apache.pig.Main.main(Main.java:157)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.RunJar.main(RunJar.java:208)
No SORT command in Pig. Try ORDER command
sortedResult = ORDER aggedResults BY $1 DESC;
Related
I get stuck after grouping the data by department no.The steps followed by me
grunt> A = load '/home/cloudera/naveen1/hive_data/emp_data.txt' using PigStorage(',') as (eno:int,ename:chararray,job:chararray,sal:float,comm:float,dno:int);
grunt> B = group A by don;
grunt> describe B;
B: {group: int,A: {(eno: int,ename: chararray,job: chararray,sal: float,comm: float,dno: int)}}
Please let me know the steps after this.I am bit confused about the Nested Foreach statement execution.
Data contains eno,ename,sal,job,commisson,deptno and i want extract the max sal in each dept and the employee getting the highest salary.
Similary for min sal.
Use the aggregate functions after grouping.
C = FOREACH B GENERATE group,MAX(A.sal),MIN(A.sal),AVG(A.sal),SUM(A.sal);
DUMP C;
To get the name,eno and max sal in each dept,sort the records and get the top row
C = FOREACH B {
max_sal = ORDER A BY sal DESC;
max_limit = LIMIT max_sal 1;
GENERATE FLATTEN(max_limit);
}
DUMP C;
I'm trying like
select * from A where A.ID NOT IN (select id from B) (in sql)
sourcenew = LOAD 'hdfs://HADOOPMASTER:54310/DVTTest/Source.txt' USING PigStorage(',') as (ID:int,Name:chararray,FirstName:chararray ,LastName:chararray,Vertical_Name:chararray ,Vertical_ID:chararray,Gender:chararray,DOB:chararray,Degree_Percentage:chararray ,Salary:chararray,StateName:chararray);
destnew = LOAD 'hdfs://HADOOPMASTER:54310/DVTTest/Destination.txt' USING PigStorage(',') as (ID:int,Name:chararray,FirstName:chararray ,LastName:chararray,Vertical_Name:chararray ,Vertical_ID:chararray,Gender:chararray,DOB:chararray,Degree_Percentage:chararray ,Salary:chararray,StateName:chararray);
c= FOREACH destnew GENERATE ID;
D=FILTER sourcenew BY NOT ID (c.ID);
org.apache.pig.tools.pigscript.parser.ParseException: Encountered " <PATH> "D=FILTER "" at line 1, column 1.
Was expecting one of:
<EOF>
"cat" ...
"clear" ...<EOF>
any help on this to resolve error, getting this on the execution of last line.
Use LEFT OUTER JOIN and FILTER the nulls
sourcenew = LOAD 'hdfs://HADOOPMASTER:54310/DVTTest/Source.txt' USING PigStorage(',') as (ID:int,Name:chararray,FirstName:chararray ,LastName:chararray,Vertical_Name:chararray ,Vertical_ID:chararray,Gender:chararray,DOB:chararray,Degree_Percentage:chararray ,Salary:chararray,StateName:chararray);
destnew = LOAD 'hdfs://HADOOPMASTER:54310/DVTTest/Destination.txt' USING PigStorage(',') as (ID:int,Name:chararray,FirstName:chararray ,LastName:chararray,Vertical_Name:chararray ,Vertical_ID:chararray,Gender:chararray,DOB:chararray,Degree_Percentage:chararray ,Salary:chararray,StateName:chararray);
c = FOREACH destnew GENERATE ID;
d = JOIN sourcenew BY ID LEFT OUTER,destnew by ID;
e = FILTER d by destnew.ID is null;
NOTE
I wrote a sample script with couple of test files and below is the working solution.In you case check to see if you are loading the data correctly from your files.
test1.txt
1 abc
2 def
3 ghi
4 jkl
5 mno
6 pqr
7 stu
8 vwx
1 abc
2 def
3 ghi
4 jkl
1 abc
2 def
3 ghi
1 abc
2 def
test2.txt
1
2
3
4
Script
A = LOAD 'test1.txt' USING PigStorage('\t') AS (aid:int,name:chararray);
B = LOAD 'test2.txt' USING PigStorage('\t') AS (bid:int);
C = JOIN A BY aid LEFT OUTER,B BY bid;
D = FILTER C BY bid is null;
DUMP D;
So in the above example records 5,6,7,8 should be in the result since those Ids are not in test2.txt.
I have 2 tables orders and order_items
orders table contains (order_id, order_date,order_customer_id, order_status, order_month)
order_items contains (order_item_id, order_item_order_id, order_item_product_id, order_item_quantity, order_item_subtotal, order_item_product_price)
the tables are joined by orders.order_id and order_items.order_item_order_id
the datatype is not provided so positional notation is used.
orders = LOAD '/user/horton/orders' USING PigStorage(',');
order_items = LOAD '/user/horton/orders' USING PigStorage(',');
ordersjoin = JOIN orders BY $0, order_items BY $1 ;
orderrevenuebydate = FOREACH ordersjoin GENERATE orders::$1, order_items::$4;
I get the following error when trying to generate the FOREACH for orderrevenuebydate
Unexpected character '$' 2016-06-19 19:17:22,757
[main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1200: Unexpected character '$' Details at logfile:
/home/6301dd50e3fac19f7c90fbf9898496/pig_1466356957630.log
You'll have to directly reference the positional notation from the relation.
For example, if you want to generate order_date and order_item_subtotal fields from ordersjoin relation, use the below statement.
orderrevenuebydate = FOREACH ordersjoin GENERATE $1, $9;
Note that after the join operation, ordersjoin relation will contain all the attributes from two relations.
I am new to PIG Latin and I am trying to solve the below problem
Find number of employees having phone number with each areacode.
EMPID ADD_ID ZIP SAL PHONE DAT
Abcd411 PbcDr60264 953492 46404 111-432-4193 20150113
Abcd874 PbcDr39353 186307 29873 100-432-9164 20150728
Abcd197 PbcDr46725 306185 31908 113-432-4191 20150410
Abcd160 PbcDr77738 330533 61313 105-432-2468 20151007
Abcd327 PbcDr10034 951703 39301 109-432-9235 20150805
Abcd172 PbcDr21679 683299 71686 105-432-5616 20150908
Abcd227 PbcDr57694 876619 46743 109-432-9181 20151101
Abcd900 PbcDr80166 970136 34242 105-432-7415 20150820
Abcd318 PbcDr34711 234066 10989 101-432-9667 20150906
Abcd702 PbcDr86734 997954 97688 105-432-6592 20151026
And below is the way I am trying to solve it.
empdata = LOAD '/home/cloudera/empData.txt' as (empId:chararray, location:chararray, zipCode:long , salary:long, phone:chararray, dateOfJoin:long);
grpdata = GROUP empdata by SUBSTRING(phone, 0, INDEXOF(phone, '-' , 0));
dataCnt = foreach grpdata generate count(grpdata);
But I am not getting error stating that its:- Invalid scalar projection: grpdata : A column needs to be projected from a relation for it to be used as a scalar
And in another problem statement for same data set
Find number of employees having date of joining between 2015-01-01 to 2015-05-28.
I tried below solution , but this time I am not getting any results.
empdata = LOAD '/home/cloudera/empData.txt' as (empId:chararray, location:chararray, zipCode:long , salary:long, phone:chararray, doj:chararray);
filtDate = filter empdata by ToDate(doj, 'yyyyMMdd') >= ToDate('20150101', 'yyyymmdd') AND ToDate(doj, 'yyyyMMdd') <= ToDate('20150528', 'yyyymmdd');
Please help with explanation.
try this
empdata = LOAD '/home/cloudera/empData.txt' as using PigStorage(' ') (empId:chararray, location:chararray, zipCode:long , salary:long, phone:chararray, dateOfJoin:long);
grpdata = GROUP empdata by SUBSTRING(phone, 0, INDEXOF(phone, '-' , 0));
dataCnt = foreach grpdata generate $0, COUNT(empdata);
you should count empdata
dataCnt = foreach grpdata generate COUNT(empdata);
I'm testing the following example from Apache Pig docs:
http://pig.apache.org/docs/r0.14.0/basic.html#order-by
but the sort function seems to be not working. Any idea?
$ pig -version
Apache Pig version 0.14.0 (r1640057)
compiled Nov 16 2014, 18:02:05
grunt> a= load 'data' as (c1:int, c2:int, c3:int);
grunt> describe a;
a: {c1: int,c2: int,c3: int}
grunt> dump a;
(1,2,3)
(4,2,1)
(8,3,4)
(4,3,3)
(7,2,5)
(8,4,3)
grunt> result = order a by c1 desc;
grunt> dump result;
(8,4,3)
(8,3,4)
(4,3,3)
(4,2,1)
(1,2,3)
(7,2,5)
grunt> result = order a by c2 desc;
grunt> dump result;
(8,4,3)
(7,2,5)
(4,2,1)
(1,2,3)
(4,3,3)
(8,3,4)
grunt> result = order a by c3 desc;
grunt> dump result;
(7,2,5)
(4,3,3)
(8,4,3)
(1,2,3)
(4,2,1)
(8,3,4)
You are loading the data using default delimiter(tab) but your actual input data is not properly delimited by tab. Can you make sure that your input data fields are delimited by tab in the file 'data'.
In the below example each input field is delimited by tab and its working fine.
data:
1<TAB>2<TAB>3
4<TAB>2<TAB>1
8<TAB>3<TAB>4
4<TAB>3<TAB>3
7<TAB>2<TAB>5
8<TAB>4<TAB>3
grunt> a= load 'data' as (c1:int, c2:int, c3:int);
(1,2,3)
(4,2,1)
(8,3,4)
(4,3,3)
(7,2,5)
(8,4,3)
grunt> result = order a by c1 desc;
grunt> dump result;
(8,3,4)
(8,4,3)
(7,2,5)
(4,2,1)
(4,3,3)
(1,2,3)
grunt> result = order a by c2 desc;
grunt> dump result;
(8,4,3)
(8,3,4)
(4,3,3)
(1,2,3)
(4,2,1)
(7,2,5)
grunt> result = order a by c3 desc;
grunt> dump result;
(7,2,5)
(8,3,4)
(1,2,3)
(4,3,3)
(8,4,3)
(4,2,1)