Pig group by and average function

Pig group by and average function - hadoop

I have data that looks like this
STN--- WBAN YEARMODA TEMP DEWP SLP STP VISIB WDSP MXSPD GUST MAX MIN PRCP SNDP FRSHTT
030050 99999 19291029 46.7 4 42.0 4 990.9 4 9999.9 0 10.9 4 13.0 4 13.0 999.9 46.9* 44.1 99.99 999.9 010000
030050 99999 19291030 43.5 4 33.5 4 1015.4 4 9999.9 0 12.4 4 14.3 4 18.1 999.9 46.9 42.1 0.00I 999.9 000000
030050 99999 19291031 43.7 4 37.3 4 1026.8 4 9999.9 0 12.4 4 4.5 4 8.9 999.9 46.9* 37.9 0.00I 999.9 000000
030050 99999 19291101 49.2 4 45.5 4 1019.9 4 9999.9 0 6.2 4 8.2 4 13.0 999.9 51.1* 46.0 99.99 999.9 010000
030050 99999 19291102 47.0 4 44.5 4 1013.6 4 9999.9 0 7.8 4 6.2 4 8.9 999.9 51.1 44.1 0.00I 999.9 000000
030050 99999 19291103 44.0 4 36.0 4 1009.2 4 9999.9 0 10.9 4 8.0 4 8.9 999.9 50.0 42.1 0.00I 999.9 000000
I want to get the average for each month, in this case: 10 and 11.
First I load the data using:
RAW_LOGS = LOAD 'data' as (line:chararray);
Then I separate the data into different variables using a regex:
LOGS_BASE = FOREACH RAW_LOGS GENERATE
FLATTEN(
REGEX_EXTRACT_ALL(line, '^(\\d+)\\s+(\\d+)\\s+(\\d{4})(\\d{2})(\\d{2})\\s+(\\d+\\.\\d).*$')
)
as (
STN: int,
WBAN: int,
YEAR: int,
MONTH: int,
DAY: int,
TEMP: float
);
Next I get rid of the top tuple which previously contained the header data:
no_nulls = FILTER LOGS_BASE BY STN is not null;
Then I group the data by STN, WBAN, YEAR, and MONTH:
grouped = group no_nulls by STN..MONTH;
And finally I try to generate an Average and run into an error:
C = FOREACH grouped GENERATE AVG(LOGS_BASE.TEMP);
ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1045:
<line 17, column 29> Could not infer the matching function for org.apache.pig.builtin.AVG as multiple or none of them fit. Please use an explicit cast.
I think the error may be with my Regex in that it is returning the TEMP as a string even though I am telling it to be a double but I could be wrong.
EDIT: I changed C to:
C = FOREACH grouped GENERATE AVG(no_nulls.TEMP);
and now I get this error:
HadoopVersion PigVersion UserId StartedAt FinishedAt Features
1.0.3 0.9.2-amzn hadoop 2013-04-20 19:55:25 2013-04-20 19:57:21 GROUP_BY,FILTER
Failed!
Failed Jobs:
JobId Alias Feature Message Outputs
job_201304201942_0001 C,LOGS_BASE,RAW_LOGS,grouped,no_nulls GROUP_BY,COMBINER Message: Job failed! Error - # of failed Map Tasks exceeded allowed limit. FailedCount: 1. LastFailedTask: task_201304201942_0001_m_000000 hdfs://10.254.106.85:9000/tmp/temp413183623/tmp1677272203,
The log has a bit more info:
org.apache.pig.backend.executionengine.ExecException: ERROR 2106: Error while computing average in Initial
at org.apache.pig.builtin.FloatAvg$Initial.exec(FloatAvg.java:99)
at org.apache.pig.builtin.FloatAvg$Initial.exec(FloatAvg.java:75)
at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:216)
at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:253)
at org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.getNext(PhysicalOperator.java:334)
at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:332)
at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:284)
at org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:290)
at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POLocalRearrange.getNext(POLocalRearrange.java:256)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.runPipeline(PigGenericMapBase.java:267)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map(PigGenericMapBase.java:262)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map(PigGenericMapBase.java:64)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:771)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:375)
at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1132)
at org.apache.hadoop.mapred.Child.main(Child.java:249)
Caused by: java.lang.ClassCastException: java.lang.String cannot be cast to java.lang.Float
at org.apache.pig.builtin.FloatAvg$Initial.exec(FloatAvg.java:86)
... 19 more
Pig Stack Trace
---------------
ERROR 2997: Unable to recreate exception from backed error: org.apache.pig.backend.executionengine.ExecException: ERROR 2106: Error while computing average in Initial
org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1066: Unable to open iterator for alias C. Backend error : Unable to recreate exception from backed error: org.apache.pig.backend.executionengine.ExecException: ERROR 2106: Error while computing average in Initial
at org.apache.pig.PigServer.openIterator(PigServer.java:890)
at org.apache.pig.tools.grunt.GruntParser.processDump(GruntParser.java:679)
at org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:303)
at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:189)
at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:165)
at org.apache.pig.tools.grunt.Grunt.run(Grunt.java:69)
at org.apache.pig.Main.run(Main.java:500)
at org.apache.pig.Main.main(Main.java:114)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.RunJar.main(RunJar.java:187)
Caused by: org.apache.pig.backend.executionengine.ExecException: ERROR 2997: Unable to recreate exception from backed error: org.apache.pig.backend.executionengine.ExecException: ERROR 2106: Error while computing average in Initial
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.Launcher.getErrorMessages(Launcher.java:221)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.Launcher.getStats(Launcher.java:151)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher.launchPig(MapReduceLauncher.java:354)
at org.apache.pig.PigServer.launchPlan(PigServer.java:1313)
at org.apache.pig.PigServer.executeCompiledLogicalPlan(PigServer.java:1298)
at org.apache.pig.PigServer.storeEx(PigServer.java:995)
at org.apache.pig.PigServer.store(PigServer.java:962)
at org.apache.pig.PigServer.openIterator(PigServer.java:875)

My guess is because grouped doesn't contain LOGS_BASE, it contains no_nulls. Try making it
C = FOREACH grouped GENERATE AVG(no_nulls.TEMP);
and see if that fixes it.
If that doesn't work, try adding dump RAW_LOGS after the first line and commenting everything else out, make sure that looks good, then uncomment second line and make the dump dump LOGS_BASE, repeat for rest of lines. Always good to sanity check each piece of a pig script.

It turns out that temp was being treated as a String instead of a Float. I applied the code used here and got it to work. Even though I told Pig to treat the TEMP column as a float it was still reading it in as a chararray. This ended up being a one line fix by putting (tuple(int,int,int,int,int,float)) right before my REGEX_EXTRACT_ALL function. Here's what that code looks like:
LOGS_BASE = FOREACH RAW_LOGS GENERATE
FLATTEN(
(tuple(int,int,int,int,int,float))
REGEX_EXTRACT_ALL(line, '^(\\d+)\\s+(\\d+)\\s+(\\d{4})(\\d{2})(\\d{2})\\s+(-?\\d+\\.\\d).*$')
)
as (
STN: int,
WBAN: int,
YEAR: int,
MONTH: int,
DAY: int,
TEMP: float
);

Related

Maximum value of a column in apache pig

I am trying to find the maximum value of a column ratingTime using pig.I am running below script :
ratings = LOAD '/user/maria_dev/ml-100k/u.data' AS (userid:int,movieID:int,rating:int, ratingTime:int);
maxrating = MAX(ratings.ratingTime);
DUMP maxrating
Sample Input data is :
196 242 3 881250949
186 302 3 891717742
22 377 1 878887116
244 51 2 880606923
I am getting below error :
2018-08-05 07:02:05,247 [main] INFO org.apache.pig.backend.hadoop.PigATSClient - Created ATS Hook
2018-08-05 07:02:05,914 [main] ERROR org.apache.pig.PigServer - exception during parsing: Error during parsing. <file script.pi

You need a preceding GROUP ALL before applying MAX.Source
ratings = LOAD '/user/maria_dev/ml-100k/u.data' USING PigStorage('\t') AS (userid:int,movieID:int,rating:int, ratingTime:int);
rating_group = GROUP ratings ALL;
maxrating = FOREACH ratings_group GENERATE MAX(ratings.ratingTime);
DUMP maxrating;

Error while executing hive Query : org.apache.hadoop.hive.serde2.io.HiveDecimalWritable cannot be cast to org.apache.hadoop.io.IntWritable

I am running query on hive ....seems from logs , it is failing because of erroneous record in table [date_dim] causing casting exception ....when I looked for other records in [date_dim], nothing different in particular record(see sample records at last).
I may be wrong pointing to particular record...trying to understand why this casting error and how this can be resolved ?
Can you please help me understand why error for this particular record ? how it can be resolved ?
Any help is highly appreciated !!!
Hive query :
SELECT 'store' AS channel, 'ss_cdemo_sk' AS col_name, d_year, d_qoy, i_category, ss_ext_sales_price AS ext_sales_price
FROM store_sales , item , date_dim
WHERE ss_cdemo_sk IS NULL
AND ss_sold_date_sk = d_date_sk
AND ss_item_sk = i_item_sk ;
Hive logs :
Query ID = root_20150922151717_a94d4679-224b-41f3-8336-a799d4ebedab
Total jobs = 4
....
Launching Job 3 out of 4
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1441794795162_13426, Tracking URL = http://myhost:8035/proxy/application_1441794795162_13426/
Kill Command = /opt/hes/hadoop/hadoop-2.6.0//bin/hadoop job -kill job_1441794795162_13426
Hadoop job information for Stage-6: number of mappers: 1; number of reducers: 0
2015-09-22 15:18:34,291 Stage-6 map = 0%, reduce = 0%
2015-09-22 15:18:54,541 Stage-6 map = 100%, reduce = 0%
Ended Job = job_1441794795162_13426 with errors
Error during job, obtaining debugging information...
Examining task ID: task_1441794795162_13426_m_000000 (and more) from job job_1441794795162_13426
Task with the most failures(4):
Task ID:task_1441794795162_13426_m_000000
URL : myhost:8088/taskdetails.jsp?jobid=job_1441794795162_13426&tipid=task_1441794795162_13426_m_000000
-----
Diagnostic Messages for this Task:
Error: java.lang.RuntimeException: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while processing row **{"d_date_sk":2452538,"d_date_id":"AAAAAAAAKDMGFCAA","d_date":"2002-09-20","d_month_seq":1232,"d_week_seq":5360,"d_quarter_seq":412,"d_year":2002,"d_dow":5,"d_moy":9,"d_dom":20,"d_qoy":3,"d_fy_year":2002,"d_fy_quarter_seq":412,"d_fy_week_seq":5360,"d_day_name":"Friday ","d_quarter_name":"N","d_holiday":"2002Q3","d_weekend":"N","d_following_holiday":"Y","d_first_dom":2452761,"d_last_dom":2452519,"d_same_day_ly":2452447,"d_same_day_lq":2452173,"d_current_day":"N","d_current_week":"N","d_current_month":"N","d_current_quarter":"N","d_current_year":"N"}**
at org.apache.hadoop.hive.ql.exec.mr.ExecMapper.map(ExecMapper.java:185)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:54)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:450)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:163)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while processing row {"d_date_sk":2452538,"d_date_id":"AAAAAAAAKDMGFCAA","d_date":"2002-09-20","d_month_seq":1232,"d_week_seq":5360,"d_quarter_seq":412,"d_year":2002,"d_dow":5,"d_moy":9,"d_dom":20,"d_qoy":3,"d_fy_year":2002,"d_fy_quarter_seq":412,"d_fy_week_seq":5360,"d_day_name":"Friday ","d_quarter_name":"N","d_holiday":"2002Q3","d_weekend":"N","d_following_holiday":"Y","d_first_dom":2452761,"d_last_dom":2452519,"d_same_day_ly":2452447,"d_same_day_lq":2452173,"d_current_day":"N","d_current_week":"N","d_current_month":"N","d_current_quarter":"N","d_current_year":"N"}
at org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:503)
at org.apache.hadoop.hive.ql.exec.mr.ExecMapper.map(ExecMapper.java:176)
... 8 more
Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Unexpected exception: org.apache.hadoop.hive.serde2.io.HiveDecimalWritable cannot be cast to org.apache.hadoop.io.IntWritable
at org.apache.hadoop.hive.ql.exec.MapJoinOperator.processOp(MapJoinOperator.java:311)
at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:815)
at org.apache.hadoop.hive.ql.exec.FilterOperator.processOp(FilterOperator.java:120)
at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:815)
at org.apache.hadoop.hive.ql.exec.TableScanOperator.processOp(TableScanOperator.java:95)
at org.apache.hadoop.hive.ql.exec.MapOperator$MapOpCtx.forward(MapOperator.java:157)
at org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:493)
... 9 more
Caused by: java.lang.ClassCastException: org.apache.hadoop.hive.serde2.io.HiveDecimalWritable cannot be cast to org.apache.hadoop.io.IntWritable
at org.apache.hadoop.hive.serde2.objectinspector.primitive.WritableIntObjectInspector.get(WritableIntObjectInspector.java:36)
at org.apache.hadoop.hive.ql.udf.generic.GenericUDFOPEqual.evaluate(GenericUDFOPEqual.java:84)
at org.apache.hadoop.hive.ql.exec.ExprNodeGenericFuncEvaluator._evaluate(ExprNodeGenericFuncEvaluator.java:185)
at org.apache.hadoop.hive.ql.exec.ExprNodeEvaluator.evaluate(ExprNodeEvaluator.java:77)
at org.apache.hadoop.hive.ql.exec.ExprNodeGenericFuncEvaluator$DeferredExprObject.get(ExprNodeGenericFuncEvaluator.java:86)
at org.apache.hadoop.hive.ql.udf.generic.GenericUDFOPAnd.evaluate(GenericUDFOPAnd.java:68)
at org.apache.hadoop.hive.ql.exec.ExprNodeGenericFuncEvaluator._evaluate(ExprNodeGenericFuncEvaluator.java:185)
at org.apache.hadoop.hive.ql.exec.ExprNodeEvaluator.evaluate(ExprNodeEvaluator.java:77)
at org.apache.hadoop.hive.ql.exec.ExprNodeEvaluator.evaluate(ExprNodeEvaluator.java:65)
at org.apache.hadoop.hive.ql.exec.FilterOperator.processOp(FilterOperator.java:106)
at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:815)
at org.apache.hadoop.hive.ql.exec.CommonJoinOperator.internalForward(CommonJoinOperator.java:638)
at org.apache.hadoop.hive.ql.exec.CommonJoinOperator.genUniqueJoinObject(CommonJoinOperator.java:651)
at org.apache.hadoop.hive.ql.exec.CommonJoinOperator.genUniqueJoinObject(CommonJoinOperator.java:654)
at org.apache.hadoop.hive.ql.exec.CommonJoinOperator.checkAndGenObject(CommonJoinOperator.java:750)
at org.apache.hadoop.hive.ql.exec.MapJoinOperator.processOp(MapJoinOperator.java:299)
... 15 more
FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask
ATTEMPT: Execute BackupTask: org.apache.hadoop.hive.ql.exec.mr.MapRedTask
.......
Stage-Stage-8: Map: 2 Cumulative CPU: 42.59 sec HDFS Read: 122983367 HDFS Write: 5267230 SUCCESS
Stage-Stage-6: Map: 1 HDFS Read: 0 HDFS Write: 0 FAIL
Stage-Stage-2: Map: 2 Reduce: 1 Cumulative CPU: 8.78 sec HDFS Read: 15648221 HDFS Write: 8106 SUCCESS
Total MapReduce CPU Time Spent: 51 seconds 370 msec
OK
channel col_name d_year d_qoy i_category ext_sales_price
store ss_cdemo_sk 1998 1 Sports NULL
.....
.
.
.
..
Sample data of table from date_dim ( including erroneous record )
hive > select * from date_dim ;
date_dim.d_date_sk date_dim.d_date_id date_dim.d_date date_dim.d_month_seq date_dim.d_week_seq date_dim.d_quarter_seq date_dim.d_year date_dim.d_dow date_dim.d_moy date_dim.d_dom date_dim.d_qoy date_dim.d_fy_year date_dim.d_fy_quarter_seq date_dim.d_fy_week_seq date_dim.d_day_name date_dim.d_quarter_name date_dim.d_holiday date_dim.d_weekend date_dim.d_following_holiday date_dim.d_first_dom date_dim.d_last_dom date_dim.d_same_day_ly date_dim.d_same_day_lq date_dim.d_current_day date_dim.d_current_week date_dim.d_current_month date_dim.d_current_quarter date_dim.d_current_year
2452538 AAAAAAAAKDMGFCAA 2002-09-20 1232 5360 412 2002 5 9 20 3 2002 412 5360 Friday N 2002Q3 N Y 2452761 2452519 2452447 2452173 N N N NN
2431208 AAAAAAAAIOIBFCAA 1944-04-27 531 2313 178 1944 4 4 27 2 1944 178 2313 Thursday N 1944Q2 N N 2431272 2431182 2431117 2430842 N N N NN
2456494 AAAAAAAAOKLHFCAA 2013-07-20 1362 5925 455 2013 6 7 20 3 2013 455 5925 Saturday N 2013Q3 N Y 2456655 2456475 2456403 2456129 N N N NN
2481780 AAAAAAAAEHONFCAA 2082-10-12 2193 9537 732 2082 1 10 12 4 2082 732 9537 Monday N 2082Q4 N N 2482041 2481769 2481688 2481415 N N N NN
hive> describe store_sales ;
ss_sold_date_sk int
ss_sold_time_sk int
ss_item_sk int
ss_customer_sk int
ss_cdemo_sk int
ss_hdemo_sk int
ss_addr_sk int
ss_store_sk int
ss_promo_sk int
ss_ticket_number int
ss_quantity int
ss_wholesale_cost decimal(7,2)
ss_list_price decimal(7,2)
ss_sales_price decimal(7,2)
ss_ext_discount_amt decimal(7,2)
ss_ext_sales_price decimal(7,2)
ss_ext_wholesale_cost decimal(7,2)
ss_ext_list_price decimal(7,2)
ss_ext_tax decimal(7,2)
ss_coupon_amt decimal(7,2)
ss_net_paid decimal(7,2)
ss_net_paid_inc_tax decimal(7,2)
ss_net_profit decimal(7,2)
hive> describe item ;
i_item_sk int
i_item_id string
i_rec_start_date date
i_rec_end_date date
i_item_desc string
i_current_price decimal(7,2)
i_wholesale_cost decimal(7,2)
i_brand_id int
i_brand string
i_class_id int
i_class string
i_category_id int
i_category string
i_manufact_id int
i_manufact string
i_size string
i_formulation string
i_color string
i_units string
i_container string
i_manager_id int
i_product_name string
hive> describe date_dim ;
d_date_sk int
d_date_id string
d_date date
d_month_seq int
d_week_seq int
d_quarter_seq int
d_year int
d_dow int
d_moy int
d_dom int
d_qoy int
d_fy_year int
d_fy_quarter_seq int
d_fy_week_seq int
d_day_name string
d_quarter_name string
d_holiday string
d_weekend string
d_following_holiday string
d_first_dom int
d_last_dom int
d_same_day_ly int
d_same_day_lq int
d_current_day string
d_current_week string
d_current_month string
d_current_quarter string
d_current_year string

Pig Error on SUM function

I have data like -
store trn_date dept_id sale_amt
1 2014-12-14 101 10007655
1 2014-12-14 101 10007654
1 2014-12-14 101 10007544
6 2014-12-14 104 100086544
8 2014-12-14 101 1000000
9 2014-12-14 106 1000000
I want to get the sum of sale_amt,for this I'm doing
First I load the data using:
table = LOAD 'table' USING org.apache.hcatalog.pig.HCatLoader();
Then grouping the data on store, tran_date, dept_id
grp_table = GROUP table BY (store, tran_date, dept_id);
Finally trying to get the SUM of sale_amt using
grp_gen = FOREACH grp_table GENERATE
FLATTEN(group) AS (store, tran_date, dept_id),
SUM(table.sale_amt) AS tota_sale_amt;
getting below Error -
================================================================================
Pig Stack Trace
---------------
ERROR 2103: Problem doing work on Longs
org.apache.pig.backend.executionengine.ExecException: ERROR 0: Exception while executing (Name: grouped_all: Local Rearrange[tuple]{tuple}(false) - scope-1317 Operator Key: scope-1317): org.apache.pig.backend.executionengine.ExecException: ERROR 2103: Problem doing work on Longs
at org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:289)
at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POLocalRearrange.getNextTuple(POLocalRearrange.java:263)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigCombiner$Combine.processOnePackageOutput(PigCombiner.java:183)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigCombiner$Combine.reduce(PigCombiner.java:161)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigCombiner$Combine.reduce(PigCombiner.java:51)
at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:171)
at org.apache.hadoop.mapred.Task$NewCombinerRunner.combine(Task.java:1645)
at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.java:1611)
at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:1462)
at org.apache.hadoop.mapred.MapTask$NewOutputCollector.close(MapTask.java:700)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:770)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:340)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:167)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1554)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:162)
Caused by: org.apache.pig.backend.executionengine.ExecException: ERROR 2103: Problem doing work on Longs
at org.apache.pig.builtin.AlgebraicLongMathBase.doTupleWork(AlgebraicLongMathBase.java:84)
at org.apache.pig.builtin.AlgebraicLongMathBase$Intermediate.exec(AlgebraicLongMathBase.java:108)
at org.apache.pig.builtin.AlgebraicLongMathBase$Intermediate.exec(AlgebraicLongMathBase.java:102)
at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:330)
at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNextTuple(POUserFunc.java:369)
at org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.getNext(PhysicalOperator.java:333)
at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:378)
at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNextTuple(POForEach.java:298)
at org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:281)
Caused by: java.lang.ClassCastException: java.lang.String cannot be cast to java.lang.Number
at org.apache.pig.builtin.AlgebraicLongMathBase.doTupleWork(AlgebraicLongMathBase.java:77)
================================================================================
As I'm reading table using HCatalog Loader and in hive table data type is string so i have tried with casting as well in the script but still getting the same Error

I don't have HCatalog installed in my system, so tried with simple file, but the below approach and code will work for you.
1.SUM will work only with data types(int, long, float, double, bigdecimal, biginteger or bytearray cast as double). Its look like your sale_amt column is in string, so you need to typecast this column to (long or double) before using SUM function.
2.You should not use store as variable, bcoz it is reserved keyword in Pig, so you have to rename this variable to different name otherwise you will get an error. I renamed this variable as "stores".
Example:
table:
1 2014-12-14 101 10007655
1 2014-12-14 101 10007654
1 2014-12-14 101 10007544
6 2014-12-14 104 100086544
8 2014-12-14 101 1000000
9 2014-12-14 106 1000000
PigScript:
A = LOAD 'table' USING PigStorage() AS (store:chararray,trn_date:chararray,dept_id:chararray,sale_amt:chararray);
B = FOREACH A GENERATE $0 AS stores,trn_date,dept_id,(long)sale_amt; --Renamed the variable store to stores and typecasted the sale_amt to long.
C = GROUP B BY (stores,trn_date,dept_id);
D = FOREACH C GENERATE FLATTEN(group),SUM(B.sale_amt);
DUMP D;
Output:
(1,2014-12-14,101,30022853)
(6,2014-12-14,104,100086544)
(8,2014-12-14,101,1000000)
(9,2014-12-14,106,1000000)

Pandas performance issue of dataframe column "rename" and "drop"

Below is the line_profiler record of a function :
Wrote profile results to FM_CORE.py.lprof
Timer unit: 2.79365e-07 s
File: F:\FM_CORE.py
Function: _rpt_join at line 1068
Total time: 1.87766 s
Line # Hits Time Per Hit % Time Line Contents
==============================================================
1068 #profile
1069 def _rpt_join(dfa, dfb, join_type='inner'):
1070 ''' join two dataframe together by ('STK_ID','RPT_Date') multilevel index.
1071 'join_type' can be 'inner' or 'outer'
1072 '''
1073
1074 2 56 28.0 0.0 try: # ('STK_ID','RPT_Date') are normal column
1075 2 2936668 1468334.0 43.7 rst = pd.merge(dfa, dfb, how=join_type, on=['STK_ID','RPT_Date'], left_index=True, right_index=True)
1076 except: # ('STK_ID','RPT_Date') are index
1077 rst = pd.merge(dfa, dfb, how=join_type, left_index=True, right_index=True)
1078
1079
1080 2 81 40.5 0.0 try: # handle 'STK_Name
1081 2 426472 213236.0 6.3 name_combine = pd.concat([dfa.STK_Name, dfb.STK_Name])
1082
1083
1084 2 900584 450292.0 13.4 nameseries = name_combine[-Series(name_combine.index.values, name_combine.index).duplicated()]
1085
1086 2 1138140 569070.0 16.9 rst.STK_Name_x = nameseries
1087 2 596768 298384.0 8.9 rst = rst.rename(columns={'STK_Name_x': 'STK_Name'})
1088 2 722293 361146.5 10.7 rst = rst.drop(['STK_Name_y'], axis=1)
1089 except:
1090 pass
1091
1092 2 94 47.0 0.0 return rst
What surprise me is these two lines:
1087 2 596768 298384.0 8.9 rst = rst.rename(columns={'STK_Name_x': 'STK_Name'})
1088 2 722293 361146.5 10.7 rst = rst.drop(['STK_Name_y'], axis=1)
Why a simple dataframe column "rename" and "drop" operation costs that much percentage of time (8.9% + 10.7%)? Anyway, the "merge" operation only costs 43.7% , and "rename"/"drop" looks not like a calculation-intensive operation. How to improve it ?

Pig Join is returning no results

I have been stuck on this problem for over twelve hours now. I have a Pig script that is running on Amazon Web Services. Currently, I am just running my script in interactive mode. I am trying to get averages on a large data set of climate readings from weather stations; however, this data doesn't have country or state information so it has to be joined with another table that does.
State Table:
719990 99999 LILLOOET CN CA BC WKF +50683 -121933 +02780
719994 99999 SEDCO 710 CN CA CWQJ +46500 -048500 +00000
720000 99999 BOGUS AMERICAN US US -99999 -999999 -99999
720001 99999 PEASON RIDGE/RANGE US US LA K02R +31400 -093283 +01410
720002 99999 HALLOCK(AWS) US US MN K03Y +48783 -096950 +02500
720003 99999 DEER PARK(AWS) US US WA K07S +47967 -117433 +06720
720004 99999 MASON US US MI K09G +42567 -084417 +02800
720005 99999 GASTONIA US US NC K0A6 +35200 -081150 +02440
Climate Table: (I realize this doesn't contain anything to satisfy the join condition, but the full data set does.)
STN--- WBAN YEARMODA TEMP DEWP SLP STP VISIB WDSP MXSPD GUST MAX MIN PRCP SNDP FRSHTT
010010 99999 20090101 23.3 24 15.6 24 1033.2 24 1032.0 24 13.5 6 9.6 24 17.5 999.9 27.9* 16.7 0.00G 999.9 001000
010010 99999 20090102 27.3 24 20.5 24 1026.1 24 1024.9 24 13.7 5 14.6 24 23.3 999.9 28.9 25.3* 0.00G 999.9 001000
010010 99999 20090103 25.2 24 18.4 24 1028.3 24 1027.1 24 15.5 6 4.2 24 9.7 999.9 26.2* 23.9* 0.00G 999.9 001000
010010 99999 20090104 27.7 24 23.2 24 1019.3 24 1018.1 24 6.7 6 8.6 24 13.6 999.9 29.8 24.8 0.00G 999.9 011000
010010 99999 20090105 19.3 24 13.0 24 1015.5 24 1014.3 24 5.6 6 17.5 24 25.3 999.9 26.2* 10.2* 0.05G 999.9 001000
010010 99999 20090106 12.9 24 2.9 24 1019.6 24 1018.3 24 8.2 6 15.5 24 25.3 999.9 19.0* 8.8 0.02G 999.9 001000
010010 99999 20090107 26.2 23 20.7 23 998.6 23 997.4 23 6.6 6 12.1 22 21.4 999.9 31.5 19.2* 0.00G 999.9 011000
010010 99999 20090108 21.5 24 15.2 24 995.3 24 994.1 24 12.4 5 12.8 24 25.3 999.9 24.6* 19.2* 0.05G 999.9 011000
010010 99999 20090109 27.5 23 24.5 23 982.5 23 981.3 23 7.9 5 20.2 22 33.0 999.9 34.2 20.1* 0.00G 999.9 011000
010010 99999 20090110 22.5 23 16.7 23 977.2 23 976.1 23 11.9 6 15.5 23 35.0 999.9 28.9* 17.2 0.09G 999.9 000000
I load in the climate data using TextLoader, apply a regular expression to obtain the fields, and filter out the nulls from the result set. I then do the same with the state data, but I filter it for the country being the US.
The bags have the following schema:
CLIMATE_REMOVE_EMPTY: {station: int,wban: int,year: int,month: int,day: int,temp: double}
STATES_FILTER_US: {station: int,wban: int,name: chararray,wmo: chararray,fips: chararray,state: chararray}
I need to perform a join operation on (station,wban) so I can get a resulting bag with the station, wban, year, month, and temps. When I perform a dump on the resulting bag, it says that it was successful; however, the dump returns 0 results. This is the output.
HadoopVersion PigVersion UserId StartedAt FinishedAt Features
1.0.3 0.9.2-amzn hadoop 2013-05-03 00:10:51 2013-05-03 00:12:42 HASH_JOIN,FILTER
Success!
Job Stats (time in seconds):
JobId Maps Reduces MaxMapTime MinMapTIme AvgMapTime MaxReduceTime MinReduceTime AvgReduceTime Alias Feature Outputs
job_201305030005_0001 2 1 36 15 25 33 33 33 CLIMATE,CLIMATE_REMOVE_NULL,RAW_CLIMATE,RAW_STATES,STATES,STATES_FILTER_US,STATE_CLIMATE_JO IN HASH_JOIN hdfs://10.204.30.125:9000/tmp/temp-204730737/tmp1776606203,
Input(s):
Successfully read 30587 records from: "hiddenbucket"
Successfully read 21027 records from: "hiddenbucket"
Output(s):
Successfully stored 0 records in: "hdfs://10.204.30.125:9000/tmp/temp-204730737/tmp1776606203"
Counters:
Total records written : 0
Total bytes written : 0
Spillable Memory Manager spill count : 0
Total bags proactively spilled: 0
Total records proactively spilled: 0
I have no idea why my this contains 0 results. My data extraction seems correct. and the job is successful. It leads me to believe that the join condition is never satisfied. I know the input files have some data that should satisfy the join condition, but it returns absolutely nothing.
The only thing that looks suspicious is a warning that states:
Encountered Warning ACCESSING_NON_EXISTENT_FIELD 26001 time(s).
I'm not exactly sure where to go from here. Since the job isn't failing, I can't see any errors or anything in debug.
I'm not sure if these mean anything, but here are other things that stand out:
When I try to illustrate STATE_CLIMATE_JOIN, I get a nullPointerException - ERROR 2997: Encountered IOException. Exception : null
When I try to illustrate STATES, I get java.lang.IndexOutOfBoundsException: Index: 1, Size: 1
Here is my full code:
--Piggy Bank Functions
register file:/home/hadoop/lib/pig/piggybank.jar
DEFINE EXTRACT org.apache.pig.piggybank.evaluation.string.EXTRACT();
--Load Climate Data
RAW_CLIMATE = LOAD 'hiddenbucket' USING TextLoader as (line:chararray);
RAW_STATES= LOAD 'hiddenbucket' USING TextLoader as (line:chararray);
CLIMATE=
FOREACH
RAW_CLIMATE
GENERATE
FLATTEN ((tuple(int,int,int,int,int,double))
EXTRACT(line,'^(\\d{6})\\s+(\\d{5})\\s+(\\d{4})(\\d{2})(\\d{2})\\s+(\\d{1,3}\\.\\d{1})')
)
AS (
station: int,
wban: int,
year: int,
month: int,
day: int,
temp: double
)
;
STATES=
FOREACH
RAW_STATES
GENERATE
FLATTEN ((tuple(int,int,chararray,chararray,chararray,chararray))
EXTRACT(line,'^(\\d{6})\\s+(\\d{5})\\s+(\\S+)\\s+(\\w{2})\\s+(\\w{2})\\s+(\\w{2})')
)
AS (
station: int,
wban: int,
name: chararray,
wmo: chararray,
fips: chararray,
state: chararray
)
;
CLIMATE_REMOVE_NULL = FILTER CLIMATE BY station IS NOT NULL;
STATES_FILTER_US = FILTER STATES BY (fips == 'US');
STATE_CLIMATE_JOIN = JOIN CLIMATE_REMOVE_NULL BY (station), STATES_FILTER_US BY (station);
Thanks in advance. I am at a loss here.
--EDIT--
I finally got it to work! My regular expression for parsing the STATE_DATA was invalid.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Pig group by and average function - hadoop

Related

Maximum value of a column in apache pig

Error while executing hive Query : org.apache.hadoop.hive.serde2.io.HiveDecimalWritable cannot be cast to org.apache.hadoop.io.IntWritable

Pig Error on SUM function

Pandas performance issue of dataframe column "rename" and "drop"

Pig Join is returning no results

Categories

Resources