Pig Error on SUM function - hadoop

I have data like -
store trn_date dept_id sale_amt
1 2014-12-14 101 10007655
1 2014-12-14 101 10007654
1 2014-12-14 101 10007544
6 2014-12-14 104 100086544
8 2014-12-14 101 1000000
9 2014-12-14 106 1000000
I want to get the sum of sale_amt,for this I'm doing
First I load the data using:
table = LOAD 'table' USING org.apache.hcatalog.pig.HCatLoader();
Then grouping the data on store, tran_date, dept_id
grp_table = GROUP table BY (store, tran_date, dept_id);
Finally trying to get the SUM of sale_amt using
grp_gen = FOREACH grp_table GENERATE
FLATTEN(group) AS (store, tran_date, dept_id),
SUM(table.sale_amt) AS tota_sale_amt;
getting below Error -
================================================================================
Pig Stack Trace
---------------
ERROR 2103: Problem doing work on Longs
org.apache.pig.backend.executionengine.ExecException: ERROR 0: Exception while executing (Name: grouped_all: Local Rearrange[tuple]{tuple}(false) - scope-1317 Operator Key: scope-1317): org.apache.pig.backend.executionengine.ExecException: ERROR 2103: Problem doing work on Longs
at org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:289)
at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POLocalRearrange.getNextTuple(POLocalRearrange.java:263)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigCombiner$Combine.processOnePackageOutput(PigCombiner.java:183)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigCombiner$Combine.reduce(PigCombiner.java:161)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigCombiner$Combine.reduce(PigCombiner.java:51)
at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:171)
at org.apache.hadoop.mapred.Task$NewCombinerRunner.combine(Task.java:1645)
at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.java:1611)
at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:1462)
at org.apache.hadoop.mapred.MapTask$NewOutputCollector.close(MapTask.java:700)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:770)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:340)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:167)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1554)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:162)
Caused by: org.apache.pig.backend.executionengine.ExecException: ERROR 2103: Problem doing work on Longs
at org.apache.pig.builtin.AlgebraicLongMathBase.doTupleWork(AlgebraicLongMathBase.java:84)
at org.apache.pig.builtin.AlgebraicLongMathBase$Intermediate.exec(AlgebraicLongMathBase.java:108)
at org.apache.pig.builtin.AlgebraicLongMathBase$Intermediate.exec(AlgebraicLongMathBase.java:102)
at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:330)
at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNextTuple(POUserFunc.java:369)
at org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.getNext(PhysicalOperator.java:333)
at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:378)
at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNextTuple(POForEach.java:298)
at org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:281)
Caused by: java.lang.ClassCastException: java.lang.String cannot be cast to java.lang.Number
at org.apache.pig.builtin.AlgebraicLongMathBase.doTupleWork(AlgebraicLongMathBase.java:77)
================================================================================
As I'm reading table using HCatalog Loader and in hive table data type is string so i have tried with casting as well in the script but still getting the same Error

I don't have HCatalog installed in my system, so tried with simple file, but the below approach and code will work for you.
1.SUM will work only with data types(int, long, float, double, bigdecimal, biginteger or bytearray cast as double). Its look like your sale_amt column is in string, so you need to typecast this column to (long or double) before using SUM function.
2.You should not use store as variable, bcoz it is reserved keyword in Pig, so you have to rename this variable to different name otherwise you will get an error. I renamed this variable as "stores".
Example:
table:
1 2014-12-14 101 10007655
1 2014-12-14 101 10007654
1 2014-12-14 101 10007544
6 2014-12-14 104 100086544
8 2014-12-14 101 1000000
9 2014-12-14 106 1000000
PigScript:
A = LOAD 'table' USING PigStorage() AS (store:chararray,trn_date:chararray,dept_id:chararray,sale_amt:chararray);
B = FOREACH A GENERATE $0 AS stores,trn_date,dept_id,(long)sale_amt; --Renamed the variable store to stores and typecasted the sale_amt to long.
C = GROUP B BY (stores,trn_date,dept_id);
D = FOREACH C GENERATE FLATTEN(group),SUM(B.sale_amt);
DUMP D;
Output:
(1,2014-12-14,101,30022853)
(6,2014-12-14,104,100086544)
(8,2014-12-14,101,1000000)
(9,2014-12-14,106,1000000)

Related

removing bad data from a data file using pig

I have a data file like this
1943 49 1
1975 91 L
1903 56 3
1909 52 3
1953 96 3
1912 82
1976 66 3
1913 35
1990 45 1
1927 92 A
1912 2
1924 22
1971 2
1959 94 E
now using pig script I want to remove the bad data like removing those rows which have characters and empty fields
I tried this way
records = load '/user/a106524609/test.txt' using PigStorage(' ') as
(year:chararray, temperature:int, quality:int);
rec1 = filter records by temperature != 'null' and (quality != 'null ')
Load it as lines
A = load 'data.txt' using PigStorage('\n') as (line:chararray);
Split on all whitespaces
B = FOREACH A GENERATE FLATTEN(STRSPLIT(line, '\\s+')) as (year:int,temp:int,quality:chararray);
Filter by valid strings
C = FILTER B BY quality IN ('0','1','2','3','4','5','6','7','8','9');
(Optionally) Cast to an int
D = FOREACH C GENERATE year,temp,(int)quality;
In Spark, I would start with a regex match of the expected format.
val cleanRows = sc.textFile("data.txt")
.filter(line => line.matches("(?:\\d+\\s+){2}\\d+"))

Maximum value of a column in apache pig

I am trying to find the maximum value of a column ratingTime using pig.I am running below script :
ratings = LOAD '/user/maria_dev/ml-100k/u.data' AS (userid:int,movieID:int,rating:int, ratingTime:int);
maxrating = MAX(ratings.ratingTime);
DUMP maxrating
Sample Input data is :
196 242 3 881250949
186 302 3 891717742
22 377 1 878887116
244 51 2 880606923
I am getting below error :
2018-08-05 07:02:05,247 [main] INFO org.apache.pig.backend.hadoop.PigATSClient - Created ATS Hook
2018-08-05 07:02:05,914 [main] ERROR org.apache.pig.PigServer - exception during parsing: Error during parsing. <file script.pi
You need a preceding GROUP ALL before applying MAX.Source
ratings = LOAD '/user/maria_dev/ml-100k/u.data' USING PigStorage('\t') AS (userid:int,movieID:int,rating:int, ratingTime:int);
rating_group = GROUP ratings ALL;
maxrating = FOREACH ratings_group GENERATE MAX(ratings.ratingTime);
DUMP maxrating;

Core Data migration: exceptions after changing a relationship from one entity to its parent entity

I have a Core Data model that includes a Car entity and let's say an EngineData entity. This is a one-to-one relationship.
In a new version of my app I want to add trucks. So I've created a new version of my Core Data model. I now have a Vehicle entity that is the parent entity of Car. I've added a new Truck entity that also has Vehicle as its parent entity. For EngineData, the inverse relationship was generically named object, so the destination entity just changes from Car to Vehicle.
I wasn't entirely sure this would work with lightweight migration, but I first changed it a couple of weeks ago, and until now it seemed fine. I have code that fetches and updates existing data from EngineData using a Car's engineData property, and I haven't seen any issues there. However, there's a single Core Data fetch in my app that causes a crash every time. All I have to do to cause the crash is a simple fetch of all my EngineData objects:
do {
let request: NSFetchRequest<EngineData> = EngineData.fetchRequest()
let objects = try context.fetch(request)
} catch {
NSLog("Error fetching data: \(error)")
}
On the context.fetch line, I get an exception:
[error] error: Background Core Data task threw an exception. Exception = *** -[__NSArrayM objectAtIndex:]: index 18446744073709551615 beyond bounds [0 .. 10] and userInfo = (null)
CoreData: error: Background Core Data task threw an exception. Exception = *** -[__NSArrayM objectAtIndex:]: index 18446744073709551615 beyond bounds [0 .. 10] and userInfo = (null)
And if I try to actually do anything with those objects, I get some more exceptions until the app crashes:
[General] An uncaught exception was raised
[General] *** -[__NSArrayM objectAtIndex:]: index 18446744073709551615 beyond bounds [0 .. 10]
0 CoreFoundation 0x00007fff8861937b __exceptionPreprocess + 171
1 libobjc.A.dylib 0x00007fff9d40d48d objc_exception_throw + 48
2 CoreFoundation 0x00007fff88532b5c -[__NSArrayM objectAtIndex:] + 204
3 CoreData 0x00007fff881978ed -[NSSQLRow newObjectIDForToOne:] + 253
4 CoreData 0x00007fff8819770f -[NSSQLRow _validateToOnes] + 399
5 CoreData 0x00007fff88197571 -[NSSQLRow knownKeyValuesPointer] + 33
6 CoreData 0x00007fff88191868 _prepareResultsFromResultSet + 4312
7 CoreData 0x00007fff8818e47b newFetchedRowsForFetchPlan_MT + 3387
8 CoreData 0x00007fff8835f6d7 _executeFetchRequest + 55
9 CoreData 0x00007fff8828bb35 -[NSSQLFetchRequestContext executeRequestUsingConnection:] + 53
10 CoreData 0x00007fff8832c9c8 __52-[NSSQLDefaultConnectionManager handleStoreRequest:]_block_invoke + 216
11 libdispatch.dylib 0x000000010105478c _dispatch_client_callout + 8
12 libdispatch.dylib 0x00000001010555ad _dispatch_barrier_sync_f_invoke + 307
13 CoreData 0x00007fff8832c89d -[NSSQLDefaultConnectionManager handleStoreRequest:] + 237
14 CoreData 0x00007fff88286c86 -[NSSQLCoreDispatchManager routeStoreRequest:] + 310
15 CoreData 0x00007fff88261189 -[NSSQLCore dispatchRequest:withRetries:] + 233
16 CoreData 0x00007fff8825e21d -[NSSQLCore processFetchRequest:inContext:] + 93
17 CoreData 0x00007fff8817d218 -[NSSQLCore executeRequest:withContext:error:] + 568
18 CoreData 0x00007fff882436de __65-[NSPersistentStoreCoordinator executeRequest:withContext:error:]_block_invoke + 5486
19 CoreData 0x00007fff8823a347 -[NSPersistentStoreCoordinator _routeHeavyweightBlock:] + 407
20 CoreData 0x00007fff8817cd9e -[NSPersistentStoreCoordinator executeRequest:withContext:error:] + 654
21 CoreData 0x00007fff8817af51 -[NSManagedObjectContext executeFetchRequest:error:] + 593
22 libswiftCoreData.dylib 0x0000000100c91648 _TFE8CoreDataCSo22NSManagedObjectContext5fetchuRxSo20NSFetchRequestResultrfzGCSo14NSFetchRequestx_GSax_ + 56
I thought the flag -com.apple.CoreData.SQLDebug 1 might give me some useful information, but I don't see anything very helpful here:
CoreData: annotation: Connecting to sqlite database file at "/Users/name/Library/Group Containers/Folder/Database.sqlite"
CoreData: sql: pragma recursive_triggers=1
CoreData: sql: pragma journal_mode=wal
CoreData: sql: SELECT Z_VERSION, Z_UUID, Z_PLIST FROM Z_METADATA
CoreData: sql: SELECT TBL_NAME FROM SQLITE_MASTER WHERE TBL_NAME = 'Z_MODELCACHE'
CoreData: sql: SELECT 0, t0.Z_PK, t0.Z_OPT, t0.ZDATA, t0.ZOBJECT, t0.Z9_OBJECT FROM ZENGINEDATA t0
2017-04-28 16:18:41.693548-0400 AppName[95979:11442888] [error] error: Background Core Data task threw an exception. Exception = *** -[__NSArrayM objectAtIndex:]: index 18446744073709551615 beyond bounds [0 .. 10] and userInfo = (null)
CoreData: error: Background Core Data task threw an exception. Exception = *** -[__NSArrayM objectAtIndex:]: index 18446744073709551615 beyond bounds [0 .. 10] and userInfo = (null)
CoreData: annotation: sql connection fetch time: 0.0065s
CoreData: annotation: fetch using NSSQLiteStatement <0x6080004805a0> on entity 'ENGINEDATA' with sql text 'SELECT 0, t0.Z_PK, t0.Z_OPT, t0.ZDATA, t0.ZOBJECT, t0.Z9_OBJECT FROM ZENGINEDATA t0 ' returned 1185 rows with values: (
"<JUNENGINEDATA: 0x608000480eb0> (entity: ENGINEDATA; id: 0x40000b <x-coredata://10C884E2-18EF-4DA2-BC5D-4CBD0CE7D1A6/ENGINEDATA/p1> ; data: <fault>)",
"<JUNENGINEDATA: 0x608000480f00> (entity: ENGINEDATA; id: 0x80000b <x-coredata://10C884E2-18EF-4DA2-BC5D-4CBD0CE7D1A6/ENGINEDATA/p2> ; data: <fault>)",
"<JUNENGINEDATA: 0x608000480f50> (entity: ENGINEDATA; id: 0xc0000b <x-coredata://10C884E2-18EF-4DA2-BC5D-4CBD0CE7D1A6/ENGINEDATA/p3> ; data: <fault>)",
...
"<JUNENGINEDATA: 0x600000288110> (entity: ENGINEDATA; id: 0x128c0000b <x-coredata://10C884E2-18EF-4DA2-BC5D-4CBD0CE7D1A6/ENGINEDATA/p1187> ; data: <fault>)",
"<JUNENGINEDATA: 0x600000288160> (entity: ENGINEDATA; id: 0x12900000b <x-coredata://10C884E2-18EF-4DA2-BC5D-4CBD0CE7D1A6/ENGINEDATA/p1188> ; data: <fault>)",
"<JUNENGINEDATA: 0x6000002881b0> (entity: ENGINEDATA; id: 0x12940000b <x-coredata://10C884E2-18EF-4DA2-BC5D-4CBD0CE7D1A6/ENGINEDATA/p1189> ; data: <fault>)"
)
CoreData: annotation: total fetch execution time: 0.1053s for 1185 rows.
2017-04-28 16:18:41.793201-0400 AppName[95979:11442649] Got results: 1185
The good news is everything in EngineData can be recreated, and I've found that if I do a batch delete of all EngineData objects, it no longer crashes (even after creating some new objects). If possible I'd greatly prefer to understand the cause of the problem though, and find a less drastic solution.
Here are some other things I've discovered:
I tried copying a database from another computer, and I was no able to duplicate the problem using that data. That got me thinking maybe the data was corrupt to begin with…
But if I go back to an old version of my app and its data, the same fetch works fine. If I do nothing other than upgrade the data model, I get the same exceptions.
I've tried setting a version hash modifier for the relationship, hoping that might force Core Data to clean things up when it migrates, but no luck.
If I set a batch size of 10, it's able to loop through 900 results (out of 1185) before the exception.
Using that process, I can delete good records one at a time to narrow down which ones are problematic. (I'm saving after each delete, and I kept a backup of the original database if I need to go back for further testing.)
After a lot of experimenting I was able to narrow it down: I have a handful of EngineData objects that reference a Car object that no longer exists. It looks like since that relationship is broken, the records aren't migrated correctly to the new data model version. If I open the .sqlite file in the Mac app Base, I can see a new Z9_OBJECT field that's set to 10 for all the good records, and NULL for the damaged ones.
Fixing the problem was fairly easy once I discovered the cause. Using NSBatchDeleteRequest I was able to find and delete the damaged objects:
do {
let fetchRequest = NSFetchRequest<NSFetchRequestResult>(entityName: "EngineData")
fetchRequest.predicate = NSPredicate(format: "object.uuid == nil")
let batchRequest = NSBatchDeleteRequest(fetchRequest: fetchRequest)
try context.execute(batchRequest)
} catch {
NSLog("Error deleting objects: \(error)")
}
In my predicate, object is the parent object (previously a Car, now a Vehicle), and uuid is a non-optional attribute on that object. As long as the attribution isn't optional—meaning it will have a value on any undamaged object—this should successfully delete only the objects that are damaged.
You might be expecting this to cause an exception, since it's still fetching the damaged objects! Fortunately NSBatchDeleteRequest works by running SQL statements directly on the store, without loading anything into memory—so it avoids the problem.

Error while executing hive Query : org.apache.hadoop.hive.serde2.io.HiveDecimalWritable cannot be cast to org.apache.hadoop.io.IntWritable

I am running query on hive ....seems from logs , it is failing because of erroneous record in table [date_dim] causing casting exception ....when I looked for other records in [date_dim], nothing different in particular record(see sample records at last).
I may be wrong pointing to particular record...trying to understand why this casting error and how this can be resolved ?
Can you please help me understand why error for this particular record ? how it can be resolved ?
Any help is highly appreciated !!!
Hive query :
SELECT 'store' AS channel, 'ss_cdemo_sk' AS col_name, d_year, d_qoy, i_category, ss_ext_sales_price AS ext_sales_price
FROM store_sales , item , date_dim
WHERE ss_cdemo_sk IS NULL
AND ss_sold_date_sk = d_date_sk
AND ss_item_sk = i_item_sk ;
Hive logs :
Query ID = root_20150922151717_a94d4679-224b-41f3-8336-a799d4ebedab
Total jobs = 4
....
Launching Job 3 out of 4
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1441794795162_13426, Tracking URL = http://myhost:8035/proxy/application_1441794795162_13426/
Kill Command = /opt/hes/hadoop/hadoop-2.6.0//bin/hadoop job -kill job_1441794795162_13426
Hadoop job information for Stage-6: number of mappers: 1; number of reducers: 0
2015-09-22 15:18:34,291 Stage-6 map = 0%, reduce = 0%
2015-09-22 15:18:54,541 Stage-6 map = 100%, reduce = 0%
Ended Job = job_1441794795162_13426 with errors
Error during job, obtaining debugging information...
Examining task ID: task_1441794795162_13426_m_000000 (and more) from job job_1441794795162_13426
Task with the most failures(4):
Task ID:task_1441794795162_13426_m_000000
URL : myhost:8088/taskdetails.jsp?jobid=job_1441794795162_13426&tipid=task_1441794795162_13426_m_000000
-----
Diagnostic Messages for this Task:
Error: java.lang.RuntimeException: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while processing row **{"d_date_sk":2452538,"d_date_id":"AAAAAAAAKDMGFCAA","d_date":"2002-09-20","d_month_seq":1232,"d_week_seq":5360,"d_quarter_seq":412,"d_year":2002,"d_dow":5,"d_moy":9,"d_dom":20,"d_qoy":3,"d_fy_year":2002,"d_fy_quarter_seq":412,"d_fy_week_seq":5360,"d_day_name":"Friday ","d_quarter_name":"N","d_holiday":"2002Q3","d_weekend":"N","d_following_holiday":"Y","d_first_dom":2452761,"d_last_dom":2452519,"d_same_day_ly":2452447,"d_same_day_lq":2452173,"d_current_day":"N","d_current_week":"N","d_current_month":"N","d_current_quarter":"N","d_current_year":"N"}**
at org.apache.hadoop.hive.ql.exec.mr.ExecMapper.map(ExecMapper.java:185)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:54)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:450)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:163)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while processing row {"d_date_sk":2452538,"d_date_id":"AAAAAAAAKDMGFCAA","d_date":"2002-09-20","d_month_seq":1232,"d_week_seq":5360,"d_quarter_seq":412,"d_year":2002,"d_dow":5,"d_moy":9,"d_dom":20,"d_qoy":3,"d_fy_year":2002,"d_fy_quarter_seq":412,"d_fy_week_seq":5360,"d_day_name":"Friday ","d_quarter_name":"N","d_holiday":"2002Q3","d_weekend":"N","d_following_holiday":"Y","d_first_dom":2452761,"d_last_dom":2452519,"d_same_day_ly":2452447,"d_same_day_lq":2452173,"d_current_day":"N","d_current_week":"N","d_current_month":"N","d_current_quarter":"N","d_current_year":"N"}
at org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:503)
at org.apache.hadoop.hive.ql.exec.mr.ExecMapper.map(ExecMapper.java:176)
... 8 more
Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Unexpected exception: org.apache.hadoop.hive.serde2.io.HiveDecimalWritable cannot be cast to org.apache.hadoop.io.IntWritable
at org.apache.hadoop.hive.ql.exec.MapJoinOperator.processOp(MapJoinOperator.java:311)
at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:815)
at org.apache.hadoop.hive.ql.exec.FilterOperator.processOp(FilterOperator.java:120)
at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:815)
at org.apache.hadoop.hive.ql.exec.TableScanOperator.processOp(TableScanOperator.java:95)
at org.apache.hadoop.hive.ql.exec.MapOperator$MapOpCtx.forward(MapOperator.java:157)
at org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:493)
... 9 more
Caused by: java.lang.ClassCastException: org.apache.hadoop.hive.serde2.io.HiveDecimalWritable cannot be cast to org.apache.hadoop.io.IntWritable
at org.apache.hadoop.hive.serde2.objectinspector.primitive.WritableIntObjectInspector.get(WritableIntObjectInspector.java:36)
at org.apache.hadoop.hive.ql.udf.generic.GenericUDFOPEqual.evaluate(GenericUDFOPEqual.java:84)
at org.apache.hadoop.hive.ql.exec.ExprNodeGenericFuncEvaluator._evaluate(ExprNodeGenericFuncEvaluator.java:185)
at org.apache.hadoop.hive.ql.exec.ExprNodeEvaluator.evaluate(ExprNodeEvaluator.java:77)
at org.apache.hadoop.hive.ql.exec.ExprNodeGenericFuncEvaluator$DeferredExprObject.get(ExprNodeGenericFuncEvaluator.java:86)
at org.apache.hadoop.hive.ql.udf.generic.GenericUDFOPAnd.evaluate(GenericUDFOPAnd.java:68)
at org.apache.hadoop.hive.ql.exec.ExprNodeGenericFuncEvaluator._evaluate(ExprNodeGenericFuncEvaluator.java:185)
at org.apache.hadoop.hive.ql.exec.ExprNodeEvaluator.evaluate(ExprNodeEvaluator.java:77)
at org.apache.hadoop.hive.ql.exec.ExprNodeEvaluator.evaluate(ExprNodeEvaluator.java:65)
at org.apache.hadoop.hive.ql.exec.FilterOperator.processOp(FilterOperator.java:106)
at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:815)
at org.apache.hadoop.hive.ql.exec.CommonJoinOperator.internalForward(CommonJoinOperator.java:638)
at org.apache.hadoop.hive.ql.exec.CommonJoinOperator.genUniqueJoinObject(CommonJoinOperator.java:651)
at org.apache.hadoop.hive.ql.exec.CommonJoinOperator.genUniqueJoinObject(CommonJoinOperator.java:654)
at org.apache.hadoop.hive.ql.exec.CommonJoinOperator.checkAndGenObject(CommonJoinOperator.java:750)
at org.apache.hadoop.hive.ql.exec.MapJoinOperator.processOp(MapJoinOperator.java:299)
... 15 more
FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask
ATTEMPT: Execute BackupTask: org.apache.hadoop.hive.ql.exec.mr.MapRedTask
.......
Stage-Stage-8: Map: 2 Cumulative CPU: 42.59 sec HDFS Read: 122983367 HDFS Write: 5267230 SUCCESS
Stage-Stage-6: Map: 1 HDFS Read: 0 HDFS Write: 0 FAIL
Stage-Stage-2: Map: 2 Reduce: 1 Cumulative CPU: 8.78 sec HDFS Read: 15648221 HDFS Write: 8106 SUCCESS
Total MapReduce CPU Time Spent: 51 seconds 370 msec
OK
channel col_name d_year d_qoy i_category ext_sales_price
store ss_cdemo_sk 1998 1 Sports NULL
.....
.
.
.
..
Sample data of table from date_dim ( including erroneous record )
hive > select * from date_dim ;
date_dim.d_date_sk date_dim.d_date_id date_dim.d_date date_dim.d_month_seq date_dim.d_week_seq date_dim.d_quarter_seq date_dim.d_year date_dim.d_dow date_dim.d_moy date_dim.d_dom date_dim.d_qoy date_dim.d_fy_year date_dim.d_fy_quarter_seq date_dim.d_fy_week_seq date_dim.d_day_name date_dim.d_quarter_name date_dim.d_holiday date_dim.d_weekend date_dim.d_following_holiday date_dim.d_first_dom date_dim.d_last_dom date_dim.d_same_day_ly date_dim.d_same_day_lq date_dim.d_current_day date_dim.d_current_week date_dim.d_current_month date_dim.d_current_quarter date_dim.d_current_year
2452538 AAAAAAAAKDMGFCAA 2002-09-20 1232 5360 412 2002 5 9 20 3 2002 412 5360 Friday N 2002Q3 N Y 2452761 2452519 2452447 2452173 N N N NN
2431208 AAAAAAAAIOIBFCAA 1944-04-27 531 2313 178 1944 4 4 27 2 1944 178 2313 Thursday N 1944Q2 N N 2431272 2431182 2431117 2430842 N N N NN
2456494 AAAAAAAAOKLHFCAA 2013-07-20 1362 5925 455 2013 6 7 20 3 2013 455 5925 Saturday N 2013Q3 N Y 2456655 2456475 2456403 2456129 N N N NN
2481780 AAAAAAAAEHONFCAA 2082-10-12 2193 9537 732 2082 1 10 12 4 2082 732 9537 Monday N 2082Q4 N N 2482041 2481769 2481688 2481415 N N N NN
hive> describe store_sales ;
ss_sold_date_sk int
ss_sold_time_sk int
ss_item_sk int
ss_customer_sk int
ss_cdemo_sk int
ss_hdemo_sk int
ss_addr_sk int
ss_store_sk int
ss_promo_sk int
ss_ticket_number int
ss_quantity int
ss_wholesale_cost decimal(7,2)
ss_list_price decimal(7,2)
ss_sales_price decimal(7,2)
ss_ext_discount_amt decimal(7,2)
ss_ext_sales_price decimal(7,2)
ss_ext_wholesale_cost decimal(7,2)
ss_ext_list_price decimal(7,2)
ss_ext_tax decimal(7,2)
ss_coupon_amt decimal(7,2)
ss_net_paid decimal(7,2)
ss_net_paid_inc_tax decimal(7,2)
ss_net_profit decimal(7,2)
hive> describe item ;
i_item_sk int
i_item_id string
i_rec_start_date date
i_rec_end_date date
i_item_desc string
i_current_price decimal(7,2)
i_wholesale_cost decimal(7,2)
i_brand_id int
i_brand string
i_class_id int
i_class string
i_category_id int
i_category string
i_manufact_id int
i_manufact string
i_size string
i_formulation string
i_color string
i_units string
i_container string
i_manager_id int
i_product_name string
hive> describe date_dim ;
d_date_sk int
d_date_id string
d_date date
d_month_seq int
d_week_seq int
d_quarter_seq int
d_year int
d_dow int
d_moy int
d_dom int
d_qoy int
d_fy_year int
d_fy_quarter_seq int
d_fy_week_seq int
d_day_name string
d_quarter_name string
d_holiday string
d_weekend string
d_following_holiday string
d_first_dom int
d_last_dom int
d_same_day_ly int
d_same_day_lq int
d_current_day string
d_current_week string
d_current_month string
d_current_quarter string
d_current_year string

Pig group by and average function

I have data that looks like this
STN--- WBAN YEARMODA TEMP DEWP SLP STP VISIB WDSP MXSPD GUST MAX MIN PRCP SNDP FRSHTT
030050 99999 19291029 46.7 4 42.0 4 990.9 4 9999.9 0 10.9 4 13.0 4 13.0 999.9 46.9* 44.1 99.99 999.9 010000
030050 99999 19291030 43.5 4 33.5 4 1015.4 4 9999.9 0 12.4 4 14.3 4 18.1 999.9 46.9 42.1 0.00I 999.9 000000
030050 99999 19291031 43.7 4 37.3 4 1026.8 4 9999.9 0 12.4 4 4.5 4 8.9 999.9 46.9* 37.9 0.00I 999.9 000000
030050 99999 19291101 49.2 4 45.5 4 1019.9 4 9999.9 0 6.2 4 8.2 4 13.0 999.9 51.1* 46.0 99.99 999.9 010000
030050 99999 19291102 47.0 4 44.5 4 1013.6 4 9999.9 0 7.8 4 6.2 4 8.9 999.9 51.1 44.1 0.00I 999.9 000000
030050 99999 19291103 44.0 4 36.0 4 1009.2 4 9999.9 0 10.9 4 8.0 4 8.9 999.9 50.0 42.1 0.00I 999.9 000000
I want to get the average for each month, in this case: 10 and 11.
First I load the data using:
RAW_LOGS = LOAD 'data' as (line:chararray);
Then I separate the data into different variables using a regex:
LOGS_BASE = FOREACH RAW_LOGS GENERATE
FLATTEN(
REGEX_EXTRACT_ALL(line, '^(\\d+)\\s+(\\d+)\\s+(\\d{4})(\\d{2})(\\d{2})\\s+(\\d+\\.\\d).*$')
)
as (
STN: int,
WBAN: int,
YEAR: int,
MONTH: int,
DAY: int,
TEMP: float
);
Next I get rid of the top tuple which previously contained the header data:
no_nulls = FILTER LOGS_BASE BY STN is not null;
Then I group the data by STN, WBAN, YEAR, and MONTH:
grouped = group no_nulls by STN..MONTH;
And finally I try to generate an Average and run into an error:
C = FOREACH grouped GENERATE AVG(LOGS_BASE.TEMP);
ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1045:
<line 17, column 29> Could not infer the matching function for org.apache.pig.builtin.AVG as multiple or none of them fit. Please use an explicit cast.
I think the error may be with my Regex in that it is returning the TEMP as a string even though I am telling it to be a double but I could be wrong.
EDIT: I changed C to:
C = FOREACH grouped GENERATE AVG(no_nulls.TEMP);
and now I get this error:
HadoopVersion PigVersion UserId StartedAt FinishedAt Features
1.0.3 0.9.2-amzn hadoop 2013-04-20 19:55:25 2013-04-20 19:57:21 GROUP_BY,FILTER
Failed!
Failed Jobs:
JobId Alias Feature Message Outputs
job_201304201942_0001 C,LOGS_BASE,RAW_LOGS,grouped,no_nulls GROUP_BY,COMBINER Message: Job failed! Error - # of failed Map Tasks exceeded allowed limit. FailedCount: 1. LastFailedTask: task_201304201942_0001_m_000000 hdfs://10.254.106.85:9000/tmp/temp413183623/tmp1677272203,
The log has a bit more info:
org.apache.pig.backend.executionengine.ExecException: ERROR 2106: Error while computing average in Initial
at org.apache.pig.builtin.FloatAvg$Initial.exec(FloatAvg.java:99)
at org.apache.pig.builtin.FloatAvg$Initial.exec(FloatAvg.java:75)
at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:216)
at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:253)
at org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.getNext(PhysicalOperator.java:334)
at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:332)
at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:284)
at org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:290)
at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POLocalRearrange.getNext(POLocalRearrange.java:256)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.runPipeline(PigGenericMapBase.java:267)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map(PigGenericMapBase.java:262)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map(PigGenericMapBase.java:64)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:771)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:375)
at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1132)
at org.apache.hadoop.mapred.Child.main(Child.java:249)
Caused by: java.lang.ClassCastException: java.lang.String cannot be cast to java.lang.Float
at org.apache.pig.builtin.FloatAvg$Initial.exec(FloatAvg.java:86)
... 19 more
Pig Stack Trace
---------------
ERROR 2997: Unable to recreate exception from backed error: org.apache.pig.backend.executionengine.ExecException: ERROR 2106: Error while computing average in Initial
org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1066: Unable to open iterator for alias C. Backend error : Unable to recreate exception from backed error: org.apache.pig.backend.executionengine.ExecException: ERROR 2106: Error while computing average in Initial
at org.apache.pig.PigServer.openIterator(PigServer.java:890)
at org.apache.pig.tools.grunt.GruntParser.processDump(GruntParser.java:679)
at org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:303)
at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:189)
at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:165)
at org.apache.pig.tools.grunt.Grunt.run(Grunt.java:69)
at org.apache.pig.Main.run(Main.java:500)
at org.apache.pig.Main.main(Main.java:114)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.RunJar.main(RunJar.java:187)
Caused by: org.apache.pig.backend.executionengine.ExecException: ERROR 2997: Unable to recreate exception from backed error: org.apache.pig.backend.executionengine.ExecException: ERROR 2106: Error while computing average in Initial
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.Launcher.getErrorMessages(Launcher.java:221)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.Launcher.getStats(Launcher.java:151)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher.launchPig(MapReduceLauncher.java:354)
at org.apache.pig.PigServer.launchPlan(PigServer.java:1313)
at org.apache.pig.PigServer.executeCompiledLogicalPlan(PigServer.java:1298)
at org.apache.pig.PigServer.storeEx(PigServer.java:995)
at org.apache.pig.PigServer.store(PigServer.java:962)
at org.apache.pig.PigServer.openIterator(PigServer.java:875)
My guess is because grouped doesn't contain LOGS_BASE, it contains no_nulls. Try making it
C = FOREACH grouped GENERATE AVG(no_nulls.TEMP);
and see if that fixes it.
If that doesn't work, try adding dump RAW_LOGS after the first line and commenting everything else out, make sure that looks good, then uncomment second line and make the dump dump LOGS_BASE, repeat for rest of lines. Always good to sanity check each piece of a pig script.
It turns out that temp was being treated as a String instead of a Float. I applied the code used here and got it to work. Even though I told Pig to treat the TEMP column as a float it was still reading it in as a chararray. This ended up being a one line fix by putting (tuple(int,int,int,int,int,float)) right before my REGEX_EXTRACT_ALL function. Here's what that code looks like:
LOGS_BASE = FOREACH RAW_LOGS GENERATE
FLATTEN(
(tuple(int,int,int,int,int,float))
REGEX_EXTRACT_ALL(line, '^(\\d+)\\s+(\\d+)\\s+(\\d{4})(\\d{2})(\\d{2})\\s+(-?\\d+\\.\\d).*$')
)
as (
STN: int,
WBAN: int,
YEAR: int,
MONTH: int,
DAY: int,
TEMP: float
);

Resources