Pig Join is returning no results - hadoop

I have been stuck on this problem for over twelve hours now. I have a Pig script that is running on Amazon Web Services. Currently, I am just running my script in interactive mode. I am trying to get averages on a large data set of climate readings from weather stations; however, this data doesn't have country or state information so it has to be joined with another table that does.
State Table:
719990 99999 LILLOOET CN CA BC WKF +50683 -121933 +02780
719994 99999 SEDCO 710 CN CA CWQJ +46500 -048500 +00000
720000 99999 BOGUS AMERICAN US US -99999 -999999 -99999
720001 99999 PEASON RIDGE/RANGE US US LA K02R +31400 -093283 +01410
720002 99999 HALLOCK(AWS) US US MN K03Y +48783 -096950 +02500
720003 99999 DEER PARK(AWS) US US WA K07S +47967 -117433 +06720
720004 99999 MASON US US MI K09G +42567 -084417 +02800
720005 99999 GASTONIA US US NC K0A6 +35200 -081150 +02440
Climate Table: (I realize this doesn't contain anything to satisfy the join condition, but the full data set does.)
STN--- WBAN YEARMODA TEMP DEWP SLP STP VISIB WDSP MXSPD GUST MAX MIN PRCP SNDP FRSHTT
010010 99999 20090101 23.3 24 15.6 24 1033.2 24 1032.0 24 13.5 6 9.6 24 17.5 999.9 27.9* 16.7 0.00G 999.9 001000
010010 99999 20090102 27.3 24 20.5 24 1026.1 24 1024.9 24 13.7 5 14.6 24 23.3 999.9 28.9 25.3* 0.00G 999.9 001000
010010 99999 20090103 25.2 24 18.4 24 1028.3 24 1027.1 24 15.5 6 4.2 24 9.7 999.9 26.2* 23.9* 0.00G 999.9 001000
010010 99999 20090104 27.7 24 23.2 24 1019.3 24 1018.1 24 6.7 6 8.6 24 13.6 999.9 29.8 24.8 0.00G 999.9 011000
010010 99999 20090105 19.3 24 13.0 24 1015.5 24 1014.3 24 5.6 6 17.5 24 25.3 999.9 26.2* 10.2* 0.05G 999.9 001000
010010 99999 20090106 12.9 24 2.9 24 1019.6 24 1018.3 24 8.2 6 15.5 24 25.3 999.9 19.0* 8.8 0.02G 999.9 001000
010010 99999 20090107 26.2 23 20.7 23 998.6 23 997.4 23 6.6 6 12.1 22 21.4 999.9 31.5 19.2* 0.00G 999.9 011000
010010 99999 20090108 21.5 24 15.2 24 995.3 24 994.1 24 12.4 5 12.8 24 25.3 999.9 24.6* 19.2* 0.05G 999.9 011000
010010 99999 20090109 27.5 23 24.5 23 982.5 23 981.3 23 7.9 5 20.2 22 33.0 999.9 34.2 20.1* 0.00G 999.9 011000
010010 99999 20090110 22.5 23 16.7 23 977.2 23 976.1 23 11.9 6 15.5 23 35.0 999.9 28.9* 17.2 0.09G 999.9 000000
I load in the climate data using TextLoader, apply a regular expression to obtain the fields, and filter out the nulls from the result set. I then do the same with the state data, but I filter it for the country being the US.
The bags have the following schema:
CLIMATE_REMOVE_EMPTY: {station: int,wban: int,year: int,month: int,day: int,temp: double}
STATES_FILTER_US: {station: int,wban: int,name: chararray,wmo: chararray,fips: chararray,state: chararray}
I need to perform a join operation on (station,wban) so I can get a resulting bag with the station, wban, year, month, and temps. When I perform a dump on the resulting bag, it says that it was successful; however, the dump returns 0 results. This is the output.
HadoopVersion PigVersion UserId StartedAt FinishedAt Features
1.0.3 0.9.2-amzn hadoop 2013-05-03 00:10:51 2013-05-03 00:12:42 HASH_JOIN,FILTER
Success!
Job Stats (time in seconds):
JobId Maps Reduces MaxMapTime MinMapTIme AvgMapTime MaxReduceTime MinReduceTime AvgReduceTime Alias Feature Outputs
job_201305030005_0001 2 1 36 15 25 33 33 33 CLIMATE,CLIMATE_REMOVE_NULL,RAW_CLIMATE,RAW_STATES,STATES,STATES_FILTER_US,STATE_CLIMATE_JO IN HASH_JOIN hdfs://10.204.30.125:9000/tmp/temp-204730737/tmp1776606203,
Input(s):
Successfully read 30587 records from: "hiddenbucket"
Successfully read 21027 records from: "hiddenbucket"
Output(s):
Successfully stored 0 records in: "hdfs://10.204.30.125:9000/tmp/temp-204730737/tmp1776606203"
Counters:
Total records written : 0
Total bytes written : 0
Spillable Memory Manager spill count : 0
Total bags proactively spilled: 0
Total records proactively spilled: 0
I have no idea why my this contains 0 results. My data extraction seems correct. and the job is successful. It leads me to believe that the join condition is never satisfied. I know the input files have some data that should satisfy the join condition, but it returns absolutely nothing.
The only thing that looks suspicious is a warning that states:
Encountered Warning ACCESSING_NON_EXISTENT_FIELD 26001 time(s).
I'm not exactly sure where to go from here. Since the job isn't failing, I can't see any errors or anything in debug.
I'm not sure if these mean anything, but here are other things that stand out:
When I try to illustrate STATE_CLIMATE_JOIN, I get a nullPointerException - ERROR 2997: Encountered IOException. Exception : null
When I try to illustrate STATES, I get java.lang.IndexOutOfBoundsException: Index: 1, Size: 1
Here is my full code:
--Piggy Bank Functions
register file:/home/hadoop/lib/pig/piggybank.jar
DEFINE EXTRACT org.apache.pig.piggybank.evaluation.string.EXTRACT();
--Load Climate Data
RAW_CLIMATE = LOAD 'hiddenbucket' USING TextLoader as (line:chararray);
RAW_STATES= LOAD 'hiddenbucket' USING TextLoader as (line:chararray);
CLIMATE=
FOREACH
RAW_CLIMATE
GENERATE
FLATTEN ((tuple(int,int,int,int,int,double))
EXTRACT(line,'^(\\d{6})\\s+(\\d{5})\\s+(\\d{4})(\\d{2})(\\d{2})\\s+(\\d{1,3}\\.\\d{1})')
)
AS (
station: int,
wban: int,
year: int,
month: int,
day: int,
temp: double
)
;
STATES=
FOREACH
RAW_STATES
GENERATE
FLATTEN ((tuple(int,int,chararray,chararray,chararray,chararray))
EXTRACT(line,'^(\\d{6})\\s+(\\d{5})\\s+(\\S+)\\s+(\\w{2})\\s+(\\w{2})\\s+(\\w{2})')
)
AS (
station: int,
wban: int,
name: chararray,
wmo: chararray,
fips: chararray,
state: chararray
)
;
CLIMATE_REMOVE_NULL = FILTER CLIMATE BY station IS NOT NULL;
STATES_FILTER_US = FILTER STATES BY (fips == 'US');
STATE_CLIMATE_JOIN = JOIN CLIMATE_REMOVE_NULL BY (station), STATES_FILTER_US BY (station);
Thanks in advance. I am at a loss here.
--EDIT--
I finally got it to work! My regular expression for parsing the STATE_DATA was invalid.

Related

why my dxf file not working in AutoCAD giving me ID 11 incorrect: already used

I have generated a dxf file but when I opened it with AutoCAD, crashes AutoCAD and gives a message ID 11 incorrect: already used.
the dxf content: https://github.com/tarikjabiri/dxf/blob/dev/examples/latest.dxf
I can't spot the problem 3 days I am trying to solve it.
I think something wrong with the APPID because it holding the ID 11 or the Handle in the language of DXF.
I have a dxf working: https://github.com/tarikjabiri/dxf/blob/dev/examples/Minimal_DXF_AC1021.dxf
Thanks in advance.
There are two minor issues:
DIMSTYLE table
0
TABLE
2
DIMSTYLE
105 <<< handle group code of the table "head" is 5 as usual
8
100
AcDbSymbolTable
100
AcDbDimStyleTable
70
1
0
DIMSTYLE
5 <<< handle group code of the table entry is 105
12
330
8
100
AcDbSymbolTableRecord
100
AcDbDimStyleTableRecord
2
STANDARD
70
0
40
1
BLOCK_RECORD table entries for *MODEL_SPACE and *PAPER_SPACE
0
TABLE
2
BLOCK_RECORD
5
9
330
0
100
AcDbSymbolTable
70
2
0
BLOCK_RECORD
5
14
330
9
100
AcDbSymbolTableRecord
100
AcDbRegAppTableRecord <<< subclass marker string "AcDbBlockTableRecord"
2
*MODEL_SPACE
70
0
70
0
280
After this changes the file opens in Autodesk DWG Trueview 2022.

SparkR - Retaining the previous value in another column

I have a spark dataFrame that looks like this:
id dates value
1 11 2013-11-15 10
2 11 2013-11-16 15
3 22 2013-11-15 20
4 22 2013-11-16 21
5 22 2013-11-17 3
I wish to retain the value from the previous date per id.
The final result should look like this:
id dates value prev_value
1 11 2013-11-15 10 NA
2 11 2013-11-16 15 10
3 22 2013-11-15 20 NA
4 22 2013-11-16 21 20
5 22 2013-11-17 3 21
The solution from this question would not work for various reasons.
I would appreciate the help!
So after playing with it for a while, here's the workaround that I found:
First of all, here's the example DF
id<-c(11,11,22,22,22)
dates<-as.Date(c('2013-11-15','2013-11-16','2013-11-15','2013-11-16','2013-11-17'), "%Y-%m-%d")
value <- c(10,15,20,21,3)
example<-as.DataFrame(data.frame(id=id,dates=dates, value))
I copy the example DF and add 1 day to the original date, then rename the column
example_p <- example
example_p$dates <- date_add(example_p$dates, 1)
colnames(example_p) <- c("id", "dates", "prev_value")
Finally, I merge the new DF to the original one
result <- select(merge(example, example_p, by = intersect(names(example),names(example_p))
, all.x = T), c("id_x", "dates_x", "value", "prev_value"))
showDF(result)
+----+----------+-----+----------+
|id_x| dates_x|value|prev_value|
+----+----------+-----+----------+
|22.0|2013-11-15| 20.0| null|
|11.0|2013-11-15| 10.0| null|
|11.0|2013-11-16| 15.0| 10.0|
|22.0|2013-11-16| 21.0| 20.0|
|22.0|2013-11-17| 3.0| 21.0|
+----+----------+-----+----------+
Obviously, this is somehow clumsy and I will be happy to give the points to anyone who can suggest a solution that would work faster than this.

Pandas performance issue of dataframe column "rename" and "drop"

Below is the line_profiler record of a function :
Wrote profile results to FM_CORE.py.lprof
Timer unit: 2.79365e-07 s
File: F:\FM_CORE.py
Function: _rpt_join at line 1068
Total time: 1.87766 s
Line # Hits Time Per Hit % Time Line Contents
==============================================================
1068 #profile
1069 def _rpt_join(dfa, dfb, join_type='inner'):
1070 ''' join two dataframe together by ('STK_ID','RPT_Date') multilevel index.
1071 'join_type' can be 'inner' or 'outer'
1072 '''
1073
1074 2 56 28.0 0.0 try: # ('STK_ID','RPT_Date') are normal column
1075 2 2936668 1468334.0 43.7 rst = pd.merge(dfa, dfb, how=join_type, on=['STK_ID','RPT_Date'], left_index=True, right_index=True)
1076 except: # ('STK_ID','RPT_Date') are index
1077 rst = pd.merge(dfa, dfb, how=join_type, left_index=True, right_index=True)
1078
1079
1080 2 81 40.5 0.0 try: # handle 'STK_Name
1081 2 426472 213236.0 6.3 name_combine = pd.concat([dfa.STK_Name, dfb.STK_Name])
1082
1083
1084 2 900584 450292.0 13.4 nameseries = name_combine[-Series(name_combine.index.values, name_combine.index).duplicated()]
1085
1086 2 1138140 569070.0 16.9 rst.STK_Name_x = nameseries
1087 2 596768 298384.0 8.9 rst = rst.rename(columns={'STK_Name_x': 'STK_Name'})
1088 2 722293 361146.5 10.7 rst = rst.drop(['STK_Name_y'], axis=1)
1089 except:
1090 pass
1091
1092 2 94 47.0 0.0 return rst
What surprise me is these two lines:
1087 2 596768 298384.0 8.9 rst = rst.rename(columns={'STK_Name_x': 'STK_Name'})
1088 2 722293 361146.5 10.7 rst = rst.drop(['STK_Name_y'], axis=1)
Why a simple dataframe column "rename" and "drop" operation costs that much percentage of time (8.9% + 10.7%)? Anyway, the "merge" operation only costs 43.7% , and "rename"/"drop" looks not like a calculation-intensive operation. How to improve it ?

Pig group by and average function

I have data that looks like this
STN--- WBAN YEARMODA TEMP DEWP SLP STP VISIB WDSP MXSPD GUST MAX MIN PRCP SNDP FRSHTT
030050 99999 19291029 46.7 4 42.0 4 990.9 4 9999.9 0 10.9 4 13.0 4 13.0 999.9 46.9* 44.1 99.99 999.9 010000
030050 99999 19291030 43.5 4 33.5 4 1015.4 4 9999.9 0 12.4 4 14.3 4 18.1 999.9 46.9 42.1 0.00I 999.9 000000
030050 99999 19291031 43.7 4 37.3 4 1026.8 4 9999.9 0 12.4 4 4.5 4 8.9 999.9 46.9* 37.9 0.00I 999.9 000000
030050 99999 19291101 49.2 4 45.5 4 1019.9 4 9999.9 0 6.2 4 8.2 4 13.0 999.9 51.1* 46.0 99.99 999.9 010000
030050 99999 19291102 47.0 4 44.5 4 1013.6 4 9999.9 0 7.8 4 6.2 4 8.9 999.9 51.1 44.1 0.00I 999.9 000000
030050 99999 19291103 44.0 4 36.0 4 1009.2 4 9999.9 0 10.9 4 8.0 4 8.9 999.9 50.0 42.1 0.00I 999.9 000000
I want to get the average for each month, in this case: 10 and 11.
First I load the data using:
RAW_LOGS = LOAD 'data' as (line:chararray);
Then I separate the data into different variables using a regex:
LOGS_BASE = FOREACH RAW_LOGS GENERATE
FLATTEN(
REGEX_EXTRACT_ALL(line, '^(\\d+)\\s+(\\d+)\\s+(\\d{4})(\\d{2})(\\d{2})\\s+(\\d+\\.\\d).*$')
)
as (
STN: int,
WBAN: int,
YEAR: int,
MONTH: int,
DAY: int,
TEMP: float
);
Next I get rid of the top tuple which previously contained the header data:
no_nulls = FILTER LOGS_BASE BY STN is not null;
Then I group the data by STN, WBAN, YEAR, and MONTH:
grouped = group no_nulls by STN..MONTH;
And finally I try to generate an Average and run into an error:
C = FOREACH grouped GENERATE AVG(LOGS_BASE.TEMP);
ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1045:
<line 17, column 29> Could not infer the matching function for org.apache.pig.builtin.AVG as multiple or none of them fit. Please use an explicit cast.
I think the error may be with my Regex in that it is returning the TEMP as a string even though I am telling it to be a double but I could be wrong.
EDIT: I changed C to:
C = FOREACH grouped GENERATE AVG(no_nulls.TEMP);
and now I get this error:
HadoopVersion PigVersion UserId StartedAt FinishedAt Features
1.0.3 0.9.2-amzn hadoop 2013-04-20 19:55:25 2013-04-20 19:57:21 GROUP_BY,FILTER
Failed!
Failed Jobs:
JobId Alias Feature Message Outputs
job_201304201942_0001 C,LOGS_BASE,RAW_LOGS,grouped,no_nulls GROUP_BY,COMBINER Message: Job failed! Error - # of failed Map Tasks exceeded allowed limit. FailedCount: 1. LastFailedTask: task_201304201942_0001_m_000000 hdfs://10.254.106.85:9000/tmp/temp413183623/tmp1677272203,
The log has a bit more info:
org.apache.pig.backend.executionengine.ExecException: ERROR 2106: Error while computing average in Initial
at org.apache.pig.builtin.FloatAvg$Initial.exec(FloatAvg.java:99)
at org.apache.pig.builtin.FloatAvg$Initial.exec(FloatAvg.java:75)
at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:216)
at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:253)
at org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.getNext(PhysicalOperator.java:334)
at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:332)
at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:284)
at org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:290)
at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POLocalRearrange.getNext(POLocalRearrange.java:256)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.runPipeline(PigGenericMapBase.java:267)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map(PigGenericMapBase.java:262)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map(PigGenericMapBase.java:64)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:771)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:375)
at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1132)
at org.apache.hadoop.mapred.Child.main(Child.java:249)
Caused by: java.lang.ClassCastException: java.lang.String cannot be cast to java.lang.Float
at org.apache.pig.builtin.FloatAvg$Initial.exec(FloatAvg.java:86)
... 19 more
Pig Stack Trace
---------------
ERROR 2997: Unable to recreate exception from backed error: org.apache.pig.backend.executionengine.ExecException: ERROR 2106: Error while computing average in Initial
org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1066: Unable to open iterator for alias C. Backend error : Unable to recreate exception from backed error: org.apache.pig.backend.executionengine.ExecException: ERROR 2106: Error while computing average in Initial
at org.apache.pig.PigServer.openIterator(PigServer.java:890)
at org.apache.pig.tools.grunt.GruntParser.processDump(GruntParser.java:679)
at org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:303)
at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:189)
at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:165)
at org.apache.pig.tools.grunt.Grunt.run(Grunt.java:69)
at org.apache.pig.Main.run(Main.java:500)
at org.apache.pig.Main.main(Main.java:114)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.RunJar.main(RunJar.java:187)
Caused by: org.apache.pig.backend.executionengine.ExecException: ERROR 2997: Unable to recreate exception from backed error: org.apache.pig.backend.executionengine.ExecException: ERROR 2106: Error while computing average in Initial
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.Launcher.getErrorMessages(Launcher.java:221)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.Launcher.getStats(Launcher.java:151)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher.launchPig(MapReduceLauncher.java:354)
at org.apache.pig.PigServer.launchPlan(PigServer.java:1313)
at org.apache.pig.PigServer.executeCompiledLogicalPlan(PigServer.java:1298)
at org.apache.pig.PigServer.storeEx(PigServer.java:995)
at org.apache.pig.PigServer.store(PigServer.java:962)
at org.apache.pig.PigServer.openIterator(PigServer.java:875)
My guess is because grouped doesn't contain LOGS_BASE, it contains no_nulls. Try making it
C = FOREACH grouped GENERATE AVG(no_nulls.TEMP);
and see if that fixes it.
If that doesn't work, try adding dump RAW_LOGS after the first line and commenting everything else out, make sure that looks good, then uncomment second line and make the dump dump LOGS_BASE, repeat for rest of lines. Always good to sanity check each piece of a pig script.
It turns out that temp was being treated as a String instead of a Float. I applied the code used here and got it to work. Even though I told Pig to treat the TEMP column as a float it was still reading it in as a chararray. This ended up being a one line fix by putting (tuple(int,int,int,int,int,float)) right before my REGEX_EXTRACT_ALL function. Here's what that code looks like:
LOGS_BASE = FOREACH RAW_LOGS GENERATE
FLATTEN(
(tuple(int,int,int,int,int,float))
REGEX_EXTRACT_ALL(line, '^(\\d+)\\s+(\\d+)\\s+(\\d{4})(\\d{2})(\\d{2})\\s+(-?\\d+\\.\\d).*$')
)
as (
STN: int,
WBAN: int,
YEAR: int,
MONTH: int,
DAY: int,
TEMP: float
);

Merge and matching tables in Oracle

Does anyone know how to merge two tables with a common column name and data into a single table? The shared column is a date column. This is part of a project at work, no one here quite knows how it works. Any help would be appreciated.
table A
Sub Temp Weight Silicon Cast_Date
108 2675 2731 0.7002 18-jun-11 18:45
101 2691 3268 0.6194 18-jun-11 20:30
107 2701 6749 0.6976 18-jun-11 20:30
113 2713 2112 0.6616 18-jun-11 20:30
116 2733 3142 0.7382 19-jun-11 05:46
121 2745 2611 0.6949 19-jun-11 00:19
125 2726 1995 0.644 19-jun-11 00:19
table B
Si Temperature Sched_Cast_Date Treadwell
0.6622 2542 01-APR-11 02:57 114
0.6622 2542 01-APR-11 03:07 116
0.7516 2526 19-jun-11 05:46 116
0.7516 2526 01-APR-11 03:40 107
0.6741 2372 01-APR-11 04:03 107
0.6206 2369 01-APR-11 09:43 114
0.6741 2372 19-jun-11 00:19 125
the results would look like:
Subcar Temp Weight Silicon Cast_Date SI Temperature Sched_Cast_Date Treadwell
116 2733 3142 0.7382 19-jun-11 05:46 0.7516 2526 19-jun-11 05:46 116
125 2726 1995 0.644 19-jun-11 00:19 0.6741 2372 19-jun-11 00:19 125
I would like to run a query that returns a results data only where Sched_Cast_Date and Cast_Date are the same. A table with the same qualities would work just as well.
I hope that this makes more sense.
Are you asking how to join two tables on a common column? i.e.
select a.Sub, a.Temp, a.Weight a.Silicon a.Cast_Date, b.SI,
b.Temperature, b.Sched_Cast_Date, b.Treadwell
from a
join b on b.sched_cast_date = a.cast_date

Resources