Why are all the values same in ARIMA model predictions? - arima

The data set had 1511 observations. I used the first 1400 values to fit ARIMA model of order (1,1,9), keeping the rest for predictions. But when I look at the predictions, apart from the first 16 values all the remaining values are the same. Here's what I tried:
model2=ARIMA(tstrain,order=(1,1,9))
fitted_model2=model2.fit()
And for prediction:
start=len(tstrain)
end=len(tstrain)+len(tstest)-1
predictions=fitted_model2.predict(start,end,typ='levels')
Here tstrain and tstest are the train and test sets.
predictions.head(30)
1400 214.097742
1401 214.689674
1402 214.820804
1403 215.621131
1404 215.244980
1405 215.349230
1406 215.392444
1407 215.022312
1408 215.020736
1409 215.021384
1410 215.021118
1411 215.021227
1412 215.021182
1413 215.021201
1414 215.021193
1415 215.021196
1416 215.021195
1417 215.021195
1418 215.021195
1419 215.021195
1420 215.021195
1421 215.021195
1422 215.021195
1423 215.021195
1424 215.021195
1425 215.021195
1426 215.021195
1427 215.021195
1428 215.021195
1429 215.021195
Please help me out here. What am I missing?

Related

Peak removal/interpolation

I have some signals that look like following:
I would like to remove the two peaks by doing linear interpolation, so I can get something like this:
where the orange line segment should replace the two peaks after the interpolation.
I understand this very difficult because even for human being you can do it differently like this:
or even this:
So it is really a challenging problem, and might not be a definite answer, but I just think something that looks comfortable, natural, and capturing details as much as possible.
I tried using mask, but edge is pretty noisy, and often times the width of the mask is far from the actual width of the spike. I also tried smoothing, and then applying finite difference to detect the starting and end position of the edges, but again it really does not as accurate as it should be.
I am wondering anyone has experience dealing with this problem? What algorithm I should use? Any literature describing the processing?
For this articular data set, the points are here:
-0.0568
-0.0536
-0.0528
-0.0500
-0.0379
-0.0169
-0.0005
0.0127
0.0075
0.0133
0.0123
0.0130
0.0084
0.0126
0.0144
0.0030
0.0093
0.0168
0.0101
0.0096
0.0078
0.0117
0.0106
0.0138
0.0128
0.0059
0.0075
0.0062
0.0056
0.0017
0.0037
0.0173
0.0114
0.0143
0.0113
0.0117
0.0040
0.0118
0.0085
0.0079
0.0063
0.0152
0.0064
0.0024
0.0058
0.0041
0.0101
0.0086
0.0086
0.0154
0.0018
0.0130
0.0094
0.0094
0.0096
0.0103
0.0170
0.0081
0.0035
0.0138
0.0123
0.0031
0.0120
0.0039
0.0043
0.0063
0.0191
0.0023
0.0165
0.0174
0.0129
0.0135
0.0153
0.0100
0.0066
0.0135
0.0109
0.0038
0.0129
0.0084
0.0095
0.0109
0.0121
0.0077
0.0116
0.0128
0.0101
0.0158
0.0134
0.0042
0.0054
0.0063
0.0059
0.0136
0.0029
0.0139
0.0104
0.0215
0.0180
0.0153
0.0187
0.0138
0.0236
0.0190
0.0267
0.0209
0.0112
0.0108
0.0238
0.0280
0.0266
0.0300
0.0256
0.0278
0.0260
0.0263
0.0257
0.0334
0.0309
0.0301
0.0325
0.0280
0.0300
0.0286
0.0359
0.0317
0.0381
0.0348
0.0422
0.0389
0.0491
0.1754
0.4760
0.8146
1.0172
1.0757
0.9471
0.8509
0.7955
0.7526
0.7314
0.7092
0.7073
0.6906
0.6787
0.6654
0.6646
0.6553
0.6420
0.6385
0.6390
0.6373
0.6305
0.6216
0.6218
0.6212
0.6108
0.6161
0.6054
0.6106
0.6006
0.6032
0.6100
0.6006
0.5975
0.6042
0.6027
0.6044
0.6138
0.6106
0.6051
0.6084
0.6065
0.6212
0.6207
0.6306
0.6270
0.6484
0.6605
0.6742
0.6828
0.6972
0.7076
0.7062
0.6918
0.6905
0.6759
0.6459
0.6134
0.5989
0.5790
0.5663
0.5595
0.5609
0.5467
0.5442
0.5400
0.5317
0.5267
0.5182
0.5187
0.5101
0.4975
0.4951
0.4907
0.4855
0.4745
0.4505
0.4604
0.5814
0.7370
0.8355
0.9012
0.9498
0.9783
1.0188
1.0496
1.0727
1.1201
1.1639
1.2085
1.2465
1.2691
1.3170
1.3553
1.4211
1.4715
1.5169
1.5694
1.5963
1.6341
1.6722
1.7125
1.7388
1.7725
1.8040
1.8505
1.8817
1.9064
1.9337
1.9837
1.9992
2.0385
2.0719
2.1062
2.1415
2.1767
2.2151
2.2385
2.2427
2.2591
2.2856
2.3185
2.3572
2.3638
2.3905
2.4077
2.4429
2.4662
2.4841
2.4977
2.5204
2.5549
2.5709
2.5810
2.6063
2.6301
2.6245
2.6519
2.6594
2.6707
2.6836
2.7045
2.7642
2.8208
2.8278
2.8821
2.8950
2.9526
3.0908
3.1539
3.1935
3.1544
3.1317
3.1717
3.1677
3.1526
3.1489
3.1292
3.1129
3.1293
3.1561
3.1556
3.1857
3.1856
3.1327
3.1160
3.0868
3.1122
3.1407
3.1970
3.2136
3.2211
3.2376
3.2222
3.2521
3.3035
3.4006
3.5001
3.5602
3.5756
3.6020
3.6014
3.5830
3.5640
3.5016
3.4363
3.3618
3.3640
3.4059
3.4812
3.4943
3.5307
3.5735
3.5193
3.5079
3.5052
3.4986
3.4955
3.4303
3.3649
3.3260
3.2755
3.1902
3.0984
3.0574
3.0174
2.9852
2.9648
2.9462
2.9398
2.9393
2.9490
2.9268
2.9042
2.9143
2.9065
2.9340
3.0154
3.0141
3.0202
3.0782
3.1301
3.1803
3.2108
3.2176
3.2588
3.2822
3.3173
3.3732
3.3976
3.4492
3.4675
3.5090
3.5702
3.5230
3.4513
3.3371
3.2674
3.2867
3.3829
3.4563
3.5314
3.5805
3.6043
3.6157
3.6267
3.6450
3.6317
3.5860
3.4163
3.3502
3.3793
3.3572
3.5124
3.8337
4.2717
4.6394
4.8060
4.7245
4.5504
4.3687
4.3737
4.6887
5.4021
6.0749
6.5674
6.7279
6.8391
6.8456
6.8219
6.8410
6.7609
6.5246
5.7718
4.4415
3.5784
3.4720
3.3728
3.4125
3.5051
3.4689
3.2906
3.2217
3.1706
3.1218
3.3428
3.7802
4.5759
5.3222
5.6758
6.0151
6.1276
6.1647
6.0552
5.9937
5.9784
5.7171
5.0609
4.8232
4.2979
3.7390
3.3099
2.9529
2.6971
2.6021
2.5640
2.6019
2.6515
2.6531
2.6558
2.7166
2.7408
2.8190
2.8535
2.8639
2.8700
2.7703
2.6353
2.5842
2.5137
2.4497
2.3751
2.3382
2.1323
1.8490
1.6700
1.5507
1.4733
1.4242
1.3643
1.2997
1.2203
1.1462
1.0776
0.9962
0.8265
0.4876
0.1304
0.0341
0.0296
0.0263
0.0261
0.0247
0.0232
0.0256
0.0214
0.0232
0.0208
0.0205
0.0182
0.0186
0.0169
0.0236
0.0198
0.0157
0.0143
0.0179
0.0118
0.0136
0.0139
0.0115
0.0093
0.0096
0.0107
0.0132
0.0090
0.0074
0.0103
0.0071
0.0086
0.0069
0.0052
0.0069
0.0062
0.0115
0.0068
0.0179
0.0121
0.0092
0.0098
0.0138
0.0081
0.0055
0.0077
0.0048
0.0059
0.0052
0.0095
0.0087
0.0114
0.0036
0.0080
0.0110
0.0049
0.0079
0.0065
0.0080
0.0110
0.0059
0.0158
0.0146
0.0095
0.0045
0.0081
0.0116
0.0091
0.0080
0.0095
0.0105
0.0077
0.0098
0.0138
0.0069
0.0118
0.0087
0.0046
0.0056
0.0072
0.0136
0.0110
0.0054
0.0090
0.0147
0.0102
0.0066
0.0102
0.0092
0.0045
0.0089
0.0134
0.0222
0.0336
0.0362
0.0464
0.0354
0.0420
0.0445
0.0400
0.0338
0.0369
0.0441
0.0397
0.0383
0.0353
0.0319
0.0342
0.0366
0.0414
0.0401
0.0452
0.0507
0.0444
0.0358
0.0432
0.0394
0.0406
0.0441
0.0386
0.0410
0.0409
0.0330
0.0282
0.0186
0.0137
0.0103
0.0033
0.0101
0.0080
0.0141
0.0097
0.0102
0.0092
0.0094
0.0055
0.0119
0.0140
0.0116
0.0077
0.0148
0.0063
0.0021
0.0048
0.0033
0.0123
0.0109
0.0108
0.0168
0.0112
0.0046
0.0085
0.0068
0.0091
0.0096
0.0061
0.0063
0.0082
0.0084
0.0094
0.0070
0.0087
0.0042
0.0077
0.0060
0.0123
0.0127
0.0107
0.0019
0.0082
0.0051
0.0068
0.0064
0.0061
0.0057
0.0094
0.0162
0.0141
0.0165
0.0065
0.0121
0.0047
0.0120
0.0076
0.0050
0.0080
0.0139
0.0023
0.0139
0.0123
0.0087
0.0151
0.0060
0.0103
0.0039
0.0042
0.0043
-0.0011
0.0080
0.0028
0.0074
0.0042
0.0018
0.0087
0.0049
0.0076
0.0156
0.0076
0.0091
0.0056
0.0091
0.0075
0.0012
0.0056
0.0123
0.0137
0.0087
0.0025
0.0084
0.0104
0.0086
-0.0008
0.0072
0.0110
0.0096
0.0081
0.0126
0.0020
0.0098
0.0070
0.0041
0.0027
0.0075
0.0040
0.0069
0.0098
0.0180
0.0143
0.0182
0.0120
0.0003
-0.0011
0.0063
0.0104
0.0043
0.0128
0.0075
0.0051
0.0065
0.0063
0.0005
0.0097
0.0099
0.0084
0.0105
0.0017
0.0080
0.0140
0.0054
0.0048
The easiest is to remove the points above a given threshold, without replacing them.
You could instead try to remove the points that exhibit a difference superior to a certain value (it seems that you should only consider the positive difference in the special case of your problem), and not replace them; this may require several passes to erase the peaks.
There are more complicated approached if this doesn't work.

Neo4j very slow for graph import

I'm using neo4j to load a graph . It is a csv file of 11 million rows
and it is taking a long time for loading
2 hours have passed yet the graph is not finished loading yet
Is it normal ?
My laptop is an i7 2.4Ghs and 8g RAM
The sample data:
protein1 protein2 combined_score
9615.ENSCAFP00000000001 9615.ENSCAFP00000014827 151
9615.ENSCAFP00000000001 9615.ENSCAFP00000026847 802
9615.ENSCAFP00000000001 9615.ENSCAFP00000015235 900
9615.ENSCAFP00000000001 9615.ENSCAFP00000007210 261
9615.ENSCAFP00000000001 9615.ENSCAFP00000025394 248
9615.ENSCAFP00000000001 9615.ENSCAFP00000038575 900
9615.ENSCAFP00000000001 9615.ENSCAFP00000011457 177
9615.ENSCAFP00000000001 9615.ENSCAFP00000002193 503
9615.ENSCAFP00000000001 9615.ENSCAFP00000042321 900
9615.ENSCAFP00000000001 9615.ENSCAFP00000011541 207
9615.ENSCAFP00000000001 9615.ENSCAFP00000038517 183
9615.ENSCAFP00000000001 9615.ENSCAFP00000003009 151
Query
CREATE CONSTRAINT ON (n:Node) ASSERT n.NodeID IS UNIQUE;
USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM 'file:///linksdog.csv'
AS line
MERGE (n1:Node {NodeID: line.protein1})
MERGE (n2:Node {NodeID: line.protein2})
MERGE (n1)-[:ACTING_WITH {Score: TOFLOAT(line.combined_score)}]->(n2);

Pandas performance issue of dataframe column "rename" and "drop"

Below is the line_profiler record of a function :
Wrote profile results to FM_CORE.py.lprof
Timer unit: 2.79365e-07 s
File: F:\FM_CORE.py
Function: _rpt_join at line 1068
Total time: 1.87766 s
Line # Hits Time Per Hit % Time Line Contents
==============================================================
1068 #profile
1069 def _rpt_join(dfa, dfb, join_type='inner'):
1070 ''' join two dataframe together by ('STK_ID','RPT_Date') multilevel index.
1071 'join_type' can be 'inner' or 'outer'
1072 '''
1073
1074 2 56 28.0 0.0 try: # ('STK_ID','RPT_Date') are normal column
1075 2 2936668 1468334.0 43.7 rst = pd.merge(dfa, dfb, how=join_type, on=['STK_ID','RPT_Date'], left_index=True, right_index=True)
1076 except: # ('STK_ID','RPT_Date') are index
1077 rst = pd.merge(dfa, dfb, how=join_type, left_index=True, right_index=True)
1078
1079
1080 2 81 40.5 0.0 try: # handle 'STK_Name
1081 2 426472 213236.0 6.3 name_combine = pd.concat([dfa.STK_Name, dfb.STK_Name])
1082
1083
1084 2 900584 450292.0 13.4 nameseries = name_combine[-Series(name_combine.index.values, name_combine.index).duplicated()]
1085
1086 2 1138140 569070.0 16.9 rst.STK_Name_x = nameseries
1087 2 596768 298384.0 8.9 rst = rst.rename(columns={'STK_Name_x': 'STK_Name'})
1088 2 722293 361146.5 10.7 rst = rst.drop(['STK_Name_y'], axis=1)
1089 except:
1090 pass
1091
1092 2 94 47.0 0.0 return rst
What surprise me is these two lines:
1087 2 596768 298384.0 8.9 rst = rst.rename(columns={'STK_Name_x': 'STK_Name'})
1088 2 722293 361146.5 10.7 rst = rst.drop(['STK_Name_y'], axis=1)
Why a simple dataframe column "rename" and "drop" operation costs that much percentage of time (8.9% + 10.7%)? Anyway, the "merge" operation only costs 43.7% , and "rename"/"drop" looks not like a calculation-intensive operation. How to improve it ?

Pig Join is returning no results

I have been stuck on this problem for over twelve hours now. I have a Pig script that is running on Amazon Web Services. Currently, I am just running my script in interactive mode. I am trying to get averages on a large data set of climate readings from weather stations; however, this data doesn't have country or state information so it has to be joined with another table that does.
State Table:
719990 99999 LILLOOET CN CA BC WKF +50683 -121933 +02780
719994 99999 SEDCO 710 CN CA CWQJ +46500 -048500 +00000
720000 99999 BOGUS AMERICAN US US -99999 -999999 -99999
720001 99999 PEASON RIDGE/RANGE US US LA K02R +31400 -093283 +01410
720002 99999 HALLOCK(AWS) US US MN K03Y +48783 -096950 +02500
720003 99999 DEER PARK(AWS) US US WA K07S +47967 -117433 +06720
720004 99999 MASON US US MI K09G +42567 -084417 +02800
720005 99999 GASTONIA US US NC K0A6 +35200 -081150 +02440
Climate Table: (I realize this doesn't contain anything to satisfy the join condition, but the full data set does.)
STN--- WBAN YEARMODA TEMP DEWP SLP STP VISIB WDSP MXSPD GUST MAX MIN PRCP SNDP FRSHTT
010010 99999 20090101 23.3 24 15.6 24 1033.2 24 1032.0 24 13.5 6 9.6 24 17.5 999.9 27.9* 16.7 0.00G 999.9 001000
010010 99999 20090102 27.3 24 20.5 24 1026.1 24 1024.9 24 13.7 5 14.6 24 23.3 999.9 28.9 25.3* 0.00G 999.9 001000
010010 99999 20090103 25.2 24 18.4 24 1028.3 24 1027.1 24 15.5 6 4.2 24 9.7 999.9 26.2* 23.9* 0.00G 999.9 001000
010010 99999 20090104 27.7 24 23.2 24 1019.3 24 1018.1 24 6.7 6 8.6 24 13.6 999.9 29.8 24.8 0.00G 999.9 011000
010010 99999 20090105 19.3 24 13.0 24 1015.5 24 1014.3 24 5.6 6 17.5 24 25.3 999.9 26.2* 10.2* 0.05G 999.9 001000
010010 99999 20090106 12.9 24 2.9 24 1019.6 24 1018.3 24 8.2 6 15.5 24 25.3 999.9 19.0* 8.8 0.02G 999.9 001000
010010 99999 20090107 26.2 23 20.7 23 998.6 23 997.4 23 6.6 6 12.1 22 21.4 999.9 31.5 19.2* 0.00G 999.9 011000
010010 99999 20090108 21.5 24 15.2 24 995.3 24 994.1 24 12.4 5 12.8 24 25.3 999.9 24.6* 19.2* 0.05G 999.9 011000
010010 99999 20090109 27.5 23 24.5 23 982.5 23 981.3 23 7.9 5 20.2 22 33.0 999.9 34.2 20.1* 0.00G 999.9 011000
010010 99999 20090110 22.5 23 16.7 23 977.2 23 976.1 23 11.9 6 15.5 23 35.0 999.9 28.9* 17.2 0.09G 999.9 000000
I load in the climate data using TextLoader, apply a regular expression to obtain the fields, and filter out the nulls from the result set. I then do the same with the state data, but I filter it for the country being the US.
The bags have the following schema:
CLIMATE_REMOVE_EMPTY: {station: int,wban: int,year: int,month: int,day: int,temp: double}
STATES_FILTER_US: {station: int,wban: int,name: chararray,wmo: chararray,fips: chararray,state: chararray}
I need to perform a join operation on (station,wban) so I can get a resulting bag with the station, wban, year, month, and temps. When I perform a dump on the resulting bag, it says that it was successful; however, the dump returns 0 results. This is the output.
HadoopVersion PigVersion UserId StartedAt FinishedAt Features
1.0.3 0.9.2-amzn hadoop 2013-05-03 00:10:51 2013-05-03 00:12:42 HASH_JOIN,FILTER
Success!
Job Stats (time in seconds):
JobId Maps Reduces MaxMapTime MinMapTIme AvgMapTime MaxReduceTime MinReduceTime AvgReduceTime Alias Feature Outputs
job_201305030005_0001 2 1 36 15 25 33 33 33 CLIMATE,CLIMATE_REMOVE_NULL,RAW_CLIMATE,RAW_STATES,STATES,STATES_FILTER_US,STATE_CLIMATE_JO IN HASH_JOIN hdfs://10.204.30.125:9000/tmp/temp-204730737/tmp1776606203,
Input(s):
Successfully read 30587 records from: "hiddenbucket"
Successfully read 21027 records from: "hiddenbucket"
Output(s):
Successfully stored 0 records in: "hdfs://10.204.30.125:9000/tmp/temp-204730737/tmp1776606203"
Counters:
Total records written : 0
Total bytes written : 0
Spillable Memory Manager spill count : 0
Total bags proactively spilled: 0
Total records proactively spilled: 0
I have no idea why my this contains 0 results. My data extraction seems correct. and the job is successful. It leads me to believe that the join condition is never satisfied. I know the input files have some data that should satisfy the join condition, but it returns absolutely nothing.
The only thing that looks suspicious is a warning that states:
Encountered Warning ACCESSING_NON_EXISTENT_FIELD 26001 time(s).
I'm not exactly sure where to go from here. Since the job isn't failing, I can't see any errors or anything in debug.
I'm not sure if these mean anything, but here are other things that stand out:
When I try to illustrate STATE_CLIMATE_JOIN, I get a nullPointerException - ERROR 2997: Encountered IOException. Exception : null
When I try to illustrate STATES, I get java.lang.IndexOutOfBoundsException: Index: 1, Size: 1
Here is my full code:
--Piggy Bank Functions
register file:/home/hadoop/lib/pig/piggybank.jar
DEFINE EXTRACT org.apache.pig.piggybank.evaluation.string.EXTRACT();
--Load Climate Data
RAW_CLIMATE = LOAD 'hiddenbucket' USING TextLoader as (line:chararray);
RAW_STATES= LOAD 'hiddenbucket' USING TextLoader as (line:chararray);
CLIMATE=
FOREACH
RAW_CLIMATE
GENERATE
FLATTEN ((tuple(int,int,int,int,int,double))
EXTRACT(line,'^(\\d{6})\\s+(\\d{5})\\s+(\\d{4})(\\d{2})(\\d{2})\\s+(\\d{1,3}\\.\\d{1})')
)
AS (
station: int,
wban: int,
year: int,
month: int,
day: int,
temp: double
)
;
STATES=
FOREACH
RAW_STATES
GENERATE
FLATTEN ((tuple(int,int,chararray,chararray,chararray,chararray))
EXTRACT(line,'^(\\d{6})\\s+(\\d{5})\\s+(\\S+)\\s+(\\w{2})\\s+(\\w{2})\\s+(\\w{2})')
)
AS (
station: int,
wban: int,
name: chararray,
wmo: chararray,
fips: chararray,
state: chararray
)
;
CLIMATE_REMOVE_NULL = FILTER CLIMATE BY station IS NOT NULL;
STATES_FILTER_US = FILTER STATES BY (fips == 'US');
STATE_CLIMATE_JOIN = JOIN CLIMATE_REMOVE_NULL BY (station), STATES_FILTER_US BY (station);
Thanks in advance. I am at a loss here.
--EDIT--
I finally got it to work! My regular expression for parsing the STATE_DATA was invalid.

Merge and matching tables in Oracle

Does anyone know how to merge two tables with a common column name and data into a single table? The shared column is a date column. This is part of a project at work, no one here quite knows how it works. Any help would be appreciated.
table A
Sub Temp Weight Silicon Cast_Date
108 2675 2731 0.7002 18-jun-11 18:45
101 2691 3268 0.6194 18-jun-11 20:30
107 2701 6749 0.6976 18-jun-11 20:30
113 2713 2112 0.6616 18-jun-11 20:30
116 2733 3142 0.7382 19-jun-11 05:46
121 2745 2611 0.6949 19-jun-11 00:19
125 2726 1995 0.644 19-jun-11 00:19
table B
Si Temperature Sched_Cast_Date Treadwell
0.6622 2542 01-APR-11 02:57 114
0.6622 2542 01-APR-11 03:07 116
0.7516 2526 19-jun-11 05:46 116
0.7516 2526 01-APR-11 03:40 107
0.6741 2372 01-APR-11 04:03 107
0.6206 2369 01-APR-11 09:43 114
0.6741 2372 19-jun-11 00:19 125
the results would look like:
Subcar Temp Weight Silicon Cast_Date SI Temperature Sched_Cast_Date Treadwell
116 2733 3142 0.7382 19-jun-11 05:46 0.7516 2526 19-jun-11 05:46 116
125 2726 1995 0.644 19-jun-11 00:19 0.6741 2372 19-jun-11 00:19 125
I would like to run a query that returns a results data only where Sched_Cast_Date and Cast_Date are the same. A table with the same qualities would work just as well.
I hope that this makes more sense.
Are you asking how to join two tables on a common column? i.e.
select a.Sub, a.Temp, a.Weight a.Silicon a.Cast_Date, b.SI,
b.Temperature, b.Sched_Cast_Date, b.Treadwell
from a
join b on b.sched_cast_date = a.cast_date

Resources