I have some signals that look like following:
I would like to remove the two peaks by doing linear interpolation, so I can get something like this:
where the orange line segment should replace the two peaks after the interpolation.
I understand this very difficult because even for human being you can do it differently like this:
or even this:
So it is really a challenging problem, and might not be a definite answer, but I just think something that looks comfortable, natural, and capturing details as much as possible.
I tried using mask, but edge is pretty noisy, and often times the width of the mask is far from the actual width of the spike. I also tried smoothing, and then applying finite difference to detect the starting and end position of the edges, but again it really does not as accurate as it should be.
I am wondering anyone has experience dealing with this problem? What algorithm I should use? Any literature describing the processing?
For this articular data set, the points are here:
-0.0568
-0.0536
-0.0528
-0.0500
-0.0379
-0.0169
-0.0005
0.0127
0.0075
0.0133
0.0123
0.0130
0.0084
0.0126
0.0144
0.0030
0.0093
0.0168
0.0101
0.0096
0.0078
0.0117
0.0106
0.0138
0.0128
0.0059
0.0075
0.0062
0.0056
0.0017
0.0037
0.0173
0.0114
0.0143
0.0113
0.0117
0.0040
0.0118
0.0085
0.0079
0.0063
0.0152
0.0064
0.0024
0.0058
0.0041
0.0101
0.0086
0.0086
0.0154
0.0018
0.0130
0.0094
0.0094
0.0096
0.0103
0.0170
0.0081
0.0035
0.0138
0.0123
0.0031
0.0120
0.0039
0.0043
0.0063
0.0191
0.0023
0.0165
0.0174
0.0129
0.0135
0.0153
0.0100
0.0066
0.0135
0.0109
0.0038
0.0129
0.0084
0.0095
0.0109
0.0121
0.0077
0.0116
0.0128
0.0101
0.0158
0.0134
0.0042
0.0054
0.0063
0.0059
0.0136
0.0029
0.0139
0.0104
0.0215
0.0180
0.0153
0.0187
0.0138
0.0236
0.0190
0.0267
0.0209
0.0112
0.0108
0.0238
0.0280
0.0266
0.0300
0.0256
0.0278
0.0260
0.0263
0.0257
0.0334
0.0309
0.0301
0.0325
0.0280
0.0300
0.0286
0.0359
0.0317
0.0381
0.0348
0.0422
0.0389
0.0491
0.1754
0.4760
0.8146
1.0172
1.0757
0.9471
0.8509
0.7955
0.7526
0.7314
0.7092
0.7073
0.6906
0.6787
0.6654
0.6646
0.6553
0.6420
0.6385
0.6390
0.6373
0.6305
0.6216
0.6218
0.6212
0.6108
0.6161
0.6054
0.6106
0.6006
0.6032
0.6100
0.6006
0.5975
0.6042
0.6027
0.6044
0.6138
0.6106
0.6051
0.6084
0.6065
0.6212
0.6207
0.6306
0.6270
0.6484
0.6605
0.6742
0.6828
0.6972
0.7076
0.7062
0.6918
0.6905
0.6759
0.6459
0.6134
0.5989
0.5790
0.5663
0.5595
0.5609
0.5467
0.5442
0.5400
0.5317
0.5267
0.5182
0.5187
0.5101
0.4975
0.4951
0.4907
0.4855
0.4745
0.4505
0.4604
0.5814
0.7370
0.8355
0.9012
0.9498
0.9783
1.0188
1.0496
1.0727
1.1201
1.1639
1.2085
1.2465
1.2691
1.3170
1.3553
1.4211
1.4715
1.5169
1.5694
1.5963
1.6341
1.6722
1.7125
1.7388
1.7725
1.8040
1.8505
1.8817
1.9064
1.9337
1.9837
1.9992
2.0385
2.0719
2.1062
2.1415
2.1767
2.2151
2.2385
2.2427
2.2591
2.2856
2.3185
2.3572
2.3638
2.3905
2.4077
2.4429
2.4662
2.4841
2.4977
2.5204
2.5549
2.5709
2.5810
2.6063
2.6301
2.6245
2.6519
2.6594
2.6707
2.6836
2.7045
2.7642
2.8208
2.8278
2.8821
2.8950
2.9526
3.0908
3.1539
3.1935
3.1544
3.1317
3.1717
3.1677
3.1526
3.1489
3.1292
3.1129
3.1293
3.1561
3.1556
3.1857
3.1856
3.1327
3.1160
3.0868
3.1122
3.1407
3.1970
3.2136
3.2211
3.2376
3.2222
3.2521
3.3035
3.4006
3.5001
3.5602
3.5756
3.6020
3.6014
3.5830
3.5640
3.5016
3.4363
3.3618
3.3640
3.4059
3.4812
3.4943
3.5307
3.5735
3.5193
3.5079
3.5052
3.4986
3.4955
3.4303
3.3649
3.3260
3.2755
3.1902
3.0984
3.0574
3.0174
2.9852
2.9648
2.9462
2.9398
2.9393
2.9490
2.9268
2.9042
2.9143
2.9065
2.9340
3.0154
3.0141
3.0202
3.0782
3.1301
3.1803
3.2108
3.2176
3.2588
3.2822
3.3173
3.3732
3.3976
3.4492
3.4675
3.5090
3.5702
3.5230
3.4513
3.3371
3.2674
3.2867
3.3829
3.4563
3.5314
3.5805
3.6043
3.6157
3.6267
3.6450
3.6317
3.5860
3.4163
3.3502
3.3793
3.3572
3.5124
3.8337
4.2717
4.6394
4.8060
4.7245
4.5504
4.3687
4.3737
4.6887
5.4021
6.0749
6.5674
6.7279
6.8391
6.8456
6.8219
6.8410
6.7609
6.5246
5.7718
4.4415
3.5784
3.4720
3.3728
3.4125
3.5051
3.4689
3.2906
3.2217
3.1706
3.1218
3.3428
3.7802
4.5759
5.3222
5.6758
6.0151
6.1276
6.1647
6.0552
5.9937
5.9784
5.7171
5.0609
4.8232
4.2979
3.7390
3.3099
2.9529
2.6971
2.6021
2.5640
2.6019
2.6515
2.6531
2.6558
2.7166
2.7408
2.8190
2.8535
2.8639
2.8700
2.7703
2.6353
2.5842
2.5137
2.4497
2.3751
2.3382
2.1323
1.8490
1.6700
1.5507
1.4733
1.4242
1.3643
1.2997
1.2203
1.1462
1.0776
0.9962
0.8265
0.4876
0.1304
0.0341
0.0296
0.0263
0.0261
0.0247
0.0232
0.0256
0.0214
0.0232
0.0208
0.0205
0.0182
0.0186
0.0169
0.0236
0.0198
0.0157
0.0143
0.0179
0.0118
0.0136
0.0139
0.0115
0.0093
0.0096
0.0107
0.0132
0.0090
0.0074
0.0103
0.0071
0.0086
0.0069
0.0052
0.0069
0.0062
0.0115
0.0068
0.0179
0.0121
0.0092
0.0098
0.0138
0.0081
0.0055
0.0077
0.0048
0.0059
0.0052
0.0095
0.0087
0.0114
0.0036
0.0080
0.0110
0.0049
0.0079
0.0065
0.0080
0.0110
0.0059
0.0158
0.0146
0.0095
0.0045
0.0081
0.0116
0.0091
0.0080
0.0095
0.0105
0.0077
0.0098
0.0138
0.0069
0.0118
0.0087
0.0046
0.0056
0.0072
0.0136
0.0110
0.0054
0.0090
0.0147
0.0102
0.0066
0.0102
0.0092
0.0045
0.0089
0.0134
0.0222
0.0336
0.0362
0.0464
0.0354
0.0420
0.0445
0.0400
0.0338
0.0369
0.0441
0.0397
0.0383
0.0353
0.0319
0.0342
0.0366
0.0414
0.0401
0.0452
0.0507
0.0444
0.0358
0.0432
0.0394
0.0406
0.0441
0.0386
0.0410
0.0409
0.0330
0.0282
0.0186
0.0137
0.0103
0.0033
0.0101
0.0080
0.0141
0.0097
0.0102
0.0092
0.0094
0.0055
0.0119
0.0140
0.0116
0.0077
0.0148
0.0063
0.0021
0.0048
0.0033
0.0123
0.0109
0.0108
0.0168
0.0112
0.0046
0.0085
0.0068
0.0091
0.0096
0.0061
0.0063
0.0082
0.0084
0.0094
0.0070
0.0087
0.0042
0.0077
0.0060
0.0123
0.0127
0.0107
0.0019
0.0082
0.0051
0.0068
0.0064
0.0061
0.0057
0.0094
0.0162
0.0141
0.0165
0.0065
0.0121
0.0047
0.0120
0.0076
0.0050
0.0080
0.0139
0.0023
0.0139
0.0123
0.0087
0.0151
0.0060
0.0103
0.0039
0.0042
0.0043
-0.0011
0.0080
0.0028
0.0074
0.0042
0.0018
0.0087
0.0049
0.0076
0.0156
0.0076
0.0091
0.0056
0.0091
0.0075
0.0012
0.0056
0.0123
0.0137
0.0087
0.0025
0.0084
0.0104
0.0086
-0.0008
0.0072
0.0110
0.0096
0.0081
0.0126
0.0020
0.0098
0.0070
0.0041
0.0027
0.0075
0.0040
0.0069
0.0098
0.0180
0.0143
0.0182
0.0120
0.0003
-0.0011
0.0063
0.0104
0.0043
0.0128
0.0075
0.0051
0.0065
0.0063
0.0005
0.0097
0.0099
0.0084
0.0105
0.0017
0.0080
0.0140
0.0054
0.0048
The easiest is to remove the points above a given threshold, without replacing them.
You could instead try to remove the points that exhibit a difference superior to a certain value (it seems that you should only consider the positive difference in the special case of your problem), and not replace them; this may require several passes to erase the peaks.
There are more complicated approached if this doesn't work.
Related
The data set had 1511 observations. I used the first 1400 values to fit ARIMA model of order (1,1,9), keeping the rest for predictions. But when I look at the predictions, apart from the first 16 values all the remaining values are the same. Here's what I tried:
model2=ARIMA(tstrain,order=(1,1,9))
fitted_model2=model2.fit()
And for prediction:
start=len(tstrain)
end=len(tstrain)+len(tstest)-1
predictions=fitted_model2.predict(start,end,typ='levels')
Here tstrain and tstest are the train and test sets.
predictions.head(30)
1400 214.097742
1401 214.689674
1402 214.820804
1403 215.621131
1404 215.244980
1405 215.349230
1406 215.392444
1407 215.022312
1408 215.020736
1409 215.021384
1410 215.021118
1411 215.021227
1412 215.021182
1413 215.021201
1414 215.021193
1415 215.021196
1416 215.021195
1417 215.021195
1418 215.021195
1419 215.021195
1420 215.021195
1421 215.021195
1422 215.021195
1423 215.021195
1424 215.021195
1425 215.021195
1426 215.021195
1427 215.021195
1428 215.021195
1429 215.021195
Please help me out here. What am I missing?
Suppose we have a finite set S of vectors of length n with positive real values and a positive real number t. I want to find a minimal subset M of S such that each component of the sum of the vectors is greater than t.
For example, if I the following problem,
t = 10
S = [[8, 2], [5, 5], [4, 3], [3, 9]]
an optimal solution would consist of the 2 vectors
M = [[8, 2], [3, 9]]
since
[8, 2] + [3, 9] = [11, 11] > [10, 10]
and no single vector has each component greater than 10. There is not necessarily a unique optimal subset, or a subset which satisfies the criterion at all.
It seems to me that this problem may be related to integer linear programming, but does not seem to fit within its standard form. It may be the case that the problem is NP Hard, so I would also appreciate algorithms which try to approximate the solution.
It seems to me that this problem may be related to integer linear
programming, but does not seem to fit within its standard form
I think it can be trivially implemented as a MIP model:
Here x(i) indicates if vector i is selected.
The data and results can look like:
---- 33 PARAMETER t = 10
---- 33 PARAMETER S
j1 j2
i1 8 2
i2 5 5
i3 4 3
i4 3 9
---- 33 VARIABLE x.L select vectors
i1 1, i4 1
---- 33 VARIABLE numSelected.L = 2
Just for testing, I also generated a larger problem:
---- 29 PARAMETER t = 100.000
---- 29 PARAMETER S
j1 j2 j3 j4 j5 j6 j7 j8 j9 j10
i1 1.717 8.433 5.504 3.011 2.922 2.241 3.498 8.563 0.671 5.002
i2 9.981 5.787 9.911 7.623 1.307 6.397 1.595 2.501 6.689 4.354
i3 3.597 3.514 1.315 1.501 5.891 8.309 2.308 6.657 7.759 3.037
i4 1.105 5.024 1.602 8.725 2.651 2.858 5.940 7.227 6.282 4.638
i5 4.133 1.177 3.142 0.466 3.386 1.821 6.457 5.607 7.700 2.978
i6 6.611 7.558 6.274 2.839 0.864 1.025 6.413 5.453 0.315 7.924
i7 0.728 1.757 5.256 7.502 1.781 0.341 5.851 6.212 3.894 3.587
i8 2.430 2.464 1.305 9.334 3.799 7.834 3.000 1.255 7.489 0.692
i9 2.020 0.051 2.696 4.999 1.513 1.742 3.306 3.169 3.221 9.640
i10 9.936 3.699 3.729 7.720 3.967 9.131 1.196 7.355 0.554 5.763
i11 0.514 0.060 4.012 5.199 6.289 2.257 3.961 2.760 1.524 9.363
i12 4.227 1.347 3.861 3.746 2.685 9.484 1.889 2.975 0.746 4.013
i13 1.017 3.839 3.241 1.921 1.124 5.966 5.114 0.451 7.831 9.457
i14 5.965 6.073 3.625 5.941 6.799 5.066 1.593 6.569 5.239 1.244
i15 9.867 2.281 6.757 7.768 9.325 2.012 2.971 1.972 2.463 6.465
i16 7.350 0.854 1.503 4.342 1.869 6.927 7.630 1.548 3.894 6.954
i17 8.458 6.127 9.760 0.269 1.874 0.871 5.404 1.269 7.340 1.132
i18 4.884 7.956 4.920 5.336 0.106 5.439 4.511 9.753 1.838 1.635
i19 0.246 1.778 0.613 0.166 8.357 6.017 0.270 1.961 9.507 3.355
i20 5.943 2.592 6.406 1.552 4.600 3.933 8.055 5.410 3.907 5.578
i21 9.328 3.488 0.083 9.488 5.719 3.336 9.837 7.665 1.101 9.948
i22 5.803 1.664 6.434 3.443 9.123 9.001 0.163 3.686 6.644 5.934
i23 0.346 8.418 9.321 5.080 2.996 4.966 0.449 7.737 5.330 7.468
i24 7.201 6.316 1.149 9.712 7.067 9.863 8.548 6.214 7.013 7.009
i25 7.907 6.102 0.543 4.852 0.525 6.986 1.948 2.260 8.136 9.917
i26 7.507 7.183 0.006 2.639 8.238 8.195 8.604 2.127 4.568 0.384
i27 3.230 4.399 3.153 1.348 8.110 4.168 1.418 4.655 2.830 8.957
i28 0.644 4.146 3.416 4.683 6.427 6.436 3.376 1.008 9.058 2.174
i29 9.189 4.518 0.899 3.742 4.150 4.042 1.117 7.511 8.034 0.237
i30 4.809 2.786 9.016 0.176 6.810 9.509 9.002 8.988 8.745 3.910
i31 5.042 8.313 6.021 0.822 5.778 5.932 6.838 1.588 3.318 3.159
i32 5.199 3.638 1.678 6.831 5.054 5.762 7.198 6.837 0.198 8.398
i33 7.100 1.555 6.107 6.616 1.944 3.635 6.239 7.314 4.140 1.575
i34 0.125 0.102 9.520 9.767 9.663 8.563 1.416 0.497 5.530 1.840
i35 9.942 8.091 3.062 0.874 4.305 3.497 1.173 5.860 4.455 4.123
i36 9.145 2.138 2.242 5.423 6.311 3.274 1.488 9.291 2.510 0.626
i37 3.101 0.402 8.212 2.310 4.100 3.026 4.449 7.160 5.932 1.312
i38 1.612 3.156 5.721 2.687 0.364 6.864 6.746 3.321 7.599 1.768
i39 6.825 6.730 8.312 5.152 2.830 5.554 4.140 0.734 8.060 3.327
i40 0.847 5.722 0.221 7.420 9.051 5.608 4.728 7.176 5.130 8.871
i41 7.715 1.401 2.645 6.826 4.498 9.655 9.579 8.992 3.275 4.571
i42 5.962 8.786 1.707 6.336 7.716 5.694 0.277 8.110 2.789 4.333
i43 3.363 5.886 5.744 5.434 5.782 9.772 3.215 7.630 9.625 9.490
i44 2.559 3.249 2.148 1.740 7.313 2.702 7.585 6.174 2.910 7.407
i45 0.078 8.665 0.151 4.283 3.586 7.049 4.159 5.498 3.450 6.996
i46 9.335 4.693 2.136 5.108 3.657 9.354 0.680 5.039 3.924 2.049
i47 5.295 5.891 3.458 2.529 5.477 5.475 0.583 3.777 9.741 3.798
i48 1.563 4.723 3.970 2.055 6.273 0.035 5.040 0.022 5.214 8.361
i49 0.721 7.606 2.901 2.449 4.360 3.688 5.534 0.747 9.094 0.487
i50 8.204 7.930 6.595 3.918 4.132 8.657 9.753 5.724 3.140 4.550
i51 3.710 4.198 0.853 8.145 5.092 7.344 8.244 4.145 9.244 3.941
i52 4.443 6.955 6.767 5.715 1.735 6.043 5.860 7.278 2.462 1.421
i53 8.912 4.428 1.143 9.036 3.339 9.960 4.645 5.305 1.904 1.991
i54 6.449 7.991 5.864 9.716 4.911 3.725 8.272 8.209 0.313 9.252
i55 0.031 6.180 0.067 4.056 6.563 6.155 2.634 0.704 0.460 1.603
i56 4.525 1.689 5.716 8.587 0.359 3.554 3.379 4.865 2.594 8.912
i57 8.493 6.345 9.865 7.579 5.079 7.678 8.321 5.839 5.751 5.567
i58 6.036 9.771 5.541 9.350 4.132 8.064 0.008 4.553 4.192 0.157
i59 0.820 5.988 5.552 6.180 8.851 1.899 4.281 1.665 8.863 6.765
i60 8.633 0.928 9.965 6.168 0.008 6.089 1.737 3.112 7.289 0.847
i61 4.420 6.591 4.845 3.182 9.140 1.842 8.721 4.561 4.589 1.241
i62 1.038 2.349 3.402 8.731 8.448 8.229 4.805 9.138 9.293 1.081
i63 3.506 8.600 3.405 2.131 9.264 3.607 8.760 1.481 4.560 2.683
i64 2.968 1.155 3.998 9.118 2.516 1.030 1.463 3.863 0.464 0.264
i65 1.540 0.727 8.286 3.998 4.172 9.721 2.434 3.620 6.303 2.429
i66 1.011 4.059 4.791 1.449 5.097 8.853 0.555 5.074 7.641 9.790
i67 7.204 8.719 2.995 7.539 8.438 4.682 6.549 3.781 3.589 2.545
i68 2.555 4.506 9.449 3.355 0.485 8.065 7.326 9.200 3.316 2.098
i69 0.471 8.830 0.928 8.369 4.638 0.002 1.566 7.050 2.803 6.356
i70 5.800 9.621 5.142 5.903 5.675 0.299 9.689 5.938 0.596 5.441
i71 1.274 4.205 8.814 7.166 3.136 9.930 0.553 6.191 8.044 2.966
i72 9.589 6.948 6.314 6.160 5.230 1.410 8.367 8.921 6.965 3.118
i73 3.558 7.611 1.364 7.166 9.637 4.419 2.645 4.442 3.967 3.669
i74 6.213 8.597 7.550 9.008 4.380 0.691 9.386 8.440 9.504 5.794
i75 0.352 4.074 0.574 5.320 1.124 8.938 2.417 6.422 2.908 8.136
i76 1.861 5.007 9.388 6.816 1.852 8.729 1.650 4.308 7.717 9.781
i77 9.445 2.488 8.865 8.844 9.649 7.491 6.476 7.484 5.232 6.861
i78 6.911 8.656 7.958 2.176 5.896 9.226 6.029 0.357 9.111 5.650
i79 3.248 3.901 2.993 2.190 8.217 1.527 9.505 0.319 7.345 1.105
i80 4.878 3.610 2.164 9.237 4.500 9.711 0.963 4.789 7.222 4.332
i81 1.582 1.007 8.055 3.987 1.171 8.744 1.449 1.777 5.452 4.686
i82 9.092 7.231 1.664 3.276 5.813 5.775 6.276 0.267 1.294 0.641
i83 3.111 5.785 8.098 6.792 7.357 3.386 2.242 9.000 8.294 3.162
i84 9.522 2.567 6.261 9.713 9.621 4.253 1.054 0.771 6.441 3.122
i85 5.952 6.064 6.337 9.582 0.823 1.254 6.052 7.415 8.475 3.525
i86 6.414 8.957 3.882 2.734 9.704 3.462 4.096 9.399 6.029 8.995
i87 2.847 2.222 5.748 5.095 5.575 3.442 3.983 7.762 0.282 3.624
i88 7.558 4.749 0.762 0.975 3.297 2.006 0.908 4.488 4.628 8.120
i89 4.500 9.543 1.226 4.066 8.864 7.032 8.749 5.551 2.556 2.592
i90 3.551 1.369 8.071 3.260 4.288 0.090 2.243 6.607 2.874 1.311
i91 4.071 1.616 8.618 3.777 8.886 2.700 7.774 4.228 4.299 2.491
i92 3.818 0.710 7.156 7.029 0.702 9.685 2.700 3.181 8.834 5.862
i93 3.821 9.730 6.708 9.511 7.183 4.436 8.797 4.698 4.639 3.714
i94 1.183 9.654 8.435 7.213 9.644 3.646 7.683 3.325 5.421 1.384
i95 3.137 0.799 9.342 4.943 6.789 6.916 0.547 2.402 8.827 1.006
i96 6.784 1.014 0.428 7.535 9.870 0.148 1.469 9.910 1.322 0.854
i97 1.859 7.902 8.645 7.479 3.895 7.950 6.210 7.614 5.803 6.636
i98 0.821 5.669 4.432 3.283 3.306 1.229 6.179 5.339 1.441 3.923
i99 0.148 8.708 2.455 8.032 0.840 9.872 5.583 6.819 8.596 5.867
i100 0.395 9.129 5.425 2.210 9.256 7.928 7.394 6.777 5.012 4.311
---- 29 VARIABLE x.L select vectors
i24 1, i30 1, i42 1, i43 1, i50 1, i54 1, i57 1, i72 1, i74 1, i76 1, i77 1, i78 1, i83 1
i84 1, i86 1, i93 1, i97 1
---- 29 VARIABLE numSelected.L = 17
This model still solved in 1 second. I am sure there are larger instances that become more difficult to solve to optimality. But MIP solvers typically find good solutions very quickly, so in a sense, a MIP model can also be used as a heuristic: just stop earlier (before proving optimality).
I am facing an issue in my application throwing an error
Microsoft VBScript runtime error '800a0005'
Invalid procedure call or argument: 'mid'
strLine = Trim(Mid(strLine, 1, InStr(UCase(strLine), "SINGLE POINT DATA") - 1))
Main function is below
strLine = tstextstreamRead.ReadAll
If cint(intTestId) = 23 Then
strLine2 = Trim(Mid(strLine, InStr(UCase(strLine), "SINGLE POINT DATA") + 1, Len(strLine)))
strLine2 = Trim(Mid(strLine2, InStr(strLine2, "Model Cross-Arrhenius"), Len(strLine2)))
End If
response.write strLine
strLine = Trim(Mid(strLine, 1, InStr(UCase(strLine), "SINGLE POINT DATA") - 1))
When i verified through response.write strline
below i got the data
Sample ID Region :PACIFIC
Request No :tesy
Sample No :12
Company :213
Family :213
Grade :213
Standard :ZOLLER METHOD
Color No :UNKNOWN
Lot No :UNKNOWN
Remark :213
Date :7/22/2016
Generic :123
Operator :213
Test Lab :WTC
Test Name :Pressure-Volume-Temperature (PVT)
Test Method :ZOLLER METHOD
Dataset 0
POINTS 26 30.2 1109.998 40.6 1126.961 50.8 1124.916 61.1 1121.716 71.3 1117.909 81.5 1113.674 91.4 1108.76 101.4 1103.019 111.8 1095.211 121.8 1087.404 131.8 1079.681 142.2 1072.674 152 1065.485 162.3 1057.771 172.1 1050.805 182.3 1043.491 193.4 1036.327 203 1029.239 214.1 1022.532 223.6 1015.535 234.6 1008.179 245.1 1000.297 255 993.677 265.6 986.178 275.4 979.918 285.8 973.305
Dataset 50
POINTS 26 30.2 1141.188 40.6 1145.201 50.8 1142.895 61.1 1140.183 71.3 1137.529 81.5 1134.817 91.4 1132.018 101.4 1129.33 111.8 1125.39 121.8 1120.212 131.8 1113.709 142.2 1107.681 152 1101.57 162.3 1095.298 172.1 1089.361 182.3 1083.419 193.4 1077.757 203 1072.34 214.1 1066.657 223.6 1061.414 234.6 1055.779 245.1 1049.333 255 1043.651 265.6 1038.333 275.4 1033.283 285.8 1028.57
Dataset 100
POINTS 26 30.2 1160.699 40.6 1159.695 50.8 1157.533 61.1 1155.299 71.3 1153.075 81.5 1151.028 91.4 1149.093 101.4 1147.7 111.8 1145.902 121.8 1142.99 131.8 1138.733 142.2 1133.957 152 1128.643 162.3 1123.009 172.1 1117.834 182.3 1112.526 193.4 1107.26 203 1102.65 214.1 1097.619 223.6 1093.004 234.6 1086.949 245.1 1082.376 255 1077.455 265.6 1072.812 275.4 1068.752 285.8 1064.607
Dataset 150
POINTS 26 30.2 1174.42 40.6 1172.732 50.8 1170.673 61.1 1168.593 71.3 1166.813 81.5 1165.074 91.4 1163.543 101.4 1162.673 111.8 1161.681 121.8 1160.11 131.8 1157.741 142.2 1154.905 152 1150.866 162.3 1146.22 172.1 1141.382 182.3 1136.381 193.4 1131.774 203 1127.484 214.1 1123.175 223.6 1118.773 234.6 1113.149 245.1 1108.975 255 1104.514 265.6 1100.457 275.4 1096.706 285.8 1093.423
Dataset 200
POINTS 26 30.2 1185.837 40.6 1183.95 50.8 1182.17 61.1 1180.377 71.3 1178.593 81.5 1177.103 91.4 1175.963 101.4 1175.373 111.8 1174.678 121.8 1173.552 131.8 1172.182 142.2 1170.538 152 1168.029 162.3 1164.802 172.1 1160.855 182.3 1156.361 193.4 1152.075 203 1147.987 214.1 1143.934 223.6 1139.92 234.6 1134.884 245.1 1130.755 255 1126.75 265.6 1123.348 275.4 1119.874 285.8 1116.635
Kindly let me know what is the exact problem.
Your data does not contain "SINGLE POINT DATA", so InStr() returns 0 (not found) and your Mid() call boils down to
>> s = Mid("s", 1, -1)
>>
Error Number: 5
Error Description: Invalid procedure call or argument
>>
You should check the Instr() result:
nPos = InStr(UCase(strLine), "SINGLE POINT DATA")
If 0 = nPos Then
WScript.Echo "Bingo"
Else
strLine = Trim(Mid(strLine, 1, nPos - 1))
End If
I have been stuck on this problem for over twelve hours now. I have a Pig script that is running on Amazon Web Services. Currently, I am just running my script in interactive mode. I am trying to get averages on a large data set of climate readings from weather stations; however, this data doesn't have country or state information so it has to be joined with another table that does.
State Table:
719990 99999 LILLOOET CN CA BC WKF +50683 -121933 +02780
719994 99999 SEDCO 710 CN CA CWQJ +46500 -048500 +00000
720000 99999 BOGUS AMERICAN US US -99999 -999999 -99999
720001 99999 PEASON RIDGE/RANGE US US LA K02R +31400 -093283 +01410
720002 99999 HALLOCK(AWS) US US MN K03Y +48783 -096950 +02500
720003 99999 DEER PARK(AWS) US US WA K07S +47967 -117433 +06720
720004 99999 MASON US US MI K09G +42567 -084417 +02800
720005 99999 GASTONIA US US NC K0A6 +35200 -081150 +02440
Climate Table: (I realize this doesn't contain anything to satisfy the join condition, but the full data set does.)
STN--- WBAN YEARMODA TEMP DEWP SLP STP VISIB WDSP MXSPD GUST MAX MIN PRCP SNDP FRSHTT
010010 99999 20090101 23.3 24 15.6 24 1033.2 24 1032.0 24 13.5 6 9.6 24 17.5 999.9 27.9* 16.7 0.00G 999.9 001000
010010 99999 20090102 27.3 24 20.5 24 1026.1 24 1024.9 24 13.7 5 14.6 24 23.3 999.9 28.9 25.3* 0.00G 999.9 001000
010010 99999 20090103 25.2 24 18.4 24 1028.3 24 1027.1 24 15.5 6 4.2 24 9.7 999.9 26.2* 23.9* 0.00G 999.9 001000
010010 99999 20090104 27.7 24 23.2 24 1019.3 24 1018.1 24 6.7 6 8.6 24 13.6 999.9 29.8 24.8 0.00G 999.9 011000
010010 99999 20090105 19.3 24 13.0 24 1015.5 24 1014.3 24 5.6 6 17.5 24 25.3 999.9 26.2* 10.2* 0.05G 999.9 001000
010010 99999 20090106 12.9 24 2.9 24 1019.6 24 1018.3 24 8.2 6 15.5 24 25.3 999.9 19.0* 8.8 0.02G 999.9 001000
010010 99999 20090107 26.2 23 20.7 23 998.6 23 997.4 23 6.6 6 12.1 22 21.4 999.9 31.5 19.2* 0.00G 999.9 011000
010010 99999 20090108 21.5 24 15.2 24 995.3 24 994.1 24 12.4 5 12.8 24 25.3 999.9 24.6* 19.2* 0.05G 999.9 011000
010010 99999 20090109 27.5 23 24.5 23 982.5 23 981.3 23 7.9 5 20.2 22 33.0 999.9 34.2 20.1* 0.00G 999.9 011000
010010 99999 20090110 22.5 23 16.7 23 977.2 23 976.1 23 11.9 6 15.5 23 35.0 999.9 28.9* 17.2 0.09G 999.9 000000
I load in the climate data using TextLoader, apply a regular expression to obtain the fields, and filter out the nulls from the result set. I then do the same with the state data, but I filter it for the country being the US.
The bags have the following schema:
CLIMATE_REMOVE_EMPTY: {station: int,wban: int,year: int,month: int,day: int,temp: double}
STATES_FILTER_US: {station: int,wban: int,name: chararray,wmo: chararray,fips: chararray,state: chararray}
I need to perform a join operation on (station,wban) so I can get a resulting bag with the station, wban, year, month, and temps. When I perform a dump on the resulting bag, it says that it was successful; however, the dump returns 0 results. This is the output.
HadoopVersion PigVersion UserId StartedAt FinishedAt Features
1.0.3 0.9.2-amzn hadoop 2013-05-03 00:10:51 2013-05-03 00:12:42 HASH_JOIN,FILTER
Success!
Job Stats (time in seconds):
JobId Maps Reduces MaxMapTime MinMapTIme AvgMapTime MaxReduceTime MinReduceTime AvgReduceTime Alias Feature Outputs
job_201305030005_0001 2 1 36 15 25 33 33 33 CLIMATE,CLIMATE_REMOVE_NULL,RAW_CLIMATE,RAW_STATES,STATES,STATES_FILTER_US,STATE_CLIMATE_JO IN HASH_JOIN hdfs://10.204.30.125:9000/tmp/temp-204730737/tmp1776606203,
Input(s):
Successfully read 30587 records from: "hiddenbucket"
Successfully read 21027 records from: "hiddenbucket"
Output(s):
Successfully stored 0 records in: "hdfs://10.204.30.125:9000/tmp/temp-204730737/tmp1776606203"
Counters:
Total records written : 0
Total bytes written : 0
Spillable Memory Manager spill count : 0
Total bags proactively spilled: 0
Total records proactively spilled: 0
I have no idea why my this contains 0 results. My data extraction seems correct. and the job is successful. It leads me to believe that the join condition is never satisfied. I know the input files have some data that should satisfy the join condition, but it returns absolutely nothing.
The only thing that looks suspicious is a warning that states:
Encountered Warning ACCESSING_NON_EXISTENT_FIELD 26001 time(s).
I'm not exactly sure where to go from here. Since the job isn't failing, I can't see any errors or anything in debug.
I'm not sure if these mean anything, but here are other things that stand out:
When I try to illustrate STATE_CLIMATE_JOIN, I get a nullPointerException - ERROR 2997: Encountered IOException. Exception : null
When I try to illustrate STATES, I get java.lang.IndexOutOfBoundsException: Index: 1, Size: 1
Here is my full code:
--Piggy Bank Functions
register file:/home/hadoop/lib/pig/piggybank.jar
DEFINE EXTRACT org.apache.pig.piggybank.evaluation.string.EXTRACT();
--Load Climate Data
RAW_CLIMATE = LOAD 'hiddenbucket' USING TextLoader as (line:chararray);
RAW_STATES= LOAD 'hiddenbucket' USING TextLoader as (line:chararray);
CLIMATE=
FOREACH
RAW_CLIMATE
GENERATE
FLATTEN ((tuple(int,int,int,int,int,double))
EXTRACT(line,'^(\\d{6})\\s+(\\d{5})\\s+(\\d{4})(\\d{2})(\\d{2})\\s+(\\d{1,3}\\.\\d{1})')
)
AS (
station: int,
wban: int,
year: int,
month: int,
day: int,
temp: double
)
;
STATES=
FOREACH
RAW_STATES
GENERATE
FLATTEN ((tuple(int,int,chararray,chararray,chararray,chararray))
EXTRACT(line,'^(\\d{6})\\s+(\\d{5})\\s+(\\S+)\\s+(\\w{2})\\s+(\\w{2})\\s+(\\w{2})')
)
AS (
station: int,
wban: int,
name: chararray,
wmo: chararray,
fips: chararray,
state: chararray
)
;
CLIMATE_REMOVE_NULL = FILTER CLIMATE BY station IS NOT NULL;
STATES_FILTER_US = FILTER STATES BY (fips == 'US');
STATE_CLIMATE_JOIN = JOIN CLIMATE_REMOVE_NULL BY (station), STATES_FILTER_US BY (station);
Thanks in advance. I am at a loss here.
--EDIT--
I finally got it to work! My regular expression for parsing the STATE_DATA was invalid.
Does anyone know how to merge two tables with a common column name and data into a single table? The shared column is a date column. This is part of a project at work, no one here quite knows how it works. Any help would be appreciated.
table A
Sub Temp Weight Silicon Cast_Date
108 2675 2731 0.7002 18-jun-11 18:45
101 2691 3268 0.6194 18-jun-11 20:30
107 2701 6749 0.6976 18-jun-11 20:30
113 2713 2112 0.6616 18-jun-11 20:30
116 2733 3142 0.7382 19-jun-11 05:46
121 2745 2611 0.6949 19-jun-11 00:19
125 2726 1995 0.644 19-jun-11 00:19
table B
Si Temperature Sched_Cast_Date Treadwell
0.6622 2542 01-APR-11 02:57 114
0.6622 2542 01-APR-11 03:07 116
0.7516 2526 19-jun-11 05:46 116
0.7516 2526 01-APR-11 03:40 107
0.6741 2372 01-APR-11 04:03 107
0.6206 2369 01-APR-11 09:43 114
0.6741 2372 19-jun-11 00:19 125
the results would look like:
Subcar Temp Weight Silicon Cast_Date SI Temperature Sched_Cast_Date Treadwell
116 2733 3142 0.7382 19-jun-11 05:46 0.7516 2526 19-jun-11 05:46 116
125 2726 1995 0.644 19-jun-11 00:19 0.6741 2372 19-jun-11 00:19 125
I would like to run a query that returns a results data only where Sched_Cast_Date and Cast_Date are the same. A table with the same qualities would work just as well.
I hope that this makes more sense.
Are you asking how to join two tables on a common column? i.e.
select a.Sub, a.Temp, a.Weight a.Silicon a.Cast_Date, b.SI,
b.Temperature, b.Sched_Cast_Date, b.Treadwell
from a
join b on b.sched_cast_date = a.cast_date