Integration of orbits with solar system gravity fields from Skyfield - speed issues - performance

In the time tests shown below, I found that Skyfield takes several hundred microseconds up to a millisecond to return obj.at(jd).position.km for a single time value in jd, but the incremental cost for longer JulianDate objects (a list of points in time) is only about one microsecond per point. I see similar speeds using Jplephem and with two different ephemerides.
My question here is: if I want to random-access points in time, for example as a slave to an external Runge-Kutta routine which uses its own variable stepsize, is there a way I can do this faster within python (without having to learn to compile code)?
I understand this is not at all the typical way Skyfield is intended to be used. Normally we'd load a JulianDate object with a long list of time points and then calculate them at once, and probably do that a few times, not thousands of times (or more), the way an orbit integrator might do.
Workaround: I can imagine a work-around where I build my own NumPy database by running Skyfield once using a JulianDate object with fine time granularity, then writing my own Runge-Kutta routine which changes step sizes up and down by discrete amounts such that the timesteps always correspond directly to the striding of NumPy array.
Or I could even try re-interpolating. I am not doing highly precise calculations so a simple NumPy or SciPy 2nd order might be fine.
Ultimately I'd like to try integrating the path of objects under the influence of the gravity field of the solar system (e.g. deep-space satellite, comet, asteroid). When looking for an orbit solution one might try millions of starting state vectors in 6D phase space. I know I should be using things like ob.at(jd).observe(large_body).position.km method because gravity travels at the speed of light like everything else. This seems to cost substantial time since (I'm guessing) it's an iterative calculation ("Let's see... where would Jupiter have been such that I feel it's gravity right NOW"). But let's peel the cosmic onion one layer at a time.
Figure 1. Skyfield and JPLephem performance on my laptop for different length JulianDate objects, for de405 and de421. They are all about the same - (very) roughly about a half-millisecond for the first point and a microsecond for each additional point. Also the very first point to be calculated when the script runs (Earth (blue) with len(jd) = 1) has an additional millisecond artifact.
Earth and Moon are slower because it is a two-step calculation internally (the Earth-Moon Barycenter plus the individual orbits about the Barycenter). Mercury may be slower because it moves so fast compared to the ephemeris time steps that it requires more coefficients in the (costly) Chebyshev interpolation?
SCRIPT FOR SKYFIELD DATA the JPLephem script is farther down
import numpy as np
import matplotlib.pyplot as plt
from skyfield.api import load, JulianDate
import time
ephem = 'de421.bsp'
ephem = 'de405.bsp'
de = load(ephem)
earth = de['earth']
moon = de['moon']
earth_barycenter = de['earth barycenter']
mercury = de['mercury']
jupiter = de['jupiter barycenter']
pluto = de['pluto barycenter']
things = [ earth, moon, earth_barycenter, mercury, jupiter, pluto ]
names = ['earth', 'moon', 'earth barycenter', 'mercury', 'jupiter', 'pluto']
ntimes = [i*10**n for n in range(5) for i in [1, 2, 5]]
years = [np.zeros(1)] + [np.linspace(0, 100, n) for n in ntimes[1:]] # 100 years
microsecs = []
for y in years:
jd = JulianDate(utc=(1900 + y, 1, 1))
mics = []
for thing in things:
tstart = time.clock()
answer = thing.at(jd).position.km
mics.append(1E+06 * (time.clock() - tstart))
microsecs.append(mics)
microsecs = np.array(microsecs).T
many = [len(y) for y in years]
fig = plt.figure()
ax = plt.subplot(111, xlabel='length of JD object',
ylabel='microseconds',
title='time for thing.at(jd).position.km with ' + ephem )
for item in ([ax.title, ax.xaxis.label, ax.yaxis.label] +
ax.get_xticklabels() + ax.get_yticklabels()):
item.set_fontsize(item.get_fontsize() + 4) # http://stackoverflow.com/a/14971193/3904031
for name, mics in zip(names, microsecs):
ax.plot(many, mics, lw=2, label=name)
plt.legend(loc='upper left', shadow=False, fontsize='x-large')
plt.xscale('log')
plt.yscale('log')
plt.savefig("skyfield speed test " + ephem.split('.')[0])
plt.show()
SCRIPT FOR JPLEPHEM DATA the Skyfield script is above
import numpy as np
import matplotlib.pyplot as plt
from jplephem.spk import SPK
import time
ephem = 'de421.bsp'
ephem = 'de405.bsp'
kernel = SPK.open(ephem)
jd_1900_01_01 = 2415020.5004882407
ntimes = [i*10**n for n in range(5) for i in [1, 2, 5]]
years = [np.zeros(1)] + [np.linspace(0, 100, n) for n in ntimes[1:]] # 100 years
barytup = (0, 3)
earthtup = (3, 399)
# moontup = (3, 301)
microsecs = []
for y in years:
mics = []
#for thing in things:
jd = jd_1900_01_01 + y * 365.25 # roughly, it doesn't matter here
tstart = time.clock()
answer = kernel[earthtup].compute(jd) + kernel[barytup].compute(jd)
mics.append(1E+06 * (time.clock() - tstart))
microsecs.append(mics)
microsecs = np.array(microsecs)
many = [len(y) for y in years]
fig = plt.figure()
ax = plt.subplot(111, xlabel='length of JD object',
ylabel='microseconds',
title='time for jplephem [0,3] and [3,399] with ' + ephem )
# from here: http://stackoverflow.com/a/14971193/3904031
for item in ([ax.title, ax.xaxis.label, ax.yaxis.label] +
ax.get_xticklabels() + ax.get_yticklabels()):
item.set_fontsize(item.get_fontsize() + 4)
#for name, mics in zip(names, microsecs):
ax.plot(many, microsecs, lw=2, label='earth')
plt.legend(loc='upper left', shadow=False, fontsize='x-large')
plt.xscale('log')
plt.yscale('log')
plt.ylim(1E+02, 1E+06)
plt.savefig("jplephem speed test " + ephem.split('.')[0])
plt.show()

Related

Discrete path tracking with python gekko

I have some discrete data points representing a path and I want to minimize the distance between a trajectory of an object to these path points along with some other constraints. I'm trying out gekko as a tool to solve this problem and for that I made a simple problem by making data points from a parabola and a constraint to the path. My attempt to solve it is
from gekko import GEKKO
import numpy as np
import matplotlib.pyplot as plt
import time
#path data points
x_ref = np.linspace(0, 4, num=21)
y_ref = - np.square(x_ref) + 16
#constraint for visualization purposes
x_bound = np.linspace(0, 4, num=10)
y_bound = 1.5*x_bound + 4
def distfunc(x,y,xref,yref,p):
'''
Shortest distance from (x,y) to (xref, yref)
'''
dtemp = []
for i in range(len(xref)):
d = (x-xref[i])**2+(y-yref[i])**2
dtemp.append(dtemp)
min_id = dtemp.index(min(dtemp))
if min_id == 0:
next_id = min_id+1
elif min_id == len(x_ref):
next_id = min_id-1
else:
d2 = (x-xref[min_id-1])**2+(y-yref[min_id-1])**2
d1 = (x-xref[min_id+1])**2+(y-yref[mid_id+1])**2
d_next = [d2, d1]
next_id = min_id + 2*d_next.index(min(d_next)) - 1
n1 = xref[next_id] - xref[min_id]
n2 = yref[next_id] - yref[min_id]
nnorm = p.sqrt(n1**2+n2**2)
n1 = n1 / nnorm
n2 = n2 / nnorm
difx = x-xref[min_id]
dify = y-yref[min_id]
dot = difx*n1 + dify*n2
deltax = difx - dot*n1
deltay = dify - dot*n2
return deltax**2+deltay**2
v_ref = 3
now = time.time()
p = GEKKO(remote=False)
p.time = np.linspace(0,10,21)
x = p.Var(value=0)
y = p.Var(value=16)
vx = p.Var(value=1)
vy = p.Var(value=0)
ax = p.Var(value=0)
ay = p.Var(value=0)
p.options.IMODE = 6
p.options.SOLVER = 3
p.options.WEB = 0
x_refg = p.Param(value=x_ref)
y_refg = p.Param(value=y_ref)
x_refg = p.Param(value=x_ref)
y_refg = p.Param(value=y_ref)
v_ref = p.Const(value=v_ref)
p.Obj(distfunc(x,y,x_refg,y_refg,p))
p.Obj( (p.sqrt(vx**2+vy**2) - v_ref)**2 + ax**2 + ay**2)
p.Equation(x.dt()==vx)
p.Equation(y.dt()==vy)
p.Equation(vx.dt()==ax)
p.Equation(vy.dt()==ay)
p.Equation(y>=1.5*x+4)
p.solve(disp=False, debug=True)
print(f'run time: {time.time()-now}')
plt.plot(x_ref, y_ref)
plt.plot(x_bound, y_bound)
plt.plot(x1.value,x2.value)
plt.show()
This is the result that I get. As you can see, its not exactly the solution that one should expect. For reference to a solution that you may expect, here is what I get using the cost function below
p.Obj((x-x_refg)**2 + (y-y_refg)**2 + ax**2 + ay**2)
However since what I actually wanted is the shortest distance to a path described by these points I expect the distfunc to be closer to what I want since the shortest distance is most likely to some interpolated point. So my question is twofold:
Is this the correct gekko expression/formulation for the objective function?
My other goal is solution speed so is there a more efficient way of expressing this problem for gekko?
You can't define an objective function that changes based on conditions unless you insert logical conditions that are continuously differentiable such as with the if2 or if3 function. Gekko evaluates the symbolic model once and then passes that off to an executable for solution. It only calls the Python model build once because it is compiling the model to efficient byte-code for execution. You can see the model that you created with p.open_folder(). The model file ends in the apm extension: gk_model0.apm.
Model
Constants
i0 = 3
End Constants
Parameters
p1
p2
p3
p4
End Parameters
Variables
v1 = 0
v2 = 16
v3 = 1
v4 = 0
v5 = 0
v6 = 0
End Variables
Equations
v3=$v1
v4=$v2
v5=$v3
v6=$v4
v2>=(((1.5)*(v1))+4)
minimize (((((v1-0.0)-((((((v1-0.0))*((0.2/sqrt(0.04159999999999994))))+(((v2-16.0))&
*((-0.03999999999999915/sqrt(0.04159999999999994))))))*&
((0.2/sqrt(0.04159999999999994))))))^(2))+((((v2-16.0)&
-((((((v1-0.0))*((0.2/sqrt(0.04159999999999994))))+(((v2-16.0))&
*((-0.03999999999999915/sqrt(0.04159999999999994))))))&
*((-0.03999999999999915/sqrt(0.04159999999999994))))))^(2)))
minimize (((((sqrt((((v3)^(2))+((v4)^(2))))-i0))^(2))+((v5)^(2)))+((v6)^(2)))
End Equations
End Model
One strategy is to split your problem into multiple optimization problems that are all minimal time problems where you navigate to the first way-point and then re-initialize the problem to navigate to the second way-point, and so on. If you want to preserve momentum and anticipate the turning then you'll need to use more advanced methods such as shown in the Pigeon / Eagle tracking problem (see source files) or similar to a trajectory optimization with UAVs or HALE UAVs (see references below).
Martin, R.A., Gates, N., Ning, A., Hedengren, J.D., Dynamic Optimization of High-Altitude Solar Aircraft Trajectories Under Station-Keeping Constraints, Journal of Guidance, Control, and Dynamics, 2018, doi: 10.2514/1.G003737.
Gates, N.S., Moore, K.R., Ning, A., Hedengren, J.D., Combined Trajectory, Propulsion and Battery Mass Optimization for Solar-Regenerative High-Altitude Long Endurance Unmanned Aircraft, AIAA Science and Technology Forum (SciTech), 2019.

Finding center point given distance matrix

I have a matrix (really a loaded image) in which every element is a L2 distance from some unknown center point.
Here is a trivial example
A = [1.4142 1.0000 1.4142 2.2361]
[1.0000 0.0000 1.0000 2.0000]
[1.4142 1.0000 1.4142 2.2361]
In this case, the center is obviously at coordinate (1,1) (index A[1,1] in a 0-indexed matrix or 2D array).
However, in the case where my centers are not constrained to be integer indices, it's no longer as obvious. For example, given this matrix B, where is my center coordinate?
B = [3.0292 1.9612 2.8932 5.8252]
[1.2292 0.1612 1.0932 4.0252]
[1.4292 0.3612 1.2932 4.2252]
How would you find that the answer in this case is at row 1.034 and column 1.4?
I am aware of the trilateration solution (having provided MATLAB code to visualize that in 3D previously), but is there a more efficient way (e.g. one without a matrix inversion)?
This question is sort of language agnostic, as I am looking more for algorithmic help. If you could stick to MATLAB, Python, or C++ though in a solution, that would be great ;-).
While having no experience with similar tasks, i read some stuff and also tried something.
When unfamiliar with this topic it's hard to grasp it seems and all those resources i found are a bit chaotic.
Still unclear in regards to theory for me:
is the problem as stated above a convex-optimization problem (local-minimum = global-minimum; would mean access to powerful solvers!)
there are much more resources about more generic problems (Sensor Network
Localization), which are non-convex and where extremely complex methods have been developed
is your trilateration-approach able to exploit > 3 points (trilateration vs. multilateration; at least this code does not seem like it can which means: bad performance with noise!)
Here some example code with two approaches:
A: Convex-optimization: SOCP-Relaxation
Follows SECOND-ORDER CONE PROGRAMMING RELAXATION OF SENSOR NETWORK LOCALIZATION
Not impressive performance, but should be powerful as approximation for big-data
Guaranteed global-optimum for this relaxation!
Implemented with cvxpy
B: Nonlinear-programming optimization
Implemented using scipy.optimize
Pretty much perfect in my synthetic experiments; even good results in noisy case; despite the fact we are using numerical-differentiation (automatic-diff hard to use here)
Some additional remark:
Your example B surely has some (pretty bad) noise or some other problem in my opinion, as my approaches are completely off; while especially approach B shines for my synthetic-data (at least that's my impression)
Code:
import numpy as np
import cvxpy as cvx
from scipy.spatial.distance import cdist
from scipy.optimize import minimize
np.random.seed(1)
""" Create noise-free (not anymore!) fake-problem """
real_x = np.random.random(size=2) * 3
M, N = 5, 10
NOISE_DISTS = 0.1
pos = np.array([(i,j) for i in range(M) for j in range(N)]) # ugly -> tile/repeat/stack
real_x_stacked = np.vstack([real_x for i in range(pos.shape[0])])
Y = cdist(pos, real_x[np.newaxis])
Y += np.random.normal(size=Y.shape)*NOISE_DISTS # Let's add some noise!
print('-----')
print('PROBLEM')
print('-------')
print('real x: ', real_x)
print('dist mat: ', np.round(Y,3).T)
""" Helper """
def cost(x, Y, pos):
res = np.linalg.norm(pos - x, ord=2, axis=1) - Y.ravel()
return np.linalg.norm(res, 2)
print('cost with real_x (check vs. noisy): ', cost(real_x, Y, pos))
""" SOLVER SOCP """
def solve_socp_relax(pos, Y):
x = cvx.Variable(2)
y = cvx.Variable(pos.shape[0])
fake_stack = [x for i in range(pos.shape[0])] # hacky
objective = cvx.sum_entries(cvx.norm(y - Y))
x_stacked = cvx.reshape(cvx.vstack(*fake_stack), pos.shape[0], 2) # hacky
constraints = [cvx.norm(pos - x_stacked, 2, axis=1) <= y]
problem = cvx.Problem(cvx.Minimize(objective), constraints)
problem.solve(solver=cvx.ECOS, verbose=False)
return x.value.T
""" SOLVER NLP """
def solve_nlp(pos, Y):
sol = minimize(cost, np.zeros(pos.shape[1]), args=(Y, pos), method='BFGS')
# print(sol)
return sol.x
""" TEST """
print('-----')
print('SOLVE')
print('-----')
socp_relax_sol = solve_socp_relax(pos, Y)
print('SOCP RELAX SOL: ', socp_relax_sol)
nlp_sol = solve_nlp(pos, Y)
print('NLP SOL: ', nlp_sol)
Output:
-----
PROBLEM
-------
real x: [ 1.25106601 2.16097348]
dist mat: [[ 2.444 1.599 1.348 1.276 2.399 3.026 4.07 4.973 6.118 6.746
2.143 1.149 0.412 0.766 1.839 2.762 3.851 4.904 5.734 6.958
2.377 1.432 0.856 1.056 1.973 2.843 3.885 4.95 5.818 6.84
2.711 2.015 1.689 1.939 2.426 3.358 4.385 5.22 6.076 6.97
3.422 3.153 2.759 2.81 3.326 4.162 4.734 5.627 6.484 7.336]]
cost with real_x (check vs. noisy): 0.665125233772
-----
SOLVE
-----
SOCP RELAX SOL: [[ 1.95749275 2.00607253]]
NLP SOL: [ 1.23560791 2.16756168]
Edit: Further speedup can be achieved (especially in large-scale) in using nonlinear-least-squares instead of the more general NLP-approach! My results are still the same (as expected if the problem would be convex). Timings between NLP/NLS can look like 9 vs. 0.5 seconds!
This is my recommended method!
def solve_nls(pos, Y):
def res(x, Y, pos):
return np.linalg.norm(pos - x, ord=2, axis=1) - Y.ravel()
sol = least_squares(res, np.zeros(pos.shape[1]), args=(Y, pos), method='lm')
# print(sol)
return sol.x
Especially the second-approach (NLP) will also run for much bigger instances (cvxpy's overhead hurts; that's not a downside of the SOCP-solver which should scale much much better!).
Here some output for M, N = 500, 1000 with some more noise:
-----
PROBLEM
-------
real x: [ 12.51066014 21.6097348 ]
dist mat: [[ 24.706 23.573 23.693 ..., 1090.29 1091.216
1090.817]]
cost with real_x (check vs. noisy): 353.354267797
-----
SOLVE
-----
NLP SOL: [ 12.51082419 21.60911561]
used: 5.9552763315495625 # SECONDS
So in my experiments it works, but i won't give any global-convergence guarantees or reconstruction-guarantees (still missing some theory).
At first i though about using the global optimum of the relaxed-SOCP-problem as initial-point in the NLP-solver, but i did not find any example where this is needed!
Some just-for-fun visuals using:
M, N = 20, 30
NOISE_DISTS = 0.2
...
import matplotlib.pyplot as plt
plt.imshow(Y.reshape(M, N), cmap='viridis', interpolation='none')
plt.colorbar()
plt.scatter(nlp_sol[1], nlp_sol[0], color='red', s=20)
plt.xlim((0, N))
plt.ylim((0, M))
plt.show()
And some super noisy case (nice performance!):
M, N = 50, 100
NOISE_DISTS = 5
-----
PROBLEM
-------
real x: [ 12.51066014 21.6097348 ]
dist mat: [[ 22.329 18.745 27.588 ..., 94.967 80.034 91.206]]
cost with real_x (check vs. noisy): 354.527196716
-----
SOLVE
-----
NLP SOL: [ 12.44158986 21.50164637]
used: 0.01050068340320306
If I understand correctly, you have a matrix A, where A[i,j] holds the distance from (i,j) to some unknown point (y,x). You could find (y,x) like this:
Square each element of A, to make a matrix B say.
We then want to find (y,x) so
(y-i)*(y-i) + (x-j)*(x-j) = B[i,j]
Subtracting each equation from the 0,0 equation and rearranging:
2*i*y + 2*j*x = B[0,0] + i*i + j*j - B[i,j]
This can be solved by linear least squares. Note that since there are 2 unknowns, the matix inversion (better, factorisation) involved will be on a 2x2 matrix and so not time consuming. You could indeed, given just the dimensions of A, work out the required matrix and its inverse analytically.

Why do the trace values have periods of (unwanted) stability?

I have a fairly simple test data set I am trying to fit with pymc3.
The result generated by traceplot looks something like this.
Essentially the trace of all parameter look like there is a standard 'caterpillar' for 100 iterations, followed by a flat line for 750 iterations, followed by the caterpillar again.
The initial 100 iterations happen after 25,000 ADVI iterations, and 10,000 tune iterations. If I change these amounts, I randomly will/won't have these periods of unwanted stability.
I'm wondering if anyone has any advice about how I can stop this from happening - and what is causing it?
Thanks.
The full code is below. In brief, I am generating a set of 'phases' (-pi -> pi) with a corresponding set of values y = a(j)*sin(phase) + b(j)*sin(phase). a and b are drawn for each subject j at random, but are related to each other.
I then essentially try to fit this same model.
Edit: Here is a similar example, running for 25,000 iterations. Something goes wrong around iteration 20,000.
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
import matplotlib.pyplot as plt
import numpy as np
import pymc3 as pm
%matplotlib inline
np.random.seed(0)
n_draw = 2000
n_tune = 10000
n_init = 25000
init_string = 'advi'
target_accept = 0.95
##
# Generate some test data
# Just generates:
# x a vector of phases
# y a vector corresponding to some sinusoidal function of x
# subject_idx a vector corresponding to which subject x is
#9 Subjects
N_j = 9
#Each with 276 measurements
N_i = 276
sigma_y = 1.0
mean = [0.1, 0.1]
cov = [[0.01, 0], [0, 0.01]] # diagonal covariance
x_sub = np.zeros((N_j,N_i))
y_sub = np.zeros((N_j,N_i))
y_true_sub = np.zeros((N_j,N_i))
ab_sub = np.zeros((N_j,2))
tuning_sub = np.zeros((N_j,1))
sub_ix_sub = np.zeros((N_j,N_i))
for j in range(0,N_j):
aj,bj = np.random.multivariate_normal(mean, cov)
#aj = np.abs(aj)
#bj = np.abs(bj)
xj = np.random.uniform(-1,1,size = (N_i,1))*np.pi
xj = np.sort(xj)#for convenience
yj_true = aj*np.sin(xj) + bj*np.cos(xj)
yj = yj_true + np.random.normal(scale=sigma_y, size=(N_i,1))
x_sub[j,:] = xj.ravel()
y_sub[j,:] = yj.ravel()
y_true_sub[j,:] = yj_true.ravel()
ab_sub[j,:] = [aj,bj]
tuning_sub[j,:] = np.sqrt(((aj**2)+(bj**2)))
sub_ix_sub[j,:] = [j]*N_i
x = np.ravel(x_sub)
y = np.ravel(y_sub)
subject_idx = np.ravel(sub_ix_sub)
subject_idx = np.asarray(subject_idx, dtype=int)
##
# Fit model
hb1_model = pm.Model()
with hb1_model:
# Hyperpriors
hb1_mu_a = pm.Normal('hb1_mu_a', mu=0., sd=100)
hb1_sigma_a = pm.HalfCauchy('hb1_sigma_a', 4)
hb1_mu_b = pm.Normal('hb1_mu_b', mu=0., sd=100)
hb1_sigma_b = pm.HalfCauchy('hb1_sigma_b', 4)
# We fit a mixture of a sine and cosine with these two coeffieicents
# allowed to be different for each subject
hb1_aj = pm.Normal('hb1_aj', mu=hb1_mu_a, sd=hb1_sigma_a, shape = N_j)
hb1_bj = pm.Normal('hb1_bj', mu=hb1_mu_b, sd=hb1_sigma_b, shape = N_j)
# Model error
hb1_eps = pm.HalfCauchy('hb1_eps', 5)
hb1_linear = hb1_aj[subject_idx]*pm.math.sin(x) + hb1_bj[subject_idx]*pm.math.cos(x)
hb1_linear_like = pm.Normal('y', mu = hb1_linear, sd=hb1_eps, observed=y)
with hb1_model:
hb1_trace = pm.sample(draws=n_draw, tune = n_tune,
init = init_string, n_init = n_init,
target_accept = target_accept)
pm.traceplot(hb1_trace)
To partially answer my own question: After playing with this for a while, it looks like the problem might be due to the hyperprior standard deviation going to 0. I am not sure why the algorithm should get stuck there though (testing a small standard deviation can't be that uncommon...).
In any case, two solutions that seem to alleviate the problem (although they don't remove it entirely) are:
1) Add an offset to the definitions of the standard deviation. e.g.:
offset = 1e-2
hb1_sigma_a = offset + pm.HalfCauchy('hb1_sigma_a', 4)
2) Instead of using a HalfCauchy or HalfNormal for the SD prior, use a logNormal distribution set so that 0 is unlikely.
I'd look at the divergencies, as explained in notes and literature on Hamiltonian Monte Carlo, see, e.g., here and here.
with model:
np.savetxt('diverging.csv', hb1_trace['diverging'])
As a dirty solution, you can try to increase target_accept, perhaps.
Good luck!

What does it mean when train and validation loss diverge from epoch 1?

I was recently working on a deep learning model in Keras and it gave me very perplexing results. The model is capable of mastering the training data over time, but it consistently gets worse results on the validation data.
I know that if the validation accuracy goes up for a while and then starts to decrease that you are over-fitting to the training data, but in this case, the validation accuracy only ever decreases. I am really confused why this happens. Does anyone have any intuition as to what could cause this to happen? Or any suggestions on things to test to potentially fix it?
Edit to add more info and code
Ok. So I am making a model that is trying to do some basic stock predictions. By looking at the open, high, low, close, and volume of the last 40 days, the model tries to predict whether or not the price will go up two average true ranges without going down one average true range. As input, I took CSVs from Yahoo Finance that include this information for the last 30 years for all of the stocks in the Dow Jones Industrial Average. The model trains on 70% of the stocks and validates on the other 20%. This leads to about 150,000 training samples. I am currently using a 1d Convolutional Neural Network, but I have also tried other smaller models (logistic regression and small Feed Forward NN) and I always get the same either diverging train and validation loss or nothing learned at all because the model is too simple.
Here is the code:
import numpy as np
from sklearn import preprocessing
from sklearn.metrics import auc, roc_curve, roc_auc_score
from keras.layers import Input, Dense, Flatten, Conv1D, Activation, MaxPooling1D, Dropout, Concatenate
from keras.models import Model
from keras.callbacks import ModelCheckpoint, EarlyStopping, Callback
from keras import backend as K
import matplotlib.pyplot as plt
from random import seed, shuffle
from os import listdir
class roc_auc(Callback):
def on_train_begin(self, logs={}):
self.aucs = []
def on_train_end(self, logs={}):
return
def on_epoch_begin(self, epoch, logs={}):
return
def on_epoch_end(self, epoch, logs={}):
y_pred = self.model.predict(self.validation_data[0])
self.aucs.append(roc_auc_score(self.validation_data[1], y_pred))
if max(self.aucs) == self.aucs[-1]:
model.save_weights("weights.roc_auc.hdf5")
print(" - auc: %0.4f" % self.aucs[-1])
return
def on_batch_begin(self, batch, logs={}):
return
def on_batch_end(self, batch, logs={}):
return
rrr = 2
epochs = 200
batch_size = 64
days_input = 40
seed(42)
X_train = []
X_test = []
y_train = []
y_test = []
files = listdir("Stocks")
total_stocks = len(files)
shuffle(files)
for x, file in enumerate(files):
test = False
if (x+1.0)/total_stocks > 0.7:
test = True
if test:
print("Test -> Stocks/%s" % file)
else:
print("Train -> Stocks/%s" % file)
stock = np.loadtxt(open("Stocks/"+file, "r"), delimiter=",", skiprows=1, usecols = (1,2,3,5,6))
atr = []
last = None
for day in stock:
if last is None:
tr = abs(day[1] - day[2])
atr.append(tr)
else:
tr = max(day[1] - day[2], abs(last[3] - day[1]), abs(last[3] - day[2]))
atr.append((13*atr[-1]+tr)/14)
last = day.copy()
stock = np.insert(stock, 5, atr, axis=1)
for i in range(days_input,stock.shape[0]-1):
input = stock[i-days_input:i, 0:5].copy()
for j, day in enumerate(input):
input[j][1] = (day[1]-day[0])/day[0]
input[j][2] = (day[2]-day[0])/day[0]
input[j][3] = (day[3]-day[0])/day[0]
input[:,0] = input[:,0] / np.linalg.norm(input[:,0])
input[:,1] = input[:,1] / np.linalg.norm(input[:,1])
input[:,2] = input[:,2] / np.linalg.norm(input[:,2])
input[:,3] = input[:,3] / np.linalg.norm(input[:,3])
input[:,4] = input[:,4] / np.linalg.norm(input[:,4])
preprocessing.scale(input, copy=False)
output = -1
buy = stock[i][1]
stoploss = buy - stock[i][5]
target = buy + rrr*stock[i][5]
for j in range(i+1, stock.shape[0]):
if stock[j][0] < stoploss or stock[j][2] < stoploss:
output = 0
break
elif stock[j][1] > target:
output = 1
break
if output != -1:
if test:
X_test.append(input)
y_test.append(output)
else:
X_train.append(input)
y_train.append(output)
shape = list(X_train[0].shape)
shape[:0] = [len(X_train)]
X_train = np.concatenate(X_train).reshape(shape)
y_train = np.array(y_train)
shape = list(X_test[0].shape)
shape[:0] = [len(X_test)]
X_test = np.concatenate(X_test).reshape(shape)
y_test = np.array(y_test)
print("Train class split is %0.2f" % (100*np.average(y_train)))
print("Test class split is %0.2f" % (100*np.average(y_test)))
inputs = Input(shape=(days_input,5))
x = Conv1D(32, 5, padding='same')(inputs)
x = Activation('relu')(x)
x = MaxPooling1D()(x)
x = Conv1D(64, 5, padding='same')(x)
x = Activation('relu')(x)
x = MaxPooling1D()(x)
x = Conv1D(128, 5, padding='same')(x)
x = Activation('relu')(x)
x = MaxPooling1D()(x)
x = Flatten()(x)
x = Dense(128, activation="relu")(x)
x = Dense(64, activation="relu")(x)
output = Dense(1, activation="sigmoid")(x)
model = Model(inputs=inputs,outputs=output)
model.compile(optimizer='adam',
loss='binary_crossentropy',
metrics=['accuracy'])
filepath="weights.best.hdf5"
checkpoint = ModelCheckpoint(filepath, monitor='val_acc', verbose=0, save_best_only=True, mode='max')
auc_hist = roc_auc()
callbacks_list = [checkpoint, auc_hist]
history = model.fit(X_train, y_train, validation_data=(X_test,y_test) , epochs=epochs, callbacks=callbacks_list, batch_size=batch_size, class_weight ='balanced').history
model_json = model.to_json()
with open("model.json", "w") as json_file:
json_file.write(model_json)
model.save_weights("weights.latest.hdf5")
model.load_weights("weights.roc_auc.hdf5")
plt.plot(history['acc'])
plt.plot(history['val_acc'])
plt.title('model accuracy')
plt.ylabel('accuracy')
plt.xlabel('epoch')
plt.legend(['train', 'test'], loc='upper left')
plt.show()
plt.plot(history['loss'])
plt.plot(history['val_loss'])
plt.title('model loss')
plt.ylabel('loss')
plt.xlabel('epoch')
plt.legend(['train', 'test'], loc='upper left')
plt.show()
plt.plot(auc_hist.aucs)
plt.title('model ROC AUC')
plt.ylabel('AUC')
plt.xlabel('epoch')
plt.show()
y_pred = model.predict(X_train)
fpr, tpr, _ = roc_curve(y_train, y_pred)
roc_auc = auc(fpr, tpr)
plt.subplot(1, 2, 1)
plt.plot(fpr, tpr, label='ROC curve (area = %0.2f)' % roc_auc)
plt.plot([0, 1], [0, 1], color='navy',linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.0])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Train ROC')
plt.legend(loc="lower right")
y_pred = model.predict(X_test)
fpr, tpr, thresholds = roc_curve(y_test, y_pred)
roc_auc = auc(fpr, tpr)
plt.subplot(1, 2, 2)
plt.plot(fpr, tpr, label='ROC curve (area = %0.2f)' % roc_auc)
plt.plot([0, 1], [0, 1], color='navy',linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.0])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Test ROC')
plt.legend(loc="lower right")
plt.show()
with open('roc.csv','w+') as file:
for i in range(len(thresholds)):
file.write("%f,%f,%f\n" % (fpr[i], tpr[i], thresholds[i]))
Results by 100 batches instead of by epoch
I listened to suggestions and made a few updates. The classes are now balanced 50% to 50% instead of 25% to 75%. Also, the validation data is randomly selected now instead of being a specific set of stocks. By graphing the loss and accuracy at a finer resolution(100 batches vs 1 epoch), the over-fitting can clearly be seen. The model does actually start to learn at the very beginning before it starts to diverge. I am surprised at how fast it starts to over-fit, but now that I can see the issue hopefully I can debug it.
Possible explanations
Coding error
Overfitting due to differences in the training / validation data
Skewed classes (and differences in the training / validation data)
Things I would try
Swapping the training and the validation set. Does the problem still occur?
Plot the curves in more detail for the first ~10 epochs (e.g. directly after initialization; each few training iterations, not only per epoch). Do you still start at > 75%? Then your classes might be skewed and you might also want to check if your training-validation split is stratified.
Code
This is useless: np.concatenate(X_train)
Make your code as readable as possible when you post it here. This includes removing lines which are commented out.
This looks suspicious for a coding error to me:
if test:
X_test.append(input)
y_test.append(output)
else:
#if((output == 0 and np.average(y_train) > 0.5) or output == 1):
X_train.append(input)
y_train.append(output)
use sklearn.model_selection.train_test_split instead. Do all transformations to the data before, then make the split with this method.
Looks like the batch size is much too small for the number of training samples you have. Try batching 20% and see if that makes a difference.

PyMC: sampling step by step?

I would like to know why the sampler is incredibly slow when sampling step by step.
For example, if I run:
mcmc = MCMC(model)
mcmc.sample(1000)
the sampling is fast. However, if I run:
mcmc = MCMC(model)
for i in arange(1000):
mcmc.sample(1)
the sampling is slower (and the more it samples, the slower it is).
If you are wondering why I am asking this.. well, I need a step by step sampling because I want to perform some operations on the values of the variables after each step of the sampler.
Is there a way to speed it up?
Thank you in advance!
------------------ EDIT -------------------------------------------------------------
Here I present the specific problem in more details:
I have two models in competition and they are part of a bigger model that has a categorical variable functioning as a 'switch' between the two.
In this toy example, I have the observed vector 'Y', that could be explained by a Poisson or a Geometric distribution. The Categorical variable 'switch_model' selects the Geometric model when = 0 and the Poisson model when =1.
After each sample, if switch_model selects the Geometric model, I want the variables of the Poisson model NOT to be updated, because they are not influencing the likelihood and therefore they are just drifting away. The opposite is true if the switch_model selects the Poisson model.
Basically what I do at each step is to 'change' the value of the non-selected model by bringing it manually one step back.
I hope that my explanation and the commented code will be clear enough. Let me know if you need further details.
import numpy as np
import pymc as pm
import pandas as pd
import matplotlib.pyplot as plt
# OBSERVED VALUES
Y = np.array([0, 1, 2, 3, 8])
# PRIOR ON THE MODELS
pi = (0.5, 0.5)
switch_model = pm.Categorical("switch_model", p = pi)
# switch_model = 0 for Geometric, switch_model = 1 for Poisson
p = pm.Uniform('p', lower = 0, upper = 1) # Prior of the parameter of the geometric distribution
mu = pm.Uniform('mu', lower = 0, upper = 10) # Prior of the parameter of the Poisson distribution
# LIKELIHOOD
#pm.observed
def Ylike(value = Y, mu = mu, p = p, M = switch_model):
if M == 0:
out = pm.geometric_like(value+1, p)
elif M == 1:
out = pm.poisson_like(value, mu)
return out
model = pm.Model([Ylike, p, mu, switch_model])
mcmc = pm.MCMC(model)
n_samples = 5000
traces = {}
for var in mcmc.stochastics:
traces[str(var)] = np.zeros(n_samples)
bar = pm.progressbar.progress_bar(n_samples)
bar.update(0)
mcmc.sample(1, progress_bar=False)
for var in mcmc.stochastics:
traces[str(var)][0] = mcmc.trace(var)[-1]
for i in np.arange(1,n_samples):
mcmc.sample(1, progress_bar=False)
bar.update(i)
for var in mcmc.stochastics:
traces[str(var)][i] = mcmc.trace(var)[-1]
if mcmc.trace('switch_model')[-1] == 0: # Gemetric wins
traces['mu'][i] = traces['mu'][i-1] # One step back for the sampler of the Poisson parameter
mu.value = traces['mu'][i-1]
elif mcmc.trace('switch_model')[-1] == 1: # Poisson wins
traces['p'][i] = traces['p'][i-1] # One step back for the sampler of the Geometric parameter
p.value = traces['p'][i-1]
print '\n\n'
traces=pd.DataFrame(traces)
traces['mu'][traces['switch_model'] == 0] = np.nan
traces['p'][traces['switch_model'] == 1] = np.nan
print traces.describe()
traces.plot()
plt.show()
The reason this is so slow is that Python's for loops are pretty slow, especially when they are compared to FORTRAN loops (Which is what PyMC is written in basically.) If you could show more detailed code, it might be easier to see what you are trying to do and to provide faster alternative algorithms.
Actually I found a 'crazy' solution, and I have the suspect to know why it works. I would still like to get an expert opinion on my trick.
Basically if I modify the for loop in the following way, adding a 'reset of the mcmc' every 1000 loops, the sampling fires up again:
for i in np.arange(1,n_samples):
mcmc.sample(1, progress_bar=False)
bar.update(i)
for var in mcmc.stochastics:
traces[str(var)][i] = mcmc.trace(var)[-1]
if mcmc.trace('switch_model')[-1] == 0: # Gemetric wins
traces['mu'][i] = traces['mu'][i-1] # One step back for the sampler of the Poisson parameter
mu.value = traces['mu'][i-1]
elif mcmc.trace('switch_model')[-1] == 1: # Poisson wins
traces['p'][i] = traces['p'][i-1] # One step back for the sampler of the Geometric parameter
p.value = traces['p'][i-1]
if i%1000 == 0:
mcmc = pm.MCMC(model)
In practice this trick erases the traces and the database of the sampler every 1000 steps. It looks like the sampler does not like having a long database, although I do not really understand why. (of course 1000 steps is arbitrary, too short it adds too much overhead, too long it will cause the traces and database to be too long).
I find this hack a bit crazy and definitely not elegant.. does any of the experts or developers have a comment on it? Thank you!

Resources