GEKKO: Array size as a model variable - gekko

I'm quite new to Gekko. Is it possible to vary the size of a model array as part of an optimization? I am running a simple problem where various numbers of torsional springs engage at different angles, and I would like to allow the model to change the number of engagement angles. Each spring has several component variables, which I am also attempting to define as arrays of variables. However, the size definition of the array theta_engage, below, has not accepted int(n_engage.value). I get the following error:
TypeError: int() argument must be a string, a bytes-like object or a number, not 'GK_Value'
Relevant code:
n_engage = m.Var(2, lb=1, ub=10, integer=True)
theta_engage = m.Array(m.Var, (int(n_engage.value)))
theta_engage[0].value = 0.0
theta_engage[0].lower = 0.0
theta_engage[0].upper = 85.0
theta_engage[1].value = 15.0
theta_engage[1].lower = 0.0
theta_engage[1].upper = 85.0
If I try to define the size of theta_engage only by n_engage.value, I get this error:
TypeError: expected sequence object with len >= 0 or a single integer
I suppose I could define the array at the maximum size I am willing to accept and allow the number of springs to have a lower bound of 0, but I would have to enforce a minimum number of total springs somehow in the constraints. If Gekko is capable of varying the size of the arrays this way it seems to me the more elegant solution.
Any help is much appreciated.

The problem structure can't be changed iteration-to-iteration. However, it is easy to define a binary variable b that either activates or deactivates those parts of the model that should be included or excluded.
from gekko import GEKKO
import numpy as np
m = GEKKO()
# number of springs
n = 10
# number of engaged springs (1-10)
nb = m.Var(2, lb=1, ub=n, integer=True)
# engaged springs (binary, 0-1)
b = m.Array(m.Var,n,lb=0,ub=1,integer=True)
# angle of engaged springs
θ = m.Array(m.Param,n,lb=0,ub=85)
# initialize values
t0 = [0,15,20,25,30,15,30,25,10,50]
for i,ti in enumerate(t0):
θ[i].value = ti
# contributing spring forces
F = [m.Intermediate(b[i]*m.cos((np.pi/180.0)*θ[i])) \
for i in range(10)]
# force constraint
m.Equation(m.sum(F)>=3)
# engaged springs
m.Equation(nb==m.sum(b))
# minimize engaged springs
m.Minimize(nb)
# optimize with APOPT solver
m.options.SOLVER=1
m.solve()
# print solution
print(b)
This gives a solution in 0.079 sec that springs 1, 3, 9, and 10 should be engaged. It selects the minimum number of springs (4) to achieve the required force that is equivalent to 3 springs at 0 angle.
Successful solution
---------------------------------------------------
Solver : APOPT (v1.0)
Solution time : 7.959999999729916E-002 sec
Objective : 4.00000000000000
Successful solution
---------------------------------------------------
[[1.0] [0.0] [1.0] [0.0] [0.0] [0.0] [0.0] [0.0] [1.0] [1.0]]

Related

Unexpected results from sum using gekko variable

I am optimizing a simple problem where I am summing intermediate variables for a constraint where the sum needs to be lower than a certain budget.
When I print the sum, either using sum or np.sum, I get the following results:(((((((((((((((((((((((((((((i429+i430)+i431)+i432)+i433)+i434)+i435)+i436)+i437)+i438)+i439)+i440)+i441)+i442)+i443)+i444)+i445)+i446)+i447)+i448)+i449)+i450)+i451)+i452)+i453)+i454)+i455)+i456)+i457)+i458)
Here is the command to create the variables and the sum.
x = m.Array(m.Var, (len(bounds)),integer=True)
sums = [m.Intermediate(objective_inverse2(x,y)) for x,y in zip(x,reg_feats)]
My understanding of the intermediate variable is a variable which is dynamically calculated based on the value of x, which are decision variables.
Here is the summing function for the max budget constraint.
m.Equation(np.sum(sums) < max_budget)
Solving the problem returns an error saying there are no feasible solution, even through trivial solutions exist. Furthermore, removing this constraint returns a solution which naturally does not violate the max budget constraint.
What am I misunderstanding about the intermediate variable and how to sum them.
It is difficult to diagnose the problem without a complete, minimal problem. Here is an attempt to recreate the problem:
from gekko import GEKKO
import numpy as np
m = GEKKO()
nb = 5
x = m.Array(m.Var,nb,value=1,lb=0,ub=1,integer=True)
y = m.Array(m.Var,nb,lb=0)
i = [] # intermediate list
for xi,yi in zip(x,y):
i.append(m.Intermediate(xi*yi))
m.Maximize(m.sum(i))
m.Equation(m.sum(i)<=100)
m.options.SOLVER = 1
m.solve()
print(x)
print(y)
Instead of creating a list of Intermediates, the summation can also happen with the result of the list comprehension. This way, only one Intermediate value is created.
from gekko import GEKKO
import numpy as np
m = GEKKO()
nb = 5
x = m.Array(m.Var,nb,value=1,lb=0,ub=1,integer=True)
y = m.Array(m.Var,nb,lb=0)
sums = m.Intermediate(m.sum([xi*yi for xi,yi in zip(x,y)]))
m.Maximize(sums)
m.Equation(sums<=100)
m.options.SOLVER = 1
m.solve()
print(sums.value)
print(x)
print(y)
In both cases, the optimal solution is:
---------------------------------------------------
Solver : APOPT (v1.0)
Solution time : 1.560000001336448E-002 sec
Objective : -100.000000000000
Successful solution
---------------------------------------------------
[100.0]
[[1.0] [1.0] [1.0] [1.0] [1.0]]
[[20.0] [20.0] [20.0] [20.0] [20.0]]
Try using the Gekko m.sum() function to improve solution efficiency, especially for large problems.

Initialization of Weighted Reservoir Sampling (A-Chao implementation)

I am trying to implement A-Chao version of weighted reservoir sampling as shown in https://en.wikipedia.org/wiki/Reservoir_sampling#Algorithm_A-Chao
But I found that the pseudo-code described in wiki seems to be wrong, especially on the initialization part. I read the paper, it mentions we need to handle over-weighted data points, but I still cannot get the idea how to initialize correctly.
In my understanding, on initialization step, we want to make sure all initial data points chosen should have same probability*weight to be chosen. However, I don't understand how the over-weighted points is related with that.
Code I implemented according to the wiki, but the results show it is incorrect.
const reservoirSampling = <T>(dataList: T[], k: number, getWeight: (point: T) => number): T[] => {
const sampledList = dataList.slice(0, k);
let currentWeightSum: number = sampledList.reduce((sum, item) => sum + getWeight(item), 0);
for (let i = k; i < dataList.length; i++) {
const currentItem = dataList[i];
currentWeightSum += getWeight(currentItem);
const probOfChoosingCurrentItem = getWeight(currentItem) / currentWeightSum;
const rand = Math.random();
if (rand <= probOfChoosingCurrentItem) {
sampledList[getRandomInt(0, k - 1)] = currentItem;
}
}
return sampledList;
};
The best way to get the distribution that Chao's algorithm produces is to implement VarOptk sampling as in the pseudocode labeled Algorithm 1 from the paper that introduced VarOptk sampling by Cohen et al.
That's an arXiv link and hence very stable, but to summarize, the idea is to separate the items into "heavy" (weight high enough to guarantee inclusion in the sample so far) and "light" (the others). Keep the heavy items in a priority queue where it is easy to remove the lightest of them. When a new item comes in, we have to determine whether it is heavy or light, and which heavy items became light (if any). Then there's a sampling procedure for dropping an item that treats the heavy → light items specially using weighted sampling and then falls back to choosing a uniform random light item (as in the easy case of Chao's algorithm).
The one trick with the pseudocode is that, if you use floating-point arithmetic, you have to be a little careful about "impossible" cases. Post your finished code on Code Review and ping me here if you would like feedback.
You will find a python implementation of Chao's strategy below. Here is a plot of 10000 samples from 0,..,99 with weights indicated by the yellow lines. The y-coordinate denotes how many times a given item was sampled.
I first implemented the pseudocode on Wikipedia, and agree completely with the OP that it is dead wrong. It then took me more than a day to understand Chao's paper. I also found the section of Tillé's book on Chao's method (see Algorithm 6.14 on page 120) helpful. (I don't know what the OP means by with the issues with initialization.)
Disclaimer: I am new to python, and just tried to do my best. I think posting code might be more helpful than posting pseudocode. (Mainly I want to save someone a day's work getting to the bottom of Chao's paper!) If you do end up using this, I'd appreciate any feedback. Standard health warnings apply!
First, Chao's computation of inclusion probabilities:
import numpy as np
import random
def compute_Chao_probs(weights, total_weight, sample_size):
"""
Consider a weighted population, some of its members, and their weights.
This function returns a list of probabilities that these members are selected
in a weighted sample of sample_size members of the population.
Example 1: If all weights are equal, this probability is sample_size /(size of population).
Example 2: If the size of our population is sample_size then these probabilities are all 1.
Naively we expect these probabilities to be given by sample_size*weight/total_weight, however
this may lead to a probability greater than 1. For example, consider a population
of 3 with weights [3,1,1], and suppose we want to select 2 elements. The naive
probability of selecting the first element is 2*3/5 > 1.
We follow Chao's description: compute naive guess, set any probs which are bigger
than 1 to 1, rinse and repeat.
We expect to call this routine many times, so we avoid for loops, and try to make numpy do the work.
"""
assert all(w > 0 for w in weights), "weights must be strictly positive."
# heavy_items is a True / False array of length sample_size.
# True indicates items deemed "heavy" (i.e. assigned probability 1)
# At the outset, no items are heavy:
heavy_items = np.zeros(len(weights),dtype=bool)
while True:
new_probs = (sample_size - np.sum(heavy_items))/(total_weight - np.sum(heavy_items*weights))*weights
valid_probs = np.less_equal(np.logical_not(heavy_items) * new_probs, np.ones((len(weights))))
if all(valid_probs): # we are done
return np.logical_not(heavy_items)*new_probs + heavy_items
else: # we need to declare some more items heavy
heavy_items = np.logical_or(heavy_items, np.logical_not(valid_probs))
Then Chao's rejection rule:
def update_sample(current_sample, new_item, new_weight):
"""
We have a weighted population, from which we have selected n items.
We know their weights, the total_weight of the population, and the
probability of their inclusion in the sample when we selected them.
Now new_item arrives, with a new_weight. Should we take it or not?
current_sample is a dictionary, with keys 'items', 'weights', 'probs'
and 'total_weight'. This function updates current_sample according to
Chao's recipe.
"""
items = current_sample['items']
weights = current_sample['weights']
probs = current_sample['probs']
total_weight = current_sample['total_weight']
assert len(items) == len(weights) and len(weights) == len(probs)
fixed_sample_size = len(weights)
total_weight = total_weight + new_weight
new_Chao_probs = compute_Chao_probs(np.hstack((weights,[new_weight])),total_weight,fixed_sample_size)
if random.random() <= new_Chao_probs[-1]: # we should take new_item
#
# Now we need to decide which element should be replaced.
# Fix an index i in items, and let P denote probability. We have:
# P(i is selected in previous step) = probs[i]
# P(i is selected at current step) = new_Chao_probs[i]
# Hence (by law of conditional probability)
# P(i is selected at current step | i is selected at previous step) = new_Chao_probs[i] / probs[i]
# Thus:
# P(i is not selected at current step | i is selected at previous step) = 1 - new_Chao_probs[i] / probs[i]
# Now is we condition this on the assumption that the new element is taken, we get
# 1/new_Chao_probs[-1]*(1 - new_Chao_probs[i] / probs[i]).
#
# (*I think* this is what Chao is talking about in the two paragraphs just before Section 3 in his paper.)
rejection_weights = 1/new_Chao_probs[-1]*(np.ones((fixed_sample_size)) - (new_Chao_probs[0:-1]/probs))
# assert np.isclose(np.sum(rejection_weights),1)
# In examples we see that np.sum(rejection_weights) is not necessarily 1.
# I am a little confused by this, but ignore it for the moment.
rejected_index = random.choices(range(fixed_sample_size), rejection_weights)[0]
#make the changes:
current_sample['items'][rejected_index] = new_item
current_sample['weights'][rejected_index] = new_weight
current_sample['probs'] = new_Chao_probs[0:-1]
current_sample['probs'][rejected_index] = new_Chao_probs[-1]
current_sample['total_weight'] = total_weight
Finally, code to test and plot:
# Now we test Chao on some different distributions.
#
# This also illustrates how to use update_sample.
#
from collections import Counter
import matplotlib.pyplot as plt
n = 10 # number of samples
items_in = list(range(100))
weights_in = [random.random() for _ in range(10)]
# other possible tests:
weights_in = [i+1 for i in range(10)] # staircase
#weights_in = [9-i+1 for i in range(10)] # upside down staircase
#weights_in = [(i+1)**2 for i in range(10)] # parabola
#weights_in = [10**i for i in range(10)] # a very heavy tailed distribution (to check numerical stability)
random.shuffle(weights_in) # sometimes it is fun to shuffle
weights_in = np.array([w for w in weights_in for _ in range(10)])
count = Counter({})
for j in range(10000):
# we take the first n with probability 1:
current_sample = {}
current_sample['items'] = items_in[:n]
current_sample['weights'] = np.array(weights_in[:n])
current_sample['probs'] = np.ones((n))
current_sample['total_weight'] = np.sum(current_sample['weights'])
for i in range(n,len(items_in)):
update_sample(current_sample, items_in[i], weights_in[i])
count.update(current_sample['items'])
plt.figure(figsize=(20,10))
plt.plot(100000*np.array(weights_in)/np.sum(weights_in), 'yo')
plt.plot(list(count.keys()), list(count.values()), 'ro')
plt.show()

Q learning - epsilon greedy update

I am trying to understand the epsilon - greedy method in DQN. I am learning from the code available in https://github.com/karpathy/convnetjs/blob/master/build/deepqlearn.js
Following is the update rule for epsilon which changes with age as below:
$this.epsilon = Math.min(1.0, Math.max(this.epsilon_min, 1.0-(this.age - this.learning_steps_burnin)/(this.learning_steps_total - this.learning_steps_burnin)));
Does this mean the epsilon value starts with min (chosen by user) and then increase with age reaching upto burnin steps and eventually becoming to 1? Or Does the epsilon start around 1 and then decays to epsilon_min ?
Either way, then the learning almost stops after this process. So, do we need to choose the learning_steps_burnin and learning_steps_total carefully enough? Any thoughts on what value needs to be chosen?
Since epsilon denotes the amount of randomness in your policy (action is greedy with probability 1-epsilon and random with probability epsilon), you want to start with a fairly randomized policy and later slowly move towards a deterministic policy. Therefore, you usually start with a large epsilon (like 0.9, or 1.0 in your code) and decay it to a small value (like 0.1). Most common and simple approaches are linear decay and exponential decay. Usually, you have an idea of how many learning steps you will perform (what in your code is called learning_steps_total) and tune the decay factor (your learning_steps_burnin) such that in this interval epsilon goes from 0.9 to 0.1.
Your code is an example of linear decay.
An example of exponential decay is
epsilon = 0.9
decay = 0.9999
min_epsilon = 0.1
for i from 1 to n
epsilon = max(min_epsilon, epsilon*decay)
Personally I recommend an epsilon decay such that after about 50/75% of the training you reach the minimum value of espilon (advice from 0.05 to 0.0025) from which then you have only the improvement of the policy itself.
I created a specific script to set the various parameters and it returns after what the decay stop is reached (at the indicated value)
import matplotlib.pyplot as plt
import numpy as np
eps_start = 1.0
eps_min = 0.05
eps_decay = 0.9994
epochs = 10000
pct = 0
df = np.zeros(epochs)
for i in range(epochs):
if i == 0:
df[i] = eps_start
else:
df[i] = df[i-1] * eps_decay
if df[i] <= eps_min:
print(i)
stop = i
break
print("With this parameter you will stop epsilon decay after {}% of training".format(stop/epochs*100))
plt.plot(df)
plt.show()

KALMAN filter doesn't respond to changes

I am implementing a Kalman filter for the first time to get voltage values from a source. It works and it stabilizes at the source voltage value but if then the source changes the voltage the filter doesn't adapt to the new value.
I use 3 steps:
Get the Kalman gain
KG = previous_error_in_estimate / ( previous_error_in_estimate + Error_in_measurement )
Get current estimation
Estimation = previous_estimation + KG*[measurement - previous_estimation]
Calculate the error in estimate
Error_in_estimate = [1-KG]*previous_error_in_estimate
The thing is that, as 0 <= KG <= 1, Error_in_estimate decreases more and more and that makes KG to also decrease more and more ( error_in_measurement is a constant ), so at the end the estimation only depends on the previous estimation and the current measurement is not taken into account.
This prevents the filter from adapt himself to measurement changes.
How can I do to make that happen?
Thanks
EDIT:
Answering to Claes:
I am not sure that the Kalman filter is valid for my problem since I don't have a system model, I just have a bunch of readings from a quite noisy sensor measuring a not very predictable variable.
To keep things simple, imagine reading a potentiometer ( a variable resistor ) changed by the user, you can't predict or model the user's behavior.
I have implemented a very basic SMA ( Simple Moving Average ) algorithm and I was wondering if there is a better way to do it.
Is the Kalman filter valid for a problem like this?
If not, what would you suggest?
2ND EDIT
Thanks to Claes for such an useful information
I have been doing some numerical tests in MathLab (with no real data yet) and doing the convolution with a Gaussian filter seems to give the most accurate result.
With the Kalman filter I don't know how to estimate the process and measurement variances, is there any method for that?. Only when I decrease quite a lot the measurement variance the kalman filter seems to adapt. In the previous image the measurement variance was R=0.1^2 (the one in the original example). This is the same test with R=0.01^2
Of course, these are MathLab tests with no real data. Tomorrow I will try to implement this filters in the real system with real data and see if I can get similar results
A simple MA filter is probably sufficient for your example. If you would like to use the Kalman filter there is a great example at the SciPy cookbook
I have modified the code to include a step change so you can see the convergence.
# Kalman filter example demo in Python
# A Python implementation of the example given in pages 11-15 of "An
# Introduction to the Kalman Filter" by Greg Welch and Gary Bishop,
# University of North Carolina at Chapel Hill, Department of Computer
# Science, TR 95-041,
# http://www.cs.unc.edu/~welch/kalman/kalmanIntro.html
# by Andrew D. Straw
import numpy as np
import matplotlib.pyplot as plt
plt.rcParams['figure.figsize'] = (10, 8)
# intial parameters
n_iter = 400
sz = (n_iter,) # size of array
x1 = -0.37727*np.ones(n_iter/2) # truth value 1
x2 = -0.57727*np.ones(n_iter/2) # truth value 2
x = np.concatenate((x1,x2),axis=0)
z = x+np.random.normal(0,0.1,size=sz) # observations (normal about x, sigma=0.1)
Q = 1e-5 # process variance
# allocate space for arrays
xhat=np.zeros(sz) # a posteri estimate of x
P=np.zeros(sz) # a posteri error estimate
xhatminus=np.zeros(sz) # a priori estimate of x
Pminus=np.zeros(sz) # a priori error estimate
K=np.zeros(sz) # gain or blending factor
R = 0.1**2 # estimate of measurement variance, change to see effect
# intial guesses
xhat[0] = 0.0
P[0] = 1.0
for k in range(1,n_iter):
# time update
xhatminus[k] = xhat[k-1]
Pminus[k] = P[k-1]+Q
# measurement update
K[k] = Pminus[k]/( Pminus[k]+R )
xhat[k] = xhatminus[k]+K[k]*(z[k]-xhatminus[k])
P[k] = (1-K[k])*Pminus[k]
plt.figure()
plt.plot(z,'k+',label='noisy measurements')
plt.plot(xhat,'b-',label='a posteri estimate')
plt.plot(x,color='g',label='truth value')
plt.legend()
plt.title('Estimate vs. iteration step', fontweight='bold')
plt.xlabel('Iteration')
plt.ylabel('Voltage')
And the output is:

Parallelising gradient calculation in Julia

I was persuaded some time ago to drop my comfortable matlab programming and start programming in Julia. I have been working for a long with neural networks and I thought that, now with Julia, I could get things done faster by parallelising the calculation of the gradient.
The gradient need not be calculated on the entire dataset in one go; instead one can split the calculation. For instance, by splitting the dataset in parts, we can calculate a partial gradient on each part. The total gradient is then calculated by adding up the partial gradients.
Though, the principle is simple, when I parallelise with Julia I get a performance degradation, i.e. one process is faster then two processes! I am obviously doing something wrong... I have consulted other questions asked in the forum but I could still not piece together an answer. I think my problem lies in that there is a lot of unnecessary data moving going on, but I can't fix it properly.
In order to avoid posting messy neural network code, I am posting below a simpler example that replicates my problem in the setting of linear regression.
The code-block below creates some data for a linear regression problem. The code explains the constants, but X is the matrix containing the data inputs. We randomly create a weight vector w which when multiplied with X creates some targets Y.
######################################
## CREATE LINEAR REGRESSION PROBLEM ##
######################################
# This code implements a simple linear regression problem
MAXITER = 100 # number of iterations for simple gradient descent
N = 10000 # number of data items
D = 50 # dimension of data items
X = randn(N, D) # create random matrix of data, data items appear row-wise
Wtrue = randn(D,1) # create arbitrary weight matrix to generate targets
Y = X*Wtrue # generate targets
The next code-block below defines functions for measuring the fitness of our regression (i.e. the negative log-likelihood) and the gradient of the weight vector w:
####################################
## DEFINE FUNCTIONS ##
####################################
#everywhere begin
#-------------------------------------------------------------------
function negative_loglikelihood(Y,X,W)
#-------------------------------------------------------------------
# number of data items
N = size(X,1)
# accumulate here log-likelihood
ll = 0
for nn=1:N
ll = ll - 0.5*sum((Y[nn,:] - X[nn,:]*W).^2)
end
return ll
end
#-------------------------------------------------------------------
function negative_loglikelihood_grad(Y,X,W, first_index,last_index)
#-------------------------------------------------------------------
# number of data items
N = size(X,1)
# accumulate here gradient contributions by each data item
grad = zeros(similar(W))
for nn=first_index:last_index
grad = grad + X[nn,:]' * (Y[nn,:] - X[nn,:]*W)
end
return grad
end
end
Note that the above functions are on purpose not vectorised! I choose not to vectorise, as the final code (the neural network case) will also not admit any vectorisation (let us not get into more details regarding this).
Finally, the code-block below shows a very simple gradient descent that tries to recover the parameter weight vector w from the given data Y and X:
####################################
## SOLVE LINEAR REGRESSION ##
####################################
# start from random initial solution
W = randn(D,1)
# learning rate, set here to some arbitrary small constant
eta = 0.000001
# the following for-loop implements simple gradient descent
for iter=1:MAXITER
# get gradient
ref_array = Array(RemoteRef, nworkers())
# let each worker process part of matrix X
for index=1:length(workers())
# first index of subset of X that worker should work on
first_index = (index-1)*int(ceil(N/nworkers())) + 1
# last index of subset of X that worker should work on
last_index = min((index)*(int(ceil(N/nworkers()))), N)
ref_array[index] = #spawn negative_loglikelihood_grad(Y,X,W, first_index,last_index)
end
# gather the gradients calculated on parts of matrix X
grad = zeros(similar(W))
for index=1:length(workers())
grad = grad + fetch(ref_array[index])
end
# now that we have the gradient we can update parameters W
W = W + eta*grad;
# report progress, monitor optimisation
#printf("Iter %d neg_loglikel=%.4f\n",iter, negative_loglikelihood(Y,X,W))
end
As is hopefully visible, I tried to parallelise the calculation of the gradient in the easiest possible way here. My strategy is to break the calculation of the gradient in as many parts as available workers. Each worker is required to work only on part of matrix X, which part is specified by first_index and last_index. Hence, each worker should work with X[first_index:last_index,:]. For instance, for 4 workers and N = 10000, the work should be divided as follows:
worker 1 => first_index = 1, last_index = 2500
worker 2 => first_index = 2501, last_index = 5000
worker 3 => first_index = 5001, last_index = 7500
worker 4 => first_index = 7501, last_index = 10000
Unfortunately, this entire code works faster if I have only one worker. If add more workers via addprocs(), the code runs slower. One can aggravate this issue by create more data items, for instance use instead N=20000.
With more data items, the degradation is even more pronounced.
In my particular computing environment with N=20000 and one core, the code runs in ~9 secs. With N=20000 and 4 cores it takes ~18 secs!
I tried many many different things inspired by the questions and answers in this forum but unfortunately to no avail. I realise that the parallelisation is naive and that data movement must be the problem, but I have no idea how to do it properly. It seems that the documentation is also a bit scarce on this issue (as is the nice book by Ivo Balbaert).
I would appreciate your help as I have been stuck for quite some while with this and I really need it for my work. For anyone wanting to run the code, to save you the trouble of copying-pasting you can get the code here.
Thanks for taking the time to read this very lengthy question! Help me turn this into a model answer that anyone new in Julia can then consult!
I would say that GD is not a good candidate for parallelizing it using any of the proposed methods: either SharedArray or DistributedArray, or own implementation of distribution of chunks of data.
The problem does not lay in Julia, but in the GD algorithm.
Consider the code:
Main process:
for iter = 1:iterations #iterations: "the more the better"
δ = _gradient_descent_shared(X, y, θ)
θ = θ - α * (δ/N)
end
The problem is in the above for-loop which is a must. No matter how good _gradient_descent_shared is, the total number of iterations kills the noble concept of the parallelization.
After reading the question and the above suggestion I've started implementing GD using SharedArray. Please note, I'm not an expert in the field of SharedArrays.
The main process parts (simple implementation without regularization):
run_gradient_descent(X::SharedArray, y::SharedArray, θ::SharedArray, α, iterations) = begin
N = length(y)
for iter = 1:iterations
δ = _gradient_descent_shared(X, y, θ)
θ = θ - α * (δ/N)
end
θ
end
_gradient_descent_shared(X::SharedArray, y::SharedArray, θ::SharedArray, op=(+)) = begin
if size(X,1) <= length(procs(X))
return _gradient_descent_serial(X, y, θ)
else
rrefs = map(p -> (#spawnat p _gradient_descent_serial(X, y, θ)), procs(X))
return mapreduce(r -> fetch(r), op, rrefs)
end
end
The code common to all workers:
#= Returns the range of indices of a chunk for every worker on which it can work.
The function splits data examples (N rows into chunks),
not the parts of the particular example (features dimensionality remains intact).=#
#everywhere function _worker_range(S::SharedArray)
idx = indexpids(S)
if idx == 0
return 1:size(S,1), 1:size(S,2)
end
nchunks = length(procs(S))
splits = [round(Int, s) for s in linspace(0,size(S,1),nchunks+1)]
splits[idx]+1:splits[idx+1], 1:size(S,2)
end
#Computations on the chunk of the all data.
#everywhere _gradient_descent_serial(X::SharedArray, y::SharedArray, θ::SharedArray) = begin
prange = _worker_range(X)
pX = sdata(X[prange[1], prange[2]])
py = sdata(y[prange[1],:])
tempδ = pX' * (pX * sdata(θ) .- py)
end
The data loading and training. Let me assume that we have:
features in X::Array of the size (N,D), where N - number of examples, D-dimensionality of the features
labels in y::Array of the size (N,1)
The main code might look like this:
X=[ones(size(X,1)) X] #adding the artificial coordinate
N, D = size(X)
MAXITER = 500
α = 0.01
initialθ = SharedArray(Float64, (D,1))
sX = convert(SharedArray, X)
sy = convert(SharedArray, y)
X = nothing
y = nothing
gc()
finalθ = run_gradient_descent(sX, sy, initialθ, α, MAXITER);
After implementing this and run (on 8-cores of my Intell Clore i7) I got a very slight acceleration over serial GD (1-core) on my training multiclass (19 classes) training data (715 sec for serial GD / 665 sec for shared GD).
If my implementation is correct (please check this out - I'm counting on that) then parallelization of the GD algorithm is not worth of that. Definitely you might get better acceleration using stochastic GD on 1-core.
If you want to reduce the amount of data movement, you should strongly consider using SharedArrays. You could preallocate just one output vector, and pass it as an argument to each worker. Each worker sets a chunk of it, just as you suggested.

Resources