How to train multiple models time using 4 CPUs and python? [duplicate] - parallel-processing

This question already has an answer here:
Train multiple models in parallel with sklearn?
(1 answer)
Closed 3 years ago.
My task is like
from sklearn.gaussian_process import GaussianProcessRegressor
num = 100
model = dict()
for i in range(100):
model[i]=GaussianProcessRegressor(kernel=kernel, n_restarts_optimizer=20)
for i in range(num):
model[i].fit(X,Y)
where X,Y are my training data constaining features and labels, respectively.
My Ubuntu has 4 CPUs. In order to reduce the training time cost to a quarter of the above code, I therefore want to execute model[0].fit(X, Y) on CPU-0, model[1].fit(X, Y) on CPU-1, model[2].fit(X, Y) on CPU-2 and model[3].fit(X, Y) on CPU-3, simultaneously. What should I do?

Replace input_x and input_y with your actual training data in a list.
input_x=[X for i in range(100)]
input_y=[Y for i in range(100)]
def trainmodel(X,Y):
model=GaussianProcessRegressor(n_restarts_optimizer=20)
model.fit(X,Y)
return model
models=joblib.Parallel(n_jobs=4,verbose=1)(map(joblib.delayed(trainmodel),input_x,input_y))
You should also check the number of cpu available just in case
import multiprocessing
multiprocessing.cpu_count()

Related

Performance expectations when running caret::train() to develop a kknn model

I'm using the caret::train() function to develop a weighted knn classification model (kknn) with 10-fold cross-validation and a tuneGrid containing 15 values for kmax, one value for distance, and 3 values for kernel.
That’s 450 total iterations if I understand the process correctly (an iteration being the computation of the probability of a given outcome for a given combination of kmax, distance, and kernel). x has about 480,000 data points (6 predictors each having about 80,000 observations), and y has about 80,000 data points.
Understanding that there are innumerable variables affecting performance, how long can I reasonably expect the train function to take if run on a pc with an 8-core 3GHz Intel processor and 32GB of RAM?
It currently takes about 70 minutes per fold, which is about 1.5 minutes per iteration. Is this reasonable, or excessive?
This is a kknn learning exercise. I realize there are other types of algorithms that produce better results more efficiently.
Here is the essential code:
x <- as.matrix(train_set2[, c("n_launch_angle", "n_launch_speed", "n_spray_angle_Kolp", "n_spray_angle_adj", "n_hp_to_1b", "n_if_alignment")])
y <- train_set2$events
set.seed(1)
fitControl <- trainControl(method = "cv", number = 10, p = 0.8, returnData = TRUE,
returnResamp = "all", savePredictions = "all",
summaryFunction = twoClassSummary, classProbs = TRUE,
verboseIter = TRUE)
tuneGrid <- expand.grid(kmax = seq(11, 39, 2),
distance = 2,
kernel = c("triangular", "gaussian", "optimal"))
kknn_train <- train(x, y, method = "kknn",
tuneGrid = tuneGrid, trControl = fitControl)
As we have established in the comments, it is reasonable to expect this type of runtime. There are a few step to reduce this;
Running your code in parallel
Using a more efficient OS; like Linux
Be more efficient in your trainControl(), is it really necessary to have returnResamps=TRUE? There is small gains in controlling these.
Clearly, the first one is a no-brainer. For the second one, I can find as many computer-engineers who swears to linux as those who swears to windows. What convinced me to switch to Linux, was this particular test, which I hope will give you what it gave me.
# Calculate distance matrix
test_data <- function(dim, num, seed = 1903) {
set.seed(seed)
dist(
matrix(
rnorm(dim * num), nrow = num
)
)
}
# Benchmarking
microbenchmark::microbenchmark(test_data(120,4500))
This piece of code simply just runs faster on the exact same machine that runs Linux. At least this was my experience.

How to average parameters for a grid in utm format avoiding for loops [duplicate]

This question already has an answer here:
matlab: splitting small arrays by latitude/longitude into individual grid cells of one large array
(1 answer)
Closed 7 years ago.
Hi I have a small problem. I have 1 sec time resolution gps data in utm (x,y) with speed for one year and I would like to make speed averages over a 20m grid. My code works but it is really slow as i use for loops to find the coordinates which matches the grid. Any help is appreciated.
kind regards matthias
%x_d is x coordinate
%Y_d is y coordinate
%x_vec is the xgrid vector definition
%y_vec is the ygrid vector definition
%s is the speed
for i=1:length(vec_x)
for j=1:length(vec_y)
ind = find(x_d<=vec_x(i)+10& x_d>vec_x(i)-10 & y_d<=vec_y(j)+10 & y_d>vec_y(j)-10);
Ad(j,i) = nanmean(s(ind));
end
end
It can be done with no loops using MATLAB's histcounts and accumarray functions. This question/solution is a near duplicate of this question/solution, except for possibly the use of histcounts.
histcounts is good for binning problems (which this is). [~,~,x_idx]=histcounts(x_d,x_vec) tells you which x-bin each x-coordinate is in. Similarly for y_d, y_vec.
accumarray is good for summing with repeated indices (to avoid looping). The call below sums the speed values for each bin, and then applies the #mean function to average them. The 0 tells accumarray to fill empty bins with zeros.
x_vec = 0:20;
y_vec = 0:20;
x_d = rand(1000,1)*20;
y_d = rand(1000,1)*20;
s = rand(1000,1);
[~,~,x_idx] = histcounts(x_d,x_vec);
[~,~,y_idx] = histcounts(y_d,y_vec);
avg = accumarray([x_idx y_idx],s,[length(x_vec)-1,length(y_vec)-1],#mean,0)
A) Use logical indexing, which gets rid of the time consuming find: Your code
ind = find(x_d<=vec_x(i)+10& x_d>vec_x(i)-10 & y_d<=vec_y(j)+10 & y_d>vec_y(j)-10);
Ad(j,i) = nanmean(s(ind));
is equal to the faster
ix = x_d<=vec_x(i)+10& x_d>vec_x(i)-10 & y_d<=vec_y(j)+10 & y_d>vec_y(j)-10;
Ad(j,i) = nanmean(s(ix));
B) Try to use your known Grid-size information to access the right grid-element directly. From the offset and element-size (20) you can infer:
ind_x = floor((x_d - min(vec_x))./20) + 1;
ind_y = floor((y_d - min(vec_y))./20) + 1;
Then you would loop throught your grid and pick the positions. Maybe this could be improved by arrayfun.
for i=1:length(vec_x)
for j=1:length(vec_y)
ix = ind_x == i & ind_y == j;
Ad(j,i) = nanmean(s(ix));
end
end

Parallelising gradient calculation in Julia

I was persuaded some time ago to drop my comfortable matlab programming and start programming in Julia. I have been working for a long with neural networks and I thought that, now with Julia, I could get things done faster by parallelising the calculation of the gradient.
The gradient need not be calculated on the entire dataset in one go; instead one can split the calculation. For instance, by splitting the dataset in parts, we can calculate a partial gradient on each part. The total gradient is then calculated by adding up the partial gradients.
Though, the principle is simple, when I parallelise with Julia I get a performance degradation, i.e. one process is faster then two processes! I am obviously doing something wrong... I have consulted other questions asked in the forum but I could still not piece together an answer. I think my problem lies in that there is a lot of unnecessary data moving going on, but I can't fix it properly.
In order to avoid posting messy neural network code, I am posting below a simpler example that replicates my problem in the setting of linear regression.
The code-block below creates some data for a linear regression problem. The code explains the constants, but X is the matrix containing the data inputs. We randomly create a weight vector w which when multiplied with X creates some targets Y.
######################################
## CREATE LINEAR REGRESSION PROBLEM ##
######################################
# This code implements a simple linear regression problem
MAXITER = 100 # number of iterations for simple gradient descent
N = 10000 # number of data items
D = 50 # dimension of data items
X = randn(N, D) # create random matrix of data, data items appear row-wise
Wtrue = randn(D,1) # create arbitrary weight matrix to generate targets
Y = X*Wtrue # generate targets
The next code-block below defines functions for measuring the fitness of our regression (i.e. the negative log-likelihood) and the gradient of the weight vector w:
####################################
## DEFINE FUNCTIONS ##
####################################
#everywhere begin
#-------------------------------------------------------------------
function negative_loglikelihood(Y,X,W)
#-------------------------------------------------------------------
# number of data items
N = size(X,1)
# accumulate here log-likelihood
ll = 0
for nn=1:N
ll = ll - 0.5*sum((Y[nn,:] - X[nn,:]*W).^2)
end
return ll
end
#-------------------------------------------------------------------
function negative_loglikelihood_grad(Y,X,W, first_index,last_index)
#-------------------------------------------------------------------
# number of data items
N = size(X,1)
# accumulate here gradient contributions by each data item
grad = zeros(similar(W))
for nn=first_index:last_index
grad = grad + X[nn,:]' * (Y[nn,:] - X[nn,:]*W)
end
return grad
end
end
Note that the above functions are on purpose not vectorised! I choose not to vectorise, as the final code (the neural network case) will also not admit any vectorisation (let us not get into more details regarding this).
Finally, the code-block below shows a very simple gradient descent that tries to recover the parameter weight vector w from the given data Y and X:
####################################
## SOLVE LINEAR REGRESSION ##
####################################
# start from random initial solution
W = randn(D,1)
# learning rate, set here to some arbitrary small constant
eta = 0.000001
# the following for-loop implements simple gradient descent
for iter=1:MAXITER
# get gradient
ref_array = Array(RemoteRef, nworkers())
# let each worker process part of matrix X
for index=1:length(workers())
# first index of subset of X that worker should work on
first_index = (index-1)*int(ceil(N/nworkers())) + 1
# last index of subset of X that worker should work on
last_index = min((index)*(int(ceil(N/nworkers()))), N)
ref_array[index] = #spawn negative_loglikelihood_grad(Y,X,W, first_index,last_index)
end
# gather the gradients calculated on parts of matrix X
grad = zeros(similar(W))
for index=1:length(workers())
grad = grad + fetch(ref_array[index])
end
# now that we have the gradient we can update parameters W
W = W + eta*grad;
# report progress, monitor optimisation
#printf("Iter %d neg_loglikel=%.4f\n",iter, negative_loglikelihood(Y,X,W))
end
As is hopefully visible, I tried to parallelise the calculation of the gradient in the easiest possible way here. My strategy is to break the calculation of the gradient in as many parts as available workers. Each worker is required to work only on part of matrix X, which part is specified by first_index and last_index. Hence, each worker should work with X[first_index:last_index,:]. For instance, for 4 workers and N = 10000, the work should be divided as follows:
worker 1 => first_index = 1, last_index = 2500
worker 2 => first_index = 2501, last_index = 5000
worker 3 => first_index = 5001, last_index = 7500
worker 4 => first_index = 7501, last_index = 10000
Unfortunately, this entire code works faster if I have only one worker. If add more workers via addprocs(), the code runs slower. One can aggravate this issue by create more data items, for instance use instead N=20000.
With more data items, the degradation is even more pronounced.
In my particular computing environment with N=20000 and one core, the code runs in ~9 secs. With N=20000 and 4 cores it takes ~18 secs!
I tried many many different things inspired by the questions and answers in this forum but unfortunately to no avail. I realise that the parallelisation is naive and that data movement must be the problem, but I have no idea how to do it properly. It seems that the documentation is also a bit scarce on this issue (as is the nice book by Ivo Balbaert).
I would appreciate your help as I have been stuck for quite some while with this and I really need it for my work. For anyone wanting to run the code, to save you the trouble of copying-pasting you can get the code here.
Thanks for taking the time to read this very lengthy question! Help me turn this into a model answer that anyone new in Julia can then consult!
I would say that GD is not a good candidate for parallelizing it using any of the proposed methods: either SharedArray or DistributedArray, or own implementation of distribution of chunks of data.
The problem does not lay in Julia, but in the GD algorithm.
Consider the code:
Main process:
for iter = 1:iterations #iterations: "the more the better"
δ = _gradient_descent_shared(X, y, θ)
θ = θ - α * (δ/N)
end
The problem is in the above for-loop which is a must. No matter how good _gradient_descent_shared is, the total number of iterations kills the noble concept of the parallelization.
After reading the question and the above suggestion I've started implementing GD using SharedArray. Please note, I'm not an expert in the field of SharedArrays.
The main process parts (simple implementation without regularization):
run_gradient_descent(X::SharedArray, y::SharedArray, θ::SharedArray, α, iterations) = begin
N = length(y)
for iter = 1:iterations
δ = _gradient_descent_shared(X, y, θ)
θ = θ - α * (δ/N)
end
θ
end
_gradient_descent_shared(X::SharedArray, y::SharedArray, θ::SharedArray, op=(+)) = begin
if size(X,1) <= length(procs(X))
return _gradient_descent_serial(X, y, θ)
else
rrefs = map(p -> (#spawnat p _gradient_descent_serial(X, y, θ)), procs(X))
return mapreduce(r -> fetch(r), op, rrefs)
end
end
The code common to all workers:
#= Returns the range of indices of a chunk for every worker on which it can work.
The function splits data examples (N rows into chunks),
not the parts of the particular example (features dimensionality remains intact).=#
#everywhere function _worker_range(S::SharedArray)
idx = indexpids(S)
if idx == 0
return 1:size(S,1), 1:size(S,2)
end
nchunks = length(procs(S))
splits = [round(Int, s) for s in linspace(0,size(S,1),nchunks+1)]
splits[idx]+1:splits[idx+1], 1:size(S,2)
end
#Computations on the chunk of the all data.
#everywhere _gradient_descent_serial(X::SharedArray, y::SharedArray, θ::SharedArray) = begin
prange = _worker_range(X)
pX = sdata(X[prange[1], prange[2]])
py = sdata(y[prange[1],:])
tempδ = pX' * (pX * sdata(θ) .- py)
end
The data loading and training. Let me assume that we have:
features in X::Array of the size (N,D), where N - number of examples, D-dimensionality of the features
labels in y::Array of the size (N,1)
The main code might look like this:
X=[ones(size(X,1)) X] #adding the artificial coordinate
N, D = size(X)
MAXITER = 500
α = 0.01
initialθ = SharedArray(Float64, (D,1))
sX = convert(SharedArray, X)
sy = convert(SharedArray, y)
X = nothing
y = nothing
gc()
finalθ = run_gradient_descent(sX, sy, initialθ, α, MAXITER);
After implementing this and run (on 8-cores of my Intell Clore i7) I got a very slight acceleration over serial GD (1-core) on my training multiclass (19 classes) training data (715 sec for serial GD / 665 sec for shared GD).
If my implementation is correct (please check this out - I'm counting on that) then parallelization of the GD algorithm is not worth of that. Definitely you might get better acceleration using stochastic GD on 1-core.
If you want to reduce the amount of data movement, you should strongly consider using SharedArrays. You could preallocate just one output vector, and pass it as an argument to each worker. Each worker sets a chunk of it, just as you suggested.

Julia: use of pmap with matrices

I have a question about the use of pmap. I think it's a simple/obvious answer but still can't figure it out! I am currently running a loop where each of 50 iterations is separate and so running it in parallel should be possible and should improve speed. It uses a function that has multiple inputs and outputs, which are both a mixture of vectors and scalars. I need to save the outputs of the function for each of the 50 iterations for later use. Here are the basics of the code when not in parallel.
A=Array(Float64, 500,50)
b=Array(Float64,50)
for i in 1:50
A[:,i],b[i] = func(i,x,y,z)
end
Any advice for how to implement this is parallel? I'm using v0.3 Julia.
Thanks in advance.
David
This worked for me.
#everywhere x,y,z = 1,2,3
#everywhere function f(i,x,y,z)
sleep(1)
return(ones(500)*i, i+x+y+z)
end
naive = #time map(i -> f(i,x,y,z), 1:50)
parallel = #time pmap(i -> f(i,x,y,z), 1:50)
A = [x[1] for x in parallel]
b = [x[2] for x in parallel]
Let me know if anyone can suggest a more elegant way to get A and b out of the array of tuples that is produced by pmap.
The timings (when run on 8 processes) are as we would expect
elapsed time: 5.063214725 seconds (94436 bytes allocated)
elapsed time: 0.815228485 seconds (288864 bytes allocated)

How to find a function that can approximate another blackbox function programmatically? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 8 years ago.
Improve this question
I have two functions
m1 = f1(w, s)
m2 = f2(w, s)
f1() and f2() are all blackboxs. Given w and s, I can get m1 and m2.
Now, I need to design or find a function g, such that
m2' = g(m1)
Also, the difference between m2 and m2' must be minimized.
The w and s are all stochastic process.
How can I find such a function g()? What knowledge domain does this belong to ?
Assuming you can invoke f1,f2 as many times as you want - this can be solved using regression.
Set a training set: (w_1,s_1,m2_1),...,(w_n,s_n,m2_n).
'Convert' the set to the parameters of g:
(m1_1,m2_1),...,(m1_n,m2_n).
Create your 'base functions'. For example, for base functions of
polynoms up to degree 3 the the 'modified' training set will be
(1,m1_1,m1_1^2,m1_1^3,m2_1), ... It is easy to generalize it to any
degree of polynom or any other set base functions.
Now you have yourself a problem which can be solved by linear
regression using ordinary least squares (OLS)
However, note that for some functions, this might be impossible to calculate find a good model to fit, since you lose data when you reduce the dimensionality from 2 (w,s) to 1 (m1).
Matlab code snap (poor choice of functions):
%example functions
f = #(w,s) w.^2 + s.^3 -1;
g = #(w,s) s.^2 - w + 2;
%random points for sampling
w = rand(1,100);
s = rand(1,100);
%the data
m1 = f(w,s)';
m2 = g(w,s)';
%changing dimension:
d = 5;
points = size(m1,1);
A = ones(points,d);
for jj=1:d
A(:,jj) = (m1.^(jj-1))';
end
%OLS:
theta = pinv(A'*A)*A'*m2;
%new point:
w = rand(1,1);
s = rand(1,1);
m1 = f(w,s);
%estimate the new point:
A = ones(1,d);
for jj=1:d
A(:,jj) = (m1.^(jj-1))';
end
%the estimation:
estimated = A*theta
%the real value:
g(w,s)
This kind of problems are studied in fields such as statistic or inverse problems. Here's one way to approach the problem theoretically (from the point of view of inverse problems):
First of all, it is quite clear that in the general case, the function g might not exists. However, what you can (try to) compute, given that you (assume to) know something about the statistics of w and s, is the posterior probability density p(m2|m1), which can then be used to compute estimators for m2 given m1, for instance, a maximum a posteriori estimate.
The posterior density can be computed using Bayes' formula:
p(m2|m1) = (\int p(m1,m2|w,s)p(w,s) dw ds) / (\int p(m1|w,s) dw ds)
which, in this case, might be (theoretically) nasty to apply since some of the involved maginal probability densities are singular. The best way to proceed numerically depends on the additional assumptions you can do on the statistics of w and s (e.g., Gaussian) and the functions f1, f2 (e.g., smooth). There is no silver bullet.
amit's OLS solution is probably a good starting point. Just be sure to sample from the correct distributions for w and s.

Resources