Parallel for-loop with SharedArray - performance

I would like to have the most time and memory efficient load balanced application of an operation to each column of a SharedArray producing corresponding columns that are modifying in-place a pre-allocated output SharedArray. How should I improve the following code ?
using SharedArrays, Distributed
nCores = length(Sys.cpu_info())
addprocs(nCores - 1);
#everywhere using SharedArrays, Distributed
#everywhere addConstant = 10000;
#everywhere function divideColByMeanConstant(x)
return (x ./ mean(x)) .+ addConstant
end
inputMAT = SharedArray(rand(10000,20000))
outputMAT = SharedArray(Array{Float64,2}(undef, size(inputMAT)))
function Array2ArrayColumnwise_forloop!(output::SharedArray{Float64,2},operation::Function, input::SharedArray{Float64,2}, rowRange::Array{Int64,1},colRange::Array{Int64,1})
#async #distributed for colInd in colRange
#views output[rowRange,colInd] = operation(input[rowRange, colInd])
end
end
rowRange = [1,4,6,8,10,15];
colRange = [1,4,5, 6, 7,10,15];
Array2ArrayColumnwise_forloop!(outputMAT,divideColByMeanConstant, inputMAT, rowRange,colRange)
Thank you in advance

Related

Improve code result speed by multiprocessing

I'm self study of Python and it's my first code.
I'm working for analyze logs from the servers. Usually I need analyze full day logs. I created script (this is example, simple logic) just for check speed. If I use normal coding the duration of analyzing 20mil rows about 12-13 minutes. I need 200mil rows by 5 min.
What I tried:
Use multiprocessing (met issue with share memory, think that fix it). But as the result - 300K rows = 20 sec and no matter how many processes. (PS: Also need control processors count in advance)
Use threading (I found that it's not give any speed, 300K rows = 2 sec. But normal code same, 300K = 2 sec)
Use asyncio (I think that script is slow because need reads many files). Result same as threading - 300K = 2 sec.
Finally I think that all three my script incorrect and didn't work correctly.
PS: I try to avoid use specific python modules (like pandas) because in this case it will be more difficult to execute on different servers. Better to use common lib.
Please help to check 1st - multiprocessing.
import csv
import os
from multiprocessing import Process, Queue, Value, Manager
file = {"hcs.log", "hcs1.log", "hcs2.log", "hcs3.log"}
def argument(m, a, n):
proc_num = os.getpid()
a_temp_m = a["vod_miss"]
a_temp_h = a["vod_hit"]
with open(os.getcwd() + '/' + m, newline='') as hcs_1:
hcs_2 = csv.reader(hcs_1, delimiter=' ')
for j in hcs_2:
if j[3].find('MISS') != -1:
a_temp_m[n] = a_temp_m[n] + 1
elif j[3].find('HIT') != -1:
a_temp_h[n] = a_temp_h[n] + 1
a["vod_miss"][n] = a_temp_m[n]
a["vod_hit"][n] = a_temp_h[n]
if __name__ == '__main__':
procs = []
manager = Manager()
vod_live_cuts = manager.dict()
i = "vod_hit"
ii = "vod_miss"
cpu = 1
n = 1
vod_live_cuts[i] = manager.list([0] * cpu)
vod_live_cuts[ii] = manager.list([0] * cpu)
for m in file:
proc = Process(target=argument, args=(m, vod_live_cuts, (n-1)))
procs.append(proc)
proc.start()
if n >= cpu:
n = 1
proc.join()
else:
n += 1
[proc.join() for proc in procs]
[proc.close() for proc in procs]
I'm expect, each file by def argument will be processed by independent process and finally all results will be saved in dict vod_live_cuts. For each process I added independent list in dict. I think it will help cross operation for use this parameter. But maybe it's wrong way :(
using IPC is costly, so only use "shared objects" for saving the final result, not for intermediate results while parsing the file.
limiting the number of processes is done by using a multiprocessing.Pool, the following code uses it to reach the max hard-disk speed, you only need to post-process the results.
you can only parse data as fast as your HDD can read it (typically 30-80 MB/s), so if you need to improve the performance further you should use SSD or RAID0 for higher disk speed, you cannot get much faster than this without changing your hardware.
import csv
import os
from multiprocessing import Process, Queue, Value, Manager, Pool
file = {"hcs.log", "hcs1.log", "hcs2.log", "hcs3.log"}
def argument(m, a):
proc_num = os.getpid()
a_temp_m_n = 0 # make it local to process
a_temp_h_n = 0 # as shared lists use IPC
with open(os.getcwd() + '/' + m, newline='') as hcs_1:
hcs_2 = csv.reader(hcs_1, delimiter=' ')
for j in hcs_2:
if j[3].find('MISS') != -1:
a_temp_m_n = a_temp_m_n + 1
elif j[3].find('HIT') != -1:
a_temp_h_n = a_temp_h_n + 1
a["vod_miss"].append(a_temp_m_n)
a["vod_hit"].append(a_temp_h_n)
if __name__ == '__main__':
manager = Manager()
vod_live_cuts = manager.dict()
i = "vod_hit"
ii = "vod_miss"
cpu = 1
vod_live_cuts[i] = manager.list()
vod_live_cuts[ii] = manager.list()
with Pool(cpu) as pool:
tasks = []
for m in file:
task = pool.apply_async(argument, args=(m, vod_live_cuts))
tasks.append(task)
for task in tasks:
task.get()
print(list(vod_live_cuts[i]))
print(list(vod_live_cuts[ii]))

Detectron2- How to log validation loss during training?

I copied the idea from mnslarcher and wrote the following two functions for my keypoint detector (resnet50 backbone) algorithm.
def build_valid_loader(cfg):
_cfg = cfg.clone()
_cfg.defrost() # make this cfg mutable.
_cfg.DATASETS.TRAIN = cfg.DATASETS.TEST
return build_detection_train_loader(_cfg)
def store_valid_loss(model, data, storage):
training_mode = model.training
with torch.no_grad():
loss_dict = model(data)
losses = sum(loss_dict.values())
assert torch.isfinite(losses).all(), loss_dict
loss_dict_reduced = {k: v.item()
for k, v in comm.reduce_dict(loss_dict).items()}
losses_reduced = sum(loss for loss in loss_dict_reduced.values())
if comm.is_main_process():
storage.put_scalars(val_loss=losses_reduced, **loss_dict_reduced)
model.train(training_mode)
then in plain_train_net.py I am calling them as bellow.
val_data_loader = build_valid_loader(cfg)
logger.info("Starting training from iteration {}".format(start_iter))
with EventStorage(start_iter) as storage:
for data, val_data, iteration in zip(data_loader, val_data_loader, range(start_iter, max_iter)):
iteration = iteration + 1
..
..
#At the end of the for loop.
# Calculate and log validation loss.
store_valid_loss(model, val_data, storage)
after 1k iteration, loss_keypoint is increasing, but total_loss is same compared to without store_valid_loss call. What am I missing? Can anyone please help to understand?
I am using 4 GeForce RTX 2080 Ti.

Distributed Julia: parallel map (pmap) with a timeout / time limit for each map task to complete

My project involves computing in parallel a map using Julia's Distributed's pmap function.
Mapping a given element could take a few seconds, or it could take essentially forever. I want a timeout or time limit for an individual map task/computation to complete.
If a map task finishes in time, great, return the result of the computation. If the task doesn't complete by the time limit, stop computation when the time limit has been reached, and return some value or message indicating a timeout occurred.
A minimal example follows. First are imported modules, and then worker processes are launched:
num_procs = 1
using Distributed
if num_procs > 1
# The main process (no calling addprocs) can be used for `pmap`:
addprocs(num_procs-1)
end
Next, the mapping task is defined for all the worker processes. The mapping task should timeout after 1 second:
#everywhere import Random
#everywhere begin
"""
Compute stuff for `wait_time` seconds, and return `wait_time`.
If `timeout` seconds elapses, stop computation and return something else.
"""
function waitForTimeUnlessTimeout(wait_time, timeout=1)
# < Insert some sort of timeout code? >
# This block of code simulates a long computation.
# (pretend the computation time is unknown)
x = 0
while time()-t0 < wait_time
x += Random.rand() - 0.5
end
# computation completed before time limit. Return wait_time.
round(wait_time, digits=2)
end
end
The function that executes the parallel map (pmap) is defined on the main process. Each map task randomly takes up to 2 seconds to complete, but should time out after 1 second.
function myParallelMapping(num_tasks = 20, max_runtime=2)
# random task runtimes between 0 and max_runtime
runtimes = Random.rand(num_tasks) * max_runtime
# return the parallel computation of the mapping tasks
pmap((runtime)->waitForTimeUnlessTimeout(runtime), runtimes)
end
print(myParallelMapping())
How should this time-limited parallel map be implemented?
You could put something like this inside your pmap body
pmap(runtimes) do runtime
t0 = time()
task = #async waitForTimeUnlessTimeout(runtime)
while !istaskdone(task) && time()-t0 < time_limit
sleep(1)
end
istaskdone(task) && (return fetch(task))
error("time over")
end
Also note that (runtime)->waitForTimeUnlessTimeout(runtime) is the same as just waitForTimeUnlessTimeout .
Following #Fredrik Bagge's very helpful answer, here is the full working example implementation with some extra explanation.
num_procs = 8
using Distributed
if num_procs > 1
addprocs(num_procs-1)
end
#everywhere import Random
#everywhere begin
function waitForTime(wait_time)
# This code block simulates a long computation.
# Pretend the computation time is unknown.
t0 = time()
x = 0
while time()-t0 < wait_time
x += Random.rand() - 0.5
yield() # CRITICAL to release computation to check if task is done.
# If you comment out #yield(), you will see timeout doesn't work!
end
return round(wait_time, digits=2)
end
end
function myParallelMapping(num_tasks = 16, max_runtime=2, time_limit=1)
# random task runtimes between 0 and max_runtime
runtimes = Random.rand(num_tasks) * max_runtime
# parallel compute the mapping tasks. See "do block" in
# the Julia documentation, it's just syntactic sugar.
return pmap(runtimes) do runtime
t0 = time()
task = #async waitForTime(runtime)
while !istaskdone(task) && time()-t0 < time_limit
# releases computation to waitForTime
sleep(0.1)
# nothing past here will run until waitForTime calls yield()
# *and* 0.1 seconds have passed.
end
# equal to if istaskdone(task); return fetch(task); end
istaskdone(task) && (return fetch(task))
return "TimeOut"
# `return error("TimeOut")` halts pmap unless pmap is
# given an error handler argument. See pmap documentation.
end
end
The output is
julia> print(myParallelMapping())
Any["TimeOut", "TimeOut", 0.33, 0.35, 0.56, 0.41, 0.08, 0.14, 0.72,
"TimeOut", "TimeOut", "TimeOut", 0.52, "TimeOut", 0.33, "TimeOut"]
Note that there are two tasks per process in this example. The original task (the "time checker") is checking every 0.1 seconds if the other task has completed computation. The other task (created with #async) is computing something, periodically calling yield() to release control to the time checker; if it doesn't call yield(), time checking cannot occur.

How to speed up MATLAB integration?

I have the following code:
function [] = Solver( t )
%pre-declaration
foo=[1,1,1];
fooCell = num2cell(foo);
[q, val(q), star]=fooCell{:};
%functions used in prosomoiwsh
syms q val(q) star;
qd1=symfun(90*pi/180+30*pi/180*cos(q),q);
qd2=symfun(90*pi/180+30*pi/180*sin(q),q);
p1=symfun(79*pi/180*exp(-1.25*q)+pi/180,q);
p2=symfun(79*pi/180*exp(-1.25*q)+pi/180,q);
e1=symfun(val-qd1,q);
e2=symfun(val-qd2,q);
T1=symfun(log(-(1+star)/star),star);
T2=symfun(log(star/(1-star)),star);
%anonymous function handles
lambda=[0.75;10.494441313222076];
calcEVR_handles={#(t,x)[double(subs(diff(subs(T1,star,e1/p1),q)+subs(lambda(1)*T1,star,e1/p1),{diff(val,q);val;q},{x(2);x(1);t})),double(subs(diff(subs(T1,star,e1/p1),q)+subs(lambda(1)*T1,star,e1/p1),{diff(val,q);val;q},{0;x(1);t})),double(subs(double(subs(subs(diff(T1,star),star,e1/p1),{val;q},{x(1);t}))/p1,q,t))];#(t,x)[double(subs(diff(subs(T2,star,e2/p2),q)+subs(lambda(2)*T2,star,e2/p2),{diff(val,q);val;q},{x(4);x(3);t})),double(subs(diff(subs(T2,star,e2/p2),q)+subs(lambda(2)*T2,star,e2/p2),{diff(val,q);val;q},{0;x(3);t})),double(subs(double(subs(subs(diff(T2,star),star,e2/p2),{val;q},{x(3);t}))/p2,q,t))]};
options = odeset('AbsTol',1e-1,'RelTol',1e-1);
[T,x_r] = ode23(#prosomoiwsh,[0 t],[80*pi/180;0;130*pi/180;0;2.4943180186983711;11.216948999754299],options);
save newresult T x_r
function dx_th = prosomoiwsh(t,x_th)
%declarations
k=0.80773938740480955;
nf=6.2860930902603602;
hGa=0.16727117784664769;
hGb=0.010886618389781832;
dD=0.14062935253218495;
s=0.64963817519705203;
IwF={[4.5453398382686956 5.2541234145178066 -6.5853972592002235 7.695225990702979];[-4.4358339284697337 -8.1138542053372298 -8.2698210582548395 3.9739729629084071]};
IwG={[5.7098975358444752 4.2470526600975802 -0.83412489434697168 0.53829395964565041] [1.8689492167233894 -0.0015017513794517434 8.8666804106266461 -1.0775021663921467];[6.9513235639494155 -0.8133752392893685 7.4032432556804162 3.1496138243338709] [5.8037182454981568 2.0933267947187457 4.852362963697928 -0.10745559204132382]};
IbF={-1.2165533594615545;7.9215291787744917};
IbG={2.8425752327892844 2.5931576770598168;9.4789237295474873 7.9378928037841252};
p=2;
m=2;
signG=1;
n_vals=[2;2];
nFixedStates=4;
gamma_nn=[0.31559428834175318;9.2037894041383641];
th_star_guess=[2.4943180186983711;11.216948999754299];
%solution
x = x_th(1:nFixedStates);
th = x_th(nFixedStates+1:nFixedStates+p);
f = zeros(m,1);
G = zeros(m,m);
ZF = zeros(p,m);
ZG = zeros(p,m,m);
for i=1:m
[f(i), ZF(:,i)] = calculate_neural_output(x, IwF{i}, IbF{i}, th);
for j=1:m
[G(i,j), ZG(:,i,j)] = calculate_neural_output(x, IwG{i,j}, IbG{i,j}, th);
end
end
detG = det(G);
if m == 1
adjG = 1;
else
adjG = detG*G^-1;
end
E = zeros(m,1);
V = zeros(m,1);
R = zeros(m,m);
for i=1:m
EVR=calcEVR_handles{i}(t,x);
E(i)=EVR(1);
V(i)=EVR(2);
R(i,i)=EVR(3);
end
Rinv = R^-1;
prod_R_E = R*E;
ub = f + Rinv * (V + k*E) + nf*prod_R_E;
ua = - detG / (detG^2+dD) * (adjG * ub) ;
u = ua - signG * (hGa*(ua'*ua) + hGb*(ub'*ub)) * prod_R_E;
dx_th = zeros(nFixedStates+p, 1); %preallocation
%System in form (1) of the IEEE paper
[vec_sys_f, vec_sys_G] = sys_f_G(x);
dx_nm = vec_sys_f + vec_sys_G*u;
%Calculation of dx
index_start = 1;
index_end = -1;
for i=1:m
index_end = index_end + n_vals(i);
for j=index_start:index_end
dx_th(j) = x(j+1);
end
dx_th(index_end+1) = dx_nm(i);
index_start = index_end + 2;
end
%Calculation of dth
AFvalueT = zeros(p,m);
for i=1:m
AFvalueT(:,i) = 0;
for j=1:m
AFvalueT(:,i) = AFvalueT(:,i)+ZG(:,i,j)*ua(j);
end
end
dx_th(nFixedStates+1:nFixedStates+p) = diag(gamma_nn)*( (ZF+AFvalueT)*prod_R_E -s*(th-th_star_guess) );
display(t)
end
function [y, Z] = calculate_neural_output(input, Iw, Ib, state)
Z = [tanh(Iw*input+Ib);1];
y = state' * Z;
end
function [ f,g ] = sys_f_G( x )
Iz1=0.96;
Iz2=0.81;
m1=3.2;
m2=2.0;
l1=0.5;
l2=0.4;
g=9.81;
q1=x(1);
q2=x(3);
q1dot=x(2);
q2dot=x(4);
M=[Iz1+Iz2+m1*l1^2/4+m2*(l1^2+l2^2/4+l1*l2*cos(q2)),Iz2+m2*(l2^2/4+l1*l2*cos(q2)/2);Iz2+m2*(l2^2/4+l1*l2*cos(q2)/2),Iz2+m2*l2^2/4];
c=0.5*m2*l1*l2*sin(q2);
C=[-c*q2dot,-c*(q1dot+q2dot);c*q1dot,0];
G=[0.5*m1*g*l1*cos(q1)+m2*g*(l1*cos(q1)+0.5*l2*cos(q1+q2));0.5*m2*g*l2*cos(q1+q2)];
f=-M\(C*[q1dot;q2dot]+G);
g=inv(M);
end
end
Its target is to simulate the control of a 2-DOF robotic arm using a certain control law. The results I get after running the simulation are correct(I have a graph of the output I should expect), but it takes ages to finish!
Is there anything I could do to speed up the process?
In order to improve the computational speed of any integration in Matlab, a few options are available to you:
Reduce the required accuracy (which you already have done)
Use an adapted integrator. As mentioned by #sanchises, sometimes ode23 can be longer than another ode solver in Matlab (if your equation is stiff for instance). You could try to determine which solver is most adapted from the documentation... Or simply try them all!
The best solution, but by far the most time consuming, would be to use a compiled language, such as C or Fortran. If the integration is but a part of your Matlab program, you could use Mex files, and translate only the integration to a compiled language. You could also create dynamic libraries in your compiled language and load them in Matlab using loadlibrary. I use loadlibrary and an integration routine written in Fortran for the integration of orbits and trajectories, and I get over 100 times speedup with Fortran vs. Matlab! Of course, technically, the integration is not in Matlab anymore... But the library or Mex files trick allows you to only convert the integration part of your program to a different language! A number of open source integrators are available, such as ODEPACK or RKSUITE in Fortran. Then, you only need to create a wrapper and your dynamics function in the correct language.
So to put it in a nutshell, if you're going to use this integration a lot, I would advise using a compiled language. If not, you should make do with Matlab, and be patient!

Multiple linear optimizations

I'm interested in solving a few hundred linear systems in MATLAB. At the moment this is done by a for-loop with linprog
The vectors used have identical dimensions and are lines of one matrix.
for combination_id = 1:1000
[tempOperatingPointsVectors,tempTargetValue, exitflag] = ...
linprog( lo_c(combination_id,:), ...
[], [], ...
lo_G(:,:,combination_id), lo_d(:,combination_id), ...
lo_u(:,combination_id), lo_v(:,combination_id), ...
x0_in, options);
end
Is there a way of using linprog with the whole vectors instead of picking each line?
I also tried a parfor loop but since the operations in each loop are very small there is no speed improvement.
Why can't you set up one big linear program and then solve all of it at once?
Since I don't have your data, I cannot test the following code, but the basic idea should work.
xVar = 1:size(lo_c,2);
uBound = lo_u(:, 1);
vBound = lo_v(:, 1);
dMat = lo_d(:, 1);
gMat = lo_G(:,:, 1);
objMat = lo_c(1,:);
x0_inMat = x0_in;
for combination_id = 2:1000
xVar = [xVar, xVar(end)+1:xVar(end)+size(lo_c,2)];
uBound = [uBound; lo_u(:, combination_id);
vBound = [vBound; lo_v(:, combination_id);
dMat = [dMat; lo_d(:, combination_id);
gMat = [gMat; lo_G(:,:, combination_id)];
objMat = [objMat; lo_c(combination_id,:)];
x0_inMat = [xo_inMat; x0_in];
end
[tempOperatingPointsVectors,tempTargetValue, exitflag] = ...
linprog( objMat, ...
[], [], ...
gMat, dMat, ...
uBound, vBound, ...
x0_in, options);
Should do the trick.

Resources