SAS parallel processing: Start k parallel jobs subsequently until N jobs are done - parallel-processing

There might be a simple solution, but currently I cannot get my head around it. Maybe someone could help me out.
So I have the following problem:
I have in total N SAS jobs that I want to finish. As my resources on my machine are too limited to start all N jobs simultaneously, I only want to start say k = 5 at the same time. Every time one job finishes, I want to start the next one. The order in which the jobs are completed is not important.
Currently, I let all k=5 jobs finish before I start the next 5.
So here is the pseudo code, for what I am currently doing:
/*The following macro starts k threads*/
%macro parallel_processing(k);
options fullstimer autosignon=yes sascmd="sas -nonews -threads";
%global thread jj idlist;
/*These are the N ID numbers for the jobs*/
%let idlist = 1 2 3 4 5 ... N;
%let jj = 0;
%do %until(&jj.=N);
%do thread = 1 %to &k.;
%let jj = %eval(&thread.+&jj.);
%syslput thread = &thread;
%syslput jj = &jj.;
%syslput idlist = &idlist.;
rsubmit process=task&thread. wait=no sysrputsync=yes;
%let id =%scan(%trim(&idlist.),&jj.);
/*Do the job*/
%job(id);
endrsubmit;
%end;
/* HERE IS MY PROBLEM:
I want to wait for each job separately, and start a new one
with an increased id. So that constantly k threads are busy.
*/
/* Wait for all threads to finish */
waitfor _all_ %do thread = 1 %to &k.;
task&thread
%end;
/* GET RESULTS FROM THREADS */
%do thread = 1 %to (&k.);
rget task&thread;
%end;
/* SIGNOFF THREADS*/
%do thread = 1 %to (&k.);
signoff task&thread;
%end;
%end;
%mend parallel_processing;
%parallel_processing(k);
Maybe somebody has a nice idea, I would be grateful! Thank you in advance.

Use waitfor _any_ ... instead of waitfor _all_ ...:
Start the first 5 tasks, keeping a note of which 5 tasks are active.
Wait for any one of them to finish and remove it from the list of active tasks.
Start the next task from the queue and add it to the list of active tasks.
Repeat steps 2 and 3 until there are no tasks left in the queue.
You can use any method you like to keep track of which 5 tasks are currently active.

Related

Matlab ROS slow publishing + subscribing

In my experience Matlab performs publish subscribe operations with ROS slow for some reason. I work with components as defined in an object class as shown below, where I made a test-class. Normally objects of comparable structure are used to control mobile robots.
To quantify performance tested required time for an operation and got the following results:
1x publishing a message + 1x simple subscriber callback : 3.7ms
Simply counting in a callback (per count): 2.1318e-03 ms
Creating a new message with msg1 = rosmessage(obj.publisher) adds 3.6-4.3ms per iteration
Pinging myself indicated communication latency of 0.05 ms
The times required for a simple publish + start of a subscribe callback seems oddly slow.
I want to have multiple system components as objects in my workspace such that they respond to ROS topic updates or on timer events. The pc used for testing is not a monster but should not be garbage either.
Do you also think the shown time requirements are unneccesary large? this allows barely to publish a single topic at 200hz without doing anything else. Normally I have multiple lower frequency topics (e.g.20hz) but the total consumed time becomes significant.
Do you know any practices to make the system operate quicker?
What do you think of the OOP style of making control system components in general?
classdef subpubspeedMonitor < handle
% Use: call in matlab console, after initializing ros:
%
% SPM1 = subpubspeedMonitor()
%
% This will create an object which starts a set repetitive task upon creation
% and finally destructs itself after posting results in console.
properties
node
subscriber
publisher
timestart
messagetotal
end
methods
function obj = subpubspeedMonitor()
obj.node = ros.Node('subspeedmonitor1');
obj.subscriber = ros.Subscriber(obj.node,'topic1','sensor_msgs/NavSatFix',{#obj.rosSubCallback});
obj.publisher = ros.Publisher(obj.node,'topic1','sensor_msgs/NavSatFix');
obj.timestart = tic;
obj.messagetotal = 0;
msg1 = rosmessage(obj.publisher);
% Choose to evaluate subscriber + publisher loop or just counting
if 1
send(obj.publisher,msg1);
else
countAndDisplay(obj)
end
end
%% Test method one: repetitive publishing and subscribing
function rosSubCallback(obj,~,msg_) % ~3.7 ms per loop for a simple publish+subscribe action
% Latency to self is 0.05ms on average, according to "pinging" in terminal
obj.messagetotal = obj.messagetotal+1;
if obj.messagetotal <10000
%msg1 = rosmessage(obj.publisher); % this line adds 4.3000ms per loop
msg_.Longitude = 51; % this line adds 0.25000 ms per loop
send(obj.publisher,msg_)
else
% Display some results
timepassed = toc(obj.timestart);
time_per_pubsub = timepassed/obj.messagetotal
delete(obj);
end
end
%% Test method two: simply counting
function countAndDisplay(obj) % this costs 2.1318e-03 ms(!) per loop
obj.messagetotal = obj.messagetotal+1;
if obj.messagetotal <10000
%msg1 = rosmessage(obj.publisher); %adds 3.6ms per loop
%i = 1% adds 5.7532e-03 ms per loop
%msg1 = rosmessage("std_msgs/Bool"); %adds 1.5ms per loop
countAndDisplay(obj);
else
% Display some results
timepassed = toc(obj.timestart);
time_per_count_FCN = timepassed/obj.messagetotal
delete(obj);
end
end
%% Deconstructor
function delete(obj)
delete(obj.subscriber)
delete(obj.publisher)
delete(obj.node)
end
end
end

IDL:How to divide FOR loop in N parts for parallel execution?

I have a time consuming loop of length 300. I would like to execute in parallel.
Pseudocode:
for t=0, 300 do begin
output_data[t] = function(input_data(t))
endfor
• The function() for each iteration is completely the same
• The input_data(t)) is stored in a file
Is possible divide the 300 iterations in to K parallel processes (where k is the CPU number)?
I found split_fot.pro but if I understand correctly it is for divide the different processes in the same nth cicle of loop.
How can I do?
Thank you!!
I have some routines in my library that you could use to do something like the following:
pool = obj_new('MG_Pool', n_processes=k)
x = indgen(300)
output_data = pool->map('my_function', x)
Here, my_function would need to accept an argument i, get the data associated with index I, and apply function to it. The result would then be put into output_data[i].
You can specify the number of processes you want to use for the pool object with the N_PROCESSES keyword or it will just automatically use the number of cores you have available.
The code is in my library, check the src/multiprocessing directory. See the examples/multiprocessing directory for some examples of using it.
You can use IDL_IDLBridge along with object arrays
to create multiple child processes.
Child processes do not inherit variables from main process,
so you may want to use SHMMAP, SHMVAR, SHMUNMAP
to share variables across child processes,
or use SETVAR method of IDL_IDLBridge if memory is not a problem.
As an example, below I create 5 child processes to distribute the for-loop:
dim_input = size(input_data, /dim) ; obtain the data dimension
dim_output = size(output_data, /dim)
shmmap, dimension=dim_input, get_name=seg_input ; set the memory segment
shmmap, dimension=dim_output, get_name=seg_output
shared_input = shmvar(seg_input)
shared_output = shmvar(seg_output)
shared_input[0] = input_data ; assign data to the shared variable
; shared_data[0, 0] if data is 2d
shared_output[0] = output_data
; initialize child processes
procs = objarr(5)
for i = 0, 4 do begin
procs[i] = IDL_IDLBridge(output='')
procs[i].setvar, 'istart', i*60
procs[i].setvar, 'iend', (i+1)*60 - 1
procs[i].setvar, 'seg_input', seg_input
procs[i].setvar, 'seg_output', seg_output
procs[i].setvar, 'dim_input', dim_input
procs[i].setvar, 'dim_output', dim_output
procs[i].execute, 'shmmap, seg_input, dimension=dim_input'
procs[i].execute, 'shmmap, seg_output, dimension=dim_output'
procs[i].execute, 'shared_input = shmvar(seg_input)'
procs[i].execute, 'shared_output = shmvar(seg_output)'
endfor
; execute the for-loop asynchronously
for i = 0, 4 do begin
procs[i].execute, 'for t=istart, iend do ' + $
'shared_output[t] = function(shared_input[t])', $
/nowait
endfor
; wait until all child processes are idle
repeat begin
n_idle = 0
for i = 0, 4 do begin
case procs[i].status() of
0: n_idle++
2: n_idle++
else:
endcase
endfor
wait, 1
endrep until (n_idle eq 5)
; cleanup child processes
for i = 0, 4 do begin
procs[i].cleanup
obj_destroy, procs[i]
endfor
; assign output values back to output_data
; unmap the shared variable
output_data = shared_output[0]
shmunmap, seg_input
shmunmap, seg_output
shared_input = 0
shared_output = 0
You may also want to optimize your function for multi-processing.
Lastly, to prevent multiple accesses to the memory segment,
you can use SEM_CREATE, SEM_LOCK, SEM_RELEASE, SEM_DELETE
functions/procedures provided by IDL.

reading in parallel from" generator in Keras

I have a big dataset divided in files.
I would like to read and process my data one file at the time and for this I have this keras generator:
def myGenerator():
while 1:
rnd = random.randint(1,200)
strRnd = str(rnd)
lenRnd = len(strRnd)
rndPadded = strRnd.rjust(5, '0')
nSearchesInBatch = 100
f = "path/part-" + rndPadded + "*" #read one block of data
data = sqlContext.read.load(f).toPandas()
imax = int(data.shape[0]/nSearchesInBatch) #number of batches that will be created sequentially from the generator
for i in range(imax):
data_batch = data[i*nSearchesInBatch:(i+1)*nSearchesInBatch]
features = data_batch['features']
output = data_batch['output']
yield features, output
The problem is that the reading takes the biggest part (each file is around 200mb), and in the meanwhile the GPU sits waiting, it is possible to pre-read the next batch while the GPU is traning on the previous one?
At the moment one file is read and split in steps (the inner loop), the CPUs are hidden and the GPU training, as soon as the epoch finishes, the GPU goes idle and the cpu start reading (which takes 20/30 seconds).
Any solution to parallelize this?

The most efficient way to read a unformatted file

Now I am data-processing 100,000 files by using Fortran. These data are generated by HPC using MPI I/O. Now I can just figure out the following ways to read the raw, which is not efficient. Is it possible that read every to read ut_yz(:,J,K), at one one time insteading of reading one by one? Thanks
The old code is as follows and the efficiency is not so high.
OPEN(10,FILE=trim(filename)//".dat",FORM='UNFORMATTED',&
ACCESS='DIRECT', RECL=4, STATUS='OLD')
!,CONVERT='big_endian'
COUNT = 1
DO K=1,nz
DO J=1,ny
DO I=1,nxt
READ(10,REC=COUNT) ut_yz(I,J,K)
COUNT = COUNT + 1
ENDDO
ENDDO
ENDDO
CLOSE(10)
The desired one is
OPEN(10,FILE=trim(filename)//".dat",FORM='UNFORMATTED', RECL=4, STATUS='OLD')
!,CONVERT='big_endian'
COUNT = 1
DO K=1,nz
DO J=1,ny
READ(10,REC=COUNT) TEMP(:)
COUNT = COUNT + 153
ut_yz(:,J,K)=TEMP(:)
ENDDO
ENDDO
CLOSE(10)
However, it always fails. Can anyone make a comment on this? Thanks.
Direct IO read will read a single record, if I am not mistaken. Thus, in your new code version you need to increase the record length accordingly:
inquire(iolength=rl) ut_yz(:,1,1)
open(10, file=trim(filename)//'.dat', form='UNFORMATTED', recl=rl, status='OLD', action='READ')
count = 1
do k=1,nz
do j=1,ny
read(10, rec=count) ut_yz(:,j,k)
count = count + 1
end do
end do
close(10)
Of course, in this example you could also read the complete array at once, which should be the fastest option:
inquire(iolength=rl) ut_yz
open(10, file=trim(filename)//'.dat', form='UNFORMATTED', recl=rl, status='OLD', action='READ')
read(10, rec=1) ut_yz
close(10)

Rufus Scheduler: Every day, start a job at time X and process in batches with a fixed interval

I am trying to figure out how to be able to do the following.
I want to start a job at 12:00pm every day which computes a list of things and then processes these things in batches of size 'b'. and guarantee an interval of x minutes between the end of one batch and the beginning of another.
schedule.cron '00 12 * * *' do
# things = Compute the list of things
schedule.interval '10m', # Job.new(things[index, 10]) ???
end
I tried something like the following but I hate that I have to pass in the scheduler to the job or access the singleton scheduler.
class Job < Struct.new(:things, :batch_index, :scheduler)
def call
if (batch_index+1) >= things.length
ap "Done with all batches"
else
# Process batch
scheduler.in('10m', Dummy.new(things, batch_index+1, scheduler))
end
end
end
scheduler = Rufus::Scheduler.new
schedule.cron '00 12 * * *' , Dummy.new([ [1,2,3] , [4,5,6,7], [8,9,10]], 0, scheduler)
with a bit of help from https://github.com/jmettraux/rufus-scheduler#scheduling-handler-instances
I'd prepare a class like:
class Job
def initialize(batch_size)
#batch_size = batch_size
end
def call(job, time)
# > I want to start a job (...) which computes a list of things and
# > then processes these things in batches of size 'b'.
things = compute_list_of_things()
loop do
ts = things.shift(#batch_size)
do_something_with_the_batch(ts)
break if things.empty?
# > and guarantee an interval of x minutes between the end of one batch
# > and the beginning of another.
sleep 10 * 60
end
end
end
I'd pass an instance of that class to the scheduler instead of a block:
scheduler = Rufus::Scheduler.new
# > I want to start a job at 12:00pm every day which computes...
scheduler.cron '00 12 * * *', Job.new(10) # batch size of 10...
I don't bother using the scheduler for the 10 minutes wait.

Resources