IDL:How to divide FOR loop in N parts for parallel execution? - parallel-processing

I have a time consuming loop of length 300. I would like to execute in parallel.
Pseudocode:
for t=0, 300 do begin
output_data[t] = function(input_data(t))
endfor
• The function() for each iteration is completely the same
• The input_data(t)) is stored in a file
Is possible divide the 300 iterations in to K parallel processes (where k is the CPU number)?
I found split_fot.pro but if I understand correctly it is for divide the different processes in the same nth cicle of loop.
How can I do?
Thank you!!

I have some routines in my library that you could use to do something like the following:
pool = obj_new('MG_Pool', n_processes=k)
x = indgen(300)
output_data = pool->map('my_function', x)
Here, my_function would need to accept an argument i, get the data associated with index I, and apply function to it. The result would then be put into output_data[i].
You can specify the number of processes you want to use for the pool object with the N_PROCESSES keyword or it will just automatically use the number of cores you have available.
The code is in my library, check the src/multiprocessing directory. See the examples/multiprocessing directory for some examples of using it.

You can use IDL_IDLBridge along with object arrays
to create multiple child processes.
Child processes do not inherit variables from main process,
so you may want to use SHMMAP, SHMVAR, SHMUNMAP
to share variables across child processes,
or use SETVAR method of IDL_IDLBridge if memory is not a problem.
As an example, below I create 5 child processes to distribute the for-loop:
dim_input = size(input_data, /dim) ; obtain the data dimension
dim_output = size(output_data, /dim)
shmmap, dimension=dim_input, get_name=seg_input ; set the memory segment
shmmap, dimension=dim_output, get_name=seg_output
shared_input = shmvar(seg_input)
shared_output = shmvar(seg_output)
shared_input[0] = input_data ; assign data to the shared variable
; shared_data[0, 0] if data is 2d
shared_output[0] = output_data
; initialize child processes
procs = objarr(5)
for i = 0, 4 do begin
procs[i] = IDL_IDLBridge(output='')
procs[i].setvar, 'istart', i*60
procs[i].setvar, 'iend', (i+1)*60 - 1
procs[i].setvar, 'seg_input', seg_input
procs[i].setvar, 'seg_output', seg_output
procs[i].setvar, 'dim_input', dim_input
procs[i].setvar, 'dim_output', dim_output
procs[i].execute, 'shmmap, seg_input, dimension=dim_input'
procs[i].execute, 'shmmap, seg_output, dimension=dim_output'
procs[i].execute, 'shared_input = shmvar(seg_input)'
procs[i].execute, 'shared_output = shmvar(seg_output)'
endfor
; execute the for-loop asynchronously
for i = 0, 4 do begin
procs[i].execute, 'for t=istart, iend do ' + $
'shared_output[t] = function(shared_input[t])', $
/nowait
endfor
; wait until all child processes are idle
repeat begin
n_idle = 0
for i = 0, 4 do begin
case procs[i].status() of
0: n_idle++
2: n_idle++
else:
endcase
endfor
wait, 1
endrep until (n_idle eq 5)
; cleanup child processes
for i = 0, 4 do begin
procs[i].cleanup
obj_destroy, procs[i]
endfor
; assign output values back to output_data
; unmap the shared variable
output_data = shared_output[0]
shmunmap, seg_input
shmunmap, seg_output
shared_input = 0
shared_output = 0
You may also want to optimize your function for multi-processing.
Lastly, to prevent multiple accesses to the memory segment,
you can use SEM_CREATE, SEM_LOCK, SEM_RELEASE, SEM_DELETE
functions/procedures provided by IDL.

Related

Improve code result speed by multiprocessing

I'm self study of Python and it's my first code.
I'm working for analyze logs from the servers. Usually I need analyze full day logs. I created script (this is example, simple logic) just for check speed. If I use normal coding the duration of analyzing 20mil rows about 12-13 minutes. I need 200mil rows by 5 min.
What I tried:
Use multiprocessing (met issue with share memory, think that fix it). But as the result - 300K rows = 20 sec and no matter how many processes. (PS: Also need control processors count in advance)
Use threading (I found that it's not give any speed, 300K rows = 2 sec. But normal code same, 300K = 2 sec)
Use asyncio (I think that script is slow because need reads many files). Result same as threading - 300K = 2 sec.
Finally I think that all three my script incorrect and didn't work correctly.
PS: I try to avoid use specific python modules (like pandas) because in this case it will be more difficult to execute on different servers. Better to use common lib.
Please help to check 1st - multiprocessing.
import csv
import os
from multiprocessing import Process, Queue, Value, Manager
file = {"hcs.log", "hcs1.log", "hcs2.log", "hcs3.log"}
def argument(m, a, n):
proc_num = os.getpid()
a_temp_m = a["vod_miss"]
a_temp_h = a["vod_hit"]
with open(os.getcwd() + '/' + m, newline='') as hcs_1:
hcs_2 = csv.reader(hcs_1, delimiter=' ')
for j in hcs_2:
if j[3].find('MISS') != -1:
a_temp_m[n] = a_temp_m[n] + 1
elif j[3].find('HIT') != -1:
a_temp_h[n] = a_temp_h[n] + 1
a["vod_miss"][n] = a_temp_m[n]
a["vod_hit"][n] = a_temp_h[n]
if __name__ == '__main__':
procs = []
manager = Manager()
vod_live_cuts = manager.dict()
i = "vod_hit"
ii = "vod_miss"
cpu = 1
n = 1
vod_live_cuts[i] = manager.list([0] * cpu)
vod_live_cuts[ii] = manager.list([0] * cpu)
for m in file:
proc = Process(target=argument, args=(m, vod_live_cuts, (n-1)))
procs.append(proc)
proc.start()
if n >= cpu:
n = 1
proc.join()
else:
n += 1
[proc.join() for proc in procs]
[proc.close() for proc in procs]
I'm expect, each file by def argument will be processed by independent process and finally all results will be saved in dict vod_live_cuts. For each process I added independent list in dict. I think it will help cross operation for use this parameter. But maybe it's wrong way :(
using IPC is costly, so only use "shared objects" for saving the final result, not for intermediate results while parsing the file.
limiting the number of processes is done by using a multiprocessing.Pool, the following code uses it to reach the max hard-disk speed, you only need to post-process the results.
you can only parse data as fast as your HDD can read it (typically 30-80 MB/s), so if you need to improve the performance further you should use SSD or RAID0 for higher disk speed, you cannot get much faster than this without changing your hardware.
import csv
import os
from multiprocessing import Process, Queue, Value, Manager, Pool
file = {"hcs.log", "hcs1.log", "hcs2.log", "hcs3.log"}
def argument(m, a):
proc_num = os.getpid()
a_temp_m_n = 0 # make it local to process
a_temp_h_n = 0 # as shared lists use IPC
with open(os.getcwd() + '/' + m, newline='') as hcs_1:
hcs_2 = csv.reader(hcs_1, delimiter=' ')
for j in hcs_2:
if j[3].find('MISS') != -1:
a_temp_m_n = a_temp_m_n + 1
elif j[3].find('HIT') != -1:
a_temp_h_n = a_temp_h_n + 1
a["vod_miss"].append(a_temp_m_n)
a["vod_hit"].append(a_temp_h_n)
if __name__ == '__main__':
manager = Manager()
vod_live_cuts = manager.dict()
i = "vod_hit"
ii = "vod_miss"
cpu = 1
vod_live_cuts[i] = manager.list()
vod_live_cuts[ii] = manager.list()
with Pool(cpu) as pool:
tasks = []
for m in file:
task = pool.apply_async(argument, args=(m, vod_live_cuts))
tasks.append(task)
for task in tasks:
task.get()
print(list(vod_live_cuts[i]))
print(list(vod_live_cuts[ii]))

Replace For-loop to remove blobs (IDL)

I have the following function to identify blobs in an image, and remove them if they are under a certain size.
With the for-loop the removal is of course very slow, if there are a lot of blobs, now my question is, is it possible to replace the for-loop?
function clean_regions, input, max_size
output = input
tmp = size(input)
input_labels = LABEL_REGION(input, /ULONG)
hist = histogram(input_labels, binsize=1, locations=loc, /nan, /l64)
to_remove = loc(where(hist le max_size))
result_map = MAKE_ARRAY(tmp[1:2], /ULONG, VALUE=0)
to_keep = mg_complement(to_remove, n_elements(loc))
for i=0,n_elements(to_keep)-1 do begin
result_map[where(input_labels EQ to_keep[i])] = 1
endfor
output *= result_map
output = boolean(output gt 0)
return, output
END

python Use Pool to create multiple processes but not execute the results

I put all the functions are placed in a class, including the creation of the process of the function and the implementation of the function, in another file to call the function of this class
from multiprocessing import Pool
def initData(self, type):
# create six process to deal with the data
if type == 'train':
data = pd.read_csv('./data/train_merged_8.csv')
elif type == 'test':
data = pd.read_csv('./data/test_merged_2.csv')
modelvec = allWord2Vec('no').getModel()
modelvec_all = allWord2Vec('all').getModel()
modelvec_stop = allWord2Vec('stop').getModel()
p = Pool(6)
count = 0
for i in data.index:
count += 1
p.apply_async(self.valueCal, args=(i, data, modelvec, modelvec_all, modelvec_stop))
if count % 1000 == 0:
print(str(count // 100) + 'h rows of data has been dealed')
p.close()
p.join
def valueCal(self, i, data, modelvec, modelvec_all, modelvec_stop):
# the function run in process
list_con = []
q1 = str(data.get_value(i, 'question1')).split()
q2 = str(data.get_value(i, 'question2')).split()
f1 = self.getF1_union(q1, q2)
f2 = self.getF2_inter(q1, q2)
f3 = self.getF3_sum(q1, q2)
f4_q1 = len(q1)
f4_q2 = len(q2)
f4_rate = f4_q1/f4_q2
q1 = [','.join(str(ve)) for ve in q1]
q2 = [','.join(str(ve)) for ve in q2]
list_con.append('|'.join(q1))
list_con.append('|'.join(q2))
list_con.append(f1)
list_con.append(f2)
list_con.append(f3)
list_con.append(f4_q1)
list_con.append(f4_q2)
list_con.append(f4_rate)
f = open('./data/test.txt', 'a')
f.write('\t'.join(list_con) + '\n')
f.close()
The result appears very soon like this, but I have not even seen the file being created.But when I check the task manager, there are indeed six processes are created and consumed a lot of resources I cpu. And when the program is finished, the file is still not created.
How can i solve this problem?
10h rows of data have been dealed
20h rows of data have been dealed
30h rows of data have been dealed
40h rows of data have been dealed

Send and Receive operations between communicators in MPI

Following my previous question : Unable to implement MPI_Intercomm_create
The problem of MPI_INTERCOMM_CREATE has been solved. But when I try to implement a basic send receive operations between process 0 of color 0 (globally rank = 0) and process 0 of color 1 (ie globally rank = 2), the code just hangs up after printing received buffer.
the code:
program hello
include 'mpif.h'
implicit none
integer tag,ierr,rank,numtasks,color,new_comm,inter1,inter2
integer sendbuf,recvbuf,tag,stat(MPI_STATUS_SIZE)
tag = 22
sendbuf = 222
call MPI_Init(ierr)
call MPI_COMM_RANK(MPI_COMM_WORLD,rank,ierr)
call MPI_COMM_SIZE(MPI_COMM_WORLD,numtasks,ierr)
if (rank < 2) then
color = 0
else
color = 1
end if
call MPI_COMM_SPLIT(MPI_COMM_WORLD,color,rank,new_comm,ierr)
if (color .eq. 0) then
if (rank == 0) print*,' 0 here'
call MPI_INTERCOMM_CREATE(new_comm,0,MPI_Comm_world,2,tag,inter1,ierr)
call mpi_send(sendbuf,1,MPI_INT,2,tag,inter1,ierr)
!local_comm,local leader,peer_comm,remote leader,tag,new,ierr
else if(color .eq. 1) then
if(rank ==2) print*,' 2 here'
call MPI_INTERCOMM_CREATE(new_comm,2,MPI_COMM_WORLD,0,tag,inter2,ierr)
call mpi_recv(recvbuf,1,MPI_INT,0,tag,inter2,stat,ierr)
print*,recvbuf
end if
end
The communication with intercommunication is not well understood by most users, and examples are not as many as examples for other MPI operations. You can find a good explanation by following this link.
Now, there are two things to remember:
1) Communication in an inter communicator always go from one group to the other group. When sending, the rank of the destination is its the local rank in the remote group communicator. When receiving, the rank of the sender is its local rank in the remote group communicator.
2) Point to point communication (MPI_send and MPI_recv family) is between one sender and one receiver. In your case, everyone in color 0 is sending and everyone in color 1 is receiving, however, if I understood your problem, you want the process 0 of color 0 to send something to the process 0 of color 1.
The sending code should be something like this:
call MPI_COMM_RANK(inter1,irank,ierr)
if(irank==0)then
call mpi_send(sendbuf,1,MPI_INT,0,tag,inter1,ierr)
end if
The receiving code should look like:
call MPI_COMM_RANK(inter2,irank,ierr)
if(irank==0)then
call mpi_recv(recvbuf,1,MPI_INT,0,tag,inter2,stat,ierr)
print*,'rec buff = ', recvbuf
end if
In the sample code, there is a new variable irank that I use to query the rank of each process in the inter-communicator; that is the rank of the process in his local communicator. So you will have two process of rank 0, one for each group, and so on.
It is important to emphasize what other commentators of your post are saying: when building a program in those modern days, use moderns constructs like use mpi instead of include 'mpif.h' see comment from Vladimir F. Another advise from your previous question was yo use rank 0 as remote leader in both case. If I combine those 2 ideas, your program can look like:
program hello
use mpi !instead of include 'mpif.h'
implicit none
integer :: tag,ierr,rank,numtasks,color,new_comm,inter1,inter2
integer :: sendbuf,recvbuf,stat(MPI_STATUS_SIZE)
integer :: irank
!
tag = 22
sendbuf = 222
!
call MPI_Init(ierr)
call MPI_COMM_RANK(MPI_COMM_WORLD,rank,ierr)
call MPI_COMM_SIZE(MPI_COMM_WORLD,numtasks,ierr)
!
if (rank < 2) then
color = 0
else
color = 1
end if
!
call MPI_COMM_SPLIT(MPI_COMM_WORLD,color,rank,new_comm,ierr)
!
if (color .eq. 0) then
call MPI_INTERCOMM_CREATE(new_comm,0,MPI_Comm_world,2,tag,inter1,ierr)
!
call MPI_COMM_RANK(inter1,irank,ierr)
if(irank==0)then
call mpi_send(sendbuf,1,MPI_INT,0,tag,inter1,ierr)
end if
!
else if(color .eq. 1) then
call MPI_INTERCOMM_CREATE(new_comm,0,MPI_COMM_WORLD,0,tag,inter2,ierr)
call MPI_COMM_RANK(inter2,irank,ierr)
if(irank==0)then
call mpi_recv(recvbuf,1,MPI_INT,0,tag,inter2,stat,ierr)
if(ierr/=MPI_SUCCESS)print*,'Error in rec '
print*,'rec buff = ', recvbuf
end if
end if
!
call MPI_finalize(ierr)
end program h

PID control - value of process parameter based on PID result

I'm trying to implement a PID controller following http://en.wikipedia.org/wiki/PID_controller
The mechanism I try to control works as follows:
1. I have an input variable which I can control. Typical values would be 0.5...10.
2. I have an output value which I measure daily. My goal for the output is roughly at the same range.
The two variables have strong correlation - when the process parameter goes up, the output generally goes up, but there's quite a bit of noise.
I'm following the implementation here:
http://code.activestate.com/recipes/577231-discrete-pid-controller/
Now the PID seems like it is correlated with the error term, not the measured level of output. So my guess is that I am not supposed to use it as-is for the process variable, but rather as some correction to the current value? How is that supposed to work exactly?
For example, if we take Kp=1, Ki=Kd=0, The process (input) variable is 4, the current output level is 3 and my target is a value of 2, I get the following:
error = 2-3 = -1
PID = -1
Then I should set the process variable to -1? or 4-1=3?
You need to think in terms of the PID controller correcting a manipulated variable (MV) for errors, and that you need to use an I term to get to an on-target steady-state result. The I term is how the PID retains and applies memory of the prior behavior of the system.
If you are thinking in terms of the output of the controller being changes in the MV, it is more of a 'velocity form' PID, and the memory of prior errors and behavior is integrated and accumulated in the prior MV setting.
From your example, it seems like a manipulated value of -1 is not feasible and that you would like the controller to suggest a value like 3 to get a process output (PV) of 2. For a PID controller to make use of "The process (input) variable is 4,..." (MV in my terms) Ki must be non-zero, and if the system was at steady-state, whatever was accumulated in the integral (sum_e=sum(e)) would precisely equal 4/Ki, so:
Kp= Ki = 1 ; Kd =0
error = SV - PV = 2 - 3 = -1
sum_e = sum_e + error = 4/Ki -1
MV = PID = -1(Kp) + (4/Ki -1)Ki = -1Kp + 4 - 1*Ki = -1 +4 -1 = 2
If you used a slower Ki than 1, it would smooth out the noise more and not adjust the MV so quickly:
Ki = 0.1 ;
MV = PID = -1(Kp) + (4/Ki -1)Ki = -1Kp + 4 - 1*Ki = -1 +4 -0.1 = 2.9
At steady state at target (PV = SV), sum_e * Ki should produce the steady-state MV:
PV = SV
error = SV - PV = 0
Kp * error = 0
MV = 3 = PID = 0 * Kp + Ki * sum_e
A nice way to understand the PID controller is to put units on everything and think of Kp, Ki, Kd as conversions of the process error, accumulated error*timeUnit, and rate-of-change of error/timeUnit into terms of the manipulated variable, and that the controlled system converts the controller's manipulated variable into units of output.

Resources