Python Multiprocessing slight slower than Multithreading on Windows

Python Multiprocessing slight slower than Multithreading on Windows - windows

I have been experimenting my code to send "parallel" commands to multiple serial COM ports.
My multi-threading code consists of:
global q
q = Queue()
devices = [0, 1, 2, 3]
for i in devices:
q.put(i)
cpus=cpu_count() #detect number of cores
logging.debug("Creating %d threads" % cpus)
for i in range(cpus):
t = Thread(name= 'DeviceThread_'+str(i), target=testFunc1)
t.daemon = True
t.start()
and multi-processing code consists of:
devices = [0, 1, 2, 3]
cpus=cpu_count() #detect number of cores
pool = Pool(cpus)
results = pool.map(multi_run_wrapper, devices)
I observe that the task of sending serial commands to 4 COM ports in "parallel" takes about 6 seconds and multi-processing always always takes a 0.5 to 1 second of additional total run time.
Any inputs on why the discrepancy on a Windows machine?

Well, for one, you're not comparing apples to apples. If you want equivalent code, use multiprocessing.dummy.Pool in your threaded case (which is the same as multiprocessing.Pool implemented in terms of threads, not processes), so you're at least using the same basic parallelization model with different internal implementations, not changing everything all at once.
Beyond that, launching the workers and communicating data to them has some overhead, more on Windows than on other systems since Windows can't fork to spawn new processes cheaply; it has to spawn a new Python instance and then copy over state via IPC to approximate forking.
Aside from that, you haven't provided enough information; your process and thread based worker functions aren't provided, and could cause significant differences in behavior. Nor have you provided information on how you're performing timing. Similarly, if each worker process needs to reinitialize the COM port communication library, that could involve non-trivial overhead.

Related

What is the best way to do a very large and repetitive task?

I need to get the duration of 18000+ audios, using the audioread library for each audio takes about 300ms, ie at least 25~30 minutes of processing.
Using a Queue andProcess system, using all available cores of my processor, I can lower the processing average of each audio to 70ms, but it will still take 21 minutes, how can I improve this? I would like to be able to read all the audios in at least 5 minutes, remembering that I have no competition on the machine, it will only run my software, so I can consume all the resources.
Code of function read the duration:
def get_duration(q, au):
while not q.empty():
index = q.get()
with audioread.audio_open(au[index]['absolute_path']) as f:
au[index]['duration'] = f.duration * 1000
Code to create the Processes:
for i in range(os.cpu_count()):
pr = Process(target=get_duration, args=(queue, audios, ))
pr.daemon = True
pr.start()
In my code there is only one Queue with some Process, and I use Manager to edit the objects.

I would look into using a complied solution to speed up your script. Pythons threading still leaves room for improvement.
Try a compiled language.
If you still want python, separate as many functions to different threads as your system supports. use threading.thread.
Also see this wiki for info on efficient loops in python. The gist of the story is that for loops incur overhead.

Parallel computing in Julia - running a simple for-loop on multiple cores

For starters, I have to say I'm completely new to parallel computing (and know close to nothing about computer science), so my understanding of what things like "workers" or "processes" actually are is very limited. I do however have a question about running a simple for-loop that presumably has no dependencies between the iterations in parallel.
Let's say I wanted to do the following:
for N in 1:5:20
println("The N of this iteration in $N")
end
If I simply wanted these messages to appear on screen and the order of appearance didn't matter, how could one achieve this in Julia 0.6, and for future reference in Julia 0.7 (and therefore 1.0)?

Just to add the example to the answer of Chris. Since the release of julia 1.3 you do this easily with Threads.#threads
Threads.#threads for N in 1:5:20
println("The number of this iteration is $N")
end
Here you are running only one julia session with multiple threads instead of using Distributed where you run multiple julia sessions.
See, e.g. multithreading blog post for more information.

Distributed Processing
Start julia with e.g. julia -p 4 if you want to use 4 cpus (or use the function addprocs(4)). In Julia 1.x, you make a parallel loop as following:
using Distributed
#distributed for N in 1:5:20
println("The N of this iteration in $N")
end
Note that every process have its own variables per default.
For any serious work, have a look at the manual https://docs.julialang.org/en/v1.4/manual/parallel-computing/, in particular the section about SharedArrays.
Another option for distributed computing are the function pmap or the package MPI.jl.
Threads
Since Julia 1.3, you can also use Threads as noted by wueli.
Start julia with e.g. julia -t 4 to use 4 threads. Alternatively you can or set the environment variable JULIA_NUM_THREADS before starting julia.
For example Linux/Mac OS:
export JULIA_NUM_THREADS=4
In windows, you can use set JULIA_NUM_THREADS 4 in the cmd prompt.
Then in julia:
Threads.#threads for N = 1::20
println("N = $N (thread $(Threads.threadid()) of out $(Threads.nthreads()))")
end
All CPUs are assumed to have access to shared memory in the examples above (e.g. "OpenMP style" parallelism) which is the common case for multi-core CPUs.

What is the best general purpose computing practice in OpenCL for iterative problems?

When we have a program that requires lots of operations over a large data sets and the operations on each of the data elements are independent, OpenCL can be one of the good choice to make it faster. I have a program like the following:
while( function(b,c)!=TRUE)
{
[X,Y] = function1(BigData);
M = functionA(X);
b = function2(M);
N = functionB(Y);
c = function3(N);
}
Here the function1 is applied on each of the elements on the BigData and produce another two big data sets (X,Y). function2 and function3 are then applied operation individually on each of the elements on these X,Y data, respectively.
Since the operations of all the functions are applied on each of the elements of the data sets independently, using GPU might make it faster. So I come up with the following:
while( function(b,c)!=TRUE)
{
//[X,Y] = function1(BigData);
1. load kernel1 and BigData on the GPU. each of the thread will work on one of the data
element and save the result on X and Y on GPU.
//M = functionA(X);
2a. load kernel2 on GPU. Each of the threads will work on one of the
data elements of X and save the result on M on GPU.
(workItems=n1, workgroup size=y1)
//b = function2(M);
2b. load kernel2 (Same kernel) on GPU. Each of the threads will work on
one of the data elements of M and save the result on B on GPU
(workItems=n2, workgroup size=y2)
3. read the data B on host variable b
//N = functionB(Y);
4a. load kernel3 on GPU. Each of the threads will work on one of the
data element of Y and save the result on N on GPU.
(workItems=n1, workgroup size=y1)
//c = function2(M);
4b. load kernel3 (Same kernel) on GPU. Each of the threads will work
on one of the data element of M and save the result on C on GPU
(workItems=n2, workgroup size=y2)
5. read the data C on host variable c
}
However, the overhead involved in this code seems significant to me (I have implemented a test program and run on a GPU). And if the kernels have some sort of synchronizations it might be ended up with more slowdown.
I also believe the workflow is kind of common. So what is the best practice to using OpenCL for speedup for a program like this.

I don't think there's a general problem with the way you've split up the problem into kernels, although it's hard to say as you haven't been very specific. How often do you expect your while loop to run?
If your kernels do negligible work but the outer loop is doing a lot of iterations, you may wish to combine the kernels into one, and do some number of iterations within the kernel itself, if that works for your problem.
Otherwise:
If you're getting unexpectedly bad performance, you most likely need to be looking at the efficiency of each of your kernels, and possibly their data access patterns. Unless neighbouring work items are reading/writing neighbouring data (ideally: 16 work items read 4 bytes each from a 64-byte cache line at a time) you're probably wasting memory bandwidth. If your kernels contain lots of conditionals or non-constant loop iterations, that will cost you, etc.
You don't specify what kind of runtimes you're getting, on what kind Of job size, (Tens? Thousands? Millions of arithmetic ops? How big are your data sets?) or what hardware. (Compute card? Laptop IGPU?) "Significant overhead" can mean a lot of different things. 5ms? 1 second?
Intel, nVidia and AMD all publish optimisation guides - have you read these?

How do cuda threads are executed inside a single block?

I have several question regarding cuda. Following is a figure taken from a book on parallel programming. It shows how threads are allocated in the device for a multiplication of two vectors each of length 8192.
1) in threadblock 0 there are 15 SIMD threads. Are these 15 threads executed in parallel or just one thread at a specific time?
2) each block contains 512 elements in this example. is this number dependent on the hardware or is it a decision of the programmer?

1)
In this particular example, each thread seems to be assigned to 32 elements in the vector. Code that is executed by a single thread is executed sequentially.
2)
The size of the thread blocks is up to the programmer. However, there are restrictions on the number and size of the thread blocks given the hardware the code is executed on. For more information on this, see this elaborate answer:
Understanding CUDA grid dimensions, block dimensions and threads organization (simple explanation)

From your illustration, it seems that:
The grid is composed of 16 thread blocks, numbered from 0 to 15.
Each block is composed of 16 "SIMD threads", numbered from 0 to 15
Each "SIMD thread" computes the product of 32 vector elements.
It is not necessarily obvious from the illustration whether "SIMD thread" means, in the CUDA (OpenCL) parlance:
A warp (wavefront) of 32 threads (work-items)
or:
A thread (work-item) working on 32 elements
I will assume the former ("SIMD thread" = warp/wavefront), since it is a more reasonable assumption performance-wise, but the latter isn't technically incorrect, it's simply suboptimal design (on current hardware, at least).
1) in threadblock 0 there are 15 SIMD threads. Are these 15 threads executed in parallel or just one thread at a specific time?
As stated above, there are 16 warps (numbered from 0 to 15, that makes 16) in thread block 0, each of them made of 32 threads. These threads execute in lockstep, simultaneously, in parallel. The warps are executed independently from each another, sequentially or in parallel, depending on the capabilities of the underlying hardware. For example, the hardware may be capable of scheduling a number of warps for simultaneous execution.
2) each block contains 512 elements in this example. is this number dependent on the hardware or is it a decision of the programmer?
In this case, it is simply a decision of the programmer, but in some cases there are also hardware limitations that could force the programmer into changing the design. For example, there is a maximum number of threads a block can handle, and there is a maximum number of blocks a grid can handle.

MPI: shared variable value for all processors

Here's one question about MPI. I need two processors that keeps modifying one variable and I want both processors to have access to the variable with the most up-to-date value.
from mpi4py import MPI
from time import sleep
comm = MPI.COMM_WORLD
rank = comm.rank
assert comm.size == 2
msg = 0
sec = 10
if comm.rank == 0:
for i in range(sec):
print msg
sleep(1)
msg = comm.bcast(msg,root = 1)
else:
for i in range(sec*2):
msg += 1
sleep(0.5)
comm.bcast(msg,root = 1)
So I'm expecting the program to print something like: 0 2 4 ...
But the program turns out to print: 0 1 2 3 4 5 6 7 8 9
I'm curious if there's a mechanism in mpi4py such that the variable msg is shared by both processors? That is, whenever msg is modified by processor 1, the new value becomes immediately available to processor 0. In other words, I want processor 0 to access the most-up-to-date value of msg instead of waiting for every changes that were made on msg by processor 1.

I think you're getting confused about how distributed memory programming works. In MPI, each process (or rank) has its own memory, and therefore when it changes values via load/store operations (like what you're doing with msg += 1), it will not affect the value of the variable on another process. The only way to update remote values is by sending messages, which you are doing with the comm.bcast() call. This sends the local value of msg from rank 1 to all other ranks. Until this point, there's no way for rank 0 to know what's been happening on rank 1.
If you want to have shared values between processes, then you probably need to take a look at something else, perhaps threads. You'll lose the distributed abilities of MPI if you switch to OpenMP, but that might not be what you needed MPI for in the first place. There are ways of doing this with distributed memory models (such as PGAS languages like Unified Parallel C, Global Arrays, etc.), but you will always run into the issue of latency which means that there will be some time that the values on ranks 0 and 1 are not synchronized unless you have some sort of protection to enforce it.

As mentioned by Wesley Bland, this isn't really possible in a pure distributed memory environment, as memory isn't shared.
However, MPI has for some time (since 1997) allowed something like this in the MPI-2, as one-sided communications; these have been updated significantly in MPI-3 (2012). This approach can have real advantages, but one has to be a little careful; since memory isn't really shared, every update requires expensive communications and it's easy to accidentally put significant scalability/performance bottlenecks in your code by over-reliance on shared state.
The Using MPI-2 book has an example of implementing a counter using the MPI-2 one-sided communications; a simple version of that counter is described and implemented in this answer in C. In the mpi4py distribution, under 'demos', there are implementations of these same counters in the 'nxtval' demo; the same simple counter as nxtval-onesided.py and a more complicated but more scalable implementation, also as described in the Using MPI-2 book, as nxtval-scalable.py. You should be able to use either of those implementations more or less as-is in the above code.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio