Determining ideal number of cores in MPI - performance

I'm trying to determine the ideal number of cores for my model (in this example I'm using xbeach but I'm having the same issue when running other models such as SWAN) by doing a speedup test, i.e. I'm running the same code mpiexec -n <np> mymodel.exe with <np> = 2, 4, 8, 12, 16, 18, 20, 24, 32, and 36 cores/processes and see how long it takes. I ran the test multiple times, each with different results - see image
To my understanding MPI tries to distribute all tasks evenly across all processors (so having other applications running in the background might slow things down) but I'm still confused why I get SUCH a large range of computing times for the same code with the same number of cores/tasks. Why could it be that my model takes 18 minutes to run on 18 cores one time and then 128 minutes the next time I run it?
I'm using MPICH2 1.4.1p1 on Windows 10 on a machine with 2 NUMA nodes with 36 cores each so any potential background tasks should easily be handled on the remaining cores/node. It also doesn't look like I'm running out of memory.
To summarize my questions are:
Is it normal to have such a large range in computation time when using MPI?
If it is - why?
Is there a better way than mine to determine the ideal number of cores for an MPI program?

Related

How does OpenCL distribute work items?

I'm testing and comparing GPU speed up with different numbers of work-items (no work-groups). The kernel I'm using is a very simple but long operation. When I test with multiple work-items, I use a barrier function and split the work in smaller chunks to get the same result as with just one work-item. I measure the kernel execution time using cl_event and the results are the following:
1 work-item: 35735 ms
2 work-items: 11822 ms (3 times faster than with 1 work-item)
10 work-items: 2380 ms (5 times faster than with 2 work-items)
100 work-items: 239 ms (10 times faster than with 10 work-items)
200 work-items: 122 ms (2 times faster than with 100 work-items)
CPU takes about 580 ms on average to do the same operation.
The only result I don't understand and can't explain is the one with 2 work items. I would expect the speed up to be about 2 times faster compared to the result with just one work item, so why is it 3?
I'm trying to make sense of these numbers by looking at how these work-items were distributed on processing elements. I'm assuming if I have just one kernel, only one compute unit (or multiprocessor) will be activated and the work items distributed on all processing elements (or CUDA cores) of that compute unit. What I'm also not sure about is whether a processing element can process multiple work-items at the same time, or is it just one work-item per processing element?
CL_DEVICE_MAX_WORK_ITEM_SIZES are 1024 / 1024 / 64 and CL_DEVICE_MAX_WORK_GROUP_SIZE 1024. Since I'm using just one dimension, does that mean I can have 1024 work-items running at the same time per processing element or per compute unit? When I tried with 1000 work-items, the result was a smaller number so I figured not all of them got executed, but why would that be?
My GPU info: Nvidia GeForce GT 525M, 96 CUDA cores (2 compute units, 48 CUDA cores per unit)
The only result I don't understand and can't explain is the one with 2
work items. I would expect the speed up to be about 2 times faster
compared to the result with just one work item, so why is it 3?
The exact reasons will probably be hard to pin down, but here are a few suggestions:
GPUs aren't optimised at all for small numbers of work items. Benchmarking that end of the scale isn't especially useful.
35 seconds is a very long time for a GPU. Your GPU probably has other things to do, so your work-item is probably being interrupted many times, with its context saved and resumed every time.
It will depend very much on your algorithm. For example, if your kernel uses local memory, or a work-size dependent amount of private memory, it might "spill" to global memory, which will slow things down.
Depending on your kernel's memory access patterns, you might be running into the effects of read/write coalescing. More work items means fewer memory accesses.
What I'm also not sure about is whether a processing element can process multiple work-items at the same time, or is it just one work-item per processing element?
Most GPU hardware supports a form of SMT to hide memory access latency. So a compute core will have up to some fixed number of work items in-flight at a time, and if one of them is blocked waiting for a memory access or barrier, the core will continue executing commands on another work item. Note that the maximum number of simultaneous threads can be further limited if your kernel uses a lot of local memory or private registers, because those are a finite resource shared by all cores in a compute unit.
Work-groups will normally run on only one compute unit at a time, because local memory and barriers don't work across units. So you don't want to make your groups too large.
One final note: compute hardware tends to be grouped in powers of 2, so it's usually a good idea to make your work group sizes a multiple of e.g. 16 or 64. 1000 is neither, which usually means some cores will be doing nothing.
When I tried with 1000 work-items, the result was a smaller number so I figured not all of them got executed, but why would that be?
Please be more precise in this question, it's not clear what you're asking.

H2O - Not seeing much speed-up after moving to powerful machine

I am running a Python program that calls H2O for deep learning (training and testing). The program runs in a loop of 20 iterations and in each loop calls H2ODeepLearningEstimator() 4 times and associated predict() and model_performance(). I am doing h2o.remove_all() and cleaning up all data-related Python objects after each iteration.
Data size: training set 80,000 with 122 features (all float) with 20% for validation (10-fold CV). test set 20,000. Doing binary classification.
Machine 1: Windows 7, 4 core, Xeon, each core 3.5GHz, Memory 32 GB
Takes about 24 hours to complete
Machine 2: CentOS 7, 20 core, Xeon, each core 2.0GHz, Memory 128 GB
Takes about 17 hours to complete
I am using h2o.init(nthreads=-1, max_mem_size = 96)
So, the speed-up is not that much.
My questions:
1) Is the speed-up typical?
2) What can I do to achieve substantial speed-up?
2.1) Will adding more cores help?
2.2) Are there any H2O configuration or tips that I am missing?
Thanks very much.
- Mohammad,
Graduate student
If the training time is the main effort, and you have enough memory, then the speed up will be proportional to cores times core-speed. So, you might have expected a 40/14 = 2.85 speed-up (i.e. your 24hrs coming down to the 8-10 hour range).
There is a typo in your h2o.init(): 96 should be "96g". However, I think that was a typo when writing the question, as h2o.init() would return an error message. (And H2O would fail to start if you'd tried "96", with the quotes but without the "g".)
You didn't show your h2o.deeplearning() command, but I am guessing you are using early stopping. And that can be unpredictable. So, what might have happened is that your first 24hr run did, say, 1000 epochs, but your second 17hr run did 2000 epochs. (1000 vs. 2000 would be quite an extreme difference, though.)
It might be that you are spending too much time scoring. If you've not touched the defaults, this is unlikely. But you could experiment with train_samples_per_iteration (e.g. set it to 10 times the number of your training rows).
What can I do to achieve substantial speed-up?
Stop using cross-validation. That might be a bit controversial, but personally I think 80,000 training rows is going to be enough to do an 80%/10%/10% split into train/valid/test. That will be 5-10 times quicker.
If it is for a paper, and you want to show more confidence in the results, once you have your final model, and you've checked that test score is close to valid score, then rebuild it a couple of times using a different seed for the 80/10/10 split, and confirm you end up with the same metrics. (*)
*: By the way, take a look at the score for each of the 10 cv models you've already made; if they are fairly close to each other, then this approach should work well. If they are all over the place, you might have to re-consider the train/valid/test splits - or just think about what it is in your data that might be causing that sensitivity.

Python 3 multiprocessing: optimal chunk size

How do I find the optimal chunk size for multiprocessing.Pool instances?
I used this before to create a generator of n sudoku objects:
processes = multiprocessing.cpu_count()
worker_pool = multiprocessing.Pool(processes)
sudokus = worker_pool.imap_unordered(create_sudoku, range(n), n // processes + 1)
To measure the time, I use time.time() before the snippet above, then I initialize the pool as described, then I convert the generator into a list (list(sudokus)) to trigger generating the items (only for time measurement, I know this is nonsense in the final program), then I take the time using time.time() again and output the difference.
I observed that the chunk size of n // processes + 1 results in times of around 0.425 ms per object. But I also observed that the CPU is only fully loaded the first half of the process, in the end the usage goes down to 25% (on an i3 with 2 cores and hyper-threading).
If I use a smaller chunk size of int(l // (processes**2) + 1) instead, I get times of around 0.355 ms instead and the CPU load is much better distributed. It just has some small spikes down to ca. 75%, but stays high for much longer part of the process time before it goes down to 25%.
Is there an even better formula to calculate the chunk size or a otherwise better method to use the CPU most effective? Please help me to improve this multiprocessing pool's effectiveness.
This answer provides a high level overview.
Going into detais, each worker is sent a chunk of chunksize tasks at a time for processing. Every time a worker completes that chunk, it needs to ask for more input via some type of inter-process communication (IPC), such as queue.Queue. Each IPC request requires a system call; due to the context switch it costs anywhere in the range of 1-10 μs, let's say 10 μs. Due to shared caching, a context switch may hurt (to a limited extent) all cores. So extremely pessimistically let's estimate the maximum possible cost of an IPC request at 100 μs.
You want the IPC overhead to be immaterial, let's say <1%. You can ensure that by making chunk processing time >10 ms if my numbers are right. So if each task takes say 1 μs to process, you'd want chunksize of at least 10000.
The main reason not to make chunksize arbitrarily large is that at the very end of the execution, one of the workers might still be running while everyone else has finished -- obviously unnecessarily increasing time to completion. I suppose in most cases a delay of 10 ms is a not a big deal, so my recommendation of targeting 10 ms chunk processing time seems safe.
Another reason a large chunksize might cause problems is that preparing the input may take time, wasting workers capacity in the meantime. Presumably input preparation is faster than processing (otherwise it should be parallelized as well, using something like RxPY). So again targeting the processing time of ~10 ms seems safe (assuming you don't mind startup delay of under 10 ms).
Note: the context switches happen every ~1-20 ms or so for non-real-time processes on modern Linux/Windows - unless of course the process makes a system call earlier. So the overhead of context switches is no more than ~1% without system calls. Whatever overhead you're creating due to IPC is in addition to that.
Nothing will replace the actual time measurements. I wouldn't bother with a formula and try a constant such as 1, 10, 100, 1000, 10000 instead and see what works best in your case.

How do cuda threads are executed inside a single block?

I have several question regarding cuda. Following is a figure taken from a book on parallel programming. It shows how threads are allocated in the device for a multiplication of two vectors each of length 8192.
1) in threadblock 0 there are 15 SIMD threads. Are these 15 threads executed in parallel or just one thread at a specific time?
2) each block contains 512 elements in this example. is this number dependent on the hardware or is it a decision of the programmer?
1)
In this particular example, each thread seems to be assigned to 32 elements in the vector. Code that is executed by a single thread is executed sequentially.
2)
The size of the thread blocks is up to the programmer. However, there are restrictions on the number and size of the thread blocks given the hardware the code is executed on. For more information on this, see this elaborate answer:
Understanding CUDA grid dimensions, block dimensions and threads organization (simple explanation)
From your illustration, it seems that:
The grid is composed of 16 thread blocks, numbered from 0 to 15.
Each block is composed of 16 "SIMD threads", numbered from 0 to 15
Each "SIMD thread" computes the product of 32 vector elements.
It is not necessarily obvious from the illustration whether "SIMD thread" means, in the CUDA (OpenCL) parlance:
A warp (wavefront) of 32 threads (work-items)
or:
A thread (work-item) working on 32 elements
I will assume the former ("SIMD thread" = warp/wavefront), since it is a more reasonable assumption performance-wise, but the latter isn't technically incorrect, it's simply suboptimal design (on current hardware, at least).
1) in threadblock 0 there are 15 SIMD threads. Are these 15 threads executed in parallel or just one thread at a specific time?
As stated above, there are 16 warps (numbered from 0 to 15, that makes 16) in thread block 0, each of them made of 32 threads. These threads execute in lockstep, simultaneously, in parallel. The warps are executed independently from each another, sequentially or in parallel, depending on the capabilities of the underlying hardware. For example, the hardware may be capable of scheduling a number of warps for simultaneous execution.
2) each block contains 512 elements in this example. is this number dependent on the hardware or is it a decision of the programmer?
In this case, it is simply a decision of the programmer, but in some cases there are also hardware limitations that could force the programmer into changing the design. For example, there is a maximum number of threads a block can handle, and there is a maximum number of blocks a grid can handle.

Slow down when executing multiple Racket programs

I have a Racket program that will be long running. Executing many instances of the same programs will help finding the answer faster. (It depends on the randomness.) So I execute 10 instances of the same program from the command line on a 24-core machine. The average throughput when executing one instance (on one core) is 500 iterations/s. The average throughput when executing 10 instances (on 10 cores) goes down to 100 iterations/s per core. I expect to see similar throughput per core because each execution does not interface with the others at all. Does anyone else experience this behavior? What is happening? How can I fix this?
--------------------------- Additional information -----------------------------
OS: ubuntu 13.10
cores: 24
Each instance write its own output file. Approximately once per minute, each instance will replace the same output file with the updated result which is about 10 lines of text. So, I don't think they hit I/O bound.
According to top, each core uses 1.5-2.5% of memory. When running 10 core, 16 GB is used and, 9 GB is free. With nothing running, 11 GB is used, and 14 GB are free.
There is no network request.
The follows are (current-memory-use) divided by 1,000,000 over 12 minutes on 3 of the 10 cores (MB).
core 3: 313, 48, 73, 154, 292, 242
core 4: 56, 245, 261, 106, 229, 190
core 6: 55, 238, 66, 229, 275, 207
When I run (current-memory-use) without anything else, it returns 29 MB.
I found the issue. My program indeed used too much memory. Therefore, when I'm running multiple instances at the same time, either everything can't fit in cache (probably L3) or it exceeds memory bandwidth.
I tried to discover the source of the problem why my program used so much memory. By putting (current-memory-use) at many places in the program, I found that the issue was from arithmetic-shift. Because of that one operation, somehow the memory usage became doubled immediately.
The problem occured when executing (arithmetic-shift x y) when x is big and y is positive. In that case, I believe the result is represented using "flonum" (boxed) instead of "fixnum" (unboxed).
Even though I masked the result to 32-bit later, something prevented racket from optimizing that, likely first-order functions. I fixed it by masking x before passing it to arithmetic-shift such that the result is never greater than 32-bit number, and that fixed the problem. Now, my program uses 80 MB instead of 300 MB, and I get the speed up I expect!
I suppose this isn't truly an answer; it's more like a guess and advice that doesn't fit in a comment.
From the list #MarkSetchell gave, the most obvious place to start is I/O -- do the processes make network requests, or share an input file?
Slightly less obvious (but, wild guess, more likely in your case) is memory. The sole instance could use all available RAM, if needed. Does it?. With 10 instances sharing the same RAM, they'd probably garbage collect more often, which would be slower.
Try adding something like
(thread
(λ ()
(let loop ()
(displayln (current-memory-use))
(sleep 5)
(loop))))
and see how that plots over time. For one instance, does it top out at a value? How does that compare to RAM in the system?
And/or, use racket -W "error debug#GC" <your-program> to show debug-level log info from the GC.

Resources