Embarrasingly parallel execution, no speedup (MEEP, openMPI)

Embarrasingly parallel execution, no speedup (MEEP, openMPI) - parallel-processing

I've been trying to exploit parallelization to run some simulations with the MEEP simulation software a bit faster. By default the software only uses one CPU, and FDTD simulations are easily sped up by parallelization. In the end I found there was no difference between running 1 or 4 cores, the simulation times were the same.
I then figured I would instead run individual simulations on each core to increase my total simulation throughput (for example running 4 different simulations at the same time).
What I found surprising is that whenever I start a new simulation, the already started simulations would slow down, even though they run on separate cores. For example, if I run only 1 simulation on 1 core, each time step of the FDTD simulation takes around 0.01 seconds. If I start another process on another core, each simulation now spends 0.02 seconds per time step, and so on, meaning that even when I run different simulations that have nothing to do with each other on separate cores, they all slow down giving me no net increase in speed.
I'm not necessarily looking for help to solve this problem as much as I'm looking for help understanding it, because it peaked my curiosity. Each instance of the simulation requires less than 1% of my total memory, so it's not a memory issue. The only thing I can think of is the cores sharing the cache memory, or the memory bandwidth being saturated, is there any way to check if this is the case?
The simulations are fairly simple and I've ran programs which are much more memory hungry than this one and had great speedup with parallelization.
Any tips to help me understand this phenomena?

I think it should be better look on bigger simulations because the well known issue with the turbo boost like technology (the single core performance change with the number of threads) cannot explain your result. It will explain just if have a single core processor.
So, I think that can be explain with memory cache levels. Maybe if you try simulations much bigger than L3 Cache (> 8MB for i7).

MY test on a Intel(R) Core(TM) i7-3517U CPU # 1.90GHz Dual Core (4 Threads). All simulations for 1 mpi thread (-np 1)
10mb simulation:
Four simulation 0.0255 s/step
Two simulation 0.0145 s/step
One simulation 0.0129 s/step
100mb simulation:
Four simulation 1.13 s/step
Two simulation 0.61 s/step
One simulation 0.53 s/step
A curious thing is that two simulation with 2 threads each run at almost the same speed as two simulations with 1 thread.

Related

GPU affects core calculation and or RAM access (high jitter)?

i have a kthread which runs alone on one core from a multi-core CPU. This kthread disables all IRQs for that core, runs a loop as fast as possible and measures the maximum loop duration with the help of the TSC. The whole ACPI stuff is disabled (no frequency scaling, no power saving, etc.).
My problem is, that the maximum loop duration apparently depends on the gpu.
When the system is used normal (a little bit office, Internet and programming stuff / not really busy) then the maximum loop duration is around 5 us :-(
The same situation, but with a stressed CPU (the other three cores are 100% busy) leads to a maximum loop duration of approximately 1 us :-|
But when the GPU is switching into idle mode (turning-off the screen), then the maximum loop duration is going down to less than 300 ns :-)
Why is that? And how can i influence this behavior? I thought the CPU and the RAM are directly connected. I recognized, that the maximum loop duration becomes better on a system with a external graphic card for the first situation. For the second and third case i couldn't see a difference. I also tested AMD and Intel systems without success - always the same :-(
I'm fine with the second case. But is it possible to achieve that without stressing the CPU additionally?
Many thanks in advance!
Billy

How to reduce time taken for large calculations in MATLAB

When using the desktop PC's in my university (Which have 4Gb of ram), calculations in Matlab are fairly speedy, but on my laptop (Which also has 4Gb of ram), the exact same calculations take ages. My laptop is much more modern so I assume it also has a similar clock speed to the desktops.
For example, I have written a program that calculates the solid angle subtended by 50 disks at 500 points. On the desktop PC's this calculation takes about 15 seconds, on my laptop it takes about 5 minutes.
Is there a way to reduce the time taken to perform these calculations? e.g, can I allocate more ram to MATLAB, or can I boot up my PC in a way that optimises it for using MATLAB? I'm thinking that if the processor on my laptop is also doing calculations to run other programs this will slow down the MATLAB calculations. I've closed all other applications, but I know theres probably a lot of stuff going on I can't see. Can I boot my laptop up in a way that will have less of these things going on in the background?
I can't modify the code to make it more efficient.
Thanks!

You might run some of my benchmarks which, along with example results, can be found via:
http://www.roylongbottom.org.uk/
The CPU core used at a particular point in time, is the same on Pentiums, Celerons, Core 2s, Xeons and others. Only differences are L2/L3 cache sizes and external memory bus speeds. So you can compare most results with similar vintage 2 GHz CPUs. Things to try, besides simple number crunching tests.
1 - Try memory test, such as my BusSpeed, to show that caches are being used and RAM not dead slow.
2 - Assuming Windows, check that the offending program is the one using most CPU time in Task Manager, also that with the program not running, that CPU utilisation is around zero.
3 - Check that CPU temperature is not too high, like with SpeedFan (free D/L).
4 - If disk light is flashing, too much RAM might be being used, with some being swapped in and out. Task Manager Performance would show this. Increasing RAM demands can be checked my some of my reliability tests.

There are many things that go into computing power besides RAM. You mention processor speed, but there is also number of cores, GPU capability and more. Programs like MATLAB are designed to take advantage of features like parallelism.
Summary: You can't compare only RAM between two machines and expect to know how they will perform with respect to one another.
Side note: 4 GB is not very much RAM for a modern laptop.

Firstly you should perform a CPU performance benchmark on both computers.
Modern operating systems usually apply the most aggressive power management schemes when it is run on laptop. This usually means turning off one or more cores, or setting them to a very low frequency. For example, a Quad-core CPU that normally runs at 2.0 GHz could be throttled down to 700 MHz on one CPU while the other three are basically put to sleep, while it is on battery. (Remark. Numbers are not taken from a real example.)
The OS manages the CPU frequency in a dynamic way, tweaking it on the order of seconds. You will need a software monitoring tool that actually asks for the CPU frequency every second (without doing busy work itself) in order to know if this is the case.
Plugging in the laptop will make the OS use a less aggressive power management scheme.
(If this is found to be unrelated to MATLAB, please "flag" this post and ask moderator to move this question to the SuperUser site.)

Why is parallel compilation performance with HT worse than without?

I've made several measurements of compilation time of wine with HyperThreading enabled and disabled in BIOS on my Core i7 930 #2.8GHz (quad-core) on Linux 2.6.39 x86_64. Each measurement was like this:
git clean -xdf
./configure --prefix=/usr
time make -j$N
where N is number from 1 to 8.
Here're the results ("speed" is 60/real from time(1)):
Here the blue line corresponds to HT disabled and purple one to HT enabled. It appears that when HT is enabled, using 1-4 threads is slower than without HT. I guess this might be related to the kernel not distributing the processes to different cores and reusing second threads of already busy cores.
So, my question: how can I force the kernel to give 1 process per core scheduling higher priority than adding more processes to the same core's different thread? Or, if my reasoning is wrong, how can I have performance with HT not worse than without HT for 1-4 processes running in parallel?

Hyper-threading on Intel chips is implemented as duplication of some of the elements of a pysical core but without enough electronics to be an independent core (e.g. they may share an instruction decoder but I cant recall the specifics of Intel's implementation).
Image a pysical core with HT as 1.5 physical cores that your OS sees as 2 real cores. This doesn't equate to 1.5x speed though (this can vary depending on use case)
In your example, non-HT is faster up to 4 threads because none of the cores are sharing work with their HT pipeline. You see a flatline above 4 threads because now you only have 4 execution threads and you get a little extra overhead context switching between threads.
In the HT example you are a bit slower up to 4 threads probably because some of those threads are being assigned to a real core and it's HT, so you are losing performance as those two execution threads share physical resources. Above 4 threads you are seeing the benefit of the extra execution threads, but you see the beginning of diminishing returns.
You could probably match performance on both cases for up to 4 threads, but likely not with a compilation job. To many processes being spawned for processor affinity to be setup I think. If you instead ran a real parallel job using OpenMP or MPI with X<=4 threads bound to the specific real CPU cores, I think you'd see similar performance between HT-off and -on.

Given a number of threads <= the number of real cores, using HT should be slower because (considered crudely) you are potentially cutting the speed of your cores in half.1
Keep in mind that generally more cores is NOT better than FASTER cores. In fact, the only reason so much work was put into developing multi-core systems is that it became increasingly difficult to make faster and faster ones. So if you cannot have a 20 Ghz processor, then 8 x 3 Ghz ones will have to do.
HT is, I believe, primarily intended as an advantage in contexts where each thread is not necessarily gobbling as much processor as it can; it's doing some particular task that's governed by interaction with a user, such as CAD stuff, video games, etc; these are the kind of applications that benefit from multi-tasking. By contrast, server platforms -- wherein the primary applications tend to thread independent tasks that are not governed by a dependence on anything else, hence are optimally run as fast as possible -- do not benefit directly from multi-tasking; they benefit from speed. make is in the same category, although with a perhaps greater degree of interdependence between threads, which is why you see an advantage for HT from 4-8 threads.
1. This is a simplification. HT doesn't simply double the number of cores and halve their speed, but whatever dynamic is used, the total number of processor cycles per second for the system is not improved. It's the same -- only more fragmented.

Is CPU time relevant to Hyperthreading?

Is increased CPU time (as reported by time CLI command) indicative of inefficiency when hyperthreading is used (e.g. time spent in spinlocks or cache misses) or is it possible that the CPU time is inflated by the odd nature of HT? (e.g. real cores being busy and HT can't kick in)
I have quad-core i7, and I'm testing trivially-parallelizable part (image to palette remapping) of an OpenMP program — with no locks, no critical sections. All threads access a bit of read-only shared memory (look-up table), but write only to their own memory.
cores real CPU
1: 5.8 5.8
2: 3.7 5.9
3: 3.1 6.1
4: 2.9 6.8
5: 2.8 7.6
6: 2.7 8.2
7: 2.6 9.0
8: 2.5 9.7
I'm concerned that amount of CPU time used increases rapidly as number of cores exceeds 1 or 2.
I imagine that in an ideal scenario CPU time wouldn't increase much (same amount of work just gets distributed over multiple cores).
Does this mean there's 40% of overhead spent on parallelizing the program?

It's quite possibly an artefact of how CPU time is measured. A trivial example, if you run a 100 MHz CPU and a 3 GHz CPU for one second each, each will report that it ran for one second. The second CPU might do 30 times more work, but it takes one second.
With hyperthreading, a reasonable (not quite accurate) model would be that one core can run either one task at lets say 2000 MHz, or two tasks at lets say 1200 MHz. Running two tasks it does only 60% of the work per thread, but 120% of the work for both threads together, a 20% improvement. But if the OS asks how many seconds of CPU time was used, the first will report "1 second" after each second on real time, while the second will report "2 seconds".
So the reported CPU time goes up. If it less than doubles, overall performance is improved.

Quick question - are you running the genuine time program /usr/bin/time, or the built in bash command of the same name? I'm not sure that matters, they look very similar.
Looking at your table of numbers I sense that the processed data set (ie input plus all the out data) is reasonably large overall (bigger than L2 cache), and that the processing per data item is not that lengthy.
The numbers show a nearly linear improvement from 1 to 2 cores, but that is tailing off significantly by the time you're using 4 cores. The hyoerthreaded cores are adding virtually nothing. This means that something shared is being contended for. Your program has free running threads, so that thing can only be memory (L3 cache and main memory on the i7).
This sounds like a typical example of being I/O bound rather than compute bound, the I/O in this case being to/from L3 cache and main memory. L2 cache is 256k, so I'm guessing that the size of your input data plus one set of results and all intermediate arrays is bigger than 256k.
Am I near the mark?
Generally speaking when considering how many threads to use you have to take shared cache and memory speeds and data set sizes into account. That can be a right be a right bugger because you have to work it out at run time, which is a lot of programming effort (unless your hardware config is fixed).

Optimal number of threads per core

Let's say I have a 4-core CPU, and I want to run some process in the minimum amount of time. The process is ideally parallelizable, so I can run chunks of it on an infinite number of threads and each thread takes the same amount of time.
Since I have 4 cores, I don't expect any speedup by running more threads than cores, since a single core is only capable of running a single thread at a given moment. I don't know much about hardware, so this is only a guess.
Is there a benefit to running a parallelizable process on more threads than cores? In other words, will my process finish faster, slower, or in about the same amount of time if I run it using 4000 threads rather than 4 threads?

If your threads don't do I/O, synchronization, etc., and there's nothing else running, 1 thread per core will get you the best performance. However that very likely not the case. Adding more threads usually helps, but after some point, they cause some performance degradation.
Not long ago, I was doing performance testing on a 2 quad-core machine running an ASP.NET application on Mono under a pretty decent load. We played with the minimum and maximum number of threads and in the end we found out that for that particular application in that particular configuration the best throughput was somewhere between 36 and 40 threads. Anything outside those boundaries performed worse. Lesson learned? If I were you, I would test with different number of threads until you find the right number for your application.
One thing for sure: 4k threads will take longer. That's a lot of context switches.

I agree with #Gonzalo's answer. I have a process that doesn't do I/O, and here is what I've found:
Note that all threads work on one array but different ranges (two threads do not access the same index), so the results may differ if they've worked on different arrays.
The 1.86 machine is a macbook air with an SSD. The other mac is an iMac with a normal HDD (I think it's 7200 rpm). The windows machine also has a 7200 rpm HDD.
In this test, the optimal number was equal to the number of cores in the machine.

I know this question is rather old, but things have evolved since 2009.
There are two things to take into account now: the number of cores, and the number of threads that can run within each core.
With Intel processors, the number of threads is defined by the Hyperthreading which is just 2 (when available). But Hyperthreading cuts your execution time by two, even when not using 2 threads! (i.e. 1 pipeline shared between two processes -- this is good when you have more processes, not so good otherwise. More cores are definitively better!) Note that modern CPUs generally have more pipelines to divide the workload, so it's no really divided by two anymore. But Hyperthreading still shares a lot of the CPU units between the two threads (some call those logical CPUs).
On other processors you may have 2, 4, or even 8 threads. So if you have 8 cores each of which support 8 threads, you could have 64 processes running in parallel without context switching.
"No context switching" is obviously not true if you run with a standard operating system which will do context switching for all sorts of other things out of your control. But that's the main idea. Some OSes let you allocate processors so only your application has access/usage of said processor!
From my own experience, if you have a lot of I/O, multiple threads is good. If you have very heavy memory intensive work (read source 1, read source 2, fast computation, write) then having more threads doesn't help. Again, this depends on how much data you read/write simultaneously (i.e. if you use SSE 4.2 and read 256 bits values, that stops all threads in their step... in other words, 1 thread is probably a lot easier to implement and probably nearly as speedy if not actually faster. This will depend on your process & memory architecture, some advanced servers manage separate memory ranges for separate cores so separate threads will be faster assuming your data is properly filed... which is why, on some architectures, 4 processes will run faster than 1 process with 4 threads.)

The answer depends on the complexity of the algorithms used in the program. I came up with a method to calculate the optimal number of threads by making two measurements of processing times Tn and Tm for two arbitrary number of threads ‘n’ and ‘m’. For linear algorithms, the optimal number of threads will be N = sqrt ( (mn(Tm*(n-1) – Tn*(m-1)))/(nTn-mTm) ) .
Please read my article regarding calculations of the optimal number for various algorithms: pavelkazenin.wordpress.com

The actual performance will depend on how much voluntary yielding each thread will do. For example, if the threads do NO I/O at all and use no system services (i.e. they're 100% cpu-bound) then 1 thread per core is the optimal. If the threads do anything that requires waiting, then you'll have to experiment to determine the optimal number of threads. 4000 threads would incur significant scheduling overhead, so that's probably not optimal either.

I thought I'd add another perspective here. The answer depends on whether the question is assuming weak scaling or strong scaling.
From Wikipedia:
Weak scaling: how the solution time varies with the number of processors for a fixed problem size per processor.
Strong scaling: how the solution time varies with the number of processors for a fixed total problem size.
If the question is assuming weak scaling then #Gonzalo's answer suffices. However if the question is assuming strong scaling, there's something more to add. In strong scaling you're assuming a fixed workload size so if you increase the number of threads, the size of the data that each thread needs to work on decreases. On modern CPUs memory accesses are expensive and would be preferable to maintain locality by keeping the data in caches. Therefore, the likely optimal number of threads can be found when the dataset of each thread fits in each core's cache (I'm not going into the details of discussing whether it's L1/L2/L3 cache(s) of the system).
This holds true even when the number of threads exceeds the number of cores. For example assume there's 8 arbitrary unit (or AU) of work in the program which will be executed on a 4 core machine.
Case 1: run with four threads where each thread needs to complete 2AU. Each thread takes 10s to complete (with a lot of cache misses). With four cores the total amount of time will be 10s (10s * 4 threads / 4 cores).
Case 2: run with eight threads where each thread needs to complete 1AU. Each thread takes only 2s (instead of 5s because of the reduced amount of cache misses). With four cores the total amount of time will be 4s (2s * 8 threads / 4 cores).
I've simplified the problem and ignored overheads mentioned in other answers (e.g., context switches) but hope you get the point that it might be beneficial to have more number of threads than the available number of cores, depending on the data size you're dealing with.

4000 threads at one time is pretty high.
The answer is yes and no. If you are doing a lot of blocking I/O in each thread, then yes, you could show significant speedups doing up to probably 3 or 4 threads per logical core.
If you are not doing a lot of blocking things however, then the extra overhead with threading will just make it slower. So use a profiler and see where the bottlenecks are in each possibly parallel piece. If you are doing heavy computations, then more than 1 thread per CPU won't help. If you are doing a lot of memory transfer, it won't help either. If you are doing a lot of I/O though such as for disk access or internet access, then yes multiple threads will help up to a certain extent, or at the least make the application more responsive.

Benchmark.
I'd start ramping up the number of threads for an application, starting at 1, and then go to something like 100, run three-five trials for each number of threads, and build yourself a graph of operation speed vs. number of threads.
You should that the four thread case is optimal, with slight rises in runtime after that, but maybe not. It may be that your application is bandwidth limited, ie, the dataset you're loading into memory is huge, you're getting lots of cache misses, etc, such that 2 threads are optimal.
You can't know until you test.

You will find how many threads you can run on your machine by running htop or ps command that returns number of process on your machine.
You can use man page about 'ps' command.
man ps
If you want to calculate number of all users process, you can use one of these commands:
ps -aux| wc -l
ps -eLf | wc -l
Calculating number of an user process:
ps --User root | wc -l
Also, you can use "htop" [Reference]:
Installing on Ubuntu or Debian:
sudo apt-get install htop
Installing on Redhat or CentOS:
yum install htop
dnf install htop [On Fedora 22+ releases]
If you want to compile htop from source code, you will find it here.

The ideal is 1 thread per core, as long as none of the threads will block.
One case where this may not be true: there are other threads running on the core, in which case more threads may give your program a bigger slice of the execution time.

One example of lots of threads ("thread pool") vs one per core is that of implementing a web-server in Linux or in Windows.
Since sockets are polled in Linux a lot of threads may increase the likelihood of one of them polling the right socket at the right time - but the overall processing cost will be very high.
In Windows the server will be implemented using I/O Completion Ports - IOCPs - which will make the application event driven: if an I/O completes the OS launches a stand-by thread to process it. When the processing has completed (usually with another I/O operation as in a request-response pair) the thread returns to the IOCP port (queue) to wait for the next completion.
If no I/O has completed there is no processing to be done and no thread is launched.
Indeed, Microsoft recommends no more than one thread per core in IOCP implementations. Any I/O may be attached to the IOCP mechanism. IOCs may also be posted by the application, if necessary.

speaking from computation and memory bound point of view (scientific computing) 4000 threads will make application run really slow. Part of the problem is a very high overhead of context switching and most likely very poor memory locality.
But it also depends on your architecture. From where I heard Niagara processors are suppose to be able to handle multiple threads on a single core using some kind of advanced pipelining technique. However I have no experience with those processors.

Hope this makes sense, Check the CPU and Memory utilization and put some threshold value. If the threshold value is crossed,don't allow to create new thread else allow...

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio