I understand that creating many processes may yield no benefit, depending on how many cores your processor has (if the tasks are CPU-bound), or depending on how many IO operations you can do simultaneously (if your tasks are IO-bound). In such cases, creating too many processes simply has no effect.
However, can creating too many processes have a negative effect on performance? If yes, why?
Short answer: yes.
A process that isn't active has some overhead in memory and CPU time -- not a lot, but not none. So if you have an extremely large number of processes, you will see negatives.
On a modern system, multiple processes of the same executable will share code and read-only data, but each needs its own copy of mutable data, each needs its own stack, etc. Thus, each additional process takes up some amount of memory; this means more cache pressure, and in the extreme case, more swapfile activity or outright running out of memory. There may be a hard limit to the number of processes as well.
The OS process scheduler will have more overhead working through a longer list of processes (though this probably won't be linearly bad; if heap-based it might be O(log n)).
Cache pressure is probably the biggest factor in practice. Assume your processes are all processing similar workloads. Some of the data they will need while processing will be shared across multiple work units, while not being known at compile time; each process will wind up having its own copy of that data. Thus two work units being handled by two processes will use up twice as much cache space for that kind of data.
Related
I wanted to ask you a question regarding the consistency of the cache memory.
If I have a sequential program, I shouldn't have cache consistency problems because in any case the instructions are executed sequentially and consequently there is no danger that several processors will write the same memory location at the same time, in case there are is the shared memory.
Different case is the situation where I have a parallel program, so it runs on multiple processors and there is a high probability that there are cache consistency problems.
Quite right?
In a single-threaded program, unless otherwise programmed, it doesn't change the thread by itself, except if OS does (and when it does, all the same thread-states are re-loaded from memory into that cache so there is no problem about coherence in there).
In a multi-threaded program, an update on same variable found on other caches needs to inform those caches somehow. This causes a re-flow of data through all other caches. Maybe it's not a blocking effect on same thread but once user wants only updated values, the synchronization / locking will see a performance hit. Especially when there are also other variables being updated on very close addresses such that they're in same cache-line. That's why using 20-byte elements for locking resolution is worse than using 128-byte elements in an array of locks.
If CPUs did not have coherence, multi-threading wouldn't work efficiently. So, for some versions, they chose to broadcast an update to all caches (as in Snoop cache). But this is not efficient on high number of cores. If 1000 cores existed in same CPU, it would require a 1000-way broadcasting logic consuming a lot of area of circuitry. So they break the problem into smaller parts and add other ways like directory-based coherence & multiple chunks of multiple cores. But this adds more latency for the coherence.
On the other hand, many GPUs do not implement automatic cache coherence because
the algorithm given by developer is generally embarrassingly parallel with only few points of synchronization and multiple blocks of threads do not require to communicate with other blocks (when they do, they go through a common cache by developer's choice of instructions anyway)
there are thousands of streaming pipelines (not real cores) that just need to make memory requests efficiently or else there wouldn't be enough space for that many pipelines
high throughput is required instead of low-latency (no need for implicit coherence anywhere)
so multi-processors in a GPU are designed to do completely independent work from each other and adding automatic coherence would add little performance (if not subtract). When developer needs to synchronize data between multiple threads in GPU in same block, there are instructions for this and not using these do not make any valid data update. So it's just an optional cache coherence in GPU.
I have 12 tasks to run on an octo-core machine. All tasks are CPU intensive and each will max out a core.
Is there a theoretical reason to avoid stacking tasks on a maxed out core (such as overhead, swapping across tasks) or is it faster to queue everything?
Task switching is a waste of CPU time. Avoid it if you can.
Whatever the scheduler timeslice is set to, the CPU will waste its time every time slice by going into the kernel, saving all the registers, swapping the memory mappings and starting the next task. Then it has to load in all its CPU cache, etc.
Much more efficient to just run one task at a time.
Things are different of course if the tasks use I/O and aren't purely compute bound.
Yes it's called queueing theory https://en.wikipedia.org/wiki/Queueing_theory. There are many different models https://en.wikipedia.org/wiki/Category:Queueing_theory for a range of different problems I'd suggest you scan them and pick the one most applicable to your workload then go and read up on how to avoid the worst outcomes for that model, or pick a different, better, model for dispatching your workload.
Although the graph at this link https://commons.wikimedia.org/wiki/File:StochasticQueueingQueueLength.png applies to Traffic it will give you an idea of what is happening to response times as your CPU utilisation increases. It shows that you'll reach an inflection point after which things get slower and slower.
More work is arriving than can be processed with subsequent work waiting longer and longer until it can be dispatched.
The more cores you have the further to the right you push the inflection point but the faster things go bad after you reach it.
I would also note that unless you've got some really serious cooling in place you are going to cook your CPU. Depending on it's design it will either slow itself down, making your problem worse, or you'll trigger it's thermal overload protection.
So a simplistic design for 8 cores would be, 1 thread to manage things and add tasks to the work queue and 7 threads that are pulling tasks from the work queue. If the tasks need to be performed within a certain time you can add a TimeToLive value so that they can be discarded rather than executed needlessly. As you are almost certainly running your application in an OS that uses a pre-emptive threading model consider things like using processor affinity where possible because as #Zan-Lynx says task/context switching hurts. Be careful not to try to build your OS'es thread management again as you'll probably wind up in conflict with it.
tl;dr: cache thrash is Bad
You have a dozen tasks. Each will have to do a certain amount of work.
At an app level they each processed a thousand customer records or whatever. That is fixed, it is a constant no matter what happens on the hardware.
At the the language level, again it is fixed, C++, java, or python will execute a fixed number of app instructions or bytecodes. We'll gloss over gc overhead here, and page fault and scheduling details.
At the assembly level, again it is fixed, some number of x86 instructions will execute as the app continues to issue new instructions.
But you don't care about how many instructions, you only care about how long it takes to execute those instructions. Many of the instructions are reads which MOV a value from RAM to a register. Think about how long that will take. Your computer has several components to implement the memory hierarchy - which ones will be involved? Will that read hit in L1 cache? In L2? Will it be a miss in last-level cache so you wait (for tens or hundreds of cycles) until RAM delivers that cache line? Did the virtual memory reference miss in RAM, so you wait (for milliseconds) until SSD or Winchester storage can page in the needed frame? You think of your app as issuing N reads, but you might more productively think of it as issuing 0.2 * N cache misses. Running at a different multi-programming level, where you issue 0.3 * N cache misses, could make elapsed time quite noticeably longer.
Every workload is different, and can place larger or smaller demands on memory storage. But every level of the memory hierarchy depends on caching to some extent, and higher multi-programming levels are guaranteed to impact cache hit rates. There are network- and I/O-heavy workloads where very high multi-programming levels absolutely make sense. But for CPU- and memory-intensive workloads, when you benchmark elapsed times you may find that less is more.
My MPI experience showed that the speedup as does not increase linearly with the number of nodes we use (because of the costs of communication). My experience is similar to this:.
Today a speaker said: "Magically (smiles), in some occasions we can get more speedup than the ideal one!".
He meant that ideally, when we use 4 nodes, we would get a speedup of 4. But in some occasions we can get a speedup greater than 4, with 4 nodes! The topic was related to MPI.
Is this true? If so, can anyone provide a simple example on that? Or maybe he was thinking about adding multithreading to the application (he went out of time and then had to leave ASAP, thus we could not discuss)?
Parallel efficiency (speed-up / number of parallel execution units) over unity is not at all uncommon.
The main reason for that is the total cache size available to the parallel program. With more CPUs (or cores), one has access to more cache memory. At some point, a large portion of the data fits inside the cache and this speeds up the computation considerably. Another way to look at it is that the more CPUs/cores you use, the smaller the portion of the data each one gets, until that portion could actually fit inside the cache of the individual CPU. This is sooner or later cancelled by the communication overhead though.
Also, your data shows the speed-up compared to the execution on a single node. Using OpenMP could remove some of the overhead when using MPI for intranode data exchange and therefore result in better speed-up compared to the pure MPI code.
The problem comes from the incorrectly used term ideal speed-up. Ideally, one would account for cache effects. I would rather use linear instead.
Not too sure this is on-topic here, but here goes nothing...
This super-linearity in speed-up can typically occur when you parallelise your code while distributing the data in memory with MPI. In some cases, by distributing the data across several nodes / processes, you end-up having sufficiently small chunks of data to deal with for each individual process that it fits in the cache of the processor. This cache effect might have a huge impact on the code's performance, leading to great speed-ups and compensating for the increased need of MPI communications... This can be observed in many situations, but this isn't something you can really count for for compensating a poor scalability.
Another case where you can observe this sort of super-linear scalability is when you have an algorithm where you distribute the task of finding a specific element in a large collection: by distributing your work, you can end up in one of the processes/threads finding almost immediately the results, just because it happened to be given range of indexes starting very close to the answer. But this case is even less reliable than the aforementioned cache effect.
Hope that gives you a flavour of what super-linearity is.
Cache has been mentioned, but it's not the only possible reason. For instance you could imagine a parallel program which does not have sufficient memory to store all its data structures at low node counts, but foes at high. Thus at low node counts the programmer may have been forced to write intermediate values to disk and then read them back in again, or alternatively re-calculate the data when required. However at high node counts these games are no longer required and the program can store all its data in memory. Thus super-linear speed-up is a possibility because at higher node counts the code is just doing less work by using the extra memory to avoid I/O or calculations.
Really this is the same as the cache effects noted in the other answers, using extra resources as they become available. And this is really the trick - more nodes doesn't just mean more cores, it also means more of all your resources, so as speed up really measures your core use if you can also use those other extra resources to good effect you can achieve super-linear speed up.
In a Delphi application, what is the maximum number of concurrent threads that can be running at one time ? Suppose that a single thread processing time is about 100 milliseconds.
The number of concurrent threads is limited by available resources. However, keep in mind that every thread uses a minimum amount of memory (usually 1MB by default, unless you specify differently), and the more threads you run, the more work the OS has to do to manage them, and the more time it takes just to keep switching between them so they have fair opportunity to run. A good rule of thumb is to not have more threads than there are CPUs available, since that will be the maximum number of threads that can physically run at any given moment. But you can certainly have more threads than CPUs, the OS will simply schedule them accordingly, which can degrade performance if you have too many running at a time. So you need to think about why you are using threads in the first place and plan accordingly to trade off between performance, memory usage, overhead, etc. Multi-threaded programming is not trivial, so do not treat it lightly.
This is memory dependent, there is no fixed limit to how many threads or other objects that you can create. At some point, if you allocate too much memory, you may get an "out of memory" exception, so you should think about how many threads you really need to invoke and go from there. Also keep in mind the more threads that you invoke, you should expect the processing time for all of the threads to decrease. So you may not get the performance that you're looking for if you have too many concurrent threads at once. I hope that this helps!
In parallel systems every process has an impact onto other processes, because they all compete for several scarce resources like cpu-caches, memory, disk I/O, network, etc.
What method is best suited for measuring interference between processes? Such as Process A & B each access the disk heavily. So running them parallel will probably slower then running sequential (individual runtime). Because the bottleneck is the hard drive.
If I don't know exactly the behaviour of a process (disk-, memory- or cpu- intensive), what method would be best to analyse that?
Measure individual runtime and compare the relative share of each parallel process?
Like process A runs on average 30s alone, when 100% parallel with B 45s, when 20% parallel 35s.. etc ??
Would it be better to compare several indicators like L1 & LLC cache misses, page faults, etc.??
What you need to do is first determine what the limiting factors are on each of the individual programs. If you want to run CPU-bound and IO-bound at the same time it'll have very little impact. If you want to run two IO-bound processes and the same time there'll be a lot of contention.
I wrote a rather detailed answer about how to interpret the output of "time [command]" results to see what's the limiting factor. It's here: What caused my elapsed time much longer than user time?
Once you have the ouput from "time"ing your programs you can determine which are likely to step on one another and which are not.