In parallel systems every process has an impact onto other processes, because they all compete for several scarce resources like cpu-caches, memory, disk I/O, network, etc.
What method is best suited for measuring interference between processes? Such as Process A & B each access the disk heavily. So running them parallel will probably slower then running sequential (individual runtime). Because the bottleneck is the hard drive.
If I don't know exactly the behaviour of a process (disk-, memory- or cpu- intensive), what method would be best to analyse that?
Measure individual runtime and compare the relative share of each parallel process?
Like process A runs on average 30s alone, when 100% parallel with B 45s, when 20% parallel 35s.. etc ??
Would it be better to compare several indicators like L1 & LLC cache misses, page faults, etc.??
What you need to do is first determine what the limiting factors are on each of the individual programs. If you want to run CPU-bound and IO-bound at the same time it'll have very little impact. If you want to run two IO-bound processes and the same time there'll be a lot of contention.
I wrote a rather detailed answer about how to interpret the output of "time [command]" results to see what's the limiting factor. It's here: What caused my elapsed time much longer than user time?
Once you have the ouput from "time"ing your programs you can determine which are likely to step on one another and which are not.
Related
I have 12 tasks to run on an octo-core machine. All tasks are CPU intensive and each will max out a core.
Is there a theoretical reason to avoid stacking tasks on a maxed out core (such as overhead, swapping across tasks) or is it faster to queue everything?
Task switching is a waste of CPU time. Avoid it if you can.
Whatever the scheduler timeslice is set to, the CPU will waste its time every time slice by going into the kernel, saving all the registers, swapping the memory mappings and starting the next task. Then it has to load in all its CPU cache, etc.
Much more efficient to just run one task at a time.
Things are different of course if the tasks use I/O and aren't purely compute bound.
Yes it's called queueing theory https://en.wikipedia.org/wiki/Queueing_theory. There are many different models https://en.wikipedia.org/wiki/Category:Queueing_theory for a range of different problems I'd suggest you scan them and pick the one most applicable to your workload then go and read up on how to avoid the worst outcomes for that model, or pick a different, better, model for dispatching your workload.
Although the graph at this link https://commons.wikimedia.org/wiki/File:StochasticQueueingQueueLength.png applies to Traffic it will give you an idea of what is happening to response times as your CPU utilisation increases. It shows that you'll reach an inflection point after which things get slower and slower.
More work is arriving than can be processed with subsequent work waiting longer and longer until it can be dispatched.
The more cores you have the further to the right you push the inflection point but the faster things go bad after you reach it.
I would also note that unless you've got some really serious cooling in place you are going to cook your CPU. Depending on it's design it will either slow itself down, making your problem worse, or you'll trigger it's thermal overload protection.
So a simplistic design for 8 cores would be, 1 thread to manage things and add tasks to the work queue and 7 threads that are pulling tasks from the work queue. If the tasks need to be performed within a certain time you can add a TimeToLive value so that they can be discarded rather than executed needlessly. As you are almost certainly running your application in an OS that uses a pre-emptive threading model consider things like using processor affinity where possible because as #Zan-Lynx says task/context switching hurts. Be careful not to try to build your OS'es thread management again as you'll probably wind up in conflict with it.
tl;dr: cache thrash is Bad
You have a dozen tasks. Each will have to do a certain amount of work.
At an app level they each processed a thousand customer records or whatever. That is fixed, it is a constant no matter what happens on the hardware.
At the the language level, again it is fixed, C++, java, or python will execute a fixed number of app instructions or bytecodes. We'll gloss over gc overhead here, and page fault and scheduling details.
At the assembly level, again it is fixed, some number of x86 instructions will execute as the app continues to issue new instructions.
But you don't care about how many instructions, you only care about how long it takes to execute those instructions. Many of the instructions are reads which MOV a value from RAM to a register. Think about how long that will take. Your computer has several components to implement the memory hierarchy - which ones will be involved? Will that read hit in L1 cache? In L2? Will it be a miss in last-level cache so you wait (for tens or hundreds of cycles) until RAM delivers that cache line? Did the virtual memory reference miss in RAM, so you wait (for milliseconds) until SSD or Winchester storage can page in the needed frame? You think of your app as issuing N reads, but you might more productively think of it as issuing 0.2 * N cache misses. Running at a different multi-programming level, where you issue 0.3 * N cache misses, could make elapsed time quite noticeably longer.
Every workload is different, and can place larger or smaller demands on memory storage. But every level of the memory hierarchy depends on caching to some extent, and higher multi-programming levels are guaranteed to impact cache hit rates. There are network- and I/O-heavy workloads where very high multi-programming levels absolutely make sense. But for CPU- and memory-intensive workloads, when you benchmark elapsed times you may find that less is more.
I'm new to Elixir, and I'm starting to read through Dave Thomas's excellent Programming Elixir. I was curious how far I could take the concurrency of the "pmap" function, so I iteratively boosted the number of items to square from 1,000 to 10,000,000. Out of curiosity, I watched the output of htop as I did so, usually peaking out with CPU usage similar to that shown below:
After showing the example in the book, Dave says:
And, yes, I just kicked off 1,000 background processes, and I used all the cores and processors on my machine.
My question is, how come on my machine only cores 1, 3, 5, and 7 are lighting up? My guess would be that it has to do with my iex process being only a single OS-level process and OSX is managing the reach of that process. Is that what's going on here? Is there some way to ensure all cores get utilized for performance-intensive tasks?
Great comment by #Thiago Silveira about first line of iex's output. The part [smp:8:8] says how many operating system level processes is Erlang using. You can control this with flag --smp if you want to disable it:
iex --erl '-smp disable'
This will ensure that you have only one system process. You can achieve similar result by leaving symmetric multiprocessing enabled, but setting directly NumberOfShcedulers:NumberOfSchedulersOnline.
iex --erl '+S 1:1'
Each operating system process needs to have its own scheduler for Erlang processes, so you can easily see how many of them do you have currently:
:erlang.system_info(:schedulers_online)
To answer your question about performance. If your processors are not working at full capacity (100%) and non of them is doing nothing (0%) then it is probable that making the load more evenly distributed will not speed things up. Why?
The CPU usage is measured by probing the processor state at many points in time. This states are either "working" or "idle". 82% CPU usage means that you can perform couple of more tasks on this CPU without slowing other tasks.
Erlang schedulers try to be smart and not migrate Erlang processes between cores unless they have to because it requires copying. The migration occurs for example when one of schedulers is idle. It can then borrow a process from others scheduler run queue.
Next thing that may cause such a big discrepancy between odd and even cores is Hyper Threading. On my dual core processor htop shows 4 logical cores. In your case you probably have 4 physical cores and 8 logical because of HT. It might be the case that you are utilizing your physical cores with 100%.
Another thing: pmap needs to calculate result in separate process, but at the end it sends it to the caller which may be a bottleneck. The more you send messages the less CPU utilization you can achieve. You can try for fun giving the processes a task that is really CPU intensive like calculating Ackerman function. You can even calculate how much of your job is the sequential part and how much is parallel using Amdahl's law and measuring execution times for different number of cores.
To sum up: the CPU utilization from screenshot looks really great! You don't have to change anything for more performance-intensive tasks.
Concurrency is not Parallelism
In order to get good parallel performance out of Elixir/BEAM coding you need to have some understanding of how the BEAM scheduler works.
This is a very simplistic model, but the BEAM scheduler gives each process 2000 reductions before it swaps out the process for the next process. Reductions can be thought of as function calls. By default a process runs on the core/scheduler that spawned it. Processes only get moved between schedulers if the queue of outstanding processes builds up on a given scheduler. By default the BEAM runs a scheduling thread on each available core.
What this implies is that in order to get the most use of the processors you need to break up your tasks into large enough pieces of work that will exceed the standard "reduction" slice of work. In general, pmap style parallelism only gives significant speedup when you chunk many items into a single task.
The other thing to be aware of is that some parts of the BEAM use a spin/wait loop when awaiting work and that can skew usage when you use
a tool like htop to examine CPU usage. You'll get a much better understanding of your program's performance by using :observer.
I understand that creating many processes may yield no benefit, depending on how many cores your processor has (if the tasks are CPU-bound), or depending on how many IO operations you can do simultaneously (if your tasks are IO-bound). In such cases, creating too many processes simply has no effect.
However, can creating too many processes have a negative effect on performance? If yes, why?
Short answer: yes.
A process that isn't active has some overhead in memory and CPU time -- not a lot, but not none. So if you have an extremely large number of processes, you will see negatives.
On a modern system, multiple processes of the same executable will share code and read-only data, but each needs its own copy of mutable data, each needs its own stack, etc. Thus, each additional process takes up some amount of memory; this means more cache pressure, and in the extreme case, more swapfile activity or outright running out of memory. There may be a hard limit to the number of processes as well.
The OS process scheduler will have more overhead working through a longer list of processes (though this probably won't be linearly bad; if heap-based it might be O(log n)).
Cache pressure is probably the biggest factor in practice. Assume your processes are all processing similar workloads. Some of the data they will need while processing will be shared across multiple work units, while not being known at compile time; each process will wind up having its own copy of that data. Thus two work units being handled by two processes will use up twice as much cache space for that kind of data.
I've been going through this tutorial on parallel pipelines and noticed that, while there is definitely a considerable difference in throughput, couldn't it be even better if the compression stage also took on a read job since it's just waiting around anyway? The same thing goes for the write stage... I mean, why not take on a third compression and then switch over to writing two, and then have one of those cores go back to compressing while the other wraps up the third write, and so on?
I apologize if this is obvious. I imagine this is standard practice and is called something, I'm just not sure what. Is their any overhead involved with switching jobs like this?
And I know this might be the wrong forum for this last question, but can the GPU switch jobs like this or should the programmable shaders/CUDA cores pretty much be left alone after being programmed?
EDIT: I guess I also don't understand how taking the same six-cores used in the 2 cores/stage example would be faster than just giving each of the six cores all three stages. Sure, there would be two cores that would do two, but that's still faster than the top scenario. I would understand it better in the GPU's case since there is specialized hardware involved for certain computations, but generally speaking, I don't see it. Maybe this example is weak or something because I know the parallel processing is here to stay.
This is definitely an issue with pipelining and there are a number of different ways to try and mitigate it.
With specialized hardware the hardware will often be tuned to try and balance the time taken in each stage for typical workloads. Fixed function stages in GPUs for example are typically balanced around the needs of a sample of representative game rendering workloads with transistors being allocated to try and balance the time taken in each stage. With static balancing like this there will usually be some wasted performance still however.
An alternative approach that can be used in both software and hardware to balance a pipeline is to break the longer stages down into multiple shorter steps. This is a common strategy in CPU instruction pipelines but can also be useful in software. In your example, the longer running compression step could potentially be broken down into multiple shorter pipeline stages. Depending on the task this may be difficult or impossible to do efficiently however.
Task scheduling systems can be used to help balance workloads across CPUs in a software pipeline. In a task scheduling system, you have a number of worker threads (usually around one per hardware thread) and any task can run on any worker thread. You have an API to set up dependencies between tasks and the task scheduler is responsible for scheduling tasks to run wherever CPU time is available once their dependencies are satisfied. In your example, the cores with idle time running the Read and Write tasks could help out with Compress tasks rather than sitting idle as long as the Compress tasks had their Read task dependencies satisfied.
Traditional OS thread schedulers can give some of the same benefits of a task scheduling system. In your example, if the Read threads waited on a semaphore when their work queues were empty (to be signalled when new work was added to the queues), the OS could schedule Compress threads to run on those idle cores. This can work reasonably well for relatively long running pipeline stages (10s of milliseconds) but for shorter pipeline stages (sub 1ms) the overhead of the OS thread scheduling and the length of the thread time slice will likely mean a task scheduling system would give better performance.
Your points are valid. The tutorial is lacking.
If the read, compress, and write operations can all occur at once, independently, the simple non-pipelined case would be the fastest for the six cores. Also notice that in the six core diagram, the reads and writes never overlap, so they could be the same ones. You only need four cores.
But consider a case where the reads all access the same disk so issuing too many read operations in parallel makes the reads take longer because they interfere with each other. In this case you can gain by pipelining the reads since you start the first compress steps sooner and they limit
the overall performance.
I want to run a batch say 20 CPU intensive comps (basically really long nested for loop) on a machine.
Each of these 20 jobs doesn't share data with the other 19.
If the machine has N cores, should I spin off N-1 of these jobs then? Or N? Or should I just launch all 20, and have Windows figure out how to schedule them?
Unfortunately, there is no simple answer. The only way to know for sure is to implement and then profile your application.
Typically, for maximum throughput, if the jobs are pure CPU, you'd want one per core. Depending on the type of work, this would include one per hyperthread code or just one per "true physical core". (If the work is identical for all 20 jobs, then hyperthreading often slows down the overall work...)
If the jobs have any non-CPU functionaltiy (such as reading a file, waiting on anything, etc), then >1 work item per core tends to be much better. For many situations, this will improve.
Generally, if you aren't sharing data, not blocking on IO, and using lots of CPU and nothing else is running on the box (and probably a few more caveats) using all the CPU's (e.g. N threads) is probably the best idea.
The best choice is probably to make it configurable and profile it and see what happens.
You should use a thread pool of some sort, so it's (reasonably) easy to tune the number of threads without affecting the structure of the program.
Once you've done that, it's a fairly simple matter of testing to find a reasonably optimal number of threads relative to the number of processors available. Chances are that even when/if they look like this should be purely CPU bound, you'll get better efficiency with the number of threads >N, but about the only way to be sure is to test.