I recently enabled cgroups/cpu isolation on my Mesos cluster. I've been running some stress tests (like starting some cpu-bound programs and seeing if a cpu-burst program can jump in and claim its cpu allocation), and it looks like Mesos is slicing the cpu correctly. However, I've seen some posts claiming it's dangerous for cpu-bound programs to take all idle cpu.
I'm trying to understand exactly what the dangers of soft-limiting cpu are. Is the problem that a critical task may not be able use its full cpu allocation immediately? What are some situations that soft-limits on cpu would cause problems? The alternative to my current setup is CFS scheduling, but my programs tend to be idle most of the time.
I use Marathon and Chronos (latest stable versions) to schedule tasks on my Mesos cluster (also the latest stable version).
The main danger of soft-limiting CPU is the inherent uncertainty. "Explicit is better than implicit." You hope your task gets scheduled on a host machine with tasks that are mostly idle, but it might not be so lucky. In unlucky cases where you have other tasks bursting, it means your task's performance is negatively affected, relative to scenarios where your task would be in an environment with hard limits. You may value predictability more than you do burst-ability. In a more ideal world, we might even want a mix.
That being said, hard limits are not necessarily a silver bullet. I can't speak to the reasoning of the posts you mention, but even the Mesos docs mention that CFS may not be appropriate for everything: https://mesosphere.github.io/marathon/docs/cfs.html
Related
I'm new to Elixir, and I'm starting to read through Dave Thomas's excellent Programming Elixir. I was curious how far I could take the concurrency of the "pmap" function, so I iteratively boosted the number of items to square from 1,000 to 10,000,000. Out of curiosity, I watched the output of htop as I did so, usually peaking out with CPU usage similar to that shown below:
After showing the example in the book, Dave says:
And, yes, I just kicked off 1,000 background processes, and I used all the cores and processors on my machine.
My question is, how come on my machine only cores 1, 3, 5, and 7 are lighting up? My guess would be that it has to do with my iex process being only a single OS-level process and OSX is managing the reach of that process. Is that what's going on here? Is there some way to ensure all cores get utilized for performance-intensive tasks?
Great comment by #Thiago Silveira about first line of iex's output. The part [smp:8:8] says how many operating system level processes is Erlang using. You can control this with flag --smp if you want to disable it:
iex --erl '-smp disable'
This will ensure that you have only one system process. You can achieve similar result by leaving symmetric multiprocessing enabled, but setting directly NumberOfShcedulers:NumberOfSchedulersOnline.
iex --erl '+S 1:1'
Each operating system process needs to have its own scheduler for Erlang processes, so you can easily see how many of them do you have currently:
:erlang.system_info(:schedulers_online)
To answer your question about performance. If your processors are not working at full capacity (100%) and non of them is doing nothing (0%) then it is probable that making the load more evenly distributed will not speed things up. Why?
The CPU usage is measured by probing the processor state at many points in time. This states are either "working" or "idle". 82% CPU usage means that you can perform couple of more tasks on this CPU without slowing other tasks.
Erlang schedulers try to be smart and not migrate Erlang processes between cores unless they have to because it requires copying. The migration occurs for example when one of schedulers is idle. It can then borrow a process from others scheduler run queue.
Next thing that may cause such a big discrepancy between odd and even cores is Hyper Threading. On my dual core processor htop shows 4 logical cores. In your case you probably have 4 physical cores and 8 logical because of HT. It might be the case that you are utilizing your physical cores with 100%.
Another thing: pmap needs to calculate result in separate process, but at the end it sends it to the caller which may be a bottleneck. The more you send messages the less CPU utilization you can achieve. You can try for fun giving the processes a task that is really CPU intensive like calculating Ackerman function. You can even calculate how much of your job is the sequential part and how much is parallel using Amdahl's law and measuring execution times for different number of cores.
To sum up: the CPU utilization from screenshot looks really great! You don't have to change anything for more performance-intensive tasks.
Concurrency is not Parallelism
In order to get good parallel performance out of Elixir/BEAM coding you need to have some understanding of how the BEAM scheduler works.
This is a very simplistic model, but the BEAM scheduler gives each process 2000 reductions before it swaps out the process for the next process. Reductions can be thought of as function calls. By default a process runs on the core/scheduler that spawned it. Processes only get moved between schedulers if the queue of outstanding processes builds up on a given scheduler. By default the BEAM runs a scheduling thread on each available core.
What this implies is that in order to get the most use of the processors you need to break up your tasks into large enough pieces of work that will exceed the standard "reduction" slice of work. In general, pmap style parallelism only gives significant speedup when you chunk many items into a single task.
The other thing to be aware of is that some parts of the BEAM use a spin/wait loop when awaiting work and that can skew usage when you use
a tool like htop to examine CPU usage. You'll get a much better understanding of your program's performance by using :observer.
If a single threaded process is busy and uses 100% of a single core it seems like Windows is switching this process between the cores, because in Task Managers core overview all cores are equal used.
Why does Windows do that? Isn't this destroying L1/L2 caches?
There are advantages to pinning a process to one core, primarily caching which you already mentioned.
There are also disadvantages -- you get unequal heating, which can create mechanical stresses that do not improve the expected lifetime of the silicon die.
To avoid this, OSes tend to keep all cores at equal utilization. When there's only one active thread, it will have to be moved and invalidate caches. As long as this is done infrequently (in CPU time), the impact of the extra cache misses during migration is negligible.
For example, the abstract of "Energy and thermal tradeoffs in hardware-based load balancing for clustered multi-core architectures implementing power gating" explicitly lists this as a design goal of scheduling algorithms (emphasis mine):
In this work, a load-balancing technique for these clustered multi-core architectures is presented that provides both a low overhead in energy and an a smooth temperature distribution across the die, increasing the reliability of the processor by evenly stressing the cores.
Spreading the heat dissipation throughout the die is also essential for techniques such as Turbo Boost, where cores are clocked temporarily at a rate that is unsustainable long term. By moving load to a different core regularly, the average heat dissipation remains sustainable even though the instantaneous power is not.
Your process may be the only one doing a lot of work, but it is not the only thing running. There are lots of other processes that need to run occasionally. When your process gets evicted and eventually re-scheduled, the core on which it was running previously might not be available. It's better to run the waiting process immediately on a free core than to wait for the previous core to be available (and in any case its data will likely have been bumped from the caches by the other thread).
In addition, modern CPUs allow all the cores in a package to share high-level caches. See the "Smart Cache" feature in this Intel Core i5 spec sheet. You still lose the lower-level cache(s) on core switch, but those are small and will probably churn somewhat anyway if you're running more than just a small tight loop.
I've been going through this tutorial on parallel pipelines and noticed that, while there is definitely a considerable difference in throughput, couldn't it be even better if the compression stage also took on a read job since it's just waiting around anyway? The same thing goes for the write stage... I mean, why not take on a third compression and then switch over to writing two, and then have one of those cores go back to compressing while the other wraps up the third write, and so on?
I apologize if this is obvious. I imagine this is standard practice and is called something, I'm just not sure what. Is their any overhead involved with switching jobs like this?
And I know this might be the wrong forum for this last question, but can the GPU switch jobs like this or should the programmable shaders/CUDA cores pretty much be left alone after being programmed?
EDIT: I guess I also don't understand how taking the same six-cores used in the 2 cores/stage example would be faster than just giving each of the six cores all three stages. Sure, there would be two cores that would do two, but that's still faster than the top scenario. I would understand it better in the GPU's case since there is specialized hardware involved for certain computations, but generally speaking, I don't see it. Maybe this example is weak or something because I know the parallel processing is here to stay.
This is definitely an issue with pipelining and there are a number of different ways to try and mitigate it.
With specialized hardware the hardware will often be tuned to try and balance the time taken in each stage for typical workloads. Fixed function stages in GPUs for example are typically balanced around the needs of a sample of representative game rendering workloads with transistors being allocated to try and balance the time taken in each stage. With static balancing like this there will usually be some wasted performance still however.
An alternative approach that can be used in both software and hardware to balance a pipeline is to break the longer stages down into multiple shorter steps. This is a common strategy in CPU instruction pipelines but can also be useful in software. In your example, the longer running compression step could potentially be broken down into multiple shorter pipeline stages. Depending on the task this may be difficult or impossible to do efficiently however.
Task scheduling systems can be used to help balance workloads across CPUs in a software pipeline. In a task scheduling system, you have a number of worker threads (usually around one per hardware thread) and any task can run on any worker thread. You have an API to set up dependencies between tasks and the task scheduler is responsible for scheduling tasks to run wherever CPU time is available once their dependencies are satisfied. In your example, the cores with idle time running the Read and Write tasks could help out with Compress tasks rather than sitting idle as long as the Compress tasks had their Read task dependencies satisfied.
Traditional OS thread schedulers can give some of the same benefits of a task scheduling system. In your example, if the Read threads waited on a semaphore when their work queues were empty (to be signalled when new work was added to the queues), the OS could schedule Compress threads to run on those idle cores. This can work reasonably well for relatively long running pipeline stages (10s of milliseconds) but for shorter pipeline stages (sub 1ms) the overhead of the OS thread scheduling and the length of the thread time slice will likely mean a task scheduling system would give better performance.
Your points are valid. The tutorial is lacking.
If the read, compress, and write operations can all occur at once, independently, the simple non-pipelined case would be the fastest for the six cores. Also notice that in the six core diagram, the reads and writes never overlap, so they could be the same ones. You only need four cores.
But consider a case where the reads all access the same disk so issuing too many read operations in parallel makes the reads take longer because they interfere with each other. In this case you can gain by pipelining the reads since you start the first compress steps sooner and they limit
the overall performance.
Has anyone else noticed terrible performance when scaling up to use all the cores on a cloud instance with somewhat memory intense jobs (2.5GB in my case)?
When I run jobs locally on my quad xeon chip, the difference between using 1 core and all 4 cores is about a 25% slowdown with all cores. This is to be expected from what I understand; a drop in clock rate as the cores get used up is part of the multi-core chip design.
But when I run the jobs on a multicore virtual instance, I am seeing a slowdown of like 2x - 4x in processing time between using 1 core and all cores. I've seen this on GCE, EC2, and Rackspace instances. And I have tested many difference instance types, mostly the fastest offered.
So has this behavior been seen by others with jobs about the same size in memory usage?
The jobs I am running are written in fortran. I did not write them, and I'm not really a fortran guy so my knowledge of them is limited. I know they have low I/O needs. They appear to be CPU-bound when I watch top as they run. They run without the need to communicate with each other, ie., embarrasingly parallel. They each take about 2.5GB in memory.
So my best guess so far is that jobs that use up this much memory take a big hit by the virtualization layer's memory management. It could also be that my jobs are competing for an I/O resource, but this seems highly unlikely according to an expert.
My workaround for now is to use GCE because they have single-core instance that actually runs the jobs as fast as my laptop's chip, and are priced almost proportionally by core.
You might be running into memory bandwidth constraints, depending on your data access pattern.
The linux perf tool might give some insight into this, though I'll admit that I don't entirely understand your description of the problem. If I understand correctly:
Running one copy of the single-threaded program on your laptop takes X minutes to complete.
Running 4 copies of the single-threaded program on your laptop, each copy takes X * 1.25 minutes to complete.
Running one copy of the single-threaded program on various cloud instances takes X minutes to complete.
Running N copies of the single-threaded program on an N-core virtual cloud instances, each copy takes X * 2-4 minutes to complete.
If so, it sounds like you're either running into a kernel contention or contention for e.g. memory I/O. It would be interesting to see whether various fortran compiler options might help optimize memory access patterns; for example, enabling SSE2 load/store intrinsics or other optimizations. You might also compare results with gcc and intel's fortran compilers.
Imagine I have two (three, four, whatever) tasks that have to run in parallel. Now, the easy way to do this would be to create separate threads and forget about it. But on a plain old single-core CPU that would mean a lot of context switching - and we all know that context switching is big, bad, slow, and generally simply Evil. It should be avoided, right?
On that note, if I'm writing the software from ground up anyway, I could go the extra mile and implement my own task-switching. Split each task in parts, save the state inbetween, and then switch among them within a single thread. Or, if I detect that there are multiple CPU cores, I could just give each task to a separate thread and all would be well.
The second solution does have the advantage of adapting to the number of available CPU cores, but will the manual task-switch really be faster than the one in the OS core? Especially if I'm trying to make the whole thing generic with a TaskManager and an ITask, etc?
Clarification: I'm a Windows developer so I'm primarily interested in the answer for this OS, but it would be most interesting to find out about other OSes as well. When you write your answer, please state for which OS it is.
More clarification: OK, so this isn't in the context of a particular application. It's really a general question, the result on my musings about scalability. If I want my application to scale and effectively utilize future CPUs (and even different CPUs of today) I must make it multithreaded. But how many threads? If I make a constant number of threads, then the program will perform suboptimally on all CPUs which do not have the same number of cores.
Ideally the number of threads would be determined at runtime, but few are the tasks that can truly be split into arbitrary number of parts at runtime. Many tasks however can be split in a pretty large constant number of threads at design time. So, for instance, if my program could spawn 32 threads, it would already utilize all cores of up to 32-core CPUs, which is pretty far in the future yet (I think). But on a simple single-core or dual-core CPU it would mean a LOT of context switching, which would slow things down.
Thus my idea about manual task switching. This way one could make 32 "virtual" threads which would be mapped to as many real threads as is optimal, and the "context switching" would be done manually. The question just is - would the overhead of my manual "context switching" be less than that of OS context switching?
Naturally, all this applies to processes which are CPU-bound, like games. For your run-of-the-mill CRUD application this has little value. Such an application is best made with one thread (at most two).
I don't see how a manual task switch could be faster since the OS kernel is still switching other processes, including yours in out of the running state too. Seems like a premature optimization and a potentially huge waste of effort.
If the system isn't doing anything else, chances are you won't have a huge number of context switches anyway. The thread will use its timeslice, the kernel scheduler will see that nothing else needs to run and switch right back to your thread. Also the OS will make a best effort to keep from moving threads between CPUs so you benefit there with caching.
If you are really CPU bound, detect the number of CPUs and start that many threads. You should see nearly 100% CPU utilization. If not, you aren't completely CPU bound and maybe the answer is to start N + X threads. For very IO bound processes, you would be starting a (large) multiple of the CPU count (i.e. high traffic webservers run 1000+ threads).
Finally, for reference, both Windows and Linux schedulers wake up every millisecond to check if another process needs to run. So, even on an idle system you will see 1000+ context switches per second. On heavily loaded systems, I have seen over 10,000 per second per CPU without any significant issues.
The only advantage of manual switch that I can see is that you have better control of where and when the switch happens. The ideal place is of course after a unit of work has been completed so that you can trash it all together. This saves you a cache miss.
I advise not to spend your effort on this.
Single-core Windows machines are going to become extinct in the next few years, so I generally write new code with the assumption that multi-core is the common case. I'd say go with OS thread management, which will automatically take care of whatever concurrency the hardware provides, now and in the future.
I don't know what your application does, but unless you have multiple compute-bound tasks, I doubt that context switches are a significant bottleneck in most applications. If your tasks block on I/O, then you are not going to get much benefit from trying to out-do the OS.