After learning that one could isolate cpus (in a sense that they are no more under the supervision of scheduler) using isolcpus parameter at boot time; how one can determine what cpus are under the supervision in linux?
Try stress, it will try to consume all the cpu resources in linux. then with 'top' command, you can see which cpu is isolated.
Related
I recently enabled cgroups/cpu isolation on my Mesos cluster. I've been running some stress tests (like starting some cpu-bound programs and seeing if a cpu-burst program can jump in and claim its cpu allocation), and it looks like Mesos is slicing the cpu correctly. However, I've seen some posts claiming it's dangerous for cpu-bound programs to take all idle cpu.
I'm trying to understand exactly what the dangers of soft-limiting cpu are. Is the problem that a critical task may not be able use its full cpu allocation immediately? What are some situations that soft-limits on cpu would cause problems? The alternative to my current setup is CFS scheduling, but my programs tend to be idle most of the time.
I use Marathon and Chronos (latest stable versions) to schedule tasks on my Mesos cluster (also the latest stable version).
The main danger of soft-limiting CPU is the inherent uncertainty. "Explicit is better than implicit." You hope your task gets scheduled on a host machine with tasks that are mostly idle, but it might not be so lucky. In unlucky cases where you have other tasks bursting, it means your task's performance is negatively affected, relative to scenarios where your task would be in an environment with hard limits. You may value predictability more than you do burst-ability. In a more ideal world, we might even want a mix.
That being said, hard limits are not necessarily a silver bullet. I can't speak to the reasoning of the posts you mention, but even the Mesos docs mention that CFS may not be appropriate for everything: https://mesosphere.github.io/marathon/docs/cfs.html
Has anyone else noticed terrible performance when scaling up to use all the cores on a cloud instance with somewhat memory intense jobs (2.5GB in my case)?
When I run jobs locally on my quad xeon chip, the difference between using 1 core and all 4 cores is about a 25% slowdown with all cores. This is to be expected from what I understand; a drop in clock rate as the cores get used up is part of the multi-core chip design.
But when I run the jobs on a multicore virtual instance, I am seeing a slowdown of like 2x - 4x in processing time between using 1 core and all cores. I've seen this on GCE, EC2, and Rackspace instances. And I have tested many difference instance types, mostly the fastest offered.
So has this behavior been seen by others with jobs about the same size in memory usage?
The jobs I am running are written in fortran. I did not write them, and I'm not really a fortran guy so my knowledge of them is limited. I know they have low I/O needs. They appear to be CPU-bound when I watch top as they run. They run without the need to communicate with each other, ie., embarrasingly parallel. They each take about 2.5GB in memory.
So my best guess so far is that jobs that use up this much memory take a big hit by the virtualization layer's memory management. It could also be that my jobs are competing for an I/O resource, but this seems highly unlikely according to an expert.
My workaround for now is to use GCE because they have single-core instance that actually runs the jobs as fast as my laptop's chip, and are priced almost proportionally by core.
You might be running into memory bandwidth constraints, depending on your data access pattern.
The linux perf tool might give some insight into this, though I'll admit that I don't entirely understand your description of the problem. If I understand correctly:
Running one copy of the single-threaded program on your laptop takes X minutes to complete.
Running 4 copies of the single-threaded program on your laptop, each copy takes X * 1.25 minutes to complete.
Running one copy of the single-threaded program on various cloud instances takes X minutes to complete.
Running N copies of the single-threaded program on an N-core virtual cloud instances, each copy takes X * 2-4 minutes to complete.
If so, it sounds like you're either running into a kernel contention or contention for e.g. memory I/O. It would be interesting to see whether various fortran compiler options might help optimize memory access patterns; for example, enabling SSE2 load/store intrinsics or other optimizations. You might also compare results with gcc and intel's fortran compilers.
I am running a parallel algorithm using light threads and I am wondering how are these assigned to different cores when the system provides several cores and several chips. Are threads assigned to a single chip until all the cores on the chip are exhausted? Are threads assigned to cores on different chips in order to better distribute the work between chips?
You don't say what OS you're on, but in Linux, threads are assigned to a core based on the load on that core. A thread that is ready to run will be assigned to a core with lowest load unless you specify otherwise by setting thread affinity. You can do this with sched_setaffinity(). See the man page for more details. In general, as meyes1979 said, this is something that is decided by the scheduler implemented in the OS you are using.
Depending upon the version of Linux you're using, there are two articles that might be helpful: this article describes early 2.6 kernels, up through 2.6.22, and this article describes kernels newer than 2.6.23.
Different threading libraries perform threading operations differently. The "standard" in Linux these days is NPTL, which schedules threads at the same level as processes. This is quite fine, as process creation is fast on Linux, and is intended to always remain fast.
The Linux kernel attempts to provide very strong CPU affinity with executing processes and threads to increase the ratio of cache hits to cache misses -- if a task always executes on the same core, it'll more likely have pre-populated cache lines.
This is usually a good thing, but I have noticed the kernel might not always migrate tasks away from busy cores to idle cores. This behavior is liable to change from version to version, but I have found multiple CPU-bound tasks all running on one core while three other cores were idle. (I found it by noticing that one core was six or seven degrees Celsius warmer than the other three.)
In general, the right thing should just happen; but when the kernel does not automatically migrate tasks to other processors, you can use the taskset(1) command to restrict the processors allowed to programs or you could modify your program to use the pthread_setaffinity_np(3) function to ask for individual threads to be migrated. (This is perhaps best for in-house applications -- one of your users might not want your program to use all available cores. If you do choose to include calls to this function within your program, make sure it is configurable via configuration files to provide functionality similar to the taskset(1) program.)
I'm using a linux 2.6.x kernel on my machine which has ubuntu installed (Ubuntu is just mentioned in case this changes anything). The kernel runs on a machine that has 8 cores. The machine also runs openvz but I don't think this does change the context of the question.
I have a software installed that allows only the usage of two CPUs and it sets a hard CPU affinity on the first both CPUs (cpumask 3). I'm asking myself how the scheduling of the other processes is affected by this. I think I read something about it but I assume for now that processes are likely to be attached to the first CPUs. And the kernel tries to keep the processes on the same CPU always to avoid cache invalidation.
On the machine there are quite a few processes running. How does the kernel handle this situation? Can it be the hard CPU affinity proceses are running slower because they are affected while being bound to a crowded zone? How does the kernel care about the hard affinity.
What will happen in the long run is that the load balancing code of the scheduler will move more of the unbound tasks to the rest of the CPUs to account for this task being bound to the first two.
The way it works is that each task starts on the CPU where it was created and at the micro level the Linux task scheduler does scheduling decisions on each CPU without regard for the others. But then there is the more macro level process migration load balancing code that will step up and say: "the run queue (list of processes waiting to be scheduled) on this cpu is longer than that cpu, let's move some over to balance the loads".
Of course, since your specific task is bound to the first two cpus the load balance will pick other tasks to move - so your bound task will in the long run will "push out" enough of the other non bound tasks to the other cpus and balance will be preserved.
Some of the fellows in the office think that when they've added threads to their code that windows will assign these threads to run on different processors of a multi-core or multi-processor machine. Then when this doesn't happen everything gets blamed on the existence of these threads colliding with one another on said multi-core or multi-processor machine.
Could someone debunk or confirm this notion?
When an application spawns multiple threads, it is indeed possible for them to get assigned to different processors. In fact, it is not uncommon for incorrect multi-threaded code to run ok on a single-processor machine but then display problems on a multi-processor machine. (This happens if the code is safe in the face of time slicing but broken in the face of true concurrency.)
You generally can only optimally have 1 thread per CPU, but unless your application has some explicit thread affinity to one processor then yes Windows will assign these threads to a free processor.
Windows will automatically execute multiple threads on different processors if the machine has multiple processors. If you are running on a single processor machine, the threads are time-sliced but when you move the process to a multiple processor machine, the process will automatically take advantage of the multiple processors.
Because the code is running simultaneously, the threads may be more likely to step on each others toes on a multi-core machine then on a single core machine since both threads could be writing to a shared location at the same time instead of this happening if the thread swap is timed just right.
Yes, Threads and multi-Threading has almost nothing to do with the number of cpus or cores in a machine...
EDIT ADDITION: To talk about "how many threads run on a cpu" is an oxymoron. Only one thread can ever run on a single CPU. Multi-Threading is about multiple threads in a PROCESS, not on a CPU. Before another thread can be run on any CPU, the thread currently on that CPU has to STOP running, and it's state must be preserved somewhere so that the OS can restart it when it get's it's next "turn".
Code runs in "Processes" which are logical abstractions that can run one or more sequences of code instructions, and manage computer resources independantly from other processes. Within a process, each separate sequence of code instructions is a "thread". Which cpu they run on is irrelevant. A single thread can run on a differnt cpu each time is is allocated a cpu to run on... and multiple threads, as they are each allocated cpu cycles, may, by coincidence, run on the same cpu (although obviously not simultaneously)
The OS (a component of the OS) is responsible for "running" threads. It keeps an in-memory list of all threads, and constantly "switches" (it's called a context switch) among them. It does this in a single CPU machine in almost exactly the same way as it does ion in a multiple-cpu machine. Even in a multiple Cpu machine, each time it "turns on" a thread, it might give it to a different CPU, or to the same cpu, as it did the last time.
There is no guarantee that threads of your process will be assigned to run on different CPUs.