If I am running R on linux or on a mac, I can detect the number of available cores using multicore:::detectCores(). However, there's no windows version of the multicore functions, so I can't use this technique on windows.
How can I programmatically detect the number of cores on a windows machine, from within R?
The parallel package now has a function to detect the number of cores: parallel:::detectCores().
This thread has a number of suggestions, including:
Sys.getenv('NUMBER_OF_PROCESSORS')
Note also the posting in that thread by Prof. Ripley which talks to the difficulties of doing this.
If you actually need to distinguish between actual cores, chips, and logical processors, the API to call is GetLogicalProcessInformation
GetSystemInfo if just want to know how many logical processors on a machine (with no differentiation for hyperthreading.).
How you call this from "R" is beyond me. But I'd guess R has a facility for invoking code from native Windows DLLs.
GetSystemInfo will give you a structure that has the number of "processors", which corresponds to the total number of cores.
In theory, it will be the same value as the environment variable recommended in another answer, but the user can tamper with (or delete) the environment variable. That can be a bug or a feature depending on your intent.
Related
I have a section of slow code that I want to speed up by using the parallel gem and multiple cores.
With the Parallel gem, you must specify the number of processes or threads you want to use, so I hard-coded it to use the same number of logical cores that I have. It works perfectly, but my problem is that this code is intended to be distributed and used by other people who may have a different number of cores.
Should I try to detect the number of cores that their machine has, and use that number? Or should I default to no parallelism and only switch to multi-threaded code if the user explicitly specifies the number of threads they'd like to use? (e.g. pg_restore)
If I do try to detect cores, should I try to utilise all cores found, or would it be more polite to use, say, all but one of the cores?
No idea how memory-intensive your program is but those requirements could also cause major unexpected issues for people with less memory than the machine you're testing it on.
Since it's a CLI tool, why not add a flag like --procs that takes an argument for the number of processes to use, and leave it up to the user to decide?
I am running 60 MPI processes and MKL_THREAD_NUM is set to 4 to get me to the full 240 hardware threads on the Xeon Phi. My code is running but I want to make sure that MKL is actually using 4 threads. What is the best way to check this with the limited Xeon Phi linux kernel?
You can set MKL_NUM_THREADS to 4 if you like. However,using every single thread does not necessarily give the best performance. In some cases, the MKL library knows things about the algorithm that mean fewer threads is better. In these cases, the library routines can choose to use fewer threads. You should only use 60 MPI ranks if you have 61 coresIf you are going to use that many MPI ranks, you will want to set the I_MPI_PIN_DOMAIN environment variable to "core". Remember to leave one core free for the OS and system level processes. This will put one rank per core on the coprocessor and allow all the OpenMP threads for each MPI process to reside on the same core, giving you better cache behavior. If you do this, you can also use micsmc in gui mode on the host processor to continuously monitor the activity on all the cores. With one MPI processor per core, you can see how much of the time all threads on a core are being used.
Set MKL_NUM_THREADS to 4. You can use environment variable or runtime call. This value will be respected so there is nothing to check.
Linux kernel on KNC is not stripped down so I don't know why you think that's a limitation. You should not use any system calls for this anyways though.
I am running a parallel algorithm using light threads and I am wondering how are these assigned to different cores when the system provides several cores and several chips. Are threads assigned to a single chip until all the cores on the chip are exhausted? Are threads assigned to cores on different chips in order to better distribute the work between chips?
You don't say what OS you're on, but in Linux, threads are assigned to a core based on the load on that core. A thread that is ready to run will be assigned to a core with lowest load unless you specify otherwise by setting thread affinity. You can do this with sched_setaffinity(). See the man page for more details. In general, as meyes1979 said, this is something that is decided by the scheduler implemented in the OS you are using.
Depending upon the version of Linux you're using, there are two articles that might be helpful: this article describes early 2.6 kernels, up through 2.6.22, and this article describes kernels newer than 2.6.23.
Different threading libraries perform threading operations differently. The "standard" in Linux these days is NPTL, which schedules threads at the same level as processes. This is quite fine, as process creation is fast on Linux, and is intended to always remain fast.
The Linux kernel attempts to provide very strong CPU affinity with executing processes and threads to increase the ratio of cache hits to cache misses -- if a task always executes on the same core, it'll more likely have pre-populated cache lines.
This is usually a good thing, but I have noticed the kernel might not always migrate tasks away from busy cores to idle cores. This behavior is liable to change from version to version, but I have found multiple CPU-bound tasks all running on one core while three other cores were idle. (I found it by noticing that one core was six or seven degrees Celsius warmer than the other three.)
In general, the right thing should just happen; but when the kernel does not automatically migrate tasks to other processors, you can use the taskset(1) command to restrict the processors allowed to programs or you could modify your program to use the pthread_setaffinity_np(3) function to ask for individual threads to be migrated. (This is perhaps best for in-house applications -- one of your users might not want your program to use all available cores. If you do choose to include calls to this function within your program, make sure it is configurable via configuration files to provide functionality similar to the taskset(1) program.)
So I've looked around online for some time to no avail. I'm new to using OpenMP and so not sure of the terminology here, but is there a way to figure out a specific machine's mapping from OMPThread (given by omp_get_thread_num();) and the physical cores on which the threads will run?
Also I was interested in how exactly OMP assigned threads, for example is thread 0 always going to run in the same location when the same code is run on the same machine? Thanks.
Typically, the OS takes care of assigning threads to cores, including with OpenMP. This is by design, and a good thing - you normally would want the OS to be able to move a thread across cores (transparently to your application) as required, since it will interrupt your application at times.
Certain operating system APIs will allow thread affinity to be set. For example, on Windows, you can use SetThreadAffinityMask to force a thread onto a specific core.
Most of the time Reed is correct, OpenMP doesn't care about the assignment of threads to cores (or processors). However, because of things like cache reuse and data locality we have found that there are many cases where having the threads assigned to cores increases the performance of OpenMP. Therefore if you look at most OpenMP implementations, you will find that there is usually some environment variable that can be set to "bind" threads to cores. The OpenMP ARB has not yet specified any "standard" way of doing this, so at this time it is left up to an OpenMP implementation to decide if and how this should be done. There has been a great deal of discussion about whether this should be included in the OpenMP spec or not and if so how it could best be done.
Apple introduced Grand Central Dispatch (a thread pool) in Snow Leopard, but haven't gone into why one should use it over OpenMP, which is cross-platform and also works on Leopard. They're both pretty easy to use and look similar in capability. So, any ideas?
GCD is much better at runtime evaluation of the appropriate level of resources to throw at a problem - OpenMP decides how many threads to invoke for a set of parallel tasks based on information like environment variables. GCD looks at the current system load and number of available cores and allows an appropriate number of threads to run - scaling up and back as the resource usage changes in real time. That means that a GCD program ought to get better results in the general case. Of course, if you've bought a cluster of dedicated boxes to run your code, then this is moot because there will be little else for your code to conflict with.
Now that GCD has been open sourced, it's a matter of putting both tools side by side and see who lives in the end.
Performance and OS Level Integration?