how many kernels running parallel in Mathematica? - wolfram-mathematica

Wolfram site states that typically only 4 cores are used with its Parallel feature. If you want more than 4 you need to contact them and pay up.
I have a machine with 2 quad-core hyperthreaded processors. When I run Parallel commands, it starts up 16 kernels 2 x 4 x 2 (factor of 2 for HT, I guess). So it looks like 16 kernels are used and not 4. Correct? It may be the case that my university's license allows for > 4 cores. I just wanted to check to see if I am actually using all available cores.
Thanks.

A standard Mathematica license will have 2 kernels and then 4 sub-kernels for each of the kernels. So that would be 8 if your program used more than 1 normal kernel. Subkernels are essentially what you use for parallel processing.
If you wanted to see how many subkernels you were allowed, please either
(1) Contact Wolfram customer support about this at info#wolfram.com
(2) Check your user portal account at user.wolfram.com. After entering in your password, go to "My Products and Services" and select the copy of Mathematica you are interesting in looking at. In that products page, you will see an entry called "Processes" which will tell you how many different processes your license gives you.
You can use commands such as $KernelCount to see how many subkernels are running.

Related

Saving JULIA_NUM_THREADS with physical cores ot logical cores

I need to initialize the environment variable to parallelize the solving of differential equations. I know how to create a system variable in windows, what I want to ask is with which number should I initialize this variable. I have 6 core CPUs but after including logical cores it becomes 12. So should I initialize it with 6 or 12 for better performance?
Also when I run the command Base.Sys.CPU_THREADS, I get 12 but when I run the command Threads.nthreads(), I get 1.
Also, like which one would be optimal, saving the JULIA_NUM_THREADS to 6 or 12?
You can always run Julia with the --threads auto option
julia --threads auto
You will see that it is allocating all available logical threads (case of my machine)
julia> Threads.nthreads()
8
Now the question "what is the optimal number of threads" is far more complicated. The rule of thumb is to use the number of logical cores up to around 16 threads. From my experience when you have a machine with a bigger level of parallelism multiprocessing is going to perform better (again depends on use case scenario).
In decisions like this BenchmarkTools is your biggest friend.
Another thing to be aware of is the hardware limitations - if you have laptop with 8 logical cores and you run it multi-threaded for longer periods of time it will overheat and you end up needing a new laptop (I have burned down three laptops this way).

How can I check that MKL calls are running with the correct number of threads on Xeon Phi?

I am running 60 MPI processes and MKL_THREAD_NUM is set to 4 to get me to the full 240 hardware threads on the Xeon Phi. My code is running but I want to make sure that MKL is actually using 4 threads. What is the best way to check this with the limited Xeon Phi linux kernel?
You can set MKL_NUM_THREADS to 4 if you like. However,using every single thread does not necessarily give the best performance. In some cases, the MKL library knows things about the algorithm that mean fewer threads is better. In these cases, the library routines can choose to use fewer threads. You should only use 60 MPI ranks if you have 61 coresIf you are going to use that many MPI ranks, you will want to set the I_MPI_PIN_DOMAIN environment variable to "core". Remember to leave one core free for the OS and system level processes. This will put one rank per core on the coprocessor and allow all the OpenMP threads for each MPI process to reside on the same core, giving you better cache behavior. If you do this, you can also use micsmc in gui mode on the host processor to continuously monitor the activity on all the cores. With one MPI processor per core, you can see how much of the time all threads on a core are being used.
Set MKL_NUM_THREADS to 4. You can use environment variable or runtime call. This value will be respected so there is nothing to check.
Linux kernel on KNC is not stripped down so I don't know why you think that's a limitation. You should not use any system calls for this anyways though.

How are light weight threads scheduled by the linux kernel on a multichip multicore SMP system?

I am running a parallel algorithm using light threads and I am wondering how are these assigned to different cores when the system provides several cores and several chips. Are threads assigned to a single chip until all the cores on the chip are exhausted? Are threads assigned to cores on different chips in order to better distribute the work between chips?
You don't say what OS you're on, but in Linux, threads are assigned to a core based on the load on that core. A thread that is ready to run will be assigned to a core with lowest load unless you specify otherwise by setting thread affinity. You can do this with sched_setaffinity(). See the man page for more details. In general, as meyes1979 said, this is something that is decided by the scheduler implemented in the OS you are using.
Depending upon the version of Linux you're using, there are two articles that might be helpful: this article describes early 2.6 kernels, up through 2.6.22, and this article describes kernels newer than 2.6.23.
Different threading libraries perform threading operations differently. The "standard" in Linux these days is NPTL, which schedules threads at the same level as processes. This is quite fine, as process creation is fast on Linux, and is intended to always remain fast.
The Linux kernel attempts to provide very strong CPU affinity with executing processes and threads to increase the ratio of cache hits to cache misses -- if a task always executes on the same core, it'll more likely have pre-populated cache lines.
This is usually a good thing, but I have noticed the kernel might not always migrate tasks away from busy cores to idle cores. This behavior is liable to change from version to version, but I have found multiple CPU-bound tasks all running on one core while three other cores were idle. (I found it by noticing that one core was six or seven degrees Celsius warmer than the other three.)
In general, the right thing should just happen; but when the kernel does not automatically migrate tasks to other processors, you can use the taskset(1) command to restrict the processors allowed to programs or you could modify your program to use the pthread_setaffinity_np(3) function to ask for individual threads to be migrated. (This is perhaps best for in-house applications -- one of your users might not want your program to use all available cores. If you do choose to include calls to this function within your program, make sure it is configurable via configuration files to provide functionality similar to the taskset(1) program.)

Should I disable HyperThreading to run parallel simulations?

My computer has a quadcore i7 processor. I'm studying parallelization of scientific simulations. How does hyperthreading impact on parallel performances? I know I should never use more than 4 working processes to get descent performances. But should I disable hyperthreading as well? Does it have an impact on parallel performances?
In my experience, running electromagnetic modelling and inversion codes, the answer is yes, you should disable hyperthreading. But this is not the sort of question which is well answered by other people's anecdotes (not even mine, fascinating and true as they are).
You are the student, this is definitely a topic worth your time spent in coming to your own conclusions. There are so many factors involved that my experience running my codes on my platforms is nearly worthless to you.
Under Linux, if you have 4 busy threads on an i7 it will place each one on a different core. Provided the other half of the core is idle, the performance should be the same. If you are running another program, it is debatable as to whether having hyperthreading to run the extra programs or context switching is better. (I suspect less context switching is better)
A common mistake is assuming that if you use 8 threads instead of 4 it will be twice as fast. It might be only slightly faster (in which case it might still be worth it) or slightly slower (in which case limit your program to 4 threads) I have found examples of where using double the number of threads was slightly faster. IMHO, Its all a matter of test it to find the optimal number and use that many.
The only time I can see you need to turning HT off is when you have no control over how your application behaves and using 4 threads is faster.
You state:
I know I should never use more than 4 working processes to get descent performances.
This isn't necessarily true! Here is an example of what I have found running on an i7-3820 with HT enabled. All of my code that I was running was C++. Consider that I have 8 separate programs (albeit identical) that I need to run. I have tried the two following ways of running these codes:
Run only 4 separate threads at a time, simultaneously. When these 4 complete, run the next 4 threads (4 x 2 = 8 total).
Run all 8 as separate threads simultaneously (8 x 1 = 8 total).
As you can see these two scenarios achieve the same thing. However, what I have found is that the run times are:
1 hour for each set of 4 threads; for a total of 2 hours to complete all 8.
1.5 hours for the set of 8 threads.
What you find is that a single thread will finish faster for case #1, but that overall #2 gives better performance since ALL of your work is completed in less time. I found typical increases in performance to be ~25% with HT enabled.
As is evident, there are scenarios when running 8 threads is faster than 4.
HyperTreading is the Intel implementation of Simultaneous Multi Threading (SMT). In general, SMT is almost always beneficial (this is why it is usually enabled), unless your application is CPU-bound. If you know for sure that your application is CPU-bound, then disable SMT. Otherwise (your application is IO-bound or is not able to completely saturate the cores), leave it enabled.

MPI on a single machine dualcore

What happend if I ran an MPI program which require 3 nodes (i.e. mpiexec -np 3 ./Program) on a single machine which has 2 cpu?
This depends on your MPI implementation, of course. Most likely, it will create three processes, and use shared memory to exchange the messages. This will work just fine: the operating system will dispatch the two CPUs across the three processes, and always execute one of the ready processes. If a process waits to receive a message, it will block, and the operating system will schedule one of the other two processes to run - one of which will be the one that is sending the message.
Martin has given the right answer and I've plus-1ed him, but I just want to add a few subtleties which are a little too long to fit into the comment box.
There's nothing wrong with having more processes than cores, of course; you probably have dozens running on your machine well before you run any MPI program. You can try with any command-line executable you have sitting around something like mpirun -np 24 hostname or mpirun -np 17 ls on a linux box, and you'll get 24 copies of your hostname, or 17 (probably interleaved) directory listings, and everything runs fine.
In MPI, this using more processes than cores is generally called 'oversubscribing'. The fact that it has a special name already suggests that its a special case. The sorts of programs written with MPI typically perform best when each process has its own core. There are situations where this need not be the case, but it's (by far) the usual one. And for this reason, for instance, OpenMPI has optimized for the usual case -- it just makes the strong assumption that every process has its own core, and so is very agressive in using the CPU to poll to see if a message has come in yet (since it figures it's not doing anything else crucial). That's not a problem, and can easily be turned off if OpenMPI knows it's being oversubscribed ( http://www.open-mpi.org/faq/?category=running#oversubscribing ). It's a design decision, and one which improves the performance of the vast majority of cases.
For historical reasons I'm more familiar with OpenMPI than MPICH2, but my understanding is that MPICH2s defaults are more forgiving of the oversubscribed case -- but I think even there, too it's possible to turn on more agressive busywaiting.
Anyway, this is a long way of saying that yes, there what you're doing is perfectly fine, and if you see any weird problems when you switch MPIs or even versions of MPIs, do a quick search to see if there are any parameters that need to be tweaked for this case.

Resources