Offload daemon on xeon phi 5110p - intel-mic

I am aware that the Intel Xeon phi coprocessor SE10X has 61 cores
and it is suggested to use only 60 cores since 1 core is used for the offload daemon.
Also, since intel xeon phi coprocessor 5110P has 60 cores, is it suggested to use 59 cores?

From this this MIC-related FAQ:
Sensible Affinities
Under Intel MPSS many of the kernel services and daemons are affinitized to the “Bootstrap Processor” (BSP), which is the last physical core. This is also where the offload daemon runs the services required to support data transfer for offload. It is therefore generally sensible to avoid using this core for user code. (Indeed, as already discussed, the offload system does that automatically by removing the logical CPUs on the last core from the default affinity of offloaded processes).
From this OpenMP on MIC guide:
Offloaded programs inherit an affinity map that hides the last core, which is dedicated to offload system functions. Native programs can use all the cores, making the calculations required for balancing the threads slightly different.
None of these sources is specific to any MIC model, they're about the architecture; so it seems that if you offload to the device and don't use the default affinity, you should indeed avoid the last core.

I evaluated the performance of my test code on a intel xeon phi 7120p card. I observed that the code performance was best when no. of threads was a multiple of (number of cores - 1). This is because one of the cores is busy running the Linux micro-OS services.
In general:
No. of threads to create >= K * T * (N-1)
K = Positive integer (=2 works fine)
T = No. of thread contexts on hardware(4 in my case)
N = No. of cores present on hardware.

When you execute your workload in offload mode (when application runs on the CPU and offloads some computation to the Xeon Phi) it is recommended to leave 1 core for offload runtime. There is a COI demon on the Xeon Phi side that runs four service threads to manage offload activity. Keep in mind that 1 physical core on Xeon Phi runs 4 hardware threads.
In case of native execution model when application started directly on Xeon Phi card you could use all available cores. Since there are now any offload activity.

Related

Can CUDA cores run things absolutely parallel or do they need context switching?

Can a CUDA INT32 Core process two different integer instructions completelly parallel, without context switching? I know that it is not possible on a CPU, but on a NVIDIA GPU? I know that a SM can run warps, and if core has to wait for some information, then a it gets another thread from the dispatch unit.
I know that it is not possible on a CPU, but on a NVIDIA GPU?
This assertion is wrong on modern mainstream CPUs (eg. since at least a decade for nearly all x86-64 processors, starting from Intel Skylake or AMD Zen 2). Indeed, modern x86-64 Intel/AMD processor can generally compute 2 (256 AVX) SIMD vectors in parallel since there is generally 2 SIMD units. Processors like Intel Skylake also have 4 ALU units capable of computing 4 basic arithmetic operations (eg. add, sub, and, xor) in parallel per cycle. Some instruction like division are far more expensive and do not run in parallel on such architecture though it is well pipelined. The instructions can come from the same thread on the same logical cores or possibly 2 threads (of possibly 2 different processes) scheduled on 2 logical cores without any context switches. Note that recent high-end ARM processors can also do this (even some mobile processors).
Can a CUDA INT32 Core process two different integer instructions completelly parallel, without context switching?
NVIDIA GPUs execute groups of threads known as warps in SIMT (Single Instruction, Multiple Thread) fashion. Thus, 1 instruction operate on 32 items in parallel (though, theoretically, an hardware can be free not to do that completely in parallel). A kernel execution basically contains many block and blocks are scheduled to SM. An SM can operate on many blocks concurrently so there is a massive amount of parallelism available.
Whether a specific GPU can execute two INT32 warp in parallel it is dependent of the target architecture, not CUDA itself. On modern Nvidia GPUs, each SM can be split in multiple partitions that can each execute instructions on blocks independently of the other partitions. For example, AFAIK, on a Pascal GP104, there is 20 SM and each SM has 4 partition capable of running SIMD instructions operating on 1 warp (32 items) at time. In practice, things can be a bit more complex on newer architectures. You can get more information here.

Why the time needed for executing code on "Hackerrank" site varies significantly?

When you try some coding challenges on www.hackerrank.com and you track the time also (e.g. in C++ using the "chrono" library), you will see every time you run the Code you will get a different time needed for execution at exactly the same Code. The variance I estimate at 10 to 30 percent. What is the reason that Code execution times are varying significantly at equal code? Which factors account to it?
It could be the Server System; but are there even physical reasons (stochastic processes in electronic components)?
Almost certainly just a busy system where your process competes for CPU time against other processes. (And competes for memory bandwidth even when it has a CPU to itself).
Also it's likely that the server runs on a CPU with SMT (Simultaneous multithreading) that uses each physical cores as two logical cores, so the "shared execution resources" that processes compete for includes stuff inside a single CPU core, not just L3 cache and memory bandwidth.
Intel calls their SMT tech "hyperthreading"; most servers these days run on Intel Xeon CPUs. AMD Zen also uses SMT, so either way unless the server admins disabled it, when the OS schedules tasks onto both logical cores of a single physical core, they will slow each other down some (with the slowdown amount mostly depending on their average IPC (instructions per cycle) - two threads that have a lot of branch mispredicts will typically not see much slowdown (so throughput nearly doubles). But two threads that could both nearly saturate the FP multiplier ALUs on their own will each run at nearly half speed (near zero gain in throughput).
OSes "know about" SMT / Hyperthreading and can detect which logical cores share a physical core. They try to avoid scheduling threads to the same physical core while there are some physical cores with both threads idle.
See also Modern Microprocessors
A 90-Minute Guide! which covers SMT, with the background to understand what execution resources are being shared.
but are there even physical reasons (stochastic processes in electronic components)?
No, CPUs are deterministic (except for the rdrand instruction which uses true analog electrical noise as a randomness source).
CPUs do use dynamic voltage and frequency scaling to save power (idle vs. max turbo, or somewhere in between if thermal limits don't allow max turbo.) https://en.wikipedia.org/wiki/Intel_Turbo_Boost / https://en.wikipedia.org/wiki/Dynamic_frequency_scaling
See also Idiomatic way of performance evaluation? for more microbenchmarking pitfalls / warm-up effects.

Xeon Phi coprocessor vs Xeon Phi host processor?

What is the difference between a host processor and coprocessor? Specifically Xeon Phi coprocessor and Xeon Phi host processor?
I have some performance results on these machines (a parallelized OpenMP code of diffusion equation was being run) which shows that the host processor works much faster when the same number of threads are working. I would like to know differences and relate them to my results.
Just to re-iterate what Jeff said in the comments, you have a Xeon host with an attached Xeon Phi coprocessor. The current generation of Xeon Phi (Knight's Corner) is only available as a coprocessor, not as a standalone Xeon Phi host (which should be available next generation with Knight's Landing).
When you run your program without offloading from your host Xeon, from this website, it looks like you'll be able to run with up to 16 threads. Note that the speed of each of your cores is about 2.2 GHz.
When you run your program in native execution mode on your Xeon Phi coprocessor, you should be able to run with a lot more threads. The optimal number of threads to use depends on the model of Xeon Phi you have (some work best with 56, others with 60). But note that each Xeon Phi core (roughly 1.2 GHz) is noticeably weaker than a single Xeon core (roughly 2.2 GHz). The benefit of the many-core Xeon Phi technology is exactly that: you can run across many cores.
The last very important thing to consider is that the Xeon Phi has a 512-bit wide SIMD instruction set. Thus, you can support much better SIMD vectorization running on the Xeon Phi coprocessor than on the host. In your case, I believe your Xeon host only has a 256-bit SIMD vector processing unit. Therefore, if you haven't already, you can improve your performance (up to x16 if you're dealing in single-precision) on your Xeon Phi taking advantage of SIMD vectorization. Your Xeon host will only give up to x8 performance. Just to start you on a google trek, OpenMP 4.0 allows you to write things like #pragma omp simd in order to tell the compiler when to vectorize lower-level loops throughout your code. If you really want maximum performance from the Xeon Phi, adding SIMD vectorization is a necessity.
So to directly answer your question: comparing the performance results between your Xeon host and Xeon Phi coprocessor using the same number of cores is useless. We already know that each Xeon Phi core is slower than each Xeon core. You should be comparing the results using the maximum number of cores each allows (60, and 16 respectively) and taking maximum advantage of the vector processing unit if you want a direct comparison.
If you are talking about the current generation (KNC) and not the next (KNL), these are the definitions.
Host processor: The ~8 core/ ~16 thread Xeon that is hosting the coprocessor, meaning the Xeon host off of which the coprocessor is connected via the PCIe bus.
Coprocessor: The ~60 core/~240 thread coprocessor that is hanging off of your Xeon host on the Xeon's PCIe bus.
The host farms off highly parallel / vectorizeable jobs to the coprocessor using either offload instructions or by running them natively using some distributed programming paradigm such as MPI.
As to the comment about the next generation host processor, the commenter is referring to the fact that the next generation Xeon Phi (KNL) can be configured either as a coprocessor hanging off the PCIe bus (like the 1st gen Xeon Phi, KNC) or as a normal processor that you plug into a motherboard.

Difference between core and processor

What is the difference between a core and a processor?
I've already looked for it on Google, but I only get definitions for multi-core and multi-processor, which is not what I am looking for.
A core is usually the basic computation unit of the CPU - it can run a single program context (or multiple ones if it supports hardware threads such as hyperthreading on Intel CPUs), maintaining the correct program state, registers, and correct execution order, and performing the operations through ALUs. For optimization purposes, a core can also hold on-core caches with copies of frequently used memory chunks.
A CPU may have one or more cores to perform tasks at a given time. These tasks are usually software processes and threads that the OS schedules. Note that the OS may have many threads to run, but the CPU can only run X such tasks at a given time, where X = number cores * number of hardware threads per core. The rest would have to wait for the OS to schedule them whether by preempting currently running tasks or any other means.
In addition to the one or many cores, the CPU will include some interconnect that connects the cores to the outside world, and usually also a large "last-level" shared cache. There are multiple other key elements required to make a CPU work, but their exact locations may differ according to design. You'll need a memory controller to talk to the memory, I/O controllers (display, PCIe, USB, etc..). In the past these elements were outside the CPU, in the complementary "chipset", but most modern design have integrated them into the CPU.
In addition the CPU may have an integrated GPU, and pretty much everything else the designer wanted to keep close for performance, power and manufacturing considerations. CPU design is mostly trending in to what's called system on chip (SoC).
This is a "classic" design, used by most modern general-purpose devices (client PC, servers, and also tablet and smartphones). You can find more elaborate designs, usually in the academy, where the computations is not done in basic "core-like" units.
An image may say more than a thousand words:
* Figure describing the complexity of a modern multi-processor, multi-core system.
Source:
https://software.intel.com/en-us/articles/intel-performance-counter-monitor-a-better-way-to-measure-cpu-utilization
Let's clarify first what is a CPU and what is a core, a central processing unit CPU, can have multiple core units, those cores are a processor by itself, capable of execute a program but it is self contained on the same chip.
In the past one CPU was distributed among quite a few chips, but as Moore's Law progressed they made to have a complete CPU inside one chip (die), since the 90's the manufacturer's started to fit more cores in the same die, so that's the concept of Multi-core.
In these days is possible to have hundreds of cores on the same CPU (chip or die) GPUs, Intel Xeon. Other technique developed in the 90's was simultaneous multi-threading, basically they found that was possible to have another thread in the same single core CPU, since most of the resources were duplicated already like ALU, multiple registers.
So basically a CPU can have multiple cores each of them capable to run one thread or more at the same time, we may expect to have more cores in the future, but with more difficulty to be able to program efficiently.
CPU is a central processing unit. Since 2002 we have only single core processor i.e. we will only perform a single task or a program at a time.
For having multiple programs run at a time we have to use the multiple processor for executing multi processes at a time so we required another motherboard for that and that is very expensive.
So, Intel introduced the concept of hyper threading i.e. it will convert the single CPU into two virtual CPUs i.e we have two cores for our task. Now the CPU is single, but it is only pretending (masqueraded) that it has a dual CPU and performs multiple tasks. But having real multiple cores will be better than that so people develop making multi-core processor i.e. multiple processors on a single box i.e. grabbing a multiple CPU on single big CPU. I.e. multiple cores.
In the early days...like before the 90s...the processors weren't able to do multi tasks that efficiently...coz a single processor could handle just a single task...so when we used to say that my antivirus,microsoft word,vlc,etc. softwares are all running at the same time...that isn't actually true. When I said a processor could handle a single process at a time...I meant it. It actually would process a single task...then it used to pause that task...take another task...complete it if its a short one or again pause it and add it to the queue...then the next. But this 'pause' that I mentioned was so small (appx. 1ns) that you didn't understand that the task has been paused. Eg. On vlc while listening to music there are other apps running simultaneously but as I told you...one program at a time...so the vlc is actually pausing in between for ns so you dont underatand it but the music is actually stopping in between.
But this was about the old processors...
Now-a- days processors ie 3rd gen pcs have multi cored processors. Now the 'cores' can be compared to a 1st or 2nd gen processors itself...embedded onto a single chip, a single processor. So now we understood what are cores ie they are mini processors which combine to become a processor. And each core can handle a single process at a time or multi threads as designed for the OS. And they folloq the same steps as I mentioned above about the single processor.
Eg. A i7 6gen processor has 8 cores...ie 8 mini processors in 1 i7...ie its speed is 8x times the old processors. And this is how multi tasking can be done.
There could be hundreds of cores in a single processor
Eg. Intel i128.
I hope I explaned this well.
I have read all answers, but this link was more clear explanation for me about difference between CPU(Processor) and Core. So I'm leaving here some notes from there.
The main difference between CPU and Core is that the CPU is an electronic circuit inside the computer that carries out instruction to perform arithmetic, logical, control and input/output operations while the core is an execution unit inside the CPU that receives and executes instructions.
Intel's picture is helpful, as shown by Tortuga's best answer. Here's a caption for it.
Processor: One semiconductor chip, the CPU (central processing unit) seated in one socket, circa 1950s-2010s. Over time, more functions have been packed onto the CPU chip. Prior to the 1950s releases of single-chip processors, one processor might have spread across multiple chips. In the mid 2010s the system-on-a-chip chips made it slightly more sketchy to equate one processor to one chip, though that's generally what people mean by processor, as in "this computer has an i7 processor" or "this computer system has four processors."
Core: One block of a CPU, executing one instruction at a time. (You'll see people say one instruction per clock cycle, but some CPUs use multiple clock cycles for some instructions.)

How will applications be scheduled on hyper-threading enabled multi-core machines?

I'm trying to gain a better understanding of how hyper-threading enabled multi-core processors work. Let's say I have an app which can be compiled with MPI or OpenMP or MPI+OpenMP. I wonder how it will be scheduled on a CentOS 5.3 box with four Xeon X7560 # 2.27GHz processors and each processor core has Hyper-Threading enabled.
The processor is numbered from 0 to 63 in /proc/cpuinfo. For my understanding, there are FOUR 8-cores physical processors, the total PHYSICAL CORES are 32, each processor core has Hyper-Threading enabled, the total LOGICAL processors are 64.
Compiled with MPICH2
How many physical cores will be used if I run with mpirun -np 16? Does it get divided up amongst the available 16 PHYSICAL cores or 16 LOGICAL processors ( 8 PHYSICAL cores using hyper-threading)?
compiled with OpenMP
How many physical cores will be used if I set OMP_NUM_THREADS=16? Does it will use 16 LOGICAL processors ?
Compiled with MPICH2+OpenMP
How many physical cores will be used if I set OMP_NUM_THREADS=16 and run with mpirun -np 16?
Compiled with OpenMPI
OpenMPI has two runtime options
-cpu-set which specifies logical cpus allocated to the job,
-cpu-per-proc which specifies number of cpu to use for each process.
If run with mpirun -np 16 -cpu-set 0-15, will it only use 8 PHYSICAL cores ?
If run with mpirun -np 16 -cpu-set 0-31 -cpu-per-proc 2, how it will be scheduled?
Thanks
Jerry
I'd expect any sensible scheduler to prefer running threads on different physical processors if possible. Then I'd expect it to prefer different physical cores. Finally, if it must, it would start using the hyperthreaded second thread on each physical core.
Basically when threads have to share processor resources they slow down. So the optimal strategy is usually to minimise the amount of processor resource sharing. This is the right strategy for CPU bound processes and that's normally what an OS assumes it is dealing with.
I would hazard a guess that the scheduler will try to keep threads in one process on the same physical cores. So if you had sixteen threads, they would be on the smallest number of physical cores. The reason for this would be cache locality; it would be considered threads from the same process would be more likely to touch the same memory, than threads from different processes. (For example, the costs of cache line invalidation across cores is high, but that cost does not occur for logical processors in the same core).
As you can see from the other two answers the ideal scheduling policy varies depending on what activity the threads are doing.
Threads working on completely different data benefit from more separation. These threads would ideally be scheduled in separate NUMA domains and physical cores.
Threads working on the same data will benefit from cache locality, so the idea policy is to schedule them close together so they share cache.
Threads that work on the same data and experience a large amount of pipeline stalls benefit from sharing a hyperthread core. Each thread can run until it stalls, at which point the other thread can run. Threads that run without stalls are only hurt by hyperthreading and should be run on different cores.
Making the ideal scheduling decision relies on a lot of data collection and a lot of decision making. A large danger in OS design is to make the thread scheduling too smart. If the OS spends a lot of processor time trying to find the ideal place to run a thread, it's wasting time it could be using to run the thread.
So often it's more efficient to use a simplified thread scheduler and if needed, let the program specify its own policy. This is the thread affinity setting.

Resources