Xeon Phi coprocessor vs Xeon Phi host processor? - openmp

What is the difference between a host processor and coprocessor? Specifically Xeon Phi coprocessor and Xeon Phi host processor?
I have some performance results on these machines (a parallelized OpenMP code of diffusion equation was being run) which shows that the host processor works much faster when the same number of threads are working. I would like to know differences and relate them to my results.

Just to re-iterate what Jeff said in the comments, you have a Xeon host with an attached Xeon Phi coprocessor. The current generation of Xeon Phi (Knight's Corner) is only available as a coprocessor, not as a standalone Xeon Phi host (which should be available next generation with Knight's Landing).
When you run your program without offloading from your host Xeon, from this website, it looks like you'll be able to run with up to 16 threads. Note that the speed of each of your cores is about 2.2 GHz.
When you run your program in native execution mode on your Xeon Phi coprocessor, you should be able to run with a lot more threads. The optimal number of threads to use depends on the model of Xeon Phi you have (some work best with 56, others with 60). But note that each Xeon Phi core (roughly 1.2 GHz) is noticeably weaker than a single Xeon core (roughly 2.2 GHz). The benefit of the many-core Xeon Phi technology is exactly that: you can run across many cores.
The last very important thing to consider is that the Xeon Phi has a 512-bit wide SIMD instruction set. Thus, you can support much better SIMD vectorization running on the Xeon Phi coprocessor than on the host. In your case, I believe your Xeon host only has a 256-bit SIMD vector processing unit. Therefore, if you haven't already, you can improve your performance (up to x16 if you're dealing in single-precision) on your Xeon Phi taking advantage of SIMD vectorization. Your Xeon host will only give up to x8 performance. Just to start you on a google trek, OpenMP 4.0 allows you to write things like #pragma omp simd in order to tell the compiler when to vectorize lower-level loops throughout your code. If you really want maximum performance from the Xeon Phi, adding SIMD vectorization is a necessity.
So to directly answer your question: comparing the performance results between your Xeon host and Xeon Phi coprocessor using the same number of cores is useless. We already know that each Xeon Phi core is slower than each Xeon core. You should be comparing the results using the maximum number of cores each allows (60, and 16 respectively) and taking maximum advantage of the vector processing unit if you want a direct comparison.

If you are talking about the current generation (KNC) and not the next (KNL), these are the definitions.
Host processor: The ~8 core/ ~16 thread Xeon that is hosting the coprocessor, meaning the Xeon host off of which the coprocessor is connected via the PCIe bus.
Coprocessor: The ~60 core/~240 thread coprocessor that is hanging off of your Xeon host on the Xeon's PCIe bus.
The host farms off highly parallel / vectorizeable jobs to the coprocessor using either offload instructions or by running them natively using some distributed programming paradigm such as MPI.
As to the comment about the next generation host processor, the commenter is referring to the fact that the next generation Xeon Phi (KNL) can be configured either as a coprocessor hanging off the PCIe bus (like the 1st gen Xeon Phi, KNC) or as a normal processor that you plug into a motherboard.


How to measure the ACTUAL number of clock cycles elapsed on modern x86?

On recent x86, RDTSC returns some pseudo-counter that measures time instead of clock cycles.
Given this, how do I measure actual clock cycles for the current thread/program?
Platform-wise, I prefer Windows, but a Linux answer works too.
This is not simple. Such a thing is described in the Intel® 64 and IA-32 Architectures Developer's Manual: Vol. 3B:
Here is the behaviour:
For Pentium M processors; for Pentium 4 processors, Intel Xeon processors; and for P6 family processors: the time-stamp counter increments
with every internal processor clock cycle. The internal processor clock cycle is determined by the current core-clock to bus-clock ratio. Intel®
SpeedStep® technology transitions may also impact the processor clock.
For Pentium 4 processors, Intel Xeon processors; for Intel Core Solo
and Intel Core Duo processors; for the Intel Xeon processor 5100 series and Intel Core 2 Duo processors; for Intel Core 2 and Intel Xeon processors; for Intel Atom processors: the time-stamp counter increments at a constant rate. That rate may be set by the maximum core-clock to bus-clock ratio of the processor or may be set by the maximum resolved frequency at which the processor is booted. The maximum resolved frequency may differ from the processor base frequency. On certain processors, the TSC frequency may not be the same as the frequency in the brand string.
Here is the advise for your use-case:
To determine average processor clock frequency, Intel recommends the use of performance monitoring logic to count processor core clocks over the period of time for which the average is required. See Section 18.17, “Counting Clocks on systems with Intel Hyper-Threading Technology in Processors Based on Intel NetBurst® Microarchitecture,” and Chapter 19, “Performance-
Monitoring Events,” for more information.
The bad news is that AFAIK performance counters are often not portable between AMD and Intel processors. Thus, you certainly need to check which performance counters to use in the AMD documentation. There are also complications: you cannot easily measure the number of of cycle taken by any arbitrary code. For example, the processor can be halted or enter in sleep mode for a short period of time (see C-state) or the OS can executing some protected code that cannot be profiled without high privileges (for sake of security). This method is fine as long as you need to measure the number of cycle of a numerically-intensive code taking relatively-long time (at least several dozens of cycles). On top of all of that, the documentation and usage of MSR is pretty complex and it has some restrictions.
Performance counters like CPU_CLK_UNHALTED.THREAD and CPU_CLK_UNHALTED.REF_TSC seems a good start for what you want to measure. Using library to read such performance counter is generally a very good idea (unless you like having a headache for at least few days). PAPI might be enough to do the job for this.
Here is some interesting related posts:
Lost Cycles on Intel? An inconsistency between rdtsc and CPU_CLK_UNHALTED.REF_TSC
How to read performance counters by rdpmc instruction?

Query Intel CPU details of execution unit, port, etc

Is it possible to query the number of execution unit/port per core and similar information on Intel CPU?
I have an assembly program, and noticed that the performance is quite different on different CPU's. For example, on an Core i5 4570, some functions takes consistently 25% cycles to complete than on an Core i7 4970HQ. They are both Haswell based, from the same generation. No memory movement is involved in the part of program benchmarked. So I am thinking maybe the difference comes from the details such as number of execution unit, number of ports etc. The benchmark measures single core CPU cycles, so frequencies/HT etc does not come into play.
Am I right to assume such an explanation of performance difference? If yes, where can I find such informations for specific CPUs. And is it possible to query it dynamically? If possible, then I can dispatch dynamically based on such informations and distribution uops more evenly and similar techniques to optimize the program for multiple CPUs.
Did you time reference cycles (RDTSC) instead of core clock cycles (with perf counters)? That would explain your observations.
Turbo makes a big difference, and the ratio between max turbo and max sustained / rated clock speed (i.e. reference cycle tick rate) is different on different CPUs. e.g. see my answer on this related question
The lower the CPU's TDP, the bigger the ratio between sustained and peak. The Haswell wikipedia article has tables:
84W desktop i5 4570: sustained 3.2GHz = RDTSC frequency, max turbo 3.6GHz (the speed the core was probably actually running for most of your benchmark, if it had time to go up from low-power idle speed).
47W laptop i7-4960HQ: 2.6GHz sustained = RDTSC frequency vs. 3.8GHz max turbo.
Time your code with performance counters, and look at the "core clock cycles" count. (And lots of other neat stuff).
Every Haswell core is identical from Core-M 5Watt CPUs to high-power quad core to 18-core Xeon (which actually has a per-core power-budget more like a laptop CPU); it's only the L3 caches, number of cores (and interconnect), and support or not for HT and/or Turbo that differ. Basically everything outside the cores themselves can be different, including the GPU. They don't disable execution ports, and even the L1/L2 caches are identical. I think disabling execution ports would require significant redesigns in the out-of-order scheduler and stuff like that.
More importantly, every port has at least one execution unit that isn't found on any other port: p0 has the divider, p1 has the integer multiply unit, p5 has the shuffle unit, and p6 is the only port that can execute predicted-taken branches. Actually, p2 and p3 are identical load ports (and can handle store-address uops)...
See Agner Fog's microarch pdf for more about Haswell internals, and also David Kanter's writeup with diagrams of the different blocks.
(However, it's not strictly true that the entire core is identical: Haswell Pentium/Celeron CPUs don't support AVX/AVX2, or BMI/BMI2. I think they do that by disabling decode of VEX prefixes in the decoders. This is still the case for Skylake Pentiums/Celerons, so thanks Intel for delaying the time when we can assume support for new instruction sets. Presumably they do this so CPUs with defects in one only the upper or lower half of their vector execution units can still be sold as Celeron or Pentium, just like CPUs with a defect in some of their L3 can be sold as i5 instead of i7)

Offload daemon on xeon phi 5110p

I am aware that the Intel Xeon phi coprocessor SE10X has 61 cores
and it is suggested to use only 60 cores since 1 core is used for the offload daemon.
Also, since intel xeon phi coprocessor 5110P has 60 cores, is it suggested to use 59 cores?
From this this MIC-related FAQ:
Sensible Affinities
Under Intel MPSS many of the kernel services and daemons are affinitized to the “Bootstrap Processor” (BSP), which is the last physical core. This is also where the offload daemon runs the services required to support data transfer for offload. It is therefore generally sensible to avoid using this core for user code. (Indeed, as already discussed, the offload system does that automatically by removing the logical CPUs on the last core from the default affinity of offloaded processes).
From this OpenMP on MIC guide:
Offloaded programs inherit an affinity map that hides the last core, which is dedicated to offload system functions. Native programs can use all the cores, making the calculations required for balancing the threads slightly different.
None of these sources is specific to any MIC model, they're about the architecture; so it seems that if you offload to the device and don't use the default affinity, you should indeed avoid the last core.
I evaluated the performance of my test code on a intel xeon phi 7120p card. I observed that the code performance was best when no. of threads was a multiple of (number of cores - 1). This is because one of the cores is busy running the Linux micro-OS services.
In general:
No. of threads to create >= K * T * (N-1)
K = Positive integer (=2 works fine)
T = No. of thread contexts on hardware(4 in my case)
N = No. of cores present on hardware.
When you execute your workload in offload mode (when application runs on the CPU and offloads some computation to the Xeon Phi) it is recommended to leave 1 core for offload runtime. There is a COI demon on the Xeon Phi side that runs four service threads to manage offload activity. Keep in mind that 1 physical core on Xeon Phi runs 4 hardware threads.
In case of native execution model when application started directly on Xeon Phi card you could use all available cores. Since there are now any offload activity.

parallel code slower on multicore AMD

parallelized code(openmp), compiled on and intel (linux) with gcc, runs much faster on an intel computer than on an AMD with twice as many cores. I see that all the cores are in use but it takes about 10 times more cpu time on the AMD. I had heard about "cripple AMD" in intel compiler, but I am using gcc! Thanks in advance
Intel has hyper-threading technology in their modern processor cores which essentially means that you have multiple hardware contexts running on a single core simultaneously, r you taking this into account, when you make the comparison ??

Does modern GPU (e.g Fermi/Evergreen) supports out of order execution?

I am writing an OpenCL kernel which involves a few barriers in a loop. I have tested the kernel on CPU (8-core FX8150) and the result shows these barriers reduced running speed by a factor of 50~100 times (I further verified this by re-implementing the kernel on Java using multi-threading + CyclicBarrier). I suspect the reason was barrier essentially stops the CPU taking advantage of out-of-order execution, so I am a little worried if I would observe the same magnitude of speed decrease on GPU. I checked a few official documents and googled around a bit but there is little information available on this topic.
Current state-of-the art GPUs are in-order pipelined processor. GPUs fill the pipeline effectively by interleaving instructions from different warps (wavefronts). In comparisons, CPUs use out-of-order speculative execution to fill the pipeline. There are different functional units like ALUs and SFUs which have separated pipelines. But notice that instruction dependency stalls the warp. For more information on instruction dependency resolving on GPUs refer to this NVIDIA patent.
NVIDIA’s Next Generation
CUDA Compute and Graphics Architecture, Code-Named “Fermi”:
Nvidia GigaThread Engine has capabilities of(at page 5)
10x faster application context switching
Concurrent kernel execution
Out of Order thread block execution :)
Dual overlapped memory transfer engines
Evergreen has SIMD capabilities and has a chance outperform some fermi but i dont know about oooe of it. There is also "local atomic add" upper hand of HD 7000 series compared to GTX 600 series (nearly 10x faster)
