Can CUDA cores run things absolutely parallel or do they need context switching? - parallel-processing

Can a CUDA INT32 Core process two different integer instructions completelly parallel, without context switching? I know that it is not possible on a CPU, but on a NVIDIA GPU? I know that a SM can run warps, and if core has to wait for some information, then a it gets another thread from the dispatch unit.

I know that it is not possible on a CPU, but on a NVIDIA GPU?
This assertion is wrong on modern mainstream CPUs (eg. since at least a decade for nearly all x86-64 processors, starting from Intel Skylake or AMD Zen 2). Indeed, modern x86-64 Intel/AMD processor can generally compute 2 (256 AVX) SIMD vectors in parallel since there is generally 2 SIMD units. Processors like Intel Skylake also have 4 ALU units capable of computing 4 basic arithmetic operations (eg. add, sub, and, xor) in parallel per cycle. Some instruction like division are far more expensive and do not run in parallel on such architecture though it is well pipelined. The instructions can come from the same thread on the same logical cores or possibly 2 threads (of possibly 2 different processes) scheduled on 2 logical cores without any context switches. Note that recent high-end ARM processors can also do this (even some mobile processors).
Can a CUDA INT32 Core process two different integer instructions completelly parallel, without context switching?
NVIDIA GPUs execute groups of threads known as warps in SIMT (Single Instruction, Multiple Thread) fashion. Thus, 1 instruction operate on 32 items in parallel (though, theoretically, an hardware can be free not to do that completely in parallel). A kernel execution basically contains many block and blocks are scheduled to SM. An SM can operate on many blocks concurrently so there is a massive amount of parallelism available.
Whether a specific GPU can execute two INT32 warp in parallel it is dependent of the target architecture, not CUDA itself. On modern Nvidia GPUs, each SM can be split in multiple partitions that can each execute instructions on blocks independently of the other partitions. For example, AFAIK, on a Pascal GP104, there is 20 SM and each SM has 4 partition capable of running SIMD instructions operating on 1 warp (32 items) at time. In practice, things can be a bit more complex on newer architectures. You can get more information here.


In NVIDIA gpu, Can ld/st and arithmetic instruction(such as int32 fp32 )run simultaneously in same sm?

Especially turing and ampere architecture,In the same sm and same warp scheduler,Can the warps run ld/st and other arithmetic instruction simultaneously?
I want to know about how warp scheduler work
In the same sm and same warp scheduler,Can the warps run ld/st and other arithmetic instruction simultaneously?
No, not if "simultaneously" means "issued in the same clock cycle".
In current CUDA GPUs including turing and ampere, when the warp scheduler issues an instruction, it issues the same instruction to all threads in the warp, in any given clock cycle.
Different instructions could be run in different clock cycles (of course) and different instructions can be run in the same clock cycle, if those instructions are issued by different warp schedulers in the SM. This would also imply that those instructions are issued to distinct/separate SM units.
So, for example, an integer add instruction issued by warp scheduler 0 would have to be issued to separate functional units compared to a load/store instruction issued by warp scheduler 1 in the same SM. For this example, since the instructions are different, different functional units are needed anyway, and this is self-evident.
But even if both warp schedulers were issuing, for example, FADD (for 2 different warps), they would have to issue to separate floating-point functional units in the SM.
In modern CUDA GPUs, due to the partitioning of the SM, each warp scheduler has its own execution resources (functional units) for at least some instruction types, like FADD. So this would happen anyway, again, for this reason, in this example.

Does SIMD require a multi-core CPU?

Is a multi-core CPU required to implement SIMD?
I found the following phrase "multiple processing elements" when reading Wikipedia about SIMD. So what's the difference between this phrase and "multi-core CPU"?
Every core has its own independent SIMD execution units. Using SIMD instructions in one core doesn't cost execution resources in other cores. Separate cores even on the same physical chip are independent so they can go to sleep separately to save power, and various other design reasons for keeping them isolated.
One exception that I'm aware of: AMD Bulldozer has two weak integer cores sharing a SIMD / FPU and sharing some cache. They call this a "cluster", and it's basically an alternative to Hyperthreading (SMT). See David Kanter's Bulldozer write-up on RealworldTech.
SIMD and multi-core are orthogonal: you can have multi-core without SIMD (maybe some ARM chips without an FPU / NEON), and you can have SIMD without multi-core.
Many examples of the latter, including most prominently early x86 chips like Pentium-MMX through Pentium III / Pentium 4 that has MMX / SSE1 / SSE2 but were single-core CPUs.
There are at least three different kinds of parallelism in programs:
Instruction-level parallelism: it's possible to overlap some of the work done by different instructions within the same single thread of execution, preserving the illusion of running every instruction one after another. Exploit it by building a pipelined CPU core, or superscalar (multiple instructions per clock), or even out-of-order execution. (See my answer on a question about that for details.)
When creating software: Expose this parallelism to the hardware by avoiding long dependency chains whenever possible. (e.g. replace sum += a[i++] with sum1+=a[i]; sum2+=a[i+1]; i+=2;: unroll with multiple accumulators). Or use arrays instead of linked lists, because the next address to load is computed cheaply, instead of being part of the data from memory you have to wait for on a cache miss. But mostly ILP is already there in "normal" code without doing anything special, and you build bigger / fancier hardware to find more of it, and increase the average instructions-per-clock.
Data parallelism: you need to do the same thing to every pixel of an image, or every sample in an audio file. (e.g. blend 2 images, or mix two audio streams). Exploit this by building parallel execution units into each CPU core so a single instruction can do 16 single-byte additions in parallel, giving you increased throughput with no increase in the amount of instructions you need to get through the CPU core per clock. This is SIMD: Single Instruction, Multiple Data.
Audio / video are the most well-known applications of this, where the speedups are massive because you can fit a lot of byte or 16-bit elements into a single fixed-width vector register.
Exploit SIMD by auto-vectorizing loops with smart compilers, or manually. SIMD turns sum += a[i]; into sum[0..3] += a[i+0..3] (for 4 elements per vector, like with int or float with 32-bit vectors).
Thread/task-level parallelism: exploit with multi-core CPUs, expose to the hardware by manually writing multi-threaded code, or using OpenMP or other auto-parallelization tools to multi-thread a loop, or use a library function that starts multiple threads for a big matrix multiply or something.
Or more simply by running multiple separate programs at once. e.g. compile with make -j8 to keep 8 compile processes in flight at once. Coarse-grained task-level parallelism can also be exploited by running your workload on a cluster of multiple computers, or even distributed computing.
But multi-core CPUs make it possible / efficient to exploit fine-grained thread-level parallelism where tasks need to share lots of data (like a large array), or have low latency communication through shared memory. (e.g. with locks to protect different parts of shared data, or lockless programming.)
These three kinds of parallelism are orthogonal.
To sum a very large array of float on a modern CPU:
You'd start one thread per CPU core, and have each core loop over a chunk of the array in shared memory. (Thread-level parallelism). This gives you a factor of 4 speedup, let's say. (Even that's maybe unrealistic because of memory bottlenecks, but you can imagine some other computationally intensive task that didn't require reading so much memory, running on a 28-core Xeon, or a dual-socket server with two of those chips...)
The code for each thread would use SIMD to do 4 or 8 adds per instruction, on each core separately. (SIMD). This gives you a factor of 4 or 8 speedup. (Or 16 with AVX512)
You'd unroll with let's say 8 vector accumulators to hide the latency of floating-point add. (ILP). Skylake's vaddps instruction has a latency of 4 cycles and a throughput of 0.5 cycles (i.e. 2 per clock). So 8 accumulators is just barely enough to hide that latency and keep 8 FP add instructions in flight at once.
The total throughput gain over single-threaded scalar sum += a[i++] is the product of all those speedup factors: 4 * 8 * 8 = 256x the throughput of a non-parallelized, non-vectorized, single-accumulator ILP-bottlenecked naive implementation like you'd get from gcc -O2 for a simple loop. clang -O3 -march=native -ffast-math would give SIMD, and some ILP (because clang knows how to use multiple accumulators when unrolling, often using 4, unlike gcc.)
You'd need OpenMP or other auto-parallelization to exploit multiple cores.
Related: Why does mulss take only 3 cycles on Haswell, different from Agner's instruction tables? for a more in-depth look at multiple accumulators for ILP, and SIMD, for an FMA loop.
No, each core normally can perform most general operations from the instruction set. But the "multiple processing elements" for SIMD operations just perform a single operation on different data (different bytes or words).
For example, each core of ARM Cortex-A53 microarchitecture has capability to run SIMD instructions independently of other cores, while such SIMD instruction sets as MMX, SSE and SSE2 were first introduced on single-core CPUs.
Yes. It does. But only from the marketing point of view. It would be difficult to sell uP or uC with no SIMD instructions.

Why is parallel compilation performance with HT worse than without?

I've made several measurements of compilation time of wine with HyperThreading enabled and disabled in BIOS on my Core i7 930 #2.8GHz (quad-core) on Linux 2.6.39 x86_64. Each measurement was like this:
git clean -xdf
./configure --prefix=/usr
time make -j$N
where N is number from 1 to 8.
Here're the results ("speed" is 60/real from time(1)):
Here the blue line corresponds to HT disabled and purple one to HT enabled. It appears that when HT is enabled, using 1-4 threads is slower than without HT. I guess this might be related to the kernel not distributing the processes to different cores and reusing second threads of already busy cores.
So, my question: how can I force the kernel to give 1 process per core scheduling higher priority than adding more processes to the same core's different thread? Or, if my reasoning is wrong, how can I have performance with HT not worse than without HT for 1-4 processes running in parallel?
Hyper-threading on Intel chips is implemented as duplication of some of the elements of a pysical core but without enough electronics to be an independent core (e.g. they may share an instruction decoder but I cant recall the specifics of Intel's implementation).
Image a pysical core with HT as 1.5 physical cores that your OS sees as 2 real cores. This doesn't equate to 1.5x speed though (this can vary depending on use case)
In your example, non-HT is faster up to 4 threads because none of the cores are sharing work with their HT pipeline. You see a flatline above 4 threads because now you only have 4 execution threads and you get a little extra overhead context switching between threads.
In the HT example you are a bit slower up to 4 threads probably because some of those threads are being assigned to a real core and it's HT, so you are losing performance as those two execution threads share physical resources. Above 4 threads you are seeing the benefit of the extra execution threads, but you see the beginning of diminishing returns.
You could probably match performance on both cases for up to 4 threads, but likely not with a compilation job. To many processes being spawned for processor affinity to be setup I think. If you instead ran a real parallel job using OpenMP or MPI with X<=4 threads bound to the specific real CPU cores, I think you'd see similar performance between HT-off and -on.
Given a number of threads <= the number of real cores, using HT should be slower because (considered crudely) you are potentially cutting the speed of your cores in half.1
Keep in mind that generally more cores is NOT better than FASTER cores. In fact, the only reason so much work was put into developing multi-core systems is that it became increasingly difficult to make faster and faster ones. So if you cannot have a 20 Ghz processor, then 8 x 3 Ghz ones will have to do.
HT is, I believe, primarily intended as an advantage in contexts where each thread is not necessarily gobbling as much processor as it can; it's doing some particular task that's governed by interaction with a user, such as CAD stuff, video games, etc; these are the kind of applications that benefit from multi-tasking. By contrast, server platforms -- wherein the primary applications tend to thread independent tasks that are not governed by a dependence on anything else, hence are optimally run as fast as possible -- do not benefit directly from multi-tasking; they benefit from speed. make is in the same category, although with a perhaps greater degree of interdependence between threads, which is why you see an advantage for HT from 4-8 threads.
1. This is a simplification. HT doesn't simply double the number of cores and halve their speed, but whatever dynamic is used, the total number of processor cycles per second for the system is not improved. It's the same -- only more fragmented.

Difference between core and processor

What is the difference between a core and a processor?
I've already looked for it on Google, but I only get definitions for multi-core and multi-processor, which is not what I am looking for.
A core is usually the basic computation unit of the CPU - it can run a single program context (or multiple ones if it supports hardware threads such as hyperthreading on Intel CPUs), maintaining the correct program state, registers, and correct execution order, and performing the operations through ALUs. For optimization purposes, a core can also hold on-core caches with copies of frequently used memory chunks.
A CPU may have one or more cores to perform tasks at a given time. These tasks are usually software processes and threads that the OS schedules. Note that the OS may have many threads to run, but the CPU can only run X such tasks at a given time, where X = number cores * number of hardware threads per core. The rest would have to wait for the OS to schedule them whether by preempting currently running tasks or any other means.
In addition to the one or many cores, the CPU will include some interconnect that connects the cores to the outside world, and usually also a large "last-level" shared cache. There are multiple other key elements required to make a CPU work, but their exact locations may differ according to design. You'll need a memory controller to talk to the memory, I/O controllers (display, PCIe, USB, etc..). In the past these elements were outside the CPU, in the complementary "chipset", but most modern design have integrated them into the CPU.
In addition the CPU may have an integrated GPU, and pretty much everything else the designer wanted to keep close for performance, power and manufacturing considerations. CPU design is mostly trending in to what's called system on chip (SoC).
This is a "classic" design, used by most modern general-purpose devices (client PC, servers, and also tablet and smartphones). You can find more elaborate designs, usually in the academy, where the computations is not done in basic "core-like" units.
An image may say more than a thousand words:
* Figure describing the complexity of a modern multi-processor, multi-core system.
Let's clarify first what is a CPU and what is a core, a central processing unit CPU, can have multiple core units, those cores are a processor by itself, capable of execute a program but it is self contained on the same chip.
In the past one CPU was distributed among quite a few chips, but as Moore's Law progressed they made to have a complete CPU inside one chip (die), since the 90's the manufacturer's started to fit more cores in the same die, so that's the concept of Multi-core.
In these days is possible to have hundreds of cores on the same CPU (chip or die) GPUs, Intel Xeon. Other technique developed in the 90's was simultaneous multi-threading, basically they found that was possible to have another thread in the same single core CPU, since most of the resources were duplicated already like ALU, multiple registers.
So basically a CPU can have multiple cores each of them capable to run one thread or more at the same time, we may expect to have more cores in the future, but with more difficulty to be able to program efficiently.
CPU is a central processing unit. Since 2002 we have only single core processor i.e. we will only perform a single task or a program at a time.
For having multiple programs run at a time we have to use the multiple processor for executing multi processes at a time so we required another motherboard for that and that is very expensive.
So, Intel introduced the concept of hyper threading i.e. it will convert the single CPU into two virtual CPUs i.e we have two cores for our task. Now the CPU is single, but it is only pretending (masqueraded) that it has a dual CPU and performs multiple tasks. But having real multiple cores will be better than that so people develop making multi-core processor i.e. multiple processors on a single box i.e. grabbing a multiple CPU on single big CPU. I.e. multiple cores.
In the early before the 90s...the processors weren't able to do multi tasks that efficiently...coz a single processor could handle just a single when we used to say that my antivirus,microsoft word,vlc,etc. softwares are all running at the same time...that isn't actually true. When I said a processor could handle a single process at a time...I meant it. It actually would process a single task...then it used to pause that task...take another task...complete it if its a short one or again pause it and add it to the queue...then the next. But this 'pause' that I mentioned was so small (appx. 1ns) that you didn't understand that the task has been paused. Eg. On vlc while listening to music there are other apps running simultaneously but as I told program at a the vlc is actually pausing in between for ns so you dont underatand it but the music is actually stopping in between.
But this was about the old processors...
Now-a- days processors ie 3rd gen pcs have multi cored processors. Now the 'cores' can be compared to a 1st or 2nd gen processors itself...embedded onto a single chip, a single processor. So now we understood what are cores ie they are mini processors which combine to become a processor. And each core can handle a single process at a time or multi threads as designed for the OS. And they folloq the same steps as I mentioned above about the single processor.
Eg. A i7 6gen processor has 8 8 mini processors in 1 its speed is 8x times the old processors. And this is how multi tasking can be done.
There could be hundreds of cores in a single processor
Eg. Intel i128.
I hope I explaned this well.
I have read all answers, but this link was more clear explanation for me about difference between CPU(Processor) and Core. So I'm leaving here some notes from there.
The main difference between CPU and Core is that the CPU is an electronic circuit inside the computer that carries out instruction to perform arithmetic, logical, control and input/output operations while the core is an execution unit inside the CPU that receives and executes instructions.
Intel's picture is helpful, as shown by Tortuga's best answer. Here's a caption for it.
Processor: One semiconductor chip, the CPU (central processing unit) seated in one socket, circa 1950s-2010s. Over time, more functions have been packed onto the CPU chip. Prior to the 1950s releases of single-chip processors, one processor might have spread across multiple chips. In the mid 2010s the system-on-a-chip chips made it slightly more sketchy to equate one processor to one chip, though that's generally what people mean by processor, as in "this computer has an i7 processor" or "this computer system has four processors."
Core: One block of a CPU, executing one instruction at a time. (You'll see people say one instruction per clock cycle, but some CPUs use multiple clock cycles for some instructions.)

What is difference between 'Cores across processors' and 'Number of CPUs'?

E.g. Consider following is processor configuration of my machine:
Intel(R) Core(TM)i5 CPU 650 #3.20GHz (4 CPUs)
Then how should i find out how many 'Cores across processors' My machine have?
Is it the 4 cores[i.e. Number of CPU]?
I have referred following links but still i does not get clear idea:
Can anyone please clear my doubt?
Cores across processors means nothing, or at least, nothing in particular, it's a generic and non-technical assumption/phrase with no exact meaning or no meaning at all.
According to Intel this CPU provides 2 physical cores with Hyper Threading and this mean that you get 4 logical cores or so called threads.
Hyper Threading is an Intel Technology that for each core provides 2 threads, so 2*2 = 4 threads.
I think that this is the closest answer to what you are asking here.
Let's clarify first what is a CPU and what is a core, a central processing unit CPU, can have multiple core units, those cores are a processor by itself, capable of execute a program but it is self contained on the same chip.
In the past one CPU was distributed among quite a few chips, but as Moore's progressed they made to have a complete CPU inside one chip (die), since the 90's the manufacturer's started to fit more cores in the same die, so that's the concept of Multi-core.
In these days is possible to have hundreds of cores on the same CPU (chip or die) GPU's, Intel Xeon. Other technique developed no the 90's was simultaneous multi-threading, basically they found that was possible to have another thread in the same single core CPU, since most of the resources were duplicated already like ALU, multiple registers.
So basically a CPU can have multiple cores each of them capable to run one thread or more at the same time, we may expect to have more cores in the future, but with more difficulty to be able to program efficiently.
