What are the latencies of GPU? - caching

I can find the latencies in terms of either ns or CPU cylces between CPU core and its cache, main memory, etc.
But it seems so hard to find similiar information about modern GPU.
Does anyone know about the latencies of GPU, esepecially the latencies between modern nvidia GPU (GF110 or later) and their memory, thanks.
GPU memory do have a much larger bandwidth, but what about their latencies?
I heard that the latencies for GPU are just as high as these for CPU, so basically make the larger bandwidth largely pointless for many general purpose computing tasks, I just need to confirm this.

Since vendors do not reveal all the architectural details, researchers have used reverse engineering to demystify GPU architecture. See this paper Demystifying GPU microarchitecture through microbenchmarking and other papers that cite this (note that it is not my paper). I have copied their findings in the image below.

Related

Discrete GPU to reduce memory contention & improve CPU performance

I have long suspected the shared RAM of integrated GPUs causes memory contention and significantly slows the performance of the CPU. Especially in the context of compiler and IDE performance.
Have you done any experiments or noticed a difference when adding or removing a discrete graphics card?
Are you aware of any studies on this subject? (I could not find any)
For video there's 2 uses of memory - reading the frame buffer's contents and sending it to the monitor every frame; and whatever the GPU happens to be doing.
For the GPU there's no way to guess.
For reading the frame buffer; for a video mode like 1920x1600 with 32 bits per pixel you're looking at 12.288 MB per frame, so at 60 frames per second that's 0.737 GB/s. A single RAM module is typically capable of "tens of GB per second" (e.g. DDR4-3200 is 25.6 GB/s according to wikipedia). From this you can assume reading from the framebuffer consumes less than 10% of one RAM module's bandwidth. Of course for most systems there's multiple RAM modules and multiple memory channels; so it's likely to be significantly less than 10% of available RAM bandwidth.
Also note that CPUs typically use caches for most memory accesses and only need RAM bandwidth for "cache miss" (e.g. you could have 8 CPUs pounding caches and still have almost all of the usable RAM bandwidth wasted/being used for nothing); so devices of all types (e.g. disk controllers, network cards, USB controllers, sound cards, discrete and integrated video) using RAM bandwidth won't necessarily effect CPU performance.
There are also other (potentially more significant) factors for performance too. For example, for modern integrated video, GPU is in the same package as the CPUs, so when the GPU is going berserk heating up the package the CPUs may need to slow down to avoid melting everything. Discrete video cards don't have this problem (they have the "spend several hundred extra $$ to be deafened by excessive fan noise while you're sitting in a puddle of your own perspiration" problem instead ;) ).
Mostly; everything involved (which hardware, which software, which other devices) is too variable for a concrete measurement of one specific case to be meaningful; so I wouldn't expect to find any studies.

DRAM and its effects on real world performance

After learning a little on how computer programs run I had some thoughts concerning the cpu and RAM. After watching a few youtube videos (linus tech tips and others) they all seem to show that increasing a RAM speed (frequency) does not really have much of a performance improvement in real world applications and games on a general desktop computer. My first question is why is this? Is it because of the high hit rates (95% and above) of the cpu's cache on most modern cpus? Which in turn would lead to less and less need for the cpu to reach out to ram? Also, in which situations would faster RAM frequency be beneficial?
NOTE: this is a very broad question, and the answers can vary very differently depending on the architecture/OS running the system. I am answering from a best-judgement standpoint on how these things generally work
Why is there not a larger performance difference between different RAM clock speeds?
I would imagine that the clock speed of the RAM of the computer matters less than the clock speed of the CPU cache. Because:
the CPU gets its instructions from the cache, not straight from RAM
with the larger cache sizes of modern CPU's, it is less necessary to need to go out to RAM as often.
When the cache needs to go out to RAM, it uses an asynchronous processor (DMA) to grab more information, allowing the CPU to switch to a different process entirely.
Besides that, the clock speed of the motherboard's various pipelines (DMA) could be creating a chokepoint where it is slowing the transfer rate of the information overall.
which situations would faster RAM frequency be beneficial?
I would say that overall, any one of the core pieces of hardware involved with memory and its use and transfer (CPU, CPU Cache, The Various memory pipelines, the various memory transfer devices (DMA, etc.), the RAM itself) can cause a chokepoint where faster RAM might or might not affect the overall performance. It is really a by-case issue.

hyperthreading and turbo boost in matrix multiply - worse performance using hyper threading

I am tunning my GEMM code and comparing with Eigen and MKL. I have a system with four physical cores. Until now I have used the default number of threads from OpenMP (eight on my system). I assumed this would be at least as good as four threads. However, I discovered today that if I run Eigen and my own GEMM code on a large dense matrix (1000x1000) I get better performance using four threads instead of eight. The efficiency jumped from 45% to 65%. I think this can be also seen in this plot
https://plafrim.bordeaux.inria.fr/doku.php?id=people:guenneba
The difference is quite substantial. However, the performance is much less stable. The performance jumps around quit a bit each iteration both with Eigen and my own GEMM code. I'm surprised that Hyperthreading makes the performance so much worse. I guess this is not not a question. It's an unexpected observation which I'm hoping to find feedback on.
I see that not using hyper threading is also suggested here.
How to speed up Eigen library's matrix product?
I do have a question regarding measuring max performance. What I do now is run CPUz and look at the frequency as I'm running my GEMM code and then use that number in my code (4.3 GHz on one overclocked system I use). Can I trust this number for all threads? How do I know the frequency per thread to determine the maximum? How to I properly account for turbo boost?
The purpose of hyperthreading is to improve CPU usage for code exhibiting high latency. Hyperthreading masks this latency by treating two threads at once thus having more instruction level parallelism.
However, a well written matrix product kernel exhibits an excellent instruction level parallelism and thus exploits nearly 100% of the CPU ressources. Therefore there is no room for a second "hyper" thread, and the overhead of its management can only decrease the overall performance.
Unless I've missed something, always possible, your CPU has one clock shared by all its components so if you measure it's rate at 4.3GHz (or whatever) then that's the rate of all the components for which it makes sense to figure out a rate. Imagine the chaos if this were not so, some cores running at one rate, others at another rate; the shared components (eg memory access) would become unmanageable.
As to hyperthreading actually worsening the performance of your matrix multiplication, I'm not surprised. After all, hyperthreading is a poor-person's parallelisation technique, duplicating instruction pipelines but not functional units. Once you've got your code screaming along pushing your n*10^6 contiguous memory locations through the FPUs a context switch in response to a pipeline stall isn't going to help much. At best the other pipeline will scream along for a while before another context switch robs you of useful clock cycles, at worst all the careful arrangement of data in the memory hierarchy will be horribly mangled at each switch.
Hyperthreading is designed not for parallel numeric computational speed but for improving the performance of a much more general workload; we use general-purpose CPUs in high-performance computing not because we want hyperthreading but because all the specialist parallel numeric CPUs have gone the way of all flesh.
As a provider of multithreaded concurrency services, I have explored how hyperthreading affects performance under a variety of conditions. I have found that with software that limits its own high-utilization threads to no more that the actual physical processors available, the presence or absence of HT makes very little difference. Software that attempts to use more threads than that for heavy computational work, is likely unaware that it is doing so, relying on merely the total processor count (which doubles under HT), and predictably runs more slowly. Perhaps the largest benefit that enabling HT may provide, is that you can max out all physical processors, without bringing the rest of the system to a crawl. Without HT, software often has to leave one CPU free to keep the host system running normally. Hyperthreads are just more switchable threads, they are not additional processors.

Why not using GPUs as a CPU?

I know the question is only partially programming-related because the answer I would like to get is originally from these two questions:
Why are CPU cores number so low (vs GPU)? and Why aren't we using GPUs instead of CPUs, GPUs only or CPUs only? (I know that GPUs are specialized while CPUs are more for multi-task, etc.). I also know that there are memory (Host vs GPU) limitations along with precision and caches capability. But, In term of hardware comparison, high-end to high-end CPU/GPU comparison GPUs are much much more performant.
So my question is: Could we use GPUs instead of CPUs for OS, applications, etc
The reason I am asking this questions is because I would like to know the reason why current computers are still using 2 main processing units (CPU/GPU) with two main memory and caching systems (CPU/GPU) even if it is not something a programmer would like.
Current GPUs lack many of the facilities of a modern CPU that are generally considered important (crucial, really) to things like an OS.
Just for example, an OS normally used virtual memory and paging to manage processes. Paging allows the OS to give each process its own address space, (almost) completely isolated from every other process. At least based on publicly available information, most GPUs don't support paging at all (or at least not in the way an OS needs).
GPUs also operate at much lower clock speeds than CPUs. Therefore, they only provide high performance for embarrassingly parallel problems. CPUs are generally provide much higher performance for single threaded code. Most of the code in an OS isn't highly parallel -- in fact, a lot of it is quite difficult to make parallel at all (e.g., for years, Linux had a giant lock to ensure only one thread executed most kernel code at any given time). For this kind of task, a GPU would be unlikely to provide any benefit.
From a programming viewpoint, a GPU is a mixed blessing (at best). People have spent years working on programming models to make programming a GPU even halfway sane, and even so it's much more difficult (in general) than CPU programming. Given the difficulty of getting even relatively trivial things to work well on a GPU, I can't imagine attempting to write anything even close to as large and complex as an operating system to run on one.
GPUs are designed for graphics related processing (obviously), which is inherently something that benefits from parallel processing (doing multiple tasks/calculations at once). This means that unlike modern CPUs, which as you probably know usually have 2-8 cores, GPUs have hundreds of cores. This means that they are uniquely suited to processing things like ray tracing or anything else that you might encounter in a 3D game or other graphics intensive activity.
CPUs on the other hand have a relatively limited number of cores because the tasks that a CPU faces usually do not benefit from parallel processing nearly as much as rendering a 3D scene would. In fact, having too many cores in a CPU could actually degrade the performance of a machine, because of the nature of the tasks a CPU usually does and the fact that a lot of programs would not be written to take advantage of the multitude of cores. This means that for internet browsing or most other desktop tasks, a CPU with a few powerful cores would be better suited for the job than a GPU with many, many smaller cores.
Another thing to note is that more cores usually means more power needed. This means that a 256-core phone or laptop would be pretty impractical from a power and heat standpoint, not to mention the manufacturing challenges and costs.
Usually operating systems are pretty simple, if you look at their structure.
But parallelizing them will not improve speeds much, only raw clock speed will do.
GPU's simply lack parts and a lot of instructions from their instruction sets that an OS needs, it's a matter of sophistication. Just think of the virtualization features (Intel VT-x or AMD's AMD-v).
GPU cores are like dumb ants, whereas a CPU is like a complex human, so to speak. Both have different energy consumption because of this and produce very different amounts of heat.
See this extensive superuser answer here on more info.
Because nobody will spend money and time on this. Except for some enthusiasts like that one: http://gerigeri.uw.hu/DawnOS/history.html (now here: http://users.atw.hu/gerigeri/DawnOS/history.html)
Dawn now works on GPU-s: with a new OpenCL capable emulator, Dawn now
boots and works on Graphics Cards, GPU-s and IGP-s (with OpenCL 1.0).
Dawn is the first and only operating system to boot and work fully on
a graphics chip.

Would it be possible for a JIT compiler to utilize GPU for certain operations behind the scenes?

Feel free to correct me if any part of my understanding is wrong.
My understanding is that GPUs offer a subset of the instructions that a normal CPU provides but executes them much faster.
I know there are ways to utilize GPU cycles for non-graphical purpose, but it seems like (in theory) a language that's Just In Time compiled could detect the presence of a suitable GPU and offload some of the work to the GPU behind the scenes without code change.
Is my understanding naive? Is it just a matter of it's really complicated and just hasn't been done it?
My understanding is that GPUs offer a
subset of the instructions that a
normal CPU provides but executes them
much faster.
It's definitly not as simple. The GPU is tailored mainly at SIMD/vector processing. So even though the theoretical potential of GPUs nowadays is vastely superior to CPUs, only programs that can benefit from SIMD instructions can be executed efficiently on the GPU. Also, there is of course a performance penalty when data has to be transfered from the CPU to the GPU to be processed there.
So for a JIT compiler to be able to use the GPU efficiently, it must be able to detect code that can be parallelized to benefit from SIMD instructions and then has to determine, if the overhead induced by transfering data from the CPU to the GPU will be outweight by the performance improvements.
It is possible to use GPU (e.g., a CUDA- or OpenCL-enabled one) to speed up JIT itself. Both register allocation and instruction scheduling could be efficiently implemented.

Resources