Discrete GPU to reduce memory contention & improve CPU performance

Discrete GPU to reduce memory contention & improve CPU performance - performance

I have long suspected the shared RAM of integrated GPUs causes memory contention and significantly slows the performance of the CPU. Especially in the context of compiler and IDE performance.
Have you done any experiments or noticed a difference when adding or removing a discrete graphics card?
Are you aware of any studies on this subject? (I could not find any)

For video there's 2 uses of memory - reading the frame buffer's contents and sending it to the monitor every frame; and whatever the GPU happens to be doing.
For the GPU there's no way to guess.
For reading the frame buffer; for a video mode like 1920x1600 with 32 bits per pixel you're looking at 12.288 MB per frame, so at 60 frames per second that's 0.737 GB/s. A single RAM module is typically capable of "tens of GB per second" (e.g. DDR4-3200 is 25.6 GB/s according to wikipedia). From this you can assume reading from the framebuffer consumes less than 10% of one RAM module's bandwidth. Of course for most systems there's multiple RAM modules and multiple memory channels; so it's likely to be significantly less than 10% of available RAM bandwidth.
Also note that CPUs typically use caches for most memory accesses and only need RAM bandwidth for "cache miss" (e.g. you could have 8 CPUs pounding caches and still have almost all of the usable RAM bandwidth wasted/being used for nothing); so devices of all types (e.g. disk controllers, network cards, USB controllers, sound cards, discrete and integrated video) using RAM bandwidth won't necessarily effect CPU performance.
There are also other (potentially more significant) factors for performance too. For example, for modern integrated video, GPU is in the same package as the CPUs, so when the GPU is going berserk heating up the package the CPUs may need to slow down to avoid melting everything. Discrete video cards don't have this problem (they have the "spend several hundred extra $$ to be deafened by excessive fan noise while you're sitting in a puddle of your own perspiration" problem instead ;) ).
Mostly; everything involved (which hardware, which software, which other devices) is too variable for a concrete measurement of one specific case to be meaningful; so I wouldn't expect to find any studies.

Related

Explanation for why effective DRAM bandwidth reduces upon adding CPUs

This question is a spin-off of the one posted here: Measuring bandwidth on a ccNUMA system
I've written a micro-benchmark for the memory bandwidth on a ccNUMA system with 2x Intel(R) Xeon(R) Platinum 8168:
24 cores # 2.70 GHz,
L1 cache 32 kB, L2 cache 1 MB and L3 cache 33 MB.
As a reference, I'm using the Intel Advisor's roof-line plot, which depicts the bandwidths of each CPU data-path available. According to this, the bandwidth is 230 GB/s.
Strong scaling of bandwidth:
Question: If you look at the strong scaling diagram, you can see that the peak effective bandwidth is actually achieved at 33 CPUs, following which adding CPUs only reduces it. Why is this happening?

Overview
This answer provides probable explanations. Put it shortly, all parallel workload does not infinitely scale. When many cores compete for the same shared resource (eg. DRAM), using too many cores is often detrimental because there is a point where there are enough cores to saturate a given shared resource and using more core only increase the overheads.
More specifically, in your case, the L3 cache and the IMCs are likely the problem. Enabling Sub-NUMA Clustering and non-temporal prefetch should improve a bit the performances and the scalability of your benchmark. Still, there are other architectural hardware limitations that can cause the benchmark not to scale well. The next section describes how Intel Skylake SP processors deal with memory accesses and how to find the bottlenecks.
Under the hood
The layout of Intel Xeon Skylake SP processors is like the following in your case:
Source: Intel
There are two sockets connected with an UPI interconnect and each processor is connected to its own set of DRAM. There are 2 Integrated Memory Controller (IMC) per processor and each is connected to 3 DDR4 DRAM # 2666MHz. This means the theoretical bandwidth is 2*2*3*2666e6*8 = 256 GB/s = 238 GiB/s.
Assuming your benchmark is well designed and each processor access only to its NUMA node, I expect a very low UPI throughput and a very low number of remote NUMA pages. You can check this with hardware counters. Linux perf or VTune enable you to check this relatively easily.
The L3 cache is split in slices. All physical addresses are distributed across the cache slices using an hash function (see here for more informations). This method enable the processor to balance the throughput between all the L3 slices. This method also enable the processor to balance the throughput between the two IMCs so that in-fine the processor looks like a SMP architecture instead of a NUMA one. This was also use in Sandy Bridge and Xeon Phi processors (mainly to mitigate NUMA effects).
Hashing does not guarantee a perfect balancing though (no hash function is perfect, especially the ones that are fast to compute), but it is often quite good in practice, especially for contiguous accesses. A bad balancing decreases the memory throughput due to partial stalls. This is one reason you cannot reach the theoretical bandwidth.
With a good hash function, the balancing should be independent of the number of core used. If the hash function is not good enough, one IMC can be more saturated than the other one oscillating over time. The bad news is that the hash function is undocumented and checking this behaviour is complex: AFAIK you can get hardware counters for the each IMC throughput but they have a limited granularity which is quite big. On my Skylake machine the name of the hardware counters are uncore_imc/data_reads/ and uncore_imc/data_writes/ but on your platform you certainly have 4 counters for that (one for each IMC).
Fortunately, Intel provides a feature called Sub-NUMA Clustering (SNC) on Xeon SP processors like your. The idea is to split the processor in two NUMA nodes that have their own dedicated IMC. This solve the balancing issue due to the hash function and so result in faster memory operations as long as your application is NUMA-friendly. Otherwise, it can actually be significantly slower due to NUMA effects. In the worst case, the pages of an application can all be mapped to the same NUMA node resulting in only half the bandwidth being usable. Since your benchmark is supposed to be NUMA-friendly, SNC should be more efficient.
Source: Intel
Furthermore, having more cores accessing the L3 in parallel can cause more early evictions of prefetched cache lines which need to be fetched again later when the core actual need them (with an additional DRAM latency time to pay). This effect is not as unusual as it seems. Indeed, due to the high latency of DDR4 DRAMs, hardware prefetching units have to prefetch data a long time in advance so to reduce the impact of the latency. They also need to perform a lot of requests concurrently. This is generally not a problem with sequential accesses, but more cores causes accesses to look more random from the caches and IMCs point-of-view. The thing is DRAM are designed so that contiguous accesses are faster than random one (multiple contiguous cache lines should be loaded consecutively to fully saturate the bandwidth). You can analyse the value of the LLC-load-misses hardware counter to check if more data are re-fetched with more threads (I see such effect on my Skylake-based PC with only 6-cores but it is not strong enough to cause any visible impact on the final throughput). To mitigate this problem, you can use software non-temporal prefetch (prefetchnta) to request the processor to load data directly into the line fill buffer instead of the L3 cache resulting in a lower pollution (here is a related answer). This may be slower with fewer cores due to a lower concurrency, but it should be a bit faster with a lot of cores. Note that this does not solve the problem of having fetched address that looks more random from the IMCs point-of-view and there is not much to do about that.
The low-level architecture DRAM and caches is very complex in practice. More information about memory can be found in the following links:
What Every Programmer Should Know About Memory
Introduction to High Performance Scientific Computing (Section 1.3)
Lecture: Main Memory and the DRAM System
Short lectures: Dynamic Random Access Memory (in 7 parts)
Intel® 64 and IA-32 Architectures Software Developer's Manual (Volume 3)

How come register file size in GPU's (eg GTX 1080) bigger than L2 cache size?

Looking at this fact, I've started wondering how registers work in GPU? Before knowing this, I thought going higher and higher above the hierarchical memory ladder, the size keeps on decreasing (which is intuitive (latency decrease, size decrease)). What is the purpose of registers in GPU's and why is their size greater than the L2/L1 cache?
Thanks.

In CPUs caches serve two basic purposes:
They enable temporal and spatial reuse of data already fetched from DRAM. This reduces the required bandwidth of the DRAM.
CPU caches provide a huge reduction of latency, which is extremely important for single threaded performance.
GPUs are not focused on single thread performance, but are focused on throughput instead. Most of the time they also deal with working sets that are too big to fit into any reasonably sized cache. Small cache help in some situation, but overall caches are not nearly as important for GPUs as they are for CPUs.
Now to the second part of the question: Why huge registers files? GPUs reach their performance by exploiting thread level parallelism. Many threads need to be active at the same time to reach high performance levels. But every thread needs to store its own set of registers. In Maxwell GPUs and likely in GP104/GTX1080 every SM can host up to 2048 threads. Every SM has a 256 KB register file, so if all threads are used, 32x 32-bit registers are available per thread.
I mentioned earlier that CPUs use caches to reduce memory latency, but GPUs must also somehow deal with memory latency. They just switch to a different thread, while a thread is waiting for an answer from the memory. Latency and throughput and threads are connected by Little's law:
(data in flight/thread) * threads = latency x throughput
The memory latency is likely a few hundred ns to thousand ns (lets use 1000ns). The throughput here would be the memory bandwidth (320 GB/s). To full utilize the available memory bandwidth we need at (320 GB/s * 1000 ns=) 320 KB in flight. GTX1080 should have 20 SMs, so each SM would need to have 16 KB in flight to full use the memory bandwidth. Even if all 2048 threads are used for memory access all the time, every thread would still be required to have 8 bytes in outstanding memory requests. If some threads are busy with calculations and cannot send out new memory requests, even more memory requests are required from the remaining threads. If threads use more than 32 registers per thread, even more memory requests per thread are needed.
If GPUs would use smaller register files, they could not use the full bandwidth of their memory. They would send out some work to memory interface and then all threads would be waiting for answers from the memory interface and no new work could be submitted to the memory interface. The huge registers are required to have enough threads available. Careful coding is still required to really get the maximum power of the GPU.

GPU is built for 3D and calculations so vendors dedicated more area for cores. More cores need more data to feed them and that needs more gpu area for scheduling mechanisms to maintain occupation as high as possible.
Too many cores, too many 3D pipelines such as tmu and rop, too many scheduling parts and too wide memory controllers to feed those cores.
Gpu area is just not enough for everything. Least important one seems to be caches. Even texture memory is more important than that and that is faster too.
Making gpu bigger means lower yield for production and that means less profit. Since gpu vendors are not charity organisations, they chose maximum profit, optimum performance and power savings(as of lately). Cache is expensive.
A compute unit in a gpu can have more than kilobytes of registers per thread so any multiply used data is not needed to be transferred between long distances(such as cache and cores) and make it have energy efficieny.
Also you can hide latency of some parts by having good occupation ratio for large-enough calculations; local shared memory(per compute unit) and registers(per thread) has more important role in achieving that.
While memory controller,L1 and L2 can handle only 100 GB/s, 200 GB/s and 300 GB/s;local shared memory and registers can be up to 5 TB/s and 15 TB/s bandwidth for a gpu.

DRAM and its effects on real world performance

After learning a little on how computer programs run I had some thoughts concerning the cpu and RAM. After watching a few youtube videos (linus tech tips and others) they all seem to show that increasing a RAM speed (frequency) does not really have much of a performance improvement in real world applications and games on a general desktop computer. My first question is why is this? Is it because of the high hit rates (95% and above) of the cpu's cache on most modern cpus? Which in turn would lead to less and less need for the cpu to reach out to ram? Also, in which situations would faster RAM frequency be beneficial?

NOTE: this is a very broad question, and the answers can vary very differently depending on the architecture/OS running the system. I am answering from a best-judgement standpoint on how these things generally work
Why is there not a larger performance difference between different RAM clock speeds?
I would imagine that the clock speed of the RAM of the computer matters less than the clock speed of the CPU cache. Because:
the CPU gets its instructions from the cache, not straight from RAM
with the larger cache sizes of modern CPU's, it is less necessary to need to go out to RAM as often.
When the cache needs to go out to RAM, it uses an asynchronous processor (DMA) to grab more information, allowing the CPU to switch to a different process entirely.
Besides that, the clock speed of the motherboard's various pipelines (DMA) could be creating a chokepoint where it is slowing the transfer rate of the information overall.
which situations would faster RAM frequency be beneficial?
I would say that overall, any one of the core pieces of hardware involved with memory and its use and transfer (CPU, CPU Cache, The Various memory pipelines, the various memory transfer devices (DMA, etc.), the RAM itself) can cause a chokepoint where faster RAM might or might not affect the overall performance. It is really a by-case issue.

Estimating how processor frequency affects I/O performance

I am doing research about dedicated I/O software that would run on consumer hardware. Essentially it boils down to saving huge data streams for later processing. Right now I am looking for a model to estimate performance factors on x86.
Take for example the new Macbook Pro:
high-speed Thunderbolt I/O (input/output) technology delivers
an amazing 10 gigabits per second of transfer speeds in both
directions
1.25 GB/s sounds nice but most processors of the day are clocked around 2 Ghz. Multiple cores make little difference as long as only one can be assigned per network channel.
So even if the software acts as a miniature operating system and limits itself to network/disk operations, the amount of data flowing to storage can't be greater than P / (2 * N)[1] chunks per second. Although this hints the rough performance limit, I feel it's far from adequate.
What other considerations should one take estimating I/O performance in regards to processor frequency and other hardware specifics? For simplicity's sake, assume here that storage performs instantly under all circumstances.
[1] P - processor frequency; N - algorithm overhead

The hardware limiting factors are probably the I/O bus performance, say PCIe, and more recently, the FSB clock-rates, since memory controllers are moving from northbridge to the CPUs themselves.
Then, of course, you have to figure out what sort of processing you need to do on the input, and how much work it is to produce the output. These, at least for conventional software running on a CPU, are dependent on the processor clock, but not only. Writing your code to take advantage of the hardware facilities like caches, instruction-level parallelism, etc. is still a black art but can give you an order of magnitude performance boost.
Basically what I'm ranting about is that not all software is created equal, and you probably want to take that into account.

Likely, harddisk controllers will decide the harddisk I/O performance, graphics cards will decide maximum resolution and refresh I/O performance, and so on. Don't really understand the question, the CPU is becoming less and less involved in these kinds of things (well, has been for the last 10 years).
I doubt the question will even have bearing on CPUs with integrated GPUs, since the buffer to be output to screen is in external memory sharing a bus with (again) a controller on the motherboard.
It's all buffered, so I can only see CPUs affecting file performance if you somehow force the hardware buffer size to something insanely puny. Edit: and I'm pretty sure Apple will prevent you from doing such things. ;)
For Thunderbolt specifically, it's more about what the minimum CPU model is, that supports the kinds of bus speeds required by the Thunderbolt chip set version that is in the machine in question.
Thunderbolt is a raw data traffic system and performance specs are potential maximums, hence all the asterisks in the Apple specs. I believe it will indeed alleviate bottlenecks and in general give lag-free intelligent data shuffling doing many things simultaneously.
The CPU will idle-wait a shorter time for needed data, but the processing speed of the data is the same. When playing or creating a movie, codec processing time will be the same, but you will still feel a boost/lack of lag because the data is there when it needs it. For the I/O, the bottleneck will become the read/write speed of your harddisk instead, and the CPU bottleneck (for file copy operations, likely at least some code in Finder) will stay the same.
In other words, only CPU-intensive tasks such as for example movie encoding will benefit significantly from a faster CPU, while the benefits of Thunderbolt vs. a mix of interfaces will boost machines with both slow and fast CPUs.

What's the speed of texture upload?

I would like to upload two images to the GPU memory, and I'm interested how fast I can do this?
In fact - will it be faster to compare two bitmaps in RAM with CPU, or upload them to GPU and use GPU parallelism to do it?

If you run the CUDA device bandwidth sample, you'll get a benchmark for the upload speed.
Assuming DDR3 tri-channel 1600MHz RAM, you'll get something like 38 GB/s memory bandwidth.
Take a typical midrange card like a GTX460 and you'll get something like 84 GB/s memory bandwidth. Note that you'll have to make a hop across the bus which is something like 8GB/s theoretical, ~5.5 in practice for a PCI-E2.0 x16 link.
Note that kotlinski's answer isn't quite correct. You'll can do compared in parallel and then do a parallel reduction in which case, the bigger GPU device bandwidth can work win out eventually.
I think the answer is likely to be: a loss to upload to GPU and do comparison once. Possible gain if comparison is made multiple times (kept and modified on the GPU, for example).
Edit:
The multiple times comparison refers to if you modified the images on the GPU memory in situ. Thus, it would merit another comparison (caching doesn't cut it), while not incurring the penalty of another copy across the bus.

Since memory access is the bottleneck here, it is extremely likely that it is faster to just do it in CPU. Making it run in parallel is not likely to give you anything, memory access is essentially a serial operation.

The answer to this question is highly debatable and depends entirely on you systems configuration. This means that you'll have to do the benchmarks yourself. Factors that could influence your situation:
Speed of your RAM
Speed of the GPU Bus
Whether or not you have shared RAM between GPU & CPU
However, I do think that in the general case (eg. with busspeeds in the order of GB/s) it's faster to upload the images to the GPU and do the difference comparison there.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio