I am doing research about dedicated I/O software that would run on consumer hardware. Essentially it boils down to saving huge data streams for later processing. Right now I am looking for a model to estimate performance factors on x86.
Take for example the new Macbook Pro:
high-speed Thunderbolt I/O (input/output) technology delivers
an amazing 10 gigabits per second of transfer speeds in both
directions
1.25 GB/s sounds nice but most processors of the day are clocked around 2 Ghz. Multiple cores make little difference as long as only one can be assigned per network channel.
So even if the software acts as a miniature operating system and limits itself to network/disk operations, the amount of data flowing to storage can't be greater than P / (2 * N)[1] chunks per second. Although this hints the rough performance limit, I feel it's far from adequate.
What other considerations should one take estimating I/O performance in regards to processor frequency and other hardware specifics? For simplicity's sake, assume here that storage performs instantly under all circumstances.
[1] P - processor frequency; N - algorithm overhead
The hardware limiting factors are probably the I/O bus performance, say PCIe, and more recently, the FSB clock-rates, since memory controllers are moving from northbridge to the CPUs themselves.
Then, of course, you have to figure out what sort of processing you need to do on the input, and how much work it is to produce the output. These, at least for conventional software running on a CPU, are dependent on the processor clock, but not only. Writing your code to take advantage of the hardware facilities like caches, instruction-level parallelism, etc. is still a black art but can give you an order of magnitude performance boost.
Basically what I'm ranting about is that not all software is created equal, and you probably want to take that into account.
Likely, harddisk controllers will decide the harddisk I/O performance, graphics cards will decide maximum resolution and refresh I/O performance, and so on. Don't really understand the question, the CPU is becoming less and less involved in these kinds of things (well, has been for the last 10 years).
I doubt the question will even have bearing on CPUs with integrated GPUs, since the buffer to be output to screen is in external memory sharing a bus with (again) a controller on the motherboard.
It's all buffered, so I can only see CPUs affecting file performance if you somehow force the hardware buffer size to something insanely puny. Edit: and I'm pretty sure Apple will prevent you from doing such things. ;)
For Thunderbolt specifically, it's more about what the minimum CPU model is, that supports the kinds of bus speeds required by the Thunderbolt chip set version that is in the machine in question.
Thunderbolt is a raw data traffic system and performance specs are potential maximums, hence all the asterisks in the Apple specs. I believe it will indeed alleviate bottlenecks and in general give lag-free intelligent data shuffling doing many things simultaneously.
The CPU will idle-wait a shorter time for needed data, but the processing speed of the data is the same. When playing or creating a movie, codec processing time will be the same, but you will still feel a boost/lack of lag because the data is there when it needs it. For the I/O, the bottleneck will become the read/write speed of your harddisk instead, and the CPU bottleneck (for file copy operations, likely at least some code in Finder) will stay the same.
In other words, only CPU-intensive tasks such as for example movie encoding will benefit significantly from a faster CPU, while the benefits of Thunderbolt vs. a mix of interfaces will boost machines with both slow and fast CPUs.
Related
I know this may sound like a silly question considering the speeds at which computers work, but say a certain address in RAM is physically closer to the CPU on the motherboard, compared to a memory address that is located the farthest possible to the CPU, will this have an affect on the speed that the closer memory address is accessed compared to the farthest memory address?
If you're talking about NUMA accessing RAM connected to this socket vs. going over the interconnect to access RAM connected to another socket, then yes this is a well known effect. example. Otherwise, no.
Also note that signal travel time over the external memory bus is only tiny fraction of the total latency cache-miss latency cost for a CPU core. Queuing inside the CPU, time to check L3 cahce, and the internal bus between cores and memory controllers, all adds up. Tightening DDR4 CAS latency by 1 whole memory cycle makes only a small (but measurable) difference to overall memory performance (see hardware review sites benchmarking memory overclocking), other timings even less so.
No, DDR4 (and earlier) memory busses are synced to a clock and expect a response at a specific number of memory-clock cycles1 after a command (so the controller can pipeline requests without causing overlap). See What Every Programmer Should Know About Memory? for more about DDR memory commands and memory timings (and CAS latency vs. other timings).
(Wikipedia's introduction to SDRAM mentions that earlier DRAM standards were asynchronous, so yes they maybe could just reply as soon as they had the data ready. If that happened to be a whole clock cycle early, a speedup was perhaps possible.)
So memory latency is discrete, not continuous, and being 1 mm closer can't make it fractions of a nanosecond faster. The only plausible effect is if you socket all the memory into DIMM slots in a way that enables you to run tighter timings and/or a faster memory clock than with some other arrangement. Go read about memory overclocking if you want real-world experience with people who try to push systems to the limits of stability. What's best may depend on the motherboard; physical length of traces isn't the only consideration.
AFAIK, all real-world motherboard firmwares insist on using the same timings for all DIMMs on all memory channels2.
So even if one DIMM could theoretically support tighter timings than another, you couldn't actually configure a system to make that happen. e.g. because of shorter or less noisy traces, less signal reflection because it's at the end instead of middle of some traces, or whatever. Physical proximity isn't the only thing that could help.
(This is probably a good thing; interleaving physical address space across multiple DRAM channels allows sequential reads/writes to benefit from the aggregate bandwidth of all channels. But if they ran at different speeds, you might have more contention for shared busses between controllers and cores, and more time left unused.)
Memory frequency and timings are usually chosen by the firmware after reading the SPD ROM on each DIMM (memory module) to find out what memory is installed and what timings each DIMM is rated for at what frequencies.
Footnote 1: I'm not sure how transmission-line propagation delays over memory traces are accounted for when the memory controller and DIMM agree on how many cycles there should be after a read command before the DIMM starts putting data on the bus.
The CAS latency is a timing number that the memory controller programs into the "mode register" of each DIMM.
Presumably the number the DIMM sees is the actual number it uses, and the memory controller has to account for the round-trip propagation delay to know when to really expect a read burst to start arriving. Other command latencies are just times between sending different commands so propagation delay doesn't matter: the gap at the sending side equals the gap at the receiving side.
But the CAS latency seen by the memory controller includes the round-trip propagation delay for signals to go over the wires to the DIMM and back. Modern systems with DDR4-4000 have a clock that runs at 2GHz, cycle time of half a nanosecond (and transferring data on the rising and falling edge).
At light speed, 0.5ns is "only" about 15 cm, half of one of Grace Hopper's nanoseconds, and with transmission-line effects could be somewhat shorter (like maybe 2/3rd of that). On a big server motherboard it's certainly plausible that some DIMMs are far enough away from the CPU for traces to be that long.
The rated speeds on memory DIMMs are somewhat conservative so they're still supposed to work at that speed even when as far as allowed by DDR4 standards. I don't know the details, but I assume JEDEC considers this when developing DDR SDRAM standards.
If there's a "data valid" pin the DIMM asserts at the start of the read burst, that would solve the problem, but I haven't seen a mention of that on Wikipedia.
Timings are those numbers like 9-9-9-24, with the first one being CAS latency, CL. https://www.hardwaresecrets.com/understanding-ram-timings/ was an early google hit if you want to read more from a perf-tuning PoV. Also described in Ulrich Drepper's "What Every Programmer Should Know about Memory" linked earlier, from a how-it-works PoV. Note that the higher the memory clock speed, the less real time (in nanoseconds) a given number of cycles is. So CAS latency and other timings have stayed nearly constant in nanoseconds as clock frequencies have increase, or even dropped. https://www.crucial.com/articles/about-memory/difference-between-speed-and-latency shows a table.
Footnote 2: Unless we're talking about special faster memory for use as a scratchpad or cache for the larger main memory, but still off-chip. e.g. the 16GB of MCDRAM on Xeon Phi cards, separate from the 384 GB of regular DDR4. But faster memories are usually soldered down so timings are fixed, not socketed DIMMs. So I think it's fair to say that all DIMMs in a system will run with the same timings.
Other random notes:
https://www.overclock.net/threads/ram-4x-sr-or-2x-dr-for-ryzen-3000.1729606/ contained some discussion of motherboards with a "T-topology" vs. "daisy chain" for the layout of their DIMM sockets. This seems pretty self-explanatory terminology: a T would be when each of the 2 DIMMs on a channel are on opposite sides of the CPU, about equidistant from the pins. vs. "daisy chain" when both DIMMs for the same channel are on the same side of the CPU, with one farther away than the other.
I'm not sure what the recommended practice is for using the closer or farther socket. Signal reflection could be more of a concern with the near socket because it's not the end of the trace.
If you have multiple DIMMs on the same memory channel by the "chip-enable" pin , the DDR4 protocol may require they all run at the same timings. (Such DIMMs see each others commands, except there's a "chip-select" pin that the memory controller can control independently for each DIMM to control which one the command is for.
But in theory a CPU could be designed to run its different memory channels at different frequencies, or at least different timings at the same frequency if the memory controllers all share a clock. And of course in a multi-socket system, you'd expect no physical / electrical obstacle to programming different timings for the different sockets.
(I haven't played around in the BIOS on a multi-socket system for years, not since I was a cluster sysadmin in AMD K8 / K10 days). So IDK, it's possible that some BIOS might have options to control different timings for different sockets, or simply allow different auto-detect if you use slower RAM in one socket than in others. But given the price of servers and how few people run them as hobby machines, it's unlikely that vendors would bother to support or validate such a config.
After learning a little on how computer programs run I had some thoughts concerning the cpu and RAM. After watching a few youtube videos (linus tech tips and others) they all seem to show that increasing a RAM speed (frequency) does not really have much of a performance improvement in real world applications and games on a general desktop computer. My first question is why is this? Is it because of the high hit rates (95% and above) of the cpu's cache on most modern cpus? Which in turn would lead to less and less need for the cpu to reach out to ram? Also, in which situations would faster RAM frequency be beneficial?
NOTE: this is a very broad question, and the answers can vary very differently depending on the architecture/OS running the system. I am answering from a best-judgement standpoint on how these things generally work
Why is there not a larger performance difference between different RAM clock speeds?
I would imagine that the clock speed of the RAM of the computer matters less than the clock speed of the CPU cache. Because:
the CPU gets its instructions from the cache, not straight from RAM
with the larger cache sizes of modern CPU's, it is less necessary to need to go out to RAM as often.
When the cache needs to go out to RAM, it uses an asynchronous processor (DMA) to grab more information, allowing the CPU to switch to a different process entirely.
Besides that, the clock speed of the motherboard's various pipelines (DMA) could be creating a chokepoint where it is slowing the transfer rate of the information overall.
which situations would faster RAM frequency be beneficial?
I would say that overall, any one of the core pieces of hardware involved with memory and its use and transfer (CPU, CPU Cache, The Various memory pipelines, the various memory transfer devices (DMA, etc.), the RAM itself) can cause a chokepoint where faster RAM might or might not affect the overall performance. It is really a by-case issue.
We've just bought a 32-core Opteron machine, and the speedups we get are a little disappointing: beyond about 24 threads we see no speedup at all (actually gets slower overall) and after about 6 threads it becomes significantly sub-linear.
Our application is very thread-friendly: our job breaks down into about 170,000 little tasks which can each be executed separately, each taking 5-10 seconds. They all read from the same memory-mapped file of size about 4Gb. They make occasional writes to it, but it might be 10,000 reads to each write - we just write a little bit of data at the end of each of the 170,000 tasks. The writes are lock-protected. Profiling shows that the locks are not a problem. The threads use a lot of JVM memory each in non-shared objects and they make very little access to shared JVM objects and of that, only a small percentage of accesses involve writes.
We're programming in Java, on Linux, with NUMA enabled. We have 128Gb RAM. We have 2 Opteron CPU's (model 6274) of 16 cores each. Each CPU has 2 NUMA nodes. The same job running on an Intel quad-core (i.e. 8 cores) scaled nearly linearly up to 8 threads.
We've tried replicating the read-only data to have one-per-thread, in the hope that most lookups can be local to a NUMA node, but we observed no speedup from this.
With 32 threads, 'top' shows the CPU's 74% "us" (user) and about 23% "id" (idle). But there are no sleeps and almost no disk i/o. With 24 threads we get 83% CPU usage. I'm not sure how to interpret 'idle' state - does this mean 'waiting for memory controller'?
We tried turning NUMA on and off (I'm referring to the Linux-level setting that requires a reboot) and saw no difference. When NUMA was enabled, 'numastat' showed only about 5% of 'allocation and access misses' (95% of cache misses were local to the NUMA node). [Edit:] But adding "-XX:+useNUMA" as a java commandline flag gave us a 10% boost.
One theory we have is that we're maxing out the memory controllers, because our application uses a lot of RAM and we think there are a lot of cache misses.
What can we do to either (a) speed up our program to approach linear scalability, or (b) diagnose what's happening?
Also: (c) how do I interpret the 'top' result - does 'idle' mean 'blocked on memory controllers'? and (d) is there any difference in the characteristics of Opteron vs Xeon's?
I also have a 32 core Opteron machine, with 8 NUMA nodes (4x6128 processors, Mangy Cours, not Bulldozer), and I have faced similar issues.
I think the answer to your problem is hinted at by the 2.3% "sys" time shown in top. In my experience, this sys time is the time the system spends in the kernel waiting for a lock. When a thread can't get a lock it then sits idle until it makes its next attempt. Both the sys and idle time are a direct result of lock contention. You say that your profiler is not showing locks to be the problem. My guess is that for some reason the code causing the lock in question is not included in the profile results.
In my case a significant cause of lock contention was not the processing I was actually doing but the work scheduler that was handing out the individual pieces of work to each thread. This code used locks to keep track of which thread was doing which piece of work. My solution to this problem was to rewrite my work scheduler avoiding mutexes, which I have read do not scale well beyond 8-12 cores, and instead use gcc builtin atomics (I program in C on Linux). Atomic operations are effectively a very fine grained lock that scales much better with high core counts. In your case if your work parcels really do take 5-10s each it seems unlikely this will be significant for you.
I also had problems with malloc, which suffers horrible lock issues in high core count situations, but I can't, off the top of my head, remember whether this also led to sys & idle figures in top, or whether it just showed up using Mike Dunlavey's debugger profiling method (How can I profile C++ code running in Linux?). I suspect it did cause sys & idle problems, but I draw the line at digging through all my old notes to find out :) I do know that I now avoid runtime mallocs as much as possible.
My best guess is that some piece of library code you are using implements locks without your knowledge, is not included in your profiling results, and is not scaling well to high core-count situations. Beware memory allocators!
I'm sure the answer will lie in a consideration of the hardware architecture. You have to think of multi core computers as if they were individual machines connected by a network. In fact that's all that Hypertransport and QPI are.
I find that to solve these scalability problems you have to stop thinking in terms of shared memory and start adopting the philosophy of Communicating Sequential Processes. It means thinking very differently, ie imagine how you would write the software if your hardware was 32 single core machines connected by a network. Modern (and ancient) CPU architectures are not designed to give unfettered scaling of the sort you're after. They are designed to allow many different processes to get on with processing their own data.
Like everything else in computing these things go in fashions. CSP dates back to the 1970s, but the very modern and Java derived Scala is a popular embodiment of the concept. See this section on Scala concurrency on Wikipedia.
What the philosophy of CSP does is force you to design a data distribution scheme that fits your data and the problem you're solving. That's not necessarily easy, but if you manage it then you have a solution that will scale very well indeed. Scala may make it easier to develop.
Personally I do everything in CSP and in C. It's allowed me to develop a signal processing application that scales perfectly linearly from 8 cores to several thousand cores (the limit being how big my room is).
The first thing you're going to have to do is actually use NUMA. It isn't a magic setting that you turn on, you have to exploit it in your software's architecture. I don't know about Java, but in C one would bind a memory allocation to a specific core's memory controller (aka memory affinity), and similarly for threads (core affinity) in cases where the OS doesn't get the hint.
I presume that your data doesn't break down into 32 neat, discrete chunks? It's difficult to give advice without knowing exactly the data flows implicit in your program. But think about it in terms of data flow. Draw it out even; Data Flow Diagrams are useful for this (another ancient graphical formal notation). If your picture shows all your data going through a single object (eg through a single memory buffer) then it's going to be slow...
I assume you have optimized your locks, and synchronization made a minimum. In such a case, it still depends a lot on what libraries you are using to program in parallel.
One issue that can happen even if you have no synchronization issue, is memory bus congestion. This is very nasty and difficult to get rid of.
All I can suggest is somehow make your tasks bigger and create fewer tasks. This depends highly on the nature of your problem. Ideally you want as many tasks as the number of cores/threads, but this is not easy (if possible) to achieve.
Something else that can help is to give more heap to your JVM. This will reduce the need to run Garbage Collector frequently, and speeds up a little.
does 'idle' mean 'blocked on memory controllers'
No. You don't see that in top. I mean if the CPU is waiting for memory access, it will be shown as busy. If you have idle periods, it is either waiting for a lock, or for IO.
I'm the Original Poster. We think we've diagnosed the issue, and it's not locks, not system calls, not memory bus congestion; we think it's level 2/3 CPU cache contention.
To reiterate, our task is embarrassingly parallel so it should scale well. However, one thread has a large amount of CPU cache it can access, but as we add more threads, the amount of CPU cache each process can access gets lower and lower (the same amount of cache divided by more processes). Some levels on some architectures are shared between cores on a die, some are even shared between dies (I think), and it may help to get "down in the weeds" with the specific machine you're using, and optimise your algorithms, but our conclusion is that there's not a lot we can do to achieve the scalability we thought we'd get.
We identified this as the cause by using 2 different algorithms. The one which accesses more level 2/3 cache scales much worse than the one which does more processing with less data. They both make frequent accesses to the main data in main memory.
If you haven't tried that yet: Look at hardware-level profilers like Oracle Studio has (for CentOS, Redhat, and Oracle Linux) or if you are stuck with Windows: Intel VTune. Then start looking at operations with suspiciously high clocks per instruction metrics. Suspiciously high mean a lot higher than the same code on a single-numa, single-L3-cache machine (like current Intel desktop CPUs).
I know the question is only partially programming-related because the answer I would like to get is originally from these two questions:
Why are CPU cores number so low (vs GPU)? and Why aren't we using GPUs instead of CPUs, GPUs only or CPUs only? (I know that GPUs are specialized while CPUs are more for multi-task, etc.). I also know that there are memory (Host vs GPU) limitations along with precision and caches capability. But, In term of hardware comparison, high-end to high-end CPU/GPU comparison GPUs are much much more performant.
So my question is: Could we use GPUs instead of CPUs for OS, applications, etc
The reason I am asking this questions is because I would like to know the reason why current computers are still using 2 main processing units (CPU/GPU) with two main memory and caching systems (CPU/GPU) even if it is not something a programmer would like.
Current GPUs lack many of the facilities of a modern CPU that are generally considered important (crucial, really) to things like an OS.
Just for example, an OS normally used virtual memory and paging to manage processes. Paging allows the OS to give each process its own address space, (almost) completely isolated from every other process. At least based on publicly available information, most GPUs don't support paging at all (or at least not in the way an OS needs).
GPUs also operate at much lower clock speeds than CPUs. Therefore, they only provide high performance for embarrassingly parallel problems. CPUs are generally provide much higher performance for single threaded code. Most of the code in an OS isn't highly parallel -- in fact, a lot of it is quite difficult to make parallel at all (e.g., for years, Linux had a giant lock to ensure only one thread executed most kernel code at any given time). For this kind of task, a GPU would be unlikely to provide any benefit.
From a programming viewpoint, a GPU is a mixed blessing (at best). People have spent years working on programming models to make programming a GPU even halfway sane, and even so it's much more difficult (in general) than CPU programming. Given the difficulty of getting even relatively trivial things to work well on a GPU, I can't imagine attempting to write anything even close to as large and complex as an operating system to run on one.
GPUs are designed for graphics related processing (obviously), which is inherently something that benefits from parallel processing (doing multiple tasks/calculations at once). This means that unlike modern CPUs, which as you probably know usually have 2-8 cores, GPUs have hundreds of cores. This means that they are uniquely suited to processing things like ray tracing or anything else that you might encounter in a 3D game or other graphics intensive activity.
CPUs on the other hand have a relatively limited number of cores because the tasks that a CPU faces usually do not benefit from parallel processing nearly as much as rendering a 3D scene would. In fact, having too many cores in a CPU could actually degrade the performance of a machine, because of the nature of the tasks a CPU usually does and the fact that a lot of programs would not be written to take advantage of the multitude of cores. This means that for internet browsing or most other desktop tasks, a CPU with a few powerful cores would be better suited for the job than a GPU with many, many smaller cores.
Another thing to note is that more cores usually means more power needed. This means that a 256-core phone or laptop would be pretty impractical from a power and heat standpoint, not to mention the manufacturing challenges and costs.
Usually operating systems are pretty simple, if you look at their structure.
But parallelizing them will not improve speeds much, only raw clock speed will do.
GPU's simply lack parts and a lot of instructions from their instruction sets that an OS needs, it's a matter of sophistication. Just think of the virtualization features (Intel VT-x or AMD's AMD-v).
GPU cores are like dumb ants, whereas a CPU is like a complex human, so to speak. Both have different energy consumption because of this and produce very different amounts of heat.
See this extensive superuser answer here on more info.
Because nobody will spend money and time on this. Except for some enthusiasts like that one: http://gerigeri.uw.hu/DawnOS/history.html (now here: http://users.atw.hu/gerigeri/DawnOS/history.html)
Dawn now works on GPU-s: with a new OpenCL capable emulator, Dawn now
boots and works on Graphics Cards, GPU-s and IGP-s (with OpenCL 1.0).
Dawn is the first and only operating system to boot and work fully on
a graphics chip.
I've now saved a bit of money for the hardware upgrade. What I'd like to know, which is the easiest way to measure which part of hardware is the bottleneck for compiling and should be upgraded?
Are there any clever techniques I could use? I've looked into perfmon, but it has too many counters and isn't very helpful without exact knowledge what should be looked at.
Conditions: Home development, Windows XP Pro, Visual Studio 2008
Thanks!
The question is really "what is maxed out during compilation?"
If you don't want to use perfmon, you can use something like the task monitor.
Run a compile.
See what's maxed out.
Did you go to 100% CPU for the whole time? Get more CPU -- faster or more cores or something.
Did you go to 100% memory for the whole time? Which number matters on the display? The only memory you can buy is "physical" memory. The only factor that matters is physical memory. The other things you see on the meter are not things you buy, they're adjustments to make to the way Windows works.
Did you go to "huge" amounts of I/O? You can't easily tell what's "huge", but you can conclude this. If you're not using memory and not using CPU, then you're using the only resource that's left -- you're I/O bound and you need a faster bus -- which usually means a whole new machine.
A faster HDD is of little or no value -- the bus clock speed is one limiting factor. The bus width is the other limiting factor. No one designs an ass-kicking I/O bus and then saddles it with junk HDD's. Usually, they design the bus that fits a specific cost target based on available HDD's.
A faster HDD is of little or no value -- the bus clock speed is one limiting factor. The bus width is the other limiting factor. No one designs an ass-kicking I/O bus and then saddles it with junk HDD's. Usually, they design the bus that fits a specific cost target based on available HDD's.
Garbage. Modern HDDs are slow compared to the I/O buses they are connected to. Name a single HDD that can max out a SATA 2 interface (and that is even a generation old now) for random IOPS... A hard drive is lucky to hit 10MB/s when the bus is capable of around 280MB/s.
E.g. http://www.anandtech.com/show/2948/3. Even there the SSDs are only hitting 50MB/s. It's clear the IOPs are NOT the bottleneck otherwise the HDD would do just as much as the SSDs.
I've never seen a computer IOPs bound rather than HDD bound. It doesn't happen.
Using the task monitor has already been suggested but the Sys Internals task monitor gives you more information than the built-in Windows task monitor:
Sys Internals task monitor
You might also want to see what other things are running on your PC which are using up memory and / or CPU processing power. It may be possible to remove or only run on demand things which are affecting performance.
Windows XP will only support 3GB of memory using a switch that you have to turn on and
I seem to remember that applications need to be written to actually take this into consideration.