I'm wondering what kind of performance hit numerical calculations will have in a virtualized setting? More specifically, what kind of performance loss can I expect from running CPU-bound C++ code in a virtualized windows OS as opposed to a native Linux one, on rather fast x86_64 multi-core machines?
I'll be happy to add precisions as needed, but as I don't know much about virtualization, I don't know what info is relevant.
Processes are just bunches of threads which are streams of instructions executing in a sequential fashion. In modern virtualisation solutions, as far as the CPU is concerned, the host and the guest processes execute together and differ only in that the I/O of the latter is being trapped and virtualised. Memory is also virtualised but that occurs more or less in the hardware MMU. Guest instructions are directly executed by the CPU otherwise it would not be virtualisation but rather emulation and as long as they do not access any virtualised resources they would execute just as fast as the host instructions. At the end it all depends on how well the CPU could cope with the increased number of running processes.
There are lightweight virtualisation solutions like zones in Solaris that partition the process space in order to give the appearance of multiple copies of the OS but it all happens under the umbrella of a single OS kernel.
The performance hit for pure computational codes is very small, often under 1-2%. The catch is that in reality all programs read and write data and computational codes usually read and write lots of data. Virtualised I/O is usually much slower than direct I/O even with solutions like Intel VT-* or AMD-V.
Exact numbers depend heavily on the specific hardware.
Goaded by #Mitch Wheat's unarguable assertion that my original post here was not an answer, here's an attempt to recast it as an answer:
I work mostly on HPC in the energy sector. Some of the computations that my scientist colleagues run take O(10^5) CPU-hours, we're seriously thinking about O(10^6) CPU-hours jobs in the near future.
I get well paid to squeeze every last drop of performance out of our codes, I'd think it was a good day's work if I could knock 1% off the run-time of some of our programs. Sometimes it has taken me a month to get that sort of performance improvement, sure I may be slow, but it's still cost-effective for our scientists.
I shudder therefore, when bright salespeople offering the latest and best in data center software (of which virtualization is one aspect) which will only, as I see it, shackle my codes to a pile of anchor chain from a 250,00dwt tanker (that was a metaphor).
I have read the question carefully and understand that OP is not proposing that virtualization would help, I'm offering the perspective of a practitioner. If this is still too much of a comment, do the SO thing and vote to close, I promise I won't be offended !
Related
We've just bought a 32-core Opteron machine, and the speedups we get are a little disappointing: beyond about 24 threads we see no speedup at all (actually gets slower overall) and after about 6 threads it becomes significantly sub-linear.
Our application is very thread-friendly: our job breaks down into about 170,000 little tasks which can each be executed separately, each taking 5-10 seconds. They all read from the same memory-mapped file of size about 4Gb. They make occasional writes to it, but it might be 10,000 reads to each write - we just write a little bit of data at the end of each of the 170,000 tasks. The writes are lock-protected. Profiling shows that the locks are not a problem. The threads use a lot of JVM memory each in non-shared objects and they make very little access to shared JVM objects and of that, only a small percentage of accesses involve writes.
We're programming in Java, on Linux, with NUMA enabled. We have 128Gb RAM. We have 2 Opteron CPU's (model 6274) of 16 cores each. Each CPU has 2 NUMA nodes. The same job running on an Intel quad-core (i.e. 8 cores) scaled nearly linearly up to 8 threads.
We've tried replicating the read-only data to have one-per-thread, in the hope that most lookups can be local to a NUMA node, but we observed no speedup from this.
With 32 threads, 'top' shows the CPU's 74% "us" (user) and about 23% "id" (idle). But there are no sleeps and almost no disk i/o. With 24 threads we get 83% CPU usage. I'm not sure how to interpret 'idle' state - does this mean 'waiting for memory controller'?
We tried turning NUMA on and off (I'm referring to the Linux-level setting that requires a reboot) and saw no difference. When NUMA was enabled, 'numastat' showed only about 5% of 'allocation and access misses' (95% of cache misses were local to the NUMA node). [Edit:] But adding "-XX:+useNUMA" as a java commandline flag gave us a 10% boost.
One theory we have is that we're maxing out the memory controllers, because our application uses a lot of RAM and we think there are a lot of cache misses.
What can we do to either (a) speed up our program to approach linear scalability, or (b) diagnose what's happening?
Also: (c) how do I interpret the 'top' result - does 'idle' mean 'blocked on memory controllers'? and (d) is there any difference in the characteristics of Opteron vs Xeon's?
I also have a 32 core Opteron machine, with 8 NUMA nodes (4x6128 processors, Mangy Cours, not Bulldozer), and I have faced similar issues.
I think the answer to your problem is hinted at by the 2.3% "sys" time shown in top. In my experience, this sys time is the time the system spends in the kernel waiting for a lock. When a thread can't get a lock it then sits idle until it makes its next attempt. Both the sys and idle time are a direct result of lock contention. You say that your profiler is not showing locks to be the problem. My guess is that for some reason the code causing the lock in question is not included in the profile results.
In my case a significant cause of lock contention was not the processing I was actually doing but the work scheduler that was handing out the individual pieces of work to each thread. This code used locks to keep track of which thread was doing which piece of work. My solution to this problem was to rewrite my work scheduler avoiding mutexes, which I have read do not scale well beyond 8-12 cores, and instead use gcc builtin atomics (I program in C on Linux). Atomic operations are effectively a very fine grained lock that scales much better with high core counts. In your case if your work parcels really do take 5-10s each it seems unlikely this will be significant for you.
I also had problems with malloc, which suffers horrible lock issues in high core count situations, but I can't, off the top of my head, remember whether this also led to sys & idle figures in top, or whether it just showed up using Mike Dunlavey's debugger profiling method (How can I profile C++ code running in Linux?). I suspect it did cause sys & idle problems, but I draw the line at digging through all my old notes to find out :) I do know that I now avoid runtime mallocs as much as possible.
My best guess is that some piece of library code you are using implements locks without your knowledge, is not included in your profiling results, and is not scaling well to high core-count situations. Beware memory allocators!
I'm sure the answer will lie in a consideration of the hardware architecture. You have to think of multi core computers as if they were individual machines connected by a network. In fact that's all that Hypertransport and QPI are.
I find that to solve these scalability problems you have to stop thinking in terms of shared memory and start adopting the philosophy of Communicating Sequential Processes. It means thinking very differently, ie imagine how you would write the software if your hardware was 32 single core machines connected by a network. Modern (and ancient) CPU architectures are not designed to give unfettered scaling of the sort you're after. They are designed to allow many different processes to get on with processing their own data.
Like everything else in computing these things go in fashions. CSP dates back to the 1970s, but the very modern and Java derived Scala is a popular embodiment of the concept. See this section on Scala concurrency on Wikipedia.
What the philosophy of CSP does is force you to design a data distribution scheme that fits your data and the problem you're solving. That's not necessarily easy, but if you manage it then you have a solution that will scale very well indeed. Scala may make it easier to develop.
Personally I do everything in CSP and in C. It's allowed me to develop a signal processing application that scales perfectly linearly from 8 cores to several thousand cores (the limit being how big my room is).
The first thing you're going to have to do is actually use NUMA. It isn't a magic setting that you turn on, you have to exploit it in your software's architecture. I don't know about Java, but in C one would bind a memory allocation to a specific core's memory controller (aka memory affinity), and similarly for threads (core affinity) in cases where the OS doesn't get the hint.
I presume that your data doesn't break down into 32 neat, discrete chunks? It's difficult to give advice without knowing exactly the data flows implicit in your program. But think about it in terms of data flow. Draw it out even; Data Flow Diagrams are useful for this (another ancient graphical formal notation). If your picture shows all your data going through a single object (eg through a single memory buffer) then it's going to be slow...
I assume you have optimized your locks, and synchronization made a minimum. In such a case, it still depends a lot on what libraries you are using to program in parallel.
One issue that can happen even if you have no synchronization issue, is memory bus congestion. This is very nasty and difficult to get rid of.
All I can suggest is somehow make your tasks bigger and create fewer tasks. This depends highly on the nature of your problem. Ideally you want as many tasks as the number of cores/threads, but this is not easy (if possible) to achieve.
Something else that can help is to give more heap to your JVM. This will reduce the need to run Garbage Collector frequently, and speeds up a little.
does 'idle' mean 'blocked on memory controllers'
No. You don't see that in top. I mean if the CPU is waiting for memory access, it will be shown as busy. If you have idle periods, it is either waiting for a lock, or for IO.
I'm the Original Poster. We think we've diagnosed the issue, and it's not locks, not system calls, not memory bus congestion; we think it's level 2/3 CPU cache contention.
To reiterate, our task is embarrassingly parallel so it should scale well. However, one thread has a large amount of CPU cache it can access, but as we add more threads, the amount of CPU cache each process can access gets lower and lower (the same amount of cache divided by more processes). Some levels on some architectures are shared between cores on a die, some are even shared between dies (I think), and it may help to get "down in the weeds" with the specific machine you're using, and optimise your algorithms, but our conclusion is that there's not a lot we can do to achieve the scalability we thought we'd get.
We identified this as the cause by using 2 different algorithms. The one which accesses more level 2/3 cache scales much worse than the one which does more processing with less data. They both make frequent accesses to the main data in main memory.
If you haven't tried that yet: Look at hardware-level profilers like Oracle Studio has (for CentOS, Redhat, and Oracle Linux) or if you are stuck with Windows: Intel VTune. Then start looking at operations with suspiciously high clocks per instruction metrics. Suspiciously high mean a lot higher than the same code on a single-numa, single-L3-cache machine (like current Intel desktop CPUs).
I know the question is only partially programming-related because the answer I would like to get is originally from these two questions:
Why are CPU cores number so low (vs GPU)? and Why aren't we using GPUs instead of CPUs, GPUs only or CPUs only? (I know that GPUs are specialized while CPUs are more for multi-task, etc.). I also know that there are memory (Host vs GPU) limitations along with precision and caches capability. But, In term of hardware comparison, high-end to high-end CPU/GPU comparison GPUs are much much more performant.
So my question is: Could we use GPUs instead of CPUs for OS, applications, etc
The reason I am asking this questions is because I would like to know the reason why current computers are still using 2 main processing units (CPU/GPU) with two main memory and caching systems (CPU/GPU) even if it is not something a programmer would like.
Current GPUs lack many of the facilities of a modern CPU that are generally considered important (crucial, really) to things like an OS.
Just for example, an OS normally used virtual memory and paging to manage processes. Paging allows the OS to give each process its own address space, (almost) completely isolated from every other process. At least based on publicly available information, most GPUs don't support paging at all (or at least not in the way an OS needs).
GPUs also operate at much lower clock speeds than CPUs. Therefore, they only provide high performance for embarrassingly parallel problems. CPUs are generally provide much higher performance for single threaded code. Most of the code in an OS isn't highly parallel -- in fact, a lot of it is quite difficult to make parallel at all (e.g., for years, Linux had a giant lock to ensure only one thread executed most kernel code at any given time). For this kind of task, a GPU would be unlikely to provide any benefit.
From a programming viewpoint, a GPU is a mixed blessing (at best). People have spent years working on programming models to make programming a GPU even halfway sane, and even so it's much more difficult (in general) than CPU programming. Given the difficulty of getting even relatively trivial things to work well on a GPU, I can't imagine attempting to write anything even close to as large and complex as an operating system to run on one.
GPUs are designed for graphics related processing (obviously), which is inherently something that benefits from parallel processing (doing multiple tasks/calculations at once). This means that unlike modern CPUs, which as you probably know usually have 2-8 cores, GPUs have hundreds of cores. This means that they are uniquely suited to processing things like ray tracing or anything else that you might encounter in a 3D game or other graphics intensive activity.
CPUs on the other hand have a relatively limited number of cores because the tasks that a CPU faces usually do not benefit from parallel processing nearly as much as rendering a 3D scene would. In fact, having too many cores in a CPU could actually degrade the performance of a machine, because of the nature of the tasks a CPU usually does and the fact that a lot of programs would not be written to take advantage of the multitude of cores. This means that for internet browsing or most other desktop tasks, a CPU with a few powerful cores would be better suited for the job than a GPU with many, many smaller cores.
Another thing to note is that more cores usually means more power needed. This means that a 256-core phone or laptop would be pretty impractical from a power and heat standpoint, not to mention the manufacturing challenges and costs.
Usually operating systems are pretty simple, if you look at their structure.
But parallelizing them will not improve speeds much, only raw clock speed will do.
GPU's simply lack parts and a lot of instructions from their instruction sets that an OS needs, it's a matter of sophistication. Just think of the virtualization features (Intel VT-x or AMD's AMD-v).
GPU cores are like dumb ants, whereas a CPU is like a complex human, so to speak. Both have different energy consumption because of this and produce very different amounts of heat.
See this extensive superuser answer here on more info.
Because nobody will spend money and time on this. Except for some enthusiasts like that one: http://gerigeri.uw.hu/DawnOS/history.html (now here: http://users.atw.hu/gerigeri/DawnOS/history.html)
Dawn now works on GPU-s: with a new OpenCL capable emulator, Dawn now
boots and works on Graphics Cards, GPU-s and IGP-s (with OpenCL 1.0).
Dawn is the first and only operating system to boot and work fully on
a graphics chip.
Will the current trend of adding cores to computers continue? Or is there some theoretical or practical limit to the number of cores that can be served by one set of memory?
Put another way: is the high powered desktop computer of the future apt to have 1024 cores using one set of memory, or is it apt to have 32 sets of memory, each accessed by 32 cores?
Or still another way: I have a multi-threaded program that runs well on a 4-core machine, using a significant amount of the total CPU. As this program grows in size and does more work, can I be reasonably confident more powerful machines will be available to run it? Or should I be thinking seriously about running multiple sessions on multiple machines (or at any rate multiple sets of memory) to get the work done?
In other words, is a purely multithreaded approach to design going to leave me in a dead end? (As using a single threaded approach and depending on continued improvements in CPU speed years back would have done?) The program is unlikely to be run on a machine costing more than, say $3,000. If that machine cannot do the work, the work won't get done. But if that $3,000 machine is actually a network of 32 independent computers (though they may share the same cooling fan) and I've continued my massively multithreaded approach, the machine will be able to do the work, but the program won't, and I'm going to be in an awkward spot.
Distributed processing looks like a bigger pain than multithreading was, but if that might be in my future, I'd like some warning.
Will the current trend of adding cores to computers continue?
Yes, the GHz race is over. It's not practical to ramp the speed any more on the current technology. Physics has gotten in the way. There may be a dramatic shift in the technology of fabricating chips that allows us to get round this, but it's not obviously 'just around the corner'.
If we can't have faster cores, the only way to get more power is to have more cores.
Or is there some theoretical or practical limit to the number of cores that can be served by one set of memory?
Absolutely there's a limit. In a shared memory system the memory is a shared resource and has a limited amount of bandwidth.
Max processes = (Memory Bandwidth) / (Bandwidth required per process)
Now - that 'Bandwidth per process' figure will be reduced by caches, but caches become less efficient if they have to be coherent with one another because everyone is accessing the same area of memory. (You can't cache a memory write if another CPU may need what you've written)
When you start talking about huge systems, shared resources like this become the main problem. It might be memory bandwidth, CPU cycles, hard drive access, network bandwidth. It comes down to how the system as a whole is structured.
You seem to be really asking for a vision of the future so you can prepare. Here's my take.
I think we're going to see a change in the way software developers see parallelism in their programs. At the moment, I would say that a lot of software developers see the only way of using multiple threads is to have lots of them doing the same thing. The trouble is they're all contesting for the same resources. This then means lots of locking needs to be introduced, which causes performance issues, and subtle bugs which are infuriating and time consuming to solve.
This isn't sustainable.
Manufacturing worked out at the beginning of the 20th Century, the fastest way to build lots of cars wasn't to have lots of people working on one car, and then, when that one's done, move them all on to the next car. It was to split the process of building the car down into lots of small jobs, with the output of one job feeding the next. They called it assembly lines. In hardware design it's called pipe-lining, and I'll think we'll see software designs move to it more and more, as it minimizes the problem of shared resources.
Sure - There's still a shared resource on the output of one stage and the input of the next, but this is only between two threads/processes and is much easier to handle. Standard methods can also be adopted on how these interfaces are made, and message queueing libraries seem to be making big strides here.
There's not one solution for all problems though. This type of pipe-line works great for high throughput applications that can absorb some latency. If you can't live with the latency, you have no option but to go the 'many workers on a single task' route. Those are the ones you ideally want to be throwing at SIMD machines/Array processors like GPUs, but it only really excels with a certain type of problem. Those problems are ones where there's lots of data to process in the same way, and there's very little or no dependency between data items.
Having a good grasp of message queuing techniques and similar for pipelined systems, and utilising fine grained parallelism on GPUs through libraries such as OpenCL, will give you insight at both ends of the spectrum.
Update: Multi-threaded code may run on clustered machines, so this issue may not be as critical as I thought.
I was carefully checking out the Java Memory Model in the JLS, chapter 17, and found it does not mirror the typical register-cache-main memory model of most computers. There were opportunities there for a multi-memory machine to cleanly shift data from one memory to another (and from one thread running on one machine to another running on a different one). So I started searching for JVMs that would run across multiple machines. I found several old references--the idea has been out there, but not followed through. However, one company, Terracotta, seems to have something, if I'm reading their PR right.
At any rate, it rather seems that when PC's typically contain several clustered machines, there's likely to be a multi-machine JVM for them.
I could find nothing outside the Java world, but Microsoft's CLR ought to provide the same opportunities. C and C++ and all the other .exe languages might be more difficult. However, Terracotta's websites talk more about linking JVM's rather than one JVM on multiple machines, so their tricks might work for executable langauges also (and maybe the CLR, if needed).
I have a large scientific computing task that parallelizes very well with SMP, but at too fine grained a level to be easily parallelized via explicit message passing. I'd like to parallelize it across address spaces and physical machines. Is it feasible to create a scheduler that would parallelize already multithreaded code across multiple physical computers under the following conditions:
The code is already multithreaded and can scale pretty well on SMP configurations.
The fact that not all of the threads are running in the same address space or on the same physical machine must be transparent to the program, even if this comes at a significant performance penalty in some use cases.
You may assume that all of the physical machines involved are running operating systems and CPU architectures that are binary compatible.
Things like locks and atomic operations may be slow (having network latency to deal with and all) but must "just work".
Edits:
I only care about throughput, not latency.
I'm using the D programming language, and I'm almost sure there's no canned solution. I'm more interested in whether this is feasible in principle than in a particular canned solution.
My first thought is to use Apache Hadoop. It provides distributed storage and distributed computing. You can synchronize across processes by using files as locks.
It sounds like you want something like SCRAMNet, although that requires custom hardware. I don't know if there is a software-only solution. Also, it's likely that even if you got it working, you'd find your networked version was actually running slower than when it was previously on a single machine. You may just have to bite the bullet and re-design your app.
Since your point 2 suggests that you can live with some performance degradation you might want to consider a hybrid approach: SMP within individual machines, message-passing between machines. I'm not familiar with D so can offer no specific advice. Further I've seen mixed reviews of the hybrid approach for OpenMP+MPI, but it might suit you and your application.
EDIT: You might want to Google around for 'partitioned global address space' which seems to describe your desired approach quite accurately. As before, I have no advice on using D for this.
With all the hype around parallel computing lately, I've been thinking a lot about parallelism, number crunching, clusters, etc...
I started reading Learn You Some Erlang. As more people are learning (myself included), Erlang handles concurrency in a very impressive, elegant way.
Then the author asserts that Erlang is not ideal for number crunching. I can understand that a language like Erlang would be slower than C, but the model for concurrency seems ideally suited to things like image handling or matrix multiplication, even though the author specifically says its not.
Is it really that bad? Is there a tipping point where Erlang's strength overcomes its local speed weakness? Are/what measures are being taken to deal with speed?
To be clear: I'm not trying to start a debate; I just want to know.
It's a mistake to think of parallelism as only about raw number crunching power. Erlang is closer to the way a cluster computer works than, say, a GPU or classic supercomputer.
In modern GPUs and old-style supercomputers, performance is all about vectorized arithmetic, special-purpose calculation hardware, and low-latency communication between processing units. Because communication latency is low and each individual computing unit is very fast, the ideal usage pattern is to load the machine's RAM up with data and have it crunch it all at once. This processing might involve lots of data passing among the nodes, as happens in image processing or 3D, where there are lots of CPU-bound tasks to do to transform the data from input form to output form. This type of machine is a poor choice when you frequently have to go to a disk, network, or some other slow I/O channel for data. This idles at least one expensive, specialized processor, and probably also chokes the data processing pipeline so nothing else gets done, either.
If your program requires heavy use of slow I/O channels, a better type of machine is one with many cheap independent processors, like a cluster. You can run Erlang on a single machine, in which case you get something like a cluster within that machine, or you can easily run it on an actual hardware cluster, in which case you have a cluster of clusters. Here, communication overhead still idles processing units, but because you have many processing units running on each bit of computing hardware, Erlang can switch to one of the other processes instantaneously. If it happens that an entire machine is sitting there waiting on I/O, you still have the other nodes in the hardware cluster that can operate independently. This model only breaks down when the communication overhead is so high that every node is waiting on some other node, or for general I/O, in which case you either need faster I/O or more nodes, both of which Erlang naturally takes advantage of.
Communication and control systems are ideal applications of Erlang because each individual processing task takes little CPU and only occasionally needs to communicate with other processing nodes. Most of the time, each process is operating independently, each taking a tiny fraction of the CPU power. The most important thing here is the ability to handle many thousands of these efficiently.
The classic case where you absolutely need a classic supercomputer is weather prediction. Here, you divide the atmosphere up into cubes and do physics simulations to find out what happens in each cube, but you can't use a cluster because air moves between each cube, so each cube is constantly communicating with its 6 adjacent neighbors. (Air doesn't go through the edges or corners of a cube, being infinitely fine, so it doesn't talk to the other 20 neighboring cubes.) Run this on a cluster, whether running Erlang on it or some other system, and it instantly becomes I/O bound.
Is there a tipping point where Erlang's strength overcomes its local speed weakness?
Well, of course there is. For example, when trying to find the median of a trillion numbers :) :
http://matpalm.com/median/question.html
Just before you posted, I happened to notice this was the number 1 post on erlang.reddit.com.
Almost any language can be parallelized. In some languages it's simple, in others it's a pain in the butt, but it can be done. If you want to run a C++ program across 8000 CPU's in a grid, go ahead! You can do that. It's been done before.
Erlang doesn't do anything that's impossible in other languages. If a single CPU running an Erlang program is less efficient than the same CPU running a C++ program, then two hundred CPU's running Erlang will also be slower than two hundred CPU's running C++.
What Erlang does do is making this kind of parallelism easy to work with. It saves developer time and reduces the chance of bugs.
So I'm going to say no, there is no tipping point at which Erlang's parallelism allows it to outperform another language's numerical number-crunching strength.
Where Erlang scores is in making it easier to scale out and do so correctly. But it can still be done in other languages which are better at number-crunching, if you're willing to spend the extra development time.
And of course, let's not forget the good old point that languages don't have a speed.
A sufficiently good Erlang compiler would yield perfectly optimal code. A sufficiently bad C compiler would yield code that runs slower than anything else.
There is pressure to make Erlang execute numeric code faster. The HiPe compiler compiles to native code instead of the BEAM bytecode for example, and it probably has its most effective optimization on code on floating points where it can avoid boxing. This is very beneficial for floating point code, since it can store values directly in FPU registers.
For the majority of Erlang usage, Erlang is plenty fast as it is. They use Erlang to write always-up control systems where the most important speed measurement that matters is low latency responses. Performance under load tends to be IO-bound. These users tend to stay away from HiPe since it is not as flexible/malleable in debugging live systems.
Now that servers with 128Gb of RAM are not that uncommon, and there's no reason they'll get even more memory, some IO-bound problems might shift over to be somewhat CPU bound. That could be a driver.
You should follow HiPe for the development.
Your examples of image manipulations and matrix multiplications seem to me as very bad matches for Erlang though. Those are examples that benefit from vector/SIMD operations. Erlang is not good at parallellism (where one does the same thing to multiple values at once).
Erlang processes are MIMD, multiple instructions multiple data. Erlang does lots of branching behind pattern matching and recursive loops. That kills CPU instruction pipelining.
The best architecture for heavily parallellised problems are the GPUs. For programming GPUs in a functional language I see the best potential in using Haskell for creating programs targeting them. A GPU is basically a pure function from input data to output data. See the Lava project in Haskell for creating FPGA circuits, if it is possible to create circuits so cleanly in Haskell, it can't be harder to create program data for GPUs.
The Cell architecture is very nice for vectorizable problems as well.
I think the broader need is to point out that parallelism is not necessarily or even typically about speed.
It is about how to express algorithms or programs in which the sequence of activities is partial-ordered.