hardware environment for compilation performance - performance

This is a rather general question ..
What hardware setup is best for large C/C++ compile jobs, like a Linux kernel or Applications ?
I remember reading a post by Joel Spolsky on experiments with solid state disks and stuff like that.
Do I have to have rather more CPU power or more RAM or a fast harddisk IO solution like solid state ? Would it for example be convenient to have a 'normal' harddisk for the standard system and then use a solid state for the compilation ? Or can I just buy lots of RAM ? And how important is the CPU, or is it just sitting around most of the compile time ?
Probably it's a stupid question, but I dont have a lot of experience on that field, thanks for answers.
Here's some info on the SSD issue
Linus Torvalds on the topic
How to get the most out of it (Standard settings are sort of slowing it down)
Interesting article on Coding Horror about it and with a tip for a payable Chipset

I think you need enough of everything. CPU is very important, and compiles can easily be parallelised (with make -j), so you want as many CPU cores as possible. Then, RAM is probably just as important, since it provides more 'working space' for the compiler and allows to your IO to be buffered. Finally, of course drive speed is probably the least important of the three - kernel code is big, but not that big.

Definitely not a stupid question, getting the build-test environment tuned correctly will make a lot of headaches go away.
Hard disk performance would probably top the list. I'd stay well away from the solid state drive as they're only rated for a large-but-limited number of write operations and make-clean-build cycles will hammer it.
More importantly, can you take advantage of a parallel or shared build environment - from memory ClearCase and Perforce had mechanisms to handle shared builds. Unless you have a parallelising build system, having multiple CPUs will be fairly pointless
Last but not least, I would doubt that the build time would be the limiting factor - more likely you should focus on the needs of your test system. Before you look at the actual metal though, try to design a build-test system that's appropriate to how you'll actually be working - how often are your builds, how many people are involved, how big is your test system ....

Related

Multithreaded programming for CPUs with E/P cores on Alder Lake architecture

This is more a conceptual/high level question rather than a language specific one.
In terms of writing software for high performance applications that will use multiple threads, would it be advisable to manually check for and only allocate threads to P cores, or rather just simply allocate threads as they were done before Alder Lake, and let the OS scheduler decide where to put them?
To be more specific, my program will be a computer game with separate computationally expensive CPU threads for AI, pathfinding, etc. Ideally I don't want these threads on E cores but I'm wondering if I should be leaving this sort of thing up to the OS to decide instead of ensuring it manually.
This is probably not a good question for SO, and might fit better on https://softwareengineering.stackexchange.com, but here goes with a conceptual/high-level answer anyways.
In terms of writing software for high performance applications you will get best performance by writing platform-dependent code, that is by writing programs which are informed by, and take advantage of, the particular features of the hardware+o/s+runtime on which the programs are to execute.
The costs of this approach are that the code will, by definition, be less-than-optimal-performancewise on any other platform; and that writing codes to squeeze out every last drop of performance for a particular problem can be quite difficult and time consuming.
Personally (so this might be an opinion, which is something SO doesn't like) I would first write the platform-neutral version of the code and test it. Only when I was convinced that I couldn't achieve necessary performance (or other) goals would I roll up my sleeves and develop that first version into a platform-dependent version. (Well, I might do this extra work for fun, but you catch my drift).
Later, if you want to move the program to another platform you already have the platform-neutral version to start with.

ARM NEON: Tools to predict performance issues due to memory access limited bandwidth?

I am trying to optimize critical parts of a C code for image processing in ARM devices and recently discovered NEON.
Having read tips here and there, I am getting pretty nice results, but there is something that escapes me. I see that overall performance is very much dependant on memory accesses and how they are done.
Which is the simplest way (by simple I mean, if possible, not having to run the whole compiled code in an emulator or simulator, but something that can be feed of small pieces of assembly and analyze them), in order to get an idea of how memory accesses are "bottlenecking" the subroutine?
I know this can not be done exactly without running it in a specific hardware and specific conditions, but the purpose is to have a "comparison" trial-and error tool to experiment with, even if the results are only approximations.
(something similar to this great tool for cycle counting)
I think you've probably answered your own question. Memory is a system level effect and many ARM implementers (Apple, Samsung, Qualcomm, etc) implement the system differently with different results.
However, of course you can optimize things for a certain system and it will probably work well on others, so really it comes down to figuring out a way that you can quickly iterate and test/simulate system level effects. This does get complicated so you might pay some money for system level simulators such as is included in ARM's RealView. Or I might recommend getting some open source hardware like a Panda Board and using valgrind's cache-grind. With linux on the panda board you can write some scripts to automate your testing.
It can be a hassle to get this going but if optimizing for ARM will be part of your professional life, then it's worth the (relatively low compared to your salary) software/hardware investment and time.
Note 1: I recommend against using PLD. This is very system tuning dependent, and if you get it working well on one ARM implementation it may hurt you for the next generation of chip or a different implementation. This may be a hint that trying to optimize at the system level, other than some basic data localization and ordering stuff may not be worth your efforts? (See Stephen's comment below).
Memory access is one thing that simply cannot be modeled from "small pieces of assembly” to generate meaningful guidance. Cache hierarchies, store buffers, load miss queues, cache policy, etc … even relatively simple processors have an enormous amount of “state” hiding underneath the LSU, and any small-scale analysis cannot accurately capture that state. That said, there are a few basic guidelines for getting the best performance:
maximize the ratio of "useful computation” instructions to LSU operations.
align your memory accesses (ideally to 16B).
if you need to pick between aligning loads or aligning stores, align your stores.
try to write out complete cachelines when possible.
PLD is mainly useful for non-uniform-but-somehow-still-predictable memory access patterns (these are rare).
For NEON specifically, you should prefer to use the vld1 and vst1 instructions (with an alignment hint). On most micro-architectures, in most cases, they are the fastest way to move between NEON and memory. Eschew v[ld|st][3|4] in particular; these are an attractive nuisance, slower than doing separate permutes on most micro-architectures in most cases.

"Well-parallelized" algorithm not sped up by multiple threads

I'm sorry to ask a question one a topic that I know so little about, but this idea has really been bugging me and I haven't been able to find any answers on the internet.
Background:
I was talking to one of my friends who is in computer science research. I'm in mostly ad-hoc development, so my understanding of a majority of CS concepts is at a functional level (I know how to use them rather than how they work). He was saying that converting a "well-parallelized" algorithm that had been running on a single thread into one that ran on multiple threads didn't result in the processing speed increase that he was expecting.
Reasoning:
I asked him what the architecture of the computer he was running this algorithm on was, and he said 16-core (non-virtualized). According to what I know about multi-core processors, the processing speed increase of an algorithm running on multiple cores should be roughly proportional to how well it is parallelized.
Question:
How can an algorithm that is "well-parallelized" and programmed correctly to run on a true multi-core processor not run several times more quickly? Is there some information that I'm missing here, or is it more likely a problem with the implementation?
Other stuff: I asked if the threads were possibly taking up more power than any individual core had available and apparently each core runs at 3.4 GHz. This is much more than the algorithm should need, and when diagnostics are run the cores aren't maxed out during runtime.
It is likely sharing something. What is being shared may not be obvious.
One of the most common non-obvious shared resources is CPU cache. If the threads are updating the same cache line that cache line has to bounce between CPUs, slowing everything down.
That can happen because of accessing (even read-only) variables which are near to each other in memory. If all accesses are read-only it is OK, but if even one CPU is writing to that cache line it will force a bounce.
A brute-force method of fixing this is to put shared variables into structures that look like:
struct var_struct {
int value;
char padding[128];
};
Instead of hard-coding 128 you could research what system parameter or preprocessor macros define the cache-line size for your system type.
Another place that sharing can take place is inside system calls. Even seemingly innocent functions might be taking global locks. I seem to recall reading about Linux fixing an issue like this a while back with locks on the functions that return process and thread identifiers and parent identifiers.
Performance versus number of cores is often a S-like curve - first it obviously increases but as locking, shared cache and the like take they debt the further cores do not add so much and even may degrade. Hence nothing mysterious. If we would know more details about the algorithm it may be possible to find an idea to speed it up.

Performing numerical calculations in a virtualized setting?

I'm wondering what kind of performance hit numerical calculations will have in a virtualized setting? More specifically, what kind of performance loss can I expect from running CPU-bound C++ code in a virtualized windows OS as opposed to a native Linux one, on rather fast x86_64 multi-core machines?
I'll be happy to add precisions as needed, but as I don't know much about virtualization, I don't know what info is relevant.
Processes are just bunches of threads which are streams of instructions executing in a sequential fashion. In modern virtualisation solutions, as far as the CPU is concerned, the host and the guest processes execute together and differ only in that the I/O of the latter is being trapped and virtualised. Memory is also virtualised but that occurs more or less in the hardware MMU. Guest instructions are directly executed by the CPU otherwise it would not be virtualisation but rather emulation and as long as they do not access any virtualised resources they would execute just as fast as the host instructions. At the end it all depends on how well the CPU could cope with the increased number of running processes.
There are lightweight virtualisation solutions like zones in Solaris that partition the process space in order to give the appearance of multiple copies of the OS but it all happens under the umbrella of a single OS kernel.
The performance hit for pure computational codes is very small, often under 1-2%. The catch is that in reality all programs read and write data and computational codes usually read and write lots of data. Virtualised I/O is usually much slower than direct I/O even with solutions like Intel VT-* or AMD-V.
Exact numbers depend heavily on the specific hardware.
Goaded by #Mitch Wheat's unarguable assertion that my original post here was not an answer, here's an attempt to recast it as an answer:
I work mostly on HPC in the energy sector. Some of the computations that my scientist colleagues run take O(10^5) CPU-hours, we're seriously thinking about O(10^6) CPU-hours jobs in the near future.
I get well paid to squeeze every last drop of performance out of our codes, I'd think it was a good day's work if I could knock 1% off the run-time of some of our programs. Sometimes it has taken me a month to get that sort of performance improvement, sure I may be slow, but it's still cost-effective for our scientists.
I shudder therefore, when bright salespeople offering the latest and best in data center software (of which virtualization is one aspect) which will only, as I see it, shackle my codes to a pile of anchor chain from a 250,00dwt tanker (that was a metaphor).
I have read the question carefully and understand that OP is not proposing that virtualization would help, I'm offering the perspective of a practitioner. If this is still too much of a comment, do the SO thing and vote to close, I promise I won't be offended !

Are there practical limits to the number of cores accessing the same memory?

Will the current trend of adding cores to computers continue? Or is there some theoretical or practical limit to the number of cores that can be served by one set of memory?
Put another way: is the high powered desktop computer of the future apt to have 1024 cores using one set of memory, or is it apt to have 32 sets of memory, each accessed by 32 cores?
Or still another way: I have a multi-threaded program that runs well on a 4-core machine, using a significant amount of the total CPU. As this program grows in size and does more work, can I be reasonably confident more powerful machines will be available to run it? Or should I be thinking seriously about running multiple sessions on multiple machines (or at any rate multiple sets of memory) to get the work done?
In other words, is a purely multithreaded approach to design going to leave me in a dead end? (As using a single threaded approach and depending on continued improvements in CPU speed years back would have done?) The program is unlikely to be run on a machine costing more than, say $3,000. If that machine cannot do the work, the work won't get done. But if that $3,000 machine is actually a network of 32 independent computers (though they may share the same cooling fan) and I've continued my massively multithreaded approach, the machine will be able to do the work, but the program won't, and I'm going to be in an awkward spot.
Distributed processing looks like a bigger pain than multithreading was, but if that might be in my future, I'd like some warning.
Will the current trend of adding cores to computers continue?
Yes, the GHz race is over. It's not practical to ramp the speed any more on the current technology. Physics has gotten in the way. There may be a dramatic shift in the technology of fabricating chips that allows us to get round this, but it's not obviously 'just around the corner'.
If we can't have faster cores, the only way to get more power is to have more cores.
Or is there some theoretical or practical limit to the number of cores that can be served by one set of memory?
Absolutely there's a limit. In a shared memory system the memory is a shared resource and has a limited amount of bandwidth.
Max processes = (Memory Bandwidth) / (Bandwidth required per process)
Now - that 'Bandwidth per process' figure will be reduced by caches, but caches become less efficient if they have to be coherent with one another because everyone is accessing the same area of memory. (You can't cache a memory write if another CPU may need what you've written)
When you start talking about huge systems, shared resources like this become the main problem. It might be memory bandwidth, CPU cycles, hard drive access, network bandwidth. It comes down to how the system as a whole is structured.
You seem to be really asking for a vision of the future so you can prepare. Here's my take.
I think we're going to see a change in the way software developers see parallelism in their programs. At the moment, I would say that a lot of software developers see the only way of using multiple threads is to have lots of them doing the same thing. The trouble is they're all contesting for the same resources. This then means lots of locking needs to be introduced, which causes performance issues, and subtle bugs which are infuriating and time consuming to solve.
This isn't sustainable.
Manufacturing worked out at the beginning of the 20th Century, the fastest way to build lots of cars wasn't to have lots of people working on one car, and then, when that one's done, move them all on to the next car. It was to split the process of building the car down into lots of small jobs, with the output of one job feeding the next. They called it assembly lines. In hardware design it's called pipe-lining, and I'll think we'll see software designs move to it more and more, as it minimizes the problem of shared resources.
Sure - There's still a shared resource on the output of one stage and the input of the next, but this is only between two threads/processes and is much easier to handle. Standard methods can also be adopted on how these interfaces are made, and message queueing libraries seem to be making big strides here.
There's not one solution for all problems though. This type of pipe-line works great for high throughput applications that can absorb some latency. If you can't live with the latency, you have no option but to go the 'many workers on a single task' route. Those are the ones you ideally want to be throwing at SIMD machines/Array processors like GPUs, but it only really excels with a certain type of problem. Those problems are ones where there's lots of data to process in the same way, and there's very little or no dependency between data items.
Having a good grasp of message queuing techniques and similar for pipelined systems, and utilising fine grained parallelism on GPUs through libraries such as OpenCL, will give you insight at both ends of the spectrum.
Update: Multi-threaded code may run on clustered machines, so this issue may not be as critical as I thought.
I was carefully checking out the Java Memory Model in the JLS, chapter 17, and found it does not mirror the typical register-cache-main memory model of most computers. There were opportunities there for a multi-memory machine to cleanly shift data from one memory to another (and from one thread running on one machine to another running on a different one). So I started searching for JVMs that would run across multiple machines. I found several old references--the idea has been out there, but not followed through. However, one company, Terracotta, seems to have something, if I'm reading their PR right.
At any rate, it rather seems that when PC's typically contain several clustered machines, there's likely to be a multi-machine JVM for them.
I could find nothing outside the Java world, but Microsoft's CLR ought to provide the same opportunities. C and C++ and all the other .exe languages might be more difficult. However, Terracotta's websites talk more about linking JVM's rather than one JVM on multiple machines, so their tricks might work for executable langauges also (and maybe the CLR, if needed).

Resources