Why not using GPUs as a CPU? - cpu

I know the question is only partially programming-related because the answer I would like to get is originally from these two questions:
Why are CPU cores number so low (vs GPU)? and Why aren't we using GPUs instead of CPUs, GPUs only or CPUs only? (I know that GPUs are specialized while CPUs are more for multi-task, etc.). I also know that there are memory (Host vs GPU) limitations along with precision and caches capability. But, In term of hardware comparison, high-end to high-end CPU/GPU comparison GPUs are much much more performant.
So my question is: Could we use GPUs instead of CPUs for OS, applications, etc
The reason I am asking this questions is because I would like to know the reason why current computers are still using 2 main processing units (CPU/GPU) with two main memory and caching systems (CPU/GPU) even if it is not something a programmer would like.

Current GPUs lack many of the facilities of a modern CPU that are generally considered important (crucial, really) to things like an OS.
Just for example, an OS normally used virtual memory and paging to manage processes. Paging allows the OS to give each process its own address space, (almost) completely isolated from every other process. At least based on publicly available information, most GPUs don't support paging at all (or at least not in the way an OS needs).
GPUs also operate at much lower clock speeds than CPUs. Therefore, they only provide high performance for embarrassingly parallel problems. CPUs are generally provide much higher performance for single threaded code. Most of the code in an OS isn't highly parallel -- in fact, a lot of it is quite difficult to make parallel at all (e.g., for years, Linux had a giant lock to ensure only one thread executed most kernel code at any given time). For this kind of task, a GPU would be unlikely to provide any benefit.
From a programming viewpoint, a GPU is a mixed blessing (at best). People have spent years working on programming models to make programming a GPU even halfway sane, and even so it's much more difficult (in general) than CPU programming. Given the difficulty of getting even relatively trivial things to work well on a GPU, I can't imagine attempting to write anything even close to as large and complex as an operating system to run on one.

GPUs are designed for graphics related processing (obviously), which is inherently something that benefits from parallel processing (doing multiple tasks/calculations at once). This means that unlike modern CPUs, which as you probably know usually have 2-8 cores, GPUs have hundreds of cores. This means that they are uniquely suited to processing things like ray tracing or anything else that you might encounter in a 3D game or other graphics intensive activity.
CPUs on the other hand have a relatively limited number of cores because the tasks that a CPU faces usually do not benefit from parallel processing nearly as much as rendering a 3D scene would. In fact, having too many cores in a CPU could actually degrade the performance of a machine, because of the nature of the tasks a CPU usually does and the fact that a lot of programs would not be written to take advantage of the multitude of cores. This means that for internet browsing or most other desktop tasks, a CPU with a few powerful cores would be better suited for the job than a GPU with many, many smaller cores.
Another thing to note is that more cores usually means more power needed. This means that a 256-core phone or laptop would be pretty impractical from a power and heat standpoint, not to mention the manufacturing challenges and costs.

Usually operating systems are pretty simple, if you look at their structure.
But parallelizing them will not improve speeds much, only raw clock speed will do.
GPU's simply lack parts and a lot of instructions from their instruction sets that an OS needs, it's a matter of sophistication. Just think of the virtualization features (Intel VT-x or AMD's AMD-v).
GPU cores are like dumb ants, whereas a CPU is like a complex human, so to speak. Both have different energy consumption because of this and produce very different amounts of heat.
See this extensive superuser answer here on more info.

Because nobody will spend money and time on this. Except for some enthusiasts like that one: http://gerigeri.uw.hu/DawnOS/history.html (now here: http://users.atw.hu/gerigeri/DawnOS/history.html)
Dawn now works on GPU-s: with a new OpenCL capable emulator, Dawn now
boots and works on Graphics Cards, GPU-s and IGP-s (with OpenCL 1.0).
Dawn is the first and only operating system to boot and work fully on
a graphics chip.

Related

Is it possible to compare ARM and x86 performance via benchmarks?

Judging by the latest news, new Apple processor A11 Bionic gains more points than the mobile Intel Core i7 in the Geekbench benchmark.
As I understand, there are a lot of different tests in this benchmark. These tests simulate a different load, including the load, which can occur in everyday use.
Some people state that these results can not be compared to x86 results. They say that x86 is able to perform "more complex tasks". As an example, they lead Photoshop, video conversion, scientific calculations. I agree that the software for the ARM is often only a "lighweight" version of software for desktops. But it seems to me that this limitation is caused by the format of mobile operating systems (do your work on the go, no mouse, etc), and not by the performance of ARM.
As an opposite example, let's look at Safari. A browser is a complex program. And on the iPad Safari works just as well as on the Mac. Moreover, if we take the results of Sunspider (JS benchmark), it turns out that Safari on the iPad is gaining more points.
I think that in everyday tasks (Web, Office, Music/Films) ARM (A10X, A11) and x86 (dual core mobile Intel i7) performance are comparable and equal.
Are there any kinds of tasks where ARM really lags far behind x86? If so, what is the reason for this? What's stopping Apple from releasing a laptop on ARM? They already do same thing with migration from POWER to x86. This is technical restrictions, or just marketing?
(Intended this as a comment since this question is off topic, but it got long..).
Of course you can compare, you just need to be very careful, which most people aren't. The fact that companies publishing (or "leaking") results are biased also doesn't help much.
The common misconception is that you can compare a benchmark across two systems and get a single score for each. That ignores the fact that different systems have different optimization points, most often with regards to power (or "TDP"). What you need to look at is the power/performance curve - this graph shows how the system reacts to more power (raising the frequency, enabling more performance features, etc), and how much it contributes to its performance.
One system can win over the low power range, but lose when the available power increases since it doesn't scale that well (or even stops scaling at some point). This is usually the case with Arm, as most of these CPUs are tuned for low power, while x86 covers a larger domain and scales much better.
If you are forced to observe a single point along the graph (which is a legitimate scenario, for example if you're looking for a CPU for a low-power device), at least make sure the comparison is fair and uses the same power envelope.
There are of course other factors that must be aligned (and sometimes aren't due to negligence or an intention to cheat) - the workload should be the same (i've seen different versions compared..), the compiler should be as close as possible (although generating arm vs x86 code is already a difference, but the compiler intermediate optimizations should be similar. When comparing 2 x86 like intel and AMD you should prefer the same binary, unless you also want to allow machine specific optimizations).
Finally, the system should also be similar, which is not the case when comparing a smartphone against a pc/macbook. The memory could differ, the core count, etc. This could be legitimate difference, but it's not really related to one architecture being better than the other.
the topic is bogus, from the ISA to an application or source code there are many abstraction level and the only metric that we have (execution time, or throughput) depends on many factors that could advantage one or the other: the algorithm choices, the optimization written in source code, the compiler/interpreter implementation/optimizations, the operating system behaviour. So they are not exactly/mathematically comparable.
However, looking at the numbers, and the utility of the mobile application written by talking as a management engeneer, ARM chip seems to be capable of run quite good.
I think the only reason is inertia of standard spread around (if you note microsoft propose a variant of windows running on ARM processors, debian ARM variant are ready https://www.debian.org/distrib/netinst).
the ARMv8 cores seems close to x86/64 ones by looking at raw numbers
note i7-3770k results: https://en.wikipedia.org/wiki/Instructions_per_second#MIPS
summary of last Armv8 CPU characteristics, note the quantity of decode, dispatch, caches, and compare the last column on cortex A73 to the i7 3770k
https://en.wikipedia.org/wiki/Comparison_of_ARMv8-A_cores
intel ivy bridge characteristics:
https://en.wikichip.org/wiki/intel/microarchitectures/ivy_bridge_(client)
A75 details. https://www.anandtech.com/show/11441/dynamiq-and-arms-new-cpus-cortex-a75-a55
the topic of power consumption is complex again, the basic rule that go under all the frequency/tension rule (used and abused) over www is: transistors raise time. https://en.wikipedia.org/wiki/Rise_time
There is a fixed time delay in the switching of a transistor, this determinates the maximum frequency that a transistor could switch, and with more of them linked in a cascade way this time sums up in a nonlinear way (need some integration to demonstrate it), as a result 10 years ago to increase the GHz companies try to split in more stage the execution of an operation and runs them (operations) in a pipeline way, even inside the logical pipeline stage. https://en.wikipedia.org/wiki/Instruction_pipelining
the raise time depends of physical characteristics (materials and shape of transistors). It can be reduced by increasing the voltage, so the transistor switch faster, as the switching is associated (let me the term) to a the charge/discharge of a capacitor that trigger the transistor channel opening/closing.
These ARM chips are designed to low power applications, by changing the design they could easily gain MHz, but they will use much power, how much? again not comparable if you don't work inside a foundry and have the numbers.
an example of server applications of ARM processors that could be closer to desktop/workstation CPU as power consumption are Cavium or qualcomm Falkor CPUs, and some benchmark report that they are not bad.

Performing numerical calculations in a virtualized setting?

I'm wondering what kind of performance hit numerical calculations will have in a virtualized setting? More specifically, what kind of performance loss can I expect from running CPU-bound C++ code in a virtualized windows OS as opposed to a native Linux one, on rather fast x86_64 multi-core machines?
I'll be happy to add precisions as needed, but as I don't know much about virtualization, I don't know what info is relevant.
Processes are just bunches of threads which are streams of instructions executing in a sequential fashion. In modern virtualisation solutions, as far as the CPU is concerned, the host and the guest processes execute together and differ only in that the I/O of the latter is being trapped and virtualised. Memory is also virtualised but that occurs more or less in the hardware MMU. Guest instructions are directly executed by the CPU otherwise it would not be virtualisation but rather emulation and as long as they do not access any virtualised resources they would execute just as fast as the host instructions. At the end it all depends on how well the CPU could cope with the increased number of running processes.
There are lightweight virtualisation solutions like zones in Solaris that partition the process space in order to give the appearance of multiple copies of the OS but it all happens under the umbrella of a single OS kernel.
The performance hit for pure computational codes is very small, often under 1-2%. The catch is that in reality all programs read and write data and computational codes usually read and write lots of data. Virtualised I/O is usually much slower than direct I/O even with solutions like Intel VT-* or AMD-V.
Exact numbers depend heavily on the specific hardware.
Goaded by #Mitch Wheat's unarguable assertion that my original post here was not an answer, here's an attempt to recast it as an answer:
I work mostly on HPC in the energy sector. Some of the computations that my scientist colleagues run take O(10^5) CPU-hours, we're seriously thinking about O(10^6) CPU-hours jobs in the near future.
I get well paid to squeeze every last drop of performance out of our codes, I'd think it was a good day's work if I could knock 1% off the run-time of some of our programs. Sometimes it has taken me a month to get that sort of performance improvement, sure I may be slow, but it's still cost-effective for our scientists.
I shudder therefore, when bright salespeople offering the latest and best in data center software (of which virtualization is one aspect) which will only, as I see it, shackle my codes to a pile of anchor chain from a 250,00dwt tanker (that was a metaphor).
I have read the question carefully and understand that OP is not proposing that virtualization would help, I'm offering the perspective of a practitioner. If this is still too much of a comment, do the SO thing and vote to close, I promise I won't be offended !

Are there practical limits to the number of cores accessing the same memory?

Will the current trend of adding cores to computers continue? Or is there some theoretical or practical limit to the number of cores that can be served by one set of memory?
Put another way: is the high powered desktop computer of the future apt to have 1024 cores using one set of memory, or is it apt to have 32 sets of memory, each accessed by 32 cores?
Or still another way: I have a multi-threaded program that runs well on a 4-core machine, using a significant amount of the total CPU. As this program grows in size and does more work, can I be reasonably confident more powerful machines will be available to run it? Or should I be thinking seriously about running multiple sessions on multiple machines (or at any rate multiple sets of memory) to get the work done?
In other words, is a purely multithreaded approach to design going to leave me in a dead end? (As using a single threaded approach and depending on continued improvements in CPU speed years back would have done?) The program is unlikely to be run on a machine costing more than, say $3,000. If that machine cannot do the work, the work won't get done. But if that $3,000 machine is actually a network of 32 independent computers (though they may share the same cooling fan) and I've continued my massively multithreaded approach, the machine will be able to do the work, but the program won't, and I'm going to be in an awkward spot.
Distributed processing looks like a bigger pain than multithreading was, but if that might be in my future, I'd like some warning.
Will the current trend of adding cores to computers continue?
Yes, the GHz race is over. It's not practical to ramp the speed any more on the current technology. Physics has gotten in the way. There may be a dramatic shift in the technology of fabricating chips that allows us to get round this, but it's not obviously 'just around the corner'.
If we can't have faster cores, the only way to get more power is to have more cores.
Or is there some theoretical or practical limit to the number of cores that can be served by one set of memory?
Absolutely there's a limit. In a shared memory system the memory is a shared resource and has a limited amount of bandwidth.
Max processes = (Memory Bandwidth) / (Bandwidth required per process)
Now - that 'Bandwidth per process' figure will be reduced by caches, but caches become less efficient if they have to be coherent with one another because everyone is accessing the same area of memory. (You can't cache a memory write if another CPU may need what you've written)
When you start talking about huge systems, shared resources like this become the main problem. It might be memory bandwidth, CPU cycles, hard drive access, network bandwidth. It comes down to how the system as a whole is structured.
You seem to be really asking for a vision of the future so you can prepare. Here's my take.
I think we're going to see a change in the way software developers see parallelism in their programs. At the moment, I would say that a lot of software developers see the only way of using multiple threads is to have lots of them doing the same thing. The trouble is they're all contesting for the same resources. This then means lots of locking needs to be introduced, which causes performance issues, and subtle bugs which are infuriating and time consuming to solve.
This isn't sustainable.
Manufacturing worked out at the beginning of the 20th Century, the fastest way to build lots of cars wasn't to have lots of people working on one car, and then, when that one's done, move them all on to the next car. It was to split the process of building the car down into lots of small jobs, with the output of one job feeding the next. They called it assembly lines. In hardware design it's called pipe-lining, and I'll think we'll see software designs move to it more and more, as it minimizes the problem of shared resources.
Sure - There's still a shared resource on the output of one stage and the input of the next, but this is only between two threads/processes and is much easier to handle. Standard methods can also be adopted on how these interfaces are made, and message queueing libraries seem to be making big strides here.
There's not one solution for all problems though. This type of pipe-line works great for high throughput applications that can absorb some latency. If you can't live with the latency, you have no option but to go the 'many workers on a single task' route. Those are the ones you ideally want to be throwing at SIMD machines/Array processors like GPUs, but it only really excels with a certain type of problem. Those problems are ones where there's lots of data to process in the same way, and there's very little or no dependency between data items.
Having a good grasp of message queuing techniques and similar for pipelined systems, and utilising fine grained parallelism on GPUs through libraries such as OpenCL, will give you insight at both ends of the spectrum.
Update: Multi-threaded code may run on clustered machines, so this issue may not be as critical as I thought.
I was carefully checking out the Java Memory Model in the JLS, chapter 17, and found it does not mirror the typical register-cache-main memory model of most computers. There were opportunities there for a multi-memory machine to cleanly shift data from one memory to another (and from one thread running on one machine to another running on a different one). So I started searching for JVMs that would run across multiple machines. I found several old references--the idea has been out there, but not followed through. However, one company, Terracotta, seems to have something, if I'm reading their PR right.
At any rate, it rather seems that when PC's typically contain several clustered machines, there's likely to be a multi-machine JVM for them.
I could find nothing outside the Java world, but Microsoft's CLR ought to provide the same opportunities. C and C++ and all the other .exe languages might be more difficult. However, Terracotta's websites talk more about linking JVM's rather than one JVM on multiple machines, so their tricks might work for executable langauges also (and maybe the CLR, if needed).

Estimating how processor frequency affects I/O performance

I am doing research about dedicated I/O software that would run on consumer hardware. Essentially it boils down to saving huge data streams for later processing. Right now I am looking for a model to estimate performance factors on x86.
Take for example the new Macbook Pro:
high-speed Thunderbolt I/O (input/output) technology delivers
an amazing 10 gigabits per second of transfer speeds in both
directions
1.25 GB/s sounds nice but most processors of the day are clocked around 2 Ghz. Multiple cores make little difference as long as only one can be assigned per network channel.
So even if the software acts as a miniature operating system and limits itself to network/disk operations, the amount of data flowing to storage can't be greater than P / (2 * N)[1] chunks per second. Although this hints the rough performance limit, I feel it's far from adequate.
What other considerations should one take estimating I/O performance in regards to processor frequency and other hardware specifics? For simplicity's sake, assume here that storage performs instantly under all circumstances.
[1] P - processor frequency; N - algorithm overhead
The hardware limiting factors are probably the I/O bus performance, say PCIe, and more recently, the FSB clock-rates, since memory controllers are moving from northbridge to the CPUs themselves.
Then, of course, you have to figure out what sort of processing you need to do on the input, and how much work it is to produce the output. These, at least for conventional software running on a CPU, are dependent on the processor clock, but not only. Writing your code to take advantage of the hardware facilities like caches, instruction-level parallelism, etc. is still a black art but can give you an order of magnitude performance boost.
Basically what I'm ranting about is that not all software is created equal, and you probably want to take that into account.
Likely, harddisk controllers will decide the harddisk I/O performance, graphics cards will decide maximum resolution and refresh I/O performance, and so on. Don't really understand the question, the CPU is becoming less and less involved in these kinds of things (well, has been for the last 10 years).
I doubt the question will even have bearing on CPUs with integrated GPUs, since the buffer to be output to screen is in external memory sharing a bus with (again) a controller on the motherboard.
It's all buffered, so I can only see CPUs affecting file performance if you somehow force the hardware buffer size to something insanely puny. Edit: and I'm pretty sure Apple will prevent you from doing such things. ;)
For Thunderbolt specifically, it's more about what the minimum CPU model is, that supports the kinds of bus speeds required by the Thunderbolt chip set version that is in the machine in question.
Thunderbolt is a raw data traffic system and performance specs are potential maximums, hence all the asterisks in the Apple specs. I believe it will indeed alleviate bottlenecks and in general give lag-free intelligent data shuffling doing many things simultaneously.
The CPU will idle-wait a shorter time for needed data, but the processing speed of the data is the same. When playing or creating a movie, codec processing time will be the same, but you will still feel a boost/lack of lag because the data is there when it needs it. For the I/O, the bottleneck will become the read/write speed of your harddisk instead, and the CPU bottleneck (for file copy operations, likely at least some code in Finder) will stay the same.
In other words, only CPU-intensive tasks such as for example movie encoding will benefit significantly from a faster CPU, while the benefits of Thunderbolt vs. a mix of interfaces will boost machines with both slow and fast CPUs.

At what rate are the number of cores per CPU increasing?

I'm designing a system that will be on-line in 2016 and run on commodity 1U or 2U server boxes. I'd like to understand how parallel the software will need to be so I'd like to estimate the number of cores per physical machine. I'm not interested in more exotic hardware like video game console processors, GPUs or DSPs. I could extrapolate based on when chips where issued by Intel or AMD, but this historical information seems scarce.
Thanks.
I found the following charts from Design for Manycore Systems:
As the great computer scientist Yogi Berra said, "It's tough to make predictions, especially about the future.". Given the relative recency of multicore systems, I think you're right to be wary of extrapolations. Still, you need a number to aim for.
M. Spinelli's graphs are very valuable, and (I think) have the benefit of being based on real plans out to 2014. Other than that, if you want a simple, easly calculatable and defensible number, I'd take as a starting point the number of cores in current (say) 2U systems at your price point (high range systems -- 24-32 cores at $15k; mid-range 12-16 cores at $8k, lower-end 8-12 core at $5k). Then note that Moore's law suggests 8-16x as many transistors per unit silicon in 2016 as now, and that on current trends, these mainly go into more cores. That suggests 64-512 cores per node depending on how much you're spending on each -- and these numbers are consistent with the graphs Matt Spinelli posted above.
Cores per physical machine doesn't seem to be a particularly good metric, I think. We haven't really seen that number grow in particularly non-linear ways, and many-core hardware has been available COTS since the 90's (though it was relatively specialized at that point). If your task is really that parallel, quadrupling the number of cores shouldn't change it that much. We've always had the option of faster-but-fewer-cores, which should still be available to you in 6 years if you find that you don't scale well with the current number of cores.
If your application is really embarrassingly parallel, why are you unwilling to consider GPU solutions?
How quickly do you plan to rotate the hardware? Leave old machines till they die, or replace them proactively as they start to slow the cluster down? How many machines are we talking about? What kind of interconnect technology are you considering? For many cluster applications that is the limiting factor.
The drdobbs article above is not a bad analysis, but I think it misses the point just a tad. It's going to be a significant while before many mainstream apps can take advantage of really parallel general compute hardware (and many tasks simply can't be parallelized much), and when they do, they'll be using graphics cards and (to a less extent) soundcards as the specialized hardware they use to do it.

Resources