Does anyone has used Charm++ (http://charm.cs.uiuc.edu/research/charm/) for parallelization outside of HPC (Supercomputers)?
If so, can you tell me about the experience.
Thanks
I'm one of the lead developers of Charm++. Our use cases all center on application users getting the highest performance from whatever hardware they have available to them - really, that's what any sort of parallel computing is about.
For a very large portion of users, that means simply using all of the cores on a multicore desktop workstation or laptop. Many more also use simple Linux clusters with commodity network hardware (Ethernet or Infiniband). These are usually fairly small systems, up to a few dozen nodes - hardly a supercomputer!
We've demonstrated applications for domains as diverse as graphics, resource allocation (e.g. planning, scheduling, combinatorial optimization), applied computer vision, and more. Other users have demonstrated volunteer computing applications using Charm++ (think like SETI#Home or Folding#Home).
Related
The vast majority of CPUs coming out nowadays contain multiple cores which can operate at the same time - in parallel.
I'm just wondering, from the point of executing a program as quickly as possible using all available CPU cores, does a programmer need to take into consideration that the software being developed will be running on a multi-core CPU? For instance, would the software being developed have to be manually configured to assign different tasks to each CPU core? Or does the OS/CPU automatically identify and choose which parts of a program can run - in parallel - on different cores?
Apologies if this may seem like a simple or silly question. I'm completely new to the topic of parallel programming and I've come across some conflicting information early on in my research - some sources state that the programmer must manually configure their software in order to utilise more than one CPU core (the more believable option in my opinion) - and other sources state that the OS/CPU automatically identifies and chooses which tasks can be run in parallel on different CPU cores (the less believable option in my opinion due to the complexity involved in automatically identifying this).
Just in case different Operating Systems, CPUs or Programming Languages perform differently in a parallel computing or multi-core environment - I will be using Windows 7 as my OS, an Intel Dual Core i7 Processor, and OpenCL as the programming language.
Any help is much appreciated.
In practice this occurs semi-automatically.
More detailed answer will depend on your application nature, preferred programming model and target architecture.
More explanation:
In order to exploit multicore hardware efficiently (in your case, keeping as much cores busy as possible) you first of all 1) need to "parallelize" algorithm itself - make it "concurrent", 2) use one of multi-threading (most often) or multi-process (rare case) parallel programming APIs, like for example "OpenMP", "Intel TBB", "OpenCL", "Posix Threads" or (for multi-process) "MPI" in order to efficiently and often automatically assign different "pieces" of your concurrent program to different threads (or, rare case, processes).
One of the simplest possible examples of such kind of parallel programming (using OpenMP) is given here.
Now, you've told that you are using OpenCL as a programming model for CPU. In certain cases, when you use vendor-provided OpenCL implementations (like Intel OpenCL) you could semi-automatically assign OpenCL kernel to be executed by various threads using "NDRange" and other OpenCL concepts, like explained here for Intel Xeon Phi co-processor (not exactly CPU-programming, but similar idea) or here (more general, but more advanced article).
However, using OpenCL as a general-purpose multi-threading programming API for CPU - is definitely not the simplest approach; and it is not always optimal in terms of final performance. There are certain application types, where OpenCL makes some little sense for general-purpose CPU multi-threading programming, but again it very much depends on your algorithm nature and target architecture..
There is one very obsolete, but still reasonable post about OpenCL vs. OpenMP/TBB on stackoverflow. This is obsolete in sense that OpenMP 4.0 now also provides solid capabilities to do Threading*+SIMD* programming (which will make you interested in some future if you explore given topic in more details). That's why I would tell that OpenMP seems to be number-one choice nowadays, bug TBB, MPI or OpenCL might also be appropriate in certain cases.
I know the question is only partially programming-related because the answer I would like to get is originally from these two questions:
Why are CPU cores number so low (vs GPU)? and Why aren't we using GPUs instead of CPUs, GPUs only or CPUs only? (I know that GPUs are specialized while CPUs are more for multi-task, etc.). I also know that there are memory (Host vs GPU) limitations along with precision and caches capability. But, In term of hardware comparison, high-end to high-end CPU/GPU comparison GPUs are much much more performant.
So my question is: Could we use GPUs instead of CPUs for OS, applications, etc
The reason I am asking this questions is because I would like to know the reason why current computers are still using 2 main processing units (CPU/GPU) with two main memory and caching systems (CPU/GPU) even if it is not something a programmer would like.
Current GPUs lack many of the facilities of a modern CPU that are generally considered important (crucial, really) to things like an OS.
Just for example, an OS normally used virtual memory and paging to manage processes. Paging allows the OS to give each process its own address space, (almost) completely isolated from every other process. At least based on publicly available information, most GPUs don't support paging at all (or at least not in the way an OS needs).
GPUs also operate at much lower clock speeds than CPUs. Therefore, they only provide high performance for embarrassingly parallel problems. CPUs are generally provide much higher performance for single threaded code. Most of the code in an OS isn't highly parallel -- in fact, a lot of it is quite difficult to make parallel at all (e.g., for years, Linux had a giant lock to ensure only one thread executed most kernel code at any given time). For this kind of task, a GPU would be unlikely to provide any benefit.
From a programming viewpoint, a GPU is a mixed blessing (at best). People have spent years working on programming models to make programming a GPU even halfway sane, and even so it's much more difficult (in general) than CPU programming. Given the difficulty of getting even relatively trivial things to work well on a GPU, I can't imagine attempting to write anything even close to as large and complex as an operating system to run on one.
GPUs are designed for graphics related processing (obviously), which is inherently something that benefits from parallel processing (doing multiple tasks/calculations at once). This means that unlike modern CPUs, which as you probably know usually have 2-8 cores, GPUs have hundreds of cores. This means that they are uniquely suited to processing things like ray tracing or anything else that you might encounter in a 3D game or other graphics intensive activity.
CPUs on the other hand have a relatively limited number of cores because the tasks that a CPU faces usually do not benefit from parallel processing nearly as much as rendering a 3D scene would. In fact, having too many cores in a CPU could actually degrade the performance of a machine, because of the nature of the tasks a CPU usually does and the fact that a lot of programs would not be written to take advantage of the multitude of cores. This means that for internet browsing or most other desktop tasks, a CPU with a few powerful cores would be better suited for the job than a GPU with many, many smaller cores.
Another thing to note is that more cores usually means more power needed. This means that a 256-core phone or laptop would be pretty impractical from a power and heat standpoint, not to mention the manufacturing challenges and costs.
Usually operating systems are pretty simple, if you look at their structure.
But parallelizing them will not improve speeds much, only raw clock speed will do.
GPU's simply lack parts and a lot of instructions from their instruction sets that an OS needs, it's a matter of sophistication. Just think of the virtualization features (Intel VT-x or AMD's AMD-v).
GPU cores are like dumb ants, whereas a CPU is like a complex human, so to speak. Both have different energy consumption because of this and produce very different amounts of heat.
See this extensive superuser answer here on more info.
Because nobody will spend money and time on this. Except for some enthusiasts like that one: http://gerigeri.uw.hu/DawnOS/history.html (now here: http://users.atw.hu/gerigeri/DawnOS/history.html)
Dawn now works on GPU-s: with a new OpenCL capable emulator, Dawn now
boots and works on Graphics Cards, GPU-s and IGP-s (with OpenCL 1.0).
Dawn is the first and only operating system to boot and work fully on
a graphics chip.
What is the difference between a Cluster and MPP supercomputer architecture?
In a cluster, each machine is largely independent of the others in terms of memory, disk, etc. They are interconnected using some variation on normal networking. The cluster exists mostly in the mind of the programmer and how s/he chooses to distribute the work.
In a Massively Parallel Processor, there really is only one machine with thousands of CPUs tightly interconnected. MPPs have exotic memory architectures to allow extremely high speed exchange of intermediate results with neighboring processors.
The major variants are SIMD (Single Instruction, Multiple Data) and MIMD (Multiple Instruction, Multiple Data). In a SIMD system, every processor is executing the same instruction at the same time, only on different bits of memory. Essentially, there is only one Program Counter. In a MIMD machine, each CPU has it's own PC.
MPPs can be a bitch to program and are of use only on algorithms that are embarrassingly parallel (that's actually what they call it). However, if you have such a problem, then an MPP can be shockingly fast. They are also incredibly expensive.
The top500 list uses a slightly different distinction between an MPP and a cluster, as explained in Dongarra et al. paper:
[a cluster is a] parallel computer system comprising an integrated collection of independent nodes, each of which is a system in its own right, capable of independent operation and derived from products developed and marketed for other stand-alone purposes
Compared to a cluster, a modern MPP (such as the IBM Blue Gene) is more tightly-integrated: individual nodes cannot run on their own and they are connected by a custom network (like a multidimensional torus). But, similarly to a cluster, there is no single, shared memory spanning all the nodes (note: an MPP might be hierarchical and shared memory might be used inside a single node (NUMA), or between a handful of nodes).
I'd be thus extremely careful to use terms SIMD and MIMD in this context as they usually describe shared memory architectures (SMP).
Update:
Dongarra et al. link
Update:
MPP can have nodes that use shared memory internally; but the whole MPP memory is not shared.
A cluster is a bunch of machines, normally usually Ethernet interconnect (read: network), each running it's own and separate copy of an OS which happen to serve a single purpose.
An MPP supercomputer usually implies a faster propitiatory very fast interconnect (e.g. SGI NUMALink) that supports either Distributed Shared Memory (run processes on different MPP nodes that use shared memory over the fast interconnect to share data as if they were running on a single computer) or even a Single System Image (a single instance of an operating system, mostly Linux, running on all the nodes at the same time as if on a single machine - e.g. "ps aux" on any node will show you all the processes running on the MPP).
As you can see the definition is quite fluid, it's more a question of scale rather than clear cut differences.
I've searched in a lot of HPC literature and couldn't find a concrete definition of MPP. There is quite a concesus over a cluster consisting of multiple interconnected regular personal computers or workstations, usually coupled with standard technologies (like Ethernet or open-source operating systems). The term MPP is usually applied to more propietary approches for building distributed-memory computers, usually having propietary technologies.
For example: Tianhe-2 is considered a cluster because it uses x86-64 nodes and a regular operating system (Kylin Linux). Sunway TaihuLight is considered an MPP because its nodes have its particular architecture, SW26010, and work over his own operating system called Sunway Raise OS.
The most concrete explanation of this matter I found was in Sourcebook of Parallel Computing (Dongarra et al.):
We note that the term cluster can be applied both broadly (any system built with a significant number of commodity components) or narrowly (only commodity components and open-source software). In fact, there is no precise definition of a cluster. Some of the issues that are used to argue that a system is a massively parallel processor (MPP) instead of a cluster include proprietary interconnects (...), particularly ones designed for a specific
parallel computer, and special software that treats the entire system as a single machine, particularly for the system administrators. Clusters may be built from personal computers or workstations (either single processors or symmetric multiprocessors (SMPs)) and may run either open-source or proprietary operating systems.
I'm designing a system that will be on-line in 2016 and run on commodity 1U or 2U server boxes. I'd like to understand how parallel the software will need to be so I'd like to estimate the number of cores per physical machine. I'm not interested in more exotic hardware like video game console processors, GPUs or DSPs. I could extrapolate based on when chips where issued by Intel or AMD, but this historical information seems scarce.
Thanks.
I found the following charts from Design for Manycore Systems:
As the great computer scientist Yogi Berra said, "It's tough to make predictions, especially about the future.". Given the relative recency of multicore systems, I think you're right to be wary of extrapolations. Still, you need a number to aim for.
M. Spinelli's graphs are very valuable, and (I think) have the benefit of being based on real plans out to 2014. Other than that, if you want a simple, easly calculatable and defensible number, I'd take as a starting point the number of cores in current (say) 2U systems at your price point (high range systems -- 24-32 cores at $15k; mid-range 12-16 cores at $8k, lower-end 8-12 core at $5k). Then note that Moore's law suggests 8-16x as many transistors per unit silicon in 2016 as now, and that on current trends, these mainly go into more cores. That suggests 64-512 cores per node depending on how much you're spending on each -- and these numbers are consistent with the graphs Matt Spinelli posted above.
Cores per physical machine doesn't seem to be a particularly good metric, I think. We haven't really seen that number grow in particularly non-linear ways, and many-core hardware has been available COTS since the 90's (though it was relatively specialized at that point). If your task is really that parallel, quadrupling the number of cores shouldn't change it that much. We've always had the option of faster-but-fewer-cores, which should still be available to you in 6 years if you find that you don't scale well with the current number of cores.
If your application is really embarrassingly parallel, why are you unwilling to consider GPU solutions?
How quickly do you plan to rotate the hardware? Leave old machines till they die, or replace them proactively as they start to slow the cluster down? How many machines are we talking about? What kind of interconnect technology are you considering? For many cluster applications that is the limiting factor.
The drdobbs article above is not a bad analysis, but I think it misses the point just a tad. It's going to be a significant while before many mainstream apps can take advantage of really parallel general compute hardware (and many tasks simply can't be parallelized much), and when they do, they'll be using graphics cards and (to a less extent) soundcards as the specialized hardware they use to do it.
I have a large scientific computing task that parallelizes very well with SMP, but at too fine grained a level to be easily parallelized via explicit message passing. I'd like to parallelize it across address spaces and physical machines. Is it feasible to create a scheduler that would parallelize already multithreaded code across multiple physical computers under the following conditions:
The code is already multithreaded and can scale pretty well on SMP configurations.
The fact that not all of the threads are running in the same address space or on the same physical machine must be transparent to the program, even if this comes at a significant performance penalty in some use cases.
You may assume that all of the physical machines involved are running operating systems and CPU architectures that are binary compatible.
Things like locks and atomic operations may be slow (having network latency to deal with and all) but must "just work".
Edits:
I only care about throughput, not latency.
I'm using the D programming language, and I'm almost sure there's no canned solution. I'm more interested in whether this is feasible in principle than in a particular canned solution.
My first thought is to use Apache Hadoop. It provides distributed storage and distributed computing. You can synchronize across processes by using files as locks.
It sounds like you want something like SCRAMNet, although that requires custom hardware. I don't know if there is a software-only solution. Also, it's likely that even if you got it working, you'd find your networked version was actually running slower than when it was previously on a single machine. You may just have to bite the bullet and re-design your app.
Since your point 2 suggests that you can live with some performance degradation you might want to consider a hybrid approach: SMP within individual machines, message-passing between machines. I'm not familiar with D so can offer no specific advice. Further I've seen mixed reviews of the hybrid approach for OpenMP+MPI, but it might suit you and your application.
EDIT: You might want to Google around for 'partitioned global address space' which seems to describe your desired approach quite accurately. As before, I have no advice on using D for this.