Faking a Single Address Space - performance

I have a large scientific computing task that parallelizes very well with SMP, but at too fine grained a level to be easily parallelized via explicit message passing. I'd like to parallelize it across address spaces and physical machines. Is it feasible to create a scheduler that would parallelize already multithreaded code across multiple physical computers under the following conditions:
The code is already multithreaded and can scale pretty well on SMP configurations.
The fact that not all of the threads are running in the same address space or on the same physical machine must be transparent to the program, even if this comes at a significant performance penalty in some use cases.
You may assume that all of the physical machines involved are running operating systems and CPU architectures that are binary compatible.
Things like locks and atomic operations may be slow (having network latency to deal with and all) but must "just work".
Edits:
I only care about throughput, not latency.
I'm using the D programming language, and I'm almost sure there's no canned solution. I'm more interested in whether this is feasible in principle than in a particular canned solution.

My first thought is to use Apache Hadoop. It provides distributed storage and distributed computing. You can synchronize across processes by using files as locks.

It sounds like you want something like SCRAMNet, although that requires custom hardware. I don't know if there is a software-only solution. Also, it's likely that even if you got it working, you'd find your networked version was actually running slower than when it was previously on a single machine. You may just have to bite the bullet and re-design your app.

Since your point 2 suggests that you can live with some performance degradation you might want to consider a hybrid approach: SMP within individual machines, message-passing between machines. I'm not familiar with D so can offer no specific advice. Further I've seen mixed reviews of the hybrid approach for OpenMP+MPI, but it might suit you and your application.
EDIT: You might want to Google around for 'partitioned global address space' which seems to describe your desired approach quite accurately. As before, I have no advice on using D for this.

Related

How to cheaply process large amounts of data (local setup or cloud)?

I would like to try testing NLP tools against dumps of the web and other corpora, sometimes larger than 4 TB.
If I run this on a mac it's very slow. What is the best way to speed up this process?
deploying to EC2/Heroku and scaling up servers
buying hardware and creating a local setup
Just want to know how this is usually done (processing terabytes in a matter of minutes/seconds), if it's cheaper/better to experiment with this in the cloud, or do I need my own hardware setup?
Regardless of the brand of your cloud, the whole idea of cloud computing is to be able to scale-up and scale down in a flexible way.
In a corporate environment you might have a scenario in which you will consistently need the same amount of computing resources, so if you already have them, it is rather a difficult case to use the cloud because you just don't need the flexibility provided.
On the other hand if your processing tasks are not quite predictable, your best solution is the cloud because you will be able to pay more when you use more computing power, and then pay less when you don't need as much power.
Take into account though, that not all cloud-solutions are the same, for instance, a Web role is a highly web-dedicated node whose main purpose is to serve web requests, the more requests are served, the more you pay.
Whereas in a virtual role, is almost like you are given the exclusivity of a computer system that you can use for anything you want, either a linux or a windows OS, the system keeps running even though you are not using it at its best.
Overall, the costs depend on your own scenario and how well it fits to your needs.
I suppose it depends quite a bit on what kind of experimenting you are wanting to do, for what purpose and for how long.
If you're looking into buying the hardware and running your own cluster then you probably want something like Hadoop or Storm to manage the compute nodes. I don't know how feasible it is to go through 4TB of data in a matter of seconds but again that really depends on the kind of processing you want to do. Counting the frequency of words in the 4TB corpus should be pretty easy (even or your mac), but building SVMs or doing something like LDA on the lot won't be. One issue you'll run into is that you won't have enough memory to fit all of that, so you'll want a library that can run the methods off disk.
If you don't know exactly what your requirements are then I would use EC2 to setup a test rig to gain a better understanding what it is that you want to do and how much grunt/memory that needs to get done in the amount of time you require.
We recently bought two compute nodes 128 cores each with 256Gb of memory and a few terabytes of disk space for I think it was around £20k or so. These are AMD interlagos machines. That said the compute cluster already had infiniband storage so we just had to hook up to that and just buy to two compute nodes, not the whole infrastructure.
The obvious thing to do here is to start off with a smaller data set, say a few gigabytes. That'll get you started on your mac, you can experiment with the data and different methods to get an idea of what works and what doesn't, and then move your pipeline to the cloud, and run it with more data. If you don't want to start the experimentation with a single sample, you can always take multiple samples from different parts of the full corpus, just keep the sample sizes down to something you can manage on your own workstation to start off with.
As an aside, I highly recommend the scikit-learn project on GitHub for machine learning. It's written in Python, but most of the matrix operations are done in Fortran or C libraries so it's pretty fast. The developer community is also extremely active on the project. Another good library that perhaps a bit more approachable (depending on your level of expertise) is NLTK. It's nowhere near as fast but makes a bit more sense if you're not familiar with thinking about everything as a matrix.
UPDATE
One thing I forgot to mention is the time your project will be running. Or to put it another way, how long will you get some use out of your specialty hardware. If it's a project that is supposed to serve the EU parliament for the next 10 years, then you should definitely buy the hardware. If it's a project for you to get familiar with NLP, then forking out the money might be a bit redundant, unless you're also planning on starting you own cloud computing rental service :).
That said, I don't know what the real world costs of using EC2 are for something like this. I've never had to use them.

Performing numerical calculations in a virtualized setting?

I'm wondering what kind of performance hit numerical calculations will have in a virtualized setting? More specifically, what kind of performance loss can I expect from running CPU-bound C++ code in a virtualized windows OS as opposed to a native Linux one, on rather fast x86_64 multi-core machines?
I'll be happy to add precisions as needed, but as I don't know much about virtualization, I don't know what info is relevant.
Processes are just bunches of threads which are streams of instructions executing in a sequential fashion. In modern virtualisation solutions, as far as the CPU is concerned, the host and the guest processes execute together and differ only in that the I/O of the latter is being trapped and virtualised. Memory is also virtualised but that occurs more or less in the hardware MMU. Guest instructions are directly executed by the CPU otherwise it would not be virtualisation but rather emulation and as long as they do not access any virtualised resources they would execute just as fast as the host instructions. At the end it all depends on how well the CPU could cope with the increased number of running processes.
There are lightweight virtualisation solutions like zones in Solaris that partition the process space in order to give the appearance of multiple copies of the OS but it all happens under the umbrella of a single OS kernel.
The performance hit for pure computational codes is very small, often under 1-2%. The catch is that in reality all programs read and write data and computational codes usually read and write lots of data. Virtualised I/O is usually much slower than direct I/O even with solutions like Intel VT-* or AMD-V.
Exact numbers depend heavily on the specific hardware.
Goaded by #Mitch Wheat's unarguable assertion that my original post here was not an answer, here's an attempt to recast it as an answer:
I work mostly on HPC in the energy sector. Some of the computations that my scientist colleagues run take O(10^5) CPU-hours, we're seriously thinking about O(10^6) CPU-hours jobs in the near future.
I get well paid to squeeze every last drop of performance out of our codes, I'd think it was a good day's work if I could knock 1% off the run-time of some of our programs. Sometimes it has taken me a month to get that sort of performance improvement, sure I may be slow, but it's still cost-effective for our scientists.
I shudder therefore, when bright salespeople offering the latest and best in data center software (of which virtualization is one aspect) which will only, as I see it, shackle my codes to a pile of anchor chain from a 250,00dwt tanker (that was a metaphor).
I have read the question carefully and understand that OP is not proposing that virtualization would help, I'm offering the perspective of a practitioner. If this is still too much of a comment, do the SO thing and vote to close, I promise I won't be offended !

Are there practical limits to the number of cores accessing the same memory?

Will the current trend of adding cores to computers continue? Or is there some theoretical or practical limit to the number of cores that can be served by one set of memory?
Put another way: is the high powered desktop computer of the future apt to have 1024 cores using one set of memory, or is it apt to have 32 sets of memory, each accessed by 32 cores?
Or still another way: I have a multi-threaded program that runs well on a 4-core machine, using a significant amount of the total CPU. As this program grows in size and does more work, can I be reasonably confident more powerful machines will be available to run it? Or should I be thinking seriously about running multiple sessions on multiple machines (or at any rate multiple sets of memory) to get the work done?
In other words, is a purely multithreaded approach to design going to leave me in a dead end? (As using a single threaded approach and depending on continued improvements in CPU speed years back would have done?) The program is unlikely to be run on a machine costing more than, say $3,000. If that machine cannot do the work, the work won't get done. But if that $3,000 machine is actually a network of 32 independent computers (though they may share the same cooling fan) and I've continued my massively multithreaded approach, the machine will be able to do the work, but the program won't, and I'm going to be in an awkward spot.
Distributed processing looks like a bigger pain than multithreading was, but if that might be in my future, I'd like some warning.
Will the current trend of adding cores to computers continue?
Yes, the GHz race is over. It's not practical to ramp the speed any more on the current technology. Physics has gotten in the way. There may be a dramatic shift in the technology of fabricating chips that allows us to get round this, but it's not obviously 'just around the corner'.
If we can't have faster cores, the only way to get more power is to have more cores.
Or is there some theoretical or practical limit to the number of cores that can be served by one set of memory?
Absolutely there's a limit. In a shared memory system the memory is a shared resource and has a limited amount of bandwidth.
Max processes = (Memory Bandwidth) / (Bandwidth required per process)
Now - that 'Bandwidth per process' figure will be reduced by caches, but caches become less efficient if they have to be coherent with one another because everyone is accessing the same area of memory. (You can't cache a memory write if another CPU may need what you've written)
When you start talking about huge systems, shared resources like this become the main problem. It might be memory bandwidth, CPU cycles, hard drive access, network bandwidth. It comes down to how the system as a whole is structured.
You seem to be really asking for a vision of the future so you can prepare. Here's my take.
I think we're going to see a change in the way software developers see parallelism in their programs. At the moment, I would say that a lot of software developers see the only way of using multiple threads is to have lots of them doing the same thing. The trouble is they're all contesting for the same resources. This then means lots of locking needs to be introduced, which causes performance issues, and subtle bugs which are infuriating and time consuming to solve.
This isn't sustainable.
Manufacturing worked out at the beginning of the 20th Century, the fastest way to build lots of cars wasn't to have lots of people working on one car, and then, when that one's done, move them all on to the next car. It was to split the process of building the car down into lots of small jobs, with the output of one job feeding the next. They called it assembly lines. In hardware design it's called pipe-lining, and I'll think we'll see software designs move to it more and more, as it minimizes the problem of shared resources.
Sure - There's still a shared resource on the output of one stage and the input of the next, but this is only between two threads/processes and is much easier to handle. Standard methods can also be adopted on how these interfaces are made, and message queueing libraries seem to be making big strides here.
There's not one solution for all problems though. This type of pipe-line works great for high throughput applications that can absorb some latency. If you can't live with the latency, you have no option but to go the 'many workers on a single task' route. Those are the ones you ideally want to be throwing at SIMD machines/Array processors like GPUs, but it only really excels with a certain type of problem. Those problems are ones where there's lots of data to process in the same way, and there's very little or no dependency between data items.
Having a good grasp of message queuing techniques and similar for pipelined systems, and utilising fine grained parallelism on GPUs through libraries such as OpenCL, will give you insight at both ends of the spectrum.
Update: Multi-threaded code may run on clustered machines, so this issue may not be as critical as I thought.
I was carefully checking out the Java Memory Model in the JLS, chapter 17, and found it does not mirror the typical register-cache-main memory model of most computers. There were opportunities there for a multi-memory machine to cleanly shift data from one memory to another (and from one thread running on one machine to another running on a different one). So I started searching for JVMs that would run across multiple machines. I found several old references--the idea has been out there, but not followed through. However, one company, Terracotta, seems to have something, if I'm reading their PR right.
At any rate, it rather seems that when PC's typically contain several clustered machines, there's likely to be a multi-machine JVM for them.
I could find nothing outside the Java world, but Microsoft's CLR ought to provide the same opportunities. C and C++ and all the other .exe languages might be more difficult. However, Terracotta's websites talk more about linking JVM's rather than one JVM on multiple machines, so their tricks might work for executable langauges also (and maybe the CLR, if needed).

Converting a parallel program to a cluster program. From OpenMP to?

I want to write a code converter that takes an OpenMP based parallel program and runs it on a cluster.
How do I go about this problem? What libraries do I use? How do I set up a small cluster for this?
I'm finding it extremely hard to find good material about cluster computing on the internet.
EDIT: If it's impossible then how does Intel do it? The Intel compiler seems to do exactly what I want to. I don't have any specific application that I would like to run. I want to write the "converter/compiler", not the application. I understand that shared memory is different from distributed memory, but there has to be a way to sync memory, if not for all cases, then for some specific cases, even if it means that application is written with custom constructs.
Intel has an implementation of OpenMP that works with their C++ and Fortran compilers for x86 64-bit clusters. You can get a 30-day eval version of these compilers for free. Other than that, Zifre is mostly right. If you are concerned with scalability, bite the bullet and write your parallel program in another programming model (MPI, CUDA, Cilk, ...) which is designed with distributed systems in mind. If you provide a little more information about your application, we may be able to provide more useful guidance on that front.
It seems to me that this is not a good idea.
The basic idea behind OpenMP is data-shared parallel execution. It works well, when accessing shared data costs you nothing. Every thread can access a variable in shared cache or RAM.
The cluster computations exploit message-passing, because computers in cluster have distributed memory. When one process needs data from another one then you should manage data passing over the network. It is time-consuming operation.
So, if you want to write such compiler, you should implement data broadcasting operations (e.g. MPI_Bcast from MPI) for each data access in OpenMP. This will kill parallel performance at all.
This is simply not possible. You have to structure your code in a completely different way to get it to work on a cluster (programming multiple machines is very different from programming one machine).
There is no magic pixie dust to do this.
On the other hand, if you write your program with clusters in mind, it is possible to run it on a single machine (although it will obviously be slower).
SCORE/SCASH and Omni OpenMP compiler

MPI for multicore?

With the recent buzz on multicore programming is anyone exploring the possibilities of using MPI ?
I've used MPI extensively on large clusters with multi-core nodes. I'm not sure if it's the right thing for a single multi-core box, but if you anticipate that your code may one day scale larger than a single chip, you might consider implementing it in MPI. Right now, nothing scales larger than MPI. I'm not sure where the posters who mention unacceptable overheads are coming from, but I've tried to give an overview of the relevant tradeoffs below. Read on for more.
MPI is the de-facto standard for large-scale scientific computation and it's in wide use on multicore machines already. It is very fast. Take a look at the most recent Top 500 list. The top machines on that list have, in some cases, hundreds of thousands of processors, with multi-socket dual- and quad-core nodes. Many of these machines have very fast custom networks (Torus, Mesh, Tree, etc) and optimized MPI implementations that are aware of the hardware.
If you want to use MPI with a single-chip multi-core machine, it will work fine. In fact, recent versions of Mac OS X come with OpenMPI pre-installed, and you can download an install OpenMPI pretty painlessly on an ordinary multi-core Linux machine. OpenMPI is in use at Los Alamos on most of their systems. Livermore uses mvapich on their Linux clusters. What you should keep in mind before diving in is that MPI was designed for solving large-scale scientific problems on distributed-memory systems. The multi-core boxes you are dealing with probably have shared memory.
OpenMPI and other implementations use shared memory for local message passing by default, so you don't have to worry about network overhead when you're passing messages to local processes. It's pretty transparent, and I'm not sure where other posters are getting their concerns about high overhead. The caveat is that MPI is not the easiest thing you could use to get parallelism on a single multi-core box. In MPI, all the message passing is explicit. It has been called the "assembly language" of parallel programming for this reason. Explicit communication between processes isn't easy if you're not an experienced HPC person, and there are other paradigms more suited for shared memory (UPC, OpenMP, and nice languages like Erlang to name a few) that you might try first.
My advice is to go with MPI if you anticipate writing a parallel application that may need more than a single machine to solve. You'll be able to test and run fine with a regular multi-core box, and migrating to a cluster will be pretty painless once you get it working there. If you are writing an application that will only ever need a single machine, try something else. There are easier ways to exploit that kind of parallelism.
Finally, if you are feeling really adventurous, try MPI in conjunction with threads, OpenMP, or some other local shared-memory paradigm. You can use MPI for the distributed message passing and something else for on-node parallelism. This is where big machines are going; future machines with hundreds of thousands of processors or more are expected to have MPI implementations that scale to all nodes but not all cores, and HPC people will be forced to build hybrid applications. This isn't for the faint of heart, and there's a lot of work to be done before there's an accepted paradigm in this space.
I would have to agree with tgamblin. You'll probably have to roll your sleeves up and really dig into the code to use MPI, explicitly handling the organization of the message-passing yourself. If this is the sort of thing you like or don't mind doing, I would expect that MPI would work just as well on multicore machines as it would on a distributed cluster.
Speaking from personal experience... I coded up some C code in graduate school to do some large scale modeling of electrophysiologic models on a cluster where each node was itself a multicore machine. Therefore, there were a couple of different parallel methods I thought of to tackle the problem.
1) I could use MPI alone, treating every processor as it's own "node" even though some of them are grouped together on the same machine.
2) I could use MPI to handle data moving between multicore nodes, and then use threading (POSIX threads) within each multicore machine, where processors share memory.
For the specific mathematical problem I was working on, I tested two formulations first on a single multicore machine: one using MPI and one using POSIX threads. As it turned out, the MPI implementation was much more efficient, giving a speed-up of close to 2 for a dual-core machine as opposed to 1.3-1.4 for the threaded implementation. For the MPI code, I was able to organize operations so that processors were rarely idle, staying busy while messages were passed between them and masking much of the delay from transferring data. With the threaded code, I ended up with a lot of mutex bottlenecks that forced threads to often sit and wait while other threads completed their computations. Keeping the computational load balanced between threads didn't seem to help this fact.
This may have been specific to just the models I was working on, and the effectiveness of threading vs. MPI would likely vary greatly for other types of parallel problems. Nevertheless, I would disagree that MPI has an unwieldy overhead.
No, in my opinion it is unsuitable for most processing you would do on a multicore system. The overhead is too high, the objects you pass around must be deeply cloned, and passing large objects graphs around to then run a very small computation is very inefficient. It is really meant for sharing data between separate processes, most often running in separate memory spaces, and most often running long computations.
A multicore processor is a shared memory machine, so there are much more efficient ways to do parallel processing, that do not involve copying objects and where most of the threads run for a very small time. For example, think of a multithreaded Quicksort. The overhead of allocating memory and copying the data to a thread before it can be partioned will be much slower with MPI and an unlimited number of processors than Quicksort running on a single processor.
As an example, in Java, I would use a BlockingQueue (a shared memory construct), to pass object references between threads, with very little overhead.
Not that it does not have its place, see for example the Google search cluster that uses message passing. But it's probably not the problem you are trying to solve.
MPI is not inefficient. You need to break the problem down into chunks and pass the chunks around and reorganize when the result is finished per chunk. No one in the right mind would pass around the whole object via MPI when only a portion of the problem is being worked on per thread. Its not the inefficiency of the interface or design pattern thats the inefficiency of the programmers knowledge of how to break up a problem.
When you use a locking mechanism the overhead on the mutex does not scale well. this is due to the fact that the underlining runqueue does not know when you are going to lock the thread next. You will perform more kernel level thrashing using mutex's than a message passing design pattern.
MPI has a very large amount of overhead, primarily to handle inter-process communication and heterogeneous systems. I've used it in cases where a small amount of data is being passed around, and where the ratio of computation to data is large.
This is not the typical usage scenario for most consumer or business tasks, and in any case, as a previous reply mentioned, on a shared memory architecture like a multicore machine, there are vastly faster ways to handle it, such as memory pointers.
If you had some sort of problem with the properties describe above, and you want to be able to spread the job around to other machines, which must be on the same highspeed network as yourself, then maybe MPI could make sense. I have a hard time imagining such a scenario though.
I personally have taken up Erlang( and i like to so far). The messages based approach seem to fit most of the problem and i think that is going to be one of the key item for multi core programming. I never knew about the overhead of MPI and thanks for pointing it out
You have to decide if you want low level threading or high level threading. If you want low level then use pThread. You have to be careful that you don't introduce race conditions and make threading performance work against you.
I have used some OSS packages for (C and C++) that are scalable and optimize the task scheduling. TBB (threading building blocks) and Cilk Plus are good and easy to code and get applications of the ground. I also believe they are flexible enough integrate other thread technologies into it at a later point if needed (OpenMP etc.)
www.threadingbuildingblocks.org
www.cilkplus.org

Resources