Parallel programming over multiple machines without clustering

Parallel programming over multiple machines without clustering - parallel-processing

I'm going to be a college student at 40. I'll be studying IT and plan on doing a bachelor's project. The basic idea is to try to use neural nets to evaluate bias in media. The training data will be political blogs with well known biases.
What I need is a programming language that can run parallel on multiple machines that are networked, but not clustered. I have 2 Linux machines and 3 running OS X. I would prefer if the language would compile to binary rather than bytecode or to a VM, but I'll take what I can get. I don't need any GUI libraries, so that's not a constraint. I do most of my programming in python, but I'm willing to learn another language if it'll make the parallel execution easier. Any suggestions?

I strongly suggest that you consider sticking with Python. Learning a new language, at the same time as you start tackling parallel / distributed computing, may well throw a spanner in your works that you just don't need. I believe that your time will be better spent tackling the issues of building the neural net you want, rather than learning the peculiarities of a new language. And, by reputation, Python is eminently suitable for what you plan. It does, of course, fail your requirement that it should compile to binary but I'm not sure where that is coming from.
When you write parallel programming over multiple machines without clustering I'm thinking oh, he means distributed programming. I tend towards the view that parallel computing is a niche within distributed computing, in part defined by the homogeneity (from the programmer's point of view) of the resources used. This apparent homogeneity is aided tremendously if it is supported by homogeneity of hardware so that there is little gap between vision and reality.
If what you really have is an assortment of computers of different specs and different OSes and communicating over a non-dedicated network then I fear that you will find it difficult building the illusion of homogeneity for the programmer (ie for yourself) and would be better setting out to build a distributed system from the get go.
I just plain disagree with the answer telling you to pick up C and MPI, I think you'll make progress much faster much quicker with Python.
Good luck with your studies.
Oh, and if you just won't take my advice to forget about a new programming language, consider Haskell and Erlang.

Sounds like like an interesting project. However thinking laterally wouldn't a GPU based system (ie massively parallel) be more the soupe du jour? Hence something like C + CUDA perhaps?
I don't know if it's still around but OCCAM (from the Transputers of old) was designed to be a parallel system, with it's PAR and SEQ constructs. I've just read of this one on linux

That sounds like C + MPI to me.

Related

Most effective method to use parallel computing on different architectures

I am planning to write something to take advantages of the many devices that I have at home.
Basically my aim is to use the laptop to execute calculations, and also to use my main desktop computer to add more power (and finish the task quicker). I work with cellular simulation and chemical interactions, so to me would be great to take advantage of all that I have available at home.
I am using mainly OSX, so I need something that may work with that OS. I can code in objective-C, C and C++.
I am aware of GCD, OpenCL and MPI, but I am not sure which way to go.
I was planning to not use the full power of my desktop but only some of the available cores (in this way I can continue to work on the desktop doing other tasks that are not so resource intensive). In particular I would love to use the graphic card power (it is an ATI card, so no CUDA), since all that I do mainly is spreadsheet, word and coding with Xcode, and the graphic card resources are basically unused in that scenario.
Is there a specific set of libraries or API, among the aforementioned 3, that would allow me to selectively route tasks, and use resources on another machine without leaving the control totally to the compiler? I've heard that GCD is great but it has very limited control on where the blocks are executed, while MPI is on the other side of the spectrum....OpenCL seems to be in the middle.
Before diving in one of these technologies I would like to know which one would most likely suit my needs; I am sure that some other researcher has already used successfully parallel computing to achieve what I am trying to achieve.
Thanks in advance.

MPI is more for scientific computing large scale many processors many nodes exc not for a weekend project, for what you describe I would suggest using OpenCl or any one the more distributed framework of AMQP protocol families, such as zeromq or rabbitMQ, or a combination of OpenCl and AMQP , or even simpler consider multithreading , i would suggest OpenMP for that. I'm not sure if you are looking for direct solvers or parallel functions but there are many that exist as well for gpu's and cpu's which you can find on the web

Sorry, but this question simply cannot be meaningfully answered as posed. To be sure, I could toss out a collection of buzzwords describing various technologies to look at like GCD, OpenMPI, OpenCL, CUDA and any number of other technologies which allow one to run a single program on multiple cores, multiple programs on different cooperating computers, or a single program distributed across CPU and GPU, and it sounds like you know about a number of those already so I wouldn't even be adding much value in listing the buzzwords.
To simply toss out such terms without knowing the full specifics of the problem you're trying to solve, however, is a bit like saying that you know English, French and a little German so sure, by all means - mix them all together in a single paragraph without knowing anything about the target audience! Similarly, you can parallelize a given computation in any number of ways, across any number of different processing elements, but whether that parallelization is actually a win or not is going to be entirely dependent on the nature of the algorithm, its data dependencies, how much computation is expected for each reasonable "work chunk", and whether it can be executed on a GPU with sufficient numeric precision, among many other factors. The more complex the technology you choose, the more those factors matter and the greater the possibility that the resulting code will actually be slower than its single-threaded, single machine counterpart. IPC overhead and data copying can, and frequently do, swamp all of the gains one might realize from trying to naively parallelize something and then add additional overhead on top of that, resulting in a net loss. This is why engineers who can do this kind of work meaningfully and well are in such high demand. :)
Without knowing anything about your calculations, I would move in baby steps. First try a simple multi-processor framework like GCD (which is already built in to OS X and requires no additional dependencies to use) and figure out how to factor your code such that it can effectively use all of the available cores on a single machine. Once you've learned where the wins are (and if there even are any - if multi-threading isn't helping, multi-machine parallelization almost certainly won't either), try setting up several instances of the calculation on several machines with a simple IPC model that allows for distributing the work. Having already factored your algorithm(s) for multiple threads, it should be comparatively straight-forward to further generalize the approach across multiple machines (though it bears noting that the two are NOT the same problem and either way you still want to use all the cores available on any of the given target machines, so the two challenges are both complimentary and orthogonal).

Computationally intensive algorithms

I am planning to write a bunch of programs on computationally intensive algorithms. The programs would serve as an indicator of different compiler/hardware performance.
I would want to pick up some common set of algorithms which are used in different fields, like Bioinformatics, Gaming, Image Processing, et al. The reason I want to do this would be to learn the algorithms and have a personal mini benchmark suit that would be small | useful | easy to maintain.
Any advice on algorithm selection would be tremendously helpful.

Benchmarks are not worthy of your attention!
The very best guide is to processor performance is: http://www.agner.org/optimize/
And then someone will chuck it in a box with 3GB of RAM defeating dual-channel hopes and your beautifully tuned benchmark will give widely different results again.
If you have a piece of performance critical code and you are sure you've picked the winning algorithm you can't then go use a generic benchmark to determine the best compiler. You have to actually compile your specific piece of code with each compiler and benchmark them with that. And the results you gain, whilst useful for you, will not extrapolate to others.
Case in point: people who make compression software - like zip and 7zip and the high-end stuff like PPMs and context-mixing and things - are very careful about performance and benchmark their programs. They hang out on www.encode.ru
And the situation is this: for engineers developing the same basic algorithm - say LZ or entropy coding like arithmetic-coding and huffman - the engineers all find very different compilers are better.
That is to say, two engineers solving the same problem with the same high-level algorithm will each benchmark their implementation and have results recommending different compilers...
(I have seen the same thing repeat itself repeatedly in competition programming e.g. Al Zimmermann's Programming Contests which is an equally performance-attentive community.)
(The newer GCC 4.x series is very good all round, but that's just my data-point, others still favour ICC)
(Platform benchmarks for IO-related tasks is altogether another thing; people don't appreciate how differently Linux, Windows and FreeBSD (and the rest) perform when under stress. And benchmarks there - on the same workload, same machine, different machines or different core counts - would be very generally informative. There aren't enough benchmarks like that about sadly.)

There was some work done at Berkeley about this a few years ago. The identified 13 common application paterns for parallel programming, the "13 Dwarves". The include things like linear algebra, n-body models, FFTs etc
http://www.eecs.berkeley.edu/Pubs/TechRpts/2006/EECS-2006-183.html
See page 10 onwards.
There are some sample .NET implementations here:
http://paralleldwarfs.codeplex.com/

The typical one is Fast Fourier Transform, perhaps you can also do something like the Lucas–Lehmer primality test.

I remember a guy who tested computational performance of machines, compiler versions by inverting Hilbert matrices.
For image processing, median filtering (used for removing noise, bad pixels) is always too slow. It might make a good test, given a large enough image say 1000x1000.

How to design programs/softwares to take advantage of multiprocessors

To take advantage of multiprocessors
1. Do you need to select any specific programming language
2. Are there any design patterns
3. Can you schedule each thread on any available different processor
I am trying to understand good practices to write excellent programs which take full advantage of the available processors.

Writing proper parallel code is hard.
I don't know of any textbooks, but I've found Herb Sutter's series on Effective Concurrency to be pretty good.

Each language has varying support for multithreading.
Yes, you can read about them in operating systems textbooks.
It depends on your programming language and your tools. Often, it is easier to leave this decision to the operating system.
If you are just starting out with multithreading, you may want to look at some books: https://stackoverflow.com/search?q=multithreading+book.

Unless the problem you have to solve specifically asks for paralell computing I would choose the programming language best fit to solve my real problem.
The allocation of threads to processors is usually best left to the operating system. You can influence that allocation by using different priorities of threads.
I you use an language with runtime environment (java, .net) than you have the additional layer of threads within the runtime environment vs native threads.
To fully use the potential of multi processors, the problem you have must be a problem that lends itself to multi processing. There is no real use in a heavily multithreaded data entry form.
hth
Mario

Does the advent of MultiCore architectures affect me as a software developer?

As a software developer dealing mostly with high-level programming languages I'm not sure what I can do to appropriately pay attention to the upcoming omni-presence of multicore computers. I write mostly ordinary and non-demanding applications, nevertheless I think it is important to know if I need to change any programming paradigms or even language to master the future.
My question therefore:
How to deal with increasing multicore presence in day-by-day hacking?

Herb Sutter wrote about it in 2005: The Free Lunch Is Over: A Fundamental Turn Toward Concurrency in Software

Most problems do not require a lot of CPU time. Really, single cores are quite fast enough for many purposes. When you do find your program is too slow, first profile it and look at your choice of algorithms, architecture, and caching. If that doesn't get you enough, try to divide the problem up into separate processes. Often this is worth doing simply for fault isolation and so that you can understand the CPU and memory usage of each process. Also, normally each process will run on a specific core and make good use of the processor caches, so you won't have to suffer the substantial performance overhead of keeping cache lines consistent. If you go for a multi process design and still find problem needs more CPU time than you get with the machine you have, you are well placed to extend it run over a cluster.
There are situations where you need multiple threads within the same address space, but beware that threads are really hard to get right. Race conditions, especially in non-safe languages, sometimes take weeks to debug; often, simply adding tracing or running under a debugger will change the timings enough to hide the problem. Simply putting locks everywhere often means you get a lot of locking overhead and sometimes so much lock contention that you don't really get the concurrency advantage you were hoping for. Even when you've got the locking right, you then need to profile to tune for cache coherency. Ultimately, if you want to really tune some highly concurrent code, you'll probably end up looking at lock-free constructs and more complex locking schemes than those in current multi-threading libraries.

Learn the benefits of concurrency, and the limits (e.g. Amdahl's law).
So you can, where possible, exploit the only route for higher performance that is going to be open. There is a lot of innovative work happening on easier approaches (futures and task libraries), and old work being rediscovered (functional languages and immutable data).
The free lunch is over, but that does not mean that there is nothing to exploit.

In general, become very friendly with threading. It's a terrible mechanism for parallelization, but it's what we have.
If you do work with .NET, look at the Parallel Extensions. They allow you to easily accomplish many parallel programming tasks.

To benefit from more that just one core you should consider parallelizing your code. Multiple threads, immutable types, and a minimum of synchronization are your new friends.

I think it will depend on what kind of applications you're writing.
Some kind of apps benefit more of the fact that they're run on a mutli-core cpu then others.
If your application can benefit from the multi-core fact, then you should be ready to go parallel.
The free lunch is over; that is: in the past, your application became faster when a new cpu was released and you didn't have to put any effort in your application to get that extra speed.
Now, to take advantage of the capabilities a multi-core cpu offers, you've to make sure that your application can take advantage of it. That is: you've to see which tasks can be executed multithreaded / concurrently, and this brings some issues to the table ...

Learn Erlang/F# (depending on your platform)

Prefer immutable data structures, their use makes software easier to understand not only in concurrent programs.
Learn the tools for concurrency in your language (e.g. java.util.concurrent, JCIP).
Learn a functional language (e.g Haskell).

I've been asked the same question, and the answer is, "it depends". If your Joe Winforms, maybe not so much. If your writing code that must be performant, yes. One of the biggest problem I can see with parallel programming is this: if something can't be parallized, and you lie and tell the run-time to do in parallel anyways, it's not going to crash, it's just going to do things wrong, and you'll get crap results and blame the framework.

Learn OpenMP and MPI for C and C++ code.
OpenMP also applies to other languages as well like Fortran I suppose.

Write smaller programs.
Other code languages/styles will let you do multithreading better (though multithreading is still really hard in any language) but the big benefit for regular developers, IMHO, is the ability to execute lots of smaller programs concurrently to accomplish some much larger task.
So, get in the habit of breaking your problems down into independent components that can be run whenever you want.
You'll build more maintainable software too.

Scaling multithreaded applications on multicored machines

I'm working on a project were we need more performance. Over time we've continued to evolve the design to work more in parallel(both threaded and distributed). Then latest step has been to move part of it onto a new machine with 16 cores. I'm finding that we need to rethink how we do things to scale to that many cores in a shared memory model. For example the standard memory allocator isn't good enough.
What resources would people recommend?
So far I've found Sutter's column Dr. Dobbs to be a good start.
I just got The Art of Multiprocessor Programming and The O'Reilly book on Intel Threading Building Blocks

A couple of other books that are going to be helpful are:
Synchronization Algorithms and Concurrent Programming
Patterns for Parallel Programming
Communicating Sequential Processes by C. A. R. Hoare (a classic, free PDF at that link)
Also, consider relying less on sharing state between concurrent processes. You'll scale much, much better if you can avoid it because you'll be able to parcel out independent units of work without having to do as much synchronization between them.
Even if you need to share some state, see if you can partition the shared state from the actual processing. That will let you do as much of the processing in parallel, independently from the integration of the completed units of work back into the shared state. Obviously this doesn't work if you have dependencies among units of work, but it's worth investigating instead of just assuming that the state is always going to be shared.

You might want to check out Google's Performance Tools. They've released their version of malloc they use for multi-threaded applications. It also includes a nice set of profiling tools.

Jeffrey Richter is into threading a lot. He has a few chapters on threading in his books and check out his blog:
http://www.wintellect.com/cs/blogs/jeffreyr/default.aspx.

As monty python would say "and now for something completely different" - you could try a language/environment that doesn't use threads, but processes and messaging (no shared state). One of the most mature ones is erlang (and this excellent and fun book: http://www.pragprog.com/titles/jaerlang/programming-erlang). May not be exactly relevant to your circumstances, but you can still learn a lot of ideas that you may be able to apply in other tools.
For other environments:
.Net has F# (to learn functional programming).
JVM has Scala (which has actors, very much like Erlang, and is functional hybrid language). Also there is the "fork join" framework from Doug Lea for Java which does a lot of the hard work for you.

The allocator in FreeBSD recently got an update for FreeBSD 7. The new one is called jemaloc and is apparently much more scaleable with respect to multiple threads.
You didn't mention which platform you are using, so perhaps this allocator is available to you. (I believe Firefox 3 uses jemalloc, even on windows. So ports must exist somewhere.)

Take a look at Hoard if you are doing a lot of memory allocation.
Roll your own Lock Free List. A good resource is here - it's in C# but the ideas are portable. Once you get used to how they work you start seeing other places where they can be used and not just in lists.

I will have to check-out Hoard, Google Perftools and jemalloc sometime. For now we are using scalable_malloc from Intel Threading Building Blocks and it performs well enough.
For better or worse, we're using C++ on Windows, though much of our code will compile with gcc just fine. Unless there's a compelling reason to move to redhat (the main linux distro we use), I doubt it's worth the headache/political trouble to move.
I would love to use Erlang, but there way to much here to redo it now. If we think about the requirements around the development of Erlang in a telco setting, the are very similar to our world (electronic trading). Armstrong's book is on my to read stack :)
In my testing to scale out from 4 cores to 16 cores I've learned to appreciate the cost of any locking/contention in the parallel portion of the code. Luckily we have a large portion that scales with the data, but even that didn't work at first because of an extra lock and the memory allocator.

I maintain a concurrency link blog that may be of ongoing interest:
http://concurrency.tumblr.com

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio