It's difficult to tell what is being asked here. This question is ambiguous, vague, incomplete, overly broad, or rhetorical and cannot be reasonably answered in its current form. For help clarifying this question so that it can be reopened, visit the help center.
Closed 11 years ago.
I have to implement MPI system in a cluster. If anyone here has any experience with MPI (MPICH/OpenMPI), I'd like to know which is better and how the performance can be boosted on a cluster of x86_64 boxes.
MPICH has been around a lot longer. It's extremely portable and you'll find years worth of tips and tricks online. It's a safe bet and it's probably compatible with more MPI programs out there.
OpenMPI is newer. While it's not quite as portable, it supports the most common platforms really well. Most people seem to think it's a lot better in several regards, especially for fault-tolerance - but to take advantage of this you may have to use some of its special features that aren't part of the MPI standard.
As for performance, it depends a lot on the application; it's hard to give general advice. You should post a specific question about the type of calculation you want to run, the number of nodes, and the type of hardware - including what type of network hardware you're using.
I've written quite a few parallel applications for both Windows and Linux clusters, and I can advise you that right now MPICH2 is probably the safer choice. It is, as the other responder mentions, a very mature library. Also, there is ample broadcasting support (via MPI_Bcast) now, and in fact, MPICH2 has quite a few really nice features like scatter-and-gather.
OpenMPI is gaining some ground though. Penguin computing (they're a big cluster vendor, and they like Linux) actually has some really strong benchmarks where OpenMPI beats MPICH2 hands down in certain circumstances.
Regarding your comment about "boosting performance", the best piece of advice I can give is to never send more data than absolutely necessary if you're I/O bound, and never do more work than necessary if you're CPU bound. I've fallen into the trap of optimizing the wrong piece of code more than once :) Hopefully you won't follow in my footsteps!
Check out the MPI forums - they have a lot of good info about MPI routines, and the Beowulf site has a lot of interesting questions answered.
'Better' is hard to define... 'Faster' can be answered by benchmarking it with your code, and your hardware. Things like collective & offload optimisation will depend on your exact hardware and is also quite variable with regards to driver stack versions, google should be able to find you working combinations.
As far as optimisation work, that somewhat depends on the code, and somewhat on the hardware.
Is your code I/O bound to storage? In which case investigation something better than NFS might help a lot, or using MPI I/O rather than naive parallel I/O
If you are network bound, then looking at communication locality, and comms/compute overlap can help. Most of the various MPI implementations have tuning options for using local shared memory rather than the network for intranode comms, which for some codes can reduce the network load significantly.
Segregation of I/O and MPI traffic can make a big difference on some clusters, particularly for gigabit ethernet clusters.
We used mpich simply because it seemed most available and best documented, we didn't put a lot of effort into testing alternatives. MPICH has reasonable tools for deployment on windows.
The main performance issue we had was that we needed to ship the same base data to all nodes and MPICH doesn't (or didn't) support broadcast - so deploying the initial data was O(n)
Related
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 9 years ago.
Improve this question
Considering the fact that currently many libraries already have optimized sort engines, then why still many companies ask about Big O and some sorting algorithms, when in reality in our every day in computing, this type of implementation is not any longer really needed?
For example, self balancing binary tree, is a kind of problem that some big companies from the embedded industry still ask programmers to code as part of their testing and candidate selection process.
Even for embedded coding, there are any circumstances when such kind of implementation is going to take place, given fact that boost, SQLite and other libraries are available for use? In other words, is it worth still to think on ways to optimize such algorithms?
As an embedded programmer, I would say it all comes down to the problem and system constraints. Especially on a microprocessor, you may not want/need to pull in Boost and SQLite may still be too heavy for a given problem. How you chop up problems looks different if you have say, 2K of RAM - but this is definitely the extreme.
So for example, you probably don't want to rewrite code for a red-black tree yourself because as you pointed out, you will likely end up with highly non-optimized code. But in the pursuit of reusability, abstraction often adds layers of indirection to the code. So you may end up reimplementing at least simpler "solved" problems where you can do better than the built-in library for certain niche cases. Recently I wrote a specialized version of linked lists using shared memory pools for allocation across lists, for example. I had benchmarked against STL's list and it just wasn't cutting it because of the added layers of abstraction. Is my code as good? No, probably not. But I was more easily able to specialize the specific uses cases, so it came out better.
So I guess I'd like to address a few things from your post:
-Why do companies still ask about big-O runtime? I have seen even pretty good engineers make bad choices with regards to data structures because they didn't reason carefully about the O() runtime. Quadratic versus linear or linear versus constant time operation is a painful lesson when you get the assumption wrong. (ah, the voice of experience)
-Why do companies still ask about implementing classic structures/algorithms? Arguably you won't reimplement quick sort, but as stated, you may very well end up implementing slightly less complicated structures on a more regular basis. Truthfully, if I'm looking to hire you, I'm going to want to make sure that you understand the theory inside and out for existing solutions so if I need you to come up with a new solution you can take an educated crack at it. And if the other applicant has that and you don't, I'd probably say they have an advantage.
Also, here's another way to think about it. In software development, often the platform is quite powerful and the consumer already owns the hardware platform. In embedded software development, the consumer is probably purchasing the hardware platform - hopefully from your company. This means that the software is often selling the hardware. So often it makes more cents to use less powerful, cheaper hardware to solve a problem and take a bit more time to develop the firmware. The firmware is a one-time cost upfront, whereas hardware is per-unit. So even from the business side there are pressures for constrained hardware which in turn leads to specialized structure implementation on the software side.
If you suggest using SQLite on a 2 kB Arduino, you might hear a lot of laughter.
Embedded systems are bit special. They often have extraordinarily tight memory requirements, so space complexity might be far more important than time complexity. I've rarely worked in that area myself, so I don't know what embedded systems companies are interested in, but if they're asking such questions, then it's probably because you'll need to be more acquainted with such issues than in other areas of I.T.
Nothing is optimized enough.
Besides, the questions are meant to test your understanding of the solution (and each part of the solution) and not how great you are at memorizing stuff. Hence it makes perfect sense to ask such questions.
I'm aware this will vary between systems and implementations, but I have absolutely no idea if OpenMP is feasible for use where you have easily parallelizable code that doesn't takea long time to run. For instance if I have a loop which would be ideal for OpenMP, but only takes a couple of ms to run on a single thread, will the overhead kill any gains?
My usage is in real-time rendering, where we have perhaps 20ms to generate each frame. So typically we don't have any single block of code taking more than 5-10ms... is OpenMP targeted at this kind of time-scale, or only on operations taking several seconds to run?
Any anecdotal/empirical data is welcome, as well as more 'official' sources. I suppose I'm primarily interested in MSVC++ and GCC OpenMP implementations.
I found this interesting article: http://software.intel.com/en-us/blogs/2010/06/02/using-openmp-to-parallelize-a-game/
The answer, to all empirical questions when data is not available, is 42 ...
But seriously, you're going to have to derive your own answer to this question for your own configuration of hardware, software and problem.
Yes, perhaps when, in 1997, OpenMP was first published, the typical use case was for relatively heavyweight threads of computation taking seconds each. Since then processors have got faster and implementations have got better, I don't think you'll get a good answer to your question without benchmarking. And you're going to have to benchmark anyway, whatever you learn from SO, so stop wasting time and get coding.
You might be persuaded that OpenMP is not useful: but you won't quite trust the answer wrt your own requirements.
You might be persuaded that OpenMP is useful: but you won't quite trust the answer wrt your own requirements.
You might get a lot of conflicting anecdotal evidence: but you won't quite trust the answer wrt your own requirements.
Get benchmarking. And let us know how you get on, SO is far too long on anecdotes (such as the trash I'm writing), far too short on hard data.
PS Don't lose your benchmark codes. Revise them and rerun them as your experience develops and as your hardware, systems software and applications software change. Of one thing I am convinced without data: the answer to your question is a moving target.
It's difficult to tell what is being asked here. This question is ambiguous, vague, incomplete, overly broad, or rhetorical and cannot be reasonably answered in its current form. For help clarifying this question so that it can be reopened, visit the help center.
Closed 12 years ago.
I started programming because I was a hardware guy that got bored, I thought the problems being solved in the software side of things were much more interesting than those in hardware. At that time, most of the electrical buses I dealt with were serial, some moving data as fast as 1.5 megabit!! ;)
Over the years these evolved into parallel buses in order to speed communication up, after all, transferring 8/16/32/64, whatever bits at a time incredibly speeds up the transfer. Well, our ability to create and detect state changes got faster and faster, to the point where we could push data so fast that interference between parallel traces or cable wires made cleaning the signal too expensive to continue, and we still got reasonable performance from serial interfaces, heck some graphics interfaces are even happening over USB for a while now.
I think I'm seeing a like trend in software now, our processors were getting faster and faster, so we got good at building "serial" software. Now we've hit a speed bump in raw processor speed, so we're adding cores, or "traces" to the mix, and spending a lot of time and effort on learning how to properly use those. But I'm also seeing what I feel are advances in things like optical switching and even quantum computing that could take us far more quickly that I was expecting back to the point where "serial programming" again makes the most sense.
What are your thoughts?
Things like optical switching and quantum computing have been in the wings for years; hopefully, they will eventually come to fruition, but progress is usually a gradual process. There's no reason for research to go in only one direction when it can go in many, and as processors have hit their limits, parallelism seems like one promising approach to increasing computing efficiency.
I think your point is very interesting.
However it seems to me that it boils down to to "eventually we manage to make fast serial hardware faster". That might be true. Nevertheless, in my experience the cost and time to improve a technology by adding multiple units of something rather than moving to the next generation is still much lower. So I believe that with every "generation change" we would still go through a period where advancement is achieved by the "multi" approach rather than the "neo" approach. In fact, more time is spent in the multi approach that investing in parallelization makes sense.
Furthermore, we have learned with time. Look at graphics cards. They are essentially a bulk set of cheaper number crunchers. Graphic programming is already making use of this. Eventually, more calculations will be done this way for regular development.
Also, if in the past many developers never understood threading or it was some advanced concept, today most C# and Java programmers encounter it early in the education. It's becoming more important early, and that means that with time we will have more people able to write software that makes use of this. Fifteen years ago, very few job interviews asked developers any questions about concurrency. Today, most screening interviews include at least rudimentary threading questions, and almost every developer knows what a deadlock is and why the philosophers might starve.
Beyond all this, we can't ignore the cloud. Even if we can change the way that a single machine is implemented, we can't deny that the network, and the ability to draw in additional resources as we need, create an inherent need for parallel software. More and more software, and the whole model of Software as a Service, is heading that way.
On an unrelated note, I'm not sure I share your optimism on quantum computing. We haven't managed to make reliable software in this universe with deterministic means :)
I Think its a hardware software cycle, At one point the development in hardware stops for sometime and then the world looks towards the software developers to get better performance out of the existing hardware.The software engineers become important at this point. However after sometime the intelligent electrical engineers find a way around the stalling factors in hardware development and then hardware development starts again with a faster pace allowing the software engineers to relax a bit.
As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 13 years ago.
It seems that I often hear people criticize certain programming languages because they "have poor performance", or because some other language is "faster" in general (not necessarily for a specific application). However, my experience and education have taught me that anytime you have a performance problem, at least one of the following is probably happening:
The bottleneck isn't in the CPU, it's in some other device, such as the network or the hard drive.
The poor performance is caused by your algorithms, not by the language you're using.
My general impression is that the speed of a programming language itself is all but irrelevant in the vast majority of cases, with exceptions for serious data processing problems. Even in those cases, I believe you could use a hybrid approach and use a lower-level language only for the CPU-intensive pieces so that you wouldn't lose the benefits of the more abstract language altogether.
Do you agree? Is programming language speed insignificant most of the time, or do the critics have a right to point out language performance issues?
I hope this question isn't too subjective, but it seems to me that there should be a relatively objective answer to this.
Performance can be a serious concern in libraries, operating systems, and the like. However, I believe that upwards of 90% of the time raw performance is irrelevant.
What is more important in many cases is TIMING. Any garbage collected language is going to have some unpredictability in this regard, which makes them unsuited to embedded and realtime design spaces.
The overlap of GC'd and "slow" languages is considerable, and so you may see a language discounted for speed reasons when the real problem is inconsistent timing.
There are some allocation/threading/etc. schemes that allow for garbage collection while also guaranteeing the runtime of parts of the system, such as Realtime Java, though I haven't personally seen it in use anywhere.
Short answer: most of the time the speed of the language is irrelevant (within reason), language choices are made based on familiarity and available libraries.
Amazingly, the performance of a system is a combination of the programming language, the system it's executing on, the operations that system is performing and the external resources (network, disk, slow line printers, etc.) that it relies upon.
If your system is slow, rather than guessing, test it.
If there is any "Rule" in computing, it's "Test your assumptions". Everything else is gross guideline.
This is impossible to answer so broadly. It's like asking if big engines are a waste in cars. Well, for some people, yes. For others, not at all. And all sorts in between.
There are a myriad of factors that come in to play. What is your target environment? End-user deployment or servers? Let's suppose we're talking about web development and coding for a server. RoR is well-known to be (relatively) slow. .NET is pretty fast by comparison. But RoR also has RAD qualities that .NET can't compete with.
Is getting your app up-and-running yesterday more of a priority than scalability?
Does your business model live or die on the milliseconds you serve a page, or the time you went to market?
Does your TCO and application architecture support scaling out or scaling up? Do you even expect to need to scale up?
Those are just a tiny handful of the questions an architect has to answer when making platform/language decisions. Does speed matter? Sometimes. If I am planning to write a LoB service that will eventually need to scale to thousands of transactions per second, and it will be deployed in an enterprise environment, I will probably go with .NET. If I have an idea for a Web 2.0 business like selling Twitter teeshirts, I need to capitalize on that idea yesterday and I can know I probably won't get slammed with enough business to bring the site down before preparing for it.
This is honestly over-simplifying a very complex issue, but hopefully illustrated the point that it's impossible to simply "say" whether it matters or not.
I think it's a good question. To answer it requires having a general framework for thinking about performance, so let me try to provide one. (Some of this is going to sound really obvious, but bear with me.)
To keep things simple,
let's just consider the simple case of applications that have a specific job to do, and that start, and then finish, and what you care about is wall-clock time. Let's assume a standard CPU cycle rate, and a mono-processor.
The time duration consists of a stream of time-slices (nanoseconds, say). To do that job, there is a minimum amount of time required, and it is usually greater than zero. There is no maximum amount of time required. If a program spends longer than the minimum number of nanoseconds, then some of those nanoseconds are being spent, strictly speaking, unnecessarily (i.e. for poor reasons).
So, to optimize a program's execution time, it is necessary to find the nanoseconds it is spending that do not have to be spent (i.e. that do not have good reasons) and remove them.
One way to do this is to, if possible, step through the program and keep track at each step of why it is doing that step. If the reason is not good, there is an opportunity for removing steps.
Another way to do this is to select nanoseconds at random from the program's execution, and inquire their reasons. For example, the program counter can tell you what the program is doing, but the call stack can tell you why. In order for the nanosecond to be spent for a good reason, every call instruction on the call stack has to have a good reason. If any instruction on the call stack does not have a good reason, then there is an opportunity to optimize. In fact, the amount of time that instruction is on the call stack is the amount of time that would be saved by its removal.
In some kinds of software that are highly asynchronous, message-driven, or interpreted, the call stack may not provide enough information. In that case, to answer why a given nanosecond is being spent may be more difficult. It may require examining more state information than just the call stack. For example, in an interpreter, the stack of the program being interpreted may also need to be examined. However, often the hardware call stack does provide sufficient information, so it is a useful thing to examine.
Now, to try to answer your question.
There is such a thing as a "hot spot". This is a small set of addresses that are often at the bottom of the call stack. Nanoseconds spent in that code may or may not have good reasons.
There is such a thing as a "performance problem". This is an instruction that often accounts for why nanoseconds are being spent, but that does not have a good reason. Such an instruction may be in a hot spot. It may also be a subroutine call instruction. (It cannot be both.) It may be an instruction to send a message to be processed later, that does not have a good reason for being spent. To optimize software, such instructions (not functions) are what are being looked for.
Languages, loosely speaking, are either compiled into machine language or interpreted. Interpreted languages are usually 1 or 2 orders of magnitude slower than compiled, because they are constantly re-determining what they need to do. However, roughly speaking, this is only a performance problem if it occurs in a hot spot. If a program spends all its time calling compiled library functions, or waiting for I/O completions, then its speed of execution probably doesn't matter, because most of the nanoseconds are being spent for other reasons.
Now, certainly, any language or program can in principle be highly non-optimal, but in terms of compilers, for hotspot code, they are mostly pretty good, give or take maybe 30%. If there is a background process involved, like garbage collection, that adds an overhead, but it depends on the rate at which the program generates garbage.
So to sum up, the speed of a language matters in hotspot code, but not much elsewhere. When a program has been optimized by removal of all other performance problems, and if the hotspot code is actually seen by the compiler/interpreter, then speed of language matters.
Your question is framed very broadly, so I'll try to give a somewhat narrower answer:
Unless there is some good reason not to do so, the language for a project should always be chosen from among those languages that will help the project team be productive and produce reliable software that can easily be adapted for future needs. The tradeoffs generally favor high-level languages with automatic memory management.
N.B. There are plenty of good reasons to make other choices, such as compatibility with current products and libraries.
It sometimes happens that when a program is too slow, the quickest and easiest way to speed it up is to rewrite the program (or a critical part) in a new language. This happens most often when the implementation language is interpreted and the new language is compiled.
Example: I got about a 4x speedup out of the OSBF-Lua spam filter by rewriting the lexical analysis of the mail headers. By rewriting from Lua to C I not only went from interpreted to compiled but was able to eliminate an array-bounds check for every input character.
To answer your question as stated, it is not very often that language performance per se is an issue.
I'm working on a project were we need more performance. Over time we've continued to evolve the design to work more in parallel(both threaded and distributed). Then latest step has been to move part of it onto a new machine with 16 cores. I'm finding that we need to rethink how we do things to scale to that many cores in a shared memory model. For example the standard memory allocator isn't good enough.
What resources would people recommend?
So far I've found Sutter's column Dr. Dobbs to be a good start.
I just got The Art of Multiprocessor Programming and The O'Reilly book on Intel Threading Building Blocks
A couple of other books that are going to be helpful are:
Synchronization Algorithms and Concurrent Programming
Patterns for Parallel Programming
Communicating Sequential Processes by C. A. R. Hoare (a classic, free PDF at that link)
Also, consider relying less on sharing state between concurrent processes. You'll scale much, much better if you can avoid it because you'll be able to parcel out independent units of work without having to do as much synchronization between them.
Even if you need to share some state, see if you can partition the shared state from the actual processing. That will let you do as much of the processing in parallel, independently from the integration of the completed units of work back into the shared state. Obviously this doesn't work if you have dependencies among units of work, but it's worth investigating instead of just assuming that the state is always going to be shared.
You might want to check out Google's Performance Tools. They've released their version of malloc they use for multi-threaded applications. It also includes a nice set of profiling tools.
Jeffrey Richter is into threading a lot. He has a few chapters on threading in his books and check out his blog:
http://www.wintellect.com/cs/blogs/jeffreyr/default.aspx.
As monty python would say "and now for something completely different" - you could try a language/environment that doesn't use threads, but processes and messaging (no shared state). One of the most mature ones is erlang (and this excellent and fun book: http://www.pragprog.com/titles/jaerlang/programming-erlang). May not be exactly relevant to your circumstances, but you can still learn a lot of ideas that you may be able to apply in other tools.
For other environments:
.Net has F# (to learn functional programming).
JVM has Scala (which has actors, very much like Erlang, and is functional hybrid language). Also there is the "fork join" framework from Doug Lea for Java which does a lot of the hard work for you.
The allocator in FreeBSD recently got an update for FreeBSD 7. The new one is called jemaloc and is apparently much more scaleable with respect to multiple threads.
You didn't mention which platform you are using, so perhaps this allocator is available to you. (I believe Firefox 3 uses jemalloc, even on windows. So ports must exist somewhere.)
Take a look at Hoard if you are doing a lot of memory allocation.
Roll your own Lock Free List. A good resource is here - it's in C# but the ideas are portable. Once you get used to how they work you start seeing other places where they can be used and not just in lists.
I will have to check-out Hoard, Google Perftools and jemalloc sometime. For now we are using scalable_malloc from Intel Threading Building Blocks and it performs well enough.
For better or worse, we're using C++ on Windows, though much of our code will compile with gcc just fine. Unless there's a compelling reason to move to redhat (the main linux distro we use), I doubt it's worth the headache/political trouble to move.
I would love to use Erlang, but there way to much here to redo it now. If we think about the requirements around the development of Erlang in a telco setting, the are very similar to our world (electronic trading). Armstrong's book is on my to read stack :)
In my testing to scale out from 4 cores to 16 cores I've learned to appreciate the cost of any locking/contention in the parallel portion of the code. Luckily we have a large portion that scales with the data, but even that didn't work at first because of an extra lock and the memory allocator.
I maintain a concurrency link blog that may be of ongoing interest:
http://concurrency.tumblr.com