Scaling multithreaded applications on multicored machines - performance

I'm working on a project were we need more performance. Over time we've continued to evolve the design to work more in parallel(both threaded and distributed). Then latest step has been to move part of it onto a new machine with 16 cores. I'm finding that we need to rethink how we do things to scale to that many cores in a shared memory model. For example the standard memory allocator isn't good enough.
What resources would people recommend?
So far I've found Sutter's column Dr. Dobbs to be a good start.
I just got The Art of Multiprocessor Programming and The O'Reilly book on Intel Threading Building Blocks

A couple of other books that are going to be helpful are:
Synchronization Algorithms and Concurrent Programming
Patterns for Parallel Programming
Communicating Sequential Processes by C. A. R. Hoare (a classic, free PDF at that link)
Also, consider relying less on sharing state between concurrent processes. You'll scale much, much better if you can avoid it because you'll be able to parcel out independent units of work without having to do as much synchronization between them.
Even if you need to share some state, see if you can partition the shared state from the actual processing. That will let you do as much of the processing in parallel, independently from the integration of the completed units of work back into the shared state. Obviously this doesn't work if you have dependencies among units of work, but it's worth investigating instead of just assuming that the state is always going to be shared.

You might want to check out Google's Performance Tools. They've released their version of malloc they use for multi-threaded applications. It also includes a nice set of profiling tools.

Jeffrey Richter is into threading a lot. He has a few chapters on threading in his books and check out his blog:
http://www.wintellect.com/cs/blogs/jeffreyr/default.aspx.

As monty python would say "and now for something completely different" - you could try a language/environment that doesn't use threads, but processes and messaging (no shared state). One of the most mature ones is erlang (and this excellent and fun book: http://www.pragprog.com/titles/jaerlang/programming-erlang). May not be exactly relevant to your circumstances, but you can still learn a lot of ideas that you may be able to apply in other tools.
For other environments:
.Net has F# (to learn functional programming).
JVM has Scala (which has actors, very much like Erlang, and is functional hybrid language). Also there is the "fork join" framework from Doug Lea for Java which does a lot of the hard work for you.

The allocator in FreeBSD recently got an update for FreeBSD 7. The new one is called jemaloc and is apparently much more scaleable with respect to multiple threads.
You didn't mention which platform you are using, so perhaps this allocator is available to you. (I believe Firefox 3 uses jemalloc, even on windows. So ports must exist somewhere.)

Take a look at Hoard if you are doing a lot of memory allocation.
Roll your own Lock Free List. A good resource is here - it's in C# but the ideas are portable. Once you get used to how they work you start seeing other places where they can be used and not just in lists.

I will have to check-out Hoard, Google Perftools and jemalloc sometime. For now we are using scalable_malloc from Intel Threading Building Blocks and it performs well enough.
For better or worse, we're using C++ on Windows, though much of our code will compile with gcc just fine. Unless there's a compelling reason to move to redhat (the main linux distro we use), I doubt it's worth the headache/political trouble to move.
I would love to use Erlang, but there way to much here to redo it now. If we think about the requirements around the development of Erlang in a telco setting, the are very similar to our world (electronic trading). Armstrong's book is on my to read stack :)
In my testing to scale out from 4 cores to 16 cores I've learned to appreciate the cost of any locking/contention in the parallel portion of the code. Luckily we have a large portion that scales with the data, but even that didn't work at first because of an extra lock and the memory allocator.

I maintain a concurrency link blog that may be of ongoing interest:
http://concurrency.tumblr.com

Related

High Frequency Trading in the JVM with Scala/Akka

Let's imagine an hypothetical HFT system in Java, requiring (very) low-latency, with lots of short-lived small objects somewhat due to immutability (Scala?), thousands of connections per second, and an obscene number of messages passing around in an event-driven architecture (akka and amqp?).
For the experts out there, what would (hypothetically) be the best tuning for JVM 7? What type of code would make it happy? Would Scala and Akka be ready for this kind of systems?
Note: There has been some similar questions, like this one, but I've yet to find one covering Scala (which has its own idiosyncratic footprint in the JVM).
It is possible to achieve very good performance in Java. However the question needs to be more specific to provide a credible answer. Your main sources of latency will come from follow non-exhaustive list:
How much garbage you create and the work of the GC to collect and
promote it. Immutable designs in my experience do not fit well with
low-latency. GC tuning needs to be a big focus.
Warm up the JVM so that classes are loaded and the JIT has had time
to do its work.
Design your algorithms to be O(1) or at least O(log2 n), and have
performance tests that assert this.
Your design needs to be lock-free and follow the "Single Writer
Principle".
A significant effort needs to be put into understanding the whole
stack and showing mechanical sympathy in its use.
Design your algorithms and data structures to be cache friendly.
Cache misses these days are the biggest cost. This is closely
related to process affinity which if not set up correctly can result
and significant cache pollution. This will involve sympathy for the OS and even some JNI code in some cases.
Ensure you have sufficient cores so that any thread that needs to
run has a core available without having to wait.
I recently blogged about a case study of such an exercise.
On my laptop the average latency of ping messages between Akka 2.3.7 actors is ~300ns and it is much less than the latency expected due to GC pauses on JVMs.
Code (incl. JVM options) & test results for Akka and other actors on Intel Core i7-2640M here.
P.S. You can find lots of principles and tips for low-latency computing on Dmitry Vyukov's site and in Martin Thompson's blog.
You may find that use of a ring buffer for message passing will surpass what can be done with Akka. The main ring buffer implementation that people use on the JVM for financial applications is one called Disruptor which is carefully tuned for efficiency (power of two size), for the JVM (no GC, no locks) and for modern CPUs (no false sharing of cache lines).
Here is an intro presentation from a Scala point of view http://scala-phase.org/talks/jamie-allen-sdisruptor/index.html#1 and there are links on the last slide to the original LMAX stuff.
There are HFT systems written in Java, lots of them, but I've never heard of a single one written in Scala. Why?
JVM languages, where garbage collectors is a common place, are not axactly a good choice for HFT. But Java is a garbage collected language. So, what is the trick? The trick is that people code HFT in Java as if they were using C, I mean: they trade some niceties Java offers by higher control of memory allocations. The trick is basically replacing the JCF (Java Collections Framework) by mundane arrays pre-allocated in memory, or some other creative solutions.
Scala is a functional programming language which promotes heavy use of functional programming techniques, most of them backed by the collections framework. The point is: if you are going to get rid of the collections framework (by item 1 above), you are basically not using the basic building blocks of FP built in Scala. If you are not doing that in the first place, what is the point of using Scala at all?
Akka does not look to be a good option because... well... it is written in Scala. So, but items (1) and (2), we can rule out Akka too.
The point is: Technology evolved since the question was proposed in 2012. My answer is now 10 years from the future. We have far more interesting options than C or C++, the only possible options in 2012.
OK... maybe not options (in plural)... but one option: Rust.
Rust is not a garbage collected language, it is a safe programming language, it productive, it offers a rich asynchronous library, and it is pretty fast.
The only "problem" with Rust is that people responsible for steering technology in big banks and financial institutions are, in general, averse to innovation, preferring things they already know since the beginning of their long careers. So, in a nutshell, it will take some time until we see HFT systems written in Rust.

What is the reason why high level abstractions that use lock free programming deep down aren't popular?

From what I gathered on the lock free programming, it is incredibly hard to do right... and I agree.
Just thinking about some problems makes my head hurt. But what I wonder is, why isn't
there a widespread use of high-level wrappers around (e.g. lock free queue and similar stuff)?
For example boost has no lock free library, although one was suggested as far as I know.
I mean I guess that there is a lot of applications where you cant avoid the fact that the critical
section is the big part of the load. So what are the reasons? Is it...
Patents - I heard that some stuff related to lock-free programming is patented.
Performance.
Google, and Microsoft have internal libraries like that but none of them are public...
Something else?
So my question is: Why are high level abstractions that use lock free programming deep down not very
popular, while at the same time "regular" multi-threaded programming is "in"?
EDIT: boost got a lockfree lib :)
There are few people who are familiar enough with the field to implement easy-to-use lock-free libraries. Of those few, even fewer publish work for free and of those almost none do the vital additional work to make the library useable - e.g. publish full API docs, etc. They tend to just release a zip file with code in, which is almost useless. Then of course you also need to find a library which is written in the language you want to use, compiles on the platform you're using and finally, word of the library has to get out, so people know it exists.
Patents are an issue, in that they limit what can be offered. There is, for example, to my knowledge no unpatented singly-linked list. All the skip list stuff is heavily patented, too.
A hero in this field is Cliff Click, who came up with a lock-free hash, which he has more-or-less placed in the public domain.
You can find my lock-free library here;
http://www.liblfds.org
Another is Samy Bahra's Concurrency Kit;
http://www.concurrencykit.org
FYI Microsoft's .Net framework gained some lock free classes in .Net 4.0. Namely container classes in the System.Collections.Concurrent namespace, which are:
ConcurrentDictionary
ConcurrentQueue
ConcurrentStack
I've looked into their implementation and they are relatively fiddly/complex under the hood therefore they do represent a significant amount of effort in designing and testing (threading issues are of course notoriously difficult to test to a high standard).
You can take a look at libcds C++ library. It is collection of lock-free containers (stacks, queues, sets and maps) and safe memory reclamation algorithms.
IMHO regarding C++ (I'm not advanced in other languages). New C++ standard has just been released and the compiler developers need a time to implement its requirements. Today, all compilers do not support C++11 memory model entirely since it requires significant changes in compiler’s optimization rules. Recently, Microsoft announces support of the atomic operations that is the base of lock-free programming in VC++ 11 Developer Preview. It is good news for us. As I know, GCC is going to support it in 4.8 (or above).
Second problem is patents. Many interesting lock-free container algorithms are patented that is a barrier to include them to vendor’s libraries.
Third, the main part of lock-free containers is garbage collecting (safe memory reclamation). C++ is free from any GC (fortunately). There are a few GC algos (Hazard Pointer, Pass-the-Buck, epoch-based and so on) but most of them are patented too.
Fourth, not enough instruments to prove the correctness of memory fences applied in your lock-free implementation. Now I known only one – relacy(http://www.1024cores.net/home/relacy-race-detector).
I think after 2-3 years we’ll see many production-ready multiplatform C++ libraries of lock-free containers and algorithms. These libraries are being developed by vendors and enthusiasts.
However, in my opinion, our future is the hardware transaction memory (HTM). Today AMD, Sun (sorry, Oracle), Intel (?) are investigating HTM with very interesting results. Let’s wait.
There is at least one "lock free” framework that is somewhat popular: Erlang.
One major problem is that unless one uses an excessive number of memory barriers, it's hard to be certain that one has enough; if one does use an excessive number of memory barriers, performance is likely to be inferior to what one would have gotten using locks.
The biggest problem with locks is not performance, but robustness. If a thread gets waylaid while it holds a lock, the system dies. By contrast, if a thread which is accessing a lock-free data structure gets waylaid, it won't affect other threads' use thereof. In some situations, a lock-free data structure may be preferable to one using locks, even if performance is inferior, because one must protect the system from being brought down by a malfunctioning thread (for example, even if one was prepared to kill off a thread which hit a StackOverflowException without taking down the process, how would one protect against a thread putting a lot of stuff on its stack before calling a method to access a lock-protected data structure that the method, such that the lock-guarded method hit a stack overflow?) If one uses lock-free data structures, such risks aren't a problem.

Relation between language and scalability

I came across the following statement in Trapexit, an Erlang community website:
Erlang is a programming language used
to build massively scalable soft
real-time systems with requirements on
high availability.
Also I recall reading somewhere that Twitter switched from Ruby to Scala to address scalability problem.
Hence, I wonder what is the relation between a programming language and scalability?
I would think that scalability depends only on the system design, exception handling etc. Is it because of the way a language is implemented, the libraries, or some other reasons?
Hope for enlightenment. Thanks.
Erlang is highly optimized for a telecommunications environment, running at 5 9s uptime or so.
It contains a set of libraries called OTP, and it is possible to reload code into the application 'on the fly' without shutting down the application! In addition, there is a framework of supervisor modules and so on, so that when something fails, it gets automatically restarted, or else the failure can gradually work itself up the chain until it gets to a supervisor module that can deal with it.
That would be possible in other languages of course too. In C++, you can reload dlls on the fly, load plugsin. In Python you can reload modules. In C#, you can load code in on-the-fly, use reflection and so on.
It's just that that functionality is built in to Erlang, which means that:
it's more standard, any erlang developer knows how it works
less stuff to re-implement oneself
That said, there are some fundamental differences between languages, to the extent that some are interpreted, some run off bytecode, some are native compiled, so the performance, and the availability of type information and so on at runtime differs.
Python has a global interpreter lock around its runtime library so cannot make use of SMP.
Erlang only recently had changes added to take advantage of SMP.
Generally I would agree with you in that I feel that a significant difference is down to the built-in libraries rather than a fundamental difference between the languages themselves.
Ultimately I feel that any project that gets very large risks getting 'bogged down' no matter what language it is written in. As you say I feel architecture and design are pretty fundamental to scalability and choosing one language over another will not I feel magically give awesome scalability...
Erlang comes from another culture in thinking about reliability and how to achieve it. Understanding the culture is important, since Erlang code does not become fault-tolerant by magic just because its Erlang.
A fundamental idea is that high uptime does not only come from a very long mean-time-between-failures, it also comes from a very short mean-time-to-recovery, if a failure happened.
One then realize that one need automatic restarts when a failure is detected. And one realize that at the first detection of something not being quite right then one should "crash" to cause a restart. The recovery needs to be optimized, and the possible information losses need to be minimal.
This strategy is followed by many successful softwares, such as journaling filesystems or transaction-logging databases. But overwhelmingly, software tends to only consider the mean-time-between-failure and send messages to the system log about error-indications then try to keep on running until it is not possible anymore. Typically requiring human monitoring the system and manually reboot.
Most of these strategies are in the form of libraries in Erlang. The part that is a language feature is that processes can "link" and "monitor" each other. The first one is a bi-directional contract that "if you crash, then I get your crash message, which if not trapped will crash me", and the second is a "if you crash, i get a message about it".
Linking and monitoring are the mechanisms that the libraries use to make sure that other processes have not crashed (yet). Processes are organized into "supervision" trees. If a worker process in the tree fails, the supervisor will attempt to restart it, or all workers at the same level of that branch in the tree. If that fails it will escalate up, etc. If the top level supervisor gives up the application crashes and the virtual machine quits, at which point the system operator should make the computer restart.
The complete isolation between process heaps is another reason Erlang fares well. With few exceptions, it is not possible to "share values" between processes. This means that all processes are very self-contained and are often not affected by another process crashing. This property also holds between nodes in an Erlang cluster, so it is low-risk to handle a node failing out of the cluster. Replicate and send out change events rather than have a single point of failure.
The philosophies adopted by Erlang has many names, "fail fast", "crash-only system", "recovery oriented programming", "expose errors", "micro-restarts", "replication", ...
Erlang is a language designed with concurrency in mind. While most languages depend on the OS for multi-threading, concurrency is built into Erlang. Erlang programs can be made from thousands to millions of extremely lightweight processes that can run on a single processor, can run on a multicore processor, or can run on a network of processors. Erlang also has language level support for message passing between processes, fault-tolerance etc. The core of Erlang is a functional language and functional programming is the best paradigm for building concurrent systems.
In short, making a distributed, reliable and scalable system in Erlang is easy as it is a language designed specially for that purpose.
In short, the "language" primarily affects the vertical axii of scaling but not all aspects as you already eluded to in your question. Two things here:
1) Scalability needs to be defined in relation to a tangible metric. I propose money.
S = # of users / cost
Without an adequate definition, we will discussing this point ad vitam eternam. Using my proposed definition, it becomes easier to compare system implementations. For a system to be scalable (read: profitable), then:
Scalability grows with S
2) A system can be made to scale based on 2 primary axis:
a) Vertical
b) Horizontal
a) Vertical scaling relates to enhancing nodes in isolation i.e. bigger server, more RAM etc.
b) Horizontal scaling relates to enhancing a system by adding nodes. This process is more involving since it requires dealing with real world properties such as speed of light (latency), tolerance to partition, failures of many kinds etc.
(Node => physical separation, different "fate sharing" from another)
The term scalability is too often abused unfortunately.
Too many times folks confuse language with libraries & implementation. These are all different things. What makes a language a good fit for a particular system has often more to do with the support around the said language: libraries, development tools, efficiency of the implementation (i.e. memory footprint, performance of builtin functions etc.)
In the case of Erlang, it just happens to have been designed with real world constraints (e.g. distributed environment, failures, need for availability to meet liquidated damages exposure etc.) as input requirements.
Anyways, I could go on for too long here.
First you have to distinguish between languages and their implementations. For instance ruby language supports threads, but in the official implementation, the thread will not make use of multicore chips.
Then, a language/implementation/algorithm is often termed scalable when it supports parallel computation (for instance via multithread) AND if it exhibits a good speedup increase when the number of CPU goes up (see Amdahl Law).
Some languages like Erlang, Scala, Oz etc. have also syntax (or nice library) which help writing clear and nice parallel code.
In addition to the points made here about Erlang (Which I was not aware of) there is a sense in which some languages are more suited for scripting and smaller tasks.
Languages like ruby and python have some features which are great for prototyping and creativity but terrible for large scale projects. Arguably their best features are their lack of "formality", which hurts you in large projects.
For example, static typing is a hassle on small script-type things, and makes languages like java very verbose. But on a project with hundreds or thousands of classes you can easily see variable types. Compare this to maps and arrays that can hold heterogeneous collections, where as a consumer of a class you can't easily tell what kind of data it's holding. This kind of thing gets compounded as systems get larger. e.g. You can also do things that are really difficult to trace, like dynamically add bits to classes at runtime (which can be fun but is a nightmare if you're trying to figure out where a piece of data comes from) or call methods that raise exceptions without being forced by the compiler to declare the exception. Not that you couldn't solve these kinds of things with good design and disciplined programming - it's just harder to do.
As an extreme case, you could (performance issues aside) build a large system out of shell scripts, and you could probably deal with some of the issues of the messiness, lack of typing and global variables by being very strict and careful with coding and naming conventions ( in which case you'd sort of be creating a static typing system "by convention"), but it wouldn't be a fun exercise.
Twitter switched some parts of their architecture from Ruby to Scala because when they started they used the wrong tool for the job. They were using Ruby on Rails—which is highly optimised for building green field CRUD Web applications—to try to build a messaging system. AFAIK, they're still using Rails for the CRUD parts of Twitter e.g. creating a new user account, but have moved the messaging components to more suitable technologies.
Erlang is at its core based on asynchronous communication (both for co-located and distributed interactions), and that is the key to the scalability made possible by the platform. You can program with asynchronous communication on many platforms, but Erlang the language and the Erlang/OTP framework provides the structure to make it manageable - both technically and in your head. For instance: Without the isolation provided by erlang processes, you will shoot yourself in the foot. With the link/monitor mechanism you can react on failures sooner.

Does the advent of MultiCore architectures affect me as a software developer?

As a software developer dealing mostly with high-level programming languages I'm not sure what I can do to appropriately pay attention to the upcoming omni-presence of multicore computers. I write mostly ordinary and non-demanding applications, nevertheless I think it is important to know if I need to change any programming paradigms or even language to master the future.
My question therefore:
How to deal with increasing multicore presence in day-by-day hacking?
Herb Sutter wrote about it in 2005: The Free Lunch Is Over: A Fundamental Turn Toward Concurrency in Software
Most problems do not require a lot of CPU time. Really, single cores are quite fast enough for many purposes. When you do find your program is too slow, first profile it and look at your choice of algorithms, architecture, and caching. If that doesn't get you enough, try to divide the problem up into separate processes. Often this is worth doing simply for fault isolation and so that you can understand the CPU and memory usage of each process. Also, normally each process will run on a specific core and make good use of the processor caches, so you won't have to suffer the substantial performance overhead of keeping cache lines consistent. If you go for a multi process design and still find problem needs more CPU time than you get with the machine you have, you are well placed to extend it run over a cluster.
There are situations where you need multiple threads within the same address space, but beware that threads are really hard to get right. Race conditions, especially in non-safe languages, sometimes take weeks to debug; often, simply adding tracing or running under a debugger will change the timings enough to hide the problem. Simply putting locks everywhere often means you get a lot of locking overhead and sometimes so much lock contention that you don't really get the concurrency advantage you were hoping for. Even when you've got the locking right, you then need to profile to tune for cache coherency. Ultimately, if you want to really tune some highly concurrent code, you'll probably end up looking at lock-free constructs and more complex locking schemes than those in current multi-threading libraries.
Learn the benefits of concurrency, and the limits (e.g. Amdahl's law).
So you can, where possible, exploit the only route for higher performance that is going to be open. There is a lot of innovative work happening on easier approaches (futures and task libraries), and old work being rediscovered (functional languages and immutable data).
The free lunch is over, but that does not mean that there is nothing to exploit.
In general, become very friendly with threading. It's a terrible mechanism for parallelization, but it's what we have.
If you do work with .NET, look at the Parallel Extensions. They allow you to easily accomplish many parallel programming tasks.
To benefit from more that just one core you should consider parallelizing your code. Multiple threads, immutable types, and a minimum of synchronization are your new friends.
I think it will depend on what kind of applications you're writing.
Some kind of apps benefit more of the fact that they're run on a mutli-core cpu then others.
If your application can benefit from the multi-core fact, then you should be ready to go parallel.
The free lunch is over; that is: in the past, your application became faster when a new cpu was released and you didn't have to put any effort in your application to get that extra speed.
Now, to take advantage of the capabilities a multi-core cpu offers, you've to make sure that your application can take advantage of it. That is: you've to see which tasks can be executed multithreaded / concurrently, and this brings some issues to the table ...
Learn Erlang/F# (depending on your platform)
Prefer immutable data structures, their use makes software easier to understand not only in concurrent programs.
Learn the tools for concurrency in your language (e.g. java.util.concurrent, JCIP).
Learn a functional language (e.g Haskell).
I've been asked the same question, and the answer is, "it depends". If your Joe Winforms, maybe not so much. If your writing code that must be performant, yes. One of the biggest problem I can see with parallel programming is this: if something can't be parallized, and you lie and tell the run-time to do in parallel anyways, it's not going to crash, it's just going to do things wrong, and you'll get crap results and blame the framework.
Learn OpenMP and MPI for C and C++ code.
OpenMP also applies to other languages as well like Fortran I suppose.
Write smaller programs.
Other code languages/styles will let you do multithreading better (though multithreading is still really hard in any language) but the big benefit for regular developers, IMHO, is the ability to execute lots of smaller programs concurrently to accomplish some much larger task.
So, get in the habit of breaking your problems down into independent components that can be run whenever you want.
You'll build more maintainable software too.

What's the difference between Managed/Byte Code and Unmanaged/Native Code?

Sometimes it's difficult to describe some of the things that "us programmers" may think are simple to non-programmers and management types.
So...
How would you describe the difference between Managed Code (or Java Byte Code) and Unmanaged/Native Code to a Non-Programmer?
Managed Code == "Mansion House with an entire staff or Butlers, Maids, Cooks & Gardeners to keep the place nice"
Unmanaged Code == "Where I used to live in University"
think of your desk, if you clean it up regularly, there's space to sit what you're actually working on in front of you. if you don't clean it up, you run out of space.
That space is equivalent to computer resources like RAM, Hard Disk, etc.
Managed code allows the system automatically choose when and what to clean up. Unmanaged Code makes the process "manual" - in that the programmer needs to tell the system when and what to clean up.
I'm astonished by what emerges from this discussion (well, not really but rhetorically). Let me add something, even if I'm late.
Virtual Machines (VMs) and Garbage Collection (GC) are decades old and two separate concepts. Garbage-collected native-code compiled languages exist, even these from decades (canonical example: ANSI Common Lisp; well, there is at least a compile-time garbage-collected declarative language, Mercury - but apparently the masses scream at Prolog-like languages).
Suddenly GCed byte-code based VMs are a panacea for all IT diseases. Sandboxing of existing binaries (other examples here, here and here)? Principle of least authority (POLA)/capabilities-based security? Slim binaries (or its modern variant SafeTSA)? Region inference? No, sir: Microsoft & Sun does not authorize us to even only think about such perversions. No, better rewrite our entire software stack for this wonderful(???) new(???) language§/API. As one of our hosts says, it's Fire and Motion all over again.
§ Don't be silly: I know that C# is not the only language that target .Net/Mono, it's an hyperbole.
Edit: it is particularly instructive to look at comments to this answer by S.Lott in the light of alternative techniques for memory management/safety/code mobility that I pointed out.
My point is that non technical people don't need to be bothered with technicalities at this level of detail.
On the other end, if they are impressed by Microsoft/Sun marketing it is necessary to explain them that they are being fooled - GCed byte-code based VMs are not this novelty as they claim, they don't solve magically every IT problem and alternatives to these implementation techniques exist (some are better).
Edit 2: Garbage Collection is a memory management technique and, as every implementation technique, need to be understood to be used correctly. Look how, at ITA Software, they bypass GC to obtain good perfomance:
4 - Because we have about 2 gigs of static data we need rapid access to,
we use C++ code to memory-map huge
files containing pointerless C structs
(of flights, fares, etc), and then
access these from Common Lisp using
foreign data accesses. A struct field
access compiles into two or three
instructions, so there's not really
any performance. penalty for accessing
C rather than Lisp objects. By doing
this, we keep the Lisp garbage
collector from seeing the data (to
Lisp, each pointer to a C object is
just a fixnum, though we do often
temporarily wrap these pointers in
Lisp objects to improve
debuggability). Our Lisp images are
therefore only about 250 megs of
"working" data structures and code.
...
9 - We can do 10 seconds of Lisp computation on a 800mhz box and cons
less than 5k of data. This is because
we pre-allocate all data structures we
need and die on queries that exceed
them. This may make many Lisp
programmers cringe, but with a 250 meg
image and real-time constraints, we
can't afford to generate garbage. For
example, rather than using cons, we
use "cons!", which grabs cells from an
array of 10,000,000 cells we've
preallocated and which gets reset
every query.
Edit 3: (to avoid misunderstanding) is GC better than fiddling directly with pointers? Most of the time, certainly, but there are alternatives to both. Is there a need to bother users with these details? I don't see any evidence that this is the case, besides dispelling some marketing hype when necessary.
I'm pretty sure the basic interpretation is:
Managed = resource cleanup managed by runtime (i.e. Garbage Collection)
Unmanaged = clean up after yourself (i.e. malloc & free)
Perhaps compare it with investing in the stock market.
You can buy and sell shares yourself, trying to become an expert in what will give the best risk/reward - or you can invest in a fund which is managed by an "expert" who will do it for you - at the cost of you losing some control, and possibly some commission. (Admittedly I'm more of a fan of tracker funds, and the stock market "experts" haven't exactly done brilliant recently, but....)
Here's my Answer:
Managed (.NET) or Byte Code (Java) will save you time and money.
Now let's compare the two:
Unmanaged or Native Code
You need to do your own resource (RAM / Memory) allocation and cleanup. If you forget something, you end up with what's called a "Memory Leak" that can crash the computer. A Memory Leak is a term for when an application starts using up (eating up) Ram/Memory but not letting it go so the computer can use if for other applications; eventually this causes the computer to crash.
In order to run your application on different Operating Systems (Mac OSX, Windows, etc.) you need to compile your code specifically for each Operating System, and possibly change alot of code that is Operating System specific so it works on each Operating System.
.NET Managed Code or Java Byte Code
All the resource (RAM / Memory) allocation and cleanup are done for you and the risk of creating "Memory Leaks" is reduced to a minimum. This allows more time to code features instead of spending it on resource management.
In order to run you application on different Operating Systems (Mac OSX, Windows, etc.) you just compile once, and it'll run on each as long as they support the given Framework you are app runs on top of (.NET Framework / Mono or Java).
In Short
Developing using the .NET Framework (Managed Code) or Java (Byte Code) make it overall cheaper to build an application that can target multiple operating systems with ease, and allow more time to be spend building rich features instead of the mundane tasks of memory/resource management.
Also, before anyone points out that the .NET Framework doesn't support multiple operating systems, I need to point out that technically Windows 98, WinXP 32-bit, WinXP 64-bit, WinVista 32-bit, WinVista 64-bit and Windows Server are all different Operating Systems, but the same .NET app will run on each. And, there is also the Mono Project that brings .NET to Linux and Mac OSX.
Unmanaged code is a list of instructions for the computer to follow.
Managed code is a list of tasks for the computer follow that the computer is free to interpret on its own on how to accomplish them.
The big difference is memory management. With native code, you have to manage memory yourself. This can be difficult and is the cause of a lot of bugs and lot of development time spent tracking down those bugs. With managed code, you still have problems, but a lot less of them and they're easier to track down. This normally means less buggy software, and less development time.
There are other differences, but memory management is probably the biggest.
If they were still interested I might mention how a lot of exploits are from buffer overruns and that you don't get that with managed code, or that code reuse is now easy, or that we no longer have to deal with COM (if you're lucky anyway). I'd probably stay way from COM otherwise I'd launch into a tirade over how awful it is.
It's like the difference between playing pool with and without bumpers along the edges. Unless you and all the other players always make perfect shots, you need something to keep the balls on the table. (Ignore intentional ricochets...)
Or use soccer with walls instead of sidelines and endlines, or baseball without a backstop, or hockey without a net behind the goal, or NASCAR without barriers, or football without helmets ...)
"The specific term managed code is particularly pervasive in the Microsoft world."
Since I work in MacOS and Linux world, it's not a term I use or encounter.
The Brad Abrams "What is Managed Code" blog post has a definition that say things like ".NET Framework Common Language Runtime".
My point is this: it may not be appropriate to explain it the terms at all. If it's a bug, hack or work-around, it's not very important. Certainly not important enough to work up a sophisticated lay-persons description. It may vanish with the next release of some batch of MS products.

Resources