Does the advent of MultiCore architectures affect me as a software developer? - cpu

As a software developer dealing mostly with high-level programming languages I'm not sure what I can do to appropriately pay attention to the upcoming omni-presence of multicore computers. I write mostly ordinary and non-demanding applications, nevertheless I think it is important to know if I need to change any programming paradigms or even language to master the future.
My question therefore:
How to deal with increasing multicore presence in day-by-day hacking?

Herb Sutter wrote about it in 2005: The Free Lunch Is Over: A Fundamental Turn Toward Concurrency in Software

Most problems do not require a lot of CPU time. Really, single cores are quite fast enough for many purposes. When you do find your program is too slow, first profile it and look at your choice of algorithms, architecture, and caching. If that doesn't get you enough, try to divide the problem up into separate processes. Often this is worth doing simply for fault isolation and so that you can understand the CPU and memory usage of each process. Also, normally each process will run on a specific core and make good use of the processor caches, so you won't have to suffer the substantial performance overhead of keeping cache lines consistent. If you go for a multi process design and still find problem needs more CPU time than you get with the machine you have, you are well placed to extend it run over a cluster.
There are situations where you need multiple threads within the same address space, but beware that threads are really hard to get right. Race conditions, especially in non-safe languages, sometimes take weeks to debug; often, simply adding tracing or running under a debugger will change the timings enough to hide the problem. Simply putting locks everywhere often means you get a lot of locking overhead and sometimes so much lock contention that you don't really get the concurrency advantage you were hoping for. Even when you've got the locking right, you then need to profile to tune for cache coherency. Ultimately, if you want to really tune some highly concurrent code, you'll probably end up looking at lock-free constructs and more complex locking schemes than those in current multi-threading libraries.

Learn the benefits of concurrency, and the limits (e.g. Amdahl's law).
So you can, where possible, exploit the only route for higher performance that is going to be open. There is a lot of innovative work happening on easier approaches (futures and task libraries), and old work being rediscovered (functional languages and immutable data).
The free lunch is over, but that does not mean that there is nothing to exploit.

In general, become very friendly with threading. It's a terrible mechanism for parallelization, but it's what we have.
If you do work with .NET, look at the Parallel Extensions. They allow you to easily accomplish many parallel programming tasks.

To benefit from more that just one core you should consider parallelizing your code. Multiple threads, immutable types, and a minimum of synchronization are your new friends.

I think it will depend on what kind of applications you're writing.
Some kind of apps benefit more of the fact that they're run on a mutli-core cpu then others.
If your application can benefit from the multi-core fact, then you should be ready to go parallel.
The free lunch is over; that is: in the past, your application became faster when a new cpu was released and you didn't have to put any effort in your application to get that extra speed.
Now, to take advantage of the capabilities a multi-core cpu offers, you've to make sure that your application can take advantage of it. That is: you've to see which tasks can be executed multithreaded / concurrently, and this brings some issues to the table ...

Learn Erlang/F# (depending on your platform)

Prefer immutable data structures, their use makes software easier to understand not only in concurrent programs.
Learn the tools for concurrency in your language (e.g. java.util.concurrent, JCIP).
Learn a functional language (e.g Haskell).

I've been asked the same question, and the answer is, "it depends". If your Joe Winforms, maybe not so much. If your writing code that must be performant, yes. One of the biggest problem I can see with parallel programming is this: if something can't be parallized, and you lie and tell the run-time to do in parallel anyways, it's not going to crash, it's just going to do things wrong, and you'll get crap results and blame the framework.

Learn OpenMP and MPI for C and C++ code.
OpenMP also applies to other languages as well like Fortran I suppose.

Write smaller programs.
Other code languages/styles will let you do multithreading better (though multithreading is still really hard in any language) but the big benefit for regular developers, IMHO, is the ability to execute lots of smaller programs concurrently to accomplish some much larger task.
So, get in the habit of breaking your problems down into independent components that can be run whenever you want.
You'll build more maintainable software too.

Related

Where to learn about low-level, hard-core performance stuffs?

This is actually a 2 part question:
For people who want to squeeze every clock cycle, people talk about pipelines, cache locality, etc.
I have seen these low level performance techniques mentioned here and there but I have not seen a good introduction to the subject, from start to finish. Any resource recommendations? (Google gave me definitions and papers, where I'd really appreciate some kind of worked examples/tutorials real-life hands-on kind of materials)
How does one actually measure this kind of things? Like, as in a profiler of some sort? I know we can always change the code, see the improvement and theorize in retrospect, I am just wondering if there are established tools for the job.
(I know algorithm optimization is where the orders of magnitudes are. I am interested in the metal here)
The chorus of replies is, "Don't optimize prematurely." As you mention, you will get a lot more performance out of a better design than a better loop, and your maintainers will appreciate it, as well.
That said, to answer your question:
Learn assembly. Lots and lots of assembly. Don't MUL by a power of two when you can shift. Learn the weird uses of xor to copy and clear registers. For specific references,
http://www.mark.masmcode.com/ and http://www.agner.org/optimize/
Yes, you need to time your code. On *nix, it can be as easy as time { commands ; } but you'll probably want to use a full-features profiler. GNU gprof is open source http://www.cs.utah.edu/dept/old/texinfo/as/gprof.html
If this really is your thing, go for it, have fun, and remember, lots and lots of bit-level math. And your maintainers will hate you ;)
EDIT/REWRITE:
If it is books you need Michael Abrash did a good job in this area, Zen of Assembly language, a number of magazine articles, big black book of graphics programming, etc. Much of what he was tuning for is no longer a problem, the problems have changed. What you will get out of this is the ideas of the kinds of things that can cause bottle necks and the kinds of ways to solve. Most important is to time everything, and understand how your timing measurements work so that you are not fooling yourself by measuring incorrectly. Time the different solutions and try crazy, weird solutions, you may find an optimization that you were not aware of and didnt realize until you exposed it.
I have only just started reading but See MIPS Run (early/first edition) looks good so far (note that ARM took over MIPS as the leader in the processor market, so the MIPS and RISC hype is a bit dated). There are a number of text books old and new to be had about MIPS. Mips being designed for performance (At the cost of the software engineer in some ways).
The bottlenecks today fall into the categories of the processor itself and the I/O around it and what is connected to that I/O. The insides of the processor chips themselves (for higher end systems) run much faster than the I/O can handle, so you can only tune so far before you have to go off chip and wait forever. Getting off the train, from the train to your destination half a minute faster when the train ride was 3 hours is not necessarily a worthwhile optimization.
It is all about learning the hardware, you can probably stay within the ones and zeros world and not have to get into the actual electronics. But without really knowing the interfaces and internals you really cannot do much performance tuning. You might re-arrange or change a few instructions and get a little boost, but to make something several hundred times faster you need more than that. Learning a lot of different instruction sets (assembly languages) helps get into the processors. I would recommend simulating HDL, for example processors at opencores, to get a feel for how some folks do their designs and getting a solid handle on how to really squeeze clocks out of a task. Processor knowledge is big, memory interfaces are a huge deal and need to be learned, media (flash, hard disks, etc) and displays and graphics, networking, and all the types of interfaces between all of those things. And understanding at the clock level or as close to it as you can get, is what it takes.
Intel and AMD provide optimization manuals for x86 and x86-64.
http://www.intel.com/content/www/us/en/processors/architectures-software-developer-manuals.html/
http://developer.amd.com/documentation/guides/pages/default.aspx
Another excellent resource is agner.
http://www.agner.org/optimize/
Some of the key points (in no particular order):
Alignment; memory, loop/function labels/addresses
Cache; non-temporal hints, page and cache misses
Branches; branch prediction and avoiding branching with compare&move op-codes
Vectorization; using SSE and AVX instructions
Op-codes; avoiding slow running op-codes, taking advantage of op-code fusion
Throughput / pipeline; re-ordering or interleaving op-codes to perform separate tasks avoiding partial stales and saturating the processor's ALUs and FPUs
Loop unrolling; performing multiple iterations for a single "loop comparison, branch"
Synchronization; using atomic op-code (or LOCK prefix) to avoid high level synchronization constructs
Yes, measure, and yes, know all those techniques.
Experienced people will tell you "don't optimize prematurely", which I relate as simply "don't guess".
They will also say "use a profiler to find the bottleneck", but I have a problem with that. I hear lots of stories of people using profilers and either liking them a lot or being confused with their output.
SO is full of them.
What I don't hear a lot of is success stories, with speedup factors achieved.
The method I use is very simple, and I've tried to give lots of examples, including this case.
I'd suggest Optimizing subroutines in assembly
language
An optimization guide for x86 platforms.
It's quite heavy stuff though ;)

Are there any resources for language independent performance tips?

I work with many people that program video games for a living. I have a quite a bit of knowledge in C++ and I know a number of general performance strategies to utilize in day to day programming. Like using prefix ++/-- over post fix.
My problem is that often times people come to me to give them tips on general optimizations they can do on a regular basis when programming, but often times these people program in all sorts of languages. Some use C++, C#, Java, ActionScript, etc.
I am wondering if there are any general performance tips that can be utilized on a day by day programming basis? For example, I would suggest prefix ++/-- over postfix for people programming in another language, but I am just not sure if that is true.
My guess is that it is language specific and the best way to go about general optimizations is to make sure you are not using majorly bloated algorithms, but maybe someone has some advice.
Without going into language specifics, or even knowing whether this is embedded, web, CAD, game, or iPhone programming, there isn't much that can be said. All we know is that there's multiple languages involved, and for some unknown reason performance is always slower than desirable.
First, check your algorithms. A slow algorithm can cause horrible performance. Read up on algorithms and their complexity.
Second, note if there are any really slow operations, such as hitting a database or transmitting information or moving a robot arm. See if the program is doing more of those than it should.
Third, profile. If there's a section of code that's taking 5% of the time, no optimization will make your program more than 5% faster. If a section of code is taking a lot of the time, it's worth looking at.
Fourth, get somebody who knows what they're doing to make any specific optimizations. Test them when they're done to make sure they actually speed up performance. When performance was an issue, I've improved it with some counterintuitive measures, like rolling up loops.
I don't think you can generalize optimization as such. To optimize execution time, you need to dig deep into the language and understand how things work in detail. Just guessing or making assumptions on experiences with other languages won't work! For example, writing x = x << 1 instead of x = x*2 might be a big benefit in C++. In JavaScript it will slow you down.
With all the differences between all the languages it's hard to find generic optimization tips. Maybe for some languages which are similar (f.ex. C# and Java). But if you add both JavaScript and Python to that list I'm pretty sure not many common optimization techniques will be left over.
Also keep in mind that premature optimization is often considered bad practice. Developer-hours are much more expensive than buying additional hardware.
However, there is one thing which comes to mind. Over the past decade or so, Object Relational Mappers have become quite popular. And hence, they emerge(d) in pretty much all popular languages. But you have to be careful with those. It's easy to load tons of data into memory that you will never use in your code if not properly configured. Keep that in mind. Lazy loading might be of some help here. But your mileage will vary.
Optimization depends on so many things that answering such a generic question would make this post explode into a full-fledged paper. In my opinion, optimization should be regarded on a project-by-project basis. Not only Language-by-Language basis.
I think you need to split this into two separate questions:
1) Are there language-agnostic ways to find performance problems? YES. Profile, but avoid the myths around that subject.
2) Are there language-agnostic ways to fix performance problems? IT DEPENDS.
A general language-agnostic principle is: do (1) before you do (2).
In other words, Ready-Aim-Fire, not Ready-Fire-Aim.
Here's an example of performance tuning, in C, but it could be any language.
A few things I have learned since asking this:
I/O operations are usually the most expensive to performance. This holds especially true when you are doing disk or network I/O (which is usually the most expensive because if you have to wait for a response from the other host you have to wait for all processing and I/O operations the remote host does). Only do these operations when absolutely necessary and possibly consider using a cache when possible.
Database operations can be very expensive because of network/disk I/O and the translation time to and from SQL. Using in-memory DB or cache can help reduce I/O issues and some (not all) NoSQL databases can reduce SQL translation time.
Only log important information. Using logging libraries like log4j can help because you can put logging to your hearts desire in your application but you set each message to a certain log level. Whichever log level you set the application to it will only log messages at that level or higher. This way if you need to troubleshoot functionality you only have to change a quick config and restart you application to give you additional messages. Then when you are done just turn you application back to the default level so that you do not log too often.
Only include functionality that is needed. Additional functionality may be nice to have but can increase processing time, provide additional locations for the application to fail, and costs your team development time that could be spent on more important tasks.
Use and configure your memory manager correctly. Garbage collection routines can kill performance if they are not configured correctly. If every minute you application freezes for a second or two for garbage collection your customer probably will not be happy.
Profile only after you have discovered a performance issue. Profilers will make the applications performance look worse than it is because you have your application and the profiler running on the same host, consuming the same hardware resources.
Do not prematurely do performance tuning. There are general practices you can take that should be better on performance in each language, but starting performance tuning in the middle of application development can cost you a lot on development because there is still functionality to be added.
This is not necessarily going to help performance but keep class dependency to a minimal. When you get into performance tuning there is good chance you will have to rewrite whole portions of code, which if there is a lot of dependencies on the section you are performance tuning the greater chance you will break the code. It can often be a domino affect because after fixing the performance issue than you have to fix all the dependencies, and possibly dependencies of the original dependencies. A performance tuning exercise estimate for a few hours can quickly turn into months with an application that has a lot of dependencies.
If performance is a concern do not use interpreted languages (scripting languages).
Only use the hardware you need. Having a system with a 64 core processor may seem cool but if you only have two or three threads running in your application than you are getting little benefit from having 64 cores. In fact, in rare instances having overly excessive hardware can sometimes hurt performance because the chips have to be wired to handle all the hardware which can cause your application to spend more time switching between cores or processors than actually being processed.
Any timing metrics you report make as granular as possible. Currently, you may only need to be worried about the number of milliseconds a process takes but in the future as you make your application faster and faster you may need more granular timings. If version A uses milliseconds and version B uses microseconds, how can you compare performance if version B is taking about the same number of milliseconds. Version B may be better but you just can't tell because version A did not use granular enough metrics.

Relation between language and scalability

I came across the following statement in Trapexit, an Erlang community website:
Erlang is a programming language used
to build massively scalable soft
real-time systems with requirements on
high availability.
Also I recall reading somewhere that Twitter switched from Ruby to Scala to address scalability problem.
Hence, I wonder what is the relation between a programming language and scalability?
I would think that scalability depends only on the system design, exception handling etc. Is it because of the way a language is implemented, the libraries, or some other reasons?
Hope for enlightenment. Thanks.
Erlang is highly optimized for a telecommunications environment, running at 5 9s uptime or so.
It contains a set of libraries called OTP, and it is possible to reload code into the application 'on the fly' without shutting down the application! In addition, there is a framework of supervisor modules and so on, so that when something fails, it gets automatically restarted, or else the failure can gradually work itself up the chain until it gets to a supervisor module that can deal with it.
That would be possible in other languages of course too. In C++, you can reload dlls on the fly, load plugsin. In Python you can reload modules. In C#, you can load code in on-the-fly, use reflection and so on.
It's just that that functionality is built in to Erlang, which means that:
it's more standard, any erlang developer knows how it works
less stuff to re-implement oneself
That said, there are some fundamental differences between languages, to the extent that some are interpreted, some run off bytecode, some are native compiled, so the performance, and the availability of type information and so on at runtime differs.
Python has a global interpreter lock around its runtime library so cannot make use of SMP.
Erlang only recently had changes added to take advantage of SMP.
Generally I would agree with you in that I feel that a significant difference is down to the built-in libraries rather than a fundamental difference between the languages themselves.
Ultimately I feel that any project that gets very large risks getting 'bogged down' no matter what language it is written in. As you say I feel architecture and design are pretty fundamental to scalability and choosing one language over another will not I feel magically give awesome scalability...
Erlang comes from another culture in thinking about reliability and how to achieve it. Understanding the culture is important, since Erlang code does not become fault-tolerant by magic just because its Erlang.
A fundamental idea is that high uptime does not only come from a very long mean-time-between-failures, it also comes from a very short mean-time-to-recovery, if a failure happened.
One then realize that one need automatic restarts when a failure is detected. And one realize that at the first detection of something not being quite right then one should "crash" to cause a restart. The recovery needs to be optimized, and the possible information losses need to be minimal.
This strategy is followed by many successful softwares, such as journaling filesystems or transaction-logging databases. But overwhelmingly, software tends to only consider the mean-time-between-failure and send messages to the system log about error-indications then try to keep on running until it is not possible anymore. Typically requiring human monitoring the system and manually reboot.
Most of these strategies are in the form of libraries in Erlang. The part that is a language feature is that processes can "link" and "monitor" each other. The first one is a bi-directional contract that "if you crash, then I get your crash message, which if not trapped will crash me", and the second is a "if you crash, i get a message about it".
Linking and monitoring are the mechanisms that the libraries use to make sure that other processes have not crashed (yet). Processes are organized into "supervision" trees. If a worker process in the tree fails, the supervisor will attempt to restart it, or all workers at the same level of that branch in the tree. If that fails it will escalate up, etc. If the top level supervisor gives up the application crashes and the virtual machine quits, at which point the system operator should make the computer restart.
The complete isolation between process heaps is another reason Erlang fares well. With few exceptions, it is not possible to "share values" between processes. This means that all processes are very self-contained and are often not affected by another process crashing. This property also holds between nodes in an Erlang cluster, so it is low-risk to handle a node failing out of the cluster. Replicate and send out change events rather than have a single point of failure.
The philosophies adopted by Erlang has many names, "fail fast", "crash-only system", "recovery oriented programming", "expose errors", "micro-restarts", "replication", ...
Erlang is a language designed with concurrency in mind. While most languages depend on the OS for multi-threading, concurrency is built into Erlang. Erlang programs can be made from thousands to millions of extremely lightweight processes that can run on a single processor, can run on a multicore processor, or can run on a network of processors. Erlang also has language level support for message passing between processes, fault-tolerance etc. The core of Erlang is a functional language and functional programming is the best paradigm for building concurrent systems.
In short, making a distributed, reliable and scalable system in Erlang is easy as it is a language designed specially for that purpose.
In short, the "language" primarily affects the vertical axii of scaling but not all aspects as you already eluded to in your question. Two things here:
1) Scalability needs to be defined in relation to a tangible metric. I propose money.
S = # of users / cost
Without an adequate definition, we will discussing this point ad vitam eternam. Using my proposed definition, it becomes easier to compare system implementations. For a system to be scalable (read: profitable), then:
Scalability grows with S
2) A system can be made to scale based on 2 primary axis:
a) Vertical
b) Horizontal
a) Vertical scaling relates to enhancing nodes in isolation i.e. bigger server, more RAM etc.
b) Horizontal scaling relates to enhancing a system by adding nodes. This process is more involving since it requires dealing with real world properties such as speed of light (latency), tolerance to partition, failures of many kinds etc.
(Node => physical separation, different "fate sharing" from another)
The term scalability is too often abused unfortunately.
Too many times folks confuse language with libraries & implementation. These are all different things. What makes a language a good fit for a particular system has often more to do with the support around the said language: libraries, development tools, efficiency of the implementation (i.e. memory footprint, performance of builtin functions etc.)
In the case of Erlang, it just happens to have been designed with real world constraints (e.g. distributed environment, failures, need for availability to meet liquidated damages exposure etc.) as input requirements.
Anyways, I could go on for too long here.
First you have to distinguish between languages and their implementations. For instance ruby language supports threads, but in the official implementation, the thread will not make use of multicore chips.
Then, a language/implementation/algorithm is often termed scalable when it supports parallel computation (for instance via multithread) AND if it exhibits a good speedup increase when the number of CPU goes up (see Amdahl Law).
Some languages like Erlang, Scala, Oz etc. have also syntax (or nice library) which help writing clear and nice parallel code.
In addition to the points made here about Erlang (Which I was not aware of) there is a sense in which some languages are more suited for scripting and smaller tasks.
Languages like ruby and python have some features which are great for prototyping and creativity but terrible for large scale projects. Arguably their best features are their lack of "formality", which hurts you in large projects.
For example, static typing is a hassle on small script-type things, and makes languages like java very verbose. But on a project with hundreds or thousands of classes you can easily see variable types. Compare this to maps and arrays that can hold heterogeneous collections, where as a consumer of a class you can't easily tell what kind of data it's holding. This kind of thing gets compounded as systems get larger. e.g. You can also do things that are really difficult to trace, like dynamically add bits to classes at runtime (which can be fun but is a nightmare if you're trying to figure out where a piece of data comes from) or call methods that raise exceptions without being forced by the compiler to declare the exception. Not that you couldn't solve these kinds of things with good design and disciplined programming - it's just harder to do.
As an extreme case, you could (performance issues aside) build a large system out of shell scripts, and you could probably deal with some of the issues of the messiness, lack of typing and global variables by being very strict and careful with coding and naming conventions ( in which case you'd sort of be creating a static typing system "by convention"), but it wouldn't be a fun exercise.
Twitter switched some parts of their architecture from Ruby to Scala because when they started they used the wrong tool for the job. They were using Ruby on Rails—which is highly optimised for building green field CRUD Web applications—to try to build a messaging system. AFAIK, they're still using Rails for the CRUD parts of Twitter e.g. creating a new user account, but have moved the messaging components to more suitable technologies.
Erlang is at its core based on asynchronous communication (both for co-located and distributed interactions), and that is the key to the scalability made possible by the platform. You can program with asynchronous communication on many platforms, but Erlang the language and the Erlang/OTP framework provides the structure to make it manageable - both technically and in your head. For instance: Without the isolation provided by erlang processes, you will shoot yourself in the foot. With the link/monitor mechanism you can react on failures sooner.

Is there a relation between static code analysis and application performance

My Question:
Performance tests are generally done after an application is integrated with various modules and ready for deploy.
Is there any way to identify performance bottlenecks during the development phase. Does code analysis throw any hints # performance?
It all depends on rules that you run during code analysis but I don't think that you can prevent performance bottlenecks just by CA.
From my expired it looks that performance problems are usually quite complicated and to find real problems you have to run performance tests.
No, except in very minor cases (eg for Java, use StringBuilder in a loop rather than string appends).
The reason is that you won't know how a particular piece of code will affect the application as a whole, until you're running the whole application with relevant dataset.
For example: changing bubblesort to quicksort wouldn't significantly affect your application if you're consistently sorting lists of a half-dozen elements. Or if you're running the sort once, in the middle of the night, and it doesn't delay other processing.
If we are talking .NET, then yes and no... FxCop (or built-in code analysis) has a number of rules in it that deal with performance concerns. However, this list is fairly short and limited in nature.
Having said that, there is no reason that FxCop could not be extended with a lot more rules (heuristic or otherwise) that catch potential problem areas and flag them. It's simply a fact that nobody (that I know of) has put significant work into this (yet).
Generally, no, although from experience I can look at a system I've never seen before and recognize some design approaches that are prone to performance problems:
How big is it, in terms of lines of code, or number of classes? This correlates strongly with performance problems caused by over-design.
How many layers of abstraction are there? Each layer is a chance to spend more cycles than necessary, and this effect compounds, especially if each operation is perceived as being "pretty efficient".
Are there separate data structures that need to be kept in agreement? If so, how is this done? If there is an attempt, through notifications, to keep the data structures tightly in sync, that is a red flag.
Of the categories of input information to the system, does some of it change at low frequency? If so, chances are it should be "compiled" rather than "interpreted". This can be a huge win both in performance and ease of development.
A common motif is this: Programmer A creates functions that wrap complex operations, like DB access to collect a good chunk of information. Programmer A considers this very useful to other programmers, and expects these functions to be used with a certain respect, not casually. Programmer B appreciates these powerful functions and uses them a lot because they get so much done with only a single line of code. (Programmers B and A can be the same person.) You can see how this causes performance problems, especially if distributed over multiple layers.
Those are the first things that come to mind.

Scaling multithreaded applications on multicored machines

I'm working on a project were we need more performance. Over time we've continued to evolve the design to work more in parallel(both threaded and distributed). Then latest step has been to move part of it onto a new machine with 16 cores. I'm finding that we need to rethink how we do things to scale to that many cores in a shared memory model. For example the standard memory allocator isn't good enough.
What resources would people recommend?
So far I've found Sutter's column Dr. Dobbs to be a good start.
I just got The Art of Multiprocessor Programming and The O'Reilly book on Intel Threading Building Blocks
A couple of other books that are going to be helpful are:
Synchronization Algorithms and Concurrent Programming
Patterns for Parallel Programming
Communicating Sequential Processes by C. A. R. Hoare (a classic, free PDF at that link)
Also, consider relying less on sharing state between concurrent processes. You'll scale much, much better if you can avoid it because you'll be able to parcel out independent units of work without having to do as much synchronization between them.
Even if you need to share some state, see if you can partition the shared state from the actual processing. That will let you do as much of the processing in parallel, independently from the integration of the completed units of work back into the shared state. Obviously this doesn't work if you have dependencies among units of work, but it's worth investigating instead of just assuming that the state is always going to be shared.
You might want to check out Google's Performance Tools. They've released their version of malloc they use for multi-threaded applications. It also includes a nice set of profiling tools.
Jeffrey Richter is into threading a lot. He has a few chapters on threading in his books and check out his blog:
http://www.wintellect.com/cs/blogs/jeffreyr/default.aspx.
As monty python would say "and now for something completely different" - you could try a language/environment that doesn't use threads, but processes and messaging (no shared state). One of the most mature ones is erlang (and this excellent and fun book: http://www.pragprog.com/titles/jaerlang/programming-erlang). May not be exactly relevant to your circumstances, but you can still learn a lot of ideas that you may be able to apply in other tools.
For other environments:
.Net has F# (to learn functional programming).
JVM has Scala (which has actors, very much like Erlang, and is functional hybrid language). Also there is the "fork join" framework from Doug Lea for Java which does a lot of the hard work for you.
The allocator in FreeBSD recently got an update for FreeBSD 7. The new one is called jemaloc and is apparently much more scaleable with respect to multiple threads.
You didn't mention which platform you are using, so perhaps this allocator is available to you. (I believe Firefox 3 uses jemalloc, even on windows. So ports must exist somewhere.)
Take a look at Hoard if you are doing a lot of memory allocation.
Roll your own Lock Free List. A good resource is here - it's in C# but the ideas are portable. Once you get used to how they work you start seeing other places where they can be used and not just in lists.
I will have to check-out Hoard, Google Perftools and jemalloc sometime. For now we are using scalable_malloc from Intel Threading Building Blocks and it performs well enough.
For better or worse, we're using C++ on Windows, though much of our code will compile with gcc just fine. Unless there's a compelling reason to move to redhat (the main linux distro we use), I doubt it's worth the headache/political trouble to move.
I would love to use Erlang, but there way to much here to redo it now. If we think about the requirements around the development of Erlang in a telco setting, the are very similar to our world (electronic trading). Armstrong's book is on my to read stack :)
In my testing to scale out from 4 cores to 16 cores I've learned to appreciate the cost of any locking/contention in the parallel portion of the code. Luckily we have a large portion that scales with the data, but even that didn't work at first because of an extra lock and the memory allocator.
I maintain a concurrency link blog that may be of ongoing interest:
http://concurrency.tumblr.com

Resources