Related
I am learning about concurrent data structures, and I have come across a broad research literature on relaxed data structures, such as k-relaxed queues, stacks, counters, etc. See, e.g., here
https://github.com/cksystemsgroup/scal#data-structures
or here
http://www.faculty.idc.ac.il/gadi/MyPapers/2015ST-RelaxedDataStructures.pdf
The idea is that a relaxed data structure offers weaker guarantees (e.g., a dequeue operation of a k-relaxed FIFO queue is free to give any of the first k elements). Ideally, the relaxed data structures offer better performance than the strict ones.
The problem is that, as far as I see, these data structures are evaluated only on some micro-benchmarks (as in the first link above), measuring only pure performance/throughput of data structure operations, and not looking at how the relaxation influences the overall application design and performance.
Have you ever used such relaxed data structures in real applications, or can you imagine plugging in relaxed data structures into your application to obtain performance gains? Or are those data structures a purely academic exercise?
Quick Answer
Here's a quick answer by analogy. Relaxed structures are comparable to normal0 concurrent structures, in the same way that normal concurrent structures are comparable to normal non-concurrent structures.
Most uses of data structures in an application are the plain-jane non-current ones, either because they are not shared, or because some type of coarse-grained locking is sufficient to share them. They offer a simple API and are easy to reason about, even in the presence of concurrent access when coarse-grained locking is used. To give a specific example, you'll probably find 20 to 100 uses of plain Map implementations for every use ConcurrentMap.
That doesn't make ConcurrentMap useless! In the 5% of places it is used, there may be no real substitute, performance wise!
Similarly, you are usually going to do just find with the non-relaxed concurrent structures in those places, but there may be an additional 5% (so now 0.25% of the total Map use) that could benefit from the relaxed structures (at the cost of additional complexity when reasoning about the system, and possibly theoretically visible service order changes for clients).
Long Answer
I think there are two separate implied questions here:
Are relaxed data structures actually faster than their non-relaxed equivalents in actual use?
Are related data structures actually useful in real-world applications given their weaker guarantees?
I can speak about my personal experience for both of them. I have used "relaxed structures" before I even knew thew were called relaxed structures.
Are Relaxed Structures Actually Faster?
First, maybe you are asking if relaxed structures are actually faster than their non-relaxed equivalents?
The easy answer is "Yes, but only at the extremes of concurrent access".
Microbenchmarks already show the extremes of performance. You usually have one structure which is continually accessed (i.e., with a duty cycle near 100%) from multiple threads. At these extremes of contention, relaxed data structures may show an order of magnitude improvement over their non-relaxed cousins. You have to realize, however, that the vast majority of structures are not accessed in such a fashion. Even structures which are already using concurrent data structures (think ConcurrentHashMap in Java), are usually accessed rarely as a percentage of total execution time, and the use of concurrent structure is necessary more for correctness and high-sigma response time than performance1.
Usually No
That said, just because the vast majority of structures in your application don't need higher concurrent performance, it doesn't mean that it's worthless: the few structures (if any) that do need it, may constitute a high percentage of total accesses. What I mean is, if you look at the code level: "How many declaration of map objects in my application need a fast concurrent structure?", then the answer is likely to be "not many" - but if you look at the runtime level: "How many accesses to map objects in my application need a fast concurrent structure?" - then the answer might be a "a lot of them". It depends if you weight by "use in code" or "use at runtime".
Sometimes Yes
Once you start tuning a highly concurrent application, you are very likely to run into a concurrency bottleneck at some point. Even if you application is very parallel - such as processing many independent incoming requests in parallel, you'll usually want to introduce some type of shared state for caching, logging, security checks or simply because your application does have some amount of shared writable state. Whatever structure you use there may indeed suffer from high contention.
As a final point on the first question: a lot of it also depends on the hardware you use. A typical behavior of contended structure is that tend to flatline in throughput under concurrent access, regardless of the number of cores accessing the structure - i.e., the maximum throughput of this structure is X, regardless of concurrency. Often that is in fact the best case: you may instead reach peak throughput a moderate amount of concurrency and then total throughput drops after that.
It Is Hardware Dependent
The impact of this depends on your hardware. If you are on a single-socket development machine with 4 cores, then this behavior isn't too terrible. You effectively drop down to 1 core for the concurrent parts, but that still leaves 25% of the maximum power available (one core). So if the inherently concurrent part of your load is 10% of the total load, then you still achieve 1 / (0.9 + 0.1 * 4) = 0.77 = 77% of the total throughput, despite your contention. No not a problem you'd reorganize your application over. Then may you deploy to production and you are running on a 2-socket, 18-core box and it looks like 1 / (0.9 + 0.1 / 36) = 0.22 = 22% - i.e., you are getting less than a quarter of the ideal performance level because of your concurrency bottleneck. That is a problem.
That's with the "simple" view of concurrent scaling, where performance is constant for the contended part - in reality increased contention may decrease performance, making the performance worse.
Summary
So the long answer then finishes in the same way as the short answer - yes, there are places in highly concurrent applications that can benefit from structures that perform better in contention, but they are small in number. You have to first remove all the places where false sharing is occurring, where you have non-contention bottlenecks, and then you'll find the places where true sharing occurs and may benefit.
Can Relaxed Structures Be Used in Practice?
Absolutely, yes. This is the easy part. There are many times when you need a structure that that offers weaker guarantees than the underly structure you want. For example, if you are just going to store a bunch of objects with no duplicates, in no particular order, you might use an array/vector or a linked list or a set-like structure. Each of the choices gives some advantage over the other, at the cost of some disadvantage in other scenarios (for example, arrays allow constant-time access to any element given its index, but cannot insert elements anywhere other than the end without paying an O(n) cost).
You don't often find structures with even weaker guarantees than the existing ones, because outside of concurrency there aren't really any structures are even faster than the basic ones, with weaker guarantees. Try it - choose a "collection" API that has the lowest-common-denominator guarantees out of arrays, linked lists, hash tables, trees, etc. I.e., it doesn't need to have fast insertion in the middle, it doesn't offer fast search by element, it doesn't offer fast indexing, etc, etc - can you make any basic operation faster?
This scenario is very common: scan through the source code for anything you've written lately and check out how you used List in Java or vector in C++, or just plain arrays: in each case, which of the high performance operations did you actually need? Often you are just holding a collection of elements and iterating over them later - that's it!
Once you introduce the new axis of concurrent performance - you can use weaker guarantees to make a faster structure. Arrays are problematic because to preserve exact insertion order you always end up contenting on some type of shared size variable. Linked lists are problematic because the head/tail nodes are contended and so on.
There there is a place for weaker structures. For example, a "bad" of elements, that doesn't offer FIFO insertion/iteration order, doesn't offer fast search by equality, etc. Many uses are fine with this weakening.
Concretely
Here are a couple of places I've actually used or seen used relaxed structures in high-performance concurrent systems:
For statistics collection: high-concurrent systems processing mostly independent requests often scale well because the requests do not share much mutable data. Once you start collecting system-wide fine-grained statistics, however, you have a contention point. Usually you don't need any kind of FIFO ordering for statistics and don't need searchability (at the collect point anyway) - so the in-process statistics collector code is an obvious place where relaxed structures are useful.
For logging: in the same way as stasitics above, many systems have a global logging facility, which might become a point of contention. Usually you want log messages to appear in FIFO order, since out of order messages are very confusing, but you might not care much about the order of messages from independent threads in the log, or you might know exactly which messages need to occur in-order wrt other threads. Here again relaxed structures are a natural fit.
The general problem of processing or redistributing a high-volume stream of incoming requests. Imagine what any web-server does - it processes a large number of incoming quests. Usually those requests come from many different users and it doesn't matter if the requests are processed in the exact order they arrive. There are some ordering cases that do matter however - usually among requests from the same user or session (e.g., image a user who submits some change with a POST and then issues a GET to reload the page: she would expect to see the changes from the previous POST reflected in the GET). So many of the front-end structures to handle the requests would relax the usual ordering requirements for performance.
0 By normal concurrent structures, I mean structures like ConcurrentHashMap which are designed for better performance under concurrency, but usually keep their usual ordering guarantees and so on. That is, they are made about as concurrent as possible withing the boundaries of the existing API.
1 Admittedly, this glosses over the difference between average performance, and other "long tail" performance issues. I.e., a structure may be accessed infrequently on average, but under certain scenarios access may skyrocket and it can become a bottleneck under this particular load.
For many parallel programs, the parallelization brings substantial cost, making the speedup sublinear. In this case, the parallel versions are less energy efficient than sequential one.
However, people may care both the time performance and energy efficiency, are there any specific metrics commonly used for this purpose?
More specifically, a metric that can determine the number of threads for best energy and performance goal.
The most common metric is performance per watt. Take a look at the "Green500 List". Wikipedia also has an article on performance per watt. The metric is not as clear cut as it first appears because "performance" is not clear cut. FLOPS is very popular at the moment but it has a lot of deficiencies. I disagree that performance/watt can't be used to evaluate the performance of software. Depending upon your application, you may want to use performance/watt/sec.
I don’t know why you want to determine energy efficiency if parallelism is costing you. In fact, I don’t really understand how parallelism can be decreasing energy efficiency unless you are using a single core machine, doing pure computation, and are doing a lot of thrashing between threads. I’m guessing that this is not your own code.
Software power efficiency: The most important two factors are:
getting your computation done faster
making sure that periods between computation are truly idle
These factors break down into a whole host of other more concrete guidelines:
avoid timing interrupts and (shutter) polling
minimize synchronization constructs
exploit parallelism (thread and vectorization)
use a good optimizing compiler
use a thread pool if you are continuously creating and terminating a lot of threads
use efficient high performance libraries
avoid virtual machines (e.g. java and flash)
use a modern (tickless) OS
etc. etc. etc
Dividing your computation between parallel threads should decrease computation times, or else why add its complications? (Yes, I understand that some programming constructs, such as recursion, can result in simpler and cleaner code but worse performance, but these are exceptions.) Decreasing computation should increase energy efficiency. If it doesn't, look at the algorithm and code practice.
If you can give me more detail about your app, I may be able to make more concrete suggestions.
I have a piece of C# 5.0 code that generates a ton of network and disk I/O. I need to run multiple copies of this code in parallel. Which of the following technologies is likely to give me the best performance:
async methods with await
directly use Task from TPL
the TPL Dataflow nuget
Reactive Extensions
I'm not very good at this parallel stuff, but if using a lower lever, like say Thread, can give me a lot better performance I'd consider that too.
This is like trying to optimize the length of your transatlantic flight by asking the quickest method to remove your seatbelt.
Ok, some real advice, since I was kind of a jerk
Let's give a helpful answer. Think of performance as in "Classes" of activities - each one is an order of magnitude slower (at least!):
Only accessing the CPU, very little memory usage (i.e. rendering very simple graphics to a very fast GPU, or calculating digits of Pi)
Only accessing CPU and in-memory things, nothing on disk (i.e. a well-written game)
Accessing the disk
Accessing the network.
If you do even one of activity #3, there's no point in doing optimizations typical to activities #1 and #2 like optimizing threading libraries - they're completely overshadowed by the disk hit. Same for CPU tricks - if you're constantly incurring L2/L3 cache misses, sparing a few CPU cycles by hand-writing assembly isn't worth it (which is why things like loop unrolling are usually a bad idea these days).
So, what can we derive from this? There are two ways to make your program faster, either move up from #3 to #2 (which isn't often possible, depending on what you're doing), or by doing less I/O. I/O and network speed is the rate-limiting factor in most modern applications, and that's what you should be trying to optimize.
Any performance difference between these options would be inconsequential in the face of "a ton of network and disk I/O".
A better question to ask is "which option is easiest to learn and develop with?" Or "which option would be best to maintain this code with five years from now?" And for that I would suggest async first, or Dataflow or Rx if your logic is better represented as a stream.
It's an older question, but for anyone reading this...
It depends. If you try to saturate 1Gbps link with 50B messages, you will be CPU bound even with simple non-blocking send over raw sockets. If, on the other hand, you are happy with 1Mbps throughput or your messages are larger than 10KB, any of these frameworks will do the job.
For low-bandwidth situations, I would recommend to prioritize by ease of use, i.e. async/await, Dataflow, Rx, TPL in this order. Note that high-bandwidth application should be prototyped as if it is low-bandwidth and optimized later.
For true high-bandwidth application, I can recommend Dataflow over Rx, because Rx is not designed for high concurrency. Raw TPL is the bottom layer, which guarantees the lowest overhead if you can handle the complexity. If you can make efficient use of dedicated threads, then that would be even faster. Async/await vs. Dataflow IMO doesn't make any performance difference. The overhead seems comparable, so choose one that's a better fit.
I am a student in Computer Science and I am hearing the word "overhead" a lot when it comes to programs and sorts. What does this mean exactly?
It's the resources required to set up an operation. It might seem unrelated, but necessary.
It's like when you need to go somewhere, you might need a car. But, it would be a lot of overhead to get a car to drive down the street, so you might want to walk. However, the overhead would be worth it if you were going across the country.
In computer science, sometimes we use cars to go down the street because we don't have a better way, or it's not worth our time to "learn how to walk".
The meaning of the word can differ a lot with context. In general, it's resources (most often memory and CPU time) that are used, which do not contribute directly to the intended result, but are required by the technology or method that is being used. Examples:
Protocol overhead: Ethernet frames, IP packets and TCP segments all have headers, TCP connections require handshake packets. Thus, you cannot use the entire bandwidth the hardware is capable of for your actual data. You can reduce the overhead by using larger packet sizes and UDP has a smaller header and no handshake.
Data structure memory overhead: A linked list requires at least one pointer for each element it contains. If the elements are the same size as a pointer, this means a 50% memory overhead, whereas an array can potentially have 0% overhead.
Method call overhead: A well-designed program is broken down into lots of short methods. But each method call requires setting up a stack frame, copying parameters and a return address. This represents CPU overhead compared to a program that does everything in a single monolithic function. Of course, the added maintainability makes it very much worth it, but in some cases, excessive method calls can have a significant performance impact.
You're tired and cant do any more work. You eat food. The energy spent looking for food, getting it and actually eating it consumes energy and is overhead!
Overhead is something wasted in order to accomplish a task. The goal is to make overhead very very small.
In computer science lets say you want to print a number, thats your task. But storing the number, the setting up the display to print it and calling routines to print it, then accessing the number from variable are all overhead.
Wikipedia has us covered:
In computer science, overhead is
generally considered any combination
of excess or indirect computation
time, memory, bandwidth, or other
resources that are required to attain
a particular goal. It is a special
case of engineering overhead.
Overhead typically reffers to the amount of extra resources (memory, processor, time, etc.) that different programming algorithms take.
For example, the overhead of inserting into a balanced Binary Tree could be much larger than the same insert into a simple Linked List (the insert will take longer, use more processing power to balance the Tree, which results in a longer percieved operation time by the user).
For a programmer overhead refers to those system resources which are consumed by your code when it's running on a giving platform on a given set of input data. Usually the term is used in the context of comparing different implementations or possible implementations.
For example we might say that a particular approach might incur considerable CPU overhead while another might incur more memory overhead and yet another might weighted to network overhead (and entail an external dependency, for example).
Let's give a specific example: Compute the average (arithmetic mean) of a set of numbers.
The obvious approach is to loop over the inputs, keeping a running total and a count. When the last number is encountered (signaled by "end of file" EOF, or some sentinel value, or some GUI buttom, whatever) then we simply divide the total by the number of inputs and we're done.
This approach incurs almost no overhead in terms of CPU, memory or other resources. (It's a trivial task).
Another possible approach is to "slurp" the input into a list. iterate over the list to calculate the sum, then divide that by the number of valid items from the list.
By comparison this approach might incur arbitrary amounts of memory overhead.
In a particular bad implementation we might perform the sum operation using recursion but without tail-elimination. Now, in addition to the memory overhead for our list we're also introducing stack overhead (which is a different sort of memory and is often a more limited resource than other forms of memory).
Yet another (arguably more absurd) approach would be to post all of the inputs to some SQL table in an RDBMS. Then simply calling the SQL SUM function on that column of that table. This shifts our local memory overhead to some other server, and incurs network overhead and external dependencies on our execution. (Note that the remote server may or may not have any particular memory overhead associated with this task --- it might shove all the values immediately out to storage, for example).
Hypothetically we might consider an implementation over some sort of cluster (possibly to make the averaging of trillions of values feasible). In this case any necessary encoding and distribution of the values (mapping them out to the nodes) and the collection/collation of the results (reduction) would count as overhead.
We can also talk about the overhead incurred by factors beyond the programmer's own code. For example compilation of some code for 32 or 64 bit processors might entail greater overhead than one would see for an old 8-bit or 16-bit architecture. This might involve larger memory overhead (alignment issues) or CPU overhead (where the CPU is forced to adjust bit ordering or used non-aligned instructions, etc) or both.
Note that the disk space taken up by your code and it's libraries, etc. is not usually referred to as "overhead" but rather is called "footprint." Also the base memory your program consumes (without regard to any data set that it's processing) is called its "footprint" as well.
Overhead is simply the more time consumption in program execution. Example ; when we call a function and its control is passed where it is defined and then its body is executed, this means that we make our CPU to run through a long process( first passing the control to other place in memory and then executing there and then passing the control back to the former position) , consequently it takes alot performance time, hence Overhead. Our goals are to reduce this overhead by using the inline during function definition and calling time, which copies the content of the function at the function call hence we dont pass the control to some other location, but continue our program in a line, hence inline.
You could use a dictionary. The definition is the same. But to save you time, Overhead is work required to do the productive work. For instance, an algorithm runs and does useful work, but requires memory to do its work. This memory allocation takes time, and is not directly related to the work being done, therefore is overhead.
You can check Wikipedia. But mainly when more actions or resources are used. Like if you are familiar with .NET there you can have value types and reference types. Reference types have memory overhead as they require more memory than value types.
A concrete example of overhead is the difference between a "local" procedure call and a "remote" procedure call.
For example, with classic RPC (and many other remote frameworks, like EJB), a function or method call looks the same to a coder whether its a local, in memory call, or a distributed, network call.
For example:
service.function(param1, param2);
Is that a normal method, or a remote method? From what you see here you can't tell.
But you can imagine that the difference in execution times between the two calls are dramatic.
So, while the core implementation will "cost the same", the "overhead" involved is quite different.
Think about the overhead as the time required to manage the threads and coordinate among them. It is a burden if the thread does not have enough task to do. In such a case the overhead cost over come the saved time through using threading and the code takes more time than the sequential one.
To answer you, I would give you an analogy of cooking Rice, for example.
Ideally when we want to cook, we want everything to be available, we want pots to be already clean, rice available in enough quantities. If this is true, then we take less time to cook our rice( less overheads).
On the other hand, let's say you don't have clean water available immediately, you don't have rice, therefore you need to go buy it from the shops first and you need to also get clean water from the tap outside your house. These extra tasks are not standard or let me say to cook rice you don't necessarily have to spend so much time gathering your ingredients. Ideally, your ingredients must be present at the time of wanting to cook your rice.
So the cost of time spent in going to buy your rice from the shops and water from the tap are overheads to cooking rice. They are costs that we can avoid or minimize, as compared to the standard way of cooking rice( everything is around you, you don't have to waste time gathering your ingredients).
The time wasted in collecting ingredients is what we call the Overheads.
In Computer Science, for example in multithreading, communication overheads amongst threads happens when threads have to take turns giving each other access to a certain resource or they are passing information or data to each other. Overheads happen due to context switching.Even though this is crucial to them but it's the wastage of time (CPU cycles) as compared to the traditional way of single threaded programming where there is never a time wastage in communication. A single threaded program does the work straight away.
its anything other than the data itself, ie tcp flags, headers, crc, fcs etc..
With all the hype around parallel computing lately, I've been thinking a lot about parallelism, number crunching, clusters, etc...
I started reading Learn You Some Erlang. As more people are learning (myself included), Erlang handles concurrency in a very impressive, elegant way.
Then the author asserts that Erlang is not ideal for number crunching. I can understand that a language like Erlang would be slower than C, but the model for concurrency seems ideally suited to things like image handling or matrix multiplication, even though the author specifically says its not.
Is it really that bad? Is there a tipping point where Erlang's strength overcomes its local speed weakness? Are/what measures are being taken to deal with speed?
To be clear: I'm not trying to start a debate; I just want to know.
It's a mistake to think of parallelism as only about raw number crunching power. Erlang is closer to the way a cluster computer works than, say, a GPU or classic supercomputer.
In modern GPUs and old-style supercomputers, performance is all about vectorized arithmetic, special-purpose calculation hardware, and low-latency communication between processing units. Because communication latency is low and each individual computing unit is very fast, the ideal usage pattern is to load the machine's RAM up with data and have it crunch it all at once. This processing might involve lots of data passing among the nodes, as happens in image processing or 3D, where there are lots of CPU-bound tasks to do to transform the data from input form to output form. This type of machine is a poor choice when you frequently have to go to a disk, network, or some other slow I/O channel for data. This idles at least one expensive, specialized processor, and probably also chokes the data processing pipeline so nothing else gets done, either.
If your program requires heavy use of slow I/O channels, a better type of machine is one with many cheap independent processors, like a cluster. You can run Erlang on a single machine, in which case you get something like a cluster within that machine, or you can easily run it on an actual hardware cluster, in which case you have a cluster of clusters. Here, communication overhead still idles processing units, but because you have many processing units running on each bit of computing hardware, Erlang can switch to one of the other processes instantaneously. If it happens that an entire machine is sitting there waiting on I/O, you still have the other nodes in the hardware cluster that can operate independently. This model only breaks down when the communication overhead is so high that every node is waiting on some other node, or for general I/O, in which case you either need faster I/O or more nodes, both of which Erlang naturally takes advantage of.
Communication and control systems are ideal applications of Erlang because each individual processing task takes little CPU and only occasionally needs to communicate with other processing nodes. Most of the time, each process is operating independently, each taking a tiny fraction of the CPU power. The most important thing here is the ability to handle many thousands of these efficiently.
The classic case where you absolutely need a classic supercomputer is weather prediction. Here, you divide the atmosphere up into cubes and do physics simulations to find out what happens in each cube, but you can't use a cluster because air moves between each cube, so each cube is constantly communicating with its 6 adjacent neighbors. (Air doesn't go through the edges or corners of a cube, being infinitely fine, so it doesn't talk to the other 20 neighboring cubes.) Run this on a cluster, whether running Erlang on it or some other system, and it instantly becomes I/O bound.
Is there a tipping point where Erlang's strength overcomes its local speed weakness?
Well, of course there is. For example, when trying to find the median of a trillion numbers :) :
http://matpalm.com/median/question.html
Just before you posted, I happened to notice this was the number 1 post on erlang.reddit.com.
Almost any language can be parallelized. In some languages it's simple, in others it's a pain in the butt, but it can be done. If you want to run a C++ program across 8000 CPU's in a grid, go ahead! You can do that. It's been done before.
Erlang doesn't do anything that's impossible in other languages. If a single CPU running an Erlang program is less efficient than the same CPU running a C++ program, then two hundred CPU's running Erlang will also be slower than two hundred CPU's running C++.
What Erlang does do is making this kind of parallelism easy to work with. It saves developer time and reduces the chance of bugs.
So I'm going to say no, there is no tipping point at which Erlang's parallelism allows it to outperform another language's numerical number-crunching strength.
Where Erlang scores is in making it easier to scale out and do so correctly. But it can still be done in other languages which are better at number-crunching, if you're willing to spend the extra development time.
And of course, let's not forget the good old point that languages don't have a speed.
A sufficiently good Erlang compiler would yield perfectly optimal code. A sufficiently bad C compiler would yield code that runs slower than anything else.
There is pressure to make Erlang execute numeric code faster. The HiPe compiler compiles to native code instead of the BEAM bytecode for example, and it probably has its most effective optimization on code on floating points where it can avoid boxing. This is very beneficial for floating point code, since it can store values directly in FPU registers.
For the majority of Erlang usage, Erlang is plenty fast as it is. They use Erlang to write always-up control systems where the most important speed measurement that matters is low latency responses. Performance under load tends to be IO-bound. These users tend to stay away from HiPe since it is not as flexible/malleable in debugging live systems.
Now that servers with 128Gb of RAM are not that uncommon, and there's no reason they'll get even more memory, some IO-bound problems might shift over to be somewhat CPU bound. That could be a driver.
You should follow HiPe for the development.
Your examples of image manipulations and matrix multiplications seem to me as very bad matches for Erlang though. Those are examples that benefit from vector/SIMD operations. Erlang is not good at parallellism (where one does the same thing to multiple values at once).
Erlang processes are MIMD, multiple instructions multiple data. Erlang does lots of branching behind pattern matching and recursive loops. That kills CPU instruction pipelining.
The best architecture for heavily parallellised problems are the GPUs. For programming GPUs in a functional language I see the best potential in using Haskell for creating programs targeting them. A GPU is basically a pure function from input data to output data. See the Lava project in Haskell for creating FPGA circuits, if it is possible to create circuits so cleanly in Haskell, it can't be harder to create program data for GPUs.
The Cell architecture is very nice for vectorizable problems as well.
I think the broader need is to point out that parallelism is not necessarily or even typically about speed.
It is about how to express algorithms or programs in which the sequence of activities is partial-ordered.