Is there a parallel flood fill implementation? - parallel-processing

I've got openMP and MPI at my disposal, and was wondering if anyone has come across a parallel version of any flood fill algorithm (preferably in c). If not, I'd be interested in sketches of how to do parallelise it - is it even possible given its based on recursion?
Wikipedia's got a pretty good article if you need to refresh your memory on flood fills.
Many thanks for your help.

There's nothing "inherently" recursive about flood-fill, just that to do some work, you need some information about previously-discovered "frontier" cells. If you think of it that way, it's clear that parallelism is eminently possible: even with a single queue, you could use four threads (one for each direction), and only move the tail of the queue when the cell has been examined by each thread. or equivalently, four queues. thinking in this way, one might even imagine partitioning the space into multiple queues - bucketed by coordinate ranges, perhaps.
one basic problem is that the problem definition usually includes the proviso that no cell is ever revisited. this implies that each worker needs an up-to-date map of which cells have been considered (globally). mutable global information is problematic, performance-wise, though it's not hard to think of ways to limit the necessity for propagating updates globally...

Related

Two stacks with a deque, what's the purpose of implementing it?

From Algorithms 4th:
1.3.48 Two stacks with a deque. Implement two stacks with a single deque so that each
operation takes a constant number of deque operations (see Exercise 1.3.33).
What's the meaning of implementing 2 stacks with 1 single deque? Any practical reasons? Why don't I just create 2 stacks directly?
1.3.49 Queue with three stacks. Implement a queue with three stacks so that each
queue operation takes a constant (worst-case) number of stack operations. Warning :
high degree of difficulty.
Related question: How to implement a queue with three stacks?
Also, why do I have to implement a queue with three stacks? Can't I just create a queue directly too?
That first problem looks more like it's designed as an exercise than as anything else. I doubt that there are many cases where you'd want to implement two stacks using a single deque, though I'm happy to be proven wrong. I think the purpose of the question, though, is to get you to think about the "geometry" of deques and stacks. There's a really beautiful solution to the problem that's quite elegant, and if you see how it works it'll give you a much deeper appreciation for how all these types work.
To your second question - in imperative programming languages, there isn't much of a reason to implement a queue with three stacks. However, in functional programming languages like Lisp, typically, stacks are fairly simple to implement, but it's actually quite difficult to get a queue working with a constant number of operations required per operation. In fact, for a while, if I remember correctly, it was believed that this simply wasn't possible. You could implement a queue with two stacks (this is a very common exercise, and it's actually a really good one because the resulting queue is extremely fast), but this usually only gives good amortized performance rather than good worst-case performance, and in functional languages where amortization is either not a thing or much harder to achieve this isn't necessarily a good idea. Getting a queue out of three stacks with constant complexity is a Big Deal, then, as it unlocks the ability to use a number of classical algorithms that rely on queues that otherwise wouldn't be available in a functional context.
But again, in both cases, these are primarily designed as exercises to help you build a better understanding of the fundamentals. Would you actually do either of these things in practice? Probably not - some library designer will likely do it for you. But will doing these exercises give you a much deeper understanding of how these data types work, the sorts of things they're good and bad at, and an appreciation for how hard library designers have to work? Yes, totally!
Stacks from Deques
Stacks are first-in, last-out structures. Deques let you push/pop from both their front and back. If you keep track of the number of items you've stored in the front/back then you can use the front as one stack and the back as the other, returning NULL items if your counters go to zero.
Why would you do this? Who knows, but read on.
Queues from Stacks
You can implement a queue so that it has O(1) amortized time on all of its operations by using two stacks. When you're placing items on the queue place them in one stack. When you need to pull things off the queue, empty that stack into the other stack and pop from the top of that stack (while filling up the other stack with new incoming items).
Why would you want to do this?
Because, this is, roughly speaking, how you make a queue. Data structures have to be implemented somehow. In a computer you allocate memory starting from a base address and building outwards. Thus, a stack is a very natural data structure because all you need to do is keep track of a single positive offset to know where the top of your stack is.
Implementing a queue directly is more difficult... you are adding items to one end but pulling them off of the other. Two stacks gives you a way to do it.
But why 3 queues?
Because this algorithm (if it exists) ensures that there is a constant bound on the time complexity of a queue operation. With our 2-stack solution on average each item takes O(1) time, but, if we have a really big stack, once in a while there'll be an operation that takes a long time. While that's happening the car crashes, the rocket blows up, or your patient dies.
You don't want a crummy algorithm that gives unpredictable performance.
You want guarantees.
This StackOverflow answer explains that a 3-stack solution isn't known, but that there is a 6-stack solution.
Why Stacks From Deques
Let's return to your first question. As we've seen, there are good reasons to be able to build queues from stacks. For a computer it's a natural way of building a complex data structure from a simple one. It can even offer us performance guarantees.
Stacks from Dequeues doesn't strike me as being practical in a computer. But computer science isn't about computers; it's about finding efficient algorithmic solutions to problems. Maybe you're storing stuff on a multi-directional conveyor belt in a factory. You can program the belt to act like a dequeue and use the dequeue to make stacks. In this context the question makes more sense.
Seems there is no practical usage for these implementations.
The main purpose is to encourage student to invent complex solutions using simple tools - important skill for every qualified developer.
(perhaps the secondary goal is to teach programmer to implement ridiculous visions of the boss :))

How to measure?

when I did performance-tuning, I will first to work in the high-level and try to answer is this cpu-bound or IO-bound?
when I make sure this is the cpu-bound, then I will try to find hotspot by adding some timer code.This is good, but I failed to figure out these issues:
cache misses
thread context effect.
Is there any one knows how to measure these items?
Are you open to a different way of thinking about performance tuning?
It does not look at I/O vs CPU bound, hotspots, and timers.
First, think about just one thread. The execution of a thread is much like a tree. There is a main function (the trunk). There are points when subroutines are called (branches). There are terminal instructions (leaves) and blocking calls like I/O (fruit). The total time the program takes is the sum of all the leaves and all the fruit.
What you want to do is prune the tree, making it as light as possible, without killing it.
What many people do is weigh (time) the whole thing, and then weigh parts of it, and so on, and hope to find hotspots (leafy branches) that maybe they could trim.
Another way is 1) select some leaves or fruit at random. 2) from each leaf or fruit, paint a line from it along the branch it is on, all the way back to the trunk. 3) Take note of branches that have >1 lines painted on them. 4) Ask "Do I need this branch?". If you can prune it, do so. You will eliminate the entire weight of the branch, and you did it without weighing it. Then start over.
That's the idea behind random-pausing.
There are certain kinds of problems it will not find, but most of them it will find, quickly, including any that timing threads can find.
1) Use cachegrind/callgrind/kcachegrind
http://valgrind.org/info/tools.html#cachegrind
pretty useful in terms of analysing memory locality under specific sets of assumptions.
2) Threading is really painful to profile correctly. Play some with cpusets and process affinities, on modern NUMA systems it becomes critical quickly.

how to partition the 2d arrays among the processes for "The Game of Life"

I am doing an assignment using MPI to implement Game of Life. I was wondering if I should use a block-row partitioning, a cyclic row partitioning or a block-checkerboard partitioning?
What are the pros and cons between the types of partitioning? I tried to find references to the partitionings (which seems to tie in with parallell processing) but it was difficult to find such without going way over my head into it. :)
Try the one that fits your needs the most, since it is an assignment you should try the simplest one first and do the others when time allows.
However you do it, don't forget to make your partitions bigger on each side with some overlap.
This will mean duplicating some data, but it also means each partition can compute independently. At the end of each tick your partitions can copy their overlap to their neighbors.

another Game of Life question (infinite grid)?

I have been playing around with Conway's Game of life and recently discovered some amazingly fast implementations such as Hashlife and Golly. (download Golly here - http://golly.sourceforge.net/)
One thing that I cant get my head around is how do coders implement the infinite grid? We can't keep an infinite array of anything, if you run golly and get a few gliders to fly off past the edges, wait for a few mins and zoom right out, you will see the gliders still there out in space running away, so how in gods name is this concept of infinity dealt with programmatically? Is there a well documented pattern or what?
Many thanks
It is possible to represent living nodes with some type of sparse matrix in this situation. For instance, if we store a list of (LivingNode, Coordinate) pairs instead of an array of Nodes where each is either living or dead, we are simply changing the Coordinates rather than increasing an array's size. Thus, the space required for this is proportional to the number of LivingNodes.
This solution doesn't work for states where the number of living nodes is constantly increasing, but it works very well for gliders.
EDIT: So that was off the top of my head. Turns out Wikipedia has an article that shows a much more well-thought out solution. Oh well! :) Enjoy.
Wikipedia explains it.
The basic idea is that Conway's Game of Life exhibits locality, since information travels at a slow speed compared to the pattern size and the maximum density of filled cells is around 1/2 of the cells in any region. (More will kill off cells due to overcrowding.)
Since there is locality, you can separate the field in different sections and simulate each section independently. If you choose your locality well, you will often see the same patterns. You can simulate how those evolve and store the results in a lookup table, so that other instances of the same pattern do not need to be simulated more than once. Combining adjacent patterns into larger 'metapatterns' allows you to precalculate those as well, and so on.

Efficient reordering of large dataset to maximize memory cache effectiveness

I've been working on a problem which I thought people might find interesting (and perhaps someone is aware of a pre-existing solution).
I have a large dataset consisting of a long list of pairs of pointers to objects, something like this:
[
(a8576, b3295),
(a7856, b2365),
(a3566, b5464),
...
]
There are way too many objects to keep in memory at any one time (potentially hundreds of gigabytes), so they need to be stored on disk, but can be cached in memory (probably using an LRU cache).
I need to run through this list processing every pair, which requires that both objects in the pair be loaded into memory (if they aren't already cached there).
So, the question: is there a way to reorder the pairs in the list to maximize the effectiveness of an in-memory cache (in other words: minimize the number of cache misses)?
Notes
Obviously, the re-ordering algorithm should be as fast as possible, and shouldn't depend on being able to have the entire list in memory at once (since we don't have enough RAM for that) - but it could iterate over the list several times if necessary.
If we were dealing with individual objects, not pairs, then the simple answer would be to sort them. This obviously won't work in this situation because you need to consider both elements in the pair.
The problem may be related to that of finding a minimum graph cut, but even if the problems are equivalent, I don't think solutions to min-cut meet
My assumption is that the heuristic would stream the data off the disk, and write it back in chunks in a better order. It may need to iterate over this several times.
Actually it may not just be pairs, it could be triplets, quadruplets, or more. I'm hoping that an algorithm that does this for pairs can be easily generalized.
Your problem is related to a similar one for computer graphics hardware:
When rendering indexed vertices in a triangle mesh, typically the hardware has a cache of most recently transformed vertices (~128 the last time I had to worry about it, but suspect the number is larger these days). Vertices not cached need a relatively expensive transform operation to calculate. "Mesh optimisation" to restructure triangle meshes to optimise cache usage used to be a pretty hot research topic. Googling
vertex cache optimisation
(or optimization :^) might find you some interesting material relevant to your problem. As other posters suggest, I suspect doing this effectively will depend on exploiting any inherent coherence in your data.
Another thing to bear in mind: as an LRU cache becomes overloaded it can be well worth changing to an MRU replacement strategy to at least hold some of the items in memory (rather than turning over the entire cache each pass). I seem to remember John Carmack has written some good material on this subject in connection with Direct3D texture caching strategies.
For start, you could mmap the list. That works if there's enough address space, not memory, e.g. on 64-bit CPUs. This makes it easier to access the elements in order.
You could sort that list according to a minimum distance in cache which considers both elements, which works well if the objects are in a contiguous space. The sorting function could be something like: compare (a, b) to (c, d) = (a - c) + (b - d) (which looks like a Hamming distance). Then you pull in slices of the object store and process according to the list.
EDIT: fixed a mistake in the distance.
Even though you're not just sorting this list, the general pattern of a multiway merge sort might be applicable - that is, consider some kind of (possibly recursive) breakdown of the set into smaller sets that can be dealt with in memory separately, and then a second phase where small chunks of the previously dealt-with sets can all be combined together. Even not knowing the specific nature of what you're doing with the pairs, it's safe to say that many algorithmic problems are made much more straightforward when you're dealing with sorted data (including graph problems, which might be what you have on your hands here).
I think the answer to this question is going to depend very heavily on exactly the access pattern of the pair of objects. As you said, just sorting the pointers would be best in a simple, non-paired case. In a more complex case it may still make sense to sort by one of the halves of the pair if the pattern is such that locality for those values is more important (if, for example, these are key/value pairs and you are doing a lot of searches, locality for the keys is infinitely more important than for the values).
So, really, my answer is that this question can't be answered in a general case.
For storing your structure, what you actually want is probably a B-tree. These are designed for what you're talking about--keeping track of large collections where you don't want to (or can't) keep the whole thing in memory.

Resources