Quick and Merge sort for multiple CPUs - algorithm

Both merge sort and quick sort can work in parallel. Each time we split a problem in two sub-problems we can run those sub-problems in parallel. However it looks sub-optimal.
Suppose we have 4 CPUs. On the 1st iteration we split the problem in only 2 sub-problems and two CPUs are idle. On the 2nd iteration all CPUs are busy but on the 3d iteration we do not have enough CPUs. So, we should adapt the algorithm for the case when CPUs << log(N).
Does it make sense? How would you adapt the sorting algorithms to these cases?

First off, the best parallel implementation will depend highly on the environment. Some factors to consider:
Shared Memory (a 4-core computer) vs. Not Shared (4 single-core computers)
Size of data to sort
Speed of comparing two elements
Speed of swapping/moving two elements
Memory available
Is each computer/core identical or are there differences in speeds, network latency to communicate between parts, cache effects, etc.
Fault tolerance: what if one computer/core broke down in the middle of the operation.
etc.
Now moving back to the theoretical:
Suppose I have 1024 cards, and 7 other people to help me sort them.
Merge Sort
I quickly split the stack into 8 sections of somewhat equal size. It won't be perfectly equal since I am going fast. Actually since my friends can start sorting their part as soon as they get their section, I should give my first friend a stack bigger than the rest and get smaller towards the end.
Each person sorts their part however they like sequentially. (radix sort, quick sort, merge sort, etc.)
Now for the hard part ... merging.
In real life I would probably have the first two people that are ready form a pair and start merging their decks together. Perhaps they could work together, one person merging from the front and the other from the back. Perhaps they could both work from the front while calling their numbers out.
Soon enough other people will be done with their individual sorting, and can start merging. I would have them form pairs as they find convenient and keep going until all the cards are merged.
Quick Sort
The real trick here is to try to parallelize the partitioning, since the rest is pretty easy to do.
I will start by breaking the stack into 8 parts, and hand one part out to each friend. While doing this, I will choose one of the cards that looks like it might end up towards the middle of the sorted deck. I call out that number.
Each of my friends will partition their smaller stack into three piles, less than the called out number, equal to the called out number, and greater than the called out number. If one friend is faster than the others, he/she can steal some cards from a neighboring friend.
When they are finished with that, I collect all the less thans into one pile and give that to friends 0 through 3, I set aside the equal to's, and give the greater's to friends 4 through 7.
Friends 0 through 3, will divide their stack into four somewhat equal parts, will choose a card to partition around, and repeat the process amongst themselves.
This repeats until each friend has their own stack.
(Note that if the partitioning card wasn't chosen well, rather than dividing up the work 50-50, maybe I would only assign 2 friends to work on the less thans, and let the other 6 work on the greater thans.)
At the end, I just collect all of the stacks in the right order, along with the partition cards.
Conclusion
While it is true that some approaches are faster on a computer than in real life, I think the preceding is a good start. Different computers or cores or threads will perform their work at different speeds, unless you are implementing the sort in hardware. (If you are, you might want to look into "Sorting Networks" and or "Optimal Sorting Networks").
If you are sorting numbers, you will need a large dataset to be helped by paralellizing it.
However, if you are sorting images by comparing the sum manhattan distance between corresponding pixel red green blue values. You will find it less difficult to get speed-up of just less than k times with k cpu's.
Lastly, you will want to time the sequential version(s), and compare as you go along, since, cache effects, memory usage, network costs, etc, might just might make a difference.

Related

Is it possible to come up with a distributed / multi core implementation of a prime sieve.

I have been working on prime sieve algorithm, and the basic implementation is working fine for me. What I am currently struggling with is a way to divide and distribute the calculation on to multiple processors.
I know it would require storage of the actual sieve in a shared memory area or a text file, but how would one go about dividing the calculation related steps.
Any lead would help. Thanks!
Split the numbers into sections of equal size, each processor will be responsible for one of these sections.
Another processor (or one of the processors) will generate the numbers of which multiple needs to be crossed-off. And pass this number to each other processors.
Each of the processors will then use the remainder of the section size divided by the given number and its own section index to determine the offset into its own section, and then loop through and cross off the applicable numbers.
Alternatively, one could get a much simpler approach by just using shared memory.
Let the first processor start crossing off multiple of 2, the second multiples of 3, the third multiples of 5, etc.
Essentially just let each processor grab the next number from the array and run with it.
If you don't do this well, you may end up with the third crossing off multiples of 4, since the first didn't get to 4 yet when the third started, so it's not crossed off, but it shouldn't result in too much more work - it will take increasingly longer for a multiple of some prime to be grabbed by a processor, while it will always be the first value crossed off by a processor handling that prime, so the likelihood of this redundancy happening decreases very quickly.
Using shared memory like this tends to be risky - if you plan on using one bit per index, most languages don't allow you to work on that level, and you'll end up needing to do some bitwise operations (probably bitwise-AND) on a few bytes to make your desired changes (although this complexity might be hidden in some API), and many languages will also not have this operation be a so-called atomic operation, meaning one thread can get a value, AND it, and write it back, and another can come in and get the value before the first thread wrote it, AND it, and write it back after the first thread's write, essentially causing the first thread's changes to be lost. There's no simple, efficient fix for this - what exactly you need to do will depend on the language.

How to get top K words distributed in N computers efficiently?

Suppose we have many words distributed in N computers and we want the top 10 frequent words.
There are three approaches I can think of:
Count the words on each N computer seperately and we can get top 20(This number can be discussed) words on each computer. Than merge these result together.
The drawback of this approach is some words might be ignored. These words are distributed evenly on each computer but cannot be the Top 20 on each computer, but the total frequency of these words might be top 10.
It's almost the same as the first one. The difference is getting all the counting results on each computer and merge them. Then calculate the TOP 10.
The drawback is the merge time and transmission time is relatively large.
Use a good hash function to redistribute the words. Different computer will not have same word. Then we can get TOP 10 on each computer and merge them.
The drawback is every word will be hashed and transmit to another computer. It will take much transmission time.
Do you have any better approach for this? Or which one of my approaches is the best?
Your idea in #1 was good but needed better execution. If F is the frequency of the Kth most common word on a single computer, then all words with frequency less than F/N on all N computers can be ignored. If you divide the machines into G groups, then the threshold F'/G applies, where F' is the frequency of the Kth most common word on the computers within a single group.
In two rounds, the computers can determine the best value for F and then aggregate a small Bloom filter that hits on all frequent words and gives false positives on some others, used to reduce the amount of data to merge with approaches #2 and #3.
The drawback of the first approach makes it a non-solution.
The second approach sends all results to a single machine for it to do all the work.
The third approach sends all results as well, but to multiple machines, who then share the workload (this is the important part) - the final results that gets sent to a single machine to be merged should be small in comparison to sending all word frequencies.
Clearly the third approach makes the most sense.

brute force search optimisation

I have an function that is engineered as follows:
int brutesearch(startNumber,endNumber);
this function returns the correct number if one matches my criteria by performing a linear search, or null if it's not found in the searched numbers.
Say that:
I want to search all 6 digits numbers to find one that does something I want
I can run the brutesearch() function multithreaded
I have a laptop with 4 cores
My question is the following:
What is my best bet for optimising this search? Dividing the number space in 4 segments and running 4 instances of the function one on each core? Or dividing for example in 10 segments and running them all together, or dividing in 12 segments and running them in batches of 4 using a queue?
Any ideas?
Knowing nothing about your search criteria (there may be other considerations created by the memory subsystem), the tradeoff here is between the cost of having some processors do more work than others (e.g., because the search predicate is faster on some values than others, or because other threads were scheduled) and the cost of coordinating the work. A strategy that's worked well for me is to have a work queue from which threads grab a constant/#threads fraction of the remaining tasks each time, but with only four processors, it's pretty hard to go wrong, though the really big running-time wins are in algorithms.
There is no general answer. You need to give more information.
If your each comparison is completely independent of the others, and there are no opportunities for saving computation in a global resource, there is say no global hash tables involved, and your operations are all done in a single stage,
then your best bet is to just divide your problem space into the number of cores you have available, in this case 4 and send 1/4 of the data to each core.
For example if you had 10 million unique numbers that you wanted to test for primality. Or if you had 10 million passwords your were trying to hash to find a match, then just divide by 4.
If you have a real world problem, then you need to know a lot more about the underlying operations to get a good solution. For example if a global resource is involved, then you won't get any improvement from parallelism unless you isolate the operations on the global resource somehow.

Are there any good techniques for keeping nearly-sorted data nearly-sorted?

Short Version:
I'm looking for a technique to keep nearly-sorted data in nearly-sorted order over time, despite the values changing slightly.
Here's the scenario:
In the world of 3D graphics, it is often beneficial to order your objects from front-to-back before drawing. As your scene changes or your view of the scene changes, this data may require re-sorting, however it will usually be very close to the sorted order (i.e. it won't change very much between frames). It's also not critical that the data be exactly in sorted order. The worst thing that will happen is that a polygon will be rendered and then completely hidden. It's a small performance hit, but not the end of the world.
With this in mind, is it possible to sort the data once ahead of time and then apply a minimal patch to the data once per frame to ensure that the data stays mostly sorted? In this scenario, the data would be considered mostly sorted if most of the objects were in ascending order. That is, 1 object that is 10 steps away from it's proper location is much better (10x better) than 10 objects that are 1 step away from their proper location.
It's also worth noting that the data could continue to be patched on a semi regular basis, as the data is typically rendered 30 times per second (or so). As long as the calculation was efficient, it could continue to be done over time until the changes stop and the list was completely sorted.
Existing Idea:
My knee jerk reaction to this problem is:
Apply an n log n sort to the data when it is loaded, and on large changes (which I can track pretty easily).
When the data starts changing slowly (e.g. when the scene is rotated), apply a single (linear) pass of some sort on the data to swap backwards neighbors and try to maintain sort order (I think this is basically shell sort - maybe there is a better algorithm to use for this single pass).
Keep doing a single pass of the partial sort each frame until the changes stop and the data is completely sorted
Go back to step 2 and wait for more changes.
There are a variety of sorts that run in O(n) time if the input is mostly sorted, and O(n log n) if the data is not sorted. It sounds like you can use that pretty easily. Timsort is one such sort and, I believe, is the default sort now in both python and java. Smoothsort is another one that is fairly easy to implement yourself.
From your description it sounds like the sort order changes without you changing the data itself. E.g. you change the camera, so the sort order should change, even though you have not modified any polygons.
If so, you can't detect sort order changes directly when they happen. If you could, I would create buckets for the list of polygons, and resort buckets when 'enough' polygons in that bucket have been touched.
But I'm betting your system doesn't work that way. The sort is determined by the view port. In that case polygons at the front of the sort matter much more than ones at the end.
So I'd segment the poly list into fifths or something like that. Front to back, so that the first fifth is the part closest to the camera. I'd completely sort the first segment every frame. I'd divide the second segment into sub segments - say 5 again - and sort each sub segment every frame, such that every 5 frames the second fifth is completely sorted. segment the third through 5th segments into 15 sub segments and do those every 5 frames each such that the rest get sorted completely every 75 frames. At 60 fps you'd have the display list completely resorted a little more than once per second.
The nice thing about prioritizing the front of the list, is
1. Polys at the front are going to tend to be larger on the screen, and will fail depth test more often. Bad orders at the end of the list will more often than not just not matter.
2. the front of the list is more susceptible sort changes due to camera changes.
Also chose those segment ranges with a little overlap, so that polygons can migrate to their correct segment in 2 sorts.
#OP: Thinking about it a little more. You are probably more concerned with having the sorting cost stay bounded - instead of exploding with scene complexity. Especially since a very complex scene should - surprisingly - be less susceptible to bad sorts ( because generally the polys get smaller ).
You could define a fixed amount of sorting you are willing to do per frame. Use say 50% of the budget for as much of the front of the list as you can afford, 25% of the budget to sort the next region and 25% to spend equally on the rest.
Say you budget 1000 polys sorted per frame, and you have 10000 polys in the scene. Sort the first 500 polys every frame. Sort 250 polys every tenth frame for the next region. So 501-750 on frame 1, 751-1000 on frame 2 etc. And then divide the rest of the list into 250 frame segments and sort them round robin for however many frames you need to.
This keeps the sorting cost fixed s the scene gets more and less complex, and it is easy to tune, you just adjust the sorting budget to what you can afford.
I'll suggest a solution that borrows from a number of others here. Of course we start with a full sort of the objects on initialisation.
What I would do is always perform, say, 10 linear-time runs over your objects for every frame (with early termination if you find out that your objects are already completely sorted). Each run can be, say, one pass of bubble sort with a shell sort-style gap over the whole array: for all i from 0 to n-gap-1, compare A[i] and A[i+gap], and exchange them if they are not sorted. You can use a fixed sequence of gaps, or maybe better, let it vary between frames; either way, if you do sufficiently many frames where the objects do not change, you'll have a fully sorted sequence. You could even mix different types of sub-algorithms to do your runs, as long as each iteration improves the 'sortedness'.
You can add Rafael Baptista's idea of prioritizing the front of the scene easily by doing one extra run on the front segment, or choosing to divide the gap by two for the front half, or something like that.
It doesn't work out as neatly as the problem you've supposed because all you have to do is turn the camera 90 degrees and the basis for being sorted is on a different axis entirely. (X and Y axis are independent, for example -- looking down the X axis will cause the sort order to not rely on the X axis, and looking down the Y axis will cause the sort order to not rely on the Y axis.) Even a 5 degree turn can cause far away "close" (as far as Z-order is concerned) things to be suddenly "far".
Let's be honest -- generating the draw calls for the objects is normally going to take much more time than sorting them, especially if you have an optimized sorting algorithm for your scenario and your game is of modern visual complexity.
Sorting can be practically O(n), especially with histogram-based algorithms or radix-style algorithms. (Yes, radix sort applies to integers, so you'd have to scale your world coordinates to integers, but normally that's more than good enough unless you have a gigantic world.)
That being said, since you're already doing O(n) ops for everything you're drawing, resorting per frame isn't going to be a huge problem, especially with both high and low level optimization.
Another common way of addressing this issue is with a scene graph, but for your purposes it ends up essentially being a re-sort per frame. However, you can build frustum culling, shadow culling, and level of detail calculations into the scene graph traversal.
If you're looking for approximations, instead of doing a z-distance sort do a true distance sort and update the sort order more often for close by objects and less often for further objects (depending on distance the camera has traveled). This can work because if you're further away from an object, moving doesn't cause the angle to the viewer to change as often which, in turn, means the old sorting data is more likely to be valid. I'm not a fan of this because I like algorithms which allow my game to teleport across the map without any issues. (Mind you, streaming assets from disk becomes the real issue for teleporting.)
Shell sort is good for lists with few unique values and some scenarios that "need short code and do not use the call stack".
In your case, you need something called Adaptive sort, which means algorithms "takes advantage of existing order in its input".
If your space is tight, you can just use Straight Insertion Sort, which is adaptive and in place.
Otherwise you can try Timsort and Smoothsort as #RunningWild suggested, they are both adaptive sort algorithms.

Optimizing Conway's 'Game of Life'

To experiment, I've (long ago) implemented Conway's Game of Life (and I'm aware of this related question!).
My implementation worked by keeping 2 arrays of booleans, representing the 'last state', and the 'state being updated' (the 2 arrays being swapped at each iteration). While this is reasonably fast, I've often wondered about how to optimize this.
One idea, for example, would be to precompute at iteration N the zones that could be modified at iteration (N+1) (so that if a cell does not belong to such a zone, it won't even be considered for modification at iteration (N+1)). I'm aware that this is very vague, and I never took time to go into the details...
Do you have any ideas (or experience!) of how to go about optimizing (for speed) Game of Life iterations?
I am going to quote my answer from the other question, because the chapters I mention have some very interesting and fine-tuned solutions. Some of the implementation details are in c and/or assembly, yes, but for the most part the algorithms can work in any language:
Chapters 17 and 18 of
Michael Abrash's Graphics
Programmer's Black Book are one of
the most interesting reads I have ever
had. It is a lesson in thinking
outside the box. The whole book is
great really, but the final optimized
solutions to the Game of Life are
incredible bits of programming.
There are some super-fast implementations that (from memory) represent cells of 8 or more adjacent squares as bit patterns and use that as an index into a large array of precalculated values to determine in a single machine instruction if a cell is live or dead.
Check out here:
http://dotat.at/prog/life/life.html
Also XLife:
http://linux.maruhn.com/sec/xlife.html
You should look into Hashlife, the ultimate optimization. It uses the quadtree approach that skinp mentioned.
As mentioned in Arbash's Black Book, one of the most simple and straight forward ways to get a huge speedup is to keep a change list.
Instead of iterating through the entire cell grid each time, keep a copy of all the cells that you change.
This will narrow down the work you have to do on each iteration.
The algorithm itself is inherently parallelizable. Using the same double-buffered method in an unoptimized CUDA kernel, I'm getting around 25ms per generation in a 4096x4096 wrapped world.
what is the most efficient algo mainly depends on the initial state.
if the majority of cells is dead, you could save a lot of CPU time by skipping empty parts and not calculating stuff cell by cell.
im my opinion it can make sense to check for completely dead spaces first, when your initial state is something like "random, but with chance for life lower than 5%."
i would just divide the matrix up into halves and start checking the bigger ones first.
so if you have a field of 10,000 * 10,000, you´d first accumulate the states of the upper left quarter of 5,000 * 5,000.
and if the sum of states is zero in the first quarter, you can ignore this first quarter completely now and check the upper right 5,000 * 5,000 for life next.
if its sum of states is >0, you will now divide up the second quarter into 4 pieces again - and repeat this check for life for each of these subspaces.
you could go down to subframes of 8*8 or 10*10 (not sure what makes the most sense here) now.
whenever you find life, you mark these subspaces as "has life".
only spaces which "have life" need to be divided into smaller subspaces - the empty ones can be skipped.
when you are finished assigning the "has life" attribute to all possible subspaces, you end up with a list of subspaces which you now simply extend by +1 to each direction - with empty cells - and perform the regular (or modified) game of life rules to them.
you might think that dividn up a 10,000*10,000 spae into subspaces of 8*8 is a lot os tasks - but accumulating their states values is in fact much, much less computing work than performing the GoL algo to each cell plus their 8 neighbours plus comparing the number and storing the new state for the net iteration somewhere...
but like i said above, for a random init state with 30% population this wont make much sense, as there will be not many completely dead 8*8 subspaces to find (leave alone dead 256*256 subpaces)
and of course, the way of perfect optimisation will last but not least depend on your language.
-110
Two ideas:
(1) Many configurations are mostly empty space. Keep a linked list (not necessarily in order, that would take more time) of the live cells, and during an update, only update around the live cells (this is similar to your vague suggestion, OysterD :)
(2) Keep an extra array which stores the # of live cells in each row of 3 positions (left-center-right). Now when you compute the new dead/live value of a cell, you need only 4 read operations (top/bottom rows and the center-side positions), and 4 write operations (update the 3 affected row summary values, and the dead/live value of the new cell). This is a slight improvement from 8 reads and 1 write, assuming writes are no slower than reads. I'm guessing you might be able to be more clever with such configurations and arrive at an even better improvement along these lines.
If you don't want anything too complex, then you can use a grid to slice it up, and if that part of the grid is empty, don't try to simulate it (please view Tyler's answer). However, you could do a few optimizations:
Set different grid sizes depending on the amount of live cells, so if there's not a lot of live cells, that likely means they are in a tiny place.
When you randomize it, don't use the grid code until the user changes the data: I've personally tested randomizing it, and even after a long amount of time, it still fills most of the board (unless for a sufficiently small grid, at which point it won't help that much anymore)
If you are showing it to the screen, don't use rectangles for pixel size 1 and 2: instead set the pixels of the output. Any higher pixel size and I find it's okay to use the native rectangle-filling code. Also, preset the background so you don't have to fill the rectangles for the dead cells (not live, because live cells disappear pretty quickly)
Don't exactly know how this can be done, but I remember some of my friends had to represent this game's grid with a Quadtree for a assignment. I'm guess it's real good for optimizing the space of the grid since you basically only represent the occupied cells. I don't know about execution speed though.
It's a two dimensional automaton, so you can probably look up optimization techniques. Your notion seems to be about compressing the number of cells you need to check at each step. Since you only ever need to check cells that are occupied or adjacent to an occupied cell, perhaps you could keep a buffer of all such cells, updating it at each step as you process each cell.
If your field is initially empty, this will be much faster. You probably can find some balance point at which maintaining the buffer is more costly than processing all the cells.
There are table-driven solutions for this that resolve multiple cells in each table lookup. A google query should give you some examples.
I implemented this in C#:
All cells have a location, a neighbor count, a state, and access to the rule.
Put all the live cells in array B in array A.
Have all the cells in array A add 1 to the neighbor count of their
neighbors.
Have all the cells in array A put themselves and their neighbors in array B.
All the cells in Array B Update according to the rule and their state.
All the cells in Array B set their neighbors to 0.
Pros:
Ignores cells that don't need to be updated
Cons:
4 arrays: a 2d array for the grid, an array for the live cells, and an array
for the active cells.
Can't process rule B0.
Processes cells one by one.
Cells aren't just booleans
Possible improvements:
Cells also have an "Updated" value, they are updated only if they haven't
updated in the current tick, removing the need of array B as mentioned above
Instead of array B being the ones with live neighbors, array B could be the
cells without, and those check for rule B0.

Resources