Parallelise bubble sort using CUDA - algorithm

I was given an assignment, to parallelize Bubble Sort and implement it using CUDA.
I don't see how bubble sort could possibly be parallelized. I think its inherently sequential. Since, it compares two consecutive elements and swaps them after a conditional branch.
Thoughts, anyone?

To be completely honest, I had trouble thinking about a way to parallelize bubble sort as well. I initially thought of a hybrid sort where you could tile, bubble sort each tile, and then merge (probably would still improve performance if you could make it work). However, I browsed for "Parallel Bubble Sort", and found this page. If you scroll down you'll find the following parallel bubble sort algorithm:
For k = 0 to n-2
If k is even then
for i = 0 to (n/2)-1 do in parallel
If A[2i] > A[2i+1] then
Exchange A[2i] ↔ A[2i+1]
Else
for i = 0 to (n/2)-2 do in parallel
If A[2i+1] > A[2i+2] then
Exchange A[2i+1] ↔ A[2i+2]
Next k
You could run the for-loop in the CPU and then use a kernel for each of the do in parallels. This seems efficient for large arrays, but might be too much overhead with small arrays. Large arrays are assumed if you're writing a CUDA implementation. Since the swaps within these kernels are with adjacent pairs of elements, you should be able to tile accordingly. I've searched for generic, non-gpu-specific parallel bubble sorts and this was the only one I could find.
I did find a (very slightly) helpful visualization here, which can be seen below. I'd love to discuss this more in the comments.
EDIT: I found another parallel version of bubble sort called Cocktail Shaker Sort. Here's the pseudocode:
procedure cocktailShakerSort( A : list of sortable items ) defined as:
do
swapped := false
for each i in 0 to length( A ) - 2 do:
if A[ i ] > A[ i + 1 ] then // test whether the two elements are in the wrong order
swap( A[ i ], A[ i + 1 ] ) // let the two elements change places
swapped := true
end if
end for
if not swapped then
// we can exit the outer loop here if no swaps occurred.
break do-while loop
end if
swapped := false
for each i in length( A ) - 2 to 0 do:
if A[ i ] > A[ i + 1 ] then
swap( A[ i ], A[ i + 1 ] )
swapped := true
end if
end for
while swapped // if no elements have been swapped, then the list is sorted
end procedure
It looks like this also has two for-loops comparing adjacent elements bubbly.. These algorithms look kind of like similar opposites, since the first algorithm (which I now learned is called odd-even sort) assumes sorted and lets the for-loops specify false, while cocktail shaker sort conditionally checks sorted in each loop.
The code included in this post for the odd-even sort seems to just run the while loop enough times to guarantee sorted, where the wikipedia pseudocode checks. A potential first-pass could be to implement the algorithm of this post and then optimize with the check, although the check may actually be slower with CUDA.
Regardless the sort will be slow. Here's a related SO question fyi, but there isn't much help. They agree it's not effective for small arrays, and really emphasize its failure.
Are you looking for specific CUDA code or was this enough? It seems like you wanted an overview of possible options and understand CUDA implementation.

TL;DR;
For a complete implementation of a Generic Parallel Bubble Sort, take a look at generic-bubble-sort.cu. *Generic" in here means the algorithm sorts any kind of elements as long as you provide your comparator.
At best
With a number linearly proportional to N threads (say N/2) you can get a Parallel Bubble Sort of O(N) time complexity (where N is the size of the array you want to sort).
A hint
It might not be trivial but when you look closely, you'd realize that all what the Sequential Bubble Sort does is swapping pairs of elements if not ordered correctly One pair at the time!.
Since pairs can be sorted independently, a Parallel Bubble Sort could take advantage of the property that ordering pairs is independent.
An approach
Let's say we want to sort the following array ascendantly :
# [7][1][3][2][0]
We'd first take our initial unsorted array and consider every two elements array[i] and array [i+1] to be an indepent pair. For this first iteration i is going to be an EVEN index, so our paris are { {array[0], array[1]} , {array[2],array[3]}, ...}.
# [7][1][3][2][0] <-- Unsorted array of 5 elements
# [7][1] [3][2] [0] <-- A set of independent pairs.
Then, we'd swap every two elements of each pair if they are not in the desired order.
# [7][1] [3][2] [0] --┑ Sorting first set of pairs
# |
# [1][7] [2][3] [0] <-┛ starting from an even idx
Our array after this first iteration would look like this :
# [1][7][2][3][0] <-- Result after first iteration
We are now going to reiterate for a second time, but unlike previously, we will now sort pairs starting from an ODD index { {array[1], array[2]} , {array[3],array[4]}, ...}. It's worth mentioning that elements with no peers won't be considered.
# [1][7][2][3][0] <-- Result after first iteration
# [1] [7][2] [3][0] --┑ Sorting second set of pairs
|
# [1] [2][7] [0][3] <-┛ starting from an odd index
# [1][2][7][0][3] <-- Result after second iteration
After N EVEN/ODD pair sorting iterations we'd have a sorted array.
# [1][2] [7][0] [3] --┑
# [1][2] [0][7] [3] |
# |
# [1][2][0][7][3] |
# | The whole parallel sorting
# [1] [2][0] [7][3] | will converge after N iterations
# [1] [0][2] [3][7] | So we keep sorting pairs for 3 more
# | iterations.
# [1][0][2][3][7] |
# |
# [1][0] [2][3] [7] |
# [0][1] [2][3] [7] <-┛
#
# [0][1][2][3][7] <-- Sorted array!
Parallel Bubble Sort with CUDA
A straightforward implementation of a CUDA program for the approach above would be done as follows:
each thread would be responsible for sorting an individual pair
you would need N/2 threads
since warp divergence is a thing we'd need to care about synchronizing our threads
USING A SINGLE BLOCK: if our threads fit into a single block we'd only use __synchronize() after each iteration and we'd be able to take advantage of the shared memory by having all our array there.
USING MORE THAN ONE BLOCK: we would have to ensure thread synchronization for all the threads in our kernel. We'd only be able to perform one iteration per kernel launch, and launch our kernel N times. The bad news is that we can only use the global memory for processing our array since the shared memory has just kernel lifetime.
Some code
Here's a simple implementation of what's explained above considering only one block. The whole code is available in this repo.
template<typename T>
__global__
void bubbleSort(T* v, const unsigned int n, ShouldSwap<T> shouldSwap) {
const unsigned int tIdx = threadIdx.x;
for (unsigned int i = 0; i < n; i++) {
unsigned int offset = i % 2;
unsigned int leftIndex = 2 * tIdx + offset;
unsigned int rightIndex = leftIndex + 1;
if (rightIndex < n) {
if (shouldSwap(v[leftIndex ], v[rightIndex ])) {
swap<T>(&v[leftIndex ], &v[rightIndex ]);
}
}
__syncthreads();
}
}
If you're wondering about ShouldSwap and swap implementations here's the code:
swap
A device function for swapping elements.
template<typename T>
__host__ __device__ __inline__
void swap (T* a, T* b) {
T tmp = *a;
*a = *b;
*b = tmp;
}
ShouldSwap
A C++ Functor used as a generic comparator.
template<typename T>
__host__ __device__
bool ShouldSwap<T>::operator() (const T left, const T right) const {
return left > right;
}

Related

abstract inplace mergesort for effective merge sort

I am reading about merge sort in Algorithms in C++ by Robert Sedgewick and have following questions.
static void mergeAB(ITEM[] c, int cl, ITEM[] a, int al, int ar, ITEM[] b, int bl, int br )
{
int i = al, j = bl;
for (int k = cl; k < cl+ar-al+br-bl+1; k++)
{
if (i > ar) { c[k] = b[j++]; continue; }
if (j > br) { c[k] = a[i++]; continue; }
c[k] = less(a[i], b[j]) ? a[i++] : b[j++];
}
}
The characteristic of the basic merge that is worthy of note is that
the inner loop includes two tests to determine whether the ends of the
two input arrays have been reached. Of course, these two tests usually
fail, and the situation thus cries out for the use of sentinel keys to
allow the tests to be removed. That is, if elements with a key value
larger than those of all the other keys are added to the ends of the a
and aux arrays, the tests can be removed, because when the a (b) array
is exhausted, the sentinel causes the next elements for the c array to
be taken from the b (a) array until the merge is complete.
However, it is not always easy to use sentinels, either because it
might not be easy to know the largest key value or because space might
not be available conveniently.
For merging, there is a simple remedy. The method is based on the
following idea: Given that we are resigned to copying the arrays to
implement the in-place abstraction, we simply put the second array in
reverse order when it is copied (at no extra cost), so that its
associated index moves from right to left. This arrangement leads to
the largest element—in whichever array it is—serving as sentinel for
the other array.
My questions on above text
What does statement "when the a (b) array is exhausted"? what is 'a (b)' here?
Why is the author mentioning that it is not easy to determine the largest key and how is the space related in determining largest key?
What does author mean by "Given that we are resigned to copying the arrays"? What is resigned in this context?
Request with simple example in understanding idea which is mentioned as simple remedy?
"When the a (b) array is exhausted" is a shorthand for "When either the a array or the b array is exhausted".
The interface is dealing with sub-arrays of a bigger array, so you can't simply go writing beyond the ends of the arrays.
The code copies the data from two arrays into one other array. Since this copy is inevitable, we are 'resigned to copying the arrays' means we reluctantly accept that it is inevitable that the arrays must be copied.
Tricky...that's going to take some time to work out what is meant.
Tangentially: That's probably not the way I'd write the loop. I'd be inclined to use:
int i = al, j = bl;
for (int k = cl; i <= ar && j <= br; k++)
{
if (a[i] < b[j])
c[k] = a[i++];
else
c[k] = b[j++];
}
while (i <= ar)
c[k++] = a[i++];
while (j <= br)
c[k++] = b[j++];
One of the two trailing loops does nothing. The revised main merge loop has 3 tests per iteration versus 4 tests per iteration for the one original algorithm. I've not formally measured it, but the simpler merge loop is likely to be quicker than the original single-loop algorithm.
The first three questions are almost best suited for English Language Learners.
a(b) and b(a)
Sometimes parenthesis are used to tell one or more similar phrases at once:
when a (b) is exhausted we copy elements from b (a)
means:
when a is exhausted we copy elements from b,
when b is exhausted we copy elements from a
What is difficult about sentinels
Two annoying things about sentinels are
sometimes your array data may potentially contain every possible value, so there is no value you can use as sentinel that is guaranteed to be bigger that all the values in the array
to use a sentinel instead of checking the index to see if you are done with an array requires that you have room for one extra space in the array to store the sentinel
Resigning
We programmers are never happy to copy (or move) things around and leaving them where they already are is, if possible, better (because we are lazy).
In this version of the merge sort we already gave up about trying to not copy things around... we resigned to it.
Given that we must copy, we can copy things in the opposite order if we like (and of course use the copy in opposite order) because that is free(*).
(*) is free at this level of abstraction, the cost on some real CPU may be high. As almost always in the performance area YMMV.

how to read all 1's in an Array of 1's and 0's spread-ed all over the array randomly

I have an Array with 1 and 0 spread over the array randomly.
int arr[N] = {1,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,1,1,0,0,0,1....................N}
Now I want to retrive all the 1's in the array as fast as possible, but the condition is I should not loose the exact position(based on index) of the array , so sorting option not valid.
So the only option left is linear searching ie O(n) , is there anything better than this.
The main problem behind linear scan is , I need to run the scan even
for X times. So I feel I need to have some kind of other datastructure
which maintains this list once the first linear scan happens, so that
I need not to run the linear scan again and again.
Let me be clear about final expectations-
I just need to find the number of 1's in a certain range of array , precisely I need to find numbers of 1's in the array within range of 40-100. So this can be random range and I need to find the counts of 1 within that range. I can't do sum and all as I need to iterate over the array over and over again because of different range requirements
I'm surprised you considered sorting as a faster alternative to linear search.
If you don't know where the ones occur, then there is no better way than linear searching. Perhaps if you used bits or char datatypes you could do some optimizations, but it depends on how you want to use this.
The best optimization that you could do on this is to overcome branch prediction. Because each value is zero or one, you can use it to advance the index of the array that is used to store the one-indices.
Simple approach:
int end = 0;
int indices[N];
for( int i = 0; i < N; i++ )
{
if( arr[i] ) indices[end++] = i; // Slow due to branch prediction
}
Without branching:
int end = 0;
int indices[N];
for( int i = 0; i < N; i++ )
{
indices[end] = i;
end += arr[i];
}
[edit] I tested the above, and found the version without branching was almost 3 times faster (4.36s versus 11.88s for 20 repeats on a randomly populated 100-million element array).
Coming back here to post results, I see you have updated your requirements. What you want is really easy with a dynamic programming approach...
All you do is create a new array that is one element larger, which stores the number of ones from the beginning of the array up to (but not including) the current index.
arr : 1 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 1 1 0 0 0 1
count : 0 1 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 4 5 6 6 6 6 7
(I've offset arr above so it lines up better)
Now you can compute the number of 1s in any range in O(1) time. To compute the number of 1s between index A and B, you just do:
int num = count[B+1] - count[A];
Obviously you can still use the non-branch-prediction version to generate the counts initially. All this should give you a pretty good speedup over the naive approach of summing for every query:
int *count = new int[N+1];
int total = 0;
count[0] = 0;
for( int i = 0; i < N; i++ )
{
total += arr[i];
count[i+1] = total;
}
// to compute the ranged sum:
int range_sum( int *count, int a, int b )
{
if( b < a ) return range_sum(b,a);
return count[b+1] - count[a];
}
Well one time linear scanning is fine. Since you are looking for multiple scans across ranges of array I think that can be done in constant time. Here you go:
Scan the array and create a bitmap where key = key of array = sequence (1,2,3,4,5,6....).The value storedin bitmap would be a tuple<IsOne,cumulativeSum> where isOne is whether you have a one in there and cumulative Sum is addition of 1's as and wen you encounter them
Array = 1 1 0 0 1 0 1 1 1 0 1 0
Tuple: (1,1) (1,2) (0,2) (0,2) (1,3) (0,3) (1,4) (1,5) (1,6) (0,6) (1,7) (0,7)
CASE 1: When lower bound of cumulativeSum has a 0. Number of 1's [6,11] =
cumulativeSum at 11th position - cumulativeSum at 6th position = 7 - 3 = 4
CASE 2: When lower bound of cumulativeSum has a 1. Number of 1's [2,11] =
cumulativeSum at 11th position - cumulativeSum at 2nd position + 1 = 7-2+1 = 6
Step 1 is O(n)
Step 2 is 0(1)
Total complexity is linear no doubt but for your task where you have to work with the ranges several times the above Algorithm seems to be better if you have ample memory :)
Does it have to be a simple linear array data structure? Or can you create your own data structure which happens to have the desired properties, for which you're able to provide the required API, but whose implementation details can be hidden (encapsulated)?
If you can implement your own and if there is some guaranteed sparsity (to either 1s or 0s) then you might be able to offer better than linear performance. I see that you want to preserve (or be able to regenerate) the exact stream, so you'll have to store an array or bitmap or run-length encoding for that. (RLE will be useless if the stream is actually random rather than arbitrary but could be quite useful if there are significant sparsity or patterns with long strings of one or the other. For example a black&white raster of a bitmapped image is often a good candidate for RLE).
Let's say that your guaranteed that the stream will be sparse --- that no more than 10%, for example, of the bits will be 1s (or, conversely that more than 90% will be). If that's the case then you might model your solution on an RLE and maintain a count of all 1s (simply incremented as you set bits and decremented as you clear them). If there might be a need to quickly get the number of set bits for arbitrary ranges of these elements then instead of a single counter you can have a conveniently sized array of counters for partitions of the stream. (Conveniently-sized, in this case, means something which fits easily within memory, within your caches, or register sets, but which offers a reasonable trade off between computing a sum (all the partitions fully within the range) and the linear scan. The results for any arbitrary range is the sum of all the partitions fully enclosed by the range plus the results of linear scans for any fragments that are not aligned on your partition boundaries.
For a very, very, large stream you could even have a multi-tier "index" of partition sums --- traversing from the largest (most coarse) granularity down toward the "fragments" to either end (using the next layer of partition sums) and finishing with the linear search of only the small fragments.
Obviously such a structure represents trade offs between the complexity of building and maintaining the structure (inserting requires additional operations and, for an RLE, might be very expensive for anything other than appending/prepending) vs the expense of performing arbitrarily long linear search/increment scans.
If:
the purpose is to be able to find the number of 1s in the array at any time,
given that relatively few of the values in the array might change between one moment when you want to know the number and another moment, and
if you have to find the number of 1s in a changing array of n values m times,
... you can certainly do better than examining every cell in the array m times by using a caching strategy.
The first time you need the number of 1s, you certainly have to examine every cell, as others have pointed out. However, if you then store the number of 1s in a variable (say sum) and track changes to the array (by, for instance, requiring that all array updates occur through a specific update() function), every time a 0 is replaced in the array with a 1, the update() function can add 1 to sum and every time a 1 is replaced in the array with a 0, the update() function can subtract 1 from sum.
Thus, sum is always up-to-date after the first time that the number of 1s in the array is counted and there is no need for further counting.
(EDIT to take the updated question into account)
If the need is to return the number of 1s in a given range of the array, that can be done with a slightly more sophisticated caching strategy than the one I've just described.
You can keep a count of the 1s in each subset of the array and update the relevant subset count whenever a 0 is changed to a 1 or vice versa within that subset. Finding the total number of 1s in a given range within the array would then be a matter of adding the number of 1s in each subset that is fully contained within the range and then counting the number of 1s that are in the range but not in the subsets that have already been counted.
Depending on circumstances, it might be worthwhile to have a hierarchical arrangement in which (say) the number of 1s in the whole array is at the top of the hierarchy, the number of 1s in each 1/q th of the array is in the second level of the hierarchy, the number of 1s in each 1/(q^2) th of the array is in the third level of the hierarchy, etc. e.g. for q = 4, you would have the total number of 1s at the top, the number of 1s in each quarter of the array at the second level, the number of 1s in each sixteenth of the array at the third level, etc.
Are you using C (or derived language)? If so, can you control the encoding of your array? If, for example, you could use a bitmap to count. The nice thing about a bitmap, is that you can use a lookup table to sum the counts, though if your subrange ends aren't divisible by 8, you'll have to deal with end partial bytes specially, but the speedup will be significant.
If that's not the case, can you at least encode them as single bytes? In that case, you may be able to exploit sparseness if it exists (more specifically, the hope that there are often multi index swaths of zeros).
So for:
u8 input = {1,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,1,1,0,0,0,1....................N};
You can write something like (untested):
uint countBytesBy1FromTo(u8 *input, uint start, uint stop)
{ // function for counting one byte at a time, use with range of less than 4,
// use functions below for longer ranges
// assume it's just one's and zeros, otherwise we have to test/branch
uint sum;
u8 *end = input + stop;
for (u8 *each = input + start; each < end; each++)
sum += *each;
return sum;
}
countBytesBy8FromTo(u8 *input, uint start, uint stop)
{
u64 *chunks = (u64*)(input+start);
u64 *end = chunks + ((start - stop) >> 3);
uint sum = countBytesBy1FromTo((u8*)end, 0, stop - (u8*)end);
for (; chunks < end; chunks++)
{
if (*chunks)
{
sum += countBytesBy1FromTo((u8*)chunks, 0, 8);
}
}
}
The basic trick, is exploiting the ability to cast slices of your target array to single entities your language can look at in one swoop, and test by inference if ANY of the values of it are zeros, and then skip the whole block. The more zeros, the better it will work. In the case where your large cast integer always has at least one, this approach just adds overhead. You might find that using a u32 is better for your data. Or that adding a u32 test between the 1 and 8 helps. For datasets where zeros are much more common than ones, I've used this technique to great advantage.
Why is sorting invalid? You can clone the original array, sort the clone, and count and/or mark the locations of the 1s as needed.

How would you implement this function in CUDA? (offsets in sorted integer vector)

I have a sorted integer array on the device, e.g.:
[0,0,0,1,1,2,2]
And I want the offsets to each element in another array:
[0,3,5]
(since the first 0 is at position 0, the first 1 at position 3 and so on)
I know how many different elements there will be beforehand. How would you implement this efficiently in CUDA? I'm not asking for code, but a high level description of the algorithm you would implement to compute this transformation. I already hat a look at the various functions in the thrust name space, but could not think of any combination of thrust functions to achieve this. Also, does this transformation have a widely accepted name?
You can solve this in Thrust using thrust::unique_by_key_copy with thrust::counting_iterator. The idea is to treat your integer array as the keys argument to unique_by_key_copy and to use a sequence of ascending integers (i.e., counting_iterator) as the values. unique_by_key_copy will compact the values array into the indices of each unique key:
#include <thrust/device_vector.h>
#include <thrust/iterator/counting_iterator.h>
#include <thrust/iterator/discard_iterator.h>
#include <thrust/unique.h>
#include <thrust/copy.h>
#include <iterator>
#include <iostream>
int main()
{
thrust::device_vector<int> keys(7);
keys[0] = 0; keys[1] = 0; keys[2] = 0;
keys[3] = 1; keys[4] = 1; keys[5] = 2; keys[6] = 2;
std::cout << "keys before unique_by_key_copy: [ ";
thrust::copy(keys.begin(), keys.end(), std::ostream_iterator<int>(std::cout," "));
std::cout << "]" << std::endl;
thrust::device_vector<int> offsets(3);
thrust::unique_by_key_copy(keys.begin(), keys.end(), // keys
thrust::make_counting_iterator(0), // [0, 1, 2, 3, ...] are the values
thrust::make_discard_iterator(), // discard the compacted keys
offsets.begin()); // the offsets are the values
std::cout << "offsets after unique_by_key_copy: [ ";
thrust::copy(offsets.begin(), offsets.end(), std::ostream_iterator<int>(std::cout," "));
std::cout << "]" << std::endl;
return 0;
}
Here's the output:
$ nvcc test.cu -run
keys before unique_by_key_copy: [ 0 0 0 1 1 2 2 ]
offsets after unique_by_key_copy: [ 0 3 5 ]
Although I've never used thrust library, what about this possible approach (simple but maybe effective):
int input[N]; // your sorted array
int offset[N]; // the offset of the first values of each elements. Initialized with -1
// each thread will check an index position
if (input[id] > input[id-1]) // bingo! here begins a new value
{
int oid = input[id]; // use the integer value as index
offset[oid] = id; // mark the offset with the beginning of the new value
}
In your example the output will be:
[0,3,5]
But if the input array is:
[0,0,0,2,2,4,4]
Then the output will be:
[0,-1, 3, -1, 5]
Now, if thrust can do it for you, remove_if( offset[i] == -1 ) and compact the array.
This approach will waste lot of memory for the offset array, but as you dont know how many offset you are going to find, the worst case will use as much memory as the input array.
On the other hand, the few instruction per thread compared to the global memory load will limit this implementation by memory bandwidth. There are some optimization for this case as process some values per thread.
My 2 cents!
Scan is the algorithm you're looking for. If you don't have an implementation lying around, the Thrust library would be a good resource. (Look for thrust::scan)
Scan (or "parallel prefix sum") takes an input array and generates an output where each element is the sum of the inputs to that point: [1 5 3 7] => [1 6 9 16]
If you scan predicates (0 or 1 depending on an evaluated condition) where the predicate checks whether a given element the same as the preceding element, then you compute the output index of the element in question. Your example array
[0 0 0 1 1 2 2]
[0 0 0 1 0 1 0] <= predicates
[0 0 0 1 1 2 2] <= scanned predicates
Now you can use the scanned predicates as indices to write your output.
Good question and the answer depends on what you need to do with it after. Let me explain.
As soon as this problem can be solved in O(n) (where n is the input length) on CPU, you will suffer from memory allocation and copying (Host -> Device (input) and Device -> Host (result)) drawbacks. This will leads to performance degradation against simple CPU solution.
Even if your array already in device memory, each computation block need to read it to local or registers (at least access device memory), and it can't be done significantly faster than on CPU.
In general CUDA accelerate perfomance well if:
Asymptotic complexity of computations is high comparing to input data length. For example input data length is n and complexity is O(n^2) or O(n^3).
There is way to split task to independed or weak depended subtasks.
So if I was you, I would not try to do computations of such kind on CUDA if it's possible. And if it must be some standalone function or output format convertion for some other function I would do in CPU.
If it's part of some more complex algorithm the answer is more complicated. If I was on your place I would try to somehow change [0,3,5] format, because it adds limitations for utilization CUDA computation power. You can't effectively split your task on independed blocks. Just for example if I process 10 integers in one computation thread and next 10 integers in other. The second one don't know where to place his outputs until first one not finished. May be I will split an array on subarrays and store answer for each subarray separately. It's highly depends on what computations you are doing.

Interview Question: Find Median From Mega Number Of Integers

There is a file that contains 10G(1000000000) number of integers, please find the Median of these integers. you are given 2G memory to do this. Can anyone come up with an reasonable way? thanks!
Create an array of 8-byte longs that has 2^16 entries. Take your input numbers, shift off the bottom sixteen bits, and create a histogram.
Now you count up in that histogram until you reach the bin that covers the midpoint of the values.
Pass through again, ignoring all numbers that don't have that same set of top bits, and make a histogram of the bottom bits.
Count up through that histogram until you reach the bin that covers the midpoint of the (entire list of) values.
Now you know the median, in O(n) time and O(1) space (in practice, under 1 MB).
Here's some sample Scala code that does this:
def medianFinder(numbers: Iterable[Int]) = {
def midArgMid(a: Array[Long], mid: Long) = {
val cuml = a.scanLeft(0L)(_ + _).drop(1)
cuml.zipWithIndex.dropWhile(_._1 < mid).head
}
val topHistogram = new Array[Long](65536)
var count = 0L
numbers.foreach(number => {
count += 1
topHistogram(number>>>16) += 1
})
val (topCount,topIndex) = midArgMid(topHistogram, (count+1)/2)
val botHistogram = new Array[Long](65536)
numbers.foreach(number => {
if ((number>>>16) == topIndex) botHistogram(number & 0xFFFF) += 1
})
val (botCount,botIndex) =
midArgMid(botHistogram, (count+1)/2 - (topCount-topHistogram(topIndex)))
(topIndex<<16) + botIndex
}
and here it is working on a small set of input data:
scala> medianFinder(List(1,123,12345,1234567,123456789))
res18: Int = 12345
If you have 64 bit integers stored, you can use the same strategy in 4 passes instead.
You can use the Medians of Medians algorithm.
If the file is in text format, you may be able to fit it in memory just by converting things to integers as you read them in, since an integer stored as characters may take more space than an integer stored as an integer, depending on the size of the integers and the type of text file. EDIT: You edited your original question; I can see now that you can't read them into memory, see below.
If you can't read them into memory, this is what I came up with:
Figure out how many integers you have. You may know this from the start. If not, then it only takes one pass through the file. Let's say this is S.
Use your 2G of memory to find the x largest integers (however many you can fit). You can do one pass through the file, keeping the x largest in a sorted list of some sort, discarding the rest as you go. Now you know the x-th largest integer. You can discard all of these except for the x-th largest, which I'll call x1.
Do another pass through, finding the next x largest integers less than x1, the least of which is x2.
I think you can see where I'm going with this. After a few passes, you will have read in the (S/2)-th largest integer (you'll have to keep track of how many integers you've found), which is your median. If S is even then you'll average the two in the middle.
Make a pass through the file and find count of integers and minimum and maximum integer value.
Take midpoint of min and max, and get count, min and max for values either side of the midpoint - by again reading through the file.
partition count > count => median lies within that partition.
Repeat for the partition, taking into account size of 'partitions to the left' (easy to maintain), and also watching for min = max.
Am sure this'd work for an arbitrary number of partitions as well.
Do an on-disk external mergesort on the file to sort the integers (counting them if that's not already known).
Once the file is sorted, seek to the middle number (odd case), or average the two middle numbers (even case) in the file to get the median.
The amount of memory used is adjustable and unaffected by the number of integers in the original file. One caveat of the external sort is that the intermediate sorting data needs to be written to disk.
Given n = number of integers in the original file:
Running time: O(nlogn)
Memory: O(1), adjustable
Disk: O(n)
Check out Torben's method in here:http://ndevilla.free.fr/median/median/index.html. It also has implementation in C at the bottom of the document.
My best guess that probabilistic median of medians would be the fastest one. Recipe:
Take next set of N integers (N should be big enough, say 1000 or 10000 elements)
Then calculate median of these integers and assign it to variable X_new.
If iteration is not first - calculate median of two medians:
X_global = (X_global + X_new) / 2
When you will see that X_global fluctuates not much - this means that you found approximate median of data.
But there some notes :
question arises - Is median error acceptable or not.
integers must be distributed randomly in a uniform way, for solution to work
EDIT:
I've played a bit with this algorithm, changed a bit idea - in each iteration we should sum X_new with decreasing weight, such as:
X_global = k*X_global + (1.-k)*X_new :
k from [0.5 .. 1.], and increases in each iteration.
Point is to make calculation of median to converge fast to some number in very small amount of iterations. So that very approximate median (with big error) is found between 100000000 array elements in only 252 iterations !!! Check this C experiment:
#include <stdlib.h>
#include <stdio.h>
#include <time.h>
#define ARRAY_SIZE 100000000
#define RANGE_SIZE 1000
// probabilistic median of medians method
// should print 5000 as data average
// from ARRAY_SIZE of elements
int main (int argc, const char * argv[]) {
int iter = 0;
int X_global = 0;
int X_new = 0;
int i = 0;
float dk = 0.002;
float k = 0.5;
srand(time(NULL));
while (i<ARRAY_SIZE && k!=1.) {
X_new=0;
for (int j=i; j<i+RANGE_SIZE; j++) {
X_new+=rand()%10000 + 1;
}
X_new/=RANGE_SIZE;
if (iter>0) {
k += dk;
k = (k>1.)? 1.:k;
X_global = k*X_global+(1.-k)*X_new;
}
else {
X_global = X_new;
}
i+=RANGE_SIZE+1;
iter++;
printf("iter %d, median = %d \n",iter,X_global);
}
return 0;
}
Opps seems i'm talking about mean, not median. If it is so, and you need exactly median, not mean - ignore my post. In any case mean and median are very related concepts.
Good luck.
Here is the algorithm described by #Rex Kerr implemented in Java.
/**
* Computes the median.
* #param arr Array of strings, each element represents a distinct binary number and has the same number of bits (padded with leading zeroes if necessary)
* #return the median (number of rank ceil((m+1)/2) ) of the array as a string
*/
static String computeMedian(String[] arr) {
// rank of the median element
int m = (int) Math.ceil((arr.length+1)/2.0);
String bitMask = "";
int zeroBin = 0;
while (bitMask.length() < arr[0].length()) {
// puts elements which conform to the bitMask into one of two buckets
for (String curr : arr) {
if (curr.startsWith(bitMask))
if (curr.charAt(bitMask.length()) == '0')
zeroBin++;
}
// decides in which bucket the median is located
if (zeroBin >= m)
bitMask = bitMask.concat("0");
else {
m -= zeroBin;
bitMask = bitMask.concat("1");
}
zeroBin = 0;
}
return bitMask;
}
Some test cases and updates to the algorithm can be found here.
I was also asked the same question and i couldn't tell an exact answer so after the interview i went through some books on interviews and here is what i found from Cracking The Coding interview book.
Example: Numbers are randomly generated and stored into an (expanding) array. How
wouldyoukeep track of the median?
Our data structure brainstorm might look like the following:
• Linked list? Probably not. Linked lists tend not to do very well with accessing and
sorting numbers.
• Array? Maybe, but you already have an array. Could you somehow keep the elements
sorted? That's probably expensive. Let's hold off on this and return to it if it's needed.
• Binary tree? This is possible, since binary trees do fairly well with ordering. In fact, if the binary search tree is perfectly balanced, the top might be the median. But, be careful—if there's an even number of elements, the median is actually the average
of the middle two elements. The middle two elements can't both be at the top. This is probably a workable algorithm, but let's come back to it.
• Heap? A heap is really good at basic ordering and keeping track of max and mins.
This is actually interesting—if you had two heaps, you could keep track of the bigger
half and the smaller half of the elements. The bigger half is kept in a min heap, such
that the smallest element in the bigger half is at the root.The smaller half is kept in a
max heap, such that the biggest element of the smaller half is at the root. Now, with
these data structures, you have the potential median elements at the roots. If the
heaps are no longer the same size, you can quickly "rebalance" the heaps by popping
an element off the one heap and pushing it onto the other.
Note that the more problems you do, the more developed your instinct on which data
structure to apply will be. You will also develop a more finely tuned instinct as to which of these approaches is the most useful.

Is it possible to rearrange an array in place in O(N)?

If I have a size N array of objects, and I have an array of unique numbers in the range 1...N, is there any algorithm to rearrange the object array in-place in the order specified by the list of numbers, and yet do this in O(N) time?
Context: I am doing a quick-sort-ish algorithm on objects that are fairly large in size, so it would be faster to do the swaps on indices than on the objects themselves, and only move the objects in one final pass. I'd just like to know if I could do this last pass without allocating memory for a separate array.
Edit: I am not asking how to do a sort in O(N) time, but rather how to do the post-sort rearranging in O(N) time with O(1) space. Sorry for not making this clear.
I think this should do:
static <T> void arrange(T[] data, int[] p) {
boolean[] done = new boolean[p.length];
for (int i = 0; i < p.length; i++) {
if (!done[i]) {
T t = data[i];
for (int j = i;;) {
done[j] = true;
if (p[j] != i) {
data[j] = data[p[j]];
j = p[j];
} else {
data[j] = t;
break;
}
}
}
}
}
Note: This is Java. If you do this in a language without garbage collection, be sure to delete done.
If you care about space, you can use a BitSet for done. I assume you can afford an additional bit per element because you seem willing to work with a permutation array, which is several times that size.
This algorithm copies instances of T n + k times, where k is the number of cycles in the permutation. You can reduce this to the optimal number of copies by skipping those i where p[i] = i.
The approach is to follow the "permutation cycles" of the permutation, rather than indexing the array left-to-right. But since you do have to begin somewhere, everytime a new permutation cycle is needed, the search for unpermuted elements is left-to-right:
// Pseudo-code
N : integer, N > 0 // N is the number of elements
swaps : integer [0..N]
data[N] : array of object
permute[N] : array of integer [-1..N] denoting permutation (used element is -1)
next_scan_start : integer;
next_scan_start = 0;
while (swaps < N )
{
// Search for the next index that is not-yet-permtued.
for (idx_cycle_search = next_scan_start;
idx_cycle_search < N;
++ idx_cycle_search)
if (permute[idx_cycle_search] >= 0)
break;
next_scan_start = idx_cycle_search + 1;
// This is a provable invariant. In short, number of non-negative
// elements in permute[] equals (N - swaps)
assert( idx_cycle_search < N );
// Completely permute one permutation cycle, 'following the
// permutation cycle's trail' This is O(N)
while (permute[idx_cycle_search] >= 0)
{
swap( data[idx_cycle_search], data[permute[idx_cycle_search] )
swaps ++;
old_idx = idx_cycle_search;
idx_cycle_search = permute[idx_cycle_search];
permute[old_idx] = -1;
// Also '= -idx_cycle_search -1' could be used rather than '-1'
// and would allow reversal of these changes to permute[] array
}
}
Do you mean that you have an array of objects O[1..N] and then you have an array P[1..N] that contains a permutation of numbers 1..N and in the end you want to get an array O1 of objects such that O1[k] = O[P[k]] for all k=1..N ?
As an example, if your objects are letters A,B,C...,Y,Z and your array P is [26,25,24,..,2,1] is your desired output Z,Y,...C,B,A ?
If yes, I believe you can do it in linear time using only O(1) additional memory. Reversing elements of an array is a special case of this scenario. In general, I think you would need to consider decomposition of your permutation P into cycles and then use it to move around the elements of your original array O[].
If that's what you are looking for, I can elaborate more.
EDIT: Others already presented excellent solutions while I was sleeping, so no need to repeat it here. ^_^
EDIT: My O(1) additional space is indeed not entirely correct. I was thinking only about "data" elements, but in fact you also need to store one bit per permutation element, so if we are precise, we need O(log n) extra bits for that. But most of the time using a sign bit (as suggested by J.F. Sebastian) is fine, so in practice we may not need anything more than we already have.
If you didn't mind allocating memory for an extra hash of indexes, you could keep a mapping of original location to current location to get a time complexity of near O(n). Here's an example in Ruby, since it's readable and pseudocode-ish. (This could be shorter or more idiomatically Ruby-ish, but I've written it out for clarity.)
#!/usr/bin/ruby
objects = ['d', 'e', 'a', 'c', 'b']
order = [2, 4, 3, 0, 1]
cur_locations = {}
order.each_with_index do |orig_location, ordinality|
# Find the current location of the item.
cur_location = orig_location
while not cur_locations[cur_location].nil? do
cur_location = cur_locations[cur_location]
end
# Swap the items and keep track of whatever we swapped forward.
objects[ordinality], objects[cur_location] = objects[cur_location], objects[ordinality]
cur_locations[ordinality] = orig_location
end
puts objects.join(' ')
That obviously does involve some extra memory for the hash, but since it's just for indexes and not your "fairly large" objects, hopefully that's acceptable. Since hash lookups are O(1), even though there is a slight bump to the complexity due to the case where an item has been swapped forward more than once and you have to rewrite cur_location multiple times, the algorithm as a whole should be reasonably close to O(n).
If you wanted you could build a full hash of original to current positions ahead of time, or keep a reverse hash of current to original, and modify the algorithm a bit to get it down to strictly O(n). It'd be a little more complicated and take a little more space, so this is the version I wrote out, but the modifications shouldn't be difficult.
EDIT: Actually, I'm fairly certain the time complexity is just O(n), since each ordinality can have at most one hop associated, and thus the maximum number of lookups is limited to n.
#!/usr/bin/env python
def rearrange(objects, permutation):
"""Rearrange `objects` inplace according to `permutation`.
``result = [objects[p] for p in permutation]``
"""
seen = [False] * len(permutation)
for i, already_seen in enumerate(seen):
if not already_seen: # start permutation cycle
first_obj, j = objects[i], i
while True:
seen[j] = True
p = permutation[j]
if p == i: # end permutation cycle
objects[j] = first_obj # [old] p -> j
break
objects[j], j = objects[p], p # p -> j
The algorithm (as I've noticed after I wrote it) is the same as the one from #meriton's answer in Java.
Here's a test function for the code:
def test():
import itertools
N = 9
for perm in itertools.permutations(range(N)):
L = range(N)
LL = L[:]
rearrange(L, perm)
assert L == [LL[i] for i in perm] == list(perm), (L, list(perm), LL)
# test whether assertions are enabled
try:
assert 0
except AssertionError:
pass
else:
raise RuntimeError("assertions must be enabled for the test")
if __name__ == "__main__":
test()
There's a histogram sort, though the running time is given as a bit higher than O(N) (N log log n).
I can do it given O(N) scratch space -- copy to new array and copy back.
EDIT: I am aware of the existance of an algorithm that will proceed through. The idea is to perform the swaps on the array of integers 1..N while at the same time mirroring the swaps on your array of large objects. I just cannot find the algorithm right now.
The problem is one of applying a permutation in place with minimal O(1) extra storage: "in-situ permutation".
It is solvable, but an algorithm is not obvious beforehand.
It is described briefly as an exercise in Knuth, and for work I had to decipher it and figure out how it worked. Look at 5.2 #13.
For some more modern work on this problem, with pseudocode:
http://www.fernuni-hagen.de/imperia/md/content/fakultaetfuermathematikundinformatik/forschung/berichte/bericht_273.pdf
I ended up writing a different algorithm for this, which first generates a list of swaps to apply an order and then runs through the swaps to apply it. The advantage is that if you're applying the ordering to multiple lists, you can reuse the swap list, since the swap algorithm is extremely simple.
void make_swaps(vector<int> order, vector<pair<int,int>> &swaps)
{
// order[0] is the index in the old list of the new list's first value.
// Invert the mapping: inverse[0] is the index in the new list of the
// old list's first value.
vector<int> inverse(order.size());
for(int i = 0; i < order.size(); ++i)
inverse[order[i]] = i;
swaps.resize(0);
for(int idx1 = 0; idx1 < order.size(); ++idx1)
{
// Swap list[idx] with list[order[idx]], and record this swap.
int idx2 = order[idx1];
if(idx1 == idx2)
continue;
swaps.push_back(make_pair(idx1, idx2));
// list[idx1] is now in the correct place, but whoever wanted the value we moved out
// of idx2 now needs to look in its new position.
int idx1_dep = inverse[idx1];
order[idx1_dep] = idx2;
inverse[idx2] = idx1_dep;
}
}
template<typename T>
void run_swaps(T data, const vector<pair<int,int>> &swaps)
{
for(const auto &s: swaps)
{
int src = s.first;
int dst = s.second;
swap(data[src], data[dst]);
}
}
void test()
{
vector<int> order = { 2, 3, 1, 4, 0 };
vector<pair<int,int>> swaps;
make_swaps(order, swaps);
vector<string> data = { "a", "b", "c", "d", "e" };
run_swaps(data, swaps);
}

Resources