parallel computing about sorted array's index - parallel-processing

I met a problem about using cuda to compute the first index about the member in one sorted array, for example, if one sorted array is given [1,1,2,2,5,5,5], I need to return 0(the first index of 1), 2(the first index of 2), 4(the first index of 5). is there some parallel method to solve this problem?

One possible method to perform this operation would be:
use an adjacent difference methodology (each parallel thread looks at its element and its neighbor) to identify the start of each sub-sequence. Elements which have no difference as compared to their neighbors are not the start of a sub-sequence. Elements which are different from their neighbors represent the start (or end, or start+end) of a sub-sequence.
Once the start of each sub-sequence is identified, use a stream compaction method to reduce the given sequence to just the sequence of elements that represent the start of each sub-sequence. Stream compaction can also be done in parallel, and a typical approach would involve use of a parallel prefix sum to identify destination addresses for each element in the compacted sequence.
The first part of the above algorithm would be fairly easy to write CUDA code directly for. The second part would be a little more involved because a parallel prefix sum is a little bit more complicated to write. Furthermore, for algorithms like parallel prefix sum, parallel reduction, sorting, etc. I would never recommend that someone write these from scratch. You should always use a library implementation if possible.
Therefore, the thrust library, built on top of CUDA, presents a set of routines allowing a straightforward approach to prototype such a solution:
$ cat t1200.cu
#include <thrust/device_vector.h>
#include <thrust/copy.h>
#include <thrust/adjacent_difference.h>
#include <thrust/iterator/counting_iterator.h>
#include <iostream>
typedef int mytype;
using namespace thrust::placeholders;
int main(){
mytype data[] = {1,1,2,2,5,5,5};
int dsize = sizeof(data)/sizeof(data[0]);
thrust::device_vector<mytype> d_data(data, data+dsize);
thrust::device_vector<mytype> d_diffs(dsize);
thrust::adjacent_difference(d_data.begin(), d_data.end(), d_diffs.begin());
thrust::device_vector<int> d_result(dsize);
int rsize = thrust::copy_if(thrust::counting_iterator<int>(0), thrust::counting_iterator<int>(dsize), d_diffs.begin(), d_result.begin(), _1 > 0) - d_result.begin();
thrust::copy_n(d_result.begin(), rsize, std::ostream_iterator<int>(std::cout, ","));
std::cout << std::endl;
return 0;
}
$ nvcc -o t1200 t1200.cu
$ ./t1200
0,2,4,
$
There are various corner cases that might need to be handled depending on the exact composition of your input data. The above code is just a simple example to demonstrate a possible method. For example, if the first element in your sorted sequence is zero or negative, then the above code would need to be modified slightly. Since the first element of your input data is always the start of a sub-sequence, this could be trivially handled with an extra line of code that sets the first element of d_diffs to a positive value, always, immediately before the copy_if usage.

Related

Performance of counting sort

AFAIK counting sort is using following algorithm:
// A: input array
// B: output array
// C: counting array
sort(A,B,n,k)
1. for(i:k) C[i]=0;
2. for(i:n) ++C[A[i]];
3. for(i:k) C[i]+=C[i-1];
4. for(i:n-1..0) { B[C[A[i]]-1]=A[i]; --C[A[i]]; }
What about I remove step 3 and 4, and do following?
3. t=0; for(i:k) while(C[A[i]]) { --A[i]; B[t++]=i; }
Full code here, looks like fine, but I don't know which one has better performance.
Questions:
I guess the complexity of these two versions would be the same, is that ture?
In step 3 and step 4 the first version need to iterate n+k times, the second one only need to iterate n times. So does the second one have better performance?
Your code seems to be correct and it will work in case of sorting numbers. But, suppose you had an array of structures that you were sorting according to their keys. Your method will not work in that case because it simply counts the frequency of a number and while it remains positive assigns it to increasing indices in the output array. The classical method however will work for arrays of structures and objects etc. because it calculates the position that each element should go to and then copies data from the initial array to the output array.
To answer your question:
1> Yes, the runtime complexity of your code will be the same because for an array of size n and range 0...k, your inner and outer loop run proportional to f(0)+f(1)+...+f(k), where f denotes frequency of a number. Therefore runtime is O(n).
2> In terms of asymptotic complexity, both the methods have same performance. Due to an extra loop, the constants may be higher. But, that also makes the classical method a stable sort and have the benefits that I pointed out earlier.

Parallel Subset

The setup: I have two arrays which are not sorted and are not of the same length. I want to see if one of the arrays is a subset of the other. Each array is a set in the sense that there are no duplicates.
Right now I am doing this sequentially in a brute force manner so it isn't very fast. I am currently doing this subset method sequentially. I have been having trouble finding any algorithms online that A) go faster and B) are in parallel. Say the maximum size of either array is N, then right now it is scaling something like N^2. I was thinking maybe if I sorted them and did something clever I could bring it down to something like Nlog(N), but not sure.
The main thing is I have no idea how to parallelize this operation at all. I could just do something like each processor looks at an equal amount of the first array and compares those entries to all of the second array, but I'd still be doing N^2 work. But I guess it'd be better since it would run in parallel.
Any Ideas on how to improve the work and make it parallel at the same time?
Thanks
Suppose you are trying to decide if A is a subset of B, and let len(A) = m and len(B) = n.
If m is a lot smaller than n, then it makes sense to me that you sort A, and then iterate through B doing a binary search for each element on A to see if there is a match or not. You can partition B into k parts and have a separate thread iterate through every part doing the binary search.
To count the matches you can do 2 things. Either you could have a num_matched variable be incremented every time you find a match (You would need to guard this var using a mutex though, which might hinder your program's concurrency) and then check if num_matched == m at the end of the program. Or you could have another array or bit vector of size m, and have a thread update the k'th bit if it found a match for the k'th element of A. Then at the end, you make sure this array is all 1's. (On 2nd thoughts bit vector might not work out without a mutex because threads might overwrite each other's annotations when they load the integer containing the bit relevant to them). The array approach, atleast, would not need any mutex that can hinder concurrency.
Sorting would cost you mLog(m) and then, if you only had a single thread doing the matching, that would cost you nLog(m). So if n is a lot bigger than m, this would effectively be nLog(m). Your worst case still remains NLog(N), but I think concurrency would really help you a lot here to make this fast.
Summary: Just sort the smaller array.
Alternatively if you are willing to consider converting A into a HashSet (or any equivalent Set data structure that uses some sort of hashing + probing/chaining to give O(1) lookups), then you can do a single membership check in just O(1) (in amortized time), so then you can do this in O(n) + the cost of converting A into a Set.

A quick hack to sorting: am I doing this right?

I was looking into different sorting algorithms, and trying to think how to port them to GPUs when I got this idea of sorting without actually sorting. This is how my kernel looks:
__global__ void noSort(int *inarr, char *outarr, int size)
{
int idx = threadIdx.x + blockIdx.x * blockDim.x;
if (idx < size)
outarr[inarr[idx]] = 1;
}
Then at the host side, I am just printing the array indices where outarr[i] == 1. Now effectively, the above could be used to sort an integer list, and that too may be faster than algorithms which actually sort.
Is this legit?
Your example is essentially a specialized counting sort for inputs with unique keys (i.e. no duplicates). To make the code a proper counting sort you could replace the assignment outarr[inarr[idx]] = 1 with atomicAdd(inarr + idx, 1) so duplicate keys are counted. However, aside from the fact that atomic operations are fairly expensive, you still have the problem that the complexity of the method is proportional to the largest value in the input. Fortunately, radix sort solves both of these problems.
Radix sort can be thought of as a generalization of counting sort that looks at only B bits of the input at a time. Since integers of B bits can only take on values in the range [0,2^B) we can avoid looking at the full range of values.
Now, before you go and implement radix sort on CUDA I should warn you that it has been studied extensively and extremely fast implementations are readily available. In fact, the Thrust library will automatically apply radix sort whenever possible.
I see what you're doing here, but I think it's only useful in special cases. For example, what if an element of inarr had an extremely large value? That would require outarr to have at least as many elements in order to handle it. What about duplicate numbers?
Supposing you started with an array with unique, small values within it, this is an interesting way of sorting. In general though, it seems to me that it will use enormous amounts of memory to do something that is already well-handled with algorithms such as parallel merge sort. Reading the output array would also be a very expensive process (especially if there are any large values in the input array), as you will essentially end up with a very sparse array.

Best data structure to store lots one bit data

I want to store lots of data so that
they can be accessed by an index,
each data is just yes and no (so probably one bit is enough for each)
I am looking for the data structure which has the highest performance and occupy least space.
probably storing data in a flat memory, one bit per data is not a good choice on the other hand using different type of tree structures still use lots of memory (e.g. pointers in each node are required to make these tree even though each node has just one bit of data).
Does anyone have any Idea?
What's wrong with using a single block of memory and either storing 1 bit per byte (easy indexing, but wastes 7 bits per byte) or packing the data (slightly trickier indexing, but more memory efficient) ?
Well in Java the BitSet might be a good choice http://download.oracle.com/javase/6/docs/api/java/util/BitSet.html
If I understand your question correctly you should store them in an unsigned integer where you assign each value to a bit of the integer (flag).
Say you represent 3 values and they can be on or off. Then you assign the first to 1, the second to 2 and the third to 4. Your unsigned int can then be 0,1,2,3,4,5,6 or 7 depending on which values are on or off and you check the values using bitwise comparison.
Depends on the language and how you define 'index'. If you mean that the index operator must work, then your language will need to be able to overload the index operator. If you don't mind using an index macro or function, you can access the nth element by dividing the given index by the number of bits in your type (say 8 for char, 32 for uint32_t and variants), then return the result of arr[n / n_bits] & (1 << (n % n_bits))
Have a look at a Bloom Filter: http://en.wikipedia.org/wiki/Bloom_filter
It performs very well and is space-efficient. But make sure you read the fine print below ;-): Quote from the above wiki page.
An empty Bloom filter is a bit array
of m bits, all set to 0. There must
also be k different hash functions
defined, each of which maps or hashes
some set element to one of the m array
positions with a uniform random
distribution. To add an element, feed
it to each of the k hash functions to
get k array positions. Set the bits at
all these positions to 1. To query for
an element (test whether it is in the
set), feed it to each of the k hash
functions to get k array positions. If
any of the bits at these positions are
0, the element is not in the set – if
it were, then all the bits would have
been set to 1 when it was inserted. If
all are 1, then either the element is
in the set, or the bits have been set
to 1 during the insertion of other
elements. The requirement of designing
k different independent hash functions
can be prohibitive for large k. For a
good hash function with a wide output,
there should be little if any
correlation between different
bit-fields of such a hash, so this
type of hash can be used to generate
multiple "different" hash functions by
slicing its output into multiple bit
fields. Alternatively, one can pass k
different initial values (such as 0,
1, ..., k − 1) to a hash function that
takes an initial value; or add (or
append) these values to the key. For
larger m and/or k, independence among
the hash functions can be relaxed with
negligible increase in false positive
rate (Dillinger & Manolios (2004a),
Kirsch & Mitzenmacher (2006)).
Specifically, Dillinger & Manolios
(2004b) show the effectiveness of
using enhanced double hashing or
triple hashing, variants of double
hashing, to derive the k indices using
simple arithmetic on two or three
indices computed with independent hash
functions. Removing an element from
this simple Bloom filter is
impossible. The element maps to k
bits, and although setting any one of
these k bits to zero suffices to
remove it, this has the side effect of
removing any other elements that map
onto that bit, and we have no way of
determining whether any such elements
have been added. Such removal would
introduce a possibility for false
negatives, which are not allowed.
One-time removal of an element from a
Bloom filter can be simulated by
having a second Bloom filter that
contains items that have been removed.
However, false positives in the second
filter become false negatives in the
composite filter, which are not
permitted. In this approach re-adding
a previously removed item is not
possible, as one would have to remove
it from the "removed" filter. However,
it is often the case that all the keys
are available but are expensive to
enumerate (for example, requiring many
disk reads). When the false positive
rate gets too high, the filter can be
regenerated; this should be a
relatively rare event.

Remembering the "original" index of elements after sorting

Say, I employ merge sort to sort an array of Integers. Now I need to also remember the positions that elements had in the unsorted array, initially. What would be the best way to do this?
A very very naive and space consuming way to do would be to (in C), to maintain each number as a "structure" with another number storing its index:
struct integer {
int value;
int orig_pos;
};
But, obviously there are better ways. Please share your thoughts and solution if you have already tackled such problems. Let me know if you would need more context. Thank you.
Clearly for an N-long array you do need to store SOMEwhere N integers -- the original position of each item, for example; any other way to encode "1 out of N!" possibilities (i.e., what permutation has in fact occurred) will also take at least O(N) space (since, by Stirling's approximation, log(N!) is about N log(N)...).
So, I don't see why you consider it "space consuming" to store those indices most simply and directly. Of course there are other possibilities (taking similar space): for example, you might make a separate auxiliary array of the N indices and sort THAT auxiliary array (based on the value at that index) leaving the original one alone. This means an extra level of indirectness for accessing the data in sorted order, but can save you a lot of data movement if you're sorting an array of large structures, so there's a performance tradeoff... but the space consumption is basically the same!-)
Is the struct such a bad idea? The alternative, to me, would be an array of pointers.
It feels to me that in this question you have to consider the age old question: speed vs size. In either case, you are keeping both a new representation of your data (the sorted array) and an old representation of your data (the way the array use to look), so inherently your solution will have some data replication. If you are sorting n numbers, and you need to remember after they were sorted where those n numbers were, you will have to store n amount of information somewhere, there is no getting around that.
As long as you accept that you are doubling the amount of space you need to be able to keep this old data, then you should consider the specific application and decide what will be faster. One option is to just make a copy of the array before you sort it, however resolving which was where later might turn into a O(N) problem. From that point of view your suggestion of adding another int to your struct doesn't seem like such a bad idea, if it fits with the way you will be using the data later.
This looks like the case where I use an index sort. The following C# example shows how to do it with a lambda expression. I am new at using lambdas, but they can do some complex tasks very easily.
// first, some data to work with
List<double> anylist = new List<double>;
anylist.Add(2.18); // add a value
... // add many more values
// index sort
IEnumerable<int> serial = Enumerable.Range(0, anylist.Count);
int[] index = serial.OrderBy(item => (anylist[item])).ToArray();
// how to use
double FirstValue = anylist[index[0]];
double SecondValue = anylist[index[1]];
And, of course, anylist is still in the origial order.
you can do it the way you proposed
you can also remain a copy of the original unsorted array (means you may use a not inplace sorting algorithm)
you can create an additional array containing only the original indices
All three ways are equally space consuming, there is no "better" way. you may use short instead of int to safe space if you array wont get >65k elements (but be aware of structure padding with your suggestion).

Resources