How would you implement this function in CUDA? (offsets in sorted integer vector)

How would you implement this function in CUDA? (offsets in sorted integer vector) - algorithm

I have a sorted integer array on the device, e.g.:
[0,0,0,1,1,2,2]
And I want the offsets to each element in another array:
[0,3,5]
(since the first 0 is at position 0, the first 1 at position 3 and so on)
I know how many different elements there will be beforehand. How would you implement this efficiently in CUDA? I'm not asking for code, but a high level description of the algorithm you would implement to compute this transformation. I already hat a look at the various functions in the thrust name space, but could not think of any combination of thrust functions to achieve this. Also, does this transformation have a widely accepted name?

You can solve this in Thrust using thrust::unique_by_key_copy with thrust::counting_iterator. The idea is to treat your integer array as the keys argument to unique_by_key_copy and to use a sequence of ascending integers (i.e., counting_iterator) as the values. unique_by_key_copy will compact the values array into the indices of each unique key:
#include <thrust/device_vector.h>
#include <thrust/iterator/counting_iterator.h>
#include <thrust/iterator/discard_iterator.h>
#include <thrust/unique.h>
#include <thrust/copy.h>
#include <iterator>
#include <iostream>
int main()
{
thrust::device_vector<int> keys(7);
keys[0] = 0; keys[1] = 0; keys[2] = 0;
keys[3] = 1; keys[4] = 1; keys[5] = 2; keys[6] = 2;
std::cout << "keys before unique_by_key_copy: [ ";
thrust::copy(keys.begin(), keys.end(), std::ostream_iterator<int>(std::cout," "));
std::cout << "]" << std::endl;
thrust::device_vector<int> offsets(3);
thrust::unique_by_key_copy(keys.begin(), keys.end(), // keys
thrust::make_counting_iterator(0), // [0, 1, 2, 3, ...] are the values
thrust::make_discard_iterator(), // discard the compacted keys
offsets.begin()); // the offsets are the values
std::cout << "offsets after unique_by_key_copy: [ ";
thrust::copy(offsets.begin(), offsets.end(), std::ostream_iterator<int>(std::cout," "));
std::cout << "]" << std::endl;
return 0;
}
Here's the output:
$ nvcc test.cu -run
keys before unique_by_key_copy: [ 0 0 0 1 1 2 2 ]
offsets after unique_by_key_copy: [ 0 3 5 ]

Although I've never used thrust library, what about this possible approach (simple but maybe effective):
int input[N]; // your sorted array
int offset[N]; // the offset of the first values of each elements. Initialized with -1
// each thread will check an index position
if (input[id] > input[id-1]) // bingo! here begins a new value
{
int oid = input[id]; // use the integer value as index
offset[oid] = id; // mark the offset with the beginning of the new value
}
In your example the output will be:
[0,3,5]
But if the input array is:
[0,0,0,2,2,4,4]
Then the output will be:
[0,-1, 3, -1, 5]
Now, if thrust can do it for you, remove_if( offset[i] == -1 ) and compact the array.
This approach will waste lot of memory for the offset array, but as you dont know how many offset you are going to find, the worst case will use as much memory as the input array.
On the other hand, the few instruction per thread compared to the global memory load will limit this implementation by memory bandwidth. There are some optimization for this case as process some values per thread.
My 2 cents!

Scan is the algorithm you're looking for. If you don't have an implementation lying around, the Thrust library would be a good resource. (Look for thrust::scan)
Scan (or "parallel prefix sum") takes an input array and generates an output where each element is the sum of the inputs to that point: [1 5 3 7] => [1 6 9 16]
If you scan predicates (0 or 1 depending on an evaluated condition) where the predicate checks whether a given element the same as the preceding element, then you compute the output index of the element in question. Your example array
[0 0 0 1 1 2 2]
[0 0 0 1 0 1 0] <= predicates
[0 0 0 1 1 2 2] <= scanned predicates
Now you can use the scanned predicates as indices to write your output.

Good question and the answer depends on what you need to do with it after. Let me explain.
As soon as this problem can be solved in O(n) (where n is the input length) on CPU, you will suffer from memory allocation and copying (Host -> Device (input) and Device -> Host (result)) drawbacks. This will leads to performance degradation against simple CPU solution.
Even if your array already in device memory, each computation block need to read it to local or registers (at least access device memory), and it can't be done significantly faster than on CPU.
In general CUDA accelerate perfomance well if:
Asymptotic complexity of computations is high comparing to input data length. For example input data length is n and complexity is O(n^2) or O(n^3).
There is way to split task to independed or weak depended subtasks.
So if I was you, I would not try to do computations of such kind on CUDA if it's possible. And if it must be some standalone function or output format convertion for some other function I would do in CPU.
If it's part of some more complex algorithm the answer is more complicated. If I was on your place I would try to somehow change [0,3,5] format, because it adds limitations for utilization CUDA computation power. You can't effectively split your task on independed blocks. Just for example if I process 10 integers in one computation thread and next 10 integers in other. The second one don't know where to place his outputs until first one not finished. May be I will split an array on subarrays and store answer for each subarray separately. It's highly depends on what computations you are doing.

Related

Fill device array consecutively in CUDA

(This might be more of a theoretical parallel optimization problem then a CUDA specific problem per se. I'm very new to Parallel Programming in general so this may just be personal ignorance.)
I have a workload that consists of a 64-bit binary numbers upon which I run analysis. If the analysis completes successfully then that binary number is a "valid solution". If the analysis breaks midway then the number is "invalid". The end goal is to get a list of all the valid solutions.
Now there are many trillions of 64 bit binary numbers I am analyzing, but only ~5% or less will be valid solutions, and they usually come in bunches (i.e. every consecutive 1000 numbers are valid and then every random billion or so are invalid). I can't find a pattern to the space between bunches so I can't ignore the large chunks of invalid solutions.
Currently, every thread in a kernel call analyzes just one number. If the number is valid it denotes it as such in it's respective place on a device array. Ditto if it's invalid. So basically I generate a data point for very value analyzed regardless if it's valid or not. Then once the array is full I copy it to host only if a valid solution was found (denoted by a flag on the device). With this, overall throughput is greatest when the array is the same size as the # of threads in the grid.
But Copying Memory to & from the GPU is expensive time wise. That said what I would like to do is copy data over only when necessary; I want to fill up a device array with only valid solutions and then once the array is full then copy it over from the host. But how do you consecutively fill an array up in a parallel environment? Or am I approaching this problem the wrong way?
EDIT 1
This is the Kernel I initially developed. As you see I am generating 1 byte of data for each value analyzed. Now I really only need each 64 bit number which is valid; if I need be I can make a new kernel. As suggested by some of the commentators I am currently looking into stream compaction.
__global__ void kValid(unsigned long long*kInfo, unsigned char*values, char *solutionFound) {
//a 64 bit binary value to be evaluated is called a kValue
unsigned long long int kStart, kEnd, kRoot, kSize, curK;
//kRoot is the kValue at the start of device array, this is used is the device array is larger than the total threads in the grid
//kStart is the kValue to start this kernel call on
//kEnd is the last kValue to validate
//kSize is how many bits long is kValue (we don't necessarily use all 64 bits but this value stays constant over the entire chunk of values defined on the host
//curK is the current kValue represented as a 64 bit unsigned integer
int rowCount, kBitLocation, kMirrorBitLocation, row, col, nodes, edges;
kStart = kInfo[0];
kEnd = kInfo[1];
kRoot = kInfo[2];
nodes = kInfo[3];
edges = kInfo[4];
kSize = kInfo[5];
curK = blockIdx.x*blockDim.x + threadIdx.x + kStart;
if (curK > kEnd) {//check to make sure you don't overshoot the end value
return;
}
kBitLocation = 1;//assuming the first bit in the kvalue has a position 1;
for (row = 0; row < nodes; row++) {
rowCount = 0;
kMirrorBitLocation = row;//the bit position for the mirrored kvals is always starts at the row value (assuming the first row has a position of 0)
for (col = 0; col < nodes; col++) {
if (col > row) {
if (curK & (1 << (unsigned long long int)(kSize - kBitLocation))) {//add one to kIterator to convert to counting space
rowCount++;
}
kBitLocation++;
}
if (col < row) {
if (col > 0) {
kMirrorBitLocation += (nodes - 2) - (col - 1);
}
if (curK & (1 << (unsigned long long int)(kSize - kMirrorBitLocation))) {//if bit is set
rowCount++;
}
}
}
if (rowCount != edges) {
//set the ith bit to zero
values[curK - kRoot] = 0;
return;
}
}
//set the ith bit to one
values[curK - kRoot] = 1;
*solutionFound = 1; //not a race condition b/c it will only ever be set to 1 by any thread.
}

(This answer assumes output order is inconsequential and so are the positions of the valid values.)
Conceptually, your analysis produces a set of valid values. The implementation you described uses a dense representation of this set: One bit for every potential value. Yet you've indicated that the data is quite sparse (either 5e-2 or 1000/10^9 = 1e-6); moreover, copying data across PCI express is quite a pain.
Well, then, why not consider a sparse representation? The simplest one would be merely an unordered sequence of the valid values. Of course, writing that requires some synchronization across threads - perhaps even across blocks. Roughly, you can have warps collect their valid values in shared memory; then synchronize at the block level to collect the block's valid values (for a given chunk of the input it has analyzed); and finally use atomics to collect the data from all the blocks.
Oh, also - have each thread analyze multiple values, so you don't have to do that much synchronization.

So, you would want to have each thread analyze multiple numbers (thousands or millions) before you do a return from the computation. So if you analyze a million numbers in your thread, you will only need %5 of that amount of space to possible hold the results of that computation.

Cache friendly offline random read

Consider this function in C++:
void foo(uint32_t *a1, uint32_t *a2, uint32_t *b1, uint32_t *b2, uint32_t *o) {
while (b1 != b2) {
// assert(0 <= *b1 && *b1 < a2 - a1)
*o++ = a1[*b1++];
}
}
Its purpose should be clear enough. Unfortunately, b1 contains random data and trash the cache, making foo the bottleneck of my program. Is there anyway I can optimize it?
This is an SSCCE that should resemble my actual code:
#include <iostream>
#include <chrono>
#include <algorithm>
#include <numeric>
namespace {
void foo(uint32_t *a1, uint32_t *a2, uint32_t *b1, uint32_t *b2, uint32_t *o) {
while (b1 != b2) {
// assert(0 <= *b1 && *b1 < a2 - a1)
*o++ = a1[*b1++];
}
}
constexpr unsigned max_n = 1 << 24, max_q = 1 << 24;
uint32_t data[max_n], index[max_q], result[max_q];
}
int main() {
uint32_t seed = 0;
auto rng = [&seed]() { return seed = seed * 9301 + 49297; };
std::generate_n(data, max_n, rng);
std::generate_n(index, max_q, [rng]() { return rng() % max_n; });
auto t1 = std::chrono::high_resolution_clock::now();
foo(data, data + max_n, index, index + max_q, result);
auto t2 = std::chrono::high_resolution_clock::now();
std::cout << std::chrono::duration<double>(t2 - t1).count() << std::endl;
uint32_t hash = 0;
for (unsigned i = 0; i < max_q; i++)
hash += result[i] ^ (i << 8) ^ i;
std::cout << hash << std::endl;
}
This is not Cache-friendly copying of an array with readjustment by known index, gather, scatter, which asks about random writes and assumes b is a permutation.

First, let's take a look at the actual performance of the code above:
$ sudo perf stat ./offline-read
0.123023
1451229184
Performance counter stats for './offline-read':
184.661547 task-clock (msec) # 0.997 CPUs utilized
3 context-switches # 0.016 K/sec
0 cpu-migrations # 0.000 K/sec
717 page-faults # 0.004 M/sec
623,638,834 cycles # 3.377 GHz
419,309,952 instructions # 0.67 insn per cycle
70,803,672 branches # 383.424 M/sec
16,895 branch-misses # 0.02% of all branches
0.185129552 seconds time elapsed
We are getting a low IPC of 0.67, probably caused almost entirely by load-misses to DRAM5. Let's confirm:
sudo ../pmu-tools/ocperf.py stat -e cycles,LLC-load-misses,cycle_activity.stalls_l3_miss ./offline-read
perf stat -e cycles,LLC-load-misses,cpu/event=0xa3,umask=0x6,cmask=6,name=cycle_activity_stalls_l3_miss/ ./offline-read
0.123979
1451229184
Performance counter stats for './offline-read':
622,661,371 cycles
16,114,063 LLC-load-misses
368,395,404 cycle_activity_stalls_l3_miss
0.184045411 seconds time elapsed
So ~370k cycles out of 620k are straight-up stalled on outstanding misses. In fact, the portion of cycles stalled this way in foo() is much higher, close to 90% since perf is also measuring the init and accumulate code which takes about a third of the runtime (but doesn't have significant L3 misses).
This is nothing unexpected, since we knew the random-read pattern a1[*b1++] was going to have essentially zero locality. In fact, the number of LLC-load-misses is 16 million1, corresponding almost exactly to the 16 million random reads of a1.2
If we just assume 100% of foo() is spending waiting on memory access, we can get an idea of the total cost of each miss: 0.123 sec / 16,114,063 misses == 7.63 ns/miss. On my box, the memory latency is around 60 ns in the best case, so less than 8 ns per miss means we are already extracting a lot of memory-level parallelism (MLP): about 8 misses would have to be overlapped and in-flight on average to achieve this (even totally ignoring the additional traffic from the streaming load of b1 and streaming write of o).
So I don't think there are many tweaks you can apply to the simple loop to do much better. Still, two possibilities are:
Non-temporal stores for the writes to o, if your platform supports them. This would cut out the reads implied by RFO for normal stores. It should be a straight win since o is never read again (inside the timed portion!).
Software prefetching. Carefully tuned prefetching of a1 or b1 could potentially help a bit. The impact is going to be fairly limited, however, since we are already approaching the limits of MLP as described above. Also, we expect the linear reads of b1 to be almost perfectly prefetched by the hardware prefetchers. The random reads of a1 seem like they could be amenable to prefetching, but in practice the ILP in the loop leads to enough MLP though out-of-order processing (at least on big OoO processors like recent x86).
In the comments user harold already mentioned that he tried prefetching with only a small effect.
So since the simple tweaks aren't likely to bear much fruit, you are left with transforming the loop. One "obvious" transformation is to sort the indexes b1 (along with the index element's original position) and then do the reads from a1 in sorted order. This transforms the reads of a1 from completely random, to almost3 linear, but now the writes are all random, which is no better.
Sort and then unsort
The key problem is that the reads of a1 under control of b1 are random, and a1 is large you get a miss-to-DRAM for essentially every read. We can fix that by sorting b1, and then reading a1 in order to get a permuted result. Now you need to "un-permute" the result a1 to get the result in the final order, which is simply another sort, this time on the "output index".
Here's a worked example with the given input array a, index array b and output array o, and i which is the (implicit) position of each element:
i = 0 1 2 3
a = [00, 10, 20, 30]
b = [ 3, 1, 0, 1]
o = [30, 10, 00, 10] (desired result)
First, sort array b, with the original array position i as secondary data (alternately you may see this as sorting tuples (b[0], 0), (b[1], 1), ...), this gives you the sorted b array b' and the sorted index list i' as shown:
i' = [ 2, 1, 3, 0]
b' = [ 0, 1, 1, 3]
Now you can read the permuted result array o' from a under the control of b'. This read is strictly increasing in order, and should be able to operate at close to memcpy speeds. In fact you may be able to take advantage of wide contiguous SIMD reads and some shuffles to do several reads and once and move the 4-byte elements into the right place (duplicating some elements and skipping others):
a = [00, 10, 20, 30]
b' = [ 0, 1, 1, 3]
o' = [00, 10, 10, 30]
Finally, you de-permute o' to get o, conceptually simply by sorting o' on the permuted indexes i':
i' = [ 2, 1, 3, 0]
o' = [00, 10, 10, 30]
i = [ 0, 1, 2, 3]
o = [30, 10, 00, 10]
Finished!
Now this is the simplest idea of the technique and isn't particularly cache-friendly (each pass conceptually iterates over one or more 2^26-byte arrays), but it at least fully uses every cache line it reads (unlike the original loop which only reads a single element from a cache line, which is why you have 16 million misses even though the data only occupies 1 million cache lines!). All of the reads are more or less linear, so hardware prefetching will help a lot.
How much speedup you get probably large depends on how will you implement the sorts: they need to be fast and cache sensitive. Almost certainly some type of cache-aware radix sort will work best.
Here are some notes on ways to improve this further:
Optimize the amount of sorting
You don't actually need to fully sort b. You just want to sort it "enough" such that the subsequent reads of a under the control of b' are more or less linear. For example, 16 elements fit in a cache line, so you don't need to sort based on the last 4 bits at all: the same linear sequence of cache lines will be read anyways. You could also sort on even fewer bits: e.g., if you ignored the 5 least-significant bits, you'd read cache lines in an "almost linear" way, sometimes swapping two cache lines from the perfectly linear pattern like: 0, 1, 3, 2, 5, 4, 6, 7. Here, you'll still get the full benefit of the L1 cache (subsequent reads to a cache line will always hit), and I suspect such a pattern would still be prefetched well and if not you can always help it with software prefetching.
You can test on your system what the optimal number of ignored bits is. Ignoring bits has two benefits:
Less work to do in the radix search, either from fewer passes needed or needing fewer buckets in one or more passes (which helps caching).
Potentially less work to do to "undo" the permutation in the last step: if the undo by examining the original index array b, ignoring bits means that you get the same savings when undoing the search.
Cache block the work
The above description lays out everything in several sequential, disjoint passes that each work on the entire data set. In practice, you'd probably want to interleave them to get better caching behavior. For example, assuming you use an MSD radix-256 sort, you might do the first pass, sorting the data into 256 buckets of approximately 256K elements each.
Then rather than doing the full second pass, you might finish sorting only the first (or first few) buckets, and proceed to do the read of a based on the resulting block of b'. You are guaranteed that this block is contiguous (i.e., a suffix of the final sorted sequence) so you don't give up any locality in the read, and your reads will generally be cached. You may also do the first pass of de-permuting o' since the block of o' is also hot in the cache (and perhaps you can combine the latter two phases into one loop).
Smart De-permutation
One area for optimization is how exactly the de-permutation of o' is implemented. In the description above, we assume some index array i initially with values [0, 1, 2, ..., max_q] which is sorted along with b. That's conceptually how it works, but you may not need to actually materialize i right away and sort it as auxillary data. In the first pass of the radix sort, for example, the value of i is implicitly known (since you are iterating through the data), so it could be calculated for free4 and written out during the first pass without every having appeared in sorted order.
There may also be more efficient ways to do the "unsort" operation than maintaining the full index. For example, the original unsorted b array conceptually has all the information needed to do the unsort, but it is clear to me how to use to efficiently unsort.
Is it be faster?
So will this actually be faster than the naive approach? It depends largely on implementation details especially including the efficiency of the implemented sort. On my hardware, the naive approach is processing about ~140 million elements per second. Online descriptions of cache-aware radix sorts seem to vary from perhaps 200 to 600 million elements/s, and since you need two of those, the opportunity for a big speedup would seem limited if you believe those numbers. On other hand, those numbers are from older hardware, and for slightly more general searches (e.g,. for all 32 bits of the key, while we may be able to use as few as 16 bits).
Only a careful implementation will determine if it is feasible, and feasibility also depends on the hardware. For example, on hardware that can't sustain as much MLP, the sorting-unsorting approach becomes relatively more favorable.
The best approach also depends on the relative values of max_n and max_q. For example, if max_n >> max_q, then the reads will be "sparse" even with optimal sorting, so the naive approach would be better. On the other hand if max_n << max_q, then the same index will usually be read many times, so the sorting approach will have good read locality, the sorting steps will themselves have better locality, and further optimizations which handle duplicate reads explicitly may be possible.
Multiple Cores
It isn't clear from the question whether you are interested in parallelizing this. The naive solution for foo() already does admit a "straightforward" parallelization where you simply partition the a and b arrays into equal sized chunks, on for each thread, which would seem to provide a perfect speedup. Unfortunately, you'll probably find that you get much worse than linear scaling, because you'll be running into resource contention in the memory controller and associated uncore/offcore resources which are shared between all cores on a socket. So it isn't clear how much more throughput you'll get for a purely parallel random read load to memory as you add more cores6.
For the radix-sort version, most of the bottlenecks (store throughput, total instruction throughput) are in the core, so I expect it to scale reasonably with additional cores. As Peter mentioned in the comment, if you are using hyperthreading, the sort may have the additional benefit of good locality in the core local L1 and L2 caches, effectively letting each sibling thread use the entire cache, rather than cutting the effective capacity in half. Of course, that involves carefully managing your thread affinity so that sibling threads actually use nearby data, and not just letting the scheduler do whatever it does.
1 You might ask why the LLC-load-misses isn't say 32 or 48 million, given that we also have to read all 16 million elements of b1 and then the accumulate() call reads all of result. The answer is that LLC-load-misses only counts demand misses that actually miss in the L3. The other mentioned read patterns are totally linear, so the prefetchers will always be bringing the line into the L3 before it is needed. These don't count as "LLC misses" by the definition perf uses.
2 You might want to know how I know that the load misses all come from the reads of a1 in foo: I simply used perf record and perf mem to confirm that the misses were coming from the expected assembly instruction.
3 Almost linear because b1 is not a permutation of all indexes, so in principle there can be skipped and duplicate indexes. At the cache-line level, however, it is highly likely that every cache line will be read in-order since each element has a ~63% chance of being included, and a cache line has 16 4-byte elements, so there's only about a 1 in 10 million chance that any given cache has zero elements. So prefetching, which works at the cache line level, will work fine.
4 Here I mean that the calculation of the value comes for free or nearly so, but of course the write still costs. This is still much better than the "up-front materialization" approach, however, which first creates the i array [0, 1, 2, ...] needing max_q writes and then again needs another max_q writes to sort it in the first radix sort pass. The implicit materialization only incurs the second write.
5 In fact, the IPC of the actual timed section foo() is much lower: about 0.15 based on my calculations. The reported IPC of the entire process is an average of the IPC of the timed section and the initialization and accumulation code before and after which has a much higher IPC.
6 Notably, this is different from a how a dependent-load latency bound workflow scales: a load that is doing random read but can only have one load in progress because each load depends on the result of last scales very well to multiple cores because the serial nature of the loads doesn't use many downstream resources (but such loads can conceptually also be sped up even on a single core by changing the core loop to handle more than one dependent load stream in parallel).

You can partition indices into buckets where higher bits of indices are the same. Beware that if indices are not random the buckets will overflow.
#include <iostream>
#include <chrono>
#include <cassert>
#include <algorithm>
#include <numeric>
#include <vector>
namespace {
constexpr unsigned max_n = 1 << 24, max_q = 1 << 24;
void foo(uint32_t *a1, uint32_t *a2, uint32_t *b1, uint32_t *b2, uint32_t *o) {
while (b1 != b2) {
// assert(0 <= *b1 && *b1 < a2 - a1)
*o++ = a1[*b1++];
}
}
uint32_t* foo_fx(uint32_t *a1, uint32_t *a2, uint32_t *b1, uint32_t *b2, const uint32_t b_offset, uint32_t *o) {
while (b1 != b2) {
// assert(0 <= *b1 && *b1 < a2 - a1)
*o++ = a1[b_offset+(*b1++)];
}
return o;
}
uint32_t data[max_n], index[max_q], result[max_q];
std::pair<uint32_t, uint32_t[max_q / 8]>index_fx[16];
}
int main() {
uint32_t seed = 0;
auto rng = [&seed]() { return seed = seed * 9301 + 49297; };
std::generate_n(data, max_n, rng);
//std::generate_n(index, max_q, [rng]() { return rng() % max_n; });
for (size_t i = 0; i < max_q;++i) {
const uint32_t idx = rng() % max_n;
const uint32_t bucket = idx >> 20;
assert(bucket < 16);
index_fx[bucket].second[index_fx[bucket].first] = idx % (1 << 20);
index_fx[bucket].first++;
assert((1 << 20)*bucket + index_fx[bucket].second[index_fx[bucket].first - 1] == idx);
}
auto t1 = std::chrono::high_resolution_clock::now();
//foo(data, data + max_n, index, index + max_q, result);
uint32_t* result_begin = result;
for (int i = 0; i < 16; ++i) {
result_begin = foo_fx(data, data + max_n, index_fx[i].second, index_fx[i].second + index_fx[i].first, (1<<20)*i, result_begin);
}
auto t2 = std::chrono::high_resolution_clock::now();
std::cout << std::chrono::duration<double>(t2 - t1).count() << std::endl;
std::cout << std::accumulate(result, result + max_q, 0ull) << std::endl;
}

Parallelise bubble sort using CUDA

I was given an assignment, to parallelize Bubble Sort and implement it using CUDA.
I don't see how bubble sort could possibly be parallelized. I think its inherently sequential. Since, it compares two consecutive elements and swaps them after a conditional branch.
Thoughts, anyone?

To be completely honest, I had trouble thinking about a way to parallelize bubble sort as well. I initially thought of a hybrid sort where you could tile, bubble sort each tile, and then merge (probably would still improve performance if you could make it work). However, I browsed for "Parallel Bubble Sort", and found this page. If you scroll down you'll find the following parallel bubble sort algorithm:
For k = 0 to n-2
If k is even then
for i = 0 to (n/2)-1 do in parallel
If A[2i] > A[2i+1] then
Exchange A[2i] ↔ A[2i+1]
Else
for i = 0 to (n/2)-2 do in parallel
If A[2i+1] > A[2i+2] then
Exchange A[2i+1] ↔ A[2i+2]
Next k
You could run the for-loop in the CPU and then use a kernel for each of the do in parallels. This seems efficient for large arrays, but might be too much overhead with small arrays. Large arrays are assumed if you're writing a CUDA implementation. Since the swaps within these kernels are with adjacent pairs of elements, you should be able to tile accordingly. I've searched for generic, non-gpu-specific parallel bubble sorts and this was the only one I could find.
I did find a (very slightly) helpful visualization here, which can be seen below. I'd love to discuss this more in the comments.
EDIT: I found another parallel version of bubble sort called Cocktail Shaker Sort. Here's the pseudocode:
procedure cocktailShakerSort( A : list of sortable items ) defined as:
do
swapped := false
for each i in 0 to length( A ) - 2 do:
if A[ i ] > A[ i + 1 ] then // test whether the two elements are in the wrong order
swap( A[ i ], A[ i + 1 ] ) // let the two elements change places
swapped := true
end if
end for
if not swapped then
// we can exit the outer loop here if no swaps occurred.
break do-while loop
end if
swapped := false
for each i in length( A ) - 2 to 0 do:
if A[ i ] > A[ i + 1 ] then
swap( A[ i ], A[ i + 1 ] )
swapped := true
end if
end for
while swapped // if no elements have been swapped, then the list is sorted
end procedure
It looks like this also has two for-loops comparing adjacent elements bubbly.. These algorithms look kind of like similar opposites, since the first algorithm (which I now learned is called odd-even sort) assumes sorted and lets the for-loops specify false, while cocktail shaker sort conditionally checks sorted in each loop.
The code included in this post for the odd-even sort seems to just run the while loop enough times to guarantee sorted, where the wikipedia pseudocode checks. A potential first-pass could be to implement the algorithm of this post and then optimize with the check, although the check may actually be slower with CUDA.
Regardless the sort will be slow. Here's a related SO question fyi, but there isn't much help. They agree it's not effective for small arrays, and really emphasize its failure.
Are you looking for specific CUDA code or was this enough? It seems like you wanted an overview of possible options and understand CUDA implementation.

TL;DR;
For a complete implementation of a Generic Parallel Bubble Sort, take a look at generic-bubble-sort.cu. *Generic" in here means the algorithm sorts any kind of elements as long as you provide your comparator.
At best
With a number linearly proportional to N threads (say N/2) you can get a Parallel Bubble Sort of O(N) time complexity (where N is the size of the array you want to sort).
A hint
It might not be trivial but when you look closely, you'd realize that all what the Sequential Bubble Sort does is swapping pairs of elements if not ordered correctly One pair at the time!.
Since pairs can be sorted independently, a Parallel Bubble Sort could take advantage of the property that ordering pairs is independent.
An approach
Let's say we want to sort the following array ascendantly :
# [7][1][3][2][0]
We'd first take our initial unsorted array and consider every two elements array[i] and array [i+1] to be an indepent pair. For this first iteration i is going to be an EVEN index, so our paris are { {array[0], array[1]} , {array[2],array[3]}, ...}.
# [7][1][3][2][0] <-- Unsorted array of 5 elements
# [7][1] [3][2] [0] <-- A set of independent pairs.
Then, we'd swap every two elements of each pair if they are not in the desired order.
# [7][1] [3][2] [0] --┑ Sorting first set of pairs
# |
# [1][7] [2][3] [0] <-┛ starting from an even idx
Our array after this first iteration would look like this :
# [1][7][2][3][0] <-- Result after first iteration
We are now going to reiterate for a second time, but unlike previously, we will now sort pairs starting from an ODD index { {array[1], array[2]} , {array[3],array[4]}, ...}. It's worth mentioning that elements with no peers won't be considered.
# [1][7][2][3][0] <-- Result after first iteration
# [1] [7][2] [3][0] --┑ Sorting second set of pairs
|
# [1] [2][7] [0][3] <-┛ starting from an odd index
# [1][2][7][0][3] <-- Result after second iteration
After N EVEN/ODD pair sorting iterations we'd have a sorted array.
# [1][2] [7][0] [3] --┑
# [1][2] [0][7] [3] |
# |
# [1][2][0][7][3] |
# | The whole parallel sorting
# [1] [2][0] [7][3] | will converge after N iterations
# [1] [0][2] [3][7] | So we keep sorting pairs for 3 more
# | iterations.
# [1][0][2][3][7] |
# |
# [1][0] [2][3] [7] |
# [0][1] [2][3] [7] <-┛
#
# [0][1][2][3][7] <-- Sorted array!
Parallel Bubble Sort with CUDA
A straightforward implementation of a CUDA program for the approach above would be done as follows:
each thread would be responsible for sorting an individual pair
you would need N/2 threads
since warp divergence is a thing we'd need to care about synchronizing our threads
USING A SINGLE BLOCK: if our threads fit into a single block we'd only use __synchronize() after each iteration and we'd be able to take advantage of the shared memory by having all our array there.
USING MORE THAN ONE BLOCK: we would have to ensure thread synchronization for all the threads in our kernel. We'd only be able to perform one iteration per kernel launch, and launch our kernel N times. The bad news is that we can only use the global memory for processing our array since the shared memory has just kernel lifetime.
Some code
Here's a simple implementation of what's explained above considering only one block. The whole code is available in this repo.
template<typename T>
__global__
void bubbleSort(T* v, const unsigned int n, ShouldSwap<T> shouldSwap) {
const unsigned int tIdx = threadIdx.x;
for (unsigned int i = 0; i < n; i++) {
unsigned int offset = i % 2;
unsigned int leftIndex = 2 * tIdx + offset;
unsigned int rightIndex = leftIndex + 1;
if (rightIndex < n) {
if (shouldSwap(v[leftIndex ], v[rightIndex ])) {
swap<T>(&v[leftIndex ], &v[rightIndex ]);
}
}
__syncthreads();
}
}
If you're wondering about ShouldSwap and swap implementations here's the code:
swap
A device function for swapping elements.
template<typename T>
__host__ __device__ __inline__
void swap (T* a, T* b) {
T tmp = *a;
*a = *b;
*b = tmp;
}
ShouldSwap
A C++ Functor used as a generic comparator.
template<typename T>
__host__ __device__
bool ShouldSwap<T>::operator() (const T left, const T right) const {
return left > right;
}

how to read all 1's in an Array of 1's and 0's spread-ed all over the array randomly

I have an Array with 1 and 0 spread over the array randomly.
int arr[N] = {1,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,1,1,0,0,0,1....................N}
Now I want to retrive all the 1's in the array as fast as possible, but the condition is I should not loose the exact position(based on index) of the array , so sorting option not valid.
So the only option left is linear searching ie O(n) , is there anything better than this.
The main problem behind linear scan is , I need to run the scan even
for X times. So I feel I need to have some kind of other datastructure
which maintains this list once the first linear scan happens, so that
I need not to run the linear scan again and again.
Let me be clear about final expectations-
I just need to find the number of 1's in a certain range of array , precisely I need to find numbers of 1's in the array within range of 40-100. So this can be random range and I need to find the counts of 1 within that range. I can't do sum and all as I need to iterate over the array over and over again because of different range requirements

I'm surprised you considered sorting as a faster alternative to linear search.
If you don't know where the ones occur, then there is no better way than linear searching. Perhaps if you used bits or char datatypes you could do some optimizations, but it depends on how you want to use this.
The best optimization that you could do on this is to overcome branch prediction. Because each value is zero or one, you can use it to advance the index of the array that is used to store the one-indices.
Simple approach:
int end = 0;
int indices[N];
for( int i = 0; i < N; i++ )
{
if( arr[i] ) indices[end++] = i; // Slow due to branch prediction
}
Without branching:
int end = 0;
int indices[N];
for( int i = 0; i < N; i++ )
{
indices[end] = i;
end += arr[i];
}
[edit] I tested the above, and found the version without branching was almost 3 times faster (4.36s versus 11.88s for 20 repeats on a randomly populated 100-million element array).
Coming back here to post results, I see you have updated your requirements. What you want is really easy with a dynamic programming approach...
All you do is create a new array that is one element larger, which stores the number of ones from the beginning of the array up to (but not including) the current index.
arr : 1 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 1 1 0 0 0 1
count : 0 1 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 4 5 6 6 6 6 7
(I've offset arr above so it lines up better)
Now you can compute the number of 1s in any range in O(1) time. To compute the number of 1s between index A and B, you just do:
int num = count[B+1] - count[A];
Obviously you can still use the non-branch-prediction version to generate the counts initially. All this should give you a pretty good speedup over the naive approach of summing for every query:
int *count = new int[N+1];
int total = 0;
count[0] = 0;
for( int i = 0; i < N; i++ )
{
total += arr[i];
count[i+1] = total;
}
// to compute the ranged sum:
int range_sum( int *count, int a, int b )
{
if( b < a ) return range_sum(b,a);
return count[b+1] - count[a];
}

Well one time linear scanning is fine. Since you are looking for multiple scans across ranges of array I think that can be done in constant time. Here you go:
Scan the array and create a bitmap where key = key of array = sequence (1,2,3,4,5,6....).The value storedin bitmap would be a tuple<IsOne,cumulativeSum> where isOne is whether you have a one in there and cumulative Sum is addition of 1's as and wen you encounter them
Array = 1 1 0 0 1 0 1 1 1 0 1 0
Tuple: (1,1) (1,2) (0,2) (0,2) (1,3) (0,3) (1,4) (1,5) (1,6) (0,6) (1,7) (0,7)
CASE 1: When lower bound of cumulativeSum has a 0. Number of 1's [6,11] =
cumulativeSum at 11th position - cumulativeSum at 6th position = 7 - 3 = 4
CASE 2: When lower bound of cumulativeSum has a 1. Number of 1's [2,11] =
cumulativeSum at 11th position - cumulativeSum at 2nd position + 1 = 7-2+1 = 6
Step 1 is O(n)
Step 2 is 0(1)
Total complexity is linear no doubt but for your task where you have to work with the ranges several times the above Algorithm seems to be better if you have ample memory :)

Does it have to be a simple linear array data structure? Or can you create your own data structure which happens to have the desired properties, for which you're able to provide the required API, but whose implementation details can be hidden (encapsulated)?
If you can implement your own and if there is some guaranteed sparsity (to either 1s or 0s) then you might be able to offer better than linear performance. I see that you want to preserve (or be able to regenerate) the exact stream, so you'll have to store an array or bitmap or run-length encoding for that. (RLE will be useless if the stream is actually random rather than arbitrary but could be quite useful if there are significant sparsity or patterns with long strings of one or the other. For example a black&white raster of a bitmapped image is often a good candidate for RLE).
Let's say that your guaranteed that the stream will be sparse --- that no more than 10%, for example, of the bits will be 1s (or, conversely that more than 90% will be). If that's the case then you might model your solution on an RLE and maintain a count of all 1s (simply incremented as you set bits and decremented as you clear them). If there might be a need to quickly get the number of set bits for arbitrary ranges of these elements then instead of a single counter you can have a conveniently sized array of counters for partitions of the stream. (Conveniently-sized, in this case, means something which fits easily within memory, within your caches, or register sets, but which offers a reasonable trade off between computing a sum (all the partitions fully within the range) and the linear scan. The results for any arbitrary range is the sum of all the partitions fully enclosed by the range plus the results of linear scans for any fragments that are not aligned on your partition boundaries.
For a very, very, large stream you could even have a multi-tier "index" of partition sums --- traversing from the largest (most coarse) granularity down toward the "fragments" to either end (using the next layer of partition sums) and finishing with the linear search of only the small fragments.
Obviously such a structure represents trade offs between the complexity of building and maintaining the structure (inserting requires additional operations and, for an RLE, might be very expensive for anything other than appending/prepending) vs the expense of performing arbitrarily long linear search/increment scans.

If:
the purpose is to be able to find the number of 1s in the array at any time,
given that relatively few of the values in the array might change between one moment when you want to know the number and another moment, and
if you have to find the number of 1s in a changing array of n values m times,
... you can certainly do better than examining every cell in the array m times by using a caching strategy.
The first time you need the number of 1s, you certainly have to examine every cell, as others have pointed out. However, if you then store the number of 1s in a variable (say sum) and track changes to the array (by, for instance, requiring that all array updates occur through a specific update() function), every time a 0 is replaced in the array with a 1, the update() function can add 1 to sum and every time a 1 is replaced in the array with a 0, the update() function can subtract 1 from sum.
Thus, sum is always up-to-date after the first time that the number of 1s in the array is counted and there is no need for further counting.
(EDIT to take the updated question into account)
If the need is to return the number of 1s in a given range of the array, that can be done with a slightly more sophisticated caching strategy than the one I've just described.
You can keep a count of the 1s in each subset of the array and update the relevant subset count whenever a 0 is changed to a 1 or vice versa within that subset. Finding the total number of 1s in a given range within the array would then be a matter of adding the number of 1s in each subset that is fully contained within the range and then counting the number of 1s that are in the range but not in the subsets that have already been counted.
Depending on circumstances, it might be worthwhile to have a hierarchical arrangement in which (say) the number of 1s in the whole array is at the top of the hierarchy, the number of 1s in each 1/q th of the array is in the second level of the hierarchy, the number of 1s in each 1/(q^2) th of the array is in the third level of the hierarchy, etc. e.g. for q = 4, you would have the total number of 1s at the top, the number of 1s in each quarter of the array at the second level, the number of 1s in each sixteenth of the array at the third level, etc.

Are you using C (or derived language)? If so, can you control the encoding of your array? If, for example, you could use a bitmap to count. The nice thing about a bitmap, is that you can use a lookup table to sum the counts, though if your subrange ends aren't divisible by 8, you'll have to deal with end partial bytes specially, but the speedup will be significant.
If that's not the case, can you at least encode them as single bytes? In that case, you may be able to exploit sparseness if it exists (more specifically, the hope that there are often multi index swaths of zeros).
So for:
u8 input = {1,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,1,1,0,0,0,1....................N};
You can write something like (untested):
uint countBytesBy1FromTo(u8 *input, uint start, uint stop)
{ // function for counting one byte at a time, use with range of less than 4,
// use functions below for longer ranges
// assume it's just one's and zeros, otherwise we have to test/branch
uint sum;
u8 *end = input + stop;
for (u8 *each = input + start; each < end; each++)
sum += *each;
return sum;
}
countBytesBy8FromTo(u8 *input, uint start, uint stop)
{
u64 *chunks = (u64*)(input+start);
u64 *end = chunks + ((start - stop) >> 3);
uint sum = countBytesBy1FromTo((u8*)end, 0, stop - (u8*)end);
for (; chunks < end; chunks++)
{
if (*chunks)
{
sum += countBytesBy1FromTo((u8*)chunks, 0, 8);
}
}
}
The basic trick, is exploiting the ability to cast slices of your target array to single entities your language can look at in one swoop, and test by inference if ANY of the values of it are zeros, and then skip the whole block. The more zeros, the better it will work. In the case where your large cast integer always has at least one, this approach just adds overhead. You might find that using a u32 is better for your data. Or that adding a u32 test between the 1 and 8 helps. For datasets where zeros are much more common than ones, I've used this technique to great advantage.

Why is sorting invalid? You can clone the original array, sort the clone, and count and/or mark the locations of the 1s as needed.

Efficient Indexing method for a 2 by 2 matrix

If I fill numbers from 1 to 4 in a 2 by 2 matrix, there are 16 possible combinations. What I want to do is store values in an array of size 24 corresponding to each matrix. So given a
2 by 2 matrix, I want a efficient indexing method to index into the array directly.( I dont want comparing all 4 elements for each of 16 positions). Something similar to bit vector ? but not able to figure out how?
I want it for a 4 by 4 matrix also filling from 1 to 9

to clarify: you're looking for an efficient hash function for 2x2 matrices. you want to use the results of the hash function to compare matrices to see if they're the same.
first, lets assume you actually want the numbers 0 to 3 instead of 1 to 4 - this makes it easier, and is more computer-sciency. Next, 16 is not right. there are 24 possible permutations of the numbers 0-3. There are 4^4 = 256 possible strings of length 4 that use a four-letter alphabet (you can repeat already-used numbers).
either one is trivial to encode into a single byte. Let the first 2 bits represent the (0,0) position, the next 2 bits represent (0,1), and so forth. Than, to hash your 2x2 matrix, simply do:
hash = m[0][0] | (m[0][1] << 2) | (m[1][0] << 4) | (m[1][1] << 6
random example: the number 54 in binary is 00110110 which represents a matrix like:
2 1
3 0

When you need efficiency, sometimes code clarity goes out the window :)
First you need to be sure you want efficiency - you have profiling info to be sure that the simple comparison code is too inefficient for you?
You can simply treat it as an array of bytes of the same size. memcmp does comparisons of arbitary memory:
A data structure such as:
int matrix[2][2];
is stored the same as:
int matrix[2*2];
which could be dynamically allocated as:
typedef int matrix[2*2];
matrix* m = (matrix*)malloc(sizeof(matrix));
I'm not suggesting you dynamically allocate them, I'm illustrating how the bytes in your original type is actually layed out in memory.
Therefore, the following is valid:
matrix lookup[16];
int matrix_cmp(const void* a,const void* b) {
return memcmp(a,b,sizeof(matrix));
}
void init_matrix_lookup() {
int i;
for(i=0; i<16; i++) {
...
}
qsort(lookup,16,sizeof(matrix),matrix_cmp));
}
int matrix_to_lookup(matrix* m) {
// in this example I'm sorting them so we can bsearch;
// but with only 16 elements, its probably not worth the effort,
// and you could safely just loop over them...
return bsearch(m,lookup,16,sizeof(matrix),matrix_cmp);
}

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio