Different way to index threads in CUDA C - matrix

I have 9x9 matrix and i flattened it into a vector of 81 elements; then i defined a grid of 9 blocks with 9 threads each for a total of 81 threads; here's a picture of the grid
Then i tried to verify what was the index related to the the thread (0,0) of block (1,1); first i calculated the i-th column and the j-th row like this:
i = blockDim.x*blockId.x + threadIdx.x = 3*1 + 0 = 3
j = blockDim.y*blockId.y + threadIdx.y = 3*1 + 0 = 3
therefore the index is:
index = N*i + j = 9*3 +3 = 30
As a matter of fact thread (0,0) of block (1,1) is actually the 30th element of the matrix;
Now here's my problem: let's say a choose a grid with 4 blocks (0,0)(1,0)(0,1)(1,1) with 4 threads each (0,0)(1,0)(0,1)(1,1)
Let's say i keep the original vector with 81 elements; what should i do to get the index of a generic element of the vector by using just 4*4 = 16 threads? i have tried the formulas written above but they don't seem to apply.
My goal is that every thread handles a single element of the vector...

A common way to have a smaller number of threads cover a larger number of data elements is to use a "grid-striding loop". Suppose I had a vector of length n elements, and I had some smaller number of threads, and I wanted to take every element, add 1 to it, and store it back in the original vector. That code could look something like this:
__global__ void my_inc_kernel(int *data, int n){
int idx = (gridDim.x*blockDim.x)*(threadIdx.y+blockDim.y*blockIdx.y) + (threadIdx.x+blockDim.x*blockIdx.x);
while(idx < n){
idx += (gridDim.x*blockDim.x)*(gridDim.y*blockDim.y);}
(the above is coded in browser, not tested)
The only complicated parts above are the indexing parts. The initial calculation of idx is just a typical creation/assignment of a globally unique id (idx) to each thread in a 2D threadblock/grid structure. Let's break it down:
int idx = (gridDim.x*blockDim.x)*(threadIdx.y+blockDim.y*blockIdx.y) +
(width of grid in threads)*(thread y-index)
(thread x-index)
The amount added to idx on each pass of the while loop is the size of the 2D grid in total threads. Therefore, each iteration of the while loop does one "grid's width" of elements at a time, and then "strides" to the next grid-width, to process the next group of elements. Let's break that down:
idx += (gridDim.x*blockDim.x)*(gridDim.y*blockDim.y);
(width of grid in threads)*(height of grid in threads)
This methodology does not require that the total number of elements be evenly divisible the number of threads. The conditional check of the while-loop handles all cases of relationship between vector size and grid size.
This particular grid-striding loop methodology has the additional benefit (in terms of mapping elements to threads) that it tends to naturally promote coalesced access. The reads and writes to data vector in the code above will coalesce perfectly, due to the behavior of the grid-striding loop. You can enhance coalescing behavior in this case by choosing blocks that are a whole-number multiple of 32, but that is not central to your question.


Computing partial sums in OpenCL

A 1D dataset is divided into segments, each work item processes one segment. It read a number of elements from the segment? The number of elements is not known beforehand and differs for each segment.
For example:
+----+----+----+----+----+----+----+----+----+ <-- segments
A BCD E FG HIJK L M N <-- elements in this segment
After all segments have been processes they should write the elements in contiguously output memory, like
So the absolute output position of the elements from one segment depends on the number of elements in the previous segments. E is at position 4 because segment contains 1 element (A) and segment 2 contains 3 elements.
The OpenCL kernel writes the number of elements for each segment into a local/shared memory buffer, and works like this (pseudocode)
kernel void k(
constant uchar* input,
global int* output,
local int* segment_element_counts
) {
int segment = get_local_id(0);
int count = count_elements(&input[segment * segment_size]);
segment_element_counts[segment] = count;
ptrdiff_t position = 0;
for(int previous_segment = 0; previous_segment < segment; ++previous_segment)
position += segment_element_counts[previous_segment];
global int* output_ptr = &output[position];
read_elements(&input[segment * segment_size], output_ptr);
So each work item has to calculate a partial sum using a loop, where the work items with larger id do more iterations.
Is there a more efficient way to implement this (each work item calculate a partial sum of a sequence, up to its index), in OpenCL 1.2? OpenCL 2 seems to provide work_group_scan_inclusive_add for this.
You can do N partial (prefix) sums in log2(N) iterations using something like this:
offsets[get_local_id(0)] = count;
for (ushort combine = 1; combine < total_num_segments; combine *= 2)
if (get_local_id(0) & combine)
offsets[get_local_id(0)] +=
offsets[(get_local_id(0) & ~(combine * 2u - 1u)) | (combine - 1u)];
Given segment element counts of
a b c d
The successive iterations will produce:
a b+a c d+c
a b+a c+(b+a) (d+c)+(b+a)
Which is the result we want.
So in the first iteration, we've divided the segment element counts into groups of 2, and sum within them. Then we merge 2 groups at a time into 4 elements, and propagate the result from the first group into the second. We grow the groups again to 8, and so on.
The key observation is that this pattern also matches the binary representation of the index of each segment:
0: 0b00 1: 0b01 2: 0b10 3: 0b11
Index 0 performs no sums. Both indices 1 and 3 perform a sum in the first iteration (bit 0/LSB = 1), whereas indices 2 and 3 perform a sum in the second iteration (bit 1 = 1). That explains this line:
if (get_local_id(0) & combine)
The other statement that really needs an explanation is of course
offsets[get_local_id(0)] +=
offsets[(get_local_id(0) & ~(combine * 2u - 1u)) | (combine - 1u)];
Calculating the index at which we find the previous prefix sum we want to accumulate onto our work-item's sum is a little tricky. The subexpression (combine * 2u - 1u) takes the value (2n-1) on each iteration (for n starting at 1):
1 = 0b001
3 = 0b011
7 = 0b111
By bitwise-masking these bit suffixes off (i.e. i & ~x) the work-item index, this gives you the index of the first item in the current group.
The (combine - 1u) subexpression then gives you the index within the current group of the last item of the first half. Putting the two together gives you the overall index of the item you want to accumulate into the current segment.
There is one slight ugliness in the result: it's shifted to the left by one: so segment 1 needs to use offsets[0], and so on, while segment 0's offset is of course 0. You can either over-allocate the offsets array by 1 and perform the prefix sums on the subarray starting at index 1 and initialise index 0 to 0, or use a conditional.
There are probably profiling-driven micro-optimisations you can make to the above code.

Sorting and Counting Elements in OpenCL

I want to create an OpenCL kernel that sorts and counts millions of ulong.
There is a particular algorithm that fits my needs or should I go for an hash table?
To be clear, given the following input:
[42, 13, 9, 42]
I would like to get an output like this:
[(9,1), (13,1), (42,2)]
My first idea was to modify the Counting Sort - which already counts in order to sort - but because of the wide range of ulongs it requires too much memory. Bitonic or Radix sort plus something to count elements could be a way but I miss a fast way to count the elements. Any suggestions on this?
Extra notes:
I'm developing using an NVIDIA Tesla K40C GPU and a Terasic DE5-Net FPGA. So far the main goal is to make it work on the GPU but I'm also interested in solutions that might be a nice fit for FPGAs.
I know that some values inside the range of ulong aren't used so we can use them to mark invalid elements or duplicates.
I want to consume the output from the GPU using multiple threads in the CPU so a would like to avoid any solution that require some post-processing (in the host side I mean) that has data dependencies spread around the output.
This solution requires two passes of the bitonic sort to both count the duplicates as well as remove them (well move them to the end of the array). Bitonic sort is O(log(n)^2), so this then will run with time complexity 2(log(n)^2), which shouldn't be a problem unless you are running this in a loop.
Create a simple struct for each of the elements, to include the number of duplicates, and if the element has been added as a duplicate, something like:
// Note: If you are worried about space, or know that there
// will only be a few duplicates for each element, then
// make the count element smaller
typedef struct {
cl_ulong value;
cl_ulong count : 63;
cl_ulong seen : 1;
} Element;
You can start by creating a comparison function which will move duplicates to the end, and count the duplicates if they are you to be added to the total count for the element. This is the logic behind the comparison function:
If one element is a duplicate and another is not, return that the non-duplicate element is smaller (regardless of the values), which will move all duplicates to the end.
If the elements are duplicates and the right element has not been marked a duplicate (seen=0), then add the right element's count to the left element's count and set the right element as a duplicate (seen=1). This has the effect of moving the total count of an element with a specific value to the leftmost element in the array with that value, and all duplicates with that value to the end.
Otherwise return that the element with the smaller value, is smaller.
The comparison function would look like:
bool compare(const Element* E1, const Element* E2) {
if (!E1->seen && E2->seen) return true; // E1 smaller
if (!E2->seen && E1->seen) return false; // E2 smaller
// If the elements are duplicates and the right element has
// not yet been "seen" by an element with the same value
if (E1->value == E2->value && !E2->seen) {
E1->count += E2->count;
E2->seen = 1;
return true;
// They aren't duplicates, and either
// neither has been seen, or both have
return E1->value < E2->value;
Bitonic sort has a specific structure, which can be nicely illustrated with a diagram. In the diagram, each element is referred to by a 3-tuple (a,b,c) where a = value, b = count, and c = seen.
Each diagram shows one run of bitonic sort on the array (vertical lines denote a comparison between elements, and horizontal lines move right to the next stage of the bitonic sort). Using the diagram and the above comparison function and logic, you should be able to convince yourself that this does what is required.
Run 1:
Run 2:
At the end of run 2, all elements are arranged by value. Duplicates with seen = 1 are at the end, and duplicates with seen = 0 are in their correct place and count is the number of other elements with the same value.
The diagrams are color coded to illustrate the sub-processes of bitonic sort. I'll call the blue blocks a phase (there are three phases in each run in the diagrams). In general, there will be ceil(log(N)) phases for each run. Each phase consists of a number of green block (I'll call these out-in blocks, because the shape of the comparisons is out to in), and red blocks (I'll call these constant blocks, because the distance between elements to compare remains constant).
From the diagram, the out-in block size (elements in each block) starts at 2 and doubles in each pass. The constant block size for each pass starts at half the out-in block size (in the second (blue block) phase, there are 2 elements in each of the four red blocks, because the green blocks have a size of 4) and halves for each successive vertical lines of red block within the phase. Also, the number of successive vertical lines of the constant (red) blocks in a phase is always the same as the phase number with 0 indexing (0 vertical lines of red blocks for phase 0, 1 vertical line of red bocks for phase 1, and 2 vertical lines of red blocks for phase 2 -- each vertical line is an iteration of calling that kernel).
You can then make kernels for the out-in passes, and the constant passes, then invoke the kernels from the host side (because you need to constantly synchronise, which is a disadvantage, but you should still see large performance improvements over sequential implementations).
From the host side, the overall bitonic sort might look like:
cl_uint num_elements = 4; // Set number of elements
cl_uint phases = (cl_uint)ceil((float)log2(num_elements));
cl_uint out_in_block_size = 2;
cl_uint constant_block_size;
// Set the elements_buffer, which should have been created with
// with clCreateBuffer, as the first kernel argument, and the
// number of elements as the second kernel argument
clSetKernelArg(out_in_kernel, 0, sizeof(cl_mem), (void*)(&elements_buffer));
clSetKernelArg(out_in_kernel, 1, sizeof(cl_uint), (void*)(&num_elements));
clSetKernelArg(constant_kernel, 0, sizeof(cl_mem), (void*)(&elements_buffer));
clSetKernelArg(constant_kernel, 1, sizeof(cl_uint), (void*)(&num_elements));
// For each pass
for (unsigned int phase = 0; phase < phases; ++phase) {
// -------------------- Green Part ------------------------ //
// Set the out_in_block size for the kernel
clSetKernelArg(out_in_kernel, 2, sizeof(cl_int), (void*)(&out_in_block_size));
// Call the kernel - command_queue is the clCommandQueue
// which should have been created during cl setup
clEnqueNDRangeKernel(command_queue , // clCommandQueue
out_in_kernel , // The kernel
1 , // Work dim = 1 since 1D array
NULL , // No global offset
&local_work_size ,
0 ,
barrier(CLK_GLOBAL_MEM_FENCE); // Synchronise
// ---------------------- End Green Part -------------------- //
// Set the block size for constant blocks based on the out_in_block_size
constant_block_size = out_in_block_size / 2;
// -------------------- Red Part ------------------------ //
for (unsigned int i 0; i < phase; ++i) {
// Set the constant_block_size as a kernel argument
clSetKernelArg(constant_kernel, 2, sizeof(cl_int), (void*)(&constant_block_size));
// Call the constant kernel
clEnqueNDRangeKernel(command_queue , // clCommandQueue
constant_kernel , // The kernel
1 , // Work dim = 1 since 1D array
NULL , // No global offset
&local_work_size ,
0 ,
barrier(CLK_GLOBAL_MEM_FENCE); // Synchronise
// Update constant_block_size for next iteration
constant_block_size /= 2;
// ------------------- End Red Part ---------------------- //
And then the kernels would be something like (you also need to put the struct typedef in the kernel file so that the OpenCL compiler know what 'Element' is):
__global void out_in_kernel(__global Element* elements, unsigned int num_elements, unsigned int block_size) {
const unsigned int idx_upper = // index of upper element in diagram.
const unsigned int idx_lower = // index of lower element in diagram
// Check that both indices are in range (this depends on thread mapping)
if (idx_upper is in range && index_lower is in range) {
// Do the comparison
if (!compare(elements + idx_upper, elements + idx_lower) {
// Swap the elements
The constant_kernel will look the same, but the thread mapping (how you determine idx_upper and idx_lower) will be different. There are many ways you can map the threads to the elements generally to mimic the diagrams (note that the number of threads required is half the total number of elements, since each thread can do one comparison).
Another consideration is how to make the thread mapping general (so that if you have a number of elements which is not a power of two the algorithm doesn't break).
How about boost.compute or VexCL? Both provide sorting algorithms.
Mergesort works quite well on GPUs and you could modify it to sort key+count instead of keys only. During merging you would then also check if do keys are identical and if yes, fuse them into a single key during merge. (If you merge [9/c:1, 42/c:1] and [13/c:1,42/c:1] you would get [9/c:1,13/c:1,42/c:2] )
You might have to use parallel prefix sum to remove the gaps caused by fusing keys.
Or: Use a regular GPU sort first, mark all keys where the key to its right is different (this is only true at the last key of each unique key), use parallel prefix sum to get consecutive indexes for all unique keys and note their position in the sorted array. Then you only need to subtract the index of the previous unique key to get the count.

matlab code optimization - clustering algorithm KFCG

I have a large set of vectors (orientation data in an axis-angle representation... the axis is the vector). I want to apply a clustering algorithm to. I tried kmeans but the computational time was too long (never finished). So instead I am trying to implement KFCG algorithm which is faster (Kirke 2010):
Initially we have one cluster with the entire training vectors and the codevector C1 which is centroid. In the first iteration of the algorithm, the clusters are formed by comparing first element of training vector Xi with first element of code vector C1. The vector Xi is grouped into the cluster 1 if xi1< c11 otherwise vector Xi is grouped into cluster2 as shown in Figure 2(a) where codevector dimension space is 2. In second iteration, the cluster 1 is split into two by comparing second element Xi2 of vector Xi belonging to cluster 1 with that of the second element of the codevector. Cluster 2 is split into two by comparing the second element Xi2 of vector Xi belonging to cluster 2 with that of the second element of the codevector as shown in Figure 2(b). This procedure is repeated till the codebook size is reached to the size specified by user.
I'm unsure what ratio is appropriate for the codebook, but it shouldn't matter for the code optimization. Also note mine is 3-D so the same process is done for the 3rd dimension.
My code attempts
I've tried implementing the above algorithm into Matlab 2013 (Student Version). Here's some different structures I've tried - BUT take way too long (have never seen it completed):
%training vectors:
Atgood = Nx4 vector (see test data below if want to test);
vecA = Atgood(:,1:3);
roA = size(vecA,1);
%Codebook size, Nsel, is ratio of data
Nseltemp = remainFrac2*roA; %codebook size
%Ensure selected size after nearest power of 2 is NOT greater than roA
if 2^round(log2(Nseltemp)) &lt roA
NselIter = round(log2(Nseltemp));
NselIter = ceil(log2(Nseltemp)-1);
Nsel = 2^NselIter; %power of 2 - for LGB and other algorithms
%%cluster = cell(1,Nsel); %Unsure #rows - Don't know how to initialize if need mean...
codevec(1,1:3) = mean(vecA,1);
for kk = 1:NselIter
hh2 = 1:2:size(codevec,1)*2;
for hh1 = 1:length(hh2)
% for ii = 1:roA
% if vecA(ii,ind) &lt codevec(hh1,ind)
% cluster{1,hh}(count1,1:4) = Atgood(ii,:); %want all 4 elements
% count1=count1+1;
% else
% cluster{1,hh+1}(count2,1:4) = Atgood(ii,:); %want all 4
% count2=count2+1;
% end
% end
%EDIT: My ATTEMPT at optimizing above for loop:
splitind = vecA(:,ind)&gt=repcv;
splitind2 = vecA(:,ind)&ltrepcv;
clear codevec
%Only mean the 1x3 vector portion of the cluster - for centroid
codevec = cell2mat((cellfun(#(x) mean(x(:,1:3),1),cluster,'UniformOutput',false))');
if ind &lt 3
ind = ind+1;
if length(codevec) ~= Nsel
warning('codevec ~= Nsel');
Alternatively, instead of cells I thought 3D Matrices would be faster? I tried but it was slower using my method of appending the next row each iteration (temp=[]; for...temp=[temp;new];)
Also, I wasn't sure what was best to loop with, for or while:
%If initialize cell to full length
while length(find(~cellfun('isempty',cluster))) < Nsel
Well, anyways, the first method was fastest for me.
Is the logic standard? Not in the sense that it matches with the algorithm described, but from a coding perspective, any weird methods I employed (especially with those multiple inner loops) that slows it down? Where can I speed up (you can just point me to resources or previous questions)?
My array size, Atgood, is 1,000,000x4 making NselIter=19; - do I just need to find a way to decrease this size or can the code be optimized?
Should this be asked on CodeReview? If so, I'll move it.
Testing Data
Here's some random vectors you can use to test:
for ii=1:1000 %My size is ~ 1,000,000
omega = 2*rand(3,1)-1;
omega = (omega/norm(omega))';
Atgood(ii,1:4) = [omega,57];
Your biggest issue is re-iterating through all of vecA FOR EACH CODEVECTOR, rather than just the ones that are part of the corresponding cluster. You're supposed to split each cluster on it's codevector. As it is, your cluster structure grows and grows, and each iteration is processing more and more samples.
Your second issue is the loop around the comparisons, and the appending of samples to build up the clusters. Both of those can be solved by vectorizing the comparison operation. Oh, I just saw your edit, where this was optimized. Much better. But codevec(hh1,ind) is just a scalar, so you don't even need the repmat.
Try this version:
% (preallocs added in edit)
cluster = cell(1,Nsel);
codevec = zeros(Nsel, 3);
codevec(1,:) = mean(Atgood(:,1:3),1);
cluster{1} = Atgood;
nClusters = 1;
ind = 1;
while nClusters < Nsel
for c = 1:nClusters
lower_cluster_logical = cluster{c}(:,ind) < codevec(c,ind);
cluster{nClusters+c} = cluster{c}(~lower_cluster_logical,:);
cluster{c} = cluster{c}(lower_cluster_logical,:);
codevec(c,:) = mean(cluster{c}(:,1:3), 1);
codevec(nClusters+c,:) = mean(cluster{nClusters+c}(:,1:3), 1);
ind = rem(ind,3) + 1;
nClusters = nClusters*2;

how to read all 1's in an Array of 1's and 0's spread-ed all over the array randomly

I have an Array with 1 and 0 spread over the array randomly.
int arr[N] = {1,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,1,1,0,0,0,1....................N}
Now I want to retrive all the 1's in the array as fast as possible, but the condition is I should not loose the exact position(based on index) of the array , so sorting option not valid.
So the only option left is linear searching ie O(n) , is there anything better than this.
The main problem behind linear scan is , I need to run the scan even
for X times. So I feel I need to have some kind of other datastructure
which maintains this list once the first linear scan happens, so that
I need not to run the linear scan again and again.
Let me be clear about final expectations-
I just need to find the number of 1's in a certain range of array , precisely I need to find numbers of 1's in the array within range of 40-100. So this can be random range and I need to find the counts of 1 within that range. I can't do sum and all as I need to iterate over the array over and over again because of different range requirements
I'm surprised you considered sorting as a faster alternative to linear search.
If you don't know where the ones occur, then there is no better way than linear searching. Perhaps if you used bits or char datatypes you could do some optimizations, but it depends on how you want to use this.
The best optimization that you could do on this is to overcome branch prediction. Because each value is zero or one, you can use it to advance the index of the array that is used to store the one-indices.
Simple approach:
int end = 0;
int indices[N];
for( int i = 0; i < N; i++ )
if( arr[i] ) indices[end++] = i; // Slow due to branch prediction
Without branching:
int end = 0;
int indices[N];
for( int i = 0; i < N; i++ )
indices[end] = i;
end += arr[i];
[edit] I tested the above, and found the version without branching was almost 3 times faster (4.36s versus 11.88s for 20 repeats on a randomly populated 100-million element array).
Coming back here to post results, I see you have updated your requirements. What you want is really easy with a dynamic programming approach...
All you do is create a new array that is one element larger, which stores the number of ones from the beginning of the array up to (but not including) the current index.
arr : 1 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 1 1 0 0 0 1
count : 0 1 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 4 5 6 6 6 6 7
(I've offset arr above so it lines up better)
Now you can compute the number of 1s in any range in O(1) time. To compute the number of 1s between index A and B, you just do:
int num = count[B+1] - count[A];
Obviously you can still use the non-branch-prediction version to generate the counts initially. All this should give you a pretty good speedup over the naive approach of summing for every query:
int *count = new int[N+1];
int total = 0;
count[0] = 0;
for( int i = 0; i < N; i++ )
total += arr[i];
count[i+1] = total;
// to compute the ranged sum:
int range_sum( int *count, int a, int b )
if( b < a ) return range_sum(b,a);
return count[b+1] - count[a];
Well one time linear scanning is fine. Since you are looking for multiple scans across ranges of array I think that can be done in constant time. Here you go:
Scan the array and create a bitmap where key = key of array = sequence (1,2,3,4,5,6....).The value storedin bitmap would be a tuple<IsOne,cumulativeSum> where isOne is whether you have a one in there and cumulative Sum is addition of 1's as and wen you encounter them
Array = 1 1 0 0 1 0 1 1 1 0 1 0
Tuple: (1,1) (1,2) (0,2) (0,2) (1,3) (0,3) (1,4) (1,5) (1,6) (0,6) (1,7) (0,7)
CASE 1: When lower bound of cumulativeSum has a 0. Number of 1's [6,11] =
cumulativeSum at 11th position - cumulativeSum at 6th position = 7 - 3 = 4
CASE 2: When lower bound of cumulativeSum has a 1. Number of 1's [2,11] =
cumulativeSum at 11th position - cumulativeSum at 2nd position + 1 = 7-2+1 = 6
Step 1 is O(n)
Step 2 is 0(1)
Total complexity is linear no doubt but for your task where you have to work with the ranges several times the above Algorithm seems to be better if you have ample memory :)
Does it have to be a simple linear array data structure? Or can you create your own data structure which happens to have the desired properties, for which you're able to provide the required API, but whose implementation details can be hidden (encapsulated)?
If you can implement your own and if there is some guaranteed sparsity (to either 1s or 0s) then you might be able to offer better than linear performance. I see that you want to preserve (or be able to regenerate) the exact stream, so you'll have to store an array or bitmap or run-length encoding for that. (RLE will be useless if the stream is actually random rather than arbitrary but could be quite useful if there are significant sparsity or patterns with long strings of one or the other. For example a black&white raster of a bitmapped image is often a good candidate for RLE).
Let's say that your guaranteed that the stream will be sparse --- that no more than 10%, for example, of the bits will be 1s (or, conversely that more than 90% will be). If that's the case then you might model your solution on an RLE and maintain a count of all 1s (simply incremented as you set bits and decremented as you clear them). If there might be a need to quickly get the number of set bits for arbitrary ranges of these elements then instead of a single counter you can have a conveniently sized array of counters for partitions of the stream. (Conveniently-sized, in this case, means something which fits easily within memory, within your caches, or register sets, but which offers a reasonable trade off between computing a sum (all the partitions fully within the range) and the linear scan. The results for any arbitrary range is the sum of all the partitions fully enclosed by the range plus the results of linear scans for any fragments that are not aligned on your partition boundaries.
For a very, very, large stream you could even have a multi-tier "index" of partition sums --- traversing from the largest (most coarse) granularity down toward the "fragments" to either end (using the next layer of partition sums) and finishing with the linear search of only the small fragments.
Obviously such a structure represents trade offs between the complexity of building and maintaining the structure (inserting requires additional operations and, for an RLE, might be very expensive for anything other than appending/prepending) vs the expense of performing arbitrarily long linear search/increment scans.
the purpose is to be able to find the number of 1s in the array at any time,
given that relatively few of the values in the array might change between one moment when you want to know the number and another moment, and
if you have to find the number of 1s in a changing array of n values m times,
... you can certainly do better than examining every cell in the array m times by using a caching strategy.
The first time you need the number of 1s, you certainly have to examine every cell, as others have pointed out. However, if you then store the number of 1s in a variable (say sum) and track changes to the array (by, for instance, requiring that all array updates occur through a specific update() function), every time a 0 is replaced in the array with a 1, the update() function can add 1 to sum and every time a 1 is replaced in the array with a 0, the update() function can subtract 1 from sum.
Thus, sum is always up-to-date after the first time that the number of 1s in the array is counted and there is no need for further counting.
(EDIT to take the updated question into account)
If the need is to return the number of 1s in a given range of the array, that can be done with a slightly more sophisticated caching strategy than the one I've just described.
You can keep a count of the 1s in each subset of the array and update the relevant subset count whenever a 0 is changed to a 1 or vice versa within that subset. Finding the total number of 1s in a given range within the array would then be a matter of adding the number of 1s in each subset that is fully contained within the range and then counting the number of 1s that are in the range but not in the subsets that have already been counted.
Depending on circumstances, it might be worthwhile to have a hierarchical arrangement in which (say) the number of 1s in the whole array is at the top of the hierarchy, the number of 1s in each 1/q th of the array is in the second level of the hierarchy, the number of 1s in each 1/(q^2) th of the array is in the third level of the hierarchy, etc. e.g. for q = 4, you would have the total number of 1s at the top, the number of 1s in each quarter of the array at the second level, the number of 1s in each sixteenth of the array at the third level, etc.
Are you using C (or derived language)? If so, can you control the encoding of your array? If, for example, you could use a bitmap to count. The nice thing about a bitmap, is that you can use a lookup table to sum the counts, though if your subrange ends aren't divisible by 8, you'll have to deal with end partial bytes specially, but the speedup will be significant.
If that's not the case, can you at least encode them as single bytes? In that case, you may be able to exploit sparseness if it exists (more specifically, the hope that there are often multi index swaths of zeros).
So for:
u8 input = {1,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,1,1,0,0,0,1....................N};
You can write something like (untested):
uint countBytesBy1FromTo(u8 *input, uint start, uint stop)
{ // function for counting one byte at a time, use with range of less than 4,
// use functions below for longer ranges
// assume it's just one's and zeros, otherwise we have to test/branch
uint sum;
u8 *end = input + stop;
for (u8 *each = input + start; each < end; each++)
sum += *each;
return sum;
countBytesBy8FromTo(u8 *input, uint start, uint stop)
u64 *chunks = (u64*)(input+start);
u64 *end = chunks + ((start - stop) >> 3);
uint sum = countBytesBy1FromTo((u8*)end, 0, stop - (u8*)end);
for (; chunks < end; chunks++)
if (*chunks)
sum += countBytesBy1FromTo((u8*)chunks, 0, 8);
The basic trick, is exploiting the ability to cast slices of your target array to single entities your language can look at in one swoop, and test by inference if ANY of the values of it are zeros, and then skip the whole block. The more zeros, the better it will work. In the case where your large cast integer always has at least one, this approach just adds overhead. You might find that using a u32 is better for your data. Or that adding a u32 test between the 1 and 8 helps. For datasets where zeros are much more common than ones, I've used this technique to great advantage.
Why is sorting invalid? You can clone the original array, sort the clone, and count and/or mark the locations of the 1s as needed.

Optimize code performance when odd/even threads are doing different things in CUDA

I have two large vectors, I am trying to do some sort of element multiplication, where an even-numbered element in the first vector is multiplied by the next odd-numbered element in the second vector... and where the odd-numbered element in the first vector is multiplied by the preceding even-numbered element in the second vector.
For example:
vector 1 is V1(1) V1(2) V1(3) V1(4)
vector 2 is V2(1) V2(2) V2(3) V2(4)
V1(1) * V2(2)
V1(3) * V2(4)
V1(2) * V2(1)
V1(4) * V2(3)
I have written Cuda code to do this (Pds has the elements of the first vector in shared memory, Nds the second Vector):
// instead of % 2, checking the first bit to decide if a number
// is odd/even is faster
if ((tx & 0x0001) == 0x0000)
Nds[tx+1] = Pds[tx] * Nds[tx+1];
Nds[tx-1] = Pds[tx] * Nds[tx-1];
Is there anyway to further accelerate this code or avoid divergence?
You should be able to eliminate the branch like this:
int tx_index = tx ^ 1; // equivalent to: tx_index = (tx & 1) ? tx - 1 : tx + 1
Nds[tx_index] = Pds[tx] * Nds[tx_index];
This is an old post, may be someone finds my answer useful. If in your code tx is threadIdx, then you have branching or warp divergence. You must avoid divergence in blocks, because it serializes the process. It means that the threads with even indices will run first, and then threads with odd indices will run. If tx is threadIdx, try to change your algorithm such that branching depends on blockIdx.
