Sorting and Counting Elements in OpenCL - sorting

I want to create an OpenCL kernel that sorts and counts millions of ulong.
There is a particular algorithm that fits my needs or should I go for an hash table?
To be clear, given the following input:
[42, 13, 9, 42]
I would like to get an output like this:
[(9,1), (13,1), (42,2)]
My first idea was to modify the Counting Sort - which already counts in order to sort - but because of the wide range of ulongs it requires too much memory. Bitonic or Radix sort plus something to count elements could be a way but I miss a fast way to count the elements. Any suggestions on this?
Extra notes:
I'm developing using an NVIDIA Tesla K40C GPU and a Terasic DE5-Net FPGA. So far the main goal is to make it work on the GPU but I'm also interested in solutions that might be a nice fit for FPGAs.
I know that some values inside the range of ulong aren't used so we can use them to mark invalid elements or duplicates.
I want to consume the output from the GPU using multiple threads in the CPU so a would like to avoid any solution that require some post-processing (in the host side I mean) that has data dependencies spread around the output.

This solution requires two passes of the bitonic sort to both count the duplicates as well as remove them (well move them to the end of the array). Bitonic sort is O(log(n)^2), so this then will run with time complexity 2(log(n)^2), which shouldn't be a problem unless you are running this in a loop.
Create a simple struct for each of the elements, to include the number of duplicates, and if the element has been added as a duplicate, something like:
// Note: If you are worried about space, or know that there
// will only be a few duplicates for each element, then
// make the count element smaller
typedef struct {
cl_ulong value;
cl_ulong count : 63;
cl_ulong seen : 1;
} Element;
Algorithm:
You can start by creating a comparison function which will move duplicates to the end, and count the duplicates if they are you to be added to the total count for the element. This is the logic behind the comparison function:
If one element is a duplicate and another is not, return that the non-duplicate element is smaller (regardless of the values), which will move all duplicates to the end.
If the elements are duplicates and the right element has not been marked a duplicate (seen=0), then add the right element's count to the left element's count and set the right element as a duplicate (seen=1). This has the effect of moving the total count of an element with a specific value to the leftmost element in the array with that value, and all duplicates with that value to the end.
Otherwise return that the element with the smaller value, is smaller.
The comparison function would look like:
bool compare(const Element* E1, const Element* E2) {
if (!E1->seen && E2->seen) return true; // E1 smaller
if (!E2->seen && E1->seen) return false; // E2 smaller
// If the elements are duplicates and the right element has
// not yet been "seen" by an element with the same value
if (E1->value == E2->value && !E2->seen) {
E1->count += E2->count;
E2->seen = 1;
return true;
}
// They aren't duplicates, and either
// neither has been seen, or both have
return E1->value < E2->value;
}
Bitonic sort has a specific structure, which can be nicely illustrated with a diagram. In the diagram, each element is referred to by a 3-tuple (a,b,c) where a = value, b = count, and c = seen.
Each diagram shows one run of bitonic sort on the array (vertical lines denote a comparison between elements, and horizontal lines move right to the next stage of the bitonic sort). Using the diagram and the above comparison function and logic, you should be able to convince yourself that this does what is required.
Run 1:
Run 2:
At the end of run 2, all elements are arranged by value. Duplicates with seen = 1 are at the end, and duplicates with seen = 0 are in their correct place and count is the number of other elements with the same value.
Implementation:
The diagrams are color coded to illustrate the sub-processes of bitonic sort. I'll call the blue blocks a phase (there are three phases in each run in the diagrams). In general, there will be ceil(log(N)) phases for each run. Each phase consists of a number of green block (I'll call these out-in blocks, because the shape of the comparisons is out to in), and red blocks (I'll call these constant blocks, because the distance between elements to compare remains constant).
From the diagram, the out-in block size (elements in each block) starts at 2 and doubles in each pass. The constant block size for each pass starts at half the out-in block size (in the second (blue block) phase, there are 2 elements in each of the four red blocks, because the green blocks have a size of 4) and halves for each successive vertical lines of red block within the phase. Also, the number of successive vertical lines of the constant (red) blocks in a phase is always the same as the phase number with 0 indexing (0 vertical lines of red blocks for phase 0, 1 vertical line of red bocks for phase 1, and 2 vertical lines of red blocks for phase 2 -- each vertical line is an iteration of calling that kernel).
You can then make kernels for the out-in passes, and the constant passes, then invoke the kernels from the host side (because you need to constantly synchronise, which is a disadvantage, but you should still see large performance improvements over sequential implementations).
From the host side, the overall bitonic sort might look like:
cl_uint num_elements = 4; // Set number of elements
cl_uint phases = (cl_uint)ceil((float)log2(num_elements));
cl_uint out_in_block_size = 2;
cl_uint constant_block_size;
// Set the elements_buffer, which should have been created with
// with clCreateBuffer, as the first kernel argument, and the
// number of elements as the second kernel argument
clSetKernelArg(out_in_kernel, 0, sizeof(cl_mem), (void*)(&elements_buffer));
clSetKernelArg(out_in_kernel, 1, sizeof(cl_uint), (void*)(&num_elements));
clSetKernelArg(constant_kernel, 0, sizeof(cl_mem), (void*)(&elements_buffer));
clSetKernelArg(constant_kernel, 1, sizeof(cl_uint), (void*)(&num_elements));
// For each pass
for (unsigned int phase = 0; phase < phases; ++phase) {
// -------------------- Green Part ------------------------ //
// Set the out_in_block size for the kernel
clSetKernelArg(out_in_kernel, 2, sizeof(cl_int), (void*)(&out_in_block_size));
// Call the kernel - command_queue is the clCommandQueue
// which should have been created during cl setup
clEnqueNDRangeKernel(command_queue , // clCommandQueue
out_in_kernel , // The kernel
1 , // Work dim = 1 since 1D array
NULL , // No global offset
&global_work_size,
&local_work_size ,
0 ,
NULL ,
NULL);
barrier(CLK_GLOBAL_MEM_FENCE); // Synchronise
// ---------------------- End Green Part -------------------- //
// Set the block size for constant blocks based on the out_in_block_size
constant_block_size = out_in_block_size / 2;
// -------------------- Red Part ------------------------ //
for (unsigned int i 0; i < phase; ++i) {
// Set the constant_block_size as a kernel argument
clSetKernelArg(constant_kernel, 2, sizeof(cl_int), (void*)(&constant_block_size));
// Call the constant kernel
clEnqueNDRangeKernel(command_queue , // clCommandQueue
constant_kernel , // The kernel
1 , // Work dim = 1 since 1D array
NULL , // No global offset
&global_work_size,
&local_work_size ,
0 ,
NULL ,
NULL);
barrier(CLK_GLOBAL_MEM_FENCE); // Synchronise
// Update constant_block_size for next iteration
constant_block_size /= 2;
}
// ------------------- End Red Part ---------------------- //
}
And then the kernels would be something like (you also need to put the struct typedef in the kernel file so that the OpenCL compiler know what 'Element' is):
__global void out_in_kernel(__global Element* elements, unsigned int num_elements, unsigned int block_size) {
const unsigned int idx_upper = // index of upper element in diagram.
const unsigned int idx_lower = // index of lower element in diagram
// Check that both indices are in range (this depends on thread mapping)
if (idx_upper is in range && index_lower is in range) {
// Do the comparison
if (!compare(elements + idx_upper, elements + idx_lower) {
// Swap the elements
}
}
}
The constant_kernel will look the same, but the thread mapping (how you determine idx_upper and idx_lower) will be different. There are many ways you can map the threads to the elements generally to mimic the diagrams (note that the number of threads required is half the total number of elements, since each thread can do one comparison).
Another consideration is how to make the thread mapping general (so that if you have a number of elements which is not a power of two the algorithm doesn't break).

How about boost.compute or VexCL? Both provide sorting algorithms.

Mergesort works quite well on GPUs and you could modify it to sort key+count instead of keys only. During merging you would then also check if do keys are identical and if yes, fuse them into a single key during merge. (If you merge [9/c:1, 42/c:1] and [13/c:1,42/c:1] you would get [9/c:1,13/c:1,42/c:2] )
You might have to use parallel prefix sum to remove the gaps caused by fusing keys.
Or: Use a regular GPU sort first, mark all keys where the key to its right is different (this is only true at the last key of each unique key), use parallel prefix sum to get consecutive indexes for all unique keys and note their position in the sorted array. Then you only need to subtract the index of the previous unique key to get the count.

Related

Fill device array consecutively in CUDA

(This might be more of a theoretical parallel optimization problem then a CUDA specific problem per se. I'm very new to Parallel Programming in general so this may just be personal ignorance.)
I have a workload that consists of a 64-bit binary numbers upon which I run analysis. If the analysis completes successfully then that binary number is a "valid solution". If the analysis breaks midway then the number is "invalid". The end goal is to get a list of all the valid solutions.
Now there are many trillions of 64 bit binary numbers I am analyzing, but only ~5% or less will be valid solutions, and they usually come in bunches (i.e. every consecutive 1000 numbers are valid and then every random billion or so are invalid). I can't find a pattern to the space between bunches so I can't ignore the large chunks of invalid solutions.
Currently, every thread in a kernel call analyzes just one number. If the number is valid it denotes it as such in it's respective place on a device array. Ditto if it's invalid. So basically I generate a data point for very value analyzed regardless if it's valid or not. Then once the array is full I copy it to host only if a valid solution was found (denoted by a flag on the device). With this, overall throughput is greatest when the array is the same size as the # of threads in the grid.
But Copying Memory to & from the GPU is expensive time wise. That said what I would like to do is copy data over only when necessary; I want to fill up a device array with only valid solutions and then once the array is full then copy it over from the host. But how do you consecutively fill an array up in a parallel environment? Or am I approaching this problem the wrong way?
EDIT 1
This is the Kernel I initially developed. As you see I am generating 1 byte of data for each value analyzed. Now I really only need each 64 bit number which is valid; if I need be I can make a new kernel. As suggested by some of the commentators I am currently looking into stream compaction.
__global__ void kValid(unsigned long long*kInfo, unsigned char*values, char *solutionFound) {
//a 64 bit binary value to be evaluated is called a kValue
unsigned long long int kStart, kEnd, kRoot, kSize, curK;
//kRoot is the kValue at the start of device array, this is used is the device array is larger than the total threads in the grid
//kStart is the kValue to start this kernel call on
//kEnd is the last kValue to validate
//kSize is how many bits long is kValue (we don't necessarily use all 64 bits but this value stays constant over the entire chunk of values defined on the host
//curK is the current kValue represented as a 64 bit unsigned integer
int rowCount, kBitLocation, kMirrorBitLocation, row, col, nodes, edges;
kStart = kInfo[0];
kEnd = kInfo[1];
kRoot = kInfo[2];
nodes = kInfo[3];
edges = kInfo[4];
kSize = kInfo[5];
curK = blockIdx.x*blockDim.x + threadIdx.x + kStart;
if (curK > kEnd) {//check to make sure you don't overshoot the end value
return;
}
kBitLocation = 1;//assuming the first bit in the kvalue has a position 1;
for (row = 0; row < nodes; row++) {
rowCount = 0;
kMirrorBitLocation = row;//the bit position for the mirrored kvals is always starts at the row value (assuming the first row has a position of 0)
for (col = 0; col < nodes; col++) {
if (col > row) {
if (curK & (1 << (unsigned long long int)(kSize - kBitLocation))) {//add one to kIterator to convert to counting space
rowCount++;
}
kBitLocation++;
}
if (col < row) {
if (col > 0) {
kMirrorBitLocation += (nodes - 2) - (col - 1);
}
if (curK & (1 << (unsigned long long int)(kSize - kMirrorBitLocation))) {//if bit is set
rowCount++;
}
}
}
if (rowCount != edges) {
//set the ith bit to zero
values[curK - kRoot] = 0;
return;
}
}
//set the ith bit to one
values[curK - kRoot] = 1;
*solutionFound = 1; //not a race condition b/c it will only ever be set to 1 by any thread.
}
(This answer assumes output order is inconsequential and so are the positions of the valid values.)
Conceptually, your analysis produces a set of valid values. The implementation you described uses a dense representation of this set: One bit for every potential value. Yet you've indicated that the data is quite sparse (either 5e-2 or 1000/10^9 = 1e-6); moreover, copying data across PCI express is quite a pain.
Well, then, why not consider a sparse representation? The simplest one would be merely an unordered sequence of the valid values. Of course, writing that requires some synchronization across threads - perhaps even across blocks. Roughly, you can have warps collect their valid values in shared memory; then synchronize at the block level to collect the block's valid values (for a given chunk of the input it has analyzed); and finally use atomics to collect the data from all the blocks.
Oh, also - have each thread analyze multiple values, so you don't have to do that much synchronization.
So, you would want to have each thread analyze multiple numbers (thousands or millions) before you do a return from the computation. So if you analyze a million numbers in your thread, you will only need %5 of that amount of space to possible hold the results of that computation.

Random logic engine implementation ideas

I try to find an effective random logic algorithm for this scenario. It doesn't matter which programming Language:
Say I have 20 element array filled with numbers
[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20]
From this I need to construct each time 15 size array BUT
each time I set numbers that must be in this new array, and the remaining slots will be filled with random numbers from the master array.
For example:
In the new array the numbers that must be in are: 1,11,13,20,8,9
so the new array will be:
[1,N,N,11,N,20,8,N,9,N,N,N,13,N,N]
Where the Ns are random numbers from ALL 20 elements of the Master array.
Another example:
given 2,18,17,9,5
create new 10 element array:
[2,2,18,2,11,17,20,5,5,9]
No problem with duplicate elements
I'm trying to find some good algorithm for this.
If you want to receive one random number at a time and don't want to create the full result array up front, an alternative to my other answer is this:
Get a random number ranging from 0..requested_number (where requested_number is the total number of elements to fetch).
If this index is between 0 and length(required), print it from the array required; then remove it from the array;
.. else the next index is > length(required) and so pick any random number out of the optional array.
Decrease requested_number and repeat until this reaches 0.
You need 2 calls to random; the first to select an index from total_number - required_number, so you know from which array to pick a value, and the second time for optional, to pick a random number out of the entire available range.
Here is a basic implementation in C (footnote: using mod on rand() does not yield A Good Random Number, but it'll do for this example).
int main()
{
int optional[] = { 1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20 };
int required[] = { 21,22,23,24,25 };
int requested_number = 15;
int take_from_required, optional_size, next;
srand(time(NULL));
if (requested_number < sizeof(required)/sizeof(required[0]))
{
printf ("requested number of elements must be at least as large as required array\n");
return EDOM;
}
/* Use this much from 'required': */
take_from_required = sizeof(required)/sizeof(required[0]);
/* Use this much from 'optional': */
optional_size = sizeof(optional)/sizeof(optional[0]);
while (requested_number > 0)
{
/* Please note this is a fairly bad 'random'!
As discussed many times before on SO. */
next = rand() % requested_number;
/* Take from which array? */
if (next >= take_from_required)
{
printf ("%d\n", optional[rand() % optional_size]);
} else
{
printf ("%d (required)\n", required[next]);
required[next] = required[take_from_required-1];
take_from_required--;
}
requested_number--;
}
return 0;
}
If I understand correctly, this is the issue:
optional [ 1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20 ]
required [ 2,18,17,9,5 ]
Now construct a new array containing at least all elements of required, and filled to its capacity with elements taken from optional.
The problem seems to be that you need to take out random numbers from either required or optional and at the same time make sure required is empty at the end. [*]
Create a new array result (which needs to be at least as long as required -- then again, that can be inferred from the question). Copy all elements of required into it; fill the rest with random elements from optional.
At this point, you fulfill the primary condition, but the elements of required always appear first. So, as a last step, shuffle the elements now stored in the result array (for example, with the well-known Fisher-Yates shuffle).
[*] 'Empty', because all numbers in required must be used at least once. Taking them "out" of the array is the easiest way to make sure this happens. Things start to get complicated when (a) you may have duplicates of any number (from both optional and required) and (b) required is not a subset of optional.

Data structure for set of (non-disjoint) sets

I'm looking for a data structure that roughly corresponds to (in Java terms) Map<Set<int>, double>. Essentially a set of sets of labeled marbles, where each set of marbles is associated with a scalar. I want it to be able to efficiently handle the following operations:
Add a given integer to every set.
Remove every set that contains (or does not contain) a given integer, or at least set the associated double to 0.
Union two of the maps, adding together the doubles for sets that appear in both.
Multiply all of the doubles by a given double.
Rarely, iterate over the entire map.
under the following conditions:
The integers will fall within a constrained range (between 1 and 10,000 or so); the exact range will be known at compile-time.
Most of the integers within the range (80-90%) will never be used, but which ones will not be easily determinable until the end of the calculation.
The number of integers used will almost always still be over 100.
Many of the sets will be very similar, differing only by a few elements.
It may be possible to identify certain groups of integers that frequently appear only in sequential order: for example, if a set contains the integers 27 and 29 then it (almost?) certainly contains 28 as well.
It may be possible to identify these groups prior to running the calculation.
These groups would typically have 100 or so integers.
I've considered tries, but I don't see a good way to handle the "remove every set that contains a given integer" operation.
The purpose of this data structure would be to represent discrete random variables and permit addition, multiplication, and scalar multiplication operations on them. Each of these discrete random variables would ultimately have been created by applying these operations to a fixed (at compile-time) set of independent Bernoulli random variables (i.e. each takes the value 1 or 0 with some probability).
The systems being modeled are close to being representable as a time-inhomogeneous Markov chains (which would of course simplify this immensely) but, unfortunately, it is essential to track the duration since various transitions.
Here's a data structure, that can do all of your operations pretty efficiently:
I'm going to refer to it as a BitmapArray for this explanation.
Thinking about it, apparently for just the operations you have described a sorted array with bitmaps as keys and weights(your doubles) as values will be pretty efficient.
The bitmaps are what maintain membership in your set. Since you said the range of integers in the set are between 1-10,000, we can maintain information about any set with a bitmap of length 10,000.
It's gonna be tough sorting an array where the keys can be as big as 2^10000, but you can be smart about implementing the comparison function in the following way:
Iterate from left to right on the two bitmaps
XOR the bits on each index
Say you get a 1 at ith position
Whichever bitmap has 1 at ith position is greater
If you never get a 1 they're equal
I know this is still a slow comparison.
But not too slow, Here's a benchmark fiddle I did on bitmaps with length 10000.
This is in Javascript, if you're going to write in Java, it's going to perform even better.
function runTest() {
var num = document.getElementById("txtValue").value;
num = isNaN(num * 1) ? 0 : num * 1;
/*For integers in the range 1-10,000 the worst case for comparison are any equal integers which will cause the comparision to iterate over the whole BitArray*/
bitmap1 = convertToBitmap(10000, num);
bitmap2 = convertToBitmap(10000, num);
before = new Date().getMilliseconds();
var result = firstIsGreater(bitmap1, bitmap2, 10000);
after = new Date().getMilliseconds();
alert(result + " in time: " + (after-before) + " ms");
}
function convertToBitmap(size, number) {
var bits = new Array();
var q = number;
do {
bits.push(q % 2);
q = Math.floor(q / 2);
} while (q > 0);
xbitArray = new Array();
for (var i = 0; i < size; i++) {
xbitArray.push(0);
}
var j = xbitArray.length - 1;
for (var i = bits.length - 1; i >= 0; i--) {
xbitArray[j] = bits[i];
j--
}
return xbitArray;
}
function firstIsGreater(bitArray1, bitArray2, lengthOfArrays) {
for (var i = 0; i < lengthOfArrays; i++) {
if (bitArray1[i] ^ bitArray2[i]) {
if (bitArray1[i]) return true;
else return false;
}
}
return false;
}
document.getElementById("btnTest").onclick = function (e) {
runTest();
};
Also, remember that you only have to do this once, when building your BitmapArray (or while taking unions) and then it's going to become pretty efficient for the operations you'd do most often:
Note: N is the length of the BitmapArray.
Add integer to every set: Worst/best case O(N) time. Flip a 0 to 1 in each bitmap.
Remove every set that contains a given integer: Worst case O(N) time.
For each bitmap check the bit that represents the given integer, if 1 mark it's index.
Compress the array by deleting all marked indices.
If you're okay with just setting the weights to 0 it'll be even more efficient. This also makes it very easy if you want to remove all sets that have any element in a given set.
Union of two maps: Worst case O(N1+N2) time. Just like merging two sorted arrays, except you have to be smart about comparisons once more.
Multiply all of the doubles by a given double: Worst/best case O(N) time. Iterate and multiply each value by the input double.
Iterate over the BitmapArray: Worst/best case O(1) time for next element.

how to read all 1's in an Array of 1's and 0's spread-ed all over the array randomly

I have an Array with 1 and 0 spread over the array randomly.
int arr[N] = {1,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,1,1,0,0,0,1....................N}
Now I want to retrive all the 1's in the array as fast as possible, but the condition is I should not loose the exact position(based on index) of the array , so sorting option not valid.
So the only option left is linear searching ie O(n) , is there anything better than this.
The main problem behind linear scan is , I need to run the scan even
for X times. So I feel I need to have some kind of other datastructure
which maintains this list once the first linear scan happens, so that
I need not to run the linear scan again and again.
Let me be clear about final expectations-
I just need to find the number of 1's in a certain range of array , precisely I need to find numbers of 1's in the array within range of 40-100. So this can be random range and I need to find the counts of 1 within that range. I can't do sum and all as I need to iterate over the array over and over again because of different range requirements
I'm surprised you considered sorting as a faster alternative to linear search.
If you don't know where the ones occur, then there is no better way than linear searching. Perhaps if you used bits or char datatypes you could do some optimizations, but it depends on how you want to use this.
The best optimization that you could do on this is to overcome branch prediction. Because each value is zero or one, you can use it to advance the index of the array that is used to store the one-indices.
Simple approach:
int end = 0;
int indices[N];
for( int i = 0; i < N; i++ )
{
if( arr[i] ) indices[end++] = i; // Slow due to branch prediction
}
Without branching:
int end = 0;
int indices[N];
for( int i = 0; i < N; i++ )
{
indices[end] = i;
end += arr[i];
}
[edit] I tested the above, and found the version without branching was almost 3 times faster (4.36s versus 11.88s for 20 repeats on a randomly populated 100-million element array).
Coming back here to post results, I see you have updated your requirements. What you want is really easy with a dynamic programming approach...
All you do is create a new array that is one element larger, which stores the number of ones from the beginning of the array up to (but not including) the current index.
arr : 1 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 1 1 0 0 0 1
count : 0 1 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 4 5 6 6 6 6 7
(I've offset arr above so it lines up better)
Now you can compute the number of 1s in any range in O(1) time. To compute the number of 1s between index A and B, you just do:
int num = count[B+1] - count[A];
Obviously you can still use the non-branch-prediction version to generate the counts initially. All this should give you a pretty good speedup over the naive approach of summing for every query:
int *count = new int[N+1];
int total = 0;
count[0] = 0;
for( int i = 0; i < N; i++ )
{
total += arr[i];
count[i+1] = total;
}
// to compute the ranged sum:
int range_sum( int *count, int a, int b )
{
if( b < a ) return range_sum(b,a);
return count[b+1] - count[a];
}
Well one time linear scanning is fine. Since you are looking for multiple scans across ranges of array I think that can be done in constant time. Here you go:
Scan the array and create a bitmap where key = key of array = sequence (1,2,3,4,5,6....).The value storedin bitmap would be a tuple<IsOne,cumulativeSum> where isOne is whether you have a one in there and cumulative Sum is addition of 1's as and wen you encounter them
Array = 1 1 0 0 1 0 1 1 1 0 1 0
Tuple: (1,1) (1,2) (0,2) (0,2) (1,3) (0,3) (1,4) (1,5) (1,6) (0,6) (1,7) (0,7)
CASE 1: When lower bound of cumulativeSum has a 0. Number of 1's [6,11] =
cumulativeSum at 11th position - cumulativeSum at 6th position = 7 - 3 = 4
CASE 2: When lower bound of cumulativeSum has a 1. Number of 1's [2,11] =
cumulativeSum at 11th position - cumulativeSum at 2nd position + 1 = 7-2+1 = 6
Step 1 is O(n)
Step 2 is 0(1)
Total complexity is linear no doubt but for your task where you have to work with the ranges several times the above Algorithm seems to be better if you have ample memory :)
Does it have to be a simple linear array data structure? Or can you create your own data structure which happens to have the desired properties, for which you're able to provide the required API, but whose implementation details can be hidden (encapsulated)?
If you can implement your own and if there is some guaranteed sparsity (to either 1s or 0s) then you might be able to offer better than linear performance. I see that you want to preserve (or be able to regenerate) the exact stream, so you'll have to store an array or bitmap or run-length encoding for that. (RLE will be useless if the stream is actually random rather than arbitrary but could be quite useful if there are significant sparsity or patterns with long strings of one or the other. For example a black&white raster of a bitmapped image is often a good candidate for RLE).
Let's say that your guaranteed that the stream will be sparse --- that no more than 10%, for example, of the bits will be 1s (or, conversely that more than 90% will be). If that's the case then you might model your solution on an RLE and maintain a count of all 1s (simply incremented as you set bits and decremented as you clear them). If there might be a need to quickly get the number of set bits for arbitrary ranges of these elements then instead of a single counter you can have a conveniently sized array of counters for partitions of the stream. (Conveniently-sized, in this case, means something which fits easily within memory, within your caches, or register sets, but which offers a reasonable trade off between computing a sum (all the partitions fully within the range) and the linear scan. The results for any arbitrary range is the sum of all the partitions fully enclosed by the range plus the results of linear scans for any fragments that are not aligned on your partition boundaries.
For a very, very, large stream you could even have a multi-tier "index" of partition sums --- traversing from the largest (most coarse) granularity down toward the "fragments" to either end (using the next layer of partition sums) and finishing with the linear search of only the small fragments.
Obviously such a structure represents trade offs between the complexity of building and maintaining the structure (inserting requires additional operations and, for an RLE, might be very expensive for anything other than appending/prepending) vs the expense of performing arbitrarily long linear search/increment scans.
If:
the purpose is to be able to find the number of 1s in the array at any time,
given that relatively few of the values in the array might change between one moment when you want to know the number and another moment, and
if you have to find the number of 1s in a changing array of n values m times,
... you can certainly do better than examining every cell in the array m times by using a caching strategy.
The first time you need the number of 1s, you certainly have to examine every cell, as others have pointed out. However, if you then store the number of 1s in a variable (say sum) and track changes to the array (by, for instance, requiring that all array updates occur through a specific update() function), every time a 0 is replaced in the array with a 1, the update() function can add 1 to sum and every time a 1 is replaced in the array with a 0, the update() function can subtract 1 from sum.
Thus, sum is always up-to-date after the first time that the number of 1s in the array is counted and there is no need for further counting.
(EDIT to take the updated question into account)
If the need is to return the number of 1s in a given range of the array, that can be done with a slightly more sophisticated caching strategy than the one I've just described.
You can keep a count of the 1s in each subset of the array and update the relevant subset count whenever a 0 is changed to a 1 or vice versa within that subset. Finding the total number of 1s in a given range within the array would then be a matter of adding the number of 1s in each subset that is fully contained within the range and then counting the number of 1s that are in the range but not in the subsets that have already been counted.
Depending on circumstances, it might be worthwhile to have a hierarchical arrangement in which (say) the number of 1s in the whole array is at the top of the hierarchy, the number of 1s in each 1/q th of the array is in the second level of the hierarchy, the number of 1s in each 1/(q^2) th of the array is in the third level of the hierarchy, etc. e.g. for q = 4, you would have the total number of 1s at the top, the number of 1s in each quarter of the array at the second level, the number of 1s in each sixteenth of the array at the third level, etc.
Are you using C (or derived language)? If so, can you control the encoding of your array? If, for example, you could use a bitmap to count. The nice thing about a bitmap, is that you can use a lookup table to sum the counts, though if your subrange ends aren't divisible by 8, you'll have to deal with end partial bytes specially, but the speedup will be significant.
If that's not the case, can you at least encode them as single bytes? In that case, you may be able to exploit sparseness if it exists (more specifically, the hope that there are often multi index swaths of zeros).
So for:
u8 input = {1,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,1,1,0,0,0,1....................N};
You can write something like (untested):
uint countBytesBy1FromTo(u8 *input, uint start, uint stop)
{ // function for counting one byte at a time, use with range of less than 4,
// use functions below for longer ranges
// assume it's just one's and zeros, otherwise we have to test/branch
uint sum;
u8 *end = input + stop;
for (u8 *each = input + start; each < end; each++)
sum += *each;
return sum;
}
countBytesBy8FromTo(u8 *input, uint start, uint stop)
{
u64 *chunks = (u64*)(input+start);
u64 *end = chunks + ((start - stop) >> 3);
uint sum = countBytesBy1FromTo((u8*)end, 0, stop - (u8*)end);
for (; chunks < end; chunks++)
{
if (*chunks)
{
sum += countBytesBy1FromTo((u8*)chunks, 0, 8);
}
}
}
The basic trick, is exploiting the ability to cast slices of your target array to single entities your language can look at in one swoop, and test by inference if ANY of the values of it are zeros, and then skip the whole block. The more zeros, the better it will work. In the case where your large cast integer always has at least one, this approach just adds overhead. You might find that using a u32 is better for your data. Or that adding a u32 test between the 1 and 8 helps. For datasets where zeros are much more common than ones, I've used this technique to great advantage.
Why is sorting invalid? You can clone the original array, sort the clone, and count and/or mark the locations of the 1s as needed.

How to find an element in a linked list of blocks (containing n elements) as fast as possible?

My data structure is a linked list of blocks. A block contains 31 elements of 4 byte and one 4 byte pointer to the next block or NULL(in summary 128 bytes per block). I add elements from time to time. If the last block is full, I add another block via pointer.
One objective is to use as less memory (= blocks) as possible and having no free space between two elements in a block.
This setting is fix. All code runs on a 32-bit ARM Cortex-A8 CPU with NEON pipeline.
Question:
How to find a specific element in that data structure as quickly as possible?
Approach (right now):
I use sorted blocks and binary search to check for an element (9 bit of the 4 byte are the search criteria). If the desired element is not in the current block I jump to the next block. If the element is not in the last block and the last block is not yet full, I use the result of the binary search to insert the new element (if necessary I make space using memmove within this block). Thus all blocks are always sorted.
Do you have an idea to make that faster?
This is how I search right now: (q->getPosition() is an inline function that just extracts the 9-bit position from the element via "& bitmask")
do
{
// binary search algorithm (bsearch)
// from http://www.google.com/codesearch/
// p?hl=en#qoCVjtE_vOw/gcc4/trunk/gcc-
// 4.4.3/libiberty/bsearch.c&q=bsearch&sa=N&cd=2&ct=rc
base = &(block->points[0]);
if (block->next == NULL)
{
pointsInBlock = pointsInLastBlock;
stop = true;
}
else
{
block = block->next;
}
for (lim = pointsInBlock; lim != 0; lim >>= 1)
{
q = base + (lim >> 1);
cmp = quantizedPosition - q->getPosition();
if (cmp > 0)
{
// quantizedPosition > q: move right
base = q + 1;
lim--;
}
else if (cmp == 0)
{
// We found the QuantPoint
*outQuantPoint = q;
return true;
}
// else move left
}
}
while (!stop);
Since the bulk of the time is spent in the within-block search, that needs to be as fast as possible. Since the number of elements is fixed, you can completely unroll that loop, as in:
if (key < a[16]){
if (key < a[8]){
...
}
else { // key >= a[8] && key < a[16]
...
}
}
else { // key >= a[16]
if (key < a[24]){
...
}
else { // key >= a[24]
...
}
}
Study the generated assembly language and single-step it in a debugger, to make sure the compiler's giving you good code.
You might want to write a little program to print out the above code, as it will be hard to write by hand, or possibly generate it with macros.
ADDED: Just noticed your 9-bit search criterion. In that case, just pre-allocate an array of 512 4-byte words, and index it directly. That's the fastest, and the least code.
ALSO ADDED: If you need to keep your block structure, there's another way to do the unrolled binary search. It's the Jon Bentley method:
i = 0;
if (key >= a[i+16]) i += 16;
if (key >= a[i+ 8]) i += 8;
if (key >= a[i+ 4]) i += 4;
if (key >= a[i+ 2]) i += 2;
if (i < 30 && key >= a[i+ 1]) i += 1; // this excludes 31
if (key == a[i]) // then key is found
That's slower than the if-tree above, because of manipulating i, but could be substantially less code.
Let the number of elements in each block be m and the total number of blocks currently in the list be n. Then the current time complexity of you algorithm is O(n log m).
If you cannot move elements once they are added to a block, then I don't think you can do better in terms of time complexity than what you are already doing. (You could keep track of the maximum and minimum elements in a block, and skip the blocks if the element does not lie in this range. But this is not going to give you much gain. This will also waste space keeping track of the minimum and maximum for each block)
If you can afford to spend time while inserting the element and can move elements from one block to another, then here is a scheme that has time complexity O(log (mn)).
Basically, you keep all elements in sorted order. When a new element has to be inserted, binary search across block boundaries and insert it in its correct location, shifting elements to create space. This will lead to O(nm) time while inserting elements but O(log (mn)) when finding an element.
if this search criterion for an element is fixed, you had better to move the searching into a separate index structure, because the maximal number of elements you distinguish by your search criterion is only 2^9 = 512 indexes, so the maximal size of the search index would be (2 + 4)*512 = 3072, but you could surely use other that static one if you needed, saving some memory. Right now, imagine it as a field of 512 pairs <9-bit index, direct address>, that should be very fast (only one NULL-check and dereference call respectively).
Generally the answer on your question also depend on what other operations you want to perform on your structure and how frequently each of them (including the search ability). If all you want is search(9 bits)->add/modify/read, the your block structure would be useless.
You could write them here and maybe add what language you'r using.
Edit 3:
I just noticed you can't change the blocks' size. But is your search for efficiency reasons only, or do you need the elements of list to be unique (by those 9 bits)?

Resources