Parallelizing with threads and blocks in CUDA - parallel-processing

I have the following simple nested for loops
float a[1024][1024], b[1024]
for(i=1; i < 1024; i++){
for(j = 1; j < 1024 - i; j++){
b[i + j] += a[i][j];
}
}
And I am trying to understand how to partition this problem using CUDA threads and thread blocks to parallelize with GPU. So far I believe I have a total of N = 522753 computations. I am not entirely sure how to proceed from here: I know the number of threads in each block should be a multiple of 32. So for instance if the number of threads per block is 1024, then I need at least 511 blocks where each thread takes a computation from 1 -> N. Can someone explain how to choose the best number of threads per block, and how to actually implement this in parallel.

A long comment:
Edit: c matrix should be column major instead of row major and sorting should be on columns ınstead of rows but I left it as row major here for readability.
You can (for once) prepare a matrix of counts and references per workitem such that first column is count, rest is references and last column is write address
c[0] = {1, &a[1][1], &b[2]}; // b[2]
c[1] = {2, &a[1][2],&a[2][1], &b[3]}; // b[3]
c[2] = {3, &a[1][3],&a[2][2],&a[3][1], &b[4]}; // b[4]
..
then sort them (for once) on their number of indices / subarray size so they become
c[0] = {1, &a[1][1], &b[2]} // b[2]
c[1] = {1, &a[1022][1], &b[1023]} // b[1023]
..
c[k] = {5, x1,y1,z1,t1,w1, &b[m]} // b[m]
c[k+1] = {5, x2,y2,z2,t2,w2, &b[n]} // b[n]
balanced on the amount of work between cuda threads of a warp/block.
Then access c matrix (1 cuda thread per row) to know which elements to add together in a plain for loop per workitem.
const int length = (int)c[workitemId][0];
for(int i=1;i<length+1;i++)
resultOfWorkitem += *(c[workitemId][i]);
*(c[workitemId][length+1])=resultOfWorkitem;
Since all sorted lists will be sorted only once, if you are going to do calculation part frequently, then this extra referencing part can be faster than using atomics and probably be cached for c and a arrays' read-only access.
If random-write address becomes problem on performance, you can sort c array on their address of last item (contiguous b indices) but this would decrease balance of work between neighboring cuda threads. Maybe this is faster, didn't test. Maybe sorting c on a's second index value makes it faster by decreasing number of reads especially when you sort each row's elements between themselves so that they become continuous with neighbor threads' reads similar to first part
c[0] = {1, &a[1][1] // address x \
c[1] = {2, &a[1][2] // address x+1 > less than L1 cache line size 128byte?
c[2] = {3, &a[1][3] // address x+2 /
preserving both continous address accessing and balanced work per workitem can be impossible.

Related

Need help understanding the solution for the Jewelry Topcoder solution

I am fairly new to dynamic programming and don't yet understand most of the types of problems it can solve. Hence I am facing problems in understaing the solution of Jewelry topcoder problem.
Can someone at least give me some hints as to what the code is doing ?
Most importantly is this problem a variant of the subset-sum problem ? Because that's what I am studying to make sense of this problem.
What are these two functions actually counting ? Why are we using actually two DP tables ?
void cnk() {
nk[0][0]=1;
FOR(k,1,MAXN) {
nk[0][k]=0;
}
FOR(n,1,MAXN) {
nk[n][0]=1;
FOR(k,1,MAXN)
nk[n][k] = nk[n-1][k-1]+nk[n-1][k];
}
}
void calc(LL T[MAXN+1][MAX+1]) {
T[0][0] = 1;
FOR(x,1,MAX) T[0][x]=0;
FOR(ile,1,n) {
int a = v[ile-1];
FOR(x,0,MAX) {
T[ile][x] = T[ile-1][x];
if(x>=a) T[ile][x] +=T[ile-1][x-a];
}
}
}
How is the original solution constructed by using the following logic ?
FOR(u,1,c) {
int uu = u * v[done];
FOR(x,uu,MAX)
res += B[done][x-uu] * F[n-done-u][x] * nk[c][u];
}
done=p;
}
Any help would be greatly appreciated.
Let's consider the following task first:
"Given a vector V of N positive integers less than K, find the number of subsets whose sum equals S".
This can be solved in polynomial time with dynamic programming using some extra-memory.
The dynamic programming approach goes like this:
instead of solving the problem for N and S, we will solve all the problems of the following form:
"Find the number of ways to write sum s (with s ≤ S) using only the first n ≤ N of the numbers".
This is a common characteristic of the dynamic programming solutions: instead of only solving the original problem, you solve an entire family of related problems. The key idea is that solutions for more difficult problem settings (i.e. higher n and s) can efficiently be built up from the solutions of the easier settings.
Solving the problem for n = 0 is trivial (sum s = 0 can be expressed in one way -- using the empty set, while all other sums can't be expressed in any ways).
Now consider that we have solved the problem for all values up to a certain n and that we have these solutions in a matrix A (i.e. A[n][s] is the number of ways to write sum s using the first n elements).
Then, we can find the solutions for n+1, using the following formula:
A[n+1][s] = A[n][s - V[n+1]] + A[n][s].
Indeed, when we write the sum s using the first n+1 numbers we can either include or not V[n+1] (the n+1th term).
This is what the calc function computes. (the cnk function uses Pascal's rule to compute binomial coefficients)
Note: in general, if in the end we are only interested in answering the initial problem (i.e. for N and S), then the array A can be uni-dimensional (with length S) -- this is because whenever trying to construct solutions for n + 1 we only need the solutions for n, and not for smaller values).
This problem (the one initially stated in this answer) is indeed related to the subset sum problem (finding a subset of elements with sum zero).
A similar type of dynamic programming approach can be applied if we have a reasonable limit on the absolute values of the integers used (we need to allocate an auxiliary array to represent all possible reachable sums).
In the zero-sum problem we are not actually interested in the count, thus the A array can be an array of booleans (indicating whether a sum is reachable or not).
In addition, another auxiliary array, B can be used to allow reconstructing the solution if one exists.
The recurrence would now look like this:
if (!A[s] && A[s - V[n+1]]) {
A[s] = true;
// the index of the last value used to reach sum _s_,
// allows going backwards to reproduce the entire solution
B[s] = n + 1;
}
Note: the actual implementation requires some additional care for handling the negative sums, which can not directly represent indices in the array (the indices can be shifted by taking into account the minimum reachable sum, or, if working in C/C++, a trick like the one described in this answer can be applied: https://stackoverflow.com/a/3473686/6184684).
I'll detail how the above ideas apply in the TopCoder problem and its solution linked in the question.
The B and F matrices.
First, note the meaning of the B and F matrices in the solution:
B[i][s] represents the number of ways to reach sum s using only the smallest i items
F[i][s] represents the number of ways to reach sum s using only the largest i items
Indeed, both matrices are computed using the calc function, after sorting the array of jewelry values in ascending order (for B) and descending order (for F).
Solution for the case with no duplicates.
Consider first the case with no duplicate jewelry values, using this example: [5, 6, 7, 11, 15].
For the reminder of the answer I will assume that the array was sorted in ascending order (thus "first i items" will refer to the smallest i ones).
Each item given to Bob has value less (or equal) to each item given to Frank, thus in every good solution there will be a separation point such that Bob receives only items before that separation point, and Frank receives only items after that point.
To count all solutions we would need to sum over all possible separation points.
When, for example, the separation point is between the 3rd and 4th item, Bob would pick items only from the [5, 6, 7] sub-array (smallest 3 items), and Frank would pick items from the remaining [11, 12] sub-array (largest 2 items). In this case there is a single sum (s = 11) that can be obtained by both of them. Each time a sum can be obtained by both, we need to multiply the number of ways that each of them can reach the respective sum (e.g. if Bob could reach a sum s in 4 ways and Frank could reach the same sum s in 5 ways, then we could get 20 = 4 * 5 valid solutions with that sum, because each combination is a valid solution).
Thus we would get the following code by considering all separation points and all possible sums:
res = 0;
for (int i = 0; i < n; i++) {
for (int s = 0; s <= maxS; s++) {
res += B[i][s] * F[n-i][s]
}
}
However, there is a subtle issue here. This would often count the same combination multiple times (for various separation points). In the example provided above, the same solution with sum 11 would be counted both for the separation [5, 6] - [7, 11, 15], as well as for the separation [5, 6, 7] - [11, 15].
To alleviate this problem we can partition the solutions by "the largest value of an item picked by Bob" (or, equivalently, by always forcing Bob to include in his selection the largest valued item from the first sub-array under the current separation).
In order to count the number of ways to reach sum s when Bob's largest valued item is the ith one (sorted in ascending order), we can use B[i][s - v[i]]. This holds because using the v[i] valued item implies requiring the sum s - v[i] to be expressed using subsets from the first i items (indices 0, 1, ... i - 1).
This would be implemented as follows:
res = 0;
for (int i = 0; i < n; i++) {
for (int s = v[i]; s <= maxS; s++) {
res += B[i][s - v[i]] * F[n - 1 - i][s];
}
}
This is getting closer to the solution on TopCoder (in that solution, done corresponds to the i above, and uu = v[i]).
Extension for the case when duplicates are allowed.
When duplicate values can appear in the array, it's no longer easy to directly count the number of solutions when Bob's most valuable item is v[i]. We need to also consider the number of such items picked by Bob.
If there are c items that have the same value as v[i], i.e. v[i] = v[i+1] = ... v[i + c - 1], and Bob picks u such items, then the number of ways for him to reach a certain sum s is equal to:
comb(c, u) * B[i][s - u * v[i]] (1)
Indeed, this holds because the u items can be picked from the total of c which have the same value in comb(c, u) ways. For each such choice of the u items, the remaining sum is s - u * v[i], and this should be expressed using a subset from the first i items (indices 0, 1, ... i - 1), thus it can be done in B[i][s - u * v[i]] ways.
For Frank, if Bob used u of the v[i] items, the number of ways to express sum s will be equal to:
F[n - i - u][s] (2)
Indeed, since Bob uses the smallest i + u values, Frank can use any of the largest n - i - u values to reach the sum s.
By combining relations (1) and (2) from above, we obtain that the number of solutions where both Frank and Bob have sum s, when Bob's most valued item is v[i] and he picks u such items is equal to:
comb(c, u) * B[i][s - u * v[i]] * F[n - i - u][s].
This is precisely what the given solution implements.
Indeed, the variable done corresponds to variable i above, variable x corresponds to sums s, the index p is used to determine the c items with same value as v[done], and the loop over u is used in order to consider all possible numbers of such items picked by Bob.
Here's some Java code for this that references the original solution. It also incorporates qwertyman's fantastic explanations (to the extent feasible). I've added some of my comments along the way.
import java.util.*;
public class Jewelry {
int MAX_SUM=30005;
int MAX_N=30;
long[][] C;
// Generate all possible sums
// ret[i][sum] = number of ways to compute sum using the first i numbers from val[]
public long[][] genDP(int[] val) {
int i, sum, n=val.length;
long[][] ret = new long[MAX_N+1][MAX_SUM];
ret[0][0] = 1;
for(i=0; i+1<=n; i++) {
for(sum=0; sum<MAX_SUM; sum++) {
// Carry over the sum from i to i+1 for each sum
// Problem definition allows excluding numbers from calculating sums
// So we are essentially excluding the last number for this calculation
ret[i+1][sum] = ret[i][sum];
// DP: (Number of ways to generate sum using i+1 numbers =
// Number of ways to generate sum-val[i] using i numbers)
if(sum>=val[i])
ret[i+1][sum] += ret[i][sum-val[i]];
}
}
return ret;
}
// C(n, r) - all possible combinations of choosing r numbers from n numbers
// Leverage Pascal's polynomial co-efficients for an n-degree polynomial
// Leverage Dynamic Programming to build this upfront
public void nCr() {
C = new long[MAX_N+1][MAX_N+1];
int n, r;
C[0][0] = 1;
for(n=1; n<=MAX_N; n++) {
C[n][0] = 1;
for(r=1; r<=MAX_N; r++)
C[n][r] = C[n-1][r-1] + C[n-1][r];
}
}
/*
General Concept:
- Sort array
- Incrementally divide array into two partitions
+ Accomplished by using two different arrays - L for left, R for right
- Take all possible sums on the left side and match with all possible sums
on the right side (multiply these numbers to get totals for each sum)
- Adjust for common sums so as to not overcount
- Adjust for duplicate numbers
*/
public long howMany(int[] values) {
int i, j, sum, n=values.length;
// Pre-compute C(n,r) and store in C[][]
nCr();
/*
Incrementally split the array and calculate sums on either side
For eg. if val={2, 3, 4, 5, 9}, we would partition this as
{2 | 3, 4, 5, 9} then {2, 3 | 4, 5, 9}, etc.
First, sort it ascendingly and generate its sum matrix L
Then, sort it descendingly, and generate another sum matrix R
In later calculations, manipulate indexes to simulate the partitions
So at any point L[i] would correspond to R[n-i-1]. eg. L[1] = R[5-1-1]=R[3]
*/
// Sort ascendingly
Arrays.sort(values);
// Generate all sums for the "Left" partition using the sorted array
long[][] L = genDP(values);
// Sort descendingly by reversing the existing array.
// Java 8 doesn't support Arrays.sort for primitive int types
// Use Comparator or sort manually. This uses the manual sort.
for(i=0; i<n/2; i++) {
int tmp = values[i];
values[i] = values[n-i-1];
values[n-i-1] = tmp;
}
// Generate all sums for the "Right" partition using the re-sorted array
long[][] R = genDP(values);
// Re-sort in ascending order as we will be using values[] as reference later
Arrays.sort(values);
long tot = 0;
for(i=0; i<n; i++) {
int dup=0;
// How many duplicates of values[i] do we have?
for(j=0; j<n; j++)
if(values[j] == values[i])
dup++;
/*
Calculate total by iterating through each sum and multiplying counts on
both partitions for that sum
However, there may be count of sums that get duplicated
For instance, if val={2, 3, 4, 5, 9}, you'd get:
{2, 3 | 4, 5, 9} and {2, 3, 4 | 5, 9} (on two different iterations)
In this case, the subset {2, 3 | 5} is counted twice
To account for this, exclude the current largest number, val[i], from L's
sum and exclude it from R's i index
There is another issue of duplicate numbers
Eg. If values={2, 3, 3, 3, 4}, how do you know which 3 went to L?
To solve this, group the same numbers
Applying to {2, 3, 3, 3, 4} :
- Exclude 3, 6 (3+3) and 9 (3+3+3) from L's sum calculation
- Exclude 1, 2 and 3 from R's index count
We're essentially saying that we will exclude the sum contribution of these
elements to L and ignore their count contribution to R
*/
for(j=1; j<=dup; j++) {
int dup_sum = j*values[i];
for(sum=dup_sum; sum<MAX_SUM; sum++) {
// (ways to pick j numbers from dup) * (ways to get sum-dup_sum from i numbers) * (ways to get sum from n-i-j numbers)
if(n-i-j>=0)
tot += C[dup][j] * L[i][sum-dup_sum] * R[n-i-j][sum];
}
}
// Skip past the duplicates of values[i] that we've now accounted for
i += dup-1;
}
return tot;
}
}

Computing partial sums in OpenCL

A 1D dataset is divided into segments, each work item processes one segment. It read a number of elements from the segment? The number of elements is not known beforehand and differs for each segment.
For example:
+----+----+----+----+----+----+----+----+----+ <-- segments
A BCD E FG HIJK L M N <-- elements in this segment
After all segments have been processes they should write the elements in contiguously output memory, like
A B C D E F G H I J K L M N
So the absolute output position of the elements from one segment depends on the number of elements in the previous segments. E is at position 4 because segment contains 1 element (A) and segment 2 contains 3 elements.
The OpenCL kernel writes the number of elements for each segment into a local/shared memory buffer, and works like this (pseudocode)
kernel void k(
constant uchar* input,
global int* output,
local int* segment_element_counts
) {
int segment = get_local_id(0);
int count = count_elements(&input[segment * segment_size]);
segment_element_counts[segment] = count;
barrier(CLK_LOCAL_MEM_FENCE);
ptrdiff_t position = 0;
for(int previous_segment = 0; previous_segment < segment; ++previous_segment)
position += segment_element_counts[previous_segment];
global int* output_ptr = &output[position];
read_elements(&input[segment * segment_size], output_ptr);
}
So each work item has to calculate a partial sum using a loop, where the work items with larger id do more iterations.
Is there a more efficient way to implement this (each work item calculate a partial sum of a sequence, up to its index), in OpenCL 1.2? OpenCL 2 seems to provide work_group_scan_inclusive_add for this.
You can do N partial (prefix) sums in log2(N) iterations using something like this:
offsets[get_local_id(0)] = count;
barrier(CLK_LOCAL_MEM_FENCE);
for (ushort combine = 1; combine < total_num_segments; combine *= 2)
{
if (get_local_id(0) & combine)
{
offsets[get_local_id(0)] +=
offsets[(get_local_id(0) & ~(combine * 2u - 1u)) | (combine - 1u)];
}
barrier(CLK_LOCAL_MEM_FENCE);
}
Given segment element counts of
a b c d
The successive iterations will produce:
a b+a c d+c
and
a b+a c+(b+a) (d+c)+(b+a)
Which is the result we want.
So in the first iteration, we've divided the segment element counts into groups of 2, and sum within them. Then we merge 2 groups at a time into 4 elements, and propagate the result from the first group into the second. We grow the groups again to 8, and so on.
The key observation is that this pattern also matches the binary representation of the index of each segment:
0: 0b00 1: 0b01 2: 0b10 3: 0b11
Index 0 performs no sums. Both indices 1 and 3 perform a sum in the first iteration (bit 0/LSB = 1), whereas indices 2 and 3 perform a sum in the second iteration (bit 1 = 1). That explains this line:
if (get_local_id(0) & combine)
The other statement that really needs an explanation is of course
offsets[get_local_id(0)] +=
offsets[(get_local_id(0) & ~(combine * 2u - 1u)) | (combine - 1u)];
Calculating the index at which we find the previous prefix sum we want to accumulate onto our work-item's sum is a little tricky. The subexpression (combine * 2u - 1u) takes the value (2n-1) on each iteration (for n starting at 1):
1 = 0b001
3 = 0b011
7 = 0b111
…
By bitwise-masking these bit suffixes off (i.e. i & ~x) the work-item index, this gives you the index of the first item in the current group.
The (combine - 1u) subexpression then gives you the index within the current group of the last item of the first half. Putting the two together gives you the overall index of the item you want to accumulate into the current segment.
There is one slight ugliness in the result: it's shifted to the left by one: so segment 1 needs to use offsets[0], and so on, while segment 0's offset is of course 0. You can either over-allocate the offsets array by 1 and perform the prefix sums on the subarray starting at index 1 and initialise index 0 to 0, or use a conditional.
There are probably profiling-driven micro-optimisations you can make to the above code.

O(n) algorithm for two identical points

The Problem Statement
Given n points in a 2D plane having x and y coordinate. Two points are identical if one can be obtained from the other by multiplication by the same number. Example: (10,15) and (2,3) are identical whereas (10,15) and (10,20) are not. Suggest an O(n) algorithm which determines whether the input n points contains two identical points or not.
The simple approach can be just checking for each points i.e. if there are 5 points, for the first one I have 4 comparisons, for the second one I have 3 comparisons and so on. But that isn't an O(n) time complexity solution. I really can't think ahead of that. Any suggestions?
One obvious (but possibly inadequate) possibility would be to reduce each point to a floating point number representing the ratio, so (2,3) and (10,15) both become 0.66667, and (10, 20) become 0.5.
The reason this wouldn't work is that floating point numbers tend to be approximate, so you'd just about need to use an approximate comparison, and put up with the fact that it would show points as identical as long as they were equal to (say) 15 decimal places.
If you don't want that, you could create a rational number class that supported comparison (e.g., reduced each ratio to lowest terms).
Either way, once you've reduced a point to a single number, you just insert each into (for one possibility) a hash table. As you insert each you check whether that ratio is already in the hash table--if it is, you have an identical point. If not, insert it normally.
One way to reduce a point to a single number is to multiply the first co-ordinate of the point by product of all the second co-ordinates of the other points.
So for e.g:
(10, 20) -> 10 * 10 * 4 = 400
(5, 10) -> 5 * 20 * 4 = 400
(3, 4) -> 3 * 20 * 10 = 600
The first and second point match. For large sets of points the products would be very large, and would require using a BigNumber (which will be more than O(n)) but you could keep the numbers within a reasonable limit by taking a modulo after each multiplication. Then use a hash table as suggested in Jerry Coffin's answer.
You can easily compute the product of all the second co-ordinates by doing a single forward pass then a single backwards pass over the array and keeping running products:
e.g. in Java:
long m = 9223372036854775783L; // largest prime less than max long
int[][] points = {{1, 2}, {1, 3}, {1, 4}, {1, 5}, {2, 6}};
long[] mods = new long[points.length];
long prod = 1;
for(int i = 0; i < points.length; i++)
{
mods[i] = prod;
prod = (points[i][1] * prod) % m;
}
prod = 1;
for(int i = points.length - 1; i >= 0 ; i--)
{
mods[i] = (mods[i] * prod) % m;
prod = (points[i][1] * prod) % m;
}
HashSet<Long> set = new HashSet<Long>();
for(int i = 0; i < points.length; i++)
{
prod = (mods[i] * points[i][0]) % m;
if(set.contains(prod))
System.out.println("Found a match");
set.add(prod);
}
This algorithm assumes all the co-ordinates are integers != 0. Zeroes can be handled as special cases: all points with zero in the first place match each other, likewise for those with zero in the second place, and (0, 0) matches all points. As an optimization, the second and third pass through the array could be merged into a single pass.

Different way to index threads in CUDA C

I have 9x9 matrix and i flattened it into a vector of 81 elements; then i defined a grid of 9 blocks with 9 threads each for a total of 81 threads; here's a picture of the grid
Then i tried to verify what was the index related to the the thread (0,0) of block (1,1); first i calculated the i-th column and the j-th row like this:
i = blockDim.x*blockId.x + threadIdx.x = 3*1 + 0 = 3
j = blockDim.y*blockId.y + threadIdx.y = 3*1 + 0 = 3
therefore the index is:
index = N*i + j = 9*3 +3 = 30
As a matter of fact thread (0,0) of block (1,1) is actually the 30th element of the matrix;
Now here's my problem: let's say a choose a grid with 4 blocks (0,0)(1,0)(0,1)(1,1) with 4 threads each (0,0)(1,0)(0,1)(1,1)
Let's say i keep the original vector with 81 elements; what should i do to get the index of a generic element of the vector by using just 4*4 = 16 threads? i have tried the formulas written above but they don't seem to apply.
My goal is that every thread handles a single element of the vector...
A common way to have a smaller number of threads cover a larger number of data elements is to use a "grid-striding loop". Suppose I had a vector of length n elements, and I had some smaller number of threads, and I wanted to take every element, add 1 to it, and store it back in the original vector. That code could look something like this:
__global__ void my_inc_kernel(int *data, int n){
int idx = (gridDim.x*blockDim.x)*(threadIdx.y+blockDim.y*blockIdx.y) + (threadIdx.x+blockDim.x*blockIdx.x);
while(idx < n){
data[idx]++;
idx += (gridDim.x*blockDim.x)*(gridDim.y*blockDim.y);}
}
(the above is coded in browser, not tested)
The only complicated parts above are the indexing parts. The initial calculation of idx is just a typical creation/assignment of a globally unique id (idx) to each thread in a 2D threadblock/grid structure. Let's break it down:
int idx = (gridDim.x*blockDim.x)*(threadIdx.y+blockDim.y*blockIdx.y) +
(width of grid in threads)*(thread y-index)
(threadIdx.x+blockDim.x*blockIdx.x);
(thread x-index)
The amount added to idx on each pass of the while loop is the size of the 2D grid in total threads. Therefore, each iteration of the while loop does one "grid's width" of elements at a time, and then "strides" to the next grid-width, to process the next group of elements. Let's break that down:
idx += (gridDim.x*blockDim.x)*(gridDim.y*blockDim.y);
(width of grid in threads)*(height of grid in threads)
This methodology does not require that the total number of elements be evenly divisible the number of threads. The conditional check of the while-loop handles all cases of relationship between vector size and grid size.
This particular grid-striding loop methodology has the additional benefit (in terms of mapping elements to threads) that it tends to naturally promote coalesced access. The reads and writes to data vector in the code above will coalesce perfectly, due to the behavior of the grid-striding loop. You can enhance coalescing behavior in this case by choosing blocks that are a whole-number multiple of 32, but that is not central to your question.

Modified Parallel Scan

This is more of an algorithms question than a programming one. I'm wondering if the prefix sum (or any) parallel algorithm can be modified to accomplish the following. I'd like to generate a result from two input lists on a GPU in less than O(N) time.
The rule is: Carry forth the first number from data until the same index in keys contains a lesser value.
Whenever I try mapping it to a parallel scan, it doesn't work because I can't be sure which values of data to propagate in upsweep since it's not possible to know which prior data might have carried far enough to compare against the current key. This problem reminds me of a ripple carry where we need to consider the current index AND all past indices.
Again, don't need code for a parallel scan (though that would be nice), more looking to understand how it can be done or why it can't be done.
int data[N] = {5, 6, 5, 5, 3, 1, 5, 5};
int keys[N] = {5, 6, 5, 5, 4, 2, 5, 5};
int result[N];
serial_scan(N, keys, data, result);
// Print result. should be {5, 5, 5, 5, 3, 1, 1, 1, }
code to do the scan in serial is below:
void serial_scan(int N, int *k, int *d, int *r)
{
r[0] = d[0];
for(int i=1; i<N; i++)
{
if (k[i] >= r[i-1]) {
r[i] = r[i-1];
} else if (k[i] >= d[i]) {
r[i] = d[i];
} else {
r[i] = 0;
}
}
}
The general technique for a parallel scan can be found here, described in the functional language Standard ML. This can be done for any associative operator, and I think yours fits the bill.
One intuition pump is that you can calculate the sum of an array in O(log(n)) span (running time with infinite processors) by recursively calculating the sum of two halves of the array and adding them together. In calculating the scan you just need know the sum of the array before the current point.
We could calculate the scan of an array doing two halves in parallel: calculate the sum of the 1st half using the above technique. Then calculating the scan for the two halves sequentially; the 1st half starts at 0 and the 2nd half starts at the sum you calculated before. The full algorithm is a little trickier, but uses the same idea.
Here's some pseudo-code for doing a parallel scan in a different language (for the specific case of ints and addition, but the logic is identical for any associative operator):
//assume input.length is a power of 2
int[] scanadd( int[] input) {
if (input.length == 1)
return input
else {
//calculate a new collapsed sequence which is the sum of sequential even/odd pairs
//assume this for loop is done in parallel
int[] collapsed = new int[input.length/2]
for (i <- 0 until collapsed.length)
collapsed[i] = input[2 * i] + input[2*i+1]
//recursively scan collapsed values
int[] scancollapse = scanadd(collapse)
//now we can use the scan of the collapsed seq to calculate the full sequence
//also assume this for loop is in parallel
int[] output = int[input.length]
for (i <- 0 until input.length)
//if an index is even then we can just look into the collapsed sequence and get the value
// otherwise we can look just before it and add the value at the current index
if (i %2 ==0)
output[i] = scancollapse[i/2]
else
output[i] = scancollapse[(i-1)/2] + input[i]
return output
}
}

Resources