Using binary number sequence to determine which boxes are filled - algorithm

I am struggling with programming the following (with GAS):
I have 14 boxes which are either empty or not and I need to keep track of which boxes are empty. My idea is to use a base2 control value to keep track of the which boxes are empty.
To illustrate I will use an example with only 3 boxes
I figured the following scheme would work:
All boxes empty: 0 control value
Only box 1 empty: 1 control value
Only box 2 empty: 2 control value
Boxes 1 and 2 empty: 3 control value
Only box 3 empty: 4 control value
Boxes 1 and 3 empty: 5 control value
Boxes 2 and 3 empty: 6 control value
No boxes empty: 7 control value
I need an algorithm to know which boxes are empty and how to update the control value as boxes are filled and/or emptied.
I assume(d) that there must exist ready to use algorithms for this functionality and hence this sort of use of a base2 series have a name that I can use to google, right?
In case this it the wrong place for this question please let me know. Thank you.

(I second the comment that empty would usually be a 0 bit and full a 1 bit, but I can work with this.)
Just use bitwise operators
x & 1 # True if box 1 is empty
x & 2 # True if box 2 is empty
x & 4 # True if box 3 is empty
To switch whether boxes are empty or full:
x ^ 1 # Switches box 1
x ^ 2 # Switches box 2
x ^ 4 # Switches box 3
To make sure a box is empty:
x & 1 # Box 1 is empty
x & 2 # Box 2 is empty
x & 4 # Box 3 is empty
The only one that is tricky is guaranteeing that a box is full.
(x & 1) ^ 1 # Box 1 is full
(x & 2) ^ 2 # Box 2 is full
(x & 4) ^ 4 # Box 3 is full
Bitwise And(&)
Bitwise XOR (^)

This is called bitmasks. In computer science, a mask or bitmask is data that is used for bitwise operations, particularly in a bit field.
You can represent the state of your problem as a bool array, something like: bool is_empty[14]; But this takes O(N) memory, where N is the number of boxes you have. Instead of using this array you can use the concept of bitmasks to represent every box with one bit of a single number. Let's take an example three boxes:
decimal binary meaning
0 000 All three boxes are empty
1 001 Box 1 is full
2 010 Box 2 is full
3 011 Boxes 1, 2 are full
4 100 Box 3 is full
5 101 Boxes 1, 3 are full
6 110 Boxes 2, 3 are full
7 111 All three boxes are full
To check if bit i is set, you just check if mask & 1<<i is not equal to zero.
To turn on bit i, you perform mask = mask | 1<<i;
To turn off bit i, you perform mask = mask & ~(1<<i)
To toggle bit i, you perform mask = mask ^ 1<<i.
A common use of bitmasks is generate all subsets from a set of numbers, here is a C++ to do that:
#include <iostream>
#include <vector>
int main(){
std::vector<int>a{1, 2, 3};
int n = static_cast<int>(a.size());
for(int mask = 0;mask < (1<<n);mask++){
std::vector<int>subset;
for(int i = 0;i < n;i++){
if(mask & 1<<i){
subset.push_back(a[i]);
}
}
for(auto &&num: subset)std::cout << num << " ";
std::cout << std::endl;
}
}

Related

How to extract vectors from a given condition matrix in Octave

I'm trying to extract a matrix with two columns. The first column is the data that I want to group into a vector, while the second column is information about the group.
A =
1 1
2 1
7 2
9 2
7 3
10 3
13 3
1 4
5 4
17 4
1 5
6 5
the result that i seek are
A1 =
1
2
A2 =
7
9
A3 =
7
10
13
A4=
1
5
17
A5 =
1
6
as an illustration, I used the eval function but it didn't give the results I wanted
Assuming that you don't actually need individually named separated variables, the following will put the values into separate cells of a cell array, each of which can be an arbitrary size and which can be then retrieved using cell index syntax. It makes used of logical indexing so that each iteration of the for loop assigns to that cell in B just the values from the first column of A that have the correct number in the second column of A.
num_cells = max (A(:,2));
B = cell (num_cells,1);
for idx = 1:max(A(:,2))
B(idx) = A((A(:,2)==idx),1);
end
B =
{
[1,1] =
1
2
[2,1] =
7
9
[3,1] =
7
10
13
[4,1] =
1
5
17
[5,1] =
1
6
}
Cell arrays are accessed a bit differently than normal numeric arrays. Array indexing (with ()) will return another cell, e.g.:
>> B(1)
ans =
{
[1,1] =
1
2
}
To get the contents of the cell so that you can work with them like any other variable, index them using {}.
>> B{1}
ans =
1
2
How it works:
Use max(A(:,2)) to find out how many array elements are going to be needed. A(:,2) uses subscript notation to indicate every value of A in column 2.
Create an empty cell array B with the right number of cells to contain the separated parts of A. This isn't strictly necessary, but with large amounts of data, things can slow down a lot if you keep adding on to the end of an array. Pre-allocating is usually better.
For each iteration of the for loop, it determines which elements in the 2nd column of A have the value matching the value of idx. This returns a logical array. For example, for the third time through the for loop, idx = 3, and:
>> A_index3 = A(:,2)==3
A_index3 =
0
0
0
0
1
1
1
0
0
0
0
0
That is a logical array of trues/falses indicating which elements equal 3. You are allowed to mix both logical and subscripts when indexing. So using this we can retrieve just those values from the first column:
A(A_index3, 1)
ans =
7
10
13
we get the same result if we do it in a single line without the A_index3 intermediate placeholder:
>> A(A(:,2)==3, 1)
ans =
7
10
13
Putting it in a for loop where 3 is replaced by the loop variable idx, and we assign the answer to the idx location in B, we get all of the values separated into different cells.

Assignment regarding, dynamic programming. Making my code more efficient?

I've got an assignment regarding dynamic programming.
I'm to design an efficient algorithm that does the following:
There is a path, covered in spots. The user can move forward to the end of the path using a series of push buttons. There are 3 buttons. One moves you forward 2 spots, one moves you forward 3 spots, one moves you forward 5 spots. The spots on the path are either black or white, and you cannot land on a black spot. The algorithm finds the smallest number of button pushes needed to reach the end (past the last spot, can overshoot it).
The user inputs are for "n", the number of spots. And fill the array with n amount of B or W (Black or white). The first spot must be white. Heres what I have so far (Its only meant to be pseudo):
int x = 0
int totalsteps = 0
n = user input
int countAtIndex[n-1] <- Set all values to -1 // I'll do the nitty gritty stuff like this after
int spots[n-1] = user input
pressButton(totalSteps, x) {
if(countAtIndex[x] != -1 AND totalsteps >= countAtIndex[x]) {
FAILED } //Test to see if the value has already been modified (not -1 or not better)
else
if (spots[x] = "B") {
countAtIndex[x] = -2 // Indicator of invalid spot
FAILED }
else if (x >= n-5) { // Reached within 5 of the end, press 5 so take a step and win
GIVE VALUE OF TOTALSTEPS + 1 A SUCCESSFUL SHORTEST OUTPUT
FINISH }
else
countAtIndex[x] = totalsteps
pressButton(totalsteps + 1, x+5) //take 5 steps
pressButton(totalsteps + 1, x+3) //take 3 steps
pressButton(totalsteps + 1, x+2) //take 2 steps
}
I appreciate this may look quite bad but I hope it comes across okay, I just want to make sure the theory is sound before I write it out better. I'm wondering if this is not the most efficient way of doing this problem. In addition to this, where there are capitals, I'm unsure on how to "Fail" the program, or how to return the "Successful" value.
Any help would be greatly appreciated.
I should add incase its unclear, I'm using countAtIndex[] to store the number of moves to get to that index in the path. I.e at position 3 (countAtIndex[2]) could have a value 1, meaning its taken 1 move to get there.
I'm converting my comment into an answer since this will be too long for a comment.
There are always two ways to solve a dynamic programming problem: top-down with memoization, or bottom-up by systematically filling an output array. My intuition says that the implementation of the bottom-up approach will be simpler. And my intent with this answer is to provide an example of that approach. I'll leave it as an exercise for the reader to write the formal algorithm, and then implement the algorithm.
So, as an example, let's say that the first 11 elements of the input array are:
index: 0 1 2 3 4 5 6 7 8 9 10 ...
spot: W B W B W W W B B W B ...
To solve the problem, we create an output array (aka the DP table), to hold the information we know about the problem. Initially all values in the output array are set to infinity, except for the first element which is set to 0. So the output array looks like this:
index: 0 1 2 3 4 5 6 7 8 9 10 ...
spot: W B W B W W W B B W B
output: 0 - x - x x x - - x -
where - is a black space (not allowed), and x is being used as the symbol for infinity (a spot that's either unreachable, or hasn't been reached yet).
Then we iterate from the beginning of the table, updating entries as we go.
From index 0, we can reach 2 and 5 with one move. We can't move to 3 because that spot is black. So the updated output array looks like this:
index: 0 1 2 3 4 5 6 7 8 9 10 ...
spot: W B W B W W W B B W B
output: 0 - 1 - x 1 x - - x -
Next, we skip index 1 because the spot is black. So we move on to index 2. From 2, we can reach 4,5, and 7. Index 4 hasn't been reached yet, but now can be reached in two moves. The jump from 2 to 5 would reach 5 in two moves. But 5 can already be reached in one move, so we won't change it (this is where the recurrence relation comes in). We can't move to 7 because it's black. So after processing index 2, the output array looks like this:
index: 0 1 2 3 4 5 6 7 8 9 10 ...
spot: W B W B W W W B B W B
output: 0 - 1 - 2 1 x - - x -
After skipping index 3 (black) and processing index 4 (can reach 6 and 9), we have:
index: 0 1 2 3 4 5 6 7 8 9 10 ...
spot: W B W B W W W B B W B
output: 0 - 1 - 2 1 3 - - 3 -
Processing index 5 won't change anything because 7,8,10 are all black. Index 6 doesn't change anything because 8 is black, 9 can already be reached in three moves, and we aren't showing index 11. Indexes 7 and 8 are skipped because they're black. And all jumps from 9 are into parts of the array that aren't shown.
So if the goal was to reach index 11, the number of moves would be 4, and the possible paths would be 2,4,6,11 or 2,4,9,11. Or if the array continued, we would simply keep iterating through the array, and then check the last five elements of the array to see which has the smallest number of moves.

Replace multiple pixels value in an image with a certain value Matlab

I have an image 640x480 img, and I want to replace pixels having values not in this list or array x=[1, 2, 3, 4, 5] with a certain value 10, so that any pixel in img which doesn't have the any of the values in x will be replaced with 10. I already know how to replace only one value using img(img~=1)=10 or multiple values using this img(img~=1 & img~=2 & img~=3 & img~=4 & img~=5)=10 but I when I tried this img(img~=x)=10 it gave an error saying Matrix dimensions must agree. So if anyone could please advise.
You can achieve this very easily with a combination of permute and bsxfun. We can create a 3D column vector that consists of the elements of [1,2,3,4,5], then use bsxfun with the not equals method (#ne) on your image (assuming grayscale) so that we thus create a 3D matrix of 5 slices. Each slice would tell you whether the locations in the image do not match an element in x. The first slice would give you the locations that don't match x = 1, the second slice would give you the locations that don't match x = 2, and so on.
Once you finish this, we can use an all call operating on the third dimension to consolidate the pixel locations that are not equal to all of 1, 2, 3, 4 or 5. The last step would be to take this logical map, which that tells you the locations that are none of 1, 2, 3, 4, or 5 and we'd set those locations to 10.
One thing we need to consider is that the image type and the vector x must be the same type. We can ensure this by casting the vector to be the same class as img.
As such, do something like this:
x = permute([1 2 3 4 5], [3 1 2]);
vals = bsxfun(#ne, img, cast(x, class(img)));
ind = all(vals, 3);
img(ind) = 10;
The advantage of the above method is that the list you want to use to check for the elements can be whatever you want. It prevents having messy logical indexing syntax, like img(img ~= 1 & img ~= 2 & ....). All you have to do is change the input list at the beginning line of the code, and bsxfun, permute and any should do the work for you.
Here's an example 5 x 5 image:
>> rng(123123);
>> img = randi(7, 5, 5)
img =
3 4 3 6 5
7 2 6 5 1
3 1 6 1 7
6 4 4 3 3
6 2 4 1 3
By using the code above, the output we get is:
img =
3 4 3 10 5
10 2 10 5 1
3 1 10 1 10
10 4 4 3 3
10 2 4 1 3
You can most certainly see that those elements that are neither 1, 2, 3, 4 or 5 get set to 10.
Aside
If you don't like the permute and bsxfun approach, one way would be to have a for loop and with an initially all true array, keep logical ANDing the final result with a logical map that consists of those locations which are not equal to each value in x. In the end, we will have a logical map where true are those locations that are neither equal to 1, 2, 3, 4 or 5.
Therefore, do something like this:
ind = true(size(img));
for idx = 1 : 5
ind = ind & img ~= idx;
end
img(ind) = 10;
If you do this instead, you'll see that we get the same answer.
Approach #1
You can use ismember,
which according to its official documentation for a case of ismember(A,B) would output a logical array of the same size as A and with 1's where
any element from B is present in A, 0's otherwise. Since, you are looking to detect "not in the list or array", you need to invert it afterwards, i.e. ~ismember().
In your case, you have img as A and x as B, so ~ismember(img,x) would give you those places where img~=any element in x
You can then map into img to set all those in it to 10 with this final solution -
img(~ismember(img,x)) = 10
Approach #2
Similar to rayryeng's solution, you can use bsxfun, but keep it in 2D which could be more efficient as it would also avoid permute. The implementation would look something like this -
img(reshape(all(bsxfun(#ne,img(:),x(:).'),2),size(img))) = 10

Convert non-zero image pixels to row-column coordinates and save the output to workspace

I'm having difficulties converting image pixels to coordinates and making them appear in my MATLAB workspace. For example, I have the image with pixel values as below (it's a binary image of size 4x4):
0 0 0 0
0 1 1 0
0 1 1 0
0 0 0 0
After getting the pixels, I want to read each value and if they're not equal to zero (which means 1), I want to read the coordinates of that value and save them in to my MATLAB workspace. For example, this is the idea that I thought of:
[x,y] = size(image)
for i=1:x
for j=1:y
if (image(i,j)~=0)
....
However, I am stuck. Can anyone give any suggestion on how to read the coordinates of the non-zero values and save them to my workspace?
Specifically, my expected result in the workspace:
2 2
2 3
3 2
3 3
Doing it with loops is probably not the most efficient way to do what you ask. Instead, use find. find determines the locations in a vector or matrix that are non-zero. In your case, all you have to do is:
[row,col] = find(image);
row and col would contain the row and column locations of the non-zero elements in your binary image. Therefore, with your example:
b = [0 0 0 0;
0 1 1 0;
0 1 1 0;
0 0 0 0];
We get:
>> disp([row, col]);
2 2
3 2
2 3
3 3
However, you'll see that the locations are not in the order you expect. This is because the locations are displayed in column-major order, meaning that the columns are traversed first. In your example, you are displaying them in row-major order. If you'd like to maintain this order, you would sort the results by the row coordinate:
>> sortrows([row, col])
ans =
2 2
2 3
3 2
3 3
However, if you really really really really... I mean really... want to use for loops, what you would do is keep two separate arrays that are initially empty, then loop through each pixel and determine whether it's non-zero. If it is, then you would add the x and y locations to these two separate arrays.
As such, you would do this:
row = []; col = [];
[x,y] = size(image);
for i=1:x
for j=1:y
if (image(i,j)~=0)
row = [row; i]; %// Concatenate row and column location if non-zero
col = [col; j];
end
end
end
This should give you the same results as find.
you can use meshgrid() to collect those coordinates. The function generates two outputs, first being x coordinates, second being y coordinates. you'd go like this:
[xcoord ycoord] = meshgrid( 1:x_size, 1:y_size);
zeros_coordsx = xcoord( image == 0);
zeros_coordsy = ycoord( image == 0);
this is way faster that nested looping and keeps you within matlab's natural vector operation space... these two outputs are in sync,meaning that
image( zeros_coordsy(1), zeros_coordsx(1))
is one of the zeros on the image

Parallel radix sort, how would this implementation actually work? Are there some heuristics?

I am working on an Udacity quiz for their parallel programming course. I am pretty stuck on how I should start on the assignment because I am not sure if I understand it correctly.
For the assignment (in code) we are given two arrays and array on values and an array of positions. We are supposed to sort the array of values with a parallelized radix sort, along with setting the positions correctly too.
I completely understand radix sort and how it works. What I don't understand is how they want us to implemented it. Here is the template given to start the assignment
//Udacity HW 4
//Radix Sorting
#include "reference_calc.cpp"
#include "utils.h"
/* Red Eye Removal
===============
For this assignment we are implementing red eye removal. This is
accomplished by first creating a score for every pixel that tells us how
likely it is to be a red eye pixel. We have already done this for you - you
are receiving the scores and need to sort them in ascending order so that we
know which pixels to alter to remove the red eye.
Note: ascending order == smallest to largest
Each score is associated with a position, when you sort the scores, you must
also move the positions accordingly.
Implementing Parallel Radix Sort with CUDA
==========================================
The basic idea is to construct a histogram on each pass of how many of each
"digit" there are. Then we scan this histogram so that we know where to put
the output of each digit. For example, the first 1 must come after all the
0s so we have to know how many 0s there are to be able to start moving 1s
into the correct position.
1) Histogram of the number of occurrences of each digit
2) Exclusive Prefix Sum of Histogram
3) Determine relative offset of each digit
For example [0 0 1 1 0 0 1]
-> [0 1 0 1 2 3 2]
4) Combine the results of steps 2 & 3 to determine the final
output location for each element and move it there
LSB Radix sort is an out-of-place sort and you will need to ping-pong values
between the input and output buffers we have provided. Make sure the final
sorted results end up in the output buffer! Hint: You may need to do a copy
at the end.
*/
void your_sort(unsigned int* const d_inputVals,
unsigned int* const d_inputPos,
unsigned int* const d_outputVals,
unsigned int* const d_outputPos,
const size_t numElems)
{
}
I specifically don't understand how those 4 steps end up sorting the array.
So for the first step, I am supposed to create a histogram of the "digits" (why is that in quotes..?). So given a input value n I need to make a count of the 0's and 1's into a histogram. So, should step 1 create an array of histograms, one for each input value?
And well, for the rest of the steps it breaks down pretty quickly. Could someone show me how these steps are supposed to implement a radix sort?
The basic idea behind a radix sort is that we will consider each element to be sorted digit by digit, from least significant to most significant. For each digit, we will move the elements so that those digits are in increasing order.
Let's take a really simple example. Let's sort four quantities, each of which have 4 binary digits. Let's choose 1, 4, 7, and 14. We'll mix them up and also visualize the binary representation:
Element # 1 2 3 4
Value: 7 14 4 1
Binary: 0111 1110 0100 0001
First we will consider bit 0:
Element # 1 2 3 4
Value: 7 14 4 1
Binary: 0111 1110 0100 0001
bit 0: 1 0 0 1
Now the radix sort algorithm says we must move the elements in such a way that (considering only bit 0) all the zeroes are on the left, and all the ones are on the right. Let's do this while preserving the order of the elements with a zero bit and preserving the order of the elements with a one bit. We could do that like this:
Element # 2 3 1 4
Value: 14 4 7 1
Binary: 1110 0100 0111 0001
bit 0: 0 0 1 1
The first step of our radix sort is complete. The next step is to consider the next (binary) digit:
Element # 3 2 1 4
Value: 4 14 7 1
Binary: 0100 1110 0111 0001
bit 1: 0 1 1 0
Once again, we must move elements so that the digit in question (bit 1) is arranged in ascending order:
Element # 3 4 2 1
Value: 4 1 14 7
Binary: 0100 0001 1110 0111
bit 1: 0 0 1 1
Now we must move to the next higher digit:
Element # 3 4 2 1
Value: 4 1 14 7
Binary: 0100 0001 1110 0111
bit 2: 1 0 1 1
And move them again:
Element # 4 3 2 1
Value: 1 4 14 7
Binary: 0001 0100 1110 0111
bit 2: 0 1 1 1
Now we move to the last (highest order) digit:
Element # 4 3 2 1
Value: 1 4 14 7
Binary: 0001 0100 1110 0111
bit 3: 0 0 1 0
And make our final move:
Element # 4 3 1 2
Value: 1 4 7 14
Binary: 0001 0100 0111 1110
bit 3: 0 0 0 1
And the values are now sorted. This hopefully seems clear, but in the description so far we've glossed over the details of things like "how do we know which elements to move?" and "how do we know where to put them?" So let's repeat our example, but we'll use the specific methods and sequence suggested in the prompt, in order to answer these questions. Starting over with bit 0:
Element # 1 2 3 4
Value: 7 14 4 1
Binary: 0111 1110 0100 0001
bit 0: 1 0 0 1
First let's build a histogram of the number of zero bits in bit 0 position, and the number of 1 bits in bit 0 position:
bit 0: 1 0 0 1
zero bits one bits
--------- --------
1)histogram: 2 2
Now let's do an exclusive prefix-sum on these histogram values:
zero bits one bits
--------- --------
1)histogram: 2 2
2)prefix sum: 0 2
An exclusive prefix-sum is just the sum of all preceding values. There are no preceding values in the first position, and in the second position the preceding value is 2 (the number of elements with a 0 bit in bit 0 position). Now, as an independent operation, let's determine the relative offset of each 0 bit amongst all the zero bits, and each one bit amongst all the one bits:
bit 0: 1 0 0 1
3)offset: 0 0 1 1
This can actually be done programmatically using exclusive prefix-sums again, considering the 0-group and 1-group separately, and treating each position as if it has a value of 1:
0 bit 0: 1 1
3)ex. psum: 0 1
1 bit 0: 1 1
3)ex. psum: 0 1
Now, step 4 of the given algorithm says:
4) Combine the results of steps 2 & 3 to determine the final output location for each element and move it there
What this means is, for each element, we will select the histogram-bin prefix sum value corresponding to its bit value (0 or 1) and add to that, the offset associated with its position, to determine the location to move that element to:
Element # 1 2 3 4
Value: 7 14 4 1
Binary: 0111 1110 0100 0001
bit 0: 1 0 0 1
hist psum: 2 0 0 2
offset: 0 0 1 1
new index: 2 0 1 3
Moving each element to its "new index" position, we have:
Element # 2 3 1 4
Value: 14 4 7 1
Binary: 0111 1110 0111 0001
Which is exactly the result we expect for the completion of our first digit-move, based on the previous walk-through. This has completed step 1, i.e. the first (least-significant) digit; we still have the remaining digits to process, creating a new histogram and new prefix sums at each step.
Notes:
Radix-sort, even in a computer, does not have to be done based strictly on binary digits. It's possible to construct a similar algorithm with digits of different sizes, perhaps consisting of 2,3, or 4 bits.
One of the optimizations we can perform on a radix sort is to only sort based on the number of digits that are actually meaningful. For example, if we are storing quantities in 32-bit values, but we know that the largest quantity present is 1023 (2^10-1), we need not sort on all 32 bits. We can stop, expecting a proper sort, after proceeding through the first 10 bits.
What does any of this have to do with GPUs? In so far as the above description goes, not much. The practical application is to consider using parallel algorithms for things like the histogram, the prefix-sums, and the data movement. This decomposition of radix-sort allows one to locate and use parallel algorithms already developed for these more basic operations, in order to construct a fast parallel sort.
What follows is a worked example. This may help with your understanding of radix sort. I don't think it will help with your assignment, because this example performs a 32-bit radix sort at the warp level, for a single warp, ie. for 32 quantities. But a possible advantage from an understanding point of view is that things like histogramming and prefix sums can be done at the warp level in just a few instructions, taking advantage of various CUDA intrinsics. For your assignment, you won't be able to use these techniques, and you will need to come up with full-featured parallel prefix sums, histograms, etc. that can operate on an arbitrary dataset size.
#include <stdio.h>
#include <stdlib.h>
#define WSIZE 32
#define LOOPS 100000
#define UPPER_BIT 31
#define LOWER_BIT 0
__device__ unsigned int ddata[WSIZE];
// naive warp-level bitwise radix sort
__global__ void mykernel(){
__shared__ volatile unsigned int sdata[WSIZE*2];
// load from global into shared variable
sdata[threadIdx.x] = ddata[threadIdx.x];
unsigned int bitmask = 1<<LOWER_BIT;
unsigned int offset = 0;
unsigned int thrmask = 0xFFFFFFFFU << threadIdx.x;
unsigned int mypos;
// for each LSB to MSB
for (int i = LOWER_BIT; i <= UPPER_BIT; i++){
unsigned int mydata = sdata[((WSIZE-1)-threadIdx.x)+offset];
unsigned int mybit = mydata&bitmask;
// get population of ones and zeroes (cc 2.0 ballot)
unsigned int ones = __ballot(mybit); // cc 2.0
unsigned int zeroes = ~ones;
offset ^= WSIZE; // switch ping-pong buffers
// do zeroes, then ones
if (!mybit) // threads with a zero bit
// get my position in ping-pong buffer
mypos = __popc(zeroes&thrmask);
else // threads with a one bit
// get my position in ping-pong buffer
mypos = __popc(zeroes)+__popc(ones&thrmask);
// move to buffer (or use shfl for cc 3.0)
sdata[mypos-1+offset] = mydata;
// repeat for next bit
bitmask <<= 1;
}
// save results to global
ddata[threadIdx.x] = sdata[threadIdx.x+offset];
}
int main(){
unsigned int hdata[WSIZE];
for (int lcount = 0; lcount < LOOPS; lcount++){
unsigned int range = 1U<<UPPER_BIT;
for (int i = 0; i < WSIZE; i++) hdata[i] = rand()%range;
cudaMemcpyToSymbol(ddata, hdata, WSIZE*sizeof(unsigned int));
mykernel<<<1, WSIZE>>>();
cudaMemcpyFromSymbol(hdata, ddata, WSIZE*sizeof(unsigned int));
for (int i = 0; i < WSIZE-1; i++) if (hdata[i] > hdata[i+1]) {printf("sort error at loop %d, hdata[%d] = %d, hdata[%d] = %d\n", lcount,i, hdata[i],i+1, hdata[i+1]); return 1;}
// printf("sorted data:\n");
//for (int i = 0; i < WSIZE; i++) printf("%u\n", hdata[i]);
}
printf("Success!\n");
return 0;
}
The methodology that #Robert Crovella gives is absolutely correct and very helpful. It is mildly different than the process that they explain in the Udacity videos. I'll record one iteration of their method, watchable here, in this answer, jumping off from Robert Crovella's example:
Element # 1 2 3 4
Value: 7 14 4 1
Binary: 0111 1110 0100 0001
LSB: 1 0 0 1
Predicate: 0 __1__ __1__ 0
Pred. Scan: 0 __0__ __1__ 2
Number of ones in predicate: 2
!Predicate:__1__ 0 0 __1__
!Pred. Scan: 0 1 1 1
Offset for !Pred. Scan = Number of ones in predicate = 2
!Pred. Scan + Offset:
__2__ 3 3 __3__
Final indexes to move values after 1 iteration (on LSB):
2 0 1 3
Values after 1 iteration (on LSB):
14 4 7 1
I placed emphasis (__ __) on the values that indicate or contain the index to move the value to.
Terms (from Udacity video):
LSB = least significant bit
Predicate (for LSB): (x & 1) == 0
for the next significant bit: (x & 2) == 0
for the one after that: (x & 4) == 0
and so on, with more left shifting (<<)
Pred. Scan = Predicate Scan = Predicate exclusive prefix sum
!Pred. = bits of predicate flipped (0->1 and 1->0)
Number of ones in predicate
note that this is not necessarily the last entry in the scan, you can instead get this value (sum/reduction of the predicate) as an intermediate of the Blelloch scan
A summary of the above is:
Get the predicate of your list (bit in common, starting from the LSB)
Scan the predicate, and record the sum of the predicate in the process
Blelloch Scan on the GPU
note that your predicate will be of arbitrary size, so read the section on Blelloch Scan for arrays of arbitrary instead of 2^n size
Flip bits of the predicate, and scan that
Move the values in your array with the following rule:
For the ith element in the array:
if the ith predicate is TRUE, move the ith value to the index in the ith element of the predicate scan
else, move the ith value to the index in the ith element of the !Predicate scan plus the sum of the Predicate
Move to the next significant bit (NSB)
For reference, you can consult my solution for this HW assignment in CUDA.

Resources