Control Divergence with simple matrix multiplication kernel - parallel-processing

Given the following simple matrix multiplication kernel
`__global__ void MatrixMulKernel(float* M, float* N, float* P, int
Width)
{
int Row = blockIdx.y*blockDim.y+threadIdx.y;
int Col = blockIdx.x*blockDim.x+threadIdx.x;
if ((Row < Width) && (Col < Width)) {
float Pvalue = 0;
for (int k = 0; k < Width; ++k)
{
Pvalue += M[Row*Width+k]*N[k*Width+Col];
}
P[Row*Width+Col] = Pvalue;
}
}`
If we launch the kernel with a block size of 16X16 on a 1000X1000 matrix, how many warps will have control divergence?
Answer: 500
Explanation: There will be 63 blocks in the horizontal direction. 8 threads in the x dimension in each row will be in the invalid range. Every two rows form a warp. Therefore, there are 1000/2=500 warps that will straddle the valid and invalid ranges in the horizontal direction. As for the warps in the bottom blocks, there are 8 warps in the valid range and 8 warps in the invalid range. Threads in these warps are either totally in the valid range or invalid range.
Question: I am trying to understand why in this scenario 8 threads in the x dimension will be in the invalid range?

Each block covers a 16x16 array of elements. To cover a matrix of 1000x1000 elements, I need a square threadblock array that has dimensions of 1000/16 = 62.5 blocks in the horizontal direction and 62.5 blocks in the vertical direction.
But I can't launch 62.5x62.5 blocks, so in order to have full coverage I must launch 63x63 blocks, acknowledging that this will create extra threads in the "invalid range" (i.e. that would map to an element location outside the 1000x1000 matrix).
When I launch 63 blocks in the horizontal direction, I get 63x16 = 1008 threads in the horizontal direction. But I only need 1000 so 8 threads (in each row) are in the "invalid range".

Related

Generate random position with uniform distribution inside rounded rect

Can't invent something acceptable.
My first (and sole) approach is pretty awkward:
Calculate area = non_rounded_area + area_of_rounded corner * 4. Let's consider this area as pixel count in the rect.
Get random number from range [0..area), so to say a pixel index.
Somehow get x and y coordinates from that index.
The main embarrassment is how to perform step 3?
I reckon it's even enough to consider 1/4 part of rect (and one corner) and just rotate result for other quarters.
Ok, suppose I know what number of pixels belongs' to the given corner.
And it's easy to get x and y coordinates from index that belongs to non-rounded area.
But how to do this for pixels that belongs to corners?
My thoughts are flying about "determine whether pixel belongs to circle" but can't formulate them plainly.
Here's a way to do it for one quadrant that you can generalize to a full rectangle:
First compute the total number of pixels in the quadrant (red + orange + green):
int totalPixels = w * h;
Then compute the red area (the pixels in the corner that are outside the rounded rect):
int invalidCornerPixels = (int)((float)(r * r) * ((4.0f - PI) / 4.0f));
The orange area is equal to the red area. You can sample pixels in the red + green area, and if they are in the red area, sample a random pixel in the orange area instead.
int redGreenArea = totalPixels - invalidCornerPixels;
Assume randomValue(n) returns a random int from 0 to n - 1:
int pixelIndex = randomValue(redGreenArea);
int pixelX = pixelIndex % w;
int pixelY = pixelIndex / w;
Test if the sampled pixel is in the red area and resample if necessary:
if((pixelX < r) && (pixelY < r))
{
int circleX = r - pixelX;
int circleY = r - pixelY;
if(((circleX * circleX) + (circleY * circleY)) > (r * r))
{
pixelIndex = randomValue(invalidCornerPixels) + redGreenArea;
pixelX = pixelIndex % w;
pixelY = pixelIndex / w;
}
}
This requires a maximum of 2 random number generations (usually only 1), and isn't any more complicated than rejection sampling, because you have to implement the same test for that too. The calculation of totalPixels, invalidCornerPixels and redGreenArea can be done once and stored for a given rectangle.
One weakness is that due to rounding errors the number of pixels that will fail the test in practice may not be exactly equal to invalidCornerPixels, which will give a very slightly non-uniform distribution. You could address this by calculating invalidCornerPixels by brute force offline (counting the pixels that fail the test in an r x r square) and creating a lookup table for each value of r. I doubt it will be noticeable if used for a particle generator however. Another weakness is that it will fail if the red area overlaps the orange area.

How to construct a loop with reducing iterations

In MATLAB, I have a 256x256 RGB image and a 3x3 kernel that passes over it. The 3x3 kernel computes the colour-euclidean distance between every pair combination of the 9 pixels in the kernel, and stores the maximum value in an array. It then moves by 1 pixel and performs the same computation, and so on.
I can easily code the movement of the kernel over the image, as well as the extraction of the RGB values from the pixels in the kernel.
HOWEVER, I do have trouble efficiently computing the colour-euclidean distance operation for every pair combination of pixels.
For example if I had a 3x3 matrix with the following values:
[55 12 5; 77 15 99; 124 87 2]
I need to code a loop such that the 1st element performs an operation with the 2nd,3rd...9th element. Then the 2nd element performs the operation with the 3rd,4th...9th element and so on until finally the 8th element performs the operation with the 9th element. Preferrably, the same pixel combination shouldn't compute again (like if you computed 2nd with 7th, don't compute 7th with 2nd).
Thank you in advance.
EDIT: My code so far
K=3;
s=1; %If S=0, don't reject, If S=1 Reject first max distance pixel pair
OI=imread('onion.png');
Rch = im2col(OI(:,:,1),[K,K],'sliding')
Gch = im2col(OI(:,:,2),[K,K],'sliding')
Bch = im2col(OI(:,:,3),[K,K],'sliding')
indexes = bsxfun(#gt,(1:K^2)',1:K^2)
a=find(indexes);
[idx1,idx2] = find(indexes);
Rsqdiff = (Rch(idx2,:) - Rch(idx1,:)).^2
Gsqdiff = (Gch(idx2,:) - Gch(idx1,:)).^2
Bsqdiff = (Bch(idx2,:) - Bch(idx1,:)).^2
dists = sqrt(double(Rsqdiff + Gsqdiff + Bsqdiff)) %Distance values for all 36 combinations in 1 column
[maxdist,idx3] = max(dists,[],1) %idx3 is each column's index of max value
if s==0
y = reshape(maxdist,size(OI,1)-K+1,[]) %max value of each column (each column has 36 values)
elseif s==1
[~,I]=max(maxdist);
idx3=idx3(I);
n=size(idx3,2);
for i=1:1:n
idx3(i)=a(idx3(i));
end
[I,J]=ind2sub([K*K K*K],idx3);
for j=1:1:a
[M,N]=ind2sub([K*K K*K],dists(j,:));
M(I,:)=0;
N(:,J)=0;
dists(j,:)=sub2ind; %Incomplete line, don't know what to do here
end
[maxdist,idx3] = max(dists,[],1);
y = reshape(maxdist,size(OI,1)-K+1,[]);
end
If I understood the question correctly, you are looking to form unique pairwise combinations within a sliding 3x3 window, perform euclidean distance calculations consider all three channels, which we are calling as colour-euclidean distances and finally picking out the largest of all distances for each sliding window. So, for a 3x3 window that has 9 elements, you would have 36 unique pairs. If the image size is MxN, because of the sliding nature, you would have (M-3+1)*(N-3+1) = 64516 (for 256x256 case) such sliding windows with 36 pairs each, and therefore the distances array would be 36x64516 sized and the output array of maximum distances would be of size 254x254. The implementation suggested here involves im2col to extract sliding windowed elements as columns, nchoosek to form the pairs and finally performing the square-root of squared differences between three channels of such pairs and would look something like this -
K = 3; %// Kernel size
Rch = im2col(img(:,:,1),[K,K],'sliding')
Gch = im2col(img(:,:,2),[K,K],'sliding')
Bch = im2col(img(:,:,3),[K,K],'sliding')
[idx1,idx2] = find(bsxfun(#gt,(1:K^2)',1:K^2)); %//'
Rsqdiff = (Rch(idx2,:) - Rch(idx1,:)).^2
Gsqdiff = (Gch(idx2,:) - Gch(idx1,:)).^2
Bsqdiff = (Bch(idx2,:) - Bch(idx1,:)).^2
dists = sqrt(Rsqdiff + Gsqdiff + Bsqdiff)
out = reshape(max(dists,[],1),size(img,1)-K+1,[])
Your question is interesting and caught my attention. As far as I understood, you need to calculate euclidean distance between RGB color values of all cells inside 3x3 kernel and to find the largest one. I suggest a possible way to do this by using circshift function and 4D array operations:
Firstly, we pad the input array and create 8 shifted versions of it for each direction:
DIM = 256;
A = zeros(DIM,DIM,3,9);
A(:,:,:,1) = round(255*rand(DIM,DIM,3));%// random 256x256 array (suppose it is your image)
A = padarray(A,[1,1]);%// add zeros on each side of image
%// compute shifted versions of the input array
%// and write them as 4th dimension starting from shifted up clockwise:
A(:,:,:,2) = circshift(A(:,:,:,1),[-1, 0]);
A(:,:,:,3) = circshift(A(:,:,:,1),[-1, 1]);
A(:,:,:,4) = circshift(A(:,:,:,1),[ 0, 1]);
A(:,:,:,5) = circshift(A(:,:,:,1),[ 1, 1]);
A(:,:,:,6) = circshift(A(:,:,:,1),[ 1, 0]);
A(:,:,:,7) = circshift(A(:,:,:,1),[ 1,-1]);
A(:,:,:,8) = circshift(A(:,:,:,1),[ 0,-1]);
A(:,:,:,9) = circshift(A(:,:,:,1),[-1,-1]);
Next, we create an array that calculates the difference for all the possible combinations between all the above arrays:
q = nchoosek(1:9,2);
B = zeros(DIM+2,DIM+2,3,size(q,1));
for i = 1:size(q,1)
B(:,:,:,i) = (A(:,:,:,q(i,1)) - A(:,:,:,q(i,2))).^2;
end
C = sqrt(sum(B,3));
Finally, what we have is all the euclidean distances between all possible pairs within a 3x3 kernel. All we have to do is to extract the maximum values. As far as I understood, you do not consider image edges, so:
C = sqrt(sum(B,3));
D = zeros(DIM-2);
for i = 3:DIM
for j = 3:DIM
temp = C(i-1:i+1,j-1:j+1);
D(i-2,j-2) = max(temp(:));
end
end
D is the 254x254 array with maximum Euclidean distances for A(2:255,2:255), i.e. we exclude image edges.
Hope that helps.
P.S. I am amazed by the shortness of the code provided by #Divakar.

Need an algorithm to calculate the size of a rectangle

I get a logical riddle and I need an efficient algorithm to solve it.
I have large rectangle (box) with size w*h (width*height).
I have also x other rectangles with not size but with fixed proportions.
What is the fastest way to get the x that will let each of the X rectangle the maximum size to be inside the box(large rectangle)?
Example:
The box rectangle size is 150* 50 (width * height) and i have 25 small rectangles.
The fixed proportion of the small rectangle is 3 (if height =5 then width =5*3=15).
Lets call the height of the rectangle x.
I want to find that largest X that will let me to insert all the rectangle into the big rectangle (into the box).
(The small rectangles will be placed in rows and columns, for example 5 columns and 5 rows by the proportion and maximum height)
Does anyone know an efficient algorithm to solve this?
Um what?
Isn't it just (w*h)/75?
Yeah, brackets aren't needed... but isn't that what you want? Or am i totes missing something here?
Where w and h are the dimensions of the big or parent rectangle.
And 75 is 3*25.
I would attempt to solve this problem empirically (solve using backtracking) instead of analytically, i.e. find all possibilities* (I'll explain the *). Essentially we want to place every rectangle starting with as small as that rect can be to its maximum size (max size can be defined by largest the rectangle can be before bumping into the start point of its neighbors or growing to the container master rect). What this means is if we attempt to place every rect in its every possible size, one of those solutions will be the best solution. Also note that this really a one dimentional problem since the rects height and width is bound by a ratio; setting one implicitly sets the other.
* - When I said all possibilities, I really meant most reasonable possibilities. Since we are in floating point space we cannot test ALL possibilities. We can test for finer and finer precision, but will be unable to test all sizes. Due to this we define a step size to iterate through the size of the rects we will try.
const float STEP_SIZE = 0.0001;
float fLastTotalSize = 0;
int main()
{
PlaceRect(myRects.begin(), myRects.end());
}
void PlaceRect(Iterator currentRect, Iterator end)
{
if (currentRect == end)
{
return;
}
float fRectMaxSize = CalculateMaxPossibleRectSize(*currentRect);
// find the number of steps it will take to iterate from the smallest
// rect size to the largest
int nSteps = fRectMaxSize / STEP_SIZE;
for(int i = 0; i < nSteps; ++i)
{
// based on the step index scale the rect size
float fCurrentRectTestSize = i*STEP_SIZE;
currentRect->SetSize(fCurrentRectTestSize);
float fTotalSize = CalculateTotalSizesOfAllRects();
if (fTotalSize > fLastTotalSize)
{
fLastTotalSize = fTotalSize;
SaveRectConfiguration();
}
// Continue placing the rest of the rects assuming the size
// we just set for the current rect
PlaceRect(currentRect + 1, end);
// Once we return we can now reset the current rect size to
// something else and continue testing possibilities
}
}
Based on the step size and the number of rectangles this may run for a very long time, but will find you the empirical solution.

Rotating a two dimensional array by 90 degrees

I am studying this piece of code on rotating an NxN matrix; I have traced the program countless times, and I sort of understand how the actual rotation happens. It basically rotates the corners first and the elements after the corners in a clockwise direction. I just do not understand a couple of lines, and the code is still not "driven home" in my brain, so to speak. Please help. I am rotating it 90 degrees, given a 4x4 matrix as my tracing example.
[1][2][3][4]
[5][6][7][8]
[9][0][1][2]
[3][4][5][6]
becomes
[3][9][5][1]
[4][0][6][2]
[5][1][7][3]
[6][2][8][4]
public static void rotate(int[][] matrix, int n){
for(int layer=0; layer < n/2; ++layer) {
int first=layer; //It moves from the outside in.
int last=n-1-layer; //<--This I do not understand
for(int i=first; i<last;++i){
int offset=i-first; //<--A bit confusing for me
//save the top left of the matrix
int top = matrix[first][i];
//shift left to top;
matrix[first][i]=matrix[last-offset][first];
/*I understand that it needs
last-offset so that it will go up the column in the matrix,
and first signifies it's in the first column*/
//shift bottom to left
matrix[last-offset][first]=matrix[last][last-offset];
/*I understand that it needs
last-offset so that the number decreases and it may go up the column (first
last-offset) and left (latter). */
//shift right to bottom
matrix[last][last-offset]=matrix[i][last];
/*I understand that it i so that in the next iteration, it moves down
the column*/
//rightmost top corner
matrix[i][last]=top;
}
}
}
It's easier to understand an algorithm like this if you draw a diagram, so I made a quick pic in Paint to demonstrate for a 5x5 matrix :D
The outer for(int layer=0; layer < n/2; ++layer) loop iterates over the layers from outside to inside. The outer layer (layer 0) is depicted by coloured elements. Each layer is effectively a square of elements requiring rotation. For n = 5, layer will take on values from 0 to 1 as there are 2 layers since we can ignore the centre element/layer which is unaffected by rotation. first and last refer to the first and last rows/columns of elements for a layer; e.g. layer 0 has elements from Row/Column first = 0 to last = 4 and layer 1 from Row/Column 1 to 3.
Then for each layer/square, the inner for(int i=first; i<last;++i) loop rotates it by rotating 4 elements in each iteration. Offset represents how far along the sides of the square we are. For our 5x5 below, we first rotate the red elements (offset = 0), then yellow (offset = 1), then green and blue. Arrows 1-5 demonstrate the 4-element rotation for the red elements, and 6+ for the rest which are performed in the same fashion. Note how the 4-element rotation is essentially a 5-assignment circular swap with the first assignment temporarily putting aside an element. The //save the top left of the matrix comment for this assignment is misleading since matrix[first][i] isn't necessarily the top left of the matrix or even the layer for that matter. Also, note that the row/column indexes of elements being rotated are sometimes proportional to offset and sometimes proportional to its inverse, last - offset.
We move along the sides of the outer layer (delineated by first=0 and last=4) in this manner, then move onto the inner layer (first = 1 and last = 3) and do the same thing there. Eventually, we hit the centre and we're done.
This trigger a WTF. The easiest way to rotate a matrix in place is by
first transposing the matrix (swap M[i,j] with M[j,i])
then swapping M[i,j] with M[i, nColumns - j]
When matrices are column-major, the second operation is swapping columns, and hence has good data locality properties. If the matrix is row major, then first permute rows, and then transpose.
Here is a recursive way of solving this:
// rotating a 2 D array (mXn) by 90 degrees
public void rotateArray(int[][] inputArray) {
System.out.println("Input Array: ");
print2D(inputArray);
rotateArray(inputArray, 0, 0, inputArray.length - 1,
inputArray[0].length - 1);
System.out.println("\n\nOutput Array: ");
print2D(inputArray);
}
public void rotateArray(int[][] inputArray, int currentRow,
int currentColumn, int lastRow, int lastColumn) {
// condition to come out of recursion.
// if all rows are covered or all columns are covered (all layers
// covered)
if (currentRow >= lastRow || currentColumn >= lastColumn)
return;
// rotating the corner elements first
int top = inputArray[currentRow][currentColumn];
inputArray[currentRow][currentColumn] = inputArray[lastRow][currentColumn];
inputArray[lastRow][currentColumn] = inputArray[lastRow][lastColumn];
inputArray[lastRow][lastColumn] = inputArray[currentRow][lastColumn];
inputArray[currentRow][lastColumn] = top;
// clockwise rotation of remaining elements in the current layer
for (int i = currentColumn + 1; i < lastColumn; i++) {
int temp = inputArray[currentRow][i];
inputArray[currentRow][i] = inputArray[lastRow - i][currentColumn];
inputArray[lastRow - i][currentColumn] = inputArray[lastRow][lastColumn
- i];
inputArray[lastRow][lastColumn - i] = inputArray[currentRow + i][lastColumn];
inputArray[currentRow + i][lastColumn] = temp;
}
// call recursion on remaining layers
rotateArray(inputArray, ++currentRow, ++currentColumn, --lastRow,
--lastColumn);
}

Implementing a Hilbert map of the Internet

In the XKCD comic 195 a design for a map of the Internet address space is suggested using a Hilbert curve so that items from a similar IP adresses will be clustered together.
Given an IP address, how would I calculate its 2D coordinates (in the range zero to one) on such a map?
This is pretty easy, since the Hilbert curve is a fractal, that is, it is recursive. It works by bisecting each square horizontally and vertically, dividing it into four pieces. So you take two bits of the IP address at a time, starting from the left, and use those to determine the quadrant, then continue, using the next two bits, with that quadrant instead of the whole square, and so on until you have exhausted all the bits in the address.
The basic shape of the curve in each square is horseshoe-like:
0 3
1 2
where the numbers correspond to the top two bits and therefore determine the traversal order. In the xkcd map, this square is the traversal order at the highest level. Possibly rotated and/or reflected, this shape is present at each 2x2 square.
Determination of how the "horseshoe" is oriented in each of the subsquares is determined by one rule: the 0 corner of the 0 square is in the corner of the larger square. Thus, the subsquare corresponding to 0 above must be traversed in the order
0 1
3 2
and, looking at the whole previous square and showing four bits, we get the following shape for the next division of the square:
00 01 32 33
03 02 31 30
10 13 20 23
11 12 21 22
This is how the square always gets divided at the next level. Now, to continue, just focus on the latter two bits, orient this more detailed shape according to how the horseshoe shape of those bits is oriented, and continue with a similar division.
To determine the actual coordinates, each two bits determine one bit of binary precision in the real number coordinates. So, on the first level, the first bit after the binary point (assuming coordinates in the [0,1] range) in the x coordinate is 0 if the first two bits of the address have the value 0 or 1, and 1 otherwise. Similarly, the first bit in the y coordinate is 0 if the first two bits have the value 1 or 2. To determine whether to add a 0 or 1 bit to the coordinates, you need to check the orientation of the horseshoe at that level.
EDIT: I started working out the algorithm and it turns out that it's not that hard after all, so here's some pseudo-C. It's pseudo because I use a b suffix for binary constants and treat integers as arrays of bits, but changing it to proper C shouldn't be too hard.
In the code, pos is a 3-bit integer for the orientation. The first two bits are the x and y coordinates of 0 in the square and the third bit indicates whether 1 has the same x coordinate as 0. The initial value of pos is 011b, meaning that the coordinates of 0 are (0, 1) and 1 has the same x coordinate as 0. ad is the address, treated as an n-element array of 2-bit integers, and starting from the most significant bits.
double x = 0.0, y = 0.0;
double xinc, yinc;
pos = 011b;
for (int i = 0; i < n; i++) {
switch (ad[i]) {
case 0: xinc = pos[0]; yinc = pos[1]; pos[2] = ~pos[2]; break;
case 1: xinc = pos[0] ^ ~pos[2]; yinc = pos[1] ^ pos[2]; break;
case 2: xinc = ~pos[0]; yinc = ~pos[1]; break;
case 3: xinc = pos[0] ^ pos[2]; yinc = pos[1] ^ ~pos[2];
pos = ~pos; break;
}
x += xinc / (1 << (i+1)); y += yinc / (1 << (i+1));
}
I tested it with a couple of 8-bit prefixes and it placed them correctly according to the xkcd map, so I'm somewhat confident the code is correct.
Essentially you would decompose the number, using pairs of bits, MSB to LSB. The pair of bits tells you if the location is in the Upper Left (0) Lower Left (1) Lower Right (2) or Upper Right (3) quadrant, at a scale that gets finer as you shift through the number.
Additionally, you need to track an "orientation". This is the winding that is used at the scale you are at; the initial winding is as above (UL, LL, LR, UR), and depending on which quadrant you end up in, the winding at the next scale down is (rotated -90, 0, 0, +90) from your current winding.
So you could accumulate offsets :
suppose I start at 0,0, and the first pair gives me a 2, I shift offsets to 0.5, 0.5. The winding in the lower right is the same as my initial one. The next pair reduces the scale, so my adjustments are going to be 0.25 in length.
This pair is a 3, so I translate only my x coordinate and I am at .75, .5. The winding is now rotated over and my next scale down will be (LR, LL, UL, UR). The scale is now .125, and so on and so on until I run out of bits in my address.
I expect that based on the wikipedia code for a Hilbert curve you could keep track of your current position (as an (x, y) coordinate) and return that position after n cells had been visited. Then the position scaled onto [0..1] would depend on how high and wide the Hilbert curve was going to be at completion.
from turtle import left, right, forward
size = 10
def hilbert(level, angle):
if level:
right(angle)
hilbert(level - 1, -angle)
forward(size)
left(angle)
hilbert(level - 1, angle)
forward(size)
hilbert(level - 1, angle)
left(angle)
forward(size)
hilbert(level - 1, -angle)
right(angle)
Admittedly, this would be a brute force solution rather than a closed form solution.

Resources