CUDA kernel: performance drops by 10x when increased loop count by 10% - performance

I have a simple CUDA kernel to test loop unrolling, then discovered another thing: when the loop count is 10, kernel takes 34 milliseconds to perform, when the loop count is 90, it takes 59 milliseconds, but when the loop count is 100, the time it takes is 423 milliseconds!
Launch configuration is the same, only loop count changed.
So, my question is, what could be the reason for this performance drop?
Here is the code, input is an array of 128x1024x1024 elements, and I'm using PyCUDA:
__global__ void copy(float *input, float *output) {
int tidx = blockIdx.y * blockDim.x + threadIdx.x;
int stride = 1024 * 1024;
for (int i = 0; i < 128; i++) {
int idx = i * stride + tidx;
float x = input[idx];
float y = 0;
for (int j = 0; j < 100; j += 10) {
x = x + sqrt(float(j));
y = sqrt(abs(x)) + sin(x) + cos(x);
x = x + sqrt(float(j+1));
y = sqrt(abs(x)) + sin(x) + cos(x);
x = x + sqrt(float(j+2));
y = sqrt(abs(x)) + sin(x) + cos(x);
x = x + sqrt(float(j+3));
y = sqrt(abs(x)) + sin(x) + cos(x);
x = x + sqrt(float(j+4));
y = sqrt(abs(x)) + sin(x) + cos(x);
x = x + sqrt(float(j+5));
y = sqrt(abs(x)) + sin(x) + cos(x);
x = x + sqrt(float(j+6));
y = sqrt(abs(x)) + sin(x) + cos(x);
x = x + sqrt(float(j+7));
y = sqrt(abs(x)) + sin(x) + cos(x);
x = x + sqrt(float(j+8));
y = sqrt(abs(x)) + sin(x) + cos(x);
x = x + sqrt(float(j+9));
y = sqrt(abs(x)) + sin(x) + cos(x);
}
output[idx] = y;
}
}
The loop count I mentioned is this line:
for (int j = 0; j < 100; j += 10)
And sample outputs here:
10 loops
griddimx: 1 griddimy: 1024 griddimz: 1
blockdimx: 1024 blockdimy: 1 blockdimz: 1
nthreads: 1048576 blocks: 1024
prefetch.py:82: UserWarning: The CUDA compiler succeeded, but said the following:
ptxas info : 0 bytes gmem, 24 bytes cmem[3]
ptxas info : Compiling entry function 'copy' for 'sm_61'
ptxas info : Function properties for copy
32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 21 registers, 336 bytes cmem[0], 52 bytes cmem[2]
computation takes 34.24 miliseconds
90 loops
griddimx: 1 griddimy: 1024 griddimz: 1
blockdimx: 1024 blockdimy: 1 blockdimz: 1
nthreads: 1048576 blocks: 1024
prefetch.py:82: UserWarning: The CUDA compiler succeeded, but said the following:
ptxas info : 0 bytes gmem, 24 bytes cmem[3]
ptxas info : Compiling entry function 'copy' for 'sm_61'
ptxas info : Function properties for copy
32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 21 registers, 336 bytes cmem[0], 52 bytes cmem[2]
computation takes 59.33 miliseconds
100 loops
griddimx: 1 griddimy: 1024 griddimz: 1
blockdimx: 1024 blockdimy: 1 blockdimz: 1
nthreads: 1048576 blocks: 1024
prefetch.py:82: UserWarning: The CUDA compiler succeeded, but said the following:
ptxas info : 0 bytes gmem, 24 bytes cmem[3]
ptxas info : Compiling entry function 'copy' for 'sm_61'
ptxas info : Function properties for copy
32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 22 registers, 336 bytes cmem[0], 52 bytes cmem[2]
computation takes 422.96 miliseconds

The problem seems to come from loop unrolling.
Indeed, the 10-loops case can be trivially unrolled by NVCC since the loop is actually always executed once (thus the for line can be removed with j set to 0).
The 90-loops case is unrolled by NVCC (there are only 9 actual iterations). The resulting code is thus much bigger but still fast since no branches are performed (GPUs hate branches). However, the 100-loops case is not unrolled by NVCC (you hit a threshold of the compiler optimizer). The resulting code is small, but it leads to more branches being executed at runtime: branching is performed for each executed loop iteration (a total of 10).
You can see the assembly code difference here.
You can force unrolling using the directive #pragma unroll. However, keep in mind that increasing the size of a code can reduce its performance.
PS: the slightly higher number of register used in the last version may decrease performance, but simulations show that it should be OK in this case.

Related

Julia: why doesn't shared memory multi-threading give me a speedup?

I want to use shared memory multi-threading in Julia. As done by the Threads.#threads macro, I can use ccall(:jl_threading_run ...) to do this. And whilst my code now runs in parallel, I don't get the speedup I expected.
The following code is intended as a minimal example of the approach I'm taking and the performance problem I'm having: [EDIT: See later for even more minimal example]
nthreads = Threads.nthreads()
test_size = 1000000
println("STARTED with ", nthreads, " thread(s) and test size of ", test_size, ".")
# Something to be processed:
objects = rand(test_size)
# Somewhere for our results
results = zeros(nthreads)
counts = zeros(nthreads)
# A function to do some work.
function worker_fn()
work_idx = 1
my_result = results[Threads.threadid()]
while work_idx > 0
my_result += objects[work_idx]
work_idx += nthreads
if work_idx > test_size
break
end
counts[Threads.threadid()] += 1
end
end
# Call our worker function using jl_threading_run
#time ccall(:jl_threading_run, Ref{Cvoid}, (Any,), worker_fn)
# Verify that we made as many calls as we think we did.
println("\nCOUNTS:")
println("\tPer thread:\t", counts)
println("\tSum:\t\t", sum(counts))
On an i7-7700, a typical single threaded result is:
STARTED with 1 thread(s) and test size of 1000000.
0.134606 seconds (5.00 M allocations: 76.563 MiB, 1.79% gc time)
COUNTS:
Per thread: [999999.0]
Sum: 999999.0
And with 4 threads:
STARTED with 4 thread(s) and test size of 1000000.
0.140378 seconds (1.81 M allocations: 25.661 MiB)
COUNTS:
Per thread: [249999.0, 249999.0, 249999.0, 249999.0]
Sum: 999996.0
Multi-threading slows things down! Why?
EDIT: A better minimal example can be created #threads macro itself.
a = zeros(Threads.nthreads())
b = rand(test_size)
calls = zeros(Threads.nthreads())
#time Threads.#threads for i = 1 : test_size
a[Threads.threadid()] += b[i]
calls[Threads.threadid()] += 1
end
I falsely assumed that the #threads macro's inclusion in Julia would mean that there was a benefit to be had.
The problem you have is most probably false sharing.
You can solve it by separating the areas you write to far enough like this (here is a "quick and dirty" implementation to show the essence of the change):
julia> function f(spacing)
test_size = 1000000
a = zeros(Threads.nthreads()*spacing)
b = rand(test_size)
calls = zeros(Threads.nthreads()*spacing)
Threads.#threads for i = 1 : test_size
#inbounds begin
a[Threads.threadid()*spacing] += b[i]
calls[Threads.threadid()*spacing] += 1
end
end
a, calls
end
f (generic function with 1 method)
julia> #btime f(1);
41.525 ms (35 allocations: 7.63 MiB)
julia> #btime f(8);
2.189 ms (35 allocations: 7.63 MiB)
or doing per-thread accumulation on a local variable like this (this is a preferred approach as it should be uniformly faster):
function getrange(n)
tid = Threads.threadid()
nt = Threads.nthreads()
d , r = divrem(n, nt)
from = (tid - 1) * d + min(r, tid - 1) + 1
to = from + d - 1 + (tid ≤ r ? 1 : 0)
from:to
end
function f()
test_size = 10^8
a = zeros(Threads.nthreads())
b = rand(test_size)
calls = zeros(Threads.nthreads())
Threads.#threads for k = 1 : Threads.nthreads()
local_a = 0.0
local_c = 0.0
for i in getrange(test_size)
for j in 1:10
local_a += b[i]
local_c += 1
end
end
a[Threads.threadid()] = local_a
calls[Threads.threadid()] = local_c
end
a, calls
end
Also note that you are probably using 4 treads on a machine with 2 physical cores (and only 4 virtual cores) so the gains from threading will not be linear.

Exponent calculation speed

I am currently testing Julia (I've worked with Matlab)
In matlab the calculation speed of N^3 is slower than NxNxN. This doesn't happen with N^2 and NxN. They use a different algorithm to calculate higher-order exponents because they prefer accuracy rather than speed.
I think Julia do the same thing.
I wanted to ask if there is a way to force Julia to calculate the exponent of N using multiplication instead of the default algorithm, at least for cube exponents.
Some time ago a I did a few test on matlab of this. I made a translation of that code to julia.
Links to code:
http://pastebin.com/bbeukhTc
(I cant upload all the links here :( )
Results of the scripts on Matlab 2014:
Exponente1
Elapsed time is 68.293793 seconds. (17.7x times of the smallest)
Exponente2
Elapsed time is 24.236218 seconds. (6.3x times of the smallests)
Exponente3
Elapsed time is 3.853348 seconds.
Results of the scripts on Julia 0.46:
Exponente1
18.423204 seconds (8.22 k allocations: 372.563 KB) (51.6x times of the smallest)
Exponente2
13.746904 seconds (9.02 k allocations: 407.332 KB) (38.5 times of the smallest)
Exponente3
0.356875 seconds (10.01 k allocations: 450.441 KB)
In my tests julia is faster than Matlab, but i am using a relative old version. I cant test other versions.
Checking Julia's source code:
julia/base/math.jl:
^(x::Float64, y::Integer) =
box(Float64, powi_llvm(unbox(Float64,x), unbox(Int32,Int32(y))))
^(x::Float32, y::Integer) =
box(Float32, powi_llvm(unbox(Float32,x), unbox(Int32,Int32(y))))
julia/base/fastmath.jl:
pow_fast{T<:FloatTypes}(x::T, y::Integer) = pow_fast(x, Int32(y))
pow_fast{T<:FloatTypes}(x::T, y::Int32) =
box(T, Base.powi_llvm(unbox(T,x), unbox(Int32,y)))
We can see that Julia uses powi_llvm
Checking llvm's source code:
define double #powi(double %F, i32 %power) {
; CHECK: powi:
; CHECK: bl __powidf2
%result = call double #llvm.powi.f64(double %F, i32 %power)
ret double %result
}
Now, the __powidf2 is the interesting function here:
COMPILER_RT_ABI double
__powidf2(double a, si_int b)
{
const int recip = b < 0;
double r = 1;
while (1)
{
if (b & 1)
r *= a;
b /= 2;
if (b == 0)
break;
a *= a;
}
return recip ? 1/r : r;
}
Example 1: given a = 2; b = 7:
- r = 1
- iteration 1: r = 1 * 2 = 2; b = (int)(7/2) = 3; a = 2 * 2 = 4
- iteration 2: r = 2 * 4 = 8; b = (int)(3/2) = 1; a = 4 * 4 = 16
- iteration 3: r = 8 * 16 = 128;
Example 2: given a = 2; b = 8:
- r = 1
- iteration 1: r = 1; b = (int)(8/2) = 4; a = 2 * 2 = 4
- iteration 2: r = 1; b = (int)(4/2) = 2; a = 4 * 4 = 16
- iteration 3: r = 1; b = (int)(2/2) = 1; a = 16 * 16 = 256
- iteration 4: r = 1 * 256 = 256; b = (int)(1/2) = 0;
Integer power is always implemented as a sequence multiplications. That's why N^3 is slower than N^2.
jl_powi_llvm (called in fastmath.jl. "jl_" is concatenated by macro expansion), on the other hand, casts the exponent to floating-point and calls pow(). C source code:
JL_DLLEXPORT jl_value_t *jl_powi_llvm(jl_value_t *a, jl_value_t *b)
{
jl_value_t *ty = jl_typeof(a);
if (!jl_is_bitstype(ty))
jl_error("powi_llvm: a is not a bitstype");
if (!jl_is_bitstype(jl_typeof(b)) || jl_datatype_size(jl_typeof(b)) != 4)
jl_error("powi_llvm: b is not a 32-bit bitstype");
jl_value_t *newv = newstruct((jl_datatype_t*)ty);
void *pa = jl_data_ptr(a), *pr = jl_data_ptr(newv);
int sz = jl_datatype_size(ty);
switch (sz) {
/* choose the right size c-type operation */
case 4:
*(float*)pr = powf(*(float*)pa, (float)jl_unbox_int32(b));
break;
case 8:
*(double*)pr = pow(*(double*)pa, (double)jl_unbox_int32(b));
break;
default:
jl_error("powi_llvm: runtime floating point intrinsics are not implemented for bit sizes other than 32 and 64");
}
return newv;
}
Lior's answer is excellent. Here is a solution to the problem you posed: Yes, there is a way to force usage of multiplication, at cost of accuracy. It's the #fastmath macro:
julia> #benchmark 1.1 ^ 3
BenchmarkTools.Trial:
samples: 10000
evals/sample: 999
time tolerance: 5.00%
memory tolerance: 1.00%
memory estimate: 16.00 bytes
allocs estimate: 1
minimum time: 13.00 ns (0.00% GC)
median time: 14.00 ns (0.00% GC)
mean time: 15.74 ns (6.14% GC)
maximum time: 1.85 μs (98.16% GC)
julia> #benchmark #fastmath 1.1 ^ 3
BenchmarkTools.Trial:
samples: 10000
evals/sample: 1000
time tolerance: 5.00%
memory tolerance: 1.00%
memory estimate: 0.00 bytes
allocs estimate: 0
minimum time: 2.00 ns (0.00% GC)
median time: 3.00 ns (0.00% GC)
mean time: 2.59 ns (0.00% GC)
maximum time: 20.00 ns (0.00% GC)
Note that with #fastmath, performance is much better.

JPEG compression implementation in MATLAB

I'm working on an implementation of the JPEG compression algorithm in MATLAB. I've run into some issues when computing the discrete cosine transform(DCT) of the 8x8 image blocks(T = H * F * H_transposed, H is the matrix containing the DCT coefficients of an 8x8 matrix, generated with dctmtx(8) and F is an 8x8 image block). The code is bellow:
jpegCompress.m
function y = jpegCompress(x, quality)
% y = jpegCompress(x, quality) compresses an image X based on 8 x 8 DCT
% transforms, coefficient quantization and Huffman symbol coding. Input
% quality determines the amount of information that is lost and compression achieved. y is the encoding structure containing fields:
% y.size size of x
% y.numblocks number of 8 x 8 encoded blocks
% y.quality quality factor as percent
% y.huffman Huffman coding structure
narginchk(1, 2); % check number of input arguments
if ~ismatrix(x) || ~isreal(x) || ~ isnumeric(x) || ~ isa(x, 'uint8')
error('The input must be a uint8 image.');
end
if nargin < 2
quality = 1; % default value for quality
end
if quality <= 0
error('Input parameter QUALITY must be greater than zero.');
end
m = [16 11 10 16 24 40 51 61 % default JPEG normalizing array
12 12 14 19 26 58 60 55 % and zig-zag reordering pattern
14 13 16 24 40 57 69 56
14 17 22 29 51 87 80 62
18 22 37 56 68 109 103 77
24 35 55 64 81 104 113 92
49 64 78 87 103 121 120 101
72 92 95 98 112 100 103 99] * quality;
order = [1 9 2 3 10 17 25 18 11 4 5 12 19 26 33 ...
41 34 27 20 13 6 7 14 21 28 35 42 49 57 50 ...
43 36 29 22 15 8 16 23 30 37 44 51 58 59 52 ...
45 38 31 24 32 39 46 53 60 61 54 47 40 48 55 ...
62 63 56 64];
[xm, xn] = size(x); % retrieve size of input image
x = double(x) - 128; % level shift input
t = dctmtx(8); % compute 8 x 8 DCT matrix
% Compute DCTs pf 8 x 8 blocks and quantize coefficients
y = blkproc(x, [8 8], 'P1 * x * P2', t, t');
y = blkproc(y, [8 8], 'round(x ./ P1)', m); % <== nearly all elements from y are zero after this step
y = im2col(y, [8 8], 'distinct'); % break 8 x 8 blocks into columns
xb = size(y, 2); % get number of blocks
y = y(order, :); % reorder column elements
eob = max(x(:)) + 1; % create end-of-block symbol
r = zeros(numel(y) + size(y, 2), 1);
count = 0;
for j = 1:xb % process one block(one column) at a time
i = find(y(:, j), 1, 'last'); % find last non-zero element
if isempty(i) % check if there are no non-zero values
i = 0;
end
p = count + 1;
q = p + i;
r(p:q) = [y(1:i, j); eob]; % truncate trailing zeros, add eob
count = count + i + 1; % and add to output vector
end
r((count + 1):end) = []; % delete unused portion of r
y = struct;
y.size = uint16([xm xn]);
y.numblocks = uint16(xb);
y.quality = uint16(quality * 100);
y.huffman = mat2huff(r);
mat2huff is implemented as:
mat2huff.m
function y = mat2huff(x)
%MAT2HUFF Huffman encodes a matrix.
% Y = mat2huff(X) Huffman encodes matrix X using symbol
% probabilities in unit-width histogram bins between X's minimum
% and maximum value s. The encoded data is returned as a structure
% Y :
% Y.code the Huffman - encoded values of X, stored in
% a uint16 vector. The other fields of Y contain
% additional decoding information , including :
% Y.min the minimum value of X plus 32768
% Y.size the size of X
% Y.hist the histogram of X
%
% If X is logical, uintB, uint16 ,uint32 ,intB ,int16, or double,
% with integer values, it can be input directly to MAT2HUF F. The
% minimum value of X must be representable as an int16.
%
% If X is double with non - integer values --- for example, an image
% with values between O and 1 --- first scale X to an appropriate
% integer range before the call.For example, use Y
% MAT2HUFF (255 * X) for 256 gray level encoding.
%
% NOTE : The number of Huffman code words is round(max(X(:)))
% round (min(X(:)))+1. You may need to scale input X to generate
% codes of reasonable length. The maximum row or column dimension
% of X is 65535.
if ~ismatrix(x) || ~isreal(x) || (~isnumeric(x) && ~islogical(x))
error('X must be a 2-D real numeric or logical matrix.');
end
% Store the size of input x.
y.size = uint32(size(x));
% Find the range of x values
% by +32768 as a uint16.
x = round(double(x));
xmin = min(x(:));
xmax = max(x(:));
pmin = double(int16(xmin));
pmin = uint16(pmin+32768);
y.min = pmin;
% Compute the input histogram between xmin and xmax with unit
% width bins , scale to uint16 , and store.
x = x(:)';
h = histc(x, xmin:xmax);
if max(h) > 65535
h = 65535 * h / max(h);
end
h = uint16(h);
y.hist = h;
% Code the input mat rix and store t h e r e s u lt .
map = huffman(double(h)); % Make Huffman code map
hx = map(x(:) - xmin + 1); % Map image
hx = char(hx)'; % Convert to char array
hx = hx(:)';
hx(hx == ' ') = [ ]; % Remove blanks
ysize = ceil(length(hx) / 16); % Compute encoded size
hx16 = repmat('0', 1, ysize * 16); % Pre-allocate modulo-16 vector
hx16(1:length(hx)) = hx; % Make hx modulo-16 in length
hx16 = reshape(hx16, 16, ysize); % Reshape to 16-character words
hx16 = hx16' - '0'; % Convert binary string to decimal
twos = pow2(15 : - 1 : 0);
y.code = uint16(sum(hx16 .* twos(ones(ysize ,1), :), 2))';
Why is the block processing step generating mostly null values?
It is likely that multiplying the Quantization values you have by four is causing the DCT coefficients to go to zero.

CUDA - Multiple Threads

I am trying to make an LCG Random Number Generator run in parallel using CUDA & GPU's. However, I am having trouble actually getting multiple threads running at the same time.Here is a copy of the code:
#include <iostream>
#include <math.h>
__global__ void rng(long *cont)
{
int a=9, c=3, F, X=1;
long M=524288, Y;
printf("\nKernel X is %d\n", X[0]);
F=X;
Y=X;
printf("Kernel F is %d\nKernel Y is %d\n", F, Y);
Y=(a*Y+c)%M;
printf("%ld\t", Y);
while(Y!=F)
{
Y=(a*Y+c)%M;
printf("%ld\t", Y);
cont[0]++;
}
}
int main()
{
long cont[1]={1};
int X[1];
long *dev_cont;
int *dev_X;
cudaEvent_t beginEvent;
cudaEvent_t endEvent;
cudaEventCreate( &beginEvent );
cudaEventCreate( &endEvent );
printf("Please give the value of the seed X ");
scanf("%d", &X[0]);
printf("Host X is: %d", *X);
cudaEventRecord( beginEvent, 0);
cudaMalloc( (void**)&dev_cont, sizeof(long) );
cudaMalloc( (void**)&dev_X, sizeof(int) );
cudaMemcpy(dev_cont, cont, 1 * sizeof(long), cudaMemcpyHostToDevice);
cudaMemcpy(dev_X, X, 1 * sizeof(int), cudaMemcpyHostToDevice);
rng<<<1,1>>>(dev_cont);
cudaMemcpy(cont, dev_cont, 1 * sizeof(long), cudaMemcpyDeviceToHost);
cudaEventRecord( endEvent, 0);
cudaEventSynchronize (endEvent );
float timevalue;
cudaEventElapsedTime (&timevalue, beginEvent, endEvent);
printf("\n\nYou generated a total of %ld numbers", cont[0]);
printf("\nCUDA Kernel Time: %.2f ms\n", timevalue);
cudaFree(dev_cont);
cudaFree(dev_X);
cudaEventDestroy( endEvent );
cudaEventDestroy( beginEvent );
return 0;
}
Right now I am only sending one block with one thread. However, if I send 100 threads, the only thing that will happen is that it will produce the same number 100 times and then proceed to the next number. In theory this is what is meant to be expected but it automatically disregards the purpose of "random numbers" when a number is repeated.
The idea I want to implement is to have multiple threads. One thread will use that formula:
Y=(a*Y+c)%M but using an initial value of Y=1, then another thread will use the same formula but with an initial value of Y=1000, etc etc. However, once the first thread produces 1000 numbers, it needs to stop making more calculations because if it continues it will interfere with the second thread producing numbers with a value of Y=1000.
If anyone can point in the right direction, at least in the way of creating multiple threads with different functions or instructions inside of them, to run in parallel, I will try to figure out the rest.
Thanks!
UPDATE: July 31, 8:14PM EST
I updated my code to the following. Basically I am trying to produce 256 random numbers. I created the array where those 256 numbers will be stored. I also created an array with 10 different seed values for the values of Y in the threads. I also changed the code to request 10 threads in the device. I am also saving the numbers that are generated in an array. The code is not working correctly as it should. Please advise on how to fix it or how to make it achieve what I want.
Thanks!
#include <iostream>
#include <math.h>
__global__ void rng(long *cont, int *L, int *N)
{
int Y=threadIdx.x;
Y=N[threadIdx.x];
int a=9, c=3, i;
long M=256;
for(i=0;i<256;i++)
{
Y=(a*Y+c)%M;
N[i]=Y;
cont[0]++;
}
}
int main()
{
long cont[1]={1};
int i;
int L[10]={1,25,50,75,100,125,150,175,200,225}, N[256];
long *dev_cont;
int *dev_L, *dev_N;
cudaEvent_t beginEvent;
cudaEvent_t endEvent;
cudaEventCreate( &beginEvent );
cudaEventCreate( &endEvent );
cudaEventRecord( beginEvent, 0);
cudaMalloc( (void**)&dev_cont, sizeof(long) );
cudaMalloc( (void**)&dev_L, sizeof(int) );
cudaMalloc( (void**)&dev_N, sizeof(int) );
cudaMemcpy(dev_cont, cont, 1 * sizeof(long), cudaMemcpyHostToDevice);
cudaMemcpy(dev_L, L, 10 * sizeof(int), cudaMemcpyHostToDevice);
cudaMemcpy(dev_N, N, 256 * sizeof(int), cudaMemcpyHostToDevice);
rng<<<1,10>>>(dev_cont, dev_L, dev_N);
cudaMemcpy(cont, dev_cont, 1 * sizeof(long), cudaMemcpyDeviceToHost);
cudaMemcpy(N, dev_N, 256 * sizeof(int), cudaMemcpyDeviceToHost);
cudaEventRecord( endEvent, 0);
cudaEventSynchronize (endEvent );
float timevalue;
cudaEventElapsedTime (&timevalue, beginEvent, endEvent);
printf("\n\nYou generated a total of %ld numbers", cont[0]);
printf("\nCUDA Kernel Time: %.2f ms\n", timevalue);
printf("Your numbers are:");
for(i=0;i<256;i++)
{
printf("%d\t", N[i]);
}
cudaFree(dev_cont);
cudaFree(dev_L);
cudaFree(dev_N);
cudaEventDestroy( endEvent );
cudaEventDestroy( beginEvent );
return 0;
}
#Bardia - Please let me know how I can change my code to accommodate my needs.
UPDATE: August 1, 5:39PM EST
I edited my code to accommodate #Bardia's modifications to the Kernel code. However a few errors in the generation of numbers are coming out. First, the counter that I created in the kernel to count the amount of numbers that are being created, is not working. At the end it only displays that "1" number was generated. The Timer that I created to measure the time it takes for the kernel to execute the instructions is also not working because it keeps displaying 0.00 ms. And based on the parameters that I have set for the formula, the numbers that are being generated and copied into the array and then printed on the screen do not reflect the numbers that are meant to appear (or even close). These all used to work before.
Here is the new code:
#include <iostream>
#include <math.h>
__global__ void rng(long *cont, int *L, int *N)
{
int Y=threadIdx.x;
Y=L[threadIdx.x];
int a=9, c=3, i;
long M=256;
int length=ceil((float)M/10); //256 divided by the number of threads.
for(i=(threadIdx.x*length);i<length;i++)
{
Y=(a*Y+c)%M;
N[i]=Y;
cont[0]++;
}
}
int main()
{
long cont[1]={1};
int i;
int L[10]={1,25,50,75,100,125,150,175,200,225}, N[256];
long *dev_cont;
int *dev_L, *dev_N;
cudaEvent_t beginEvent;
cudaEvent_t endEvent;
cudaEventCreate( &beginEvent );
cudaEventCreate( &endEvent );
cudaEventRecord( beginEvent, 0);
cudaMalloc( (void**)&dev_cont, sizeof(long) );
cudaMalloc( (void**)&dev_L, sizeof(int) );
cudaMalloc( (void**)&dev_N, sizeof(int) );
cudaMemcpy(dev_cont, cont, 1 * sizeof(long), cudaMemcpyHostToDevice);
cudaMemcpy(dev_L, L, 10 * sizeof(int), cudaMemcpyHostToDevice);
cudaMemcpy(dev_N, N, 256 * sizeof(int), cudaMemcpyHostToDevice);
rng<<<1,10>>>(dev_cont, dev_L, dev_N);
cudaMemcpy(cont, dev_cont, 1 * sizeof(long), cudaMemcpyDeviceToHost);
cudaMemcpy(N, dev_N, 256 * sizeof(int), cudaMemcpyDeviceToHost);
cudaEventRecord( endEvent, 0);
cudaEventSynchronize (endEvent );
float timevalue;
cudaEventElapsedTime (&timevalue, beginEvent, endEvent);
printf("\n\nYou generated a total of %ld numbers", cont[0]);
printf("\nCUDA Kernel Time: %.2f ms\n", timevalue);
printf("Your numbers are:");
for(i=0;i<256;i++)
{
printf("%d\t", N[i]);
}
cudaFree(dev_cont);
cudaFree(dev_L);
cudaFree(dev_N);
cudaEventDestroy( endEvent );
cudaEventDestroy( beginEvent );
return 0;
}
This is the output I receive:
[wigberto#client2 CUDA]$ ./RNG8
You generated a total of 1 numbers
CUDA Kernel Time: 0.00 ms
Your numbers are:614350480 32767 1132936976 11079 2 0 10 0 1293351837 0 -161443660 48 0 0 614350336 32767 1293351836 0 -161444681 48 614350760 32767 1132936976 11079 2 0 10 0 1057178751 0 -161443660 48 155289096 49 614350416 32767 1057178750 0 614350816 32767 614350840 32767 155210544 49 0 0 1132937352 11079 1130370784 11079 1130382061 11079 155289096 49 1130376992 11079 0 1 1610 1 1 1 1130370408 11079 614350896 32767 614350816 32767 1057178751 0 614350840 32767 0 0 -161443150 48 0 0 1132937352 11079 1 11079 0 0 1 0 614351008 32767 614351032 32767 0 0 0 0 0 0 1130369536 1 1132937352 11079 1130370400 11079 614350944 32767 1130369536 11079 1130382061 11079 1130370784 11079 1130365792 11079 6143510880 614351008 32767 -920274837 0 614351032 32767 0 0 -161443150 48 0 0 0 0 1 0 128 0-153802168 48 614350896 32767 1132839104 11079 97 0 88 0 1 0 155249184 49 1130370784 11079 0 0-1 0 1130364928 11079 2464624 0 4198536 0 4198536 0 4197546 0 372297808 0 1130373120 11079 -161427611 48 111079 0 0 1 0 -153802272 48 155249184 49 372297840 0 -1 0 -161404446 48 0 0 0 0372298000 0 372297896 0 372297984 0 0 0 0 0 1130369536 11079 84 0 1130471067 11079 6303744 0614351656 32767 0 0 -1 0 4198536 0 4198536 0 4197546 0 1130397880 11079 0 0 0 0 0 0 00 0 0 -161404446 48 0 0 4198536 0 4198536 0 6303744 0 614351280 32767 6303744 0 614351656 32767 614351640 32767 1 0 4197371 0 0 0 0 0 [wigberto#client2 CUDA]$
#Bardia - Please advise on what is the best thing to do here.
Thanks!
You can address threads within a block by threadIdx variable.
ie., in your case you should probably set
Y = threadIdx.x and then use Y=(a*Y+c)%M
But in general implementing a good RNG on CUDA could be really difficult.
So I don't know if you want to implement your own generator just for practice..
Otherwise there is a CURAND library available which provides a number of pseudo- and quasi-random generators, ie. XORWOW, MersenneTwister, Sobol etc.
It should do the same work in all threads, because you want them to do the same work. You should always distinguish threads from each other with addressing them.
For example you should say thread #1 you do this job and save you work here and thread #2 you do that job and save your work there and then go to Host and use that data.
For a two dimensional block grid with two dimension threads in each block I use this code for addressing:
int X = blockIdx.x*blockDim.x+threadIdx.x;
int Y = blockIdx.y*blockDim.y+threadIdx.y;
The X and Y in the code above are the global address of your thread (I think for your a one dimensional grid and thread is sufficient).
Also remember that you can not use the printf function on the kernel. The GPUs can't make any interrupt. For this you can use cuPrintf function which is one of CUDA SDK's samples, but read it's instructions to use it correctly.
This answer relates to the edited part of the question.
I didn't notice that it is a recursive Algorithm and unfortunately I don't know how to parallelize a recursive algorithm.
My only idea for generating these 256 number is to generate them separately. i.e. generate 26 of them in the first thread, 26 of them on the second thread and so on.
This code will do this (this is only kernel part):
#include <iostream>
#include <math.h>
__global__ void rng(long *cont, int *L, int *N)
{
int Y=threadIdx.x;
Y=L[threadIdx.x];
int a=9, c=3, i;
long M=256;
int length=ceil((float)M/10); //256 divided by the number of threads.
for(i=(threadIdx.x*length);i<length;i++)
{
Y=(a*Y+c)%M;
N[i]=Y;
cont[0]++;
}
}

Is there any easy way to do modulus of 2^32 - 1 operation?

I just heard about that x mod (2^32-1) and x / (2^32-1) would be easy, but how?
to calculate the formula:
xn = (xn-1 + xn-1 / b)mod b.
For b = 2^32, its easy, x%(2^32) == x & (2^32-1); and x / (2^32) == x >> 32. (the ^ here is not XOR). How to do that when b = 2^32 - 1.
In the page https://en.wikipedia.org/wiki/Multiply-with-carry. They say "arithmetic for modulus 2^32 − 1 requires only a simple adjustment from that for 2^32". So what is the "simple adjustment"?
(This answer only handles the mod case.)
I'll assume that the datatype of x is more than 32 bits (this answer will actually work with any positive integer) and that it is positive (the negative case is just -(-x mod 2^32-1)), since if it at most 32 bits, the question can be answered by
x mod (2^32-1) = 0 if x == 2^32-1, x otherwise
x / (2^32 - 1) = 1 if x == 2^32-1, 0 otherwise
We can write x in base 2^32, with digits x0, x1, ..., xn. So
x = x0 + 2^32 * x1 + (2^32)^2 * x2 + ... + (2^32)^n * xn
This makes the answer clearer when we do the modulus, since 2^32 == 1 mod 2^32-1. That is
x == x0 + 1 * x1 + 1^2 * x2 + ... + 1^n * xn (mod 2^32-1)
== x0 + x1 + ... + xn (mod 2^32-1)
x mod 2^32-1 is the same as the sum of the base 2^32 digits! (we can't drop the mod 2^32-1 yet). We have two cases now, either the sum is between 0 and 2^32-1 or it is greater. In the former, we are done; in the later, we can just recur until we get between 0 and 2^32-1. Getting the digits in base 2^32 is fast, since we can use bitwise operations. In Python (this doesn't handle negative numbers):
def mod_2to32sub1(x):
s = 0 # the sum
while x > 0: # get the digits
s += x & (2**32-1)
x >>= 32
if s > 2**32-1:
return mod_2to32sub1(s)
elif s == 2**32-1:
return 0
else:
return s
(This is extremely easy to generalise to x mod 2^n-1, in fact you just replace any occurance of 32 with n in this answer.)
(EDIT: added the elif clause to avoid an infinite loop on mod_2to32sub1(2**32-1). EDIT2: replaced ^ with **... oops.)
So you compute with the "rule" 232 = 1. In general, 232+x = 2x. You can simplify 2a by taking the exponent modulo 32. Example: 266 = 22.
You can express any number in binary, and then lower the exponents. Example: the number 240 + 238 + 220 + 2 + 1 can be simplified to 28 + 26 + 220 + 2 + 1.
In general, you can group the exponents every 32 powers of 2, and "downgrade" all exponents modulo 32.
For 64 bit words, the number can be expressed as
232 A + B
where 0 <= A,B <= 232-1. Getting A and B is easy with bitwise operations.
So you can simplify that to A + B, which is much smaller: at most 233. Then, check if this number is at least 232-1, and subtract 232 - 1 in that case.
This avoids expensive direct division.
The modulus has already been explained, nevertheless, let's recapitulate.
To find the remainder of k modulo 2^n-1, write
k = a + 2^n*b, 0 <= a < 2^n
Then
k = a + ((2^n-1) + 1) * b
= (a + b) + (2^n-1)*b
≡ (a + b) (mod 2^n-1)
If a + b >= 2^n, repeat until the remainder is less than 2^n, and if that leads you to a + b = 2^n-1, replace that with 0. Each "shift right by n and add to the last n bits" moves the first set bit right by n or n-1 places (unless k < 2^(2*n-1), when the first set bit after the shift-and-add may be the 2^n bit). So if the width of the type is large compared to n, this will need many shifts - consider a 128-bit type and n = 3, for large k you will need over 40 shifts. To reduce the number of shifts required, you can exploit the fact that
2^(m*n) - 1 = (2^n - 1) * (2^((m-1)*n) + 2^((m-2)*n) + ... + 2^(2*n) + 2^n + 1),
of which we will only use that 2^n - 1 divides 2^(m*n) - 1 for all m > 0. Then you shift by multiples of n that are roughly half the maximal bit-length the value can have at that step. For the above example of a 128-bit type and the remainder modulo 7 (2^3 - 1), the closest multiples of 3 to 128/2 are 63 and 66, first shift by 63 bits
r_1 = (k & (2^63 - 1)) + (k >> 63) // r_1 < 2^63 + 2^(128-63) < 2^66
to get a number with at most 66 bits, then shift by 66/2 = 33 bits
r_2 = (r_1 & (2^33 - 1)) + (r_1 >> 33) // r_2 < 2^33 + 2^(66-33) = 2^34
to reach at most 34 bits. Next shift by 18 bits, then 9, 6, 3
r_3 = (r_2 & (2^18 - 1)) + (r_2 >> 18) // r_3 < 2^18 + 2^(34-18) < 2^19
r_4 = (r_3 & (2^9 - 1)) + (r_3 >> 9) // r_4 < 2^9 + 2^(19-9) < 2^11
r_5 = (r_4 & (2^6 - 1)) + (r_4 >> 6) // r_5 < 2^6 + 2^(11-6) < 2^7
r_6 = (r_5 & (2^3 - 1)) + (r_5 >> 3) // r_6 < 2^3 + 2^(7-3) < 2^5
r_7 = (r_6 & (2^3 - 1)) + (r_6 >> 3) // r_7 < 2^3 + 2^(5-3) < 2^4
Now a single subtraction if r_7 >= 2^3 - 1 suffices. To calculate k % (2^n -1) in a b-bit type, O(log2 (b/n)) shifts are needed.
The quotient is obtained similarly, again we write
k = a + 2^n*b, 0 <= a < 2^n
= a + ((2^n-1) + 1)*b
= (2^n-1)*b + (a+b),
so k/(2^n-1) = b + (a+b)/(2^n-1), and we continue while a+b > 2^n-1. Here we unfortunately cannot reduce the work by shifting and masking about half the width, so the method is only efficient when n is not much smaller than the width of the type.
Code for the fast cases where n is not too small:
unsigned long long modulus_2n1(unsigned n, unsigned long long k) {
unsigned long long mask = (1ULL << n) - 1ULL;
while(k > mask) {
k = (k & mask) + (k >> n);
}
return k == mask ? 0 : k;
}
unsigned long long quotient_2n1(unsigned n, unsigned long long k) {
unsigned long long mask = (1ULL << n) - 1ULL, quotient = 0;
while(k > mask) {
quotient += k >> n;
k = (k & mask) + (k >> n);
}
return k == mask ? quotient + 1 : quotient;
}
For the special case where n is half the width of the type, the loop runs at most twice, so if branches are expensive, it may be better to unroll the loop and unconditionally execute the loop body twice.
It is not. What must you have heard is x mod 2^n and x/2^n being easier. x/2^n can be performed as x>>n, and x mod 2^n, do x&(1<<n-1)

Resources