Parallelizing the Gaussian Blur Algorithm with OpenMP - openmp

I was trying to parallelize the gaussian blur function using OpenMP,
but I am new at OpenMP, and when I tried to parallelize the two for loops (I don't think there are any variables that need to be private for each thread), it ended up
running even slower than before, and the output was different. So did I do anything wrong? What should I do to make it run faster?
void gaussian_blur(float **src, float **dst, int w, int h, float sigma)
{
int x, y, i;
int ksize = (int)(sigma * 2.f * 4.f + 1) | 1;
int halfk = ksize / 2;
float scale = -0.5f/(sigma*sigma);
float sum = 0.f;
float *kernel, *ringbuf;
int xmax = w - halfk;
int ymax = h - halfk;
// if sigma too small, just copy src to dst
if (ksize <= 1)
{
for (y = 0; y < h; y++)
for (x = 0; x < w; x++)
dst[y][x] = src[y][x];
return;
}
// create Gaussian kernel
kernel = malloc(ksize * sizeof(float));
ringbuf = malloc(ksize * sizeof(float));
#pragma omp parallel for reduction(+ : sum)
for (i = 0; i < ksize; i++)
{
float x = (float)(i - halfk);
float t = expf(scale * x * x);
kernel[i] = t;
sum += t;
}
scale = 1.f / sum;
#pragma omp parallel for
for (i = 0; i < ksize; i++)
kernel[i] *= scale;
// blur each row
#pragma omp parallel for // this is the for loop I parallelized but ended up with wrong output and running slower
for (y = 0; y < h; y++)
{
int x1;
int bufi0 = ksize-1;
float tmp = src[y][0];
for (x1 = 0; x1 < halfk ; x1++) ringbuf[x1] = tmp;
for (; x1 < ksize-1; x1++) ringbuf[x1] = src[y][x1-halfk];
for (x1 = 0; x1 < w; x1++)
{
if(x1 < xmax)
ringbuf[bufi0++] = src[y][x1+halfk];
else
ringbuf[bufi0++] = src[y][w-1];
if (bufi0 == ksize) bufi0 = 0;
dst[y][x1] = convolve(kernel, ringbuf, ksize, bufi0);
}
}
// blur each column
#pragma omp parallel for // this is the for loop I parallelized but ended up with wrong output and running slower
for (x = 0; x < w; x++)
{
int y1;
int bufi0 = ksize-1;
float tmp = dst[0][x];
for (y1 = 0; y1 < halfk ; y1++) ringbuf[y1] = tmp;
for ( ; y1 < ksize-1; y1++) ringbuf[y1] = dst[y1-halfk][x];
for (y1 = 0; y1 < h; y1++)
{
if(y1 < ymax)
ringbuf[bufi0++] = dst[y1+halfk][x];
else
ringbuf[bufi0++] = dst[h-1][x];
if (bufi0 == ksize) bufi0 = 0;
dst[y1][x] = convolve(kernel, ringbuf, ksize, bufi0);
}
}
// clean up
free(kernel);
free(ringbuf);
}

Besides the need to properly identify private and shared data, there are several things that you could do in order to speed up your program.
As a first step you should remove any unnecessary concurrency. For example, how big ksize happens to be on average? If it is less than several hundred elements, it makes absolutely no sense to employ OpenMP for such simple operations as computing the kernel and then normalising it:
#pragma omp parallel for reduction(+ : sum)
for (i = 0; i < ksize; i++)
{
float x = (float)(i - halfk);
float t = expf(scale * x * x);
kernel[i] = t;
sum += t;
}
scale = 1.f / sum;
#pragma omp parallel for
for (i = 0; i < ksize; i++)
kernel[i] *= scale;
On a typical modern CPU it would take more cycles to bootstrap the parallel regions than to compute this on a single core. Also on modern CPUs these loops can be unrolled and vectorised and you can get up to 8x boost on a single core. If the kernel is too small, then besides OpenMP overhead you will also get slowdown from excessive false sharing. You have to make sure that each thread gets an exact multiple of 16 elements (64 bytes of cache line size / sizeof(float)) to work on in order to prevent false sharing.
You also have to make sure that threads do not share cache lines in the column blur section.
// blur each column
#pragma omp parallel for
for (x = 0; x < w; x++)
{
...
for (y1 = 0; y1 < h; y1++)
{
...
dst[y1][x] = convolve(kernel, ringbuf, ksize, bufi0);
}
}
Because of the access pattern here, you have to make sure that each thread gets a chunk of columns that is a multiple of 16 or else there will be a border overlap area of 16*y1 pixels shared by every two consecutive threads where excessive false sharing will occur. If you cannot guarantee that w is divisible by 16, then you can give each thread a starting offset in the y direction, e.g. the innermost loop becomes:
int tid = omp_get_thread_num();
for (y1 = 2*tid; y1 < h; y1++)
{
...
}
for (y1 = 0; y1 < 2*tid; y1++)
{
...
}
The multiplier 2 is arbitrary. The idea is to give the next thread several rows of advance in comparison to the current one so that both threads will not be processing the same line at once at any moment in time. You could also use addition and modulo arithmetic to compute y1, i.e.
for (y2 = 0; y2 < h; y2++)
{
y1 = (y2 + 2*tid) % h;
...
}
but this is generally slower than just separating the loop in two parts.
Also mind your data size. The last level cache (LLC) has very high but still limited bandwidth. If data cannot fit in the private cache of each core then compiler optimisations such as loop vectorisations can put very high pressure on the LLC. Things get more ugly if data doesn't fit in the LLC and therefore the main memory has to be accessed.
If you don't know what false sharing is, there is an article in Dr.Dobb's that kind of explains it here.

I may have fixed your code. You did not post your convolve function so it's difficult to say for sure but I'm not sure it matters. There are at least two bugs. There is a race condition in the ringbuf array. To fix this I extend the array times the number of threads.
ringbuf = (float*)malloc(nthreads*ksize * sizeof(float));
To access the array do something like this
int ithread = omp_get_thread_num();
ringbuf[ksize*ithread + x1]
Edit: I added some code which defines ringbuf inside the parallel block. That way you don't have to access ringbuf based on the thread number.
The second errors is the ibufi0 variable. I defined a new one like this
const int ibufi0_fix = (x1+ksize-1)%ksize;
Below is the code I used to check it. Replace with your convolve function. Note, this may still be quite inefficient. There are probably cache issues such as cache misses and false sharing (particularly when you convolve vertically). Hopefully, though, the image will be correct now.
Edit: here is a paper by Intel that shows how to do this best with AVX. It's optimized to minimize the cache misses. I'm not sure it's optimized for threading though.
http://software.intel.com/en-us/articles/iir-gaussian-blur-filter-implementation-using-intel-advanced-vector-extensions
I'm writing my own function on this (it's actually the reason I started learning OpenMP) which uses SSE/AVX as well. There are a lot of similarities with matrix multiplication and image filtering so I learned how to optimized matrix multiplication first and will do Gaussian Blur shortly...
#include "math.h"
#include "omp.h"
#include "stdio.h"
#include <nmmintrin.h>
float convolve(const float *kernel, const float *ringbuf, const int ksize, const int bufi0) {
float sum = 0.0f;
for(int i=0; i<ksize; i++) {
sum += kernel[i]*ringbuf[i];
}
return sum;
}
void gaussian_blur(float *src, float *dst, int w, int h, float sigma, int nthreads)
{
int x, y, i;
int ksize = (int)(sigma * 2.f * 4.f + 1) | 1;
int halfk = ksize / 2;
printf("ksize %d\n", ksize);
float scale = -0.5f/(sigma*sigma);
float sum = 0.f;
float *kernel, *ringbuf;
int xmax = w - halfk;
int ymax = h - halfk;
// if sigma too small, just copy src to dst
if (ksize <= 1)
{
for (y = 0; y < h; y++)
for (x = 0; x < w; x++)
dst[y*w + x] = src[y*w + x];
return;
}
// create Gaussian kernel
//kernel = malloc(ksize * sizeof(float));
kernel = (float*)_mm_malloc(ksize * sizeof(float),16);
//ringbuf = malloc(ksize * sizeof(float));
ringbuf = (float*)_mm_malloc(nthreads*ksize * sizeof(float),16);
#pragma omp parallel for reduction(+ : sum) if(nthreads>1)
for (i = 0; i < ksize; i++)
{
float x = (float)(i - halfk);
float t = expf(scale * x * x);
kernel[i] = t;
sum += t;
}
scale = 1.f / sum;
#pragma omp parallel for if(nthreads>1)
for (i = 0; i < ksize; i++)
kernel[i] *= scale;
// blur each row
#pragma omp parallel for if(nthreads>1)// this is the for loop I parallelized but ended up with wrong output and running slower
for (y = 0; y < h; y++)
{
int ithread = omp_get_thread_num();
//printf("nthread %d\n", nthread);
int x1;
int bufi0 = ksize-1;
float tmp = src[y*w + 0];
for (x1 = 0; x1 < halfk ; x1++) ringbuf[ksize*ithread + x1] = tmp;
for (; x1 < ksize-1; x1++) ringbuf[ksize*ithread + x1] = src[y*w + x1-halfk];
for (x1 = 0; x1 < w; x1++)
{
const int ibufi0_fix = (x1+ksize-1)%ksize;
if(x1 < xmax)
ringbuf[ksize*ithread + ibufi0_fix] = src[y*w + x1+halfk];
else
ringbuf[ksize*ithread + ibufi0_fix] = src[y*w + w-1];
if (bufi0 == ksize) bufi0 = 0;
dst[y*w + x1] = convolve(kernel, &ringbuf[ksize*ithread], ksize, bufi0);
}
}
// blur each column
#pragma omp parallel for if(nthreads>1)// this is the for loop I parallelized but ended up with wrong output and running slower
for (x = 0; x < w; x++)
{
int ithread = omp_get_thread_num();
int y1;
int bufi0 = ksize-1;
float tmp = dst[0*w + x];
for (y1 = 0; y1 < halfk ; y1++) ringbuf[ksize*ithread + y1] = tmp;
for ( ; y1 < ksize-1; y1++) ringbuf[ksize*ithread + y1] = dst[(y1-halfk)*w + x];
for (y1 = 0; y1 < h; y1++)
{
const int ibufi0_fix = (y1+ksize-1)%ksize;
if(y1 < ymax)
ringbuf[ibufi0_fix] = dst[(y1+halfk)*w + x];
else
ringbuf[ibufi0_fix] = dst[(h-1)*w + x];
if (bufi0 == ksize) bufi0 = 0;
dst[y1*w + x] = convolve(kernel, &ringbuf[ksize*ithread], ksize, bufi0);
}
}
// clean up
_mm_free(kernel);
_mm_free(ringbuf);
}
int compare(float *dst1, float *dst2, const int n) {
int error = 0;
for(int i=0; i<n; i++) {
if(*dst1 != *dst2) error++;
}
return error;
}
int main() {
const int w = 20;
const int h = 20;
float *src = (float*)_mm_malloc(w*h*sizeof(float),16);
float *dst1 = (float*)_mm_malloc(w*h*sizeof(float),16);
float *dst2 = (float*)_mm_malloc(w*h*sizeof(float),16);
for(int i=0; i<w*h; i++) {
src[i] = i;
}
gaussian_blur(src, dst1, w, h, 1.0f, 1);
gaussian_blur(src, dst2, w, h, 1.0f, 4);
int error = compare(dst1, dst2, w*h);
printf("error %d\n", error);
_mm_free(src);
_mm_free(dst1);
_mm_free(dst2);
}
Edit: here is code which defines ringbuf inside the parallel block based on the comment by Hristo. It should be equivalent.
#include "math.h"
#include "omp.h"
#include "stdio.h"
#include <nmmintrin.h>
float convolve(const float *kernel, const float *ringbuf, const int ksize, const int bufi0) {
float sum = 0.0f;
for(int i=0; i<ksize; i++) {
sum += kernel[i]*ringbuf[i];
}
return sum;
}
void gaussian_blur(float *src, float *dst, int w, int h, float sigma, int nthreads)
{
int x, y, i;
int ksize = (int)(sigma * 2.f * 4.f + 1) | 1;
int halfk = ksize / 2;
printf("ksize %d\n", ksize);
float scale = -0.5f/(sigma*sigma);
float sum = 0.f;
float *kernel;
int xmax = w - halfk;
int ymax = h - halfk;
// if sigma too small, just copy src to dst
if (ksize <= 1)
{
for (y = 0; y < h; y++)
for (x = 0; x < w; x++)
dst[y*w + x] = src[y*w + x];
return;
}
// create Gaussian kernel
//kernel = malloc(ksize * sizeof(float));
kernel = (float*)_mm_malloc(ksize * sizeof(float),16);
#pragma omp parallel for reduction(+ : sum) if(nthreads>1)
for (i = 0; i < ksize; i++)
{
float x = (float)(i - halfk);
float t = expf(scale * x * x);
kernel[i] = t;
sum += t;
}
scale = 1.f / sum;
#pragma omp parallel for if(nthreads>1)
for (i = 0; i < ksize; i++)
kernel[i] *= scale;
// blur each row
//#pragma omp parallel for if(nthreads>1)// this is the for loop I parallelized but ended up with wrong output and running slower
#pragma omp parallel if(nthreads>1)
{
float *ringbuf = (float*)_mm_malloc(ksize * sizeof(float),16);
#pragma omp for// this is the for loop I parallelized but ended up with wrong output and running slower
for (y = 0; y < h; y++)
{
//printf("nthread %d\n", nthread);
int x1;
int bufi0 = ksize-1;
float tmp = src[y*w + 0];
for (x1 = 0; x1 < halfk ; x1++) ringbuf[x1] = tmp;
for (; x1 < ksize-1; x1++) ringbuf[x1] = src[y*w + x1-halfk];
for (x1 = 0; x1 < w; x1++)
{
const int ibufi0_fix = (x1+ksize-1)%ksize;
if(x1 < xmax)
ringbuf[ibufi0_fix] = src[y*w + x1+halfk];
else
ringbuf[ibufi0_fix] = src[y*w + w-1];
if (bufi0 == ksize) bufi0 = 0;
dst[y*w + x1] = convolve(kernel, ringbuf, ksize, bufi0);
}
}
_mm_free(ringbuf);
}
// blur each column
#pragma omp parralel if(ntheads>1)
{
float *ringbuf = (float*)_mm_malloc(ksize * sizeof(float),16);
#pragma omp for// this is the for loop I parallelized but ended up with wrong output and running slower
for (x = 0; x < w; x++)
{
int y1;
int bufi0 = ksize-1;
float tmp = dst[0*w + x];
for (y1 = 0; y1 < halfk ; y1++) ringbuf[y1] = tmp;
for ( ; y1 < ksize-1; y1++) ringbuf[y1] = dst[(y1-halfk)*w + x];
for (y1 = 0; y1 < h; y1++)
{
const int ibufi0_fix = (y1+ksize-1)%ksize;
if(y1 < ymax)
ringbuf[ibufi0_fix] = dst[(y1+halfk)*w + x];
else
ringbuf[ibufi0_fix] = dst[(h-1)*w + x];
if (bufi0 == ksize) bufi0 = 0;
dst[y1*w + x] = convolve(kernel, ringbuf, ksize, bufi0);
}
}
_mm_free(ringbuf);
}
// clean up
_mm_free(kernel);
}
int compare(float *dst1, float *dst2, const int n) {
int error = 0;
for(int i=0; i<n; i++) {
if(*dst1 != *dst2) error++;
}
return error;
}
int main() {
const int w = 20;
const int h = 20;
float *src = (float*)_mm_malloc(w*h*sizeof(float),16);
float *dst1 = (float*)_mm_malloc(w*h*sizeof(float),16);
float *dst2 = (float*)_mm_malloc(w*h*sizeof(float),16);
for(int i=0; i<w*h; i++) {
src[i] = i;
}
gaussian_blur(src, dst1, w, h, 1.0f, 1);
gaussian_blur(src, dst2, w, h, 1.0f, 4);
int error = compare(dst1, dst2, w*h);
printf("error %d\n", error);
_mm_free(src);
_mm_free(dst1);
_mm_free(dst2);
}

Related

Reducing Shared Memory Bank Conflicts

Nvprof reported that there are about 200 milion shared_ld_bank_conflict and some shared_st_bank_conflict in my sgemm kernel. I tried the padding trick __shared__ float smem[SIZE + OFFSET];, it reduced store bank conflicts to 0, but load bank conflicts are still there. I don't know how to further improve it.
__global__ void sgemm(
const float* __restrict__ A,
const float* __restrict__ B,
float* __restrict__ C,
int M, int N, int K
){
int tid = threadIdx.x;
int gStartx = blockIdx.x * 128;
int gStarty = blockIdx.y * 128;
int dx = tid % 8;
int dy = tid / 8;
int vx = tid % 16;
int vy = tid / 16;
__shared__ volatile float aSM[8][128+4];
__shared__ volatile float bSM[8][128+4];
float aBuffer1[4];
float bBuffer1[4];
float aBuffer2[4];
float bBuffer2[4];
float cCache[8][8];
#pragma unroll
for (int i=0; i<8; i++)
#pragma unroll
for (int j=0; j<8; j++)
cCache[i][j] = 0.f;
//load first two tiles
#pragma unroll
for (int i=0; i<4; i++){
aBuffer1[i] = A[(gStarty + dy + i*32)*K + (dx)];
bBuffer1[i] = B[(gStartx + dy + i*32)*K + (dx)];
}
int nIt = (K + 8 - 1) / 8;
#pragma unroll
for (int itr=0; itr<nIt; itr++){
int gStartk = itr * 8;
int is_odd = itr & 1;
if (is_odd == 0){
#pragma unroll
for (int i=0; i<4; i++){
if (itr != (nIt - 1)){
// prefetch next tiles
aBuffer2[i] = A[(gStarty + i*32 + dy)*K + (gStartk + 8 + dx)];
bBuffer2[i] = B[(gStartx + i*32 + dy)*K + (gStartk + 8 + dx)];
}
//move current tiles to SMEM
aSM[dx][dy+i*32] = aBuffer1[i];
bSM[dx][dy+i*32] = bBuffer1[i];
}
} else {
#pragma unroll
for (int i=0; i<4; i++){
if (itr != (nIt - 1)){
//prefetch next tiles to another buffer
aBuffer1[i] = A[(gStarty + i*32 + dy)*K + (gStartk + 8 + dx)];
bBuffer1[i] = B[(gStartx + i*32 + dy)*K + (gStartk + 8 + dx)];
}
aSM[dx][dy+i*32] = aBuffer2[i];
bSM[dx][dy+i*32] = bBuffer2[i];
}
}
__syncthreads();
float aCache[8][4];
#pragma unroll
for (int p=0; p<2; p++){
#pragma unroll
for (int ki=0; ki<8; ki++){
#pragma unroll
for (int mi=0; mi<4; mi++){
aCache[ki][mi] = aSM[ki][8*vy + 4*p +mi];
}
}
#pragma unroll
for (int ki=0; ki<8; ki++){
#pragma unroll
for (int ni=0; ni<8; ni++){
float b = bSM[ki][8*vx + ni];
#pragma unroll
for (int mi=0; mi<4; mi++){
float a = aCache[ki][mi];
cCache[mi + 4*p][ni] = fma(a, b, cCache[mi + 4*p][ni] );
}
}
}
}
__syncthreads();
}
#pragma unroll
for (int i=0; i<8; i++){
for (int j=0; j<8; j++){
C[(gStarty + vy*8 + i)*N + (gStartx + vx*8 + j)] = cCache[i][j];
}
}
}
A (2048x2048) matrix is row major, B (2048x2048) is column major, each block has 256 threads, each block calculates 128x128 portion of C, and each thread calculates 8x8x8. the gpu is Tesla P100.
Ok I found a solution: when storing to bSM, insert one padding word between every 32 words in the second dimention
//bSM[dx][dy+i*32] = bBuffer1[i];
bSM[dx][dy+i*33] = bBuffer1[i]; //we're skipping column 32, 65, 98, 131
when reading bSM[i][j], read it like this: bSM[i][j/32 + j]
//float b = bSM[ki][8*vx + ni];
float b = bSM[ki][(8*vx) / 32 + 8*vx + ni];
// (8*vx+ni)/32 is the same as (8*vx)/32, since vi is always less than 8
now it's giving me 55% performance of cublas gemm on tesla p4

3d point closest to multiple lines in 3D space

I search for non iterative, closed form, algorithm to find Least squares solution for point closest to the set of 3d lines. It is similar to 3d point triangulation (to minimize re-projections) but seems to be be simpler and faster?
Lines can be described in any form, 2 points, point and unit direction or similar.
Let the i th line be given by point ai and unit direction vector di. We need to find the single point that minimizes the sum of squared point to line distances. This is where the gradient is the zero vector:
Expanding the gradient,
Algebra yields a canonical 3x3 linear system,
where the k'th row (a 3-element row vector) of matrix M is
with vector ek the respective unit basis vector, and
It's not hard to turn this into code. I borrowed (and fixed a small bug in) a Gaussian elimination function from Rosettacode to solve the system. Thanks to the author!
#include <stdio.h>
#include <math.h>
typedef double VEC[3];
typedef VEC MAT[3];
void solve(double *a, double *b, double *x, int n); // linear solver
double dot(VEC a, VEC b) { return a[0]*b[0] + a[1]*b[1] + a[2]*b[2]; }
void find_nearest_point(VEC p, VEC a[], VEC d[], int n) {
MAT m = {{0, 0, 0}, {0, 0, 0}, {0, 0, 0}};
VEC b = {0, 0, 0};
for (int i = 0; i < n; ++i) {
double d2 = dot(d[i], d[i]), da = dot(d[i], a[i]);
for (int ii = 0; ii < 3; ++ii) {
for (int jj = 0; jj < 3; ++jj) m[ii][jj] += d[i][ii] * d[i][jj];
m[ii][ii] -= d2;
b[ii] += d[i][ii] * da - a[i][ii] * d2;
}
}
solve(&m[0][0], b, p, 3);
}
// Debug printing.
void pp(VEC v, char *l, char *r) {
printf("%s%.3lf, %.3lf, %.3lf%s", l, v[0], v[1], v[2], r);
}
void pv(VEC v) { pp(v, "(", ")"); }
void pm(MAT m) { for (int i = 0; i < 3; ++i) pp(m[i], "\n[", "]"); }
// A simple verifier.
double dist2(VEC p, VEC a, VEC d) {
VEC pa = { a[0]-p[0], a[1]-p[1], a[2]-p[2] };
double dpa = dot(d, pa);
return dot(d, d) * dot(pa, pa) - dpa * dpa;
}
double sum_dist2(VEC p, VEC a[], VEC d[], int n) {
double sum = 0;
for (int i = 0; i < n; ++i) sum += dist2(p, a[i], d[i]);
return sum;
}
// Check 26 nearby points and verify the provided one is nearest.
int is_nearest(VEC p, VEC a[], VEC d[], int n) {
double min_d2 = 1e100;
int ii = 2, jj = 2, kk = 2;
#define D 0.01
for (int i = -1; i <= 1; ++i)
for (int j = -1; j <= 1; ++j)
for (int k = -1; k <= 1; ++k) {
VEC pp = { p[0] + D * i, p[1] + D * j, p[2] + D * k };
double d2 = sum_dist2(pp, a, d, n);
// Prefer provided point among equals.
if (d2 < min_d2 || i == 0 && j == 0 && k == 0 && d2 == min_d2) {
min_d2 = d2;
ii = i; jj = j; kk = k;
}
}
return ii == 0 && jj == 0 && kk == 0;
}
void normalize(VEC v) {
double len = sqrt(dot(v, v));
v[0] /= len;
v[1] /= len;
v[2] /= len;
}
int main(void) {
VEC a[] = {{-14.2, 17, -1}, {1, 1, 1}, {2.3, 4.1, 9.8}, {1,2,3}};
VEC d[] = {{1.3, 1.3, -10}, {12.1, -17.2, 1.1}, {19.2, 31.8, 3.5}, {4,5,6}};
int n = 4;
for (int i = 0; i < n; ++i) normalize(d[i]);
VEC p;
find_nearest_point(p, a, d, n);
pv(p);
printf("\n");
if (!is_nearest(p, a, d, n)) printf("Woops. Not nearest.\n");
return 0;
}
// A linear solver from rosettacode (with bug fix: added a missing fabs())
#define mat_elem(a, y, x, n) (a + ((y) * (n) + (x)))
void swap_row(double *a, double *b, int r1, int r2, int n)
{
double tmp, *p1, *p2;
int i;
if (r1 == r2) return;
for (i = 0; i < n; i++) {
p1 = mat_elem(a, r1, i, n);
p2 = mat_elem(a, r2, i, n);
tmp = *p1, *p1 = *p2, *p2 = tmp;
}
tmp = b[r1], b[r1] = b[r2], b[r2] = tmp;
}
void solve(double *a, double *b, double *x, int n)
{
#define A(y, x) (*mat_elem(a, y, x, n))
int i, j, col, row, max_row, dia;
double max, tmp;
for (dia = 0; dia < n; dia++) {
max_row = dia, max = fabs(A(dia, dia));
for (row = dia + 1; row < n; row++)
if ((tmp = fabs(A(row, dia))) > max) max_row = row, max = tmp;
swap_row(a, b, dia, max_row, n);
for (row = dia + 1; row < n; row++) {
tmp = A(row, dia) / A(dia, dia);
for (col = dia+1; col < n; col++)
A(row, col) -= tmp * A(dia, col);
A(row, dia) = 0;
b[row] -= tmp * b[dia];
}
}
for (row = n - 1; row >= 0; row--) {
tmp = b[row];
for (j = n - 1; j > row; j--) tmp -= x[j] * A(row, j);
x[row] = tmp / A(row, row);
}
#undef A
}
This isn't extensively tested, but seems to be working fine.
Let base point of line is p and unit direction vector is d.
Then distance from point v to this line might be calculated using cross product
SquaredDist = ((v - p) x d)^2
Using Maple packet symbolic calculation, we can get
d := <dx, dy, dz>;
v := <vx, vy, vz>;
p := <px, py, pz>;
w := v - p;
cp := CrossProduct(d, w);
nrm := BilinearForm(cp, cp, conjugate=false); //squared dist
nr := expand(nrm);
//now partial derivatives
nrx := diff(nr, vx);
//results:
nrx := -2*dz^2*px-2*dy^2*px+2*dz^2*vx+2*dy^2*vx
+2*dx*py*dy-2*dx*vy*dy+2*dz*dx*pz-2*dz*dx*vz
nry := -2*dx^2*py-2*dz^2*py-2*dy*vz*dz+2*dx^2*vy
+2*dz^2*vy+2*dy*pz*dz+2*dx*dy*px-2*dx*dy*vx
nrz := -2*dy^2*pz+2*dy^2*vz-2*dy*dz*vy+2*dx^2*vz
-2*dx^2*pz-2*dz*vx*dx+2*dy*dz*py+2*dz*px*dx
To minimize sum of squared distances, we have to make system of linear equations for zero partial derivatives like this:
vx*2*(Sum(dz^2)+Sum(dy^2)) + vy * (-2*Sum(dx*dy)) + vz *(-2*Sum(dz*dx)) =
2*Sum(dz^2*px)-2*Sum(dy^2*px) -2*Sum(dx*py*dy)-2*Sum(dz*dx*pz)
where
Sum(dz^2) = Sum{over all i in line indexes} {dz[i] * dz[i]}
and solve it for unknowns vx, vy, vz
Edit: Old erroneous answer for planes instead of lines, left for reference
If we use general equation of line
A * x + B * y + C * z + D = 0
then distance from point (x, y, z) to this line is
Dist = Abs(A * x + B * y + C * z + D) / Sqrt(A^2 + B^2 + C^2)
To simplify - just normalize all line equations dividing by Norm's
Norm = Sqrt(A^2 + B^2 + C^2)
a = A / Norm
b = B / Norm
c = C / Norm
d = D / Norm
now equation is
a * x + b * y + c * z + d = 0
and distance
Dist = Abs(a * x + b * y + c * z + d)
and we can use squared distances like LS method (ai, bi, ci, di are coefficients for i-th line)
F = Sum(ai*x + bi*y + ci * z + d)^2 =
Sum(ai^2*x^2 + bi^2*y^2 + ci^2*z^2 + d^2 +
2 * (ai*bi*x*y + ai*ci*x*z + bi*y*ci*z + ai*x*di + bi*y*di + ci*z*di))
partial derivatives
dF/dx = 2*Sum(ai^2*x + ai*bi*y + ai*ci*z + ai*di) = 0
dF/dy = 2*Sum(bi^2*y + ai*bi*x + bi*ci*z + bi*di) = 0
dF/dz = 2*Sum(ci^2*z + ai*ci*x + bi*ci*y + ci*di) = 0
so we have system of linear equation
x * Sum(ai^2) + y * Sum(ai*bi) + z * Sum(ai*ci)= - Sum(ai*di)
y * Sum(bi^2) + x * Sum(ai*bi) + z * Sum(bi*ci)= - Sum(bi*di)
z * Sum(ci^2) + x * Sum(ai*ci) + y * Sum(bi*ci)= - Sum(ci*di)
x * Saa + y * Sab + z * Sac = - Sad
x * Sab + y * Sbb + z * Sbc = - Sbd
x * Sac + y * Sbc + z * Scc = - Scd
where S** are corresponding sums
and can solve it for unknowns x, y, z
I needed this for a sketch in Processing, so I ported Gene's answer. Works great and thought it might save someone else a little time. Unfortunately PVector/PMatrix don't have array accessors for vectors or matrices so I had to add these as local functions.
float getv(PVector v, int i) {
if(i == 0) return v.x;
if(i == 1) return v.y;
return v.z;
}
void setv(PVector v, int i, float value) {
if (i == 0) v.x = value;
else if (i == 1) v.y = value;
else v.z = value;
}
void incv(PVector v, int i, float value) {
setv(v,i,getv(v,i) + value);
}
float getm(float[] mm, int r, int c) { return mm[c + r*4]; }
void setm(float[] mm, int r, int c, float value) { mm[c + r*4] = value; }
void incm(float[] mm, int r, int c, float value) { mm[c + r*4] += value; }
PVector findNearestPoint(PVector a[], PVector d[]) {
var mm = new float[16];
var b = new PVector();
var n = a.length;
for (int i = 0; i < n; ++i) {
var d2 = d[i].dot(d[i]);
var da = d[i].dot(a[i]);
for (int ii = 0; ii < 3; ++ii) {
for (int jj = 0; jj < 3; ++jj) {
incm(mm,ii,jj, getv(d[i],ii) * getv(d[i],jj));
}
incm(mm, ii,ii, -d2);
incv(b, ii, getv(d[i], ii) * da - getv(a[i], ii) * d2);
}
}
var p = solve(mm, new float[] {b.x, b.y, b.z});
return new PVector(p[0],p[1],p[2]);
}
// Verifier
float dist2(PVector p, PVector a, PVector d) {
PVector pa = new PVector( a.x-p.x, a.y-p.y, a.z-p.z );
float dpa = d.dot(pa);
return d.dot(d) * pa.dot(pa) - dpa * dpa;
}
//double sum_dist2(VEC p, VEC a[], VEC d[], int n) {
float sum_dist2(PVector p, PVector a[], PVector d[]) {
int n = a.length;
float sum = 0;
for (int i = 0; i < n; ++i) {
sum += dist2(p, a[i], d[i]);
}
return sum;
}
// Check 26 nearby points and verify the provided one is nearest.
boolean isNearest(PVector p, PVector a[], PVector d[]) {
float min_d2 = 3.4028235E38;
int ii = 2, jj = 2, kk = 2;
final float D = 0.1f;
for (int i = -1; i <= 1; ++i)
for (int j = -1; j <= 1; ++j)
for (int k = -1; k <= 1; ++k) {
PVector pp = new PVector( p.x + D * i, p.y + D * j, p.z + D * k );
float d2 = sum_dist2(pp, a, d);
// Prefer provided point among equals.
if (d2 < min_d2 || i == 0 && j == 0 && k == 0 && d2 == min_d2) {
min_d2 = d2;
ii = i; jj = j; kk = k;
}
}
return ii == 0 && jj == 0 && kk == 0;
}
void setup() {
PVector a[] = {
new PVector(-14.2, 17, -1),
new PVector(1, 1, 1),
new PVector(2.3, 4.1, 9.8),
new PVector(1,2,3)
};
PVector d[] = {
new PVector(1.3, 1.3, -10),
new PVector(12.1, -17.2, 1.1),
new PVector(19.2, 31.8, 3.5),
new PVector(4,5,6)
};
int n = 4;
for (int i = 0; i < n; ++i)
d[i].normalize();
PVector p = findNearestPoint(a, d);
println(p);
if (!isNearest(p, a, d))
println("Woops. Not nearest.\n");
}
// From rosettacode (with bug fix: added a missing fabs())
int mat_elem(int y, int x) { return y*4+x; }
void swap_row(float[] a, float[] b, int r1, int r2, int n)
{
float tmp;
int p1, p2;
int i;
if (r1 == r2) return;
for (i = 0; i < n; i++) {
p1 = mat_elem(r1, i);
p2 = mat_elem(r2, i);
tmp = a[p1];
a[p1] = a[p2];
a[p2] = tmp;
}
tmp = b[r1];
b[r1] = b[r2];
b[r2] = tmp;
}
float[] solve(float[] a, float[] b)
{
float[] x = new float[] {0,0,0};
int n = x.length;
int i, j, col, row, max_row, dia;
float max, tmp;
for (dia = 0; dia < n; dia++) {
max_row = dia;
max = abs(getm(a, dia, dia));
for (row = dia + 1; row < n; row++) {
if ((tmp = abs(getm(a, row, dia))) > max) {
max_row = row;
max = tmp;
}
}
swap_row(a, b, dia, max_row, n);
for (row = dia + 1; row < n; row++) {
tmp = getm(a, row, dia) / getm(a, dia, dia);
for (col = dia+1; col < n; col++) {
incm(a, row, col, -tmp * getm(a, dia, col));
}
setm(a,row,dia, 0);
b[row] -= tmp * b[dia];
}
}
for (row = n - 1; row >= 0; row--) {
tmp = b[row];
for (j = n - 1; j > row; j--) {
tmp -= x[j] * getm(a, row, j);
}
x[row] = tmp / getm(a, row, row);
}
return x;
}

square Matrix transpose with CUDA

I'm trying to write the matrix transpose algorithm. I test this program with matrix size equal to 1024, the result shows that not all elements are in the right places.
Why isn't my array transposing correctly? Does anyone can help me or give me any hint? I will appreciate it. Thanks a lot!
there is the whole cpu code:
__global__ void transpose_naive (float *out, float *in, int w, int h )
{
unsigned int xIdx = blockDim.x * blockIdx.x + threadIdx.x;
unsigned int yIdx = blockDim.y * blockIdx.y + threadIdx.y;
if ( xIdx <=w && yIdx <=h ) {
unsigned int idx_in = xIdx + w * yIdx;
unsigned int idx_out = yIdx + h * xIdx;
out[idx_out] = in[idx_in];
}
}
int main()
{
int nx=1024;
int mem_size = nx*nx*sizeof(float);
int t=32;
dim3 dimGrid(((nx-1)/t) +1, ((nx-1)/t) +1);
dim3 dimBlock(t,t);
float *h_idata = (float*)malloc(mem_size);
float *h_cdata = (float*)malloc(mem_size);
float *d_idata, *d_cdata;
checkCuda(cudaMalloc(&d_idata, mem_size) );
checkCuda(cudaMalloc(&d_cdata, mem_size) );
// host
for (int j = 0; j < nx; j++)
for (int i = 0; i < nx; i++)
h_idata[j*nx + i] = j*nx + i;
// device
checkCuda(cudaMemcpy(d_idata, h_idata, mem_size, cudaMemcpyHostToDevice) );
// events for timing
cudaEvent_t startEvent, stopEvent;
checkCuda(cudaEventCreate(&startEvent) );
checkCuda(cudaEventCreate(&stopEvent) );
float ms;
checkCuda( cudaEventRecord(startEvent, 0) );
transpose_naive<<<dimGrid, dimBlock>>>(d_cdata, d_idata,nx,nx);
checkCuda(cudaEventRecord(stopEvent, 0) );
checkCuda(cudaEventSynchronize(stopEvent) );
checkCuda(cudaEventElapsedTime(&ms, startEvent, stopEvent) );
checkCuda( cudaMemcpy(h_cdata, d_cdata, mem_size, cudaMemcpyDeviceToHost) );
printf("the time %5f ", ms);
printf("\n");
savetofile(h_idata,"i.txt",nx,nx);
savetofile(h_cdata,"t.txt",nx,nx);
error_exit:
// cleanup
checkCuda(cudaEventDestroy(startEvent) );
checkCuda(cudaEventDestroy(stopEvent) );
checkCuda( cudaFree(d_cdata) );
checkCuda( cudaFree(d_idata) );
free(h_idata);
free(h_cdata);
system("pause");
}
I think there is something wrong with file output "i.txt" and "t.txt" otherwise the program looks to be correct. I have made some minor changes in your code by adding error checking and printing on the standard output stream. I am printing the last (1020 - 1024) 3 x 3 matrix to cross check the transpose. Run it on your system and verify whether the matrix transpose is correct or not?
#include "cuda_runtime.h"
#include <stdio.h>
#include <stdlib.h>
#include "device_launch_parameters.h"
#define gpuErrchk(ans) { gpuAssert((ans), __FILE__, __LINE__); }
inline void gpuAssert(cudaError_t code, const char *file, int line, bool abort = true)
{
if (code != cudaSuccess)
{
fprintf(stderr, "GPUassert: %s %s %d\n", cudaGetErrorString(code),file, line);
if (abort) exit(code);
}
}
__global__ void transpose_naive(float *out, float *in, int w, int h)
{
unsigned int xIdx = blockDim.x * blockIdx.x + threadIdx.x;
unsigned int yIdx = blockDim.y * blockIdx.y + threadIdx.y;
if (xIdx <= w && yIdx <= h) {
unsigned int idx_in = xIdx + w * yIdx;
unsigned int idx_out = yIdx + h * xIdx;
out[idx_out] = in[idx_in];
}
}
int main()
{
int nx = 1024;
int mem_size = nx*nx*sizeof(float);
int t = 32;
dim3 dimGrid(((nx - 1) / t) + 1, (((nx - 1) / t) + 1));
dim3 dimBlock(t, t);
float *h_idata = (float*)malloc(mem_size);
float *h_cdata = (float*)malloc(mem_size);
float *d_idata, *d_cdata;
gpuErrchk(cudaMalloc(&d_idata, mem_size));
gpuErrchk(cudaMalloc(&d_cdata, mem_size));
// host
for (int j = 0; j < nx; j++)
for (int i = 0; i < nx; i++)
h_idata[j*nx + i] = j*nx + i;
// device
gpuErrchk(cudaMemcpy(d_idata,h_idata,mem_size,cudaMemcpyHostToDevice));
// events for timing
cudaEvent_t startEvent, stopEvent;
gpuErrchk(cudaEventCreate(&startEvent));
gpuErrchk(cudaEventCreate(&stopEvent));
float ms;
gpuErrchk(cudaEventRecord(startEvent, 0));
transpose_naive << <dimGrid, dimBlock >> >(d_cdata, d_idata, nx, nx);
gpuErrchk(cudaEventRecord(stopEvent, 0));
gpuErrchk(cudaEventSynchronize(stopEvent));
gpuErrchk(cudaEventElapsedTime(&ms, startEvent, stopEvent));
gpuErrchk(cudaMemcpy(h_cdata,d_cdata,mem_size,cudaMemcpyDeviceToHost));
printf("the time %5f ", ms);
printf("\n");
for (int i = 1020; i < 1024; i++) {
for (int j = 1020; j < 1024; j++) {
printf("%.2f ", h_idata[i*nx + j]);
}
printf("\n");
}
printf("\n");
for (int i = 1020; i < 1024; i++) {
for (int j = 1020; j < 1024; j++) {
printf("%.2f ", h_cdata[i*nx + j]);
}
printf("\n");
}
//savetofile(h_idata, "i.txt", nx, nx);
//savetofile(h_cdata, "t.txt", nx, nx);
//error_exit:
// cleanup
gpuErrchk(cudaEventDestroy(startEvent));
gpuErrchk(cudaEventDestroy(stopEvent));
gpuErrchk(cudaFree(d_cdata));
gpuErrchk(cudaFree(d_idata));
free(h_idata);
free(h_cdata);
//system("pause");
}
The only flaw in the code is the incorrect bound checks in the following line of the kernel.
if ( xIdx <=w && yIdx <=h ) {
As the indices are from 0 to w-1 and 0 to h-1 for x and y dimensions respectively, the if condition should be as follows:
if ( xIdx <w && yIdx <h ) {

I made a processing program that generates a mandelbrot set but don't know how to effectively implement a zoom method

I'm not sure if it is possible in processing but I would like to be able to zoom in on the fractal without it being extremely laggy and buggy. What I currently have is:
int maxIter = 100;
float zoom = 1;
float x0 = width/2;
float y0 = height/2;
void setup(){
size(500,300);
noStroke();
smooth();
}
void draw(){
translate(x0, y0);
scale(zoom);
for(float Py = 0; Py < height; Py++){
for(float Px = 0; Px < width; Px++){
// scale pixel coordinates to Mandelbrot scale
float w = width;
float h = height;
float xScaled = (Px * (3.5/w)) - 2.5;
float yScaled = (Py * (2/h)) - 1;
float x = 0;
float y = 0;
int iter = 0;
while( x*x + y*y < 2*2 && iter < maxIter){
float tempX = x*x - y*y + xScaled;
y = 2*x*y + yScaled;
x = tempX;
iter += 1;
}
// color pixels
color c;
c = pickColor(iter);
rect(Px, Py,1,1);
fill(c);
}
}
}
// pick color based on time pixel took to escape (number of iterations through loop)
color pickColor(int iters){
color b = color(0,0,0);
if(iters == maxIter) return b;
int l = 1;
color[] colors = new color[maxIter];
for(int i = 0; i < colors.length; i++){
switch(l){
case 1 : colors[i] = color(255,0,0); break;
case 2 : colors[i] = color(0,0,255); break;
case 3 : colors[i] = color(0,255,0); break;
}
if(l == 1 || l == 2) l++;
else if(l == 3) l = 1;
else l--;
}
return colors[iters];
}
// allow zooming in and out
void mouseWheel(MouseEvent event){
float direction = event.getCount();
if(direction < 0) zoom += .02;
if(direction > 0) zoom -= .02;
}
// allow dragging back and forth to change view
void mouseDragged(){
x0+= mouseX-pmouseX;
y0+= mouseY-pmouseY;
}
but it doesn't work very well. It works alright at the size and max iteration I have it set to now (but still not well) and is completely unusable at larger sizes or higher maximum iterations.
The G4P library has an example that does exactly this. Download the library and go to the G4P_MandelBrot example. The example can be found online here.
Hope this helps!

Unordinary performance gap between OpenCL and CUDA

I have coded a simple tiled matrix multiplication in CUDA. It's like this:
__global__ void matrixMultiplyShared(float * A, float * B, float * C,
int numARows, int numAColumns,
int numBRows, int numBColumns,
int numCRows, int numCColumns) {
__shared__ float ds_A[TILE_WIDTH][TILE_WIDTH];
__shared__ float ds_B[TILE_WIDTH][TILE_WIDTH];
int bx = blockIdx.x; int by = blockIdx.y;
int tx = threadIdx.x; int ty = threadIdx.y;
int row = by * TILE_WIDTH + ty;
int col = bx * TILE_WIDTH + tx;
float Cvalue = 0.0;
// Loop over the M and N tiles required to compute the Pd element
for (int m = 0; m < (numAColumns-1)/TILE_WIDTH+1; ++m) {
if(row<numARows && m*TILE_WIDTH+tx < numAColumns){
ds_A[ty][tx] = A[row*numAColumns + m*TILE_WIDTH+tx];
} else {
ds_A[ty][tx] = 0;
}
if(m*TILE_WIDTH+ty < numBRows && col < numBColumns){
ds_B[ty][tx] = B[(m*TILE_WIDTH+ty)*numBColumns+col];
} else {
ds_B[ty][tx] = 0;
}
__syncthreads();
if(row < numCRows && col < numCColumns){
for (int k = 0; k < TILE_WIDTH; ++k)
Cvalue += ds_A[ty][k] * ds_B[k][tx];
}
__syncthreads();
}
if(row < numCRows && col < numCColumns)
C[row*numCColumns+col] = Cvalue;
}
After that, I used the same above kernel (with some minor changes) in the OpenCL version to compare the performance of CUDA and OpenCL together. But the result was to so far beyond my expectations. OpenCL was 6-7 times faster than CUDA. Is it valid?
The output of Nisght is as follows:
CUDA:
OpenCL:
You can see a large gap between starting the app and executing the kernel. why is that happened?
My GPU is: GTX 580 |
The Kernel Ex time (CUDA): 3.78s |
The Kernel Ex time (OpenCL): 0.53s |
CUDA Code: http://pastebin.com/VQMp3Hba
OpenCL Host Code: http://pastebin.com/cjGYSLQf
OpenCL Kernel Code: http://pastebin.com/KKw3Ayz7
You can try and insert explicit timers in the code instead of trusting the output from the tool. May be the case that the tool is wrong.

Resources