Resize 8-bit image by 2 with ARM NEON - image

I have an 8-bit 640x480 image that I would like to shrink to a 320x240 image:
void reducebytwo(uint8_t *dst, uint8_t *src)
//src is 640x480, dst is 320x240
What would be the best way to do that using ARM SIMD NEON? Any sample code somewhere?
As a starting point, I simply would like to do the equivalent of:
for (int h = 0; h < 240; h++)
for (int w = 0; w < 320; w++)
dst[h * 320 + w] = (src[640 * h * 2 + w * 2] + src[640 * h * 2 + w * 2 + 1] + src[640 * h * 2 + 640 + w * 2] + src[640 * h * 2 + 640 + w * 2 + 1]) / 4;

This is a one to one translation of your code to arm NEON intrinsics:
#include <arm_neon.h>
#include <stdint.h>
static void resize_line (uint8_t * __restrict src1, uint8_t * __restrict src2, uint8_t * __restrict dest)
int i;
for (i=0; i<640; i+=16)
// load upper line and add neighbor pixels:
uint16x8_t a = vpaddlq_u8 (vld1q_u8 (src1));
// load lower line and add neighbor pixels:
uint16x8_t b = vpaddlq_u8 (vld1q_u8 (src2));
// sum of upper and lower line:
uint16x8_t c = vaddq_u16 (a,b);
// divide by 4, convert to char and store:
vst1_u8 (dest, vshrn_n_u16 (c, 2));
// move pointers to next chunk of data
void resize_image (uint8_t * src, uint8_t * dest)
int h;
for (h = 0; h < 240 - 1; h++)
resize_line (src+640*(h*2+0),
It processes 32 source-pixels and generates 8 output pixels per iteration.
I did a quick look at the assembler output and it looks okay. You can get better performance if you write the resize_line function in assembler, unroll the loop and eliminate pipeline stalls. That would give you an estimated factor of three performance boost.
It should be a lot faster than your implementation without assembler changes though.
Note: I haven't tested the code...

If you're not too concerned with precision then this inner loop should give you twice the compute throughput compared to the more accurate algorithm:
for (i=0; i<640; i+= 32)
uint8x16x2_t a, b;
uint8x16_t c, d;
/* load upper row, splitting even and odd pixels into a.val[0]
* and a.val[1] respectively. */
a = vld2q_u8(src1);
/* as above, but for lower row */
b = vld2q_u8(src2);
/* compute average of even and odd pixel pairs for upper row */
c = vrhaddq_u8(a.val[0], a.val[1]);
/* compute average of even and odd pixel pairs for lower row */
d = vrhaddq_u8(b.val[0], b.val[1]);
/* compute average of upper and lower rows, and store result */
vst1q_u8(dest, vrhaddq_u8(c, d));
It works by using the vhadd operation, which has a result the same size as the input. This way you don't have to shift the final sum back down to 8-bit, and all of the arithmetic throughout is eight-bit, which means you can perform twice as many operations per instruction.
However it is less accurate, because the intermediate sum is quantised, and GCC 4.7 does a terrible job of generating code. GCC 4.8 does just fine.
The whole operation has a good chance of being I/O bound, though. The loop should be unrolled to maximise separation between loads and arithmetic, and __builtin_prefetch() (or PLD) should be used to hoist the incoming data into caches before it's needed.

Here is the asm version on reduce_line that #Nils Pipenbrinck suggested
static void reduce2_neon_line(uint8_t* __restrict src1, uint8_t* __restrict src2, uint8_t* __restrict dest, int width) {
for(int i=0; i<width; i+=16) {
asm (
"pld [%[line1], #0xc00] \n"
"pld [%[line2], #0xc00] \n"
"vldm %[line1]!, {d0,d1} \n"
"vldm %[line2]!, {d2,d3} \n"
"vpaddl.u8 q0, q0 \n"
"vpaddl.u8 q1, q1 \n"
"vadd.u16 q0, q1 \n"
"vshrn.u16 d0, q0, #2 \n"
"vst1.8 {d0}, [%[dst]]! \n"
: [line1] "r"(src1), [line2] "r"(src2), [dst] "r"(dest)
: "q0", "q1", "memory"
It is about 4 times faster then C version (tested on iPhone 5).


Understanding CUDA indexing

I inherited some CUDA code that I need to work on but some of the indexing done in it is confusing me.
A simple example would be the normalisation of data. Say we have a shared array A[2*N] which is a matrix of shape 2xN which has been unrolled to an array. Then we have the normalisation means and standard deviation: norm_means[2] and norm_stds[2]. The goal is to normalise the data in A in parallel. A minimal example would be:
__global__ void normalise(float *data, float *norm, float *std) {
int tdy = threadIdx.y;
for (int i=tdy; i<D; i+=blockDim.y)
data[i] = data[i] * norm[i] + std[i];
int main(int argc, char **argv) {
// generate data
int N=100;
int D=2;
MatrixXd A = MatrixXd::Random(N*D,1);
MatrixXd norm_means = MatrixXd::Random(D,1);
MatrixXd norm_stds = MatrixXd::Random(D,1);
// transfer data to device
float* A_d;
float* norm_means_d;
float* nrom_stds_d;
cudaMalloc((void **)&A_d, N * D * sizeof(float));
cudaMalloc((void **)&norm_means_d, D * sizeof(float));
cudaMalloc((void **)&norm_stds_d, D * sizeof(float));
cudaMemcpy(A_d,, D * N * sizeof(float), cudaMemcpyHostToDevice);
cudaMemcpy(norm_means_d,, D * sizeof(float), cudaMemcpyHostToDevice);
cudaMemcpy(norm_stds_d,, D * sizeof(float), cudaMemcpyHostToDevice);
// Setup execution
const int BLOCKSIZE_X = 8;
const int BLOCKSIZE_Y = 16;
const int GRIDSIZE_X = (N-1)/BLOCKSIZE_X + 1;
dim3 dimBlock(BLOCKSIZE_X, BLOCKSIZE_Y, 1);
dim3 dimGrid(GRIDSIZE_X, 1, 1);
normalise<<dimGrid, dimBlock, 0>>>(A_d, norm_means_d, norm_stds_d);
Note that I am using Eigen for the matrix generation. I have omitted the includes for brevity.
This code above through some magic works and achieves the desired results. However, the CUDA kernel function does not make any sense to me because the for loop should stop after one execution as i>D after the first iteration .. but it doesn't?
If I change the kernel that makes more sense to me eg.
__global__ void normalise(float *data, float *norm, float *std) {
int tdy = threadIdx.y;
for (int i=0; i<D; i++)
data[tdy + i*blockDim.y] = data[tdy + i*blockDim.y] * norm[i] + std[i];
the program stops working and just outputs gibberish data.
Can somebody explain why I get this behaviour?
PS. I am very new to CUDA
It is indeed senseless to have a 2-dimensional kernel to perform an elementwise operation on an array. There is also no reason to work in blocks of size 8x16. But your modified kernel uses the second dimension (y) only; that's probably why it doesn't work. You probably needed to use the first dimension (x) only.
However - it would be reasonable to use the Y dimension for the actual second dimension, e.g. something like this:
__global__ void normalize(
float __restrict *data,
const float __restrict *norm,
const float __restrict *std)
auto pos = threadIdx.x + blockDim.x * blockIdx.x;
auto d = threadIdx.y + blockDim.y * blockIdx.y; // or maybe just threadIdx.y;
data[pos + d * N] = data[pos + d * N] * norm[d] + std[d];
Other points to consider:
I added __restrict to your pointers. Always do that when relevant; here's why.
It's a good idea to have a single thread to work on more than one element of data - but you should make that happen in the longer dimension, where the thread can reuse its norm and std values rather than read them from memory every time.

Mandelbrot optimization in openmp

Well i have to paralellisize the mandelbrot program in C. I think i have done it well and i cant get better times. My question if someone has an idea to improve the code, ive been thinking perhaps in nested parallel regions between the outer and insider for...
Also i have doubts if its more elegant or recommended to put all the pragmas in a single line or to write separate pragmas ( one for omp parallel and shared and private variables and a conditional, and another pragma with omp for and schedule dynamic).
Ive the doubt if constants can be used as private variables because i think is cleaner to have constants instead of defined variables.
Also i have written a conditional ( if numcpu >1) it has no sense to use parallel region and make a normal sequential execution.
Finally as i have read the dynamic chunk it depends on hardware and your system configuration... so i have left it as a constant, so it can be easily changed.
Also i adapt the number of threads to the number of processors available..
int main(int argc, char *argv[])
int xactual, yactual;
//each iteration, it calculates: newz = oldz*oldz + p, where p is the current pixel, and oldz stars at the origin
double pr, pi; //real and imaginary part of the pixel p
double newRe, newIm, oldRe, oldIm; //real and imaginary parts of new and old z
double zoom = 1, moveX = -0.5, moveY = 0; //you can change these to zoom and change position
pixel_t *pixels = malloc(sizeof(pixel_t)*IMAGEHEIGHT*IMAGEWIDTH);
clock_t begin, end;
double time_spent;
int numcpu;
numcpu = omp_get_num_procs();
//FILE * fp;
printf("El nĂºmero de procesadores que utilizaremos es: %d", numcpu);
#pragma omp parallel shared(pixels, moveX, moveY, zoom) private(xactual, yactual, pr, pi, newRe, newIm) (if numcpu>1)
//int xactual=0;
// int yactual=0;
#pragma omp for schedule(dynamic, CHUNK)
//loop through every pixel
for(yactual = 0; yactual < IMAGEHEIGHT; yactual++)
for(xactual = 0; xactual < IMAGEWIDTH; xactual++)
//calculate the initial real and imaginary part of z, based on the pixel location and zoom and position values
pr = 1.5 * (xactual - IMAGEWIDTH / 2) / (0.5 * zoom * IMAGEWIDTH) + moveX;
pi = (yactual - IMAGEHEIGHT / 2) / (0.5 * zoom * IMAGEHEIGHT) + moveY;
newRe = newIm = oldRe = oldIm = 0; //these should start at 0,0
//"i" will represent the number of iterations
int i;
//start the iteration process
for(i = 0; i < ITERATIONS; i++)
//remember value of previous iteration
oldRe = newRe;
oldIm = newIm;
//the actual iteration, the real and imaginary part are calculated
newRe = oldRe * oldRe - oldIm * oldIm + pr;
newIm = 2 * oldRe * oldIm + pi;
//if the point is outside the circle with radius 2: stop
if((newRe * newRe + newIm * newIm) > 4) break;
// color(i % 256, 255, 255 * (i < maxIterations));
//color(0, 0, 0); // black
pixels[yactual*IMAGEWIDTH+xactual][0] = 0;
pixels[yactual*IMAGEWIDTH+xactual][1] = 0;
pixels[yactual*IMAGEWIDTH+xactual][2] = 0;
double z = sqrt(newRe * newRe + newIm * newIm);
int brightness = 256 * log2(1.75 + i - log2(log2(z))) / log2((double)ITERATIONS);
//color(brightness, brightness, 255)
pixels[yactual*IMAGEWIDTH+xactual][0] = brightness;
pixels[yactual*IMAGEWIDTH+xactual][1] = brightness;
pixels[yactual*IMAGEWIDTH+xactual][2] = 255;
} //end of parallel region
end= clock();
time_spent = (double)(end - begin) / CLOCKS_PER_SEC;
fprintf(stderr, "Elapsed time: %.2lf seconds.\n", time_spent);
You could extend the implementation to leverage SIMD extensions. As far as I know the latest OpenMP standard includes vector constructs. Check out this article that describes the new capabilities.
This whitepaper explains how SSE3 can be used when calculating the Mandelbrot set.

Optimising Matrix Multiplication OpenCL - Purpose: learn how to manage memory

I'm new to OpenCL and trying to understand how to optimise matrix multiplication to become familiar with the various paradigms. Here's the current code.
If I'm multipliying matrices A and B. I allocate a row of A in private memory to start with (because each work item uses it), and a column of B in local memory (because each work group uses it).
1) the code is currently incorrect, unfortunately I'm struggling on how to use local work ids to get the correct code, but I can't find my mistake? I'm basing myself on but (slide 27) it seems that this is wrong as they don't make use of loc_size in their internal loop)
2) Are there any other optimisations you would suggest with this code?
__kernel void mmul(
__global int* C,
__global int* A,
__global int* B,
const int rA,
const int rB,
const int cC,
__local char* local_mem)
int k,ty;
int tx = get_global_id(0);
int loctx = get_local_id(0);
int loc_size = get_local_size(0);
int value = 0 ;
int tmp_array[1000];
for(k=0;k<rB;k++) {
tmp_array[k] = A[tx * cA + k] ;
for (ty=0 ; ty < cC ; ty++) { \n" \
for (k = loctx ; k < rB ; k+=loc_size) {
local_mem[k] = B[ty + k * cC] ;
value = 0 ;
for(k=0;k<rB;k+=1) {
int i = loctx + k*loc_size;
value += tmp_array[k] * local_mem[i];
C[ty + (tx * cC)] = value;
where I set the global and local work items as follows
const size_t globalWorkItems[1] = {result_row};
const size_t localWorkItems[1] = {(size_t)local_wi_size};
local_wi_size is result_row/number of compute units (such that result_row % compute units == 0)
Your code is pretty close, but the indexing into the local memory array is actually simpler that you think. You have a row in private memory and a column in local memory, and you need to compute the dot product of these two vectors. You just need to sum row[k]*col[k], for k = 0 up to N-1:
for(k=0;k<rB;k+=1) {
value += tmp_array[k] * local_mem[k];
There's actually a second, more subtle bug that is also present in the example solution given on the slides you are using. Since you are reading and writing local memory inside a loop, you actually need two barriers, in order to make sure that work-items writing to local memory on iteration i don't overwrite values that are being read by other work-items executing iteration i-1.
Therefore, the full code for your kernel (tested and working), should look something like this:
__kernel void mmul(
__global int* C,
__global int* A,
__global int* B,
const int rA,
const int rB,
const int cC,
__local char* local_mem)
int k,ty;
int tx = get_global_id(0);
int loctx = get_local_id(0);
int loc_size = get_local_size(0);
int value = 0;
int tmp_array[1000];
for(k=0;k<rB;k++) {
tmp_array[k] = A[tx * cA + k] ;
for (ty=0 ; ty < cC ; ty++) {
for (k = loctx ; k < rB ; k+=loc_size) {
local_mem[k] = B[ty + k * cC];
barrier(CLK_LOCAL_MEM_FENCE); // First barrier to ensure writes have finished
value = 0;
for(k=0;k<rB;k+=1) {
value += tmp_array[k] * local_mem[k];
C[ty + (tx * cC)] = value;
barrier(CLK_LOCAL_MEM_FENCE); // Second barrier to ensure reads have finished
You can find the full set of exercises and solutions that go with the slides you are looking at on the HandsOnOpenCL GitHub page. There's also a more complete set of slides from the same tutorial available here, which go on to show a much more optimised matrix multiply example that uses a blocking approach to better exploit temporal and spatial locality. The aforementioned missing barrier bug has been fixed in the example solution code, but not on the slides (yet).

Fast sigmoid algorithm

The sigmoid function is defined as
I found that using the C built-in function exp() to calculate the value of f(x) is slow. Is there any faster algorithm to calculate the value of f(x)?
you don't have to use the actual, exact sigmoid function in a neural network algorithm but can replace it with an approximated version that has similar properties but is faster the compute.
For example, you can use the "fast sigmoid" function
f(x) = x / (1 + abs(x))
Using first terms of the series expansion for exp(x) won't help too much if the arguments to f(x) are not near zero, and you have the same problem with a series expansion of the sigmoid function if the arguments are "large".
An alternative is to use table lookup. That is, you precalculate the values of the sigmoid function for a given number of data points, and then do fast (linear) interpolation between them if you want.
It's best to measure on your hardware first. Just a quick benchmark script shows, that on my machine 1/(1+|x|) is the fastest, and tanh(x) is the close second. Error function erf is pretty fast too.
% gcc -Wall -O2 -lm -o sigmoid-bench{,.c} -std=c99 && ./sigmoid-bench
atan(pi*x/2)*2/pi 24.1 ns
atan(x) 23.0 ns
1/(1+exp(-x)) 20.4 ns
1/sqrt(1+x^2) 13.4 ns
erf(sqrt(pi)*x/2) 6.7 ns
tanh(x) 5.5 ns
x/(1+|x|) 5.5 ns
I expect that the results may vary depending on architecture and the compiler used, but erf(x) (since C99), tanh(x) and x/(1.0+fabs(x)) are likely to be the fast performers.
People here are mostly concerned about how fast one function is relative to another and create micro benchmark to see whether f1(x) runs 0.0001 ms faster than f2(x). The big problem is that this is mostly irrelevant, because what matters is how fast your network learns with your activation function trying to minimize your cost function.
As of current theory, rectifier function and softplus
compared to sigmoid function or similar activation functions, allow
for faster and effective training of deep neural architectures on
large and complex datasets.
So I suggest to throw away micro-optimization, and take a look at which function allows faster learning (also taking looking at various other cost function).
To do the NN more flexible usually used some alpha rate to change the angle of graph around 0.
The sigmoid function looks like:
f(x) = 1 / ( 1+exp(-x*alpha))
The nearly equivalent, (but more faster function) is:
f(x) = 0.5 * (x * alpha / (1 + abs(x*alpha))) + 0.5
You can check the graphs here
When I using abs function the network become faster 100+ times.
This answer probably isn't relevant for most cases, but just wanted to throw out there that for CUDA computing I've found x/sqrt(1+x^2) to be the fastest function by far.
For example, done with single precision float intrinsics:
__device__ void fooCudaKernel(/* some arguments */) {
float foo, sigmoid;
// some code defining foo
sigmoid = __fmul_rz(rsqrtf(__fmaf_rz(foo,foo,1)),foo);
Also you might use rough version of sigmoid (it differences not greater than 0.2% from original):
inline float RoughSigmoid(float value)
float x = ::abs(value);
float x2 = x*x;
float e = 1.0f + x + x2*0.555f + x2*x2*0.143f;
return 1.0f / (1.0f + (value > 0 ? 1.0f / e : e));
void RoughSigmoid(const float * src, size_t size, const float * slope, float * dst)
float s = slope[0];
for (size_t i = 0; i < size; ++i)
dst[i] = RoughSigmoid(src[i] * s);
Optimization of RoughSigmoid function with using SSE:
#include <xmmintrin.h>
void RoughSigmoid(const float * src, size_t size, const float * slope, float * dst)
size_t alignedSize = size/4*4;
__m128 _slope = _mm_set1_ps(*slope);
__m128 _0 = _mm_set1_ps(-0.0f);
__m128 _1 = _mm_set1_ps(1.0f);
__m128 _0555 = _mm_set1_ps(0.555f);
__m128 _0143 = _mm_set1_ps(0.143f);
size_t i = 0;
for (; i < alignedSize; i += 4)
__m128 _src = _mm_loadu_ps(src + i);
__m128 x = _mm_andnot_ps(_0, _mm_mul_ps(_src, _slope));
__m128 x2 = _mm_mul_ps(x, x);
__m128 x4 = _mm_mul_ps(x2, x2);
__m128 series = _mm_add_ps(_mm_add_ps(_1, x), _mm_add_ps(_mm_mul_ps(x2, _0555), _mm_mul_ps(x4, _0143)));
__m128 mask = _mm_cmpgt_ps(_src, _0);
__m128 exp = _mm_or_ps(_mm_and_ps(_mm_rcp_ps(series), mask), _mm_andnot_ps(mask, series));
__m128 sigmoid = _mm_rcp_ps(_mm_add_ps(_1, exp));
_mm_storeu_ps(dst + i, sigmoid);
for (; i < size; ++i)
dst[i] = RoughSigmoid(src[i] * slope[0]);
Optimization of RoughSigmoid function with using AVX:
#include <immintrin.h>
void RoughSigmoid(const float * src, size_t size, const float * slope, float * dst)
size_t alignedSize = size/8*8;
__m256 _slope = _mm256_set1_ps(*slope);
__m256 _0 = _mm256_set1_ps(-0.0f);
__m256 _1 = _mm256_set1_ps(1.0f);
__m256 _0555 = _mm256_set1_ps(0.555f);
__m256 _0143 = _mm256_set1_ps(0.143f);
size_t i = 0;
for (; i < alignedSize; i += 8)
__m256 _src = _mm256_loadu_ps(src + i);
__m256 x = _mm256_andnot_ps(_0, _mm256_mul_ps(_src, _slope));
__m256 x2 = _mm256_mul_ps(x, x);
__m256 x4 = _mm256_mul_ps(x2, x2);
__m256 series = _mm256_add_ps(_mm256_add_ps(_1, x), _mm256_add_ps(_mm256_mul_ps(x2, _0555), _mm256_mul_ps(x4, _0143)));
__m256 mask = _mm256_cmp_ps(_src, _0, _CMP_GT_OS);
__m256 exp = _mm256_or_ps(_mm256_and_ps(_mm256_rcp_ps(series), mask), _mm256_andnot_ps(mask, series));
__m256 sigmoid = _mm256_rcp_ps(_mm256_add_ps(_1, exp));
_mm256_storeu_ps(dst + i, sigmoid);
for (; i < size; ++i)
dst[i] = RoughSigmoid(src[i] * slope[0]);
Code is based on a C# version previously posted by '#jenkas' with minor modifications.
The following C++ code provides excellent precision that outperforms low-precision approximations by virtue of the fact that it allows compilers to auto-vectorize compiled code onto SIMD instructions when used in simple loops.
GCC will compile code to SIMD (Arm Neon, or Intel AVX) instructions that perform four sigmoid (or tanh) computations in parallel. Auto-vectorization yields performance that is comparable to even very low-precision optimizations while maintaining essentially full precision. Microsoft and Intel compilers also perform auto-vectorization.
A brief discussion of auto-vectorization, compiler optimizations, and practices that produce optimal performance is provided near the end of this post.
The following functions provide a maximum error of +/- 6.55651e-07 over full range as compared to 1/(1+exp(-v)).
// Returns float approximation of 1/(1+exp(-v))
inline float fast_sigmoid(float v)
constexpr float c1 = 0.03138777F;
constexpr float c2 = 0.276281267F;
constexpr float c_log2f = 1.442695022F;
v *= c_log2f*0.5;
int intPart = (int)v;
float x = (v - intPart);
float xx = x * x;
float v1 = c_log2f + c2 * xx;
float v2 = x + xx * c1 * x;
float v3 = (v2 + v1);
*((int*)&v3) += intPart << 24;
float v4 = v2 - v1;
float res = v3 / (v3 - v4); //for tanh change to (v3 + v4)/ (v3 - v4)
return res;
// Returns float approximation tanh(v)
inline float fast_tanh(float v)
const float c1 = 0.03138777F;
const float c2 = 0.276281267F;
const float c_log2f = 1.442695022F;
v *= c_log2f;
int intPart = (int)v;
float x = (v - intPart);
float xx = x * x;
float v1 = c_log2f + c2 * xx;
float v2 = x + xx * c1 * x;
float v3 = (v2 + v1);
*((int*)&v3) += intPart << 24;
float v4 = v2 - v1;
float res = (v3+v4) / (v3 - v4);
return res;
Benchmark results on Raspberry PI 4 (AARCH64):
-- Sigmoid benchmark --------
fast_sigmoid(x) 5.63 ns
fast_tanh(x) 5.89 ns
Vectorized fast_sigmoid(out,in,count) using Neon intrinsics
5.79 ns
atan(pi*/2 * x)/(pi/2) 27.29 ns
atan(x) 24.13 ns
1/(1+exp(-x)) 14.92 ns
1/sqrt(1+x^2) 4.26 ns
(erf(sqrt(pi)/2 *x) 20.62 ns
tanh(x) 20.64 ns
x/(1+|x|) 8.93 ns
x (measures loop overhead) 1.62 ns
x*x (for reference) 1.62 ns
1/(1+x) (for reference) 2.64 ns
Raspberry Pi 4, aarch64 Arm Cortex 72#1.8GHz. GCC 10.2.1
In the benchmark, GCC vectorizes the fast_sigmoid call into ARM Neon instructions allowing four values to be calculated in parallel.
For optimal performance, you should ensure that input vectors are aligned on 64-byte boundaries. AVX and Neon instructions both allow for unaligned access, but do so with a mild performance penalty.
In addition, you should inform the compiler that input vectors do not alias using the non-standard restrict keyword. The restrict keyword is defined in the C99 standard, but is not standard C++. Fortunately, all major C++ compilers (Intel, Microsoft, GCC, Clang) implement it as a C++ keyword as well. Without alias guarantees, compilers will generate a small code preamble that tests for aliasing at runtime, and executes a slow code-path if aliasing is detected.
To enable vectorization, GCC requires either the -ftree-vectorize option, or -O3 (which includes -ftree-vectorize).
Loops are vectorized as long as there are no operations that prevent vectorization. Including a call to a math intrinsic (exp, sin, cos &c) will prevent loop vectorization, as will if statements within the loop. However, loop bodies can be fairly substantial. For example, in my LSTM implementation, one of the loops contains operations on four separate vector components (more operations in the loop provides more opportunity for interleaved instruction scheduling)
The restrict keyword in the following sample informs the compiler that no part of the input and output vector overlap, allowing the compiler to omit the aliasing check:
void vec_sigmoid(
int length,
restrict float*output,
restrict float*input,
restrict float *bias)
for (int i = 0; i < length; ++i)
output[i] = fast_sigmoid(input[i])+bias[i];
Code is a C++ port of #jenkas' C# code posted earlier, adjusted to return 1/(1+exp(-x)) instead of 1/(1+exp(-2*x)) which is what the original code calculates.
You can use a simple but effective method by using two formulas:
if x < 0 then f(x) = 1 / (0.5/(1+(x^2)))
if x > 0 then f(x) = 1 / (-0.5/(1+(x^2)))+1
This will look like this:
Two graphs for a sigmoid {Blue: (0.5/(1+(x^2))), Yellow: (-0.5/(1+(x^2)))+1}
Try this .NET Core 5+ implementation
public static unsafe float FastSigmoid(float v)
const float c1 = 0.03138777F;
const float c2 = 0.276281267F;
const float c_log2f = 1.442695022F;
v *= c_log2f;
int intPart = (int)v;
float x = (v - intPart);
float xx = x * x;
float v1 = c_log2f + c2 * xx;
float v2 = x + xx * c1 * x;
float v3 = (v2 + v1);
*((int*)&v3) += intPart << 24;
float v4 = v2 - v1;
float res = v3 / (v3 - v4); //for tanh change to (v3 + v4)/ (v3 - v4)
return res;
Using Eureqa to search for approximations to sigmoid I found 1/(1 + 0.3678749025^x) approximates it. It's pretty close, just gets rid of one operation with the negation of x.
Some of the other functions shown here are interesting, but is the power operation really that slow? I tested it and it actually did faster than addition, but that could just be a fluke. If so it should be just as fast or faster as all the others.
EDIT:0.5 + 0.5*tanh(0.5*x) and less accurate, 0.5 + 0.5*tanh(n) also works. And you could just get rid of the constants if you don't care about getting it between the range [0,1] like sigmoid. But it assumes that tanh is faster.
The tanh function may be optimized in some languages, making it faster than a custom defined x/(1+abs(x)), such is the case in Julia.
You can also use this:
y=x / (2 * ((x<0.0)*-x+(x>=0.0)*x) + 2) + 0.5;
acts like a sigmoid now because y(1-y)=y' is more let say round than 1/(2 (1 + abs(x))^2)
acts more like to fast sigmoid;
I don't think you can do better than the built-in exp() but if you want another approach, you can use series expansion. WolframAlpha can compute it for you.

Computing the null space of a matrix as fast as possible

I need to compute the nullspace of several thousand small matrices (8x9, not 4x3 as I wrote previously) in parallel (CUDA). All references point to SVD but the algorithm in numerical recipes seems very expensive, and gives me lots of things other than the null space that I don't really need. Is Gaussian elimination really not an option? Are there any other commonly used methods?
To answer your question directly... yes! QR decomposition!
Let A be an m-by-n matrix with rank n. QR decomposition finds orthonormal m-by-m matrix Q and upper triangular m-by-n matrix R such that A = QR. If we define Q = [Q1 Q2], where Q1 is m-by-n and Q2 is m-by-(m-n), then the columns of Q2 form the null space of A^T.
QR decomposition is computed either by Gram-Schmidt, Givens rotations, or Householder reflections. They have different stability properties and operation counts.
You are right: SVD is expensive! I can't speak for what state-of-the-art stuff uses, but when I hear "compute null space" (EDIT: in a way that is simple for me to understand), I think QR.
I don't think the above proposed method always gives the whole null space. To recap: "A = QR, where Q = [Q1 Q2], and Q1 is m-by-n and Q2 is m-by-(m-n). Then the columns of Q2 form the null space of A^T."
Indeed, this may only give a subspace of the null space. Simple counter-example is when A=0, in which case the null space of A^T is the whole R^m.
Therefore, it is necessary to check R too. Based on my experience with Matlab, if a row of R is straight 0, then the corresponding column in Q should also be a basis of the null space of A^T. Clearly this observation is heuristic and hinges on the particular algorithm used for QR decomposition.
Gaussian elimination is plenty fast for 4x3 matrices. IIRC I've done about 5 million per second with Java without parallelism. With such a small problem, your best bet is to code the routine (row reduce etc.) yourself; otherwise you'll waste most of the time putting the data into the right format for the external routine.
In the anwers above, it has been already pointed out how the null space of a matrix can be calculated by using the QR or the SVD approach. SVD should be preferred when accuracy is required, see also Null-space of a rectangular dense matrix.
As of February 2015, CUDA 7 (now in release candidate) makes SVD available through its new cuSOLVER library. Below I report an example on how using cuSOLVER's SVD to calculate the null space of a matrix.
Be aware that the problem you are focusing on concerns the calculation of several small matrices, so you should adapt the example I'm providing below by using streams to make sense for your case. To associate a stream to each task you can use
#include "cuda_runtime.h"
#include "device_launch_paraMeters.h"
#include <cusolverDn.h>
#include <cuda_runtime_api.h>
#include "Utilities.cuh"
/* MAIN */
int main(){
// --- gesvd only supports Nrows >= Ncols
// --- column major memory ordering
const int Nrows = 7;
const int Ncols = 5;
// --- cuSOLVE input/output parameters/arrays
int work_size = 0;
int *devInfo; gpuErrchk(cudaMalloc(&devInfo, sizeof(int)));
// --- CUDA solver initialization
cusolverDnHandle_t solver_handle;
// --- Singular values threshold
double threshold = 1e-12;
// --- Setting the host, Nrows x Ncols matrix
double *h_A = (double *)malloc(Nrows * Ncols * sizeof(double));
for(int j = 0; j < Nrows; j++)
for(int i = 0; i < Ncols; i++)
h_A[j + i*Nrows] = (i + j*j) * sqrt((double)(i + j));
// --- Setting the device matrix and moving the host matrix to the device
double *d_A; gpuErrchk(cudaMalloc(&d_A, Nrows * Ncols * sizeof(double)));
gpuErrchk(cudaMemcpy(d_A, h_A, Nrows * Ncols * sizeof(double), cudaMemcpyHostToDevice));
// --- host side SVD results space
double *h_U = (double *)malloc(Nrows * Nrows * sizeof(double));
double *h_V = (double *)malloc(Ncols * Ncols * sizeof(double));
double *h_S = (double *)malloc(min(Nrows, Ncols) * sizeof(double));
// --- device side SVD workspace and matrices
double *d_U; gpuErrchk(cudaMalloc(&d_U, Nrows * Nrows * sizeof(double)));
double *d_V; gpuErrchk(cudaMalloc(&d_V, Ncols * Ncols * sizeof(double)));
double *d_S; gpuErrchk(cudaMalloc(&d_S, min(Nrows, Ncols) * sizeof(double)));
// --- CUDA SVD initialization
cusolveSafeCall(cusolverDnDgesvd_bufferSize(solver_handle, Nrows, Ncols, &work_size));
double *work; gpuErrchk(cudaMalloc(&work, work_size * sizeof(double)));
// --- CUDA SVD execution
cusolveSafeCall(cusolverDnDgesvd(solver_handle, 'A', 'A', Nrows, Ncols, d_A, Nrows, d_S, d_U, Nrows, d_V, Ncols, work, work_size, NULL, devInfo));
int devInfo_h = 0; gpuErrchk(cudaMemcpy(&devInfo_h, devInfo, sizeof(int), cudaMemcpyDeviceToHost));
if (devInfo_h != 0) std::cout << "Unsuccessful SVD execution\n\n";
// --- Moving the results from device to host
gpuErrchk(cudaMemcpy(h_S, d_S, min(Nrows, Ncols) * sizeof(double), cudaMemcpyDeviceToHost));
gpuErrchk(cudaMemcpy(h_U, d_U, Nrows * Nrows * sizeof(double), cudaMemcpyDeviceToHost));
gpuErrchk(cudaMemcpy(h_V, d_V, Ncols * Ncols * sizeof(double), cudaMemcpyDeviceToHost));
for(int i = 0; i < min(Nrows, Ncols); i++)
std::cout << "d_S["<<i<<"] = " << std::setprecision(15) << h_S[i] << std::endl;
int count = 0;
bool flag = 0;
while (!flag) {
if (h_S[count] < threshold) flag = 1;
if (count == min(Nrows, Ncols)) flag = 1;
printf("The null space of A has dimension %i\n\n", min(Ncols, Nrows) - count);
for(int j = count; j < Ncols; j++) {
printf("Basis vector nr. %i\n", j - count);
for(int i = 0; i < Ncols; i++)
std::cout << "d_V["<<i<<"] = " << std::setprecision(15) << h_U[j*Ncols + i] << std::endl;
return 0;
extern "C" int iDivUp(int, int);
extern "C" void gpuErrchk(cudaError_t);
extern "C" void cusolveSafeCall(cusolverStatus_t);
#include <stdio.h>
#include <assert.h>
#include "cuda_runtime.h"
#include <cuda.h>
#include <cusolverDn.h>
/* iDivUp FUNCTION */
extern "C" int iDivUp(int a, int b){ return ((a % b) != 0) ? (a / b + 1) : (a / b); }
// --- Credit to
void gpuAssert(cudaError_t code, char *file, int line, bool abort=true)
if (code != cudaSuccess)
fprintf(stderr,"GPUassert: %s %s %d\n", cudaGetErrorString(code), file, line);
if (abort) { exit(code); }
extern "C" void gpuErrchk(cudaError_t ans) { gpuAssert((ans), __FILE__, __LINE__); }
static const char *_cudaGetErrorEnum(cusolverStatus_t error)
switch (error)
return "<unknown>";
inline void __cusolveSafeCall(cusolverStatus_t err, const char *file, const int line)
fprintf(stderr, "CUSOLVE error in file '%s', line %d\n %s\nerror %d: %s\nterminating!\n",__FILE__, __LINE__,err, \
_cudaGetErrorEnum(err)); \
cudaDeviceReset(); assert(0); \
extern "C" void cusolveSafeCall(cusolverStatus_t err) { __cusolveSafeCall(err, __FILE__, __LINE__); }
I think the most important thing for CUDA is to find an algorithm that doesn't depend on conditional branching (which is quite slow on graphics hardware). Simple if statements that can be optimized into conditional assignment are much better (or you can use the ?: operator).
If necessary, you should be able to do some form of pivoting using conditional assignment. It might actually be harder to determine how to store your result: if your matrix is rank-deficient, what do you want your CUDA program to do about it?
If you assume your 4x3 matrix is not actually rank-deficient, you can find your (single) null-space vector without any conditionals at all: the matrix is small enough that you can use Cramer's rule efficiently.
Actually, since you don't actually care about the scale of your null vector, you don't have to divide by the determinant -- you can just take the determinants of the minors:
x1 x2 x3
M = y1 y2 y3
z1 z2 z3
w1 w2 w3
|y1 y2 y3| |x1 x2 x3| |x1 x2 x3| |x1 x2 x3|
-> x0 = |z1 z2 z3| y0 = -|z1 z2 z3| z0 = |y1 y2 y3| w0 = -|y1 y2 y3|
|w1 w2 w3| |w1 w2 w3| |w1 w2 w3| |z1 z2 z3|
Note that these 3x3 determinants are just triple products; you can save computation by reusing the cross products.
"seems very expensive" - what data do you have that supports this?
Maybe Block Lanczos is the answer you seek.
Or maybe this.
Both JAMA and Apache Commons Math have SVD implementations in Java. Why not take those and try them out? Get some real data for your case instead of impressions. It won't cost you much, since the code is already written and tested.
I wondered if the matrixes are related rather than just being random, so that the null spaces you are seeking can be considered to be like 1-dimensional tangents to a curve in N-space (N = 9). If so, you may be able to speed things up by using Newton's method to solve successive instances of the system of quadratic equations Ax = 0, |x|^2 = 1, starting from a previous null space vector. Newton's method uses first derivatives to converge to a solution, and so would use Gaussian elimination to solve 9x9 systems. Using this technique would require that you be able to make small steps from matrix to matrix by say varying a parameter.
So the idea is that you initialize using SVD on the first matrix, but thereafter you step from matrix to matrix, using the null space vector of one as the starting point for the iteration for the next one. You need one or two iterations to get convergence. If you don't get convegence you use SVD to restart. If this situation is what you have, it is much faster than starting fresh on each matrix.
I used this a long time ago to map contours in the solutions of sets of 50 x 50 quadratic equations associated with the behavior of electric power systems.
