How to get a random number in Metal shader? - random

How would I go about getting a random number in a Metal shader?
I searched for "random" in The Metal Shading Language Specification, but found nothing.

It looks like there's not one built in. This example code for MetalShaderShowcase/AAPLWoodShader.metal defines its own simple rand function.
// Generate a random float in the range [0.0f, 1.0f] using x, y, and z (based on the xor128 algorithm)
float rand(int x, int y, int z)
{
int seed = x + y * 57 + z * 241;
seed= (seed<< 13) ^ seed;
return (( 1.0 - ( (seed * (seed * seed * 15731 + 789221) + 1376312589) & 2147483647) / 1073741824.0f) + 1.0f) / 2.0f;
}

So I was working on a Random Number Generator for another project and was wanting to package it into a neat framework for a while.
Your question pushed me to do just that. If you don't mind the shameless plug, here is a very simple framework that will generate a random number for you in a metal shader based on (up to) three seeds that you give it. The code is based on the following research paper that describes how to create random numbers on parallel processors for Monte Carlo simulations. It also has a (theoretical) period of 2^121 so it should be good for most reasonable calculations that can be done on a GPU.
All you have to call in your shader is an intializer, then you call rand(), like so:
// Initialize a random number generator, seeds 2 and 3 are optional
Loki rng = Loki(seed1, seed2, seed3);
// get a random float [0,1)
float random_float = rng.rand();
I also included a sample project in the repo so you can see how it is used.

Instead of computing the random number on the GPU, you can also compute a bunch of random numbers on the CPU and pass them into a the shader using a uniform / MTLBuffer.

Please take a look at [pcg-random], it's very simple and fast, more importantly it's fast. And it's super easy to modify their C code for Metal. https://www.pcg-random.org/
typedef struct { uint64_t state; uint64_t inc; } pcg32_random_t;
void pcg32_srandom_r(thread pcg32_random_t* rng, uint64_t initstate, uint64_t initseq)
{
rng->state = 0U;
rng->inc = (initseq << 1u) | 1u;
pcg32_random_r(rng);
rng->state += initstate;
pcg32_random_r(rng);
}
uint32_t pcg32_random_r(thread pcg32_random_t* rng)
{
uint64_t oldstate = rng->state;
rng->state = oldstate * 6364136223846793005ULL + rng->inc;
uint32_t xorshifted = ((oldstate >> 18u) ^ oldstate) >> 27u;
uint32_t rot = oldstate >> 59u;
return (xorshifted >> rot) | (xorshifted << ((-rot) & 31));
}
How do I use it?
float randomF(thread pcg32_random_t* rng)
{
//return pcg32_random_r(rng)/float(UINT_MAX);
return ldexp(float(pcg32_random_r(rng)), -32);
}
pcg32_random_t rng;
pcg32_srandom_r(&rng, pos_grid.x*int_time, pos_grid.y*int_time);
auto randomFloat = randomF(&rng);

Related

Fast random/mutation algorithms (vector to vector) [duplicate]

I've been trying to create a generalized Gradient Noise generator (which doesn't use the hash method to get gradients). The code is below:
class GradientNoise {
std::uint64_t m_seed;
std::uniform_int_distribution<std::uint8_t> distribution;
const std::array<glm::vec2, 4> vector_choice = {glm::vec2(1.0, 1.0), glm::vec2(-1.0, 1.0), glm::vec2(1.0, -1.0),
glm::vec2(-1.0, -1.0)};
public:
GradientNoise(uint64_t seed) {
m_seed = seed;
distribution = std::uniform_int_distribution<std::uint8_t>(0, 3);
}
// 0 -> 1
// just passes the value through, origionally was perlin noise activation
double nonLinearActivationFunction(double value) {
//return value * value * value * (value * (value * 6.0 - 15.0) + 10.0);
return value;
}
// 0 -> 1
//cosine interpolation
double interpolate(double a, double b, double t) {
double mu2 = (1 - cos(t * M_PI)) / 2;
return (a * (1 - mu2) + b * mu2);
}
double noise(double x, double y) {
std::mt19937_64 rng;
//first get the bottom left corner associated
// with these coordinates
int corner_x = std::floor(x);
int corner_y = std::floor(y);
// then get the respective distance from that corner
double dist_x = x - corner_x;
double dist_y = y - corner_y;
double corner_0_contrib; // bottom left
double corner_1_contrib; // top left
double corner_2_contrib; // top right
double corner_3_contrib; // bottom right
std::uint64_t s1 = ((std::uint64_t(corner_x) << 32) + std::uint64_t(corner_y) + m_seed);
std::uint64_t s2 = ((std::uint64_t(corner_x) << 32) + std::uint64_t(corner_y + 1) + m_seed);
std::uint64_t s3 = ((std::uint64_t(corner_x + 1) << 32) + std::uint64_t(corner_y + 1) + m_seed);
std::uint64_t s4 = ((std::uint64_t(corner_x + 1) << 32) + std::uint64_t(corner_y) + m_seed);
// each xy pair turns into distance vector from respective corner, corner zero is our starting corner (bottom
// left)
rng.seed(s1);
corner_0_contrib = glm::dot(vector_choice[distribution(rng)], {dist_x, dist_y});
rng.seed(s2);
corner_1_contrib = glm::dot(vector_choice[distribution(rng)], {dist_x, dist_y - 1});
rng.seed(s3);
corner_2_contrib = glm::dot(vector_choice[distribution(rng)], {dist_x - 1, dist_y - 1});
rng.seed(s4);
corner_3_contrib = glm::dot(vector_choice[distribution(rng)], {dist_x - 1, dist_y});
double u = nonLinearActivationFunction(dist_x);
double v = nonLinearActivationFunction(dist_y);
double x_bottom = interpolate(corner_0_contrib, corner_3_contrib, u);
double x_top = interpolate(corner_1_contrib, corner_2_contrib, u);
double total_xy = interpolate(x_bottom, x_top, v);
return total_xy;
}
};
I then generate an OpenGL texture to display with like this:
int width = 1024;
int height = 1024;
unsigned char *temp_texture = new unsigned char[width*height * 4];
double octaves[5] = {2,4,8,16,32};
for( int i = 0; i < height; i++){
for(int j = 0; j < width; j++){
double d_noise = 0;
d_noise += temp_1.noise(j/octaves[0], i/octaves[0]);
d_noise += temp_1.noise(j/octaves[1], i/octaves[1]);
d_noise += temp_1.noise(j/octaves[2], i/octaves[2]);
d_noise += temp_1.noise(j/octaves[3], i/octaves[3]);
d_noise += temp_1.noise(j/octaves[4], i/octaves[4]);
d_noise/=5;
uint8_t noise = static_cast<uint8_t>(((d_noise * 128.0) + 128.0));
temp_texture[j*4 + (i * width * 4) + 0] = (noise);
temp_texture[j*4 + (i * width * 4) + 1] = (noise);
temp_texture[j*4 + (i * width * 4) + 2] = (noise);
temp_texture[j*4 + (i * width * 4) + 3] = (255);
}
}
Which give good results:
But gprof is telling me that the Mersenne twister is taking up 62.4% of my time and growing with larger textures. Nothing else individual takes any where near as much time. While the Mersenne twister is fast after initialization, the fact that I initialize it every time I use it seems to make it pretty slow.
This initialization is 100% required for this to make sure that the same x and y generates the same gradient at each integer point (so you need either a hash function or seed the RNG each time).
I attempted to change the PRNG to both the linear congruential generator and Xorshiftplus, and while both ran orders of magnitude faster, they gave odd results:
LCG (one time, then running 5 times before using)
Xorshiftplus
After one iteration
After 10,000 iterations.
I've tried:
Running the generator several times before utilizing output, this results in slow execution or simply different artifacts.
Using the output of two consecutive runs after initial seed to seed the PRNG again and use the value after wards. No difference in result.
What is happening? What can i do to get faster results that are of the same quality as the mersenne twister?
OK BIG UPDATE:
I don't know why this works, I know it has something to do with the prime number utilized, but after messing around a bit, it appears that the following works:
Step 1, incorporate the x and y values as seeds separately (and incorporate some other offset value or additional seed value with them, this number should be a prime/non trivial factor)
Step 2, Use those two seed results into seeding the generator again back into the function (so like geza said, the seeds made were bad)
Step 3, when getting the result, instead of using modulo number of items (4) trying to get, or & 3, modulo the result by a prime number first then apply & 3. I'm not sure if the prime being a mersenne prime matters or not.
Here is the result with prime = 257 and xorshiftplus being used! (note I used 2048 by 2048 for this one, the others were 256 by 256)
LCG is known to be inadequate for your purpose.
Xorshift128+'s results are bad, because it needs good seeding. And providing good seeding defeats the whole purpose of using it. I don't recommend this.
However, I recommend using an integer hash. For example, one from Bob's page.
Here's a result of the first hash of that page, it looks OK to me, and it is fast (I think it is much faster than Mersenne Twister):
Here's the code I've written to generate this:
#include <cmath>
#include <stdio.h>
unsigned int hash(unsigned int a) {
a = (a ^ 61) ^ (a >> 16);
a = a + (a << 3);
a = a ^ (a >> 4);
a = a * 0x27d4eb2d;
a = a ^ (a >> 15);
return a;
}
unsigned int ivalue(int x, int y) {
return hash(y<<16|x)&0xff;
}
float smooth(float x) {
return 6*x*x*x*x*x - 15*x*x*x*x + 10*x*x*x;
}
float value(float x, float y) {
int ix = floor(x);
int iy = floor(y);
float fx = smooth(x-ix);
float fy = smooth(y-iy);
int v00 = ivalue(iy+0, ix+0);
int v01 = ivalue(iy+0, ix+1);
int v10 = ivalue(iy+1, ix+0);
int v11 = ivalue(iy+1, ix+1);
float v0 = v00*(1-fx) + v01*fx;
float v1 = v10*(1-fx) + v11*fx;
return v0*(1-fy) + v1*fy;
}
unsigned char pic[1024*1024];
int main() {
for (int y=0; y<1024; y++) {
for (int x=0; x<1024; x++) {
float v = 0;
for (int o=0; o<=9; o++) {
v += value(x/64.0f*(1<<o), y/64.0f*(1<<o))/(1<<o);
}
int r = rint(v*0.5f);
pic[y*1024+x] = r;
}
}
FILE *f = fopen("x.pnm", "wb");
fprintf(f, "P5\n1024 1024\n255\n");
fwrite(pic, 1, 1024*1024, f);
fclose(f);
}
If you want to understand, how a hash function work (or better yet, which properties a good hash have), check out Bob's page, for example this.
You (unknowingly?) implemented a visualization of PRNG non-random patterns. That looks very cool!
Except Mersenne Twister, all your tested PRNGs do not seem fit for your purpose. As I have not done further tests myself, I can only suggest to try out and measure further PRNGs.
The randomness of LCGs are known to be sensitive to the choice of their parameters. In particular, the period of a LCG is relative to the m parameter - at most it will be m (your prime factor) & for many values it can be less.
Similarly, the careful parameters selection is required to get a long period from Xorshift PRNGs.
You've noted that some PRNGs give good procedural generation results while other do not. In order to isolate the cause, I would factor out the proc gen stuff & examine the PRNG output directly. An easy way to visualize the data is to build a grey scale image where each pixel value is a (possibly scaled) random value. For image based stuff, I find this to be an easy way to find stuff that may lead to visual artifacts. Any artifacts you see with this are likely to cause issues with your proc gen output.
Another option is to try something like the Diehard tests. If the aforementioned image test failed to reveal any problems, I might use this just to be sure my PRNG techniques were trustworthy.
Note that your code seeds the PRNG, then generates one pseudorandom number from the PRNG. The reason for the nonrandomness in xorshift128+ that you discovered is that xorshift128+ simply adds the two halves of the seed (and uses the result mod 264 as the generated number) before changing its state (review its source code). This makes that PRNG considerably different from a hash function.
What you see is the practical demonstration of quality of PRNG. Mersenne Twister is one of the best PRNGs with good performance, it passes DIEHARD tests. One should know that generating a random numbers is not an easy computational task, so looking for a better performance will inevitably result in poor quality. LCG is known to be simplest and worst PRNG ever designed and it clearly shows two-dimensional correlation as in your picture. The quality of Xorshift generators largely depend on bitness and parameters. They are definitely worse than Mersenne Twister, but some (xorshift128+) may work good enough to pass BigCrush battery of TestU01 tests.
In other words, if you are making an important physical modelling numerical experiment, you better continue to use Mersenne Twister as known to be a good trade-off between speed and quality and it comes in many standard libraries. On a less important case you may try to use xorshift128+ generator. For an ultimate results you need to use cryptographical-quality PRNG (none of mentioned here may be used for cryptographical purposes).

Pseudo random number generation on the gpu

Using a C# script in the Unity3D game engine to control a HLSL compute shader, I'm trying to generate pseudo random numbers on the GPU and store them in a Texture2D. Following along with
GPU Gems 3 Hybrid Tausworthe method
and another thread Pseudo Random Number Generation on the GPU, I've come across an issue.
The problem:
the resulting texture appears to be one solid color. If I run the shader multiple times, I get a different solid color texture result every time, but the entire texture is the one color.
Compute shader code
#pragma kernel CSMain
RWTexture2D<float4> result; // 256 resolution texture to write to
uint4 seed; //four uniform random numbers generated on the CPU in a C# script
struct RandomResult
{
uint4 state;
float value;
};
uint TausStep(uint z, int S1, int S2, int S3, uint M)
{
uint b = (((z << S1) ^ z) >> S2);
return ((z & M) << S3) ^ b;
}
uint LCGStep(uint z, uint A, uint C)
{
return A * z + C;
}
RandomResult HybridTaus(uint4 state)
{
state.x = TausStep(state.x, 13, 19, 12, 4294967294);
state.y = TausStep(state.y, 2, 25, 4, 4294967288);
state.z = TausStep(state.z, 3, 11, 17, 4294967280);
state.w = LCGStep(state.w, 1664525, 1013904223);
RandomResult rand;
rand.state = state;
rand.value = 2.3283064365387e-10 * (state.x ^ state.y ^ state.z ^ state.w);
return rand;
}
[numthreads(8, 8, 1)]
void CSMain(uint3 id)
{
result[id.xy] = HybridTaus(seed).value;
}
Do I need to save the state on the gpu? If so, how would I do that? Do I need to deallocate the memory afterwards?
I tried to assign the result of the HybridTaus() function to seed in hopes that it would use the new value in the following HybridTaus(seed) call to see if that would make a difference. I also tried to add unique arbitrary numbers based on the thread id, which is the id parameter. This gave some improved results, but I suspect the randomness is only as good as I can make it, coming from maths performed on the thread ids and not effectively from the random number generator.
[numthreads(8, 8, 1)]
void CSMain(uint3 id)
{
//first thing I tried
//RandomResult rand = HybridTaus(seed);
//seed = rand.state; // re-assign seed with the new state
//result[id.xy] = rand.value;
//second thing I tried
RandResult rand = HybridTaus(seed * uint4(id.x*id.y*id.x*id.y,
id.x*id.y/id.x*id.y,
id.x*id.y+id.x*id.y,
id.x*id.y-id.x*id.y));
result[id.xy] = rand.value;
}
First of all, I don't know about the algo you posted but i found this simple algorithm online for generating random numbers on the gpu. Here seed is a 32 bit uint.
uint wang_hash(uint seed)
{
seed = (seed ^ 61) ^ (seed >> 16);
seed *= 9;
seed = seed ^ (seed >> 4);
seed *= 0x27d4eb2d;
seed = seed ^ (seed >> 15);
return seed;
}
Now in most cases this is sufficient, you can pass your compute shader's Local invocation ID as that is unique and get a random number per thread or per invocation. However if you need multiple random numbers per invocation (for example you have a loop or a nested loop) this wasn't working as the seed remains the same. So i messed the function a little bit and came up with this
uint wang_hash(uint seed)
{
seed = seed + 76.897898 * 48.789789 * cos(x) * sin(y) * 20.79797
seed = (seed ^ 61) ^ (seed >> 16);
seed *= 9;
seed = seed ^ (seed >> 4);
seed *= 0x27d4eb2d;
seed = seed ^ (seed >> 15);
return seed;
}
Here x and y are my nested for loops variables. And this works for me. Now you can get multiple random numbers per invocation.
In your case however, I don't think you need the latter one. If I understood correct you just need to store a random number for every texel so you can try the first one and use the unique local invocation ID to get random numbers for every texel value.

TERCOM algorithm - Changing from single thread to multiple threads in CUDA

I'm currently working on porting a TERCOM algorithm from using only 1 thread to use multiple threads. Briefly explained , the TERCOM algorithm receives 5 measurements and the heading, and compare this measurements to a prestored map. The algorithm will choose the best match, i.e. lowest Mean Absolute Difference (MAD), and return the position.
The code is working perfectly with one thread and for-loops, but when I try to use multiple threads and blocks it returns the wrong answer. It seems like the multithread version doesn't "run through" the calculation in the same way as the singlethread versjon. Does anyone know what I am doing wrong?
Here's the code using for-loops
__global__ void kernel (int m, int n, int h, int N, float *f, float heading, float *measurements)
{
//Without threads
float pos[2]={0};
float theta=heading*(PI/180);
float MAD=0;
// Calculate how much to move in x and y direction
float offset_x = h*cos(theta);
float offset_y = -h*sin(theta);
float min=100000; //Some High value
//Calculate Mean Absolute Difference
for(float row=0;row<m;row++)
{
for(float col=0;col<n;col++)
{
for(float g=0; g<N; g++)
{
f[(int)g] = tex2D (tex, col+(g-2)*offset_x+0.5f, row+(g-2)*offset_y+0.5f);
MAD += abs(measurements[(int)g]-f[(int)g]);
}
if(MAD<min)
{
min=MAD;
pos[0]=col;
pos[1]=row;
}
MAD=0; //Reset MAD
}
}
f[0]=min;
f[1]=pos[0];
f[2]=pos[1];
}
This is my attempt to use multiple threads
__global__ void kernel (int m, int n, int h, int N, float *f, float heading, float *measurements)
{
// With threads
int idx = blockIdx.x * blockDim.x + threadIdx.x;
int idy = blockIdx.y * blockDim.y + threadIdx.y;
float pos[2]={0};
float theta=heading*(PI/180);
float MAD=0;
// Calculate how much to move in x and y direction
float offset_x = h*cos(theta);
float offset_y = -h*sin(theta);
float min=100000; //Some High value
if(idx < n && idy < m)
{
for(float g=0; g<N; g++)
{
f[(int)g] = tex2D (tex, idx+(g-2)*offset_x+0.5f, idy+(g-2)*offset_y+0.5f);
MAD += abs(measurements[(int)g]-f[(int)g]);
}
if(MAD<min)
{
min=MAD;
pos[0]=idx;
pos[1]=idy;
}
MAD=0; //Reset MAD
}
f[0]=min;
f[1]=pos[0];
f[2]=pos[1];
}
To launch the kernel
dim3 dimBlock( 16,16 );
dim3 dimGrid;
dimGrid.x = (n + dimBlock.x - 1)/dimBlock.x;
dimGrid.y = (m + dimBlock.y - 1)/dimBlock.y;
kernel <<< dimGrid,dimBlock >>> (m, n, h, N, dev_results, heading, dev_measurements);
The basic problem here is that you have a memory race in the code, centered around the use of f as both some sort of thread local scratch space and an output variable. Every concurrent thread will be trying to write values into the same locations in f simultaneously, which will produce undefined behaviour.
As best as I can tell, the use of f as scratch space isn't even necessary at all and the main computational section of the kernel could be written as something like:
if(idx < n && idy < m)
{
for(float g=0; g<N; g++)
{
float fval = tex2D (tex, idx+(g-2)*offset_x+0.5f, idy+(g-2)*offset_y+0.5f);
MAD += abs(measurements[(int)g]-fval);
}
min=MAD;
pos[0]=idx;
pos[1]=idy;
}
[disclaimer: written in browser, use at own risk]
At the end of that calculation, each thread has its own values of min and pos. At a minimum these must be stored in unique global memory (ie. the output must have enough space for each thread result). You will then need to perform some sort of reduction operation to obtain the global minimum from the set of thread local values. That could be in the host, or in the device code, or some combination of the two. There is a lot of code already available for CUDA parallel reductions which you should be able to find by searching and/or looking in the examples supplied with the CUDA toolkit. It should be trivial to adapt them to your specify case where you need to retain the position along with the minimum value.

Fast sigmoid algorithm

The sigmoid function is defined as
I found that using the C built-in function exp() to calculate the value of f(x) is slow. Is there any faster algorithm to calculate the value of f(x)?
you don't have to use the actual, exact sigmoid function in a neural network algorithm but can replace it with an approximated version that has similar properties but is faster the compute.
For example, you can use the "fast sigmoid" function
f(x) = x / (1 + abs(x))
Using first terms of the series expansion for exp(x) won't help too much if the arguments to f(x) are not near zero, and you have the same problem with a series expansion of the sigmoid function if the arguments are "large".
An alternative is to use table lookup. That is, you precalculate the values of the sigmoid function for a given number of data points, and then do fast (linear) interpolation between them if you want.
It's best to measure on your hardware first. Just a quick benchmark script shows, that on my machine 1/(1+|x|) is the fastest, and tanh(x) is the close second. Error function erf is pretty fast too.
% gcc -Wall -O2 -lm -o sigmoid-bench{,.c} -std=c99 && ./sigmoid-bench
atan(pi*x/2)*2/pi 24.1 ns
atan(x) 23.0 ns
1/(1+exp(-x)) 20.4 ns
1/sqrt(1+x^2) 13.4 ns
erf(sqrt(pi)*x/2) 6.7 ns
tanh(x) 5.5 ns
x/(1+|x|) 5.5 ns
I expect that the results may vary depending on architecture and the compiler used, but erf(x) (since C99), tanh(x) and x/(1.0+fabs(x)) are likely to be the fast performers.
People here are mostly concerned about how fast one function is relative to another and create micro benchmark to see whether f1(x) runs 0.0001 ms faster than f2(x). The big problem is that this is mostly irrelevant, because what matters is how fast your network learns with your activation function trying to minimize your cost function.
As of current theory, rectifier function and softplus
compared to sigmoid function or similar activation functions, allow
for faster and effective training of deep neural architectures on
large and complex datasets.
So I suggest to throw away micro-optimization, and take a look at which function allows faster learning (also taking looking at various other cost function).
To do the NN more flexible usually used some alpha rate to change the angle of graph around 0.
The sigmoid function looks like:
f(x) = 1 / ( 1+exp(-x*alpha))
The nearly equivalent, (but more faster function) is:
f(x) = 0.5 * (x * alpha / (1 + abs(x*alpha))) + 0.5
You can check the graphs here
When I using abs function the network become faster 100+ times.
This answer probably isn't relevant for most cases, but just wanted to throw out there that for CUDA computing I've found x/sqrt(1+x^2) to be the fastest function by far.
For example, done with single precision float intrinsics:
__device__ void fooCudaKernel(/* some arguments */) {
float foo, sigmoid;
// some code defining foo
sigmoid = __fmul_rz(rsqrtf(__fmaf_rz(foo,foo,1)),foo);
}
Also you might use rough version of sigmoid (it differences not greater than 0.2% from original):
inline float RoughSigmoid(float value)
{
float x = ::abs(value);
float x2 = x*x;
float e = 1.0f + x + x2*0.555f + x2*x2*0.143f;
return 1.0f / (1.0f + (value > 0 ? 1.0f / e : e));
}
void RoughSigmoid(const float * src, size_t size, const float * slope, float * dst)
{
float s = slope[0];
for (size_t i = 0; i < size; ++i)
dst[i] = RoughSigmoid(src[i] * s);
}
Optimization of RoughSigmoid function with using SSE:
#include <xmmintrin.h>
void RoughSigmoid(const float * src, size_t size, const float * slope, float * dst)
{
size_t alignedSize = size/4*4;
__m128 _slope = _mm_set1_ps(*slope);
__m128 _0 = _mm_set1_ps(-0.0f);
__m128 _1 = _mm_set1_ps(1.0f);
__m128 _0555 = _mm_set1_ps(0.555f);
__m128 _0143 = _mm_set1_ps(0.143f);
size_t i = 0;
for (; i < alignedSize; i += 4)
{
__m128 _src = _mm_loadu_ps(src + i);
__m128 x = _mm_andnot_ps(_0, _mm_mul_ps(_src, _slope));
__m128 x2 = _mm_mul_ps(x, x);
__m128 x4 = _mm_mul_ps(x2, x2);
__m128 series = _mm_add_ps(_mm_add_ps(_1, x), _mm_add_ps(_mm_mul_ps(x2, _0555), _mm_mul_ps(x4, _0143)));
__m128 mask = _mm_cmpgt_ps(_src, _0);
__m128 exp = _mm_or_ps(_mm_and_ps(_mm_rcp_ps(series), mask), _mm_andnot_ps(mask, series));
__m128 sigmoid = _mm_rcp_ps(_mm_add_ps(_1, exp));
_mm_storeu_ps(dst + i, sigmoid);
}
for (; i < size; ++i)
dst[i] = RoughSigmoid(src[i] * slope[0]);
}
Optimization of RoughSigmoid function with using AVX:
#include <immintrin.h>
void RoughSigmoid(const float * src, size_t size, const float * slope, float * dst)
{
size_t alignedSize = size/8*8;
__m256 _slope = _mm256_set1_ps(*slope);
__m256 _0 = _mm256_set1_ps(-0.0f);
__m256 _1 = _mm256_set1_ps(1.0f);
__m256 _0555 = _mm256_set1_ps(0.555f);
__m256 _0143 = _mm256_set1_ps(0.143f);
size_t i = 0;
for (; i < alignedSize; i += 8)
{
__m256 _src = _mm256_loadu_ps(src + i);
__m256 x = _mm256_andnot_ps(_0, _mm256_mul_ps(_src, _slope));
__m256 x2 = _mm256_mul_ps(x, x);
__m256 x4 = _mm256_mul_ps(x2, x2);
__m256 series = _mm256_add_ps(_mm256_add_ps(_1, x), _mm256_add_ps(_mm256_mul_ps(x2, _0555), _mm256_mul_ps(x4, _0143)));
__m256 mask = _mm256_cmp_ps(_src, _0, _CMP_GT_OS);
__m256 exp = _mm256_or_ps(_mm256_and_ps(_mm256_rcp_ps(series), mask), _mm256_andnot_ps(mask, series));
__m256 sigmoid = _mm256_rcp_ps(_mm256_add_ps(_1, exp));
_mm256_storeu_ps(dst + i, sigmoid);
}
for (; i < size; ++i)
dst[i] = RoughSigmoid(src[i] * slope[0]);
}
Code is based on a C# version previously posted by '#jenkas' with minor modifications.
The following C++ code provides excellent precision that outperforms low-precision approximations by virtue of the fact that it allows compilers to auto-vectorize compiled code onto SIMD instructions when used in simple loops.
GCC will compile code to SIMD (Arm Neon, or Intel AVX) instructions that perform four sigmoid (or tanh) computations in parallel. Auto-vectorization yields performance that is comparable to even very low-precision optimizations while maintaining essentially full precision. Microsoft and Intel compilers also perform auto-vectorization.
A brief discussion of auto-vectorization, compiler optimizations, and practices that produce optimal performance is provided near the end of this post.
The following functions provide a maximum error of +/- 6.55651e-07 over full range as compared to 1/(1+exp(-v)).
// Returns float approximation of 1/(1+exp(-v))
inline float fast_sigmoid(float v)
{
constexpr float c1 = 0.03138777F;
constexpr float c2 = 0.276281267F;
constexpr float c_log2f = 1.442695022F;
v *= c_log2f*0.5;
int intPart = (int)v;
float x = (v - intPart);
float xx = x * x;
float v1 = c_log2f + c2 * xx;
float v2 = x + xx * c1 * x;
float v3 = (v2 + v1);
*((int*)&v3) += intPart << 24;
float v4 = v2 - v1;
float res = v3 / (v3 - v4); //for tanh change to (v3 + v4)/ (v3 - v4)
return res;
}
// Returns float approximation tanh(v)
inline float fast_tanh(float v)
{
const float c1 = 0.03138777F;
const float c2 = 0.276281267F;
const float c_log2f = 1.442695022F;
v *= c_log2f;
int intPart = (int)v;
float x = (v - intPart);
float xx = x * x;
float v1 = c_log2f + c2 * xx;
float v2 = x + xx * c1 * x;
float v3 = (v2 + v1);
*((int*)&v3) += intPart << 24;
float v4 = v2 - v1;
float res = (v3+v4) / (v3 - v4);
return res;
}
Benchmark results on Raspberry PI 4 (AARCH64):
-- Sigmoid benchmark --------
fast_sigmoid(x) 5.63 ns
fast_tanh(x) 5.89 ns
Vectorized fast_sigmoid(out,in,count) using Neon intrinsics
5.79 ns
atan(pi*/2 * x)/(pi/2) 27.29 ns
atan(x) 24.13 ns
1/(1+exp(-x)) 14.92 ns
1/sqrt(1+x^2) 4.26 ns
(erf(sqrt(pi)/2 *x) 20.62 ns
tanh(x) 20.64 ns
x/(1+|x|) 8.93 ns
x (measures loop overhead) 1.62 ns
x*x (for reference) 1.62 ns
1/(1+x) (for reference) 2.64 ns
Raspberry Pi 4, aarch64 Arm Cortex 72#1.8GHz. GCC 10.2.1
In the benchmark, GCC vectorizes the fast_sigmoid call into ARM Neon instructions allowing four values to be calculated in parallel.
For optimal performance, you should ensure that input vectors are aligned on 64-byte boundaries. AVX and Neon instructions both allow for unaligned access, but do so with a mild performance penalty.
In addition, you should inform the compiler that input vectors do not alias using the non-standard restrict keyword. The restrict keyword is defined in the C99 standard, but is not standard C++. Fortunately, all major C++ compilers (Intel, Microsoft, GCC, Clang) implement it as a C++ keyword as well. Without alias guarantees, compilers will generate a small code preamble that tests for aliasing at runtime, and executes a slow code-path if aliasing is detected.
To enable vectorization, GCC requires either the -ftree-vectorize option, or -O3 (which includes -ftree-vectorize).
Loops are vectorized as long as there are no operations that prevent vectorization. Including a call to a math intrinsic (exp, sin, cos &c) will prevent loop vectorization, as will if statements within the loop. However, loop bodies can be fairly substantial. For example, in my LSTM implementation, one of the loops contains operations on four separate vector components (more operations in the loop provides more opportunity for interleaved instruction scheduling)
The restrict keyword in the following sample informs the compiler that no part of the input and output vector overlap, allowing the compiler to omit the aliasing check:
void vec_sigmoid(
int length,
restrict float*output,
restrict float*input,
restrict float *bias)
{
for (int i = 0; i < length; ++i)
{
output[i] = fast_sigmoid(input[i])+bias[i];
}
}
Code is a C++ port of #jenkas' C# code posted earlier, adjusted to return 1/(1+exp(-x)) instead of 1/(1+exp(-2*x)) which is what the original code calculates.
You can use a simple but effective method by using two formulas:
if x < 0 then f(x) = 1 / (0.5/(1+(x^2)))
if x > 0 then f(x) = 1 / (-0.5/(1+(x^2)))+1
This will look like this:
Two graphs for a sigmoid {Blue: (0.5/(1+(x^2))), Yellow: (-0.5/(1+(x^2)))+1}
Try this .NET Core 5+ implementation
[MethodImpl(MethodImplOptions.AggressiveInlining)]
public static unsafe float FastSigmoid(float v)
{
const float c1 = 0.03138777F;
const float c2 = 0.276281267F;
const float c_log2f = 1.442695022F;
v *= c_log2f;
int intPart = (int)v;
float x = (v - intPart);
float xx = x * x;
float v1 = c_log2f + c2 * xx;
float v2 = x + xx * c1 * x;
float v3 = (v2 + v1);
*((int*)&v3) += intPart << 24;
float v4 = v2 - v1;
float res = v3 / (v3 - v4); //for tanh change to (v3 + v4)/ (v3 - v4)
return res;
}
Using Eureqa to search for approximations to sigmoid I found 1/(1 + 0.3678749025^x) approximates it. It's pretty close, just gets rid of one operation with the negation of x.
Some of the other functions shown here are interesting, but is the power operation really that slow? I tested it and it actually did faster than addition, but that could just be a fluke. If so it should be just as fast or faster as all the others.
EDIT:0.5 + 0.5*tanh(0.5*x) and less accurate, 0.5 + 0.5*tanh(n) also works. And you could just get rid of the constants if you don't care about getting it between the range [0,1] like sigmoid. But it assumes that tanh is faster.
The tanh function may be optimized in some languages, making it faster than a custom defined x/(1+abs(x)), such is the case in Julia.
You can also use this:
y=x / (2 * ((x<0.0)*-x+(x>=0.0)*x) + 2) + 0.5;
y'=y(1-y);
acts like a sigmoid now because y(1-y)=y' is more let say round than 1/(2 (1 + abs(x))^2)
acts more like to fast sigmoid;
I don't think you can do better than the built-in exp() but if you want another approach, you can use series expansion. WolframAlpha can compute it for you.

Computing the null space of a matrix as fast as possible

I need to compute the nullspace of several thousand small matrices (8x9, not 4x3 as I wrote previously) in parallel (CUDA). All references point to SVD but the algorithm in numerical recipes seems very expensive, and gives me lots of things other than the null space that I don't really need. Is Gaussian elimination really not an option? Are there any other commonly used methods?
To answer your question directly... yes! QR decomposition!
Let A be an m-by-n matrix with rank n. QR decomposition finds orthonormal m-by-m matrix Q and upper triangular m-by-n matrix R such that A = QR. If we define Q = [Q1 Q2], where Q1 is m-by-n and Q2 is m-by-(m-n), then the columns of Q2 form the null space of A^T.
QR decomposition is computed either by Gram-Schmidt, Givens rotations, or Householder reflections. They have different stability properties and operation counts.
You are right: SVD is expensive! I can't speak for what state-of-the-art stuff uses, but when I hear "compute null space" (EDIT: in a way that is simple for me to understand), I think QR.
I don't think the above proposed method always gives the whole null space. To recap: "A = QR, where Q = [Q1 Q2], and Q1 is m-by-n and Q2 is m-by-(m-n). Then the columns of Q2 form the null space of A^T."
Indeed, this may only give a subspace of the null space. Simple counter-example is when A=0, in which case the null space of A^T is the whole R^m.
Therefore, it is necessary to check R too. Based on my experience with Matlab, if a row of R is straight 0, then the corresponding column in Q should also be a basis of the null space of A^T. Clearly this observation is heuristic and hinges on the particular algorithm used for QR decomposition.
Gaussian elimination is plenty fast for 4x3 matrices. IIRC I've done about 5 million per second with Java without parallelism. With such a small problem, your best bet is to code the routine (row reduce etc.) yourself; otherwise you'll waste most of the time putting the data into the right format for the external routine.
In the anwers above, it has been already pointed out how the null space of a matrix can be calculated by using the QR or the SVD approach. SVD should be preferred when accuracy is required, see also Null-space of a rectangular dense matrix.
As of February 2015, CUDA 7 (now in release candidate) makes SVD available through its new cuSOLVER library. Below I report an example on how using cuSOLVER's SVD to calculate the null space of a matrix.
Be aware that the problem you are focusing on concerns the calculation of several small matrices, so you should adapt the example I'm providing below by using streams to make sense for your case. To associate a stream to each task you can use
cudaStreamCreate()
and
cusolverDnSetStream()
kernel.cu
#include "cuda_runtime.h"
#include "device_launch_paraMeters.h"
#include<iostream>
#include<iomanip>
#include<stdlib.h>
#include<stdio.h>
#include<assert.h>
#include<math.h>
#include <cusolverDn.h>
#include <cuda_runtime_api.h>
#include "Utilities.cuh"
/********/
/* MAIN */
/********/
int main(){
// --- gesvd only supports Nrows >= Ncols
// --- column major memory ordering
const int Nrows = 7;
const int Ncols = 5;
// --- cuSOLVE input/output parameters/arrays
int work_size = 0;
int *devInfo; gpuErrchk(cudaMalloc(&devInfo, sizeof(int)));
// --- CUDA solver initialization
cusolverDnHandle_t solver_handle;
cusolverDnCreate(&solver_handle);
// --- Singular values threshold
double threshold = 1e-12;
// --- Setting the host, Nrows x Ncols matrix
double *h_A = (double *)malloc(Nrows * Ncols * sizeof(double));
for(int j = 0; j < Nrows; j++)
for(int i = 0; i < Ncols; i++)
h_A[j + i*Nrows] = (i + j*j) * sqrt((double)(i + j));
// --- Setting the device matrix and moving the host matrix to the device
double *d_A; gpuErrchk(cudaMalloc(&d_A, Nrows * Ncols * sizeof(double)));
gpuErrchk(cudaMemcpy(d_A, h_A, Nrows * Ncols * sizeof(double), cudaMemcpyHostToDevice));
// --- host side SVD results space
double *h_U = (double *)malloc(Nrows * Nrows * sizeof(double));
double *h_V = (double *)malloc(Ncols * Ncols * sizeof(double));
double *h_S = (double *)malloc(min(Nrows, Ncols) * sizeof(double));
// --- device side SVD workspace and matrices
double *d_U; gpuErrchk(cudaMalloc(&d_U, Nrows * Nrows * sizeof(double)));
double *d_V; gpuErrchk(cudaMalloc(&d_V, Ncols * Ncols * sizeof(double)));
double *d_S; gpuErrchk(cudaMalloc(&d_S, min(Nrows, Ncols) * sizeof(double)));
// --- CUDA SVD initialization
cusolveSafeCall(cusolverDnDgesvd_bufferSize(solver_handle, Nrows, Ncols, &work_size));
double *work; gpuErrchk(cudaMalloc(&work, work_size * sizeof(double)));
// --- CUDA SVD execution
cusolveSafeCall(cusolverDnDgesvd(solver_handle, 'A', 'A', Nrows, Ncols, d_A, Nrows, d_S, d_U, Nrows, d_V, Ncols, work, work_size, NULL, devInfo));
int devInfo_h = 0; gpuErrchk(cudaMemcpy(&devInfo_h, devInfo, sizeof(int), cudaMemcpyDeviceToHost));
if (devInfo_h != 0) std::cout << "Unsuccessful SVD execution\n\n";
// --- Moving the results from device to host
gpuErrchk(cudaMemcpy(h_S, d_S, min(Nrows, Ncols) * sizeof(double), cudaMemcpyDeviceToHost));
gpuErrchk(cudaMemcpy(h_U, d_U, Nrows * Nrows * sizeof(double), cudaMemcpyDeviceToHost));
gpuErrchk(cudaMemcpy(h_V, d_V, Ncols * Ncols * sizeof(double), cudaMemcpyDeviceToHost));
for(int i = 0; i < min(Nrows, Ncols); i++)
std::cout << "d_S["<<i<<"] = " << std::setprecision(15) << h_S[i] << std::endl;
printf("\n\n");
int count = 0;
bool flag = 0;
while (!flag) {
if (h_S[count] < threshold) flag = 1;
if (count == min(Nrows, Ncols)) flag = 1;
count++;
}
count--;
printf("The null space of A has dimension %i\n\n", min(Ncols, Nrows) - count);
for(int j = count; j < Ncols; j++) {
printf("Basis vector nr. %i\n", j - count);
for(int i = 0; i < Ncols; i++)
std::cout << "d_V["<<i<<"] = " << std::setprecision(15) << h_U[j*Ncols + i] << std::endl;
printf("\n");
}
cusolverDnDestroy(solver_handle);
return 0;
}
Utilities.cuh
#ifndef UTILITIES_CUH
#define UTILITIES_CUH
extern "C" int iDivUp(int, int);
extern "C" void gpuErrchk(cudaError_t);
extern "C" void cusolveSafeCall(cusolverStatus_t);
#endif
Utilities.cu
#include <stdio.h>
#include <assert.h>
#include "cuda_runtime.h"
#include <cuda.h>
#include <cusolverDn.h>
/*******************/
/* iDivUp FUNCTION */
/*******************/
extern "C" int iDivUp(int a, int b){ return ((a % b) != 0) ? (a / b + 1) : (a / b); }
/********************/
/* CUDA ERROR CHECK */
/********************/
// --- Credit to http://stackoverflow.com/questions/14038589/what-is-the-canonical-way-to-check-for-errors-using-the-cuda-runtime-api
void gpuAssert(cudaError_t code, char *file, int line, bool abort=true)
{
if (code != cudaSuccess)
{
fprintf(stderr,"GPUassert: %s %s %d\n", cudaGetErrorString(code), file, line);
if (abort) { exit(code); }
}
}
extern "C" void gpuErrchk(cudaError_t ans) { gpuAssert((ans), __FILE__, __LINE__); }
/**************************/
/* CUSOLVE ERROR CHECKING */
/**************************/
static const char *_cudaGetErrorEnum(cusolverStatus_t error)
{
switch (error)
{
case CUSOLVER_STATUS_SUCCESS:
return "CUSOLVER_SUCCESS";
case CUSOLVER_STATUS_NOT_INITIALIZED:
return "CUSOLVER_STATUS_NOT_INITIALIZED";
case CUSOLVER_STATUS_ALLOC_FAILED:
return "CUSOLVER_STATUS_ALLOC_FAILED";
case CUSOLVER_STATUS_INVALID_VALUE:
return "CUSOLVER_STATUS_INVALID_VALUE";
case CUSOLVER_STATUS_ARCH_MISMATCH:
return "CUSOLVER_STATUS_ARCH_MISMATCH";
case CUSOLVER_STATUS_EXECUTION_FAILED:
return "CUSOLVER_STATUS_EXECUTION_FAILED";
case CUSOLVER_STATUS_INTERNAL_ERROR:
return "CUSOLVER_STATUS_INTERNAL_ERROR";
case CUSOLVER_STATUS_MATRIX_TYPE_NOT_SUPPORTED:
return "CUSOLVER_STATUS_MATRIX_TYPE_NOT_SUPPORTED";
}
return "<unknown>";
}
inline void __cusolveSafeCall(cusolverStatus_t err, const char *file, const int line)
{
if(CUSOLVER_STATUS_SUCCESS != err) {
fprintf(stderr, "CUSOLVE error in file '%s', line %d\n %s\nerror %d: %s\nterminating!\n",__FILE__, __LINE__,err, \
_cudaGetErrorEnum(err)); \
cudaDeviceReset(); assert(0); \
}
}
extern "C" void cusolveSafeCall(cusolverStatus_t err) { __cusolveSafeCall(err, __FILE__, __LINE__); }
I think the most important thing for CUDA is to find an algorithm that doesn't depend on conditional branching (which is quite slow on graphics hardware). Simple if statements that can be optimized into conditional assignment are much better (or you can use the ?: operator).
If necessary, you should be able to do some form of pivoting using conditional assignment. It might actually be harder to determine how to store your result: if your matrix is rank-deficient, what do you want your CUDA program to do about it?
If you assume your 4x3 matrix is not actually rank-deficient, you can find your (single) null-space vector without any conditionals at all: the matrix is small enough that you can use Cramer's rule efficiently.
Actually, since you don't actually care about the scale of your null vector, you don't have to divide by the determinant -- you can just take the determinants of the minors:
x1 x2 x3
M = y1 y2 y3
z1 z2 z3
w1 w2 w3
|y1 y2 y3| |x1 x2 x3| |x1 x2 x3| |x1 x2 x3|
-> x0 = |z1 z2 z3| y0 = -|z1 z2 z3| z0 = |y1 y2 y3| w0 = -|y1 y2 y3|
|w1 w2 w3| |w1 w2 w3| |w1 w2 w3| |z1 z2 z3|
Note that these 3x3 determinants are just triple products; you can save computation by reusing the cross products.
"seems very expensive" - what data do you have that supports this?
Maybe Block Lanczos is the answer you seek.
Or maybe this.
Both JAMA and Apache Commons Math have SVD implementations in Java. Why not take those and try them out? Get some real data for your case instead of impressions. It won't cost you much, since the code is already written and tested.
I wondered if the matrixes are related rather than just being random, so that the null spaces you are seeking can be considered to be like 1-dimensional tangents to a curve in N-space (N = 9). If so, you may be able to speed things up by using Newton's method to solve successive instances of the system of quadratic equations Ax = 0, |x|^2 = 1, starting from a previous null space vector. Newton's method uses first derivatives to converge to a solution, and so would use Gaussian elimination to solve 9x9 systems. Using this technique would require that you be able to make small steps from matrix to matrix by say varying a parameter.
So the idea is that you initialize using SVD on the first matrix, but thereafter you step from matrix to matrix, using the null space vector of one as the starting point for the iteration for the next one. You need one or two iterations to get convergence. If you don't get convegence you use SVD to restart. If this situation is what you have, it is much faster than starting fresh on each matrix.
I used this a long time ago to map contours in the solutions of sets of 50 x 50 quadratic equations associated with the behavior of electric power systems.

Resources