coding with vectors using the Accelerate framework - coding-style

I'm playing around with the Accelerate framework for the first time with the goal of implementing some vectorized code into an iOS application. I've never tried to do anything with respect to working with vectors in Objective C or C. Having some experience with MATLAB, I wonder if using Accelerate is indeed that much more of a pain. Suppose I'd want to calculate the following:
b = 4*(sin(a/2))^2 where a and b are vectors.
MATLAB code:
a = 1:4;
b = 4*(sin(a/2)).^2;
However, as I see it after some spitting through the documentation, things are quite different using Accelerate.
My C implementation:
float a[4] = {1,2,3,4}; //define a
int len = 4;
float div = 2; //define 2
float a2[len]; //define intermediate result 1
vDSP_vsdiv(a, 1, &div, a2, 1, len); //divide
float sinResult[len]; //define intermediate result 2
vvsinf(sinResult, a2, &len); //take sine
float sqResult[len]; //square the result
vDSP_vsq(sinResult, 1, sqResult, 1, len); //take square
float factor = 4; //multiply all this by four
float b[len]; //define answer vector
vDSP_vsmul(sqResult, 1, &factor, b, 1, len); //multiply
//unset all variables I didn't actually need
Honestly, I don't know what's worst here: keeping track of all intermediate steps, trying to memorize how the arguments are passed in vDSP with respect to VecLib (quite different), or that it takes so much time doing something quite trivial.
I really hope I am missing something here and that most steps can be merged or shortened. Any recommendations on coding resources, good coding habits (learned the hard way or from a book), etc. would be very welcome! How do you all deal with multiple lines of vector calculations?

I guess you could write it that way, but it seems awfully complicated to me. I like this better (intel-specific, but can easily be abstracted for other architectures):
#include <Accelerate/Accelerate.h>
#include <immintrin.h>
const __m128 a = {1,2,3,4};
const __m128 sina2 = vsinf(a*_mm_set1_ps(0.5));
const __m128 b = _mm_set1_ps(4)*sina2*sina2;
Also, just to be pedantic, what you're doing here is not linear algebra. Linear algebra involves only linear operations (no squaring, no transcendental operations like sin).
Edit: as you noted, the above won't quite work out of the box on iOS; the biggest issue is that there is no vsinf (vMathLib is not available in Accelerate on iOS). I don't have the SDK installed on my machine to test, but I believe that something like the following should work:
#include <Accelerate/Accelerate.h>
const vFloat a = {1, 2, 3, 4};
const vFloat a2 = a*(vFloat){0.5,0.5,0.5,0.5};
const int n = 4;
vFloat sina2;
vvsinf((float *)&sina2, (const float *)&a, &n);
const vFloat b = sina2*sina2*(vFloat){4,4,4,4};
Not quite as pretty as what is possible with vMathLib, but still fairly compact.
In general, a lot of basic arithmetic operations on vectors just work; there's no need to use calls to any library, which is why Accelerate doesn't go out of its way to supply those operations cleanly. Instead, Accelerate usually tries to provide operations that aren't immediately available by other means.

To answer my own question:
In iOS 6, vMathLib will be introduced. As Stephen clarified, vMathLib could already be used on OSX, but it was not available in iOS. Until now.
The functions that vMathLib provides will allow for easier vector calculations.

Related

Vulkan compute shader for parallel sum reduction

I want to implement this algorithm
https://dournac.org/info/gpu_sum_reduction
in Vulkan's compute shader. In OpenCL it would be easy because I can explicitly declare
what buffers are __local and which are __global. Unfortunately, I can't seem to find
any such mechanisms in Vulkan. Could somebody more experienced, show me an example, how to get such things working in Vulkan, please?
Subgroups in Vulkan seem equivalent functionality. It is a functionality where the shader invocations can cooperate in the subgroup.
Maybe something like:
void main(){
int partial_sum = subgroupAdd(arr[gl_GlobalInvocationID.x]);
if (subgroupElect()) {
atomicAdd(mem, partial_sum);
}
}
You could study the Subgroup Tutorial.
Then again you could just try going about it the "normal" way by simply:
void main(){
int partial_sum = 0;
for( int i = 0; i < LOCAL_SIZE; ++i ){
partial_sum += arr[gl_WorkGroupID.x * gl_WorkGroupSize.x + i]
}
atomicAdd(mem, partial_sum); // or perhaps without atomics by recursively reducing
}
It is not so much "parallel", but then again needs no barriers. It is just a matter of measuring performance to find what works best, and also might depend how large do you assume the input arrays are.
Disclaimer: I have not tried the shaders, so assume they are kind of a pseudocode and can have bugs.
It should also be possible to implement your linked algorithm nearly verbatim. The equivalent to __local in compute GLSL is shared. The workgroup barrier in GLSL is memoryBarrierShared().

PyOpenCL - Multi-dimensional reduction kernel

I'm a total newbie to OpenCL.
I'm trying to code a reduction kernel that sums along one axis for a multi-dimensional array. I have stumbled upon that code which comes from here: https://tmramalho.github.io/blog/2014/06/16/parallel-programming-with-opencl-and-python-parallel-reduce/
__kernel void reduce(__global float *a, __global float *r, __local float *b) {
uint gid = get_global_id(0);
uint wid = get_group_id(0);
uint lid = get_local_id(0);
uint gs = get_local_size(0);
b[lid] = a[gid];
barrier(CLK_LOCAL_MEM_FENCE);
for(uint s = gs/2; s > 0; s >>= 1) {
if(lid < s) {
b[lid] += b[lid+s];
}
barrier(CLK_LOCAL_MEM_FENCE);
}
if(lid == 0) r[wid] = b[lid];
}
I don't understand the for loop part. I get that uint s = gs/2 means that we split the array in half, but then it is a complete mystery. Without understanding it, I can't really implement another version for taking the maximum of an array for instance, even less for multi-dimensional arrays.
Furthermore, as far as I understand, the reduce kernel needs to be rerun another time if "N is bigger than the number of cores in a single unit".
Could you give me further explanations on that whole piece of code? Or even guidance on how to implement it for taking the max of an array?
Complete code can be found here: https://github.com/tmramalho/easy-pyopencl/blob/master/008_localreduce.py
Your first question about the meaning of the for loop:
for(uint s = gs/2; s > 0; s >>= 1)
It means that you divide the local size gs by 2, and keep dividing by 2 (the shift part s >>= 1 is equivalent to s = s/2) while s > 0, in other words, until s = 1. This algorithm depends on your array's size being a power of 2, otherwise you'd have to deal with the excess of a power of 2 until you have reduced the whole array, or you'd have to fill your array with neutral values for the reduction until completing a power of 2 size.
Your second concern when N is bigger than the capacity of your GPU, you are right: you have to run your reduction in portions that fit and then merge the results.
Finally, when you ask for guidance on how to implement a reduction to get the max of an array, I would suggest the following:
For a simple reduction like max or sum, try using numpy, especially if you are dealing with programming the reduction by axis.
If you think that the GPU would give you an advantage, try first using pyopencl's Multidimensional Array functionality, e.g. max.
If the reduction is more math intensive, try using pyopencl's Parallel Algorithms, e.g. reduction
I think that the whole point of using pyopencl is to avoid dealing with the underlying GPU's architecture. Otherwise, it is easier to deal with CUDA or HIP directly instead of OpenCL.

Eigen: Reduction operations very slow with custom scalar type

This is an issue that came up when profiling my convex optimization library. It uses a non-primitive custom Scalar type. We found that reduction operations like sum() and squaredNorm() are very slow compared to raw loops.
Here's a minimal example to reproduce the issue:
#include "epigraph.hpp"
size_t T = 200;
cvx::OptimizationProblem qp;
cvx::MatrixX u = qp.addVariable("u", 1, T);
// this takes very long
cvx::Scalar u_sum = u.sum();
// a raw loop is about 5x faster
cvx::Scalar sum = 0.;
for (int i = 0; i < u.size(); i++)
{
sum += u(i);
}
I did some profiling but could not get to the root of the issue since the reduction implementations are somewhat convoluted. I suspect that this has to do with internal optimizations that turn out to be slower in this case lead to a lot of allocations. Is there a way to turn those optimizations off?
Profiler Output

Eigen: map non-continous data in an arra with stride

I have a data array (double *) in memory which looks like:
[x0,y0,z0,junk,x1,y1,z1,junk,...]
I would like to map it to an Eigen vector and virtually remove the junk values by doing something like:
Eigen::Map<
Eigen::Matrix<double, Eigen::Dynamic, 1, Eigen::ColMajor>,
Eigen::Unaligned,
Eigen::OuterStride<4>
>
But it does not work because the outerstride seems to be restricted to 2D matrices.
Is there a trick to do what I want?
Many thanks!
With the head of Eigen, you can map it as a 2D matrix and then view it as a 1D vector:
auto m1 = Matrix<double,3,Dynamic>::Map(ptr, 3, n, OuterStride<4>());
auto v = m1.reshaped(); // new in future Eigen 3.4
But be aware accesses to such a v involve costly integer division/modulo.
If you want a solution compatible with Eigen 3.3, you can do something like this
VectorXd convert(double const* ptr, Index n)
{
VectorXd res(n*3);
Matrix3Xd::Map(res.data(), 3, n) = Matrix4Xd::Map(ptr, 4, n).topRows<3>();
return res;
}
But this of course would copy the data, which you probably intended to avoid.
Alternatively, you should think about whether it is possible to access your data as a 3xN array/matrix instead of a flat vector (really depends on what you are actually doing).

Random Number Generator in CUDA

I've struggled with this all day, I am trying to get a random number generator for threads in my CUDA code. I have looked through all forums and yes this topic comes up a fair bit but I've spent hours trying to unravel all sorts of codes to no avail. If anyone knows of a simple method, probably a device kernel that can be called to returns a random float between 0 and 1, or an integer that I can transform I would be most grateful.
Again, I hope to use the random number in the kernel, just like rand() for instance.
Thanks in advance
For anyone interested, you can now do it via cuRAND.
I'm not sure I understand why you need anything special. Any traditional PRNG should port more or less directly. A linear congruential should work fine. Do you have some special properties you're trying to establish?
The best way for this is writing your own device function , here is the one
void RNG()
{
unsigned int m_w = 150;
unsigned int m_z = 40;
for(int i=0; i < 100; i++)
{
m_z = 36969 * (m_z & 65535) + (m_z >> 16);
m_w = 18000 * (m_w & 65535) + (m_w >> 16);
cout <<(m_z << 16) + m_w << endl; /* 32-bit result */
}
}
It'll give you 100 random numbers with 32 bit result.
If you want some random numbers between 1 and 1000, you can also take the result%1000, either at the point of consumption, or at the point of generation:
((m_z << 16) + m_w)%1000
Changing m_w and m_z starting values (in the example, 150 and 40) allows you to get a different results each time. You can use threadIdx.x as one of them, which should give you different pseudorandom series each time.
I wanted to add that it works 2 time faster than rand() function, and works great ;)
I think any discussion of this question needs to answer Zenna's orginal request and that is for a thread level implementation. Specifically a device function that can be called from within a kernel or thread. Sorry if I overdid the "in bold" phrases but I really think the answers so far are not quite addressing what is being sought here.
The cuRAND library is your best bet. I appreciate that people are wanting to reinvent the wheel (it makes one appreciate and more properly use 3rd party libraries) but high performance high quality number generators are plentiful and well tested. The best info I can recommend is on the documentation for the GSL library on the different generators here:http://www.gnu.org/software/gsl/manual/html_node/Random-number-generator-algorithms.html
For any serious code it is best to use one of the main algorithms that mathematicians/computer-scientists have into the ground over and over looking for systemic weaknesses. The "mersenne twister" is something with a period (repeat loop) on the order of 10^6000 (MT19997 algorithm means "Mersenne Twister 2^19997") that has been especially adapted for Nvidia to use at a thread level within threads of the same warp using thread id calls as seeds. See paper here:http://developer.download.nvidia.com/compute/cuda/2_2/sdk/website/projects/MersenneTwister/doc/MersenneTwister.pdf. I am actually working to implement somehting using this library and IF I get it to work properly I will post my code. Nvidia has some examples at their documentation site for the current CUDA toolkit.
NOTE: Just for the record I do not work for Nvidia, but I will admit their documentation and abstraction design for CUDA is something I have so far been impressed with.
Depending on your application you should be wary of using LCGs without considering whether the streams (one stream per thread) will overlap. You could implement a leapfrog with LCG, but then you would need to have a sufficiently long period LCG to ensure that the sequence doesn't repeat.
An example leapfrog could be:
template <typename ValueType>
__device__ void leapfrog(unsigned long &a, unsigned long &c, int leap)
{
unsigned long an = a;
for (int i = 1 ; i < leap ; i++)
an *= a;
c = c * ((an - 1) / (a - 1));
a = an;
}
template <typename ValueType>
__device__ ValueType quickrand(unsigned long &seed, const unsigned long a, const unsigned long c)
{
seed = seed * a;
return seed;
}
template <typename ValueType>
__global__ void mykernel(
unsigned long *d_seeds)
{
// RNG parameters
unsigned long a = 1664525L;
unsigned long c = 1013904223L;
unsigned long ainit = a;
unsigned long cinit = c;
unsigned long seed;
// Generate local seed
seed = d_seeds[bid];
leapfrog<ValueType>(ainit, cinit, tid);
quickrand<ValueType>(seed, ainit, cinit);
leapfrog<ValueType>(a, c, blockDim.x);
...
}
But then the period of that generator is probably insufficient in most cases.
To be honest, I'd look at using a third party library such as NAG. There are some batch generators in the SDK too, but that's probably not what you're looking for in this case.
EDIT
Since this just got up-voted, I figure it's worth updating to mention that cuRAND, as mentioned by more recent answers to this question, is available and provides a number of generators and distributions. That's definitely the easiest place to start.
There's an MDGPU package (GPL) which includes an implementation of the GNU rand48() function for CUDA here.
I found it (quite easily, using Google, which I assume you tried :-) on the NVidia forums here.
I haven't found a good parallel number generator for CUDA, however I did find a parallel random number generator based on academic research here: http://sprng.cs.fsu.edu/
You could try out Mersenne Twister for GPUs
It is based on SIMD-oriented Fast Mersenne Twister (SFMT) which is a quite fast and reliable random number generator. It passes Marsaglias DIEHARD tests for Random Number Generators.
In case you're using cuda.jit in Numba for Python, this Random number generator is useful.

Resources