Adding two __m128 types via Accelerate framework - sse2

I need to add/mul/sub two __m128 (float) variables using Accelerate framework. But, I can't find function to do that. All Accelerate framework functions takes int__vector__ type instead float__vector__ type. I find function for dividing 'vdivf', but I need to add/mul/sub too.
Can anyone tell me, how to add/mul/sub two __m128 (float) variables using Accelerate framework? Something like this: _mm_add_ps, _mm_sub_ps, _mm_mul_ps but using Accelerate framework API.

The problem is that Accelerate is a higher level API than using SSE2 intrinsics. SSE intrinsics map to single instructions which operate on one vector at a time. Accelerate provides a higher level API of functions which operate at a much larger granularity, typically with arrays of a reasonable size. To port your existing code you should just stick with SSE intrinsics, and if you really do need PowerPC support then you'll need to #idef the SSE code and write an equivalent AltiVec implementation for the ppc build. I doubt this will be worth the effort however - Apple stopped selling PowerPC Macs around 7 years ago, so the market for PowerPC apps must be very small by now.

You don't need an API for basic arithmetic:
__m128 x, y;
__m128 z = x + y;
__m128 w = x - y;
__m128 t = x * y;
An API would be totally unnecessary for these operations, so Accelerate doesn't have one.
That said, if you have existing code which uses the SSE intrinsics (_mm_add_ps, etc), and you're really trying to make "minimum code changes", why are you changing anything at all? The SSE intrinsics work just fine on OS X as well.

Related

Is batching same functions with SIMD instruction possible?

I have a scenario that many exact same functions(for simplicity let's just consider C/C++ and python here) will be executed at the same time on my machine. Intuitively I just use multi-threading to treat each instance of a function as a thread to utilize the parallism, they do not contend for same resources but they will do many branch operation(e.g., for loop). However, since they are actually the same functions, I'm thinking about batching them using some SIMD instructions, e.g., AVX-512. Of course, it should be automatic so that users do not have to modify their code.
The reason? Because every thread/process/container/VM occupies resources, but AVX only needs one instructions. So I can hold more users with the same hardware.
Most articles I find online focus on using AVX instructions inside the function, for example, to accelerate the stream data processing, or deal with some large calculation. None of them mentions batching different instances of same function.
I know there are some challenges, such as different execution path caused by different input, and it is not easy to turn a normal function into a batched version automatically, but I think it is indeed possible technically.
Here are my questions
Is it hard(or possible) to automatically change a normal function into a batched version?
If 1 is no, what restrictions should I put on the function to make it possible? For example, if the function only has one path regardless of the data?
Is there other technologies to better solve the problem? I don't think GPU is a good option to me because GPU cannot support IO or branch instruction, although its SIMT fits perfectly into my goal.
Thanks!
SSE/AVX is basically a vector unit, it allows simple operations (like +-*/ and,or,XOR etc) on arrays of multiple elements at once. AVX1 and 2 has 256 byte registers, so you can do e.g. 8 32-bit singles at once, or 4 doubles. AVX-512 is coming but quite rare atm.
So if your functions are all operations on arrays of basic types, it is a natural fit. Rewriting your function using AVX intrinsics is doable if the operations are very simple. Complex things (like not matching vector widths) or even doing it in assembler is a challenge though.
If your function is not operating on vectors then it becomes difficult, and the possibilities are mostly theoretical. Autovectorizing compilers sometimes can do this, but it s rare and limited, and extremely complex.
There's two ways to fix this: vectorization (SIMD) and parallelization (threads).
GCC can already do the SIMD vectorization you want provided that the function is inlined, and the types and operations are compatible (and it will automatically inline smallish functions without you asking it to).
E.g.
inline void func (int i) {
somearray[i] = someotherarray[i] * athirdarray[i];
}
for (int i = 0; i < ABIGNUMBER; i++)
func (i);
Vectorization and inlining are enabled at -O3.
If the functions are too complex, and/or GCC doesn't vectorize it yet, then you can use OpenMP or OpenACC to parallelize it.
OpenMP uses special markup to tell the compiler where to spawn threads.
E.g.
#pragma omp parallel
#pragma omp for
for (int i = 0; i < ABIGNUMBER; i++)
....
And yes, you can do that on a GPU too! You do have to do a bit more typing to get the data copied in and out correctly. Only the marked up areas run on the GPU. Everything else runs on the CPU, so I/O etc. is not a problem.
#pragma omp target map(somearray,someotherarray,athirdarray)
#pragma omp parallel
#pragma omp for
for (int i = 0; i < ABIGNUMBER; i++)
....
OpenACC is a similar idea, but more specialized towards GPUs.
You can find OpenMP and OpenACC compilers in many places. Both GCC and LLVM support NVidia GPUs. LLVM has some support for AMD GPUs, and there are unofficial GCC builds available too (with official support coming soon).

GCC ARM SIMD intrinsics compiling to scalar instructions

I have a music synthesis app that runs on a RPi3 (Cortex-A53) in 32-bit mode, under a Yocto-based RTLinux. I'm using GCC 6.3 to compile the code, which uses tons of SIMD intrinsics in C++ to operate on float32x4_t and int32x4_t data. The code is instrumented so that I can see how long it takes to execute certain sizeable chunks of SIMD. It worked well until a couple days ago, when all of a sudden after fiddling unrelated stuff it slowed down by a factor of more than two.
I went in and looked at the code that was being generated. In the past, the code looked beautiful, very efficient. Now, it's not even using SIMD in most places. I checked the compiler options. They include -marm -mcpu=cortex-a53 -mfloat-abi=hard -mfpu=crypto-neon-fp-armv8 -O3. Occasionally you see a q register in the generated code, so it knows they exist, but mostly it operates on s registers. Furthermore, it uses lots of code to move pieces of q8-q15 (a.k.a. d16-d31) into general registers and then back into s0-s31 registers to operate on them, and then moves them back, which is horribly inefficient. Does anyone know any reason why the compiler should suddenly start compiling the float32x4_t and int32x4_t vector intrinsics into individual scalar ops? Or any way to diagnose this by getting the compiler to cough up some information about what's going on inside?
Edit: I found that in some places I was doing direct arithmetic on int32x4_t and float32x4_t types, while in other places I was using the ARM intrinsic functions. In the latter case, I was getting SIMD instructions but in the former it was using scalars. When I rewrote the code using all intrinsics, the SIMD instructions reappeared, and the execution time dropped close to what it was before. But I noticed that if I wrote something like x += y * z; the compiler would use scalars but was smart enough to use four VFMA instructions, while if I wrote x = vaddq_f32(x, vmulq_f32(y, z)); it would use VADDQ and VMULQ instructions. This explains why it isn't quite as fast as before when it was compiling arithmetic operators into SIMD.
So the question is now: Why was the compiler willing to compile direct arithmetic on int32x4_t and float32x4_t values into quad SIMD operations before, but not any more? Is there some obscure option that I didn't realize I had in there, and am now missing?

Is a reduction or atomic operation on mat/vec types with OpenGL compute shader possible?

Is it possible to do reduction/update or atomic operations in the computer shader on e.g. mat3, vec3 data types?
Like this scheme:
some_type mat3 A;
void main() {
A += mat3(1);
}
I have tried out to use shader storage buffer objects (SSBO) but it seems like the update is not atomic (at least I get wrong results when I read back the buffer).
Does anyone have an idea to realize this? Maybe creating a tiny 3x3 image2D and store the result by imageAtomicAdd in there?
There are buffer-based atomics in GLES 3.1.
https://www.khronos.org/registry/gles/specs/3.1/es_spec_3.1.pdf
Section 7.7.
Maybe creating a tiny 3x3 image2D and store the result by imageAtomicAdd in there?
Image atomics are not core and require an extension.
Thank you for the links. I forgot to mention that I work with ARM Mali GPUs and as such they do not expose TLP and do not have warps/wave fronts as Nvidia or AMD. That is, I might have to figure out another quick way.
The techniques proposed in the comments for your post (in particular the log(N) divisor approach where you fold the top half of the results down) still work fine on Mali. The technique doesn't rely on warps/wavefronts - as the original poster said, you just need synchronization (e.g. use a barrier() rather than relying on the implicit barrier which wavefronts would give you).

Portability and Optimization of OpenCL between Radeon Graphic Cards

I'm planning on diving into OpenCL and have been reading (only surface knowledge) on what OpenCL can do, but have a few questions.
Let's say I have an AMD Radeon 7750 and I have another computer that has an AMD Radeon 5870 and no plans on using a computer with an Nvidia card. I heard that optimizing the code for a particular device brings performance benefits. What exactly does optimizing mean? From what I've read and a little bit of guessing, it sounds like it means writing the code in a way that a GPU likes (in general without concern that it's an AMD or Nvidia card) as well as in a way that matches how the graphics card handles memory (I'm guessing this is compute device specific? Or is this only brand specific?).
So if I write code and optimized it for the Radeon 7750, would I be able to bring that code to the other computer with the Radeon 5870 and, without changing any part of the code, still retain a reasonable amount of performance benefits from the optimization? In the event that the code doesn't work, would changing parts of the code be a minor issue or would it involve rewriting enough code that it would have been a better idea to have written an optimized code for the Radeon 5870 in the first place.
Without more information about the algorithms and applications you intend to write, the question is a little vague. But I think I can give you some high-level strategies to keep in mind as you develop your code for these two different platforms.
The Radeon 7750's design is of the new Graphics Core Next architecture, while your HD5780 is based on the older VLIW5 (RV770) Architecture.
For your code to perform well on the HD5780 hardware you must make as heavy use of the packed primitive datatypes as possible, especially the int4, float4 types. This is because the OpenCL compiler has a difficult time automatically discovering parallelism and packing data into the wide vectors for you. If you can structure your code so that you already have taken this into account, then you will be able to fill more of the VLIW-5 slots and thus use more of your stream processors.
GCN is more like NVidia's Fermi architecture, where the code's path to the functional units (ALUs, etc.) of the stream processors does not go through explicitly scheduled VLIW instructions. So more parallelism can be automatically detected at runtime and keep your functional units busy doing useful work without you having to think as hard about how to make that happen.
Here's an over-simplified example to illustrate my point:
// multiply four factors
// A[0] = B[0] * C[0]
// ...
// A[3] = B[3] * C[3];
float *A, *B, *C;
for (i = 0; i < 4; i ++) {
A[i] = B[i] * C[i];
}
That code will probably run ok on a GCN architecture (except for suboptimal memory access performance--an advanced topic). But on your HD5870 it would be a disaster, because those four multiplies would take up 4 VLIW5 instructions instead of 1! So you would write that above code using the float4 type:
float4 A, B, C;
A = B * C;
And it would run really well on both of your cards. Plus it would kick ass on a CPU OpenCL context and make great use of MMX/SSE wide registers, a bonus. It's also a much better use of the memory system.
An a tiny nutshell, using the packed primitives is the one thing I can recommend to keep in mind as you start deploying code on these two systems at the same time.
Here's one more example that more clearly illustrates what you need to be careful doing on your HD5870. Say we had implemented the previous example using separate work-units:
// multiply four factors
// as separate work units
// A = B * C
float A, B, C;
A = B * C;
And we had four separate work units instead of one. That would be an absolute disaster on the VLIW device and would show tremendously better performance on the GCN device. That is something you will want to also look for when you are writing your code--can you use float4 types to reduce the number of work units doing the same work? If so, then you will see good performance on both platforms.

xmmintrin.h vs gcc vector extensions

Which method should I prefer to write SIMD instructions?
mm* methods form *mmintrin.h seem to be more portable across compilers.
But gcc vector extensions seems to produce mush simpler code, and to support more architectures.
So which method is the best?
If you use the gcc vector extensions you will only be able to use a limited subset of SSE functionality, since there are many SSE intrinsics which do not fit in with a generic vector model such as gcc's. If you only want to do fairly basic stuff, e.g. floating point arithmetic on vectors, then you might get away with it, but if you are interested in exploiting SIMD for maximum performance benefit then you'll need to go with the native intrinsics.
The intrinsics available from the *mmintrin.h files are available only on SSE machines, but they are available across different compilers. The GCC vector extensions are more limited but implemented on a wider range of platforms, and obviously GCC specific.
As with everything, there is no 'best' answer; you'll have to choose one that fits your needs.

Resources