Is batching same functions with SIMD instruction possible? - gcc

I have a scenario that many exact same functions(for simplicity let's just consider C/C++ and python here) will be executed at the same time on my machine. Intuitively I just use multi-threading to treat each instance of a function as a thread to utilize the parallism, they do not contend for same resources but they will do many branch operation(e.g., for loop). However, since they are actually the same functions, I'm thinking about batching them using some SIMD instructions, e.g., AVX-512. Of course, it should be automatic so that users do not have to modify their code.
The reason? Because every thread/process/container/VM occupies resources, but AVX only needs one instructions. So I can hold more users with the same hardware.
Most articles I find online focus on using AVX instructions inside the function, for example, to accelerate the stream data processing, or deal with some large calculation. None of them mentions batching different instances of same function.
I know there are some challenges, such as different execution path caused by different input, and it is not easy to turn a normal function into a batched version automatically, but I think it is indeed possible technically.
Here are my questions
Is it hard(or possible) to automatically change a normal function into a batched version?
If 1 is no, what restrictions should I put on the function to make it possible? For example, if the function only has one path regardless of the data?
Is there other technologies to better solve the problem? I don't think GPU is a good option to me because GPU cannot support IO or branch instruction, although its SIMT fits perfectly into my goal.
Thanks!

SSE/AVX is basically a vector unit, it allows simple operations (like +-*/ and,or,XOR etc) on arrays of multiple elements at once. AVX1 and 2 has 256 byte registers, so you can do e.g. 8 32-bit singles at once, or 4 doubles. AVX-512 is coming but quite rare atm.
So if your functions are all operations on arrays of basic types, it is a natural fit. Rewriting your function using AVX intrinsics is doable if the operations are very simple. Complex things (like not matching vector widths) or even doing it in assembler is a challenge though.
If your function is not operating on vectors then it becomes difficult, and the possibilities are mostly theoretical. Autovectorizing compilers sometimes can do this, but it s rare and limited, and extremely complex.

There's two ways to fix this: vectorization (SIMD) and parallelization (threads).
GCC can already do the SIMD vectorization you want provided that the function is inlined, and the types and operations are compatible (and it will automatically inline smallish functions without you asking it to).
E.g.
inline void func (int i) {
somearray[i] = someotherarray[i] * athirdarray[i];
}
for (int i = 0; i < ABIGNUMBER; i++)
func (i);
Vectorization and inlining are enabled at -O3.
If the functions are too complex, and/or GCC doesn't vectorize it yet, then you can use OpenMP or OpenACC to parallelize it.
OpenMP uses special markup to tell the compiler where to spawn threads.
E.g.
#pragma omp parallel
#pragma omp for
for (int i = 0; i < ABIGNUMBER; i++)
....
And yes, you can do that on a GPU too! You do have to do a bit more typing to get the data copied in and out correctly. Only the marked up areas run on the GPU. Everything else runs on the CPU, so I/O etc. is not a problem.
#pragma omp target map(somearray,someotherarray,athirdarray)
#pragma omp parallel
#pragma omp for
for (int i = 0; i < ABIGNUMBER; i++)
....
OpenACC is a similar idea, but more specialized towards GPUs.
You can find OpenMP and OpenACC compilers in many places. Both GCC and LLVM support NVidia GPUs. LLVM has some support for AMD GPUs, and there are unofficial GCC builds available too (with official support coming soon).

Related

OpenMP: what is the difference between "taskloop" and "omp for" performance wise?

"taskloop" is introduced in OpenMP 4.5. It can take clauses from both loop and task constructs (except depend clause AFAIK).
However, I'm wondering if "taskloop" and "omp for" constructs differ performance wise too.
I think it may depends on the actual problem. To parallelize a for loop omp for can be faster than tasks, because it offers several different scheduling scheme for your needs. In my experience (solving a particular problem using clang12 compiler) omp for produces a bit faster code than tasks (on Ryzen 5 7800X).

GCC ARM SIMD intrinsics compiling to scalar instructions

I have a music synthesis app that runs on a RPi3 (Cortex-A53) in 32-bit mode, under a Yocto-based RTLinux. I'm using GCC 6.3 to compile the code, which uses tons of SIMD intrinsics in C++ to operate on float32x4_t and int32x4_t data. The code is instrumented so that I can see how long it takes to execute certain sizeable chunks of SIMD. It worked well until a couple days ago, when all of a sudden after fiddling unrelated stuff it slowed down by a factor of more than two.
I went in and looked at the code that was being generated. In the past, the code looked beautiful, very efficient. Now, it's not even using SIMD in most places. I checked the compiler options. They include -marm -mcpu=cortex-a53 -mfloat-abi=hard -mfpu=crypto-neon-fp-armv8 -O3. Occasionally you see a q register in the generated code, so it knows they exist, but mostly it operates on s registers. Furthermore, it uses lots of code to move pieces of q8-q15 (a.k.a. d16-d31) into general registers and then back into s0-s31 registers to operate on them, and then moves them back, which is horribly inefficient. Does anyone know any reason why the compiler should suddenly start compiling the float32x4_t and int32x4_t vector intrinsics into individual scalar ops? Or any way to diagnose this by getting the compiler to cough up some information about what's going on inside?
Edit: I found that in some places I was doing direct arithmetic on int32x4_t and float32x4_t types, while in other places I was using the ARM intrinsic functions. In the latter case, I was getting SIMD instructions but in the former it was using scalars. When I rewrote the code using all intrinsics, the SIMD instructions reappeared, and the execution time dropped close to what it was before. But I noticed that if I wrote something like x += y * z; the compiler would use scalars but was smart enough to use four VFMA instructions, while if I wrote x = vaddq_f32(x, vmulq_f32(y, z)); it would use VADDQ and VMULQ instructions. This explains why it isn't quite as fast as before when it was compiling arithmetic operators into SIMD.
So the question is now: Why was the compiler willing to compile direct arithmetic on int32x4_t and float32x4_t values into quad SIMD operations before, but not any more? Is there some obscure option that I didn't realize I had in there, and am now missing?

Alternative for dynamic parallelism for CUDA

I am very new to the CUDA programming model and programming in general, I suppose. I'm attempting to parallelize an expectation maximization algorithm. I am working on a gtx 480 which has compute capability 2.0. At first, I sort of assumed that there's no reason for the device to launch its own threads, but of course, I was sadly mistaken. I came across this pdf.
http://docs.nvidia.com/cuda/pdf/CUDA_Dynamic_Parallelism_Programming_Guide.pdf
Unfortunately, dynamic parallelism only works on the latest and greatest GPUs, with compute capability 3.5. Without diving into too much specifics, what is the alternative to dynamic parallelism? The loops in the CPU EM algorithm have many dependencies and are highly nested, which seems to make dynamic parallelism an attractive ability. I'm not sure if my question makes sense so please ask if you need clarification.
Thank you!
As indicated by #JackOLantern, dynamic parallelism can be described in a nutshell as the ability to call a kernel (i.e. a __global__ function) from device code (a __global__ or __device__ function).
Since the kernel call is the principal method by which the machine spins up multiple threads in response to a single function call, there is really no direct alternative that provides all the capability of dynamic parallelism in a device that does not support it (ie. pre cc 3.5 devices).
Without dynamic parallelism, your overall code will almost certainly involve more synchronization and communication between CPU code and GPU code.
The principal method would be to realize some unit of your code as parallelizable, convert it to a kernel, and work through your code in essentially a non-nested fashion. Repetetive functions might be done via looping in the kernel, or else looping in the host code that calls the kernel.
For a pictorial example of what I am trying to describe, please refer to slide 14 of this deck which introduces some of the new features of CUDA 5 including dynamic parallelism. The code architecture on the right is an algorithm realized with dynamic parallelism. The architecture on the left is the same function realized without dynamic parallelism.
I have checked your algorithm in Wikipedia and I'm not sure you need dynamic parallelism at all.
You do the expectation step in your kernel, __syncthreads(), do the maximization step, and __syncthreads() again. From this distance, the expectation looks like a reduction primitive, and the maximization is a filter one.
If it doesn't work, and you need real task parallelism, a GPU may not be the best choice. While the Kepler GPUs can do that to some degree, this is not what this architecture is designed for. In that case you might be better off using a multi-CPU system, such as an office grid, a supercomputer, or a Xeon Phi accelerator. You should also check OpenMP and MPI, these are the languages used for task-parallel programming (actually OpenMP is just a handful of pragmas in most cases).

What is the relationship between vectorization and embarrasingly parallel?

The question says it all. It seems to me that vectorization is very closely related to embarrassingly parallel problems. In other words, all vectorizeable programs must be embarrassingly parallel programs. Is this correct?
A quick summary for embarrassingly parallelism:
A code is embarrassingly parallel if the code can be parallelized without any efforts, especially handling data dependency. Note that embarrassingly parallelism only means that the code will be safely parallelized without effort; it doesn't guarantee any optimal performance.
A simple example would be summation of two vectors.
// A, B, and C are all distinct arrays.
for (int i = 0; i < N; ++i)
C[i] = A[i] + B[i];
This code is embarrassingly parallel because there is no data dependency on C. This code can be simply parallelized, for example, by using OpenMP:
#pragma omp parallel for
for (int i = 0; i < N; ++i)
C[i] = A[i] + B[i];
Vectorization is a particular form of how parallelism is achieved. In particular, vectorization mostly uses dedicated SIMD execution hardware units in processors using specialized instructions such as x86 SSE/AVX and ARM NEON. Compilers may automatically vectorize your code, or you can manually vectorize using intrinsics and direct assembly code.
I don't think that vectorization necessarily means that the code to be vectorized must be embarrassingly parallel. But, in practice, most vectorizable code is embarrassingly parallel because virtually all SIMD instructions assume that. I can't find any SIMD instructions that allow data dependent operations. In that sense, yes, you can say that vectorizable programs need to be embarrassingly parallel.
However, in a broad sense, vectorization could embrace GPGPU-style SIMD programming such as Nvidia's CUDA and Intel's MIC architecture. They allow more flexible SIMD operations by handling data-dependent operations and branches.
To summarize, in a narrow definition of the vectorization, i.e., vectorization for conventional CPU's SIMD operations, I believe that vectorizable programs should be embarrassingly parallel. However, an advanced form of SIMD/vectorization could enable data-dependent operations and arbitrary branches.
Embarrassingly parallel problems are tasks that require no effort to write in parallel form. Vectorisation is the process by which a conventional procedure becomes parallel.
So, this isn't a case of one being a logical sub-type of another, but a trend. The closer a problem is to being embarrassingly parallel, the less vectorisation is required.
Parallelism is to use our computing devices as much as possible, so we try to schedule tasks in such a way so that their execution is almost independent of each other.
Now, Vectorization, in parallel computing, is a special case of parallelization, in which software programs that by default perform one operation at a time on a single thread are modified to perform multiple operations simultaneously.
Vectorization is the more limited process of converting a computer program from a scalar implementation, which processes a single pair of operands at a time, to a vector implementation which processes one operation on multiple pairs of operands at once.
lets take a simple example of addition of two arrays:
in normal mode, we need a loop to add the two arrays. We can make it parallel in many ways, like we can make two threads and divide the arrays in two equal parts and assign to threads respectively. Now, maximum no. of threads that can increase performance is equal to the no. of nodes of array or length of array.
And if we apply parallelism then it should not need to loop all the array to add it. it should launch only one command and both the array will be added. To do this, we need to make the array addition program parallel in such a way so that it doesnot need any loop. for this we will divide arrays into portions equal to length of array and assign each portion to a new thread and launch all of them simultaneously so that all additions occur under a single instruction without looping at a same time.
FOr a rutine to be completely vectorized, it must be embarassingly parallel like in the above example. But it depends on the given scenerio. for two rutines having interdependencies, they can be individualy vectorized, etc.

How do modern compilers use mmx/3dnow/sse instructions?

I've been reading up on the x86 instruction set extensions, and they only seem useful in some quite specific circumstances (eg HADDPD - (Horizontal-Add-Packed-Double) in SSE3). These require a certain register layout that needs to be either deliberately set up, or occur from the series of instructions before it. How often do general-purpose compilers like gcc actually use these instructions (or a subset thereof), or are they mainly to be used in hand-coded assembler? How does the compiler detect where it is appropriate to use SIMD instructions?
Generally, few compilers use them. GCC and Visual Studio arn't usually able to use the SIMD instructions. If you enable SSE as a compiler flag, it will use the scalar SSE instructions for regular floating-point operations, but generally, don't expect the vectorized ones to be used automatically. Recent versions of GCC might be able to use them in some cases, but didn't work last I tried. Intel's C++ compiler is the only big compiler I know of that is able to auto-vectorize some loops.
In general though, you'll have to use them yourself. Either in raw assembler, or by using compiler intrinsics. In general, I'd say intrinsics are the better approach, since they better allow the compiler to understand the code, and so schedule and optimize, but in practice, I know MSVC at least doesn't always generate very efficient code from intrinsics, so plain asm may be the best solution there. Experiment, see what works. But don't expect the compiler to use these instructions for you, unless you 1) use the right compiler, and 2) write fairly simple loops that can be trivially vectorized.
Update 2012
Ok, so three years have passed since I wrote this answer. GCC has been able to auto-vectorize (simple) code for a couple of years now, and in VS2012, MSVC finally gains the same capability. Of course, the main part of my answer still applies: compilers can still only vectorize fairly trivial code. For anything more complex, you're stuck fiddling with intrinsics or inline asm.
Mono can use SIMD extensions as long as you use its classes for vectors. You can read about it here: http://tirania.org/blog/archive/2008/Nov-03.html
GCC should do some automatic vectorisation as long as you're using -O3 or a specific flag. They have an info page here: http://gcc.gnu.org/projects/tree-ssa/vectorization.html
The question of how to exploit SSE and other small vector units automatically (without direction from the programmer in the form of special language constructs or specially blessed compiler "intrinsics") has been a topic of compiler research for some time. Most results seem to be specialized to a particular problem domain, such as digital signal processing. I have not kept up with the literature on this topic, but what I have read suggests that exploiting the vector (SSE) unit is still a topic for research, and that one should have low expectations of general-purpose compilers commonly used in the field.
Suggested search term: vectorizing compiler
I have seen gcc use sse to zero out a default std::string object. Not a particularly powerful use of sse, but it exists. In most cases, though you will have to write your own.
I know this because I had allowed the stack to become unaligned and it crashed, otherwise I probably wouldn't have noticed!
If you use the vector pascal compiler you will get efficient SIMD code for types for which SIMD gives an advantage. Basically this is anything of length less than 64 bits. ( for 64 bit reals it is actually slower to do SIMD).
Latest versions of the compiler will also automatically parallelise accross cores

Resources