What is the relationship between vectorization and embarrasingly parallel? - parallel-processing

The question says it all. It seems to me that vectorization is very closely related to embarrassingly parallel problems. In other words, all vectorizeable programs must be embarrassingly parallel programs. Is this correct?

A quick summary for embarrassingly parallelism:
A code is embarrassingly parallel if the code can be parallelized without any efforts, especially handling data dependency. Note that embarrassingly parallelism only means that the code will be safely parallelized without effort; it doesn't guarantee any optimal performance.
A simple example would be summation of two vectors.
// A, B, and C are all distinct arrays.
for (int i = 0; i < N; ++i)
C[i] = A[i] + B[i];
This code is embarrassingly parallel because there is no data dependency on C. This code can be simply parallelized, for example, by using OpenMP:
#pragma omp parallel for
for (int i = 0; i < N; ++i)
C[i] = A[i] + B[i];
Vectorization is a particular form of how parallelism is achieved. In particular, vectorization mostly uses dedicated SIMD execution hardware units in processors using specialized instructions such as x86 SSE/AVX and ARM NEON. Compilers may automatically vectorize your code, or you can manually vectorize using intrinsics and direct assembly code.
I don't think that vectorization necessarily means that the code to be vectorized must be embarrassingly parallel. But, in practice, most vectorizable code is embarrassingly parallel because virtually all SIMD instructions assume that. I can't find any SIMD instructions that allow data dependent operations. In that sense, yes, you can say that vectorizable programs need to be embarrassingly parallel.
However, in a broad sense, vectorization could embrace GPGPU-style SIMD programming such as Nvidia's CUDA and Intel's MIC architecture. They allow more flexible SIMD operations by handling data-dependent operations and branches.
To summarize, in a narrow definition of the vectorization, i.e., vectorization for conventional CPU's SIMD operations, I believe that vectorizable programs should be embarrassingly parallel. However, an advanced form of SIMD/vectorization could enable data-dependent operations and arbitrary branches.

Embarrassingly parallel problems are tasks that require no effort to write in parallel form. Vectorisation is the process by which a conventional procedure becomes parallel.
So, this isn't a case of one being a logical sub-type of another, but a trend. The closer a problem is to being embarrassingly parallel, the less vectorisation is required.

Parallelism is to use our computing devices as much as possible, so we try to schedule tasks in such a way so that their execution is almost independent of each other.
Now, Vectorization, in parallel computing, is a special case of parallelization, in which software programs that by default perform one operation at a time on a single thread are modified to perform multiple operations simultaneously.
Vectorization is the more limited process of converting a computer program from a scalar implementation, which processes a single pair of operands at a time, to a vector implementation which processes one operation on multiple pairs of operands at once.
lets take a simple example of addition of two arrays:
in normal mode, we need a loop to add the two arrays. We can make it parallel in many ways, like we can make two threads and divide the arrays in two equal parts and assign to threads respectively. Now, maximum no. of threads that can increase performance is equal to the no. of nodes of array or length of array.
And if we apply parallelism then it should not need to loop all the array to add it. it should launch only one command and both the array will be added. To do this, we need to make the array addition program parallel in such a way so that it doesnot need any loop. for this we will divide arrays into portions equal to length of array and assign each portion to a new thread and launch all of them simultaneously so that all additions occur under a single instruction without looping at a same time.
FOr a rutine to be completely vectorized, it must be embarassingly parallel like in the above example. But it depends on the given scenerio. for two rutines having interdependencies, they can be individualy vectorized, etc.

Related

Is batching same functions with SIMD instruction possible?

I have a scenario that many exact same functions(for simplicity let's just consider C/C++ and python here) will be executed at the same time on my machine. Intuitively I just use multi-threading to treat each instance of a function as a thread to utilize the parallism, they do not contend for same resources but they will do many branch operation(e.g., for loop). However, since they are actually the same functions, I'm thinking about batching them using some SIMD instructions, e.g., AVX-512. Of course, it should be automatic so that users do not have to modify their code.
The reason? Because every thread/process/container/VM occupies resources, but AVX only needs one instructions. So I can hold more users with the same hardware.
Most articles I find online focus on using AVX instructions inside the function, for example, to accelerate the stream data processing, or deal with some large calculation. None of them mentions batching different instances of same function.
I know there are some challenges, such as different execution path caused by different input, and it is not easy to turn a normal function into a batched version automatically, but I think it is indeed possible technically.
Here are my questions
Is it hard(or possible) to automatically change a normal function into a batched version?
If 1 is no, what restrictions should I put on the function to make it possible? For example, if the function only has one path regardless of the data?
Is there other technologies to better solve the problem? I don't think GPU is a good option to me because GPU cannot support IO or branch instruction, although its SIMT fits perfectly into my goal.
Thanks!
SSE/AVX is basically a vector unit, it allows simple operations (like +-*/ and,or,XOR etc) on arrays of multiple elements at once. AVX1 and 2 has 256 byte registers, so you can do e.g. 8 32-bit singles at once, or 4 doubles. AVX-512 is coming but quite rare atm.
So if your functions are all operations on arrays of basic types, it is a natural fit. Rewriting your function using AVX intrinsics is doable if the operations are very simple. Complex things (like not matching vector widths) or even doing it in assembler is a challenge though.
If your function is not operating on vectors then it becomes difficult, and the possibilities are mostly theoretical. Autovectorizing compilers sometimes can do this, but it s rare and limited, and extremely complex.
There's two ways to fix this: vectorization (SIMD) and parallelization (threads).
GCC can already do the SIMD vectorization you want provided that the function is inlined, and the types and operations are compatible (and it will automatically inline smallish functions without you asking it to).
E.g.
inline void func (int i) {
somearray[i] = someotherarray[i] * athirdarray[i];
}
for (int i = 0; i < ABIGNUMBER; i++)
func (i);
Vectorization and inlining are enabled at -O3.
If the functions are too complex, and/or GCC doesn't vectorize it yet, then you can use OpenMP or OpenACC to parallelize it.
OpenMP uses special markup to tell the compiler where to spawn threads.
E.g.
#pragma omp parallel
#pragma omp for
for (int i = 0; i < ABIGNUMBER; i++)
....
And yes, you can do that on a GPU too! You do have to do a bit more typing to get the data copied in and out correctly. Only the marked up areas run on the GPU. Everything else runs on the CPU, so I/O etc. is not a problem.
#pragma omp target map(somearray,someotherarray,athirdarray)
#pragma omp parallel
#pragma omp for
for (int i = 0; i < ABIGNUMBER; i++)
....
OpenACC is a similar idea, but more specialized towards GPUs.
You can find OpenMP and OpenACC compilers in many places. Both GCC and LLVM support NVidia GPUs. LLVM has some support for AMD GPUs, and there are unofficial GCC builds available too (with official support coming soon).

OpenMP: what is the difference between "taskloop" and "omp for" performance wise?

"taskloop" is introduced in OpenMP 4.5. It can take clauses from both loop and task constructs (except depend clause AFAIK).
However, I'm wondering if "taskloop" and "omp for" constructs differ performance wise too.
I think it may depends on the actual problem. To parallelize a for loop omp for can be faster than tasks, because it offers several different scheduling scheme for your needs. In my experience (solving a particular problem using clang12 compiler) omp for produces a bit faster code than tasks (on Ryzen 5 7800X).

Program memory footprint for different interpreters/compilers

Here's an excerpt from the Wikipedia entry on K programming language:
The small size of the interpreter and compact syntax of the language makes it possible for K applications to fit entirely within the level 1 cache of the processor.
What in particular makes K programs so small? When one uses ' operator in K, map in compiled functional language like Haskell, or equivalent for loop in a compiled imperative language like C, I can't imagine either compiler generating radically different assembly code or that what happens in interpreter's internals will be very different from for loop. Is there anything special in K that makes its runtime and programs so small?
There's a similar question on SO, but the answers there basically clarify nothing.
There are ways of generating a very compact code. For example, a http://en.wikipedia.org/wiki/Threaded_code of Forth and alike. It is likely that K is compiled into some form of it.
I am not the author of the wikipedia statement above, just somebody who uses K extensively.
As for code, K is not unrolling loops or making other changes to the program structure that would increase it in size beyond what you're expecting. The executable interpreter itself is tiny. And the programs tend to be small (though not necessarily so). It's not the execution of any particular instructions for mapping, etc. that make it more likely that the code itself will execute all within cache.
K programs tend to be small because they are a small, tight bytecode in storage, and their syntax tends to yield very small amounts of code for a given operation.
Compare this Java program:
int r=0;
for(int i=0; i<100; i++) {
r+=i;
}
Against this K program to yield the same result:
+/!100
The amount of code being executed is similar, but the storage required by the program (much less typing!) is far less. K is great for those with repetitive stress injuries.
As for the data, the encouragement to work on multiple data items with single instructions tends to make access sequential, in a manner friendly to the cache, rather than random access. All of this merely makes it more likely that the program will be cache friendly.
But this is all just tendencies and best practices within the language in combination with the K executable itself. If you link in large amounts of additional code, special case lots of functions, and randomize your indices before accessing your data, your program will be just as unfriendly to the cache as you'd expect.

Is Cilk's approach to shared memory parallel programming a panacea?

What challenges in shared memory parallel programming (particularly multicore) cannot be solved or cannot be solved efficiently using a Cilk-style solution (i.e. nested data parallelism with per-core work stealing task deques)?
I think Cilk's model is nested task parallelism (which you can use to implement
data parallelism). It is pretty cute, but...
Cilk doesn't support SIMD data parallelism or streaming parallelism.
Cilk doesn't appear to handle partial orders over tasks well, because it only offers nested parallelism. Try coding the following set of parallel tasks: A,B,C,D with ordering constraints A before B, A before D, C before D. (This is the canonical example of the smallest partial task order that nested task parallelism can't encode directly). You lose some the parallelism implementing this with nested parallelism. Parallelism being precious, you don't want to waste opportunities to be parallel.
It doesn't handle (AFAIK) exception handling across thread boundaries.
This is needed if you want to build a really big, complex symbolic application.
It is also useful if you want to compute speculatively.
I also don't think Cilk can handle large sets of interacting (waiting on synchronization events) between computation grains, because AFAIK in a Cilk program can only have as many live parallel computations as there are OS-offered threads. This is due to the Cilk implementation choice of living on top of standard C/C++ compilers and their stack models, which in most workstations use the one-big-stack-per-thread model. You might get to 10 or 100, but not 10,000; this matters when handling very large graphs. [I don't know for a fact that Cilk even allows computation grains to synchronize, but I don't see any technical reason why it could not if it gave up the large-stack-model].
A second implication is that Cilk applications can't recurse over huge data structures, because whatever size stack you choose is bounded, and there's some example for which you will run out of stack. (This isn't a design flaw of Cilk, just of its implementation). That's too bad, because huge things are one of the places where you want parallelism.
For an alternative, see PARLANSE, which offers
arbitrarily large number of computation grains, with work-stealing, but with heap-allocated grains and activation records. Each grain has its own context (and therefore one can implement large interacting sets of grains, because it is straightforward to save grain state when it needs to wait on an event. PARLANSE synchonization primitives include futures, semaphores, and critical function results (see below)
PARLANSE offers explicit "teams" (set of computational grains) as an abstraction, with exceptions propagating out of functions to the top of a computational grain (Java defines this to be "undefined" which is stupid), to the team parent, back to all the other team children as an asynchronous abort-exception (catchable in a try clause) allowing the other children to clean up.
Because some actions (e.g., asynchronous abort-exceptions) can occur at arbitrary times, PARLANSE offers the notion of critical functions, whose results are guaranteed to be returned to the caller atomically, so a function either returns a result assuredly, or does not, and the function can clean up resources safely in an asynch abort handler.
Special "partial order" teams allows one to encode computations in which the partial order is known; I think this is more efficient than Cilk's nested parallelism if you have large sets of such.
(We use PARLANSE to implement large-scale program analysis/transformation tools.. PARLANSE was invented to support this; we want parallelism because the artifacts we deal with are huge, well, trees and graphs, representing millions of lines of code).
(PARLANSE doesn't do streams or SIMD either, but they aren't out of scope of the language. One could arguably add streams and SIMD to C and C++ but its likely to be pretty hard).

How to implement efficient sorting algorithms for multiple processors with Scala?

How to implement efficient sorting algorithms for multiple processors in Scala? Here's the link for radix algorithm in GPU:
radix algorithm in GPU
Use scala.actors.Futures. It isn't a good solution, because you are talking about parallel computation, not concurrent computation, and Futures is aimed at the latter, not the former.
Things like parallel arrays that are coming with Java 7 and a later (not 2.8) version of Scala are more appropriate for parallel algorithms.
Just explaining, a parallel algorithm is one which does the same computation on multiple processing units. It is easy to see that each of which runs the same code. A concurrent computation is one in which each processing unit is running a potentially different code.
Also related, in parallel algorithms, the code being run doesn't change, only the data. In concurrent computation you have code changing constantly.
By the way, though that is not what you are asking, let me state that there's a library for Scala to run OpenCL code (ie, run computation on GPU). It's called ScalaCL.

Resources