OpenCL Index operations: algorithmic vs constant index buffer - performance

So I'm writing a neural network library using Aparapi (which generates OpenCL from Java code).
Anyway there are many situations where I need to do complex index operations to find the source/destination node for a given weight when doing forward passes and backpropagation.
In many cases this is very simple 1D to 2D formula, but in some cases, such as for convolution nets, I need to do a somewhat more complex operation to find the index (often something like 3D to 1D to 3D).
I have been sticking with algorthims to compute these indices. The alternative would be to simple store the source and destination indices for each weight in a constant int array. I have avoided this as this would almost double the amount of memory storage.
I was wondering what the speed differences would be for computing indices vs reading them from a constant array? Am I losing speed in exchange for memory? Is the difference significant?

Computation is almost always faster on the GPU than global memory access to do the same thing (like a look-up-table). In particular, because the GPU keeps so many kernels "in flight" the math happens while it is waiting on the I/O from the previous kernel slot. So if your math is not too complex, prefer to do it rather than burn a global memory access.

Related

Practical Efficiency of binary search

When searching an element or an insertion point in a sorted array, there are basically two approaches: straight search (element by element) or binary search. From the time complexities O(n) vs O(log(n)) we know that binary search is ultimately more efficient, however this does not automatically imply that binary search will always be faster than "normal search".
My question therefore is: Can binary search be practically less efficent than "normal" search for low n? If yes, can we estimate the point at which binary search will be more efficient?
Thanks!
Yes, a binary search can be practically less efficient than "normal" search for a small n. However, this is very hard to estimate the point at which a binary search will be more efficient (if even possible) because this is very dependent of the problem (eg. data type, search predicate), the hardware (eg. processor, RAM) and even the dynamic state of the hardware used when the search is performed as well as the actual data in the sorted array on modern systems.
The first reason a binary search can be less efficient is vectorization. Indeed, modern processors can support SIMD instructions working on pretty big vectors. Thus, a linear search can work simultaneously on many item per processing cycle. Modern processors can even often execute few SIMD instructions in parallel per cycle. While linear searches can be often trivially vectorized, it is not the case of binary searches which are almost inherently sequential. One should keep in mind that vectorization is not always possible nor always automatically done by compilers, especially on non-trivial data types (eg. composite data structures, pointer-based types) or non-trivial search predicates (eg. the ones with conditionals or memory indirections).
The second reason a binary search can be less efficient is branch predictability. Indeed, modern processors try to predict branches ahead of time to avoid pipeline stall. If this prediction works, then branches can be taken very quickly, otherwise the processor can stall for several cycles (up to dozens). A branch can be easily predicted if it is always true or always false. A randomly taken branch cannot be predicted causing stalls. Because the array is sorted, branches in linear searches are easy to predict (branches are either always taken or never taken until the element is found), while this is clearly not the case for binary searches. As a result, the speed of a search is dependent of the searched item, and data inside the sorted array.
The same thing apply for cache misses and memory fetches: because the latency of the RAM is very big compared to executing arithmetic instructions, modern processors contains dedicated hardware prefetching units trying to predict the next memory fetches and prefetch data ahead of time in order to avoid cache misses. Prefetchers are good to predict linear/contiguous memory accesses but very bad for random memory accesses. Memory accesses of linear searches are trivial while the one of binary searches appear to be mostly random for many processors. A cache miss happening during a binary search will certainly cause the processor to stall for a lot of cycles. If the sorted array is already loaded in cache, a binary search on it can be much faster.
But this is not enough: using wide SIMD instructions or doing cache-misses can impact the frequency of the computing core and so the speed of the algorithm. Not to mention that the size of the data type also matters a lot as the memory throughput is limited and strided memory accesses are slower than contiguous one. One should also take into account the additional complexity of binary searches compared to linear ones (ie. often more instructions to execute). I guess I missed some important points in the above list.
As a programmer, you may need to define a threshold to choose which algorithm to use. If you really need that, the best solution is to find is automatically using a benchmark or autotuning methods. Practical experimentations shows that the threshold changed over the last decades for a given fixed context (data type, cache state, etc.), in favour to linear searches (so the thresholds are generally increasing over time).
My personal advice is not to use a binary search for value of n smaller than 256 / data_type_size_in_bytes with trivial/native data types on mainstream processors. I think it is a good idea to use a binary search when n is bigger than 1000, or also when the data-type is non-trivial as well as when the predicate is expensive.

Suitability of parallel computation for comparisons over a large dataset

Suppose the following hypothetical task:
I am given a single integer A (say, 32 bit double) an a large array of integers B's (same type). The size of the integer array is fixed at runtime (doesn't grow mid-run) but of arbitrary size except it can always fit inside either RAM or VRAM (whichever is smallest). For the sake of this scenario, the integer array can sit in either RAM and VRAM; ignore any time cost in transferring this initial data set at start-up.
The task is to compare A against each B and to return true only if the test is true for against ALL B's, returning false otherwise. For the sake of this scenario, let is the greater than comparison (although I'd be interested if your answer is different for slightly more complex comparisons).
A naïve parallel implementation could involve slicing up the set B and distributing the comparison workload across multiple core. The core's workload would then be entirely independent save for when a failed comparison would interrupt all others as the result would immediately be false. Interrupts play a role in this implementation; although I'd imagine an ever decreasing one probabilistically as the array of integers gets larger.
My question is three-fold:
Would such a scenario be suitable for parallel-processing on GPU. If so, under what circumstances? Or is this a misleading case where the direct CPU implementation is actually the fastest?
Can you suggest an improved parallel algorithm over the naïve one?
Can you suggest any reading to gain intuition on deciding such problems?
If I understand your questions correctly, what you are trying to perform is a reductive operation. The operation in question is equivalent to a MATLAB/Numpy all(A[:] == B). To answer the three sections:
Yes. Reductions on GPUs/multicore CPUs can be faster than their sequential counterpart. See the presentation on GPU reductions here.
The presentation should provide a hierarchical approach for reduction. A more modern approach would be to use atomic operations on shared memory and global memory, as well as warp-aggregation. However, if you do not wish to deal with the intricate details of GPU implementations, you can use a highly-optimized library such as CUB.
See 1 and 2.
Good luck! Hope this helps.
I think this is a situation where you'll derive minimal benefit from the use of a GPU. I also think this is a situation where it'll be difficult to get good returns on any form of parallelism.
Comments on the speed of memory versus CPUs
Why do I believe this? Behold: the performance gap (in terrifyingly unclear units).
The point here is that CPUs have gotten very fast. And, with SIMD becoming a thing, they are poised to become even faster.
In the meantime, memory is getting faster slower. Not shown on the chart are memory buses, which ferry data to/from the CPU. Those are also getting faster, but at a slow rate.
Since RAM and hard drives are slow, CPUs try to store data in "little RAMs" known as the L1, L2, and L3 caches. These caches are super-fast, but super-small. However, if you can design an algorithm to repeatedly use the same memory, these caches can speed things up by an order of magnitude. For instance, this site discusses optimizing matrix multiplication for cache reuse. The speed-ups are dramatic:
The speed of the naive implementation (3Loop) drops precipitously for everything about a 350x350 matrix. Why is this? Because double-precision numbers (8 bytes each) are being used, this is the point at which the 1MB L2 cache on the test machine gets filled. All the speed gains you see in the other implementations come from strategically reusing memory so this cache doesn't empty as quickly.
Caching in your algorithm
Your algorithm, by definition, does not reuse memory. In fact, it has the lowest possible rate of memory reuse. That means you get no benefit from the L1, L2, and L3 caches. It's as though you've plugged your CPU directly into the RAM.
How do you get data from RAM?
Here's a simplified diagram of a CPU:
Note that each core has it's own, dedicated L1 cache. Core-pairs share L2 caches. RAM is shared between everyone and accessed via a bus.
This means that if two cores want to get something from RAM at the same time, only one of them is going to be successful. The other is going to be sitting there doing nothing. The more cores you have trying to get stuff from RAM, the worse this is.
For most code, the problem's not too bad since RAM is being accessed infrequently. However, for your code, the performance gap I talked about earlier, coupled your algorithm's un-cacheable design, means that most of your code's time is spent getting stuff from RAM. That means that cores are almost always in conflict with each other for limited memory bandwidth.
What about using a GPU?
A GPU doesn't really fix things: most of your time will still be spent pulling stuff from RAM. Except rather than having one slow bus (from the CPU to RAM), you have two (the other being the bus from the CPU to the GPU).
Whether you get a speed up is dependent on the relative speed of the CPU, the GPU-CPU bus, and the GPU. I suspect you won't get much of a speed up, though. GPUs are good for SIMD-type operations, or maps. The operation you describe is a reduction or fold: an inherently non-parallel operation. Since your mapped function (equality) is extremely simple, the GPU will spend most of its time on the reduction operation.
tl;dr
This is a memory-bound operation: more cores and GPUs are not going to fix that.
ignore any time cost in transferring this initial data set at
start-up
if there are only a few flase conditions in millions or billions of elements, you can try an opencl example:
// A=5 and B=arr
int id=get_global_id(0);
if(arr[id]!=5)
{
atomic_add(arr,1);
}
is as fast as it gets. arr[0] must be zero if all conditions are "true"
If you are not sure wheter there are only a few falses or millions(which makes atomic functions slow), you can have a single-pass preprocessing to decrease number of falses:
int id=get_global_id(0);
// get arr[id*128] to arr[id*128+128] into local/private mem
// check if a single false exists.
// if yes, set all cells true condition except one
// write results back to a temporary arr2 to be used
this copies whole array to another but if you can ignore time delta of transferring from host device, this should be also ignored. On top of this, only two kernels shouldn't take more than 1ms for the overhead(not including memory read writes)
If data fits in cache, the second kernel(one with the atomic function) will access it instead of global memory.
If time of transfers starts concerning, you can hide their latency using pipelined upload compute download operations if threads are separable from whole array.

Algorithmic Complexity Analysis: practically using Knuth's Ordinary Operations (oops) and Memory Operations (mems) method

In implementing most algorithms (sort, search, graph traversal, etc.), there is frequently a trade-off that can be made in reducing memory accesses at the cost of additional ordinary operations.
Knuth has a useful method for comparing the complexity of various algorithm implementations by abstracting it from particular processors and only distinguishing between ordinary operations (oops) and memory operations (mems).
In compiled programs, one typically lets the compiler organise the low level operations, and hopes that the operating system will handle the question of whether data is held in cache memory (faster) or in virtual memory (slower). Furthermore, the exact number / cost of instructions is encapsulated by the compiler.
With Forth, there is no longer such encapsulation, and one is much closer to the machine, albeit perhaps to a stack machine running on top of a register processor.
Ignoring the effect of an operating system (so no memory stalls, etc.), and assuming for the moment a simple processor,
(1) Can anyone advise on how the ordinary stack operations in Forth (e.g. dup, rot, over, swap, etc.) compare with the cost of Forth's memory access fetch (#) or store (!) ?
(2) Is there a rule of thumb I can use to decide how many ordinary operations to trade-off against saving a memory access?
What I'm looking for is something like 'memory access costs as much as 50 ordinary ops, or 500 ordinary ops, or 5 ordinary ops' Ballpark is absolutely fine.
I'm trying to get a sense of the relative expense of fetch and store vs. rot, swap, dup, drop, over, correct to an order of magnitude.
This article How much time does it take to fetch one word from memory? talks about main memory stall times, with some rule of thumb type numbers, but basically you can do lots of instructions while stalling for main memory. As others have said, the numbers vary a lot between systems.
Main memory stalls is a big area of interest, especially as CPUs have more cores, but typically not much faster memory bandwidth. There is some research going on around compressing data in main memory too, so that the CPU can take advantage of 'spare' cycles and tightly packed cache lines http://oai.cwi.nl/oai/asset/15564/15564B.pdf
For those who are really interested in the details, most CPU manufacturers publish in depth guides on memory optimisations etc. mostly aimed at high end and compiler writers, but readable by all 2gl and 3gl programmers.
Ps. Go Forth.
A comparison between memory fetches and register operations is okay for assembler programs, as it is for the output of c-compilers, which is in fact an assembler program.
In Forth this question hardly makes sense. In the first place Forth is an interpreter and in using Forth one foregoes the ultimate in speed. Of course one could add an optimiser on top of Forth but then the question makes even less sense, because the output of a c-optimiser and a Forth optimiser converge to -- you guessed it -- an optimal solution.
Let's look at an elementary operation in Forth like AND.
This is implemented as
> CODE AND
> POP AX
> POP BX
> AND AX, BX
> PUSH AX
> NEXT
So we see already three memory operations for something that looks like an elementary calculation operation. It appears the Knuth metric is not applicable. Also Forth seems to be loosing big time.That is however not true. Those memory operations are all onto the L1 cache of a typical processor. That is about as efficient as local variables in small c functions,
We can compare stack operations with memory operations using VARIABLE's and the stack. The answer is simple. A VARIABLE risks a memory stall. A stack operation will almost certainly be a L1 cache hit. This is the single most important point of consideration. However the question explicitly asks not to consider it!
So there.

parallel processing multiple evaluations of a sequential task on a large data set -- a task for GPU computing?

I am working on some signal processing code in SciPy, and am now trying to use a numerical optimizer to tune it. Unfortunately, as these things go, it is turning out to be quite a slow process.
The operations I must perform for this optimization are the following:
Load a large 1-d data file (~ 120000 points)
Run optimizer, which:
Executes a signal processing operation, does not modify original data, produces 120000 new data points.
Examines difference between original signal and new signal using various operations,
One of which includes FFT-based convolution
Generates a single "error" value to summarise the result -- this is what should be minimized
Looks at error and re-runs operation with different parameters
The signal processing and error functions take under 3 seconds, but unfortunately doing it 50,000 times takes much longer. I am experimenting with various more efficient optimisation algorithms, but no matter what it's going to take thousands of iterations.
I have parallelised a couple of the optimisers I'm trying using CPU threads, which wasn't too difficult since the optimiser can easily perform several scheduled runs at once on separate threads using ThreadPool.map.
But this is only about a 2x speed-up on my laptop, or maybe 8x on a multicore computer. My question is, is this an application for which I could make use of GPU processing? I have already translated some parts of the code to C, and I could imagine using OpenCL to create a function from an array of parameters to an array of error values, and running this hundreds of times at once. -- Even if it performs the sequential processing part slowly, getting all the results in one shot would be amazing.
However, my guess is that the memory requirements (loading up a large file and producing a temporary one of equal size to generate every data point) would make it difficult to run the whole algorithm in an OpenCL kernel. I don't have much experience with GPU processing and writing CUDA/OpenCL code, so I don't want to set about learning the ins and outs if there is no hope in making it work.
Any advice?
Do you need to produce all 120,000 new points before analysing the difference? Could you calculate the new point, then decide for that point if you are converging?
How big are the points? A $50 graphics card today has 1Gb of memory - should be plenty for 120K points. I'm not as familiar with openCL as Cuda but there may also be limits on how much of this is texture memory vs general memory etc.
edit: More familiar with CUDA than OpenCL but this probably applies to both.
The memory on GPUs is a bit more complex but very flexible, you have texture memory that can be read by the GPU kernel and has some very clever cache features to make access to values in a 2d and 3d arrays very fast. There is openGL memory that you can write to for display and there is a limited (16-64k ?) cache per thread
Although transfers from main memory to the GPU are relatively slow ( few GB/s) the internal memory bus on the graphics card is 20x as fast as this

CUDA: reduction or atomic operations?

I'm writing a CUDA kernel which involves calculating the maximum value on a given matrix and I'm evaluating possibilities. The best way I could find is:
Forcing every thread to store a value in the shared memory and using a reduction algorithm after that to determine the maximum (pro: minimum divergence cons: shared memory is limited to 48Kb on 2.0 devices)
I couldn't use atomic operations because there are both a reading and a writing operation, so threads could not be synchronized by synchthreads.
Any other idea come into your mind?
You may also want to use the reduction routines that comes w/ CUDA Thrust which is a part of CUDA 4.0 or available here.
The library is written by a pair of nVidia engineers and compares favorably with heavily hand optimized code. I believe there is also some auto-tuning of grid/block size going on.
You can interface with your own kernel easily by wrapping your raw device pointers.
This is strictly from a rapid integration point of view. For the theory, see tkerwin's answer.
This is the usual way to perform reductions in CUDA
Within each block,
1) Keep a running reduced value in shared memory for each thread. Hence each thread will read n (I personally favor between 16 and 32), values from global memory and updates the reduced value from these
2) Perform the reduction algorithm within the block to get one final reduced value per block.
This way you will not need more shared memory than (number of threads) * sizeof (datatye) bytes.
Since each block a reduced value, you will need to perform a second reduction pass to get the final value.
For example, if you are launching 256 threads per block, and are reading 16 values per thread, you will be able to reduce (256 * 16 = 4096) elements per block.
So given 1 million elements, you will need to launch around 250 blocks in the first pass, and just one block in the second.
You will probably need a third pass for cases when the number of elements > (4096)^2 for this configuration.
You will have to take care that the global memory reads are coalesced. You can not coalesce global memory writes, but that is one performance hit you need to take.
NVIDIA has a CUDA demo that does reduction: here. There's a whitepaper that goes along with it that explains some motivations behind the design.
I found this document very useful for learning the basics of parallel reduction with CUDA. It's kind of old, so there must be additional tricks to boost performance further.
Actually, the problem you described is not really about matrices. The two-dimensional view of the input data is not significant (assuming the matrix data is layed out contiguously in memory). It's just a reduction over a sequence of values, being all matrix elements in whatever order they appear in memory.
Assuming the matrix representation is contiguous in memory, you just want to perform a simple reduction. And the best available implementation these days - as far as I can tell - is the excellent libcub by nVIDIA's Duane Merill. Here is the documentation on its device-wide Maximum-calculating function.
Note, though, that unless the matrix is small, for most of the computation it will simply be threads reading data and updating their own thread-specific maximum. Only when a thread has finished reading through a large swatch of the matrix (or rather, a large strided swath) will it write its local maximum anywhere - typically into shared memory for a block-level reduction. And as for atomics, you will probably be making an atomicMax() call once every obscenely large number of matrix element reads - tens of thousands if not more.
The atomicAdd function could also be used, but it is much less efficient than the approaches mentioned above. http://supercomputingblog.com/cuda/cuda-tutorial-4-atomic-operations/
If you have K20 or Titan, I suggest dynamic parallelism: lunching a single thread kernel, which lunches #items worker kernel threads to produce data, then lunches #items/first-round-reduction-factor threads for first round reduction, and keep lunching till result coming out.

Resources