Bank conflict in parallel reduction using interleaved addressing method - parallel-processing

I was reading the presentation on Optimizing Parallel Reduction in CUDA by Mark Harris. Here is a slide I have problem in:
It says there is bank conflict problem in this method. But why? All threads are accessing two consecutive memory cell which are in different banks. Neither of them accesses a specific memory cell concurrently.

This presentation dates from the very early days of CUDA, and applies to first generation hardware.
That hardware had shared memory arranged in 8 32 bit banks. Because every eighth entry in the shared array resides in the same bank, there are bank conflicts at a number of levels of that reduction tree.
This problem was addressed in newer hardware, where the number of banks was expanded to 32, meaning that this sort of bank conflict cannot occur.

Related

xilinx fpga resource estimation

I am trying to understand how to estimate FPGA resource requirement for a design/application.
Lets says Spartan 7 part has,
Logic Cells - 52160
DSP Slices - 120
Memory - 2700
How to find out number of CLB's, RAM, and Flash availability?
Lets say my design needs a SPI interface in FPGA,
How to estimate CLB, RAM and Flash requirement for this design?
Thanks
Estimation of a block of logic can be done in a couple ways. One method is to actually pen out the logic on paper and look at what registers you are planning on creating. Then you need to look at the part you working with. In this case the Spartan 7 has CLB config as below:
This is from the Xilinx UG474 7 Series document, pg 17. So now you can see the quantity of flops and memory per CLB. Once you look at the registers in the code and count up the memory in the design, you can figure out the number of CLB's. You can share memory and flops in a single CLB generally without issue, however, if you have multiple memories, quantization takes over. Two seperate memories can't occupy the same CLB generally. Also, there are other quantization effects. Memories some in perfect binary sizes, and if you build a 33 bit wide memory x 128K locations, you will really absorbes 64x128K bits of memory, where 31 bits x 128K are unused and untouchable for other uses.
The second method of estimating size is more experienced based as is practiced by larger FPGA teams where previous designs are looked at, and engineers will make basic comparisons of logic to identify previous blocks that are similar to what you are designing next. You might argue that I2C interface isn'a 100% like a SPI interface, but they are similar enough that you could say, 125% of I2C would be a good estiamte of a SPI with some margin for error. You then just throw that number into a spread sheet along with estimates for the 100 other modules that are in design and you call that the rough estimate.
If the estimate needs a second pass to make it more accurate, then you should throw a little code together and validate that it is functional enough to NOT be optimizing flops, gates and memory away and then use that to sure up the estimate. This is tougher because optimization (Read as dropping of unused flops) can happen all too easily, so you need to be certain that flops and gates are twiddle-able enough to not let them be interpreted as unused or always 1 or always 0.
To figure out the number of CLB's you can use the CLB slice configuration table above. Take the number of flops and divide by 16 (For the 7 Series devices) and this will give you the flop based CLB number. Take the memory bits, and divide each memory by 256 (again for 7 series devices) and you will get the total CLB's based on memory. At that point just take the larger of the CLB counts and that will be your CLB estimate.

Why not just predict both branches?

CPU's use branch prediction to speed up code, but only if the first branch is actually taken.
Why not simply take both branches? That is, assume both branches will be hit, cache both sides, and the take the proper one when necessary. The cache does not need to be invalidated. While this requires the compiler to load both branches before hand(more memory, proper layout, etc), I imagine that proper optimization could streamline both so that one can get near optimal results from a single predictor. That is, one would require more memory for loading both branches(which is exponential for N branches), the majority of the time one should be able to "recache" the failed branch with new code quickly enough before it has finished executing the branch taken.
if (x) Bl else Br;
Instead of assuming Bl is taken, assume that both Bl and Br are taken(some type of parallel processing or special interleaving) and after the branch is actually determined, one branch is then invalid and the cache could then be freed for use(maybe some type of special technique would be required to fill and use it properly).
In fact, no prediction circuitry is required and all the design used for that could be, instead, used to handle both branches.
Any ideas if this is feasible?
A Historical Perspective on Fetching Instructions from both Paths
The first similar proposal (to my knowledge) was discussed in this 1968 patent. I understand that you are only asking about fetching instructions from both branches, but bear with me a little. In that patent, three broad strategies were laid out, one of them is following both paths (the fall-through path and the branch path). That is, not just fetching instructions from both paths, but also executing both paths. When the conditional branch instruction is resolved, one of the paths is discarded. It was only mentioned as an idea in the introduction of the patent, but the patent itself was about another invention.
Later in 1977, a commercial processor was released from IBM, called the IBM 3033 processor. That is the first processor (to my knowledge) to implement exactly what you are proposing. I'm surprised to see that the Wikipedia page did not mention that the processor fetched instructions from both paths. The paper that describes the IBM 3033 is titled "The IBM 3033: An inside look". Unfortunately, I'm not able to find the paper. But the paper on the IBM 3090 does mention that fact. So what you're proposing did make sense and was implemented in real processors about half a decade ago.
A patent was filed in 1981 and granted in 1984 about processor with two memories and instructions can be fetched from both memories simultaneously. I quote from the abstract of the patent:
A dual fetch microsequencer having two single-ported microprogram
memories wherein both the sequential and jump address
microinstructions of a binary conditional branch can be simultaneously
prefetched, one from each memory. The microprogram is assembled so
that the sequential and jump addresses of each branch have opposite
odd/even polarities. Accordingly, with all odd addresses in one memory
and even in the other, the first instruction of both possible paths
can always be prefetched simultaneously. When a conditional branch
microinstruction is loaded into the execution register, its jump
address or a value corresponding to it is transferred to the address
register for the appropriate microprogram memory. The address of the
microinstruction in the execution register is incremented and
transferred to the address register of the other microprogram memory.
Prefetch delays are thereby reduced. Also, when a valid conditional
jump address is not provided, that microprogram memory may be
transparently overlayed during that microcycle.
A Historical Perspective on Fetching and Executing Instructions from both Paths
There is a lot of research published in the 80s and 90s about proposing and evaluating techniques by which instructions from both paths are not only fetched but also executed, even for multiple conditional branches. This will have the potential additional overhead of fetching data required by both paths. The idea of branch prediction confidence was proposed in this paper in 1996 and was used to improve such techniques by being more selective regarding which paths to fetch and execute. Another paper (Threaded Multiple Path Execution) published in 1998 proposes an architecture that exploits simultaneous multithreading (SMT) to run multiple paths following conditional branches. Another paper (Dual Path Instruction Processing) published in 2002 proposes to fetch, decode, and rename, but not execute, instructions from both paths.
Discussion
Fetching instructions from both paths into one or more of the caches reduces the effective capacity of the caches in general, because, typically, one of the paths will be executed much more frequently than the other (in some, potentially highly irregular, pattern). Imagine fetching into the L3 cache, which practically always shared between all the cores and holds both instructions and data. This can have negative impact on the ability of the L3 cache to hold useful data. Fetching into the much smaller L2 cache can even lead to a substantially worse performance, especially when the L3 is inclusive. Fetching instructions from both paths across multiple conditional branches for all the cores may cause hot data held in the caches to be frequently evicted and brought back. Therefore, extreme variants of the technique you are proposing would reduce the overall performance of modern architectures. However, less aggressive variants can be beneficial.
I'm not aware of any real modern processors that fetch instructions on both paths when they see a conditional branch (perhaps some do, but it's not publicly disclosed). But instruction prefetching has been extensively researched and still is. An important question here that needs to be addressed is: what is the probability that a sufficient number of instructions from the other path are already present in the cache when the predicted path turns out to be the wrong path? If the probability is high, then there would be little motivation to fetch instructions from both paths. Otherwise, there is indeed an opportunity. According to an old paper from Intel (Wrong-Path Instruction Prefetching), on the benchmarks tested, over 50% of instructions accessed on mispredicted paths were later accessed during correct path execution. The answer to this question certainly depends on the target domain of the processor being designed.

OpenCL: work group concept

I don't really understand the purpose of Work-Groups in OpenCL.
I understand that they are a group of Work Items (supposedly, hardware threads), which ones get executed in parallel.
However, why is there this need of coarser subdivision ? Wouldn't it be OK to have only the grid of threads (and, de facto, only one W-G)?
Should a Work-Group exactly map to a physical core ? For example, the TESLA c1060 card is said to have 240 cores. How would the Work-Groups map to this??
Also, as far as I understand, work-items inside a work group can be synchronized thanks to memory fences. Can work-groups synchronize or is that even needed ? Do they talk to each other via shared memory or is this only for work items (not sure on this one)?
Part of the confusion here I think comes down to terminology. What GPU people often call cores, aren't really, and what GPU people often call threads are only in a certain sense.
Cores
A core, in GPU marketing terms may refer to something like a CPU core, or it may refer to a single lane of a SIMD unit - in effect a single core x86 CPU would be four cores of this simpler type. This is why GPU core counts can be so high. It isn't really a fair comparison, you have to divide by 16, 32 or a similar number to get a more directly comparable core count.
Work-items
Each work-item in OpenCL is a thread in terms of its control flow, and its memory model. The hardware may run multiple work-items on a single thread, and you can easily picture this by imagining four OpenCL work-items operating on the separate lanes of an SSE vector. It would simply be compiler trickery that achieves that, and on GPUs it tends to be a mixture of compiler trickery and hardware assistance. OpenCL 2.0 actually exposes this underlying hardware thread concept through sub-groups, so there is another level of hierarchy to deal with.
Work-groups
Each work-group contains a set of work-items that must be able to make progress in the presence of barriers. In practice this means that it is a set, all of whose state is able to exist at the same time, such that when a synchronization primitive is encountered there is little overhead in switching between them and there is a guarantee that the switch is possible.
A work-group must map to a single compute unit, which realistically means an entire work-group fits on a single entity that CPU people would call a core - CUDA would call it a multiprocessor (depending on the generation), AMD a compute unit and others have different names. This locality of execution leads to more efficient synchronization, but it also means that the set of work-items can have access to locally constructed memory units. They are expected to communicate frequently, or barriers wouldn't be used, and to make this communication efficient there may be local caches (similar to a CPU L1) or scratchpad memories (local memory in OpenCL).
As long as barriers are used, work-groups can synchronize internally, between work-items, using local memory, or by using global memory. Work-groups cannot synchronize with each other and the standard makes no guarantees on forward progress of work-groups relative to each other, which makes building portable locking and synchronization primitives effectively impossible.
A lot of this is due to history rather than design. GPU hardware has long been designed to construct vector threads and assign them to execution units in a fashion that optimally processes triangles. OpenCL falls out of generalising that hardware to be useful for other things, but not generalising it so much that it becomes inefficient to implement.
There are already alot of good answers, for further understanding of the terminology of OpenCL this paper ("An Introduction to the OpenCL Programming Model" by Jonathan Tompson and Kristofer Schlachter) actually describes all the concepts very well.
Use of the work-groups allows more optimization for the kernel compilers. This is because data is not transferred between work-groups. Depending on used OpenCL device, there might be caches that can be used for local variables to result faster data accesses. If there is only one work-group, local variables would be just the same as global variables which would lead to slower data accesses.
Also, usually OpenCL devices use Single Instruction Multiple Data (SIMD) extensions to achieve good parallelism. One work group can be run in parallel with SIMD extensions.
Should a Work-Group exactly map to a physical core ?
I think that, only way to find the fastest work-group size, is to try different work-group sizes. It is also possible to query the CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE from the device with clGetKernelWorkGroupInfo. The fastest size should be multiple of that.
Can work-groups synchronize or is that even needed ?
Work-groups cannot be synchronized. This way there is no data dependencies between them and they can also be run sequentially, if that is considered to be the fastest way to run them. To achieve same result, than synchronization between work-groups, kernel needs to split into multiple kernels. Variables can be transferred between the kernels with buffers.
One benefit of work groups is they enable using shared local memory as a programmer-defined cache. A value read from global memory can be stored in shared work-group local memory and then accessed quickly by any work item in the work group. A good example is the game of life: each cell depends on itself and the 8 around it. If each work item read this information you'd have 9x global memory reads. By using work groups and shared local memory you can approach 1x global memory reads (only approach since there is redundant reads at the edges).

Should I prefer stride one memory access for either reading or writing?

It's well known that accessing memory in a stride one fashion is best for performance.
In situations where
I must access one region of memory for reading,
I must access another region for writing, and
I may only access one of the two regions in a stride one fashion,
should I prefer reading stride one or writing stride one?
One simple, concrete example is a BLAS-like copy-and-permute operation like y := P x. The permutation matrix P is defined entirely by some permutation vector q(i). It has a corresponding inverse permutation vector qinv(i). One could code the required loop as y[qinv(i)] = x[i] or as y[i]=x[q(i)] where the former reads from x stride one and the latter writes to y stride one.
Ideally one could always code both possibilities, profile them under representative conditions, and choose the faster version. Pretend you could only code one version-- which access pattern would you always anticipate being faster based on the behavior of modern memory architectures? Does working in a threaded environment change your response?
Access pattern, that you name "writes stride one" (y[i]=x[q(i)]), is usually faster.
If memory is cached and your data pieces are smaller than cache line, this access pattern requires less memory bandwidth.
It is usual for modern processors to have more load execution units, than store units. And next Intel architecture, named Haswell, supports only GATHER instruction, while SCATTER is not yet in their plans. All this is also in favor of "writes stride one" pattern.
Working in a threaded environment does not change this.
I'd like to share results of my simple benchmarks. Suppose we have two square NxN matrices A and B of doubles, and we want to perform a copy with a transposition:
A = transpose(B)
Algorithms:
Two nested loops such that reads are contiguous and writes are strided.
Two nested loops such that reads are strided and writes are contiguous.
Sequential MKL's mkl_domatcopy.
Copy without transposition is used as a baseline. Values of N are taken to be 2^K + 1 to mitigate cache associativity effects.
Intel Core i7-4770 with GCC 8.3.0 (-O3 -m64 -march=native) and Intel MKL 2019.0.1:
Intel Xeon E5-2650 v3 with GCC 7.3.0 (-O3 -m64 -march=native) and Intel MKL 2017.0.1:
Numbers and C++ source code

CUDA: reduction or atomic operations?

I'm writing a CUDA kernel which involves calculating the maximum value on a given matrix and I'm evaluating possibilities. The best way I could find is:
Forcing every thread to store a value in the shared memory and using a reduction algorithm after that to determine the maximum (pro: minimum divergence cons: shared memory is limited to 48Kb on 2.0 devices)
I couldn't use atomic operations because there are both a reading and a writing operation, so threads could not be synchronized by synchthreads.
Any other idea come into your mind?
You may also want to use the reduction routines that comes w/ CUDA Thrust which is a part of CUDA 4.0 or available here.
The library is written by a pair of nVidia engineers and compares favorably with heavily hand optimized code. I believe there is also some auto-tuning of grid/block size going on.
You can interface with your own kernel easily by wrapping your raw device pointers.
This is strictly from a rapid integration point of view. For the theory, see tkerwin's answer.
This is the usual way to perform reductions in CUDA
Within each block,
1) Keep a running reduced value in shared memory for each thread. Hence each thread will read n (I personally favor between 16 and 32), values from global memory and updates the reduced value from these
2) Perform the reduction algorithm within the block to get one final reduced value per block.
This way you will not need more shared memory than (number of threads) * sizeof (datatye) bytes.
Since each block a reduced value, you will need to perform a second reduction pass to get the final value.
For example, if you are launching 256 threads per block, and are reading 16 values per thread, you will be able to reduce (256 * 16 = 4096) elements per block.
So given 1 million elements, you will need to launch around 250 blocks in the first pass, and just one block in the second.
You will probably need a third pass for cases when the number of elements > (4096)^2 for this configuration.
You will have to take care that the global memory reads are coalesced. You can not coalesce global memory writes, but that is one performance hit you need to take.
NVIDIA has a CUDA demo that does reduction: here. There's a whitepaper that goes along with it that explains some motivations behind the design.
I found this document very useful for learning the basics of parallel reduction with CUDA. It's kind of old, so there must be additional tricks to boost performance further.
Actually, the problem you described is not really about matrices. The two-dimensional view of the input data is not significant (assuming the matrix data is layed out contiguously in memory). It's just a reduction over a sequence of values, being all matrix elements in whatever order they appear in memory.
Assuming the matrix representation is contiguous in memory, you just want to perform a simple reduction. And the best available implementation these days - as far as I can tell - is the excellent libcub by nVIDIA's Duane Merill. Here is the documentation on its device-wide Maximum-calculating function.
Note, though, that unless the matrix is small, for most of the computation it will simply be threads reading data and updating their own thread-specific maximum. Only when a thread has finished reading through a large swatch of the matrix (or rather, a large strided swath) will it write its local maximum anywhere - typically into shared memory for a block-level reduction. And as for atomics, you will probably be making an atomicMax() call once every obscenely large number of matrix element reads - tens of thousands if not more.
The atomicAdd function could also be used, but it is much less efficient than the approaches mentioned above. http://supercomputingblog.com/cuda/cuda-tutorial-4-atomic-operations/
If you have K20 or Titan, I suggest dynamic parallelism: lunching a single thread kernel, which lunches #items worker kernel threads to produce data, then lunches #items/first-round-reduction-factor threads for first round reduction, and keep lunching till result coming out.

Resources