Regular division as fast as multiplication with approximate reciprocal. Why?

Regular division as fast as multiplication with approximate reciprocal. Why? - performance

I have following codes:
void division_approximate(float a[], float b[], float c[], int n) {
// c[i] = a[i] * (1 / b[i]);
for (int i = 0; i < n; i+=8) {
__m256 b_val = _mm256_loadu_ps(b + i);
b_val = _mm256_rcp_ps(b_val);
__m256 a_val = _mm256_loadu_ps(a + i);
a_val = _mm256_mul_ps(a_val, b_val);
_mm256_storeu_ps(c + i, a_val);
}
}
void division(float a[], float b[], float c[], int n) {
// c[i] = a[i] / b[i];
for (int i = 0; i < n; i+=8) {
__m256 b_val = _mm256_loadu_ps(b + i);
__m256 a_val = _mm256_loadu_ps(a + i);
a_val = _mm256_div_ps(a_val, b_val);
_mm256_storeu_ps(c + i, a_val);
}
}
I would expect that division_approximate is faster than division, but both function takes almost the same time on my AMD Ryzen 7 4800H. I don't understand why, I would expect that the division_approximate is significantly faster. This issue reproduces on both GCC an CLANG. Compiled with -O3 -march=core-avx2.
UPDATE
Here is the source code generated by GCC 9.3 for both loops:
division
│ >0x555555555c38 <division+88> vmovups 0x0(%r13,%rax,4),%ymm3 │
│ 0x555555555c3f <division+95> vdivps (%r14,%rax,4),%ymm3,%ymm0 │
│ 0x555555555c45 <division+101> vmovups %ymm0,(%rbx,%rax,4) │
│ 0x555555555c4a <division+106> add $0x8,%rax │
│ 0x555555555c4e <division+110> cmp %eax,%r12d │
│ 0x555555555c51 <division+113> jg 0x555555555c38 <division+88> │
division_approximate
│ >0x555555555b38 <division_approximate+88> vrcpps (%r14,%rax,4),%ymm0 │
│ 0x555555555b3e <division_approximate+94> vmulps 0x0(%r13,%rax,4),%ymm0,%ymm0 │
│ 0x555555555b45 <division_approximate+101> vmovups %ymm0,(%rbx,%rax,4) │
│ 0x555555555b4a <division_approximate+106> add $0x8,%rax │
│ 0x555555555b4e <division_approximate+110> cmp %eax,%r12d │
│ 0x555555555b51 <division_approximate+113> jg 0x555555555b38 <division_approximate+88> │
Both codes take almost exactly the same amount of time to execute (318 ms vs 319 ms) for n = 256 * 1024 * 1024.

(256 * 1024 * 1024) * 4 (bytes per float) / 0.318 / 1000^2 * 4 is about 13.5 GB/s DRAM bandwidth, or about 10.1GB/s useful stream bandwidth. (Assuming stores actually cost read+write bandwidth for the RFO; as Jerome points out, _mm256_stream_ps could make the stores just cost once, not twice.)
IDK if that's good or bad for single-threaded triad bandwidth on your Zen 2, but that's only
(256 * 1024 * 1024 / 8) / 0.318 / 1000^3 = ~0.1055 vectors (of 8 floats) per nanosecond; something Zen 2 vdivps could keep up with at 0.36 GHz. I assume your CPU is faster than that :P
(0.1055 vec/ns * 3.5 cycles/vec = 0.36 cycles/ns aka GHz)
So very obviously a memory bottleneck, nowhere near Zen2's one per 3.5 cycle vdivps ymm throughput. (https://uops.info/). Use a much smaller array (that fits in L1 or at least L2 cache) and loop over it multiple times.
Try to avoid writing loops like this in real-world code. The computational intensity (work per time you load data into L1 cache or registers) is very low. Do this as part of another pass, or use cache-blocking to do it for a small part of the input, then consume that small part of the output while it's still hot in cache. (That's much better than using _mm256_stream_ps to bypass cache.)
When mixed with other operations (lots of fmas / mul / add), vdivps is often a better choice than rcpps + a Newton iteration (which is normally needed for acceptable accuracy: raw rcpps is only about 11-bit IIRC.) vdivps is only single uop, vs. separate rcpps and mulps uops. (Although from memory, a separate vmovups load is needed anyway with vdivps, and Zen has no problem folding a memory source into a single uop). See also Floating point division vs floating point multiplication re: front-end throughput vs. divide-unit bottlenecks if you're just dividing, not mixing it with other operations.
Of course it's great if you can avoid division entirely, e.g. hoist a reciprocal out of a loop and just multiply, but modern CPUs have good enough divider HW that there's often not much to gain from rcpps, even if you aren't memory bottlenecked. e.g. in evaluating a polynomial approximation using the ratio of two polynomials, the amount of FMAs is usually enough to hide the throughput cost of vdivps, and a newton iteration would cost more FMA uops.
Also, why -march=core-avx2 when you don't have an Intel "core" microarchitecture? Use -march=native or -march=znver2. Unless you're intentionally benchmarking binaries tuned for Intel running on your AMD CPU.

Related

How to compute the achieved FLOPS of a MPI program which calls cuBlas function

I am accelerating a MPI program using cuBlas function. To evaluate the application's efficiency, I want to know the FLOPS, memory usage and other stuff of GPU after the program has ran, especially FLOPS.
I have read the relevant question:How to calculate Gflops of a kernel. I think the answers give two ways to calculate the FLOPS of a program:
The model count of an operation divided by the cost time of the operation
Using NVIDIA's profiling tools
The first solution doesn't depend on any tools. But I'm not sure the meaning of model count. It's O(f(N))? Like the model count of GEMM is O(N^3)? And if I multiply two matrices of 4 x 5 and 5 x 6 and the cost time is 0.5 s, is the model count 4 x 5 x 6 = 120? So the FLOPS is 120 / 0.5 = 240?
The second solution uses nvprof, which is deprecated now and replaced by Nsight System and Nsight Compute. But those two tools only work for CUDA program, instead of MPI program launching CUDA function. So I am wondering whether there is a tool to profile the program launching CUDA function.
I have been searching for this question for two days but still can't find an acceptable solution.

But I'm not sure the meaning of model count. It's O(f(N))? Like the
model count of GEMM is O(N^3)? And if I multiply two matrices of 4 x 5
and 5 x 6 and the cost time is 0.5 s, is the model count 4 x 5 x 6 =
120? So the FLOPS is 120 / 0.5 = 240?
The standard BLAS GEMM operation is C <- alpha * (A dot B) + beta * C and for A (m by k), B (k by n) and C (m by n), each inner product of a row of A and a column of B multiplied by alpha is 2 * k + 1 flop and there are m * n inner products in A dot B and another 2 * m * n flop for adding beta * C to that dot product. So the total model FLOP count is (2 * k + 3) * (m * n) when alpha and beta are both non-zero.
For your example, assuming alpha = 1 and beta = 0 and the implementation is smart enough to skip the extra operations (and most are) GEMM flop count is (2 * 5) * (4 * 6) = 240, and if the execution time is 0.5 seconds, the model arithmetic throughput is 240 / 0.5 = 480 flop/s.
I would recommend using that approach if you really need to calculate performance of GEMM (or other BLAS/LAPACK operations). This is the way that most of the computer linear algebra literature and benchmarking has worked since the 1970’s and how most reported results you will find are calculated, including the HPC LINPACK benchmark.

The Using the CLI to Analyze MPI Codes states clearly how to use nsys to collect MPI program runtime information.
And the gitlab Roofline Model on NVIDIA GPUs uses ncu to collect real time FLOPS and memory usage of the program. The methodology to compute these metrics is:
Time:
sm__cycles_elapsed.avg / sm__cycles_elapsed.avg.per_second
FLOPs:
DP: sm__sass_thread_inst_executed_op_dadd_pred_on.sum + 2 x
sm__sass_thread_inst_executed_op_dfma_pred_on.sum +
sm__sass_thread_inst_executed_op_dmul_pred_on.sum
SP: sm__sass_thread_inst_executed_op_fadd_pred_on.sum + 2 x
sm__sass_thread_inst_executed_op_ffma_pred_on.sum +
sm__sass_thread_inst_executed_op_fmul_pred_on.sum
HP: sm__sass_thread_inst_executed_op_hadd_pred_on.sum + 2 x
sm__sass_thread_inst_executed_op_hfma_pred_on.sum +
sm__sass_thread_inst_executed_op_hmul_pred_on.sum
Tensor Core: 512 x sm__inst_executed_pipe_tensor.sum
Bytes:
DRAM: dram__bytes.sum
L2: lts__t_bytes.sum
L1: l1tex__t_bytes.sum

Why accessing an array of int8_t is not faster than int32_t, due to cache?

I have read that when accessing with a stride
for (int i = 0; i < aSize; i++) a[i] *= 3;
for (int i = 0; i < aSize; i += 16) a[i] *= 3;
both loops should perform similarly, as memory accesses are in a higher order than multiplication.
I'm playing around with google benchmark and while testing similar cache behaviour, I'm getting results that I don't understand.
template <class IntegerType>
void BM_FillArray(benchmark::State& state) {
for (auto _ : state)
{
IntegerType a[15360 * 1024 * 2]; // Reserve array that doesn't fit in L3
for (size_t i = 0; i < sizeof(a) / sizeof(IntegerType); ++i)
benchmark::DoNotOptimize(a[i] = 0); // I have compiler optimizations disabled anyway
}
}
BENCHMARK_TEMPLATE(BM_FillArray, int32_t);
BENCHMARK_TEMPLATE(BM_FillArray, int8_t);
Run on (12 X 3592 MHz CPU s)
CPU Caches:
L1 Data 32 KiB (x6)
L1 Instruction 32 KiB (x6)
L2 Unified 256 KiB (x6)
L3 Unified 15360 KiB (x1)
---------------------------------------------------------------
Benchmark Time CPU Iterations
---------------------------------------------------------------
BM_FillArray<int32_t> 196577075 ns 156250000 ns 4
BM_FillArray<int8_t> 205476725 ns 160156250 ns 4
I would expect accessing the array of bytes to be faster than the array of ints as more elements fit in a cache line, but this is not the case.
Here are the results with optimizations enabled:
BM_FillArray<int32_t> 47279657 ns 47991071 ns 14
BM_FillArray<int8_t> 49374830 ns 50000000 ns 10
Anyone please can clarify this? Thanks :)
UPDATE 1:
I have read the old article "What programmers should know about memory" and everything is more clear now. However, I've tried the following benchmark:
template <int32_t CacheLineSize>
void BM_ReadArraySeqCacheLine(benchmark::State& state) {
struct CacheLine
{
int8_t a[CacheLineSize];
};
vector<CacheLine> cl;
int32_t workingSetSize = state.range(0);
int32_t arraySize = workingSetSize / sizeof(CacheLine);
cl.resize(arraySize);
const int32_t iterations = 1536 * 1024;
for (auto _ : state)
{
srand(time(NULL));
int8_t res = 0;
int32_t i = 0;
while (i++ < iterations)
{
//size_t idx = i% arraySize;
int idx = (rand() / float(RAND_MAX)) * arraySize;
benchmark::DoNotOptimize(res += cl[idx].a[0]);
}
}
}
BENCHMARK_TEMPLATE(BM_ReadArraySeqCacheLine, 1)
->Arg(32 * 1024) // L1 Data 32 KiB(x6)
->Arg(256 * 1024) // L2 Unified 256 KiB(x6)
->Arg(15360 * 1024);// L3 Unified 15360 KiB(x1)
BENCHMARK_TEMPLATE(BM_ReadArraySeqCacheLine, 64)
->Arg(32 * 1024) // L1 Data 32 KiB(x6)
->Arg(256 * 1024) // L2 Unified 256 KiB(x6)
->Arg(15360 * 1024);// L3 Unified 15360 KiB(x1)
BENCHMARK_TEMPLATE(BM_ReadArraySeqCacheLine, 128)
->Arg(32 * 1024) // L1 Data 32 KiB(x6)
->Arg(256 * 1024) // L2 Unified 256 KiB(x6)
->Arg(15360 * 1024);// L3 Unified 15360 KiB(x1)
I would expect random accesses to perform much worse when working size doesn't fit the caches. However, these are the results:
BM_ReadArraySeqCacheLine<1>/32768 39936129 ns 38690476 ns 21
BM_ReadArraySeqCacheLine<1>/262144 40822781 ns 39062500 ns 16
BM_ReadArraySeqCacheLine<1>/15728640 58144300 ns 57812500 ns 10
BM_ReadArraySeqCacheLine<64>/32768 32786576 ns 33088235 ns 17
BM_ReadArraySeqCacheLine<64>/262144 32066729 ns 31994048 ns 21
BM_ReadArraySeqCacheLine<64>/15728640 50734420 ns 50000000 ns 10
BM_ReadArraySeqCacheLine<128>/32768 29122832 ns 28782895 ns 19
BM_ReadArraySeqCacheLine<128>/262144 31991964 ns 31875000 ns 25
BM_ReadArraySeqCacheLine<128>/15728640 68437327 ns 68181818 ns 11
what am I missing?
UPDATE 2:
I'm using now what you suggested (linear_congruential_engine) to generate the random numbers, and I'm using only static arrays, but the results are now even more confusing to me.
Here is the updated code:
template <int32_t WorkingSetSize, int32_t ElementSize>
void BM_ReadArrayRndCacheLine(benchmark::State& state) {
struct Element
{
int8_t data[ElementSize];
};
constexpr int32_t ArraySize = WorkingSetSize / sizeof(ElementSize);
Element a[ArraySize];
constexpr int32_t iterations = 1536 * 1024;
linear_congruential_engine<size_t, ArraySize/10, ArraySize/10, ArraySize> lcg; // I've tried with many params...
for (auto _ : state)
{
int8_t res = 0;
int32_t i = 0;
while (i++ < iterations)
{
size_t idx = lcg();
benchmark::DoNotOptimize(res += a[idx].data[0]);
}
}
}
// L1 Data 32 KiB(x6)
// L2 Unified 256 KiB(x6)
// L3 Unified 15360 KiB(x1)
BENCHMARK_TEMPLATE(BM_ReadArrayRndCacheLine, 32 * 1024, 1);
BENCHMARK_TEMPLATE(BM_ReadArrayRndCacheLine, 32 * 1024, 64);
BENCHMARK_TEMPLATE(BM_ReadArrayRndCacheLine, 32 * 1024, 128);
BENCHMARK_TEMPLATE(BM_ReadArrayRndCacheLine, 256 * 1024, 1);
BENCHMARK_TEMPLATE(BM_ReadArrayRndCacheLine, 256 * 1024, 64);
BENCHMARK_TEMPLATE(BM_ReadArrayRndCacheLine, 256 * 1024, 128);
BENCHMARK_TEMPLATE(BM_ReadArrayRndCacheLine, 15360 * 1024, 1);
BENCHMARK_TEMPLATE(BM_ReadArrayRndCacheLine, 15360 * 1024, 64);
BENCHMARK_TEMPLATE(BM_ReadArrayRndCacheLine, 15360 * 1024, 128);
BENCHMARK_TEMPLATE(BM_ReadArrayRndCacheLine, 15360 * 1024 * 4, 1);
BENCHMARK_TEMPLATE(BM_ReadArrayRndCacheLine, 15360 * 1024 * 4, 64);
BENCHMARK_TEMPLATE(BM_ReadArrayRndCacheLine, 15360 * 1024 * 4, 128);
Here are the results (optimizations enabled):
// First template parameter is working set size.
// Second template parameter is array elemeent size.
BM_ReadArrayRndCacheLine<32 * 1024, 1> 2833786 ns 2823795 ns 249
BM_ReadArrayRndCacheLine<32 * 1024, 64> 2960200 ns 2979343 ns 236
BM_ReadArrayRndCacheLine<32 * 1024, 128> 2896079 ns 2910539 ns 204
BM_ReadArrayRndCacheLine<256 * 1024, 1> 3114670 ns 3111758 ns 236
BM_ReadArrayRndCacheLine<256 * 1024, 64> 3629689 ns 3643135 ns 193
BM_ReadArrayRndCacheLine<256 * 1024, 128> 3213500 ns 3187189 ns 201
BM_ReadArrayRndCacheLine<15360 * 1024, 1> 5782703 ns 5729167 ns 90
BM_ReadArrayRndCacheLine<15360 * 1024, 64> 5958600 ns 6009615 ns 130
BM_ReadArrayRndCacheLine<15360 * 1024, 128> 5958221 ns 5998884 ns 112
BM_ReadArrayRndCacheLine<15360 * 1024 * 4, 1> 6143701 ns 6076389 ns 90
BM_ReadArrayRndCacheLine<15360 * 1024 * 4, 64> 5800649 ns 5902778 ns 90
BM_ReadArrayRndCacheLine<15360 * 1024 * 4, 128> 5826414 ns 5729167 ns 90
How is it possible that for (L1d < workingSet < L2) results do not differ much against (workingSet < L1d)? Throughput and latency of L2 are still very high, but with the random accesses I'm trying to prevent prefetching and force cache misses.. so, why I'm not even noticing a minimal increment?
Even when trying to fetch from main memory (workingSet > L3) I'm not getting a massive performance drop. You mention that latest architectures can hold bandwidths of up to ~8bytes per clock, but I understand that they must copy a hold cache line, and that without prefetching with a predictable linear pattern, latency should be more noticeable in my tests... why is not the case?
I suspect that page faults and tlb may have something to do too.
(I've downloaded vtune analyzer to try to understand all this stuff better, but it's hanging on my machine and I've waiting for support)
I REALLY appreciate your help Peter Cordes :)
I'm just a GAME programmer trying to show my teammates whether using certain integer types in our code might (or not) have implications on our game performance. For example, whether we should worry about using fast types (eg. int_fast16_t) or using the least possible bytes in our variables for better packing (eg. int8_t).

Re: the ultimate question: int_fast16_t is garbage for arrays because glibc on x86-64 unfortunately defines it as a 64-bit type (not 32-bit), so it wastes huge amounts of cache footprint. The question is "fast for what purpose", and glibc answered "fast for use as array indices / loop counters", apparently, even though it's slower to divide, or to multiply on some older CPUs (which were current when the choice was made). IMO this was a bad design decision.
Cpp uint32_fast_t resolves to uint64_t but is slower for nearly all operations than a uint32_t (x86_64). Why does it resolve to uint64_t?
How should the [u]int_fastN_t types be defined for x86_64, with or without the x32 ABI?
Why are the fast integer types faster than the other integer types?
Generally using small integer types for arrays is good; usually cache misses are a problem so reducing your footprint is nice even if it means using a movzx or movsx load instead of a memory source operand to use it with an int or unsigned 32-bit local. If SIMD is ever possible, having more elements per fixed-width vector means you get more work done per instruction.
But unfortunately int_fast16_t isn't going to help you achieve that with some libraries, but short will, or int_least16_t.
See my comments under the question for answers to the early part: 200 stall cycles is latency, not throughput. HW prefetch and memory-level parallelism hide that. Modern Microprocessors - A 90 Minute Guide! is excellent, and has a section on memory. See also What Every Programmer Should Know About Memory? which is still highly relevant in 2021. (Except for some stuff about prefetch threads.)
Your Update 2 with a faster PRNG
Re: why L2 isn't slower than L1: out-of-order exec is sufficient to hide L2 latency, and even your LGC is too slow to stress L2 throughput. It's hard to generate random numbers fast enough to give the available memory-level parallelism much trouble.
Your Skylake-derived CPU has an out-of-order scheduler (RS) of 97 uops, and a ROB size of 224 uops (like https://realworldtech.com/haswell-cpu/3 but larger), and 12 LFBs to track cache lines it's waiting for. As long as the CPU can keep track of enough in-flight loads (latency * bandwidth), having to go to L2 is not a big deal. Ability to hide cache misses is one way to measure out-of-order window size in the first place: https://blog.stuffedcow.net/2013/05/measuring-rob-capacity
Latency for an L2 hit is 12 cycles (https://www.7-cpu.com/cpu/Skylake.html). Skylake can do 2 loads per clock from L1d cache, but not from L2. (It can't sustain 1 cache line per clock IIRC, but 1 per 2 clocks or even somewhat better is doable).
Your LCG RNG bottlenecks your loop on its latency: 5 cycles for power-of-2 array sizes, or more like 13 cycles for non-power-of-2 sizes like your "L3" test attempts1. So that's about 1/10th the access rate that L1d can handle, and even if every access misses L1d but hits in L2, you're not even keeping more than one load in flight from L2. OoO exec + load buffers aren't even going to break a sweat. So L1d and L2 will be the same speed because they both user power-of-2 array sizes.
note 1: imul(3c) + add(1c) for x = a * x + c, then remainder = x - (x/m * m) using a multiplicative inverse, probably mul(4 cycles for high half of size_t?) + shr(1) + imul(3c) + sub(1c). Or with a power-of-2 size, modulo is just AND with a constant like (1UL<<n) - 1.
Clearly my estimates aren't quite right because your non-power-of-2 arrays are less than twice the times of L1d / L2, not 13/5 which my estimate would predict even if L3 latency/bandwidth wasn't a factor.
Running multiple independent LCGs in an unrolled loop could make a difference. (With different seeds.) But a non-power-of-2 m for an LCG still means quite a few instructions so you would bottleneck on CPU front-end throughput (and back-end execution ports, specifically the multiplier).
An LCG with multiplier (a) = ArraySize/10 is probably just barely a large enough stride for the hardware prefetcher to not benefit much from locking on to it. But normally IIRC you want a large odd number or something (been a while since I looked at the math of LCG param choices), otherwise you risk only touching a limited number of array elements, not eventually covering them all. (You could test that by storing a 1 to every array element in a random loop, then count how many array elements got touched, i.e. by summing the array, if other elements are 0.)
a and c should definitely not both be factors of m, otherwise you're accessing the same 10 cache lines every time to the exclusion of everything else.
As I said earlier, it doesn't take much randomness to defeat HW prefetch. An LCG with c=0, a= an odd number, maybe prime, and m=UINT_MAX might be good, literally just an imul. You can modulo to your array size on each LCG result separately, taking that operation off the critical path. At this point you might as well keep the standard library out of it and literally just unsigned rng = 1; to start, and rng *= 1234567; as your update step. Then use arr[rng % arraysize].
That's cheaper than anything you could do with xorshift+ or xorshft*.
Benchmarking cache latency:
You could generate an array of random uint16_t or uint32_t indices once (e.g. in a static initializer or constructor) and loop over that repeatedly, accessing another array at those positions. That would interleave sequential and random access, and make code that could probably do 2 loads per clock with L1d hits, especially if you use gcc -O3 -funroll-loops. (With -march=native it might auto-vectorize with AVX2 gather instructions, but only for 32-bit or wider elements, so use -fno-tree-vectorize if you want to rule out that confounding factor that only comes from taking indices from an array.)
To test cache / memory latency, the usual technique is to make linked lists with a random distribution around an array. Walking the list, the next load can start as soon as (but not before) the previous load completes. Because one depends on the other. This is called "load-use latency". See also Is there a penalty when base+offset is in a different page than the base? for a trick Intel CPUs use to optimistically speed up workloads like that (the 4-cycle L1d latency case, instead of the usual 5 cycle). Semi-related: PyPy 17x faster than Python. Can Python be sped up? is another test that's dependent on pointer-chasing latency.

CUDA Parallel Cross Product

Disclaimer: I am fairly new to CUDA and parallel programming - so if you're not going to bother to answer my question, just ignore this, or at least point me to the right resources so I can find the answer myself.
Here's the particular problem I'm looking to solve using parallel programming. I have some 1D arrays that store 3D vectors in this format -> [v0x, v0y, v0z, ... vnx, vny, vnz], where n is the vector, and x, y, z are the respective components.
Suppose I want to find the cross product between vectors [v0, v1, ... vn] in one array and their corresponding vectors [v0, v1, ... vn] in another array.
The calculation is pretty straightforward without parallelization:
result[x] = vec1[y]*vec2[z] - vec1[z]*vec2[y];
result[y] = vec1[z]*vec2[x] - vec1[x]*vec2[z];
result[z] = vec1[x]*vec2[y] - vec1[y]*vec2[x];
The problem I'm having is understanding how to implement CUDA parallelization for the arrays I currently have. Since each value in the result vector is a separate calculation, I can effectively run the above calculation for each vector in parallel. Since each component of the resulting cross product is a separate calculation, those too could run in parallel. How would I go about setting up the blocks and threads/ go about thinking about setting up the threads for such a problem?

The top 2 optimization priorities for any CUDA programmer are to use memory efficiently, and expose enough parallelism to hide latency. We'll use those to guide our algorithmic choices.
A very simple thread strategy (the thread strategy answers the question, "what will each thread do or be responsible for?") in any transformation (as opposed to reduction) type problem is to have each thread be responsible for 1 output value. Your problem fits the description of transformation - the output data set size is on the order of the input data set size(s).
I'll assume that you intended to have two equal length vectors containing your 3D vectors, and that you want to take the cross product of the first 3D vectors in each and the 2nd 3D vectors in each, and so on.
If we choose a thread strategy of 1 output point per thread (i.e. result[x] or result[y] or result[z], all together would be 3 output points), then we will need 3 threads to compute the output of each vector cross product. If we have enough vectors to multiply, then we will have enough threads to keep our machine "busy" and do a good job of hiding latency. As a rule of thumb, your problem will start to become interesting on GPUs if the number of threads is 10000 or more, so this means we would want your 1D vectors to consist of about 3000 3D vectors or more. Let's assume that is the case.
In order to tackle the memory efficiency objective, our first task is to load your vector data from global memory. We will want this ideally to be coalesced, which roughly means adjacent threads access adjacent elements in memory. We'll want the output stores to be coalesced also, and our thread strategy of choosing one output point/one vector component per thread will work nicely to support that.
For efficient memory usage, we'd like to ideally load each item from global memory only once. Your algorithm naturally involves a small amount of data reuse. The data reuse is evident since the computation of result[y] depends on vec2[z] and the computation of result[x] also depends on vec2[z] to pick just one example. Therefore a typical strategy when there is data reuse is to load the data first into CUDA shared memory, and then allow the threads to perform their computations based on the data in shared memory. As we will see, this makes it fairly easy/convenient for us to arrange for coalesced loads from global memory, since the global data load arrangement is no longer tightly coupled to the threads or the usage of the data for computation.
The last challenge is to figure out an indexing pattern so that each thread will select the proper elements from shared memory to multiply together. If we look at your calculation pattern that you have depicted in your question, we see that the first load from vec1 follows an offset pattern of +1(modulo 3) from the index that the result is being computed for. So x->y, y->z, and z -> x. Likewise we see a +2(modulo 3) for the next load from vec2, another +2(modulo 3) pattern for the next load from vec1 and another +1(modulo 3) pattern for the final load from vec2.
If we combine all these ideas, we can then write a kernel that should have generally efficient characteristics:
$ cat t1003.cu
#include <stdio.h>
#define TV1 1
#define TV2 2
const size_t N = 4096; // number of 3D vectors
const int blksize = 192; // choose as multiple of 3 and 32, and less than 1024
typedef float mytype;
//pairwise vector cross product
template <typename T>
__global__ void vcp(const T * __restrict__ vec1, const T * __restrict__ vec2, T * __restrict__ res, const size_t n){
__shared__ T sv1[blksize];
__shared__ T sv2[blksize];
size_t idx = threadIdx.x+blockDim.x*blockIdx.x;
while (idx < 3*n){ // grid-stride loop
// load shared memory using coalesced pattern to global memory
sv1[threadIdx.x] = vec1[idx];
sv2[threadIdx.x] = vec2[idx];
// compute modulo/offset indexing for thread loads of shared data from vec1, vec2
int my_mod = threadIdx.x%3; // costly, but possibly hidden by global load latency
int off1 = my_mod+1;
if (off1 > 2) off1 -= 3;
int off2 = my_mod+2;
if (off2 > 2) off2 -= 3;
__syncthreads();
// each thread loads its computation elements from shared memory
T t1 = sv1[threadIdx.x-my_mod+off1];
T t2 = sv2[threadIdx.x-my_mod+off2];
T t3 = sv1[threadIdx.x-my_mod+off2];
T t4 = sv2[threadIdx.x-my_mod+off1];
// compute result, and store using coalesced pattern, to global memory
res[idx] = t1*t2-t3*t4;
idx += gridDim.x*blockDim.x;} // for grid-stride loop
}
int main(){
mytype *h_v1, *h_v2, *d_v1, *d_v2, *h_res, *d_res;
h_v1 = (mytype *)malloc(N*3*sizeof(mytype));
h_v2 = (mytype *)malloc(N*3*sizeof(mytype));
h_res = (mytype *)malloc(N*3*sizeof(mytype));
cudaMalloc(&d_v1, N*3*sizeof(mytype));
cudaMalloc(&d_v2, N*3*sizeof(mytype));
cudaMalloc(&d_res, N*3*sizeof(mytype));
for (int i = 0; i<N; i++){
h_v1[3*i] = TV1;
h_v1[3*i+1] = 0;
h_v1[3*i+2] = 0;
h_v2[3*i] = 0;
h_v2[3*i+1] = TV2;
h_v2[3*i+2] = 0;
h_res[3*i] = 0;
h_res[3*i+1] = 0;
h_res[3*i+2] = 0;}
cudaMemcpy(d_v1, h_v1, N*3*sizeof(mytype), cudaMemcpyHostToDevice);
cudaMemcpy(d_v2, h_v2, N*3*sizeof(mytype), cudaMemcpyHostToDevice);
vcp<<<(N*3+blksize-1)/blksize, blksize>>>(d_v1, d_v2, d_res, N);
cudaMemcpy(h_res, d_res, N*3*sizeof(mytype), cudaMemcpyDeviceToHost);
// verification
for (int i = 0; i < N; i++) if ((h_res[3*i] != 0) || (h_res[3*i+1] != 0) || (h_res[3*i+2] != TV1*TV2)) { printf("mismatch at %d, was: %f, %f, %f, should be: %f, %f, %f\n", i, h_res[3*i], h_res[3*i+1], h_res[3*i+2], (float)0, (float)0, (float)(TV1*TV2)); return -1;}
printf("%s\n", cudaGetErrorString(cudaGetLastError()));
return 0;
}
$ nvcc t1003.cu -o t1003
$ cuda-memcheck ./t1003
========= CUDA-MEMCHECK
no error
========= ERROR SUMMARY: 0 errors
$
Note that I've chosen to write the kernel using a grid-stride loop. This isn't terribly important to this discussion, and not that relevant for this problem, because I've chosen a grid size equal to the problem size (4096*3). However for much larger problem sizes, you might choose a smaller grid size than the overall problem size, for some possible small efficiency gain.
For such a simple problem as this, it's fairly easy to define "optimality". The optimal scenario would be however long it takes to load the input data (just once) and write the output data. If we consider a larger version of the test code above, changing N to 40960 (and making no other changes), then the total data read and written would be 40960*3*4*3 bytes. If we profile that code and then compare to bandwidthTest as a proxy for peak achievable memory bandwidth, we observe:
$ CUDA_VISIBLE_DEVICES="1" nvprof ./t1003
==27861== NVPROF is profiling process 27861, command: ./t1003
no error
==27861== Profiling application: ./t1003
==27861== Profiling result:
Type Time(%) Time Calls Avg Min Max Name
GPU activities: 65.97% 162.22us 2 81.109us 77.733us 84.485us [CUDA memcpy HtoD]
30.04% 73.860us 1 73.860us 73.860us 73.860us [CUDA memcpy DtoH]
4.00% 9.8240us 1 9.8240us 9.8240us 9.8240us void vcp<float>(float const *, float const *, float*, unsigned long)
API calls: 99.10% 249.79ms 3 83.263ms 6.8890us 249.52ms cudaMalloc
0.46% 1.1518ms 96 11.998us 374ns 454.09us cuDeviceGetAttribute
0.25% 640.18us 3 213.39us 186.99us 229.86us cudaMemcpy
0.10% 255.00us 1 255.00us 255.00us 255.00us cuDeviceTotalMem
0.05% 133.16us 1 133.16us 133.16us 133.16us cuDeviceGetName
0.03% 71.903us 1 71.903us 71.903us 71.903us cudaLaunchKernel
0.01% 15.156us 1 15.156us 15.156us 15.156us cuDeviceGetPCIBusId
0.00% 7.0920us 3 2.3640us 711ns 4.6520us cuDeviceGetCount
0.00% 2.7780us 2 1.3890us 612ns 2.1660us cuDeviceGet
0.00% 1.9670us 1 1.9670us 1.9670us 1.9670us cudaGetLastError
0.00% 361ns 1 361ns 361ns 361ns cudaGetErrorString
$ CUDA_VISIBLE_DEVICES="1" /usr/local/cuda/samples/bin/x86_64/linux/release/bandwidthTest
[CUDA Bandwidth Test] - Starting...
Running on...
Device 0: Tesla K20Xm
Quick Mode
Host to Device Bandwidth, 1 Device(s)
PINNED Memory Transfers
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 6375.8
Device to Host Bandwidth, 1 Device(s)
PINNED Memory Transfers
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 6554.3
Device to Device Bandwidth, 1 Device(s)
PINNED Memory Transfers
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 171220.3
Result = PASS
NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.
$
The kernel takes 9.8240us to execute, and in that time loads or stores a total of 40960*3*4*3 bytes of data. Therefore the achieved memory bandwidth by the kernel is 40960*3*4*3/0.000009824 or 150 GB/s. The proxy measurement for peak achievable on this GPU is 171 GB/s, so this kernel achieves 88% of the optimal throughput. With more careful benchmarking to run the kernel twice in a row, the 2nd execution requires only 8.99us to execute. This brings the achieved bandwidth in this case up to 96% of peak achievable throughput.

64/32-bit division on a processor with 32/16-bit division

My processor, a small 16-bit microcontroller with no FPU and integer math only has 16/16 division and 32/16 division which both take 18 cycles. At the moment I'm using a very slow software routine (~7,500 cycles) to do 64/32 division. Is there any way to use these division engines to calculate a 64/32 division? Similar to how I'm already using the 16x16 multiplier and adder to calculate 32x32 multiplies? I'm using C but can work with any general explanation on how it can be done... I'm hoping to target <200 cycles (if it's at all possible.)

See "Hacker's Delight", multiword division (pages 140-145).
The basic concept (going back to Knuth) is to think of your problem in base-65536 terms. Then you have a 4 digit by 2 digit division problem, with 2/1 digit division as a primitive.
The C code is here: https://github.com/hcs0/Hackers-Delight/blob/master/divmnu.c.txt

My copy of Knuth (The Art of Computer Programming) is at work, so I can't check it until Monday, but that would be my first source. It has a whole section on arithmetic.
edit: your post about "16/16 division and 32/16 division which both take 18 cycles." -- dsPICs have a conditional subtract operation in assembly. Consider using this as your computational primitive.
Also note that if X = XH * 232 + XL and D = DH * 216 + DL, then if you are looking for
(Q,R) = X/D where X = Q * D + R
where Q = QH * 216 + QL, R = RH * 216 + RL, then
XH * 232 + XL = DH * QH * 232 + (DL * QH + DH * QL) * 216 + (DL * QL) + RH * 216 + RL
This suggests (by looking at terms that are the high 32 bits) to use the following procedure, akin to long division:
(QH, R0) = XH / (DH+1) -> XH = QH * (DH+1) + R0 [32/16 divide]
R1 = X - (QH * 216) * D [requires a 16*32 multiply, a shift-left by 16, and a 64-bit subtract]
calculate R1' = R1 - D * 216
while R1' >= 0, adjust QH upwards by 1, set R1 = R1', and goto step 3
(QL, R2) = (R1 >> 16) / (DH+1) -> R1 = QL * (DH+1) + R2 [32/16 divide]
R3 = R1 - (QL * D) [requires a 16*32 multiply and a 48-bit subtract]
calculate R3' = R3 - D
while R3' >= 0, adjust QL upwards by 1, set R3 = R3', and goto step 7
Your 32-bit quotient is the pair (QH,QL), and 32-bit remainder is R3.
(This assumes that the quotient is not larger than 32-bit, which you need to know ahead of time, and can easily check ahead of time.)

Starting point would be:
D. Knuth, The Art of Computer Programming Vol.2, Section 4.3.1, Algorithm D
But I suppose you may need to optimize the algorithm.

You may want to look at Booth's Algorithm (http://www.scribd.com/doc/3132888/Booths-Algorithm-Multiplication-Division).
The part you want is about 1/2 way down the page.
I haven't looked at this since my VLSI class, but, this may be your best bet, if possible you may want to do this in assembly, to optimize it as much as possible, if you will be calling this often.
Basically involves shifting and adding or subtracting.

Why is SSE scalar sqrt(x) slower than rsqrt(x) * x?

I've been profiling some of our core math on an Intel Core Duo, and while looking at various approaches to square root I've noticed something odd: using the SSE scalar operations, it is faster to take a reciprocal square root and multiply it to get the sqrt, than it is to use the native sqrt opcode!
I'm testing it with a loop something like:
inline float TestSqrtFunction( float in );
void TestFunc()
{
#define ARRAYSIZE 4096
#define NUMITERS 16386
float flIn[ ARRAYSIZE ]; // filled with random numbers ( 0 .. 2^22 )
float flOut [ ARRAYSIZE ]; // filled with 0 to force fetch into L1 cache
cyclecounter.Start();
for ( int i = 0 ; i < NUMITERS ; ++i )
for ( int j = 0 ; j < ARRAYSIZE ; ++j )
{
flOut[j] = TestSqrtFunction( flIn[j] );
// unrolling this loop makes no difference -- I tested it.
}
cyclecounter.Stop();
printf( "%d loops over %d floats took %.3f milliseconds",
NUMITERS, ARRAYSIZE, cyclecounter.Milliseconds() );
}
I've tried this with a few different bodies for the TestSqrtFunction, and I've got some timings that are really scratching my head. The worst of all by far was using the native sqrt() function and letting the "smart" compiler "optimize". At 24ns/float, using the x87 FPU this was pathetically bad:
inline float TestSqrtFunction( float in )
{ return sqrt(in); }
The next thing I tried was using an intrinsic to force the compiler to use SSE's scalar sqrt opcode:
inline void SSESqrt( float * restrict pOut, float * restrict pIn )
{
_mm_store_ss( pOut, _mm_sqrt_ss( _mm_load_ss( pIn ) ) );
// compiles to movss, sqrtss, movss
}
This was better, at 11.9ns/float. I also tried Carmack's wacky Newton-Raphson approximation technique, which ran even better than the hardware, at 4.3ns/float, although with an error of 1 in 210 (which is too much for my purposes).
The doozy was when I tried the SSE op for reciprocal square root, and then used a multiply to get the square root ( x * 1/√x = √x ). Even though this takes two dependent operations, it was the fastest solution by far, at 1.24ns/float and accurate to 2-14:
inline void SSESqrt_Recip_Times_X( float * restrict pOut, float * restrict pIn )
{
__m128 in = _mm_load_ss( pIn );
_mm_store_ss( pOut, _mm_mul_ss( in, _mm_rsqrt_ss( in ) ) );
// compiles to movss, movaps, rsqrtss, mulss, movss
}
My question is basically what gives? Why is SSE's built-in-to-hardware square root opcode slower than synthesizing it out of two other math operations?
I'm sure that this is really the cost of the op itself, because I've verified:
All data fits in cache, and
accesses are sequential
the functions are inlined
unrolling the loop makes no difference
compiler flags are set to full optimization (and the assembly is good, I checked)
(edit: stephentyrone correctly points out that operations on long strings of numbers should use the vectorizing SIMD packed ops, like rsqrtps — but the array data structure here is for testing purposes only: what I am really trying to measure is scalar performance for use in code that can't be vectorized.)

sqrtss gives a correctly rounded result. rsqrtss gives an approximation to the reciprocal, accurate to about 11 bits.
sqrtss is generating a far more accurate result, for when accuracy is required. rsqrtss exists for the cases when an approximation suffices, but speed is required. If you read Intel's documentation, you will also find an instruction sequence (reciprocal square-root approximation followed by a single Newton-Raphson step) that gives nearly full precision (~23 bits of accuracy, if I remember properly), and is still somewhat faster than sqrtss.
edit: If speed is critical, and you're really calling this in a loop for many values, you should be using the vectorized versions of these instructions, rsqrtps or sqrtps, both of which process four floats per instruction.

There are a number of other answers to this already from a few years ago. Here's what the consensus got right:
The rsqrt* instructions compute an approximation to the reciprocal square root, good to about 11-12 bits.
It's implemented with a lookup table (i.e. a ROM) indexed by the mantissa. (In fact, it's a compressed lookup table, similar to mathematical tables of old, using adjustments to the low-order bits to save on transistors.)
The reason why it's available is that it is the initial estimate used by the FPU for the "real" square root algorithm.
There's also an approximate reciprocal instruction, rcp. Both of these instructions are a clue to how the FPU implements square root and division.
Here's what the consensus got wrong:
SSE-era FPUs do not use Newton-Raphson to compute square roots. It's a great method in software, but it would be a mistake to implement it that way in hardware.
The N-R algorithm to compute reciprocal square root has this update step, as others have noted:
x' = 0.5 * x * (3 - n*x*x);
That's a lot of data-dependent multiplications and one subtraction.
What follows is the algorithm that modern FPUs actually use.
Given b[0] = n, suppose we can find a series of numbers Y[i] such that b[n] = b[0] * Y[0]^2 * Y[1]^2 * ... * Y[n]^2 approaches 1. Then consider:
x[n] = b[0] * Y[0] * Y[1] * ... * Y[n]
y[n] = Y[0] * Y[1] * ... * Y[n]
Clearly x[n] approaches sqrt(n) and y[n] approaches 1/sqrt(n).
We can use the Newton-Raphson update step for reciprocal square root to get a good Y[i]:
b[i] = b[i-1] * Y[i-1]^2
Y[i] = 0.5 * (3 - b[i])
Then:
x[0] = n Y[0]
x[i] = x[i-1] * Y[i]
and:
y[0] = Y[0]
y[i] = y[i-1] * Y[i]
The next key observation is that b[i] = x[i-1] * y[i-1]. So:
Y[i] = 0.5 * (3 - x[i-1] * y[i-1])
= 1 + 0.5 * (1 - x[i-1] * y[i-1])
Then:
x[i] = x[i-1] * (1 + 0.5 * (1 - x[i-1] * y[i-1]))
= x[i-1] + x[i-1] * 0.5 * (1 - x[i-1] * y[i-1]))
y[i] = y[i-1] * (1 + 0.5 * (1 - x[i-1] * y[i-1]))
= y[i-1] + y[i-1] * 0.5 * (1 - x[i-1] * y[i-1]))
That is, given initial x and y, we can use the following update step:
r = 0.5 * (1 - x * y)
x' = x + x * r
y' = y + y * r
Or, even fancier, we can set h = 0.5 * y. This is the initialisation:
Y = approx_rsqrt(n)
x = Y * n
h = Y * 0.5
And this is the update step:
r = 0.5 - x * h
x' = x + x * r
h' = h + h * r
This is Goldschmidt's algorithm, and it has a huge advantage if you're implementing it in hardware: the "inner loop" is three multiply-adds and nothing else, and two of them are independent and can be pipelined.
In 1999, FPUs already needed a pipelined add/substract circuit and a pipelined multiply circuit, otherwise SSE would not be very "streaming". Only one of each circuit was needed in 1999 to implement this inner loop in a fully-pipelined way without wasting a lot of hardware just on square root.
Today, of course, we have fused multiply-add exposed to the programmer. Again, the inner loop is three pipelined FMAs, which are (again) generally useful even if you're not computing square roots.

This is also true for division. MULSS(a,RCPSS(b)) is way faster than DIVSS(a,b). In fact it's still faster even when you increase its precision with a Newton-Raphson iteration.
Intel and AMD both recommend this technique in their optimisation manuals. In applications which don't require IEEE-754 compliance, the only reason to use div/sqrt is code readability.

Instead of supplying an answer, that actually might be incorrect (I'm also not going to check or argue about cache and other stuff, let's say they are identical) I'll try to point you to the source that can answer your question.
The difference might lie in how sqrt and rsqrt are computed. You can read more here http://www.intel.com/products/processor/manuals/. I'd suggest to start from reading about processor functions you are using, there are some info, especially about rsqrt (cpu is using internal lookup table with huge approximation, which makes it much simpler to get the result). It may seem, that rsqrt is so much faster than sqrt, that 1 additional mul operation (which isn't to costly) might not change the situation here.
Edit: Few facts that might be worth mentioning:
1. Once I was doing some micro optimalizations for my graphics library and I've used rsqrt for computing length of vectors. (instead of sqrt, I've multiplied my sum of squared by rsqrt of it, which is exactly what you've done in your tests), and it performed better.
2. Computing rsqrt using simple lookup table might be easier, as for rsqrt, when x goes to infinity, 1/sqrt(x) goes to 0, so for small x's the function values doesn't change (a lot), whereas for sqrt - it goes to infinity, so it's that simple case ;).
Also, clarification: I'm not sure where I've found it in books I've linked, but I'm pretty sure I've read that rsqrt is using some lookup table, and it should be used only, when the result doesn't need to be exact, although - I might be wrong as well, as it was some time ago :).

Newton-Raphson converges to the zero of f(x) using increments equals to -f/f' where f' is the derivative.
For x=sqrt(y), you can try to solve f(x) = 0 for x using f(x) = x^2 - y;
Then the increment is: dx = -f/f' = 1/2 (x - y/x) = 1/2 (x^2 - y) / x
which has a slow divide in it.
You can try other functions (like f(x) = 1/y - 1/x^2) but they will be equally complicated.
Let's look at 1/sqrt(y) now. You can try f(x) = x^2 - 1/y, but it will be equally complicated: dx = 2xy / (y*x^2 - 1) for instance.
One non-obvious alternate choice for f(x) is: f(x) = y - 1/x^2
Then: dx = -f/f' = (y - 1/x^2) / (2/x^3) = 1/2 * x * (1 - y * x^2)
Ah! It's not a trivial expression, but you only have multiplies in it, no divide. => Faster!
And: the full update step new_x = x + dx then reads:
x *= 3/2 - y/2 * x * x which is easy too.

It is faster becausse these instruction ignore rounding modes, and do not handle floatin point exceptions or dernormalized numbers. For these reasons it is much easier to pipeline, speculate and execute other fp instruction Out of order.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio