Fastest way to convert 12bit image to 16bit image

Fastest way to convert 12bit image to 16bit image - image

Most modern CMOS camera can produce 12bit bayered images.
What would be the fastest way to convert an image data array of 12bit to 16bit so processing would be possible? The actual problem is padding each 12bit number with 4 zeros, little endian can be assumed, SSE2/SSE3/SS4 also acceptable.
Code added:
int* imagePtr = (int*)Image.data;
fixed (float* imageData = img.Data)
{
float* imagePointer = imageData;
for (int t = 0; t < total; t++)
{
int i1 = *imagePtr;
imagePtr = (int*)((ushort*)imagePtr + 1);
int i2 = *imagePtr;
imagePtr = (int*)((ushort*)imagePtr + 2);
*imagePointer = (float)(((i1 << 4) & 0x00000FF0) | ((i1 >> 8) & 0x0000000F));
imagePointer++;
*imagePointer = (float)((i1 >> 12) & 0x00000FFF);
imagePointer++;
*imagePointer = (float)(((i2 >> 4) & 0x00000FF0) | ((i2 >> 12) & 0x0000000F));
imagePointer++;
*imagePointer = (float)((i2 >> 20) & 0x00000FFF);
imagePointer++;
}
}

I cannot guarantee fastest, but this is an approach that uses SSE. Eight 12-16bit conversions are done per iteration and two conversions (approx) are done per step (ie, each iteration takes multiple steps).
This approach straddles the 12bit integers around the 16bit boundaries in the xmm register. Below shows how this is done.
One xmm register is being used (assume xmm0). The state of the register is represented by one line of letters.
Each letter represents 4 bits of a 12bit integer (ie, AAA is the entire first 12bit word in the array).
Each gap represents a 16-bit boundary.
>>2 indicates a logical right-shift of one byte.
The carrot (^) symbol is used to highlight which relevant 12bit integers are straddling a 16bit boundary in each step.
:
load
AAAB BBCC CDDD EEEF FFGG GHHH JJJK KKLL
^^^
>>2
00AA ABBB CCCD DDEE EFFF GGGH HHJJ JKKK
^^^ ^^^
>>2
0000 AAAB BBCC CDDD EEEF FFGG GHHH JJJK
^^^ ^^^
>>2
0000 00AA ABBB CCCD DDEE EFFF GGGH HHJJ
^^^ ^^^
>>2
0000 0000 AAAB BBCC CDDD EEEF FFGG GHHH
^^^
At each step, we can extract the aligned 12bit integers and store them in the xmm1 register. At the end, our xmm1 will look as follows. Question marks denote values which we do not care about.
AAA? ?BBB CCC? ?DDD EEE? ?FFF GGG? ?HHH
Extract the high aligned integers (A, C, E, G) into xmm2 and then, on xmm2, perform a right logical word shift of 4 bits. This will convert the high aligned integers to low aligned. Blend these adjusted integers back into xmm1. The state of xmm1 is now:
?AAA ?BBB ?CCC ?DDD ?EEE ?FFF ?GGG ?HHH
Finally we can mask out the integers (ie, convert the ?'s to 0's) with 0FFFh on each word.
0AAA 0BBB 0CCC 0DDD 0EEE 0FFF 0GGG 0HHH
Now xmm1 contains eight consecutive converted integers.
The following NASM program demonstrates this algorithm.
global main
segment .data
sample dw 1234, 5678, 9ABCh, 1234, 5678, 9ABCh, 1234, 5678
low12 times 8 dw 0FFFh
segment .text
main:
movdqa xmm0, [sample]
pblendw xmm1, xmm0, 10000000b
psrldq xmm0, 1
pblendw xmm1, xmm0, 01100000b
psrldq xmm0, 1
pblendw xmm1, xmm0, 00011000b
psrldq xmm0, 1
pblendw xmm1, xmm0, 00000110b
psrldq xmm0, 1
pblendw xmm1, xmm0, 00000001b
pblendw xmm2, xmm1, 10101010b
psrlw xmm2, 4
pblendw xmm1, xmm2, 10101010b
pand xmm1, [low12] ; low12 could be stored in another xmm register

I'd try to build a solution around the SSSE3 instruction PSHUFB;
Given A=[a0, a1, a2, a3 ... a7], B=[b0, b1, b2, .. b7];
PSHUFB(A,B) = [a_b0, a_b1, a_b2, ... a_b7],
except that the result byte will be zero, if the top bit of bX is 1.
Thus, if
A = [aa ab bb cc cd dd ee ef] == input vector
C=PSHUFB(A, [0 1 1 2 3 4 4 5]) = [aa ab ab bb cc cd cd dd]
C=PSRLW (C, [4 0 4 0]) = [0a aa ab bb 0c cc cd dd] // (>> 4)
C=PSLLW (C, 4) = [aa a0 bb b0 cc c0 dd d0] // << by immediate
A complete solution would read in 3 or 6 mmx / xmm registers and output 4/8 mmx/xmm registers each round. The middle two outputs will have to be combined from two input chunks, requiring some extra copying and combining of registers.

Related

Counting differences between 2 buffers seems too slow

My problem
I have 2 adjacent buffers of bytes of identical size (around 20 MB each). I just want to count the differences between them.
My question
How much time this loop should take to run on a 4.8GHz Intel I7 9700K with 3600MT RAM ?
How do we compute max theoretical speed ?
What I tried
uint64_t compareFunction(const char *const __restrict buffer, const uint64_t commonSize)
{
uint64_t diffFound = 0;
for(uint64_t byte = 0; byte < commonSize; ++byte)
diffFound += static_cast<uint64_t>(buffer[byte] != buffer[byte + commonSize]);
return diffFound;
}
It takes 11ms on my PC (9700K 4.8Ghz RAM 3600 Windows 10 Clang 14.0.6 -O3 MinGW ) and I feel it is too slow and that I am missing something.
40MB should take less than 2ms to be read on the CPU (my RAM bandwidth is between 20 and 30GB/s)
I don't know how to count cycles required to execute one iteration (especially because CPUs are superscalar nowadays). If I assume 1 cycle per operation and if I don't mess up my counting, it should be 10 ops per iteration -> 200 million ops -> at 4.8 Ghz with only one execution unit -> 40ms. Obviously I am wrong on how to compute the number of cycles per loop.
Fun fact: I tried on Linux PopOS GCC 11.2 -O3 and it ran at 4.5ms. Why such a difference?
Here are the dissassemblies vectorised and scalar produced by clang:
compareFunction(char const*, unsigned long): # #compareFunction(char const*, unsigned long)
test rsi, rsi
je .LBB0_1
lea r8, [rdi + rsi]
neg rsi
xor edx, edx
xor eax, eax
.LBB0_4: # =>This Inner Loop Header: Depth=1
movzx r9d, byte ptr [rdi + rdx]
xor ecx, ecx
cmp r9b, byte ptr [r8 + rdx]
setne cl
add rax, rcx
add rdx, 1
mov rcx, rsi
add rcx, rdx
jne .LBB0_4
ret
.LBB0_1:
xor eax, eax
ret
Clang14 O3:
.LCPI0_0:
.quad 1 # 0x1
.quad 1 # 0x1
compareFunction(char const*, unsigned long): # #compareFunction(char const*, unsigned long)
test rsi, rsi
je .LBB0_1
cmp rsi, 4
jae .LBB0_4
xor r9d, r9d
xor eax, eax
jmp .LBB0_11
.LBB0_1:
xor eax, eax
ret
.LBB0_4:
mov r9, rsi
and r9, -4
lea rax, [r9 - 4]
mov r8, rax
shr r8, 2
add r8, 1
test rax, rax
je .LBB0_5
mov rdx, r8
and rdx, -2
lea r10, [rdi + 6]
lea r11, [rdi + rsi]
add r11, 6
pxor xmm0, xmm0
xor eax, eax
pcmpeqd xmm2, xmm2
movdqa xmm3, xmmword ptr [rip + .LCPI0_0] # xmm3 = [1,1]
pxor xmm1, xmm1
.LBB0_7: # =>This Inner Loop Header: Depth=1
movzx ecx, word ptr [r10 + rax - 6]
movd xmm4, ecx
movzx ecx, word ptr [r10 + rax - 4]
movd xmm5, ecx
movzx ecx, word ptr [r11 + rax - 6]
movd xmm6, ecx
pcmpeqb xmm6, xmm4
movzx ecx, word ptr [r11 + rax - 4]
movd xmm7, ecx
pcmpeqb xmm7, xmm5
pxor xmm6, xmm2
punpcklbw xmm6, xmm6 # xmm6 = xmm6[0,0,1,1,2,2,3,3,4,4,5,5,6,6,7,7]
pshuflw xmm4, xmm6, 212 # xmm4 = xmm6[0,1,1,3,4,5,6,7]
pshufd xmm4, xmm4, 212 # xmm4 = xmm4[0,1,1,3]
pand xmm4, xmm3
paddq xmm4, xmm0
pxor xmm7, xmm2
punpcklbw xmm7, xmm7 # xmm7 = xmm7[0,0,1,1,2,2,3,3,4,4,5,5,6,6,7,7]
pshuflw xmm0, xmm7, 212 # xmm0 = xmm7[0,1,1,3,4,5,6,7]
pshufd xmm5, xmm0, 212 # xmm5 = xmm0[0,1,1,3]
pand xmm5, xmm3
paddq xmm5, xmm1
movzx ecx, word ptr [r10 + rax - 2]
movd xmm0, ecx
movzx ecx, word ptr [r10 + rax]
movd xmm1, ecx
movzx ecx, word ptr [r11 + rax - 2]
movd xmm6, ecx
pcmpeqb xmm6, xmm0
movzx ecx, word ptr [r11 + rax]
movd xmm7, ecx
pcmpeqb xmm7, xmm1
pxor xmm6, xmm2
punpcklbw xmm6, xmm6 # xmm6 = xmm6[0,0,1,1,2,2,3,3,4,4,5,5,6,6,7,7]
pshuflw xmm0, xmm6, 212 # xmm0 = xmm6[0,1,1,3,4,5,6,7]
pshufd xmm0, xmm0, 212 # xmm0 = xmm0[0,1,1,3]
pand xmm0, xmm3
paddq xmm0, xmm4
pxor xmm7, xmm2
punpcklbw xmm7, xmm7 # xmm7 = xmm7[0,0,1,1,2,2,3,3,4,4,5,5,6,6,7,7]
pshuflw xmm1, xmm7, 212 # xmm1 = xmm7[0,1,1,3,4,5,6,7]
pshufd xmm1, xmm1, 212 # xmm1 = xmm1[0,1,1,3]
pand xmm1, xmm3
paddq xmm1, xmm5
add rax, 8
add rdx, -2
jne .LBB0_7
test r8b, 1
je .LBB0_10
.LBB0_9:
movzx ecx, word ptr [rdi + rax]
movd xmm2, ecx
movzx ecx, word ptr [rdi + rax + 2]
movd xmm3, ecx
add rax, rsi
movzx ecx, word ptr [rdi + rax]
movd xmm4, ecx
pcmpeqb xmm4, xmm2
movzx eax, word ptr [rdi + rax + 2]
movd xmm2, eax
pcmpeqb xmm2, xmm3
pcmpeqd xmm3, xmm3
pxor xmm4, xmm3
punpcklbw xmm4, xmm4 # xmm4 = xmm4[0,0,1,1,2,2,3,3,4,4,5,5,6,6,7,7]
pshuflw xmm4, xmm4, 212 # xmm4 = xmm4[0,1,1,3,4,5,6,7]
pshufd xmm4, xmm4, 212 # xmm4 = xmm4[0,1,1,3]
movdqa xmm5, xmmword ptr [rip + .LCPI0_0] # xmm5 = [1,1]
pand xmm4, xmm5
paddq xmm0, xmm4
pxor xmm2, xmm3
punpcklbw xmm2, xmm2 # xmm2 = xmm2[0,0,1,1,2,2,3,3,4,4,5,5,6,6,7,7]
pshuflw xmm2, xmm2, 212 # xmm2 = xmm2[0,1,1,3,4,5,6,7]
pshufd xmm2, xmm2, 212 # xmm2 = xmm2[0,1,1,3]
pand xmm2, xmm5
paddq xmm1, xmm2
.LBB0_10:
paddq xmm0, xmm1
pshufd xmm1, xmm0, 238 # xmm1 = xmm0[2,3,2,3]
paddq xmm1, xmm0
movq rax, xmm1
cmp r9, rsi
je .LBB0_13
.LBB0_11:
lea r8, [r9 + rsi]
sub rsi, r9
add r8, rdi
add rdi, r9
xor edx, edx
.LBB0_12: # =>This Inner Loop Header: Depth=1
movzx r9d, byte ptr [rdi + rdx]
xor ecx, ecx
cmp r9b, byte ptr [r8 + rdx]
setne cl
add rax, rcx
add rdx, 1
cmp rsi, rdx
jne .LBB0_12
.LBB0_13:
ret
.LBB0_5:
pxor xmm0, xmm0
xor eax, eax
pxor xmm1, xmm1
test r8b, 1
jne .LBB0_9
jmp .LBB0_10

TLDR: the reason why the Clang code is so slow comes from a poor vectorization method saturating the port 5 (known to be often an issue). GCC does a better job here, but it is still far from being efficient. One can write a much faster chunk-based code using AVX-2 not saturating the port 5.
Analysis of the unvectorized Clang code
To understand what is going on it is better to start with a simple example. Indeed, as you said, modern processor are superscalar so it is not easy to understand the speed of some generated code on such architecture.
The code generated by Clang using the -O1 optimization flag is a good start. Here is the code of the hot loop produced by GodBold provided in your question:
(instructions) (ports)
.LBB0_4:
movzx r9d, byte ptr [rdi + rdx] p23
xor ecx, ecx p0156
cmp r9b, byte ptr [r8 + rdx] p0156+p23
setne cl p06
add rax, rcx p0156
add rdx, 1 p0156
mov rcx, rsi (optimized)
add rcx, rdx p0156
jne .LBB0_4 p06
Modern processors like the Coffee Lake 9700K are structured in two big parts: a front-end fetching/decoding the instructions (and splitting them into micro-instructions, aka. uops), and a back-end scheduling/executing them. The back-end schedule the uops on many ports and each of them can execute some specific sets of instructions (eg. only memory load, or only arithmetic instruction). For each instruction, I put the ports that can execute them. p0156+p23 means the instruction is split in two uops: the first can be executed by the ports 0 or 1 or 5 or 6, and the second can be executed by the ports 2 or 3. Note that the front-end can somehow optimize the code so not to produce any uops for basic instructions like the mov in the loop (thanks to a mechanism called register renaming).
For each loop iteration, the processor needs to read 2 value from memory. A Coffee Lake processor like the 9700K can load two values per cycle so the loop will at least take 1 cycle/iteration (assuming the loads in r9d and r9b does not conflict due to the use of different part of the same r9 64-bit register). This processor has a uops cache and the loop has a lot of instructions so the decoding part should not be a problem. That being said, there is 9 uops to execute and the processor can only execute 6 of them per cycle so the loop cannot take less than 1.5 cycle/iteration. More precisely, the ports 0, 1, 5 and 6 are under pressure, so even assuming the processor perfectly load balance the uops, 2 cycle/iterations are needed. This is an optimistic lower-bound execution time since the processor may not perfectly schedule the instruction and there are many things that could possibly go wrong (like a sneaky hidden dependency I did not see). With a frequency of 4.8GHz, the final execution time is at least 8.3 ms. It can reach 12.5 ms with 3 cycle/iteration (note that 2.5 cycle/iteration is possible due to the scheduling of uops to ports).
The loop can be improved using unrolling. Indeed, a significant number of instructions are needed just to do the loop and not the actual computation. Unrolling can help to increase the ratio of useful instructions so to make a better usage of available ports. Still, the 2 loads prevent the loop to be faster than 1 cycle/iteration, that is 4.2 ms.
Analysis of the vectorized Clang code
The vectorized code generated by Clang is complex. One could try to apply the same analysis than in the previous code but it would be a tedious task.
One can note that even though the code is vectorized, the loads are not vectorized. This is an issue since only 2 loads can be done per cycle. That being said, loads are performed by pairs two contiguous char values so loads are not so slow compared to the previously generated code.
Clang does that since only two 64-bit values can fit in a 128-bit SSE register and a 64-bit and it needs to do that because diffFound is a 64-bit integer. The 8-bit to 64-bit conversion is the biggest issue in the code because it requires several SSE instructions to do the conversion. Moreover, only 4 integers can be computed at a time since there is 3 SSE integer units on Coffee Lake and each of them can only compute two 64-bit integers at a time. In the end, Clang only put 2 values in each SSE register (and use 4 of them so to compute 8 items per loop iteration) so one should expect a code running more than twice faster (especially due to SSE and the loop unrolling), but this is not much the case due to fewer SSE ports than ALU ports and a more instructions required for the type conversions. Put it shortly, the vectorization is clearly inefficient, but this is not so easy for Clang to generate an efficient code in this case. Still, with 28 SSE instructions and 3 SSE integer units computing 8 items per loop, one should expect the computing part of the code to take about 28/3/8 ~= 1.2 cycle/item which is far from what you can observe (and this is not due to other instruction since they can mostly be executed in parallel as they can mostly be scheduled on other ports).
In fact, the performance issue certainly comes from the saturation of the port 5. Indeed, this port is the only one that can shuffle items of SIMD registers. Thus, the instructions punpcklbw, pshuflw, pshufd and even the movd can only be executed on the port 5. This is a pretty common issue with SIMD codes. This is a big issue since there is 20 instructions per loop and the processor may not even use it perfectly. This means the code should take at least 10.4 ms which is very close to the observed execution time (11 ms).
Analysis of the vectorized GCC code
The code generated by GCC is actually pretty good compared to the one of Clang. Firstly, GCC loads items using SIMD instruction directly which is much more efficient as 16 items are computed per instruction (and by iteration): it only need 2 load uops per iteration reducing the pressure on the port 2 and 3 (1 cycle/iteration for that, so 0.0625 cycle/item). Secondly, GCC only uses 14 punpckhwd instructions while each iteration compute 16 items, reducing critical pressure on the port 5 (0.875 cycle/item for that). Thirdly, the SIMD registers are nearly fully used, at least for the comparison since the pcmpeqb comparison instruction compare 16 items at a time (as opposed to 2 with Clang). The other instructions like paddq are cheap (for example, paddq can be scheduled on the 3 SSE ports) and they should not impact much the execution time. In the end, this version should still be bounded by the port 5, but it should be much faster than the Clang version. Indeed, one should expect the execution time to reach 1 cycle/item (since the port scheduling is certainly not perfect and memory loads may introduce some stalling cycles). This means an execution time of 4.2 ms. This is close to the observed results.
Faster implementation
The GCC implementation is not perfect.
First of all, it does not use AVX2 supported by your processor since the -mavx2 flag is not provided (or any similar flag like -march=native). Indeed, GCC like other mainstream compilers only use SSE2 by default for sake of compatibility with previous architecture: SSE2 is safe to use on all x86-64 processors, but not other instruction sets like SSE3, SSSE3, SSE4.1, SSE4.2, AVX, AVX2. With such flag, GCC should be able to produce a memory bound code.
Moreover, the compiler could theoretically perform a multi-level sum reduction. The idea is to accumulate the result of the comparison in a 8-bit wide SIMD lane using chunks with a size of 1024 items (ie. 64x16 items). This is safe since the value of each lane cannot exceed 64. To avoid overflow, the accumulated values needs to be stored in wider SIMD lanes (eg. 64-bit ones). With this strategy, the overhead of the punpckhwd instructions is 64 time smaller. This is a big improvement since it removes the saturation of the port 5. This strategy should be sufficient to generate a memory-bound code, even using only SSE2. Here is an example of untested code requiring the flag -fopenmp-simd to be efficient.
uint64_t compareFunction(const char *const __restrict buffer, const uint64_t commonSize)
{
uint64_t byteChunk = 0;
uint64_t diffFound = 0;
if(commonSize >= 127)
{
for(; byteChunk < commonSize-127; byteChunk += 128)
{
uint8_t tmpDiffFound = 0;
#pragma omp simd reduction(+:tmpDiffFound)
for(uint64_t byte = byteChunk; byte < byteChunk + 128; ++byte)
tmpDiffFound += buffer[byte] != buffer[byte + commonSize];
diffFound += tmpDiffFound;
}
}
for(uint64_t byte = byteChunk; byte < commonSize; ++byte)
diffFound += buffer[byte] != buffer[byte + commonSize];
return diffFound;
}
Both GCC and Clang generates a rather efficient code (while sub-optimal for data fitting in the cache), especially Clang. Here is for example the code generated by Clang using AVX2:
.LBB0_4:
lea r10, [rdx + 128]
vmovdqu ymm2, ymmword ptr [r9 + rdx - 96]
vmovdqu ymm3, ymmword ptr [r9 + rdx - 64]
vmovdqu ymm4, ymmword ptr [r9 + rdx - 32]
vpcmpeqb ymm2, ymm2, ymmword ptr [rcx + rdx - 96]
vpcmpeqb ymm3, ymm3, ymmword ptr [rcx + rdx - 64]
vpcmpeqb ymm4, ymm4, ymmword ptr [rcx + rdx - 32]
vmovdqu ymm5, ymmword ptr [r9 + rdx]
vpaddb ymm2, ymm4, ymm2
vpcmpeqb ymm4, ymm5, ymmword ptr [rcx + rdx]
vpaddb ymm3, ymm4, ymm3
vpaddb ymm2, ymm3, ymm2
vpaddb ymm2, ymm2, ymm0
vextracti128 xmm3, ymm2, 1
vpaddb xmm2, xmm2, xmm3
vpshufd xmm3, xmm2, 238
vpaddb xmm2, xmm2, xmm3
vpsadbw xmm2, xmm2, xmm1
vpextrb edx, xmm2, 0
add rax, rdx
mov rdx, r10
cmp r10, r8
jb .LBB0_4
All the loads are 256-bit SIMD ones. The number of vpcmpeqb is optimal. The number of vpaddb is relatively good. There are few other instructions, but they should clearly not be a bottleneck. The loop operate on 128 items per iteration and I expect it to takes less than a dozen of cycles per iteration for data already in the cache (otherwise it should be completely memory-bound). This means <0.1 cycle/item, that is, far less than the previous implementation. In fact, the uiCA tool indicates about 0.055 cycle/item, that is 81 GiB/s! One may manually write a better code using SIMD intrinsics, but at the expense of a significantly worse portability, maintenance and readability.
Note that generating a sequential memory-bound does not always mean the RAM throughput will be saturated. In fact, on one core, there is sometimes not enough concurrency to hide the latency of memory operations though it should be fine on your processor (like it is on my i5-9600KF with 2 interleaved 3200 MHz DDR4 memory channels).

Yes, if your data is not hot in cache, even SSE2 should keep up with memory bandwidth. Compare-and-sum of 32 compare results per cycle (from two 32-byte loads) is totally possible if data is hot in L1d cache, or whatever bandwidth outer levels of cache can provide.
If not, the compiler did a bad job. That's unfortunately common for problems like this reducing into a wider variable; compilers don't know good vectorization strategies for summing bytes, especially compare-result bytes that must be 0/-1. They probably widen to 64-bit with pmovsxbq right away (or even worse if SSE4.1 instructions aren't available).
So even -O3 -march=native doesn't help much; this is a big missed-optimization; hopefully GCC and clang will learn how to vectorize this kind of loop at some point, summing compare results probably comes up in enough codebases to be worth recognizing that pattern.
The efficient way is to use psadbw to sum horizontally into qwords. But only after an inner loop does some iterations of vsum -= cmp(p, q), subtracting 0 or -1 to increment a counter or not. 8-bit elements can do 255 iterations of that without risk of overflow. And with unrolling for multiple vector accumulators, that's many vectors of 32 bytes each, so you don't have to break out of that inner loop very often.
See How to count character occurrences using SIMD for manually-vectorized AVX2 code. (And one answer has a Godbolt link to an SSE2 version.) Summing the compare results is the same problem as that, but you're loading two vectors to feed pcmpeqb instead of broadcasting one byte outside the loop to find occurrences of a single char.
An answer there has benchmarks that report 28 GB/s for AVX2, 23 GB/s for SSE2, on an i7-6700 Skylake (at only 3.4GHz, maybe they disabled turbo or are just reporting the rated speed. DRAM speed not mentioned.)
I'd expect 2 input streams of data to achieve about the same sustained bandwidth as one.
This is more interesting to optimize if you benchmark repeated passes over smaller arrays that fit in L2 cache, then efficiency of your ALU instructions matters. (The strategy in the answers on that question are pretty good and well tuned for that case.)
Fast counting the number of equal bytes between two arrays is an older Q&A using a worse strategy, not using psadbw to sum bytes to 64-bit. (But not as bad as GCC/clang, still hsumming as it widens to 32-bit.)
Multiple threads/cores will barely help on a modern desktop, especially at high core clocks like yours. Memory latency is low enough and each core has enough buffers to keep enough requests in flight that it can nearly saturate dual-channel DRAM controllers.
On a big Xeon, that would be very different; you need most of the cores to achieve peak aggregate bandwidth, even for just memcpy or memset so there's zero ALU work, just loads/stores. The higher latency means a single core has much less memory bandwidth available than on a desktop (even in an absolute sense, let alone as a percentage of 6 channels instead of 2). See also Enhanced REP MOVSB for memcpy and Why is Skylake so much better than Broadwell-E for single-threaded memory throughput?
Portable source that compiles to less-bad asm, micro-optimized from Jérôme's: 5.5 cycles per 4x 32-byte vectors, down from 7 or 8, assuming L1d cache hits.
Still not good (as it reduces to scalar every 128 bytes, or 192 if you want to try that), but
#Jérôme Richard came up with a clever way to give clang something it could vectorize a short with a good strategy, with a uint8_t sum, using that as an inner loop short enough to not overflow.
But clang still does some dumb things with that loop, as we can see in his answer. I modified the loop control to use a pointer increment, which reduces the loop overhead a bit, just one pointer-add and compare/jcc, not LEA/MOV. I don't know why clang was doing it inefficiently using integer indexing.
And it avoids an indexed addressing mode for the vpcmpeqb memory source operands, letting them stay micro-fused on Intel CPUs. (Clang doesn't seem to know that this matters at all! Reversing operands to != in the source was enough to make it use indexed addressing modes for vpcmpeqb instead of for vmovdqu pure loads.)
// micro-optimized version of Jérôme's function, clang compiles this better
// instead of 2 arrays, it compares first and 2nd half of one array, which lets it index one relative to the other with an offset if we hand-hold clang into doing that.
uint64_t compareFunction_sink_fixup(const char *const __restrict buffer, const size_t commonSize)
{
uint64_t byteChunk = 0;
uint64_t diffFound = 0;
const char *endp = buffer + commonSize;
const char *__restrict ptr = buffer;
if(commonSize >= 127) {
// A signed type for commonSize wouldn't avoid UB in pointer subtraction creating a pointer before the object
// in practice it would be fine except maybe when inlining into a function where the compiler could see a compile-time-constant array size.
for(; ptr < endp-127 ; ptr += 128)
{
uint8_t tmpDiffFound = 0;
#pragma omp simd reduction(+:tmpDiffFound)
for(int off = 0 ; off < 128; ++off)
tmpDiffFound += ptr[off + commonSize] != ptr[off];
// without AVX-512, we get -1 for ==, 0 for not-equal. So clang adds set1_epi(4) to each bucket that holds the sum of four 0 / -1 elements
diffFound += tmpDiffFound;
}
}
// clang still auto-vectorizes, but knows the max trip count is only 127
// so doesn't unroll, just 4 bytes per iter.
for(int byte = 0 ; byte < commonSize % 128 ; ++byte)
diffFound += ptr[byte] != ptr[byte + commonSize];
return diffFound;
}
Godbolt with clang15 -O3 -fopenmp-simd -mavx2 -march=skylake -mbranches-within-32B-boundaries
# The main loop, from clang 15 for x86-64 Skylake
.LBB0_4: # =>This Inner Loop Header: Depth=1
vmovdqu ymm2, ymmword ptr [rdi + rsi]
vmovdqu ymm3, ymmword ptr [rdi + rsi + 32] # Indexed addressing modes are fine here
vmovdqu ymm4, ymmword ptr [rdi + rsi + 64]
vmovdqu ymm5, ymmword ptr [rdi + rsi + 96]
vpcmpeqb ymm2, ymm2, ymmword ptr [rdi] # non-indexed allow micro-fusion without un-lamination
vpcmpeqb ymm3, ymm3, ymmword ptr [rdi + 32]
vpcmpeqb ymm4, ymm4, ymmword ptr [rdi + 64]
vpaddb ymm2, ymm4, ymm2
vpcmpeqb ymm4, ymm5, ymmword ptr [rdi + 96]
vpaddb ymm3, ymm4, ymm3
vpaddb ymm2, ymm2, ymm3
vpaddb ymm2, ymm2, ymm0 # add a vector of set1_epi8(4) to turn sums of 0 / -1 into sums of 1 / 0
vextracti128 xmm3, ymm2, 1
vpaddb xmm2, xmm2, xmm3
vpshufd xmm3, xmm2, 238 # xmm3 = xmm2[2,3,2,3]
vpaddb xmm2, xmm2, xmm3 # reduced to 8 bytes
vpsadbw xmm2, xmm2, xmm1 # hsum to one qword
vpextrb edx, xmm2, 0 # extract and zero-extend
add rax, rdx # accumulate the chunk sum
sub rdi, -128 # pointer increment (with a sign_extended_imm8 instead of +imm32)
cmp rdi, rcx
jb .LBB0_4 # }while(p < endp)
This could use 192 instead of 128 to further amortize the loop overhead, at the cost of needing to do %192 (not a power of 2), and making the cleanup loop worst case be 191 bytes. We can't go to 256, or anything higher than UINT8_MAX (255), and sticking to multiples of 32 is necessary. Or 64 for good measure.
There's an extra vpaddb of a fixup constant, set1_epi8(4), which turns the sum of four 0 / -1 into a sum of four 1 / 0 results from the C != operator.
I don't think there's any way to get rid of it or sink it out of the loop while still accumulating into a uint8_t, which is necessary for clang to vectorize this way. It doesn't know how to use vpsadbw to do a widening (non-truncating) sum of bytes, which is ironic because that's what it actually does when used against an all-zero register. If you do something like sum += ptr[off + commonSize] == ptr[off] ? -1 : 0 you can get it to use the vpcmpeqb result directly, summing 4 vectors down to one with 3 adds, and eventually feeding that to vpsadbw after some reduction steps. So you get a sum of matches * 0xFF truncated to uint8_t for each block of 128 bytes. Or as an int8_t, that's a sum of -1 * matches, so 0..-128, which doesn't overflow a signed byte. So that's interesting. But adding with zero-extension into a 64-bit counter might destroy information, and sign-extension inside the outer loop would cost another instruction. It would be a scalar movsx instruction instead of vpaddb, but that's not important for Skylake, probably only if using AVX-512 with 512-bit vectors (which clang and GCC both do badly, not using masked adds). Can we do 128*n_chunks - count after the loop to recover the differences from the sum of matches? No, I don't think so.
uiCA static analysis predicts Skylake (such as your CPU) will run the main loop at 5.51 cycles / iter (4 vectors) if data is hot in L1d cache, or 5.05 on Ice Lake / Rocket Lake. (I had to hand-tweak the asm to emulate the padding effect -mbranches-within-32B-boundaries would have, for uiCA's default assumption of where the top of the loop is relative to a 32-byte alignment boundary. I could have just changed that setting in uiCA instead. :/)
The only missed micro-optimization in implementing this sub-optimal strategy is that it's using vpextrb (because it doesn't prove that truncation to uint8_t isn't needed?) instead of vmovd or vmovq. So it costs an extra uop for the front-end, and for port 5 in the back end. With that optimized (comment + uncomment in the link), 5.25c / iter on Skylake, or 4.81 on Ice Lake, pretty close to the 2 load/clock bottleneck.
(Doing 6 vectors per iter, 192 bytes, predicts 7 cycles per iter on SKL, or 1.166 per vector, down from 5.5 / iter = 1.375 per vector. Or about 6.5 on ICL/RKL = 1.08 c/vec, hitting back-end ALU port bottlecks.)
This is not bad for something we were able to coax clang into generating from portable C++ source, vs. 4 cycles per 4 vectors of 32 byte-compares each for efficient manual vectorization. This will very likely keep up with memory or cache bandwidth even from L2 cache, so it's pretty usable, and not much slower with data hot in L1d. Taking a few more uops does hurt out-of-order exec, and uses up more execution resources that another logical core sharing a physical core could use. (Hyperthreading).
Unfortunately gcc/clang do not make good use of AVX-512 for this. If you were using 512-bit vectors (or AVX-512 features on 256-bit vectors), you'd compare into mask registers, then do something like vpaddb zmm0{k1}, zmm0, zmm1 merge-masking to conditionally increment a vector, where zmm1 = set1_epi8( 1 ). (Or a -1 constant with sub.) Instruction and uop count per vector should be about the same as AVX2 if done properly, but gcc/clang use about twice as many, so the only saving is in the reduction to scalar which seems to be the price for getting anything at all usable.
This version also avoids unrolling of the clean-up loop, just vectorizing with its dumb 4 bytes per iter strategy, which is about right for cleanup of size%128 bytes. It's pretty silly that it uses both vpxor to flip and vpand to turn 0xff into 0x01, when it could have used vpandn to do both those things in one instruction. That would get that cleanup loop down to 8 uops, just twice the pipeline width on Haswell / Skylake, so it would issue more efficiently from the loop buffer, except Skylake disabled that in microcode updates. It would help a bit on Haswell

Correct me if I am wrong but the answer seems to be
-march=native for the win.
the scalar version of the code was CPU bottlenecked and not RAM bottlenecked
use uica.uops.info to have an estimate of the cycles per loop
I will try to write my own AVX code to compare.
Details
After an afternoon tinkering around with the suggestions, here is what I found with clang:
-O1 around 10ms, scalar code
-O3 enables SSE2 and is as slow as O1, maybe poor assembly code
-O3 -march=westmere enables also SSE2 but is faster (7ms)
-O3 -march=native enables AVX -> 2.5ms and we are probably RAM bandwidth limited (close to the theoretical speed)
The scalar 10ms makes sense now because according to that awesome tool uica.uops.info it takes
2.35 cycles per loop
47 million cycles for the whole comparison (20 million iterations)
Processor is clocked at 4.8GHz meaning it should take around 9.8ms and it is close to what is measured.
g++ seems to generate better default code when no flags are added
O1 11ms
O2 scalar still but 9ms
O3 SSE 4.5ms
O3 -march=westmere 7ms like clang
O3 -march=native 3.4ms, slightly slower than clang

I still don't get how IMUL works in Assembly

I am a begginer with assembly i just started learning it and i don't get how the instruction IMUL really works
For example i'm working on this piece of code on visual studio:
Mat = 0A2A(hexadecimal)
__asm {
MOV AX, Mat
AND AL,7Ch
OR AL,83h
XOR BL,BL
SUB BL,2
IMUL BL
MOV Ris5,AX
}
the result in Ris5 should be 00AA (in hexadecimal), for the first couple lines i'm all good, from the first line to 'SUB BL,2'
the results are AL = AB (AX =0AAB)
but then starting from IMUL i'm stuck.
I know that IMUL executes a signed multiply of AL by a register or a byte or a word .. and stores the result in AX (here) but i can't find the same result (00AA)

MOV AX, Mat AX = 0x0A2A (...00101010)
AND AL,7Ch AX = 0x0A28 (...00101000)
OR AL,83h AX = 0x0AAB (...10101011)
XOR BL,BL BL = 0x00
SUB BL,2 BL = 0xFE
IMUL BL AX = 0xFFAB * 0xFFFE = 0x00AA
MOV Ris5,AX Ris5 = 0x00AA
When you multiply two N bit numbers the lower N bits don't care about signed vs unsigned, but as you pad the numbers then you get into signed vs unsigned multiply instructions as you will see in some instruction sets. To not lose precision you desire a 2*N number of bits result, grade school math:
00000000aaaaaaaa
* 00000000bbbbbbbb
=====================
AAAAAAAAAAaaaaaa
* BBBBBBBBBBbbbbbb
====================
Signed vs unsigned with the Capital letter representing the sign extension
0xAB = 171 unsigned = -85 signed
0xFE = 254 unsigned = -2 signed
unsigned multiply 171 * 254 = 43434 = 0xA9AA
signed multiply -85 * -2 = 170 = 0x00AA
The lower byte is the same as they are 8 bit operands and the sign extension doesn't come into play:
bbbbbbbb *a[0]
bbbbbbbb *a[1]
bbbbbbbb *a[2]
bbbbbbbb *a[3]
bbbbbbbb *a[4]
bbbbbbbb *a[5]
bbbbbbbb *a[6]
+ bbbbbbbb *a[7]
==================
cyyyyyyyxxxxxxxx
If you look up the columns the x bits are not affected by the sign extension so are the same for unsigned and signed. y bits are affected as well as the carry out of the msbit c which makes up the 16th bit of the result.
Now the tool is not complaining about this syntax, is it?
Mat = 0A2A(hexadecimal)
Without an h at the end or 0x or $ up-front that looks like octal, but the A's would cause an error if octal (or if decimal). Assuming you start with 0x0A2A, I think your understanding is solid.

Adding arrays using YMM instructions using gcc

I want to run the following code (in Intel syntax) in gcc (AT&T syntax).
; float a[128], b[128], c[128];
; for (int i = 0; i < 128; i++) a[i] = b[i] + c[i];
; Assume that a, b and c are aligned by 32
xor ecx, ecx ; Loop counter i = 0
L: vmovaps ymm0, [b+rcx] ; Load 8 elements from b
vaddps ymm0,ymm0,[c+rcx] ; Add 8 elements from c
vmovaps [a+rcx], ymm0 ; Store result in a
add ecx,32 ; 8 elements * 4 bytes = 32
cmp ecx, 512 ; 128 elements * 4 bytes = 512
jb L ;Loop
Code is from Optimizing subroutines in assembly language.
The code I've written so far is:
static inline void addArray(float *a, float *b, float *c) {
__asm__ __volatile__ (
"nop \n"
"xor %%ecx, %%ecx \n" //;Loop counter set to 0
"loop: \n\t"
"vmovaps %1, %%ymm0 \n" //;Load 8 elements from b <== WRONG
"vaddps %2, %%ymm0, %%ymm0 \n" //;Add 8 elements from c <==WRONG
"vmovaps %%ymm0, %0 \n" //;Store result in a
"add 0x20, %%ecx \n" //;8 elemtns * 4 bytes = 32 (0x20)
"cmp 0x200,%%ecx \n" //;128 elements * 4 bytes = 512 (0x200)
"jb loop \n" //;Loop"
"nop \n"
: "=m"(a) //Outputs
: "m"(b), "m"(c) //Inputs
: "%ecx","%ymm0" //Modifies ECX and YMM0
);
}
The lines marked as "wrong" generate: (except from gdb disassemble)
0x0000000000000b78 <+19>: vmovaps -0x10(%rbp),%ymm0
0x0000000000000b7d <+24>: vaddps -0x18(%rbp),%ymm0,%ymm0
I want to get something like this (I guess):
vmovaps -0x10(%rbp,%ecx,%0x8),%ymm0
But I do not know how to specify %ecx as my index register.
Can you help me, please?
EDIT
I've tried (%1, %%ecx):
__asm__ __volatile__ (
"nop \n"
"xor %%ecx, %%ecx \n" //;Loop counter set to 0
"loop: \n\t"
"vmovaps (%1, %%rcx), %%ymm0 \n" //;Load 8 elements from b <== MODIFIED HERE
"vaddps %2, %%ymm0, %%ymm0 \n" //;Add 8 elements from c
"vmovaps %%ymm0, %0 \n" //;Store result in a
"add 0x20, %%ecx \n" //;8 elemtns * 4 bytes = 32 (0x20)
"cmp 0x200,%%ecx \n" //;128 elements * 4 bytes = 512 (0x200)
"jb loop \n" //;Loop"
"nop \n"
: "=m"(a) //Outputs
: "m"(b), "m"(c) //Inputs
: "%ecx","%ymm0" //Modifies ECX and YMM0
);
And I got:
inline1.cpp: Assembler messages:
inline1.cpp:90: Error: found '(', expected: ')'
inline1.cpp:90: Error: junk `(%rbp),%rcx)' after expression

I don't think it is possible to translate this literally into GAS inline assembly. In AT&T syntax, the syntax is:
displacement(base register, offset register, scalar multiplier)
which would produce something akin to:
movl -4(%ebp, %ecx, 4), %eax
or in your case:
vmovaps -16(%rsp, %ecx, 0), %ymm0
The problem is, when you use a memory constraint (m), the inline assembler is going to emit the following wherever you write %n (where n is the number of the input/output):
-16(%rsp)
There is no way to manipulate the above into the form you actually want. You can write:
(%1, %%rcx)
but this will produce:
(-16(%rsp),%rcx)
which is clearly wrong. There is no way to get the offset register inside of those parentheses, where it belongs, since %n is emitting the whole -16(%rsp) as a chunk.
Of course, this is not really an issue, since you write inline assembly to get speed, and there's nothing speedy about loading from memory. You should have the inputs in a register, and when you use a register constraint for the input/output (r), you don't have a problem. Notice that this will require modifying your code slightly
Other things wrong with your inline assembly include:
Numeric literals begin with $.
Instructions should have size suffixes, like l for 32-bit and q for 64-bit.
You are clobbering memory when you write through a, so you should have a memory clobber.
The nop instructions at the beginning and the end are completely pointless. They aren't even aligning the branch target.
Every line should really end with a tab character (\t), in addition to a new-line (\n), so that you get proper alignment when you inspect the disassembly.
Here is my version of the code:
void addArray(float *a, float *b, float *c) {
__asm__ __volatile__ (
"xorl %%ecx, %%ecx \n\t" // Loop counter set to 0
"loop: \n\t"
"vmovaps (%1,%%rcx), %%ymm0 \n\t" // Load 8 elements from b
"vaddps (%2,%%rcx), %%ymm0, %%ymm0 \n\t" // Add 8 elements from c
"vmovaps %%ymm0, (%0,%%rcx) \n\t" // Store result in a
"addl $0x20, %%ecx \n\t" // 8 elemtns * 4 bytes = 32 (0x20)
"cmpl $0x200, %%ecx \n\t" // 128 elements * 4 bytes = 512 (0x200)
"jb loop" // Loop"
: // Outputs
: "r" (a), "r" (b), "r" (c) // Inputs
: "%ecx", "%ymm0", "memory" // Modifies ECX, YMM0, and memory
);
}
This causes the compiler to emit the following:
addArray(float*, float*, float*):
xorl %ecx, %ecx
loop:
vmovaps (%rsi,%rcx), %ymm0 # b
vaddps (%rdx,%rcx), %ymm0, %ymm0 # c
vmovaps %ymm0, (%rdi,%rcx) # a
addl $0x20, %ecx
cmpl $0x200, %ecx
jb loop
vzeroupper
retq
Or, in the more familiar Intel syntax:
addArray(float*, float*, float*):
xor ecx, ecx
loop:
vmovaps ymm0, YMMWORD PTR [rsi + rcx]
vaddps ymm0, ymm0, YMMWORD PTR [rdx + rcx]
vmovaps YMMWORD PTR [rdi + rcx], ymm0
add ecx, 32
cmp ecx, 512
jb loop
vzeroupper
ret
In the System V 64-bit calling convention, the first three parameters are passed in the rdi, rsi, and rdx registers, so the code doesn't need to move the parameters into registers—they are already there.
But you are not using input/output constraints to their fullest. You don't need rcx to be used as the counter. Nor do you need to use ymm0 as the scratch register. If you let the compiler pick which free registers to use, it will make the code more efficient. You also won't need to provide an explicit clobber list:
#include <stdint.h>
#include <x86intrin.h>
void addArray(float *a, float *b, float *c) {
uint64_t temp = 0;
__m256 ymm;
__asm__ __volatile__(
"loop: \n\t"
"vmovaps (%3,%0), %1 \n\t" // Load 8 elements from b
"vaddps (%4,%0), %1, %1 \n\t" // Add 8 elements from c
"vmovaps %1, (%2,%0) \n\t" // Store result in a
"addl $0x20, %0 \n\t" // 8 elemtns * 4 bytes = 32 (0x20)
"cmpl $0x200, %0 \n\t" // 128 elements * 4 bytes = 512 (0x200)
"jb loop" // Loop
: "+r" (temp), "=x" (ymm)
: "r" (a), "r" (b), "r" (c)
: "memory"
);
}
Of course, as has been mentioned in the comments, this entire exercise is a waste of time. GAS-style inline assembly, although powerful, is exceedingly difficult to write correctly (I'm not even 100% positive that my code here is correct!), so you should not write anything using inline assembly that you absolutely don't have to. And this is certainly not a case where you have to—the compiler will optimize the addition loop automatically:
void addArray(float *a, float *b, float *c) {
for (int i = 0; i < 128; i++) a[i] = b[i] + c[i];
}
With -O2 and -mavx2, GCC compiles this to the following:
addArray(float*, float*, float*):
xor eax, eax
.L2:
vmovss xmm0, DWORD PTR [rsi+rax]
vaddss xmm0, xmm0, DWORD PTR [rdx+rax]
vmovss DWORD PTR [rdi+rax], xmm0
add rax, 4
cmp rax, 512
jne .L2
rep ret
Well, that looks awfully familiar, doesn't it? To be fair, it isn't vectorized like your code is. You can get that by using -O3 or -ftree-vectorize, but you also get a lot more code generated, so I'd need a benchmark to convince me that it was actually faster and worth the explosion in code size. But most of this is to handle cases where the input isn't aligned—if you indicate that it is aligned and that the pointer is restricted, that solves these problems and improves the code generation substantially. Notice that it is completely unrolling the loop, as well as vectorizing the addition.

x86: Long loop-carried dependency chain. Why 13 cycles?

I modified the code from a previous experiment (Agner Fog's Optimizing Assembly, example 12.10a) to make it more dependent:
movsd xmm2, [x]
movsd xmm1, [one]
xorps xmm0, xmm0
mov eax, coeff
L1:
movsd xmm3, [eax]
mulsd xmm3, xmm1
mulsd xmm1, xmm2
addsd xmm1, xmm3
add eax, 8
cmp eax, coeff_end
jb L1
And now it takes ~13 cycles per iteration, but I have no idea why so much.
Please help me understand.
(update)
I'm sorry. Yes, definetely #Peter Cordes is right- it takes 9 cycles per iteration in fact. The misunderstanding is caused by myself. I missed two similar pieces of codes ( instructions swapped), the 13-cycles code is here:
movsd xmm2, [x]
movsd xmm1, [one]
xorps xmm0, xmm0
mov eax, coeff
L1:
movsd xmm3, [eax]
mulsd xmm1, xmm2
mulsd xmm3, xmm1
addsd xmm1, xmm3
add eax, 8
cmp eax, coeff_end
jb L1

It runs at exactly one iteration per 9c for me, on a Core2 E6600, which is expected:
movsd xmm3, [eax] ; independent, depends only on eax
A: mulsd xmm3, xmm1 ; 5c: depends on xmm1:C from last iteration
B: mulsd xmm1, xmm2 ; 5c: depends on xmm1:C from last iteration
C: addsd xmm1, xmm3 ; 3c: depends on xmm1:B from THIS iteration (and xmm3:A from this iteration)
When xmm1:C is ready from iteration i, the next iteration can start calculating:
A: producing xmm3:A in 5c
B: producing xmm1:B in 5c (but there's a resource conflict; these multiplies can't both start in the same cycle in Core2 or IvyBridge, only Haswell and later)
Regardless of which one runs first, both have to finish before C can run. So the loop-carried dependency chain is 5 + 3 cycles, +1c for the resource conflict that stops both multiplies from starting in the same cycle.
Test code that runs at the expected speed:
This slows down to one iteration per ~11c when the array is 8B * 128 * 1024. If you're testing with an even bigger array instead of using a repeat-loop around what you posted, then that's why you're seeing a higher latency.
If a load arrives late, there's no way for the CPU to "catch up", since it delays the loop-carried dependency chain. If the load was only needed in a dependency chain that forked off from the loop-carried chain, then the pipeline could absorb an occasional slow load more easily. So, some loops can be more sensitive to memory delays than others.
default REL
%macro IACA_start 0
mov ebx, 111
db 0x64, 0x67, 0x90
%endmacro
%macro IACA_end 0
mov ebx, 222
db 0x64, 0x67, 0x90
%endmacro
global _start
_start:
movsd xmm2, [x]
movsd xmm1, [one]
xorps xmm0, xmm0
mov ecx, 10000
outer_loop:
mov eax, coeff
IACA_start ; outside the loop
ALIGN 32 ; this matters on Core2, .78 insn per cycle vs. 0.63 without
L1:
movsd xmm3, [eax]
mulsd xmm3, xmm1
mulsd xmm1, xmm2
addsd xmm1, xmm3
add eax, 8
cmp eax, coeff_end
jb L1
IACA_end
dec ecx
jnz outer_loop
;mov eax, 1
;int 0x80 ; exit() for 32bit code
xor edi, edi
mov eax, 231 ; exit_group(0). __NR_exit = 60.
syscall
section .data
x:
one: dq 1.0
section .bss
coeff: resq 24*1024 ; 6 * L1 size. Doesn't run any faster when it fits in L1 (resb)
coeff_end:
Experimental test
$ asm-link interiteration-test.asm
+ yasm -felf64 -Worphan-labels -gdwarf2 interiteration-test.asm
+ ld -o interiteration-test interiteration-test.o
$ perf stat ./interiteration-test
Performance counter stats for './interiteration-test':
928.543744 task-clock (msec) # 0.995 CPUs utilized
152 context-switches # 0.164 K/sec
1 cpu-migrations # 0.001 K/sec
52 page-faults # 0.056 K/sec
2,222,536,634 cycles # 2.394 GHz (50.14%)
<not supported> stalled-cycles-frontend
<not supported> stalled-cycles-backend
1,723,575,954 instructions # 0.78 insns per cycle (75.06%)
246,414,304 branches # 265.377 M/sec (75.16%)
51,483 branch-misses # 0.02% of all branches (74.74%)
0.933372495 seconds time elapsed
Each branch / every 7 instructions is one iteration of the inner loop.
$ bc -l
bc 1.06.95
1723575954 / 7
246225136.28571428571428571428
# ~= number of branches: good
2222536634 / .
9.026
# cycles per iteration
IACA agrees: 9c per iteration on IvB
(not counting the nops from ALIGN):
$ iaca.sh -arch IVB interiteration-test
Intel(R) Architecture Code Analyzer Version - 2.1
Analyzed File - interiteration-test
Binary Format - 64Bit
Architecture - IVB
Analysis Type - Throughput
Throughput Analysis Report
--------------------------
Block Throughput: 9.00 Cycles Throughput Bottleneck: InterIteration
Port Binding In Cycles Per Iteration:
-------------------------------------------------------------------------
| Port | 0 - DV | 1 | 2 - D | 3 - D | 4 | 5 |
-------------------------------------------------------------------------
| Cycles | 2.0 0.0 | 1.0 | 0.5 0.5 | 0.5 0.5 | 0.0 | 2.0 |
-------------------------------------------------------------------------
N - port number or number of cycles resource conflict caused delay, DV - Divider pipe (on port 0)
D - Data fetch pipe (on ports 2 and 3), CP - on a critical path
F - Macro Fusion with the previous instruction occurred
* - instruction micro-ops not bound to a port
^ - Micro Fusion happened
# - ESP Tracking sync uop was issued
# - SSE instruction followed an AVX256 instruction, dozens of cycles penalty is expected
! - instruction not supported, was not accounted in Analysis
| Num Of | Ports pressure in cycles | |
| Uops | 0 - DV | 1 | 2 - D | 3 - D | 4 | 5 | |
---------------------------------------------------------------------
| 1 | | | 0.5 0.5 | 0.5 0.5 | | | | movsd xmm3, qword ptr [eax]
| 1 | 1.0 | | | | | | | mulsd xmm3, xmm1
| 1 | 1.0 | | | | | | CP | mulsd xmm1, xmm2
| 1 | | 1.0 | | | | | CP | addsd xmm1, xmm3
| 1 | | | | | | 1.0 | | add eax, 0x8
| 1 | | | | | | 1.0 | | cmp eax, 0x63011c
| 0F | | | | | | | | jb 0xffffffffffffffe7
Total Num Of Uops: 6

With the addsd change suggested in my comments above, --> addsd xmm0,xmm3, this can be coded to use the full width registers and the performance is twice as fast.
Loosely:
For the initial value of ones, it needs to be:
double ones[2] = { 1.0, x }
And we need to replace x with x2:
double x2[2] = { x * x, x * x }
If there is an odd number of coefficients, pad it with a zero to produce an even number of them.
And, changing the pointer increment to 16.
Here are the test results I got. I did a number of trials and took the ones that had the best time and elongated the time by doing 100 iterations. std is the C version, dbl is your version, and qed is the "wide" version:
R=1463870188
C=100
T=100
I=100
x: 3.467957099973322e+00 3.467957099973322e+00
one: 1.000000000000000e+00 3.467957099973322e+00
x2: 1.202672644725538e+01 1.202672644725538e+01
std: 2.803772098439484e+56 (ELAP: 0.000019312)
dbl: 2.803772098439484e+56 (ELAP: 0.000019312)
qed: 2.803772098439492e+56 (ELAP: 0.000009060)
rtn_loop: 2.179378907910304e+55 2.585834207648461e+56
rtn_shuf: 2.585834207648461e+56 2.179378907910304e+55
rtn_add: 2.803772098439492e+56 2.585834207648461e+56
This was done on an i7 920 # 2.67 GHz.
I think if you take the elapsed numbers and convert them, you'll see that your version is faster than you think.
I apologize, in advance, for switching to AT&T syntax as I had difficulty getting the assembler to work the other way. Again, sorry. Also, I'm using linux, so I used the rdi rsi registers to pass the coefficient pointers. If you're on windows, the ABI is different and you'll have to adjust for that.
I did a C version and diassembled it. It was virtually identical to your code except that it rearranged the non-xmm instructions a bit, which I've added below.
I believe I posted all the files, so you could conceivably run this on your system if you wished.
Here's the original code:
# xmmloop/dbl.s -- implement using single double
.globl dbl
# dbl -- compute result using single double
#
# arguments:
# rdi -- pointer to coeff vector
# rsi -- pointer to coeff vector end
dbl:
movsd x(%rip),%xmm2 # get x value
movsd one(%rip),%xmm1 # get ones
xorps %xmm0,%xmm0 # sum = 0
dbl_loop:
movsd (%rdi),%xmm3 # c[i]
add $8,%rdi # increment to next vector element
cmp %rsi,%rdi # done yet?
mulsd %xmm1,%xmm3 # c[i]*x^i
mulsd %xmm2,%xmm1 # x^(i+1)
addsd %xmm3,%xmm0 # sum += c[i]*x^i
jb dbl_loop # no, loop
retq
Here's the code changed to use the movapd et. al:
# xmmloop/qed.s -- implement using single double
.globl qed
# qed -- compute result using single double
#
# arguments:
# rdi -- pointer to coeff vector
# rsi -- pointer to coeff vector end
qed:
movapd x2(%rip),%xmm2 # get x^2 value
movapd one(%rip),%xmm1 # get [1,x]
xorpd %xmm4,%xmm4 # sum = 0
qed_loop:
movapd (%rdi),%xmm3 # c[i]
add $16,%rdi # increment to next coefficient
cmp %rsi,%rdi # done yet?
mulpd %xmm1,%xmm3 # c[i]*x^i
mulpd %xmm2,%xmm1 # x^(i+2)
addpd %xmm3,%xmm4 # sum += c[i]*x^i
jb qed_loop # no, loop
movapd %xmm4,rtn_loop(%rip) # save intermediate DEBUG
movapd %xmm4,%xmm0 # get lower sum
shufpd $1,%xmm4,%xmm4 # get upper value into lower half
movapd %xmm4,rtn_shuf(%rip) # save intermediate DEBUG
addsd %xmm4,%xmm0 # add upper sum to lower
movapd %xmm0,rtn_add(%rip) # save intermediate DEBUG
retq
Here's a C version of the code:
// xmmloop/std -- compute result using C code
#include <xmmloop.h>
// std -- compute result using C
double
std(const double *cur,const double *ep)
{
double xt;
double xn;
double ci;
double sum;
xt = x[0];
xn = one[0];
sum = 0;
for (; cur < ep; ++cur) {
ci = *cur; // get c[i]
ci *= xn; // c[i]*x^i
xn *= xt; // x^(i+1)
sum += ci; // sum += c[i]*x^i
}
return sum;
}
Here's the test program I used:
// xmmloop/xmmloop -- test program
#define _XMMLOOP_GLO_
#include <xmmloop.h>
// tvget -- get high precision time
double
tvget(void)
{
struct timespec ts;
double sec;
clock_gettime(CLOCK_REALTIME,&ts);
sec = ts.tv_nsec;
sec /= 1e9;
sec += ts.tv_sec;
return sec;
}
// timeit -- get best time
void
timeit(fnc_p proc,double *cofptr,double *cofend,const char *tag)
{
double tvbest;
double tvbeg;
double tvdif;
double sum;
sum = 0;
tvbest = 1e9;
for (int trycnt = 1; trycnt <= opt_T; ++trycnt) {
tvbeg = tvget();
for (int iter = 1; iter <= opt_I; ++iter)
sum = proc(cofptr,cofend);
tvdif = tvget();
tvdif -= tvbeg;
if (tvdif < tvbest)
tvbest = tvdif;
}
printf("%s: %.15e (ELAP: %.9f)\n",tag,sum,tvbest);
}
// main -- main program
int
main(int argc,char **argv)
{
char *cp;
double *cofptr;
double *cofend;
double *cur;
double val;
long rseed;
int cnt;
--argc;
++argv;
rseed = 0;
cnt = 0;
for (; argc > 0; --argc, ++argv) {
cp = *argv;
if (*cp != '-')
break;
switch (cp[1]) {
case 'C':
cp += 2;
cnt = strtol(cp,&cp,10);
break;
case 'R':
cp += 2;
rseed = strtol(cp,&cp,10);
break;
case 'T':
cp += 2;
opt_T = (*cp != 0) ? strtol(cp,&cp,10) : 1;
break;
case 'I':
cp += 2;
opt_I = (*cp != 0) ? strtol(cp,&cp,10) : 1;
break;
}
}
if (rseed == 0)
rseed = time(NULL);
srand48(rseed);
printf("R=%ld\n",rseed);
if (cnt == 0)
cnt = 100;
if (cnt & 1)
++cnt;
printf("C=%d\n",cnt);
if (opt_T == 0)
opt_T = 100;
printf("T=%d\n",opt_T);
if (opt_I == 0)
opt_I = 100;
printf("I=%d\n",opt_I);
cofptr = malloc(sizeof(double) * cnt);
cofend = &cofptr[cnt];
val = drand48();
for (; val < 3; val += 1.0);
x[0] = val;
x[1] = val;
DMP(x);
one[0] = 1.0;
one[1] = val;
DMP(one);
val *= val;
x2[0] = val;
x2[1] = val;
DMP(x2);
for (cur = cofptr; cur < cofend; ++cur) {
val = drand48();
val *= 1e3;
*cur = val;
}
timeit(std,cofptr,cofend,"std");
timeit(dbl,cofptr,cofend,"dbl");
timeit(qed,cofptr,cofend,"qed");
DMP(rtn_loop);
DMP(rtn_shuf);
DMP(rtn_add);
return 0;
}
And the header file:
// xmmloop/xmmloop.h -- common control
#ifndef _xmmloop_xmmloop_h_
#define _xmmloop_xmmloop_h_
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#ifdef _XMMLOOP_GLO_
#define EXTRN_XMMLOOP /**/
#else
#define EXTRN_XMMLOOP extern
#endif
#define XMMALIGN __attribute__((aligned(16)))
EXTRN_XMMLOOP int opt_T;
EXTRN_XMMLOOP int opt_I;
EXTRN_XMMLOOP double x[2] XMMALIGN;
EXTRN_XMMLOOP double x2[2] XMMALIGN;
EXTRN_XMMLOOP double one[2] XMMALIGN;
EXTRN_XMMLOOP double rtn_loop[2] XMMALIGN;
EXTRN_XMMLOOP double rtn_shuf[2] XMMALIGN;
EXTRN_XMMLOOP double rtn_add[2] XMMALIGN;
#define DMP(_sym) \
printf(#_sym ": %.15e %.15e\n",_sym[0],_sym[1]);
typedef double (*fnc_p)(const double *cofptr,const double *cofend);
double std(const double *cofptr,const double *cofend);
double dbl(const double *cofptr,const double *cofend);
double qed(const double *cofptr,const double *cofend);
#endif

Summation in AVR Assembly

I want to implement a routine that calculates the sum of all natural numbers from 1 to n. n is a variable stored in RAM. The result has to be stored in a two-byte variable in RAM, too. I'm very new in assembly programming so I'm having a hard time trying to figure out the algorithm to achieve this. So far, I've done this:
.DSEG
.ORG 0x100
n: .BYTE l_n
result: .BYTE l_result
.CSEG
.ORG 0x100
SUM:
LDI XL, n ;the direction of n is stored in XL
LD R16, X ;now r16=n
LDI XL, LOW(result)
LDI XH, HIGH(result) ;X points to result
CLC ;in case C is full with trash
LDI R17, 0x0 ;R17 = 0
LDI R18, 0x1 ;R18 = 1
CALL LOOP
LDI R16,0
LDI R17,0
ADC R16, R17 ;if C is on when the loop finishes, then it has to be summed as well
ST X, R16
RET ;returns to the program that called the routine
I did the initialization of R17 and R18 because I thought that the subroutine LOOP should do something like increasing this numbers one by one until doing it n times. The thing that is complicating me the most is the fact that the result has two bytes, while each number being summed consists of just one byte. I don't know how to deal with this. Any help will be appreciated.

what you need is
ADD R18,R24 //sumL += nL
ADC R19,R25 //sumH += nH + Carry
and for 2 bytes variable the max sum will be 65535 so for
1+2+3+...+n=n*(n+1)/2 <= 65535 then N <= 361 = 0x0169
1+2+3+...+361=361*362/2=65341
and code will looks like this:
//CPU: ATmega128A
.include "m128Adef.inc"
.DSEG
//.ORG 0x100
n: .BYTE 2 // define 2 bytes var
result: .BYTE 2 // define 2 bytes var
.CSEG
.ORG 0
RJMP boot
n0: .DW 0x0169 //init value for n=361 (max value for 2 byte result)
//in: N=R24:R25
//out: Sum=R18:R19
//calc sum 1 to n (n >=1 and n <=361)
//1+2+3+...+n=n*(n+1)/2 <= 65535 => n<=361= 0x0169
Sum1toN:
LDI R18,0x00 //sumL=0
LDI R19,0x00 //sumH=0
Lsum:
ADD R18,R24 //sumL + = nL
ADC R19,R25 //sumH += nH + C
SBIW R24,0x01 //n--
BRNE Lsum // n >0 ?
RET
boot:
CLR R1
OUT SREG,R1 //Clear all
//init stack pointer
LDI R28,LOW(RAMEND) //LDI R28,0xFF
LDI R29,HIGH(RAMEND) //LDI R28,0x10
OUT SPH,R29
OUT SPL,R28
//init
LDI ZL,LOW(n0<<1)
LDI ZH,HIGH(n0<<1)
LDI XL,LOW(n)
LDI XH,HIGH(n)
LDI R24,2
LDI R25,0
init:
LPM R0,Z+
ST X+,R0
SBIW R24,1
BRNE init
//calc:
LDS R24,n // LDS R24,0x0100
LDS R25,n+1 // LDS R25,0x0101
RCALL Sum1toN
STS result,R18
STS result+1,R19
main:
RJMP main

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio