Two loop-carried dependency chain. Which one matters?

Two loop-carried dependency chain. Which one matters? - performance

My CPU is IvyBridge.
Let's consider example from Agner's Fog optimizing_assembly, I mean. 12.6 chapter and the 12.10a example:
movsd xmm2, [x]
movsd xmm1, [one]
xorps xmm0, xmm0
mov eax, coeff
L1:
movsd xmm3, [eax] ; c[i]
mulsd xmm3, xmm1 ; c[i]*x^i
mulsd xmm1, xmm2 ; x^(i+1)
addsd xmm0, xmm3 ; sum += c[i]*x^i
add eax, 8
cmp eax, coeff_end
jb L1
And the frontend is not a bottleneck ( it is obvious because of the latency of multiplication).
We have two loop-carried dependency ( I skipped add eax, 8):
1. mulsd xmm1, xmm2 latency of 5 cycles
2. addsd xmm0, xmm3 latency of 3 cycles
Btw, I have a proble to decide: Should I sum up ( 5 + 3 = 8) or to get the greatest, i.e. 5 cycle?
I've tested it for 10000000 elements array. And it takes 6.7 cycle per iteration ( according to perf) and 5.9 cycles per iteration according to Agners' tool.
Please explain why it does take 6/7 cycles instead of just 5 cycles?

Probably resource conflicts are sometimes delaying the loop-carried mulsd. uop scheduling chooses oldest-first (out of uops that have their inputs ready), not critical-path-first, so the mulsd xmm3, xmm1 probably steals the execution port and delays the loop-carried mulsd xmm1, xmm2 by a cycle occasionally.
There are probably also occasional pipeline stalls from cache misses, since hardware prefetching doesn't work across page boundaries. Test for this by putting an outer repeat-loop around an inner loop over ~128kiB of data (1/2 L2 cache size).
prefetch should have no problem keeping up with your loop normally, though: One 64b load per 5 cycles is about 1/5th of main memory bandwidth.
You could also use float instead of double
Does your data have any denormals? Those are slow (but NaN isn't, with SSE).

Related

CISC short instructions vs long instructions

I'm currently programming a compiler and am about to implement code generation. The target instruction set for now is x64.
Now x64 is CISC, so there are many complex instructions. But I know these are internally converted to RISC by the CPU, and there's also out-of-order execution after that.
My question is therefore: Does using more short instructions (RISC-like) have a performance impact over using fewer complex instructions? The test programs for my language aren't that big, so I think fitting instructions into the cache should currently not be a problem.

No, using mostly simple x86 instructions (e.g. avoiding push and using sub rsp, whatever and storing args with mov) was a useful optimization for P5-pentium, because it didn't know how to split compact but complex instructions internally. Its 2-wide superscalar pipeline could only pair simple instructions.
Modern x86 CPUs (since Intel P6 (pentium pro / PIII), and including all x86-64 CPUs) do decode complex instructions to multiple uops that can be scheduled independently. (And for common complex instructions like push / pop, they have tricks to handle them as a single uop. In that case, a stack engine that renames the stack pointer outside of the out-of-order part of the core, so no uop is needed for the rsp-=8 part of push.)
Memory-source instructions like add eax, [rdi] can even decode to a single uop on Intel CPUs by micro-fusing the load with the ALU uop, only separating them in the out-of-order scheduler for dispatching to execution units. In the rest of the pipeline, it only uses 1 entry (in the front-end and the ROB). (But see Micro fusion and addressing modes for limitations on Sandybridge with indexed addressing modes, relaxed somewhat on Haswell and later.) AMD CPUs just naturally keep memory operands fused with ALU instructions, and didn't used to decode them to extra m-ops / uops so it doesn't have a fancy name.
Instruction length is not perfectly correlated with simplicitly. e.g. idiv rcx is only 3 bytes, but decodes to 57 uops on Skylake. (Avoid 64-bit division, it's slower than 32-bit.)
Smaller code is better, all else equal. Prefer 32-bit operand-size when it's sufficient to avoid REX prefixes, and choose register that don't need REX prefixes (like ecx instead of r8d). But normally don't spend extra instructions to make that happen. (e.g. use r8d instead of saving/restoring rbx so you can use ebx as another scratch register).
But when all else is not equal, size is usually the last priority for high performance, behind minimizing uops and keeping latency dependency chains short (especially loop-carried dependency chains).
Modern x86 cost model
What considerations go into predicting latency for operations on modern superscalar processors and how can I calculate them by hand?
Agner Fog's optimization guides and instruction tables: https://agner.org/optimize/
Intel’s Sandy Bridge Microarchitecture, deep dive by David Kanter.
(https://www.realworldtech.com/sandy-bridge/)
https://stackoverflow.com/tags/x86/info
Most programs spend most of their time in loops small enough to fit in L1d cache, and lots of that time in a few even smaller loops within that.
Unless you can correctly identify "cold" code (executed rarely), optimizing for size over speed with something like 3-byte push 1 / pop rax instead of 5-byte mov eax, 1 is definitely not a good default. clang/LLVM will push/pop for constants with -Oz (optimize only for size), but not -Os (optimize for a balance of size and speed).
Using inc instead of add reg,1 saves a byte (only 1 in x86-64, vs. 2 in 32-bit code). With a register destination, it's just as fast in most cases on most CPUs. See INC instruction vs ADD 1: Does it matter?
Modern mainstream x86 CPUs have decoded-uop caches (AMD since Ryzen, Intel since Sandybridge) that mostly avoid the front-end bottlenecks on older CPUs with average instruction length > 4.
Before that (Core2 / Nehalem), tuning to avoid front-end bottlenecks was much more complicated that just using short instructions on average. See Agner Fog's microarch guide for details about the uop patterns the decoders can handle in those older Intel CPUs, and effects of code alignment relative to 16-byte boundaries for fetch after a jump, and lots more.
AMD Bulldozer-family marks instruction boundaries in L1i cache, and can decode up to 2x 16 bytes per cycle if both cores of a cluster are active, otherwise Agner Fog's microarch PDF (https://agner.org/optimize/) reports ~21 bytes per cycle (vs. Intel's up to 16 bytes per cycle for the decoders when not running from the uop cache). Bulldozer's lower back-end throughput probably means that front-end bottlenecks happen less often. But I don't really know, I haven't tuned anything for Bulldozer-family with access to hardware to test anything.
An example: this function compiled with clang with -O3, -Os, and -Oz
int sum(int*arr) {
int sum = 0;
for(int i=0;i<10240;i++) {
sum+=arr[i];
}
return sum;
}
Source + asm output on the Godbolt compiler explorer, where you can play with this code and compiler options.
I also used -fno-vectorize because I assume you won't be trying to auto-vectorize with SSE2, even though that's baseline for x86-64. (Although that would speed up this loop by a factor of 4
# clang -O3 -fno-vectorize
sum: # #sum
xor eax, eax
mov ecx, 7
.LBB2_1: # =>This Inner Loop Header: Depth=1
add eax, dword ptr [rdi + 4*rcx - 28]
add eax, dword ptr [rdi + 4*rcx - 24]
add eax, dword ptr [rdi + 4*rcx - 20]
add eax, dword ptr [rdi + 4*rcx - 16]
add eax, dword ptr [rdi + 4*rcx - 12]
add eax, dword ptr [rdi + 4*rcx - 8]
add eax, dword ptr [rdi + 4*rcx - 4]
add eax, dword ptr [rdi + 4*rcx]
add rcx, 8
cmp rcx, 10247
jne .LBB2_1
ret
This is pretty silly; it unrolled by 8 but still with only 1 accumulator. So it bottlenecks on 1 cycle latency add instead of on 2 loads per clock throughput on Intel since SnB and AMD since K8. (And only reading 4 bytes per clock cycle, it probably doesn't bottleneck on memory bandwidth very much.)
It does better with normal -O3, not disabling vectorization, using 2 vector accumulators:
sum: # #sum
pxor xmm0, xmm0 # zero first vector register
mov eax, 36
pxor xmm1, xmm1 # 2nd vector
.LBB2_1: # =>This Inner Loop Header: Depth=1
movdqu xmm2, xmmword ptr [rdi + 4*rax - 144]
paddd xmm2, xmm0
movdqu xmm0, xmmword ptr [rdi + 4*rax - 128]
paddd xmm0, xmm1
movdqu xmm1, xmmword ptr [rdi + 4*rax - 112]
movdqu xmm3, xmmword ptr [rdi + 4*rax - 96]
movdqu xmm4, xmmword ptr [rdi + 4*rax - 80]
paddd xmm4, xmm1
paddd xmm4, xmm2
movdqu xmm2, xmmword ptr [rdi + 4*rax - 64]
paddd xmm2, xmm3
paddd xmm2, xmm0
movdqu xmm1, xmmword ptr [rdi + 4*rax - 48]
movdqu xmm3, xmmword ptr [rdi + 4*rax - 32]
movdqu xmm0, xmmword ptr [rdi + 4*rax - 16]
paddd xmm0, xmm1
paddd xmm0, xmm4
movdqu xmm1, xmmword ptr [rdi + 4*rax]
paddd xmm1, xmm3
paddd xmm1, xmm2
add rax, 40
cmp rax, 10276
jne .LBB2_1
paddd xmm1, xmm0 # add the two accumulators
# and horizontal sum the result
pshufd xmm0, xmm1, 78 # xmm0 = xmm1[2,3,0,1]
paddd xmm0, xmm1
pshufd xmm1, xmm0, 229 # xmm1 = xmm0[1,1,2,3]
paddd xmm1, xmm0
movd eax, xmm1 # extract the result into a scalar integer reg
ret
This version unrolls probably more than it needs to; the loop overhead is tiny and movdqu + paddd is only 2 uops, so we're far from bottlenecking on the front-end. With 2-per-clock movdqu loads, this loop can process 32 bytes of input per clock cycle assuming the data is hot in L1d cache or maybe L2, otherwise it will run slower. This more-than-minimum unroll will let out-of-order execution run ahead and see the loop exit condition before the paddd work has caught up, and maybe mostly hide the branch mispredict on the last iteration.
Using more than 2 accumulators to hide latency is very important in FP code, where most instructions don't have single-cycle latency. (It would also be useful for this function on AMD Bulldozer-family, where paddd has 2 cycle latency.)
With big unrolls and large displacements, compilers sometimes generate a lot of instructions that need a disp32 displacement instead of disp8 in the addressing mode. Choosing the point where you increment the loop counter or pointer to keep as many addressing modes as possible using a displacement of -128 .. +127 would probably be a good thing.
Unless you're tuning for Nehalem / Core2 or other CPUs without a uop cache, you probably don't want to add extra loop overhead (of add rdi, 256 twice instead of add rdi, 512 or something) just to shrink the code size.
By comparison, clang -Os still auto-vectorizes (unless you disable it), with an inner loop that's exactly 4 uops long on Intel CPUs.
# clang -Os
.LBB2_1: # =>This Inner Loop Header: Depth=1
movdqu xmm1, xmmword ptr [rdi + 4*rax]
paddd xmm0, xmm1
add rax, 4
cmp rax, 10240
jne .LBB2_1
But with clang -Os -fno-vectorize, we get the simple and obvious minimal scalar implementation:
# clang -Os -fno-vectorize
sum: # #sum
xor ecx, ecx
xor eax, eax
.LBB2_1: # =>This Inner Loop Header: Depth=1
add eax, dword ptr [rdi + 4*rcx]
inc rcx
cmp rcx, 10240
jne .LBB2_1
ret
Missed-optimization: using ecx would avoid a REX prefix on inc and cmp. The range is known to fix in 32-bits. Probably it's using RCX because it promoted int to 64-bit to avoid movsxd rcx,ecx sign-extension to 64-bit before use in an addressing mode. (Because signed overflow is UB in C.) But after doing that, it could optimize it back down again after noticing the range.
The loop is 3 uops (assuming macro-fused cmp/jne on Intel since Nehalem and AMD since Bulldozer), or 4 uops on Sandybridge (unlamination of add with an indexed addressing mode.) A pointer-increment loop could be slightly more efficient on some CPUs, only requiring 3 uops inside the loop even on SnB/IvB.
Clang's -Oz output is actually larger, showing signs of its code-gen strategy. Many loops can't be proven to run at least 1 time, and thus need a conditional branch to skip the loop instead of falling into it in the run-zero-times case. Or they need a jump to an entry point near the bottom. (Why are loops always compiled into "do...while" style (tail jump)?).
Looks like LLVM's -Oz code-gen unconditionally uses the jump-to-bottom strategy without checking if the condition is provably always true on the first iteration.
sum: # #sum
xor ecx, ecx
xor eax, eax
jmp .LBB2_1
.LBB2_3: # in Loop: Header=BB2_1 Depth=1
add eax, dword ptr [rdi + 4*rcx]
inc rcx
.LBB2_1: # =>This Inner Loop Header: Depth=1
cmp rcx, 10240
jne .LBB2_3
ret
Everything's the same except the extra jmp to enter the loop.
In a function that did more, you'd see more differences in code-gen. Like maybe using a slow div even for compile-time-constants, instead of a multiplicative inverse (Why does GCC use multiplication by a strange number in implementing integer division?).

Creating array and adding values

So i am working on an assignment and i am having some issues understanding arrays in this type of code (keep in mind that my knowledge of this stuff is limited). My code is supposed to ask the user to enter the number of values that that will be put in an array of SDWORD's and then create a procedure that has the user input the numbers. I have the part done below that asks the user for the amount (saved in "count") but i am struggling with the other procedure part For example with my code below if they enter 5 then the procedure that i have to make would require them to input 5 numbers that would go in to an array.
The problem I am facing is that i'm not sure how to actually set up the array. It can contain anywhere between 2 and twelve numbers which is why i have the compare set up in the code below. Let's say for example that the user inputs that they will enter 5 numbers and i set it up like this...
.data
array SDWORD 5
the problem i am having is that I'm not sure if that is saying the array will hold 5 values or if just one value in the array is 5. I need the amount of values in the array to be equal to "count". "count" as i have set up below is the amount that the user is going to enter.
Also i obviously know how to set up the procedure like this...
EnterValues PROC
return
EnterValues ENDP
I just don't know how to implement something like this. All of the research that i have done online is only confusing me more and none of the examples i have found ask the user to enter how many number will be i the array before they physically enter any numbers in to it. I hope what i described makes sense. Any input on what i could possibly do would be great!
INCLUDE Irvine32.inc
.data
count SDWORD ?
prompt1 BYTE "Enter the number of values to sort",0
prompt2 BYTE "Error. The number must be between 2 and 12",0
.code
Error PROC
mov edx, OFFSET prompt2
call WriteString
exit ; exit ends program after error occures
Error ENDP
main PROC
mov edx, OFFSET prompt1
call WriteString ; prints out prompt1
call ReadInt
mov count, eax ; save returned value from eax to count
cmp count, 12
jle Loop1 ; If count is less than or equal to 12 jump to Loop1, otherwise continue with Error procedure
call Error ; performs Error procedure which will end the program
Loop1: cmp count, 2
jge Loop2 ; If count is greater than or equal to 2 jump to Loop2, otherwise continue with Error procedure
call Error ; performs Error procedure which will end the program
Loop2: exit
main ENDP
END main
============EDIT==============
I came up with this...
EnterValues PROC
mov ecx, count
mov edx, 0
Loop3:
mov eax, ArrayOfInputs[edx * 4]
call WriteInt
call CrLf
inc edx
dec ecx
jnz Loop3
ret
EnterValues ENDP

.data
array SDWORD 5
defines one SDWORD with the initial value 5 in the DATA section and gives it the name "array".
You might want to use the DUP operator
.data
array SDWORD 12 DUP (5)
This defines twelve SDWORD and initializes each of them with the value 5. If the initial value doesn't matter, i.e. you want an uninitialized array change the initial value to '?':
array SDWORD 12 DUP (?)
MASM may now create a _BSS segment. To force the decision:
.data?
array SDWORD 12 DUP (?)
The symbol array is used in a MASM program as a constant offset to the address of the first entry. Use an additional index to address subsequent entries, for example:
mov eax, [array + 4] ; second SDWORD
mov eax, [array + esi]
Pointer arithmetic:
lea esi, array ; copy address into register
add esi, 8 ; move pointer to the third entry
mov eax, [esi] ; load eax with the third entry
lea esi, array + 12 ; copy the address of the fourth entry
mov eax, [esi] ; load eax with the fourth entry
You've got in every case an array with a fixed size. It's on you, just to fill it with count values.

String Loop in Assembly

I have to create a loop that will have the end result that prints out "LoopLoopLoopLoopLoop" (note only 5 times). What I have so far is this..
.data
STRING_1 DB "Loop"
.code
mov ecx, 5 ; Perform loop 5 times
printLoop:
LEA DX, STRING_1
loop printLoop
call DumpRegs
I'm not even 100% sure if what I have is even correct but I suppose the main question I have is how do I make it so that all of the outcomes are printed on the same line?

Packing BCD to DPD: How to improve this amd64 assembly routine?

I'm writing a routine to convert between BCD (4 bits per decimal digit) and Densely Packed Decimal (DPD) (10 bits per 3 decimal digits). DPD is further documented (with the suggestion for software to use lookup-tables) on Mike Cowlishaw's web site.
This routine only ever requires the lower 16 bit of the registers it uses, yet for shorter instruction encoding I have used 32 bit instructions wherever possible. Is a speed penalty associated with code like:
mov data,%eax # high 16 bit of data are cleared
...
shl %al
shr %eax
or
and $0x888,%edi # = 0000 a000 e000 i000
imul $0x0490,%di # = aei0 0000 0000 0000
where the alternative to a 16 bit imul would be either a 32 bit imul and a subsequent and or a series of lea instructions and a final and.
The whole code in my routine can be found below. Is there anything in it where performance is worse than it could be due to me mixing word and dword instructions?
.section .text
.type bcd2dpd_mul,#function
.globl bcd2dpd_mul
# convert BCD to DPD with multiplication tricks
# input abcd efgh iklm in edi
.align 8
bcd2dpd_mul:
mov %edi,%eax # = 0000 abcd efgh iklm
shl %al # = 0000 abcd fghi klm0
shr %eax # = 0000 0abc dfgh iklm
test $0x880,%edi # fast path for a = e = 0
jz 1f
and $0x888,%edi # = 0000 a000 e000 i000
imul $0x0490,%di # = aei0 0000 0000 0000
mov %eax,%esi
and $0x66,%esi # q = 0000 0000 0fg0 0kl0
shr $13,%edi # u = 0000 0000 0000 0aei
imul tab-8(,%rdi,4),%si # v = q * tab[u-2][0]
and $0x397,%eax # r = 0000 00bc d00h 0klm
xor %esi,%eax # w = r ^ v
or tab-6(,%rdi,4),%ax # x = w | tab[u-2][1]
and $0x3ff,%eax # = 0000 00xx xxxx xxxx
1: ret
.size bcd2dpd_mul,.-bcd2dpd_mul
.section .rodata
.align 4
tab:
.short 0x0011 ; .short 0x000a
.short 0x0000 ; .short 0x004e
.short 0x0081 ; .short 0x000c
.short 0x0008 ; .short 0x002e
.short 0x0081 ; .short 0x000e
.short 0x0000 ; .short 0x006e
.size tab,.-tab
Improved Code
After applying some suggestions from the answer and comments and some other trickery, here is my improved code.
.section .text
.type bcd2dpd_mul,#function
.globl bcd2dpd_mul
# convert BCD to DPD with multiplication tricks
# input abcd efgh iklm in edi
.align 8
bcd2dpd_mul:
mov %edi,%eax # = 0000 abcd efgh iklm
shl %al # = 0000 abcd fghi klm0
shr %eax # = 0000 0abc dfgh iklm
test $0x880,%edi # fast path for a = e = 0
jnz 1f
ret
.align 8
1: and $0x888,%edi # = 0000 a000 e000 i000
imul $0x49,%edi # = 0ae0 aei0 ei00 i000
mov %eax,%esi
and $0x66,%esi # q = 0000 0000 0fg0 0kl0
shr $8,%edi # = 0000 0000 0ae0 aei0
and $0xe,%edi # = 0000 0000 0000 aei0
movzwl lookup-4(%rdi),%edx
movzbl %dl,%edi
imul %edi,%esi # v = q * tab[u-2][0]
and $0x397,%eax # r = 0000 00bc d00h 0klm
xor %esi,%eax # w = r ^ v
or %dh,%al # = w | tab[u-2][1]
and $0x3ff,%eax # = 0000 00xx xxxx xxxx
ret
.size bcd2dpd_mul,.-bcd2dpd_mul
.section .rodata
.align 4
lookup:
.byte 0x11
.byte 0x0a
.byte 0x00
.byte 0x4e
.byte 0x81
.byte 0x0c
.byte 0x08
.byte 0x2e
.byte 0x81
.byte 0x0e
.byte 0x00
.byte 0x6e
.size lookup,.-lookup

TYVM for commenting the code clearly and well, BTW. It made is super easy to figure out what was going on, and where the bits were going. I'd never heard of DPD before, so puzzling it out from uncommented code and the wikipedia article would have sucked.
The relevant gotchas are:
Avoid 16bit operand size for instructions with immediate constants, on Intel CPUs. (LCP stalls)
avoid reading the full 32 or 64bit register after writing only the low 8 or 16, on Intel pre-IvyBridge. (partial-register extra uop). (IvB still has that slowdown if you modify an upper8 reg like AH, but Haswell removes that too). It's not just an extra uop: the penalty on Core2 is 2 to 3 cycles, according to Agner Fog. I might be measuring it wrong, but it seems a lot less bad on SnB.
See http://agner.org/optimize/ for full details.
Other than that, there's no general problem with mixing in some instructions using the operand-size prefix to make them 16-bit.
You should maybe write this as inline asm, rather than as a called function. You only use a couple registers, and the fast-path case is very few instructions.
I had a look at the code. I didn't look into achieving the same result with significantly different logic, just at optimizing the logic you do have.
Possible code suggestions: Switch the branching so the fast-path has the not-taken branch. Actually, it might make no diff either way in this case, or might improve the alignment of the slow-path code.
.p2align 4,,10 # align to 16, unless we're already in the first 6 bytes of a block of 16
bcd2dpd_mul:
mov %edi,%eax # = 0000 abcd efgh iklm
shl %al # = 0000 abcd fghi klm0
shr %eax # = 0000 0abc dfgh iklm
test $0x880,%edi # fast path for a = e = 0
jnz .Lslow_path
ret
.p2align 4 # Maybe fine-tune this alignment based on how the rest of the code assembles.
.Lslow_path:
...
ret
It's sometimes better to duplicate return instructions than to absolutely minimize code-size. The compare-and-branch in this case is the 4th uop of the function, though, so a taken branch wouldn't have prevented 4 uops from issuing in the first clock cycle, and a correctly-predicted branch would still issue the return on the 2nd clock cycle.
You should use a 32bit imul for the one with the table source. (see next section about aligning the table so reading an extra 2B is ok). 32bit imul is one uop instead of two on Intel SnB-family microarches. The result in the low16 should be the same, since the sign bit can't be set. The upper16 gets zeroed by the final and before ret, and doesn't get used in any way where garbage in the upper16 matters while it's there.
Your imul with an immediate operand is problematic, though.
It causes an LCP stall when decoding on Intel, and it writes the the low16 of a register that is later read at full width. Its upper16 would be a problem if not masked off (since it's used as a table index). Its operands are large enough that they will put garbage into the upper16, so it does need to be discarded.
I thought your way of doing it would be optimal for some architectures, but it turns out imul r16,r16,imm16 itself is slower than imul r32,r32,imm32 on every architecture except VIA Nano, AMD K7 (where it's faster than imul32), and Intel P6 (where using it from 32bit / 64bit mode will LCP-stall, and where partial-reg slowdowns are a problem).
On Intel SnB-family CPUs, where imul r16,r16,imm16 is two uops, imul32/movzx would be strictly better, with no downside except code size. On P6-family CPUs (i.e. PPro to Nehalem), imul r16,r16,imm16 is one uop, but those CPUs don't have a uop cache, so the LCP stall is probably critical (except maybe Nehalem calling this in a tight loop, fitting in the 28 uop loop buffer). And for those CPUs, the explicit movzx is probably better from the perspective of the partial-reg stall. Agner Fog says something about there being an extra cycle while the CPU inserts the merging uop, which might mean a cycle where that extra uop is issued alone.
On AMD K8-Steamroller, imul imm16 is 2 m-ops instead of 1 for imul imm32, so imul32/movzx is about equal to imul16 there. They don't suffer from LCP stalls, or from partial-reg problems.
On Intel Silvermont, imul imm16 is 2 uops (with one per 4 clocks throughput), vs. imul imm32 being 1 uops (with one per 1 clock throughput). Same thing on Atom (the in-order predecessor to Silvermont): imul16 is an extra uop and much slower. On most other microarchitectures, throughput isn't worse, just latency.
So if you're willing to increase the code-size in bytes where it will give a speedup, you should use a 32bit imul and a movzwl %di, %edi. On some architectures, this will be about the same speed as the imul imm16, while on others it will be much faster. It might be slightly worse on AMD bulldozer-family, which isn't very good at using both integer execution units at once, apparently, so a 2 m-op instruction for EX1 might be better than two 1 m-op instructions where one of them is still an EX1-only instruction. Benchmark this if you care.
Align tab to at least a 32B boundary, so your 32bit imul and or can do a 4B load from any 2B-aligned entry in it without crossing a cache-line boundary. Unaligned accesses have no penalty on all recent CPUs (Nehalem and later, and recent AMD), as long as they don't span two cache lines.
Making the operations that read from the table 32bit avoids the partial-register penalty that Intel CPUs have. AMD CPUs, and Silvermont, don't track partial-registers separately, so even instructions that write-only to the low16 have to wait for the result in the rest of the reg. This stops 16bit insns from breaking dependency chains. Intel P6 and SnB microarch families track partial regs. Haswell does full dual bookkeeping or something, because there's no penalty when merging is needed, like after you shift al, then shift eax. SnB will insert an extra uop there, and there may be a penalty of a cycle or two while it does this. I'm not sure, and haven't tested. However, I don't see a nice way to avoid this.
The shl %al could be replaced with a add %al, %al. That can run on more ports. Probably no difference, since port0/5 (or port0/6 on Haswell and later) probably aren't saturated. They have the same effect on the bits, but set flags differently. Otherwise they could be decoded to the same uop.
changes: split the pext/pdep / vectorize version into a separate answer, partly so it can have its own comment thread.

(I split the BMI2 version into a separate answer, since it could end up totally different)
After seeing what you're doing with that imul/shr to get a table index, I can see where you could use BMI2 pextr to replace and/imul/shr, or BMI1 bextr to replace just the shr (allowing use of imul32, instead of imul16, since you'd just extract the bits you want, rather than needing to shift zeros from the upper16). There are AMD CPUs with BMI1, but even steamroller lacks BMI2. Intel introduced BMI1 and BMI2 at the same time with Haswell.
You could maybe process two or four 16bit words at once, with 64bit pextr. But not for the whole algorithm: you can't do 4 parallel table lookups. (AVX2 VPGATHERDD is not worth using here.) Actually, you can use pshufb to implement a LUT with indices up to 4bits, see below.
Minor improvement version:
.section .rodata
# This won't won't assemble, written this way for humans to line up with comments.
extmask_lobits: .long 0b0000 0111 0111 0111
extmask_hibits: .long 0b0000 1000 1000 1000
# pext doesn't have an immediate-operand form, but it can take the mask from a memory operand.
# Load these into regs if running in a tight loop.
#### TOTALLY UNTESTED #####
.text
.p2align 4,,10
bcd2dpd_bmi2:
# mov %edi,%eax # = 0000 abcd efgh iklm
# shl %al # = 0000 abcd fghi klm0
# shr %eax # = 0000 0abc dfgh iklm
pext extmask_lobits, %edi, %eax
# = 0000 0abc dfgh iklm
mov %eax, %esi # insn scheduling for 4-issue front-end: Fast-path is 4 fused-domain uops
# And doesn't waste issue capacity when we're taking the slow path. CPUs with mov-elimination won't waste execution units from issuing an extra mov
test $0x880, %edi # fast path for a = e = 0
jnz .Lslow_path
ret
.p2align 4
.Lslow_path:
# 8 uops, including the `ret`: can issue in 2 clocks.
# replaces and/imul/shr
pext extmask_hibits, %edi, %edi #u= 0000 0000 0000 0aei
and $0x66, %esi # q = 0000 0000 0fg0 0kl0
imul tab-8(,%rdi,4), %esi # v = q * tab[u-2][0]
and $0x397, %eax # r = 0000 00bc d00h 0klm
xor %esi, %eax # w = r ^ v
or tab-6(,%rdi,4), %eax # x = w | tab[u-2][1]
and $0x3ff, %eax # = 0000 00xx xxxx xxxx
ret
Of course, if making this an inline-asm, rather than a stand-alone function, you'd change back to the fast path branching to the end, and the slow-path falling through. And you wouldn't waste space with alignment padding mid-function either.
There might be more scope for using pextr and/or pdep for more of the rest of the function.
I was thinking about how to do even better with BMI2. I think we could get multiple aei selectors from four shorts packed into 64b, then use pdep to deposit them in the low bits of different bytes. Then movq that to a vector register, where you use it as a shuffle control-mask for pshufb to do multiple 4bit LUT lookups.
So we could go from 60 BCD bits to 50 DPD bits at a time. (Use shrd to shift bits between registers to handle loads/stores to byte-addressable memory.)
Actually, 48 BCD bits (4 groups of 12bits each) -> 40 DPD bits is probably a lot easier, because you can unpack that to 4 groups of 16bits in a 64b integer register, using pdep. Dealing with the selectors for 5 groups is fine, you can unpack with pmovzx, but dealing with the rest of the data would require bit-shuffling in vector registers. Not even the slow AVX2 variable-shift insns would make that easy to do. (Although it might be interesting to consider how to implement this with BMI2 at all, for big speedups on CPUs with just SSSE3 (i.e. every relevant CPU) or maybe SSE4.1.)
This also means we can put two clusters of 4 groups into the low and high halves of a 128b register, to get even more parallelism.
As a bonus, 48bits is a whole number of bytes, so reading from a buffer of BCD digits wouldn't require any shrd insns to get the leftover 4 bits from the last 64b into the low 4 for the next. Or two offset pextr masks to work when the 4 ignored bits were the low or high 4 of the 64b.... Anyway, I think doing 5 groups at once just isn't worth considering.
Full BMI2 / AVX pshufb LUT version (vectorizable)
The data movement could be:
ignored | group 3 | group 2 | group 1 | group 0
16bits | abcd efgh iklm | abcd efgh iklm | abcd efgh iklm | abcd efgh iklm
3 2 1 | 0
pext -> aei|aei|aei|aei # packed together in the low bits
2 | 1 | 0
pdep -> ... |0000 0000 0000 0aei|0000 0000 0000 0aei # each in a separate 16b word
movq -> xmm vector register.
(Then pinsrq another group of 4 selectors into the upper 64b of the vector reg). So the vector part can handle 2 (or AVX2: 4) of this at once
vpshufb xmm2 -> map each byte to another byte (IMUL table)
vpshufb xmm3 -> map each byte to another byte (OR table)
Get the bits other than `aei` from each group of 3 BCD digits unpacked from 48b to 64b, into separate 16b words:
group 3 | group 2 | group 1 | group 0
pdep(src)-> 0000 abcd efgh iklm | 0000 abcd efgh iklm | 0000 abcd efgh iklm | 0000 abcd efgh iklm
movq this into a vector reg (xmm1). (And repeat for the next 48b and pinsrq that to the upper64)
VPAND xmm1, mask (to zero aei in each group)
Then use the vector-LUT results:
VPMULLW xmm1, xmm2 -> packed 16b multiply, keeping only the low16 of the result
VPAND xmm1, mask
VPXOR xmm1, something
VPOR xmm1, xmm3
movq / pextrq back to integer regs
pext to pack the bits back together
You don't need the AND 0x3ff or equivalent:
Those bits go away when you pext to pack each 16b down to 10b
shrd or something to pack the 40b results of this into 64b chunks for store to memory.
Or: 32b store, then shift and store the last 8b, but that seems lame
Or: just do 64b stores, overlapping with the previous. So you write 24b of garbage every time. Take care at the very end of the buffer.
Use AVX 3-operand versions of the 128b SSE instructions to avoid needing movdqa to not overwrite the table for pshufb. As long as you never run a 256b AVX instruction, you don't need to mess with vzeroupper. You might as well use the v (VEX) versions of all vector instructions, though, if you use any. Inside a VM, you might be running on a virtual CPU with BMI2 but not AVX support, so it's prob. still a good idea to check both CPU feature flags, rather than assuming AVX if you see BMI2 (even though that's safe for all physical hardware that currently exists).
This is starting to look really efficient. It might be worth doing the mul/xor/and stuff in vector regs, even if you don't have BMI2 pext/pdep to do the bit packing/unpacking. I guess you could use code like the existing non-BMI scalar routing to get selectors, and mask/shift/or could build up the non-selector data into 16b chunks. Or maybe shrd for shifting data from one reg into another?

gcc compiled binary in x86 assembly

gcc compile binary has following assembly:
8049264: 8d 44 24 3e lea 0x3e(%esp),%eax
8049268: 89 c2 mov %eax,%edx
804926a: bb ff 00 00 00 mov $0xff,%ebx
804926f: b8 00 00 00 00 mov $0x0,%eax
8049274: 89 d1 mov %edx,%ecx
8049276: 83 e1 02 and $0x2,%ecx
8049279: 85 c9 test %ecx,%ecx
804927b: 74 09 je 0x8049286
At first glance, I had no idea what it is doing at all. My best guess is some sort of memory alignment and clearing up local variable (because rep stos is filling 0 at local variable location). If you take a look at first few lines, load address into eax and move to ecx and test if it is even address or not, but I'm lost why this is happening. I want to know what exactly happen in here.

It looks like initialising a local variable located at [ESP + 0x03e] to zeroes. At first, EDX is initialised to hold the address and EBX is initialised to hold the size in bytes. Then, it's checked whether EDX & 2 is nonzero; in other words, whether EDX as a pointer is wyde-aligned but not tetra-aligned. (Assuming ESP is tetrabyte aligned, as it generally should, EDX, which was initialised at 0x3E bytes above ESP, would not be tetrabyte aligned. But this is slightly besides the point.) If this is the case, the wyde from AX, which is zero, is stored at [EDX], EDX is incremented by two, and the counter EBX is decremented by two. Now, assuming ESP was at least wyde-aligned, EDX is guaranteed to be tetra-aligned. ECX is calculated to hold the number of tetrabytes remaining by shifting EBX right two bits, EDI is loaded from EDX, and the REP STOS stores that many zero tetrabytes at [EDI], incrementing EDI in the process. Then, EDX is loaded from EDI to get the pointer-past-space-initialised-so-far. Finally, if there were at least two bytes remaining uninitialised, a zero wyde is stored at [EDX] and EDX is incremented by two, and if there was at least one byte remaining uninitialised, a zero byte is stored at [EDX] and EDX is incremented by one. The point of this extra complexity is apparently to store most of the zeroes as four-byte values rather than single-byte values, which may, under certain circumstances and in certain CPU architectures, be slightly faster.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio