32-byte aligned routine does not fit the uops cache

32-byte aligned routine does not fit the uops cache - performance

KbL i7-8550U
I'm researching the behavior of uops-cache and came across a misunderstanding regarding it.
As specified in the Intel Optimization Manual 2.5.2.2 (emp. mine):
The Decoded ICache consists of 32 sets. Each set contains eight Ways.
Each Way can hold up to six micro-ops.
-
All micro-ops in a Way represent instructions which are statically
contiguous in the code and have their EIPs within the same aligned
32-byte region.
-
Up to three Ways may be dedicated to the same 32-byte aligned chunk,
allowing a total of 18 micro-ops to be cached per 32-byte region of
the original IA program.
-
A non-conditional branch is the last micro-op in a Way.
CASE 1:
Consider the following routine:
uop.h
void inhibit_uops_cache(size_t);
uop.S
align 32
inhibit_uops_cache:
mov edx, esi
mov edx, esi
mov edx, esi
mov edx, esi
mov edx, esi
mov edx, esi
jmp decrement_jmp_tgt
decrement_jmp_tgt:
dec rdi
ja inhibit_uops_cache ;ja is intentional to avoid Macro-fusion
ret
To make sure that the code of the routine is actually 32-bytes aligned here is the asm
0x555555554820 <inhibit_uops_cache> mov edx,esi
0x555555554822 <inhibit_uops_cache+2> mov edx,esi
0x555555554824 <inhibit_uops_cache+4> mov edx,esi
0x555555554826 <inhibit_uops_cache+6> mov edx,esi
0x555555554828 <inhibit_uops_cache+8> mov edx,esi
0x55555555482a <inhibit_uops_cache+10> mov edx,esi
0x55555555482c <inhibit_uops_cache+12> jmp 0x55555555482e <decrement_jmp_tgt>
0x55555555482e <decrement_jmp_tgt> dec rdi
0x555555554831 <decrement_jmp_tgt+3> ja 0x555555554820 <inhibit_uops_cache>
0x555555554833 <decrement_jmp_tgt+5> ret
0x555555554834 <decrement_jmp_tgt+6> nop
0x555555554835 <decrement_jmp_tgt+7> nop
0x555555554836 <decrement_jmp_tgt+8> nop
0x555555554837 <decrement_jmp_tgt+9> nop
0x555555554838 <decrement_jmp_tgt+10> nop
0x555555554839 <decrement_jmp_tgt+11> nop
0x55555555483a <decrement_jmp_tgt+12> nop
0x55555555483b <decrement_jmp_tgt+13> nop
0x55555555483c <decrement_jmp_tgt+14> nop
0x55555555483d <decrement_jmp_tgt+15> nop
0x55555555483e <decrement_jmp_tgt+16> nop
0x55555555483f <decrement_jmp_tgt+17> nop
running as
int main(void){
inhibit_uops_cache(4096 * 4096 * 128L);
}
I got the counters
Performance counter stats for './bin':
6 431 201 748 idq.dsb_cycles (56,91%)
19 175 741 518 idq.dsb_uops (57,13%)
7 866 687 idq.mite_uops (57,36%)
3 954 421 idq.ms_uops (57,46%)
560 459 dsb2mite_switches.penalty_cycles (57,28%)
884 486 frontend_retired.dsb_miss (57,05%)
6 782 598 787 cycles (56,82%)
1,749000366 seconds time elapsed
1,748985000 seconds user
0,000000000 seconds sys
This is exactly what I expected to get.
The vast majority of uops came from uops cache. Also uops number perfectly matches with my expectation
mov edx, esi - 1 uop;
jmp imm - 1 uop; near
dec rdi - 1 uop;
ja - 1 uop; near
4096 * 4096 * 128 * 9 = 19 327 352 832 approximately equal to the counters 19 326 755 442 + 3 836 395 + 1 642 975
CASE 2:
Consider the implementation of inhibit_uops_cache which is different by one instruction commented out:
align 32
inhibit_uops_cache:
mov edx, esi
mov edx, esi
mov edx, esi
mov edx, esi
mov edx, esi
; mov edx, esi
jmp decrement_jmp_tgt
decrement_jmp_tgt:
dec rdi
ja inhibit_uops_cache ;ja is intentional to avoid Macro-fusion
ret
disas:
0x555555554820 <inhibit_uops_cache> mov edx,esi
0x555555554822 <inhibit_uops_cache+2> mov edx,esi
0x555555554824 <inhibit_uops_cache+4> mov edx,esi
0x555555554826 <inhibit_uops_cache+6> mov edx,esi
0x555555554828 <inhibit_uops_cache+8> mov edx,esi
0x55555555482a <inhibit_uops_cache+10> jmp 0x55555555482c <decrement_jmp_tgt>
0x55555555482c <decrement_jmp_tgt> dec rdi
0x55555555482f <decrement_jmp_tgt+3> ja 0x555555554820 <inhibit_uops_cache>
0x555555554831 <decrement_jmp_tgt+5> ret
0x555555554832 <decrement_jmp_tgt+6> nop
0x555555554833 <decrement_jmp_tgt+7> nop
0x555555554834 <decrement_jmp_tgt+8> nop
0x555555554835 <decrement_jmp_tgt+9> nop
0x555555554836 <decrement_jmp_tgt+10> nop
0x555555554837 <decrement_jmp_tgt+11> nop
0x555555554838 <decrement_jmp_tgt+12> nop
0x555555554839 <decrement_jmp_tgt+13> nop
0x55555555483a <decrement_jmp_tgt+14> nop
0x55555555483b <decrement_jmp_tgt+15> nop
0x55555555483c <decrement_jmp_tgt+16> nop
0x55555555483d <decrement_jmp_tgt+17> nop
0x55555555483e <decrement_jmp_tgt+18> nop
0x55555555483f <decrement_jmp_tgt+19> nop
running as
int main(void){
inhibit_uops_cache(4096 * 4096 * 128L);
}
I got the counters
Performance counter stats for './bin':
2 464 970 970 idq.dsb_cycles (56,93%)
6 197 024 207 idq.dsb_uops (57,01%)
10 845 763 859 idq.mite_uops (57,19%)
3 022 089 idq.ms_uops (57,38%)
321 614 dsb2mite_switches.penalty_cycles (57,35%)
1 733 465 236 frontend_retired.dsb_miss (57,16%)
8 405 643 642 cycles (56,97%)
2,117538141 seconds time elapsed
2,117511000 seconds user
0,000000000 seconds sys
The counters are completely unexpected.
I expected all the uops come from dsb as before since the routine matches the requirements of uops cache.
By contrast, almost 70% of uops came from Legacy Decode Pipeline.
QUESTION: What's wrong with the CASE 2? What counters to look at to understand what's going on?
UPD: Following #PeterCordes idea I checked the 32-byte alignment of the unconditional branch target decrement_jmp_tgt. Here is the result:
CASE 3:
Aligning onconditional jump target to 32 byte as follows
align 32
inhibit_uops_cache:
mov edx, esi
mov edx, esi
mov edx, esi
mov edx, esi
mov edx, esi
; mov edx, esi
jmp decrement_jmp_tgt
align 32 ; align 16 does not change anything
decrement_jmp_tgt:
dec rdi
ja inhibit_uops_cache
ret
disas:
0x555555554820 <inhibit_uops_cache> mov edx,esi
0x555555554822 <inhibit_uops_cache+2> mov edx,esi
0x555555554824 <inhibit_uops_cache+4> mov edx,esi
0x555555554826 <inhibit_uops_cache+6> mov edx,esi
0x555555554828 <inhibit_uops_cache+8> mov edx,esi
0x55555555482a <inhibit_uops_cache+10> jmp 0x555555554840 <decrement_jmp_tgt>
#nops to meet the alignment
0x555555554840 <decrement_jmp_tgt> dec rdi
0x555555554843 <decrement_jmp_tgt+3> ja 0x555555554820 <inhibit_uops_cache>
0x555555554845 <decrement_jmp_tgt+5> ret
and running as
int main(void){
inhibit_uops_cache(4096 * 4096 * 128L);
}
I got the following counters
Performance counter stats for './bin':
4 296 298 295 idq.dsb_cycles (57,19%)
17 145 751 147 idq.dsb_uops (57,32%)
45 834 799 idq.mite_uops (57,32%)
1 896 769 idq.ms_uops (57,32%)
136 865 dsb2mite_switches.penalty_cycles (57,04%)
161 314 frontend_retired.dsb_miss (56,90%)
4 319 137 397 cycles (56,91%)
1,096792233 seconds time elapsed
1,096759000 seconds user
0,000000000 seconds sys
The result is perfectly expected. More then 99% of the uops came from dsb.
Avg dsb uops delivery rate = 17 145 751 147 / 4 296 298 295 = 3.99
Which is close to the peak bandwith.

This is not the answer to the OP's problem, but is one to watch out for
See Code alignment dramatically affects performance for compiler options to work around this performance pothole Intel introduced into Skylake-derived CPUs, as part of this workaround.
Other observations: the block of 6 mov instructions should fill a uop cache line, with jmp in a line by itself. In case 2, the 5 mov + jmp should fit in one cache line (or more properly "way").
(Posting this for the benefit of future readers who might have the same symptoms but a different cause. I realized right as I finished writing it that 0x...30 is not a 32-byte boundary, only 0x...20 and 40, so this erratum shouldn't be the problem for the code in the question.)
A recent (late 2019) microcode update introduced a new performance pothole. It works around Intel's JCC erratum on Skylake-derived microarchitectures. (KBL142 on your Kaby-Lake specifically).
Microcode Update (MCU) to Mitigate JCC Erratum
This erratum can be prevented by a microcode update (MCU). The MCU prevents
jump instructions from being cached in the Decoded ICache when the jump
instructions cross a 32-byte boundary or when they end on a 32-byte boundary. In
this context, Jump Instructions include all jump types: conditional jump (Jcc), macrofused op-Jcc (where op is one of cmp, test, add, sub, and, inc, or dec), direct
unconditional jump, indirect jump, direct/indirect call, and return.
Intel's whitepaper also includes a diagram of cases that trigger this non-uop-cacheable effect. (PDF screenshot borrowed from a Phoronix article with benchmarks before/after, and after with rebuilding with some workarounds in GCC/GAS that try to avoid this new performance pitfall).
The last byte of the ja in your code is ...30, so it's the culprit.
If this was a 32-byte boundary, not just 16, then we'd have the problem here:
0x55555555482a <inhibit_uops_cache+10> jmp # fine
0x55555555482c <decrement_jmp_tgt> dec rdi
0x55555555482f <decrement_jmp_tgt+3> ja # spans 16B boundary (not 32)
0x555555554831 <decrement_jmp_tgt+5> ret # fine
This section not fully updated, still talking about spanning a 32B boundary
JA itself spans a boundary.
Inserting a NOP after dec rdi should work, putting the 2-byte ja fully after the boundary with a new 32-byte chunk. Macro-fusion of dec/ja wasn't possible anyway because JA reads CF (and ZF) but DEC doesn't write CF.
Using sub rdi, 1 to move the JA would not work; it would macro-fuse, and the combined 6 bytes of x86 code corresponding to that instruction would still span the boundary.
You could use single-byte nops instead of mov before the jmp to move everything earlier, if that gets it all in before the last byte of a block.
ASLR can change what virtual page code executes from (bit 12 and higher of the address), but not the alignment within a page or relative to a cache line. So what we see in disassembly in one case will happen every time.

OBSERVATION 1: A branch with a target within the same 32-byte region which is predicted to be taken behaves much like the unconditional branch from the uops cache standpoint (i.e. it should be the last uop in the line).
Consider the following implementation of inhibit_uops_cache:
align 32
inhibit_uops_cache:
xor eax, eax
jmp t1 ;jz, jp, jbe, jge, jle, jnb, jnc, jng, jnl, jno, jns, jae
t1:
jmp t2 ;jz, jp, jbe, jge, jle, jnb, jnc, jng, jnl, jno, jns, jae
t2:
jmp t3 ;jz, jp, jbe, jge, jle, jnb, jnc, jng, jnl, jno, jns, jae
t3:
dec rdi
ja inhibit_uops_cache
ret
The code is tested for all the branches mentioned in the comment. The difference turned out to be very insignificant, so I provide for only 2 of them:
jmp:
Performance counter stats for './bin':
4 748 772 552 idq.dsb_cycles (57,13%)
7 499 524 594 idq.dsb_uops (57,18%)
5 397 128 360 idq.mite_uops (57,18%)
8 696 719 idq.ms_uops (57,18%)
6 247 749 210 dsb2mite_switches.penalty_cycles (57,14%)
3 841 902 993 frontend_retired.dsb_miss (57,10%)
21 508 686 982 cycles (57,10%)
5,464493212 seconds time elapsed
5,464369000 seconds user
0,000000000 seconds sys
jge:
Performance counter stats for './bin':
4 745 825 810 idq.dsb_cycles (57,13%)
7 494 052 019 idq.dsb_uops (57,13%)
5 399 327 121 idq.mite_uops (57,13%)
9 308 081 idq.ms_uops (57,13%)
6 243 915 955 dsb2mite_switches.penalty_cycles (57,16%)
3 842 842 590 frontend_retired.dsb_miss (57,16%)
21 507 525 469 cycles (57,16%)
5,486589670 seconds time elapsed
5,486481000 seconds user
0,000000000 seconds sys
IDK why the number of dsb uops is 7 494 052 019, which is significantly lesser then 4096 * 4096 * 128 * 4 = 8 589 934 592.
Replacing any of the jmp with a branch that is predicted not to be taken yields a result which is significantly different. For example:
align 32
inhibit_uops_cache:
xor eax, eax
jnz t1 ; perfectly predicted to not be taken
t1:
jae t2
t2:
jae t3
t3:
dec rdi
ja inhibit_uops_cache
ret
results in the following counters:
Performance counter stats for './bin':
5 420 107 670 idq.dsb_cycles (56,96%)
10 551 728 155 idq.dsb_uops (57,02%)
2 326 542 570 idq.mite_uops (57,16%)
6 209 728 idq.ms_uops (57,29%)
787 866 654 dsb2mite_switches.penalty_cycles (57,33%)
1 031 630 646 frontend_retired.dsb_miss (57,19%)
11 381 874 966 cycles (57,05%)
2,927769205 seconds time elapsed
2,927683000 seconds user
0,000000000 seconds sys
Considering another example which is similar to the CASE 1:
align 32
inhibit_uops_cache:
nop
nop
nop
nop
nop
xor eax, eax
jmp t1
t1:
dec rdi
ja inhibit_uops_cache
ret
results in
Performance counter stats for './bin':
6 331 388 209 idq.dsb_cycles (57,05%)
19 052 030 183 idq.dsb_uops (57,05%)
343 629 667 idq.mite_uops (57,05%)
2 804 560 idq.ms_uops (57,13%)
367 020 dsb2mite_switches.penalty_cycles (57,27%)
55 220 850 frontend_retired.dsb_miss (57,27%)
7 063 498 379 cycles (57,19%)
1,788124756 seconds time elapsed
1,788101000 seconds user
0,000000000 seconds sys
jz:
Performance counter stats for './bin':
6 347 433 290 idq.dsb_cycles (57,07%)
18 959 366 600 idq.dsb_uops (57,07%)
389 514 665 idq.mite_uops (57,07%)
3 202 379 idq.ms_uops (57,12%)
423 720 dsb2mite_switches.penalty_cycles (57,24%)
69 486 934 frontend_retired.dsb_miss (57,24%)
7 063 060 791 cycles (57,19%)
1,789012978 seconds time elapsed
1,788985000 seconds user
0,000000000 seconds sys
jno:
Performance counter stats for './bin':
6 417 056 199 idq.dsb_cycles (57,02%)
19 113 550 928 idq.dsb_uops (57,02%)
329 353 039 idq.mite_uops (57,02%)
4 383 952 idq.ms_uops (57,13%)
414 037 dsb2mite_switches.penalty_cycles (57,30%)
79 592 371 frontend_retired.dsb_miss (57,30%)
7 044 945 047 cycles (57,20%)
1,787111485 seconds time elapsed
1,787049000 seconds user
0,000000000 seconds sys
All these experiments made me think that the observation corresponds to the real behavior of the uops cache. I also ran another experiments and judging by the counters br_inst_retired.near_taken and br_inst_retired.not_taken the result correlates with the observation.
Consider the following implementation of inhibit_uops_cache:
align 32
inhibit_uops_cache:
t0:
;nops 0-9
jmp t1
t1:
;nop 0-6
dec rdi
ja t0
ret
Collecting dsb2mite_switches.penalty_cycles and frontend_retired.dsb_miss we have:
The X-axis of the plot stands for the number of nops, e.g. 24 means 2 nops after the t1 label, 4 nops after the t0 label:
align 32
inhibit_uops_cache:
t0:
nop
nop
nop
nop
jmp t1
t1:
nop
nop
dec rdi
ja t0
ret
Judging by the plots I came to the
OBSERVATION 2: In case there are 2 branches within a 32-byte region that are predicted to be taken there is no observable correlation between dsb2mite switches and dsb misses. So the dsb misses may occur independently from the dsb2mite switches.
Increasing frontend_retired.dsb_miss rate correlate well with the increasing idq.mite_uops rate and decreasing idq.dsb_uops. This can be seen on the following plot:
OBSERVATION 3: The dsb misses occurring for some (unclear?) reason causes IDQ read bubbles and therefore RAT underflow.
Conclusion: Taking all the measurements into account there are definitely some differences between the behavior defined in the Intel Optimization Manual, 2.5.2.2 Decoded ICache

Related

Using POSIX thread library in x86_64 assembler (yasm) takes more execution time

I want to implement parallel processing in yasm program using POSIX thread (or simply pthread) library.
Code
Here is the most important part of my program.
section .data
pThreadID1 dq 0
pThreadID2 dq 0
MAX: dq 100000000
value: dd 0
section .bss
extern pthread_create
extern pthread_join
section .text
global main
main:
;create the first thread with id = pThreadID1
mov rdi, pThreadID1
mov rsi, NULL
mov rdx, thread1
mov rcx, NULL
call pthread_create
;join the 1st thread
mov rdi, qword [pThreadID1]
mov rsi, NULL
call pthread_join
;create the second thread with id = pThreadID2
mov rdi, pThreadID2
mov rsi, NULL
mov rdx, thread2
mov rcx, NULL
call pthread_create
;join the 2nd thread
mov rdi, qword [pThreadID2]
mov rsi, NULL
call pthread_join
;print value block
where thread1 contains loop in which value is being incremented by one MAX/2 times:
global thread1
thread1:
mov rcx, qword [MAX]
shr rcx, 1
thread1.loop:
mov eax, dword [value]
inc eax
mov dword [value], eax
loop thread1.loop
ret
and thread2 is similar.
NOTE: thread1 and thread2 share the variable value.
Result
I assemble and compile the above program as follows:
yasm -g dwarf2 -f elf64 Parallel.asm -l Parallel.lst
gcc -g Parallel.o -lpthread -o Parallel
Then I use time command in order to know the elapsed execution time:
time ./Parallel
And I get
value: +100000000
real 0m0.482s
user 0m0.472s
sys 0m0.000s
Problem
OK. In the program above I create one thread wait until it has been finished and only then create the second one. Not the best "threading", isn't it? So I change the order in the program as follows:
;create thread1
;create thread2
;join thread1
;join thread2
I expect that elapsed time will be less in this case but I get
value: +48634696
real 0m2.403s
user 0m4.772s
sys 0m0.000s
I understand why value is not equal to MAX but what I do not understand is why in this case the elapsed time is significantly more? Am I missing something?
EDIT
I decided to exclude overlap between thread1 and thread2 by using different variables for each of them and then just add the results. In this case "parallel" order gives less elapsed time (compared to the previous result) but, anyway, greater than "series" order.
Code
Only changes shown
Data
Now there are two variables --- one for each of threads.
section .data
value1: dd 0
value2: dd 0
Threads
Each thread is responsible for incrementing of its own value.
global thread1
thread1:
mov rcx, qword [MAX]
shr rcx, 1
thread1.loop:
mov eax, dword [value1]
inc eax
mov dword [value1], eax
loop thread1.loop
ret
thread2 is similar (replace 1 by 2).
Getting final result
Assuming that comments represent corresponding blocks of code from the Code section in the beginning of the question the program is the following.
Parallel order
;create thread1
;create thread1
;join thread1
;join thread2
mov eax, dword [value]
add eax, dword [value1]
add eax, dword [value2]
mov dword [value], eax
Result
value: +100000000
Performance counter stats for './Parallel':
3078.140527 cpu-clock (msec)
1.586070821 seconds time elapsed
Series order
;create thread1
;join thread1
;create thread2
;join thread2
mov eax, dword [value]
add eax, dword [value1]
add eax, dword [value2]
mov dword [value], eax
Result
value: +100000000
Performance counter stats for './Parallel':
508.757321 cpu-clock (msec)
0.509709406 seconds time elapsed
UPDATE
I've drawn a simple graph that reflects time dependencies against MAX value for 4 different modes.

The version with two threads running at once is slower because the two cores running your code will compete over the cache line that contains your counter value. The cache line will have to move back and forth between the two cores, which takes 10s of cycles each time, and only a few increments will occur before it moves back to the other core. Compare that to the single-threaded cases where the increment can occur once per ~5 cycles1, limited by store-forwarding latency and the slow loop instruction.
This is true both for the case where the threads increment a shared value, and for your other case where they are distinct. It applies even in the latter case because the values value1 and value2 are declared to occupy contiguous locations in memory and thus appear on the same cache line. Since coherency occurs at cache-line granularity this so-called "false sharing" effect is similar to true sharing (the first case).
You mentioned that although both true and false sharing cases were much slower than the single-threaded case, the true sharing case was still even slower than the false sharing case. I had thought, without testing, that these two cases were equivalent performance-wise, but that they aren't makes sense: while both suffer the cache-line thrashing described above, the true sharing case additionally perhaps suffers additional memory order clears - although the exact mechanism isn't clear.

What's up with gcc weird stack manipulation when it wants extra stack alignment?

I've seen this r10 weirdness a few times, so let's see if anyone knows what's up.
Take this simple function:
#define SZ 4
void sink(uint64_t *p);
void andpop(const uint64_t* a) {
uint64_t result[SZ];
for (unsigned i = 0; i < SZ; i++) {
result[i] = a[i] + 1;
}
sink(result);
}
It just adds 1 to each of the 4 64-bit elements of the passed-in array and stores it in a local and calls sink() on the result (to avoid the whole function being optimized away).
Here's the corresponding assembly:
andpop(unsigned long const*):
lea r10, [rsp+8]
and rsp, -32
push QWORD PTR [r10-8]
push rbp
mov rbp, rsp
push r10
sub rsp, 40
vmovdqa ymm0, YMMWORD PTR .LC0[rip]
vpaddq ymm0, ymm0, YMMWORD PTR [rdi]
lea rdi, [rbp-48]
vmovdqa YMMWORD PTR [rbp-48], ymm0
vzeroupper
call sink(unsigned long*)
add rsp, 40
pop r10
pop rbp
lea rsp, [r10-8]
ret
It's hard to understand almost everything that is going on with r10. First, r10 is set to point to rsp + 8, then push QWORD PTR [r10-8], which as far as I can tell pushes a copy of the return address on the stack. Following that, rbp is set up as normal and then finally r10 itself is pushed.
To unwind all this, r10 is popped off of the stack and used to restore rsp to its original value.
Some observations:
Looking at the entire function, all of this seems like a totally roundabout way of simply restoring rsp to it's original value before ret - but the usual epilog of mov rsp, rpb would do just as well (see clang)!
That said, the (expensive) push QWORD PTR [r10-8] doesn't even help in that mission: this value (the return address?) is apparently never used.
Why is r10 pushed and popped at all? The value isn't clobbered in the very small function body and there is no register pressure.
What's up with that? I've seen it several times before, and it usually wants to use r10, sometimes r13. It seems likely that has something to do with aligning the stack to 32 bytes, since if you change SZ to be less than 4 it uses xmm ops and the issue disappears.
Here's SZ == 2 for example:
andpop(unsigned long const*):
sub rsp, 24
vmovdqa xmm0, XMMWORD PTR .LC0[rip]
vpaddq xmm0, xmm0, XMMWORD PTR [rdi]
mov rdi, rsp
vmovaps XMMWORD PTR [rsp], xmm0
call sink(unsigned long*)
add rsp, 24
ret
Much nicer!

Well, you answered your question: The stack pointer needs to be aligned to 32 bytes before it can be accessed with aligned AVX2 loads and stores, but the ABI only provides 16 byte alignment. Since the compiler cannot know how much the alignment is off, it has to save the stack pointer in a scratch register and restore it afterwards. But the saved value has to outlive the function call, so it has to be put on the stack, and a stack frame has to be created.
Some x86-64 ABIs have a red zone (a region of the stack below the stack pointer which is not used by signal handlers), so it is feasible not to change the stack pointer at all for such short functions, but GCC apparently does not implement this optimization and it would not apply here anyway because of the function call at the end.
In addition, the default stack alignment implementation is rather poor. For this case, -maccumulate-outgoing-args results in better-looking code with GCC 6, just aligning RSP after saving RBP, instead of copying the return address before saving RBP:
andpop:
pushq %rbp
movq %rsp, %rbp # make a traditional stack frame
andq $-32, %rsp # reserve 0 or 16 bytes
subq $32, %rsp
vmovdqu (%rdi), %xmm0 # split unaligned load from tune=generic
vinserti128 $0x1, 16(%rdi), %ymm0, %ymm0 # use -march=haswell instead
movq %rsp, %rdi
vpaddq .LC0(%rip), %ymm0, %ymm0
vmovdqa %ymm0, (%rsp)
vzeroupper
call sink#PLT
leave
ret
(editor's note: gcc8 and later make asm like this by default (Godbolt compiler explorer with gcc8, clang7, ICC19, and MSVC), even without -maccumulate-outgoing-args)
This issue (GCC generating poor code for stack alignment) recently came up when we had to implement a workaround for GCC __tls_get_addr ABI bug, and we ended up writing the stack realignment by hand.
EDIT There is also another issue, related to RTL pass ordering: stack alignment is picked before the final determination whether the stack is actually needed, as BeeOnRope's second example shows.

What does MOV EAX,DWORD PTR DS:[ESI+EBP*8] do?

If I do step through the debugger in Ollydbg I see
MOV EAX,DWORD PTR DS:[ESI+EBP*8]
and register ESI = 0040855C and EBP = 00000000.
My problem is I dont know 2 register * 8

MOV EAX,DWORD PTR DS:[ESI+EBP*8]
MOV - move
EAX - to EAX (generally this will be a value you just calculated)
DWORD PTR - from the value pointed at by
[DS: - in the data segment]
[ESI+EBP*8] - ESI plus 8 times EBP.
Move the value in EAX into the address pointed at by ESI + EBP*8 (ESI plus 8 times EBP, it means exactly how it's written)
This is probably being used to load data from an array, where the 8 is there to scale up the counter (which is EBP) to the size of the thing being stored (8 bytes), and ESI contains the address of the start of the array. So if EBP is zero, you store the data in ESI+0, if EBP=1, you end up storing at ESI+8, etc.

In normal INTEL syntax this instruction moves a value from memory into EAX.
MOV EAX,DWORD PTR DS:[ESI+EBP*8]
It is usually used to extract a value from an array.
The array is situated in memory at DS:ESI.
The elements are indexed through EBP.
The scale of 8 means that every element is 64 bit long and this instruction only reads the low dword.

Triple fault on stack in C code [closed]

Closed. This question is not reproducible or was caused by typos. It is not currently accepting answers.
This question was caused by a typo or a problem that can no longer be reproduced. While similar questions may be on-topic here, this one was resolved in a way less likely to help future readers.
Closed 3 years ago.
Improve this question
I am writing a toy kernel for learning purposes, and I am having a bit of trouble with it. I have made a simple bootloader that loads a segment from a floppy disk (which is written in 32 bit code), then the bootloader enables the A20 gate and turns on protected mode. I can jump to the 32 bit code fine if I write it in assembler, but if I write it in C, I get a triple fault. When I disassemble the C code I can see that the first two instructions involve setting up a new stack frame. This is the key difference between the working ASM code and the failing C code.
I am using NASM v2.10.05 for the ASM code, and GCC from the DJGPP 4.72 collection for the C code.
This is the bootloader code:
org 7c00h
BITS 16
entry:
mov [drive], dl ;Save the current drive
cli
mov ax,cs ; Setup segment registers
mov ds,ax ; Make DS correct
mov ss,ax ; Make SS correct
mov bp,0fffeh
mov sp,0fffeh ;Setup a temporary stack
sti
;Set video mode to text
;===================
mov ah, 0
mov al, 3
int 10h
;===================
;Set current page to 0
;==================
mov ah, 5
mov al, 0
int 10h
;==================
;Load the sector
;=============
call load_image
;=============
;Clear interrupts
;=============
cli
;=============
;Disable NMIs
;============
in ax, 70h
and ax, 80h ;Set the high bit to 1
out 70h, ax
;============
;Enable A20:
;===========
mov ax, 02401h
int 15h
;===========
;Load the GDT
;===========
lgdt [gdt_pointer]
;===========
;Clear interrupts
;=============
cli
;=============
;Enter protected mode
;==================
mov eax, cr0
or eax, 1 ;Set the low bit to 1
mov cr0, eax
;==================
jmp 08h:clear_pipe ;Far jump to clear the instruction queue
;======================================================
load_image:
reset_drive:
mov ah, 00h
; DL contains *this* drive, given to us by the BIOS
int 13h
jc reset_drive
read_sectors:
mov ah, 02h
mov al, 01h
mov ch, 00h
mov cl, 02h
mov dh, 00h
; DL contains *this* drive, given to us by the BIOS
mov bx, 7E0h
mov es, bx
mov bx, 0
int 13h
jc read_sectors
ret
;======================================================
BITS 32 ;Protected mode now!
clear_pipe:
mov ax, 10h ; Save data segment identifier
mov ds, ax ; Move a valid data segment into the data segment register
mov es, ax
mov fs, ax
mov gs, ax
mov ss, ax ; Move a valid data segment into the stack segment register
mov esp, 90000h ; Move the stack pointer to 90000h
mov ebp, esp
jmp 08h:7E00h ;Jump to the kernel proper
;===============================================
;========== GLOBAL DESCRIPTOR TABLE ==========
;===============================================
gdt: ; Address for the GDT
gdt_null: ; Null Segment
dd 0
dd 0
gdt_code: ; Code segment, read/execute, nonconforming
dw 0FFFFh ; LIMIT, low 16 bits
dw 0 ; BASE, low 16 bits
db 0 ; BASE, middle 8 bits
db 10011010b ; ACCESS byte
db 11001111b ; GRANULARITY byte
db 0 ; BASE, low 8 bits
gdt_data: ; Data segment, read/write, expand down
dw 0FFFFh
dw 0
db 0
db 10010010b
db 11001111b
db 0
gdt_end: ; Used to calculate the size of the GDT
gdt_pointer: ; The GDT descriptor
dw gdt_end - gdt - 1 ; Limit (size)
dd gdt ; Address of the GDT
;===============================================
;===============================================
drive: db 00 ;A byte to store the current drive in
times 510-($-$$) db 00
db 055h
db 0AAh
And this is the kernel code:
void main()
{
asm("mov byte ptr [0x8000], 'T'");
asm("mov byte ptr [0x8001], 'e'");
asm("mov byte ptr [0x8002], 's'");
asm("mov byte ptr [0x8003], 't'");
}
The kernel simply inserts those four bytes into memory, which I can check as I am running the code in a VMPlayer virtual machine. If the bytes appear, then I know the code is working. If I write code in ASM that looks like this, then the program works:
org 7E00h
BITS 32
main:
mov byte [8000h], 'T'
mov byte [8001h], 'e'
mov byte [8002h], 's'
mov byte [8003h], 't'
hang:
jmp hang
The only differences are therefore the two stack operations I found in the disassembled C code, which are these:
push ebp
mov ebp, esp
Any help on this matter would be greatly appreciated. I figure I am missing something relatively minor, but crucial, here, as I know this sort of thing is possible to do.

Try using the technique here:
Is there a way to get gcc to output raw binary?
to produce a flat binary of the .text section from your object file.

I'd try moving the address of your protected mode stack to something else - say 0x80000 - which (according to this: http://wiki.osdev.org/Memory_Map_(x86)) is RAM which is guaranteed free for use (as opposed to the address you currently use - 0x90000 - which apparently may not be present - depending a bit on what you're using as a testing environment).

Code Optimization Tips:

I am using the following ASM routine to bubble sort an array. I want to know of the inefficiencies of my code:
.386
.model flat, c
option casemap:none
.code
public sample
sample PROC
;[ebp+0Ch]Length
;[ebp+08h]Array
push ebp
mov ebp, esp
push ecx
push edx
push esi
push eax
mov ecx,[ebp+0Ch]
mov esi,[ebp+08h]
_bubbleSort:
push ecx
push esi
cmp ecx,1
je _exitLoop
sub ecx,01h
_miniLoop:
push ecx
mov edx,DWORD PTR [esi+4]
cmp DWORD PTR [esi],edx
ja _swap
jmp _continueLoop
_swap:
lodsd
mov DWORD PTR [esi-4],edx
xchg DWORD PTR [esi],eax
jmp _skipIncrementESI
_continueLoop:
add esi,4
_skipIncrementESI:
pop ecx
loop _miniLoop
_exitLoop:
pop esi
pop ecx
loop _bubbleSort
pop eax
pop esi
pop edx
pop ecx
pop ebp
ret
sample ENDP
END
Basically I have two loops, as usual for the bubble sort algorithm. The value of ecx for the outer loop is 10, and for the inner loop it is [ecx-1]. I have tried the routine and it compiles and runs successfully, but I am not sure if it is efficient.

There are several things you can do to speed up your assembly code:
don't do things like ja label_1 ; jmp label_2. Just do jbe label_2 instead.
loop is a very slow instruction. dec ebx; jnz loopstart is much faster
use all registers instead of repeatedly push/pop ecx and esi. Use ebx and edi too.
jmp-targets should be well aligned. Use align 4 before the two loop-starts and after the jbe
Get yourself a manual for your cpu from Intel (you can download it as pdf), it has the timings for the opcodes, maybe it has other hints too.

Several simple tips:
1) Try to minimize the number of conditional jumps, because they are very expensive. Unroll if possible.
2) Reorder instructions to minimize stalls because of data depencency:
cmp DWORD PTR [esi],edx ;// takes some time to compute,
mov edx,DWORD PTR [esi+4] ;
ja _swap ;// waits for results of cmp
3) Avoid old composite instructions (dec, jnz pair is faster than loop and is not bound to ecx register)
It would be quite difficult to write assembly code that is faster than the code generated by optimizing C compiler, because you should consider lots of factors: size of data and instruction caches, alignments, pipeline, instruction timings. You can find some good documentation about this here. I especially recommend the first book: Optimizing software in C++

Substitute for "add esi,4" if we do not need a flag for this instruction:
_continueLoop:
lea esi,[esi+4]

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio