Is it faster to pop unneeded values from the stack, or add an immediate constant to SP on a 386+ CPU?

Is it faster to pop unneeded values from the stack, or add an immediate constant to SP on a 386+ CPU? - performance

My code target's 386+ (DOSBox usually, occasionally Pentium MMX) CPUs, but I only use the 8086 feature set for compatibility. My code is written for a non-multitasking environment (MS-DOS, or DOSBox.)
In nested loops I often find myself re-using CX for the deeper loop counters. I PUSH it at the top of the nested loop, and POP it before LOOP is executed.
Sometimes conditions other than CX reaching 0 terminate these inner loops. Then I am left with the unnecessary loop counter, and sometimes more variables, sitting on the stack that I need to clean up.
Is it faster to just add a constant to SP, or POP these unneeded values?
I know that the fastest way of doing it would be to store CX in a spare register at the top of the loop, then restore it before LOOP executes, foregoing the stack completely, but I often don't have a spare register.
Here's a piece of code where I add a constant to SP to avoid a couple POP instructions:
FIND_ENTRY PROC
;SEARCHES A SINGLE SECTOR OF A DIRECTORY LOADED INTO secBuff FOR A
;SPECIFIED FILE/SUB DIRECTORY ENTRY
;IF FOUND, RETURNS THE FILE/SUB DIRECTORY'S CLUSTER NUMBER IN BX
;IF NOT FOUND, RETURNS 0 IN BX
;ALTERS BX
;EXPECTS A FILE NAME STRING INDEX NUMBER IN BP
;EXPECTS A SECTOR OF A DIRECTORY (ROOT, OR SUB) TO BE LOADED INTO secBuff
;EXPECTS DS TO BE LOADED WITH varData
push ax
push cx
push es
push si
push di
lea si, fileName ;si -> file name strings
mov ax, 11d ;ax -> file name length in bytes/characters
mul bp ;ax -> offset to file name string
add si, ax ;ds:si -> desired file name as source input
;for CMPS
mov di, ds
mov es, di
lea di, secBuff ;es:di -> first entry in ds:secBuff as
;destination input for CMPS
mov cx, 16d ;outer loop cntr -> num entries in a sector
ENTRY_SEARCH:
push cx ;store outer loop cntr
push si ;store start of the file name
push di ;store start of the entry
mov cx, 11d ;inner loop cntr -> length of file name
repe cmpsb ;Do the strings match?
jne NOT_ENTRY ;If not, test next entry.
pop di ;di -> start of the entry
mov bx, WORD PTR [di+26] ;bx -> entry's cluster number
add sp, 4 ;discard unneeded stack elements
pop di
pop si
pop es
pop cx
pop ax
ret
NOT_ENTRY:
pop di ;di -> start of the entry
add di, 32d ;di -> start of next entry
pop si ;si -> start of file name
pop cx ;restore the outer loop cntr
loop ENTRY_SEARCH ;loop till we've either found a match, or
;have tested every entry in the sector
;without finding a match.
xor bx, bx ;if we're here no match was found.
;return 0.
pop di
pop si
pop es
pop cx
pop ax
ret
FIND_ENTRY ENDP

If you want to write efficient code, pop vs. add is a very minor issue compared to reducing the amount of saving/restoring you need to do, and optimizing everything else (see below).
If it would take more than 1 pop, always use add sp, imm. Or sub sp, -128 to save code-size by still using an imm8. Or some CPUs may prefer lea instead of add/sub. (e.g. gcc uses LEA whenever possible with -mtune=atom). Of course, this would require an address-size prefix in 16-bit mode because [sp+2] isn't a valid addressing mode.
Beyond that, there's no single answer that applies to both an actual 386 and a modern x86 like Haswell or Skylake! There have been a lot of microarchitectural changes between those CPUs. Modern CPUs decode x86 instructions to internal RISC-like uops. For a while, using simple x86 instructions was important, but now modern CPUs can represent a lot of work in a single instruction, so more complex x86 instructions (like push, or add with a memory source operand) are single uop instructions.
Modern CPUs (since Pentium-M) have a stack engine that removes the need for a separate uop to actually update RSP/ESP/SP in the out-of-order core. Intel's implementation requires a stack-sync uop when you read/write RSP with a non-stack instruction (anything other than push/pop / call/ret), which is why pop can be useful, especially if you're doing it after a push or call.
clang uses push/pop to align the stack in x86-64 code when a single 8-byte offset is needed. Why does this function push RAX to the stack as the first operation?.
But if you care about performance, loop is slow and should be avoided in the first place, let alone push/pop of the loop counter! Use different regs for inner/outer loops.
Basically, you've gone pretty far down the wrong path as far as optimizing, so the real answer is just to point you at http://agner.org/optimize/, and other performance links in the x86 tag wiki. 16-bit code makes it hard to get good performance because of all the partial-register false dependencies on modern CPUs, but with some impact on code size you can break those by using 32-bit operand size when necessary. (e.g. for xor ebx,ebx)
Of course, if you're optimizing for DOSBOX, it's not a real CPU and emulates. So loop may be fast! IDK if anyone has profiled or written optimization guides for DOSBOX's CPU emulator. But I'd recommend learning what's fast on real modern hardware; that's more interesting.

Related

How often do the contents of a CPU register change?

Does the data that CPU registers hold change often? The Wikipedia article describes the registers as "a quickly accessible location...of a small amount of fast storage". I'm assuming the memory is fast because a register is accessed and modified often?

Yes, data registers may change on subsequent instructions which is quite often. There are more complications with superscalarity, out-of-order execution, pipelining, register renaming, etc which complicate the analysis, but even on a simple in-order CPU, a register can change as often as once per instruction. A plausible program may have a run of many instructions, all affecting the same register:
// Type your code here, or load an example.
int polynom(int num) {
return num * num + 2 * num + 1;
}
which compiles as:
polynom(int):
push rbp
mov rbp, rsp
mov DWORD PTR [rbp-4], edi
* mov eax, DWORD PTR [rbp-4]
* imul eax, eax
* mov edx, DWORD PTR [rbp-4]
add edx, edx
* add eax, edx
* add eax, 1
pop rbp
ret
Note the many writes to the eax register, noted with an asterisk. In this little function, five almost-consecutive instructions write to this specific register, meaning that we can expect the program-visible state of eax1 to change at a rate of over 1 GHz if this code were to be called in a tight loop.
On a more fundamental note, there are some architectural registers that almost always change on every instruction. The most evident of these is the program counter (called PC in many contexts, EIP on x86, RIP on x86_64). Because this register points to the currently executing instruction, it must certainly change with every instruction, barring counterexamples like x86 REP encodings or an instruction that simply jumps to itself.
1 Again, barring architectural considerations like register renaming, which uses multiple physical registers to implement a single logical, program-visible register.

Since modern CPU's run in GHz, CPU registers can change what they are storing hundred of millions or even billions of times per second.
Since most modern CPU's have ~128 registers, they would typically change values a few million times per second when performing many operations.

Fastest method to accept a byte from a nibble in hardware (8051)

I have a smaller 8051 microcontroller (AT89C4051) connected to a larger microcontroller (AT89S52) and the larger one is running the clock of the smaller one. The crystal speed for the large one is 22.1184Mhz. Documentation states that because the ALE line is controlling the smaller micro, its clock speed is limited to 3.6Mhz.
The two micros communicate with each other with only 4 I/O lines and one interrupt line. I'm trying to make reception of a byte occur as fast as possible, but the code I have come up with makes me think I didn't choose the best solution, but here is what I got so far:
org 0h
ljmp main ;run initialization + program
org 13h ;INT1 handler - worse case scenario: 52uS processing time I think?
push PSW ;save old registers and the carry flag
mov PSW,#18h ;load our register space
mov R7,A ;save accumulator (PUSH takes an extra clock cycle I think)
mov A,P1 ;Grab data (wish I could grab it sooner somehow)
anl A,#0Fh ;Only lowest 4 bits on P1 is the actual data. other 4 bits are useless
djnz R2,nonib2 ;See what nibble # we are at. 1st or 2nd?
orl A,R6 ;were at 2nd so merge previously saved data in
mov #R0,A ;and put it in memory space
inc R0 ;and increment pointer
mov R2,#2h ;and reset nibble number
nonib2:
swap A ;exchange nibbles to prevent overwriting nibble later
mov R6,A ;save to R6 high nibble
mov A,R7 ;restore accumulator
pop PSW ;restore carry and register location
reti ;return to wherever
main:
mov PSW,#18h ;use new address for R0 through R7 to not clash with other routines
mov R1,#BUFCMD ;setup start of buffer space as R1
mov R2,#2h ;set # nibbles needed to process byte
mov PSW,#0h
mov IE,#84h ;enable external interrupt 1
..rest of code here...
We have to assume that this can be triggered by hardware at any point, even during a time-sensitive LCD character processing routine in which all registers and accumulator are used.
What optimizations can I perform to this code here to make it run much faster?

There is no need to do the nibble processing in the interrupt. Just store the 4 bits as they come in.
Assuming you can allocate R0 globally, the code can be as simple as:
org 13h
mov #r0, p1
inc r0
reti
Won't get much faster than that.
If you absolutely can not reserve R0, but you can at least arrange to use register banks differing in a single bit, e.g. #0 and #1, then you can use bit set/clear to switch away and back in 2 cycles, instead of 5 for the push psw approach.

Why is a memory round-trip faster than not performing the round-trip?

I've got some simple 32bit code which computes the product of an array of 32bit integers. The inner loop looks like this:
##loop:
mov esi,[ebx]
mov [esp],esi
imul eax,[esp]
add ebx, 4
dec edx
jnz ##loop
What I'm trying to understand is why the above code is 6% faster than these two versions of the code, which does not perform the redundant memory round-trip:
##loop:
mov esi,[ebx]
imul eax,esi
add ebx, 4
dec edx
jnz ##loop
and
##loop:
imul eax,[ebx]
add ebx, 4
dec edx
jnz ##loop
The two latter pieces of code execute in virtually the same time, and as mentioned both are 6% slower than the first piece (165ms vs 155ms, 200 million elements).
I've tried manually aligning the jump target to a 16 byte boundary, but it makes no difference.
I'm running this on an Intel i7 4770k, Windows 10 x64.
Note: I know the code could be improved by doing all sorts of optimizations, however I'm only interested in the performance difference between the above pieces of code.

I suspect but can't be sure that you are preventing a stall on a data dependency:
The code looks like this:
##loop:
mov esi,[ebx] # (1)Load the memory location to esi reg
(mov [esp],esi) # (1)optionally store the location on the stack
imul eax,[esp] # (3) Perform the multiplication
add ebx, 4 # (1) Add 4
dec edx # (1)decrement counter
jnz ##loop # (0**) loop
Those numbers in brackets are the latencies of the instructions ... that jump is 0 if the branch predictor guesses correctly (which since it will mostly loop it will most of the time).
So: while the multiplication is still going (3 instructions) we get back to the top of the loop after 2 and try to load in to the memory and has to stall. Or we could do a store ... which we can do at the same time as our multiplication and then not stall at all.
What about the dummy store you ask? Why does that work? Notice you are storing the critical value that we are using to multiply to memory. Thus the processor can use this value which is being stored in memory and clobber the register.
So why can't the processor do this anyway? The processor can't produce more memory accesses than you ask it to or it could interfere with multi-processor programs (imagine that cache line that you are writing to is shared and you have to invalidate it on other CPUs every loop by writing to it ... ouch!).
All of this is pure speculation, but it seems to match all the evidence (your code and my knowledge of the intel architecture ... and x86 assembly). Hopefully someone can point out if I have something wrong.

Understanding optimized assembly code generated by gcc

I'm trying to understand what kind of optimizations are performed by gcc when -O3 flag was set. I'm quite confused what these two lines,
xor %esi, %esi
lea 0x0(%esi), %esi
It seems to me redundant. What's point to use lea instruction here?

That instruction is used to fill space for alignment purposes. Loops can be faster when they start on aligned addresses, because the processor loads memory into the decoder in chunks. By aligning the beginnings of loops and functions, it becomes more likely that they will be at the beginning of one of these chunks. This prevents previous instructions which will not be used from being loaded, maximizes the number of future instructions that will, and, possibly most importantly, ensures that the first instruction is entirely in the first chunk, so it does not take two loads to execute it.
The compiler knows that it is best to align the loop, and has two options to do so. It can either place a jump to the beginning of the loop, or fill the gap with no-ops and let the processor flow through them. Jump instructions break the flow of instructions and often cause wasted cycles on modern processors, so adding them unnecessarily is inadvisable. For a short distance like this no-ops are better.
The x86 architecture contains an instruction specifically for the purpose of doing nothing, nop. However, this is one byte long, so it would take more than one to align the loop. Decoding each one and deciding it does nothing takes time, so it is faster to simply insert another longer instruction that has no side effects. Therefore, the compiler inserted the lea instruction you see. It has absolutely no effects, and is chosen by the compiler to have the exact length required. In fact, recent processors have standard multi-byte no-op instructions, so this will likely be recognized during decode and never even executed.

As explained by ughoavgfhw - these are paddings for better code alignment.
You can find this lea in the following link -
http://mail.openjdk.java.net/pipermail/hotspot-compiler-dev/2010-September/003881.html
quote:
1-byte: XCHG EAX, EAX
2-byte: 66 NOP
3-byte: LEA REG, 0 (REG) (8-bit displacement)
4-byte: NOP DWORD PTR [EAX + 0] (8-bit displacement)
5-byte: NOP DWORD PTR [EAX + EAX*1 + 0] (8-bit displacement)
**6-byte: LEA REG, 0 (REG) (32-bit displacement)**
7-byte: NOP DWORD PTR [EAX + 0] (32-bit displacement)
8-byte: NOP DWORD PTR [EAX + EAX*1 + 0] (32-bit displacement)
9-byte: NOP WORD PTR [EAX + EAX*1 + 0] (32-bit displacement)
Also note this SO question describing it in more details -
What does NOPL do in x86 system?
Note that the xor itself is not a nop (it changes the value of the reg), but it is also very cheap to perform since it's a zero idiom - What is the purpose of XORing a register with itself?

How do I intentionally read from main memory vs cache?

So I am being taught assembly and we have an assignment which is to find the time difference between reading from memory and reading from cache. We have to do this by creating 2 loops and timing them. (one reads from main memory and the other from cache). The thing is, I don't know and can't find anything that tells me how to read from either cache or main memory =/. Could you guys help me? I'm doing this in MASM32. I understand how to make loops and well most of the assembly language but I just can't make it read =/
Edit:
I have a question, I've done this ...
mov ecx, 100 ;loop 100 times
xor eax, eax ;set eax to 0
_label:
mov eax, eax ;according to me this is read memory is that good?
dec ecx ;dec loop
jnz _label ;if still not equal to 0 goes again to _label
... would that be ok?
Edit 2:
well then, I don't intend to pry and I appreciate your help, I just have another question, since these are two loops I have to do. I need to compare them somehow, I've been looking for a timer instruction but I haven't found any I've found only: timeGetTime, GetTickCount and Performance Counter but as far as I understand these instructions return the system time not the time it takes for the loop to finish. Is there a way to actually do what I want? or I need to think of another way?
Also, to read from different registers in the second loop (the one not reading from cache) is it ok if I give various "mov" instructions? or am I totally off base here?
Sorry for all this questions but again thank you for your help.

To read from cache. have a loop which reads from the same (or very similar) memory address:
The first time you read from that address, the values from that memory address (and from other, nearby memory address) will be moved into cache
The next time you read from that same address, the values are already cached and so you're reading from cache.
To read uncached memory, have a loop which reads from many, very different (i.e. further apart than the cache size) memory addresses.
To answer your second question:
The things you're doing with ecx and jnz look OK (I don't know how accurate/sensitive your timer is, but you might want to loop more than 100 times)
The mov eax, eax is not "read memory" ... it's a no-op, which moves eax into eax. Instead, I think that the MASM syntax for reading from memory is something more like mov eax,[esi] ("read the from the memory location whose address is contained in esi")
Depending on what O/S you're using, you must read from a memory address which actually exists and is readable. On Windows, for example, an application wouldn't be allowed to do mov esi, 0 followed by mov eax, [esi] because an application isn't allowed to read the memory whose address/location is zero.
To answer your third question:
timeGetTime, GetTickCount and Performance Counter
Your mentioning timeGetTime, GetTickCount and Performance Counter implies that you're running under Windows.
Yes, these return the current time, to various resolutions/accuracies: for example, GetTickCount has a resolution of about 50 msec, so it fails to time events which last less than 50 msec, is inaccurate when timing events which last only 50-100 msec. That's why I said that 100 in your ecx probably isn't big enough.
The QueryPerformanceCounter function is probably the most accurate timer that you have.
To use any of these timers as an interval timer:
Get the time, before you start to loop
Get the time again, after you finish looping
Subtract these two times: the difference is the time interval
is it ok if I give various "mov" instructions?
Yes I think so. I think you can do it like this (beware I'm not sure/don't remember whether this is the right MASM syntax for reading from a name memory location) ...
mov eax,[memory1]
mov eax,[memory2]
mov eax,[memory3]
mov eax,[memory4]
mov eax,[memory5]
... where memory1 through memory5 are addresses of widely-spaced global variables in your data segment.
Or, you could do ...
mov eax,[esi]
add esi,edx
mov eax,[esi]
add esi,edx
mov eax,[esi]
add esi,edx
mov eax,[esi]
add esi,edx
mov eax,[esi]
... where esi is pointing to the bottom of a long chunk of memory, and edx is some increment that's equal to about a fifth of the length of the chunk.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio