Fastest method to accept a byte from a nibble in hardware (8051) - performance

I have a smaller 8051 microcontroller (AT89C4051) connected to a larger microcontroller (AT89S52) and the larger one is running the clock of the smaller one. The crystal speed for the large one is 22.1184Mhz. Documentation states that because the ALE line is controlling the smaller micro, its clock speed is limited to 3.6Mhz.
The two micros communicate with each other with only 4 I/O lines and one interrupt line. I'm trying to make reception of a byte occur as fast as possible, but the code I have come up with makes me think I didn't choose the best solution, but here is what I got so far:
org 0h
ljmp main ;run initialization + program
org 13h ;INT1 handler - worse case scenario: 52uS processing time I think?
push PSW ;save old registers and the carry flag
mov PSW,#18h ;load our register space
mov R7,A ;save accumulator (PUSH takes an extra clock cycle I think)
mov A,P1 ;Grab data (wish I could grab it sooner somehow)
anl A,#0Fh ;Only lowest 4 bits on P1 is the actual data. other 4 bits are useless
djnz R2,nonib2 ;See what nibble # we are at. 1st or 2nd?
orl A,R6 ;were at 2nd so merge previously saved data in
mov #R0,A ;and put it in memory space
inc R0 ;and increment pointer
mov R2,#2h ;and reset nibble number
nonib2:
swap A ;exchange nibbles to prevent overwriting nibble later
mov R6,A ;save to R6 high nibble
mov A,R7 ;restore accumulator
pop PSW ;restore carry and register location
reti ;return to wherever
main:
mov PSW,#18h ;use new address for R0 through R7 to not clash with other routines
mov R1,#BUFCMD ;setup start of buffer space as R1
mov R2,#2h ;set # nibbles needed to process byte
mov PSW,#0h
mov IE,#84h ;enable external interrupt 1
..rest of code here...
We have to assume that this can be triggered by hardware at any point, even during a time-sensitive LCD character processing routine in which all registers and accumulator are used.
What optimizations can I perform to this code here to make it run much faster?

There is no need to do the nibble processing in the interrupt. Just store the 4 bits as they come in.
Assuming you can allocate R0 globally, the code can be as simple as:
org 13h
mov #r0, p1
inc r0
reti
Won't get much faster than that.
If you absolutely can not reserve R0, but you can at least arrange to use register banks differing in a single bit, e.g. #0 and #1, then you can use bit set/clear to switch away and back in 2 cycles, instead of 5 for the push psw approach.

Related

Why is POP slow when using register R12?

On recent Intel CPUs, the POP instruction usually has a throughput of 2 instructions per cycle. However, when using register R12 (or RSP, which has the same encoding except for the prefix), the throughput drops to 1 per cycle if the instructions go through the legacy decoders (the throughput stays at around 2 per cycle if the µops come from the DSB).
This can be reproduced using nanoBench as follows:
sudo ./nanoBench.sh -asm "pop R12"
Further experiments on a Haswell machine show the following: When adding between 1 and 4 nops,
sudo ./nanoBench.sh -asm "pop R12; nop;"
sudo ./nanoBench.sh -asm "pop R12; nop; nop;"
sudo ./nanoBench.sh -asm "pop R12; nop; nop; nop;"
sudo ./nanoBench.sh -asm "pop R12; nop; nop; nop; nop;"
the execution time increases to 2 cycles. When adding a 5th nop,
sudo ./nanoBench.sh -asm "pop R12; nop; nop; nop; nop; nop;"
the execution time increases to 3 cycles. This suggests that no other instruction can be decoded in the same cycle as a pop R12 instruction. (When using a different register, e.g., R11, the last example needs 1.5 cycles.)
On Skylake, the execution time stays at 1 cycle when adding between 1 and 3 nops, and increases to 2 for between 4 and 7 nops. This suggests that pop R12 is an instruction that requires the complex decoder, even though it has just one µop (see also Can the simple decoders in recent Intel microarchitectures handle all 1-µop instructions?)
Why is the POP instruction decoded differently when using register R12? Are there any other instructions for which this is also the case?
Workaround: the pop r/m64 encoding of pop r12 doesn't have this decode penalty. (Thanks #Andreas for testing my guess.)
db 0x41, 0x8f, 0xc4 ; REX.B=1 8F /0 pop r/m64 = pop r12
The standard encoding of pop r12 has the same opcode byte as pop rsp, differing only by a REX. (The short form encoding puts the register number in the low 3 bits of that 1 byte).
pop rsp is special cased even in the decoders; on Haswell it's 3 uops1 so clearly only the complex decoder can decode it. pop r12 also getting penalized makes sense if the primary filtering of which decoder can decode which instruction is by the opcode byte (not accounting for prefixes), at least for this group of opcodes. Whether this really reflects the exact internals, it's at least a useful mental model to understand why pop modrm doesn't have this effect. (Although normally you'd only use pop r/m64 with a memory destination, which would mean multi-uop and thus complex-decoder only.) See Can the simple decoders in recent Intel microarchitectures handle all 1-µop instructions? for more about this effect for other opcodes.
push rsp is 2 total uops on Haswell, unlike most push reg instructions being 1 uop. But likely that extra uop is just a stack-sync inserted during issue/rename (because of reading RSP), not during decode. #Andreas reports that push rsp and push r12 both show no special effects in the decoder (and I assume uop cache). Just 1 micro-fused uop, with/without a stack-sync uop when it executes.
Opcodes like FF /0 inc r/m32 where the same leading byte is shared between different instructions (overloading the modrm /r field as extra opcode bytes) might be interesting to check on, if there are some single-uop instructions that share a leading byte with multi-uop instructions. Like maybe C0 /4 SHL r/m8,imm8 vs. C0 /2 RCL r/m8, imm8. http://ref.x86asm.net/coder64.html. But SHL with a memory destination can already be multiple uops, so it might get optimistically attempted by the simple decoders anyway, and succeed if it turns out to be single-uop? While perhaps pop r12 bails out early in the simple decoders instead of detecting the REX prefix.
It would make sense for Intel to spend the transistors to make sure common instructions like immediate shifts can decode efficiently, moreso than for less-common instructions like pop r12 which you'll normally only find in function epilogues, and thus usually not in inner loop. Only larger loops that include function calls.
Footnote 1: pop rsp is special because it's just mov rsp, [rsp]. (Or as the manual puts it, The POP ESP instruction increments the stack pointer (ESP) before data at the old top of stack is written into the destination. Haswell's 3-uop implementation seems unnecessary vs. literally the same 1 uop as mov rsp, [rsp] (I think the fault conditions are identical), but this might have saved transistors in the decoders by adding a uop to the normal way pop reg decodes (possibly implicitly requiring a stack-sync uop for a total of 3), instead of treating it as a whole separate instruction? pop rsp is very rarely used so its performance doesn't matter.
Perhaps the 16-bit pop sp case was a problem for decoding that byte as 1 pure-load uop? There is no [sp] addressing mode in x86 machine code, and it's possible that limitation extends to internal uops for 16-bit AGU. Other than that, I think the possible fault reasons are the same for pop and mov.
pop r12 (short form) does eventually decode to the normal 1 uop, with no more stack-sync uops than for repeated pop of other registers, as per #Andreas's testing. It gets penalized by not being decodeable in the simple decoders, but not by any extra uops that pop rsp specifically decodes to.
Perhaps GAS, NASM, and other assemblers should get a patch to make it possible to encode pop r12 with the modrm encoding, although probably not defaulting to that. Decoder throughput is often not a problem so spending an extra byte of code-size by default would be undesirable. Especially if there's no effect on other uarches, like AMD or Silvermont-family.
And/or GCC should use R12 as its last choice of call-preserved reg to save/restore? (R12 always needs a SIB byte when used as the base in an addressing mode, too, so that's another reason to avoid it, if compilers aren't going to try to avoid keeping pointers in it.) And maybe schedule the push/pop of r12 for efficient decoding, with 3 other pops (or other single-uop isns) after it before multi-uop ret.

How often do the contents of a CPU register change?

Does the data that CPU registers hold change often? The Wikipedia article describes the registers as "a quickly accessible location...of a small amount of fast storage". I'm assuming the memory is fast because a register is accessed and modified often?
Yes, data registers may change on subsequent instructions which is quite often. There are more complications with superscalarity, out-of-order execution, pipelining, register renaming, etc which complicate the analysis, but even on a simple in-order CPU, a register can change as often as once per instruction. A plausible program may have a run of many instructions, all affecting the same register:
// Type your code here, or load an example.
int polynom(int num) {
return num * num + 2 * num + 1;
}
which compiles as:
polynom(int):
push rbp
mov rbp, rsp
mov DWORD PTR [rbp-4], edi
* mov eax, DWORD PTR [rbp-4]
* imul eax, eax
* mov edx, DWORD PTR [rbp-4]
add edx, edx
* add eax, edx
* add eax, 1
pop rbp
ret
Note the many writes to the eax register, noted with an asterisk. In this little function, five almost-consecutive instructions write to this specific register, meaning that we can expect the program-visible state of eax1 to change at a rate of over 1 GHz if this code were to be called in a tight loop.
On a more fundamental note, there are some architectural registers that almost always change on every instruction. The most evident of these is the program counter (called PC in many contexts, EIP on x86, RIP on x86_64). Because this register points to the currently executing instruction, it must certainly change with every instruction, barring counterexamples like x86 REP encodings or an instruction that simply jumps to itself.
1 Again, barring architectural considerations like register renaming, which uses multiple physical registers to implement a single logical, program-visible register.
Since modern CPU's run in GHz, CPU registers can change what they are storing hundred of millions or even billions of times per second.
Since most modern CPU's have ~128 registers, they would typically change values a few million times per second when performing many operations.

Is it faster to pop unneeded values from the stack, or add an immediate constant to SP on a 386+ CPU?

My code target's 386+ (DOSBox usually, occasionally Pentium MMX) CPUs, but I only use the 8086 feature set for compatibility. My code is written for a non-multitasking environment (MS-DOS, or DOSBox.)
In nested loops I often find myself re-using CX for the deeper loop counters. I PUSH it at the top of the nested loop, and POP it before LOOP is executed.
Sometimes conditions other than CX reaching 0 terminate these inner loops. Then I am left with the unnecessary loop counter, and sometimes more variables, sitting on the stack that I need to clean up.
Is it faster to just add a constant to SP, or POP these unneeded values?
I know that the fastest way of doing it would be to store CX in a spare register at the top of the loop, then restore it before LOOP executes, foregoing the stack completely, but I often don't have a spare register.
Here's a piece of code where I add a constant to SP to avoid a couple POP instructions:
FIND_ENTRY PROC
;SEARCHES A SINGLE SECTOR OF A DIRECTORY LOADED INTO secBuff FOR A
;SPECIFIED FILE/SUB DIRECTORY ENTRY
;IF FOUND, RETURNS THE FILE/SUB DIRECTORY'S CLUSTER NUMBER IN BX
;IF NOT FOUND, RETURNS 0 IN BX
;ALTERS BX
;EXPECTS A FILE NAME STRING INDEX NUMBER IN BP
;EXPECTS A SECTOR OF A DIRECTORY (ROOT, OR SUB) TO BE LOADED INTO secBuff
;EXPECTS DS TO BE LOADED WITH varData
push ax
push cx
push es
push si
push di
lea si, fileName ;si -> file name strings
mov ax, 11d ;ax -> file name length in bytes/characters
mul bp ;ax -> offset to file name string
add si, ax ;ds:si -> desired file name as source input
;for CMPS
mov di, ds
mov es, di
lea di, secBuff ;es:di -> first entry in ds:secBuff as
;destination input for CMPS
mov cx, 16d ;outer loop cntr -> num entries in a sector
ENTRY_SEARCH:
push cx ;store outer loop cntr
push si ;store start of the file name
push di ;store start of the entry
mov cx, 11d ;inner loop cntr -> length of file name
repe cmpsb ;Do the strings match?
jne NOT_ENTRY ;If not, test next entry.
pop di ;di -> start of the entry
mov bx, WORD PTR [di+26] ;bx -> entry's cluster number
add sp, 4 ;discard unneeded stack elements
pop di
pop si
pop es
pop cx
pop ax
ret
NOT_ENTRY:
pop di ;di -> start of the entry
add di, 32d ;di -> start of next entry
pop si ;si -> start of file name
pop cx ;restore the outer loop cntr
loop ENTRY_SEARCH ;loop till we've either found a match, or
;have tested every entry in the sector
;without finding a match.
xor bx, bx ;if we're here no match was found.
;return 0.
pop di
pop si
pop es
pop cx
pop ax
ret
FIND_ENTRY ENDP
If you want to write efficient code, pop vs. add is a very minor issue compared to reducing the amount of saving/restoring you need to do, and optimizing everything else (see below).
If it would take more than 1 pop, always use add sp, imm. Or sub sp, -128 to save code-size by still using an imm8. Or some CPUs may prefer lea instead of add/sub. (e.g. gcc uses LEA whenever possible with -mtune=atom). Of course, this would require an address-size prefix in 16-bit mode because [sp+2] isn't a valid addressing mode.
Beyond that, there's no single answer that applies to both an actual 386 and a modern x86 like Haswell or Skylake! There have been a lot of microarchitectural changes between those CPUs. Modern CPUs decode x86 instructions to internal RISC-like uops. For a while, using simple x86 instructions was important, but now modern CPUs can represent a lot of work in a single instruction, so more complex x86 instructions (like push, or add with a memory source operand) are single uop instructions.
Modern CPUs (since Pentium-M) have a stack engine that removes the need for a separate uop to actually update RSP/ESP/SP in the out-of-order core. Intel's implementation requires a stack-sync uop when you read/write RSP with a non-stack instruction (anything other than push/pop / call/ret), which is why pop can be useful, especially if you're doing it after a push or call.
clang uses push/pop to align the stack in x86-64 code when a single 8-byte offset is needed. Why does this function push RAX to the stack as the first operation?.
But if you care about performance, loop is slow and should be avoided in the first place, let alone push/pop of the loop counter! Use different regs for inner/outer loops.
Basically, you've gone pretty far down the wrong path as far as optimizing, so the real answer is just to point you at http://agner.org/optimize/, and other performance links in the x86 tag wiki. 16-bit code makes it hard to get good performance because of all the partial-register false dependencies on modern CPUs, but with some impact on code size you can break those by using 32-bit operand size when necessary. (e.g. for xor ebx,ebx)
Of course, if you're optimizing for DOSBOX, it's not a real CPU and emulates. So loop may be fast! IDK if anyone has profiled or written optimization guides for DOSBOX's CPU emulator. But I'd recommend learning what's fast on real modern hardware; that's more interesting.

Understanding optimized assembly code generated by gcc

I'm trying to understand what kind of optimizations are performed by gcc when -O3 flag was set. I'm quite confused what these two lines,
xor %esi, %esi
lea 0x0(%esi), %esi
It seems to me redundant. What's point to use lea instruction here?
That instruction is used to fill space for alignment purposes. Loops can be faster when they start on aligned addresses, because the processor loads memory into the decoder in chunks. By aligning the beginnings of loops and functions, it becomes more likely that they will be at the beginning of one of these chunks. This prevents previous instructions which will not be used from being loaded, maximizes the number of future instructions that will, and, possibly most importantly, ensures that the first instruction is entirely in the first chunk, so it does not take two loads to execute it.
The compiler knows that it is best to align the loop, and has two options to do so. It can either place a jump to the beginning of the loop, or fill the gap with no-ops and let the processor flow through them. Jump instructions break the flow of instructions and often cause wasted cycles on modern processors, so adding them unnecessarily is inadvisable. For a short distance like this no-ops are better.
The x86 architecture contains an instruction specifically for the purpose of doing nothing, nop. However, this is one byte long, so it would take more than one to align the loop. Decoding each one and deciding it does nothing takes time, so it is faster to simply insert another longer instruction that has no side effects. Therefore, the compiler inserted the lea instruction you see. It has absolutely no effects, and is chosen by the compiler to have the exact length required. In fact, recent processors have standard multi-byte no-op instructions, so this will likely be recognized during decode and never even executed.
As explained by ughoavgfhw - these are paddings for better code alignment.
You can find this lea in the following link -
http://mail.openjdk.java.net/pipermail/hotspot-compiler-dev/2010-September/003881.html
quote:
1-byte: XCHG EAX, EAX
2-byte: 66 NOP
3-byte: LEA REG, 0 (REG) (8-bit displacement)
4-byte: NOP DWORD PTR [EAX + 0] (8-bit displacement)
5-byte: NOP DWORD PTR [EAX + EAX*1 + 0] (8-bit displacement)
**6-byte: LEA REG, 0 (REG) (32-bit displacement)**
7-byte: NOP DWORD PTR [EAX + 0] (32-bit displacement)
8-byte: NOP DWORD PTR [EAX + EAX*1 + 0] (32-bit displacement)
9-byte: NOP WORD PTR [EAX + EAX*1 + 0] (32-bit displacement)
Also note this SO question describing it in more details -
What does NOPL do in x86 system?
Note that the xor itself is not a nop (it changes the value of the reg), but it is also very cheap to perform since it's a zero idiom - What is the purpose of XORing a register with itself?

How do I intentionally read from main memory vs cache?

So I am being taught assembly and we have an assignment which is to find the time difference between reading from memory and reading from cache. We have to do this by creating 2 loops and timing them. (one reads from main memory and the other from cache). The thing is, I don't know and can't find anything that tells me how to read from either cache or main memory =/. Could you guys help me? I'm doing this in MASM32. I understand how to make loops and well most of the assembly language but I just can't make it read =/
Edit:
I have a question, I've done this ...
mov ecx, 100 ;loop 100 times
xor eax, eax ;set eax to 0
_label:
mov eax, eax ;according to me this is read memory is that good?
dec ecx ;dec loop
jnz _label ;if still not equal to 0 goes again to _label
... would that be ok?
Edit 2:
well then, I don't intend to pry and I appreciate your help, I just have another question, since these are two loops I have to do. I need to compare them somehow, I've been looking for a timer instruction but I haven't found any I've found only: timeGetTime, GetTickCount and Performance Counter but as far as I understand these instructions return the system time not the time it takes for the loop to finish. Is there a way to actually do what I want? or I need to think of another way?
Also, to read from different registers in the second loop (the one not reading from cache) is it ok if I give various "mov" instructions? or am I totally off base here?
Sorry for all this questions but again thank you for your help.
To read from cache. have a loop which reads from the same (or very similar) memory address:
The first time you read from that address, the values from that memory address (and from other, nearby memory address) will be moved into cache
The next time you read from that same address, the values are already cached and so you're reading from cache.
To read uncached memory, have a loop which reads from many, very different (i.e. further apart than the cache size) memory addresses.
To answer your second question:
The things you're doing with ecx and jnz look OK (I don't know how accurate/sensitive your timer is, but you might want to loop more than 100 times)
The mov eax, eax is not "read memory" ... it's a no-op, which moves eax into eax. Instead, I think that the MASM syntax for reading from memory is something more like mov eax,[esi] ("read the from the memory location whose address is contained in esi")
Depending on what O/S you're using, you must read from a memory address which actually exists and is readable. On Windows, for example, an application wouldn't be allowed to do mov esi, 0 followed by mov eax, [esi] because an application isn't allowed to read the memory whose address/location is zero.
To answer your third question:
timeGetTime, GetTickCount and Performance Counter
Your mentioning timeGetTime, GetTickCount and Performance Counter implies that you're running under Windows.
Yes, these return the current time, to various resolutions/accuracies: for example, GetTickCount has a resolution of about 50 msec, so it fails to time events which last less than 50 msec, is inaccurate when timing events which last only 50-100 msec. That's why I said that 100 in your ecx probably isn't big enough.
The QueryPerformanceCounter function is probably the most accurate timer that you have.
To use any of these timers as an interval timer:
Get the time, before you start to loop
Get the time again, after you finish looping
Subtract these two times: the difference is the time interval
is it ok if I give various "mov" instructions?
Yes I think so. I think you can do it like this (beware I'm not sure/don't remember whether this is the right MASM syntax for reading from a name memory location) ...
mov eax,[memory1]
mov eax,[memory2]
mov eax,[memory3]
mov eax,[memory4]
mov eax,[memory5]
... where memory1 through memory5 are addresses of widely-spaced global variables in your data segment.
Or, you could do ...
mov eax,[esi]
add esi,edx
mov eax,[esi]
add esi,edx
mov eax,[esi]
add esi,edx
mov eax,[esi]
add esi,edx
mov eax,[esi]
... where esi is pointing to the bottom of a long chunk of memory, and edx is some increment that's equal to about a fifth of the length of the chunk.

Resources