machine code for backward conditional jump in 8086 microprocessor - x86-16

How to construct the machine code for backward conditional jump (e.g. JNZ ) for 8086 microprocessor ?
LOOP: MOV DL, [BX] (say this starts at 100C)
ADD AX,DX (this at 100E)
INC BX (1010)
DEC CL (1011)
JNZ LOOP (1013)
What will be the machine code of last line?
machine code for JNZ is 75 and here I want to jump by 9 bytes backward ( I think so).

Jumps are based on the location after the jump instruction. Here you want to jump 9 bytes back and thus the encoding will be 75h,F7h

Related

Why jnz counts no cycle?

I found in online resource that IvyBridge has 3 ALU. So I write a small program to test:
global _start
_start:
mov rcx, 10000000
.for_loop: ; do {
inc rax
inc rbx
dec rcx
jnz .for_loop ; } while (--rcx)
xor rdi, rdi
mov rax, 60 ; _exit(0)
syscall
I compile and run it with perf:
$ nasm -felf64 cycle.asm && ld cycle.o && sudo perf stat ./a.out
The output shows:
10,491,664 cycles
which seems to make sense at the first glance, because there are 3 independent instructions (2 inc and 1 dec) that uses ALU in the loop, so they count 1 cycle together.
But what I don't understand is why the whole loop only has 1 cycle? jnz depends on the result of dec rcx, it should counts 1 cycle, so that the whole loop is 2 cycle. I would expect the output to be close to 20,000,000 cycles.
I also tried to change the second inc from inc rbx to inc rax, which makes it dependent on the first inc. The result does becomes close to 20,000,000 cycles, which shows that dependency will delay an instruction so that they can't run at the same time. So why jnz is special?
What I'm missing here?
First of all, dec/jnz will macro-fuse into a single uop on Intel Sandybridge-family. You could defeat that by putting a non-flag-setting instruction between the dec and jnz.
.for_loop: ; do {
inc rax
dec rcx
lea rbx, [rbx+1] ; doesn't touch flags, defeats macro-fusion
jnz .for_loop ; } while (--rcx)
This will still run at 1 iter per cycle on Haswell and later and Ryzen because they have 4 integer execution ports to keep up with 4 uops per iteration. (Your loop with macro-fusion is only 3 fused-domain uops on Intel CPUs, so SnB/IvB can run it at 1 per clock, too.)
See Agner Fog's optimization guide and especially his microarch guide. Also other links in https://stackoverflow.com/tags/x86/info.
Control dependencies are hidden by branch prediction + speculative execution, unlike data dependencies.
Out-of-order execution and branch prediction + speculative execution hide the "latency" of the control dependency. i.e. the next iteration can start running before the CPU verifies that jnz should really be taken.
So each jnz has an input dependency on the previous dec rcx before it can verify the prediction, but later instructions don't have to wait for it to be checked before they can execute. In-order retirement makes sure that mis-speculation is caught before anything can "see" it happen (except for microarchitectural effects leading to the Spectre attack...)
10M iterations is not a lot. I'd normally use at least 100M for something that runs at only 1c per iter. Having a simple microbenchmark run for 0.1 to 1 second is normally good to get very high precision and hide startup overhead.
And BTW, you don't need sudo perf if you set kernel.perf_event_paranoid = 0 with sysctl. It's almost certainly better to do that than to use sudo all the time.

Fastest method to accept a byte from a nibble in hardware (8051)

I have a smaller 8051 microcontroller (AT89C4051) connected to a larger microcontroller (AT89S52) and the larger one is running the clock of the smaller one. The crystal speed for the large one is 22.1184Mhz. Documentation states that because the ALE line is controlling the smaller micro, its clock speed is limited to 3.6Mhz.
The two micros communicate with each other with only 4 I/O lines and one interrupt line. I'm trying to make reception of a byte occur as fast as possible, but the code I have come up with makes me think I didn't choose the best solution, but here is what I got so far:
org 0h
ljmp main ;run initialization + program
org 13h ;INT1 handler - worse case scenario: 52uS processing time I think?
push PSW ;save old registers and the carry flag
mov PSW,#18h ;load our register space
mov R7,A ;save accumulator (PUSH takes an extra clock cycle I think)
mov A,P1 ;Grab data (wish I could grab it sooner somehow)
anl A,#0Fh ;Only lowest 4 bits on P1 is the actual data. other 4 bits are useless
djnz R2,nonib2 ;See what nibble # we are at. 1st or 2nd?
orl A,R6 ;were at 2nd so merge previously saved data in
mov #R0,A ;and put it in memory space
inc R0 ;and increment pointer
mov R2,#2h ;and reset nibble number
nonib2:
swap A ;exchange nibbles to prevent overwriting nibble later
mov R6,A ;save to R6 high nibble
mov A,R7 ;restore accumulator
pop PSW ;restore carry and register location
reti ;return to wherever
main:
mov PSW,#18h ;use new address for R0 through R7 to not clash with other routines
mov R1,#BUFCMD ;setup start of buffer space as R1
mov R2,#2h ;set # nibbles needed to process byte
mov PSW,#0h
mov IE,#84h ;enable external interrupt 1
..rest of code here...
We have to assume that this can be triggered by hardware at any point, even during a time-sensitive LCD character processing routine in which all registers and accumulator are used.
What optimizations can I perform to this code here to make it run much faster?
There is no need to do the nibble processing in the interrupt. Just store the 4 bits as they come in.
Assuming you can allocate R0 globally, the code can be as simple as:
org 13h
mov #r0, p1
inc r0
reti
Won't get much faster than that.
If you absolutely can not reserve R0, but you can at least arrange to use register banks differing in a single bit, e.g. #0 and #1, then you can use bit set/clear to switch away and back in 2 cycles, instead of 5 for the push psw approach.

Is it faster to pop unneeded values from the stack, or add an immediate constant to SP on a 386+ CPU?

My code target's 386+ (DOSBox usually, occasionally Pentium MMX) CPUs, but I only use the 8086 feature set for compatibility. My code is written for a non-multitasking environment (MS-DOS, or DOSBox.)
In nested loops I often find myself re-using CX for the deeper loop counters. I PUSH it at the top of the nested loop, and POP it before LOOP is executed.
Sometimes conditions other than CX reaching 0 terminate these inner loops. Then I am left with the unnecessary loop counter, and sometimes more variables, sitting on the stack that I need to clean up.
Is it faster to just add a constant to SP, or POP these unneeded values?
I know that the fastest way of doing it would be to store CX in a spare register at the top of the loop, then restore it before LOOP executes, foregoing the stack completely, but I often don't have a spare register.
Here's a piece of code where I add a constant to SP to avoid a couple POP instructions:
FIND_ENTRY PROC
;SEARCHES A SINGLE SECTOR OF A DIRECTORY LOADED INTO secBuff FOR A
;SPECIFIED FILE/SUB DIRECTORY ENTRY
;IF FOUND, RETURNS THE FILE/SUB DIRECTORY'S CLUSTER NUMBER IN BX
;IF NOT FOUND, RETURNS 0 IN BX
;ALTERS BX
;EXPECTS A FILE NAME STRING INDEX NUMBER IN BP
;EXPECTS A SECTOR OF A DIRECTORY (ROOT, OR SUB) TO BE LOADED INTO secBuff
;EXPECTS DS TO BE LOADED WITH varData
push ax
push cx
push es
push si
push di
lea si, fileName ;si -> file name strings
mov ax, 11d ;ax -> file name length in bytes/characters
mul bp ;ax -> offset to file name string
add si, ax ;ds:si -> desired file name as source input
;for CMPS
mov di, ds
mov es, di
lea di, secBuff ;es:di -> first entry in ds:secBuff as
;destination input for CMPS
mov cx, 16d ;outer loop cntr -> num entries in a sector
ENTRY_SEARCH:
push cx ;store outer loop cntr
push si ;store start of the file name
push di ;store start of the entry
mov cx, 11d ;inner loop cntr -> length of file name
repe cmpsb ;Do the strings match?
jne NOT_ENTRY ;If not, test next entry.
pop di ;di -> start of the entry
mov bx, WORD PTR [di+26] ;bx -> entry's cluster number
add sp, 4 ;discard unneeded stack elements
pop di
pop si
pop es
pop cx
pop ax
ret
NOT_ENTRY:
pop di ;di -> start of the entry
add di, 32d ;di -> start of next entry
pop si ;si -> start of file name
pop cx ;restore the outer loop cntr
loop ENTRY_SEARCH ;loop till we've either found a match, or
;have tested every entry in the sector
;without finding a match.
xor bx, bx ;if we're here no match was found.
;return 0.
pop di
pop si
pop es
pop cx
pop ax
ret
FIND_ENTRY ENDP
If you want to write efficient code, pop vs. add is a very minor issue compared to reducing the amount of saving/restoring you need to do, and optimizing everything else (see below).
If it would take more than 1 pop, always use add sp, imm. Or sub sp, -128 to save code-size by still using an imm8. Or some CPUs may prefer lea instead of add/sub. (e.g. gcc uses LEA whenever possible with -mtune=atom). Of course, this would require an address-size prefix in 16-bit mode because [sp+2] isn't a valid addressing mode.
Beyond that, there's no single answer that applies to both an actual 386 and a modern x86 like Haswell or Skylake! There have been a lot of microarchitectural changes between those CPUs. Modern CPUs decode x86 instructions to internal RISC-like uops. For a while, using simple x86 instructions was important, but now modern CPUs can represent a lot of work in a single instruction, so more complex x86 instructions (like push, or add with a memory source operand) are single uop instructions.
Modern CPUs (since Pentium-M) have a stack engine that removes the need for a separate uop to actually update RSP/ESP/SP in the out-of-order core. Intel's implementation requires a stack-sync uop when you read/write RSP with a non-stack instruction (anything other than push/pop / call/ret), which is why pop can be useful, especially if you're doing it after a push or call.
clang uses push/pop to align the stack in x86-64 code when a single 8-byte offset is needed. Why does this function push RAX to the stack as the first operation?.
But if you care about performance, loop is slow and should be avoided in the first place, let alone push/pop of the loop counter! Use different regs for inner/outer loops.
Basically, you've gone pretty far down the wrong path as far as optimizing, so the real answer is just to point you at http://agner.org/optimize/, and other performance links in the x86 tag wiki. 16-bit code makes it hard to get good performance because of all the partial-register false dependencies on modern CPUs, but with some impact on code size you can break those by using 32-bit operand size when necessary. (e.g. for xor ebx,ebx)
Of course, if you're optimizing for DOSBOX, it's not a real CPU and emulates. So loop may be fast! IDK if anyone has profiled or written optimization guides for DOSBOX's CPU emulator. But I'd recommend learning what's fast on real modern hardware; that's more interesting.

Why is a memory round-trip faster than not performing the round-trip?

I've got some simple 32bit code which computes the product of an array of 32bit integers. The inner loop looks like this:
##loop:
mov esi,[ebx]
mov [esp],esi
imul eax,[esp]
add ebx, 4
dec edx
jnz ##loop
What I'm trying to understand is why the above code is 6% faster than these two versions of the code, which does not perform the redundant memory round-trip:
##loop:
mov esi,[ebx]
imul eax,esi
add ebx, 4
dec edx
jnz ##loop
and
##loop:
imul eax,[ebx]
add ebx, 4
dec edx
jnz ##loop
The two latter pieces of code execute in virtually the same time, and as mentioned both are 6% slower than the first piece (165ms vs 155ms, 200 million elements).
I've tried manually aligning the jump target to a 16 byte boundary, but it makes no difference.
I'm running this on an Intel i7 4770k, Windows 10 x64.
Note: I know the code could be improved by doing all sorts of optimizations, however I'm only interested in the performance difference between the above pieces of code.
I suspect but can't be sure that you are preventing a stall on a data dependency:
The code looks like this:
##loop:
mov esi,[ebx] # (1)Load the memory location to esi reg
(mov [esp],esi) # (1)optionally store the location on the stack
imul eax,[esp] # (3) Perform the multiplication
add ebx, 4 # (1) Add 4
dec edx # (1)decrement counter
jnz ##loop # (0**) loop
Those numbers in brackets are the latencies of the instructions ... that jump is 0 if the branch predictor guesses correctly (which since it will mostly loop it will most of the time).
So: while the multiplication is still going (3 instructions) we get back to the top of the loop after 2 and try to load in to the memory and has to stall. Or we could do a store ... which we can do at the same time as our multiplication and then not stall at all.
What about the dummy store you ask? Why does that work? Notice you are storing the critical value that we are using to multiply to memory. Thus the processor can use this value which is being stored in memory and clobber the register.
So why can't the processor do this anyway? The processor can't produce more memory accesses than you ask it to or it could interfere with multi-processor programs (imagine that cache line that you are writing to is shared and you have to invalidate it on other CPUs every loop by writing to it ... ouch!).
All of this is pure speculation, but it seems to match all the evidence (your code and my knowledge of the intel architecture ... and x86 assembly). Hopefully someone can point out if I have something wrong.

How can I create a sleep function in 16bit MASM Assembly x86?

I am trying to create a sleep/delay procedure in 16bit MASM Assembly x86 that will, say, print a character on the screen every 500ms.
From the research I have done, it seems that there are three methods to achieve this - I would like to use the one that uses CPU clock ticks.
Please note I am running Windows XP through VMWare Fusion on Mac OS X Snow Leopard - I am not sure if that affects anything.
Could someone please point me in the right direction, or provide a working piece of code I can tweak? Thank you!
The code I have found is supposed to print 'A' on the screen every second, but does not work (I'd like to use milliseconds anyways).
TOP:
MOV AH,2C
INT 21
MOV BH,DH ; DH has current second
GETSEC: ; Loops until the current second is not equal to the last, in BH
MOV AH,2C
INT 21
CMP BH,DH ; Here is the comparison to exit the loop and print 'A'
JNE PRINTA
JMP GETSEC
PRINTA:
MOV AH,02
MOV DL,41
INT 21
JMP TOP
EDIT: Following GJ's advice, here's a working procedure. Just call it
DELAY PROC
TIMER:
MOV AH, 00H
INT 1AH
CMP DX,WAIT_TIME
JB TIMER
ADD DX,3 ;1-18, where smaller is faster and 18 is close to 1 second
MOV WAIT_TIME,DX
RET
DELAY ENDP
This cannot be done in pure MASM. All the old tricks for setting a fixed delay operate on the assumption that you have total control of the machine and are the only thread running on a CPU, so that if you wait 500 million cycles, exactly 500,000,000/f seconds will have elapsed (for a CPU at frequency f); that'd be 500ms for a 1GHz processor.
Because you are running on a modern operating system, you are sharing the CPU with many other threads (among them, the kernel -- no matter what you do, you cannot take priority over the kernel!), so waiting 500 million cycles in only your thread will mean that more than 500 million cycles elapse in the real world. This problem cannot be solved by userspace code alone; you are going to need the cooperation of the kernel.
The proper way to solve this is to look up what Win32 API function will suspend your thread for a specified number of milliseconds, then just call that function. You should be able to do this directly from assembly, possibly with additional arguments to your linker. Or, there might be an NT kernel system call to perform this function (I have very little experience with NT system calls, and honestly have no idea what the NT system call table looks like, but a sleep function is the sort of thing I might expect to see). If a system call is available, then issuing a direct system call from assembly is probably the quickest way to do what you want; it's also the least portable (but then, you're writing assembly!).
Edit: Looking at the NT kernel system call table, there don't appear to be any calls related to sleeping or getting the date and time (like your original code uses), but there are several system calls to set up and query timers. Spinning while you wait for a timer to reach the desired delay is one effective, if inelegant, solution.
use INT 15h, function 86h:
Call With:
AH = 86h
CX:DX = interval in uS
Actually you can use ROM BIOS interrupt 1Ah function 00h, 'Read Current Clock Count'. Or you can read dword at address $40:$6C but you must ensure atomic read. It is incremented by MS-DOS at about 18.2 Hz.
For more information read: The DOS Clock
Well, then. An old style, non constant, power consuming delay loop which will make other threads running slow down would look like:
delay equ 5000
top: mov ax, delay
loopa: mov bx, delay
loopb: dec bx
jnc loopb
dec ax
jnc loopa
mov ah,2
mov dl,'A'
int 21
jmp top
The delay is quadratic to the constant. But if you use this delay loop, somewhere in the world a young innocent kitten will die.
I didn't test this code but concept must work...
Save/restore es register is optional!
Check code carefully!
DelayProcedure:
push es //Save es and load new es
mov ax, 0040h
mov es, ax
//Pseudo atomic read of 32 bit DOS time tick variable
PseudoAtomicRead1:
mov ax, es:[006ch]
mov dx, es:[006eh]
cmp ax, es:[006ch]
jne PseudoAtomicRead1
//Add time delay to dx,ax where smaller is faster and 18 is close to 1 second
add ax, 3
adc dx, 0
//1800AFh is last DOS time tick value so check day overflow
mov cx, ax
mov bx, dx
//Do 32 bit subtract/compare
sub cx, 00AFh
sbb dx, 0018h
jbe DayOverflow
//Pseudo atomic read of 32 bit DOS time tick variable
PseudoAtomicRead2:
mov cx, es:[006ch]
mov bx, es:[006eh]
cmp cx, es:[006ch]
jne PseudoAtomicRead2
NotZero:
//At last do 32 bit compare
sub cx, ax
sbb bx, dx
jae Exit
//Check again day overflow because task scheduler can overjumps last time ticks
inc bx //If no Day Overflow then bx = 0FFh
jz PseudoAtomicRead2
jmp Exit
DayOverflow:
//Pseudo atomic read of 32 bit DOS time tick variable
PseudoAtomicRead3:
mov ax, es:[006ch]
mov dx, es:[006eh]
cmp dx, es:[006ch]
jne PseudoAtomicRead3
//At last do 32 bit compare
sub ax, cx
sbb dx, bx
jb PseudoAtomicRead3
Exit:
pop es //Restore es
ret
Here's a fairly simple example that should work if a long, not highly precise delay is needed.
To use, specify the delay in AX in 125ms increments.
;----------------------------------------------------------------------------;
; Simple delay based on PIT timer ticks
;----------------------------------------------------------------------------;
; Input: AX = delay_time in 1/8th of a second (~125ms) increments
;----------------------------------------------------------------------------;
DELAY_TIMER PROC
STI ; ensure interrupts are on
PUSH CX ; call-preserve CX and DS (if needed)
PUSH DS
MOV CX, 40H ; set DS to BIOS Data Area
MOV DS, CX
MOV CX, 583 ; delay_factor = 1/8 * 18.2 * 256
MUL CX ; AH (ticks) = delay_time * delay_factor
XOR CX, CX ; CX = 0
MOV CL, AH ; CX = # of ticks to wait
MOV AH, BYTE PTR DS:[6CH] ; get starting tick counter
TICK_DELAY:
HLT ; wait for any interrupt
MOV AL, BYTE PTR DS:[6CH] ; get current tick counter
CMP AL, AH ; still the same?
JZ TICK_DELAY ; loop if the same
MOV AH, AL ; otherwise, save new tick value to AH
LOOP TICK_DELAY ; loop until # of ticks (CX) has elapsed
POP DS
POP CX
RET
DELAY_TIMER ENDP
Here is an example to "sleep" for 1/2 second:
MOV AX, 4 ; delay for 1/2 seconds (4 * 1/8 seconds)
CALL IO_DELAY_TIMER
This is somewhat "non-blocking" as it will halt the CPU between ticks. It would of course be a non-issue on real hardware running on single-user DOS, but in a VM/Windows environment it might make it a better neighbor.
..The problem with all of the above code examples is that they use non-blocking operations. If you examine the CPU usage during a relatively long wait period, you will see it running around 50%. What we want is to use some DOS or BIOS function that blocks execution so that CPU usage is near 0%.
..Offhand, the BIOS INT 16h, AH=1 function comes to mind. You may be able to devise a routine that calls that function, then inserts a keystroke into the keyboard buffer when the time has expired. There are numerous problems with that idea ;), but it may be food for thought. It is likely that you will be writing some sort of interrupt handler.
..In the 32-bit windows API, there is a "Sleep" function. I suppose you could thunk to that.

Resources