I am trying to create a sleep/delay procedure in 16bit MASM Assembly x86 that will, say, print a character on the screen every 500ms.
From the research I have done, it seems that there are three methods to achieve this - I would like to use the one that uses CPU clock ticks.
Please note I am running Windows XP through VMWare Fusion on Mac OS X Snow Leopard - I am not sure if that affects anything.
Could someone please point me in the right direction, or provide a working piece of code I can tweak? Thank you!
The code I have found is supposed to print 'A' on the screen every second, but does not work (I'd like to use milliseconds anyways).
INT 21
MOV BH,DH ; DH has current second
GETSEC: ; Loops until the current second is not equal to the last, in BH
INT 21
CMP BH,DH ; Here is the comparison to exit the loop and print 'A'
INT 21
EDIT: Following GJ's advice, here's a working procedure. Just call it
ADD DX,3 ;1-18, where smaller is faster and 18 is close to 1 second
This cannot be done in pure MASM. All the old tricks for setting a fixed delay operate on the assumption that you have total control of the machine and are the only thread running on a CPU, so that if you wait 500 million cycles, exactly 500,000,000/f seconds will have elapsed (for a CPU at frequency f); that'd be 500ms for a 1GHz processor.
Because you are running on a modern operating system, you are sharing the CPU with many other threads (among them, the kernel -- no matter what you do, you cannot take priority over the kernel!), so waiting 500 million cycles in only your thread will mean that more than 500 million cycles elapse in the real world. This problem cannot be solved by userspace code alone; you are going to need the cooperation of the kernel.
The proper way to solve this is to look up what Win32 API function will suspend your thread for a specified number of milliseconds, then just call that function. You should be able to do this directly from assembly, possibly with additional arguments to your linker. Or, there might be an NT kernel system call to perform this function (I have very little experience with NT system calls, and honestly have no idea what the NT system call table looks like, but a sleep function is the sort of thing I might expect to see). If a system call is available, then issuing a direct system call from assembly is probably the quickest way to do what you want; it's also the least portable (but then, you're writing assembly!).
Edit: Looking at the NT kernel system call table, there don't appear to be any calls related to sleeping or getting the date and time (like your original code uses), but there are several system calls to set up and query timers. Spinning while you wait for a timer to reach the desired delay is one effective, if inelegant, solution.
use INT 15h, function 86h:
Call With:
AH = 86h
CX:DX = interval in uS
Actually you can use ROM BIOS interrupt 1Ah function 00h, 'Read Current Clock Count'. Or you can read dword at address $40:$6C but you must ensure atomic read. It is incremented by MS-DOS at about 18.2 Hz.
For more information read: The DOS Clock
Well, then. An old style, non constant, power consuming delay loop which will make other threads running slow down would look like:
delay equ 5000
top: mov ax, delay
loopa: mov bx, delay
loopb: dec bx
jnc loopb
dec ax
jnc loopa
mov ah,2
mov dl,'A'
int 21
jmp top
The delay is quadratic to the constant. But if you use this delay loop, somewhere in the world a young innocent kitten will die.
I didn't test this code but concept must work...
Save/restore es register is optional!
Check code carefully!
push es //Save es and load new es
mov ax, 0040h
mov es, ax
//Pseudo atomic read of 32 bit DOS time tick variable
mov ax, es:[006ch]
mov dx, es:[006eh]
cmp ax, es:[006ch]
jne PseudoAtomicRead1
//Add time delay to dx,ax where smaller is faster and 18 is close to 1 second
add ax, 3
adc dx, 0
//1800AFh is last DOS time tick value so check day overflow
mov cx, ax
mov bx, dx
//Do 32 bit subtract/compare
sub cx, 00AFh
sbb dx, 0018h
jbe DayOverflow
//Pseudo atomic read of 32 bit DOS time tick variable
mov cx, es:[006ch]
mov bx, es:[006eh]
cmp cx, es:[006ch]
jne PseudoAtomicRead2
//At last do 32 bit compare
sub cx, ax
sbb bx, dx
jae Exit
//Check again day overflow because task scheduler can overjumps last time ticks
inc bx //If no Day Overflow then bx = 0FFh
jz PseudoAtomicRead2
jmp Exit
//Pseudo atomic read of 32 bit DOS time tick variable
mov ax, es:[006ch]
mov dx, es:[006eh]
cmp dx, es:[006ch]
jne PseudoAtomicRead3
//At last do 32 bit compare
sub ax, cx
sbb dx, bx
jb PseudoAtomicRead3
pop es //Restore es
Here's a fairly simple example that should work if a long, not highly precise delay is needed.
To use, specify the delay in AX in 125ms increments.
; Simple delay based on PIT timer ticks
; Input: AX = delay_time in 1/8th of a second (~125ms) increments
STI ; ensure interrupts are on
PUSH CX ; call-preserve CX and DS (if needed)
MOV CX, 40H ; set DS to BIOS Data Area
MOV CX, 583 ; delay_factor = 1/8 * 18.2 * 256
MUL CX ; AH (ticks) = delay_time * delay_factor
XOR CX, CX ; CX = 0
MOV CL, AH ; CX = # of ticks to wait
MOV AH, BYTE PTR DS:[6CH] ; get starting tick counter
HLT ; wait for any interrupt
MOV AL, BYTE PTR DS:[6CH] ; get current tick counter
CMP AL, AH ; still the same?
JZ TICK_DELAY ; loop if the same
MOV AH, AL ; otherwise, save new tick value to AH
LOOP TICK_DELAY ; loop until # of ticks (CX) has elapsed
Here is an example to "sleep" for 1/2 second:
MOV AX, 4 ; delay for 1/2 seconds (4 * 1/8 seconds)
This is somewhat "non-blocking" as it will halt the CPU between ticks. It would of course be a non-issue on real hardware running on single-user DOS, but in a VM/Windows environment it might make it a better neighbor.
..The problem with all of the above code examples is that they use non-blocking operations. If you examine the CPU usage during a relatively long wait period, you will see it running around 50%. What we want is to use some DOS or BIOS function that blocks execution so that CPU usage is near 0%.
..Offhand, the BIOS INT 16h, AH=1 function comes to mind. You may be able to devise a routine that calls that function, then inserts a keystroke into the keyboard buffer when the time has expired. There are numerous problems with that idea ;), but it may be food for thought. It is likely that you will be writing some sort of interrupt handler.
..In the 32-bit windows API, there is a "Sleep" function. I suppose you could thunk to that.
Does the data that CPU registers hold change often? The Wikipedia article describes the registers as "a quickly accessible location...of a small amount of fast storage". I'm assuming the memory is fast because a register is accessed and modified often?
Yes, data registers may change on subsequent instructions which is quite often. There are more complications with superscalarity, out-of-order execution, pipelining, register renaming, etc which complicate the analysis, but even on a simple in-order CPU, a register can change as often as once per instruction. A plausible program may have a run of many instructions, all affecting the same register:
// Type your code here, or load an example.
int polynom(int num) {
return num * num + 2 * num + 1;
which compiles as:
push rbp
mov rbp, rsp
mov DWORD PTR [rbp-4], edi
* mov eax, DWORD PTR [rbp-4]
* imul eax, eax
* mov edx, DWORD PTR [rbp-4]
add edx, edx
* add eax, edx
* add eax, 1
pop rbp
Note the many writes to the eax register, noted with an asterisk. In this little function, five almost-consecutive instructions write to this specific register, meaning that we can expect the program-visible state of eax1 to change at a rate of over 1 GHz if this code were to be called in a tight loop.
On a more fundamental note, there are some architectural registers that almost always change on every instruction. The most evident of these is the program counter (called PC in many contexts, EIP on x86, RIP on x86_64). Because this register points to the currently executing instruction, it must certainly change with every instruction, barring counterexamples like x86 REP encodings or an instruction that simply jumps to itself.
1 Again, barring architectural considerations like register renaming, which uses multiple physical registers to implement a single logical, program-visible register.
Since modern CPU's run in GHz, CPU registers can change what they are storing hundred of millions or even billions of times per second.
Since most modern CPU's have ~128 registers, they would typically change values a few million times per second when performing many operations.
I have a smaller 8051 microcontroller (AT89C4051) connected to a larger microcontroller (AT89S52) and the larger one is running the clock of the smaller one. The crystal speed for the large one is 22.1184Mhz. Documentation states that because the ALE line is controlling the smaller micro, its clock speed is limited to 3.6Mhz.
The two micros communicate with each other with only 4 I/O lines and one interrupt line. I'm trying to make reception of a byte occur as fast as possible, but the code I have come up with makes me think I didn't choose the best solution, but here is what I got so far:
org 0h
ljmp main ;run initialization + program
org 13h ;INT1 handler - worse case scenario: 52uS processing time I think?
push PSW ;save old registers and the carry flag
mov PSW,#18h ;load our register space
mov R7,A ;save accumulator (PUSH takes an extra clock cycle I think)
mov A,P1 ;Grab data (wish I could grab it sooner somehow)
anl A,#0Fh ;Only lowest 4 bits on P1 is the actual data. other 4 bits are useless
djnz R2,nonib2 ;See what nibble # we are at. 1st or 2nd?
orl A,R6 ;were at 2nd so merge previously saved data in
mov #R0,A ;and put it in memory space
inc R0 ;and increment pointer
mov R2,#2h ;and reset nibble number
swap A ;exchange nibbles to prevent overwriting nibble later
mov R6,A ;save to R6 high nibble
mov A,R7 ;restore accumulator
pop PSW ;restore carry and register location
reti ;return to wherever
mov PSW,#18h ;use new address for R0 through R7 to not clash with other routines
mov R1,#BUFCMD ;setup start of buffer space as R1
mov R2,#2h ;set # nibbles needed to process byte
mov PSW,#0h
mov IE,#84h ;enable external interrupt 1
..rest of code here...
We have to assume that this can be triggered by hardware at any point, even during a time-sensitive LCD character processing routine in which all registers and accumulator are used.
What optimizations can I perform to this code here to make it run much faster?
There is no need to do the nibble processing in the interrupt. Just store the 4 bits as they come in.
Assuming you can allocate R0 globally, the code can be as simple as:
org 13h
mov #r0, p1
inc r0
Won't get much faster than that.
If you absolutely can not reserve R0, but you can at least arrange to use register banks differing in a single bit, e.g. #0 and #1, then you can use bit set/clear to switch away and back in 2 cycles, instead of 5 for the push psw approach.
My code target's 386+ (DOSBox usually, occasionally Pentium MMX) CPUs, but I only use the 8086 feature set for compatibility. My code is written for a non-multitasking environment (MS-DOS, or DOSBox.)
In nested loops I often find myself re-using CX for the deeper loop counters. I PUSH it at the top of the nested loop, and POP it before LOOP is executed.
Sometimes conditions other than CX reaching 0 terminate these inner loops. Then I am left with the unnecessary loop counter, and sometimes more variables, sitting on the stack that I need to clean up.
Is it faster to just add a constant to SP, or POP these unneeded values?
I know that the fastest way of doing it would be to store CX in a spare register at the top of the loop, then restore it before LOOP executes, foregoing the stack completely, but I often don't have a spare register.
Here's a piece of code where I add a constant to SP to avoid a couple POP instructions:
push ax
push cx
push es
push si
push di
lea si, fileName ;si -> file name strings
mov ax, 11d ;ax -> file name length in bytes/characters
mul bp ;ax -> offset to file name string
add si, ax ;ds:si -> desired file name as source input
;for CMPS
mov di, ds
mov es, di
lea di, secBuff ;es:di -> first entry in ds:secBuff as
;destination input for CMPS
mov cx, 16d ;outer loop cntr -> num entries in a sector
push cx ;store outer loop cntr
push si ;store start of the file name
push di ;store start of the entry
mov cx, 11d ;inner loop cntr -> length of file name
repe cmpsb ;Do the strings match?
jne NOT_ENTRY ;If not, test next entry.
pop di ;di -> start of the entry
mov bx, WORD PTR [di+26] ;bx -> entry's cluster number
add sp, 4 ;discard unneeded stack elements
pop di
pop si
pop es
pop cx
pop ax
pop di ;di -> start of the entry
add di, 32d ;di -> start of next entry
pop si ;si -> start of file name
pop cx ;restore the outer loop cntr
loop ENTRY_SEARCH ;loop till we've either found a match, or
;have tested every entry in the sector
;without finding a match.
xor bx, bx ;if we're here no match was found.
;return 0.
pop di
pop si
pop es
pop cx
pop ax
If you want to write efficient code, pop vs. add is a very minor issue compared to reducing the amount of saving/restoring you need to do, and optimizing everything else (see below).
If it would take more than 1 pop, always use add sp, imm. Or sub sp, -128 to save code-size by still using an imm8. Or some CPUs may prefer lea instead of add/sub. (e.g. gcc uses LEA whenever possible with -mtune=atom). Of course, this would require an address-size prefix in 16-bit mode because [sp+2] isn't a valid addressing mode.
Beyond that, there's no single answer that applies to both an actual 386 and a modern x86 like Haswell or Skylake! There have been a lot of microarchitectural changes between those CPUs. Modern CPUs decode x86 instructions to internal RISC-like uops. For a while, using simple x86 instructions was important, but now modern CPUs can represent a lot of work in a single instruction, so more complex x86 instructions (like push, or add with a memory source operand) are single uop instructions.
Modern CPUs (since Pentium-M) have a stack engine that removes the need for a separate uop to actually update RSP/ESP/SP in the out-of-order core. Intel's implementation requires a stack-sync uop when you read/write RSP with a non-stack instruction (anything other than push/pop / call/ret), which is why pop can be useful, especially if you're doing it after a push or call.
clang uses push/pop to align the stack in x86-64 code when a single 8-byte offset is needed. Why does this function push RAX to the stack as the first operation?.
But if you care about performance, loop is slow and should be avoided in the first place, let alone push/pop of the loop counter! Use different regs for inner/outer loops.
Basically, you've gone pretty far down the wrong path as far as optimizing, so the real answer is just to point you at http://agner.org/optimize/, and other performance links in the x86 tag wiki. 16-bit code makes it hard to get good performance because of all the partial-register false dependencies on modern CPUs, but with some impact on code size you can break those by using 32-bit operand size when necessary. (e.g. for xor ebx,ebx)
Of course, if you're optimizing for DOSBOX, it's not a real CPU and emulates. So loop may be fast! IDK if anyone has profiled or written optimization guides for DOSBOX's CPU emulator. But I'd recommend learning what's fast on real modern hardware; that's more interesting.
I'm trying to understand what kind of optimizations are performed by gcc when -O3 flag was set. I'm quite confused what these two lines,
xor %esi, %esi
lea 0x0(%esi), %esi
It seems to me redundant. What's point to use lea instruction here?
That instruction is used to fill space for alignment purposes. Loops can be faster when they start on aligned addresses, because the processor loads memory into the decoder in chunks. By aligning the beginnings of loops and functions, it becomes more likely that they will be at the beginning of one of these chunks. This prevents previous instructions which will not be used from being loaded, maximizes the number of future instructions that will, and, possibly most importantly, ensures that the first instruction is entirely in the first chunk, so it does not take two loads to execute it.
The compiler knows that it is best to align the loop, and has two options to do so. It can either place a jump to the beginning of the loop, or fill the gap with no-ops and let the processor flow through them. Jump instructions break the flow of instructions and often cause wasted cycles on modern processors, so adding them unnecessarily is inadvisable. For a short distance like this no-ops are better.
The x86 architecture contains an instruction specifically for the purpose of doing nothing, nop. However, this is one byte long, so it would take more than one to align the loop. Decoding each one and deciding it does nothing takes time, so it is faster to simply insert another longer instruction that has no side effects. Therefore, the compiler inserted the lea instruction you see. It has absolutely no effects, and is chosen by the compiler to have the exact length required. In fact, recent processors have standard multi-byte no-op instructions, so this will likely be recognized during decode and never even executed.
As explained by ughoavgfhw - these are paddings for better code alignment.
You can find this lea in the following link -
1-byte: XCHG EAX, EAX
2-byte: 66 NOP
3-byte: LEA REG, 0 (REG) (8-bit displacement)
4-byte: NOP DWORD PTR [EAX + 0] (8-bit displacement)
5-byte: NOP DWORD PTR [EAX + EAX*1 + 0] (8-bit displacement)
**6-byte: LEA REG, 0 (REG) (32-bit displacement)**
7-byte: NOP DWORD PTR [EAX + 0] (32-bit displacement)
8-byte: NOP DWORD PTR [EAX + EAX*1 + 0] (32-bit displacement)
9-byte: NOP WORD PTR [EAX + EAX*1 + 0] (32-bit displacement)
Also note this SO question describing it in more details -
What does NOPL do in x86 system?
Note that the xor itself is not a nop (it changes the value of the reg), but it is also very cheap to perform since it's a zero idiom - What is the purpose of XORing a register with itself?
So I am being taught assembly and we have an assignment which is to find the time difference between reading from memory and reading from cache. We have to do this by creating 2 loops and timing them. (one reads from main memory and the other from cache). The thing is, I don't know and can't find anything that tells me how to read from either cache or main memory =/. Could you guys help me? I'm doing this in MASM32. I understand how to make loops and well most of the assembly language but I just can't make it read =/
I have a question, I've done this ...
mov ecx, 100 ;loop 100 times
xor eax, eax ;set eax to 0
mov eax, eax ;according to me this is read memory is that good?
dec ecx ;dec loop
jnz _label ;if still not equal to 0 goes again to _label
... would that be ok?
Edit 2:
well then, I don't intend to pry and I appreciate your help, I just have another question, since these are two loops I have to do. I need to compare them somehow, I've been looking for a timer instruction but I haven't found any I've found only: timeGetTime, GetTickCount and Performance Counter but as far as I understand these instructions return the system time not the time it takes for the loop to finish. Is there a way to actually do what I want? or I need to think of another way?
Also, to read from different registers in the second loop (the one not reading from cache) is it ok if I give various "mov" instructions? or am I totally off base here?
Sorry for all this questions but again thank you for your help.
To read from cache. have a loop which reads from the same (or very similar) memory address:
The first time you read from that address, the values from that memory address (and from other, nearby memory address) will be moved into cache
The next time you read from that same address, the values are already cached and so you're reading from cache.
To read uncached memory, have a loop which reads from many, very different (i.e. further apart than the cache size) memory addresses.
To answer your second question:
The things you're doing with ecx and jnz look OK (I don't know how accurate/sensitive your timer is, but you might want to loop more than 100 times)
The mov eax, eax is not "read memory" ... it's a no-op, which moves eax into eax. Instead, I think that the MASM syntax for reading from memory is something more like mov eax,[esi] ("read the from the memory location whose address is contained in esi")
Depending on what O/S you're using, you must read from a memory address which actually exists and is readable. On Windows, for example, an application wouldn't be allowed to do mov esi, 0 followed by mov eax, [esi] because an application isn't allowed to read the memory whose address/location is zero.
To answer your third question:
timeGetTime, GetTickCount and Performance Counter
Your mentioning timeGetTime, GetTickCount and Performance Counter implies that you're running under Windows.
Yes, these return the current time, to various resolutions/accuracies: for example, GetTickCount has a resolution of about 50 msec, so it fails to time events which last less than 50 msec, is inaccurate when timing events which last only 50-100 msec. That's why I said that 100 in your ecx probably isn't big enough.
The QueryPerformanceCounter function is probably the most accurate timer that you have.
To use any of these timers as an interval timer:
Get the time, before you start to loop
Get the time again, after you finish looping
Subtract these two times: the difference is the time interval
is it ok if I give various "mov" instructions?
Yes I think so. I think you can do it like this (beware I'm not sure/don't remember whether this is the right MASM syntax for reading from a name memory location) ...
mov eax,[memory1]
mov eax,[memory2]
mov eax,[memory3]
mov eax,[memory4]
mov eax,[memory5]
... where memory1 through memory5 are addresses of widely-spaced global variables in your data segment.
Or, you could do ...
mov eax,[esi]
add esi,edx
mov eax,[esi]
add esi,edx
mov eax,[esi]
add esi,edx
mov eax,[esi]
add esi,edx
mov eax,[esi]
... where esi is pointing to the bottom of a long chunk of memory, and edx is some increment that's equal to about a fifth of the length of the chunk.