How do I intentionally read from main memory vs cache? - performance

So I am being taught assembly and we have an assignment which is to find the time difference between reading from memory and reading from cache. We have to do this by creating 2 loops and timing them. (one reads from main memory and the other from cache). The thing is, I don't know and can't find anything that tells me how to read from either cache or main memory =/. Could you guys help me? I'm doing this in MASM32. I understand how to make loops and well most of the assembly language but I just can't make it read =/
Edit:
I have a question, I've done this ...
mov ecx, 100 ;loop 100 times
xor eax, eax ;set eax to 0
_label:
mov eax, eax ;according to me this is read memory is that good?
dec ecx ;dec loop
jnz _label ;if still not equal to 0 goes again to _label
... would that be ok?
Edit 2:
well then, I don't intend to pry and I appreciate your help, I just have another question, since these are two loops I have to do. I need to compare them somehow, I've been looking for a timer instruction but I haven't found any I've found only: timeGetTime, GetTickCount and Performance Counter but as far as I understand these instructions return the system time not the time it takes for the loop to finish. Is there a way to actually do what I want? or I need to think of another way?
Also, to read from different registers in the second loop (the one not reading from cache) is it ok if I give various "mov" instructions? or am I totally off base here?
Sorry for all this questions but again thank you for your help.

To read from cache. have a loop which reads from the same (or very similar) memory address:
The first time you read from that address, the values from that memory address (and from other, nearby memory address) will be moved into cache
The next time you read from that same address, the values are already cached and so you're reading from cache.
To read uncached memory, have a loop which reads from many, very different (i.e. further apart than the cache size) memory addresses.
To answer your second question:
The things you're doing with ecx and jnz look OK (I don't know how accurate/sensitive your timer is, but you might want to loop more than 100 times)
The mov eax, eax is not "read memory" ... it's a no-op, which moves eax into eax. Instead, I think that the MASM syntax for reading from memory is something more like mov eax,[esi] ("read the from the memory location whose address is contained in esi")
Depending on what O/S you're using, you must read from a memory address which actually exists and is readable. On Windows, for example, an application wouldn't be allowed to do mov esi, 0 followed by mov eax, [esi] because an application isn't allowed to read the memory whose address/location is zero.
To answer your third question:
timeGetTime, GetTickCount and Performance Counter
Your mentioning timeGetTime, GetTickCount and Performance Counter implies that you're running under Windows.
Yes, these return the current time, to various resolutions/accuracies: for example, GetTickCount has a resolution of about 50 msec, so it fails to time events which last less than 50 msec, is inaccurate when timing events which last only 50-100 msec. That's why I said that 100 in your ecx probably isn't big enough.
The QueryPerformanceCounter function is probably the most accurate timer that you have.
To use any of these timers as an interval timer:
Get the time, before you start to loop
Get the time again, after you finish looping
Subtract these two times: the difference is the time interval
is it ok if I give various "mov" instructions?
Yes I think so. I think you can do it like this (beware I'm not sure/don't remember whether this is the right MASM syntax for reading from a name memory location) ...
mov eax,[memory1]
mov eax,[memory2]
mov eax,[memory3]
mov eax,[memory4]
mov eax,[memory5]
... where memory1 through memory5 are addresses of widely-spaced global variables in your data segment.
Or, you could do ...
mov eax,[esi]
add esi,edx
mov eax,[esi]
add esi,edx
mov eax,[esi]
add esi,edx
mov eax,[esi]
add esi,edx
mov eax,[esi]
... where esi is pointing to the bottom of a long chunk of memory, and edx is some increment that's equal to about a fifth of the length of the chunk.

Related

How often do the contents of a CPU register change?

Does the data that CPU registers hold change often? The Wikipedia article describes the registers as "a quickly accessible location...of a small amount of fast storage". I'm assuming the memory is fast because a register is accessed and modified often?
Yes, data registers may change on subsequent instructions which is quite often. There are more complications with superscalarity, out-of-order execution, pipelining, register renaming, etc which complicate the analysis, but even on a simple in-order CPU, a register can change as often as once per instruction. A plausible program may have a run of many instructions, all affecting the same register:
// Type your code here, or load an example.
int polynom(int num) {
return num * num + 2 * num + 1;
}
which compiles as:
polynom(int):
push rbp
mov rbp, rsp
mov DWORD PTR [rbp-4], edi
* mov eax, DWORD PTR [rbp-4]
* imul eax, eax
* mov edx, DWORD PTR [rbp-4]
add edx, edx
* add eax, edx
* add eax, 1
pop rbp
ret
Note the many writes to the eax register, noted with an asterisk. In this little function, five almost-consecutive instructions write to this specific register, meaning that we can expect the program-visible state of eax1 to change at a rate of over 1 GHz if this code were to be called in a tight loop.
On a more fundamental note, there are some architectural registers that almost always change on every instruction. The most evident of these is the program counter (called PC in many contexts, EIP on x86, RIP on x86_64). Because this register points to the currently executing instruction, it must certainly change with every instruction, barring counterexamples like x86 REP encodings or an instruction that simply jumps to itself.
1 Again, barring architectural considerations like register renaming, which uses multiple physical registers to implement a single logical, program-visible register.
Since modern CPU's run in GHz, CPU registers can change what they are storing hundred of millions or even billions of times per second.
Since most modern CPU's have ~128 registers, they would typically change values a few million times per second when performing many operations.

Cost of instruction jumps in assembly

I've had always been curious about the cost of jumps in assembly.
cmp ecx, edx
je SOME_LOCATION # What's the cost of this jump?
Does it need to do a search in a lookup table for each jumps or how does it work?
No, a jump doesn’t do a search. The assembler resolves the label to an address, which in most cases is then converted to an offset from the current instruction. The address or offset is encoded in the instruction. At run time, the processor loads the address into the IP register or adds the offset to the current value of the IP register (along with all the other effects discussed by #Brendan).
There is a type of jump instruction that can be used to get the destination from a table. The jump instruction reads the address from a memory location. (The instruction specifies a single location, so there still is no “search”.) This instruction could look something like this:
jmp table[eax*4]
where eax is the index of the entry in the table containing the address to jump to.
Originally (e.g. 8086) the cost of a jump wasn't much different to the cost of a mov.
Later CPUs added caches, which meant some jumps were faster (because the code they jump to is in the cache) and some jumps were slower (because the code they jump to isn't in the cache).
Even later CPUs added "out of order" execution, where conditional branches (e.g. je SOME_LOCATION) would have to wait until the flags from "previous instructions that happen to be executed in parallel" became known.
This means that a sequence like
mov esi, edi
cmp ecx, edx
je SOME_LOCATION
can be slower than rearranging it to
cmp ecx, edx
mov esi, edi
je SOME_LOCATION
to increase the chance that the flags would be known.
Even later CPUs added speculative execution. In this case, for conditional branches the CPU just takes a guess at where it will branch to before it actually knows (e.g. before the flags are known), and if it guesses wrong it'll just pretend that it didn't execute the wrong instructions. More specifically, the speculatively executed instructions are tagged at the start of the pipeline and held at the end of the pipeline (at retirement) until the CPU knows if they can be committed to visible state or if they have to be discarded.
After that things just got more complicated, with fancier methods of doing branch prediction, additional "branch target" buffers, etc.
Far jumps that change the code segment are more expensive. In real mode it's not so bad because the CPU mostly only does "CS.base = value * 16" when CS is changed. For protected mode it's a table lookup (to find GDT or LDT entry), decoding the entry, deciding what to do based on what kind of entry it is, then a pile of protection checks. For long mode it's vaguely similar. All of this adds more uncertainty (e.g. with the table entry be in cache?).
On top of all of this there's things like TLB misses. For example, jmp [indirectAddress] can cause a TLB miss at indirectAddress then a TLB miss at the stack top then a TLB miss at the new instruction pointer; where each TLB miss can cost a few hundred cycles.
Mostly; the cost of a jump can be anything from 0 cycles (for a correctly predicted jump) to maybe 1000 cycles; depending on which CPU it is, what kind of jump, what is in caches, what branch prediction predicts, cache/TLB misses, how fast/slow RAM is, and anything I may have forgotten.

Fastest method to accept a byte from a nibble in hardware (8051)

I have a smaller 8051 microcontroller (AT89C4051) connected to a larger microcontroller (AT89S52) and the larger one is running the clock of the smaller one. The crystal speed for the large one is 22.1184Mhz. Documentation states that because the ALE line is controlling the smaller micro, its clock speed is limited to 3.6Mhz.
The two micros communicate with each other with only 4 I/O lines and one interrupt line. I'm trying to make reception of a byte occur as fast as possible, but the code I have come up with makes me think I didn't choose the best solution, but here is what I got so far:
org 0h
ljmp main ;run initialization + program
org 13h ;INT1 handler - worse case scenario: 52uS processing time I think?
push PSW ;save old registers and the carry flag
mov PSW,#18h ;load our register space
mov R7,A ;save accumulator (PUSH takes an extra clock cycle I think)
mov A,P1 ;Grab data (wish I could grab it sooner somehow)
anl A,#0Fh ;Only lowest 4 bits on P1 is the actual data. other 4 bits are useless
djnz R2,nonib2 ;See what nibble # we are at. 1st or 2nd?
orl A,R6 ;were at 2nd so merge previously saved data in
mov #R0,A ;and put it in memory space
inc R0 ;and increment pointer
mov R2,#2h ;and reset nibble number
nonib2:
swap A ;exchange nibbles to prevent overwriting nibble later
mov R6,A ;save to R6 high nibble
mov A,R7 ;restore accumulator
pop PSW ;restore carry and register location
reti ;return to wherever
main:
mov PSW,#18h ;use new address for R0 through R7 to not clash with other routines
mov R1,#BUFCMD ;setup start of buffer space as R1
mov R2,#2h ;set # nibbles needed to process byte
mov PSW,#0h
mov IE,#84h ;enable external interrupt 1
..rest of code here...
We have to assume that this can be triggered by hardware at any point, even during a time-sensitive LCD character processing routine in which all registers and accumulator are used.
What optimizations can I perform to this code here to make it run much faster?
There is no need to do the nibble processing in the interrupt. Just store the 4 bits as they come in.
Assuming you can allocate R0 globally, the code can be as simple as:
org 13h
mov #r0, p1
inc r0
reti
Won't get much faster than that.
If you absolutely can not reserve R0, but you can at least arrange to use register banks differing in a single bit, e.g. #0 and #1, then you can use bit set/clear to switch away and back in 2 cycles, instead of 5 for the push psw approach.

Is it faster to pop unneeded values from the stack, or add an immediate constant to SP on a 386+ CPU?

My code target's 386+ (DOSBox usually, occasionally Pentium MMX) CPUs, but I only use the 8086 feature set for compatibility. My code is written for a non-multitasking environment (MS-DOS, or DOSBox.)
In nested loops I often find myself re-using CX for the deeper loop counters. I PUSH it at the top of the nested loop, and POP it before LOOP is executed.
Sometimes conditions other than CX reaching 0 terminate these inner loops. Then I am left with the unnecessary loop counter, and sometimes more variables, sitting on the stack that I need to clean up.
Is it faster to just add a constant to SP, or POP these unneeded values?
I know that the fastest way of doing it would be to store CX in a spare register at the top of the loop, then restore it before LOOP executes, foregoing the stack completely, but I often don't have a spare register.
Here's a piece of code where I add a constant to SP to avoid a couple POP instructions:
FIND_ENTRY PROC
;SEARCHES A SINGLE SECTOR OF A DIRECTORY LOADED INTO secBuff FOR A
;SPECIFIED FILE/SUB DIRECTORY ENTRY
;IF FOUND, RETURNS THE FILE/SUB DIRECTORY'S CLUSTER NUMBER IN BX
;IF NOT FOUND, RETURNS 0 IN BX
;ALTERS BX
;EXPECTS A FILE NAME STRING INDEX NUMBER IN BP
;EXPECTS A SECTOR OF A DIRECTORY (ROOT, OR SUB) TO BE LOADED INTO secBuff
;EXPECTS DS TO BE LOADED WITH varData
push ax
push cx
push es
push si
push di
lea si, fileName ;si -> file name strings
mov ax, 11d ;ax -> file name length in bytes/characters
mul bp ;ax -> offset to file name string
add si, ax ;ds:si -> desired file name as source input
;for CMPS
mov di, ds
mov es, di
lea di, secBuff ;es:di -> first entry in ds:secBuff as
;destination input for CMPS
mov cx, 16d ;outer loop cntr -> num entries in a sector
ENTRY_SEARCH:
push cx ;store outer loop cntr
push si ;store start of the file name
push di ;store start of the entry
mov cx, 11d ;inner loop cntr -> length of file name
repe cmpsb ;Do the strings match?
jne NOT_ENTRY ;If not, test next entry.
pop di ;di -> start of the entry
mov bx, WORD PTR [di+26] ;bx -> entry's cluster number
add sp, 4 ;discard unneeded stack elements
pop di
pop si
pop es
pop cx
pop ax
ret
NOT_ENTRY:
pop di ;di -> start of the entry
add di, 32d ;di -> start of next entry
pop si ;si -> start of file name
pop cx ;restore the outer loop cntr
loop ENTRY_SEARCH ;loop till we've either found a match, or
;have tested every entry in the sector
;without finding a match.
xor bx, bx ;if we're here no match was found.
;return 0.
pop di
pop si
pop es
pop cx
pop ax
ret
FIND_ENTRY ENDP
If you want to write efficient code, pop vs. add is a very minor issue compared to reducing the amount of saving/restoring you need to do, and optimizing everything else (see below).
If it would take more than 1 pop, always use add sp, imm. Or sub sp, -128 to save code-size by still using an imm8. Or some CPUs may prefer lea instead of add/sub. (e.g. gcc uses LEA whenever possible with -mtune=atom). Of course, this would require an address-size prefix in 16-bit mode because [sp+2] isn't a valid addressing mode.
Beyond that, there's no single answer that applies to both an actual 386 and a modern x86 like Haswell or Skylake! There have been a lot of microarchitectural changes between those CPUs. Modern CPUs decode x86 instructions to internal RISC-like uops. For a while, using simple x86 instructions was important, but now modern CPUs can represent a lot of work in a single instruction, so more complex x86 instructions (like push, or add with a memory source operand) are single uop instructions.
Modern CPUs (since Pentium-M) have a stack engine that removes the need for a separate uop to actually update RSP/ESP/SP in the out-of-order core. Intel's implementation requires a stack-sync uop when you read/write RSP with a non-stack instruction (anything other than push/pop / call/ret), which is why pop can be useful, especially if you're doing it after a push or call.
clang uses push/pop to align the stack in x86-64 code when a single 8-byte offset is needed. Why does this function push RAX to the stack as the first operation?.
But if you care about performance, loop is slow and should be avoided in the first place, let alone push/pop of the loop counter! Use different regs for inner/outer loops.
Basically, you've gone pretty far down the wrong path as far as optimizing, so the real answer is just to point you at http://agner.org/optimize/, and other performance links in the x86 tag wiki. 16-bit code makes it hard to get good performance because of all the partial-register false dependencies on modern CPUs, but with some impact on code size you can break those by using 32-bit operand size when necessary. (e.g. for xor ebx,ebx)
Of course, if you're optimizing for DOSBOX, it's not a real CPU and emulates. So loop may be fast! IDK if anyone has profiled or written optimization guides for DOSBOX's CPU emulator. But I'd recommend learning what's fast on real modern hardware; that's more interesting.

Why is a memory round-trip faster than not performing the round-trip?

I've got some simple 32bit code which computes the product of an array of 32bit integers. The inner loop looks like this:
##loop:
mov esi,[ebx]
mov [esp],esi
imul eax,[esp]
add ebx, 4
dec edx
jnz ##loop
What I'm trying to understand is why the above code is 6% faster than these two versions of the code, which does not perform the redundant memory round-trip:
##loop:
mov esi,[ebx]
imul eax,esi
add ebx, 4
dec edx
jnz ##loop
and
##loop:
imul eax,[ebx]
add ebx, 4
dec edx
jnz ##loop
The two latter pieces of code execute in virtually the same time, and as mentioned both are 6% slower than the first piece (165ms vs 155ms, 200 million elements).
I've tried manually aligning the jump target to a 16 byte boundary, but it makes no difference.
I'm running this on an Intel i7 4770k, Windows 10 x64.
Note: I know the code could be improved by doing all sorts of optimizations, however I'm only interested in the performance difference between the above pieces of code.
I suspect but can't be sure that you are preventing a stall on a data dependency:
The code looks like this:
##loop:
mov esi,[ebx] # (1)Load the memory location to esi reg
(mov [esp],esi) # (1)optionally store the location on the stack
imul eax,[esp] # (3) Perform the multiplication
add ebx, 4 # (1) Add 4
dec edx # (1)decrement counter
jnz ##loop # (0**) loop
Those numbers in brackets are the latencies of the instructions ... that jump is 0 if the branch predictor guesses correctly (which since it will mostly loop it will most of the time).
So: while the multiplication is still going (3 instructions) we get back to the top of the loop after 2 and try to load in to the memory and has to stall. Or we could do a store ... which we can do at the same time as our multiplication and then not stall at all.
What about the dummy store you ask? Why does that work? Notice you are storing the critical value that we are using to multiply to memory. Thus the processor can use this value which is being stored in memory and clobber the register.
So why can't the processor do this anyway? The processor can't produce more memory accesses than you ask it to or it could interfere with multi-processor programs (imagine that cache line that you are writing to is shared and you have to invalidate it on other CPUs every loop by writing to it ... ouch!).
All of this is pure speculation, but it seems to match all the evidence (your code and my knowledge of the intel architecture ... and x86 assembly). Hopefully someone can point out if I have something wrong.

Resources