how to reset general purpose performance counter of intel

how to reset general purpose performance counter of intel - performance

I know we can use wrmsr and rdmsr instruction to set the performance counter and read the general purpose performance counter register.
However, my question is:
Do we need to reset the general purpose performance counter register before we issue the wrmsr?
In other words, for the following code, do we need to reset the performance counter before the following code? If we have to, how can we reset it?
mov $0x0001010E, %eax # Write selector value to EAX
xor %edx, %edx # Zero EDX
mov $0x187, %ecx # Write logical register id to ECX (IA32_PERFEVTSEL1)
wrmsr

Related

How can I know which registers WinAPI functions use for arguments? [duplicate]

I'm writing a function in x86 assembly that should be callable from c code, and I'm wondering which registers i have to restore before i return to the caller.
Currently I'm only restoring esp and ebp, while the return value is in eax.
Are there any other registers I should be concerned about, or could I leave whatever pleases me in them?

Using Microsoft's 32 bit ABI (cdecl or stdcall or other calling conventions), EAX, EDX and ECX are scratch registers (call clobbered). The other general-purpose integer registers are call-preserved.
The condition codes in EFLAGS are call-clobbered. DF=0 is required on call/return so you can use rep movsb without a cld first. The x87 stack must be empty on call, or on return from a function that doesn't return an FP value. (FP return values go in st0, with the x87 stack empty other than that.) XMM6 and 7 are call-preserved, the rest are call-clobbered scratch registers.
Outside of Windows, most 32-bit calling conventions (including i386 System V on Linux) agree with this choice of EAX, EDX and ECX as call-clobbered, but all the xmm registers are call-clobbered.
For x64 under Windows, you only need to restore RBX, RBP, RDI, RSI, R12, R13, R14, and R15. XMM6..15 are call-preserved. (And you have to reserve 32 bytes of shadow space for use by the callee, whether or not there are any args that don't fit in registers.) xmm6..15 are call-preserved.
See https://en.wikipedia.org/wiki/X86_calling_conventions#Microsoft_x64_calling_convention for more details.
Other OSes use the x86-64 System V ABI (see figure 3.4), where the call-preserved integer registers are RBP, RBX, RSP, R12, R13, R14, and R15. All the XMM/YMM/ZMM registers are call-clobbered.
EFLAGS and the x87 stack are the same as in 32-bit conventions: DF=0, condition flags are clobbered, and x87 stack is empty. (x86-64 conventions return FP values in XMM0, so the x87 stack registers always need to be empty on call/return.)
For links to official calling convention docs, see https://stackoverflow.com/tags/x86/info

32-bit: EBX, ESI, EDI, EBP
64-bit Windows: RBX, RSI, RDI, RBP, R12-R15, XMM6-XMM15
64-bit Linux,BSD,Mac: RBX, RBP, R12-R15
For details see "Software optimization resources" by Agner Fog. Calling conventions are described in this pdf.

if you are unsure about the registers' situation, these instructions below could save the day easily.
PUSHA/PUSHAD -- Push all General Registers
POPA/POPAD -- Pop all General Registers
These instructions push and pop the general purpose and SI/ESI , DI/EDI registers in certain order.
The order for PUSHA/PUSHAD instruction is as follows.
Opcode Instruction Clocks Description
60 PUSHA 18 Push AX, CX, DX, BX, original SP, BP, SI, and DI
60 PUSHAD 18 Push EAX, ECX, EDX, EBX, original ESP, EBP ESI, and EDI
And the order for POPA/POPAD instruction is as follows. (in reverse order)
Opcode Instruction Clocks Description
61 POPA 24 Pop DI, SI, BP, SP, BX, DX, CX, and AX
61 POPAD 24 Pop EDI, ESI, EBP, ESP(***),EBX, EDX, ECX, and EAX
*** The ESP value is discarded instead of loaded into ESP.

Handling Register Integer Value Wrap Around In x86 Assembly GNU on MacOSX

I am working with a simple example in x86 GNU GAS on MacOSX whereby an integer value of 300 is moved into the eax register. As expected, only 300 mod 256 (the value 44), is actually stored in %eax, as echo$? reveals from Mac terminal:
.globl _main
_main:
pushq %rbp
movq %rsp, %rbp
subq $16, %rsp
movl $300, %eax;
leave
ret
However, I was under the impression that there is an overflow/wrap-around flag to denote that a wraparound occurred or a register storing the result of the integer division of 300 and 256, the result being 1. I have been unable to find any information detailing this process (if it exists) for x86 GNU. Does anyone know how the wraparound value or an overflow flag can be accessed?

There are a couple of misconceptions in your question.
First, eax can hold values from 0 to 4294967295, so mov $300, %eax does in fact store 300 into eax.
Second, a mov instruction cannot overflow or wrap around; the size of the source and the size of the destination are the same. The overflow flag is used for arithmetic operations.
The reason echo $? prints 44 is that the operating system reports the low byte of the exit status of the process to the shell.

Do complex addressing modes have extra overhead for loads from memory?

Is there a difference in performance between these mov load instructions? Do the more complex addressing modes have extra overhead (latency or throughput) compared to the simple ones?
# AT&T syntax # Intel syntax:
movq (%rsi), %rax mov rax, [rsi]
movq (%rdi, %rsi), %rax mov rax, [rdi + rsi]
movq (%rdi, %rsi, 4), %rax mov rax, [rdi + rsi*4]

Yes, there is an overhead for "complex addressing" on recent Intel CPUs. The cost is one additional cycle of latency (e.g., 5 cycles for a normal GP load using complex addressing versus 4 cycles with simple addressing).
Simple addressing is anything of the form [reg + offset] where the immediate offset between 0 and 2047 inclusive.
Complex addressing is anything other than simple addressing.
In particular any addressing mode with two registers like your examples [rdi + rsi] or [rdi + rsi*4] are complex addressing and cost an extra cycle.
There is an exceptional case: if the index register1 is zeroed via a zeroing idiom (like xor edi, edi, but not like mov edi, 0) you don't pay the complex addressing penalty.
1 The index register is the one multiplied by 1, 2, 4 or 8, i.e., rsi in [rdi + rsi*4]. In the case neither register shows a multiplier, like [rdi + rsi] the multiplier is 1 and you'll have to check your assembler to see how to specify which is the index and which is the displacement. nasm seems to use the second register as the index.

Depending on which specific CPU; mostly "no, there's no extra overhead". However...
Most CPUs have out-of-order cores, which means they perform instruction in whatever order is fastest and not in the order the instructions are given. For this to work, one instruction (e.g. movq (%rdi, %rsi, 4), %rax) can't happen until things it depended on are finished (e.g. the values in rdi and rsi are known).
For example, these 2 instructions can occur in parallel (because the second instruction doesn't depend on the first):
movq (%rdi), %edi
movq (%rsi), %rax
And these 2 instructions can't occur in parallel (the second instruction has to wait until the first instruction completes):
movq (%rdi), %rdi
movq (%rdi, %rsi), %rax
Also note that the bottleneck for a piece of code may not be execution. If the bottleneck is instruction fetch then larger instructions will be worse; if the bottleneck is instruction decode then more complex instructions can be worse; if the bottleneck is data cache bandwidth then anything that reads/writes to memory can be worse, etc.
Basically; you can't look at individual instructions in isolation and decide if they're better/worse. You have to look at entire sequences of multiple instructions so that you can know about any dependencies on previous instructions (and their latencies); and you have to know what the bottleneck is (e.g. from performance monitoring tools); and if you know all this then you can make an "educated guess" that's only really useful for a small number of CPUs (because different CPUs have different characteristics).

What is this assembly function prologue / epilogue code doing with rbp / rsp / leave?

I am just starting to learn assembly for the mac using the GCC compiler to assemble my code. Unfortunately, there are VERY limited resources for learning how to do this if you are a beginner. I finally managed to find some simple sample code that I could start to rap my head around, and I got it to assemble and run correctly. Here is the code:
.text # start of code indicator.
.globl _main # make the main function visible to the outside.
_main: # actually label this spot as the start of our main function.
push %rbp # save the base pointer to the stack.
mov %rsp, %rbp # put the previous stack pointer into the base pointer.
subl $8, %esp # Balance the stack onto a 16-byte boundary.
movl $0, %eax # Stuff 0 into EAX, which is where result values go.
leave # leave cleans up base and stack pointers again.
ret
The comments explain some things in the code (I kind of understand what lines 2 - 5 do), but I dont understand what most of this means. I do understand the basics of what registers are and what each register here (rbp, rsp, esp and eax) is used for and how big they are, I also understand (generally) what the stack is, but this is still going over my head. Can anyone tell me exactly what this is doing? Also, could anyone point me in the direction of a good tutorial for beginners?

Stack is a data structure that follows LIFO principle. Whereas stacks in everyday life (outside computers, I mean) grow upward, stacks in x86 and x86-64 processors grow downward. See Wikibooks article on x86 stack (but please take into account that the code examples are 32-bit x86 code in Intel syntax, and your code is 64-bit x86-64 code in AT&T syntax).
So, what your code does (my explanations here are with Intel syntax):
push %rbp
Pushes rbp to stack, practically subtracting 8 from rsp (because the size of rbp is 8 bytes) and then stores rbp to [ss:rsp].
So, in Intel syntax push rbp practically does this:
sub rsp, 8
mov [ss:rsp], rbp
Then:
mov %rsp, %rbp
This is obvious. Just store the value of rsp into rbp.
subl $8, %esp
Subtract 8 from esp and store it into esp. Actually this is a bug in your code, even if it causes no problems here. Any instruction with a 32-bit register (eax, ebx, ecx, edx, ebp, esp, esi or edi) as destination in x86-64 sets the topmost 32 bits of the corresponding 64-bit register (rax, rbx, rcx, rdx, rbp, rsp, rsi or rdi) to zero, causing the stack pointer to point somewhere below the 4 GiB limit, effectively doing this (in Intel syntax):
sub rsp,8
and rsp,0x00000000ffffffff
Edit: added consequences of sub esp,8 below.
However, this causes no problems on a computer with less than 4 GiB of memory. On computers with more than 4 GiB memory, it may result in a segmentation fault. leave further below in your code returns a sane value to rsp. Generally in x86-64 code you don't need esp never (excluding possibly some optimizations or tweaks). To fix this bug:
subq $8, %rsp
The instructions so far are the standard entry sequence (replace $8 according to the stack usage). Wikibooks has a useful article on x86 functions and stack frames (but note again that it uses 32-bit x86 assembly with Intel syntax, not 64-bit x86-64 assembly with AT&T syntax).
Then:
movl $0, %eax
This is obvious. Store 0 into eax. This has nothing to do with the stack.
leave
This is equivalent to mov rsp, rbp followed by pop rbp.
ret
And this, finally, sets rip to the value stored at [ss:rsp], effective returning the code pointer back to where this procedure was called, and adds 8 to rsp.

Performance of modern processor

Being executed on modern processor (AMD Phenom II 1090T), how many clock ticks does the following code consume more likely : 3 or 11?
label: mov (%rsi), %rax
adc %rax, (%rdx)
lea 8(%rdx), %rdx
lea 8(%rsi), %rsi
dec %ecx
jnz label
The problem is, when I execute many iterations of such code, results vary near 3 OR 11 ticks per iteration from time to time. And I can't decide "who is who".
UPD
According to Table of instruction latencies (PDF), my piece of code takes at least 10 clock cycles on AMD K10 microarchitecture. Therefore, impossible 3 ticks per iteration are caused by bugs in measurement.
SOLVED
#Atom noticed, that cycle frequency isn't constant in modern processors. When I disabled in BIOS three options - Core Performance Boost, AMD C1E Support and AMD K8 Cool&Quiet Control, consumption of my "six instructions" stabilized on 3 clock ticks :-)

I won't try to answer with certainty how many cycles (3 or 10) it will take to run each iteration, but I'll explain how it might be possible to get 3 cycles per iteration.
(Note that this is for processors in general and I make no references specific to AMD processors.)
Key Concepts:
Out of Order Execution
Register Renaming
Most modern (non-embedded) processors today are both super-scalar and out-of-order. Not only can execute multiple (independent) instructions in parallel, but they can re-order instructions to break dependencies and such.
Let's break down your example:
label:
mov (%rsi), %rax
adc %rax, (%rdx)
lea 8(%rdx), %rdx
lea 8(%rsi), %rsi
dec %ecx
jnz label
The first thing to notice is that the last 3 instructions before the branch are all independent:
lea 8(%rdx), %rdx
lea 8(%rsi), %rsi
dec %ecx
So it's possible for a processor to execute all 3 of these in parallel.
Another thing is this:
adc %rax, (%rdx)
lea 8(%rdx), %rdx
There seems to be a dependency on rdx that prevents the two from running in parallel. But in reality, this is false dependency because the second instruction doesn't actually
depend on the output of the first instruction. Modern processors are able to rename the rdx register to allow these two instructions to be re-ordered or done in parallel.
Same applies to the rsi register between:
mov (%rsi), %rax
lea 8(%rsi), %rsi
So in the end, 3 cycles is (potentially) achievable as follows (this is just one of several possible orderings):
1: mov (%rsi), %rax lea 8(%rdx), %rdx lea 8(%rsi), %rsi
2: adc %rax, (%rdx) dec %ecx
3: jnz label
*Of course, I'm over-simplifying things for simplicity. In reality the latencies are probably longer and there's overlap between different iterations of the loop.
In any case, this could explain how it's possible to get 3 cycles. As for why you sometimes get 10 cycles, there could be a ton of reasons for that: branch misprediction, some random pipeline bubble...

At Intel, Dr. David Levinthal's "Performance Analysis Guide" investigates the answers to such questions in great detail.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio