Here's a function:
void func(char *ptr)
{
*ptr = 42;
}
Here's an output (cut) of gcc -S function.c:
func:
.LFB0:
.cfi_startproc
pushq %rbp
.cfi_def_cfa_offset 16
.cfi_offset 6, -16
movq %rsp, %rbp
.cfi_def_cfa_register 6
movq %rdi, -8(%rbp)
movq -8(%rbp), %rax
movb $42, (%rax)
nop
popq %rbp
.cfi_def_cfa 7, 8
ret
.cfi_endproc
I can use that function as:
func(malloc(1));
or as:
char local_var;
func(&local_var);
The question is how does processor determine which segment register should be used to transform the effective address to virtual one in this instruction (it may be DS as well as SS)
movb $42, (%rax)
I have a x86_64 proc.
The default segment is DS; that’s what the processor uses in your example.
In 64-bit mode, it doesn’t matter what segment is used, because the segment base is always 0 and permissions are ignored. (There are one or two minor differences, which I won’t go into here.)
In 32-bit mode, most OSes set the base of all segments to 0, and sets their permissions the same, so again it doesn’t matter.
In code where it does matter (especially 16 bit code that needs to use more than 64 KB of memory), the code must use far pointers, which include the segment selector as part of the pointer value. The software must load the selector into a segment register in order to perform the memory access.
Related
Consider this C code:
int f(void) {
int ret;
char carry;
__asm__(
"nop # do something that sets eax and CF"
: "=a"(ret), "=#ccc"(carry)
);
return carry ? -ret : ret;
}
When I compile it with gcc -O3, I get this:
f:
nop # do something that sets eax and CF
setc %cl
movl %eax, %edx
negl %edx
testb %cl, %cl
cmovne %edx, %eax
ret
If I change char carry to int carry, I instead get this:
f:
nop # do something that sets eax and CF
setc %cl
movl %eax, %edx
movzbl %cl, %ecx
negl %edx
testl %ecx, %ecx
cmovne %edx, %eax
ret
That change replaced testb %cl, %cl with movzbl %cl, %ecx and testl %ecx, %ecx. The program is actually equivalent, though, and GCC knows it. As evidence of this, if I compile with -Os instead of -O3, then both char carry and int carry result in the exact same assembly:
f:
nop # do something that sets eax and CF
jnc .L1
negl %eax
.L1:
ret
It seems like one of two things must be true, but I'm not sure which:
A testb is faster than a movzbl followed by a testl, so GCC's use of the latter with int is a missed optimization.
A testb is slower than a movzbl followed by a testl, so GCC's use of the former with char is a missed optimization.
My gut tells me that an extra instruction will be slower, but I also have a nagging doubt that it's preventing a partial register stall that I just don't see.
By the way, the usual recommended approach of xoring the register to zero before the setc doesn't work in my real example. You can't do it after the inline assembly runs, since xor will overwrite the carry flag, and you can't do it before the inline assembly runs, since in the real context of this code, every general-purpose call-clobbered register is already in use somehow.
There's no downside I'm aware of to reading a byte register with test vs. movzb.
If you are going to zero-extend, it's also a missed optimization not to xor-zero a reg ahead of the asm statement, and setc into that so the cost of zero-extension is off the critical path. (On CPUs other than Intel IvyBridge+ where movzx r32, r8 is not zero latency). Assuming there's a free register, of course. Recent GCC does sometimes find this zero/set-flags/setcc optimization for generating a 32-bit boolean from a flag-setting instruction, but often misses it when things get complex.
Fortunately for you, your real use-case couldn't do that optimization anyway (except with mov $0, %eax zeroing, which would be off the critical path for latency but cause a partial-register stall on Intel P6 family, and cost more code size.) But it's still a missed optimization for your test case.
This question is similar to another question I posted here. I am attempting to write the Assembly version of the following in c/c++:
int x[10];
for (int i = 0; i < 10; i++){
x[i] = i;
}
Essentially, creating an array storing the values 1 through 9.
My current logic is to create a label that loops up to 10 (calling itself until reaching the end value). In the label, I have placed the instructions to update the array at the current index of iteration. However, after compiling with gcc filename.s and running with ./a.out, the error Segmentation fault: 11 is printed to the console. My code is below:
.data
x:.fill 10, 4
index:.int 0
end:.int 10
.text
.globl _main
_main:
pushq %rbp
movq %rsp, %rbp
subq $16, %rsp
jmp outer_loop
leave
ret
outer_loop:
movl index(%rip), %eax;
cmpl end(%rip), %eax
jge end_loop
lea x(%rip), %rdi;
mov index(%rip), %rsi;
movl index(%rip), %eax;
movl %eax, (%rdi, %rsi, 4)
incl index(%rip)
jmp outer_loop
leave
ret
end_loop:
leave
ret
Oddly the code below
lea x(%rip), %rdi;
mov index(%rip), %rsi;
movl index(%rip), %eax;
movl %eax, (%rdi, %rsi, 4)
works only if it is not in a label that is called repetitively. Does anyone know how I can implement the code above in a loop, without Segmentation fault: 11 being raised? I am using x86 Assembly on MacOS with GNU GAS syntax compiled with gcc.
Please note that this question is not a duplicate of this question as different Assembly syntax is being used and the scope of the problem is different.
You're using a 64-bit instruction to access a 32-bit area of memory :
mov index(%rip), %rsi;
This results in %rsi being assigned the contents of memory starting from index and ending at end (I'm assuming no alignment, though I don't remember GAS's rules regarding it). Thus, %rsi effectively is assigned the value 0xa00000000 (assuming first iteration of the loop), and executing the following movl %eax, (%rdi, %rsi, 4) results in the CPU trying to access the address that's not mapped by your process.
The solution is to remove the assignment, and replace the line after it with movl index(%rip), %esi. 32-bit operations are guaranteed to always clear out the upper bits of 64-bit registers, so you can then safely use %rsi in the address calculation, as it's going to contain the current index and nothing more.
Your debugger would've told you this, so please do use it next time.
Is it "legal" to have a gcc inline asm statement without the actual instruction?
For example, is the asm statement "legal"? Will it introduce undefined behaviour?
int main(){
int *p = something;
asm("":"=m"(p));
return 0;
}
int main(){
int *p = 0;
asm("":"=m"(p));
return 0;
}
Compiles without any errors, but it is unnecessary:
main:
.LFB0:
.cfi_startproc
pushq %rbp
.cfi_def_cfa_offset 16
.cfi_offset 6, -16
movq %rsp, %rbp
.cfi_def_cfa_register 6
movq $0, -8(%rbp)
movl $0, %eax
popq %rbp
.cfi_def_cfa 7, 8
ret
GCC completely ignores that empty asm statement.
In that specific example, the compiler can see that none of the outputs of the asm (ie p) are ever used. Since the asm is not volatile (and has at least 1 output), the compiler is free to completely discard the statement during optimization.
It may also be worth mentioning that on i386, all extended asm statements (ie ones with parameters) always implicitly clobber fpsr (floating point flags) and eflags (think: cc clobber). If the statement isn't completely discarded (for example if it is volatile), this might have an effect. If it does, it will at worst be a tiny loss of efficiency, not incorrect results.
So to sum up:
Yes it is legal.
No it doesn't introduce undefined behavior EXCEPT that the value of p could be undefined since you are saying that you are overwriting the contents, but you aren't actually putting anything in it.
It can, theoretically, introduce tiny inefficiencies, but it probably won't.
I have looked at a few dumps of assembler code and there is this section (found here and here) in the main function:
<main+0>: push %ebp
<main+1>: mov %esp, %ebp
<main+3>: sub $0x8, %esp
<main+6>: and $0xfffffff0, %esp
<main+9>: mov $0x0, %eax
<main+14>: add $0xf, %eax
<main+17>: add $0xf, %eax
<main+20>: shr $0x4, %eax
<main+23>: shl $0x4, %eax
<main+26>: sub %eax, %esp
Can you explain me what (main+9) to (main+26) is used for?
Why is this done so 'inefficient'?
So you want a full walk-through without doing any research yourself? Sounds legit.
main+9: mov $0x0, %eax
Loads the register eax with hex 0 (=dec 0).
main+14: add $0xf, %eax
Adds hex F (= dec 15) to the zero in eax.
main+17: add $0xf, %eax
Adds hex F (= dec 15) to eax again. These three instructions could have also been done by
movl $0x1e, %eax
but who's counting instructions... Anyway, at this point eax contains hex 1E which is dec 30.
main+20: shr $0x4, %eax
Shifts the contents of eax to the right by four bits.
main+23: shl $0x4, %eax
Shifts eax right back. Why? Because this clears the lowest four bits. Now eax contains hex 10 (= dec 16)
main+26: sub %eax, %esp
Substracts eax from esp (the stack pointer). Since
main+6: and $0xfffffff0, %esp
cleared the lower four bits in esp previously, the new esp will be sixteen byte aligned, as per ABI. Why not simply use esp after main+6? Because on x86, the stack grows downwards from the top of memory. Simply masking off the lower bits of esp risks clobbering local variables. Hence the subtraction to grow the stack down to the sixteen byte boundary.
I'm a total noob at assembly, just poking around a bit to see what's going on. Anyway, I wrote a very simple function:
void multA(double *x,long size)
{
long i;
for(i=0; i<size; ++i){
x[i] = 2.4*x[i];
}
}
I compiled it with:
gcc -S -m64 -O2 fun.c
And I get this:
.file "fun.c"
.text
.p2align 4,,15
.globl multA
.type multA, #function
multA:
.LFB34:
.cfi_startproc
testq %rsi, %rsi
jle .L1
movsd .LC0(%rip), %xmm1
xorl %eax, %eax
.p2align 4,,10
.p2align 3
.L3:
movsd (%rdi,%rax,8), %xmm0
mulsd %xmm1, %xmm0
movsd %xmm0, (%rdi,%rax,8)
addq $1, %rax
cmpq %rsi, %rax
jne .L3
.L1:
rep
ret
.cfi_endproc
.LFE34:
.size multA, .-multA
.section .rodata.cst8,"aM",#progbits,8
.align 8
.LC0:
.long 858993459
.long 1073951539
.ident "GCC: (Ubuntu/Linaro 4.6.3-1ubuntu5) 4.6.3"
.section .note.GNU-stack,"",#progbits
The assembly output makes sense to me (mostly) except for the line xorl %eax, %eax. From googling, I gather that the purpose of this is simply to set %eax to zero, which in this case corresponds to my iterator long i;.
However, unless I am mistaken, %eax is a 32-bit register. So it seems to me that this should actually be xorq %rax, %rax, particularly since this is holding a 64-bit long int. Moreover, further down in the code, it actually uses the 64-bit register %rax to do the iterating, which never gets initialized outside of xorl %eax %eax, which would seem to only zero out the lower 32 bits of the register.
Am I missing something?
Also, out of curiosity, why are there two .long constants there at the bottom? The first one, 858993459 is equal to the double floating-point representation of 2.4 but I can't figure out what the second number is or why it is there.
I gather that the purpose of this is simply to set %eax to zero
Yes.
which in this case corresponds to my iterator long i;.
No. Your i is uninitialized in the declaration. Strictly speaking, that operation corresponds to the i = 0 expression in the for loop.
However, unless I am mistaken, %eax is a 32-bit register. So it seems to me that this should actually be xorq %rax, %rax, particularly since this is holding a 64-bit long int.
But clearing the lower double word of the register clears the entire register. This is not intuitive, but it's implicit.
Just to answer the second part: .long means 32 bit, and the two integral constants side-by-side form the IEEE-754 representation of the double 2.4:
Dec: 1073951539 858993459
Hex: 0x40033333 0x33333333
400 3333333333333
S+E Mantissa
The exponent is offset by 1023, so the actual exponent is 0x400 − 1023 = 1. The leading "one" in the mantissa is implied, so it's 21 × 0b1.001100110011... (You recognize this periodic expansion as 3/15, i.e. 0.2. Sure enough, 2 × 1.2 = 2.4.)