Is it "legal" to have a gcc inline asm statement without the actual instruction?
For example, is the asm statement "legal"? Will it introduce undefined behaviour?
int main(){
int *p = something;
asm("":"=m"(p));
return 0;
}
int main(){
int *p = 0;
asm("":"=m"(p));
return 0;
}
Compiles without any errors, but it is unnecessary:
main:
.LFB0:
.cfi_startproc
pushq %rbp
.cfi_def_cfa_offset 16
.cfi_offset 6, -16
movq %rsp, %rbp
.cfi_def_cfa_register 6
movq $0, -8(%rbp)
movl $0, %eax
popq %rbp
.cfi_def_cfa 7, 8
ret
GCC completely ignores that empty asm statement.
In that specific example, the compiler can see that none of the outputs of the asm (ie p) are ever used. Since the asm is not volatile (and has at least 1 output), the compiler is free to completely discard the statement during optimization.
It may also be worth mentioning that on i386, all extended asm statements (ie ones with parameters) always implicitly clobber fpsr (floating point flags) and eflags (think: cc clobber). If the statement isn't completely discarded (for example if it is volatile), this might have an effect. If it does, it will at worst be a tiny loss of efficiency, not incorrect results.
So to sum up:
Yes it is legal.
No it doesn't introduce undefined behavior EXCEPT that the value of p could be undefined since you are saying that you are overwriting the contents, but you aren't actually putting anything in it.
It can, theoretically, introduce tiny inefficiencies, but it probably won't.
Related
case 1->
int a;
std :: cout << a << endl; // prints 0
case 2->
int a;
std :: cout << &a << " " << a << endl; // 0x7ffc057370f4 32764
whenever I print address of variable, they aren't initialized to default value why is it so.
I thought value of a in case 2 is junk but every time I run the code it shows 32764,5,6,7 are these still junk values?
Variables in C++ are not initialized to a default value, hence there's no way to determine the value. You can read more about it here.
I'm afraid the accepted answer does not touch the main point of the question:
why
int a;
std :: cout << a << endl; // prints 0
always prints 0, as if a was initialized to its default value, whereas in
int a;
std :: cout << &a << " " << a << endl; // 0x7ffc057370f4 32764
the compiler produces some junk value for a.
Yes, in both cases we have an example of undefined behavior and ANY value for a is possible, so why in Case 1 there's always 0?
First of all remember that a C/C++ compiler is free to modify the source code in an arbitrary way as long as the meaning of the program remains the same. So, if you write
int a;
std :: cout << a << endl; // prints 0
the compiler is free to assume that a needs not be associated with any real RAM cells. You don't read it, nor do you write to a. So the compiler is free to allocate the memory for a in one of its registers. In such a case a has no address and is functionally equivalent to something as weird as a "named, addressless temporary". However, in Case 2 you ask the compiler to print the address of a. In such a case the compiler cannot ignore the request and generates the code for the memory where a would be allocated even though the value of a can be a junk.
The next factor is optimization. You can either switch it off completely in Debug compilation mode or turn on aggressive optimization in Release mode. So, you can expect that your simple code will behave differently whether you compile it as Debug or Release. Moreover, since it is undefined behavior, your code may run differently if compiled with different compilers or even different versions of the same compiler.
I prepared a version of your program that is a bit easier to analyze:
#include <iostream>
int f()
{
int a;
return a; // prints 0
}
int g()
{
int a;
return reinterpret_cast<long long int>(&a) + a; // prints 0
}
int main() { std::cout << f() << " " << g() << "\n"; }
Function g differs form f in that it uses the address of uninitialized variable a. I tested it in Godbolt Compiler Explorer: https://godbolt.org/z/os8b583ss You can switch there between various compilers and various optimization options. Please do experiment yourself. For Debug and gcc or clang, use -O0 or -g, for Release use -O3.
For the newest (trunk) gcc, we have the following assembly equivalent:
f():
xorl %eax, %eax
ret
g():
leaq -4(%rsp), %rax
addl -4(%rsp), %eax
ret
main:
subq $24, %rsp
xorl %esi, %esi
movl $_ZSt4cout, %edi
call std::basic_ostream<char, std::char_traits<char> >::operator<<(int)
leaq 12(%rsp), %rsi
movl $_ZSt4cout, %edi
addl 12(%rsp), %esi
call std::basic_ostream<char, std::char_traits<char> >::operator<<(int)
xorl %eax, %eax
addq $24, %rsp
ret
Please notice that f() was reduced to a trivial setting of the eax register to zero ( for any value of integer a, a xor a equals 0). eax is the register where this function is to return its value. Hence 0 in Release. Well, actually, no, the compiler is even smarter: it never calls f()! Instead, it zeroes the esi register that is used in a call to operator<<. Similarly, g is replaced by reading 12(%rsp), once as a value, once as the address of. This generates a random value for a and rather similar values for &a. AFIK, they're a bit randomized to make the life of hackers attacking our code harder.
Now the same code in Debug:
f():
pushq %rbp
movq %rsp, %rbp
movl -4(%rbp), %eax
popq %rbp
ret
g():
pushq %rbp
movq %rsp, %rbp
leaq -4(%rbp), %rax
movl %eax, %edx
movl -4(%rbp), %eax
addl %edx, %eax
popq %rbp
ret
main:
pushq %rbp
movq %rsp, %rbp
call f()
movl %eax, %esi
movl $_ZSt4cout, %edi
call std::basic_ostream<char, std::char_traits<char> >::operator<<(int)
call g()
movl %eax, %esi
movl $_ZSt4cout, %edi
call std::basic_ostream<char, std::char_traits<char> >::operator<<(int)
movl $0, %eax
popq %rbp
ret
You can now clearly see, even without knowing the 386 assembly (I don't know it either) that in Debug mode (-g) the compiler performs no optimization at all. In f() it reads a (4 bytes below the frame pointer register value, -4(%rbp)) and moves it to the "result register" eax. In g(), the same is done, but a is read once as a value and once as an address. Moreover, both f() and g() are called in main(). In this compiler mode, the program produces "random" results for a (try it yourself!).
To make things even more interesting, here's f() as compiled by clang (trunk) in Release:
f(): # #f()
retq
g(): # #g()
retq
Can you see? These function are so trivial to clang that it generated no code for them. Moreover, it did not zeroed the registers corresponding to a, so, unlike g++, clang produces a random value for a (in both Release and Debug).
You can go with your experiments even further and find that what clang produces for f depends on whether f or g is called first in main.
Now you should have a better understanding of what Undefined Behavior is.
Consider this C code:
int f(void) {
int ret;
char carry;
__asm__(
"nop # do something that sets eax and CF"
: "=a"(ret), "=#ccc"(carry)
);
return carry ? -ret : ret;
}
When I compile it with gcc -O3, I get this:
f:
nop # do something that sets eax and CF
setc %cl
movl %eax, %edx
negl %edx
testb %cl, %cl
cmovne %edx, %eax
ret
If I change char carry to int carry, I instead get this:
f:
nop # do something that sets eax and CF
setc %cl
movl %eax, %edx
movzbl %cl, %ecx
negl %edx
testl %ecx, %ecx
cmovne %edx, %eax
ret
That change replaced testb %cl, %cl with movzbl %cl, %ecx and testl %ecx, %ecx. The program is actually equivalent, though, and GCC knows it. As evidence of this, if I compile with -Os instead of -O3, then both char carry and int carry result in the exact same assembly:
f:
nop # do something that sets eax and CF
jnc .L1
negl %eax
.L1:
ret
It seems like one of two things must be true, but I'm not sure which:
A testb is faster than a movzbl followed by a testl, so GCC's use of the latter with int is a missed optimization.
A testb is slower than a movzbl followed by a testl, so GCC's use of the former with char is a missed optimization.
My gut tells me that an extra instruction will be slower, but I also have a nagging doubt that it's preventing a partial register stall that I just don't see.
By the way, the usual recommended approach of xoring the register to zero before the setc doesn't work in my real example. You can't do it after the inline assembly runs, since xor will overwrite the carry flag, and you can't do it before the inline assembly runs, since in the real context of this code, every general-purpose call-clobbered register is already in use somehow.
There's no downside I'm aware of to reading a byte register with test vs. movzb.
If you are going to zero-extend, it's also a missed optimization not to xor-zero a reg ahead of the asm statement, and setc into that so the cost of zero-extension is off the critical path. (On CPUs other than Intel IvyBridge+ where movzx r32, r8 is not zero latency). Assuming there's a free register, of course. Recent GCC does sometimes find this zero/set-flags/setcc optimization for generating a 32-bit boolean from a flag-setting instruction, but often misses it when things get complex.
Fortunately for you, your real use-case couldn't do that optimization anyway (except with mov $0, %eax zeroing, which would be off the critical path for latency but cause a partial-register stall on Intel P6 family, and cost more code size.) But it's still a missed optimization for your test case.
Here's a function:
void func(char *ptr)
{
*ptr = 42;
}
Here's an output (cut) of gcc -S function.c:
func:
.LFB0:
.cfi_startproc
pushq %rbp
.cfi_def_cfa_offset 16
.cfi_offset 6, -16
movq %rsp, %rbp
.cfi_def_cfa_register 6
movq %rdi, -8(%rbp)
movq -8(%rbp), %rax
movb $42, (%rax)
nop
popq %rbp
.cfi_def_cfa 7, 8
ret
.cfi_endproc
I can use that function as:
func(malloc(1));
or as:
char local_var;
func(&local_var);
The question is how does processor determine which segment register should be used to transform the effective address to virtual one in this instruction (it may be DS as well as SS)
movb $42, (%rax)
I have a x86_64 proc.
The default segment is DS; that’s what the processor uses in your example.
In 64-bit mode, it doesn’t matter what segment is used, because the segment base is always 0 and permissions are ignored. (There are one or two minor differences, which I won’t go into here.)
In 32-bit mode, most OSes set the base of all segments to 0, and sets their permissions the same, so again it doesn’t matter.
In code where it does matter (especially 16 bit code that needs to use more than 64 KB of memory), the code must use far pointers, which include the segment selector as part of the pointer value. The software must load the selector into a segment register in order to perform the memory access.
This question is similar to another question I posted here. I am attempting to write the Assembly version of the following in c/c++:
int x[10];
for (int i = 0; i < 10; i++){
x[i] = i;
}
Essentially, creating an array storing the values 1 through 9.
My current logic is to create a label that loops up to 10 (calling itself until reaching the end value). In the label, I have placed the instructions to update the array at the current index of iteration. However, after compiling with gcc filename.s and running with ./a.out, the error Segmentation fault: 11 is printed to the console. My code is below:
.data
x:.fill 10, 4
index:.int 0
end:.int 10
.text
.globl _main
_main:
pushq %rbp
movq %rsp, %rbp
subq $16, %rsp
jmp outer_loop
leave
ret
outer_loop:
movl index(%rip), %eax;
cmpl end(%rip), %eax
jge end_loop
lea x(%rip), %rdi;
mov index(%rip), %rsi;
movl index(%rip), %eax;
movl %eax, (%rdi, %rsi, 4)
incl index(%rip)
jmp outer_loop
leave
ret
end_loop:
leave
ret
Oddly the code below
lea x(%rip), %rdi;
mov index(%rip), %rsi;
movl index(%rip), %eax;
movl %eax, (%rdi, %rsi, 4)
works only if it is not in a label that is called repetitively. Does anyone know how I can implement the code above in a loop, without Segmentation fault: 11 being raised? I am using x86 Assembly on MacOS with GNU GAS syntax compiled with gcc.
Please note that this question is not a duplicate of this question as different Assembly syntax is being used and the scope of the problem is different.
You're using a 64-bit instruction to access a 32-bit area of memory :
mov index(%rip), %rsi;
This results in %rsi being assigned the contents of memory starting from index and ending at end (I'm assuming no alignment, though I don't remember GAS's rules regarding it). Thus, %rsi effectively is assigned the value 0xa00000000 (assuming first iteration of the loop), and executing the following movl %eax, (%rdi, %rsi, 4) results in the CPU trying to access the address that's not mapped by your process.
The solution is to remove the assignment, and replace the line after it with movl index(%rip), %esi. 32-bit operations are guaranteed to always clear out the upper bits of 64-bit registers, so you can then safely use %rsi in the address calculation, as it's going to contain the current index and nothing more.
Your debugger would've told you this, so please do use it next time.
I write a lot of vectorized loops, so 1 common idiom is
volatile int dummy[1<<10];
for (int64_t i = 0; i + 16 <= argc; i+= 16) // process all elements with whole vector
{
int x = dummy[i];
}
// handle remainder (hopefully with SIMD too)
But the resulting machine code has 1 more instruction than I would like (using gcc 4.9)
.L3:
leaq -16(%rax), %rdx
addq $16, %rax
cmpq %rcx, %rax
movl -120(%rsp,%rdx,4), %edx
jbe .L3
If I change the code to for (int64_t i = 0; i <= argc - 16; i+= 16), then the "extra"
instruction is gone:
.L2:
movl -120(%rsp,%rax,4), %ecx
addq $16, %rax
cmpq %rdx, %rax
jbe .L2
But why the difference? I was thinking maybe it was due to loop invariants, but too vaguely. Then I noticed in the 5 instruction case, the increment is done before the load, which would require an extra mov due to x86's destructive 2 operand instructions.
So another explanation could be that it's trading instruction parallelism for 1 extra instruction.
Although it seems there would hardly be any performance difference, can someone explain this mystery (preferably who knows about compiler transformations)?
Ideally I would like to keep the i + 16 <= size form since that has a more intuitive meaning (the last element of the vector doesn't go out of bounds)
If argc were below -2147483632, and i was below 2147483632, the expressions i+16 <= argc would be required to yield an arithmetically-correct result, while the expression and i<argc-16 would not. The need to give an arithmetically-correct result in that corner case prevents the compiler from optimizing the former expression to match the latter.