How do I ask the assembler to "give me a full size register"? - gcc

I'm trying to allow the assembler to give me a register it chooses, and then use that register with inline assembly. I'm working with the program below, and its seg faulting. The program was compiled with g++ -O1 -g2 -m64 wipe.cpp -o wipe.exe.
When I look at the crash under lldb, I believe I'm getting a 32-bit register rather than a 64-bit register. I'm trying to compute an address (base + offset) using lea, and store the result in a register the assembler chooses:
"lea (%0, %1), %2\n"
Above, I'm trying to say "use a register, and I'll refer to it as %2".
When I perform a disassembly, I see:
0x100000b29: leal (%rbx,%rsi), %edi
-> 0x100000b2c: movb $0x0, (%edi)
So it appears the code being generated calculates and address using 64-bit values (rbx and rsi), but saves it to a 32-bit register (edi) (that the assembler chose).
Here are the values at the time of the crash:
(lldb) type format add --format hex register
(lldb) p $edi
(unsigned int) $3 = 1063330
(lldb) p $rbx
(unsigned long) $4 = 4296030616
(lldb) p $rsi
(unsigned long) $5 = 10
A quick note on the Input Operands below. If I drop the "r" (2), then I get a compiler error when I refer to %2 in the call to lea: invalid operand number in inline asm string.
How do I tell the assembler to "give me a full size register" and then refer to it in my program?
int main(int argc, char* argv[])
{
string s("Hello world");
cout << s << endl;
char* ptr = &s[0];
size_t size = s.length();
if(ptr && size)
{
__asm__ __volatile__
(
"%=:\n" /* generate a unique label for TOP */
"subq $1, %1\n" /* 0-based index */
"lea (%0, %1), %2\n" /* calcualte ptr[idx] */
"movb $0, (%2)\n" /* 0 -> ptr[size - 1] .. ptr[0] */
"jnz %=b\n" /* Back to TOP if non-zero */
: /* no output */
: "r" (ptr), "r" (size), "r" (2)
: "0", "1", "2", "cc"
);
}
return 0;
}
Sorry about these inline assembly questions. I hope this is the last one. I'm not really thrilled with using inline assembly in GCC because of pain points like this (and my fading memory). But its the only legal way I know to do what I want to do given GCC's interpretation of the qualifier volatile in C.
If interested, GCC interprets C's volatile qualifier as hardware backed memory, and anything else is an abuse and it results in an illegal program. So the following is not legal for GCC:
volatile void* g_tame_the_optimizer = NULL;
...
unsigned char* ptr = ...
size_t size = ...;
for(size_t i = 0; i < size; i++)
ptr[i] = 0x00;
g_tame_the_optimizer = ptr;
Interestingly, Microsoft uses a more customary interpretation of volatile (what most programmers expect - namely, anything can change the memory, and not just memory mapped hardware), and the code above is acceptable.

gcc inline asm is a complicated beast. "r" (2) means allocate an int sized register and load it with the value 2. If you just need an arbitrary scratch register you can declare a 64 bit early-clobber dummy output, such as "=&r" (dummy) in the output section, with void *dummy declared earlier. You can consult the gcc manual for more details.
As to the final code snippet looks like you want a memory barrier, just as the linked email says. See the manual for example.

Related

Confusion about different clobber description for arm inline assembly

I'm learning ARM inline assembly, and is confused about a very simple function: assign the value of x to y (both are int type), on arm32 and arm64 why different clobber description required?
Here is the code:
#include <arm_neon.h>
#include <stdio.h>
void asm_test()
{
int x = 10;
int y = 0;
#ifdef __aarch64__
asm volatile(
"mov %w[in], %w[out]"
: [out] "=r"(y)
: [in] "r"(x)
: "r0" // r0 not working, but r1 or x1 works
);
#else
asm volattile(
"mov %[in], %[out]"
: [out] "=r"(y)
: [in] "r"(x)
: "r0" // r0 works, but r1 not working
);
#endif
printf("y is %d\n", y);
}
int main() {
arm_test();
return 0;
}
Tested on my rooted android phone, for arm32, r0 generates correct result but r1 won't. For arm64, r1 or x1 generate correct result, and r0 won't. Why on arm32 and arm64 they are different? What is the concrete rule for this and where can I find it?
ARM / AArch64 syntax is mov dst, src
Your asm statement only works if the compiler happens to pick the same register for both "=r" output and "r" input (or something like that, given extra copies of x floating around).
Different clobbers simply perturb the compiler's register-allocation choices. Look at the generated asm (gcc -S or on https://godbolt.org/, especially with -fverbose-asm.)
Undefined Behaviour from getting the constraints mismatched with the instructions in the template string can still happen to work; never assume that an asm statement is correct just because it works with one set of compiler options and surrounding code.
BTW, x86 AT&T syntax does use mov src, dst, and many GNU C inline-asm examples / tutorials are written for that. Assembly language is specific to the ISA and the toolchain, but a lot of architectures have an instruction called mov. Seeing a mov does not mean this is an ARM example.
Also, you don't actually need a mov instruction to use inline asm to copy a valid. Just tell the compiler you want the input to be in the same register it picks for the output, whatever that happens to be:
// not volatile: has no side effects and produces the same output if the input is the same; i.e. the output is a pure function of the input.
asm (""
: "=r"(output) // pick any register
: "0"(input) // pick the same register as operand 0
: // no clobbers
);

Translate VS inline assembler to GCC inline assembler

I find this C code with inline assembler code:
ReadFromCMOS (unsigned char array [])
{
unsigned char tvalue, index;
for(index = 0; index < 128; index++)
{
_asm
{
cli /* Disable interrupts*/
mov al, index /* Move index address*/
/* since the 0x80 bit of al is not set, NMI is active */
out 0x70,al /* Copy address to CMOS register*/
/* some kind of real delay here is probably best */
in al,0x71 /* Fetch 1 byte to al*/
sti /* Enable interrupts*/
mov tvalue,al
}
array[index] = tvalue;
}
}
WriteTOCMOS(unsigned char array[])
{
unsigned char index;
for(index = 0; index < 128; index++)
{
unsigned char tvalue = array[index];
_asm
{
cli /* Clear interrupts*/
mov al,index /* move index address*/
out 0x70,al /* copy address to CMOS register*/
/* some kind of real delay here is probably best */
mov al,tvalue /* move value to al*/
out 0x71,al /* write 1 byte to CMOS*/
sti /* Enable interrupts*/
}
}
}
I tried to translate to GNU inline assembler, but I failed, mostly because GNU inline asm is messy, uses the archaic AT&T syntax and is difficult to use.
Code that gives me error:
void read_cmos(unsigned char array[])
{
unsigned char tvalue, index;
for (index = 0; index < 128; ++index)
{
/* read from CMOS */
asm ("cli; outb %1, $0x70; inb $0x71, %0; sti" : "=a"(tvalue) : "a"(index));
}
array[index] = tvalue;
}
Try something like this:
/* read from CMOS */
asm ("cli; outb %1, $0x70; inb $0x71, %0; sti" : "=a"(tvalue) : "a"(index));
/* write to CMOS */
unsigned char i = index;
asm volatile ("cli; outb %0, $0x70; movb %1, %%al; outb %%al, $0x71; sti" : "+a"(i) : "rm"(tvalue));
Note that using an extra variable for tvalue is optional. You could also specify
"+a"(array[index])
or
"a"(array[index])
directly. What matters is that the expression you pass has a byte-sized type so gcc picks al instead of eax.
Assigning index to i is needed to allow al to be clobbered without changing the value of index. This code should just work. Alternatively, the second set of instructions can also be split up into two:
asm volatile ("cli; outb %0, $0x70" :: "a"(index));
asm volatile ("outb %0, %0x71" :: "a"(tvalue));
This avoids the need for an extra variable and gives greater flexibility to the compiler when chosing registers.
Take a look at the (ancient) GCC-inline-assembly HOWTO (geared towards i686 Linux, so probably right on for your use), check the argument passing/constraints carefully (they allow GCC to arrange calling code right, by e.g. placing the inputs/outputs in the registers used). The GCC documentation on inline assembly is also relevant, but somewhat opaque in my memory, much more detailed, covering many more architectures in detail (but presumably more up to date).
(Sorry, can't place links on my phone. A quick search should give them as first hits.)

What is the role of the clobber list? [duplicate]

This function "strcpy" aims to copy the content of src to dest, and it works out just fine: display two lines of "Hello_src".
#include <stdio.h>
static inline char * strcpy(char * dest,const char *src)
{
int d0, d1, d2;
__asm__ __volatile__("1:\tlodsb\n\t"
"stosb\n\t"
"testb %%al,%%al\n\t"
"jne 1b"
: "=&S" (d0), "=&D" (d1), "=&a" (d2)
: "0"(src),"1"(dest)
: "memory");
return dest;
}
int main(void) {
char src_main[] = "Hello_src";
char dest_main[] = "Hello_des";
strcpy(dest_main, src_main);
puts(src_main);
puts(dest_main);
return 0;
}
I tried to change the line : "0"(src),"1"(dest) to : "S"(src),"D"(dest), the error occurred: ‘asm’ operand has impossible constraints. I just cannot understand. I thought that "0"/"1" here specified the same constraint as the 0th/1th output variable. the constraint of 0th output is =&S, te constraint of 1th output is =&D. If I change 0-->S, 1-->D, there shouldn't be any wrong. What's the matter with it?
Does "clobbered registers" or the earlyclobber operand(&) have any use? I try to remove "&" or "memory", the result of either circumstance is the same as the original one: output two lines of "Hello_src" strings. So why should I use the "clobbered" things?
The earlyclobber & means that the particular output is written before the inputs are consumed. As such, the compiler may not allocate any input to the same register. Apparently using the 0/1 style overrides that behavior.
Of course the clobber list also has important use. The compiler does not parse your assembly code. It needs the clobber list to figure out which registers your code will modify. You'd better not lie, or subtle bugs may creep in. If you want to see its effect, try to trick the compiler into using a register around your asm block:
extern int foo();
int bar()
{
int x = foo();
asm("nop" ::: "eax");
return x;
}
Relevant part of the generated assembly code:
call foo
movl %eax, %edx
nop
movl %edx, %eax
Notice how the compiler had to save the return value from foo into edx because it believed that eax will be modified. Normally it would just leave it in eax, since that's where it will be needed later. Here you can imagine what would happen if your asm code did modify eax without telling the compiler: the return value would be overwritten.

atomic_inc and atomic_xchg in gcc assembly

I have written the following user-level code snippet to test two sub functions, atomic inc and xchg (refer to Linux code).
What I need is just try to perform operations on 32-bit integer, and that's why I explicitly use int32_t.
I assume global_counter will be raced by different threads, while tmp_counter is fine.
#include <stdio.h>
#include <stdint.h>
int32_t global_counter = 10;
/* Increment the value pointed by ptr */
void atomic_inc(int32_t *ptr)
{
__asm__("incl %0;\n"
: "+m"(*ptr));
}
/*
* Atomically exchange the val with *ptr.
* Return the value previously stored in *ptr before the exchange
*/
int32_t atomic_xchg(uint32_t *ptr, uint32_t val)
{
uint32_t tmp = val;
__asm__(
"xchgl %0, %1;\n"
: "=r"(tmp), "+m"(*ptr)
: "0"(tmp)
:"memory");
return tmp;
}
int main()
{
int32_t tmp_counter = 0;
printf("Init global=%d, tmp=%d\n", global_counter, tmp_counter);
atomic_inc(&tmp_counter);
atomic_inc(&global_counter);
printf("After inc, global=%d, tmp=%d\n", global_counter, tmp_counter);
tmp_counter = atomic_xchg(&global_counter, tmp_counter);
printf("After xchg, global=%d, tmp=%d\n", global_counter, tmp_counter);
return 0;
}
My 2 questions are:
Are these two subfunctions written properly?
Will this behave the same when I compile this on 32-bit or
64-bit platform? For example, could the pointer address have a different
length. or could incl and xchgl will conflict with the operand?
My understanding of this question is below, please correct me if I'm wrong.
All the read-modify-write instructions (ex: incl, add, xchg) need a lock prefix. The lock instruction is to lock the memory accessed by other CPUs by asserting LOCK# signal on the memory bus.
The __xchg function in Linux kernel implies no "lock" prefix because xchg always implies lock anyway. http://lxr.linux.no/linux+v2.6.38/arch/x86/include/asm/cmpxchg_64.h#L15
However, the incl used in atomic_inc does not have this assumption so a lock_prefix is needed.
http://lxr.linux.no/linux+v2.6.38/arch/x86/include/asm/atomic.h#L105
btw, I think you need to copy the *ptr to a volatile variable to avoid gcc optimization.
William

Any constraints to refer to the high half of a register in gcc inline assembly?

In my C code there are some inlined assembly calling PCI BIOS service. Now the problem is that one of the results is returned in the %ah register, but I can't find a constrant to refer to that register.
What I want is to write like following:
asm("lcall *%[call_addr]" : "something here"(status) :);
and the variable status contains the value of %ah register.
If I use "=a"(status) and add a mov %%ah, %%al instruction it will work. But it seems ugly.
Any suggestions?
I don't think there is a way to specify %ah in a constraint - the GCC x86 backend knows that sub-registers contain particular parts of values, but doesn't really treat them as independent entities.
Your approach will work; another option is to shift status down outside the inline assembler. e.g. this:
unsigned int foo(void)
{
unsigned int status;
asm("movb $0x12, %%ah; movb $0x34, %%al" : "=a"(status) : );
return (status >> 8) & 0xff;
}
...implements the (status >> 8) & 0xff as a single movzbl %ah, %eax instruction.
A third option is to use a small structure:
unsigned int bar(void)
{
struct { uint8_t al; uint8_t ah; } status;
asm("movb $0x12, %%ah; movb $0x34, %%al" : "=a"(status) : );
return status.ah;
}
I'm not sure whether this is nicer or not - it seems a little more self-documenting, but the use of a small structure with a register constraint looks less obviously correct. However, it generates the same code as foo() above.
(Disclaimer: code generation tested with gcc 4.3.2 only; results may well differ on other versions.)

Resources