After two mov in memory register system crash - gcc

I'm trying to perform some write in memory registers from user space ( through a custom driver ). I want to write three 64-bit integer and I initialized the variables "value_1, value_2 and value_3" to uint64_t type.
I must use the gcc inline mov instruction and I'm working on an ARM 64-bit architecture on a custom version of linux for an embedded system.
Thi is my code:
asm ( "MOV %[reg1], %[val1]\t\n"
"MOV %[reg2], %[val2]\t\n"
"MOV %[reg3], %[val3]\t\n"
:[reg1] "=&r" (*register_1),[arg2] "=&r" (*register_2), [arg3] "=&r" (*register_3)
:[val1] "r"(value_1),[val2] "r" (value_2), [val3] "r" (value_3)
);
The problem is strange...
If I perform just two MOV, the code works.
If I perform all the three MOV, the entire system crash and I have to reboot the entire system.
even stranger...
If I put a "printf" o even a nanosleep with 0 nanosecond between the second and the third MOV, the code works!
I looked around trying to find a solution and I also use the clobber of the memory:
asm ( "MOV %[reg1], %[val1]\t\n"
"MOV %[reg2], %[val2]\t\n"
"MOV %[reg3], %[val3]\t\n"
:[reg1] "=&r" (*register_1),[arg2] "=&r" (*register_2), [arg3] "=&r" (*register_3)
:[val1] "r"(value_1),[val2] "r" (value_2), [val3] "r" (value_3)
:"memory"
);
...doen't work!
I used also the memory barrier macro between the second and the third MOV or at the end of the three MOV:
asm volatile("": : :"memory")
..doesn't work!
Also, I tried to write directly into the register using pointers and I had the same behavior: after the second write the system crash...
Anybody can suggest me a solution..or tell me if I'm using the gcc inline MOV or the memory barrier in a wrong way?
----> MORE DETAILS <-----
This is my main:
int main()
{
int dev_fd;
volatile void * base_addr = NULL;
volatile uint64_t * reg1_addr = NULL;
volatile uint32_t * reg2_addr = NULL;
volatile uint32_t * reg3_addr = NULL;
dev_fd = open(MY_DEVICE, O_RDWR);
if (dev_fd < 0)
{
perror("Open call failed");
return -1;
}
base_addr = mmap(NULL, PAGE_SIZE, PROT_READ|PROT_WRITE, MAP_SHARED, xsmll_dev_fd, 0);
if (base_addr == MAP_FAILED)
{
perror("mmap operation failed");
return -1;
}
printf("BASE ADDRESS VIRT: 0x%p\n", base_addr);
/* Preparing the registers */
reg1_addr = base_addr + REG1_OFF;
reg2_addr = base_addr + REG2_OFF;
reg3_addr = base_addr + REG3_OFF;
uint64_t val_1 = 0xEEEEEEEE;
uint64_t val_2 = 0x00030010;
uint64_t val_3 = 0x01;
asm ( "str %[val1], %[reg1]\t\n"
"str %[val2], %[reg2]\t\n"
"str %[val3], %[reg3]\t\n"
:[reg1] "=&m" (*reg1_addr),[reg2] "=&m" (*reg2_addr), [reg3] "=&m" (*reg3_addr)
:[val1] "r"(val_1),[val2] "r" (val_2), [val3] "r" (val_3)
);
printf("--- END ---\n");
close(dev_fd);
return 0;
}
This is the output of the compiler regarding the asm statement (linaro..I cross compile):
400bfc: f90013a0 str x0, [x29,#32]
400c00: f94027a3 ldr x3, [x29,#72]
400c04: f94023a4 ldr x4, [x29,#64]
400c08: f9402ba5 ldr x5, [x29,#80]
400c0c: f9401ba0 ldr x0, [x29,#48]
400c10: f94017a1 ldr x1, [x29,#40]
400c14: f94013a2 ldr x2, [x29,#32]
400c18: f9000060 str x0, [x3]
400c1c: f9000081 str x1, [x4]
400c20: f90000a2 str x2, [x5]
Thank you!

I tried with *reg1_addr = val_1; and I have the same problem.
Then this code isn't the problem. Avoiding asm is just a cleaner way to get equivalent machine code, without having to use inline asm. Your problem is more likely your choice of registers and values, or the kernel driver.
Or do you need the values to be in CPU registers before writing the first mmaped location, to avoid loading anything from the stack between stores? That's the only reason I can think of that you'd need inline asm, where compiler-generated stores might not be equivalent.
Answer to original question:
An "=&r" output constraint means a CPU register. So your inline-asm instructions will run in that order, assembling to something like
mov x0, x5
mov x1, x6
mov x2, x7
And then after that, compiler-generated code will store the values back to memory in some unspecified order. That order depends on how it chooses to generate code for the surrounding C. This is probably why changing the surrounding code changes the behaviour.
One solution might be "=&m" constraints with str instructions, so your asm does actually store to memory. str %[val1], %[reg1] because STR instructions take the addressing mode as the 2nd operand, even though it's the destination.
Why can't you use volatile uint64_t* = register_1; like a normal person, to have the compiler emit store instructions that it's not allowed to reorder or optimize away? MMIO is exactly what volatile is for.
Doesn't Linux have macros or functions for doing MMIO loads/stores?
If you're having problem with inline asm, step 1 in debugging should be to look at the actual asm emitted by the compiler when it filled in the asm template, and the surrounding code.
Then single-step by instructions through the code (with GDB stepi, maybe in layout reg mode).

Related

What is the GCC documentation and example saying about inline asm and not using early clobbers so a pointer shares a register with a mem input?

The GCC documentation (https://gcc.gnu.org/onlinedocs/gcc/Extended-Asm.html#Clobbers-and-Scratch-Registers-1) contains the following PowerPC example and description:
static void
dgemv_kernel_4x4 (long n, const double *ap, long lda,
const double *x, double *y, double alpha)
{
double *a0;
double *a1;
double *a2;
double *a3;
__asm__
(
/* lots of asm here */
"#n=%1 ap=%8=%12 lda=%13 x=%7=%10 y=%0=%2 alpha=%9 o16=%11\n"
"#a0=%3 a1=%4 a2=%5 a3=%6"
:
"+m" (*(double (*)[n]) y),
"+&r" (n), // 1
"+b" (y), // 2
"=b" (a0), // 3
"=&b" (a1), // 4
"=&b" (a2), // 5
"=&b" (a3) // 6
:
"m" (*(const double (*)[n]) x),
"m" (*(const double (*)[]) ap),
"d" (alpha), // 9
"r" (x), // 10
"b" (16), // 11
"3" (ap), // 12
"4" (lda) // 13
:
"cr0",
"vs32","vs33","vs34","vs35","vs36","vs37",
"vs40","vs41","vs42","vs43","vs44","vs45","vs46","vs47"
);
}
... On the other hand, ap can’t be the same as any of the other inputs, so an early-clobber
on a0 is not needed. It is also not desirable in this case. An
early-clobber on a0 would cause GCC to allocate a separate register
for the "m" (*(const double (*)[]) ap) input. Note that tying an
input to an output is the way to set up an initialized temporary
register modified by an asm statement. An input not tied to an output
is assumed by GCC to be unchanged...
I am totally confused about this description:
For the code there is no relationship between "m" (*(const double (*)[]) ap) and "=b" (a0). "=b" (a0) will share the register with "3" (ap), which saves the address of the input parameter, and "m" (*(const double (*)[]) ap) is the content of the first element of ap, so why an early-clobber on a0 will impact "m" (*(const double (*)[]) ap)?
Even if gcc allocate a new register to "m" (*(const double (*)[]) ap), I still don't understand what the problem. Since there is tied between "=b" (a0) and "3" (ap), so we can still read / write through the register that allocated for "=b" (a0)?
This is an efficiency consideration, not correctness, stopping GCC from wasting instructions (and creating register pressure).
"m" (*(const double (*)[]) ap) isn't the first element, it's an arbitrary-length array, letting the compiler know that the entire array object is an input. But it's a dummy input; the asm template won't actually use that operand, instead looping over the array via the pointer input "3" (ap)
See How can I indicate that the memory *pointed* to by an inline ASM argument may be used? for more about this technique.
But "m" inputs are real inputs that have to work expand to an addressing mode if the template does use them, including after early-clobbers have clobbered their register.
With =&b(a0) / "3"(ap), GCC couldn't pick the same register as the base for an addressing mode for "m" (*(const double (*)[]) ap).
So it would have to waste an instruction ahead of the asm statement copying the address to another register. Also wasting that integer register.

Extending SRecord to handle crc32_mpeg2?

statement of problem:
I'm working with a Kinetis L series (ARM Cortex M0+) that has a dedicated CRC hardware module. Through trial and error and using this excellent online CRC calculator, I determined that the CRC hardware is configured to compute CRC32_MPEG2.
I'd like to use srec_input (a part of SRecord 1.64) to generate a CRC for a .srec file whose results must match the CRC_MPEG2 computed by the hardware. However, srec's built-in CRC algos (CRC32 and STM32) don't generate the same results as the CRC_MPEG2.
the question:
Is there a straightforward way to extend srec to handle CRC32_MPEG2? My current thought is to fork the srec source tree and extend it, but it seems likely that someone's already been down this path.
Alternatively, is there a way for srec to call an external program? (I didn't see one after a quick scan.) That might do the trick as well.
some details
The parameters of the hardware CRC32 algorithm are:
Input Reflected: No
Output Reflected: No
Polynomial: 0x4C11DB7
Initial Seed: 0xFFFFFFFF
Final XOR: 0x0
To test it, an input string of:
0x10 0xB5 0x06 0x4C 0x23 0x78 0x00 0x2B
0x07 0xD1 0x05 0x4B 0x00 0x2B 0x02 0xD0
should result in a CRC32 value of:
0x938F979A
what generated the CRC value in the first place?
In response to Mark Adler's well-posed question, the firmware uses the Freescale fsl_crc library to compute the CRC. The relevant code and parameters (mildly edited) follows:
void crc32_update(crc32_data_t *crc32Config, const uint8_t *src, uint32_t lengthInBytes)
{
crc_config_t crcUserConfigPtr;
CRC_GetDefaultConfig(&crcUserConfigPtr);
crcUserConfigPtr.crcBits = kCrcBits32;
crcUserConfigPtr.seed = 0xffffffff;
crcUserConfigPtr.polynomial = 0x04c11db7U;
crcUserConfigPtr.complementChecksum = false;
crcUserConfigPtr.reflectIn = false;
crcUserConfigPtr.reflectOut = false;
CRC_Init(g_crcBase[0], &crcUserConfigPtr);
CRC_WriteData(g_crcBase[0], src, lengthInBytes);
crcUserConfigPtr.seed = CRC_Get32bitResult(g_crcBase[0]);
crc32Config->currentCrc = crcUserConfigPtr.seed;
crc32Config->byteCountCrc += lengthInBytes;
}
Peter Miller be praised...
It turns out that if you supply enough filters to srec_cat, you can make it do anything! :) In fact, the following arguments the correct checksum:
$ srec_cat test.srec -Bit_Reverse -CRC32LE 0x1000 -Bit_Reverse -XOR 0xff -crop 0x1000 0x1004 -Output -HEX_DUMP
00001000: 93 8F 97 9A #....
In other words, bit reverse the bits going to the CRC32 algorithm, bit reverse them on the way out, and 1's compliment them.

inline assembly instruction with two return registers

I have an custom instruction for a processor, it has two return registers and two operands like:
MINMAX rdMin, RdMax, rs1, rs2
It returns the minimum and maximum out of rs1 and rs2. I have verified this instruction using assembly program. It works fine. Now I want to use this instruction from GCC using inline assembly. I tried the following code, but it did not give the correct values of rdMin and rdMax. Is there any mistake in the syntax.
int main() {
unsigned int array[10] = { 45, 75,0,0,0,0,0,0,0};
int op1=16,op2=18,out,out1,out2;
//asm for AVG rd, rs1, rs2
__asm__ volatile (
"avg %[my_out], %[my_op1], %[my_op2]\n"
: [my_out] "=&r" (out)
: [my_op1] "r" (op1),[my_op2] "r" (op2)
);
//asm for MinMax rdMin, rdMax, rs1, rs2
__asm__ volatile (
"minmax %[my_out1], %[my_out2], %[my_op1], %[my_op2]\n"
: [my_out1] "=r" (out1), [my_out2] "=r" (out2)
: [my_op1] "r" (op1), [my_op2] "r" (op2)
);
array[3] = out;
array[4] = out1;
array[5] = out2;
return 0;
}
Thanks.

What is the fastest way to handle overflow on integer division/remainder without panic?

I'm still improving overflower to handle integer overflow. One goal was to be able use #[overflow(wrap)] to avoid panics on overflow. However, I found out that the .wrapping_div(_) and .wrapping_rem(_) functions of the standard integer types do in fact panic when dividing by zero. Edit: To motivate this use case better: Within interrupt handlers, we absolutely want to avoid panics. I assume that the div-by-zero condition is highly unlikely, but we still need to return a "valid" value for some definition of valid.
One possible solution is saturating the value (which I do when code is annotated with #[overflow(saturate)]), but this is likely relatively slow (especially since other,more operations are also saturated). So I want to add an #[overflow(no_panic)] mode that avoids panics completely, and is almost as fast as #[overflow(wrap)] in all cases.
My question is: What is the fastest way to return something (don't care what) without panicking on dividing (or getting the remainder) by zero?
Disclaimer: this isn't really a serious answer. It is almost certainly slower than the naive solution of using an if statement to check whether the divisor is zero.
#![feature(asm)]
fn main() {
println!("18 / 3 = {}", f(18, 3));
println!("2555 / 10 = {}", f(2555, 10));
println!("-16 / 3 = {}", f(-16, 3));
println!("7784388 / 0 = {}", f(7784388, 0));
}
fn f(x: i32, y: i32) -> i32 {
let z: i32;
unsafe {
asm!(
"
test %ecx, %ecx
lahf
and $$0x4000, %eax
or %eax, %ecx
mov %ebx, %eax
cdq
idiv %ecx
"
: "={eax}"(z)
: "{ebx}"(x), "{ecx}"(y)
: "{edx}"
);
}
z
}
Rust Playground
pub fn nopanic_signed_div(x: i32, y: i32) -> i32 {
if y == 0 || y == -1 {
// Divide by -1 is equivalent to neg; we don't care what
// divide by zero returns.
x.wrapping_neg()
} else {
// (You can replace this with unchecked_div to make it more
// obvious this will never panic.)
x / y
}
}
This produces the following on x86-64 with "rustc 1.11.0-nightly (6e00b5556 2016-05-29)":
movl %edi, %eax
leal 1(%rsi), %ecx
cmpl $1, %ecx
ja .LBB0_2
negl %eax
retq
.LBB0_2:
cltd
idivl %esi
retq
It should produce something similar on other platforms.
At least one branch is necessary because LLVM IR considers divide by zero to be undefined behavior. Checking for 0 and -1 separately would involve an extra branch. With those constraints, there isn't really any other choice.
(It might be possible to come up with something slightly faster with inline assembly, but it would be a terrible idea because you would end up generating much worse code in the case of dividing by a constant.)
Whether this solution is actually appropriate probably depends on what your goal is; a divide by zero is probably a logic error, so silently accepting it seems like a bad idea.

Which is faster for reverse iteration, for or while loops?

I am trying to implement the standard memmove function in Rust and I was wondering which method is faster for downwards iteration (where src < dest):
for i in (0..n).rev() {
//Do copying
}
or
let mut i = n;
while i != 0 {
i -= 1;
// Do copying
}
Will the rev() in the for loops version significantly slow it down?
TL;DR: Use the for loop.
Both should be equally fast. We can check the compiler's ability to peel away the layers of abstraction involved in the for loop quite simply:
#[inline(never)]
fn blackhole() {}
#[inline(never)]
fn with_for(n: usize) {
for i in (0..n).rev() { blackhole(); }
}
#[inline(never)]
fn with_while(n: usize) {
let mut i = n;
while i > 0 {
blackhole();
i -= 1;
}
}
This generates this LLVM IR:
; Function Attrs: noinline nounwind readnone uwtable
define internal void #_ZN8with_for20h645c385965fcce1fhaaE(i64) unnamed_addr #0 {
entry-block:
ret void
}
; Function Attrs: noinline nounwind readnone uwtable
define internal void #_ZN10with_while20hc09c3331764a9434yaaE(i64) unnamed_addr #0 {
entry-block:
ret void
}
Even if you are not versed in LLVM, it is obvious that both functions compiled down to the same IR (and thus obviously to the same assembly).
Since their performance is the same, one should prefer the more explicit for loop and reserve the while loop to cases where the iteration is irregular.
EDIT: to address starblue's concern of unfitness.
#[link(name = "snappy")]
extern {
fn blackhole(i: libc::c_int) -> libc::c_int;
}
#[inline(never)]
fn with_for(n: i32) {
for i in (0..n).rev() { unsafe { blackhole(i as libc::c_int); } }
}
#[inline(never)]
fn with_while(n: i32) {
let mut i = n;
while i > 0 {
unsafe { blackhole(i as libc::c_int); }
i -= 1;
}
}
compiles down to:
; Function Attrs: noinline nounwind uwtable
define internal void #_ZN8with_for20h7cf06f33e247fa35maaE(i32) unnamed_addr #1 {
entry-block:
%1 = icmp sgt i32 %0, 0
br i1 %1, label %match_case.preheader, label %clean_ast_95_
match_case.preheader: ; preds = %entry-block
br label %match_case
match_case: ; preds = %match_case.preheader, %match_case
%.in = phi i32 [ %2, %match_case ], [ %0, %match_case.preheader ]
%2 = add i32 %.in, -1
%3 = tail call i32 #blackhole(i32 %2)
%4 = icmp sgt i32 %2, 0
br i1 %4, label %match_case, label %clean_ast_95_.loopexit
clean_ast_95_.loopexit: ; preds = %match_case
br label %clean_ast_95_
clean_ast_95_: ; preds = %clean_ast_95_.loopexit, %entry-block
ret void
}
; Function Attrs: noinline nounwind uwtable
define internal void #_ZN10with_while20hee8edd624cfe9293IaaE(i32) unnamed_addr #1 {
entry-block:
%1 = icmp sgt i32 %0, 0
br i1 %1, label %while_body.preheader, label %while_exit
while_body.preheader: ; preds = %entry-block
br label %while_body
while_exit.loopexit: ; preds = %while_body
br label %while_exit
while_exit: ; preds = %while_exit.loopexit, %entry-block
ret void
while_body: ; preds = %while_body.preheader, %while_body
%i.05 = phi i32 [ %3, %while_body ], [ %0, %while_body.preheader ]
%2 = tail call i32 #blackhole(i32 %i.05)
%3 = add nsw i32 %i.05, -1
%4 = icmp sgt i32 %i.05, 1
br i1 %4, label %while_body, label %while_exit.loopexit
}
The core loops are:
; -- for loop
match_case: ; preds = %match_case.preheader, %match_case
%.in = phi i32 [ %2, %match_case ], [ %0, %match_case.preheader ]
%2 = add i32 %.in, -1
%3 = tail call i32 #blackhole(i32 %2)
%4 = icmp sgt i32 %2, 0
br i1 %4, label %match_case, label %clean_ast_95_.loopexit
; -- while loop
while_body: ; preds = %while_body.preheader, %while_body
%i.05 = phi i32 [ %3, %while_body ], [ %0, %while_body.preheader ]
%2 = tail call i32 #blackhole(i32 %i.05)
%3 = add nsw i32 %i.05, -1
%4 = icmp sgt i32 %i.05, 1
br i1 %4, label %while_body, label %while_exit.loopexit
And the only difference is that:
for decrements before calling blackhole, while decrements after
for compares against 0, while compares against 1
otherwise, it's the same core loop.
In short: They are (nearly) equally fast -- use the for loop!
Longer version:
First: rev() only works for iterators that implement DoubleEndedIterator, which provides a next_back() method. This method is expected to run in o(n) (sublinear time), usually even O(1) (constant time). And indeed, by looking at the implementation of next_back() for Range, we can see that it runs in constant time.
Now we know that both versions have asymptotically identical runtime. If this is the case, you should usually stop thinking about it and use the solution that is more idiomatic (which is for in this case). Thinking about optimization too early often decreases programming productivity, because performance matters only in a tiny percentage of all code you write.
But since you are implementing memmove, performance might actually really matter to you. So lets try to look at the resulting ASM. I used this code:
#![feature(start)]
#![feature(test)]
extern crate test;
#[inline(never)]
#[no_mangle]
fn with_for(n: usize) {
for i in (0..n).rev() {
test::black_box(i);
}
}
#[inline(never)]
#[no_mangle]
fn with_while(n: usize) {
let mut i = n;
while i > 0 {
test::black_box(i);
i -= 1;
}
}
#[start]
fn main(_: isize, vargs: *const *const u8) -> isize {
let random_enough_value = unsafe {
**vargs as usize
};
with_for(random_enough_value);
with_while(random_enough_value);
0
}
(Playground Link)
The #[no_mangle] is to improve readability in the resulting ASM. The #inline(never) and the random_enough_value as well as the black_box are used to prevent LLVM to optimize things we don't want to be optimized. The generated ASM of this (in release mode!) with some cleanup looks like:
with_for: | with_while:
testq %rdi, %rdi | testq %rdi, %rdi
je .LBB0_3 | je .LBB1_3
decq %rdi |
leaq -8(%rsp), %rax | leaq -8(%rsp), %rax
.LBB0_2: | .LBB1_2:
movq %rdi, -8(%rsp) | movq %rdi, -8(%rsp)
decq %rdi | decq %rdi
cmpq $-1, %rdi |
jne .LBB0_2 | jne .LBB1_2
.LBB0_3: | .LBB1_3:
retq | retq
The only difference is that with_while has two instructions less, because it's counting down to 0 instead of -1, like with_for does.
Conclusion: if you can tell that the asymptotic runtime is optimal, you should probably not think about optimization at all. Modern optimizers are clever enough to compile high level constructs down to pretty perfect ASM. Often, data layout and resulting cache efficiency is much more important than a minimal count of instructions, anyway.
If you actually need to think about optimization though, look at the ASM (or LLVM IR). In this case the for loop is actually a bit slower (more instructions, comparison with -1 instead of 0). But the number of cases where a Rust programmers should care about this, is probably miniscule.
For small N, it really shouldn't matter.
Rust is lazy on iterators; 0..n won't cause any evaluation until you actually ask for an element. rev() asks for the last element first. As far as I know, the Rust counter iterator is clever and doesn't need to generate the first N-1 elements to get the Nth one. In this specific case, the rev method is probably even faster.
In the general case, it depends on what kind of access paradigm and access time your iterator has; make sure that accessing the end takes constant time, and it doesn't make a difference.
As with all benchmarking questions, it depends. Test for your N values yourself!
Premature optimization is also evil, so if your N is small, and your loop isn't done very often... don't worry.

Resources