inline assembly instruction with two return registers - gcc

I have an custom instruction for a processor, it has two return registers and two operands like:
MINMAX rdMin, RdMax, rs1, rs2
It returns the minimum and maximum out of rs1 and rs2. I have verified this instruction using assembly program. It works fine. Now I want to use this instruction from GCC using inline assembly. I tried the following code, but it did not give the correct values of rdMin and rdMax. Is there any mistake in the syntax.
int main() {
unsigned int array[10] = { 45, 75,0,0,0,0,0,0,0};
int op1=16,op2=18,out,out1,out2;
//asm for AVG rd, rs1, rs2
__asm__ volatile (
"avg %[my_out], %[my_op1], %[my_op2]\n"
: [my_out] "=&r" (out)
: [my_op1] "r" (op1),[my_op2] "r" (op2)
);
//asm for MinMax rdMin, rdMax, rs1, rs2
__asm__ volatile (
"minmax %[my_out1], %[my_out2], %[my_op1], %[my_op2]\n"
: [my_out1] "=r" (out1), [my_out2] "=r" (out2)
: [my_op1] "r" (op1), [my_op2] "r" (op2)
);
array[3] = out;
array[4] = out1;
array[5] = out2;
return 0;
}
Thanks.

Related

What is the GCC documentation and example saying about inline asm and not using early clobbers so a pointer shares a register with a mem input?

The GCC documentation (https://gcc.gnu.org/onlinedocs/gcc/Extended-Asm.html#Clobbers-and-Scratch-Registers-1) contains the following PowerPC example and description:
static void
dgemv_kernel_4x4 (long n, const double *ap, long lda,
const double *x, double *y, double alpha)
{
double *a0;
double *a1;
double *a2;
double *a3;
__asm__
(
/* lots of asm here */
"#n=%1 ap=%8=%12 lda=%13 x=%7=%10 y=%0=%2 alpha=%9 o16=%11\n"
"#a0=%3 a1=%4 a2=%5 a3=%6"
:
"+m" (*(double (*)[n]) y),
"+&r" (n), // 1
"+b" (y), // 2
"=b" (a0), // 3
"=&b" (a1), // 4
"=&b" (a2), // 5
"=&b" (a3) // 6
:
"m" (*(const double (*)[n]) x),
"m" (*(const double (*)[]) ap),
"d" (alpha), // 9
"r" (x), // 10
"b" (16), // 11
"3" (ap), // 12
"4" (lda) // 13
:
"cr0",
"vs32","vs33","vs34","vs35","vs36","vs37",
"vs40","vs41","vs42","vs43","vs44","vs45","vs46","vs47"
);
}
... On the other hand, ap can’t be the same as any of the other inputs, so an early-clobber
on a0 is not needed. It is also not desirable in this case. An
early-clobber on a0 would cause GCC to allocate a separate register
for the "m" (*(const double (*)[]) ap) input. Note that tying an
input to an output is the way to set up an initialized temporary
register modified by an asm statement. An input not tied to an output
is assumed by GCC to be unchanged...
I am totally confused about this description:
For the code there is no relationship between "m" (*(const double (*)[]) ap) and "=b" (a0). "=b" (a0) will share the register with "3" (ap), which saves the address of the input parameter, and "m" (*(const double (*)[]) ap) is the content of the first element of ap, so why an early-clobber on a0 will impact "m" (*(const double (*)[]) ap)?
Even if gcc allocate a new register to "m" (*(const double (*)[]) ap), I still don't understand what the problem. Since there is tied between "=b" (a0) and "3" (ap), so we can still read / write through the register that allocated for "=b" (a0)?
This is an efficiency consideration, not correctness, stopping GCC from wasting instructions (and creating register pressure).
"m" (*(const double (*)[]) ap) isn't the first element, it's an arbitrary-length array, letting the compiler know that the entire array object is an input. But it's a dummy input; the asm template won't actually use that operand, instead looping over the array via the pointer input "3" (ap)
See How can I indicate that the memory *pointed* to by an inline ASM argument may be used? for more about this technique.
But "m" inputs are real inputs that have to work expand to an addressing mode if the template does use them, including after early-clobbers have clobbered their register.
With =&b(a0) / "3"(ap), GCC couldn't pick the same register as the base for an addressing mode for "m" (*(const double (*)[]) ap).
So it would have to waste an instruction ahead of the asm statement copying the address to another register. Also wasting that integer register.

After two mov in memory register system crash

I'm trying to perform some write in memory registers from user space ( through a custom driver ). I want to write three 64-bit integer and I initialized the variables "value_1, value_2 and value_3" to uint64_t type.
I must use the gcc inline mov instruction and I'm working on an ARM 64-bit architecture on a custom version of linux for an embedded system.
Thi is my code:
asm ( "MOV %[reg1], %[val1]\t\n"
"MOV %[reg2], %[val2]\t\n"
"MOV %[reg3], %[val3]\t\n"
:[reg1] "=&r" (*register_1),[arg2] "=&r" (*register_2), [arg3] "=&r" (*register_3)
:[val1] "r"(value_1),[val2] "r" (value_2), [val3] "r" (value_3)
);
The problem is strange...
If I perform just two MOV, the code works.
If I perform all the three MOV, the entire system crash and I have to reboot the entire system.
even stranger...
If I put a "printf" o even a nanosleep with 0 nanosecond between the second and the third MOV, the code works!
I looked around trying to find a solution and I also use the clobber of the memory:
asm ( "MOV %[reg1], %[val1]\t\n"
"MOV %[reg2], %[val2]\t\n"
"MOV %[reg3], %[val3]\t\n"
:[reg1] "=&r" (*register_1),[arg2] "=&r" (*register_2), [arg3] "=&r" (*register_3)
:[val1] "r"(value_1),[val2] "r" (value_2), [val3] "r" (value_3)
:"memory"
);
...doen't work!
I used also the memory barrier macro between the second and the third MOV or at the end of the three MOV:
asm volatile("": : :"memory")
..doesn't work!
Also, I tried to write directly into the register using pointers and I had the same behavior: after the second write the system crash...
Anybody can suggest me a solution..or tell me if I'm using the gcc inline MOV or the memory barrier in a wrong way?
----> MORE DETAILS <-----
This is my main:
int main()
{
int dev_fd;
volatile void * base_addr = NULL;
volatile uint64_t * reg1_addr = NULL;
volatile uint32_t * reg2_addr = NULL;
volatile uint32_t * reg3_addr = NULL;
dev_fd = open(MY_DEVICE, O_RDWR);
if (dev_fd < 0)
{
perror("Open call failed");
return -1;
}
base_addr = mmap(NULL, PAGE_SIZE, PROT_READ|PROT_WRITE, MAP_SHARED, xsmll_dev_fd, 0);
if (base_addr == MAP_FAILED)
{
perror("mmap operation failed");
return -1;
}
printf("BASE ADDRESS VIRT: 0x%p\n", base_addr);
/* Preparing the registers */
reg1_addr = base_addr + REG1_OFF;
reg2_addr = base_addr + REG2_OFF;
reg3_addr = base_addr + REG3_OFF;
uint64_t val_1 = 0xEEEEEEEE;
uint64_t val_2 = 0x00030010;
uint64_t val_3 = 0x01;
asm ( "str %[val1], %[reg1]\t\n"
"str %[val2], %[reg2]\t\n"
"str %[val3], %[reg3]\t\n"
:[reg1] "=&m" (*reg1_addr),[reg2] "=&m" (*reg2_addr), [reg3] "=&m" (*reg3_addr)
:[val1] "r"(val_1),[val2] "r" (val_2), [val3] "r" (val_3)
);
printf("--- END ---\n");
close(dev_fd);
return 0;
}
This is the output of the compiler regarding the asm statement (linaro..I cross compile):
400bfc: f90013a0 str x0, [x29,#32]
400c00: f94027a3 ldr x3, [x29,#72]
400c04: f94023a4 ldr x4, [x29,#64]
400c08: f9402ba5 ldr x5, [x29,#80]
400c0c: f9401ba0 ldr x0, [x29,#48]
400c10: f94017a1 ldr x1, [x29,#40]
400c14: f94013a2 ldr x2, [x29,#32]
400c18: f9000060 str x0, [x3]
400c1c: f9000081 str x1, [x4]
400c20: f90000a2 str x2, [x5]
Thank you!
I tried with *reg1_addr = val_1; and I have the same problem.
Then this code isn't the problem. Avoiding asm is just a cleaner way to get equivalent machine code, without having to use inline asm. Your problem is more likely your choice of registers and values, or the kernel driver.
Or do you need the values to be in CPU registers before writing the first mmaped location, to avoid loading anything from the stack between stores? That's the only reason I can think of that you'd need inline asm, where compiler-generated stores might not be equivalent.
Answer to original question:
An "=&r" output constraint means a CPU register. So your inline-asm instructions will run in that order, assembling to something like
mov x0, x5
mov x1, x6
mov x2, x7
And then after that, compiler-generated code will store the values back to memory in some unspecified order. That order depends on how it chooses to generate code for the surrounding C. This is probably why changing the surrounding code changes the behaviour.
One solution might be "=&m" constraints with str instructions, so your asm does actually store to memory. str %[val1], %[reg1] because STR instructions take the addressing mode as the 2nd operand, even though it's the destination.
Why can't you use volatile uint64_t* = register_1; like a normal person, to have the compiler emit store instructions that it's not allowed to reorder or optimize away? MMIO is exactly what volatile is for.
Doesn't Linux have macros or functions for doing MMIO loads/stores?
If you're having problem with inline asm, step 1 in debugging should be to look at the actual asm emitted by the compiler when it filled in the asm template, and the surrounding code.
Then single-step by instructions through the code (with GDB stepi, maybe in layout reg mode).

gcc arm inline assembler %e0 and %f0 operand modifiers for 16-byte NEON operands?

Found the following inline assembler code to calculate the vector cross product :
float32x4_t cross_test( const float32x4_t& lhs, const float32x4_t& rhs )
{
float32x4_t result;
asm volatile(
"vext.8 d6, %e2, %f2, #4 \n\t"
"vext.8 d7, %e1, %f1, #4 \n\t"
"vmul.f32 %e0, %f1, %e2 \n\t"
"vmul.f32 %f0, %e1, d6 \n\t"
"vmls.f32 %e0, %f2, %e1 \n\t"
"vmls.f32 %f0, %e2, d7 \n\t"
"vext.8 %e0, %f0, %e0, #4 "
: "+w" ( result )
: "w" ( lhs ), "w" ( rhs )
: "d6", "d7" );
return result;
}
What do the modifiers e and f after '%' mean (e.g. %e2)? I can not find any reference for this.
This is the assembler code generated by gcc:
vext.8 d6, d20, d21, #4
vext.8 d7, d18, d19, #4
vmul.f32 d16, d19, d20
vmul.f32 d17, d18, d6
vmls.f32 d16, d21, d18
vmls.f32 d17, d20, d7
vext.8 d16, d17, d16, #4
I now understood the meaning of the used modifiers. Now I tried to follow the cross product algorithm. For this I added some additional comments to the assembler code but the result is not equal to my expectation:
// History:
// - '%e' = lower register part
// - '%f' = higher register part
// - '%?0' = res = [ x2 y2 | z2 v2 ]
// - '%?1' = lhs = [ x0 y0 | z0 v0 ]
// - '%?2' = rhs = [ x1 y1 | z1 v1 ]
// - '%e0' = [ x2 y2 ]
// - '%f0' = [ z2 v2 ]
// - '%e1' = [ x0 y0 ]
// - '%f1' = [ z0 v0 ]
// - '%e2' = [ x1 y1 ]
// - '%f2' = [ z1 v1 ]
// Implemented algorithm:
// |x2| |y0 * z1 - z0 * y1|
// |y2| = |z0 * x1 - x0 * z1|
// |z2| |x0 * y1 - y0 * x1|
asm (
"vext.8 d6, %e2, %f2, #4 \n\t" // e2=[ x1 y1 ], f2=[ z1 v1 ] -> d6=[ v1 x1 ]
"vext.8 d7, %e1, %f1, #4 \n\t" // e1=[ x0 y0 ], f1=[ z0 v0 ] -> d7=[ v0 x0 ]
"vmul.f32 %e0, %f1, %e2 \n\t" // f1=[ z0 v0 ], e2=[ x1 y1 ] -> e0=[ z0 * x1, v0 * y1 ]
"vmul.f32 %f0, %e1, d6 \n\t" // e1=[ x0 y0 ], d6=[ v1 x1 ] -> f0=[ x0 * v1, y0 * x1 ]
"vmls.f32 %e0, %f2, %e1 \n\t" // f2=[ z1 v1 ], e1=[ x0 y0 ] -> e0=[ z0 * x1 - z1 * x0, v0 * y1 - v1 * y0 ] = [ y2, - ]
"vmls.f32 %f0, %e2, d7 \n\t" // e2=[ x1 y1 ], d7=[ v0 x0 ] -> f0=[ x0 * v1 - x1 * v0, y0 * x1 - y1 * x0 ] = [ -, - ]
"vext.8 %e0, %f0, %e0, #4 " //
: "+w" ( result ) // Output section: 'w'='VFP floating point register', '+'='read/write'
: "w" ( lhs ), "w" ( rhs ) // Input section : 'w'='VFP floating point register'
: "d6", "d7" ); // Temporary 64[bit] register.
First of all, this is weird. result isn't initialized before the asm statement, but it's used as an input/output operand with "+w" ( result ). I think "=w" (result) would be better. It also makes no sense that this is volatile; the output is a pure function of the inputs with no side effects or dependency on any "hidden" inputs, so the same inputs will give the same result every time. Thus, omitting volatile would allow the compiler to CSE it and hoist it out of loops if possible, instead of forcing it to re-compute every time the source runs it with the same inputs.
I couldn't find any reference either; the gcc manual's Extended ASM page only documents operand modifiers for x86, not ARM.
But I think we can see the operand modifiers do from looking at the asm output:
%e0 is substituted with d16, %f0 is substituted with d17. %e1 is d18 and %f1 is d19. %2 is in d20 and d21
Your inputs are 16-byte NEON vectors, in q registers. In ARM32, the upper and lower half of each q register is separately accessible as a d register. (Unlike AArch64 where each s / d register is the bottom element of a different q reg.) It looks like this code is taking advantage of this to shuffle for free by using 64-bit SIMD on the high and low pair of floats, after doing a 4-byte vext shuffle to mix those pairs of floats.
%e[operand] is the low d register of an operand, %f[operand] is the high d register. They're not documented, but the gcc source code says (in arm_print_operand in gcc/config/arm/arm.c#L22486:
These two codes print the low/high doubleword register of a Neon quad
register, respectively. For pair-structure types, can also print
low/high quadword registers.
I didn't test what happens if you apply these modifiers to 64-bit operands like float32x2_t, and this is all just me reverse-engineering from one example. But it makes perfect sense that there would be modifiers for this.
x86 modifiers include one for the low and high 8 bits of integer registers (so you can get AL / AH if your input as in EAX), so partial-register stuff is definitely something that GNU C inline asm operand modifiers can do.
Beware that undocumented means unsupported.
I am looking for the meaning of %e0 & %f0, this topic is very helpful. The cross_test() output could be explained as follows:
#include <arm_neon.h>
#include <stdio.h>
float32x4_t cross_test(const float32x4_t& lhs, const float32x4_t& rhs) {
float32x4_t result;
// | f | e
// -----------------------------
// 1 | a3(4) a2(3) | a1(2) a0(1)
// 2 | b3(5) b2(6) | b1(7) b0(8)
asm volatile (
"vext.8 d6, %e1, %f1, #4" "\n" // a2, a1
"vext.8 d7, %e2, %f2, #4" "\n" // b2, b1
"vmul.f32 %e0, %f1, %e2" "\n" // a3*b1, a2*b0
"vmul.f32 %f0, %e1, d7" "\n" // a1*b2, a0*b1
"vmls.f32 %e0, %f2, %e1" "\n" // a3*b1-a1*b3(18), a2*b0-a0*b2(18)
"vmls.f32 %f0, %e2, d6" "\n" // a1*b2-a2*b1(-9), a0*b1-a1*b0(-9)
"vext.8 %e0, %f0, %e0, #4" "\n" // a2*b0-a0*b2(18), a1*b2-a2*b1(-9)
: "+w"(result) // %0
: "w"(lhs), // %1
"w"(rhs) // %2
: "d6", "d7"
);
return result;
}
#define nforeach(i, count) \
for (int i = 0, __count = static_cast<int>(count); i < __count; ++i)
#define dump_f128(qf) do { \
float *fp = reinterpret_cast<float *>(&qf); \
puts(#qf ":"); \
nforeach(i, 4) { \
printf("[%d]%f\n", i, fp[i]); \
} \
} while (0)
int main() {
float fa[] = {1., 2., 3., 4.};
float fb[] = {8., 7., 6., 5.};
float32x4_t qa, qb, qres;
qa = vld1q_f32(const_cast<const float *>(&fa[0]));
qb = vld1q_f32(const_cast<const float *>(&fb[0]));
qres = cross_test(qa, qb);
dump_f128(qa);
puts("---");
dump_f128(qb);
puts("---");
// -9, 18, -9, -9
dump_f128(qres);
return 0;
}

Neon Optimization for multiplication and store in ARM

Using an ARM Cortex A15 board I'm trying to optimize a perfectly working C code by using NEON intrinsics.
compiler: gcc 4.7 on ubuntu 12.04
Flags:-g -O3 -mcpu=cortex-a15 -mfpu=neon-vfpv4 -ftree-vectorize -DDRA7XX_ARM -DARM_PROC -DSL -funroll-loops -ftree-loop-ivcanon -mfloat-abi=hard
I wanted to do the following function ,its just a simple load->multiply->store.
here are some parameters:
*input is a pointer to an array of size 40680 and after completing the loop the pointer should retain the current position and do the same for next input stream via input pointer.
float32_t A=0.7;
float32_t *ptr_op=(float*)output[9216];
float32x2_t reg1;
for(i= 0;i< 4608;i+=4){
/*output[(2*i)] = A*(*input); // C version
input++;
output[(2*i)+1] = A*(*input);
input++;*/
reg1=vld1q_f32(input++); //Neon version
R_N=vmulq_n_f32(reg1,A);
vst1q_f32(ptr_op++,R_N);
}
I want to understand where am I making mistake in this loop because it seems pretty straightforward.
Here is my assembly implementation of the same . Am I going in the correct direction???
__asm__ __volatile__(
"\t mov r4, #0\n"
"\t vdup.32 d1,%3\n"
"Lloop2:\n"
"\t cmp r4, %2\n"
"\t bge Lend2\n"
"\t vld1.32 d0, [%0]!\n"
"\t vmul.f32 d0, d0, d1\n"
"\t vst1.32 d0, [%1]!\n"
"\t add r4, r4, #2\n"
"\t b Lloop2\n"
"Lend2:\n"
: "=r"(input), "=r"(ptr_op), "=r"(length), "=r"(A)
: "0"(input), "1"(ptr_op), "2"(length), "3"(A)
: "cc", "r4", "d1", "d0");
Hmmmmm, does your code compile in the first place? I didn't know that you can multiply a vector by a float scalar. Probably the compiler did convert if for you.
Anyway, you have to understand that most NEON instructions are bound with a long latency. Unless you hide them properly, your code won't be any faster than the standard C version, if not slower.
vld1q..... // 1 cycle
// 4 cycles latency + potential cache miss penalty
vmulq..... // 2 cycles
// 6 cycles latency
vst1q..... // 1 cycle
// 2 cycles loop overhead
The example above roughly shows the cycles required for each iteration.
And as you can see, it's minimum 18 cycles/iteration from which only 4 cycles are spent on actual computation while 14 cycles are wasted meaninglessly.
It's called RAW dependency (Read after Write)
The easiest and practically only way to hide these latencies is loop unrolling: a deep one.
Unrolling by four vectors per iteration is usually sufficient, and eight is even better, if you don't mind the code length.
void vecMul(float * pDst, float * pSrc, float coeff, int length)
{
const float32x4_t scal = vmovq_n_f32(coeff);
float32x4x4_t veca, vecb;
length -= 32;
if (length >= 0)
{
while (1)
{
do
{
length -= 32;
veca = vld1q_f32_x4(pSrc++);
vecb = vld1q_f32_x4(pSrc++);
veca.val[0] = vmulq_f32(veca.val[0], scal);
veca.val[1] = vmulq_f32(veca.val[1], scal);
veca.val[2] = vmulq_f32(veca.val[2], scal);
veca.val[3] = vmulq_f32(veca.val[3], scal);
vecb.val[0] = vmulq_f32(vecb.val[0], scal);
vecb.val[1] = vmulq_f32(vecb.val[1], scal);
vecb.val[2] = vmulq_f32(vecb.val[2], scal);
vecb.val[3] = vmulq_f32(vecb.val[3], scal);
vst1q_f32_x4(pDst++, veca);
vst1q_f32_x4(pDst++, vecb);
} while (length >= 0);
if (length <= -32) return;
pSrc += length;
pDst += length;
}
}
///////////////////////////////////////////////////////////////
if (length & 16)
{
veca = vld1q_f32_x4(pSrc++);
}
if (length & 8)
{
vecb.val[0] = vld1q_f32(pSrc++);
vecb.val[1] = vld1q_f32(pSrc++);
}
if (length & 4)
{
vecb.val[2] = vld1q_f32(pSrc++);
}
if (length & 2)
{
vld1q_lane_f32(pSrc++, vecb.val[3], 0);
vld1q_lane_f32(pSrc++, vecb.val[3], 1);
}
if (length & 1)
{
vld1q_lane_f32(pSrc, vecb.val[3], 2);
}
veca.val[0] = vmulq_f32(veca.val[0], scal);
veca.val[1] = vmulq_f32(veca.val[1], scal);
veca.val[2] = vmulq_f32(veca.val[2], scal);
veca.val[3] = vmulq_f32(veca.val[3], scal);
vecb.val[0] = vmulq_f32(vecb.val[0], scal);
vecb.val[1] = vmulq_f32(vecb.val[1], scal);
vecb.val[2] = vmulq_f32(vecb.val[2], scal);
vecb.val[3] = vmulq_f32(vecb.val[3], scal);
if (length & 16)
{
vst1q_f32_x4(pDst++, veca);
}
if (length & 8)
{
vst1q_f32(pDst++, vecb.val[0]);
vst1q_f32(pDst++, vecb.val[1]);
}
if (length & 4)
{
vst1q_f32(pDst++, vecb.val[2]);
}
if (length & 2)
{
vst1q_lane_f32(pDst++, vecb.val[3], 0);
vst1q_lane_f32(pDst++, vecb.val[3], 1);
}
if (length & 1)
{
vst1q_lane_f32(pDst, vecb.val[3], 2);
}
}
Now we are dealing with eight independent vectors, hence the latencies are completely hidden, and the potential cache miss penalty as well as the flat loop overhead are rather diminishing.

Which is faster for reverse iteration, for or while loops?

I am trying to implement the standard memmove function in Rust and I was wondering which method is faster for downwards iteration (where src < dest):
for i in (0..n).rev() {
//Do copying
}
or
let mut i = n;
while i != 0 {
i -= 1;
// Do copying
}
Will the rev() in the for loops version significantly slow it down?
TL;DR: Use the for loop.
Both should be equally fast. We can check the compiler's ability to peel away the layers of abstraction involved in the for loop quite simply:
#[inline(never)]
fn blackhole() {}
#[inline(never)]
fn with_for(n: usize) {
for i in (0..n).rev() { blackhole(); }
}
#[inline(never)]
fn with_while(n: usize) {
let mut i = n;
while i > 0 {
blackhole();
i -= 1;
}
}
This generates this LLVM IR:
; Function Attrs: noinline nounwind readnone uwtable
define internal void #_ZN8with_for20h645c385965fcce1fhaaE(i64) unnamed_addr #0 {
entry-block:
ret void
}
; Function Attrs: noinline nounwind readnone uwtable
define internal void #_ZN10with_while20hc09c3331764a9434yaaE(i64) unnamed_addr #0 {
entry-block:
ret void
}
Even if you are not versed in LLVM, it is obvious that both functions compiled down to the same IR (and thus obviously to the same assembly).
Since their performance is the same, one should prefer the more explicit for loop and reserve the while loop to cases where the iteration is irregular.
EDIT: to address starblue's concern of unfitness.
#[link(name = "snappy")]
extern {
fn blackhole(i: libc::c_int) -> libc::c_int;
}
#[inline(never)]
fn with_for(n: i32) {
for i in (0..n).rev() { unsafe { blackhole(i as libc::c_int); } }
}
#[inline(never)]
fn with_while(n: i32) {
let mut i = n;
while i > 0 {
unsafe { blackhole(i as libc::c_int); }
i -= 1;
}
}
compiles down to:
; Function Attrs: noinline nounwind uwtable
define internal void #_ZN8with_for20h7cf06f33e247fa35maaE(i32) unnamed_addr #1 {
entry-block:
%1 = icmp sgt i32 %0, 0
br i1 %1, label %match_case.preheader, label %clean_ast_95_
match_case.preheader: ; preds = %entry-block
br label %match_case
match_case: ; preds = %match_case.preheader, %match_case
%.in = phi i32 [ %2, %match_case ], [ %0, %match_case.preheader ]
%2 = add i32 %.in, -1
%3 = tail call i32 #blackhole(i32 %2)
%4 = icmp sgt i32 %2, 0
br i1 %4, label %match_case, label %clean_ast_95_.loopexit
clean_ast_95_.loopexit: ; preds = %match_case
br label %clean_ast_95_
clean_ast_95_: ; preds = %clean_ast_95_.loopexit, %entry-block
ret void
}
; Function Attrs: noinline nounwind uwtable
define internal void #_ZN10with_while20hee8edd624cfe9293IaaE(i32) unnamed_addr #1 {
entry-block:
%1 = icmp sgt i32 %0, 0
br i1 %1, label %while_body.preheader, label %while_exit
while_body.preheader: ; preds = %entry-block
br label %while_body
while_exit.loopexit: ; preds = %while_body
br label %while_exit
while_exit: ; preds = %while_exit.loopexit, %entry-block
ret void
while_body: ; preds = %while_body.preheader, %while_body
%i.05 = phi i32 [ %3, %while_body ], [ %0, %while_body.preheader ]
%2 = tail call i32 #blackhole(i32 %i.05)
%3 = add nsw i32 %i.05, -1
%4 = icmp sgt i32 %i.05, 1
br i1 %4, label %while_body, label %while_exit.loopexit
}
The core loops are:
; -- for loop
match_case: ; preds = %match_case.preheader, %match_case
%.in = phi i32 [ %2, %match_case ], [ %0, %match_case.preheader ]
%2 = add i32 %.in, -1
%3 = tail call i32 #blackhole(i32 %2)
%4 = icmp sgt i32 %2, 0
br i1 %4, label %match_case, label %clean_ast_95_.loopexit
; -- while loop
while_body: ; preds = %while_body.preheader, %while_body
%i.05 = phi i32 [ %3, %while_body ], [ %0, %while_body.preheader ]
%2 = tail call i32 #blackhole(i32 %i.05)
%3 = add nsw i32 %i.05, -1
%4 = icmp sgt i32 %i.05, 1
br i1 %4, label %while_body, label %while_exit.loopexit
And the only difference is that:
for decrements before calling blackhole, while decrements after
for compares against 0, while compares against 1
otherwise, it's the same core loop.
In short: They are (nearly) equally fast -- use the for loop!
Longer version:
First: rev() only works for iterators that implement DoubleEndedIterator, which provides a next_back() method. This method is expected to run in o(n) (sublinear time), usually even O(1) (constant time). And indeed, by looking at the implementation of next_back() for Range, we can see that it runs in constant time.
Now we know that both versions have asymptotically identical runtime. If this is the case, you should usually stop thinking about it and use the solution that is more idiomatic (which is for in this case). Thinking about optimization too early often decreases programming productivity, because performance matters only in a tiny percentage of all code you write.
But since you are implementing memmove, performance might actually really matter to you. So lets try to look at the resulting ASM. I used this code:
#![feature(start)]
#![feature(test)]
extern crate test;
#[inline(never)]
#[no_mangle]
fn with_for(n: usize) {
for i in (0..n).rev() {
test::black_box(i);
}
}
#[inline(never)]
#[no_mangle]
fn with_while(n: usize) {
let mut i = n;
while i > 0 {
test::black_box(i);
i -= 1;
}
}
#[start]
fn main(_: isize, vargs: *const *const u8) -> isize {
let random_enough_value = unsafe {
**vargs as usize
};
with_for(random_enough_value);
with_while(random_enough_value);
0
}
(Playground Link)
The #[no_mangle] is to improve readability in the resulting ASM. The #inline(never) and the random_enough_value as well as the black_box are used to prevent LLVM to optimize things we don't want to be optimized. The generated ASM of this (in release mode!) with some cleanup looks like:
with_for: | with_while:
testq %rdi, %rdi | testq %rdi, %rdi
je .LBB0_3 | je .LBB1_3
decq %rdi |
leaq -8(%rsp), %rax | leaq -8(%rsp), %rax
.LBB0_2: | .LBB1_2:
movq %rdi, -8(%rsp) | movq %rdi, -8(%rsp)
decq %rdi | decq %rdi
cmpq $-1, %rdi |
jne .LBB0_2 | jne .LBB1_2
.LBB0_3: | .LBB1_3:
retq | retq
The only difference is that with_while has two instructions less, because it's counting down to 0 instead of -1, like with_for does.
Conclusion: if you can tell that the asymptotic runtime is optimal, you should probably not think about optimization at all. Modern optimizers are clever enough to compile high level constructs down to pretty perfect ASM. Often, data layout and resulting cache efficiency is much more important than a minimal count of instructions, anyway.
If you actually need to think about optimization though, look at the ASM (or LLVM IR). In this case the for loop is actually a bit slower (more instructions, comparison with -1 instead of 0). But the number of cases where a Rust programmers should care about this, is probably miniscule.
For small N, it really shouldn't matter.
Rust is lazy on iterators; 0..n won't cause any evaluation until you actually ask for an element. rev() asks for the last element first. As far as I know, the Rust counter iterator is clever and doesn't need to generate the first N-1 elements to get the Nth one. In this specific case, the rev method is probably even faster.
In the general case, it depends on what kind of access paradigm and access time your iterator has; make sure that accessing the end takes constant time, and it doesn't make a difference.
As with all benchmarking questions, it depends. Test for your N values yourself!
Premature optimization is also evil, so if your N is small, and your loop isn't done very often... don't worry.

Resources