Replicate llvm instructions - c++11

I'm trying to replicate instruction (addition binary operation for example), and show them in the LLVM IR, but the following code only returns the 1st instruction (add1) that I built.How to return both built instructions ?
IRBuilder<> builder(op);
Value *lhs = op->getOperand(0);
Value *rhs = op->getOperand(1);
Value *add1 = builder.CreateAdd(lhs, rhs);
Value *add2 = builder.CreateAdd(lhs, rhs);
for (auto &U : op->uses()) {
User *user = U.getUser(); // A User is anything with operands.
user->setOperand(U.getOperandNo(), add1);
user->setOperand(U.getOperandNo(), add2);
}

Assume an add instruction. You have a BinaryOperator which has two operands e.g.,: %op = add i32 10, 32
You take them as Value *lhs = op->getOperand(0); and Value *rhs = op->getOperand(1);
So fare so good. Now you are creating two new add tnstructions before the actual add since you are construction your IRBuilder with op as insertion point.
%add1 = add i32 10, 32
%add2 = add i32 10, 32
%op = add i32 10, 32
Finally you update the Users of your original instruction e.g., something like another BinaryOperator: %0 = mul i32 %op, %op
When you look closely on your loop you will see that you set both (add1 and add2) to the same operand of the User. After your loop the multiplication will look like %0 = mul i32 %add2, %add2
If you dump the BasicBlock where the instructions are inserted directly after insertion, you should see something like:
%add1 = add i32 10, 32
%add2 = add i32 10, 32
%op = add i32 10, 32
%0 = mul i32 %add2, %add2
But if you run another LLVM Pass that performs dead code elimination (e.g., InstCombine) you will end up with:
%add2 = add i32 10, 32
%0 = mul i32 %add2, %add2
Because add1 has no users. You have immediately replaced the uses of add1 with add2. And op is also gone because all users now use add2 instead of op.
From your question it is hard to guess what you have intended with your code but this is why you will see only one of your instructions in the final IR.

Related

What is the fastest way to handle overflow on integer division/remainder without panic?

I'm still improving overflower to handle integer overflow. One goal was to be able use #[overflow(wrap)] to avoid panics on overflow. However, I found out that the .wrapping_div(_) and .wrapping_rem(_) functions of the standard integer types do in fact panic when dividing by zero. Edit: To motivate this use case better: Within interrupt handlers, we absolutely want to avoid panics. I assume that the div-by-zero condition is highly unlikely, but we still need to return a "valid" value for some definition of valid.
One possible solution is saturating the value (which I do when code is annotated with #[overflow(saturate)]), but this is likely relatively slow (especially since other,more operations are also saturated). So I want to add an #[overflow(no_panic)] mode that avoids panics completely, and is almost as fast as #[overflow(wrap)] in all cases.
My question is: What is the fastest way to return something (don't care what) without panicking on dividing (or getting the remainder) by zero?
Disclaimer: this isn't really a serious answer. It is almost certainly slower than the naive solution of using an if statement to check whether the divisor is zero.
#![feature(asm)]
fn main() {
println!("18 / 3 = {}", f(18, 3));
println!("2555 / 10 = {}", f(2555, 10));
println!("-16 / 3 = {}", f(-16, 3));
println!("7784388 / 0 = {}", f(7784388, 0));
}
fn f(x: i32, y: i32) -> i32 {
let z: i32;
unsafe {
asm!(
"
test %ecx, %ecx
lahf
and $$0x4000, %eax
or %eax, %ecx
mov %ebx, %eax
cdq
idiv %ecx
"
: "={eax}"(z)
: "{ebx}"(x), "{ecx}"(y)
: "{edx}"
);
}
z
}
Rust Playground
pub fn nopanic_signed_div(x: i32, y: i32) -> i32 {
if y == 0 || y == -1 {
// Divide by -1 is equivalent to neg; we don't care what
// divide by zero returns.
x.wrapping_neg()
} else {
// (You can replace this with unchecked_div to make it more
// obvious this will never panic.)
x / y
}
}
This produces the following on x86-64 with "rustc 1.11.0-nightly (6e00b5556 2016-05-29)":
movl %edi, %eax
leal 1(%rsi), %ecx
cmpl $1, %ecx
ja .LBB0_2
negl %eax
retq
.LBB0_2:
cltd
idivl %esi
retq
It should produce something similar on other platforms.
At least one branch is necessary because LLVM IR considers divide by zero to be undefined behavior. Checking for 0 and -1 separately would involve an extra branch. With those constraints, there isn't really any other choice.
(It might be possible to come up with something slightly faster with inline assembly, but it would be a terrible idea because you would end up generating much worse code in the case of dividing by a constant.)
Whether this solution is actually appropriate probably depends on what your goal is; a divide by zero is probably a logic error, so silently accepting it seems like a bad idea.

getelementptr has -1 as the first index operand

I'm reading the IR of nginx generated by Clang. In function ngx_event_expire_timers, there are some getelementptr instructions with i64 -1 as first index operand. For example,
%handler = getelementptr inbounds %struct.ngx_rbtree_node_s, %struct.ngx_rbtree_node_s* %node.addr.0.i, i64 -1, i32 2
I know the first index operand will be used as an offset to the first operand. But what does a negative offset mean?
The GEP instruction is perfectly fine with negative indices.
In this case you have something like:
node arr[100];
node* ptr = arr[50];
if ( (ptr-1)->value == ptr->value)
// then ...
GEP with negative indices just calculate the offset to the base pointer into the other direction. There is nothing wrong with it.
Considering what is doing inside nginx source code, the semantics of the getelementptr instruction is interesting. It's the result of two lines of C source code:
ev = (ngx_event_t *) ((char *) node - offsetof(ngx_event_t, timer));
ev->handler(ev);
node is of type ngx_rbtree_node_t, which is a member of ev's type ngx_event_t. That is like:
struct ngx_event_t {
....
struct ngx_rbtree_node_t time;
....
};
struct ngx_event_t *ev;
struct ngx_rbtree_node_t *node;
timer is the name of struct ngx_event_t member where node should point to.
|<- ngx_rbtree_node_t ->|
|<- ngx_event_t ->|
------------------------------------------------------
| (some data) | "time" | (some data)
------------------------------------------------------
^ ^
ev node
The graph above shows the layout of an instance of ngx_event_t. The result of offsetof(ngx_event_t, time) is 40. That means the some data before time is of 40 bytes. And the size of ngx_rbtree_node_t is also 40 bytes, by coincidence. So the i64 -1 in the first index oprand of getelementptr instruction computes the base address of the ngx_event_t containing node, which is 40 bytes ahead of node.
handler is another member of ngx_event_t, which is 16 bytes behind the base of ngx_event_t. By (another) coincidence, the third member of ngx_rbtree_node_t is also 16 bytes behind the base address of ngx_rbtree_node_t. So the i32 2 in getelementptr instruction will add 16 bytes to ev, to get the address of handler.
Note that the 16 bytes is computed from the layout of ngx_rbtree_node_t, but not ngx_event_t. Clang must have done some computations to ensure the correctness of the getelementptr instruction. Before use the value of %handler, there is a bitcast instruction which casts %handler to a function pointer type.
What Clang has done breaks the type transformation process defined in C source code. But the result is the exactly same.

How to divide by 9 using just shifts/add/sub?

Last week I was in an interview and there was a test like this:
Calculate N/9 (given that N is a positive integer), using only
SHIFT LEFT, SHIFT RIGHT, ADD, SUBSTRACT instructions.
first, find the representation of 1/9 in binary
0,0001110001110001
means it's (1/16) + (1/32) + (1/64) + (1/1024) + (1/2048) + (1/4096) + (1/65536)
so (x/9) equals (x>>4) + (x>>5) + (x>>6) + (x>>10) + (x>>11)+ (x>>12)+ (x>>16)
Possible optimization (if loops are allowed):
if you loop over 0001110001110001b right shifting it each loop,
add "x" to your result register whenever the carry was set on this shift
and shift your result right each time afterwards,
your result is x/9
mov cx, 16 ; assuming 16 bit registers
mov bx, 7281 ; bit mask of 2^16 * (1/9)
mov ax, 8166 ; sample value, (1/9 of it is 907)
mov dx, 0 ; dx holds the result
div9:
inc ax ; or "add ax,1" if inc's not allowed :)
; workaround for the fact that 7/64
; are a bit less than 1/9
shr bx,1
jnc no_add
add dx,ax
no_add:
shr dx,1
dec cx
jnz div9
( currently cannot test this, may be wrong)
you can use fixed point math trick.
so you just scale up so the significant fraction part goes to integer range, do the fractional math operation you need and scale back.
a/9 = ((a*10000)/9)/10000
as you can see I scaled by 10000. Now the integer part of 10000/9=1111 is big enough so I can write:
a/9 = ~a*1111/10000
power of 2 scale
If you use power of 2 scale then you need just to use bit-shift instead of division. You need to compromise between precision and input value range. I empirically found that on 32 bit arithmetics the best scale for this is 1<<18 so:
(((a+1)<<18)/9)>>18 = ~a/9;
The (a+1) corrects the rounding errors back to the right range.
Hardcoded multiplication
Rewrite the multiplication constant to binary
q = (1<<18)/9 = 29127 = 0111 0001 1100 0111 bin
Now if you need to compute c=(a*q) use hard-coded binary multiplication: for each 1 of the q you can add a<<(position_of_1) to the c. If you see something like 111 you can rewrite it to 1000-1 minimizing the number of operations.
If you put all of this together you should got something like this C++ code of mine:
DWORD div9(DWORD a)
{
// ((a+1)*q)>>18 = (((a+1)<<18)/9)>>18 = ~a/9;
// q = (1<<18)/9 = 29127 = 0111 0001 1100 0111 bin
// valid for a = < 0 , 147455 >
DWORD c;
c =(a<< 3)-(a ); // c= a*29127
c+=(a<< 9)-(a<< 6);
c+=(a<<15)-(a<<12);
c+=29127; // c= (a+1)*29127
c>>=18; // c= ((a+1)*29127)>>18
return c;
}
Now if you see the binary form the pattern 111000 is repeating so yu can further improve the code a bit:
DWORD div9(DWORD a)
{
DWORD c;
c =(a<<3)-a; // first pattern
c+=(c<<6)+(c<<12); // and the other 2...
c+=29127;
c>>=18;
return c;
}

Which is faster for reverse iteration, for or while loops?

I am trying to implement the standard memmove function in Rust and I was wondering which method is faster for downwards iteration (where src < dest):
for i in (0..n).rev() {
//Do copying
}
or
let mut i = n;
while i != 0 {
i -= 1;
// Do copying
}
Will the rev() in the for loops version significantly slow it down?
TL;DR: Use the for loop.
Both should be equally fast. We can check the compiler's ability to peel away the layers of abstraction involved in the for loop quite simply:
#[inline(never)]
fn blackhole() {}
#[inline(never)]
fn with_for(n: usize) {
for i in (0..n).rev() { blackhole(); }
}
#[inline(never)]
fn with_while(n: usize) {
let mut i = n;
while i > 0 {
blackhole();
i -= 1;
}
}
This generates this LLVM IR:
; Function Attrs: noinline nounwind readnone uwtable
define internal void #_ZN8with_for20h645c385965fcce1fhaaE(i64) unnamed_addr #0 {
entry-block:
ret void
}
; Function Attrs: noinline nounwind readnone uwtable
define internal void #_ZN10with_while20hc09c3331764a9434yaaE(i64) unnamed_addr #0 {
entry-block:
ret void
}
Even if you are not versed in LLVM, it is obvious that both functions compiled down to the same IR (and thus obviously to the same assembly).
Since their performance is the same, one should prefer the more explicit for loop and reserve the while loop to cases where the iteration is irregular.
EDIT: to address starblue's concern of unfitness.
#[link(name = "snappy")]
extern {
fn blackhole(i: libc::c_int) -> libc::c_int;
}
#[inline(never)]
fn with_for(n: i32) {
for i in (0..n).rev() { unsafe { blackhole(i as libc::c_int); } }
}
#[inline(never)]
fn with_while(n: i32) {
let mut i = n;
while i > 0 {
unsafe { blackhole(i as libc::c_int); }
i -= 1;
}
}
compiles down to:
; Function Attrs: noinline nounwind uwtable
define internal void #_ZN8with_for20h7cf06f33e247fa35maaE(i32) unnamed_addr #1 {
entry-block:
%1 = icmp sgt i32 %0, 0
br i1 %1, label %match_case.preheader, label %clean_ast_95_
match_case.preheader: ; preds = %entry-block
br label %match_case
match_case: ; preds = %match_case.preheader, %match_case
%.in = phi i32 [ %2, %match_case ], [ %0, %match_case.preheader ]
%2 = add i32 %.in, -1
%3 = tail call i32 #blackhole(i32 %2)
%4 = icmp sgt i32 %2, 0
br i1 %4, label %match_case, label %clean_ast_95_.loopexit
clean_ast_95_.loopexit: ; preds = %match_case
br label %clean_ast_95_
clean_ast_95_: ; preds = %clean_ast_95_.loopexit, %entry-block
ret void
}
; Function Attrs: noinline nounwind uwtable
define internal void #_ZN10with_while20hee8edd624cfe9293IaaE(i32) unnamed_addr #1 {
entry-block:
%1 = icmp sgt i32 %0, 0
br i1 %1, label %while_body.preheader, label %while_exit
while_body.preheader: ; preds = %entry-block
br label %while_body
while_exit.loopexit: ; preds = %while_body
br label %while_exit
while_exit: ; preds = %while_exit.loopexit, %entry-block
ret void
while_body: ; preds = %while_body.preheader, %while_body
%i.05 = phi i32 [ %3, %while_body ], [ %0, %while_body.preheader ]
%2 = tail call i32 #blackhole(i32 %i.05)
%3 = add nsw i32 %i.05, -1
%4 = icmp sgt i32 %i.05, 1
br i1 %4, label %while_body, label %while_exit.loopexit
}
The core loops are:
; -- for loop
match_case: ; preds = %match_case.preheader, %match_case
%.in = phi i32 [ %2, %match_case ], [ %0, %match_case.preheader ]
%2 = add i32 %.in, -1
%3 = tail call i32 #blackhole(i32 %2)
%4 = icmp sgt i32 %2, 0
br i1 %4, label %match_case, label %clean_ast_95_.loopexit
; -- while loop
while_body: ; preds = %while_body.preheader, %while_body
%i.05 = phi i32 [ %3, %while_body ], [ %0, %while_body.preheader ]
%2 = tail call i32 #blackhole(i32 %i.05)
%3 = add nsw i32 %i.05, -1
%4 = icmp sgt i32 %i.05, 1
br i1 %4, label %while_body, label %while_exit.loopexit
And the only difference is that:
for decrements before calling blackhole, while decrements after
for compares against 0, while compares against 1
otherwise, it's the same core loop.
In short: They are (nearly) equally fast -- use the for loop!
Longer version:
First: rev() only works for iterators that implement DoubleEndedIterator, which provides a next_back() method. This method is expected to run in o(n) (sublinear time), usually even O(1) (constant time). And indeed, by looking at the implementation of next_back() for Range, we can see that it runs in constant time.
Now we know that both versions have asymptotically identical runtime. If this is the case, you should usually stop thinking about it and use the solution that is more idiomatic (which is for in this case). Thinking about optimization too early often decreases programming productivity, because performance matters only in a tiny percentage of all code you write.
But since you are implementing memmove, performance might actually really matter to you. So lets try to look at the resulting ASM. I used this code:
#![feature(start)]
#![feature(test)]
extern crate test;
#[inline(never)]
#[no_mangle]
fn with_for(n: usize) {
for i in (0..n).rev() {
test::black_box(i);
}
}
#[inline(never)]
#[no_mangle]
fn with_while(n: usize) {
let mut i = n;
while i > 0 {
test::black_box(i);
i -= 1;
}
}
#[start]
fn main(_: isize, vargs: *const *const u8) -> isize {
let random_enough_value = unsafe {
**vargs as usize
};
with_for(random_enough_value);
with_while(random_enough_value);
0
}
(Playground Link)
The #[no_mangle] is to improve readability in the resulting ASM. The #inline(never) and the random_enough_value as well as the black_box are used to prevent LLVM to optimize things we don't want to be optimized. The generated ASM of this (in release mode!) with some cleanup looks like:
with_for: | with_while:
testq %rdi, %rdi | testq %rdi, %rdi
je .LBB0_3 | je .LBB1_3
decq %rdi |
leaq -8(%rsp), %rax | leaq -8(%rsp), %rax
.LBB0_2: | .LBB1_2:
movq %rdi, -8(%rsp) | movq %rdi, -8(%rsp)
decq %rdi | decq %rdi
cmpq $-1, %rdi |
jne .LBB0_2 | jne .LBB1_2
.LBB0_3: | .LBB1_3:
retq | retq
The only difference is that with_while has two instructions less, because it's counting down to 0 instead of -1, like with_for does.
Conclusion: if you can tell that the asymptotic runtime is optimal, you should probably not think about optimization at all. Modern optimizers are clever enough to compile high level constructs down to pretty perfect ASM. Often, data layout and resulting cache efficiency is much more important than a minimal count of instructions, anyway.
If you actually need to think about optimization though, look at the ASM (or LLVM IR). In this case the for loop is actually a bit slower (more instructions, comparison with -1 instead of 0). But the number of cases where a Rust programmers should care about this, is probably miniscule.
For small N, it really shouldn't matter.
Rust is lazy on iterators; 0..n won't cause any evaluation until you actually ask for an element. rev() asks for the last element first. As far as I know, the Rust counter iterator is clever and doesn't need to generate the first N-1 elements to get the Nth one. In this specific case, the rev method is probably even faster.
In the general case, it depends on what kind of access paradigm and access time your iterator has; make sure that accessing the end takes constant time, and it doesn't make a difference.
As with all benchmarking questions, it depends. Test for your N values yourself!
Premature optimization is also evil, so if your N is small, and your loop isn't done very often... don't worry.

Poor LLVM JIT performance

I have a legacy C++ application that constructs a tree of C++ objects. I want to use LLVM to call class constructors to create said tree. The generated LLVM code is fairly straight-forward and looks like repeated sequences of:
; ...
%11 = getelementptr [11 x i8*]* %Value_array1, i64 0, i64 1
%12 = call i8* #T_string_M_new_A_2Pv(i8* %heap, i8* getelementptr inbounds ([10 x i8]* #0, i64 0, i64 0))
%13 = call i8* #T_QueryLoc_M_new_A_2Pv4i(i8* %heap, i8* %12, i32 1, i32 1, i32 4, i32 5)
%14 = call i8* #T_GlobalEnvironment_M_getItemFactory_A_Pv(i8* %heap)
%15 = call i8* #T_xs_integer_M_new_A_Pvl(i8* %heap, i64 2)
%16 = call i8* #T_ItemFactory_M_createInteger_A_3Pv(i8* %heap, i8* %14, i8* %15)
%17 = call i8* #T_SingletonIterator_M_new_A_4Pv(i8* %heap, i8* %2, i8* %13, i8* %16)
store i8* %17, i8** %11, align 8
; ...
Where each T_ function is a C "thunk" that calls some C++ constructor, e.g.:
void* T_string_M_new_A_2Pv( void *v_value ) {
string *const value = static_cast<string*>( v_value );
return new string( value );
}
The thunks are necessary, of course, because LLVM knows nothing about C++. The T_ functions are added to the ExecutionEngine in use via ExecutionEngine::addGlobalMapping().
When this code is JIT'd, the performance of the JIT'ing itself is very poor. I've generated a call-graph using kcachegrind. I don't understand all the numbers (and this PDF seems not to include commas where it should), but if you look at the left fork, the bottom two ovals, Schedule... is called 16K times and setHeightToAtLeas... is called 37K times. On the right fork, RAGreed... is called 35K times.
Those are far too many calls to anything for what's mostly a simple sequence of call LLVM instructions. Something seems horribly wrong.
Any ideas on how to improve the performance of the JIT'ing?
Another order of magnitude is unlikely to happen without a huge change in how the JIT works or looking at the particular calls you're trying to JIT. You could enable -fast-isel-verbose on llc -O0 (e.g. llc -O0 -fast-isel-verbose mymodule.[ll,bc]) and get it to tell you if it's falling back to the selection dag for instruction generation. You may want to profile again and see what the current hot spots are.

Resources