Poor LLVM JIT performance - performance

I have a legacy C++ application that constructs a tree of C++ objects. I want to use LLVM to call class constructors to create said tree. The generated LLVM code is fairly straight-forward and looks like repeated sequences of:
; ...
%11 = getelementptr [11 x i8*]* %Value_array1, i64 0, i64 1
%12 = call i8* #T_string_M_new_A_2Pv(i8* %heap, i8* getelementptr inbounds ([10 x i8]* #0, i64 0, i64 0))
%13 = call i8* #T_QueryLoc_M_new_A_2Pv4i(i8* %heap, i8* %12, i32 1, i32 1, i32 4, i32 5)
%14 = call i8* #T_GlobalEnvironment_M_getItemFactory_A_Pv(i8* %heap)
%15 = call i8* #T_xs_integer_M_new_A_Pvl(i8* %heap, i64 2)
%16 = call i8* #T_ItemFactory_M_createInteger_A_3Pv(i8* %heap, i8* %14, i8* %15)
%17 = call i8* #T_SingletonIterator_M_new_A_4Pv(i8* %heap, i8* %2, i8* %13, i8* %16)
store i8* %17, i8** %11, align 8
; ...
Where each T_ function is a C "thunk" that calls some C++ constructor, e.g.:
void* T_string_M_new_A_2Pv( void *v_value ) {
string *const value = static_cast<string*>( v_value );
return new string( value );
}
The thunks are necessary, of course, because LLVM knows nothing about C++. The T_ functions are added to the ExecutionEngine in use via ExecutionEngine::addGlobalMapping().
When this code is JIT'd, the performance of the JIT'ing itself is very poor. I've generated a call-graph using kcachegrind. I don't understand all the numbers (and this PDF seems not to include commas where it should), but if you look at the left fork, the bottom two ovals, Schedule... is called 16K times and setHeightToAtLeas... is called 37K times. On the right fork, RAGreed... is called 35K times.
Those are far too many calls to anything for what's mostly a simple sequence of call LLVM instructions. Something seems horribly wrong.
Any ideas on how to improve the performance of the JIT'ing?

Another order of magnitude is unlikely to happen without a huge change in how the JIT works or looking at the particular calls you're trying to JIT. You could enable -fast-isel-verbose on llc -O0 (e.g. llc -O0 -fast-isel-verbose mymodule.[ll,bc]) and get it to tell you if it's falling back to the selection dag for instruction generation. You may want to profile again and see what the current hot spots are.

Related

What are the errors in these avx2 intrinsics, and how to use the raw assembler [duplicate]

This question already has answers here:
inlining failed in call to always_inline '__m256d _mm256_broadcast_sd(const double*)'
(1 answer)
The Effect of Architecture When Using SSE / AVX Intrinisics
(2 answers)
Closed 2 years ago.
The Intel reference manual is fairly opaque on the details of how instructions are used. There are no examples on each instruction page.
https://software.intel.com/sites/default/files/4f/5b/36945
It's possible we haven't found the right page, or the right manual yet. In any case, the syntax of gnu as might be different.
So I thought perhaps we could call intrinsics, step in and view the opcodes in gdb, and from that perhaps learn the opcodes in gnu as. we also searched for the opcodes for avx2 in gnu as but didn't find them documented.
Starting with the c++ code that isn't compiling:
#include <immintrin.h>
int main() {
__m256i a = _mm256_set_epi32(1, 2, 3, 4, 5, 6, 7, 8); // seems to work
double x = 1.0, y = 2.0, z = 3.0;
__m256d c,d;
__m256d b = _mm256_loadu_pd(&x);
// __m256d c = _mm256_loadu_pd(&y);
// __m256d d = _mm256_loadu_pd(&z);
d = _mm256_fmadd_pd(b, c, d); // c += a * b
_mm256_add_pd(b, c);
}
g++ -g
We would like to be able to load a vector register %ymm0 with a single value, which appears to be supported by the intrinsic: _mm256_loadu_pd.
Fused multiply-add and add are also there. All intrinsics except the first give errors.
/usr/lib/gcc/x86_64-linux-gnu/9/include/fmaintrin.h:47:1: error: inlining failed in call to always_inline ‘__m256d _mm256_fmadd_pd(__m256d, __m256d, __m256d)’: target specific option mismatch
47 | _mm256_fmadd_pd (__m256d __A, __m256d __B, __m256d __C)
| ^~~~~~~~~~~~~~~
Next, what are the syntax of the underlying assembler instructions? If you could point to a manual showing them that would be very helpful.

F# Performance Impact of Checked Calcs?

Is there a performance impact from using the Checked module? I've tested it out with sequences of type int and see no noticeable difference. Sometimes the checked version is faster and sometimes unchecked is faster, but generally not by much.
Seq.initInfinite (fun x-> x) |> Seq.item 1000000000;;
Real: 00:00:05.272, CPU: 00:00:05.272, GC gen0: 0, gen1: 0, gen2: 0
val it : int = 1000000000
open Checked
Seq.initInfinite (fun x-> x) |> Seq.item 1000000000;;
Real: 00:00:04.785, CPU: 00:00:04.773, GC gen0: 0, gen1: 0, gen2: 0
val it : int = 1000000000
Basically I'm trying to figure out if there would be any downside to always opening Checked. (I encountered an overflow that wasn't immediately obvious, so I'm now playing the role of the jilted lover who doesn't want another broken heart.) The only non-contrived reason I can come up with for not always using Checked is if there were some performance hit, but I haven't seen one yet.
When you measure performance it's usually not a good idea to include Seq as Seq adds lots of overhead (at least compared to int operations) so you risk that most of the time is spent in Seq, not in the code you like to test.
I wrote a small test program for (+):
let clock =
let sw = System.Diagnostics.Stopwatch ()
sw.Start ()
fun () ->
sw.ElapsedMilliseconds
let dbreak () = System.Diagnostics.Debugger.Break ()
let time a =
let b = clock ()
let r = a ()
let n = clock ()
let d = n - b
d, r
module Unchecked =
let run c () =
let rec loop a i =
if i < c then
loop (a + 1) (i + 1)
else
a
loop 0 0
module Checked =
open Checked
let run c () =
let rec loop a i =
if i < c then
loop (a + 1) (i + 1)
else
a
loop 0 0
[<EntryPoint>]
let main argv =
let count = 1000000000
let testCases =
[|
"Unchecked" , Unchecked.run
"Checked" , Checked.run
|]
for nm, a in testCases do
printfn "Running %s ..." nm
let ms, r = time (a count)
printfn "... it took %d ms, result is %A" ms r
0
The performance results are this:
Running Unchecked ...
... it took 561 ms, result is 1000000000
Running Checked ...
... it took 1103 ms, result is 1000000000
So it seems some overhead is added by using Checked. The cost of int add should be less than the loop overhead so the overhead of Checked is higher than 2x maybe closer to 4x.
Out of curiousity we can check the IL Code using tools like ILSpy:
Unchecked:
IL_0000: nop
IL_0001: ldarg.2
IL_0002: ldarg.0
IL_0003: bge.s IL_0014
IL_0005: ldarg.0
IL_0006: ldarg.1
IL_0007: ldc.i4.1
IL_0008: add
IL_0009: ldarg.2
IL_000a: ldc.i4.1
IL_000b: add
IL_000c: starg.s i
IL_000e: starg.s a
IL_0010: starg.s c
IL_0012: br.s IL_0000
Checked:
IL_0000: nop
IL_0001: ldarg.2
IL_0002: ldarg.0
IL_0003: bge.s IL_0014
IL_0005: ldarg.0
IL_0006: ldarg.1
IL_0007: ldc.i4.1
IL_0008: add.ovf
IL_0009: ldarg.2
IL_000a: ldc.i4.1
IL_000b: add.ovf
IL_000c: starg.s i
IL_000e: starg.s a
IL_0010: starg.s c
IL_0012: br.s IL_0000
The only difference is that Unchecked uses add and Checked uses add.ovf. add.ovf is add with overflow check.
We can dig even deeper by looking at the jitted x86_64 code.
Unchecked:
; if i < c then
00007FF926A611B3 cmp esi,ebx
00007FF926A611B5 jge 00007FF926A611BD
; i + 1
00007FF926A611B7 inc esi
; a + 1
00007FF926A611B9 inc edi
; loop (a + 1) (i + 1)
00007FF926A611BB jmp 00007FF926A611B3
Checked:
; if i < c then
00007FF926A62613 cmp esi,ebx
00007FF926A62615 jge 00007FF926A62623
; a + 1
00007FF926A62617 add edi,1
; Overflow?
00007FF926A6261A jo 00007FF926A6262D
; i + 1
00007FF926A6261C add esi,1
; Overflow?
00007FF926A6261F jo 00007FF926A6262D
; loop (a + 1) (i + 1)
00007FF926A62621 jmp 00007FF926A62613
Now the reason for the Checked overhead is visible. After each operation the jitter inserts the conditional instruction jo which jumps to code that raises OverflowException if the overflow flag is set.
This chart shows us that the cost of an integer add is less than 1 clock cycle. The reason it's less than 1 clock cycle is that modern CPU can execute certain instructions in parallel.
The chart also shows us that branch that was correctly predicted by the CPU takes around 1-2 clock cycles.
So assuming a throughtput of at least 2 the cost of two integer additions in the Unchecked example should be 1 clock cycle.
In the Checked example we do add, jo, add, jo. Most likely CPU can't parallelize in this case and the cost of this should be around 4-6 clock cycles.
Another interesting difference is that the order of additions changed. With checked additions the order of the operations matter but with unchecked the jitter (and the CPU) has a greater flexibility moving the operations possibly improving performance.
So long story short; for cheap operations like (+) the overhead of Checked should be around 4x-6x compared to Unchecked.
This assumes no overflow exception. The cost of a .NET exception is probably around 100,000x times more expensive than an integer addition.

getelementptr has -1 as the first index operand

I'm reading the IR of nginx generated by Clang. In function ngx_event_expire_timers, there are some getelementptr instructions with i64 -1 as first index operand. For example,
%handler = getelementptr inbounds %struct.ngx_rbtree_node_s, %struct.ngx_rbtree_node_s* %node.addr.0.i, i64 -1, i32 2
I know the first index operand will be used as an offset to the first operand. But what does a negative offset mean?
The GEP instruction is perfectly fine with negative indices.
In this case you have something like:
node arr[100];
node* ptr = arr[50];
if ( (ptr-1)->value == ptr->value)
// then ...
GEP with negative indices just calculate the offset to the base pointer into the other direction. There is nothing wrong with it.
Considering what is doing inside nginx source code, the semantics of the getelementptr instruction is interesting. It's the result of two lines of C source code:
ev = (ngx_event_t *) ((char *) node - offsetof(ngx_event_t, timer));
ev->handler(ev);
node is of type ngx_rbtree_node_t, which is a member of ev's type ngx_event_t. That is like:
struct ngx_event_t {
....
struct ngx_rbtree_node_t time;
....
};
struct ngx_event_t *ev;
struct ngx_rbtree_node_t *node;
timer is the name of struct ngx_event_t member where node should point to.
|<- ngx_rbtree_node_t ->|
|<- ngx_event_t ->|
------------------------------------------------------
| (some data) | "time" | (some data)
------------------------------------------------------
^ ^
ev node
The graph above shows the layout of an instance of ngx_event_t. The result of offsetof(ngx_event_t, time) is 40. That means the some data before time is of 40 bytes. And the size of ngx_rbtree_node_t is also 40 bytes, by coincidence. So the i64 -1 in the first index oprand of getelementptr instruction computes the base address of the ngx_event_t containing node, which is 40 bytes ahead of node.
handler is another member of ngx_event_t, which is 16 bytes behind the base of ngx_event_t. By (another) coincidence, the third member of ngx_rbtree_node_t is also 16 bytes behind the base address of ngx_rbtree_node_t. So the i32 2 in getelementptr instruction will add 16 bytes to ev, to get the address of handler.
Note that the 16 bytes is computed from the layout of ngx_rbtree_node_t, but not ngx_event_t. Clang must have done some computations to ensure the correctness of the getelementptr instruction. Before use the value of %handler, there is a bitcast instruction which casts %handler to a function pointer type.
What Clang has done breaks the type transformation process defined in C source code. But the result is the exactly same.

Which is faster for reverse iteration, for or while loops?

I am trying to implement the standard memmove function in Rust and I was wondering which method is faster for downwards iteration (where src < dest):
for i in (0..n).rev() {
//Do copying
}
or
let mut i = n;
while i != 0 {
i -= 1;
// Do copying
}
Will the rev() in the for loops version significantly slow it down?
TL;DR: Use the for loop.
Both should be equally fast. We can check the compiler's ability to peel away the layers of abstraction involved in the for loop quite simply:
#[inline(never)]
fn blackhole() {}
#[inline(never)]
fn with_for(n: usize) {
for i in (0..n).rev() { blackhole(); }
}
#[inline(never)]
fn with_while(n: usize) {
let mut i = n;
while i > 0 {
blackhole();
i -= 1;
}
}
This generates this LLVM IR:
; Function Attrs: noinline nounwind readnone uwtable
define internal void #_ZN8with_for20h645c385965fcce1fhaaE(i64) unnamed_addr #0 {
entry-block:
ret void
}
; Function Attrs: noinline nounwind readnone uwtable
define internal void #_ZN10with_while20hc09c3331764a9434yaaE(i64) unnamed_addr #0 {
entry-block:
ret void
}
Even if you are not versed in LLVM, it is obvious that both functions compiled down to the same IR (and thus obviously to the same assembly).
Since their performance is the same, one should prefer the more explicit for loop and reserve the while loop to cases where the iteration is irregular.
EDIT: to address starblue's concern of unfitness.
#[link(name = "snappy")]
extern {
fn blackhole(i: libc::c_int) -> libc::c_int;
}
#[inline(never)]
fn with_for(n: i32) {
for i in (0..n).rev() { unsafe { blackhole(i as libc::c_int); } }
}
#[inline(never)]
fn with_while(n: i32) {
let mut i = n;
while i > 0 {
unsafe { blackhole(i as libc::c_int); }
i -= 1;
}
}
compiles down to:
; Function Attrs: noinline nounwind uwtable
define internal void #_ZN8with_for20h7cf06f33e247fa35maaE(i32) unnamed_addr #1 {
entry-block:
%1 = icmp sgt i32 %0, 0
br i1 %1, label %match_case.preheader, label %clean_ast_95_
match_case.preheader: ; preds = %entry-block
br label %match_case
match_case: ; preds = %match_case.preheader, %match_case
%.in = phi i32 [ %2, %match_case ], [ %0, %match_case.preheader ]
%2 = add i32 %.in, -1
%3 = tail call i32 #blackhole(i32 %2)
%4 = icmp sgt i32 %2, 0
br i1 %4, label %match_case, label %clean_ast_95_.loopexit
clean_ast_95_.loopexit: ; preds = %match_case
br label %clean_ast_95_
clean_ast_95_: ; preds = %clean_ast_95_.loopexit, %entry-block
ret void
}
; Function Attrs: noinline nounwind uwtable
define internal void #_ZN10with_while20hee8edd624cfe9293IaaE(i32) unnamed_addr #1 {
entry-block:
%1 = icmp sgt i32 %0, 0
br i1 %1, label %while_body.preheader, label %while_exit
while_body.preheader: ; preds = %entry-block
br label %while_body
while_exit.loopexit: ; preds = %while_body
br label %while_exit
while_exit: ; preds = %while_exit.loopexit, %entry-block
ret void
while_body: ; preds = %while_body.preheader, %while_body
%i.05 = phi i32 [ %3, %while_body ], [ %0, %while_body.preheader ]
%2 = tail call i32 #blackhole(i32 %i.05)
%3 = add nsw i32 %i.05, -1
%4 = icmp sgt i32 %i.05, 1
br i1 %4, label %while_body, label %while_exit.loopexit
}
The core loops are:
; -- for loop
match_case: ; preds = %match_case.preheader, %match_case
%.in = phi i32 [ %2, %match_case ], [ %0, %match_case.preheader ]
%2 = add i32 %.in, -1
%3 = tail call i32 #blackhole(i32 %2)
%4 = icmp sgt i32 %2, 0
br i1 %4, label %match_case, label %clean_ast_95_.loopexit
; -- while loop
while_body: ; preds = %while_body.preheader, %while_body
%i.05 = phi i32 [ %3, %while_body ], [ %0, %while_body.preheader ]
%2 = tail call i32 #blackhole(i32 %i.05)
%3 = add nsw i32 %i.05, -1
%4 = icmp sgt i32 %i.05, 1
br i1 %4, label %while_body, label %while_exit.loopexit
And the only difference is that:
for decrements before calling blackhole, while decrements after
for compares against 0, while compares against 1
otherwise, it's the same core loop.
In short: They are (nearly) equally fast -- use the for loop!
Longer version:
First: rev() only works for iterators that implement DoubleEndedIterator, which provides a next_back() method. This method is expected to run in o(n) (sublinear time), usually even O(1) (constant time). And indeed, by looking at the implementation of next_back() for Range, we can see that it runs in constant time.
Now we know that both versions have asymptotically identical runtime. If this is the case, you should usually stop thinking about it and use the solution that is more idiomatic (which is for in this case). Thinking about optimization too early often decreases programming productivity, because performance matters only in a tiny percentage of all code you write.
But since you are implementing memmove, performance might actually really matter to you. So lets try to look at the resulting ASM. I used this code:
#![feature(start)]
#![feature(test)]
extern crate test;
#[inline(never)]
#[no_mangle]
fn with_for(n: usize) {
for i in (0..n).rev() {
test::black_box(i);
}
}
#[inline(never)]
#[no_mangle]
fn with_while(n: usize) {
let mut i = n;
while i > 0 {
test::black_box(i);
i -= 1;
}
}
#[start]
fn main(_: isize, vargs: *const *const u8) -> isize {
let random_enough_value = unsafe {
**vargs as usize
};
with_for(random_enough_value);
with_while(random_enough_value);
0
}
(Playground Link)
The #[no_mangle] is to improve readability in the resulting ASM. The #inline(never) and the random_enough_value as well as the black_box are used to prevent LLVM to optimize things we don't want to be optimized. The generated ASM of this (in release mode!) with some cleanup looks like:
with_for: | with_while:
testq %rdi, %rdi | testq %rdi, %rdi
je .LBB0_3 | je .LBB1_3
decq %rdi |
leaq -8(%rsp), %rax | leaq -8(%rsp), %rax
.LBB0_2: | .LBB1_2:
movq %rdi, -8(%rsp) | movq %rdi, -8(%rsp)
decq %rdi | decq %rdi
cmpq $-1, %rdi |
jne .LBB0_2 | jne .LBB1_2
.LBB0_3: | .LBB1_3:
retq | retq
The only difference is that with_while has two instructions less, because it's counting down to 0 instead of -1, like with_for does.
Conclusion: if you can tell that the asymptotic runtime is optimal, you should probably not think about optimization at all. Modern optimizers are clever enough to compile high level constructs down to pretty perfect ASM. Often, data layout and resulting cache efficiency is much more important than a minimal count of instructions, anyway.
If you actually need to think about optimization though, look at the ASM (or LLVM IR). In this case the for loop is actually a bit slower (more instructions, comparison with -1 instead of 0). But the number of cases where a Rust programmers should care about this, is probably miniscule.
For small N, it really shouldn't matter.
Rust is lazy on iterators; 0..n won't cause any evaluation until you actually ask for an element. rev() asks for the last element first. As far as I know, the Rust counter iterator is clever and doesn't need to generate the first N-1 elements to get the Nth one. In this specific case, the rev method is probably even faster.
In the general case, it depends on what kind of access paradigm and access time your iterator has; make sure that accessing the end takes constant time, and it doesn't make a difference.
As with all benchmarking questions, it depends. Test for your N values yourself!
Premature optimization is also evil, so if your N is small, and your loop isn't done very often... don't worry.

Replicate llvm instructions

I'm trying to replicate instruction (addition binary operation for example), and show them in the LLVM IR, but the following code only returns the 1st instruction (add1) that I built.How to return both built instructions ?
IRBuilder<> builder(op);
Value *lhs = op->getOperand(0);
Value *rhs = op->getOperand(1);
Value *add1 = builder.CreateAdd(lhs, rhs);
Value *add2 = builder.CreateAdd(lhs, rhs);
for (auto &U : op->uses()) {
User *user = U.getUser(); // A User is anything with operands.
user->setOperand(U.getOperandNo(), add1);
user->setOperand(U.getOperandNo(), add2);
}
Assume an add instruction. You have a BinaryOperator which has two operands e.g.,: %op = add i32 10, 32
You take them as Value *lhs = op->getOperand(0); and Value *rhs = op->getOperand(1);
So fare so good. Now you are creating two new add tnstructions before the actual add since you are construction your IRBuilder with op as insertion point.
%add1 = add i32 10, 32
%add2 = add i32 10, 32
%op = add i32 10, 32
Finally you update the Users of your original instruction e.g., something like another BinaryOperator: %0 = mul i32 %op, %op
When you look closely on your loop you will see that you set both (add1 and add2) to the same operand of the User. After your loop the multiplication will look like %0 = mul i32 %add2, %add2
If you dump the BasicBlock where the instructions are inserted directly after insertion, you should see something like:
%add1 = add i32 10, 32
%add2 = add i32 10, 32
%op = add i32 10, 32
%0 = mul i32 %add2, %add2
But if you run another LLVM Pass that performs dead code elimination (e.g., InstCombine) you will end up with:
%add2 = add i32 10, 32
%0 = mul i32 %add2, %add2
Because add1 has no users. You have immediately replaced the uses of add1 with add2. And op is also gone because all users now use add2 instead of op.
From your question it is hard to guess what you have intended with your code but this is why you will see only one of your instructions in the final IR.

Resources