F# Performance Impact of Checked Calcs? - performance

Is there a performance impact from using the Checked module? I've tested it out with sequences of type int and see no noticeable difference. Sometimes the checked version is faster and sometimes unchecked is faster, but generally not by much.
Seq.initInfinite (fun x-> x) |> Seq.item 1000000000;;
Real: 00:00:05.272, CPU: 00:00:05.272, GC gen0: 0, gen1: 0, gen2: 0
val it : int = 1000000000
open Checked
Seq.initInfinite (fun x-> x) |> Seq.item 1000000000;;
Real: 00:00:04.785, CPU: 00:00:04.773, GC gen0: 0, gen1: 0, gen2: 0
val it : int = 1000000000
Basically I'm trying to figure out if there would be any downside to always opening Checked. (I encountered an overflow that wasn't immediately obvious, so I'm now playing the role of the jilted lover who doesn't want another broken heart.) The only non-contrived reason I can come up with for not always using Checked is if there were some performance hit, but I haven't seen one yet.

When you measure performance it's usually not a good idea to include Seq as Seq adds lots of overhead (at least compared to int operations) so you risk that most of the time is spent in Seq, not in the code you like to test.
I wrote a small test program for (+):
let clock =
let sw = System.Diagnostics.Stopwatch ()
sw.Start ()
fun () ->
sw.ElapsedMilliseconds
let dbreak () = System.Diagnostics.Debugger.Break ()
let time a =
let b = clock ()
let r = a ()
let n = clock ()
let d = n - b
d, r
module Unchecked =
let run c () =
let rec loop a i =
if i < c then
loop (a + 1) (i + 1)
else
a
loop 0 0
module Checked =
open Checked
let run c () =
let rec loop a i =
if i < c then
loop (a + 1) (i + 1)
else
a
loop 0 0
[<EntryPoint>]
let main argv =
let count = 1000000000
let testCases =
[|
"Unchecked" , Unchecked.run
"Checked" , Checked.run
|]
for nm, a in testCases do
printfn "Running %s ..." nm
let ms, r = time (a count)
printfn "... it took %d ms, result is %A" ms r
0
The performance results are this:
Running Unchecked ...
... it took 561 ms, result is 1000000000
Running Checked ...
... it took 1103 ms, result is 1000000000
So it seems some overhead is added by using Checked. The cost of int add should be less than the loop overhead so the overhead of Checked is higher than 2x maybe closer to 4x.
Out of curiousity we can check the IL Code using tools like ILSpy:
Unchecked:
IL_0000: nop
IL_0001: ldarg.2
IL_0002: ldarg.0
IL_0003: bge.s IL_0014
IL_0005: ldarg.0
IL_0006: ldarg.1
IL_0007: ldc.i4.1
IL_0008: add
IL_0009: ldarg.2
IL_000a: ldc.i4.1
IL_000b: add
IL_000c: starg.s i
IL_000e: starg.s a
IL_0010: starg.s c
IL_0012: br.s IL_0000
Checked:
IL_0000: nop
IL_0001: ldarg.2
IL_0002: ldarg.0
IL_0003: bge.s IL_0014
IL_0005: ldarg.0
IL_0006: ldarg.1
IL_0007: ldc.i4.1
IL_0008: add.ovf
IL_0009: ldarg.2
IL_000a: ldc.i4.1
IL_000b: add.ovf
IL_000c: starg.s i
IL_000e: starg.s a
IL_0010: starg.s c
IL_0012: br.s IL_0000
The only difference is that Unchecked uses add and Checked uses add.ovf. add.ovf is add with overflow check.
We can dig even deeper by looking at the jitted x86_64 code.
Unchecked:
; if i < c then
00007FF926A611B3 cmp esi,ebx
00007FF926A611B5 jge 00007FF926A611BD
; i + 1
00007FF926A611B7 inc esi
; a + 1
00007FF926A611B9 inc edi
; loop (a + 1) (i + 1)
00007FF926A611BB jmp 00007FF926A611B3
Checked:
; if i < c then
00007FF926A62613 cmp esi,ebx
00007FF926A62615 jge 00007FF926A62623
; a + 1
00007FF926A62617 add edi,1
; Overflow?
00007FF926A6261A jo 00007FF926A6262D
; i + 1
00007FF926A6261C add esi,1
; Overflow?
00007FF926A6261F jo 00007FF926A6262D
; loop (a + 1) (i + 1)
00007FF926A62621 jmp 00007FF926A62613
Now the reason for the Checked overhead is visible. After each operation the jitter inserts the conditional instruction jo which jumps to code that raises OverflowException if the overflow flag is set.
This chart shows us that the cost of an integer add is less than 1 clock cycle. The reason it's less than 1 clock cycle is that modern CPU can execute certain instructions in parallel.
The chart also shows us that branch that was correctly predicted by the CPU takes around 1-2 clock cycles.
So assuming a throughtput of at least 2 the cost of two integer additions in the Unchecked example should be 1 clock cycle.
In the Checked example we do add, jo, add, jo. Most likely CPU can't parallelize in this case and the cost of this should be around 4-6 clock cycles.
Another interesting difference is that the order of additions changed. With checked additions the order of the operations matter but with unchecked the jitter (and the CPU) has a greater flexibility moving the operations possibly improving performance.
So long story short; for cheap operations like (+) the overhead of Checked should be around 4x-6x compared to Unchecked.
This assumes no overflow exception. The cost of a .NET exception is probably around 100,000x times more expensive than an integer addition.

Related

Copy-and-update semantics and performance in F# record types

I am experimenting with a settings object that is passed on through a bunch of functions (a bit like a stack). It has quite a few fields (mixed ints, enums, DU's, strings) and I was wondering what the best data type is for this task (I have come back to this a few times in the past few years...).
While I currently employ a home-grown data type, it is slow and not thread-safe, so I am looking into a more logical choice and decided to experiment with record types.
Since they use copy-and-update semantics, it seems logical to think the F# compiler is smart enough to update only the necessary data and leave non-mutated data alone, considering it is immutable anyway.
So I ran a couple of tests and expected to see a performance improvement between full-copy (each field is updated) and one-field copy (one field is updated). But I am not sure I am using the proper approach for testing, or whether other datatypes may be more suitable.
As an actual example, let's take an XML qualified name. They have a small local part and a prefix (which can be ignored) and a long namespace part. The namespace is mostly the same, so only the local part needs updating.
type RecQName = { Ns :string; Prefix :string; Name :string }
// full update
let f a b c = { Ns = a; Prefix = b; Name = c}
let g() = f "http://test/is/a/long/name/here/and/there" "xs" "value"
// partial update
let test = g();;
let h a b = {a with Name = b }
let k() = h test "newname"
With timings on, in FSI (set to Release, x64 and Debug off) I get (running each twice):
> for i in 0 .. 100000000 do g() |> ignore;;
Real: 00:00:01.412, CPU: 00:00:01.404, GC gen0: 637, gen1: 1, gen2: 1
> for i in 0 .. 100000000 do g() |> ignore;;
Real: 00:00:01.317, CPU: 00:00:01.310, GC gen0: 636, gen1: 0, gen2: 0
> for i in 0 .. 100000000 do k() |> ignore;;
Real: 00:00:01.191, CPU: 00:00:01.185, GC gen0: 636, gen1: 1, gen2: 0
> for i in 0 .. 100000000 do k() |> ignore;;
Real: 00:00:01.099, CPU: 00:00:01.092, GC gen0: 636, gen1: 0, gen2: 0
Now I know timings are not everything, and there's a clear difference of roughly 20%, but that seems to small to justify the change, and may well because of other reasons (the CLI may intern the strings, for instance).
Am I making the wrong assumptions? I googled for record-type performance, but they were all about comparing them with structs. Does anybody know of the algorithm used for copy-and-update? Any thoughts on whether this datatype or something else is a smarter choice (given many fields, not just three as above, and wanting to use immutability without locking with copy-and-update).
Update 1
The test above doesn't really test anything, it seems, as can be shown with the next test.
type Rec = { A: int; B: int; C: int; D: int; E: int};;
// full
let f a b c d e = { A = a; B = b; C = c; D = d; E = e }
// partial
let temp = f 1 2 3 4 5
let g b = { temp with B = b }
// perf tests (subtract is necessary or the compiler optimizes them away)
let mutable res = 0;;
for i in 0 .. 1000000000 do f i (i-1) (i-2) (i-3) (i-4) |> function { B = b } -> if b < 1000 then res <- b
for i in 0 .. 1000000000 do g i |> function { B = b } -> if b < 1000 then res <- b
Results:
> for i in 0 .. 1000000000 do f i (i-1) (i-2) (i-3) (i-4) |> function { B = b } -> if b < 1000 then res <- b ;;
Real: 00:00:09.039, CPU: 00:00:09.032, GC gen0: 6358, gen1: 1, gen2: 0
> for i in 0 .. 1000000000 do g i |> function { B = b } -> if b < 1000 then res <- b;;
Real: 00:00:10.571, CPU: 00:00:10.576, GC gen0: 6358, gen1: 2, gen2: 0
Now the difference, while unexpected, shows. Looks like the copying is definitely not faster than not copying. The overhead of copying in this case may be due to the extra stack slot required, I don't know.
At the very least it shows that no magic goes on (as Fyodor already mentioned in the comments).
Update 2
Ok, one more update. If I inline both f and g functions above, the timings become remarkably different, in favor of the partial-update. Apparently, the inlining has the effect that either the compiler or the JIT "knows" it doesn't have to do a full copy, or it is just the effect of putting everything on the stack (the JIT or compiler getting rid of the boxing), as can be seen for 0 GC collections.
> for i in 0 .. 1000000000 do f i (i-1) (i-2) (i-3) (i-4) |> function { B = b } -> if b < 1000 then res <- b ;;
Real: 00:00:08.885, CPU: 00:00:08.876, GC gen0: 6359, gen1: 1, gen2: 1
> for i in 0 .. 1000000000 do g i |> function { B = b } -> if b < 1000 then res <- b ;;
Real: 00:00:00.571, CPU: 00:00:00.561, GC gen0: 0, gen1: 0, gen2: 0
Whether this holds in the "real world" is debatable and should (of course) be tested. Whether this improvement has any relation to record types, I doubt it.

How to divide by 9 using just shifts/add/sub?

Last week I was in an interview and there was a test like this:
Calculate N/9 (given that N is a positive integer), using only
SHIFT LEFT, SHIFT RIGHT, ADD, SUBSTRACT instructions.
first, find the representation of 1/9 in binary
0,0001110001110001
means it's (1/16) + (1/32) + (1/64) + (1/1024) + (1/2048) + (1/4096) + (1/65536)
so (x/9) equals (x>>4) + (x>>5) + (x>>6) + (x>>10) + (x>>11)+ (x>>12)+ (x>>16)
Possible optimization (if loops are allowed):
if you loop over 0001110001110001b right shifting it each loop,
add "x" to your result register whenever the carry was set on this shift
and shift your result right each time afterwards,
your result is x/9
mov cx, 16 ; assuming 16 bit registers
mov bx, 7281 ; bit mask of 2^16 * (1/9)
mov ax, 8166 ; sample value, (1/9 of it is 907)
mov dx, 0 ; dx holds the result
div9:
inc ax ; or "add ax,1" if inc's not allowed :)
; workaround for the fact that 7/64
; are a bit less than 1/9
shr bx,1
jnc no_add
add dx,ax
no_add:
shr dx,1
dec cx
jnz div9
( currently cannot test this, may be wrong)
you can use fixed point math trick.
so you just scale up so the significant fraction part goes to integer range, do the fractional math operation you need and scale back.
a/9 = ((a*10000)/9)/10000
as you can see I scaled by 10000. Now the integer part of 10000/9=1111 is big enough so I can write:
a/9 = ~a*1111/10000
power of 2 scale
If you use power of 2 scale then you need just to use bit-shift instead of division. You need to compromise between precision and input value range. I empirically found that on 32 bit arithmetics the best scale for this is 1<<18 so:
(((a+1)<<18)/9)>>18 = ~a/9;
The (a+1) corrects the rounding errors back to the right range.
Hardcoded multiplication
Rewrite the multiplication constant to binary
q = (1<<18)/9 = 29127 = 0111 0001 1100 0111 bin
Now if you need to compute c=(a*q) use hard-coded binary multiplication: for each 1 of the q you can add a<<(position_of_1) to the c. If you see something like 111 you can rewrite it to 1000-1 minimizing the number of operations.
If you put all of this together you should got something like this C++ code of mine:
DWORD div9(DWORD a)
{
// ((a+1)*q)>>18 = (((a+1)<<18)/9)>>18 = ~a/9;
// q = (1<<18)/9 = 29127 = 0111 0001 1100 0111 bin
// valid for a = < 0 , 147455 >
DWORD c;
c =(a<< 3)-(a ); // c= a*29127
c+=(a<< 9)-(a<< 6);
c+=(a<<15)-(a<<12);
c+=29127; // c= (a+1)*29127
c>>=18; // c= ((a+1)*29127)>>18
return c;
}
Now if you see the binary form the pattern 111000 is repeating so yu can further improve the code a bit:
DWORD div9(DWORD a)
{
DWORD c;
c =(a<<3)-a; // first pattern
c+=(c<<6)+(c<<12); // and the other 2...
c+=29127;
c>>=18;
return c;
}

Performance of Lazy in F#

Why is the creation of Lazy type so slow?
Assume the following code:
type T() =
let v = lazy (0.0)
member o.a = v.Value
type T2() =
member o.a = 0.0
#time "on"
for i in 0 .. 10000000 do
T() |> ignore
#time "on"
for i in 0 .. 10000000 do
T2() |> ignore
The first loop gives me: Real: 00:00:00.647 whereas the second loop gives me Real: 00:00:00.051. Lazy is 13X slower!!
I have tried to optimize my code in this way and I ended up with simulation code 6X slower. It was then fun to track back where the slow down occurred...
The Lazy version has some significant overhead code -
60 .method public specialname
61 instance default float64 get_a () cil managed
62 {
63 // Method begins at RVA 0x2078
64 // Code size 14 (0xe)
65 .maxstack 3
66 IL_0000: ldarg.0
67 IL_0001: ldfld class [FSharp.Core]System.Lazy`1<float64> Test/T::v
68 IL_0006: tail.
69 IL_0008: call instance !0 class [FSharp.Core]System.Lazy`1<float64>::get_Value()
70 IL_000d: ret
71 } // end of method T::get_a
Compare this to the direct version
.method public specialname
130 instance default float64 get_a () cil managed
131 {
132 // Method begins at RVA 0x20cc
133 // Code size 10 (0xa)
134 .maxstack 3
135 IL_0000: ldc.r8 0.
136 IL_0009: ret
137 } // end of method T2::get_a
So the direct version has a load and then return, whilst the indirect version has a load then a call and then a return.
Since the lazy version has an extra call I would expect it to be significantly slower.
UPDATE:
So I wondered if we could create a custom version of lazy which did not require the method calls - I also updated the test to actual call the method rather than just create the objects. Here is the code:
type T() =
let v = lazy (0.0)
member o.a() = v.Value
type T2() =
member o.a() = 0.0
type T3() =
let mutable calculated = true
let mutable value = 0.0
member o.a() = if calculated then value else failwith "not done";;
#time "on"
let lazy_ =
for i in 0 .. 1000000 do
T().a() |> ignore
printfn "lazy"
#time "on"
let fakelazy =
for i in 0 .. 1000000 do
T3().a() |> ignore
printfn "fake lazy"
#time "on"
let direct =
for i in 0 .. 1000000 do
T2().a() |> ignore
printfn "direct";;
Which gives the following result:
lazy
Real: 00:00:03.786, CPU: 00:00:06.443, GC gen0: 7
val lazy_ : unit = ()
--> Timing now on
fake lazy
Real: 00:00:01.627, CPU: 00:00:02.858, GC gen0: 2
val fakelazy : unit = ()
--> Timing now on
direct
Real: 00:00:01.759, CPU: 00:00:02.935, GC gen0: 2
val direct : unit = ()
Here the lazy version is only 2x slower than the direct version and the fake lazy version is even slightly faster than the direct version - this is probably due to a GC happening during the benchmark.
Update int the .net core world
A new constructor was added to Lazy to handle constants such as your case. Unfortunately F#'s lazy "pseudo keyword" always (at the moment!) will wrap constants as functions.
Anyway, if you change:
let v = lazy (0.0)
to:
let v = Lazy<_> 0.0 // NB. Only .net core at the moment
then you will find that your T() class will only take ~3 times your T2.
(What's the point of having a lazy constant? Well means that you can use Lazy as an abstraction with quite little overhead when you do have a mix of constants and real lazy items...)
...and...
If you actually use the created value a number of times then the overhead shrinks further. i.e. something such as:
open System.Diagnostics
type T() =
let v = Lazy<_> 0.1
member o.a () = v.Value
type T2() =
member o.a () = 0.1
let withLazyType () =
let mutable sum = 0.0
for i in 0 .. 10000000 do
let t = T()
for __ = 1 to 10 do
sum <- sum + t.a()
sum
let withoutLazyType () =
let mutable sum = 0.0
for i in 0 .. 10000000 do
let t = T2()
for __ = 1 to 10 do
sum <- sum + t.a()
sum
let runtest name count f =
let mutable checksum = 0.
let mutable totaltime = 0L
for i = 0 to count do
if i = 0 then
f () |> ignore // warm up
else
let sw = Stopwatch.StartNew ()
checksum <- checksum + f ()
totaltime <- totaltime + sw.ElapsedMilliseconds
printfn "%s: %4d (checksum=%f for %d runs)" name (totaltime/int64 count) checksum count
[<EntryPoint>]
let main _ =
runtest "w/o lazy" 10 withoutLazyType
runtest "with lazy" 10 withLazyType
0
brings the difference in time to < 2 times.
NB. I worked on the new lazy implementation...

Performance problem with Euler problem and recursion on Int64 types

I'm currently learning Haskell using the project Euler problems as my playground.
I was astound by how slow my Haskell programs turned out to be compared to similar
programs written in other languages. I'm wondering if I've forseen something, or if this is the kind of performance penalties one has to expect when using Haskell.
The following program in inspired by Problem 331, but I've changed it before posting so I don't spoil anything for other people. It computes the arc length of a discrete circle drawn on a 2^30 x 2^30 grid. It is a simple tail recursive implementation and I make sure that the updates of the accumulation variable keeping track of the arc length is strict. Yet it takes almost one and a half minute to complete (compiled with the -O flag with ghc).
import Data.Int
arcLength :: Int64->Int64
arcLength n = arcLength' 0 (n-1) 0 0 where
arcLength' x y norm2 acc
| x > y = acc
| norm2 < 0 = arcLength' (x + 1) y (norm2 + 2*x +1) acc
| norm2 > 2*(n-1) = arcLength' (x - 1) (y-1) (norm2 - 2*(x + y) + 2) acc
| otherwise = arcLength' (x + 1) y (norm2 + 2*x + 1) $! (acc + 1)
main = print $ arcLength (2^30)
Here is a corresponding implementation in Java. It takes about 4.5 seconds to complete.
public class ArcLength {
public static void main(String args[]) {
long n = 1 << 30;
long x = 0;
long y = n-1;
long acc = 0;
long norm2 = 0;
long time = System.currentTimeMillis();
while(x <= y) {
if (norm2 < 0) {
norm2 += 2*x + 1;
x++;
} else if (norm2 > 2*(n-1)) {
norm2 += 2 - 2*(x+y);
x--;
y--;
} else {
norm2 += 2*x + 1;
x++;
acc++;
}
}
time = System.currentTimeMillis() - time;
System.err.println(acc);
System.err.println(time);
}
}
EDIT: After the discussions in the comments I made som modifications in the Haskell code and did some performance tests. First I changed n to 2^29 to avoid overflows. Then I tried 6 different version: With Int64 or Int and with bangs before either norm2 or both and norm2 and acc in the declaration arcLength' x y !norm2 !acc. All are compiled with
ghc -O3 -prof -rtsopts -fforce-recomp -XBangPatterns arctest.hs
Here are the results:
(Int !norm2 !acc)
total time = 3.00 secs (150 ticks # 20 ms)
total alloc = 2,892 bytes (excludes profiling overheads)
(Int norm2 !acc)
total time = 3.56 secs (178 ticks # 20 ms)
total alloc = 2,892 bytes (excludes profiling overheads)
(Int norm2 acc)
total time = 3.56 secs (178 ticks # 20 ms)
total alloc = 2,892 bytes (excludes profiling overheads)
(Int64 norm2 acc)
arctest.exe: out of memory
(Int64 norm2 !acc)
total time = 48.46 secs (2423 ticks # 20 ms)
total alloc = 26,246,173,228 bytes (excludes profiling overheads)
(Int64 !norm2 !acc)
total time = 31.46 secs (1573 ticks # 20 ms)
total alloc = 3,032 bytes (excludes profiling overheads)
I'm using GHC 7.0.2 under a 64-bit Windows 7 (The Haskell platform binary distribution). According to the comments, the problem does not occur when compiling under other configurations. This makes me think that the Int64 type is broken in the Windows release.
Hm, I installed a fresh Haskell platform with 7.0.3, and get roughly the following core for your program (-ddump-simpl):
Main.$warcLength' =
\ (ww_s1my :: GHC.Prim.Int64#) (ww1_s1mC :: GHC.Prim.Int64#)
(ww2_s1mG :: GHC.Prim.Int64#) (ww3_s1mK :: GHC.Prim.Int64#) ->
case {__pkg_ccall ghc-prim hs_gtInt64 [...]
ww_s1my ww1_s1mC GHC.Prim.realWorld#
[...]
So GHC has realized that it can unpack your integers, which is good. But this hs_getInt64 call looks suspiciously like a C call. Looking at the assembler output (-ddump-asm), we see stuff like:
pushl %eax
movl 76(%esp),%eax
pushl %eax
call _hs_gtInt64
addl $16,%esp
So this looks very much like every operation on the Int64 get turned into a full-blown C call in the backend. Which is slow, obviously.
The source code of GHC.IntWord64 seems to verify that: In a 32-bit build (like the one currently shipped with the platform), you will have only emulation via the FFI interface.
Hmm, this is interesting. So I just compiled both of your programs, and tried them out:
% java -version
java version "1.6.0_18"
OpenJDK Runtime Environment (IcedTea6 1.8.7) (6b18-1.8.7-2~squeeze1)
OpenJDK 64-Bit Server VM (build 14.0-b16, mixed mode)
% javac ArcLength.java
% java ArcLength
843298604
6630
So about 6.6 seconds for the Java solution. Next is ghc with some optimization:
% ghc --version
The Glorious Glasgow Haskell Compilation System, version 6.12.1
% ghc --make -O arc.hs
% time ./arc
843298604
./arc 12.68s user 0.04s system 99% cpu 12.718 total
Just under 13 seconds for ghc -O
Trying with some further optimization:
% ghc --make -O3
% time ./arc [13:16]
843298604
./arc 5.75s user 0.00s system 99% cpu 5.754 total
With further optimization flags, the haskell solution took under 6 seconds
It would be interesting to know what version compiler you are using.
There's a couple of interesting things in your question.
You should be using -O2 primarily. It will just do a better job (in this case, identifying and removing laziness that was still present in the -O version).
Secondly, your Haskell isn't quite the same as the Java (it does different tests and branches). As with others, running your code on my Linux box results in around 6s runtime. It seems fine.
Make sure it is the same as the Java
One idea: let's do a literal transcription of your Java, with the same control flow, operations and types.
import Data.Bits
import Data.Int
loop :: Int -> Int
loop n = go 0 (n-1) 0 0
where
go :: Int -> Int -> Int -> Int -> Int
go x y acc norm2
| x <= y = case () of { _
| norm2 < 0 -> go (x+1) y acc (norm2 + 2*x + 1)
| norm2 > 2 * (n-1) -> go (x-1) (y-1) acc (norm2 + 2 - 2 * (x+y))
| otherwise -> go (x+1) y (acc+1) (norm2 + 2*x + 1)
}
| otherwise = acc
main = print $ loop (1 `shiftL` 30)
Peek at the core
We'll take a quick peek at the Core, using ghc-core, and it shows a very nice loop of unboxed type:
main_$s$wgo
:: Int#
-> Int#
-> Int#
-> Int#
-> Int#
main_$s$wgo =
\ (sc_sQa :: Int#)
(sc1_sQb :: Int#)
(sc2_sQc :: Int#)
(sc3_sQd :: Int#) ->
case <=# sc3_sQd sc2_sQc of _ {
False -> sc1_sQb;
True ->
case <# sc_sQa 0 of _ {
False ->
case ># sc_sQa 2147483646 of _ {
False ->
main_$s$wgo
(+# (+# sc_sQa (*# 2 sc3_sQd)) 1)
(+# sc1_sQb 1)
sc2_sQc
(+# sc3_sQd 1);
True ->
main_$s$wgo
(-#
(+# sc_sQa 2)
(*# 2 (+# sc3_sQd sc2_sQc)))
sc1_sQb
(-# sc2_sQc 1)
(-# sc3_sQd 1)
};
True ->
main_$s$wgo
(+# (+# sc_sQa (*# 2 sc3_sQd)) 1)
sc1_sQb
sc2_sQc
(+# sc3_sQd 1)
that is, all unboxed into registers. That loop looks great!
And performs just fine (Linux/x86-64/GHC 7.03):
./A 5.95s user 0.01s system 99% cpu 5.980 total
Checking the asm
We get reasonable assembly too, as a nice loop:
Main_mainzuzdszdwgo_info:
cmpq %rdi, %r8
jg .L8
.L3:
testq %r14, %r14
movq %r14, %rdx
js .L4
cmpq $2147483646, %r14
jle .L9
.L5:
leaq (%rdi,%r8), %r10
addq $2, %rdx
leaq -1(%rdi), %rdi
addq %r10, %r10
movq %rdx, %r14
leaq -1(%r8), %r8
subq %r10, %r14
jmp Main_mainzuzdszdwgo_info
.L9:
leaq 1(%r14,%r8,2), %r14
addq $1, %rsi
leaq 1(%r8), %r8
jmp Main_mainzuzdszdwgo_info
.L8:
movq %rsi, %rbx
jmp *0(%rbp)
.L4:
leaq 1(%r14,%r8,2), %r14
leaq 1(%r8), %r8
jmp Main_mainzuzdszdwgo_info
Using the -fvia-C backend.
So this looks fine!
My suspicion, as mentioned in the comment above, is something to do with the version of libgmp you have on 32 bit Windows generating poor code for 64 bit ints. First try upgrading to GHC 7.0.3, and then try some of the other code generator backends, then if you still have an issue with Int64, file a bug report to GHC trac.
Broadly confirming that it is indeed the cost of making those C calls in the 32 bit emulation of 64 bit ints, we can replace Int64 with Integer, which is implemented with C calls to GMP on every machine, and indeed, runtime goes from 3s to well over a minute.
Lesson: use hardware 64 bits if at all possible.
The normal optimization flag for performance concerned code is -O2. What you used, -O, does very little. -O3 doesn't do much (any?) more than -O2 - it even used to include experimental "optimizations" that often made programs notably slower.
With -O2 I get performance competitive with Java:
tommd#Mavlo:Test$ uname -r -m
2.6.37 x86_64
tommd#Mavlo:Test$ ghc --version
The Glorious Glasgow Haskell Compilation System, version 7.0.3
tommd#Mavlo:Test$ ghc -O2 so.hs
[1 of 1] Compiling Main ( so.hs, so.o )
Linking so ...
tommd#Mavlo:Test$ time ./so
843298604
real 0m4.948s
user 0m4.896s
sys 0m0.000s
And Java is about 1 second faster (20%):
tommd#Mavlo:Test$ time java ArcLength
843298604
3880
real 0m3.961s
user 0m3.936s
sys 0m0.024s
But an interesting thing about GHC is it has many different backends. By default it uses the native code generator (NCG), which we timed above. There's also an LLVM backend that often does better... but not here:
tommd#Mavlo:Test$ ghc -O2 so.hs -fllvm -fforce-recomp
[1 of 1] Compiling Main ( so.hs, so.o )
Linking so ...
tommd#Mavlo:Test$ time ./so
843298604
real 0m5.973s
user 0m5.968s
sys 0m0.000s
But, as FUZxxl mentioned in the comments, LLVM does much better when you add a few strictness annotations:
$ ghc -O2 -fllvm -fforce-recomp so.hs
[1 of 1] Compiling Main ( so.hs, so.o )
Linking so ...
tommd#Mavlo:Test$ time ./so
843298604
real 0m4.099s
user 0m4.088s
sys 0m0.000s
There's also an old "via-c" generator that uses C as an intermediate language. It does well in this case:
tommd#Mavlo:Test$ ghc -O2 so.hs -fvia-c -fforce-recomp
[1 of 1] Compiling Main ( so.hs, so.o )
on the commandline:
Warning: The -fvia-c flag will be removed in a future GHC release
Linking so ...
ttommd#Mavlo:Test$ ti
tommd#Mavlo:Test$ time ./so
843298604
real 0m3.982s
user 0m3.972s
sys 0m0.000s
Hopefully the NCG will be improved to match via-c for this case before they remove this backend.
dberg, I feel like all of this got off to a bad start with the unfortunate -O flag. Just to emphasize a point made by others, for run-of-the-mill compilation and testing, do like me and paste this into your .bashrc or whatever:
alias ggg="ghc --make -O2"
alias gggg="echo 'Glorious Glasgow for Great Good!' && ghc --make -O2 --fforce-recomp"
I've played with the code a little and this version seems to run faster than Java version on my laptop (3.55s vs 4.63s):
{-# LANGUAGE BangPatterns #-}
arcLength :: Int->Int
arcLength n = arcLength' 0 (n-1) 0 0 where
arcLength' :: Int -> Int -> Int -> Int -> Int
arcLength' !x !y !norm2 !acc
| x > y = acc
| norm2 > 2*(n-1) = arcLength' (x - 1) (y - 1) (norm2 - 2*(x + y) + 2) acc
| norm2 < 0 = arcLength' (succ x) y (norm2 + x*2 + 1) acc
| otherwise = arcLength' (succ x) y (norm2 + 2*x + 1) (acc + 1)
main = print $ arcLength (2^30)
:
$ ghc -O2 tmp1.hs -fforce-recomp
[1 of 1] Compiling Main ( tmp1.hs, tmp1.o )
Linking tmp1 ...
$ time ./tmp1
843298604
real 0m3.553s
user 0m3.539s
sys 0m0.006s

What is tail call optimization?

Very simply, what is tail-call optimization?
More specifically, what are some small code snippets where it could be applied, and where not, with an explanation of why?
Tail-call optimization is where you are able to avoid allocating a new stack frame for a function because the calling function will simply return the value that it gets from the called function. The most common use is tail-recursion, where a recursive function written to take advantage of tail-call optimization can use constant stack space.
Scheme is one of the few programming languages that guarantee in the spec that any implementation must provide this optimization, so here are two examples of the factorial function in Scheme:
(define (fact x)
(if (= x 0) 1
(* x (fact (- x 1)))))
(define (fact x)
(define (fact-tail x accum)
(if (= x 0) accum
(fact-tail (- x 1) (* x accum))))
(fact-tail x 1))
The first function is not tail recursive because when the recursive call is made, the function needs to keep track of the multiplication it needs to do with the result after the call returns. As such, the stack looks as follows:
(fact 3)
(* 3 (fact 2))
(* 3 (* 2 (fact 1)))
(* 3 (* 2 (* 1 (fact 0))))
(* 3 (* 2 (* 1 1)))
(* 3 (* 2 1))
(* 3 2)
6
In contrast, the stack trace for the tail recursive factorial looks as follows:
(fact 3)
(fact-tail 3 1)
(fact-tail 2 3)
(fact-tail 1 6)
(fact-tail 0 6)
6
As you can see, we only need to keep track of the same amount of data for every call to fact-tail because we are simply returning the value we get right through to the top. This means that even if I were to call (fact 1000000), I need only the same amount of space as (fact 3). This is not the case with the non-tail-recursive fact, and as such large values may cause a stack overflow.
Let's walk through a simple example: the factorial function implemented in C.
We start with the obvious recursive definition
unsigned fac(unsigned n)
{
if (n < 2) return 1;
return n * fac(n - 1);
}
A function ends with a tail call if the last operation before the function returns is another function call. If this call invokes the same function, it is tail-recursive.
Even though fac() looks tail-recursive at first glance, it is not as what actually happens is
unsigned fac(unsigned n)
{
if (n < 2) return 1;
unsigned acc = fac(n - 1);
return n * acc;
}
ie the last operation is the multiplication and not the function call.
However, it's possible to rewrite fac() to be tail-recursive by passing the accumulated value down the call chain as an additional argument and passing only the final result up again as the return value:
unsigned fac(unsigned n)
{
return fac_tailrec(1, n);
}
unsigned fac_tailrec(unsigned acc, unsigned n)
{
if (n < 2) return acc;
return fac_tailrec(n * acc, n - 1);
}
Now, why is this useful? Because we immediately return after the tail call, we can discard the previous stackframe before invoking the function in tail position, or, in case of recursive functions, reuse the stackframe as-is.
The tail-call optimization transforms our recursive code into
unsigned fac_tailrec(unsigned acc, unsigned n)
{
TOP:
if (n < 2) return acc;
acc = n * acc;
n = n - 1;
goto TOP;
}
This can be inlined into fac() and we arrive at
unsigned fac(unsigned n)
{
unsigned acc = 1;
TOP:
if (n < 2) return acc;
acc = n * acc;
n = n - 1;
goto TOP;
}
which is equivalent to
unsigned fac(unsigned n)
{
unsigned acc = 1;
for (; n > 1; --n)
acc *= n;
return acc;
}
As we can see here, a sufficiently advanced optimizer can replace tail-recursion with iteration, which is far more efficient as you avoid function call overhead and only use a constant amount of stack space.
TCO (Tail Call Optimization) is the process by which a smart compiler can make a call to a function and take no additional stack space. The only situation in which this happens is if the last instruction executed in a function f is a call to a function g (Note: g can be f). The key here is that f no longer needs stack space - it simply calls g and then returns whatever g would return. In this case the optimization can be made that g just runs and returns whatever value it would have to the thing that called f.
This optimization can make recursive calls take constant stack space, rather than explode.
Example: this factorial function is not TCOptimizable:
from dis import dis
def fact(n):
if n == 0:
return 1
return n * fact(n-1)
dis(fact)
2 0 LOAD_FAST 0 (n)
2 LOAD_CONST 1 (0)
4 COMPARE_OP 2 (==)
6 POP_JUMP_IF_FALSE 12
3 8 LOAD_CONST 2 (1)
10 RETURN_VALUE
4 >> 12 LOAD_FAST 0 (n)
14 LOAD_GLOBAL 0 (fact)
16 LOAD_FAST 0 (n)
18 LOAD_CONST 2 (1)
20 BINARY_SUBTRACT
22 CALL_FUNCTION 1
24 BINARY_MULTIPLY
26 RETURN_VALUE
This function does things besides call another function in its return statement.
This below function is TCOptimizable:
def fact_h(n, acc):
if n == 0:
return acc
return fact_h(n-1, acc*n)
def fact(n):
return fact_h(n, 1)
dis(fact)
2 0 LOAD_GLOBAL 0 (fact_h)
2 LOAD_FAST 0 (n)
4 LOAD_CONST 1 (1)
6 CALL_FUNCTION 2
8 RETURN_VALUE
This is because the last thing to happen in any of these functions is to call another function.
Probably the best high level description I have found for tail calls, recursive tail calls and tail call optimization is the blog post
"What the heck is: A tail call"
by Dan Sugalski. On tail call optimization he writes:
Consider, for a moment, this simple function:
sub foo (int a) {
a += 15;
return bar(a);
}
So, what can you, or rather your language compiler, do? Well, what it can do is turn code of the form return somefunc(); into the low-level sequence pop stack frame; goto somefunc();. In our example, that means before we call bar, foo cleans itself up and then, rather than calling bar as a subroutine, we do a low-level goto operation to the start of bar. Foo's already cleaned itself out of the stack, so when bar starts it looks like whoever called foo has really called bar, and when bar returns its value, it returns it directly to whoever called foo, rather than returning it to foo which would then return it to its caller.
And on tail recursion:
Tail recursion happens if a function, as its last operation, returns
the result of calling itself. Tail recursion is easier to deal with
because rather than having to jump to the beginning of some random
function somewhere, you just do a goto back to the beginning of
yourself, which is a darned simple thing to do.
So that this:
sub foo (int a, int b) {
if (b == 1) {
return a;
} else {
return foo(a*a + a, b - 1);
}
gets quietly turned into:
sub foo (int a, int b) {
label:
if (b == 1) {
return a;
} else {
a = a*a + a;
b = b - 1;
goto label;
}
What I like about this description is how succinct and easy it is to grasp for those coming from an imperative language background (C, C++, Java)
GCC C minimal runnable example with x86 disassembly analysis
Let's see how GCC can automatically do tail call optimizations for us by looking at the generated assembly.
This will serve as an extremely concrete example of what was mentioned in other answers such as https://stackoverflow.com/a/9814654/895245 that the optimization can convert recursive function calls to a loop.
This in turn saves memory and improves performance, since memory accesses are often the main thing that makes programs slow nowadays.
As an input, we give GCC a non-optimized naive stack based factorial:
tail_call.c
#include <stdio.h>
#include <stdlib.h>
unsigned factorial(unsigned n) {
if (n == 1) {
return 1;
}
return n * factorial(n - 1);
}
int main(int argc, char **argv) {
int input;
if (argc > 1) {
input = strtoul(argv[1], NULL, 0);
} else {
input = 5;
}
printf("%u\n", factorial(input));
return EXIT_SUCCESS;
}
GitHub upstream.
Compile and disassemble:
gcc -O1 -foptimize-sibling-calls -ggdb3 -std=c99 -Wall -Wextra -Wpedantic \
-o tail_call.out tail_call.c
objdump -d tail_call.out
where -foptimize-sibling-calls is the name of generalization of tail calls according to man gcc:
-foptimize-sibling-calls
Optimize sibling and tail recursive calls.
Enabled at levels -O2, -O3, -Os.
as mentioned at: How do I check if gcc is performing tail-recursion optimization?
I choose -O1 because:
the optimization is not done with -O0. I suspect that this is because there are required intermediate transformations missing.
-O3 produces ungodly efficient code that would not be very educative, although it is also tail call optimized.
Disassembly with -fno-optimize-sibling-calls:
0000000000001145 <factorial>:
1145: 89 f8 mov %edi,%eax
1147: 83 ff 01 cmp $0x1,%edi
114a: 74 10 je 115c <factorial+0x17>
114c: 53 push %rbx
114d: 89 fb mov %edi,%ebx
114f: 8d 7f ff lea -0x1(%rdi),%edi
1152: e8 ee ff ff ff callq 1145 <factorial>
1157: 0f af c3 imul %ebx,%eax
115a: 5b pop %rbx
115b: c3 retq
115c: c3 retq
With -foptimize-sibling-calls:
0000000000001145 <factorial>:
1145: b8 01 00 00 00 mov $0x1,%eax
114a: 83 ff 01 cmp $0x1,%edi
114d: 74 0e je 115d <factorial+0x18>
114f: 8d 57 ff lea -0x1(%rdi),%edx
1152: 0f af c7 imul %edi,%eax
1155: 89 d7 mov %edx,%edi
1157: 83 fa 01 cmp $0x1,%edx
115a: 75 f3 jne 114f <factorial+0xa>
115c: c3 retq
115d: 89 f8 mov %edi,%eax
115f: c3 retq
The key difference between the two is that:
the -fno-optimize-sibling-calls uses callq, which is the typical non-optimized function call.
This instruction pushes the return address to the stack, therefore increasing it.
Furthermore, this version also does push %rbx, which pushes %rbx to the stack.
GCC does this because it stores edi, which is the first function argument (n) into ebx, then calls factorial.
GCC needs to do this because it is preparing for another call to factorial, which will use the new edi == n-1.
It chooses ebx because this register is callee-saved: What registers are preserved through a linux x86-64 function call so the subcall to factorial won't change it and lose n.
the -foptimize-sibling-calls does not use any instructions that push to the stack: it only does goto jumps within factorial with the instructions je and jne.
Therefore, this version is equivalent to a while loop, without any function calls. Stack usage is constant.
Tested in Ubuntu 18.10, GCC 8.2.
Note first of all that not all languages support it.
TCO applys to a special case of recursion. The gist of it is, if the last thing you do in a function is call itself (e.g. it is calling itself from the "tail" position), this can be optimized by the compiler to act like iteration instead of standard recursion.
You see, normally during recursion, the runtime needs to keep track of all the recursive calls, so that when one returns it can resume at the previous call and so on. (Try manually writing out the result of a recursive call to get a visual idea of how this works.) Keeping track of all the calls takes up space, which gets significant when the function calls itself a lot. But with TCO, it can just say "go back to the beginning, only this time change the parameter values to these new ones." It can do that because nothing after the recursive call refers to those values.
Look here:
http://tratt.net/laurie/tech_articles/articles/tail_call_optimization
As you probably know, recursive function calls can wreak havoc on a stack; it is easy to quickly run out of stack space. Tail call optimization is way by which you can create a recursive style algorithm that uses constant stack space, therefore it does not grow and grow and you get stack errors.
The recursive function approach has a problem. It builds up a call stack of size O(n), which makes our total memory cost O(n). This makes it vulnerable to a stack overflow error, where the call stack gets too big and runs out of space.
Tail call optimization (TCO) scheme. Where it can optimize recursive functions to avoid building up a tall call stack and hence saves the memory cost.
There are many languages who are doing TCO like (JavaScript, Ruby and few C) whereas Python and Java do not do TCO.
JavaScript language has confirmed using :) http://2ality.com/2015/06/tail-call-optimization.html
We should ensure that there are no goto statements in the function itself .. taken care by function call being the last thing in the callee function.
Large scale recursions can use this for optimizations, but in small scale, the instruction overhead for making the function call a tail call reduces the actual purpose.
TCO might cause a forever running function:
void eternity()
{
eternity();
}
In a functional language, tail call optimization is as if a function call could return a partially evaluated expression as the result, which would then be evaluated by the caller.
f x = g x
f 6 reduces to g 6. So if the implementation could return g 6 as the result, and then call that expression it would save a stack frame.
Also
f x = if c x then g x else h x.
Reduces to f 6 to either g 6 or h 6. So if the implementation evaluates c 6 and finds it is true then it can reduce,
if true then g x else h x ---> g x
f x ---> h x
A simple non tail call optimization interpreter might look like this,
class simple_expresion
{
...
public:
virtual ximple_value *DoEvaluate() const = 0;
};
class simple_value
{
...
};
class simple_function : public simple_expresion
{
...
private:
simple_expresion *m_Function;
simple_expresion *m_Parameter;
public:
virtual simple_value *DoEvaluate() const
{
vector<simple_expresion *> parameterList;
parameterList->push_back(m_Parameter);
return m_Function->Call(parameterList);
}
};
class simple_if : public simple_function
{
private:
simple_expresion *m_Condition;
simple_expresion *m_Positive;
simple_expresion *m_Negative;
public:
simple_value *DoEvaluate() const
{
if (m_Condition.DoEvaluate()->IsTrue())
{
return m_Positive.DoEvaluate();
}
else
{
return m_Negative.DoEvaluate();
}
}
}
A tail call optimization interpreter might look like this,
class tco_expresion
{
...
public:
virtual tco_expresion *DoEvaluate() const = 0;
virtual bool IsValue()
{
return false;
}
};
class tco_value
{
...
public:
virtual bool IsValue()
{
return true;
}
};
class tco_function : public tco_expresion
{
...
private:
tco_expresion *m_Function;
tco_expresion *m_Parameter;
public:
virtual tco_expression *DoEvaluate() const
{
vector< tco_expression *> parameterList;
tco_expression *function = const_cast<SNI_Function *>(this);
while (!function->IsValue())
{
function = function->DoCall(parameterList);
}
return function;
}
tco_expresion *DoCall(vector<tco_expresion *> &p_ParameterList)
{
p_ParameterList.push_back(m_Parameter);
return m_Function;
}
};
class tco_if : public tco_function
{
private:
tco_expresion *m_Condition;
tco_expresion *m_Positive;
tco_expresion *m_Negative;
tco_expresion *DoEvaluate() const
{
if (m_Condition.DoEvaluate()->IsTrue())
{
return m_Positive;
}
else
{
return m_Negative;
}
}
}

Resources