Copy-and-update semantics and performance in F# record types - performance

I am experimenting with a settings object that is passed on through a bunch of functions (a bit like a stack). It has quite a few fields (mixed ints, enums, DU's, strings) and I was wondering what the best data type is for this task (I have come back to this a few times in the past few years...).
While I currently employ a home-grown data type, it is slow and not thread-safe, so I am looking into a more logical choice and decided to experiment with record types.
Since they use copy-and-update semantics, it seems logical to think the F# compiler is smart enough to update only the necessary data and leave non-mutated data alone, considering it is immutable anyway.
So I ran a couple of tests and expected to see a performance improvement between full-copy (each field is updated) and one-field copy (one field is updated). But I am not sure I am using the proper approach for testing, or whether other datatypes may be more suitable.
As an actual example, let's take an XML qualified name. They have a small local part and a prefix (which can be ignored) and a long namespace part. The namespace is mostly the same, so only the local part needs updating.
type RecQName = { Ns :string; Prefix :string; Name :string }
// full update
let f a b c = { Ns = a; Prefix = b; Name = c}
let g() = f "http://test/is/a/long/name/here/and/there" "xs" "value"
// partial update
let test = g();;
let h a b = {a with Name = b }
let k() = h test "newname"
With timings on, in FSI (set to Release, x64 and Debug off) I get (running each twice):
> for i in 0 .. 100000000 do g() |> ignore;;
Real: 00:00:01.412, CPU: 00:00:01.404, GC gen0: 637, gen1: 1, gen2: 1
> for i in 0 .. 100000000 do g() |> ignore;;
Real: 00:00:01.317, CPU: 00:00:01.310, GC gen0: 636, gen1: 0, gen2: 0
> for i in 0 .. 100000000 do k() |> ignore;;
Real: 00:00:01.191, CPU: 00:00:01.185, GC gen0: 636, gen1: 1, gen2: 0
> for i in 0 .. 100000000 do k() |> ignore;;
Real: 00:00:01.099, CPU: 00:00:01.092, GC gen0: 636, gen1: 0, gen2: 0
Now I know timings are not everything, and there's a clear difference of roughly 20%, but that seems to small to justify the change, and may well because of other reasons (the CLI may intern the strings, for instance).
Am I making the wrong assumptions? I googled for record-type performance, but they were all about comparing them with structs. Does anybody know of the algorithm used for copy-and-update? Any thoughts on whether this datatype or something else is a smarter choice (given many fields, not just three as above, and wanting to use immutability without locking with copy-and-update).
Update 1
The test above doesn't really test anything, it seems, as can be shown with the next test.
type Rec = { A: int; B: int; C: int; D: int; E: int};;
// full
let f a b c d e = { A = a; B = b; C = c; D = d; E = e }
// partial
let temp = f 1 2 3 4 5
let g b = { temp with B = b }
// perf tests (subtract is necessary or the compiler optimizes them away)
let mutable res = 0;;
for i in 0 .. 1000000000 do f i (i-1) (i-2) (i-3) (i-4) |> function { B = b } -> if b < 1000 then res <- b
for i in 0 .. 1000000000 do g i |> function { B = b } -> if b < 1000 then res <- b
Results:
> for i in 0 .. 1000000000 do f i (i-1) (i-2) (i-3) (i-4) |> function { B = b } -> if b < 1000 then res <- b ;;
Real: 00:00:09.039, CPU: 00:00:09.032, GC gen0: 6358, gen1: 1, gen2: 0
> for i in 0 .. 1000000000 do g i |> function { B = b } -> if b < 1000 then res <- b;;
Real: 00:00:10.571, CPU: 00:00:10.576, GC gen0: 6358, gen1: 2, gen2: 0
Now the difference, while unexpected, shows. Looks like the copying is definitely not faster than not copying. The overhead of copying in this case may be due to the extra stack slot required, I don't know.
At the very least it shows that no magic goes on (as Fyodor already mentioned in the comments).
Update 2
Ok, one more update. If I inline both f and g functions above, the timings become remarkably different, in favor of the partial-update. Apparently, the inlining has the effect that either the compiler or the JIT "knows" it doesn't have to do a full copy, or it is just the effect of putting everything on the stack (the JIT or compiler getting rid of the boxing), as can be seen for 0 GC collections.
> for i in 0 .. 1000000000 do f i (i-1) (i-2) (i-3) (i-4) |> function { B = b } -> if b < 1000 then res <- b ;;
Real: 00:00:08.885, CPU: 00:00:08.876, GC gen0: 6359, gen1: 1, gen2: 1
> for i in 0 .. 1000000000 do g i |> function { B = b } -> if b < 1000 then res <- b ;;
Real: 00:00:00.571, CPU: 00:00:00.561, GC gen0: 0, gen1: 0, gen2: 0
Whether this holds in the "real world" is debatable and should (of course) be tested. Whether this improvement has any relation to record types, I doubt it.

Related

F# Performance Impact of Checked Calcs?

Is there a performance impact from using the Checked module? I've tested it out with sequences of type int and see no noticeable difference. Sometimes the checked version is faster and sometimes unchecked is faster, but generally not by much.
Seq.initInfinite (fun x-> x) |> Seq.item 1000000000;;
Real: 00:00:05.272, CPU: 00:00:05.272, GC gen0: 0, gen1: 0, gen2: 0
val it : int = 1000000000
open Checked
Seq.initInfinite (fun x-> x) |> Seq.item 1000000000;;
Real: 00:00:04.785, CPU: 00:00:04.773, GC gen0: 0, gen1: 0, gen2: 0
val it : int = 1000000000
Basically I'm trying to figure out if there would be any downside to always opening Checked. (I encountered an overflow that wasn't immediately obvious, so I'm now playing the role of the jilted lover who doesn't want another broken heart.) The only non-contrived reason I can come up with for not always using Checked is if there were some performance hit, but I haven't seen one yet.
When you measure performance it's usually not a good idea to include Seq as Seq adds lots of overhead (at least compared to int operations) so you risk that most of the time is spent in Seq, not in the code you like to test.
I wrote a small test program for (+):
let clock =
let sw = System.Diagnostics.Stopwatch ()
sw.Start ()
fun () ->
sw.ElapsedMilliseconds
let dbreak () = System.Diagnostics.Debugger.Break ()
let time a =
let b = clock ()
let r = a ()
let n = clock ()
let d = n - b
d, r
module Unchecked =
let run c () =
let rec loop a i =
if i < c then
loop (a + 1) (i + 1)
else
a
loop 0 0
module Checked =
open Checked
let run c () =
let rec loop a i =
if i < c then
loop (a + 1) (i + 1)
else
a
loop 0 0
[<EntryPoint>]
let main argv =
let count = 1000000000
let testCases =
[|
"Unchecked" , Unchecked.run
"Checked" , Checked.run
|]
for nm, a in testCases do
printfn "Running %s ..." nm
let ms, r = time (a count)
printfn "... it took %d ms, result is %A" ms r
0
The performance results are this:
Running Unchecked ...
... it took 561 ms, result is 1000000000
Running Checked ...
... it took 1103 ms, result is 1000000000
So it seems some overhead is added by using Checked. The cost of int add should be less than the loop overhead so the overhead of Checked is higher than 2x maybe closer to 4x.
Out of curiousity we can check the IL Code using tools like ILSpy:
Unchecked:
IL_0000: nop
IL_0001: ldarg.2
IL_0002: ldarg.0
IL_0003: bge.s IL_0014
IL_0005: ldarg.0
IL_0006: ldarg.1
IL_0007: ldc.i4.1
IL_0008: add
IL_0009: ldarg.2
IL_000a: ldc.i4.1
IL_000b: add
IL_000c: starg.s i
IL_000e: starg.s a
IL_0010: starg.s c
IL_0012: br.s IL_0000
Checked:
IL_0000: nop
IL_0001: ldarg.2
IL_0002: ldarg.0
IL_0003: bge.s IL_0014
IL_0005: ldarg.0
IL_0006: ldarg.1
IL_0007: ldc.i4.1
IL_0008: add.ovf
IL_0009: ldarg.2
IL_000a: ldc.i4.1
IL_000b: add.ovf
IL_000c: starg.s i
IL_000e: starg.s a
IL_0010: starg.s c
IL_0012: br.s IL_0000
The only difference is that Unchecked uses add and Checked uses add.ovf. add.ovf is add with overflow check.
We can dig even deeper by looking at the jitted x86_64 code.
Unchecked:
; if i < c then
00007FF926A611B3 cmp esi,ebx
00007FF926A611B5 jge 00007FF926A611BD
; i + 1
00007FF926A611B7 inc esi
; a + 1
00007FF926A611B9 inc edi
; loop (a + 1) (i + 1)
00007FF926A611BB jmp 00007FF926A611B3
Checked:
; if i < c then
00007FF926A62613 cmp esi,ebx
00007FF926A62615 jge 00007FF926A62623
; a + 1
00007FF926A62617 add edi,1
; Overflow?
00007FF926A6261A jo 00007FF926A6262D
; i + 1
00007FF926A6261C add esi,1
; Overflow?
00007FF926A6261F jo 00007FF926A6262D
; loop (a + 1) (i + 1)
00007FF926A62621 jmp 00007FF926A62613
Now the reason for the Checked overhead is visible. After each operation the jitter inserts the conditional instruction jo which jumps to code that raises OverflowException if the overflow flag is set.
This chart shows us that the cost of an integer add is less than 1 clock cycle. The reason it's less than 1 clock cycle is that modern CPU can execute certain instructions in parallel.
The chart also shows us that branch that was correctly predicted by the CPU takes around 1-2 clock cycles.
So assuming a throughtput of at least 2 the cost of two integer additions in the Unchecked example should be 1 clock cycle.
In the Checked example we do add, jo, add, jo. Most likely CPU can't parallelize in this case and the cost of this should be around 4-6 clock cycles.
Another interesting difference is that the order of additions changed. With checked additions the order of the operations matter but with unchecked the jitter (and the CPU) has a greater flexibility moving the operations possibly improving performance.
So long story short; for cheap operations like (+) the overhead of Checked should be around 4x-6x compared to Unchecked.
This assumes no overflow exception. The cost of a .NET exception is probably around 100,000x times more expensive than an integer addition.

Optimising F# answer for Euler #4

I have recently begun learning F#. Hoping to use it to perform any mathematically heavy algorithms in C# applications and to broaden my knowledge
I have so far avoided StackOverflow as I didn't want to see the answer to this until I came to one myself.
I want to be able to write very efficient F# code, focused on performance and then maybe in other ways, such as writing in F# concisely (number of lines etc.).
Project Euler Question 4:
A palindromic number reads the same both ways. The largest palindrome made from the product of two 2-digit numbers is 9009 = 91 × 99.
Find the largest palindrome made from the product of two 3-digit numbers.
My Answer:
let IsPalindrome (x:int) = if x.ToString().ToCharArray() = Array.rev(x.ToString().ToCharArray()) then x else 0
let euler4 = [for i in [100..999] do
for j in [i..999] do yield i*j]
|> Seq.filter(fun x -> x = IsPalindrome(x)) |> Seq.max |> printf "Largest product of two 3-digit numbers is %d"
I tried using option and returning Some(x) and None in IsPalindrome but kept getting compiling errors as I was passing in an int and returning int option. I got a NullRefenceException trying to return None.Value.
Instead I return 0 if the number isn't a palindrome, these 0's go into the Sequence, unfortunately.
Maybe I could order the sequence and then get the top value? instead of using Seq.Max? Or filter out results > 1?
Would this be better? Any advice would be much appreciated, even if it's general F# advice.
Efficiency being a primary concern, using string allocation/manipulation to find a numeric palindrome seems misguided – here's my approach:
module NumericLiteralG =
let inline FromZero () = LanguagePrimitives.GenericZero
let inline FromOne () = LanguagePrimitives.GenericOne
module Euler =
let inline isNumPalindrome number =
let ten = 1G + 1G + 1G + 1G + 1G + 1G + 1G + 1G + 1G + 1G
let hundred = ten * ten
let rec findHighDiv div =
let div' = div * ten
if number / div' = 0G then div else findHighDiv div'
let rec impl n div =
div = 0G || n / div = n % ten && impl (n % div / ten) (div / hundred)
findHighDiv 1G |> impl number
let problem004 () =
{ 100 .. 999 }
|> Seq.collect (fun n -> Seq.init (1000 - n) ((+) n >> (*) n))
|> Seq.filter isNumPalindrome
|> Seq.max
Here's one way to do it:
/// handy extension for reversing a string
type System.String with
member s.Reverse() = String(Array.rev (s.ToCharArray()))
let isPalindrome x = let s = string x in s = s.Reverse()
seq {
for i in 100..999 do
for j in i..999 -> i * j
}
|> Seq.filter isPalindrome
|> Seq.max
|> printfn "The answer is: %d"
let IsPalindrom (str:string)=
let rec fn(a,b)=a>b||str.[a]=str.[b]&&fn(a+1,b-1)
fn(0,str.Length-1)
let IsIntPalindrome = (string>>IsPalindrom)
let sq={100..999}
sq|>Seq.map (fun x->sq|>Seq.map (fun y->(x,y),x*y))
|>Seq.concat|>Seq.filter (snd>>IsIntPalindrome)|>Seq.maxBy (snd)
just my solution:
let isPalin x =
x.ToString() = new string(Array.rev (x.ToString().ToCharArray()))
let isGood num seq1 = Seq.exists (fun elem -> (num % elem = 0 && (num / elem) < 999)) seq1
{998001 .. -1 .. 10000} |> Seq.filter(fun x -> isPalin x) |> Seq.filter(fun x -> isGood x {999 .. -1 .. 100}) |> Seq.nth 0
simplest way is to go from 999 to 100, because is much likley to be product of two large numbers.
j can then start from i because other way around was already tested
other optimisations would go in directions where multiplactions would go descending order, but that makes everything little more difficult. In general it is expressed as list mergeing.
Haskell (my best try in functional programming)
merge f x [] = x
merge f [] y = y
merge f (x:xs) (y:ys)
| f x y = x : merge f xs (y:ys)
| otherwise = y : merge f (x:xs) ys
compare_tuples (a,b) (c,d) = a*b >= c*d
gen_mul n = (n,n) : merge compare_tuples
( gen_mul (n-1) )
( map (\x -> (n,x)) [n-1,n-2 .. 1] )
is_product_palindrome (a,b) = x == reverse x where x = show (a*b)
main = print $ take 10 $ map ( \(a,b)->(a,b,a*b) )
$ filter is_product_palindrome $ gen_mul 9999
output (less than 1s)- first 10 palindromes =>
[(9999,9901,99000099),
(9967,9867,98344389),
(9999,9811,98100189),
(9999,9721,97200279),
(9999,9631,96300369),
(9999,9541,95400459),
(9999,9451,94500549),
(9767,9647,94222249),
(9867,9547,94200249),
(9999,9361,93600639)]
One can see that this sequence is lazy generated from large to small
Optimized version:
let Euler dgt=
let [mine;maxe]=[dgt-1;dgt]|>List.map (fun x->String.replicate x "9"|>int)
let IsPalindrom (str:string)=
let rec fn(a,b)=a>b||str.[a]=str.[b]&&fn(a+1,b-1)
fn(0,str.Length-1)
let IsIntPalindrome = (string>>IsPalindrom)
let rec fn=function
|x,y,max,a,_ when a=mine->x,y,max
|x,y,max,a,b when b=mine->fn(x,y,max,a-1,maxe)
|x,y,max,a,b->a*b|>function
|m when b=maxe&&m<max->x,y,max
|m when m>max&&IsIntPalindrome(m)->fn(a,b,m,a-1,maxe)
|m when m>max->fn(x,y,max,a,b-1)
|_->fn(x,y,max,a-1,maxe)
fn(0,0,0,maxe,maxe)
Log (switch #time on):
> Euler 2;;
Real: 00:00:00.004, CPU: 00:00:00.015, GC gen0: 0, gen1: 0, gen2: 0
val it : int * int * int = (99, 91, 9009)
> Euler 3;;
Real: 00:00:00.004, CPU: 00:00:00.015, GC gen0: 0, gen1: 0, gen2: 0
val it : int * int * int = (993, 913, 906609)
> Euler 4;;
Real: 00:00:00.002, CPU: 00:00:00.000, GC gen0: 0, gen1: 0, gen2: 0
val it : int * int * int = (9999, 9901, 99000099)
> Euler 5;;
Real: 00:00:00.702, CPU: 00:00:00.686, GC gen0: 108, gen1: 1, gen2: 0
val it : int * int * int = (99793, 99041, 1293663921) //int32 overflow
Extern to BigInteger:
let Euler dgt=
let [mine;maxe]=[dgt-1;dgt]|>List.map (fun x->new System.Numerics.BigInteger(String.replicate x "9"|>int))
let IsPalindrom (str:string)=
let rec fn(a,b)=a>b||str.[a]=str.[b]&&fn(a+1,b-1)
fn(0,str.Length-1)
let IsIntPalindrome = (string>>IsPalindrom)
let rec fn=function
|x,y,max,a,_ when a=mine->x,y,max
|x,y,max,a,b when b=mine->fn(x,y,max,a-1I,maxe)
|x,y,max,a,b->a*b|>function
|m when b=maxe&&m<max->x,y,max
|m when m>max&&IsIntPalindrome(m)->fn(a,b,m,a-1I,maxe)
|m when m>max->fn(x,y,max,a,b-1I)
|_->fn(x,y,max,a-1I,maxe)
fn(0I,0I,0I,maxe,maxe)
Check:
Euler 5;;
Real: 00:00:02.658, CPU: 00:00:02.605, GC gen0: 592, gen1: 1, gen2: 0
val it :
System.Numerics.BigInteger * System.Numerics.BigInteger *
System.Numerics.BigInteger =
(99979 {...}, 99681 {...}, 9966006699 {...})

Overhead involved in nested sequences in F#

I need to do excessive pattern matching against an xml structure, so I declared a type to represent the xml nodes. The xml is a multitree and I need to somehow iterate over the nodes. To make the tree enumerable, I use nested sequence comprehensions. My XML will never be too big, so simplicity beats performance in my case, but is the following code somehow dangerous for larger inputs, or would it be ok to leave it as it is.
type ElementInfo = { Tag : string; Attributes : Map<string, string> }
type Contents =
| Nothing
| String of string
| Elements of Node list
and Node =
| Element of ElementInfo * Contents
| Comment of string
member node.PreOrder =
seq {
match node with
| Element (_, Elements children) as parent -> yield parent; yield! children |> Seq.collect (fun child -> child.PreOrder)
| singleNode -> yield singleNode
}
I thought the call to Seq.collect might cause it to generate an IEnumerable per call, so I replaced it with a for loop and it produces the same output (via ILSpy).
That made me curious so I decompiled a simpler sequence expression I'd expect to be "flattened."
let rec infSeq n =
seq {
yield n
yield! infSeq (n+1)
}
The code to generate the next element in the sequence decompiles to this:
public override int GenerateNext(ref IEnumerable<int> next)
{
switch (this.pc)
{
case 1:
this.pc = 2;
next = Test.infSeq(this.n + 1);
return 2;
case 2:
this.pc = 3;
break;
case 3:
break;
default:
this.pc = 1;
this.current = this.n;
return 1;
}
this.current = 0;
return 0;
}
As you can see, it calls itself recursively, generating a fresh IEnumerable each time. A quick test in FSI
infSeq 0 |> Seq.take 10000000 |> Seq.length
you can see there's a lot of GC:
> Real: 00:00:01.759, CPU: 00:00:01.762, GC gen0: 108, gen1: 107, gen2: 1
Compared to the C# version
public static IEnumerable<int> InfSeq(int n) {
while (true) yield return n++;
}
in FSI:
> Real: 00:00:00.991, CPU: 00:00:00.998, GC gen0: 0, gen1: 0, gen2: 0
It's faster and uses constant memory (no extra IEnumerables).
I thought F# would generate a single IEnumerable for yield! in a tail position, but apparently not.
EDIT
The spec confirms this: {| yield! expr |} is elaborated as expr, that is, subsequences (recursive or otherwise) are not merged into a single IEnumerable.

Performance of Lazy in F#

Why is the creation of Lazy type so slow?
Assume the following code:
type T() =
let v = lazy (0.0)
member o.a = v.Value
type T2() =
member o.a = 0.0
#time "on"
for i in 0 .. 10000000 do
T() |> ignore
#time "on"
for i in 0 .. 10000000 do
T2() |> ignore
The first loop gives me: Real: 00:00:00.647 whereas the second loop gives me Real: 00:00:00.051. Lazy is 13X slower!!
I have tried to optimize my code in this way and I ended up with simulation code 6X slower. It was then fun to track back where the slow down occurred...
The Lazy version has some significant overhead code -
60 .method public specialname
61 instance default float64 get_a () cil managed
62 {
63 // Method begins at RVA 0x2078
64 // Code size 14 (0xe)
65 .maxstack 3
66 IL_0000: ldarg.0
67 IL_0001: ldfld class [FSharp.Core]System.Lazy`1<float64> Test/T::v
68 IL_0006: tail.
69 IL_0008: call instance !0 class [FSharp.Core]System.Lazy`1<float64>::get_Value()
70 IL_000d: ret
71 } // end of method T::get_a
Compare this to the direct version
.method public specialname
130 instance default float64 get_a () cil managed
131 {
132 // Method begins at RVA 0x20cc
133 // Code size 10 (0xa)
134 .maxstack 3
135 IL_0000: ldc.r8 0.
136 IL_0009: ret
137 } // end of method T2::get_a
So the direct version has a load and then return, whilst the indirect version has a load then a call and then a return.
Since the lazy version has an extra call I would expect it to be significantly slower.
UPDATE:
So I wondered if we could create a custom version of lazy which did not require the method calls - I also updated the test to actual call the method rather than just create the objects. Here is the code:
type T() =
let v = lazy (0.0)
member o.a() = v.Value
type T2() =
member o.a() = 0.0
type T3() =
let mutable calculated = true
let mutable value = 0.0
member o.a() = if calculated then value else failwith "not done";;
#time "on"
let lazy_ =
for i in 0 .. 1000000 do
T().a() |> ignore
printfn "lazy"
#time "on"
let fakelazy =
for i in 0 .. 1000000 do
T3().a() |> ignore
printfn "fake lazy"
#time "on"
let direct =
for i in 0 .. 1000000 do
T2().a() |> ignore
printfn "direct";;
Which gives the following result:
lazy
Real: 00:00:03.786, CPU: 00:00:06.443, GC gen0: 7
val lazy_ : unit = ()
--> Timing now on
fake lazy
Real: 00:00:01.627, CPU: 00:00:02.858, GC gen0: 2
val fakelazy : unit = ()
--> Timing now on
direct
Real: 00:00:01.759, CPU: 00:00:02.935, GC gen0: 2
val direct : unit = ()
Here the lazy version is only 2x slower than the direct version and the fake lazy version is even slightly faster than the direct version - this is probably due to a GC happening during the benchmark.
Update int the .net core world
A new constructor was added to Lazy to handle constants such as your case. Unfortunately F#'s lazy "pseudo keyword" always (at the moment!) will wrap constants as functions.
Anyway, if you change:
let v = lazy (0.0)
to:
let v = Lazy<_> 0.0 // NB. Only .net core at the moment
then you will find that your T() class will only take ~3 times your T2.
(What's the point of having a lazy constant? Well means that you can use Lazy as an abstraction with quite little overhead when you do have a mix of constants and real lazy items...)
...and...
If you actually use the created value a number of times then the overhead shrinks further. i.e. something such as:
open System.Diagnostics
type T() =
let v = Lazy<_> 0.1
member o.a () = v.Value
type T2() =
member o.a () = 0.1
let withLazyType () =
let mutable sum = 0.0
for i in 0 .. 10000000 do
let t = T()
for __ = 1 to 10 do
sum <- sum + t.a()
sum
let withoutLazyType () =
let mutable sum = 0.0
for i in 0 .. 10000000 do
let t = T2()
for __ = 1 to 10 do
sum <- sum + t.a()
sum
let runtest name count f =
let mutable checksum = 0.
let mutable totaltime = 0L
for i = 0 to count do
if i = 0 then
f () |> ignore // warm up
else
let sw = Stopwatch.StartNew ()
checksum <- checksum + f ()
totaltime <- totaltime + sw.ElapsedMilliseconds
printfn "%s: %4d (checksum=%f for %d runs)" name (totaltime/int64 count) checksum count
[<EntryPoint>]
let main _ =
runtest "w/o lazy" 10 withoutLazyType
runtest "with lazy" 10 withLazyType
0
brings the difference in time to < 2 times.
NB. I worked on the new lazy implementation...

Why is my Scala tail-recursion faster than the while loop?

Here are two solutions to exercise 4.9 in Cay Horstmann's Scala for the Impatient: "Write a function lteqgt(values: Array[Int], v: Int) that returns a triple containing the counts of values less than v, equal to v, and greater than v." One uses tail recursion, the other uses a while loop. I thought that both would compile to similar bytecode but the while loop is slower than the tail recursion by a factor of almost 2. This suggests to me that my while method is badly written.
import scala.annotation.tailrec
import scala.util.Random
object PerformanceTest {
def main(args: Array[String]): Unit = {
val bigArray:Array[Int] = fillArray(new Array[Int](100000000))
println(time(lteqgt(bigArray, 25)))
println(time(lteqgt2(bigArray, 25)))
}
def time[T](block : => T):T = {
val start = System.nanoTime : Double
val result = block
val end = System.nanoTime : Double
println("Time = " + (end - start) / 1000000.0 + " millis")
result
}
#tailrec def fillArray(a:Array[Int], pos:Int=0):Array[Int] = {
if (pos == a.length)
a
else {
a(pos) = Random.nextInt(50)
fillArray(a, pos+1)
}
}
#tailrec def lteqgt(values: Array[Int], v:Int, lt:Int=0, eq:Int=0, gt:Int=0, pos:Int=0):(Int, Int, Int) = {
if (pos == values.length)
(lt, eq, gt)
else
lteqgt(values, v, lt + (if (values(pos) < v) 1 else 0), eq + (if (values(pos) == v) 1 else 0), gt + (if (values(pos) > v) 1 else 0), pos+1)
}
def lteqgt2(values:Array[Int], v:Int):(Int, Int, Int) = {
var lt = 0
var eq = 0
var gt = 0
var pos = 0
val limit = values.length
while (pos < limit) {
if (values(pos) > v)
gt += 1
else if (values(pos) < v)
lt += 1
else
eq += 1
pos += 1
}
(lt, eq, gt)
}
}
Adjust the size of bigArray according to your heap size. Here is some sample output:
Time = 245.110899 millis
(50004367,2003090,47992543)
Time = 465.836894 millis
(50004367,2003090,47992543)
Why is the while method so much slower than the tailrec? Naively the tailrec version looks to be at a slight disadvantage, as it must always perform 3 "if" checks for every iteration, whereas the while version will often only perform 1 or 2 tests due to the else construct. (NB reversing the order I perform the two methods does not affect the outcome).
Test results (after reducing array size to 20000000)
Under Java 1.6.22 I get 151 and 122 ms for tail-recursion and while-loop respectively.
Under Java 1.7.0 I get 55 and 101 ms
So under Java 6 your while-loop is actually faster; both have improved in performance under Java 7, but the tail-recursive version has overtaken the loop.
Explanation
The performance difference is due to the fact that in your loop, you conditionally add 1 to the totals, while for recursion you always add either 1 or 0. So they are not equivalent. The equivalent while-loop to your recursive method is:
def lteqgt2(values:Array[Int], v:Int):(Int, Int, Int) = {
var lt = 0
var eq = 0
var gt = 0
var pos = 0
val limit = values.length
while (pos < limit) {
gt += (if (values(pos) > v) 1 else 0)
lt += (if (values(pos) < v) 1 else 0)
eq += (if (values(pos) == v) 1 else 0)
pos += 1
}
(lt, eq, gt)
}
and this gives exactly the same execution time as the recursive method (regardless of Java version).
Discussion
I'm not an expert on why the Java 7 VM (HotSpot) can optimize this better than your first version, but I'd guess it's because it's taking the same path through the code each time (rather than branching along the if / else if paths), so the bytecode can be inlined more efficiently.
But remember that this is not the case in Java 6. Why one while-loop outperforms the other is a question of JVM internals. Happily for the Scala programmer, the version produced from idiomatic tail-recursion is the faster one in the latest version of the JVM.
The difference could also be occurring at the processor level. See this question, which explains how code slows down if it contains unpredictable branching.
The two constructs are not identical. In particular, in the first case you don't need any jumps (on x86, you can use cmp and setle and add, instead of having to use cmp and jb and (if you don't jump) add. Not jumping is faster than jumping on pretty much every modern architecture.
So, if you have code that looks like
if (a < b) x += 1
where you may add or you may jump instead, vs.
x += (a < b)
(which only makes sense in C/C++ where 1 = true and 0 = false), the latter tends to be faster as it can be turned into more compact assembly code. In Scala/Java, you can't do this, but you can do
x += if (a < b) 1 else 0
which a smart JVM should recognize is the same as x += (a < b), which has a jump-free machine code translation, which is usually faster than jumping. An even smarter JVM would recognize that
if (a < b) x += 1
is the same yet again (because adding zero doesn't do anything).
C/C++ compilers routinely perform optimizations like this. Being unable to apply any of these optimizations was not a mark in the JIT compiler's favor; apparently it can as of 1.7, but only partially (i.e. it doesn't recognize that adding zero is the same as a conditional adding one, but it does at least convert x += if (a<b) 1 else 0 into fast machine code).
Now, none of this has anything to do with tail recursion or while loops per se. With tail recursion it's more natural to write the if (a < b) 1 else 0 form, but you can do either; and with while loops you can also do either. It just so happened that you picked one form for tail recursion and the other for the while loop, making it look like recursion vs. looping was the change instead of the two different ways to do the conditionals.

Resources