Performance of Lazy in F#

Performance of Lazy in F# - performance

Why is the creation of Lazy type so slow?
Assume the following code:
type T() =
let v = lazy (0.0)
member o.a = v.Value
type T2() =
member o.a = 0.0
#time "on"
for i in 0 .. 10000000 do
T() |> ignore
#time "on"
for i in 0 .. 10000000 do
T2() |> ignore
The first loop gives me: Real: 00:00:00.647 whereas the second loop gives me Real: 00:00:00.051. Lazy is 13X slower!!
I have tried to optimize my code in this way and I ended up with simulation code 6X slower. It was then fun to track back where the slow down occurred...

The Lazy version has some significant overhead code -
60 .method public specialname
61 instance default float64 get_a () cil managed
62 {
63 // Method begins at RVA 0x2078
64 // Code size 14 (0xe)
65 .maxstack 3
66 IL_0000: ldarg.0
67 IL_0001: ldfld class [FSharp.Core]System.Lazy`1<float64> Test/T::v
68 IL_0006: tail.
69 IL_0008: call instance !0 class [FSharp.Core]System.Lazy`1<float64>::get_Value()
70 IL_000d: ret
71 } // end of method T::get_a
Compare this to the direct version
.method public specialname
130 instance default float64 get_a () cil managed
131 {
132 // Method begins at RVA 0x20cc
133 // Code size 10 (0xa)
134 .maxstack 3
135 IL_0000: ldc.r8 0.
136 IL_0009: ret
137 } // end of method T2::get_a
So the direct version has a load and then return, whilst the indirect version has a load then a call and then a return.
Since the lazy version has an extra call I would expect it to be significantly slower.
UPDATE:
So I wondered if we could create a custom version of lazy which did not require the method calls - I also updated the test to actual call the method rather than just create the objects. Here is the code:
type T() =
let v = lazy (0.0)
member o.a() = v.Value
type T2() =
member o.a() = 0.0
type T3() =
let mutable calculated = true
let mutable value = 0.0
member o.a() = if calculated then value else failwith "not done";;
#time "on"
let lazy_ =
for i in 0 .. 1000000 do
T().a() |> ignore
printfn "lazy"
#time "on"
let fakelazy =
for i in 0 .. 1000000 do
T3().a() |> ignore
printfn "fake lazy"
#time "on"
let direct =
for i in 0 .. 1000000 do
T2().a() |> ignore
printfn "direct";;
Which gives the following result:
lazy
Real: 00:00:03.786, CPU: 00:00:06.443, GC gen0: 7
val lazy_ : unit = ()
--> Timing now on
fake lazy
Real: 00:00:01.627, CPU: 00:00:02.858, GC gen0: 2
val fakelazy : unit = ()
--> Timing now on
direct
Real: 00:00:01.759, CPU: 00:00:02.935, GC gen0: 2
val direct : unit = ()
Here the lazy version is only 2x slower than the direct version and the fake lazy version is even slightly faster than the direct version - this is probably due to a GC happening during the benchmark.

Update int the .net core world
A new constructor was added to Lazy to handle constants such as your case. Unfortunately F#'s lazy "pseudo keyword" always (at the moment!) will wrap constants as functions.
Anyway, if you change:
let v = lazy (0.0)
to:
let v = Lazy<_> 0.0 // NB. Only .net core at the moment
then you will find that your T() class will only take ~3 times your T2.
(What's the point of having a lazy constant? Well means that you can use Lazy as an abstraction with quite little overhead when you do have a mix of constants and real lazy items...)
...and...
If you actually use the created value a number of times then the overhead shrinks further. i.e. something such as:
open System.Diagnostics
type T() =
let v = Lazy<_> 0.1
member o.a () = v.Value
type T2() =
member o.a () = 0.1
let withLazyType () =
let mutable sum = 0.0
for i in 0 .. 10000000 do
let t = T()
for __ = 1 to 10 do
sum <- sum + t.a()
sum
let withoutLazyType () =
let mutable sum = 0.0
for i in 0 .. 10000000 do
let t = T2()
for __ = 1 to 10 do
sum <- sum + t.a()
sum
let runtest name count f =
let mutable checksum = 0.
let mutable totaltime = 0L
for i = 0 to count do
if i = 0 then
f () |> ignore // warm up
else
let sw = Stopwatch.StartNew ()
checksum <- checksum + f ()
totaltime <- totaltime + sw.ElapsedMilliseconds
printfn "%s: %4d (checksum=%f for %d runs)" name (totaltime/int64 count) checksum count
[<EntryPoint>]
let main _ =
runtest "w/o lazy" 10 withoutLazyType
runtest "with lazy" 10 withLazyType
0
brings the difference in time to < 2 times.
NB. I worked on the new lazy implementation...

Related

How to reduce the allocations in Julia?

I am starting to use Julia mainly because of its speed. Currently, I am solving a fixed point problem. Although the current version of my code runs fast I would like to know some methods to improve its speed.
First of all, let me summarize the algorithm.
There is an initial seed called C0 that maps from the space (b,y) into an action space c, then we have C0(b,y)
There is a formula that generates a rule Ct from C0.
Then, using an additional restriction, I can obtain an updating of b [let's called it bt]. Thus,it generates a rule Ct(bt,y)
I need to interpolate the previous rule to move from the grid bt into the original grid b. It gives me an update for C0 [let's called that C1]
I will iterate until the distance between C1 and C0 is below a convergence threshold.
To implement it I created two structures:
struct Parm
lC::Array{Float64, 2} # Lower limit
uC::Array{Float64, 2} # Upper limit
γ::Float64 # CRRA coefficient
δ::Float64 # factor in the euler
γ1::Float64 #
r1::Float64 # inverse of the gross interest rate
yb1::Array{Float64, 2} # y - b(t+1)
P::Array{Float64, 2} # Transpose of transition matrix
end
mutable struct Upd1
pol::Array{Float64,2} # policy function
b::Array{Float64, 1} # exogenous grid for interpolation
dif::Float64 # updating difference
end
The first one is a set of parameters while the second one stores the decision rule C1. I also define some functions:
function eulerm(x::Upd1,p::Parm)
ct = p.δ *(x.pol.^(-p.γ)*p.P).^(-p.γ1); #Euler equation
bt = p.r1.*(ct .+ p.yb1); #Endeogenous grid for bonds
return ct,bt
end
function interp0!(bt::Array{Float64},ct::Array{Float64},x::Upd1, p::Parm)
polold = x.pol;
polnew = similar(x.pol);
#inbounds #simd for col in 1:size(bt,2)
F1 = LinearInterpolation(bt[:,col], ct[:,col],extrapolation_bc=Line());
polnew[:,col] = F1(x.b);
end
polnew[polnew .< p.lC] .= p.lC[polnew .< p.lC];
polnew[polnew .> p.uC] .= p.uC[polnew .> p.uC];
dif = maximum(abs.(polnew - polold));
return polnew,dif
end
function updating!(x::Upd1,p::Parm)
ct, bt = eulerm(x,p); # endogeneous grid
x.pol, x.dif = interp0!(bt,ct,x,p);
end
function conver(x::Upd1,p::Parm)
while x.dif>1e-8
updating!(x,p);
end
end
The first formula implements steps 2 and 3. The third one makes the updating (last part of step 4), and the last one iterates until convergence (step 5).
The most important function is the second one. It makes the interpolation. While I was running the function #time and #btime I realized that the largest number of allocations are in the loop inside this function. I tried to reduce it by not defining polnew and goes directly to x.pol but in this case, the results are not correct since it only need two iterations to converge (I think that Julia is thinking that polold is exactly the same than x.pol and it is updating both at the same time).
Any advice is well received.
To anyone that wants to run it by themselves, I add the rest of the required code:
function rouwen(ρ::Float64, σ2::Float64, N::Int64)
if (N % 2 != 1)
return "N should be an odd number"
end
sigz = sqrt(σ2/(1-ρ^2));
zn = sigz*sqrt(N-1);
z = range(-zn,zn,N);
p = (1+ρ)/2;
q = p;
Rho = [p 1-p;1-q q];
for i = 3:N
zz = zeros(i-1,1);
Rho = p*[Rho zz; zz' 0] + (1-p)*[zz Rho; 0 zz'] + (1-q)*[zz' 0; Rho zz] + q *[0 zz'; zz Rho];
Rho[2:end-1,:] = Rho[2:end-1,:]/2;
end
return z,Rho;
end
#############################################################
# Parameters of the model
############################################################
lb = 0; ub = 1000; pivb = 0.25; nb = 500;
ρ = 0.988; σz = 0.0439; μz =-σz/2; nz = 7;
ϕ = 0.0; σe = 0.6376; μe =-σe/2; ne = 7;
β = 0.98; r = 1/400; γ = 1;
b = exp10.(range(start=log10(lb+pivb), stop=log10(ub+pivb), length=nb)) .- pivb;
#=========================================================
Algorithm
======================================================== =#
(z,Pz) = rouwen(ρ,σz, nz);
μZ = μz/(1-ρ);
z = z .+ μZ;
(ee,Pe) = rouwen(ϕ,σe,ne);
ee = ee .+ μe;
y = exp.(vec((z .+ ee')'));
P = kron(Pz,Pe);
R = 1 + r;
r1 = R^(-1);
γ1 = 1/γ;
δ = (β*R)^(-γ1);
m = R*b .+ y';
lC = max.(m .- ub,0);
uC = m .- lb;
by1 = b .- y';
# initial guess for C0
c0 = 0.1*(m);
# Set of parameters
pp = Parm(lC,uC,γ,δ,γ1,r1,by1,P');
# Container of results
up1 = Upd1(c0,b,1);
# Fixed point problem
conver(up1,pp)
UPDATE As it was reccomend, I made the following changes to the third function
function interp0!(bt::Array{Float64},ct::Array{Float64},x::Upd1, p::Parm)
polold = x.pol;
polnew = similar(x.pol);
#inbounds for col in 1:size(bt,2)
F1 = LinearInterpolation(#view(bt[:,col]), #view(ct[:,col]),extrapolation_bc=Line());
polnew[:,col] = F1(x.b);
end
for j in eachindex(polnew)
polnew[j] < p.lC[j] ? polnew[j] = p.lC[j] : nothing
polnew[j] > p.uC[j] ? polnew[j] = p.uC[j] : nothing
end
dif = maximum(abs.(polnew - polold));
return polnew,dif
end
This leads to an improvement in the speed (from ~1.5 to ~1.3 seconds). And a reduction in the number of allocations. Somethings that I noted were:
Changing from polnew[:,col] = F1(x.b) to polnew[:,col] .= F1(x.b) can reduce the total allocations but the time is slower, why is that?
How should I understand the difference between #time and #btime. For this case, I have:
up1 = Upd1(c0,b,1);
#time conver(up1,pp)
1.338042 seconds (385.72 k allocations: 1.157 GiB, 3.37% gc time)
up1 = Upd1(c0,b,1);
#btime conver(up1,pp)
4.200 ns (0 allocations: 0 bytes)
Just to be precise, in both cases, I run it several times and I choose representative numbers for each line.
Does it mean that all the time is due allocations during the compilation?

Start going through the "performance tips" as advised by #DNF but below you will find most important comments for your code.
Vectorize vector assignments - a small dot makes big difference
julia> julia> a = rand(3,4);
julia> #btime $a[3,:] = $a[3,:] ./ 2;
40.726 ns (2 allocations: 192 bytes)
julia> #btime $a[3,:] .= $a[3,:] ./ 2;
20.562 ns (1 allocation: 96 bytes)
Use views when doing something with subarrays:
julia> #btime sum($a[3,:]);
18.719 ns (1 allocation: 96 bytes)
julia> #btime sum(#view($a[3,:]));
5.600 ns (0 allocations: 0 bytes)
Your code around a lines polnew[polnew .< p.lC] .= p.lC[polnew .< p.lC]; will make much less allocations when you do it with a for loop over each element of polnew
#simd will have no effect on conditionals (point 3) neither when code is calling complex external functions

I want to give an update about this problem. I made two main changes to my code: (i) I define my own linear interpolation function and (ii) I include the check of bounds in the interpolation.
With this the new function three is
function interp0!(bt::Array{Float64},ct::Array{Float64},x::Upd1, p::Parm)
polold = x.pol;
polnew = similar(x.pol);
#inbounds #views for col in 1:size(bt,2)
polnew[:,col] = myint(bt[:,col], ct[:,col],x.b[:],p.lC[:,col],p.uC[:,col]);
end
dif = maximum(abs.(polnew - polold));
return polnew,dif
end
And the interpolation is now:
function myint(x0,y0,x1,ly,uy)
y1 = similar(x1);
n = size(x0,1);
j = 1;
#simd for i in eachindex(x1)
while (j <= n) && (x1[i] > x0[j])
j+=1;
end
if j == 1
y1[i] = y0[1] + ((y0[2]-y0[1])/(x0[2]-x0[1]))*(x1[i]-x0[1]) ;
elseif j == n+1
y1[i] = y0[n] + ((y0[n]-y0[n-1])/(x0[n]-x0[n-1]))*(x1[i]-x0[n]);
else
y1[i] = y0[j-1]+ ((x1[i]-x0[j-1])/(x0[j]-x0[j-1]))*(y0[j]-y0[j-1]);
end
y1[i] > uy[i] ? y1[i] = uy[i] : nothing;
y1[i] < ly[i] ? y1[i] = ly[i] : nothing;
end
return y1;
end
As you can see, I am taking advantage (and assuming) that both vectors that we use as basis are ordered while the two last lines in the outer loops checks the bounds imposed by lC and uC.
With that I get the following total time
up1 = Upd1(c0,b,1);
#time conver(up1,pp)
0.734630 seconds (28.93 k allocations: 752.214 MiB, 3.82% gc time)
up1 = Upd1(c0,b,1);
#btime conver(up1,pp)
4.200 ns (0 allocations: 0 bytes)
which is almost as twice faster with ~8% of the total allocations. the use of views in the loop of the function interp0! also helps a lot.

Performance issues with evaluation of custom tree data structure in Julia

I am implementing a Binary Tree in Julia. The binary tree has nodes and leafs. The nodes point to left and right children, which are also nodes/leafs objects. The following code exemplifies the data structure:
using TimerOutputs
mutable struct NodeLeaf
isleaf::Bool
value::Union{Nothing,Float64}
split::Union{Nothing,Float64}
column::Union{Nothing,Int64}
left::Union{Nothing,NodeLeaf}
right::Union{Nothing,NodeLeaf}
end
function evaluate(node::NodeLeaf, x)::Float64
while !node.isleaf
if x[node.column] < node.split
node = node.left
else
node = node.right
end
end
return node.value
end
function build_random_tree(max_depth)
if max_depth == 0
return NodeLeaf(true, randn(), randn(), rand(1:10), nothing, nothing)
else
return NodeLeaf(false, randn(), randn(), rand(1:10), build_random_tree(max_depth - 1), build_random_tree(max_depth - 1))
end
end
function main()
my_random_tree = build_random_tree(4)
#timeit to "evaluation" for i in 1:1000000
evaluate(my_random_tree, randn(10))
end
end
const to = TimerOutput()
main()
show(to)
I notice that a lot of allocations occur in the evaluate function, but I don't see the reason why this is the case:
julia mytree.jl
───────────────────────────────────────────────────────────────────────
Time Allocations
─────────────────────── ────────────────────────
Tot / % measured: 476ms / 21.6% 219MiB / 62.7%
Section ncalls time %tot avg alloc %tot avg
───────────────────────────────────────────────────────────────────────
evaluation 1 103ms 100.0% 103ms 137MiB 100.0% 137MiB
───────────────────────────────────────────────────────────────────────
As I increase the evaluation loop, the allocation continues to increase without bound. Can anybody explain why allocation grows so much and please suggest how to avoid this issue? Thanks.
EDIT
I simplified too much the code for the example. The actual code is accessing DataFrames, so the main looks like this:
using DataFrames
function main()
my_random_tree = build_random_tree(7)
df = DataFrame(A=1:1000000)
for i in 1:9
df[!, string(i)] = collect(1:1000000)
end
#timeit to "evaluation" for i in 1:size(df, 1)
evaluate(my_random_tree, #view df[i, :])
end
end
I expect this to yield 0 allocations, but that isn't true:
julia mytree.jl
───────────────────────────────────────────────────────────────────────
Time Allocations
─────────────────────── ────────────────────────
Tot / % measured: 551ms / 20.5% 305MiB / 45.0%
Section ncalls time %tot avg alloc %tot avg
───────────────────────────────────────────────────────────────────────
evaluation 1 113ms 100.0% 113ms 137MiB 100.0% 137MiB
───────────────────────────────────────────────────────────────────────%
On the other hand, if I use a plain array I don't get allocations:
function main()
my_random_tree = build_random_tree(7)
df = randn(1000000, 10)
#timeit to "evaluation" for i in 1:size(df, 1)
evaluate(my_random_tree, #view df[i, :])
end
end
julia mytree.jl
───────────────────────────────────────────────────────────────────────
Time Allocations
─────────────────────── ────────────────────────
Tot / % measured: 465ms / 5.7% 171MiB / 0.0%
Section ncalls time %tot avg alloc %tot avg
───────────────────────────────────────────────────────────────────────
evaluation 1 26.4ms 100.0% 26.4ms 0.00B - % 0.00B
───────────────────────────────────────────────────────────────────────%

The thing that allocates is randn not evaluation. Switch to randn!:
julia> using Random
julia> function main()
my_random_tree = build_random_tree(4)
x = randn(10)
#allocated for i in 1:1000000
evaluate(my_random_tree, randn!(x))
end
end
main (generic function with 1 method)
julia> main()
0
EDIT
Solution with DataFrames.jl:
function bar(mrt, nti)
#timeit to "evaluation" for nt in nti
evaluate(mrt, nt)
end
end
function main()
my_random_tree = build_random_tree(7)
df = DataFrame(A=1:1000000)
for i in 1:9
df[!, string(i)] = collect(1:1000000)
end
bar(my_random_tree, Tables.namedtupleiterator(df))
end

F# Performance Impact of Checked Calcs?

Is there a performance impact from using the Checked module? I've tested it out with sequences of type int and see no noticeable difference. Sometimes the checked version is faster and sometimes unchecked is faster, but generally not by much.
Seq.initInfinite (fun x-> x) |> Seq.item 1000000000;;
Real: 00:00:05.272, CPU: 00:00:05.272, GC gen0: 0, gen1: 0, gen2: 0
val it : int = 1000000000
open Checked
Seq.initInfinite (fun x-> x) |> Seq.item 1000000000;;
Real: 00:00:04.785, CPU: 00:00:04.773, GC gen0: 0, gen1: 0, gen2: 0
val it : int = 1000000000
Basically I'm trying to figure out if there would be any downside to always opening Checked. (I encountered an overflow that wasn't immediately obvious, so I'm now playing the role of the jilted lover who doesn't want another broken heart.) The only non-contrived reason I can come up with for not always using Checked is if there were some performance hit, but I haven't seen one yet.

When you measure performance it's usually not a good idea to include Seq as Seq adds lots of overhead (at least compared to int operations) so you risk that most of the time is spent in Seq, not in the code you like to test.
I wrote a small test program for (+):
let clock =
let sw = System.Diagnostics.Stopwatch ()
sw.Start ()
fun () ->
sw.ElapsedMilliseconds
let dbreak () = System.Diagnostics.Debugger.Break ()
let time a =
let b = clock ()
let r = a ()
let n = clock ()
let d = n - b
d, r
module Unchecked =
let run c () =
let rec loop a i =
if i < c then
loop (a + 1) (i + 1)
else
a
loop 0 0
module Checked =
open Checked
let run c () =
let rec loop a i =
if i < c then
loop (a + 1) (i + 1)
else
a
loop 0 0
[<EntryPoint>]
let main argv =
let count = 1000000000
let testCases =
[|
"Unchecked" , Unchecked.run
"Checked" , Checked.run
|]
for nm, a in testCases do
printfn "Running %s ..." nm
let ms, r = time (a count)
printfn "... it took %d ms, result is %A" ms r
0
The performance results are this:
Running Unchecked ...
... it took 561 ms, result is 1000000000
Running Checked ...
... it took 1103 ms, result is 1000000000
So it seems some overhead is added by using Checked. The cost of int add should be less than the loop overhead so the overhead of Checked is higher than 2x maybe closer to 4x.
Out of curiousity we can check the IL Code using tools like ILSpy:
Unchecked:
IL_0000: nop
IL_0001: ldarg.2
IL_0002: ldarg.0
IL_0003: bge.s IL_0014
IL_0005: ldarg.0
IL_0006: ldarg.1
IL_0007: ldc.i4.1
IL_0008: add
IL_0009: ldarg.2
IL_000a: ldc.i4.1
IL_000b: add
IL_000c: starg.s i
IL_000e: starg.s a
IL_0010: starg.s c
IL_0012: br.s IL_0000
Checked:
IL_0000: nop
IL_0001: ldarg.2
IL_0002: ldarg.0
IL_0003: bge.s IL_0014
IL_0005: ldarg.0
IL_0006: ldarg.1
IL_0007: ldc.i4.1
IL_0008: add.ovf
IL_0009: ldarg.2
IL_000a: ldc.i4.1
IL_000b: add.ovf
IL_000c: starg.s i
IL_000e: starg.s a
IL_0010: starg.s c
IL_0012: br.s IL_0000
The only difference is that Unchecked uses add and Checked uses add.ovf. add.ovf is add with overflow check.
We can dig even deeper by looking at the jitted x86_64 code.
Unchecked:
; if i < c then
00007FF926A611B3 cmp esi,ebx
00007FF926A611B5 jge 00007FF926A611BD
; i + 1
00007FF926A611B7 inc esi
; a + 1
00007FF926A611B9 inc edi
; loop (a + 1) (i + 1)
00007FF926A611BB jmp 00007FF926A611B3
Checked:
; if i < c then
00007FF926A62613 cmp esi,ebx
00007FF926A62615 jge 00007FF926A62623
; a + 1
00007FF926A62617 add edi,1
; Overflow?
00007FF926A6261A jo 00007FF926A6262D
; i + 1
00007FF926A6261C add esi,1
; Overflow?
00007FF926A6261F jo 00007FF926A6262D
; loop (a + 1) (i + 1)
00007FF926A62621 jmp 00007FF926A62613
Now the reason for the Checked overhead is visible. After each operation the jitter inserts the conditional instruction jo which jumps to code that raises OverflowException if the overflow flag is set.
This chart shows us that the cost of an integer add is less than 1 clock cycle. The reason it's less than 1 clock cycle is that modern CPU can execute certain instructions in parallel.
The chart also shows us that branch that was correctly predicted by the CPU takes around 1-2 clock cycles.
So assuming a throughtput of at least 2 the cost of two integer additions in the Unchecked example should be 1 clock cycle.
In the Checked example we do add, jo, add, jo. Most likely CPU can't parallelize in this case and the cost of this should be around 4-6 clock cycles.
Another interesting difference is that the order of additions changed. With checked additions the order of the operations matter but with unchecked the jitter (and the CPU) has a greater flexibility moving the operations possibly improving performance.
So long story short; for cheap operations like (+) the overhead of Checked should be around 4x-6x compared to Unchecked.
This assumes no overflow exception. The cost of a .NET exception is probably around 100,000x times more expensive than an integer addition.

Copy-and-update semantics and performance in F# record types

I am experimenting with a settings object that is passed on through a bunch of functions (a bit like a stack). It has quite a few fields (mixed ints, enums, DU's, strings) and I was wondering what the best data type is for this task (I have come back to this a few times in the past few years...).
While I currently employ a home-grown data type, it is slow and not thread-safe, so I am looking into a more logical choice and decided to experiment with record types.
Since they use copy-and-update semantics, it seems logical to think the F# compiler is smart enough to update only the necessary data and leave non-mutated data alone, considering it is immutable anyway.
So I ran a couple of tests and expected to see a performance improvement between full-copy (each field is updated) and one-field copy (one field is updated). But I am not sure I am using the proper approach for testing, or whether other datatypes may be more suitable.
As an actual example, let's take an XML qualified name. They have a small local part and a prefix (which can be ignored) and a long namespace part. The namespace is mostly the same, so only the local part needs updating.
type RecQName = { Ns :string; Prefix :string; Name :string }
// full update
let f a b c = { Ns = a; Prefix = b; Name = c}
let g() = f "http://test/is/a/long/name/here/and/there" "xs" "value"
// partial update
let test = g();;
let h a b = {a with Name = b }
let k() = h test "newname"
With timings on, in FSI (set to Release, x64 and Debug off) I get (running each twice):
> for i in 0 .. 100000000 do g() |> ignore;;
Real: 00:00:01.412, CPU: 00:00:01.404, GC gen0: 637, gen1: 1, gen2: 1
> for i in 0 .. 100000000 do g() |> ignore;;
Real: 00:00:01.317, CPU: 00:00:01.310, GC gen0: 636, gen1: 0, gen2: 0
> for i in 0 .. 100000000 do k() |> ignore;;
Real: 00:00:01.191, CPU: 00:00:01.185, GC gen0: 636, gen1: 1, gen2: 0
> for i in 0 .. 100000000 do k() |> ignore;;
Real: 00:00:01.099, CPU: 00:00:01.092, GC gen0: 636, gen1: 0, gen2: 0
Now I know timings are not everything, and there's a clear difference of roughly 20%, but that seems to small to justify the change, and may well because of other reasons (the CLI may intern the strings, for instance).
Am I making the wrong assumptions? I googled for record-type performance, but they were all about comparing them with structs. Does anybody know of the algorithm used for copy-and-update? Any thoughts on whether this datatype or something else is a smarter choice (given many fields, not just three as above, and wanting to use immutability without locking with copy-and-update).
Update 1
The test above doesn't really test anything, it seems, as can be shown with the next test.
type Rec = { A: int; B: int; C: int; D: int; E: int};;
// full
let f a b c d e = { A = a; B = b; C = c; D = d; E = e }
// partial
let temp = f 1 2 3 4 5
let g b = { temp with B = b }
// perf tests (subtract is necessary or the compiler optimizes them away)
let mutable res = 0;;
for i in 0 .. 1000000000 do f i (i-1) (i-2) (i-3) (i-4) |> function { B = b } -> if b < 1000 then res <- b
for i in 0 .. 1000000000 do g i |> function { B = b } -> if b < 1000 then res <- b
Results:
> for i in 0 .. 1000000000 do f i (i-1) (i-2) (i-3) (i-4) |> function { B = b } -> if b < 1000 then res <- b ;;
Real: 00:00:09.039, CPU: 00:00:09.032, GC gen0: 6358, gen1: 1, gen2: 0
> for i in 0 .. 1000000000 do g i |> function { B = b } -> if b < 1000 then res <- b;;
Real: 00:00:10.571, CPU: 00:00:10.576, GC gen0: 6358, gen1: 2, gen2: 0
Now the difference, while unexpected, shows. Looks like the copying is definitely not faster than not copying. The overhead of copying in this case may be due to the extra stack slot required, I don't know.
At the very least it shows that no magic goes on (as Fyodor already mentioned in the comments).
Update 2
Ok, one more update. If I inline both f and g functions above, the timings become remarkably different, in favor of the partial-update. Apparently, the inlining has the effect that either the compiler or the JIT "knows" it doesn't have to do a full copy, or it is just the effect of putting everything on the stack (the JIT or compiler getting rid of the boxing), as can be seen for 0 GC collections.
> for i in 0 .. 1000000000 do f i (i-1) (i-2) (i-3) (i-4) |> function { B = b } -> if b < 1000 then res <- b ;;
Real: 00:00:08.885, CPU: 00:00:08.876, GC gen0: 6359, gen1: 1, gen2: 1
> for i in 0 .. 1000000000 do g i |> function { B = b } -> if b < 1000 then res <- b ;;
Real: 00:00:00.571, CPU: 00:00:00.561, GC gen0: 0, gen1: 0, gen2: 0
Whether this holds in the "real world" is debatable and should (of course) be tested. Whether this improvement has any relation to record types, I doubt it.

debugging old fortran code

I have an old Fortran code for calculation of Lyapunov exponent which I tried converting to modern Fortran syntax.
PROGRAM ODE
integer, PARAMETER :: N=3
integer, PARAMETER :: NN=12
EXTERNAL FCN
DIMENSION Y(NN),ZNORM(N),GSC(N),CUM(N),C(24),W(NN,9)
Y(1) = 10.0
Y(2) = 1.0
Y(3) = 0.0
! INITIAL CONDITIONS FOR LINEAR SYSTEM (ORTHONORMAL FRAME)
DO 10 I = N+1,NN
Y(I) = 0.0
10 CONTINUE
DO 20 I = 1,N
Y((N+1)*I) = 1.0
CUM(I) = 0.0
20 CONTINUE
! INTEGRATION TOLERANCE, # OF INTEGRATION STEPS,
! TIME PER STEP, AND I/O RATE
write (*,*) "TOL, NSTEP, STPSZE, IO ?"
read (*,*) TOL, NSTEP, STPSZE, IO
! INITIALIZATION FOR INTEGRATOR
NEQ = NN
X=0.0
IND = 1
DO 100 I = 1,NSTEP
XEND = STPSZE*FLOAT(I)
! CALL ANY ODE INTEGRATOR - THIS IS AN LMSL ROUTINE
CALL DVERK (NEQ,FCN,X,Y,XEND,TOL, IND,C,NEQ,W,IER)
! CONSTRUCT A NEW ORTHONORMAL BASIS BY GRAM-SCHMIDT METHOD
! NORMALIZE FIRST VECTOR
ZNORM(1) = 0.0
DO 30 J = 1,N
ZNORM(1) = ZNORM(1)+Y(N*J+1)**2
30 CONTINUE
ZNORM(1) = SQRT(ZNORM(1))
DO 40 J = 1,N
Y(N*J+1) = Y(N*J+1)/ZNORM(1)
40 CONTINUE
! GENERATE THE NEW ORTHONORMAL SET OF VECTORS.
DO 80 J = 2,N
! GENERATE J-1 GSR COEFFICIENTS.
DO 50 K = l,(J-l)
GSC(K) = 0.0
DO 50 L = 1,N
GSC(K) = GSC(K)+Y(N*L+J)*Y(N*L+K)
50 CONTINUE
! CONSTRUCT A NEW VECTOR.
DO 60 K = 1,N
DO 60 L = l,(J-l)
Y(N*K+J) = Y(N*K+J)-GSC(L)*Y(N*K+L)
60 CONTINUE
! CALCULATE THE VECTOR'S NORM
ZNORM(J) = 0.0
DO 70 K = I,N
ZNORM(J) = ZNORM(J)+Y(N*K+J)**2
70 CONTINUE
ZNORM(J) = SQRT(ZNORM(J))
! NORMALIZE THE NEW VECTOR.
DO 80 K = 1,N
Y(N*K+J) = Y(N*K+J)/ZNORM(J)
80 CONTINUE
! UPDATE RUNNING VECTORMAGNITUDES
DO 90 K = 1,N
CUM(K) = CUM(K)+ALOG(ZNORM(K) )/ALOG(2. )
90 CONTINUE
! NORMALIZE EXPONENT AND PRINT EVERY IO ITERATIONS
IF (MOD(I,IO).EQ.0) write (*,*) X,(CUM(K)/X,K = I,N)
100 CONTINUE
CALL EXIT
END
SUBROUTINE FCN (N,X,Y,YPRIME)
! USER DEFINED ROUTINE CALLED BY IMSL INTEGRATOR.
DIMENSION Y(12),YPRIME(12)
! LORENZ EQUATIONS OF MOTION
YPRIME(1) = 16.*(Y(2)-Y(1))
YPRIME(2) = -Y(1)*Y(3)+45.92*Y(1)-Y(2)
YPRIME(3) = Y(1)*Y(2)-4.*Y(3)
! 3 COPIES OF LINEARIZED EQUATIONS OF MOTION.
DO 10 I = 0,2
YPRIME(4+I) = 16.*(Y(7+I)-Y(4+I))
YPRIME(7+I) = (45.92-Y(3))*Y(4+I)-Y(7+I)-Y(1)*Y(10+I)
YPRIME(10+I) = Y(2)*Y(4+I)+Y(1)*Y(7+I)-4.*Y(10+I)
10 CONTINUE
RETURN
END
I have debugged most of this, but I am still left with a few errors that I am unable to get around. The error log says:
main.f95:44.14:
DO 50 L = 1,N
1
Warning: Obsolescent feature: Shared DO termination label 50 at (1)
main.f95:49.18:
DO 60 L = l,(J-l)
1
Warning: Obsolescent feature: Shared DO termination label 60 at (1)
main.f95:59.14:
DO 80 K = 1,N
1
Warning: Obsolescent feature: Shared DO termination label 80 at (1)
/tmp/ccfI69Sj.o: In function `MAIN__':
main.f95:(.text+0x296): undefined reference to `dverk_'
main.f95:(.text+0x844): undefined reference to `exit_'
collect2: error: ld returned 1 exit status
Could someone please help me out in resolving the errors?
Thanks.

It's just what the compiler states:
Shared DO termination label
The nested loop 50 uses the same termination label:
DO 50 K = l,(J-l)
GSC(K) = 0.0
DO 50 L = 1,N
GSC(K) = GSC(K)+Y(N*L+J)*Y(N*L+K)
50 CONTINUE
In modern Fortran, you should use separate enddo statements:
DO K = l,(J-l)
GSC(K) = 0.0
DO L = 1,N
GSC(K) = GSC(K)+Y(N*L+J)*Y(N*L+K)
ENDDO
ENDDO
This omits the loop-label, but in your code you don't need it (I guess).
The same needs to be done with loops 60 and 80
The real errors are the undefined references to dverk and exit. These subroutines are missing in your code, so I assume they are contained in external objects/libraries. You need to tell the compiler where to find them, or include them in your code (after the end of the program or inside a module).

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Performance of Lazy in F# - performance

Related

How to reduce the allocations in Julia?

Performance issues with evaluation of custom tree data structure in Julia

F# Performance Impact of Checked Calcs?

Copy-and-update semantics and performance in F# record types

debugging old fortran code

Categories

Resources