Related
I am starting to use Julia mainly because of its speed. Currently, I am solving a fixed point problem. Although the current version of my code runs fast I would like to know some methods to improve its speed.
First of all, let me summarize the algorithm.
There is an initial seed called C0 that maps from the space (b,y) into an action space c, then we have C0(b,y)
There is a formula that generates a rule Ct from C0.
Then, using an additional restriction, I can obtain an updating of b [let's called it bt]. Thus,it generates a rule Ct(bt,y)
I need to interpolate the previous rule to move from the grid bt into the original grid b. It gives me an update for C0 [let's called that C1]
I will iterate until the distance between C1 and C0 is below a convergence threshold.
To implement it I created two structures:
struct Parm
lC::Array{Float64, 2} # Lower limit
uC::Array{Float64, 2} # Upper limit
γ::Float64 # CRRA coefficient
δ::Float64 # factor in the euler
γ1::Float64 #
r1::Float64 # inverse of the gross interest rate
yb1::Array{Float64, 2} # y - b(t+1)
P::Array{Float64, 2} # Transpose of transition matrix
end
mutable struct Upd1
pol::Array{Float64,2} # policy function
b::Array{Float64, 1} # exogenous grid for interpolation
dif::Float64 # updating difference
end
The first one is a set of parameters while the second one stores the decision rule C1. I also define some functions:
function eulerm(x::Upd1,p::Parm)
ct = p.δ *(x.pol.^(-p.γ)*p.P).^(-p.γ1); #Euler equation
bt = p.r1.*(ct .+ p.yb1); #Endeogenous grid for bonds
return ct,bt
end
function interp0!(bt::Array{Float64},ct::Array{Float64},x::Upd1, p::Parm)
polold = x.pol;
polnew = similar(x.pol);
#inbounds #simd for col in 1:size(bt,2)
F1 = LinearInterpolation(bt[:,col], ct[:,col],extrapolation_bc=Line());
polnew[:,col] = F1(x.b);
end
polnew[polnew .< p.lC] .= p.lC[polnew .< p.lC];
polnew[polnew .> p.uC] .= p.uC[polnew .> p.uC];
dif = maximum(abs.(polnew - polold));
return polnew,dif
end
function updating!(x::Upd1,p::Parm)
ct, bt = eulerm(x,p); # endogeneous grid
x.pol, x.dif = interp0!(bt,ct,x,p);
end
function conver(x::Upd1,p::Parm)
while x.dif>1e-8
updating!(x,p);
end
end
The first formula implements steps 2 and 3. The third one makes the updating (last part of step 4), and the last one iterates until convergence (step 5).
The most important function is the second one. It makes the interpolation. While I was running the function #time and #btime I realized that the largest number of allocations are in the loop inside this function. I tried to reduce it by not defining polnew and goes directly to x.pol but in this case, the results are not correct since it only need two iterations to converge (I think that Julia is thinking that polold is exactly the same than x.pol and it is updating both at the same time).
Any advice is well received.
To anyone that wants to run it by themselves, I add the rest of the required code:
function rouwen(ρ::Float64, σ2::Float64, N::Int64)
if (N % 2 != 1)
return "N should be an odd number"
end
sigz = sqrt(σ2/(1-ρ^2));
zn = sigz*sqrt(N-1);
z = range(-zn,zn,N);
p = (1+ρ)/2;
q = p;
Rho = [p 1-p;1-q q];
for i = 3:N
zz = zeros(i-1,1);
Rho = p*[Rho zz; zz' 0] + (1-p)*[zz Rho; 0 zz'] + (1-q)*[zz' 0; Rho zz] + q *[0 zz'; zz Rho];
Rho[2:end-1,:] = Rho[2:end-1,:]/2;
end
return z,Rho;
end
#############################################################
# Parameters of the model
############################################################
lb = 0; ub = 1000; pivb = 0.25; nb = 500;
ρ = 0.988; σz = 0.0439; μz =-σz/2; nz = 7;
ϕ = 0.0; σe = 0.6376; μe =-σe/2; ne = 7;
β = 0.98; r = 1/400; γ = 1;
b = exp10.(range(start=log10(lb+pivb), stop=log10(ub+pivb), length=nb)) .- pivb;
#=========================================================
Algorithm
======================================================== =#
(z,Pz) = rouwen(ρ,σz, nz);
μZ = μz/(1-ρ);
z = z .+ μZ;
(ee,Pe) = rouwen(ϕ,σe,ne);
ee = ee .+ μe;
y = exp.(vec((z .+ ee')'));
P = kron(Pz,Pe);
R = 1 + r;
r1 = R^(-1);
γ1 = 1/γ;
δ = (β*R)^(-γ1);
m = R*b .+ y';
lC = max.(m .- ub,0);
uC = m .- lb;
by1 = b .- y';
# initial guess for C0
c0 = 0.1*(m);
# Set of parameters
pp = Parm(lC,uC,γ,δ,γ1,r1,by1,P');
# Container of results
up1 = Upd1(c0,b,1);
# Fixed point problem
conver(up1,pp)
UPDATE As it was reccomend, I made the following changes to the third function
function interp0!(bt::Array{Float64},ct::Array{Float64},x::Upd1, p::Parm)
polold = x.pol;
polnew = similar(x.pol);
#inbounds for col in 1:size(bt,2)
F1 = LinearInterpolation(#view(bt[:,col]), #view(ct[:,col]),extrapolation_bc=Line());
polnew[:,col] = F1(x.b);
end
for j in eachindex(polnew)
polnew[j] < p.lC[j] ? polnew[j] = p.lC[j] : nothing
polnew[j] > p.uC[j] ? polnew[j] = p.uC[j] : nothing
end
dif = maximum(abs.(polnew - polold));
return polnew,dif
end
This leads to an improvement in the speed (from ~1.5 to ~1.3 seconds). And a reduction in the number of allocations. Somethings that I noted were:
Changing from polnew[:,col] = F1(x.b) to polnew[:,col] .= F1(x.b) can reduce the total allocations but the time is slower, why is that?
How should I understand the difference between #time and #btime. For this case, I have:
up1 = Upd1(c0,b,1);
#time conver(up1,pp)
1.338042 seconds (385.72 k allocations: 1.157 GiB, 3.37% gc time)
up1 = Upd1(c0,b,1);
#btime conver(up1,pp)
4.200 ns (0 allocations: 0 bytes)
Just to be precise, in both cases, I run it several times and I choose representative numbers for each line.
Does it mean that all the time is due allocations during the compilation?
Start going through the "performance tips" as advised by #DNF but below you will find most important comments for your code.
Vectorize vector assignments - a small dot makes big difference
julia> julia> a = rand(3,4);
julia> #btime $a[3,:] = $a[3,:] ./ 2;
40.726 ns (2 allocations: 192 bytes)
julia> #btime $a[3,:] .= $a[3,:] ./ 2;
20.562 ns (1 allocation: 96 bytes)
Use views when doing something with subarrays:
julia> #btime sum($a[3,:]);
18.719 ns (1 allocation: 96 bytes)
julia> #btime sum(#view($a[3,:]));
5.600 ns (0 allocations: 0 bytes)
Your code around a lines polnew[polnew .< p.lC] .= p.lC[polnew .< p.lC]; will make much less allocations when you do it with a for loop over each element of polnew
#simd will have no effect on conditionals (point 3) neither when code is calling complex external functions
I want to give an update about this problem. I made two main changes to my code: (i) I define my own linear interpolation function and (ii) I include the check of bounds in the interpolation.
With this the new function three is
function interp0!(bt::Array{Float64},ct::Array{Float64},x::Upd1, p::Parm)
polold = x.pol;
polnew = similar(x.pol);
#inbounds #views for col in 1:size(bt,2)
polnew[:,col] = myint(bt[:,col], ct[:,col],x.b[:],p.lC[:,col],p.uC[:,col]);
end
dif = maximum(abs.(polnew - polold));
return polnew,dif
end
And the interpolation is now:
function myint(x0,y0,x1,ly,uy)
y1 = similar(x1);
n = size(x0,1);
j = 1;
#simd for i in eachindex(x1)
while (j <= n) && (x1[i] > x0[j])
j+=1;
end
if j == 1
y1[i] = y0[1] + ((y0[2]-y0[1])/(x0[2]-x0[1]))*(x1[i]-x0[1]) ;
elseif j == n+1
y1[i] = y0[n] + ((y0[n]-y0[n-1])/(x0[n]-x0[n-1]))*(x1[i]-x0[n]);
else
y1[i] = y0[j-1]+ ((x1[i]-x0[j-1])/(x0[j]-x0[j-1]))*(y0[j]-y0[j-1]);
end
y1[i] > uy[i] ? y1[i] = uy[i] : nothing;
y1[i] < ly[i] ? y1[i] = ly[i] : nothing;
end
return y1;
end
As you can see, I am taking advantage (and assuming) that both vectors that we use as basis are ordered while the two last lines in the outer loops checks the bounds imposed by lC and uC.
With that I get the following total time
up1 = Upd1(c0,b,1);
#time conver(up1,pp)
0.734630 seconds (28.93 k allocations: 752.214 MiB, 3.82% gc time)
up1 = Upd1(c0,b,1);
#btime conver(up1,pp)
4.200 ns (0 allocations: 0 bytes)
which is almost as twice faster with ~8% of the total allocations. the use of views in the loop of the function interp0! also helps a lot.
I am implementing a Binary Tree in Julia. The binary tree has nodes and leafs. The nodes point to left and right children, which are also nodes/leafs objects. The following code exemplifies the data structure:
using TimerOutputs
mutable struct NodeLeaf
isleaf::Bool
value::Union{Nothing,Float64}
split::Union{Nothing,Float64}
column::Union{Nothing,Int64}
left::Union{Nothing,NodeLeaf}
right::Union{Nothing,NodeLeaf}
end
function evaluate(node::NodeLeaf, x)::Float64
while !node.isleaf
if x[node.column] < node.split
node = node.left
else
node = node.right
end
end
return node.value
end
function build_random_tree(max_depth)
if max_depth == 0
return NodeLeaf(true, randn(), randn(), rand(1:10), nothing, nothing)
else
return NodeLeaf(false, randn(), randn(), rand(1:10), build_random_tree(max_depth - 1), build_random_tree(max_depth - 1))
end
end
function main()
my_random_tree = build_random_tree(4)
#timeit to "evaluation" for i in 1:1000000
evaluate(my_random_tree, randn(10))
end
end
const to = TimerOutput()
main()
show(to)
I notice that a lot of allocations occur in the evaluate function, but I don't see the reason why this is the case:
julia mytree.jl
───────────────────────────────────────────────────────────────────────
Time Allocations
─────────────────────── ────────────────────────
Tot / % measured: 476ms / 21.6% 219MiB / 62.7%
Section ncalls time %tot avg alloc %tot avg
───────────────────────────────────────────────────────────────────────
evaluation 1 103ms 100.0% 103ms 137MiB 100.0% 137MiB
───────────────────────────────────────────────────────────────────────
As I increase the evaluation loop, the allocation continues to increase without bound. Can anybody explain why allocation grows so much and please suggest how to avoid this issue? Thanks.
EDIT
I simplified too much the code for the example. The actual code is accessing DataFrames, so the main looks like this:
using DataFrames
function main()
my_random_tree = build_random_tree(7)
df = DataFrame(A=1:1000000)
for i in 1:9
df[!, string(i)] = collect(1:1000000)
end
#timeit to "evaluation" for i in 1:size(df, 1)
evaluate(my_random_tree, #view df[i, :])
end
end
I expect this to yield 0 allocations, but that isn't true:
julia mytree.jl
───────────────────────────────────────────────────────────────────────
Time Allocations
─────────────────────── ────────────────────────
Tot / % measured: 551ms / 20.5% 305MiB / 45.0%
Section ncalls time %tot avg alloc %tot avg
───────────────────────────────────────────────────────────────────────
evaluation 1 113ms 100.0% 113ms 137MiB 100.0% 137MiB
───────────────────────────────────────────────────────────────────────%
On the other hand, if I use a plain array I don't get allocations:
function main()
my_random_tree = build_random_tree(7)
df = randn(1000000, 10)
#timeit to "evaluation" for i in 1:size(df, 1)
evaluate(my_random_tree, #view df[i, :])
end
end
julia mytree.jl
───────────────────────────────────────────────────────────────────────
Time Allocations
─────────────────────── ────────────────────────
Tot / % measured: 465ms / 5.7% 171MiB / 0.0%
Section ncalls time %tot avg alloc %tot avg
───────────────────────────────────────────────────────────────────────
evaluation 1 26.4ms 100.0% 26.4ms 0.00B - % 0.00B
───────────────────────────────────────────────────────────────────────%
The thing that allocates is randn not evaluation. Switch to randn!:
julia> using Random
julia> function main()
my_random_tree = build_random_tree(4)
x = randn(10)
#allocated for i in 1:1000000
evaluate(my_random_tree, randn!(x))
end
end
main (generic function with 1 method)
julia> main()
0
EDIT
Solution with DataFrames.jl:
function bar(mrt, nti)
#timeit to "evaluation" for nt in nti
evaluate(mrt, nt)
end
end
function main()
my_random_tree = build_random_tree(7)
df = DataFrame(A=1:1000000)
for i in 1:9
df[!, string(i)] = collect(1:1000000)
end
bar(my_random_tree, Tables.namedtupleiterator(df))
end
Is there a performance impact from using the Checked module? I've tested it out with sequences of type int and see no noticeable difference. Sometimes the checked version is faster and sometimes unchecked is faster, but generally not by much.
Seq.initInfinite (fun x-> x) |> Seq.item 1000000000;;
Real: 00:00:05.272, CPU: 00:00:05.272, GC gen0: 0, gen1: 0, gen2: 0
val it : int = 1000000000
open Checked
Seq.initInfinite (fun x-> x) |> Seq.item 1000000000;;
Real: 00:00:04.785, CPU: 00:00:04.773, GC gen0: 0, gen1: 0, gen2: 0
val it : int = 1000000000
Basically I'm trying to figure out if there would be any downside to always opening Checked. (I encountered an overflow that wasn't immediately obvious, so I'm now playing the role of the jilted lover who doesn't want another broken heart.) The only non-contrived reason I can come up with for not always using Checked is if there were some performance hit, but I haven't seen one yet.
When you measure performance it's usually not a good idea to include Seq as Seq adds lots of overhead (at least compared to int operations) so you risk that most of the time is spent in Seq, not in the code you like to test.
I wrote a small test program for (+):
let clock =
let sw = System.Diagnostics.Stopwatch ()
sw.Start ()
fun () ->
sw.ElapsedMilliseconds
let dbreak () = System.Diagnostics.Debugger.Break ()
let time a =
let b = clock ()
let r = a ()
let n = clock ()
let d = n - b
d, r
module Unchecked =
let run c () =
let rec loop a i =
if i < c then
loop (a + 1) (i + 1)
else
a
loop 0 0
module Checked =
open Checked
let run c () =
let rec loop a i =
if i < c then
loop (a + 1) (i + 1)
else
a
loop 0 0
[<EntryPoint>]
let main argv =
let count = 1000000000
let testCases =
[|
"Unchecked" , Unchecked.run
"Checked" , Checked.run
|]
for nm, a in testCases do
printfn "Running %s ..." nm
let ms, r = time (a count)
printfn "... it took %d ms, result is %A" ms r
0
The performance results are this:
Running Unchecked ...
... it took 561 ms, result is 1000000000
Running Checked ...
... it took 1103 ms, result is 1000000000
So it seems some overhead is added by using Checked. The cost of int add should be less than the loop overhead so the overhead of Checked is higher than 2x maybe closer to 4x.
Out of curiousity we can check the IL Code using tools like ILSpy:
Unchecked:
IL_0000: nop
IL_0001: ldarg.2
IL_0002: ldarg.0
IL_0003: bge.s IL_0014
IL_0005: ldarg.0
IL_0006: ldarg.1
IL_0007: ldc.i4.1
IL_0008: add
IL_0009: ldarg.2
IL_000a: ldc.i4.1
IL_000b: add
IL_000c: starg.s i
IL_000e: starg.s a
IL_0010: starg.s c
IL_0012: br.s IL_0000
Checked:
IL_0000: nop
IL_0001: ldarg.2
IL_0002: ldarg.0
IL_0003: bge.s IL_0014
IL_0005: ldarg.0
IL_0006: ldarg.1
IL_0007: ldc.i4.1
IL_0008: add.ovf
IL_0009: ldarg.2
IL_000a: ldc.i4.1
IL_000b: add.ovf
IL_000c: starg.s i
IL_000e: starg.s a
IL_0010: starg.s c
IL_0012: br.s IL_0000
The only difference is that Unchecked uses add and Checked uses add.ovf. add.ovf is add with overflow check.
We can dig even deeper by looking at the jitted x86_64 code.
Unchecked:
; if i < c then
00007FF926A611B3 cmp esi,ebx
00007FF926A611B5 jge 00007FF926A611BD
; i + 1
00007FF926A611B7 inc esi
; a + 1
00007FF926A611B9 inc edi
; loop (a + 1) (i + 1)
00007FF926A611BB jmp 00007FF926A611B3
Checked:
; if i < c then
00007FF926A62613 cmp esi,ebx
00007FF926A62615 jge 00007FF926A62623
; a + 1
00007FF926A62617 add edi,1
; Overflow?
00007FF926A6261A jo 00007FF926A6262D
; i + 1
00007FF926A6261C add esi,1
; Overflow?
00007FF926A6261F jo 00007FF926A6262D
; loop (a + 1) (i + 1)
00007FF926A62621 jmp 00007FF926A62613
Now the reason for the Checked overhead is visible. After each operation the jitter inserts the conditional instruction jo which jumps to code that raises OverflowException if the overflow flag is set.
This chart shows us that the cost of an integer add is less than 1 clock cycle. The reason it's less than 1 clock cycle is that modern CPU can execute certain instructions in parallel.
The chart also shows us that branch that was correctly predicted by the CPU takes around 1-2 clock cycles.
So assuming a throughtput of at least 2 the cost of two integer additions in the Unchecked example should be 1 clock cycle.
In the Checked example we do add, jo, add, jo. Most likely CPU can't parallelize in this case and the cost of this should be around 4-6 clock cycles.
Another interesting difference is that the order of additions changed. With checked additions the order of the operations matter but with unchecked the jitter (and the CPU) has a greater flexibility moving the operations possibly improving performance.
So long story short; for cheap operations like (+) the overhead of Checked should be around 4x-6x compared to Unchecked.
This assumes no overflow exception. The cost of a .NET exception is probably around 100,000x times more expensive than an integer addition.
I am experimenting with a settings object that is passed on through a bunch of functions (a bit like a stack). It has quite a few fields (mixed ints, enums, DU's, strings) and I was wondering what the best data type is for this task (I have come back to this a few times in the past few years...).
While I currently employ a home-grown data type, it is slow and not thread-safe, so I am looking into a more logical choice and decided to experiment with record types.
Since they use copy-and-update semantics, it seems logical to think the F# compiler is smart enough to update only the necessary data and leave non-mutated data alone, considering it is immutable anyway.
So I ran a couple of tests and expected to see a performance improvement between full-copy (each field is updated) and one-field copy (one field is updated). But I am not sure I am using the proper approach for testing, or whether other datatypes may be more suitable.
As an actual example, let's take an XML qualified name. They have a small local part and a prefix (which can be ignored) and a long namespace part. The namespace is mostly the same, so only the local part needs updating.
type RecQName = { Ns :string; Prefix :string; Name :string }
// full update
let f a b c = { Ns = a; Prefix = b; Name = c}
let g() = f "http://test/is/a/long/name/here/and/there" "xs" "value"
// partial update
let test = g();;
let h a b = {a with Name = b }
let k() = h test "newname"
With timings on, in FSI (set to Release, x64 and Debug off) I get (running each twice):
> for i in 0 .. 100000000 do g() |> ignore;;
Real: 00:00:01.412, CPU: 00:00:01.404, GC gen0: 637, gen1: 1, gen2: 1
> for i in 0 .. 100000000 do g() |> ignore;;
Real: 00:00:01.317, CPU: 00:00:01.310, GC gen0: 636, gen1: 0, gen2: 0
> for i in 0 .. 100000000 do k() |> ignore;;
Real: 00:00:01.191, CPU: 00:00:01.185, GC gen0: 636, gen1: 1, gen2: 0
> for i in 0 .. 100000000 do k() |> ignore;;
Real: 00:00:01.099, CPU: 00:00:01.092, GC gen0: 636, gen1: 0, gen2: 0
Now I know timings are not everything, and there's a clear difference of roughly 20%, but that seems to small to justify the change, and may well because of other reasons (the CLI may intern the strings, for instance).
Am I making the wrong assumptions? I googled for record-type performance, but they were all about comparing them with structs. Does anybody know of the algorithm used for copy-and-update? Any thoughts on whether this datatype or something else is a smarter choice (given many fields, not just three as above, and wanting to use immutability without locking with copy-and-update).
Update 1
The test above doesn't really test anything, it seems, as can be shown with the next test.
type Rec = { A: int; B: int; C: int; D: int; E: int};;
// full
let f a b c d e = { A = a; B = b; C = c; D = d; E = e }
// partial
let temp = f 1 2 3 4 5
let g b = { temp with B = b }
// perf tests (subtract is necessary or the compiler optimizes them away)
let mutable res = 0;;
for i in 0 .. 1000000000 do f i (i-1) (i-2) (i-3) (i-4) |> function { B = b } -> if b < 1000 then res <- b
for i in 0 .. 1000000000 do g i |> function { B = b } -> if b < 1000 then res <- b
Results:
> for i in 0 .. 1000000000 do f i (i-1) (i-2) (i-3) (i-4) |> function { B = b } -> if b < 1000 then res <- b ;;
Real: 00:00:09.039, CPU: 00:00:09.032, GC gen0: 6358, gen1: 1, gen2: 0
> for i in 0 .. 1000000000 do g i |> function { B = b } -> if b < 1000 then res <- b;;
Real: 00:00:10.571, CPU: 00:00:10.576, GC gen0: 6358, gen1: 2, gen2: 0
Now the difference, while unexpected, shows. Looks like the copying is definitely not faster than not copying. The overhead of copying in this case may be due to the extra stack slot required, I don't know.
At the very least it shows that no magic goes on (as Fyodor already mentioned in the comments).
Update 2
Ok, one more update. If I inline both f and g functions above, the timings become remarkably different, in favor of the partial-update. Apparently, the inlining has the effect that either the compiler or the JIT "knows" it doesn't have to do a full copy, or it is just the effect of putting everything on the stack (the JIT or compiler getting rid of the boxing), as can be seen for 0 GC collections.
> for i in 0 .. 1000000000 do f i (i-1) (i-2) (i-3) (i-4) |> function { B = b } -> if b < 1000 then res <- b ;;
Real: 00:00:08.885, CPU: 00:00:08.876, GC gen0: 6359, gen1: 1, gen2: 1
> for i in 0 .. 1000000000 do g i |> function { B = b } -> if b < 1000 then res <- b ;;
Real: 00:00:00.571, CPU: 00:00:00.561, GC gen0: 0, gen1: 0, gen2: 0
Whether this holds in the "real world" is debatable and should (of course) be tested. Whether this improvement has any relation to record types, I doubt it.
I have an old Fortran code for calculation of Lyapunov exponent which I tried converting to modern Fortran syntax.
PROGRAM ODE
integer, PARAMETER :: N=3
integer, PARAMETER :: NN=12
EXTERNAL FCN
DIMENSION Y(NN),ZNORM(N),GSC(N),CUM(N),C(24),W(NN,9)
Y(1) = 10.0
Y(2) = 1.0
Y(3) = 0.0
! INITIAL CONDITIONS FOR LINEAR SYSTEM (ORTHONORMAL FRAME)
DO 10 I = N+1,NN
Y(I) = 0.0
10 CONTINUE
DO 20 I = 1,N
Y((N+1)*I) = 1.0
CUM(I) = 0.0
20 CONTINUE
! INTEGRATION TOLERANCE, # OF INTEGRATION STEPS,
! TIME PER STEP, AND I/O RATE
write (*,*) "TOL, NSTEP, STPSZE, IO ?"
read (*,*) TOL, NSTEP, STPSZE, IO
! INITIALIZATION FOR INTEGRATOR
NEQ = NN
X=0.0
IND = 1
DO 100 I = 1,NSTEP
XEND = STPSZE*FLOAT(I)
! CALL ANY ODE INTEGRATOR - THIS IS AN LMSL ROUTINE
CALL DVERK (NEQ,FCN,X,Y,XEND,TOL, IND,C,NEQ,W,IER)
! CONSTRUCT A NEW ORTHONORMAL BASIS BY GRAM-SCHMIDT METHOD
! NORMALIZE FIRST VECTOR
ZNORM(1) = 0.0
DO 30 J = 1,N
ZNORM(1) = ZNORM(1)+Y(N*J+1)**2
30 CONTINUE
ZNORM(1) = SQRT(ZNORM(1))
DO 40 J = 1,N
Y(N*J+1) = Y(N*J+1)/ZNORM(1)
40 CONTINUE
! GENERATE THE NEW ORTHONORMAL SET OF VECTORS.
DO 80 J = 2,N
! GENERATE J-1 GSR COEFFICIENTS.
DO 50 K = l,(J-l)
GSC(K) = 0.0
DO 50 L = 1,N
GSC(K) = GSC(K)+Y(N*L+J)*Y(N*L+K)
50 CONTINUE
! CONSTRUCT A NEW VECTOR.
DO 60 K = 1,N
DO 60 L = l,(J-l)
Y(N*K+J) = Y(N*K+J)-GSC(L)*Y(N*K+L)
60 CONTINUE
! CALCULATE THE VECTOR'S NORM
ZNORM(J) = 0.0
DO 70 K = I,N
ZNORM(J) = ZNORM(J)+Y(N*K+J)**2
70 CONTINUE
ZNORM(J) = SQRT(ZNORM(J))
! NORMALIZE THE NEW VECTOR.
DO 80 K = 1,N
Y(N*K+J) = Y(N*K+J)/ZNORM(J)
80 CONTINUE
! UPDATE RUNNING VECTORMAGNITUDES
DO 90 K = 1,N
CUM(K) = CUM(K)+ALOG(ZNORM(K) )/ALOG(2. )
90 CONTINUE
! NORMALIZE EXPONENT AND PRINT EVERY IO ITERATIONS
IF (MOD(I,IO).EQ.0) write (*,*) X,(CUM(K)/X,K = I,N)
100 CONTINUE
CALL EXIT
END
SUBROUTINE FCN (N,X,Y,YPRIME)
! USER DEFINED ROUTINE CALLED BY IMSL INTEGRATOR.
DIMENSION Y(12),YPRIME(12)
! LORENZ EQUATIONS OF MOTION
YPRIME(1) = 16.*(Y(2)-Y(1))
YPRIME(2) = -Y(1)*Y(3)+45.92*Y(1)-Y(2)
YPRIME(3) = Y(1)*Y(2)-4.*Y(3)
! 3 COPIES OF LINEARIZED EQUATIONS OF MOTION.
DO 10 I = 0,2
YPRIME(4+I) = 16.*(Y(7+I)-Y(4+I))
YPRIME(7+I) = (45.92-Y(3))*Y(4+I)-Y(7+I)-Y(1)*Y(10+I)
YPRIME(10+I) = Y(2)*Y(4+I)+Y(1)*Y(7+I)-4.*Y(10+I)
10 CONTINUE
RETURN
END
I have debugged most of this, but I am still left with a few errors that I am unable to get around. The error log says:
main.f95:44.14:
DO 50 L = 1,N
1
Warning: Obsolescent feature: Shared DO termination label 50 at (1)
main.f95:49.18:
DO 60 L = l,(J-l)
1
Warning: Obsolescent feature: Shared DO termination label 60 at (1)
main.f95:59.14:
DO 80 K = 1,N
1
Warning: Obsolescent feature: Shared DO termination label 80 at (1)
/tmp/ccfI69Sj.o: In function `MAIN__':
main.f95:(.text+0x296): undefined reference to `dverk_'
main.f95:(.text+0x844): undefined reference to `exit_'
collect2: error: ld returned 1 exit status
Could someone please help me out in resolving the errors?
Thanks.
It's just what the compiler states:
Shared DO termination label
The nested loop 50 uses the same termination label:
DO 50 K = l,(J-l)
GSC(K) = 0.0
DO 50 L = 1,N
GSC(K) = GSC(K)+Y(N*L+J)*Y(N*L+K)
50 CONTINUE
In modern Fortran, you should use separate enddo statements:
DO K = l,(J-l)
GSC(K) = 0.0
DO L = 1,N
GSC(K) = GSC(K)+Y(N*L+J)*Y(N*L+K)
ENDDO
ENDDO
This omits the loop-label, but in your code you don't need it (I guess).
The same needs to be done with loops 60 and 80
The real errors are the undefined references to dverk and exit. These subroutines are missing in your code, so I assume they are contained in external objects/libraries. You need to tell the compiler where to find them, or include them in your code (after the end of the program or inside a module).