Int32 vs Float64 performances in Crystal - performance

I ran this benchmark and I was very surprised to see that Crystal performances are almost the same for Int32 or Float64 operations.
$ crystal benchmarks/int_vs_float.cr --release
int32 414.96M ( 2.41ns) (±14.81%) 0.0B/op fastest
float64 354.27M ( 2.82ns) (±12.46%) 0.0B/op 1.17× slower
Do I have some weird side effects on my benchmark code?
require "benchmark"
res = 0
res2 = 0.0
Benchmark.ips do |x|
x.report("int32") do
a = 128973 / 119236
b = 119236 - 128973
d = 117232 > 123462 ? 117232 * 123462 : 123462 / 117232
res = a + b + d
end
x.report("float64") do
a = 1.28973 / 1.19236
b = 1.19236 - 1.28973
d = 1.17232 > 1.23462 ? 1.17232 * 1.23462 : 1.23462 / 1.17232
res = a + b + d
end
end
puts res
puts res2

First of all / in Crystal is float division, so this is largely comparing floats:
typeof(a) # => Float64
typeof(b) # => Int32
typeof(d) # => Float64 | Int32)
If we fix the benchmark to use integer division, //, I get:
int32 631.35M ( 1.58ns) (± 5.53%) 0.0B/op 1.23× slower
float64 773.57M ( 1.29ns) (± 3.21%) 0.0B/op fastest
Still no real difference, within error margin. Why's that? Let's dig deeper. First we can extract the examples into a not inlinable function and make sure to call it so Crystal doesn't just ignore it:
#[NoInline]
def calc
a = 128973 // 119236
b = 119236 - 128973
d = 117232 > 123462 ? 117232 * 123462 : 123462 // 117232
a + b + d
end
p calc
Then we can build this with crystal build --release --no-debug --emit llvm-ir to obtain an .ll file witht the optimized LLVM-IR. We dig out our calc function and see something like this:
define i32 #"*calc:Int32"() local_unnamed_addr #19 {
alloca:
%0 = tail call i1 #llvm.expect.i1(i1 false, i1 false)
br i1 %0, label %overflow, label %normal6
overflow: ; preds = %alloca
tail call void #__crystal_raise_overflow()
unreachable
normal6: ; preds = %alloca
ret i32 -9735
}
Where's all our calculations gone? LLVM did them at compile time because it was all constants! We can repeat the experiment with the Float64 example:
define double #"*calc:Float64"() local_unnamed_addr #11 {
alloca:
ret double 0x40004CAA3B35919C
}
A little less boilerplate, hence it being slightly faster, but again, all precomputed!
I'll end the exercise here. Further research for the reader:
What happens if we try to introduce non constant terms into all expressions?
Is the premise that 32bit integer operations should be any faster or slower than 64bit IEEE754 floating point operations on a modern 64bit CPU sane?

Related

What is the simplest method to achieve 100% CPU usage in F#?

Sometimes I write code that I expect to be able to fully saturate the CPU.
For example, to calculate the Mandelbrot set, you might use something like this:
type MandelbrotPoint =
|Escaped of int
|NotEscaped
let getIterationCount maxIters (c:Complex) =
let rec innerFunction iters (z:Complex) =
match z.Magnitude, iters with
|m, i when m > 2.0 -> Escaped i
|_, i when i > maxIters -> NotEscaped
|_ -> innerFunction (iters + 1) (z * z + c)
innerFunction 0 c
let getIterationCounts (topLeft:Complex) pixelWidth pixelHeight realWidth =
let xGap = realWidth / ((pixelWidth - 1) |> float)
[|for iY in 0 .. (pixelHeight - 1) do
for iX in 0 .. (pixelWidth - 1) do
yield Complex(topLeft.Real + xGap * (float iX), topLeft.Imaginary - xGap * (float iY))
|]
|> Array.Parallel.map (getIterationCount 1000)
Naively, I would expect this to run at close to 100% until complete, but it bounces around between 25% and 60%.
I get that often calculations are constrained by how long it takes data to move into and out of the CPU cache, but that shouldn't be a problem here right? There is much data to move, its just a simple iterative calculation, no?
On my 4-core 8-thread CPU the below naive snippet does pretty close to 100% CPU saturation by FSI process:
let consume (x: int) =
while true do
let _ = x*x
()
[|0..7|] |> Array.Parallel.iter consume

Filling a matrix using parallel processing in Julia

I'm trying to speed up the solution time for a dynamic programming problem in Julia (v. 0.5.0), via parallel processing. The problem involves choosing the optimal values for every element of a 1073 x 19 matrix at every iteration, until successive matrix differences fall within a tolerance. I thought that, within each iteration, filling in the values for each element of the matrix could be parallelized. However, I'm seeing a huge performance degradation using SharedArray, and I'm wondering if there's a better way to approach parallel processing for this problem.
I construct the arguments for the function below:
est_params = [.788,.288,.0034,.1519,.1615,.0041,.0077,.2,0.005,.7196]
r = 0.015
tau = 0.35
rho =est_params[1]
sigma =est_params[2]
delta = 0.15
gamma =est_params[3]
a_capital =est_params[4]
lambda1 =est_params[5]
lambda2 =est_params[6]
s =est_params[7]
theta =est_params[8]
mu =est_params[9]
p_bar_k_ss =est_params[10]
beta = (1+r)^(-1)
sigma_range = 4
gz = 19
gp = 29
gk = 37
lnz=collect(linspace(-sigma_range*sigma,sigma_range*sigma,gz))
z=exp(lnz)
gk_m = fld(gk,2)
# Need to add mu somewhere to k_ss
k_ss = (theta*(1-tau)/(r+delta))^(1/(1-theta))
k=cat(1,map(i->k_ss*((1-delta)^i),collect(1:gk_m)),map(i->k_ss/((1-delta)^i),collect(1:gk_m)))
insert!(k,gk_m+1,k_ss)
sort!(k)
p_bar=p_bar_k_ss*k_ss
p = collect(linspace(-p_bar/2,p_bar,gp))
#Tauchen
N = length(z)
Z = zeros(N,1)
Zprob = zeros(Float32,N,N)
Z[N] = lnz[length(z)]
Z[1] = lnz[1]
zstep = (Z[N] - Z[1]) / (N - 1)
for i=2:(N-1)
Z[i] = Z[1] + zstep * (i - 1)
end
for a = 1 : N
for b = 1 : N
if b == 1
Zprob[a,b] = 0.5*erfc(-((Z[1] - mu - rho * Z[a] + zstep / 2) / sigma)/sqrt(2))
elseif b == N
Zprob[a,b] = 1 - 0.5*erfc(-((Z[N] - mu - rho * Z[a] - zstep / 2) / sigma)/sqrt(2))
else
Zprob[a,b] = 0.5*erfc(-((Z[b] - mu - rho * Z[a] + zstep / 2) / sigma)/sqrt(2)) -
0.5*erfc(-((Z[b] - mu - rho * Z[a] - zstep / 2) / sigma)/sqrt(2))
end
end
end
# Collecting tauchen results in a 2 element array of linspace and array; [2] gets array
# Zprob=collect(tauchen(gz, rho, sigma, mu, sigma_range))[2]
Zcumprob=zeros(Float32,gz,gz)
# 2 in cumsum! denotes the 2nd dimension, i.e. columns
cumsum!(Zcumprob, Zprob,2)
gm = gk * gp
control=zeros(gm,2)
for i=1:gk
control[(1+gp*(i-1)):(gp*i),1]=fill(k[i],(gp,1))
control[(1+gp*(i-1)):(gp*i),2]=p
end
endog=copy(control)
E=Array(Float32,gm,gm,gz)
for h=1:gm
for m=1:gm
for j=1:gz
# set the nonzero net debt indicator
if endog[h,2]<0
p_ind=1
else
p_ind=0
end
# set the investment indicator
if (control[m,1]-(1-delta)*endog[h,1])!=0
i_ind=1
else
i_ind=0
end
E[m,h,j] = (1-tau)*z[j]*(endog[h,1]^theta) + control[m,2]-endog[h,2]*(1+r*(1-tau)) +
delta*endog[h,1]*tau-(control[m,1]-(1-delta)*endog[h,1]) -
(i_ind*gamma*endog[h,1]+endog[h,1]*(a_capital/2)*(((control[m,1]-(1-delta)*endog[h,1])/endog[h,1])^2)) +
s*endog[h,2]*p_ind
elem = E[m,h,j]
if E[m,h,j]<0
E[m,h,j]=elem+lambda1*elem-.5*lambda2*elem^2
else
E[m,h,j]=elem
end
end
end
end
I then constructed the function with serial processing. The two for loops iterate through each element to find the largest value in a 1072-sized (=the gm scalar argument in the function) array:
function dynam_serial(E,gm,gz,beta,Zprob)
v = Array(Float32,gm,gz )
fill!(v,E[cld(gm,2),cld(gm,2),cld(gz,2)])
Tv = Array(Float32,gm,gz)
# Set parameters for the loop
convcrit = 0.0001 # chosen convergence criterion
diff = 1 # arbitrary initial value greater than convcrit
while diff>convcrit
exp_v=v*Zprob'
for h=1:gm
for j=1:gz
Tv[h,j]=findmax(E[:,h,j] + beta*exp_v[:,j])[1]
end
end
diff = maxabs(Tv - v)
v=copy(Tv)
end
end
Timing this, I get:
#time dynam_serial(E,gm,gz,beta,Zprob)
> 106.880008 seconds (91.70 M allocations: 203.233 GB, 15.22% gc time)
Now, I try using Shared Arrays to benefit from parallel processing. Note that I reconfigured the iteration so that I only have one for loop, rather than two. I also use v=deepcopy(Tv); otherwise, v is copied as an Array object, rather than a SharedArray:
function dynam_parallel(E,gm,gz,beta,Zprob)
v = SharedArray(Float32,(gm,gz),init = S -> S[Base.localindexes(S)] = myid() )
fill!(v,E[cld(gm,2),cld(gm,2),cld(gz,2)])
# Set parameters for the loop
convcrit = 0.0001 # chosen convergence criterion
diff = 1 # arbitrary initial value greater than convcrit
while diff>convcrit
exp_v=v*Zprob'
Tv = SharedArray(Float32,gm,gz,init = S -> S[Base.localindexes(S)] = myid() )
#sync #parallel for hj=1:(gm*gz)
j=cld(hj,gm)
h=mod(hj,gm)
if h==0;h=gm;end;
#async Tv[h,j]=findmax(E[:,h,j] + beta*exp_v[:,j])[1]
end
diff = maxabs(Tv - v)
v=deepcopy(Tv)
end
end
Timing the parallel version; and using a 4-core 2.5 GHz I7 processor with 16GB of memory, I get:
addprocs(3)
#time dynam_parallel(E,gm,gz,beta,Zprob)
> 164.237208 seconds (2.64 M allocations: 201.812 MB, 0.04% gc time)
Am I doing something incorrect here? Or is there a better way to approach parallel processing in Julia for this particular problem? I've considered using Distributed Arrays, but it's difficult for me to see how to apply them to the present problem.
UPDATE:
Per #DanGetz and his helpful comments, I turned instead to trying to speed up the serial processing version. I was able to get performance down to 53.469780 seconds (67.36 M allocations: 103.419 GiB, 19.12% gc time) through:
1) Upgrading to 0.6.0 (saved about 25 seconds), which includes the helpful #views macro.
2) Preallocating the main array I'm trying to fill in (Tv), per the section on Preallocating Outputs in the Julia Performance Tips: https://docs.julialang.org/en/latest/manual/performance-tips/. (saved another 25 or so seconds)
The biggest remaining slow-down seems to be coming from the add_vecs function, which sums together subarrays of two larger matrices. I've tried devectorizing and using BLAS functions, but haven't been able to produce better performance.
In any event, the improved code for dynam_serial is below:
function add_vecs(r::Array{Float32},h::Int,j::Int,E::Array{Float32},exp_v::Array{Float32},beta::Float32)
#views r=E[:,h,j] + beta*exp_v[:,j]
return r
end
function dynam_serial(E::Array{Float32},gm::Int,gz::Int,beta::Float32,Zprob::Array{Float32})
v = Array{Float32}(gm,gz)
fill!(v,E[cld(gm,2),cld(gm,2),cld(gz,2)])
Tv = Array{Float32}(gm,gz)
r = Array{Float32}(gm)
# Set parameters for the loop
convcrit = 0.0001 # chosen convergence criterion
diff = 1 # arbitrary initial value greater than convcrit
while diff>convcrit
exp_v=v*Zprob'
for h=1:gm
for j=1:gz
#views Tv[h,j]=findmax(add_vecs(r,h,j,E,exp_v,beta))[1]
end
end
diff = maximum(abs,Tv - v)
v=copy(Tv)
end
return Tv
end
If add_vecs seems to be the critical function, writing an explicit for loop could offer more optimization. How does the following benchmark:
function add_vecs!(r::Array{Float32},h::Int,j::Int,E::Array{Float32},
exp_v::Array{Float32},beta::Float32)
#inbounds for i=1:size(E,1)
r[i]=E[i,h,j] + beta*exp_v[i,j]
end
return r
end
UPDATE
To continue optimizing dynam_serial I have tried to remove more allocations. The result is:
function add_vecs_and_max!(gm::Int,r::Array{Float64},h::Int,j::Int,E::Array{Float64},
exp_v::Array{Float64},beta::Float64)
#inbounds for i=1:gm
r[i] = E[i,h,j]+beta*exp_v[i,j]
end
return findmax(r)[1]
end
function dynam_serial(E::Array{Float64},gm::Int,gz::Int,
beta::Float64,Zprob::Array{Float64})
v = Array{Float64}(gm,gz)
fill!(v,E[cld(gm,2),cld(gm,2),cld(gz,2)])
r = Array{Float64}(gm)
exp_v = Array{Float64}(gm,gz)
# Set parameters for the loop
convcrit = 0.0001 # chosen convergence criterion
diff = 1.0 # arbitrary initial value greater than convcrit
while diff>convcrit
A_mul_Bt!(exp_v,v,Zprob)
diff = -Inf
for h=1:gm
for j=1:gz
oldv = v[h,j]
newv = add_vecs_and_max!(gm,r,h,j,E,exp_v,beta)
v[h,j]= newv
diff = max(diff, oldv-newv, newv-oldv)
end
end
end
return v
end
Switching the functions to use Float64 should increase speed (as CPUs are inherently optimized for 64-bit word lengths). Also, using the mutating A_mul_Bt! directly saves another allocation. Avoiding the copy(...) by switching the arrays v and Tv.
How do these optimizations improve your running time?
2nd UPDATE
Updated the code in the UPDATE section to use findmax. Also, changed dynam_serial to use v without Tv, as there was no need to save the old version except for the diff calculation, which is now done inside the loop.
Here's the code I copied-and-pasted, provided by Dan Getz above. I include the array and scalar definitions exactly as I ran them. Performance was: 39.507005 seconds (11 allocations: 486.891 KiB) when running #time dynam_serial(E,gm,gz,beta,Zprob).
using SpecialFunctions
est_params = [.788,.288,.0034,.1519,.1615,.0041,.0077,.2,0.005,.7196]
r = 0.015
tau = 0.35
rho =est_params[1]
sigma =est_params[2]
delta = 0.15
gamma =est_params[3]
a_capital =est_params[4]
lambda1 =est_params[5]
lambda2 =est_params[6]
s =est_params[7]
theta =est_params[8]
mu =est_params[9]
p_bar_k_ss =est_params[10]
beta = (1+r)^(-1)
sigma_range = 4
gz = 19 #15 #19
gp = 29 #19 #29
gk = 37 #25 #37
lnz=collect(linspace(-sigma_range*sigma,sigma_range*sigma,gz))
z=exp.(lnz)
gk_m = fld(gk,2)
# Need to add mu somewhere to k_ss
k_ss = (theta*(1-tau)/(r+delta))^(1/(1-theta))
k=cat(1,map(i->k_ss*((1-delta)^i),collect(1:gk_m)),map(i->k_ss/((1-delta)^i),collect(1:gk_m)))
insert!(k,gk_m+1,k_ss)
sort!(k)
p_bar=p_bar_k_ss*k_ss
p = collect(linspace(-p_bar/2,p_bar,gp))
#Tauchen
N = length(z)
Z = zeros(N,1)
Zprob = zeros(Float64,N,N)
Z[N] = lnz[length(z)]
Z[1] = lnz[1]
zstep = (Z[N] - Z[1]) / (N - 1)
for i=2:(N-1)
Z[i] = Z[1] + zstep * (i - 1)
end
for a = 1 : N
for b = 1 : N
if b == 1
Zprob[a,b] = 0.5*erfc(-((Z[1] - mu - rho * Z[a] + zstep / 2) / sigma)/sqrt(2))
elseif b == N
Zprob[a,b] = 1 - 0.5*erfc(-((Z[N] - mu - rho * Z[a] - zstep / 2) / sigma)/sqrt(2))
else
Zprob[a,b] = 0.5*erfc(-((Z[b] - mu - rho * Z[a] + zstep / 2) / sigma)/sqrt(2)) -
0.5*erfc(-((Z[b] - mu - rho * Z[a] - zstep / 2) / sigma)/sqrt(2))
end
end
end
# Collecting tauchen results in a 2 element array of linspace and array; [2] gets array
# Zprob=collect(tauchen(gz, rho, sigma, mu, sigma_range))[2]
Zcumprob=zeros(Float64,gz,gz)
# 2 in cumsum! denotes the 2nd dimension, i.e. columns
cumsum!(Zcumprob, Zprob,2)
gm = gk * gp
control=zeros(gm,2)
for i=1:gk
control[(1+gp*(i-1)):(gp*i),1]=fill(k[i],(gp,1))
control[(1+gp*(i-1)):(gp*i),2]=p
end
endog=copy(control)
E=Array(Float64,gm,gm,gz)
for h=1:gm
for m=1:gm
for j=1:gz
# set the nonzero net debt indicator
if endog[h,2]<0
p_ind=1
else
p_ind=0
end
# set the investment indicator
if (control[m,1]-(1-delta)*endog[h,1])!=0
i_ind=1
else
i_ind=0
end
E[m,h,j] = (1-tau)*z[j]*(endog[h,1]^theta) + control[m,2]-endog[h,2]*(1+r*(1-tau)) +
delta*endog[h,1]*tau-(control[m,1]-(1-delta)*endog[h,1]) -
(i_ind*gamma*endog[h,1]+endog[h,1]*(a_capital/2)*(((control[m,1]-(1-delta)*endog[h,1])/endog[h,1])^2)) +
s*endog[h,2]*p_ind
elem = E[m,h,j]
if E[m,h,j]<0
E[m,h,j]=elem+lambda1*elem-.5*lambda2*elem^2
else
E[m,h,j]=elem
end
end
end
end
function add_vecs_and_max!(gm::Int,r::Array{Float64},h::Int,j::Int,E::Array{Float64},
exp_v::Array{Float64},beta::Float64)
maxr = -Inf
#inbounds for i=1:gm r[i] = E[i,h,j]+beta*exp_v[i,j]
maxr = max(r[i],maxr)
end
return maxr
end
function dynam_serial(E::Array{Float64},gm::Int,gz::Int,
beta::Float64,Zprob::Array{Float64})
v = Array{Float64}(gm,gz)
fill!(v,E[cld(gm,2),cld(gm,2),cld(gz,2)])
Tv = Array{Float64}(gm,gz)
r = Array{Float64}(gm)
exp_v = Array{Float64}(gm,gz)
# Set parameters for the loop
convcrit = 0.0001 # chosen convergence criterion
diff = 1.0 # arbitrary initial value greater than convcrit
while diff>convcrit
A_mul_Bt!(exp_v,v,Zprob)
diff = -Inf
for h=1:gm
for j=1:gz
Tv[h,j]=add_vecs_and_max!(gm,r,h,j,E,exp_v,beta)
diff = max(abs(Tv[h,j]-v[h,j]),diff)
end
end
(v,Tv)=(Tv,v)
end
return v
end
Now, here's another version of the algorithm and inputs. The functions are similar to what Dan Getz suggested, except that I use findmax rather than an iterated max function to find the array maximum. In the input construction, I am using both Float32 and mixing different bit-types together. However, I've consistently achieved better performance this way: 24.905569 seconds (1.81 k allocations: 46.829 MiB, 0.01% gc time). But it's not clear at all why.
using SpecialFunctions
est_params = [.788,.288,.0034,.1519,.1615,.0041,.0077,.2,0.005,.7196]
r = 0.015
tau = 0.35
rho =est_params[1]
sigma =est_params[2]
delta = 0.15
gamma =est_params[3]
a_capital =est_params[4]
lambda1 =est_params[5]
lambda2 =est_params[6]
s =est_params[7]
theta =est_params[8]
mu =est_params[9]
p_bar_k_ss =est_params[10]
beta = Float32((1+r)^(-1))
sigma_range = 4
gz = 19
gp = 29
gk = 37
lnz=collect(linspace(-sigma_range*sigma,sigma_range*sigma,gz))
z=exp(lnz)
gk_m = fld(gk,2)
# Need to add mu somewhere to k_ss
k_ss = (theta*(1-tau)/(r+delta))^(1/(1-theta))
k=cat(1,map(i->k_ss*((1-delta)^i),collect(1:gk_m)),map(i->k_ss/((1-delta)^i),collect(1:gk_m)))
insert!(k,gk_m+1,k_ss)
sort!(k)
p_bar=p_bar_k_ss*k_ss
p = collect(linspace(-p_bar/2,p_bar,gp))
#Tauchen
N = length(z)
Z = zeros(N,1)
Zprob = zeros(Float32,N,N)
Z[N] = lnz[length(z)]
Z[1] = lnz[1]
zstep = (Z[N] - Z[1]) / (N - 1)
for i=2:(N-1)
Z[i] = Z[1] + zstep * (i - 1)
end
for a = 1 : N
for b = 1 : N
if b == 1
Zprob[a,b] = 0.5*erfc(-((Z[1] - mu - rho * Z[a] + zstep / 2) / sigma)/sqrt(2))
elseif b == N
Zprob[a,b] = 1 - 0.5*erfc(-((Z[N] - mu - rho * Z[a] - zstep / 2) / sigma)/sqrt(2))
else
Zprob[a,b] = 0.5*erfc(-((Z[b] - mu - rho * Z[a] + zstep / 2) / sigma)/sqrt(2)) -
0.5*erfc(-((Z[b] - mu - rho * Z[a] - zstep / 2) / sigma)/sqrt(2))
end
end
end
# Collecting tauchen results in a 2 element array of linspace and array; [2] gets array
# Zprob=collect(tauchen(gz, rho, sigma, mu, sigma_range))[2]
Zcumprob=zeros(Float32,gz,gz)
# 2 in cumsum! denotes the 2nd dimension, i.e. columns
cumsum!(Zcumprob, Zprob,2)
gm = gk * gp
control=zeros(gm,2)
for i=1:gk
control[(1+gp*(i-1)):(gp*i),1]=fill(k[i],(gp,1))
control[(1+gp*(i-1)):(gp*i),2]=p
end
endog=copy(control)
E=Array(Float32,gm,gm,gz)
for h=1:gm
for m=1:gm
for j=1:gz
# set the nonzero net debt indicator
if endog[h,2]<0
p_ind=1
else
p_ind=0
end
# set the investment indicator
if (control[m,1]-(1-delta)*endog[h,1])!=0
i_ind=1
else
i_ind=0
end
E[m,h,j] = (1-tau)*z[j]*(endog[h,1]^theta) + control[m,2]-endog[h,2]*(1+r*(1-tau)) +
delta*endog[h,1]*tau-(control[m,1]-(1-delta)*endog[h,1]) -
(i_ind*gamma*endog[h,1]+endog[h,1]*(a_capital/2)*(((control[m,1]-(1-delta)*endog[h,1])/endog[h,1])^2)) +
s*endog[h,2]*p_ind
elem = E[m,h,j]
if E[m,h,j]<0
E[m,h,j]=elem+lambda1*elem-.5*lambda2*elem^2
else
E[m,h,j]=elem
end
end
end
end
function add_vecs!(gm::Int,r::Array{Float32},h::Int,j::Int,E::Array{Float32},
exp_v::Array{Float32},beta::Float32)
#inbounds #views for i=1:gm
r[i]=E[i,h,j] + beta*exp_v[i,j]
end
return r
end
function dynam_serial(E::Array{Float32},gm::Int,gz::Int,beta::Float32,Zprob::Array{Float32})
v = Array{Float32}(gm,gz)
fill!(v,E[cld(gm,2),cld(gm,2),cld(gz,2)])
Tv = Array{Float32}(gm,gz)
# Set parameters for the loop
convcrit = 0.0001 # chosen convergence criterion
diff = 1.00000 # arbitrary initial value greater than convcrit
iter=0
exp_v=Array{Float32}(gm,gz)
r=Array{Float32}(gm)
while diff>convcrit
A_mul_Bt!(exp_v,v,Zprob)
for h=1:gm
for j=1:gz
Tv[h,j]=findmax(add_vecs!(gm,r,h,j,E,exp_v,beta))[1]
end
end
diff = maximum(abs,Tv - v)
(v,Tv)=(Tv,v)
end
return v
end

AsymUp function (Rounding) not producing desired results - VB6

I am trying to round up numbers in a legacy application which I had coded in VB6 to obtain the following outcomes:
2.53 should be 2.60
2.55 should be 2.60
2.56 should be 2.60
2.50 should remain 2.50
2.501 should be 2.50
2.505 should be 2.60
I have tried to use the suggested User-Defined Rounding functions by Microsoft
http://support.microsoft.com/kb/196652/en-gb
The closest I arrived to is the Asymup function
Function AsymUp(ByVal X As Double, _
Optional ByVal Factor As Double = 1) As Double
Dim Temp As Double
Temp = Int(X * Factor)
AsymUp = (Temp + IIf(X = Temp, 0, 1)) / Factor
End Function
I am testing this procedure by calling the function as follows:
Text1.Text = AsymUp(Val(Text1.Text), 10)
But this is not producing the desired results because 2.60 for example becomes 2.7 when I want it to remain 2.6. Strangely enough 2.0 also becomes 2.1, implying that the function is not working well.
How can I correct this to acheive the desired results
Private Function AsymUp(ByVal D As Double, Optional Precision As Double = 1) As Double
Precision = CDbl("1" & String$(Precision, 48))
D = D * Precision
If Int(D) <> D Then
AsymUp = (Int(D) + 1) / Precision
Else
AsymUp = D / Precision
End If
End Function
Visual Basic will always automatically truncate trailing zeros from Double and Single data types. Therefore, unless a function is designed to pass the number back as a string, along with the extra zero(s) appended, it will be impossible to retain trailing zeros of a particular precision.
Bonus tip: The IIf function should not be used where performance is of concern, as both the falsepart and truthpart parameters are always evaluated, regardless of the arguments passed to them.
String manipulation method (retains zeros):
Private Function AsymUp(ByVal D As Double, Optional Precision As Double = 1) As String
Dim P As Double, Z As Long
P = CDbl("1" & String$(Precision, 48))
D = D * P
If Int(D) <> D Then
D = (Int(D) + 1) / P
Else
D = D / P
End If
AsymUp = CStr(D)
Z = Precision - (Len(AsymUp) - InStr(AsymUp, "."))
If Z > 0 Then AsymUp = AsymUp & String$(Z, 48)
End Function
This will obviously not be as efficient to run as my other answer, but it's the only way to keep any trailing zeros intact.
Please note that the function will always return the number as a string. You may need to convert the string back to a decimal number to continue using the number with additional mathematics in your program.
I've resolved the issue using the following code:
Private Sub Command1_Click()
roundnumber = Val(Text1.Text)
Text1.Text = Round(roundnumber, 2)
Text1.Text = AsymUp(Val(Text1.Text), 10)
End Sub
Function AsymUp(ByVal X As Double, _
Optional ByVal Factor As Double = 1) As Double
Dim Temp As Double
Temp = Int(X * Factor)
AsymUp = (Temp + IIf((X * Factor) = Temp, 0, 1)) / Factor
End Function
This is what I was after not precision to 10dp as suggested in this post.

Matlab - Speeding up Nested For-Loops

I'm working on a function with three nested for loops that is way too slow for its intended use. The bottleneck is clearly the looping part - almost 100 % of the execution time is spent in the innermost loop.
The function takes a 2d matrix called rM as input and returns a 3d matrix called ec:
rows = size(rM, 1);
cols = size(rM, 2);
%preallocate.
ec = zeros(rows+1, cols, numRiskLevels);
ec(1, :, :) = 100;
for risk = minRisk:stepRisk:maxRisk;
for c = 1:cols,
for r = 2:rows+1,
ec(r, c, risk) = ec(r-1, c, risk) * (1 + risk * rM(r-1, c));
end
end
end
Any help on speeding up the for loops would be appreciated...
The problem is, that the inner loop is slowest, while it is also near-impossible to vectorize. As every iteration directly depends on the previous one.
The outer two are possible:
clc;
rM = rand(50);
rows = size(rM, 1);
cols = size(rM, 2);
minRisk = 1;
stepRisk = 1;
maxRisk = 100;
numRiskLevels = maxRisk/stepRisk;
%preallocate.
ec = zeros(rows+1, cols, numRiskLevels);
ec(1, :, :) = 100;
riskArray = (minRisk:stepRisk:maxRisk)';
tic
for r = 2:rows+1
tmp = riskArray * rM(r-1, :);
tmp = permute(tmp, [3 2 1]);
ec(r, :, :) = ec(r-1, :, :) .* (1 + tmp);
end
toc
%preallocate.
ec2 = zeros(rows+1, cols, numRiskLevels);
ec2(1, :, :) = 100;
tic
for risk = minRisk:stepRisk:maxRisk;
for c = 1:cols
for r = 2:rows+1
ec2(r, c, risk) = ec2(r-1, c, risk) * (1 + risk * rM(r-1, c));
end
end
end
toc
all(all(all(ec == ec2)))
But to my surprise, the vectorized code is indeed slower. (But maybe someone can improve the code, so I figured I leave it her for you.)
I have just tried to vectorize the outer loop, and actually noticed a significant speed increase. Of course it is hard to judge the speed of a script without knowing (the size of) the inputs but I would say this is a good starting point:
% Here you can change the input parameters
riskVec = 1:3:120;
rM = rand(50);
%preallocate and calculate non vectorized solution
ec2 = zeros(size(rM,2)+1, size(rM,1), max(riskVec));
ec2(1, :, :) = 100;
tic
for risk = riskVec
for c = 1:size(rM,2)
for r = 2:size(rM,1)+1
ec2(r, c, risk) = ec2(r-1, c, risk) * (1 + risk * rM(r-1, c));
end
end
end
t1=toc;
%preallocate and calculate vectorized solution
ec = zeros(size(rM,2)+1, size(rM,1), max(riskVec));
ec(1, :, :) = 100;
tic
for c = 1:size(rM,2)
for r = 2:size(rM,1)+1
ec(r, c, riskVec) = ec(r-1, c, riskVec) .* reshape(1 + riskVec * rM(r-1, c),[1 1 length(riskVec)]);
end
end
t2=toc;
% Check whether the vectorization is done correctly and show the timing results
if ec(:) == ec2(:)
t1
t2
end
The given output is:
t1 =
0.1288
t2 =
0.0408
So for this riskVec and rM it is about 3 times as fast as the non-vectorized solution.

How to optimize MATLAB bitwise operations

I have written my own SHA1 implementation in MATLAB, and it gives correct hashes. However, it's very slow (a string a 1000 a's takes 9.9 seconds on my Core i7-2760QM), and I think the slowness is a result of how MATLAB implements bitwise logical operations (bitand, bitor, bitxor, bitcmp) and bitwise shifts (bitshift, bitrol, bitror) of integers.
Especially I wonder the need to construct fixed-point numeric objects for bitrol and bitror using fi command, because anyway in Intel x86 assembly there's rol and ror both for registers and memory addresses of all sizes. However, bitshift is quite fast (it doesn't need any fixed-point numeric costructs, a regular uint64 variable works fine), which makes the situation stranger: why in MATLAB bitrol and bitror need fixed-point numeric objects constructed with fi, whereas bitshift does not, when in assembly level it all comes down to shl, shr, rol and ror?
So, before writing this function in C/C++ as a .mex file, I'd be happy to know if there is any way to improve the performance of this function. I know there are some specific optimizations for SHA1, but that's not the issue, if the very basic implementation of bitwise rotations is so slow.
Testing a little bit with tic and toc, it's evident that what makes it slow are the loops in with bitrol and fi. There are two such loops:
%# Define some variables.
FFFFFFFF = uint64(hex2dec('FFFFFFFF'));
%# constants: K(1), K(2), K(3), K(4).
K(1) = uint64(hex2dec('5A827999'));
K(2) = uint64(hex2dec('6ED9EBA1'));
K(3) = uint64(hex2dec('8F1BBCDC'));
K(4) = uint64(hex2dec('CA62C1D6'));
W = uint64(zeros(1, 80));
... some other code here ...
%# First slow loop begins here.
for index = 17:80
W(index) = uint64(bitrol(fi(bitxor(bitxor(bitxor(W(index-3), W(index-8)), W(index-14)), W(index-16)), 0, 32, 0), 1));
end
%# First slow loop ends here.
H = sha1_handle_block_struct.H;
A = H(1);
B = H(2);
C = H(3);
D = H(4);
E = H(5);
%# Second slow loop begins here.
for index = 1:80
rotatedA = uint64(bitrol(fi(A, 0, 32, 0), 5));
if (index <= 20)
% alternative #1.
xorPart = bitxor(D, (bitand(B, (bitxor(C, D)))));
xorPart = bitand(xorPart, FFFFFFFF);
temp = rotatedA + xorPart + E + W(index) + K(1);
elseif ((index >= 21) && (index <= 40))
% FIPS.
xorPart = bitxor(bitxor(B, C), D);
xorPart = bitand(xorPart, FFFFFFFF);
temp = rotatedA + xorPart + E + W(index) + K(2);
elseif ((index >= 41) && (index <= 60))
% alternative #2.
xorPart = bitor(bitand(B, C), bitand(D, bitxor(B, C)));
xorPart = bitand(xorPart, FFFFFFFF);
temp = rotatedA + xorPart + E + W(index) + K(3);
elseif ((index >= 61) && (index <= 80))
% FIPS.
xorPart = bitxor(bitxor(B, C), D);
xorPart = bitand(xorPart, FFFFFFFF);
temp = rotatedA + xorPart + E + W(index) + K(4);
else
error('error in the code of sha1_handle_block.m!');
end
temp = bitand(temp, FFFFFFFF);
E = D;
D = C;
C = uint64(bitrol(fi(B, 0, 32, 0), 30));
B = A;
A = temp;
end
%# Second slow loop ends here.
Measuring with tic and toc, the entire computation of SHA1 hash of message abc takes on my laptop around 0.63 seconds, of which around 0.23 seconds is passed in the first slow loop and around 0.38 seconds in the second slow loop. So is there some way to optimize those loops in MATLAB before writing a .mex file?
There's this DataHash from the MATLAB File Exchange that calculates SHA-1 hashes lightning fast.
I ran the following code:
x = 'The quick brown fox jumped over the lazy dog'; %# Just a short sentence
y = repmat('a', [1, 1e6]); %# A million a's
opt = struct('Method', 'SHA-1', 'Format', 'HEX', 'Input', 'bin');
tic, x_hashed = DataHash(uint8(x), opt), toc
tic, y_hashed = DataHash(uint8(y), opt), toc
and got the following results:
x_hashed = F6513640F3045E9768B239785625CAA6A2588842
Elapsed time is 0.029250 seconds.
y_hashed = 34AA973CD4C4DAA4F61EEB2BDBAD27316534016F
Elapsed time is 0.020595 seconds.
I verified the results with a random online SHA-1 tool, and the calculation was indeed correct. Also, the 106 a's were hashed ~1.5 times faster than the first sentence.
So how does DataHash do it so fast??? Using the java.security.MessageDigest library, no less!
If you're interested with a fast MATLAB-friendly SHA-1 function, this is the way to go.
However, if this is just an exercise for implementing fast bit-level operations, then MATLAB doesn't really handle them efficiently, and in most cases you'll have to resort to MEX.
why in MATLAB bitrol and bitror need fixed-point numeric objects constructed with fi, whereas bitshift does not
bitrol and bitror are not part of the set of bitwise logic functions that are applicable for uints. They are part of the fixed-point toolbox, which also contains variants of bitand, bitshift etc that apply to fixed-point inputs.
A bitrol could be expressed as two bitshifts, a bitand and a bitor if you want to try using only the uint-functions. That might be even slower though.
As most MATLAB functions, bitand, bitor, bitxor are vectorized. So you get a lot faster if you give these function vector input rather than calling them in a loop over each element
Example:
%# create two sets of 10k random numbers
num = 10000;
hex = '0123456789ABCDEF';
A = uint64(hex2dec( hex(randi(16, [num 16])) ));
B = uint64(hex2dec( hex(randi(16, [num 16])) ));
%# compare loop vs. vectorized call
tic
C1 = zeros(size(A), class(A));
for i=1:numel(A)
C1(i) = bitxor(A(i),B(i));
end
toc
tic
C2 = bitxor(A,B);
toc
assert(isequal(C1,C2))
The timing was:
Elapsed time is 0.139034 seconds.
Elapsed time is 0.000960 seconds.
That's an order of magnitude faster!
The problem is, and as far as I can tell, the SHA-1 computation cannot be well vectorized. So you might not be able to take advantage of such vectorization.
As an experiment, I implemented a pure MATLAB-based funciton to compute such bit operations:
function num = my_bitops(op,A,B)
%# operation to perform: not, and, or, xor
if ischar(op)
op = str2func(op);
end
%# integer class: uint8, uint16, uint32, uint64
clss = class(A);
depth = str2double(clss(5:end));
%# bit exponents
e = 2.^(depth-1:-1:0);
%# convert to binary
b1 = logical(dec2bin(A,depth)-'0');
if nargin == 3
b2 = logical(dec2bin(B,depth)-'0');
end
%# perform binary operation
if nargin < 3
num = op(b1);
else
num = op(b1,b2);
end
%# convert back to integer
num = sum(bsxfun(#times, cast(num,clss), cast(e,clss)), 2, 'native');
end
Unfortunately, this was even worse in terms of performance:
tic, C1 = bitxor(A,B); toc
tic, C2 = my_bitops('xor',A,B); toc
assert(isequal(C1,C2))
The timing was:
Elapsed time is 0.000984 seconds.
Elapsed time is 0.485692 seconds.
Conclusion: write a MEX function or search the File Exchange to see if someone already did :)

Resources