Faster way to compute distributions from Markov chain? - performance

Suppose that I have a probability transition matrix, say a matrix of dimensions 2000x2000, that represents a homogeneous Markov chain, and I want to get some statistics of each probability distribution of the first 200 steps of the chain (the distribution of the first row at each step), then I've written the following
using Distributions, LinearAlgebra
# This function defines our transition matrix:
function tm(N::Int, n0::Int)
[pdf(Hypergeometric(N-l,l,n0),k-l) for l in 0:N, k in 0:N]
end
# This computes the 5-percentile of a probability vector
function percentile5(M::Vector)
s=0
i=0
while s <= 0.05
i += 1
s += M[i]
end
return i-1
end
# This function compute a matrix with three rows: means, 5-percentiles
# and standard deviations. Each column represent a session.
function stats(N::Int, n0::Int, m::Int)
A = tm(N,n0)
B = I # Initilizing B with the identity matrix
sup = 0:N # The support of each distribution
sup2 = [k^2 for k in sup]
stats = zeros(3,m)
for i in 1:m
C = B[1,:]
stats[1,i] = sum(C .* sup) # Mean
stats[2,i] = percentile5(C) # 5-percentile
stats[3,i] = sqrt(sum(C .* sup2) - stats[1,i]^2) # Standard deviation
B = A*B
end
return stats
end
data = stats(2000,50,200)
My question is, there is a more efficient (faster) way to do the same computation? I don't see a better way to do it but maybe there are some tricks that speed-up this computation.

This is what I have running so far:
using Distributions, LinearAlgebra, SparseArrays
# This function defines our transition matrix:
function tm(N::Int, n0::Int)
[pdf(Hypergeometric(N-l,l,n0),k-l) for l in 0:N, k in 0:N]
end
# This computes the 5-percentile of a probability vector
function percentile5(M::AbstractVector)
s = zero(eltype(M))
res = length(M)
#inbounds for i = 1:length(M)
s += M[i]
if s > 0.05
res = i - 1
break
end
end
return res
end
# This function compute a matrix with three rows: means, 5-percentiles
# and standard deviations. Each column represent a session.
function stats(N::Int, n0::Int, m::Int)
A = sparse(transpose(tm(N, n0)))
C = zeros(size(A, 1))
C[1] = 1.0
sup = 0:N # The support of each distribution
sup2 = sup .^ 2
stats = zeros(3, m)
for i = 1:m
stats[1, i] = sum(C .* sup) # Mean
stats[2, i] = percentile5(C) # 5-percentile
stats[3, i] = sqrt(sum(C .* sup2) - stats[1, i]^2) # Standard deviation
C = A * C
end
return stats
end
It is around 4x faster (on smaller parameters - possibly much more speedup on large parameters). Basically uses the tips I've made in the comment:
using sparse arrays.
avoiding whole matrix multiply but using vector-matrix multiply instead.
Further improvement are possible (like simulation/ensemble method I've mentioned).

Related

Exponential fractional values for power function

I am using below Matlab code to calculate power function [ without using built-in function] to calculate power = b^e.
At the moment , I am unable to get power function going that support fractional exponential values b^(1/2) = sqrt(b) or 3.4 ^ (1/4) to calculate power due inefficient approach , because it loops e times. I need help in efficient logic for fractional exponent.
Thank you
b = [-32:32]; % example input values
e = [-3:3]; % example input values but doesn't support fraction's
power_function(b,e)
p = 1;
if e < 0
e = abs(e);
multiplier = 1/b;
else
multiplier = b;
end
for k = 1:e
p(:) = p * multiplier; % n-th root for any given number
end

How to improve in Julia the Storkey Learning Performance?

Hi community I'm new here , and also new in Julia 1.0.3. Nowadays I'm studying Storkey Learning rules in different number systems. In my first attempt coding this ideias I try this naive code:
function storkey_learning_first(U)
# The memories are given by the columns
row,col = size(U)
# First W matrix
W_new = zeros(row,row)
for mu=1:col
W_old = copy(W_new)
for i=1:row
for j=i:row
s = 0.0
# Putting this value in the new matrix
s += U[i,mu]*U[j,mu]
s -= local_field_opt(W_old,U[:,mu],i,j,row)*U[j,mu]
s -= local_field_opt(W_old,U[:,mu],j,i,row)*U[i,mu]
s *= 1/row
W_new[i,j] += s
W_new[j,i] = W_new[i,j]
end
end
end
return W_new
end
which is the main function and the "local field" given by
function local_field_opt(W_old,U,i,j,row)
hij = 0.0
for k=1:row
if k != i && k != j
hij += W_old[i,k]*U[k]
end
end
return hij
end
then given a n-dimensional real-valued vector, both codes creates a matrix of dimension (n x n). For lower dimensional vectors is working. But it is really slow for higher dimensional arrays. In fact, I want to store vectors of dimension n = 8192. Also, I would like to work with lower dimensional complex-valued vectors, or quaternions but I can't do better in the real case. In a second attempt I separate the complete structure in two functions, in particular, I separate the two inner loops avoinding to call the same elements repeatedly:
function_inner(U,W_old,W_mu,mu,row)
# Calling one time the column
U_mu = U[:,mu]
for j=1:row
U_j_mu = U[j,mu]
for i=j:row
U_i_mu = U[i,mu]
s = 0.0
s += U_i_mu *U_j_mu
s -= U_i_mu *local_field_opt(W_old,U_mu,j,i,row)
s -= local_field_opt(W_old,U_mu,i,j,row)*U_j_mu
s *= 1/row
W_mu[i,j] += s
W_mu[j,i] = W_mu[i,j]
end
end
return W_mu
end
with this I gained a few seconds. How can I improve my syntax in this particular case? and the use of complex or quaternion numbers: should that be a considerable additional burden?. Finally, until now I'm obtaining this time mark for vectors of dimension n=1352:
#time W = RealStorkey.storkey_learning(U,RealStorkey.first)
349.680284 seconds (268 allocations: 1.376 GiB, 0.09% gc time)

Speeding up program in matlab

I have 2 functions:
ccexpan - which calculates coefficients of interpolating polynomial of function f with N nodes in Chebyshew polynomial of the first kind basis.
csum - calculates value for arguments t using coefficients c from ccexpan (using Clenshaw algorithm).
This is what I have written so far:
function c = ccexpan(f,N)
z = zeros (1,N+1);
s = zeros (1,N+1);
for i = 1:(N+1)
z(i) = pi*(i-1)/N;
end
t = f(cos(z));
for k = 1:(N+1)
s(k) = sum(t.*cos(z.*(k-1)));
s(k) = s(k)-(f(1)+f(-1)*cos(pi*(k-1)))/2;
end
c = s.*2/N;
and:
function y = csum(t,c)
M = length(t);
N = length(c);
y = t;
b = zeros(1,N+2);
for k = 1:M
for i = N:-1:1
b(i) = c(i)+2*t(k)*b(i+1)-b(i+2);
end
y(k)=(b(1)-b(3))/2;
end
Unfortunately these programs are very slow, and also slightly inacurrate. Please give me some tips on how to speed them up, and how to improve accuracy.
Where possible try to get away from looping structures. At first blush, I would trade out your first for loop of
for i = 1:(N+1)
z(i) = pi*(i-1)/N;
end
and replace with
i=1:(N+1)
z = pi*(i-1)/N
I did not check the rest of you code but the above example will definitely speed up you code. And a second strategy is to combine loops when possible.
Martin,
Consider the following strategy.
% create hypothetical N and f
N = 3
f = #(x) 1./(1+15*x.*x)
% calculate z and t
i=1:(N+1)
z = pi*(i-1)/N
t = f(cos(z))
% make a column vector of k's
k = (1:(N+1))'
% do this: s(k) = sum(t.*cos(z.*(k-1)))
s1 = t.*cos(z.*(k-1)) % should be a matrix with one row for each row of k
% via implicit expansion
s2 = sum(s1,2) % row sum, i.e., one value for each row of k
% do this: s(k) = s(k)-(f(1)+f(-1)*cos(pi*(k-1)))/2
s3 = s2 - (f(1)+f(-1)*cos(pi*(k-1)))/2
% calculate c
c = s3 .* 2/N

Julia #spawn and pmap() on an embarrassingly parallel problem that requires JuMP and Ipopt

I'd really appreciate some help on parallelizing the following pseudo code in Julia (and I do apologize in advance for the long post):
P, Q # both K by N matrix, K = num features and N = num samples
X, Y # K*4 by N and K*2 by N matrices
tempX, tempY # column vectors of size K*4 and K*2
ndata # a dict from parsing a .m file to be used by a solver with JuMP and Ipopt
# serial version
for i = 1:N
ndata[P] = P[:, i] # technically requires a for loop from 1 to K since the dict has to be indexed element-wise
ndata[Q] = Q[:, i]
ndata_A = run_solver_A(ndata) # with a third-party package and JuMP, Ipopt
ndata_B = run_solver_B(ndata)
kX = 1, kY = 1
for j = 1:K
tempX[kX:kX+3] = [ndata_A[j][a], ndata_A[j][b], P[j, i], Q[j, i]]
tempY[kY:kY+1] = [ndata_B[j][a], ndata_B[j][b]]
kX += 4
kY += 2
end
X[:, i] = deepcopy(tempX)
Y[:, i] = deepcopy(tempY)
end
So obviously, this for loop can be executed independently as long as no columns of P and Q is accessed twice and the same column i of P and Q are accessed at a time. The only thing I need to be careful about is that column i of X and Y are correct pairs of tempX and tempY, and I don't care as much about whether the i = 1, ..., N order is maintained (hopefully that makes sense!).
I read both the official documentation and some online tutorials, and wrote the following with #spawn and fetch that works for the insertion part by replacing the ndata[j][a] etc. with placeholder numbers 1.0 and 180:
using Distributed
addprocs(2)
num_proc = nprocs()
#everywhere function insertPQ(P, Q)
println(myid())
data = zeros(4*length(P))
k = 1
for i = 1:length(P)
data[k:k+3] = [1.0, 180., P[i], Q[i]]
k += 4
end
return data
end
P = [0.99, 0.99, 0.99, 0.99]
Q = [-0.01, -0.01, -0.01, -0.01]
for i = 1:5 # should be 4 x 32
global P = hcat(P, (P .- 0.01))
global Q = hcat(Q, (Q .- 0.01))
end
datas = zeros(16, 0) # serial result
datap = zeros(16, 32) # parallel result
#time for i = 1:32
s = fetch(#spawn insertPQ(P[:, i], Q[:, i]))
global datap = hcat(datap, s)
end
#time for i = 1:32
k = 1
for j = 1:4
datas[k:k+3, i] = [1.0, 180., P[j, i], Q[j, i]]
k += 4
end
end
println(datap == datas)
The above code is fine but I did notice the output was consistently worker 2->3->4->5->2... and was much slower than the serial case (I'm testing this on my laptop with only 4 cores, but eventually I'll run it on a cluster). It took forever to run when added in the run_solver_A/B in the insertPQ() that I had to stop it.
As for pmap(), I couldn't figure out how to pass an entire vector to the function. I probably misunderstood the documentation but "Transform collection c by applying f to each element using available workers and tasks" sounds like I can only do this element-wise? That can't be it. I went to a Julia intro session last week and asked the lecturer about this. He said I should use pmap and I've been trying to make it work since.
So, how can I parallelize the my original pseudo code? Any help or suggestion is greatly appreciated!

Julia parallel computing for loop

I would like to calculate the summation of elements from a large upper triangular matrix. The regular Julia code is below.
function upsum(M); n = size(M)[1]; sum = 0
for i = 1:n-1 for j = i+1:n
sum = sum + M[i,j]
end
end
return sum
end
R = randn(10000,10000)
upsum(R)
Since the matrix is very large, I would like to know is there anyway to improve the speed. How can I use parallel computing here?
I would use threads not parallel processing in this case. Here is an example code:
using Base.Threads
function upsum_threads(M)
n = size(M, 1)
chunks = nthreads()
sums = zeros(eltype(M), chunks)
chunkend = round.(Int, n * sqrt.((1:chunks) ./ chunks))
#assert minimum(diff(chunkend)) > 0
chunkstart = [2; chunkend[1:end-1] .+ 1]
#threads for job in 1:chunks
s = zero(eltype(M))
for i in chunkstart[job]:chunkend[job]
#simd for j in 1:(i-1)
#inbounds s += M[j, i]
end
end
sums[job] = s
end
return sum(sums)
end
R = randn(10000,10000)
upsum_threads(R)
It should give you a significant speedup (even if you remove #threads it should be much faster).
You choose number of threads Julia uses by setting JULIA_NUM_THREADS environment variable.

Resources