Markov Chain does not converge - chain

i am trying to model car urban mobility by using Markov chains. I am trying to figure out why my Markov Chain model will not converge to a steady state distribution, assuming that it follows the assumptions of aperiodocity, ergodicity etc. I tried to implement it through python and matrix multiplication using numpy. Below you can see the initial state matrix (A0) and the transition matrix (T). I would appreciate any help.
Edit:
Below there is a sample of my python code
#Transition matrix
T = np.array([[0.772,0.044,0.001,0.026,0.026,0.001],
[0.114,0.760,0.183,0.026,0.022,0.007],
[0.001,0.112,0.447,0.055,0.056,0.008],
[0.057,0.028,0.157,0.473,0.022,0.001],
[0.055,0.042,0.184,0.394,0.556,0.072],
[0.001,0.014,0.026,0.028,0.318,0.911]]).reshape(6,6)
#Initial state matrix
A1 = np.array([0.086,0.180,0.086,0.078,0.219,0.350]).reshape(6,1)
forw = A1
fin = []
i=0
while i<10000:
foo = np.dot(T,forw)
fin.append(foo)
forw = foo
i+=1

Related

How to speed up the solving of multiple optimization problems?

Currently, I'm writing a simulation that asses the performance of a positioning algorithm by measuring the mean error of the position estimator for different points around the room. Unfortunately the running times are pretty slow and so I am looking for ways to speed up my code.
The working principle of the position estimator is based on the MUSIC algorithm. The estimator gets an autocorrelation matrix (sized 12x12, with complex values in general) as an input and follows the next steps:
Find the 12 eigenvalues and eigenvectors of the autocorrelation matrix R.
Construct a new 12x11 matrix EN whose columns are the 11 eigenvectors corresponding to the 11 smallest eigenvalues.
Using the matrix EN, construct a function P = 1/(a' EN EN' a).
Where a is a 12x1 complex vector and a' is the Hermitian conjugate of a. The components of a are functions of 3 variables (named x,y and z) and so the scalar P is also a function P(x,y,z)
Finally, find the values (x0,y0,z0) which maximizes the value of P and return it as the position estimate.
In my code, I choose some constant z and create a grid on points in the plane (at heigh z, parallel to the xy plane). For each point I make n4Avg repetitions and calculate the error of the estimated point. At the end of the parfor loop (and some reshaping), I have a matrix of errors with dims (nx) x (ny) x (n4Avg) and the mean error is calculated by taking the mean of the error matrix (acting on the 3rd dimension).
nx=30 is the number of point along the x axis.
ny=15 is the number of points along the y axis.
n4Avg=100 is the number of repetitions used for calculating the mean error at each point.
nGen=100 is the number of generations in the GA algorithm (100 was tested to be good enough).
x = linspace(-20,20,nx);
y = linspace(0,20,ny);
z = 5;
[X,Y] = meshgrid(x,y);
parfor ri = 1:nx*ny
rT = [X(ri);Y(ri);z];
[ENs] = getEnNs(rT,stdv,n4R,n4Avg); % create n4Avg EN matrices
for rep = 1:n4Avg
pos_est = estPos_helper(squeeze(ENs(:,:,rep)),nGen);
posEstErr(ri,rep) = vecnorm(pos_est(:)-rT(:));
end
end
The matrices EN are generated by the following code
function [ENs] = getEnNs(rT,stdv,n4R,nEN)
% generate nEN simulated EN matrices, each using n4R simulated phases
f_c = 2402e6; % center frequency [Hz]
c0 = 299702547; % speed of light [m/s]
load antennaeArr1.mat antennaeArr1;
% generate initial phases.
phi0 = 2*pi*rand(n4R*nEN,1);
k0 = 2*pi.*(f_c)./c0;
I = cos(-k0.*vecnorm(antennaeArr1 - rT(:),2,1)-phi0);
Q = -sin(-k0.*vecnorm(antennaeArr1 - rT(:),2,1)-phi0);
phases = I+1i*Q;
phases = phases + stdv/sqrt(2)*(randn(size(phases)) + 1i*randn(size(phases)));
phases = reshape(phases',[12,n4R,nEN]);
Rxx = pagemtimes(phases,pagectranspose(phases));
ENs = zeros(12,11,nEN);
for i=1:nEN
[ENs(:,:,i),~] = eigs(squeeze(Rxx(:,:,i)),11,'smallestabs');
end
end
The position estimator uses a solver utilizing a 'genetic algorithm' (chosen because it preformed the best of all the other solvers).
function pos_est = estPos_helper(EN,nGen)
load antennaeArr1.mat antennaeArr1; % 3x12 constant matrix
antennae_array = antennaeArr1;
x0 = [0;10;5];
lb = [-20;0;0];
ub = [20;20;10];
function y = myfun(x)
k0 = 2*pi*2.402e9/299702547;
a = exp( -1i*k0*sqrt( (x(1)-antennae_array(1,:)').^2 + (x(2) - antennae_array(2,:)').^2 + (x(3)-antennae_array(3,:)').^2 ) );
y = 1/real((a')*(EN)*(EN')*a);
end
% Create optimization variables
x3 = optimvar("x",3,1,"LowerBound",lb,"UpperBound",ub);
% Set initial starting point for the solver
initialPoint2.x = x0;
% Create problem
problem = optimproblem("ObjectiveSense","Maximize");
% Define problem objective
problem.Objective = fcn2optimexpr(#myfun,x3);
% Set nondefault solver options
options2 = optimoptions("ga","Display","off","HybridFcn","fmincon",...
"MaxGenerations",nGen);
% Solve problem
solution = solve(problem,initialPoint2,"Solver","ga","Options",options2);
% Clear variables
clearvars x3 initialPoint2 options2
pos_est = solution.x;
end
The current runtime of the code, when setting the parameters as shown above, is around 700-800 seconds. This is a problem as I would like to increase the number of points in the grid and the number of repetitions to get a more accurate result.
The main ways I've tried to tackle this is by using parallel computing (in the form of the parloop) and by reducing the nested loops I had (one for x and one for y) into a single vectorized loop going over all the points in the grid.
It indeed helped, but not quite enough.
I apologize for the messy code.
Michael.

Hermitian matrix of h-chain using np.diag

I hope you are well.
Greatly appreciate any help provided.
I am attempting to create a NxN hermitian matrix which takes the molecular chain of H atoms using a function and numpy diagonal.
def H_chain(N, E, A):
y = np.ones(N,N)
x = np.arange(E).reshape((N,N))
z = np.arange(A).reshape((N,N))
g = np.diag(np.diag(y)*x)+np.diag(np.diag(y, k=-1)*-z+np.diag(np.diag(y,k=1)*-z))
return g
#The function should return something that looks like this
#[E,-A,0,0,0],
#[-A,E,-A,0,0]
#[0,-A,E,-A,0]
#[0,0,-A,E,-A]
#[0,0,0,-A,E]
Returns an N x N matrix representing the Hamiltonian of an N-atom chain molecule with
site-energies E and inter-site coupling rate A.
Ideally I should be summing three matrices together using the numpy diagonal function and the numpy ones to fill the diagonals aswell.
Much appreciate any help.
Thank you.
Kind regards,

Using RMSE function with 5-fold cross validation to choose the best model out of 3

I have defined three different models obtained from the dataset diabetes from the library lars. The first model (M1) is the one that minimizes the BIC value out of all the possible regression models obtained combining the explanatory variables (which are p=10, so 2^10 possible models). The other two are obtained through glmnet and are a Lasso regression with respectively lambda.min (M2) and lambda.1se (M3), where lambda.min and lambda.1se are obtained through cv.glmnet. Now I should perform 5-fold cross-validation using the RMSE (Root Mean Square Error) function to check which of the tree models Μ1, Μ2 and Μ3, has the best predictive performance. In order to find the errors in the models obtained from Lasso I have to use the ordinal least squares estimates.
This is my code as for now:
library(lars)
library(glmnet)
data(diabetes)
y<-diabetes$y
x<-diabetes$x
x2<-diabetes$x2
X = as.data.frame(cbind(x))
Y = as.data.frame(y)
p=10
n=442
best_score = Inf
M1 = NA
for (i in 1:(2^p-1)){
model = lm(y ~ ., data = subset(X, select = c(which(as.integer(intToBits(i)) == 1))))
if (BIC(model) < best_score){
M1 = model
best_score = BIC ( model )
}
}
W<-as.matrix(X)
Y<-as.matrix(Y)
lasso<-glmnet(W, Y)
x11()
plot(lasso, label=T)
x11()
plot(lasso, xvar = 'lambda', label=T)
lasso$df
lasso$lambda
cvfit<-cv.glmnet(W,Y)
cvfit
coef(cvfit, s="lambda.min")
coef(cvfit, s="lambda.1se")
M2<-glmnet(W,Y,lambda = cvfit$lambda.min)
M3<-glmnet(W,Y,lambda = cvfit$lambda.1se)
I really don't know where to put hands now. Should I first of all split the original dataset in 5 and then compute again the models on the different train and test set? And how do I compute the final RMSE for each model? And what does it mean that I should use ordinal least square estimates for the models obtained through Lasso?

How to simulate a disturbance in scilab program (not xcos)

how are you?.
I need to simulate a disturbance in a control system using scilab, that is, the csim function is used to simulate the response of a system by using a step, impulse, ramp or any other input, but, I need to input a disturbance for example in t = 0.5s to see the system behavior.
That drags another problem to me because I don't know how to make csim or syslin to acknowledge two different inputs, or its as simple as defining two systems, one with the referent input and other with the disturbance entrance and sum both?.
Thanks in advance for the help.
Suppose you have the following linear time invariant system (A,B,C)
x'=A*x+B1*v+B2*d
y=C*x
with B=[B1,B2], where v is the control/input and d the disturbance. If you want e.g. simulate a step response and a disturbance you have to define your own overall input [v;d] and decide when to apply the disturbance. Here is an example:
function ud = step(t)
ud = [1;0];
endfunction
function ud = input(t)
ud = zeros(t);
ud(1,:) = 1;
s = abs(t-0.5);
// 0.2 is the half-width of disturbance
ud(2,:) = -0.1*(1-s/0.2).*(s<0.2)
endfunction
A = [-2 1;
1 -2];
B1 = [1;0];
B2 = [0;4];
C = [0 1];
sl = syslin('c',A,[B1 B2],C);
t = linspace(0,5,1000);
x = csim(step,t,sl)
xd = csim(input,t,sl)
clf
plot(t,x,t,xd,t,input(t)(2,:))
legend('step','step and disturbance','disturbance',2)
I made here two csim calls, one for the usual step response and the second one for the perturbed step response. However, I warn you about the ode solver used by csim: discontinuous inputs can be easily missed, that's why I applied here a hat-shaped disturbance. The code of the perturbed input is designed to allow vector time input, in order to easily plot the perturbation.

Parallelising gradient calculation in Julia

I was persuaded some time ago to drop my comfortable matlab programming and start programming in Julia. I have been working for a long with neural networks and I thought that, now with Julia, I could get things done faster by parallelising the calculation of the gradient.
The gradient need not be calculated on the entire dataset in one go; instead one can split the calculation. For instance, by splitting the dataset in parts, we can calculate a partial gradient on each part. The total gradient is then calculated by adding up the partial gradients.
Though, the principle is simple, when I parallelise with Julia I get a performance degradation, i.e. one process is faster then two processes! I am obviously doing something wrong... I have consulted other questions asked in the forum but I could still not piece together an answer. I think my problem lies in that there is a lot of unnecessary data moving going on, but I can't fix it properly.
In order to avoid posting messy neural network code, I am posting below a simpler example that replicates my problem in the setting of linear regression.
The code-block below creates some data for a linear regression problem. The code explains the constants, but X is the matrix containing the data inputs. We randomly create a weight vector w which when multiplied with X creates some targets Y.
######################################
## CREATE LINEAR REGRESSION PROBLEM ##
######################################
# This code implements a simple linear regression problem
MAXITER = 100 # number of iterations for simple gradient descent
N = 10000 # number of data items
D = 50 # dimension of data items
X = randn(N, D) # create random matrix of data, data items appear row-wise
Wtrue = randn(D,1) # create arbitrary weight matrix to generate targets
Y = X*Wtrue # generate targets
The next code-block below defines functions for measuring the fitness of our regression (i.e. the negative log-likelihood) and the gradient of the weight vector w:
####################################
## DEFINE FUNCTIONS ##
####################################
#everywhere begin
#-------------------------------------------------------------------
function negative_loglikelihood(Y,X,W)
#-------------------------------------------------------------------
# number of data items
N = size(X,1)
# accumulate here log-likelihood
ll = 0
for nn=1:N
ll = ll - 0.5*sum((Y[nn,:] - X[nn,:]*W).^2)
end
return ll
end
#-------------------------------------------------------------------
function negative_loglikelihood_grad(Y,X,W, first_index,last_index)
#-------------------------------------------------------------------
# number of data items
N = size(X,1)
# accumulate here gradient contributions by each data item
grad = zeros(similar(W))
for nn=first_index:last_index
grad = grad + X[nn,:]' * (Y[nn,:] - X[nn,:]*W)
end
return grad
end
end
Note that the above functions are on purpose not vectorised! I choose not to vectorise, as the final code (the neural network case) will also not admit any vectorisation (let us not get into more details regarding this).
Finally, the code-block below shows a very simple gradient descent that tries to recover the parameter weight vector w from the given data Y and X:
####################################
## SOLVE LINEAR REGRESSION ##
####################################
# start from random initial solution
W = randn(D,1)
# learning rate, set here to some arbitrary small constant
eta = 0.000001
# the following for-loop implements simple gradient descent
for iter=1:MAXITER
# get gradient
ref_array = Array(RemoteRef, nworkers())
# let each worker process part of matrix X
for index=1:length(workers())
# first index of subset of X that worker should work on
first_index = (index-1)*int(ceil(N/nworkers())) + 1
# last index of subset of X that worker should work on
last_index = min((index)*(int(ceil(N/nworkers()))), N)
ref_array[index] = #spawn negative_loglikelihood_grad(Y,X,W, first_index,last_index)
end
# gather the gradients calculated on parts of matrix X
grad = zeros(similar(W))
for index=1:length(workers())
grad = grad + fetch(ref_array[index])
end
# now that we have the gradient we can update parameters W
W = W + eta*grad;
# report progress, monitor optimisation
#printf("Iter %d neg_loglikel=%.4f\n",iter, negative_loglikelihood(Y,X,W))
end
As is hopefully visible, I tried to parallelise the calculation of the gradient in the easiest possible way here. My strategy is to break the calculation of the gradient in as many parts as available workers. Each worker is required to work only on part of matrix X, which part is specified by first_index and last_index. Hence, each worker should work with X[first_index:last_index,:]. For instance, for 4 workers and N = 10000, the work should be divided as follows:
worker 1 => first_index = 1, last_index = 2500
worker 2 => first_index = 2501, last_index = 5000
worker 3 => first_index = 5001, last_index = 7500
worker 4 => first_index = 7501, last_index = 10000
Unfortunately, this entire code works faster if I have only one worker. If add more workers via addprocs(), the code runs slower. One can aggravate this issue by create more data items, for instance use instead N=20000.
With more data items, the degradation is even more pronounced.
In my particular computing environment with N=20000 and one core, the code runs in ~9 secs. With N=20000 and 4 cores it takes ~18 secs!
I tried many many different things inspired by the questions and answers in this forum but unfortunately to no avail. I realise that the parallelisation is naive and that data movement must be the problem, but I have no idea how to do it properly. It seems that the documentation is also a bit scarce on this issue (as is the nice book by Ivo Balbaert).
I would appreciate your help as I have been stuck for quite some while with this and I really need it for my work. For anyone wanting to run the code, to save you the trouble of copying-pasting you can get the code here.
Thanks for taking the time to read this very lengthy question! Help me turn this into a model answer that anyone new in Julia can then consult!
I would say that GD is not a good candidate for parallelizing it using any of the proposed methods: either SharedArray or DistributedArray, or own implementation of distribution of chunks of data.
The problem does not lay in Julia, but in the GD algorithm.
Consider the code:
Main process:
for iter = 1:iterations #iterations: "the more the better"
δ = _gradient_descent_shared(X, y, θ)
θ = θ - α * (δ/N)
end
The problem is in the above for-loop which is a must. No matter how good _gradient_descent_shared is, the total number of iterations kills the noble concept of the parallelization.
After reading the question and the above suggestion I've started implementing GD using SharedArray. Please note, I'm not an expert in the field of SharedArrays.
The main process parts (simple implementation without regularization):
run_gradient_descent(X::SharedArray, y::SharedArray, θ::SharedArray, α, iterations) = begin
N = length(y)
for iter = 1:iterations
δ = _gradient_descent_shared(X, y, θ)
θ = θ - α * (δ/N)
end
θ
end
_gradient_descent_shared(X::SharedArray, y::SharedArray, θ::SharedArray, op=(+)) = begin
if size(X,1) <= length(procs(X))
return _gradient_descent_serial(X, y, θ)
else
rrefs = map(p -> (#spawnat p _gradient_descent_serial(X, y, θ)), procs(X))
return mapreduce(r -> fetch(r), op, rrefs)
end
end
The code common to all workers:
#= Returns the range of indices of a chunk for every worker on which it can work.
The function splits data examples (N rows into chunks),
not the parts of the particular example (features dimensionality remains intact).=#
#everywhere function _worker_range(S::SharedArray)
idx = indexpids(S)
if idx == 0
return 1:size(S,1), 1:size(S,2)
end
nchunks = length(procs(S))
splits = [round(Int, s) for s in linspace(0,size(S,1),nchunks+1)]
splits[idx]+1:splits[idx+1], 1:size(S,2)
end
#Computations on the chunk of the all data.
#everywhere _gradient_descent_serial(X::SharedArray, y::SharedArray, θ::SharedArray) = begin
prange = _worker_range(X)
pX = sdata(X[prange[1], prange[2]])
py = sdata(y[prange[1],:])
tempδ = pX' * (pX * sdata(θ) .- py)
end
The data loading and training. Let me assume that we have:
features in X::Array of the size (N,D), where N - number of examples, D-dimensionality of the features
labels in y::Array of the size (N,1)
The main code might look like this:
X=[ones(size(X,1)) X] #adding the artificial coordinate
N, D = size(X)
MAXITER = 500
α = 0.01
initialθ = SharedArray(Float64, (D,1))
sX = convert(SharedArray, X)
sy = convert(SharedArray, y)
X = nothing
y = nothing
gc()
finalθ = run_gradient_descent(sX, sy, initialθ, α, MAXITER);
After implementing this and run (on 8-cores of my Intell Clore i7) I got a very slight acceleration over serial GD (1-core) on my training multiclass (19 classes) training data (715 sec for serial GD / 665 sec for shared GD).
If my implementation is correct (please check this out - I'm counting on that) then parallelization of the GD algorithm is not worth of that. Definitely you might get better acceleration using stochastic GD on 1-core.
If you want to reduce the amount of data movement, you should strongly consider using SharedArrays. You could preallocate just one output vector, and pass it as an argument to each worker. Each worker sets a chunk of it, just as you suggested.

Resources