R: How to solve Lapack routine dgesv: system is exactly singular in Mahalanobis distance - matrix

I am trying to run an Explanatory Factor Analysis on my questionnaire data.
I have data for 201 participants and 30 questions. The head of my data looks somehow like this (I am showing only the first 5 questions to give an idea of the dataset structure):
Q1 Q2 Q3 Q3 Q4 Q5
1 14 0 20 0 0 0
2 14 14 20 20 20 1
3 20 18 20 20 20 9
4 14 14 20 20 20 0
5 20 18 20 20 20 5
6 20 18 20 20 8 7
I want to find multivariate outliers ,so I am trying to calculate the Mahalanobis distance (cases with Mahalanobis Distance p values bigger than 0.001 are considered outliers).
I am using this code in R-studio (all_data_EFA is my dataset name):
distance <- as.matrix(mahalanobis(all_data_EFA, colMeans(all_data_EFA), cov = cov(all_data_EFA)))
Mah_significant <- all_data_EFA %>%
transmute(row_number = 1:nrow(all_data_EFA),
Mahalanobis_distance = distance,
Mah_p_value = pchisq(distance, df = ncol(all_data_EFA), lower.tail = F)) %>%
filter(Mah_p_value <= 0.001)
However, when I run "distance" I get the following Error:
Error in solve.default(cov, ...) :
Lapack routine dgesv: system is exactly singular: U[26,26] = 0
As far as I understood, this means that the covariance matrix of my data is singular, hence the matrix is not invertible and I cannot calculate Mahalanobis distance.
Is there an alternative way to calculate multivariate outliers or how can I solve this problem?
Many thanks.

Related

Generate an unique identifier such as a hash every N minutes. But they have to be the same in N minutes timeframe without storing data

I want to create a unique identifier such as a small hash every N minutes but the result should be the same in the N timeframe without storing data.
Examples when the N minute is 10:
0 > 10 = 25ba38ac9
10 > 20 = 900605583
20 > 30 = 6156625fb
30 > 40 = e130997e3
40 > 50 = 2225ca027
50 > 60 = 3b446db34
Between minute 1 and 10 i get "25ba38ac9" but with anything between 10 and 20 i get "900605583" etc.
I have no start/example code because i have no idea or algorithm i can use to create the desired result.
I did not provide a specific tag or language in this question because i am interested in the logic and not the final code. But i appreciate documented code as a example.
Pick your favourite hash-function h. Pick your favourite string sugar. To get a hash at time t, append the euclidean quotient of t divided by N to sugar, and apply h to it.
Example in python:
h = lambda x: hex(abs(hash(x)))
sugar = 'Samir'
def hash_for_N_minutes(t, N=10):
return h(sugar + str(t // N))
for t in range(0, 30, 2):
print(t, hash_for_N_minutes(t, 10))
Output:
0 0xeb3d3abb787c890
2 0xeb3d3abb787c890
4 0xeb3d3abb787c890
6 0xeb3d3abb787c890
8 0xeb3d3abb787c890
10 0x45e2d2a970323e9f
12 0x45e2d2a970323e9f
14 0x45e2d2a970323e9f
16 0x45e2d2a970323e9f
18 0x45e2d2a970323e9f
20 0x334dce1d931e5da8
22 0x334dce1d931e5da8
24 0x334dce1d931e5da8
26 0x334dce1d931e5da8
28 0x334dce1d931e5da8
Weakness of this hash method, and suggested improvement
Of course, nothing stops you from inputting a time in the future. So you can easily answer the question "What will the hash be in exactly one hour ?".
If you want future hashes to be unpredictable, you can combine the value t // N with a real-world value that's dependent on the time, not known in advance, but for which we keep records.
There are two well-known time-series that fit this criterion: values related to the meteo, and values related to the stock market.
See also this 2008 xkcd comic: https://xkcd.com/426/

Quickly compute `dot(a(n:end), b(1:end-n))`

Suppose we have two, one dimensional arrays of values a and b which both have length N. I want to create a new array c such that c(n)=dot(a(n:N), b(1:N-n+1)) I can of course do this using a simple loop:
for n=1:N
c(n)=dot(a(n:N), b(1:N-n+1));
end
but given that this is such a simple operation which resembles a convolution I was wondering if there isn't a more efficient method to do this (using Matlab).
A solution using 1D convolution conv:
out = conv(a, flip(b));
c = out(ceil(numel(out)/2):end);
In conv the first vector is multiplied by the reversed version of the second vector so we need to compute the convolution of a and the flipped b and trim the unnecessary part.
This is an interesting problem!
I am going to assume that a and b are column vectors of the same length. Let us consider a simple example:
a = [9;10;2;10;7];
b = [1;3;6;10;10];
% yields:
c = [221;146;74;31;7];
Now let's see what happens when we compute the convolution of these vectors:
>> conv(a,b)
ans =
9
37
86
166
239
201
162
170
70
>> conv2(a, b.')
ans =
9 27 54 90 90
10 30 60 100 100
2 6 12 20 20
10 30 60 100 100
7 21 42 70 70
We notice that c is the sum of elements along the lower diagonals of the result of conv2. To show it clearer we'll transpose to get the diagonals in the same order as values in c:
>> triu(conv2(a.', b))
ans =
9 10 2 10 7
0 30 6 30 21
0 0 12 60 42
0 0 0 100 70
0 0 0 0 70
So now it becomes a question of summing the diagonals of a matrix, which is a more common problem with existing solution, for example this one by Andrei Bobrov:
C = conv2(a.', b);
p = sum( spdiags(C, 0:size(C,2)-1) ).'; % This gives the same result as the loop.

Drawing from a 2-D prior that is only available as samples in pymc2

I'm trying to play around with Bayesian updating, and have a situation in which I am using a posterior from previous runs as a prior. This is a 2D prior on alpha and beta, for which I have traces, alphatrace and betatrace. So I stack them and use code adopted from https://gist.github.com/jcrudy/5911624 to make a KDE based stochastic.
#from https://gist.github.com/jcrudy/5911624
def KernelSmoothing(name, dataset, bw_method=None, observed=False, value=None):
'''Create a pymc node whose distribution comes from a kernel smoothing density estimate.'''
density = gaussian_kde(dataset, bw_method)
def logp(value):
#print "VAL", value
d = density(value)
if d == 0.0:
return float('-inf')
return np.log(d)
def random():
result = None
sample=density.resample(1)
#print sample, sample.shape
result = sample[0][0],sample[1][0]
return result
if value == None:
value = random()
dtype = type(value)
result = pymc.Stochastic(logp = logp,
doc = 'A kernel smoothing density node.',
name = name,
parents = {},
random = random,
trace = True,
value = None,
dtype = dtype,
observed = observed,
cache_depth = 2,
plot = True,
verbose = 0)
return result
Note that the critical thing here is to obtain 2-values from the joint prior: this is why i need a 2-D prior and not two 1-D priors.
The model itself is so:
ctrace=np.vstack((alphatrace, betatrace))
cnew=KernelSmoothing("cnew", ctrace)
#pymc.deterministic
def alphanew(cnew=cnew, name='alphanew'):
return cnew[0]
#pymc.deterministic
def betanew(cnew=cnew, name='betanew'):
return cnew[1]
newtheta=pymc.Beta("newtheta", alphanew, betanew)
newexp = pymc.Binomial('newexp', n=[14], p=[newtheta], value=[4], observed=True)
model3=pymc.Model([cnew, alphanew, betanew, newtheta, newexp])
mcmc3=pymc.MCMC(model3)
mcmc3.sample(20000,5000,5)
In case you are wondering, this is to do the 71st experiment in the hierarchical Rat Tumor example in Chapter 5 in Gelman's BDA. The "prior" I am using is the posterior on alpha and beta after 70 experiments.
But, when I sample, things blow up with the error:
ValueError: Maximum competence reported for stochastic cnew is <= 0... you may need to write a custom step method class.
Its not cnew I care about updating as a stochastic, but rather alphanew and betanew. How ought I be structuring the code to make this error go away?
EDIT: initial model which gave me the posteriors I wish to use as the prior:
tumordata="""0 20
0 20
0 20
0 20
0 20
0 20
0 20
0 19
0 19
0 19
0 19
0 18
0 18
0 17
1 20
1 20
1 20
1 20
1 19
1 19
1 18
1 18
3 27
2 25
2 24
2 23
2 20
2 20
2 20
2 20
2 20
2 20
1 10
5 49
2 19
5 46
2 17
7 49
7 47
3 20
3 20
2 13
9 48
10 50
4 20
4 20
4 20
4 20
4 20
4 20
4 20
10 48
4 19
4 19
4 19
5 22
11 46
12 49
5 20
5 20
6 23
5 19
6 22
6 20
6 20
6 20
16 52
15 46
15 47
9 24
"""
tumortuples=[e.strip().split() for e in tumordata.split("\n")]
tumory=np.array([np.int(e[0].strip()) for e in tumortuples if len(e) > 0])
tumorn=np.array([np.int(e[1].strip()) for e in tumortuples if len(e) > 0])
N = tumorn.shape[0]
mu = pymc.Uniform("mu",0.00001,1., value=0.13)
nu = pymc.Uniform("nu",0.00001,1., value=0.01)
#pymc.deterministic
def alpha(mu=mu, nu=nu, name='alpha'):
return mu/(nu*nu)
#pymc.deterministic
def beta(mu=mu, nu=nu, name='beta'):
return (1.-mu)/(nu*nu)
thetas=pymc.Container([pymc.Beta("theta_%i" % i, alpha, beta) for i in range(N)])
deaths = pymc.Binomial('deaths', n=tumorn, p=thetas, value=tumory, size=N, observed=True)
I use the joint-posterior from this model on alpha, beta as input to the "new model" at top. This also begs the question if I ought to be including theta1..theta70 in the model at top as they will update along with alpha and beta thanks to the new data which is a binomial with n=14, y=4. But I cant even get the little model with only a prior as a 2d sample array working :-(
I found your question since I ran into a similar proble. According to the documentation of pymc.StepMethod.competence, the problem is that none of the built-in samplers handle the dtype associated with the stochastic variable.
I am not sure what needs to be done to actually resolve that. Maybe one of the sampler methods can be extended to handle special types?
Hopefully someone with more pymc mojo can shine a light on what needs to be done..
def competence(s):
"""
This function is used by Sampler to determine which step method class
should be used to handle stochastic variables.
Return value should be a competence
score from 0 to 3, assigned as follows:
0: I can't handle that variable.
1: I can handle that variable, but I'm a generalist and
probably shouldn't be your top choice (Metropolis
and friends fall into this category).
2: I'm designed for this type of situation, but I could be
more specialized.
3: I was made for this situation, let me handle the variable.
In order to be eligible for inclusion in the registry, a sampling
method's init method must work with just a single argument, a
Stochastic object.
If you want to exclude a particular step method from
consideration for handling a variable, do this:
Competence functions MUST be called 'competence' and be decorated by the
'#staticmethod' decorator. Example:
#staticmethod
def competence(s):
if isinstance(s, MyStochasticSubclass):
return 2
else:
return 0
:SeeAlso: pick_best_methods, assign_method
"""

How to calculate Sub matrix of a matrix

I was giving a test for a company called Code Nation and came across this question which asked me to calculate how many times a number k appears in the submatrix M[n][n]. Now there was a example which said Input like this.
5
1 2 3 2 5
36
M[i][j] is to calculated by a[i]*a[j]
which on calculation turn I could calculate.
1,2,3,2,5
2,4,6,4,10
3,6,9,6,15
2,4,6,4,10
5,10,15,10,25
Now I had to calculate how many times 36 appears in sub matrix of M.
The answer was 5.
I am unable to comprehend how to calculate this submatrix. How to represent it?
I had a naïve approach which resulted in many matrices of which I think none are correct.
One of them is Submatrix[i][j]
1 2 3 2 5
3 9 18 24 39
6 18 36 60 99
15 33 69 129 228
33 66 129 258 486
This was formed by adding all the numbers before it 0,0 to i,j
In this 36 did not appear 5 times so i know this is incorrect. If you can back it up with some pseudo code it will be icing on the cake.
Appreciate the help
[Edit] : Referred Following link 1 link 2
My guess is that you have to compute how many submatrices of M have sum equal to 36.
Here is Matlab code:
a=[1,2,3,2,5];
n=length(a);
M=a'*a;
count = 0;
for a0 = 1:n
for b0 = 1:n
for a1 = a0:n
for b1 = b0:n
A = M(a0:a1,b0:b1);
if (sum(A(:))==36)
count = count + 1;
end
end
end
end
end
count
This prints out 5.
So you are correctly computing M, but then you have to consider every submatrix of M, for example, M is
1,2,3,2,5
2,4,6,4,10
3,6,9,6,15
2,4,6,4,10
5,10,15,10,25
so one possible submatrix is
1,2,3
2,4,6
3,6,9
and if you add up all of these, then the sum is equal to 36.
There is an answer on cstheory which gives an O(n^3) algorithm for this.

sorting 2d points into a matrix

I have the following problem:
An image is given and I am doing some blob detection. As a limit, lets say I have a max of 16 blobs and from each blob I calculate the centroid (x,y position).
If no distorion happends, these centroids are arranged in an equidistant 4x4 grid but they could be really much distorted.
The assumption is that they keep more or less the grid form but they could be really much warped.
I need to sort the blobs such that I know which one is the nearest left, right, up and down. So the best would be to write these blobs into a matrix.
If this is not enough, it could happen that I detect less then 16 and then I also need to sort them into a matrix.
Does anyone know how this could be efficiently solved in Matlab?
Thanks.
[update 1:]
I uploaded an image and the red numbers are the numbers which my blob detection algorithm assign each blob.
The resulting matrix should look like this with these numbers:
1 2 4 3
6 5 7 8
9 10 11 12
13 16 14 15
e.g. I start with blob 11 and the nearest right number is 12 and so on
[update 2:]
The posted solution looks quite nice. In reality it could happen, that one of the outer spots is missing or maybe two ... I know that this makes everything much more complicated and I just want to get a feeling if this is worth spending time.
These problems arise if you analyze a wavefront with a shack-hartmann wavefront sensor and you want to increase the dynamic range :-)
The spots could be really warped such that the dividing lines are not orthogonal any more.
Maybe someone knows a good literature for classification algorithms.
Best solution would be one, which could be implemented on a FPGA without to much effort but this is at this stage not so much important.
This will work as long as the blobs form a square and are relatively ordered:
Image:
Code:
bw = imread('blob.jpg');
bw = im2bw(bw);
rp = regionprops(bw,'Centroid');
% Must be a square
side = sqrt(length(rp));
centroids = vertcat(rp.Centroid);
centroid_labels = cellstr(num2str([1:length(rp)]'));
figure(1);
imshow(bw);
hold on;
text(centroids(:,1),centroids(:,2),centroid_labels,'Color','r','FontSize',60);
hold off;
% Find topleft element - minimum distance from origin
[~,topleft_idx] = min(sqrt(centroids(:,1).^2+centroids(:,2).^2));
% Find bottomright element - maximum distance from origin
[~,bottomright_idx] = max(sqrt(centroids(:,1).^2+centroids(:,2).^2));
% Find bottom left element - maximum normal distance from line formed by
% topleft and bottom right blob
A = centroids(bottomright_idx,2)-centroids(topleft_idx,2);
B = centroids(topleft_idx,1)-centroids(bottomright_idx,1);
C = -B*centroids(topleft_idx,2)-A*centroids(topleft_idx,1);
[~,bottomleft_idx] = max(abs(A*centroids(:,1)+B*centroids(:,2)+C)/sqrt(A^2+B^2));
% Sort blobs based on distance from line formed by topleft and bottomleft
% blob
A = centroids(bottomleft_idx,2)-centroids(topleft_idx,2);
B = centroids(topleft_idx,1)-centroids(bottomleft_idx,1);
C = -B*centroids(topleft_idx,2)-A*centroids(topleft_idx,1);
[~,leftsort_idx] = sort(abs(A*centroids(:,1)+B*centroids(:,2)+C)/sqrt(A^2+B^2));
% Reorder centroids and redetermine bottomright_idx and bottomleft_idx
centroids = centroids(leftsort_idx,:);
bottomright_idx = find(leftsort_idx == bottomright_idx);
bottomleft_idx = find(leftsort_idx == bottomleft_idx);
% Sort blobs based on distance from line formed by bottomleft and
% bottomright blob
A = centroids(bottomright_idx,2)-centroids(bottomleft_idx,2);
B = centroids(bottomleft_idx,1)-centroids(bottomright_idx,1);
C = -B*centroids(bottomleft_idx,2)-A*centroids(bottomleft_idx,1);
[~,bottomsort_idx] = sort(abs(A*reshape(centroids(:,1),side,side)+B*reshape(centroids(:,2),side,side)+C)/sqrt(A^2+B^2),'descend');
disp(leftsort_idx(bsxfun(#plus,bottomsort_idx,0:side:side^2-1)));
Output:
2 12 13 20 25 31
4 11 15 19 26 32
1 7 14 21 27 33
3 8 16 22 28 34
6 9 17 24 29 35
5 10 18 23 30 36
Just curious, are you using this to automate camera calibration through a checkerboard or something?
UPDATE:
For skewed image
tform = maketform('affine',[1 0 0; .5 1 0; 0 0 1]);
bw = imtransform(bw,tform);
Output:
1 4 8 16 21 25
2 5 10 18 23 26
3 6 13 19 27 29
7 9 17 24 30 32
11 14 20 28 33 35
12 15 22 31 34 36
For rotated image:
bw = imrotate(bw,20);
Output:
1 4 10 17 22 25
2 5 12 18 24 28
3 6 14 21 26 31
7 9 16 23 30 32
8 13 19 27 33 35
11 15 20 29 34 36

Resources