I am trying to run an Explanatory Factor Analysis on my questionnaire data.
I have data for 201 participants and 30 questions. The head of my data looks somehow like this (I am showing only the first 5 questions to give an idea of the dataset structure):
Q1 Q2 Q3 Q3 Q4 Q5
1 14 0 20 0 0 0
2 14 14 20 20 20 1
3 20 18 20 20 20 9
4 14 14 20 20 20 0
5 20 18 20 20 20 5
6 20 18 20 20 8 7
I want to find multivariate outliers ,so I am trying to calculate the Mahalanobis distance (cases with Mahalanobis Distance p values bigger than 0.001 are considered outliers).
I am using this code in R-studio (all_data_EFA is my dataset name):
distance <- as.matrix(mahalanobis(all_data_EFA, colMeans(all_data_EFA), cov = cov(all_data_EFA)))
Mah_significant <- all_data_EFA %>%
transmute(row_number = 1:nrow(all_data_EFA),
Mahalanobis_distance = distance,
Mah_p_value = pchisq(distance, df = ncol(all_data_EFA), lower.tail = F)) %>%
filter(Mah_p_value <= 0.001)
However, when I run "distance" I get the following Error:
Error in solve.default(cov, ...) :
Lapack routine dgesv: system is exactly singular: U[26,26] = 0
As far as I understood, this means that the covariance matrix of my data is singular, hence the matrix is not invertible and I cannot calculate Mahalanobis distance.
Is there an alternative way to calculate multivariate outliers or how can I solve this problem?
Many thanks.
Suppose we have two, one dimensional arrays of values a and b which both have length N. I want to create a new array c such that c(n)=dot(a(n:N), b(1:N-n+1)) I can of course do this using a simple loop:
for n=1:N
c(n)=dot(a(n:N), b(1:N-n+1));
end
but given that this is such a simple operation which resembles a convolution I was wondering if there isn't a more efficient method to do this (using Matlab).
A solution using 1D convolution conv:
out = conv(a, flip(b));
c = out(ceil(numel(out)/2):end);
In conv the first vector is multiplied by the reversed version of the second vector so we need to compute the convolution of a and the flipped b and trim the unnecessary part.
This is an interesting problem!
I am going to assume that a and b are column vectors of the same length. Let us consider a simple example:
a = [9;10;2;10;7];
b = [1;3;6;10;10];
% yields:
c = [221;146;74;31;7];
Now let's see what happens when we compute the convolution of these vectors:
>> conv(a,b)
ans =
9
37
86
166
239
201
162
170
70
>> conv2(a, b.')
ans =
9 27 54 90 90
10 30 60 100 100
2 6 12 20 20
10 30 60 100 100
7 21 42 70 70
We notice that c is the sum of elements along the lower diagonals of the result of conv2. To show it clearer we'll transpose to get the diagonals in the same order as values in c:
>> triu(conv2(a.', b))
ans =
9 10 2 10 7
0 30 6 30 21
0 0 12 60 42
0 0 0 100 70
0 0 0 0 70
So now it becomes a question of summing the diagonals of a matrix, which is a more common problem with existing solution, for example this one by Andrei Bobrov:
C = conv2(a.', b);
p = sum( spdiags(C, 0:size(C,2)-1) ).'; % This gives the same result as the loop.
I have the struct Trajectories with field uniqueDate, dateAll, label: I want to compare the fields uniqueDate and dateAll and, if there is a correspondence, I will save in label a value from an other struct.
I have written this code:
for k=1:nCols
for j=1:size(Trajectories(1,k).dateAll,1)
for i=1:size(Trajectories(1,k).uniqueDate,1)
if (~isempty(s(1,k).places))&&(Trajectories(1,k).dateAll(j,1)==Trajectories(1,k).uniqueDate(i,1))&&(Trajectories(1,k).dateAll(j,2)==Trajectories(1,k).uniqueDate(i,2))&&(Trajectories(1,k).dateAll(j,3)==Trajectories(1,k).uniqueDate(i,3))
for z=1:24
if(Trajectories(1,k).dateAll(j,4)==z)&&(size(s(1,k).places.all,2)>=size(Trajectories(1,k).uniqueDate,1))
Trajectories(1,k).label(j)=s(1,k).places.all(z,i);
else if(Trajectories(1,k).dateAll(j,4)==z)&&(size(s(1,k).places.all,2)<size(Trajectories(1,k).uniqueDate,1))
for l=1:size(s(1,k).places.all,2)
Trajectories(1,k).label(l)=s(1,k).places.all(z,l);
end
end
end
end
end
end
end
end
E.g
Trajectories(1,4).dateAll=[1 2004 8 1 14 1 15 0 0 0 1 42 13 2;596 2004 8 1 16 20 14 0 0 0 1 29 12 NaN;674 2004 8 1 18 26 11 0 0 0 1 20 38 1;674 2004 8 2 10 7 40 0 0 0 14 26 5 3;674 2004 8 2 11 3 29 0 0 0 1 54 3 3;631 2004 8 2 11 57 56 0 0 0 0 30 8 2;1 2004 8 2 12 4 35 0 0 0 1 53 21 2;631 2004 8 2 12 52 58 0 0 0 0 20 36 2;631 2004 8 2 13 5 3 0 0 0 1 49 40 2;631 2004 8 2 14 0 20 0 0 0 1 56 12 2;631 2004 8 2 15 2 0 0 0 0 1 57 39 2;631 2004 8 2 16 1 4 0 0 0 1 55 53 2;1 2004 8 2 17 9 15 0 0 0 1 48 41 2];
Trajectories(1,4).uniqueDate= [2004 8 1;2004 8 2;2004 8 3;2004 8 4];
it runs but it's very very slow. How can I modify it to speed up?
Let's work from the inside out and see where it gets us.
Step 1: Simplify your comparison condition:
if (~isempty(s(1,k).places))&&(Trajectories(1,k).dateAll(j,1)==Trajectories(1,k).uniqueDate(i,1))&&(Trajectories(1,k).dateAll(j,2)==Trajectories(1,k).uniqueDate(i,2))&&(Trajectories(1,k).dateAll(j,3)==Trajectories(1,k).uniqueDate(i,3))
becomes
if (~isempty(s(1,k).places)) && all( Trajectories(1,k).dateAll(j,1:3)==Trajectories(1,k).uniqueDate(i,1:3) )
Then we want to remove this from a for-loop. The "intersect" function is useful here:
[ia i1 i2]=intersect(Trajectories(1,k).dateAll(:,1:3),Trajectories(1,k).uniqueDate(:,1:3),'rows');
We now have a vector i1 of all rows in dateAll that intersect with uniqueDate.
Now we can remove the loop comparing z using a similar approach:
[iz iz1 iz2] = intersect(Trajectories(1,k).dateAll(i1,4),1:24);
We have to be careful about our indices here, using a subset of a subset.
This simplifies the code to:
for k=1:nCols
if isempty(s(1,k).places)
continue; % skip to the next value of k, no need to do the rest of the comparison
end
[ia i1 i2]=intersect(Trajectories(1,k).dateAll(:,1:3),Trajectories(1,k).uniqueDate(:,1:3),'rows');
[iz iz1 iz2] = intersect(Trajectories(1,k).dateAll(i1,4),1:24);
usescalarlabel = (size(s(1,k).places.all,2)>=size(Trajectories(1,k).uniqueDate,1);
if (usescalarlabel)
Trajectories(1,k).label(i1(iz1)) = s(1,k).places.all(iz,i2(iz1));
else
% you will need to check this: I think here you were needlessly repeating this step for every match
Trajectories(1,k).label(i1(iz1)) = s(1,k).places.all(iz,:);
end
end
But wait! That z loop is exactly the same as using indexing. So we don't need that second intersect after all:
for k=1:nCols
if isempty(s(1,k).places)
continue; % skip to the next value of k, no need to do the rest of the comparison
end
[ia i1 i2]=intersect(Trajectories(1,k).dateAll(:,1:3),Trajectories(1,k).uniqueDate(:,1:3),'rows');
usescalarlabel = (size(s(1,k).places.all,2)>=size(Trajectories(1,k).uniqueDate,1);
label_indices = Trajectories(1,k).dateAll(i1,4);
if (usescalarlabel)
Trajectories(1,k).label(label_indices) = s(1,k).places.all(label_indices,i2);
else
% you will need to check this: I think here you were needlessly repeating this step for every match
Trajectories(1,k).label(label_indices) = s(1,k).places.all(label_indices,:);
end
end
You'll need to check the indexing in this - I'm sure I've made a mistake somewhere without having data to test against, but that should give you an idea on how to proceed removing the loops and using vector expressions instead. Without seeing the data that's as far as I can optimise. You may be able to go further if you can reformat your data into a set of 3d matrices / cells instead of using structs.
I am suspicious of your condition which I have called "usescalarlabel" - it seems like you are mixing two data types. Also I would strongly recommend separating the dateAll matrices into separate "date" and "data" matrices as the row indices 4 onwards don't seem to be dates. Also the example you copy/pasted in seems to have an extra value at row index 1? In that case you'll need to compare Trajectories(1,k).dateAll(:,2:4) instead of Trajectories(1,k).dateAll(:,1:3).
Good luck.
I am solving project Euler question 58. Here a square is created by starting with 1 and spiralling anticlockwise in the following way (here is side length equal to 7:
37 36 35 34 33 32 31
38 17 16 15 14 13 30
39 18 5 4 3 12 29
40 19 6 1 2 11 28
41 20 7 8 9 10 27
42 21 22 23 24 25 26
43 44 45 46 47 48 49
The question is to find out when we keep spiralling around the square, when the ratio of primes in the diagonals and the amount of numbers in the diagonal is smaller than 0.10.
I am convinced I have the solution with the code below (see code comments for clarification), but the site states that the answer is wrong when I am entering it.
require 'prime'
# We use a mathematical derivation of the corner values, keep increasing the value till we find a ratio smaller
# than 0.10 and increase the grid_size and amount of numbers on diagonals each iteration
side_length = 3 # start with grid size of 3x3 so that we do not get into trouble with 1x1 grid
prime_count = 3 # 3, 5, 7 are prime and on a diagonal in a 3x3 grid
diagonal_size = 5
prime_ratio = 1 # dummy value bigger than 0.10 so we can start the loop
while prime_ratio >= 0.10
# Add one to prime count for each corner if it is prime
# Corners are given by n2 (top left), n2-n+1, n2-2n+2, and n2-3n+3
prime_count += 1 if (side_length**2).prime?
prime_count += 1 if (side_length**2-side_length+1).prime?
prime_count += 1 if (side_length**2-2*side_length+2).prime?
prime_count += 1 if (side_length**2-3*side_length+3).prime?
# Divide amount of primes counted by the diagonal length to get prime ratio
prime_ratio = prime_count/diagonal_size.to_f
# Increase the side length by two (full spiral) and diagonal size by four
side_length += 2 and diagonal_size += 4
end
puts side_length-2 #-2 to account for last addition in while-loop
# => 26612
It probably is wrong and site is right. I am stuck on this problem for quite some time now. Can anyone point me the mistake?
side_length += 2 and diagonal_size += 4 should be at the beginning of the loop.
Couldn't check, I do not have ruby installed, but I can reproduce the same problem on my python solution.
I'm writing a gem to detect tracking numbers (called tracking_number, natch). It searches text for valid tracking number formats, and then runs those formats through the checksum calculation as specified in each respective service's spec to determine valid numbers.
The other day I mailed a letter using USPS Certified Mail, got the accompanying tracking number from USPS, and fed it into my gem and it failed the validation. I am fairly certain I am performing the calculation correctly, but have run out of ideas.
The number is validated using USS Code 128 as described in section 2.8 (page 15) of the following document: http://www.usps.com/cpim/ftp/pubs/pub109.pdf
The tracking number I got from the post office was "7196 9010 7560 0307 7385", and the code I'm using to calculate the check digit is:
def valid_checksum?
# tracking number doesn't have spaces at this point
chars = self.tracking_number.chars.to_a
check_digit = chars.pop
total = 0
chars.reverse.each_with_index do |c, i|
x = c.to_i
x *= 3 if i.even?
total += x
end
check = total % 10
check = 10 - check unless (check.zero?)
return true if check == check_digit.to_i
end
According to my calculations based on the spec provided, the last digit should be a 3 in order to be valid. However, Google's tracking number auto detection picks up the number fine as is, so I can only assume I am doing something wrong.
From my manual calculations, it should match what your code does:
posn: 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 sum mult
even: 7 9 9 1 7 6 0 0 7 8 54 162
odd: 1 6 0 0 5 0 3 7 3 25 25
===
187
Hence the check digit should be three.
If that number is valid, then they're using a different algorithm to the one you think they are.
I think that might be the case since, when I plug the number you gave into the USPS tracker page, I can see its entire path.
In fact, if you look at publication 91, the Confirmation Services Technical Guide, you'll see it uses two extra digits, including the 91 at the front for the tracking application ID. Applying the algorithm found in that publication gives us:
posn: 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 sum mult
even: 9 7 9 9 1 7 6 0 0 7 8 63 189
odd: 1 1 6 0 0 5 0 3 7 3 26 26
===
215
and that would indeed give you a check digit of 5. I'm not saying that's the answer but it does match with the facts and is at least a viable explanation.
Probably your best bet would be to contact USPS for the information.
I don't know Ruby, but it looks as though you're multiplying by 3 at each even number; and the way I read the spec, you sum all the even digits and multiply the sum by 3. See the worked-through example pp. 20-21.
(later)
your code may be right. this Python snippet gives 7 for their example, and 3 for yours:
#!/usr/bin/python
'check tracking number checksum'
import sys
def check(number = sys.argv[1:]):
to_check = ''.join(number).replace('-', '')
print to_check
even = sum(map(int, to_check[-2::-2]))
odd = sum(map(int, to_check[-3::-2]))
print even * 3 + odd
if __name__ == '__main__':
check(sys.argv[1:])
[added later]
just completing my code, for reference:
jcomeau#intrepid:~$ /tmp/track.py 7196 9010 7560 0307 7385
False
jcomeau#intrepid:~$ /tmp/track.py 91 7196 9010 7560 0307 7385
True
jcomeau#intrepid:~$ /tmp/track.py 71123456789123456787
True
jcomeau#intrepid:~$ cat /tmp/track.py
#!/usr/bin/python
'check tracking number checksum'
import sys
def check(number):
to_check = ''.join(number).replace('-', '')
even = sum(map(int, to_check[-2::-2]))
odd = sum(map(int, to_check[-3::-2]))
checksum = even * 3 + odd
checkdigit = (10 - (checksum % 10)) % 10
return checkdigit == int(to_check[-1])
if __name__ == '__main__':
print check(''.join(sys.argv[1:]).replace('-', ''))