I am trying to run an Explanatory Factor Analysis on my questionnaire data.
I have data for 201 participants and 30 questions. The head of my data looks somehow like this (I am showing only the first 5 questions to give an idea of the dataset structure):
Q1 Q2 Q3 Q3 Q4 Q5
1 14 0 20 0 0 0
2 14 14 20 20 20 1
3 20 18 20 20 20 9
4 14 14 20 20 20 0
5 20 18 20 20 20 5
6 20 18 20 20 8 7
I want to find multivariate outliers ,so I am trying to calculate the Mahalanobis distance (cases with Mahalanobis Distance p values bigger than 0.001 are considered outliers).
I am using this code in R-studio (all_data_EFA is my dataset name):
distance <- as.matrix(mahalanobis(all_data_EFA, colMeans(all_data_EFA), cov = cov(all_data_EFA)))
Mah_significant <- all_data_EFA %>%
transmute(row_number = 1:nrow(all_data_EFA),
Mahalanobis_distance = distance,
Mah_p_value = pchisq(distance, df = ncol(all_data_EFA), lower.tail = F)) %>%
filter(Mah_p_value <= 0.001)
However, when I run "distance" I get the following Error:
Error in solve.default(cov, ...) :
Lapack routine dgesv: system is exactly singular: U[26,26] = 0
As far as I understood, this means that the covariance matrix of my data is singular, hence the matrix is not invertible and I cannot calculate Mahalanobis distance.
Is there an alternative way to calculate multivariate outliers or how can I solve this problem?
Many thanks.
This is snapshot of my dataset
A B
1 34
1 33
1 66
0 54
0 77
0 98
0 39
0 12
I am trying to create a random sample where there are 2 1s and 3 0s from column A in the sample along with their respective B values. Is there a way to do that? Basically trying to see how to get a sample with specific percentages of a particular column? Thanks.
I tried to use VW to train a regression model on a small set of examples (about 3112). I think I'm doing it correctly, yet it showed me weird results. Dug around but didn't find anything helpful.
$ cat sh600000.feat | vw --l1 1e-8 --l2 1e-8 --readable_model model -b 24 --passes 10 --cache_file cache
using l1 regularization = 1e-08
using l2 regularization = 1e-08
Num weight bits = 24
learning rate = 0.5
initial_t = 0
power_t = 0.5
decay_learning_rate = 1
using cache_file = cache
ignoring text input in favor of cache input
num sources = 1
average since example example current current current
loss last counter weight label predict features
0.040000 0.040000 1 1.0 -0.2000 0.0000 79
0.051155 0.062310 2 2.0 0.2000 -0.0496 79
0.046606 0.042056 4 4.0 0.4100 0.1482 79
0.052160 0.057715 8 8.0 0.0200 0.0021 78
0.064936 0.077711 16 16.0 -0.1800 0.0547 77
0.060507 0.056079 32 32.0 0.0000 0.3164 79
0.136933 0.213358 64 64.0 -0.5900 -0.0850 79
0.151692 0.166452 128 128.0 0.0700 0.0060 79
0.133965 0.116238 256 256.0 0.0900 -0.0446 78
0.179995 0.226024 512 512.0 0.3700 -0.0217 79
0.109296 0.038597 1024 1024.0 0.1200 -0.0728 79
0.579360 1.049425 2048 2048.0 -0.3700 -0.0084 79
0.485389 0.485389 4096 4096.0 1.9600 0.3934 79 h
0.517748 0.550036 8192 8192.0 0.0700 0.0334 79 h
finished run
number of examples per pass = 2847
passes used = 5
weighted example sum = 14236
weighted label sum = -155.98
average loss = 0.490685 h
best constant = -0.0109567
total feature number = 1121506
$ wc model
41 48 657 model
Questions:
Why is the number of features in the output (readable) model less than the number of actual features? I counted that the training data contains 78 features (plus the bias that's 79 as shown during the training). The number of feature bits is 24, which should be far more than enough to avoid collision.
Why does the average loss actually go up in the training as you can see in the above example?
(Minor) I tried to increase the number of feature bits to 32, and it output an empty model. Why?
EDIT:
I tried to shuffle the input file, as well as using --holdout_off, as suggested. But the result is still almost the same - the average loss go up.
$ cat sh600000.feat.shuf | vw --l1 1e-8 --l2 1e-8 --readable_model model -b 24 --passes 10 --cache_file cache --holdout_off
using l1 regularization = 1e-08
using l2 regularization = 1e-08
Num weight bits = 24
learning rate = 0.5
initial_t = 0
power_t = 0.5
decay_learning_rate = 1
using cache_file = cache
ignoring text input in favor of cache input
num sources = 1
average since example example current current current
loss last counter weight label predict features
0.040000 0.040000 1 1.0 -0.2000 0.0000 79
0.051155 0.062310 2 2.0 0.2000 -0.0496 79
0.046606 0.042056 4 4.0 0.4100 0.1482 79
0.052160 0.057715 8 8.0 0.0200 0.0021 78
0.071332 0.090504 16 16.0 0.0300 0.1203 79
0.043720 0.016108 32 32.0 -0.2200 -0.1971 78
0.142895 0.242071 64 64.0 0.0100 -0.1531 79
0.158564 0.174232 128 128.0 0.0500 -0.0439 79
0.150691 0.142818 256 256.0 0.3200 0.1466 79
0.197050 0.243408 512 512.0 0.2300 -0.0459 79
0.117398 0.037747 1024 1024.0 0.0400 0.0284 79
0.636949 1.156501 2048 2048.0 1.2500 -0.0152 79
0.363364 0.089779 4096 4096.0 0.1800 0.0071 79
0.477569 0.591774 8192 8192.0 -0.4800 0.0065 79
0.411068 0.344567 16384 16384.0 0.0700 0.0450 77
finished run
number of examples per pass = 3112
passes used = 10
weighted example sum = 31120
weighted label sum = -105.5
average loss = 0.423404
best constant = -0.0033901
total feature number = 2451800
The training examples are unique to each other so I doubt there is over-fitting problem (which, as I understand it, usually happens when the number of input is too small comparing the number of features).
EDIT2:
Tried to print the average loss for every pass of examples, and see that it mostly remains constant.
$ cat dist/sh600000.feat | vw --l1 1e-8 --l2 1e-8 -f dist/model -P 3112 --passes 10 -b 24 --cache_file dist/cache
using l1 regularization = 1e-08
using l2 regularization = 1e-08
Num weight bits = 24
learning rate = 0.5
initial_t = 0
power_t = 0.5
decay_learning_rate = 1
final_regressor = dist/model
using cache_file = dist/cache
ignoring text input in favor of cache input
num sources = 1
average since example example current current current
loss last counter weight label predict features
0.498822 0.498822 3112 3112.0 0.0800 0.0015 79 h
0.476677 0.454595 6224 6224.0 -0.2200 -0.0085 79 h
0.466413 0.445856 9336 9336.0 0.0200 -0.0022 79 h
0.490221 0.561506 12448 12448.0 0.0700 -0.1113 79 h
finished run
number of examples per pass = 2847
passes used = 5
weighted example sum = 14236
weighted label sum = -155.98
average loss = 0.490685 h
best constant = -0.0109567
total feature number = 1121506
Also another try without the --l1, --l2 and -b parameters:
$ cat dist/sh600000.feat | vw -f dist/model -P 3112 --passes 10 --cache_file dist/cacheNum weight bits = 18
learning rate = 0.5
initial_t = 0
power_t = 0.5
decay_learning_rate = 1
final_regressor = dist/model
using cache_file = dist/cache
ignoring text input in favor of cache input
num sources = 1
average since example example current current current
loss last counter weight label predict features
0.520286 0.520286 3112 3112.0 0.0800 -0.0021 79 h
0.488581 0.456967 6224 6224.0 -0.2200 -0.0137 79 h
0.474247 0.445538 9336 9336.0 0.0200 -0.0299 79 h
0.496580 0.563450 12448 12448.0 0.0700 -0.1727 79 h
0.533413 0.680958 15560 15560.0 -0.1700 0.0322 79 h
0.524531 0.480201 18672 18672.0 -0.9800 -0.0573 79 h
finished run
number of examples per pass = 2801
passes used = 7
weighted example sum = 19608
weighted label sum = -212.58
average loss = 0.491739 h
best constant = -0.0108415
total feature number = 1544713
Does that mean it's normal for average loss to go up during one pass, but as long as multiple pass gets the same loss then it's fine?
Model file stores only non-zero weights. So most likely others got nulled especially if you are using --l1
It may be caused by many reasons. Perhaps your dataset isn't shuffled well enough. If you sort your dataset so examples labeled -1 will be in first half and examples labeled 1 will be in second then your model will show very good convergence on first half, but you'll see avg loss bump as it reaches 2nd half. So it may be disbalance in dataset. As for last two losses - these are holdout losses (marked with 'h' at end of line) and may point that model is overfitted. Pls refer to my other answer.
Well, in master branch usage of -b 32 is even currently blocked. You shall use up to -b 31. On practice -b 24-28 is usually enough even for dozens of thousands of features.
I would recommend you to get up-to-date VW version from github
I'm trying to play around with Bayesian updating, and have a situation in which I am using a posterior from previous runs as a prior. This is a 2D prior on alpha and beta, for which I have traces, alphatrace and betatrace. So I stack them and use code adopted from https://gist.github.com/jcrudy/5911624 to make a KDE based stochastic.
#from https://gist.github.com/jcrudy/5911624
def KernelSmoothing(name, dataset, bw_method=None, observed=False, value=None):
'''Create a pymc node whose distribution comes from a kernel smoothing density estimate.'''
density = gaussian_kde(dataset, bw_method)
def logp(value):
#print "VAL", value
d = density(value)
if d == 0.0:
return float('-inf')
return np.log(d)
def random():
result = None
sample=density.resample(1)
#print sample, sample.shape
result = sample[0][0],sample[1][0]
return result
if value == None:
value = random()
dtype = type(value)
result = pymc.Stochastic(logp = logp,
doc = 'A kernel smoothing density node.',
name = name,
parents = {},
random = random,
trace = True,
value = None,
dtype = dtype,
observed = observed,
cache_depth = 2,
plot = True,
verbose = 0)
return result
Note that the critical thing here is to obtain 2-values from the joint prior: this is why i need a 2-D prior and not two 1-D priors.
The model itself is so:
ctrace=np.vstack((alphatrace, betatrace))
cnew=KernelSmoothing("cnew", ctrace)
#pymc.deterministic
def alphanew(cnew=cnew, name='alphanew'):
return cnew[0]
#pymc.deterministic
def betanew(cnew=cnew, name='betanew'):
return cnew[1]
newtheta=pymc.Beta("newtheta", alphanew, betanew)
newexp = pymc.Binomial('newexp', n=[14], p=[newtheta], value=[4], observed=True)
model3=pymc.Model([cnew, alphanew, betanew, newtheta, newexp])
mcmc3=pymc.MCMC(model3)
mcmc3.sample(20000,5000,5)
In case you are wondering, this is to do the 71st experiment in the hierarchical Rat Tumor example in Chapter 5 in Gelman's BDA. The "prior" I am using is the posterior on alpha and beta after 70 experiments.
But, when I sample, things blow up with the error:
ValueError: Maximum competence reported for stochastic cnew is <= 0... you may need to write a custom step method class.
Its not cnew I care about updating as a stochastic, but rather alphanew and betanew. How ought I be structuring the code to make this error go away?
EDIT: initial model which gave me the posteriors I wish to use as the prior:
tumordata="""0 20
0 20
0 20
0 20
0 20
0 20
0 20
0 19
0 19
0 19
0 19
0 18
0 18
0 17
1 20
1 20
1 20
1 20
1 19
1 19
1 18
1 18
3 27
2 25
2 24
2 23
2 20
2 20
2 20
2 20
2 20
2 20
1 10
5 49
2 19
5 46
2 17
7 49
7 47
3 20
3 20
2 13
9 48
10 50
4 20
4 20
4 20
4 20
4 20
4 20
4 20
10 48
4 19
4 19
4 19
5 22
11 46
12 49
5 20
5 20
6 23
5 19
6 22
6 20
6 20
6 20
16 52
15 46
15 47
9 24
"""
tumortuples=[e.strip().split() for e in tumordata.split("\n")]
tumory=np.array([np.int(e[0].strip()) for e in tumortuples if len(e) > 0])
tumorn=np.array([np.int(e[1].strip()) for e in tumortuples if len(e) > 0])
N = tumorn.shape[0]
mu = pymc.Uniform("mu",0.00001,1., value=0.13)
nu = pymc.Uniform("nu",0.00001,1., value=0.01)
#pymc.deterministic
def alpha(mu=mu, nu=nu, name='alpha'):
return mu/(nu*nu)
#pymc.deterministic
def beta(mu=mu, nu=nu, name='beta'):
return (1.-mu)/(nu*nu)
thetas=pymc.Container([pymc.Beta("theta_%i" % i, alpha, beta) for i in range(N)])
deaths = pymc.Binomial('deaths', n=tumorn, p=thetas, value=tumory, size=N, observed=True)
I use the joint-posterior from this model on alpha, beta as input to the "new model" at top. This also begs the question if I ought to be including theta1..theta70 in the model at top as they will update along with alpha and beta thanks to the new data which is a binomial with n=14, y=4. But I cant even get the little model with only a prior as a 2d sample array working :-(
I found your question since I ran into a similar proble. According to the documentation of pymc.StepMethod.competence, the problem is that none of the built-in samplers handle the dtype associated with the stochastic variable.
I am not sure what needs to be done to actually resolve that. Maybe one of the sampler methods can be extended to handle special types?
Hopefully someone with more pymc mojo can shine a light on what needs to be done..
def competence(s):
"""
This function is used by Sampler to determine which step method class
should be used to handle stochastic variables.
Return value should be a competence
score from 0 to 3, assigned as follows:
0: I can't handle that variable.
1: I can handle that variable, but I'm a generalist and
probably shouldn't be your top choice (Metropolis
and friends fall into this category).
2: I'm designed for this type of situation, but I could be
more specialized.
3: I was made for this situation, let me handle the variable.
In order to be eligible for inclusion in the registry, a sampling
method's init method must work with just a single argument, a
Stochastic object.
If you want to exclude a particular step method from
consideration for handling a variable, do this:
Competence functions MUST be called 'competence' and be decorated by the
'#staticmethod' decorator. Example:
#staticmethod
def competence(s):
if isinstance(s, MyStochasticSubclass):
return 2
else:
return 0
:SeeAlso: pick_best_methods, assign_method
"""
I have a crosstab and create custom grand total for the row level in each column dimension, by using a data element expression.
Crosstab Example:
Cat 1 Cat 2 GT
ITEM C F % VALUE C F % VALUE
A 101 0 0.9 10 112 105 93.8 10 20
B 294 8 2.7 6 69 66 95.7 10 16
C 211 7 3.3 4 212 161 75.9 6 10
------------------------------------------------------------------
GT 606 15 2.47 6 393 332 84.5 8 **14**
Explanation for GT row:
Those C and F column is summarized from the above. But the
% column is division result of F/C.
Create a data element to fill the VALUE column, which comes from range of value definition, varies for each Cat (category). For instance... in Cat 1, if the value is between 0 - 1 the value will be 10, or between 1 - 2 = 8, etc. And condition for Cat 2, between 85 - 100 = 10, and 80 - 85 = 8, etc.
The GT row (with the value of 14), is gathered by adding VALUE of Cat 1 + Cat 2.
I am able to work on point 1 and 2 above, but I can't seem to make it working for GT row. I don't know the code/expression to sum up the VALUE data element for this 2 categories. Because those VALUE field comes from one data element in design mode.
I have found the solution for my problem. I can show the result by using a report variable. I am assigning 2 report variables in % field expression, based on the category in data cube dimension (by using if statement). And then in data element expression, I am calling both of the expressions and add them.