Drawing from a 2-D prior that is only available as samples in pymc2 - pymc

I'm trying to play around with Bayesian updating, and have a situation in which I am using a posterior from previous runs as a prior. This is a 2D prior on alpha and beta, for which I have traces, alphatrace and betatrace. So I stack them and use code adopted from https://gist.github.com/jcrudy/5911624 to make a KDE based stochastic.
#from https://gist.github.com/jcrudy/5911624
def KernelSmoothing(name, dataset, bw_method=None, observed=False, value=None):
'''Create a pymc node whose distribution comes from a kernel smoothing density estimate.'''
density = gaussian_kde(dataset, bw_method)
def logp(value):
#print "VAL", value
d = density(value)
if d == 0.0:
return float('-inf')
return np.log(d)
def random():
result = None
sample=density.resample(1)
#print sample, sample.shape
result = sample[0][0],sample[1][0]
return result
if value == None:
value = random()
dtype = type(value)
result = pymc.Stochastic(logp = logp,
doc = 'A kernel smoothing density node.',
name = name,
parents = {},
random = random,
trace = True,
value = None,
dtype = dtype,
observed = observed,
cache_depth = 2,
plot = True,
verbose = 0)
return result
Note that the critical thing here is to obtain 2-values from the joint prior: this is why i need a 2-D prior and not two 1-D priors.
The model itself is so:
ctrace=np.vstack((alphatrace, betatrace))
cnew=KernelSmoothing("cnew", ctrace)
#pymc.deterministic
def alphanew(cnew=cnew, name='alphanew'):
return cnew[0]
#pymc.deterministic
def betanew(cnew=cnew, name='betanew'):
return cnew[1]
newtheta=pymc.Beta("newtheta", alphanew, betanew)
newexp = pymc.Binomial('newexp', n=[14], p=[newtheta], value=[4], observed=True)
model3=pymc.Model([cnew, alphanew, betanew, newtheta, newexp])
mcmc3=pymc.MCMC(model3)
mcmc3.sample(20000,5000,5)
In case you are wondering, this is to do the 71st experiment in the hierarchical Rat Tumor example in Chapter 5 in Gelman's BDA. The "prior" I am using is the posterior on alpha and beta after 70 experiments.
But, when I sample, things blow up with the error:
ValueError: Maximum competence reported for stochastic cnew is <= 0... you may need to write a custom step method class.
Its not cnew I care about updating as a stochastic, but rather alphanew and betanew. How ought I be structuring the code to make this error go away?
EDIT: initial model which gave me the posteriors I wish to use as the prior:
tumordata="""0 20
0 20
0 20
0 20
0 20
0 20
0 20
0 19
0 19
0 19
0 19
0 18
0 18
0 17
1 20
1 20
1 20
1 20
1 19
1 19
1 18
1 18
3 27
2 25
2 24
2 23
2 20
2 20
2 20
2 20
2 20
2 20
1 10
5 49
2 19
5 46
2 17
7 49
7 47
3 20
3 20
2 13
9 48
10 50
4 20
4 20
4 20
4 20
4 20
4 20
4 20
10 48
4 19
4 19
4 19
5 22
11 46
12 49
5 20
5 20
6 23
5 19
6 22
6 20
6 20
6 20
16 52
15 46
15 47
9 24
"""
tumortuples=[e.strip().split() for e in tumordata.split("\n")]
tumory=np.array([np.int(e[0].strip()) for e in tumortuples if len(e) > 0])
tumorn=np.array([np.int(e[1].strip()) for e in tumortuples if len(e) > 0])
N = tumorn.shape[0]
mu = pymc.Uniform("mu",0.00001,1., value=0.13)
nu = pymc.Uniform("nu",0.00001,1., value=0.01)
#pymc.deterministic
def alpha(mu=mu, nu=nu, name='alpha'):
return mu/(nu*nu)
#pymc.deterministic
def beta(mu=mu, nu=nu, name='beta'):
return (1.-mu)/(nu*nu)
thetas=pymc.Container([pymc.Beta("theta_%i" % i, alpha, beta) for i in range(N)])
deaths = pymc.Binomial('deaths', n=tumorn, p=thetas, value=tumory, size=N, observed=True)
I use the joint-posterior from this model on alpha, beta as input to the "new model" at top. This also begs the question if I ought to be including theta1..theta70 in the model at top as they will update along with alpha and beta thanks to the new data which is a binomial with n=14, y=4. But I cant even get the little model with only a prior as a 2d sample array working :-(

I found your question since I ran into a similar proble. According to the documentation of pymc.StepMethod.competence, the problem is that none of the built-in samplers handle the dtype associated with the stochastic variable.
I am not sure what needs to be done to actually resolve that. Maybe one of the sampler methods can be extended to handle special types?
Hopefully someone with more pymc mojo can shine a light on what needs to be done..
def competence(s):
"""
This function is used by Sampler to determine which step method class
should be used to handle stochastic variables.
Return value should be a competence
score from 0 to 3, assigned as follows:
0: I can't handle that variable.
1: I can handle that variable, but I'm a generalist and
probably shouldn't be your top choice (Metropolis
and friends fall into this category).
2: I'm designed for this type of situation, but I could be
more specialized.
3: I was made for this situation, let me handle the variable.
In order to be eligible for inclusion in the registry, a sampling
method's init method must work with just a single argument, a
Stochastic object.
If you want to exclude a particular step method from
consideration for handling a variable, do this:
Competence functions MUST be called 'competence' and be decorated by the
'#staticmethod' decorator. Example:
#staticmethod
def competence(s):
if isinstance(s, MyStochasticSubclass):
return 2
else:
return 0
:SeeAlso: pick_best_methods, assign_method
"""

Related

R: How to solve Lapack routine dgesv: system is exactly singular in Mahalanobis distance

I am trying to run an Explanatory Factor Analysis on my questionnaire data.
I have data for 201 participants and 30 questions. The head of my data looks somehow like this (I am showing only the first 5 questions to give an idea of the dataset structure):
Q1 Q2 Q3 Q3 Q4 Q5
1 14 0 20 0 0 0
2 14 14 20 20 20 1
3 20 18 20 20 20 9
4 14 14 20 20 20 0
5 20 18 20 20 20 5
6 20 18 20 20 8 7
I want to find multivariate outliers ,so I am trying to calculate the Mahalanobis distance (cases with Mahalanobis Distance p values bigger than 0.001 are considered outliers).
I am using this code in R-studio (all_data_EFA is my dataset name):
distance <- as.matrix(mahalanobis(all_data_EFA, colMeans(all_data_EFA), cov = cov(all_data_EFA)))
Mah_significant <- all_data_EFA %>%
transmute(row_number = 1:nrow(all_data_EFA),
Mahalanobis_distance = distance,
Mah_p_value = pchisq(distance, df = ncol(all_data_EFA), lower.tail = F)) %>%
filter(Mah_p_value <= 0.001)
However, when I run "distance" I get the following Error:
Error in solve.default(cov, ...) :
Lapack routine dgesv: system is exactly singular: U[26,26] = 0
As far as I understood, this means that the covariance matrix of my data is singular, hence the matrix is not invertible and I cannot calculate Mahalanobis distance.
Is there an alternative way to calculate multivariate outliers or how can I solve this problem?
Many thanks.

Quickly compute `dot(a(n:end), b(1:end-n))`

Suppose we have two, one dimensional arrays of values a and b which both have length N. I want to create a new array c such that c(n)=dot(a(n:N), b(1:N-n+1)) I can of course do this using a simple loop:
for n=1:N
c(n)=dot(a(n:N), b(1:N-n+1));
end
but given that this is such a simple operation which resembles a convolution I was wondering if there isn't a more efficient method to do this (using Matlab).
A solution using 1D convolution conv:
out = conv(a, flip(b));
c = out(ceil(numel(out)/2):end);
In conv the first vector is multiplied by the reversed version of the second vector so we need to compute the convolution of a and the flipped b and trim the unnecessary part.
This is an interesting problem!
I am going to assume that a and b are column vectors of the same length. Let us consider a simple example:
a = [9;10;2;10;7];
b = [1;3;6;10;10];
% yields:
c = [221;146;74;31;7];
Now let's see what happens when we compute the convolution of these vectors:
>> conv(a,b)
ans =
9
37
86
166
239
201
162
170
70
>> conv2(a, b.')
ans =
9 27 54 90 90
10 30 60 100 100
2 6 12 20 20
10 30 60 100 100
7 21 42 70 70
We notice that c is the sum of elements along the lower diagonals of the result of conv2. To show it clearer we'll transpose to get the diagonals in the same order as values in c:
>> triu(conv2(a.', b))
ans =
9 10 2 10 7
0 30 6 30 21
0 0 12 60 42
0 0 0 100 70
0 0 0 0 70
So now it becomes a question of summing the diagonals of a matrix, which is a more common problem with existing solution, for example this one by Andrei Bobrov:
C = conv2(a.', b);
p = sum( spdiags(C, 0:size(C,2)-1) ).'; % This gives the same result as the loop.

Speed up code to compare fields in a struct

I have the struct Trajectories with field uniqueDate, dateAll, label: I want to compare the fields uniqueDate and dateAll and, if there is a correspondence, I will save in label a value from an other struct.
I have written this code:
for k=1:nCols
for j=1:size(Trajectories(1,k).dateAll,1)
for i=1:size(Trajectories(1,k).uniqueDate,1)
if (~isempty(s(1,k).places))&&(Trajectories(1,k).dateAll(j,1)==Trajectories(1,k).uniqueDate(i,1))&&(Trajectories(1,k).dateAll(j,2)==Trajectories(1,k).uniqueDate(i,2))&&(Trajectories(1,k).dateAll(j,3)==Trajectories(1,k).uniqueDate(i,3))
for z=1:24
if(Trajectories(1,k).dateAll(j,4)==z)&&(size(s(1,k).places.all,2)>=size(Trajectories(1,k).uniqueDate,1))
Trajectories(1,k).label(j)=s(1,k).places.all(z,i);
else if(Trajectories(1,k).dateAll(j,4)==z)&&(size(s(1,k).places.all,2)<size(Trajectories(1,k).uniqueDate,1))
for l=1:size(s(1,k).places.all,2)
Trajectories(1,k).label(l)=s(1,k).places.all(z,l);
end
end
end
end
end
end
end
end
E.g
Trajectories(1,4).dateAll=[1 2004 8 1 14 1 15 0 0 0 1 42 13 2;596 2004 8 1 16 20 14 0 0 0 1 29 12 NaN;674 2004 8 1 18 26 11 0 0 0 1 20 38 1;674 2004 8 2 10 7 40 0 0 0 14 26 5 3;674 2004 8 2 11 3 29 0 0 0 1 54 3 3;631 2004 8 2 11 57 56 0 0 0 0 30 8 2;1 2004 8 2 12 4 35 0 0 0 1 53 21 2;631 2004 8 2 12 52 58 0 0 0 0 20 36 2;631 2004 8 2 13 5 3 0 0 0 1 49 40 2;631 2004 8 2 14 0 20 0 0 0 1 56 12 2;631 2004 8 2 15 2 0 0 0 0 1 57 39 2;631 2004 8 2 16 1 4 0 0 0 1 55 53 2;1 2004 8 2 17 9 15 0 0 0 1 48 41 2];
Trajectories(1,4).uniqueDate= [2004 8 1;2004 8 2;2004 8 3;2004 8 4];
it runs but it's very very slow. How can I modify it to speed up?
Let's work from the inside out and see where it gets us.
Step 1: Simplify your comparison condition:
if (~isempty(s(1,k).places))&&(Trajectories(1,k).dateAll(j,1)==Trajectories(1,k).uniqueDate(i,1))&&(Trajectories(1,k).dateAll(j,2)==Trajectories(1,k).uniqueDate(i,2))&&(Trajectories(1,k).dateAll(j,3)==Trajectories(1,k).uniqueDate(i,3))
becomes
if (~isempty(s(1,k).places)) && all( Trajectories(1,k).dateAll(j,1:3)==Trajectories(1,k).uniqueDate(i,1:3) )
Then we want to remove this from a for-loop. The "intersect" function is useful here:
[ia i1 i2]=intersect(Trajectories(1,k).dateAll(:,1:3),Trajectories(1,k).uniqueDate(:,1:3),'rows');
We now have a vector i1 of all rows in dateAll that intersect with uniqueDate.
Now we can remove the loop comparing z using a similar approach:
[iz iz1 iz2] = intersect(Trajectories(1,k).dateAll(i1,4),1:24);
We have to be careful about our indices here, using a subset of a subset.
This simplifies the code to:
for k=1:nCols
if isempty(s(1,k).places)
continue; % skip to the next value of k, no need to do the rest of the comparison
end
[ia i1 i2]=intersect(Trajectories(1,k).dateAll(:,1:3),Trajectories(1,k).uniqueDate(:,1:3),'rows');
[iz iz1 iz2] = intersect(Trajectories(1,k).dateAll(i1,4),1:24);
usescalarlabel = (size(s(1,k).places.all,2)>=size(Trajectories(1,k).uniqueDate,1);
if (usescalarlabel)
Trajectories(1,k).label(i1(iz1)) = s(1,k).places.all(iz,i2(iz1));
else
% you will need to check this: I think here you were needlessly repeating this step for every match
Trajectories(1,k).label(i1(iz1)) = s(1,k).places.all(iz,:);
end
end
But wait! That z loop is exactly the same as using indexing. So we don't need that second intersect after all:
for k=1:nCols
if isempty(s(1,k).places)
continue; % skip to the next value of k, no need to do the rest of the comparison
end
[ia i1 i2]=intersect(Trajectories(1,k).dateAll(:,1:3),Trajectories(1,k).uniqueDate(:,1:3),'rows');
usescalarlabel = (size(s(1,k).places.all,2)>=size(Trajectories(1,k).uniqueDate,1);
label_indices = Trajectories(1,k).dateAll(i1,4);
if (usescalarlabel)
Trajectories(1,k).label(label_indices) = s(1,k).places.all(label_indices,i2);
else
% you will need to check this: I think here you were needlessly repeating this step for every match
Trajectories(1,k).label(label_indices) = s(1,k).places.all(label_indices,:);
end
end
You'll need to check the indexing in this - I'm sure I've made a mistake somewhere without having data to test against, but that should give you an idea on how to proceed removing the loops and using vector expressions instead. Without seeing the data that's as far as I can optimise. You may be able to go further if you can reformat your data into a set of 3d matrices / cells instead of using structs.
I am suspicious of your condition which I have called "usescalarlabel" - it seems like you are mixing two data types. Also I would strongly recommend separating the dateAll matrices into separate "date" and "data" matrices as the row indices 4 onwards don't seem to be dates. Also the example you copy/pasted in seems to have an extra value at row index 1? In that case you'll need to compare Trajectories(1,k).dateAll(:,2:4) instead of Trajectories(1,k).dateAll(:,1:3).
Good luck.

Where is my mistake in this answer to Project Euler #58?

I am solving project Euler question 58. Here a square is created by starting with 1 and spiralling anticlockwise in the following way (here is side length equal to 7:
37 36 35 34 33 32 31
38 17 16 15 14 13 30
39 18 5 4 3 12 29
40 19 6 1 2 11 28
41 20 7 8 9 10 27
42 21 22 23 24 25 26
43 44 45 46 47 48 49
The question is to find out when we keep spiralling around the square, when the ratio of primes in the diagonals and the amount of numbers in the diagonal is smaller than 0.10.
I am convinced I have the solution with the code below (see code comments for clarification), but the site states that the answer is wrong when I am entering it.
require 'prime'
# We use a mathematical derivation of the corner values, keep increasing the value till we find a ratio smaller
# than 0.10 and increase the grid_size and amount of numbers on diagonals each iteration
side_length = 3 # start with grid size of 3x3 so that we do not get into trouble with 1x1 grid
prime_count = 3 # 3, 5, 7 are prime and on a diagonal in a 3x3 grid
diagonal_size = 5
prime_ratio = 1 # dummy value bigger than 0.10 so we can start the loop
while prime_ratio >= 0.10
# Add one to prime count for each corner if it is prime
# Corners are given by n2 (top left), n2-n+1, n2-2n+2, and n2-3n+3
prime_count += 1 if (side_length**2).prime?
prime_count += 1 if (side_length**2-side_length+1).prime?
prime_count += 1 if (side_length**2-2*side_length+2).prime?
prime_count += 1 if (side_length**2-3*side_length+3).prime?
# Divide amount of primes counted by the diagonal length to get prime ratio
prime_ratio = prime_count/diagonal_size.to_f
# Increase the side length by two (full spiral) and diagonal size by four
side_length += 2 and diagonal_size += 4
end
puts side_length-2 #-2 to account for last addition in while-loop
# => 26612
It probably is wrong and site is right. I am stuck on this problem for quite some time now. Can anyone point me the mistake?
side_length += 2 and diagonal_size += 4 should be at the beginning of the loop.
Couldn't check, I do not have ruby installed, but I can reproduce the same problem on my python solution.

Why isn't this valid USPS tracking number validating according to their spec?

I'm writing a gem to detect tracking numbers (called tracking_number, natch). It searches text for valid tracking number formats, and then runs those formats through the checksum calculation as specified in each respective service's spec to determine valid numbers.
The other day I mailed a letter using USPS Certified Mail, got the accompanying tracking number from USPS, and fed it into my gem and it failed the validation. I am fairly certain I am performing the calculation correctly, but have run out of ideas.
The number is validated using USS Code 128 as described in section 2.8 (page 15) of the following document: http://www.usps.com/cpim/ftp/pubs/pub109.pdf
The tracking number I got from the post office was "7196 9010 7560 0307 7385", and the code I'm using to calculate the check digit is:
def valid_checksum?
# tracking number doesn't have spaces at this point
chars = self.tracking_number.chars.to_a
check_digit = chars.pop
total = 0
chars.reverse.each_with_index do |c, i|
x = c.to_i
x *= 3 if i.even?
total += x
end
check = total % 10
check = 10 - check unless (check.zero?)
return true if check == check_digit.to_i
end
According to my calculations based on the spec provided, the last digit should be a 3 in order to be valid. However, Google's tracking number auto detection picks up the number fine as is, so I can only assume I am doing something wrong.
From my manual calculations, it should match what your code does:
posn: 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 sum mult
even: 7 9 9 1 7 6 0 0 7 8 54 162
odd: 1 6 0 0 5 0 3 7 3 25 25
===
187
Hence the check digit should be three.
If that number is valid, then they're using a different algorithm to the one you think they are.
I think that might be the case since, when I plug the number you gave into the USPS tracker page, I can see its entire path.
In fact, if you look at publication 91, the Confirmation Services Technical Guide, you'll see it uses two extra digits, including the 91 at the front for the tracking application ID. Applying the algorithm found in that publication gives us:
posn: 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 sum mult
even: 9 7 9 9 1 7 6 0 0 7 8 63 189
odd: 1 1 6 0 0 5 0 3 7 3 26 26
===
215
and that would indeed give you a check digit of 5. I'm not saying that's the answer but it does match with the facts and is at least a viable explanation.
Probably your best bet would be to contact USPS for the information.
I don't know Ruby, but it looks as though you're multiplying by 3 at each even number; and the way I read the spec, you sum all the even digits and multiply the sum by 3. See the worked-through example pp. 20-21.
(later)
your code may be right. this Python snippet gives 7 for their example, and 3 for yours:
#!/usr/bin/python
'check tracking number checksum'
import sys
def check(number = sys.argv[1:]):
to_check = ''.join(number).replace('-', '')
print to_check
even = sum(map(int, to_check[-2::-2]))
odd = sum(map(int, to_check[-3::-2]))
print even * 3 + odd
if __name__ == '__main__':
check(sys.argv[1:])
[added later]
just completing my code, for reference:
jcomeau#intrepid:~$ /tmp/track.py 7196 9010 7560 0307 7385
False
jcomeau#intrepid:~$ /tmp/track.py 91 7196 9010 7560 0307 7385
True
jcomeau#intrepid:~$ /tmp/track.py 71123456789123456787
True
jcomeau#intrepid:~$ cat /tmp/track.py
#!/usr/bin/python
'check tracking number checksum'
import sys
def check(number):
to_check = ''.join(number).replace('-', '')
even = sum(map(int, to_check[-2::-2]))
odd = sum(map(int, to_check[-3::-2]))
checksum = even * 3 + odd
checkdigit = (10 - (checksum % 10)) % 10
return checkdigit == int(to_check[-1])
if __name__ == '__main__':
print check(''.join(sys.argv[1:]).replace('-', ''))

Resources