I have data from a study for a "device". Subjects receive an implant and are followed for 5 years, with a checkup every year. A "disability score" is calculated from a survey every visit. The implant comes in different "sizes". Also, every year the "device condition" is evaluated. What I'm trying to do is see if "device condition" is associated with "disability score" while controlling for implant size/time/subject.
The Data:
I have the subject id, implant (A and B for sizes and is a static/baseline variable), the timepoint (12, 24, 36, 48, 60 months), implant condition (ordinal scale where 0 is best condition and can change timepoint to timepoint within each subject) and the disability score (dependent quantitative variable).
Example shown below
data modelling;
input sid $ implant $ timepoint condition disability_score;
cards;
1 A 12 0 8
1 A 24 0 9
1 A 36 1 9.5
1 A 48 1 8
1 A 60 2 9
2 B 12 1 7
2 B 24 2 6
2 B 36 2 7
2 B 48 3 8
2 B 60 4 9
.....
15 A 12 0 8
15 A 24 0 9
15 A 36 1 9.5
15 A 48 1 8
15 A 60 2 9
;
run;
I'm using proc mixed, but I'm not sure which syntax would be most appropriate for this question. My initial inkling is this is a repeated measures situation and used this:
proc mixed data=modelling plots=none;
class sid implant timepoint condition;
model disability_score = implant condition timepoint condition*timepoint;
repeated timepoint / subject=sid;
run;
However, one thing I'm concerned with is if timepoint needs/should be a quantitative value, and if this is truly a repeated measures situation. When I run this, the condition and implant seem to be significant, and timepoint is not.
I also tried using the random statement with the subject id instead and using timepoint as quantitative
proc mixed data=modelling plots=none;
class sid implant condition;
model disability_score = implant condition timepoint condition*timepoint;
random sid;
run;
When I do this, I get vastly different results, with timepoint being significant and both implant and condition no longer significant.
Really, I'm not 100% sure which is the correctly specified model, and am looking for some validation and/or guidance to the correctly specified model for this question.
Related
I have a script for a rock-paper-scissors (RPS) game I am making, and I am trying to generate a random number to determine a series of RPS moves. The logic is as follows:
moves = {}
table.insert(moves, 'rock')
table.insert(moves, 'paper')
table.insert(moves, 'scissors')
currentMoves = {}
math.randomseed(playdate.getSecondsSinceEpoch()) -- game SDK library function that returns seconds since midnight January 1 2000 UTC to initialize new random sequence
math.random(); math.random(); math.random();
-- generates a list of rps moves to display on the screen
function generateMoves(maxMovesLength) -- i set maxMovesLength to 3
currentMoves = {}
for i = 1, maxMovesLength, 1 do
randomNumber = math.random(1, 3)
otherRandomNumber = math.random(1,99) -- even with this, based on the presumption 1~33 is rock, 34~66 is paper, 67~99 is scissors, I get a suspicious number of 3 of the same move)
print(otherRandomNumber)
table.insert(currentMoves, moves[randomNumber])
end
return currentMoves
end
However, I noticed that using the Lua math.random() function, I seem to be getting a statistically unlikely number of series of 3 of the same RPS move. The likelihood of getting 3 of the same move (rock rock rock, paper paper paper, or scissors scissors scissors) should be about 11%, but I am getting sets of 3 much more often.
For example, here is what I got when I set maxMovesLength to 15:
36 -paper
41 -paper
60 -paper
22 -rock
1 -rock
2 -rock
91 -scissors
36 -paper
69 -scissors
76 -scissors
35 -paper
18 -rock
22 -rock
22 -rock
92 -scissors
From this sample, it seems that sets of 3 of a kind are happening much more often than they should be. There are 13 series of 3 moves in this list of 15 moves, and among those 3/13 are three of a kind which would be a probability of about 23%, higher than the expected statistical probability of 11%.
Is this just a flaw in the Lua math library?
It seems that when setting maxMovesLength to a very high number this issue doesn't exist, so I will just call math.random() a bunch of times before I actually use it in my game (more than the 3 times I currently do under randomseed().
I want to create a unique identifier such as a small hash every N minutes but the result should be the same in the N timeframe without storing data.
Examples when the N minute is 10:
0 > 10 = 25ba38ac9
10 > 20 = 900605583
20 > 30 = 6156625fb
30 > 40 = e130997e3
40 > 50 = 2225ca027
50 > 60 = 3b446db34
Between minute 1 and 10 i get "25ba38ac9" but with anything between 10 and 20 i get "900605583" etc.
I have no start/example code because i have no idea or algorithm i can use to create the desired result.
I did not provide a specific tag or language in this question because i am interested in the logic and not the final code. But i appreciate documented code as a example.
Pick your favourite hash-function h. Pick your favourite string sugar. To get a hash at time t, append the euclidean quotient of t divided by N to sugar, and apply h to it.
Example in python:
h = lambda x: hex(abs(hash(x)))
sugar = 'Samir'
def hash_for_N_minutes(t, N=10):
return h(sugar + str(t // N))
for t in range(0, 30, 2):
print(t, hash_for_N_minutes(t, 10))
Output:
0 0xeb3d3abb787c890
2 0xeb3d3abb787c890
4 0xeb3d3abb787c890
6 0xeb3d3abb787c890
8 0xeb3d3abb787c890
10 0x45e2d2a970323e9f
12 0x45e2d2a970323e9f
14 0x45e2d2a970323e9f
16 0x45e2d2a970323e9f
18 0x45e2d2a970323e9f
20 0x334dce1d931e5da8
22 0x334dce1d931e5da8
24 0x334dce1d931e5da8
26 0x334dce1d931e5da8
28 0x334dce1d931e5da8
Weakness of this hash method, and suggested improvement
Of course, nothing stops you from inputting a time in the future. So you can easily answer the question "What will the hash be in exactly one hour ?".
If you want future hashes to be unpredictable, you can combine the value t // N with a real-world value that's dependent on the time, not known in advance, but for which we keep records.
There are two well-known time-series that fit this criterion: values related to the meteo, and values related to the stock market.
See also this 2008 xkcd comic: https://xkcd.com/426/
I need help in a particular issue with Stata. I have a panel dataset by id year from 1996 to 2018.
The panel data is a combination of world countries and regions, yearly observations, for 7 different crops, area cultivated.
I would like to create a mean around years 2000, 2010 and 2018, so that mean(year2000)= mean of (1999+2000+2001), mean(year2010)=mean from (2009+2010+2011) and mean(year2018)= mean from (2016+2017+2018) for every crop from my 7 crops selection.
Then the problem is even more complicated when I need to combine some countries to form sub-regions: say I need the sub-region RUS1 = Russia + Ukraine. How can I create another variable that shows the total from crop1 between crop1 area cultivated in Russia + crop1 area cultivated in Ukraine on yearly basis. Meaning another variable that shows these sums for each year using the above means.
I've tried with by id year: egen area_rus1=total(area) if area=="Russia" & area=="Ukraine"
but nothing works.
The names of area being strings I used encode (area), gen (area2) and automatically Stata generates a number.
In order to create a panel dataset i've used gen id=area2+itemcode
The panel data looks like this after sort year
Please be aware that the period is 1996-2018. The example above shows only year 1996.
This didn't get much of a response, for several reasons:
You didn't show very much code.
You didn't show data in a form that is especially useful. An image can't be copied and pasted easily into someone's Stata to allow experiment. In fact your image shows variables that are irrelevant and variables that are different versions of each other and so is much more complicated than we need.
You escalated the question to ask the most complicated version of what you want to know.
There is a problem you should have explained better. area is string and so totals can't be calculated at all and area2 is just arbitrary integers so totals can be calculated but don't make sense. "nothing works" is not informative as a problem report. The only totals that make sense to me are totals of value.
You need to simplify your problem first and then build up.
The essence seems to be as follows:
* Example generated by -dataex-. To install: ssc install dataex
clear
input str2 country str6 item float year str1 region float value
"A" "barley" 1999 "X" 1
"B" "barley" 1999 "X" 2
"C" "barley" 1999 "Y" 3
"A" "barley" 2000 "X" 4
"B" "barley" 2000 "X" 5
"C" "barley" 2000 "Y" 6
"A" "barley" 2001 "X" 7
"B" "barley" 2001 "X" 8
"C" "barley" 2001 "Y" 9
end
* means by countries: similar variables for other periods
egen mean_9901_c = mean(cond(inrange(year, 1999, 2001), value, .)), by(country item)
* aggregation to regions, but ensure that you don't double count
egen value_region = total(value), by(region item year)
egen tag = tag(region item year)
* means by regions: similar variables for other periods
egen mean_9901_r = mean(cond(tag == 1 & inrange(year, 1999, 2001), value_region, .)), by(region item)
list, sepby(year)
+---------------------------------------------------------------------------------+
| country item year region value mean_9~c value_~n tag mean_9~r |
|---------------------------------------------------------------------------------|
1. | A barley 1999 X 1 4 3 1 9 |
2. | B barley 1999 X 2 5 3 0 9 |
3. | C barley 1999 Y 3 6 3 1 6 |
|---------------------------------------------------------------------------------|
4. | A barley 2000 X 4 4 9 1 9 |
5. | B barley 2000 X 5 5 9 0 9 |
6. | C barley 2000 Y 6 6 6 1 6 |
|---------------------------------------------------------------------------------|
7. | A barley 2001 X 7 4 15 1 9 |
8. | B barley 2001 X 8 5 15 0 9 |
9. | C barley 2001 Y 9 6 9 1 6 |
+---------------------------------------------------------------------------------+
The example shows just one item, but the code should work for several.
The example shows fake data for just three years, but means for other periods can be constructed similarly.
Results are repeated for all observations to which they apply. To see or use results just once, use if. For example the means over 1999 to 2001 are shown for each of those years (and others) but if year == 1999 would be a way to see results just once.
See also help collapse, help egen for its tag() function and this paper.
What was wrong with your code
Your problems start with
if area=="Russia" & area=="Ukraine"
which selects observations for which it is true that area is both "Russia" and "Ukraine" in the same observation, which is impossible. You need the | (or) operator there, not the & operator, or to approach the problem in another way.
The prefix id is wrong too. Using by id: enforces separate calculations for different values of id and is going to make the combinations of identifiers impossible.
I'm trying to play around with Bayesian updating, and have a situation in which I am using a posterior from previous runs as a prior. This is a 2D prior on alpha and beta, for which I have traces, alphatrace and betatrace. So I stack them and use code adopted from https://gist.github.com/jcrudy/5911624 to make a KDE based stochastic.
#from https://gist.github.com/jcrudy/5911624
def KernelSmoothing(name, dataset, bw_method=None, observed=False, value=None):
'''Create a pymc node whose distribution comes from a kernel smoothing density estimate.'''
density = gaussian_kde(dataset, bw_method)
def logp(value):
#print "VAL", value
d = density(value)
if d == 0.0:
return float('-inf')
return np.log(d)
def random():
result = None
sample=density.resample(1)
#print sample, sample.shape
result = sample[0][0],sample[1][0]
return result
if value == None:
value = random()
dtype = type(value)
result = pymc.Stochastic(logp = logp,
doc = 'A kernel smoothing density node.',
name = name,
parents = {},
random = random,
trace = True,
value = None,
dtype = dtype,
observed = observed,
cache_depth = 2,
plot = True,
verbose = 0)
return result
Note that the critical thing here is to obtain 2-values from the joint prior: this is why i need a 2-D prior and not two 1-D priors.
The model itself is so:
ctrace=np.vstack((alphatrace, betatrace))
cnew=KernelSmoothing("cnew", ctrace)
#pymc.deterministic
def alphanew(cnew=cnew, name='alphanew'):
return cnew[0]
#pymc.deterministic
def betanew(cnew=cnew, name='betanew'):
return cnew[1]
newtheta=pymc.Beta("newtheta", alphanew, betanew)
newexp = pymc.Binomial('newexp', n=[14], p=[newtheta], value=[4], observed=True)
model3=pymc.Model([cnew, alphanew, betanew, newtheta, newexp])
mcmc3=pymc.MCMC(model3)
mcmc3.sample(20000,5000,5)
In case you are wondering, this is to do the 71st experiment in the hierarchical Rat Tumor example in Chapter 5 in Gelman's BDA. The "prior" I am using is the posterior on alpha and beta after 70 experiments.
But, when I sample, things blow up with the error:
ValueError: Maximum competence reported for stochastic cnew is <= 0... you may need to write a custom step method class.
Its not cnew I care about updating as a stochastic, but rather alphanew and betanew. How ought I be structuring the code to make this error go away?
EDIT: initial model which gave me the posteriors I wish to use as the prior:
tumordata="""0 20
0 20
0 20
0 20
0 20
0 20
0 20
0 19
0 19
0 19
0 19
0 18
0 18
0 17
1 20
1 20
1 20
1 20
1 19
1 19
1 18
1 18
3 27
2 25
2 24
2 23
2 20
2 20
2 20
2 20
2 20
2 20
1 10
5 49
2 19
5 46
2 17
7 49
7 47
3 20
3 20
2 13
9 48
10 50
4 20
4 20
4 20
4 20
4 20
4 20
4 20
10 48
4 19
4 19
4 19
5 22
11 46
12 49
5 20
5 20
6 23
5 19
6 22
6 20
6 20
6 20
16 52
15 46
15 47
9 24
"""
tumortuples=[e.strip().split() for e in tumordata.split("\n")]
tumory=np.array([np.int(e[0].strip()) for e in tumortuples if len(e) > 0])
tumorn=np.array([np.int(e[1].strip()) for e in tumortuples if len(e) > 0])
N = tumorn.shape[0]
mu = pymc.Uniform("mu",0.00001,1., value=0.13)
nu = pymc.Uniform("nu",0.00001,1., value=0.01)
#pymc.deterministic
def alpha(mu=mu, nu=nu, name='alpha'):
return mu/(nu*nu)
#pymc.deterministic
def beta(mu=mu, nu=nu, name='beta'):
return (1.-mu)/(nu*nu)
thetas=pymc.Container([pymc.Beta("theta_%i" % i, alpha, beta) for i in range(N)])
deaths = pymc.Binomial('deaths', n=tumorn, p=thetas, value=tumory, size=N, observed=True)
I use the joint-posterior from this model on alpha, beta as input to the "new model" at top. This also begs the question if I ought to be including theta1..theta70 in the model at top as they will update along with alpha and beta thanks to the new data which is a binomial with n=14, y=4. But I cant even get the little model with only a prior as a 2d sample array working :-(
I found your question since I ran into a similar proble. According to the documentation of pymc.StepMethod.competence, the problem is that none of the built-in samplers handle the dtype associated with the stochastic variable.
I am not sure what needs to be done to actually resolve that. Maybe one of the sampler methods can be extended to handle special types?
Hopefully someone with more pymc mojo can shine a light on what needs to be done..
def competence(s):
"""
This function is used by Sampler to determine which step method class
should be used to handle stochastic variables.
Return value should be a competence
score from 0 to 3, assigned as follows:
0: I can't handle that variable.
1: I can handle that variable, but I'm a generalist and
probably shouldn't be your top choice (Metropolis
and friends fall into this category).
2: I'm designed for this type of situation, but I could be
more specialized.
3: I was made for this situation, let me handle the variable.
In order to be eligible for inclusion in the registry, a sampling
method's init method must work with just a single argument, a
Stochastic object.
If you want to exclude a particular step method from
consideration for handling a variable, do this:
Competence functions MUST be called 'competence' and be decorated by the
'#staticmethod' decorator. Example:
#staticmethod
def competence(s):
if isinstance(s, MyStochasticSubclass):
return 2
else:
return 0
:SeeAlso: pick_best_methods, assign_method
"""
I'm writing a gem to detect tracking numbers (called tracking_number, natch). It searches text for valid tracking number formats, and then runs those formats through the checksum calculation as specified in each respective service's spec to determine valid numbers.
The other day I mailed a letter using USPS Certified Mail, got the accompanying tracking number from USPS, and fed it into my gem and it failed the validation. I am fairly certain I am performing the calculation correctly, but have run out of ideas.
The number is validated using USS Code 128 as described in section 2.8 (page 15) of the following document: http://www.usps.com/cpim/ftp/pubs/pub109.pdf
The tracking number I got from the post office was "7196 9010 7560 0307 7385", and the code I'm using to calculate the check digit is:
def valid_checksum?
# tracking number doesn't have spaces at this point
chars = self.tracking_number.chars.to_a
check_digit = chars.pop
total = 0
chars.reverse.each_with_index do |c, i|
x = c.to_i
x *= 3 if i.even?
total += x
end
check = total % 10
check = 10 - check unless (check.zero?)
return true if check == check_digit.to_i
end
According to my calculations based on the spec provided, the last digit should be a 3 in order to be valid. However, Google's tracking number auto detection picks up the number fine as is, so I can only assume I am doing something wrong.
From my manual calculations, it should match what your code does:
posn: 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 sum mult
even: 7 9 9 1 7 6 0 0 7 8 54 162
odd: 1 6 0 0 5 0 3 7 3 25 25
===
187
Hence the check digit should be three.
If that number is valid, then they're using a different algorithm to the one you think they are.
I think that might be the case since, when I plug the number you gave into the USPS tracker page, I can see its entire path.
In fact, if you look at publication 91, the Confirmation Services Technical Guide, you'll see it uses two extra digits, including the 91 at the front for the tracking application ID. Applying the algorithm found in that publication gives us:
posn: 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 sum mult
even: 9 7 9 9 1 7 6 0 0 7 8 63 189
odd: 1 1 6 0 0 5 0 3 7 3 26 26
===
215
and that would indeed give you a check digit of 5. I'm not saying that's the answer but it does match with the facts and is at least a viable explanation.
Probably your best bet would be to contact USPS for the information.
I don't know Ruby, but it looks as though you're multiplying by 3 at each even number; and the way I read the spec, you sum all the even digits and multiply the sum by 3. See the worked-through example pp. 20-21.
(later)
your code may be right. this Python snippet gives 7 for their example, and 3 for yours:
#!/usr/bin/python
'check tracking number checksum'
import sys
def check(number = sys.argv[1:]):
to_check = ''.join(number).replace('-', '')
print to_check
even = sum(map(int, to_check[-2::-2]))
odd = sum(map(int, to_check[-3::-2]))
print even * 3 + odd
if __name__ == '__main__':
check(sys.argv[1:])
[added later]
just completing my code, for reference:
jcomeau#intrepid:~$ /tmp/track.py 7196 9010 7560 0307 7385
False
jcomeau#intrepid:~$ /tmp/track.py 91 7196 9010 7560 0307 7385
True
jcomeau#intrepid:~$ /tmp/track.py 71123456789123456787
True
jcomeau#intrepid:~$ cat /tmp/track.py
#!/usr/bin/python
'check tracking number checksum'
import sys
def check(number):
to_check = ''.join(number).replace('-', '')
even = sum(map(int, to_check[-2::-2]))
odd = sum(map(int, to_check[-3::-2]))
checksum = even * 3 + odd
checkdigit = (10 - (checksum % 10)) % 10
return checkdigit == int(to_check[-1])
if __name__ == '__main__':
print check(''.join(sys.argv[1:]).replace('-', ''))