problem with power.roc.test in R - roc

I anlaysing several different ROC analyses in my article. Therefore I am investigating whether my sample size is appropriate. I have created a data frame which consists all combinations of possible sample sizes for ROC analysis.
str(auc)
'data.frame': 93 obs. of 2 variables:
$ cases : int 10 11 12 13 14 15 16 17 18 19 ...
$ controls: int 102 101 100 99 98 97 96 95 94 93 ...
My aim is to create line plot cases/controls (ie. kappa) versus optimal AUC
Hence I would like to create third variable using power.roc.test to calculate optimal AUC
I ran to problem above, where lies to problem?
auc$auc<-power.roc.test(sig.level=.05,power=.8,ncases=auc$cases,ncontrols=auc$controls)$auc
Error in value[[3L]](cond) : AUC could not be solved:
Error in uniroot(power.roc.test.optimize.auc.function, interval = c(0.5, : invalid function value in 'zeroin'
In addition: Warning messages:
1: In if (is.na(f.lower)) stop("f.lower = f(lower) is NA") :
the condition has length > 1 and only the first element will be used
2: In if (is.na(f.upper)) stop("f.upper = f(upper) is NA") :
the condition has length > 1 and only the first element will be used
3: In if (f.lower * f.upper > 0) stop("f() values at end points not of opposite sign") :
the condition has length > 1 and only the first element will be used

I believe you are using the pROC package. The error message is not especially helpful here, but you basically need to pass scalar values, including to ncases and ncontrols.
power.roc.test(sig.level=.05,power=.8,ncases=10, ncontrols=102)
You can wrap that in some apply loop:
auc$auc<- apply(auc, 1, function(line) {
power.roc.test(sig.level=.05, power=.8, ncases=line[["cases"]], ncontrols=line[["controls"]])$auc
})
Then you will be able to plot this however you want:
plot(auc$cases / auc$controls, auc$auc, type = "l")
Note that the AUC here is not the "optimal AUC", but the AUC at which you can expect the given power at the given significance level with the given sample size for a test of the significance of the AUC (H0: AUC = 0.5). Note that you won't be able to perform this test with pROC anyway.

Related

How to fix skew trapezoidal distribution sampling output sample size

I am trying to generate a skewed trapezoidal distribution using inverse transform sampling.
The inputs are the values where the ramps start and end (a, b, c, d) and the sample size.
a=-3;b=-1;c=1;d=8;
SampleSize=10e4;
h=2/(d+c-a-b);
Then I calculate the ratio of the length of ramps and flat components to get sample size for each:
firstramp=round(((b-a)/(d-a)),3);
flat=round((c-b)/(d-a),3);
secondramp=round((d-c)/(d-a),3);
n1=firstramp*SampleSize; %sample size for first ramp
n3=secondramp*SampleSize; %sample size for second ramp
n2=flat*SampleSize;
And then finally I get the histogram from the following code:
quartile1=h/2*(b-a);
quartile2=1-h/2*(d-c);
y1=linspace(0,quartile1,n1);
y2=linspace(quartile1,quartile2,n2);
y3=linspace(quartile2,1,n3);
%inverse cumulative distribution functions
invcdf1=a+sqrt(2*(b-a)/h)*sqrt(y1);
invcdf2=(a+b)/2+y2/h;
invcdf3=d-sqrt(2*(d-c)/h)*sqrt(1-y3);
distr=[invcdf1 invcdf2 invcdf3];
histogram(distr,100)
However the sampling of ramps and flat components are not equal, looks like this:
I fixed this by trial and error, by reducing the sample size of the ramps by half:
n1=0.5*firstramp*SampleSize; %sample size for first ramp
n3=0.5*secondramp*SampleSize; %sample size for second ramp
n2=flat*SampleSize;
This made the distribution look like this:
However this makes the output sample less than what is given in input.
I've also tried different combinations of changing the sample sizes of ramps and flat.
This also works:
n1=0.75*firstramp*SampleSize; %sample size for first ramp
n3=0.75*secondramp*SampleSize; %sample size for second ramp
n2=1.5*flat*SampleSize;
It increases the output samples, but it's still not close.
Any help will be appreciated.
Full code:
a=-3;b=-1;c=1;d=8;
SampleSize=10e4;%*1.33333333333333;
h=2/(d+c-a-b);
firstramp=round(((b-a)/(d-a)),3);
flat=round((c-b)/(d-a),3);
secondramp=round((d-c)/(d-a),3);
n1=firstramp*SampleSize; %sample size for first ramp
n3=secondramp*SampleSize; %sample size for second ramp
n2=flat*SampleSize;
quartile1=h/2*(b-a);
quartile2=1-h/2*(d-c);
y1=linspace(0,quartile1,.75*n1);
y2=linspace(quartile1,quartile2,1.5*n2);
y3=linspace(quartile2,1,.75*n3);
%inverse cumulative distribution functions
invcdf1=a+sqrt(2*(b-a)/h)*sqrt(y1);
invcdf2=(a+b)/2+y2/h;
invcdf3=d-sqrt(2*(d-c)/h)*sqrt(1-y3);
distr=[invcdf1 invcdf2 invcdf3];
histogram(distr,100)
%end
I don't know Matlab so I was hoping somebody else would jump in on this, but since nobody did here goes.
If I'm reading your code correctly what you did is not an inversion. Inversion is 1-1, i.e., one uniform input produces one outcome. You seem to be using a technique known as the "composition method". In composition the overall distribution is comprised of component pieces, each of which is straightforward to generate. You choose which component to generate from based on their proportions/probabilities relative to the whole. For density functions, probability is found as the area under the density curve, so your first mistake was in sampling the components relative to the width of each component rather than using their areas. The correct sampling proportions are 2/13, 4/13, and 7/13 for what you designated the firstramp, flat, and secondramp components, respectively. A second mistake (which is relatively minor) was to assign exact sample sizes to each of the components. Having probability 2/13 does not mean that exactly 2*SampleSize/13 of your samples will be from the firstramp, it means that's the expected sample size for that component. The expected value of a random variate is not necessarily (or even likely to be) the outcome you will actually get.
In pseudocode, the composition approach would be
generate U ~ Uniform(0,1)
if U <= 2/13:
generate and return a value from firstramp
else if U <= 6/13:
generate and return a value from flat
else:
generate and return a value from secondramp
Note that since each of the generate options will use one or more uniforms, and choosing between the options requires a uniform U, this is not an inversion.
If you want an actual inversion, you need to quantify your density, integrate it to get the cumulative distribution function, then apply the inversion technique by setting F(X) = U and solving for X. Since your distribution is made of distinct components, both the density and cumulative density will be piecewise functions.
After deriving the height based on the requirement that the areas of the two triangles and the flat section must add up to 1, I came up with the following for your density:
| (x + 3) / 13 -3 <= x <= -1
|
f(x) = | 2 / 13 -1 <= x <= 1
|
| 2 * (8 - x) / 91 1 <= x <= 8
Integrating this and collecting terms produces the CDF:
| (x + 3)**2 / 26 -3 <= x <= -1
|
F(x) = | (2 + x) * 2 / 13 -1 <= x <= 1
|
| 6 / 13 + [49 - (x - 8)**2] / 91 1 <= x <= 8
Finally, determining the values of F(x) at the break points between the segments and applying inversion yields the following pseudocode algorithm:
generate U ~ Uniform(0,1)
if U <= 2 / 13:
return 2 * sqrt( (13 * U) / 2 ) - 3
else if U <= 6 / 13:
return (13 * U) / 2 - 2:
else:
return 8 - sqrt( 91 * (1 - U) )
Note that this is a true inversion. The outcome is determined by generating a single U, and transforming it in different ways depending on which range it falls in.

Speeding Monte Carlo in matlab

I'm trying to speed up the following Monte Carlo simulation in matlab:
http://pastebin.com/nS0K7XXa
and this is the full result of the matlab profiler
http://i.imgur.com/bGFY5e7.png
I am pretty new at using matlab, but I spent a good deal of time already on this and I think I'm missing something somewhere, because I have the feeling that this should run much faster.
I'm concerned about the lines the profiler show in red of course... lets start with these ones:
time calls line code
37.59 19932184 54 radselec = fix(rand(1)*nr) + 1;
4.54 19932184 55 nm = nm - 1;
45.35 19932184 56 Rad2(radselec) = Rad2(radselec) + 1;
I have a very large vector (Rad2) which holds positive integer values, initially they are all zero but as the simulation progresses it fills up.
line 54 picks a random element of that vector, everytime I add a value to that vector I also add a value to the variable nr, so basically nr is numel(nr) and fix(rand(1)*nr)+1 will pick a random number between 1 and nr.
Question 1: Is there a better way of doing this? rand(1) alone seems to take a long time, as you can see from line 26:
31.50 20540616 26 r = rand(1);
Question 2: line 56 also called my attention... once I have a value for radselec, I need to add +1 to the value of Rad2(radselec).
Now I thought that doing Rad2(radselec) = Rad2(radselec) + 1; was just as fast as doing nm = nm - 1 or +1 for that matter... but the profiler shows that adding +1 to an element of a vector is 10 times slower.
Question 3:
31.50 20540616 26 r = rand(1);
27
22.72 20540616 28 if r > R1/Rt
3.39 20220062 29 reacselec = 2;
10.80 20220062 30 if r > (R1+R2)/Rt
rand(1) seems to be slow as it is... by definition I need that random number between 0 and 1. So I can't think of another way of speeding that line up.
Now... How come line 28 is 2 times slower than line 30 ??? I mean... they are practically the same line with the same calculation... if anything line 30 should be slightle slower for having R1+R2 in the numerator, instead of just R1.
What's happening there?
And finally,
24.26 20540616 79 end
why is that end statement chugging so many time? How can fix that?
Thank you for your time, and sorry if this questions are too basic. I just started programming a few months ago, and I do not have a computer science background. I'm thinking on taking some courses, but that's not a priority.
Any help will be very appreciated.

Code 16K barcode - checksum computation

I found 2 docs abouth this barcode.
None of them described fine how to compute checksum.
They both just give a formula and didnt say which characters to include in computation.
Also, these docs doesnt present integer values for start/stop/pad or rest special symbols. So if they are included in the computation i even dont know their values.
Does anyone know how to compute checksum ?
I found this information there : http://www.gomaro.ch/ftproot/Code%2016k.pdf
and there (more complete) : http://www.expresscorp.com/content/express/pdf/IndustrySpecifications/USS-16K.pdf
So this code has 2 checksums which are calculated by weighting the sum of the values of each character including the start character.
The first check symbol starts the weighting at 2.
The second starts weighting at 1.
Next, take the modulo 107 of the sum.
So if you had the character values 22, 10, 15, 20, the two checksums would be:
(2*22 + 3*10 + 4*15 + 5*20) % 107
(1*22 + 2*10 + 3*15 + 4*20) % 107
If you have more characters just keep going... a general formula would be for n characters :
C1 = modulo 107(sum((i+1)*Char(i))
summed from i=1 to number of symbol character -2
C2 = modulo 107(sum(i*Char(i))
summed from i=1 to number of symbol character -1 (so this includes C1)
Here is an image of the structure of a 16k code :

Whats the reasoning behind odd integers rounding down when divided by 2?

Wouldn't it be better to see an error when dividing an odd integer by 2 than an incorrect calculation?
Example in Ruby (I'm guessing its the same in other languages because ints and floats are common datatypes):
39 / 2 => 19
I get that the output isn't 19.5 because we're asking for the value of an integer divided by an integer, not a float (39.0) divided by an integer. My question is, if the limits of these datatypes inhibit it from calculating the correct value, why output the least correct value?
Correct = 19.5
Correct-ish = 20 (rounded up)
Least correct = 19
Wouldn't it be better to see an error?
Throwing an error would be usually be extremely counter-productive, and computationally inefficient in most languages.
And consider that this is often useful behaviour:
total_minutes = 563;
hours = total_minutes / 60;
minutes = total_minutes % 60;
Correct = 19.5
Correct-ish = 20 (rounded up)
Least correct = 19
Who said that 20 is more correct than 19?
Among other reasons to keep the following very useful relationship between the sibling operators of division and modulo.
Quotient: a / b = Q
Remainder: a % b = R
Awesome relationship: a = b*Q + R.
Also so that integer division by two returns the same result as a right shift by one bit and lots of other nice relationships.
But the secret, main reason is that C did it this way, and you simply don't argue with C!
If you divide through 2.0, you get the correct result in Ruby.
39 / 2.0 => 19.5

is there a such thing as a randomly accessible pseudo-random number generator? (preferably open-source)

first off, is there a such thing as a random access random number generator, where you could not only sequentially generate random numbers as we're all used to, assuming rand100() always generates a value from 0-100:
for (int i=0;i<5;i++)
print rand100()
output:
14
75
36
22
67
but also randomly access any random value like:
rand100(0)
would output 14 as long as you didn't change the seed
rand100(3)
would always output 22
rand100(4)
would always output 67
and so on...
I've actually found an open-source generator algorithm that does this, but you cannot change the seed. I know that pseudorandomness is a complex field; I wouldn't know how to alter it to add that functionality.
Is there a seedable random access random number generator, preferably open source? or is there a better term for this I can google for more information?
if not, part 2 of my question would be, is there any reliably random open source conventional seedable pseudorandom number generator so I could port it to multiple platforms/languages while retaining a consistent sequence of values for each platform for any given seed?
I've not heard of anything like that, but it seems to me you could take use a decent hash and write a wrapper function that takes a seed value and your 'index', and runs them through the hash function. I'm not sure of the randomness of the bits output by various cryptographic hash functions, but I imagine that someone has taken a look at that.
The PCG family of psuedo-random number generators can jump forward and backward in logarithmic time (i.e. jumping forward 1000 numbers requires O(log(1000)) operations), which is probably good enough to be considered random access. The reference C and C++ implementations both support this feature.
The table on the front page of the PCG site indicates that a number of other generators can support jump-ahead, but I've not seen it in any implementations.
Blum Blum Shub is a pseudorandom number generator with a seed and random access to any value it generates.
Thanks for all the replies,
and also, for anyone who might happen upon this asking a similar question, I found a solution that isn't exactly what I asked for, but fits the bill for my purposes.
It is a perlin noise class that can be found here.
I'm not sure how computationally complex this is relative to a conventional random number generator, which is a concern, since one of the planned platforms is Android. Also, perlin noise isn't the same thing as pseudorandomness, but from what I can tell, a high octave and/or frequency value should provide suitable randomness for non-cryptographic purposes, where the statistical level of true randomness isn't as important as the mere appearance of randomness.
This solution allows seeding, and also allows sampling a random set from any point, in other words, random access randomness.
here's an example set of regular c++ randomness (rand%200) on the left column for comparison, and perlin noise (with the equivalent of %200) on the right:
91 , 100
48 , 97
5 , 90
93 , 76
197 , 100
97 , 114
132 , 46
190 , 67
118 , 103
78 , 96
143 , 110
187 , 108
139 , 79
69 , 58
156 , 81
123 , 128
84 , 98
15 , 105
178 , 117
10 , 82
13 , 110
182 , 56
10 , 96
144 , 64
133 , 105
both were seeded to 0
the parameters for the perlin noise were
octaves = 8
amplitude = 100
frequency = 9999
width/height = 10000,100
the sequential sampling order for the perlin noise was simply
for (int i=0;i<24;i++)
floor(Get(i,i)+100);
//amplitude 100 generates noise between -100 and 100,
//so adding 100 generates between 0 and 200
Once I read a really good blog post from a guy who used to work at Google, which answered a question very similar to yours.
In short, the answer was to use a block cipher with a random number as the encryption key, and the index of the number you want in the sequence as the data to be encrypted. He mentioned a cipher which can work on blocks of any size (in bits), which is convenient -- I'd have to search for the blog to find the name of the cipher.
For example: say you want a random shuffling of integers from 0 to (2^32)-1. You can achieve that using a block cipher which takes 4 bytes input, and returns 4 encrypted bytes. To iterate over the series, first "encrypt" a block of value 0, then 1, then 2, etc. If you only want the 1 millionth item in the shuffled sequence, you just encrypt the number 1,000,000.
The "random sequences" you will get using a cipher are different from what you would get using a hash function (as #MichaelBurr suggested). Using a cipher, you can get a random permutation of a range of integers, and sample any item in that permutation in constant time. In other words, the "random numbers" won't repeat. If you use a hash function, the numbers in the sequence may repeat.
Having said all this, #MichaelBurr's solution is more appropriate for your situation, and I would recommend you use it.
One way of achieving this is to synthesize a larger amount of random data from a smaller set. One way of doing this is to have three arrays of pre-generated random data. Each array should have prime number of entires.
To produce our random numbers we imagine each of these one-time pads to be looped inifintely and sampled incrementally; we combine the data in each of them using xor.
#define PRIME1 7001
#define PRIME2 7013
#define PRIME3 7019
static int pad1[PRIME1];
static int pad2[PRIME2];
static int pad3[PRIME3];
static void random_no_init (void)
{
static int initialized = 0;
int i;
if (initialized)
return;
for (i = 0; i < PRIME1; i++) pad1[i] = random ();
for (i = 0; i < PRIME2; i++) pad2[i] = random ();
for (i = 0; i < PRIME3; i++) pad3[i] = random ();
initialized = 1;
}
int random_no (int no)
{
random_no_init ();
return pad1[no % PRIME1] ^ pad2[no % PRIME2] ^ pad3[no % PRIME3];
}
The code sample above shows a simple example that yields 344618953247 randomly accessible entires. To ensure reproducible results between runs, you should provide a random number generator with a seed. A more complex system built on the same principle with seed variation based on picking different primes can be found at http://git.gnome.org/browse/gegl/tree/gegl/gegl-random.c
All the generators I'm aware of are iterative. So, any 'random access' would involve calculating all the values from the first to the one you ask for.
The closest you could come is to take a fixed seed, hash it, and then hash the index value, using something that mixes really enthusiastically.
Or generate a long list of them and store it.
take a look at this patent:
Random-access psuedo random number generator
https://patents.google.com/patent/US4791594
It uses a multi-stage bit scrambling scheme to generate a Pseudo Number sequence that is able to be accessed randomly.
The idea is to use the input address as control bits to scramble multiple seed numbers, and then XOR the results to produce an output, then a second pass of scrambling using the result generated from the previous pass.

Resources