What's the best SAS PROC SURVEYSELECT options for accomplishing a semi-controlled random set? - random

The scenario I'm working with is creating a macro that takes in a data set and produces a random stratified sample, the stratification should be by the column STATE that also needs equal total number of representation (when possible) when creating the random sample.
The size of the sample needed has some set rules that we have to abide by which are:
If the total data set size is <= 50 then let the sample size = the entire data set
Else if the total data set size is between 51 and 500 then let the sample size = 50
Else if the total data set size is between 501 and 999 then let the sample size = 10% of the total data set size (n*.10) given that n = the total data set size.
Else if the total data set size is > 999 then let the sample size = 100
SAMPLESIZE is currently defined in code as:
/*sets sample size in accordance to standards*/
%if &num>=0 and &num<=50 %then %let samplesize=&num;
%else %if &num<501 %then %let samplesize=50;
%else %if &num<1000 %then %let samplesize=%sysevalf((&num*.10),ceil);
%else %let samplesize=100;
The data set I used for testing has a total number of records of 550 (so the sample size needed would be 55) with each state totaling the following number:
IN = 100
KY = 217
MO = 189
OH = 8
WI = 36
Applying the STRATA option for SURVEYSELECT works great when each state has the minimum number needed to satisfy the sample size. In this case the SAMPLESIZE for each STRATA would be 11
You can see that the OH STRATUM does not satisfy the minimum requirement for the SAMPLESIZE here since there is only 8 records with OH in the data set, hence leading to the following error:
ERROR: The sample size, 11, is greater than the number of sampling units, 8.
UPDATE (7/14/21) I was able to resolve the error by using the SELECTALL option, I was also able to grab from other states to fill in the missing records for OH using the ALLOC option for STRATA, so my updated SURVEYSELECT statement now looks like this.
```PROC SURVEYSELECT DATA=UniqueList OUT=UniqueListsamp METHOD=SRS SAMPSIZE=&samplesize
SELECTALL NOPRINT;
STRATA PROVIDER_STATE / ALLOC=(.2 .2 .2 .2 .2) ;
RUN;```
What I would like to achieve in this scenario is to make the ALLOC option function in a way that would be able to handle any number of states found in the input file. My understanding is the option requires hard coded decimals that add up to 1, dependent on the number of strata used (in this case 5, so 1/5 would be 5 instances of .2 that add up to a total of 1). This works great if we know the total number of states ahead of time, but that will not be the case when the code gets implemented for use. Is there a way to do a calculation (1 / num of states = .2) then input that value as many times as the number of states found seperated by a comma or a space (.2 .2 .2 .2 .2) into the ALLOC option?

You can pass a dataset as the argument to SAMPSIZE in surveyselect. I think that's what you need here.
Taking your counts as a starting point, I first just create a dataset matching your actual input. Then I run a tabulate to get your counts back. Then I parse the tabulate to figure out how many to pull, and how many per state, and make sure it's not asking for too many. This gives us a first idea of what's going to be pulled per state, and gives us a dataset that lets us modify that number.
The question of how to pull those last 3 is complicated, because it's not straightforward - how do you want to pull those 3? Should you pick the states "randomly" to add one to? What if a state only had 1 left, and you actually want 3 per state? It gets a bit messy to do this, and if you're not doing this frequently, it might be easier to just do it analytically. A proper system will have detailed checks, several passes, and the assumption that everything that can go wrong, will.
In this example I just go ahead and take the "extra" - so I sample 56. That gets you very close to your sample desired while sticking to your sampling plan ratios evenly and not having different amounts per state (among those states that can). If you want to actually sample 55 exactly, you need to decide how to allocate that 12th - to the 3 largest states? To three random states? Up to you, but the work is similar.
data for_gen;
input state $ count;
do id = 1 to count;
state_id = cats(state,put(id,z3.));
output;
end;
keep state state_id;
datalines;
IN 100
KY 217
MO 189
OH 8
WI 36
;;;;
run;
*create a listing, including the overall row (which will be on top);
proc tabulate data=for_gen out=state_counts(keep=state n);
class state;
table (all state),n;
run;
*now distribute the sample, first pass;
data sample_counts;
set state_counts nobs=statecount end=eof;
retain total_sample sample_per_state states_left;
if _n_ = 1 then do;
*the sample size rules;
if n lt 500 then total_sample = min(50,n);
else total_sample = min(100,floor(n/10));
*how many per state;
sample_per_state = ceil(total_sample/(statecount-1)); *or maybe floor?;
end;
else do;
*here we are in the per-state section;
_NSIZE_ = min(n,sample_per_state);
*allocate sample amounts, remove the used sample quantity from the total quantity, and keep track of how many states still have sample remaining;
total_sample = total_sample - _NSIZE_;
if n ne _nsize_ then states_left+1;
end;
*save the remaining info in macro variables to use later;
if eof then do;
call symputx('sample_left',total_sample);
call symputx('states_left',states_left);
end;
if state ne ' ' then output;
run;
*allocate the remaining sample - we assume we want "at least" the sample count;
data sample_secondpass;
set sample_counts end=eof;
retain total_sample_left &sample_left.
total_states_left &states_left.
leftover 0
;
if total_sample_left gt 0 and total_states_left gt 0 then do;
per_state = ceil(total_sample_left/total_states_left);
if n gt (_nsize_ + per_State) then do;
_nsize_ = _nsize_ + per_state;
end;
else do;
leftover = leftover + (_nsize_ + per_state - n);
_nsize_ = n;
end;
end;
if eof then call symputx('leftover',leftover);
run;
* Use the sample counts dataset to run the surveyselect;
proc surveyselect sampsize=sample_secondpass data=for_gen;
strata state;
run;

Related

Algorithm to convert offset pagination to page number pagination

I have a service that must receive pagnation queries in offset format, receiving offset and limit parameters. For example, if I receive offset=5&limit=10, I would expect to receive items 5-14 back. I am able to enforce some validation against these parameters, for example to set a maximum value for limit.
My data source must receive pagination requests in page number format, receiving page_number and page_size parameters. For example, if I send page_number=0&page_size=20, I would receive items 0-19. The data source has a maximum page_size of 100.
I need to be able to take the offset pagination parameters I receive and use them to determine appropriate values for the page_number and page_size parameters, in order to return a range from the data source that includes all of the items I need. Additional items may be returned padding out the start and/or end of the range, which can then be filtered out to produce the requested range.
If possible, I should only make a single request to the data source. Optionally, performance can be improved by minimising the size of the range to be requested from the datasource (i.e. fetching 10 items to satisfy a request for 8 items is more efficient than requesting 100 items for the same).
This feels like it should be relatively simple to achieve, but my attempts at simple mathematical solutions don't address all of the edge cases, and my attempts at more robust solutions have started to head into the more complex space of calculating and iterating over factors etc.
Is there a simple way to calculate the appropriate values?
I've put together a test harness REPL with a set of example test cases that makes it easy to trial different implementations here.
An appropriate implementation is to start at page_size=limit, and then increment page_size until there's a single page that contains the whole range from offset to offset+limit.
If you're thinking that you don't want to waste time iterating, then consider that the time taken by this method is at most proportional to the size of the result set, and it's completely insignificant comparted to the time you'll take reading, marshaling, unmarshaling, and processing the results themselves.
I've tested all combinations with offset+limit <= 100000. In all cases, page_size <= 2*limit+20. For large limits, the worst case overhead always occurs when limit=offset+1. At some point it will become more efficient to make 2 requests. You should check for that.
How about this ?
if (limit > 100) {
// return error for exceeding limit...
}
mod_offset = (offset % limit)
page_number = (offset / limit) ;
page_size = limit;
// Sample test cases ..
// offset=25, limit=20 .. mod_offset = 5
(a) page_number = 1, page_size = 20 // skip first 'n' values equal to 'mod_offset'
(b) page_number = 1+1 = 2, page_size = 20 // include only first 'n' values equal to 'mod_offset'
// offset=50, limit=25 .. mod_offset = 0
(a) page_number = 2, page_size = 25 // if offset is multiple of limit, no need to fetch twice...
// offset=125, limit=20 .. mod_offset = 5
(a) page_number = 6, page_size = 20 // skip first 'n' values equal to 'mod_offset'
(b) page_number = 6+1 = 7, page_size = 20 // include only first 'n' values equal to 'mod_offset'

Sorted Two-Way Tabulation of Many Values

I have a decent-sized dataset (about 18,000 rows). I have two variables that I want to tabulate, one taking on many string values, and the second taking on just 4 values. I want to tabulate the string values by the 4 categories. I need these sorted. I have tried several commands, including tabsort, which works, but only if I restrict the number of rows it uses to the first 603 (at least with the way it is currently sorted). If the number of rows is greater than this, then I get the r(134) error that there are too many values. Is there anything to be done? My goal is to create a table with the most common words and export it to LaTeX. Would it be a lot easier to try and do this in something like R?
Here's one way, via contract and texsave from SSC:
/* Fake Data */
set more off
clear
set matsize 5000
set seed 12345
set obs 1000
gen x = string(rnormal())
expand mod(_n,10)
gen y = mod(_n,4)
/* Collapse Data to Get Frequencies for Each x-y Cell */
preserve
contract x y, freq(N)
reshape wide N, i(x) j(y)
forvalues v=0/3 {
lab var N`v' "`v'" // need this for labeling
replace N`v'=0 if missing(N`v')
}
egen T = rowtotal(N*)
gsort -T x // sort by occurrence
keep if T > 0 // set occurrence threshold
capture ssc install texsave
texsave x N0 N1 N2 N3 using "tab_x_y.tex", varlabel replace title("tab x y")
restore
/* Check Calculations */
type "tab_x_y.tex"
tab x y, rowsort

How to find planet in resonance

I'm trying to find a method to detect from orbital parameters (period, eccentricity, semi-major axis...) planets that are in resonance.
I know that if the ratio between two planets is commensurable, this means that they are in resonance, but suppose I want to know IN WHICH resonance they are, how can I do it?
For instance, I have my matrix of N planets and periods. How can I create a loop to check if and in which resonance the planets are?
Something like:
for i=1, N
P(i)/P(i-1)=m
if m (check the resonance condition) then
write (planets parameters)
end if
end for
Thanks a lot.
I make this program, I have a 2xN matrix in which the columns are the ID of planets and their period, the rows are the number of planets, for instance something like that:
1 0.44
1 0.8
1 0.9
2 0.9
2 1.2
3 2.0
3 3.0
The trick to change from one system of planet to the other is to rename all the planets of a system with the same number and the planets of other system with another number so, I can be able to change the resonance condition from one system to another one.
The program is simple:
read the file and save the columns and rows numbers,
create and save a matrix of col*row objects,
save as a vector the `name` and `period` of planets,
start the cycle:
for r=1,row <--- THIS MUST READ all the file
if (difference in name = 0.) then start the resonance find criterion
for l = 0,4 (number of planet in each system: THIS MUST BE MODIFIED !!)
for i = 1,5
for j = 1,5
if (i*period(l)-j*period(l+1) eq 0) <- RESONANCE CONDITION !!!
then write on file
end for
end for
end for
else write a separation between the first set and second set of planets !
end for
This is the IDL code I wrote:
pro resfind
file = "data.dat"
rows =File_Lines(file) ; per le righe
openr,lun,file,/Get_lun ; per le colonne
line=""
readf,lun,line
cols = n_elements(StrSplit(line, /RegEx, /extract))
openr,1,"data.dat"
data = dblarr(cols,rows)
readf,1,data
close,1
name = data(0,*)
period = data(1,*)
openw,2,"find.dat"
for r = 0, rows-2 DO BEGIN ;
if (name(r)-name(r+1) EQ 0) then begin
for l = 0,rows-2 do begin
for j = 1,4 do begin
for i = 1,4 do begin
if (abs(i*period(l)-j*period(l+1)) EQ 0.) then begin
printf,2, 'i resonance:', i , ' j resonance:',j,' planet ID:',l,' planet ID:',l+1
endif
endfor
endfor
endfor
endif else begin
printf,2, ' '
endfor
close,2
end
PROBLEMS:
I can't understand how to eliminate the multiply of resonance (2:4, 3:6 and so on);
in the second for loop (the one with the planet) the number of planets must be change every time but I don't understand how to change this.
First, every real number can be represented as a ratio of integers with any finite precision. That's in particular what we do when we express numbers with more and more digits in decimal system. So you need not only check whether orbital periods are in some integer-to-integer ratio, but also if the two integers are relatively small. And it's arbitrary decision, which are 'small'.
Second, remember that two floating point values are, in general, different if one is not a copy of the other. For example 3*(1/3) may be not equal to 1. That's a result of finite precision: 1/3 is infinitely repeating when represented in binary, so it gets truncated somewhere when stored in memory. So you should not check if the periods ratio is equal to some ratio but rather if it is close enough to some ratio. And its arbitrary to say what is 'close enough'.
So the fastest way would be to build an array of ratios of some relatively small integers, then sort it and remove duplicates (3:3 = 2:2, and you don't need multiple ones in your array). (Remember that duplicates are not those equal to each oher, but those close enough to each other.) Then, for each two planets calculate orbital periods ratio and binary search your table for the closest value. If it is close enough, you found a resonance.

Fastest approximate counting algorithm

Whats the fastest way to get an approximate count of number of rows of an input file or std out data stream. FYI, this is a probabilistic algorithm, I can't find many examples online.
The data could just be one or 2 columns coming from an awk script of csv file! Lets say i want an aprox groupby on one of the columns. I would use a database group by but the number of rows are over 6-7 billion. I would like the first approx result In under 3 to 4 seconds. Then run a bayes or something after decisions are made on the prior. Any ideas on a really rough initial group count?
If you can provide the algorithm example in python, or java that would be very helpful.
#Ben Allison's answer is a good way if you want to count the total lines. Since you mentioned the Bayes and the prior, I will add an answer in that direction to calculate the percentage of different groups. (see my comments on your question. I guess if you have an idea of the total and if you want to do a groupby, to estimate the percentage of different groups makes more sense).
The recursive Bayesian update:
I will start by assuming you have only two groups (extensions can be made to make it work for multiple groups, see later explanations for that.), group1 and group2.
For m group1s out of the first n lines(rows) you processed, we denote the event as M(m,n). Obviously you will see n-m group2s because we assume they are the only two possible groups. So you know the conditional probability of the event M(m,n) given the percentage of group1 (s), is given by the binomial distribution with n trials. We are trying to estimate s in a bayesian way.
The conjugate prior for binomial is beta distribution. So for simplicity, we choose Beta(1,1) as the prior (of course, you can pick your own parameters here for alpha and beta), which is a uniform distribution on (0,1). Therefor, for this beta distribution, alpha=1 and beta=1.
The recursive update formulas for a binomial + beta prior are as below:
if group == 'group1':
alpha = alpha + 1
else:
beta = beta + 1
The posterior of s is actually also a beta distribution:
s^(m+alpha-1) (1-s)^(n-m+beta-1)
p(s| M(m,n)) = ----------------------------------- = Beta (m+alpha, n-m+beta)
B(m+alpha, n-m+beta)
where B is the beta function. To report the estimate result, you can rely on Beta distribution's mean and variance, where:
mean = alpha/(alpha+beta)
var = alpha*beta/((alpha+beta)**2 * (alpha+beta+1))
The python code: groupby.py
So a few lines of python to process your data from stdin and estimate the percentage of group1 would be something like below:
import sys
alpha = 1.
beta = 1.
for line in sys.stdin:
data = line.strip()
if data == 'group1':
alpha += 1.
elif data == 'group2':
beta += 1.
else:
continue
mean = alpha/(alpha+beta)
var = alpha*beta/((alpha+beta)**2 * (alpha+beta+1))
print 'mean = %.3f, var = %.3f' % (mean, var)
The sample data
I feed a few lines of data to the code:
group1
group1
group1
group1
group2
group2
group2
group1
group1
group1
group2
group1
group1
group1
group2
The approximate estimation result
And here is what I get as results:
mean = 0.667, var = 0.056
mean = 0.750, var = 0.037
mean = 0.800, var = 0.027
mean = 0.833, var = 0.020
mean = 0.714, var = 0.026
mean = 0.625, var = 0.026
mean = 0.556, var = 0.025
mean = 0.600, var = 0.022
mean = 0.636, var = 0.019
mean = 0.667, var = 0.017
mean = 0.615, var = 0.017
mean = 0.643, var = 0.015
mean = 0.667, var = 0.014
mean = 0.688, var = 0.013
mean = 0.647, var = 0.013
The result shows that group1 is estimated to have 64.7% percent up to the 15th row processed (based on our beta(1,1) prior). You might notice that the variance keeps shrinking because we have more and more observation points.
Multiple groups
Now if you have more than 2 groups, just change the underline distribution from binomial to multinomial, and then the corresponding conjugate prior would be Dirichlet. Everything else you just make similar changes.
Further notes
You said you would like the approximate estimate in 3-4 seconds. In this case, you just sample a portion of your data and feed the output to the above script, e.g.,
head -n100000 YOURDATA.txt | python groupby.py
That's it. Hope it helps.
If it's reasonable to assume the data are IID (so there's no bias such as certain types of records occur in certain parts of the stream), then just subsample and scale up the counts by approximate size.
Take say the first million records (this should be processable in a couple of seconds). Its size is x units (MB, chars, whatever you care about). The full stream has size y where y >> x. Now, derive counts for whatever you care about from your sample x, and simply scale them by the factor y/*x* for approximate full-counts. An example: you want to know roughly how many records have column 1 with value v in the full stream. The first million records have a file size of 100MB, while the total file size is 10GB. In the first million records, 150,000 of them have value v for column 1. So, you assume that in the full 10GB of records, you'll see 150,000 * (10,000,000,000 / 100,000,000) = 15,000,000 with that value. Any statistics you compute on the sample can simply be scaled by the same factor to produce an estimate.
If there is bias in the data such that certain records are more or less likely to be in certain places of the file then you should select your sample records at random (or evenly spaced intervals) from the total set. This is going to ensure an unbiased, representative sample, but probably incur a much greater I/O overhead.

Generate infinite stream of unique numbers between 0 and 1

Came across this question previously on an interview. The requirements are to write a function that
Generates a number between 0..1
Never returns the same number
Can scale (called every few milliseconds and continuously for years)
Can use only 1mb of heap memory
Does not need to return as a decimal, can render directly to stdout
My idea was hacky at best which involved manipulating a string of the "0.1" then "0.11" then "0.12" etc. Since the requirements did not mention it had to be uniformly distributed, it does not need to be random. Another idea is generate a timestamp of the form yyyyMMddhhmmssSSS (where SSS is msec) then convert that to a string and prefix it with "0." . This way the values will always be unique.
It's a pretty open ended question and I'm curious how other people would tackle it.
Pseudo code that can do what you except guarantee no repeats.
Take your 1 MB allocation.
Randomly set every byte.
Echo to stdout as "0.<bytes as integer string>" (will be very long)
Go to #2
Your "Never returns the same number" is not guaranteed but it is extremely unlikely (1 in 2^8192) assuming a good implementation of Random.
Allocate about a million characters and set them initially to all 0.
Then each call to the function simply increments the number and returns it, something like:
# Gives you your 1MB heap space.
num = new digit/byte/char/whatever[about a million]
# Initialise all digits to zero (1-based arrays).
def init():
for posn ranges from 1 to size(num):
set num[posn] to 0
# Print next value.
def printNext():
# Carry-based add-1-to-number.
# Last non-zero digit stored for truncated output.
set carry to 1
set posn to size(num)
set lastposn to posn
# Keep going until no more carry or out of digits.
while posn is greater than 0 and carry is 1:
# Detect carry and continue, or increment and stop.
if num[posn] is '9':
set num[posn] to '0'
set lastposn to posn minus 1
else:
set num[posn] to num[posn] + 1
set carry to 0
set posn to posn minus one
# Carry set after all digits means you've exhausted all numbers.
if carry is 1:
exit badly
# Output the number.
output "0."
for posn ranges from 1 to lastposn
output num[posn]
The use of lastposn prevents the output of trailing zeros. If you don't care about that, you can remove every line with lastposn in it and run the output loop from 1 to size(num) instead.
Calling this every millisecond will give you about well over 10some--big-number-resulting-in-a-runtime-older-than-the-age-of-the-universe years of run time.
I wouldn't go with your time-based solution because the time may change - think daylight savings or summer time and people adjusting clocks due to drift.
Here's some actual Python code which demonstrates it:
import sys
num = "00000"
def printNext():
global num
carry = 1
posn = len(num) - 1
lastposn = posn
while posn >= 0 and carry == 1:
if num[posn:posn+1] == '9':
num = num[:posn] + '0' + num[posn+1:]
lastposn = posn - 1
else:
num = num[:posn] + chr(ord(num[posn:posn+1]) + 1) + num[posn+1:]
carry = 0
posn = posn - 1
if carry == 1:
print "URK!"
sys.exit(0)
s = "0."
for posn in range (0,lastposn+1):
s = s + num[posn:posn+1];
print s
for i in range (0,15):
printNext()
And the output:
0.00001
0.00002
0.00003
0.00004
0.00005
0.00006
0.00007
0.00008
0.00009
0.0001
0.00011
0.00012
0.00013
0.00014
0.00015
Your method would eventually use more than 1mb of heap memory. Every way you represent numbers, if you are constrained by 1mb of heap then there is only a finite number of values. I would take the maximum ammount of memory possible, and increment the least significant bit by one on each call. That would ensure running as longer as possible before returning a repeted number.
Yes, because there is no random requirement, you have a lot of flexibility.
The idea here I think is very close to that of enumerating all strings over the regular expression [0-9]* with a couple modifications:
the real string starts with the sequence 0.
you cannot end with a 0
So how would you enumerate? One idea is
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.11 0.12 0.13 0.14 0.15 ... 0.19 0.21 0.22 ... 0.29 0.31 ... 0.99 0.101 0.102 ...
The only state you need here is an integer I think. Just be clever in skipping those zeros at the end (not difficult really). 1 MB of memory should be fine. It stores a massive massive integer, so I think you would be good here.
(It is different from yours because I generate all one character strings, then all two character strings, then all three character strings, ... so I believe there is no need for state other than the last number generated.)
Then again I may be wrong; I haven't tried this.
ADDENDUM
Okay I will try it. Here is the generator in Ruby
i = 0
while true
puts "0.#{i}" if i % 10 != 0
i += 1
end
Looks okay to me....
If you are programming in C, the nextafter() family of functions are Posix-compatible functions useful for producing the next double after or before any given value. This will give you about 2^64 different values to output, if you output both positive and negative values.
If you are required to print out the values, use the %a or %A format for exact representation. From the printf(3) man page: "For 'a' conversion, the double argument is converted to hexadecimal notation (using the letters abcdef) in the style [-]0xh.hhhhp±d..." "The default precision suffices for an exact representation of the value if an exact representation in base 2 exists..."
If you want to generate random numbers rather than sequentially ascending ones, perhaps do a google search for 64-bit KISS RNG. Implementations in Java, C, Ada, Fortran, et al are available on the web. The period of 64-bit KISS RNG itself is ~ 2^250, but there are not that many 64-bit double-precision numbers, so some numbers will re-appear within 2^64 outputs, but with different neighbor values. On some systems, long doubles have 128-bit values; on others, only 80 or 96. Using long doubles, you could accordingly increase the number of different values output by combining two randoms into each output.
It may be that the point of this question in an interview is to figure out if you can recognize a silly spec when you see it.

Resources