Stata - How to Generate Random Integers - random

I am learning Stata and want to know how to generate random integers (without replacement). If I had 10 total rows, I would want each row to have a unique integer from 1 to 10 assigned to it. In R, one could simply do:
sample(1:10, 10)
But it seems more difficult to do in Stata. From this Stata page, I saw:
generate ui = floor((b-a+1)*runiform() + a)
If I substitute a=1 and b=10, I get something close to what I want, but it samples with replacement.
After getting that part figured out, how would I handle the following wrinkle: my data come in pairs. For example, in the 10 observations, there are 5 groups of 2. Each group of 2 has a unique identifier. How would I arrange the groups (and not the observations) in random order? The data would look something like this:
obs group mem value
1 A x 9345
2 A y 129
3 B x 251
4 B y 373
5 C x 788
6 C y 631
7 D x 239
8 D y 481
9 E x 224
10 E y 585
obs is the observation number. group is the group the observation (row) belongs to. mem is the member identifier in the group. Each group has one x and one y in it.

First question:
You could just shuffle observation identifiers.
set obs 10
gen y = _n
gen rnd = runiform()
sort rnd
Or in Mata
jumble(1::10)
Second question: Several ways. Here's one.
gen rnd = runiform()
bysort group (rnd): replace rnd = rnd[1]
sort rnd
General comment: For reproducibility, set the random number seed beforehand.
set seed 2803
or whatever.

Related

Combining every column-combination of an arbitrary number of matrices

I'm trying to figure out a way to do a certain "reduction"
I have a varying number of matrices of varying size, e.g
1 2 2 2 5 6...70 70
3 7 8 9 7 7...88 89
1 3 4
2 7 7
3 8 8
9 9 9
.
.
44 49 49 49 49 49 49
50 50 50 50 50 50 50
87 87 88 89 90 91 92
What I need to do (and I hope that I'm explaining this clearly enough) is to combine any possible
combination of columns from these matrices, this means that one column might be
1
3
1
2
3
9
.
.
.
44
50
87
Which would reduce down to
1
2
3
9
.
.
.
44
50
87
The reason why I'm doing this is because I need to find the smallest unique combined column
What am I trying to accomplish
For those interested, I'm trying to find the smallest set of gene knockouts
to disable reactions. Here, every matrix represents a reactions, and the columns represent the indices of
the genes that would disable that reaction.
The method may be as brute force as needed, as these matrices rarely become overwhelmingly large,
and the reaction combinations won't be long either
The problem
I can't (as far as I know) create a for loop with an arbitrary number of iterators, and the number of
matrices (reactions to disable) is arbitrary.
Clarification
If I have matrices A,B,C with columns a1,a2...b1,b2...c1...cn what I need
are the columns [a1 b1 c1], [a1, b1, c2], ..., [a1 b1 cn] ... [an bn cn]
Solution
Courtesy of Michael Ohlrogge below.
Extension of his answer, for completeness
His solution ends with
MyProd = product(Array_of_ColGroups...)
Which gets the job done
And picking up where he left off
collection = collect(MyProd); #MyProd is an iterator
merged_cols = Array[] # the rows of 'collection' are arrays of arrays
for (i,v) in enumerate(collection)
# I apologize for this line
push!(merged_cols, sort!(unique(vcat(v...))))
end
# find all lengths so I can find which is the minimum
lengths = map(x -> length(x), merged_cols);
loc_of_shortest = find(broadcast((x,y) -> length(x) == y, merged_cols,minimum(lengths)))
best_gene_combos = merged_cols[loc_of_shortest]
tl;dr - complete solution:
# example matrices
a = rand(1:50, 8,4); b = rand(1:50, 10,5); c = rand(1:50, 12,4);
Matrices = [a,b,c];
toJagged(x) = [x[:,i] for i in 1:size(x,2)];
JaggedMatrices = [toJagged(x) for x in Matrices];
Combined = [unique(i) for i in JaggedMatrices[1]];
for n in 2:length(JaggedMatrices)
Combined = [unique([i;j]) for i in Combined, j in JaggedMatrices[n]];
end
Lengths = [length(s) for s in Combined];
Minima = findin(Lengths, min(Lengths...));
SubscriptsArray = ind2sub(size(Lengths), Minima);
ComboTuples = [((i[j] for i in SubscriptsArray)...) for j in 1:length(Minima)]
Explanation:
Assume you have matrix a and b
a = rand(1:50, 8,4);
b = rand(1:50, 10,5);
Express them as a jagged array, columns first
A = [a[:,i] for i in 1:size(a,2)];
B = [b[:,i] for i in 1:size(b,2)];
Concatenate rows for all column combinations using a list comprehension; remove duplicates on the spot:
Combined = [unique([i;j]) for i in A, j in B];
You now have all column combinations of a and b, as concatenated rows with duplicates removed. Find the lengths easily:
Lengths = [length(s) for s in Combined];
If you have more than two matrices, perform this process iteratively in a for loop, e.g. by using the Combined matrix in place of a. e.g. if you have a matrix c:
c = rand(1:50, 12,4);
C = [c[:,i] for i in 1:size(c,2)];
Combined = [unique([i;j]) for i in Combined, j in C];
Once you have the Lengths array as a multidimensional array (as many dimensions as input matrices, where the size of each dimension is the number of columns in each matrix), you can find the column combinations that correspond to the lowest value (there may well be more than one combination), via a simple ind2sub operation:
Minima = findin(Lengths, min(Lengths...));
SubscriptsArray = ind2sub(size(Lengths), Minima)
(e.g. for a randomized run with 3 input matrices, I happened to get 4 results with the minimal length of 19. The result of ind2sub was ([4,4,3,4,4],[3,3,4,5,3],[1,3,3,3,4])
You can convert this further to a list of "Column Combination" tuples with a (somewhat ugly) list comprehension:
ComboTuples = [((i[j] for i in SubscriptsArray)...) for j in 1:length(Minima)]
# results in:
# 5-element Array{Tuple{Int64,Int64,Int64},1}:
# (4,3,1)
# (4,3,3)
# (3,4,3)
# (4,5,3)
# (4,3,4)
Ok, let's see if I understand this. You've got n matrices and want all combinations with one column from each of the n matrices? If so, how about the product() (for Cartesian product) from the Iterators package?
using Iterators
n = 3
Array_of_Arrays = [rand(3,3) for idx = 1:n] ## arbitrary representation of your set of arrays.
Array_of_ColGroups = Array(Array, length(Array_of_Arrays))
for (idx, MyArray) in enumerate(Array_of_Arrays)
Array_of_ColGroups[idx] = [MyArray[:,jdx] for jdx in 1:size(MyArray,2)]
end
MyProd = product(Array_of_ColGroups...)
This will create an iterator object which you can then loop over to consider the specific combinations of columns.

Make an n x n-1 matrix from 1 x n vector where the i-th row is the vector without the i-th element, without a for loop

I need this for Lagrange polynomials. I'm curious how one would do this without a for loop. The code currently looks like this:
tj = 1:n;
ti = zeros(n,n-1);
for i = 1:n
ti(i,:) = tj([1:i-1, i+1:end]);
end
My tj is not really just a 1:n vector but that's not important. While this for loop gets the job done, I'd rather use some matrix operation. I tried looking for some appropriate matrices to multiply it with, but no luck so far.
Here's a way:
v = [10 20 30 40]; %// example vector
n = numel(v);
M = repmat(v(:), 1, n);
M = M(~eye(n));
M = reshape(M,n-1,n).';
gives
M =
20 30 40
10 30 40
10 20 40
10 20 30
This should generalize to any n
ti = flipud(reshape(repmat(1:n, [n-1 1]), [n n-1]));
Taking a closer look at what's going on. If you look at the resulting matrix closely, you'll see that it's n-1 1's, n-1 2's, etc. from the bottom up.
For the case where n is 3.
ti =
2 3
1 3
1 2
So we can flip this vertically and get
f = flipud(ti);
1 2
1 3
2 3
Really this is [1, 2, 3; 1, 2, 3] reshaped to be 3 x 2 rather than 2 x 3.
In that line of thinking
a = repmat(1:3, [2 1])
1 2 3
1 2 3
b = reshape(a, [3 2]);
1 2
1 3
2 3
c = flipud(b);
2 3
1 3
1 2
We are now back to where you started when we bring it all together and replace 3's with n and 2's with n-1.
Here's another way. First create a matrix where each row is the vector tj but are stacked on top of each other. Next, extract the lower and upper triangular parts of the matrix without the diagonal, then add the results together ensuring that you remove the last column of the lower triangular matrix and the first column of the upper triangular matrix.
n = numel(tj);
V = repmat(tj, n, 1);
L = tril(V,-1);
U = triu(V,1);
ti = L(:,1:end-1) + U(:,2:end);
numel finds the total number of values in tj which we store in n. repmat facilitates the stacking of the vector tj to create a matrix that is n x n large. After, we use tril and triu so that we extract the lower and upper triangular parts of the matrices without the diagonal. In addition, the rest of the matrix is all zero except for the relevant triangular parts. The -1 and 1 flags for tril and triu respectively extract this out successfully while ensuring that the diagonal is all zero. This creates a column of extra zeroes appearing at the last column when calling tril and the first column when calling triu. The last part is to simply add these two matrices together ignoring the last column of the tril result and the first column of the triu result.
Given that tj = [10 20 30 40]; (borrowed from Luis Mendo's example), we get:
ti =
20 30 40
10 30 40
10 20 40
10 20 30

How to calculate one certain value from a rolling-window estimation in Stata

I'm using Stata to estimate Value-at-risk (VaR) with the historical simulation method. Basically, I will create a rolling window with 100 observations, to estimate VaR for the next 250 days (repeat 250 times). Hence, as I've known, the rolling window with time series command in Stata would be useful in this case. Here is the process:
Input: 350 values
1. Ascending sort the very first 100 values (by magnitude).
2. Then I need to take the 5th smallest for each window.
3. Repeat 250 times.
Output: a list of the 5th values (250 in total).
Sound simple, but I cannot do it the right way. This was my attempt below:
program his,rclass
sort lnreturn
return scalar actual=lnreturn in 5
end
tsset stt
time variable: stt, 1 to 350
delta: 1 unit
rolling actual=r(actual), window(100) saving(C:\result100.dta, replace) : his
(running his on estimation sample)
And the result is:
Start end actual
1 100 -.047856
2 101 -.047856
3 102 -.047856
4 103 -.047856
.... ..... ......
251 350 -.047856
What I want is 250 different 5th values in panel "actual", not the same like that.
If I understand this correctly, you want the 5th percentile of values in a window of 100. That should yield to summarize, detail or centile. I see no need to write a program.
Your bug is that your program his calculates the same thing each time it is called. There is no communication about windows other than what is explicit in your code. It is like saying
move here: now add 2 + 2
move there: now add 2 + 2
move to New York: now add 2 + 2
The result is invariant to your supposed position.
Note that I doubt that
return scalar actual=lnreturn in 5
really is your code. lnreturn[5] should work.
UPDATE You don't even need rolling here. Looping over data is easy enough. The data in this example are clearly fake.
clear
* sandpit
set obs 500
set seed 2803
gen y = ceil(exp(rnormal(3,2)))
l y in 1/5
* initialise
gen p5 = .
* windows of length 100: 1..100, 101..200, ...
quietly forval j = 1/401 {
local J = `j' + 99
su y in `j'/`J', detail
replace p5 = r(p5) in `j'
}
* check first calculation
su y in 1/100, detail
l in 1/5

Consolidate 10 bit Value into a Unique Byte

As part of an algorithm I'm writing, I need to find a way to convert a 10-bit word into a unique 8-bit word. The 10-bit word is made up of 5 pairs, where each pair can only ever equal 0, 1 or 2 (never 3). For example:
|00|10|00|01|10|
This value needs to somehow be consolidated into a single, unique byte.
As each pair can never equal 3, there are a wide range of values that this 10-bit word will never represent, which makes me think that it is possible to create an algorithm to perform this conversion. The simplest way to do this would be to use a lookup table, but it seems like a waste of resources to store ~680 values which will only be used once in my program. I've already tried to incorporate one of the pairs into the others somehow, but every attempt I've made has resulted in a non-unique value, and I'm now very quickly running out of ideas!
Any help?
The number you have is essentially base 3. You just need to convert this to base 2.
There are 5 pairs, so 3^5 = 243 numbers. And 8 bits is 2^8 = 256 numbers, so it's possible.
The simplest way to convert between bases is to go to base 10 first.
So, for your example:
00|10|00|01|10
Base 3: 02012
Base 10: 2*3^3 + 1*3^1 + 2*3^0
= 54 + 3 + 2
= 59
Base 2:
59 % 2 = 1
/2 29 % 2 = 1
/2 14 % 2 = 0
/2 7 % 2 = 1
/2 3 % 2 = 1
/2 1 % 2 = 1
So 111011 is your number in binary
This explains the above process in a bit more detail.
Note that once you have 59 above stored in a 1-byte integer, you'll probably already have what you want, thus explicitly converting to base 2 might not be necessary.
What you basically have is a base 3 number and you want to convert this to a single number 0 - 255, luckily 5 digits in ternary (base 3) gives 243 combinations.
What you'll need to do is:
Digit Action
( 1st x 3^4)
+ (2nd x 3^3)
+ (3rd x 3^2)
+ (4th x 3)
+ (5th)
This will give you a number 0 to 242.
You are considering to store some information in a byte. A byte can contain at most 2 ^ 8 = 256 status.
Your status is totally 3 ^ 5 = 243 < 256. That make the transfer possible.
Consider your pairs are ABCDE (each character can be 0, 1 or 2)
You can just calculate A*3^4 + B*3^3 + C*3^2 + D*3 + E as your result. I guarantee the result will be in range 0 -- 255.

Using one probability set to generate another [duplicate]

This question already has answers here:
Expand a random range from 1–5 to 1–7
(78 answers)
Closed 8 years ago.
How can I generate a bigger probability set from a smaller probability set?
This is from Algorithm Design Manual -Steven Skiena
Q:
Use a random number generator (rng04) that generates numbers from {0,1,2,3,4} with equal probability to write a random number generator that generates numbers from 0 to 7 (rng07) with equal probability?
I tried for around 3 hours now, mostly based on summing two rng04 outputs. The problem is that in that case the probability of each value is different - 4 can come with 5/24 probability while 0 happening is 1/24. I tried some ways to mask it, but cannot.
Can somebody solve this?
You have to find a way to combine the two sets of random numbers (the first and second random {0,1,2,3,4} ) and make n*n distinct possibilities. Basically the problem is that with addition you get something like this
X
0 1 2 3 4
0 0 1 2 3 4
Y 1 1 2 3 4 5
2 2 3 4 5 6
3 3 4 5 6 7
4 4 5 6 7 8
Which has duplicates, which is not what you want. One possible way to combine the two sets would be the Z = X + Y*5 where X and Y are the two random numbers. That would give you a set of results like this
X
0 1 2 3 4
0 0 1 2 3 4
Y 1 5 6 7 8 9
2 10 11 12 13 14
3 15 16 17 18 19
4 20 21 22 23 24
So now that you have a bigger set of random numbers, you need to do the reverse and make it smaller. This set has 25 distinct values (because you started with 5, and used two random numbers, so 5*5=25). The set you want has 8 distinct values. A naïve way to do this would be
x = rnd(5) // {0,1,2,3,4}
y = rnd(5) // {0,1,2,3,4}
z = x+y*5 // {0-24}
random07 = x mod 8
This would indeed have a range of {0,7}. But the values {1,7} would appear 3/25 times, and the value 0 would appear 4/25 times. This is because 0 mod 8 = 0, 8 mod 8 = 0, 16 mod 8 = 0 and 24 mod 8 = 0.
To fix this, you can modify the code above to this.
do {
x = rnd(5) // {0,1,2,3,4}
y = rnd(5) // {0,1,2,3,4}
z = x+y*5 // {0-24}
while (z != 24)
random07 = z mod 8
This will take the one value (24) that is throwing off your probabilities and discard it. Generating a new random number if you get a 'bad' value like this will make your algorithm run very slightly longer (in this case 1/25 of the time it will take 2x as long to run, 1/625 it will take 3x as long, etc). But it will give you the right probabilities.
The real problem, of course, is the fact that the numbers in the middle of the sum (4 in this case) occur in many combinations (0+4, 1+3, etc.) whereas 0 and 8 have exactly one way to be produced.
I don't know how to solve this problem, but I'm going to try to reduce it a bit for you. Some points to consider:
The 0-7 range has 8 possible values, so ultimately the total number of possible situations that you should aim for has to be a multiple of 8. That way you can have an integral number of distributions per value in that codomain.
When you take the sum of two density functions, the number of possible situations (not necessarily distinct when you evaluate the sum, just in terms of different permutations of inputs) is equal to the product of the size of each of the input sets.
Thus, given two {0,1,2,3,4} sets summed together, you have 5*5=25 possibilities.
It will not be possible to get a multiple of eight (see first point) from powers of 5 (see second point, but extrapolate it to any number of sets > 1), so you will need to have a surplus of possible situations in your function and ignore some of them if they occur.
The simplest way to do that, as far as I can see at this point, is to use the sum of two {0,1,2,3,4} sets (25 possibilities) and ignore 1 (to leave 24, a multiple of 8).
Thus the challenge now has been reduced to this: Find a way to distribute the remaining 24 possibilities among the 8 output values. For this, you'll probably NOT want to use the sum, but rather just the input values.
One way to do that is, imagine a number in base 5 constructed from your input. Ignore 44 (that's your 25th, superfluous value; if you get it, synthesize a new set of inputs) and take the others, modulo 8, and you'll get your 0-7 across 24 different input combinations (3 each), which is an equal distribution.
My logic would be this:
rn07 = 0;
do {
num = rng04;
}
while(num == 4);
rn07 = num * 2;
do {
num = rng04;
}
while(num == 4);
rn07 += num % 2

Resources