How to draw without replacement to fill in data set - random

I am generating a data set where I first want to randomly draw a number for each observation from a discrete distribution, fill in var1 with these numbers. Next, I want to draw another number from the distribution for each row, but the catch is that the number in var1 for this observation is not eligible to be drawn anymore. I want to repeat this a relatively large number of times.
To hopefully make this make more sense, suppose that I start with:
id
1
2
3
...
999
1000
Suppose that the distribution I have is ["A", "B", "C", "D", "E"] that happen with probability [.2, .3, .1, .15, .25].
I would first like to randomly draw from this distribution to fill in var. Suppose that the result of this is:
id var1
1 E
2 E
3 C
...
999 B
1000 A
Now E is not eligible to be drawn for observations 1 and 2. C, B, and A are ineligible for observations 3, 999, and 1000, respectively.
After all the columns are filled in, we may end up with this:
id var1 var2 var3 var4 var5
1 E C B A D
2 E A B D C
3 C B A E D
...
999 B D C A E
1000 A E B C D
I am not sure of how to approach this in Stata. But one way to fill in var1 is to do something like:
gen random1 = runiform()
replace var1 = "A" if random1<.2
replace var1 = "B" if random1>=.2 & random1<.5
etc....
Note that sticking with the (scaled) probabilities after creating var1 is desirable, but is not required for me.

Here's a solution that works in long form to select from the distribution. As values are selected, they are flagged as done and the next selection is made from the groups that contain the remaining values. Probabilities are scaled at each pass.
version 14
set seed 3241234
* Example generated by -dataex-. To install: ssc install dataex
clear
input byte ip str1 y double p
1 "A" .2
2 "B" .3
3 "C" .1
4 "D" .15
5 "E" .25
end
local nval = _N
* the following should be true
isid y
expand 1000
bysort y: gen id = _n
sort id ip
gen done = 0
forvalues i = 1/`nval' {
// scale probabilities
bysort id done (ip): gen double ptot = sum(p) // this is a running sum
by id done: gen double phigh = sum(p / ptot[_N])
by id done: gen double plow = cond(_n == 1, 0, phigh[_n-1])
// random number in the range of (0,1) for the group
bysort id done (ip): gen double x = runiform()
// pick from the not done group; choose first x to represent group
by id done: gen pick = !done & inrange(x[1], plow, phigh)
// put the picked obs at the end and create the new var
bysort id (pick ip): gen v`i' = y[_N]
// we are done for the obs that was picked
bysort id: replace done = 1 if _n == _N
drop x pick ptot phigh plow
}
bysort id: keep if _n == 1

Related

Duplicate Strings with Ambiguity

I have a large (5-10 million) set of strings with the restricted alphabet of nucleotide symbols (A,T,C, and G) along with a wildcard symbol N. Each string has an integer associated with it.
I want to find all the unique strings and, for each, sum their integer values. The 'representative' string for a set of equal strings should be the one with the highest integer value. For example, given:
NTG 9
NAG 6
ANG 5
TTT 2
ATG 2
I want the output to be:
NTG 14
NAG 6
ATG 2
TTT 2
With a dataset of this size pairwise comparisons are not feasible. Any ideas?
I assumed that your target output wasn't accurate. It seems more appropriate to match "ATG" to "ANG" (which I have done) instead of matching "ANG" to "NTG" (your stated goal). This solution addresses your given sample set, but may not be helpful for your desired application given the significant difference in scale.
Code:
import re
test = """
NTG 9
NAG 6
ANG 5
TTT 2
ATG 2
"""
test = [x.split(" ") for x in test.upper().split("\n") if x != ""]
#print(test)
index = 0
while index < len(test):
seq = test[index]
seq_regex = seq[0].replace("N", ".")
no_match_li = [x for x in test if len(re.findall(seq_regex, x[0])) == 0]
match_li = [int(x[1]) for x in test if len(re.findall(seq_regex, x[0])) != 0]
#print(no_match_li, match_li)
test = [[seq[0], sum(match_li)]] + no_match_li
index += 1
test = sorted(test, key=lambda x: x[1], reverse=True)
for seq in test:
print(seq[0], seq[1])
Output:
NTG 11
NAG 6
ANG 5
TTT 2

corpus extraction with changing data type R

i have a corpus of text files, contains just text, I want to extract the ngrams from the texts and save each one with his original file name in matrixes of 3 columns..
library(tokenizer)
myTokenizer <- function(x, n, n_min) {
corp<-"this is a full text "
tok <- unlist(tokenize_ngrams(as.character(x), n = n, n_min = n_min))
M <- matrix(nrow=length(tok), ncol=3,
dimnames=list(NULL, c( "gram" , "num.words", "words")))
}
corp <- tm_map(corp,content_transformer(function (x) myTokenizer(x, n=3, n_min=1)))
writecorpus(corp)
Since I don't have your corpus I created one of my own using the crude dataset from tm. No need to use tm_map as that keeps the data in a corpus format. The tokenizer package can handle this.
What I do is store all your desired matrices in a list object via lapply and then use sapply to store the data in the crude directory as separate files.
Do realize that the matrices as specified in your function will be character matrices. This means that columns 1 and 2 will be characters, not numbers.
library(tm)
data("crude")
crude <- as.VCorpus(crude)
myTokenizer <- function(x, n, n_min) {
tok <- unlist(tokenizers::tokenize_ngrams(as.character(x), n = n, n_min = n_min))
M <- matrix(nrow=length(tok), ncol=3,
dimnames=list(NULL, c( "gram" , "num.words", "words")))
M[, 3] <- tok
M[, 2] <- lengths(strsplit(M[, 3], "\\W+")) # counts the words
M[, 1] <- 1:length(tok)
return(M)
}
my_matrices <- lapply(crude, myTokenizer, n = 3, n_min = 1)
# make sure directory crude exists as a subfolder in working directory
sapply(names(my_matrices),
function (x) write.table(my_matrices[[x]], file=paste("crude/", x, ".txt", sep=""), row.names = FALSE))
outcome of the first file:
"gram" "num.words" "words"
"1" "1" "diamond"
"2" "2" "diamond shamrock"
"3" "3" "diamond shamrock corp"
"4" "1" "shamrock"
"5" "2" "shamrock corp"
"6" "3" "shamrock corp said"
I would recommend to create a document term matrix (DTM). You will probably need this in your downstream tasks anyway. From that you could also extract the information you want, although, it is probably not reasonable to assume that a term (incl. ngrams) only has a single document where its coming from (at least this is what I understood from your question, please correct me if I am wrong). Therefore, I guess that in practice one term will have several documents associated with it - this kind of information is usually stored in a DTM.
An example with text2vec below. If you could elaborate further how you want to use your terms, etc. I could adapt the code according to your needs.
library(text2vec)
# I have set up two text do not overlap in any term just as an example
# in practice, this probably never happens
docs = c(d1 = c("here a text"), d2 = c("and another one"))
it = itoken(docs, tokenizer = word_tokenizer, progressbar = F)
v = create_vocabulary(it, ngram = c(1,3))
vectorizer = vocab_vectorizer(v)
dtm = create_dtm(it, vectorizer)
as.matrix(dtm)
# a a_text and and_another and_another_one another another_one here here_a here_a_text one text
# d1 1 1 0 0 0 0 0 1 1 1 0 1
# d2 0 0 1 1 1 1 1 0 0 0 1 0
library(stringi)
docs = c(d1 = c("here a text"), d2 = c("and another one"))
it = itoken(docs, tokenizer = word_tokenizer, progressbar = F)
v = create_vocabulary(it, ngram = c(1,3))
vectorizer = vocab_vectorizer(v)
dtm = create_dtm(it, vectorizer)
for (d in rownames(dtm)) {
v = dtm[d, ]
v = v[v!=0]
v = data.frame(number = 1:length(v)
,term = names(v))
v$n = stri_count_fixed(v$term, "_")+1
write.csv(v, file = paste0("v_", d, ".csv"), row.names = F)
}
read.csv("v_d1.csv")
# number term n
# 1 1 a 1
# 2 2 a_text 2
# 3 3 here 1
# 4 4 here_a 2
# 5 5 here_a_text 3
# 6 6 text 1
read.csv("v_d2.csv")
# number term n
# 1 1 and 1
# 2 2 and_another 2
# 3 3 and_another_one 3
# 4 4 another 1
# 5 5 another_one 2
# 6 6 one 1

Stata - How to Generate Random Integers

I am learning Stata and want to know how to generate random integers (without replacement). If I had 10 total rows, I would want each row to have a unique integer from 1 to 10 assigned to it. In R, one could simply do:
sample(1:10, 10)
But it seems more difficult to do in Stata. From this Stata page, I saw:
generate ui = floor((b-a+1)*runiform() + a)
If I substitute a=1 and b=10, I get something close to what I want, but it samples with replacement.
After getting that part figured out, how would I handle the following wrinkle: my data come in pairs. For example, in the 10 observations, there are 5 groups of 2. Each group of 2 has a unique identifier. How would I arrange the groups (and not the observations) in random order? The data would look something like this:
obs group mem value
1 A x 9345
2 A y 129
3 B x 251
4 B y 373
5 C x 788
6 C y 631
7 D x 239
8 D y 481
9 E x 224
10 E y 585
obs is the observation number. group is the group the observation (row) belongs to. mem is the member identifier in the group. Each group has one x and one y in it.
First question:
You could just shuffle observation identifiers.
set obs 10
gen y = _n
gen rnd = runiform()
sort rnd
Or in Mata
jumble(1::10)
Second question: Several ways. Here's one.
gen rnd = runiform()
bysort group (rnd): replace rnd = rnd[1]
sort rnd
General comment: For reproducibility, set the random number seed beforehand.
set seed 2803
or whatever.

Efficient way of finding rows in which A>B

Suppose M is a matrix where each row represents a randomized sequence of a pool of N objects, e.g.,
1 2 3 4
3 4 1 2
2 1 3 4
How can I efficiently find all the rows in which a number A comes before a number B?
e.g., A=1 and B=2; I want to retrieve the first and the second rows (in which 1 comes before 2)
There you go:
[iA jA] = find(M.'==A);
[iB jB] = find(M.'==B);
sol = find(iA<iB)
Note that this works because, according to the problem specification, every number is guaranteed to appear once in each row.
To find rows of M with a given prefix (as requested in the comments): let prefix be a vector with the sought prefix (for example, prefix = [1 2]):
find(all(bsxfun(#eq, M(:,1:numel(prefix)).', prefix(:))))
something like the following code should work. It will look to see if A comes before B in each row.
temp = [1 2 3 4;
3 4 1 2;
2 1 3 4];
A = 1;
B = 2;
orderMatch = zeros(1,size(temp,1));
for i = 1:size(temp,1)
match1= temp(i,:) == A;
match2= temp(i,:) == B;
aIndex = find(match1,1);
bIndex = find(match2,1);
if aIndex < bIndex
orderMatch(i) = 1;
end
end
solution = find(orderMatch);
This will result in [1,1,0] because the first two rows have 1 coming before 2, but the third row does not.
UPDATE
added find function on ordermatch to give row indices as suggested by Luis

Number of combinations of football pool

I have developed a macro in VBA which combines three triples of football pools representing the 27 different combinations. 27 represents the max possible combinations of betting. I would like to modify the list in a way to develop a system with double, fixed, triple prediction;
For example, now the program only works for:
1st game 1 x 2
2nd game 1 x 2
3rd game 1 x 2
equal to (3 * 3 * 3 = 27 possible combinations)
but if the prediction was the following:
1st game 1 x
2nd game 1
3rd game 1 x 2
equal to (2 * 1 * 3 = 6 possible combinations)
Now : first game 1 x 2 , second 1 x 2 ,third 1 x 2 ,equal to (3 * 3 * 3 = 27 combinations) but if the prediction should be the following: first game 1 x, second 1 , third x 2 , equal to (2 * 1 * 3 = 6 combinations) should be printed only valid columns.
Thank you in advance who can help me to solve the problem.
Sub Combination_Prediction()
Dim A As Integer
Dim B As Integer
Dim C As Integer
Dim Col1Sviluppo As Integer
Dim Row1Sviluppo As Integer
Col1Sviluppo = 10
Row1Sviluppo = 14
For C = 3 To 5
For B = 3 To 5
For A = 3 To 5
Contatore = Contatore + 1
Col1Sviluppo = Col1Sviluppo + 1
Cells(Row1Sviluppo + 1, Col1Sviluppo) = Cells(2, A)
Cells(Row1Sviluppo + 2, Col1Sviluppo) = Cells(3, B)
Cells(Row1Sviluppo + 3, Col1Sviluppo) = Cells(4, C)
Cells(10, 10) = Contatore & " colonne elaborate"
Next A
Next B
Next C
End Sub
Disclaimer: your logic is based on unpredictable assumptions. Please do not relay on it if youre betting real money. It's all more complicated then you think it is. There is only one reliable way of betting and earning (requires a lot of money to get started and proper and good understanding of bookmakers policies) and it's called sure bets. But please, do not get into it.
Now, back to your original question.
You can have a function return the number of combinations based on the input ยป combinations multipliers
Let's assume that
combinations multipliers
1 - 1
2 - 1X
3 - 1X2
1 represents either home or away win, 1 combination
2 stands for home win or draw, away win or draw, 2 combinations
3 is default: win, draw, win
The code:
Sub Combination_Prediction()
' combinations multipliers
' 1 - 1
' 2 - 1X
' 3 - 1X2
Range("A1") = Combination(3, 3, 3) ' 1x2, 1x2, 1x2
Range("B2") = Combination(2, 1, 3) ' 1x, 1, 1x2
End Sub
Function Combination(c1 As Long, c2 As Long, c3 As Long) As Long
Dim i As Long, j As Long, k As Long, combinationMultiplier As Long
combinationMultiplier = 0
For i = 1 To c1
For j = 1 To c2
For k = 1 To c3
combinationMultiplier = combinationMultiplier + 1
Next k
Next j
Next i
Combination = combinationMultiplier
End Function
If you ran this code, you will see in Cell A1 number 27 which is the correct (and simplified) calculation of possible bets.
The Combination() function takes 3 parameters which are the 3 combinations.
In the first example the first input is 3, 3, 3 as from your sample
1st game = 1x2
2nd game = 1x2
3rd game = 1x2
Now look at the combinations multipliers above
1st game = 1x2 = 3
2nd game = 1x2 = 3
3rd game = 1x2 = 3
Therefore, your 3 parameters are: 3, 3, 3
The second sample you provided
1st game = 1x = 2
2nd game = 1 = 1
3rd game = 1x2 = 3
therefore Combination(2, 1, 3) will return 6 (combinations) to Cell A2
Stick any combination of 1, 2, 3 into the combination function to get results. You can either print them to cell or use msgbox or debug.print for testing.
I hope that helps

Resources