Set seed parallel random forest in caret for reproducible result - parallel-processing

I wish to run random forest in parallel using caret package, and I wish to set the seeds for reproducible result as in Fully reproducible parallel models using caret. However, I don't understand line 9 in the following code taken from caret help: why do we sample 22 (plus the last model in line 12, 23) integer numbers (12 values for parameter k are evaluated)? For information, I wish to run 5-fold CV to evaluate 584 values for RF parameter 'mtry'. Any help is much appreciated. Thank you.
## Not run:
## Do 5 repeats of 10-Fold CV for the iris data. We will fit
## a KNN model that evaluates 12 values of k and set the seed
## at each iteration.
set.seed(123)
seeds <- vector(mode = "list", length = 51)
for(i in 1:50) seeds[[i]] <- sample.int(1000, 22) # Why 22?
## For the last model:
seeds[[51]] <- sample.int(1000, 1)
ctrl <- trainControl(method = "repeatedcv",
repeats = 5,
seeds = seeds)

I'd say it is a mistake, and should be 12 instead of 22.
From what I understand, you will be running the model 10*5 = 50 times, for each value of k. Hence, for each i in 1:50, you'll need 12 seeds (one for every k). After obtaining the best k, you will run the final model. This time, you only need one seed (no more repeated resampling).

Related

Binary random number with a specific number of ones

I am looking to randomly insert ones into a binary number where each specific set of bits has a fixed number of ones.
For example, if I have a 15 bit number, all 3 sets of 5 bits must have exactly 3 ones each. I need to generate say, 40 such unique binary numbers.
import numpy as np
N = 15
K = 9 # K zeros, N-K ones
arr = np.array([0] * K + [1] * (N-K))
np.random.shuffle(arr)
This is something that I discovered, but the issue is, here, this solution means that it is not necessary that the ones are distributed in the way that I want - through this solution, all ones can be grouped together right at the beginning, such that the last set of 5 bits are all zeroes - and this is not what I'm looking for.
Also, this method does not guarantee that all combinations I have are unique.
Looking for any suggestions regarding this. Thank you!
If I understand the question correctly, you could do something like this in Python:
import random
def valgen():
set_bits = [
*random.sample(range(0, 5), 3),
*random.sample(range(5, 10), 3),
*random.sample(range(10, 15), 3),
]
return sum(1<<i for i in set_bits)
i.e. sample three sets of integer values, without replacement, in each block and set those bits in the result.
if you want 40 unique values, I'd do:
vals = {valgen() for _ in range(40)}
while len(vals) < 40:
vals.add(valgen())
see the birthday problem for why you should expect approx one duplicate per set of 40

Is there a way to get a random sample from a particular decile of the normal distribution using Stata's rnormal() function?

I'm working with a dataset where the values of my variable of interest are hidden. I have the range (min max), mean, and sd of this variable and for each observation, I have information on which decile the value for observation lies in. Is there any way I can impute some values for this variable using the random number generator or rnormal() suite of commands in Stata? Something along the lines of:
set seed 1
gen imputed_var=rnormal(mean,sd,decile) if decile==1
Appreciate any help on this, thanks!
I am not familiar with Stata, but the following may get you in the right direction.
In general, to generate a random number in a certain decile:
Generate a random number in [(decile-1)/10, decile/10], where decile is the desired decile, from 1 through 10.
Find the quantile of the random number just generated.
Thus, in pseudocode, the following will achieve what you want (I'm not sure about the exact names of the corresponding functions in Stata, though, which is why it's pseudocode):
decile = 4 # 4th decile
# Generate a random number in the decile (here, [0.3, 0.4]).
v = runiform((decile-1)/10, decile/10)
# Convert the number to a normal random number
q = qnormal(v) # Quantile of the standard normal distribution
# Scale and shift the number to the desired mean
# and standard deviation
q = q * sd + mean
This is precisely the suggestion just made by #Peter O. I make the same assumption he did: that by a common abuse of terminology, "decile" is your shorthand for decile class, bin or interval. Historically, deciles are values corresponding to cumulative probabilities 0.1(0.1)0.9, not any bins those values delimit.
. clear
. set obs 100
number of observations (_N) was 0, now 100
. set seed 1506
. gen foo = invnormal(runiform(0, 0.1))
. su foo
Variable | Obs Mean Std. Dev. Min Max
-------------+---------------------------------------------------------
foo | 100 -1.739382 .3795648 -3.073447 -1.285071
and (closer to your variable names)
gen wanted = invnormal(runiform(0.1 * (decile - 1), 0.1 * decile))

generate clustered spatstat marks?

I was wondering if anyone knows how to assign marks in spatstat so that they tend to cluster spatially? I have a set of lat long coordinates that I want to categorize into 4 groups. I have figured out how to randomly assign marks/groups to these points using the following code:
as.ppp(data, window ,marks=factor(sample(1:4,replace=TRUE)))
But I can't figure out how to assign the marks so that groups tend to occupy points closer to one another. As a further complication, I would also like the number of points within each group to be the same, specified number each time. Does anyone have any leads? Thanks in advance!
Typically in spatstat we define models which describe/generate points at random locations and possibly with random marks. If I understand you correctly you have a fixed set of locations and you simply want to assign random marks. How many points do you have? If you don't have too many points a simple suggestion could be to generate a multivariate normally distributed variable and then take the n_1 lowest values for the first mark, the n_2 next values for the second mark, and so on. A simple example with 4 equal sized groups of points:
library(spatstat)
library(mvtnorm)
set.seed(42) # Make reproducible
X <- redwood # Example data
n <- npoints(redwood)
Xdist <- pairdist(X) # n x n matrix of distances in X
decay_rate <- 1 # Parameter for covariance sturcture
sigma <- exp(-decay_rate * Xdist)
m <- rmvnorm(1, rep(0, n), sigma)
breaks <- quantile(m, probs = c(0, .25, .5, .75, 1)) # breaks to cut marks in four equal sized groups
marks(X) <- cut(m, breaks = breaks, include.lowest=TRUE, labels = 1:4)
plot(X)

What is the fast way to calculate this summation in MATLAB?

So I have the following constraints:
How to write this in MATLAB in an efficient way? The inputs are x_mn, M, and N. The set B={1,...,N} and the set U={1,...,M}
I did it like this (because I write x as the follwoing vector)
x=[x_11, x_12, ..., x_1N, X_21, x_22, ..., x_M1, X_M2, ..., x_MN]:
%# first constraint
function R1 = constraint_1(M, N)
ee = eye(N);
R1 = zeros(N, N*M);
for m = 1:M
R1(:, (m-1)*N+1:m*N) = ee;
end
end
%# second constraint
function R2 = constraint_2(M, N)
ee = ones(1, N);
R2 = zeros(M, N*M);
for m = 1:M
R2(m, (m-1)*N+1:m*N) = ee;
end
end
By the above code I will get a matrix A=[R1; R2] with 0-1 and I will have A*x<=1.
For example, M=N=2, I will have something like this:
And, I will create a function test(x) which returns true or false according to x.
I would like to get some help from you and optimize my code.
You should place your x_mn values in a matrix. After that, you can sum in each dimension to get what you want. Looking at your constraints, you will place these values in an M x N matrix, where M is the amount of rows and N is the amount of columns.
You can certainly place your values in a vector and construct your summations in the way you intended earlier, but you would have to write for loops to properly subset the proper elements in each iteration, which is very inefficient. Instead, use a matrix, and use sum to sum over the dimensions you want.
For example, let's say your values of x_mn ranged from 1 to 20. B is in the set from 1 to 5 and U is in the set from 1 to 4. As such:
X = vec2mat(1:20, 5)
X =
1 2 3 4 5
6 7 8 9 10
11 12 13 14 15
16 17 18 19 20
vec2mat takes a vector and reshapes it into a matrix. You specify the number of columns you want as the second element, and it will create the right amount of rows to ensure that a proper matrix is built. In this case, I want 5 columns, so this should create a 4 x 5 matrix.
The first constraint can be achieved by doing:
first = sum(X,1)
first =
34 38 42 46 50
sum works for vectors as well as matrices. If you have a matrix supplied to sum, you can specify a second parameter that tells you in what direction you wish to sum. In this case, specifying 1 will sum over all of the rows for each column. It works in the first dimension, which is the rows.
What this is doing is it is summing over all possible values in the set B over all values of U, which is what we are exactly doing here. You are simply summing every single column individually.
The second constraint can be achieved by doing:
second = sum(X,2)
second =
15
40
65
90
Here we specify 2 as the second parameter so that we can sum over all of the columns for each row. The second dimension goes over the columns. What this is doing is it is summing over all possible values in the set U over all values of B. Basically, you are simply summing every single row individually.
BTW, your code is not achieving what you think it's achieving. All you're doing is simply replicating the identity matrix a set number of times over groups of columns in your matrix. You are actually not performing any summations as per the constraint. What you are doing is you are simply ensuring that this matrix will have the conditions you specified at the beginning of your post to be enforced. These are the ideal matrices that are required to satisfy the constraints.
Now, if you want to check to see if the first condition or second condition is satisfied, you can do:
%// First condition satisfied?
firstSatisfied = all(first <= 1);
%// Second condition satisfied
secondSatisfied = all(second <= 1);
This will check every element of first or second and see if the resulting sums after you do the above code that I just showed are all <= 1. If they all satisfy this constraint, we will have true. Else, we have false.
Please let me know if you need anything further.

increase the performance to generate random numbers in a range with step-size

To make sure that this is not a duplicate, I have already checked this and this out.
I want to generate random numbers in a specific range including step size (not continuous distribution).
For example, I want to generate random numbers between -2 and 3 in which the step between two consecutive numbers is 0.02. (e.g. [-2 -1.98 -1.96 ... 2.69 2.98 3] so a generated number should be 2.96 not 2.95).
I have tried this:
a=-2*100;
b=3*100;
r = (b-a).*rand(5,1) + a;
for i=1:length(r)
if r(i) >= 0
if mod(fix(r(i)),2)
r(i)=ceil(r(i))/100;
else
r(i)=floor(r(i))/100;
end
else
if mod(fix(r(i)),2)
r(i)=floor(r(i))/100;
else
r(i)=ceil(r(i))/100;
end
end
end
and it works.
there is an alternative way to do this in MATLAB which is :
y = datasample(-2:0.02:3,5,'Replace',false)
I want to know:
How can I make my own implementation faster (improve the
performance)?
If the second method is faster (it looks faster to me), how can I
use similar implementation in C++?
Those previous answers do cover your case if you read carefully. For example, this one produces random numbers between limits with a step size of one. But let's generalize this to an arbitrary step size in case you can't figure out how to get there. There are several different ways. Here's one using randi where we use the default step size of one and the range from one to the number possible values as indices:
lo = 2;
hi = 3;
step = 0.02;
v = lo:step:hi;
r = v(randi(length(v),[5 1]))
If you look inside datasample (type edit datasample in your command window to view the code) you'll see that it's doing something very similar to this answer. In the case of the 'Replace' option being true see around line 135 (in R2013a at least).
If the 'Replace' option is false, as in your use of datasample above, then randperm actually needs to be used instead (see around line 159):
lo = 2;
hi = 3;
step = 0.02;
v = lo:step:hi;
r = v(randperm(length(v),51))
Because there is no replacement in this case, 51 is the maximum number of values that can be requested in a call and all values of r will be unique.
In C++ you should not use rand() if you're doing scientific computing and generating large numbers of random variates. Instead you should use a large period random number generator such as Mersenne Twister (the default in Matlab). C++11 includes a version of this generator as part of . More here in rand(). If you want something fast, you should try the Double precision SIMD-oriented Fast Mersenne Twister. You'll have to ask another question if you want to implement your code in C++.
The distribution you want is a simple transform of integers, so how about:
step = 0.02
r = randi([-2 3] / step, [5, 1]) * step;
In C++, rand() generates integers too, so it should be pretty obvious how to take a similar approach there.

Resources