I have the following code using random forest as method which is fully reproducible if you run it in parallel mode on the same machine:
library(doParallel)
library(caret)
recursive_feature_elimination <- function(dat){
all_preds <- dat[,which(names(dat) %in% c("Time", "Chick", "Diet"))]
response <- dat[,which(names(dat) == "weight")]
sizes <- c(1:(ncol(all_preds)-1))
# set seeds manually
set.seed(42, kind = "Mersenne-Twister", normal.kind = "Inversion")
# an optional vector of integers for the size. The vector should have length of length(sizes)+1
# length is n_repeats*nresampling+1
seeds <- vector(mode = "list", length = 16)
for(i in 1:15) seeds[[i]]<- sample.int(n=1000, size = length(sizes)+1)
# for the last model
seeds[[16]]<-sample.int(1000, 1)
seeds_list <- list(rfe_seeds = seeds,
train_seeds = NA)
# specify rfeControl
contr <- caret::rfeControl(functions=rfFuncs, method="repeatedcv", number=3, repeats=5,
saveDetails = TRUE, seeds = seeds, allowParallel = TRUE)
# recursive feature elimination caret
results <- caret::rfe(x = all_preds,
y = response,
sizes = sizes,
method ="rf",
ntree = 250,
metric= "RMSE",
rfeControl=contr )
return(results)
}
dat <- as.data.frame(ChickWeight)
cores <- detectCores()
cl <- makePSOCKcluster(cores, outfile="")
registerDoParallel(cl)
results <- recursive_feature_elimination(dat)
stopCluster(cl)
registerDoSEQ()
The outcome on my machine is:
Variables RMSE Rsquared MAE RMSESD RsquaredSD MAESD Selected
1 39.14 0.6978 24.60 2.755 0.02908 1.697
2 23.12 0.8998 13.90 2.675 0.02273 1.361 *
3 28.18 0.8997 20.32 2.243 0.01915 1.225
The top 2 variables (out of 2):
Time, Chick
I am using a Windows OS with one CPU and 4 cores. If the code is run on a UNIX OS using multiple CPUs with multiple cores, the outcome is different. I think this behaviour shows up because of the random number generation, which obviously differs between my system and the multi-CPU system. The same happens with train().
How can I get fully reproducible results independent of the OS and independent of how many CPUs and cores used for parallelization?
How can I assure that the same random numbers are used for each internal process of rfe and randomForest no matter in which sequence during the parallel computing the process is run?
How are the random numbers generated for each parallel process?
Related
Consider the following scenario:
We use n MPI processes to evaluate chunks of a large matrix A. Each chunk is evaluated by a unique process. Number of OMP threads is 1.
We gather these chunks in the master process (rank=0) and build the full matrix A.
The master process calls a function x = f(A) and broadcasts the result x to all processes.
The issue is that f(A) takes a long time, and other processes have to wait.
If the function f can be accelerated with OMP parallelism, does it make sense to set the number of OMP threads (on the master process) equal to n before calling f(A) and changing it to 1 afterwards?
In other words, what can be wrong with the following pseudo-code?
mpirun -n n ./exe # command for running the code
where the pseudo-code for exe looks something like
a = g(mpi_rank, ... ) # process-specific matrix
A = mpi_gather(a, mpi_rank=0) # gather all
if mpi_rank == 0
set_num_omp_threads(n)
x = f(A)
set_num_omp_threads(1)
else
x = 0
mpi_bcast(x, mpi_rank=0)
what are the possible performance issues/pitfalls?
EDIT: The original code is too complicated, but I have replicated the issue via the following code.
# run by: mpirun -n 6 python test.py
import time
import torch
import torch.distributed as dist
def get_weights(A, Y, use_omp=False):
# maximize OMP threads
master = dist.get_rank() == 0
if use_omp and master:
ws = dist.get_world_size()
th = torch.get_num_threads()
torch.set_num_threads(ws*th)
# actual calculation; only master
if dist.get_rank() == 0:
Q, R = torch.qr(A)
W = (R.inverse()#Q.t()#Y)
print(torch.get_num_threads())
else:
W = torch.zeros(A.shape[1], 1)
dist.broadcast(W, 0)
# reset OMP threads
if use_omp and master:
torch.set_num_threads(th)
return W
def test(n=100000, m=300, s=10, use_omp=False):
# ...
# normal distributed code ...
# ...
A = torch.rand(n, m)
_W = torch.rand(m, 1)
Y = A # _W
W = get_weights(A, Y, use_omp=use_omp) # <- this
res = W.allclose(_W, atol=1e-3)
return res
def timeit(repeat=10, use_omp=False):
t1 = time.time()
for _ in range(repeat):
test(use_omp=use_omp)
t2 = time.time()
return (t2-t1)/repeat
if __name__ == '__main__':
dist.init_process_group('mpi')
print(timeit(use_omp=False)) # -> 3.036880087852478
print(timeit(use_omp=True)) # -> 1.3959508180618285
In the above example setting threads to 6 improved the speed by a factor of ~2. But when I try sth similar to this in my actual code (and on a much larger cluster) it became almost 2 times slower!
EDIT2: The essence of this question is that (on a single node with n cores) most of the calculation (70%) is optimal with MPI using n processes, the other part (30%) is optimal with n OMP threads. I was wondering about optimal utilization of the available cores in both parts.
Although the comments were very helpful, I guess there are no easy answers. In this particular case the mentioned region is a linear algebra problem and using scalapack is probably the best solution.
But the question stands for general cases.
I think from what I understood from your question is to launch the OpenMP threads for the computation and then shutdown the OpenMP implementation. With your pseudo-code example, it would look like this:
a = g(mpi_rank, ... ) # process-specific matrix
A = mpi_gather(a, mpi_rank=0) # gather all
if mpi_rank == 0
x = f(A) ! do a parallel region in f()
omp_pause_resource_all(omp_pause_hard)
else
x = 0
mpi_bcast(x, mpi_rank=0)
See the functions omp_pause_resource_all() and omp_pause_resource() of the OpenMP API specification. They terminate most of the OpenMP implementation, such that only a very small part that should not affect performance will remain.
Note, the functions were introduced with OpenMP API version 5.0, so you will need a fairly new OpenMP compiler for this to work.
Arrange your MPI processes so that node 0 has only one process, and all other nodes have one process per core. Then process zero can easily launch an OpenMP (or otherwise threaded) shared-memory parallel section.
I'm using the caret::train() function to develop a weighted knn classification model (kknn) with 10-fold cross-validation and a tuneGrid containing 15 values for kmax, one value for distance, and 3 values for kernel.
That’s 450 total iterations if I understand the process correctly (an iteration being the computation of the probability of a given outcome for a given combination of kmax, distance, and kernel). x has about 480,000 data points (6 predictors each having about 80,000 observations), and y has about 80,000 data points.
Understanding that there are innumerable variables affecting performance, how long can I reasonably expect the train function to take if run on a pc with an 8-core 3GHz Intel processor and 32GB of RAM?
It currently takes about 70 minutes per fold, which is about 1.5 minutes per iteration. Is this reasonable, or excessive?
This is a kknn learning exercise. I realize there are other types of algorithms that produce better results more efficiently.
Here is the essential code:
x <- as.matrix(train_set2[, c("n_launch_angle", "n_launch_speed", "n_spray_angle_Kolp", "n_spray_angle_adj", "n_hp_to_1b", "n_if_alignment")])
y <- train_set2$events
set.seed(1)
fitControl <- trainControl(method = "cv", number = 10, p = 0.8, returnData = TRUE,
returnResamp = "all", savePredictions = "all",
summaryFunction = twoClassSummary, classProbs = TRUE,
verboseIter = TRUE)
tuneGrid <- expand.grid(kmax = seq(11, 39, 2),
distance = 2,
kernel = c("triangular", "gaussian", "optimal"))
kknn_train <- train(x, y, method = "kknn",
tuneGrid = tuneGrid, trControl = fitControl)
As we have established in the comments, it is reasonable to expect this type of runtime. There are a few step to reduce this;
Running your code in parallel
Using a more efficient OS; like Linux
Be more efficient in your trainControl(), is it really necessary to have returnResamps=TRUE? There is small gains in controlling these.
Clearly, the first one is a no-brainer. For the second one, I can find as many computer-engineers who swears to linux as those who swears to windows. What convinced me to switch to Linux, was this particular test, which I hope will give you what it gave me.
# Calculate distance matrix
test_data <- function(dim, num, seed = 1903) {
set.seed(seed)
dist(
matrix(
rnorm(dim * num), nrow = num
)
)
}
# Benchmarking
microbenchmark::microbenchmark(test_data(120,4500))
This piece of code simply just runs faster on the exact same machine that runs Linux. At least this was my experience.
Question: How can I generate a random number in the interval [0,1] from a Gaussian distribution in Julia?
I gather randn is the way to generate normally distributed random numbers, but the documentation's description of how to specify a range is quite opaque.
Use the Distributions package. If you don't already have it:
using Pkg ; Pkg.add("Distributions")
then:
using Distributions
mu = 0 #The mean of the truncated Normal
sigma = 1 #The standard deviation of the truncated Normal
lb = 0 #The truncation lower bound
ub = 1 #The truncation upper bound
d = Truncated(Normal(mu, sigma), lb, ub) #Construct the distribution type
x = rand(d, 100) #Simulate 100 obs from the truncated Normal
or all in one line:
x = rand(Truncated(Normal(0, 1), 0, 1), 100)
I have two matrices One that contains all the mean values and another that contains all the standard deviations. I want to simulate a random number for each of the three investors and see which investor gets the highest.
For example:- Loan 1 has three investors. I take the highest of
rnorm(1,m[1,1],sd[1,1]),rnorm(1,m[1,2],sd[1,2]),rnorm(1,m[1,3],sd[1,3])
and store it. I want to simulate this 1000 times and store results as
follows.
Output
Can I use a combination of Mapply and Sapply and replicate to do it? if you guys can give me some pointers I would be very grateful.
means <- matrix(c(-0.086731728,-0.1556901,-0.744495,
-0.166453802, -0.1978284, -0.9021422,
-0.127376145, -0.1227214, -0.6926699
), ncol = 3)
m <- t(m)
colnames(m) <- c("inv1","inv2","inv3")
rownames(m) <- c("loan1","loan2","loan3")
sd <- matrix(c(0.4431459, 0.5252441, 0.5372112,
0.4431882, 0.5252268, 0.5374614,
0.4430836, 0.5248798, 0.536924
), ncol = 3)
sd <- t(sd)
colnames(sd) <- c("inv1","inv2","inv3")
rownames(sd) <- c("loan1","loan2","loan3")
Given this is just an element-wise operation, you can use an appropriate vectorised function to compute this:
# Create a function to perform the computation you want
# Get the highest value from 1000 simulations
f <- function(m,s,reps=1000) max(rnorm(reps,m,s))
# Convert this function to a vectorised binary function
`%f%` <- Vectorize(f)
# Generate results - this will be a vector
results <- means %f% sd
# Tidy up results
results <- matrix(results,ncol(means))
colnames(results) <- colnames(means)
rownames(results) <- rownames(means)
# Results
results
inv1 inv2 inv3
loan1 1.486830 1.317569 0.8679278
loan2 1.212262 1.762396 0.7514182
loan3 1.533593 1.461248 0.7539696
I'm intending to implement a random number generator via Swift 3. I have three different methods for generating an integer (between 0 and 50000) ten thousand times non-stop.
Do these generators use the same math principles of generating a value or not?
What generator is less CPU and RAM intensive at runtime (having 10000 iterations)?
method A:
var generator: Int = random() % 50000
method B:
let generator = Int(arc4random_uniform(50000))
method C:
import GameKit
let number: [Int] = [0, 1, 2... 50000]
func generator() -> Int {
let random = GKRandomSource.sharedRandom().nextIntWithUpperBound(number.count)
return number[random]
}
All of these are pretty well documented, and most have published source code.
var generator: Int = random() % 50000
Well, first of all, this is modulo biased, so it certainly won't be equivalent to a proper uniform random number. The docs for random explain it:
The random() function uses a non-linear, additive feedback, random number generator, employing a default table of size 31 long integers. It returns successive pseudo-random numbers in the range
from 0 to (2**31)-1. The period of this random number generator is very large, approximately 16*((2**31)-1).
But you can look at the full implementation and documentation in Apple's source code for libc.
Contrast the documentation for arc4random_uniform (which does not have modulo bias):
These functions use a cryptographic pseudo-random number generator to generate high quality random bytes very quickly. One data pool is used for all consumers in a process, so that consumption
under program flow can act as additional stirring. The subsystem is re-seeded from the kernel random number subsystem on a regular basis, and also upon fork(2).
And the source code is also available. The important thing to note from arc4random_uniform is that it avoids modulo bias by adjusting the modulo correctly and then generating random numbers until it is in the correct range. In principle this could require generating an unlimited number of random values; in practice it is incredibly rare that it would need to generate more than one, and rare-to-the-point-of-unbelievable that it would generate more than that.
GKRandomSource.sharedRandom() is also well documented:
The system random source shares state with the arc4random family of C functions. Generating random numbers with this source modifies the outcome of future calls to those functions, and calling those functions modifies the sequence of random values generated by this source. As such, this source is neither deterministic nor independent—use it only for trivial gameplay mechanics that do not rely on those attributes.
For performance, you would expect random() to be fastest since it never seeds itself from the system entropy pool, and so it also will not reduce the entropy in the system (though arc4random only does this periodically, I believe around every 1.5MB or so of random bytes generated; not for every value). But as with all things performance, you must profile. Of course since random() does not reseed itself it is less random than arc4random, which is itself less random than the source of entropy in the system (/dev/random).
When in doubt, if you have GameplayKit available, use it. Apple selected the implementation of sharedRandom() based on what they think is going to work best in most cases. Otherwise use arc4random. But if you really need to minimize impact on the system for "pretty good" (but not cryptographic) random numbers, look at random. If you're willing to take "kinda random if you don't look at them too closely" numbers and have even less impact on the system, look at rand. And if you want almost no impact on the system (guaranteed O(1), inlineable), see XKCD's getRandomNumber().
Xorshift generators are among the fastest non-cryptographically-secure random number generators, requiring very small code and state.
an example of swift implementation of xorshift128+
func xorshift128plus(seed0 : UInt64, seed1 : UInt64) -> () -> UInt64 {
var s0 = seed0
var s1 = seed1
if s0 == 0 && s1 == 0 {
s1 = 1 // The state must be seeded so that it is not everywhere zero.
}
return {
var x = s0
let y = s1
s0 = y
x ^= x << 23
x ^= x >> 17
x ^= y
x ^= y >> 26
s1 = x
return s0 &+ s1
}
}
// create random generator, seed as needed!!
let random = xorshift128plus(seed0: 0, seed1: 0)
for _ in 0..<100 {
// and use it later
random()
}
to avoid modulo bias, you could use
func random_uniform(bound: UInt64)->UInt64 {
var u: UInt64 = 0
let b: UInt64 = (u &- bound) % bound
repeat {
u = random()
} while u < b
return u % bound
}
in your case
let r_number = random_uniform(bound: 5000) // r_number from interval 0..<5000