Random number seeds for Julia cluster computing - random

I have a Julia code, and I want to submit this code to a remote computing cluster via running a large number of jobs in parallel (i.e., around 10,000 jobs in parallel). The way this code works is that, the main function (call it "main.jl") calls another function (call it "generator.jl") which utilizes random numbers such as rand(Float64) and so on. I submit main.jl via a bash file, and I run N jobs in parallel by including
#PBS -t 1-N
I want to make sure I have a different random number generator for each of the N job submissions, but I'm not sure how to do this. I was thinking of setting a random seed based upon the environment variable; i.e., by setting
#everywhere import Random.Random
#everywhere using Random.Random
Random.seed!(ENV_VAR)
in main.jl. However, I'm not sure how to get the environment variable ENV_VAR. In MATLAB, I know I can get this via
NUM = getenv( 'PBS_ARRAYID' )
But I don't know how to do this in Julia. If I manage to set this new random seed in main.jl, will this generate a different random seed every time the bash script submits main.jl to the cluster? In a similar vein, do I even need to do this in Julia, considering the Julia RNG uses MersenneTwister?
Just in case it matters, I have been using Julia 1.5.1 on the remote machine.

There are two issues here:
Getting the job number
Using the job number in random number generation.
Each of those issues has two solutions - one more and other less elegant but also OK.
Ad.1.
For managing job numbers consider using PBS together with ClusterManagers.jl. There is a command there addprocs_pbs(np::Integer; qsub_flags="") that will make it possible to manage the run numbers and orchestrate the cluster from within Julia. In many scenarios you will find this approach more comfortable. Here you can use for seeding the random number generator (more on that later) myid() that returns the worker number. Anyway most likely you will in this scenario run your computations using the #distributed loop and you can use that value for seeding RNG.
If you are rather orchestrating array jobs externally via bash scripts then perhaps the best thing is to pass the job number via a parameter to the Julia process and read it from ARGS variable or have a setup bash script that exports an environmental parameter that can be read from ENV variable.
Ad.2.
There are two approaches here. Firstly you can simply create a new MersseneTwister at each worker and then use it in the streams. For an example (here I assume that you use some variable jobid):
using Random
rnd = MersenneTwister(jobid)
rand(rnd, 4)
This is basically OK and the random streams are known not be correlated. However, you might be worried that this approach is going to introduce you some artifacts to your simulation. If you want to be more careful you can use a single random stream and divide it across processes. This is also perhaps the most state-of-the-art solution:
using Random, Future
rnd = Future.randjump(MersenneTwister(0), jobid*big(10)^20)
This will make all processes share the same huge random number stream (note that the state of Mersenne Twister is 19937 bits and hes the period of 2^19937 – 1 so this size of jump is not big at all and big(10)^20 is the recommended step of the jump because it is already precomputed in the randjump function implementation).

Related

In Julia set random seed and not generating same values if calling a function from different parts of program

I have a Julia program which runs some montecarlo simulations. To do this, I set a seed before calling the function that generates my random values. This function can be called from different parts of my program, and the problem is that I am not generating same random values if I call the function from different parts of my program.
My question is: In order to obtain the same reproducible results, do I have to call this function always from the same part of the program? Maybe this questions is more computer science related than language programming itself.
I thought of seeding inside the function with an increasing index but I still do not get same results.
it is easier to have randomness according to which are your needs if you pass to your function a RNG (random number generator) object as parameter.
For example:
using Random
function foo(a=3;rng=MersenneTwister(123))
return rand(rng,a)
end
Now, you can call the function getting the same results at each function call:
foo() # the same result on any call
However, this is rarely what you want. More often you want to guarantee the replicability of your overall experiment.
In this case, set an RNG at the beginning of your script and call the function with it. This will cause the output of the function to be different at each call, but the same between one run of the whole script and another run.
myGlobalProgramRNG = MersenneTwister() # at the beginning of your script...
foo(rng=myGlobalProgramRNG)
Two further notes: attention to multi-thread. If you want to guarantee replicability in case of multithreading, and in particular independently from the fact of your script running with 1, 2, 4,... threads, look for example how I dealt it in aML library using a generateParallelRngs() function.
The second notice is that even if you create your own RNG at the beginning of your script, or seed the default random seed, different Julia versions may (and indeed they do) change the RNG stream, so the replicability is guaranteed only within the same Julia version.
To overcome this problem you can use a package like StableRNG.jl that guarantee the stability of the RNG stream (at the cost of losing a bit of performance)...

Behaviour of random_number across different platforms [duplicate]

I'm using the random_number subroutine from Fortran, but in different runs of program the number which is being produced doesn't change. What should I include in my code so every time I compile and run the program the numbers change?
The random number generator produces pseudo-random numbers. To get different numbers each run, you need to initialise the random seed at the start of your program. This picks a different starting position in the pseudo-random stream.
The sequence of pseudorandom numbers coming from call(s) to random_number depends on the algorithm used by the processor and the value of the seed.
The initial value of the seed is processor dependent. For some processors this seed value will be the same each time the program runs, and for some it will be different. The first case gives a repeatable pseudorandom sequence and the second a non-repeatable sequence.
gfortran (before version 7) falls into this first category. You will need to explicitly change the random seed if you wish to get non-repeatable sequences.
As stated in another answer the intrinsic random_seed can be used to set the value of the seed and restart the pseudorandom generator. Again, it is processor dependent what happens when the call is call random_seed() (that is, without a put= argument). Some processors will restart the generator with a repeatable sequence, some won't. gfortran (again, before version 7) is in the first category.
For processors where call random_seed() gives rise to a repeatable sequence an explicit run-time varying seed will be required to generate distinct sequences. An example for those older gfortran versions can be found in the documentation.
It should be noted that choosing a seed can be a complicated thing. Not only will there be portability issues, but care may be required in ensuring that the generator is not restarted in a low entropy region. For multi-image programs the user will have to work to have varying sequences across these images.
On a final note, Fortran 2018 introduced the standard intrinsic procedure random_init. This handles both cases of selecting repeatability across invocations and distinctness over (coarray) images.

Running N Iterations of a Single-Processor Job in Parallel

There should be a simple solution, but I am too novice with parallel processing.
I want to run N instances of command f in different directories. There aren't different parameters for f or anything like that. There are no parameters or anything for f. It just runs based on an input file in the directory where it is started. I would like to just run one instance of the function in each of the N directories.
I have access to 7 nodes which have a total of ~280 processors between them, but I'm not familiar enough with mpi things to know how to code for the above.
I do know that I can use mpirun and mpiexec, if that helps at all...
Help?

Safety of sharing a read-only scipy sparse matrix between multiple processes

I have a computation I must do which is somewhat expensive and I want to spawn multiple processes to complete it. The gist is more or less this:
1) I have a big scipy.sparse.csc_matrix (could use other sparse format if needed) from which I'm going to read (only read, never write) data for the calculation.
2) I must do lots of embarrassingly parallel calculations and return values.
So I did something like this:
import numpy as np
from multiprocessing import Process, Manager
def f(instance, big_matrix):
"""
This is the actual thing I want to calculate. This reads lots of
data from big_matrix but never writes anything to it.
"""
return stuff_calculated
def do_some_work(big_matrix, instances, outputs):
"""
This do a few chunked calculations for a few instances and
saves the result in `outputs`, which is a memory shared dictionary.
"""
for instance in instances:
x = f(instance, big_matrix)
outputs[instance] = x
def split_work(big_matrix, instances_to_calculate):
"""
Split do_some_work into many processes by chunking instances_to_calculate,
creating a shared dictionary and spawning and joining the processes.
"""
# break instance list into 4 chunks to pass each process
instance_sets = np.array_split(instances_to_calculate, 4)
manager = Manager()
outputs = manager.dict()
processes = [
Process(target=do_some_work, args=(big_matrix, instance_sets, outputs))
for instances in instance_sets
]
for p in processes:
p.start()
for p in processes:
p.join()
return user_sets, outputs
My question is: is this safe? My function f never writes anything, but I'm not taking any precaution to share the big_array between processes, just passing it as it is. It seems to be working but I'm concerned if I can corrupt anything just by passing a value between multiple processes even if I never write to it.
I tried to use the sharemem package to share the matrix between multiple processes but it seems to be unable to hold scipy sparse matrices, only normal numpy arrays.
If this isn't safe, how can I share (read only) big sparse matrices between processes without problems?
I've saw here that I can make another csc_matrix pointing to the same memory with:
other_matrix = csc_matrix(
(bit_matrix.data, bit_matrix.indices, bit_matrix.indptr),
shape=bit_matrix.shape,
copy=False
)
Will this make it safer or would it be the same just as safe as passing the original object?
Thanks.
As explained here it seems your first option creates one copy of the sparse matrix per process. This is safe, but isn't ideal from a performance point of view. However, depending on the computation you perform on the sparse matrix, the overhead may not be signficant.
I suspect a cleaner option using the multiprocessing lib would be to create three lists (depending on the matrix format you use) and populate these with the values, row_ind and col_ptr of your CSC matrix. The documentation for multiprocessing shows how this can be done using an Array or using the Manager and one of the supported types.
Afterwards I don't see how you could run into trouble using read-only operations and it may be more efficient.

Thread-safe uniform random number generator

I have some parallel Fortran90 code in which each thread needs to generate the same sequence of random numbers.
I have a random number generator that seems to be thread-unsafe, since, for a given seed, I'm completely unable to repeat the same results each time I run the program.
I surfed unsuccessfully (almost) the entire web looking for some code of a thread-safe RNG. Could anyone provide me with (the link to) the code of one?
Thanks in advance!
A good Pseudorandom number generator for Fortran90 can be found in the Intel Math Kernel Vector Statistical Library. They are thread safe. Also, why does it need to be threadsafe? If you want each thread to get the same list, instantiate a new PRNG for each thread with the same seed.
Most repeatable random number generators need state in some form. Without state, they can't do what comes next. In order to be thread safe, you need a way to hold onto the state yourself (ie, it can't be global).
When you say "needs to generate the same sequence of random numbers" do you mean that
Each thread needs to generate a stream of numbers identical to the other thread? This implies choosing the seed before peeling off threads, then instantiating the a thread-local PRNG in each thread with the same seed.
or
You want to be able to repeat the same sequence of numbers between different runs of the programs, but each thread generates it's own independent sequence? In this case, you still can't share a single PRNG because the thread operation sequence is non-deterministic. So seed a single PRNG with a known seed before launching threads, and use it to generate the initial seeds for the threads. Then you instantiate thread-local generators in each thread...
In each of these cases you should note what Neil Butterworth say about the statistics: most of the usual guarantees that the PRNG like to claim are not reliable when mix streams generated in this way.
In both cases you need a thread-local PRNG. I don't know what is available in f90...but you can also write you own (lookup Mersenne Twister, and write a routne that takes the saved state as a parameter...).
In fortran 77, this would look something like
function PRNGthread (state)
double state(statesize)
c stuff happens here which uses and manipulates the state vector...
PRNGthread = result
return
and each of your threads should maintain a separate state vector, though all will use the same initial value.
I understand you need every thread to produce the same stream of random numbers.
A very good Pseudo Random Generator that will generate a reproducable stream of numbers and is quite fast is the MT19937. Just make sure that you generate the seed before spawning off the threads, but generate a separate instance of the MT in every thread (make the instance of the MT thread local). That way it will be guaranteed that every MT will produce the same stream of numbers.
How about SPRNG? I have not tried it myself though.
I coded a thread-safe Fortran 90 version of the Mersenne Twister/MT19973. The state of the PRNG is saved in a derived type (randomNumberSequence), and you use procedures to seed the generator or get the next element in the sequence.
See http://code.google.com/p/i3rc-monte-carlo-model/source/browse/trunk/Code/RandomNumbersForMC.f95
The alternatives seem to be:
Use a synchronisation object (such as
a mutex) on the generator's seed
value. This will unfortunately
serialise your code on accesses to
generator
Use thread-local storage in the
generator so each thread gets its own
seed - this may cause statstical
problems for your app
If your platform supports a suitable
atomic operation, use that on the
seed (it probably won't, however)
Not a very encouraging list, I know. And to add to it, I have no idea how to implement any of them in FORTRAN!
This article https://www.cmiss.org/openCMISS/wiki/RandomNumberGenerationWithOpenMP does not only link to a Fortran implementation, but mentions key points needed to make a PRNG usable with threads. The most important point is:
The Fortran90 version of Ziggurat has several variables and arrays with the 'SAVE' attribute. In order to parallelize the uniform RNG, then, it appears that the required changes are to make these variables arrays with a separate value for each thread (beware of false sharing). Then when the PRNG function is called, we must pass the thread number, and use the corresponding state value.

Resources