Running N Iterations of a Single-Processor Job in Parallel - parallel-processing

There should be a simple solution, but I am too novice with parallel processing.
I want to run N instances of command f in different directories. There aren't different parameters for f or anything like that. There are no parameters or anything for f. It just runs based on an input file in the directory where it is started. I would like to just run one instance of the function in each of the N directories.
I have access to 7 nodes which have a total of ~280 processors between them, but I'm not familiar enough with mpi things to know how to code for the above.
I do know that I can use mpirun and mpiexec, if that helps at all...
Help?

Related

Random number seeds for Julia cluster computing

I have a Julia code, and I want to submit this code to a remote computing cluster via running a large number of jobs in parallel (i.e., around 10,000 jobs in parallel). The way this code works is that, the main function (call it "main.jl") calls another function (call it "generator.jl") which utilizes random numbers such as rand(Float64) and so on. I submit main.jl via a bash file, and I run N jobs in parallel by including
#PBS -t 1-N
I want to make sure I have a different random number generator for each of the N job submissions, but I'm not sure how to do this. I was thinking of setting a random seed based upon the environment variable; i.e., by setting
#everywhere import Random.Random
#everywhere using Random.Random
Random.seed!(ENV_VAR)
in main.jl. However, I'm not sure how to get the environment variable ENV_VAR. In MATLAB, I know I can get this via
NUM = getenv( 'PBS_ARRAYID' )
But I don't know how to do this in Julia. If I manage to set this new random seed in main.jl, will this generate a different random seed every time the bash script submits main.jl to the cluster? In a similar vein, do I even need to do this in Julia, considering the Julia RNG uses MersenneTwister?
Just in case it matters, I have been using Julia 1.5.1 on the remote machine.
There are two issues here:
Getting the job number
Using the job number in random number generation.
Each of those issues has two solutions - one more and other less elegant but also OK.
Ad.1.
For managing job numbers consider using PBS together with ClusterManagers.jl. There is a command there addprocs_pbs(np::Integer; qsub_flags="") that will make it possible to manage the run numbers and orchestrate the cluster from within Julia. In many scenarios you will find this approach more comfortable. Here you can use for seeding the random number generator (more on that later) myid() that returns the worker number. Anyway most likely you will in this scenario run your computations using the #distributed loop and you can use that value for seeding RNG.
If you are rather orchestrating array jobs externally via bash scripts then perhaps the best thing is to pass the job number via a parameter to the Julia process and read it from ARGS variable or have a setup bash script that exports an environmental parameter that can be read from ENV variable.
Ad.2.
There are two approaches here. Firstly you can simply create a new MersseneTwister at each worker and then use it in the streams. For an example (here I assume that you use some variable jobid):
using Random
rnd = MersenneTwister(jobid)
rand(rnd, 4)
This is basically OK and the random streams are known not be correlated. However, you might be worried that this approach is going to introduce you some artifacts to your simulation. If you want to be more careful you can use a single random stream and divide it across processes. This is also perhaps the most state-of-the-art solution:
using Random, Future
rnd = Future.randjump(MersenneTwister(0), jobid*big(10)^20)
This will make all processes share the same huge random number stream (note that the state of Mersenne Twister is 19937 bits and hes the period of 2^19937 – 1 so this size of jump is not big at all and big(10)^20 is the recommended step of the jump because it is already precomputed in the randjump function implementation).

Is there a way to use torch.autograd.gradient in parallel in Pytorch?

I am trying to train some network where the loss is not only a function of the output but also the derivative of the output w.r.t. the input. The problem is that while computing the batch output can be done in parallel with the modules with Pytorh, I can't find a way to do the derivative in parallel. Here's the best I can do in serial:
import torch
x=torch.rand(300,1)
dydx=torch.zeros_like(x)
fc=torch.nn.Linear(1,1)
x.requires_grad=True
for ii in range(x.size(0)):
xi=x[ii,0:]
yi=torch.tanh(fc(xi))
dydx[ii]=torch.autograd.grad(yi,xi,create_graph=True)[0]
dydxsum=(dydx**2).sum()
dydxsum.backward()
In the code above, x is split to save memory and time. However, when the size of x becomes large, parallelization (in CUDA) is still necessary. If it has to be implemented by tinkering Pytorch, a hint to where to start will be appreciated.

Get all possible letter combinations with more than 5 letters

This code successfully would make array with all possible letter combinations with 5 characters.
a = ('aaaaa'..'zzzzz').to_a
however, when I try to go for 6+ characters, it takes like 10 mins, and then kills the task. Is there any way for it to actually load without killing the task? Is it limited by hardware?
You are indeed limited by hardware. In oversimplified terms, there are two limitation that you are facing here - processing power and memory capacity.
The "k-permutations of n" formula will tell us that you are trying to generate and process 26**6 = 308_915_776 elements.
(x..y) creates a Range, which knows how to generate all of its elements, but doesn't eagerly do so. When you call Range#to_a however, your processor tries to generate all those elements. After some time, the process runs out of memory and dies.
To avoid the memory restriction, you could instead take advantage of the fact that Range is also Enumerable. For example:
('aaaaaaa'..'zzzzzzz').each { |seven_letter_word| puts seven_letter_word }
will instantly start printing strings. Eventually (after a lot of waiting) it will loop through all of them.
However, note that this will let you bypass the memory restriction, but not the processing one. For that there are no shortcuts other than understand the specifics of the problem at hand.

OpenMP ensemble execution

I am new to the OpenMP and at the moment with no access to my workstation where I can check the details. Had a quick question to set the basics right before moving on to the hands on part.
Suppose I have a serial program written in FORTRAN90 which evolves a map with iterations and gives the final value of the variable after the evolution, the code looks like:
call random_number(xi) !! RANDOM INITIALIZATION OF THE VARIABLE
do i=1,50000 !! ITERATION OF THE SYSTEM
xf=4.d0*xi*(1.d0-xi) !! EVOLUTION OF THE SYSTEM
xi=xf
enddo !! END OF SYSTEM ITERATION
print*, xf
I want to run the same code as independent processes on a cluster for 100 different random initial conditions and see how the output changes with the initial conditions. A serial program for this purpose would look like:
do iter=1,100 !! THE INITIAL CONDITION LOOP
call random_number(xi) !! RANDOM INITIALIZATION OF THE VARIABLE
do i=1,50000 !! ITERATION OF THE SYSTEM
xf=4.d0*xi*(1.d0-xi) !! EVOLUTION OF THE SYSTEM
xi=xf
enddo !! END OF SYSTEM ITERATION
print*, xf
Will the OpenMP implementation that I could think of work? The code I could come up with is as follows:
!$ OMP PARALLEL PRIVATE(xi,xf,i)
!$ OMP DO
do iter=1,100 !! THE INITIAL CONDITION LOOP
call random_number(xi) !! RANDOM INITIALIZATION OF THE VARIABLE
do i=1,50000 !! ITERATION OF THE SYSTEM
xf=4.d0*xi*(1.d0-xi) !! EVOLUTION OF THE SYSTEM
xi=xf
enddo !! END OF SYSTEM ITERATION
print*, xf
!$ OMP ENDDO
!$ OMP END PARALLEL
Thank you in advance for any suggestions or help.
I think that this line
call random_number(xi) !! RANDOM INITIALIZATION OF THE VARIABLE
might cause some problems. Is the implementation of random_number on your system thread-safe ? I haven't a clue, I know nothing about your compiler or operating system. If it isn't thread-safe then your program might do a number of things when the OpenMP threads all start using the random number generator; those things include crashing or deadlocking.
If the implementation is thread-safe you will want to figure out how to ensure that the threads either do or don't all generate the same sequence of random numbers. It's entirely sensible to write programs which use the same random numbers in each thread, or that use different sequences in different threads, but you ought to figure out that what you get is what you want.
And if the random number generator is thread safe and generates different sequences for each thread, do those sequences pass the sort of tests for randomness that a single-threaded random number generator might pass ?
It's quite tricky to generate properly independent sequences of pseudo-random numbers in parallel programs; certainly not something I can cover in the space of an SO answer.
While you figure all that out one workaround which might help would be to generate, in a sequential part of your code, all the random numbers you need (into an array perhaps) and let the different threads read different elements out of the array.
I want to run the same code as independent processes on a cluster
Then you do not want OpenMP. OpenMP is about exploiting parallelism inside a single address space.
I suggest you look at MPI, if you want to operate on a cluster

expressing a grep like algorithm in mapreduce terms for a very long list of keywords

I am having trouble expressing an algorightm in mapreduce terms.
I have two big input text files: Let's call the first file "R" and the
second one "P". R is typically much bigger than P, but both are big.
In a non-mapreduce approach, the contents of P would be loaded into
memory (hashed) and then we would start iterating over all the lines
in R. The lines in R are just strings, and we want to
check if any of the substrings in R match any string in P.
The problem is very similar to grepping words in a bigfile, the issue
is that the list of words is very large so you cannot hardcode them
in your map routine.
The problem I am encountering is that I don't know how to ensure that
all the splits of the P file end up in a map job per each split of the R file.
So, assuming these splits:
R = R1, R2, R3;
P = P1, P2
The 6 map jobs have to contain these splits:
(R1, P1) (R1, P2);
(R2, P1) (R2, P2);
(R3, P1) (R3, P2);
How would you express this problem in mapreduce terms?
Thanks.
I have spent some time working on this and I have come up with a couple of
solutions. The first one is based on hadoop streaming and the second one uses
native java.
For the first solution I use an interesting feature from ruby. If you add
the keyword __END__ at the end of your code, all the text after that will
be exposed by the interpreter via the global variable DATA. This variable
is a File object. Example:
$ cat /tmp/foo.rb
puts DATA.read
__END__
Hello World!
$ ruby /tmp/foo.rb
Hello World!
We will use the file R as a input (It will be distributed across the HDFS filesytem).
We iterate over the P file and after traversing a certain number of lines,
we add those at the end of our mapper script. Then, we submit the job to the
hadoop cluster. We keep iterating over the contents of P until we have
consumed all the lines. Multiple jobs will be sent to the cluster based on
the number of lines per job and the size of P.
That's a fine approach that I have implemented and it works quite well. I
don't find particularly elegant though. We can do better by writing a native
mapreduce app in java.
When using a native java app, we have a full access to the hadoop HDFS API.
That means we can read the contents of a file from our code. That's something
I don't think it is available when streaming.
We follow an approach similar to the streaming method, but once we have
traversed a certain number of lines, we send those to the hadoop cluster instead
of append it to the code. We can do that within the code that schedules
our jobs.
Then, it is a matter of running as many jobs as the number of splits that
we have for P. All the mappers in a particular job will load a certain split
and will use it to compute the splits of R.
Nice problem.
One quick way I can think of is to split the P file into multiple files and run multiple MR jobs with each split of the P file and the complete R file as input.

Resources