Safety of sharing a read-only scipy sparse matrix between multiple processes - thread-safety

I have a computation I must do which is somewhat expensive and I want to spawn multiple processes to complete it. The gist is more or less this:
1) I have a big scipy.sparse.csc_matrix (could use other sparse format if needed) from which I'm going to read (only read, never write) data for the calculation.
2) I must do lots of embarrassingly parallel calculations and return values.
So I did something like this:
import numpy as np
from multiprocessing import Process, Manager
def f(instance, big_matrix):
"""
This is the actual thing I want to calculate. This reads lots of
data from big_matrix but never writes anything to it.
"""
return stuff_calculated
def do_some_work(big_matrix, instances, outputs):
"""
This do a few chunked calculations for a few instances and
saves the result in `outputs`, which is a memory shared dictionary.
"""
for instance in instances:
x = f(instance, big_matrix)
outputs[instance] = x
def split_work(big_matrix, instances_to_calculate):
"""
Split do_some_work into many processes by chunking instances_to_calculate,
creating a shared dictionary and spawning and joining the processes.
"""
# break instance list into 4 chunks to pass each process
instance_sets = np.array_split(instances_to_calculate, 4)
manager = Manager()
outputs = manager.dict()
processes = [
Process(target=do_some_work, args=(big_matrix, instance_sets, outputs))
for instances in instance_sets
]
for p in processes:
p.start()
for p in processes:
p.join()
return user_sets, outputs
My question is: is this safe? My function f never writes anything, but I'm not taking any precaution to share the big_array between processes, just passing it as it is. It seems to be working but I'm concerned if I can corrupt anything just by passing a value between multiple processes even if I never write to it.
I tried to use the sharemem package to share the matrix between multiple processes but it seems to be unable to hold scipy sparse matrices, only normal numpy arrays.
If this isn't safe, how can I share (read only) big sparse matrices between processes without problems?
I've saw here that I can make another csc_matrix pointing to the same memory with:
other_matrix = csc_matrix(
(bit_matrix.data, bit_matrix.indices, bit_matrix.indptr),
shape=bit_matrix.shape,
copy=False
)
Will this make it safer or would it be the same just as safe as passing the original object?
Thanks.

As explained here it seems your first option creates one copy of the sparse matrix per process. This is safe, but isn't ideal from a performance point of view. However, depending on the computation you perform on the sparse matrix, the overhead may not be signficant.
I suspect a cleaner option using the multiprocessing lib would be to create three lists (depending on the matrix format you use) and populate these with the values, row_ind and col_ptr of your CSC matrix. The documentation for multiprocessing shows how this can be done using an Array or using the Manager and one of the supported types.
Afterwards I don't see how you could run into trouble using read-only operations and it may be more efficient.

Related

A memory efficient way for a randomized single pass over a set of indices

I have a big file (about 1GB) which I am using as a basis to do some data integrity testing. I'm using Python 2.7 for this because I don't care so much about how fast the writes happen, my window for data corruption should be big enough (and it's easier to submit a Python script to the machine I'm using for testing)
To do this I'm writing a sequence of 32 bit integers to memory as a background process while other code is running, like the following:
from struct import pack
with open('./FILE', 'rb+', buffering=0) as f:
f.seek(0)
counter = 1
while counter < SIZE+1:
f.write(pack('>i', counter))
counter+=1
Then after I do some other stuff it's very easy to see if we missed a write since there will be a gap instead of the sequential increasing sequence. This works well enough. My problem is some data corruption cases might only be caught with random I/O (not sequential like this) based on how we track changes to files
So what I need is a method for performing a single pass of random I/O over my 1GB file, but I can't really store this in memory since 1GB ~= 250 million 4-byte integers. Considered chunking up the file into smaller pieces and indexing those, maybe 500 KB or something, but if there is a way to write a generator that can do the same job that would be awesome. Like this:
from struct import pack
def rand_index_generator:
generator = RAND_INDEX(1, MAX+1, NO REPLACEMENT)
counter = 0
while counter < MAX:
counter+=1
yield generator.next_index()
with open('./FILE', 'rb+', buffering=0) as f:
counter = 1
for index in rand_index_generator:
f.seek(4*index)
f.write(pack('>i', counter))
counter+=1
I need it:
Not to run out of memory (so no pouring the random sequence into a list)
To be reproducible so I can verify these values in the same order later
Is there a way to do this in Python 2.7?
Just to provide an answer for anyone who has the same problem, the approach that I settled on was this, which worked well enough if you don't need something all that random:
def rand_index_generator(a,b):
ctr=0
while True:
yield (ctr%b)
ctr+=a
Then, initialize it with your index size, b and a value a which is coprime to b. This is easy to choose if b is a power of two, since a just needs to be an odd number to make sure it isn't divisible by 2. It's a hard requirement for the two values to be coprime, so you might have to do more work if your index size b is not such an easily factored number as a power of 2.
index_gen = rand_index_generator(1934919251, 2**28)
Then each time you want the new index you use index_gen.next() and this is guaranteed to iterate over numbers between [0,2^28-1] in a semi-randomish manner depending on your choice of 'a'
There's really no point in picking an a value larger than your index size, since the mod gets rid of the remainder anyways. This isn't a very good approach in terms of randomness, but it's very efficient in terms of memory and speed which is what I care about for simulating this write workload.

Random number seeds for Julia cluster computing

I have a Julia code, and I want to submit this code to a remote computing cluster via running a large number of jobs in parallel (i.e., around 10,000 jobs in parallel). The way this code works is that, the main function (call it "main.jl") calls another function (call it "generator.jl") which utilizes random numbers such as rand(Float64) and so on. I submit main.jl via a bash file, and I run N jobs in parallel by including
#PBS -t 1-N
I want to make sure I have a different random number generator for each of the N job submissions, but I'm not sure how to do this. I was thinking of setting a random seed based upon the environment variable; i.e., by setting
#everywhere import Random.Random
#everywhere using Random.Random
Random.seed!(ENV_VAR)
in main.jl. However, I'm not sure how to get the environment variable ENV_VAR. In MATLAB, I know I can get this via
NUM = getenv( 'PBS_ARRAYID' )
But I don't know how to do this in Julia. If I manage to set this new random seed in main.jl, will this generate a different random seed every time the bash script submits main.jl to the cluster? In a similar vein, do I even need to do this in Julia, considering the Julia RNG uses MersenneTwister?
Just in case it matters, I have been using Julia 1.5.1 on the remote machine.
There are two issues here:
Getting the job number
Using the job number in random number generation.
Each of those issues has two solutions - one more and other less elegant but also OK.
Ad.1.
For managing job numbers consider using PBS together with ClusterManagers.jl. There is a command there addprocs_pbs(np::Integer; qsub_flags="") that will make it possible to manage the run numbers and orchestrate the cluster from within Julia. In many scenarios you will find this approach more comfortable. Here you can use for seeding the random number generator (more on that later) myid() that returns the worker number. Anyway most likely you will in this scenario run your computations using the #distributed loop and you can use that value for seeding RNG.
If you are rather orchestrating array jobs externally via bash scripts then perhaps the best thing is to pass the job number via a parameter to the Julia process and read it from ARGS variable or have a setup bash script that exports an environmental parameter that can be read from ENV variable.
Ad.2.
There are two approaches here. Firstly you can simply create a new MersseneTwister at each worker and then use it in the streams. For an example (here I assume that you use some variable jobid):
using Random
rnd = MersenneTwister(jobid)
rand(rnd, 4)
This is basically OK and the random streams are known not be correlated. However, you might be worried that this approach is going to introduce you some artifacts to your simulation. If you want to be more careful you can use a single random stream and divide it across processes. This is also perhaps the most state-of-the-art solution:
using Random, Future
rnd = Future.randjump(MersenneTwister(0), jobid*big(10)^20)
This will make all processes share the same huge random number stream (note that the state of Mersenne Twister is 19937 bits and hes the period of 2^19937 – 1 so this size of jump is not big at all and big(10)^20 is the recommended step of the jump because it is already precomputed in the randjump function implementation).

Orthogonal Recursive Bisection in Chapel (Barnes-Hut algorithm)

I'm implementing a distributed version of the Barnes-Hut n-body simulation in Chapel. I've already implemented the sequential and shared memory versions which are available on my GitHub.
I'm following the algorithm outlined here (Chapter 7):
Perform orthogonal recursive bisection and distribute bodies so that each process has equal amount of work
Construct locally essential tree on each process
Compute forces and advance bodies
I have a pretty good idea on how to implement the algorithm in C/MPI using MPI_Allreduce for bisection and simple message passing for communication between processes (for body transfer). And also MPI_Comm_split is a very handy function that allows me to split the processes at each step of ORB.
I'm having some trouble performing ORB using parallel/distributed constructs that Chapel provides. I would need some way to sum (reduce) work across processes (locales in Chapel), splitting processes into groups and process-to-process communication to transfer bodies.
I would be grateful for any advice on how to implement this in Chapel. If another approach would be better for Chapel that would also be great.
After a lot of deadlocks and crashes I did manage to implement the algorithm in Chapel. It can be found here: https://github.com/novoselrok/parallel-algorithms/tree/75312981c4514b964d5efc59a45e5eb1b8bc41a6/nbody-bh/dm-chapel
I was not able to use much of the fancy parallel equipment Chapel provides. I relied only on block distributed arrays with sync elements. I also replicated the SPMD model.
In main.chpl I set up all of the necessary arrays that will be used to transfer data. Each array has a corresponding sync array used for synchronization. Then each worker is started with its share of bodies and the previously mentioned arrays. Worker.chpl contains the bulk of the algorithm.
I replaced the MPI_Comm_split functionality with a custom function determineGroupPartners where I do the same thing manually. As for the MPI_Allreduce I found a nice little pattern I could use everywhere:
var localeSpace = {0..#numLocales};
var localeDomain = localeSpace dmapped Block(boundingBox=localeSpace);
var arr: [localeDomain] SomeType;
var arr$: [localeDomain] sync int; // stores the ranks of inteded receivers
var rank = here.id;
for i in 0..#numLocales-1 {
var intendedReceiver = (rank + i + 1) % numLocales;
var partner = ((rank - (i + 1)) + numLocales) % numLocales;
// Wait until the previous value is read
if (arr$[rank].isFull) {
arr$[rank];
}
// Store my value
arr[rank] = valueIWantToSend;
arr$[rank] = intendedReceiver;
// Am I the intended receiver?
while (arr$[partner].readFF() != rank) {}
// Read partner value
var partnerValue = arr[partner];
// Do something with partner value
arr$[partner]; // empty
// Reset write, which blocks until arr$[rank] is empty
arr$[rank] = -1;
}
This is a somewhat complicated way of implementing FIFO channels (see Julia RemoteChannel, where I got the inspiration for this "pattern").
Overview:
First each locale calculates its intended receiver and its partner (where the locale will read a value from)
Locale checks if the previous value was read
Locales stores a new value and "locks" the value by setting the arr$[rank] with the intended receiver
Locale waits while its partner sets the value and sets the appropriate intended receiver
Once the locale is the intended receiver it reads the partner value and does some operation on it
Then locale empties/unlocks the value by reading arr$[partner]
Finally it resets its arr$[rank] by writing -1. This way we also ensure that the value set by locale was read by the intended receiver
I realize that this might be an overly complicated solution for this problem. There probably exists a better algorithm that would fit Chapels global view of parallel computation. The algorithm I implemented lends itself to the SPMD model of computation.
That being said, I think that Chapel still does a good job performance-wise. Here are the performance benchmarks against Julia and C/MPI. As the number of processes grows the performance improves by quite a lot. I didn't have a chance to run the benchmark on a cluster with >4 nodes, but I think Chapel will end up with respectable benchmarks.

Incremental text file processing for parallel processing

I'm at the first experience with the Julia language, and I'm quite surprises by its simplicity.
I need to process big files, where each line is composed by a set of tab separated strings. As a first example, I started by a simple count program; I managed to use #parallel with the following code:
d = open(f)
lis = readlines(d)
ntrue = #parallel (+) for li in lis
contains(li,s)
end
println(ntrue)
close(d)
end
I compared the parallel approach against a simple "serial" one with a 3.5GB file (more than 1 million lines). On a 4-cores Intel Xeon E5-1620, 3.60GHz, with 32GB of RAM, What I've got is:
Parallel = 10.5 seconds; Serial = 12.3 seconds; Allocated Memory = 5.2
GB;
My first concern is about memory allocation; is there a better way to read the file incrementally in order to lower the memory allocation, while preserving the benefits of parallelizing the processing?
Secondly, since the CPU gain related to the use of #parallel is not astonishing, I'm wondering if it might be related to the specific case itself, or to my naive use of the parallel features of Julia? In the latter case, what would be the right approach to follow? Thanks for the help!
Your program is reading all of the file into memory as a large array of strings at once. You may want to try a serial version that processes the lines one at a time instead (i.e. streaming):
const s = "needle" # it's important for this to be const
open(f) do d
ntrue = 0
for li in eachline(d)
ntrue += contains(li,s)
end
println(ntrue)
end
This avoids allocating an array to hold all of the strings and avoids allocating all of string objects at once, allowing the program to reuse the same memory by periodically reclaiming it during garbage collection. You may want to try this and see if that improves the performance sufficiently for you. The fact that s is const is important since it allows the compiler to predict the types in the for loop body, which isn't possible if s could change value (and thus type) at any time.
If you still want to process the file in parallel, you will have to open the file in each worker and advance each worker's read cursor (using the seek function) to an appropriate point in the file to start reading lines. Note that you'll have to be careful to avoid reading in the middle of a line and you'll have to make sure each worker does all of the lines assigned to it and no more – otherwise you might miss some instances of the search string or double count some of them.
If this workload isn't just an example and you actually just want to count the number of lines in which a certain string occurs in a file, you may just want to use the grep command, e.g. calling it from Julia like this:
julia> s = "boo"
"boo"
julia> f = "/usr/share/dict/words"
"/usr/share/dict/words"
julia> parse(Int, readchomp(`grep -c -F $s $f`))
292
Since the grep command has been carefully optimized over decades to search text files for lines matching certain patterns, it's hard to beat its performance. [Note: if it's possible that zero lines contain the pattern you're looking for, you will want to wrap the grep command in a call to the ignorestatus function since the grep command returns an error status code when there are no matches.]

Data management in a parallel for-loop in Julia

I'm trying to do some statistical analysis using Julia. The code consists of the files script.jl (e.g. initialisation of the data) and algorithm.jl.
The number of simulations is large (at least 100,000) so it makes sense to use parallel processing.
The code below is just some pseudocode to illustrate my question —
function script(simulations::Int64)
# initialise input data
...
# initialise other variables for statistical analysis using zeros()
...
require("algorithm.jl")
#parallel for z = 1:simulations
while true
choices = algorithm(data);
if length(choices) == 0
break
else
# process choices and pick one (which alters the data)
...
end
end
end
# display results of statistical analysis
...
end
and
function algorithm(data)
# actual algorithm
...
return choices;
end
As example, I would like to know how many choices there are on average, what is the most common choice, and so on. For this purpose I need to save some data from choices (in the for-loop) to the statistical analysis variables (initialised before the for-loop) and display the results (after the for-loop).
I've read about using #spawn and fetch() and functions like pmap() but I'm not sure how I should proceed. Just using the variables inside the for-loop does not work as each proc gets its own copy, so the values of the statistical analysis variables after the for-loop will just be zeros.
[Edit] In Julia I use include("script.jl") and script(100000) to run the simulations, there are no issues when using a single proc. However, when using multiple procs (e.g. using addprocs(3)) all statistical variables are zeros after the for-loop — which is to be expected.
It seems that you want to parallelize an inherently serial operations, because each operation is related to the result of another one (in this case data).
I think if you could implement the above code like:
#parallel (dosumethingwithdata) for z = 1:simulations
while true
choices = algorithm(data,z);
if length(choices) == 0
break
else
# process choices and pick one (which alters the data)
...
end
data
end
end
then you may find a parallel solution for the problem.

Resources