Mutate existing container for `rand` results in Julia? - random

I have a simulation where the computation will reuse the random sample of a multi-dimensional random variable. I'd like to be able to re-use a pre-allocated container for better performance with less allocations.
Simplified example:
function f()
container = zeros(2) # my actual use-case is an MvNormal
map(1:100) do i
container = rand(2)
sum(container)
end
end
I think the above is allocating a new vector as a result of rand(2) each time. I'd like to mutate container by storing the results of the rand call.
I tried the typical pattern of rand!(container,...) but there does not seem to be a built-in rand! function following the usual mutation convention in Julia.
How can I reuse the container or otherwise improve the performance of this approach?

rand! exists in the Random module
julia> using Random
julia> a = zeros(2)
2-element Vector{Float64}:
0.0
0.0
julia> rand!(a);#show a;
a = [0.8139794738918935, 0.6336948436048475]

Related

Synchronously Outputting to DistributedArray of Vectors in Parallel

I'm trying to distribute a function that outputs a vector into an array.
I followed this post with something like the following code:
a = distribute([Float64[] for _ in 1:nrow(df)])
#sync #distributed for i in 1:nrow(df)
append!(localpart(a)[i], foo(df[i]))
end
But I get the following error:
BoundsError: attempt to access 145-element Vector{Vector{Float64}} at index [147]
I've only ever parallelized with SharedArrays, which aren't an option, since I need to store vectors in the shared array. Any and all advice would be life-saving.
Each localpart is indexed starting from one.
Hence you need to convert between a global index and the local index.
The function DistributedArrays.localindices() returns a one element tuple that contains a range of global indices that are mapped to the localpart.
This information can be in turn used for the index conversion:
#sync #distributed for i in 1:nrow(df)
id = i - DistributedArrays.localindices(a)[1][1] + 1
push!(localpart(a)[id], f(df[i,1]))
end
EDIT
to understand how localindiced work look at this code:
julia> addprocs(4);
julia> a = distribute([Float64[] for _ in 1:14]);
julia> fetch(#spawnat 2 DistributedArrays.localindices(a))
(1:4,)
julia> fetch(#spawnat 3 DistributedArrays.localindices(a))
(5:8,)
julia> fetch(#spawnat 4 DistributedArrays.localindices(a))
(9:11,)
julia> fetch(#spawnat 5 DistributedArrays.localindices(a))
(12:14,)

Julia type instability: Array of LinearInterpolations

I am trying to improve the performance of my code by removing any sources of type instability.
For example, I have several instances of Array{Any} declarations, which I know generally destroy performance. Here is a minimal example (greatly simplified compared to my code) of a 2D Array of LinearInterpolation objects, i.e
n,m=5,5
abstract_arr=Array{Any}(undef,n+1,m+1)
arr_x=LinRange(1,10,100)
for l in 1:n
for alpha in 1:m
abstract_arr[l,alpha]=LinearInterpolation(arr_x,alpha.*arr_x.^n)
end
end
so that typeof(abstract_arr) gives Array{Any,2}.
How can I initialize abstract_arr to avoid using Array{Any} here?
And how can I do this in general for Arrays whose entries are structures like Dicts() where the Dicts() are dictionaries of 2-tuples of Float64?
If you make a comprehension, the type will be figured out for you:
arr = [LinearInterpolation(arr_x, ;alpha.*arr_x.^n) for l in 1:n, alpha in 1:m]
isconcretetype(eltype(arr)) # true
When it can predict the type & length, it will make the right array the first time. When it cannot, it will widen or extend it as necessary. So probably some of these will be Vector{Int}, and some Vector{Union{Nothing, Int}}:
[rand()>0.8 ? nothing : 0 for i in 1:3]
[rand()>0.8 ? nothing : 0 for i in 1:3]
[rand()>0.8 ? nothing : 0 for i in 1:10]
The main trick is that you just need to know the type of the object that is returned by LinearInterpolation, and then you can specify that instead of Any when constructing the array. To determine that, let's look at the typeof one of these objects
julia> typeof(LinearInterpolation(arr_x,arr_x.^2))
Interpolations.Extrapolation{Float64, 1, ScaledInterpolation{Float64, 1, Interpolations.BSplineInterpolation{Float64, 1, Vector{Float64}, BSpline{Linear{Throw{OnGrid}}}, Tuple{Base.OneTo{Int64}}}, BSpline{Linear{Throw{OnGrid}}}, Tuple{LinRange{Float64}}}, BSpline{Linear{Throw{OnGrid}}}, Throw{Nothing}}
This gives a fairly complicated type, but we don't necessarily need to use the whole thing (though in some cases it might be more efficient to). So for instance, we can say
using Interpolations
n,m=5,5
abstract_arr=Array{Interpolations.Extrapolation}(undef,n+1,m+1)
arr_x=LinRange(1,10,100)
for l in 1:n
for alpha in 1:m
abstract_arr[l,alpha]=LinearInterpolation(arr_x,alpha.*arr_x.^n)
end
end
which gives us a result of type
julia> typeof(abstract_arr)
Matrix{Interpolations.Extrapolation} (alias for Array{Interpolations.Extrapolation, 2})
Since the return type of this LinearInterpolation does not seem to be of known size, and
julia> isbitstype(typeof(LinearInterpolation(arr_x,arr_x.^2)))
false
each assignment to this array will still trigger allocations, and consequently there actually may not be much or any performance gain from the added type stability when it comes to filling the array. Nonetheless, there may still be performance gains down the line when it comes to using values stored in this array (depending on what is subsequently done with them).

How can I use randjump instead of generating a certain number of random numbers?

I am trying to "skip forward" a few realizations by using the function Future.randjump(), but it doesn't seem to behave as I expect it to. The following code gives me the desired result, where jumping forward 1 steps gives the same result as if I had called rand(rng) twice, i.e. the two println display the same number:
using Random, Future
rng = MersenneTwister(123);
new_rng = Future.randjump(rng, 1)
rand(rng)
rand(rng)
println(rand(rng))
println(rand(new_rng))
However, if I add one extra call to rand(rng) before the call to randjump(), the two printed numbers are completely different:
using Random, Future
rng = MersenneTwister(123);
rand(rng) # Added line
new_rng = Future.randjump(rng, 1)
rand(rng)
rand(rng)
println(rand(rng))
println(rand(new_rng))
I expected that the two calls to println() would display the same thing even in the second case, how come they don't? Is there a way I can use randjump() in the second case to get the same realizations as if I had called rand(rng) several times? Thank you in advance.
One unit of randjump corresponds to generation of two floating point numbers.
Consider this example
julia> rng = MersenneTwister(123);
julia> rng2 = Future.randjump(rng, 1);
julia> rand(rng, 4)
4-element Vector{Float64}:
0.7684476751965699
0.940515000715187
0.6739586945680673
0.3954531123351086
julia> rand(rng2,2)
2-element Vector{Float64}:
0.6739586945680673
0.3954531123351086
Note that in the second call (that is rand(rng2,2)) the both numbers are identical to the two last numbers in the first call (taht is rand(rng,2)).
Another issue is that different distributions might "consume" Float64 numbers from the stream at a different speed - so you need to check with a particular distribution how fast it consumes floats for the stream (some might also use buffering etc...).
Looking at the source code of randn (#edit randn()) it consumes one float and hence you get the same results for those two calls:
julia> randn(MersenneTwister(123),6)[3:end]
4-element Vector{Float64}:
1.142650902867199
0.45941562040708034
-0.396679079295223
-0.6647125451916877
julia> randn(Future.randjump(MersenneTwister(123),1),4)
4-element Vector{Float64}:
1.142650902867199
0.45941562040708034
-0.396679079295223
-0.6647125451916877
EDIT
Regarding your comment the size of Mersenne Twister state is 19937 bits and half-unit jumps are not supported. Running rand is mutating this state but not half-the way - so you end up with different bits. Note that an RNG is a sequence of states and the actual values are calculated from that state.
The correct pattern to synchronize random numbers in your computations is the following:
master_rng = MersenneTwister(123);
rng1 = Future.randjump(master_rng, big(10)^20)
# do whatever you want
rng2 = Future.randjump(master_rng, 2*big(10)^20)
# do whatever you want
rng3 = Future.randjump(master_rng, 3*big(10)^20)
# do whatever you want
With this pattern you can correctly maintains synchronization between random number streams and have full control whether the should overlap or not.

passing shared array in julia using #everywhere

I found this post - Shared array usage in Julia, which is clearly close but I still don't really understand what to do in my case.
I am trying to pass a shared array to a function I define, and call that function using #everywhere. The following, which has no shared array, works:
#everywhere mat = rand(3,3)
#everywhere foo1(x::Array) = det(x)
Then this
#everywhere println(foo1(mat))
properly produces different results from each worker. Now let me include a shared array:
test = SharedArray(Float64,10)
#everywhere foo2(x::Array,y::SharedArray) = det(x) + sum(y)
Then this
#everywhere println(foo2(mat,test))
fails on the workers.
ERROR: On worker 2:
UndefVarError: test not defined
etc. I can get what I want like this:
for w in procs()
#spawnat w println(foo2(eval(:mat),test))
end
This works - but is it optimal? Is there a way to make it work with #everywhere?
While it's tempting to use "named variables" on workers, it generally seems to work better if you access them via references. Schematically, you might do something like this:
mat = [#spawnat p rand(3,3) for p in workers()] # process 1 holds references to objects on workers
#sync for (i, p) in enumerate(workers())
#spawnat p foo(mat[i], sharedarray)
end

Julia: Parallel for loop over partitions iterator

So I'm trying to iterate over the list of partitions of something, say 1:n for some n between 13 and 21. The code that I ideally want to run looks something like this:
valid_num = #parallel (+) for p in partitions(1:n)
int(is_valid(p))
end
println(valid_num)
This would use the #parallel for to map-reduce my problem. For example, compare this to the example in the Julia documentation:
nheads = #parallel (+) for i=1:200000000
Int(rand(Bool))
end
However, if I try my adaptation of the loop, I get the following error:
ERROR: `getindex` has no method matching getindex(::SetPartitions{UnitRange{Int64}}, ::Int64)
in anonymous at no file:1433
in anonymous at multi.jl:1279
in run_work_thunk at multi.jl:621
in run_work_thunk at multi.jl:630
in anonymous at task.jl:6
which I think is because I am trying to iterate over something that is not of the form 1:n (EDIT: I think it's because you cannot call p[3] if p=partitions(1:n)).
I've tried using pmap to solve this, but because the number of partitions can get really big, really quickly (there are more than 2.5 million partitions of 1:13, and when I get to 1:21 things will be huge), constructing such a large array becomes an issue. I left it running over night and it still didn't finish.
Does anyone have any advice for how I can efficiently do this in Julia? I have access to a ~30 core computer and my task seems easily parallelizable, so I would be really grateful if anyone knows a good way to do this in Julia.
Thank you so much!
The below code gives 511, the number of partitions of size 2 of a set of 10.
using Iterators
s = [1,2,3,4,5,6,7,8,9,10]
is_valid(p) = length(p)==2
valid_num = #parallel (+) for i = 1:30
sum(map(is_valid, takenth(chain(1:29,drop(partitions(s), i-1)), 30)))
end
This solution combines the takenth, drop, and chain iterators to get the same effect as the take_every iterator below under PREVIOUS ANSWER. Note that in this solution, every process must compute every partition. However, because each process uses a different argument to drop, no two processes will ever call is_valid on the same partition.
Unless you want to do a lot of math to figure out how to actually skip partitions, there is no way to avoid computing partitions sequentially on at least one process. I think Simon's answer does this on one process and distributes the partitions. Mine asks each worker process to compute the partitions itself, which means the computation is being duplicated. However, it is being duplicated in parallel, which (if you actually have 30 processors) will not cost you time.
Here is a resource on how iterators over partitions are actually computed: http://www.informatik.uni-ulm.de/ni/Lehre/WS03/DMM/Software/partitions.pdf.
PREVIOUS ANSWER (More complicated than necessary)
I noticed Simon's answer while writing mine. Our solutions seem similar to me, except mine uses iterators to avoid storing partitions in memory. I'm not sure which would actually be faster for what size sets, but I figure it's good to have both options. Assuming it takes you significantly longer to compute is_valid than to compute the partitions themselves, you can do something like this:
s = [1,2,3,4]
is_valid(p) = length(p)==2
valid_num = #parallel (+) for i = 1:30
foldl((x,y)->(x + int(is_valid(y))), 0, take_every(partitions(s), i-1, 30))
end
which gives me 7, the number of partitions of size 2 for a set of 4. The take_every function returns an iterator that returns every 30th partition starting with the ith. Here is the code for that:
import Base: start, done, next
immutable TakeEvery{Itr}
itr::Itr
start::Any
value::Any
flag::Bool
skip::Int64
end
function take_every(itr, offset, skip)
value, state = Nothing, start(itr)
for i = 1:(offset+1)
if done(itr, state)
return TakeEvery(itr, state, value, false, skip)
end
value, state = next(itr, state)
end
if done(itr, state)
TakeEvery(itr, state, value, true, skip)
else
TakeEvery(itr, state, value, false, skip)
end
end
function start{Itr}(itr::TakeEvery{Itr})
itr.value, itr.start, itr.flag
end
function next{Itr}(itr::TakeEvery{Itr}, state)
value, state_, flag = state
for i=1:itr.skip
if done(itr.itr, state_)
return state[1], (value, state_, false)
end
value, state_ = next(itr.itr, state_)
end
if done(itr.itr, state_)
state[1], (value, state_, !flag)
else
state[1], (value, state_, false)
end
end
function done{Itr}(itr::TakeEvery{Itr}, state)
done(itr.itr, state[2]) && !state[3]
end
One approach would be to divide the problem up into pieces that are not too big to realize and then process the items within each piece in parallel, e.g. as follows:
function my_take(iter,state,n)
i = n
arr = Array[]
while !done(iter,state) && (i>0)
a,state = next(iter,state)
push!(arr,a)
i = i-1
end
return arr, state
end
function get_part(npart,npar)
valid_num = 0
p = partitions(1:npart)
s = start(p)
while !done(p,s)
arr,s = my_take(p,s,npar)
valid_num += #parallel (+) for a in arr
length(a)
end
end
return valid_num
end
valid_num = #time get_part(10,30)
I was going to use the take() method to realize up to npar items from the iterator but take() appears to be deprecated so I've included my own implementation which I've called my_take(). The getPart() function therefore uses my_take() to obtain up to npar partitions at a time and carry out a calculation on them. In this case, the calculation just adds up their lengths, because I don't have the code for the OP's is_valid() function. get_part() then returns the result.
Because the length() calculation isn't very time-consuming, this code is actually slower when run on parallel processors than it is on a single processor:
$ julia -p 1 parpart.jl
elapsed time: 10.708567515 seconds (373025568 bytes allocated, 6.79% gc time)
$ julia -p 2 parpart.jl
elapsed time: 15.70633439 seconds (548394872 bytes allocated, 9.14% gc time)
Alternatively, pmap() could be used on each piece of the problem instead of the parallel for loop.
With respect to the memory issue, realizing 30 items from partitions(1:10) took nearly 1 gigabyte of memory on my PC when I ran Julia with 4 worker processes so I expect realizing even a small subset of partitions(1:21) will require a great deal of memory. It may be desirable to estimate how much memory would be needed to see if it would be at all possible before trying such a computation.
With respect to the computation time, note that:
julia> length(partitions(1:10))
115975
julia> length(partitions(1:21))
474869816156751
... so even efficient parallel processing on 30 cores might not be enough to make the larger problem solvable in a reasonable time.

Resources