I'm writing a convex solver, for concreteness' sake assume it's solving ordinary least squares: find x that minimizes ||b-Ax||^2. So my function call would look like
x = optim(A, b)
I would like to be able to use warm-starts when they are useful, to provide a good initial guess at the solution. So something like
x = optim(A, b; w=some_starting_value)
My problem is that if I want to use a default value, some_starting_value needs to be of length equal to the number of columns in A, which is chosen by the user. In R it's possible to do something like
x = optim(A, b; w=ncols(A))
Does any similar functionality exist in Julia? My current solution is to do something like
x = optim(A, b; w=0)
and then check if w != 0 and set it to be the right size vector inside the optim function. But that seems hacky and (I assume) messes with type stability.
Is there a clean way to specify a keyword argument whose size depends on a required argument?
Edit
It looks like something like
function foo{T<:Real}(A::Array{T,2}; w=zeros(T,size(x,2)))
println("$x")
println("$y")
end
will do the trick.
It appears that default parameters in Julia can be expressions containing the values of the other parameters:
julia> a(x, y=2*x) = println("$x, $y")
a (generic function with 2 methods)
julia> a(10)
10, 20
Additionally the default parameter expressions can make calls to other functions:
julia> b(x) = sqrt(x)
b (generic function with 1 method)
julia> a(x, y=b(x)) = println("$x, $y")
a (generic function with 2 methods)
julia> a(100)
100, 10.0
I have been working in Julia for a few months now and I am interested in writing some of my code in parallel. I am working on a problem where I use 1 model to generate data for several different receivers (the data for each receiver is a vector). The data for each receiver can be computed independently which leads me to believe I should be able to use the pmap function. My plan is to initialize the data as a 2D SharedArray (each column represents the data for 1 receiver) and then have pmap loop over each of the columns. However I am finding that using SharedArray's with pmap is no faster than working in serial using map. I wrote the following dummy code to illustrate this point.
#everywhere function Dummy(icol,model,data,A,B)
nx = 250
nz = 250
nh = 50
for ih = 1:nh
for ix = 1:nx
for iz = 1:nz
data[iz,icol] += A[iz,ix,ih]*B[iz,ix,ih]*model[iz,ix,ih]
end
end
end
end
function main()
nx = 250
nz = 250
nh = 50
nt = 500
ncol = 100
model1 = rand(nz,nx,nh)
model2 = copy(model1)
model3 = convert(SharedArray,model1)
data1 = zeros(Float64,nt,ncol)
data2 = SharedArray(Float64,nt,ncol)
data3 = SharedArray(Float64,nt,ncol)
A1 = rand(nz,nx,nh)
A2 = copy(A1)
A3 = convert(SharedArray,A1)
B1 = rand(nz,nx,nh)
B2 = copy(B1)
B3 = convert(SharedArray,B1)
#time map((arg)->Dummy(arg,model1,data1,A1,B1),[icol for icol = 1:ncol])
#time pmap((arg)->Dummy(arg,model2,data2,A2,B2),[icol for icol = 1:ncol])
#time pmap((arg)->Dummy(arg,model3,data3,A3,B3),[icol for icol = 1:ncol])
println(data1==data2)
println(data1==data3)
end
main()
I start the Julia session with Julia -p 3 and run the script. The times for the 3 tests are 1.4s, 4.7s, and 1.6s respectively. Using SharedArrays with pmap (1.6s runtime) hasn't provided any improvement in speed compared with regular Arrays with map (1.4s). I am also confused as to why the 2nd case (data as a SharedArray, all other inputs as a regular Array with pmap) is so slow. What do I need to change in order to benefit from working in parallel?
Preface: yes, there actually is a solution to your issue. See code at bottom for that. But, before I get there, I'll go into some explanation.
I think the root of the problem here is memory access. First off, although I haven't rigorously investigated it, I suspect that there are a moderate number of improvements that could be made to Julia's underlying code in order to improve the way that it handles memory access in parallel processing. Nevertheless, in this case, I suspect that any underlying issues with the base code, if such actually exist, aren't so much at fault. Instead, I think it is useful to think carefully about what exactly is going on in your code and what it means vis-a-vis memory access.
A key thing to keep in mind when working in Julia is that it stores Arrays in column-major order. That is, it stores them as stacks of columns on top of each other. This generalizes to dimensions > 2 as well. See this very helpful segment of the Julia performance tips for more info. The implication of this is that it is fast to access one row after another after another in a single column. But, if you need to be jumping around columns, then you get into trouble. Yes, accessing ram memory might be relatively fast, but accessing cache memory is much, much faster, and so if your code allows for a single column or so to be loaded from ram into cache and then worked on, then you'll do much better than if you need to be doing lots of swapping between ram and cache. Here in your code, you're switching from column to column between your computations like nobody's business. For instance, in your pmap each process gets a different column of the shared array to work on. Then, each goes down the rows of that column and modifies the values in it. But, since they are trying to work in parallel with one another, and the whole array is too big to fit into your cache, there is lots of swapping between ram and cache that happens and that really slows you down. In theory, perhaps a clever enough under-the-hood memory management system could be devised to address this, but I don't really know - that goes beyond my pay grade. The same thing of course is happening to your accesses to your other objects.
Another thing to keep in mind in general when parallelizing is your ratio of flops (i.e. computer calculations) to read/write operations. Flops tend to parallelize well, you can have different cores, processes, etc. doing their own little computations on their own bits of data that they hold in their tiny caches. But, read/write operations don't parallelize so well. There are some things that can be done to engineer hardware systems to improve on this. But in general, if you have a given computer system with say, two cores, and you add four more cores to it, your ability to perform flops will increase three fold, but your ability to read/write data to/from ram won't really improve so much. (note: this is an oversimplication, a lot depends on your system). Nevertheless, in general, the higher your ratio of flops to read/writes, the more benefits you can expect from parallelism. In your case, your code involves a decent number of read/writes (all of those accesses to your different arrays) for a relatively small number of flops (a few multiplactions and an addition). It's just something to keep in mind.
Fortunately, your case is amenable to some good speedups from parallelism if written correctly. In my experience with Julia, all of my most successful parallelism comes when I can break data up and have workers process chunks of it separately. Your case happens to be amenable to that. Below is an example of some code I wrote that does that. You can see that it gets nearly a 3x increase in speed going from one processor to three. The code a bit crude in places, but it at least demonstrates the general idea of how something like this could be approached. I give a few comments on the code afterwards.
addprocs(3)
nx = 250;
nz = 250;
nh = 50;
nt = 250;
#everywhere ncol = 100;
model = rand(nz,nx,nh);
data = SharedArray(Float64,nt,ncol);
A = rand(nz,nx,nh);
B = rand(nz,nx,nh);
function distribute_data(X, obj_name_on_worker::Symbol, dim)
size_per_worker = floor(Int,size(X,1) / nworkers())
StartIdx = 1
EndIdx = size_per_worker
for (idx, pid) in enumerate(workers())
if idx == nworkers()
EndIdx = size(X,1)
end
println(StartIdx:EndIdx)
if dim == 3
#spawnat(pid, eval(Main, Expr(:(=), obj_name_on_worker, X[StartIdx:EndIdx,:,:])))
elseif dim == 2
#spawnat(pid, eval(Main, Expr(:(=), obj_name_on_worker, X[StartIdx:EndIdx,:])))
end
StartIdx = EndIdx + 1
EndIdx = EndIdx + size_per_worker - 1
end
end
distribute_data(model, :model, 3)
distribute_data(A, :A, 3)
distribute_data(B, :B, 3)
distribute_data(data, :data, 2)
#everywhere function Dummy(icol,model,data,A,B)
nx = size(model, 2)
nz = size(A,1)
nh = size(model, 3)
for ih = 1:nh
for ix = 1:nx
for iz = 1:nz
data[iz,icol] += A[iz,ix,ih]*B[iz,ix,ih]*model[iz,ix,ih]
end
end
end
end
regular_test() = map((arg)->Dummy(arg,model,data,A,B),[icol for icol = 1:ncol])
function parallel_test()
#everywhere begin
if myid() != 1
map((arg)->Dummy(arg,model,data,A,B),[icol for icol = 1:ncol])
end
end
end
#time regular_test(); # 2.120631 seconds (307 allocations: 11.313 KB)
#time parallel_test(); # 0.918850 seconds (5.70 k allocations: 337.250 KB)
getfrom(p::Int, nm::Symbol; mod=Main) = fetch(#spawnat(p, getfield(mod, nm)))
function recombine_data(Data::Symbol)
Results = cell(nworkers())
for (idx, pid) in enumerate(workers())
Results[idx] = getfrom(pid, Data)
end
return vcat(Results...)
end
#time P_Data = recombine_data(:data); # 0.003132 seconds
P_Data == data ## true
Comments
The use of the SharedArray is quite superfluous here. I just used it since it lends itself easily to being modified in place, which is how your code is originally written. This let me work more directly based on what you wrote without modifying it as much.
I didn't include the step to bring the data back in the time trial, but as you can see, it's quite a trivial amount of time in this case. In other situations, it might be less trivial, but data movement is just one of those issues that you face with parallelism.
When doing time trials in general, it is considered best practice to run the function once (in order to compile the code) and then run it again to get the times. That's what I did here.
See this SO post for where I got inspiration for some of these functions that I used here.
I try to run the following function in julia command, but when timing the function I see too much memory allocations which I can't figure out why.
function pdpf(L::Int64, iters::Int64)
snr_dB = -10
snr = 10^(snr_dB/10)
Pf = 0.01:0.01:1
thresh = rand(100)
Pd = rand(100)
for m = 1:length(Pf)
i = 0
for k = 1:iters
n = randn(L)
s = sqrt(snr) * randn(L)
y = s + n
energy_fin = (y'*y) / L
#inbounds thresh[m] = erfcinv(2Pf[m]) * sqrt(2/L) + 1
if energy_fin[1] >= thresh[m]
i += 1
end
end
#inbounds Pd[m] = i/iters
end
#thresh = erfcinv(2Pf) * sqrt(2/L) + 1
#Pd_the = 0.5 * erfc(((thresh - (snr + 1)) * sqrt(L)) / (2*(snr + 1)))
end
Running that function in the julia command on my laptop, I get the following shocking numbers:
julia> #time pdpf(1000, 10000)
17.621551 seconds (9.00 M allocations: 30.294 GB, 7.10% gc time)
What is wrong with my code? Any help is appreciated.
I don't think this memory allocation is so surprising. For instance, consider all of the times that the inner loop gets executed:
for m = 1:length(Pf) this gives you 100 executions
for k = 1:iters this gives you 10,000 executions based on the arguments you supply to the function.
randn(L) this gives you a random vector of length 1,000, based on the arguments you supply to the function.
Thus, just considering these, you've got 100*10,000*1000 = 1 billion Float64 random numbers being generated. Each one of them takes 64 bits = 8 bytes. I.e. 8GB right there. And, you've got two calls to randn(L) which means that you're at 16GB allocations already.
You then have y = s + n which means another 8GB allocations, taking you up to 24GB. I haven't looked in detail on the remaining code to get you from 24GB to 30GB allocations, but this should show you that it's not hard for the GB allocations to start adding up in your code.
If you're looking at places to improve, I'll give you a hint that these lines can be improved by using the properties of normal random variables:
n = randn(L)
s = sqrt(snr) * randn(L)
y = s + n
You should easily be able to cut down the allocations here from 24GB to 8GB in this way. Note that y will be a normal random variable here as you've defined it, and think up a way to generate a normal random variable with an identical distribution to what y has now.
Another small thing, snr is a constant inside your function. Yet, you keep taking its sqrt 1 million separate times. In some settings, 'checking your work' can be helpful, but I think that you can be confident the computer will get it right the first time and thus you don't need to make it keep re-doing this calculation ; ). There are other similar places you can improve your code to avoid duplicate computations here that I'll leave to you to locate.
aireties gives a good answer for why you have so many allocations. You can do more to reduce the number of allocations. Using this property we know that y = s+n is really y = sqrt(snr) * randn(L) + randn(L) and so we can instead do y = rvvar*randn(L) where rvvar= sqrt(1+sqrt(snr)^2) is defined outside the loop (thanks for the fix!). This will halve the number of random variables needed.
Outside the loop you can save sqrt(2/L) to cut down a little bit of time.
I don't think transpose is special-cased yet, so try using dot(y,y) instead of y'*y. I know dot for sure is just a loop without having to transpose, while the other may transpose depending on the version of Julia.
Something that would help performance (but not allocations) would be to use one big randn(L,iters) and loop through that. The reason is because if you make all of your random numbers all at once it's faster since it can use SIMD and a bunch of other goodies. If you want to implicitly do that without changing your code much, you can use ChunkedArrays.jl where you can use rands = ChunkedArray(randn,L) to initialize it and then everytime you want a randn(L), you instead use next(rands). Inside the ChunkedArray it actually makes bigger vectors and replenishes them as needed, but like this you can just get your randn(L) without having to keep track of all of that.
Edit:
ChunkedArrays probably only save time when L is smaller. This gives the code:
function pdpf(L::Int64, iters::Int64)
snr_dB = -10
snr = 10^(snr_dB/10)
Pf = 0.01:0.01:1
thresh = rand(100)
Pd = rand(100)
rvvar= sqrt(1+sqrt(snr)^2)
for m = 1:length(Pf)
i = 0
for k = 1:iters
y = rvvar*randn(L)
energy_fin = (y'*y) / L
#inbounds thresh[m] = erfcinv(2Pf[m]) * sqrt(2/L) + 1
if energy_fin[1] >= thresh[m]
i += 1
end
end
#inbounds Pd[m] = i/iters
end
end
which runs in half the time as using two randn calls. Indeed from the ProfileViewer we get:
#profile pdpf(1000, 10000)
using ProfileView
ProfileView.view()
I circled the two parts for the line y = rvvar*randn(L), so the vast majority of the time is random number generation. Last time I checked you could still get a decent speedup on random number generation by changing to to VSL.jl library, but you need MKL linked to your Julia build. Note that from the Google Summer of Code page you can see that there is a project to make a repo RNG.jl with faster psudo-rngs. It looks like it already has a few new ones implemented. You may want to check them out and see if they give speedups (or help out with that project!)
The goal of this experiment is to compare speed of Matlab and Julia with a small piece of code below.
First the Matlab code:
>> t = 5000; n = 10000; x = 1:t*n;
>> x = reshape(x, t, n);
>> tic(); y1 = sum(x(:) .* x(:)); toc()
Elapsed time is 0.229563 seconds.
>> y1
y1 =
4.1667e+22
>> tic(); y2 = trace(x * x'); toc()
Elapsed time is 15.332694 seconds.
>> y2
y2 =
4.1667e+22
Versus in Julia
julia> t = 5000; n = 10000; x = 1: t*n;
julia> x = reshape(x, t, n);
julia> tic(); y1 = sum(x[:].* x[:]); toc();
elapsed time: 1.235170533 seconds
julia> y1
-4526945843202100544
julia> tic();y2 = trace(x*x'); toc();
The second one did not finish the job in more than 1 minutes. So what is the matter here with Julia? This piece of code happen to run both slower and overflow in Julia? Is there any problem in my style? I think one reason to convert from Matlab to Julia is the speed, and I used to think that Julia handles big number arithmetic operations by default. Now, it looks like these are not correct. Can someone explain?
There are a couple of things going on here.
Firstly, unlike Matlab, the x is an array of machine integers, not floating point values. This appears to be the main difference in speed, as it is unable to use BLAS routines for linear algebra.
You need to do either
x = 1.0:t*n
or explicitly convert it via
x = float(x)
This is also the reason for a different answer: Julia uses machine arithmetic, which for integers will wrap around on overflow (hence the negative number). You won't have this problem with floating point values.
(Julia does have arbitrary-precision integers, but doesn't promote by default: instead you would need to promote them yourself via big(n))
This will give you some speed up, but you can do better. Julia does require a slightly different programming style than Matlab. See the links that Isaiah provided in the comments above.
In regards to your specific examples
sum(x[:].* x[:])
This is slow as it creates 3 intermediate vectors (2 copies of x[:], though hopefully this will change in future, and their product).
Similarly,
trace(x*x')
computes an intermediate matrix (most of which is not used in the trace).
My suggestion would be to use
dot(vec(x),vec(x))
vec(x) just reshapes x into a vector (so no copying), and dot is the usual sum-product. You could also "roll-your-own":
function test(x)
s = zero(eltype(x)) # this prevents type-instability
for xi in x
s += xi*xi
end
s
end
test(x)
(this needs to be in a function for the JIT compiler to work its magic). It should be reasonably fast, though probably still not as fast as dot, which uses BLAS calls underneath.
Optimizing my MATLAB code, I stumbled upon a weird problem regarding anonymous functions.
Like in this thread I realized, that sometimes anonymous functions are running really slow.
But with minimal changes to the function, it runs as fast as subfunctions or nested functions.
I used this (simple) test file to reproduce the behaviour with Matlab R2010b under Windows 7 64-bit:
clear all; close all; clc;
% functions
fn1 = #(x) x^2;
fn2 = #(x) double(x^2);
% variables
x = linspace(-100,100,100000);
N = length(x);
%% anonymous function
y = zeros(1,N);
t = tic;
for i=1:N
y(i) = fn1(x(i));
end
tm.anonymous_1 = toc(t);
%% anonymous function (modified)
y = zeros(1,N);
t = tic;
for i=1:N
y(i) = fn2(x(i));
end
tm.anonymous_2 = toc(t);
%% print
tm
The results I got were:
tm =
anonymous_1: 1.0605
anonymous_2: 0.1217
As you can see the first approach is about 10 times slower.
I have no idea what triggers this speedup/slowdown.
I tried different things, getting nearly the same (fast) timings:
fn2 = #(x) 1 * x^2;
fn2 = #(x) 0 + x^2;
fn2 = #(x) abs(x^2);
fn2 = #(x) x*x;
Before I start profiling all of my functions,
I would like to know if anyone has an explanation for this behaviour?
P.S.: I know that "vectorized" approaches are much faster, but in my case a solver will be evaluating the function for each variable time step, so that is not an option.
It appears that in the case of 'fn2' the Matlab optimizer is able to inline the function, whereas it is unable to do so in the case of 'fn1'.
This probably has to do with what Matlab knows about the scalarity or complexity or structure of the argument and return value. It probably figures out that 'i' (the argument at the call-site) is necessarily scalar, real and non-strctured. Given a scalar argument it then tries to figure out the behaviour of the function. In the case of 'fn2' Matlab's optimizer statically determines that it can always fit all possible results of 'double()' into the target variable 'y(i)'. For some reason only known to the designers of the optimizer, Matlab is unable to come to the same conclusion for 'fn1'. Maybe there are some non-obvious corner-cases, or '^' lacks some piece of meta-data that the optimizer depends on. Anyway, the result is that in case of 'fn1' Matlab apparently re-evaluats the function at every iteration.
Anyway, statically optimizing dynamic languages is something of a black art in compiler design.
I believe that making the return type of a function independent of its argument's types makes it easier for Matlab to optimize. By the way, y = fn1(x); and y = fn2(x); have roughly the same proportion in terms of run time, so it's not the effect of arguments being scalar or complex.