I have one very large matrix M (around 5 Gig) and have to perform an operation f: Column -> Column on every column of M.
I suppose I should use pmap (correct me if am wrong), but as I understand I should give it a list of matrices. How do I effectively process M in order pass it to pmap?
The second question is if it is preferable that f can take multiple columns at once or not.
I think it might be a good idea to try SharedArray for this. Even better would be multithreading instead of Julia's current multiprocessing, but this isn't released yet.
f should take a reference to the matrix, and a list of columns, rather than the columns themselves, to avoid copying.
EDIT: Here is my attempt at a SharedArray example - I've never used it myself before, so its probably written poorly.
addprocs(3)
#everywhere rows = 10000
#everywhere cols = 100
data = SharedArray(Float64, (rows,cols))
#everywhere function f(col, data)
for row = 1:rows
new_val = rand()*col
for dowork = 1:10000
new_val = sqrt(new_val)^2
end
data[row,col] = new_val
end
end
tic()
pmap(g->f(g...), [(col,data) for col in 1:cols])
toc()
for i = 1:10:cols
println(i, " ", mean(data[:,i]), " ", 0.5*i)
end
tic()
map(g->f(g...), [(col,data) for col in 1:cols])
toc()
with output
elapsed time: 24.454875168 seconds
1 0.49883655930753457 0.5
11 5.480063271913496 5.5
21 10.495998948926 10.5
31 15.480227440365235 15.5
41 20.70105670567518 20.5
51 25.300540822213783 25.5
61 30.427728439076436 30.5
71 35.5280001975307 35.5
81 41.06101008798742 40.5
91 45.72394376323945 45.5
elapsed time: 69.651211534 seconds
So we are getting approximately a 3x speedup, as hoped for. It'll approach the ideal closer the longer the jobs runs, as there is probably some JIT warmup time.
Related
Originally this is a problem coming up in mathematica.SE, but since multiple programming languages have involved in the discussion, I think it's better to rephrase it a bit and post it here.
In short, michalkvasnicka found that in the following MATLAB sample
s = 15000;
tic
% for-loop version
H = zeros(s,s);
for c = 1:s
for r = 1:s
H(r,c) = 1/(r+c-1);
end
end
toc
%Elapsed time is 1.359625 seconds.... For-loop
tic;
% vectorized version
c = 1:s;
r = c';
HH=1./(r+c-1);
toc
%Elapsed time is 0.047916 seconds.... Vectorized
isequal(H,HH)
the vectorized code piece is more than 25 times faster than the pure for-loop code piece. Though I don't have access to MATLAB so cannot test the sample myself, the timing 1.359625 seems to suggest it's tested on an average PC, just as mine.
But I cannot reproduce the timing with other languages like fortran or julia! (We know, both of them are famous for their performance of numeric calculation. Well, I admit I'm by no means an expert of fortran or julia. )
The followings are the samples I used for test. I'm using a laptop with i7-8565U CPU, Win 10.
fortran
fortran code is compiled with gfortran (TDM-GCC-10.3.0-2, with compile option -Ofast).
program tst
use, intrinsic :: iso_fortran_env
implicit none
integer,parameter::s=15000
integer::r,c
real(real64)::hmn(s,s)
do r=1,s
do c=1, s
hmn(r,c)=1._real64/(r + c - 1)
end do
end do
print *, hmn(s,s)
end program
compilation timing: 0.2057823 seconds
execution timing: 0.7179657 seconds
julia
Version of julia is 1.6.3.
#time (s=15000; Hmm=[1. /(r+c-1) for r=1:s,c=1:s];)
Timing: 0.7945998 seconds
Here comes the question:
Is the timing of MATLAB reliable?
If the answer to 1st question is yes, then how can we reproduce the performance (for 2 GHz CPU, the timing should be around 0.05 seconds) with julia, fortran, or any other programming languages?
Just to add on the Julia side - make sure you use BenchmarkToolsto benchmark, wrap the code you want to benchmark in functions so as not to benchmark in global scope, and interpolate any variables you pass to #btime.
Here's how I would do it:
julia> s = 15_000;
julia> function f_loop!(H)
for c ∈ 1:size(H, 1)
for r ∈ 1:size(H, 1)
H[r, c] = 1 / (r + c - 1)
end
end
end
f_loop! (generic function with 1 method)
julia> function f_vec!(H)
c = 1:size(H, 1)
r = c'
H .= 1 ./ (r .+ c .- 1)
end
f_vec! (generic function with 1 method)
julia> H = zeros(s, s);
julia> using BenchmarkTools
julia> #btime f_loop!($H);
625.891 ms (0 allocations: 0 bytes)
julia> H = zeros(s, s);
julia> #btime f_vec!($H);
625.248 ms (0 allocations: 0 bytes)
So both versions come in at the same time, which is what I'd expect for such a straightforward operation where a properly type-inferred code should compile down to roughly the same machine code.
tic/toc should be fine, but it looks like the timing is being skewed by memory pre-allocation.
I can reproduce similar timings to your MATLAB example, however
On first run (clear workspace)
Loop approach takes 2.08 sec
Vectorised approach takes 1.04 sec
Vectorisation saves 50% execution time
On second run (workspace not cleared)
Loop approach takes 2.55 sec
Vectorised approach takes 0.065 sec
Vectorisation "saves" 97.5% execution time
My guess would be that since the loop approach explicitly creates a new matrix via zeros, the memory is reallocated from scratch on every run and you don't see the speed improvement on subsequent runs.
However, when HH remains in memory and the HH=___ line outputs a matrix of the same size, I suspect MATLAB is doing some clever memory allocation to speed up the operation.
We can prove this theory with the following test:
Test Num | Workspace cleared | s | Loop (sec) | Vectorised (sec)
1 | Yes | 15000 | 2.10 | 1.41
2 | No | 15000 | 2.73 | 0.07
3 | No | 15000 | 2.50 | 0.07
4 | No | 15001 | 2.74 | 1.73
See the variation between tests 2 and 3, this is why timeit would have been helpful for an average runtime (see footnote). The difference in output sizes between tests 3 and 4 are pretty small, but the execution time returns to a similar magnitude of that in test 1 for the vectorised approach, suggesting that the re-allocation to create HH costs most of the time.
Footnote: tic/toc timings in MATLAB can be improved by using the in-built timeit function, which essentially takes an average over several runs. One interesting thing to observe from the workings of timeit though is that it explicitly "warms up" (quoting a comment) the tic/toc function by calling it a couple of times. You can see when running tic/toc a few times from a clear workspace (with no intermediate code) that the first call takes longer than subsequent calls, as there must be some overhead for getting the timer initialised.
I hope that the following modified benchmark could bring some new light to the problem:
s = 15000;
tic
% for-loop version
H = zeros(s,s);
for i =1:10
for c = 1:s
for r = 1:s
H(r,c) = H(r,c) + 1/(r+c-1+i);
end
end
end
toc
tic;
% vectorized version
HH = zeros(s,s);
c = 1:s;
r = c';
for i=1:10
HH= HH + 1./(r+c-1+i);
end
toc
isequal(H,HH)
In this case any kind of "cashing" is avoided by changing of matrix H (HH) at each for-loop (over "i") iteration.
In this case we get:
Elapsed time is 3.737275 seconds. (for-loop)
Elapsed time is 1.143387 seconds. (vectorized)
So, there is still performance improvement (~ 3x) due to the vectorization, which is probably done by implicit multi-threading implementation of vectorized Matlab commands.
Yes, tic/toc vs timeit is not strictly consistent, but the overall timing functionality is very similar.
To add to this, here is a simple python script which does the vectorized operation with numpy:
from timeit import default_timer
import numpy as np
s = 15000
start = default_timer()
# for-loop
H = np.zeros([s, s])
for c in range(1, s):
for r in range(1, s):
H[r, c] = 1 / (r + c - 1)
end = default_timer()
print(end - start)
start = default_timer()
# vectorized
c = np.arange(1, s).reshape([1, -1])
r = c.T
HH = 1 / (c + r - 1)
end = default_timer()
print(end - start)
for-loop: 32.94566780002788 seconds
vectorized: 0.494859800033737 seconds
While the for-loop version is terribly slow, the vectorized version is faster than the posted fortran/julia times. Numpy internally tries to use special SIMD hardware instructions to speed up arithmetic on vectors, which can make a significant difference. It's possible that the fortran/julia compilers weren't able to generate those instructions from the provided code, but numpy/matlab were able to. However, Matlab is still about 10x faster than the numpy code, which I don't think would be explained by better use of SIMD instructions. Instead, they may also be using multiple threads to parallelize the computation, since the matrix is fairly large.
Ultimately, I think the matlab numbers are plausible, but I'm not sure exactly how they're getting their speedup.
I'm doing MC simulations and I need to generate random integers within a range between 1 and a variable upper limit n_mol
The specific Julia function for doing this is rand(1:n_mol) where n_mol is an integer that changes with every MC iteration. The problem is that doing it this is slow... (possibly an issue to open for Julia developers). So, instead of using that particular function call, I thought about generating a random float in [0,1) multiply it by n_mol and then get the integer part of the result: int(rand()*n_mol) the problem now is that int() rounds up so I could end up with numbers between 0 and n_mol, and I can't get 0... so the solution I'm using for the moment is using ifloor and add a 1, ifloor(rand()*n_mol)+1, which considerably faster that the first, but slower than the second.
function t1(N,n_mol)
for i = 1:N
rand(1:n_mol)
end
end
function t2(N,n_mol)
for i = 1:N
int(rand()*n_mol)
end
end
function t3(N,n_mol)
for i = 1:N
ifloor(rand()*n_mol)+1
end
end
#time t1(1e8,123456789)
#time t2(1e8,123456789)
#time t3(1e8,123456789)
elapsed time: 3.256220849 seconds (176 bytes allocated)
elapsed time: 0.482307467 seconds (176 bytes allocated)
elapsed time: 0.975422095 seconds (176 bytes allocated)
So, is there any way of doing this faster with speeds near the second test?
It's important because the MC simulation goes for more than 1e10 iterations.
The result has to be an integer because it will be used as an index of an array.
The rand(r::Range) code is quite fast, given the following two considerations. First, julia calls a 52 bit rng twice to obtain random integers and a 52 bit rng once to obtain random floats, that gives with some book keeping a factor 2.5. A second thing is that
(rand(Uint) % k)
is only evenly distributed between 0 to k-1, if k is a power of 2. This is taken care of with rejection sampling, this explains more or less the remaining additional cost.
If speed is extremely important you can use a simpler random number generator as Julia and ignore those issues. For example with a linear congruential generator without rejection sampling
function lcg(old)
a = unsigned(2862933555777941757)
b = unsigned(3037000493)
a*old + b
end
function randfast(k, x::Uint)
x = lcg(x)
1 + rem(x, k) % Int, x
end
function t4(N, R)
state = rand(Uint)
for i = 1:N
x, state = randfast(R, state)
end
end
But be careful, if the range is (really) big.
m = div(typemax(Uint),3)*2
julia> mean([rand(1:m)*1.0 for i in 1:10^7])
6.148922790091841e18
julia> m/2
6.148914691236517e18
but (!)
julia> mean([(rand(Uint) % m)*1.0 for i in 1:10^7])
5.123459611164573e18
julia> 5//12*tm
5.124095576030431e18
Note that in 0.4, int() is deprecated, and you're aske to use round() instead.
function t2(N,n_mol)
for i = 1:N
round(rand()*n_mol)
end
end
gives 0.27 seconds on my machine (using Julia 0.4).
It is really stupid all I am trying to do is having a 7 column matrix consisiting all mod 7 numbers and it takes a huge time to generate such a matrix utilizing the following code
to = 7^k;
msgValue = zeros(to,k);
for l=0:to
for kCounter=0:(k-1)
msgValue(l+1,kCounter+1)=mod((l/7^kCounter),7);
end
end
msgValue = floor(msgValue);
How can I do this faster?
Or another vectorized approach (direct matrix multiplication):
msgValue = floor( mod( (0:7^k).' * (1./(7.^(0:k-1))),7 ) ) ;
a wee bit faster than the famous bsxfun ;-)
%// For 10000 iterations, k=3
Elapsed time is 2.280774 seconds. %// double loop
Elapsed time is 1.329179 seconds. %// bsxfun
Elapsed time is 0.958945 seconds. %// matrix multiplication
You can use a vectorized approach with bsxfun -
msgValue = floor(mod(bsxfun(#rdivide,[0:to]',7.^(0:(k-1))),7));
Quick runtime tests for k = 7:
-------------------- With Original Approach
Elapsed time is 1.519023 seconds.
-------------------- With Proposed Approach
Elapsed time is 0.279547 seconds.
I used a submssion from matlab central called rude, which I tend to use from time to time and was able to eliminate one for loop and vectorize the code to some extent.
tic
k=7;
modval = 7;
to=modval^k;
mods = mod(0:(modval-1),modval);
msgValue=zeros(to,k);
for kCounter=1:k
aux = rude(modval^(kCounter-1)*ones(1,modval),mods)';
msgValue(:,kCounter) = repmat(aux,to/(7^kCounter),1);
end
toc
The idea behind the code is to make at the beginning of each iteration the building block of the column vector using the rude function. Rude, in turn, uses mods = [0 1 2 3 4 5 6] as a starting point for the manipulation. The real work is done through vectorization.
You did not mention how long your code takes to run. So I timed it just once to give you a rough idea. It ran in 0.43 seconds in my machine, a Windows 7 Ultimate, 2.4 GHz, 4GB Ram, Dual CPU.
Also, the way you defined your loop adds a repetition in your msgValue matrix. The first row consists of zero values throughout all columns, and so the last row, which I also fixed. For a toy example with k=3, your code returns a 344x1 matrix, while you explicitly initialize it as a 7³x1 (343x1) matrix.
I'm trying to speed up the following Monte Carlo simulation in matlab:
http://pastebin.com/nS0K7XXa
and this is the full result of the matlab profiler
http://i.imgur.com/bGFY5e7.png
I am pretty new at using matlab, but I spent a good deal of time already on this and I think I'm missing something somewhere, because I have the feeling that this should run much faster.
I'm concerned about the lines the profiler show in red of course... lets start with these ones:
time calls line code
37.59 19932184 54 radselec = fix(rand(1)*nr) + 1;
4.54 19932184 55 nm = nm - 1;
45.35 19932184 56 Rad2(radselec) = Rad2(radselec) + 1;
I have a very large vector (Rad2) which holds positive integer values, initially they are all zero but as the simulation progresses it fills up.
line 54 picks a random element of that vector, everytime I add a value to that vector I also add a value to the variable nr, so basically nr is numel(nr) and fix(rand(1)*nr)+1 will pick a random number between 1 and nr.
Question 1: Is there a better way of doing this? rand(1) alone seems to take a long time, as you can see from line 26:
31.50 20540616 26 r = rand(1);
Question 2: line 56 also called my attention... once I have a value for radselec, I need to add +1 to the value of Rad2(radselec).
Now I thought that doing Rad2(radselec) = Rad2(radselec) + 1; was just as fast as doing nm = nm - 1 or +1 for that matter... but the profiler shows that adding +1 to an element of a vector is 10 times slower.
Question 3:
31.50 20540616 26 r = rand(1);
27
22.72 20540616 28 if r > R1/Rt
3.39 20220062 29 reacselec = 2;
10.80 20220062 30 if r > (R1+R2)/Rt
rand(1) seems to be slow as it is... by definition I need that random number between 0 and 1. So I can't think of another way of speeding that line up.
Now... How come line 28 is 2 times slower than line 30 ??? I mean... they are practically the same line with the same calculation... if anything line 30 should be slightle slower for having R1+R2 in the numerator, instead of just R1.
What's happening there?
And finally,
24.26 20540616 79 end
why is that end statement chugging so many time? How can fix that?
Thank you for your time, and sorry if this questions are too basic. I just started programming a few months ago, and I do not have a computer science background. I'm thinking on taking some courses, but that's not a priority.
Any help will be very appreciated.
I have a list of dataframes for which I am certain that they all contain at least one row (in fact, some contain only one row, and others contain a given number of rows), and that they all have the same columns (names and types). In case it matters, I am also certain that there are no NA's anywhere in the rows.
The situation can be simulated like this:
#create one row
onerowdfr<-do.call(data.frame, c(list(), rnorm(100) , lapply(sample(letters[1:2], 100, replace=TRUE), function(x){factor(x, levels=letters[1:2])})))
colnames(onerowdfr)<-c(paste("cnt", 1:100, sep=""), paste("cat", 1:100, sep=""))
#reuse it in a list
someParts<-lapply(rbinom(200, 1, 14/200)*6+1, function(reps){onerowdfr[rep(1, reps),]})
I've set the parameters (of the randomization) so that they approximate my true situation.
Now, I want to unite all these dataframes in one dataframe. I thought using rbind would do the trick, like this:
system.time(
result<-do.call(rbind, someParts)
)
Now, on my system (which is not particularly slow), and with the settings above, this takes is the output of the system.time:
user system elapsed
5.61 0.00 5.62
Nearly 6 seconds for rbind-ing 254 (in my case) rows of 200 variables? Surely there has to be a way to improve the performance here? In my code, I have to do similar things very often (it is a from of multiple imputation), so I need this to be as fast as possible.
Can you build your matrices with numeric variables only and convert to a factor at the end? rbind is a lot faster on numeric matrices.
On my system, using data frames:
> system.time(result<-do.call(rbind, someParts))
user system elapsed
2.628 0.000 2.636
Building the list with all numeric matrices instead:
onerowdfr2 <- matrix(as.numeric(onerowdfr), nrow=1)
someParts2<-lapply(rbinom(200, 1, 14/200)*6+1,
function(reps){onerowdfr2[rep(1, reps),]})
results in a lot faster rbind.
> system.time(result2<-do.call(rbind, someParts2))
user system elapsed
0.001 0.000 0.001
EDIT: Here's another possibility; it just combines each column in turn.
> system.time({
+ n <- 1:ncol(someParts[[1]])
+ names(n) <- names(someParts[[1]])
+ result <- as.data.frame(lapply(n, function(i)
+ unlist(lapply(someParts, `[[`, i))))
+ })
user system elapsed
0.810 0.000 0.813
Still not nearly as fast as using matrices though.
EDIT 2:
If you only have numerics and factors, it's not that hard to convert everything to numeric, rbind them, and convert the necessary columns back to factors. This assumes all factors have exactly the same levels. Converting to a factor from an integer is also faster than from a numeric so I force to integer first.
someParts2 <- lapply(someParts, function(x)
matrix(unlist(x), ncol=ncol(x)))
result<-as.data.frame(do.call(rbind, someParts2))
a <- someParts[[1]]
f <- which(sapply(a, class)=="factor")
for(i in f) {
lev <- levels(a[[i]])
result[[i]] <- factor(as.integer(result[[i]]), levels=seq_along(lev), labels=lev)
}
The timing on my system is:
user system elapsed
0.090 0.00 0.091
Not a huge boost, but swapping rbind for rbind.fill from the plyr package knocks about 10% off the running time (with the sample dataset, on my machine).
If you really want to manipulate your data.frames faster, I would suggest to use the package data.table and the function rbindlist(). I did not perform extensive tests but for my dataset (3000 dataframes, 1000 rows x 40 columns each) rbindlist() takes only 20 seconds.
This is ~25% faster, but there has to be a better way...
system.time({
N <- do.call(sum, lapply(someParts, nrow))
SP <- as.data.frame(lapply(someParts[[1]], function(x) rep(x,N)))
k <- 0
for(i in 1:length(someParts)) {
j <- k+1
k <- k + nrow(someParts[[i]])
SP[j:k,] <- someParts[[i]]
}
})
Make sure you're binding dataframe to dataframe. Ran into huge perf degradation when binding list to dataframe.