I have a stack of images (3D array) and I want to improve their resolution (upsampling). I run the following code snippet that I find a little slow ...
Is there any way to improve the speed of this piece of code? (without using multiprocessing)
using BenchmarkTools
using Interpolations
function doInterpol(arr::Array{Int, 2}, h, w)
A = interpolate(arr, BSpline(Linear()))
return A[1:2/(h-1)/2:2, 1:2/(w-1)/2:2]
end
function applyResize!(arr3D_hd::Array, arr3D_ld::Array, t::Int, h::Int, w::Int)
for i = 1:1:t
#inbounds arr3D_hd[i, :, :] = doInterpol(arr3D_ld[i, :, :], h, w)
end
end
t, h, w = 502, 65, 47
h_target, w_target = 518, 412
arr3D_ld = reshape(collect(1:t*h*w), (t, h, w))
arr3D_hd = Array{Float32}(undef, t, h_target, w_target)
applyResize!(arr3D_hd, arr3D_ld, t, h_target, w_target)
When I benchmark the following:
#btime applyResize!(arr3D_hd, arr3D_ld, t, h_target, w_target)
I got :
2.334 s (68774 allocations: 858.01 MiB)
I ran it multiple time and results are in [1.8s - 2.8s] interval.
Julia stores arrays in column-major order. This means that slices like arr[i, : ,:] perform much worse than arr[:,:,i] (which is contiguous in memory). Therefore, a way to gain some speed is to index your arrays using (h,w,t) rather than (t, w, h).
A second issue is that taking slices like arr[i,:,:] copies data. It seems to have negligible impact here, but it might be good to get into the habit of using array views instead of slices when you can. A view is a small wrapper object that behaves in the same way as a slice of a larger array, but does not hold a copy of the data: it directly accesses the data of the parent array (see the example below to maybe better understand what a view is).
Note that both these issues are mentioned in the Julia performance tips; it might be useful to read the remaining pieces of advice in this page.
Putting this together, your example can be rewritten like:
function applyResize2!(arr3D_hd::Array, arr3D_ld::Array, h::Int, w::Int, t)
#inbounds for i = 1:1:t
A = interpolate(#view(arr3D_ld[:, :, i]), BSpline(Linear()))
arr3D_hd[:, :, i] .= A(1:2/(h-1)/2:2, 1:2/(w-1)/2:2)
end
end
which is used with arrays stored a bit differently from your case:
# Note the order of indices
julia> arr3D_ld = reshape(collect(1:t*h*w), (h, w, t));
julia> arr3D_hd = Array{Float32}(undef, h_target, w_target, t);
# Don't forget to escape arguments with a $ when using btime
# (not really an issue here, but could have been one)
julia> #btime applyResize2!($arr3D_hd, $arr3D_ld, h_target, w_target, t)
506.449 ms (6024 allocations: 840.11 MiB)
This is roughly a speed-up by a factor 3.4 w.r.t your original code, which benchmarks like this on my machine:
julia> arr3D_ld = reshape(collect(1:t*h*w), (t, h, w));
julia> arr3D_hd = Array{Float32}(undef, t, h_target, w_target);
julia> #btime applyResize!($arr3D_hd, $arr3D_ld, t, h_target, w_target)
1.733 s (50200 allocations: 857.30 MiB)
NB: Your original code uses a syntax like A[x, y] to get interpolated values. This seems to be deprecated in favor of A(x, y). I might not have the same version of Interpolations as you, though...
Example illustrating the behavior of views
julia> a = rand(3,3)
3×3 Array{Float64,2}:
0.042097 0.767261 0.0433798
0.791878 0.764044 0.605218
0.332268 0.197196 0.722173
julia> v = #view(a[:,2]) # creates a view instead of a slice
3-element view(::Array{Float64,2}, :, 2) with eltype Float64:
0.7672610491393876
0.7640443797187411
0.19719581867637093
julia> v[3] = 42 # equivalent to a[3,2] = 42
42
Use
itp = interpolate(arr3D_ld, (NoInterp(), BSpline(Linear()), BSpline(Linear())));
A = itp(1:size(itp,1), 1:2/517:2, 1:2/411:2);
It should give a ~7x performance improvement compared to your version.
As François Févotte noted, it's also important to pay attention to deprecation warnings, as they slow down execution.
Related
Suppose I have the following function:
function y1(x)
y = x^(2) - 4
return y
end
Now, I want to evaluate all the values from this sequence: collect(range(-10,10, 1000))
I tried this
y_1 = zeros(1000);
for x in collect(range(-10, 10, 1000))
y_1 = y1.(x)
end
Note that I use the broadcast operator to apply the function y1 for every value that takes the iterator. But if I don't use it I get the same result.
But as an answer, I just get 96.0.
How can I refill the y_1 vector with the for loop, so I get the evaluated values?
The evaluated vector should be of size 1000
Thanks in advance!
Edit:
I found a way to get to my desired result without the for loop:
y_1 = y1.(collect(range(-10, 10, 1000)))
But I still want to know how can I do it in a loop.
The broadcast operator broadcasts the function over the entire iterator by itself i.e. y1.(arr) will
call y1 on each of the elements of the array arr
collect the results of all those calls, and
allocate memory to store those results as an array too
So the following are all equivalent in terms of functionality:
julia> arr = range(-4, 5, length = 10) #define a simple range
-4.0:1.0:5.0
julia> y1.(arr)
10-element Vector{Float64}:
12.0
5.0
0.0
-3.0
-4.0
-3.0
0.0
5.0
12.0
21.0
julia> [y1(x) for x in arr]
10-element Vector{Float64}:
(same values as above)
julia> map(y1, arr)
10-element Vector{Float64}:
(same values as above)
julia> y_1 = zeros(10);
julia> for (i, x) in pairs(arr)
y_1[i] = y1(x)
end
julia> y_1
10-element Vector{Float64}:
(same values as above)
In practice, there maybe other considerations, including performance, that decides between these and other choices.
As an aside, note that very often you don't want to collect a range in Julia i.e. don't think of collect as somehow equivalent to c() in R. For many operations, the ranges can be directly used, including for iteration in for loops. collect should only be necessary in the rare cases where an actual Vector is necessary, for eg. a value in the middle of the array needs to be changed for some reason. As a general rule, use the range results as they are, until and unless you get an error that requires you to change it.
I'm new to Julia. I want to write code which, for each of several vectors, outputs a new vector, the name of which depends on the name of the input vector.
For example, the following code works
a = ones(10)
b = ones(10)
for var in [a, b]
global log_var = log.(var)
end
except I want the resulting vectors to be named log_a and log_b (rather than have the loop overwrite log_var). I had thought this would be simple, but having read a few tutorials about locals in Julia, I'm still lost! Is there a simple way to go about this?
In case this question is unclear, I'll describe how I would do this in Stata, with which I'm more familiar:
clear
set obs 10
gen a = 1
gen b = 1
foreach var in a b {
gen log_`var' = log(`var')
}
Thank you!
if you are looking for something similar to what you do in stata, you can use DataFrames.jl,
julia> using DataFrames
julia> df = DataFrame(a=ones(10), b=ones(10))
julia> for col in ["a", "b"]
df[:, "log_"*col] = log.(df[:, col])
end
julia> df
You really probably don't want to do that. But, if you had to, you could do it pretty easily with metaprogramming. In this case for example:
macro logify(variable)
quote
$(esc(Symbol("log_$variable"))) = log.($variable)
end
end
then
julia> b = rand(5)
5-element Vector{Float64}:
0.29129581739244315
0.21098023915449915
0.8736387630142392
0.34378216482772417
0.621583372934101
julia> #logify b;
julia> log_b
5-element Vector{Float64}:
-1.2334159735391819
-1.555990803188027
-0.13508830339365252
-1.0677470639708686
-0.4754852291054692
In general, any time you need to depend on the name of a variable rather than its contents, you're going to need metaprogramming.
However, to emphasize, again, this feels like a bad idea.
Rather than defining new top-level variables, you might consider instead using some sort of data structure like a Dict or a NamedTuple or a DataFrame, or even just a multidimensional Array. For example, with NamedTuples:
julia> data = (a = rand(5), b = rand(5));
julia> typeof(data)
NamedTuple{(:a, :b), Tuple{Vector{Float64}, Vector{Float64}}}
julia> data.a
5-element Vector{Float64}:
0.7146929585896256
0.5248314042991269
0.040560190890127856
0.9714549101298824
0.9477790450084252
julia> data.b
5-element Vector{Float64}:
0.6856764745285641
0.3066093923258396
0.5655243277481422
0.13478854894985115
0.8495720250298817
julia> logdata = NamedTuple{keys(data)}(log.(data[x]) for x in keys(data));
julia> logdata.a
5-element Vector{Float64}:
-0.335902257064951
-0.6446782026336225
-3.204968213346185
-0.02896042387181646
-0.05363387877891503
julia> logdata.b
5-element Vector{Float64}:
-0.3773493739743169
-1.182180679204628
-0.5700019644606769
-2.0040480325554944
-0.1630225562612911
Not really recommended for such usage, but a quick and dirty variant is
for var in [:a, :b]
#eval global $(Symbol("log_", var)) = log.($var)
end
I am trying to minimize memory allocations in Julia by pre-allocating arrays as shown in the documentation. My sample code looks as follows:
using BenchmarkTools
dim1 = 100
dim2 = 1000
A = rand(dim1,dim2)
B = rand(dim1,dim2)
C = rand(dim1,dim2)
D = rand(dim1,dim2)
M = Array{Float64}(undef,dim1,dim2)
function calc!(a, b, c, d, E)
#. E = a * b * ((d-c)/d)
nothing
end
function run_calc(A,B,C,D,M)
for i in 1:dim2
#views calc!(A[:,i], B[:,i], C[:,i], D[:,i], M[:,i])
end
end
My understanding is that this should essentially not allocate since M is pre-allocated outside the either of the two functions. However, when I benchmark this I still see a lot of allocations:
#btime run_calc(A,B,C,D,M)
1.209 ms (14424 allocations: 397.27 KiB)
In this case I can of course run the much more concise
#btime #. M = A * B * ((D-C)/D)
which performs very few allocations as expected:
122.599 μs (6 allocations: 144 bytes)
However my actual code is more complex and cannot be reduced like this, hence I am wondering where I am going wrong with the first version.
You are not doing anything wrong. Currently creation of views in Julia is allocating (as Stefan noted it has gotten much better than in the past, but still some allocations seem to happen in this case). The allocations you see are a consequence of this.
See:
julia> #allocated view(M, 1:10, 1:10)
64
Your case is one of the situations where it is simplest to just write an appropriate loop (I assume that in your code the loop will be more complex but I hope the intent is clear), e.g.:
julia> function run_calc2(A,B,C,D,M)
#inbounds for i in eachindex(A,B,C,D,M)
M[i] = A[i] * B[i] * ((D[i] - C[i])/D[i])
end
end
run_calc2 (generic function with 1 method)
julia> #btime run_calc2($A,$B,$C,$D,$M)
56.441 μs (0 allocations: 0 bytes)
julia> #btime run_calc($A,$B,$C,$D,$M)
893.789 μs (14424 allocations: 397.27 KiB)
julia> #btime #. $M = $A * $B * (($D-$C)/$D);
381.745 μs (0 allocations: 0 bytes)
EDIT: all timings on Julia Version 1.6.0-DEV.1580
EDIT2: for completeness a code that passes #views down to the inner function. It still allocates (but is better) and is still slower than using just the loop:
julia> function calc2!(a, b, c, d, E, i)
#inbounds #. #views E[:,i] = a[:,i] * b[:,i] * ((d[:,i]-c[:,i])/d[:,i])
nothing
end
calc2! (generic function with 1 method)
julia> function run_calc3(A,B,C,D,M)
for i in 1:dim2
calc2!(A,B,C,D,M,i)
end
end
run_calc3 (generic function with 1 method)
julia> #btime run_calc3($A,$B,$C,$D,$M);
305.709 μs (1979 allocations: 46.56 KiB)
Prior to Julia 1.5, creating array views will often allocate a bit of memory for the view object. After Julia 1.5, creating views will usually not cause any allocation. Your post doesn't include what version of Julia you're using, so I'll assume that it's older than 1.5. In your code, you are creating a view for each index of a potentially large array dimension, which will definitely add up. You could refactor this code to pass the dimension through to the inner calculation. Otherwise you can upgrade Julia and see if the allocation goes away.
We can use
c = #parallel (vcat) for i=1:10
(i,i+1)
end
But when I'm trying to use push!() instead of vcat() I'm getting some error. How can I use push!() in this parallel loop?
c = #parallel (push!) for i=1:10
(c, (i,i+1))
end
The #parallel is somewhat similar to foldl(op, itr) in that it uses the first value of itr as an initial first parameter for op. push! lacks the required symmetry between the operands. Perhaps what you are looking for is:
julia> c = #parallel (append!) for i=1:10
[(i,i+1)]
end
Elaborating a bit on Dan's point; to see how the parallel macro works, see the difference between the following two invocations:
julia> #parallel print for i in 1:10
(i,i+1)
end
(1, 2)(2, 3)nothing(3, 4)nothing(4, 5)nothing(5, 6)nothing(6, 7)nothing(7, 8)nothing(8, 9)nothing(9, 10)nothing(10, 11)
julia> #parallel string for i in 1:10
(i,i+1)
end
"(1, 2)(2, 3)(3, 4)(4, 5)(5, 6)(6, 7)(7, 8)(8, 9)(9, 10)(10, 11)"
From the top one it should be clear what's going on. Each iteration produces an output. When it comes to using the specified function on those outputs, this is done in output pairs. Two first pair of outputs is fed to print, and the result of the print operation then becomes the first item in the next pair to be processed. Since the output is nothing, print prints nothing then (3,4). The result of this print statement is nothing, therefore the next pair to be printed is nothing and (4,5), and so on until all elements are consumed. I.e. in terms of pseudocode, this is what's happening:
Step 1: state = print((1,2), (2,3)); # state becomes nothing
Step 2: state = print(state, (3,4)); # state becomes nothing again
Step 3: state = print(state, (4,5)); # and so forth
The reason string works as expected is because what's happening is the following steps:
Step 1: state = string((1,2),(2,3));
Step 2: state = string(state, (3,4));
Step 3: state = string(state, (4,5);
etc
In general, the function you pass to the parallel macro should be something that takes two inputs of the same type, and outputs an object of the same type.
Therefore you cannot use push!, because this always uses two inputs of different types (one array, and one plain element), and outputs an array. Therefore you need to use append! instead, which fits the specification.
Also note that the order of outputs is not guaranteed. (here it happens to be in order because I only used 1 worker). If you want something where the order of operations matters, then you shouldn't use this construct. E.g., obviously in something like addition it doesn't matter, because addition is a completely associative operation; but if I used string, if outputs are processed in different order, then obviously you could end up with a different string than what you'd expect.
EDIT - addressing benchmark between vcat / append! / indexed assignment
I think the most efficient way to do this is in fact via normal indexing onto a preallocated array. But between append! and vcat, append will most certainly be faster as vcat always makes a copy (as I understand it).
Benchmarks:
function parallelWithVcat!( A::Array{Tuple{Int64, Int64}, 1} )
A = #parallel vcat for i = 1:10000
(i, i+1)
end
end;
function parallelWithFunction!( A::Array{Tuple{Int64, Int64}, 1} )
A = #parallel append! for i in 1:10000
[(i, i+1)];
end
end;
function parallelWithPreallocation!( A::Array{Tuple{Int64, Int64}, 1} )
#parallel for i in 1:10000
A[i] = (i, i+1);
end
end;
A = Array{Tuple{Int64, Int64}, 1}(10000);
### first runs omitted, all benchmarks here are from 2nd runs ###
# first on a single worker:
#time for n in 1:100; parallelWithVcat!(A); end
#> 8.050429 seconds (24.65 M allocations: 75.341 GiB, 15.42% gc time)
#time for n in 1:100; parallelWithFunction!(A); end
#> 0.072325 seconds (1.01 M allocations: 141.846 MiB, 52.69% gc time)
#time for n in 1:100; parallelWithPreallocation!(A); end
#> 0.000387 seconds (4.21 k allocations: 234.750 KiB)
# now with true parallelism:
addprocs(10);
#time for n in 1:100; parallelWithVcat!(A); end
#> 1.177645 seconds (160.02 k allocations: 109.618 MiB, 0.75% gc time)
#time for n in 1:100; parallelWithFunction!(A); end
#> 0.060813 seconds (111.87 k allocations: 70.585 MiB, 3.91% gc time)
#time for n in 1:100; parallelWithPreallocation!(A); end
#> 0.058134 seconds (116.16 k allocations: 4.174 MiB)
If someone can suggest an even more efficient way, please do so!
Note in particular that the indexed assignment is much faster than the rest, such that it appears (for this example at least) that most of its computation in the parallel case appears to be lost on the parallelisation itself.
Disclaimer: I make no claim that the above are correct summonings of the #parallel spell. I have not delved into the inner workings of the macro in detail to be able to claim otherwise. In particular, I am not aware which parts the macro causes to be processed remotely vs local (e.g. the assignment part). Caution is advised, ymmv, etc.
I have a simple function which appears at several places in my Julia code and is run millions of times inside a loop. The function essentially does rand([1,-1,im,-im]), that's, it picks one of four possible given values. I noticed that this function takes a substantial amount of time in my huge loop, so, I tried to write it in a slightly faster way like this:
function qpsk()
temp1 = ifelse(rand(Bool), 1+0im, -1+0im)
temp2 = ifelse(rand(Bool), 1+0im, 0+1im)
temp1*temp2
end
Then, it is typically called like this:
sig = complex(zeros(N))
for i = 1:N
sig[i] = qpsk()
end
Now, is there any way to further optimize this function, or use another faster method? Appreciate your help.
Comments on current answers:
The answer of #DanGetz (22 lines??) doesn't solve the problem, because at the moment, Julia is not as good at vectors as with explicit loops. Also,
my simple, 1 line qpsk2(s) below, is about 2X faster than those "cryptic" 22 lines of code in the original answer by Dan (a vector is created, though, which adds more time).
But the question remains, why they
didn't implement something like qpsk1 below? and why my original qpsk with branching is more than 3X faster than the straightforward qpsk4(s) below?
I added more versions below to guide the discussion if more experienced people like to jump in.
qpsk1(s) = s[1+(rand(Int8)&3)] # Blazingly fast
qpsk2(s) = s[1+rand(Bool)+2rand(Bool)] # Very fast
qpsk3(s) = s[rand(1:4,1)] # Compiler issue here?
qpsk4(s) = s[rand(1:4)] # Why slow?
qpsk5(s) = rand([s]) # Ridiculously slow!!
function test_orig(n) # Test qpsk(), very fast(branching!), why?
for i = 1:n
qpsk()
end
end
using StaticArrays
function test(func, n) # Test all qpsk1 --> qpsk5
s = SVector(1,-1,im,-im)
for i=1:n
func(s)
end
end
#time test(qpsk1,10^8) 0.554994 seconds (5 allocations: 176 bytes)
#time test(qpsk2,10^8) 0.755286 seconds (5 allocations: 176 bytes)
#time test(qpsk3,10^8) 13.431529 seconds (400 M allocations: 26.822 GiB, 20.68% gc time)
#time test(qpsk4,10^8) 2.520085 seconds (5 allocations: 176 bytes)
#time test(qpsk5,10^8) 10.881852 seconds (200 M allocations: 20.862 GiB, 19.76% gc time)
#time test_orig(10^8) 0.771778 seconds (5 allocations: 176 bytes)
#time nqpsk2(10^8); 1.402830 seconds (9 allocations: 1.490 GiB, 6.39% gc time)
Summary of answer
[(-1)^b1*im^b2 for (b1,b2) in zip(rand!(BitVector(N)),rand!(BitVector(N)))]
generates a length N vector faster.
Answer
Calculating the random bits is the bulk of the work, so exploring Chris' idea from comments of using RandomNumbers.jl is worth a shot. Additionally, we can use #rickhg12hs's idea to extract more bits from each random number generated. Regardless, generating a block of values together is essential for better optimization.
For example, the following code (nqpsk1 uses qpsk from the question as the baseline. nqpsk2 is a suggested improvement):
function qpsk()
temp1 = ifelse(rand(Bool), 1+0im, -1+0im)
temp2 = ifelse(rand(Bool), 1+0im, 0+1im)
temp1*temp2
end
nqpsk1(n::Int) = [qpsk() for i=1:n]
nqpsk2(n::Int) = begin
res = zeros(Int,2*n)
blocks = n >>> 4 # use blocks of 16 values
btail = n & 0x000000000000000f # in case n is not a multiple of 16
pos = 1
#inbounds for i=1:blocks
bits = rand(UInt32) # get random bits for a whole block
for j=1:16
b1 = Bool(bits & 1)
bits >>>= 1
b2 = Bool(bits & 1)
bits >>>= 1
res[pos+b1] = (-1)^b2
pos += 2
end
end
#inbounds for i=1:btail
res[pos+rand(Bool)] = (-1)^rand(Bool)
pos += 2
end
return reinterpret(Complex{Int64},res)
end
achieved a >4x improvement on my setup (Julia 0.7):
julia> using BenchmarkTools
julia> #btime nqpsk1(320);
8.791 μs (323 allocations: 15.19 KiB)
julia> #btime nqpsk2(320);
1.056 μs (3 allocations: 5.20 KiB)
Update
With only a modest compromise in speed (and some allocation), but much better looking code:
function nqpsk3(n::Int)
res = zeros(Int,2n)
rv1 = rand!(BitVector(n))
rv2 = rand!(BitVector(n))
#inbounds for (b1,b2,i) in zip(rv1,rv2,1:2:2n)
res[i+b1] = (-1)^b2
end
return reinterpret(Complex{Int},res)
end
The benchmark:
julia> #btime nqpsk3(320);
1.780 μs (11 allocations: 5.83 KiB)
Addendum
And the one-(wrapped)-line version, does OK (2.48 μs) too:
nqpsk4(n) = [(1+0im,-1+0im,0+im,0-im)[2b1+b2+1] for
(b1,b2) in zip(rand!(BitVector(n)),rand!(BitVector(n)))]
Finally, the real one-line version (1.96 μs):
nqpsk5(n) = [(-1)^b1*im^b2 for (b1,b2) in zip(rand!(BitVector(n)),rand!(BitVector(n)))]
Latest state of investigation
My current best solution is the following:
function g(pX::Array{Complex{Float64},1})
tab = [1.0,im,-1.0,-im]
bits = UInt128(0)
#inbounds for i = 1 : length(pX)
bits = (i % 64) == 1 ? rand(UInt128) : bits >>> 2
pX[i] = tab[(bits & 3)+1]
end
end
sig = complex(zeros(1280));
using BenchmarkTools
#btime g(sig)
3.838 μs (13 allocations: 464 bytes)
This is better than my optimized version of Dan Getz which runs with the same N, and i feel much more readable
4.236 μs (4 allocations: 20.16 KiB)
However, the performance is extremely fragil. Just have a look at subtle differences to this 36 times slower version:
function g(pX::Array{Complex{Float64},1})
tab = [1,im,-1,-im]
bits = 0
for i = 1 : length(pX)
bits = (i % 64) == 1 ? rand(UInt128) : bits >>> 2
pX[i] = tab[(bits & 3)+1]
end
end
138.320 μs (10209 allocations: 319.14 KiB)
Did you find the differences?
no conversion from Int64 to Float64
Type stability
disable range checking
to follow the convention g() should be renamed to g!()
In the following you find the evolution to the currently best timed solution
My first approach to answer was addressing general weaknesses
a) calling functions is expensive due to calling overhead.
b) complex calculations are more time consuming than lookup.
This end up with the proposal
cases = [1+0im,0+1im,-1+0im,0-1im]
g() = cases[rand(1:4)]
// to use just call g()
g()
What happend?
Why does a) do not succeed?
using BenchmarkTools
test(n) = [q() for i = 1:n]
g() = rand()
#btime test(800);
This results to
rand() => 5.784
rand(Float32) => 5.604
rand(Float64) => 5.821
rand(Bool) => 5.167
rand(Int8) => 5.126
rand(Int16) => 5.171
rand(Int32) => 5.631
rand(Int64) => 7.980
rand(Int128) => 10.549
rand(1:4) => 28.603
(rand(Int8) % 4) + 1 => 6.053
(rand(Int8) & 3) + 1 => 5.843
rand(0:255) => 28.568
rand(UInt8) => 5.104
rand([1,2,3,4]) => 58.437
l = [1,2,3,4]; g() = rand(l) => 47.399
rand(l, 1) => 70.052
m = (1,2,3,4); rand(m) => 124.311
0 => 0.872
0.0 => 0.887
Int8(0) => 0.113
return => 0.33
(running Julia 0.6 on Ubuntu)
How to judge the results
requesting float32 and float64 needs same time. This may be an indicator that float64 does NOT the full mantisse (of 56 bits) for random value
rand for Bool, Int8, Int16 needs nearly the same time. Probably same algorithmus just using fewer bits.
rand for Int32 slightly more time consuming. Int64 and Int128 takes under proportional more time.
rand(1:4) takes surprisingly much more time. It should be in the range of rand(Int8) since it is equivalent to (rand(Int8) % 4) + 1 and (rand(Int8) & 3) + 1.
Even if I hurt somebodies religious feelings, this is just poor code.
Same to rand(Uint) and rand(0:255)
The performance of rand with arrays and tupel is far from acceptable!
Why does b) do not succeed?
Julia seems not able to lookup efficiently from tupels or arrays.
But even when lookup were fast, the rand methods dominates.
Other approaches
Dan Getz approach uses all bits from a rand call. So in end it needs in his first algorithm 1/16 calls per value.
However, this approach could be improved by using UInt128 since now 1/64 calls per value is required.
On my machine Dan Getz original code takes 17.314 for 1280 values while the modified code takes 4.595. The improvement is proportional to the reduced number of calls to rand!
test2(n::Int) = begin
res = zeros(Int,2*n)
blocks = n >>> 7 # use blocks of 16 values
btail = n & 0x000000000000007f # in case n is not a multiple of 16
pos = 1
#inbounds for i=1:blocks
bits = rand(UInt128) # get random bits for a whole block
for j=1:16
b1 = Bool(bits & 1)
bits >>>= 1
b2 = Bool(bits & 1)
bits >>>= 1
res[pos+b1] = (-1)^b2
pos += 2
end
end
#inbounds for i=1:btail
res[pos+rand(Bool)] = (-1)^rand(Bool)
pos += 2
end
return reinterpret(Complex{Int64},res)
end
#btime test2(1280);
However, the use of reinterpret means to know the bit layout of the different structures. That not a real good idea.
high level view
In the end all what the questioneer has coded was a complicated complex build array of random numbers from 1 to 4 (or 0 to 3). I would try to optimize the next step in questioneers following task. However, no info were supplied.
In that following case julia performs much better, which sounds somewhat strange. More returns, less time??
#btime rand(0:3, 1280)
=> 24.377
PS:
Just for comparing the numbers with Dan Getz last approach, the following code takes 27.004
N=1280
#btime [(-1)^b1*im^b2 for (b1,b2) in zip(rand!(BitVector(N)),rand!(BitVector(N)))]