Julia - How to efficiently turn to zero the diagonal of a matrix? - matrix

In Julia, what would be an efficient way of turning the diagonal of a matrix to zero?

Assuming m is your matrix of size N x N this could be done as:
setindex!.(Ref(m), 0.0, 1:N, 1:N)
Another option:
using LinearAlgebra
m[diagind(m)] .= 0.0
And some performance tests:
julia> using LinearAlgebra, BenchmarkTools
julia> m=rand(20,20);
julia> #btime setindex!.(Ref($m), 0.0, 1:20, 1:20);
55.533 ns (1 allocation: 240 bytes)
julia> #btime $m[diagind($m)] .= 0.0;
75.386 ns (2 allocations: 80 bytes)

Performance wise, simple loop is faster (and more explicit, but it is taste dependent)
julia> #btime foreach(i -> $m[i, i] = 0, 1:20)
11.881 ns (0 allocations: 0 bytes)
julia> #btime setindex!.(Ref($m), 0.0, 1:20, 1:20);
50.923 ns (1 allocation: 240 bytes)
And it is faster then diagind version, but not by much
julia> m = rand(1000, 1000);
julia> #btime foreach(i -> $m[i, i] = 0.0, 1:1000)
1.456 μs (0 allocations: 0 bytes)
julia> #btime foreach(i -> #inbounds($m[i, i] = 0.0), 1:1000)
1.338 μs (0 allocations: 0 bytes)
julia> #btime $m[diagind($m)] .= 0.0;
1.495 μs (2 allocations: 80 bytes)

Przemyslaw Szufel's solutions bench-marked for 1_000 x 1_000 matrix size show that diagind performs best:
julia> #btime setindex!.(Ref($m), 0.0, 1:1_000, 1:1_000);
2.222 μs (1 allocation: 7.94 KiB)
julia> #btime $m[diagind($m)] .= 0.0;
1.280 μs (2 allocations: 80 bytes)

Here is a more general way how to performantly use setindex! on an Array by accessing custom indices:
Using an array for in array indexing
Linear indexing is best performing, that's why diagind runs better than the Cartesian Indices.


Why is operating on Float64 faster than Float16?

I wonder why operating on Float64 values is faster than operating on Float16:
julia> rnd64 = rand(Float64, 1000);
julia> rnd16 = rand(Float16, 1000);
julia> #benchmark rnd64.^2
BenchmarkTools.Trial: 10000 samples with 10 evaluations.
Range (min … max): 1.800 μs … 662.140 μs ┊ GC (min … max): 0.00% … 99.37%
Time (median): 2.180 μs ┊ GC (median): 0.00%
Time (mean ± σ): 3.457 μs ± 13.176 μs ┊ GC (mean ± σ): 12.34% ± 3.89%
▁██▄▂▂▆▆▄▂▁ ▂▆▄▁ ▂▂▂▁ ▂
████████████████▇▇▆▆▇▆▅▇██▆▆▅▅▆▄▄▁▁▃▃▁▁▄▁▃▄▁▃▁▄▃▁▁▆▇██████▇ █
1.8 μs Histogram: log(frequency) by time 10.6 μs <
Memory estimate: 8.02 KiB, allocs estimate: 5.
julia> #benchmark rnd16.^2
BenchmarkTools.Trial: 10000 samples with 6 evaluations.
Range (min … max): 5.117 μs … 587.133 μs ┊ GC (min … max): 0.00% … 98.61%
Time (median): 5.383 μs ┊ GC (median): 0.00%
Time (mean ± σ): 5.716 μs ± 9.987 μs ┊ GC (mean ± σ): 3.01% ± 1.71%
▃▅█▇▅▄▄▆▇▅▄▁ ▁ ▂
▄██████████████▇▆▇▆▆▇▆▇▅█▇████▇█▇▇▆▅▆▄▇▇▆█▇██▇█▇▇▇▆▇▇▆▆▆▆▄▄ █
5.12 μs Histogram: log(frequency) by time 7.48 μs <
Memory estimate: 2.14 KiB, allocs estimate: 5.
Maybe you ask why I expect the opposite: Because Float16 values have less floating point precision:
julia> rnd16[1]
julia> rnd64[1]
Shouldn't calculations with fewer precisions take place faster? Then I wonder why someone should use Float16? They can do it even with Float128!
As you can see, the effect you are expecting is present for Float32:
julia> rnd64 = rand(Float64, 1000);
julia> rnd32 = rand(Float32, 1000);
julia> rnd16 = rand(Float16, 1000);
julia> #btime $rnd64.^2;
616.495 ns (1 allocation: 7.94 KiB)
julia> #btime $rnd32.^2;
330.769 ns (1 allocation: 4.06 KiB) # faster!!
julia> #btime $rnd16.^2;
2.067 μs (1 allocation: 2.06 KiB) # slower!!
Float64 and Float32 have hardware support on most platforms, but Float16 does not, and must therefore be implemented in software.
Note also that you should use variable interpolation ($) when micro-benchmarking. The difference is significant here, not least in terms of allocations:
julia> #btime $rnd32.^2;
336.187 ns (1 allocation: 4.06 KiB)
julia> #btime rnd32.^2;
930.000 ns (5 allocations: 4.14 KiB)
The short answer is that you probably shouldn't use Float16 unless you are using a GPU or an Apple CPU because (as of 2022) other processors don't have hardware support for Float16.

Find index of maximum element satisfying condition (Julia)

In Julia I can use argmax(X) to find max element. If I want to find all element satisfying condition C I can use findall(C,X). But how can I combine the two? What's the most efficient/idiomatic/concise way to find maximum element index satisfying some condition in Julia?
If you'd like to avoid allocations, filtering the array lazily would work:
idx_filtered = (i for (i, el) in pairs(X) if C(el))
argmax(i -> X[i], idx_filtered)
Unfortunately, this is about twice as slow as a hand-written version. (edit: in my benchmarks, it's 2x slower on Intel Xeon Platinum but nearly equal on Apple M1)
function byhand(C, X)
start = findfirst(C, X)
isnothing(start) && return nothing
imax, max = start, X[start]
for i = start:lastindex(X)
if C(X[i]) && X[i] > max
imax, max = i, X[i]
imax, max
You can store the index returned by findall and subset it with the result of argmax of the vector fulfilling the condition.
X = [5, 4, -3, -5]
C = <(0)
i = findall(C, X);
Or combine both:
argmax(i -> X[i], findall(C, X))
Assuming that findall is not empty. Otherwise it need to be tested e.g. with isempty.
function August(C, X)
idx_filtered = (i for (i, el) in pairs(X) if C(el))
argmax(i -> X[i], idx_filtered)
function byhand(C, X)
start = findfirst(C, X)
isnothing(start) && return nothing
imax, max = start, X[start]
for i = start:lastindex(X)
if C(X[i]) && X[i] > max
imax, max = i, X[i]
imax, max
function GKi1(C, X)
i = findall(C, X);
GKi2(C, X) = argmax(i -> X[i], findall(C, X))
using Random
n = 100000
X = randn(n)
C = <(0)
using BenchmarkTools
suite = BenchmarkGroup()
suite["August"] = #benchmarkable August(C, $X)
suite["byhand"] = #benchmarkable byhand(C, $X)
suite["GKi1"] = #benchmarkable GKi1(C, $X)
suite["GKi2"] = #benchmarkable GKi2(C, $X)
results = run(suite)
#4-element BenchmarkTools.BenchmarkGroup:
# tags: []
# "August" => Trial(641.061 μs)
# "byhand" => Trial(261.135 μs)
# "GKi2" => Trial(259.260 μs)
# "GKi1" => Trial(339.570 μs)
#BenchmarkTools.Trial: 7622 samples with 1 evaluation.
# Range (min … max): 641.061 μs … 861.379 μs ┊ GC (min … max): 0.00% … 0.00%
# Time (median): 643.640 μs ┊ GC (median): 0.00%
# Time (mean ± σ): 653.027 μs ± 18.123 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
# ▄█▅▄▃ ▂▂▃▁ ▁▃▃▂▂ ▁▃ ▁▁ ▁
# ██████▇████████████▇▆▆▇████▇▆██▇▇▇▆▆▆▅▇▆▅▅▅▅▆██▅▆▆▆▇▆▇▇▆▇▆▆▆▅ █
# 641 μs Histogram: log(frequency) by time 718 μs <
# Memory estimate: 16 bytes, allocs estimate: 1.
#BenchmarkTools.Trial: 10000 samples with 1 evaluation.
# Range (min … max): 261.135 μs … 621.141 μs ┊ GC (min … max): 0.00% … 0.00%
# Time (median): 261.356 μs ┊ GC (median): 0.00%
# Time (mean ± σ): 264.382 μs ± 11.638 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
# █ ▁▁▁▁ ▂ ▁▁ ▂ ▁ ▁ ▁
# █▅▂▂▅████▅▄▃▄▆█▇▇▆▄▅███▇▄▄▅▆▆█▄▇█▅▄▅▅▆▇▇▅▄▅▄▄▄▃▄▃▃▃▄▅▆▅▄▇█▆▅▄ █
# 261 μs Histogram: log(frequency) by time 292 μs <
# Memory estimate: 32 bytes, allocs estimate: 1.
#BenchmarkTools.Trial: 10000 samples with 1 evaluation.
# Range (min … max): 339.570 μs … 1.447 ms ┊ GC (min … max): 0.00% … 0.00%
# Time (median): 342.579 μs ┊ GC (median): 0.00%
# Time (mean ± σ): 355.167 μs ± 52.935 μs ┊ GC (mean ± σ): 1.90% ± 6.85%
# █▆▄▅▃▂▁▁ ▁ ▁
# ████████▇▆▆▅▅▅▆▄▄▄▄▁▃▁▁▃▄▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█ █
# 340 μs Histogram: log(frequency) by time 722 μs <
# Memory estimate: 800.39 KiB, allocs estimate: 11.
#BenchmarkTools.Trial: 10000 samples with 1 evaluation.
# Range (min … max): 259.260 μs … 752.773 μs ┊ GC (min … max): 0.00% … 54.40%
# Time (median): 260.692 μs ┊ GC (median): 0.00%
# Time (mean ± σ): 270.300 μs ± 40.094 μs ┊ GC (mean ± σ): 1.31% ± 5.60%
# █▁▁▅▄▂▂▄▃▂▁▁▁ ▁ ▁
# █████████████████▇██▆▆▇▆▅▄▆▆▆▄▅▄▆▅▇▇▆▆▅▅▄▅▃▃▅▃▄▁▁▁▃▁▃▃▃▄▃▃▁▃▃ █
# 259 μs Histogram: log(frequency) by time 390 μs <
# Memory estimate: 408.53 KiB, allocs estimate: 9.
#Julia Version 1.8.0
#Commit 5544a0fab7 (2022-08-17 13:38 UTC)
#Platform Info:
# OS: Linux (x86_64-linux-gnu)
# CPU: 8 × Intel(R) Core(TM) i7-2600K CPU # 3.40GHz
# LIBM: libopenlibm
# LLVM: libLLVM-13.0.1 (ORCJIT, sandybridge)
# Threads: 1 on 8 virtual cores
In this example argmax(i -> X[i], findall(C, X)) is close to the performance of the hand written function of #August but uses more memory, but can show better performance in case the data is sorted:
results = run(suite)
#4-element BenchmarkTools.BenchmarkGroup:
# tags: []
# "August" => Trial(297.519 μs)
# "byhand" => Trial(270.486 μs)
# "GKi2" => Trial(242.320 μs)
# "GKi1" => Trial(319.732 μs)
From what I understand from your question you can use findmax() (requires Julia >= v1.7) to find the maximum index on the result of findall():
julia> v = [10, 20, 30, 40, 50]
5-element Vector{Int64}:
julia> findmax(findall(x -> x > 30, v))[1]
Performance of the above function:
julia> v = collect(10:1:10_000_000);
julia> #btime findmax(findall(x -> x > 30, v))[1]
33.471 ms (10 allocations: 77.49 MiB)
Update: solution suggested by #dan-getz of using last() and findlast() perform better than findmax() but findlast() is the winner:
julia> #btime last(findall(x -> x > 30, v))
19.961 ms (9 allocations: 77.49 MiB)
julia> #btime findlast(x -> x > 30, v)
81.422 ns (2 allocations: 32 bytes)
Update 2: Looks like the OP wanted to find the max element and not only the index. In that case, the solution would be:
julia> v[findmax(findall(x -> x > 30, v))[1]]

Julia parallelism: #distributed (+) slower than serial?

After seeing a couple tutorials on the internet on Julia parallelism I decided to implement a small parallel snippet for computing the harmonic series.
The serial code is:
harmonic = function (n::Int64)
x = 0
for i in n:-1:1 # summing backwards to avoid rounding errors
x +=1/i
And I made 2 parallel versions, one using #distributed macro and another using the #everywhere macro (julia -p 2 btw):
#everywhere harmonic_ever = function (n::Int64)
x = 0
for i in n:-1:1
x +=1/i
harmonic_distr = function (n::Int64)
x = #distributed (+) for i in n:-1:1
x = 1/i
However, when I run the above code and #time it, I don't get any speedup - in fact, the #distributed version runs significantly slower!
#time harmonic(10^10)
>>> 53.960678 seconds (29.10 k allocations: 1.553 MiB) 23.60306659488827
job = #spawn harmonic_ever(10^10)
#time fetch(job)
>>> 46.729251 seconds (309.01 k allocations: 15.737 MiB) 23.60306659488827
#time harmonic_distr(10^10)
>>> 143.105701 seconds (1.25 M allocations: 63.564 MiB, 0.04% gc time) 23.603066594889185
What completely and absolutely baffles me is the "0.04% gc time". I'm clearly missing something and also the examples I saw weren't for 1.0.1 version (one for example used #parallel).
You're distributed version should be
function harmonic_distr2(n::Int64)
x = #distributed (+) for i in n:-1:1
1/i # no x assignment here
The #distributed loop will accumulate values of 1/i on every worker an then finally on the master process.
Note that it is also generally better to use BenchmarkTools's #btime macro instead of #time for benchmarking:
julia> using Distributed; addprocs(4);
julia> #btime harmonic(1_000_000_000); # serial
1.601 s (1 allocation: 16 bytes)
julia> #btime harmonic_distr2(1_000_000_000); # parallel
754.058 ms (399 allocations: 36.63 KiB)
julia> #btime harmonic_distr(1_000_000_000); # your old parallel version
4.289 s (411 allocations: 37.13 KiB)
The parallel version is, of course, slower if run only on one process:
julia> rmprocs(workers())
Task (done) #0x0000000006fb73d0
julia> nprocs()
julia> #btime harmonic_distr2(1_000_000_000); # (not really) parallel
1.879 s (34 allocations: 2.00 KiB)

Extract lower triangle portion of a matrix

I was wondering if there is a command or a package in Julia that permits us to extract directly the lower triangle portion of a matrix, excluding the diagonal. I can call R commands for that (like lowerTriangle of the gdata package), obviously, but I'd like to know if Julia has something similar. For example, imagine I have the matrix
1.0 0.751 0.734
0.751 1.0 0.948
0.734 0.948 1.0
I don't want to create a lower triangular matrix like
0.751 NA NA
0.734 0.948 NA
but extract the lower portion of the matrix as an array: 0.751 0.734 0.948
If you're OK with creating a lower triangular matrix as an intermediate step, you can use logical indexing and tril! with an extra argument to get what you need.
julia> M = [1.0 0.751 0.734
0.751 1.0 0.948
0.734 0.948 1.0];
julia> v = M[tril!(trues(size(M)), -1)]
3-element Array{Float64, 1}:
The trues call returns an array of M's shape filled with boolean true values. tril! then prunes this down to just the part of the matrix that we want. The second argument to tril! tells it which superdiagonal to start from, which we use here to avoid the values in the leading diagonal.
We use the result of that for indexing into M, and that returns an array with the required values.
Using comprehensions:
julia> [M[m, n] for m in 2:size(M, 1) for n in 1:m-1]
3-element Array{Float64,1}:
But it is much slower than the sundar/Matt B. solution:
lower_triangular_1(M) = [M[m, n] for m in 2:size(M, 1) for n in 1:m-1]
lower_triangular_2(M) = [M[m, n] for n in 1:size(M, 2) for m in n+1:size(M, 1)]
lower_triangular_3(M) = M[tril!(trues(size(M)), -1)]
using BenchmarkTools
using LinearAlgebra # avoid warning in 0.7
M=rand(100, 100)
Testing with Julia Version 0.7.0-alpha.0:
julia> #btime lower_triangular_1(M);
73.179 μs (10115 allocations: 444.34 KiB)
julia> #btime lower_triangular_2(M);
71.157 μs (10117 allocations: 444.41 KiB)
julia> #btime lower_triangular_3(M);
16.325 μs (6 allocations: 40.19 KiB)
Not elegant, but faster (with #views):
function lower_triangular_4(M)
# works only for square matrices
res = similar(M, ((size(M, 1)-1) * size(M, 2)) ÷ 2)
start_idx = 1
for n = 1:size(M, 2)-1
#views column = M[n+1:end, n]
last_idx = start_idx -1 + length(column)
#views res[start_idx:last_idx] = column[:]
start_idx = last_idx + 1
julia> #btime lower_triangular_4(M);
4.272 μs (101 allocations: 44.95 KiB)

Julia: best way to sample from successively shrinking range?

I would like to sample k numbers where the first number is sampled from 1:n and the second from 1:n-1 and the third from 1:n-2 and so on.
I have the below implementation
function shrinksample(n,k)
[rand(1:m) for m in n:-1:n-k+1]
Are there faster solutions in Julia?
The following takes ideas from the implementation of randperm and since n and k are of the same order, this is appropriate as the same type of randomness is needed (both have output space of size n factorial):
function fastshrinksample(r::AbstractRNG,n,k)
a = Vector{typeof(n)}(k)
#assert n <= Int64(2)^52
k == 0 && return a
mask = (1<<(64-leading_zeros(n)))-1
nextmask = mask>>1
nn = n
for i=1:k
a[i] = 1+Base.Random.rand_lt(r, nn, mask)
nn -= 1
if nn == nextmask
mask, nextmask = nextmask, nextmask>>1
return a
fastshrinksample(n,k) = fastshrinksample(Base.Random.GLOBAL_RNG, n, k)
Benchmarking gives a 3x improvement for one tested instance:
julia> using BenchmarkTools
julia> #btime shrinksample(10000,10000);
310.277 μs (2 allocations: 78.20 KiB)
julia> #btime fastshrinksample(10000,10000);
91.815 μs (2 allocations: 78.20 KiB)
The trick is mainly to use the internal Base.Random.rand_lt instead of regular rand(1:n)
If this is not very sensitive to randomness (you're not doing cryptography), the following should be amazingly fast and very simple:
blazingshrinksample(n,k) = (Int)[trunc(Int,(n-m)rand()+1) for m in 0:k-1]
Testing this along with your implementation and with Dan's, I got this:
using BenchmarkTools
#btime shrinksample(10000,10000);
259.414 μs (2 allocations: 78.20 KiB)
#btime fastshrinksample(10000,10000);
66.713 μs (2 allocations: 78.20 KiB)
#btime blazingshrinksample(10000,10000);
33.614 μs (2 allocations: 78.20 KiB)
