OpenMP Sparse Jacobi - algorithm

I'm trying to determine if there is a way to parallelize the Jacobi method using sparse matrix formats (specifically Compressed Row Format)
I have a working sparse matrix Jacobi. I don't know if I can place
!$OMP PARALLEL DO
Directives on the middle do loop because x is being both written to and read from. I guess the inner do loop can have it, but the same t is being overwritten so I don't know if it is possible there either. Am I overlooking something here? Thanks.
x(:) = 0
do p = 1, numIterations
do i=1, n
t=b(i)
do j = IA(i), IA(i+1) - 1
if j=i
d=A(j)
else
t = t - A(j) * x(jA(j))
end if
end do
x(i) = t/d
end do
end do

It is true you have a dependency on t in the inner loop since it used as an accumulator. However, that also means you can have a private copy of t in each of the threads (since the arrays A and x are not being written in the loop, the value of t only depends on the value of j, which is also thread private).
The following should work:
x(:) = 0
do p = 1, numIterations
do i=1, n
t=0
!$OMP PARALLEL DO
!$OMP REDUCTION(+:t)
do j = IA(i), IA(i+1) - 1
if j=i
d=A(j)
else
t = A(j) * x(jA(j))
end if
end do
x(i) = (b(i)-t)/d
end do
end do
Note that d can only be be written by one of the threads, so the variable can be shared betewen the threads, no loop-carried dependencies on d.

Related

Fastest way to generate a kmer count vector from a nucleotide sequence (Julia)

Given a nucleotide sequence, I'm writing some Julia code to generate a sparse vector of (masked) kmer counts, and I would like it to run as fast as possible.
Here is my current implementation,
using Distributions
using SparseArrays
function kmer_profile(seq, k, mask)
basis = [4^i for i in (k - 1):-1:0]
d = Dict('A'=>0, 'C'=>1, 'G'=>2, 'T'=>3)
kmer_dict = Dict{Int, Int32}(4^k=>0)
for n in 1:(length(seq) - length(mask) + 1)
kmer_hash = 1
j = 1
for i in 1:length(mask)
if mask[i]
kmer_hash += d[seq[n+i-1]] * basis[j]
j += 1
end
end
haskey(kmer_dict, kmer_hash) ? kmer_dict[kmer_hash] += 1 : kmer_dict[kmer_hash] = 1
end
return sparsevec(kmer_dict)
end
seq = join(sample(['A','C','G','T'], 1000000))
mask_str = "111111011111001111111111111110"
mask = BitArray([parse(Bool, string(m)) for m in split(mask_str, "")])
k = sum(mask)
#time kmer_profile(seq, k, mask)
This code runs in about 0.3 seconds on my M1 MacBook Pro, is there any way to make it run significantly faster?
The function kmer_profile uses a sliding window of size length(mask) to count the number of times each masked kmer appears in the nucleotide sequence. A mask is a binary sequence, and a masked kmer is a kmer with nucleotides dropped at positions at which the mask is zero. E.g. the kmer ACGT and mask 1001 will produce the masked kmer AT.
To produce the kmer hash, the function treats each kmer as a base 4 number and then converts it to a (base 10) 64-bit integer, for indexing into the kmer vector.
The size of k is equal to the number of ones in the mask string, and is implicitly limited to 31 so that kmer hashes can fit into a 64-bit integer type.
There are several possible optimizations to make this code faster.
First of all, one can convert the Dict to an array since array-based indexing is faster than dictionary-based indexing one and this is possible here since the key is an ASCII character.
Moreover, the extraction of the sequence codes can be done once instead of length(mask) times by pre-computing code and putting the result in a temporary array.
Additionally, the mask-based conditional and the loop carried dependency make things slow. Indeed, the condition cannot be (easily) predicted by the processor causing it to stall for several cycles. The loop carried dependency make things even worse since the processor can hardly execute other instructions during this stall. This problem can be solved by pre-computing the factors based on both mask and basis. The result is a faster branch-less loop.
Once the above optimizations are done, the biggest bottleneck is sparsevec. In fact, it was also taking nearly half the time of the initial implementation! Optimizing this step is difficult but not impossible. It is slow because of random accesses in the Julia implementation. One can speed this up by sorting the keys-values pairs in the first place. It is faster due to a more cache-friendly execution and it can also help the prediction unit of the processor. This is a complex topic. For more details about how this works, please read Why is processing a sorted array faster than processing an unsorted array?.
Here is the final optimized code:
function kmer_profile_opt(seq, k, mask)
basis = [4^i for i in (k - 1):-1:0]
d = zeros(Int8, 128)
d[Int64('A')] = 0
d[Int64('C')] = 1
d[Int64('G')] = 2
d[Int64('T')] = 3
seq_codes = [d[Int8(e)] for e in seq]
j = 1
premult = zeros(Int64, length(mask))
for i in 1:length(mask)
if mask[i]
premult[i] = basis[j]
j += 1
end
end
kmer_dict = Dict{Int, Int32}(4^k=>0)
for n in 1:(length(seq) - length(mask) + 1)
kmer_hash = 1
j = 1
for i in 1:length(mask)
kmer_hash += seq_codes[n+i-1] * premult[i]
end
haskey(kmer_dict, kmer_hash) ? kmer_dict[kmer_hash] += 1 : kmer_dict[kmer_hash] = 1
end
sorted_kmer_pairs = sort(collect(kmer_dict))
sorted_kmer_keys = [e[1] for e in sorted_kmer_pairs]
sorted_kmer_values = [e[2] for e in sorted_kmer_pairs]
return sparsevec(sorted_kmer_keys, sorted_kmer_values)
end
This code is a bit more than twice faster than the initial implementation on my machine. A significant fraction of the time is still spent in the sorting algorithm.
The code can still be optimized further. One way is to use a parallel sort algorithm. Another way is to replace the premult[i] multiplication by a shift which is faster assuming premult[i] is modified so to contain exponents. I expect the code to be about 4 times faster than the original code. The main bottleneck should be the big dictionary creation. Improving further the performance of this is very hard (though it is still possible).
Inspired by Jérôme's answer, and squeezing some more by avoiding Dicts altogether:
function kmer_profile_opt3a(seq, k, mask)
d = zeros(Int8, 128)
d[Int64('A')] = 0
d[Int64('C')] = 1
d[Int64('G')] = 2
d[Int64('T')] = 3
seq_codes = [d[Int8(e)] for e in seq]
basis = [4^i for i in (k-1):-1:0]
j = 1
premult = zeros(Int64, length(mask))
for i in 1:length(mask)
if mask[i]
premult[i] = basis[j]
j += 1
end
end
kmer_vec = Vector{Int}(undef, length(seq)-length(mask)+1)
#inbounds for n in 1:(length(seq) - length(mask) + 1)
kmer_hash = 1
for i in 1:length(mask)
kmer_hash += seq_codes[n+i-1] * premult[i]
end
kmer_vec[n] = kmer_hash
end
sort!(kmer_vec)
return sparsevec(kmer_vec, ones(length(kmer_vec)), 4^k, +)
end
This achieved another 2x over Jérôme's answer on my machine.
The auto-combining feature of sparsevec makes the code a bit more compact.
Trying to slim the code further, and avoid unnecessary allocations in sparse vector creation, the following can be used:
using SparseArrays, LinearAlgebra
function specialsparsevec(nzs, n)
vals = Vector{Int}(undef, length(nzs))
j, k, count, last = (1, 1, 0, nzs[1])
while k <= length(nzs)
if nzs[k] == last
count += 1
else
vals[j], nzs[j] = (count, last)
count, last = (1, nzs[k])
j += 1
end
k += 1
end
vals[j], nzs[j] = (count, last)
resize!(nzs, j)
resize!(vals, j)
return SparseVector(n, nzs, vals)
end
function kmer_profile_opt3(seq, k, mask)
d = zeros(Int8, 128)
foreach(((i,c),) -> d[Int(c)]=i-1, enumerate(collect("ACGT")))
seq_codes = getindex.(Ref(d), Int8.(collect(seq)))
premult = foldr(
(i,(p,j))->(mask[i] && (p[i]=j ; j<<=2) ; (p,j)),
1:length(mask); init=(zeros(Int64,length(mask)),1)) |> first
kmer_vec = sort(
[ dot(#view(seq_codes[n:n+length(mask)-1]),premult) + 1 for
n in 1:(length(seq)-length(mask)+1)
])
return specialsparsevec(kmer_vec, 4^k)
end
This last version gets another 10% speedup (but is a little cryptic):
julia> #btime kmer_profile_opt($seq, $k, $mask);
367.584 ms (81 allocations: 134.71 MiB) # other answer
julia> #btime kmer_profile_opt3a($seq, $k, $mask);
140.882 ms (22 allocations: 54.36 MiB) # 1st this answer
julia> #btime kmer_profile_opt3($seq, $k, $mask);
127.016 ms (14 allocations: 27.66 MiB) # 2nd this answer

Julia parallel computing for loop

I would like to calculate the summation of elements from a large upper triangular matrix. The regular Julia code is below.
function upsum(M); n = size(M)[1]; sum = 0
for i = 1:n-1 for j = i+1:n
sum = sum + M[i,j]
end
end
return sum
end
R = randn(10000,10000)
upsum(R)
Since the matrix is very large, I would like to know is there anyway to improve the speed. How can I use parallel computing here?
I would use threads not parallel processing in this case. Here is an example code:
using Base.Threads
function upsum_threads(M)
n = size(M, 1)
chunks = nthreads()
sums = zeros(eltype(M), chunks)
chunkend = round.(Int, n * sqrt.((1:chunks) ./ chunks))
#assert minimum(diff(chunkend)) > 0
chunkstart = [2; chunkend[1:end-1] .+ 1]
#threads for job in 1:chunks
s = zero(eltype(M))
for i in chunkstart[job]:chunkend[job]
#simd for j in 1:(i-1)
#inbounds s += M[j, i]
end
end
sums[job] = s
end
return sum(sums)
end
R = randn(10000,10000)
upsum_threads(R)
It should give you a significant speedup (even if you remove #threads it should be much faster).
You choose number of threads Julia uses by setting JULIA_NUM_THREADS environment variable.

Is there a way to refactor the julia code below in order to avoid the loop/malloc?

m,n =size(l.x)
for batch=1:m
l.ly = l.y[batch,:]
l.jacobian .= -l.ly .* l.ly'
l.jacobian[diagind(l.jacobian)] .= l.ly.*(1.0.-l.ly)
# # n x 1 = n x n * n x 1
l.dldx[batch,:] = l.jacobian * DLDY[batch,:]
end
return l.dldx
l.x is a m by n matrix. l.y is another matrix with the same size as l.x. My goal is to create another m by n matrix, l.dldx, in which each row is the result of the operation inside the for loop. Can any one spot further optimization for this block of code? The code above is part of https://github.com/stevenygd/NN.jl.
The following should implement the same calculation and is more efficient:
l.dldx = l.y .* (DLDY .- sum( l.y .* DLDY , 2))
There might be a slight improvement available by refactoring the sum into a loop.
As the question does not have runnable code, or a test case, it is hard to give definite benchmarks, so feedback would be welcome.
UPDATE
Here is the code above with explicit loops:
function calc_dldx(y,DLDY)
tmp = zeros(eltype(y),size(y,1))
dldx = similar(y)
#inbounds for j=1:size(y,2)
for i=1:size(y,1)
tmp[i] += y[i,j]*DLDY[i,j]
end
end
#inbounds for j=1:size(y,2)
for i=1:size(y,1)
dldx[i,j] = y[i,j]*(DLDY[i,j]-tmp[i])
end
end
return dldx
end
The long version should run even faster. A good way to measure the performance of code is using the BenchmarkTools package.

breaking out of a loop in Julia

I have a Vector of Vectors of different length W. These last vectors contain integers between 0 and 150,000 in steps of 5 but can also be empty. I am trying to compute the empirical cdf for each of those vectors. I could compute these cdf iterating over every vector and every integer like this
cdfdict = Dict{Tuple{Int,Int},Float64}()
for i in 1:length(W)
v = W[i]
len = length(v)
if len == 0
pcdf = 1.0
else
for j in 0:5:150_000
pcdf = length(v[v .<= j])/len
cdfdict[i, j] = pcdf
end
end
end
However, this approach is inefficient because the cdf will be equal to 1 for j >= maximum(v) and sometimes this maximum(v) will be much lower than 150,000.
My question is: how can I include a condition that breaks out of the j loop for j > maximum(v) but still assigns pcdf = 1.0 for the rest of js?
I tried including a break when j > maximum(v) but this, of course, stops the loop from continuing for the rest of js. Also, I can break the loop and then use get! to access/include 1.0 for keys not found in cdfdict later on, but that is not what I'm looking for.
To elaborate on my comment, this answer details an implementation which fills an Array instead of a Dict.
First to create a random test case:
W = [rand(0:mv,rand(0:10)) for mv in floor(Int,exp(log(150_000)*rand(10)))]
Next create an array of the right size filled with 1.0s:
cdfmat = ones(Float64,length(W),length(0:5:150_000));
Now to fill the beginning of the CDFs:
for i=1:length(W)
v = sort(W[i])
k = 1
thresh = 0
for j=1:length(v)
if (j>1 && v[j]==v[j-1])
continue
end
pcdf = (j-1)/length(v)
while thresh<v[j]
cdfmat[i,k]=pcdf
k += 1
thresh += 5
end
end
end
This implementation uses a sort which can be slow sometimes, but the other implementations basically compare the vector with various values which is even slower in most cases.
break only does one level. You can do what you want by wrapping the for loop function and using return (instead of where you would've put break), or using #goto.
Or where you would break, you could switch a boolean breakd=true and then break, and at the bottom of the larger loop do if breakd break end.
You can use another for loop to set all remaining elements to 1.0. The inner loop becomes
m = maximum(v)
for j in 0:5:150_000
if j > m
for k in j:5:150_000
cdfdict[i, k] = 1.0
end
break
end
pcdf = count(x -> x <= j, v)/len
cdfdict[i, j] = pcdf
end
However, this is rather hard to understand. It would be easier to use a branch. In fact, this should be just as fast because the branch is very predictable.
m = maximum(v)
for j in 0:5:150_000
if j > m
cdfdict[i, j] = 1.0
else
pcdf = count(x -> x <= j, v)/len
cdfdict[i, j] = pcdf
end
end
Another answer gave an implementation using an Array which calculated the CDF by sorting the samples and filling up the CDF bins with quantile values. Since the whole Array is thus filled, doing another pass on the array should not be overly costly (we tolerate a single pass already). The sorting bit and the allocation accompanying it can be avoided by calculating a histogram in the array and using cumsum to produce a CDF. Perhaps the code will explain this better:
Initialize sizes, lengths and widths:
n = 10; w = 5; rmax = 150_000; hl = length(0:w:rmax)
Produce a sample example:
W = [rand(0:mv,rand(0:10)) for mv in floor(Int,exp(log(rmax)*rand(n)))];
Calculate the CDFs:
cdfmat = zeros(Float64,n,hl); # empty histograms
for i=1:n # drop samples into histogram bins
for j=1:length(W[i])
cdfmat[i,1+(W[i][j]+w-1)÷5]+=one(Float64)
end
end
cumsum!(cdfmat,cdfmat,2) # calculate pre-CDF by cumsum
for i=1:n # normalize each CDF by total
if cdfmat[i,hl]==zero(Float64) # check if histogram empty?
for j=1:hl # CDF of 1.0 as default (might be changed)
cdfmat[i,j] = one(Float64)
end
else # the normalization factor calc-ed once
f = one(Float64)/cdfmat[i,hl]
for j=1:hl
cdfmat[i,j] *= f
end
end
end
(a) Note the use of one,zero to prepare for change of Real type - this is good practice. (b) Also adding various #inbounds and #simd should optimize further. (c) Putting this code in a function is recommended (this is not done in this answer). (d) If having a zero CDF for empty samples is OK (which means no samples means huge samples semantically), then the second for can be simplified.
See other answers for more options, and reminder: Premature optimization is the root of all evil (Knuth??)

OpenMP over Summation

I have been trying to apply OpenMP on a simple summation operation inside two nested loop, but it produced incorrect result so far. I have been looking around in here and here, also in here. All suggest to use reduction clause, but it does not work for my case by producing very large number which leads to segmentation fault.
I also tried this way posted in here and my own question here which has been solved. Both do not use reduction and simply just set summation variable as shared, but it also produces incorrect result. Is there something that I am missing? When to use reduction and not using that while facing summation operation?
Codes using reduction clause
index = 0
!$OMP PARALLEL DO PRIVATE(iy,ix) REDUCTION(:+index)
do iy = 1, number(2)
do ix = 1, number(1)
index = index + 1
xoutput(index)=xinput(ix)
youtput(index)=yinput(iy)
end do
end do
!$OMP END PARALLEL DO
Code without using reduction clause
index = 0
!$OMP PARALLEL DO PRIVATE(iy,ix) SHARED(index)
do iy = 1, number(2)
do ix = 1, number(1)
index = index + 1
xoutput(index)=xinput(ix)
youtput(index)=yinput(iy)
end do
end do
!$OMP END PARALLEL DO
I think you have a mis-conception of what the reduction clause does...
REDUCTION(+:index)
means, that you will have the correct sum index in the end. In each step of the iteration, each tread will have a different version with different values! So the reduction is not suitable to manage array indices during the parallel section.
Let me try to illustrate this...
The following loop
!$OMP PARALLEL DO PRIVATE(iy) REDUCTION(+:index)
do iy = 1, number(2)
index = index + 1
end do
!$OMP END PARALLEL DO
is (more or less) equivalent to
!$OMP PARALLEL PRIVATE(iy, privIndex) SHARED(index)
!$OMP DO
do iy = 1, number(2)
privIndex = privIndex + 1
end do
!$OMP END DO
!$OMP CRITICAL
index = index + privIndex
!$OMP END CRITICAL
!$OMP END PARALLEL
You can see that during the loop all threads work on different variables privIndex which are private to that thread, and calculate local (partial) sums. In the end, the total sum is taken, using a critical section to avoid race conditions.
This might not be what the compiler does, but it gives you an idea how a reduction works: at no point within the first loop does privIndex correspond to the correct index you would expect in the serial version.
As Vladimir suggests in his comment, you can calculate the index directly as you are only incrementing it in the inner loop:
!$OMP PARALLEL DO PRIVATE(iy,ix, index)
do iy = 1, number(2)
do ix = 1, number(1)
index = (iy-1)*number(1) + ix
xoutput(index)=xinput(ix)
youtput(index)=yinput(iy)
end do
end do
!$OMP END PARALLEL DO

Resources