Fastest way to generate a kmer count vector from a nucleotide sequence (Julia) - performance

Given a nucleotide sequence, I'm writing some Julia code to generate a sparse vector of (masked) kmer counts, and I would like it to run as fast as possible.
Here is my current implementation,
using Distributions
using SparseArrays
function kmer_profile(seq, k, mask)
basis = [4^i for i in (k - 1):-1:0]
d = Dict('A'=>0, 'C'=>1, 'G'=>2, 'T'=>3)
kmer_dict = Dict{Int, Int32}(4^k=>0)
for n in 1:(length(seq) - length(mask) + 1)
kmer_hash = 1
j = 1
for i in 1:length(mask)
if mask[i]
kmer_hash += d[seq[n+i-1]] * basis[j]
j += 1
end
end
haskey(kmer_dict, kmer_hash) ? kmer_dict[kmer_hash] += 1 : kmer_dict[kmer_hash] = 1
end
return sparsevec(kmer_dict)
end
seq = join(sample(['A','C','G','T'], 1000000))
mask_str = "111111011111001111111111111110"
mask = BitArray([parse(Bool, string(m)) for m in split(mask_str, "")])
k = sum(mask)
#time kmer_profile(seq, k, mask)
This code runs in about 0.3 seconds on my M1 MacBook Pro, is there any way to make it run significantly faster?
The function kmer_profile uses a sliding window of size length(mask) to count the number of times each masked kmer appears in the nucleotide sequence. A mask is a binary sequence, and a masked kmer is a kmer with nucleotides dropped at positions at which the mask is zero. E.g. the kmer ACGT and mask 1001 will produce the masked kmer AT.
To produce the kmer hash, the function treats each kmer as a base 4 number and then converts it to a (base 10) 64-bit integer, for indexing into the kmer vector.
The size of k is equal to the number of ones in the mask string, and is implicitly limited to 31 so that kmer hashes can fit into a 64-bit integer type.

There are several possible optimizations to make this code faster.
First of all, one can convert the Dict to an array since array-based indexing is faster than dictionary-based indexing one and this is possible here since the key is an ASCII character.
Moreover, the extraction of the sequence codes can be done once instead of length(mask) times by pre-computing code and putting the result in a temporary array.
Additionally, the mask-based conditional and the loop carried dependency make things slow. Indeed, the condition cannot be (easily) predicted by the processor causing it to stall for several cycles. The loop carried dependency make things even worse since the processor can hardly execute other instructions during this stall. This problem can be solved by pre-computing the factors based on both mask and basis. The result is a faster branch-less loop.
Once the above optimizations are done, the biggest bottleneck is sparsevec. In fact, it was also taking nearly half the time of the initial implementation! Optimizing this step is difficult but not impossible. It is slow because of random accesses in the Julia implementation. One can speed this up by sorting the keys-values pairs in the first place. It is faster due to a more cache-friendly execution and it can also help the prediction unit of the processor. This is a complex topic. For more details about how this works, please read Why is processing a sorted array faster than processing an unsorted array?.
Here is the final optimized code:
function kmer_profile_opt(seq, k, mask)
basis = [4^i for i in (k - 1):-1:0]
d = zeros(Int8, 128)
d[Int64('A')] = 0
d[Int64('C')] = 1
d[Int64('G')] = 2
d[Int64('T')] = 3
seq_codes = [d[Int8(e)] for e in seq]
j = 1
premult = zeros(Int64, length(mask))
for i in 1:length(mask)
if mask[i]
premult[i] = basis[j]
j += 1
end
end
kmer_dict = Dict{Int, Int32}(4^k=>0)
for n in 1:(length(seq) - length(mask) + 1)
kmer_hash = 1
j = 1
for i in 1:length(mask)
kmer_hash += seq_codes[n+i-1] * premult[i]
end
haskey(kmer_dict, kmer_hash) ? kmer_dict[kmer_hash] += 1 : kmer_dict[kmer_hash] = 1
end
sorted_kmer_pairs = sort(collect(kmer_dict))
sorted_kmer_keys = [e[1] for e in sorted_kmer_pairs]
sorted_kmer_values = [e[2] for e in sorted_kmer_pairs]
return sparsevec(sorted_kmer_keys, sorted_kmer_values)
end
This code is a bit more than twice faster than the initial implementation on my machine. A significant fraction of the time is still spent in the sorting algorithm.
The code can still be optimized further. One way is to use a parallel sort algorithm. Another way is to replace the premult[i] multiplication by a shift which is faster assuming premult[i] is modified so to contain exponents. I expect the code to be about 4 times faster than the original code. The main bottleneck should be the big dictionary creation. Improving further the performance of this is very hard (though it is still possible).

Inspired by Jérôme's answer, and squeezing some more by avoiding Dicts altogether:
function kmer_profile_opt3a(seq, k, mask)
d = zeros(Int8, 128)
d[Int64('A')] = 0
d[Int64('C')] = 1
d[Int64('G')] = 2
d[Int64('T')] = 3
seq_codes = [d[Int8(e)] for e in seq]
basis = [4^i for i in (k-1):-1:0]
j = 1
premult = zeros(Int64, length(mask))
for i in 1:length(mask)
if mask[i]
premult[i] = basis[j]
j += 1
end
end
kmer_vec = Vector{Int}(undef, length(seq)-length(mask)+1)
#inbounds for n in 1:(length(seq) - length(mask) + 1)
kmer_hash = 1
for i in 1:length(mask)
kmer_hash += seq_codes[n+i-1] * premult[i]
end
kmer_vec[n] = kmer_hash
end
sort!(kmer_vec)
return sparsevec(kmer_vec, ones(length(kmer_vec)), 4^k, +)
end
This achieved another 2x over Jérôme's answer on my machine.
The auto-combining feature of sparsevec makes the code a bit more compact.
Trying to slim the code further, and avoid unnecessary allocations in sparse vector creation, the following can be used:
using SparseArrays, LinearAlgebra
function specialsparsevec(nzs, n)
vals = Vector{Int}(undef, length(nzs))
j, k, count, last = (1, 1, 0, nzs[1])
while k <= length(nzs)
if nzs[k] == last
count += 1
else
vals[j], nzs[j] = (count, last)
count, last = (1, nzs[k])
j += 1
end
k += 1
end
vals[j], nzs[j] = (count, last)
resize!(nzs, j)
resize!(vals, j)
return SparseVector(n, nzs, vals)
end
function kmer_profile_opt3(seq, k, mask)
d = zeros(Int8, 128)
foreach(((i,c),) -> d[Int(c)]=i-1, enumerate(collect("ACGT")))
seq_codes = getindex.(Ref(d), Int8.(collect(seq)))
premult = foldr(
(i,(p,j))->(mask[i] && (p[i]=j ; j<<=2) ; (p,j)),
1:length(mask); init=(zeros(Int64,length(mask)),1)) |> first
kmer_vec = sort(
[ dot(#view(seq_codes[n:n+length(mask)-1]),premult) + 1 for
n in 1:(length(seq)-length(mask)+1)
])
return specialsparsevec(kmer_vec, 4^k)
end
This last version gets another 10% speedup (but is a little cryptic):
julia> #btime kmer_profile_opt($seq, $k, $mask);
367.584 ms (81 allocations: 134.71 MiB) # other answer
julia> #btime kmer_profile_opt3a($seq, $k, $mask);
140.882 ms (22 allocations: 54.36 MiB) # 1st this answer
julia> #btime kmer_profile_opt3($seq, $k, $mask);
127.016 ms (14 allocations: 27.66 MiB) # 2nd this answer

Related

A faster alternative to all(a(:,i)==a,1) in MATLAB

It is a straightforward question: Is there a faster alternative to all(a(:,i)==a,1) in MATLAB?
I'm thinking of a implementation that benefits from short-circuit evaluations in the whole process. I mean, all() definitely benefits from short-circuit evaluations but a(:,i)==a doesn't.
I tried the following code,
% example for the input matrix
m = 3; % m and n aren't necessarily equal to those values.
n = 5000; % It's only possible to know in advance that 'm' << 'n'.
a = randi([0,5],m,n); % the maximum value of 'a' isn't necessarily equal to
% 5 but it's possible to state that every element in
% 'a' is a positive integer.
% all, equal solution
tic
for i = 1:n % stepping up the elapsed time in orders of magnitude
%%%%%%%%%% all and equal solution %%%%%%%%%
ax_boo = all(a(:,i)==a,1);
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
end
toc
% alternative solution
tic
for i = 1:n % stepping up the elapsed time in orders of magnitude
%%%%%%%%%%% alternative solution %%%%%%%%%%%
ax_boo = a(1,i) == a(1,:);
for k = 2:m
ax_boo(ax_boo) = a(k,i) == a(k,ax_boo);
end
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
end
toc
but it's intuitive that any "for-loop-solution" within the MATLAB environment will be naturally slower. I'm wondering if there is a MATLAB built-in function written in a faster language.
EDIT:
After running more tests I found out that the implicit expansion does have a performance impact in evaluating a(:,i)==a. If the matrix a has more than one row, all(repmat(a(:,i),[1,n])==a,1) may be faster than all(a(:,i)==a,1) depending on the number of columns (n). For n=5000 repmat explicit expansion has proved to be faster.
But I think that a generalization of Kenneth Boyd's answer is the "ultimate solution" if all elements of a are positive integers. Instead of dealing with a (m x n matrix) in its original form, I will store and deal with adec (1 x n matrix):
exps = ((0):(m-1)).';
base = max(a,[],[1,2]) + 1;
adec = sum( a .* base.^exps , 1 );
In other words, each column will be encoded to one integer. And of course adec(i)==adec is faster than all(a(:,i)==a,1).
EDIT 2:
I forgot to mention that adec approach has a functional limitation. At best, storing adec as uint64, the following inequality must hold base^m < 2^64 + 1.
Since your goal is to count the number of columns that match, my example converts the binary encoding to integer decimals, then you just loop over the possible values (with 3 rows that are 8 possible values) and count the number of matches.
a_dec = 2.^(0:(m-1)) * a;
num_poss_values = 2 ^ m;
num_matches = zeros(num_poss_values, 1);
for i = 1:num_poss_values
num_matches(i) = sum(a_dec == (i - 1));
end
On my computer, using 2020a, Here are the execution times for your first 2 options and the code above:
Elapsed time is 0.246623 seconds.
Elapsed time is 0.553173 seconds.
Elapsed time is 0.000289 seconds.
So my code is 853 times faster!
I wrote my code so it will work with m being an arbitrary integer.
The num_matches variable contains the number of columns that add up to 0, 1, 2, ...7 when converted to a decimal.
As an alternative you can use the third output of unique:
[~, ~, iu] = unique(a.', 'rows');
for i = 1:n
ax_boo = iu(i) == iu;
end
As indicated in a comment:
ax_boo isolates the indices of the columns I have to sum in a row vector b. So, basically the next line would be something like c = sum(b(ax_boo),2);
It is a typical usage of accumarray:
[~, ~, iu] = unique(a.', 'rows');
C = accumarray(iu,b);
for i = 1:n
c = C(i);
end

Julia parallel computing for loop

I would like to calculate the summation of elements from a large upper triangular matrix. The regular Julia code is below.
function upsum(M); n = size(M)[1]; sum = 0
for i = 1:n-1 for j = i+1:n
sum = sum + M[i,j]
end
end
return sum
end
R = randn(10000,10000)
upsum(R)
Since the matrix is very large, I would like to know is there anyway to improve the speed. How can I use parallel computing here?
I would use threads not parallel processing in this case. Here is an example code:
using Base.Threads
function upsum_threads(M)
n = size(M, 1)
chunks = nthreads()
sums = zeros(eltype(M), chunks)
chunkend = round.(Int, n * sqrt.((1:chunks) ./ chunks))
#assert minimum(diff(chunkend)) > 0
chunkstart = [2; chunkend[1:end-1] .+ 1]
#threads for job in 1:chunks
s = zero(eltype(M))
for i in chunkstart[job]:chunkend[job]
#simd for j in 1:(i-1)
#inbounds s += M[j, i]
end
end
sums[job] = s
end
return sum(sums)
end
R = randn(10000,10000)
upsum_threads(R)
It should give you a significant speedup (even if you remove #threads it should be much faster).
You choose number of threads Julia uses by setting JULIA_NUM_THREADS environment variable.

IDL: How do I loop through all possible values of shift without using a loop?

I have a 3 dimensional array which I am seeking to shift, multiply (element wise) with the un-shifted array, and sum all the products. For a given set of the three integers defining the shift (i,j,k) the sum corresponding to this triple will be stored in another 3 dimensional array at said location (i,j,k).
I am attempting to do this for all possible values of i,j,k. Is there a way to do this without explicitly using for/while loops? I am highly interested in minimizing the computational time but IDL doesnt understand the current instructions I am providing:
FUNCTION ccs, vecx, vecy
TIC
l = 512
km = l/2
corvec = fltarr(km,km,km)
N = float(l)^3
corvec[*,*,*] = (total(vecx*shift(vecx,*,*,*))/N)+ (total(vecy*shift(vecy,*,*,*))/N)
return, corvec
TOC
end
IDL appears to only accept scalars or 1 element arrays as possible shifting arguments.
Is the below what you are trying to do? It's not fast, but it will give something to test faster solutions against.
function ccs, vecx, vecy
compile_opt strictarr
tic
l = 512
km = l/2
corvec = fltarr(km, km, km)
n = float(l)^3
for i = 0L, km - 1L do begin
for j = 0L, km - 1L do begin
for k = 0L, km - 1L do begin
corvec[i, j, k] = total(vecx * shift(vecx, i, j, k)) / n + total(vecy * shift(vecy, i, j, k)) / n
endfor
endfor
endfor
toc
return, corvec
end
If this is what you are trying to do, I think it will be tough in IDL. Generally, vectorized solutions without loops for these types of problems need to create intermediate arrays — in this case, of all possible shifts of vecx and vecy, which would be a very large intermediate array. My suggestion would be to implement in C and call from IDL, but maybe I am missing something.

How to parallelize computation of pairwise distance matrix?

My problem is roughly as follows. Given a numerical matrix X, where each row is an item. I want to find each row's nearest neighbor in terms of L2 distance in all rows except itself. I tried reading the official documentation but was still a little confused about how to achieve this. Could someone give me some hint?
My code is as follows
function l2_dist(v1, v2)
return sqrt(sum((v1 - v2) .^ 2))
end
function main(Mat, dist_fun)
n = size(Mat, 1)
Dist = SharedArray{Float64}(n) #[Inf for i in 1:n]
Id = SharedArray{Int64}(n) #[-1 for i in 1:n]
#parallel for i = 1:n
Dist[i] = Inf
Id[i] = 0
end
Threads.#threads for i in 1:n
for j in 1:n
if i != j
println(i, j)
dist_temp = dist_fun(Mat[i, :], Mat[j, :])
if dist_temp < Dist[i]
println("Dist updated!")
Dist[i] = dist_temp
Id[i] = j
end
end
end
end
return Dict("Dist" => Dist, "Id" => Id)
end
n = 4000
p = 30
X = [rand() for i in 1:n, j in 1:p];
main(X[1:30, :], l2_dist)
#time N = main(X, l2_dist)
I'm trying to distributed all the i's (i.e. calculating each row minimum) over different cores. But the version above apparently isn't working correctly. It is even slower than the sequential version. Can someone point me to the right direction? Thanks.
Maybe you're doing something in addition to what you have written down, but, at this point from what I can see, you aren't actually doing any computations in parallel. Julia requires you to tell it how many processors (or threads) you would like it to have access to. You can do this through either
Starting Julia with multiple processors julia -p # (where # is the number of processors you want Julia to have access to)
Once you have started a Julia "session" you can call the addprocs function to add additional processors.
To have more than 1 thread, you need to run command export JULIA_NUM_THREADS = #. I don't know very much about threading, so I will be sticking with the #parallel macro. I suggest reading documentation for more details on threading -- Maybe #Chris Rackauckas could expand a little more on the difference.
A few comments below about my code and on your code:
I'm on version 0.6.1-pre.0. I don't think I'm doing anything 0.6 specific, but this is a heads up just in case.
I'm going to use the Distances.jl package when computing the distances between vectors. I think it is a good habit to farm out as many of my computations to well-written and well-maintained packages as possible.
Rather than compute the distance between rows, I'm going to compute the distance between columns. This is because Julia is a column-major language, so this will increase the number of cache hits and give a little extra speed. You can obviously get the row-wise results you want by just transposing the input.
Unless you expect to have that many memory allocations then that many allocations are a sign that something in your code is inefficient. It is often a type stability problem. I don't know if that was the case in your code before, but that doesn't seem to be an issue in the current version (it wasn't immediately clear to me why you were having so many allocations).
Code is below
# Make sure all processors have access to Distances package
#everywhere using Distances
# Create a random matrix
nrow = 30
ncol = 4000
# Seed creation of random matrix so it is always same matrix
srand(42)
X = rand(nrow, ncol)
function main(X::AbstractMatrix{Float64}, M::Distances.Metric)
# Get size of the matrix
nrow, ncol = size(X)
# Create `SharedArray` to store output
ind_vec = SharedArray{Int}(ncol)
dist_vec = SharedArray{Float64}(ncol)
# Compute the distance between columns
#sync #parallel for i in 1:ncol
# Initialize various temporary variables
min_dist_i = Inf
min_ind_i = -1
X_i = view(X, :, i)
# Check distance against all other columns
for j in 1:ncol
# Skip comparison with itself
if i==j
continue
end
# Tell us who is doing the work
# (can uncomment if you want to verify stuff)
# println("Column $i compared with Column $j by worker $(myid())")
# Evaluate the new distance...
# If it is less then replace it, otherwise proceed
dist_temp = evaluate(M, X_i, view(X, :, j))
if dist_temp < min_dist_i
min_dist_i = dist_temp
min_ind_i = j
end
end
# Which column is minimum distance from column i
dist_vec[i] = min_dist_i
ind_vec[i] = min_ind_i
end
return dist_vec, ind_vec
end
# Using Euclidean metric
metric = Euclidean()
inds, dist = main(X, metric)
#time main(X, metric);
#show dist[[1, 5, 25]], inds[[1, 5, 25]]
You can run the code with
1 processor julia testfile.jl
% julia testfile.jl
0.640365 seconds (16.00 M allocations: 732.495 MiB, 3.70% gc time)
(dist[[1, 5, 25]], inds[[1, 5, 25]]) = ([2541, 2459, 1602], [1.40892, 1.38206, 1.32184])
n processors (in this case 4) julia -p n testfile.jl
% julia -p 4 testfile.jl
0.201523 seconds (2.10 k allocations: 99.107 KiB)
(dist[[1, 5, 25]], inds[[1, 5, 25]]) = ([2541, 2459, 1602], [1.40892, 1.38206, 1.32184])

breaking out of a loop in Julia

I have a Vector of Vectors of different length W. These last vectors contain integers between 0 and 150,000 in steps of 5 but can also be empty. I am trying to compute the empirical cdf for each of those vectors. I could compute these cdf iterating over every vector and every integer like this
cdfdict = Dict{Tuple{Int,Int},Float64}()
for i in 1:length(W)
v = W[i]
len = length(v)
if len == 0
pcdf = 1.0
else
for j in 0:5:150_000
pcdf = length(v[v .<= j])/len
cdfdict[i, j] = pcdf
end
end
end
However, this approach is inefficient because the cdf will be equal to 1 for j >= maximum(v) and sometimes this maximum(v) will be much lower than 150,000.
My question is: how can I include a condition that breaks out of the j loop for j > maximum(v) but still assigns pcdf = 1.0 for the rest of js?
I tried including a break when j > maximum(v) but this, of course, stops the loop from continuing for the rest of js. Also, I can break the loop and then use get! to access/include 1.0 for keys not found in cdfdict later on, but that is not what I'm looking for.
To elaborate on my comment, this answer details an implementation which fills an Array instead of a Dict.
First to create a random test case:
W = [rand(0:mv,rand(0:10)) for mv in floor(Int,exp(log(150_000)*rand(10)))]
Next create an array of the right size filled with 1.0s:
cdfmat = ones(Float64,length(W),length(0:5:150_000));
Now to fill the beginning of the CDFs:
for i=1:length(W)
v = sort(W[i])
k = 1
thresh = 0
for j=1:length(v)
if (j>1 && v[j]==v[j-1])
continue
end
pcdf = (j-1)/length(v)
while thresh<v[j]
cdfmat[i,k]=pcdf
k += 1
thresh += 5
end
end
end
This implementation uses a sort which can be slow sometimes, but the other implementations basically compare the vector with various values which is even slower in most cases.
break only does one level. You can do what you want by wrapping the for loop function and using return (instead of where you would've put break), or using #goto.
Or where you would break, you could switch a boolean breakd=true and then break, and at the bottom of the larger loop do if breakd break end.
You can use another for loop to set all remaining elements to 1.0. The inner loop becomes
m = maximum(v)
for j in 0:5:150_000
if j > m
for k in j:5:150_000
cdfdict[i, k] = 1.0
end
break
end
pcdf = count(x -> x <= j, v)/len
cdfdict[i, j] = pcdf
end
However, this is rather hard to understand. It would be easier to use a branch. In fact, this should be just as fast because the branch is very predictable.
m = maximum(v)
for j in 0:5:150_000
if j > m
cdfdict[i, j] = 1.0
else
pcdf = count(x -> x <= j, v)/len
cdfdict[i, j] = pcdf
end
end
Another answer gave an implementation using an Array which calculated the CDF by sorting the samples and filling up the CDF bins with quantile values. Since the whole Array is thus filled, doing another pass on the array should not be overly costly (we tolerate a single pass already). The sorting bit and the allocation accompanying it can be avoided by calculating a histogram in the array and using cumsum to produce a CDF. Perhaps the code will explain this better:
Initialize sizes, lengths and widths:
n = 10; w = 5; rmax = 150_000; hl = length(0:w:rmax)
Produce a sample example:
W = [rand(0:mv,rand(0:10)) for mv in floor(Int,exp(log(rmax)*rand(n)))];
Calculate the CDFs:
cdfmat = zeros(Float64,n,hl); # empty histograms
for i=1:n # drop samples into histogram bins
for j=1:length(W[i])
cdfmat[i,1+(W[i][j]+w-1)÷5]+=one(Float64)
end
end
cumsum!(cdfmat,cdfmat,2) # calculate pre-CDF by cumsum
for i=1:n # normalize each CDF by total
if cdfmat[i,hl]==zero(Float64) # check if histogram empty?
for j=1:hl # CDF of 1.0 as default (might be changed)
cdfmat[i,j] = one(Float64)
end
else # the normalization factor calc-ed once
f = one(Float64)/cdfmat[i,hl]
for j=1:hl
cdfmat[i,j] *= f
end
end
end
(a) Note the use of one,zero to prepare for change of Real type - this is good practice. (b) Also adding various #inbounds and #simd should optimize further. (c) Putting this code in a function is recommended (this is not done in this answer). (d) If having a zero CDF for empty samples is OK (which means no samples means huge samples semantically), then the second for can be simplified.
See other answers for more options, and reminder: Premature optimization is the root of all evil (Knuth??)

Resources