Optimize rank computation for high dimension tensor - matrix

My programme wastes a lot of time on the code below, whereas it's been executed on the GPU machine. How can I optimise it please? The tensors can be of this size y_ul.shape = [8, 512, 128, 128]
for i, m in enumerate(y_ul):
for j, l in enumerate(m):
mean_rank_topleft = torch.mean(ranks_topleft.float())

Unlike torch.matrix_rank, torch.linalg.matrix_rank allows batched inputs documetation here. You can try:
ranks = torch.linalg.matrix_rank(y_ul) # shape (8, 512)
mean_rank_topleft = torch.mean(ranks) # scalar
Note that you can adjust the tolerance for computation (but the one by default should be good). Plus if your matrices are symmetric, you can add hermitian=True to speed-up calculations.
Note that torch.matrix_rank is deprecated and will be removed in further versions!


Julia: FAST way of calculating the smallest distances between two sets of points

I have 5000 3D points in a Matrix A and another 5000 3D point in a matrix B.
For each point in A i want to find the smallest distance to a point in B. These distances should be stored in an array with 5000 entries.
So far I have this solution, running in about 0.145342 seconds (23 allocations: 191.079 MiB). How can I improve this further?
using Distances
A = rand(5000, 3)
B = rand(5000, 3)
mis = #time minimum(Distances.pairwise(SqEuclidean(), A, B, dims=1), dims=2)
This is a standard way to do it as it will have a better time complexity (especially for larger data):
using NearestNeighbors
nn(KDTree(B'; leafsize = 10), A')[2] .^ 2
Two comments:
by default Euclidean distance is computed (so I square it)
by default NearestNeigbors.jl assumes observations are stored in columns (so I need B' and A' in the solution; if your original data were transposed it would not be needed; the reason why it is designed this way is that Julia uses column major matrix storage)
Generating a big distance matrix using Distances.pairwise(SqEuclidean(), A, B, dims=1) is not efficient because the main memory is pretty slow nowadays compared to CPU caches and the computing power of modern CPUs and this is not gonna be better any time soon (see "memory wall"). It is faster to compute the minimum on-the-fly using two basic nested for loops. Additionally, one can use multiple cores to compute this faster using multiple threads.
function computeMinDist(A, B)
n, m = size(A, 1), size(B, 1)
result = zeros(n)
Threads.#threads for i = 1:n
minSqDist = Inf
#inbounds for j = 1:m
dx = A[i,1] - B[j,1]
dy = A[i,2] - B[j,2]
dz = A[i,3] - B[j,3]
sqDist = dx*dx + dy*dy + dz*dz
if sqDist < minSqDist
minSqDist = sqDist
result[i] = minSqDist
return result
mis = #time computeMinDist(A, B)
Note the Julia interpreter uses 1 thread by default but this can be tuned using the environment variable JULIA_NUM_THREADS=auto or just by running it using the flag --threads=auto. See the multi-threading documentation for more information.
Performance results
Here are performance results on my i5-9600KF machine with 6 cores (with two 5000x3 matrices):
Initial implementation: 93.4 ms
This implementation: 4.4 ms
This implementation is thus 21 times faster.
Results are the same to few ULP.
Note the code can certainly be optimized further using loop tiling, and possibly by transposing A and B so the JIT can generate a more efficient implementation using SIMD instructions.

Performance expectations when running caret::train() to develop a kknn model

I'm using the caret::train() function to develop a weighted knn classification model (kknn) with 10-fold cross-validation and a tuneGrid containing 15 values for kmax, one value for distance, and 3 values for kernel.
That’s 450 total iterations if I understand the process correctly (an iteration being the computation of the probability of a given outcome for a given combination of kmax, distance, and kernel). x has about 480,000 data points (6 predictors each having about 80,000 observations), and y has about 80,000 data points.
Understanding that there are innumerable variables affecting performance, how long can I reasonably expect the train function to take if run on a pc with an 8-core 3GHz Intel processor and 32GB of RAM?
It currently takes about 70 minutes per fold, which is about 1.5 minutes per iteration. Is this reasonable, or excessive?
This is a kknn learning exercise. I realize there are other types of algorithms that produce better results more efficiently.
Here is the essential code:
x <- as.matrix(train_set2[, c("n_launch_angle", "n_launch_speed", "n_spray_angle_Kolp", "n_spray_angle_adj", "n_hp_to_1b", "n_if_alignment")])
y <- train_set2$events
fitControl <- trainControl(method = "cv", number = 10, p = 0.8, returnData = TRUE,
returnResamp = "all", savePredictions = "all",
summaryFunction = twoClassSummary, classProbs = TRUE,
verboseIter = TRUE)
tuneGrid <- expand.grid(kmax = seq(11, 39, 2),
distance = 2,
kernel = c("triangular", "gaussian", "optimal"))
kknn_train <- train(x, y, method = "kknn",
tuneGrid = tuneGrid, trControl = fitControl)
As we have established in the comments, it is reasonable to expect this type of runtime. There are a few step to reduce this;
Running your code in parallel
Using a more efficient OS; like Linux
Be more efficient in your trainControl(), is it really necessary to have returnResamps=TRUE? There is small gains in controlling these.
Clearly, the first one is a no-brainer. For the second one, I can find as many computer-engineers who swears to linux as those who swears to windows. What convinced me to switch to Linux, was this particular test, which I hope will give you what it gave me.
# Calculate distance matrix
test_data <- function(dim, num, seed = 1903) {
rnorm(dim * num), nrow = num
# Benchmarking
This piece of code simply just runs faster on the exact same machine that runs Linux. At least this was my experience.

cudnnRNNForwardTraining seqLength / xDesc usage

Let's say I have N sequences x[i], each with length seqLength[i] for 0 <= i < N. As far as I understand from the cuDNN docs, they have to be ordered by sequence length, the longest first, so assume that seqLength[i] >= seqLength[i+1]. Let's say that they have the feature dimension D, so x[i] is a 2D tensor of shape (seqLength[i], D). As far as I understand, I should prepare a tensor x where all x[i] are contiguously behind each other, i.e. it would be of shape (sum(seqLength), D).
According to the cuDNN docs, the functions cudnnRNNForwardInference / cudnnRNNForwardTraining gets the argument int seqLength and cudnnTensorDescriptor_t* xDesc, where:
seqLength: Number of iterations to unroll over.
xDesc: Array of tensor descriptors. Each must have the same second dimension. The first dimension may decrease from element n to element n + 1 but may not increase.
I'm not exactly sure I understand this correctly.
Is seqLength my max(seqLength)?
And xDesc is an array. Of what length? max(seqLength)? If so, I assume that it describes one batch of features for each frame but some of the later frames will have less sequences in it. It sounds like the number of sequences per frame is described in the first dimension.
xDesc[t].shape[0] = len([i for i in range(N) if t < seqLength[i]])
for all 0 <= t < max(seqLength). I.e. 0 <= xDesc[t].shape[0] <= N.
How much dimensions does each xDesc[t] describe, i.e. what is len(xDesc[t].shape)? I would assume that it is 2 and the second dimension is the feature dimension, i.e. D, i.e.:
xDesc[t].shape = (len(...), D)
The strides would have to be set accordingly, although it's also not totally clear. If x is stored in row-major order, then
xDesc[0].strides[0] = D * xDesc[0].shape[0]
xDesc[0].strides[1] = 1
But how does cuDNN compute the offset for frame t? I guess it will keep track and thus calculate sum([xDesc[t2].strides[0] for t2 in range(t)]).
Most example code I have seen assume that all sequences are of the same length. Also they all describe 3 dimensions per xDesc[t], not 2. Why is that? The third dimension is always 1, as well as the stride of the second and third dimension, and the stride for the first dimension is N. So this assumes that the tensor x is row-major ordered and of shape (max(seqLength), N, D). The code is actually a bit strange. E.g. from TensorFlow:
int dims[] = {batch_size, data_size, 1};
int strides[] = {dims[1] * dims[2], dims[2], 1};
sizeof(dims) / sizeof(dims[0]) /*nbDims*/, dims /*dimA*/,
strides /*strideA*/);
The code looks really similar in all examples I have found. Search for cudnnSetTensorNdDescriptor or cudnnRNNForwardTraining. E.g.:
TensorFlow (issue 6633)
Baidu persistent-rnn
I found one example which can handle sequences of different length. Again search for cudnnSetTensorNdDescriptor:
Microsoft CNTK
That claims that there must be 3 dimensions for every xDesc[t]. It has the comment:
these dimensions are what CUDNN expects: (the minibatch dimension, the data dimension, and the number 1 (because each descriptor describes one frame of data)
Edit: Support for this was added now end of 2018 for PyTorch, in this commit.
Am I missing something from the cuDNN documentation? I really have not found that information in it.
My question is basically, is my conclusion about how to set the arguments x, seqLength and xDesc for cudnnRNNForwardInference / cudnnRNNForwardTraining correct, and also my implicit assumptions, or if not, how would I use it, how does the memory layout look like, etc.?

no explicit loop to calculate product of list to some modulo in Mathematica

In Mathematica, do I have to use an explicit loop to calculate the product of elements in a given list (potentially very long) modulo to another number?
Please teach me your elegant approach if you do have. Thanks!
Just to give an example
The above is very inefficient, because while calculating the products, one could have taken the modulo to make the multipliers smaller.
Edit 2
I guess my question relates to how to replace for loop for
Module[{ret = initial_value}, For[i = 1, i <= Length[list], i++, ret = general_function[list[[i]],ret]; ret]
given a general function general_function and a list list.
For long lists a divide-and-conquer is typically faster. The idea is to compute the times-mod for the first and second halves, multiply that, and take the mod.
Here is an example. We'll use a list of 10^6 integers, all between 0 and 10^10.
len = 6;
max = 10;
list = RandomInteger[10^max, 10^len];
Multiplying and taking the modulus, for a slightly larger mod (I wanted to decrease the likelihood that the result was zero):
In[119]:= Timing[Mod[Times ## list, 32327541]]
Out[119]= {1.360000, 8826597}
Here is a variant of the sort I described. Trial and error tuning indicated that lists of length 2^9 or so were best done nonrecursively, at least for numbers in the size range indicated above.
tmod2[ll_List, m_] := With[{len=Floor[Length[ll]/2]},
Mod[Times ## ll, m],
Mod[tmod2[Take[ll,len],m] * tmod2[Drop[ll,len],m], m]]]
In[120]:= Timing[tmod2[list, 32327541]]
Out[120]= {0.310000, 8826597}
When I increase the list length to 10^7 and allow ints from 0 to 10^20, the first method takes 50 seconds and the second one takes 5 seconds. So clearly the scaling is working to our advantage.
For situations where an iteration interleaving two operations might be preferred to divide-and-conquer, one might use Fold as below.
tmod3[ll_List, m_] := Fold[Mod[#1*#2,m]&, First[ll], Rest[ll]]
While not competitive with tmod2 on long lists, this is faster than multiplying out everything prior to invoking Mod. For length 10^7 and max element 0f 10^20 it takes around 8 seconds to do what tmod2 did in 5.
Why not use Times? The following
will probably be the most efficient. From a recent WRI blog post,
Times knows a clever binary splitting trick that can be used when you have a large number of integer arguments. It is faster to recursively split the arguments into two smaller products, (1*2*…32767)(32768*…*65536), rather than working through the arguments from first to last. It still has to do the same number of multiplications, but fewer of them involve very big integers, and so, on average, are quicker to do
I'm assuming that list in your question is just an example. If you really have to take the product of n consecutive integers starting with 1, then Factorial will be the fastest. i.e.,
Mod[2000!, 32327]
This appears to be as much as twice as fast as Daniel's code on my system:
list = RandomInteger[1*^20, 1*^7];
m = 32327501;
Mod[Times ## Mod[Times ### Partition[list, 50, 50, 1, {}], m], m] // AbsoluteTiming
tmod2[list, m] // AbsoluteTiming
{1.5800904, 21590133}
{3.1081778, 21590133}
Different partition lengths could be used to tune this for your system and work set.

Bit-wise alternative

I'm trying to write a shader that needs pseudo-random number generation per pixel - fetching from a texture is just too expensive.
All of the generators I've found use ^, <<, & operators, but the shader model I'm working on doesn't support these. Is there a mathematical equivalent of these operators I can use instead?
For reference, I'm valuing speed over precision.
Of those, the only one I know the mathematical equivalent to is the << operator. Namely:
N << X = N * (2, 4, 8, 16, 32, 64, 128, 256, 512, 1024, etc)
N << 5 = N * 32
Simply create a lookup for the value (2 ^ X), and multiply by that value.
The others are going to be more complicated, and will probably require that you write an algorithm to solve them. I don't think they have any direct mathematical equivalents.
The source code for a C runtime implementation might be useful for that. Or simply search for algorithms to implement each, such as: Fast implementation/approximation of pow() function in C/C++
