Avoiding array allocations is good for performance. However, I have yet to understand what is the most possible efficient way one can perform a QR decomposition of a matrix A. (note: both Q and R matrices are needed)
Simply using
Q, R = qr(A)
is probably not the best idea, since it allocates both Q and R, where both could be re-allocated.
The function qrfact allows one to store factorization in a packed format. However, I would still write afterwards:
F = qrfact(A); Q = F[:Q]; R = F[:R]
once again allocating new arrays for Q and R. Finally, the documentation also suggests the qrfact! function, which saves space by overwriting the input A, instead of creating a copy. However, if one uses F = qrfact!(A)
the over-written A is not useful in the sense that it is not either Q or R, which one (specifically, I) would need.
So my two questions are:
What is the best/most efficient way to perform a QR decomposition if you only care about the matrices Q and R and you have no problem re-allocating them.
What is actually written in the matrix A when one calls qrfact!(A) ?
In
F = qrfact!(A)
or
F = qrfact(A)
F[:Q] and F[:R] do not allocate new dense arrays; they are simply views over the packed format from which Q and R are easily computed. This means that qrfact!(A) doesn't need to allocate arrays for Q and R, it simply computes the packed format in place for A.
However, that also means that F[:Q] and F[:R] cannot be mutated. If you need to modify one of them for whatever reason, you will need to collect it into a mutable Array, and this will certainly allocate. It will still be more efficient to use qrfact!(A) instead of qrfact(A), because the latter will allocate space for the packed QR factorization as well as for the collected Array.
Related
I want to know the performance characteristics of xla::Reshape. Specifically, I can imagine that it could be implemented by simply remapping XlaOp metadata e.g. addresses, rather than creating a whole new XlaOp. Alternatively, does XLA's fusion or some other technique essentially make it cheap?
The reason I ask is because I'm working out how to map a function over a tensor, for example a function
f : [p, q] -> [r]
over an [n, m, p, q] to get an [n, m, r]. One option I have is to flatten leading dimensions, require the function allows a single leading dimension, e.g.
f' : [n, p, q] -> [n, r]
then reshape the result as required. However, this is only feasible if flattening and expanding is performant.
Tagging Jax because I imagine it's the same story there. Of course Jax has vmap/pmap which make this unnecessary.
It depends on the physical layout chage of the tensor.
Usually, XLA's reshape accompanies tensor's physical layout change, and may cause more cost compared to bitcast operation (which does not change the physical layout, thus making almost no overhead).
However, if reshape does not accompany the logical layout, its cost may be cheap.
I have one tensor A of dimension [a,b,c,d], and another B of dimension [b,b,d,e], and C, a list of [a] integers from 0 to b. I need to produce the tensor D of dimension [a,b,c,e] given by
D[i,j,k,l] = sum for m=0..d of A[i,C[i],k,m] * B[C[i],j,m,l]
b is small enough (3 or 5, usually?) that I don't mind doing this in b separate operations -- but I can't afford the waste by going to something that takes b^2 memory or time, when this operation clearly should be linear in b. This seems like it will be some combination of pointwise multiplies (with broadcasting?) and tensor contractions (a matrix multiply across the common m dimension), but I can't pin it down.
If someone can really convince me that this isn't possible in O(b) flops with tensorflow's provided operations, then okay, but then I'd want an O(b^2) for sure.
Update: It's looking like the appropriately modified A tensors can be built individually using tf.gather_nd; if this can then be paired up with B somehow, maybe? Unfortunately my experiments in this so far led to finding a bug in tf.gather_nd itself which has slowed things down.
I figured out how to accomplish this, reasonably efficiently. First build a modified version of B with tf.gather, with the appropriate parts in the first index:
B2 = tf.gather(B, C)
Then pull out just the relevant parts of the A tensor using tf.gather_nd. We're going to pull out a bunch of pairs of indices of the form [0,C[0]], [1,C[1]], [2,C[2]]... and so on, so first we need to build the index tensor.
a = tf.shape(A)[0]
A2_indices = tf.stack([tf.range(a), C], axis=0)
A2 = tf.gather_nd(A, A2_indices)
producing A2 with shape [a,c,d]. Now we need to multiply A2 and B2 appropriately. It's tensor contraction in the m indices (2 and 3, respectively) but pointwise multiplication in the i index (0 in both). This means that, sadly, the resulting item isn't tensor contraction or pointwise multiplication! One option would be computing the tensor product and contracting only over m, and then taking tf.diag over the two i indices -- but this would waste a lot of computation building the rest of a matrix that we don't need. Instead, we can think of this as a batched matrix multiplication: this used to be called tf.batched_matmul but now it's just matmul. This has the caveat, though, that besides the 2 matrix dimensions in each input tensor, the rest all have to be pointwise multiplies. B and B2 fail this criterion, because they have the additional j index. But, we could "wrap that in" with the l output dimension, and then remove it later. This means first calling tf.transpose to put j and l next to each other, then tf.reshape to turn into one j*l output dimension, then doing tf.matmul, then another tf.reshape and tf.transpose to return to the original form. So
a, b, d, e = B2.get_shape().as_list()
B2_trans = tf.transpose(B2, perm=[0,2,1,3])
B2_jl = tf.reshape(B2, [a,d,b*e])
product_jl = tf.matmul(A2, B2_jl)
product_trans = tf.reshape(product_jl, [a,d,b,e])
result = tf.transpose(product_trans, perm=[0,2,1,3])
Which finishes it up! Of course in practice it may well be that B is only needed in this one instance, in which case it may be that B can start out already in the "compressed" state, saving a transpose (and a cheap reshape); or if A2 is going to be flattened or transposed anyway then it could also save a transpose. But overall everything is pretty minimal complexity. :)
I am using the following function:
kernel = #(X,Y,sigma) exp((-pdist2(X,Y,'euclidean').^2)./(2*sigma^2));
to compute a series of kernels, in the following way:
K = [(1:size(featureVectors,1))', kernel(featureVectors,featureVectors, sigma)];
However, since featureVectors is a huge matrix (something like 10000x10000), it takes really a long time to compute the kernels (e.g., K).
Is it possible to somehow speed up the computation?
EDIT: Context
I am using a classifier via libsvm, with a gaussian kernel, as you may have noticed from the variable names and semantics.
I am using now (more or less) #terms~=10000 and #docs~=10000. This #terms resulted after stopwords removal and stemming. This course indicates that having 10000 features makes sense.
Unfortunately, libsvm does not implement automatically the Gaussian kernel. Thus, it is required to compute it by hand. I took the idea from here, but the kernel computation (as suggested by the referenced question) is really slow.
You are using pdist2 with two equal input arguments (X and Y are equal when you call kernel). You could save half the time by computing each pair only once. You do that using pdist and then squareform:
kernel = #(X,sigma) exp((-squareform(pdist(X,'euclidean')).^2)./(2*sigma^2));
K = [(1:size(featureVectors,1))', kernel(featureVectors, sigma)];
Your exponential function will go down very fast. For distances of several sigma your kernel function will essentially be zero. These cases we can sort out and become faster.
function z = kernel(X, Y, sigma)
d = pdist2(X,Y,'euclidean');
z = zeros(size(d)); % start with zeros
m = d < 3 * sigma;
z(m) = exp(-d(m).^2/(2*sigma^2));
end
So I have a function that takes four numerical arguments and produces a numerical argument.
f(w,x,y,z) --> A
If I have the function f and a target result A, is there an iterative method for discovering parameters w,x,y,z that produce a given number A?
If it helps, my function f is a quintic bezier where most of the parameters are determined. I have isolated just these four that are required to fit the value A.
Q(t)=R(1−t)^5+5S(1−t)^4*t+10T(1−t)^3*t^2+10U(1−t)^2*t^3+5V(1−t)t^4+Wt^5
R,S,T,U,V,W are vectors where R and W are known, I have isolated only a single element in each of S,T,U,V that vary as parameters.
The set of solutions of the equation f(w,x,y,z)=A (where all of w, x, y, z and A are scalars) is, in general, a 3 dimensional manifold (surface) in the 4-dimensional space R^4 of (w,x,y,z). I.e., the solution is massively non-unique.
Now, if f is simple enough for you to compute its derivative, you can use the Newton's method to find a root: the gradient is the direction of the fastest change of the function, so you go there.
Specifically, let X_0=(w_0,x_0,y_0,z_0) be your initial approximation of a solution and let G=f'(X_0) be the gradient at X_0.
Then f(X_0+h)=f(X_0)+(G,h)+O(|h|^2) (where (a,b) is the dot product).
Let h=a*G, and solve A=f(X_0)+a*|G|^2 to get a=(A-f(X_0))/|G|^2 (if G=0, change X_0) and X_1=X_0+a*G. If f(X_1) is close enough to A, you are done, otherwise proceed to compute f'(X_1) &c.
If you cannot compute f', you can play with many other methods.
If you can impose 3 (or more) additional equations that you know (or suspect) must be true for your 4-variable solution that gives target value A, then you can try applying Newton's method for solving a system of k equations with k unknowns. Otherwise, without a deeper understanding of the structure of the function you are trying to make equal to A, the only general type of technique I'm aware of that's easy to implement is to simply define the error function as g(w,x,y,z) = |f(w,x,y,z) - A| and search for a minimum of g. Typically the "minimum" found will be a local minimum, so it may require many restarts of the minimization problem with different starting values for your parameters to actually find a solution that gives a local minimum you want of g = 0. This is very easy to implement and try in a few lines e.g. in MATLAB using fminsearch
Assume that we are working with a language which stores arrays in column-major order. Assume also that we have a function which uses 2-D array as an argument, and returns it.
I'm wondering can you claim that it is (or isn't) in general beneficial to transpose this array when calling the function in order to work with column-wise operations instead of row-wise operations, or does the transposing negate the the benefits of column-wise operations?
As an example, in R I have a object of class ts named y which has dimension n x p, i.e I have p times series of length n.
I need to make some computations with y in Fortran, where I have two loops with following kind of structure:
do i = 1, n
do j= 1, p
!just an example, some row-wise operations on `y`
x(i,j) = a*y(i,j)
D = ddot(m,y(i,1:p),1,b,1)
! ...
end do
end do
As Fortran (as does R) uses column-wise storage, it would be better to make the computations with p x n array instead. So instead of
out<-.Fortran("something",y=array(y,dim(y)),x=array(0,dim(y)))
ynew<-out$out$y
x<-out$out$x
I could use
out<-.Fortran("something2",y=t(array(y,dim(y))),x=array(0,dim(y)[2:1]))
ynew<-t(out$out$y)
x<-t(out$out$x)
where Fortran subroutine something2 would be something like
do i = 1, n
do j= 1, p
!just an example, some column-wise operations on `y`
x(j,i) = a*y(j,i)
D = ddot(m,y(1:p,i),1,b,1)
! ...
end do
end do
Does the choice of approach always depend on the dimensions n and p or is it possible to say one approach is better in terms of computation speed and/or memory requirements? In my application n is usually much larger than p, which is 1 to 10 in most cases.
more of a comment, buy i wanted to put a bit of code: under old school f77 you would essentially be forced to use the second approach as
y(1:p,i)
is simply a pointer to y(1,i), with the following p values contiguous in memory.
the first construct
y(i,1:p)
is a list of values interspaced in memory, so it seems to require making a copy of the data to pass to the subroutine. I say it seems because i haven't the foggiest idea how a modern optimizing compiler deals with these things. I tend to think at best its a wash at worst this could really hurt. Imagine an array so large you need to page swap to access the whole vector.
In the end the only way to answer this is to test it yourself
----------edit
did a little testng and confirmed my hunch: passing rows y(i,1:p) does cost you vs passing columns y(1:p,i). I used a subroutine that does practically nothing to see the difference. My guess with any real subroutine the hit is negligable.
Btw (and maybe this helps understand what goes on) passing every other value in a column
y(1:p:2,i) takes longer (orders of magnitude) than passing the whole column, while passing every other value in a row cuts the time in half vs. passing a whole row.
(using gfortran 12..)