I want to know the performance characteristics of xla::Reshape. Specifically, I can imagine that it could be implemented by simply remapping XlaOp metadata e.g. addresses, rather than creating a whole new XlaOp. Alternatively, does XLA's fusion or some other technique essentially make it cheap?
The reason I ask is because I'm working out how to map a function over a tensor, for example a function
f : [p, q] -> [r]
over an [n, m, p, q] to get an [n, m, r]. One option I have is to flatten leading dimensions, require the function allows a single leading dimension, e.g.
f' : [n, p, q] -> [n, r]
then reshape the result as required. However, this is only feasible if flattening and expanding is performant.
Tagging Jax because I imagine it's the same story there. Of course Jax has vmap/pmap which make this unnecessary.
It depends on the physical layout chage of the tensor.
Usually, XLA's reshape accompanies tensor's physical layout change, and may cause more cost compared to bitcast operation (which does not change the physical layout, thus making almost no overhead).
However, if reshape does not accompany the logical layout, its cost may be cheap.
Related
I have N lists of items
eg:
A, B, C, D
1, 2, 3
V, W, X, Y, Z
They are flattened into a single long list, and the user chooses an ordering to their liking
eg:
1, C, X, 3, B, A, Y, Z, 2, W, D, V
I need to re-order my N original lists so their relative sort order matches that in the user's ordering
eg:
C, B, A, D
1, 3, 2
X, Y, Z, W, V
The simple brute-force approach is to create N new empty containers, loop over the user's ordering and add each item into the relevant container as it is encountered.
Is there a more elegant approach?
There is not possibly a more elegant approach unless assumptions can be made about the ordering of the data.
You must, at some point, create each of the N new containers.
You must also, at some point, add the necessary elements to those N containers.
These two things cannot be avoided. Your approach has both of those and nothing more, and thus is proved minimal.
A minor caveat is that block array copies are slightly faster than iterative copies, so if you know of large blocks that are the same, then you can make a slightly faster copy for those blocks. But usually, in order to get that information, you must first visit and analyze the data. So instead of visiting and analyzing, you should just visit and insert.
You must either store the knowledge of which container an element came from at the beginning, or else store a mapping of element to position in the list and use that for sorting. (Or else save memory and do a ton of searching.)
If you're going to rearrange all lists, then it is more efficient to store the knowledge of which container each element comes from and proceed as you suggest. If you're going to only rearrange SOME lists (or rearrange future lists), then it may make more sense to store the mapping of element to position in the list and sort based on that. Which you can do either with a comparison function that goes through that lookup, or with a Schwartzian transform.
BTW have you thought about how to handle repeated elements?
I have one tensor A of dimension [a,b,c,d], and another B of dimension [b,b,d,e], and C, a list of [a] integers from 0 to b. I need to produce the tensor D of dimension [a,b,c,e] given by
D[i,j,k,l] = sum for m=0..d of A[i,C[i],k,m] * B[C[i],j,m,l]
b is small enough (3 or 5, usually?) that I don't mind doing this in b separate operations -- but I can't afford the waste by going to something that takes b^2 memory or time, when this operation clearly should be linear in b. This seems like it will be some combination of pointwise multiplies (with broadcasting?) and tensor contractions (a matrix multiply across the common m dimension), but I can't pin it down.
If someone can really convince me that this isn't possible in O(b) flops with tensorflow's provided operations, then okay, but then I'd want an O(b^2) for sure.
Update: It's looking like the appropriately modified A tensors can be built individually using tf.gather_nd; if this can then be paired up with B somehow, maybe? Unfortunately my experiments in this so far led to finding a bug in tf.gather_nd itself which has slowed things down.
I figured out how to accomplish this, reasonably efficiently. First build a modified version of B with tf.gather, with the appropriate parts in the first index:
B2 = tf.gather(B, C)
Then pull out just the relevant parts of the A tensor using tf.gather_nd. We're going to pull out a bunch of pairs of indices of the form [0,C[0]], [1,C[1]], [2,C[2]]... and so on, so first we need to build the index tensor.
a = tf.shape(A)[0]
A2_indices = tf.stack([tf.range(a), C], axis=0)
A2 = tf.gather_nd(A, A2_indices)
producing A2 with shape [a,c,d]. Now we need to multiply A2 and B2 appropriately. It's tensor contraction in the m indices (2 and 3, respectively) but pointwise multiplication in the i index (0 in both). This means that, sadly, the resulting item isn't tensor contraction or pointwise multiplication! One option would be computing the tensor product and contracting only over m, and then taking tf.diag over the two i indices -- but this would waste a lot of computation building the rest of a matrix that we don't need. Instead, we can think of this as a batched matrix multiplication: this used to be called tf.batched_matmul but now it's just matmul. This has the caveat, though, that besides the 2 matrix dimensions in each input tensor, the rest all have to be pointwise multiplies. B and B2 fail this criterion, because they have the additional j index. But, we could "wrap that in" with the l output dimension, and then remove it later. This means first calling tf.transpose to put j and l next to each other, then tf.reshape to turn into one j*l output dimension, then doing tf.matmul, then another tf.reshape and tf.transpose to return to the original form. So
a, b, d, e = B2.get_shape().as_list()
B2_trans = tf.transpose(B2, perm=[0,2,1,3])
B2_jl = tf.reshape(B2, [a,d,b*e])
product_jl = tf.matmul(A2, B2_jl)
product_trans = tf.reshape(product_jl, [a,d,b,e])
result = tf.transpose(product_trans, perm=[0,2,1,3])
Which finishes it up! Of course in practice it may well be that B is only needed in this one instance, in which case it may be that B can start out already in the "compressed" state, saving a transpose (and a cheap reshape); or if A2 is going to be flattened or transposed anyway then it could also save a transpose. But overall everything is pretty minimal complexity. :)
Avoiding array allocations is good for performance. However, I have yet to understand what is the most possible efficient way one can perform a QR decomposition of a matrix A. (note: both Q and R matrices are needed)
Simply using
Q, R = qr(A)
is probably not the best idea, since it allocates both Q and R, where both could be re-allocated.
The function qrfact allows one to store factorization in a packed format. However, I would still write afterwards:
F = qrfact(A); Q = F[:Q]; R = F[:R]
once again allocating new arrays for Q and R. Finally, the documentation also suggests the qrfact! function, which saves space by overwriting the input A, instead of creating a copy. However, if one uses F = qrfact!(A)
the over-written A is not useful in the sense that it is not either Q or R, which one (specifically, I) would need.
So my two questions are:
What is the best/most efficient way to perform a QR decomposition if you only care about the matrices Q and R and you have no problem re-allocating them.
What is actually written in the matrix A when one calls qrfact!(A) ?
In
F = qrfact!(A)
or
F = qrfact(A)
F[:Q] and F[:R] do not allocate new dense arrays; they are simply views over the packed format from which Q and R are easily computed. This means that qrfact!(A) doesn't need to allocate arrays for Q and R, it simply computes the packed format in place for A.
However, that also means that F[:Q] and F[:R] cannot be mutated. If you need to modify one of them for whatever reason, you will need to collect it into a mutable Array, and this will certainly allocate. It will still be more efficient to use qrfact!(A) instead of qrfact(A), because the latter will allocate space for the packed QR factorization as well as for the collected Array.
Assume that we are working with a language which stores arrays in column-major order. Assume also that we have a function which uses 2-D array as an argument, and returns it.
I'm wondering can you claim that it is (or isn't) in general beneficial to transpose this array when calling the function in order to work with column-wise operations instead of row-wise operations, or does the transposing negate the the benefits of column-wise operations?
As an example, in R I have a object of class ts named y which has dimension n x p, i.e I have p times series of length n.
I need to make some computations with y in Fortran, where I have two loops with following kind of structure:
do i = 1, n
do j= 1, p
!just an example, some row-wise operations on `y`
x(i,j) = a*y(i,j)
D = ddot(m,y(i,1:p),1,b,1)
! ...
end do
end do
As Fortran (as does R) uses column-wise storage, it would be better to make the computations with p x n array instead. So instead of
out<-.Fortran("something",y=array(y,dim(y)),x=array(0,dim(y)))
ynew<-out$out$y
x<-out$out$x
I could use
out<-.Fortran("something2",y=t(array(y,dim(y))),x=array(0,dim(y)[2:1]))
ynew<-t(out$out$y)
x<-t(out$out$x)
where Fortran subroutine something2 would be something like
do i = 1, n
do j= 1, p
!just an example, some column-wise operations on `y`
x(j,i) = a*y(j,i)
D = ddot(m,y(1:p,i),1,b,1)
! ...
end do
end do
Does the choice of approach always depend on the dimensions n and p or is it possible to say one approach is better in terms of computation speed and/or memory requirements? In my application n is usually much larger than p, which is 1 to 10 in most cases.
more of a comment, buy i wanted to put a bit of code: under old school f77 you would essentially be forced to use the second approach as
y(1:p,i)
is simply a pointer to y(1,i), with the following p values contiguous in memory.
the first construct
y(i,1:p)
is a list of values interspaced in memory, so it seems to require making a copy of the data to pass to the subroutine. I say it seems because i haven't the foggiest idea how a modern optimizing compiler deals with these things. I tend to think at best its a wash at worst this could really hurt. Imagine an array so large you need to page swap to access the whole vector.
In the end the only way to answer this is to test it yourself
----------edit
did a little testng and confirmed my hunch: passing rows y(i,1:p) does cost you vs passing columns y(1:p,i). I used a subroutine that does practically nothing to see the difference. My guess with any real subroutine the hit is negligable.
Btw (and maybe this helps understand what goes on) passing every other value in a column
y(1:p:2,i) takes longer (orders of magnitude) than passing the whole column, while passing every other value in a row cuts the time in half vs. passing a whole row.
(using gfortran 12..)
While working on the simulation of particle interactions, I stumbled across grid indexing in Morton-order (Z-order)(Wikipedia link) which is regarded to provide an efficient nearest neighbor cell search. The main reason that I've read is the almost sequential ordering of spatially close cells in memory.
Being in the middle of a first implementation, I can not wrap my head around how to efficiently implement the algorithm for the nearest neighbors, especially in comparison to a basic uniform grid.
Given a cell (x,y) it is trivial to obtain the 8 neighbor cell indices and compute the respective z-index. Although this provides constant access time to the elements, the z-index has either to be calculated or looked up in predefined tables (separate for each axis and OR'ing). How can this possibly be more efficient? Is it true, that accessing elements in an array A in an order say A[0] -> A1 -> A[3] -> A[4] -> ... is more efficient than in an order A[1023] -> A[12] -> A[456] -> A[56] -> ...?
I've expected that there exists a simpler algorithm to find the nearest neighbors in z-order. Something along the lines: find first cell of neighbors, iterate. But this can't be true, as this works nicely only within 2^4 sized blocks. There are two problems however: When the cell is not on the boundary, one can easily determine the first cell of the block and iterate through the cells in the block, but one has to check whether the cell is a nearest neighbor. Worse is the case when the cell lies on the boundary, than one has to take into account 2^5 cells. What am I missing here? Is there a comparatively simple and efficient algorithm that will do what I need?
The question in point 1. is easily testable, but I'm not very familiar with the underlying instructions that the described access pattern generates and would really like to understand what is going on behind the scenes.
Thanks in advance for any help, references, etc...
EDIT:
Thank you for clarifying point 1! So, with Z-ordering, the cache hit rate is increased on average for neighbor cells, interesting. Is there a way to profile cache hit/miss rates?
Regarding point 2:
I should add that I understand how to build the Morton-ordered array for a point cloud in R^d where the index i = f(x1, x2, ..., xd) is obtained from bitwise interlacing etc. What I try to understand is whether there is a better way than the following naive ansatz to get the nearest neighbors (here in d=2, "pseudo code"):
// Get the z-indices of cells adjacent to the cell containing (x, y)
// Accessing the contents of the cells is irrelevant here
(x, y) \elem R^2
point = (x, y)
zindex = f(x, y)
(zx, zy) = f^(-1)(zindex) // grid coordinates
nc = [(zx - 1, zy - 1), (zx - 1, zy), (zx - 1, zy + 1), // neighbor grid
(zx , zy - 1), (zx, zy + 1), // coordinates
(zx + 1, zy - 1), (zx + 1, zy), (zx + 1, zy + 1)]
ni= [f(x[0], x[1]) for x in nc] // neighbor indices
In modern multi-level cache-based computer systems, spacial locality is an important factor in optimising access-time to data elements.
Put simply, this means if you access a data element in memory, then accessing another data element in memory that is nearby (has an address that is close to the first) can be cheaper by several orders of magnitude that accessing a data element that is far away.
When 1-d data is accessed sequentially, as in simply image processing or sound processing, or iterating over data structures processing each element the same way, then arranging the data elements in memory in order tends to achieve spatial locality - i.e. since you access element N+1 just after accessing element N, the two elements should be placed next to each other in memory.
Standard c arrays (and many other data structures) have this property.
The point of Morton ordering is to support schemes where data is accessed two dimensionally instead of one dimensionally. In other words, after accessing element (x,y), you may go on to access (x+1,y) or (x,y+1) or similar.
The Morton ordering means that (x,y), (x+1,y) and (x,y+1) are near to each other in memory. In a standard c multidimensional array, this is not necessarily the case. For example, in the array myArray[10000][10000], (x,y) and (x,y+1) are 10000 elements apart - too far apart to take advantage of spatial locality.
In a Morton ordering, a standard c array can still be used as a store for the data, but the calculation to work out where (x,y) is is no longer as simple as store[x+y*rowsize].
To implement your application using Morton ordering, you need to work out how to transform a coordinate (x,y) into the address in the store. In other words, you need a function f(x,y) that can be used to access the store as in store[f(x,y)].
Looks like you need to do some more research - follow the links from the wikipedia page, particularly the ones on the BIGMIN function.
Yes, accessing array elements in order is indeed faster. The CPU loads memory from RAM into cache in chunks. If you access sequentially, the CPU can preload the next chunk easily, and you won't notice the load time. If you access randomly, it can't. This is called cache coherency, and what it means is that accessing memory near to memory you've already accessed is faster.
In your example, when loading A[1], A[2], A[3] and A[4], the processor probably loaded several of those indices at once, making them very trivial. Moreover, if you then go on to try to access A[5], it can pre-load that chunk while you operate on A[1] and such, making the load time effectively nothing.
However, if you load A[1023], the processor must load that chunk. Then it must load A[12]- which it hasn't already loaded and thus must load a new chunk. Et cetera, et cetera. I have no idea about the rest of your question, however.