I'm looking for a way to implement block-diagonal matrices in Tensorflow. Specifically, I have block-diagonal matrix A with N blocks of size S x S each. Further, I have a vector v of length N*S. I want to calculate A dot v. Is there any efficient way to do it in Tensorflow?
Also, I would prefer the implementation which supports a batch dimension of v (e.g. its real dimension is batch_size x (N*S)) and which is memory efficient, keeping in memory only block-diagonal parts of A.
Thanks for any help!
You can simply convert your tensor to a sparse tensor since a block-diagonal matrix is just a special case of it. Then, the operations are done in a efficient way. If you already have a dense representation of the tensor you can just cast it using sparse_tensor = tf.contrib.layers.dense_to_sparse(dense_tensor). Otherwise, you can construct it with the tf.SparseTensor(...) function. To get the indices, you might use tf.strided_slice, see this post for more information.
Related
Assume I have a very fast subroutine for fixed size unitary matrix multiplication. (The subroutine may involve hardware acceleration) Say, a function called quantum_unmm_256(A, U, m) right-multiplies a m by 256 matrix A with a 256 by 256 unitary matrix U.
Now I want multiply something with a unitary matrix whose size is multiples of 256, say, a 1280x1280 unitary matrix. What would be a fast algorithm that make best use of the fast subroutine?
All matrices are assumed dense, with 64 or 128 bit float complex type.
Have a look at parallel matrix multiplication algorithms. You can always divide the matrix in blocks, and multiply it in pieces. You can even reduce the amount of operations needed.
For example, reading Wikipedia:
This isn't a full answer, but too long for a comment:
It might be easier to work with a (1280x1280) if it were reshaped to (4, 256, 4, 256), and then transposed to (4,4,256,256). But even that could require a copy() to ensure that the inner most blocks (numpy terms) are contiguous.
It could even be cast as a (4,4) object dtype array, where each element is your 'fast' unitary array.
I could elaborate on those actions if needed, but I suspect you have enough numpy skills to do it.
There are a lot things that are unclear about this question.
why both MATLAB and numpy tags
how is this block acceleration coded - if it's fast it must be compiled code; if so what's the proposed link to interpreted code
what are the constraints on the data structure? I suspect it must be some sort of contiguous block(s) of data. That's why I suggest the reshape and transpose.
I have an MxM matrix S whose entries are zero on the diagonal, and non-zero everywhere else. I need to make a larger, block matrix. The blocks will be size NxN, and there will be MxM of them.
The (i,j)th block will be S(i,j)I where I=eye(N) is the NxN identity. This matrix will certainly be sparse, S has M^2-M nonzero entries and my block matrix will have N(M^2-M) out of (NM)^2 or about 1/N% nonzero entries, but I'll be adding it to another NMxNM matrix that I do not expect to be sparse.
Since I will be adding my block matrix to a full matrix, would there be any speed gain by trying to write my code in a 'sparse' way? I keep going back and forth, but my thinking is settling on: even if my code to turn S into a sparse block matrix isn't very efficient, when I tell it to add a full and sparse matrix together, wouldn't MATLAB know that it only needs to iterate over the nonzero elements? I've been trained that for loops are slow in MATLAB and things like repmat and padding with zeros is faster, but my guess is that the fastest thing to do would be to not even build the block matrix at all, but write code that adds the entries of (the small matrix) S to my other (large, full) matrix in a sparse way. If I were to learn how to build the block matrix with sparse code (faster than building it in a full way and passing it to sparse), then that code should be able to do the addition for me in a sparse way without even needing to build the block matrix right?
If you can keep a full NMxNM matrix in memory, dont bother with sparse operations. In fact in most cases A+B, with A full and B sparse, will take longer than A+B, where A and B are both full.
From your description, using sparse is likely slower for your problem:
If you're adding a sparse matrix A to a full matrix B, the result is full and there's almost certainly no advantage to having A sparse.
For example:
n = 12000; A = rand(n, n); B1 = rand(n, n); B2 = spalloc(n, n, n*n);
B2 is as sparse as possible, that is, it's all zeros!
On my machine, A+B1 takes about .23 seconds while A + B2 takes about .7 seconds.
Basically, operations on full matrices use BLAS/LAPACK library calls that are insanely optimized. Overhead associated with sparse is going to make things worse unless you're in the special cases where sparse is super useful.
When is sparse super useful?
Sparse is super useful when the size of matrices suggest that some algorithm should be very slow, but because of sparsity (+ perhaps special matrix structure), the actual number of computations required is orders of magnitude less.
EXAMPLE: Solving linear system A*x=b where A is block diagonal matrix:
As = sparse(rand(5, 5)); for(i=1:999) As = blkdiag(As, sparse(rand(5,5))); end %generate a 5000x5000 sparse block diagonal matrix of 5x5 blocks
Af = full(As);
b = rand(5000, 1);
On my machine, solving the linear system on the full matrix (i.e. Af \ b) takes about 2.3 seconds while As \ b takes .0012 seconds.
Sparse can be awesome, but it's only helpful for large problems where you can cleverly exploit structure.
I am using armadillo mostly for symmetric and triangular matrices. I wanted to be efficient in terms of memory storage. However, it seems there is no other way than to create a new mat and fill with zeros(for triangular) or with duplicates(for symmetric) the lower/upper part of the matrix.
Is there a more efficient way of using triangular/symmetric matrices using Armadillo?
Thanks,
Antoine
There is no specific support for triangular or banded matrices in Armadillo. However, since version 3.4 support for sparse matrices has gradually been added. Depending on what Armadillo functions you need, and the sparsity of your matrix, you might gain from using SpMat<type> which implements the compressed sparse column (CSC) format. For each nonzero value in your matrix the CSC format stores the row index along with the value so you would likely not save much memory for a triangular matrix. A banded diagonal matrix should however consume significantly less memory.
symmatu()/symmatl() and trimatu()/trimatl()
may be what you are looking for:
http://arma.sourceforge.net/docs.html
I am trying to apply Random Projections method on a very sparse dataset. I found papers and tutorials about Johnson Lindenstrauss method, but every one of them is full of equations which makes no meaningful explanation to me. For example, this document on Johnson-Lindenstrauss
Unfortunately, from this document, I can get no idea about the implementation steps of the algorithm. It's a long shot but is there anyone who can tell me the plain English version or very simple pseudo code of the algorithm? Or where can I start to dig this equations? Any suggestions?
For example, what I understand from the algorithm by reading this paper concerning Johnson-Lindenstrauss is that:
Assume we have a AxB matrix where A is number of samples and B is the number of dimensions, e.g. 100x5000. And I want to reduce the dimension of it to 500, which will produce a 100x500 matrix.
As far as I understand: first, I need to construct a 100x500 matrix and fill the entries randomly with +1 and -1 (with a 50% probability).
Edit:
Okay, I think I started to get it. So we have a matrix A which is mxn. We want to reduce it to E which is mxk.
What we need to do is, to construct a matrix R which has nxk dimension, and fill it with 0, -1 or +1, with respect to 2/3, 1/6 and 1/6 probability.
After constructing this R, we'll simply do a matrix multiplication AxR to find our reduced matrix E. But we don't need to do a full matrix multiplication, because if an element of Ri is 0, we don't need to do calculation. Simply skip it. But if we face with 1, we just add the column, or if it's -1, just subtract it from the calculation. So we'll simply use summation rather than multiplication to find E. And that is what makes this method very fast.
It turned out a very neat algorithm, although I feel too stupid to get the idea.
You have the idea right. However as I understand random project, the rows of your matrix R should have unit length. I believe that's approximately what the normalizing by 1/sqrt(k) is for, to normalize away the fact that they're not unit vectors.
It isn't a projection, but, it's nearly a projection; R's rows aren't orthonormal, but within a much higher-dimensional space, they quite nearly are. In fact the dot product of any two of those vectors you choose will be pretty close to 0. This is why it is a generally good approximation of actually finding a proper basis for projection.
The mapping from high-dimensional data A to low-dimensional data E is given in the statement of theorem 1.1 in the latter paper - it is simply a scalar multiplication followed by a matrix multiplication. The data vectors are the rows of the matrices A and E. As the author points out in section 7.1, you don't need to use a full matrix multiplication algorithm.
If your dataset is sparse, then sparse random projections will not work well.
You have a few options here:
Option A:
Step 1. apply a structured dense random projection (so called fast hadamard transform is typically used). This is a special projection which is very fast to compute but otherwise has the properties of a normal dense random projection
Step 2. apply sparse projection on the "densified data" (sparse random projections are useful for dense data only)
Option B:
Apply SVD on the sparse data. If the data is sparse but has some structure SVD is better. Random projection preserves the distances between all points. SVD preserves better the distances between dense regions - in practice this is more meaningful. Also people use random projections to compute the SVD on huge datasets. Random Projections gives you efficiency, but not necessarily the best quality of embedding in a low dimension.
If your data has no structure, then use random projections.
Option C:
For data points for which SVD has little error, use SVD; for the rest of the points use Random Projection
Option D:
Use a random projection based on the data points themselves.
This is very easy to understand what is going on. It looks something like this:
create a n by k matrix (n number of data point, k new dimension)
for i from 0 to k do #generate k random projection vectors
randomized_combination = feature vector of zeros (number of zeros = number of features)
sample_point_ids = select a sample of point ids
for each point_id in sample_point_ids do:
random_sign = +1/-1 with prob. 1/2
randomized_combination += random_sign*feature_vector[point_id] #this is a vector operation
normalize the randomized combination
#note that the normal random projection is:
# randomized_combination = [+/-1, +/-1, ...] (k +/-1; if you want sparse randomly set a fraction to 0; also good to normalize by length]
to project the data points on this random feature just do
for each data point_id in dataset:
scores[point_id, j] = dot_product(feature_vector[point_id], randomized_feature)
If you are still looking to solve this problem, write a message here, I can give you more pseudocode.
The way to think about it is that a random projection is just a random pattern and the dot product (i.e. projecting the data point) between the data point and the pattern gives you the overlap between them. So if two data points overlap with many random patterns, those points are similar. Therefore, random projections preserve similarity while using less space, but they also add random fluctuations in the pairwise similarities. What JLT tells you is that to make fluctuations 0.1 (eps)
you need about 100*log(n) dimensions.
Good Luck!
An R Package to perform Random Projection using Johnson- Lindenstrauss Lemma
RandPro
Are there any algorithms that allow efficient creation (element filling) of sparse (e.g. CSR or coordinate) matrix in parallel?
If you store your matrix as a coordinate map, any language which has a concurrent dictionary implementation available should do the job for you.
Java's got the ConcurrentHashMap, and .NET 4 has ConcurrentDictionary, both of which allow multi-threaded non-blocking (afaik) element insertion in parallel.
There are no efficient algorithms for creating sparse matrices in data-parallel way. Plausible is coordinate matrix type which requires sorting after content filling, but that type is slow for matrix products etc.
Solution is you don't build sparse matrix - you don't keep it in memory; you do implicit operations in place when you're calculating elements of sparse matrix.