Why does this tensorflow loop require so much memory? - memory-management

I have a contrived version of a complicated network:
import tensorflow as tf
a = tf.ones([1000])
b = tf.ones([1000])
for i in range(int(1e6)):
a = a * b
My intuition is that this should require very little memory. Just the space for the initial array allocation and a string of commands that utilizes the nodes and overwrites the memory stored in tensor 'a' at each step. But memory usage grows quite rapidly.
What is going on here, and how can I decrease memory usage when I compute a tensor and overwrite it a bunch of times?
Edit:
Thanks to Yaroslav's suggestions the solution turned out to be using a while_loop to minimize the number of nodes on the graph. This works great and is much faster, requires far less memory, and is all contained in-graph.
import tensorflow as tf
a = tf.ones([1000])
b = tf.ones([1000])
cond = lambda _i, _1, _2: tf.less(_i, int(1e6))
body = lambda _i, _a, _b: [tf.add(_i, 1), _a * _b, _b]
i = tf.constant(0)
output = tf.while_loop(cond, body, [i, a, b])
with tf.Session() as sess:
result = sess.run(output)
print(result)

Your a*b command translates to tf.mul(a, b), which is equivalent to tf.mul(a, b, g=tf.get_default_graph()). This command adds a Mul node to the current Graph object, so you are trying to add 1 million Mul nodes to the current graph. That's also problematic since you can't serialize Graph object larger than 2GB, there are some checks that may fail once you are dealing with such a large graph.
I'd recommend reading Programming Models for Deep Learning by MXNet folks. TensorFlow is "symbolic" programming in their terminology, and you are treating it as imperative.
To get what you want using Python loop you could construct multiplication op once, and run it repeatedly, using feed_dict to feed updates
mul_op = a*b
result = sess.run(a)
for i in range(int(1e6)):
result = sess.run(mul_op, feed_dict={a: result})
For more efficiency you could use tf.Variable objects and var.assign to avoid Python<->TensorFlow data transfers

Related

solving sparse Ax=b in scipy

I need to solve Ax=b where A is the matrix that represents finite difference method for PDEs. Typical size of A for a 2D problem is around (256^2)x(256^2), and it consists of a few diagonals. The following sample code is how I construct A:
N = Nx*Ny # Nx is no. of cols (size in x-direction), Ny is no. rows (size in y-direction)
# finite difference in x-direction
up1 = (0.5)*c
up1[Nx-1::Nx] = 0
down1 = (-0.5)*c
down1[::Nx] = 0
matX = diags([down1[1:], up1[:-1]], [-1,1], format='csc')
# finite difference in y-direction
up1 = (0.5)*c
down1 = (-0.5)*c
matY = diags([down1[Nx:], up1[:N-Nx]], [-Nx,Nx], format='csc')
Adding matX and matY together results in four diagonals. The above is for second-order discretization. For fourth-order discretization, then I have eight diagonals. If I have second derivative, then the main diagonal is nonzero as well.
I use the following code to do the actual solving:
# Initialize A_fixed, B_fixed
if const is True: # the potential term V(x) is time-independent
A = A_fixed + sp.sparse.diags(V_func(x))
B = B_fixed + sp.sparse.diags(V_func(x))
A_factored = sp.sparse.linalg.factorized(A)
for idx, t in enumerate(t_steps[1:],1):
# Solve Ax=b=Bu
if const in False: #
A = A_fixed + sp.sparse.diags(V_func(x,t))
B = B_fixed + sp.sparse.diags(V_func(x,t))
psi_n = B.dot(psi_old)
if const is True:
psi_new = A_factored(psi_n)
else:
psi_new = sp.sparse.linalg.spsolve(A,psi_n,use_umfpack=False)
psi_old = psi_new.copy()
I have a couple questions:
What's the best way to solve Ax=b in scipy? I use the spsolve in the sp.sparse.linalg library, which uses the LU-decomposition. I tried using the standard sp.linalg.solve for dense matrix, but it's considerably slower. I also tried using all the other iterative solvers in the sp.sparse.linalg library, but they are also slower (for 1000x1000, they all take a couple seconds, compared to less than half a second for spsolve, and my A is likely to be a lot bigger). Are there any alternative ways to do the computation?
Edit: The problem I'm trying to solve is actually the time-dependent Schrodinger Equation. If the potential term is time-independent, then as suggested I can pre-factorize the matrix A first to speed up the code, but this doesn't work if the potential is time-varying, as I need to change the diagonal term of both matrices A and B at each time step. For this specific problem, are there any ways to speed up the code using method similar to pre-factorization or other ways?
I have installed umfpack. I tried setting use_umfpack True and False to test it, but surprisingly use_umfpack=True takes nearly twice longer than use_umfpack=False. I thought having this package should give a speed up. Any idea what that's the case? (PS: I am using Anaconda Spyder to run the code if that makes any difference)
I have use cProfile to profile my codes, and nearly all the time is spent on that line with spsolve. So I think the remaining part of the code (matrix /problem initialization) is pretty much optimized, and it's the matrix solving part that needs to be improved.
Thanks.

Optimised code for computing truncated weighted sum of matrix products

I need to perform a double sum over a weighted matrix product like this: mathematical expression of problem
where C is a list containing some different weights (complex) for each term and
X are an 3d-array where the index refers to the i'ths 2d-array along the depth axis. I need to evaluate a sum of such expression thousands of times (different 'm' arrays) so the execution time is very important. in the real problem the arrays are of greater dimensions. An illustrative minimum example of my code is
import numpy as np
from time import clock
X = np.random.random((12,12,31))
m = np.random.random((12,12))
C = list(map(lambda x: x*2+1j*x, range(31)))
def summing(m):
start = clock()
t2 = np.zeros((12,12),dtype = complex)
for i in range(len(C)):
for j in range(len(C)):
t2 += C[i]*(X[:,:,i].dot(m).dot(np.conj(X[:,:,j].T)))
return t2, clock()-start
summing(m)[1]
The execution time for me is around 0.025s, which is way too slow for the number of computations I need to perform. Running in parallel is not an option as this function is part of a larger parallel computation, so parallelising this would create nested parallel computations.
Do any of you see a much cleverer way of performing this computation in a faster and hopefully a more pythonic way?

Numpy Hermitian Matrix class

Are you aware of something like a hermitian matrix class in numpy? I'd like to optimize matrix calculations like
B = U * A * U.H
, where A (and thus, B) are hermitian. Without specification, all matrix elements of B are calculated. In fact, it should be able to save a factor of about 2 here. Do I miss something?
The method I need should take take the upper/lower triangle of A, the full matrix of U and return the upper/lower triangle of B.
I don't think there exists a method for your specific problem, but with a little thought you might be able to build an algorithm from the low-level BLAS routines that are wrapped in SciPy. For example, dgemm, dsymm, and dtrmm do general, symmetric, and triangular matrix products respectively. Here's an example of using them:
from scipy.linalg.blas import dgemm, dsymm, dtrmm
A = np.random.rand(10, 10)
B = np.random.rand(10, 10)
S = np.dot(A, A.T) # symmetric matrix
T = np.triu(S) # upper triangular matrix
# normal matrix-matrix product
assert np.allclose(dgemm(1, A, B), np.dot(A, B))
# symmetric mat-mat product using only upper-triangle
assert np.allclose(dsymm(1, T, B), np.dot(S, B))
# upper-triangular mat-mat product
assert np.allclose(dtrmm(1, T, B), np.dot(T, B))
There are many other low-level BLAS routines available; I find the NETLIB page to be a good resource to learn what they do. You may be able to cleverly use some combination of the available routines to efficiently solve the problem you have in mind.
Edit: it looks like there are LAPACK routines that quickly compute exactly what you want: dsytrd or zhetrd, but unfortunately these don't appear to be wrapped directly in scipy.linalg.lapack, though scipy does provide cython wrappers for them. Best of luck!
I needed tridiagonal reduction of a symmetric/Hermitian matrix A,
T = Q^H * A * Q
– presumably OP's underlying problem – and I've just submitted a pull request to SciPy for properly interfacing LAPACK's {s,d}sytrd (for real symmetric matrices) and {c,z}hetrd (for Hermitian matrices). All routines use either only the upper or the lower triangular part of the matrix.
Once this has been merged, it can be used like
import numpy as np
n = 3
A = np.zeros((n, n), dtype=dtype)
A[np.triu_indices_from(A)] = np.arange(1, 2*n+1, dtype=dtype)
# query lwork -- optional
lwork, info = sytrd_lwork(n)
assert info == 0
data, d, e, tau, info = sytrd(A, lwork=lwork)
assert info == 0
The vectors d and e now contain the main diagonal and the upper and lower diagonal, respectively.

Parallelising gradient calculation in Julia

I was persuaded some time ago to drop my comfortable matlab programming and start programming in Julia. I have been working for a long with neural networks and I thought that, now with Julia, I could get things done faster by parallelising the calculation of the gradient.
The gradient need not be calculated on the entire dataset in one go; instead one can split the calculation. For instance, by splitting the dataset in parts, we can calculate a partial gradient on each part. The total gradient is then calculated by adding up the partial gradients.
Though, the principle is simple, when I parallelise with Julia I get a performance degradation, i.e. one process is faster then two processes! I am obviously doing something wrong... I have consulted other questions asked in the forum but I could still not piece together an answer. I think my problem lies in that there is a lot of unnecessary data moving going on, but I can't fix it properly.
In order to avoid posting messy neural network code, I am posting below a simpler example that replicates my problem in the setting of linear regression.
The code-block below creates some data for a linear regression problem. The code explains the constants, but X is the matrix containing the data inputs. We randomly create a weight vector w which when multiplied with X creates some targets Y.
######################################
## CREATE LINEAR REGRESSION PROBLEM ##
######################################
# This code implements a simple linear regression problem
MAXITER = 100 # number of iterations for simple gradient descent
N = 10000 # number of data items
D = 50 # dimension of data items
X = randn(N, D) # create random matrix of data, data items appear row-wise
Wtrue = randn(D,1) # create arbitrary weight matrix to generate targets
Y = X*Wtrue # generate targets
The next code-block below defines functions for measuring the fitness of our regression (i.e. the negative log-likelihood) and the gradient of the weight vector w:
####################################
## DEFINE FUNCTIONS ##
####################################
#everywhere begin
#-------------------------------------------------------------------
function negative_loglikelihood(Y,X,W)
#-------------------------------------------------------------------
# number of data items
N = size(X,1)
# accumulate here log-likelihood
ll = 0
for nn=1:N
ll = ll - 0.5*sum((Y[nn,:] - X[nn,:]*W).^2)
end
return ll
end
#-------------------------------------------------------------------
function negative_loglikelihood_grad(Y,X,W, first_index,last_index)
#-------------------------------------------------------------------
# number of data items
N = size(X,1)
# accumulate here gradient contributions by each data item
grad = zeros(similar(W))
for nn=first_index:last_index
grad = grad + X[nn,:]' * (Y[nn,:] - X[nn,:]*W)
end
return grad
end
end
Note that the above functions are on purpose not vectorised! I choose not to vectorise, as the final code (the neural network case) will also not admit any vectorisation (let us not get into more details regarding this).
Finally, the code-block below shows a very simple gradient descent that tries to recover the parameter weight vector w from the given data Y and X:
####################################
## SOLVE LINEAR REGRESSION ##
####################################
# start from random initial solution
W = randn(D,1)
# learning rate, set here to some arbitrary small constant
eta = 0.000001
# the following for-loop implements simple gradient descent
for iter=1:MAXITER
# get gradient
ref_array = Array(RemoteRef, nworkers())
# let each worker process part of matrix X
for index=1:length(workers())
# first index of subset of X that worker should work on
first_index = (index-1)*int(ceil(N/nworkers())) + 1
# last index of subset of X that worker should work on
last_index = min((index)*(int(ceil(N/nworkers()))), N)
ref_array[index] = #spawn negative_loglikelihood_grad(Y,X,W, first_index,last_index)
end
# gather the gradients calculated on parts of matrix X
grad = zeros(similar(W))
for index=1:length(workers())
grad = grad + fetch(ref_array[index])
end
# now that we have the gradient we can update parameters W
W = W + eta*grad;
# report progress, monitor optimisation
#printf("Iter %d neg_loglikel=%.4f\n",iter, negative_loglikelihood(Y,X,W))
end
As is hopefully visible, I tried to parallelise the calculation of the gradient in the easiest possible way here. My strategy is to break the calculation of the gradient in as many parts as available workers. Each worker is required to work only on part of matrix X, which part is specified by first_index and last_index. Hence, each worker should work with X[first_index:last_index,:]. For instance, for 4 workers and N = 10000, the work should be divided as follows:
worker 1 => first_index = 1, last_index = 2500
worker 2 => first_index = 2501, last_index = 5000
worker 3 => first_index = 5001, last_index = 7500
worker 4 => first_index = 7501, last_index = 10000
Unfortunately, this entire code works faster if I have only one worker. If add more workers via addprocs(), the code runs slower. One can aggravate this issue by create more data items, for instance use instead N=20000.
With more data items, the degradation is even more pronounced.
In my particular computing environment with N=20000 and one core, the code runs in ~9 secs. With N=20000 and 4 cores it takes ~18 secs!
I tried many many different things inspired by the questions and answers in this forum but unfortunately to no avail. I realise that the parallelisation is naive and that data movement must be the problem, but I have no idea how to do it properly. It seems that the documentation is also a bit scarce on this issue (as is the nice book by Ivo Balbaert).
I would appreciate your help as I have been stuck for quite some while with this and I really need it for my work. For anyone wanting to run the code, to save you the trouble of copying-pasting you can get the code here.
Thanks for taking the time to read this very lengthy question! Help me turn this into a model answer that anyone new in Julia can then consult!
I would say that GD is not a good candidate for parallelizing it using any of the proposed methods: either SharedArray or DistributedArray, or own implementation of distribution of chunks of data.
The problem does not lay in Julia, but in the GD algorithm.
Consider the code:
Main process:
for iter = 1:iterations #iterations: "the more the better"
δ = _gradient_descent_shared(X, y, θ)
θ = θ - α * (δ/N)
end
The problem is in the above for-loop which is a must. No matter how good _gradient_descent_shared is, the total number of iterations kills the noble concept of the parallelization.
After reading the question and the above suggestion I've started implementing GD using SharedArray. Please note, I'm not an expert in the field of SharedArrays.
The main process parts (simple implementation without regularization):
run_gradient_descent(X::SharedArray, y::SharedArray, θ::SharedArray, α, iterations) = begin
N = length(y)
for iter = 1:iterations
δ = _gradient_descent_shared(X, y, θ)
θ = θ - α * (δ/N)
end
θ
end
_gradient_descent_shared(X::SharedArray, y::SharedArray, θ::SharedArray, op=(+)) = begin
if size(X,1) <= length(procs(X))
return _gradient_descent_serial(X, y, θ)
else
rrefs = map(p -> (#spawnat p _gradient_descent_serial(X, y, θ)), procs(X))
return mapreduce(r -> fetch(r), op, rrefs)
end
end
The code common to all workers:
#= Returns the range of indices of a chunk for every worker on which it can work.
The function splits data examples (N rows into chunks),
not the parts of the particular example (features dimensionality remains intact).=#
#everywhere function _worker_range(S::SharedArray)
idx = indexpids(S)
if idx == 0
return 1:size(S,1), 1:size(S,2)
end
nchunks = length(procs(S))
splits = [round(Int, s) for s in linspace(0,size(S,1),nchunks+1)]
splits[idx]+1:splits[idx+1], 1:size(S,2)
end
#Computations on the chunk of the all data.
#everywhere _gradient_descent_serial(X::SharedArray, y::SharedArray, θ::SharedArray) = begin
prange = _worker_range(X)
pX = sdata(X[prange[1], prange[2]])
py = sdata(y[prange[1],:])
tempδ = pX' * (pX * sdata(θ) .- py)
end
The data loading and training. Let me assume that we have:
features in X::Array of the size (N,D), where N - number of examples, D-dimensionality of the features
labels in y::Array of the size (N,1)
The main code might look like this:
X=[ones(size(X,1)) X] #adding the artificial coordinate
N, D = size(X)
MAXITER = 500
α = 0.01
initialθ = SharedArray(Float64, (D,1))
sX = convert(SharedArray, X)
sy = convert(SharedArray, y)
X = nothing
y = nothing
gc()
finalθ = run_gradient_descent(sX, sy, initialθ, α, MAXITER);
After implementing this and run (on 8-cores of my Intell Clore i7) I got a very slight acceleration over serial GD (1-core) on my training multiclass (19 classes) training data (715 sec for serial GD / 665 sec for shared GD).
If my implementation is correct (please check this out - I'm counting on that) then parallelization of the GD algorithm is not worth of that. Definitely you might get better acceleration using stochastic GD on 1-core.
If you want to reduce the amount of data movement, you should strongly consider using SharedArrays. You could preallocate just one output vector, and pass it as an argument to each worker. Each worker sets a chunk of it, just as you suggested.

Performance of swapping two elements in MATLAB

Purely as an experiment, I'm writing sort functions in MATLAB then running these through the MATLAB profiler. The aspect I find most perplexing is to do with swapping elements.
I've found that the "official" way of swapping two elements in a matrix
self.Data([i1, i2]) = self.Data([i2, i1])
runs much slower than doing it in four lines of code:
e1 = self.Data(i1);
e2 = self.Data(i2);
self.Data(i1) = e2;
self.Data(i2) = e1;
The total length of time taken up by the second example is 12 times less than the single line of code in the first example.
Would somebody have an explanation as to why?
Based on suggestions posted, I've run some more tests.
It appears the performance hit comes when the same matrix is referenced in both the LHS and RHS of the assignment.
My theory is that MATLAB uses an internal reference-counting / copy-on-write mechanism, and this is causing the entire matrix to be copied internally when it's referenced on both sides. (This is a guess because I don't know the MATLAB internals).
Here are the results from calling the function 885548 times. (The difference here is times four, not times twelve as I originally posted. Each of the functions have the additional function-wrapping overhead, while in my initial post I just summed up the individual lines).
swap1: 12.547 s
swap2: 14.301 s
swap3: 51.739 s
Here's the code:
methods (Access = public)
function swap(self, i1, i2)
swap1(self, i1, i2);
swap2(self, i1, i2);
swap3(self, i1, i2);
self.SwapCount = self.SwapCount + 1;
end
end
methods (Access = private)
%
% swap1: stores values in temporary doubles
% This has the best performance
%
function swap1(self, i1, i2)
e1 = self.Data(i1);
e2 = self.Data(i2);
self.Data(i1) = e2;
self.Data(i2) = e1;
end
%
% swap2: stores values in a temporary matrix
% Marginally slower than swap1
%
function swap2(self, i1, i2)
m = self.Data([i1, i2]);
self.Data([i2, i1]) = m;
end
%
% swap3: does not use variables for storage.
% This has the worst performance
%
function swap3(self, i1, i2)
self.Data([i1, i2]) = self.Data([i2, i1]);
end
end
In the first (slow) approach, the RHS value is a matrix, so I think MATLAB incurs a performance penalty in creating a new matrix to store the two elements. The second (fast) approach avoids this by working directly with the elements.
Check out the "Techniques for Improving Performance" article on MathWorks for ways to improve your MATLAB code.
you could also do:
tmp = self.Data(i1);
self.Data(i1) = self.Data(i2);
self.Data(i2) = tmp;
Zach is potentially right in that a temporary copy of the matrix may be made to perform the first operation, although I would hazard a guess that there is some internal optimization within MATLAB that attempts to avoid this. It may be a function of the version of MATLAB you are using. I tried both of your cases in version 7.1.0.246 (a couple years old) and only saw a speed difference of about 2-2.5.
It's possible that this may be an example of speed improvement by what's called "loop unrolling". When doing vector operations, at some level within the internal code there is likely a FOR loop which loops over the indices you are swapping. By performing the scalar operations in the second example, you are avoiding any overhead from loops. Note these two (somewhat silly) examples:
vec = [1 2 3 4];
%Example 1:
for i = 1:4,
vec(i) = vec(i)+1;
end;
%Example 2:
vec(1) = vec(1)+1;
vec(2) = vec(2)+1;
vec(3) = vec(3)+1;
vec(4) = vec(4)+1;
Admittedly, it would be much easier to simply use vector operations like:
vec = vec+1;
but the examples above are for the purpose of illustration. When I repeat each example multiple times over and time them, Example 2 is actually somewhat faster than Example 1. For a small loop with a known number (in the example, just 4), it can actually be more efficient to forgo the loop. Of course, in this particular example, the vector operation given above is actually the fastest.
I usually follow this rule: Try a few different things, and pick the fastest for your specific problem.
This post deserves an update, since the JIT compiler is now a thing (since R2015b) and so is timeit (since R2013b) for more reliable function timing.
Below is a short benchmarking function for element swapping within a large array.
I have used the terms "directly swapping" and "using a temporary variable" to describe the two methods in the question respectively.
The results are pretty staggering, the performance of directly swapping 2 elements using is increasingly poor by comparison to using a temporary variable.
function benchie()
% Variables for plotting, loop to increase size of the arrays
M = 15; D = zeros(1,M); W = zeros(1,M);
for n = 1:M;
N = 2^n;
% Create some random array of length N, and random indices to swap
v = rand(N,1);
x = randi([1, N], N, 1);
y = randi([1, N], N, 1);
% Time the functions
D(n) = timeit(#()direct);
W(n) = timeit(#()withtemp);
end
% Plotting
plot(2.^(1:M), D, 2.^(1:M), W);
legend('direct', 'with temp')
xlabel('number of elements'); ylabel('time (s)')
function direct()
% Direct swapping of two elements
for k = 1:N
v([x(k) y(k)]) = v([y(k) x(k)]);
end
end
function withtemp()
% Using an intermediate temporary variable
for k = 1:N
tmp = v(y(k));
v(y(k)) = v(x(k));
v(x(k)) = tmp;
end
end
end

Resources