how to calculate loss over a number of images and then back propagate the average loss and update network weight - backpropagation

I am doing a task where the batch size is 1, i.e, each batch contains only 1 image. So I have to do manual batching: when the the number of accumulated losses reach a number, average the loss and then do back propagation.
My original code is:
real_batchsize = 200
for epoch in range(1, 5):
net.train()
total_loss = Variable(torch.zeros(1).cuda(), requires_grad=True)
iter_count = 0
for batch_idx, (input, target) in enumerate(train_loader):
input, target = Variable(input.cuda()), Variable(target.cuda())
output = net(input)
loss = F.nll_loss(output, target)
total_loss = total_loss + loss
if batch_idx % real_batchsize == 0:
iter_count += 1
ave_loss = total_loss/real_batchsize
ave_loss.backward()
optimizer.step()
if iter_count % 10 == 0:
print("Epoch:{}, iteration:{}, loss:{}".format(epoch,
iter_count,
ave_loss.data[0]))
total_loss.data.zero_()
optimizer.zero_grad()
This code will give the error message
RuntimeError: Trying to backward through the graph a second time, but the buffers have already been freed. Specify retain_graph=True when calling backward the first time.
I have tried the following way,
First way (failed)
I read some post about this error message, but can not understand it fully. Change ave_loss.backward() to ave_loss.backward(retain_graph=True) prevent the error message, but the loss doesn't improve the soon becomes nan.
Second way (failed)
I also tried to change total_loss = total_loss + loss.data[0], this will also prevent the error message. But the loss are always the same. So there must be something wrong.
Third way (success)
Following the instruction in this post, for each image's loss,I divide the loss by real_batchsize and backprop it. When the number of input image reach the real_batchsize, I do one update of parameter using optimizer.step(). The loss is slowly decreasing as the training process goes. But the training speed is really slow, because we backprop for each image.
My question
What does the error message mean in my case? Also, why doesn't first way and second way work? How to write the code correctly so that we can backprop gradient every real_batchsize images and update gradient once so that the training speed a faster? I know my code is nearly correct, but I just do not know how to change it.

The problem you encounter here is related to how PyTorch accumulates gradients over different passes. (see here for another post on a similar question)
So let's have a look at what happens when you have code of the following form:
loss_total = Variable(torch.zeros(1).cuda(), requires_grad=True)
for l in (loss_func(x1,y1), loss_func(x2, y2), loss_func(x3, y3), loss_func(x4, y4)):
loss_total = loss_total + l
loss_total.backward()
Here, we do a backward pass when loss_total has the following values over the different iterations:
total_loss = loss(x1, y1)
total_loss = loss(x1, y1) + loss(x2, y2)
total_loss = loss(x1, y1) + loss(x2, y2) + loss(x3, y3)
total_loss = loss(x1, y1) + loss(x2, y2) + loss(x3, y3) + loss(x4, y4)
so when you call .backward() on total_loss each time, you actually call .backward() on loss(x1, y1) four times! (and on loss(x2, y2) three times, etc).
Combine that with what is discussed in the other post, namely that to optimize memory usage, PyTorch will free the graph attached to a Variable when calling .backward() (and thereby destroying the gradients connecting x1 to y1, x2 to y2, etc), you can see what the error message means - you try to do backward passes over a loss for several times, but the underlying graph was freed up after the first pass. (unless to specify retain_graph=True, of course)
As for the specific variations you have tried:
First way: here, you will accumulate (i.e. sum up - again, see the other post) gradients forever, with them (potentially) adding up to inf.
Second way: here, you convert loss to a tensor by doing loss.data, removing the Variable wrapper, and thereby deleting the gradient information (since only Variables hold gradients).
Third way: here, you only do one pass through each xk, yk tuple, since you immediately do a backprop step, avoiding the above problem alltogether.
SOLUTION: I have not tested it, but from what I gather, the solution should be pretty straightforward: create a new total_loss object at the beginning of each batch, then sum all of the losses into that object, and then do one final backprop step at the end.

Related

Creating function for implementing steepest descent algorithm

I am trying to implement steepest descent algorithm for minimization of 2D function. Let me explain with example.
I have function f1(x1,x2) = 2*x1^2 + x2^2 - 5*x1*x2 and starting with initial guess point p0 = [1,0].
Step 1: Initial guess point p0 = [1,0], convergence parameter e=0.1
Step 2: calculate gradient c1 of f1 at p0. c1=[4,-5] I am using central difference method for that.
Step 3: If norm of c1>e, go to step 4, otherwise stop.
Step 4: Our direction of search is d1 = - c1. So, d1 = [-4,5].
Step 5: Find step size a to minimize f1(a) = f1(p0 + a*d1) = f1(1-4a,5a)
Step 6: Update p0 to p1 as p1 = p0 + a * d1 and to to step 2.
I am trying to implement this example in matlab and do not know how to implement step 5. I know that ant 1D search algorithm, such as bisection method can work. But, problem is to 'converting' f1(1-4a,5a) to a function, that is, substituting (1-4a,5a) into f1. I encounter symbolic constant a here, which I do not know how to deal with. If I write a minimization function, I can pass values to it, but not sure about symbolic variable a. I do not want to use special features of matlab, such as symbolics and trying to keep code at general level, so I can convert it into other programming languages without any problems. Your suggestions are welcome.
You probably want to use fminbnd to place some bounds on the line search (-1 to 1 below), since searching the entire line is an ill-defined task. There is no need for a symbolic variable. Step size is determined by minimizing an anonymous function below:
a = fminbnd(#(a) f1(p(1) + a*d(1), p(2) + a*d(2)), -1, 1);
Here p is the current point and d is the direction of search. (I wouldn't want to call them p1 and d1 as in your description, as if they are something used for the 1st step only.)
The example allows negative a, which I find helpful in practice; sometimes it's better to go in the opposite direction from what the gradient suggests. To disallow this, change the bounds to 0, 1 (or 0, 10 for a more aggressive search in the forward direction) .

Global minimum in a huge convex matrix by using small matrices

I have a function J(x,y,z) that gives me the result of those coordinates. This function is convex. What is needed from me is to find the minimum value of this huge matrix.
At first I tried to loop through all of them, calculate then search with min function, but that takes too long ...
so I decided to take advantage of the convexity.
Take a random(for now) set of coordinates, that will be the center of my small 3x3x3 matrice, find the local minimum and make it the center for the next matrice. This will continue until we reach the global minimum.
Another issue is that the function is not perfectly convex, so this problem can appear as well
so I'm thinking of a control measure, when it finds a fake minimum, increase the search range to make sure of it.
How would you advise me to go with it? Is this approach good? Or should I look into something else?
This is something I started myself but I am fairly new to Matlab and I am not sure how to continue.
clear all
clc
min=100;
%the initial size of the search matrix 2*level +1
level=1;
i=input('Enter the starting coordinate for i (X) : ');
j=input('Enter the starting coordinate for j (Y) : ');
k=input('Enter the starting coordinate for k (Z) : ');
for m=i-level:i+level
for n=j-level:j+level
for p=k-level:k+level
A(m,n,p)=J(m,n,p);
if A(m,n,p)<min
min=A(m,n,p);
end
end
end
end
display(min, 'Minim');
[r,c,d] = ind2sub(size(A),find(A ==min));
display(r,'X');
display(c,'Y');
display(d,'Z');
Any guidance, improvement and constructive criticism are appreciated. Thanks in advance.
Try fminsearch because it is fairly general and easy to use. This is especially easy if you can specify your function anonymously. For example:
aFunc = #(x)100*(x(2)-x(1)^2)^2+(1-x(1))^2
then using fminsearch:
[x,fval] = fminsearch( aFunc, [-1.2, 1]);
If your 3-dimensional function, J(x,y,z), can be described anonymously or as regular function, then you can try fminsearch. The input takes a vector so you would need to write your function as J(X) where X is a vector of length 3 so x=X(1), y=X(2), z=X(3)
fminseach can fail especially if the starting point is not near the solution. It is often better to refine the initial starting point. For example, the code below samples a patch around the starting vector and generally improves the chances of finding the global minimum.
% deltaR is used to refine the start vector with scatter min search over
% region defined by a path of [-deltaR+starVec(i):dx:deltaR+startVec(i)] on
% a side.
% Determine dx using maxIter.
maxIter = 1e4;
dx = max( ( 2*deltaR+1)^2/maxIter, 1/8);
dim = length( startVec);
[x,y] = meshgrid( [-deltaR:dx:deltaR]);
xV = zeros( length(x(:)), dim);
% Alternate patches as sequential x-y grids.
for ii = 1:2:dim
xV(:, ii) = startVec(ii) + x(:);
end
for ii = 2:2:dim
xV(:, ii) = startVec(ii) + y(:);
end
% Find the scatter min index to update startVec.
for ii = 1: length( xV)
nS(ii)=aFunc( xV(ii,:));
end
[fSmin, iw] = min( nS);
startVec = xV( iw,:);
fSmin = fSmin
startVec = startVec
[x,fval] = fminsearch( aFunc, startVec);
You can run a 2 dimensional test case f(x,y)=z on AlgorithmHub. The app is running the above code in Octave. You can edit the in-line function (possibly even try your problem) from this web-site as well.

matlab: optimum amount of points for linear fit

I want to make a linear fit to few data points, as shown on the image. Since I know the intercept (in this case say 0.05), I want to fit only points which are in the linear region with this particular intercept. In this case it will be lets say points 5:22 (but not 22:30).
I'm looking for the simple algorithm to determine this optimal amount of points, based on... hmm, that's the question... R^2? Any Ideas how to do it?
I was thinking about probing R^2 for fits using points 1 to 2:30, 2 to 3:30, and so on, but I don't really know how to enclose it into clear and simple function. For fits with fixed intercept I'm using polyfit0 (http://www.mathworks.com/matlabcentral/fileexchange/272-polyfit0-m) . Thanks for any suggestions!
EDIT:
sample data:
intercept = 0.043;
x = 0.01:0.01:0.3;
y = [0.0530642513911393,0.0600786706929529,0.0673485248329648,0.0794662409166333,0.0895915873196170,0.103837395346484,0.107224784565365,0.120300492775786,0.126318699218730,0.141508831492330,0.147135757370947,0.161734674733680,0.170982455701681,0.191799936622712,0.192312642057298,0.204771365716483,0.222689541632988,0.242582251060963,0.252582727297656,0.267390860166283,0.282890010610515,0.292381165948577,0.307990544720676,0.314264952297699,0.332344368808024,0.355781519885611,0.373277721489254,0.387722683944356,0.413648156978284,0.446500064130389;];
What you have here is a rather difficult problem to find a general solution of.
One approach would be to compute all the slopes/intersects between all consecutive pairs of points, and then do cluster analysis on the intersepts:
slopes = diff(y)./diff(x);
intersepts = y(1:end-1) - slopes.*x(1:end-1);
idx = kmeans(intersepts, 3);
x([idx; 3] == 2) % the points with the intersepts closest to the linear one.
This requires the statistics toolbox (for kmeans). This is the best of all methods I tried, although the range of points found this way might have a few small holes in it; e.g., when the slopes of two points in the start and end range lie close to the slope of the line, these points will be detected as belonging to the line. This (and other factors) will require a bit more post-processing of the solution found this way.
Another approach (which I failed to construct successfully) is to do a linear fit in a loop, each time increasing the range of points from some point in the middle towards both of the endpoints, and see if the sum of the squared error remains small. This I gave up very quickly, because defining what "small" is is very subjective and must be done in some heuristic way.
I tried a more systematic and robust approach of the above:
function test
%% example data
slope = 2;
intercept = 1.5;
x = linspace(0.1, 5, 100).';
y = slope*x + intercept;
y(1:12) = log(x(1:12)) + y(12)-log(x(12));
y(74:100) = y(74:100) + (x(74:100)-x(74)).^8;
y = y + 0.2*randn(size(y));
%% simple algorithm
[X,fn] = fminsearch(#(ii)P(ii, x,y,intercept), [0.5 0.5])
[~,inds] = P(X, y,x,intercept)
end
function [C, inds] = P(ii, x,y,intercept)
% ii represents fraction of range from center to end,
% So ii lies between 0 and 1.
N = numel(x);
n = round(N/2);
ii = round(ii*n);
inds = min(max(1, n+(-ii(1):ii(2))), N);
% Solve linear system with fixed intercept
A = x(inds);
b = y(inds) - intercept;
% and return the sum of squared errors, divided by
% the number of points included in the set. This
% last step is required to prevent fminsearch from
% reducing the set to 1 point (= minimum possible
% squared error).
C = sum(((A\b)*A - b).^2)/numel(inds);
end
which only finds a rough approximation to the desired indices (12 and 74 in this example).
When fminsearch is run a few dozen times with random starting values (really just rand(1,2)), it gets more reliable, but I still wouln't bet my life on it.
If you have the statistics toolbox, use the kmeans option.
Depending on the number of data values, I would split the data into a relative small number of overlapping segments, and for each segment calculate the linear fit, or rather the 1-st order coefficient, (remember you know the intercept, which will be same for all segments).
Then, for each coefficient calculate the MSE between this hypothetical line and entire dataset, choosing the coefficient which yields the smallest MSE.

Algorithm for calculating the sum-of-squares distance of a rolling window from a given line function

Given a line function y = a*x + b (a and b are previously known constants), it is easy to calculate the sum-of-squares distance between the line and a window of samples (1, Y1), (2, Y2), ..., (n, Yn) (where Y1 is the oldest sample and Yn is the newest):
sum((Yx - (a*x + b))^2 for x in 1,...,n)
I need a fast algorithm for calculating this value for a rolling window (of length n) - I cannot rescan all the samples in the window every time a new sample arrives.
Obviously, some state should be saved and updated for every new sample that enters the window and every old sample leaves the window.
Notice that when a sample leaves the window, the indecies of the rest of the samples change as well - every Yx becomes Y(x-1). Therefore when a sample leaves the window, every other sample in the window contribute a different value to the new sum: (Yx - (a*(x-1) + b))^2 instead of (Yx - (a*x + b))^2.
Is there a known algorithm for calculating this? If not, can you think of one? (It is ok to have some mistakes due to first-order linear approximations).
Won't a straightforward approach do the trick?...
By 'straightforward' I mean maintaining a queue of samples. Once a new sample arrives, you would:
pop the oldest sample from the queue
subtract its distance from your sum
append the new sample to the queue
calculate its distance and add it to your sum
As for time, everything here is O(1) if the queue is implemented as linked list or something similar, You would want to store the distance with your samples in queue, too, so you calculate it only once. The memory usage is thus 3 floats per sample - O(n).
If you expand the term (Yx - (a*x + b))^2 the terms break into three parts:
Terms of only a,x and b. These produce some constant when summed over n and can be ignored.
Terms of only Yx and b. These can be handled in the style of a boxcar integrator as #Xion described.
One term of -2*Yx*a*x. The -2*a is a constant so ignore that part. Consider the partial sum S = Y1*1 + Y2*2 + Y3*3 ... Yn*n. Given Y1 and a running sum R = Y1 + Y2 + ... + Yn you can find S - R which eliminates Y1*1 and reduces each of the other terms, leaving you with Y2*1 + Y3*2 + ... + Yn*(n-1). Now update the running sum R as for (2) by subtracting off Y1 and adding Y(n+1). Add the new Yn*n term to S.
Now just add up all those partial terms.

Efficient algorithm to find String overlaps

I won't go into the details of the problem I'm trying to solve, but it deals with a large string and involves finding overlapping intervals that exist in the string. I can only use one of the intervals that overlap, so I wanted to separate these intervals out and analyze them individually. I was wondering what algorithm to use to do this as efficiently as possible.
I must stress that speed is paramount here. I need to separate the intervals as quickly as possible. The algorithm that came to my mind was an Interval Tree, but I wasn't sure if that's the best that we can do.
Interval Trees can be queried in O(log n) time, n being the number of intervals and construction requires O(nlog n) time, though I wanted to know if we can cut down on either.
Thanks!
Edit: I know the question is vague. I apologize for the confusion. I suggest that people look at the answer by Aaron Huran and the comments on the same. That should help clarify things a lot more.
Well, I was bored last night so I did this in Python. It's recursive unnecessarily (I just read The Little Schemer and think recursion is super neat right now) but it solves your problem and handles all input I threw at it.
intervals = [(0,4), (5,13), (8,19), (10,12)]
def overlaps(x,y):
x1, x2 = x
y1, y2 = y
return (
(x1 <= y1 <= x2) or
(x1 <= y2 <= x2) or
(y1 <= x1 <= y2) or
(y1 <= x2 <= y2)
)
def find_overlaps(intervals, checklist=None, pending=None):
if not intervals:
return []
interval = intervals.pop()
if not checklist:
return find_overlaps(intervals, [interval], [interval])
check = checklist.pop()
if overlaps(interval, check):
pending = pending or []
checklist.append(check)
checklist.append(interval)
return pending + [interval] + find_overlaps(intervals, checklist)
else:
intervals.append(interval)
return find_overlaps(intervals, checklist)
Use like this:
>>> find_overlaps(intervals)
[(10, 12), (8, 19), (5, 13)]
Note that it returns all overlapping intervals in REVERSE order of their start point. Hopefully that's a minor issue. That's only happening because I'm using push() and pop() on the list, which operates on the end of the list, rather than insert(0) and pop(0) which operates on the beginning.
This isn't perfect, but it runs in linear time. Also remember that the size of the actual string doesn't matter at all - the running time is relative to the number of intervals, not the size of the string.
You may want to try using Ukkonen's algorithm (see https://en.wikipedia.org/wiki/Ukkonen%27s_algorithm).
There is a free code version at http://biit.cs.ut.ee/~vilo/edu/2002-03/Tekstialgoritmid_I/Software/Loeng5_Suffix_Trees/Suffix_Trees/cs.haifa.ac.il/shlomo/suffix_tree/suffix_tree.c
You are looking to calculate the difference between the two strings right? What language are you trying to do this in?
Update:
Without any sort of criteria on how you will select which intervals to use there are an enormous possible solutions.
One method would be to take the lowest starting number, grab its end.
Grab the next starting number that is higher than the previous interval's end. Get this interval's end and repeat.
So for 0-4, 5-13, 8-19, 10-12
You get: 0-4, 5-13 and ignore the others.

Resources