I am implementing various Backpropagation algorithms for the same dataset and trying to compare the performance. I got a help from the following tutorial for the same.
https://nl.mathworks.com/help/nnet/ug/choose-a-multilayer-neural-network-training-function.html
I tried to plot:
mean square error versus execution time for each algorithm
time required to converge versus the mean square error convergence goal for each algorithm
Have used the following code, to create my neural network and willing to know how can i implement the above two plots.
%Data
x=0:0.2:6*pi; y=sin(x);
p=con2seq(x); t=con2seq(y);
% Networks
net1=feedforwardnet(20,'trainlm');
net2=feedforwardnet(20,'traingd');
net2.iw{1,1}=net1.iw{1,1}; %set the same weights and biases for the networks
net2.lw{2,1}=net1.lw{2,1};
net2.b{1}=net1.b{1};
net2.b{2}=net1.b{2};
%training and simulation
net1.trainParam.epochs=1; % set the number of epochs for the training
net2.trainParam.epochs=1;
net1=train(net1,p,t); % train the networks
net2=train(net2,p,t);
a11=sim(net1,p); a21=sim(net2,p); % simulate the networks with the input vector p
net1.trainParam.epochs=14;
net2.trainParam.epochs=14;
net1=train(net1,p,t);
net2=train(net2,p,t);
a12=sim(net1,p); a22=sim(net2,p);
net1.trainParam.epochs=985;
net2.trainParam.epochs=985;
net1=train(net1,p,t);
net2=train(net2,p,t);
a13=sim(net1,p); a23=sim(net2,p);
%plots
figure
subplot(3,3,1);
plot(x,y,'bx',x,cell2mat(a11),'r',x,cell2mat(a21),'g'); % plot the sine function and the output of the networks
title('1 epoch');
legend('target','trainlm','traingd');
subplot(3,3,2);
postregm(cell2mat(a11),y); % perform a linear regression analysis and plot the result
subplot(3,3,3);
postregm(cell2mat(a21),y);
%
subplot(3,3,4);
plot(x,y,'bx',x,cell2mat(a12),'r',x,cell2mat(a22),'g');
title('15 epochs');
legend('target','trainlm','traingd');
subplot(3,3,5);
postregm(cell2mat(a12),y);
subplot(3,3,6);
postregm(cell2mat(a22),y);
%
subplot(3,3,7);
plot(x,y,'bx',x,cell2mat(a13),'r',x,cell2mat(a23),'g');
title('1000 epochs');
legend('target','trainlm','traingd');
subplot(3,3,8);
postregm(cell2mat(a13),y);
subplot(3,3,9);
postregm(cell2mat(a23),y);
Note that MSE is used by default when no measure of error is specified.
When training you can do something like:
[net tr] = train(net, x, t);
Then plot tr.perf or tr.tperf or tr.vperf
Related
Predicting the probability of class assignment for each chosen sample from the Train_features:
probs = classifier.predict_proba(Train_features)`
Choosing the class for which the AUC has to be determined.
preds = probs[:,1]
Calculating false positive rate, true positive rate and the possible thresholds that can clearly separate TP and TN.
fpr, tpr, threshold = metrics.roc_curve(Train_labels, preds)
roc_auc = metrics.auc(fpr, tpr)
print(max(threshold))
Output : 1.97834
The previous answer did not really address your question of why the threshold is > 1, and in fact is misleading when it says the threshold does not have any interpretation.
The range of threshold should technically be [0,1] because it is the probability threshold. But scikit learn adds +1 to the last number in the threshold array to cover the full range [0, 1]. So if in your example the max(threshold) = 1.97834, the very next number in the threshold array should be 0.97834.
See this sklearn github issue thread for an explanation. It's a little funny because somebody thought this is a bug, but it's just how the creators of sklearn decided to define threshold.
Finally, because it is a probability threshold, it does have a very useful interpretation. The optimal cutoff is the threshold at which sensitivity + specificity are maximum. In sklearn learn this can be computed like so
fpr_p, tpr_p, thresh = roc_curve(true_labels, pred)
# maximize sensitivity + specificity, i.e. tpr + (1-fpr) or just tpr-fpr
th_optimal = thresh[np.argmax(tpr_p - fpr_p)]
The threshold value does not have any kind of interpretation, what really matters is the shape of the ROC curve. Your classifier performs well if there are thresholds (no matter their values) such that the generated ROC curve lies above the linear function (better than random guessing); your classifier has a perfect result (this happens rarely in practice) if for any threshold the ROC curve is only one point at (0,1); your classifier has the worst result if for any threshold the ROC curve is only one point at (1,0). A good indicator of the performance of your classifier is the integral of the ROC curve, this indicator is known as AUC and is limited between 0 and 1, 0 for the worst performance and 1 for perfect performance.
I am going to solve the following nonlinear DE:
Code#1:
tspan1 =t0:0.05:TT;
[t1,y1] = ode45(#(t1,T) ((1-alpha)*Q-sigm*(T.^4))/R, tspan1, t0);
h1=(TT-t0)/(size(y1,1)-1);
Tspan1=t0:h1:TT;
figure(55);plot(Tspan1,y1,'b');
Code#2:
tspan=[t0 TT];
[t,y] = ode45(#(t,T) ((1-alpha)*Q-sigm*(T.^4))/R, tspan, t0);
h=(TT-t0)/(size(y,1)-1);
Tspan=t0:h:TT;
figure(5);plot(Tspan,y,'b');
wherein:
R=2.912;
Q = 342;
alpha=0.3;
sigm=5.67*(10^(-8));
TT=20;
t0=0;
why the results are different?
The second result is not equally spaced. It in some way a minimal set of points that represents the solution curve. So if the curve is rather linear, there will be only few points, while at regions of high curvature you get a dense sampling. You can and should use the returned time array, as that contains the times that the solution points are for,
figure(55);plot(t1,y1,'b');
figure(5);plot(t,y,'b');
I have a (somewhat complicated expression) in three dimensions, x,y,z. I'm interested in the cumulative integral over one of them. My best solution thus far is to create a 3D grid, evaluate the expression at every point, then integrate over the third dimension with cumtrapz. This is just a scaled down example of what I'm trying to achieve:
%integration
xvec = linspace(-pi,pi,40);
yvec = linspace(-pi,pi,40);
zvec = 1:160;
[x,y,z] = meshgrid(xvec,yvec,zvec);
f = #(x,y,z) sin(x).*cos(y).*exp(z/80).*cos((x-z/20));
output = cumtrapz(f(x,y,z),3);
%(plotting)
for j = 1:length(output(1,1,:));
surf(output(:,:,j));
zlim([-120,120]);
shading interp
pause(.05);
drawnow;
end
Given the sizes of vectors (x,y~100, z~5000), is this a computationally sensible way to do this?
if this is the function form you want to integrate over,#(x,y,z) sin(x).*cos(y).*exp(z/80).*cos((x-z/20)), x,y,z can be separately integrated and the integral can be analytically solved using complex number by replacing sin(x)=(exp(ix)-exp(ix))/2i, and cos(x)=(exp(ix)+exp(ix))/2, which will greatly reduce the time cost of your calculation
I built a neural network and successfully trained it by using backpropagation with stochastic gradient descent. Now I'm switching to batch training but I'm a bit confused about when to apply momentum and weight decay.
I know fair well how backpropagation works in theory, I'm just stuck with implementation details.
With the stochastic approach, all I had to do was apply the updates to the weight immediately after having computed the gradients, as in this pseudo python code:
for epoch in epochs:
for p in patterns:
outputs = net.feedforward(p.inputs)
# output_layer_errors is needed to plot the error
output_layer_errors = net.backpropagate(outputs, p.targets)
net.update_weights()
where update_weights method is defined as follows:
def update_weights(self):
for h in self.hidden_neurons:
for o in self.output_neurons:
gradient = h.output * o.error
self.weights[h.index][o.index] += self.learning_rate * gradient + \
self.momentum * self.prev_gradient
self.weights[h.index][o.index] -= self.decay * self.weights[h.index][o.index]
for i in self.input_neurons:
for h in self.hidden_neurons:
gradient = i.output * h.error
self.weights[i.index][h.index] += self.learning_rate * gradient + \
self.momentum * self.prev_gradient
self.weights[i.index][h.index] -= self.decay * self.weights[i.index][h.index]
This works like a charm (note that there might be errors because i'm just using python because it's more understandable, the actual net is coded in C. This code is just to show the steps i did to compute the updates).
Now, switching to batch updates, the main algorithm should be something like:
for epoch in epochs:
for p in patterns:
outputs = net.feedforward(p.inputs)
# output_layer_errors is needed to plot the error
output_layer_errors = net.backpropagate(outputs, p.targets)
net.accumulate_updates()
net.update_weights()
the accumulate method is as follows:
def accumulate_weights(self):
for h in self.hidden_neurons:
for o in self.output_neurons:
gradient = h.output * o.error
self.accumulator[h.index][o.index] += self.learning_rate * gradient
# should I compute momentum here?
for i in self.input_neurons:
for h in self.hidden_neurons:
gradient = i.output * h.error
# should I just accumulate the gradient without scaling it by the learning rate here?
self.accumulator[i.index][h.index] = self.learning_rate * gradient
# should I compute momentum here?
and the update_weights is like this:
def update_weights(self):
for h in self.hidden_neurons:
for o in self.output_neurons:
# what to do here? apply momentum? apply weight decay?
self.weights[h.index][o.index] += self.accumulator[h.index][o.index]
self.accumulator[h.index][o.index] = 0.0
for i in self.input_neurons:
for h in self.hidden_neurons:
# what to do here? apply momentum? apply weight decay?
self.weights[i.index][h.index] += self.accumulator[i.index][h.index]
self.accumulator[i.index][h.index] = 0.0
I'm not sure if I have to:
1) scale the gradient with the learning rate at the time of accumulation or at the time of update
2) apply momentum at the time accumulation of at the time of update
3) same as 2) but for weight decay
Can somebody help me solve this issue?
I'm sorry for the long question, but I thought I would be detailed to explain my doubts better.
Just some quick comment to this. Stochastic gradient descendent leads most of the times to a non-smooth optimization, and requires a sequential optimization that does not suit current technology advances such as parallel computation.
As such, the mini-batch approach try to gain the advantages of the stochastic optimization with the advantages of the batch optimization (parallel computation). Here what you do is to create small training blocks which you give in a parallel fashion to the learning algorithm. At the end each worker should tell you the error to their training sample, which you can average and use as in a normal stochastic gradient descendent.
This approach lead to a much smoother optimization, and probably to a quicker optimization if you make use of parallel computing.
It seems for the first question either is fine. But if you want to combine with momentum, you better check the original formula in your implementation. I would say you should not scale the gradient during the accumulation. At the time when computing momentum, use the formula:
v_{t+1} = \mu v_t - \alpha * g_t
where g_t is the gradient. alpha is learning rate.
I also recommend using AdaGrad and mini-batch instead of full-batch.
Reference: http://firstprayer.github.io/stochastic-gradient-descent-tricks/
Given a set of points, what's the fastest way to fit a parabola to them? Is it doing the least squares calculation or is there an iterative way?
Thanks
Edit:
I think gradient descent is the way to go. The least squares calculation would have been a little bit more taxing (having to do qr decomposition or something to keep things stable).
If the points have no error associated, you may interpolate by three points. Otherwise least squares or any equivalent formulation is the way to go.
I recently needed to find a parabola that passes through 3 points.
suppose you have (x1,y1), (x2,y2) and (x3,y3) and you want the parabola
y-y0 = a*(x-x0)^2
to pass through them: find y0, x0, and a.
You can do some algebra and get this solution (providing the points aren't all on a line) :
let c = (y1-y2) / (y2-y3)
x0 = ( -x1^2 + x2^2 + c*( x2^2 - x3^2 ) ) / (2.0*( -x1+x2 + c*x2 - c*x3 ))
a = (y1-y2) / ( (x1-x0)^2 - (x2-x0)^2 )
y0 = y1 - a*(x1-x0)^2
Note in the equation for c if y2==y3 then you've got a problem. So in my algorithm I check for this and swap say x1, y1 with x2, y2 and then proceed.
hope that helps!
Paul Probert
A calculated solution is almost always faster than an iterative solution. The "exception" would be for low iteration counts and complex calculations.
I would use the least squares method. I've only every coded it for linear regression fits but it can be used for parabolas (I had reason to look it up recently - sources included an old edition of "Numerical Recipes" Press et al; and "Engineering Mathematics" Kreyzig).
ALGORITHM FOR PARABOLA
Read no. of data points n and order of polynomial Mp .
Read data values .
If n< Mp
[ Regression is not possible ]
stop
else
continue ;
Set M=Mp + 1 ;
Compute co-efficient of C-matrix .
Compute co-efficient of B-matrix .
Solve for the co-efficients
a1,a2,. . . . . . . an .
Write the co-efficient .
Estimate the function value at the glren of independents variables .
Using the free arbitrary accuracy math program "PARI" (for Mac or PC):
Here is how I would fit a parabola to a set of 641 points,
and I also show how to find the minimum of that parabola:
Set a high number of digits of precision:
\p 300
Write the data points to a text file separated by one space
for each data point
(use ASCII characters in base ten, no space at file start or file end, and no returns, write extremely large or small floating points as for example
"9.0E-23" but not "9.0D-23" ).
make a string to point to that file:
fileone="./desktop/data.txt"
read that file into PARI using the following instructions:
fileopen(fileone,r)
readsplit(file) = my(cmd);cmd="perl -ne \"chomp; print '[' . join(',', split(/ +/)) . ']\n';\"";eval(externstr(Str(cmd," ",file)))
readsplit(fileone)
Label that data with a name:
in = %
V = in[1]
Define a least squares fit function:
lsf(X,Y,n) = my(M=matrix(#X,n+1,i,j,X[i]^(j-1)));fit=Polrev(matsolve(M~*M,M~*Y~))
Apply that lsf function to your 641 data points:
lsf([-320..320],V, 2)
Then if you want to show the minimum of that parabolic fit, enter:
xextreme = solve (x=-1000,1000,eval(deriv(fit)));print (xextreme*(124.5678-123.5678)/640+(124.5678+123.5678)/2);x=xextreme;print(eval(fit))
(I had to adjust for my particular x-axis scaling before the "print" statement in that command line above).
(Note: A sacrifice made to simplify this algorithm
causes it to work only
when the data set has equally spaced x-axis coordinates.)
I was worried that my last post
was too compact to follow and
too hard to convert to other environments.
I would like to show here how to solve the
generalized problem of parabolic data fitting explicitly
without specialized matrix math terminology;
and so that each multiplication, division,
subtraction and addition can be seen at once.
To save ink this fit reparameterizes the x-axis as evenly
spaced points centered on zero
so that odd powered sums all get eliminated
(saving a lot of space and time),
so the x-coordinates of the N data points
are effectively labeled by points
of this vector: X=[-(N-1)/2..(N-1)/2].
For example "xextreme" will be returned
versus those integer indices
and so (if desired) a simple (consumes very little CPU time)
linear transformation must be applied after the algorithm below
to get it versus your problem's particular x-axis labels.
This is written in the language of
the free program "PARI" but all the
commands are simple to translate to any language.
Step 1: assign a label to the y-axis data:
? V=[5,2,1,2,5]
"PARI" confirms that entry:
%280 = [5, 2, 1, 2, 5]
Then type in the following processing algorithm
which calculates a best fit parabola
through any y-axis data set with constant x-axis separation:
? g=#V;h=(g-1)*g*(g+1)/3;i=h*(3*g*g-7)/5;\
a=sum(i=1,g,V[i]);b=sum(i=1,g,(2*i-1-g)*V[i]);c=sum(i=1,g,(2*i-1-g)*(2*i-1-g)*V[i]);\
A=matdet([a,c;h,i])/matdet([g,h;h,i]);B=b/h*2;C=matdet([g,h;a,c])/matdet([g,h;h,i])*4;\
xextreme=-B/(2*C);yextreme=-B*B/(4*C)+A;fit=Polrev([A,B,C]);\
print("\n","y of extreme is ",yextreme,"\n","which occurs this many data points from center of data: ",xextreme)
(Note for non-PARI users:
the command "matdet([a,c;h,i])"
is just another way of entering "a*i-c*h")
Those commands then produce the following screen output:
y of extreme is 1
which occurs this many data points from center of data: 0
The algorithm stores the polynomial of the fit in the variable "fit":
? fit
%282 = x^2 + 1
?
(Note that to make that algorithm short
the x-axis labels are assigned as X=[-(N-1)/2..(N-1)/2],
thus they are X=[-2,-1,0,1,2]
To correct that
for the same polynomial as parameterized
by an x-axis coordinate data set of say X=[−1,0,1,2,3]:
just apply a simple linear transform, in this case:
"x^2 + 1" --> "(t - 1)^2 + 1".)