Ways of Trading Off Precision and Recall in Neural Networks - performance

I have seen threads with similar questions/problems but I have not found this very issue.
Suppose I train a NN with following cost function:
J(theta) = 1/m * sum(sum( -y * log(h(x)) - ( 1 - y ) * log(1-h(x)) ))
and also use sigmoid function as the activation function.
Now, e.g. for cancer detection, for a CV test I get 0.6 Precision and 0.6 Recall. If I want to get another Ratio of Precision and Recall (e.g. lower Precision but higher Recall) I can just change the threshold of a prediction function (i.e. h(output_layer) > threshold). I guess I could also:
- change the NN architecture,
- change the training set,
- change regularization parameter and I would get a different result.
But what if I do NOT want to change any architecture of the NN. Is changing the threshold of the predict function really smart? I see it like that: We train our NN with the sigmoid function (that kind of checks if an activiation of a certain node is below or above 0.5, roughly speaking). And then, after we trained the network with this lower-or-higher-than-0.5 approach, we change the last prediction threshold to some other value.
I do not think that this would be the optimal Precision/Recall Ratio (or F1 Score) that is possible with a certain training set and NN architecture. Or in other words, I do not think we 'walk along' the optimal ROC Curve. Is that correct?
My 2 thoughts on how to come up with a better solution:
1.) Change the activation function. Either to a completly different function or shift the sigmoid function (e.g. sigmoid new = 0.1 + sigmoid original). So I would also get more activation and I guess more Recall in the end.
2.) Change the Cost function (!). E.g. to
J(theta) = 1/m * sum(sum( ALPHA* -y * log(h(x)) - ( 1 - y ) * log(1-h(x)) )). With this Alpha (Scalar) I could punish the -y * log(h(x)) error more (alpha >1) or less (alpha <1). But would I need to also change the backpropagation and/or gradient calculation if I change the costfunction?
I'd appreciate every help, link or thought on this topic :-)
Best, Wolfgang

Related

How to minimize a cost function with Matlab when input variable is a large image: increase speed and prevent memory crash

I am trying to implement a differential phase integration method described in this paper:
Thüring, Thomas, et al. "Non-linear regularized phase retrieval for unidirectional X-ray differential phase contrast radiography." Optics express 19.25 (2011): 25545-25558.
Basically, it's a way to integrate a differential image across the columns only, while imposing some constraints on continuity across the rows to prevent stripe noise.
From a mathematical point of view, I want to minimize the following equation:
where ||.|| is the L2 norm, Dx is the derivative along the columns, Dy is the derivative across the rows, A is the unknown integrated matrix, lambda is a user-defined parameter and phi is the differential profile I measured. Note that for the Dy operator the L1 norm can also be used.
I wrote down a code using fminunc as Matlab solver
pdiff=imresize(diff(padarray(p,[0,1],'replicate','post'),1,2),[128,128]);
noise = 0.02 * randn(size(pdiff));
pdiff_noise = pdiff + noise ;
% normal integration
integratedProfile=cumsum(pdiff_noise,2);
options=optimoptions(#fminunc,'Display','iter-detailed','UseParallel',true,'MaxIterations',35);
% regularized integration
startingPoint=zeros(size(pdiff_noise));
fun=#(x)costFunction(pdiff_noise,x);
integratedProfile_optmized=fminunc(fun,startingPoint,options);
function difference=costFunction(ep,op)
L=0.2;
dep_o=diff(padarray(op,[0,1],'replicate','post'),1,2);
dep_v=diff(padarray(op,[1,0],'replicate','post'),1,1);
difference=sum(sum((ep-dep_o).^2))+L*sum(sum(dep_v.^2));
end
It works using a 128x128 differential image.
The problem arises as soon as I try to work with a larger image. In particular, when I use a 256x256 matrix takes forever to make each iteration even using the parallel option and takes almost the entire RAM.
When I move to a matrix that is 512x512 I get this error
Requested 262144x262144 (512.0GB) array exceeds maximum
array size preference.
Error in fminusub (line 165)
H = eye(sizes.nVar);
Error in fminunc (line 446)
[x,FVAL,GRAD,HESSIAN,EXITFLAG,OUTPUT] =
fminusub(funfcn,x, ...
Error in Untitled (line 13)
integratedProfile_optmized=fminunc(fun,startingPoint,options);
Unfortunately, my final goal is to process approximately 3000 images of 500x500 size.
I think I have understood that the crash problem is related to the size of the matrix and to the fact that each pixel is a variable. Therefore, Matlab needs to calculate a huge hessian that doesn't fit into the memory.
However, I don't really know how to solve it while also speeding up the processing.
Do you have any suggestions on how to work with large images? Is there another solver that may work in a faster way? Any mathematical approach to making the problem easier?
Thanks!

How to set a convergence tolerance to an specific variable using Dymola?

So, I have a model of a tube with pressure loss, where the unknown is the mass flow rate. Normally, and on most models of this problem, the conservation equations are used to calculate the mass flow rate, but such models have lots of convergence issues (because of the blocked flow at the end of the tube which results in an infinite pressure derivative at the end). See figure below for a representation of the problem on the left and the right a graph showing the infinite pressure derivative.
Because of that I'm using a model which is more robust, though it outputs not the mass flow rate but the tube length, which is known. Therefore an iterative loop is needed to determine the mass flow rate. Ok then, I coded a function length that given the tube geometry, mass flow rate and boundary conditions it outputs the calculated tube length and made the equations like so:
parameter Real L;
Real m_flow;
...
equation
L = length(geometry, boundary, m_flow)
It simulates fine, but it takes ages... And it shouldn't because the mass flow rate is rather insensitive to the tube length, e.g. if L=3 I could say that m_flow has converged if the output of length is within L ± 0.1. On the other hand the default convergence tolerance of DASSL in Dymola is 0.0001, which is fine for all other variables, but a major setback to my model here...
That being said, I'd like to know if there's a (hacky) way of setting a specific tolerance L (from annotations or something). I was unable to find any solution online or in Dymola's user manual... So far I managed a workaround by making a second function which uses a Newton-Raphson method to determine the mass flow rate, something like:
function massflowrate
input geometry, boundary, m_flow_start, tolerance;
output m_flow;
protected
Real error, L, dL, dLdm_flow, Delta_m_flow;
algorithm
error = geometry.L;
m_flow = m_flow_start;
while error>tolerance loop
L = length(geometry, boundary, m_flow);
error = abs(boundary.L - L);
dL = length(geometry, boundary, m_flow*1.001);
dLdm_flow = dL/(0.001*m_flow);
Delta_m_flow = (geometry.L - L)/dLdm_flow;
m_flow = m_flow + Delta_m_flow;
end while;
end massflowrate;
And then I use it in the equations section:
parameter Real L;
Real m_flow;
...
equation
m_flow = massflowrate(geometry, boundary, delay(m_flow,10), tolerance)
Nevertheless, this solutions is not without it's problems, the real equations are very non-linear and depending on the boundary conditions the solver reaches a never-ending loop... =/
PS: I'm sorry for the long post and the lack of a MWE, the real equations are very long and with loads of thermodynamics which I believe not to be of any help, be that as it may, if necessary, I'm able to provide the real model.
Is the length-function smooth? To me that it being non-smooth seems like a likely cause for problems, and the suggestions by #Phil might also be good ideas.
However, it should also be possible to do what you want as follows:
Real m_flow(nominal=1e9);
Explanation: The equations are normally solved to a certain tolerance in unknowns - in this case m_flow.
The tolerance for each variable is a relative/absolute tolerance taking into the nominal value, and Dymola does not allow you to set different tolerances for different variables.
Thus the simple way to compute m_flow less accurately is by setting a high nominal value for it, since the error tolerance will be tol*(abs(m_flow)+abs(nominal(m_flow))) or something like that.
The downside is that it may be too inaccurate, e.g. causing additional events, or that the error is so random that the solver is still slowed down.

How to speed up GLM estimation?

I am using RStudio 0.97.320 (R 2.15.3) on Amazon EC2. My data frame has 200k rows and 12 columns.
I am trying to fit a logistic regression with approximately 1500 parameters.
R is using 7% CPU and has 60+GB memory and is still taking a very long time.
Here is the code:
glm.1.2 <- glm(formula = Y ~ factor(X1) * log(X2) * (X3 + X4 * (X5 + I(X5^2)) * (X8 + I(X8^2)) + ((X6 + I(X6^2)) * factor(X7))),
family = binomial(logit), data = df[1:150000,])
Any suggestions to speed this up by a significant amount?
There are a couple packages to speed up glm fitting. fastglm has benchmarks showing it to be even faster than speedglm.
You could also install a more performant BLAS library on your computer (as Ben Bolker suggests in comments), which will help any method.
Although a bit late but I can only encourage dickoa's suggestion to generate a sparse model matrix using the Matrix package and then feeding this to the speedglm.wfit function. That works great ;-) This way, I was able to run a logistic regression on a 1e6 x 3500 model matrix in less than 3 minutes.
Assuming that your design matrix is not sparse, then you can also consider my package parglm. See this vignette for a comparison of computation times and further details. I show a comparison here of computation times on a related question.
One of the methods in the parglm function works as the bam function in mgcv. The method is described in detail in
Wood, S.N., Goude, Y. & Shaw S. (2015) Generalized additive models for large datasets. Journal of the Royal Statistical Society, Series C 64(1): 139-155.
On advantage of the method is that one can implement it with non-concurrent QR implementation and still do the computation in parallel. Another advantage is a potentially lower memory footprint. This is used in mgcv's bam function and could also be implemented here with a setup as in speedglm's shglm function.

Technique for balancing controller input and output

I have a system where I use RS232 to control a lamp that takes an input given in float representing voltage (in the range 2.5 - 7.5). The control then gives a output in the range 0 to 6000 which is the brightness a sensor picks up.
What I want is to be able to balance the system so that I can specify a brightness value, and the system should balance in on a voltage value that achieves this.
Is there some standard algorithm or technique to find what the voltage input should be in order to get a specific output? I was thinking of an an algorithm which iteratively tries values and from each try it determines some new value which should be better in order to achieve the determined output value. (in my case that is 3000).
The voltage values required tend to vary between different systems and also over the lifespan of the lamp, so this should preferably be done completely automatic.
I am just looking for a name for a technique or algorithm, but pseudo code works just as well. :)
Calibrate the system on initial run by trying all voltages between 2.5 and 7.5 in e.g. 0.1V increments, and record the sensor output.
Given e.g. 3000 as a desired brightness level, pick the voltage that gives the closest brightness then adjust up/down in small increments based on the sensor output until the desired brightness is achieved. From time to time (based on your calibrated values becoming less accurate) recalibrate.
After some more wikipedia browsing I found this:
Control loop feedback mechanism:
previous_error = setpoint - actual_position
integral = 0
start:
error = setpoint - actual_position
integral = integral + (error*dt)
derivative = (error - previous_error)/dt
output = (Kp*error) + (Ki*integral) + (Kd*derivative)
previous_error = error
wait(dt)
goto start
[edit]
By removing the "integral" component and tweaking the weights (Ki and Kd), the loop works perfectly.
I am not at all into physics, but if you can assume that the relationship between voltage and brightness is somewhat close to linear, you can use a standard binary search.
Other than that, this reminds me of the inverted pendulum, which is one of the standard examples for the use of fuzzy logic.

What is a "good" R value when comparing 2 signals using cross correlation?

I apologize for being a bit verbose in advance: if you want to skip all the background mumbo jumbo you can see my question down below.
This is pretty much a follow up to a question I previously posted on how to compare two 1D (time dependent) signals. One of the answers I got was to use the cross-correlation function (xcorr in MATLAB), which I did.
Background information
Perhaps a little background information will be useful: I'm trying to implement an Independent Component Analysis algorithm. One of my informal tests is to (1) create the test case by (a) generate 2 random vectors (1x1000), (b) combine the vectors into a 2x1000 matrix (called "S"), and multiply this by a 2x2 mixing matrix (called "A"), to give me a new matrix (let's call it "T").
In summary: T = A * S
(2) I then run the ICA algorithm to generate the inverse of the mixing matrix (called "W"), (3) multiply "T" by "W" to (hopefully) give me a reconstruction of the original signal matrix (called "X")
In summary: X = W * T
(4) I now want to compare "S" and "X". Although "S" and "X" are 2x1000, I simply compare S(1,:) to X(1,:) and S(2,:) to X(2,:), each which is 1x1000, making them 1D signals. (I have another step which makes sure that these vectors are the proper vectors to compare to each other and I also normalize the signals).
So my current quandary is how to 'grade' how close S(1,:) matches to X(1,:), and likewise with S(2,:) to X(2,:).
So far I have used something like: r1 = max(abs(xcorr(S(1,:), X(1,:)))
My question
Assuming that using the cross correlation function is a valid way to go about comparing the similarity of two signals, what would be considered a good R value to grade the similarity of the signals? Wikipedia states that this is a very subjective area, and so I defer to the better judgment of those who might have experience in this field.
As you might realize, I'm not coming from a EE/DSP/statistical background at all (I'm a medical student) so I'm going through a sort of "baptism through fire" right now, and I appreciate all the help I can get. Thanks!
(edit: as far as directly answering your question about R values, see below)
One way to approach this would be to use cross-correlation. Bear in mind that you have to normalize amplitudes and correct for delays: if you have signal S1, and signal S2 is identical in shape, but half the amplitude and delayed by 3 samples, they're still perfectly correlated.
For example:
>> t = 0:0.001:1;
>> y = #(t) sin(10*t).*exp(-10*t).*(t > 0);
>> S1 = y(t);
>> S2 = 0.4*y(t-0.1);
>> plot(t,S1,t,S2);
These should have a perfect correlation coefficient. A way to compute this is to use maximum cross-correlation:
>> f = #(S1,S2) max(xcorr(S1,S2));
f =
#(S1,S2) max(xcorr(S1,S2))
>> disp(f(S1,S1)); disp(f(S2,S2)); disp(f(S1,S2));
12.5000
2.0000
5.0000
The maximum value of xcorr() takes care of the time-delay between signals. As far as correcting for amplitude goes, you can normalize the signals so that their self-cross-correlation is 1.0, or you can fold that equivalent step into the following:
ρ2 = f(S1,S2)2 / (f(S1,S1)*f(S2,S2);
In this case ρ2 = 5 * 5 / (12.5 * 2) = 1.0
You can solve for ρ itself, i.e. ρ = f(S1,S2)/sqrt(f(S1,S1)*f(S2,S2)), just bear in mind that both 1.0 and -1.0 are perfectly correlated (-1.0 has opposite sign)
Try it on your signals!
with respect to what threshold to use for acceptance/rejection, that really depends on what kind of signals you have. 0.9 and above is fairly good but can be misleading. I would consider looking at the residual signal you get after you subtract out the correlated version. You could do this by looking at the time index of the maximum value of xcorr():
>> t = 0:0.001:1;
>> y = #(a,t) sin(a*t).*exp(-a*t).*(t > 0);
>> S1=y(10,t);
>> S2=0.4*y(9,t-0.1);
>> f(S1,S2)/sqrt(f(S1,S1)*f(S2,S2))
ans =
0.9959
This looks pretty darn good for a correlation. But let's try fitting S2 with a scaled/shifted multiple of S1:
>> [A,i]=max(xcorr(S1,S2)); tshift = i-length(S1);
>> S2fit = zeros(size(S2)); S2fit(1-tshift:end) = A/f(S1,S1)*S1(1:end+tshift);
>> plot(t,[S2; S2fit]); % fit S2 using S1 as a basis
>> plot(t,[S2-S2fit]); % residual
Residual has some energy in it; to get a feel for how much, you can use this:
>> S2res=S2-S2fit;
>> dot(S2res,S2res)/dot(S2,S2)
ans =
0.0081
>> sqrt(dot(S2res,S2res)/dot(S2,S2))
ans =
0.0900
This says that the residual has about 0.81% of the energy (9% of the root-mean-square amplitude) of the original signal S2. (the dot product of a 1D signal with itself will always be equal to the maximum value of cross-correlation of that signal with itself.)
I don't think there's a silver bullet for answering how similar two signals are with each other, but hopefully I've given you some ideas that might be applicable to your circumstances.
A good starting point is to get a sense of what a perfect match will look like by calculating the auto-correlations for each signal (i.e. do the "cross-correlation" of each signal with itself).
THIS IS A COMPLETE GUESS - but I'm guessing max(abs(xcorr(S(1,:),X(1,:)))) > 0.8 implies success. Just out of curiosity, what kind of values do you get for max(abs(xcorr(S(1,:),X(2,:))))?
Another approach to validate your algorithm might be to compare A and W. If W is calculated correctly, it should be A^-1, so can you calculate a measure like |A*W - I|? Maybe you have to normalize by the trace of A*W.
Getting back to your original question, I come from a DSP background, so I get to deal with fairly noise-free signals. I understand that's not a luxury you get in biology :) so my 0.8 guess might be very optimistic. Perhaps looking at some literature in your field, even if they aren't using cross-correlation exactly, might be useful.
Usually in such cases people talk about "false acceptance rate" and "false rejection rate".
The first one describes how many times algorithm says "similar" for non-similar signals, the second one is the opposite.
Selecting a threshold thus becomes a trade-off between these criteria. To make FAR=0, threshold should be 1, to make FRR=0 threshold should be -1.
So probably, you will need to decide which trade-off between FAR and FRR is acceptable in your situation and this will give the right value for threshold.
Mathematically this can be expressed in different ways. Just a couple of examples:
1. fix some of rates at acceptable value and minimize other one
2. minimize max(FRR,FAR)
3. minimize aFRR+bFAR
Since they should be equal, the correlation coefficient should be high, between .99 and 1. I would take the max and abs functions out of your calculation, too.
EDIT:
I spoke too soon. I confused cross-correlation with correlation coefficient, which is completely different. My answer might not be worth much.
I would agree that the result would be subjective. Something that would involve the sum of the squares of the differences, element by element, would have some value. Two identical arrays would give a value of 0 in that form. You would have to decide what value then becomes "bad". Make up 2 different vectors that "aren't too bad" and find their cross-correlation coefficient to be used as a guide.
(parenthetically: if you were doing a correlation coefficient where 1 or -1 would be great and 0 would be awful, I've been told by bio-statisticians that a real-life value of 0.7 is extremely good. I understand that this is not exactly what you are doing but the comment on correlation coefficient came up earlier.)

Resources