Imbalanced data: precision and recall when the minority is negative case instead of positive case - precision-recall

I have an imbalanced dataset where 90% of cases having Y = 1, and 10% of cases having Y = 0. In this case, do precision and recall still apply? Because precision and recall focus on true positive (TP), which is not the case in my dataset. In my dataset, I actually want to focus on true negative. Wondering what metrics shall I use to evaluate the model in my case.

Related

What techniques are effective to find periodicity in arbitrary data points?

By "arbitrary" I mean that I don't have a signal sampled on a grid that is amenable to taking an FFT. I just have points (e.g. in time) where events happened, and I'd like an estimate of the rate, for example:
p = [0, 1.1, 1.9, 3, 3.9, 6.1 ...]
...could be hits from a process with a nominal periodicity (repetition interval) of 1.0, but with noise and some missed detections.
Are there well known methods for processing such data?
A least square algorithm may do the trick, if correctly initialized. A clustering method can be applied to this end.
As an FFT is performed, the signal is depicted as a sum of sine waves. The amplitude of the frequencies may be depicted as resulting from a least square fit on the signal. Hence, if the signal is unevenly sampled, resolving the same least square problem may make sense if the Fourier transform is to be estimated. If applied to a evenly sampled signal, it boils down to the same result.
As your signal is descrete, you may want to fit it as a sum of Dirac combs. It seems more sound to minimize the sum of squared distance to the nearest Dirac of the Dirac comb. This is a non-linear optimization problem where Dirac combs are described by their period and offset. This non-linear least-square problem can be solved by mean of the Levenberg-Marquardt algorithm. Here is an python example making use of the scipy.optimize.leastsq() function. Moreover, the error on the estimated period and offset can be estimated as depicted in How to compute standard deviation errors with scipy.optimize.least_squares . It is also documented in the documentation of curve_fit() and Getting standard errors on fitted parameters using the optimize.leastsq method in python
Nevertheless, half the period, or the thrid of the period, ..., would also fit, and multiples of the period are local minima that are to be avoided by a refining the initialization of the Levenberg-Marquardt algorithm. To this end, the differences between times of events can be clustered, the cluster featuring the smallest value being that of the expected period. As proposed in Clustering values by their proximity in python (machine learning?) , the clustering function sklearn.cluster.MeanShift() is applied.
Notice that the procedure can be extended to multidimentionnal data to look for periodic patterns or mixed periodic patterns featuring different fundamental periods.
import numpy as np
from scipy.optimize import least_squares
from scipy.optimize import leastsq
from sklearn.cluster import MeanShift, estimate_bandwidth
ticks=[0,1.1,1.9,3,3.9,6.1]
import scipy
print scipy.__version__
def crudeEstimate():
# loooking for the period by looking at differences between values :
diffs=np.zeros(((len(ticks))*(len(ticks)-1))/2)
k=0
for i in range(len(ticks)):
for j in range(i):
diffs[k]=ticks[i]-ticks[j]
k=k+1
#see https://stackoverflow.com/questions/18364026/clustering-values-by-their-proximity-in-python-machine-learning
X = np.array(zip(diffs,np.zeros(len(diffs))), dtype=np.float)
bandwidth = estimate_bandwidth(X, quantile=1.0/len(ticks))
ms = MeanShift(bandwidth=bandwidth, bin_seeding=True)
ms.fit(X)
labels = ms.labels_
cluster_centers = ms.cluster_centers_
print cluster_centers
labels_unique = np.unique(labels)
n_clusters_ = len(labels_unique)
for k in range(n_clusters_):
my_members = labels == k
print "cluster {0}: {1}".format(k, X[my_members, 0])
estimated_period=np.min(cluster_centers[:,0])
return estimated_period
def disttoDiracComb(x):
residual=np.zeros((len(ticks)))
for i in range(len(ticks)):
mindist=np.inf
for j in range(len(x)/2):
offset=x[2*j+1]
period=x[2*j]
#print period, offset
index=np.floor((ticks[i]-offset)/period)
#print 'index', index
currdist=ticks[i]-(index*period+offset)
if currdist>0.5*period:
currdist=period-currdist
index=index+1
#print 'event at ',ticks[i], 'not far from index ',index, '(', currdist, ')'
#currdist=currdist*currdist
#print currdist
if currdist<mindist:
mindist=currdist
residual[i]=mindist
#residual=residual-period*period
#print x, residual
return residual
estimated_period=crudeEstimate()
print 'crude estimate by clustering :',estimated_period
xp=np.array([estimated_period,0.0])
#res_1 = least_squares(disttoDiracComb, xp,method='lm',xtol=1e-15,verbose=1)
p,pcov,infodict,mesg,ier=leastsq(disttoDiracComb, x0=xp,ftol=1e-18, full_output=True)
#print ' p is ',p, 'covariance is ', pcov
# see https://stackoverflow.com/questions/14581358/getting-standard-errors-on-fitted-parameters-using-the-optimize-leastsq-method-i
s_sq = (disttoDiracComb(p)**2).sum()/(len(ticks)-len(p))
pcov=pcov *s_sq
perr = np.sqrt(np.diag(pcov))
#print 'estimated standard deviation on parameter :' , perr
print 'estimated period is ', p[0],' +/- ', 1.96*perr[0]
print 'estimated offset is ', p[1],' +/- ', 1.96*perr[1]
Applied to your sample, it prints :
crude estimate by clustering : 0.975
estimated period is 1.0042857141346768 +/- 0.04035792507868619
estimated offset is -0.011428571139828817 +/- 0.13385206912205957
It sounds like you need to decide what exactly you want to determine. If you want to know the average interval in a set of timestamps, then that's easy (just take the mean or median).
If you expect that the interval could be changing, then you need to have some idea about how fast it is changing. Then you can find a windowed moving average. You need to have an idea of how fast it is changing so that you can select your window size appropriately - a larger window will give you a smoother result, but a smaller window will be more responsive to a faster-changing rate.
If you have no idea whether the data is following any sort of pattern, then you are probably in the territory of data exploration. In that case, I would start by plotting the intervals, to see if a pattern appears to the eye. This might also benefit from applying a moving average if the data is quite noisy.
Essentially, whether or not there is something in the data and what it means is up to you and your knowledge of the domain. That is, in any set of timestamps there will be an average (and you can also easily calculate the variance to give an indication of variability in the data), but it is up to you whether that average carries any meaning.

Is there an established method to weigh a weighted mean?

Problem. The weighted mean/average can be used to give differing weight in a mean computation to elements of differing importance. I need to figure out an extension that would in turn 'scale' or 'weigh' the resulting weighted mean with regards to zero, depending on the actual (non-normalized) values of the weights:
if the weights are low, the scaled weighted mean should be close to 0.
if at least some weights are close to the max weight, then the scaled weighted mean should be more or less equivalent with simple weighted mean.
Rationale and details. I need such an extension in order to produce a more sensible mean value in a case where:
the weights are proximity/similarity scores (of interval (0,1)) of the elements (let's call them neighbors for simplicity) of a target element, in some space, and
the values on the neighbors (being averaged) reflect a change in some quality of theirs (because it is assumed to have an effect on the target, if they are close enough)
elements that are further away should have less weight, so using weighted mean seems reasonable - but in some cases, all the neighbors are far away - in these cases, they presumably should have little to no effect on the target (so their mean should reflect this, and be closer to zero).
Reproducible example. This requirement is not met when using a simple weighted means:
# Using R for example code (answer doesn't have to use R)
weighted.mean = function(x, w){
return( sum(x*w)/sum(w) ) # standard way to calculate weighted mean
}
# Example data:
weights1 = c(0.9, 0.1, 0.01) # proximity of neighbors to target
weights2 = c(0.1, 0.1, 0.01) # proximity of neighbors to some other target
values = c(1,2,10) # values on these neighbors
mean(values)
# 4.333333 # not useful, doesn't take into account distance of elements at all
weighted.mean(values, weights1)
# 1.188119 # useful result, reflects distance/weight!
weighted.mean(values, weights2)
# 1.904762 # not useful result - none of them should have any effect, being all distant; the mean should be close to 0 (no effect) instead
What I've tried so far (1) Removing the normalizing sum(weights) business and just taking mean of values*weights:
weighted.mean2 = function(x, w){
return( mean(x*w) )
}
weighted.mean2(values, weights1)
# 0.4 # lower value, but should be viewed relatively in comparison now
weighted.mean2(values, weights2)
# 0.1333333 # makes more sense, low proximity leads to low(er) mean value
What I've tried so far (2) Call weighted mean on 0 and the weighted mean, with the new weights for this vector of length two being 1 (max proximity/identity) and the proximity of the closest neighbor as a scale; the reasoning being that if the target has no close neighbors, then the effect in question should be about 0:
weighted.mean3 = function(x, w){
tmp = weighted.mean(x, w)
maxw = max(w)
return( weighted.mean( c(0, tmp), c(1, maxw)) )
}
weighted.mean3(values, weights1)
# 0.5627931
weighted.mean3(values, weights2)
# 0.1731602 # also makes sense, low proximity leads to low(er) mean value
Both approaches seem to yield a smaller value for the target with distant neighbors, and a comparatively higher value to a target with closer neighbors. However, this feels rather hacky to me, and I'm not sure if there might be cases where either approach might fail - surely there must be a more principled/established algorithm to do something like this out there (perhaps it's not called 'mean' or 'average' though; also, if one of my attempts is equivalent with one, then the answer could just confirm that). Long story short:
Is there an established/published method to weigh/scale a weighted mean in the way I've described above?
Note on previous version of the question: it was initially flagged as too broad, so I rewrote it and applied to reopen, but it was auto-closed as being abandoned; so I rewrote a new question; this one also now has a clear yes or no answer (rationale and/or references beyond a simple yes/no are of course appreciated)

genetic algorithm handling negative fitness values

I am trying to implement genetic algorithm for maximizing a function of n variables. However the problem is that the fitness values can be negative and I am not sure about how to handle negative values while doing selection. I read this article Linear fitness scaling in Genetic Algorithm produces negative fitness values
but it's not clear to me how the negative fitness values were taken care of and how scaling factors a and b were calculated.
Also, from the article I know that roulette wheel selection only works for positive fitness value. Is it the same for tournament selection as well ?
When you have negative values, you could try to find the smallest fitness value in your population and add its opposite to every value. This way you will no longer have negative values, while the differences between fitness values will remain the same.
Tournament selection is not affected by this problem. It simply compares the fitness values of a uniformly sampled subset of size n of the population and takes the one with the best value. Still of course this means that, if you sample without repetition then the worst n-1 individuals will never get selected. If you sample with repetition they have a chance of being selected.
As with proportional selection: It doesn't work with negative fitness values. You can only apply "windowing" or "scaling" of your fitness values in which case they work again.
I once programmed some sampling methods as extension methods for C#'s IEnumerable among them is a SampleProportional and SampleProportionalWithoutRepetition extension method. They're part of HeuristicLab under GPL license.
Okay, it's late to answer, but still someone could google it.
First of all - yes, you can use negative fitness. But I'm totally suggest you not to do it, because I've did it and experienced a lot of problems (still doable, but totally not recommended). So here's explanation:
Say you have population of N creatures. After simulation they all have some fitness values f(n), where f(n) is fitness and n is creature number. After this you want to build some probability distribution to determine which creatures should be killed (of course you can delete say 40% of just worst creatures but it would be better if you use distribution). How do you build such distribution? Say f(a) = 50, and f(b) = 100, so creature b is 2 times better than creature a, so probably you want to make the
survival probability of creature a 2 times higher than creature b (makes great sense if your fitness value is linear). In case you wonder how to do it:
Let's say that sum( f (n) ) is the summ of all fitness values. Then
survival probability p(a) of creature a is:
p(a) = f(a) / sum( f(n) )
This will do the trick.
But now let's make negative fitness allowed. Say f(a) = 50, f(b) = 100, f(c) = -1000. b is again 2 times better than a, makes sense, but it's -10 times better than c? Doesn't make sense. Gentleman above suggested you to add oppositive of worst fitness value, which kinda can "fix" your situation, but really it don't (I maked same mistake before). Okay, let's add 1000 to all fitness values:
f(a) = 1050, f(b) = 1100, f(c) = 0, so survival probability of c is zero now, okay, we can take it. But let's compare a and b now:
b is 1.05 better than a now, which means that fitness of a and b is almost the same, which is totally unacceptable, because it clearly was 2 times better than a (of course in assumption that fitness is linear, but this will mess up nonlinear fitnesses as well)! You can't escape this problem, it will constantly get in your way, because probability can't be negative, so you either can remove the probability from evolution (which is not very good thing to do) or you can do some exceptions and tricks.
Since it was too late in my scenario to remove negative fitness, here's my way in order to fix things up:
Once again, you have population of N creatures. Say neg(N) gives you all negative fitness creatures and pos(N) positive fitness creatures (it's your call to make zero negative or positive, doesn't matter in this case). And let's say you need D creatures to die. And now here's the trick:
the higher f( c ) ( c is pos creature) value, the better creature is, so we can use its fitness to determine the probability of survivial. But the lower (bigger negative) f( m ) (m is neg creature ), the worser creature is, so we can use its fitness to determine the probability of dying.
Now, if D > neg(N) then all neg(N) will die, and (D-neg(N)) of pos(N) will die with use of probability distribution based on all positive creatures fitness (probability of survival p(a) = f(a) / sum( pos(n) ) ). But if D < neg(N), then all pos(N) will survive, and D of neg(N) creatures will die with use of probability distribution based on all negative creatures fitness (probability of dying p(a) = f(a) / sum( neg(n) ) (f(a) will be negative, but sum( neg(n) ) will be negative as well, so probability will be positive).
I know this question has been here for a long time, but if new guys want to know the best way to deal with negative values, and also your problem is minimum. Here is the code for it.
from numpy import min, sum, ptp, array
from numpy.random import uniform
list_fitness1 = array([-12, -45, 0, 72.1, -32.3])
list_fitness2 = array([0.5, 6.32, 988.2, 1.23])
def get_index_roulette_wheel_selection(list_fitness=None):
""" It can handle negative also. Make sure your list fitness is 1D-numpy array"""
scaled_fitness = (list_fitness - min(list_fitness)) / ptp(list_fitness)
minimized_fitness = 1.0 - scaled_fitness
total_sum = sum(minimized_fitness)
r = uniform(low=0, high=total_sum)
for idx, f in enumerate(minimized_fitness):
r = r + f
if r > total_sum:
return idx
get_index_roulette_wheel_selection(list_fitness1)
get_index_roulette_wheel_selection(list_fitness2)
Make sure your fitness list is 1D-numpy array
Scaled the fitness list to the range [0, 1]
Transform maximum problem to minimum problem by 1.0 - scaled_fitness_list
Random a number between 0 and sum(minimizzed_fitness_list)
Keep adding element in minimized fitness list until we get the value greater than the total sum
You can see if the fitness is small --> it has bigger value in minimized_fitness --> It has a bigger chance to add and make the value greater than the total sum.
I think that the main issue people are running into here is that they're treating the fitness score improperly. Let's think about an example fitness score as the temperature inside of a truck shipping frozen goods. The truck's internal temperature should be -2 C... but that's also 28.4 F. They are the same exact fitness relative to the food staying frozen, but 2 * -2 = -4, and 2 * 28.4 is 56.8. "Two times colder" doesn't really make any sense here (-4 C != 14.2 F either). Same with fitness scores.
In the case of -1000 in Volot's example, the difference between 50 and 100 is actually comparatively low: the important thing is that you'd pick either / both of those over the -1000, which you will definitely do if you just subtract -1000 from everything. Then the next generation of children may have fitness scores of 50, 100, 200, and 10, let's say. Now the difference between 50 and 100 is much more pronounced, and 50 will have a much lower chance of getting picked. Remember genetic algorithms are iterative. It also reminds me of a saying: You don’t have to run faster than the bear to get away. You just have to run faster than the guy next to you. 50 just needs to outrun -1000 to survive to reproduce.
The problem of subtracting the min resulting in 0 also can be avoided. When estimating probability distributions, people will add 1 occurrence to every (known) possible outcome so that extremely rare events are still captured. That gets somewhat trickier with fitness scores. You can't just add 1. What if your fitness scores are 0.01, 0.02, and -0.01? 1.03, 1.02, and 1.00 are going to result in picking a low relative fitness a lot. You can instead add the lowest non-zero value to everything, resulting in 0.04, 0.05, and 0.02. For the -1000 case, it results in 2150, 2100, and 1050 (so everything that used to be 0 will always be half as likely as the next lowest fitness to get picked)
Still, to make things as consistent as possible with what is a more typical GA sampling method, I would only subtract the min and add back in a small amount of fitness when there are negative values. When everything is positive, there's no reason to do it.

Efficient (fastest) way to sum elements of matrix in matlab

Lets have matrix A say A = magic(100);. I have seen 2 ways of computing sum of all elements of matrix A.
sumOfA = sum(sum(A));
Or
sumOfA = sum(A(:));
Is one of them faster (or better practise) then other? If so which one is it? Or are they both equally fast?
It seems that you can't make up your mind about whether performance or floating point accuracy is more important.
If floating point accuracy were of paramount accuracy, then you would segregate the positive and negative elements, sorting each segment. Then sum in order of increasing absolute value. Yeah, I know, its more work than anyone would do, and it probably will be a waste of time.
Instead, use adequate precision such that any errors made will be irrelevant. Use good numerical practices about tests, etc, such that there are no problems generated.
As far as the time goes, for an NxM array,
sum(A(:)) will require N*M-1 additions.
sum(sum(A)) will require (N-1)*M + M-1 = N*M-1 additions.
Either method requires the same number of adds, so for a large array, even if the interpreter is not smart enough to recognize that they are both the same op, who cares?
It is simply not an issue. Don't make a mountain out of a mole hill to worry about this.
Edit: in response to Amro's comment about the errors for one method over the other, there is little you can control. The additions will be done in a different order, but there is no assurance about which sequence will be better.
A = randn(1000);
format long g
The two solutions are quite close. In fact, compared to eps, the difference is barely significant.
sum(A(:))
ans =
945.760668102446
sum(sum(A))
ans =
945.760668102449
sum(sum(A)) - sum(A(:))
ans =
2.72848410531878e-12
eps(sum(A(:)))
ans =
1.13686837721616e-13
Suppose you choose the segregate and sort trick I mentioned. See that the negative and positive parts will be large enough that there will be a loss of precision.
sum(sort(A(A<0),'descend'))
ans =
-398276.24754782
sum(sort(A(A<0),'descend')) + sum(sort(A(A>=0),'ascend'))
ans =
945.7606681037
So you really would need to accumulate the pieces in a higher precision array anyway. We might try this:
[~,tags] = sort(abs(A(:)));
sum(A(tags))
ans =
945.760668102446
An interesting problem arises even in these tests. Will there be an issue because the tests are done on a random (normal) array? Essentially, we can view sum(A(:)) as a random walk, a drunkard's walk. But consider sum(sum(A)). Each element of sum(A) (i.e., the internal sum) is itself a sum of 1000 normal deviates. Look at a few of them:
sum(A)
ans =
Columns 1 through 6
-32.6319600960983 36.8984589766173 38.2749084367497 27.3297721091922 30.5600109446534 -59.039228262402
Columns 7 through 12
3.82231962760523 4.11017616179294 -68.1497901792032 35.4196443983385 7.05786623564426 -27.1215387236418
Columns 13 through 18
When we add them up, there will be a loss of precision. So potentially, the operation as sum(A(:)) might be slightly more accurate. Is it so? What if we use a higher precision for the accumulation? So first, I'll form the sum down the columns using doubles, then convert to 25 digits of decimal precision, and sum the rows. (I've displayed only 20 digits here, leaving 5 digits hidden as guard digits.)
sum(hpf(sum(A)))
ans =
945.76066810244807408
Or, instead, convert immediately to 25 digits of precision, then summing the result.
sum(hpf(A(:))
945.76066810244749807
So both forms in double precision were equally wrong here, in opposite directions. In the end, this is all moot, since any of the alternatives I've shown are far more time consuming compared to the simple variations sum(A(:)) or sum(sum(A)). Just pick one of them and don't worry.
Performance-wise, I'd say both are very similar (assuming a recent MATLAB version). Here is quick test using the TIMEIT function:
function sumTest()
M = randn(5000);
timeit( #() func1(M) )
timeit( #() func2(M) )
end
function v = func1(A)
v = sum(A(:));
end
function v = func2(A)
v = sum(sum(A));
end
the results were:
>> sumTest
ans =
0.0020917
ans =
0.0017159
What I would worry about is floating-point issues. Example:
>> M = randn(1000);
>> abs( sum(M(:)) - sum(sum(M)) )
ans =
3.9108e-11
Error magnitude increases for larger matrices
i think a simple way to understand is apply " tic_ toc "function in first and last of your code.
tic
A = randn(5000);
format long g
sum(A(:));
toc
but when you used randn function ,elements of it are random and time of calculation can
different in each cycle CPU calculation .
This better you used a unique matrix whit so large elements to compare time of calculation.

Linear fitness scaling in Genetic Algorithm produces negative fitness values

I have a GA with a fitness function that can evaluate to negative or positive values. For the sake of this question let's assume the function
u = 5 - (x^2 + y^2)
where
x in [-5.12 .. 5.12]
y in [-5.12 .. 5.12]
Now in the selection phase of GA I am using simple roulette wheel. Since to be able to use simple roulette wheel my fitness function must be positive for concrete cases in a population, I started looking for scaling solutions. The most natural seems to be linear fitness scaling. It should be pretty straightforward, for example look at this implementation. However, I am getting negative values even after linear scaling.
For example for the above mentioned function and these fitness values:
-9.734897 -7.479017 -22.834280 -9.868979 -13.180669 4.898595
after linear scaling I am getting these values
-9.6766040 -11.1755111 -0.9727897 -9.5875139 -7.3870793 -19.3997490
Instead, I would like to scale them to positive values, so I can do roulette wheel selection in the next phase.
I must be doing something fundamentally wrong here. How should I approach this problem?
The main mistake was that the input to linear scaling must already be positive (by definition), whereas I was fetching it also negative values.
The talk about negative values is not about input to the algorithm, but about output (scaled values) from the algorithm. The check is to handle this case and then correct it so as not to produce negative scaled values.
if(p->min > (p->scaleFactor * p->avg - p->max)/
(p->scaleFactor - 1.0)) { /* if nonnegative smin */
d = p->max - p->avg;
p->scaleConstA = (p->scaleFactor - 1.0) * p->avg / d;
p->scaleConstB = p->avg * (p->max - (p->scaleFactor * p->avg))/d;
} else { /* if smin becomes negative on scaling */
d = p->avg - p->min;
p->scaleConstA = p->avg/d;
p->scaleConstB = -p->min * p->avg/d;
}
On the image below, if f'min is negative, go to else clause and handle this case.
Well the solution is then to prescale above mentioned function, so it gives only positive values. As Hyperboreus suggested, this can be done by adding the smallest possible value
u = 5 - (2*5.12^2)
It is best if we separate real fitness values that we are trying to maximize from scaled fitness values that are input to selection phase of GA.
I agree with the previous answer. Linear scaling by itself tries to preserve the average fitness value, so it needs to be offset if the function is negative. For more details, please have a look in Goldberg's Genetic Algorithms book (1989), Chapter 7, pp. 76-79.
Your smallest possible value for u = 5 - (2*5.12^2). Why not just add this to your u?

Resources