Optimal Cutoff Point for max sum of sensitivity and specificity - probability

I would like to calculate the sensitivity-specificity sum maximization threshold (Youden Index) for my glm model:
model = glm(present ~ Summer_precipitation + Summer_temperature + Frost_days + Snowcover_days + Forest_presence + Population_density + Tick_density + Vaccination_coverage, family = binomial(link = "logit"), data = tbe_data)
I calculated predicted probabilities for the model. "Weather data" is a stacked raster file of all the covariate rasters listed in the model above.
#create predictions based on weather data
predictions=predict(weather_data,model,type="response")
#plot predictions
plot(predictions)
How can I now calculate the optimal probability cutoff point from the model? I would use the "cutpointr" function but don't know how to adapt the code to my situation

Related

Real-time peak detection in noisy sinusoidal time-series

I have been attempting to detect peaks in sinusoidal time-series data in real time, however I've had no success thus far. I cannot seem to find a real-time algorithm that works to detect peaks in sinusoidal signals with a reasonable level of accuracy. I either get no peaks detected, or I get a zillion points along the sine wave being detected as peaks.
What is a good real-time algorithm for input signals that resemble a sine wave, and may contain some random noise?
As a simple test case, consider a stationary, sine wave that is always the same frequency and amplitude. (The exact frequency and amplitude don't matter; I have arbitrarily chosen a frequency of 60 Hz, an amplitude of +/− 1 unit, at a sampling rate of 8 KS/s.) The following MATLAB code will generate such a sinusoidal signal:
dt = 1/8000;
t = (0:dt:(1-dt)/4)';
x = sin(2*pi*60*t);
Using the algorithm developed and published by Jean-Paul, I either get no peaks detected (left) or a zillion "peaks" detected (right):
I've tried just about every combination of values for these 3 parameters that I could think of, following the "rules of thumb" that Jean-Paul gives, but I have so far been unable to get my expected result.
I found an alternative algorithm, developed and published by Eli Billauer, that does give me the results that I want—e.g.:
Even though Eli Billauer's algorithm is much simpler and does tend to reliably produce the results that I want, it is not suitable for real-time applications.
As another example of a signal that I'd like to apply such an algorithm to, consider the test case given by Eli Billauer for his own algorithm:
t = 0:0.001:10;
x = 0.3*sin(t) + sin(1.3*t) + 0.9*sin(4.2*t) + 0.02*randn(1, 10001);
This is a more unusual (less uniform/regular) signal, with a varying frequency and amplitude, but still generally sinusoidal. The peaks are plainly obvious to the eye when plotted, but hard to identify with an algorithm.
What is a good real-time algorithm to correctly identify the peaks in a sinusoidal input signal? I am not really an expert when it comes to signal processing, so it would be helpful to get some rules of thumb that consider sinusoidal inputs. Or, perhaps I need to modify e.g. Jean-Paul's algorithm itself in order to work properly on sinusoidal signals. If that's the case, what modifications would be required, and how would I go about making these?
Case 1: sinusoid without noise
If your sinusoid does not contain any noise, you can use a very classic signal processing technique: taking the first derivative and detecting when it is equal to zero.
For example:
function signal = derivesignal( d )
% Identify signal
signal = zeros(size(d));
for i=2:length(d)
if d(i-1) > 0 && d(i) <= 0
signal(i) = +1; % peak detected
elseif d(i-1) < 0 && d(i) >= 0
signal(i) = -1; % trough detected
end
end
end
Using your example data:
% Generate data
dt = 1/8000;
t = (0:dt:(1-dt)/4)';
y = sin(2*pi*60*t);
% Add some trends
y(1:1000) = y(1:1000) + 0.001*(1:1000)';
y(1001:2000) = y(1001:2000) - 0.002*(1:1000)';
% Approximate first derivative (delta y / delta x)
d = [0; diff(y)];
% Identify signal
signal = derivesignal(d);
% Plot result
figure(1); clf; set(gcf,'Position',[0 0 677 600])
subplot(4,1,1); hold on;
title('Data');
plot(t,y);
subplot(4,1,2); hold on;
title('First derivative');
area(d);
ylim([-0.05, 0.05]);
subplot(4,1,3); hold on;
title('Signal (-1 for trough, +1 for peak)');
plot(t,signal); ylim([-1.5 1.5]);
subplot(4,1,4); hold on;
title('Signals marked on data');
markers = abs(signal) > 0;
plot(t,y); scatter(t(markers),y(markers),30,'or','MarkerFaceColor','red');
This yields:
This method will work extremely well for any type of sinusoid, with the only requirement that the input signal contains no noise.
Case 2: sinusoid with noise
As soon as your input signal contains noise, the derivative method will fail. For example:
% Generate data
dt = 1/8000;
t = (0:dt:(1-dt)/4)';
y = sin(2*pi*60*t);
% Add some trends
y(1:1000) = y(1:1000) + 0.001*(1:1000)';
y(1001:2000) = y(1001:2000) - 0.002*(1:1000)';
% Add some noise
y = y + 0.2.*randn(2000,1);
Will now generate this result because first differences amplify noise:
Now there are many ways to deal with noise, and the most standard way is to apply a moving average filter. One disadvantage of moving averages is that they are slow to adapt to new information, such that signals may be identified after they have occurred (moving averages have a lag).
Another very typical approach is to use Fourier Analysis to identify all the frequencies in your input data, disregard all low-amplitude and high-frequency sinusoids, and use the remaining sinusoid as a filter. The remaining sinusoid will be (largely) cleansed from the noise and you can then use first-differencing again to determine the peaks and troughs (or for a single sine wave you know the peaks and troughs happen at 1/4 and 3/4 pi of the phase). I suggest you pick up any signal processing theory book to learn more about this technique. Matlab also has some educational material about this.
If you want to use this algorithm in hardware, I would suggest you also take a look at WFLC (Weighted Fourier Linear Combiner) with e.g. 1 oscillator or PLL (Phase-Locked Loop) that can estimate the phase of a noisy wave without doing a full Fast Fourier Transform. You can find a Matlab algorithm for a phase-locked loop on Wikipedia.
I will suggest a slightly more sophisticated approach here that will identify the peaks and troughs in real-time: fitting a sine wave function to your data using moving least squares minimization with initial estimates from Fourier analysis.
Here is my function to do that:
function [result, peaks, troughs] = fitsine(y, t, eps)
% Fast fourier-transform
f = fft(y);
l = length(y);
p2 = abs(f/l);
p1 = p2(1:ceil(l/2+1));
p1(2:end-1) = 2*p1(2:end-1);
freq = (1/mean(diff(t)))*(0:ceil(l/2))/l;
% Find maximum amplitude and frequency
maxPeak = p1 == max(p1(2:end)); % disregard 0 frequency!
maxAmplitude = p1(maxPeak); % find maximum amplitude
maxFrequency = freq(maxPeak); % find maximum frequency
% Initialize guesses
p = [];
p(1) = mean(y); % vertical shift
p(2) = maxAmplitude; % amplitude estimate
p(3) = maxFrequency; % phase estimate
p(4) = 0; % phase shift (no guess)
p(5) = 0; % trend (no guess)
% Create model
f = #(p) p(1) + p(2)*sin( p(3)*2*pi*t+p(4) ) + p(5)*t;
ferror = #(p) sum((f(p) - y).^2);
% Nonlinear least squares
% If you have the Optimization toolbox, use [lsqcurvefit] instead!
options = optimset('MaxFunEvals',50000,'MaxIter',50000,'TolFun',1e-25);
[param,fval,exitflag,output] = fminsearch(ferror,p,options);
% Calculate result
result = f(param);
% Find peaks
peaks = abs(sin(param(3)*2*pi*t+param(4)) - 1) < eps;
% Find troughs
troughs = abs(sin(param(3)*2*pi*t+param(4)) + 1) < eps;
end
As you can see, I first perform a Fourier transform to find initial estimates of the amplitude and frequency of the data. I then fit a sinusoid to the data using the model a + b sin(ct + d) + et. The fitted values represent a sine wave of which I know that +1 and -1 are the peaks and troughs, respectively. I can therefore identify these values as the signals.
This works very well for sinusoids with (slowly changing) trends and general (white) noise:
% Generate data
dt = 1/8000;
t = (0:dt:(1-dt)/4)';
y = sin(2*pi*60*t);
% Add some trends
y(1:1000) = y(1:1000) + 0.001*(1:1000)';
y(1001:2000) = y(1001:2000) - 0.002*(1:1000)';
% Add some noise
y = y + 0.2.*randn(2000,1);
% Loop through data (moving window) and fit sine wave
window = 250; % How many data points to consider
interval = 10; % How often to estimate
result = nan(size(y));
signal = zeros(size(y));
for i = window+1:interval:length(y)
data = y(i-window:i); % Get data window
period = t(i-window:i); % Get time window
[output, peaks, troughs] = fitsine(data,period,0.01);
result(i-interval:i) = output(end-interval:end);
signal(i-interval:i) = peaks(end-interval:end) - troughs(end-interval:end);
end
% Plot result
figure(1); clf; set(gcf,'Position',[0 0 677 600])
subplot(4,1,1); hold on;
title('Data');
plot(t,y); xlim([0 max(t)]); ylim([-4 4]);
subplot(4,1,2); hold on;
title('Model fit');
plot(t,result,'-k'); xlim([0 max(t)]); ylim([-4 4]);
subplot(4,1,3); hold on;
title('Signal (-1 for trough, +1 for peak)');
plot(t,signal,'r','LineWidth',2); ylim([-1.5 1.5]);
subplot(4,1,4); hold on;
title('Signals marked on data');
markers = abs(signal) > 0;
plot(t,y,'-','Color',[0.1 0.1 0.1]);
scatter(t(markers),result(markers),30,'or','MarkerFaceColor','red');
xlim([0 max(t)]); ylim([-4 4]);
Main advantages of this approach are:
You have an actual model of your data, so you can predict signals in the future before they happen! (e.g. fix the model and calculate the result by inputting future time periods)
You don't need to estimate the model every period (see parameter interval in the code)
The disadvantage is that you need to select a lookback window, but you will have this problem with any method that you use for real-time detection.
Video demonstration
Data is the input data, Model fit is the fitted sine wave to the data (see code), Signal indicates the peaks and troughs and Signals marked on data gives an impression of how accurate the algorithm is. Note: watch the model fit adjust itself to the trend in the middle of the graph!
That should get you started. There are also a lot of excellent books on signal detection theory (just google that term), which will go much further into these types of techniques. Good luck!
Consider using findpeaks, it is fast, which may be important for realtime. You should filter high-frequency noise to improve accuracy. here I smooth the data with a moving window.
t = 0:0.001:10;
x = 0.3*sin(t) + sin(1.3*t) + 0.9*sin(4.2*t) + 0.02*randn(1, 10001);
[~,iPeak0] = findpeaks(movmean(x,100),'MinPeakProminence',0.5);
You can time the process (0.0015sec)
f0 = #() findpeaks(movmean(x,100),'MinPeakProminence',0.5)
disp(timeit(f0,2))
To compare, processing the slope is only a bit faster (0.00013sec), but findpeaks have many useful options, such as minimum interval between peaks etc.
iPeaks1 = derivePeaks(x);
f1 = #() derivePeaks(x)
disp(timeit(f1,1))
Where derivePeaks is:
function iPeak1 = derivePeaks(x)
xSmooth = movmean(x,100);
goingUp = find(diff(movmean(xSmooth,100)) > 0);
iPeak1 = unique(goingUp([1,find(diff(goingUp) > 100),end]));
iPeak1(iPeak1 == 1 | iPeak1 == length(iPeak1)) = [];
end

Kalman Filter Prediction Implementation

I am trying to implement a Kalman filter in order to localize a robot.
I am confused with the prediction step (excluding process noise) x = Fx + u
If x is a state estimation vector: [xLocation, xVelocity] and F is the state transition matrix [[1 1],[0 1]], then the new xLocation would be equal to xLocation + xVelocity + the corresponding component of the motion vector u.
Why is the equation not x = x + u? Shouldn't the predicted location of the robot be the location + motion of the robot?
Maybe there is some confusion with respect to what the matrices actually represent.
The "control vector", u, might be the acceleration externally applied to the system.
In this case, I would expect the equations to look like this:
xlocation = xlocation + xvelocity
xvelocity = xvelocity + uvelocity
These two equations assume that the update is applied every 1 second (otherwise some "delta time" factors would need to be applied and included the transition matrix and the control vector).
For the situation mentioned above, the matrices and vectors are:
The state vector (column vector with 2 entries):
xlocation
xvelocity
The transition matrix (2 x 2 matrix):
1 1
0 1
The control vector (column vector with 2 entries):
0
uvelocity
This link contains nice explanations and visualizations for the Kalman Filter.

calculate first,second,third derivative on 3d image

really i have a problem to calculate first , second , third derivative on 3d image with matlab.
i have 60 slice of dicom format of knee mri , and i wanna calculate derivative .
for 2d image when we want to calculate derivative on x or y direction ,for example we use sobel or another operator in x direction for calculate derivative on x direction .
but in 3d image that i have 60 slices of dicom format , how can i calculate first, second ,and third derivative on x ,y,z directions .
i implement like this for first derivative :
F is 3d matrix that has all slices. [k,l,m] = size(F);
but i think it's not true .please help me , really i need your answers .
how can we calculate first, second, third derivative on x ,y ,z directions .?
case 'x'
D(1,:,:) = (F(2,:,:) - F(1,:,:));
D(k,:,:) = (F(k,:,:) - F(k-1,:,:));
D(2:k-1,:,:) = (F(3:k,:,:)-F(1:k-2,:,:))/2;
case 'y'
D(:,1,:) = (F(:,2,:) - F(:,1,:));
D(:,l,:) = (F(:,l,:) - F(:,l-1,:));
D(:,2:l-1,:) = (F(:,3:l,:)-F(:,1:l-2,:))/2;
case 'z'
D(:,:,1) = (F(:,:,2) - F(:,:,1));
D(:,:,m) = (F(:,:,m) - F(:,:,m-1));
D(:,:,2:m-1) = (F(:,:,3:m)-F(:,:,1:m-2))/2;
There is a function for that! Look up https://www.mathworks.com/help/images/ref/imgradient3.html, where there are options to indicate the kind of gradient computation: sobel is the default.
If you'd like directional gradients, consider using https://www.mathworks.com/help/images/ref/imgradientxyz.html, which has the same options available, but returns the directional gradients Gx, Gy and Gz.
volData = load('mri');
sz = volData.siz;
vol = squeeze(volData.D);
[Gx, Gy, Gz] = imgradientxyz(vol);
Note that these functions were introduced in R2016a.
The "first derivative" in higher dimensions is called a gradient vector. There are many formulas to numerically approximate the gradient, and one of the most accurate approaches is disccused in a recent paper: "High Order Spatial Generalization of 2D and 3D Isotropic Discrete Gradient Operators with Fast Evaluation on GPUs" by Leclaire et al.
Higher order derivatives in more than one dimension are tensors. The "second derivative" in particular is a rank-2 tensor and has 6 independent components, which to the lowest order approximation are
Dxx(x,y,z) = (F(x+1,y,z) - 2*F(x,y,z) + F(x-1,y,z))/2
Dyy(x,y,z) = (F(x,y+1,z) - 2*F(x,y,z) + F(x,y-1,z))/2
Dzz(x,y,z) = (F(x,y,z+1) - 2*F(x,y,z) + F(x,y,z-1))/2
Dxy(x,y,z) = (F(x+1,y+1,z) - F(x+1,y-1,z) - F(x-1,y+1,z) + F(x-1,y-1,z))/4
Dxz(x,y,z) = (F(x+1,y,z+1) - F(x+1,y,z-1) - F(x-1,y,z+1) + F(x-1,y,z-1))/4
Dyz(x,y,z) = (F(x,y+1,z+1) - F(x,y+1,z-1) - F(x,y-1,z+1) + F(x,y-1,z-1))/4
The "third derivative" will be a rank-3 tensor and will have even more components. The formulas are lenghty and can be derived by considering a Taylor series expansion of F up to the 3rd order

How to speed up vector equations in Matlab?

I'm using the following code to do logistic regression with stochastic gradient descent in Matlab. The total number of training + testing samples are about 600K. The code runs for hours. How can I speed it up?
%% load dataset
clc;
clear;
load('covtype.mat');
Data = [X y];
%% Split into testing and training data in a 1:9 split
nRows=size(Data,1);
randRows=randperm(nRows); % generate random ordering of row indices
Test=Data(randRows(1:58101),:); % index using random order
Train=Data(randRows(58102:end),:);
Testx=Test(:,1:54);
Testy=Test(:,55:end);
Trainx=Train(:,1:54);
Trainy=Train(:,55:end);
%% Perform stochastic gradient descent on training data
lambda=0.01; % regularisation constant
alpha=0.01; % step length constant
theta_old = zeros(54,1);
theta_new = theta_old;
z=1;
for count = 1:size(Train,1)
theta_old = theta_new;
theta_new = theta_old + (alpha*Trainy(count)* (1.0 ./ (1.0 + exp(Trainy(count)*(Trainx(count,:)*theta_old)))).*Trainx(count,:))' - alpha*lambda*2*theta_old; %//'
n = norm(theta_new);
llr = lambda*n*n;
count_dummy(z)=count; % dummy variable to store iteration number for plotting later
% calculate log likelihood error for test data with current value of theta_new
for i = 1:size(Test,1)
llr = llr - 1.*log(1.0 + exp(-(Testy(i)*(Testx(i,:)*theta_new))));
end
llr_dummy(z)=llr; % dummy variable to store llr for plotting later
z=z+1;
end
thetaopt = theta_new; % this is optimal theta
%% Plot results on testing data
plot(count_dummy, llr_dummy);
I have to calculate the log likelihood error of test data at every iteration to plot it. How can I speed up this code?

comparison of clustering algorithms performance in rapidminer

I have applied different clustering algos like kmean, kmediod kmean-fast and expectation max clustering on my biomedical dataset using Rapidminer. now i want to check the performance of these algos that which algo gives better clustering results.
For that i have applied some operators like 'cluster density performance' and 'cluster distance performance' which gives me avg within cluster distance for each cluster and davis bouldin. but I am confused is it the right way to check clustering performance of each algo with these operators?
I am also interested in Silhouette method which i can apply for each algo to check performance but can't understand from where i can get b(i) and a(i) values from clustering algo output.
The most reliable way of evaluating clusterings is by looking at your data. If the clusters are of any use to you and make sense to a domain expert!
Never just rely on numbers.
For example, you can evaluate clusterings numerically by taking the within-cluster variance.
However, k-means optimizes exactly this value, so k-means will always come out best, and in fact this measure decreases with the number of k - but the results do not at all become more meaningful!
It is somewhat okay to use one coefficient such as Silhouette coefficient to compare results of the same algorithm this way. Since Silhouette coefficient is somewhat orthogonal to the variance minimization, it will make k-means stop at some point, when the result is a reasonable balance of the two objectives.
However, applying such a measure to different algorithms - which may have a different amount of correlation to the measure - is inherently unfair. Most likely, you will be overestimating one algorithm and underestimating the performance of another.
Another popular way is external evaluation, with labeled data. While this should be unbiased against the method -- unless the labels were generated by a similar method -- it has different issues: it will punish a solution that actually discovers new clusters!
All in all, evaluation of unsupervised methods is hard. Really hard. Best you can do, is see if the results prove useful in practise!
It's very good advice never to rely on the numbers.
All the numbers can do is help you focus on particular clusterings that are mathematically interesting. The Davies-Bouldin validity measure is nice because it will show a minimum when it thinks the clusters are most compact with respect to themselves and most separated with respect to others. If you plot a graph of the Davies-Bouldin measure as a function of k, the "best" clustering will show up as a minimum. Of course the data may not form into spherical clusters so this measure may not be appropriate but that's another story.
The Silhouette measure tends to a maximum when it identifies a clustering that is relatively better than another.
The cluster density and cluster distance measures often exhibit an "elbow" as they tend to zero. This elbow often coincides with an interesting clustering (I have to be honest and say I'm never really convinced by this elbow criterion approach).
If you were to plot different validity measures as a function of k and all of the measures gave an indication that a particular k was better then others that would be good reason to consider that value in more detail to see if you, as the domain expert for the data, agree.
If you're interested I have some examples here.
There are many ways to evaluate the performance of clustering models in machine learning. They are broadly divided into 3 categories-
1. Supervised techniques
2. Unsupervised techniques
3. Hybrid techniques
Supervised techniques are evaluated by comparing the value of evaluation metrics with some pre-defined ground rules and values.
For example- Jaccard similarity index, Rand Index, Purity etc.
Unsupervised techniques comprised of some evaluation metrics which cannot be compared with pre-defined values but they can be compared among different clustering models and thereby we can choose the best model.
For example - Silhouette measure, SSE
Hybrid techniques are nothing but combination of supervised and unsupervised methods.
Now, let’s have a look at the intuition behind these methods-
Silhouette measure
Silhouette measure is derived from 2 primary measures- Cohesion and Separation.
Cohesion is nothing but the compactness or tightness of the data points within a cluster.
There are basically 2 ways to compute the cohesion-
· Graph based cohesion
· Prototype based cohesion
Let’s consider that A is a cluster with 4 data points as shown in the figure-
Graph based cohesion computes the cohesion value by adding the distances (Euclidean or Manhattan) from each point to every other points.
Here,
Graph Cohesion(A) = Constant * ( Dis(1,2) + Dis(1,3) + Dis(1,4) + Dis(2,3) + Dis(2,4) + Dis(3,4) )
Where,
Constant = 1/ (2 * Average of all distances)
Prototype based cohesion is calculated by adding the distance of all data points from a commonly accepted point like centroid.
Here, let’s consider C as centroid in cluster A
Then,
Prototype Cohesion(A) = Constant * (Dis(1,C) +Dis(2,C) + Dis(3,C) + Dis(4,C))
Where,
Constant = 1/ (2 * Average of all distances)
Separation is the distance or magnitude of difference between the data points of 2 different clusters.
Here also, we have primarily 2 kinds of methods of computation of separation value.
1. Graph based separation
2. Prototype based separation
Graph based separation calculate the value by adding the distance between all points from Cluster 1 to each and every point in Cluster 2.
For example, If A and B are 2 clusters with 4 data points each then,
Graph based separation = Constant * ( Dis(A1,B1) + Dis(A1,B1) + Dis(A1,B2) + Dis(A1,B3) + Dis(A1,B4) + Dis(A2,B1) + Dis(A2,B2) + Dis(A2,B3) + Dis(A2,B4) + Dis(A3,B1) + Dis(A3,B2) + Dis(A3,B3) + Dis(A3,B4) + Dis(A4,B1) + Dis(A4,B2) + Dis(A4,B3) + Dis(A4,B4) )
Where,
Constant = 1/ number of clusters
Prototype based separation is calculated by finding the distance between the commonly accepted points of a 2 clusters like centroid.
Here, we can simply calculate the distance between the centroid of 2 cluster A and B i.e. Dis(C(A),C(B)) multiplied by a constant where constant = 1/ number of clusters.
Silhouette measure = (b-a)/max(b,a)
Where,
a = cohesion value
b = separation value
If Silhouette measure = -1 then it means clustering is very poor.
If Silhouette measure = 0 then it means clustering is good but some improvements are still possible.
If Silhouette measure = 1 then it means clustering is very good.
When we have multiple clustering algorithms, it is always recommended to choose the one with high Silhouette measure.
SSE ( Sum of squared errors)
SSE is calculated by adding Cohesion and Separation values.
SSE = Value ( Cohesion) + Value ( Separation).
When we have multiple clustering algorithms, it is always recommended to choose the one with low SSE.
Jaccard similarity index
Jaccard similarity index is measured using the labels in data points. If data points are not provided then we cannot measure this index.
The data points are divided into 4 categories-
True Negative (TN)= Data points with same class and different cluster
True Positive (TP) = Data points with same class and same cluster
False Negative (FN) = Data points with same class and different cluster
False Positive (FP) = Data points with different class and same cluster
Here,
Note - nC2 means number of combinations with 2 elements possible from a set containing n elements
nC2 = n*(n-1)/2
TP = 5C2 + 4C2 + 2C2 + 3C2 = 20
FN = 5C1 * 1C1 + 5C1 * 2C1 + 1C1 * 4C1 + 1C1 * 2C1 + 1C1 * 3C1 = 24
FP = 5C1 * 1C1 + 4C1 * 1C1 +4C1 * 1C1 + 1C1 * 1C1 +3C1 * 2C1 = 20
TN= 5C1 * 4C1 + 5C1 * 1C1 + 5C1 * 3C1 + 1C1 * 1C1 + 1C1 * 1C1 + 1C1 * 2C1 + 1C1 * 3C1 + 4C1 * 3C1 + 4C1 * 2C1 + 1C1 * 3C1 + 1C1 * 2C1 = 72
Jaccard similarity index = TP/ (TP + TN + FP +FN)
Here, Jaccard similarity index = 20 / (20+ 72 + 20 + 24) = 0.15
Rand Index
Rand Index is similar to Jaccard similarity index. Its formula is given by-
Rand Index = (TP + TN) / (TP + TN + FP +FN)
Here, Rand Index = (20 + 72) / (20+ 72 + 20 + 24) = 0.67
When Rand index is above 0.7, it can be considered as a good clustering.
Similarly, when Jaccard similarity index is above 0.5, it can be considered as good clustering.
Purity
This metrics also requires labels in the data. The formula is given by-
Purity = (Number of data points belonging to the label which is maximum in Cluster 1 + Number of data points belonging to the label which is maximum in Cluster 2 +.... + Number of data points belonging to the label which is maximum in Cluster n ) / Total number of data points.
For example, lets consider 3 clusters – A , B and C with labelled data points
Purity = (a + b + c) / n
Where,
a = Number of black circles in cluster A (Since black is the maximum in count)
b = Number of red circles in cluster B (Since red is the maximum in count)
c = Number of green circles in cluster C (Since green is the maximum in count)
n = Total number of data points
Here, Purity = (5 + 6 + 3) / (8 + 9 + 5) = 0.6
If purity is greater than 0.7 then it can be considered as a good clustering.
Original source - https://qr.ae/pNsxIX

Resources