Using Fourier Analysis to fit function to data - algorithm

I have 24 values for Y and corresponding 24 values for the Y values are measured experimentally,
while t has values : t=[1,2,3........24]
I want to find the relationship between Y and t as an equation using Fourier analysis,
what I have tried and done is:
I wrote the following MATLAB code:
Y=[10.6534
9.6646
8.7137
8.2863
8.2863
8.7137
9.0000
9.5726
11.0000
12.7137
13.4274
13.2863
13.0000
12.7137
12.5726
13.5726
15.7137
17.4274
18.0000
18.0000
17.4274
15.7137
14.0297
12.4345];
ts=1; % step
t=1:ts:24; % the period is 24
f=[-length(t)/2:length(t)/2-1]/(length(t)*ts); % computing frequency interval
M=abs(fftshift(fft(Y)));
figure;plot(f,M,'LineWidth',1.5);grid % plot of harmonic components
figure;
plot(t,Y,'LineWidth',1.5);grid % plot of original data Y
figure;bar(f,M);grid % plot of harmonic components as bar shape
the results of the bar figure is:
Now, I want to find the equation for these harmonic components which represent the data. After that I want to draw the original data Y with the data found from the fitting function and the two curves should be close to each other.
Should I use cos or sin or -sin or -cos?
In another way, what is the rule to represent these harmonics as a function: Y = f (t) ?

An example done with your data and Mathematica using Discrete sine transform. Hope you can extrapolate to Matlab:
n = 24;
xg = N[Range[n]]/n
fg = l (*your list *)
fp = ListPlot[Transpose[{xg, fg}], PlotRange -> All] (*points plot*)
coef = FourierDST[fg, 1]/Sqrt[n/2]; (*Fourier transform*)
Show[fp, Plot[Sum[coef[[r]]*Sin[Pi r x], {r, n - 1}], {x, -1, 1},
PlotRange -> All]]
The coefficients are:
{16.6411, -4.00062, 5.31557, -1.38863, 2.89762, 0.898562,
1.54402, -0.116046, 1.54847, 0.136079, 1.16729, 0.156489,
0.787476, -0.0879736, 0.747845, 0.00903859, 0.515012, 0.021791,
0.35001, 0.0159676, 0.215619, 0.0122281, 0.0943376, -0.00150218}
More detailed view:
Edit
However, as an even function seems to be better, I made also a discrete fourier cosine transform of type 3, which works much better:
In this case the coefficients are:
{14.7384, -8.93197, 4.56404, -2.85262, 2.42847, -0.249488,
0.565181,-0.848594, 0.958699, -0.468337, 0.660136, -0.317903,
0.390689,-0.457621, 0.427875, -0.260669, 0.278931, -0.166846,
0.18547, -0.102438, 0.111731, -0.0425396, 0.0484102, -0.00559378}
And the plotting of coeffs and function are obtained by:
coef = FourierDCT[fg, 3]/Sqrt[n];(*Fourier transform*)
f[x_]:= Sum[coef[[r]]*Cos[Pi (r - 1/2) x], {r, n - 1}]
You'll have to experiment a little ...

Depends on what MATLAB gave you back. It's either sine and cosine or a complex exponential.
Most FFT algorithms that I know of usually demand that the number of data points be an integer power of two. The closest one for your data set is 32, so you should pad it out with zeros.

Thanks for your help.
I found the solution I was aiming to get but for some reason everything is shifted by 1
Here is the code:
ts = 1; % time step
t = [1:ts:24];
fs = 1/ts; % frequency step
f=[-length(t)/2:length(t)/2-1]/(length(t)*ts); % frequency formula
%data
P=[10.7083
9.7003
8.9780
8.4531
8.1653
8.2633
8.8795
9.9850
11.3289
12.5172
13.2012
13.2720
12.9435
12.6647
12.8940
13.8516
15.3819
17.0033
18.1227
18.3039
17.4531
15.8322
13.9056
12.1154];
plot(t,P,'LineWidth',1.5);grid
xlabel('time (hours)');ylabel('Power (MW)')
title('Power Profile for 2nd Feb, 1998')
% fourier transform analysis
P1 = fft(P)/length(t);
P2=fftshift(P1);
amp=abs(P2); % amplitude
phi = angle(P2); % phase angle
figure
subplot(211),stem(f,amp,'LineWidth',1.5),grid
xlabel('frequency (Hz)');ylabel('amplitude (MW)')
subplot(212),stem(f,phi,'LineWidth',1.5),grid
xlabel('frequency (Hz)');ylabel('phase angle (rad)')
% NOW, I WILL CONSTRUCT THE MODEL FROM THE FIGURE
% THE STRUCTURE IS:
% Pmodel=Ai*COS(i*w*t+phii)
% where, w=2*pi/24 and i is the harmonic order
% Here, up to the third harmonic is enough
% and using Parseval's Theorem, the model is:
% PP=12.6635+2*(1.9806*cos(w*tt+1.807)+0.86388*cos(2*w*tt+2.0769)+0.39683*cos(3*w*tt- 1.8132));
w=2*pi/24;
Pmodel=12.6635+2*(1.9806*cos(w*t+1.807)+0.86388*cos(2*w*t+2.0769)+0.39686*cos(3*w*t-1.8132));
figure
plot(t,P,'LineWidth',1.5);grid on
hold on;
plot(t,Pmodel,'r','LineWidth',1.5)
legend('original','model');xlabel('time (hours )');ylabel('Power (MW)')
% But here is a problem, the modeled signal is shifted
% by 1 comparing to the original one
% I redraw the two figures together by plotting Pmodeled vs t+1
% Actually, I don't know why it is shifted, but they are
% exactly identical with shifting by 1
figure
plot(t,P,'LineWidth',1.5);grid on
hold on;
plot(t+1,Pmodel,'r','LineWidth',1.5)
legend('original','model');xlabel('time (hours )');ylabel('Power (MW)')
Why has this shifting problem happened, and how can I solve it?

The problem is with
line 2
"t = [1:ts:24];"
it should be "t= 0:ts:23;"

Related

How to speed up the solving of multiple optimization problems?

Currently, I'm writing a simulation that asses the performance of a positioning algorithm by measuring the mean error of the position estimator for different points around the room. Unfortunately the running times are pretty slow and so I am looking for ways to speed up my code.
The working principle of the position estimator is based on the MUSIC algorithm. The estimator gets an autocorrelation matrix (sized 12x12, with complex values in general) as an input and follows the next steps:
Find the 12 eigenvalues and eigenvectors of the autocorrelation matrix R.
Construct a new 12x11 matrix EN whose columns are the 11 eigenvectors corresponding to the 11 smallest eigenvalues.
Using the matrix EN, construct a function P = 1/(a' EN EN' a).
Where a is a 12x1 complex vector and a' is the Hermitian conjugate of a. The components of a are functions of 3 variables (named x,y and z) and so the scalar P is also a function P(x,y,z)
Finally, find the values (x0,y0,z0) which maximizes the value of P and return it as the position estimate.
In my code, I choose some constant z and create a grid on points in the plane (at heigh z, parallel to the xy plane). For each point I make n4Avg repetitions and calculate the error of the estimated point. At the end of the parfor loop (and some reshaping), I have a matrix of errors with dims (nx) x (ny) x (n4Avg) and the mean error is calculated by taking the mean of the error matrix (acting on the 3rd dimension).
nx=30 is the number of point along the x axis.
ny=15 is the number of points along the y axis.
n4Avg=100 is the number of repetitions used for calculating the mean error at each point.
nGen=100 is the number of generations in the GA algorithm (100 was tested to be good enough).
x = linspace(-20,20,nx);
y = linspace(0,20,ny);
z = 5;
[X,Y] = meshgrid(x,y);
parfor ri = 1:nx*ny
rT = [X(ri);Y(ri);z];
[ENs] = getEnNs(rT,stdv,n4R,n4Avg); % create n4Avg EN matrices
for rep = 1:n4Avg
pos_est = estPos_helper(squeeze(ENs(:,:,rep)),nGen);
posEstErr(ri,rep) = vecnorm(pos_est(:)-rT(:));
end
end
The matrices EN are generated by the following code
function [ENs] = getEnNs(rT,stdv,n4R,nEN)
% generate nEN simulated EN matrices, each using n4R simulated phases
f_c = 2402e6; % center frequency [Hz]
c0 = 299702547; % speed of light [m/s]
load antennaeArr1.mat antennaeArr1;
% generate initial phases.
phi0 = 2*pi*rand(n4R*nEN,1);
k0 = 2*pi.*(f_c)./c0;
I = cos(-k0.*vecnorm(antennaeArr1 - rT(:),2,1)-phi0);
Q = -sin(-k0.*vecnorm(antennaeArr1 - rT(:),2,1)-phi0);
phases = I+1i*Q;
phases = phases + stdv/sqrt(2)*(randn(size(phases)) + 1i*randn(size(phases)));
phases = reshape(phases',[12,n4R,nEN]);
Rxx = pagemtimes(phases,pagectranspose(phases));
ENs = zeros(12,11,nEN);
for i=1:nEN
[ENs(:,:,i),~] = eigs(squeeze(Rxx(:,:,i)),11,'smallestabs');
end
end
The position estimator uses a solver utilizing a 'genetic algorithm' (chosen because it preformed the best of all the other solvers).
function pos_est = estPos_helper(EN,nGen)
load antennaeArr1.mat antennaeArr1; % 3x12 constant matrix
antennae_array = antennaeArr1;
x0 = [0;10;5];
lb = [-20;0;0];
ub = [20;20;10];
function y = myfun(x)
k0 = 2*pi*2.402e9/299702547;
a = exp( -1i*k0*sqrt( (x(1)-antennae_array(1,:)').^2 + (x(2) - antennae_array(2,:)').^2 + (x(3)-antennae_array(3,:)').^2 ) );
y = 1/real((a')*(EN)*(EN')*a);
end
% Create optimization variables
x3 = optimvar("x",3,1,"LowerBound",lb,"UpperBound",ub);
% Set initial starting point for the solver
initialPoint2.x = x0;
% Create problem
problem = optimproblem("ObjectiveSense","Maximize");
% Define problem objective
problem.Objective = fcn2optimexpr(#myfun,x3);
% Set nondefault solver options
options2 = optimoptions("ga","Display","off","HybridFcn","fmincon",...
"MaxGenerations",nGen);
% Solve problem
solution = solve(problem,initialPoint2,"Solver","ga","Options",options2);
% Clear variables
clearvars x3 initialPoint2 options2
pos_est = solution.x;
end
The current runtime of the code, when setting the parameters as shown above, is around 700-800 seconds. This is a problem as I would like to increase the number of points in the grid and the number of repetitions to get a more accurate result.
The main ways I've tried to tackle this is by using parallel computing (in the form of the parloop) and by reducing the nested loops I had (one for x and one for y) into a single vectorized loop going over all the points in the grid.
It indeed helped, but not quite enough.
I apologize for the messy code.
Michael.

Frequency computation and fast fourier transform in Matlab

I have a question related to Fast Fourier transform. I want to calculate the phase and make FFT to draw power spectral density. However when I calculate the frequency f, there are some errors. This is my program code:
n = 1:32768;
T = 0.2*10^-9; % Sampling period
Fs = 1/T; % Sampling frequency
Fn = Fs/2; % Nyquist frequency
omega = 2*pi*200*10^6; % Carrier frequency
L = 32768; % % Length of signal
t = (0:L-1)*T; % Time vector
x_signal(n) = cos(omega*T*n + 0.1*randn(size(n))); % Additive phase noise (random)
y_signal(n) = sin(omega*T*n + 0.1*randn(size(n))); % Additive phase noise (random)
theta(n) = atan(y_signal(n)/x_signal(n));
f = (theta(n)-theta(n-1))/(2*pi)
Y = fft(f,t);
PSD = Y.*conj(Y); % Power Spectral Density
%Fv = linspace(0, 1, fix(L/2)+1)*Fn; % Frequency Vector
As posted, you would get the error 
error: subscript indices must be either positive integers less than 2^31 or logicals
which refers to the operation theta(n-1) when n=1 which results in an index of 0 (which is out of bounds since Matlab uses 1-based indexing). To avoid that could use a subset of indices in n:
f = (theta(n(2:end))-theta(n(1:end-1)))/(2*pi);
That said, if you are doing this to try to obtain an instantaneous measure of the frequency, then you will have a few more issues to deal with. The most trivial one is that you should also divide by T. Not as obvious is the fact that as given, theta is a scalar due to the use of the / operator (see Matlab's mrdivide) rather than the ./ operator which performs element-wise division. So a better expression would be:
theta(n) = atan(y_signal(n)./x_signal(n));
Now, the next problem you might notice is that you are actually losing some phase information since the result of atan is [-pi/2,pi/2] instead of the full [-pi,pi] range. To avoid this you should instead be using atan2:
theta(n) = atan2(y_signal(n), x_signal(n));
Even with this, you are likely to notice that the estimated frequency regularly has spikes whenever the phase jumps between near -pi and near pi. This can be avoided by computing the phase difference modulo 2*pi:
f = mod(theta(n(2:end))-theta(n(1:end-1)),2*pi)/(2*pi*T);
A final thing to note: when calling the fft, you should not be passing in a time variable (the input is implicitly assumed to be sampled at regular time intervals). You may however specify the desired length of the FFT. So, you would thus compute Y as follow:
Y = fft(f, L);
And you could then plot the resulting PSD using:
Fv = linspace(0, 1, fix(L/2)+1)*Fn; % Frequency Vector
plot(Fv, abs(PSD(1:L/2+1)));

finding the best/ scale/shift between two vectors

I have two vectors that represents a function f(x), and another vector f(ax+b) i.e. a scaled and shifted version of f(x). I would like to find the best scale and shift factors.
*best - by means of least squares error , maximum likelihood, etc.
any ideas?
for example:
f1 = [0;0.450541598502498;0.0838213779969326;0.228976968716819;0.91333736150167;0.152378018969223;0.825816977489547;0.538342435260057;0.996134716626885;0.0781755287531837;0.442678269775446;0];
f2 = [-0.029171964726699;-0.0278570165494982;0.0331454732535324;0.187656956432487;0.358856370923984;0.449974662483267;0.391341738643094;0.244800719791534;0.111797007617227;0.0721767235173722;0.0854437239807415;0.143888234591602;0.251750993723227;0.478953530572365;0.748209818420035;0.908044924557262;0.811960826711455;0.512568916956487;0.22669198638799;0.168136111568694;0.365578085161896;0.644996661336714;0.823562159983554;0.792812945867018;0.656803251999341;0.545799498053254;0.587013303815021;0.777464637372241;0.962722388208354;0.980537136457874;0.734416947254272;0.375435649393553;0.106489547770962;0.0892376361668696;0.242467741982851;0.40610516900965;0.427497319032133;0.301874099075184;0.128396341665384;0.00246347624097456;-0.0322120242872125]
*note that f(x) may be irreversible...
Thanks,
Ohad
For each f(x), take the absolute value of f(x) and normalize it such that it can be considered a probability mass function over its support. Calculate the expected value E[x] and variance of Var[x]. Then, we have that
E[a x + b] = a E[x] + b
Var[a x + b] = a^2 Var[x]
Use the above equations and the known values of E[x] and Var[x] to calculate a and b. Taking your values of f1 and f2 from your example, the following Octave script performs this procedure:
% Octave script
% f1, f2 are defined as given in your example
f1 = [zeros(length(f2) - length(f1), 1); f1];
save_f1 = f1; save_f2 = f2;
f1 = abs( f1 ); f2 = abs( f2 );
f1 = f1 ./ sum( f1 ); f2 = f2 ./ sum( f2 );
mean = #(x)sum(((1:length(x))' .* x));
var = #(x)sum((((1:length(x))'-mean(x)).^2) .* x);
m1 = mean(f1); m2 = mean(f2);
v1 = var(f1); v2 = var(f2)
a = sqrt( v2 / v1 ); b = m2 - a * m1;
plot( a .* (1:length( save_f1 )) + b, save_f1, ...
1:length( save_f2 ), save_f2 );
axis([0 length( save_f1 )];
And the output is
Here's a simple, effective, but perhaps somewhat naive approach.
First make sure you make a generic interpolator through both functions. That way you can evaluate both functions in between the given data points. I used a cubic-splines interpolator, since that seems general enough for the type of smooth functions you provided (and does not require additional toolboxes).
Then you evaluate the source function ("original") at a large number of points. Use this number also as a parameter in an inline function, that takes as input X, where
X = [a b]
(as in ax+b). For any input X, this inline function will compute
the function values of the target function at the same x-locations, but then scaled and offset by a and b, respectively.
The sum of the squared-differences between the resulting function values, and the ones of the source function you computed earlier.
Use this inline function in fminsearch with some initial estimate (one that you have obtained visually or by via automatic means). For the example you provided, I used a few random ones, which all converged to near-optimal fits.
All of the above in code:
function s = findScaleOffset
%% initialize
f2 = [0;0.450541598502498;0.0838213779969326;0.228976968716819;0.91333736150167;0.152378018969223;0.825816977489547;0.538342435260057;0.996134716626885;0.0781755287531837;0.442678269775446;0];
f1 = [-0.029171964726699;-0.0278570165494982;0.0331454732535324;0.187656956432487;0.358856370923984;0.449974662483267;0.391341738643094;0.244800719791534;0.111797007617227;0.0721767235173722;0.0854437239807415;0.143888234591602;0.251750993723227;0.478953530572365;0.748209818420035;0.908044924557262;0.811960826711455;0.512568916956487;0.22669198638799;0.168136111568694;0.365578085161896;0.644996661336714;0.823562159983554;0.792812945867018;0.656803251999341;0.545799498053254;0.587013303815021;0.777464637372241;0.962722388208354;0.980537136457874;0.734416947254272;0.375435649393553;0.106489547770962;0.0892376361668696;0.242467741982851;0.40610516900965;0.427497319032133;0.301874099075184;0.128396341665384;0.00246347624097456;-0.0322120242872125];
figure(1), clf, hold on
h(1) = subplot(2,1,1); hold on
plot(f1);
legend('Original')
h(2) = subplot(2,1,2); hold on
plot(f2);
linkaxes(h)
axis([0 max(length(f1),length(f2)), min(min(f1),min(f2)),max(max(f1),max(f2))])
%% make cubic interpolators and test points
pp1 = spline(1:numel(f1), f1);
pp2 = spline(1:numel(f2), f2);
maxX = max(numel(f1), numel(f2));
N = 100 * maxX;
x2 = linspace(1, maxX, N);
y1 = ppval(pp1, x2);
%% search for parameters
s = fminsearch(#(X) sum( (y1 - ppval(pp2,X(1)*x2+X(2))).^2 ), [0 0])
%% plot results
y2 = ppval( pp2, s(1)*x2+s(2));
figure(1), hold on
subplot(2,1,2), hold on
plot(x2,y2, 'r')
legend('before', 'after')
end
Results:
s =
2.886234493867320e-001 3.734482822175923e-001
Note that this computes the opposite transformation from the one you generated the data with. Reversing the numbers:
>> 1/s(1)
ans =
3.464721948700991e+000 % seems pretty decent
>> -s(2)
ans =
-3.734482822175923e-001 % hmmm...rather different from 7/11!
(I'm not sure about the 7/11 value you provided; using the exact values you gave to make a plot results in a less accurate approximation to the source function...Are you sure about the 7/11?)
Accuracy can be improved by either
using a different optimizer (fmincon, fminunc, etc.)
demanding a higher accuracy from fminsearch through optimset
having more sample points in both f1 and f2 to improve the quality of the interpolations
Using a better initial estimate
Anyway, this approach is pretty general and gives nice results. It also requires no toolboxes.
It has one major drawback though -- the solution found may not be the global optimizer, e.g., the quality of the outcomes of this method could be quite sensitive to the initial estimate you provide. So, always make a (difference) plot to make sure the final solution is accurate, or if you have a large number of such things to do, compute some sort of quality factor upon which you decide to re-start the optimization with a different initial estimate.
It is of course very possible to use the results of the Fourier+Mellin transforms (as suggested by chaohuang below) as an initial estimate to this method. That might be overkill for the simple example you provide, but I can easily imagine situations where this could indeed be very useful.
For the scale factor a, you can estimate it by computing the ratio of the amplitude spectra of the two signals since the Fourier transform is invariant to shift.
Similarly, you can estimate the shift factor b by using the Mellin transform, which is scale invariant.
Here's a super simple approach to estimate the scale a that works on your example data:
a = length(f2) / length(f1)
This gives 3.4167 which is close to your stated value of 3.4. If that estimate is good enough, you can use correlation to estimate the shift.
I realize that this is not exactly what you asked, but it may be an acceptable alternative depending on the data.
Both Rody Oldenhuis and jstarr's answers are correct. I'm adding my own answer just to sum things up, and connect between them.
I've messed up Rody's code a little bit and ended up with the following:
function findScaleShift
load f1f2
x0 = [length(f1)/length(f2) 0]; %initial guess, can do better
n=length(f1);
costFunc = #(z) sum((eval_f1(z,f2,n)-f1).^2);
opt.TolFun = eps;
xopt=fminsearch(costFunc,x0,opt);
f1r=eval_f1(xopt,f2,n);
subplot(211);
plot(1:n,f1,1:n,f1r,'--','linewidth',5)
title(xopt);
subplot(212);
plot(1:n,(f1-f1r).^2);
title('squared error')
end
function y = eval_f1(x,f2,n)
t = maketform('affine',[x(1) 0 x(2); 0 1 0 ; 0 0 1]');
y=imtransform(f2',t,'cubic','xdata',[1 n ],'ydata',[1 1])';
end
This gives zero results:
This method is accurate but exhaustive and may take some time. Another disadvantage is that it finds only a local minima, and may give false results if initial guess (x0) is far.
On the other hand, jstarr method gave the following results:
xopt = [ 3.49655562549115 -0.676062367063033]
which is 10% deviation from the correct answer. Pretty fast solution, but not as accurate as I requested, but still should be noted.
I think in order to get the best results jstarr method should be used as an initial guess for the method purposed by Rody, giving an accurate solution.
Ohad

Get equidistant intervals on approximated bark scale

Wikipedia says we can approximate Bark scale with the equation:
b(f) = 13*atan(0.00076*f)+3.5*atan(power(f/7500,2))
How can I divide frequency spectrum into n intervals of the same length on Bark scale (interval division points will be equidistant on Bark scale)?
The best way would be to analytically inverse function (express x by function of y). I was trying doing it on paper but failed. WolframAlpha search bar couldn't do it also. I tried Octave finverse function, but I got error.
Octave says (for simpler example):
octave:2> x = sym('x');
octave:3> finverse(2*x)
error: `finverse' undefined near line 3 column 1
This is finverse description from Matlab: http://www.mathworks.com/help/symbolic/finverse.html
There could be also numerical way to do it. I can imagine that you just start from dividing the y axis equally and search for ideal division by binary search. But maybe there are some existing tools that do it?
You need to numerically solve this equation (there is no analytical inverse function). Set values for b equally spaced and solve the equation to find the various f. Bissection is somewhat slow but a very good alternative is Brent's method. See http://en.wikipedia.org/wiki/Brent%27s_method
This function can't be inverted analytically. You'll have to use some numerical procedure. Binary search would be fine, but there are more efficient ways to do these sorts of things: look into root-finding algorithms. You can apply your algorithm of choice to the equation b(f) = f_n for each of the frequency interval endpoints f_n.
Just so you know, in (say) octave to implement rpsmi's or David Zaslavsky's answer, you'd do something like this:
global x0 = 0.
function res = b(f)
global x0
res = 13*atan(0.00076*f)+3.5*atan(power(f/7500,2)) - x0
end
function [intervals, barks] = barkintervals(left, right, n)
global x0
intervals = linspace(left, right, n);
barks = intervals;
for i = 1:n
x0 = intervals(i);
# 125*x0 is just a crude guess starting point given the values
[barks(i), fval, info] = fsolve('b', 125*x0);
endfor
end
and run it like so:
octave:1> barks
octave:2> [i,bx] = barkintervals(0, 10, 10)
[... lots of output from fsolve deleted...]
i =
Columns 1 through 8:
0.00000 1.11111 2.22222 3.33333 4.44444 5.55556 6.66667 7.77778
Columns 9 and 10:
8.88889 10.00000
bx =
Columns 1 through 6:
0.0000e+00 1.1266e+02 2.2681e+02 3.4418e+02 4.6668e+02 5.9653e+02
Columns 7 through 10:
7.3639e+02 8.8960e+02 1.0605e+03 1.2549e+03
I finally decided not to use the Bark values approximation but ideal values for critical bands centres (defined for n=1..24). I plotted them with gnuplot and on the same graph I plotted arbitrarily chosen values for points of greater density (for the required n>24). I adjusted the points values in Hz till the the both curves were approximately the same.
Of course rpsmi and David Zaslavsky answers are more general and scalable.

How do I convert between a measure of similarity and a measure of difference (distance)?

Is there a general way to convert between a measure of similarity and a measure of distance?
Consider a similarity measure like the number of 2-grams that two strings have in common.
2-grams('beta', 'delta') = 1
2-grams('apple', 'dappled') = 4
What if I need to feed this to an optimization algorithm that expects a measure of difference, like Levenshtein distance?
This is just an example...I'm looking for a general solution, if one exists. Like how to go from Levenshtein distance to a measure of similarity?
I appreciate any guidance you may offer.
Let d denotes distance, s denotes similarity. To convert distance measure to similarity measure, we need to first normalize d to [0 1], by using d_norm = d/max(d). Then the similarity measure is given by:
s = 1 - d_norm.
where s is in the range [0 1], with 1 denotes highest similarity (the items in comparison are identical), and 0 denotes lowest similarity (largest distance).
If your similarity measure (s) is between 0 and 1, you can use one of these:
1-s
sqrt(1-s)
-log(s)
(1/s)-1
Doing 1/similarity is not going to keep the properties of the distribution.
the best way is
distance (a->b) = highest similarity - similarity (a->b).
with highest similarity being the similarity with the biggest value. You hence flip your distribution.
the highest similarity becomes 0 etc
Yes, there is a most general way to change between similarity and distance: a strictly monotone decreasing function f(x).
That is, with f(x) you can make similarity = f(distance) or distance = f(similarity). It works in both directions. Such function works, because the relation between similarity and distance is that one decreases when the other increases.
Examples:
These are some well-known strictly monotone decreasing candidates that work for non-negative similarities or distances:
f(x) = 1 / (a + x)
f(x) = exp(- x^a)
f(x) = arccot(ax)
You can choose parameter a>0 (e.g., a=1)
Edit 2021-08
A very practical approach is to use the function sim2diss belonging to the statistical software R. This functions provides a up to 13 methods to compute dissimilarity from similarities. Sadly the methods are not at all explained: you have to look into the code :-\
similarity = 1/difference
and watch out for difference = 0
According to scikit learn:
Kernels are measures of similarity, i.e. s(a, b) > s(a, c) if objects a and b are considered “more similar” than objects a and c. A kernel must also be positive semi-definite.
There are a number of ways to convert between a distance metric and a similarity measure, such as a kernel. Let D be the distance, and S be the kernel:
S = np.exp(-D * gamma), where one heuristic for choosing gamma is 1 /
num_features
S = 1. / (D / np.max(D))
In the case of Levenshtein distance, you could increase the sim score by 1 for every time the sequences match; that is, 1 for every time you didn't need a deletion, insertion or substitution. That way the metric would be a linear measure of how many characters the two strings have in common.
In one of my projects (based on Collaborative Filtering) I had to convert between correlation (cosine between vectors) which was from -1 to 1 (closer 1 is more similar, closer to -1 is more diverse) to normalized distance (close to 0 the distance is smaller and if it's close to 1 the distance is bigger)
In this case: distance ~ diversity
My formula was: dist = 1 - (cor + 1)/2
If you have similarity to diversity and the domain is [0,1] in both cases the simlest way is:
dist = 1 - sim
sim = 1 - dist
Cosine similarity is widely used for n-gram count or TFIDF vectors.
from math import pi, acos
def similarity(x, y):
return sum(x[k] * y[k] for k in x if k in y) / sum(v**2 for v in x.values())**.5 / sum(v**2 for v in y.values())**.5
Cosine similarity can be used to compute a formal distance metric according to wikipedia. It obeys all the properties of a distance that you would expect (symmetry, nonnegativity, etc):
def distance_metric(x, y):
return 1 - 2 * acos(similarity(x, y)) / pi
Both of these metrics range between 0 and 1.
If you have a tokenizer that produces N-grams from a string you could use these metrics like this:
>>> import Tokenizer
>>> tokenizer = Tokenizer(ngrams=2, lower=True, nonwords_set=set(['hello', 'and']))
>>> from Collections import Counter
>>> list(tokenizer('Hello World again and again?'))
['world', 'again', 'again', 'world again', 'again again']
>>> Counter(tokenizer('Hello World again and again?'))
Counter({'again': 2, 'world': 1, 'again again': 1, 'world again': 1})
>>> x = _
>>> Counter(tokenizer('Hi world once again.'))
Counter({'again': 1, 'world once': 1, 'hi': 1, 'once again': 1, 'world': 1, 'hi world': 1, 'once': 1})
>>> y = _
>>> sum(x[k]*y[k] for k in x if k in y) / sum(v**2 for v in x.values())**.5 / sum(v**2 for v in y.values())**.5
0.42857142857142855
>>> distance_metric(x, y)
0.28196592805724774
I found the elegant inner product of Counter in this SO answer

Resources