Normalizing the edit distance - algorithm

I have a question that can we normalize the levenshtein edit distance by dividing the e.d value by the length of the two strings?
I am asking this because, if we compare two strings of unequal length, the difference between the lengths of the two will be counted as well.
for eg:
ed('has a', 'has a ball') = 4 and ed('has a', 'has a ball the is round') = 15.
if we increase the length of the string, the edit distance will increase even though they are similar.
Therefore, I can not set a value, what a good edit distance value should be.

Yes, normalizing the edit distance is one way to put the differences between strings on a single scale from "identical" to "nothing in common".
A few things to consider:
Whether or not the normalized distance is a better measure of similarity between strings depends on the application. If the question is "how likely is this word to be a misspelling of that word?", normalization is a way to go. If it's "how much has this document changed since the last version?", the raw edit distance may be a better option.
If you want the result to be in the range [0, 1], you need to divide the distance by the maximum possible distance between two strings of given lengths. That is, length(str1)+length(str2) for the LCS distance and max(length(str1), length(str2)) for the Levenshtein distance.
The normalized distance is not a metric, as it violates the triangle inequality.

I used the following successfully:
len = std::max(s1.length(), s2.length());
// normalize by length, high score wins
fDist = float(len - levenshteinDistance(s1, s2)) / float(len);
Then chose the highest score. 1.0 means an exact match.

I had used a normalized edit distance or similarity (NES) which I think is very useful, defined by Daniel Lopresti and Jiangyin Zhou, in Equation (6) of their work: http://www.cse.lehigh.edu/~lopresti/Publications/1996/sdair96.pdf.
The NES in python is:
import math
def normalized_edit_similarity(m, d):
# d : edit distance between the two strings
# m : length of the shorter string
return ( 1.0 / math.exp( d / (m - d) ) )
print(normalized_edit_similarity(3, 0))
print(normalized_edit_similarity(3, 1))
print(normalized_edit_similarity(4, 1))
print(normalized_edit_similarity(5, 1))
print(normalized_edit_similarity(5, 2))
1.0
0.6065306597126334
0.7165313105737893
0.7788007830714049
0.513417119032592
More examples can be found in Table 2 in the above paper.
The variable m in the above function can be replaced with the length of the longer string, depending on your application.

Related

OCR'ed and real string similarity

The problem:
There is a set of word S = {W1,W2.. Wn} where n < 10. This set just exists, we do not know its content.
These words are drawn on some image and then recognized. The OCR algorytm is poor as well as dpi and as a result there are mistakes. So we have a second set of errorneous words S' = {W1',W2'..Wn'}
Now we have a word W that is a member of original set S. And now I need and algorythm which, given W and S', return index of the word in S'. most similar to W.
Example. S is {"alpha", "bravo", "charlie"}, S' is for example {"alPha","hravc","onarlio"} (these are real possible ocr erros).
So the target function should return F("alpha") => 0, F("bravo") => 1, F("charlie") => 2
I tried Levenshtein distance, but it does not work well, because it returns small numbers on small strings and OCRed string can be longer than original.
Example if W' is {'hornist','cornrnunist'} and the given word is 'communist' the Levenshtein distance is 4 for the both words, but the right one is second.
Any suggestions?
As a zero approach, I'd suggest you to use the modification of Levenshtein distance algorithm with conditional cost of replacing/deleting/adding characters:
Distance(i, j) = min(Distance(i-1, j-1) + replace_cost(a.charAt(i), b.charAt(j)),
Distance(i-1, j ) + insert_cost(b.charAt(j)),
Distance(i , j-1) + delete_cost(a.charAt(i)))
You can implement function replace_cost in such way, that it will returns small values for visually similar characters (and high values for visually different characters), e.g.:
// visually similar characters
replace_cost('o', '0') = 0.1
replace_cost('o', 'O') = 0.1
replace_cost('O', '0') = 0.1
...
// visually different characters
replace_cost('O', 'K') = 0.9
...
And the similar approach can be used for insert_cost and delete_cost (e.g. you may notice, that during the OCR - some characters are more likely to disappear than others).
Also, in case when approach from above is not enough for you, I'd suggest you to look at Noisy channel model - which is widely used for spelling correction (this subject described very well in Natural Language Processing course by Dan Jurafsky, Christopher Manning - "Week 2 - Spelling Correction").
This appears to be quite difficult to do because the misread strings are not necessarily textually similar to the input, which is why Levinshtein distance won't work for you. The words are visually corrupted, not simply mistyped. You could try creating a dataset of common errors (o => 0, l -> 1, e => o) and then do some sort of comparison based on that.
If you have access to the OCR algorithm, you could run that algorithm again on a much broader set of inputs (with known outputs) and train a neural network to recognize common errors. Then you could use that model to predict mistakes in your original dataset (maybe overkill for an array of only ten items).

Compare two arrays of points [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 9 years ago.
Improve this question
I'm trying to find a way to find similarities in two arrays of different points. I drew circles around points that have similar patterns and I would like to do some kind of auto comparison in intervals of let's say 100 points and tell what coefficient of similarity is for that interval. As you can see it might not be perfectly aligned also so point-to-point comparison would not be a good solution also (I suppose). Patterns that are slightly misaligned could also mean that they are matching the pattern (but obviously with a smaller coefficient)
What similarity could mean (1 coefficient is a perfect match, 0 or less - is not a match at all):
Points 640 to 660 - Very similar (coefficient is ~0.8)
Points 670 to 690 - Quite similar (coefficient is ~0.5-~0.6)
Points 720 to 780 - Let's say quite similar (coefficient is ~0.5-~0.6)
Points 790 to 810 - Perfectly similar (coefficient is 1)
Coefficient is just my thoughts of how a final calculated result of comparing function could look like with given data.
I read many posts on SO but it didn't seem to solve my problem. I would appreciate your help a lot. Thank you
P.S. Perfect answer would be the one that provides pseudo code for function which could accept two data arrays as arguments (intervals of data) and return coefficient of similarity.
Click here to see original size of image
I also think High Performance Mark has basically given you the answer (cross-correlation). In my opinion, most of the other answers are only giving you half of what you need (i.e., dot product plus compare against some threshold). However, this won't consider a signal to be similar to a shifted version of itself. You'll want to compute this dot product N + M - 1 times, where N, M are the sizes of the arrays. For each iteration, compute the dot product between array 1 and a shifted version of array 2. The amount you shift array 2 increases by one each iteration. You can think of array 2 as a window you are passing over array 1. You'll want to start the loop with the last element of array 2 only overlapping the first element in array 1.
This loop will generate numbers for different amounts of shift, and what you do with that number is up to you. Maybe you compare it (or the absolute value of it) against a threshold that you define to consider two signals "similar".
Lastly, in many contexts, a signal is considered similar to a scaled (in the amplitude sense, not time-scaling) version of itself, so there must be a normalization step prior to computing the cross-correlation. This is usually done by scaling the elements of the array so that the dot product with itself equals 1. Just be careful to ensure this makes sense for your application numerically, i.e., integers don't scale very well to values between 0 and 1 :-)
i think HighPerformanceMarks's suggestion is the standard way of doing the job.
a computationally lightweight alternative measure might be a dot product.
split both arrays into the same predefined index intervals.
consider the array elements in each intervals as vector coordinates in high-dimensional space.
compute the dot product of both vectors.
the dot product will not be negative. if the two vectors are perpendicular in their vector space, the dot product will be 0 (in fact that's how 'perpendicular' is usually defined in higher dimensions), and it will attain its maximum for identical vectors.
if you accept the geometric notion of perpendicularity as a (dis)similarity measure, here you go.
caveat:
this is an ad hoc heuristic chosen for computational efficiency. i cannot tell you about mathematical/statistical properties of the process and separation properties - if you need rigorous analysis, however, you'll probably fare better with correlation theory anyway and should perhaps forward your question to math.stackexchange.com.
My Attempt:
Total_sum=0
1. For each index i in the range (m,n)
2. sum=0
3. k=Array1[i]*Array2[i]; t1=magnitude(Array1[i]); t2=magnitude(Array2[i]);
4. k=k/(t1*t2)
5. sum=sum+k
6. Total_sum=Total_sum+sum
Coefficient=Total_sum/(m-n)
If all values are equal, then sum would return 1 in each case and total_sum would return (m-n)*(1). Hence, when the same is divided by (m-n) we get the value as 1. If the graphs are exact opposites, we get -1 and for other variations a value between -1 and 1 is returned.
This is not so efficient when the y range or the x range is huge. But, I just wanted to give you an idea.
Another option would be to perform an extensive xnor.
1. For each index i in the range (m,n)
2. sum=1
3. k=Array1[i] xnor Array2[i];
4. k=k/((pow(2,number_of_bits))-1) //This will scale k down to a value between 0 and 1
5. sum=(sum+k)/2
Coefficient=sum
Is this helpful ?
You can define a distance metric for two vectors A and B of length N containing numbers in the interval [-1, 1] e.g. as
sum = 0
for i in 0 to 99:
d = (A[i] - B[i])^2 // this is in range 0 .. 4
sum = (sum / 4) / N // now in range 0 .. 1
This now returns distance 1 for vectors that are completely opposite (one is all 1, another all -1), and 0 for identical vectors.
You can translate this into your coefficient by
coeff = 1 - sum
However, this is a crude approach because it does not take into account the fact that there could be horizontal distortion or shift between the signals you want to compare, so let's look at some approaches for coping with that.
You can sort both your arrays (e.g. in ascending order) and then calculate the distance / coefficient. This returns more similarity than the original metric, and is agnostic towards permutations / shifts of the signal.
You can also calculate the differentials and calculate distance / coefficient for those, and then you can do that sorted also. Using differentials has the benefit that it eliminates vertical shifts. Sorted differentials eliminate horizontal shift but still recognize different shapes better than sorted original data points.
You can then e.g. average the different coefficients. Here more complete code. The routine below calculates coefficient for arrays A and B of given size, and takes d many differentials (recursively) first. If sorted is true, the final (differentiated) array is sorted.
procedure calc(A, B, size, d, sorted):
if (d > 0):
A' = new array[size - 1]
B' = new array[size - 1]
for i in 0 to size - 2:
A'[i] = (A[i + 1] - A[i]) / 2 // keep in range -1..1 by dividing by 2
B'[i] = (B[i + 1] - B[i]) / 2
return calc(A', B', size - 1, d - 1, sorted)
else:
if (sorted):
A = sort(A)
B = sort(B)
sum = 0
for i in 0 to size - 1:
sum = sum + (A[i] - B[i]) * (A[i] - B[i])
sum = (sum / 4) / size
return 1 - sum // return the coefficient
procedure similarity(A, B, size):
sum a = 0
a = a + calc(A, B, size, 0, false)
a = a + calc(A, B, size, 0, true)
a = a + calc(A, B, size, 1, false)
a = a + calc(A, B, size, 1, true)
return a / 4 // take average
For something completely different, you could also run Fourier transform using FFT and then take a distance metric on the returning spectra.

matlab: optimum amount of points for linear fit

I want to make a linear fit to few data points, as shown on the image. Since I know the intercept (in this case say 0.05), I want to fit only points which are in the linear region with this particular intercept. In this case it will be lets say points 5:22 (but not 22:30).
I'm looking for the simple algorithm to determine this optimal amount of points, based on... hmm, that's the question... R^2? Any Ideas how to do it?
I was thinking about probing R^2 for fits using points 1 to 2:30, 2 to 3:30, and so on, but I don't really know how to enclose it into clear and simple function. For fits with fixed intercept I'm using polyfit0 (http://www.mathworks.com/matlabcentral/fileexchange/272-polyfit0-m) . Thanks for any suggestions!
EDIT:
sample data:
intercept = 0.043;
x = 0.01:0.01:0.3;
y = [0.0530642513911393,0.0600786706929529,0.0673485248329648,0.0794662409166333,0.0895915873196170,0.103837395346484,0.107224784565365,0.120300492775786,0.126318699218730,0.141508831492330,0.147135757370947,0.161734674733680,0.170982455701681,0.191799936622712,0.192312642057298,0.204771365716483,0.222689541632988,0.242582251060963,0.252582727297656,0.267390860166283,0.282890010610515,0.292381165948577,0.307990544720676,0.314264952297699,0.332344368808024,0.355781519885611,0.373277721489254,0.387722683944356,0.413648156978284,0.446500064130389;];
What you have here is a rather difficult problem to find a general solution of.
One approach would be to compute all the slopes/intersects between all consecutive pairs of points, and then do cluster analysis on the intersepts:
slopes = diff(y)./diff(x);
intersepts = y(1:end-1) - slopes.*x(1:end-1);
idx = kmeans(intersepts, 3);
x([idx; 3] == 2) % the points with the intersepts closest to the linear one.
This requires the statistics toolbox (for kmeans). This is the best of all methods I tried, although the range of points found this way might have a few small holes in it; e.g., when the slopes of two points in the start and end range lie close to the slope of the line, these points will be detected as belonging to the line. This (and other factors) will require a bit more post-processing of the solution found this way.
Another approach (which I failed to construct successfully) is to do a linear fit in a loop, each time increasing the range of points from some point in the middle towards both of the endpoints, and see if the sum of the squared error remains small. This I gave up very quickly, because defining what "small" is is very subjective and must be done in some heuristic way.
I tried a more systematic and robust approach of the above:
function test
%% example data
slope = 2;
intercept = 1.5;
x = linspace(0.1, 5, 100).';
y = slope*x + intercept;
y(1:12) = log(x(1:12)) + y(12)-log(x(12));
y(74:100) = y(74:100) + (x(74:100)-x(74)).^8;
y = y + 0.2*randn(size(y));
%% simple algorithm
[X,fn] = fminsearch(#(ii)P(ii, x,y,intercept), [0.5 0.5])
[~,inds] = P(X, y,x,intercept)
end
function [C, inds] = P(ii, x,y,intercept)
% ii represents fraction of range from center to end,
% So ii lies between 0 and 1.
N = numel(x);
n = round(N/2);
ii = round(ii*n);
inds = min(max(1, n+(-ii(1):ii(2))), N);
% Solve linear system with fixed intercept
A = x(inds);
b = y(inds) - intercept;
% and return the sum of squared errors, divided by
% the number of points included in the set. This
% last step is required to prevent fminsearch from
% reducing the set to 1 point (= minimum possible
% squared error).
C = sum(((A\b)*A - b).^2)/numel(inds);
end
which only finds a rough approximation to the desired indices (12 and 74 in this example).
When fminsearch is run a few dozen times with random starting values (really just rand(1,2)), it gets more reliable, but I still wouln't bet my life on it.
If you have the statistics toolbox, use the kmeans option.
Depending on the number of data values, I would split the data into a relative small number of overlapping segments, and for each segment calculate the linear fit, or rather the 1-st order coefficient, (remember you know the intercept, which will be same for all segments).
Then, for each coefficient calculate the MSE between this hypothetical line and entire dataset, choosing the coefficient which yields the smallest MSE.

Algorithm for discrete similarity metric

Given that I have two lists that each contain a separate subset of a common superset, is
there an algorithm to give me a similarity measurement?
Example:
A = { John, Mary, Kate, Peter } and B = { Peter, James, Mary, Kate }
How similar are these two lists? Note that I do not know all elements of the common superset.
Update:
I was unclear and I have probably used the word 'set' in a sloppy fashion. My apologies.
Clarification: Order is of importance.
If identical elements occupy the same position in the list, we have the highest similarity for that element.
The similarity decreased the farther apart the identical elements are.
The similarity is even lower if the element only exists in one of the lists.
I could even add the extra dimension that lower indices are of greater value, so a a[1] == b[1] is worth more than a[9] == b[9], but that is mainly cause I am curious.
The Jaccard Index (aka Tanimoto coefficient) is used precisely for the use case recited in the OP's question.
The Tanimoto coeff, tau, is equal to Nc divided by Na + Nb - Nc, or
tau = Nc / (Na + Nb - Nc)
Na, number of items in the first set
Nb, number of items in the second set
Nc, intersection of the two sets, or the number of unique items
common to both a and b
Here's Tanimoto coded as a Python function:
def tanimoto(x, y) :
w = [ ns for ns in x if ns not in y ]
return float(len(w) / (len(x) + len(y) - len(w)))
I would explore two strategies:
Treat the lists as sets and apply set ops (intersection, difference)
Treat the lists as strings of symbols and apply the Levenshtein algorithm
If you truly have sets (i.e., an element is simply either present or absent, with no count attached) and only two of them, just adding the number of shared elements and dividing by the total number of elements is probably about as good as it gets.
If you have (or can get) counts and/or more than two of them, you can do a bit better than that with something like cosine simliarity or TFIDF (term frequency * inverted document frequency).
The latter attempts to give lower weighting to words that appear in all (or nearly) all the "documents" -- i.e., sets of words.
What is your definition of "similarity measurement?" If all you want is how many items in the set are in common with each other, you could find the cardinality of A and B, add the cardinalities together, and subtract from the cardinality of the union of A and B.
If order matters you can use Levenshtein distance or other kind of Edit distance
.

How do I convert between a measure of similarity and a measure of difference (distance)?

Is there a general way to convert between a measure of similarity and a measure of distance?
Consider a similarity measure like the number of 2-grams that two strings have in common.
2-grams('beta', 'delta') = 1
2-grams('apple', 'dappled') = 4
What if I need to feed this to an optimization algorithm that expects a measure of difference, like Levenshtein distance?
This is just an example...I'm looking for a general solution, if one exists. Like how to go from Levenshtein distance to a measure of similarity?
I appreciate any guidance you may offer.
Let d denotes distance, s denotes similarity. To convert distance measure to similarity measure, we need to first normalize d to [0 1], by using d_norm = d/max(d). Then the similarity measure is given by:
s = 1 - d_norm.
where s is in the range [0 1], with 1 denotes highest similarity (the items in comparison are identical), and 0 denotes lowest similarity (largest distance).
If your similarity measure (s) is between 0 and 1, you can use one of these:
1-s
sqrt(1-s)
-log(s)
(1/s)-1
Doing 1/similarity is not going to keep the properties of the distribution.
the best way is
distance (a->b) = highest similarity - similarity (a->b).
with highest similarity being the similarity with the biggest value. You hence flip your distribution.
the highest similarity becomes 0 etc
Yes, there is a most general way to change between similarity and distance: a strictly monotone decreasing function f(x).
That is, with f(x) you can make similarity = f(distance) or distance = f(similarity). It works in both directions. Such function works, because the relation between similarity and distance is that one decreases when the other increases.
Examples:
These are some well-known strictly monotone decreasing candidates that work for non-negative similarities or distances:
f(x) = 1 / (a + x)
f(x) = exp(- x^a)
f(x) = arccot(ax)
You can choose parameter a>0 (e.g., a=1)
Edit 2021-08
A very practical approach is to use the function sim2diss belonging to the statistical software R. This functions provides a up to 13 methods to compute dissimilarity from similarities. Sadly the methods are not at all explained: you have to look into the code :-\
similarity = 1/difference
and watch out for difference = 0
According to scikit learn:
Kernels are measures of similarity, i.e. s(a, b) > s(a, c) if objects a and b are considered “more similar” than objects a and c. A kernel must also be positive semi-definite.
There are a number of ways to convert between a distance metric and a similarity measure, such as a kernel. Let D be the distance, and S be the kernel:
S = np.exp(-D * gamma), where one heuristic for choosing gamma is 1 /
num_features
S = 1. / (D / np.max(D))
In the case of Levenshtein distance, you could increase the sim score by 1 for every time the sequences match; that is, 1 for every time you didn't need a deletion, insertion or substitution. That way the metric would be a linear measure of how many characters the two strings have in common.
In one of my projects (based on Collaborative Filtering) I had to convert between correlation (cosine between vectors) which was from -1 to 1 (closer 1 is more similar, closer to -1 is more diverse) to normalized distance (close to 0 the distance is smaller and if it's close to 1 the distance is bigger)
In this case: distance ~ diversity
My formula was: dist = 1 - (cor + 1)/2
If you have similarity to diversity and the domain is [0,1] in both cases the simlest way is:
dist = 1 - sim
sim = 1 - dist
Cosine similarity is widely used for n-gram count or TFIDF vectors.
from math import pi, acos
def similarity(x, y):
return sum(x[k] * y[k] for k in x if k in y) / sum(v**2 for v in x.values())**.5 / sum(v**2 for v in y.values())**.5
Cosine similarity can be used to compute a formal distance metric according to wikipedia. It obeys all the properties of a distance that you would expect (symmetry, nonnegativity, etc):
def distance_metric(x, y):
return 1 - 2 * acos(similarity(x, y)) / pi
Both of these metrics range between 0 and 1.
If you have a tokenizer that produces N-grams from a string you could use these metrics like this:
>>> import Tokenizer
>>> tokenizer = Tokenizer(ngrams=2, lower=True, nonwords_set=set(['hello', 'and']))
>>> from Collections import Counter
>>> list(tokenizer('Hello World again and again?'))
['world', 'again', 'again', 'world again', 'again again']
>>> Counter(tokenizer('Hello World again and again?'))
Counter({'again': 2, 'world': 1, 'again again': 1, 'world again': 1})
>>> x = _
>>> Counter(tokenizer('Hi world once again.'))
Counter({'again': 1, 'world once': 1, 'hi': 1, 'once again': 1, 'world': 1, 'hi world': 1, 'once': 1})
>>> y = _
>>> sum(x[k]*y[k] for k in x if k in y) / sum(v**2 for v in x.values())**.5 / sum(v**2 for v in y.values())**.5
0.42857142857142855
>>> distance_metric(x, y)
0.28196592805724774
I found the elegant inner product of Counter in this SO answer

Resources