Finding similar names better than O(n^2) - algorithm

Say I've got a list of names. Unfortunately, there are some duplicates, but it isn't obvious which of those are duplicates.
Tom Riddle
Tom M. Riddle
tom riddle
Tom Riddle, PhD.
I'm thinking of using Levenshtein distance, and there are definitely other algorithms that do come to mind to compare 2 names at a time.
But in a list of names, regardless of the string distance algorithm, I'll always end up generating a grid of comparison outputs (n^2).
How can I avoid the O(n^2) situation?

Introduction
What you want to do is known as Fuzzy search. Let me guide you through the topic.
First, setup an inverted index (Wikipedia) of n-grams (Wikipedia). That is, split a word like "hello" into, for example 3-grams:
"$$h", "$he", "hel", "ell", "llo", "lo$", "o$$"
And have a map which maps every n-gram to a list of words that contain it:
"$$h" -> ["hello", "helloworld", "hi", "huhu", "hey"]
"$he" -> ["hello", "helloworld", "hey"]
...
"llo" -> ["hello", "helloworld", "llowaddup", "allo"]
...
All words in your database are now indexed by their n-grams. This is why it is called inverted index.
The idea is, given a query, to compute how many n-grams the query has in common with all words in your database. This can be computed fast. After that you can use this to skip computation of the expensive edit distance for a huge set of records. Which dramatically increases the speed. It is the standard approach that all search engines use (more or less).
Let me first explain the general approach by the example of an exact match. After that we will slightly modify it and get to the fuzzy matching.
Exact match
At query time, compute the n-grams of your query, fetch the lists and compute the intersection.
Like if you get "hello" you compute the grams and get:
"$$h", "$he", "hel", "ell", "llo", "lo$", "o$$"
You fetch all lists for all of those n-grams:
List result;
foreach (String nGram) in (query.getNGrams()) {
List words = map.get(nGram);
result = result.intersect(words);
}
The intersection contains all words which match exactly those grams, this is "hello" only.
Note that an exact match can be computed faster by using hashing, like a HashSet.
Fuzzy match
Instead of intersecting the lists, merge them. In order to merge efficiently you should use any k-way merge algorithm, it requires the list of words in your inverted index to be sorted prior though, so make sure to sort it at construction.
You now get a list of all words that have at least one n-gram in common with the query.
We already greatly reduced the set of possible records. But we can do even better. Maintain, for each word, the amount of how many n-grams it has in common with the query. You can easily do that while merging the lists.
Consider the following threshold:
max(|x|, |y|) - 1 - (delta - 1) * n
where x is your query, y the word candidate you are comparing against. n is the value for the n-grams you have used, 3 if 3-gram for example. delta is the value of how many mistakes you allow.
If the count is below that value, you directly know that the edit distance is
ED(x, y) > delta
So you only need to consider words with a count more than the above threshold. Only for those words you compute the edit distance ED(x, y).
By that we extremely reduced the set of possible candidates and only compute the expensive edit distance on a small amount of records.
Example
Suppose you get the query "hilari". Let's use 3-grams. We get
"$$h", "$hi", "hil", "ila", "lar", "ari", "ri$", "i$$"
We search through our inverted index, merge lists of words that have those grams in common and get "hillary", "haemophilia", "solar". Together with those words we counted how many grams they have in common:
"hillary" -> 4 ("$$h", "hi", "hil", "lar")
"haemophilia" -> 2 ("$$h", "hil")
"solar" -> 1 ("lar")
Check each entry against the threshold. Let delta be 2. We get:
4 >= max(|"hilari"|, |"hillary"|) - 4 = 3
2 < max(|"hilari"|, |"haemophilia"|) - 4 = 6
1 < max(|"hilari"|, |"solar"|) - 4 = 2
Only "hillary" is above the threshold, discard the rest. Compute the edit distance for all remaining records:
ED("hilari", "hillary") = 2
Which is not beyond delta = 2, so we accept it.

This will be hard. Accept that you will make mistakes and don't let the perfect be the enemy of the good.
Begin by removing honorifics (Mr, Mrs, Sir, Dr, PhD, Jr, Sr,). Remove common first names (based on a list of first names) and initials and convert all characters to upper case. Create a signature for whatever is left — use Soundex or something similar, or simply remove all vowels and doubled consonants. Sort by signature to bring like names together, then run the full compare only on names with the same signature. That reduces the time complexity to O(n log n) for the sorting plus a little O(k²) for each set of k signatures.

Other answers have approached this as an abstract string problem. If that's what you're after then I think they give good advice. I'm going to assume that you would like to use specific knowledge of how names work, so that, for instance, "Mr. Thomas Riddle, Esq" and "Riddle, Tom" would match "Tom Riddle", but "Tom Griddle" wouldn't.
In general with this kind of problem you define some kind of canonicalization function and look for terms that canonicalize to the same thing. In this case, it seems like your canonical representation of a name ought to include a lower-case version of first and last name, stripped of any titles, and "de-nicknamed" using a nickname-to-formal-name mapping (assuming you want "Tom" and "Thomas" to match). This function would produce "Tom Riddle" -> {first: "tom", last: "riddle"}, "Riddle, Tom" -> {first: "tom", last: "riddle"}, "Tom Riddle, Esq" -> {first: "tom", last: "riddle"}, and so on, but "Tom Griddle" -> {first: "tom", last: "griddle"}.
Once you have a name-canonicalization function, you can create a map (e.g. hashmap or BST) that associates canonical names to a list of uncanonicalized names. For each uncanonicalized name, find the list corresponding to its canonical form in the map and insert it there. Once you're done, all the lists with more than one element are your duplicates.

Related

More efficient way to find phrases in a string?

I have a list that contains 100,000+ words/phrases sorted by length
let list = [“string with spaces”, “another string”, “test”, ...]
I need to find the longest element in the list above that is inside a given sentence. This is my initial solution
for item in list {
if sentence == item
|| sentence.startsWith(item + “ “)
|| sentence.contains(“ “ + item + “ “)
|| sentence.endsWith(“ “ + item) {
...
break
}
}
This issue I am running into is that this is too slow for my application. Is there a different approach I could take to make this faster?
You could build an Aho-Corasick searcher from the list and then run this on the sentence. According to https://en.wikipedia.org/wiki/Aho%E2%80%93Corasick_algorithm "The complexity of the algorithm is linear in the length of the strings plus the length of the searched text plus the number of output matches. Note that because all matches are found, there can be a quadratic number of matches if every substring matches (e.g. dictionary = a, aa, aaa, aaaa and input string is aaaa). "
I would break the given sentence up into a list of words and then compute all possible contiguous sublists (i.e. phrases). Given a sentence of n words, there are n * (n + 1) / 2 possible phrases that can be found inside it.
If you now substitute your list of search phrases ([“string with spaces”, “another string”, “test”, ...]) for an (amortized) constant time lookup data structure like a hashset, you can walk over the list of phrases you computed in the previous step and check whether each one is in the set in ~ constant time.
The overall time complexity of this algorithm scales quadratically in the size of the sentence, and is roughly independent of the size of the set of search terms.
The solution I decided to use was a Trie https://en.wikipedia.org/wiki/Trie. Each node in the trie is a word, and all I do is tokenize the input sentence (by word) and traverse the trie.
This improved performance from ~140 seconds to ~5 seconds

Paired comparisons algorithm design

I have 15 groupings of 5 words. Let's say the first grouping is the 'happy' grouping and is the following: ["happy", "smile", "fun", "joy", "laugh"], and the second is the 'sad' grouping and is the following: ["sad", "frown", "bummer", "cry", "rain-cloud"]. All the other groupings are similar, with five words in an array.
I am designing a paired comparisons react app, and I need one word from each grouping to be randomly chosen and paired with a randomly chosen word from each other grouping. From the examples above, a pair for grouping 1 and 2 might be ["smile", "cry"]. There should be 120 total pairs (exactly one for each grouping with each other grouping).
I was thinking of using a loop and going through the groupings one by one, then for each of the remaining groupings, taking a random word from the grouping I'm looking at and one from the other and creating a pair.
I feel like this isn't very elegant or efficient, and I'm curious how I might design a better algorithm. I think recursion might be helpful, but I can't think of how I could use it in this scenario.
Any thoughts or ideas? Thanks!
I had a thought about using recursion, but unfortunately I couldn't think of any algorithm.
What I tried here is: while the list of all sets is not empty, iterate through every item on the first set, and pair it with a random item from any other set, then remove that chosen item. After the whole first set is iterated, simply remove that first set from the list of all sets. This way there is no need of extra calculation that checks for duplicates.
(I used javascript for implementing, and made the sets simpler)
let myArray = [["happy", "smile", "fun"],
["sad", "frown", "bummer"],
[1,2,3],
[4,5,6]]
let pairs = []
while (myArray.length != 1){
for (var i = 0; i<myArray[0].length; i++){
var next = myArray[Math.floor(Math.random() * (myArray.length-1)) + 1];
var nextindex = Math.floor(Math.random()*next.length)
pairs.push([myArray [0][i], next[nextindex]]);
next.splice(nextindex, 1)
}
anarray.splice(0, 1);
}
console.log(pairs)
Although, as you said, this code is not elegant (nor the most efficient)... There should be a better solution, so it is worth it to keep on thinking for a better algorithm!

On counting pairs of words that differ by one letter

Let us consider n words, each of length k. Those words consist of letters over an alphabet (whose cardinality is n) with defined order. The task is to derive an O(nk) algorithm to count the number of pairs of words that differ by one position (no matter which one exactly, as long as it's only a single position).
For instance, in the following set of words (n = 5, k = 4):
abcd, abdd, adcb, adcd, aecd
there are 5 such pairs: (abcd, abdd), (abcd, adcd), (abcd, aecd), (adcb, adcd), (adcd, aecd).
So far I've managed to find an algorithm that solves a slightly easier problem: counting the number of pairs of words that differ by one GIVEN position (i-th). In order to do this I swap the letter at the ith position with the last letter within each word, perform a Radix sort (ignoring the last position in each word - formerly the ith position), linearly detect words whose letters at the first 1 to k-1 positions are the same, eventually count the number of occurrences of each letter at the last (originally ith) position within each set of duplicates and calculate the desired pairs (the last part is simple).
However, the algorithm above doesn't seem to be applicable to the main problem (under the O(nk) constraint) - at least not without some modifications. Any idea how to solve this?
Assuming n and k isn't too large so that this will fit into memory:
Have a set with the first letter removed, one with the second letter removed, one with the third letter removed, etc. Technically this has to be a map from strings to counts.
Run through the list, simply add the current element to each of the maps (obviously by removing the applicable letter first) (if it already exists, add the count to totalPairs and increment it by one).
Then totalPairs is the desired value.
EDIT:
Complexity:
This should be O(n.k.logn).
You can use a map that uses hashing (e.g. HashMap in Java), instead of a sorted map for a theoretical complexity of O(nk) (though I've generally found a hash map to be slower than a sorted tree-based map).
Improvement:
A small alteration on this is to have a map of the first 2 letters removed to 2 maps, one with first letter removed and one with second letter removed, and have the same for the 3rd and 4th letters, and so on.
Then put these into maps with 4 letters removed and those into maps with 8 letters removed and so on, up to half the letters removed.
The complexity of this is:
You do 2 lookups into 2 sorted sets containing maximum k elements (for each half).
For each of these you do 2 lookups into 2 sorted sets again (for each quarter).
So the number of lookups is 2 + 4 + 8 + ... + k/2 + k, which I believe is O(k).
I may be wrong here, but, worst case, the number of elements in any given map is n, but this will cause all other maps to only have 1 element, so still O(logn), but for each n (not each n.k).
So I think that's O(n.(logn + k)).
.
EDIT 2:
Example of my maps (without the improvement):
(x-1) means x maps to 1.
Let's say we have abcd, abdd, adcb, adcd, aecd.
The first map would be (bcd-1), (bdd-1), (dcb-1), (dcd-1), (ecd-1).
The second map would be (acd-3), (add-1), (acb-1) (for 4th and 5th, value already existed, so increment).
The third map : (abd-2), (adb-1), (add-1), (aed-1) (2nd already existed).
The fourth map : (abc-1), (abd-1), (adc-2), (aec-1) (4th already existed).
totalPairs = 0
For second map - acd, for the 4th, we add 1, for the 5th we add 2.
totalPairs = 3
For third map - abd, for the 2th, we add 1.
totalPairs = 4
For fourth map - adc, for the 4th, we add 1.
totalPairs = 5.
Partial example of improved maps:
Same input as above.
Map of first 2 letters removed to maps of 1st and 2nd letter removed:
(cd-{ {(bcd-1)}, {(acd-1)} }),
(dd-{ {(bdd-1)}, {(add-1)} }),
(cb-{ {(dcb-1)}, {(acb-1)} }),
(cd-{ {(dcd-1)}, {(acd-1)} }),
(cd-{ {(ecd-1)}, {(acd-1)} })
The above is a map consisting of an element cd mapped to 2 maps, one containing one element (bcd-1) and the other containing (acd-1).
But for the 4th and 5th cd already existed, so, rather than generating the above, it will be added to that map instead, as follows:
(cd-{ {(bcd-1, dcd-1, ecd-1)}, {(acd-3)} }),
(dd-{ {(bdd-1)}, {(add-1)} }),
(cb-{ {(dcb-1)}, {(acb-1)} })
You can put each word into an array.Pop out elements from that array one by one.Then compare the resulting arrays.Finally you add back the popped element to get back the original arrays.
The popped elements from both the arrays must not be same.
Count number of cases where this occurs and finally divide it by 2 to get the exact solution
Think about how you would enumerate the language - you would likely use a recursive algorithm. Recursive algorithms map onto tree structures. If you construct such a tree, each divergence represents a difference of one letter, and each leaf will represent a word in the language.
It's been two months since I submitted the problem here. I have discussed it with my peers in the meantime and would like to share the outcome.
The main idea is similar to the one presented by Dukeling. For each word A and for each ith position within that word we are going to consider a tuple: (prefix, suffix, letter at the ith position), i.e. (A[1..i-1], A[i+1..n], A[i]). If i is either 1 or n, then the applicable substring is considered empty (these are simple boundary cases).
Having these tuples in hand, we should be able to apply the reasoning I provided in my first post to count the number of pairs of different words. All we have to do is sort the tuples by the prefix and suffix values (separately for each i) - then, words with letters equal at all but ith position will be adjacent to each other.
Here though is the technical part I am lacking. So as to make the sorting procedure (RadixSort appears to be the way to go) meet the O(nk) constraint, we might want to assign labels to our prefixes and suffixes (we only need n labels for each i). I am not quite sure how to go about the labelling stuff. (Sure, we might do some hashing instead, but I am pretty confident the former solution is viable).
While this is not an entirely complete solution, I believe it casts some light on the possible way to tackle this problem and that is why I posted it here. If anyone comes up with an idea of how to do the labelling part, I will implement it in this post.
How's the following Python solution?
import string
def one_apart(words, word):
res = set()
for i, _ in enumerate(word):
for c in string.ascii_lowercase:
w = word[:i] + c + word[i+1:]
if w != word and w in words:
res.add(w)
return res
pairs = set()
for w in words:
for other in one_apart(words, w):
pairs.add(frozenset((w, other)))
for pair in pairs:
print(pair)
Output:
frozenset({'abcd', 'adcd'})
frozenset({'aecd', 'adcd'})
frozenset({'adcb', 'adcd'})
frozenset({'abcd', 'aecd'})
frozenset({'abcd', 'abdd'})

Algorithm for discrete similarity metric

Given that I have two lists that each contain a separate subset of a common superset, is
there an algorithm to give me a similarity measurement?
Example:
A = { John, Mary, Kate, Peter } and B = { Peter, James, Mary, Kate }
How similar are these two lists? Note that I do not know all elements of the common superset.
Update:
I was unclear and I have probably used the word 'set' in a sloppy fashion. My apologies.
Clarification: Order is of importance.
If identical elements occupy the same position in the list, we have the highest similarity for that element.
The similarity decreased the farther apart the identical elements are.
The similarity is even lower if the element only exists in one of the lists.
I could even add the extra dimension that lower indices are of greater value, so a a[1] == b[1] is worth more than a[9] == b[9], but that is mainly cause I am curious.
The Jaccard Index (aka Tanimoto coefficient) is used precisely for the use case recited in the OP's question.
The Tanimoto coeff, tau, is equal to Nc divided by Na + Nb - Nc, or
tau = Nc / (Na + Nb - Nc)
Na, number of items in the first set
Nb, number of items in the second set
Nc, intersection of the two sets, or the number of unique items
common to both a and b
Here's Tanimoto coded as a Python function:
def tanimoto(x, y) :
w = [ ns for ns in x if ns not in y ]
return float(len(w) / (len(x) + len(y) - len(w)))
I would explore two strategies:
Treat the lists as sets and apply set ops (intersection, difference)
Treat the lists as strings of symbols and apply the Levenshtein algorithm
If you truly have sets (i.e., an element is simply either present or absent, with no count attached) and only two of them, just adding the number of shared elements and dividing by the total number of elements is probably about as good as it gets.
If you have (or can get) counts and/or more than two of them, you can do a bit better than that with something like cosine simliarity or TFIDF (term frequency * inverted document frequency).
The latter attempts to give lower weighting to words that appear in all (or nearly) all the "documents" -- i.e., sets of words.
What is your definition of "similarity measurement?" If all you want is how many items in the set are in common with each other, you could find the cardinality of A and B, add the cardinalities together, and subtract from the cardinality of the union of A and B.
If order matters you can use Levenshtein distance or other kind of Edit distance
.

String similarity score/hash

Is there a method to calculate something like general "similarity score" of a string? In a way that I am not comparing two strings together but rather I get some number (hash) for each string that can later tell me that two strings are or are not similar. Two similar strings should have similar (close) hashes.
Let's consider these strings and scores as an example:
Hello world 1000
Hello world! 1010
Hello earth 1125
Foo bar 3250
FooBarbar 3750
Foo Bar! 3300
Foo world! 2350
You can see that Hello world! and Hello world are similar and their scores are close to each other.
This way, finding the most similar strings to a given string would be done by subtracting given strings score from other scores and then sorting their absolute value.
I believe what you're looking for is called a Locality Sensitive Hash. Whereas most hash algorithms are designed such that small variations in input cause large changes in output, these hashes attempt the opposite: small changes in input generate proportionally small changes in output.
As others have mentioned, there are inherent issues with forcing a multi-dimensional mapping into a 2-dimensional mapping. It's analogous to creating a flat map of the Earth... you can never accurately represent a sphere on a flat surface. Best you can do is find a LSH that is optimized for whatever feature it is you're using to determine whether strings are "alike".
Levenstein distance or its derivatives is the algorithm you want.
Match given string to each of strings from dictionary.
(Here, if you need only fixed number of most similar strings, you may want to use min-heap.)
If running Levenstein distance for all strings in dictionary is too expensive, then use some rough
algorithm first that will exclude too distant words from list of candidates.
After that, run levenstein distance on left candidates.
One way to remove distant words is to index n-grams.
Preprocess dictionary by splitting each of words into list of n-grams.
For example, consider n=3:
(0) "Hello world" -> ["Hel", "ell", "llo", "lo ", "o w", " wo", "wor", "orl", "rld"]
(1) "FooBarbar" -> ["Foo", "ooB", "oBa", "Bar", "arb", "rba", "bar"]
(2) "Foo world!" -> ["Foo", "oo ", "o w", " wo", "wor", "orl", "rld", "ld!"]
Next, create index of n-gramms:
" wo" -> [0, 2]
"Bar" -> [1]
"Foo" -> [1, 2]
"Hel" -> [0]
"arb" -> [1]
"bar" -> [1]
"ell" -> [0]
"ld!" -> [2]
"llo" -> [0]
"lo " -> [0]
"o w" -> [0, 2]
"oBa" -> [1]
"oo " -> [2]
"ooB" -> [1]
"orl" -> [0, 2]
"rba" -> [1]
"rld" -> [0, 2]
"wor" -> [0, 2]
When you need to find most similar strings for given string, you split given string into n-grams and select only those
words from dictionary which have at least one matching n-gram.
This reduces number of candidates to reasonable amount and you may proceed with levenstein-matching given string to each of left candidates.
If your strings are long enough, you may reduce index size by using min-hashing technnique:
you calculate ordinary hash for each of n-grams and use only K smallest hashes, others are thrown away.
P.S. this presentation seems like a good introduction to your problem.
This isn't possible, in general, because the set of edit distances between strings forms a metric space, but not one with a fixed dimension. That means that you can't provide a mapping between strings and integers that preserves a distance measure between them.
For example, you cannot assign numbers to these three phrases:
one two
one six
two six
Such that the numbers reflect the difference between all three phrases.
While the idea seems extremely sweet... I've never heard of this.
I've read many, many, technics, thesis, and scientific papers on the subject of spell correction / typo correction and the fastest proposals revolve around an index and the levenshtein distance.
There are fairly elaborated technics, the one I am currently working on combines:
A Bursted Trie, with level compactness
A Levenshtein Automaton
Even though this doesn't mean it is "impossible" to get a score, I somehow think there would not be so much recent researches on string comparisons if such a "scoring" method had proved efficient.
If you ever find such a method, I am extremely interested :)
Would Levenshtein distance work for you?
In an unbounded problem, there is no solution which can convert any possible sequence of words, or any possible sequence of characters to a single number which describes locality.
Imagine similarity at the character level
stops
spots
hello world
world hello
In both examples the messages are different, but the characters in the message are identical, so the measure would need to hold a position value , as well as a character value. (char 0 == 'h', char 1 == 'e' ...)
Then compare the following similar messages
hello world
ello world
Although the two strings are similar, they could differ at the beginning, or at the end, which makes scaling by position problematic.
In the case of
spots
stops
The words only differ by position of the characters, so some form of position is important.
If the following strings are similar
yesssssssssssssss
yessssssssssssss
Then you have a form of paradox. If you add 2 s characters to the second string, it should share the distance it was from the first string, but it should be distinct. This can be repeated getting progressively longer strings, all of which need to be close to the strings just shorter and longer than them. I can't see how to achieve this.
In general this is treated as a multi-dimensional problem - breaking the string into a vector
[ 'h', 'e', 'l', 'l', 'o', ' ', 'w', 'o', 'r', 'l', 'd' ]
But the values of the vector can not be
represented by a fixed size number, or
give good quality difference measure.
If the number of words, or length of strings were bounded, then a solution of coding may be possible.
Bounded values
Using something like arithmetic compression, then a sequence of words can be converted into a floating point number which represents the sequence. However this would treat items earlier in the sequence as more significant than the last item in the sequence.
data mining solution
If you accept that the problem is high dimensional, then you can store your strings in a metric-tree wikipedia : metric tree. This would limit your search space, whilst not solving your "single number" solution.
I have code for such at github : clustering
Items which are close together, should be stored together in a part of the tree, but there is really no guarantee. The radius of subtrees is used to prune the search space.
Edit Distance or Levenshtein distance
This is used in a sqlite extension to perform similarity searching, but with no single number solution, it works out how many edits change one string into another. This then results in a score, which shows similarity.
I think of something like this:
remove all non-word characters
apply soundex
Your idea sounds like ontology but applied to whole phrases. The more similar two phrases are, the closer in the graph they are (assuming you're using weighted edges). And vice-versa: non similar phrases are very far from each other.
Another approach, is to use Fourier transform to get sort of the 'index' for a given string (it won't be a single number, but always). You may find little bit more in this paper.
And another idea, that bases on the Levenshtein distance: you may compare n-grams that will give you some similarity index for two given phrases - the more they are similar the value is closer to 1. This may be used to calculate distance in the graph. wrote a paper on this a few years ago, if you'd like I can share it.
Anyways: despite I don't know the exact solution, I'm also interested in what you'll came up with.
Maybe use PCA, where the matrix is a list of the differences between the string and a fixed alphabet (à la ABCDEFGHI...). The answer could be simply the length of the principal component.
Just an idea.
ready-to-run PCA in C#
It is unlikely one can get a rather small number from two phrases that, being compared, provide a relevant indication of the similarity of their initial phrases.
A reason is that the number gives an indication in one dimension, while phrases are evolving in two dimensions, length and intensity.
The number could evolve as well in length as in intensity but I'm not sure it'll help a lot.
In two dimensions, you better look at a matrix, which some properties like the determinant (a kind of derivative of the matrix) could give a rough idea of the phrase trend.
In Natural Language Processing we have a thing call Minimum Edit Distance (also known as Levenshtein Distance)
Its basically defined as the smallest amount of operation needed in order to transform string1 to string2
Operations included Insertion, Deletion, Subsitution, each operation is given a score to which you add to the distance
The idea to solve your problem is to calculate the MED from your chosen string, to all the other string, sort that collection and pick out the n-th first smallest distance string
For example:
{"Hello World", "Hello World!", "Hello Earth"}
Choosing base-string="Hello World"
Med(base-string, "Hello World!") = 1
Med(base-string, "Hello Earth") = 8
1st closest string is "Hello World!"
This have somewhat given a score to each string of your string-collection
C# Implementation (Add-1, Deletion-1, Subsitution-2)
public static int Distance(string s1, string s2)
{
int[,] matrix = new int[s1.Length + 1, s2.Length + 1];
for (int i = 0; i <= s1.Length; i++)
matrix[i, 0] = i;
for (int i = 0; i <= s2.Length; i++)
matrix[0, i] = i;
for (int i = 1; i <= s1.Length; i++)
{
for (int j = 1; j <= s2.Length; j++)
{
int value1 = matrix[i - 1, j] + 1;
int value2 = matrix[i, j - 1] + 1;
int value3 = matrix[i - 1, j - 1] + ((s1[i - 1] == s2[j - 1]) ? 0 : 2);
matrix[i, j] = Math.Min(value1, Math.Min(value2, value3));
}
}
return matrix[s1.Length, s2.Length];
}
Complexity O(n x m) where n, m is length of each string
More info on Minimum Edit Distance can be found here
Well, you could add up the ascii value of each character and then compare the scores, having a maximum value on which they can differ. This does not guarantee however that they will be similar, for the same reason two different strings can have the same hash value.
You could of course make a more complex function, starting by checking the size of the strings, and then comparing each caracter one by one, again with a maximum difference set up.

Resources