String similarity score/hash - algorithm

Is there a method to calculate something like general "similarity score" of a string? In a way that I am not comparing two strings together but rather I get some number (hash) for each string that can later tell me that two strings are or are not similar. Two similar strings should have similar (close) hashes.
Let's consider these strings and scores as an example:
Hello world 1000
Hello world! 1010
Hello earth 1125
Foo bar 3250
FooBarbar 3750
Foo Bar! 3300
Foo world! 2350
You can see that Hello world! and Hello world are similar and their scores are close to each other.
This way, finding the most similar strings to a given string would be done by subtracting given strings score from other scores and then sorting their absolute value.

I believe what you're looking for is called a Locality Sensitive Hash. Whereas most hash algorithms are designed such that small variations in input cause large changes in output, these hashes attempt the opposite: small changes in input generate proportionally small changes in output.
As others have mentioned, there are inherent issues with forcing a multi-dimensional mapping into a 2-dimensional mapping. It's analogous to creating a flat map of the Earth... you can never accurately represent a sphere on a flat surface. Best you can do is find a LSH that is optimized for whatever feature it is you're using to determine whether strings are "alike".

Levenstein distance or its derivatives is the algorithm you want.
Match given string to each of strings from dictionary.
(Here, if you need only fixed number of most similar strings, you may want to use min-heap.)
If running Levenstein distance for all strings in dictionary is too expensive, then use some rough
algorithm first that will exclude too distant words from list of candidates.
After that, run levenstein distance on left candidates.
One way to remove distant words is to index n-grams.
Preprocess dictionary by splitting each of words into list of n-grams.
For example, consider n=3:
(0) "Hello world" -> ["Hel", "ell", "llo", "lo ", "o w", " wo", "wor", "orl", "rld"]
(1) "FooBarbar" -> ["Foo", "ooB", "oBa", "Bar", "arb", "rba", "bar"]
(2) "Foo world!" -> ["Foo", "oo ", "o w", " wo", "wor", "orl", "rld", "ld!"]
Next, create index of n-gramms:
" wo" -> [0, 2]
"Bar" -> [1]
"Foo" -> [1, 2]
"Hel" -> [0]
"arb" -> [1]
"bar" -> [1]
"ell" -> [0]
"ld!" -> [2]
"llo" -> [0]
"lo " -> [0]
"o w" -> [0, 2]
"oBa" -> [1]
"oo " -> [2]
"ooB" -> [1]
"orl" -> [0, 2]
"rba" -> [1]
"rld" -> [0, 2]
"wor" -> [0, 2]
When you need to find most similar strings for given string, you split given string into n-grams and select only those
words from dictionary which have at least one matching n-gram.
This reduces number of candidates to reasonable amount and you may proceed with levenstein-matching given string to each of left candidates.
If your strings are long enough, you may reduce index size by using min-hashing technnique:
you calculate ordinary hash for each of n-grams and use only K smallest hashes, others are thrown away.
P.S. this presentation seems like a good introduction to your problem.

This isn't possible, in general, because the set of edit distances between strings forms a metric space, but not one with a fixed dimension. That means that you can't provide a mapping between strings and integers that preserves a distance measure between them.
For example, you cannot assign numbers to these three phrases:
one two
one six
two six
Such that the numbers reflect the difference between all three phrases.

While the idea seems extremely sweet... I've never heard of this.
I've read many, many, technics, thesis, and scientific papers on the subject of spell correction / typo correction and the fastest proposals revolve around an index and the levenshtein distance.
There are fairly elaborated technics, the one I am currently working on combines:
A Bursted Trie, with level compactness
A Levenshtein Automaton
Even though this doesn't mean it is "impossible" to get a score, I somehow think there would not be so much recent researches on string comparisons if such a "scoring" method had proved efficient.
If you ever find such a method, I am extremely interested :)

Would Levenshtein distance work for you?

In an unbounded problem, there is no solution which can convert any possible sequence of words, or any possible sequence of characters to a single number which describes locality.
Imagine similarity at the character level
stops
spots
hello world
world hello
In both examples the messages are different, but the characters in the message are identical, so the measure would need to hold a position value , as well as a character value. (char 0 == 'h', char 1 == 'e' ...)
Then compare the following similar messages
hello world
ello world
Although the two strings are similar, they could differ at the beginning, or at the end, which makes scaling by position problematic.
In the case of
spots
stops
The words only differ by position of the characters, so some form of position is important.
If the following strings are similar
yesssssssssssssss
yessssssssssssss
Then you have a form of paradox. If you add 2 s characters to the second string, it should share the distance it was from the first string, but it should be distinct. This can be repeated getting progressively longer strings, all of which need to be close to the strings just shorter and longer than them. I can't see how to achieve this.
In general this is treated as a multi-dimensional problem - breaking the string into a vector
[ 'h', 'e', 'l', 'l', 'o', ' ', 'w', 'o', 'r', 'l', 'd' ]
But the values of the vector can not be
represented by a fixed size number, or
give good quality difference measure.
If the number of words, or length of strings were bounded, then a solution of coding may be possible.
Bounded values
Using something like arithmetic compression, then a sequence of words can be converted into a floating point number which represents the sequence. However this would treat items earlier in the sequence as more significant than the last item in the sequence.
data mining solution
If you accept that the problem is high dimensional, then you can store your strings in a metric-tree wikipedia : metric tree. This would limit your search space, whilst not solving your "single number" solution.
I have code for such at github : clustering
Items which are close together, should be stored together in a part of the tree, but there is really no guarantee. The radius of subtrees is used to prune the search space.
Edit Distance or Levenshtein distance
This is used in a sqlite extension to perform similarity searching, but with no single number solution, it works out how many edits change one string into another. This then results in a score, which shows similarity.

I think of something like this:
remove all non-word characters
apply soundex

Your idea sounds like ontology but applied to whole phrases. The more similar two phrases are, the closer in the graph they are (assuming you're using weighted edges). And vice-versa: non similar phrases are very far from each other.
Another approach, is to use Fourier transform to get sort of the 'index' for a given string (it won't be a single number, but always). You may find little bit more in this paper.
And another idea, that bases on the Levenshtein distance: you may compare n-grams that will give you some similarity index for two given phrases - the more they are similar the value is closer to 1. This may be used to calculate distance in the graph. wrote a paper on this a few years ago, if you'd like I can share it.
Anyways: despite I don't know the exact solution, I'm also interested in what you'll came up with.

Maybe use PCA, where the matrix is a list of the differences between the string and a fixed alphabet (à la ABCDEFGHI...). The answer could be simply the length of the principal component.
Just an idea.
ready-to-run PCA in C#

It is unlikely one can get a rather small number from two phrases that, being compared, provide a relevant indication of the similarity of their initial phrases.
A reason is that the number gives an indication in one dimension, while phrases are evolving in two dimensions, length and intensity.
The number could evolve as well in length as in intensity but I'm not sure it'll help a lot.
In two dimensions, you better look at a matrix, which some properties like the determinant (a kind of derivative of the matrix) could give a rough idea of the phrase trend.

In Natural Language Processing we have a thing call Minimum Edit Distance (also known as Levenshtein Distance)
Its basically defined as the smallest amount of operation needed in order to transform string1 to string2
Operations included Insertion, Deletion, Subsitution, each operation is given a score to which you add to the distance
The idea to solve your problem is to calculate the MED from your chosen string, to all the other string, sort that collection and pick out the n-th first smallest distance string
For example:
{"Hello World", "Hello World!", "Hello Earth"}
Choosing base-string="Hello World"
Med(base-string, "Hello World!") = 1
Med(base-string, "Hello Earth") = 8
1st closest string is "Hello World!"
This have somewhat given a score to each string of your string-collection
C# Implementation (Add-1, Deletion-1, Subsitution-2)
public static int Distance(string s1, string s2)
{
int[,] matrix = new int[s1.Length + 1, s2.Length + 1];
for (int i = 0; i <= s1.Length; i++)
matrix[i, 0] = i;
for (int i = 0; i <= s2.Length; i++)
matrix[0, i] = i;
for (int i = 1; i <= s1.Length; i++)
{
for (int j = 1; j <= s2.Length; j++)
{
int value1 = matrix[i - 1, j] + 1;
int value2 = matrix[i, j - 1] + 1;
int value3 = matrix[i - 1, j - 1] + ((s1[i - 1] == s2[j - 1]) ? 0 : 2);
matrix[i, j] = Math.Min(value1, Math.Min(value2, value3));
}
}
return matrix[s1.Length, s2.Length];
}
Complexity O(n x m) where n, m is length of each string
More info on Minimum Edit Distance can be found here

Well, you could add up the ascii value of each character and then compare the scores, having a maximum value on which they can differ. This does not guarantee however that they will be similar, for the same reason two different strings can have the same hash value.
You could of course make a more complex function, starting by checking the size of the strings, and then comparing each caracter one by one, again with a maximum difference set up.

Related

Finding similar names better than O(n^2)

Say I've got a list of names. Unfortunately, there are some duplicates, but it isn't obvious which of those are duplicates.
Tom Riddle
Tom M. Riddle
tom riddle
Tom Riddle, PhD.
I'm thinking of using Levenshtein distance, and there are definitely other algorithms that do come to mind to compare 2 names at a time.
But in a list of names, regardless of the string distance algorithm, I'll always end up generating a grid of comparison outputs (n^2).
How can I avoid the O(n^2) situation?
Introduction
What you want to do is known as Fuzzy search. Let me guide you through the topic.
First, setup an inverted index (Wikipedia) of n-grams (Wikipedia). That is, split a word like "hello" into, for example 3-grams:
"$$h", "$he", "hel", "ell", "llo", "lo$", "o$$"
And have a map which maps every n-gram to a list of words that contain it:
"$$h" -> ["hello", "helloworld", "hi", "huhu", "hey"]
"$he" -> ["hello", "helloworld", "hey"]
...
"llo" -> ["hello", "helloworld", "llowaddup", "allo"]
...
All words in your database are now indexed by their n-grams. This is why it is called inverted index.
The idea is, given a query, to compute how many n-grams the query has in common with all words in your database. This can be computed fast. After that you can use this to skip computation of the expensive edit distance for a huge set of records. Which dramatically increases the speed. It is the standard approach that all search engines use (more or less).
Let me first explain the general approach by the example of an exact match. After that we will slightly modify it and get to the fuzzy matching.
Exact match
At query time, compute the n-grams of your query, fetch the lists and compute the intersection.
Like if you get "hello" you compute the grams and get:
"$$h", "$he", "hel", "ell", "llo", "lo$", "o$$"
You fetch all lists for all of those n-grams:
List result;
foreach (String nGram) in (query.getNGrams()) {
List words = map.get(nGram);
result = result.intersect(words);
}
The intersection contains all words which match exactly those grams, this is "hello" only.
Note that an exact match can be computed faster by using hashing, like a HashSet.
Fuzzy match
Instead of intersecting the lists, merge them. In order to merge efficiently you should use any k-way merge algorithm, it requires the list of words in your inverted index to be sorted prior though, so make sure to sort it at construction.
You now get a list of all words that have at least one n-gram in common with the query.
We already greatly reduced the set of possible records. But we can do even better. Maintain, for each word, the amount of how many n-grams it has in common with the query. You can easily do that while merging the lists.
Consider the following threshold:
max(|x|, |y|) - 1 - (delta - 1) * n
where x is your query, y the word candidate you are comparing against. n is the value for the n-grams you have used, 3 if 3-gram for example. delta is the value of how many mistakes you allow.
If the count is below that value, you directly know that the edit distance is
ED(x, y) > delta
So you only need to consider words with a count more than the above threshold. Only for those words you compute the edit distance ED(x, y).
By that we extremely reduced the set of possible candidates and only compute the expensive edit distance on a small amount of records.
Example
Suppose you get the query "hilari". Let's use 3-grams. We get
"$$h", "$hi", "hil", "ila", "lar", "ari", "ri$", "i$$"
We search through our inverted index, merge lists of words that have those grams in common and get "hillary", "haemophilia", "solar". Together with those words we counted how many grams they have in common:
"hillary" -> 4 ("$$h", "hi", "hil", "lar")
"haemophilia" -> 2 ("$$h", "hil")
"solar" -> 1 ("lar")
Check each entry against the threshold. Let delta be 2. We get:
4 >= max(|"hilari"|, |"hillary"|) - 4 = 3
2 < max(|"hilari"|, |"haemophilia"|) - 4 = 6
1 < max(|"hilari"|, |"solar"|) - 4 = 2
Only "hillary" is above the threshold, discard the rest. Compute the edit distance for all remaining records:
ED("hilari", "hillary") = 2
Which is not beyond delta = 2, so we accept it.
This will be hard. Accept that you will make mistakes and don't let the perfect be the enemy of the good.
Begin by removing honorifics (Mr, Mrs, Sir, Dr, PhD, Jr, Sr,). Remove common first names (based on a list of first names) and initials and convert all characters to upper case. Create a signature for whatever is left — use Soundex or something similar, or simply remove all vowels and doubled consonants. Sort by signature to bring like names together, then run the full compare only on names with the same signature. That reduces the time complexity to O(n log n) for the sorting plus a little O(k²) for each set of k signatures.
Other answers have approached this as an abstract string problem. If that's what you're after then I think they give good advice. I'm going to assume that you would like to use specific knowledge of how names work, so that, for instance, "Mr. Thomas Riddle, Esq" and "Riddle, Tom" would match "Tom Riddle", but "Tom Griddle" wouldn't.
In general with this kind of problem you define some kind of canonicalization function and look for terms that canonicalize to the same thing. In this case, it seems like your canonical representation of a name ought to include a lower-case version of first and last name, stripped of any titles, and "de-nicknamed" using a nickname-to-formal-name mapping (assuming you want "Tom" and "Thomas" to match). This function would produce "Tom Riddle" -> {first: "tom", last: "riddle"}, "Riddle, Tom" -> {first: "tom", last: "riddle"}, "Tom Riddle, Esq" -> {first: "tom", last: "riddle"}, and so on, but "Tom Griddle" -> {first: "tom", last: "griddle"}.
Once you have a name-canonicalization function, you can create a map (e.g. hashmap or BST) that associates canonical names to a list of uncanonicalized names. For each uncanonicalized name, find the list corresponding to its canonical form in the map and insert it there. Once you're done, all the lists with more than one element are your duplicates.

Splitting a sentence to minimize sentence lengths

I have come across the following problem statement:
You have a sentence written entirely in a single row. You would like to split it into several rows by replacing some of the spaces
with "new row" indicators. Your goal is to minimize the width of the
longest row in the resulting text ("new row" indicators do not count
towards the width of a row). You may replace at most K spaces.
You will be given a sentence and a K. Split the sentence using the
procedure described above and return the width of the longest row.
I am a little lost with where to start. To me, it seems I need to try to figure out every possible sentence length that satisfies the criteria of splitting the single sentence up into K lines.
I can see a couple of edge cases:
There are <= K words in the sentence, therefore return the longest word.
The sentence length is 0, return 0
If neither of those criteria are true, then we have to determine all possible combinations of splitting the sentence and the return the minimum of all those options. This is the part I don't know how to do (and is obviously the heart of the problem).
You can solve it by inverting the problem. Let's say I fix the length of the longest split to L. Can you compute the minimum number of breaks you need to satisfy it?
Yes, you just break before the first word that would go over L and count them up (O(N)).
So now that we have that we just have to find a minimum L that would require less or equal K breaks. You can do a binary search in the length of the input. Final complexity O(NlogN).
First Answer
What you want to achieve is Minimum Raggedness. If you just want the algorithm, it is here as a PDF. If the research paper's link goes bad, please search for the famous paper named Breaking Paragraphs into Lines by Knuth.
However if you want to get your hands over some implementations of the same, in the question Balanced word wrap (Minimum raggedness) in PHP on SO, people have actually given implementation not only in PHP but in C, C++ and bash as well.
Second Answer
Though this is not exactly a correct approach, it is quick and dirty if you are looking for something like that. This method will not return correct answer for every case. It is for those people for whom time to ship their product is more important.
Idea
You already know the length of your input string. Let's call it L;
When putting in K breaks, the best scenario would be to be able to break the string to parts of exactly L / (K + 1) size;
So break your string at that word which makes the resulting sentence part's length least far from L / (K + 1);
My recursive solution, which can be improved through memoization or dynamic programming.
def split(self,sentence, K):
if not sentence: return 0
if ' ' not in sentence or K == 0: return len(sentence)
spaces = [i for i, s in enumerate(sentence) if s == ' ']
res = 100000
for space in spaces:
res = min(res, max(space, self.split(sentence[space+1:], K-1)))
return res

OCR'ed and real string similarity

The problem:
There is a set of word S = {W1,W2.. Wn} where n < 10. This set just exists, we do not know its content.
These words are drawn on some image and then recognized. The OCR algorytm is poor as well as dpi and as a result there are mistakes. So we have a second set of errorneous words S' = {W1',W2'..Wn'}
Now we have a word W that is a member of original set S. And now I need and algorythm which, given W and S', return index of the word in S'. most similar to W.
Example. S is {"alpha", "bravo", "charlie"}, S' is for example {"alPha","hravc","onarlio"} (these are real possible ocr erros).
So the target function should return F("alpha") => 0, F("bravo") => 1, F("charlie") => 2
I tried Levenshtein distance, but it does not work well, because it returns small numbers on small strings and OCRed string can be longer than original.
Example if W' is {'hornist','cornrnunist'} and the given word is 'communist' the Levenshtein distance is 4 for the both words, but the right one is second.
Any suggestions?
As a zero approach, I'd suggest you to use the modification of Levenshtein distance algorithm with conditional cost of replacing/deleting/adding characters:
Distance(i, j) = min(Distance(i-1, j-1) + replace_cost(a.charAt(i), b.charAt(j)),
Distance(i-1, j ) + insert_cost(b.charAt(j)),
Distance(i , j-1) + delete_cost(a.charAt(i)))
You can implement function replace_cost in such way, that it will returns small values for visually similar characters (and high values for visually different characters), e.g.:
// visually similar characters
replace_cost('o', '0') = 0.1
replace_cost('o', 'O') = 0.1
replace_cost('O', '0') = 0.1
...
// visually different characters
replace_cost('O', 'K') = 0.9
...
And the similar approach can be used for insert_cost and delete_cost (e.g. you may notice, that during the OCR - some characters are more likely to disappear than others).
Also, in case when approach from above is not enough for you, I'd suggest you to look at Noisy channel model - which is widely used for spelling correction (this subject described very well in Natural Language Processing course by Dan Jurafsky, Christopher Manning - "Week 2 - Spelling Correction").
This appears to be quite difficult to do because the misread strings are not necessarily textually similar to the input, which is why Levinshtein distance won't work for you. The words are visually corrupted, not simply mistyped. You could try creating a dataset of common errors (o => 0, l -> 1, e => o) and then do some sort of comparison based on that.
If you have access to the OCR algorithm, you could run that algorithm again on a much broader set of inputs (with known outputs) and train a neural network to recognize common errors. Then you could use that model to predict mistakes in your original dataset (maybe overkill for an array of only ten items).

Fastest way to find minimal Hamming distance to any substring?

Given a long string L and a shorter string S (the constraint is that L.length must be >= S.length), I want to find the minimum Hamming distance between S and any substring of L with length equal to S.length. Let's call the function for this minHamming(). For example,
minHamming(ABCDEFGHIJ, CDEFGG) == 1.
minHamming(ABCDEFGHIJ, BCDGHI) == 3.
Doing this the obvious way (enumerating every substring of L) requires O(S.length * L.length) time. Is there any clever way to do this in sublinear time? I search the same L with several different S strings, so doing some complicated preprocessing to L once is acceptable.
Edit: The modified Boyer-Moore would be a good idea, except that my alphabet is only 4 letters (DNA).
Perhaps surprisingly, this exact problem can be solved in just O(|A|nlog n) time using Fast Fourier Transforms (FFTs), where n is the length of the larger sequence L and |A| is the size of the alphabet.
Here is a freely available PDF of a paper by Donald Benson describing how it works:
Fourier methods for biosequence analysis (Donald Benson, Nucleic Acids Research 1990 vol. 18, pp. 3001-3006)
Summary: Convert each of your strings S and L into several indicator vectors (one per character, so 4 in the case of DNA), and then convolve corresponding vectors to determine match counts for each possible alignment. The trick is that convolution in the "time" domain, which ordinarily requires O(n^2) time, can be implemented using multiplication in the "frequency" domain, which requires just O(n) time, plus the time required to convert between domains and back again. Using the FFT each conversion takes just O(nlog n) time, so the overall time complexity is O(|A|nlog n). For greatest speed, finite field FFTs are used, which require only integer arithmetic.
Note: For arbitrary S and L this algorithm is clearly a huge performance win over the straightforward O(mn) algorithm as |S| and |L| become large, but OTOH if S is typically shorter than log|L| (e.g. when querying a large DB with a small sequence), then obviously this approach provides no speedup.
UPDATE 21/7/2009: Updated to mention that the time complexity also depends linearly on the size of the alphabet, since a separate pair of indicator vectors must be used for each character in the alphabet.
Modified Boyer-Moore
I've just dug up some old Python implementation of Boyer-Moore I had lying around and modified the matching loop (where the text is compared to the pattern). Instead of breaking out as soon as the first mismatch is found between the two strings, simply count up the number of mismatches, but remember the first mismatch:
current_dist = 0
while pattern_pos >= 0:
if pattern[pattern_pos] != text[text_pos]:
if first_mismatch == -1:
first_mismatch = pattern_pos
tp = text_pos
current_dist += 1
if current_dist == smallest_dist:
break
pattern_pos -= 1
text_pos -= 1
smallest_dist = min(current_dist, smallest_dist)
# if the distance is 0, we've had a match and can quit
if current_dist == 0:
return 0
else: # shift
pattern_pos = first_mismatch
text_pos = tp
...
If the string did not match completely at this point, go back to the point of the first mismatch by restoring the values. This makes sure that the smallest distance is actually found.
The whole implementation is rather long (~150LOC), but I can post it on request. The core idea is outlined above, everything else is standard Boyer-Moore.
Preprocessing on the Text
Another way to speed things up is preprocessing the text to have an index on character positions. You only want to start comparing at positions where at least a single match between the two strings occurs, otherwise the Hamming distance is |S| trivially.
import sys
from collections import defaultdict
import bisect
def char_positions(t):
pos = defaultdict(list)
for idx, c in enumerate(t):
pos[c].append(idx)
return dict(pos)
This method simply creates a dictionary which maps each character in the text to the sorted list of its occurrences.
The comparison loop is more or less unchanged to naive O(mn) approach, apart from the fact that we do not increase the position at which comparison is started by 1 each time, but based on the character positions:
def min_hamming(text, pattern):
best = len(pattern)
pos = char_positions(text)
i = find_next_pos(pattern, pos, 0)
while i < len(text) - len(pattern):
dist = 0
for c in range(len(pattern)):
if text[i+c] != pattern[c]:
dist += 1
if dist == best:
break
c += 1
else:
if dist == 0:
return 0
best = min(dist, best)
i = find_next_pos(pattern, pos, i + 1)
return best
The actual improvement is in find_next_pos:
def find_next_pos(pattern, pos, i):
smallest = sys.maxint
for idx, c in enumerate(pattern):
if c in pos:
x = bisect.bisect_left(pos[c], i + idx)
if x < len(pos[c]):
smallest = min(smallest, pos[c][x] - idx)
return smallest
For each new position, we find the lowest index at which a character from S occurs in L. If there is no such index any more, the algorithm will terminate.
find_next_pos is certainly complex, and one could try to improve it by only using the first several characters of the pattern S, or use a set to make sure characters from the pattern are not checked twice.
Discussion
Which method is faster largely depends on your dataset. The more diverse your alphabet is, the larger will be the jumps. If you have a very long L, the second method with preprocessing might be faster. For very, very short strings (like in your question), the naive approach will certainly be the fastest.
DNA
If you have a very small alphabet, you could try to get the character positions for character bigrams (or larger) rather than unigrams.
You're stuck as far as big-O is concerned.. At a fundamental level, you're going to need to test if every letter in the target matches each eligible letter in the substring.
Luckily, this is easily parallelized.
One optimization you can apply is to keep a running count of mismatches for the current position. If it's greater than the lowest hamming distance so far, then obviously you can skip to the next possibility.

Map strings to numbers maintaining the lexicographic ordering

I'm looking for an algorithm or function that is able to map a string to a number in such way that the resulting values correspond the lexicographic ordering of strings. Example:
"book" -> 50000
"car" -> 60000
"card" -> 65000
"a longer string" -> 15000
"another long string" -> 15500
"awesome" -> 16000
As a function it should be something like: f(x) = y, so that for any x1 < x2 => f(x1) < f(x2), where x is an arbitrary string and y is a number.
If the input set of x is finite, then I could always do a sort and assign the proper values, but I'm looking for something generic for an unlimited input set for x.
If you require that f map to integers this is impossible.
Suppose that there is such a map f. Consider the strings a, aa, aaa, etc. Consider the values f(a), f(aa), f(aaa), etc. As we require that f(a) < f(aa) < f(aaa) < ... we see that f(a_n) tends to infinity as n tends to infinity; here I am using the obvious notation that a_n is the character a repeated n times. Now consider the string b. We require that f(a_n) < f(b) for all n. But f(b) is some finite integer and we just showed that f(a_n) goes to infinity. We have a contradiction. No such map is possible.
Maybe you could tell us what you need this for? This is fairly abstract and we might be able to suggest something more suitable. Further, don't necessarily worry about solving "it" generally. YAGNI and all that.
As a corollary to Jason's answer, if you can map your strings to rational numbers, such a mapping is very straightforward. If code(c) is the ASCII code of the character c and s[i] is theith character in the string s, just sum like follows:
result <- 0
scale <- 1
for i from 1 to length(s)
scale <- scale / 26
index <- (1 + code(s[i]) - code('a'))
result <- result + index / scale
end for
return result
This maps the empty string to 0, and every other string to a rational number between 0 and 1, maintaining lexicographical order. If you have arbitrary-precision decimal floating-point numbers, you can replace the division by powers of 26 with powers of 100 and still have exactly representable numbers; with arbitrary precision binary floating-point numbers, you can divide by powers of 32.
what you are asking for is a a temporary suspension of the pigeon hole principle (http://en.wikipedia.org/wiki/Pigeonhole_principle).
The strings are the pigeons, the numbers are the holes.
There are more pigeons than holes, so you can't put each pigeon in its own hole.
You would be much better off writing a comparator which you can supply to a sort function. The comparator takes two strings and returns -1, 0, or 1. Even if you could create such a map, you still have to sort on it. If you need both a "hash" and the order, then keep stuff in two data structures - one that preserves the order, and one that allows fast access.
Maybe a Radix Tree is what you're looking for?
A radix tree, Patricia trie/tree, or
crit bit tree is a specialized set
data structure based on the trie that
is used to store a set of strings. In
contrast with a regular trie, the
edges of a Patricia trie are labelled
with sequences of characters rather
than with single characters. These can
be strings of characters, bit strings
such as integers or IP addresses, or
generally arbitrary sequences of
objects in lexicographical order.
Sometimes the names radix tree and
crit bit tree are only applied to
trees storing integers and Patricia
trie is retained for more general
inputs, but the structure works the
same way in all cases.
LWN.net also has an article describing this data structures use in the Linux kernel.
I have post a question here https://stackoverflow.com/questions/22798824/what-lexicographic-order-means
As workaround you can append empty symbols with code zero to right side of the string, and use expansion from case II.
Without such expansion with extra empty symbols I' m actually don't know how to make such mapping....
But if you have a finite set of Symbols (V), then |V*| is eqiualent to |N| -- fact from Disrete Math.

Resources