How to determine character similarity? - algorithm

I am using the Levenshtein distance to find similar strings after OCR. However, for some strings the edit distance is the same, although the visual appearance is obviously different.
For example the string Co will return these matches:
CY (1)
CZ (1)
Ca (1)
Considering, that Co is the result from an OCR engine, Ca would be the more likely match than the ones. Therefore, after calculating the Levenshtein distance, I'd like to refine query result by ordering by visual similarity. In order to calculate this similarity a I'd like to use standard sans-serif font, like Arial.
Is there a library I can use for this purpose, or how could I implement this myself? Alternatively, are there any string similarity algorithms that are more accurate than the Levenshtein distance, which I could use in addition?

If you're looking for a table that will allow you to calculate a 'replacement cost' of sorts based on visual similarity, I've been searching for such a thing for awhile with little success, so I started looking at it as a new problem. I'm not working with OCR, but I am looking for a way to limit the search parameters in a probabilistic search for mis-typed characters. Since they are mis-typed because a human has confused the characters visually, the same principle should apply to you.
My approach was to categorize letters based on their stroke components in an 8-bit field. the bits are, left to right:
7: Left Vertical
6: Center Vertical
5: Right Vertical
4: Top Horizontal
3: Middle Horizontal
2: Bottom Horizontal
1: Top-left to bottom-right stroke
0: Bottom-left to top-right stroke
For lower-case characters, descenders on the left are recorded in bit 1, and descenders on the right in bit 0, as diagonals.
With that scheme, I came up with the following values which attempt to rank the characters according to visual similarity.
m: 11110000: F0
g: 10111101: BD
S,B,G,a,e,s: 10111100: BC
R,p: 10111010: BA
q: 10111001: B9
P: 10111000: B8
Q: 10110110: B6
D,O,o: 10110100: B4
n: 10110000: B0
b,h,d: 10101100: AC
H: 10101000: A8
U,u: 10100100: A4
M,W,w: 10100011: A3
N: 10100010: A2
E: 10011100: 9C
F,f: 10011000: 98
C,c: 10010100: 94
r: 10010000: 90
L: 10000100: 84
K,k: 10000011: 83
T: 01010000: 50
t: 01001000: 48
J,j: 01000100: 44
Y: 01000011: 43
I,l,i: 01000000: 40
Z,z: 00010101: 15
A: 00001011: 0B
y: 00000101: 05
V,v,X,x: 00000011: 03
This, as it stands, is too primitive for my purposes and requires more work. You may be able to use it, however, or perhaps adapt it to suit your purposes. The scheme is fairly simple. This ranking is for a mono-space font. If you are using a sans-serif font, then you likely have to re-work the values.
This table is a hybrid table including all characters, lower- and upper-case, but if you split it into upper-case only and lower-case only it might prove more effective, and that would also allow to apply specific casing penalties.
Keep in mind that this is early experimentation. If you see a way to improve it (for example by changing the bit-sequencing) by all means feel free to do so.

In general I've seen Damerau-Levenshtein used much more often than just Levenshtein , and it basically adds the transposition operation. It is supposed to account for more than 80% of human misspelling, so you should certainly consider that.
As to your specific problem, you could easily modify the algorithm to increase the cost when substituting a capital letter with a non capital letter, and the opposite to obtain something like that:
dist(Co, CY) = 2
dist(Co, CZ) = 2
dist(Co, Ca) = 1

So in your distance function just have a different cost for replacing different pairs of characters.
That is, rather than a replacement adding a set cost of one or two irrepective of the characters involved - instead have a replace cost function that returns something in between 0.0 and 2.0 for the cost of replacing certain characters in certain contexts.
At each step of the memoization, just call this cost function:
cost[x][y] = min(
cost[x-1][y] + 1, // insert
cost[x][y-1] + 1, // delete,
cost[x-1][y-1] + cost_to_replace(a[x],b[y]) // replace
Here is my full Edit Distance implementation, just swap the replace_cost constant for a replace_cost function as shown:
In terms of implementing the cost_to_replace function you need a matrix of characters with costs based on how similiar the characters are. There may be such a table floating around, or you could implement it yourself by writing each pair of characters to a pair of images and then comparing the images for similiarity using standard vision techniques.
Alternatively you could use a supervised method whereby you correct several OCR misreads and note the occurences in a table that will then become the above cost table. (ie If the OCR gets it wrong than the characters must be similiar).


Mutually exclusivity in Problog

We have
4 different storage spaces, and
5 different boxes (named b1, b2, b3, b4 and b5) which they wanted to put in this storage spaces.
Each storage space can be filled with only one unique box at a time.
*But B5 has a special condition which allows to be used in multiple storage spaces at the same time.
Each box has specific weight as assign to it (b1:4, b2:6, b3:5, b4:6 and b5:5).
Each box has a specific probability to be filled in to the storage spaces (b1:1, b2:0.6, b3=1, b4=0.8, b5=1).
We try to get the probable content of the storage spaces and their probabilities if the total weight is 22. ! (which we will use this as an evidence mechanism)
For example :
SS1 - b2(6)
SS2 - b5(5)
SS3 - b4(6)
SS4 - b5(5)
Where the total weight will be 22
And the probability of this content.
In my code bellow I get the answer for one of the probable content as totalboxweight(b2, b5, b4, b5, 22) which is okay for me. It means first box b2 is in first storage space, b5 is in second storage space and so on.
Here is my code so far, I add comments also to explain my intentions
But I need help to update it add the probabilities and apply some of the conditions I talked about.
box(b5,5). % I tried to define the boxes but I dont know how to assign probabilites to them in this format
total(D1,D2,D3,D4,Sum) :-
Sum is D1+D2+D3+D4. % I defined the sum calculation
totalboxweight(A,B,C,D,Sum) :-
box(A,D1), box(B,D2) , box(C,D3), box(D,D4),
total(D1,D2,D3,D4,Sum). % I am sum up all weights
sumtotal(Sum) :-
box(A,D1), box(B,D2) , box(C,D3), box(D,D4),
total(D1,D2,D3,D4,Sum). % I defined this one to use it as an evidence
evidence(sumtotal(22),true). % if we know the total weight is 22
query(totalboxweight(D1,D2,D3,D4,22)). % what is the probable content
I am using an online Problog editor to test my code. Here is the link.
And I am trying to do it in Problog not Prolog, so the syntax is different.
Right now with the help of answers I overcome some issues, the problems I still have ;
I couldn't apply probabilities
I couldn't apply the condition ( Each storage space can be filled with only one unique box at a time. But B5 has a special condition which allows to be used in multiple storage spaces at the same time.)
Thanks you in advance.

Huffman encoding with variable length symbols

I'm thinking of using a Huffman code to compress text, but with symbols of variable length (strings). For example (using an underscore as a space):
huffman-code | symbol
00 | _
01 | E
100 | THE
101 | A
1100 | UP
1101 | DOWN
11100 | .
11101 |
How can I construct the frequency table? Obviously there are some overlapping issues, the sequence _TH would appear neary as often as THE, but would be useless in the table (both _ and THE have short huffman code).
Does such an algorithm exists? Does it have a special name? What would be the tricks to generate the frequency table? Do I need to tokenize the input? I did not found anything in the litterature / web. (All this make me think also of radix trees).
I was thinking of using an iterative process:
Generate an huffman tree for all symbols of length 1 to N
Remove from the tree all symbols with N>1 and below a certain count threshold
Regenerate a second huffman tree, but this time tokenizing the input with the previous one (probably using a radix tree for lookup)
Repeat to 1 until we converge (or for a few times)
But I can't figure out how can I prevent the problem of overlaps (_TH vs THE) with this.
As long as you tokenize the text properly you don't have to worry about the overlap problem. You can define each token to be a word (longest continuous stream of characters), punctuation symbol or a whitespace character (' ', '\t', \n'). Thus by definition the tokens/symbols do not overlap.
But using Huffman coding directly isn't ideal for compressing text since it cannot make use of the dependencies between the symbols. For e.g. 'q' is likely followed by 'u', 'qu' is likely followed by a vowel, 'thank' is likely followed by 'you' and so on. You may want to look into a high order encoder like 'LZ' which can exploit this redundancy, by converting the data into a sequence of lookup addresses, copy lengths, and deviating symbols. Here's an example of how LZ works. You can then apply Huffman coding on each of the three streams to further compress the data. DEFLATE algorithm works exactly this way.
This is not a complete solution.
Since you have to store both the sequence and the lookup table, maybe you can greedily pick symbols that minimize the storage cost.
Step 1: Store all the symbols of length at most k in a try and keep track of their counts
Step 2: For each probable symbol, calculate the space saved (or compression ratio).
Encode_length(symbol) = log(N) - log(count(symbol))
Space_saved(symbol) = length(symbol)*count(symbol) - Encode_length(symbol)*count(symbol) - (length(symbol)+Encode_length(symbol))
N is the total frequency of all symbols (which we don't know yet, maybe approximate?).
Step 3: Select the optimal symbol and subtract frequency of other symbols that overlap with it.
Step 4: If the whole sequence is not encoded yet pick the next optimal symbol (i.e. go to step 2)
NOTE: This is just a outline and it is neither complete nor computationally efficient. If you are looking for a practical quick solution you should use krjampani's solution. This answer is purely academical.

What are the design decisions behind Google Maps encoded polyline algorithm format?

Several Google Maps products have the notion of polylines, which in terms of underlying data is basically just a sequence of lat/lng points that might for example manifest in a line drawn on a map. The Google Map developer libraries make use of an encoded polyline format that churns out an ASCII string representing the points making up the polyline. This encoded format is then typically decoded with a built in function of the Google libraries or a function written by a third party that implements the decoding algorithm.
The algorithm for encoding polyline points is described in the Encoded Polyline Algorithm Format document. What is not described is the rationale for implementing the algorithm this way, and the significance of each of the individual steps. I'm interested to know whether the thinking/purpose behind implementing the algorithm this way is publicly described anywhere. Two example questions:
Do some of the steps have a quantifiable impact on compression and how does this impact vary as a function of the delta between points?
Is the summing of values with ASCII 63 a compatibility hack of some sort?
But just in general, a description to go along with the algorithm explaining why the algorithm is implemented the way it is.
Update: This blog post from James Snook also has the 'valid ascii' range argument and reads logically for other steps I wondered. E.g. the left shifting before storing which makes place for the negative bit as the first bit.
Some explanations I found, not sure if everything is 100% correct.
One double value is stored in multiple 5 bits chunks and 0x20 (binary '0010 0000') is used as indication that the next 5 bit entry belongs to the current double.
0x1f (binary '0001 1111') is used as bit mask to throw away other bits
I expect that 5 bits are used because the delta of lat or lons are in this range. So that every double value takes only 5 bits on average when done for a lot of examples (but not verified yet).
Now, compression is done by assuming nearby double values are very close and creating the difference is nearly 0, so that the results fits in a few bytes. Then this result is stored in a dynamic fashion: store 5 bits and if the value is longer mark with 0x20 and store the next 5 bits and so on. So I guess you can tweak the compression if you try 6 or 4 bits but I guess 5 is a practically reasonable choice.
Now regarding the magic 63, this is 0x3f and binary 0011 1111. I'm not sure why they add it. I thought that adding 63 will give some 'better' asci characters (e.g. allowed in XML or in URL) as we skip e.g. 62 which is > but 63 which is ? is really better? At least the first ascii chars are not displayable and have to be avoided. Note that if one would use 64 then one would hit the ascii char 127 for the maximum value of 31 (31+64+32) and this char is not defined in html4. Or is because of a signed char is going from -128 to 127 and we need to store the negative numbers as positive, thus adding the maximum possible negative number?
Just for me: here is a link to an official Java implementation with Apache License

How can I replace all non-words in a phrase, with the exception of numbers followed or preceded by characters?

Let us take a ruby array of sentences. Within the array we have
Sentences containing only words
Sentences containing phone numbers
Sentences containing numeric values with units of measurement
In this case we may have things that look like this: 1mL, 55mL, 1 mL, etc
Sentences containing quantities denoted as 1x or 5 x.
I am trying to construct a ruby regexp for the gsub or scan functions, such that I clean up the above sentences array to only be left with the words (1), units of measurement (3), and quantities (4) in each sentence, but clean up all non-word characters, such as phone numbers (2) and any other delimiting characters such as \t.
I've got this so far: do |sentence|
sentence.gsub!(/(?:(\d+)(?:[xX])|([xX])(?:\d+)[^a-zA-Z ])/, "")
Unfortunately, that replaces the exact opposite of what I want to replace. And, it doesn't account for cases where units of measurement are what I want to preserve at all.
Example inputs and outputs:
input: Lavender top (6 mL size preferred)
output: Lavender top (6 mL size preferred)
input: Blood & bone marrow aspirate: 15 mL centrifuge tube with transport media. Available from Cytogenetics, 415-123-4567.
output: Blood & bone marrow aspirate: 15 mL centrifuge tube with transport media. Available from Cytogenetics, .
input: Gold top x1, Lt. Green top x 1, Lavender top x1
output: Gold top x1, Lt. Green top x 1, Lavender top x1
So, effectively, replace numbers and other non-alpha characters, but only when the numbers don't denote measurements or quantities
I've been playing on rubular for about 3 hours to no avail. I think I might be misunderstanding look-aheads completely or just missing one key gotcha moment.
Looking forward to the regexp experts chiming in!
This could perhaps be a start:!{|x| x.gsub(/(?<!x\s|x)[\d-]+(?!\s?\w\w?)/i, '')}
# (?<!x\s|x) Dont match if after an x or x+space
# [\d-]+ Match digits (and other junk)
# (?!\s?\w\w) Make sure it is not followed by a two letter word. Here you could be more specific if it causes trouble.
# /expression/i make the thing case insensitive.
This works on your sample data, but there may be other cases not taken care of:
The regex only matches the 415-123-4567 in your sample data.

Deducing string transformation rules

I have a set of pairs of character strings, e.g.:
abba - aba,
haha - aha,
baa - ba,
exb - esp,
xa - za
The second (right) string in the pair is somewhat similar to the first (left) string.
That is, a character from the first string can be represented by nothing, itself or a character from a small set of characters.
There's no simple rule for this character-to-character mapping, although there are some patterns.
Given several thousands of such string pairs, how do I deduce the transformation rules such that if I apply them to the left strings, I get the right strings?
The solution can be approximate, working correctly for, say, 80-95% of the strings.
Would you recommend to use some kind of a genetic algorithm? If so, how?
If you could align the characters, or rather groups of characters, you could work out tables saying that aa => a, bb => z, and so on. If you had such tables, you could align the characters using One approach is therefore to guess an alignment (e.g. one for one, just as a starting point, or just align the first and last characters of each sequence), work out a translation table from that, use DTW to get a new alignment, work out a revised translation table, and iterate in that way. Perhaps you could wrap this up with enough maths to show that there is some measure of optimality or probability that such passes increase, climbing to a local maximum.
There is probably some way of doing this by modelling a Hidden Markov Model that generates both sequences simultaneously and then deriving rules from that model, but I would not chose this approach unless I was already familiar with HMMs and had software to use as a starting point that I was happy to modify.
You can use text to speech to create sound waves. then compare sound waves with other's and match them with percentages.
This is my theory how Google has such a advanced spell checker.
