Mapping arbitrary strings to RGB values - algorithm

I have a huge set of arbitrary natural language strings. For my tool to analyze them I need to convert each string to unique color value (RGB or other). I need color contrast to depend on string similarity (the more string is different from other, the more their respective colors should be different). Would be perfect if I would always get same color value for the same string.
Any advice on how to approach this problem?
Update on distance between strings
I probably need "similarity" defined as a Levenstein-like distance. No natural language parsing is required.
That is:
"I am going to the store" and
"We are going to the store"
"I am going to the store" and
"I am going to the store today"
Similar as well (but slightly less).
"I am going to the store" and
"J bn hpjoh up uif tupsf"
Quite not similar.
I probably would know exactly what distance function I need only when I'll see program output. So lets start from simpler things.
Update on task simplification
I've removed my own suggestion to split task into two — absolute distance calculation and color distribution. This would not work well as at first we're reducing dimensional information to a single dimension, and then trying to synthesize it up to three dimensions.

You need to elaborate more on what you mean by "similar strings" in order to come up with an appropriate conversion function. Are the strings
"I am going to the store" and
"We are going to the store"
considered similar? What about the strings
"I am going to the store" and
"J bn hpjoh up uif tupsf"
(all of the letters in the original +1), or
"I am going to the store" and
"I am going to the store today"
? Based on what you mean by "similar", you might consider different functions.
If the difference can be based solely on the values of the characters (in Unicode or whatever space they are from), then you can try summing the values up and using the result as a hue for HSV space. If having a longer string should cause the colours to be more different, you might consider weighing characters by their position in the string.
If the difference is more complex, such as by the occurrences of certain letters or words, then you need to identify this. Maybe you can decide red, green and blue values based on the number of Es, Ss and Rs in a string, if your domain has a lot of these. Or pick a hue based on the ratio of vowels to consonents, or words to syllables.
There are many, many different ways to approach this, but the best one really depends on what you mean by "similar" strings.

It sounds like you want a hash of some sort. It doesn't need to be secure (so nothing as complicated as MD5 or SHA) but something along the lines of:
char1 + char2 + char3 + ... + charN % MAX_COLOUR_VALUE
would work as a simple first step. You could also do fancier things along the lines of having each character act as an 'amplitude' for R,G and B (e could be +1R, +2G and -4B, etc.) and then simply add up all the values in a string... clamp them at the end and you have a method of turning arbitrary length strings into colours as a 'colour hash' sort of process.

First, you'll need to pick a way to measure string similarity. Minimal edit distance is traditional, but is not sufficient to well-order the strings, which is what you will need if you want to allocate the same colours to the same strings every time - perhaps you could weight the edit costs by alphabetic distance. Also minimal edit distance by itself may not be very useful if what you are after is similarity in speech rather than in written form (if so, consider a stemming/soundex pass first), or some other sense of "similarity".
Then you need to pick a way of traversing the visible colour space based on that metric. It may be helpful to consider using HSL or HSV colour representation - the algorithm could then become as simple as picking a starting hue and walking the sorted corpus, assigning the current hue to each string before offsetting it by the string's difference from the previous one.

How important is it that you never end up with two dissimilar strings having the same colour?
If it's not that important then maybe this could work?
You could pick a 1 dimensional color space that is "homotopic" to the circle: Say the color function c(x) is defined for x between 0 and 1. Then you'd want c(0) == c(1).
Now you take the sum of all character values modulo some scaling factor and wrap this back to the color space:
c( (SumOfCharValues(word) modulo ScalingFactor) / ScalingFactor )
This might work even better if you defined a "wrapping" color space of higher dimensions and for each dimension pick different SumOfCharValues function; someone suggested alternating sum and length.
Just a thought... HTH

Here is my suggestion (I think there is a general name for this algorithm, but I'm too tired to remember it):
You want to transform each string to a 3D point node(r, g, b) (you can scale the values so that they fit your range) such that the following error is minimized:
Error = \sum_i{\sum_j{(dist(node_i, node_j) - dist(str_i, str_j))^2}}
You can do this:
First assign each string a random color (r, g, b)
Repeat until you see fit (eg. error is adjusted less than \epsilon = 0.0001):
Pick a random node
Adjust it's position (r, g, b) such that the error is minimized
Scale the coordinate system such that each nodes coordinates are in the range [0., 1.) or [0, 256]

You can use something like MinHash or some other LSH method and define similarity as intersection between sets of shingles measured by Jaccard coefficient.
There is a good description in Mining of Massive data sets, Ch.3 by Rajaraman and Ullman.

I would maybe define some delta between two strings. I don't know what you define as the difference (or "unequality") of two strings, but the most obvious thing I could think about would be string length and the number of occurences of particular letters (and their index in the string). It should not be tricky to implement it such that it returns the same color code in equal strings (if you do an equal first, and return before further comparison).
When it comes to the actual RGB value, I would try to convert the string data into 4 bytes (RGBA), or 3 bytes if you only use the RGB. I don't know if every string would fit into them (as that may be language specific?).

Sorry, but you can't do what you're looking for with levenshtein distance or similar. RGB and HSV are 3-dimensional geometric spaces, but levenshtein distance describes a metric space - a much looser set of contstraints with no fixed number of dimensions. There's no way to map a metric space into a fixed number of dimensions while always preserving locality.
As far as approximations go, though, for single terms you could use a modification of an algorithm like soundex or metaphone to pick a color; for multiple terms, you could, for example, apply soundex or metaphone to each word individually, then sum them up (with overflow).


How can I encode double values in genetic algorithm?

I want to use neural network to learning cars riding on the racetrack. Imo best way to learning net is using genetic algorithm, but in each tutorial the genotype is encode by 0 and 1 (binary values). In my net weights are double values, so genotype looks like 3,12; 9,12; 0,83, -0,73 etc.
So my question is:
Should I encode each weight to binary value ? I think I can use double values but I don't know how can I mutate this ? Binary value I can inverse from 0 to 1 and from 1 to 0 but double ?
From a theoretical point of view yes, you can.
The condition anyhow is you correctly define all the operations (like crossover, mutation, etc.) for continuous values too.
The answer then is yes, if your software implementation enables you to do so.
Let me draw a simplified example.
If the algorithm aims at identifying the best fit for the sine function, and you can use shapes [triangle, square, half-circle], y magnitude and x displacement, you can have a chromosome of let's say N shapes to be summed together.
In such a case x and y must be both double: you can mutate them e.g. by adding a random number in a sensate range, and perform crossover by exchanging x or y with the partner, or even collecting a full tuple (shape-x-y).
I would say that the mantra is to keep things coherent, and let individuals mutate in a sensate way for your model (a bad choice would be cross x with y).

Word distance algorithm for OCR

I am working with an OCR output and I'm searching for special words inside it.
As the output is not clean, I look for elements that match my inputs according to a word distance lower than a specific threshold.
However, I feel that the Levenshtein distance or the Hamming distance are not the best way, as the OCR always seem to make the same mistakes: I for 1, 0 for O, Q for O... and these "classic" mistakes seem to be less important than "A for K" for instance. As a result, these distances do not care of the amount of differences of the appearances of the characters (low / high).
Is there any word distance algorithm that was made specifically for OCR that I can use that would better fit my case? Or should I implement my custom word distance empirically according to the visual differences of characters?
The Levenshtein distance allows you to specify different costs for every substitution pair (, fifth item). So you can tune it to your needs by giving more or less emphasis to the common mistakes.
I you want a custom cost function for letter mismatch you could look at the Needleman–Wunsch algorithm (NW)
OCR paper related to the NW-algorithm

How to apply the Levenshtein distance to a set of target strings?

Let TARGET be a set of strings that I expect to be spoken.
Let SOURCE be the set of strings returned by a speech recognizer (that is, the possible sentences that it has heard).
I need a way to choose a string from TARGET. I read about the Levenshtein distance and the Damerau-Levenshtein distance, which basically returns the distance between a source string and a target string, that is the number of changes needed to transform the source string into the target string.
But, how can I apply this algorithm to a set of target strings?
I thought I'd use the following method:
For each string that belongs to TARGET, I calculate the distance from each string in SOURCE. In this way we obtain an m-by-n matrix, where n is the cardinality of SOURCE and n is the cardinality of TARGET. We could say that the i-th row represents the similarity of the sentences detected by the speech recognizer with respect to the i-th target.
Calculating the average of the values ​​on each row, you can obtain the average distance between the i-th target and the output of the speech recognizer. Let's call it average_on_row(i), where i is the row index.
Finally, for each row, I calculate the standard deviation of all values in the row. For each row, I also perform the sum of all the standard deviations. The result is a column vector, in which each element (Let's call it stadard_deviation_sum(i)) refers to a string of TARGET.
The string which is associated with the shortest stadard_deviation_sum could be the sentence pronounced by the user. Could be considered the correct method I used? Or are there other methods?
Obviously, too high values ​​indicate that the sentence pronounced by the user probably does not belong to TARGET.
I'm not an expert but your proposal does not make sense. First of all, in practice I'd expect the cardinality of TARGET to be very large if not infinite. Second, I don't believe the Levensthein distance or some similar similarity metric will be useful.
If :
you could really define SOURCE and TARGET sets,
all strings in SOURCE were equally probable,
all strings in TARGET were equally probable,
the strings in SOURCE and TARGET consisted of not characters but phonemes,
then I believe your best bet would be to find the pair p in SOURCE, q in TARGET such that distance(p,q) is minimum. Since especially you cannot guarantee the equal-probability part, I think you should think about the problem from scratch, do some research and make a completely different design. The usual methodology for speech recognition is the use Hidden Markov models. I would start from there.
Answer to your comment: Choose whichever is more probable. If you don't consider probabilities, it is hopeless.
[Suppose the following example is on phonemes, not characters]
Suppose the recognized word the "chees". Target set is "cheese", "chess". You must calculate P(cheese|chees) and P(chess|chees) What I'm trying to say is that not every substitution is equiprobable. If you will model probabilities as distances between strings, then at least you must allow that for example d("c","s") < d("c","q") . (It is common to confuse c and s letters but it is not common to confuse c and q) Adapting the distance calculation algorithm is easy, coming with good values for all pairs is difficult.
Also you must somehow estimate P(cheese|context) and P(chess|context) If we are talking about board games chess is more probable. If we are talking about dairy products cheese is more probable. This is why you'll need large amounts of data to come up with such estimates. This is also why Hidden Markov Models are good for this kind of problem.
You need to calculate these probabilities first: probability of insertion, deletion and substitution. Then use log of these probabilities as penalties for each operation.
In a "context independent" situation, if pi is probability of insertion, pd is probability of deletion and ps probability of substitution, the probability of observing the same symbol is pp=1-ps-pd.
In this case use log(pi/pp/k), log(pd/pp) and log(ps/pp/(k-1)) as penalties for insertion, deletion and substitution respectively, where k is the number of symbols in the system.
Essentially if you use this distance measure between a source and target you get log probability of observing that target given the source. If you have a bunch of training data (i.e. source-target pairs) choose some initial estimates for these probabilities, align source-target pairs and re-estimate these probabilities (AKA EM strategy).
You can start with one set of probabilities and assume context independence. Later you can assume some kind of clustering among the contexts (eg. assume there are k different sets of letters whose substitution rate is different...).

Better distance metrics besides Levenshtein for ordered word sets and subsequent clustering

I am trying to solve a problem that involves comparing large numbers of word sets , each of which contains a large, ordered number of words from a set of words (totaling around 600+, very high dimensionality!) for similarity and then clustering them into distinct groupings. The solution needs to be as unsupervised as possible.
The data looks like
[Apple, Banana, Orange...]
[Apple, Banana, Grape...]
[Jelly, Anise, Orange...]
[Strawberry, Banana, Orange...]
The order of the words in each set matters ([Apple, Banana, Orange] is distinct from [Apple, Orange, Banana]
The approach I have been using so far has been to use Levenshtein distance (limited by a distance threshold) as a metric calculated in a Python script with each word being the unique identifier, generate a similarity matrix from the distances, and throwing that matrix into k-Mediods in KNIME for the groupings.
My questions are:
Is Levenshtein the most appropriate distance metric to use for this problem?
Is mean/medoid prototype clustering the best way to go about the groupings?
I haven't yet put much thought into validating the choice for 'k' in the clustering. Would evaluating an SSE curve of the clustering be the best way to go about this?
Are there any flaws in my methodology?
As an extension to the solution in the future, given training data, would anyone happen to have any ideas for going about assigning probabilities to cluster assignments? For example, set 1 has a 80% chance of being in cluster 1, etc.
I hope my questions don't seem too silly or the answers painfully obvious, I'm relatively new to data mining.
Yes, Levenshtein is a very suitable way to do this. But if the sequences vary in size much, you might be better off normalising these distances by dividing by the sum of the sequence lengths -- otherwise you will find that observed distances tend to increase for pairs of long sequences whose "average distance" (in the sense of the average distance between corresponding k-length substrings, for some small k) is constant.
Example: The pair ([Apple, Banana], [Carrot, Banana]) could be said to have the same "average" distance as ([Apple, Banana, Widget, Xylophone], [Carrot, Banana, Yam, Xylophone]) since every 2nd item matches in both, but the latter pair's raw Levenshtein distance will be twice as great.
Also bear in mind that Levenshtein does not make special allowances for "block moves": if you take a string, and move one of its substrings sufficiently far away, then the resulting pair (of original and modified strings) will have the same Levenshtein score as if the 2nd string had completely different elements at the position where the substring was moved to. If you want to take this into account, consider using a compression-based distance instead. (Although I say there that it's useful for computing distances without respect to order, it does of course favour ordered similarity to disordered similarity.)
check out SimMetrics on sourceforge for a platform supporting a variety of metrics able to use as a means to evaluate the best for a task.
for a commercially valid version check out K-Similarity from

Is there an edit distance algorithm that takes "chunk transposition" into account?

I put "chunk transposition" in quotes because I don't know whether or what the technical term should be. Just knowing if there is a technical term for the process would be very helpful.
The Wikipedia article on edit distance gives some good background on the concept.
By taking "chunk transposition" into account, I mean that
Turing, Alan.
should match
Alan Turing
more closely than it matches
Turing Machine
I.e. the distance calculation should detect when substrings of the text have simply been moved within the text. This is not the case with the common Levenshtein distance formula.
The strings will be a few hundred characters long at most -- they are author names or lists of author names which could be in a variety of formats. I'm not doing DNA sequencing (though I suspect people that do will know a bit about this subject).
In the case of your application you should probably think about adapting some algorithms from bioinformatics.
For example you could firstly unify your strings by making sure, that all separators are spaces or anything else you like, such that you would compare "Alan Turing" with "Turing Alan". And then split one of the strings and do an exact string matching algorithm ( like the Horspool-Algorithm ) with the pieces against the other string, counting the number of matching substrings.
If you would like to find matches that are merely similar but not equal, something along the lines of a local alignment might be more suitable since it provides a score that describes the similarity, but the referenced Smith-Waterman-Algorithm is probably a bit overkill for your application and not even the best local alignment algorithm available.
Depending on your programming environment there is a possibility that an implementation is already available. I personally have worked with SeqAn lately, which is a bioinformatics library for C++ and definitely provides the desired functionality.
Well, that was a rather abstract answer, but I hope it points you in the right direction, but sadly it doesn't provide you with a simple formula to solve your problem.
Have a look at the Jaccard distance metric (JDM). It's an oldie-but-goodie that's pretty adept at token-level discrepancies such as last name first, first name last. For two string comparands, the JDM calculation is simply the number of unique characters the two strings have in common divided by the total number of unique characters between them (in other words the intersection over the union). For example, given the two arguments "JEFFKTYZZER" and "TYZZERJEFF," the numerator is 7 and the denominator is 8, yielding a value of 0.875. My choice of characters as tokens is not the only one available, BTW--n-grams are often used as well.
One of the easiest and most effective modern alternatives to edit distance is called the Normalized Compression Distance, or NCD. The basic idea is easy to explain. Choose a popular compressor that is implemented in your language such as zlib. Then, given string A and string B, let C(A) be the compressed size of A and C(B) be the compressed size of B. Let AB mean "A concatenated with B", so that C(AB) means "The compressed size of "A concatenated with B". Next, compute the fraction (C(AB) - min(C(A),C(B))) / max(C(A), C(B)) This value is called NCD(A,B) and measures similarity similar to edit distance but supports more forms of similarity depending on which data compressor you choose. Certainly, zlib supports the "chunk" style similarity that you are describing. If two strings are similar the compressed size of the concatenation will be near the size of each alone so the numerator will be near 0 and the result will be near 0. If two strings are very dissimilar the compressed size together will be roughly the sum of the compressed sizes added and so the result will be near 1. This formula is much easier to implement than edit distance or almost any other explicit string similarity measure if you already have access to a data compression program like zlib. It is because most of the "hard" work such as heuristics and optimization has already been done in the data compression part and this formula simply extracts the amount of similar patterns it found using generic information theory that is agnostic to language. Moreover, this technique will be much faster than most explicit similarity measures (such as edit distance) for the few hundred byte size range you describe. For more information on this and a sample implementation just search Normalized Compression Distance (NCD) or have a look at the following paper and github project: "Clustering by Compression" C language implementation
There are many other implementations and papers on this subject in the last decade that you may use as well in other languages and with modifications.
I think you're looking for Jaro-Winkler distance which is precisely for name matching.
You might find compression distance useful for this. See an answer I gave for a very similar question.
Or you could use a k-tuple based counting system:
Choose a small value of k, e.g. k=4.
Extract all length-k substrings of your string into a list.
Sort the list. (O(knlog(n) time.)
Do the same for the other string you're comparing to. You now have two sorted lists.
Count the number of k-tuples shared by the two strings. If the strings are of length n and m, this can be done in O(n+m) time using a list merge, since the lists are in sorted order.
The number of k-tuples in common is your similarity score.
With small alphabets (e.g. DNA) you would usually maintain a vector storing the count for every possible k-tuple instead of a sorted list, although that's not practical when the alphabet is any character at all -- for k=4, you'd need a 256^4 array.
I'm not sure that what you really want is edit distance -- which works simply on strings of characters -- or semantic distance -- choosing the most appropriate or similar meaning. You might want to look at topics in information retrieval for ideas on how to distinguish which is the most appropriate matching term/phrase given a specific term or phrase. In a sense what you're doing is comparing very short documents rather than strings of characters.
