Does anyone know of a good way to calculate the "semantic distance" between two words?
Immediately an algorithm that counts the steps between words in a thesaurus springs to mind.
OK, looks like a similar question has already been answered: Is there an algorithm that tells the semantic similarity of two phrases.
In text mining there is an important maxim: "You shall know a word by the
company it keeps". It means that it is possible to learn the meaning of a word based on the terms that frequently appear close to it.
Without entering in extensive details, let me give two simple options to estimate semantic distance between terms:
Use a resource similar to WordNet (a large lexical database of English). WordNet superficially resembles a thesaurus, in that it groups words together based on their meanings. The semantic distance between words can be estimated as the number of vertices that connect the two words.
Using a large corpus (e.g. Wikipedia), count the terms that appear close to the words you are analyzing. Create two vector and compute a distance (e.g cosine).
You can check this materials to get a get picture about the subject:
http://www.saifmohammad.com/WebDocs/Mohammad_Saif_Thesis-slides.pdf
http://www.umiacs.umd.edu/~saif/WebDocs/distributionalmeasures.pdf
http://www.umiacs.umd.edu/~saif/WebDocs/Measuring-Semantic-Distance.pdf
The thesaurus idea has some merit. One idea would be to create a graph based on a thesaurus with the nodes being the words and an edge indicating that there they are listed as synonyms in the thesaurus. You could then use a shortest path algorithm to give you the distance between the nodes as a measure of their similarity.
One difficulty here is that some words have different meanings in different contexts. Your algorithm may need to take this into account and use directed links with the weight of the outgoing link dependent on the incoming link being followed (or ignore some outgoing links based on the incoming link).
Possible hack: send the two words to Google search, and return the # of pages found.
Related
I have a big city database which was compiled from many different sources. I am trying to find a way to easily spot duplicates based on city name. The naive answer would be to use the levenshtein distance. However, the problem with cities is that they often have prefixes and suffixes which are common to the country they are in.
For example:
Boulleville vs. Boscherville
These are almost certainly different cities. However, because they both end with "ville" (and both begin with "Bo") they have a rather small Levenstein distance.
*I am looking for a string distance algorithm that takes into account the position of the character to minimize the effect of prefixes and suffixes by weighting letters in the middle of the word higher than letters at the ends of the word. *
I could probably write something myself but I would find it hard to believe that no one has yet published a suitable algorithm.
This is similar to stemming in Natural Language Programming.
In that field, the stem of a word is found before performing further analysis, e.g.
run => run
running => run
runs => run
(of course things like ran do not stem to run. For that one can use a lemmatizer. But I digress...). Even though stemming is far from perfect in NLP, it works remarkably well.
In your case, it may work well to stem the city using rules specific to city names before applying Levenstein. I'm not aware of a stemmer implementation for cities, but the rules seem on the surface to be fairly simple.
You might start with a list of prefixes and a list of suffixes (including any common variant / typo spellings) and simply remove such a prefix / suffix before checking the Levenstein distance.
On a side note, if you have additional address information (such as a street address or zip/postal code), there exists address normalization software for many countries that will find the best match based on address-specific algorithms.
A pretty simple way to do it would be to just remove the common prefix and suffix before doing the distance calculation. The absolute distance between the resulting strings will be the same as with the full strings, but when the shorter length is taken into account the distance looks much greater.
Also keep in mind that in general even grevious misspellings get the first letter right. It's highly likely, then, that Cowville and Bowville are different cities, even though their L. distance is only 1.
You can make your job a lot easier by, at least at first, not doing the distance calculation if two words start with the different letters. They're likely to be different. Concentrate first on removing duplicates of words that start with the same letters. If, after that, you still have a large number of potential duplicates, you can refine your distance threshold to more closely examine words that start with different letters.
Can anyone suggest an appropriate data structure to hold a dictionary that will allow me to query the presence of words (items) that have particular letters at particular positions? For example, determine which words (if any) have letters a,b,c at positions x,y,z. Insertions do not have to be particularly efficient.
This is basically the scrabble problem (I have scores associated with the letters too, but that need not concern us). I suspect bioinformaticians have studied the same problem under the guise of sequence alignment. What's the state of the art in terms of speed?
If you are trying to build a very fast Scrabble player, you might want to look into the GADDAG data structure, which was specifically designed for the purpose. Essentially, the GADDAG is a compressed trie structure (specifically, it's a modified DAWG) that lets you explore outward and find all words that can be made with a certain set of letters subject to constraints about which letters of the words must be in what positions, as well as the overall lengths of the strings found.
The Wikipedia article on GADDAGs goes into more depth on the structure and links to the original paper on the subject. You might also want to look at DAWGs as a starting point.
Hope this helps!
I came across this variation of edit-distance problem:
Find the shortest path from one word to another, for example storm->power, validating each intermediate word by using a isValidWord() function. There is no other access to the dictionary of words and therefore a graph cannot be constructed.
I am trying to figure this out but it doesn't seem to be a distance related problem, per se. Use simple recursion maybe? But then how do you know that you're going the right direction?
Anyone else find this interesting? Looking forward to some help, thanks!
This is a puzzle from Lewis Carroll known as Word Ladders. Donald Knuth covers this in The Stanford Graphbase. This also
You can view it as a breadth first search. You will need access to a dictionary of words, otherwise the space you will have to search will be huge. If you just have access to a valid word you can generate all the permutations of words and then just use isValidWord() to filter it down (Norvig's "How to Write a Spelling Corrector" is a great explanation of generating the edits).
You can guide the search by trying to minimize the edit distance between where you currently are and where you can to be. For example, generate the space of all nodes to search, and sort by minimum edit distance. Follow the links first that are closest (e.g. minimize the edit distance) to the target. In the example, follow the nodes that are closest to "power".
I found this interesting as well, so there's a Haskell implementation here which works reasonably well. There's a link in the comments to a Clojure version which has some really nice visualizations.
You can search from two sides at the same time. I.e. change a letter in storm and run it through isValidWord(), and change a letter in power and run it through isValidWord(). If those two words are the same, you have found a path.
I have a question related to trees. I have about 100 sentences on a topic like "car." Those sentences basically talk about a car. If a user submits a query: "Find all combinations of word links between words "engine" and "oil"." I want to find all the word links possible so that "engine" and "oil" connects by any number of similar words in a sentence.
For instance.
Engine is hot when it runs.
Car has an engine.
Car use an oil.
In this case the answer will be: engine->car->oil (three word combination). And I want to find all the combinations possible so that in the end "engine" and "oil" connects with each other. It is not the shortest path, or the longest path, but all possible paths running in all directions and words. It is even possible to have 1,000 word combinations to reach "engine" and "oil" as long as the paths are not similar of course.
Is there a way to do this. I tried using bread-first, but it is little tricky. For instance combinations could be.
engine->car->run->stop->oil
engine->car->oil
engine->fast->brake->oil
Can anyone please help me with this. What is the logic and idea here. I can't ignore a word I already visited because that will stop the algorithm right there and not give me all the links.
Please help and insights.
Thanks.
fa323
Your question is so underspecified... that is, even in your example, it is unclear why the answer doesn't include engine->an->oil.
Also, it doesn't actually have anything to do with trees, rather it deals with graphs.
The first thing you need to do is determine how to build your graph. A reasonable way of doing this would be to have an edge between two words if they both appear in a particular sentence.
Then you have to decide what you want outputted. I highly doubt that you want all paths. Why? Well, if you build the graph I described, then even using just your first sentence, there are 24 paths from engine to oil not counting paths with cycles. However, if that is what you want anyway, you can find all non-cyclic paths in a graph via depth-first search, where you mark a node visited when you push it on the stack, and unmark it when you take it off.
I put "chunk transposition" in quotes because I don't know whether or what the technical term should be. Just knowing if there is a technical term for the process would be very helpful.
The Wikipedia article on edit distance gives some good background on the concept.
By taking "chunk transposition" into account, I mean that
Turing, Alan.
should match
Alan Turing
more closely than it matches
Turing Machine
I.e. the distance calculation should detect when substrings of the text have simply been moved within the text. This is not the case with the common Levenshtein distance formula.
The strings will be a few hundred characters long at most -- they are author names or lists of author names which could be in a variety of formats. I'm not doing DNA sequencing (though I suspect people that do will know a bit about this subject).
In the case of your application you should probably think about adapting some algorithms from bioinformatics.
For example you could firstly unify your strings by making sure, that all separators are spaces or anything else you like, such that you would compare "Alan Turing" with "Turing Alan". And then split one of the strings and do an exact string matching algorithm ( like the Horspool-Algorithm ) with the pieces against the other string, counting the number of matching substrings.
If you would like to find matches that are merely similar but not equal, something along the lines of a local alignment might be more suitable since it provides a score that describes the similarity, but the referenced Smith-Waterman-Algorithm is probably a bit overkill for your application and not even the best local alignment algorithm available.
Depending on your programming environment there is a possibility that an implementation is already available. I personally have worked with SeqAn lately, which is a bioinformatics library for C++ and definitely provides the desired functionality.
Well, that was a rather abstract answer, but I hope it points you in the right direction, but sadly it doesn't provide you with a simple formula to solve your problem.
Have a look at the Jaccard distance metric (JDM). It's an oldie-but-goodie that's pretty adept at token-level discrepancies such as last name first, first name last. For two string comparands, the JDM calculation is simply the number of unique characters the two strings have in common divided by the total number of unique characters between them (in other words the intersection over the union). For example, given the two arguments "JEFFKTYZZER" and "TYZZERJEFF," the numerator is 7 and the denominator is 8, yielding a value of 0.875. My choice of characters as tokens is not the only one available, BTW--n-grams are often used as well.
One of the easiest and most effective modern alternatives to edit distance is called the Normalized Compression Distance, or NCD. The basic idea is easy to explain. Choose a popular compressor that is implemented in your language such as zlib. Then, given string A and string B, let C(A) be the compressed size of A and C(B) be the compressed size of B. Let AB mean "A concatenated with B", so that C(AB) means "The compressed size of "A concatenated with B". Next, compute the fraction (C(AB) - min(C(A),C(B))) / max(C(A), C(B)) This value is called NCD(A,B) and measures similarity similar to edit distance but supports more forms of similarity depending on which data compressor you choose. Certainly, zlib supports the "chunk" style similarity that you are describing. If two strings are similar the compressed size of the concatenation will be near the size of each alone so the numerator will be near 0 and the result will be near 0. If two strings are very dissimilar the compressed size together will be roughly the sum of the compressed sizes added and so the result will be near 1. This formula is much easier to implement than edit distance or almost any other explicit string similarity measure if you already have access to a data compression program like zlib. It is because most of the "hard" work such as heuristics and optimization has already been done in the data compression part and this formula simply extracts the amount of similar patterns it found using generic information theory that is agnostic to language. Moreover, this technique will be much faster than most explicit similarity measures (such as edit distance) for the few hundred byte size range you describe. For more information on this and a sample implementation just search Normalized Compression Distance (NCD) or have a look at the following paper and github project:
http://arxiv.org/abs/cs/0312044 "Clustering by Compression"
https://github.com/rudi-cilibrasi/libcomplearn C language implementation
There are many other implementations and papers on this subject in the last decade that you may use as well in other languages and with modifications.
I think you're looking for Jaro-Winkler distance which is precisely for name matching.
You might find compression distance useful for this. See an answer I gave for a very similar question.
Or you could use a k-tuple based counting system:
Choose a small value of k, e.g. k=4.
Extract all length-k substrings of your string into a list.
Sort the list. (O(knlog(n) time.)
Do the same for the other string you're comparing to. You now have two sorted lists.
Count the number of k-tuples shared by the two strings. If the strings are of length n and m, this can be done in O(n+m) time using a list merge, since the lists are in sorted order.
The number of k-tuples in common is your similarity score.
With small alphabets (e.g. DNA) you would usually maintain a vector storing the count for every possible k-tuple instead of a sorted list, although that's not practical when the alphabet is any character at all -- for k=4, you'd need a 256^4 array.
I'm not sure that what you really want is edit distance -- which works simply on strings of characters -- or semantic distance -- choosing the most appropriate or similar meaning. You might want to look at topics in information retrieval for ideas on how to distinguish which is the most appropriate matching term/phrase given a specific term or phrase. In a sense what you're doing is comparing very short documents rather than strings of characters.