Do the same words have different lemmas for different POS? - stanford-nlp

If there are two words that have different POS but are spelled the same, then are there different lemmas for such words?
For instance, words 'care' as a noun and 'care' as a verb have the same or different lemma?
Is there NEVER such a situation where "different" words have the same lemma?

There are rare instances where a single word can have a different lemma depending on part of speech; an example is saw. For saw/NN, the lemma is saw, for saw/VBD, the lemma is see.
I'm not sure what you meant by the last question. All the time different words have the same lemma: look, looks, looking, and looked all have the lemma look.

Related

Design NFA with changing alphabet and language

I ran into this exercise and i thought about it for a few hours and got to nothing.
our alphabet is {1...n} and our language Ln contains all the words under Σ* so that each word in the language doesn't contain at least one letter from the alphabet.
for example: if n=5, the word w={111223432} is in the language because '5' is missing in the word. the word w={1352224} is not in the language because all the letters 1...n are in the word.
I need to design an NFA for this language that has n+1 states.
Again, I tried a few things and don't exactly have a good idea.
For simplicity, let's do this for the alphabet {a, b, c}. Imagine that you have a string in your language. That means that it's missing a, or it's missing b, or it's missing c (inclusive or). If we knew which character was missing, it would be really easy to check whether a string never had a copy of that character using a single-state NFA consisting of an accepting state that transitions back to itself on everything except that character.
Since there are only finitely many characters in the alphabet, we can build three one-state NFAs, each of which are designed to check whether a string is missing a particular character.
To build the overall machine, have the start state of the NFA nondeterministically guess which character is missing by adding ε-transitions from the start state to each of the individual one-state NFAs we built earlier. You now have a four-state NFA for this language. (You can see a picture of it here.) Hopefully it's not too hard to see how to generalize this up to larger alphabet sizes!

Alternative to Levenshtein distance for prefixes / suffixes

I have a big city database which was compiled from many different sources. I am trying to find a way to easily spot duplicates based on city name. The naive answer would be to use the levenshtein distance. However, the problem with cities is that they often have prefixes and suffixes which are common to the country they are in.
For example:
Boulleville vs. Boscherville
These are almost certainly different cities. However, because they both end with "ville" (and both begin with "Bo") they have a rather small Levenstein distance.
*I am looking for a string distance algorithm that takes into account the position of the character to minimize the effect of prefixes and suffixes by weighting letters in the middle of the word higher than letters at the ends of the word. *
I could probably write something myself but I would find it hard to believe that no one has yet published a suitable algorithm.
This is similar to stemming in Natural Language Programming.
In that field, the stem of a word is found before performing further analysis, e.g.
run => run
running => run
runs => run
(of course things like ran do not stem to run. For that one can use a lemmatizer. But I digress...). Even though stemming is far from perfect in NLP, it works remarkably well.
In your case, it may work well to stem the city using rules specific to city names before applying Levenstein. I'm not aware of a stemmer implementation for cities, but the rules seem on the surface to be fairly simple.
You might start with a list of prefixes and a list of suffixes (including any common variant / typo spellings) and simply remove such a prefix / suffix before checking the Levenstein distance.
On a side note, if you have additional address information (such as a street address or zip/postal code), there exists address normalization software for many countries that will find the best match based on address-specific algorithms.
A pretty simple way to do it would be to just remove the common prefix and suffix before doing the distance calculation. The absolute distance between the resulting strings will be the same as with the full strings, but when the shorter length is taken into account the distance looks much greater.
Also keep in mind that in general even grevious misspellings get the first letter right. It's highly likely, then, that Cowville and Bowville are different cities, even though their L. distance is only 1.
You can make your job a lot easier by, at least at first, not doing the distance calculation if two words start with the different letters. They're likely to be different. Concentrate first on removing duplicates of words that start with the same letters. If, after that, you still have a large number of potential duplicates, you can refine your distance threshold to more closely examine words that start with different letters.

Finding a list of adjacent words between two words

I am working on a programming challenge for practice and am having trouble finding a good data structure/algorithm to use to implement a solution.
Background:
Call two words “adjacent” if you can change one word into the other by adding, deleting, or changing a single letter.
A “word list” is an ordered list of unique words where successive words are adjacent.
The problem:
Write a program which takes two words as inputs and walks through the dictionary and creates a list of words between them.
Examples:
hate → love: hate, have, hove, love
dogs → wolves: dogs, does, doles, soles, solves, wolves
man → woman: man, ran, roan, roman, woman
flour → flower: flour, lour, dour, doer, dower, lower, flower
I am not quite sure how to approach this problem, my first attempt involved creating permutations of the first word then trying to replace letters in it. My second thought was maybe something like a suffix tree
Any thoughts or ideas toward at least breaking the problem down would be appreciated. Keep in mind that this is not homework, but a programming challenge I am working on myself.
This puzzle was first stated by Charles Dodgson, who wrote Alice's Adventures in Wonderland under his pseudonym Lewis Carroll.
The basic idea is to create a graph structure in which the nodes are words in a dictionary and the edges connect words that are one letter apart, then do a breadth-first search through the graph, starting at the first word, until you find the second word.
I discuss this problem, and give an implementation that includes a clever algorithm for identifying "adjacent to" words, at my blog.
I have done this myself and used it to create a (not very good) Windows game.
I used the approach recommended by others of implementing this as a graph, where each node is a word and they are connected if they differ in one letter. This means you can use well known graph theory results to find paths between words (eg simple recursion where knowing the words at distance 1 allows you to find the words at distance 2).
The tricky part is building up the graph. The bad news is that it is O(n^2). The good news is that it doesn't have to be done in real time - rather than your program reading the dictionary words from a file, it reads in the data structure you baked earlier.
The key insight is that the order doesn't matter, in fact it gets in the way. You need to construct another form in which to hold the words which strips out the order information and allows words to be compared more easily. You can do this in O(n). You have lots of choices; I will show two.
For word puzzles I quit often use an encoding which I call anagram dictionary. A word is represented by another word which has the same letters but in alphabetic sequence. So "cars" becomes "acrs". Both lists and slits become "ilsst". This is a better structure for comparison than the original word, but much better comparisons exist (however, it is a very useful structure for other word puzzles).
Letter counts. An array of 26 values which show the frequency of that letter in the word. So for "cars" it starts 1,0,1,0,0... as there is one "a" and one "c". Hold an external list of the non-zero entries (which letters appear in the word) so you only have to check 5 or 6 values at most instead of 26. Very simple to compare two words held in this form quickly by ensuring at most two counts are different. This is the one I would use.
So, this is how I did it.
I wrote a program which implemented the data structure up above.
It had a class called WordNode. This contains the original word; a List of all other WordNodes which are one letter different; an array of 26 integers giving the frequency of each letter, a list of the non-zero values in the letter count array.
The initialiser populates the letter frequency array and the corresponding list of non-zero values. It sets the list of connected WordNodes to zero.
After I have created an instance of the WordNode class for every word, I run a compare method which checks to see if the frequency counts are different in no more than two places. That normally takes slightly less compares than there are letters in the words; not too bad. If they are different in exactly two places they differ by one letter, and I add that WordNode into the list of WordNodes differing in only one letter.
This means we now have a graph of all the words one letter different.
You can export either the whole data structure or strip out the letter frequency and other stuff you don't need and save it (I used serialized XML. If you go that way, make sure you check it handles the List of WordNodes as references and not embedded objects).
Your actual game then only has to read in this data structure (instead of a dictionary) and it can find the words one letter different with a direct lookup, in essentially zero time.
Pity my game was crap.
I don't know if this is the type of solution that you're looking for, but an active area of research is in constructing "edit distance 1" dictionaries for quickly looking up adjacent words (to use your parlance) for search term suggestions, data entry correction, and bioinformatics (e.g. finding similarities in chromosomes). See for example this research paper. Short of indexing your entire dictionary, at the very least this might suggest a search heuristic that you can use.
The simplest (recursive) algorithm is I can think of (well, the only one I can think in the moment) is
Initialize a empty blacklist
Take all words from your dictionary that is a valid step for the current word
remove the ones that are in the blacklist
Check if you can find the target word.
if not, repeat the algorithm for all words you found in last step
if yes, you found it. Return the recursion printing all words in the path you found.
Maybe someone with a bit more time can add the ruby code for this?
Try this
x = 'hate'
puts x = x.next until x == 'love'
And if you couple it with dictionary lookup, you will get a list of all valid words in between in that dictionary.

What are the main differences between the Knuth-Morris-Pratt and Boyer-Moore search algorithms?

What are the main differences between the Knuth-Morris-Pratt search algorithm and the Boyer-Moore search algorithm?
I know KMP searches for Y in X, trying to define a pattern in Y, and saves the pattern in a vector. I also know that BM works better for small words, like DNA (ACTG).
What are the main differences in how they work? Which one is faster? Which one is less computer-greedy? In which cases?
Moore's UTexas webpage walks through both algorithms in a step-by-step fashion (he also provides various technical sources):
Knuth-Morris-Pratt
Boyer-Moore
According to the man himself,
The classic Boyer-Moore algorithm suffers from the phenomenon that it
tends not to work so efficiently on small alphabets like DNA. The skip
distance tends to stop growing with the pattern length because
substrings re-occur frequently. By remembering more of what has
already been matched, one can get larger skips through the text. One
can even arrange ``perfect memory'' and thus look at each character at
most once, whereas the Boyer-Moore algorithm, while linear, may
inspect a character from the text multiple times. This idea of
remembering more has been explored in the literature by others. It
suffers from the need for very large tables or state machines.
However, there have been some modifications of BM that have made small-alphabet searching viable.
In an rough explanation
Boyer-Moore's approach is to try to match the last character of the pattern instead of the first one with the assumption that if there's not match at the end no need to try to match at the beginning. This allows for "big jumps" therefore BM works better when the pattern and the text you are searching resemble "natural text" (i.e. English)
Knuth-Morris-Pratt searches for occurrences of a "word" W within a main "text string" S by employing the observation that when a mismatch occurs, the word itself embodies sufficient information to determine where the next match could begin, thus bypassing re-examination of previously matched characters. (Source: Wiki)
This means KMP is better suited for small sets like DNA (ACTG)
Boyer-Moore technique match the characters from right to left, works well on long patterns.
knuth moris pratt match the characters from left to right, works fast on short patterns.

Number of simple mutations to change one string to another?

I'm sure you've all heard of the "Word game", where you try to change one word to another by changing one letter at a time, and only going through valid English words. I'm trying to implement an A* Algorithm to solve it (just to flesh out my understanding of A*) and one of the things that is needed is a minimum-distance heuristic.
That is, the minimum number of one of these three mutations that can turn an arbitrary string a into another string b:
1) Change one letter for another
2) Add one letter at a spot before or after any letter
3) Remove any letter
Examples
aabca => abaca:
aabca
abca
abaca
= 2
abcdebf => bgabf:
abcdebf
bcdebf
bcdbf
bgdbf
bgabf
= 4
I've tried many algorithms out; I can't seem to find one that gives the actual answer every time. In fact, sometimes I'm not sure if even my human reasoning is finding the best answer.
Does anyone know any algorithm for such purpose? Or maybe can help me find one?
(Just to clarify, I'm asking for an algorithm that can turn any arbitrary string to any other, disregarding their English validity-ness.)
You want the minimum edit distance (or Levenshtein distance):
The Levenshtein distance between two strings is defined as the minimum number of edits needed to transform one string into the other, with the allowable edit operations being insertion, deletion, or substitution of a single character. It is named after Vladimir Levenshtein, who considered this distance in 1965.
And one algorithm to determine the editing sequence is on the same page here.
An excellent reference on "Edit distance" is section 6.3 of the Algorithms textbook by S. Dasgupta, C. H. Papadimitriou, and U. V. Vazirani, a draft of which is available freely here.
If you have a reasonably sized (small) dictionary, a breadth first tree search might work.
So start with all words your word can mutate into, then all those can mutate into (except the original), then go down to the third level... Until you find the word you are looking for.
You could eliminate divergent words (ones further away from the target), but doing so might cause you to fail in a case where you must go through some divergent state to reach the shortest path.

Resources