Understanding the Stanford CoreNLP Coreference set notation - stanford-nlp

I am using stanford corenlp. To understand more clrearly the coreference set I need help. for the sentence
"Kosgi Santosh sent an email to Stanford University. He didn't get a reply" i got the
coreference set
(2,1,[1,2)) -> (1,2,[1,3)), that is: "He" -> "Kosgi Santosh“
So far I understand the meaning of "(2,1," is 2nd sentence 1st word and "(1,2," is 1st sentence 2nd word. But can not understand the meaning of [1,2) and [1,3).
Could you please explain.
Thanks

I needed this bit of information for my research as well.
After some digging, I found out that what ones gets are, in order:
sentence for the mention
position of the mention's head word in the sentence (i.e. He and Santosh)
an interval [#first_word_for_mention, #end) which follows the mathematical notation for sets, i.e. bracket [ indicating the left-hand boundary must be included, and parenthesis ) indicating to exclude the right-hand boundary. In other words, the first number is the start word the mention, the second number is the word right after the last one belonging to the mention (whether it exists or not).
source http://nlp.stanford.edu/software/corenlp_output2.html

Related

Turing Machine Algorithm

Could you please help me? I need to write code for a one-tape Turing Machine that uses the following two-letter alphabet a and b.
So the programme should show the common prefix of the two words.
For example:
g(aab,aaaba) -> aa; g(_,abab) -> _; g(aaba,baa) -> _; g(_,_) -> _; g(babaab,babb) -> bab
Where g is the function of the Machine and underscore means an Empty word, between words we have space
I tried to implement the following option:
If at the start we see the letter a, then we erase it and move to the beginning of the second word. If we also see a letter a there, we erase it too and after both words we write a through a space. After that we return to the beginning of the first word and repeat this operation. When the first letter of the first word and the first letter of the second no longer match, we erase everything that is left.
But I have some troubles with code, because after each operation a space between two words gets longer and I don't know how to control this. Also there is a trouble when the first or the second word is a common prefix fully, like this:
g(baa,baabab) -> baa
Your approach seems reasonable. From your description it sounds like you just have trouble generalizing some of the individual steps.
For instance, to deal with the growing spaces between the two words, remember that at any time in the program, the two words are separated by one or more spaces. So implement your seek operation for that general case.
For the common prefix case you have to deal with the situation that you eventually run out of characters to compare. So after deleting the character from the first word, while seeking for the beginning of the second word, check whether the first character you pass over is a letter or a space. If it's a space, you're in the prefix case and need to take care that you don't try to seek back to the first word later, because you already erased all of it and there's only spaces left. Similarly, if the second word is the prefix, you can detect this when seeking to the output.
Try breaking the algorithm down into its essential steps and test each of those steps in isolation. It is much easier to make sure you handle the corner cases correctly when you can focus on a simple step in isolation, instead of having to test it as part of the larger algorithm. Doing this is an essential skill in debugging code, so consider this a good exercise for that. Even if it seems painful at first, make sure you have a structured approach to analyzing problems and breaking your code down into smaller parts, and you will be able to fix any problems eventually. Happy coding!

Why does MaxentTagger tag numbers as NN sometimes?

I am trying to tag a HTML page full of space-separated numbers like "5320412185 5320412184 5320412189..." to observe how the tagger behaves with numbers. I'm using english-left3words-distsim.tagger in the constructor. I'm observing on the console that most of the numbers are tagged as CD but at times there are also numbers getting tagged as NN. I searched on the FAQ page of nlp.stanford.edu but I couldn't find this there. Can anyone help me in understanding this?
I don't know if I should need to mention this: I'm feeding each number separately to the tagger by splitting the huge input(1045000 numbers!) based on space-delimiter.
From Part-of-Speech Tagging Guidelines for the Penn Treebank Project (3rd Revision)
Sometimes, it is unclear whether one is cardinal number or a noun. In general, it should be tagged as a
cardinal number (CD) even when its sense is not clearly that of a numeral.
EXAMPLE: one/CD of the best reasons
But if it could be pluralized or modified by an adjective in a particular context, it is a common noun (NN).
EXAMPLE: the only (good) one/NN of its kind
(cf. the only (good) ones/NNS of their kind)
In the collocation another one, one should also be tagged as a common noun (NN).
Hyphenated fractions one-half, three-fourths, seven-eighths, one-and-a-half, seven-and-three-eighths should
be tagged as adjectives (JJ) when they are prenominal modifiers, but as adverbs (RB) if they could be
replaced by double or twice.
For further reading: http://repository.upenn.edu/cgi/viewcontent.cgi?article=1603&context=cis_reports

Stem comparsion algorithm

I'm writing a program that makes word declension for Polish language. In this language stems can vary in some cases (because of palatalization or mobile/fleeting e and other effects).
For example, we have word "karzeł" and it is basic dictionary form of word. It's stem is also 'karzeł'. But genitive form of this word is "karła" and stem is "karł". We can see here that 'e' dissapeared and 'rz' changes to 'r'.
Another example:
'uzda' -> stem 'uzd'
'uździe' -> stem 'uździ'
Alternation: 'zd' -> 'ździ'
I'd like to store in dictionary only basic form of stem ('karzeł' and 'uzd') and when I'll put in my program stem 'karł' or 'uździ' it will find proper basic stems. Alternations takes place only at the end of stem and contains maximum 4 letters of it.
Is there any algorithms that could do that? Levensthein distance treats all letters equally so if I type word 'barzeł' then the distance to stem 'karzeł' will be less than to stem 'karł'.
I thought also about neural networks but I'm not sure how to encode words (give each stem variation different id?).
Another idea is to write algorith which makes something like reversed alternation and creates set of possible stems and try to find them in dictionary.
I would like to highlight that I only want store basic form of stem and everything else makes on the fly.
First of all, I remember seeing a number of projects on Polish morphology around. So I would look at them first, before starting one of your own.
Regarding Levenshtein, as Pierre correctly noted in the comment, the distance function can be customized. And it should be. Let me put it this way: think of Levenshtein not as an algorithm of and in itself, but as a solution to a specific error model. First he suggests a model which says that when you are typing a word every letter can be either dropped or replaced by another one due to some random process (fingers not pressing the right keys). Then, his algorithm is just a generator of maximum likelihood solutions under this model. The more errors you allow, the smaller is the probability of this sequence of errors actually happening, the bigger is the score.
You (implicitly) state a very different hypothesis, though. That Polish stems may have certain flexibility at the end (some linguistic process that you do not fully understand within this framework). Then, when you strip your suffix (or something that looks like one), there are three options:
1) there is a chance that what you have here is just a different form of a stem you have stored in your dictionary, or
2) it is a completely different stem, or
3) you've stripped your suffix improperly and what you have is not stem at all.
You can heuristically estimate these probabilities by looking at how many letters in the beginning of the supposed stem match some dictionary entries, for example (how to find these entries is a related but different question). And then you can pick the guess that is the most plausible according to your metric/heuristic.
Now, note that you can use any algorithm to find the candidates in the dictionary. Including the Levenshtein algorithm - as long as you are reasonably sure that the right ones will be picked up. But obviously you are better off writing your own dictionary search algorithm that follows your own metric or emulates it. For example, by giving the biggest/prohibitive cost to the change of letters in the beginning of the word and reducing it as you go towards the end.

Unscrambling words in a sentence using Natural Language Generation

I have a sentence in English. Now I want to jumble the words up and input that set of words into a program which should unscramble the words according to normal rules of English grammar to output the original sentence. I can vaguely assume it would require Natural Language Generation algorithms.
For eg:
Sentence: Mary has gone for a walk with her dog.
Set of words: {has, for, a, with, her, dog, Mary, gone, walk}
The output should be the same sentence.
I can assume only the set of words will never be enough to generate the original sentence. But what more information must be included to revive the original sentence?
Please guide me as to where I should be starting with.
Language models are things that can take in a text or sentence (any sequence of words) and assign it a probability based on how well the model "recognizes" that text.
To solve your problem, you could take a language model and use it to compute the probability of each possible permutation you can make of the input words. The most probable sentence accord to the model is probably the most coherent one.
For a situation like yours, trying a n-gram model (for n > 2.. I think 2 or 3 should do the trick) or a Hidden Markov model leveraging part of speech tags should do the trick.
You will not be able to solve your problem without additional information. Take this example:
{"happy", "you" "are"}
Can you reconstruct the sentence? Is it "You are happy" or is it "Are you happy"? Note that the words are the same but the meaning changes radically. No matter how good algorithm you write it will not be able to reconstruct the sentence if you can not.
You need to do following thing to get started :-
Maintain a dictionary of english words classified as nouns,adjectives,verbs,etc.
Build grammer rules for english language which you can get from any english tutorial.
try to rearrange the words to match the grammer rules.
Note:- English is very ambiguous language so you might end with something else.
eg.
grammer rule : article noun verb
input words : dog , barks , the
dictionary lookup : dog => noun , barks => verb , the => article
rearrange the words according to the rule.
There can be mutliple rule and word can also be of multiple type so
try all possibilities.

find some sentences

I'd like to find good way to find some (let it be two) sentences in some text. What will be better - use regexp or split-method? Your ideas?
As requested by Jeremy Stein - there are some examples
Examples:
Input:
The first thing to do is to create the Comment model. We’ll create this in the normal way, but with one small difference. If we were just creating comments for an Article we’d have an integer field called article_id in the model to store the foreign key, but in this case we’re going to need something more abstract.
First two sentences:
The first thing to do is to create the Comment model. We’ll create this in the normal way, but with one small difference.
Input:
Mr. T is one mean dude. I'd hate to get in a fight with him.
First two sentences:
Mr. T is one mean dude. I'd hate to get in a fight with him.
Input:
The D.C. Sniper was executed was executed by lethal injection at a Virginia prison. Death was pronounced at 9:11 p.m. ET.
First two sentences:
The D.C. Sniper was executed was executed by lethal injection at a Virginia prison. Death was pronounced at 9:11 p.m. ET.
Input:
In her concluding remarks, the opposing attorney said that "...in this and so many other instances, two wrongs won’t make a right." The jury seemed to agree.
First two sentences:
In her concluding remarks, the opposing attorney said that "...in this and so many other instances, two wrongs won’t make a right." The jury seemed to agree.
Guys, as you can see - it's not so easy to determine two sentences from text. :(
As you've noticed, sentence tokenizing is a bit tricker than it first might seem. So you may as well take advantage of existing solutions. The Punkt sentence tokenizing algorithm is popular in NLP, and there is a good implementation in the Python Natural Language Toolkit which they describe the use of here. They also describe another approach here.
There's probably other implementations around, or you could also read the original paper describing the Punkt algorithm: Kiss, Tibor and Strunk, Jan (2006): Unsupervised Multilingual Sentence Boundary Detection. Computational Linguistics 32: 485-525.
You can also read another Stack Overflow question about sentence tokenizing here.
your_string = "First sentence. Second sentence. Third sentence"
sentences = your_string.split(".")
=> ["First sentence", " Second sentence", " Third sentence"]
No need to make simple code complicated.
Edit: Now that you've clarified that the real input is more complex that your initial example you should disregard this answer as it doesn't consider edge cases. An initial look at NLP should show you what you're getting into though.
Some of the edge cases that I've seen in the past to be a bit complicated are:
Dates: Some regions use dd.mm.yyyy
Quotes: While he was sighing — "Whatever, do it. Now. And by the way...". This was enough.
Units: He was going at 138 km. while driving on the freeway.
If you plan to parse these texts you should stay away from splits or regular expressions.
This will usually match sentences.
/\S(?:(?![.?!]+\s).)*[.?!]+(?=\s|$)/m
For your example of two sentences, take the first two matches.
irb(main):005:0> a = "The first sentence. The second sentence. And the third"
irb(main):006:0> a.split(".")[0...2]
=> ["The first sentence", " The second sentence"]
irb(main):007:0>
EDIT: here's how you handle the "This is a sentence ...... and another . And yet another ..." case :
irb(main):001:0> a = "This is the first sentence ....... And the second. Let's not forget the third"
=> "This is the first sentence ....... And the second. Let's not forget the thir
d"
irb(main):002:0> a.split(/\.+/)
=> ["This is the first sentence ", " And the second", " Let's not forget the thi rd"]
And you can apply the same range operator ... to extract the first 2.
You will find tips and software links on the sentence boundary detection Wikipedia page.
If you know what sentences to search, Regex should do well searching for
((YOUR SENTENCE HERE)|(YOUR OTHER SENTENCE)){1}
Split would probably use up quite a lot of memory, as it also saves the things you don't need (the whole text that's not your sentence) as Regex only saves the sentence you searched (if it finds it, of course)
If you're segmenting a piece of text into sentences, then what you want to do is begin by determining which punction marks can separate sentences. In general, this is !, ? and . (but if all you care about is a . for the texts your processing, then just go with that).
Now since these can appear inside quotations, or as parts of abbreviations, what you want to do is find each occurrence of these punctuation marks and run some sort of machine learning classifier to determine whether that occurance starts a new sentence, or whether it does something else. This involves training data and a properly-constructed classifier. And it won't be 100% accurate, because there's probably no way to be 100% accurate.
I suggest looking in the literature for sentence segmentation techniques, and have a look at the various natural language processing toolkits that are out there. I haven't really found one for Ruby yet, but I happen to like OpenNLP (which is in Java).

Resources