Coreference resolution for tokenized text in stanford - stanford-nlp

what i have given is a tokenized text, something like
"In the summer of 2005 , a picture that people have long been looking
forward to T-1 started *-2 emerging with frequency in various major
media ."
and I need to get as result the coreference resolution from Stanford but I want to start from the steps after tokenizing, can someone help me in that I got the tokens and I am trying to create sentence annotation but it gives null,can somebody help (I read the post Coreference resolution using Stanford CoreNLP) it was helpful but not enough.

Try setting the tokenize.whitespace property. This will tell the tokenizer to tokenize on whitespace; i.e., treat the text as already tokenized.

Related

Which Tagging format is the best for training Stanford NER (IO/ IOB)?

I have trained Stanford NER to extract the organization names from text. I used IO tagging format. It works fine. However, I wonder if changing the tag format to IOB (or other formats) might improve the scores. ?
Suppose you have a sentence that lacks normal punctuation, like this:
John Sam Ted are all here.
If you don't have a B tag you won't be able to tell if this should be three entities or one entity with three words.
On the other hand, for many common types of entities, they can't just run together in normal English text since you'll at least have a comma between them.
If you can set it up, using IOB is better in case you have entities run together, but depending on your data set it may not be an issue. You'll have to look at the data to tell.

Part of speech tagged as "word"

I'm using the Stanford Part of Speech tagger on some Spanish text. As per their docs the part of speech tags come from this set: http://nlp.stanford.edu/software/spanish-faq.shtml#tagset
Overall, I've found this to be accurate and haven't had an issue. However, I just ran into a small snippet of text: "Adiós ~ hailey". This is tagged as follows: Adiós_i ~_word hailey_aq0000. So the ~ symbol, which I think should get a punctuation tag of f0 got a tag of word. That isn't documented or expected. Is this a bug or expected?
Update
It turns out the special "word" tag appears in other contexts as well. I just saw it for the word it and the word á.
Thanks for catching this! I've been a bit slow to catch up on documentation.. I just updated the tag list in our documentation to include the new word.
In the CoreNLP 3.7.0 release, we included new Spanish models trained on extra data (specifically, the DEFT Spanish Treebank V2). Some of the new data comes from a discussion forum dataset (Latin American Spanish Discussion Forum Treebank). This dataset uses an extra POS tag, word, to label emoticons and miscellaneous symbols (e.g. the ® sign).
(I know, it's a sort of silly choice of name — but we wanted to stick with what the original corpus used.)

Stanford CoreNLP demo and coreference resolution

Consider the below sentences:
Bats are the only mammals that can fly. They are also among the only mammals known to feed on blood.
Input them in the below link
http://nlp.stanford.edu:8080/corenlp/process
Coreference output does not show that the word They refers to Bats. Am I missing something basic?
Stanford's dcoref module has the pronoun 'they' hardcoded to be animate only, and presumably 'bat' is in the inanimate word list.
The animate restriction is probably justified for the newswire training data, but is not valid for general English.
You can change the animate list here https://github.com/stanfordnlp/CoreNLP/blob/master/src/edu/stanford/nlp/dcoref/Dictionaries.java#L155

Bing/Google/Flickr API: how would you find an image to go along each of 150,000 Japanese sentences?

I'm doing part-of-speech & morphological analysis project for Japanese sentences. Each sentence will have its own webpage. To make this page more visual, I want to show one picture which is somehow related to the sentence. For example, For the sentence "私は学生です" ("I'm a student"), the relevant pictures would be pictures of school, Japanese textbook, students, etc. What I have: part-of-speech tagging for every word. My approach now: use 2-3 nouns from every sentence and retrieve the first image from search results using Bing Images API. Note: all the sentence processing up to this point was done in Java.
Have a couple of questions though:
1) what is better (richer corpus & powerful search), Google Images API, Bing Images API, Flickr API, etc. for searching nouns in Japanese?
2) how do you select the most important noun from the sentence to do the query in Image Search Engine without doing complicated topic modeling, etc.?
Thanks!
Japanese WordNet has links to OpenClipart pictures. That could be another relevant source. They describe it in their paper called "Enhancing the Japanese WordNet".
I thought you would start by choosing any noun before は、が and を and giving these priority - probably in that order.
But that assumes that your part-of-speech tagging is good enough to get は=subject identified properly (as I guess you know that は is not always the subject marker).
I looked at a bunch of sample sentences here with this technique in mind and found it as good as could be expected. Except where none of those are used, which is rarish.
And sentences like this one, where you'd have to consider maybe looking for で and a noun before it in the case where there is no を or は. Because if you notice here, the word 人 (people) really doesn't tell you anything about what's being said. Without parsing context properly, you don't even know if the noun is person or people.
毎年 交通事故で 多くの人が 死にます
(many people die in traffic accidents every year)
But basically, couldn't you implement a priority/fallback type system like this?
BTW I hope your sentences all use kanji, or when you see はし (in one of the sentences linked to) you won't know whether to show a bridge or chopsticks - and showing the wrong one will probably not be good.

How can I add more tagged words to the Stanford POS-Tagger's trained models?

I haven't found anything in the documentation about adding more tagged words to the tagger, specifically the bi-directional one.
Thanks
At present, you can't. Model training is an all-at-one-time operation. (Since the tagger uses weights that take into account contexts and frequencies, it isn't trivial to add new words to it post hoc.)
There is a workaround. It is ugly but should do the trick:
build a list of "your" words
scan text for these words
if any matches found to POS tagging yourself (NLTK can help you here)
feed it to Stanford parser.
FROM: http://www.cs.ucf.edu/courses/cap5636/fall2011/nltk.pdf
"You can also give it POS tagged text; the parser will try to use
your tags if they make sense.
You might want to do this if the parser makes tagging
mistakes in your text domain."

Resources