How can I add more tagged words to the Stanford POS-Tagger's trained models? - pos-tagger

I haven't found anything in the documentation about adding more tagged words to the tagger, specifically the bi-directional one.
Thanks

At present, you can't. Model training is an all-at-one-time operation. (Since the tagger uses weights that take into account contexts and frequencies, it isn't trivial to add new words to it post hoc.)

There is a workaround. It is ugly but should do the trick:
build a list of "your" words
scan text for these words
if any matches found to POS tagging yourself (NLTK can help you here)
feed it to Stanford parser.
FROM: http://www.cs.ucf.edu/courses/cap5636/fall2011/nltk.pdf
"You can also give it POS tagged text; the parser will try to use
your tags if they make sense.
You might want to do this if the parser makes tagging
mistakes in your text domain."

Related

Which Tagging format is the best for training Stanford NER (IO/ IOB)?

I have trained Stanford NER to extract the organization names from text. I used IO tagging format. It works fine. However, I wonder if changing the tag format to IOB (or other formats) might improve the scores. ?
Suppose you have a sentence that lacks normal punctuation, like this:
John Sam Ted are all here.
If you don't have a B tag you won't be able to tell if this should be three entities or one entity with three words.
On the other hand, for many common types of entities, they can't just run together in normal English text since you'll at least have a comma between them.
If you can set it up, using IOB is better in case you have entities run together, but depending on your data set it may not be an issue. You'll have to look at the data to tell.

NLP Postagger can't grok imperatives?

Stanford NLP postagger claims imperative verbs added to recent version. I've inputted lots of text with abundant and obvious imperatives, but there seems to be no tag for them on output. Must one, after all, train it for this pos?
There is no special tag for imperatives, they are simply tagged as VB.
The info on the website refers to the fact that we added a bunch of manually annotated imperative sentences to our training data such that the POS tagger gets more of them right, i.e. tags the verb as VB.

Segmentation of entities in Named Entity Recognition

I have been using the Stanford NER tagger to find the named entities in a document. The problem that I am facing is described below:-
Let the sentence be The film is directed by Ryan Fleck-Anna Boden pair.
Now the NER tagger marks Ryan as one entity, Fleck-Anna as another and Boden as a third entity. The correct marking should be Ryan Fleck as one and Anna Boden as another.
Is this a problem of the NER tagger and if it is then can it be handled?
How about
take your data and run it through Stanford NER or some other NER.
look at the results and find all the mistakes
correctly tag the incorrect results and feed them back into your NER.
lather, rinse, repeat...
This is a sort of manual boosting technique. But your NER probably won't learn too much this way.
In this case it looks like there is a new feature, hyphenated names, the the NER needs to learn about. Why not make up a bunch of hyphenated names, put them in some text, and tag them and train your NER on that?
You should get there by adding more features, more data and training.
Instead of using stanford-coreNLP you could try Apache opeNLP. There is option available to train your model based on your training data. As this model is dependent on the names supplied by you, it able to detect names of your interest.

Ruby Text/Sentiment Analysis

I have two strings -
"I like running around the track.
I like swimming in the pool, but only in the morning.
I need to pull out what people "like" from the above two comments (running around the track and swimming in the pool.
Does anyone have a recommendation for a text analytics gem or other method of pulling in that kind of information? I don't necessarily need word counts or n-grams, I just want to know what words are seen in relation to the word "like".
For a quick-and-dirty fix, you could use a Regex to search for all the forms of "like" and pull out all the text between there and the punctuation mark or Newline character.
You could use a dependency parser such as The Stanford Parser
to parse your text and find the keys words in your sentiment dictionary, and probably put some constraints on the type of dependencies for disambiguation. For example, the dependency needs to be of type "dobj" (direct object). Then follow the dependency structures to the end of phrase or sentence depending on your needs.

Algorithms to recognize misspelled names in texts

I need to develop an application that will index several texts and I need to search for people’s names inside these texts. The problem is that, while a person’s correct name is “Gregory Jackson Junior”, inside the text, the name might me written as:
- Greg Jackson Jr
- Gegory Jackson Jr
- Gregory Jackson
- Gregory J. Junior
I plan to index the texts on a nightly bases and build a database index to speed up the search. I would like recommendation for good books and/or good articles on the subject.
Thanks
Check these related questions.
Algorithm to find articles with similar text
How to search for a person's name in a text? (heuristic)
Your question is incorrectly phrased. The examples do not indicate misspelling but change in the form of writing a full name.
And,
would your search expect to match on words like son with reference to the example?
would it expect to match bob when looking for a name called Robert?
Are you looking for things like this and this?
Ok, reading your comment suggests you do not want to venture into that.
For the record. Use a Bayesian filter. You may use mechanical truck for initializing your algorithm.

Resources