Natural Language Parsing using Stanford NLP - stanford-nlp

How Stanford natural Language Parser uses Penn Tree Bank for Tagging process ? I want to know how it finds the POS for the given input?

The Stanford part-of-speech tagger uses a probabilistic sequence model to determine the most likely sequence of part-of-speech tags underlying a sentence. Some of the features provided to this model are
Surrounding words and n-grams
Part-of-speech tags of surrounding words
"Word shapes" (e.g., "Foo5" is translated to "Xxx#")
Word suffix, prefix
See the ExtractorFrames class for details. The model is trained on a tagged corpus (like the Penn Treebank) which has each token annotated with its correct part of speech.
At run time, features like those mentioned above are calculated for input text and are used to build per-tag probabilities, which are then fed into an implementation of the Viterbi algorithm (ExactBestSequenceFinder), which finds the most likely arrangement of tags for the entire sequence.
For more information to get started with POS tagging:
Watch the Week 5 lectures of the Coursera NLP class (co-taught by the CoreNLP lead)
Check out the code in the edu.stanford.nlp.tagger.maxent package
Part-of-speech tagging in NLTK

Related

POS/NER able to differentiate between the same word being used in multiple contexts?

I have a collection of over 1 million bodies of text. Within those bodies are multiple entities whose names mimic common stop words and phrases.
This has created issues when tokenizing the data, as there are ~50 entities with the same problem. To counteract this, I've disabled the removal of the matched stop words before their removal. This is fine, but Ideally I'd have a way to differentiate when a token is actually meant to be a stop word vs an entity, since I only care for when it's used as an entity.
Here's a sample excerpt:
A determined somebody slept. Prior to this, A could never be comfortable with the idea of responsibility. It was foreign, something heard about through a story passed down by words of U. As slow as it could be, A began to find meaning in the words of a story.
A and U are entities/nouns in most of their usages here. POS tagging so far has only labelled A as a determiner, and NER either won't tag any instances of the word. Adding the target tags to the NER list will result in every instance being tagged as an entity, which is not the case.
So far I've primarily used the Stanford POS Tagger and SpaCY for NER.
I think you should try to train your own NER model.
You can do this in three steps, as follows:
label a number of documents in your corpus.
You can do this using the spacy-annotator.
train your spacy NER model from scratch.
You can follow the instructions in the spacy docs.
Use the trained model to predict entities in your corpus.
By labelling a good amount of entities at step 1, the model will learn to differentiate between a determiner and an entity.

Determining the "goodness" of a phrase based on "grammatical" or "contextual" relevancy

Given a random string of words, I would like to assign a "goodness" score to the phrase, where "goodness" is some indication of grammatical and contextual relevancy.
For example:
"the green tree was tall" [Good score]
"delicious tires swim open" [Medium score]
"jump an con porch calmly" [Poor score]
I've been experimenting with the Natural Language Toolkit. I'd considered using a trained tagger to assign parts-of-speech to each word in a phrase, and then parse a corpus for occurrences of that POS pattern. This may give me an indication of grammatical "goodness". However, as the tagger itself is trained on the same corpus that I'm using for validation, I can't imagine the results would be reliable. This approach also does not take into consideration the contextual relevancy of the words.
Is anyone aware of existing projects or research into this sort of thing? How would you approach this?
You could employ two different approaches - supervised and semi-supervised.
Supervised
Assuming you have a labeled dataset of tuples of the form <sentence> <goodness label> (like the one in your examples), you could first split your dataset up in a train:test fold (e.g. 4:1).
Then you could simply use BERT feature vectors (these are pre-trained on large volumes of natural language text). The following piece of code gives you the vector for the sentence the green tree was tall (read more here).
nlp_features = pipeline('feature-extraction')
output = nlp_features('the green tree was tall')
np.array(output).shape # (Samples, Tokens, Vector Size)
Assuming you vectorize every sentence, you could then train a simple logistc regression model (sklearn) that learns a set of parameters to minimize the errors in these predictions on the training set and eventually you throw the test set sentences at this model to see how it behaves.
Instead of BERT, you could also use embedded vectors as inputs to an LSTM network for training the classifier (like the one here).
Semi-supervised
This is applicable when you don't have sufficient labeled data (although you need a few to get you started with).
In this case, I think what you could do is to map the words of a sentence into POS tag sequences, e.g.,
the green tree was tall --> ARTICLE ADJ NOUN VERB ADJ (see here for more details).
This step would make your method depend less on the words themselves. A model trained on these sequences would try to discover some latent distinguishing characteristics of good sentences from the bad ones.
In particular, you could run a standard text classification approach with Bidirectional LSTMs for training your classifier (this time not with words but with a much smaller vocabulary of POS tags).
You can use a transformer model from HuggingFace that is fine tuned for sentence correctness. Specifically, the model has to be fine tuned on the Corpus of Linguistic Acceptability (CoLA). Here's a medium article on HuggingFace, transformers, and the fine tuning process.
You can also get a model that's already fine-tuned and you can put in the text classification pipeline for HuggingFace's transformers library here. That site hosts fine-tuned models and you can search for a few others that are fine tuned for the CoLA task there.

Parts of Speech, Stanford Core NLP

It may be that he is bargaining for the Chancellorship, which he is certainly not fit for.
In the above mentioned sentence, the Stanford Parser refers the word bargaining Parts of Speech as NN which signifies that this word is a noun here , however as per the use of the word in the above sentence, it should have been a VERB .
Could anyone clarify on this.
You are right, in your sentence above, the word is a verb. The Stanford POS Tagger calculates tags based on a bidirectional approach. The POS tag of a word is calculated based on the context it appears in, meaning that two words before and two words after are considered. Based on that the algorithm outputs the tag which is most likely to be correct. The tagger does not claim the output to be correct. It is just more likely, that bargaining is a noun in this sentence.
You could train your own POS tagger, if you really want a correct output. But that requires a lot of effort. For more information, take a look here and here.

Stanford NLP training documentpreprocessor

Does Stanford NLP provide a train method for the DocumentPreprocessor to train with own corpora and creating own models for sentence splitting?
I am working with German sentences and I need to create my own German model for sentence splitting tasks. Therefore, I need to train the sentence splitter, DocumentPreprocessor.
Is there a way I can do it?
No. At present, tokenization of all European languages is done by a (hand-written) finite automaton. Machine learning-based tokenization is used for Chinese and Arabic. At present, sentence splitting for all languages is done by rule, exploiting the decisions of the tokenizer. (Of course, that's just how things are now, not how they have to be.)
At present we have no separate German tokenizer/sentence splitter. The current properties file just re-uses the English ones. This is clearly sub-optimal. If someone wanted to produce something for German, that would be great to have. (We may do it at some point, but German development is not currently at the top of the list of priorities.)

Sentiment analysis

while performing sentiment analysis, how can I make the machine understand that I'm referring apple (the iphone), instead of apple (the fruit)?
Thanks for the advise !
Well, there are several methods,
I would start with checking Capital letter, usually, when referring to a name, first letter is capitalized.
Before doing sentiment analysis, I would use some Part-of-speech and Named Entity Recognition to tag the relevant words.
Stanford CoreNLP is a good text analysis project to start with, it will teach
you the basic concepts.
Example from CoreNLP:
You can see how the tags can help you.
And check out more info
As described by Ofiris, NER is only one way to do solve your problem. I feel it's more effective to use word embedding to represent your words. In that way machine automatically recognize the context of the word. As an example "Apple" is mostly coming together with "eat" and But if the given input "Apple" is present with "mobile" or any other word in that domain, Machine will understand it's "iPhone apple" instead of "apple fruit". There are 2 popular ways to generate word embeddings such as word2vec and fasttext.
Gensim provides more reliable implementations for both word2vec and fasttext.
https://radimrehurek.com/gensim/models/word2vec.html
https://radimrehurek.com/gensim/models/fasttext.html
In presence of dates, famous brands, vip or historical figures you can use a NER (named entity recognition) algorithm; in such case, as suggested by Ofiris, the Stanford CoreNLP offers a good Named entity recognizer.
For a more general disambiguation of polysemous words (i.e., words having more than one sense, such as "good") you could use a POS tagger coupled with a Word Sense Disambiguation (WSD) algorithm. An example of the latter can be found HERE, but I do not know any freely downloadable library for this purpose.
This problem has already been solved by many open source pre-trained NER models. Anyways you can try retraining an existing NER models to finetune them to solve this issue.
You can find an demo of NER results as done by Spacy NER here.

Resources