Stanford NLP training documentpreprocessor - stanford-nlp

Does Stanford NLP provide a train method for the DocumentPreprocessor to train with own corpora and creating own models for sentence splitting?
I am working with German sentences and I need to create my own German model for sentence splitting tasks. Therefore, I need to train the sentence splitter, DocumentPreprocessor.
Is there a way I can do it?

No. At present, tokenization of all European languages is done by a (hand-written) finite automaton. Machine learning-based tokenization is used for Chinese and Arabic. At present, sentence splitting for all languages is done by rule, exploiting the decisions of the tokenizer. (Of course, that's just how things are now, not how they have to be.)
At present we have no separate German tokenizer/sentence splitter. The current properties file just re-uses the English ones. This is clearly sub-optimal. If someone wanted to produce something for German, that would be great to have. (We may do it at some point, but German development is not currently at the top of the list of priorities.)

Related

Problem in negation sentences using a model from BERTimbau for sentiment analysis in text

Could anyone tell me how to make BERT (using it as a text sentiment classifier) ​​using as a tokenizer and model the BERTimbau (Brazilian Portuguese version) to classify sentences with negation (that is, with not in front of the sentence) the inverse of that did he train?
Explaining better:
I have a model that I created from BERTimbau and data that I got. It classifies the phrases as: Satisfied, Dissatisfied, Excited and Discouraged.
When a person writes a sentence of a feeling, but denying it, he continues to classify him with that feeling. Example:
The day is not lively.
What the model rates: Animated
What I wanted you to classify: despondent.
Can anyone tell me if I can do this (or, if so, how to do it?)
I've been trying to find out how to do this for days (otherwise he's sorting well)
Thank you very much!

Is language translation algorithm non-deterministic in nature?

A few days ago, I got a Thai translation of the string "Reward points issues" as "คะแนนสะสม".
But when I checked it today, the Google translator gave a different Thai translation - "ประเด็นคะแนนรางวัล"
So, I am guessing the algorithm might be non-deterministic.
But, there is a thing that I am not able to understand. In any language, new words are added everyday and not new characters and not new ways to form a pre-defined word. Then why did Google Translate gave a different output?
Also, is my assumption of non-deterministic nature correct?
NOTE: I could observe same behaviour with other languages like russian, dutch, chinese, and polish.
I don't think that the algorithms used by Google are non-deterministic, there is no reason for them to be.
Anyway Google translates by reference to a huge corpus of known translations. This corpus is constantly updated and this influences day-by-day translations. It is made of complete sentences rather than isolated words.
In fact Google translation... learns.

Train a non-english Stanford NER models

I'm seeing several posts about training the Stanford NER for other languages.
eg: https://blog.sicara.com/train-ner-model-with-nltk-stanford-tagger-english-french-german-6d90573a9486
However, the Stanford CRF-Classifier uses some language dependent features (such as: Part Of Speechs tags).
Can we really train non-English models using the same Jar file?
https://nlp.stanford.edu/software/crf-faq.html
Training a NER classifier is language independent. You have to provide high quality training data and create meaningful features. The point is, that not all features are equally useful for every languages. Capitalization for instance, is a good indicator for a named entity in english. But in German all nouns are capitalized, which makes this features less useful.
In Stanford NER you can decide which features the classifier has to use and therefore you can disable POS tags (in fact, they are disabled by default). Of course, you could also provide your own POS tags in your desired language.
I hope I could clarify some things.
I agree with previous comment that NER classification model is language independent.
If you have issue with training data I could suggest you this link with a huge amount of labeled datasets for different languages.
If you would like to try another model, I suggest ESTNLTK - library for Estonian language, but it could fit language independent ner models (documentation).
Also, here you could find example how to train ner model using spaCy.
I hope it helps. Good luck!

Sentiment analysis

while performing sentiment analysis, how can I make the machine understand that I'm referring apple (the iphone), instead of apple (the fruit)?
Thanks for the advise !
Well, there are several methods,
I would start with checking Capital letter, usually, when referring to a name, first letter is capitalized.
Before doing sentiment analysis, I would use some Part-of-speech and Named Entity Recognition to tag the relevant words.
Stanford CoreNLP is a good text analysis project to start with, it will teach
you the basic concepts.
Example from CoreNLP:
You can see how the tags can help you.
And check out more info
As described by Ofiris, NER is only one way to do solve your problem. I feel it's more effective to use word embedding to represent your words. In that way machine automatically recognize the context of the word. As an example "Apple" is mostly coming together with "eat" and But if the given input "Apple" is present with "mobile" or any other word in that domain, Machine will understand it's "iPhone apple" instead of "apple fruit". There are 2 popular ways to generate word embeddings such as word2vec and fasttext.
Gensim provides more reliable implementations for both word2vec and fasttext.
https://radimrehurek.com/gensim/models/word2vec.html
https://radimrehurek.com/gensim/models/fasttext.html
In presence of dates, famous brands, vip or historical figures you can use a NER (named entity recognition) algorithm; in such case, as suggested by Ofiris, the Stanford CoreNLP offers a good Named entity recognizer.
For a more general disambiguation of polysemous words (i.e., words having more than one sense, such as "good") you could use a POS tagger coupled with a Word Sense Disambiguation (WSD) algorithm. An example of the latter can be found HERE, but I do not know any freely downloadable library for this purpose.
This problem has already been solved by many open source pre-trained NER models. Anyways you can try retraining an existing NER models to finetune them to solve this issue.
You can find an demo of NER results as done by Spacy NER here.

Corenlp basic errors

Take the phrase "A Pedestrian wishes to cross the road".
I learnt english in England and, according to the old rules, the word 'Pedestrian' is a noun. Stanford CoreNLP finds it to be an adjective, regardless of capitalization.
I don't want to contradict the big-brains of Stanford, USA, but that is just wrong. I am new to this semantic stuff but, by finding the word to be an adjective, the sentence lacks a valid noun phrase.
Have I missed the point of CoreNLP, lost the point of the english language, or should I be seeking more effective analysis tools?
I ask as the example sentence is the very first sentence, of my very first processing experiment, and it is most discouraging.
CoreNLP is a statistical analysis tool. It is trained on many texts that have been annotated by pools of human experts. These experts agree on about 90% of the cases. Thus the CoreNLP system cannot beat that percentage and your sentence is part of the 10% wrong parses.

Resources