Problem in negation sentences using a model from BERTimbau for sentiment analysis in text

Problem in negation sentences using a model from BERTimbau for sentiment analysis in text - sentiment-analysis

Could anyone tell me how to make BERT (using it as a text sentiment classifier) using as a tokenizer and model the BERTimbau (Brazilian Portuguese version) to classify sentences with negation (that is, with not in front of the sentence) the inverse of that did he train?
Explaining better:
I have a model that I created from BERTimbau and data that I got. It classifies the phrases as: Satisfied, Dissatisfied, Excited and Discouraged.
When a person writes a sentence of a feeling, but denying it, he continues to classify him with that feeling. Example:
The day is not lively.
What the model rates: Animated
What I wanted you to classify: despondent.
Can anyone tell me if I can do this (or, if so, how to do it?)
I've been trying to find out how to do this for days (otherwise he's sorting well)
Thank you very much!

Related

Stanford NLP training documentpreprocessor

Does Stanford NLP provide a train method for the DocumentPreprocessor to train with own corpora and creating own models for sentence splitting?
I am working with German sentences and I need to create my own German model for sentence splitting tasks. Therefore, I need to train the sentence splitter, DocumentPreprocessor.
Is there a way I can do it?

No. At present, tokenization of all European languages is done by a (hand-written) finite automaton. Machine learning-based tokenization is used for Chinese and Arabic. At present, sentence splitting for all languages is done by rule, exploiting the decisions of the tokenizer. (Of course, that's just how things are now, not how they have to be.)
At present we have no separate German tokenizer/sentence splitter. The current properties file just re-uses the English ones. This is clearly sub-optimal. If someone wanted to produce something for German, that would be great to have. (We may do it at some point, but German development is not currently at the top of the list of priorities.)

Sentiment analysis

while performing sentiment analysis, how can I make the machine understand that I'm referring apple (the iphone), instead of apple (the fruit)?
Thanks for the advise !

Well, there are several methods,
I would start with checking Capital letter, usually, when referring to a name, first letter is capitalized.
Before doing sentiment analysis, I would use some Part-of-speech and Named Entity Recognition to tag the relevant words.
Stanford CoreNLP is a good text analysis project to start with, it will teach
you the basic concepts.
Example from CoreNLP:
You can see how the tags can help you.
And check out more info

As described by Ofiris, NER is only one way to do solve your problem. I feel it's more effective to use word embedding to represent your words. In that way machine automatically recognize the context of the word. As an example "Apple" is mostly coming together with "eat" and But if the given input "Apple" is present with "mobile" or any other word in that domain, Machine will understand it's "iPhone apple" instead of "apple fruit". There are 2 popular ways to generate word embeddings such as word2vec and fasttext.
Gensim provides more reliable implementations for both word2vec and fasttext.
https://radimrehurek.com/gensim/models/word2vec.html
https://radimrehurek.com/gensim/models/fasttext.html

In presence of dates, famous brands, vip or historical figures you can use a NER (named entity recognition) algorithm; in such case, as suggested by Ofiris, the Stanford CoreNLP offers a good Named entity recognizer.
For a more general disambiguation of polysemous words (i.e., words having more than one sense, such as "good") you could use a POS tagger coupled with a Word Sense Disambiguation (WSD) algorithm. An example of the latter can be found HERE, but I do not know any freely downloadable library for this purpose.

This problem has already been solved by many open source pre-trained NER models. Anyways you can try retraining an existing NER models to finetune them to solve this issue.
You can find an demo of NER results as done by Spacy NER here.

Does an algorithm exist to help detect the "primary topic" of an English sentence?

I'm trying to find out if there is a known algorithm that can detect the "key concept" of a sentence.
The use case is as follows:
User enters a sentence as a query (Does chicken taste like turkey?)
Our system identifies the concepts of the sentence (chicken, turkey)
And it runs a search of our corpus content
The area that we're lacking in is identifying what the core "topic" of the sentence is really about. The sentence "Does chicken taste like turkey" has a primary topic of "chicken", because the user is asking about the taste of chicken. While "turkey" is a helper topic of less importance.
So... I'm trying to find out if there is an algorithm that will help me identify the primary topic of a sentence... Let me know if you are aware of any!!!

I actually did a research project on this and won two competitions and am competing in nationals.
There are two steps to the method:
Parse the sentence with a Context-Free Grammar
In the resulting parse trees, find all nouns which are only subordinate to Noun-Phrase-like constituents
For example, "I ate pie" has 2 nouns: "I" and "pie". Looking at the parse tree, "pie" is inside of a Verb Phrase, so it cannot be a subject. "I", however, is only inside of NP-like constituents. being the only subject candidate, it is the subject. Find an early copy of this program on http://www.candlemind.com. Note that the vocabulary is limited to basic singular words, and there are no verb conjugations, so it has "man" but not "men", has "eat" but not "ate." Also, the CFG I used was hand-made an limited. I will be updating this program shortly.
Anyway, there are limitations to this program. My mentor pointed out in its currents state, it cannot recognize sentences with subjects that are "real" NPs (what grammar actually calls NPs). For example, "that the moon is flat is not a debate any longer." The subject is actually "that the moon is flat." However, the program would recognize "moon" as the subject. I will be fixing this shortly.
Anyway, this is good enough for most sentences...
My research paper can be found there too. Go to page 11 of it to read the methods.
Hope this helps.

Most of your basic NLP parsing techniques will be able to extract the basic aspects of the sentence - i.e., that chicken and turkey a NPs and they are linked by and adjective 'like', etc. Getting these to a 'topic' or 'concept' is more difficult
Technique such as Latent Semantic Analysis and its many derivatives transform this information into a vector (some have methods of retaining in some part the hierarchy/relations between parts of speech) and then compares them to existing, usually pre-classified by concept, vectors. See http://en.wikipedia.org/wiki/Latent_semantic_analysis to get started.
Edit Here's an example LSA app you can play around with to see if you might want to pursue it further . http://lsi.research.telcordia.com/lsi/demos.html

For many longer sentences its difficult to say what exactly is a topic and also there may be more than one.
One way to get approximate ans is
1.) First tag the sentence using openNLP, stanford Parser or any one.
2.) Then remove all the stop words from the sentence.
3.) Pick up Nouns( proper, singular and plural).
Other way is
1.) chuck the sentence into phrases by any parser.
2.) Pick up all the noun phrases.
3.) Remove the Noun phrases that doesn't have the Nouns as a child.
4.) Keep only adjectives and Nouns, remove all words from remaining Noun Phrases.
This might give approx. guessing.

"Key concept" is not a well-defined term in linguistics, but this may be a starting point: parse the sentence, find the subject in the parse tree or dependency structure that you get. (This doesn't always work; for example, the subject of "Is it raining?" is "it", while the key concept is likely "rain". Also, what's the key concept in "Are spaghetti and lasagna the same thing?")
This kind of problem (NLP + search) is more properly dealt with by methods such as LSA, but that's quite an advanced topic.

On the most basic level, a question in English is usually in the form of <verb> <subject> ... ? or <pronoun> <verb> <subject> ... ?. This is by no means a good algorithm, especially considering that the subject could span several words, but depending on how sophisticated a solution you need, it might be a useful starting point.
If you need precision, ignore this answer.

If you're willing to shell out money, http://www.connexor.com/ is supposed to be able to do this type of semantic analysis for a wide variety of languages, including English. I have never directly used their product, and so can't comment on how well it works.

There's an article about Parsing Noun Phrases in the MIT Computational Linguistics journal of this month: http://www.mitpressjournals.org/doi/pdf/10.1162/COLI_a_00076

Compound or complex sentences may have more than one key concept of a sentence.
You can use stanfordNLP or MaltParser which can give the dependency structure of a sentence. It also gives the parts of speech tagging including subject, verb , object etc.
I think most of the times the object will be the key concept of the sentence.

You should look at Google's Cloud Natural Language API. It's their NLP service.
https://cloud.google.com/natural-language/

Simple solution is to tag your sentence with part-of-speach tagger (e.g. from NLTK library for Python) then find matches with some predefined part-of-speach patterns in which it's clear where is main subject of the sentence

One option is to look into something like this as a first step:
http://www.abisource.com/projects/link-grammar/
But how you derive the topic from these links is another problem in itself. But as Abiword is trying to detect grammatical problems, you might be able to use it to determine the topic.

By "primary topic" you're referring to what is termed the subject of the sentence.
The subject can be identified by understanding a sentence through natural language processing.
The answer to this question is the same as that for How to determine subject, object and other words? - this is a currently unsolved problem.

Is it possible to guess a user's mood based on the structure of text?

I assume a natural language processor would need to be used to parse the text itself, but what suggestions do you have for an algorithm to detect a user's mood based on text that they have written? I doubt it would be very accurate, but I'm still interested nonetheless.
EDIT: I am by no means an expert on linguistics or natural language processing, so I apologize if this question is too general or stupid.

This is the basis of an area of natural language processing called sentiment analysis. Although your question is general, it's certainly not stupid - this sort of research is done by Amazon on the text in product reviews for example.
If you are serious about this, then a simple version could be achieved by -
Acquire a corpus of positive/negative sentiment. If this was a professional project you may take some time and manually annotate a corpus yourself, but if you were in a hurry or just wanted to experiment this at first then I'd suggest looking at the sentiment polarity corpus from Bo Pang and Lillian Lee's research. The issue with using that corpus is it is not tailored to your domain (specifically, the corpus uses movie reviews), but it should still be applicable.
Split your dataset into sentences either Positive or Negative. For the sentiment polarity corpus you could split each review into it's composite sentences and then apply the overall sentiment polarity tag (positive or negative) to all of those sentences. Split this corpus into two parts - 90% should be for training, 10% should be for test. If you're using Weka then it can handle the splitting of the corpus for you.
Apply a machine learning algorithm (such as SVM, Naive Bayes, Maximum Entropy) to the training corpus at a word level. This model is called a bag of words model, which is just representing the sentence as the words that it's composed of. This is the same model which many spam filters run on. For a nice introduction to machine learning algorithms there is an application called Weka that implements a range of these algorithms and gives you a GUI to play with them. You can then test the performance of the machine learned model from the errors made when attempting to classify your test corpus with this model.
Apply this machine learning algorithm to your user posts. For each user post, separate the post into sentences and then classify them using your machine learned model.
So yes, if you are serious about this then it is achievable - even without past experience in computational linguistics. It would be a fair amount of work, but even with word based models good results can be achieved.
If you need more help feel free to contact me - I'm always happy to help others interested in NLP =]
Small Notes -
Merely splitting a segment of text into sentences is a field of NLP - called sentence boundary detection. There are a number of tools, OSS or free, available to do this, but for your task a simple split on whitespaces and punctuation should be fine.
SVMlight is also another machine learner to consider, and in fact their inductive SVM does a similar task to what we're looking at - trying to classify which Reuter articles are about "corporate acquisitions" with 1000 positive and 1000 negative examples.
Turning the sentences into features to classify over may take some work. In this model each word is a feature - this requires tokenizing the sentence, which means separating words and punctuation from each other. Another tip is to lowercase all the separate word tokens so that "I HATE you" and "I hate YOU" both end up being considered the same. With more data you could try and also include whether capitalization helps in classifying whether someone is angry, but I believe words should be sufficient at least for an initial effort.
Edit
I just discovered LingPipe that in fact has a tutorial on sentiment analysis using the Bo Pang and Lillian Lee Sentiment Polarity corpus I was talking about. If you use Java that may be an excellent tool to use, and even if not it goes through all of the steps I discussed above.

No doubt it is possible to judge a user's mood based on the text they type but it would be no trivial thing. Things that I can think of:
Capitals tends to signify agitation, annoyance or frustration and is certainly an emotional response but then again some newbies do that because they don't realize the significance so you couldn't assume that without looking at what else they've written (to make sure its not all in caps);
Capitals are really just one form of emphasis. Others are use of certain aggressive colours (eg red) or use of bold or larger fonts;
Some people make more spelling and grammar mistakes and typos when they're highly emotional;
Scanning for emoticons could give you a very clear picture of what the user is feeling but again something like :) could be interpreted as happy, "I told you so" or even have a sarcastic meaning;
Use of expletives tends to have a clear meaning but again its not clearcut. Colloquial speech by many people will routinely contain certain four letter words. For some other people, they might not even say "hell", saying "heck" instead so any expletive (even "sucks") is significant;
Groups of punctuation marks (like ##$#$#) tend to be replaced for expletives in a context when expletives aren't necessarily appropriate, so thats less likely to be colloquial;
Exclamation marks can indicate surprise, shock or exasperation.
You might want to look at Advances in written text analysis or even Determining Mood for a Blog by Combining Multiple Sources of Evidence.
Lastly it's worth noting that written text is usually perceived to be more negative than it actually is. This is a common problem with email communication in companies, just as one example.

I can't believe I'm taking this seriously... assuming a one-dimensional mood space:
If the text contains a curse word,
-10 mood.
I think exclamations would tend to be negative, so -2 mood.
When I get frustrated, I type in
Very. Short. Sentences. -5 mood.
The more I think about this, the more it's clear that a lot of these signifiers indicate extreme mood in general, but it's not always clear what kind of mood.

If you support fonts, bold red text is probably an angry user. Green regular sized texts with butterfly clip art a happy one.

My memory isn't good on this subject, but I believe I saw some research about the grammar structure of the text and the overall tone. That could be also as simple as shorter words and emotion expression words (well, expletives are pretty obvious).
Edit: I noted that the first person to answer had substantially similar post. There could be indeed some serious idea about shorter sentences.

Analysis of mood and behavior is very serious science. Despite the other answers mocking the question law enforcement agencies have been investigating categorization of mood for years. Uses in computers I have heard of generally had more context (timing information, voice pattern, speed in changing channels). I think that you could--with some success--determine if a user is in a particular mood by training a Neural Network with samples from two known groups: angry and not angry. Good luck with your efforts.

I think, my algorythm is rather straightforward, yet, why not calculating smilics through the text :) vs :(
Obviously, the text ":) :) :) :)" resolves to a happy user, while ":( :( :(" will surely resolve to a sad one. Enjoy!

I agree with ojblass that this is a serious question.
Mood categorization is currently a hot topic in the speech recognition area. If you think about it, an interactive voice response (IVR) application needs to handle angry customers far differently than calm ones: angry people should be routed quickly to human operators with the right experience and training. Vocal tone is a pretty reliable indicator of emotion, practical enough so that companies are eager to get this to work. Google "speech emotion recognition", or read this article to find out more.
The situation should be no different in web-based GUIs. Referring back to cletus's comments, the analogies between text and speech emotion detection are interesting. If a person types CAPITALS they are said to be 'shouting', just as if his voice rose in volume and pitch using a voice interface. Detecting typed profanities is analogous to "keyword spotting" of profanity in speech systems. If a person is upset, they'll make more errors using either a GUI or a voice user interface (VUI) and can be routed to a human.
There's a "multimodal" emotion detection research area here. Imagine a web interface that you can also speak to (along the lines of the IBM/Motorola/Opera XHTML + Voice Profile prototype implementation). Emotion detection could be based on a combination of cues from the speech and visual input modality.

Yes.
Whether or not you can do it is another story. The problem seems at first to be AI complete.
Now then, if you had keystroke timings you should be able to figure it out.

Fuzzy logic will do I guess.
Any way it will be quite easy to start with several rules of determining the user's mood and then extend and combine the "engine" with more accurate and sophisticated ones.

Algorithm to determine how positive or negative a statement/text is

I need an algorithm to determine if a sentence, paragraph or article is negative or positive in tone... or better yet, how negative or positive.
For instance:
Jason is the worst SO user I have ever witnessed (-10)
Jason is an SO user (0)
Jason is the best SO user I have ever seen (+10)
Jason is the best at sucking with SO (-10)
While, okay at SO, Jason is the worst at doing bad (+10)
Not easy, huh? :)
I don't expect somebody to explain this algorithm to me, but I assume there is already much work on something like this in academia somewhere. If you can point me to some articles or research, I would love it.
Thanks.

There is a sub-field of natural language processing called sentiment analysis that deals specifically with this problem domain. There is a fair amount of commercial work done in the area because consumer products are so heavily reviewed in online user forums (ugc or user-generated-content). There is also a prototype platform for text analytics called GATE from the university of sheffield, and a python project called nltk. Both are considered flexible, but not very high performance. One or the other might be good for working out your own ideas.

In my company we have a product which does this and also performs well. I did most of the work on it. I can give a brief idea:
You need to split the paragraph into sentences and then split each sentence into smaller sub sentences - splitting based on commas, hyphen, semi colon, colon, 'and', 'or', etc.
Each sub sentence will be exhibiting a totally seperate sentiment in some cases.
Some sentences even if it is split, will have to be joined together.
Eg: The product is amazing, excellent and fantastic.
We have developed a comprehensive set of rules on the type of sentences which need to be split and which shouldn't be (based on the POS tags of the words)
On the first level, you can use a bag of words approach, meaning - have a list of positive and negative words/phrases and check in every sub sentence. While doing this, also look at the negation words like 'not', 'no', etc which will change the polarity of the sentence.
Even then if you can't find the sentiment, you can go for a naive bayes approach. This approach is not very accurate (about 60%). But if you apply this to only sentence which fail to pass through the first set of rules - you can easily get to 80-85% accuracy.
The important part is the positive/negative word list and the way you split things up. If you want, you can go even a level higher by implementing HMM (Hidden Markov Model) or CRF (Conditional Random Fields). But I am not a pro in NLP and someone else may fill you in that part.
For the curious people, we implemented all of this is python with NLTK and the Reverend Bayes module.
Pretty simple and handles most of the sentences. You may however face problems when trying to tag content from the web. Most people don't write proper sentences on the web. Also handling sarcasm is very hard.

This falls under the umbrella of Natural Language Processing, and so reading about that is probably a good place to start.
If you don't want to get in to a very complicated problem, you can just create lists of "positive" and "negative" words (and weight them if you want) and do word counts on sections of text. Obviously this isn't a "smart" solution, but it gets you some information with very little work, where doing serious NLP would be very time consuming.
One of your examples would potentially be marked positive when it was in fact negative using this approach ("Jason is the best at sucking with SO") unless you happen to weight "sucking" more than "best".... But also this is a small text sample, if you're looking at paragraphs or more of text, then weighting becomes more reliable unless you have someone purposefully trying to fool your algorithm.

As pointed out, this comes under sentiment analysis under natural language processing. Afaik GATE doesn't have any component that does sentiment analysis.
In my experience, I have implemented an algorithm which is an adaptation of the one in the paper 'Recognizing Contextual Polarity in Phrase-Level Sentiment Analysis' by Theresa Wilson, Janyce Wiebe, Paul Hoffmann (this) as a GATE plugin, which gives reasonable good results. It could help you if you want to bootstrap the implementation.

Depending on your application you could do it via a Bayesian Filtering algorithm (which is often used in spam filters).
One way to do it would be to have two filters. One for positive documents and another for negative documents. You would seed the positive filter with positive documents (whatever criteria you use) and the negative filter with negative documents. The trick would be to find these documents. Maybe your could set it up so your users effectively rate documents.
The positive filter (once seeded) would look for positive words. Maybe it would end up with words like love, peace, etc. The negative filter would be seeded appropriately as well.
Once your filters are setup, then you run the test text through them to come up with positive and negative scores. Based on these scores and some weighting, you could come up with your numeric score.
Bayesian Filters, though simple, are surprisingly effective.

You can do like this:
Jason is the worst SO user I have ever witnessed (-10)
worst (-), the rest is (+). so, that would be (-) + (+) = (-)
Jason is an SO user (0)
( ) + ( ) = ( )
Jason is the best SO user I have ever seen (+10)
best (+) , the rest is ( ). so, that would be (+) + ( ) = (+)
Jason is the best at sucking with SO (-10)
best (+), sucking (-). so, (+) + (-) = (-)
While, okay at SO, Jason is the worst at doing bad (+10)
worst (-), doing bad (-). so, (-) + (-) = (+)

There are many machine learning approaches for this kind of Sentiment Analysis. I used most of the machine learning algorithms, which are already implemented. my case I have used
weka classification algorithms
SVM
naive basian
J48
Only you have to do this train the model to your context , add featured vector and rule based tune up. In my case I got some (61% accuracy). So We move into stanford core nlp ( they trained their model for movie reviews) and we used their training set and add our training set. we could achieved 80-90% accuracy.

This is an old question, but I happened upon it looking for a tool that could analyze article tone and found Watson Tone Analyzer by IBM. It allows 1000 api calls monthly for free.

It's all about context, I think. If you're looking for the people who are best at sucking with SO. Sucking the best can be a positive thing. For determination what is bad or good and how much I could recommend looking into Fuzzy Logic.
It's a bit like being tall. Someone who's 1.95m can considered to be tall. If you place that person in a group with people all over 2.10m, he looks short.

Maybe essay grading software could be used to estimate tone? WIRED article.
Possible reference. (I couldn't read it.)
This report compares writing skill to the Flesch-Kincaid Grade Level needed to read it!
Page 4 of e-rator says that they look at mispelling and such. (Maybe bad post are misspelled too!)
Slashdot article.
You could also use an email filter of some sort for negativity instead of spam-ness.

How about sarcasm:
Jason is the best SO user I have ever seen, NOT
Jason is the best SO user I have ever seen, right

Ah, I remember one java library for this called LingPipe (commercial license) that we evaluated. It would work fine for the example corpus that is available at the site, but for real data it sucks pretty bad.

Most of the sentiment analysis tools are lexicon based and none of them is perfect. Also, sentiment analysis can be described as a trinary sentiment classification or binary sentiment classification. Moreover, it is a domain specific task. Meaning that tools which work well on news dataset may not do a good job on informal and unstructured tweets.
I would suggest using several tools and have an aggregation or vote based mechanism to decide the intensity of the sentiment. The best survey study on sentiment analysis tools that I have come across is SentiBench. You will find it helpful.

use Algorithm::NaiveBayes;
my $nb = Algorithm::NaiveBayes->new;
$nb->add_instance
(attributes => {foo => 1, bar => 1, baz => 3},
label => 'sports');
$nb->add_instance
(attributes => {foo => 2, blurp => 1},
label => ['sports', 'finance']);
... repeat for several more instances, then:
$nb->train;
# Find results for unseen instances
my $result = $nb->predict
(attributes => {bar => 3, blurp => 2});

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio