Textblob sentiment algorithm - sentiment-analysis

Does anyone know how textblob sentiment is working? I know it is working based on Pattern but I could not find any article or document explain how pattern assigns polarity value to a sentence.

Here is the code of textblog sentiment module:
https://github.com/sloria/TextBlob/blob/90cc87ab0f9e25f37379079840ec43aba59af440/textblob/en/sentiments.py
As you can see, it has a training set with preclassified movie reviews, when you give a new text for analysis, it uses NaiveBayes classifier to classify the new text's polarity in pos and neg probabilities.

By default, it calculates average polarity and subjectivity over each word in a given text using a dictionary of adjectives and their hand-tagged scores. It actually uses pattern library for that, which takes the individual word scores from sentiwordnet.
If you call sentiment scores by specifying NaiveBayesAnalyzer such as
TextBlob("The movie was excellent!", analyzer=NaiveBayesAnalyzer())
then it will calculate the sentiment score by NaiveBayesAnalyzer trained on a dataset of movie reviews.

Related

How to calculate word similarity based on transformer?

I know I can train word embedding in Tensorflow or Gensim, then I can retrieve top N most similar words for a target word. Given that transformer is now the main stream model for text representation, I want to know whether there is a better way to compute word similarity than Word Embedding. In genesim, I can do:
sims = model.wv.most_similar('computer', topn=10)
For example, if I use sentence transformer to compute:
https://huggingface.co/sentence-transformers/LaBSE
from sentence_transformers import SentenceTransformer
sentences = ["This is an example sentence", "Each sentence is converted"]
model = SentenceTransformer('sentence-transformers/LaBSE')
embeddings = model.encode(sentences)
print(embeddings)
Then use this embedding to compute similarity, would that work for word similarity, if I treat any word as a 'sentence'? Or I use Bert embedding model:
https://huggingface.co/transformers/model_doc/bert.html
I feed a word like 'computer' as input to get an embedding, then compute its topN similarity. Does this make sense? or it won't work better than embedding trained without involving transformer?

Kaggle - Tweet Sentiment Extraction - What will be length of word or phrase that supports the sentiment

Kaggle Problem:https://www.kaggle.com/c/tweet-sentiment-extraction
We have to upload the output file with id and ""
<id>,"<word or phrase that supports the sentiment>"
The question is how the model will be able to choose the length of the phrase like from x word to y word there is strong sentiment.
Can anyone please help ?
The most common way this is done is by having your model predict a start index and an end index (of the sequence of tokens you want to extract).
Poking through the discussion threads, this was the architecture of the winning entry for that competition: https://www.kaggle.com/c/tweet-sentiment-extraction/discussion/159477
Notice in the first section "Heartkilla" they are predicting two things, y-start and y-end. Further down they mention they filter out predictions where y-start is greater than y-end.

In general, when does TF-IDF reduce accuracy?

I'm training a corpus consisting of 200000 reviews into positive and negative reviews using a Naive Bayes model, and I noticed that performing TF-IDF actually reduced the accuracy (while testing on test set of 50000 reviews) by about 2%. So I was wondering if TF-IDF has any underlying assumptions on the data or model that it works with, i.e. any cases where accuracy is reduced by the use of it?
The IDF component of TF*IDF can harm your classification accuracy in some cases.
Let suppose the following artificial, easy classification task, made for the sake of illustration:
Class A: texts containing the word 'corn'
Class B: texts not containing the word 'corn'
Suppose now that in Class A, you have 100 000 examples and in class B, 1000 examples.
What will happen to TFIDF? The inverse document frequency of corn will be very low (because it is found in almost all documents), and the feature 'corn' will get a very small TFIDF, which is the weight of the feature used by the classifier. Obviously, 'corn' was THE best feature for this classification task. This is an example where TFIDF may reduce your classification accuracy. In more general terms:
when there is class imbalance. If you have more instances in one class, the good word features of the frequent class risk having lower IDF, thus their best features will have a lower weight
when you have words with high frequency that are very predictive of one of the classes (words found in most documents of that class)
You can heuristically determine whether the usage of IDF on your training data decreases your predictive accuracy by performing grid search as appropriate.
For example, if you are working in sklearn, and you want to determine whether IDF decreases the predictive accuracy of your model, you can perform a grid search on the use_idf parameter of the TfidfVectorizer.
As an example, this code would implement the gridsearch algorithm on the selection of IDF for classification with SGDClassifier (you must import all the objects being instantiated first):
# import all objects first
X = # your training data
y = # your labels
pipeline = Pipeline([('tfidf',TfidfVectorizer()),
('sgd',SGDClassifier())])
params = {'tfidf__use_idf':(False,True)}
gridsearch = GridSearch(pipeline,params)
gridsearch.fit(X,y)
print(gridsearch.best_params_)
The output would be either:
Parameters selected as the best fit:
{'tfidf__use_idf': False}
or
{'tfidf__use_idf': True}
TF-IDF as far as I understand is a feature. TF is term frequency i.e. frequency of occurence in a document. IDF is inverse document frequncy i.e frequency of documents in which the term occurs.
Here, the model is using the TF-IDF info in the training corpus to estimate the new documents. For a very simple example, Say a document with word bad has pretty high term frequency of word bad in training set will sentiment label as negative. So, any new document containing bad will be more likely to be negative.
For the accuracy you can manaually select training corpus which contains mostly used negative or positive words. This will boost the accuracy.

How does Stanford CoreNLP assign parentheses to phrases?

I have a question regarding how CoreNLP assigns parentheses to phrases en route to accumulating an overall sentence score. The main question is the ORDER to which it calculates sentiment of phrases in a sentence. Does anyone know what algorithm is used? An example will clearly illustrate my question:
In my training model, the scale I am using is 0-4, where 0 is negative, 2 is neutral, and 4 is positive, so the following phrase is scored: (3 (1 lower) (2 (2 oil) (2 production)))
-Note: the reason for the jump to positive is we are predicting oil prices and lower oil production will lead to higher prices so a proper prediction of the price of oil increasing would need an overall positive sentiment.
Next, lets assume the following tweet was grabbed: "OPEC decides to lower oil production". I assume the first thing CoreNLP does is assign each individual word a score. In our training model, lower has a score of 1 and all other words are no scored so will receive a score of neutral.
The problem seems to stem from how CoreNLP decides to score phrases (groups of words). If the first thing it did was score "oil production", then score "lower oil production", it would see we have an exact phrase match of "lower oil production" in our model and properly assign a score of 3.
However, what I'm guessing happens is this: first CoreNLP scores "OPEC decides", then "OPEC decides to", then "OPEC decides to lower", then "OPEC decides to lower oil", then OPEC decides to lower oil production". In this instance, the phrase 'lower oil production' is never considered in a vacuum, because there are no phrases matching our training model, the individual word scores decide the overall sentiment and it gets a score of 1 due to "lower."
The only solution for this would be for someone to tell me the exact parentheses algorithm that CoreNLP uses to score phrases. Thanks for the help!
Stanford CoreNLP runs a constituency parser on the sentence. Then it turns the constituency tree into a binary tree with TreeBinarizer.
This is the relevant class:
edu.stanford.nlp.parser.lexparser.TreeBinarizer
Here is a link to source code on GitHub:
https://github.com/stanfordnlp/CoreNLP/blob/master/src/edu/stanford/nlp/parser/lexparser/TreeBinarizer.java
Here is the source code of where that TreeBinarizer is set up:
https://github.com/stanfordnlp/CoreNLP/blob/master/src/edu/stanford/nlp/pipeline/ParserAnnotator.java

Document Features Vector Representation

I am building a document classifier to categorize documents.
So first step is to represent each documents as "features vector" for the training purpose.
After some research, I found that I can use either the Bag of Words approach or N-gram approach to represent a document as a vector.
The text in each document (scanned pdfs and images) is retrieved using an OCR, thus some words contain errors. And I don't have previous knowledge about the language used in these documents (can't use stemming).
So as far as I understand I have to use the n-gram approach. or are there other approaches to represent a document ?
I would also appreciate if someone could link me to an N-Gram guide in order to have a clearer picture and understand how it works.
Thanks in Advance
Use language detection to get document's language (my favorite tool is LanguageIdentifier from Tika project, but many others are available).
Use spell correction (see this question for some details).
Stem words (if you work in Java environment, Lucene is your choice).
Collect all N-grams (see below).
Make instances for classification by extracting n-grams from particular documents.
Build classifier.
N-gram models
N-grams are just sequences of N items. In classification by topic you normally use N-grams of words or their roots (though there are models based on N-grams of chars). Most popular N-grams are unigrams (just word), bigrams (2 serial words) and trigrams (3 serial words). So, from sentence
Hello, my name is Frank
you should get following unigrams:
[hello, my, name, is, frank] (or [hello, I, name, be, frank], if you use roots)
following bigrams:
[hello_my, my_name, name_is, is_frank]
and so on.
At the end your feature vector should have as much positions (dimensions) as there are words in all your text plus 1 for unknown words. Every position in instance vector should somehow reflect number of corresponding words in instance text. This may be number of occurrences, binary feature (1 if word occurs, 0 otherwise), normalized feature or tf-idf (very popular in classification by topic).
Classification process itself is the same as for any other domain.

Resources