Polarity calculation in Sentiment Analysis using TextBlob - sentiment-analysis

How is the polarity of a word in a sentence calculated using PatternAnalyser of Text Blob?

TextBlob internally uses NaiveBayes classifer for sentiment analysis,
the naivebayes classifier used in turn is the one provided by NLTK.
See Textblob sentiment analyzer code here.
#requires_nltk_corpus
def train(self):
"""Train the Naive Bayes classifier on the movie review corpus."""
super(NaiveBayesAnalyzer, self).train()
neg_ids = nltk.corpus.movie_reviews.fileids('neg')
pos_ids = nltk.corpus.movie_reviews.fileids('pos')
neg_feats = [(self.feature_extractor(
nltk.corpus.movie_reviews.words(fileids=[f])), 'neg') for f in neg_ids]
pos_feats = [(self.feature_extractor(
nltk.corpus.movie_reviews.words(fileids=[f])), 'pos') for f in pos_ids]
train_data = neg_feats + pos_feats
#### THE CLASSIFIER USED IS NLTK's NAIVE BAYES #####
self._classifier = nltk.classify.NaiveBayesClassifier.train(train_data)
def analyze(self, text):
"""Return the sentiment as a named tuple of the form:
``Sentiment(classification, p_pos, p_neg)``
"""
# Lazily train the classifier
super(NaiveBayesAnalyzer, self).analyze(text)
tokens = word_tokenize(text, include_punc=False)
filtered = (t.lower() for t in tokens if len(t) >= 3)
feats = self.feature_extractor(filtered)
#### USE PROB_CLASSIFY method of NLTK classifer #####
prob_dist = self._classifier.prob_classify(feats)
return self.RETURN_TYPE(
classification=prob_dist.max(),
p_pos=prob_dist.prob('pos'),
p_neg=prob_dist.prob("neg")
)
Source for NLTK's NaiveBayes classifier is here.. This returns probability distribution which is used for the result returned by Textblobs sentiment analyzer.
def prob_classify(self, featureset):

Related

Training loss Validation loss all 0

Im trying to finetune a T5 model with my own dataset for grammatical error correction, but when i run the model i keep on getting all 0's for my results. Im following the huggingface translation tutorial.
enter image description here
I think its a problem with the preprocess function, but i can't seem to figure out why
prefix = ''
max_input_length = 128
max_target_length = 128
source_lang = "ar"
target_lang = "ar"
def preprocess_function(examples):
inputs = [prefix + ex for ex in examples["original"]]
targets = [ex for ex in examples["corrected"]]
model_inputs = tokenizer(inputs, max_length=max_input_length, truncation=True)
# Setup the tokenizer for targets
with tokenizer.as_target_tokenizer():
labels = tokenizer(targets, max_length=max_target_length, truncation=True)
model_inputs["labels"] = labels["input_ids"]
return model_inputs

How can I get the score from Question-Answer Pipeline? Is there a bug when Question-answer pipeline is used?

When I run the following code
from transformers import AutoTokenizer, AutoModelForQuestionAnswering
import torch
tokenizer = AutoTokenizer.from_pretrained("bert-large-uncased-whole-word-masking-finetuned-squad")
model = AutoModelForQuestionAnswering.from_pretrained("bert-large-uncased-whole-word-masking-finetuned-squad")
text = r"""
As checked Dis is not yet on boarded to ARB portal, hence we cannot upload the invoices in portal
"""
questions = [
"Dis asked if it is possible to post the two invoice in ARB.I have not access so I wanted to check if you would be able to do it.",
]
for question in questions:
inputs = tokenizer.encode_plus(question, text, add_special_tokens=True, return_tensors="pt")
input_ids = inputs["input_ids"].tolist()[0]
text_tokens = tokenizer.convert_ids_to_tokens(input_ids)
answer_start_scores, answer_end_scores = model(**inputs)
answer_start = torch.argmax(
answer_start_scores
) # Get the most likely beginning of answer with the argmax of the score
answer_end = torch.argmax(answer_end_scores) + 1 # Get the most likely end of answer with the argmax of the score
answer = tokenizer.convert_tokens_to_string(tokenizer.convert_ids_to_tokens(input_ids[answer_start:answer_end]))
print(f"Question: {question}")
print(f"Answer: {answer}\n")
The answer that I get here is:
Question: Dis asked if it is possible to post the two invoice in ARB.I have not access so I wanted to check if you would be able to do it.
Answer: dis is not yet on boarded to ARB portal
How do I get a score for this answer? Score here is very similar to what is I get when I run Question-Answer pipeline .
I have to take this approach since Question-Answer pipeline when used is giving me Key Error for the below code
from transformers import pipeline
nlp = pipeline("question-answering")
context = r"""
As checked Dis is not yet on boarded to ARB portal, hence we cannot upload the invoices in portal.
"""
print(nlp(question="Dis asked if it is possible to post the two invoice in ARB?", context=context))
This is my attempt to get the score. It appears that I cannot figure out what feature.p_mask. So I could not remove the non-context indexes that contribute to the softmax at the moment.
# ... assuming imports and question and context
model_name="deepset/roberta-base-squad2"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForQuestionAnswering.from_pretrained(model_name)
inputs = tokenizer(question, context,
add_special_tokens=True,
return_tensors='pt')
input_ids = inputs['input_ids'].tolist()[0]
outputs = model(**inputs)
# used to compute score
start = outputs.start_logits.detach().numpy()
end = outputs.end_logits.detach().numpy()
# from source code
# Ensure padded tokens & question tokens cannot belong to the set of candidate answers.
#?? undesired_tokens = np.abs(np.array(feature.p_mask) - 1) & feature.attention_mask
# Generate mask
undesired_tokens = inputs['attention_mask']
undesired_tokens_mask = undesired_tokens == 0.0
# Make sure non-context indexes in the tensor cannot contribute to the softmax
start_ = np.where(undesired_tokens_mask, -10000.0, start)
end_ = np.where(undesired_tokens_mask, -10000.0, end)
# Normalize logits and spans to retrieve the answer
start_ = np.exp(start_ - np.log(np.sum(np.exp(start_), axis=-1, keepdims=True)))
end_ = np.exp(end_ - np.log(np.sum(np.exp(end_), axis=-1, keepdims=True)))
# Compute the score of each tuple(start, end) to be the real answer
outer = np.matmul(np.expand_dims(start_, -1), np.expand_dims(end_, 1))
# Remove candidate with end < start and end - start > max_answer_len
max_answer_len = 15
candidates = np.tril(np.triu(outer), max_answer_len - 1)
scores_flat = candidates.flatten()
idx_sort = [np.argmax(scores_flat)]
start, end = np.unravel_index(idx_sort, candidates.shape)[1:]
end += 1
score = candidates[0, start, end-1]
start, end, score = start.item(), end.item(), score.item()
print(tokenizer.decode(input_ids[start:end]))
print(score)
See more source code

Anormal number of sims document in gensim

I'm actually injecting 77 document in a gensim mode by reading them from a database with a first script and i save the document on file system.
I then load an other doc to check the similarity with a vector
def read_corpus_bdd(cursor, tokens_only=False):
for i, (url_id, url_label, contenu) in enumerate(cursor):
tokens = gensim.utils.simple_preprocess(contenu)
if tokens_only:
yield tokens
else:
# For training data, add tags
# yield gensim.models.doc2vec.TaggedDocument(tokens, dataLine[0])
yield gensim.models.doc2vec.TaggedDocument(tokens, [int(str(url_id))])
print (int(str(url_id)))
targetContentCorpus = list(read_corpus_bdd(cursor))
# Param of trainer corpus
model = gensim.models.doc2vec.Doc2Vec(vector_size=40, min_count=2, epochs=40)
# Build a vocabulary
model.build_vocab(targetContentCorpus)
###############################################################################
model.train(targetContentCorpus, total_examples=model.corpus_count, epochs=model.epochs)
##generate file model name for save
from datetime import date
pathModelSave=os.getenv("MODEL_BASE_SAVE") +'/projet_'+ str(projetId)
When i infer the vector :
inferred_vector = model.infer_vector(test_corpus[0])
sims = model.docvecs.most_similar([inferred_vector], topn=len(model.docvecs))
len(sims) #output 335
So I don't understand where this 335 come from and also why
sims[0][0]
return other id than the tagged one in the yield section
enter code here

What is the difference between gensim LabeledSentence and TaggedDocument

Please help me in understanding the difference between how TaggedDocument and LabeledSentence of gensim works. My ultimate goal is Text Classification using Doc2Vec model and any classifier. I am following this blog!
class MyLabeledSentences(object):
def __init__(self, dirname, dataDct={}, sentList=[]):
self.dirname = dirname
self.dataDct = {}
self.sentList = []
def ToArray(self):
for fname in os.listdir(self.dirname):
with open(os.path.join(self.dirname, fname)) as fin:
for item_no, sentence in enumerate(fin):
self.sentList.append(LabeledSentence([w for w in sentence.lower().split() if w in stopwords.words('english')], [fname.split('.')[0].strip() + '_%s' % item_no]))
return sentList
class MyTaggedDocument(object):
def __init__(self, dirname, dataDct={}, sentList=[]):
self.dirname = dirname
self.dataDct = {}
self.sentList = []
def ToArray(self):
for fname in os.listdir(self.dirname):
with open(os.path.join(self.dirname, fname)) as fin:
for item_no, sentence in enumerate(fin):
self.sentList.append(TaggedDocument([w for w in sentence.lower().split() if w in stopwords.words('english')], [fname.split('.')[0].strip() + '_%s' % item_no]))
return sentList
sentences = MyLabeledSentences(some_dir_name)
model_l = Doc2Vec(min_count=1, window=10, size=300, sample=1e-4, negative=5, workers=7)
sentences_l = sentences.ToArray()
model_l.build_vocab(sentences_l )
for epoch in range(15): #
random.shuffle(sentences_l )
model.train(sentences_l )
model.alpha -= 0.002 # decrease the learning rate
model.min_alpha = model_l.alpha
sentences = MyTaggedDocument(some_dir_name)
model_t = Doc2Vec(min_count=1, window=10, size=300, sample=1e-4, negative=5, workers=7)
sentences_t = sentences.ToArray()
model_l.build_vocab(sentences_t)
for epoch in range(15): #
random.shuffle(sentences_t)
model.train(sentences_t)
model.alpha -= 0.002 # decrease the learning rate
model.min_alpha = model_l.alpha
My question is model_l.docvecs['some_word'] is same as model_t.docvecs['some_word']?
Can you provide me weblink of good sources to get a grasp on how TaggedDocument or LabeledSentence works.
LabeledSentence is an older, deprecated name for the same simple object-type to encapsulate a text-example that is now called TaggedDocument. Any objects that have words and tags properties, each a list, will do. (words is always a list of strings; tags can be a mix of integers and strings, but in the common and most-efficient case, is just a list with a single id integer, starting at 0.)
model_l and model_t will serve the same purposes, having trained on the same data with the same parameters, using just different names for the objects. But the vectors they'll return for individual word-tokens (model['some_word']) or document-tags (model.docvecs['somefilename_NN']) will likely be different – there's randomness in Word2Vec/Doc2Vec initialization and training-sampling, and introduced by ordering-jitter from multithreaded training.

How to train tfidfvectorizer for new dataset

I am doing document classification using tfidfvectorizer and LinearSVC. I need to train tfidfvectorizer again and again as new dataset comes. Is there any way to store current tfidfvectorizer and mix new features when new dataset comes.
Code :
if os.path.exists("trans.pkl"):
with open("trans.pkl", "rb") as fid:
transformer = cPickle.load(fid)
else:
transformer = TfidfVectorizer(sublinear_tf=True, max_df=0.5,stop_words = 'english')
with open("trans.pkl", "wb") as fid:
cPickle.dump(transformer, fid)
X_train = transformer.fit_transform(train_data)
X_test = transformer.transform(test_data)
print X_train.shape[1]
if os.path.exists("store_model.pkl"):
print "model exists"
with open("store_model.pkl","rb") as fid:
classifier = cPickle.load(fid)
print classifier
else:
print "model created"
classifier = LinearSVC().fit(X_train, train_target)
with open("store_model.pkl","wb") as fid:
cPickle.dump(classifier,fid)
predictions = classifier.predict(X_test)
I have 2 diff train files and 1 test file. I executed code for 1st train file,then it works well. But when I try for 2nd train file,no of features are different than 1st so it gives error. How can I train my model if I have multiple such dataset files.

Resources