How do I use classifyAndWriteAnswersKBest in coreNLP pipeline? - stanford-nlp

When I used a CRF classifier directly, I could use the classifyAndWriteAnswersKBest function (as below) to calculate a probability score.
Counter<List<CoreLabel>> kBest = classifier.classifyKBest(labels,
CoreAnnotations.AnswerAnnotation.class, 1);
List<List<CoreLabel>> sorted = Counters.toSortedList(kBest);
for (List<CoreLabel> l : sorted) {
prob_list.add(Math.exp(kBest.getCount(l)));
}
But now because I need to take advantage of the POS tag as a feature for my classifier, I called the StanfordCoreNLP API and used ner.model to initiate the classifier. Now the problem is, how can I still use the above Kbest function to do the similar calculation without loading the classifier again?

Related

Reduce the output layer size from XLTransformers

I'm running the following using the huggingface implementation:
t1 = "My example sentence is really great."
tokenizer = TransfoXLTokenizer.from_pretrained('transfo-xl-wt103')
model = TransfoXLLMHeadModel.from_pretrained("transfo-xl-wt103")
encoded_input = tokenizer(t1, return_tensors='pt', add_space_before_punct_symbol=True)
output = model(**encoded_input)
tmp = output[0].detach().numpy()
print(tmp.shape)
>>> (1, 7, 267735)
With the goal of getting output embeddings that I'll use downstream.
The last dimension is /substantially/ larger than I expected, and it looks like it is the size of the entire vocab_size rather than a reduction based on the ECL from the paper (which potentially I am misinterpreting).
What argument would I provide the model to reduce this layer size to a smaller dimensional space, something more like the basic BERT at 400 or 768 and still obtain good performance based on the pretrained embeddings?
That's because you used ...LMHeadModel, which predicts the next token. You can use TransfoXLModel.from_pretrained("transfo-xl-wt103") instead, then output[0] is the last hidden state which has the shape (batch_size, sequence_length, hidden_size).

Is there a way to infer topic distributions on unseen document from gensim LDA pre-trained model using matrix multiplication?

Is there a way to get the topic distribution of an unseen document using a pretrained LDA model without using the LDA_Model[unseenDoc] syntax? I am trying to implement my LDA model into a web application, and if there was a way to use matrix multiplication to get a similar result then I could use the model in javascript.
For example, I tried the following:
import numpy as np
import gensim
from gensim.corpora import Dictionary
from gensim import models
import nltk
from nltk.stem import WordNetLemmatizer, SnowballStemmer
nltk.download('wordnet')
def Preprocesser(text_list):
smallestWordSize = 3
processedList = []
for token in gensim.utils.simple_preprocess(text_list):
if token not in gensim.parsing.preprocessing.STOPWORDS and len(token) > smallestWordSize:
processedList.append(StemmAndLemmatize(token))
return processedList
lda_model = models.LdaModel.load('LDAModel\GoldModel') #Load pretrained LDA model
dictionary = Dictionary.load("ModelTrain\ManDict") #Load dictionary model was trained on
#Sample Unseen Doc to Analyze
doc = "I am going to write a string about how I can't get my task executor \
to travel properly. I am trying to use the \
AGV navigator, but it doesn't seem to be working network. I have been trying\
to use the AGV Process flow but that isn't working either speed\
trailer offset I am now going to change this so I can see how fast it runs"
termTopicMatrix = lda_model.get_topics() #Get Term-topic Matrix from pretrained LDA model
cleanDoc = Preprocesser(doc) #Tokenize, lemmatize, clean and stem words
bowDoc = dictionary.doc2bow(cleanDoc) #Create bow using dictionary
dictSize = len(termTopicMatrix[0]) #Get length of terms in dictionary
fullDict = np.zeros(dictSize) #Initialize array which is length of dictionary size
First = [first[0] for first in bowDoc] #Get index of terms in bag of words
Second = [second[1] for second in bowDoc] #Get frequency of term in bag of words
fullDict[First] = Second #Add word frequency to full dictionary
print('Matrix Multiplication: \n', np.dot(termTopicMatrix,fullDict))
print('Conventional Syntax: \n', lda_model[bowDoc])
Output:
Matrix Multiplication:
[0.0283254 0.01574513 0.03669142 0.01671816 0.03742738 0.01989461
0.01558603 0.0370233 0.04648389 0.02887623 0.00776652 0.02147539
0.10045133 0.01084273 0.01229849 0.00743788 0.03747379 0.00345913
0.03086953 0.00628912 0.29406082 0.10656977 0.00618827 0.00406316
0.08775404 0.00785408 0.02722744 0.09957815 0.01669402 0.00744392
0.31177135 0.03063149 0.07211428 0.01192056 0.03228589]
Conventional Syntax:
[(0, 0.070313625), (2, 0.056414187), (18, 0.2016589), (20, 0.46500313), (24, 0.1589748)]
In the pretrained model there are 35 topics and 1155 words.
In the "Conventional Syntax" output, the first element of each tuple is the index of the topic and the second element is the probability of the topic. In the "Matrix Multiplication" version, the probability is the index and the value is the probability. Clearly the two don't match up.
For example, the lda_model[unseenDoc] shows that topic 0 has a 0.07 probability, but the matrix multiplication method says that topic has a 0.028 probability. Am I missing a step here?
You can review the full source code used by LDAModel's get_document_topics() method in your installation, or online at:
https://github.com/RaRe-Technologies/gensim/blob/e75f6c8e8d1dee0786b1b2cd5ef60da2e290f489/gensim/models/ldamodel.py#L1283
(It also makes use of the inference() method in the same file.)
It's doing a lot more scaling/normalization/clipping than your code, which is likely the cause of the discrepancy. But you should be able to examine, line-by-line, where your process & its differ to get the steps to match up.
It also shouldn't be hard to use the gensim code's steps as guidance for creating parallel Javascript code that, given the right parts of the model's state, can reproduce its results.

Copying embeddings for gensim word2vec

I wanted to see if I can simply set new weights for gensim's Word2Vec without training. I get the 20 News Group data set from scikit-learn (from sklearn.datasets import fetch_20newsgroups) and trained an instance of Word2Vec on it:
model_w2v = models.Word2Vec(sg = 1, size=300)
model_w2v.build_vocab(all_tokens)
model_w2v.train(all_tokens, total_examples=model_w2v.corpus_count, epochs = 30)
Here all_tokens is the tokenized data set.
Then I created a new instance of Word2Vec without training
model_w2v_new = models.Word2Vec(sg = 1, size=300)
model_w2v_new.build_vocab(all_tokens)
and set the embeddings of the new Word2Vec equal to the first one
model_w2v_new.wv.vectors = model_w2v.wv.vectors
Most of the functions work as expected, e.g.
model_w2v.wv.similarity( w1='religion', w2 = 'religions')
> 0.4796233
model_w2v_new.wv.similarity( w1='religion', w2 = 'religions')
> 0.4796233
and
model_w2v.wv.words_closer_than(w1='religion', w2 = 'judaism')
> ['religions']
model_w2v_new.wv.words_closer_than(w1='religion', w2 = 'judaism')
> ['religions']
and
entities_list = list(model_w2v.wv.vocab.keys()).remove('religion')
model_w2v.wv.most_similar_to_given(entity1='religion',entities_list = entities_list)
> 'religions'
model_w2v_new.wv.most_similar_to_given(entity1='religion',entities_list = entities_list)
> 'religions'
However, most_similar doesn't work:
model_w2v.wv.most_similar(positive=['religion'], topn=3)
[('religions', 0.4796232581138611),
('judaism', 0.4426296651363373),
('theists', 0.43141329288482666)]
model_w2v_new.wv.most_similar(positive=['religion'], topn=3)
>[('roderick', 0.22643062472343445),
> ('nci', 0.21744996309280396),
> ('soviet', 0.20012077689170837)]
What am I missing?
Disclaimer. I posted this question on datascience.stackexchange but got no response, hoping to have a better luck here.
Generally, your approach should work.
It's likely the specific problem you're encountering was caused by an extra probing step you took and is not shown in your code, because you had no reason to think it significant: some sort of most_similar()-like operation on model_w2v_new after its build_vocab() call but before the later, malfunctioning operations.
Traditionally, most_similar() calculations operate on a version of the vectors that has been normalized to unit-length. The 1st time these unit-normed vectors are needed, they're calculated – and then cached inside the model. So, if you then replace the raw vectors with other values, but don't discard those cached values, you'll see results like you're reporting – essentially random, reflecting the randomly-initialized-but-never-trained starting vector values.
If this is what happened, just discarding the cached values should cause the next most_similar() to refresh them properly, and then you should get the results you expect:
model_w2v_new.wv.vectors_norm = None

Misclassification cost matrix on the Random Forest classifier implemented by Apache Spark (MLLIB)

I saw that the Random Forest classifier implemented by Apache Spark's MLLIB does not support a misclassification cost matrix for training and metric evaluation.
QUESTIONS: Is there a way to implement this without changing the MLLIB's code? How?
Each split on each tree should consider the cost matrix (e.g., C5.0 cost matrix related question).
Below is a part of a SCALA code for training and prediction using the [new] Random Forest classifer provided by MLLIB.
[...]
// Train a RandomForest model.
// Here, I would like to insert a misclassification cost matrix
val rf = new RandomForestClassifier()
.setLabelCol("indexedLabel")
.setFeaturesCol("rawFeatures")
.setNumTrees(500)
.setImpurity("gini")
.setMaxBins(500)
.setMaxDepth(30)
.setFeatureSubsetStrategy("50")
// Chain indexers and forest in a Pipeline
val pipeline = new Pipeline()
.setStages(Array(assembler,rf))
// Train model. This also runs the indexers.
val model = pipeline.fit(trainingData)
// Make predictions.
val predictions = model.transform(testData)
[...]
// I could use predictions and the cost matrix to evaluate
// some metrics directly.
[...]
Thank you in advance.
regards,
Paulo Angelo

GenSim Word2Vec unexpectedly pruning

My objective is to find a vector representation of phrases. Below is the code I have, that works partially for bigrams using the Word2Vec model provided by the GenSim library.
from gensim.models import word2vec
def bigram2vec(unigrams, bigram_to_search):
bigrams = Phrases(unigrams)
model = word2vec.Word2Vec(sentences=bigrams[unigrams], size=20, min_count=1, window=4, sg=1, hs=1, negative=0, trim_rule=None)
if bigram_to_search in model.vocab.keys():
return model[bigram_to_search]
else:
return None
The problem is that the Word2Vec model is seemingly doing automatic pruning of some of the bigrams, i.e. len(model.vocab.keys()) != len(bigrams.vocab.keys()). I've tried adjusting various parameters such as trim_rule, min_count, but they don't seem to affect the pruning.
PS - I am aware that bigrams to look up need to be represented using underscore instead of space, i.e. proper way to call my function would be bigram2vec(unigrams, 'this_report')
Thanks to further clarification at the GenSim support forum, the solution is to set the appropriate min_count and threshold values for the Phrases being generated (see documentation for details about these parameters in the Phrases class). The corrected solution code is below.
from gensim.models import word2vec, Phrases
def bigram2vec(unigrams, bigram_to_search):
bigrams = Phrases(unigrams, min_count=1, threshold=0.1)
model = word2vec.Word2Vec(sentences=bigrams[unigrams], size=20, min_count=1, trim_rule=None)
if bigram_to_search in model.vocab.keys():
return model[bigram_to_search]
else:
return []

Resources