Adding a feature for unseen words - matrix

I am using sklearn countvectorizer to build my term-document matrix
However,
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(min_df=1)
corpus = ['this is jummy speaking now']
X = vectorizer.fit_transform(corpus)
c = vectorizer.transform(['lol 123']).toarray()
What happens is that X would be a term document matrix of 5 words. However, i would like the matrix to have an unknown column aka term document matrix of 6 words. In the case when a new unseen word is found, i would like it to be part of the unknown column. like for example (lol and 123) is not in the corpus. It should be part of the unknown column.

Related

Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation

I'm learning NLP following this sequence classification tutorial from HuggingFace https://huggingface.co/transformers/custom_datasets.html#sequence-classification-with-imdb-reviews
The original code runs without problem. But when I tried to load a different tokenizer , such as the one from google/bert_uncased_L-4_H-256_A-4, the following warning appears:
Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.
from transformers import AutoTokenizer
from pathlib import Path
def read_imdb_split(split_dir):
split_dir = Path(split_dir)
texts = []
labels = []
for label_dir in ["pos", "neg"]:
for text_file in (split_dir/label_dir).iterdir():
texts.append(text_file.read_text())
labels.append(0 if label_dir is "neg" else 1)
return texts[:50], labels[:50]
if __name__ == '__main__':
test_texts, test_labels = read_imdb_split('aclImdb/test')
tokenizer = AutoTokenizer.from_pretrained('google/bert_uncased_L-4_H-256_A-4')
test_encodings = tokenizer(test_texts, truncation=True, padding=True)
for input_id in test_encodings["input_ids"]:
print(len(input_id))
The output shows all input_id has len = 1288. It seems they have all been padded to 1288. But how could I specify the truncation target length such as 512?
Specify the model_max_length when load the tokenizer.
tokenizer = AutoTokenizer.from_pretrained('google/bert_uncased_L-4_H-256_A-4', model_max_length=512)

Is there a way to infer topic distributions on unseen document from gensim LDA pre-trained model using matrix multiplication?

Is there a way to get the topic distribution of an unseen document using a pretrained LDA model without using the LDA_Model[unseenDoc] syntax? I am trying to implement my LDA model into a web application, and if there was a way to use matrix multiplication to get a similar result then I could use the model in javascript.
For example, I tried the following:
import numpy as np
import gensim
from gensim.corpora import Dictionary
from gensim import models
import nltk
from nltk.stem import WordNetLemmatizer, SnowballStemmer
nltk.download('wordnet')
def Preprocesser(text_list):
smallestWordSize = 3
processedList = []
for token in gensim.utils.simple_preprocess(text_list):
if token not in gensim.parsing.preprocessing.STOPWORDS and len(token) > smallestWordSize:
processedList.append(StemmAndLemmatize(token))
return processedList
lda_model = models.LdaModel.load('LDAModel\GoldModel') #Load pretrained LDA model
dictionary = Dictionary.load("ModelTrain\ManDict") #Load dictionary model was trained on
#Sample Unseen Doc to Analyze
doc = "I am going to write a string about how I can't get my task executor \
to travel properly. I am trying to use the \
AGV navigator, but it doesn't seem to be working network. I have been trying\
to use the AGV Process flow but that isn't working either speed\
trailer offset I am now going to change this so I can see how fast it runs"
termTopicMatrix = lda_model.get_topics() #Get Term-topic Matrix from pretrained LDA model
cleanDoc = Preprocesser(doc) #Tokenize, lemmatize, clean and stem words
bowDoc = dictionary.doc2bow(cleanDoc) #Create bow using dictionary
dictSize = len(termTopicMatrix[0]) #Get length of terms in dictionary
fullDict = np.zeros(dictSize) #Initialize array which is length of dictionary size
First = [first[0] for first in bowDoc] #Get index of terms in bag of words
Second = [second[1] for second in bowDoc] #Get frequency of term in bag of words
fullDict[First] = Second #Add word frequency to full dictionary
print('Matrix Multiplication: \n', np.dot(termTopicMatrix,fullDict))
print('Conventional Syntax: \n', lda_model[bowDoc])
Output:
Matrix Multiplication:
[0.0283254 0.01574513 0.03669142 0.01671816 0.03742738 0.01989461
0.01558603 0.0370233 0.04648389 0.02887623 0.00776652 0.02147539
0.10045133 0.01084273 0.01229849 0.00743788 0.03747379 0.00345913
0.03086953 0.00628912 0.29406082 0.10656977 0.00618827 0.00406316
0.08775404 0.00785408 0.02722744 0.09957815 0.01669402 0.00744392
0.31177135 0.03063149 0.07211428 0.01192056 0.03228589]
Conventional Syntax:
[(0, 0.070313625), (2, 0.056414187), (18, 0.2016589), (20, 0.46500313), (24, 0.1589748)]
In the pretrained model there are 35 topics and 1155 words.
In the "Conventional Syntax" output, the first element of each tuple is the index of the topic and the second element is the probability of the topic. In the "Matrix Multiplication" version, the probability is the index and the value is the probability. Clearly the two don't match up.
For example, the lda_model[unseenDoc] shows that topic 0 has a 0.07 probability, but the matrix multiplication method says that topic has a 0.028 probability. Am I missing a step here?
You can review the full source code used by LDAModel's get_document_topics() method in your installation, or online at:
https://github.com/RaRe-Technologies/gensim/blob/e75f6c8e8d1dee0786b1b2cd5ef60da2e290f489/gensim/models/ldamodel.py#L1283
(It also makes use of the inference() method in the same file.)
It's doing a lot more scaling/normalization/clipping than your code, which is likely the cause of the discrepancy. But you should be able to examine, line-by-line, where your process & its differ to get the steps to match up.
It also shouldn't be hard to use the gensim code's steps as guidance for creating parallel Javascript code that, given the right parts of the model's state, can reproduce its results.

Extracting Topic distribution from gensim LDA model

I created an LDA model for some text files using gensim package in python. I want to get topic's distributions for the learned model. Is there any method in gensim ldamodel class or a solution to get topic's distributions from the model?
For example, I use the coherence model to find a model with the best cohrence value subject to the number of topics in range 1 to 5. After getting the best model I use get_document_topics method (thanks kenhbs) to get topic distribution in the document that used for creating the model.
id2word = corpora.Dictionary(doc_terms)
bow = id2word.doc2bow(doc_terms)
max_coherence = -1
best_lda_model = None
for num_topics in range(1, 6):
lda_model = gensim.models.ldamodel.LdaModel(corpus=bow, num_topics=num_topics)
coherence_model = gensim.models.CoherenceModel(model=lda_model, texts=doc_terms,dictionary=id2word)
coherence_value = coherence_model.get_coherence()
if coherence_value > max_coherence:
max_coherence = coherence_value
best_lda_model = lda_model
The best has 4 topics
print(best_lda_model.num_topics)
4
But when I use get_document_topics, I get less than 4 values for document distribution.
topic_ditrs = best_lda_model.get_document_topics(bow)
print(len(topic_ditrs))
3
My question is: For best lda model with 4 topics (using coherence model) for a document, why get_document_topics returns fewer topics for the same document? why some topics have very small distribution (less than 1-e8)?
From the documentation, you can use two methods for this.
If you are aiming to get the main terms in a specific topic, use get_topic_terms:
from gensim.model.ldamodel import LdaModel
K = 10
lda = LdaModel(some_corpus, num_topics=K)
lda.get_topic_terms(5, topn=10)
# Or for all topics
for i in range(K):
lda.get_topic_terms(i, topn=10)
You can also print the entire underlying np.ndarray (called either beta or phi in standard LDA papers, dimensions are (K, V) or (V, K)).
phi = lda.get_topics()
edit:
From the link i included in the original answer: if you are looking for a document's topic distribution, use
res = lda.get_document_topics(bow)
As can be read from the documentation, the resulting object contains the following three lists:
list of (int, float) – Topic distribution for the whole document. Each element in the list is a pair of a topic’s id, and the probability that was assigned to it.
list of (int, list of (int, float), optional – Most probable topics per word. Each element in the list is a pair of a word’s id, and a list of topics sorted by their relevance to this word. Only returned if per_word_topics was set to True.
list of (int, list of float), optional – Phi relevance values, multipled by the feature length, for each word-topic combination. Each element in the list is a pair of a word’s id and a list of the phi values between this word and each topic. Only returned if per_word_topics was set to True.
Now,
tops, probs = zip(*res[0])
probs will contains K (for you 4) probabilities. Some may be zero, but they should sum up to 1
You can play with the minimum_probability parameter and set it to a very small value like 0.000001.
topic_vector = [ x[1] for x in ldamodel.get_document_topics(new_doc_bow , minimum_probability= 0.0, per_word_topics=False)]
Just type,
pd.DataFrame(lda_model.get_document_topics(doc_term_matrix))

Vectorization in sklearn seems to be very memory expensive. Why?

I need to process more than 1,000,000 text records. I am employing CountVectorizer to transform my data. I have the following code.
TEXT = [data[i].values()[3] for i in range(len(data))] #these are the text records
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(min_df=1)
X = vectorizer.fit_transform(TEXT)
X_list = X.toarray().tolist()
As I run this code, it turns out MemoryError. The text records I have are mostly in short paragraphs (~100 words). Vectorization seems to be very expensive.
UPDATE
I added more constraints to CountVectorizer but still got MemoeryError. The length of feature_names is 2391.
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(min_df=0.003,max_df = 3.05, lowercase = True, stop_words = 'english')
X = vectorizer.fit_transform(TEXT)
feature_names = vectorizer.get_feature_names()
X_tolist = X.toarray().tolist()
Traceback (most recent call last):
File "nlp2.py", line 42, in <module>
X_tolist = X.toarray().tolist()
File "/opt/conda/lib/python2.7/site-packages/scipy/sparse/compressed.py", line 940, in toarray
return self.tocoo(copy=False).toarray(order=order, out=out)
File "/opt/conda/lib/python2.7/site-packages/scipy/sparse/coo.py", line 250, in toarray
B = self._process_toarray_args(order, out)
File "/opt/conda/lib/python2.7/site-packages/scipy/sparse/base.py", line 817, in _process_toarray_args
return np.zeros(self.shape, dtype=self.dtype, order=order)
MemoryError
Why is so and how to get around with it? Thank you!!
Your problem is that X is a sparse matrix with one row for each document representing which words are present in that document. If you have a million documents with a total of 2391 distinct words in all (length of feature_names as provided in your question), the total number of entries in the dense version of x would be about two billion, enough to potentially cause a memory error.
The problem is with this line X_list = X.toarray().tolist() which converts X to a dense array. You don't have enough memory for that, and there should be a way to do what you are trying to do without it, (as the sparse version of X has all the information that you need.

Extrapolating variance components from Weir-Fst on Vcftools

vcftools --vcf ALL.chr1.phase3_shapeit2_mvncall_integrated_v5.20130502.genotypes.vcf --weir-fst-pop POP1.txt --weir-fst-pop POP2.txt --out fst.POP1.POP2
The above script computes Fst distances on 1000 Genomes population data using Weir and Cokerham's 1984 formula. This formula uses 3 variance components, namely a,b,c (between populations; between individuals within populations; between gametes within individuals within populations).
The output directly provides the result of the formula but not the components that the program calculated to arrive at the final result. How can I ask Vcftools to output the values for a,b,c?
If you can get the data into the format for hierfstat, you can get the variance components from varcomp.glob. What I normally do is:
use vcftools with --012 to get genotypes
convert 0/1/2/-1 to hierfstat format (eg., 11/12/22/NA)
load the data into hierfstat and compute (see below)
R example:
library(hierfstat)
data = read.table("hierfstat.txt", header=T, sep="\t")
levels = data.frame(data$popid)
loci = data[,2:ncol(data)]
res = varcomp.glob(levels=levels, loci=loci, diploid=T)
print(res$loc)
print(res$F)
Fst for each locus (row) therefore is (without hierarchical design), from res$loc: res$loc[1]/sum(res$loc). If you have more complicated sampling, you'll need to interpret the variance components differently.
--update per your comment--
I do this in Pandas, but any language would do. It's a text replacement exercise. Just get your .012 file into a dataframe and convert as below. I read in row by row into numpy b/c I have tons of snps, but read_csv would work, too.
import pandas as pd
import numpy as np
z12_data = []
for i, line in enumerate(open(z12_file)):
line = line.strip()
line = [int(x) for x in line.split("\t")]
z12_data.append(np.array(line))
if i % 10 == 0:
print i
z12_data = np.array(z12_data)
z12_df = pd.DataFrame(z12_data)
z12_df = z12_df.drop(0, axis=1)
z12_df.columns = pd.Series(z12_df.columns)-1
hierf_trans = {0:11, 1:12, 2:22, -1:'NA'}
def apply_hierf_trans(series):
return [hierf_trans[x] if x in hierf_trans else x for x in series]
hierf = df.apply(apply_hierf_trans)
hierf.to_csv("hierfstat.txt", header=True, index=False, sep="\t")
Then, you'd read that file hierfstat.txt into R, these are your loci. You'd need to specify your levels in your sampling design (e.g., your population). Then call varcomp.glob() to get the variance components. I have a parallel version of this here if you want to use it.
Note that you are specifying 0 as the reference allele, in this case. May be what you want, maybe not. I often calculate minor allele frequency and make 2 the minor allele, but it depends on your study goal.

Resources