Vectorization in sklearn seems to be very memory expensive. Why? - memory-management

I need to process more than 1,000,000 text records. I am employing CountVectorizer to transform my data. I have the following code.
TEXT = [data[i].values()[3] for i in range(len(data))] #these are the text records
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(min_df=1)
X = vectorizer.fit_transform(TEXT)
X_list = X.toarray().tolist()
As I run this code, it turns out MemoryError. The text records I have are mostly in short paragraphs (~100 words). Vectorization seems to be very expensive.
UPDATE
I added more constraints to CountVectorizer but still got MemoeryError. The length of feature_names is 2391.
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(min_df=0.003,max_df = 3.05, lowercase = True, stop_words = 'english')
X = vectorizer.fit_transform(TEXT)
feature_names = vectorizer.get_feature_names()
X_tolist = X.toarray().tolist()
Traceback (most recent call last):
File "nlp2.py", line 42, in <module>
X_tolist = X.toarray().tolist()
File "/opt/conda/lib/python2.7/site-packages/scipy/sparse/compressed.py", line 940, in toarray
return self.tocoo(copy=False).toarray(order=order, out=out)
File "/opt/conda/lib/python2.7/site-packages/scipy/sparse/coo.py", line 250, in toarray
B = self._process_toarray_args(order, out)
File "/opt/conda/lib/python2.7/site-packages/scipy/sparse/base.py", line 817, in _process_toarray_args
return np.zeros(self.shape, dtype=self.dtype, order=order)
MemoryError
Why is so and how to get around with it? Thank you!!

Your problem is that X is a sparse matrix with one row for each document representing which words are present in that document. If you have a million documents with a total of 2391 distinct words in all (length of feature_names as provided in your question), the total number of entries in the dense version of x would be about two billion, enough to potentially cause a memory error.
The problem is with this line X_list = X.toarray().tolist() which converts X to a dense array. You don't have enough memory for that, and there should be a way to do what you are trying to do without it, (as the sparse version of X has all the information that you need.

Related

word2vec recommendation system KeyError: "word '21883' not in vocabulary"

The code works absolutely fine for the data set containing 500000+ instances but whenever I reduce the data set to 5000/10000/15000 it throws a key error : word "***" not in vocabulary.Not for every data point but for most them it throws the error.The data set is in excel format. [1]: https://i.stack.imgur.com/YCBiQ.png
I don't know how to fix this problem since i have very little knowledge about it,,I am still learning.Please help me fix this problem!
purchases_train = []
for i in tqdm(customers_train):
temp = train_df[train_df["CustomerID"] == i]["StockCode"].tolist()
purchases_train.append(temp)
purchases_val = []
for i in tqdm(validation_df['CustomerID'].unique()):
temp = validation_df[validation_df["CustomerID"] == i]["StockCode"].tolist()
purchases_val.append(temp)
model = Word2Vec(window = 10, sg = 1, hs = 0,
negative = 10, # for negative sampling
alpha=0.03, min_alpha=0.0007,
seed = 14)
model.build_vocab(purchases_train, progress_per=200)
model.train(purchases_train, total_examples = model.corpus_count,
epochs=10, report_delay=1)
model.save("word2vec_2.model")
model.init_sims(replace=True)
# extract all vectors
X = model[model.wv.vocab]
X.shape
products = train_df[["StockCode", "Description"]]
products.drop_duplicates(inplace=True, subset='StockCode', keep="last")
products_dict=products.groupby('StockCode'['Description'].apply(list).to_dict()
def similar_products(v, n = 6):
ms = model.similar_by_vector(v, topn= n+1)[1:]
new_ms = []
for j in ms:
pair = (products_dict[j[0]][0], j[1])
new_ms.append(pair)
return new_ms
similar_products(model['21883'])
If you get a KeyError saying a word is not in the vocabulary, that's a reliable indicator that the word you're looking-up was not in the training data fed to Word2Vec, or did not appear enough (default min_count=5) times.
So, your error indicates the word-token '21883' did not appear at least 5 times in the texts (purchases_train) supplied to Word2Vec. You should do either or both of:
Ensure all words you're going to look-up appear enough times, either with more training data or a lower min_count. (However, words with only one or a few occurrences tend not to get good vectors & instead just drag the quaality of surrounding-words' vectors down - so keeping this value above 1, or even raising it above the default of 5 to discard more rare words, is a better path whenever you have sufficient data.)
If your later code will be looking up words that might not be present, either check for their presence first (word in model.wv.vocab) or set up a try: ... except: ... to catch & handle the case where they're not present.

Copying embeddings for gensim word2vec

I wanted to see if I can simply set new weights for gensim's Word2Vec without training. I get the 20 News Group data set from scikit-learn (from sklearn.datasets import fetch_20newsgroups) and trained an instance of Word2Vec on it:
model_w2v = models.Word2Vec(sg = 1, size=300)
model_w2v.build_vocab(all_tokens)
model_w2v.train(all_tokens, total_examples=model_w2v.corpus_count, epochs = 30)
Here all_tokens is the tokenized data set.
Then I created a new instance of Word2Vec without training
model_w2v_new = models.Word2Vec(sg = 1, size=300)
model_w2v_new.build_vocab(all_tokens)
and set the embeddings of the new Word2Vec equal to the first one
model_w2v_new.wv.vectors = model_w2v.wv.vectors
Most of the functions work as expected, e.g.
model_w2v.wv.similarity( w1='religion', w2 = 'religions')
> 0.4796233
model_w2v_new.wv.similarity( w1='religion', w2 = 'religions')
> 0.4796233
and
model_w2v.wv.words_closer_than(w1='religion', w2 = 'judaism')
> ['religions']
model_w2v_new.wv.words_closer_than(w1='religion', w2 = 'judaism')
> ['religions']
and
entities_list = list(model_w2v.wv.vocab.keys()).remove('religion')
model_w2v.wv.most_similar_to_given(entity1='religion',entities_list = entities_list)
> 'religions'
model_w2v_new.wv.most_similar_to_given(entity1='religion',entities_list = entities_list)
> 'religions'
However, most_similar doesn't work:
model_w2v.wv.most_similar(positive=['religion'], topn=3)
[('religions', 0.4796232581138611),
('judaism', 0.4426296651363373),
('theists', 0.43141329288482666)]
model_w2v_new.wv.most_similar(positive=['religion'], topn=3)
>[('roderick', 0.22643062472343445),
> ('nci', 0.21744996309280396),
> ('soviet', 0.20012077689170837)]
What am I missing?
Disclaimer. I posted this question on datascience.stackexchange but got no response, hoping to have a better luck here.
Generally, your approach should work.
It's likely the specific problem you're encountering was caused by an extra probing step you took and is not shown in your code, because you had no reason to think it significant: some sort of most_similar()-like operation on model_w2v_new after its build_vocab() call but before the later, malfunctioning operations.
Traditionally, most_similar() calculations operate on a version of the vectors that has been normalized to unit-length. The 1st time these unit-normed vectors are needed, they're calculated – and then cached inside the model. So, if you then replace the raw vectors with other values, but don't discard those cached values, you'll see results like you're reporting – essentially random, reflecting the randomly-initialized-but-never-trained starting vector values.
If this is what happened, just discarding the cached values should cause the next most_similar() to refresh them properly, and then you should get the results you expect:
model_w2v_new.wv.vectors_norm = None

Adding a feature for unseen words

I am using sklearn countvectorizer to build my term-document matrix
However,
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(min_df=1)
corpus = ['this is jummy speaking now']
X = vectorizer.fit_transform(corpus)
c = vectorizer.transform(['lol 123']).toarray()
What happens is that X would be a term document matrix of 5 words. However, i would like the matrix to have an unknown column aka term document matrix of 6 words. In the case when a new unseen word is found, i would like it to be part of the unknown column. like for example (lol and 123) is not in the corpus. It should be part of the unknown column.

Julia: How to modify a column of a matrix that has been saved as a binary file?

I am working with large matrices of data (Nrow x Ncol) that are too large to be stored in memory. Instead, it is standard in my field of work to save the data into a binary file. Due to the nature of the work, I only need to access 1 column of the matrix at a time. I also need to be able to modify a column and then save the updated column back into the binary file. So far I have managed to figure out how to save a matrix as a binary file and how to read 1 'column' of the matrix from the binary file into memory. However, after I edit the contents of a column I cannot figure out how to save that column back into the binary file.
As an example, suppose the data file is a 32-bit identity matrix that has been saved to disk.
Nrow = 500
Ncol = 325
data = eye(Float32,Nrow,Ncol)
stream_data = open("data","w")
write(stream_data,data[:])
close(stream_data)
Reading the entire file from disk and then reshaping back into the matrix is straightforward:
stream_data = open("data","r")
data_matrix = read(stream_data,Float32,Nrow*Ncol)
data_matrix = reshape(data_matrix,Nrow,Ncol)
close(stream_data)
As I said before, the data-matrices I am working with are too large to read into memory and as a result the code written above would normally not be possible to execute. Instead, I need to work with 1 column at a time. The following is a solution to read 1 column (e.g. the 7th column) of the matrix into memory:
icol = 7
stream_data = open("data","r")
position_data = 4*Nrow*(icol-1)
seek(stream_data,position_data)
data_col = read(stream_data,Float32,Nrow)
close(stream_data)
Note that the coefficient '4' in the 'position_data' variable is because I am working with Float32. Also, I don't fully understand what the seek command is doing here, but it seems to be giving me the correct output based on the following tests:
data == data_matrix # true
data[:,7] == data_col # true
For the sake of this problem, lets say I have determined that the column I loaded (i.e. the 7th column) needs to be replaced with zeros:
data_col = zeros(Float32,size(data_col))
The problem now, is to figure out how to save this column back into the binary file without affecting any of the other data. Naturally I intend to use 'write' to perform this task. However, I am not entirely sure how to proceed. I know I need to start by opening up a stream to the data; however I am not sure what 'mode' I need to use: "w", "w+", "a", or "a+"? Here is a failed attempt using "w":
icol = 7
stream_data = open("data","w")
position_data = 4*Nrow*(icol-1)
seek(stream_data,position_data)
write(stream_data,data_col)
close(stream_data)
The original binary file (before my failed attempt to edit the binary file) occupied 650000 bytes on disk. This is consistent with the fact that the matrix is size 500x325 and Float32 numbers occupy 4 bytes (i.e. 4*500*325 = 650000). However, after my attempt to edit the binary file I have observed that the binary file now occupies only 14000 bytes of space. Some quick mental math shows that 14000 bytes corresponds to 7 columns of data (4*500*7 = 14000). A quick check confirms that the binary file has replaced all of the original data with a new matrix with size 500x7, and whose elements are all zeros.
stream_data = open("data","r")
data_new_matrix = read(stream_data,Float32,Nrow*7)
data_new_matrix = reshape(data_new_matrix,Nrow,7)
sum(abs(data_new_matrix)) # 0.0f0
What do I need to do/change in order to only modify only the 7th 'column' in the binary file?
Instead of
icol = 7
stream_data = open("data","w")
position_data = 4*Nrow*(icol-1)
seek(stream_data,position_data)
write(stream_data,data_col)
close(stream_data)
in the OP, write
icol = 7
stream_data = open("data","r+")
position_data = 4*Nrow*(icol-1)
seek(stream_data,position_data)
write(stream_data,data_col)
close(stream_data)
i.e. replace "w" with "r+" and everything works.
The reference to open is http://docs.julialang.org/en/release-0.4/stdlib/io-network/#Base.open and it explains the various modes. Preferably open shouldn't be used with the original somewhat confusing but definitely slower string parameter.
You can use SharedArrays for the need you describe:
data=SharedArray("/some/absolute/path/to/a/file", Float32,(Nrow,Ncols))
# do something with data
data[:,1]=a[:,1].+1
exit()
# restart julia
data=SharedArray("/some/absolute/path/to/a/file", Float32,(Nrow,Ncols))
#show data[1,1]
# prints 1
Now, be mindful that you're supposed to handle synchronisation to read/write from/to this file (if you have async workers) and that you're not supposed to change the size of the array (unless you know what you're doing).

Extrapolating variance components from Weir-Fst on Vcftools

vcftools --vcf ALL.chr1.phase3_shapeit2_mvncall_integrated_v5.20130502.genotypes.vcf --weir-fst-pop POP1.txt --weir-fst-pop POP2.txt --out fst.POP1.POP2
The above script computes Fst distances on 1000 Genomes population data using Weir and Cokerham's 1984 formula. This formula uses 3 variance components, namely a,b,c (between populations; between individuals within populations; between gametes within individuals within populations).
The output directly provides the result of the formula but not the components that the program calculated to arrive at the final result. How can I ask Vcftools to output the values for a,b,c?
If you can get the data into the format for hierfstat, you can get the variance components from varcomp.glob. What I normally do is:
use vcftools with --012 to get genotypes
convert 0/1/2/-1 to hierfstat format (eg., 11/12/22/NA)
load the data into hierfstat and compute (see below)
R example:
library(hierfstat)
data = read.table("hierfstat.txt", header=T, sep="\t")
levels = data.frame(data$popid)
loci = data[,2:ncol(data)]
res = varcomp.glob(levels=levels, loci=loci, diploid=T)
print(res$loc)
print(res$F)
Fst for each locus (row) therefore is (without hierarchical design), from res$loc: res$loc[1]/sum(res$loc). If you have more complicated sampling, you'll need to interpret the variance components differently.
--update per your comment--
I do this in Pandas, but any language would do. It's a text replacement exercise. Just get your .012 file into a dataframe and convert as below. I read in row by row into numpy b/c I have tons of snps, but read_csv would work, too.
import pandas as pd
import numpy as np
z12_data = []
for i, line in enumerate(open(z12_file)):
line = line.strip()
line = [int(x) for x in line.split("\t")]
z12_data.append(np.array(line))
if i % 10 == 0:
print i
z12_data = np.array(z12_data)
z12_df = pd.DataFrame(z12_data)
z12_df = z12_df.drop(0, axis=1)
z12_df.columns = pd.Series(z12_df.columns)-1
hierf_trans = {0:11, 1:12, 2:22, -1:'NA'}
def apply_hierf_trans(series):
return [hierf_trans[x] if x in hierf_trans else x for x in series]
hierf = df.apply(apply_hierf_trans)
hierf.to_csv("hierfstat.txt", header=True, index=False, sep="\t")
Then, you'd read that file hierfstat.txt into R, these are your loci. You'd need to specify your levels in your sampling design (e.g., your population). Then call varcomp.glob() to get the variance components. I have a parallel version of this here if you want to use it.
Note that you are specifying 0 as the reference allele, in this case. May be what you want, maybe not. I often calculate minor allele frequency and make 2 the minor allele, but it depends on your study goal.

Resources