Gensim: How to load corpus from saved lda model? - gensim

When I saved my LdaModel lda_model.save('model'), it saved 4 files:
model
model.expElogbeta.npy
model.id2word
model.state
I want to use pyLDAvis.gensim to visualize the topics, which seems to need the model, corpus and dictionary. I was able to load the model and dictionary with:
lda_model = LdaModel.load('model')
dict = corpora.Dictionary.load('model.id2word')
Is it possible to load the corpus? How?

Sharing this here because it took me awhile to find out the answer to this as well. Note that dict is not a valid name for a dictionary and we use lda_dict instead.
# text array is a list of lists containing text you are analysing
# eg. text_array = [['volume', 'eventually', 'metric', 'rally'], ...]
# lda_dict is a gensim.corpora.Dictionary object
bow_corpus = [lda_dict.doc2bow(doc) for doc in text_array]

Jireh answered correctly but it may be confusing how to load all the previous LDA files. I'm not sure why gensim saves the *.state and *.npy files (I'd appreciate insights in the comments). To reuse a previous LDA model you load the *.model and *.id2word files along with your original corpus.
For instance, if I have a dataframe of my documents in column 'docs' then you load that dataframe again as you will need it to recreate your corpus.
import pandas as pd
from gensim import corpora, models
from gensim.corpora.dictionary import Dictionary
from pyLDAvis import gensim_models
df = pd.read_csv('your_file.csv')
texts = df['docs'].values
You load your previously created dictionary as follows:
dictionary = corpora.Dictionary.load('your_file.id2word')
... and then create the corpus from the dictionary and your original texts (created from the dataframe['docs'] above):
corpus = [dictionary.doc2bow(text) for text in texts]
The previously created LDA model is loaded via gensim:
lda_model = gensim.models.ldamodel.LdaModel.load('your_file.model')
These objects are then fed into your pyLDAvis instance:
lda_viz = pyLDAvis.gensim_models.prepare(lda_model, corpus, dictionary)
If you don't use the .id2word file you can run into issues with not having the correct shape (IndexError). I've had this happen when I ran LDA multicore so I use the .id2word rather than recreating the dictionary from the corpus.

in the gensim python code, they said ignore expElogbeta and state file. It is possible to load the corpus, corpus is a set of list contain 2 numbers. It will be complex to load it out, I suggest load corpus from the origin text data and using id2word

Related

Doc2Vec input format

running gensim Doc2Vec over ubuntu
Doc2Vec rejects my input with the error
AttributeError: 'list' object has no attribute 'words'
import gensim from gensim.models
import doc2vec as dtv
from nltk.corpus import brown
documents = brown.tagged_sents()
d2vmodel = > dtv.Doc2Vec(documents, size=100, window=1, min_count=1, workers=1)
I have tried already from
this SO question and many variations with the same result
documents = [brown.tagged_sents()}
adding a hash function
If corpus is a .txt file I can utilize
documents=TaggedLineDocument(documents)
but that is often not possible
Gensim's Doc2Vec requires each document to be in the form of an object with a words property that is a list of string tokens, and a tags property that is a list of tags. These tags are usually strings, but expert users with large datasets can save a little memory by using plain-ints, starting from 0, instead.
A class TaggedDocument is included that is of the right 'shape', and used in most of the Gensim documentation/tutorial examples – but given Python's 'duck typing', any object with words and tags properties will do.
But a plain list won't.
And if I understand correctly, brown.tagged_sents() will return lists of (word, part-of-speech-tag) tuples, which isn't even the kind of list-of-word-tokens that would work as a words, and doesn't supply any of the full-document tags that are what Doc2Vec needs as keys to the doc-vectors that get trained.
Separately: it is unlikely you'd want to use min_count=1. Discarding very-low-frequency words usually makes retained Word2Vec/Doc2Vec vectors better.

Is there a way to set min_df and max_df in gensim's tfidf model?

I am using gensim's tdidf model like so:
from gensim import corpora, models
dictionary = corpora.Dictionary(some_corpus)
mapped_corpus = [dictionary.doc2bow(text)
for text in some_corpus]
tfidf = models.TfidfModel(mapped_corpus)
Now I'd like to apply thresholds to remove terms that appear too frequently (max_df) and too infrequently (min_df). I know that scikit's CountVectorizer allows you to do this, but I can't seem to find how to set these thresholds in gensim's tfidf. Could someone please help?
You can filter your dictionary with
dictionary.filter_extremes(no_below=min_df, no_above=rel_max_df)
Note that no_below expects the minimum number of documents in which tokens must appear, whereas no_above expects a maximum relative frequency, e.g. 0.5. Afterwards you can then construct your corpus with the filtered dictionary. According to the gensim docs it is also possible to construct a TfidfModel with only a dictionary.

Using Tensorflow to train a image classifier using my own data using inception and TFrecords

I follow the tutorial on how to train your own data from tensorflow at Github: https://github.com/tensorflow/models/tree/master/inception#how-to-construct-a-new-dataset-for-retraining.
I split my data (Training and validation), created labels suggested and managed to created the TFrecords using bazel-bin. Everything works and now I have my own data as TFrecords.
Now I want to train my image classifier using inception-v3 model from scratch and it seems I should use the script inception_train.py, but I am not sure. Is that right ? https://github.com/tensorflow/models/blob/master/inception/inception/inception_train.py.
If so, I have two questions:
1-) How can I train it using my TFrecords. If you can show me an example would be great.
2-) Can I run on CPU or is only possible on GPUs ?
Thank you very much.
Try the following sample code to read images and labels from your tfrecords,
import os
import glob
import tensorflow as tf
from matplotlib import pyplot as plt
def read_and_decode_file(filename_queue):
# Create an instance of tf record reader
reader = tf.TFRecordReader()
# Read the generated filename queue
_, serialized_reader = reader.read(filename_queue)
# extract the features you require from the tfrecord using their corresponding key
# In my example, all images were written with 'image' key
features = tf.parse_single_example(
serialized_reader, features={
'image': tf.FixedLenFeature([], tf.string),
'labels': tf.FixedLenFeature([], tf.int16)
})
# Extract the set of images as shown below
img = features['image']
img_out = tf.image.resize_image_with_crop_or_pad(img, target_height=128, target_width=128)
# Similarly extract the labels, be careful with the type
label = features['labels']
return img_out, label
if __name__ == "__main__":
tf.reset_default_graph()
# Path to your tfrecords
path_to_tf_records = os.getcwd() + '/*.tfrecords'
# Collect all tfrecords present in the records folder using glob
list_of_tfrecords = sorted(glob.glob(path_to_tf_records))
# Generate a tensorflow readable filename queue by supplying it with
# a list of tfrecords, optionally it is recommended to shuffle your data
# before feeding into the network
filename_queue = tf.train.string_input_producer(list_of_tfrecords, shuffle=False)
# Supply the tensorflow generated filename queue to the custom function above
image, label = read_and_decode_file(filename_queue)
# Create a new tf session to read the data
sess = tf.Session()
tf.train.start_queue_runners(sess=sess)
# Arbitrary number of iterations
for i in range(50):
img =sess.run(image)
# Show image
plt.imshow(img)
Now, you also have a function called tf.train.shuffle_batch to help you spawn multiple CPU threads that perform this function and return images and labels based on user specified batch size. You would need to create simultaneous data and training pipelines so that they work simultaneously.
To answer your second question, yes you can train your model using CPU alone but it would be slow and might take several hours or even days to achieve decent results. Remove the with tf.device('/gpu:{0}'): decorator before the creation of your inception model and tensorflow would create the model on your CPU.
Hope this explanation helps.

How to get vocabulary word count from gensim word2vec?

I am using gensim word2vec package in python. I know how to get the vocabulary from the trained model. But how to get the word count for each word in vocabulary?
Each word in the vocabulary has an associated vocabulary object, which contains an index and a count.
vocab_obj = w2v.vocab["word"]
vocab_obj.count
Output for google news w2v model: 2998437
So to get the count for each word, you would iterate over all words and vocab objects in the vocabulary.
for word, vocab_obj in w2v.vocab.items():
#Do something with vocab_obj.count
The vocab attribute was removed from KeyedVector in Gensim 4.0.0.
Instead:
word2vec_model.wv.get_vecattr("my-word", "count") # returns count of "my-word"
len(word2vec_model.wv) # returns size of the vocabulary
Check out notes on migrating from Gensim 3.x to 4
When you want to create a dictionary of word to count for easy retrieval later, you can do so as follows:
w2c = dict()
for item in model.wv.vocab:
w2c[item]=model.wv.vocab[item].count
If you want to sort it to see the most frequent words in the model, you can also do that so:
w2cSorted=dict(sorted(w2c.items(), key=lambda x: x[1],reverse=True))

Caffe Multiple Input Images

I'm looking at implementing a Caffe CNN which accepts two input images and a label (later perhaps other data) and was wondering if anyone was aware of the correct syntax in the prototxt file for doing this? Is it simply an IMAGE_DATA layer with additional tops? Or should I use separate IMAGE_DATA layers for each?
Thanks,
James
Edit: I have been using the HDF5_DATA layer lately for this and it is definitely the way to go.
HDF5 is a key value store, where each key is a string, and each value is a multi-dimensional array. Thus, to use the HDF5_DATA layer, just add a new key for each top you want to use, and set the value for that key to store the image you want to use. Writing these HDF5 files from python is easy:
import h5py
import numpy as np
filelist = []
for i in range(100):
image1 = get_some_image(i)
image2 = get_another_image(i)
filename = '/tmp/my_hdf5%d.h5' % i
with hypy.File(filename, 'w') as f:
f['data1'] = np.transpose(image1, (2, 0, 1))
f['data2'] = np.transpose(image2, (2, 0, 1))
filelist.append(filename)
with open('/tmp/filelist.txt', 'w') as f:
for filename in filelist:
f.write(filename + '\n')
Then simply set the source of the HDF5_DATA param to be '/tmp/filelist.txt', and set the tops to be "data1" and "data2".
I'm leaving the original response below:
====================================================
There are two good ways of doing this. The easiest is probably to use two separate IMAGE_DATA layers, one with the first image and label, and a second with the second image. Caffe retrieves images from LMDB or LEVELDB, which are key value stores, and assuming you create your two databases with corresponding images having the same integer id key, Caffe will in fact load the images correctly, and you can proceed to construct your net with the data/labels of both layers.
The problem with this approach is that having two data layers is not really very satisfying, and it doesn't scale very well if you want to do more advanced things like having non-integer labels for things like bounding boxes, etc. If you're prepared to make a time investment in this, you can do a better job by modifying the tools/convert_imageset.cpp file to stack images or other data across channels. For example you could create a datum with 6 channels - the first 3 for your first image's RGB, and the second 3 for your second image's RGB. After reading this in using the IMAGE_DATA layer, you can split the stream into two images using a SLICE layer with a slice_point at index 3 along the slice_dim = 1 dimension. If further down the road, you decide that you want to load even more complex assortments of data, you'll understand the encoding scheme and can write your own decoding layer based off of src/caffe/layers/data_layer.cpp to gain full control of the pipeline.
You may also consider using HDF5_DATA layer with multiple "top"s

Resources