If I train a custom tokenizer on my dataset, I would still be able to leverage a pre-trained model weight - huggingface-transformers

This is a declaration, but I'm not sure it is correct. I can elaborate.
I have a considerably large dataset (23Gb). I'd like to pre-train the Roberta-base or XLM-Roberta-base, so my language model would fit better to be used in further downstream tasks.
I know I can just run it against my dataset for a few epochs and get good results. But, what if I also train the tokenizer to generate a new vocab, and merge files? The weights from the pre-trained model I started from will still be used, or the new set of tokens will demand complete training from scratch?
I'm asking this because maybe some layers can still contribute with knowledge, so the final model will have the better of both worlds: A tokenizer that fits my dataset, and the weights from previous training.
That makes sense?

In short no.
You cannot use your own pretrained tokenizer for a pretrained model. The reason is that the vocabulary for your tokenizer and the vocabulary of the tokenizer that was used to pretrain the model that later you will use it as pretrained model are different. Thus a word-piece token which is present in Tokenizers's vocabulary may not be present in pretrained model's vocabulary.
Detailed answers can be found here,

Related

Is there a way to load pre-trained word vectors before training the doc2vec model?

I am trying to build a doc2vec model with more or less 10k sentences, after that I will use the model to find the most similar sentence in the model of some new sentences.
I have trained a gensim doc2vec model using the corpus(10k sentences) I have. This model can to some extend tell me if a new sentence is similar to some of the sentences in the corpus.
But, there is a problem: it may happen that there are words in new sentences which don't exist in the corpus, which means that they don't have a word embedding. If this happens, the prediction result will not be good.
As far as I know, the trained doc2vec model does have a matrix of doc vectors as well as a matrix of word vectors. So what I were thinking is to load a set of pre-trained word vectors, which contains a large number of words, and then train the model to get the doc vectors. Does it make sense? Is it possible with gensim? Or is there another way to do it?
Unlike what you might guess, typical Doc2Vec training does not train up word-vectors first, then compose doc-vectors using those word-vectors. Rather, in the modes that use word-vectors, the word-vectors trained in a simultaneous, interleaved fashion alongside the doc-vectors, both changing together. And in one fast and well-performing mode, PV-DBOW (dm=0 in gensim), word-vectors aren't trained or used at all.
So, gensim Doc2Vec doesn't support pre-loading state from elsewhere, and even if it did, it probably wouldn't provide the benefit you expect. (You could dig through the source code & perhaps force it by doing a bunch of initialization steps yourself. But then, if words were in the pre-loaded set, but not in your training data, training the rest of the active words would adjust the entire model in direction incompatible with the imported-but-untrained 'foreign' words. It's only the interleaved, tug-of-war co-training of the model's state which makes the various vectors meaningful in relation to each other.)
The most straightforward and reliable strategy would be to try to expand your training corpus, by finding more documents from a similar/compatible domain, to include multiple varied examples of any words you might encounter later. (If you thought some other word-vectors were apt enough for your domain, perhaps the texts that were used to train those word-vectors can be mixed-into your training corpus. That's a reasonable way to put the word/document data from that other source on equal footing in your model.)
And, as new documents arrive, you can also occasionally re-train the model from scratch, with the now-expanded corpus, letting newer documents contribute equally to the model's vocabulary and modeling strength.

Is there pre-trained doc2vec model?

Is there a pre-trained doc2vec model with a large data set, like Wikipedia or similar?
I don't know of any good one. There's one linked from this project, but:
it's based on a custom fork from an older gensim, so won't load in recent code
it's not clear what parameters or data it was trained with, and the associated paper may have made uninformed choices about the effects of parameters
it doesn't appear to be the right size to include actual doc-vectors for either Wikipedia articles (4-million-plus) or article paragraphs (tens-of-millions), or a significant number of word-vectors, so it's unclear what's been discarded
While it takes a long time and significant amount of working RAM, there is a Jupyter notebook demonstrating the creation of a Doc2Vec model from Wikipedia included in gensim:
https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/doc2vec-wikipedia.ipynb
So, I would recommend fixing the mistakes in your attempt. (And, if you succeed in creating a model, and want to document it for others, you could upload it somewhere for others to re-use.)
Yes!
I could find two pre-trained doc2vec models at this link
but still could not find any pre-trained doc2vec model which is trained on tweets

gensim doc2vec train more documents from pre-trained model

I am trying to train with new labelled document(TaggedDocument) with the pre-trained model.
Pretrained model is the trained model with documents which the unique id with label1_index, for instance, Good_0, Good_1 to Good_999
And the total size of trained data is about 7000
Now, I want to train the pre-trained model with new documents which the unique id with label2_index, for instance, Bad_0, Bad_1... to Bad_1211
And the total size of trained data is about 1211
The train itself was successful without any error, but the problem is that whenever I try to use 'most_similar' it only suggests the similar document labelled with Good_... where I expect the labelled with Bad_.
If I train altogether from the beginning, it gives me the answers I expected - it infers a newly given document similar to either labelled with Good or Bad.
However, the practice above will not work as the one trained altogether from the beginning.
Is continuing train not working properly or did I make some mistake?
The gensim Doc2Vec class can always be fed extra examples via train(), but it only discovers the working vocabulary of both word-tokens and document-tags during an initial build_vocab() step. So unless words/tags were available during the build_vocab(), they'll be ignored as unknown later. (The words get silently dropped from the text; the tags aren't trained or remembered inside the model.)
The Word2Vec superclass from which Doc2Vec borrows a lot of functionality has a newer, more-experimental parameter on its build_vocab() called update. If set true, that call to build_vocab() will add to, rather than replace, any prior vocabulary. However, as of February 2018, this option doesn't yet work with Doc2Vec, and indeed often causes memory-fault crashes.
But even if/when that can be made to work, providing incremental training examples isn't necessarily a good idea. By only updating parts of the model – those exercised by the new examples – the overall model can get worse, or its vectors made less self-consistent with each other. (The essence of these dense-embedding models is that the optimization over all varied examples results in generally-useful vectors. Training over just some subset causes the model to drift towards being good on just that subset, at likely cost to earlier examples.)
If you need new examples to also become part of the results for most_similar(), you might want to create your own separate set-of-vectors outside of Doc2Vec. When you infer new vectors for new texts, you could add those to that outside set, and then implement your own most_similar() (using the gensim code as a model) to search over this expanding set of vectors, rather than just the fixed set that is created by initial bulk Doc2Vec training.

gensim(1.0.1) Doc2Vec with google pretrained vectors

For gensim(1.0.1) doc2vec, I am trying to load google pre-trained word vectors instead of using Doc2Vec.build_vocab
wordVec_google = gensim.models.KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin', binary=True)
model0 = Doc2Vec(size=300, alpha=0.05, min_alpha=0.05, window=8, min_count=5, workers=4, dm=0, hs=1)
model0.wv = wordVec_google
##some other code
model0.build_vocab(sentences=allEmails, max_vocab_size = 20000)
but this object model0 can not be further trained with "labeled Docs", and can't infer vectors for documents.
Anyone knows how to use doc2vec with google pretrained word vectors?
I tried this post: http://mccormickml.com/2016/04/12/googles-pretrained-word2vec-model-in-python/
but it does not work to load into gensim.models.Word2Vec object, perhaps it is a different gensim version.
The GoogleNews vectors are just raw vectors - not a full Word2Vec model.
Also, the gensim Doc2Vec class does not have general support for loading pretrained word-vectors. The Doc2Vec algorithm doesn't need pre-trained word-vectors – only some modes even use such vectors, and when they do, they're trained simultaneously as needed alongside the doc-vectors.
Specifically, the mode your code is using, dm=0, is the 'Paragraph Vectors' PV-DBOW mode, and does not use word-vectors at all. So even if there was a function to load them, they'd be loaded – then ignored during training and inference. (You would need to use PV-DM, 'dm=1', or add skip-gram word-training to PV-DBOW, dm=0, dbow_words=1, in order for such reused vectors to have any relevance to your training.)
Why do you think you want/need to use pre-trained vectors? (Especially, a set of 3 million word-vectors, from another kind of data, when a later step suggests you only care about a vocabulary of 20,000 words?)
If for some reason you feel sure you want to initialize Doc2Vec with wrod-vectors from elsewhere, and use a training mode where that would have some effect, you can look into the intersect_word2vec_format() method that gensim Doc2Vec inherits from Word2Vec.
That method specifically needs to be called after build_vocab() has already learned the corpus-specific vocabulary, and it only brings in the words from the outside source that are locally relevant. It's at best an advanced, experimental feature – see its source code, doc-comments, and discussion on the gensim list to understand its side-effects and limitations.

Which classification algorithm can be used for document categorization?

Hey, Here is my problem,
Given a set of documents I need to assign each document to a predefined category.
I was going to use the n-gram approach to represent the text-content of each document and then train an SVM classifier on the training data that I have.
Correct me if I miss understood something please.
The problem now is that the categories should be dynamic. Meaning, my classifier should handle new training data with new category.
So for example, if I trained a classifier to classify a given document as category A, category B or category C, and then I was given new training data with category D. I should be able to incrementally train my classifier by providing it with the new training data for "category D".
To summarize, I do NOT want to combine the old training data (with 3 categories) and the new training data (with the new/unseen category) and train my classifier again. I want to train my classifier on the fly
Is this possible to implement with SVM ? if not, could u recommend me several classification algorithms ? or any book/paper that can help me.
Thanks in Advance.
Naive-Bayes is relatively fast incremental calssification algorithm.
KNN is also incremental by nature, and even simpler to implement and understand.
Both algorithms are implemented in the open source project Weka as NaiveBayes and IBk for KNN.
However, from personal experience - they are both vulnerable to large number of non-informative features (which is usually the case with text classification), and thus some kind of feature selection is usually used to squeeze better performance from these algorithms, which could be problematic to implement as incremental.
This blog post by Edwin Chen describes infinite mixture models to do clustering. I think this method supports automatically determining the number of clusters, but I am still trying to wrap my head all the way around it.
The class of algorithms that matches your criteria are called "Incremental Algorithms". There are incremental versions of almost any methods. The easiest to implement is naive bayes.

Resources