How to load a word2vec txt file with vocabulary constraint - gensim

I have a word2vec file in the standard format, but it is huge with 2M items. I also have a vocabulary file where each row is a word, the file has about ~800K rows. Now I want to load the embeddings from the word2vec file, and I want only embeddings for words in the vocabulary file. Is there an efficient implementation in gensim?

There's no built-in support for filtering the words on load. But you could use the code for the load_word2vec_format() function as a model for your own alternate loading code that skips words not-of-interest.
You can view the code for that function in the KeyedVectors class...
https://github.com/RaRe-Technologies/gensim/blob/ff107d6c5cb50d9ab99999cb898ff0aceb192592/gensim/models/keyedvectors.py#L1434
...and some shared support functions...
https://github.com/RaRe-Technologies/gensim/blob/ff107d6c5cb50d9ab99999cb898ff0aceb192592/gensim/models/utils_any2vec.py#L294

Related

Can't load the pre-trained word2vec of korean language

I would like to download and load the pre-trained word2vec for analyzing Korean text.
I download the pre-trained word2vec here: https://drive.google.com/file/d/0B0ZXk88koS2KbDhXdWg1Q2RydlU/view?resourcekey=0-Dq9yyzwZxAqT3J02qvnFwg
from the Github Pre-trained word vectors of 30+ languages: https://github.com/Kyubyong/wordvectors
My gensim version is 4.1.0, thus I used:
KeyedVectors.load_word2vec_format('./ko.bin', binary=False) to load the model. But there was an error that :
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 0: invalid start byte
I already tried many options including in stackoverflow and Github, but it still not work well.
Would you mind letting me the suitable solution?
Thanks,
While the page at https://github.com/Kyubyong/wordvectors isn't clear about the formats this author has chosen, by looking at their source code at...
https://github.com/Kyubyong/wordvectors/blob/master/make_wordvectors.py#L61
...shows it using the Gensim model .save() method.
Such saved models should be reloaded using the .load() class method of the same model class. For example, if a Word2Vec model was saved with...
model.save('language.bin')
...then it could be reloaded with...
loaded_model = Word2Vec.load('language.bin')
Note, through, that:
Models saved this way are often split over multiple files that should be kept together (and all start with the same root name) - but I don't see those here.
This work appears to be ~5 years old, based on a pre-1.0 version of Gensim – so there might be issues loading the models directly into the latest Gensim. If you do run into such issues, & absolutely need to make these vectors work, you might need to temporarily use a prior version of Gensim to .load() the model. Then, you could save the plain vectors out with .save_word2vec_format() for later reloading across any version. (Or, using the latest interim version that can load the model, re-save the model as .save(), then repeat the process with the latest version that can read that model, until you reach the current Gensim.)
But, you also might want to find a more recent & better-documented set of pretrained word-vectors.
For example, Facebook makes FastText pretrained vectors available in both a 'text' format and a 'bin' format for many languages at https://fasttext.cc/docs/en/pretrained-vectors.html (trained on Wikipedia only) or https://fasttext.cc/docs/en/crawl-vectors.html (trained on Wikipedia plus web crawl data).
The 'text' format should in fact be loadable with KeyedVectors.load_word2vec_format(filename, binary=False), but will only include full-word vectors. (It will also be relatively easy to view as text, or write simply code to massage into other formats.)
The 'bin' format is Facebook's own native FastText model format, and should be loadable with either the load_facebook_model() or load_facebook_vectors() utility methods. Then, the loaded model (or vectors) will be able to create the FastText algorithm's substring-based guesstimate vectors even for many words that weren't in the model or training data.

Can you provide additional tags for documents using TaggedLineDocument?

When training a doc2vec model using a corpus in the TaggedDocument class, you can provide a list of tags. When the doc2vec model is trained it learns a vector representation for the tags. For example you could have one tag representing the document, and another representing some classification that can be shared between documents.
How would one provide additional tags when streaming a corpus using TaggedLineDocument?
The TaggedLineDocument class only considers documents to be one per line, with a single tag that is their line-number.
If you want more tags, you'll have to provide your own iterable which does that. It should only be a few lines of code, depending on where your other tags come from. You can use the source for TaggedLineDocument – which is itself only 9 lines of Python code –as a model to build on:
https://github.com/RaRe-Technologies/gensim/blob/e4199cb4e9a90df44ca59c1d0505b138caa21951/gensim/models/doc2vec.py#L1126
Note: while supplying ore than one tag per document is a natural extension of the original 'Paragraph Vectors' approach, and often can provide benefits, sometimes it also 'dilutes' the salience of each tag's vector – which will be a special concern as the average number of tags per document grows, or the model acquires many more tags than unique documents. So be sure to comparatively evaluate whether any multiple-tag strategy is helping or hurting, in different modes, and whether things like pre-known categories work better as extra tags or known-labels for some later steps.

How does file convertors work in general like word to pdf, XML to json, word to txt etc

I've used many types of file convertor like word to pdf, XML to json, word to txt etc.
How do they work in backend? Is there some specific guidelines each of them follow? Are there some similarity in the way they are implemented.
I tried searching it but most of the articles take me to the web app that can convert the doc, but none of them gives clarity on how it's done.
All of them work by parsing the first document into a data structure. Then generate a document in the other format from that data structure using recursion.
Parsing itself is a giant topic that people take courses on in computer science. But long story short, it proceeds by breaking the document into tokens, and then fitting the tokens into a parse tree using one of a standard set of methods. They have all sorts of fancy names like Recursive Descent and LALR(1). That's where most of the theory you'd want to learn is.
For example if you're writing a JSON to XML converter, you'd first need to parse that JSON. A JSON Parser shows how you could write that, from scratch, using recursive descent. Once written you just need to write a recursive function that takes each data type and does something appropriate with it to generate text in the format that you want.
Incidentally you can also write a "document converter" that converts from a document format to the same document format. Why would someone want to do that? The two most common use cases are to prettify or minify code. Despite the fact that only one format is being dealt with, the principles of how you do it are exactly the same.

Train Model with Token Features

I want to train a BERT like model for Hebrew, where fore very word I know:
Lemma
Gender
Number
Voice
And I would like to train a model where for each token these features are concatenated
Embedding(Token) = E1(Lemma):E2(Gender):E3(Number):E4(Voice)
Is there a way to do such a thing with the current huggingface transformers library?
Models in the Huggingface's Transformers do not support factored inputs by default. As a workaround, you can embed the inputs yourself and bypass the embedding layer in BERT. Instead of providing the input_ids when you call the model, you can provide input_embeds. It will use the provided embeddings and the position embeddings to them. Note that the provided embeddings need to have the same dimension as the rest of the model.
You need to have one embedding layer per input type (lemma, gender, number, voice), which also means having factor-specific vocabularies that will assign indices to the inputs that are used for the embedding lookup. It makes sense to have a larger embedding for lemmas than for the grammatical categories that have several possible values.
Then you just concatenate the embeddings, optionally project them and feed them as input_embeds to the model.

How to use BERT pretrain embeddings with my own new dataset?

My dataset and NLP task is very different from the large corpus what authors have pre-trained their model (https://github.com/google-research/bert#pre-training-with-bert), so I can't directly fine-tune.
Is there any example code/GitHub that can help me to train BERT with my own data? I expect to get embeddings like glove.
Thank you very much!
Yes, you can get BERT embeddings, like other word embeddings using extract_features.py script. You have the capability to select the number of layers from which you need the output. Usage is simple, you have to save one sentence per line in a text file and pass it as input. Output will be a JSONL file providing contextual embeddings per token.
The usage of script with documentation is provided at: https://github.com/google-research/bert#using-bert-to-extract-fixed-feature-vectors-like-elmo

Resources