How can i use XLNet to generate word embeddings - word-embedding

I want to know how i can use XLNet to generate word embeddings
I am currently using a word embedding model, but I want to compare its performance with XLNet

Related

Can you provide additional tags for documents using TaggedLineDocument?

When training a doc2vec model using a corpus in the TaggedDocument class, you can provide a list of tags. When the doc2vec model is trained it learns a vector representation for the tags. For example you could have one tag representing the document, and another representing some classification that can be shared between documents.
How would one provide additional tags when streaming a corpus using TaggedLineDocument?
The TaggedLineDocument class only considers documents to be one per line, with a single tag that is their line-number.
If you want more tags, you'll have to provide your own iterable which does that. It should only be a few lines of code, depending on where your other tags come from. You can use the source for TaggedLineDocument – which is itself only 9 lines of Python code –as a model to build on:
https://github.com/RaRe-Technologies/gensim/blob/e4199cb4e9a90df44ca59c1d0505b138caa21951/gensim/models/doc2vec.py#L1126
Note: while supplying ore than one tag per document is a natural extension of the original 'Paragraph Vectors' approach, and often can provide benefits, sometimes it also 'dilutes' the salience of each tag's vector – which will be a special concern as the average number of tags per document grows, or the model acquires many more tags than unique documents. So be sure to comparatively evaluate whether any multiple-tag strategy is helping or hurting, in different modes, and whether things like pre-known categories work better as extra tags or known-labels for some later steps.

Train Model with Token Features

I want to train a BERT like model for Hebrew, where fore very word I know:
Lemma
Gender
Number
Voice
And I would like to train a model where for each token these features are concatenated
Embedding(Token) = E1(Lemma):E2(Gender):E3(Number):E4(Voice)
Is there a way to do such a thing with the current huggingface transformers library?
Models in the Huggingface's Transformers do not support factored inputs by default. As a workaround, you can embed the inputs yourself and bypass the embedding layer in BERT. Instead of providing the input_ids when you call the model, you can provide input_embeds. It will use the provided embeddings and the position embeddings to them. Note that the provided embeddings need to have the same dimension as the rest of the model.
You need to have one embedding layer per input type (lemma, gender, number, voice), which also means having factor-specific vocabularies that will assign indices to the inputs that are used for the embedding lookup. It makes sense to have a larger embedding for lemmas than for the grammatical categories that have several possible values.
Then you just concatenate the embeddings, optionally project them and feed them as input_embeds to the model.

How to Cluster words and phrases with pre-trained model on Gensim

What I want exactly is to cluster words and phrases, e.g.
knitting/knit loom/loom knitting/weaving loom/rainbow loom/home decoration accessories/loom knit/knitting loom/...And I don'd have corpus while I have only the words/phrases. Could I use a pre-trained model like the one from GoogleNews/Wikipedia/... to realise it?
I am trying now to use Gensim to load GoogleNews pre-trained model to get phrases similarity. I've been told that The GoogleNews model includes vectors of phrases and words. But I find that I could only get word-similarity while phrase-similarity fails with an error message that the phrase is not in the vocabulary. Please advise me. Thank you.
import gensim
from gensim.models import Word2Vec
from gensim.models.keyedvectors import KeyedVectors
GOOGLE_MODEL = '../GoogleNews-vectors-negative300.bin'
model = gensim.models.KeyedVectors.load_word2vec_format(GOOGLE_MODEL, binary=True)
# done well
model.most_similar("computer", topn=3)
# done with error message "computer_software" is not in the vocabulory.
model.most_similar("computer_software", topn=3)
The GoogleNews set does include many multi-word phrases, as created via some statistical analysis, but might not include something specific you're hoping it does, like 'computer_software'.
On the other hand, I see an online word-list suggesting that a phrase like 'composite_fillings' is in the GoogleNews vocabulary, so this will likely work for you:
model.most_similar("composite_fillings", topn=3)
With that vector-set, you're limited to what they chose to model as phrases. If you need similarly-strong vectors for other phrases, you'd likely need to train your own model, on a corpus where the phrases important to you have been combined into single tokens. (If you just need something-better-than-nothing, averaging together the constituent words' word-vectors would give you something to work with... but that's a pretty-crude stand-in for truly modeling the bigram/multigram against its unique contexts.)

How to load a word2vec txt file with vocabulary constraint

I have a word2vec file in the standard format, but it is huge with 2M items. I also have a vocabulary file where each row is a word, the file has about ~800K rows. Now I want to load the embeddings from the word2vec file, and I want only embeddings for words in the vocabulary file. Is there an efficient implementation in gensim?
There's no built-in support for filtering the words on load. But you could use the code for the load_word2vec_format() function as a model for your own alternate loading code that skips words not-of-interest.
You can view the code for that function in the KeyedVectors class...
https://github.com/RaRe-Technologies/gensim/blob/ff107d6c5cb50d9ab99999cb898ff0aceb192592/gensim/models/keyedvectors.py#L1434
...and some shared support functions...
https://github.com/RaRe-Technologies/gensim/blob/ff107d6c5cb50d9ab99999cb898ff0aceb192592/gensim/models/utils_any2vec.py#L294

Defining new language grammar rules?

Can you help me how could I edit the .tagger file using Stanford NLP? I have problem here, i can't open and edit the file to define the grammar rules for new language to generate part of speech?
The .tagger files are serialized statistical models used by a Maximum Entropy based sequence tagger. You can't edit them in any meaningful way.
If you want to create part of speech tags for a new language, you will have to create training data which consists of a large set of sentences in the language you want and having the correct part of speech tag for each word in the sentence, and then train a new part of speech tagging model.

Resources