Extracting word features from BERT model - word-embedding

So as you know, we can extract BERT features of word in a sentence. My question is, can we also extract word features that are not included in a sentence? For example, bert features of single words such as "dog", "human", etc.

The very first layer of BERT is a static embeddings table, so you can use it as any other embeddings table and embeddings for words (or more frequently subwords) that BERT uses input to the first self-attentive layer. The static embeddings are only comparable with each other, not with the standard contextual embeddings. If need them comparable embeddings, you can try passing single-word sentences to BERT, but note that this will be an embeddings of a single-word sentenece, not the word in general.
However, BERT is a sentence-level model that is supposed to get embeddings of words in context. It is not designed for static word embeddings, and methods specifically designed for static word embeddings (such as FastText) would certainly get better results.

Related

Word2Vec convert a sentence

I have trained a Word2Vec model using gensim, I have a dataset of tweets that I would like to convert to vectors. What is the best way to convert a sentence to a vector + how can this be done using a word2vec model.
Formally, the word2vec algorithm only gives you a vector per word, not per longer text (like a sentence or paragraph or tweet or article).
One quick & easy baseline approach for turning longer texts into vectors is to just average together the vectors of each word. Recent versions of Gensim have a helper method get_mean_vector() to do this on KeyedVectors model objects (sets-of-word-vectors):
text_vector = kv_model.get_mean_vector(list_of_words)
Of course, such a simpleminded average has no way to model the effects of word-order/grammar. Words may tend to cancel each other out rather than have the compositional effects of real language, and the space of possible multiword-text meanings is much larger than the space of single-word meanings – so just collapsing the text into the same coordinate system as words may lose a lot.
More sophisticated ways of vectorizing text rely on model far more more sophisticated than plain word2vec, such as deep/recurrent neural networks for modelling longer ranges of text.

How to measure similarity between words or very short text

I work on the problem of finding the nearest document in a list of documents. Each document is a word or a very short sentence (e.g. "jeans" or "machine tool" or "biological tomatoes"). By closest I mean close in a semantical way.
I have tried to use word2vec embeddings (from Mikolov article) but the closest words or more contextually linked than semanticaly linked ("jeans" is linked to "shoes" and not "trousers" as expected).
I have tried to use Bert encoding (https://mccormickml.com/2019/05/14/BERT-word-embeddings-tutorial/#32-understanding-the-output) using last layers but it faces the same issues.
I have tried elastic search, but it doesn't find semantical similarities.
(The task needs to be solved in French but maybe solving it in English is a good first step)
Note different sets of word-vectors may vary in how well they capture your desired 'semantic' similarities. (In particular, training with a shorter window may emphasize similarity among words that are drop-in replacements for each other, as opposed to just used-in-similar domains, as larger window values may emphasize. See this answer for more details.)
You may also want to take a look at "Word Mover's Distance" as a way to compare short texts that contain various mixes of somewhat-similar words. (It's fairly expensive, but should be practical on your short texts. It's available in the Python gensim library as wmdistance() on KeyedVectors instances.)
If you have training data where your specific multi-word phrases are used, in many natural-language-like subtly-varied contexts, you could consider combining all such phrases-of-interest into single tokens (like machine_tool or biological_tomatoes), and training your own domain-specific word-vectors.
For computing similarity between short texts which contains 2 or 3 words, you can use word2vec with getting the average vector of the sentence.
for example, if you have a text (machine tool) and want to represent it in one vector using word2vec so you have to get the vector of "machine" and the vector if "tool" then combine them in one vector by getting the average vector which is to add the two vectors and divide by 2 (the number of words). this will give you a vector representation for a sentence which is more than one word.
You can use also something like doc2vec which is designed on the top of word2vec and its purpose to get a vector for a sentence or paragraph.
You might try document embedding that is built on top of word2vec
However, notice that word and document embedding do not always capture "desired similarity", they just learn a language model on your corpus, they are heavy influenced by text size and word frequency.
How big is your corpus? If you need it just to perform some classification it might be better to train your vectors on a large dataset such as Google News corpus.

Differences between BERT sentence embeddings and LSA embeddings

BERT as a service (https://github.com/hanxiao/bert-as-service) allows to extract sentence level embeddings. Assuming I have a pre-trained LSA model which gives me a 300 dimensional word vector, I am trying to understand in which scenario would an LSA model perform better than BERT when I am trying to compare two sentences for semantic coherence?
I cannot think of a reason why LSA would be better for this use case - since LSA is just a compression of a big bag of words matrix.
BERT requires quadratic memory with the sequence length and is only trained on pairs on split sentences. This might be inconvenient when processing really long sentences.
For LSA, you only need the bag-of-word vector which is indeed constant-sized in the document length. For really long documents, LSA might still a better option.

Specify condition for negative sampling in gensim word2vec

I'm training word2vec model where each word belongs to a specific class.
I want my embeddings to learn differences of words within each class, but don't want them to learn the differences between classes.
This can be achieved by negative sampling from only the words of same class as the target word.
In gensim word2vec, we can specify the number of words to negative sample using negative parameter, but it doesn't mention any options to modify/filter the sampling function.
Is there any method to achieve this?
Update:
Consider the classes to be like languages. So I have words from different languages. In training data, each sentence/document contains mostly words from same language, but sometimes from other languages.
Now I want embeddings where words with similar meanings are together irrespective of the language.
But because words from different languages do not occur together as frequently as words from same language, the embeddings basically groups words from same language together.
Because of this, I wanted to try negative sampling target words with words from same language so that it learns to distinguish the words within same language.
It's unclear what you mean by "learn differences of words within each class, but don't want them to learn the differences between classes", or what benefit you'd hope to achieve.
If words co-occur in training texts, the word2vec training algorithm will try to predict neighboring words, and the end-results are the useful word-vectors.
If two words shouldn't have any influence on each other, you could preprocess your texts so they never co-occur. For example, if you have three classes of words, and your text corpus naturally includes a mixture of all three classes in each, you could filter the corpus into three separate corpuses. Each corpus would feature the words of one class, and drop the words of the other classes. Then you could train 3 separate Word2Vec models from the 3 corpuses.
But I'm not sure why you'd want to do that: the word-vectors from each corpus/model wouldn't be usefully comparable. I've not seen any work that does that, nor can I imagine a benefit – while it seems to throw away exactly the subtle relationships most people want from word2vec.

Natural Language Parsing using Stanford NLP

How Stanford natural Language Parser uses Penn Tree Bank for Tagging process ? I want to know how it finds the POS for the given input?
The Stanford part-of-speech tagger uses a probabilistic sequence model to determine the most likely sequence of part-of-speech tags underlying a sentence. Some of the features provided to this model are
Surrounding words and n-grams
Part-of-speech tags of surrounding words
"Word shapes" (e.g., "Foo5" is translated to "Xxx#")
Word suffix, prefix
See the ExtractorFrames class for details. The model is trained on a tagged corpus (like the Penn Treebank) which has each token annotated with its correct part of speech.
At run time, features like those mentioned above are calculated for input text and are used to build per-tag probabilities, which are then fed into an implementation of the Viterbi algorithm (ExactBestSequenceFinder), which finds the most likely arrangement of tags for the entire sequence.
For more information to get started with POS tagging:
Watch the Week 5 lectures of the Coursera NLP class (co-taught by the CoreNLP lead)
Check out the code in the edu.stanford.nlp.tagger.maxent package
Part-of-speech tagging in NLTK

Resources