Comparison among ELMo, BERT, and GloVe - stanford-nlp

What are the differences among ELMo, BERT, and GloVe in word representation? How differently do they perform word embedding tasks? Which one is better and what advantages and disadvantages does each have in comparison with others?

This is a big question.
I will concentrate into Word Representation.
ELMo, BERT and GloVe can be divided into 2 big group. GloVe is Non-contextual Word Embedding and ElMo, BERT are in Contextual Word Embeddings.
And the second group can be divided into Uni-directional model (ELMo) and Bi-directional model(BERT).
Firstly, we can try to understand 4 terms : Non-contextual/ Contextual Word Embedding and Uni/Bi-directional model.
Afterward, we can go deeper other differences.

Related

How to measure similarity between words or very short text

I work on the problem of finding the nearest document in a list of documents. Each document is a word or a very short sentence (e.g. "jeans" or "machine tool" or "biological tomatoes"). By closest I mean close in a semantical way.
I have tried to use word2vec embeddings (from Mikolov article) but the closest words or more contextually linked than semanticaly linked ("jeans" is linked to "shoes" and not "trousers" as expected).
I have tried to use Bert encoding (https://mccormickml.com/2019/05/14/BERT-word-embeddings-tutorial/#32-understanding-the-output) using last layers but it faces the same issues.
I have tried elastic search, but it doesn't find semantical similarities.
(The task needs to be solved in French but maybe solving it in English is a good first step)
Note different sets of word-vectors may vary in how well they capture your desired 'semantic' similarities. (In particular, training with a shorter window may emphasize similarity among words that are drop-in replacements for each other, as opposed to just used-in-similar domains, as larger window values may emphasize. See this answer for more details.)
You may also want to take a look at "Word Mover's Distance" as a way to compare short texts that contain various mixes of somewhat-similar words. (It's fairly expensive, but should be practical on your short texts. It's available in the Python gensim library as wmdistance() on KeyedVectors instances.)
If you have training data where your specific multi-word phrases are used, in many natural-language-like subtly-varied contexts, you could consider combining all such phrases-of-interest into single tokens (like machine_tool or biological_tomatoes), and training your own domain-specific word-vectors.
For computing similarity between short texts which contains 2 or 3 words, you can use word2vec with getting the average vector of the sentence.
for example, if you have a text (machine tool) and want to represent it in one vector using word2vec so you have to get the vector of "machine" and the vector if "tool" then combine them in one vector by getting the average vector which is to add the two vectors and divide by 2 (the number of words). this will give you a vector representation for a sentence which is more than one word.
You can use also something like doc2vec which is designed on the top of word2vec and its purpose to get a vector for a sentence or paragraph.
You might try document embedding that is built on top of word2vec
However, notice that word and document embedding do not always capture "desired similarity", they just learn a language model on your corpus, they are heavy influenced by text size and word frequency.
How big is your corpus? If you need it just to perform some classification it might be better to train your vectors on a large dataset such as Google News corpus.

Specify condition for negative sampling in gensim word2vec

I'm training word2vec model where each word belongs to a specific class.
I want my embeddings to learn differences of words within each class, but don't want them to learn the differences between classes.
This can be achieved by negative sampling from only the words of same class as the target word.
In gensim word2vec, we can specify the number of words to negative sample using negative parameter, but it doesn't mention any options to modify/filter the sampling function.
Is there any method to achieve this?
Update:
Consider the classes to be like languages. So I have words from different languages. In training data, each sentence/document contains mostly words from same language, but sometimes from other languages.
Now I want embeddings where words with similar meanings are together irrespective of the language.
But because words from different languages do not occur together as frequently as words from same language, the embeddings basically groups words from same language together.
Because of this, I wanted to try negative sampling target words with words from same language so that it learns to distinguish the words within same language.
It's unclear what you mean by "learn differences of words within each class, but don't want them to learn the differences between classes", or what benefit you'd hope to achieve.
If words co-occur in training texts, the word2vec training algorithm will try to predict neighboring words, and the end-results are the useful word-vectors.
If two words shouldn't have any influence on each other, you could preprocess your texts so they never co-occur. For example, if you have three classes of words, and your text corpus naturally includes a mixture of all three classes in each, you could filter the corpus into three separate corpuses. Each corpus would feature the words of one class, and drop the words of the other classes. Then you could train 3 separate Word2Vec models from the 3 corpuses.
But I'm not sure why you'd want to do that: the word-vectors from each corpus/model wouldn't be usefully comparable. I've not seen any work that does that, nor can I imagine a benefit – while it seems to throw away exactly the subtle relationships most people want from word2vec.

Given a list of words, how to develop an algorithmic way to semantically group them?

I am working with the Google Places API, and they contain a list of 97 different locations. I want to reduce the list of locations into a lesser number
of them, as many of them are groupable. For example, atm and bank into financial; temple, church, mosque, synagogue into worship; school, university into education; subway_station, train_station, transit_station, gas_station into transportation.
But also, it should not overgeneralize; for example, pet_store, city_hall, courthouse, restaurant into something like buildings.
I tried quite a few methods to do this. First I downloaded synonyms of each of the 97 words in the list from multiple dictionaries. Then, I found out the similarity between 2 words based on what fraction of unique synonyms they share in common (Jaccard similarity):
But after that, how do I group words into clusters? Using traditional clustering methods (k-means, k-medoid, hierarchical clustering, and FCM), I am not getting any good clustering (I identified several misclassifications by scanning the results manually):
I even tried the word2vec model trained on Google news data (where each word is expressed as a vector of 300 features), and I do not get good clusters based on that as well:
You are probably looking for something related to vector space dimensionality reduction. In these techniques, you'll need a corpus of text that uses the locations as words in the text. Dimensionality reduction will then group the terms together. You can do some reading on Latent Dirichlet Allocation and Latent semantic indexing. A good reference is "Introduction to Information Retrieval" by Manning et al., chapter 18. Note that this book is from 2009, so a lot of advances are not captured. As you noted, there has been a lot of work such as word2vec. Another good reference is "Speech and Language Processing" by Jurafsky and Martin, chapter 16.
You need much more data.
No algorithm ever, without additional data, will relate ATM and bank to financial. Because that requires knowledge of these terms.
Jaccard similarity doesn't have access to such knowledge, it can only work on the words. And then "river bank" and "bank branch" are very similar.
So don't expect magic to happen by the algorithm. You need the magic to be in the data...

Algorithm to compare similarity of ideas (as strings)

Consider an arbitrary text box that records the answer to the question, what do you want to do before you die?
Using a collection of response strings (max length 240), I'd like to somehow sort and group them and count them by idea (which may be just string similarity as described in this question).
Is there another or better way to do something like this?
Is this any different than string similarity?
Is this the right question to be asking?
The idea here is to have people write in a text box over and over again, and me to provide a number that describes, generally speaking, that 802 people wrote approximately the same thing
It is much more difficult than string similarity. This is what you need to do at a minimum:
Perform some text formatting/cleaning tasks like removing punctuations characters and common "stop words"
Construct a corpus (collection of words with their usage statistics) from the terms that occur answers.
Calculate a weight for every term.
Construct a document vector from every answer (each term corresponds to a dimension in a very high dimensional Euclidian space)
Run a clustering algorithm on document vectors.
Read a good statistical natural language processing book, or search google for good introductions / tutorials (likely terms: statistical nlp, text categorization, clustering) You can probably find some libraries (weka or nltk comes to mind) depending on the language of your choice but you need to understand the concepts to use the library anyway.
The Latent Semantic Analysis (LSA) might interest you. Here is a nice introduction.
Latent semantic analysis (LSA) is a technique in natural language processing, in particular in vectorial semantics, of analyzing relationships between a set of documents and the terms they contain by producing a set of concepts related to the documents and terms.
[...]
What you want is very much an open problem in NLP. #Ali's answer describes the idea at a high level, but the part "Construct a document vector for every answer" is the really hard one. There are a few obvious ways of building a document vector from a the vectors of the words it contains. Addition, multiplication and averaging are fast, but they affectively ignore the syntax. Man bites dog and Dog bites man will have the same representation, but clearly not the same meaning. Google compositional distributional semantics- as far as I know, there are people at Universities of Texas, Trento, Oxford, Sussex and at Google working in the area.

Is there an algorithm that extracts meaningful tags of english text

I would like to extract a reduced collection of "meaningful" tags (10 max) out of an english text of any size.
http://tagcrowd.com/ is quite interesting but the algorithm seems very basic (just word counting)
Is there any other existing algorithm to do this?
There are existing web services for this. Two Three examples:
Yahoo's Term Extraction API
Topicalizer
OpenCalais
When you subtract the human element (tagging), all that is left is frequency. "Ignore common English words" is the next best filter, since it deals with exclusion instead of inclusion. I tested a few sites, and it is very accurate. There really is no other way to derive "meaning", which is why the Semantic Web gets so much attention these days. It is a way to imply meaning with HTML... of course, that has a human element to it as well.
Basically, this is a text categorization problem/document classification problem. If you have access to a number of already tagged documents, you could analyze which (content) words trigger which tags, and then use this information for tagging new documents.
If you don't want to use a machine-learning approach and you still have a document collection, then you can use metrics like tf.idf to filter out interesting words.
Going one step further, you can use Wordnet to find synonyms and replace words by their synonym, if the frequency of the synonym is higher.
Manning & Schütze contains a lot more introduction on text categorization.
In text classification, this problem is known as dimensionality reduction. There are many useful algorithms in the literature on this subject.
You want to do the semantic analysis of a text.
Word frequency analysis is one of the easiest ways to do the semantic analysis. Unfortunately (and obviously) it is the least accurate one. It can be improved by using special dictionaries (like for synonims or forms of a word), "stop-lists" with common words, other texts (to find those "common" words and exclude them)...
As for other algorithms they could be based on:
Syntax analysis (like trying to find the main subject and/or verb in a sentence)
Format analysis (analyzing headers, bold text, italic... where applicable)
Reference analysis (if the text is in Internet, for example, then a reference can describe it in several words... used by some search engines)
BUT... you should understand that these algorithms are mereley heuristics for semantic analysis, not the strict algorithms of achieving the goal.
The problem of semantic analysis is one of the main problems in Artificial Intelligence/Machine Learning studies since the first computers appeared.
Perhaps "Term Frequency - Inverse Document Frequency" TF-IDF would be useful...
You can use this in two steps:
1 - Try topic modeling algorithms:
Latent Dirichlet Allocation
Latent word Embeddings
2 - After that you can select the most representative word of every topic as a tag

Resources