Is it possible to get the meaning of each word using BERT? - huggingface-transformers

I'm a linguist, I'm new to AI and
I would like to know if BERT is able to get the meaning of each word based on context.
I've done some searches and found that BERT is able to do that and that if I'm not wrong, it recognizes them/ converts them into unique vectors, but that's not the output I want.
What I want is to get the meaning/ or the components that constitute the meaning of each word, written in plain English, is this possible?

No, you can not get the meaning of the word in plain english. The whole idea of the BERT is to convert plain english into meaningful numerical representations.
Unfortunately, these vectors are not interpretable. It is a general limitation of Deep Learning compared to other traditional ML models that use self-extracted features.
But note that you can use these representation to find out certain relationships between words. For example the words that are close to each other (in terms of some distance measure), have similar meanings. Have a look at this link for more information.
https://opensource.googleblog.com/2013/08/learning-meaning-behind-words.html

Related

How to measure similarity between words or very short text

I work on the problem of finding the nearest document in a list of documents. Each document is a word or a very short sentence (e.g. "jeans" or "machine tool" or "biological tomatoes"). By closest I mean close in a semantical way.
I have tried to use word2vec embeddings (from Mikolov article) but the closest words or more contextually linked than semanticaly linked ("jeans" is linked to "shoes" and not "trousers" as expected).
I have tried to use Bert encoding (https://mccormickml.com/2019/05/14/BERT-word-embeddings-tutorial/#32-understanding-the-output) using last layers but it faces the same issues.
I have tried elastic search, but it doesn't find semantical similarities.
(The task needs to be solved in French but maybe solving it in English is a good first step)
Note different sets of word-vectors may vary in how well they capture your desired 'semantic' similarities. (In particular, training with a shorter window may emphasize similarity among words that are drop-in replacements for each other, as opposed to just used-in-similar domains, as larger window values may emphasize. See this answer for more details.)
You may also want to take a look at "Word Mover's Distance" as a way to compare short texts that contain various mixes of somewhat-similar words. (It's fairly expensive, but should be practical on your short texts. It's available in the Python gensim library as wmdistance() on KeyedVectors instances.)
If you have training data where your specific multi-word phrases are used, in many natural-language-like subtly-varied contexts, you could consider combining all such phrases-of-interest into single tokens (like machine_tool or biological_tomatoes), and training your own domain-specific word-vectors.
For computing similarity between short texts which contains 2 or 3 words, you can use word2vec with getting the average vector of the sentence.
for example, if you have a text (machine tool) and want to represent it in one vector using word2vec so you have to get the vector of "machine" and the vector if "tool" then combine them in one vector by getting the average vector which is to add the two vectors and divide by 2 (the number of words). this will give you a vector representation for a sentence which is more than one word.
You can use also something like doc2vec which is designed on the top of word2vec and its purpose to get a vector for a sentence or paragraph.
You might try document embedding that is built on top of word2vec
However, notice that word and document embedding do not always capture "desired similarity", they just learn a language model on your corpus, they are heavy influenced by text size and word frequency.
How big is your corpus? If you need it just to perform some classification it might be better to train your vectors on a large dataset such as Google News corpus.

Grammar parsing in Ruby

I have a task ahead of me which relies on interpreting structure of a text – to be precise, a monolingual dictionary. The dictionary has quite complex entries: up to 29 unique elements, and some are nested within others. I am designing my own XML schema for the dictionary, but I would like to write a program that parses the plain text I have automatically.
I have some basic skills in Ruby and I am a rather experienced RegEx user, but I think creating lots of if-trees and extremely long RegEx formulas is prboably not the best idea. I have found some information on Parsing Expression Grammar, Backus Normal Form and W-grammar, but it seems somewhat vague to what they apply best.
My question is: which is best way to interpret the structure of a text written in a natural language? I don't want to interpret the language itself, but rather to divide each entry into segments based on characters and keyword used, as well as their neighborhood. What gems and resources would you suggest?
Edit: here's an example of a moderately simple entry from the dictionary (in Polish). What I want to do is to tag each element (senses, explanations, collocation, label markers etc.). As you can see, I am looking for an efficient way to encompass a large number of cases in a tree-like form.
Another problem is that I want to have lots of captures, as I want to tag the segments in XML from bigger to smaller.
This looks like a problem that would be well suited for Treetop. I don't think I have enough information to be sure that it will work, but being able to combine regular expressions into a larger structure where each of the 29 elements can be managed and their information extracted/represented using any of Ruby's features as appropriate, seems like the sort of feature set you need.

Extract only English sentences

I need to extract posts and tweets from Facebok and Twitter into our database for analysis. My problem is the system can process on the English sentences (phrases) only. So how can I remove non-English posts, tweets from my database.
If you do know any algorithm in NLP can do this, please tell me.
Thanks and regards
Avoiding automatic language identification where possible is usually preferable - for instance, https://dev.twitter.com/docs/api/1/get/search shows that returned tweets contain a field iso_language_code which might be helpful.
If that's not good enough, you'll have to either
look for existing language identification libraries in whatever language you're using; or
get your hands on a sufficient amount of English text (dumps of English Wikipedia, say, or any of the Google n-gram models) and implement something like http://www.cavar.me/damir/LID/.
Get an English dictionary and see if the majority of the words in your text are in it. Since you are looking at online text, be sure to include common slang and abbreviations.
This can run very quickly if you store the dictionary in a trie data structure.
I think fancy NLP is a bit overkill for this task. You don't need to identify the language if it's not English so all you have to do is test your text with some simple characteristics of the English language.
I have tried using standard libraries for language detection on tweets. You will get a lot of false negatives because there are a lot of non-standard characters in names, smilies etc. This problem is more severe in smaller posts where the signal-to-noise ratio is lower.
The main problem is not the algorithm but the outdated data-sources. I would suggest crawling/streaming a new one from Twitter. The language flag in Twitter is based on geographical information, so that will not work in all cases. (A chinese person can still make chinese posts in USA). I would suggest using a white-list of a lot of English speaking persons and collect their posts.
I wrote a little tweet language classifier (either english or not) that was 95+% accurate if I'm remembering right. I think it was just naive bayes + 1000 training instances. Combine that with location information and you can do even better.
I found this project, the source code is very clear. I have tested and it runs pretty well.
http://code.google.com/p/guess-language/
Have you tried SVD (Single Value Decomposition) for LSI (Latent Semantic Indexing) & LSA (Latent Semantic Analysis) ? see: http://alias-i.com/lingpipe/demos/tutorial/svd/read-me.html

How do I approximate "Did you mean?" without using Google?

I am aware of the duplicates of this question:
How does the Google “Did you mean?” Algorithm work?
How do you implement a “Did you mean”?
... and many others.
These questions are interested in how the algorithm actually works. My question is more like: Let's assume Google did not exist or maybe this feature did not exist and we don't have user input. How does one go about implementing an approximate version of this algorithm?
Why is this interesting?
Ok. Try typing "qualfy" into Google and it tells you:
Did you mean: qualify
Fair enough. It uses Statistical Machine Learning on data collected from billions of users to do this. But now try typing this: "Trytoreconnectyou" into Google and it tells you:
Did you mean: Try To Reconnect You
Now this is the more interesting part. How does Google determine this? Have a dictionary handy and guess the most probably words again using user input? And how does it differentiate between a misspelled word and a sentence?
Now considering that most programmers do not have access to input from billions of users, I am looking for the best approximate way to implement this algorithm and what resources are available (datasets, libraries etc.). Any suggestions?
Assuming you have a dictionary of words (all the words that appear in the dictionary in the worst case, all the phrases that appear in the data in your system in the best case) and that you know the relative frequency of the various words, you should be able to reasonably guess at what the user meant via some combination of the similarity of the word and the number of hits for the similar word. The weights obviously require a bit of trial and error, but generally the user will be more interested in a popular result that is a bit linguistically further away from the string they entered than in a valid word that is linguistically closer but only has one or two hits in your system.
The second case should be a bit more straightforward. You find all the valid words that begin the string ("T" is invalid, "Tr" is invalid, "Try" is a word, "Tryt" is not a word, etc.) and for each valid word, you repeat the algorithm for the remaining string. This should be pretty quick assuming your dictionary is indexed. If you find a result where you are able to decompose the long string into a set of valid words with no remaining characters, that's what you recommend. Of course, if you're Google, you probably modify the algorithm to look for substrings that are reasonably close typos to actual words and you have some logic to handle cases where a string can be read multiple ways with a loose enough spellcheck (possibly using the number of results to break the tie).
From the horse's mouth: How to Write a Spelling Corrector
The interesting thing here is how you don't need a bunch of query logs to approximate the algorithm. You can use a corpus of mostly-correct text (like a bunch of books from Project Gutenberg).
I think this can be done using a spellchecker along with N-grams.
For Trytoreconnectyou, we first check with all 1-grams (all dictionary words) and find a closest match that's pretty terrible. So we try 2-grams (which can be built by removing spaces from phrases of length 2), and then 3-grams and so on. When we try a 4-gram, we find that there is a phrase that is at 0 distance from our search term. Since we can't do better than that, we return that answer as the suggestion.
I know this is very inefficient, but Peter Norvig's post here suggests clearly that Google uses spell correcters to generate it's suggestions. Since Google has massive paralellization capabilities, they can accomplish this task very quickly.
Impressive tutroail one how its work you can found here http://alias-i.com/lingpipe-3.9.3/demos/tutorial/querySpellChecker/read-me.html.
In few word it is trade off of query modification(on character or word level) to increasing coverage in search documents. For example "aple" lead to 2mln documents, but "apple" lead to 60mln and modification is only one character, therefore it is obvious that you mean apple.
Datasets/tools that might be useful:
WordNet
Corpora such as the ukWaC corpus
You can use WordNet as a simple dictionary of terms, and you can boost that with frequent terms extracted from a corpus.
You can use the Peter Norvig link mentioned before as a first attempt, but with a large dictionary, this won't be a good solution.
Instead, I suggest you use something like locality sensitive hashing (LSH). This is commonly used to detect duplicate documents, but it will work just as well for spelling correction. You will need a list of terms and strings of terms extracted from your data that you think people may search for - you'll have to choose a cut-off length for the strings. Alternatively if you have some data of what people actually search for, you could use that. For each string of terms you generate a vector (probably character bigrams or trigrams would do the trick) and store it in LSH.
Given any query, you can use an approximate nearest neighbour search on the LSH described by Charikar to find the closest neighbour out of your set of possible matches.
Note: links removed as I'm a new user - sorry.
#Legend - Consider using one of the variations of the Soundex algorithm. It has some known flaws, but it works decently well in most applications that need to approximate misspelled words.
Edit (2011-03-16):
I suddenly remembered another Soundex-like algorithm that I had run across a couple of years ago. In this Dr. Dobb's article, Lawrence Philips discusses improvements to his Metaphone algorithm, dubbed Double Metaphone.
You can find a Python implementation of this algorithm here, and more implementations on the same site here.
Again, these algorithms won't be the same as what Google uses, but for English language words they should get you very close. You can also check out the wikipedia page for Phonetic Algorithms for a list of other similar algorithms.
Take a look at this: How does the Google "Did you mean?" Algorithm work?

How to determine a strings dna for likeness to another

I am hoping I am wording this correctly to get across what I am looking for.
I need to compare two pieces of text. If the two strings are alike I would like to get scores that are very alike if the strings are very different i need scores that are very different.
If i take a md5 hash of an email and change one character the hash changes dramatically I want something to not change too much. I need to compare how alike two pieces of content are without storing the string.
Update: I am looking now at combining some ideas from the various links people have provided. Ideally I would of liked a single input function to create my score so I am looking at using a reference string to always compare my input to. I am also looking at taking asci characters and suming these up. Still reading all the links provided.
What you're looking for is a LCS algorithm (see also Levenshtein distance). You may also try Soundex or some other phonetic algorithm.
Reading your comments, it sounds like you are actually trying to compare entire documents, each containing many words.
This is done successfully in information retrieval systems by treating documents as N-dimensional points in space. Each word in the language is an axis. The distance along the axis is determined by the number of times that word appears in the document. Similar documents are then "near" each other in space.
This way, the whole document doesn't need to be stored, just its word counts. And usually the most common words in the language are not counted at all.
Check their Levenshtein Distance
In PHP you even have the levenshtein() function that makes exactly that.
I need to compare two pieces of text. If the two strings are alike I would like to get scores that are very alike if the strings are very different i need scores that are very different.
It really depends on what you mean by "same" or "different". For example, if someone replaces "United States of America" with "USA" in your string, is that mostly the same string (because USA is just an abbreviation for something longer), or is it very different (because a lot of characters changed)?
You essentially need to either devise a function that describes how to compute "sameness" or use a pre-existing definition thereof. For example, the aforementioned Levenshtein distance measures total difference based on the number of changes you have to make to get to the original string.
Since the Levenshtein distance needs both input strings to produce a value, you would have to store all strings.
You could, however, use a small number of strings as markers and only store these as strings.
You would then calculate the Levenshtein distance from a new string to each of these marker strings and store these values. You could then guess that two strings that have a similar Levenshtein distance to all markers are also similar to each other. It would likely be sensible to "engineer" these markers in such a way that their mutual Levenshtein distance is as large as possible. I don't know whether there has been some research in this direction.
Many people have suggested looking at distance/metric like approaches, and I think the wording of the question leads that way. (By the way, a hash like md5 is trying to do pretty much the opposite thing that a metric does, so it's hardly surprising that this wouldn't work for you. There are similar ideas that don't change much under small deltas, but I suspect they don't encode enough information for what you want to do)
Particularly given your update in the comments though, I think this type of approach is not very helpful.
What you are looking for is more of a clustering problem, where you want to generate a signature (i.e. feature vector) from each email and later compare it to new inputs. So essentially what you have is a machine learning problem. Deciding what "close" means may be a bit of a challenge. To get started though, assuming it actually is emails you're looking at you may do well to look at the sorts of feature generation done by many spam-filters, this will give you (probably Euclidean, at least to start) a space to measure distances in based on a signature (feature vector).
Without knowing more about your problem it's hard to be more specific.

Resources