Adding words to gensim word2vec model, but it's not shown in model.wv - gensim

I have a pre-trained model, but I need to add some new words in it.
I tried:
model.build_vocab([[new_word1, new_word2]], update=True)
model.train([[new_word1, new_word2]], total_examples=model.corpus_count, epochs=model.epochs)
But when I check:
model.wv[new_word1]
model.wv[new_word2]
I got
KeyError: "Key {new_word1} not present"
same as new_word2
I have checked this
How to add words and vectors manually to Word2vec gensim?
How can I solve it?
Thanks

If you enable logging at the INFO level, you may see more hints of where things may not be having the expeted effect.
In particular, the default min_count value used by Word2Vec is 5, meaning any words that appear fewer than 5 times in a corpus fed to .build_vocab() will be ignored. (Ignoring such rare words is almost always the right thing to do with the word2vec algorithm, which can only learn useful word-vectors when there are many varied examples of a word's usage.)
If you test is truly just 2 new words, each with just one use, a model with reasonable defaults will ignore those two single-occurrence words.
Separately: expanding the vocabulary of an existing model is a tricky, error-prone process. Most improvised/naive ways of doing it are unlikely to reliably give good results. In most cases the safer, more robust process would be re-training with all text, old and new, rather than tiny new increments.

Related

How to interpret doc2vec classifier in terms of words?

I have trained a doc2vec (PV-DM) model in gensim on documents which fall into a few classes. I am working in a non-linguistic setting where both the number of documents and the number of unique words are small (~100 documents, ~100 words) for practical reasons. Each document has perhaps 10k tokens. My goal is to show that the doc2vec embeddings are more predictive of document class than simpler statistics and to explain which words (or perhaps word sequences, etc.) in each document are indicative of class.
I have good performance of a (cross-validated) classifier trained on the embeddings compared to one compared on the other statistic, but I am still unsure of how to connect the results of the classifier to any features of a given document. Is there a standard way to do this? My first inclination was to simply pass the co-learned word embeddings through the document classifier in order to see which words inhabited which classifier-partitioned regions of the embedding space. The document classes output on word embeddings are very consistent across cross validation splits, which is encouraging, although I don't know how to turn these effective labels into a statement to the effect of "Document X got label Y because of such and such properties of words A, B and C in the document".
Another idea is to look at similarities between word vectors and document vectors. The ordering of similar word vectors is pretty stable across random seeds and hyperparameters, but the output of this sort of labeling does not correspond at all to the output from the previous method.
Thanks for help in advance.
Edit: Here are some clarifying points. The tokens in the "documents" are ordered, and they are measured from a discrete-valued process whose states, I suspect, get their "meaning" from context in the sequence, much like words. There are only a handful of classes, usually between 3 and 5. The documents are given unique tags and the classes are not used for learning the embedding. The embeddings have rather dimension, always < 100, which are learned over many epochs, since I am only worried about overfitting when the classifier is learned, not the embeddings. For now, I'm using a multinomial logistic regressor for classification, but I'm not married to it. On that note, I've also tried using the normalized regressor coefficients as vector in the embedding space to which I can compare words, documents, etc.
That's a very small dataset (100 docs) and vocabulary (100 words) compared to much published work of Doc2Vec, which has usually used tens-of-thousands or millions of distinct documents.
That each doc is thousands of words and you're using PV-DM mode that mixes both doc-to-word and word-to-word contexts for training helps a bit. I'd still expect you might need to use a smaller-than-defualt dimensionaity (vector_size<<100), & more training epochs - but if it does seem to be working for you, great.
You don't mention how many classes you have, nor what classifier algorithm you're using, nor whether known classes are being mixed into the (often unsupervised) Doc2Vec training mode.
If you're only using known classes as the doc-tags, and your "a few" classes is, say, only 3, then to some extent you only have 3 unique "documents", which you're training on in fragments. Using only "a few" unique doctags might be prematurely hiding variety on the data that could be useful to a downstream classifier.
On the other hand, if you're giving each doc a unique ID - the original 'Paragraph Vectors' paper approach, and then you're feeding those to a downstream classifier, that can be OK alone, but may also benefit from adding the known-classes as extra tags, in addition to the per-doc IDs. (And perhaps if you have many classes, those may be OK as the only doc-tags. It can be worth comparing each approach.)
I haven't seen specific work on making Doc2Vec models explainable, other than the observation that when you are using a mode which co-trains both doc- and word- vectors, the doc-vectors & word-vectors have the same sort of useful similarities/neighborhoods/orientations as word-vectors alone tend to have.
You could simply try creating synthetic documents, or tampering with real documents' words via targeted removal/addition of candidate words, or blended mixes of documents with strong/correct classifier predictions, to see how much that changes either (a) their doc-vector, & the nearest other doc-vectors or class-vectors; or (b) the predictions/relative-confidences of any downstream classifier.
(A wishlist feature for Doc2Vec for a while has been to synthesize a pseudo-document from a doc-vector. See this issue for details, including a link to one partial implementation. While the mere ranked list of such words would be nonsense in natural language, it might give doc-vectors a certain "vividness".)
Whn you're not using real natural language, some useful things to keep in mind:
if your 'texts' are really unordered bags-of-tokens, then window may not really be an interesting parameter. Setting it to a very-large number can make sense (to essentially put all words in each others' windows), but may not be practical/appropriate given your large docs. Or, trying PV-DBOW instead - potentially even mixing known-classes & word-tokens in either tags or words.
the default ns_exponent=0.75 is inherited from word2vec & natural-language corpora, & at least one research paper (linked from the class documentation) suggests that for other applications, especially recommender systems, very different values may help.

How are keyword clouds constructed?

How are keyword clouds constructed?
I know there are a lot of nlp methods, but I'm not sure how they solve the following problem:
You can have several items that each have a list of keywords relating to them.
(In my own program, these items are articles where I can use nlp methods to detect proper nouns, people, places, and (?) possibly subjects. This will be a very large list given a sufficiently sized article, but I will assume that I can winnow the list down using some method by comparing articles. How to do this properly is what I am confused about).
Each item can have a list of keywords, but how do they pick keywords such that the keywords aren't overly specific or overly general between each item?
For example, trivially "the" can be a keyword that is a lot of items.
While "supercalifragilistic" could only be in one.
I suppose that I could create a heuristic where if a word exists in n% of the items where n is sufficiently small, but will return a nice sublist (say 5% of 1000 articles is 50, which seems reasonable) then I could just use that. However, the issue that I take with this approach is that given two different sets of entirely different items, there is most likely some difference in interrelatedness between the items, and I'm throwing away that information.
This is very unsatisfying.
I feel that given the popularity of keyword clouds there must have been a solution created already. I don't want to use a library however as I want to understand and manipulate the assumptions in the math.
If anyone has any ideas please let me know.
Thanks!
EDIT:
freenode/programming/guardianx has suggested https://en.wikipedia.org/wiki/Tf%E2%80%93idf
tf-idf is ok btw, but the issue is that the weighting needs to be determined apriori. Given that two distinct collections of documents will have a different inherent similarity between documents, assuming an apriori weighting does not feel correct
freenode/programming/anon suggested https://en.wikipedia.org/wiki/Word2vec
I'm not sure I want something that uses a neural net (a little complicated for this problem?), but still considering.
Tf-idf is still a pretty standard method for extracting keywords. You can try a demo of a tf-idf-based keyword extractor (which has the idf vector, as you say apriori determined, estimated from Wikipedia). A popular alternative is the TextRank algorithm based on PageRank that has an off-the-shelf implementation in Gensim.
If you decide for your own implementation, note that all algorithms typically need plenty of tuning and text preprocessing to work correctly.
The minimum you need to do is removing stopwords that you know that they never can be a keyword (prepositions, articles, pronouns, etc.). If you want something fancier, you can use for instance Spacy to keep only desired parts of speech (nouns, verbs, adjectives). You can also include frequent multiword expressions (gensim has good function for automatic collocation detection), named entities (spacy can do it). You can get better results if you run coreference resolution and substitute pronouns with what they refer to... There are endless options for improvements.

Manually add collocations to gensim phraser

I am doing topic modelling on linguistics papers and I am using the Gensim Phrases to identify frequent collocations. I want to be able to mark terms as 'do-support' and 'it-clefts' as one single word, since they are specific linguistic terminology. However, if I make the Gensim model after taking out stopwords, these collocations will not be found (since they contain stopwords), if I make the model after taking out stopwords (or stopwords not including 'it' or 'do'), it identifies a whole lot of irrelevant collocations. Is there a way to manually add phrases that should be recognised as collocations by the Gensim Phrases?
Thanks!
The Phrases class doesn't have the ability to add desired bigrams. Its technique generally does not expect 'stop words' to have been removed before processing.
You could potentially tune Phrases behavior by trying different 'threshold' and 'min_count' values.
If you find some settings are connecting desired-phrases, but then also some unwanted phrases that still fit the same statistical thresholds, maybe that's not a great harm, despite the non-intuitiveness of some of the phrases. All these statistical techniques are imprecise, and often best judged by their end results on quantitative goals – rather than any arbitrary oddities/corner-cases found from an ad-hoc review.
If you did want to dig into the code to add the ability to force certain bigrams, it might be easier via the Phraser utility class, also in gensim's phrases.py module. At the cost of some extra up-front calculation, it reduces the Phrases data to a smaller structure, with just the bigrams that would later pass the combination-threshold. As such, it saves a bit of memory, and performs later corpus-transformations a little faster, but if you only keep the Phraser, you lose the ability to try other thresholds/min_counts below what was used in its creation. But you could potentially force extra hand-chosen bigrams into its structures, after creation, more easily than tampering with the full Phrases model.
Update (April 2021): Starting in Gensim-4.0, the Phraser class has been renamed FrozenPhrases, for better distinction from the training Phrases class. Additionally, a suggestion in a project issue provides a probably-effective way to 'force' certain bigram-phrases to always be promoted. Specifically:
phrases = Phrases(…) # do customary training/etc
frozen_phrases = phrases.freeze() # freeze bigrams' scores for compactness/efficiency
frozen_phrases.phrasegrams['my_phrase'] = float('inf') # set the desired phrase to infinite score

Find basic words and estimate their difficulty

I'm looking for a possibly simple solution of the following problem:
Given input of a sentence like
"Absence makes the heart grow fonder."
Produce a list of basic words followed by their difficulty/complexity
[["absence", 0.5], ["make", 0.05], ["the", 0.01"], ["grow", 0.1"], ["fond", 0.5]]
Let's assume that:
all the words in the sentence are valid English words
popularity is an acceptable measure of difficulty/complexity
base word can be understood in any constructive way (see below)
difficulty/complexity is on scale from 0 - piece of cake to 1 - mind-boggling
difficulty bias is ok, better to be mistaken saying easy is though than the other way
working simple solution is preferred to flawless but complicated stuff
[edit] there is no interaction with user
[edit] we can handle any proper English input
[edit] a word is not more difficult than it's basic form (because as smart beings we can create unhappily if we know happy), unless it creates a new word (unlikely is not same difficulty as like)
General ideas:
I considered using Google searches or sites like Wordcount to estimate words popularity that could indicate its difficulty. However, both solutions give different results depending on the form of entered words. Google gives 316m results for fond but 11m for fonder, whereas Wordcount gives them ranks of 6k and 54k.
Transforming words to their basic forms is not a must but solves ambiguity problem (and makes it easy to create dictionary links), however it's not a simple task and its sense could me found arguable. Obviously fond should be taken instead of fonder, however investigating believe instead of unbelievable seems to be an overkill ([edit] it might be not the best example, but there is a moment when modifying basic word we create a new one like -> likely) and words like doorkeeper shouldn't be cut into two.
Some ideas of what should be consider basic word can be found here on Wikipedia but maybe a simpler way of determining it would be a use of a dictionary. For instance according to dictionary.reference.com unbelievable is a basic word whereas fonder comes from fond but then grow is not the same as growing
Idea of a solution:
It seems to me that the best way to handle the problem would be using a dictionary to find basic words, apply some of the Wikipedia rules and then use Wordcount (maybe combined with number of Google searches) to estimate difficulty.
Still, there might (probably is a simpler and better) way or ready to use algorithms. I would appreciate any solution that deals with this problem and is easy to put in practice. Maybe I'm just trying to reinvent the wheel (or maybe you know my approach would work just fine and I'm wasting my time deliberating instead of coding what I have). I would, however, prefer to avoid implementing frequency analysis algorithms or preparing a corpus of texts.
Some terminology:
The core part of the word is called a stem or a root. More on this distinction later. You can think of the root/stem as the part that carries the main meaning of the word and will appear in the dictionary.
(In English) most words are composed of one root (exception: compounds like "windshield") / one stem and zero or more affixes: the affixes that come after the root/stem are called suffixes, and the affixes that precede the root/stem are called prefixes. Examples: "driver" = "drive" (root/stem) + suffix "-er"; "unkind" = "kind" (root/stem) + "un-" (prefix).
Suffixes/prefixes (=affixes) can be inflectional or derivational. For example, in English, third-person singular verbs have an s on the end: "I drive" but "He drive-s". These kind of agreement suffixes don't change the category of the word: "drive" is a verb regardless of the inflectional "s". On the other hand, a suffix like "-er" is derivational: it takes a verb (e.g. "drive") and turns it into a noun (e.g. "driver")
The stem, is the piece of the word without any inflectional affixes, whereas the root is the piece of the word without any derivational affixes. For instance, the plural noun "drivers" is decomposable into "drive" (root) + "er" (derivational affix, makes a new stem "driver") + "s" (plural).
The process of deriving the "base" form of the word is called "stemming".
So, armed with this terminology it seems that for your task the most useful thing to do would be to stem each form you come across, i.e. remove all the inflectional affixes, and keep the derivational ones, since derivational affixes can change how common the word is considered to be. Think about it this way: if I tell you a new word in English, you will always know how to make it plural, 3rd-person singular, however, you may not know some of the other words you can derive from this). English being inflection-poor language, there aren't a lot of inflectional suffixes to worry about (and Google search is pretty good about stripping them off, so maybe you can use the Google's stemming engine just by running your word forms through google search and getting out the highlighted results):
Third singular verbal -s: "I drive"/"He drive-s"
Nominal plural `-s': "One wug"/"Two wug-s". Note that there are some irregular forms here such as "children", "oxen", "geese", etc. I think I wouldn't worry about these.
Verbal past tense forms and participial forms. The regular ones are easy: the past tense has -ed for past tense and past participle ("I walk"/"I walk-ed"/"I had walk-ed"), but there are quite a few of irregular ones (fall/fell/fallen, dive/dove/dived?, etc). Maybe make a list of these?
Verbal -ing forms: "walk"/"walk-ing"
Adjectival comparative -er and superlative -est. There are a few irregular/suppletive ones ("good"/"better"/"best"), but these should not present a huge problem.
These are the main inflectional affixes in English: I may be forgetting a few that you could discover by picking up an introductory Linguistics books. Also there are going to be borderline cases, such as "un-" which is so promiscuous that we might consider it inflectional. For more information on these types, see Level 1 vs. Level 2 affixation, but I would treat these cases as derivational for your purposes and not stem them.
As far as "grading" how common various stems are, besides google you could various freely-available text corpora. The wikipedia article linked to has a few links to free corpora, and you can find a bunch more by googling. From these corpora you can build a frequency count of each stem, and use that to judge how common the form is.
I'm afraid there is no simple solution to the task of finding "basic" forms. I'm basing that on my memory of my Machine Learning textbook, of which language analysis was part of. You need some database, from which you can get them.
At the same time, please take note that the amount of words people use in everyday language is not that big. You can always ask a user what is the base form of a world you have not seen before. (unless this is your homework, which will be automatically checked)
Eventually, if you don't care about covering all words, you can create simple database, which would contain different forms of the most common words, and then try to use grammatical rules for the less common ones (which would be a good approximation, as actually, the most common words in English are irregular, whereas the uncommon ones are regular, because their original forms have been forgotten).
Note however, i'm no specialist, i'm simply trying to help :-)

Finding personal information in documents (hard problem)

I am tasked with trying to create an automated system that removes personal information from text documents.
Emails, phone numbers are relatively easy to remove. Names are not. The problem is hard because there are names in the documents that need to be kept (eg, references, celebrities, characters etc). The author name needs to be removed from the content (there may also be more than one author).
I have currently thought of the following:
Quite often personal names are located at the beginning of a document
Look at how frequently the name is used in the document (personal names tend to be written just once)
Search for words around the name to find patterns (mentions of university and so on...)
Any ideas? Anyone solved this problem already??
With current technology, doing what what you are describing in a fully automated way with a low error rate is impossible.
It might be possible to come up with an approximate solution, but it would still make a lot of errors...... either false positives or false negatives or some combination of the two.
If you are still really determined to try, I think your best approach would be Bayseian filtering (as used in spam filtering). The reason for this is that it is quite good at assigning probabilities based on relative positions and frequencies of words, and could also learn which names are more likely / less likely to be celebrities etc.
The area of machine learning that you would need to learn about to make an attempt at this would be natural language processing. There are a few different approaches that could be used, bayesian networks (something better then a naive bayes classifier), support vector machines, or neural nets would be areas to research. Whatever system you end up building would probably need to use an annotated corpus (labeled set of data) to learn where names should be. Even with a large corpus, whatever you build will not be 100% accurate, so you would probably be better off setting flags at the names for deletion instead of just deleting all of the words that might be names.
This is a common problem in basic cryptography courses (my first programming job).
If you generated a word histogram of your entire document corpus (each bin is a word on the x-axis whose height is frequency represented by height on the y-axis), words like "this", "the", "and" and so forth would be easy to identify because of their large y-values (frequency). Surnames should at the far right of your histogram--very infrequent; given names towards the left, but not by much.
Does this technique definitively identify the names in each document? No, but it could be used to substantially constrain your search, by eliminating all words whose frequency is larger than X. Likewise, there should be other attributes that constrain your search, such as author names only appear once on the documents they authored and not on any other documents.

Resources