POS/NER able to differentiate between the same word being used in multiple contexts? - pos-tagger

I have a collection of over 1 million bodies of text. Within those bodies are multiple entities whose names mimic common stop words and phrases.
This has created issues when tokenizing the data, as there are ~50 entities with the same problem. To counteract this, I've disabled the removal of the matched stop words before their removal. This is fine, but Ideally I'd have a way to differentiate when a token is actually meant to be a stop word vs an entity, since I only care for when it's used as an entity.
Here's a sample excerpt:
A determined somebody slept. Prior to this, A could never be comfortable with the idea of responsibility. It was foreign, something heard about through a story passed down by words of U. As slow as it could be, A began to find meaning in the words of a story.
A and U are entities/nouns in most of their usages here. POS tagging so far has only labelled A as a determiner, and NER either won't tag any instances of the word. Adding the target tags to the NER list will result in every instance being tagged as an entity, which is not the case.
So far I've primarily used the Stanford POS Tagger and SpaCY for NER.

I think you should try to train your own NER model.
You can do this in three steps, as follows:
label a number of documents in your corpus.
You can do this using the spacy-annotator.
train your spacy NER model from scratch.
You can follow the instructions in the spacy docs.
Use the trained model to predict entities in your corpus.
By labelling a good amount of entities at step 1, the model will learn to differentiate between a determiner and an entity.

Related

What is the purpose of Tags in Doc2Vec TaggedDocument?

Is it to aid in classification tasks? The [docs][1] and tutorials don't explain this; they seem to assume a level of understanding that I don't have. These SO answers get near it do not say explicitly:
https://datascience.stackexchange.com/questions/10216/doc2vec-how-to-label-the-paragraphs-gensim
Multiple tags for single document in doc2vec. TaggedDocument
The 'tag' is just the key with which to look-up the learned document vector, after training is done.
The original 'Paragraph Vectors' research papers, on which Gensim's Doc2Vec is based, tended to just assume each document had one unique ID – perhaps, a string token just like any other word. (So, too, did a small patch to the original Google word2vec.c that was once shared, long ago, as a limited example of one mode of 'paragraph vectors`.)
In those original formulations, documents had just one unique ID – lookup key for their vector.
However, it was a fairly obvious/straightforward extension to allow these associated vectors to potentially map to other known shared labels, across many documents. (That is, not a unique vector per document, but a unique vector per label, which might appear on multiple texts.) And further, that multiple such range-of-text vectors might be relevant to a single text, that's known to deserve more-than-one label.
So the word 'tag' was used in the Gensim implementation, to convery that this is an association more general than either a unique-ID, or a known-label, though it might in some cases be either.
If you're just starting out, or trying to match early papers, just consider the 'tag' a single unique ID per document. Give every independent document its own unique name – whether it's something natural from your data source (like a unique article title or primary key), or a mere serial number, from '0' to the count of docs in your data.
Only if you're trying expert/experimental other approaches, after understanding the basic approach, would you want to either repeat a 'tag' across multiple documents, or use mroe than one 'tag' per document. Neither to those approaches are necessary, or typical, in the initial application of Doc2Vec.
(And if you start to re-use known tags in training, Doc2Vec is no longer a strictly 'unsupervised' machine-learning technique, but starts to behave more like a 'supervised' or 'semi-supervised' technique, where you're nudging the algorithm towards desired answers. That's sometimes useful, and appropriate, but starts to complicate estimates of how well your steps are working: you then have to use things like held-back test/validation data to get trustworthy estimates of your system's success.)

How to interpret doc2vec classifier in terms of words?

I have trained a doc2vec (PV-DM) model in gensim on documents which fall into a few classes. I am working in a non-linguistic setting where both the number of documents and the number of unique words are small (~100 documents, ~100 words) for practical reasons. Each document has perhaps 10k tokens. My goal is to show that the doc2vec embeddings are more predictive of document class than simpler statistics and to explain which words (or perhaps word sequences, etc.) in each document are indicative of class.
I have good performance of a (cross-validated) classifier trained on the embeddings compared to one compared on the other statistic, but I am still unsure of how to connect the results of the classifier to any features of a given document. Is there a standard way to do this? My first inclination was to simply pass the co-learned word embeddings through the document classifier in order to see which words inhabited which classifier-partitioned regions of the embedding space. The document classes output on word embeddings are very consistent across cross validation splits, which is encouraging, although I don't know how to turn these effective labels into a statement to the effect of "Document X got label Y because of such and such properties of words A, B and C in the document".
Another idea is to look at similarities between word vectors and document vectors. The ordering of similar word vectors is pretty stable across random seeds and hyperparameters, but the output of this sort of labeling does not correspond at all to the output from the previous method.
Thanks for help in advance.
Edit: Here are some clarifying points. The tokens in the "documents" are ordered, and they are measured from a discrete-valued process whose states, I suspect, get their "meaning" from context in the sequence, much like words. There are only a handful of classes, usually between 3 and 5. The documents are given unique tags and the classes are not used for learning the embedding. The embeddings have rather dimension, always < 100, which are learned over many epochs, since I am only worried about overfitting when the classifier is learned, not the embeddings. For now, I'm using a multinomial logistic regressor for classification, but I'm not married to it. On that note, I've also tried using the normalized regressor coefficients as vector in the embedding space to which I can compare words, documents, etc.
That's a very small dataset (100 docs) and vocabulary (100 words) compared to much published work of Doc2Vec, which has usually used tens-of-thousands or millions of distinct documents.
That each doc is thousands of words and you're using PV-DM mode that mixes both doc-to-word and word-to-word contexts for training helps a bit. I'd still expect you might need to use a smaller-than-defualt dimensionaity (vector_size<<100), & more training epochs - but if it does seem to be working for you, great.
You don't mention how many classes you have, nor what classifier algorithm you're using, nor whether known classes are being mixed into the (often unsupervised) Doc2Vec training mode.
If you're only using known classes as the doc-tags, and your "a few" classes is, say, only 3, then to some extent you only have 3 unique "documents", which you're training on in fragments. Using only "a few" unique doctags might be prematurely hiding variety on the data that could be useful to a downstream classifier.
On the other hand, if you're giving each doc a unique ID - the original 'Paragraph Vectors' paper approach, and then you're feeding those to a downstream classifier, that can be OK alone, but may also benefit from adding the known-classes as extra tags, in addition to the per-doc IDs. (And perhaps if you have many classes, those may be OK as the only doc-tags. It can be worth comparing each approach.)
I haven't seen specific work on making Doc2Vec models explainable, other than the observation that when you are using a mode which co-trains both doc- and word- vectors, the doc-vectors & word-vectors have the same sort of useful similarities/neighborhoods/orientations as word-vectors alone tend to have.
You could simply try creating synthetic documents, or tampering with real documents' words via targeted removal/addition of candidate words, or blended mixes of documents with strong/correct classifier predictions, to see how much that changes either (a) their doc-vector, & the nearest other doc-vectors or class-vectors; or (b) the predictions/relative-confidences of any downstream classifier.
(A wishlist feature for Doc2Vec for a while has been to synthesize a pseudo-document from a doc-vector. See this issue for details, including a link to one partial implementation. While the mere ranked list of such words would be nonsense in natural language, it might give doc-vectors a certain "vividness".)
Whn you're not using real natural language, some useful things to keep in mind:
if your 'texts' are really unordered bags-of-tokens, then window may not really be an interesting parameter. Setting it to a very-large number can make sense (to essentially put all words in each others' windows), but may not be practical/appropriate given your large docs. Or, trying PV-DBOW instead - potentially even mixing known-classes & word-tokens in either tags or words.
the default ns_exponent=0.75 is inherited from word2vec & natural-language corpora, & at least one research paper (linked from the class documentation) suggests that for other applications, especially recommender systems, very different values may help.

gensim doc2vec train more documents from pre-trained model

I am trying to train with new labelled document(TaggedDocument) with the pre-trained model.
Pretrained model is the trained model with documents which the unique id with label1_index, for instance, Good_0, Good_1 to Good_999
And the total size of trained data is about 7000
Now, I want to train the pre-trained model with new documents which the unique id with label2_index, for instance, Bad_0, Bad_1... to Bad_1211
And the total size of trained data is about 1211
The train itself was successful without any error, but the problem is that whenever I try to use 'most_similar' it only suggests the similar document labelled with Good_... where I expect the labelled with Bad_.
If I train altogether from the beginning, it gives me the answers I expected - it infers a newly given document similar to either labelled with Good or Bad.
However, the practice above will not work as the one trained altogether from the beginning.
Is continuing train not working properly or did I make some mistake?
The gensim Doc2Vec class can always be fed extra examples via train(), but it only discovers the working vocabulary of both word-tokens and document-tags during an initial build_vocab() step. So unless words/tags were available during the build_vocab(), they'll be ignored as unknown later. (The words get silently dropped from the text; the tags aren't trained or remembered inside the model.)
The Word2Vec superclass from which Doc2Vec borrows a lot of functionality has a newer, more-experimental parameter on its build_vocab() called update. If set true, that call to build_vocab() will add to, rather than replace, any prior vocabulary. However, as of February 2018, this option doesn't yet work with Doc2Vec, and indeed often causes memory-fault crashes.
But even if/when that can be made to work, providing incremental training examples isn't necessarily a good idea. By only updating parts of the model – those exercised by the new examples – the overall model can get worse, or its vectors made less self-consistent with each other. (The essence of these dense-embedding models is that the optimization over all varied examples results in generally-useful vectors. Training over just some subset causes the model to drift towards being good on just that subset, at likely cost to earlier examples.)
If you need new examples to also become part of the results for most_similar(), you might want to create your own separate set-of-vectors outside of Doc2Vec. When you infer new vectors for new texts, you could add those to that outside set, and then implement your own most_similar() (using the gensim code as a model) to search over this expanding set of vectors, rather than just the fixed set that is created by initial bulk Doc2Vec training.

Manually add collocations to gensim phraser

I am doing topic modelling on linguistics papers and I am using the Gensim Phrases to identify frequent collocations. I want to be able to mark terms as 'do-support' and 'it-clefts' as one single word, since they are specific linguistic terminology. However, if I make the Gensim model after taking out stopwords, these collocations will not be found (since they contain stopwords), if I make the model after taking out stopwords (or stopwords not including 'it' or 'do'), it identifies a whole lot of irrelevant collocations. Is there a way to manually add phrases that should be recognised as collocations by the Gensim Phrases?
Thanks!
The Phrases class doesn't have the ability to add desired bigrams. Its technique generally does not expect 'stop words' to have been removed before processing.
You could potentially tune Phrases behavior by trying different 'threshold' and 'min_count' values.
If you find some settings are connecting desired-phrases, but then also some unwanted phrases that still fit the same statistical thresholds, maybe that's not a great harm, despite the non-intuitiveness of some of the phrases. All these statistical techniques are imprecise, and often best judged by their end results on quantitative goals – rather than any arbitrary oddities/corner-cases found from an ad-hoc review.
If you did want to dig into the code to add the ability to force certain bigrams, it might be easier via the Phraser utility class, also in gensim's phrases.py module. At the cost of some extra up-front calculation, it reduces the Phrases data to a smaller structure, with just the bigrams that would later pass the combination-threshold. As such, it saves a bit of memory, and performs later corpus-transformations a little faster, but if you only keep the Phraser, you lose the ability to try other thresholds/min_counts below what was used in its creation. But you could potentially force extra hand-chosen bigrams into its structures, after creation, more easily than tampering with the full Phrases model.
Update (April 2021): Starting in Gensim-4.0, the Phraser class has been renamed FrozenPhrases, for better distinction from the training Phrases class. Additionally, a suggestion in a project issue provides a probably-effective way to 'force' certain bigram-phrases to always be promoted. Specifically:
phrases = Phrases(…) # do customary training/etc
frozen_phrases = phrases.freeze() # freeze bigrams' scores for compactness/efficiency
frozen_phrases.phrasegrams['my_phrase'] = float('inf') # set the desired phrase to infinite score

How to create a gazetteer based Named Entity Recognition(NER) system?

I have tried my hands on many NER tools (OpenNLP, Stanford NER, LingPipe, Dbpedia Spotlight etc).
But what has constantly evaded me is a gazetteer/dictionary based NER system where my free text is matched with a list of pre-defined entity names, and potential matches are returned.
This way I could have various lists like PERSON, ORGANIZATION etc. I could dynamically change the lists and get different extractions. This would tremendously decrease training time (since most of them are based on maximum entropy model so they generally includes tagging a large dataset, training the model etc).
I have built a very crude gazetteer based NER system using a OpenNLP POS tagger, from which I used to take out all the Proper nouns (NP) and then look them up in a Lucene index created from my lists. This however gives me a lot of false positives. For ex. if my lucene index has "Samsung Electronics" and my POS tagger gives me "Electronics" as a NP, my approach would return me "Samsung Electronics" since I am doing partial matches.
I have also read people talking about using gazetteer as a feature in CRF algorithms. But I never could understand this approach.
Can any of you guide me towards a clear and solid approach that builds NER on gazetteer and dictionaries?
I'll try to make the use of gazetteers more clear, as I suspect this is what you are looking for. Whatever training algorithm used (CRF, maxent, etc.) they take into account features, which are most of the time:
tokens
part of speech
capitalization
gazetteers
(and much more)
Actually gazetteers features provide the model with intermediary information that the training step will take into account, without explicitly being dependent on the list of NEs present in the training corpora. Let's say you have a gazetteer about sport teams, once the model is trained you can expand the list as much as you want without training the model again. The model will consider any listed sport team as... a sport team, whatever its name.
In practice:
Use any NER or ML-based framework
Decide what gazetteers are useful (this is maybe the most crucial part)
Affect to each gazetteer a relevant tag (e.g. sportteams, companies, cities, monuments, etc.)
Populate gazetteers with large lists of NEs
Make your model take into account those gazetteers as features
Train a model on a relevant corpus (it should containing many NEs from gazetteers)
Update your list as much as you want
Hope this helps!
You can try this minimal bash Named-Entity Recognizer:
https://github.com/lasigeBioTM/MER
Demo: http://labs.fc.ul.pt/mer/

Resources