Adding custom dictionary - stanford-nlp

In Stanford sentiment analysis do we have an option for marking specific/custom words as positive[based on our requirements].
Analysing tweets is giving a negative trend due to the business terms used. Can we handle it to neutralize the negative output due to these words by adding our custom dictionary ?

The cleanest way to do this would be to retrain the sentiment model. Acquire the sentiment training data and manually modify the labels for the words you are concerned about. There are very basic instructions for training on another Stanford Sentiment page. Then use this trained model as you wish!
A very dirty but possibly faster solution would be to modify the trees you get from the standard model after the fact. For example, you'd search an analyzed tree for words of interest and manually modify their sentiment label. Then apply some heuristic in order to propagate this modification up the tree and possibly alter the sentiment of the whole sentence.

Related

POS/NER able to differentiate between the same word being used in multiple contexts?

I have a collection of over 1 million bodies of text. Within those bodies are multiple entities whose names mimic common stop words and phrases.
This has created issues when tokenizing the data, as there are ~50 entities with the same problem. To counteract this, I've disabled the removal of the matched stop words before their removal. This is fine, but Ideally I'd have a way to differentiate when a token is actually meant to be a stop word vs an entity, since I only care for when it's used as an entity.
Here's a sample excerpt:
A determined somebody slept. Prior to this, A could never be comfortable with the idea of responsibility. It was foreign, something heard about through a story passed down by words of U. As slow as it could be, A began to find meaning in the words of a story.
A and U are entities/nouns in most of their usages here. POS tagging so far has only labelled A as a determiner, and NER either won't tag any instances of the word. Adding the target tags to the NER list will result in every instance being tagged as an entity, which is not the case.
So far I've primarily used the Stanford POS Tagger and SpaCY for NER.
I think you should try to train your own NER model.
You can do this in three steps, as follows:
label a number of documents in your corpus.
You can do this using the spacy-annotator.
train your spacy NER model from scratch.
You can follow the instructions in the spacy docs.
Use the trained model to predict entities in your corpus.
By labelling a good amount of entities at step 1, the model will learn to differentiate between a determiner and an entity.

Determining the "goodness" of a phrase based on "grammatical" or "contextual" relevancy

Given a random string of words, I would like to assign a "goodness" score to the phrase, where "goodness" is some indication of grammatical and contextual relevancy.
For example:
"the green tree was tall" [Good score]
"delicious tires swim open" [Medium score]
"jump an con porch calmly" [Poor score]
I've been experimenting with the Natural Language Toolkit. I'd considered using a trained tagger to assign parts-of-speech to each word in a phrase, and then parse a corpus for occurrences of that POS pattern. This may give me an indication of grammatical "goodness". However, as the tagger itself is trained on the same corpus that I'm using for validation, I can't imagine the results would be reliable. This approach also does not take into consideration the contextual relevancy of the words.
Is anyone aware of existing projects or research into this sort of thing? How would you approach this?
You could employ two different approaches - supervised and semi-supervised.
Supervised
Assuming you have a labeled dataset of tuples of the form <sentence> <goodness label> (like the one in your examples), you could first split your dataset up in a train:test fold (e.g. 4:1).
Then you could simply use BERT feature vectors (these are pre-trained on large volumes of natural language text). The following piece of code gives you the vector for the sentence the green tree was tall (read more here).
nlp_features = pipeline('feature-extraction')
output = nlp_features('the green tree was tall')
np.array(output).shape # (Samples, Tokens, Vector Size)
Assuming you vectorize every sentence, you could then train a simple logistc regression model (sklearn) that learns a set of parameters to minimize the errors in these predictions on the training set and eventually you throw the test set sentences at this model to see how it behaves.
Instead of BERT, you could also use embedded vectors as inputs to an LSTM network for training the classifier (like the one here).
Semi-supervised
This is applicable when you don't have sufficient labeled data (although you need a few to get you started with).
In this case, I think what you could do is to map the words of a sentence into POS tag sequences, e.g.,
the green tree was tall --> ARTICLE ADJ NOUN VERB ADJ (see here for more details).
This step would make your method depend less on the words themselves. A model trained on these sequences would try to discover some latent distinguishing characteristics of good sentences from the bad ones.
In particular, you could run a standard text classification approach with Bidirectional LSTMs for training your classifier (this time not with words but with a much smaller vocabulary of POS tags).
You can use a transformer model from HuggingFace that is fine tuned for sentence correctness. Specifically, the model has to be fine tuned on the Corpus of Linguistic Acceptability (CoLA). Here's a medium article on HuggingFace, transformers, and the fine tuning process.
You can also get a model that's already fine-tuned and you can put in the text classification pipeline for HuggingFace's transformers library here. That site hosts fine-tuned models and you can search for a few others that are fine tuned for the CoLA task there.

Keyword suggestion Algorithm

I have been working on a project which asks me to give keyword/keyphrase suggestion based on description of the product.
What I have currently: Description of the Product, Category of product(May or may not be present).
What I want: Machine generated keywords/keyphrases based on description.
What research I have done: (NLP Based approach) This problem can be broken down into two separate approaches.
Not using the past Data : Just summarizing on current description
Method: - Tokenization, stemming, stopwords removal etc. (Preprocessing)
Shallow NLP (Constituency Parsing) and retain only NP & JJ phrases.
This would be an approach which doesn't use description present in database.
What I was looking for is a better approach which uses ML algorithms and also uses my past product description data.
I was thinking about applying shallow parsing on entire dataset, and then give keywords which encounters in more than N number of products.
What algorithm or approach would come in handy?
How can I use my data?
Try to look at basic models Like: Term Frequency or TF-IDF, This give you some important words: https://en.wikipedia.org/wiki/Tf%E2%80%93idf,
Then search for text clustering(For cluster text into group that are related to each other) and Topic detection approaches(this can help you find prominent words and topic related to a document)
Then you can find keyword for each cluster(also you can consider categories of documents), and try to find most relevant words to another words
I suggest read some/or whole chapters of this book: http://nlp.stanford.edu/IR-book/https://en.wikipedia.org/wiki/Tf%E2%80%93idf

Which classification algorithm can be used for document categorization?

Hey, Here is my problem,
Given a set of documents I need to assign each document to a predefined category.
I was going to use the n-gram approach to represent the text-content of each document and then train an SVM classifier on the training data that I have.
Correct me if I miss understood something please.
The problem now is that the categories should be dynamic. Meaning, my classifier should handle new training data with new category.
So for example, if I trained a classifier to classify a given document as category A, category B or category C, and then I was given new training data with category D. I should be able to incrementally train my classifier by providing it with the new training data for "category D".
To summarize, I do NOT want to combine the old training data (with 3 categories) and the new training data (with the new/unseen category) and train my classifier again. I want to train my classifier on the fly
Is this possible to implement with SVM ? if not, could u recommend me several classification algorithms ? or any book/paper that can help me.
Thanks in Advance.
Naive-Bayes is relatively fast incremental calssification algorithm.
KNN is also incremental by nature, and even simpler to implement and understand.
Both algorithms are implemented in the open source project Weka as NaiveBayes and IBk for KNN.
However, from personal experience - they are both vulnerable to large number of non-informative features (which is usually the case with text classification), and thus some kind of feature selection is usually used to squeeze better performance from these algorithms, which could be problematic to implement as incremental.
This blog post by Edwin Chen describes infinite mixture models to do clustering. I think this method supports automatically determining the number of clusters, but I am still trying to wrap my head all the way around it.
The class of algorithms that matches your criteria are called "Incremental Algorithms". There are incremental versions of almost any methods. The easiest to implement is naive bayes.

Is there an algorithm that extracts meaningful tags of english text

I would like to extract a reduced collection of "meaningful" tags (10 max) out of an english text of any size.
http://tagcrowd.com/ is quite interesting but the algorithm seems very basic (just word counting)
Is there any other existing algorithm to do this?
There are existing web services for this. Two Three examples:
Yahoo's Term Extraction API
Topicalizer
OpenCalais
When you subtract the human element (tagging), all that is left is frequency. "Ignore common English words" is the next best filter, since it deals with exclusion instead of inclusion. I tested a few sites, and it is very accurate. There really is no other way to derive "meaning", which is why the Semantic Web gets so much attention these days. It is a way to imply meaning with HTML... of course, that has a human element to it as well.
Basically, this is a text categorization problem/document classification problem. If you have access to a number of already tagged documents, you could analyze which (content) words trigger which tags, and then use this information for tagging new documents.
If you don't want to use a machine-learning approach and you still have a document collection, then you can use metrics like tf.idf to filter out interesting words.
Going one step further, you can use Wordnet to find synonyms and replace words by their synonym, if the frequency of the synonym is higher.
Manning & Schütze contains a lot more introduction on text categorization.
In text classification, this problem is known as dimensionality reduction. There are many useful algorithms in the literature on this subject.
You want to do the semantic analysis of a text.
Word frequency analysis is one of the easiest ways to do the semantic analysis. Unfortunately (and obviously) it is the least accurate one. It can be improved by using special dictionaries (like for synonims or forms of a word), "stop-lists" with common words, other texts (to find those "common" words and exclude them)...
As for other algorithms they could be based on:
Syntax analysis (like trying to find the main subject and/or verb in a sentence)
Format analysis (analyzing headers, bold text, italic... where applicable)
Reference analysis (if the text is in Internet, for example, then a reference can describe it in several words... used by some search engines)
BUT... you should understand that these algorithms are mereley heuristics for semantic analysis, not the strict algorithms of achieving the goal.
The problem of semantic analysis is one of the main problems in Artificial Intelligence/Machine Learning studies since the first computers appeared.
Perhaps "Term Frequency - Inverse Document Frequency" TF-IDF would be useful...
You can use this in two steps:
1 - Try topic modeling algorithms:
Latent Dirichlet Allocation
Latent word Embeddings
2 - After that you can select the most representative word of every topic as a tag

Resources