Sentiment analysis - sentiment-analysis

while performing sentiment analysis, how can I make the machine understand that I'm referring apple (the iphone), instead of apple (the fruit)?
Thanks for the advise !

Well, there are several methods,
I would start with checking Capital letter, usually, when referring to a name, first letter is capitalized.
Before doing sentiment analysis, I would use some Part-of-speech and Named Entity Recognition to tag the relevant words.
Stanford CoreNLP is a good text analysis project to start with, it will teach
you the basic concepts.
Example from CoreNLP:
You can see how the tags can help you.
And check out more info

As described by Ofiris, NER is only one way to do solve your problem. I feel it's more effective to use word embedding to represent your words. In that way machine automatically recognize the context of the word. As an example "Apple" is mostly coming together with "eat" and But if the given input "Apple" is present with "mobile" or any other word in that domain, Machine will understand it's "iPhone apple" instead of "apple fruit". There are 2 popular ways to generate word embeddings such as word2vec and fasttext.
Gensim provides more reliable implementations for both word2vec and fasttext.
https://radimrehurek.com/gensim/models/word2vec.html
https://radimrehurek.com/gensim/models/fasttext.html

In presence of dates, famous brands, vip or historical figures you can use a NER (named entity recognition) algorithm; in such case, as suggested by Ofiris, the Stanford CoreNLP offers a good Named entity recognizer.
For a more general disambiguation of polysemous words (i.e., words having more than one sense, such as "good") you could use a POS tagger coupled with a Word Sense Disambiguation (WSD) algorithm. An example of the latter can be found HERE, but I do not know any freely downloadable library for this purpose.

This problem has already been solved by many open source pre-trained NER models. Anyways you can try retraining an existing NER models to finetune them to solve this issue.
You can find an demo of NER results as done by Spacy NER here.

Related

POS/NER able to differentiate between the same word being used in multiple contexts?

I have a collection of over 1 million bodies of text. Within those bodies are multiple entities whose names mimic common stop words and phrases.
This has created issues when tokenizing the data, as there are ~50 entities with the same problem. To counteract this, I've disabled the removal of the matched stop words before their removal. This is fine, but Ideally I'd have a way to differentiate when a token is actually meant to be a stop word vs an entity, since I only care for when it's used as an entity.
Here's a sample excerpt:
A determined somebody slept. Prior to this, A could never be comfortable with the idea of responsibility. It was foreign, something heard about through a story passed down by words of U. As slow as it could be, A began to find meaning in the words of a story.
A and U are entities/nouns in most of their usages here. POS tagging so far has only labelled A as a determiner, and NER either won't tag any instances of the word. Adding the target tags to the NER list will result in every instance being tagged as an entity, which is not the case.
So far I've primarily used the Stanford POS Tagger and SpaCY for NER.
I think you should try to train your own NER model.
You can do this in three steps, as follows:
label a number of documents in your corpus.
You can do this using the spacy-annotator.
train your spacy NER model from scratch.
You can follow the instructions in the spacy docs.
Use the trained model to predict entities in your corpus.
By labelling a good amount of entities at step 1, the model will learn to differentiate between a determiner and an entity.

Train a non-english Stanford NER models

I'm seeing several posts about training the Stanford NER for other languages.
eg: https://blog.sicara.com/train-ner-model-with-nltk-stanford-tagger-english-french-german-6d90573a9486
However, the Stanford CRF-Classifier uses some language dependent features (such as: Part Of Speechs tags).
Can we really train non-English models using the same Jar file?
https://nlp.stanford.edu/software/crf-faq.html
Training a NER classifier is language independent. You have to provide high quality training data and create meaningful features. The point is, that not all features are equally useful for every languages. Capitalization for instance, is a good indicator for a named entity in english. But in German all nouns are capitalized, which makes this features less useful.
In Stanford NER you can decide which features the classifier has to use and therefore you can disable POS tags (in fact, they are disabled by default). Of course, you could also provide your own POS tags in your desired language.
I hope I could clarify some things.
I agree with previous comment that NER classification model is language independent.
If you have issue with training data I could suggest you this link with a huge amount of labeled datasets for different languages.
If you would like to try another model, I suggest ESTNLTK - library for Estonian language, but it could fit language independent ner models (documentation).
Also, here you could find example how to train ner model using spaCy.
I hope it helps. Good luck!

Sentence-level to document-level sentiment analysis. Analysing news

I need to perform sentiment analysis on news articles about a specific topic using the Stanford NLP tool.
Such tool only allows sentence based sentiment analysis while I would like to extract a sentiment evaluation of the whole articles with respect to my topic.
For instance, if my topic is Apple, I would like to know the sentiment of a news article with respect to Apple.
Just computing the average of the sentences in my articles won't do. For instance, I might have an article saying something along the lines of "Apple is very good at this, and this and that. While Google products are very bad for these reasons". Such an article would result in a Neutral classification using the average score of sentences, while it is actually a Very positive article about Apple.
On the other hand filtering my sentences to include only the ones containing the word Apple would miss articles along the lines of "Apple's product A is pretty good. However, it lacks the following crucial features: ...". In this case the effect of the second sentence would be lost if I were to use only the sentences containing the word Apple.
Is there a standard way of addressing this kind of problems? Is Stanford NLP the wrong tool to accomplish my goal?
Update: You might want to look into
http://blog.getprismatic.com/deeper-content-analysis-with-aspects/
This is a very active area of research so it would be hard to find an off-the-shelf tool to do this (at least nothing is built in the Stanford CoreNLP). Some pointers: look into aspect-based sentiment analysis. In this case, Apple would be an "aspect" (not really but can be modeled that way). Andrew McCallum's group at UMass, Bing Liu's group at UIC, Cornell's NLP group, among others, have worked on this problem.
If you want a quick fix, I would suggest to extract sentiment from sentences that have reference to Apple and its products; use coref (check out dcoref annotator in Stanford CoreNLP), which will increase the recall of sentences and solve the problem of sentences like "However, it lacks..".

Does an algorithm exist to help detect the "primary topic" of an English sentence?

I'm trying to find out if there is a known algorithm that can detect the "key concept" of a sentence.
The use case is as follows:
User enters a sentence as a query (Does chicken taste like turkey?)
Our system identifies the concepts of the sentence (chicken, turkey)
And it runs a search of our corpus content
The area that we're lacking in is identifying what the core "topic" of the sentence is really about. The sentence "Does chicken taste like turkey" has a primary topic of "chicken", because the user is asking about the taste of chicken. While "turkey" is a helper topic of less importance.
So... I'm trying to find out if there is an algorithm that will help me identify the primary topic of a sentence... Let me know if you are aware of any!!!
I actually did a research project on this and won two competitions and am competing in nationals.
There are two steps to the method:
Parse the sentence with a Context-Free Grammar
In the resulting parse trees, find all nouns which are only subordinate to Noun-Phrase-like constituents
For example, "I ate pie" has 2 nouns: "I" and "pie". Looking at the parse tree, "pie" is inside of a Verb Phrase, so it cannot be a subject. "I", however, is only inside of NP-like constituents. being the only subject candidate, it is the subject. Find an early copy of this program on http://www.candlemind.com. Note that the vocabulary is limited to basic singular words, and there are no verb conjugations, so it has "man" but not "men", has "eat" but not "ate." Also, the CFG I used was hand-made an limited. I will be updating this program shortly.
Anyway, there are limitations to this program. My mentor pointed out in its currents state, it cannot recognize sentences with subjects that are "real" NPs (what grammar actually calls NPs). For example, "that the moon is flat is not a debate any longer." The subject is actually "that the moon is flat." However, the program would recognize "moon" as the subject. I will be fixing this shortly.
Anyway, this is good enough for most sentences...
My research paper can be found there too. Go to page 11 of it to read the methods.
Hope this helps.
Most of your basic NLP parsing techniques will be able to extract the basic aspects of the sentence - i.e., that chicken and turkey a NPs and they are linked by and adjective 'like', etc. Getting these to a 'topic' or 'concept' is more difficult
Technique such as Latent Semantic Analysis and its many derivatives transform this information into a vector (some have methods of retaining in some part the hierarchy/relations between parts of speech) and then compares them to existing, usually pre-classified by concept, vectors. See http://en.wikipedia.org/wiki/Latent_semantic_analysis to get started.
Edit Here's an example LSA app you can play around with to see if you might want to pursue it further . http://lsi.research.telcordia.com/lsi/demos.html
For many longer sentences its difficult to say what exactly is a topic and also there may be more than one.
One way to get approximate ans is
1.) First tag the sentence using openNLP, stanford Parser or any one.
2.) Then remove all the stop words from the sentence.
3.) Pick up Nouns( proper, singular and plural).
Other way is
1.) chuck the sentence into phrases by any parser.
2.) Pick up all the noun phrases.
3.) Remove the Noun phrases that doesn't have the Nouns as a child.
4.) Keep only adjectives and Nouns, remove all words from remaining Noun Phrases.
This might give approx. guessing.
"Key concept" is not a well-defined term in linguistics, but this may be a starting point: parse the sentence, find the subject in the parse tree or dependency structure that you get. (This doesn't always work; for example, the subject of "Is it raining?" is "it", while the key concept is likely "rain". Also, what's the key concept in "Are spaghetti and lasagna the same thing?")
This kind of problem (NLP + search) is more properly dealt with by methods such as LSA, but that's quite an advanced topic.
On the most basic level, a question in English is usually in the form of <verb> <subject> ... ? or <pronoun> <verb> <subject> ... ?. This is by no means a good algorithm, especially considering that the subject could span several words, but depending on how sophisticated a solution you need, it might be a useful starting point.
If you need precision, ignore this answer.
If you're willing to shell out money, http://www.connexor.com/ is supposed to be able to do this type of semantic analysis for a wide variety of languages, including English. I have never directly used their product, and so can't comment on how well it works.
There's an article about Parsing Noun Phrases in the MIT Computational Linguistics journal of this month: http://www.mitpressjournals.org/doi/pdf/10.1162/COLI_a_00076
Compound or complex sentences may have more than one key concept of a sentence.
You can use stanfordNLP or MaltParser which can give the dependency structure of a sentence. It also gives the parts of speech tagging including subject, verb , object etc.
I think most of the times the object will be the key concept of the sentence.
You should look at Google's Cloud Natural Language API. It's their NLP service.
https://cloud.google.com/natural-language/
Simple solution is to tag your sentence with part-of-speach tagger (e.g. from NLTK library for Python) then find matches with some predefined part-of-speach patterns in which it's clear where is main subject of the sentence
One option is to look into something like this as a first step:
http://www.abisource.com/projects/link-grammar/
But how you derive the topic from these links is another problem in itself. But as Abiword is trying to detect grammatical problems, you might be able to use it to determine the topic.
By "primary topic" you're referring to what is termed the subject of the sentence.
The subject can be identified by understanding a sentence through natural language processing.
The answer to this question is the same as that for How to determine subject, object and other words? - this is a currently unsolved problem.

Word coloring and syntax analyzing

I want to colorize the words in a text according to their classification (category/declination etc). I have a fully working dictionary, but the problem is that there is a lot of ambiguity. foedere, for instance, can be forms of either the verb "fornicate" or the noun "treaty".
What the general strategies for solving these ambiguities or generating good guesses are?
Thanks!
The general strategy is to first run a part-of-speech tagger on the data to determine the word category (noun, verb, etc.). That, however, requires data (context statistics) and tools. This research paper may be a starting point.

Resources