Apache Stanbol Sentiment Analysis - sentiment-analysis

I am trying to get the sentiment tags for a given text in Apache-Stanbol .
I have added the "sentiment-word-classifier" engine to a enhancer chain, i have also added all the required chains to be able to extract the tokens and their parts of speech tags.
This is the composition of my enhancer chain :
langid ( required , LangIdEnhancementEngine)
opennlp-sentence ( required , OpenNlpSentenceDetectionEngine)
stanford-nlp ( required , RestfulNlpAnalysisEngine)
opennlp-token ( required , OpenNlpTokenizerEngine)
opennlp-pos ( required , OpenNlpPosTaggingEngine)
sentiment-wordclassifier ( required , SentimentEngine)
opennlp-chunker ( required , OpenNlpChunkingEngine)
pos-chunker ( required , PosChunkerEngine)
This is the sufficient input for sentiment-word-classifier right??
Still am not getting any sentiment tags.
Can somebody throw some light on what i am missing ??
Thanks

Sentiment analysis needs two engines to be included
sentiment-wordclassifier
sentiment-summarization
The sentiment-wordclassifier classifies tokens with sentiment values (based on dictionaries entries of a language). NOTE that you will also need to provide those dictionaries (see the modules under data/sentiment). Results are stored in the AnalyzedText content part.
The sentiment-summarization uses those classifications to create sentiments for phrases, sentences and the whole documents. Summarization does consider negations and also ties to assign adjectives holding a sentiment to the correct noun or pronoun. The result of sentiment-summarization are added to the Enhancement Results as fise:SentimentAnnotation

Same here. Started out with a chain which had just the sentiment-wordclassifier engine and was getting nothing. Then spotted in the stanbol/logs/error.log a helpful message about analyzed content not getting to the sentiment engine and a suggestion to include opennlp-pos. Looked at the other chains and included opennlp-sentence, opennlp-token in addition to the opennlp-pos. Was still getting nothing. Then came across your question and the mention of the data/sentiment module. Changed to the data/sentiment/sentiwordnet folder and did mvn install -DskipTests -PinstallBundle -Dsling=http://your.stanbol.com:8080/system/console. Am now seeing sentiment output and trying to make sense of it.

Related

how to handle spelling mistake(typos) in entity extraction in Rasa NLU?

I have few intents in my training set(nlu_data.md file) with sufficient amount of training examples under each intent.
Following is an example,
##intent: SEARCH_HOTEL
- find good [hotel](place) for me in Mumbai
I have added multiple sentences like this.
At the time of testing, all sentences in training file are working fine. But if any input query is having spelling mistake e.g, hotol/hetel/hotele for hotel keyword then Rasa NLU is unable to extract it as an entity.
I want to resolve this issue.
I am allowed to change only training data, also restricted not to write any custom component for this.
To handle spelling mistakes like this in entities, you should add these examples to your training data. So something like this:
##intent: SEARCH_HOTEL
- find good [hotel](place) for me in Mumbai
- looking for a [hotol](place) in Chennai
- [hetel](place) in Berlin please
Once you've added enough examples, the model should be able to generalise from the sentence structure.
If you're not using it already, it also makes sense to use the character-level CountVectorFeaturizer. That should be in the default pipeline described on this page already
One thing I would highly suggest you to use is to use look-up tables with fuzzywuzzy matching. If you have limited number of entities (like country names) look-up tables are quite fast, and fuzzy matching catches typos when that entity exists in your look-up table (searching for typo variations of those entities). There's a whole blogpost about it here: on Rasa.
There's a working implementation of fuzzy wuzzy as a custom component:
class FuzzyExtractor(Component):
name = "FuzzyExtractor"
provides = ["entities"]
requires = ["tokens"]
defaults = {}
language_list ["en"]
threshold = 90
def __init__(self, component_config=None, *args):
super(FuzzyExtractor, self).__init__(component_config)
def train(self, training_data, cfg, **kwargs):
pass
def process(self, message, **kwargs):
entities = list(message.get('entities'))
# Get file path of lookup table in json format
cur_path = os.path.dirname(__file__)
if os.name == 'nt':
partial_lookup_file_path = '..\\data\\lookup_master.json'
else:
partial_lookup_file_path = '../data/lookup_master.json'
lookup_file_path = os.path.join(cur_path, partial_lookup_file_path)
with open(lookup_file_path, 'r') as file:
lookup_data = json.load(file)['data']
tokens = message.get('tokens')
for token in tokens:
# STOP_WORDS is just a dictionary of stop words from NLTK
if token.text not in STOP_WORDS:
fuzzy_results = process.extract(
token.text,
lookup_data,
processor=lambda a: a['value']
if isinstance(a, dict) else a,
limit=10)
for result, confidence in fuzzy_results:
if confidence >= self.threshold:
entities.append({
"start": token.offset,
"end": token.end,
"value": token.text,
"fuzzy_value": result["value"],
"confidence": confidence,
"entity": result["entity"]
})
file.close()
message.set("entities", entities, add_to_output=True)
But I didn't implement it, it was implemented and validated here: Rasa forum
Then you will just pass it to your NLU pipeline in config.yml file.
Its a strange request that they ask you not to change the code or do custom components.
The approach you would have to take would be to use entity synonyms. A slight edit on a previous answer:
##intent: SEARCH_HOTEL
- find good [hotel](place) for me in Mumbai
- looking for a [hotol](place:hotel) in Chennai
- [hetel](place:hotel) in Berlin please
This way even if the user enters a typo, the correct entity will be extracted. If you want this to be foolproof, I do not recommend hand-editing the intents. Use some kind of automated tool for generating the training data. E.g. Generate misspelled words (typos)
First of all, add samples for the most common typos for your entities as advised here
Beyond this, you need a spellchecker.
I am not sure whether there is a single library that can be used in the pipeline, but if not you need to create a custom component. Otherwise, dealing with only training data is not feasible. You can't create samples for each typo.
Using Fuzzywuzzy is one of the ways, generally, it is slow and it doesn't solve all the issues.
Universal Encoder is another solution.
There should be more options for spell correction, but you will need to write code in any way.

Gensim most_similar() with Fasttext word vectors return useless/meaningless words

I'm using Gensim with Fasttext Word vectors for return similar words.
This is my code:
import gensim
model = gensim.models.KeyedVectors.load_word2vec_format('cc.it.300.vec')
words = model.most_similar(positive=['sole'],topn=10)
print(words)
This will return:
[('sole.', 0.6860659122467041), ('sole.Ma', 0.6750558614730835), ('sole.Il', 0.6727924942970276), ('sole.E', 0.6680260896682739), ('sole.A', 0.6419174075126648), ('sole.È', 0.6401025652885437), ('splende', 0.6336565613746643), ('sole.La', 0.6049465537071228), ('sole.I', 0.5922051668167114), ('sole.Un', 0.5904430150985718)]
The problem is that "sole" ("sun", in english) return a series of words with a dot in it (like sole., sole.Ma, ecc...). Where is the problem? Why most_similar return this meaningless word?
EDIT
I tried with english word vector and the word "sun" return this:
[('sunlight', 0.6970556974411011), ('sunshine', 0.6911839246749878), ('sun.', 0.6835992336273193), ('sun-', 0.6780728101730347), ('suns', 0.6730450391769409), ('moon', 0.6499731540679932), ('solar', 0.6437565088272095), ('rays', 0.6423950791358948), ('shade', 0.6366724371910095), ('sunrays', 0.6306195259094238)] 
Is it impossible to reproduce results like relatedwords.org?
Perhaps the bigger question is: why does the Facebook FastText cc.it.300.vec model include so many meaningless words? (I haven't noticed that before – is there any chance you've downloaded a peculiar model that has decorated words with extra analytical markup?)
To gain the unique benefits of FastText – including the ability to synthesize plausible (better-than-nothing) vectors for out-of-vocabulary words – you may not want to use the general load_word2vec_format() on the plain-text .vec file, but rather a Facebook-FastText specific load method on the .bin file. See:
https://radimrehurek.com/gensim/models/fasttext.html#gensim.models.fasttext.load_facebook_vectors
(I'm not sure that will help with these results, but if choosing to use FastText, you may be interesting it using it "fully".)
Finally, given the source of this training – common-crawl text from the open web, which may contain lots of typos/junk – these might be legimate word-like tokens, essentially typos of sole, that appear often enough in the training data to get word-vectors. (And because they really are typo-synonyms for 'sole', they're not necessarily bad results for all purposes, just for your desired purpose of only seeing "real-ish" words.)
You might find it helpful to try using the restrict_vocab argument of most_similar(), to only receive results from the leading (most-frequent) part of all known word-vectors. For example, to only get results from among the top 50000 words:
words = model.most_similar(positive=['sole'], topn=10, restrict_vocab=50000)
Picking the right value for restrict_vocab might help in practice to leave out long-tail 'junk' words, while still providing the real/common similar words you seek.

Wiktionary/MediaWiki Search & Suffix Filtering

I'm building an application that will hopefully use Wiktionary words and definitions as a data source. In my queries, I'd like to be able to search for all Wiktionary entries that are similar to user provided terms in either the title or definition, but also have titles ending with a specified suffix (or one of a set of suffixes).
For example, I want to find all Wiktionary entries that contain the words "large dog", like this:
https://en.wiktionary.org/w/api.php?action=query&list=search&srsearch=large%20dog
But further filter the results to only contain entries with titles ending with "d". So in that example, "boarhound", "Saint Bernard", and "unleashed" would be returned.
Is this possible with the MediaWiki search API? Do you have any recommendations?
This is mostly possible with ElasticSearch/CirrusSearch, but disabled for performance reasons. You can still use it on your wiki, or attempt smart search queries.
Usually for Wiktionary I use yanker, which can access the page table of the database. Your example (one-letter suffix) would be huge, but for instance .*hound$ finds:
Afghan_hound
Bavarian_mountain_hound
Foxhound
Irish_Wolfhound
Mahound
Otterhound
Russian_Wolfhound
Scottish_Deerhound
Tripehound
basset_hound
bearhound
black_horehound
bloodhound
boarhound
bookhound
boozehound
buckhound
chowhound
coon_hound
coonhound
covert-hound
covert_hound
coverthound
deerhound
double-nosed_andean_tiger_hound
elkhound
foxhound
gazehound
gorehound
grayhound
greyhound
harehound
heckhound
hell-hound
hell_hound
hellhound
hoarhound
horehound
hound
limehound
lyam-hound
minkhound
newshound
nursehound
otterhound
powder_hound
powderhound
publicity-hound
publicity_hound
rock_hound
rockhound
scent_hound
scenthound
shag-hound
sighthound
sleuth-hound
sleuthhound
slot-hound
slowhound
sluthhound
smooth_hound
smoothhound
smuthound
staghound
war_hound
whorehound
wolfhound

How do I define a SemgrexPattern in Stanford API, to match nodes' text without using {lemma:...}?

I am working with the edu.stanford.nlp.semgrex and edu.stanford.nlp.tress.semgraph packages and am looking for a way to match nodes with a text value other than the lemma: directive.
I couldn't find all possible attribute names in javadoc for SemgrexPattern, only those for lemma, tag, and relational operators - is there a comprehensive list available?
For example, in the following sentence
My take-home pay is $20.
extracting the 'take-home' node is not possible using
(SemgrexPattern.compile( "{lemma:take-home}"))
.matcher( "My take-home pay is $20.").find()
yields false, because take-home is deemed not to be a lemma.
What do I need to do to match nodes with non-lemma, arbitrary text?
Thanks for any advice or comment.
Sorry - I realize that {word:take-home} would work in the example above.
Thanks..

I want to get nodes of parseTree

This is part of my code:
String sentence = "The system Does Not Require users to identify themselves to search for books according to certain criteria and to check the availability of a particular book. However to check out books, to check their respective book loan status, and to place holds on books that are already on loan, users must first identify themselves to the system.";
Parse topParses[] =ParserTool.parseLine(sentence, parser, /*numParses=*/ 3);
for (Parse parseTree: topParses){
parseTree.show();
How can I get verbs in the sentence? Please!
I mean, how can I get nodes of tree?
If only you need to get the verbs from the sentence , then POSTagger in opennlp is sufficient.All you have to do is to use a Opennlp tokenizer to get tokens in a array and feed it to the POSTaggerME.It will give you the corresponding POS tags..Then you can filter by tags for the Verb like VB, VBZ etc.
If you are looking for verb phrases then use the Chunker, if you need just verbs then use the POS tagger.
Check out this answer
How to extract the noun phrases using Open nlp's chunking parser

Resources