Existing Warmth/Competence Dictionaries for NLP? - sentiment-analysis

I am doing text analysis for feedback left for a large set of employees. I am a researcher and want to assign a "warmth" score to each comment (warmth = comments like "Jerry is friendly!" "Sarah is kind!"), and also a competence score (e.g. "Anna is very capable." "Jack is very skilled at programming."). Before I build out my own dictionaries to embody these concepts, I wanted to see if there are any existing dictionaries out there. I know there are many dictionaries for negative/positive words, but that's not what I am trying to do. I am using R to analyze the data, but any dictionary in an importable format would work. Thanks!

Related

Are there any alternate ways other than Named Entity Recognition to extract event names from sentences?

I'm a newbie to NLP and I'm working on NER using OpenNLP. I have a sentence like " We have a dinner party today ". Here "dinner party" is an event type. Similarly consider this sentence- "we have a room reservation" here room reservation is an event type. My goal is to extract such words from sentences and label it as "Event_types" as the final output. This can be fairly achieved by creating custom NER model's by annotating sentences with proper tags in the training dataset. But the event types can be heterogeneous and random and hence it is very hard to label all possible patterns(ie. event types can be anything like "security meeting", "family function","parents teachers meeting", etc,etc,...). So I'm looking for an alternate way to achieve this problem... Immediate response would be appreciated. Thanks ! :)
Basically you have two options: 1) A list-based approach where you have lists of entities you will extract from text. To solve the heterogeneous language use, one can train an embedding (e.g. Word2Vec or FastText) to identify contextually similar phrases for your list. 2) Train a custom CRF with data you have annotated (this obviously requires that you annotate bunch of sentences with corresponding tags). I guess the ideal solution really depends on the data and people's willingness to annotate it.

How does the "The values are interchangable" option in Phrase List work in LUIS?

I've gone through the ocumentation and tried understanding the Phrase List feature. Although I'm sure of the purpose of the Phrase List feature, I couldn't quite get the purpose of the "interchangable" option intutively.
Any thorough explanation would be appreciated.
#Srichakradhar at your suggestion, posting answer related to your question on gitter to here on StackOverflow as well to benefit the community as a whole!:
"...regarding your question on phrase lists, happy to speak high-levelly on what the feature does :)
#srichakradhar
So ultimately the goal with LUIS is to understand the meaning of the user’s input (utterance), and through calculations, it returns to you the value of how confident it is about the meaning of the input. Using phrase lists is one of the ways to improve the accuracy of determining the meaning of the user’s utterance
—more specifically, when adding features to a phrase list, it can put more weight on the score of an intent or entity.
Using a couple of examples to illustrate the high-level concept of how features help determine intent/entity score, and in turn predict the user’s utterance’s meaning:
For example, if I wanted to describe a class called Tablet, features I could use to describe it could include screen, size, battery, color, etc. If an utterance mentions any of the features, it’ll add points/weight to the score of predicting that the utterance’s meaning is describing Tablet. However, features that would be good to include in a phrase list are words that are maybe foreign, proprietary, or perhaps just rare. For example, maybe I would add, “SurfacePro”, “iPad”, or “Wugz” (a made-up tablet brand) to the phrase list of Tablet. Then if a user’s utterance includes “Wugz”, more points/weight would be put onto predicting that Tablet is the right entity to an utterance.
Or maybe the intent is Book.Flight and features include “Book”, “Flight”, “Cairo”, “Seattle”, etc. And the utterance is “Book me a flight to Cairo”, points/weight towards the score of Book.Flight intent would be added for “Book”, “flight”, “Cairo”.
Now, regarding interchangeable vs. non-interchangeable phrase lists.
Maybe I had a Cities phrase list that included “Seattle”, “Cairo”, “L.A.”, etc. I would make sure that the phrase list is non-interchangeable, because it would indicate that yes “Seattle” and “Cairo” are somehow similar to one-another, however they are not synonyms—I can’t use them interchangeably or rather one in place of the other. (“book flight to Cairo” is different from “book flight to Seattle”)
But if I had a phrase list of Coffee that included features “Coffee”, “Starbucks”, “Joe”, and marked the list as interchangeable, I’m specifying that the features in the list are interchangeable. (“I’d like a cup of coffee” means the same as “I’d like a cup of Joe”)
For more on Phrase Lists - Phrase List features in LUIS
For more on improving prediction - Tutorial: Add phrase list to improve predictions"
Taken from documentation (here):
A phrase list may be interchangeable or non-interchangeable. An
interchangeable phrase list is for values that are synonyms, and a
non-interchangeable phrase list is intended for values that aren't
synonyms but are similar in another way.
There is also a great reply here on MSDN:
Choose "Exchangeable" when the list of words or phases in your feature
form a class or group -- for example, months like "January",
"February", "March"; or names like "John", "Mary", "Frank". These
features are "exchangeable" in the sense that an utterance where one
word/phrase appears would be labeled similarly if the word/phrase were
exchanged with another. For example, if "show the calendar for January" has the same intent as "show the calendar for February", this
suggests choosing "exchangeable".
Choose "Not exchangeable" for words/phrases that are useful in your
domain, but which do not form a class or group. For example, the
words "calendar", "email", "show", and "send" might be relevant to
your domain, but might all be associated with different intents, like
"show my calendar" or "send an email".
If you're not sure, you can try either and see if there's any
difference in performance.

How to index mixed language contents on Elasticsearch?

How to index mixed language contents in Elasticsearch. Let's say that we have a system where people submit contents from various parts of the world. Countries ranges from US, Canada, Europe, Japan, Korea, India, China, Kenya, Arabs, Russia to all other parts of the world.
Contents can be in any language that we can't know beforehand and can even be in mixed language. We don't want to guess the language of the contents and create multiple language specific indexes for each of the inputted language, we believe this is unmanageable.
We need an easy solution to index those contents efficiently in Elasticsearch with full text search capability as well as fuzzy string searching. Can anyone help in this regard?
What is the target you want to achieve? Do you want to have hits only in the language used at query time? Or would you also accept hits in any other language?
One approach would be to run all of elasticsearch's different language analyzers on the input and store the result in separate fields, for instance suffixed by the language of the current analyzer.
Then, at query time, you would have to search in all of these fields if you have no method to guess the most relevant ones.
However, this is likely to explode since you create a multitude of unused duplicates. This is IMHO also less elegant than having separate indices.
I would strongly recommend to evaluate if you really do not know the number of languages you will see during production. Having a distinct index per language would give you much more control over the input/output and enable you to fine tune your engine to the actual use case.
Alternatively, you may start with a simple whitespace tokenizer and evaluate the quality of the search results (per use case).
You will not have language specific stemming but at least token streams for most languages.

How can I do "related tags"?

I have tags on my website, and I input them one by one when I create a blog post. I love gmail's new feature, that ask you if you want to include X in a mail, if you type Y's name and that you often include both of them in the same messages.
I'd like to do something similar on my website, but I don't know how to represent the tags "related-ness" in an object or database ... thoughts ?
It all boils down to create associations between certain characteristics of your posts and certain tags, and then - when you press the "publish" button - to analyse the new post and propose all tags matched with your post characteristics.
This can be done in several ways from a "totally hard-coded" association to some sort of "learning AI"... and everything in-between.
Hard-coded solutions
This are the simplest algorithms to implement. You should first decide what characteristics of your post are relevant for tagging (e.g.: it's length if you tag them "short" or "long", the presence of photos or videos if you tag them "multimedia-content", etc...). The most obvious is however to focus on which words are used in posts. For example you could build a mapping like this:
tag_hint_words = {'code-development' : ['programming',
'language', 'python', 'function',
'object', 'method'],
'family' : ['Theresa', 'kids',
'uncle Ben', 'holidays']}
Then you would check your post for the presence of the words in the list (the code between [ and ] ) and propose the tag (the word before :) as a possible candidate.
A common approach is to give "scores", or in other word to put a number that indicates the probability a given tag is the right one. For example: if your post would contain the sentence...
After months of programming, we finally left for the summer holidays at uncle Ben's cottage. Theresa and the kids were ecstatic!
...despite the presence of the word "programming" the program should indicate family as the most likely tag to use, as there are many more words hinting.
Learning AI's
One of the obvious limitations of the above method is that - say one day you pick up java beside python - you would probably need to change your code and include words like "java" or "oracle" too. The same applies if you create new tags.
To circumvent this limitation (and have some fun!!) you could try to implement a learning algorithm. Learning algorithms are those who refine their outcome the more you use them (so they indeed... learn!). Some algorithm requires initial training (many spam filters and voice recognition programs need this initial "primer"). Some don't.
I am absolutely no expert on the subject, but two common AI's are: the Naive Bayes Classifier and some flavour of Neural network.
Although the WP pages might look scary, they are surprisingly easy to implement (at least in Python). Here's the recording of a lecture at PyCon 2009 on the subject "Easy AI with Python". I found it very informative and even somehow inspiring! :)
HTH!
You should have a look at this post :
Any suggestions for a db schema for storing related keywords?
If you're looking for a schema for storing related tags it will help.
Relevancy searches where multiple agents play a part are usually done using Collaborative filtering. You might want to give that a look see.
Look up Clustering (Machine Learning algorithm). Don't be intimidated by math, it's a pretty straightforward algorithm. Check out Machine Learning for Hackers for simpler explanations of many Machine Learning algorithms and methods.

What's needed for NLP?

assuming that I know nothing about everything and that I'm starting in programming TODAY what do you say would be necessary for me to learn in order to start working with Natural Language Processing?
I've been struggling with some string parsing methods but so far it is just annoying me and making me create ugly code. I'm looking for some fresh new ideas on how to create a Remember The Milk API like to parse user's input in order to provide an input form for fast data entry that are not based on fields but in simple one line phrases instead.
EDIT: RTM is todo list system. So in order to enter a task you don't need to type in each field to fill values (task name, due date, location, etc). You can simply type in a phrase like "Dentist appointment monday at 2PM in WhateverPlace" and it will parse it and fill all fields for you.
I don't have any kind of technical constraints since it's going to be a personal project but I'm more familiar with .NET world. Actually, I'm not sure this is a matter of language but if it's necessary I'm more than willing to learn a new language to do it.
My project is related to personal finances so the phrases are more like "Spent 10USD on Coffee last night with my girlfriend" and it would fill location, amount of $$$, tags and other stuff.
Thanks a lot for any kind of directions that you might give me!
This does not appear to require full NLP. Simple pattern-based information extraction will probably suffice. The basic idea is to tokenize the text, then recognize/classify certain keywords, and finally recognize patterns/phrases.
In your example, tokenizing gives you "Dentist", "appointment", "monday", "at", "2PM", "in", "WhateverPlace". Your tool will recognize that "monday" is a day of the week, "2PM" is a time, etc. Finally, you can find patterns like [at] [TIME] and [in] [Place] and use those to fill in the fields.
A framework like GATE may help, but even that may be a larger hammer than you really need.
Have a look at NLTK, its a good resource for beginner programmers interested in NLP.
http://www.nltk.org/
It is written in python which is one of the easier programming languages.
Now that I understand your problem, here is my solution:
You can develop a kind of restricted vocabulary, in which all amounts must end witha $ sign or any time must be in form of 00:00 and/or end with AM/PM, regarding detecting items, you can use list of objects from ontology such as Open Cyc. Open Cyc can provide you with list of all objects such beer, coffee, bread and milk etc. this will help you to detect objects in the short phrase. Still it would be a very fuzzy approach.

Resources