Natural Language Processing for Smart Homes - algorithm

I'm writing up a Smart Home software for my bachelor's degree, that will only simulate the actual house, but I'm stuck at the NLP part of the project. The idea is to have the client listen to voice inputs (already done), transform it into text (done) and send it to the server, which does all the heavy lifting / decision making.
So all my inputs will be fairly short (like "please turn on the porch light"). Based on this, I want to take the decision on which object to act, and how to act. So I came up with a few things to do, in order to write up something somewhat efficient.
Get rid of unnecessary words (in the previous example "please" and "the" are words that don't change the meaning of what needs to be done; but if I say "turn off my lights", "my" does have a fairly important meaning).
Deal with synonyms ("turn on lights" should do the same as "enable lights" -- I know it's a stupid example). I'm guessing the only option is to have some kind of a dictionary (XML maybe), and just have a list of possible words for one particular object in the house.
Detecting the verb and subject. "turn on" is the verb, and "lights" is the subject. I need a good way to detect this.
General implementation. How are these things usually developed in terms of algorithms? I only managed to find one article about NLP in Smart Homes, which was very vague (and had bad English). Any links welcome.
I hope the question is unique enough (I've seen NLP questions on SO, none really helped), that it won't get closed.

If you don't have a lot of time to spend with the NLP problem, you may use the Wit API (http://wit.ai) which maps natural language sentences to JSON:
It's based on machine learning, so you need to provide examples of sentences + JSON output to configure it to your needs. It should be much more robust than grammar-based approaches, especially because the voice-to-speech engine might make mistakes that will break your grammar (but the machine learning module can still get the meaning of the sentence).

I am no way a pioneer in NLP(I love it though) but let me try my hand on this one. For your project I would suggest you to go through Stanford Parser
From your problem definition I guess you don't need anything other then verbs and nouns. SP generates POS(Part of speech tags) That you can use to prune the words that you don't require.
For this I can't think of any better option then what you have in mind right now.
For this again you can use grammatical dependency structure from SP and I am pretty much sure that it is good enough to tackle this problem.
This is where your research part lies. I guess you can find enough patterns using GD and POS tags to come up with an algorithm for your problem. I hardly doubt that any algorithm would be efficient enough to handle every set of input sentence(Structured+unstructured) but something that is more that 85% accurate should be good enough for you.

First, I would construct a list of all possible commands (not every possible way to say a command, just the actual function itself: "kitchen light on" and "turn on the light in the kitchen" are the same command) based on the actual functionality the smart house has available. I assume there is a discrete number of these in the order of no more than hundreds. Assign each some sort of identifier code.
Your job then becomes to map an input of:
a sentence of english text
location of speaker
time of day, day of week
any other input data
to an output of a confidence level (0.0 to 1.0) for each command.
The system will then execute the best match command if the confidence is over some tunable threshold (say over 0.70).
From here it becomes a machine learning application. There are a number of different approaches (and furthermore, approaches can be combined together by having them compete based on features of the input).
To start with I would work through the NLP book from Jurafsky/Manning from Stanford. It is a good survey of current NLP algorithms.
From there you will get some ideas about how the mapping can be machine learned. More importantly how natural language can be broken down into a mathematical structure for machine learning.
Once the text is semantically analyzed, the simplest ML algorithm to try first would be of the supervised ones. To generate training data have a normal GUI, speak your command, then press the corresponding command manually. This forms a single supervised training case. Make some large number of these. Set some aside for testing. It is also unskilled work so other people can help. You can then use these as your training set for your ML algorithm.

Related

How do I get a quick and dirty recognition of possible typos in .net?

I have to manually go through a long list of terms (~3500) which have been entered by users through the years. Beside other things, I want to reduce the list by looking for synonyms, typos and alternate spellings.
My work will be much easier if I can group the list into clusters of possible typos before starting. I was imagining to use some metric which can calculate the similarity to a term, e.g. in percent, and then cluster everything which has a similarity higher than some threshold. As I am going through it manually anyway, I don't mind a high failure rate, if it can keep the whole thing simple.
Ideally, there exists some easily available library to do this for me, implemented by people who know what they are doing. If there is no such, then at least one calculating a similarity metric for a pair of strings would be great, I can manage the clustering myself.
If this is not available either, do you know of a good algorithm which is simple to implement? I was first thinking a Hamming distance divided by word length will be a good metric, but noticed that while it will catch swapped letters, it won't handle deletions and insertions well (ptgs-1 will be caught as very similar to ptgs/1, but hematopoiesis won't be caught as very similar to haematopoiesis).
As for the requirements on the library/algorithm: it has to rely completely on spelling. I know that the usual NLP libraries don't work this way, but
there is no full text available for it to consider context.
it can't use a dictionary corpus of words, because the terms are far outside of any everyday language, frequently abbreviations of highly specialized terms.
Finally, I am most familiar with C# as a programming language, and I already have a C# pseudoscript which does some preliminary cleanup. If there is no one-step solution (feed list in, get grouped list out), I will prefer a library I can call from within a .NET program.
The whole thing should be relatively quick to learn for somebody with almost no previous knowledge in information retrieval. This will save me maybe 5-6 hours of manual work, and I don't want to spend more time than that in setting up an automated solution. OK, maybe up to 50% longer if I get the chance to learn something awesome :)
The question: What should I use, a library, or an algorithm? Which ones should I consider? If what I need is a library, how do I recognize one which is capable of delivering results based on spelling alone, as opposed to relying on context or dictionary use?
edit To clarify, I am not looking for actual semantic relatedness the way search or recommendation engines need it. I need to catch typos. So, I am looking for a metric by which mouse and rodent have zero similarity, but mouse and house have a very high similarity. And I am afraid that tools like Lucene use a metric which gets these two examples wrong (for my purposes).
Basically you are looking to cluster terms according to Semantic Relatedness.
One (hard) way to do it is following Markovitch and Gabrilovitch approach.
A quicker way will be consisting of the following steps:
download wikipedia dump and an open source Information Retrieval library such as Lucene (or Lucene.NET).
Index the files.
Search each term in the index - and get a vector - denoting how relevant the term (the query) is for each document. Note that this will be a vector of size |D|, where |D| is the total number of documents in the collection.
Cluster your vectors in any clustering algorithm. Each vector represents one term from your initial list.
If you are interested only in "visual" similarity (words are written similar to each other) then you can settle for levenshtein distance, but it won't be able to give you semantic relatedness of terms.For example, you won't be able to relate between "fall" and "autumn".

How to count votes for qualitative survey answers

We are creating a website for a client that wants a website based around a survey of peoples' '10 favourite things'. There are 10 questions that each user must answer, e.g. 'What is your favourite colour', 'Who is your favourite celebrity', etc., and then the results are collated into a global Top 10 list on the home page.
The conundrum lies in both allowing the user to input anything they want, e.g. their favourite holiday destination might be 'Grandma's house', and being able to accurately count the votes accurately, e.g. User A might say their favourite celebrity is 'The Queen' and User B might says it's 'Queen of England' - we need those two answers to be counted as two votes for the same 'thing'.
If we force the user to choose from a large but predetermined list for each question, it restricts users' ability to define literally anything as their 'favourite thing'. Whereas, if we have a plain text input field and try to interpret answers after they have been submitted, it's going to be much more difficult to count votes where there are variations in names or spelling for the same answer.
Is it possible to automatically moderate their answers in real-time through some form of search phrase suggestion engine? How can we make sure that, if a plain text field is the input method, we make allowances for variations in spelling?
If anyone has any ideas as to possible solutions to this functionality, perhaps a piece of software, a plugin, an API, anything, then please do let us know.
Thank you and please just ask for any clarification.
If you want to automate counting "The Queen" and "The Queen of England", you're in for work that might be more complex than it's worth for a "fun little survey". If the volume is light enough, consider just manually counting the results. Just to give you a feeling, what if someone enters "The Queen of Sweden" or "Queen Letifah Concerts"?
If you really want to go down that route, look into Natural Language Processing (NLP). Specifically, the field of categorization.
For a general introduction to NLP, I recommend the relevant Wikipedia article
http://en.wikipedia.org/wiki/Natural_language_processing
RapidMiner is an open source NLP solution that would be worth looking into.
As Eric J said, this is getting into cutting edge NLP applications. These are fields of study that are very important for AI/automation researchers and computer science in general, but are still very fledgeling. There are a number of programs and algorithms you can use, the drawbacks and benefits of which very widely. RapidMiner is good, WordNet is widely used in medical applications and should be relatively easy to adjust to your own corpus, and there are more advanced methods like latent Dirichlet allocation. Here are a few resources you should start with (in addition to the Wikipedia article provided above)
http://www.semanticsearchart.com/index.html
http://www.mitpressjournals.org/loi/coli
http://marimba.d.umn.edu/ (try the SenseClusters calculator)
http://wordnet.princeton.edu/
The best to classify short answers is k-means clustering. You need to apply stemming. Then you need to convert words into indexes using elementary dictionary. You can use EverGroingDictionary.cs from sematicsearchart.com. After throwing phrase to a dictionary it will be converted to sequence of numbers or vector. Introduce measure of proximity as number of coincidences in words and apply k-means, which is lightning fast algorithm. k-means will organize all answers into groups. Most frequent words in each group will be a signature of the group. Your whole program in C++ or C# or Java must be less than 1000 lines.

User recognition algorithm

let's say you have a big IRC chan log, and you want to find out what user is using multiple accounts. As input you have the time the user connects to the server, and some sort of text analysis ( word frequency, and so on), and as output you want the likelihood two user "matches".
Is it possible to do it using ANN? Are there better algorithms to accomplish that task?
PS : use IP addresses is not an accepted solution :)
The problem with using neural networks is that you need a robust set of training data--that is, you need to have lots of examples of people using multiple accounts where you already know that's what they're doing. Furthermore, if the people you're trying to identify have ever played a role-playing game, they'll probably be able to make themselves seem quite a bit different if they want to.
So, if people are acting just like themselves and you have a pretty good training data set, then you stand a chance. You should probably start with methods used by forensic linguistics.
But I suspect that what you'll probably end up doing is identifying people who are sort of similar to each other. Good for a matchmaking site, perhaps; not so cool for most other things. (For example, I would think this would be a perfectly dreadful way to try to find members of Anonymous in other guises.)
This problem is known as "authorship detection" (or sometimes, in a particular domain, "plagiarism detection"). It can be done using a variety of statistical algorithms, of which neural networks aren't the easiest.
Check out the Cavnar & Trenkle algorithm for text classification. That may be made into a useful baseline algorithm for this task. Implementations in various languages are available on the web. You may want to turn it into a clustering algorithm instead of a classifier.

What's the best way to generate keywords from a given Text?

I want to generate Keywords for my CMS.
Does someone know a good PHP Script (or something else) which generates keywords?
I have a HTML Site like this: http://pastebin.com/ZU8vdyeP
This is a very hard problem for a computer to solve. It would be much easier to get somebody (else?) to do it manually, or simply not do it at all.
If you'd really need a computer to do it, I'd head over to the excellent Python library NLTK which has many tools for this sort of thing (=natural language processing), and it's a lot of fun to work with.
For example, you could calculate a frequency distribution of the words, and then search for the most common hypernyms of larger (above say 5 char) words that appear most frequently and use that as a hint of what the keywords could be.
Again, it is much easier to get it done by a human, however.
to automate, get the words from the article, match them against a blacklist and dont include words under 4 chars.
Additionally, Let user manually edit. So only automate if no present keywords.
This can be done by trigger or application layer.
regards,
/t
If I understand the problem, you have text and you want to determine keywords that are most relevant to the text.
Three approaches:
1) Have user enter keywords
2) Statistical analysis of text, for example determine the words that are far more common in the text than they are in the language overall. Any good text on Information Retrieval will have some algorithms.
3) If you have a set of documents that are already classified (perhaps previously classified by humans) then you can use a machine learning algorithm (perhaps a Bayesian classifier) to train the system to classify the new documents. If you let the users override/correct the suggested keywords, the system can learn over time.
Personally, I'd do #3, since it is more adaptive.

Is it possible to guess a user's mood based on the structure of text?

I assume a natural language processor would need to be used to parse the text itself, but what suggestions do you have for an algorithm to detect a user's mood based on text that they have written? I doubt it would be very accurate, but I'm still interested nonetheless.
EDIT: I am by no means an expert on linguistics or natural language processing, so I apologize if this question is too general or stupid.
This is the basis of an area of natural language processing called sentiment analysis. Although your question is general, it's certainly not stupid - this sort of research is done by Amazon on the text in product reviews for example.
If you are serious about this, then a simple version could be achieved by -
Acquire a corpus of positive/negative sentiment. If this was a professional project you may take some time and manually annotate a corpus yourself, but if you were in a hurry or just wanted to experiment this at first then I'd suggest looking at the sentiment polarity corpus from Bo Pang and Lillian Lee's research. The issue with using that corpus is it is not tailored to your domain (specifically, the corpus uses movie reviews), but it should still be applicable.
Split your dataset into sentences either Positive or Negative. For the sentiment polarity corpus you could split each review into it's composite sentences and then apply the overall sentiment polarity tag (positive or negative) to all of those sentences. Split this corpus into two parts - 90% should be for training, 10% should be for test. If you're using Weka then it can handle the splitting of the corpus for you.
Apply a machine learning algorithm (such as SVM, Naive Bayes, Maximum Entropy) to the training corpus at a word level. This model is called a bag of words model, which is just representing the sentence as the words that it's composed of. This is the same model which many spam filters run on. For a nice introduction to machine learning algorithms there is an application called Weka that implements a range of these algorithms and gives you a GUI to play with them. You can then test the performance of the machine learned model from the errors made when attempting to classify your test corpus with this model.
Apply this machine learning algorithm to your user posts. For each user post, separate the post into sentences and then classify them using your machine learned model.
So yes, if you are serious about this then it is achievable - even without past experience in computational linguistics. It would be a fair amount of work, but even with word based models good results can be achieved.
If you need more help feel free to contact me - I'm always happy to help others interested in NLP =]
Small Notes -
Merely splitting a segment of text into sentences is a field of NLP - called sentence boundary detection. There are a number of tools, OSS or free, available to do this, but for your task a simple split on whitespaces and punctuation should be fine.
SVMlight is also another machine learner to consider, and in fact their inductive SVM does a similar task to what we're looking at - trying to classify which Reuter articles are about "corporate acquisitions" with 1000 positive and 1000 negative examples.
Turning the sentences into features to classify over may take some work. In this model each word is a feature - this requires tokenizing the sentence, which means separating words and punctuation from each other. Another tip is to lowercase all the separate word tokens so that "I HATE you" and "I hate YOU" both end up being considered the same. With more data you could try and also include whether capitalization helps in classifying whether someone is angry, but I believe words should be sufficient at least for an initial effort.
Edit
I just discovered LingPipe that in fact has a tutorial on sentiment analysis using the Bo Pang and Lillian Lee Sentiment Polarity corpus I was talking about. If you use Java that may be an excellent tool to use, and even if not it goes through all of the steps I discussed above.
No doubt it is possible to judge a user's mood based on the text they type but it would be no trivial thing. Things that I can think of:
Capitals tends to signify agitation, annoyance or frustration and is certainly an emotional response but then again some newbies do that because they don't realize the significance so you couldn't assume that without looking at what else they've written (to make sure its not all in caps);
Capitals are really just one form of emphasis. Others are use of certain aggressive colours (eg red) or use of bold or larger fonts;
Some people make more spelling and grammar mistakes and typos when they're highly emotional;
Scanning for emoticons could give you a very clear picture of what the user is feeling but again something like :) could be interpreted as happy, "I told you so" or even have a sarcastic meaning;
Use of expletives tends to have a clear meaning but again its not clearcut. Colloquial speech by many people will routinely contain certain four letter words. For some other people, they might not even say "hell", saying "heck" instead so any expletive (even "sucks") is significant;
Groups of punctuation marks (like ##$#$#) tend to be replaced for expletives in a context when expletives aren't necessarily appropriate, so thats less likely to be colloquial;
Exclamation marks can indicate surprise, shock or exasperation.
You might want to look at Advances in written text analysis or even Determining Mood for a Blog by Combining Multiple Sources of Evidence.
Lastly it's worth noting that written text is usually perceived to be more negative than it actually is. This is a common problem with email communication in companies, just as one example.
I can't believe I'm taking this seriously... assuming a one-dimensional mood space:
If the text contains a curse word,
-10 mood.
I think exclamations would tend to be negative, so -2 mood.
When I get frustrated, I type in
Very. Short. Sentences. -5 mood.
The more I think about this, the more it's clear that a lot of these signifiers indicate extreme mood in general, but it's not always clear what kind of mood.
If you support fonts, bold red text is probably an angry user. Green regular sized texts with butterfly clip art a happy one.
My memory isn't good on this subject, but I believe I saw some research about the grammar structure of the text and the overall tone. That could be also as simple as shorter words and emotion expression words (well, expletives are pretty obvious).
Edit: I noted that the first person to answer had substantially similar post. There could be indeed some serious idea about shorter sentences.
Analysis of mood and behavior is very serious science. Despite the other answers mocking the question law enforcement agencies have been investigating categorization of mood for years. Uses in computers I have heard of generally had more context (timing information, voice pattern, speed in changing channels). I think that you could--with some success--determine if a user is in a particular mood by training a Neural Network with samples from two known groups: angry and not angry. Good luck with your efforts.
I think, my algorythm is rather straightforward, yet, why not calculating smilics through the text :) vs :(
Obviously, the text ":) :) :) :)" resolves to a happy user, while ":( :( :(" will surely resolve to a sad one. Enjoy!
I agree with ojblass that this is a serious question.
Mood categorization is currently a hot topic in the speech recognition area. If you think about it, an interactive voice response (IVR) application needs to handle angry customers far differently than calm ones: angry people should be routed quickly to human operators with the right experience and training. Vocal tone is a pretty reliable indicator of emotion, practical enough so that companies are eager to get this to work. Google "speech emotion recognition", or read this article to find out more.
The situation should be no different in web-based GUIs. Referring back to cletus's comments, the analogies between text and speech emotion detection are interesting. If a person types CAPITALS they are said to be 'shouting', just as if his voice rose in volume and pitch using a voice interface. Detecting typed profanities is analogous to "keyword spotting" of profanity in speech systems. If a person is upset, they'll make more errors using either a GUI or a voice user interface (VUI) and can be routed to a human.
There's a "multimodal" emotion detection research area here. Imagine a web interface that you can also speak to (along the lines of the IBM/Motorola/Opera XHTML + Voice Profile prototype implementation). Emotion detection could be based on a combination of cues from the speech and visual input modality.
Yes.
Whether or not you can do it is another story. The problem seems at first to be AI complete.
Now then, if you had keystroke timings you should be able to figure it out.
Fuzzy logic will do I guess.
Any way it will be quite easy to start with several rules of determining the user's mood and then extend and combine the "engine" with more accurate and sophisticated ones.

Resources