How to deactivate sentence boundary detection in Watson Knowledge Studio? - watson-knowledge-studio

I use Watson Knowledge Studio to analyze resume. When I upload a document, a sentence boundary detection is run by Watson. However, a resume is not exactly like natural language such as emails or comments, and have less punctuation. Therefore, the sentence boundary detection can fail miserably, and split tokens that should be within the same entity on multiple lines.
To handle this problem, I created my own model to detect sentences in a resume. Now, I would like to upload the document to Watson, without letting it try to re-segment sentences.
The best approach that I manage was two put two lines break each time my model predicted a sentence break. Thanks to that, Watson never joins together different sentences. However, it will sometimes consider that a sentence break is missing and add a new one.
How can I deactivate sentence boundary detection in Watson Knowledge Studio?

Unfortunately there is no option to disable/modify sentence boundary detection in Watson Knowledge Studio.

Related

Identify appropriate independent feature to apply machine learning

I am looking to develop a machine learning algorithm to parse a sentence and identify various parts in it. Here is what I mean:
Consider the sentence 'Demonstrate me the procedure to turn on a fan when then there is no electricity'. I would like to split the sentence into:
Command: 'Demonstrate'
Action: 'Turn on a fan'
Condition: 'When there is no electricity'
The way I plan to do this is by using a large number of sample inputs of sentences and specifying target outputs in each case. Then, I would use an appropriate machine learning algorithm for classification.
The problem I am facing is with data preparation for machine learning training. So far, I have thought about the following approach:
1- Parse the sentence and determine POS of each word. Classify each word as 1-7 depending upon the part of speech. Combine each word and the sentence will get a specific code: e.g, 102163374. Use that as an independent feature.
2- Use the total number of words as the second independent feature.
The precise problem with this approach is that the first feature will vary very heavily depending on the number of words there are in the sentence. Is that a problem? If so, how do I deal with it?

Natural Language Processing for Smart Homes

I'm writing up a Smart Home software for my bachelor's degree, that will only simulate the actual house, but I'm stuck at the NLP part of the project. The idea is to have the client listen to voice inputs (already done), transform it into text (done) and send it to the server, which does all the heavy lifting / decision making.
So all my inputs will be fairly short (like "please turn on the porch light"). Based on this, I want to take the decision on which object to act, and how to act. So I came up with a few things to do, in order to write up something somewhat efficient.
Get rid of unnecessary words (in the previous example "please" and "the" are words that don't change the meaning of what needs to be done; but if I say "turn off my lights", "my" does have a fairly important meaning).
Deal with synonyms ("turn on lights" should do the same as "enable lights" -- I know it's a stupid example). I'm guessing the only option is to have some kind of a dictionary (XML maybe), and just have a list of possible words for one particular object in the house.
Detecting the verb and subject. "turn on" is the verb, and "lights" is the subject. I need a good way to detect this.
General implementation. How are these things usually developed in terms of algorithms? I only managed to find one article about NLP in Smart Homes, which was very vague (and had bad English). Any links welcome.
I hope the question is unique enough (I've seen NLP questions on SO, none really helped), that it won't get closed.
If you don't have a lot of time to spend with the NLP problem, you may use the Wit API (http://wit.ai) which maps natural language sentences to JSON:
It's based on machine learning, so you need to provide examples of sentences + JSON output to configure it to your needs. It should be much more robust than grammar-based approaches, especially because the voice-to-speech engine might make mistakes that will break your grammar (but the machine learning module can still get the meaning of the sentence).
I am no way a pioneer in NLP(I love it though) but let me try my hand on this one. For your project I would suggest you to go through Stanford Parser
From your problem definition I guess you don't need anything other then verbs and nouns. SP generates POS(Part of speech tags) That you can use to prune the words that you don't require.
For this I can't think of any better option then what you have in mind right now.
For this again you can use grammatical dependency structure from SP and I am pretty much sure that it is good enough to tackle this problem.
This is where your research part lies. I guess you can find enough patterns using GD and POS tags to come up with an algorithm for your problem. I hardly doubt that any algorithm would be efficient enough to handle every set of input sentence(Structured+unstructured) but something that is more that 85% accurate should be good enough for you.
First, I would construct a list of all possible commands (not every possible way to say a command, just the actual function itself: "kitchen light on" and "turn on the light in the kitchen" are the same command) based on the actual functionality the smart house has available. I assume there is a discrete number of these in the order of no more than hundreds. Assign each some sort of identifier code.
Your job then becomes to map an input of:
a sentence of english text
location of speaker
time of day, day of week
any other input data
to an output of a confidence level (0.0 to 1.0) for each command.
The system will then execute the best match command if the confidence is over some tunable threshold (say over 0.70).
From here it becomes a machine learning application. There are a number of different approaches (and furthermore, approaches can be combined together by having them compete based on features of the input).
To start with I would work through the NLP book from Jurafsky/Manning from Stanford. It is a good survey of current NLP algorithms.
From there you will get some ideas about how the mapping can be machine learned. More importantly how natural language can be broken down into a mathematical structure for machine learning.
Once the text is semantically analyzed, the simplest ML algorithm to try first would be of the supervised ones. To generate training data have a normal GUI, speak your command, then press the corresponding command manually. This forms a single supervised training case. Make some large number of these. Set some aside for testing. It is also unskilled work so other people can help. You can then use these as your training set for your ML algorithm.

Graph of word frequencies

I want to make a function which from a text input produces a word frequency graph like this in the picture. This picture is taken from a report, so I am not sure how they have made it.
There's much more work here that a single function. The following links contain code of projects that perform similar tasks. You could reuse them in your program.
http://www.codeproject.com/Articles/224231/Word-Cloud-Tag-Cloud-Generator-Control-for-NET-Win
http://www.codeproject.com/Articles/19968/WordCloud-A-Squarified-Treemap-of-Word-
Frequency
https://github.com/whydoidoit/WordCloud (Silverlight)
Also, check out the accepted answer of Algorithm to implement a word cloud like Wordle, it's an answer by the creator of Wordle in which he explain the basic algorithm.

What is the typical method to separate connected letters in a word using OCR

I am very new to OCR and almost know nothing about the algorithms used to recognize words. I am just getting familiar to that.
Could anybody please advise on the typical method used to recognize and separate individual characters in connected form (I mean in a word where all letters are linked together)? Forget about handwriting, supposing the letters are connected together using a known font, what is the best method to determine each individual character in a word? When characters are written separately there is no problem, but when they are joined together, we should know where every single character starts and ends in order to go to the next step and match them individually to a letter.
Is there any known algorithm for that?
The standard term for this process is "character segmentation" - segmentation is the image processing term for breaking images into grouped areas for recognition. "Arabic character segmentation" throws up a lot of hits in google scholar if you want to learn more.
I'd encourage you to look at Tesseract - an open source OCR implementation, especially the documents.
Feature as defined in the glossary has a bit on this, but there is a ton of information here.
Basically Tesseract solves the problem (from How Tesseract Works) by looking at blobs (not letters) then combining those blobs into words. This avoids the problem you describe, while creating new problems.
For arabic (as you point out) Tesseract doesn't work. I don't know much about this area but this paper seems to imply Dynamic Time Warping (DTW) is a useful technique. This tries to stretch the words to match them to known words, and again works in word rather than letter space.

Is it possible to guess a user's mood based on the structure of text?

I assume a natural language processor would need to be used to parse the text itself, but what suggestions do you have for an algorithm to detect a user's mood based on text that they have written? I doubt it would be very accurate, but I'm still interested nonetheless.
EDIT: I am by no means an expert on linguistics or natural language processing, so I apologize if this question is too general or stupid.
This is the basis of an area of natural language processing called sentiment analysis. Although your question is general, it's certainly not stupid - this sort of research is done by Amazon on the text in product reviews for example.
If you are serious about this, then a simple version could be achieved by -
Acquire a corpus of positive/negative sentiment. If this was a professional project you may take some time and manually annotate a corpus yourself, but if you were in a hurry or just wanted to experiment this at first then I'd suggest looking at the sentiment polarity corpus from Bo Pang and Lillian Lee's research. The issue with using that corpus is it is not tailored to your domain (specifically, the corpus uses movie reviews), but it should still be applicable.
Split your dataset into sentences either Positive or Negative. For the sentiment polarity corpus you could split each review into it's composite sentences and then apply the overall sentiment polarity tag (positive or negative) to all of those sentences. Split this corpus into two parts - 90% should be for training, 10% should be for test. If you're using Weka then it can handle the splitting of the corpus for you.
Apply a machine learning algorithm (such as SVM, Naive Bayes, Maximum Entropy) to the training corpus at a word level. This model is called a bag of words model, which is just representing the sentence as the words that it's composed of. This is the same model which many spam filters run on. For a nice introduction to machine learning algorithms there is an application called Weka that implements a range of these algorithms and gives you a GUI to play with them. You can then test the performance of the machine learned model from the errors made when attempting to classify your test corpus with this model.
Apply this machine learning algorithm to your user posts. For each user post, separate the post into sentences and then classify them using your machine learned model.
So yes, if you are serious about this then it is achievable - even without past experience in computational linguistics. It would be a fair amount of work, but even with word based models good results can be achieved.
If you need more help feel free to contact me - I'm always happy to help others interested in NLP =]
Small Notes -
Merely splitting a segment of text into sentences is a field of NLP - called sentence boundary detection. There are a number of tools, OSS or free, available to do this, but for your task a simple split on whitespaces and punctuation should be fine.
SVMlight is also another machine learner to consider, and in fact their inductive SVM does a similar task to what we're looking at - trying to classify which Reuter articles are about "corporate acquisitions" with 1000 positive and 1000 negative examples.
Turning the sentences into features to classify over may take some work. In this model each word is a feature - this requires tokenizing the sentence, which means separating words and punctuation from each other. Another tip is to lowercase all the separate word tokens so that "I HATE you" and "I hate YOU" both end up being considered the same. With more data you could try and also include whether capitalization helps in classifying whether someone is angry, but I believe words should be sufficient at least for an initial effort.
Edit
I just discovered LingPipe that in fact has a tutorial on sentiment analysis using the Bo Pang and Lillian Lee Sentiment Polarity corpus I was talking about. If you use Java that may be an excellent tool to use, and even if not it goes through all of the steps I discussed above.
No doubt it is possible to judge a user's mood based on the text they type but it would be no trivial thing. Things that I can think of:
Capitals tends to signify agitation, annoyance or frustration and is certainly an emotional response but then again some newbies do that because they don't realize the significance so you couldn't assume that without looking at what else they've written (to make sure its not all in caps);
Capitals are really just one form of emphasis. Others are use of certain aggressive colours (eg red) or use of bold or larger fonts;
Some people make more spelling and grammar mistakes and typos when they're highly emotional;
Scanning for emoticons could give you a very clear picture of what the user is feeling but again something like :) could be interpreted as happy, "I told you so" or even have a sarcastic meaning;
Use of expletives tends to have a clear meaning but again its not clearcut. Colloquial speech by many people will routinely contain certain four letter words. For some other people, they might not even say "hell", saying "heck" instead so any expletive (even "sucks") is significant;
Groups of punctuation marks (like ##$#$#) tend to be replaced for expletives in a context when expletives aren't necessarily appropriate, so thats less likely to be colloquial;
Exclamation marks can indicate surprise, shock or exasperation.
You might want to look at Advances in written text analysis or even Determining Mood for a Blog by Combining Multiple Sources of Evidence.
Lastly it's worth noting that written text is usually perceived to be more negative than it actually is. This is a common problem with email communication in companies, just as one example.
I can't believe I'm taking this seriously... assuming a one-dimensional mood space:
If the text contains a curse word,
-10 mood.
I think exclamations would tend to be negative, so -2 mood.
When I get frustrated, I type in
Very. Short. Sentences. -5 mood.
The more I think about this, the more it's clear that a lot of these signifiers indicate extreme mood in general, but it's not always clear what kind of mood.
If you support fonts, bold red text is probably an angry user. Green regular sized texts with butterfly clip art a happy one.
My memory isn't good on this subject, but I believe I saw some research about the grammar structure of the text and the overall tone. That could be also as simple as shorter words and emotion expression words (well, expletives are pretty obvious).
Edit: I noted that the first person to answer had substantially similar post. There could be indeed some serious idea about shorter sentences.
Analysis of mood and behavior is very serious science. Despite the other answers mocking the question law enforcement agencies have been investigating categorization of mood for years. Uses in computers I have heard of generally had more context (timing information, voice pattern, speed in changing channels). I think that you could--with some success--determine if a user is in a particular mood by training a Neural Network with samples from two known groups: angry and not angry. Good luck with your efforts.
I think, my algorythm is rather straightforward, yet, why not calculating smilics through the text :) vs :(
Obviously, the text ":) :) :) :)" resolves to a happy user, while ":( :( :(" will surely resolve to a sad one. Enjoy!
I agree with ojblass that this is a serious question.
Mood categorization is currently a hot topic in the speech recognition area. If you think about it, an interactive voice response (IVR) application needs to handle angry customers far differently than calm ones: angry people should be routed quickly to human operators with the right experience and training. Vocal tone is a pretty reliable indicator of emotion, practical enough so that companies are eager to get this to work. Google "speech emotion recognition", or read this article to find out more.
The situation should be no different in web-based GUIs. Referring back to cletus's comments, the analogies between text and speech emotion detection are interesting. If a person types CAPITALS they are said to be 'shouting', just as if his voice rose in volume and pitch using a voice interface. Detecting typed profanities is analogous to "keyword spotting" of profanity in speech systems. If a person is upset, they'll make more errors using either a GUI or a voice user interface (VUI) and can be routed to a human.
There's a "multimodal" emotion detection research area here. Imagine a web interface that you can also speak to (along the lines of the IBM/Motorola/Opera XHTML + Voice Profile prototype implementation). Emotion detection could be based on a combination of cues from the speech and visual input modality.
Yes.
Whether or not you can do it is another story. The problem seems at first to be AI complete.
Now then, if you had keystroke timings you should be able to figure it out.
Fuzzy logic will do I guess.
Any way it will be quite easy to start with several rules of determining the user's mood and then extend and combine the "engine" with more accurate and sophisticated ones.

Resources