Stanford NLP sentiment training set - stanford-nlp

is there a problem with the original (movie reviews) training set provided by Stanford?
Looking at it, it seems that the words "no" and "not" are always marked as negative and the word "n't" is always marked as neutral. Moreover, words with 2 meanings are also always consistent. One would expect the word "like" to be positive in a phrase such as "I like you" and neutral in a phrase such as "A is like B".
Does anyone know why this is the case?

"Problem" is a relative term. There's not something really wrong, but you could provide arguments for doing things differently.
tl;dr
Annotation was indeed done under the model that one subtree of words (including the limiting case of a single word) always gets the same rating.
The idea here is the principle of compositionality of language: if you want to work out the meaning of a novel large sentence, it's generally accepted that you should work out the meaning of the parts and then work out what happens when those parts are combined. The ratings are doing that for the case of sentiment.
In contrast it's not quite obvious what you'd be doing if you were assigning sentiment to a substring in context. Like if the substring was "[a little bit]" what does it mean to say that you're evaluating it in a context like "The movie was [a little bit] original" or "The movie was [a little bit] boring". Are you evaluating the sentiment of "a little bit" or are you just looking at the context and sticking on the substring a rating which really reflects the sentiment of "original" or "boring"?
Nevertheless, one can still raise questions about the approach. For one thing, there is no use of word senses. One substring gets one rating. As another, it could be argued that sentiment is a kind of gestalt, and even though words have a meaning and larger phrase meanings are calculated compositionally from them, it doesn't really make sense to say that words have a sentiment absent of their use in a particular context. That is, "thin" has a clear meaning, and working from that using your world knowledge, it makes sense that a "thin laptop" is a good thing and "thin walls" are a bad thing, but it doesn't seem like by itself "thin" has a sentiment - it arises as a result of whether the object it refers to is deemed good if thin. Hopefully, in such cases, AMT annotators gave "thin" by itself neutral sentiment, and only gave positive and negative ratings to phrases like "thin laptop" and "thin walls". But, in practice, their mind could easily have been conjuring up a particular context and they judged the word relative to that context.
p.s. This question really seems more Linguistics Stack Exchange than Stack Overflow.

Related

Find basic words and estimate their difficulty

I'm looking for a possibly simple solution of the following problem:
Given input of a sentence like
"Absence makes the heart grow fonder."
Produce a list of basic words followed by their difficulty/complexity
[["absence", 0.5], ["make", 0.05], ["the", 0.01"], ["grow", 0.1"], ["fond", 0.5]]
Let's assume that:
all the words in the sentence are valid English words
popularity is an acceptable measure of difficulty/complexity
base word can be understood in any constructive way (see below)
difficulty/complexity is on scale from 0 - piece of cake to 1 - mind-boggling
difficulty bias is ok, better to be mistaken saying easy is though than the other way
working simple solution is preferred to flawless but complicated stuff
[edit] there is no interaction with user
[edit] we can handle any proper English input
[edit] a word is not more difficult than it's basic form (because as smart beings we can create unhappily if we know happy), unless it creates a new word (unlikely is not same difficulty as like)
General ideas:
I considered using Google searches or sites like Wordcount to estimate words popularity that could indicate its difficulty. However, both solutions give different results depending on the form of entered words. Google gives 316m results for fond but 11m for fonder, whereas Wordcount gives them ranks of 6k and 54k.
Transforming words to their basic forms is not a must but solves ambiguity problem (and makes it easy to create dictionary links), however it's not a simple task and its sense could me found arguable. Obviously fond should be taken instead of fonder, however investigating believe instead of unbelievable seems to be an overkill ([edit] it might be not the best example, but there is a moment when modifying basic word we create a new one like -> likely) and words like doorkeeper shouldn't be cut into two.
Some ideas of what should be consider basic word can be found here on Wikipedia but maybe a simpler way of determining it would be a use of a dictionary. For instance according to dictionary.reference.com unbelievable is a basic word whereas fonder comes from fond but then grow is not the same as growing
Idea of a solution:
It seems to me that the best way to handle the problem would be using a dictionary to find basic words, apply some of the Wikipedia rules and then use Wordcount (maybe combined with number of Google searches) to estimate difficulty.
Still, there might (probably is a simpler and better) way or ready to use algorithms. I would appreciate any solution that deals with this problem and is easy to put in practice. Maybe I'm just trying to reinvent the wheel (or maybe you know my approach would work just fine and I'm wasting my time deliberating instead of coding what I have). I would, however, prefer to avoid implementing frequency analysis algorithms or preparing a corpus of texts.
Some terminology:
The core part of the word is called a stem or a root. More on this distinction later. You can think of the root/stem as the part that carries the main meaning of the word and will appear in the dictionary.
(In English) most words are composed of one root (exception: compounds like "windshield") / one stem and zero or more affixes: the affixes that come after the root/stem are called suffixes, and the affixes that precede the root/stem are called prefixes. Examples: "driver" = "drive" (root/stem) + suffix "-er"; "unkind" = "kind" (root/stem) + "un-" (prefix).
Suffixes/prefixes (=affixes) can be inflectional or derivational. For example, in English, third-person singular verbs have an s on the end: "I drive" but "He drive-s". These kind of agreement suffixes don't change the category of the word: "drive" is a verb regardless of the inflectional "s". On the other hand, a suffix like "-er" is derivational: it takes a verb (e.g. "drive") and turns it into a noun (e.g. "driver")
The stem, is the piece of the word without any inflectional affixes, whereas the root is the piece of the word without any derivational affixes. For instance, the plural noun "drivers" is decomposable into "drive" (root) + "er" (derivational affix, makes a new stem "driver") + "s" (plural).
The process of deriving the "base" form of the word is called "stemming".
So, armed with this terminology it seems that for your task the most useful thing to do would be to stem each form you come across, i.e. remove all the inflectional affixes, and keep the derivational ones, since derivational affixes can change how common the word is considered to be. Think about it this way: if I tell you a new word in English, you will always know how to make it plural, 3rd-person singular, however, you may not know some of the other words you can derive from this). English being inflection-poor language, there aren't a lot of inflectional suffixes to worry about (and Google search is pretty good about stripping them off, so maybe you can use the Google's stemming engine just by running your word forms through google search and getting out the highlighted results):
Third singular verbal -s: "I drive"/"He drive-s"
Nominal plural `-s': "One wug"/"Two wug-s". Note that there are some irregular forms here such as "children", "oxen", "geese", etc. I think I wouldn't worry about these.
Verbal past tense forms and participial forms. The regular ones are easy: the past tense has -ed for past tense and past participle ("I walk"/"I walk-ed"/"I had walk-ed"), but there are quite a few of irregular ones (fall/fell/fallen, dive/dove/dived?, etc). Maybe make a list of these?
Verbal -ing forms: "walk"/"walk-ing"
Adjectival comparative -er and superlative -est. There are a few irregular/suppletive ones ("good"/"better"/"best"), but these should not present a huge problem.
These are the main inflectional affixes in English: I may be forgetting a few that you could discover by picking up an introductory Linguistics books. Also there are going to be borderline cases, such as "un-" which is so promiscuous that we might consider it inflectional. For more information on these types, see Level 1 vs. Level 2 affixation, but I would treat these cases as derivational for your purposes and not stem them.
As far as "grading" how common various stems are, besides google you could various freely-available text corpora. The wikipedia article linked to has a few links to free corpora, and you can find a bunch more by googling. From these corpora you can build a frequency count of each stem, and use that to judge how common the form is.
I'm afraid there is no simple solution to the task of finding "basic" forms. I'm basing that on my memory of my Machine Learning textbook, of which language analysis was part of. You need some database, from which you can get them.
At the same time, please take note that the amount of words people use in everyday language is not that big. You can always ask a user what is the base form of a world you have not seen before. (unless this is your homework, which will be automatically checked)
Eventually, if you don't care about covering all words, you can create simple database, which would contain different forms of the most common words, and then try to use grammatical rules for the less common ones (which would be a good approximation, as actually, the most common words in English are irregular, whereas the uncommon ones are regular, because their original forms have been forgotten).
Note however, i'm no specialist, i'm simply trying to help :-)

Using Sentiment Analysis to Detect Contradictory Arguments?

I don't have much background in sentiment analysis or natural language processing at all, but I have been reading a bit about it in my spare time. I would like to conduct and experiment to analyze forum threads/comments such as reddit, digg, blogs, etc. I'm particularity interested in doing something like counting the number of for, against, and neutral comments for threads of heated religious and political debates. Here's what I am thinking.
1) Find a thread that the original poster has defined a touchy political or religious topic.
2) For each comment categorize it as supporting the original poster or otherwise taking a contradicting or neutral stance.
3) Compare various mediums with the numbers of for or against arguments to determine what platforms are good "debate platforms" (i.e. balanced argument counts).
One big problem that I'm anticipating is that heated topics will invoke strong reactions from both supporting and contradicting parties so a simple happy/sad sentiment analysis won't cut it. I'm just sort of interested in this project for my own curiosities, so if anyone knows of similar research or utilities to conduct this experiment I'd be interested to hear more.
Can someone recommend a good sentiment analysis, word dictionary, training set, etc. for this task?
IMHO this is not possible without running into semantics. Consider the sentence:
Unlike many others, I am not against the abolishment of capital punishment.
Your AI may need to recognise idiomatic subfrases like "not against", or other "not ..." snippets. This is not impossible ;-)
An additional problem is, that "not" is more or less a stopword, its rank will probably be in the top-100, causing a low entropy (though it has a high "semantic" value to every sentence where it is unsed). Also note that omitting "the abolishment of", will cause the "polarity" of the sentence to flip as well.
You can try to use the bag of words [or even better: use n-grams as tokens to the bag]
The approach is basically:
Classify a set of examples, let your algorithm extract the relevant
words from the classified examples.
When a new comment is given, extract the relevant words, and use
k-nearest neighbors to decide if the new comment is a
pro/against/neutral.
Also, you might want to have a look on Apache Mahout.

Does an algorithm exist to help detect the "primary topic" of an English sentence?

I'm trying to find out if there is a known algorithm that can detect the "key concept" of a sentence.
The use case is as follows:
User enters a sentence as a query (Does chicken taste like turkey?)
Our system identifies the concepts of the sentence (chicken, turkey)
And it runs a search of our corpus content
The area that we're lacking in is identifying what the core "topic" of the sentence is really about. The sentence "Does chicken taste like turkey" has a primary topic of "chicken", because the user is asking about the taste of chicken. While "turkey" is a helper topic of less importance.
So... I'm trying to find out if there is an algorithm that will help me identify the primary topic of a sentence... Let me know if you are aware of any!!!
I actually did a research project on this and won two competitions and am competing in nationals.
There are two steps to the method:
Parse the sentence with a Context-Free Grammar
In the resulting parse trees, find all nouns which are only subordinate to Noun-Phrase-like constituents
For example, "I ate pie" has 2 nouns: "I" and "pie". Looking at the parse tree, "pie" is inside of a Verb Phrase, so it cannot be a subject. "I", however, is only inside of NP-like constituents. being the only subject candidate, it is the subject. Find an early copy of this program on http://www.candlemind.com. Note that the vocabulary is limited to basic singular words, and there are no verb conjugations, so it has "man" but not "men", has "eat" but not "ate." Also, the CFG I used was hand-made an limited. I will be updating this program shortly.
Anyway, there are limitations to this program. My mentor pointed out in its currents state, it cannot recognize sentences with subjects that are "real" NPs (what grammar actually calls NPs). For example, "that the moon is flat is not a debate any longer." The subject is actually "that the moon is flat." However, the program would recognize "moon" as the subject. I will be fixing this shortly.
Anyway, this is good enough for most sentences...
My research paper can be found there too. Go to page 11 of it to read the methods.
Hope this helps.
Most of your basic NLP parsing techniques will be able to extract the basic aspects of the sentence - i.e., that chicken and turkey a NPs and they are linked by and adjective 'like', etc. Getting these to a 'topic' or 'concept' is more difficult
Technique such as Latent Semantic Analysis and its many derivatives transform this information into a vector (some have methods of retaining in some part the hierarchy/relations between parts of speech) and then compares them to existing, usually pre-classified by concept, vectors. See http://en.wikipedia.org/wiki/Latent_semantic_analysis to get started.
Edit Here's an example LSA app you can play around with to see if you might want to pursue it further . http://lsi.research.telcordia.com/lsi/demos.html
For many longer sentences its difficult to say what exactly is a topic and also there may be more than one.
One way to get approximate ans is
1.) First tag the sentence using openNLP, stanford Parser or any one.
2.) Then remove all the stop words from the sentence.
3.) Pick up Nouns( proper, singular and plural).
Other way is
1.) chuck the sentence into phrases by any parser.
2.) Pick up all the noun phrases.
3.) Remove the Noun phrases that doesn't have the Nouns as a child.
4.) Keep only adjectives and Nouns, remove all words from remaining Noun Phrases.
This might give approx. guessing.
"Key concept" is not a well-defined term in linguistics, but this may be a starting point: parse the sentence, find the subject in the parse tree or dependency structure that you get. (This doesn't always work; for example, the subject of "Is it raining?" is "it", while the key concept is likely "rain". Also, what's the key concept in "Are spaghetti and lasagna the same thing?")
This kind of problem (NLP + search) is more properly dealt with by methods such as LSA, but that's quite an advanced topic.
On the most basic level, a question in English is usually in the form of <verb> <subject> ... ? or <pronoun> <verb> <subject> ... ?. This is by no means a good algorithm, especially considering that the subject could span several words, but depending on how sophisticated a solution you need, it might be a useful starting point.
If you need precision, ignore this answer.
If you're willing to shell out money, http://www.connexor.com/ is supposed to be able to do this type of semantic analysis for a wide variety of languages, including English. I have never directly used their product, and so can't comment on how well it works.
There's an article about Parsing Noun Phrases in the MIT Computational Linguistics journal of this month: http://www.mitpressjournals.org/doi/pdf/10.1162/COLI_a_00076
Compound or complex sentences may have more than one key concept of a sentence.
You can use stanfordNLP or MaltParser which can give the dependency structure of a sentence. It also gives the parts of speech tagging including subject, verb , object etc.
I think most of the times the object will be the key concept of the sentence.
You should look at Google's Cloud Natural Language API. It's their NLP service.
https://cloud.google.com/natural-language/
Simple solution is to tag your sentence with part-of-speach tagger (e.g. from NLTK library for Python) then find matches with some predefined part-of-speach patterns in which it's clear where is main subject of the sentence
One option is to look into something like this as a first step:
http://www.abisource.com/projects/link-grammar/
But how you derive the topic from these links is another problem in itself. But as Abiword is trying to detect grammatical problems, you might be able to use it to determine the topic.
By "primary topic" you're referring to what is termed the subject of the sentence.
The subject can be identified by understanding a sentence through natural language processing.
The answer to this question is the same as that for How to determine subject, object and other words? - this is a currently unsolved problem.

Determine the difficulty of an english word

I am working a word based game. My word database contains around 10,000 english words (sorted alphabetically). I am planning to have 5 difficulty levels in the game. Level 1 shows the easiest words and Level 5 shows the most difficult words, relatively speaking.
I need to divide the 10,000 long words list into 5 levels, starting from the easiest words to difficult ones. I am looking for a program to do this for me.
Can someone tell me if there is an algorithm or a method to quantitatively measure the difficulty of an english word?
I have some thoughts revolving around using the "word length" and "word frequency" as factors, and come up with a formula or something that accomplishes this.
Get a large corpus of texts (e.g. from the Gutenberg archives), do a straight frequency analysis, and eyeball the results. If they don't look satisfying, weight each text with its Flesch-Kincaid score and run the analysis again - words that show up frequently, but in "difficult" texts will get a score boost, which is what you want.
If all you have is 10000 words, though, it will probably be quicker to just do the frequency sorting as a first pass and then tweak the results by hand.
I'm not understanding how frequency is being used... if you were to scan a newspaper, I'm sure you would see the word "thoroughly" mentioned much more frequently than the word "bop" or "moo" but that doesn't mean it's an easier word; on the contrary 'thoroughly' is one of the most disgustingly absurd spelling anomalies that gives grade school children nightmares...
Try explaining to a sane human being learning english as a second language the subtle difference between slaughter and laughter.
I agree that frequency of use is the most likely metric; there are studies supporting a high correlation between word frequency and difficulty (correct responses on tests, etc.). Check out the English Lexicon Project at http://elexicon.wustl.edu/ for some 70k(?) frequency-rated words.
Crowd-source the answer.
Create an online 'game' that lists 10 words at random.
Get the player to drag and drop them into easiest - hardest, and tick to indicate if the player has ever heard of the word.
Apply an ranking algorithm (e.g. ELO) on the result of each experiment.
Repeat.
It might even be fun to play, you could get a language proficiency score at the end.
Difficulty is a pretty amorphus concept. If you've no clear idea of what you want, perhaps you could take a look at the Porter Stemming Algorithm (see for example the original paper). That contains a more advanced idea of 'length' by defining words as being of the form [C](VC){m}[V]; C means a block of consonants and V a block of vowels and this definition says a word is an optional C followed by m VC blocks and finally an optional V. The m value is this advanced 'length'.
depending on the type of game the definition of "difficult" will change. If your game involves typing quickly (ztype-style...), "difficult" will have a different meaning than in a game where you need to define a word's meaning.
That said, Scrabble has a way to measure how "difficult" a word is which is also quite easy algoritmically.
Also you may look into defining "difficult" in terms of your game. You could beta test your game and classify words according to how "difficult" players find them in the context of your own game.
There are several factors that relate to word difficulty, including age at acquisition, imageability, concreteness, abstractness, syllables, frequency (spoken and written). There are also psycholinguistic databases that will search for word by at least some of these factors. (just do a search for "psycholinguistic database".
Word frequency is an obvious choice (of course not perfect). You can download Google n-grams V2 here, which is license under the Creative Commons Attribution 3.0 Unported License.
Format: ngram TAB year TAB match_count TAB page_count TAB volume_count NEWLINE
Example:
Corpus used (from Lin, Yuri, et al. "Syntactic annotations for the google books ngram corpus." Proceedings of the ACL 2012 system demonstrations. Association for Computational Linguistics, 2012.):
Word length is a good indicator , for word frequency , you would need data as an algorithm can obviously not determine it by itself.
You could also use some sort of scoring like the scrabble game does : each letter has a value and the final value would be the sum of the values.
It would be imo easier to find frequency data about each letter in your language .
In his article on spell correction Peter Norvig uses a dictionary to count the number of occurrences of each word (and thus determine their frequency).
You could use this as a stepping stone :)
Also, frequency should probably influence the difficulty more than length... you would have to beta-test the game for that.
In addition to metrics such as Flesch-Kincaid, you could try an approach based on the Dale-Chall readability formula, using lists of words that are familiar to readers of a particular level of ability.
Implementations of many of the readability formulae contain code for estimating the number of syllables in a word, which may also be useful.
I would guess that the grade at wich the word is introduced into normal students vocabulary is a measure of difficulty. Next would be how many standard rule violations it has. Meaning your words that have spellings or pronunciations that seem to violate the normal set off rules. Finally.. the meaning.. can be a tough concept. .. for example ... try explaining abstract to someone who's never heard the word.
Without claiming to know anything about their algorithm, there is an API that returns a 1-10 scale word difficulty: TwinWord API
I have never used it, myself, though.

Is it possible to guess a user's mood based on the structure of text?

I assume a natural language processor would need to be used to parse the text itself, but what suggestions do you have for an algorithm to detect a user's mood based on text that they have written? I doubt it would be very accurate, but I'm still interested nonetheless.
EDIT: I am by no means an expert on linguistics or natural language processing, so I apologize if this question is too general or stupid.
This is the basis of an area of natural language processing called sentiment analysis. Although your question is general, it's certainly not stupid - this sort of research is done by Amazon on the text in product reviews for example.
If you are serious about this, then a simple version could be achieved by -
Acquire a corpus of positive/negative sentiment. If this was a professional project you may take some time and manually annotate a corpus yourself, but if you were in a hurry or just wanted to experiment this at first then I'd suggest looking at the sentiment polarity corpus from Bo Pang and Lillian Lee's research. The issue with using that corpus is it is not tailored to your domain (specifically, the corpus uses movie reviews), but it should still be applicable.
Split your dataset into sentences either Positive or Negative. For the sentiment polarity corpus you could split each review into it's composite sentences and then apply the overall sentiment polarity tag (positive or negative) to all of those sentences. Split this corpus into two parts - 90% should be for training, 10% should be for test. If you're using Weka then it can handle the splitting of the corpus for you.
Apply a machine learning algorithm (such as SVM, Naive Bayes, Maximum Entropy) to the training corpus at a word level. This model is called a bag of words model, which is just representing the sentence as the words that it's composed of. This is the same model which many spam filters run on. For a nice introduction to machine learning algorithms there is an application called Weka that implements a range of these algorithms and gives you a GUI to play with them. You can then test the performance of the machine learned model from the errors made when attempting to classify your test corpus with this model.
Apply this machine learning algorithm to your user posts. For each user post, separate the post into sentences and then classify them using your machine learned model.
So yes, if you are serious about this then it is achievable - even without past experience in computational linguistics. It would be a fair amount of work, but even with word based models good results can be achieved.
If you need more help feel free to contact me - I'm always happy to help others interested in NLP =]
Small Notes -
Merely splitting a segment of text into sentences is a field of NLP - called sentence boundary detection. There are a number of tools, OSS or free, available to do this, but for your task a simple split on whitespaces and punctuation should be fine.
SVMlight is also another machine learner to consider, and in fact their inductive SVM does a similar task to what we're looking at - trying to classify which Reuter articles are about "corporate acquisitions" with 1000 positive and 1000 negative examples.
Turning the sentences into features to classify over may take some work. In this model each word is a feature - this requires tokenizing the sentence, which means separating words and punctuation from each other. Another tip is to lowercase all the separate word tokens so that "I HATE you" and "I hate YOU" both end up being considered the same. With more data you could try and also include whether capitalization helps in classifying whether someone is angry, but I believe words should be sufficient at least for an initial effort.
Edit
I just discovered LingPipe that in fact has a tutorial on sentiment analysis using the Bo Pang and Lillian Lee Sentiment Polarity corpus I was talking about. If you use Java that may be an excellent tool to use, and even if not it goes through all of the steps I discussed above.
No doubt it is possible to judge a user's mood based on the text they type but it would be no trivial thing. Things that I can think of:
Capitals tends to signify agitation, annoyance or frustration and is certainly an emotional response but then again some newbies do that because they don't realize the significance so you couldn't assume that without looking at what else they've written (to make sure its not all in caps);
Capitals are really just one form of emphasis. Others are use of certain aggressive colours (eg red) or use of bold or larger fonts;
Some people make more spelling and grammar mistakes and typos when they're highly emotional;
Scanning for emoticons could give you a very clear picture of what the user is feeling but again something like :) could be interpreted as happy, "I told you so" or even have a sarcastic meaning;
Use of expletives tends to have a clear meaning but again its not clearcut. Colloquial speech by many people will routinely contain certain four letter words. For some other people, they might not even say "hell", saying "heck" instead so any expletive (even "sucks") is significant;
Groups of punctuation marks (like ##$#$#) tend to be replaced for expletives in a context when expletives aren't necessarily appropriate, so thats less likely to be colloquial;
Exclamation marks can indicate surprise, shock or exasperation.
You might want to look at Advances in written text analysis or even Determining Mood for a Blog by Combining Multiple Sources of Evidence.
Lastly it's worth noting that written text is usually perceived to be more negative than it actually is. This is a common problem with email communication in companies, just as one example.
I can't believe I'm taking this seriously... assuming a one-dimensional mood space:
If the text contains a curse word,
-10 mood.
I think exclamations would tend to be negative, so -2 mood.
When I get frustrated, I type in
Very. Short. Sentences. -5 mood.
The more I think about this, the more it's clear that a lot of these signifiers indicate extreme mood in general, but it's not always clear what kind of mood.
If you support fonts, bold red text is probably an angry user. Green regular sized texts with butterfly clip art a happy one.
My memory isn't good on this subject, but I believe I saw some research about the grammar structure of the text and the overall tone. That could be also as simple as shorter words and emotion expression words (well, expletives are pretty obvious).
Edit: I noted that the first person to answer had substantially similar post. There could be indeed some serious idea about shorter sentences.
Analysis of mood and behavior is very serious science. Despite the other answers mocking the question law enforcement agencies have been investigating categorization of mood for years. Uses in computers I have heard of generally had more context (timing information, voice pattern, speed in changing channels). I think that you could--with some success--determine if a user is in a particular mood by training a Neural Network with samples from two known groups: angry and not angry. Good luck with your efforts.
I think, my algorythm is rather straightforward, yet, why not calculating smilics through the text :) vs :(
Obviously, the text ":) :) :) :)" resolves to a happy user, while ":( :( :(" will surely resolve to a sad one. Enjoy!
I agree with ojblass that this is a serious question.
Mood categorization is currently a hot topic in the speech recognition area. If you think about it, an interactive voice response (IVR) application needs to handle angry customers far differently than calm ones: angry people should be routed quickly to human operators with the right experience and training. Vocal tone is a pretty reliable indicator of emotion, practical enough so that companies are eager to get this to work. Google "speech emotion recognition", or read this article to find out more.
The situation should be no different in web-based GUIs. Referring back to cletus's comments, the analogies between text and speech emotion detection are interesting. If a person types CAPITALS they are said to be 'shouting', just as if his voice rose in volume and pitch using a voice interface. Detecting typed profanities is analogous to "keyword spotting" of profanity in speech systems. If a person is upset, they'll make more errors using either a GUI or a voice user interface (VUI) and can be routed to a human.
There's a "multimodal" emotion detection research area here. Imagine a web interface that you can also speak to (along the lines of the IBM/Motorola/Opera XHTML + Voice Profile prototype implementation). Emotion detection could be based on a combination of cues from the speech and visual input modality.
Yes.
Whether or not you can do it is another story. The problem seems at first to be AI complete.
Now then, if you had keystroke timings you should be able to figure it out.
Fuzzy logic will do I guess.
Any way it will be quite easy to start with several rules of determining the user's mood and then extend and combine the "engine" with more accurate and sophisticated ones.

Resources