Algorithm choice for gaining intelligence from messages - algorithm

What I'm trying to do is find an algorithm that can I can implement to generate 'intelligent' suggestions to people, by comparing messages they send to messages sent by their peers.
For example, Person A sends a message to Person B talking about Obj1. If Person C sends a message to Person D about Obj1, it will notice they are talking about the same things, and may suggest Person A talks to person C.
I have implemented collecting the statistics to capture the mentions people have in common but do not know which algorithm to use to analyse this.
Any suggestions?
(I hope this makes enough sense)

take a look at clustering algorithms
and k-means or
k-nearest neighbours for a quick start
How much data you've got? The more the better.
There are lots of approaches to this problem. You may for example take that all users, to some degree, are similar to each other and what you want to do is to find for each user the most similar ones.Vector space, cosine similarity, will give you quick results.
Give some more information on what you want to achieve.

This is exactly the same problem Twitter is battling with. You might end up with a job there if you crack this ;)
On serious note coming back, one could use some crude measures (i.e. heuristic based) to do something like this, but it has a big error percentage. As delnan said in the comment.
NLP is a sure bet. Note that using NLP too has some error %, but it's far more accurate than any heuristic you would use. If you are using python I would suggest this toolkit, I use it now and then - NLP.
For other languages I am sure there are packages which will help you in this regard.
UPDATE1: If you have a way for the users to tag their messages (like stackoverflow does), you could approach this problem barring NLP. Then you could simply take the intersection of the tags of both the messages to see if there is any commonality & suggest some top items for the common items.
But there are other issues you'll have to deal with - make tags a mandatory, plus you need to be sure that the users are actually entering correct tags etc... But nevertheless this greatly simplifies your problem.
UPDATE2: As the Q has been updated - Since you have some specific keywords/phrases only which you are interested in. This kind of simplifies it. You would need to get each of your message, split it into words, then stem each word. After stemming, intersect this set with the set of keywords you have. You'll get a set(S1). Do the same with the second message, you'll get a set(S2). Intersect S1, S2. If you find something is common, bingo! Some theme is common between message1, message2. else nothing.

Related

How do I get a quick and dirty recognition of possible typos in .net?

I have to manually go through a long list of terms (~3500) which have been entered by users through the years. Beside other things, I want to reduce the list by looking for synonyms, typos and alternate spellings.
My work will be much easier if I can group the list into clusters of possible typos before starting. I was imagining to use some metric which can calculate the similarity to a term, e.g. in percent, and then cluster everything which has a similarity higher than some threshold. As I am going through it manually anyway, I don't mind a high failure rate, if it can keep the whole thing simple.
Ideally, there exists some easily available library to do this for me, implemented by people who know what they are doing. If there is no such, then at least one calculating a similarity metric for a pair of strings would be great, I can manage the clustering myself.
If this is not available either, do you know of a good algorithm which is simple to implement? I was first thinking a Hamming distance divided by word length will be a good metric, but noticed that while it will catch swapped letters, it won't handle deletions and insertions well (ptgs-1 will be caught as very similar to ptgs/1, but hematopoiesis won't be caught as very similar to haematopoiesis).
As for the requirements on the library/algorithm: it has to rely completely on spelling. I know that the usual NLP libraries don't work this way, but
there is no full text available for it to consider context.
it can't use a dictionary corpus of words, because the terms are far outside of any everyday language, frequently abbreviations of highly specialized terms.
Finally, I am most familiar with C# as a programming language, and I already have a C# pseudoscript which does some preliminary cleanup. If there is no one-step solution (feed list in, get grouped list out), I will prefer a library I can call from within a .NET program.
The whole thing should be relatively quick to learn for somebody with almost no previous knowledge in information retrieval. This will save me maybe 5-6 hours of manual work, and I don't want to spend more time than that in setting up an automated solution. OK, maybe up to 50% longer if I get the chance to learn something awesome :)
The question: What should I use, a library, or an algorithm? Which ones should I consider? If what I need is a library, how do I recognize one which is capable of delivering results based on spelling alone, as opposed to relying on context or dictionary use?
edit To clarify, I am not looking for actual semantic relatedness the way search or recommendation engines need it. I need to catch typos. So, I am looking for a metric by which mouse and rodent have zero similarity, but mouse and house have a very high similarity. And I am afraid that tools like Lucene use a metric which gets these two examples wrong (for my purposes).
Basically you are looking to cluster terms according to Semantic Relatedness.
One (hard) way to do it is following Markovitch and Gabrilovitch approach.
A quicker way will be consisting of the following steps:
download wikipedia dump and an open source Information Retrieval library such as Lucene (or Lucene.NET).
Index the files.
Search each term in the index - and get a vector - denoting how relevant the term (the query) is for each document. Note that this will be a vector of size |D|, where |D| is the total number of documents in the collection.
Cluster your vectors in any clustering algorithm. Each vector represents one term from your initial list.
If you are interested only in "visual" similarity (words are written similar to each other) then you can settle for levenshtein distance, but it won't be able to give you semantic relatedness of terms.For example, you won't be able to relate between "fall" and "autumn".

Natural Language Processing for Smart Homes

I'm writing up a Smart Home software for my bachelor's degree, that will only simulate the actual house, but I'm stuck at the NLP part of the project. The idea is to have the client listen to voice inputs (already done), transform it into text (done) and send it to the server, which does all the heavy lifting / decision making.
So all my inputs will be fairly short (like "please turn on the porch light"). Based on this, I want to take the decision on which object to act, and how to act. So I came up with a few things to do, in order to write up something somewhat efficient.
Get rid of unnecessary words (in the previous example "please" and "the" are words that don't change the meaning of what needs to be done; but if I say "turn off my lights", "my" does have a fairly important meaning).
Deal with synonyms ("turn on lights" should do the same as "enable lights" -- I know it's a stupid example). I'm guessing the only option is to have some kind of a dictionary (XML maybe), and just have a list of possible words for one particular object in the house.
Detecting the verb and subject. "turn on" is the verb, and "lights" is the subject. I need a good way to detect this.
General implementation. How are these things usually developed in terms of algorithms? I only managed to find one article about NLP in Smart Homes, which was very vague (and had bad English). Any links welcome.
I hope the question is unique enough (I've seen NLP questions on SO, none really helped), that it won't get closed.
If you don't have a lot of time to spend with the NLP problem, you may use the Wit API (http://wit.ai) which maps natural language sentences to JSON:
It's based on machine learning, so you need to provide examples of sentences + JSON output to configure it to your needs. It should be much more robust than grammar-based approaches, especially because the voice-to-speech engine might make mistakes that will break your grammar (but the machine learning module can still get the meaning of the sentence).
I am no way a pioneer in NLP(I love it though) but let me try my hand on this one. For your project I would suggest you to go through Stanford Parser
From your problem definition I guess you don't need anything other then verbs and nouns. SP generates POS(Part of speech tags) That you can use to prune the words that you don't require.
For this I can't think of any better option then what you have in mind right now.
For this again you can use grammatical dependency structure from SP and I am pretty much sure that it is good enough to tackle this problem.
This is where your research part lies. I guess you can find enough patterns using GD and POS tags to come up with an algorithm for your problem. I hardly doubt that any algorithm would be efficient enough to handle every set of input sentence(Structured+unstructured) but something that is more that 85% accurate should be good enough for you.
First, I would construct a list of all possible commands (not every possible way to say a command, just the actual function itself: "kitchen light on" and "turn on the light in the kitchen" are the same command) based on the actual functionality the smart house has available. I assume there is a discrete number of these in the order of no more than hundreds. Assign each some sort of identifier code.
Your job then becomes to map an input of:
a sentence of english text
location of speaker
time of day, day of week
any other input data
to an output of a confidence level (0.0 to 1.0) for each command.
The system will then execute the best match command if the confidence is over some tunable threshold (say over 0.70).
From here it becomes a machine learning application. There are a number of different approaches (and furthermore, approaches can be combined together by having them compete based on features of the input).
To start with I would work through the NLP book from Jurafsky/Manning from Stanford. It is a good survey of current NLP algorithms.
From there you will get some ideas about how the mapping can be machine learned. More importantly how natural language can be broken down into a mathematical structure for machine learning.
Once the text is semantically analyzed, the simplest ML algorithm to try first would be of the supervised ones. To generate training data have a normal GUI, speak your command, then press the corresponding command manually. This forms a single supervised training case. Make some large number of these. Set some aside for testing. It is also unskilled work so other people can help. You can then use these as your training set for your ML algorithm.

Intelligent web features, algorithms (people you may follow, similar to you ...)

I have 3 main questions about the algorithms in intelligent web (web 2.0)
Here the book I'm reading http://www.amazon.com/Algorithms-Intelligent-Web-Haralambos-Marmanis/dp/1933988665 and I want to learn the algorithms in deeper
1. People You may follow (Twitter)
How can one determine the nearest result to my requests ? Data mining? which algorithms?
2. How you’re connected feature (Linkedin)
Simply algorithm works like that. It draws the path between two nodes let say between Me and the other person is C. Me -> A, B -> A connections -> C . It is not any brute force algorithms or any other like graph algorithms :)
3. Similar to you (Twitter, Facebook)
This algorithms is similar to 1. Does it simply work the max(count) friend in common (facebook) or the max(count) follower in Twitter? or any other algorithms they implement? I think the second part is true because running the loop
dict{count, person}
for person in contacts:
dict.add(count(common(person)))
return dict(max)
is a silly act in every refreshing page.
4. Did you mean (Google)
I know that they may implement it with phonetic algorithm http://en.wikipedia.org/wiki/Phonetic_algorithm simply soundex http://en.wikipedia.org/wiki/Soundex and here is the Google VP of Engineering and CIO Douglas Merrill speak http://www.youtube.com/watch?v=syKY8CrHkck#t=22m03s
What about first 3 questions? Any ideas are welcome !
Thanks
People who you may follow
You can use the factors based calculations:
factorA = getFactorA(); // say double(0.3)
factorB = getFactorB(); // say double(0.6)
factorC = getFactorC(); // say double(0.8)
result = (factorA+factorB+factorC) / 3 // double(0.5666666666666667)
// if result is more than 0.5, you show this person
So say in the case of Twitter, "People who you may follow" can based on the following factors (User A is the user viewing this "People who you may follow" feature, there may be more or less factors):
Relativity between frequent keywords found in User A's and User B's tweets
Relativity between the profile description of both users
Relativity between the location of User A and B
Are people User A is following follows User B?
So where do they compare "People who you may follow" from? The list probably came from a combination of people with high amount of followers (they are probably celebrities, alpha geeks, famous products/services, etc.) and [people whom User A is following] is following.
Basically there's a certain level of data mining to be done here, reading the tweets and bios, calculations. This can be done on a daily or weekly cron job when the server load is least for the day (or maybe done 24/7 on a separate server).
How are you connected
This is probably a smart work here to make you feel that loads of brute force has been done to determine the path. However after some surface research, I find that this is simple:
Say you are User A; User B is your connection; and User C is a connection of User B.
In order for you to visit User C, you need to visit User B's profile first. By visiting User B's profile, the website already save the info indiciating that User A is at User B's profile. So when you visit User C from User B, the website immediately tells you that 'User A -> User B -> User C', ignoring all other possible paths.
This is the max level as at User C, User Acannot go on to look at his connections until User C is User A's connection.
Source: observing LinkedIN
Similar to you
It's the exact same thing as #1 (People you may follow), except that the algorithm reads in a different list of people. The list of people that the algorithm reads in is the people whom you follow.
Did you mean
Well you got it right there, except that Google probably used more than just soundex. There's language translation, word replacement, and many other algorithms used for the case of Google. I can't comment much on this because it will probably get very complex and I am not an expert to handle languages.
If we research a little more into Google's infrastructure, we can find that Google has servers dedicated to Spelling and Translation services. You can get more information on Google platform at http://en.wikipedia.org/wiki/Google_platform.
Conclusion
The key to highly intensified algorithms is caching. Once you cache the result, you don't have to load it every page. Google does it, Stack Overflow does it (on most of the pages with list of questions) and Twitter not surprisingly too!
Basically, algorithms are defined by developers. You may use others' algorithms, but ultimately, you can also create your own.
People you may follow
Could be one of many types of recommendation algorithms, maybe collaborative filtering?
How you are connected
This is just a shortest path algorithm on the social graph. Assuming there is no weight to the connections, it will simply use breadth-first.
Similar to you
Simply a re-arrangement of the data set using the same algorithm as People you may follow.
Check out the book Programming Collective Intelligence for a good introduction to the type of algorithms that are used for People you may follow and Similar to you, it has great python code available too.
People You may follow
From Twitter blog - "suggestions are based on several factors, including people you follow and the people they follow" http://blog.twitter.com/2010/07/discovering-who-to-follow.html
So if you follow A and B and they both follow C, then Twitter will suggest C to you...
How you’re connected feature
I think you have answered this one.
Similar to you
As above and as you say, although the results are probably cached - so its only done once per session or maybe even less frequently...
Hope that helps,
Chris
I don't use twitter; but with that in mind:
1). On the surface, this isn't that difficult: For each person I follow, see who they follow. Then for each of the people they follow, see who they follow, etc. The deeper you go, of course, the more number crunching it takes.
You can take this a bit further, if you can also efficiently extract the reverse: For those I follow, who also follows them?
For both ways, what's unsaid is a way to weight the tweeters to see if they're someone I'd really want to follow: A liberal follower may also follow a conservative tweeter, but that doesn't mean I'd want follow the conservative (see #3).
2). Not sure, thinking about it...
3). Assuming the bio and tweets are the only thing to go on, the hard parts are:
Deciding what attributes should exist (political affiliation, topic types, etc.)
Cleaning each 140 characters to data-mine.
Once you have the right set of attributes, then two different algorithms come to mind:
K means clustering, to decide which attributes I tend to discriminate on.
N-Nearest neighbor, to find the N most similar tweeters to you given the attributes I tend to give weight to.
EDIT: Actually, a decision tree is probably a FAR better way to do all of this...
This is all speculative, but it sounds fun if one were getting paid to do this.

How the computer knows "Recommended for You"?

Recently, I found several web site have something like : "Recommended for You", for example youtube, or facebook, the web site can study my using behavior, and recommend some content for me... ...I would like to know how they analysis this information? Is there any Algorithm to do so? Thank you.
Amazon and Netflix (among others) use a technique called Collaborative filtering to suggest things you might like based on the likes/dislikes of others who have made purchases and selections similar to yours.
Is there any Algorithm to do so?
Yes
Yes. One fairly common one is to look at things you've selected in the past, find other people who've made those selections, then find the other selections most common among those other people, and guess that you're likely to be interested in those as well.
Yup there are lots of algorithms. Things such as k-nearest neighbor: http://en.wikipedia.org/wiki/K-nearest_neighbor_algorithm.
Here is a pretty good book on the subject that covers making these sorts of systems along with others: http://www.amazon.com/gp/product/0596529325?ie=UTF8&tag=ianburriscom-20&linkCode=as2&camp=1789&creative=9325&creativeASIN=0596529325.
It's generally done by matching you with other users who have similar usage history / profile and then recommending other things that they've purhased/watched/whatever.
Searching for "recommendation algorithm" yields lots of papers. Most algorithms incorporate "machine learning" algorithms to determine groups of things (comedy movies, books on gardening, orchestral music, etc.). Your matching with those groups yields recommendations. Some companies use humans to classify things, too.
Such an algorithm is going to vary wildly from company to company. In many cases, it analyzes some combination of your search history, purchase history, physical location, and other factors. It probably will also compare purchases/searches amongst other people to find what those people have purchased/searched for, and recommend some of those products to you.
There are probably hundreds of these algorithms out there, but I doubt you can use any of them (that are actually good). Probably you are better off figuring it out yourself.
If you can categorize your contents (i.e. by tagging or content analysis), you can also categorize your users and their preferences.
For example: you have a video portal with 5 million videos .. 1 mio of them are tagged mostly red. If 80% of all videos watched by a user (who is defined by an IP, a persistent user account, ...) are tagged mostly red, you might want to recommend even more red videos to him. You might want to refine your recommendations by looking at his further actions: does he like your recommendations -- if so, why not give him even more, if not, try the second-best guess, maybe he's not looking for color, but for the background music ...
There's no absolute algorithm to do it, but all implementations will go into a similar direction. It's always basing on observing users, which scares me from time to time :-)
There's whole lot of algorithms tackling the issue: Wiki article. It's a Machine Learning domain problem. Computer's can be learned using two main techniques: classification and clustering. They require some datasets as input. If the dataset is informative (really holds some useful patterns) than those ML techniques can dig most of it.
Clustering could be best to use for this kind of problem. It's main usage is to find similarities among points in provided dataset. If the points are, e.g. your search history, they can be grouped together to form certain clusters. If Your search history closely relates to another, a hint can be given - picking links that are most similar to Your's.
The same comes with book recommendations - it's obvious what dataset they use: "Other people who bought this product also bought Product A, Product B,...". The key here is to match your profile to other's and use the most similar to recommend.
The computer retrieves information from the human brain with complex memory scan process, sorts it accordingly and outputs results based on what you have experienced in your life so far.

What should be considered when building a Recommendation Engine?

I've read the book Programming Collective Intelligence and found it fascinating. I'd recently heard about a challenge amazon had posted to the world to come up with a better recommendation engine for their system.
The winner apparently produced the best algorithm by limiting the amount of information that was being fed to it.
As a first rule of thumb I guess... "More information is not necessarily better when it comes to fuzzy algorithms."
I know's it's subjective, but ultimately it's a measurable thing (clicks in response to recommendations).
Since most of us are dealing with the web these days and search can be considered a form of recommendation... I suspect I'm not the only one who'd appreciate other peoples ideas on this.
In a nutshell, "What is the best way to build a recommendation ?"
You don't want to use "overall popularity" unless you have no information about the user. Instead, you want to align this user with similar users and weight accordingly.
This is exactly what Bayesian Inference does. In English, it means adjusting the overall probability you'll like something (the average rating) with ratings from other people who generally vote your way as well.
Another piece of advice, but this time ad hoc: I find that there are people where if they like something I will almost assuredly not like it. I don't know if this effect is real or imagined, but it might be fun to build in a kind of "negative effect" instead of just clumping people by similarity.
Finally there's a company specializing in exactly this called SenseArray. The owner (Ian Clarke of freenet fame) is very approachable. You can use my name if you call him up.
There is an entire research area in computer science devoted to this subject. I'd suggest reading some articles.
Agree with #Ricardo. This question is too broad, like asking "What's the best way to optimize a system?"
One common feature to nearly all existing recommendation engines is that making the final recommendation boils down to multiplying some number of matrices and vectors. For example multiply a matrix containing proximity weights between users by a vector of item ratings.
(Of course you have to be ready for most of your vectors to be super sparse!)
My answer is surely too late for #Allain but for other users finding this question through search -- send me a PM and ask a more specific question and I will be sure to respond.
(I design recommendation engines professionally.)
#Lao Tzu, I agree with you.
According to me, recommendation engines are made up of:
Context Input fed from context aware systems (logging all your data)
Logical reasoning to filter the most obvious
Expert systems that improve your subjective data over the period of time based on context inputs, and
Probabilistic reasoning to do decision-making close-to-proximity based on weighted sum of previous actions(beliefs, desires, & intentions).
P.S.
I made such recommendation engine.

Resources