How to implement a "related" degree measure algorithm? - algorithm

I was going to Ask a Question earlier today when I was presented to a surprising functionality in Stackoverflow. When I wrote my question title stackoverflow suggested me several related questions and I found out that there was already two similar questions. That was stunning!
Then I started thinking how I would implement such function. How I would order questions by relatedness:
Question that have higher number of
words matchs with the new question
If the number of matchs are the
same, the order of words is considered
Words that appears in the title has
higher relevancy
That would be a simple workflow or a complex score algortithm?
Some stemming to increase the recall, maybe?
Is there some library the implements this function?
What other aspects would you consider?
Maybe Jeff could answer himself! How did you implemented this in Stackoverflow? :)

One such way to implement such an algorithm would involve ranking the questions as per a heuristic function which assigns a 'relevance' weight factor using the following steps:
Apply a noise filter to the 'New' question to remove words that are common across a large number of objects such as: 'the', 'and', 'or', etc.
Get the number of words contained in the 'New' question which match the words the set of questions already posted on the website. [A]
Get the number of tag matches between the words in the 'New' question and the available. [B]
Compute the 'relevance weight' based on [A] and [B] as 'x[A] + y[B]', where x and y are weight multipliers (Assign a higher weight multiplier to [B] as tagging is more relevant than simple word search)
Get the top 5 questions which have the highest 'relevance weight'.
The heuristic might require tweaking to get optimal results, but it should work.

Your question seems similar to this one, which has some additional answers.

#marcio
Sorry, I am not aware of any direct API reference that I could suggest here and I have never worked with Lucene.
However, I am aware that Google Desktop uses a Query API to rank and suggest the relevant search results. More information on the API can be found here.
Perhaps others could chime in and guide you.

Isn't StackOverflow going to be open sourced at some point? If so, you can always find out how they did it there.
Update: It appears that they say they might open source it. I hope they do.

Related

Confusion about Rank Selection for Genetic Algorithms

I have seen other SO questions asked about rank selection for genetic algorithms, but I am still confused. I haven't really seen an answer to this, or maybe I just didn't understand it: When using the rank selection, what is the population being ranked on? I had seen some answers say it's fitness, others say it's not. If it is possible to get a snippet of code so I can better understand this would be greatly appreciated. If there are any other questions, I can answer them to provide clarity. Thank you
EDIT: The case I am trying to solve is that I have a string I need the program to get right (I know what it is and have hard-coded it)
That code snippet, the fitness function, is entirely dependent on the application. It really defines the selection process. Imagine a simple program for playing five-card draw (poker). Each candidate is an algorithm that decides which cards to replace.
The fitness function might work like this: (1) remove the specified cards. (2) repeat 100 trials: replace the cards and compute the strength of the resulting hand. (3) return the average of the 100 trials.
That average stands as the fitness measure by which the algorithms are ranked.
Does that clear things up a little?
FOLLOW-UP
This means that you have to choose a similarity metric. You'll want something that is distinct for an exact match and degrades gracefully as you get farther from the right answer. A simple search will find the popular ones.

Algorithm to find odd word in a list of english words? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 9 years ago.
Improve this question
Suppose i am having following list of words
banana,apple,orange,tree In this list odd word is tree.Can any one give the idea to write a algorithm.
What is it about tree that makes it the odd one out? Why not banana (since it's a herb, where the others are trees, and also because it's the only one in the list that doesn't end with 'e'). Or why not orange (since it's a colour as well as a plant, where the others are just plants).
You need to define the criteria that you're trying to filter by: something may be obvious to a human reader, but a computer algorithm can't see that without knowing all the facts that make it obvious to a human. Or at least sufficient facts that are relevant to draw a reliable conclusion.
You're basically talking about a large knowledge-base, not a simple algorithm.
Disclaimer: This is not an easy to do task, and thus my suggested solutions will be high level and include references to academic papers that aim to solve a part of your problem:
You can try a semantic relatedness approach:
Find relatedness between every two pairs of words, filter out the word that is least related to all others.
Semantic relatedness can be done using semantic sort in a supervised learning, for example.
Another alternative is to model a semantic representation of each word.
Each word will be represented by a vector representing its meaning.
This vector can be obtained for example using the wikipedia articles
that mention this word. More information on this approach can be
found in Markovitch et al Wikipedia-based Semantic Interpretation
for Natural Language Processing
After you represent your data as vectors, it is a question of finding
the word which is least similar to the others. It can be done using
supervised learning, or other alternative is choosing the point
which is most distant from the median of all vectors.
One more possible solution is using WordNet
Note that all methods are heuristics that I would try, and are expected to fail for some cases, but I believe will work pretty well for most of the cases.
Have a look at ontologies and reasoning algorithms. If you have an ontology that models the specific area of knowledge you will have a source of information that will allow you to distinguish words, e.g. by using the partial order and the relations and then check if the words are in the same "sub branch" of the partial order. You might even define a metric to get a "level of closeness" or something similar.
Edit: also check SPARQ, a language to query such structures. And check out triple stores which allow to get information by subject, predicate object combinations. This matches your problem since it allows you to compare two objects of your list by a predicate.
You can try create some database of categorized words like:
banana {food, plant, fruit, yellow}
apple {food, plant, fruit, computer, phone}
orange {food, plant, fruit, phone}
tree {plant}
And then you can see that all words other than tree belong to fruit category. That kind of check would be easy to code.
Biggest problem here is getting the database - i don't think you would like to create it manually and have to idea where to find it. Also it could not work. Imagine we add
eclair{food, phone}
to this database (phone because android 2.1 is called eclair). Then for query orange, apple, banana, eclair there is two possible answers - eclair, which is not fruit or banana which is not connected with mobile phones.

Methods to identify duplicate questions on Twitter?

As stated in the title, I'm simply looking for algorithms or solutions one might use to take in the twitter firehose (or a portion of it) and
a) identify questions in general
b) for a question, identify questions that could be the same, with some degree of confidence
Thanks!
(A)
I would try to identify questions using machine learning and the Bag of Words model.
Create a labeled set of twits, and label each of them with a binary
flag: question or not question.
Extract the features from the training set. The features are traditionally words, but at least for any time I tried it - using bi-grams significantly improved the results. (3-grams were not helpful for my cases).
Build a classifier from the data. I usually found out SVM gives better performance then other classifiers, but you can use others as well - such as Naive Bayes or KNN (But you will probably need feature selection algorithm for these).
Now you can use your classifier to classify a tweet.1
(B)
This issue is referred in the world of Information-Retrieval as "duplicate detection" or "near-duplicate detection".
You can at least find questions which are very similar to each other using Semantic Interpretation, as described by Markovitch and Gabrilovich in their wonderful article Wikipedia-based Semantic Interpretation for Natural Language Processing. At the very least, it will help you identify if two questions are discussing the same issues (even though not identical).
The idea goes like this:
Use wikipedia to build a vector that represents its semantics, for a term t, the entry vector_t[i] is the tf-idf score of the term i as it co-appeared with the term t. The idea is described in details in the article. Reading the 3-4 first pages are enough to understand it. No need to read it all.2
For each tweet, construct a vector which is a function of the vectors of its terms. Compare between two vectors - and you can identify if two questions are discussing the same issues.
EDIT:
On 2nd thought, the BoW model is not a good fit here, since it ignores the position of terms. However, I believe if you add NLP processing for extracting feature (for examples, for each term, also denote if it is pre-subject or post-subject, and this was determined using NLP procssing), combining with Machine Learning will yield pretty good results.
(1) For evaluation of your classifier, you can use cross-validation, and check the expected accuracy.
(2) I know Evgeny Gabrilovich published the implemented algorithm they created as an open source project, just need to look for it.

Designing a twenty questions algorithm

I am interested in writing a twenty questions algorithm similar to what akinator and, to a lesser extent, 20q.net uses. The latter seems to focus more on objects, explicitly telling you not to think of persons or places. One could say that akinator is more general, allowing you to think of literally anything, including abstractions such as "my brother".
The problem with this is that I don't know what algorithm these sites use, but from what I read they seem to be using a probabilistic approach in which questions are given a certain fitness based on how many times they have lead to correct guesses. This SO question presents several techniques, but rather vaguely, and I would be interested in more details.
So, what could be an accurate and efficient algorithm for playing twenty questions?
I am interested in details regarding:
What question to ask next.
How to make the best guess at the end of the 20 questions.
How to insert a new object and a new question into the database.
How to query (1, 2) and update (3) the database efficiently.
I realize this may not be easy and I'm not asking for code or a 2000 words presentation. Just a few sentences about each operation and the underlying data structures should be enough to get me started.
Update, 10+ years later
I'm now hosting a (WIP, but functional) implementation here: https://twentyq.evobyte.org/ with the code here: https://github.com/evobyte-apps/open-20-questions. It's based on the same rough idea listed below.
Well, over three years later, I did it (although I didn't work full time on it). I hosted a crude implementation at http://twentyquestions.azurewebsites.net/ if anyone is interested (please don't teach it too much wrong stuff yet!).
It wasn't that hard, but I would say it's the non-intuitive kind of not hard that you don't immediately think of. My methods include some trivial fitness-based ranking, ideas from reinforcement learning and a round-robin method of scheduling new questions to be asked. All of this is implemented on a normalized relational database.
My basic ideas follow. If anyone is interested, I will share code as well, just contact me. I plan on making it open source eventually, but once I have done a bit more testing and reworking. So, my ideas:
an Entities table that holds the characters and objects played;
a Questions table that holds the questions, which are also submitted by users;
an EntityQuestions table holds entity-question relations. This holds the number of times each answer was given for each question in relation to each entity (well, those for which the question was asked for anyway). It also has a Fitness field, used for ranking questions from "more general" down to "more specific";
a GameEntities table is used for ranking the entities according to the answers given so far for each on-going game. An answer of A to a question Q pushes up all the entities for which the majority answer to question Q is A;
The first question asked is picked from those with the highest sum of fitnesses across the EntityQuestions table;
Each next question is picked from those with the highest fitness associated with the currently top entries in the GameEntities table. Questions for which the expected answer is Yes are favored even before the fitness, because these have more chances of consolidating the current top ranked entity;
If the system is quite sure of the answer even before all 20 questions have been asked, it will start asking questions not associated with its answer, so as to learn more about that entity. This is done in a round-robin fashion from the global questions pool right now. Discussion: is round-robin fine, or should it be fully random?
Premature answers are also given under certain conditions and probabilities;
Guesses are given based on the rankings in GameEntities. This allows the system to account for lies as well, because it never eliminates any possibility, just decreases its likeliness of being the answer;
After each game, the fitness and answers statistics are updated accordingly: fitness values for entity-question associations decrease if the game was lost, and increase otherwise.
I can provide more details if anyone is interested. I am also open to collaborating on improving the algorithms and implementation.
This is a very interesting question. Unfortunately I don't have a full answer, let me just write down the ideas I could come up with in 10 minutes:
If you are able to halve the set of available answers on each question, you can distinguish between 2^20 ~ 1 million "objects". Your set is probably going to be larger, so it's right to assume that sometimes you have to make a guess.
You want to maximize utility. Some objects are chosen more often than others. If you want to make good guesses you have to take into consideration the weight of each object (= the probability of that object being picked) when creating the tree.
If you trust a little bit of your users you can gain knowledge based on their answers. This also means that you cannot use a static tree to ask questions because then you'll get the answers for the same questions.. and you'll learn nothing new if you encounter with the same object.
If a simple question is not able to divide the set to two halves, you could combine them to get better results: eg: "is the object green or blue?". "green or has a round shape?"
I am trying try to write a python implementation using a naïve Bayesian network for learning and minimizing the expected entropy after the question has been answered as criterium for selecting a question (with an epsilon chance of selecting a random question in order to learn more about that question), following the ideas in http://lists.canonical.org/pipermail/kragen-tol/2010-March/000912.html. I have put what I got so far on github.
Preferably choose questions with low remaining entropy expectation. (For putting together something quickly, I stole from ε-greedy multi-armed bandit learning and use: With probability 1–ε: Ask the question with the lowest remaining entropy expectation. With probability ε: Ask any random question. However, this approach seems far from optimal.)
Since my approach is a Bayesian network, I obtain the probabilities of the objects and can ask for the most probable object.
A new object is added as new column to the probabilities matrix, with low a priori probability and the answers to the questions as given if given or as guessed by the Bayes network if not given. (I expect that this second part would work much better if I would add Bayes network structure learning instead of just using naive Bayes.)
Similarly, a new question is a new row in the matrix. If it comes from user input, probably only very few answer probabilities are known, the rest needs to be guessed. (In general, if you can get objects by asking for properties, you can obtain properties by asking if given objects have them or not, and the transformation between these is essentially Bayes' theorem and breaks down to transposition in the easiest case. The guessing quality should improve again once the network has an appropriate structure.)
(This is a problem, since I calculate lots of probabilities. My goal is to do it using database-oriented sparse tensor calculations optimized for working with weighted directed acyclic graphs.)
It would be interesting to see how good a decision tree based algorithm would serve you. The trick here is purely in the learning/sorting of the tree. I'd like to note that this is stuff I remember from AI class and student work in the AI working group and should be taken with a semi-large grain (or nugget) of salt.
To answer the questions:
You just walk the tree :)
This is a big downside of decision trees. You'd only have one guess that can be attached to the end nodes of the tree at depth 20 (or earlier, if the tree is still sparse).
There are whole books dedicated to this topic. As far as I remember from AI class you try minimize entropy at all times, so you want to ask questions that ideally divide the set of remaining objects into two sets of equal size. I'm afraid you'd have to look this up in AI books.
Decision trees are highly efficient during the query phase, as you literally walk the tree and follow the 'yes' or 'no' branch at each node. Update efficiency depends on the learning algorithm applied. You might be able to do this offline as in a nightly batched update or something like that.

How does the Google "Did you mean?" Algorithm work? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 2 years ago.
Improve this question
I've been developing an internal website for a portfolio management tool. There is a lot of text data, company names etc. I've been really impressed with some search engines ability to very quickly respond to queries with "Did you mean: xxxx".
I need to be able to intelligently take a user query and respond with not only raw search results but also with a "Did you mean?" response when there is a highly likely alternative answer etc
[I'm developing in ASP.NET (VB - don't hold it against me! )]
UPDATE:
OK, how can I mimic this without the millions of 'unpaid users'?
Generate typos for each 'known' or 'correct' term and perform lookups?
Some other more elegant method?
Here's the explanation directly from the source ( almost )
Search 101!
at min 22:03
Worth watching!
Basically and according to Douglas Merrill former CTO of Google it is like this:
1) You write a ( misspelled ) word in google
2) You don't find what you wanted ( don't click on any results )
3) You realize you misspelled the word so you rewrite the word in the search box.
4) You find what you want ( you click in the first links )
This pattern multiplied millions of times, shows what are the most common misspells and what are the most "common" corrections.
This way Google can almost instantaneously, offer spell correction in every language.
Also this means if overnight everyone start to spell night as "nigth" google would suggest that word instead.
EDIT
#ThomasRutter: Douglas describe it as "statistical machine learning".
They know who correct the query, because they know which query comes from which user ( using cookies )
If the users perform a query, and only 10% of the users click on a result and 90% goes back and type another query ( with the corrected word ) and this time that 90% clicks on a result, then they know they have found a correction.
They can also know if those are "related" queries of two different, because they have information of all the links they show.
Furthermore, they are now including the context into the spell check, so they can even suggest different word depending on the context.
See this demo of google wave ( # 44m 06s ) that shows how the context is taken into account to automatically correct the spelling.
Here it is explained how that natural language processing works.
And finally here is an awesome demo of what can be done adding automatic machine translation ( # 1h 12m 47s ) to the mix.
I've added anchors of minute and seconds to the videos to skip directly to the content, if they don't work, try reloading the page or scrolling by hand to the mark.
I found this article some time ago: How to Write a Spelling Corrector, written by Peter Norvig (Director of Research at Google Inc.).
It's an interesting read about the "spelling correction" topic. The examples are in Python but it's clear and simple to understand, and I think that the algorithm can be easily
translated to other languages.
Below follows a short description of the algorithm.
The algorithm consists of two steps, preparation and word checking.
Step 1: Preparation - setting up the word database
Best is if you can use actual search words and their occurence.
If you don't have that a large set of text can be used instead.
Count the occurrence (popularity) of each word.
Step 2. Word checking - finding words that are similar to the one checked
Similar means that the edit distance is low (typically 0-1 or 0-2). The edit distance is the minimum number of inserts/deletes/changes/swaps needed to transform one word to another.
Choose the most popular word from the previous step and suggest it as a correction (if other than the word itself).
For the theory of "did you mean" algorithm you can refer to Chapter 3 of Introduction to Information Retrieval. It is available online for free. Section 3.3 (page 52) exactly answers your question. And to specifically answer your update you only need a dictionary of words and nothing else (including millions of users).
Hmm... I thought that google used their vast corpus of data (the internet) to do some serious NLP (Natural Language Processing).
For example, they have so much data from the entire internet that they can count the number of times a three-word sequence occurs (known as a trigram). So if they see a sentence like: "pink frugr concert", they could see it has few hits, then find the most likely "pink * concert" in their corpus.
They apparently just do a variation of what Davide Gualano was saying, though, so definitely read that link. Google does of course use all web-pages it knows as a corpus, so that makes its algorithm particularly effective.
My guess is that they use a combination of a Levenshtein distance algorithm and the masses of data they collect regarding the searches that are run. They could pull a set of searches that have the shortest Levenshtein distance from the entered search string, then pick the one with the most results.
Normally a production spelling corrector utilizes several methodologies to provide a spelling suggestion. Some are:
Decide on a way to determine whether spelling correction is required. These may include insufficient results, results which are not specific or accurate enough (according to some measure), etc. Then:
Use a large body of text or a dictionary, where all, or most are known to be correctly spelled. These are easily found online, in places such as LingPipe. Then to determine the best suggestion you look for a word which is the closest match based on several measures. The most intuitive one is similar characters. What has been shown through research and experimentation is that two or three character sequence matches work better. (bigrams and trigrams). To further improve results, weigh a higher score upon a match at the beginning, or end of the word. For performance reasons, index all these words as trigrams or bigrams, so that when you are performing a lookup, you convert to n-gram, and lookup via hashtable or trie.
Use heuristics related to potential keyboard mistakes based on character location. So that "hwllo" should be "hello" because 'w' is close to 'e'.
Use a phonetic key (Soundex, Metaphone) to index the words and lookup possible corrections. In practice this normally returns worse results than using n-gram indexing, as described above.
In each case you must select the best correction from a list. This may be a distance metric such as levenshtein, the keyboard metric, etc.
For a multi-word phrase, only one word may be misspelled, in which case you can use the remaining words as context in determining a best match.
Use Levenshtein distance, then create a Metric Tree (or Slim tree) to index words.
Then run a 1-Nearest Neighbour query, and you got the result.
Google apparently suggests queries with best results, not with those which are spelled correctly. But in this case, probably a spell-corrector would be more feasible, Of course you could store some value for every query, based on some metric of how good results it returns.
So,
You need a dictionary (english or based on your data)
Generate a word trellis and calculate probabilities for the transitions using your dictionary.
Add a decoder to calculate minimum error distance using your trellis. Of course you should take care of insertions and deletions when calculating distances. Fun thing is that QWERTY keyboard maximizes the distance if you hit keys close to each other.(cae would turn car, cay would turn cat)
Return the word which has the minimum distance.
Then you could compare that to your query database and check if there is better results for other close matches.
Here is the best answer I found, Spelling corrector implemented and described by Google's Director of Research Peter Norvig.
If you want to read more about the theory behind this, you can read his book chapter.
The idea of this algorithm is based on statistical machine learning.
I saw something on this a few years back, so may have changed since, but apparently they started it by analysing their logs for the same users submitting very similar queries in a short space of time, and used machine learning based on how users had corrected themselves.
As a guess... it could
search for words
if it is not found use some algorithm to try to "guess" the word.
Could be something from AI like Hopfield network or back propagation network, or something else "identifying fingerprints", restoring broken data, or spelling corrections as Davide mentioned already ...
Simple. They have tons of data. They have statistics for every possible term, based on how often it is queried, and what variations of it usually yield results the users click... so, when they see you typed a frequent misspelling for a search term, they go ahead and propose the more usual answer.
Actually, if the misspelling is in effect the most frequent searched term, the algorythm will take it for the right one.
regarding your question how to mimic the behavior without having tons of data - why not use tons of data collected by google? Download the google sarch results for the misspelled word and search for "Did you mean:" in the HTML.
I guess that's called mashup nowadays :-)
Apart from the above answers, in case you want to implement something by yourself quickly, here is a suggestion -
Algorithm
You can find the implementation and detailed documentation of this algorithm on GitHub.
Create a Priority Queue with a comparator.
Create a Ternay Search Tree and insert all english words (from Norvig's post) along with their frequencies.
Start traversing the TST and for every word encountered in TST, calculate its Levenshtein Distance(LD) from input_word
If LD ≤ 3 then put it in a Priority Queue.
At Last extract 10 words from the Priority Queue and display.
You mean to say spell checker? If it is a spell checker rather than a whole phrase then I've got a link about the spell checking where the algorithm is developed in python. Check this link
Meanwhile, I am also working on project that includes searching databases using text. I guess this would solve your problem
This is an old question, and I'm surprised that nobody suggested the OP using Apache Solr.
Apache Solr is a full text search engine that besides many other functionality also provides spellchecking or query suggestions. From the documentation:
By default, the Lucene Spell checkers sort suggestions first by the
score from the string distance calculation and second by the frequency
(if available) of the suggestion in the index.
There is a specific data structure - ternary search tree - that naturally supports partial matches and near-neighbor matches.
Easiest way to figure it out is to Google dynamic programming.
It's an algorithm that's been borrowed from Information Retrieval and is used heavily in modern day bioinformatics to see how similiar two gene sequences are.
Optimal solution uses dynamic programming and recursion.
This is a very solved problem with lots of solutions. Just google around until you find some open source code.

Resources