new features for autocomplete algorithm - algorithm

I met this problem in an interview. It is easy to implement a basic autocomplete system(https://www.futurice.com/blog/data-structures-for-fast-autocomplete/) to get a list of string from the prefix string. Now we want to add some new features.
ex,
User input: lun pla Output: lunch plan (mutiple words autocomplete)
User input: pla Output: lunch plan
User input: unc Output: lunch (autocomplete form part of the word)
How to implement the features?

You can try the following (basic) approach, and I will later give suggestions for extensions:
load a dictionary of accepted words
build a BK-Tree out of these words using Levenshtein-Damerau distance as the underlying metric
split an input sequence on the whitespace character to get words
for each word, check whether it is an accepted word. If it isn't find the nearest (within acceptable distance) word in the BK-tree
Now for the improvements:
As you indicated, sometimes a match makes more sense when two words are grouped together
Use the Google word2Phrase algorithm for this. You can find a C++ version here.
Use a more clever approach to finding word-boundaries. A stochastic method like HMM (Hidden Markov Model) might be useful (to avoid dates, times, abbreviations, etc being split)
Use a more intelligent error-metric. You could take into account common misspellings, keyboard layout errors (there are very specific errors for people that are used to typing qwerty that are suddenly faced with azerty), etc
Try to determine word part-of-speech type. (Adjective, noun, verb, etc). By doing this you can make much better completions.

Related

Predicting phrases instead of just next word

For an application that we built, we are using a simple statistical model for word prediction (like Google Autocomplete) to guide search.
It uses a sequence of ngrams gathered from a large corpus of relevant text documents. By considering the previous N-1 words, it suggests the 5 most likely "next words" in descending order of probability, using Katz back-off.
We would like to extend this to predict phrases (multiple words) instead of a single word. However, when we are predicting a phrase, we would prefer not to display its prefixes.
For example, consider the input the cat.
In this case we would like to make predictions like the cat in the hat, but not the cat in & not the cat in the.
Assumptions:
We do not have access to past search statistics
We do not have tagged text data (for instance, we do not know the parts of speech)
What is a typical way to make these kinds of multi-word predictions? We've tried multiplicative and additive weighting of longer phrases, but our weights are arbitrary and overfit to our tests.
For this question, you need to define what it is you consider to be a valid completion -- then it should be possible to come up with a solution.
In the example you've given, "the cat in the hat" is much better than "the cat in the". I could interpret this as, "it should end with a noun" or "it shouldn't end with overly common words".
You've restricted the use of "tagged text data" but you could use a pretrained model, (e.g. NLTK, spacy, StanfordNLP) to guess the parts of speech and make an attempt to restrict predictions to only complete noun-phrases (or sequence ending in noun). Note that you would not necessarily need to tag all documents fed into the model, but only those phrases you're keeping in your autocomplete db.
Alternately, you could avoid completions that end in stopwords (or very high frequency words). Both "in" and "the" are words that occur in almost all English documents, so you could experimentally find a frequency cutoff (can't end in a word that occurs in more than 50% of documents) that help you filter. You could also look at phrases -- if the end of the phrase is drastically more common as a shorter phrase, then it doesn't make sense to tag it on, as the user could come up with it on their own.
Ultimately, you could create a labeled set of good and bad instances and try to create a supervised re-ranker based on word features -- both ideas above could be strong features in a supervised model (document frequency = 2, pos tag = 1). This is typically how search engines with data can do it. Note that you don't need search statistics or users for this, just a willingness to label the top-5 completions for a few hundred queries. Building a formal evaluation (that can be run in an automated manner) would probably help when trying to improve the system in the future. Any time you observe a bad completion, you could add it to the database and do a few labels -- over time, a supervised approach would get better.

Algorithm to reform a sentence from sentence whose spaces are removed and alphabets of words are reordered?

I was looking around some puzzles online to improve my knowledge on algorithms...
I came upon below question:
"You have a sentence with several words with spaces remove and words having their character order shuffled. You have a dictionary. Write an algorithm to produce the sentence back with spaces and words with normal character order."
I do not know what is good way to solve this.
I am new to algorithms but just looking at problem I think I would make program do what an intellectual mind would do.
Here is something I can think of:
-First find out manually common short english words from dictionary like "is" "the" "if" etc and put in dataset-1.
-Then find out permutation of words in dataset1 (eg "si", "eht" or "eth" or "fi") and put in dataset-2
-then find out from input sentense what character sequence matches the words of dataset2 and put them in dataset-3 and insert space in input sentence instead of those found.
-for rest of the words i would perform permutations to find out word from dictionary.
I am newbie to algorithms...is it a bad solution?
this seems like a perfectly fine solution,
In general there are 2 parameters for judging an algorithm.
correctness - does the algorithm provide the correct answer.
resources - the time or storage size needed to provide an answer.
usually there is a tradeoff between these two parameters.
so for example the size of your dictionary dictates what scrambled sentences you may
reconstruct, giving you a correct answer for more inputs,
however the whole searching process would take longer and would require more storage.
The hard part of the problem you presented is the fact that you need to compute permutations, and there are a LOT of them.
so checking them all is expensive, a good approach would be to do what you suggested, create a small subset of commonly used words and check them first, that way the average case is better.
note: just saying that you check the permutation/search is ok, but in the end you would need to specify the exact way of doing that.
currently what you wrote is an idea for an algorithm but it would not allow you to take a given input and mechanically work out the output.
Actually, might be wise to start by partitioning the dictionary by word length.
Then try to find the largest words that can be made using the letters avaliable, instead of finding the smallest ones. Short words are more common and thus will be harder to narrow down. IE: is it really "If" or "fig".
Then for each word length w, you can proceed w characters at a time.
There are still a lot of possible combinations though, simply because you found a valid word, doesn't mean it's the right word. Once you've gone through all the substrings, of which there should be something like O(c^4*d) where d is the number of words in the dictionary and c is the number of characters in the sentence. Practically speaking if the dictionary is sorted by word length, it'll be a fair bit less than that. Then you have to take the valid words, and figure out an ordering that works, so that all characters are used. There might be multiple solutions.

Creating a suggested words algorithm

I'm designing a cool spell checker (I know I know, modern browsers already have this), anyway, I am wondering what kind of effort would it take to develop a fairly simple but decent suggest-word algorithm.
My idea is that I would first look through the misspelled word's characters and count the amount of characters it matches in each word in the dictionary (sounds resources intensive), and then pick the top 5 matches (so if the misspelled word matches the most characters with 7 words from the dictionary, it will randomly display 5 of those words as suggested spelling).
Obviously to get more advanced, we would look at "common words" and have a dictionary file that is numbered with 'frequency of that word used in English language' ranking. I think that's taking it a bit overboard maybe.
What do you think? Anyone have ideas for this?
First of all you will have to consider the complexity in finding the "nearer" words to the misspelled word. I see that you are using a dictionary, a hash table perhaps. But this may not be enough. The best and cooler solution here is to go for a TRIE datastructure. The complexity of finding these so called nearer words will take linear order timing and it is very easy to exhaust the tree.
A small example
Take the word "njce". This is a level 1 example where one word is clearly misspelled. The obvious suggestion expected would be nice. The first step is very obvious to see whether this word is present in the dictionary. Using the search function of a TRIE, this could be done O(1) time, similar to a dictionary. The cooler part is finding the suggestions. You would obviously have to exhaust all the words that start with 'a' to 'z' that has words like ajce bjce cjce upto zjce. Now to find the occurences of this type is again linear depending on the character count. You should not carried away by multiplying this number with 26 the length of words. Since TRIE immediately diminishes as the length grows. Coming back to the problem. Once that search is done for which no result was found, you go the next character. Now you would be searching for nace nbce ncce upto nzce. In fact you wont have explore all the combinations as the TRIE data structure by itself will not be having the intermediate characters. Perhaps it will have na ni ne no nu characters and the search space becomes insanely simple. So are the further occurrences. You could develop on this concept further based on second and third order matches. Hope this helped.
I'm not sure how much of the wheel you're trying to reinvent, so you may want to check out Lucene.
Apache Lucene Coreā„¢ (formerly named Lucene Java), our flagship sub-project, provides a Java-based indexing and search implementation, as well as spellchecking, hit highlighting and advanced analysis/tokenization capabilities.

How to recognize words in text with non-word tokens?

I am currently parsing a bunch of mails and want to get words and other interesting tokens out of mails (even with spelling errors or combination of characters and letters, like "zebra21" or "customer242"). But how can I know that "0013lCnUieIquYjSuIA" and "anr5Brru2lLngOiEAVk1BTjN" are not words and not relevant? How to extract words and discard tokens that are encoding errors or parts of pgp signature or whatever else we get in mails and know that we will never be interested in those?
You need to decide on a good enough criteria for a word and write a regular expression or a manual to enforce it.
A few rules that can be extrapolated from your examples:
words can start with a captial letter or be all capital letters but if you have more than say, 2 uppercase letters and more than 2 lowercase letters inside a word, it's not a word
If you have numbers inside the word, it's not a word
if it's longer than say, 20 characters
There's no magic trick. you need to decide what you want the rules to be and make them happen.
Al alternative way is to train some kind of Hidden Markov-Models system to recognize things that sound like words but I think this is an overkill for what you want to do.
http://en.wikipedia.org/wiki/English_words_with_uncommon_properties
you can make rules that reject anything with these 'uncommon properties' to build a system that accepts most actual words
Although I generally agree with shoosh's answer, his approach makes it easy to achieve high recall but also low precision, i.e. you would get almost all real words but also a lot non-words. If your definition of word is too restrictive, it's the other way around but that's also not what you want since then you would miss cases like 'zebra123'. So here are a few ideas about how to improve precision:
It may be worthwile thinking about if you could determine what parts of an email belong to the main text and which are footers like pgp signatures. I'm sure it's possible to find some simple heuristics that match most cases, e.g. cut of everything below a line which consists only of '-'-characters.
Depending on your performance criteria you may want to check if a word is a real word or contains a real word by matching against a simple word list. It's easy to find quite exhaustive lists of Englisch words on the web, and you could also compile one yourself by extracting words from a large and clean text corpus.
Using a lexical analyser, you could filter every token which is marked as unknown.
Some simple statistics may tell you how likely it is that something is a word. Tokens which occur with high frequency most probably are words. Tokens which appear only once or whose number is below a certain threshold very probably are not words. Common spelling errors should appear more than once and uncommon ones may be ignored.
Some if these suggestions clearly don't work for cases like 'zebra123'. Again, simply cutting off, or splitting on, in-word numbers may do the trick.
My general approach would be to first identify tokens which certainly are words (using the suggestions above), then identify tokens which certainly are not words (using a regular expression), and then look (with your eyes) at the few hundred or thousand remaining tokens to find common characteristics to handle these separately.

Categorizing Words and Category Values

We were set an algorithm problem in class today, as a "if you figure out a solution you don't have to do this subject". SO of course, we all thought we will give it a go.
Basically, we were provided a DB of 100 words and 10 categories. There is no match between either the words or the categories. So its basically a list of 100 words, and 10 categories.
We have to "place" the words into the correct category - that is, we have to "figure out" how to put the words into the correct category. Thus, we must "understand" the word, and then put it in the most appropriate category algorthmically.
i.e. one of the words is "fishing" the category "sport" --> so this would go into this category. There is some overlap between words and categories such that some words could go into more than one category.
If we figure it out, we have to increase the sample size and the person with the "best" matching % wins.
Does anyone have ANY idea how to start something like this? Or any resources ? Preferably in C#?
Even a keyword DB or something might be helpful ? Anyone know of any free ones?
First of all you need sample text to analyze, to get the relationship of words.
A categorization with latent semantic analysis is described in Latent Semantic Analysis approaches to categorization.
A different approach would be naive bayes text categorization. Sample text with the assigned category are needed. In a learning step the program learns the different categories and the likelihood that a word occurs in a text assigned to a category, see bayes spam filtering. I don't know how well that works with single words.
Really poor answer (demonstrates no "understanding") - but as a crazy stab you could hit google (through code) for (for example) "+Fishing +Sport", "+Fishing +Cooking" etc (i.e. cross join each word and category) - and let the google fight win! i.e. the combination with the most "hits" gets chosen...
For example (results first):
weather: fish
sport: ball
weather: hat
fashion: trousers
weather: snowball
weather: tornado
With code (TODO: add threading ;-p):
static void Main() {
string[] words = { "fish", "ball", "hat", "trousers", "snowball","tornado" };
string[] categories = { "sport", "fashion", "weather" };
using(WebClient client = new WebClient()){
foreach(string word in words) {
var bestCategory = categories.OrderByDescending(
cat => Rank(client, word, cat)).First();
Console.WriteLine("{0}: {1}", bestCategory, word);
}
}
}
static int Rank(WebClient client, string word, string category) {
string s = client.DownloadString("http://www.google.com/search?q=%2B" +
Uri.EscapeDataString(word) + "+%2B" +
Uri.EscapeDataString(category));
var match = Regex.Match(s, #"of about \<b\>([0-9,]+)\</b\>");
int rank = match.Success ? int.Parse(match.Groups[1].Value, NumberStyles.Any) : 0;
Debug.WriteLine(string.Format("\t{0} / {1} : {2}", word, category, rank));
return rank;
}
Maybe you are all making this too hard.
Obviously, you need an external reference of some sort to rank the probability that X is in category Y. Is it possible that he's testing your "out of the box" thinking and that YOU could be the external reference? That is, the algorithm is a simple matter of running through each category and each word and asking YOU (or whoever sits at the terminal) whether word X is in the displayed category Y. There are a few simple variations on this theme but they all involve blowing past the Gordian knot by simply cutting it.
Or not...depends on the teacher.
So it seems you have a couple options here, but for the most part I think if you want accurate data you are going to need to use some outside help. Two options that I can think of would be to make use of a dictionary search, or crowd sourcing.
In regards to a dictionary search, you could just go through the database, query it and parse the results to see if one of the category names is displayed on the page. For example, if you search "red" you will find "color" on the page and likewise, searching for "fishing" returns "sport" on the page.
Another, slightly more outside the box option would be to make use of crowd sourcing, consider the following:
Start by more or less randomly assigning name-value pairs.
Output the results.
Load the results up on Amazon Mechanical Turk (AMT) to get feedback from humans on how well the pairs work.
Input the results of the AMT evaluation back into the system along with the random assignments.
If everything was approved, then we are done.
Otherwise, retain the correct hits and process them to see if any pattern can be established, generate a new set of name-value pairs.
Return to step 3.
Granted this would entail some financial outlay, but it might also be one of the simplest and accurate versions of the data you are going get on a fairly easy basis.
You could do a custom algorithm to work specifically on that data, for instance words ending in 'ing' are verbs (present participle) and could be sports.
Create a set of categorization rules like the one above and see how high an accuracy you get.
EDIT:
Steal the wikipedia database (it's free anyway) and get the list of articles under each of your ten categories. Count the occurrences of each of your 100 words in all the articles under each category, and the category with the highest 'keyword density' of that word (e.g. fishing) wins.
This sounds like you could use some sort of Bayesian classification as it is used in spam filtering. But this would still require "external data" in the form of some sort of text base that provides context.
Without that, the problem is impossible to solve. It's not an algorithm problem, it's an AI problem. But even AI (and natural intelligence as well, for that matter) needs some sort of input to learn from.
I suspect that the professor is giving you an impossible problem to make you understand at what different levels you can think about a problem.
The key question here is: who decides what a "correct" classification is? What is this decision based on? How could this decision be reproduced programmatically, and what input data would it need?
I am assuming that the problem allows using external data, because otherwise I cannot conceive of a way to deduce the meaning from words algorithmically.
Maybe something could be done with a thesaurus database, and looking for minimal distances between 'word' words and 'category' words?
Fire this teacher.
The only solution to this problem is to already have the solution to the problem. Ie. you need a table of keywords and categories to build your code that puts keywords into categories.
Unless, as you suggest, you add a system which "understands" english. This is the person sitting in front of the computer, or an expert system.
If you're building an expert system and doesn't even know it, the teacher is not good at giving problems.
Google is forbidden, but they have almost a perfect solution - Google Sets.
Because you need to unterstand the semantics of the words you need external datasources. You could try using WordNet. Or you could maybe try using Wikipedia - find the page for every word (or maybe only for the categories) and look for other words appearing on the page or linked pages.
Yeah I'd go for the wordnet approach.
Check this tutorial on WordNet-based semantic similarity measurement. You can query Wordnet online at princeton.edu (google it) so it should be relatively easy to code a solution for your problem.
Hope this helps,
X.
Interesting problem. What you're looking at is word classification. While you can learn and use traditional information retrieval methods like LSA and categorization based on such - I'm not sure if that is your intent (if it is, then do so by all means! :)
Since you say you can use external data, I would suggest using wordnet and its link between words. For instance, using wordnet,
# S: (n) **fishing**, sportfishing (the act of someone who fishes as a diversion)
* direct hypernym / inherited hypernym / sister term
o S: (n) **outdoor sport, field sport** (a sport that is played outdoors)
+ direct hypernym / inherited hypernym / sister term
# S: (n) **sport**, athletics
(an active diversion requiring physical exertion and competition)
What we see here is a list of relationships between words. The term fishing relates to outdoor sport, which relates to sport.
Now, if you get the drift - it is possible to use this relationship to compute a probability of classifying "fishing" to "sport" - say, based on the linear distance of the word-chain, or number of occurrences, et al. (should be trivial to find resources on how to construct similarity measures using wordnet. when the prof says "not to use google", I assume he means programatically and not as a means to get information to read up on!)
As for C# with wordnet - how about http://opensource.ebswift.com/WordNet.Net/
My first thought would be to leverage external data. Write a program that google-searches each word, and takes the 'category' that appears first/highest in the search results :)
That might be considered cheating, though.
Well, you can't use Google, but you CAN use Yahoo, Ask, Bing, Ding, Dong, Kong...
I would do a few passes. First query the 100 words against 2-3 search engines, grab the first y resulting articles (y being a threshold to experiment with. 5 is a good start I think) and scan the text. In particular I"ll search for the 10 categories. If a category appears more than x time (x again being some threshold you need to experiment with) its a match.
Based on that x threshold (ie how many times a category appears in the text) and how may of the top y pages it appears in you can assign a weigh to a word-category pair.
for better accuracy you can then do another pass with those non-google search engines with the word-category pair (with a AND relationship) and apply the number of resulting pages to the weight of that pair. Them simply assume the word-category pair with highest weight is the right one (assuming you'll even have more than one option). You can also multi assign a word to a multiple category if the weights are close enough (z threshold maybe).
Based on that you can introduce any number of words and any number of categories. And You'll win your challenge.
I also think this method is good to evaluate the weight of potential adwords in advertising. but that's another topic....
Good luck
Harel
Use (either online, or download) WordNet, and find the number of relationships you have to follow between words and each category.
Use an existing categorized large data set such as RCV1 to train your system of choice. You could do worse then to start reading existing research and benchmarks.
Appart from Google there exist other 'encyclopedic" datasets you can build of, some of them hosted as public data sets on Amazon Web Services, such as a complete snapshot of the English language Wikipedia.
Be creative. There is other data out there besides Google.
My attempt would be to use the toolset of CRM114 to provide a way to analyze a big corpus of text. Then you can utilize the matchings from it to give a guess.
My naive approach:
Create a huge text file like this (read the article for inspiration)
For every word, scan the text and whenever you match that word, count the 'categories' that appear in N (maximum, aka radio) positions left and right of it.
The word is likely to belong in the category with the greatest counter.
Scrape delicious.com and search for each word, looking at collective tag counts, etc.
Not much more I can say about that, but delicious is old, huge, incredibly-heavily tagged and contains a wealth of current relevant semantic information to draw from. It would be very easy to build a semantics database this way, using your word list as a basis from scraping.
The knowledge is in the tags.
As you don't need to attend the subject when you solve this 'riddle' it's not supposed to be easy I think.
Nevertheless I would do something like this (told in a very simplistic way)
Build up a Neuronal Network which you give some input (a (e)book, some (e)books)
=> no google needed
this network classifies words (Neural networks are great for 'unsure' classification). I think you may simply know which word belongs to which category because of the occurences in the text. ('fishing' is likely to be mentioned near 'sports').
After some training of the neural network it should "link" you the words to the categories.
You might be able to put use the WordNet database, create some metric to determine how closely linked two words (the word and the category) are and then choose the best category to put the word in.
You could implement a learning algorithm to do this using a monte carlo method and human feedback. Have the system randomly categorize words, then ask you to vote them as "match" or "not match." If it matches, the word is categorized and can be eliminated. If not, the system excludes it from that category in future iterations since it knows it doesn't belong there. This will get very accurate results.
This will work for the 100 word problem fairly easily. For the larger problem, you could combine this with educated guessing to make the process work faster. Here, as many people above have mentioned, you will need external sources. The google method would probably work the best, since google's already done a ton of work on it, but barring that you could, for example, pull data from your facebook account using the facebook apis and try to figure out which words are statistically more likely to appear with previously categorized words.
Either way, though, this cannot be done without some kind of external input that at some point came from a human. Unless you want to be cheeky and, for example, define the categories by some serialized value contained in the ascii text for the name :P

Resources