Google search suggestion implementation - data-structures

In a recent amazon interview I was asked to implement Google "suggestion" feature. When a user enters "Aeniffer Aninston", Google suggests "Did you mean Jeniffer Aninston". I tried to solve it by using hashing but could not cover the corner cases. Please let me know your thought on this.

There are 4 most common types of erros -
Omitted letter: "stck" instead of "stack"
One letter typo: "styck" instead of "stack"
Extra letter: "starck" instead of "stack"
Adjacent letters swapped: "satck" instead of "stack"
BTW, we can swap not adjacent letters but any letters but this is not common typo.
Initial state - typed word. Run BFS/DFS from initial vertex. Depth of search is your own choice. Remember that increasing depth of search leads to dramatically increasing number of "probable corrections". I think depth ~ 4-5 is a good start.
After generating "probable corrections" search each generated word-candidate in a dictionary - binary search in sorted dictionary or search in a trie which populated with your dictionary.
Trie is faster but binary search allows searching in Random Access File without loading dictionary to RAM. You have to load only precomputed integer array[]. Array[i] gives you number of bytes to skip for accesing i-th word. Words in Random Acces File should be written in a sorted order. If you have enough RAM to store dictionary use trie.
Before suggesting corrections check typed word - if it is in a dictionary, provide nothing.
UPDATE
Generate corrections should be done by BFS - when I tried DFS, entries like "Jeniffer" showed "edit distance = 3". DFS doesn't works, since it make a lot of changes which can be done in one step - for example, Jniffer->nJiffer->enJiffer->eJniffer->Jeniffer instead of Jniffer->Jeniffer.
Sample code for generating corrections by BFS
static class Pair
{
private String word;
private byte dist;
// dist is byte because dist<=128.
// Moreover, dist<=6 in real application
public Pair(String word,byte dist)
{
this.word = word;
this.dist = dist;
}
public String getWord()
{
return word;
}
public int getDist()
{
return dist;
}
}
public static void main(String[] args) throws Exception
{
HashSet<String> usedWords;
HashSet<String> dict;
ArrayList<String> corrections;
ArrayDeque<Pair> states;
usedWords = new HashSet<String>();
corrections = new ArrayList<String>();
dict = new HashSet<String>();
states = new ArrayDeque<Pair>();
// populate dictionary. In real usage should be populated from prepared file.
dict.add("Jeniffer");
dict.add("Jeniffert"); //depth 2 test
usedWords.add("Jniffer");
states.add(new Pair("Jniffer", (byte)0));
while(!states.isEmpty())
{
Pair head = states.pollFirst();
//System.out.println(head.getWord()+" "+head.getDist());
if(head.getDist()<=2)
{
// checking reached depth.
//4 is the first depth where we don't generate anything
// swap adjacent letters
for(int i=0;i<head.getWord().length()-1;i++)
{
// swap i-th and i+1-th letters
String newWord = head.getWord().substring(0,i)+head.getWord().charAt(i+1)+head.getWord().charAt(i)+head.getWord().substring(i+2);
// even if i==curWord.length()-2 and then i+2==curWord.length
//substring(i+2) doesn't throw exception and returns empty string
// the same for substring(0,i) when i==0
if(!usedWords.contains(newWord))
{
usedWords.add(newWord);
if(dict.contains(newWord))
{
corrections.add(newWord);
}
states.addLast(new Pair(newWord, (byte)(head.getDist()+1)));
}
}
// insert letters
for(int i=0;i<=head.getWord().length();i++)
for(char ch='a';ch<='z';ch++)
{
String newWord = head.getWord().substring(0,i)+ch+head.getWord().substring(i);
if(!usedWords.contains(newWord))
{
usedWords.add(newWord);
if(dict.contains(newWord))
{
corrections.add(newWord);
}
states.addLast(new Pair(newWord, (byte)(head.getDist()+1)));
}
}
}
}
for(String correction:corrections)
{
System.out.println("Did you mean "+correction+"?");
}
usedWords.clear();
corrections.clear();
// helper data structures must be cleared after each generateCorrections call - must be empty for the future usage.
}
Words in a dictionary - Jeniffer,Jeniffert. Jeniffert is just for testing)
Output:
Did you mean Jeniffer?
Did you mean Jeniffert?
Important!
I choose depth of generating = 2. In real application depth should be 4-6, but as number of combinations grows exponentially, I don't go so deep. There are some optomizations devoted to reduce number of branches in a searching tree but I don't think much about them. I wrote only main idea.
Also, I used HashSet for storing dictionary and for labeling used words. It seems HashSet's constant is too large when it containt million objects. May be you should use trie both for word in a dictionary checking and for is word labeled checking.
I didn't implement erase letters and change letters operations because I want to show only main idea.

Related

Shortest word in a dictionary containing all letters

I was asked this question in an interview.
Given an array of characters, find the shortest word in a dictionary that contains all the characters. Also, propose an implementation for the dictionary that would optimize this function call.
for e.g. char[] chars = { 'R' , 'C' }. The result should be the word "CAR".
I could not come up with anything that would run reasonably quickly. I thought of pre-processing the dictionary by building a hash table to retrieve all words of a particular length. Then I could only think of retrieving all words in the increasing order of length and checking if the required characters were present in any of those ( maybe by using a bitmask . )
This is a common software interview question, and its solution is this: sort the dictionary itself by length and sort each value alphabetically. When given the characters, sort them and find the needed letters.
First sort the dictionary in ascending order of length.
For each letter, construct a bit map of the locations in the dictionary of the words containing that letter. Each bit map will be long, but there will not be many.
For each search, take the intersection of the bitmaps for the letters in the array. The first one bit in the result will be at the index corresponding to the location in the dictionary of the shortest word containing all the letters.
The other answers are better, but I realized this is entirely precomputable.
For each word
sort the letters and remove duplicates
The sequence of letters can be viewed as a bitmask, A=0bit, B=1bit...Z=26bit. Set the bits of a mask A according to the letters in this word.
For each combination of set bits in the mask A, make a subset mask B
If there is already a word associated with this mask B
and this word is shorter, replace the associated word with this one
otherwise try next B
If there is no word associated with mask B
Associate this word with mask B.
This would take a huge amount of setup time, and the subsequent association storage would be in the vicinity of 1.7GB, but you'd be able to find the shortest word containing a superset of the letters in O(1) time guaranteed.
The obvious preprocessing is to sort all words in the dictionary by their length and alphabetical re-ordering: "word" under "dorw", for example. Then you can use general search algorithms (e.g., regex) to search for the letters you need. An efficient (DFA) search requires only one pass over the dictionary in the worst case, and much less if the first match is short.
Here is a solution in C#:
using System.Collections.Generic;
using System.Linq;
public class ShortestWordFinder
{
public ShortestWordFinder(IEnumerable<string> dictionary)
{
this.dictionary = dictionary;
}
public string ShortestWordContaining(IEnumerable<char> chars)
{
var wordsContaining = dictionary.Where(s =>
{
foreach (var c in chars)
{
if (!s.Contains(c))
{
return false;
}
s = s.Remove(s.IndexOf(c), 1);
}
return true;
}).ToList();
if (!wordsContaining.Any())
{
return null;
}
var minLength = wordsContaining.Min(word => word.Length);
return wordsContaining.First(word => word.Length == minLength);
}
private readonly IEnumerable<string> dictionary;
}
Simple test:
using System.Diagnostics;
using Xunit;
public class ShortestWordFinderTests
{
[Fact]
public void Works()
{
var words = new[] { "dog", "moose", "gargoyle" };
var finder = new ShortestWordFinder(words);
Trace.WriteLine(finder.ShortestWordContaining("o"));
Trace.WriteLine(finder.ShortestWordContaining("oo"));
Trace.WriteLine(finder.ShortestWordContaining("oy"));
Trace.WriteLine(finder.ShortestWordContaining("eyg"));
Trace.WriteLine(finder.ShortestWordContaining("go"));
Assert.Null(finder.ShortestWordContaining("ooo"));
}
}
Pre processing
a. Sort words into alphabetic char arrays. Retain mapping from sorted to original word
b. Split dictionary by word length as you suggest
c. Sort entries in each word length set alphabetically
On function call
Sort char array alphabetically
Start with group of same length as array
Loop through entries testing for characters until first letter of entry lexicographically greater than first in your char array then break. If match then return original word (see a above for mapping)
Back to 2 for next longest word group
Interesting extensions. Multiple words might map to same entry in a. Which one (s) should you return...

Any Java functions for blocking non English words?

Please suggest me the best Java api for removing non English words and blocking incorrect words using
I use an English words list file to parse the given string. The code is responding very slowly. `
String englishword;
while ((englishword = br.readLine()) != null) {
//System.out.println("#"+englishword);
for (String word : wordsArray) {
//System.out.println("#"+word);
if(englishword.trim().toUpperCase().equals(word.trim().toUpperCase()))
{
linetmp = linetmp.replaceAll(word, " ").trim();
break;
}
}
}
if(linetmp!=null)
for(String nonEnglish:linetmp.split("\\s+"))
{
line = line.replaceAll(nonEnglish, "");
}
line = line.replaceAll(" +", " ");
return line;
Please suggest me if there is any faster way to do this
Note: i am using Linux OS's dictionary listy
Make trim() and touppercase() of the checked word only once, out of the for (String word : wordsArray) cycle.
If you'll do excessive heavy operations in the inner cycle, no API will help you.
You can use a Java API function for searching
import org.apache.commons.lang.ArrayUtils;
ArrayUtils.indexOf(array, string);
You can make your code a lot faster1 by changing the wordsArray to a HashSet, and using the contains(String) method to do the checks. (Make sure you convert words to upper case when you build the set.)
However, I would point out that this approach doesn't scale. It is not practical to enumerate all possible "non-English or incorrect" words. You would be better off building a set containing all of the words that you are prepared to accept, and then eliminating the words not in the set.
1 - Currently, your inner loop takes time that is proportional to the number of words (N) in wordArray; i.e. O(N). If you use a HashSet, the operation takes O(1) time; i.e. roughly constant time.
There is a faster way.
Create a HashSet<String> containing all your elements in wordsArray (as lower cases/upper cases).
For each new word englishword check if set.contains(englishword.toLowerCase()).
This solution runs in O(n|S|) pre-processing (creating the HashSet), and checking each word is O(|S|) where |S| is the length of the string and n is number of words in the array, while your solution is basically O(n|S|) per word.
Code snap:
public static class EnglishChecker {
private final Set<String> set;
public EnglishChecker(String[] englishWords) {
set = new HashSet<>();
for (String s : englishWords) {
set.add(s.toLowerCase());
}
}
public boolean isWord(String s) {
return set.contains(s.toLowerCase());
}
}
public static void main(String[] args) {
String[] words = { "Cat", "dog", "mousE" };
EnglishChecker checker = new EnglishChecker(words);
System.out.println(checker.isWord("cat"));
System.out.println(checker.isWord("cccccccat"));
System.out.println(checker.isWord("MOUSE"));
}

Calculate each Word Occurrence in large document

I was wondering how can I solve this problem by using which data structure.. Can anyone explain this in detail...!! I was thinking to use tree.
There is a large document. Which contains millions of words. so how you will calculate a each word occurrence count in an optimal way?
This question was asked in Microsoft... Any suggestions will be appreciated..!!
I'd just use a hash map (or Dictionary, since this is Microsoft ;) ) of strings to integers. For each word of the input, either add it to the dictionary if it's new, or increment its count otherwise. O(n) over the length of the input, assuming the hash map implementation is decent.
Using a dictionary or hash set will result in o(n) on average.
To solve it in o(n) worst case, a trie with a small change should be used:
add a counter to each word representation in the trie; Each time a word that is inserted already exists, increment its counter.
If you want to print all the amounts at the end, you can keep the counters on a different list, and reference it from the trie instead storing the counter in the trie.
class IntValue
{
public IntValue(int value)
{
Value = value;
}
public int Value;
}
static void Main(string[] args)
{
//assuming document is a enumerator for the word in the document:
Dictionary<string, IntValue> dict = new Dictionary<string, IntValue>();
foreach (string word in document)
{
IntValue intValue;
if(!dict.TryGetValue(word, out intValue))
{
intValue = new IntValue(0);
dict.Add(word, intValue);
}
++intValue.Value;
}
//now dict contains the counts
}
Tree would not work here.
Hashtable ht = new Hashtable();
// Read each word in the text in its order, for each of them:
if (ht.contains(oneWord))
{
Integer I = (Integer) ht.get(oneWord));
ht.put(oneWord, new Integer(I.intValue()+1));
}
else
{
ht.put(oneWord, new Integer(1));
}

word distribution problem

I have a big file of words ~100 Gb and have limited memory 4Gb. I need to calculate word distribution from this file. Now one option is to divide it into chunks and sort each chunk and then merge to calculate word distribution. Is there any other way it can be done faster? One idea is to sample but not sure how to implement it to return close to correct solution.
Thanks
You can build a Trie structure where each leaf (and some nodes) will contain the current count. As words will intersect with each other 4GB should be enough to process 100 GB of data.
Naively I would just build up a hash table until it hits a certain limit in memory, then sort it in memory and write this out. Finally, you can do n-way merging of each chunk. At most you will have 100/4 chunks or so, but probably many fewer provided some words are more common than others (and how they cluster).
Another option is to use a trie which was built for this kind of thing. Each character in the string becomes a branch in a 256-way tree and at the leaf you have the counter. Look up the data structure on the web.
If you can pardon the pun, "trie" this:
public class Trie : Dictionary<char, Trie>
{
public int Frequency { get; set; }
public void Add(string word)
{
this.Add(word.ToCharArray());
}
private void Add(char[] chars)
{
if (chars == null || chars.Length == 0)
{
throw new System.ArgumentException();
}
var first = chars[0];
if (!this.ContainsKey(first))
{
this.Add(first, new Trie());
}
if (chars.Length == 1)
{
this[first].Frequency += 1;
}
else
{
this[first].Add(chars.Skip(1).ToArray());
}
}
public int GetFrequency(string word)
{
return this.GetFrequency(word.ToCharArray());
}
private int GetFrequency(char[] chars)
{
if (chars == null || chars.Length == 0)
{
throw new System.ArgumentException();
}
var first = chars[0];
if (!this.ContainsKey(first))
{
return 0;
}
if (chars.Length == 1)
{
return this[first].Frequency;
}
else
{
return this[first].GetFrequency(chars.Skip(1).ToArray());
}
}
}
Then you can call code like this:
var t = new Trie();
t.Add("Apple");
t.Add("Banana");
t.Add("Cherry");
t.Add("Banana");
var a = t.GetFrequency("Apple"); // == 1
var b = t.GetFrequency("Banana"); // == 2
var c = t.GetFrequency("Cherry"); // == 1
You should be able to add code to traverse the trie and return a flat list of words and their frequencies.
If you find that this too still blows your memory limit then might I suggest that you "divide and conquer". Maybe scan the source data for all the first characters and then run the trie separately against each and then concatenate the results after all of the runs.
do you know how many different words you have? if not a lot (i.e. hundred thousand) then you can stream the input, determine words and use a hash table to keep the counts. after input is done just traverse the result.
Just use a DBM file. It’s a hash on disk. If you use the more recent versions, you can use a B+Tree to get in-order traversal.
Why not use any relational DB? The procedure would be as simple as:
Create a table with the word and count.
Create index on word. Some databases have word index (f.e. Progress).
Do SELECT on this table with the word.
If word exists then increase counter.
Otherwise - add it to the table.
If you are using python, you can check the built-in iter function. It will read line by line from your file and will not cause memory problems. You should not "return" the value but "yield" it.
Here is a sample that I used to read a file and get the vector values.
def __iter__(self):
for line in open(self.temp_file_name):
yield self.dictionary.doc2bow(line.lower().split())

Word-separating algorithm

What is the algorithm - seemingly in use on domain parking pages - that takes a spaceless bunch of words (eg "thecarrotofcuriosity") and more-or-less correctly breaks it down into the constituent words (eg "the carrot of curiosity") ?
Start with a basic Trie data structure representing your dictionary. As you iterate through the characters of the the string, search your way through the trie with a set of pointers rather than a single pointer - the set is seeded with the root of the trie. For each letter, the whole set is advanced at once via the pointer indicated by the letter, and if a set element cannot be advanced by the letter, it is removed from the set. Whenever you reach a possible end-of-word, add a new root-of-trie to the set (keeping track of the list of words seen associated with that set element). Finally, once all characters have been processed, return an arbitrary list of words which is at the root-of-trie. If there's more than one, that means the string could be broken up in multiple ways (such as "therapistforum" which can be parsed as ["therapist", "forum"] or ["the", "rapist", "forum"]) and it's undefined which we'll return.
Or, in a wacked up pseudocode (Java foreach, tuple indicated with parens, set indicated with braces, cons using head :: tail, [] is the empty list):
List<String> breakUp(String str, Trie root) {
Set<(List<String>, Trie)> set = {([], root)};
for (char c : str) {
Set<(List<String>, Trie)> newSet = {};
for (List<String> ls, Trie t : set) {
Trie tNext = t.follow(c);
if (tNext != null) {
newSet.add((ls, tNext));
if (tNext.isWord()) {
newSet.add((t.follow(c).getWord() :: ls, root));
}
}
}
set = newSet;
}
for (List<String> ls, Trie t : set) {
if (t == root) return ls;
}
return null;
}
Let me know if I need to clarify or I missed something...
I would imagine they take a dictionary word list like /usr/share/dict/words on your common or garden variety Unix system and try to find sets of word matches (starting from the left?) that result in the largest amount of original text being covered by a match. A simple breadth-first-search implementation would probably work fine, since it obviously doesn't have to run fast.
I'd imaging these sites do it similar to this:
Get a list of word for your target language
Remove "useless" words like "a", "the", ...
Run through the list and check which of the words are substrings of the domain name
Take the most common words of the remaining list (Or the ones with the highest adsense rating,...)
Of course that leads to nonsense for expertsexchange, but what else would you expect there...
(disclaimer: I did not try it myself, so take it merely as a food for experimentation. 4-grams are taken mostly out of the blue sky, just from my experience that 3-grams won't work all too well; 5-grams and more might work better, even though you will have to deal with a pretty large table). It's also simplistic in a sense that it does not take into the account the ending of the string - if it works for you otherwise, you'd probably need to think about fixing the endings.
This algorithm would run in a predictable time proportional to the length of the string that you are trying to split.
So, first: Take a lot of human-readable texts. for each of the text, supposing it is in a single string str, run the following algorithm (pseudocode-ish notation, assumes the [] is a hashtable-like indexing, and that nonexistent indexes return '0'):
for(i=0;i<length(s)-5;i++) {
// take 4-character substring starting at position i
subs2 = substring(str, i, 4);
if(has_space(subs2)) {
subs = substring(str, i, 5);
delete_space(subs);
yes_space[subs][position(space, subs2)]++;
} else {
subs = subs2;
no_space[subs]++;
}
}
This will build you the tables which will help to decide whether a given 4-gram would need to have a space in it inserted or not.
Then, take your string to split, I denote it as xstr, and do:
for(i=0;i<length(xstr)-5;i++) {
subs = substring(xstr, i, 4);
for(j=0;j<4;j++) {
do_insert_space_here[i+j] -= no_space[subs];
}
for(j=0;j<4;j++) {
do_insert_space_here[i+j] += yes_space[subs][j];
}
}
Then you can walk the "do_insert_space_here[]" array - if an element at a given position is bigger than 0, then you should insert a space in that position in the original string. If it's less than zero, then you shouldn't.
Please drop a note here if you try it (or something of this sort) and it works (or does not work) for you :-)

Resources