All anagrams in a File - algorithm

Source : Microsoft Interview Question
We are given a File containing words.We need to determine all the Anagrams Present in it .
Can someone suggest most optimal algorithm to do this.
Only way i know is
Sorting all the words,then checking .

It would be good to know more about data before suggesting an algorithm, but lets just assume that the words are in English in the single case.
Lets assign each letter a prime number from 2 to 101. For each word we can count it's "anagram number" by multiplying its letter corresponding numbers.
Lets declare a dictionary of {number, list} pairs. And one list to collect resulting anagrams into.
Then we can collect anagrams in two steps: simply traverse through the file and put each word to a dictionary's list according to its "anagram number"; traverce the map and for every pairs list with length more then 1 store it's contents in a single big anagram list.
UPDATE:
import operator
words = ["thore", "ganamar", "notanagram", "anagram", "other"]
letter_code = {'a':2, 'b':3, 'c':5, 'd':7, 'e':11, 'f':13, 'g':17, 'h':19, 'i':23, 'j':29, 'k':31, 'l':37, 'm':41, 'n':43,
'o':47, 'p':53, 'q':59, 'r':61, 's':67, 't':71, 'u':73, 'v':79, 'w':83, 'x':89, 'y':97, 'z':101}
def evaluate(word):
return reduce( operator.mul, [letter_code[letter] for letter in word] )
anagram_map = {}
anagram_list = []
for word in words:
anagram_number = evaluate(word)
if anagram_number in anagram_map:
anagram_map[ anagram_number ] += [word]
else:
anagram_map[ anagram_number ] = [word]
if len(anagram_map[ anagram_number ]) == 2:
anagram_list += anagram_map[ anagram_number ]
elif len(anagram_map[ anagram_number ]) > 2:
anagram_list += [ word ]
print anagram_list
Of course the implementation can be optimized further. For instance, you don't really need a map of anagrams, just a counters would do fine. But I guess the code illustrates the idea best as it is.

You can use "Tries".A trie (derived from retrieval) is a multi way search tree. Tries use pattern matching algorithms. It's basic use is to create spell check programs, but I think it can help your case..
Have a look at this link http://ww0.java4.datastructures.net/handouts/Tries.pdf

I just did this one not to long ago, in a different way.
split the file content into an array of words
create a HashMap that maps a key string to a linked list of strings
for each word in the array, sort the letters in the word and use that as the key to a linked list of anagrams
public static void allAnagrams2(String s) {
String[] input = s.toLowerCase().replaceAll("[^a-z^\s]", "").split("\s");
HashMap> hm = new HashMap>();
for (int i = 0; i < input.length; i++) {
String current = input[i];
char[] chars = current.toCharArray();
Arrays.sort(chars);
String key = new String(chars);
LinkedList<String> ll = hm.containsKey(key) ? hm.get(key) : new LinkedList<String>();
ll.add(current);
if (!hm.containsKey(key))
hm.put(key, ll);
}
}

Slightly different approach from the one above. Returning a Hashmap of anagrams instead.
Public static Hashmap<String> anagrams(String [] list){
Hashmap<String, String> hm = new Hashmap<String, String>();
Hashmap<String> anagrams = new Hashmap<String>();
for (int i=0;i<list.length;i++){
char[] chars = list[i].toCharArray();
Arrays.sort(chars);
String k = chars.toString();
if(hm.containsKey(k)){
anagrams.put(k);
anagrams.put(hm.get(k));
}else{
hm.put(k, list[i]);
}
}
}

Related

What's the most efficient way of filtering a string with numbers at the end (e.g. foo12)?

Here's a self-thought up quiz very similar to a real life problem that I'm facing.
Say I have a list of strings (say it's called stringlist), and among them some have two digit numbers attached at the end. For example, "foo", "foo01", "foo24".
I want to group those with the same letters (but with different two digit numbers at the end).
So, "foo", "foo01", and "foo24" would be under the group "foo".
However, I can't just check for any string that begins with "foo", because we can also have "food", "food08", "food42".
There are no duplicates.
It is possible to have numbers in the middle. Ex) "foo543food43" is under group "foo543food"
Or even multiple numbers at then end. Ex) "foo1234" is under group "foo12"
Most obvious solution I can think of is having a list of numbers.
numbers = ["0", "1", "2", ... "9"]
Then, I would do
grouplist = [[]] //Of the form: [[group_name1, word_index1, word_index2, ...], [group_name2, ...]]
for(word_index=0; word_index < len(stringlist); word_index++) //loop through stringlist
for(char_index=0; char_index < len(stringlist[word_index]); char_index++) //loop through the word
if(char_index == len(stringlist[word_index])-1) //Reached the end
for(number1 in numbers)
if(char_index == number1) //Found a number at the end
for(number2 in numbers)
if(char_index-1 == number2) //Found another number one before the end
group_name = stringlist[word_index].substring(0,char_index-1)
for(group_element in grouplist)
if(group_element[0] == group_name) //Does that group name exist already? If so, add the index to the end. If not, add the group name and the index.
group_element.append(word_index)
else
group_element.append([stringlist[word_index].substring(0,char_index-1), word_index])
break //If you found the first number, stop looping through numbers
break //If you found the second number, stop looping through numbers
Now this looks messy as hell. Any cleaner way you guys can think of?
Any of the data structures including the final result's can be what you want it to be.
I would create a map that maps the group-name to a list of all String of the corresponding group.
Here my approach in java:
public Map<String, List<String>> createGroupMap(Lust<String> listOfAllStrings){
Map<String, List<String>> result= new Hashmap<>();
for(String s: listOfAllStrings){
addToMap(result, s)
}
}
private addToMap(Map<String, List<String>> map, String s){
String group=getGroupName(s);
if(!map.containsKey(group))
map.put(group,new ArrayList<String>();
map.get(group).add(s);
}
private String getGroupName(String s){
return s.replaceFirst("\\d+$", "");
}
Maybe you can gain some speed by avoiding the RegExp in getGroupName(..) but you need to profile it to be sure that an implementation without RegExp would be faster.
You can divide the string into 2 parts like this.
pair<string, int> divide(string s) {
int r = 0;
if(isdigit(s.back())) {
r = s.back() - '0';
s.pop_back();
if(isdigit(s.back())) {
r += 10 * (s.back() - '0');
s.pop_back();
}
}
return {s, r}
}

Shortest word in a dictionary containing all letters

I was asked this question in an interview.
Given an array of characters, find the shortest word in a dictionary that contains all the characters. Also, propose an implementation for the dictionary that would optimize this function call.
for e.g. char[] chars = { 'R' , 'C' }. The result should be the word "CAR".
I could not come up with anything that would run reasonably quickly. I thought of pre-processing the dictionary by building a hash table to retrieve all words of a particular length. Then I could only think of retrieving all words in the increasing order of length and checking if the required characters were present in any of those ( maybe by using a bitmask . )
This is a common software interview question, and its solution is this: sort the dictionary itself by length and sort each value alphabetically. When given the characters, sort them and find the needed letters.
First sort the dictionary in ascending order of length.
For each letter, construct a bit map of the locations in the dictionary of the words containing that letter. Each bit map will be long, but there will not be many.
For each search, take the intersection of the bitmaps for the letters in the array. The first one bit in the result will be at the index corresponding to the location in the dictionary of the shortest word containing all the letters.
The other answers are better, but I realized this is entirely precomputable.
For each word
sort the letters and remove duplicates
The sequence of letters can be viewed as a bitmask, A=0bit, B=1bit...Z=26bit. Set the bits of a mask A according to the letters in this word.
For each combination of set bits in the mask A, make a subset mask B
If there is already a word associated with this mask B
and this word is shorter, replace the associated word with this one
otherwise try next B
If there is no word associated with mask B
Associate this word with mask B.
This would take a huge amount of setup time, and the subsequent association storage would be in the vicinity of 1.7GB, but you'd be able to find the shortest word containing a superset of the letters in O(1) time guaranteed.
The obvious preprocessing is to sort all words in the dictionary by their length and alphabetical re-ordering: "word" under "dorw", for example. Then you can use general search algorithms (e.g., regex) to search for the letters you need. An efficient (DFA) search requires only one pass over the dictionary in the worst case, and much less if the first match is short.
Here is a solution in C#:
using System.Collections.Generic;
using System.Linq;
public class ShortestWordFinder
{
public ShortestWordFinder(IEnumerable<string> dictionary)
{
this.dictionary = dictionary;
}
public string ShortestWordContaining(IEnumerable<char> chars)
{
var wordsContaining = dictionary.Where(s =>
{
foreach (var c in chars)
{
if (!s.Contains(c))
{
return false;
}
s = s.Remove(s.IndexOf(c), 1);
}
return true;
}).ToList();
if (!wordsContaining.Any())
{
return null;
}
var minLength = wordsContaining.Min(word => word.Length);
return wordsContaining.First(word => word.Length == minLength);
}
private readonly IEnumerable<string> dictionary;
}
Simple test:
using System.Diagnostics;
using Xunit;
public class ShortestWordFinderTests
{
[Fact]
public void Works()
{
var words = new[] { "dog", "moose", "gargoyle" };
var finder = new ShortestWordFinder(words);
Trace.WriteLine(finder.ShortestWordContaining("o"));
Trace.WriteLine(finder.ShortestWordContaining("oo"));
Trace.WriteLine(finder.ShortestWordContaining("oy"));
Trace.WriteLine(finder.ShortestWordContaining("eyg"));
Trace.WriteLine(finder.ShortestWordContaining("go"));
Assert.Null(finder.ShortestWordContaining("ooo"));
}
}
Pre processing
a. Sort words into alphabetic char arrays. Retain mapping from sorted to original word
b. Split dictionary by word length as you suggest
c. Sort entries in each word length set alphabetically
On function call
Sort char array alphabetically
Start with group of same length as array
Loop through entries testing for characters until first letter of entry lexicographically greater than first in your char array then break. If match then return original word (see a above for mapping)
Back to 2 for next longest word group
Interesting extensions. Multiple words might map to same entry in a. Which one (s) should you return...

Search String By SubWords

What Kind of algorithms + data structures that would help me to do that?
Having a file contains like 10000~ lines loaded in memory in a ordered set. With a given search string I want to be able to get all the lines that have words prefixed with words found in search string. Well let me give an example to clarify this:
Lines:
"A brow Fox flies."
"Boxes are full of food."
"Cats runs slow"
"Dogs hates eagles"
"Dolphins have eyes and teath"
Cases 1:
search string = "fl b a"
"A brow Fox flies."
Explanation: search string have three words "fl", "b", and "a" and
the only string that have some words that are prefixed with words
from the search string is line 1.
Case 2:
search string "e do ha"
"Dogs hates eagles", "Dolphins have eyes and teath"
Solution
(fast enough for me it took about 30ms~(including sorting the final result) on my pc on a set of 10k lines 3 words each line)
I used a trie suggested in answer.
And some other hacky methods to be able to filter out duplicate and false positive results (mainly used hash sets for this).
I think what you're probably wanting is a trie. Construct one for the set of all words in your document, and have each leaf point to a hashset containing the indices of the lines in which the key of the leaf appears.
To execute a search, you'd use each fragment of the search string to navigate to a node in the tree and take the union over the hashsets of all leaves in that node's subtree. Then you'd take the intersection of those unions over the set of fragments to get the list of lines satisfying the search string.
Here is my 2 cents:
class DicVal
{
public int OriginalValue;
public int CurrentValue;
public int LineNumber;
}
private static void Main()
{
var a = "A brow Fox flies.\r\n" +
"Boxes are full of food.\r\n" +
"Cats runs slow\r\n" +
"Dogs hates eagles\r\n" +
"A brow Fox flies. AA AB AC\r\n" +
"Dolphins have eyes and teath";
var lines = a.Split(new[] {Environment.NewLine}, StringSplitOptions.RemoveEmptyEntries);
var dic = new Dictionary<string, DicVal>
{
{"fl", new DicVal { OriginalValue = 1, LineNumber = -1}},
{"b", new DicVal { OriginalValue = 1, LineNumber = -1}},
{"a", new DicVal { OriginalValue = 4, LineNumber = -1}}
};
var globalCount = dic.Sum(x => x.Value.OriginalValue);
var lineNumber = 0;
foreach(var line in lines)
{
var words = line.Split(' ');
var currentCount = globalCount;
foreach (var word in words.Select(x => x.ToLower()))
{
for (var i = 1; i <= word.Length; i++)
{
var substr = word.Substring(0, i);
if (dic.ContainsKey(substr))
{
if (dic[substr].LineNumber != lineNumber)
{
dic[substr].CurrentValue = dic[substr].OriginalValue;
dic[substr].LineNumber = lineNumber;
}
if (dic[substr].CurrentValue > 0)
{
currentCount--;
dic[substr].CurrentValue--;
}
}
}
}
if(currentCount == 0)
Console.WriteLine(line);
lineNumber++;
}
}
Not going to explain much, as code is the best documentation :P.
Output: A brow Fox flies. AA AB AC
Assuming you implement everything efficiently, the running time will be as good as possible, since you need to read every word at least ONCE.
Further optimization can be done and apply threading. You can view into PARALLEL AGGREGATION concept, as this problem can be parallelized easily.
Here's a fairly simple implementation that should be appropriate for your use case. The idea is that you can store all combinations of short prefixes for each line (and for each query) since you only have 10,000 lines and assuming each line doesn't contain too many words. Now look up each hash generated for the query string. For each hash match, we then check for an exact match. For my example code I consider only prefixes of length 1, however you could repeat this approach for prefixes of length 2 & 3 provided the prefixes in your query have those lengths too.
__author__ = 'www.google.com/+robertking'
from itertools import combinations
from collections import defaultdict
lines = [
"A brow Fox flies.",
"Boxes are full of food.",
"Cats runs slow",
"Dogs hates eagles",
"Dolphins have eyes and teath"
]
lines = [line.lower() for line in lines]
def short_prefixes(line):
for word in line.split():
yield word[:1]
def get_hashes(line):
starts = list(short_prefixes(line))
for prefixes_in_hash in range(1, min(4, len(starts))):
for hash_group in combinations(starts, r=prefixes_in_hash):
yield tuple(sorted(hash_group))
def get_hash_map():
possible_matches = defaultdict(list)
for line_pos, line in enumerate(lines):
for hash in get_hashes(line):
possible_matches[hash].append(line_pos)
return possible_matches
possible_matches = get_hash_map()
def ok(line, q):
return all(line.startswith(prefix) or ((" " + prefix) in line) for prefix in q)
def query(search_string):
search_string = search_string.lower()
q = search_string.split()
hashes = set(get_hashes(search_string))
possible_lines = set()
for hash in hashes:
for line_pos in possible_matches[hash]:
possible_lines.add(line_pos)
for line_pos in possible_lines:
if ok(lines[line_pos], q):
yield lines[line_pos]
print(list(query("fl b a")))
#['a brow fox flies.']

Google search suggestion implementation

In a recent amazon interview I was asked to implement Google "suggestion" feature. When a user enters "Aeniffer Aninston", Google suggests "Did you mean Jeniffer Aninston". I tried to solve it by using hashing but could not cover the corner cases. Please let me know your thought on this.
There are 4 most common types of erros -
Omitted letter: "stck" instead of "stack"
One letter typo: "styck" instead of "stack"
Extra letter: "starck" instead of "stack"
Adjacent letters swapped: "satck" instead of "stack"
BTW, we can swap not adjacent letters but any letters but this is not common typo.
Initial state - typed word. Run BFS/DFS from initial vertex. Depth of search is your own choice. Remember that increasing depth of search leads to dramatically increasing number of "probable corrections". I think depth ~ 4-5 is a good start.
After generating "probable corrections" search each generated word-candidate in a dictionary - binary search in sorted dictionary or search in a trie which populated with your dictionary.
Trie is faster but binary search allows searching in Random Access File without loading dictionary to RAM. You have to load only precomputed integer array[]. Array[i] gives you number of bytes to skip for accesing i-th word. Words in Random Acces File should be written in a sorted order. If you have enough RAM to store dictionary use trie.
Before suggesting corrections check typed word - if it is in a dictionary, provide nothing.
UPDATE
Generate corrections should be done by BFS - when I tried DFS, entries like "Jeniffer" showed "edit distance = 3". DFS doesn't works, since it make a lot of changes which can be done in one step - for example, Jniffer->nJiffer->enJiffer->eJniffer->Jeniffer instead of Jniffer->Jeniffer.
Sample code for generating corrections by BFS
static class Pair
{
private String word;
private byte dist;
// dist is byte because dist<=128.
// Moreover, dist<=6 in real application
public Pair(String word,byte dist)
{
this.word = word;
this.dist = dist;
}
public String getWord()
{
return word;
}
public int getDist()
{
return dist;
}
}
public static void main(String[] args) throws Exception
{
HashSet<String> usedWords;
HashSet<String> dict;
ArrayList<String> corrections;
ArrayDeque<Pair> states;
usedWords = new HashSet<String>();
corrections = new ArrayList<String>();
dict = new HashSet<String>();
states = new ArrayDeque<Pair>();
// populate dictionary. In real usage should be populated from prepared file.
dict.add("Jeniffer");
dict.add("Jeniffert"); //depth 2 test
usedWords.add("Jniffer");
states.add(new Pair("Jniffer", (byte)0));
while(!states.isEmpty())
{
Pair head = states.pollFirst();
//System.out.println(head.getWord()+" "+head.getDist());
if(head.getDist()<=2)
{
// checking reached depth.
//4 is the first depth where we don't generate anything
// swap adjacent letters
for(int i=0;i<head.getWord().length()-1;i++)
{
// swap i-th and i+1-th letters
String newWord = head.getWord().substring(0,i)+head.getWord().charAt(i+1)+head.getWord().charAt(i)+head.getWord().substring(i+2);
// even if i==curWord.length()-2 and then i+2==curWord.length
//substring(i+2) doesn't throw exception and returns empty string
// the same for substring(0,i) when i==0
if(!usedWords.contains(newWord))
{
usedWords.add(newWord);
if(dict.contains(newWord))
{
corrections.add(newWord);
}
states.addLast(new Pair(newWord, (byte)(head.getDist()+1)));
}
}
// insert letters
for(int i=0;i<=head.getWord().length();i++)
for(char ch='a';ch<='z';ch++)
{
String newWord = head.getWord().substring(0,i)+ch+head.getWord().substring(i);
if(!usedWords.contains(newWord))
{
usedWords.add(newWord);
if(dict.contains(newWord))
{
corrections.add(newWord);
}
states.addLast(new Pair(newWord, (byte)(head.getDist()+1)));
}
}
}
}
for(String correction:corrections)
{
System.out.println("Did you mean "+correction+"?");
}
usedWords.clear();
corrections.clear();
// helper data structures must be cleared after each generateCorrections call - must be empty for the future usage.
}
Words in a dictionary - Jeniffer,Jeniffert. Jeniffert is just for testing)
Output:
Did you mean Jeniffer?
Did you mean Jeniffert?
Important!
I choose depth of generating = 2. In real application depth should be 4-6, but as number of combinations grows exponentially, I don't go so deep. There are some optomizations devoted to reduce number of branches in a searching tree but I don't think much about them. I wrote only main idea.
Also, I used HashSet for storing dictionary and for labeling used words. It seems HashSet's constant is too large when it containt million objects. May be you should use trie both for word in a dictionary checking and for is word labeled checking.
I didn't implement erase letters and change letters operations because I want to show only main idea.

Calculate each Word Occurrence in large document

I was wondering how can I solve this problem by using which data structure.. Can anyone explain this in detail...!! I was thinking to use tree.
There is a large document. Which contains millions of words. so how you will calculate a each word occurrence count in an optimal way?
This question was asked in Microsoft... Any suggestions will be appreciated..!!
I'd just use a hash map (or Dictionary, since this is Microsoft ;) ) of strings to integers. For each word of the input, either add it to the dictionary if it's new, or increment its count otherwise. O(n) over the length of the input, assuming the hash map implementation is decent.
Using a dictionary or hash set will result in o(n) on average.
To solve it in o(n) worst case, a trie with a small change should be used:
add a counter to each word representation in the trie; Each time a word that is inserted already exists, increment its counter.
If you want to print all the amounts at the end, you can keep the counters on a different list, and reference it from the trie instead storing the counter in the trie.
class IntValue
{
public IntValue(int value)
{
Value = value;
}
public int Value;
}
static void Main(string[] args)
{
//assuming document is a enumerator for the word in the document:
Dictionary<string, IntValue> dict = new Dictionary<string, IntValue>();
foreach (string word in document)
{
IntValue intValue;
if(!dict.TryGetValue(word, out intValue))
{
intValue = new IntValue(0);
dict.Add(word, intValue);
}
++intValue.Value;
}
//now dict contains the counts
}
Tree would not work here.
Hashtable ht = new Hashtable();
// Read each word in the text in its order, for each of them:
if (ht.contains(oneWord))
{
Integer I = (Integer) ht.get(oneWord));
ht.put(oneWord, new Integer(I.intValue()+1));
}
else
{
ht.put(oneWord, new Integer(1));
}

Resources