Using primes to determine anagrams faster than looping through? - performance

I had a telephone recently for a SE role and was asked how I'd determine if two words were anagrams or not, I gave a reply that involved something along the lines of getting the character, iterating over the word, if it exists exit loop and so on. I think it was a N^2 solution as one loop per word with an inner loop for the comparing.
After the call I did some digging and wrote a new solution; one that I plan on handing over tomorrow at the next stage interview, it uses a hash map with a unique prime number representing each character of the alphabet.
I'm then looping through the list of words, calculating the value of the word and checking to see if it compares with the word I'm checking. If the values match we have a winner (the whole mathematical theorem business).
It means one loop instead of two which is much better but I've started to doubt myself and am wondering if the additional operations of the hashmap and multiplication are more expensive than the original suggestion.
I'm 99% certain the hash map is going to be faster but...
Can anyone confirm or deny my suspicions? Thank you.
Edit: I forgot to mention that I check the size of the words first before even considering doing anything.

An anagram contains all the letters of the original word, in a different order. You are on the right track to use a HashMap to process a word in linear time, but your prime number idea is an unnecessary complication.
Your data structure is a HashMap that maintains the counts of various letters. You can add letters from the first word in O(n) time. The key is the character, and the value is the frequency. If the letter isn't in the HashMap yet, put it with a value of 1. If it is, replace it with value + 1.
When iterating over the letters of the second word, subtract one from your count instead, removing a letter when it reaches 0. If you attempt to remove a letter that doesn't exist, then you can immediately state that it's not an anagram. If you reach the end and the HashMap isn't empty, it's not an anagram. Else, it's an anagram.
Alternatively, you can replace the HashMap with an array. The index of the array corresponds to the character, and the value is the same as before. It's not an anagram if a value drops to -1, and it's not an anagram at the end if any of the values aren't 0.
You can always compare the lengths of the original strings, and if they aren't the same, then they can't possibly be anagrams. Including this check at the beginning means that you don't have to check if all the values are 0 at the end. If the strings are the same length, then either something will produce a -1 or there will be all 0s at the end.

The problem with multiplying is that the numbers can get big. For example, if letter 'c' was 11, then a word with 10 c's would overflow a 32bit integer.
You could reduce the result modulo some other number, but then you risk having false positives.
If you use big integers, then it will go slowly for long words.
Alternative solutions are to sort the two words and then compare for equality, or to use a histogram of letter counts as suggested by chrylis in the comments.
The idea is to have an array initialized to zero containing the number of times each letter appears.
Go through the letters in the first word, incrementing the count for each letter. Then go through the letters in the second word, decrementing the count.
If the counts reach zero at the end of this process, then the words are anagrams.

Related

Look for a data structure to match words by letters

Given a list of lowercase radom words, each word with same length, and many patterns each with some letters at some positions are specified while other letters are unknown, find out all words that matches each pattern.
For example, words list is:
["ixlwnb","ivknmt","vvqnbl","qvhntl"]
And patterns are:
i-----
-v---l
-v-n-l
With a naive algorithm, one can do an O(NL) travel for each pattern, where N is the words count and L is the word length.
But since there may be a lot of patterns travel on the same words list, is there any good data structure to preprocess and store the words list, then give a sufficient matching for all patterns?
One simple idea is to use an inverted index. First, number your words -- you'll refer to them using these indices rather than the words themselves for speed and space efficiency. Probably the index fits in a 32-bit int.
Now your inverted index: for each letter in each position, construct a sorted list of IDs for words that have that letter in that location.
To do your search, you take the lists of IDs for each of the letters in the positions you're given, and take the intersection of the lists, using a an algorithm like the "merge" in merge-sort. All IDs in the intersection match the input.
Alternatively, if your words are short enough (12 characters or fewer), you could compress them into 64 bit words (using 5 bits per letter, with letters 1-26). Construct a bit-mask with binary 11111 in places where you have a letter, and 00000 in places where you have a blank. And a bit-test from your input with the 5-bit code for each letter in each place, using 00000 where you have blanks. For example, if your input is a-c then your bitmask will be binary 111110000011111 and your bittest binary 000010000000011. Go through your word-list, take this bitwise and of each word with the bit-mask and test to see if it's equal to the bit-test value. This is cache friendly and the inner loop is tight, so may be competitive with algorithms that look like they should be faster on paper.
I'll preface this with it's more of a comment and less of an answer (I don't have enough reputation to comment though). I can't think of any data structure that will satisfy the requirements of of the box. It was interesting to think about, and figured I'd share one potential solution that popped into my head.
I keyed in on the "same length" part, and figured I could come up with something based on that.
In theory we could have N(N being the length) maps of char -> set.
When strings are added, it goes through each character and adds the string to the corresponding set. psuedocode:
firstCharMap[s[0]].insert(s);
secondCharMap[s[1]].insert(s);
thirdCharMap[s[2]].insert(s);
fourthCharMap[s[3]].insert(s);
fifthCharMap[s[4]].insert(s);
sixthCharMap[s[5]].insert(s);
Then to determine which strings match the pattern, we take just do an intersection of the sets ex: "-v-n-l" would be:
intersection of sets: secondCharMap[v], fourthCharMap[n], sixthCharMap[l]
One edge case that jumps out is if I wanted to just get all of the strings, so if that's a requirement--we may also need an additional set of all of the strings.
This solution feels clunky, but I think it could work. Depending on the language, number of strings, etc--I wouldn't be surprised if it performed worse than just iterating over all strings and checking a predicate.

scrabble solving with maximum score

I was asked a question
You are given a list of characters, a score associated with each character and a dictionary of valid words ( say normal English dictionary ). you have to form a word out of the character list such that the score is maximum and the word is valid.
I could think of a solution involving a trie made out of dictionary and backtracking with available characters, but could not formulate properly. Does anyone know the correct approach or come up with one?
First iterate over your letters and count how many times do you have each of the characters in the English alphabet. Store this in a static, say a char array of size 26 where first cell corresponds to a second to b and so on. Name this original array cnt. Now iterate over all words and for each word form a similar array of size 26. For each of the cells in this array check if you have at least as many occurrences in cnt. If that is the case, you can write the word otherwise you can't. If you can write the word you compute its score and maximize the score in a helper variable.
This approach will have linear complexity and this is also the best asymptotic complexity you can possibly have(after all the input you're given is of linear size).
Inspired by Programmer Person's answer (initially I thought that approach was O(n!) so I discarded it). It needs O(nr of words) setup and then O(2^(chars in query)) for each question. This is exponential, but in Scrabble you only have 7 letter tiles at a time; so you need to check only 128 possibilities!
First observation is that the order of characters in query or word doesn't matter, so you want to process your list into a set of bag of chars. A way to do that is to 'sort' the word so "bac", "cab" become "abc".
Now you take your query, and iterate all possible answers. All variants of keep/discard for each letter. It's easier to see in binary form: 1111 to keep all, 1110 to discard the last letter...
Then check if each possibility exists in your dictionary (hash map for simplicity), then return the one with the maximum score.
import nltk
from string import ascii_lowercase
from itertools import product
scores = {c:s for s, c in enumerate(ascii_lowercase)}
sanitize = lambda w: "".join(c for c in w.lower() if c in scores)
anagram = lambda w: "".join(sorted(w))
anagrams = {anagram(sanitize(w)):w for w in nltk.corpus.words.words()}
while True:
query = input("What do you have?")
if not query: break
# make it look like our preprocessed word list
query = anagram(sanitize(query))
results = {}
# all variants for our query
for mask in product((True, False), repeat=len(query)):
# get the variant given the mask
masked = "".join(c for i, c in enumerate(query) if mask[i])
# check if it's valid
if masked in anagrams:
# score it, also getting the word back would be nice
results[sum(scores[c] for c in masked)] = anagrams[masked]
print(*max(results.items()))
Build a lookup trie of just the sorted-anagram of each word of the dictionary. This is a one time cost.
By sorted anagram I mean: if the word is eat you represent it as aet. It the word is tea, you represent it as aet, bubble is represent as bbbelu etc
Since this is scrabble, assuming you have 8 tiles (say you want to use one from the board), you will need to maximum check 2^8 possibilities.
For any subset of the tiles from the set of 8, you sort the tiles, and lookup in the anagram trie.
There are at most 2^8 such subsets, and this could potentially be optimized (in case of repeating tiles) by doing a more clever subset generation.
If this is a more general problem, where 2^{number of tiles} could be much higher than the total number of anagram-words in the dictionary, it might be better to use frequency counts as in Ivaylo's answer, and the lookups potentially can be optimized using multi-dimensional range queries. (In this case 26 dimensions!)
Sorry, this might not help you as-is (I presume you are trying to do some exercise and have constraints), but I hope this will help the future readers who don't have those constraints.
If the number of dictionary entries is relatively small (up to a few million) you can use brute force: For each word, create a 32 bit mask. Preprocess the data: Set one bit if the letter a/b/c/.../z is used. For the six most common English characters etaoin set another bit if the letter is used twice.
Create a similar bitmap for the letters that you have. Then scan the dictionary for words where all bits that are needed for the word are set in the bitmap for the available letters. You have reduced the problem to words where you have all needed characters once, and the six most common characters twice if the are needed twice. You'll still have to check if a word can be formed in case you have a word like "bubble" and the first test only tells you that you have letters b,u,l,e but not necessarily 3 b's.
By also sorting the list of words by point values before doing the check, the first hit is the best one. This has another advantage: You can count the points that you have, and don't bother checking words with more points. For example, bubble has 12 points. If you have only 11 points, then there is no need to check this word at all (have a small table with the indexes of the first word with any given number of points).
To improve anagrams: In the table, only store different bitmasks with equal number of points (so we would have entries for bubble and blue because they have different point values, but not for team and mate). Then store all the possible words, possibly more than one, for each bit mask and check them all. This should reduce the number of bit masks to check.
Here is a brute force approach in python, using an english dictionary containing 58,109 words. This approach is actually quite fast timing at about .3 seconds on each run.
from random import shuffle
from string import ascii_lowercase
import time
def getValue(word):
return sum(map( lambda x: key[x], word))
if __name__ == '__main__':
v = range(26)
shuffle(v)
key = dict(zip(list(ascii_lowercase), v))
with open("/Users/james_gaddis/PycharmProjects/Unpack Sentance/hard/words.txt", 'r') as f:
wordDict = f.read().splitlines()
f.close()
valued = map(lambda x: (getValue(x), x), wordDict)
print max(valued)
Here is the dictionary I used, with one hyphenated entry removed for convenience.
Can we assume that the dictionary is fixed and the score are fixed and that only the letters available will change (as in scrabble) ? Otherwise, I think there is no better than looking up each word of the dictionnary as previously suggested.
So let's assume that we are in this setting. Pick an order < that respects the costs of letters. For instance Q > Z > J > X > K > .. > A >E >I .. > U.
Replace your dictionary D with a dictionary D' made of the anagrams of the words of D with letters ordered by the previous order (so the word buzz is mapped to zzbu, for instance), and also removing duplicates and words of length > 8 if you have at most 8 letters in your game.
Then construct a trie with the words of D' where the children nodes are ordered by the value of their letters (so the first child of the root would be Q, the second Z, .., the last child one U). On each node of the trie, also store the maximal value of a word going through this node.
Given a set of available characters, you can explore the trie in a depth first manner, going from left to right, and keeping in memory the current best value found. Only explore branches whose node's value is larger than you current best value. This way, you will explore only a few branches after the first ones (for instance, if you have a Z in your game, exploring any branch that start with a one point letter as A is discarded, because it will score at most 8x1 which is less than the value of Z). I bet that you will explore only a very few branches each time.

Finding duplicate digits in integers

Lets say we have an integer array of N elements which consists of integers between 0 and 10000. We need to detect the numbers including a digit more than once e.g 1245 is valid while 1214 is not. How can we do this optimally? Thanks!
You need two loops. One loop you scan each element of the array.
In the inner loop, you determine if for the given element it's valid or not based on the criteria you indicated. To determine if a number has the same digit more than once, you need a routine that effectively extracts each digit one by one. I think the most optimal way to do that is to do "mod 10" on the number, then loop dividing the original by 10. keep doing that until you don't have number left (zero). Now that you have a routine for looking at each digit of an integer, the way to determine if there are duplicate digits the most optimally is to create an array of 10 booleans. Start with a cleared array. For every digit, use it as an index into the bool array and set it to true. If you see "true" again in that spot before you set it, that means that element in the bool array was visited before, thus it's a duplicate digit. So you break out of the loop altogether and say you found an invalid value.

Efficient algorithm to find most common phrases in a large volume of text

I am thinking about writing a program to collect for me the most common phrases in a large volume of the text. Had the problem been reduced to just finding words than that would be as simple as storing each new word in a hashmap and then increasing the count on each occurrence. But with phrases, storing each permutation of a sentence as a key seems infeasible.
Basically the problem is narrowed down to figuring out how to extract every possible phrase from a large enough text. Counting the phrases and then sorting by the number of occurrences becomes trivial.
I assume that you are searching for common patterns of consecutive words appearing in the same order (e.g. "top of the world" would not be counted as the same phrase as "top of a world" or "the world of top").
If so then I would recommend the following linear-time approach:
Split your text into words and remove things you don't consider significant (i.e. remove capitalisation, punctuation, word breaks, etc.)
Convert your text into an array of integers (one integer per unique word) (e.g. every instance of "cat" becomes 1, every "dog" becomes 2) This can be done in linear time by using a hash-based dictionary to store the conversions from words to numbers. If the word is not in the dictionary then assign a new id.
Construct a suffix-array for the array of integers (this is a sorted list of all the suffixes of your array and can be constructed by linear time - e.g. using the algorithm and C code here)
Construct the longest common prefix array for your suffix array. (This can also be done in linear-time, for example using this C code) This LCP array gives the number of common words at the start of each suffix between consecutive pairs in the suffix array.
You are now in a position to collect your common phrases.
It is not quite clear how you wish to determine the end of a phrase. One possibility is to simply collect all sequences of 4 words that repeat.
This can be done in linear time by working through your suffix array looking at places where the longest common prefix array is >= 4. Each run of indices x in the range [start+1...start+len] where the LCP[x] >= 4 (for all except the last value of x) corresponds to a phrase that is repeated len times. The phrase itself is given by the first 4 words of, for example, suffix start+1.
Note that this approach will potentially spot phrases that cross sentence ends. You may prefer to convert some punctuation such as full stops into unique integers to prevent this.

Finding Valid Words in a Matrix

Given a dictionary of words, two APIs
Is_word(string)
Is_prefix(string)
And a NxN matrix with each postion consisting of a character. If from any position (i,j) you can move
in any of the four directions, find out the all the valid words that can be formed in the matrix.
(looping is not allowed, i.e. for forming a word position if you start from (i,j) and move to (i-1,j) then
from this position you cannot go back to (i,j))
What I trie :: I can see an exponential solution where we go though all possibilities and Keep track of already visited Indexes. Can we have a better solution?
This is the best I can do:
Enumerate all the possible strings in half of the allowed directions. (You can reverse the strings to get the other directions.)
For each starting character position in each string, build up substrings until they are neither a word nor a prefix.
Edit: I may have misunderstood the question. I thought each individual word could only be made up of moves in the same direction.
I would be inclined to solve this by creating two parallel data structures. One is a list of words (where duplicates are removed). The other is a matrix of prefixes, where each element is a list. Hopefully, such data structures can be used. The matrix of lists could really be a list of triplets, with the word and coordinates in the original matrix.
In any case, go through the original matrix and call isword() on each element. If a word, then insert the word into the word list.
Then go through the original matrix, and compare call isprefix() on each element. If a prefix, insert in the prefix matrix.
Then, go through the prefix matrix and test the four combinations of the prefix with the additional letter. If a word, then put in word list. If a prefix, then add to the prefix matrix, in the location of the last letter. Remember to do both, since a word can be a prefix.
Also, in the prefix list, you need to keep a list not only of the letters, but the locations of the letters that have been used. This is used to prevent loops.
Then continue this process until there are no words and no prefixes.
It is hard to measure the complexity of this structure, since it seems to depend on the contents of the dictionary.
I think you can't beat exponential here. Proof?
Think about the case when you have a matrix where you have arrangement of letters such that each combination is a valid dictionary word i.e. when you start from [i,j] for e.g. [i,j] combined with either [i-1,j] or [i+1,j] or [i,j+1] or [i,j-1] are all 2 letter words and that continues recursively i.e. any combination of length n (according to your rules) is a word. In this case you can't do better than a brute force.

Resources