search function for finding books in a datastructure - algorithm

I want to implement a search filter that is as efficient as possible, that manages book titles in my "library". The search should work as follows:
The user types in the first b letters of the book title. will be returned
Number n of book titles that begin exactly with the entered sequence of letters. k is one
preset constant that specifies how many book titles should be output. That means
if n ≤ k, an alphabetically sorted list of the n book titles is output.
The main problem I'm currently facing is that I don't know what datatype to pick and on what data structure I should implement it because I need it as efficient as possible.
And a follow up question would be, if I use an array for this, I would definitely choose an sorted array right?
Any help is gladly appreciated an I'm NOT asking for an implementation.

This is a very good candidate for a trie. At each keystroke you know exactly how many books start with the prefix. This is a linear operation. So, the search for a prefix of n characters takes O(n) time. In a simple trie implementation, to get the book titles, you would need to traverse the rest of the paths that start with the prefix, but this can also be remedied by having pointers at each node of the trie to the titles that start with that prefix (in other words continue from that node). You could also sort those pointers in advance.
So the whole operation of searching for all the titles that start with a prefix of length n and returning the titles would be O(n + m) where m is the number of titles with this prefix.

Related

Reordering the alphabet to come in first in lexicographical order in fastest way

Consider we have a list with names of people which no two of them are the same. The maximum size of the list is:
Now the goal is to find out how many names (and which ones!)can come first in lexicographical ordering if we changed the English alphabets order!
for instance if the list is:
ha haa st
then with changing the alphabet we can bring ha and st in first place but no matter how we change it haa will always come after ha, so two names can come first.
Of course there is a brute force way to found out the answers but that need to check all 26! possible orders of alphabet for each word! Since the time limit on this problem is 1 second then I think an algorithm with O(nlogn) or lower would do find. However I don't know how to approach the problem. I thing using trie would be helpful (since i encountered the problem when I was learning data structures!) but may be graph algorithms could also help.
How can I find the right algorithm and approach to this problem and how to implement it in code?
Let w be that first word
If we change alphabet letters, the first word keeps its length. Any name which can come in first place must be of length ...length(w).
Let L be the candidate words above. All the names are different as per initial formulation, so L is also made of unique names.
Only name in L is solution. Any name in L is solution. The answer to your problem is L's size.
tldr: count all the words of length length(w)

data structure for finding the substring from large number of strings

My problem statement is that I am given millions of strings, and I have to find one sub-string which can be present in any of those strings.
e.g. given is "xyzoverflowasxs, werstackweq" etc. and I have to find a given sub string named as "stack", which should return "werstackweq". What kind of data structure we can use for solving this problem ?
I think we can use suffix tree for this , but wanted some more suggestions for this problem.
I think the way to go is with a dictionary holding the actual words, and another data structure pointing to entries within this dictionary. One way to go would be with suffix trees and their variants, as mentioned in the question and the comments. I think the following is a far simpler (heuristic) alternative.
Say you choose some integer k. For each of your strings, finding the k Rabin Fingerprints of length-k within each string should be efficient and easy (any language has an implementation).
So, for a given k, you could hold two data structures:
A dictionary of the words, say a hash table based on collision lists
A dictionary mapping each fingerprint to an array of the linked-list node pointers in the first data structure.
Given a word of length k or greater, you would choose a k subword, calculate its Rabin fingerprint, find the words which contain this fingerprint, and check if they indeed contain this word.
The question is which k to use, and whether to use multiple such k. I would try this experimentally (starting with simultaneously a few small k values for, say, 1, 2, and 3, and also a couple of larger ones). The performance of this heuristic anyway depends on the distribution of your dictionary and queries.

Find the number occurences of each word in a document?

I was asked the question in an interview. The interviewer told me to assume that there exists a function say getNextWord() to return the next word in a given document. My task was to design a data structure to implement the task, and give an algorithm that constructs a list of all words with their frequencies.
Being from a C++ background, my answer was to create a multimap of string and then insert all words in it, and later display the count of it. I was however told later, to do this in a more generic way. By generic he meant that he didn't want me to use a library feature. Also I guess a multimap is implemented internally as a 2-3 tree or so, so for the multimap solution to be generic I would need to code the 2-3 tree as well.
Although tries did come to mind, implementing one during an interview was out of question for me. So, I just wanted to know if there are better ways of achieving it? Or is there a way to implement it in a smooth manner using tries?
Any histogram based algorithm would be both effient and generic in here. The idea is simple: build a histogram from the data. A generic interface for a histogram is a Map<String,Integer>
Iterate the document once (with your nextDoc() method), while maintaining your histogram.
The best implementation for this interface, in terms of big O notations - would probably be to use a trie, and in each leaf node - add the counter of occurances.
Getting the actual (word,number) pairs from the trie will be done by a simple DFS on the trie.
This solution gives you O(n * |S|) time complexity, where |S| is the average size for a string.
The insertion algorithm for each word:
Each time you add a new word: check if it already exists, if it does - increase the counter, else - add the word to the dictionary with a counter value of 1.
I'd try to implement a B-Tree (or smth quite similar) to store all the words. Therefore I could easily find a next word, if already have it and increase associated counter in the node. Or just insert a new one.
The time complexity in that case would be: O(nlogn), where n is all words count and logn is a Big-Oh for such kind of tree.
I think the simplest solution would be a a Trie. O(N) is given in this case (both for insertion and getting the count). Just store the count in an additional space at every node.
Basically each node in the tree contains 26 links to 26 possible children (1 for each letter) + 1 counter (for words the are terminated in the current node)
.
Just look at the link for a graphic image of a trie.

What is an efficient search algorithm to provide auto-completion?

I've got a list of 10000 keywords. What is an efficient search algorithm to provide auto-completion with that list?
Using a Trie is an option but they are space-inefficient. They can be made more space effecient by using a modified version known as a Radix Tree, or Patricia Tree.
A ternary search tree would probably be a better option. Here is an article on the subject: "Efficient auto-complete with a ternary search tree." Another excellent article on the use of Ternary Search Tree's for spelling-correction (a similar problem to auto-complete) is, "Using Ternary DAGs for spelling correction."
I think binary search works just fine for 10000 entries.
A trie: http://en.wikipedia.org/wiki/Trie gives you O(N) search time whenever you type a letter (I'm assuming you want new suggestions whenever a letter is typed). This should be fairly efficient if you have small words and the search space will be reduced with each new letter.
As you’ve already mentioned that you store your words in a database (see Auto-suggest Technologies and Options), create an index of that words and let the database do the work. They know how to do that efficiently.
A rather roundabout method was proposed on SO for crosswords.
It could be adapted here fairly easy :)
The idea is simple, yet quite efficient: it consists on indexing the words, building one index per letter position. It should be noted that after 4/5 letters, the subset of words available is so small that a brute force is probably best... this would have to be measured of course.
As for the idea, here is a Python way:
class AutoCompleter:
def __init__(self, words):
self.words = words
self.map = defaultdict(set)
self._map()
def _map(self):
for w in words:
for i in range(0,len(w)):
self.map[(i,w[i])].insert(w)
def words(self, firstLetters):
# Gives all the sets for each letter
sets = [self.map[(i, firstLetters[i])] for i in range(0, len(firstLetters))]
# Order them so that the smallest set is first
sets.sort(lambda x,y: cmp(len(x),len(y)))
# intersect all sets, from left to right (smallest to biggest)
return reduce(lambda x,y: intersection(x,y), sets)
The memory requirement is quite stringent: one entry for each letter at each position. However, an entry means a word exist with a letter at this position, which is not the case for all.
The speed seems quite good too, if you wish to auto-complete a 3-letters word (classic threshold to trigger auto-completion):
3 look ups in a hash-map
2 intersections of sets (definitely the one spot), but ordered as to be as efficient as possible.
I would definitely need to try this out against the ternary tree and trie approach to see how it fares.

What algorithm can you use to find duplicate phrases in a string?

Given an arbitrary string, what is an efficient method of finding duplicate phrases? We can say that phrases must be longer than a certain length to be included.
Ideally, you would end up with the number of occurrences for each phrase.
In theory
A suffix array is the 'best' answer since it can be implemented to use linear space and time to detect any duplicate substrings. However - the naive implementation actually takes time O(n^2 log n) to sort the suffixes, and it's not completely obvious how to reduce this down to O(n log n), let alone O(n), although you can read the related papers if you want to.
A suffix tree can take slightly more memory (still linear, though) than a suffix array, but is easier to implement to build quickly since you can use something like a radix sort idea as you add things to the tree (see the wikipedia link from the name for details).
The KMP algorithm is also good to be aware of, which is specialized for searching for a particular substring within a longer string very quickly. If you only need this special case, just use KMP and no need to bother building an index of suffices first.
In practice
I'm guessing you're analyzing a document of actual natural language (e.g. English) words, and you actually want to do something with the data you collect.
In this case, you might just want to do a quick n-gram analysis for some small n, such as just n=2 or 3. For example, you could tokenize your document into a list of words by stripping out punctuation, capitalization, and stemming words (running, runs both -> 'run') to increase semantic matches. Then just build a hash map (such as hash_map in C++, a dictionary in python, etc) of each adjacent pair of words to its number of occurrences so far. In the end you get some very useful data which was very fast to code, and not crazy slow to run.
Like the earlier folks mention that suffix tree is the best tool for the job. My favorite site for suffix trees is http://www.allisons.org/ll/AlgDS/Tree/Suffix/. It enumerates all the nifty uses of suffix trees on one page and has a test js applicaton embedded to test strings and work through examples.
Suffix trees are a good way to implement this. The bottom of that article has links to implementations in different languages.
Like jmah said, you can use suffix trees/suffix arrays for this.
There is a description of an algorithm you could use here (see Section 3.1).
You can find a more in-depth description in the book they cite (Gusfield, 1997), which is on google books.
suppose you are given sorted array A with n entries (i=1,2,3,...,n)
Algo(A(i))
{
while i<>n
{
temp=A[i];
if A[i]<>A[i+1] then
{
temp=A[i+1];
i=i+1;
Algo(A[i])
}
else if A[i]==A[i+1] then
mark A[i] and A[i+1] as duplicates
}
}
This algo runs at O(n) time.

Resources