Find the number occurences of each word in a document? - algorithm

I was asked the question in an interview. The interviewer told me to assume that there exists a function say getNextWord() to return the next word in a given document. My task was to design a data structure to implement the task, and give an algorithm that constructs a list of all words with their frequencies.
Being from a C++ background, my answer was to create a multimap of string and then insert all words in it, and later display the count of it. I was however told later, to do this in a more generic way. By generic he meant that he didn't want me to use a library feature. Also I guess a multimap is implemented internally as a 2-3 tree or so, so for the multimap solution to be generic I would need to code the 2-3 tree as well.
Although tries did come to mind, implementing one during an interview was out of question for me. So, I just wanted to know if there are better ways of achieving it? Or is there a way to implement it in a smooth manner using tries?

Any histogram based algorithm would be both effient and generic in here. The idea is simple: build a histogram from the data. A generic interface for a histogram is a Map<String,Integer>
Iterate the document once (with your nextDoc() method), while maintaining your histogram.
The best implementation for this interface, in terms of big O notations - would probably be to use a trie, and in each leaf node - add the counter of occurances.
Getting the actual (word,number) pairs from the trie will be done by a simple DFS on the trie.
This solution gives you O(n * |S|) time complexity, where |S| is the average size for a string.
The insertion algorithm for each word:
Each time you add a new word: check if it already exists, if it does - increase the counter, else - add the word to the dictionary with a counter value of 1.

I'd try to implement a B-Tree (or smth quite similar) to store all the words. Therefore I could easily find a next word, if already have it and increase associated counter in the node. Or just insert a new one.
The time complexity in that case would be: O(nlogn), where n is all words count and logn is a Big-Oh for such kind of tree.

I think the simplest solution would be a a Trie. O(N) is given in this case (both for insertion and getting the count). Just store the count in an additional space at every node.
Basically each node in the tree contains 26 links to 26 possible children (1 for each letter) + 1 counter (for words the are terminated in the current node)
.
Just look at the link for a graphic image of a trie.

Related

Algorithm for searching dictionary in both direction in O(log n) <= speed < O(n)

I am trying to implement an (Language1 - Language2) dictionary.
I want to make an search algorithm in both direction that speed is faster than O(n)
For example, if (hello, hola) is one pair,
SEARCH_SPANISH_BY_ENGLISH (hello) = "hola", and
SEARCH_ENGLISH_BY_SPANISH (hola) = "hello"
If you have an idea for an algorithm, can you tell me how to set up an dictionary and implement an search algorithm? It seems like I have to use an divide and conquer but I am not sure how. Thanks.
The side of dictionary should be singular, which means I cannot build both English-Spanish and Spanish-English dictionary.
You can have two dictionary as the search time will remain constant. However, if you still wish to have two way dictionary, then Jon Skeet's version is here:
Getting key of value of a generic Dictionary?
Any balanced tree, or sorted array, or hashtable data structure will give you better than O(n) when searching.
For the bidirectionality, you can just have two data structures, one for English-Español the other for Español-English. That doesn't necessarily mean two complete dictionaries, just that you need a search index of some description for either direction.
However, since you appear to have a one-to-one mapping of words, you can just maintain one dictionary as persistent and build the reverse one at run time.

Hash-maps or search tree?

The problem is as follows: Given is a list of cities and their countries, population and geo-coordinates. You should read this data, save it and answer it in an endless loop of the following type:
Request: a prefix (e.g., free).
Answer: all states beginning with this prefix ("case-insensitive")
and their associated data (country + population + geo-coordinates).
The cities should be sorted by population (highest population first).
Which data structure are the most suitable for the described problem ?
First Part : My Thoughts are hanging between Trie and Hashmap. Although i tend to the Trie more because i'm dealing with prefix requests , and Trie is basically according to Wikipedia :
"a trie, also called digital tree and sometimes radix tree or prefix tree (as they can be searched by prefixes), is a kind of search tree—an ordered tree data structure that is used to store a dynamic set or associative array where the keys are usually strings".
in addition to that in terms of Storage and reading data Trie has the advantage over Hash-maps.
Second part: returning the sorted cities by population would be a little bit challenging when we speak about Time Complexity.If i'm thinking in the right direction i should save the values of the keys as lists and it will be easier to sort just the returning list , so i don't have to save it sorted to save some times.
Please share you thoughts and correct me if i'm wrong .
There are pros of cons of picking vanilla tries and vanilla hashmaps. In general, for autocomplete systems, the structure of a trie is extremely useful because you're usually searching for prefixes and the user would like to see the words that begin with the string that they have just entered.
However, there is a method to make the best use of both of these data structures, it is called a Hash Trie (implementation: http://www.sanfoundry.com/java-program-implement-hash-trie/). So the way you would implement this is by using the structure of the trie, but the final node is the actual string it refers to. In python, this is done using dictionaries instead of lists while implementing the trie.
For the second half of the question, a list would be your best bet, in essence a list of tuples (population, city) and sort by the population and return the cities. Regarding it being "easier" to sort, I'm not sure if I agree with this, easy is a relevant term and there's really no way of saying that it's easier than, maybe storing it in a tree and then returning the Pre-Order Traversal of the tree. Essentially, if you're using comparison based sort, it won't get better than nlog (n).

whats the best way to traverse a large dictionary of words?

lets say I'm looking for a word that may or may not be in a dictionary of 95k words - I Cannot use word length to facilitate search. My question is in regards to the fastest way to find the word without doing a O(n) look up.
Here are my two thoughts:
first, store the words in a hast table, look up of the word is O(1), this seems the best scenario in my mind, but going through different websites using Trie was also suggested, my question regarding this is whether its practical to have a Trie that holds so many words.
The lookup would be O(k) in this case.
So what is the most optimal way of finding a word in a large dictionary?
Optimality depends on your use case - do you care about look up-time or space? (also, do you care about inserting new words?).
The best you can do time-wise is to use a hash table, but for a dictionary, it is space-inefficient. A trie compresses the space requirement because it stores prefixes, not the entire word, but takes longer to look up. So, to answer your question, it is more space efficient to have a trie with a large number of words than a hash table.
If you are just searching for a single word, the cost of setting up a hash table or tree structure would exceed a linear search. These structures become (very) efficient when their costs are amortized over (very) many uses.
If the dictionary is sorted (and why wouldn't a dictionary be?), then you can look for a single word in log(n) time with a binary search through the file, no additional structures needed.
I think the best way to find a word in a dictionary is a B+ tree.And let me explain you the reason.
Lets say you have a root block of 10 strings.The strings in the block are sorted.These 10 strings are followed by a pointer to another cell of 10 strings and that goes one.So the only thing you have to do is just String compare your Key word starting by the First one until you find a word smaller in comparison (StringCompare).
If we take it as standard that each string has next to it a pointer that shows to a cell with words that are smaller in comparison,it will take you 5 steps and 5 comparisons to end to the final bracket of data that will may or may not contain your Key word.
in 5 comparisons + the comparisons in the final bracket you are searching a dictionary of 10*10*10*10*10 words.
The algorithm is of logarithmic speed Log 100000 with base the number of strings in the cell.If each cell has 10 words you need 5 steps.
I must mention that only the Root of the tree must be stored in the Ram memory.All the other blocks can be stored in the hard drive without significant loss in performance because of the few steps.
Hope i explained right :D At least i tried! have fun
Trie is preferable because this data-structure can be faster than hash-table. Hash tables is O(1) only in ideal case, in real world applications collisions can occur. Different types of trie data structure doesn't suffer from this.
Another case is compression. Trie are much more compact than hash table. Hash table require some space for efficient insert operations. If load factor of the hash table are colse to 100% than insert operations takes very long time.
With hash tables you must compare your key with at least one key from the dictionary, key comparison in this case takes O(k) where k in key length. With trie you are doing the same thing, your lookup operations is O(k).
Tries allow ordered traversal, hash tables - don't.
There is many types of tries out there, for example ternary search trie is verty good in this particular case. Array mapped trie are also very fast, compared to regular hash table.

Most efficient sorting algorithm for a large set of numbers

I'm working on a large project, I won't bother to summarize it here, but this section of the project is to take a very large document of text (minimum of around 50,000 words (not unique)), and output each unique word in order of most used to least used (probably top three will be "a" "an" and "the").
My question is of course, what would be the best sorting algorithm to use? I was reading of counting sort, and I like it, but my concern is that the range of values will be too large compared to the number of unique words.
Any suggestions?
First, you will need a map of word -> count.
50,000 words is not much - it will easily fit in memory, so there's nothing to worry. In C++ you can use the standard STL std::map.
Then, once you have the map, you can copy all the map keys to a vector.
Then, sort this vector using a custom comparison operator: instead of comparing the words, compare the counts from the map. (Don't worry about the specific sorting algorithm - your array is not that large, so any standard library sort will work for you.)
I'd start with a quicksort and go from there.
Check out the wiki page on sorting algorithms, though, to learn the differences.
You should try an MSD radix sort. It will sort your entries in lexicographical order. Here is a google code project you might be interested in.
Have a look at the link. A Pictorial representation on how different algorithm works. This will give you an hint!
Sorting Algorithms
You can get better performance than quicksort with this particular problem assuming that if two words occur the same number of times, then it doesn't matter in which order you output them.
First step: Create a hash map with the words as key values and frequency as the associated values. You will fill this hash map in as you parse the file. While you are doing this, make sure to keep track of the highest frequency encountered. This step is O(n) complexity.
Second step: Create a list with the number of entries equal to the highest frequency from the first step. The index of each slot in this list will hold a list of the words with the frequency count equal to the index. So words that occur 3 times in the document will go in list[3] for example. Iterate through the hash map and insert the words into the appropriate spots in the list. This step is O(n) complexity.
Third step: Iterate through the list in reverse and output all the words. This step is O(n) complexity.
Overall this algorithm will accomplish your task in O(n) time rather than O(nlogn) required by quicksort.
In almost every case I've ever tested, Quicksort worked the best for me. However, I did have two cases where Combsort was the best. Could have been that combsort was better in those cases because the code was so small, or due to some quirk in how ordered the data was.
Any time sorting shows up in my profile, I try the major sorts. I've never had anything that topped both Quicksort and Combsort.
I think you want to do something as explained in the below post:
http://karephul.blogspot.com/2008/12/groovy-closures.html
Languages which support closure make the solution much easy, like LINQ as Eric mentioned.
For large sets you can use what is known as the "sort based indexing" in information retrieval, but for 50,000 words you can use the following:
read the entire file into a buffer.
parse the buffer and build a token vector with
struct token { char *term, int termlen; }
term is a pointer to the word in the buffer.
sort the table by term (lexicographical order).
set entrynum = 0, iterate through the term vector,
when term is new, store it in a vector :
struct { char *term; int frequency; } at index entrynum, set frequency to 1 and increment the entry number, otherwise increment frequency.
sort the vector by frequency in descending order.
You can also try implementing digital trees also known as Trie. Here is the link

What algorithm can you use to find duplicate phrases in a string?

Given an arbitrary string, what is an efficient method of finding duplicate phrases? We can say that phrases must be longer than a certain length to be included.
Ideally, you would end up with the number of occurrences for each phrase.
In theory
A suffix array is the 'best' answer since it can be implemented to use linear space and time to detect any duplicate substrings. However - the naive implementation actually takes time O(n^2 log n) to sort the suffixes, and it's not completely obvious how to reduce this down to O(n log n), let alone O(n), although you can read the related papers if you want to.
A suffix tree can take slightly more memory (still linear, though) than a suffix array, but is easier to implement to build quickly since you can use something like a radix sort idea as you add things to the tree (see the wikipedia link from the name for details).
The KMP algorithm is also good to be aware of, which is specialized for searching for a particular substring within a longer string very quickly. If you only need this special case, just use KMP and no need to bother building an index of suffices first.
In practice
I'm guessing you're analyzing a document of actual natural language (e.g. English) words, and you actually want to do something with the data you collect.
In this case, you might just want to do a quick n-gram analysis for some small n, such as just n=2 or 3. For example, you could tokenize your document into a list of words by stripping out punctuation, capitalization, and stemming words (running, runs both -> 'run') to increase semantic matches. Then just build a hash map (such as hash_map in C++, a dictionary in python, etc) of each adjacent pair of words to its number of occurrences so far. In the end you get some very useful data which was very fast to code, and not crazy slow to run.
Like the earlier folks mention that suffix tree is the best tool for the job. My favorite site for suffix trees is http://www.allisons.org/ll/AlgDS/Tree/Suffix/. It enumerates all the nifty uses of suffix trees on one page and has a test js applicaton embedded to test strings and work through examples.
Suffix trees are a good way to implement this. The bottom of that article has links to implementations in different languages.
Like jmah said, you can use suffix trees/suffix arrays for this.
There is a description of an algorithm you could use here (see Section 3.1).
You can find a more in-depth description in the book they cite (Gusfield, 1997), which is on google books.
suppose you are given sorted array A with n entries (i=1,2,3,...,n)
Algo(A(i))
{
while i<>n
{
temp=A[i];
if A[i]<>A[i+1] then
{
temp=A[i+1];
i=i+1;
Algo(A[i])
}
else if A[i]==A[i+1] then
mark A[i] and A[i+1] as duplicates
}
}
This algo runs at O(n) time.

Resources