Algorithm for searching dictionary in both direction in O(log n) <= speed < O(n) - algorithm

I am trying to implement an (Language1 - Language2) dictionary.
I want to make an search algorithm in both direction that speed is faster than O(n)
For example, if (hello, hola) is one pair,
SEARCH_SPANISH_BY_ENGLISH (hello) = "hola", and
SEARCH_ENGLISH_BY_SPANISH (hola) = "hello"
If you have an idea for an algorithm, can you tell me how to set up an dictionary and implement an search algorithm? It seems like I have to use an divide and conquer but I am not sure how. Thanks.
The side of dictionary should be singular, which means I cannot build both English-Spanish and Spanish-English dictionary.

You can have two dictionary as the search time will remain constant. However, if you still wish to have two way dictionary, then Jon Skeet's version is here:
Getting key of value of a generic Dictionary?

Any balanced tree, or sorted array, or hashtable data structure will give you better than O(n) when searching.
For the bidirectionality, you can just have two data structures, one for English-Español the other for Español-English. That doesn't necessarily mean two complete dictionaries, just that you need a search index of some description for either direction.
However, since you appear to have a one-to-one mapping of words, you can just maintain one dictionary as persistent and build the reverse one at run time.

Related

Hash-maps or search tree?

The problem is as follows: Given is a list of cities and their countries, population and geo-coordinates. You should read this data, save it and answer it in an endless loop of the following type:
Request: a prefix (e.g., free).
Answer: all states beginning with this prefix ("case-insensitive")
and their associated data (country + population + geo-coordinates).
The cities should be sorted by population (highest population first).
Which data structure are the most suitable for the described problem ?
First Part : My Thoughts are hanging between Trie and Hashmap. Although i tend to the Trie more because i'm dealing with prefix requests , and Trie is basically according to Wikipedia :
"a trie, also called digital tree and sometimes radix tree or prefix tree (as they can be searched by prefixes), is a kind of search tree—an ordered tree data structure that is used to store a dynamic set or associative array where the keys are usually strings".
in addition to that in terms of Storage and reading data Trie has the advantage over Hash-maps.
Second part: returning the sorted cities by population would be a little bit challenging when we speak about Time Complexity.If i'm thinking in the right direction i should save the values of the keys as lists and it will be easier to sort just the returning list , so i don't have to save it sorted to save some times.
Please share you thoughts and correct me if i'm wrong .
There are pros of cons of picking vanilla tries and vanilla hashmaps. In general, for autocomplete systems, the structure of a trie is extremely useful because you're usually searching for prefixes and the user would like to see the words that begin with the string that they have just entered.
However, there is a method to make the best use of both of these data structures, it is called a Hash Trie (implementation: http://www.sanfoundry.com/java-program-implement-hash-trie/). So the way you would implement this is by using the structure of the trie, but the final node is the actual string it refers to. In python, this is done using dictionaries instead of lists while implementing the trie.
For the second half of the question, a list would be your best bet, in essence a list of tuples (population, city) and sort by the population and return the cities. Regarding it being "easier" to sort, I'm not sure if I agree with this, easy is a relevant term and there's really no way of saying that it's easier than, maybe storing it in a tree and then returning the Pre-Order Traversal of the tree. Essentially, if you're using comparison based sort, it won't get better than nlog (n).

Find the number occurences of each word in a document?

I was asked the question in an interview. The interviewer told me to assume that there exists a function say getNextWord() to return the next word in a given document. My task was to design a data structure to implement the task, and give an algorithm that constructs a list of all words with their frequencies.
Being from a C++ background, my answer was to create a multimap of string and then insert all words in it, and later display the count of it. I was however told later, to do this in a more generic way. By generic he meant that he didn't want me to use a library feature. Also I guess a multimap is implemented internally as a 2-3 tree or so, so for the multimap solution to be generic I would need to code the 2-3 tree as well.
Although tries did come to mind, implementing one during an interview was out of question for me. So, I just wanted to know if there are better ways of achieving it? Or is there a way to implement it in a smooth manner using tries?
Any histogram based algorithm would be both effient and generic in here. The idea is simple: build a histogram from the data. A generic interface for a histogram is a Map<String,Integer>
Iterate the document once (with your nextDoc() method), while maintaining your histogram.
The best implementation for this interface, in terms of big O notations - would probably be to use a trie, and in each leaf node - add the counter of occurances.
Getting the actual (word,number) pairs from the trie will be done by a simple DFS on the trie.
This solution gives you O(n * |S|) time complexity, where |S| is the average size for a string.
The insertion algorithm for each word:
Each time you add a new word: check if it already exists, if it does - increase the counter, else - add the word to the dictionary with a counter value of 1.
I'd try to implement a B-Tree (or smth quite similar) to store all the words. Therefore I could easily find a next word, if already have it and increase associated counter in the node. Or just insert a new one.
The time complexity in that case would be: O(nlogn), where n is all words count and logn is a Big-Oh for such kind of tree.
I think the simplest solution would be a a Trie. O(N) is given in this case (both for insertion and getting the count). Just store the count in an additional space at every node.
Basically each node in the tree contains 26 links to 26 possible children (1 for each letter) + 1 counter (for words the are terminated in the current node)
.
Just look at the link for a graphic image of a trie.

What is a good way to find pairs of numbers, each stored in a different array, such that the difference between the first and second number is 1?

Suppose you have several arrays of integers. What is a good way to find pairs of integers, not both from the same list, such that the difference between the first and second integer is 1?
Naturally I could write a naive algorithm that just looks through each other list until it finds such a number or hits one bigger. Is there a more elegant solution?
I only mention the condition that the difference be 1 because I'm guessing there might be some use to that knowledge to speed up the computation. I imagine that if the condition for a 'hit' were something else, the algorithm would work just the same.
Some background: I'm engaged in a bit of research mathematics and I seek to find examples of a certain construction. Any help would be much appreciated.
I'd start by sorting each array. Preferably with an algorithm that runs in O( n log(n) ) time.
When you've got a bunch of sorted arrays, you can set a pointer to the start of each array, check for any +/- 1 differences in the values of the pointers, and increment the value of the smallest-valued pointer, repeating until you've reached the max length of all but one of the arrays.
To further optimize, you could keep the pointers-values in a sorted linked list, and build the check function into an insertion sort. For each increment, you could remove the previous value from the list, and step through the list checking for +/- 1 comparison until you get to a term that is larger than a possible match. That way, if you're searching a bazillion arrays, you needn't check all bazillion pointer-values - you only need to check until you find a value that is too big, and ignore all larger values.
If you've got any more information about the arrays (such as the range of the terms or number of arrays), I can see how you could take advantage of that to make much faster algorithms for this through clever uses of array properties.
This sounds like a good candidate for the classic merge sort where the final stage is not a unification but comparison.
And the magnitude of the difference wouldn't affect this, but thanks for adding the information.
Even though you state the second list is in an array, if you could put it in a hashmap of some sort then you could make it faster than just the naive approach.
Basically,
Loop through the first array.
Look to see if there exists an object in the hashmap that is one larger than the current array value.
That way you can build up pairs of numbers that meet your requirements.
I don't know if it would be as flexible as you would like though.
Basically, you may want to consider other data structures, to help you find a better solution.
You have o(n log n) from the sorting.
You can also the the search in o(log n) for each element, if you have some dynamic queryset. You can sort the arrays and then for each element in the first array binary search his upper_bound and lower_bound in the second array and check the difference.

Most efficient sorting algorithm for a large set of numbers

I'm working on a large project, I won't bother to summarize it here, but this section of the project is to take a very large document of text (minimum of around 50,000 words (not unique)), and output each unique word in order of most used to least used (probably top three will be "a" "an" and "the").
My question is of course, what would be the best sorting algorithm to use? I was reading of counting sort, and I like it, but my concern is that the range of values will be too large compared to the number of unique words.
Any suggestions?
First, you will need a map of word -> count.
50,000 words is not much - it will easily fit in memory, so there's nothing to worry. In C++ you can use the standard STL std::map.
Then, once you have the map, you can copy all the map keys to a vector.
Then, sort this vector using a custom comparison operator: instead of comparing the words, compare the counts from the map. (Don't worry about the specific sorting algorithm - your array is not that large, so any standard library sort will work for you.)
I'd start with a quicksort and go from there.
Check out the wiki page on sorting algorithms, though, to learn the differences.
You should try an MSD radix sort. It will sort your entries in lexicographical order. Here is a google code project you might be interested in.
Have a look at the link. A Pictorial representation on how different algorithm works. This will give you an hint!
Sorting Algorithms
You can get better performance than quicksort with this particular problem assuming that if two words occur the same number of times, then it doesn't matter in which order you output them.
First step: Create a hash map with the words as key values and frequency as the associated values. You will fill this hash map in as you parse the file. While you are doing this, make sure to keep track of the highest frequency encountered. This step is O(n) complexity.
Second step: Create a list with the number of entries equal to the highest frequency from the first step. The index of each slot in this list will hold a list of the words with the frequency count equal to the index. So words that occur 3 times in the document will go in list[3] for example. Iterate through the hash map and insert the words into the appropriate spots in the list. This step is O(n) complexity.
Third step: Iterate through the list in reverse and output all the words. This step is O(n) complexity.
Overall this algorithm will accomplish your task in O(n) time rather than O(nlogn) required by quicksort.
In almost every case I've ever tested, Quicksort worked the best for me. However, I did have two cases where Combsort was the best. Could have been that combsort was better in those cases because the code was so small, or due to some quirk in how ordered the data was.
Any time sorting shows up in my profile, I try the major sorts. I've never had anything that topped both Quicksort and Combsort.
I think you want to do something as explained in the below post:
http://karephul.blogspot.com/2008/12/groovy-closures.html
Languages which support closure make the solution much easy, like LINQ as Eric mentioned.
For large sets you can use what is known as the "sort based indexing" in information retrieval, but for 50,000 words you can use the following:
read the entire file into a buffer.
parse the buffer and build a token vector with
struct token { char *term, int termlen; }
term is a pointer to the word in the buffer.
sort the table by term (lexicographical order).
set entrynum = 0, iterate through the term vector,
when term is new, store it in a vector :
struct { char *term; int frequency; } at index entrynum, set frequency to 1 and increment the entry number, otherwise increment frequency.
sort the vector by frequency in descending order.
You can also try implementing digital trees also known as Trie. Here is the link

What sort of sorted datastructure is optimized for finding items within a range?

Say I have a bunch of objects with dates and I regularly want to find all the objects that fall between two arbitrary dates. What sort of datastructure would be good for this?
A binary search tree sounds like what you're looking for.
You can use it to find all the objects in O(log(N) + K), where N is the total number of objects and K is the number of objects that are actually in that range. (provided that it's balanced). Insertion/removal is O(log(N)).
Most languages have a built-in implementation of this.
C++:
http://www.cplusplus.com/reference/stl/set/
Java:
http://java.sun.com/j2se/1.4.2/docs/api/java/util/TreeSet.html
You can find the lower bound of the range (in log(n)) and then iterate from there until you reach the upper bound.
Assuming you mean by date when you say sorted, an array will do it.
Do a binary search to find the index that's >= the start date. You can then either do another search to find the index that's <= the end date leaving you with an offset & count of items, or if you're going to process them anyway just iterate though the list until you exceed the end date.
It's hard to give a good answer without a little more detail.
What kind of performance do you need?
If linear is fine then I would just use a list of dates and iterate through the list collecting all dates that fall within the range. As Andrew Grant suggested.
Do you have duplicates in the list?
If you need to have repeated dates in your collection then most implementations of a binary tree would probably be out. Something like Java's TreeSet are set implementations and don't allow repeated elements.
What are the access characteristics? Lots of lookups with few updates, vice-versa, or fairly even?
Most datastructures have trade-offs between lookups and updates. If you're doing lots of updates then some datastructure that are optimized for lookups won't be so great.
So what are the access characteristics of the data structure, what kind of performance do you need, and what are structural characteristics that it must support (e.g. must allow repeated elements)?
If you need to make random-access modifications: a tree, as in v3's answer. Find the bottom of the range by lookup, then count upwards. Inserting or deleting a node is O(log N). stbuton makes a good point that if you want to allow duplicates (as seems plausible for datestamped events), then you don't want a tree-based set.
If you do not need to make random-access modifications: a sorted array (or vector or whatever). Find the location of the start of the range by binary chop, then count upwards. Inserting or deleting is O(N) in the middle. Duplicates are easy.
Algorithmic performance of lookups is the same in both cases, O(M + log N), where M is the size of the range. But the array uses less memory per entry, and might be faster to count through the range, because after the binary chop it's just forward sequential memory access rather than following pointers.
In both cases you can arrange for insertion at the end to be (amortised) O(1). For the tree, keep a record of the end element at the head, and you get an O(1) bound. For the array, grow it exponentially and you get amortised O(1). This is useful if the changes you make are always or almost-always "add a new event with the current time", since time is (you'd hope) a non-decreasing quantity. If you're using system time then of course you'd have to check, to avoid accidents when the clock resets backwards.
Alternative answer: an SQL table, and let the database optimise how it wants. And Google's BigTable structure is specifically designed to make queries fast, by ensuring that the result of any query is always a consecutive sequence from a pre-prepared index :-)
You want a structure that keeps your objects sorted by date, whenever you insert or remove a new one, and where finding the boundary for the segment of all objects later than or earlier than a given date is easy.
A heap seems the perfect candidate. In practical applications, heaps are simply represented by an array, where all the objects are stored in order. Seeing that sorted array as a heap is simply a way to make insertions of new objects and deletions happen in the right place, and in O(log(n)).
When you have to find all the objects between date A (excluded) and B (included), find the position of A (or the insert position, that is, the position of the earlier element later than A), and the position of B (or the insert position of B), and return all the objects between those positions (which is simply the section between those positions in the array/heap)

Resources