Hash-maps or search tree? - sorting

The problem is as follows: Given is a list of cities and their countries, population and geo-coordinates. You should read this data, save it and answer it in an endless loop of the following type:
Request: a prefix (e.g., free).
Answer: all states beginning with this prefix ("case-insensitive")
and their associated data (country + population + geo-coordinates).
The cities should be sorted by population (highest population first).
Which data structure are the most suitable for the described problem ?
First Part : My Thoughts are hanging between Trie and Hashmap. Although i tend to the Trie more because i'm dealing with prefix requests , and Trie is basically according to Wikipedia :
"a trie, also called digital tree and sometimes radix tree or prefix tree (as they can be searched by prefixes), is a kind of search tree—an ordered tree data structure that is used to store a dynamic set or associative array where the keys are usually strings".
in addition to that in terms of Storage and reading data Trie has the advantage over Hash-maps.
Second part: returning the sorted cities by population would be a little bit challenging when we speak about Time Complexity.If i'm thinking in the right direction i should save the values of the keys as lists and it will be easier to sort just the returning list , so i don't have to save it sorted to save some times.
Please share you thoughts and correct me if i'm wrong .

There are pros of cons of picking vanilla tries and vanilla hashmaps. In general, for autocomplete systems, the structure of a trie is extremely useful because you're usually searching for prefixes and the user would like to see the words that begin with the string that they have just entered.
However, there is a method to make the best use of both of these data structures, it is called a Hash Trie (implementation: http://www.sanfoundry.com/java-program-implement-hash-trie/). So the way you would implement this is by using the structure of the trie, but the final node is the actual string it refers to. In python, this is done using dictionaries instead of lists while implementing the trie.
For the second half of the question, a list would be your best bet, in essence a list of tuples (population, city) and sort by the population and return the cities. Regarding it being "easier" to sort, I'm not sure if I agree with this, easy is a relevant term and there's really no way of saying that it's easier than, maybe storing it in a tree and then returning the Pre-Order Traversal of the tree. Essentially, if you're using comparison based sort, it won't get better than nlog (n).

Related

Algorithm for searching dictionary in both direction in O(log n) <= speed < O(n)

I am trying to implement an (Language1 - Language2) dictionary.
I want to make an search algorithm in both direction that speed is faster than O(n)
For example, if (hello, hola) is one pair,
SEARCH_SPANISH_BY_ENGLISH (hello) = "hola", and
SEARCH_ENGLISH_BY_SPANISH (hola) = "hello"
If you have an idea for an algorithm, can you tell me how to set up an dictionary and implement an search algorithm? It seems like I have to use an divide and conquer but I am not sure how. Thanks.
The side of dictionary should be singular, which means I cannot build both English-Spanish and Spanish-English dictionary.
You can have two dictionary as the search time will remain constant. However, if you still wish to have two way dictionary, then Jon Skeet's version is here:
Getting key of value of a generic Dictionary?
Any balanced tree, or sorted array, or hashtable data structure will give you better than O(n) when searching.
For the bidirectionality, you can just have two data structures, one for English-Español the other for Español-English. That doesn't necessarily mean two complete dictionaries, just that you need a search index of some description for either direction.
However, since you appear to have a one-to-one mapping of words, you can just maintain one dictionary as persistent and build the reverse one at run time.

data structure for finding the substring from large number of strings

My problem statement is that I am given millions of strings, and I have to find one sub-string which can be present in any of those strings.
e.g. given is "xyzoverflowasxs, werstackweq" etc. and I have to find a given sub string named as "stack", which should return "werstackweq". What kind of data structure we can use for solving this problem ?
I think we can use suffix tree for this , but wanted some more suggestions for this problem.
I think the way to go is with a dictionary holding the actual words, and another data structure pointing to entries within this dictionary. One way to go would be with suffix trees and their variants, as mentioned in the question and the comments. I think the following is a far simpler (heuristic) alternative.
Say you choose some integer k. For each of your strings, finding the k Rabin Fingerprints of length-k within each string should be efficient and easy (any language has an implementation).
So, for a given k, you could hold two data structures:
A dictionary of the words, say a hash table based on collision lists
A dictionary mapping each fingerprint to an array of the linked-list node pointers in the first data structure.
Given a word of length k or greater, you would choose a k subword, calculate its Rabin fingerprint, find the words which contain this fingerprint, and check if they indeed contain this word.
The question is which k to use, and whether to use multiple such k. I would try this experimentally (starting with simultaneously a few small k values for, say, 1, 2, and 3, and also a couple of larger ones). The performance of this heuristic anyway depends on the distribution of your dictionary and queries.

Data structure for Phonebook

A cellular phone company is going to launch new model of an existing smart phone having maximum of 2 gigabytes memory. Being a programmer, you are given a task to develop application for better utilization of its phone book resource.
You should keep in mind the fact that a single contact can be stored as “First Name”, “Last Name” and “phone number” in alphabetical order. With the passage of time phone book updates as new contact comes or removed from the phone book.
Following are two factors which you must keep in mind while performing the required task.
Space limitations, as you know the available space is limited.Time required for accessing a particular contact, which must not exceed a given threshold.
As a programmer, which data structure will you use to perform the said task, provide proper reasons to support your answer?
I will use trie.
In computer science, a trie, also called digital tree and sometimes radix tree or prefix tree (as they can be searched by prefixes), is an
ordered tree data structure that is used to store a dynamic set or
associative array where the keys are usually strings. Unlike a binary
search tree, no node in the tree stores the key associated with that
node; instead, its position in the tree defines the key with which it
is associated. All the descendants of a node have a common prefix of
the string associated with that node, and the root is associated with
the empty string. Values are not necessarily associated with every
node. Rather, values tend only to be associated with leaves, and with
some inner nodes that correspond to keys of interest. For the
space-optimized presentation of prefix tree, see compact prefix tree.
In the example shown, keys are listed in the nodes and values below
them. Each complete English word has an arbitrary integer value
associated with it. A trie can be seen as a tree-shaped deterministic
finite automaton. Each finite language is generated by a trie
automaton, and each trie can be compressed into a deterministic
acyclic finite state automaton.
Image of trie from Wikipedia page
A trie has a number of advantages over binary search trees.A trie can also be used to replace a hash table, over which it has the following advantages:
Looking up data in a trie is faster in the worst case, O(m) time
(where m is the length of a search string), compared to an imperfect
hash table. An imperfect hash table can have key collisions. The
worst-case lookup speed in an imperfect hash table is O(N) time, but
far more typically is O(1), with O(m) time spent evaluating the
hash.
There is no need to provide a hash function or to change hash
functions as more keys are added to a trie.
A trie can provide an alphabetical ordering of the entries by key.
According to Wikipedia page, Trie is a well-suited data structure for representing Predictive Text or Autocomplete dictionary.
For storing the phone numbers, we just need to add an additional node at the end of the trie which contains the phone number.
Also, we need to build another trie for storing the numbers. In this case, instead of letters, number become a node in the trie. The last node, that is leaf node contains the name of the person who owns that number. By using these two tries, we can easily implement phone book. And we can search with respect to the number and/or name of the person.
A Paragraph from Wikipedia article:
A common application of a trie is storing a predictive text or
autocomplete dictionary, such as found on a mobile telephone. Such
applications take advantage of a trie's ability to quickly search for,
insert, and delete entries
I'm not very experienced in programming, but I think that Hashing with Chaining could be an appropriate method to follow for the phonebook. I believe that this kind of structure covers all the requirements you asked for.
It allocates only the memory it needs for the data to store plus the pointers for the next node as it is implemented by using dynamically allocated nodes in linked lists.
Search, insertion and deletion all have O(n) worst case. More often 0(hash(x)).
If you hash the elements by the first letter of the Last name you can gain some sorting time. You will get 26 lists (if all first names begin with letters) which you will need to sort. And Linked Lists have O(n logn) worst case.
I hope i didn't mess up with my answer.

Self-sorted data structure with random access

I need to implement self-sorted data structure with random access. Any ideas?
A self sorted data structure can be binary search trees. If you want a self sorted data structure and a self balanced one. AVL tree is the way to go. Retrieval time will be O(lgn) for random access.
Maintaining a sorted list and accessing it arbitrarily requires at least O(lgN) / operation. So, look for AVL, red-black trees, treaps or any other similar data structure and enrich them to support random indexing. I suggest treaps since they are the easiest to understand/implement.
One way to enrich the treap tree is to keep in each node the count of nodes in the subtree rooted at that node. You'll have to update the count when you modify the tree (eg: insertion/deletion).
I'm not too much involved lately with data structures implementation. Probably this answer is not an answer at all... you should see "Introduction to algorithms" written by Thomas Cormen. That book has many "recipes" with explanations about the inner workings of many data structures.
On the other hand you have to take into account how much time do you want to spend writing an algorithm, the size of the input and the if there is an actual necessity of an special kind of datastructure.
I see one thing missing from the answers here, the Skiplist
https://en.wikipedia.org/wiki/Skip_list
You get order automatically, there is a probabilistic element to search and creation.
Fits the question no worse than binary trees.
Self sorting is a little bit to ambigious. First of all
What kind of data structure?
There are a lot of different data structures out there, such as:
Linked list
Double linked list
Binary tree
Hash set / map
Stack
Heap
And many more and each of them behave differently than others and have their benefits of course.
Now, not all of them could or should be self-sorting, such as the Stack, it would be weird if that one were self-sorting.
However, the Linked List and the Binary Tree could be self sorting, and for this you could sort it in different ways and on different times.
For Linked Lists
I would preffere Insertion sort for this, you can read various good articles about this on both wikis and other places. I like the pasted link though. Look at it and try to understand the concept.
If you want to sort after it is inserted, i.e. on random times, well then you can just implement a sorting algororithm different than insertion sort maybe, bubblesort or maybe quicksort, I would avoid bubblesort though, it's a lot slower! But easier to gasp the mind around.
Random Access
Random is always something thats being discusses around so have a read about how to perform good randomization and you will be on your way, if you have a linked list and have a "getAt"-method, you could just randomize an index between 0 and n and get the item at that index.

How Do I Choose Between a Hash Table and a Trie (Prefix Tree)?

So if I have to choose between a hash table or a prefix tree what are the discriminating factors that would lead me to choose one over the other. From my own naive point of view it seems as though using a trie has some extra overhead since it isn't stored as an array but that in terms of run time (assuming the longest key is the longest english word) it can be essentially O(1) (in relation to the upper bound). Maybe the longest english word is 50 characters?
Hash tables are instant look up once you get the index. Hashing the key to get the index however seems like it could easily take near 50 steps.
Can someone provide me a more experienced perspective on this? Thanks!
Advantages of tries:
The basics:
Predictable O(k) lookup time where k is the size of the key
Lookup can take less than k time if it's not there
Supports ordered traversal
No need for a hash function
Deletion is straightforward
New operations:
You can quickly look up prefixes of keys, enumerate all entries with a given prefix, etc.
Advantages of linked structure:
If there are many common prefixes, the space they require is shared.
Immutable tries can share structure. Instead of updating a trie in place, you can build a new one that's different only along one branch, elsewhere pointing into the old trie. This can be useful for concurrency, multiple simultaneous versions of a table, etc.
An immutable trie is compressible. That is, it can share structure on the suffixes as well, by hash-consing.
Advantages of hashtables:
Everyone knows hashtables, right? Your system will already have a nice well-optimized implementation, faster than tries for most purposes.
Your keys need not have any special structure.
More space-efficient than the obvious linked trie structure (see comments below)
It all depends on what problem you're trying to solve. If all you need to do is insertions and lookups, go with a hash table. If you need to solve more complex problems such as prefix-related queries, then a trie might be the better solution.
Everyone knows hash table and its uses but it is not exactly constant look up time , it depends on how big the hash table is , the computational complexity of the hash function.
Creating huge hash tables for efficient lookup is not an elegant solution in most of the industrial scenarios where even small latency/scalability matters (e.g.: high frequency trading). You have to care about the data structures to be optimized for space it takes up in memory too to reduce cache miss.
A very good example where trie better suits the requirements is messaging middleware . You have a million subscribers and publishers of messages to various categories (in JMS terms - Topics or exchanges) , in such cases if you want to filter out messages based on topics (which are actually strings) , you definitely do not want create hash table for the million subscriptions with million topics . A better approach is store the topics in trie , so when filtering is done based on topic match , its complexity is independent of number of topics/subscriptions/publishers (only depends on the length of string). I like it because you can be creative with this data structure to optimize space requirements and hence have lower cache miss.
Use a tree:
If you need auto complete feature
Find all words beginning with 'a' or 'axe' so on.
A suffix tree is a special form of a tree. Suffix trees have a whole list of advantages that hash cannot cover.
Insertion and lookup on a trie is linear with the lengh of the input string O(s).
A hash will give you a O(1) for lookup ans insertion, but first you have to calculate the hash based on the input string which again is O(s).
Conclussion, the asymptotic time complexity is linear in both cases.
The trie has some more overhead from data perspective, but you can choose a compressed trie which will put you again, more or less on a tie with the hash table.
To break the tie ask yourself this question: Do i need to lookup for full words only? Or do I need to return all words matching a prefix? (As in a predictive text input system ). For the first case, go for a hash. It is simpler and cleaner code. Easier to test and maintain. For a more ellaborated use case where prefixes or sufixes matter, go for a trie.
And if you do it just for fun, implementing a trie would put a Sunday afternoon to a good use.
There's something I haven't seen anyone mention explicitly that I think is important to keep in mind. Both hash tables and tries of various kinds will typically have O(k) operations, where k is the length of the string in bits (or equivalently in chars).
This is assuming you have a good hash function. If you don't want "farm" and "farm animals" to hash to the same value, then the hash function will have to use all the bits of the key, and so hashing "farm animals" should take about twice as long as "farm" (unless you're in some sort of rolling hash scenario, but there are somewhat similar operation-saving scenarios with tries too). And with a vanilla trie, it's clear why inserting "farm animals" will take about twice as long as just "farm". In the long run it's true with compressed tries as well.
HashTable implementation is space efficient as compared to basic Trie implementation. But with strings, ordering is necessary in most of the practical applications. But HashTable totally disturbs the lexographical order. Now, if your application is doing operations based on lexographical order (like partial search, all strings with given prefix, all words in sorted order), you should use Tries. For only lookup, HashTable should be used (as arguably, it gives minimum lookup time).
P.S.: Other than these, Ternary Search Trees (TSTs) would be an excellent choice. Its lookup time is more than HashTable, but is time-efficient in all other operations. Also, its more space efficient than tries.
Some (usually embedded, real-time) applications require that the processing time be independent of the data. In that case, a hash table can guarantee a known execution time, while a trie varies based on the data.

Resources