Data structure for Phonebook - data-structures

A cellular phone company is going to launch new model of an existing smart phone having maximum of 2 gigabytes memory. Being a programmer, you are given a task to develop application for better utilization of its phone book resource.
You should keep in mind the fact that a single contact can be stored as “First Name”, “Last Name” and “phone number” in alphabetical order. With the passage of time phone book updates as new contact comes or removed from the phone book.
Following are two factors which you must keep in mind while performing the required task.
Space limitations, as you know the available space is limited.Time required for accessing a particular contact, which must not exceed a given threshold.
As a programmer, which data structure will you use to perform the said task, provide proper reasons to support your answer?

I will use trie.
In computer science, a trie, also called digital tree and sometimes radix tree or prefix tree (as they can be searched by prefixes), is an
ordered tree data structure that is used to store a dynamic set or
associative array where the keys are usually strings. Unlike a binary
search tree, no node in the tree stores the key associated with that
node; instead, its position in the tree defines the key with which it
is associated. All the descendants of a node have a common prefix of
the string associated with that node, and the root is associated with
the empty string. Values are not necessarily associated with every
node. Rather, values tend only to be associated with leaves, and with
some inner nodes that correspond to keys of interest. For the
space-optimized presentation of prefix tree, see compact prefix tree.
In the example shown, keys are listed in the nodes and values below
them. Each complete English word has an arbitrary integer value
associated with it. A trie can be seen as a tree-shaped deterministic
finite automaton. Each finite language is generated by a trie
automaton, and each trie can be compressed into a deterministic
acyclic finite state automaton.
Image of trie from Wikipedia page
A trie has a number of advantages over binary search trees.A trie can also be used to replace a hash table, over which it has the following advantages:
Looking up data in a trie is faster in the worst case, O(m) time
(where m is the length of a search string), compared to an imperfect
hash table. An imperfect hash table can have key collisions. The
worst-case lookup speed in an imperfect hash table is O(N) time, but
far more typically is O(1), with O(m) time spent evaluating the
hash.
There is no need to provide a hash function or to change hash
functions as more keys are added to a trie.
A trie can provide an alphabetical ordering of the entries by key.
According to Wikipedia page, Trie is a well-suited data structure for representing Predictive Text or Autocomplete dictionary.
For storing the phone numbers, we just need to add an additional node at the end of the trie which contains the phone number.
Also, we need to build another trie for storing the numbers. In this case, instead of letters, number become a node in the trie. The last node, that is leaf node contains the name of the person who owns that number. By using these two tries, we can easily implement phone book. And we can search with respect to the number and/or name of the person.
A Paragraph from Wikipedia article:
A common application of a trie is storing a predictive text or
autocomplete dictionary, such as found on a mobile telephone. Such
applications take advantage of a trie's ability to quickly search for,
insert, and delete entries

I'm not very experienced in programming, but I think that Hashing with Chaining could be an appropriate method to follow for the phonebook. I believe that this kind of structure covers all the requirements you asked for.
It allocates only the memory it needs for the data to store plus the pointers for the next node as it is implemented by using dynamically allocated nodes in linked lists.
Search, insertion and deletion all have O(n) worst case. More often 0(hash(x)).
If you hash the elements by the first letter of the Last name you can gain some sorting time. You will get 26 lists (if all first names begin with letters) which you will need to sort. And Linked Lists have O(n logn) worst case.
I hope i didn't mess up with my answer.

Related

Hash-maps or search tree?

The problem is as follows: Given is a list of cities and their countries, population and geo-coordinates. You should read this data, save it and answer it in an endless loop of the following type:
Request: a prefix (e.g., free).
Answer: all states beginning with this prefix ("case-insensitive")
and their associated data (country + population + geo-coordinates).
The cities should be sorted by population (highest population first).
Which data structure are the most suitable for the described problem ?
First Part : My Thoughts are hanging between Trie and Hashmap. Although i tend to the Trie more because i'm dealing with prefix requests , and Trie is basically according to Wikipedia :
"a trie, also called digital tree and sometimes radix tree or prefix tree (as they can be searched by prefixes), is a kind of search tree—an ordered tree data structure that is used to store a dynamic set or associative array where the keys are usually strings".
in addition to that in terms of Storage and reading data Trie has the advantage over Hash-maps.
Second part: returning the sorted cities by population would be a little bit challenging when we speak about Time Complexity.If i'm thinking in the right direction i should save the values of the keys as lists and it will be easier to sort just the returning list , so i don't have to save it sorted to save some times.
Please share you thoughts and correct me if i'm wrong .
There are pros of cons of picking vanilla tries and vanilla hashmaps. In general, for autocomplete systems, the structure of a trie is extremely useful because you're usually searching for prefixes and the user would like to see the words that begin with the string that they have just entered.
However, there is a method to make the best use of both of these data structures, it is called a Hash Trie (implementation: http://www.sanfoundry.com/java-program-implement-hash-trie/). So the way you would implement this is by using the structure of the trie, but the final node is the actual string it refers to. In python, this is done using dictionaries instead of lists while implementing the trie.
For the second half of the question, a list would be your best bet, in essence a list of tuples (population, city) and sort by the population and return the cities. Regarding it being "easier" to sort, I'm not sure if I agree with this, easy is a relevant term and there's really no way of saying that it's easier than, maybe storing it in a tree and then returning the Pre-Order Traversal of the tree. Essentially, if you're using comparison based sort, it won't get better than nlog (n).

What is the proper data structure to store self-sorting list with repeating keys?

I need something that will work in O(log(n)) complexity, and I thought about AVL trees, but the problem is that some keys may repeat themselves (score of a person for example), so I can't think of how to implement it as a tree. What is a proper way to do this?
There are many options available. Most flavors of binary search trees can easily be modified to allow for nodes with duplicated values, since the balancing operations (usually) purely consist of rotations, which keep the sequence in order. For cases like these, you'd just do a normal BST insertion, but every time you see a duplicated value, you just arbitrarily move to the left or the right and continue as if the value were distinct.
Skiplists are particularly easy to update to support multiple copies of each key, since they don't do any complicated structural updates on insertions or deletions.
If you don't have auxiliary information associated with each key, then another simpler option would be to store a standard binary search tree, but to augment each node with a "count" field indicating how many logical copies of that field exist. Every time you do an insertion, if the key doesn't exist, you create it with count 1. If it already exists, you just increment the count in the existing node. Deletions would be implemented analogously.
Of course, if you don't want to roll your own data structure, just go and find a good implementation of a multimap or multiset, which should get the job done for you quite nicely. Depending on your Programming Language of Choice, you might even find these in the standard libraries. :-)

whats the best way to traverse a large dictionary of words?

lets say I'm looking for a word that may or may not be in a dictionary of 95k words - I Cannot use word length to facilitate search. My question is in regards to the fastest way to find the word without doing a O(n) look up.
Here are my two thoughts:
first, store the words in a hast table, look up of the word is O(1), this seems the best scenario in my mind, but going through different websites using Trie was also suggested, my question regarding this is whether its practical to have a Trie that holds so many words.
The lookup would be O(k) in this case.
So what is the most optimal way of finding a word in a large dictionary?
Optimality depends on your use case - do you care about look up-time or space? (also, do you care about inserting new words?).
The best you can do time-wise is to use a hash table, but for a dictionary, it is space-inefficient. A trie compresses the space requirement because it stores prefixes, not the entire word, but takes longer to look up. So, to answer your question, it is more space efficient to have a trie with a large number of words than a hash table.
If you are just searching for a single word, the cost of setting up a hash table or tree structure would exceed a linear search. These structures become (very) efficient when their costs are amortized over (very) many uses.
If the dictionary is sorted (and why wouldn't a dictionary be?), then you can look for a single word in log(n) time with a binary search through the file, no additional structures needed.
I think the best way to find a word in a dictionary is a B+ tree.And let me explain you the reason.
Lets say you have a root block of 10 strings.The strings in the block are sorted.These 10 strings are followed by a pointer to another cell of 10 strings and that goes one.So the only thing you have to do is just String compare your Key word starting by the First one until you find a word smaller in comparison (StringCompare).
If we take it as standard that each string has next to it a pointer that shows to a cell with words that are smaller in comparison,it will take you 5 steps and 5 comparisons to end to the final bracket of data that will may or may not contain your Key word.
in 5 comparisons + the comparisons in the final bracket you are searching a dictionary of 10*10*10*10*10 words.
The algorithm is of logarithmic speed Log 100000 with base the number of strings in the cell.If each cell has 10 words you need 5 steps.
I must mention that only the Root of the tree must be stored in the Ram memory.All the other blocks can be stored in the hard drive without significant loss in performance because of the few steps.
Hope i explained right :D At least i tried! have fun
Trie is preferable because this data-structure can be faster than hash-table. Hash tables is O(1) only in ideal case, in real world applications collisions can occur. Different types of trie data structure doesn't suffer from this.
Another case is compression. Trie are much more compact than hash table. Hash table require some space for efficient insert operations. If load factor of the hash table are colse to 100% than insert operations takes very long time.
With hash tables you must compare your key with at least one key from the dictionary, key comparison in this case takes O(k) where k in key length. With trie you are doing the same thing, your lookup operations is O(k).
Tries allow ordered traversal, hash tables - don't.
There is many types of tries out there, for example ternary search trie is verty good in this particular case. Array mapped trie are also very fast, compared to regular hash table.

What are the advantages of storing all elements in the leaf nodes?

I'm reading Advanced Data Structures by Peter Brass.
In the beginning of the chapter on search trees, he stated that there is two models of search trees - one where nodes contain the actual object (the value if the tree is used as a dictionary), and an other where all objects are stored in leaves and internal nodes are only for comparisons.
What are the advantages of the second model over the first one?
One of the big advantages of a binary tree where data is only in the leaf nodes is that you can partition based on elements that are not in your dataset.
For example, if I have a possible dataset of 0-1 million, but the vast majority of items are either at the high end or low end but not in the middle, I may still want my first compare against 500,000 - even though that number is not in my data set. If every node had data, I could not do this. While not normally needed in theory, I've run into many times that partitioning based on a value outside my data simplified implementation.
B+ trees are an example of a case where all key/values are stored in leaf nodes. The primary advantage here is that since all items are in the leaf nodes, the leaf nodes can be linked together to form a linked list which allows rapid in-order traversal. If you access a particular element, you can always find the next element in the sequence without visiting any parents because the leaf nodes are linked together. Filesystems and database storage systems can take advantage of this structures for range searches and stuff.
Lets say you are building tree over some objects on some complex criteria. On example calculated from multiple properties. Sometimes you can't change this object to store calculated value and calculating this criteria is expansive. So you calculate this criteria only once, and store objects in leafs based on criteria result. Then when your tree is complete you can find required object much faster because you don't have to calculate criteria for each tree node in your path.
well storing information objects in the nodes, we talking in this case about a trie, is usefull for fast retrival of information(faster than storing stuff in an array/hashtable, where the worst case auf acces is O(n), in the trie this is O(m) [m is the lenght of n])
look here:
https://en.wikipedia.org/wiki/Trie
In a search tree this oerations can be much more complicated(look AVL Tree O(log n) ) and so can be slower and is more compley to implement.
What data structure to choose??
Well this depends on what u want to do

How Do I Choose Between a Hash Table and a Trie (Prefix Tree)?

So if I have to choose between a hash table or a prefix tree what are the discriminating factors that would lead me to choose one over the other. From my own naive point of view it seems as though using a trie has some extra overhead since it isn't stored as an array but that in terms of run time (assuming the longest key is the longest english word) it can be essentially O(1) (in relation to the upper bound). Maybe the longest english word is 50 characters?
Hash tables are instant look up once you get the index. Hashing the key to get the index however seems like it could easily take near 50 steps.
Can someone provide me a more experienced perspective on this? Thanks!
Advantages of tries:
The basics:
Predictable O(k) lookup time where k is the size of the key
Lookup can take less than k time if it's not there
Supports ordered traversal
No need for a hash function
Deletion is straightforward
New operations:
You can quickly look up prefixes of keys, enumerate all entries with a given prefix, etc.
Advantages of linked structure:
If there are many common prefixes, the space they require is shared.
Immutable tries can share structure. Instead of updating a trie in place, you can build a new one that's different only along one branch, elsewhere pointing into the old trie. This can be useful for concurrency, multiple simultaneous versions of a table, etc.
An immutable trie is compressible. That is, it can share structure on the suffixes as well, by hash-consing.
Advantages of hashtables:
Everyone knows hashtables, right? Your system will already have a nice well-optimized implementation, faster than tries for most purposes.
Your keys need not have any special structure.
More space-efficient than the obvious linked trie structure (see comments below)
It all depends on what problem you're trying to solve. If all you need to do is insertions and lookups, go with a hash table. If you need to solve more complex problems such as prefix-related queries, then a trie might be the better solution.
Everyone knows hash table and its uses but it is not exactly constant look up time , it depends on how big the hash table is , the computational complexity of the hash function.
Creating huge hash tables for efficient lookup is not an elegant solution in most of the industrial scenarios where even small latency/scalability matters (e.g.: high frequency trading). You have to care about the data structures to be optimized for space it takes up in memory too to reduce cache miss.
A very good example where trie better suits the requirements is messaging middleware . You have a million subscribers and publishers of messages to various categories (in JMS terms - Topics or exchanges) , in such cases if you want to filter out messages based on topics (which are actually strings) , you definitely do not want create hash table for the million subscriptions with million topics . A better approach is store the topics in trie , so when filtering is done based on topic match , its complexity is independent of number of topics/subscriptions/publishers (only depends on the length of string). I like it because you can be creative with this data structure to optimize space requirements and hence have lower cache miss.
Use a tree:
If you need auto complete feature
Find all words beginning with 'a' or 'axe' so on.
A suffix tree is a special form of a tree. Suffix trees have a whole list of advantages that hash cannot cover.
Insertion and lookup on a trie is linear with the lengh of the input string O(s).
A hash will give you a O(1) for lookup ans insertion, but first you have to calculate the hash based on the input string which again is O(s).
Conclussion, the asymptotic time complexity is linear in both cases.
The trie has some more overhead from data perspective, but you can choose a compressed trie which will put you again, more or less on a tie with the hash table.
To break the tie ask yourself this question: Do i need to lookup for full words only? Or do I need to return all words matching a prefix? (As in a predictive text input system ). For the first case, go for a hash. It is simpler and cleaner code. Easier to test and maintain. For a more ellaborated use case where prefixes or sufixes matter, go for a trie.
And if you do it just for fun, implementing a trie would put a Sunday afternoon to a good use.
There's something I haven't seen anyone mention explicitly that I think is important to keep in mind. Both hash tables and tries of various kinds will typically have O(k) operations, where k is the length of the string in bits (or equivalently in chars).
This is assuming you have a good hash function. If you don't want "farm" and "farm animals" to hash to the same value, then the hash function will have to use all the bits of the key, and so hashing "farm animals" should take about twice as long as "farm" (unless you're in some sort of rolling hash scenario, but there are somewhat similar operation-saving scenarios with tries too). And with a vanilla trie, it's clear why inserting "farm animals" will take about twice as long as just "farm". In the long run it's true with compressed tries as well.
HashTable implementation is space efficient as compared to basic Trie implementation. But with strings, ordering is necessary in most of the practical applications. But HashTable totally disturbs the lexographical order. Now, if your application is doing operations based on lexographical order (like partial search, all strings with given prefix, all words in sorted order), you should use Tries. For only lookup, HashTable should be used (as arguably, it gives minimum lookup time).
P.S.: Other than these, Ternary Search Trees (TSTs) would be an excellent choice. Its lookup time is more than HashTable, but is time-efficient in all other operations. Also, its more space efficient than tries.
Some (usually embedded, real-time) applications require that the processing time be independent of the data. In that case, a hash table can guarantee a known execution time, while a trie varies based on the data.

Resources