Data Structure to implement a Word Dictionary - algorithm

Recently, I was asked in an interview about the usage of data structure.
The question was: what will be the data structure that I will intend to use while creating an English Dictionary. The dictionary will contain number of words under each alphabet and each word will have 1 meaning. Also, how will I implement the data structures to update, search and select different words?
What do you suggest guys? And what is the reason for your suggestion?

A hash table would be the preferred data structure to implement a dictionary with update, search and selection capabilities.
A hash table is a data structure that can store key-value pairs. It is essentially an array containing all of the keys to search on. A hash function(h()) is used to compute an index into an array in which an element can be inserted or searched. So when insertion is required, the hash function is used to find the location where the element needs to be inserted.
Insertion under reasonable assumptions is O(1). Each time we insert data, it takes O(1) time to insert it (assuming the hash function is O(1)).
Looking up data is also similar. If we need to find a the meaning of the word x, we need to calculate h(x), this would tell us where x is located in the hash table. So we can look up words (hash values) in O(1) as well.
However, O(1) insertion and search do not always hold true. There is nothing which guarantees that the hash function won't produce the same output for two different inputs, consequently there would be a collision. In order to handle this scenario various strategies can be employed, namely separate chaining and open addressing. But the search/insertion would no longer be O(1).

Related

What is the fastest way to lookup an item from a small set of items by key?

Say I have a class with a fields array. The fields each have a name. Basically, like a SQL table.
class X {
foo: String
bar: String
...
}
What is the way to construct a data structure and algorithm to fetch a field by key such that it is (a) fast, in terms of number of operations, and (b) minimal, in terms of memory / data-structure size?
Obviously if you know the index of the field the fastest would be to lookup the field by index in the array. But I need to find these by key.
Now, the number of keys will be relatively small for each class. In this example there are only 2 keys/fields.
One way to do this would be to create a hash table, such as like this one in JS. You give it the key, and it iterates through each character in the key and runs it through some mixing function. But this is, for one, dependent on the size of the key. Not too bad for the types of field names I am expecting which shouldn't be too large, let's say they usually aren't longer than 100 characters.
Another way to do this would be to create a trie. You first have to compute the trie, then when you do a lookup, each node of the trie would have one character, so it would have name.length number of steps to find the field.
But I'm wondering, since the number of fields will be small, why do we need to iterate over the keys in the string? A possibly simpler approach, as long as the number of fields is small, is to just iterate through the fields and do a direct string match against each field name.
But all of these 3 techniques would be roughly the same in terms of number of iterations.
Is there any other type of magic that will give you the fewest number of iterations/steps?
It seems that there could be a possible hashing algorithm that uses to its advantage the fact that the number of items in the hash table will be small. You would create a new hash table for each class, giving it a "size" (number of fields on the specific class used for this hash table). Somehow maybe it can use this size information to construct a simple hashing algorithm that minimizes the number of iterations.
Is anything like that possible? If so, how would you do it? If not, then it would be interesting to know why its not possible to get any more optimal than these.
How "small" is the field list?
If you keep field-list sorted by key, you can use binary search.
For a very small number of fields (e.g. 4) it will perform about the same number of iterations and key-comparison as linear search, if considering the worst case of linear search. (Linear search would be very efficient (speed and memory) for this case.)
To beat the average case of linear search, you'd need more fields (e.g. 8).
This is as memory efficient as your linear search solution. More memory efficient than trie solution.

How hashmap retrieves value with hash key?

I'm more confused with Hashmap or Hashtable concept, when people say Hashmap is faster over List. I'm clear with hashing concept, in which the value is stored in hash code for the given key.
But when I want to retrieve the data how it works,
For example, I'm storing n number of strings with n different keys in a HashMap.
If I want to retrieve a specific value associated with specific key, how it will return it in O(1) of time ? Because the hashed key will be compared with all other keys right ?
Lets go on a word journey, say you have a bunch weird m&m's with all the letters.
Now it's your job is to vend people m&m's in the letter color combo of their choosing.
You have some choices about how to organize your shop. ( This act of organization will be metaphorically our hash function. )
You can sort your M&M's into buckets by color or by letter or by both. The question follows, what provides you the fastest retrieval time of a specific request?
The answer is rather intuitive, being that the sorting providing the fewest different M&Ms in each bucket facilitates the most efficient queering.
Lets say someone asked if you had any green Q ; if all your M&M's are in a single bin or list or bucket or otherwise unstructured container the answer will be far from accessible in O(1) as compared to keeping an organized shop.
This analogy relies on the concept of Separate chaining where each hash-Key corresponds to a container of multiple elements.
Without this concept the idea of hashing is more generally to use keys from uniformly throughout an array such that the amortized performance is constant. Collisions can be resolved through a variety of method variations and the Wikipedia article will tell you all about it.
http://en.wikipedia.org/wiki/Hash_table
"If the set of key-value pairs is fixed and known ahead of time (so insertions and deletions are not allowed), one may reduce the average lookup cost by a careful choice of the hash function, bucket table size, and internal data structures. In particular, one may be able to devise a hash function that is collision-free, or even perfect "

Data Structure for Ascending Order Key Value Pairs with Further Insertion

I am implementing a table in which each entry consists of two integers. The entries must be ordered in ascending order by key (according to the first integer of each set). All elements will be added to the table as the program is running and must be put in the appropriate slot. Time complexity is of utmost importance and I will only use the insert, remove, and iterate functions.
Which Java data structure is ideal for this implementation?
I was thinking LinkedHashMap, as it maps keys to values (each entry in my table is two values). It also provides O(1) insert/remove functionality. However, it is not sorted. If entries can be efficiently inserted in appropriate order as they come in, this is not a bad idea as the data structure would be sorted. But I have not read or thought of an efficient way to do this. (Maybe like a comparator)?
TreeMap has a time complexity of log(n) for both add and remove. It maintains sorted order and has an iterator. But can we do better than than log(n)?
LinkedList has O(1) add/remove. I could insert with a loop, but this seems inefficient as well.
It seems like TreeMap is the way to go. But I am not sure.
Any thoughts on the ideal data structure for this program are much appreciated. If I have missed an obvious answer, please let me know.
(It can be a data structure with a Set interface, as there will not be duplicates.)
A key-value pair suggests for a Map. As you need key based ordering it narrows down to a SortedMap, in your case a TreeMap. As far as keeping sorting elements in a data structure, it can't get better than O(logn). Look no further.
The basic idea is that you need to insert the key at a proper place. For that your code needs to search for that "proper place". Now, for searching like that, you cannot perform better than a binary search, which is log(n), which is why I don't think you can perform an insert better than log(n).
Hence, again, a TreeMap would be that I would advise you to use.
Moreover, if the hash values, that you state, (specially because there are no duplicates) can be enumerated (as in integer number, serial numbers or so), you could try using statically allocated arrays for doing that. Then you might get a complexity of O(1) perhaps!

Chaining Hash Tables - Average number of table entries examined when adding to hash table

I know that in chained hashing, the average number of table entries
examined in a successful search is approximately:
1+(load factor/2)
Would it be the same formula for the number table entries examined when adding elements to the hash table? I'm thinking it would be. Just want to make sure I'm not thinking about this wrong.
Yes. "Insert" is effectively a lookup operation with an additional modification.
However, if your hashing scheme involves any kind of rebalancing or resizing operation, then there may be a higher amortized operation count for inserts than lookups.
No. If you're doing a successful search, then of the N elements linked from the hash bucket, you'll on average visit half of them before finding the element you want. When adding elements that aren't already in the table, but the hashtable insert function isn't allowed to assume they're not there, you have to compare against all N elements linked from the relevant bucket before you've confirmed the element isn't already there. Twice as many operations. (If the hash table implementation provides an insert_known_new(N) function it can just append to the linked list at that hash bucket without any key comparisons with existing elements, but I've never seen a hash table provide such a function - it would hand over control of the hash table's class invariants to the user, which breaks encapsulation, though in this case justifiably in my opinion.)

Data Structure for tuple indexing

I need a data structure that stores tuples and would allow me to do a query like: given tuple (x,y,z) of integers, find the next one (an upped bound for it). By that I mean considering the natural ordering (a,b,c)<=(d,e,f) <=> a<=d and b<=e and c<=f. I have tried MSD radix sort, which splits items into buckets and sorts them (and does this recursively for all positions in the tuples). Does anybody have any other suggestion? Ideally I would like the abouve query to happen within O(log n) where n is the number of tuples.
Two options.
Use binary search on a sorted array. If you build the keys ( assuming 32bit int)' with (a<<64)|(b<<32)|c and hold them in a simple array, packed one beside the other, you can use binary search to locate the value you are searching for ( if using C, there is even a library function to do this), and the next one is simply one position along. Worst case Performance is O(logN), and if you can do http://en.wikipedia.org/wiki/Interpolation_search then you might even approach O(log log N)
Problem with binary keys is might be tricky to add new values, might need gyrations if you will exceed available memory. But it is fast, only a few random memory accesses on average.
Alternatively, you could build a hash table by generating a key with a|b|c in some form, and then have the hash data pointing to a structure that contains the next value, whatever that might be. Possibly a little harder to create in the first place as when generating the table you need to know the next value already.
Problems with hash approach are it will likely use more memory than binary search method, performance is great if you don't get hash collisions, but then starts to drop off, although there a variations around this algorithm to help in some cases. Hash approach is possibly much easier to insert new values.
I also see you had a similar question along these lines, so I guess the guts of what I am saying is combine A,b,c to produce a single long key, and use that with binary search, hash or even b-tree. If the length of the key is your problem (what language), could you treat it as a string?
If this answer is completely off base, let me know and I will see if I can delete this answer, so you questions remains unanswered rather than a useless answer.

Resources