I was asked the following interview-question:
Suppose you have a HashSet implementation providing its ordinary
interface. How can you use one or more instances of HashSet to
implement a HashTable providing the ordinary HashTable interface it its ordinary time constraints?
I asked twice, but they meant it this way and not the other way around (implementing a HashSet using a HashTable is quite simple, Java does this for example).
I answered that it was not possible. This answer did not seem to statisfy the interviewer, so I am searching for a better answer. I could not find a solution, even when searching on the internet and on Stack Overflow.
I think it was a trick question, but to make sure I post this question here on SO.
One standard way to do this is to treat the hash table as a hash set of key/value pairs, where the hash code of the key/value pair is purely the hash code of the key and the equality comparison function says that any two key/value pairs are equal precisely when their keys are equal. That way, the normal hash set operations will store key/value pairs in a way where
No two key/value pairs with the same key are stored, and
looking up the key in the hash table will find the key/value pair object with that key, from which the value can be looked up.
Hope this helps!
Related
I am currently experimenting with clojure.core.cache and now I have the problem that I want to store values based on tuples of keys and I do not know what the best/most idiomatic way to do this is.
I was considering storing one cache for every first key value, where I can look up the final result with the second key value, but this does seem very unefficient to me.
Another way would be to concat the keys since this would be unique aswell, but this again seems a bit hackey.
Maybe the problem is too "big" for caches and I should use redis or create a mirroring db, but I want to expire the values after some time, so this does not seem optimal either.
Basically I have keys like (organization-id, user-id), and I want to retrieve values for them and store the results in a cache with some expiry time.
Why not just use a Clojure vector for the tuples? Clojure vectors are immutable values, are equal to each other only when their lengths are equal, and all corresponding elements are equal to each other, and can be used as keys in a map. I do not recall if core.cache uses maps internally to represent the contents of the cache, but whether or not it does, it should be able to use vectors as cache keys just fine.
I am looking for suggestions in improving the query time access for unordered maps. My code essentially just consists of 2 steps. In the first step, I populate the unordered map. After the first step, no more entries are ever added to the map. In the second step, the unordered map is only queried. Since the map is essentially unchanging, is there something that can be done to speed up the query time?
For instance, does stl provide any function that can adjust the internal allocations in the map to improve query time access? In other words, it is possible that more than one key was mapped to the same bucket in the unordered map. If more memory was allocated to the map, then chances of such a collision occurring can reduce. In that sense, I am curious as to whether there is anything that can be done knowing the fact that the unordered map will remain unchanged.
If measurements show this is important for you, then I'd suggest taking measurements for other hash table implementations outside the Standard Library, e.g. google's. Using closed hashing aka open addressing may well work better for you, especially if your hash table entries are small enough to store directly in the hash table buckets.
More generally, Marshall suggests finding a good hash function. Be careful though - sometimes a generally "bad" hash function performs better than a "good" one, if it works in nicely with some of the properties of your keys. For example, if you tend to have incrementing number, perhaps with a few gaps, then an identity (aka trivial) hash function that just returns the key can select hash buckets with far less collisions than a crytographic hash that pseudo-randomly (but repeatably) scatters keys with as little as a single bit of difference in uncorrelated buckets. Identity hashing can also help if you're looking up several nearby key values, as their buckets are probably nearby too and you'll get better cache utilisation. But, you've told us nothing about your keys, values, number of entries etc. - so I'll leave the rest with you.
You have two knobs that you can twist: The the hash function and number of buckets in the map. One is fixed at compile-time (the hash function), and the other you can modify (somewhat) at run-time.
A good hash function will give you very few collisions (non-equal values that have the same hash value). If you have many collisions, then there's not really much you can do to improve your lookup times. Worst case (all inputs hash to the same value) gives you O(N) lookup times. So that's where you want to focus your effort.
Once you have a good hash function, then you can play games with the number of buckets (via rehash) which can reduce collisions further.
I have to implement a Trie of codes of a given fixed-length. Each code is a sequence of integers and considering that some patterns are usual, I decided to implement a Trie in order to store all the codes.
I also need to iterate throught the codes given they lexicograph order and I'm expecting to work with millions (maybe billions) of codes.
This is why I considered implementing this particular Trie as a dictionary where each key is the index of a given prefix.
Let's say key 0 has a list of his prefix children and for each one i save the corresponding entry on the dictionary...
Example: If my first insertion is the code 231, then the dictionary would look like:
[0]->{(2,1)}
[1]->{(3,2)}
[2]->{(1,3)}
This way, if my second insertion would be 243, the dictionary would be updated this way:
[0]->{(2,1)}
[1]->{(3,2),(4,3)} *Here each list is sorted using a flat_map
[2]->{(1,endMark)}
[3]->{(3,endMark)}
My problem is that I have been using a vector for this purpuse and because having all the dictionary in contiguos memory allows me to have a better performance while iterating over it.
Now, when I need to work with BIG instances of my problem, due to resizing the vector I cannot work with millions of codes (memory consuption could be as much as 200GB).
Now I have tried google's sparse hash insted of the vector and my question is, do you have any suggestion? any other alternative in mind? Is there any other way to work with integers as keys to improve performance?
I know I wont have any collision because each key would be different from the rest.
Best regards,
Quentin
I am dealing with hundreds of thousands of files,
I have to process those files 1-by-1,
In doing so, I need to remember the files that are already processed.
All I can think of is strong the file path of each file in a lo----ong array, and then checking it every time for duplication.
But, I think that there should be some better way,
Is it possible for me to generate a KEY (which is a number) or something, that just remembers all the files that have been processed?
You could use some kind of hash function (MD5, SHA1).
Pseudocode:
for each F in filelist
hash = md5(F name)
if not hash in storage
process file F
store hash in storage to remember
see https://www.rfc-editor.org/rfc/rfc1321 for a C implementation of MD5
There are probabilistic methods that give approximate results, but if you want to know for sure whether a string is one you've seen before or not, you must store all the strings you've seen so far, or equivalent information. It's a pigeonhole principle argument. Of course you can get by without doing a linear search of the strings you've seen so far using all sorts of different methods like hash tables, binary trees, etc.
If I understand your question correctly, you want to create a SINGLE key that should take on a specific value, and from that value you should be able to deduce which files have been processed already? I don't know if you are going to be able to do that, simply from the point that your space is quite big and generating unique key presentations in such a huge space requires a lot of memory.
As mentioned, what you can do is simply to store each path URL in a HashSet. Putting a hundred thousand entries into the Set is not that bad, and lookup time is amortized constant time O(1), so it will be quite fast.
Bloom filter can solve your problem.
Idea of bloom filter is simple. It begins with having an empty array of some length, with all its members having zero value. We shall have K number of hash functions.
When ever we need to insert an item to the bloom filter, we has the item with all K hash functions. These hash functions would get K indexes on the bloom filter. For these indexes, we need to change the member value as 1.
To check if an item exists in the bloom filter, simply hash it with all of the K hashes and check the corresponding array indexes. If all of them are 1's , the item is present in the bloom filter.
Kindly note that bloom filter can provide false positive results. But this would never give false negative results. You need to tweak the bloom filter algorithm to address these false positive case.
What you need, IMHO, is a some sort of tree or hash based set implementation. It is basically a data structure that supports very fast add, remove and query operations and keeps only one instance of each elements (i.e. no duplicates). A few hundred thousand strings (assuming they are themselves not hundreds of thousands characters long) should not be problem for such a data structure.
You programming language of choice probably already has one, so you don't need to write one yourself. C++ has std::set. Java has the Set implementations TreeSet and HashSet. Python has a Set. They all allow you to add elements and check for the presence of an element very fast (O(1) for hashtable based sets, O(log(n)) for tree based sets). Other than those, there are lots of free implementations of sets as well as general purpose binary search trees and hashtables that you can use.
What Data Structure could I use to find the Phone number of a person given the person's name?
Assuming you will only ever query using the person's name, the best option is to use an associative data structure. This is basically a data structure, usually implemented as a hashtable or a balanced binary search tree, that stores data as key=>value (or, stated in another way, as (key,value) pairs). You query the data structure by using the key and it returns the corresponding value. In your case, the key would be the name of the person and the value would be the phone number.
Rather than implementing a hashtable or a binary search tree for this yourself, check to see if your language has something like this already in its library, most languages these days do. Python has dict, perl has hashes, Java and C# has Map, and C++ has the STL map.
Things can get a little trickier if you have several values for the same key (e.g. the same person having multiple phone numbers), but there are workarounds like using a list/vector as the value, or using a slightly different structure that supports multiple values for the same key (e.g. STL multimap). But you probably don't need to worry about that anyway.
An associative array, such as a hashtable.
Really, anything that maps keys to values. The specific data structure will depend on the language you are using (unless you want to implement your own, in which case you have free reign).