Referring to LRU cache design
I have a question regarding the answer.
Say my hash map is full (the interviewer gave me a max size) [I understand if I need to fetch a pair already present in the map I'll move the list entry to the front to indicate recent use.]
But, what if I have an entry which is to be added and this key hashes to same position as a different key. (Collision) How do I go about it?
DO I do chaining or probing? If I do chaining, should I increase the map size?
If I remove the oldest entry it empties a location in my hash map. But a new entry might not hash to this location? It might hash to another full entry? (Different Key, Value Pair)
How to solve this?
This design will not include chaining because we're here designing a direct mapped cache and this tradeoff is known that a direct mapped cache ONLY considers the recency of an entry before removing it from cache and not the frequency of being asked for.
The max size limit will be imposed on the linked list size and every time we try to add a new entry when the linked list is full, the last used entry (of linked list) and corresponding map entry is removed. The location where the new entry is to be inserted is independent of what was removed.
For more details on concurrency check out this link.
size of map is number of key value pairs present in Map, so its independent of whether key value pairs are present in same hash bucket or different.
so if you check data structure of hashmap its array of linkedlist, so when there is hash collision there is chaining and size of map is also increased.
now if your new entry hashes to location which is not null you need to chain as we do in linkedlist.
PS: for LRU Cache you can see LinkedHashMap
Related
For resolving hashing collision in the Hash Table data structure, we have one very popular strategy called Separate Chaining.
I'm aware, that in the Separate Chaining strategy, keys, which end up being collided into backing array's same index (due to the fact, that they're hashed into the same particular values), are Linked Lists.
I wonder whether the type of backing array is LinkedList<E>[] from the moment of creation of Hash Table (during separate chaining strategy implementation), or it's int[] and it gets converted to the LinkedList<E>[] array after first collision?
Because, having Linked Lists as each element of the backing array seems not the most optimal solution.. it means, that those Linked Lists, should be a list of the elements, which in turn, are Entries/Buckets of a pair of key-value.. and this all really consumes a lot of memory and resource, I reckon.
I did quite a research in different books and academic articles; yet, I still can't really get a clear answer on this.
Yes, separate chaining will cost more memory than probing or re-hashing. But the benefit is that you get more items in the hash table before performance begins to suffer. At some point you still have to re-index: typically when you realize that some bucket is over-represented or when the total number of occupied buckets exceeds some threshold.
Note that the backing array itself isn't a linked list. The backing array for a hash table that uses probing or re-hashing will probably be a dynamically-sized array of entries. Your entry would be something like:
class Entry {
String: key;
SomeObject: value;
}
If you're using separate chaining, the Entry object gets an additional field: a reference to the next item that hashed to the same bucket:
class Entry {
String: key;
SomeObject: value;
Entry: next;
}
The memory difference for the first item really isn't enough to worry about.
It's possible to write the code so that if a bucket has but a single item, it will contain just the key and value, and the bucket is converted to a linked list only on first collision. There is perhaps a small memory win there, and an even smaller performance gain. But the code is more complex and the gains aren't huge unless you know that the majority of your buckets won't have any collisions. Not worth the trouble of implementing, testing, and maintaining two different code paths.
I know, Maybe the title is a little confusing. however, my actual question is basic I think.
I'm working on a brand new LRU implementation for that I use an Index Table which maps the name of the incoming packet to index of where the content of packet stored in CS.
As illustrated below each incoming packet store in the CS and can be addressed by Index Table.
Now suppose new packet arrived, as we know, regarding LRU, its index must set to top of CS (zero) and it needs to upgrade other indexes, they need to be incremented as a result.
One obvious solution is to loop over all entries in the Index Table and increment them.
Is there any solution or structure that is using for such a problem?
I don't see how you are establishing the order of your cache in the description. But to answer your question, it's possible to reduce the LRU store method to O(1) time complexity.
The classical way to do it is to have these two data structures:
Doubly Linked List : for order in the cache. Each node stores a data element (it plays the role of your content store).
HashMap that associates each key to the pointer to the node in the linked list. (it plays the role of your index table)
So when you access already stored data in your cache, it must be at the top of the list, so you delete the corresponding node from the linked list (in O(1) time because you have access to its previous and next nodes) and store it at the head.
For new data it is simpler, only store it at the head of the list and store your (key, value) in the hashmap.
Is there such a data structure that combines a Queue and a Hashmap?
In addition to the FIFO (enqueue/dequeue) behaviour where a queue normally has, I want
when enqueuing, always enqueue with a key,
when peeking without the key, returns the head of the queue
when peeking with the key, returns the first element enqueued with this key
when dequeuing without the key, remove the first element ever enqueued
when dequeuing with the key, remove all elements having the key
I wonder if such data structure already exist in the wild?
No there is not. But you can combine both to achieve the behavior you want (though you will have to make tradeoffs along the way).
To do so, you will store:
A HashMap where the values are references to items in the queue: HashMap<Key, ReferenceToFIFOElement> or HashMap<Key, Set<ReferenceToFIFOElement>>.
An actual FIFO queue: FIFO<Item>
When you enqueue, you first add your element at the top of the queue. Then you update the hashmap with a reference to this newly created element if the key was not registered yet (or add the said reference to the reference bucket mapped to the given key in the set case).
Peeking will be easy: just retrieve the key and access the referenced item (or the first referenced item in the set case, or the top if no key were provided).
Dequeuing is where the real tradeoff will take place:
If you only store a reference to the first item inserted with a given key in the hashmap, then you will have to iterate over all the queue, starting from the said item. This means an overall higher time complexity.
If you store all the references to items with a given key in the hashmap (using a set), then you will just have to iterate over that set and remove the referenced elements from the queue. This increases the space complexity of the data structure.
However, in reality it can be more complicated depending on the data structure you choose to place under the hood of the FIFO:
Array list: cache friendly, random access... But can require reallocation as you insert/delete elements. This invalidates references -> store indices instead of actual references.
Linked list: not cache friendly but insertion and deletion are guaranteed to be O(1).
Is there any data structure in which locating a data is independent of its volume ?
"locating a data is independent of volume of data in it" - I assume this means O(1) for get operations. That would be a hash map.
This presumes that you fetch the object based on the hash.
If you have to check each element to see if an attribute matches a particular value, like your rson or ern or any other parts of it, then you have to make that value the key up front.
If you have several values that you need to search on - all of the must be unique and immutable - you can create several maps, one for each value. That lets you search on more than one. But they have to all be unique, immutable, and known up front.
If you don't establish the key up front it's O(N), which means you have to check every element in turn until you find what you want. On average, this time will increase as the size of the collection grows. That's what O(N) means.
What is the best way to remove an entry from a hashtable that uses linear probing? One way to do this would be to use a flag to indicate deleted elements? Are there any ways better than this?
An easy technique is to:
Find and remove the desired element
Go to the next bucket
If the bucket is empty, quit
If the bucket is full, delete the element in that bucket and re-add it to the hash table using the normal means. The item must be removed before re-adding, because it is likely that the item could be added back into its original spot.
Repeat step 2.
This technique keeps your table tidy at the expense of slightly slower deletions.
It depends on how you handle overflow and whether (1) the item being removed is in an overflow slot or not, and (2) if there are overflow items beyond the item being removed, whether they have the hash key of the item being removed or possibly some other hash key. [Overlooking that double condition is a common source of bugs in deletion implementations.]
If collisions overflow into a linked list, it is pretty easy. You're either popping up the list (which may have gone empty) or deleting a member from the middle or end of the linked list. Those are fun and not particularly difficult. There can be other optimizations to avoid excessive memory allocations and freeings to make this even more efficient.
For linear probing, Knuth suggests that a simple approach is to have a way to mark a slot as empty, deleted, or occupied. Mark a removed occupant slot as deleted so that overflow by linear probing will skip past it, but if an insertion is needed, you can fill the first deleted slot that you passed over [The Art of Computer Programming, vol.3: Sorting and Searching, section 6.4 Hashing, p. 533 (ed.2)]. This assumes that deletions are rather rare.
Knuth gives a nice refinment as Algorithm R6.4 [pp. 533-534] that instead marks the cell as empty rather than deleted, and then finds ways to move table entries back closer to their initial-probe location by moving the hole that was just made until it ends up next to another hole.
Knuth cautions that this will move existing still-occupied slot entries and is not a good idea if pointers to the slots are being held onto outside of the hash table. [If you have garbage-collected- or other managed-references in the slots, it is all right to move the slot, since it is the reference that is being used outside of the table and it doesn't matter where the slot that references the same object is in the table.]
The Python hash table implementation (arguable very fast) uses dummy elements to mark deletions. As you grow or shrink or table (assuming you're not doing a fixed-size table), you can drop the dummies at the same time.
If you have access to a copy, have a look at the article in Beautiful Code about the implementation.
The best general solutions I can think of include:
If you're can use a non-const iterator (ala C++ STL or Java), you should be able to remove them as you encounter them. Presumably, though, you wouldn't be asking this question unless you're using a const iterator or an enumerator which would be invalidated if the underlying collection is modified.
As you said, you could mark a deleted flag within the contained object. This doesn't release any memory or reduce collisions on the key, though, so it's not the best solution. Also requires the addition of a property on the class that probably doesn't really belong there. If this bothers you as much as it would me, or if you simply can't add a flag to the stored object (perhaps you don't control the class), you could store these flags in a separate hash table. This requires the most long-term memory use.
Push the keys of the to-be-removed items into a vector or array list while traversing the hash table. After releasing the enumerator, loop through this secondary list and remove the keys from the hash table. If you have a lot of items to remove and/or the keys are large (which they shouldn't be), this may not be the best solution.
If you're going to end up removing more items from the hash table than you're leaving in there, it may be better to create a new hash table, and as you traverse your original one, add to the new hash table only the items you're going to keep. Then replace your reference(s) to the old hash table with the new one. This saves a secondary list iteration, but it's probably only efficient if the new hash table will have significantly fewer items than the original one, and it definitely only works if you can change all the references to the original hash table, of course.
If your hash table gives you access to its collection of keys, you may be able to iterate through those and remove items from the hash table in one pass.
If your hash table or some helper in your library provides you with predicate-based collection modifiers, you may have a Remove() function to which you can pass a lambda expression or function pointer to identify the items to remove.
A common technique when time is a factor is to have a second table of deleted items, and clean up the main table when you have time. Commonly used in search engines.
How about enhancing the hash table to contain pointers like a linked list?
When you insert, if the bucket is full, create a pointer from this bucket to the bucket where the new field in stored.
While deleting something from the hashtable, the solution will be equivalent to how you write a function to delete a node from linkedlist.