GRE CS: Which data structure would be most appropriate to implement a collection of values with the following three characteristics?
Items are retrieved and removed from the collection in FIFO
There is no a-priori limit on the number of items in the
collection.
The size of an item is large relative to the storage
required for a memory address.
This was a multiple-choice question with these answers:
(A) Singly-linked list, with head and tail pointers
(B) Doubly-linked list, with only a head pointer
(C) Array
(D) Binary tree
(E) Hash table
I think (C), (D) and (E) are wrong.
A does seem to be the correct answer. Because items are removed in FIFO you will only ever need to operate on the first and last element in the collection. A, C and E all allow this in constant time.
There is no limit on the number of items. This means that C and E are no longer as good as A because you will eventually need to re-size an array or hash table as it gets large or allocate far more than you need to start. With a linked list you can easily add as you go.
The size of an item is large. This only further goes to suggest that A is correct because the addition of the link addresses in the storage structure will be unimportant.
Related
What is the best data structure for this case? Given N resources with ID from 0 to N-1, you can get a resource or free a resource.
We also need to consider the time & space complexity for get and free operations.
interface ResourcePool {
int get(); // return an available ID
void free(int id); // mark ID as available
}
Follow up: what if N is a super large number, say 1 billion or 1 trillion.
Generally, you need 2 things:
A variable like int nextUnused that contains the smallest ID that's never been allocated
A list of free IDs less than nextUnused.
Allocating an ID will take it from the free list if it's non empty. Otherwise it will increment nextUnused.
Freeing an ID will just add it to the free list.
There are lots of different representations for the free list, but if you need to reserve memory for allocated resources, then it's common to reuse the memory of the free ones as linked list nodes in the free list, so the free list itself doesn't consume any space. This kind of data structure is called... a "free list": https://en.wikipedia.org/wiki/Free_list
Alternatively, you can store the free list separately. Since IDs can be freed in any order, and you need to remember which ones are free, there is no choice but to store the whole list somehow.
If your ID space is really big, it's conceivable that you could adopt strategies for keeping this representation as small as possible, but I've never seen much effort put into that in practice. The other possibility is to move parts of the free list into disk storage when it gets too big.
If N is very large, you can represent your resource pool using a balanced binary search tree. Each node in the tree is a range of free ids, represented by an upper and lower bound of ints. get() removes an arbitrary node from the tree, increments the lower bound, then re-inserts the node if the range it represents is still non-empty. free(i) inserts a new node (i,i), then coalesces that nodes with its two neighbors, if possible. For instance, if the tree contains (7,9) and (11,17), then free(10) results in a tree with fewer nodes - (7,9), (10, 10), and (11,17) are all removed, and (7,17) is there in their place. On the other hand, if the two neighbors of (10,10) are (7,9) and (12,17), then the result is (7,10) and (12,17), while if the two neighbors are (7,8) and (12,17), then no coalescing is possible and all three nodes, (7,8), (10,10), and (12,17), remain in the tree.
Both operations, get() and free(), take O(log P) time, where P is the size of the number of reserved elements at the moment the operation begins. This is slower than a free list, but the advantage of this over a plain free list is that the size of the structure will be no more than P, so as long as P is much smaller than N, the space usage is low.
Is there such data structure:
There is slow list data structure such linked list or data saved on disk.
There is relatively small array of pointers to some of the elements in the "slow list", hopefully evenly distributed.
Then when you do search, you first check the array and then perform the normal search (linked list search or binary search in case of disk data).
This looks very similar to jump search, sample search and to skip lists, but I think is different algorithm.
Please note I am giving example with link list or file on disk, because they are slow structures.
I don't know if there's a name for this algorithm (I don't think it deserves one, though if there isn't, it could bear mine:), but I did implement something like that 10 years ago for an interview.
You can have an array of pointers to the elements of a list. An array of fixed size, say, of 256 pointers. When you construct the list or traverse it for the first time, you store pointers to its elements in the array. So, for a list of 256 or fewer elements you'd have a pointer to each element.
As the list grows beyond 256 elements, you drop every odd-numbered pointer by moving the 128 even-numbered pointers to the beginning of the array. When the array of pointers fills up again, you repeat the procedure. At every such point you double the step between the list elements whose addresses end up in the array of pointers. Initially you'd place every element's address there, then every other's, then of one out of four and so on.
You end up with an array of pointers to the list elements spaced apart by the list length / 256.
If the list is singly-linked, locating i-th element from the beginning or the end of it is reduced to searching in 1/256th of the list.
If the list is sorted, you can perform binary search on the array to locate the bin (the 1/256th portion of the list) where to look further.
This question was asked in the interview:
Propose and implement a data structure that works with integer data from final and continuous ranges of integers. The data structure should support O(1) insert and remove operations as well findOldest (the oldest value inserted to the data structure).
No duplication is allowed (i.e. if some value already inside - it should not be added once more)
Also, if needed, the some init might be used for initialization.
I proposed a solution to use an array (size as range size) of 1/0 indicating the value is inside. It solves insert/remove and requires O(range size) initialization.
But I have no idea how to implement findOldest with the given constraints.
Any ideas?
P.S. No dynamic allocation is allowed.
I apologize if I've misinterpreted your question, but the sense I get is that
You have a fixed range of values you're considering (say, [0, N))
You need to support insertions and deletions without duplicates.
You need to support findOldest.
One option would be to build an array of length N, where each entry stores a boolean "is active" flag as well as a pointer. Additionally, each entry has a doubly-linked list cell in it. Intuitively, you're building a bitvector with a linked list threaded through it storing the insertion order.
Initially, all bits are set to false and the pointers are all NULL. When you do an insertion, set the bit on the appropriate cell to true (returning immediately if it's already set), then update the doubly-linked list of elements by appending this new cell to it. This takes time O(1). To do a findOldest step, just query the pointer to the oldest element. Finally, to do a removal step, clear the bit on the element in question and remove it from the doubly-linked list, updating the head and tail pointer if necessary.
All in all, all operations take time O(1) and no dynamic allocations are performed because the linked list cells are preallocated as part of the array.
Hope this helps!
Introduction to Algorithms (CLRS) states that a hash table using doubly linked lists is able to delete items more quickly than one with singly linked lists. Can anybody tell me what is the advantage of using doubly linked lists instead of single linked list for deletion in Hashtable implementation?
The confusion here is due to the notation in CLRS. To be consistent with the true question, I use the CLRS notation in this answer.
We use the hash table to store key-value pairs. The value portion is not mentioned in the CLRS pseudocode, while the key portion is defined as k.
In my copy of CLR (I am working off of the first edition here), the routines listed for hashes with chaining are insert, search, and delete (with more verbose names in the book). The insert and delete routines take argument x, which is the linked list element associated with key key[x]. The search routine takes argument k, which is the key portion of a key-value pair. I believe the confusion is that you have interpreted the delete routine as taking a key, rather than a linked list element.
Since x is a linked list element, having it alone is sufficient to do an O(1) deletion from the linked list in the h(key[x]) slot of the hash table, if it is a doubly-linked list. If, however, it is a singly-linked list, having x is not sufficient. In that case, you need to start at the head of the linked list in slot h(key[x]) of the table and traverse the list until you finally hit x to get its predecessor. Only when you have the predecessor of x can the deletion be done, which is why the book states the singly-linked case leads to the same running times for search and delete.
Additional Discussion
Although CLRS says that you can do the deletion in O(1) time, assuming a doubly-linked list, it also requires you have x when calling delete. The point is this: they defined the search routine to return an element x. That search is not constant time for an arbitrary key k. Once you get x from the search routine, you avoid incurring the cost of another search in the call to delete when using doubly-linked lists.
The pseudocode routines are lower level than you would use if presenting a hash table interface to a user. For instance, a delete routine that takes a key k as an argument is missing. If that delete is exposed to the user, you would probably just stick to singly-linked lists and have a special version of search to find the x associated with k and its predecessor element all at once.
Unfortunately my copy of CLRS is in another country right now, so I can't use it as a reference. However, here's what I think it is saying:
Basically, a doubly linked list supports O(1) deletions because if you know the address of the item, you can just do something like:
x.left.right = x.right;
x.right.left = x.left;
to delete the object from the linked list, while as in a linked list, even if you have the address, you need to search through the linked list to find its predecessor to do:
pred.next = x.next
So, when you delete an item from the hash table, you look it up, which is O(1) due to the properties of hash tables, then delete it in O(1), since you now have the address.
If this was a singly linked list, you would need to find the predecessor of the object you wish to delete, which would take O(n).
However:
I am also slightly confused about this assertion in the case of chained hash tables, because of how lookup works. In a chained hash table, if there is a collision, you already need to walk through the linked list of values in order to find the item you want, and thus would need to also find its predecessor.
But, the way the statement is phrased gives clarification: "If the hash table supports deletion, then its linked lists should be doubly linked so that we can delete an item quickly. If the lists were only singly linked, then to delete element x, we would first have to find x in the list T[h(x.key)] so that we could update the next attribute of x’s predecessor."
This is saying that you already have element x, which means you can delete it in the above manner. If you were using a singly linked list, even if you had element x already, you would still have to find its predecessor in order to delete it.
I can think of one reason, but this isn't a very good one. Suppose we have a hash table of size 100. Now suppose values A and G are each added to the table. Maybe A hashes to slot 75. Now suppose G also hashes to 75, and our collision resolution policy is to jump forward by a constant step size of 80. So we try to jump to (75 + 80) % 100 = 55. Now, instead of starting at the front of the list and traversing forward 85, we could start at the current node and traverse backwards 20, which is faster. When we get to the node that G is at, we can mark it as a tombstone to delete it.
Still, I recommend using arrays when implementing hash tables.
Hashtable is often implemented as a vector of lists. Where index in vector is the key (hash).
If you don't have more than one value per key and you are not interested in any logic regarding those values a single linked list is enough. A more complex/specific design in selecting one of the values may require a double linked list.
Let's design the data structures for a caching proxy. We need a map from URLs to content; let's use a hash table. We also need a way to find pages to evict; let's use a FIFO queue to track the order in which URLs were last accessed, so that we can implement LRU eviction. In C, the data structure could look something like
struct node {
struct node *queueprev, *queuenext;
struct node **hashbucketprev, *hashbucketnext;
const char *url;
const void *content;
size_t contentlength;
};
struct node *queuehead; /* circular doubly-linked list */
struct node **hashbucket;
One subtlety: to avoid a special case and wasting space in the hash buckets, x->hashbucketprev points to the pointer that points to x. If x is first in the bucket, it points into hashbucket; otherwise, it points into another node. We can remove x from its bucket with
x->hashbucketnext->hashbucketprev = x->hashbucketprev;
*(x->hashbucketprev) = x->hashbucketnext;
When evicting, we iterate over the least recently accessed nodes via the queuehead pointer. Without hashbucketprev, we would need to hash each node and find its predecessor with a linear search, since we did not reach it via hashbucketnext. (Whether that's really bad is debatable, given that the hash should be cheap and the chain should be short. I suspect that the comment you're asking about was basically a throwaway.)
If the items in your hashtable are stored in "intrusive" lists, they can be aware of the linked list they are a member of. Thus, if the intrusive list is also doubly-linked, items can be quickly removed from the table.
(Note, though, that the "intrusiveness" can be seen as a violation of abstraction principles...)
An example: in an object-oriented context, an intrusive list might require all items to be derived from a base class.
class BaseListItem {
BaseListItem *prev, *next;
...
public: // list operations
insertAfter(BaseListItem*);
insertBefore(BaseListItem*);
removeFromList();
};
The performance advantage is that any item can be quickly removed from its doubly-linked list without locating or traversing the rest of the list.
I have to implement a cache with normal cache operations along with the facility of fast retrieval of the maximum element from the cache.
Can you please suggest data structures to implement this?
I was thinking of using hash map along with a list to maintain the minimum element.
Suggest other approaches with better complexity.
heap is great for fast retrival of max element.
There is a type of structure that I call exponential lookaside lists that are frequently used by OS's for keeping track of free chunks of memory. You start with some base size N (somewhere between 8 bytes, and the page size of the OS) and then build an array (or stack) of lists:
[list N]
[list N*2]
[list N*4]
[list N*8]
...
And so on up to some maximum. To maintain them, you just take the size of a new entry (S) and then use LOG2(S/N) as your offset into the lists array to determine which list to add the new chunk to. When you need to release (or return) your largest chunk, your just scan from the highest sized list down until you find the first non-empty list, then scan for the largest chunk in that list.