Hashtable with doubly linked lists? - algorithm

Introduction to Algorithms (CLRS) states that a hash table using doubly linked lists is able to delete items more quickly than one with singly linked lists. Can anybody tell me what is the advantage of using doubly linked lists instead of single linked list for deletion in Hashtable implementation?

The confusion here is due to the notation in CLRS. To be consistent with the true question, I use the CLRS notation in this answer.
We use the hash table to store key-value pairs. The value portion is not mentioned in the CLRS pseudocode, while the key portion is defined as k.
In my copy of CLR (I am working off of the first edition here), the routines listed for hashes with chaining are insert, search, and delete (with more verbose names in the book). The insert and delete routines take argument x, which is the linked list element associated with key key[x]. The search routine takes argument k, which is the key portion of a key-value pair. I believe the confusion is that you have interpreted the delete routine as taking a key, rather than a linked list element.
Since x is a linked list element, having it alone is sufficient to do an O(1) deletion from the linked list in the h(key[x]) slot of the hash table, if it is a doubly-linked list. If, however, it is a singly-linked list, having x is not sufficient. In that case, you need to start at the head of the linked list in slot h(key[x]) of the table and traverse the list until you finally hit x to get its predecessor. Only when you have the predecessor of x can the deletion be done, which is why the book states the singly-linked case leads to the same running times for search and delete.
Additional Discussion
Although CLRS says that you can do the deletion in O(1) time, assuming a doubly-linked list, it also requires you have x when calling delete. The point is this: they defined the search routine to return an element x. That search is not constant time for an arbitrary key k. Once you get x from the search routine, you avoid incurring the cost of another search in the call to delete when using doubly-linked lists.
The pseudocode routines are lower level than you would use if presenting a hash table interface to a user. For instance, a delete routine that takes a key k as an argument is missing. If that delete is exposed to the user, you would probably just stick to singly-linked lists and have a special version of search to find the x associated with k and its predecessor element all at once.

Unfortunately my copy of CLRS is in another country right now, so I can't use it as a reference. However, here's what I think it is saying:
Basically, a doubly linked list supports O(1) deletions because if you know the address of the item, you can just do something like:
x.left.right = x.right;
x.right.left = x.left;
to delete the object from the linked list, while as in a linked list, even if you have the address, you need to search through the linked list to find its predecessor to do:
pred.next = x.next
So, when you delete an item from the hash table, you look it up, which is O(1) due to the properties of hash tables, then delete it in O(1), since you now have the address.
If this was a singly linked list, you would need to find the predecessor of the object you wish to delete, which would take O(n).
However:
I am also slightly confused about this assertion in the case of chained hash tables, because of how lookup works. In a chained hash table, if there is a collision, you already need to walk through the linked list of values in order to find the item you want, and thus would need to also find its predecessor.
But, the way the statement is phrased gives clarification: "If the hash table supports deletion, then its linked lists should be doubly linked so that we can delete an item quickly. If the lists were only singly linked, then to delete element x, we would first have to find x in the list T[h(x.key)] so that we could update the next attribute of x’s predecessor."
This is saying that you already have element x, which means you can delete it in the above manner. If you were using a singly linked list, even if you had element x already, you would still have to find its predecessor in order to delete it.

I can think of one reason, but this isn't a very good one. Suppose we have a hash table of size 100. Now suppose values A and G are each added to the table. Maybe A hashes to slot 75. Now suppose G also hashes to 75, and our collision resolution policy is to jump forward by a constant step size of 80. So we try to jump to (75 + 80) % 100 = 55. Now, instead of starting at the front of the list and traversing forward 85, we could start at the current node and traverse backwards 20, which is faster. When we get to the node that G is at, we can mark it as a tombstone to delete it.
Still, I recommend using arrays when implementing hash tables.

Hashtable is often implemented as a vector of lists. Where index in vector is the key (hash).
If you don't have more than one value per key and you are not interested in any logic regarding those values a single linked list is enough. A more complex/specific design in selecting one of the values may require a double linked list.

Let's design the data structures for a caching proxy. We need a map from URLs to content; let's use a hash table. We also need a way to find pages to evict; let's use a FIFO queue to track the order in which URLs were last accessed, so that we can implement LRU eviction. In C, the data structure could look something like
struct node {
struct node *queueprev, *queuenext;
struct node **hashbucketprev, *hashbucketnext;
const char *url;
const void *content;
size_t contentlength;
};
struct node *queuehead; /* circular doubly-linked list */
struct node **hashbucket;
One subtlety: to avoid a special case and wasting space in the hash buckets, x->hashbucketprev points to the pointer that points to x. If x is first in the bucket, it points into hashbucket; otherwise, it points into another node. We can remove x from its bucket with
x->hashbucketnext->hashbucketprev = x->hashbucketprev;
*(x->hashbucketprev) = x->hashbucketnext;
When evicting, we iterate over the least recently accessed nodes via the queuehead pointer. Without hashbucketprev, we would need to hash each node and find its predecessor with a linear search, since we did not reach it via hashbucketnext. (Whether that's really bad is debatable, given that the hash should be cheap and the chain should be short. I suspect that the comment you're asking about was basically a throwaway.)

If the items in your hashtable are stored in "intrusive" lists, they can be aware of the linked list they are a member of. Thus, if the intrusive list is also doubly-linked, items can be quickly removed from the table.
(Note, though, that the "intrusiveness" can be seen as a violation of abstraction principles...)
An example: in an object-oriented context, an intrusive list might require all items to be derived from a base class.
class BaseListItem {
BaseListItem *prev, *next;
...
public: // list operations
insertAfter(BaseListItem*);
insertBefore(BaseListItem*);
removeFromList();
};
The performance advantage is that any item can be quickly removed from its doubly-linked list without locating or traversing the rest of the list.

Related

What is the need for asymmetric linked list

I am studying data structures. I have come across Asymmetric linked list which states that it is a special type of double linked list in which
1. next link points to next node address
2. prev link points to current node address itself
But I wonder,
1. what are the advantages we get by designing such linked list?
2. what kind of applications this would be suitable for?
Could anyone kindly explain more on Asymmetric linked list. I googled but I could not find relevent answers. Thank you.
Source :http://en.wikipedia.org/wiki/Doubly_linked_list#Asymmetric_doubly-linked_list
I agree the Wiki page is misleading. Here is the difference between LL and ALL:
Open Linked List:
node.next = nextNode
node.prev = prevNode
Asymmetric Linked List:
node.next = nextNode
node.prev = prevNode.next
Note the difference prevNode vs prevNode.next.
While pointing to a pointer within node still preserves the ability to traverse list backwards (you can get prevNode address by subtracting from prevNode.next) it may simplify insertion and deletion operations on the list, especially on the start element.
Given a node pointer from a double linked list, we can traverse all the nodes by the 'prev' and 'next', while a single linked list cannot do that if the pointer provided didn;t point to the first node.
E.g, delete a node from linked list. With single linked list, you have to traverse the list from head to find the specific node, and also need to record the prev node against the specific node, which causes the time complexity O(n). However, with double linked list, you can perform the delete with the specific node with the constant time.
In short, given a specific node, for single linked list, if we need to use its prev node information, the traverse wiht O(n) from the head is inevitable, while double lined list doesn't.
By the way, list in STL and LinkedList in Java are implemented with double linked list.
Because a picture worth thousands words :
As you can see, "previous" field is referencing "next", rather than previous element itself. This make little difference between nodes, except for first element : the previous field can point to the head rather than pointing to the last element (circular list) or be null.
The main advantage is for insertion and deletion : you don't need to take care of head and check if element is first one. Just having a single pointer to an element is enough to perform an insert or a delete to the list.
One disadvantage vs circular list : the only way to get last element (eg: to implement some "add last" operation) is to loop through the whole list.
You also lose the ability to loop through the list in reverse way (because no previous pointer), except if all elements have same size and you are allowed to do pointer arithmetic (as it is in C/C++).

LinkedList does not provide index based access, so why does it have get(index) method?

I understand that ArrayList is index based datastructure, that allows you to access its element using the index but LinkedList is not supposed index based so why does it have get(index) method that allows direct access to the element?
It may not be efficient to retrieve items from a linked list by index, but linked lists do have indices, and sometimes you just need to retrieve an item at a certain index. When that happens, it's much better to have a get method than to force users to grab an iterator and iterate to the desired position. As long as you don't call it too much or the list is small, it's fine.
This is really just an implementation decision. While an array would probably be a fairly useless data structure if you can't look up elements by index, adding a by-index lookup to a linked-list implementation doesn't do any harm (well, unless users assume it's fast - see below), and it does come in handy sometimes.
One can assign every element a number as follows:
0 1 2 3 4
Head (Element0) -> Element1 -> Element2 -> Element3 -> Element4 -> NULL
From here, it's trivial to write a function to return the element at some given index.
Note that a by-index lookup on a linked-list will be slow - if you're looking for let's say the element in the middle, you'll need to work through half the list to get there.
The previous answers imply that LinkedLists have indices.
However, a fixed index for every element in the data structure would defeat the purpose of the LinkedList and e.g. make some remove/add operations slower because the structure would need to be reindexed every time. This would take linear time, even for elements at the beginning and at the end of the list, that are crucial for Java's LinkedList's efficiency.
From Java's LinkedList implementation you can see that there is no constant time index access to the element, but rather a linear traversal where the exact element is figured out on the go.

Data structure supporting O(1) remove/insert/findOldest?

This question was asked in the interview:
Propose and implement a data structure that works with integer data from final and continuous ranges of integers. The data structure should support O(1) insert and remove operations as well findOldest (the oldest value inserted to the data structure).
No duplication is allowed (i.e. if some value already inside - it should not be added once more)
Also, if needed, the some init might be used for initialization.
I proposed a solution to use an array (size as range size) of 1/0 indicating the value is inside. It solves insert/remove and requires O(range size) initialization.
But I have no idea how to implement findOldest with the given constraints.
Any ideas?
P.S. No dynamic allocation is allowed.
I apologize if I've misinterpreted your question, but the sense I get is that
You have a fixed range of values you're considering (say, [0, N))
You need to support insertions and deletions without duplicates.
You need to support findOldest.
One option would be to build an array of length N, where each entry stores a boolean "is active" flag as well as a pointer. Additionally, each entry has a doubly-linked list cell in it. Intuitively, you're building a bitvector with a linked list threaded through it storing the insertion order.
Initially, all bits are set to false and the pointers are all NULL. When you do an insertion, set the bit on the appropriate cell to true (returning immediately if it's already set), then update the doubly-linked list of elements by appending this new cell to it. This takes time O(1). To do a findOldest step, just query the pointer to the oldest element. Finally, to do a removal step, clear the bit on the element in question and remove it from the doubly-linked list, updating the head and tail pointer if necessary.
All in all, all operations take time O(1) and no dynamic allocations are performed because the linked list cells are preallocated as part of the array.
Hope this helps!

Programming : find the first unique string in a file in just 1 pass

Given a very long list of Product Names, find the first product name which is unique (occurred exactly once ). You can only iterate once in the file.
I am thinking of taking a hashmap and storing the (keys,count) in a doubly linked list.
basically a linked hashmap
can anyone optimize this or give a better approach
Since you can only iterate the list once, you have to store
each string that occurs exactly once, because it could be the output
their relative position within the list
each string that occurs more than once (or their hash, if you're not afraid)
Notably, you don't have to store the relative positions of strings that occur more than once.
You need
efficient storage of the set of strings. A hash set is a good candidate, but a trie could offer better compression depending on the set of strings.
efficient lookup by value. This rules out a bare list. A hash-set is the clear winner, but a trie also performs well. You can store the leaves of the trie in a hash set.
efficient lookup of the minimum. This asks for a linked list.
Conclusion:
Use a linked hash-set for the set of strings, and a flag indicating if they're unique. If you're fighting for memory, use a linked trie. If a linked trie is too slow, store the trie leaves in a hash map for look-up. Include only the unique strings in the linked list.
In total, your nodes could look like: Node:{Node[] trieEdges, Node trieParent, String inEdge, Node nextUnique, Node prevUnique}; Node firstUnique, Node[] hashMap
If you strive for ease of implementation, you can have two hash-sets instead (one linked).
The following algorithm solves it in O(N+M) time.
where
N=number of strings
M=total number of characters put together in all strings.
The steps are as follows:
`1. Create a hash value for each string`
`2. Xor it and find the one which didn't have a pair`
Xor has this useful property that if you do a xor a=0 and b xor 0=b.
Tips to generate the hash value for a string:
Use a 27 base number system, and give a a value of 1, b a value of 2 and so on till z which gets 26, and so if string is "abc" , we compute hash value as:
H=3*(27 power 0)+2*(27 power 1)+ 1(27 power 2)
=786
You could use modulus operator to make hash values small enough to fit in 32-bit integers.If you do that keep an eye out for collisions, which are basically two strings which are different but get the same hash value due to the modulus operation.
Mostly I guess you won't be needing it.
So compute the hash for each string, and then start from the first hash and keep xor-ing, the result will hold the hash value of the string which din't have a pair.
Caution:This is useful only when strings occur in pairs.Still this is a good idea to start with, that's why I answered it.
Using a linked hashmap is obvious enough. Otherwise, you could use a TreeMap style data structure where the strings are ordered by count. So as soon as you are done reading the input, the root of your tree is unique if a unique string exists. Unlike a linked hash map, insertion takes at most O(log n) as opposed to O(n). You can read up on TreeMaps for insight on how to augment a basic TreeMap into what you need. Also in your linked hashmap you may have to travel O(n) to find your first unique key. With a TreeMap style data structure, your look up is O(1) -- the root. Even if more unique keys exist, the first one you encountered will be the root. The subsequent ones will be children of the root.

Immutablity of Node-based data structures

Is there any general approach if one wanted to provide an immutable version of e.g. LinkidList, implemented using as a linked sequence of nodes? I understand that in the case of ArrayList you would copy the underlying array, but in this case this is not that obvious to me...
Immutable lists are basically represented the same way as regular linked lists, except that all operations that would normally modify the list return a new one instead. This new list does not neccessarily need to contain a copy of the entire previous list but can reuse elements of it.
I recommend implementing the following operations in the following ways:
Popping the element at the front: simply return a pointer to the next node. Complexity: O(1).
Pushing an element to the front: Create a new node that point to the first node of the old list and return it. O(1).
Concatenating list a with list b: copy the entire list a and let the pointer in the final node point to the beginning of list b. Note that this is faster than the same operation on mutable lists. O(length(a)).
Inserting at position x: Copy everything up to x, add a node with the new element to the back of the copy, and let that node point to the old list at position x + 1. O(x).
Removing the element at position x: practically the same as inserting. O(x).
Sorting: you can just use plain quick- or mergesort. It's not much faster or slower than it would be on mutable lists. The only difference is that you can't sort in place but will have to sort to a copy. O(n*log n).

Resources