How to implement linked list with 1 million nodes? - algorithm

I recently attended Microsoft Interview.
I was asked to implement linked list with 1 million nodes? How will you access 999999th node?
What is the optimal design strategy and implementation for such a question?

A linked list has fairly few variations, because much variation means it would be something other than a linked list.
You can vary it by having single or double linking. Single linking is where you have a pointer to the head (the first node, A say) which points to B which points to C, etc. To turn that into a double linked list you would also add a link from C to B and B to A.
If you have a double linked list then it's meaningful to retain a pointer to the list tail (the last node) as well as the head, which means accessing the last element is cheap, and elements near the end are cheaper, because you can work backwards or forwards... BUT... you would need to know what you want is at the end of the list... AND at the end of the day a linked list is still just that, and if it is going to get very large and that is a problem because of the nature of its use case, then a storage structure other than a linked list should probably be chosen.
You could hybridise your linked list of course, so you could index it or something for example, and there's nothing wrong with that in theory, but if you index ALL the nodes then the linked list nature is no longer of much value, and if you index only some, then the nodes in between indexed nodes have to be sorted or something so you can find a close node and work towards a target node... probably this would never be optimal and a better data structure should be chosen.
Really a linked list should be used when you don't want to do things like get a specific node, but want to iterate nodes regardless.

I have no idea about what I'm going to say, but, here goes:
You could conceptually split the list in sqrt(1000000) blocks, in such a way that you would have "reference pointers" every 1000 elements.
Think of it as having 1000 linked lists each with 1000 elements representing your list with 1000000 elements.
This is what comes to mind!

As Michael said you should present first the two classic variations of linked list. The next thing you should do is ask about insertion, search and deletion patterns.
These patterns will guide you towards a better fit data structure, because nobody wants a simple or double linked list with a million nodes.

A doubly circular Linked List with a static counter to point the index could be quite helpful in this case.
What I am suggesting is creating a Circular doubly Linked List having a counter variable which keeps track of the index of each node and a static variable which will hold overall number of nodes in the list.
Now when you have a search item for the index which is greater than 50% of the total nodes count i.e. searching elements at lower half you can start traversing the list from reverse direction.
Let say you have 10 nodes in your circular linked list and you want to search 8th node so you can quickly start traversing the list in opposite direction 2 times.
This approach reduces the iterations to search list item indexed at extremes but still in worst case you have to traverse half way through list for items in middle.
The only downfall in this approach is memory constraints which I am assuming is not an design concern here.

Related

Is it theoretically beneficial to implement a linked list which remembers the node previously accessed?

I intended to implement a linked list that remembers the node previously accessed but not sure whether it is theoretically more beneficial to do so.
For example, if the user first calls an api which retrieves the 300th element and then makes another call which retrieves the 310th element. A normal linked list would do two potentially expensive linear lookup starts from the head or from the tail of the list. But if an implementation somehow remembers the 300th node during the first call and begins the second lookup from it, the second call would be much cheaper.
I failed to find related information about this topic, could anyone advise my thought is correct or wrong?
I don't see the point in using a linked list for lookups, as there are better alternatives: priority queue, binary search tree, hash table, ...
But to answer the question:
If you keep a reference to the last accessed node in the list, and first start searching from that node, there is a possibility that you will not find the target node in that first scan, and still have to do a second scan, starting this time from the head of the list (making sure not to go beyond the node from where you started the first scan).
If the node being searched for has an equal probability to be at any index in the list, then there is no gain.
To see this clearly, imagine for a moment that the list is circular, where the tail is linked to the head. This actually represents what I described above: that a failing first scan must continue from the head. As assumed, all nodes in this circular list have an equal probability to be the node being searched for, and so there really is no preferential node: all nodes have an equal role in the "circle". It can easily be seen that any node would serve just as well as starting point for the search as any other.
It is only when you have more information, and there is a higher probability that the node being searched for is closer to the end of the list than the beginning, that you may find benefit in starting the search from any node you may still have access to.

Follow up on detecting loop in linked list

So there are several questions on how to detect a loop in a linked list. Here is one example. My question is, why do all these algorithms use two pointers? Couldn't you just loop through with one pointer and mark the nodes as visited, and when you come to a node you've already visited or reach the end of the linked list (next = null), then you know there is no loop?
It's because to
mark the nodes as visited
you need either extra space in the nodes themselves to do it, or an auxiliary data structure whose size will increase with that of the list, whereas the two-pointer solutions require only enough extra space for one more pointer.
[EDITED to add:] ... And, perhaps, also because the two-pointer solutions are clever and people like clever solutions to things.

When is doubly linked list more efficient than singly linked list?

In an interview today I got asked the question.
Apart from answering reversing the list and both forward and backward traversal there was something "fundamental" in it that the interviewer kept stressing. I gave up and of course after interview did a bit of research. It seems that insertion and deletion are more efficient in doubly linked list than singly linked list. I am not quite sure how it can be more efficient for a doubly linked list since it is obvious that more references are required to change.
Can anybody explain the secret behind? I honestly did a quite a bit of research and failed to understand with my main trouble being the fact that a O(n) searching is still needed for the double linked list.
Insertion is clearly less work in a singly-linked list, as long as you are content to always insert at the head or after some known element. (That is, you cannot insert before a known element, but see below.)
Deletion, on the other hand, is trickier because you need to know the element before the element to be deleted.
One way of doing this is to make the delete API work with the predecessor of the element to be deleted. This mirrors the insert API, which takes the element which will be the predecessor of the new element, but it's not very convenient and it's hard to document. It's usually possible, though. Generally speaking, you arrive at an element in a list by traversing the list.
Of course, you could just search the list from the beginning to find the element to be deleted, so that you know what its predecessor was. That assumes that the delete API includes the head of the list, which is also inconvenient. Also, the search is stupidly slow.
The way that hardly anyone uses, but which is actually pretty effective, is to define a singly-linked list iterator to be the pointer to the element preceding the current target of the iterator. This is simple, only one indirection slower than using a pointer directly to the element, and makes both insertion and deletion fast. The downside is that deleting an element may invalidate other iterators to list elements, which is annoying. (It doesn't invalidate the iterator to the element being deleted, which is nice for traversals which delete some elements, but that's not much compensation.)
If deletion is not important, perhaps because the datastructures are immutable, singly-linked lists offer another really useful property: they allow structure-sharing. A singly-linked list can happily be the tail of multiple heads, something which is impossible for a doubly-linked list. For this reason, singly-linked lists have traditionally been the simple datastructure of choice for functional languages.
Here is some code that made it clearer to me... Having:
class Node{
Node next;
Node prev;
}
DELETE a node in a SINGLE LINKED LIST -O(n)-
You don't know which is the preceeding node so you have to traverse the list until you find it:
deleteNode(Node node){
prevNode = tmpNode;
tmpNode = prevNode.next;
while (tmpNode != null) {
if (tmpNode == node) {
prevNode.next = tmpNode.next;
}
prevNode = tmpNode;
tmpNode = prevNode.next;
}
}
DELETE a node in a DOUBLE LINKED LIST -O(1)-
You can simply update the links like this:
deleteNode(Node node){
node.prev.next = node.next;
node.next.prev = node.prev;
}
Here are my thoughts on Doubly-Linked List:
You have ready access\insert on both ends.
it can work as a Queue and a Stack at the same time.
Node deletion requires no additional pointers.
You can apply Hill-Climb traversal since you already have access on both ends.
If you are storing Numerical values, and your list is sorted, you can keep a pointer/variable for median, then Search operation can be highly optimal using Statistical approach.
If you are going to delete an element in a linked list, you will need to link the previous element to the next element. With a doubly linked list you have ready access to both elements because you have links to both of them.
This assumes that you already have a pointer to the element you need to delete and there is no searching involved.
'Apart from answering reversing the list and both forward and backward traversal there was something "fundamental"'.
Nobody seem to have mentioned: in a doubly linked list it is possible to reinsert a deleted element just by having a pointer to the deleted element. See Knuth's Dancing Links paper. I think that's pretty fundamental.
Because doubly linked lists have immediate access to both the front and end
of the list, they can insert data on either side at O(1) as well as delete data on either side at O(1). Because doubly linked lists can insert data at the end in O(1) time and delete data from the front in O(1) time, they make the perfect underlying data structure for a queue. Queeus are lists of items
in which data can only be inserted at the end and removed from the beginning.
queues are an example of an abstract data type, and
that we are able to use an array to implement them under the hood.
Now, since queues insert at the end and delete from the beginning, arrays
are only so good as the underlying data structure. While arrays are O(1) for
insertions at the end, they’re O(N) for deleting from the beginning.
A doubly linked list, on the other hand, is O(1) for both inserting at the end
and for deleting from the beginning. That’s what makes it a perfect fit for
serving as the queue’s underlying data structure.
The doubly linked list is used in LRU cache design since we need to remove the least recently items frequently. The deletion operation is faster. To delete the least recently used item, we just delete if from end, to a new item to add cache, we just append a new node to the beginning of the list
Doubly Linked List is used in navigation systems where front and back navigation is required. It is also used by the browser to implement backward and forward navigation of visited web pages that is a back and forward button.
Singly Linked List vs Doubly Linked List vs Dynamic Arrays:
When comparing the three main data structures, Doubly Linked Lists are most efficient in all major tasks and operations when looking at time complexity. For Doubly Linked Lists, it operates at constant time for all operations except only access by index, where it operated at linear time (n) as it needs to iterate through each node to get to the required index. When it comes to Insert, Remove, First, Last, Concatenation and Count, Doubly Linked list operates at constant time where Dynamic Arrays operate at linear time (n).
In terms of space complexity, Dynamic Arrays stores only elements therefore constant time complexity, singly linked lists stores the successor of each element therefore linear space complexity (n), and worst of all doubly linked list stores the predecessor and successor of each element and therefore also linear space complexity but (2*n).
Unless you have extremely limited resources / space then perhaps either Dynamic arrays or Singly linked lists are better, however, nowadays, space and resources are more and more abundant and so doubly linked lists are far better with the cost of more space.
Doubly Linked list is more effective than the Singly linked list when the location of the element to be deleted is given. Because it is required to operate on "4" pointers only & "2" when the element to be deleted is at the first node or at the last node.
struct Node {
int Value;
struct Node *Fwd;
struct Node *Bwd;
);
Only the below line of code will be enough to delete the element, if the element to be deleted is not in the first or last node.
X->Bwd->Fwd = X->Fwd; X->Fwd->Bwd = X->Bwd;

BTree- predetermined size?

I read this on wikipedia:
In B-trees, internal (non-leaf) nodes can have a variable number of
child nodes within some pre-defined range. When data is inserted or
removed from a node, its number of child nodes changes. In order to
maintain the pre-defined range, internal nodes may be joined or split.
Because a range of child nodes is permitted, B-trees do not need
re-balancing as frequently as other self-balancing search trees, but
may waste some space, since nodes are not entirely full.
We have to specify this range for B trees. Even when I looked up CLRS (Intro to Algorithms), it seemed to make to use of arrays for keys and children. My question is- is there any way to reduce this wastage in space by defining the keys and children as lists instead of predetermined arrays? Is this too much of a hassle?
Also, for the life of me I'm not able to get a decent psedocode on btreeDeleteNode. Any help here is appreciated too.
When you say "lists", do you mean linked lists?
An array of some kind of element takes up one element's worth of memory per slot, whether that slot is filled or not. A linked list only takes up memory for elements it actually contains, but for each one, it takes up one element's worth of memory, plus the size of one pointer (two if it's a doubly-linked list, unless you can use the xor trick to overlap them).
If you are storing pointers, and using a singly-linked list, then each list link is twice the size of each array slot. That means that unless the list is less than half full, a linked list will use more memory, not less.
If you're using a language whose runtime has per-object overhead (like Java, and like C unless you are handling memory allocation yourself), then you will also have to pay for that overhead on each list link, but only once on an array, and the ratio is even worse.
I would suggest that your balancing algorithm should keep tree nodes at least half full. If you split a node when it is full, you will create two half-full nodes. You then need to merge adjacent nodes when they are less than half full. You can then use an array, safe in the knowledge that it is more efficient than a linked list.
No idea about the details of deletion, sorry!
B-Tree node has an important characteristic, all keys in the node is sorted. When finding a specific key, binary search is used to find the right position. Using binary search keeps the complexity of search algorithm in B-Tree O(logn).
If you replace the preallocated array with some kind of linked list, you lost the ordering. Unless you use some complex data structures, like skip list, to keep the search algorithm with O(logn). But it's totally unnecessary, skip list itself is better.

Plain, linked and double linked lists: When and Why?

In what situations should I use each kind of list? What are the advantages of each one?
Plain list:
Stores each item sequentially, so random lookup is extremely fast (i.e. I can instantly say "I want the 657415671567th element, and go straight to it, because we know its memory address will be exactly 657415671567 bigger than the first item). This has little or no memory overhead in storage. However, it has no way of automatically resizing - you have to create a new array, copy across all the values, and then delete the old one. Plain lists are useful when you need to lookup data from anywhere in the list, and you know that your list will not be longer than a certain size.
Linked List:
Each item has a reference to the next item. This means that there is some overhead (to store the reference to the next item). Also, because they're not stored sequentially, you can't immediately go to the 657415671567th element - you have to start at the head (1st element), and then get its reference to go to the 2nd, and then get its reference, to get to the third, ... and then get its reference to get to the 657415671566th, and then get its reference to get to the 657415671567th. In this way, it is very inefficient for random lookup. However, it allows you to modify the length of the list. If your task is to go through each item sequentially, then it's about the same value as a plain list. If you need to change the length of the list, it could be better than a plain list. If you know the 566th element, and you're looking for the 567th, then all you need to do is follow the reference to the next one. However, if you know the 567th and you're looking for the 566th, the only way to find it is to start searching from the 1st element again. This is where Double Linked Lists come in handy...
Double Linked List:
Double linked lists store a reference to the previous element. This means you can traverse the list backwards as well as forwards. This could be very useful in some situations (such as the example given in the Linked List section). Other than that, they have most of the same advantages and disadvantages as a Linked List.
Answer from comments section:
For use as a queue:
You'd have to take all of those advantages and disadvantages into account: Can you say with confidence that your queue will have a maximum size? If your queue could be anywhere from 1 to 10000000000 elements long, then a plain list will just waste memory (and then may not even be big enough). In that case, I'd go with a Linked List. However, rather than storing the index of the front and rear, you should actually store the node.
Recap: A linked list is made up of "nodes", and each node stores the item as well as the reference to the next node
So you should store a reference to the first node, and the last node. Thus, when you enqueue, you stick a new node onto the rear (by linking the old rear one to the new rear one), and remember this new rear node. And, when you dequeue, you remove the front node, and remember the second one as the new "front node". That way, you don't have to worry about any of the middle elements. You can thus ignore the length of the queue (although you can store that too if you really want)
Nobody mentioned my favorite linked list: circularly linked list with a pointer to the last element. You get constant-time insertion and deletion at either end, plus constant-time destructive append. The only cost is that empty lists are a bit tricky. It's a sweet data structure: list, queue, and stack all in one.
One advantage of a doubly-linked list is that removal of a node whose pointer is specified is O(1).
With singly linked lists you can only traverse forwards. With doubly linked lists you can traverse backwards as well as forwards through the list. In general if you are going to use a linked list, there is really no good reason not to use a doubly linked list. I have only used single linked in school.
Doubly-linked list provides several advantages over a singly linked list:
Easier traversal: With a doubly linked list, each node has a pointer to both the previous and next node, allowing for easy traversal in both directions. This is useful for certain types of algorithms that need to move both forwards and backwards through the list.
Faster deletion: In a singly linked list, when you want to delete a node, you need to traverse the list to find the node before it, so that you can update the next pointer. In a doubly linked list, the node you want to delete already has a pointer to the previous node, so you can update the previous node's next pointer directly, making deletion faster.
Easier insertion: Similar to deletion, in a singly linked list, you need to traverse the list to find the node before the one you want to insert. With a doubly linked list, you can insert a new node directly before or after a given node, without the need to traverse the list.
Easier to implement in-place modification: With a doubly linked list, it is easy to move elements around within the list without creating new list elements or destroying old ones.
Easier to implement Queue and Stack : A doubly linked list makes it easy to implement queue and stack data structures.

Resources