So I need to find a data structure for this situation that I'll describe:
This is not my problem but explains the data structure aspect i need more succinctly:
I have an army made up of platoons. Every platoon has a number of men and a rank number(highest being better). If an enemy were to attack my army, they would kill some POWER of my army, starting from the weakest platoon and working up, where it takes (platoon rank) amount of power to kill every soldier from a platoon.
I could easily simulate enemies attacking me by peeking and popping elements from my priority queue of platoons, ordered by rank number, but that is not what I need to do. What i need is to be able to allow enemies to view all the soldiers they would kill if they attacked me, without actually attacking, so without actually deleting elements from my priorityqueue(if i implemented it as a pq).
Sidenote: Java's PriorityQueue.Iterator() prints elements in a random order, I know an iterator is all I need, just fyi.
The problem is, if I implemented this as a pq, I can only see the top element, so I would have to pop platoons off as if they were dying and then push them back on when the thought of the attack has been calculated. I could also implement this as a linked list or array, but insertion takes too long. Ultimately I would love to use a priority queue I just need the ability to view either the (pick an index)'th element from the pq, or to have every object in the pq have a pointer to the next object in the pq, like a linked list.
Is this thought about maintaining pointers with a pq like a linked list possible within java's PriorityQueue? Is it implemented for me somewhere in PriorityQueue that I dont know about? is the index thing implemented? is there another data structure I can use that can better serve my purpose? Is it realistic for me to find the source code from Java's PriorityQueue and rewrite it on my machine to maintain these pointers like a linked list?
Any ideas are very welcome, not really sure which path I want to take on this one.
One thing you could do is an augmented binary search tree. That would allow efficient access to the nth smallest element while still keeping the elements ordered. You could also use a threaded binary search tree. That would allow you to step from one element to the next larger one in constant time, which is faster than in a normal binary tree. Both of these data structures are slower than a heap, though.
Related
I recently attended Microsoft Interview.
I was asked to implement linked list with 1 million nodes? How will you access 999999th node?
What is the optimal design strategy and implementation for such a question?
A linked list has fairly few variations, because much variation means it would be something other than a linked list.
You can vary it by having single or double linking. Single linking is where you have a pointer to the head (the first node, A say) which points to B which points to C, etc. To turn that into a double linked list you would also add a link from C to B and B to A.
If you have a double linked list then it's meaningful to retain a pointer to the list tail (the last node) as well as the head, which means accessing the last element is cheap, and elements near the end are cheaper, because you can work backwards or forwards... BUT... you would need to know what you want is at the end of the list... AND at the end of the day a linked list is still just that, and if it is going to get very large and that is a problem because of the nature of its use case, then a storage structure other than a linked list should probably be chosen.
You could hybridise your linked list of course, so you could index it or something for example, and there's nothing wrong with that in theory, but if you index ALL the nodes then the linked list nature is no longer of much value, and if you index only some, then the nodes in between indexed nodes have to be sorted or something so you can find a close node and work towards a target node... probably this would never be optimal and a better data structure should be chosen.
Really a linked list should be used when you don't want to do things like get a specific node, but want to iterate nodes regardless.
I have no idea about what I'm going to say, but, here goes:
You could conceptually split the list in sqrt(1000000) blocks, in such a way that you would have "reference pointers" every 1000 elements.
Think of it as having 1000 linked lists each with 1000 elements representing your list with 1000000 elements.
This is what comes to mind!
As Michael said you should present first the two classic variations of linked list. The next thing you should do is ask about insertion, search and deletion patterns.
These patterns will guide you towards a better fit data structure, because nobody wants a simple or double linked list with a million nodes.
A doubly circular Linked List with a static counter to point the index could be quite helpful in this case.
What I am suggesting is creating a Circular doubly Linked List having a counter variable which keeps track of the index of each node and a static variable which will hold overall number of nodes in the list.
Now when you have a search item for the index which is greater than 50% of the total nodes count i.e. searching elements at lower half you can start traversing the list from reverse direction.
Let say you have 10 nodes in your circular linked list and you want to search 8th node so you can quickly start traversing the list in opposite direction 2 times.
This approach reduces the iterations to search list item indexed at extremes but still in worst case you have to traverse half way through list for items in middle.
The only downfall in this approach is memory constraints which I am assuming is not an design concern here.
Just like it was asked here,
I fail to understand how we can find the index of a relaxed vertex in the heap.
Programming style-wise, the heap is a black box that abstracts away the details of a priority queue. Now if we need to maintain a hash table that maps vertex keys to corresponding indices in the heap array, that would need to be done in heap implementation, right?
But most standard heaps don't provide a hash table that does such mapping.
Another way to deal with this whole problem is to add the relaxed vertices to the heap regardless of anything. When we extract the minimum we'll get the best one. To prevent the same vertex being extracted multiple times, we can mark it visited.
So my exact question is, what is the typical way (in the industry) of dealing with this problem?
What are the pros and cons compared what the methods I mentioned?
Typically, you'd need a specially-constructed priority queue that supports the decreaseKey operation in order to get this to work. I've seen this implemented by having the priority queue explicitly keep track of a hash table of the indices (if using a binary heap), or by having an intrusive priority queue where elements stored are nodes in the heap (if using a binomial heap or Fibonacci heap, for example). Sometimes, the priority queue's insertion operation will return a pointer to the node in the priority queue that holds the newly-added key. As an example, here is an implementation of a Fibonacci heap that supports decreaseKey. It works by having each insert operation return a pointer to the node in the Fibonacci heap, which makes it possible to look up the node in O(1), assuming you keep track of the returned pointers.
Hope this helps!
You are asking some very valid questions but unfortunately they are kind of vague so we won't be able to give you a 100% solid "industry standard" answer. However, I'll try to go over your points anyway:
Programming style-wise, the heap is a black box that abstracts away the details of a priority queue
Technically, a priority queue is the abstract interface (insert elements with a priority, extract the lowest priority element) and a heap is a concrete implementation (array-based heap, binomial heap, fibonacci heap, etc).
What I'm trying to say is that using an array is only one particular way to implement a priority queue.
Now if we need to maintain a hash table that maps vertex keys to corresponding indices in the heap array, that would need to be done in heap implementation, right?
Yes, because everytime you move an element inside the array you will need to update the index in the hash table.
But most standard heaps don't provide a hash table that does such mapping.
Yes. This can be very annoying.
Another way to deal with this whole problem is to add the relaxed vertices to the heap regardless of anything.
I guess that could work but I dont think I ever saw anyone do that. The whole point of using a heap here is to increase performance and by adding redundant elements to the heap you kind of go against that. Sure, you preserve the "black-boxness" of the priority queue but I don't know if that is worth it. Additionally, there could be a chance that the extra pop_heap operations could negatively affect your asymptoptic complexity but I'd have to do the math to check.
what is the typical way (in the industry) of dealing with this problem?
First of all, ask yourself if you can get away with using a dumb array instead of a priority queue.
Sure, finding the minimum element in now O(N) instead of O(log n) but the implementation is the simplest (an advantage on its own). Additionally, using an array will be just as efficient if your graph is dense and even if your graph is sparse it might be efficient enough depending on how big your graph is.
If you really need a priority queue, then you are going to have to find one that has a decreaseKey operation implemented. If you can't find one, I would say its not that bad to implement it yourself - it might be less trouble than trying to find an existing implementation and then trying to fit it in with the rest of your code.
Finally, I would not recommend using the really fancy heap data structures (such as fibonacci heaps). While these often show up in textbooks as a way to get optimal asymptotics, in practice they have terrible constant factors and these constant factors are significant when compared with something that is logarithmic.
Programming style-wise, the heap is a black box that abstracts away the details of a priority queue.
Not necessarily. Both C++ and Python have heap libraries that provide functions on arrays rather than black box objects. Go abstracts a bit, but requires the programmer to provide an array-like data structure for its heap operations to work on.
All this abstraction leaking in standardized, industry-strength libraries has a reason: some algorithms (Dijkstra) require a heap with additional operations, which would degrade the performance of other algorithms. Yet other algorithms (heapsort) need heap operations that work in-place on input arrays. If your library's heap gives you a black-box object, and it doesn't suffice for some algorithm, then it's time to re-implement the operations as function on arrays, or find a library that does have the operations you need.
This is a great question and one of those details that algorithms books like CLRS just glaze over without mention.
There are a few ways to do handle this, either:
Use a custom heap implementation that supports decreaseKey operations
Every time you "relax" a vertex, you just add it back into the heap with the new lower weight, then you write a custom way to ignore the old elements later. You can take advantage of the fact that you only ever add a node into the heap/priority-queue if the weight has decreased.
Option #1 is definitely used. For example, if you are familiar with OpenSourceRoutingMachine (OSRM) it searches over graphs with many millions of nodes to compute road routing directions. It uses a Boost implementation of a d-ary heap specifically because it has better decreaseKey operations, source. Often the Fibonacci_heap is also mentioned for this purpose because it supports O(1) decrease key operations, but likewise you'd probably have to roll your own.
In option #2 you end up doing more insertions and removeMin operations in total. If D is the total number of "relax" operations you must do, you end up doing a total of D additional heap operations. So while this has a theoretically worse runtime complexity, in practice there is research evidence that option #2 can be more performant because you can take advantage of cache locality and avoid the additional overhead of keeping pointers to do the decreaseKey operations (see [1], specifically pg. 16). This approach also has the advantage of being simpler and allows you to use standard library heap/priority-queue implementations in most languages.
To give you some psuedocode for how option #2 would look:
// Imagine this is some lookup table that has the minimum weight
// so far for each node.
weights = {}
while Queue is not empty:
u = Queue.removeMin()
// This is our new logic to discard the duplicate entries.
if u.weight > weights[u]:
continue
visit neighbors[u] and relax() each one
As an alternative, you can also check out the the Python standard library heapq docs which describe another approach to keeping track of "dead" entries in the heap. Whether you find it helpful depends on what data structure you are using for your graph representation and storing of vertex distances.
[1] Priority Queues and Dijkstra’s Algorithm 2007
I am looking into using an Edit Distance algorithm to implement a fuzzy search in a name database.
I've found a data structure that will supposedly help speed this up through a divide and conquer approach - Burkhard-Keller Trees. The problem is that I can't find very much information on this particular type of tree.
If I populate my BK-tree with arbitrary nodes, how likely am I to have a balance problem?
If it is possibly or likely for me to have a balance problem with BK-Trees, is there any way to balance such a tree after it has been constructed?
What would the algorithm look like to properly balance a BK-tree?
My thinking so far:
It seems that child nodes are distinct on distance, so I can't simply rotate a given node in the tree without re-calibrating the entire tree under it. However, if I can find an optimal new root node this might be precisely what I should do. I'm not sure how I'd go about finding an optimal new root node though.
I'm also going to try a few methods to see if I can get a fairly balanced tree by starting with an empty tree, and inserting pre-distributed data.
Start with an alphabetically sorted list, then queue from the middle. (I'm not sure this is a great idea because alphabetizing is not the same as sorting on edit distance).
Completely shuffled data. (This relies heavily on luck to pick a "not so terrible" root by chance. It might fail badly and might be probabilistically guaranteed to be sub-optimal).
Start with an arbitrary word in the list and sort the rest of the items by their edit distance from that item. Then queue from the middle. (I feel this is going to be expensive, and still do poorly as it won't calculate metric space connectivity between all words - just each word and a single reference word).
Build an initial tree with any method, flatten it (basically like a pre-order traversal), and queue from the middle for a new tree. (This is also going to be expensive, and I think it may still do poorly as it won't calculate metric space connectivity between all words ahead of time, and will simply get a different and still uneven distribution).
Order by name frequency, insert the most popular first, and ditch the concept of a balanced tree. (This might make the most sense, as my data is not evenly distributed and I won't have pure random words coming in).
FYI, I am not currently worrying about the name-synonym problem (Bill vs William). I'll handle that separately, and I think completely different strategies would apply.
There is a lisp example in the article: http://cliki.net/bk-tree. About unbalancing the tree I think the data structure and the method seems to be complicated enough and also the author didn't say anything about unbalanced tree. When you experience unbalanced tree maybe it's not for you?
I need something that will work in O(log(n)) complexity, and I thought about AVL trees, but the problem is that some keys may repeat themselves (score of a person for example), so I can't think of how to implement it as a tree. What is a proper way to do this?
There are many options available. Most flavors of binary search trees can easily be modified to allow for nodes with duplicated values, since the balancing operations (usually) purely consist of rotations, which keep the sequence in order. For cases like these, you'd just do a normal BST insertion, but every time you see a duplicated value, you just arbitrarily move to the left or the right and continue as if the value were distinct.
Skiplists are particularly easy to update to support multiple copies of each key, since they don't do any complicated structural updates on insertions or deletions.
If you don't have auxiliary information associated with each key, then another simpler option would be to store a standard binary search tree, but to augment each node with a "count" field indicating how many logical copies of that field exist. Every time you do an insertion, if the key doesn't exist, you create it with count 1. If it already exists, you just increment the count in the existing node. Deletions would be implemented analogously.
Of course, if you don't want to roll your own data structure, just go and find a good implementation of a multimap or multiset, which should get the job done for you quite nicely. Depending on your Programming Language of Choice, you might even find these in the standard libraries. :-)
I was looking for some simple implemented data structure which gets my needs fulfilled in least possible time (in worst possible case) :-
(1)To pop nth element (I have to keep relative order of elements intact)
(2)To access nth element .
I couldn't use array because it can't pop and i dont want to have a gap after deleting ith element . I tried to remove the gap , by exchanging nth element with next again with next untill last but that proves time ineffecient though array's O(1) is unbeatable .
I tried using vector and used 'erase' for popup and '.at()' for access , but even this is not cheap for time effeciency though its better than array .
What you can try is skip list - it support the operation you are requesting in O(log(n)). Another option would be tiered vector that is just slightly easier to implement and takes O(sqrt(n)). both structures are quite cool but alas not very popular.
Well , tiered vector implemented on array would i think best fit your purpose . Though the tiered vector concept may be knew and little tricky to understand at first but then once you get it , it opens lot of question and you get a handy weapon to tackle many question's data structure part very effeciently . So it is recommended that you master tiered vectors implementation.
An array will give you O(1) lookup but O(n) delete of the element.
A list will give you O(n) lookup bug O(1) delete of the element.
A binary search tree will give you O(log n) lookup with O(1) delete of the element. But it doesn't preserve the relative order.
A binary search tree used in conjunction with the list will give you the best of both worlds. Insert a node into both the list (to preserve order) and the tree (fast lookup). Delete will be O(1).
struct node {
node* list_next;
node* list_prev;
node* tree_right;
node* tree_left;
// node data;
};
Note that if the nodes are inserted into the tree using the index as the sort value, you will end up with another linked list pretending to be a tree. The tree can be balanced however in O(n) time once it is built which you would only have to incur once.
Update
Thinking about this more this might not be the best approach for you. I'm used to doing lookups on the data itself not its relative position in a set. This is a data centric approach. Using the index as the sort value will break as soon as you remove a node since the "higher" indices will need to change.
Warning: Don't take this answer seriously.
In theory, you can do both in O(1). Assuming this are the only operations you want to optimize for. The following solution will need lots of space (and it will leak space), and it will take long to create the data structure:
Use an array. In every entry of the array, point to another array which is the same, but with that entry removed.