algorithm - find the position of item in the queue - algorithm

I have an array of arbitrary objects
Each object has a unique id
New objects are added at the end of the queue (tail)
Objects are removed from the top for the processing (FIFO)
Objects pending processing can be deleted if requires.
Problem is to find the Object current position in the queue from the tail given the id.
What is the quickest way to do it? Just to be clear, I do not want Object from the id so just the hash map is not the solution. What I really need is the position.
We thought of two ways:
Brute-force, find in a loop
add a new field in Object which stores a global index that is incremented for every object added to the queue. We can then quickly get the position by checking the global index stored in the last item and this item. However, the only complexity is that if one of the objects is deleted, the global index of all the items below needs to be updated.
Any better ideas? Please suggest.

The easiest way to do it is to represent the FIFO as a doubly-linked list. This list can either be of Objects (meaning that if you had an ID you would also need a mapping, either a hash map or some other approach, from ID to object), or of standalone FIFO nodes (in which case you'd have a mapping from ID to node address).

I believe I have an log(n) time solution. Construct a hash table that maps each ID to its node in our other data structure- a self-balancing binary search tree (red-black, AVL, whatever you like). In this tree, a node should be ordered by its relative priority / order in your queue. It should also store a pointer to its parent in the tree, and the size of the subtree rooted at itself. From this we can query for the number of elements with a lower priority / order in logarithmic time.

Related

A*, what's the best data structure for the open list?

Disclaimer: I really believe that this is not a duplicate of similar questions. I've read those, and they (mostly) recommend using a heap or a priority queue. My question is more of the "I don't understand how those would work in this case" kind.
In short:
I'm referring to the typical A* (A-star) pathfinding algorithm, as described (for example) on Wikipedia:
https://en.wikipedia.org/wiki/A*_search_algorithm
More specifically, I'm wondering about what's the best data structure (which can be a single well known data structure, or a combination of those) to use so that you never have O(n) performance on any of the operations that the algorithm requires to do on the open list.
As far as I understand (mostly from the Wikipedia article), the operations needed to be done on the open list are as follows:
The elements in this list need to be Node instances with the following properties:
position (or coordinates). For the sake of argument, let's say this is a positive integer ranging in value from 0 to 64516 (I'm limiting my A* area size to 254x254, which means that any set of coordinates can be bit-encoded on 16 bits)
F score. This is positive floating point value.
Given these, the operations are:
Add a node to the open list: if a node with the same position (coordinates) exists (but, potentially, with a different F score), replace it.
Retrieve (and remove) from the open list the node with the lowest F score
(Check if exists and) retrieve from the list a node for a given position (coordinates)
As far as I can see, the problem with using a Heap or Priority Queue for the open list are:
These data structure will use the F-score as sorting criteria
As such, adding a node to this kind of data structure is problematic: how do you check optimally that a node with a similar set of coordinates (but a different F Score) doesn't already exist. Furthermore, even if you somehow are able to do this check, if you actually find such a node, but it is not on the top of the Heap/Queue, how to you optimally remove it such that the Heap/Queue keeps its correct order
Also, checking for existence and removing a node based on its position is not optimal or even possible: if we use a Priority Queue, we have to check every node in it, and remove the corresponding one if found. For a heap, if such a removal is necessary, I imagine that all remaining elements need to be extracted and re-inserted, so that the heap still remains a heap.
The only remaining operating where such a data structure would be good is when we want to remove the node with the lowest F-score. In this case the operation would be O(Log(n)).
Also, if we make a custom data structure, such as one that uses a Hashtable (with position as key) and a Priority Queue, we would still have some operations that require suboptimal processing on either of these: In order to keep them in sync (both should have the same nodes in them), for a given operation, that operation will always be subomtimal on one of the data structures: adding or removing a node by position would be fast on the Hashtable but slow on the Priority Queue. Removing the node with the lowest F score would be fast on the Priority Queue but slow on the Hashtable.
What I've done is make a custom Hashtable for the nodes that uses their position as key, that also keeps track of the current node with the lowest F score. When adding a new node, it checks if its F score is lower than the currently stored lowest F score node, and if so, it replaces it. The problem with this data structure comes when you want to remove a node (whether by position or the lowest F scored one). In this case, in order to update the field holding the current lowest F score node, I need to iterate through all the remaining node in order to find which one has the lowest F score now.
So my question is: is there a better way to store these ?
You can combine the hash table and the heap without slow operations showing up.
Have the hash table map position to index in the heap instead of node.
Any update to the heap can sync itself (which requires the heap to know about the hash table, so this is invasive and not just a wrapper around two off-the-shelf implementations) to the hash table with as many updates (each O(1), obviously) as the number of items that move in the heap, of course only log n items can move for an insertion, remove-min or update-key. The hash table finds the node (in the heap) to update the key of for the parent-updating/G-changing step of A* so that's fast too.

How to find nodes fast in an unordered tree

I have an unordered tree in the form of, for example:
Root
A1
A1_1
A1_1_1
A1_1_2
A1_1_2_1
A1_1_2_2
A1_1_2_3
A1_1_3
A1_1_n
A1_2
A1_3
A1_n
A2
A2_1
A2_2
A2_3
A2_n
The tree is unordered
each child can have a random N count of children
each node stores an unique long value.
the value required can be at any position.
My problem: if I need the long value of A1_1_2_3, first time I will traverse the nodes I do depth first search to get it, however: on later calls to the same node I must get its value without a recursive search. Why? If this tree would have hundreds of thousands of nodes until it reaches my A1_1_2_3 node, it would take too much time.
What I thought of, is to leave some pointers after the first traverse. E.g. for my case, when I give back the long value for A1_1_2_3 I also give back an array with information for future searches of the same node and say: to get to A1_1_2_3, I need:
first child of Root, which is A1
first child of A1, which is A1_1
second child of A1_1, which is A1_1_2
third child of A1_1_2, which is what I need: A1_1_2_3
So I figured I would store this information along with the value for A1_1_2_3 as an array of indexes: [0, 0, 1, 2]. By doing so, I could easily recreate the node on subsequent calls to the A1_1_2_3 and avoid recursion each time.
However the nodes can change. On subsequent calls, I might have a new structure, so my indexes stored earlier would not match anymore. But if this happens, I thought whnever I dont find the element anymore, I would recursively go back up a level and search for the item, and so on until I find it again and store the indexes again for future references:
e.g. if my A1_1_2_3 is now situated in this new structure:
A1_1
A1_1_0
A1_1_1
A1_1_2
A1_1_2_1
A1_1_2_2
A1_1_21_22
A1_1_2_3
... in this case the new element A1_1_0 ruined my stored structure, so I would go back up a level and search children again recursively until I find it again.
Does this even make sense, what I thought of here, or am I overcomplicating things? Im talking about an unordered tree which can have max about three hundreds of thousands of nodes, and it is vital that I can jump to nodes as fast as possible. But the tree can also be very small, under 10 nodes.
Is there a more efficient way to search in such a situation?
Thank you for any idea.
edit:
I forgot to add: what I need on subsequent calls is not just the same value, but also its position is important, because I must get the next page of children after that child (since its a tree structure, Im calling paging on nodes after the initially selected one). Hope it makes more sense now.

Efficient data structure for a leaderboard, i.e., a list of records (name, points) - Efficient Search(name), Search(rank) and Update(points)

Please suggest a data structure for representing a list of records in memory. Each record is made up of:
User Name
Points
Rank (based on Points) - optional field - can be either stored in the record or can be computed dynamically
The data structure should support implementation of the following operations efficiently:
Insert(record) - might change ranks of existing records
Delete(record) - might change ranks of existing records
GetRecord(name) - Probably a hash table will do.
GetRecord(rank)
Update(points) - might change ranks of existing records
My main problem is efficient implementation of GetRecord(rank), because ranks can change frequently.
I guess an in-memory DBMS would be a good solution, but please don't suggest it; please suggest a data structure.
Basically, you'll just want a pair of balanced search trees, which will allow O(lg n) insertion, deletion, and getRecord operations. The trick is that instead of storing the actual data in the trees, you'll store pointers to a set of record objects, where each record object will contain 5 fields:
user name
point value
rank
pointer back to the node in the name tree that references the object
pointer back to the node in the point tree that references the object.
The name tree is only modified when new records are added and when records are deleted. The point tree is modified for insertions and deletions, but also for updates, where the appropriate record is found, has its point-tree pointer removed, its point count updated, then a new pointer added to the point-tree.
As you mention, you can use a hash table instead for the name tree if you like. The key here is that you simply maintain separate sorted indexes into a set of otherwise unordered records that themselves contain pointers to their index nodes.
The point tree will be some variation on an order statistic tree, which rather than being a specific data structure, is an umbrella term for a binary search tree whose operations are modified to maintain an invariant which makes the requested rank-related operations more efficient than walking the tree. The details of how the invariants are maintained depend on the underlying balanced search tree (red-black tree, avl tree, etc) used.
A skiplist + hashmap should work.
Here is an implementation in Go: https://github.com/wangjia184/sortedset
Every node in the set is associated with these properties.
key is an unique identity of the node, which is "User name" in your case.
value is any value associated with the node
score a number decides the order(rank) in the set, which is "Points" in your case
Each node in the set is associated with a key. While keys are unique,
scores may be repeated. Nodes are taken in order (from low score to
high score) instead of ordered afterwards. If scores are the same, the
node is ordered by its key in lexicographic order. Each node in the
set also can be accessed by rank, which represents the position in the
sorted set.
A typical use case of sorted set is a leader board in a massive online
game, where every time a new score is submitted you update it using
AddOrUpdate() method. You can easily take the top users using
GetByRankRange() method, you can also, given an user name, return its
rank in the listing using FindRank() method. Using FindRank() and
GetByRankRange() together you can show users with a score similar to a
given user. All very quickly.
Look for a DBMS that includes a function to select a record by sequential record number.
See: How to select the nth row in a SQL database table?
Construct a table with a UserName column and a Points column. Make UserName the primary index. Construct a secondary non-unique maintained index on Points.
To get the record with rank R, select the index on Points and move to record R.
This makes the DBMS engine do most of the work and keeps your part simple.

Priority queue random access

I have priority queue which sorts elements by some value(lets name it rating). I need to take elements from queue by rating. So i need to implement function queue_get(rating). This function also increases rating which is okay with priority heap.
But problem is that each level of the heap is not ordered by rating. Elements of each level only satisfy the heap property. So I could not surely return N-th element by rating.
Are there any implementations of priority queue with such functionality?
Should I use another data structure?
The simplest solution is to use a binary search tree that is self-balancing, e.g. AVL tree, splay tree or red-black tree. It allows you to access elements by their key in O(log n) time and iterate through the objects in their order in O(log n + k) where k is the number of elements iterated.
A collection class will usually give you some Map which is based on an ordered key, such as java.util.TreeMap or C++ std::map. Using this you can retrieve items in sorted order - you may have to invert the order if the class gives you items in increasing order. If all you want to do is to read the top N items this should be enough for you.
If you want random access to the Nth highest item, this can be done by annotating a tree data structure with the number of items beneath each node, but I am not aware of a widely available class library that gives you this.
Come to think of it, if you just want to retrieve the N highest items in order, you can do this with a priority queue if you are prepared to remove the items as you read them out - and put them back again later if you need to restore the original contents.

return inserted items for a given interval

How would one design a memory efficient system which accepts Items added into it and allows Items to be retrieved given a time interval (i.e. return Items inserted between time T1 and time T2). There is no DB involved. Items stored in-memory. What is the data structure involved and associated algorithm.
Updated:
Assume extremely high insertion rate compared to data query.
You can use a sorted data structure, where key is by time of arrival. Note the following:
items are not remvoed
items are inserted in order [if item i was inserted after item j then key(i)>key(j)].
For this reason, tree is discouraged, since it is "overpower", and insertion in it is O(logn), where you can get an O(1) insertion. I suggest using one of the followings:
(1)Array: the array will be filled up always at its end. When the allocated array is full, reallocate a bigger [double sized] array, and copy existing array to it.
Advantages: good caching is usually expected in arrays, O(1) armotorized insertion, used space is at most 2*elementSize*#elemetns
Disadvantages: high latency: when the array is full, it will take O(n) to add an element, so you need to expect that once in a while, there will be costly operation.
(2)Skip list The skip list also allows you also O(logn) seek and O(1) insertion at the end, but it doesn't have latency issues. However, it will suffer more from cache misses then an array. Space used is on average elementSize*#elements + pointerSize*#elements*2 for a skip list.
Advantages: O(1) insertion, no costly ops.
Distadvantages: bad caching is expected.
Suggestion:
I suggest using an array if latency is not an issue. If it is, you should better use a skip list.
In both, finding the desired interval is:
findInterval(T1,T2):
start <- data.find(T1)
end <- data.find(T2)
for each element in data from T1 to T2:
yield element
Either BTree or Binary Search Tree could be a good in-memory data structure to accomplish the above. Just save the timestamp in each node and you can do a range query.
You can add them all to a simple array and sort them.
Do a binary search to located both T1 and T2. All the array elements between them are what you are looking for.
This is helpful if the searching is done only after all the elements are added. If not you can use an AVL or Red-Black tree
How about a relation interval tree (encode your items as intervals containing only a single element, e.g., [a,a])? Although, it has been said already that the ratio of the anticipated operations matter (a lot actually). But here's my two cents:
I suppose an item X that is inserted at time t(X) is associated with that timestamp, right? Meaning you don't insert an item now which has a timestamp from a week ago or something. If that's the case go for the simple array and do interpolation search or something similar (your items will already be sorted according to the attribute that your query refers to, i.e., the time t(X)).
We already have an answer that suggests trees, but I think we need to be more specific: the only situation in which this is really a good solution is if you are very specific about how you build up the tree (and then I would say it's on par with the skip lists suggested in a different answer; ). The objective is to keep the tree as full as possible to the left - I'll make clearer what that means in the following. Make sure each node has a pointer to its (up to) two children and to its parent and knows the depth of the subtree rooted at that node.
Keep a pointer to the root node so that you are able to do lookups in O(log(n)), and keep a pointer to the last inserted node N (which is necessarily the node with the highest key - its timestamp will be the highest). When you are inserting a node, check how many children N has:
If 0, then replace N with the new node you are inserting and make N its left child. (At this point you'll need to update the tree depth field of at most O(log(n)) nodes.)
If 1, then add the new node as its right child.
If 2, then things get interesting. Go up the tree from N until either you find a node that has only 1 child, or the root. If you find a node with only 1 child (this is necessarily the left child), then add the new node as its new right child. If all nodes up to the root have two children, then the current tree is full. Add the new node as the new root node and the old root node as its left child. Don't change the old tree structure otherwise.
Addendum: in order to make cache behaviour and memory overhead better, the best solution is probably to make a tree or skip list of arrays. Instead of every node having a single time stamp and a single value, make every node have an array of, say, 1024 time stamps and values. When an array fills up you add a new one in the top level data structure, but in most steps you just add a single element to the end of the "current array". This wouldn't affect big-O behaviour with respect to either memory or time, but it would reduce the overhead by a factor of 1024, while latency is still very small.

Resources