Data structure / algorithms for getting best and worst-scoring object from set - algorithm

My algorithm runs a loop where a set of objects is maintained. In each iteration there are objects being added and removed from the set. Also, there are some "measures" (integer values, possibly several of them) for each object, which can change at any time. From those measures, a score can be calculated based on the measures and the iteration number.
Whenever the number of objects passes a certain threshold, I want to identify and remove the lowest-scoring objects until the number of objects is again below that threshold. That is: if there are n objects with threshold t, if n>t then remove the n-t lowest-scoring objects.
But also, periodically I want to get the highest-scoring
I'm really at a loss as to what data structure I should use here to do this efficiently. A priority queue doesn't really work as measures are changed all the time and anyway the "score" I want to use can be any arbitrarily complex function of those measures and the current iteration number. The obvious approach is probably a hash-table storing associations object -> measures, with amortized O(1) add/remove/update operations, but then finding the lowest or highest scoring objects would be O(n) in the number of elements. n can be easily in the millions after a short while so this isn't ideal. Is this the best I can do?
I realise this probably isn't very trivial but I'd like to know if anyone has any suggestions as to how this could be best implemented.
PS: The language is OCaml but it really doesn't matter.

For this level of generality the best would be to have something for quick access to the measures (storing them in object or via a pointer would be best, but a hash-table would also work) and having an additional data-structure for keeping an ordered view of your objects.
Every time you update the measures you would want to refresh the score and update the ordered data-structure. Something like a balanced BST would work well (RB-tree, AVL) and would guarantee LogN update complexity.
You can also keep a min-max heap instead of the BST. This has the advantage of using less pointers, which should lower the overhead of the solution. Complexity remains LogN per update.
You've mentioned that the score depends on iteration number. This is bad for performance because it requires all entries to be updated every iteration. However, if you can isolate the impact (say the score is g(all_metrics) - f(iteration_number)) so that all elements are impacted the same then the relative order should remain consistent and you can skip updating the score every iteration.
If it's not constant, but it's still isolated (something like f(iteration_number, important_time)) you can use the balanced BST and calculate when the iteration will swap each element with one of it's neighbours, then keep the swap times in a heap, and only update the elements that would swap.
If it's not isolated at all then you would need at each iteration to update all the elements, so you might as well keep track of the highest value and the lowest ones when you go through them to recompute the scores. This at least will have a complexity of O(NlogK) where K is the number of lowest values you want to remove (hopefully it's very small so it should behave almost like O(N)).

Related

Redis Sorted Set Member Size and Performance

Redis Sorted Sets primarily sort based on a Score; however, in cases where multiple members share the same Score lexicographical (Alpha) sorting is used. The Redis zadd documentation indicates that the function complexity is:
"O(log(N)) where N is the number of elements in the sorted set"
I have to assume this remains true regardless of the member size/length; however, I have a case where there are only 4 scores resulting in members being sorted lexicographically after Score.
I want to prepend a time bases key to each member to have the secondary sort be time based and also add some uniqueness to the members. Something like:
"time-based-key:member-string"
My member-string can be larger JavaScript object literals like so:
JSON.stringify( {/* object literal */} )
Will the sorted set zadd and other functionality's performance remain constant?
If not, by what magnitude will performance be affected?
The complexity comes from the number of elements that need to be tested (compared against the new element) to find the correct insertion point (presumably using a binary search algorithm).
It says nothing about how long it will take to perform each test, because that's considered a constant factor (in the sense that it doesn't vary when you add more items).
The amount of data which needs to be compared before determining that a new element should go before or after an existing one will affect the total clock time, but it will do so for each comparison equally.
So your overall clock time for an insert will be quickest when comparing scores only, and progressively slower the deeper into a pair of strings it has to look to determine their lexical order. This won't be any particular magnitude, though, just the concrete number of microseconds to be multiplied by the log(n) complexity factor.

Efficiently Filtering out Sorted Data With A Second Predicate

Lets say that I have a list (or a hashmap etc., whatever makes this the fastest) of objects that contain the following fields: name, time added, and time removed. The list given to me is already sorted by time removed. Now, given a time T, I want to filter (remove from the list) out all objects of the list where:
the time T is greater than an object's time removed OR T is less than an object's time added.
So after processing, the list should only contain objects where T falls in the range specified by time added and time removed.
I know I can do this easily in O(n) time by going through each individual object, but I was wondering if there was a more efficient way considering the list was already sorted by the first predicate (time removed).
*Also, I know I can easily remove all objects with time removed less than T because the list is presorted (possibly in O(log n) time since I do a binary search to find the first element that is less than and then remove the first part of the list up to that object).
(Irrelevant additional info: I will be using C++ for any code that I write)
Unfortunately you are stuck with a O(n) being your fastest option. That is unless their are hidden requirements about the difference between time added and time removed (such as a max time span) that can be exploited.
As you said you can start the search where the time removed equals (or is the first greater than) the time removed. Unfortunately you'll need to go through the rest of the list to see if time added is less than your time.
Because a comparative sort is at best O(n*log(n)) you cannot sort the objects again to improve your performance.
One thing, based on the heuristics of the application it may be beneficial to receive the data in order of date added but that is between you and wherever you get the data from.
Let's examine the data structures you offered:
A list (usually implemented as a linked list, or a dynamic array), or a hash map.
Linked List: Cannot do binary search, finding first occurance of an
element (even if list is sorted) is done in O(n), so no benefit
from the fact the data is sorted.
Dynamic Array: Removing a single element (or more) from arbitrary location requires shifting all the following elements to the left, and thus is O(n). You cannot remove elements from the list better than O(n), so no gain here from the fact the DS is sorted.
HashMap: is unsorted by definition. Also, removing k elements is O(k), no way to go around this.
So, you cannot even improve performance from O(n) to O(logn) for the same field the list was sorted by.
Some data structures such as B+ trees do allow efficient range queries, and you can pretty efficiently [O(logn)] remove a range of elements from the tree.
However, it does not help you to filter the data of the 2nd field, which the tree is unsorted by, and to filter according to it (unless there is some correlation you can exploit) - will still need O(n) time.
If all you are going to do is to later on iterate the new list, you can push the evaluation to the iteration step, but there won't be any real benefit from it - only delaying the processing to when it's needed, and avoiding it, if it is not needed.

Data structure that allows accessing elements by index and delete them in O(1)

I have following task (as part of bigger task):
I need to take an k element from array like data structure and delete it (k is any possible index). Array have O(n) for deleting elements, and List have O(n) for searching element. I would like to do both operations in O(1) time.
Which data structure should I use to meet this requirement?
Clarification:
Deleting element on index(5) will move element from index(6) to index(5).
This particular task is topcoder srm 300 div2 500 points problem. It does not require such sophisticated data structure (simple java methods will do the job since max data is really small), but I am curious how to deal with much bigger problem using c-like thinking about data.
So maybe I am sticked to much to array for this problem? But I will analyze it and edit question later, after work (if you are really curious, you can see task on top coder).
I believe what you're asking for is impossible.
However, if you can relax your requirement for indexing to O(log n), then ropes may be able to satisfy it, although I'm not sure if they have a probabilistic or deterministic guarantee (I think it's probabilistic).
Given the nature of the "dating" problem as given, it involves continuously choosing and removing the "best" member of a set--a classic priority queue. In fact, you'll need to build two of those (for men and women). You'll either have to build them in O(NlogN) time (a sorted list) for constant O(1) removal, or else build them in linear time (a heap) for O(logN) removal. Overall you get O(NlogN) either way, since you'll be removing all of one queue and most of the other.
So then the question is what structure supports the other part of the task, choosing the "chooser" from the circle and removing him and his choice. Since this too must be done N times, any method that accomplishes the removal in O(logN) time won't increase the overall complexity of your algorithm. You can't get O(1) indexed access with fast deletions given the re-indexing requirement. But you can in fact get O(logN) for both indexed access and deletion with a tree (something like a rope as mentioned). This will give you O(NlogN) overall, which is the best you can do anyway.
There is a solution, that may be satisfying in some cases. You have to use an array and a vector for saving deletions. Every time you delete an element, you put its index in a vector. Every time you read an element of some index, you recalculate its index depending on previous deletions.
Say, you have an array of:
A = [3, 7, 6, 4, 3]
You delete 3-rd element:
A = [3, 7, 6, 4, 3] (no actual deletion)
d = [3]
And then read the 4-th:
i = 4
3 < 4 => i += 1
A[i] = 3
This is not exactly O(1), but yet it does not depend on array length. Only on a number of deleted elements.
The only data-structure that has a small overhead in adding and removing element is an hashtable. The only overhead is the cost of the hash function (and it is considered as O(1), if you take a purely theoretic approach).
But, if you want it to be extremely efficient, you will need to:
Have an approximate of the number of elements you will have to get inside your data-structure (and allocate this number once for all at the beginning).
Choose an hash function that will avoid collision given the way your keys are distributed (collisions are just breaking the efficiency of hashtables).
If you manage to get everything right, then you should be optimal.

Binary Search Tree for specific intent

We all know there are plenty of self-balancing binary search trees (BST), being the most famous the Red-Black and the AVL. It might be useful to take a look at AA-trees and scapegoat trees too.
I want to do deletions insertions and searches, like any other BST. However, it will be common to delete all values in a given range, or deleting whole subtrees. So:
I want to insert, search, remove values in O(log n) (balanced tree).
I would like to delete a subtree, keeping the whole tree balanced, in O(log n) (worst-case or amortized)
It might be useful to delete several values in a row, before balancing the tree
I will most often insert 2 values at once, however this is not a rule (just a tip in case there is a tree data structure that takes this into account)
Is there a variant of AVL or RB that helps me on this? Scapegoat-trees look more like this, but would also need some changes, anyone who has got experience on them can share some thougts?
More precisely, which balancing procedure and/or removal procedure would help me keep this actions time-efficient?
It is possible to delete a range of values a BST in O(logn + objects num).
The easiest way I know is to work with the Deterministic Skip List data structure (you might want to read a bit about this data structure before you go on).
In the deterministic skip list all of the real values are stored in the bottom level, and there are pointers on upper levels to them. Insert, search and remove are done in O(logn).
The range deletion operation can be done according to the following algorithm:
Find the first element in the range - O(logn)
Go forward in the linked list, and remove all elements that are still in the range. If there are elements with pointers to the upper levels - remove them too, until reaching the topmost level (removal from a linked list) - O(number of deleted objects)
Fix the pointers to fit deterministic skip list (2-3 elements between every pointer upward)
The total complexity of the range delete is O(logn + number of objects in the range).
Notice that if you choose to work with a random skip list, you get the same complexity, but on average, and not worst case. The plus is that you don't have to fix the upper level pointers to meet the 2-3 demand.
A deterministic skip list has a 1-1 mapping to a 2-3 tree, so with some more work, the procedure described above could work for a 2-3 tree as well.
Long ago in the pre-STL days I wrote my own B-Tree (BST) algorithm because I had a rather large data set at the time (roughly 700K items in 2 trees that were interdependent). I found that rebalancing after every 100-200 insertions/deletions was the peak performance I could get at the time based on experimentation on 486 and SGI hardware. This number may be different now, or maybe not since it does appear to be an algorithmic optimization limit unless you convert to a parallel model.
In short, you could apply a modification trigger for the rebalancing, and allow for forced rebalancing when you've completed all your modifications.
The improvement was remarkable. The initial straight load was not complete after 25m (killed the process). Rebalancing as we went also was killed after 15m. The restricted modification loads with a rebalance every 100 mods loaded and ran in less than 3m. Note that during the "run" portion, there were 0-8 modification to the tree per initial entry. You really need to consider whether you always need to be in-balance when the tree will be modified again in the near term.
Hmm, what about B-trees? They are also balanced, and if you choose a big-order one --- it depends on how many items do you have ---, you will save a bunch of object creation/destruction times.
To 2. If you have a B-tree of order 100, you can remove up to 100 items by one function call.
To 3. This feature can be applied to almost any of the trees, just implement a RemoveSome() function that removes N items and does a rebalance. For B-trees, it's a bit trickier, but can be done.
Note: I supposed you're a programmer. If you need a complete, tested, off-the-shelf solution, you need another answer.
It should be easy to implement deleting a node and its sub nodes in an AVL tree if every node stores its height instead of a balance factor. After deleting a node keep rotating until the two child nodes differ by no more than one. Then move up the tree and repeat. The only real difference from a normal deletion will be a while instead of an if for testing the heights.
The Set implementation in the OCaml standard library is a purely functional AVL tree that satisfies all of your requirements and, in particular, has very efficient implementations of set theoretic operations (union, intersection, difference). Insertion and deletion are O(log n). You can remove subtrees and runs of elements by representing them as a set and using set difference. You can insert two elements simultaneously by creating a 2-element set and applying set union.

What sort of sorted datastructure is optimized for finding items within a range?

Say I have a bunch of objects with dates and I regularly want to find all the objects that fall between two arbitrary dates. What sort of datastructure would be good for this?
A binary search tree sounds like what you're looking for.
You can use it to find all the objects in O(log(N) + K), where N is the total number of objects and K is the number of objects that are actually in that range. (provided that it's balanced). Insertion/removal is O(log(N)).
Most languages have a built-in implementation of this.
C++:
http://www.cplusplus.com/reference/stl/set/
Java:
http://java.sun.com/j2se/1.4.2/docs/api/java/util/TreeSet.html
You can find the lower bound of the range (in log(n)) and then iterate from there until you reach the upper bound.
Assuming you mean by date when you say sorted, an array will do it.
Do a binary search to find the index that's >= the start date. You can then either do another search to find the index that's <= the end date leaving you with an offset & count of items, or if you're going to process them anyway just iterate though the list until you exceed the end date.
It's hard to give a good answer without a little more detail.
What kind of performance do you need?
If linear is fine then I would just use a list of dates and iterate through the list collecting all dates that fall within the range. As Andrew Grant suggested.
Do you have duplicates in the list?
If you need to have repeated dates in your collection then most implementations of a binary tree would probably be out. Something like Java's TreeSet are set implementations and don't allow repeated elements.
What are the access characteristics? Lots of lookups with few updates, vice-versa, or fairly even?
Most datastructures have trade-offs between lookups and updates. If you're doing lots of updates then some datastructure that are optimized for lookups won't be so great.
So what are the access characteristics of the data structure, what kind of performance do you need, and what are structural characteristics that it must support (e.g. must allow repeated elements)?
If you need to make random-access modifications: a tree, as in v3's answer. Find the bottom of the range by lookup, then count upwards. Inserting or deleting a node is O(log N). stbuton makes a good point that if you want to allow duplicates (as seems plausible for datestamped events), then you don't want a tree-based set.
If you do not need to make random-access modifications: a sorted array (or vector or whatever). Find the location of the start of the range by binary chop, then count upwards. Inserting or deleting is O(N) in the middle. Duplicates are easy.
Algorithmic performance of lookups is the same in both cases, O(M + log N), where M is the size of the range. But the array uses less memory per entry, and might be faster to count through the range, because after the binary chop it's just forward sequential memory access rather than following pointers.
In both cases you can arrange for insertion at the end to be (amortised) O(1). For the tree, keep a record of the end element at the head, and you get an O(1) bound. For the array, grow it exponentially and you get amortised O(1). This is useful if the changes you make are always or almost-always "add a new event with the current time", since time is (you'd hope) a non-decreasing quantity. If you're using system time then of course you'd have to check, to avoid accidents when the clock resets backwards.
Alternative answer: an SQL table, and let the database optimise how it wants. And Google's BigTable structure is specifically designed to make queries fast, by ensuring that the result of any query is always a consecutive sequence from a pre-prepared index :-)
You want a structure that keeps your objects sorted by date, whenever you insert or remove a new one, and where finding the boundary for the segment of all objects later than or earlier than a given date is easy.
A heap seems the perfect candidate. In practical applications, heaps are simply represented by an array, where all the objects are stored in order. Seeing that sorted array as a heap is simply a way to make insertions of new objects and deletions happen in the right place, and in O(log(n)).
When you have to find all the objects between date A (excluded) and B (included), find the position of A (or the insert position, that is, the position of the earlier element later than A), and the position of B (or the insert position of B), and return all the objects between those positions (which is simply the section between those positions in the array/heap)

Resources