What sort of sorted datastructure is optimized for finding items within a range? - algorithm

Say I have a bunch of objects with dates and I regularly want to find all the objects that fall between two arbitrary dates. What sort of datastructure would be good for this?

A binary search tree sounds like what you're looking for.
You can use it to find all the objects in O(log(N) + K), where N is the total number of objects and K is the number of objects that are actually in that range. (provided that it's balanced). Insertion/removal is O(log(N)).
Most languages have a built-in implementation of this.
C++:
http://www.cplusplus.com/reference/stl/set/
Java:
http://java.sun.com/j2se/1.4.2/docs/api/java/util/TreeSet.html
You can find the lower bound of the range (in log(n)) and then iterate from there until you reach the upper bound.

Assuming you mean by date when you say sorted, an array will do it.
Do a binary search to find the index that's >= the start date. You can then either do another search to find the index that's <= the end date leaving you with an offset & count of items, or if you're going to process them anyway just iterate though the list until you exceed the end date.

It's hard to give a good answer without a little more detail.
What kind of performance do you need?
If linear is fine then I would just use a list of dates and iterate through the list collecting all dates that fall within the range. As Andrew Grant suggested.
Do you have duplicates in the list?
If you need to have repeated dates in your collection then most implementations of a binary tree would probably be out. Something like Java's TreeSet are set implementations and don't allow repeated elements.
What are the access characteristics? Lots of lookups with few updates, vice-versa, or fairly even?
Most datastructures have trade-offs between lookups and updates. If you're doing lots of updates then some datastructure that are optimized for lookups won't be so great.
So what are the access characteristics of the data structure, what kind of performance do you need, and what are structural characteristics that it must support (e.g. must allow repeated elements)?

If you need to make random-access modifications: a tree, as in v3's answer. Find the bottom of the range by lookup, then count upwards. Inserting or deleting a node is O(log N). stbuton makes a good point that if you want to allow duplicates (as seems plausible for datestamped events), then you don't want a tree-based set.
If you do not need to make random-access modifications: a sorted array (or vector or whatever). Find the location of the start of the range by binary chop, then count upwards. Inserting or deleting is O(N) in the middle. Duplicates are easy.
Algorithmic performance of lookups is the same in both cases, O(M + log N), where M is the size of the range. But the array uses less memory per entry, and might be faster to count through the range, because after the binary chop it's just forward sequential memory access rather than following pointers.
In both cases you can arrange for insertion at the end to be (amortised) O(1). For the tree, keep a record of the end element at the head, and you get an O(1) bound. For the array, grow it exponentially and you get amortised O(1). This is useful if the changes you make are always or almost-always "add a new event with the current time", since time is (you'd hope) a non-decreasing quantity. If you're using system time then of course you'd have to check, to avoid accidents when the clock resets backwards.
Alternative answer: an SQL table, and let the database optimise how it wants. And Google's BigTable structure is specifically designed to make queries fast, by ensuring that the result of any query is always a consecutive sequence from a pre-prepared index :-)

You want a structure that keeps your objects sorted by date, whenever you insert or remove a new one, and where finding the boundary for the segment of all objects later than or earlier than a given date is easy.
A heap seems the perfect candidate. In practical applications, heaps are simply represented by an array, where all the objects are stored in order. Seeing that sorted array as a heap is simply a way to make insertions of new objects and deletions happen in the right place, and in O(log(n)).
When you have to find all the objects between date A (excluded) and B (included), find the position of A (or the insert position, that is, the position of the earlier element later than A), and the position of B (or the insert position of B), and return all the objects between those positions (which is simply the section between those positions in the array/heap)

Related

Data structure / algorithms for getting best and worst-scoring object from set

My algorithm runs a loop where a set of objects is maintained. In each iteration there are objects being added and removed from the set. Also, there are some "measures" (integer values, possibly several of them) for each object, which can change at any time. From those measures, a score can be calculated based on the measures and the iteration number.
Whenever the number of objects passes a certain threshold, I want to identify and remove the lowest-scoring objects until the number of objects is again below that threshold. That is: if there are n objects with threshold t, if n>t then remove the n-t lowest-scoring objects.
But also, periodically I want to get the highest-scoring
I'm really at a loss as to what data structure I should use here to do this efficiently. A priority queue doesn't really work as measures are changed all the time and anyway the "score" I want to use can be any arbitrarily complex function of those measures and the current iteration number. The obvious approach is probably a hash-table storing associations object -> measures, with amortized O(1) add/remove/update operations, but then finding the lowest or highest scoring objects would be O(n) in the number of elements. n can be easily in the millions after a short while so this isn't ideal. Is this the best I can do?
I realise this probably isn't very trivial but I'd like to know if anyone has any suggestions as to how this could be best implemented.
PS: The language is OCaml but it really doesn't matter.
For this level of generality the best would be to have something for quick access to the measures (storing them in object or via a pointer would be best, but a hash-table would also work) and having an additional data-structure for keeping an ordered view of your objects.
Every time you update the measures you would want to refresh the score and update the ordered data-structure. Something like a balanced BST would work well (RB-tree, AVL) and would guarantee LogN update complexity.
You can also keep a min-max heap instead of the BST. This has the advantage of using less pointers, which should lower the overhead of the solution. Complexity remains LogN per update.
You've mentioned that the score depends on iteration number. This is bad for performance because it requires all entries to be updated every iteration. However, if you can isolate the impact (say the score is g(all_metrics) - f(iteration_number)) so that all elements are impacted the same then the relative order should remain consistent and you can skip updating the score every iteration.
If it's not constant, but it's still isolated (something like f(iteration_number, important_time)) you can use the balanced BST and calculate when the iteration will swap each element with one of it's neighbours, then keep the swap times in a heap, and only update the elements that would swap.
If it's not isolated at all then you would need at each iteration to update all the elements, so you might as well keep track of the highest value and the lowest ones when you go through them to recompute the scores. This at least will have a complexity of O(NlogK) where K is the number of lowest values you want to remove (hopefully it's very small so it should behave almost like O(N)).

Efficiently Filtering out Sorted Data With A Second Predicate

Lets say that I have a list (or a hashmap etc., whatever makes this the fastest) of objects that contain the following fields: name, time added, and time removed. The list given to me is already sorted by time removed. Now, given a time T, I want to filter (remove from the list) out all objects of the list where:
the time T is greater than an object's time removed OR T is less than an object's time added.
So after processing, the list should only contain objects where T falls in the range specified by time added and time removed.
I know I can do this easily in O(n) time by going through each individual object, but I was wondering if there was a more efficient way considering the list was already sorted by the first predicate (time removed).
*Also, I know I can easily remove all objects with time removed less than T because the list is presorted (possibly in O(log n) time since I do a binary search to find the first element that is less than and then remove the first part of the list up to that object).
(Irrelevant additional info: I will be using C++ for any code that I write)
Unfortunately you are stuck with a O(n) being your fastest option. That is unless their are hidden requirements about the difference between time added and time removed (such as a max time span) that can be exploited.
As you said you can start the search where the time removed equals (or is the first greater than) the time removed. Unfortunately you'll need to go through the rest of the list to see if time added is less than your time.
Because a comparative sort is at best O(n*log(n)) you cannot sort the objects again to improve your performance.
One thing, based on the heuristics of the application it may be beneficial to receive the data in order of date added but that is between you and wherever you get the data from.
Let's examine the data structures you offered:
A list (usually implemented as a linked list, or a dynamic array), or a hash map.
Linked List: Cannot do binary search, finding first occurance of an
element (even if list is sorted) is done in O(n), so no benefit
from the fact the data is sorted.
Dynamic Array: Removing a single element (or more) from arbitrary location requires shifting all the following elements to the left, and thus is O(n). You cannot remove elements from the list better than O(n), so no gain here from the fact the DS is sorted.
HashMap: is unsorted by definition. Also, removing k elements is O(k), no way to go around this.
So, you cannot even improve performance from O(n) to O(logn) for the same field the list was sorted by.
Some data structures such as B+ trees do allow efficient range queries, and you can pretty efficiently [O(logn)] remove a range of elements from the tree.
However, it does not help you to filter the data of the 2nd field, which the tree is unsorted by, and to filter according to it (unless there is some correlation you can exploit) - will still need O(n) time.
If all you are going to do is to later on iterate the new list, you can push the evaluation to the iteration step, but there won't be any real benefit from it - only delaying the processing to when it's needed, and avoiding it, if it is not needed.

Data structure for non overlapping ranges of integers?

I remember learning a data structure that stored a set of integers as ranges in a tree, but it's been 10 years and I can't remember the name of the data structure, and I'm a bit fuzzy on the details. If it helps, it's a functional data structure that was taught at CMU, I believe in 15-212 (Principles of Programming) in 2002.
Basically, I want to store a set of integers, most of which are consecutive. I want to be able to query for set membership efficiently, add a range of integers efficiently, and remove a range of integers efficiently. In particular, I don't care to preserve what the original ranges are. It's better if adjacent ranges are coalesced into a single larger range.
A naive implementation would be to simply use a generic set data structure such as a HashSet or TreeSet, and add all integers in a range when adding a range, or remove all integers in a range when removing a range. But of course, that would waste a lot of memory in addition to making add and remove slow.
I'm thinking of a purely functional data structure, but for my current use I don't need it to be. IIRC, lookup, insertion, and deletion were all O(log N), where N was the number of ranges in the set.
So, can you tell me the name of the data structure I'm trying to remember, or a suitable alternative?
I found the old homework and the data structure I had in mind were Discrete Interval Encoding Trees or diets for short. They are described in detail in Diets for Fat Sets, Martin Erwig. Journal of Functional Programming, Vol. 8, No. 6, 627-632, 1998. It is basically a tree of intervals with the invariant that all of the intervals are non-overlapping and non-touching. There is a Haskell implementation in Hackage. I was hoping there would be an existing implementation for Scala, but I'm not seeing any.
The homework also included another data structure they called a Recursive Interval-Occluding Tree (RIOT), which rather than keeping only an interval at each node keeps an interval and another (possibly empty) RIOT of things removed from the interval. The assignment included benchmarks showing it did better than diets for random insertions and deletions. AFAICT it is simply something the TAs made up and never published as it no longer seems to exist anywhere on the Internets, at least not under that name.
You probably are looking for segment trees. This might be helpful: http://www.topcoder.com/tc?d1=tutorials&d2=lowestCommonAncestor&module=Static
You can also use binary search trees for the same, for which each node will have two data fields: min_val and max_val.
During insertion algorithm, you just need to call another merging operation to check if the left-child,parent,right-child create a sequence, so as to club them into a single node. This will take O(log n) time.
Other operations like deletion and look-up will take O(log n) time as usual, but special measures need to be taken while deletion.

Data Structure for tuple indexing

I need a data structure that stores tuples and would allow me to do a query like: given tuple (x,y,z) of integers, find the next one (an upped bound for it). By that I mean considering the natural ordering (a,b,c)<=(d,e,f) <=> a<=d and b<=e and c<=f. I have tried MSD radix sort, which splits items into buckets and sorts them (and does this recursively for all positions in the tuples). Does anybody have any other suggestion? Ideally I would like the abouve query to happen within O(log n) where n is the number of tuples.
Two options.
Use binary search on a sorted array. If you build the keys ( assuming 32bit int)' with (a<<64)|(b<<32)|c and hold them in a simple array, packed one beside the other, you can use binary search to locate the value you are searching for ( if using C, there is even a library function to do this), and the next one is simply one position along. Worst case Performance is O(logN), and if you can do http://en.wikipedia.org/wiki/Interpolation_search then you might even approach O(log log N)
Problem with binary keys is might be tricky to add new values, might need gyrations if you will exceed available memory. But it is fast, only a few random memory accesses on average.
Alternatively, you could build a hash table by generating a key with a|b|c in some form, and then have the hash data pointing to a structure that contains the next value, whatever that might be. Possibly a little harder to create in the first place as when generating the table you need to know the next value already.
Problems with hash approach are it will likely use more memory than binary search method, performance is great if you don't get hash collisions, but then starts to drop off, although there a variations around this algorithm to help in some cases. Hash approach is possibly much easier to insert new values.
I also see you had a similar question along these lines, so I guess the guts of what I am saying is combine A,b,c to produce a single long key, and use that with binary search, hash or even b-tree. If the length of the key is your problem (what language), could you treat it as a string?
If this answer is completely off base, let me know and I will see if I can delete this answer, so you questions remains unanswered rather than a useless answer.

Best Data Structure to Store Large Amounts of Data with Dynamic and Non-unique Keys?

Basically, I have a large number of C structs to keep track of, that are essentially:
struct Data {
int key;
... // More data
};
I need to periodically access lots (hundreds) of these, and they must be sorted from lowest to highest key values. The keys are not unique and they will be changed over the course of the program. To make matters even more interesting, the majority of the structures will be culled (based on criteria completely unrelated to the key values) from the pool right before being sorted, but I still need to keep references to them.
I've looked into using a binary search tree to store them, but the keys are not guaranteed to be unique and I'm not entirely sure how to restructure the tree once a key is changed or how to cull specific structures.
To recap in case that was unclear above, I need to:
Store a large number of structures with non-unique and dynamic keys.
Cull a large percentage of the structures (but not free them entirely because different structures are culled each time).
Sort the remaining structures from highest to lowest key value.
What data structure/algorithms would you use to solve this problem? The method needs to be as fast and/or memory efficient as possible, since this is a real-time application.
EDIT: The culling is done by iterating over all of the objects and making a decision for each one. The keys change between the culling/sorting runs. I should have stated that they don't change a lot, but they do change, and they can change multiple times between the culling/sorting runs. (If it helps, the key for each structure is actually a z-order for a Sprite. They need to be sorted before each drawing loop so the Sprites with lower z-orders are drawn first.)
Just stick 'em all in a big array.
When the time comes to do the cull and sort, start by doing the sort. Do an insertion sort. That's right - nothing clever, just an insertion sort.
After the sort, go through the sorted array, and for each object, make the culling decision, then immediately output the object if it isn't culled.
This is about as memory-efficient as it gets. It should also require very little computation: there's no bookkeeping on updates between cull/sort passes, and the sort will be cheap - because insertion sort is adaptive, and for an almost-sorted array like this, it will be almost O(n). The one thing it doesn't do is cache locality: there will be two separate passes over the array, for the sort, and the cull/output.
If you demand more cleverness, then instead of an insertion sort, you could use another adaptive, in-place sort that's faster. Timsort and smoothsort are good candidates; both are utterly fiendish to implement.
The big alternative to this is to only sort unculled objects, using a secondary, temporary, list of such objects which you sort (or keep in a binary tree or whatever). But the thing is, if the keys don't change that much, then the win you get from using an adaptive sort on an almost-sorted array will (i reckon!) outweigh the win you would get from sorting a smaller dataset. It's O(n) vs O(n log n).
The general solution to this type of problem is to use a balanced search tree (e.g. AVL tree, red-black tree, B-tree), which guarantees O(log n) time (almost constant, but not quite) for insertion, deletion, and lookup, where n is the number of items currently stored in the tree. Guaranteeing no key is stored in the tree twice is quite trivial, and is done automatically by many implementations.
If you're working in C++, you could try using std::map<int, yourtype>. If in C, find or implement some simple binary search tree code, and see if it's fast enough.
However, if you use such a tree and find it's too slow, you could look into some more fine-tuned approaches. One might be to put your structs in one big array, radix sort by the integer key, cull on it, then re-sort per pass. Another approach might be to use a Patricia tree.

Resources