Efficiently Filtering out Sorted Data With A Second Predicate - algorithm

Lets say that I have a list (or a hashmap etc., whatever makes this the fastest) of objects that contain the following fields: name, time added, and time removed. The list given to me is already sorted by time removed. Now, given a time T, I want to filter (remove from the list) out all objects of the list where:
the time T is greater than an object's time removed OR T is less than an object's time added.
So after processing, the list should only contain objects where T falls in the range specified by time added and time removed.
I know I can do this easily in O(n) time by going through each individual object, but I was wondering if there was a more efficient way considering the list was already sorted by the first predicate (time removed).
*Also, I know I can easily remove all objects with time removed less than T because the list is presorted (possibly in O(log n) time since I do a binary search to find the first element that is less than and then remove the first part of the list up to that object).
(Irrelevant additional info: I will be using C++ for any code that I write)

Unfortunately you are stuck with a O(n) being your fastest option. That is unless their are hidden requirements about the difference between time added and time removed (such as a max time span) that can be exploited.
As you said you can start the search where the time removed equals (or is the first greater than) the time removed. Unfortunately you'll need to go through the rest of the list to see if time added is less than your time.
Because a comparative sort is at best O(n*log(n)) you cannot sort the objects again to improve your performance.
One thing, based on the heuristics of the application it may be beneficial to receive the data in order of date added but that is between you and wherever you get the data from.

Let's examine the data structures you offered:
A list (usually implemented as a linked list, or a dynamic array), or a hash map.
Linked List: Cannot do binary search, finding first occurance of an
element (even if list is sorted) is done in O(n), so no benefit
from the fact the data is sorted.
Dynamic Array: Removing a single element (or more) from arbitrary location requires shifting all the following elements to the left, and thus is O(n). You cannot remove elements from the list better than O(n), so no gain here from the fact the DS is sorted.
HashMap: is unsorted by definition. Also, removing k elements is O(k), no way to go around this.
So, you cannot even improve performance from O(n) to O(logn) for the same field the list was sorted by.
Some data structures such as B+ trees do allow efficient range queries, and you can pretty efficiently [O(logn)] remove a range of elements from the tree.
However, it does not help you to filter the data of the 2nd field, which the tree is unsorted by, and to filter according to it (unless there is some correlation you can exploit) - will still need O(n) time.
If all you are going to do is to later on iterate the new list, you can push the evaluation to the iteration step, but there won't be any real benefit from it - only delaying the processing to when it's needed, and avoiding it, if it is not needed.

Related

Data structure / algorithms for getting best and worst-scoring object from set

My algorithm runs a loop where a set of objects is maintained. In each iteration there are objects being added and removed from the set. Also, there are some "measures" (integer values, possibly several of them) for each object, which can change at any time. From those measures, a score can be calculated based on the measures and the iteration number.
Whenever the number of objects passes a certain threshold, I want to identify and remove the lowest-scoring objects until the number of objects is again below that threshold. That is: if there are n objects with threshold t, if n>t then remove the n-t lowest-scoring objects.
But also, periodically I want to get the highest-scoring
I'm really at a loss as to what data structure I should use here to do this efficiently. A priority queue doesn't really work as measures are changed all the time and anyway the "score" I want to use can be any arbitrarily complex function of those measures and the current iteration number. The obvious approach is probably a hash-table storing associations object -> measures, with amortized O(1) add/remove/update operations, but then finding the lowest or highest scoring objects would be O(n) in the number of elements. n can be easily in the millions after a short while so this isn't ideal. Is this the best I can do?
I realise this probably isn't very trivial but I'd like to know if anyone has any suggestions as to how this could be best implemented.
PS: The language is OCaml but it really doesn't matter.
For this level of generality the best would be to have something for quick access to the measures (storing them in object or via a pointer would be best, but a hash-table would also work) and having an additional data-structure for keeping an ordered view of your objects.
Every time you update the measures you would want to refresh the score and update the ordered data-structure. Something like a balanced BST would work well (RB-tree, AVL) and would guarantee LogN update complexity.
You can also keep a min-max heap instead of the BST. This has the advantage of using less pointers, which should lower the overhead of the solution. Complexity remains LogN per update.
You've mentioned that the score depends on iteration number. This is bad for performance because it requires all entries to be updated every iteration. However, if you can isolate the impact (say the score is g(all_metrics) - f(iteration_number)) so that all elements are impacted the same then the relative order should remain consistent and you can skip updating the score every iteration.
If it's not constant, but it's still isolated (something like f(iteration_number, important_time)) you can use the balanced BST and calculate when the iteration will swap each element with one of it's neighbours, then keep the swap times in a heap, and only update the elements that would swap.
If it's not isolated at all then you would need at each iteration to update all the elements, so you might as well keep track of the highest value and the lowest ones when you go through them to recompute the scores. This at least will have a complexity of O(NlogK) where K is the number of lowest values you want to remove (hopefully it's very small so it should behave almost like O(N)).

Finding proper data structure c++

I was looking for some simple implemented data structure which gets my needs fulfilled in least possible time (in worst possible case) :-
(1)To pop nth element (I have to keep relative order of elements intact)
(2)To access nth element .
I couldn't use array because it can't pop and i dont want to have a gap after deleting ith element . I tried to remove the gap , by exchanging nth element with next again with next untill last but that proves time ineffecient though array's O(1) is unbeatable .
I tried using vector and used 'erase' for popup and '.at()' for access , but even this is not cheap for time effeciency though its better than array .
What you can try is skip list - it support the operation you are requesting in O(log(n)). Another option would be tiered vector that is just slightly easier to implement and takes O(sqrt(n)). both structures are quite cool but alas not very popular.
Well , tiered vector implemented on array would i think best fit your purpose . Though the tiered vector concept may be knew and little tricky to understand at first but then once you get it , it opens lot of question and you get a handy weapon to tackle many question's data structure part very effeciently . So it is recommended that you master tiered vectors implementation.
An array will give you O(1) lookup but O(n) delete of the element.
A list will give you O(n) lookup bug O(1) delete of the element.
A binary search tree will give you O(log n) lookup with O(1) delete of the element. But it doesn't preserve the relative order.
A binary search tree used in conjunction with the list will give you the best of both worlds. Insert a node into both the list (to preserve order) and the tree (fast lookup). Delete will be O(1).
struct node {
node* list_next;
node* list_prev;
node* tree_right;
node* tree_left;
// node data;
};
Note that if the nodes are inserted into the tree using the index as the sort value, you will end up with another linked list pretending to be a tree. The tree can be balanced however in O(n) time once it is built which you would only have to incur once.
Update
Thinking about this more this might not be the best approach for you. I'm used to doing lookups on the data itself not its relative position in a set. This is a data centric approach. Using the index as the sort value will break as soon as you remove a node since the "higher" indices will need to change.
Warning: Don't take this answer seriously.
In theory, you can do both in O(1). Assuming this are the only operations you want to optimize for. The following solution will need lots of space (and it will leak space), and it will take long to create the data structure:
Use an array. In every entry of the array, point to another array which is the same, but with that entry removed.

Should you sort a list when getting or setting it?

A decision I often run into is when to sort a list of items. When an item is added, keeping the list sorted at all times, or when the list is accessed.
Is there a best practice for better performance, or is it just the matter of saying: if the list is mostly accessed, sort it when it is changed or vice versa.
Sorting the list at every acccess is a bad idea. You have to have a flag which you set when the collection is modified. Only if this flag is set, you need to sort and then reset the flag.
But the best is if you have a data structure which is per definition always sorted. That means, if you insert a new element, the element is automatically inserted at the right index, thus keeping the collection sorted.
I don't know which platform / framework you are using. I know .NET provides a SortedList class which manages that kind of insertion-sort algorithm for you.
The answer is a big depends. You should profile and apply a strategy that is best for your case.
If you want performance on access/finding elements a good decision will be to maintain the list sorted using InsertionSort (http://en.wikipedia.org/wiki/Insertion_sort).
Sorting list on access may be an option only on some very particular scenarios, when are many insertions, low access and performance is not very important.
But, there are many other options: like maintain a var that say "list is sorted" and sort at every n-th insertion, on idle or on access (if you need).
I'm used to think in this way:
If the list is filled all at once and only after this is read, then add elements in non-sorted order and sort it just at the end of filling (in complexity terms it requires O(n log n) plus the complexity of filling, and that's usually faster than sorting while adding elements)
Conversely, if the list needs to be read before it is completely filled, then you have to add elements in sorted order (maybe using some special data structure doing the work for you, like sortedlist, red-black tree etc.)

Inserting items in a list that is frequently insertion sorted

I have a list that is frequently insertion sorted. Is there a good position (other than the end) for adding to this list to minimize the work that the insertion sort has to do?
The best place to insert would be where the element belongs in the sorted list. This would be similar to preemptively insertion sorting.
Your question doesn't make sense. Either the list is insertion sorted (which means you can't append to the end by definition; the element will still end up in the place where it belongs. Otherwise, the list wouldn't be sorted).
If you have to add lots of elements, then the best solution is to clone the list, add all elements, sort the new list once and then replace the first list with the clone.
[EDIT] In reply to your comments: After doing a couple of appends, you must sort the list before you can do the next sorted insertion. So the question isn't how you can make the sorted insertion cheaper but the sort between appends and sorted insertions.
The answer is that most sorting algorithms do pretty good with partially sorted lists. The questions you need to ask are: What sorting algorithm is used, what properties does it have and, most importantly, why should you care.
The last question means that you should measure performance before you do any kind of optimization because you have a 90% chance that it will hurt more than it helps unless it's based on actual numbers.
Back to the sorting. Java uses a version of quicksort to sort collections. Quicksort will select a pivot element to partition the collection. This selection is crucial for the performance of the algorithm. For best performance, the pivot element should be as close to the element in the middle of the result as possible. Usually, quicksort uses an element from the middle of the current partition as a pivot element. Also, quicksort will start processing the list with the small indexes.
So adding the new elements at the end might not give you good performance. It won't affect the pivot element selection but quicksort will look at the new elements after it has checked all the sorted elements already. Adding the new elements in the middle will affect the pivot selection and we can't really tell whether that will have an influence on the performance or not. My instinctive guess is that the pivot element will be better if quicksort finds sorted elements in the middle of the partitions.
That leaves adding new elements at the beginning. This way, quicksort will usually find a perfect pivot element (since the middle of the list will be sorted) and it will pick up the new elements first. The drawback is that you must copy the whole array for every insert. There are two ways to avoid that: a) As I said elsewhere, todays PCs copy huge amounts of RAM in almost no time at all, so you can just ignore this small performance hit. b) You can use a second ArrayList, put all the new elements in it and then use addAll(). Java will do some optimizations internally for this case and just move the existing elements once.
[EDIT2] I completely misunderstood your question. For the algorithm insertion sort, the best place is probably somewhere in the middle. This should halve the chances that you have to move an element through the whole list. But since I'm not 100% sure, I suggest to create a couple of small tests to verify this.

What sort of sorted datastructure is optimized for finding items within a range?

Say I have a bunch of objects with dates and I regularly want to find all the objects that fall between two arbitrary dates. What sort of datastructure would be good for this?
A binary search tree sounds like what you're looking for.
You can use it to find all the objects in O(log(N) + K), where N is the total number of objects and K is the number of objects that are actually in that range. (provided that it's balanced). Insertion/removal is O(log(N)).
Most languages have a built-in implementation of this.
C++:
http://www.cplusplus.com/reference/stl/set/
Java:
http://java.sun.com/j2se/1.4.2/docs/api/java/util/TreeSet.html
You can find the lower bound of the range (in log(n)) and then iterate from there until you reach the upper bound.
Assuming you mean by date when you say sorted, an array will do it.
Do a binary search to find the index that's >= the start date. You can then either do another search to find the index that's <= the end date leaving you with an offset & count of items, or if you're going to process them anyway just iterate though the list until you exceed the end date.
It's hard to give a good answer without a little more detail.
What kind of performance do you need?
If linear is fine then I would just use a list of dates and iterate through the list collecting all dates that fall within the range. As Andrew Grant suggested.
Do you have duplicates in the list?
If you need to have repeated dates in your collection then most implementations of a binary tree would probably be out. Something like Java's TreeSet are set implementations and don't allow repeated elements.
What are the access characteristics? Lots of lookups with few updates, vice-versa, or fairly even?
Most datastructures have trade-offs between lookups and updates. If you're doing lots of updates then some datastructure that are optimized for lookups won't be so great.
So what are the access characteristics of the data structure, what kind of performance do you need, and what are structural characteristics that it must support (e.g. must allow repeated elements)?
If you need to make random-access modifications: a tree, as in v3's answer. Find the bottom of the range by lookup, then count upwards. Inserting or deleting a node is O(log N). stbuton makes a good point that if you want to allow duplicates (as seems plausible for datestamped events), then you don't want a tree-based set.
If you do not need to make random-access modifications: a sorted array (or vector or whatever). Find the location of the start of the range by binary chop, then count upwards. Inserting or deleting is O(N) in the middle. Duplicates are easy.
Algorithmic performance of lookups is the same in both cases, O(M + log N), where M is the size of the range. But the array uses less memory per entry, and might be faster to count through the range, because after the binary chop it's just forward sequential memory access rather than following pointers.
In both cases you can arrange for insertion at the end to be (amortised) O(1). For the tree, keep a record of the end element at the head, and you get an O(1) bound. For the array, grow it exponentially and you get amortised O(1). This is useful if the changes you make are always or almost-always "add a new event with the current time", since time is (you'd hope) a non-decreasing quantity. If you're using system time then of course you'd have to check, to avoid accidents when the clock resets backwards.
Alternative answer: an SQL table, and let the database optimise how it wants. And Google's BigTable structure is specifically designed to make queries fast, by ensuring that the result of any query is always a consecutive sequence from a pre-prepared index :-)
You want a structure that keeps your objects sorted by date, whenever you insert or remove a new one, and where finding the boundary for the segment of all objects later than or earlier than a given date is easy.
A heap seems the perfect candidate. In practical applications, heaps are simply represented by an array, where all the objects are stored in order. Seeing that sorted array as a heap is simply a way to make insertions of new objects and deletions happen in the right place, and in O(log(n)).
When you have to find all the objects between date A (excluded) and B (included), find the position of A (or the insert position, that is, the position of the earlier element later than A), and the position of B (or the insert position of B), and return all the objects between those positions (which is simply the section between those positions in the array/heap)

Resources