Linear Hashing: Fast Method for finding 'min' and 'max' - c++11

i implemented a min() and max() function for my Linear Hashtable, but i have little performance issue, because i implemented it straight forward.
I just assume that the first element that has been found is the min/mix and then i compare the rest of the elements against it. Is there a faster way
to find the min/max in a hash table? I couldn't find anything in my books.
My second approach was to write every value in an array and then look through this, but i don't think this is faster.

A generic hash table is not a sorted list of elements. So it's going to be an O(n) operation to find the min() and max() of a given table.

In agreement with Bill Lynch, we'll have to spend O(n) time at least once, however we can be as lazy as possible:
Store min and max alongside the table
So long as user is doing updates, update min and max as necessary
Upon deletion, check if deleted item is min or max, if so, set a flag to true indicating that min or max has changed. If the flag is already set, no need to check against current min/max
Upon lookup for min/max, check your flag. If it has not been set, you can use your cached value of min or max. If it has been set, spend O(n) time setting them again.
Now we will only spend O(n) time when we need to, versus O(n) time every single time someone requests min or max. If there are many more insertions than deletions, you will in general see O(1) lookup time.

Related

Data structure / algorithms for getting best and worst-scoring object from set

My algorithm runs a loop where a set of objects is maintained. In each iteration there are objects being added and removed from the set. Also, there are some "measures" (integer values, possibly several of them) for each object, which can change at any time. From those measures, a score can be calculated based on the measures and the iteration number.
Whenever the number of objects passes a certain threshold, I want to identify and remove the lowest-scoring objects until the number of objects is again below that threshold. That is: if there are n objects with threshold t, if n>t then remove the n-t lowest-scoring objects.
But also, periodically I want to get the highest-scoring
I'm really at a loss as to what data structure I should use here to do this efficiently. A priority queue doesn't really work as measures are changed all the time and anyway the "score" I want to use can be any arbitrarily complex function of those measures and the current iteration number. The obvious approach is probably a hash-table storing associations object -> measures, with amortized O(1) add/remove/update operations, but then finding the lowest or highest scoring objects would be O(n) in the number of elements. n can be easily in the millions after a short while so this isn't ideal. Is this the best I can do?
I realise this probably isn't very trivial but I'd like to know if anyone has any suggestions as to how this could be best implemented.
PS: The language is OCaml but it really doesn't matter.
For this level of generality the best would be to have something for quick access to the measures (storing them in object or via a pointer would be best, but a hash-table would also work) and having an additional data-structure for keeping an ordered view of your objects.
Every time you update the measures you would want to refresh the score and update the ordered data-structure. Something like a balanced BST would work well (RB-tree, AVL) and would guarantee LogN update complexity.
You can also keep a min-max heap instead of the BST. This has the advantage of using less pointers, which should lower the overhead of the solution. Complexity remains LogN per update.
You've mentioned that the score depends on iteration number. This is bad for performance because it requires all entries to be updated every iteration. However, if you can isolate the impact (say the score is g(all_metrics) - f(iteration_number)) so that all elements are impacted the same then the relative order should remain consistent and you can skip updating the score every iteration.
If it's not constant, but it's still isolated (something like f(iteration_number, important_time)) you can use the balanced BST and calculate when the iteration will swap each element with one of it's neighbours, then keep the swap times in a heap, and only update the elements that would swap.
If it's not isolated at all then you would need at each iteration to update all the elements, so you might as well keep track of the highest value and the lowest ones when you go through them to recompute the scores. This at least will have a complexity of O(NlogK) where K is the number of lowest values you want to remove (hopefully it's very small so it should behave almost like O(N)).

Redis Sorted Set Member Size and Performance

Redis Sorted Sets primarily sort based on a Score; however, in cases where multiple members share the same Score lexicographical (Alpha) sorting is used. The Redis zadd documentation indicates that the function complexity is:
"O(log(N)) where N is the number of elements in the sorted set"
I have to assume this remains true regardless of the member size/length; however, I have a case where there are only 4 scores resulting in members being sorted lexicographically after Score.
I want to prepend a time bases key to each member to have the secondary sort be time based and also add some uniqueness to the members. Something like:
"time-based-key:member-string"
My member-string can be larger JavaScript object literals like so:
JSON.stringify( {/* object literal */} )
Will the sorted set zadd and other functionality's performance remain constant?
If not, by what magnitude will performance be affected?
The complexity comes from the number of elements that need to be tested (compared against the new element) to find the correct insertion point (presumably using a binary search algorithm).
It says nothing about how long it will take to perform each test, because that's considered a constant factor (in the sense that it doesn't vary when you add more items).
The amount of data which needs to be compared before determining that a new element should go before or after an existing one will affect the total clock time, but it will do so for each comparison equally.
So your overall clock time for an insert will be quickest when comparing scores only, and progressively slower the deeper into a pair of strings it has to look to determine their lexical order. This won't be any particular magnitude, though, just the concrete number of microseconds to be multiplied by the log(n) complexity factor.

Efficiently Filtering out Sorted Data With A Second Predicate

Lets say that I have a list (or a hashmap etc., whatever makes this the fastest) of objects that contain the following fields: name, time added, and time removed. The list given to me is already sorted by time removed. Now, given a time T, I want to filter (remove from the list) out all objects of the list where:
the time T is greater than an object's time removed OR T is less than an object's time added.
So after processing, the list should only contain objects where T falls in the range specified by time added and time removed.
I know I can do this easily in O(n) time by going through each individual object, but I was wondering if there was a more efficient way considering the list was already sorted by the first predicate (time removed).
*Also, I know I can easily remove all objects with time removed less than T because the list is presorted (possibly in O(log n) time since I do a binary search to find the first element that is less than and then remove the first part of the list up to that object).
(Irrelevant additional info: I will be using C++ for any code that I write)
Unfortunately you are stuck with a O(n) being your fastest option. That is unless their are hidden requirements about the difference between time added and time removed (such as a max time span) that can be exploited.
As you said you can start the search where the time removed equals (or is the first greater than) the time removed. Unfortunately you'll need to go through the rest of the list to see if time added is less than your time.
Because a comparative sort is at best O(n*log(n)) you cannot sort the objects again to improve your performance.
One thing, based on the heuristics of the application it may be beneficial to receive the data in order of date added but that is between you and wherever you get the data from.
Let's examine the data structures you offered:
A list (usually implemented as a linked list, or a dynamic array), or a hash map.
Linked List: Cannot do binary search, finding first occurance of an
element (even if list is sorted) is done in O(n), so no benefit
from the fact the data is sorted.
Dynamic Array: Removing a single element (or more) from arbitrary location requires shifting all the following elements to the left, and thus is O(n). You cannot remove elements from the list better than O(n), so no gain here from the fact the DS is sorted.
HashMap: is unsorted by definition. Also, removing k elements is O(k), no way to go around this.
So, you cannot even improve performance from O(n) to O(logn) for the same field the list was sorted by.
Some data structures such as B+ trees do allow efficient range queries, and you can pretty efficiently [O(logn)] remove a range of elements from the tree.
However, it does not help you to filter the data of the 2nd field, which the tree is unsorted by, and to filter according to it (unless there is some correlation you can exploit) - will still need O(n) time.
If all you are going to do is to later on iterate the new list, you can push the evaluation to the iteration step, but there won't be any real benefit from it - only delaying the processing to when it's needed, and avoiding it, if it is not needed.

Data structure that allows accessing elements by index and delete them in O(1)

I have following task (as part of bigger task):
I need to take an k element from array like data structure and delete it (k is any possible index). Array have O(n) for deleting elements, and List have O(n) for searching element. I would like to do both operations in O(1) time.
Which data structure should I use to meet this requirement?
Clarification:
Deleting element on index(5) will move element from index(6) to index(5).
This particular task is topcoder srm 300 div2 500 points problem. It does not require such sophisticated data structure (simple java methods will do the job since max data is really small), but I am curious how to deal with much bigger problem using c-like thinking about data.
So maybe I am sticked to much to array for this problem? But I will analyze it and edit question later, after work (if you are really curious, you can see task on top coder).
I believe what you're asking for is impossible.
However, if you can relax your requirement for indexing to O(log n), then ropes may be able to satisfy it, although I'm not sure if they have a probabilistic or deterministic guarantee (I think it's probabilistic).
Given the nature of the "dating" problem as given, it involves continuously choosing and removing the "best" member of a set--a classic priority queue. In fact, you'll need to build two of those (for men and women). You'll either have to build them in O(NlogN) time (a sorted list) for constant O(1) removal, or else build them in linear time (a heap) for O(logN) removal. Overall you get O(NlogN) either way, since you'll be removing all of one queue and most of the other.
So then the question is what structure supports the other part of the task, choosing the "chooser" from the circle and removing him and his choice. Since this too must be done N times, any method that accomplishes the removal in O(logN) time won't increase the overall complexity of your algorithm. You can't get O(1) indexed access with fast deletions given the re-indexing requirement. But you can in fact get O(logN) for both indexed access and deletion with a tree (something like a rope as mentioned). This will give you O(NlogN) overall, which is the best you can do anyway.
There is a solution, that may be satisfying in some cases. You have to use an array and a vector for saving deletions. Every time you delete an element, you put its index in a vector. Every time you read an element of some index, you recalculate its index depending on previous deletions.
Say, you have an array of:
A = [3, 7, 6, 4, 3]
You delete 3-rd element:
A = [3, 7, 6, 4, 3] (no actual deletion)
d = [3]
And then read the 4-th:
i = 4
3 < 4 => i += 1
A[i] = 3
This is not exactly O(1), but yet it does not depend on array length. Only on a number of deleted elements.
The only data-structure that has a small overhead in adding and removing element is an hashtable. The only overhead is the cost of the hash function (and it is considered as O(1), if you take a purely theoretic approach).
But, if you want it to be extremely efficient, you will need to:
Have an approximate of the number of elements you will have to get inside your data-structure (and allocate this number once for all at the beginning).
Choose an hash function that will avoid collision given the way your keys are distributed (collisions are just breaking the efficiency of hashtables).
If you manage to get everything right, then you should be optimal.

What sort of sorted datastructure is optimized for finding items within a range?

Say I have a bunch of objects with dates and I regularly want to find all the objects that fall between two arbitrary dates. What sort of datastructure would be good for this?
A binary search tree sounds like what you're looking for.
You can use it to find all the objects in O(log(N) + K), where N is the total number of objects and K is the number of objects that are actually in that range. (provided that it's balanced). Insertion/removal is O(log(N)).
Most languages have a built-in implementation of this.
C++:
http://www.cplusplus.com/reference/stl/set/
Java:
http://java.sun.com/j2se/1.4.2/docs/api/java/util/TreeSet.html
You can find the lower bound of the range (in log(n)) and then iterate from there until you reach the upper bound.
Assuming you mean by date when you say sorted, an array will do it.
Do a binary search to find the index that's >= the start date. You can then either do another search to find the index that's <= the end date leaving you with an offset & count of items, or if you're going to process them anyway just iterate though the list until you exceed the end date.
It's hard to give a good answer without a little more detail.
What kind of performance do you need?
If linear is fine then I would just use a list of dates and iterate through the list collecting all dates that fall within the range. As Andrew Grant suggested.
Do you have duplicates in the list?
If you need to have repeated dates in your collection then most implementations of a binary tree would probably be out. Something like Java's TreeSet are set implementations and don't allow repeated elements.
What are the access characteristics? Lots of lookups with few updates, vice-versa, or fairly even?
Most datastructures have trade-offs between lookups and updates. If you're doing lots of updates then some datastructure that are optimized for lookups won't be so great.
So what are the access characteristics of the data structure, what kind of performance do you need, and what are structural characteristics that it must support (e.g. must allow repeated elements)?
If you need to make random-access modifications: a tree, as in v3's answer. Find the bottom of the range by lookup, then count upwards. Inserting or deleting a node is O(log N). stbuton makes a good point that if you want to allow duplicates (as seems plausible for datestamped events), then you don't want a tree-based set.
If you do not need to make random-access modifications: a sorted array (or vector or whatever). Find the location of the start of the range by binary chop, then count upwards. Inserting or deleting is O(N) in the middle. Duplicates are easy.
Algorithmic performance of lookups is the same in both cases, O(M + log N), where M is the size of the range. But the array uses less memory per entry, and might be faster to count through the range, because after the binary chop it's just forward sequential memory access rather than following pointers.
In both cases you can arrange for insertion at the end to be (amortised) O(1). For the tree, keep a record of the end element at the head, and you get an O(1) bound. For the array, grow it exponentially and you get amortised O(1). This is useful if the changes you make are always or almost-always "add a new event with the current time", since time is (you'd hope) a non-decreasing quantity. If you're using system time then of course you'd have to check, to avoid accidents when the clock resets backwards.
Alternative answer: an SQL table, and let the database optimise how it wants. And Google's BigTable structure is specifically designed to make queries fast, by ensuring that the result of any query is always a consecutive sequence from a pre-prepared index :-)
You want a structure that keeps your objects sorted by date, whenever you insert or remove a new one, and where finding the boundary for the segment of all objects later than or earlier than a given date is easy.
A heap seems the perfect candidate. In practical applications, heaps are simply represented by an array, where all the objects are stored in order. Seeing that sorted array as a heap is simply a way to make insertions of new objects and deletions happen in the right place, and in O(log(n)).
When you have to find all the objects between date A (excluded) and B (included), find the position of A (or the insert position, that is, the position of the earlier element later than A), and the position of B (or the insert position of B), and return all the objects between those positions (which is simply the section between those positions in the array/heap)

Resources