Redis Sorted Set Member Size and Performance - sorting

Redis Sorted Sets primarily sort based on a Score; however, in cases where multiple members share the same Score lexicographical (Alpha) sorting is used. The Redis zadd documentation indicates that the function complexity is:
"O(log(N)) where N is the number of elements in the sorted set"
I have to assume this remains true regardless of the member size/length; however, I have a case where there are only 4 scores resulting in members being sorted lexicographically after Score.
I want to prepend a time bases key to each member to have the secondary sort be time based and also add some uniqueness to the members. Something like:
"time-based-key:member-string"
My member-string can be larger JavaScript object literals like so:
JSON.stringify( {/* object literal */} )
Will the sorted set zadd and other functionality's performance remain constant?
If not, by what magnitude will performance be affected?

The complexity comes from the number of elements that need to be tested (compared against the new element) to find the correct insertion point (presumably using a binary search algorithm).
It says nothing about how long it will take to perform each test, because that's considered a constant factor (in the sense that it doesn't vary when you add more items).
The amount of data which needs to be compared before determining that a new element should go before or after an existing one will affect the total clock time, but it will do so for each comparison equally.
So your overall clock time for an insert will be quickest when comparing scores only, and progressively slower the deeper into a pair of strings it has to look to determine their lexical order. This won't be any particular magnitude, though, just the concrete number of microseconds to be multiplied by the log(n) complexity factor.

Related

Data structure / algorithms for getting best and worst-scoring object from set

My algorithm runs a loop where a set of objects is maintained. In each iteration there are objects being added and removed from the set. Also, there are some "measures" (integer values, possibly several of them) for each object, which can change at any time. From those measures, a score can be calculated based on the measures and the iteration number.
Whenever the number of objects passes a certain threshold, I want to identify and remove the lowest-scoring objects until the number of objects is again below that threshold. That is: if there are n objects with threshold t, if n>t then remove the n-t lowest-scoring objects.
But also, periodically I want to get the highest-scoring
I'm really at a loss as to what data structure I should use here to do this efficiently. A priority queue doesn't really work as measures are changed all the time and anyway the "score" I want to use can be any arbitrarily complex function of those measures and the current iteration number. The obvious approach is probably a hash-table storing associations object -> measures, with amortized O(1) add/remove/update operations, but then finding the lowest or highest scoring objects would be O(n) in the number of elements. n can be easily in the millions after a short while so this isn't ideal. Is this the best I can do?
I realise this probably isn't very trivial but I'd like to know if anyone has any suggestions as to how this could be best implemented.
PS: The language is OCaml but it really doesn't matter.
For this level of generality the best would be to have something for quick access to the measures (storing them in object or via a pointer would be best, but a hash-table would also work) and having an additional data-structure for keeping an ordered view of your objects.
Every time you update the measures you would want to refresh the score and update the ordered data-structure. Something like a balanced BST would work well (RB-tree, AVL) and would guarantee LogN update complexity.
You can also keep a min-max heap instead of the BST. This has the advantage of using less pointers, which should lower the overhead of the solution. Complexity remains LogN per update.
You've mentioned that the score depends on iteration number. This is bad for performance because it requires all entries to be updated every iteration. However, if you can isolate the impact (say the score is g(all_metrics) - f(iteration_number)) so that all elements are impacted the same then the relative order should remain consistent and you can skip updating the score every iteration.
If it's not constant, but it's still isolated (something like f(iteration_number, important_time)) you can use the balanced BST and calculate when the iteration will swap each element with one of it's neighbours, then keep the swap times in a heap, and only update the elements that would swap.
If it's not isolated at all then you would need at each iteration to update all the elements, so you might as well keep track of the highest value and the lowest ones when you go through them to recompute the scores. This at least will have a complexity of O(NlogK) where K is the number of lowest values you want to remove (hopefully it's very small so it should behave almost like O(N)).

Is a zinterstore going to be faster/slower when one of the two input sets is a normal set?

I know I can do a zinterstore with a normal set as an argument (Redis: How to intersect a "normal" set with a sorted set?). Is that going to affect performance? Is it going to be faster/slower than working only with zsets?
According to the sorted-set source code, ZINTERSTORE will treat a set like a sorted-set with score 1, the function name is zunionInterGenericCommand.
Intersecting sets will take more or less time depending on the sorting algorithm used in this step, for example:
/* sort sets from the smallest to largest, this will improve our
* algorithm's performance */
qsort(src,setnum,sizeof(zsetopsrc),zuiCompareByCardinality);
There are also differences in how Sets and Zsets are stored, which will affect how they are read. Redis will decide how to encode a (Sorted) Set depending on how many elements they contain. Therefore iterating through them requires different work.
However for any practical purposes, I'd say that your best bet is to use ZINTERSTORE, and I'll explain why: I hardly see how anything you might write in your source code will beat Redis performance when doing the intersection you want to do.
If your concern is performance, you're getting too much in the details. Your focus should be in the big-O of the operation instead, shown in the command documentation:
Time complexity: O(NK)+O(Mlog(M)) worst case with N being the
smallest input sorted set, K being the number of input sorted sets and
M being the number of elements in the resulting sorted set.
What this tells you is:
1-The size of the smaller set and the amount of sets you plan to intersect determine the first part. Therefore if you know that you'll always intersect 2 sets, one being small and the other one being huge; then you can say that the first part is constant. A good example of this would be intersect a set of all available products in a store (where the score is how many in stock), and a sorted set of products in a user's cart.
In this case you'll have only 2 sets, and you'll know one of them will be very small.
2-The size of the resulting sorted set M can cause a big performance issue. But there's a trick here: big sorted sets are encoded as a skip list when they are too big. A small sorted set will be stored as a zip list, which can cause an important hit in big sorted sets.
However, for the case of intersection, you know that the resulting set can not be bigger than the smaller set you provide. For a union, the resulting set will contain all elements in all sets; so attention needs to be on the size of the bigger sets more than on the smallest.
In summary, the answer to the question of performance with (sorted) sets is: it depends on the sizes of the sets much more than in the actual datatype. Take into consideration that the resulting data structure will be a sorted set regardless of all the inputs being sets. Therefore a big sorted set will be stored (less efficiently) as a skip list.
Knowing beforehand how many sets you plan to intersect (2, 3, depending on user input?) and the size of the smaller set (10? hundreds? thousands?) will give you a much better idea than the internal datatypes. The algorithm for intersecting is the same for both types.
Redis by default assumes the normal set to have some default score for each element, therefore it treats the normal set to be like a sorted set with all elements having an equal default score. I believe performance should be the same as intersecting 2 sorted sets.

Sort in ascending or descending order (chosen arbitrarily; Prefer whichever is cheaper)

I have an array of elements. This array could be:
Randomly shuffled (about 20% of the time)
Nearly sorted* in ascending order (about 40% of the time)
Nearly sorted in descending order (about 40% of the time)
But I do not know (in advance) which of these cases applies. I would prefer to sort the array into the order which it is already close to.
It does not matter whether the output is ascending or descending, but it must be one or the other (so I can perform a binary search on it.)
The sort need not be stable.
Some background info: The process goes roughly like this:
Populate the array
Sort on some attribute A
Do some processing (compute quantiles, and some other minor stuff)
Sort on some other attribute B
Do more processing
Sort on attribute C
Do more processing
A and B are often correlated with each other (but may be positively or negatively.) Same applies to B and C. Occasionally A == C.
* "nearly sorted" here means most elements are close to their final positions. But rarely exactly at their final positions (there is a lot of additive noise, and not many long sorted subsequences.) Still, there are usually a few "outliers" at the start and end of the array which are poor predictors of the order for the next sort. 
Is there an algorithm that can advantage of the fact that I have no preference for ascending vs. descending, to sort more cheaply (compared to the TimSort I am currently using?)
I'd continue using Timsort (however, a good alternative is Smoothsort*), but first probe the array to decide whether to sort in ascending or descending order. Look at the first and last elements and sort accordingly. If the array is unsorted, the choice is immaterial; if it is (partially) sorted, probing at a wide interval is more likely to correctly detect which way.
*Smoothsort has the same best, average, and worst case time as Timsort, and better space complexity. Like Timsort, it was specifically designed to take advantage of partially sorted data.
Another possibility to consider:
Start doing a (hand-rolled) insertion sort
As you go, count the number of inversions you perform
After you have done some small fixed number of insertions, compare the number of inversions that you have counted, to the maximum number of inversions that would have occurred by that point if the data were reverse-sorted to begin with:
If the proportion is close to 0, then (probably) the data is nearly-sorted. Complete the insertion sort, which performs very well on nearly-sorted data. If you don't like the sound of "probably" then continue counting inversions as you go and be ready to fall back to Timsort if it falls under a threshold.
If the proportion is close to 1, then (probably) the data is nearly-reverse-sorted, and you have a small number of sorted elements at the start. Move them to the end, reverse them, and complete an insertion sort with reversed comparator.
Otherwise the data is random, use your favourite sorting algorithm. I'd say Timsort, but since that does well on nearly-sorted data there must be some other algorithm that does at least a tiny bit better than Timsort does on uniformly-shuffled data. Probably plain merge sort without the Tim.
The "small fixed number" can be a number for which insertion sort is fairly fast even in bad cases. I would guess 10-20 or so. It's possible to work out the probability of a false positive in uniformly shuffled data for any given number of insertions and any given threshold of "close to 0/1", but I'm too lazy.
You say the first and last few array elements typically buck the trend, in which case you could exclude them from the initial test insertion sort.
Obviously this approach is somewhat inspired by Timsort. But Timsort is fiendishly optimized for data that contains runs -- I have tried to fiendishly optimize only for data that's close to one big run (in either direction). Another feature of Timsort is that it's well tested, I don't claim to share that.

Efficiently estimating the number of unique elements in a large list

This problem is a little similar to that solved by reservoir sampling, but not the same. I think its also a rather interesting problem.
I have a large dataset (typically hundreds of millions of elements), and I want to estimate the number of unique elements in this dataset. There may be anywhere from a few, to millions of unique elements in a typical dataset.
Of course the obvious solution is to maintain a running hashset of the elements you encounter, and count them at the end, this would yield an exact result, but would require me to carry a potentially large amount of state with me as I scan through the dataset (ie. all unique elements encountered so far).
Unfortunately in my situation this would require more RAM than is available to me (nothing that the dataset may be far larger than available RAM).
I'm wondering if there would be a statistical approach to this that would allow me to do a single pass through the dataset and come up with an estimated unique element count at the end, while maintaining a relatively small amount of state while I scan the dataset.
The input to the algorithm would be the dataset (an Iterator in Java parlance), and it would return an estimated unique object count (probably a floating point number). It is assumed that these objects can be hashed (ie. you can put them in a HashSet if you want to). Typically they will be strings, or numbers.
You could use a Bloom Filter for a reasonable lower bound. You just do a pass over the data, counting and inserting items which were definitely not already in the set.
This problem is well-addressed in the literature; a good review of various approaches is http://www.edbt.org/Proceedings/2008-Nantes/papers/p618-Metwally.pdf. The simplest approach (and most compact for very high accuracy requirements) is called Linear Counting. You hash elements to positions in a bitvector just like you would a Bloom filter (except only one hash function is required), but at the end you estimate the number of distinct elements by the formula D = -total_bits * ln(unset_bits/total_bits). Details are in the paper.
If you have a hash function that you trust, then you could maintain a hashset just like you would for the exact solution, but throw out any item whose hash value is outside of some small range. E.g., use a 32-bit hash, but only keep items where the first two bits of the hash are 0. Then multiply by the appropriate factor at the end to approximate the total number of unique elements.
Nobody has mentioned approximate algorithm designed specifically for this problem, Hyperloglog.

What sort of sorted datastructure is optimized for finding items within a range?

Say I have a bunch of objects with dates and I regularly want to find all the objects that fall between two arbitrary dates. What sort of datastructure would be good for this?
A binary search tree sounds like what you're looking for.
You can use it to find all the objects in O(log(N) + K), where N is the total number of objects and K is the number of objects that are actually in that range. (provided that it's balanced). Insertion/removal is O(log(N)).
Most languages have a built-in implementation of this.
C++:
http://www.cplusplus.com/reference/stl/set/
Java:
http://java.sun.com/j2se/1.4.2/docs/api/java/util/TreeSet.html
You can find the lower bound of the range (in log(n)) and then iterate from there until you reach the upper bound.
Assuming you mean by date when you say sorted, an array will do it.
Do a binary search to find the index that's >= the start date. You can then either do another search to find the index that's <= the end date leaving you with an offset & count of items, or if you're going to process them anyway just iterate though the list until you exceed the end date.
It's hard to give a good answer without a little more detail.
What kind of performance do you need?
If linear is fine then I would just use a list of dates and iterate through the list collecting all dates that fall within the range. As Andrew Grant suggested.
Do you have duplicates in the list?
If you need to have repeated dates in your collection then most implementations of a binary tree would probably be out. Something like Java's TreeSet are set implementations and don't allow repeated elements.
What are the access characteristics? Lots of lookups with few updates, vice-versa, or fairly even?
Most datastructures have trade-offs between lookups and updates. If you're doing lots of updates then some datastructure that are optimized for lookups won't be so great.
So what are the access characteristics of the data structure, what kind of performance do you need, and what are structural characteristics that it must support (e.g. must allow repeated elements)?
If you need to make random-access modifications: a tree, as in v3's answer. Find the bottom of the range by lookup, then count upwards. Inserting or deleting a node is O(log N). stbuton makes a good point that if you want to allow duplicates (as seems plausible for datestamped events), then you don't want a tree-based set.
If you do not need to make random-access modifications: a sorted array (or vector or whatever). Find the location of the start of the range by binary chop, then count upwards. Inserting or deleting is O(N) in the middle. Duplicates are easy.
Algorithmic performance of lookups is the same in both cases, O(M + log N), where M is the size of the range. But the array uses less memory per entry, and might be faster to count through the range, because after the binary chop it's just forward sequential memory access rather than following pointers.
In both cases you can arrange for insertion at the end to be (amortised) O(1). For the tree, keep a record of the end element at the head, and you get an O(1) bound. For the array, grow it exponentially and you get amortised O(1). This is useful if the changes you make are always or almost-always "add a new event with the current time", since time is (you'd hope) a non-decreasing quantity. If you're using system time then of course you'd have to check, to avoid accidents when the clock resets backwards.
Alternative answer: an SQL table, and let the database optimise how it wants. And Google's BigTable structure is specifically designed to make queries fast, by ensuring that the result of any query is always a consecutive sequence from a pre-prepared index :-)
You want a structure that keeps your objects sorted by date, whenever you insert or remove a new one, and where finding the boundary for the segment of all objects later than or earlier than a given date is easy.
A heap seems the perfect candidate. In practical applications, heaps are simply represented by an array, where all the objects are stored in order. Seeing that sorted array as a heap is simply a way to make insertions of new objects and deletions happen in the right place, and in O(log(n)).
When you have to find all the objects between date A (excluded) and B (included), find the position of A (or the insert position, that is, the position of the earlier element later than A), and the position of B (or the insert position of B), and return all the objects between those positions (which is simply the section between those positions in the array/heap)

Resources