Redis - performance of using priority set as priority queue with limited range in priorities - performance

I am writing an application using redis where I need a sort of priority queue- items are assigned a priority between 1 and 10 and then one of the highest priority items is popped (order does not matter). From my understanding, the priority set's ZADD and BZPOPMAX seem perfect for this use case.
However, I noticed in the Redis docs that both operations have an O(LOG(N)) whereas equivalent operations for lists and sets are O(1).
This leads me to a couple performance-related questions:
Even though I know my queue will have a very small range of priorities (1-10), will the practical big O of the redis priority set implementations still be O(LOG(N))?
Is it likely to be worth pursuing an alternative implementation (perhaps one that uses a couple calls to lists and sets instead)? My queue may have hundreds of thousands or even millions of items in it.

Even though I know my queue will have a very small range of priorities (1-10), will the practical big O of the redis priority set implementations still be O(LOG(N))?
The time complexity of Redis sorted set has nothing to do with the range of scores (in your case, the priorities). Instead, it's related to the number of items in the sorted set. So the practical big O is still O(LOG(N)).
Is it likely to be worth pursuing an alternative implementation (perhaps one that uses a couple calls to lists and sets instead)?
You cannot achieve the goal with Redis List and Redis Set. Because Redis set is unordered, and searching list and accessing items in list (except the head and tail of the list) are O(N).
O(LOG(N)) is not slow. Even if you have 1 million items, a call to ZPOPMAX only needs about 20 comparisons to get the result, which is quite fast.
Try the simple solution and do a benchmark before choosing a more complex algorithm.

Related

Data structure / algorithms for getting best and worst-scoring object from set

My algorithm runs a loop where a set of objects is maintained. In each iteration there are objects being added and removed from the set. Also, there are some "measures" (integer values, possibly several of them) for each object, which can change at any time. From those measures, a score can be calculated based on the measures and the iteration number.
Whenever the number of objects passes a certain threshold, I want to identify and remove the lowest-scoring objects until the number of objects is again below that threshold. That is: if there are n objects with threshold t, if n>t then remove the n-t lowest-scoring objects.
But also, periodically I want to get the highest-scoring
I'm really at a loss as to what data structure I should use here to do this efficiently. A priority queue doesn't really work as measures are changed all the time and anyway the "score" I want to use can be any arbitrarily complex function of those measures and the current iteration number. The obvious approach is probably a hash-table storing associations object -> measures, with amortized O(1) add/remove/update operations, but then finding the lowest or highest scoring objects would be O(n) in the number of elements. n can be easily in the millions after a short while so this isn't ideal. Is this the best I can do?
I realise this probably isn't very trivial but I'd like to know if anyone has any suggestions as to how this could be best implemented.
PS: The language is OCaml but it really doesn't matter.
For this level of generality the best would be to have something for quick access to the measures (storing them in object or via a pointer would be best, but a hash-table would also work) and having an additional data-structure for keeping an ordered view of your objects.
Every time you update the measures you would want to refresh the score and update the ordered data-structure. Something like a balanced BST would work well (RB-tree, AVL) and would guarantee LogN update complexity.
You can also keep a min-max heap instead of the BST. This has the advantage of using less pointers, which should lower the overhead of the solution. Complexity remains LogN per update.
You've mentioned that the score depends on iteration number. This is bad for performance because it requires all entries to be updated every iteration. However, if you can isolate the impact (say the score is g(all_metrics) - f(iteration_number)) so that all elements are impacted the same then the relative order should remain consistent and you can skip updating the score every iteration.
If it's not constant, but it's still isolated (something like f(iteration_number, important_time)) you can use the balanced BST and calculate when the iteration will swap each element with one of it's neighbours, then keep the swap times in a heap, and only update the elements that would swap.
If it's not isolated at all then you would need at each iteration to update all the elements, so you might as well keep track of the highest value and the lowest ones when you go through them to recompute the scores. This at least will have a complexity of O(NlogK) where K is the number of lowest values you want to remove (hopefully it's very small so it should behave almost like O(N)).

What data structure to use in maintaining k most frequently dialed numbers in a phone?

I was asked this question in an interview, "How to maintain k most frequent dialed numbers in a phone ?". So what kind of data structure to use in this case ?
The tasks are:
Keep track of the #times each number is dialed;
Keep track of top counted k numbers.
So, you'll have to use Augmented DS. In your case, this will be HashSet and PriorityQueue (aka Heap) of size k with minimum dialed number at the top.
Since the number of times a number has been dialed can only increase, this makes our job a bit easier in the sense that you will never have to pull a number out of Heap because its count went down. Instead you will only add a number that has been dialed and then remove the top of Heap because top is the least dialed number.
The class PhoneNumber would contain:
the phone number;
the count of times it has been dialed; and,
a boolean to tell whether it is in top-k number or not.
General steps would be:
Whenever a number is dialed:
Add it to HashSet if it has never been dialed before with a count of dials = 1 and the boolean tracking its presence in the heap to true;
If it is already present in the HashSet, increase its dial count by 1 making sure the hashing function independent of dialing counts (otherwise you will not be able to retrieve the number back from HashSet);
If the number is in Heap already (which you can know by the boolean in PhoneNumber object), increase its count and heapify() the heap again;
If the number is not in Heap, add the number to the heap and then remove the top, setting the trakcing boolean of the removed number to false. This will ensure that top-k dialed numbers only are present in the heap;
Make sure you don't remove the numbers until the heap's size = k.
Space complexity: O(n) for the n numbers dialed until now, stored in HashSet and referenced in Heap.
Time Complexity: O(k*Log(k)) O(k + Log(k)) for each dialing of number because you have to heapify at each new dial. Since the rearrangement of keys will be done for only 1 number in the worst case, you iterate over k numbers and then sometimes do a Log(k) work for exactly one number. O(1) would be the complexity for getting the k top dialed numbers as they are right there in your heap.
Priority Queue (which is an implementation of a MaxHeap)
A max heap which is known as Priority Queue in many programming languages can be used where each entry will <phone_number, count_of_dial> pair. The max heap will be sorted according to the count_of_dial. Top k items are the answer.
The purpose of this question is twofold:
To get you to ask questions.
To get you to talk through drawbacks and advantages of different approaches.
The interviewer isn't terribly interested in you getting the "right" answer, as much as he's interested in how you approach the problem. For example, the problem as stated is not well specified. You probably should have asked questions like:
Most frequent over what period? All time? Per month? Year to date?
How will this information be used?
How frequently will the information be queried?
How fast does response have to be?
How many phone numbers do you expect it to handle?
Is k constant? Or will users ask at one point for the top 10, and some other time for the top 100?
All of these questions are relevant to solving the problem.
Once you know all of the requirements, then you can start thinking about how to implement a solution. It could be as simple as maintaining a call count with every phone entry. Then, when the data is queried, you run a simple heap selection algorithm on the entire phone list, picking the top k items. That's a perfectly reasonable solution if queries are infrequent. The typical phone isn't going to have a huge number of called numbers, so this could work quite well.
Another possible solution would be to maintain the call count for each number, and then, after every call, run the heap selection algorithm and cache the result. The idea here is that the data can only update when a new call is made, and calls are very infrequent, in terms of computer time. If you could make a call every 15 seconds (quite unlikely), that's only 5,760 calls in a day. Even the slowest phone should be able to keep up with that.
There are other solutions, all of which have their advantages and disadvantages. Balancing memory use, CPU resources, simplicity, development time, and many other factors is a large part of what we do as software developers. The interviewer purposely under-specified a problem with a seemingly straightforward solution in order to see how you approach things.
If you did not do well on the interview, learn from it. Also understand that the interviewer probably thought you were technically competent, otherwise he wouldn't have moved on to see how well you approach problems. You're expected to ask questions. After all, anybody can regurgitate simple solutions to simple problems. But in the real world we don't get simple problems, and the people asking us to do things don't always fully specify the requirements. So we have to learn how to extract the requirements from them: usually by asking questions.
I'd use a structure where I have the number and how many time it was dialed. Would put that in a B-Tree organizing it according to the number of times the numbers was dialed.
Add O(log(n)) [Balanced]
Add O(n) [NOT balanced]
Search O(log(n))
Balance O(log(n))
Add(not balanced) + balance O(log(n))
IN THE WORS CASE: Searching + Adding + balancing it would be O(n). The avarage complexity of all operation in a B-Tree is still O(log(n).
A B-Tree grows by the root not by the leaves, so it'll guarantee that is balanced all the time, since you keep control of it when inserting by splitting down the nodes and moving values down.
In this specific case where I don't have to forget the number that were once dialed is even simpler.
The advantage of it is that it'll be always ordered, so what you are looking for would be the 50 first "nodes(key/value)" of the tree.

How to update key of a relaxed vertex in Dijkstra's algorithm?

Just like it was asked here,
I fail to understand how we can find the index of a relaxed vertex in the heap.
Programming style-wise, the heap is a black box that abstracts away the details of a priority queue. Now if we need to maintain a hash table that maps vertex keys to corresponding indices in the heap array, that would need to be done in heap implementation, right?
But most standard heaps don't provide a hash table that does such mapping.
Another way to deal with this whole problem is to add the relaxed vertices to the heap regardless of anything. When we extract the minimum we'll get the best one. To prevent the same vertex being extracted multiple times, we can mark it visited.
So my exact question is, what is the typical way (in the industry) of dealing with this problem?
What are the pros and cons compared what the methods I mentioned?
Typically, you'd need a specially-constructed priority queue that supports the decreaseKey operation in order to get this to work. I've seen this implemented by having the priority queue explicitly keep track of a hash table of the indices (if using a binary heap), or by having an intrusive priority queue where elements stored are nodes in the heap (if using a binomial heap or Fibonacci heap, for example). Sometimes, the priority queue's insertion operation will return a pointer to the node in the priority queue that holds the newly-added key. As an example, here is an implementation of a Fibonacci heap that supports decreaseKey. It works by having each insert operation return a pointer to the node in the Fibonacci heap, which makes it possible to look up the node in O(1), assuming you keep track of the returned pointers.
Hope this helps!
You are asking some very valid questions but unfortunately they are kind of vague so we won't be able to give you a 100% solid "industry standard" answer. However, I'll try to go over your points anyway:
Programming style-wise, the heap is a black box that abstracts away the details of a priority queue
Technically, a priority queue is the abstract interface (insert elements with a priority, extract the lowest priority element) and a heap is a concrete implementation (array-based heap, binomial heap, fibonacci heap, etc).
What I'm trying to say is that using an array is only one particular way to implement a priority queue.
Now if we need to maintain a hash table that maps vertex keys to corresponding indices in the heap array, that would need to be done in heap implementation, right?
Yes, because everytime you move an element inside the array you will need to update the index in the hash table.
But most standard heaps don't provide a hash table that does such mapping.
Yes. This can be very annoying.
Another way to deal with this whole problem is to add the relaxed vertices to the heap regardless of anything.
I guess that could work but I dont think I ever saw anyone do that. The whole point of using a heap here is to increase performance and by adding redundant elements to the heap you kind of go against that. Sure, you preserve the "black-boxness" of the priority queue but I don't know if that is worth it. Additionally, there could be a chance that the extra pop_heap operations could negatively affect your asymptoptic complexity but I'd have to do the math to check.
what is the typical way (in the industry) of dealing with this problem?
First of all, ask yourself if you can get away with using a dumb array instead of a priority queue.
Sure, finding the minimum element in now O(N) instead of O(log n) but the implementation is the simplest (an advantage on its own). Additionally, using an array will be just as efficient if your graph is dense and even if your graph is sparse it might be efficient enough depending on how big your graph is.
If you really need a priority queue, then you are going to have to find one that has a decreaseKey operation implemented. If you can't find one, I would say its not that bad to implement it yourself - it might be less trouble than trying to find an existing implementation and then trying to fit it in with the rest of your code.
Finally, I would not recommend using the really fancy heap data structures (such as fibonacci heaps). While these often show up in textbooks as a way to get optimal asymptotics, in practice they have terrible constant factors and these constant factors are significant when compared with something that is logarithmic.
Programming style-wise, the heap is a black box that abstracts away the details of a priority queue.
Not necessarily. Both C++ and Python have heap libraries that provide functions on arrays rather than black box objects. Go abstracts a bit, but requires the programmer to provide an array-like data structure for its heap operations to work on.
All this abstraction leaking in standardized, industry-strength libraries has a reason: some algorithms (Dijkstra) require a heap with additional operations, which would degrade the performance of other algorithms. Yet other algorithms (heapsort) need heap operations that work in-place on input arrays. If your library's heap gives you a black-box object, and it doesn't suffice for some algorithm, then it's time to re-implement the operations as function on arrays, or find a library that does have the operations you need.
This is a great question and one of those details that algorithms books like CLRS just glaze over without mention.
There are a few ways to do handle this, either:
Use a custom heap implementation that supports decreaseKey operations
Every time you "relax" a vertex, you just add it back into the heap with the new lower weight, then you write a custom way to ignore the old elements later. You can take advantage of the fact that you only ever add a node into the heap/priority-queue if the weight has decreased.
Option #1 is definitely used. For example, if you are familiar with OpenSourceRoutingMachine (OSRM) it searches over graphs with many millions of nodes to compute road routing directions. It uses a Boost implementation of a d-ary heap specifically because it has better decreaseKey operations, source. Often the Fibonacci_heap is also mentioned for this purpose because it supports O(1) decrease key operations, but likewise you'd probably have to roll your own.
In option #2 you end up doing more insertions and removeMin operations in total. If D is the total number of "relax" operations you must do, you end up doing a total of D additional heap operations. So while this has a theoretically worse runtime complexity, in practice there is research evidence that option #2 can be more performant because you can take advantage of cache locality and avoid the additional overhead of keeping pointers to do the decreaseKey operations (see [1], specifically pg. 16). This approach also has the advantage of being simpler and allows you to use standard library heap/priority-queue implementations in most languages.
To give you some psuedocode for how option #2 would look:
// Imagine this is some lookup table that has the minimum weight
// so far for each node.
weights = {}
while Queue is not empty:
u = Queue.removeMin()
// This is our new logic to discard the duplicate entries.
if u.weight > weights[u]:
continue
visit neighbors[u] and relax() each one
As an alternative, you can also check out the the Python standard library heapq docs which describe another approach to keeping track of "dead" entries in the heap. Whether you find it helpful depends on what data structure you are using for your graph representation and storing of vertex distances.
[1] Priority Queues and Dijkstra’s Algorithm 2007

Sorted queue with dropping out elements

I have a list of jobs and queue of workers waiting for these jobs. All the jobs are the same, but workers are different and sorted by their ability to perform the job. That is, first person can do this job best of all, second does it just a little bit worse and so on. Job is always assigned to the person with the highest skills from those who are free at that moment. When person is assigned a job, he drops out of the queue for some time. But when he is done, he gets back to his position. So, for example, at some moment in time worker queue looks like:
[x, x, .83, x, .7, .63, .55, .54, .48, ...]
where x's stand for missing workers and numbers show skill level of left workers. When there's a new job, it is assigned to 3rd worker as the one with highest skill of available workers. So next moment queue looks like:
[x, x, x, x, .7, .63, .55, .54, .48, ...]
Let's say, that at this moment worker #2 finishes his job and gets back to the list:
[x, .91, x, x, .7, .63, .55, .54, .48, ...]
I hope the process is completely clear now. My question is what algorithm and data structure to use to implement quick search and deletion of worker and insertion back to his position.
For the moment the best approach I can see is to use Fibonacci heap that have amortized O(log n) for deleting minimal element (assigning job and deleting worker from queue) and O(1) for inserting him back, which is pretty good. But is there even better algorithm / data structure that possibly take into account the fact that elements are already sorted and only drop of the queue from time to time?
As a theoretical exercise, you might consider pre-processing your data to reduce everything to small integers giving position in the full queue. The first thing that springs to mind then is http://en.wikipedia.org/wiki/Van_Emde_Boas_tree, which could in theory reduce log n to log log n. Note that at the end of this article there are some ideas for slightly less impractical solutions. The article off http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.137.8757 (Integer Priority Queues with Decrease Key in constant time...) claims in particular to be theoretically better than Fibonacci trees for the case of small integer keys, and does note the link with the sorting problem - and references another paper with links to sorting - also very theoretical.
Use a regular heap, which is easier to implement than a Fibonacci heap and also supports insertion and deletions in O(lg n) (you have as many deletions as insertions, so getting cheaper insertions isn't worth that much). As opposed to Fibonacci heaps, regular heaps are often implemented in standard libraries, such as priority_queue in STL in C++.
If a faster data structure existed, you could use it to perform sorting faster than Omega(n lg n), which is impossible in the general case. If the skill level numbers have some special properties (say, they are integers within a restricted range), it is possible to perform sorting faster than Omega(n lg n), but I don't know whether faster priority queues exist in that case.
("rambo coder"'s comments are absolutely right, by the way; you should compare the actual performance with a heap to that of an unsorted list).

Looking for a sort algorithm with as few as possible compare operations

I want to sort items where the comparison is performed by humans:
Pictures
Priority of work items
...
For these tasks the number of comparisons is the limiting factor for performance.
What is the minimum number of comparisons needed (I assume > N for N items)?
Which algorithm guarantees this minimum number?
To answer this, we need to make a lot of assumptions.
Let's assume we are sorting pictures by cuteness. The goal is to get the maximum usable information from the human in the least amount of time. This interaction will dominate all other computation, so it's the only one that counts.
As someone else mentioned, humans can deal well with ordering several items in one interaction. Let's say we can get eight items in relative order per round.
Each round introduces seven edges into a directed graph where the nodes are the pictures. If node A is reachable from node B, then node A is cuter than node B. Keep this graph in mind.
Now, let me tell you about a problem the Navy and the Air Force solve differently. They both want to get a group of people in height order and quickly. The Navy tells people to get in line, then if you're shorter than the guy in front of you, switch places, and repeat until done. In the worst case, it's N*N comparison.
The Air Force tells people to stand in a square grid. They shuffle front-to-back on sqrt(N) people, which means worst case sqrt(N)*sqrt(N) == N comparisons. However, the people are only sorted along one dimension. So therefore, the people face left, then do the same shuffle again. Now we're up to 2*N comparisons, and the sort is still imperfect but it's good enough for government work. There's a short corner, a tall corner opposite, and a clear diagonal height gradient.
You can see how the Air Force method gets results in less time if you don't care about perfection. You can also see how to get the perfection effectively. You already know that the very shortest and very longest men are in two corners. The second-shortest might be behind or beside the shortest, the third shortest might be behind or beside him. In general, someone's height rank is also his maximum possible Manhattan distance from the short corner.
Looking back at the graph analogy, the eight nodes to present each round are eight of those with the currently most common length of longest inbound path. The length of the longest inbound path also represents the node's minimum possible sorted rank.
You'll use a lot of CPU following this plan, but you will make the best possible use of your human resources.
From an assignment I once did on this very subject ...
The comparison counts are for various sorting algorithms operating on data in a random order
Size QkSort HpSort MrgSort ModQk InsrtSort
2500 31388 48792 25105 27646 1554230
5000 67818 107632 55216 65706 6082243
10000 153838 235641 120394 141623 25430257
20000 320535 510824 260995 300319 100361684
40000 759202 1101835 561676 685937
80000 1561245 2363171 1203335 1438017
160000 3295500 5045861 2567554 3047186
These comparison counts are for various sorting algorithms operating on data that is started 'nearly sorted'. Amongst other things it shows a the pathological case of quicksort.
Size QkSort HpSort MrgSort ModQk InsrtSort
2500 72029 46428 16001 70618 76050
5000 181370 102934 34503 190391 3016042
10000 383228 226223 74006 303128 12793735
20000 940771 491648 158015 744557 50456526
40000 2208720 1065689 336031 1634659
80000 4669465 2289350 712062 3820384
160000 11748287 4878598 1504127 10173850
From this we can see that merge sort is the best by number of comparisons.
I can't remember what the modifications to the quick sort algorithm were, but I believe it was something that used insertion sorts once the individual chunks got down to a certain size. This sort of thing is commonly done to optimise quicksort.
You might also want to look up Tadao Takaoka's 'Minimal Merge Sort', which is a more efficient version of the merge sort.
Pigeon hole sorting is order N and works well with humans if the data can be pigeon holed. A good example would be counting votes in an election.
You should consider that humans might make non-transitive comparisons, e.g. they favor A over B, B over C but also C over A. So when choosing your sort algorithm, make sure it doesn't completely break when that happens.
People are really good at ordering 5-10 things from best to worst and come up with more consistent results when doing so. I think trying to apply a classical sorting algo might not work here because of the typically human multi-compare approach.
I'd argue that you should have a round robin type approach and try to bucket things into their most consistent groups each time. Each iteration would only make the result more certain.
It'd be interesting to write too :)
If comparisons are expensive relative to book-keeping costs, you might try the following algorithm which I call "tournament sort". First, some definitions:
Every node has a numeric "score" property (which must be able to hold values from 1 to the number of nodes), and a "last-beat" and "fellow-loser" properties, which must be able to hold node references.
A node is "better" than another node if it should be output before the other.
An element is considered "eligible" if there are no elements known to be better than it which have been output, and "ineligible" if any element which has not been output is known to be better than it.
The "score" of a node is the number of nodes it's known to be better than, plus one.
To run the algorithm, initially assign every node a score of 1. Repeatedly compare the two lowest-scoring eligible nodes; after each comparison, mark the loser "ineligible", and add the loser's score to the winner's (the loser's score is unaltered). Set the loser's "fellow loser" property to the winner's "last-beat", and the winner's "last-beat" property to the loser. Iterate this until only one eligible node remains. Output that node, and make eligible all nodes the winner beat (using the winner's "last-beat" and the chain of "fellow-loser" properties). Then continue the algorithm on the remaining nodes.
The number of comparisons with 1,000,000 items was slightly lower than that of a stock library implementation of Quicksort; I'm not sure how the algorithm would compare against a more modern version of QuickSort. Bookkeeping costs are significant, but if comparisons are sufficiently expensive the savings could possibly be worth it. One interesting feature of this algorithm is that it will only perform comparisons relevant to determining the next node to be output; I know of no other algorithm with that feature.
I don't think you're likely to get a better answer than the Wikipedia page on sorting.
Summary:
For arbitrary comparisons (where you can't use something like radix sorting) the best you can achieve is O(n log n)
Various algorithms achieve this - see the "comparison of algorithms" section.
The commonly used QuickSort is O(n log n) in a typical case, but O(n^2) in the worst case; there are often ways to avoid this, but if you're really worried about the cost of comparisons, I'd go with something like MergeSort or a HeapSort. It partly depends on your existing data structures.
If humans are doing the comparisons, are they also doing the sorting? Do you have a fixed data structure you need to use, or could you effectively create a copy using a balanced binary tree insertion sort? What are the storage requirements?
Here is a comparison of algorithms. The two better candidates are Quick Sort and Merge Sort. Quick Sort is in general better, but has a worse worst case performance.
Merge sort is definately the way to go here as you can use a Map/Reduce type algorithm to have several humans doing the comparisons in parallel.
Quicksort is essentially a single threaded sort algorithm.
You could also tweak the merge sort algorithm so that instead of comparing two objects you present your human with a list of say five items and ask him or her to rank them.
Another possibility would be to use a ranking system as used by the famous "Hot or Not" web site. This requires many many more comparisons, but, the comparisons can happen in any sequence and in parallel, this would work faster than a classic sort provided you have enough huminoids at your disposal.
The questions raises more questions really.
Are we talking a single human performing the comparisons? It's a very different challenge if you are talking a group of humans trying to arrange objects in order.
What about the questions of trust and error? Not everyone can be trusted or to get everything right - certain sorts would go catastrophically wrong if at any given point you provided the wrong answer to a single comparison.
What about subjectivity? "Rank these pictures in order of cuteness". Once you get to this point, it could get really complex. As someone else mentions, something like "hot or not" is the simplest conceptually, but isn't very efficient. At it's most complex, I'd say that google is a way of sorting objects into an order, where the search engine is inferring the comparisons made by humans.
The best one would be the merge sort
The minimum run time is n*log(n) [Base 2]
The way it is implemented is
If the list is of length 0 or 1, then it is already sorted.
Otherwise:
Divide the unsorted list into two sublists of about half the size.
Sort each sublist recursively by re-applying merge sort.
Merge the two sublists back into one sorted list.

Resources