What's the best data structure to handle customers orders? - data-structures

I'm thinking of queue (FIFO) but what if at some point I need to prioritize orders like for example: any order that includes milk should be pushed into the back of the queue until the milk is available and then once the milk is there I should put these orders back to its previous state.
If I use queue here, I will endup having at least bigO(n log n) time complexity.
any suggestions?

One possibility is to have two data structures. Use the FIFO to hold orders for which you have all the ingredients. Use a different data structure for orders that are waiting for something. Once that thing comes in, then you can put it in the FIFO. Adding to the end of the queue is of course O(1). Putting it back in order will require O(n) time.
If you want the order that was held to get put back into the queue in the place it should have gone, then you probably want a priority queue that uses order number (if they're sequential), or time.
Whereas it takes O(log n) time to insert into or remove from a priority queue, that's not going to be a problem unless you're processing thousands of orders a second. Insertion worst case is O(log n), but in a system such as yours where the priority queue is generally just a FIFO that has a few exceptions, expected insertion will be close to O(1).
I should clarify that most priority queue implementations (in C++, Java, Python, and other mainstream languages) use a binary heap for the backing store. Insertion into a binary heap is worst case O(log n), but analysis shows that it's usually closer to O(1). See Argument for O(1) average-case complexity of heap insertion. You could also implement a pairing heap or other advanced heap type, which has O(1) amortized insertion. But again, unless you're processing thousands of orders per second, that's probably overkill.
Another option is, when you get these exception orders, just push them to the front of the queue once all the necessary items are available. That's easy to do with a double-ended queue. How effective that will be kind of depends on how long things usually sit in the queue, and how long it takes to re-stock items that you run out of.

Related

Clarification on Sedgewick "Algorithms" heapsort chapter remark (4 ed, chapter 2.4)

Currently reading Algorithms book. Q&A section for chapter 2.4 on heapsort implementation based on priority queue (p.328) has the following passage (let's focus on priority queue heap, not on heapsort):
Q. I’m still not clear on the purpose of priority queues. Why exactly
don’t we just sort and then consider the items in increasing order in
the sorted array?
A. In some data-processing examples such as TopM and Multiway, the
total amount of data is far too large to consider sorting (or even
storing in memory). If you are looking for the top ten entries among a
billion items, do you really want to sort a billion-entry array? With
a priority queue, you can do it with a ten-entry priority queue. In
other examples, all the data does not even exist together at any point
in time: we take something from the priority queue, process it, and as
a result of processing it perhaps add some more things to the priority
queue.
TopM, Multiway are simple clients of priority queue. Book speaks about 2 phases of heapsort:
heap construction (author uses priority queue heap, we're interested in)
sortdown
In my understanding heap construction is almost sorting ("heap order"). In order to build a heap you practically need to visit each item in original dataset.
Question: can anyone illustrate the point of author I put in bold in above quote? How can we build a heap without visiting all items? What I miss here? Cheers for clarif.
Of course, you have to visit all entries. Just visiting them takes O(n) time. But sorting them usually requires O(n log n) time. And as the author states, you don't have to sort all of them. Only the ten greatest elements. The basic program would look as follows:
allocate priority queue q with space for t entries
visit each entry e in the input array
queueIsFull := size(q) == t
if !queueIsFull || e > min(q)
if !queueIsFull
insert e into q
else
exchange min(q) with e and bubble up
next
The basic point here is that you remove elements from the queue as soon as you know that they are not amongst the top-t entries. Hence, the insertion and exchange do not take O(log n) time but only O(log t). This reduces the overall time from O(n log n) to O(n log t), where log t is usually much smaller than log n.

Queue with O(1) enqueue and dequeue on average

I am stuck on an exercise in Problem Solving with Algorithms and Data Structures. The exercise says that one can implement a queue in which enqueue and dequeue are both O(1) on average and that there is one circumstance when dequeue is O(n).
The only thing I could think of was to use a list in which the front (dequeue side) of the queue is tracked by an index of the list. In this enqueue is at the end (i.e., an append which is O(1)) and dequeue operates by copying the current "front" element and then moving the front index tracker forward. But this is massively space costing and is not the target answer because is always O(1) in both.
Any thoughts on this?
There are lots of ways to implement a queue and use only O(n) space.
How to implement a queue using two stacks?
C Program to Implement Queue Data Structure using Linked List
Circular buffer
... and I don't think you need to know the implementation that takes O(n) time to dequeue.

Soft heaps: what is corruption and why is it useful?

I recently read Bernard Chazelle's paper "The Soft Heap, An Approximate Priority Queue with Optimal Error Rate by Bernard Chazelle" (http://www.link.cs.cmu.edu/15859-f07/papers/chazelle-soft-heap.pdf)
The paper talks a lot about "corruption." What is corruption, how do elements get corrupted, and how does it help you?
I have spent a lot of time reading through the paper and Googling and this still doesn't make sense.
In most research papers on priority queues, each element in the queue has an associated number called a priority that's set when the element is inserted. The elements are then dequeued in order of increasing priority. Most programming languages these days that support priority queues don't actually use explicit priorities and instead rely on a comparison function to rank elements, but the soft heap uses the "associated numerical priority" model.
Because priority queues dequeue elements in increasing order of priority, they can be used to sort a sequence of values - start by inserting every element into the priority queue with priority equal to its rank in the sequence, then dequeuing all the elements from the priority queue. This pulls the elements out in sorted order.
This connection between priority queues and sorting comes at a cost, though. There are known lower bounds on comparison sorting algorithms (no comparison sort algorithm can have a runtime that is better than O(n log n)). Consequently, there's a lower-bound on the runtime of any comparison-based priority queue. Specifically, n enqueues and n dequeues must have a total cost no better than O(n log n). Most of the time, that's fine, but in some cases this isn't fast enough.
As long as the priority queue can be used to sort the input sequence, the runtime of n enqueues and n dequeues will never beat O(n log n). But what if the priority queue doesn't sort the input? Take it to the extreme - if the priority queue hands back elements in a totally arbitrary order, then it's possible to implement n enqueues and n dequeues in time O(n) - just use a stack or a queue, for example.
Intuitively, you can think of a soft heap as a bridge between the two extremes of "always sorted" and "no guarantees whatsoever about the order." Each sort heap is parameterized over some quantity ε called a "corruption parameter" that determines how close to sorted the values that come out of the soft heap can be. Specifically, as ε gets closer to 0, the output will be progressively more sorted, and as ε gets closer to 1, the output will get progressively more arbitrary. Appropriately, the runtime of the soft heap operations is determined as a function of O(log ε-1), so the runtime of the operations gets cheaper and cheaper as ε goes up (and, therefore, the output gets less sorted) and the operations get more expensive as ε goes down (in which case the output gets more and more sorted).
The soft heap precisely quantifies how unsorted the output will be using the new concept of "corruption." In a normal priority queue, once you insert an element/priority pair, the element's priority never changes. In a soft heap, the elements associated with a priority can become corrupted when the element is inside the soft heap. When an element's priority is corrupted, its priority goes up by some amount. (Since the soft heap dequeues elements in increasing order of priority, the priority of an element increasing means that it will come out of the queue later than it normally should). In other words, corruption will cause elements not to come out in sorted order, since the priorities of the elements when they're dequeued isn't necessarily the same as when they're enqueued.
The choice of ε tunes how many different elements can have their priorities corrupted. With ε small, fewer elements have corrupted priorities, and with ε large, more elements will have corrupted priorities.
Now, to your specific questions - how do elements' priorities get corrupted, and how does that help you? Your first question is a good one - how does the data structure decide when to corrupt priorities? There are two ways of viewing this. First, you can think of a soft heap as a data structure where you specify in advance how much corruption is acceptable (that's the ε parameter), and the data structure then internally decides when and how to corrupt priorities so long as it doesn't exceed some total corruption level. If it seems weird to have a data structure make decisions like this, think about something like a Bloom filter or skiplist, where there really are random choices going on internally that can impact the observable behavior of the data structure. It turns out that the soft heap typically is not implemented using randomness (an impressive feature to have!), but that's not particularly relevant here.
Internally, the two known implementations of soft heaps (the one from Chazelle's original paper, and a later cleanup using binary trees) implement corruption using a technique called carpooling where elements are grouped together and all share a common priority. The corruption occurs because the original priorities of all the elements in each group is forgotten and a new priority is used instead. The actual details of how the elements are grouped is frighteningly complex and isn't really worth looking into, so it's probably best to leave it as "the data structure chooses to corrupt however it wants, as long as it doesn't corrupt more elements than you specified when you picked ε."
Next, why is this useful? In practice, it isn't. The soft heap is almost exclusively of theoretical interest. The reason it's nice in theory is that the runtime of n insertions and n deletions from a soft heap can be O(n) - faster than O(n log n) - if ε is chosen correctly. Originally, soft heaps were used as a building block in a fast algorithm for building minimum spanning trees. They're also used in a new algorithm for linear-time selection, the first such deterministic algorithm to run in linear time since the famous median-of-medians algorithm. In both of these cases, the soft heap is used to "approximately" sort the input elements in a way that lets the algorithms get a rough approximation of a sorted sequence, at which point the algorithm does some extra logic to correct for the lack of perfection. You almost certainly will never see a soft heap used in practice, but if you did end up finding a case where you do, please leave a comment and let me know!
To summarize:
Corrupting priorities is a way of making a tradeoff between perfect sorting (exact, but slow) and arbitrary ordering (inexact, but very fast). The parameter ε determines where on the spectrum the amount of corruption lies.
Corruption works by changing the priorities of existing elements in the soft heap, in particular by raising the priorities of some elements. Low corruption corresponds to approximately sorted sequences, while high corruption corresponds to more arbitrary sequences.
The way corruption is performed is data-structure specific and very hard to understand. It's best to think of soft heaps as performing corruption when they need to, but never in a way that exceeds the limit imposed by the choice of ε.
Corruption is helpful in theoretical settings where sorting is too slow, but an approximately correctly sorted sequence is good enough for practical purposes. It's unlikely to be useful in practice.
Hope this helps!
The answer is in the second page:
"The soft heap may, at any time, increase the value of certain keys. Such keys, and
by extension, the corresponding items, are called corrupted. Corruption is entirely
at the discretion of the data structure and the user has no control over it.
Naturally, findmin returns the minimum current key, which might or might not
be corrupted. The benefit is speed: during heap updates, items travel together in
packets in a form of “car pooling,” in order to save time.
From an information-theoretic point of view, corruption is a way to decrease
the entropy of the data stored in the data structure, and thus facilitate its
treatment. The entropy is defined as the logarithm, in base two, of the number of
distinct key assignments (i.e., entropy of the uniform distribution over key
assignments). To see the soundness of this idea, push it to its limit, and observe
that if every key was corrupted by raising its value to `, then the set of keys
would have zero entropy and we could trivially perform all operations in constant
time. Interestingly, soft heaps show that the entropy need not drop to zero for
the complexity to become constant."
Is this a self-defeating data structure?

Best algorithm/data structure for a continually updated priority queue

I need to frequently find the minimum value object in a set that's being continually updated. I need to have a priority queue type of functionality. What's the best algorithm or data structure to do this? I was thinking of having a sorted tree/heap, and every time the value of an object is updated, I can remove the object, and re-insert it into the tree/heap. Is there a better way to accomplish this?
A binary heap is hard to beat for simplicity, but it has the disadvantage that decrease-key takes O(n) time. I know, the standard references say that it's O(log n), but first you have to find the item. That's O(n) for a standard binary heap.
By the way, if you do decide to use a binary heap, changing an item's priority doesn't require a remove and re-insert. You can change the item's priority in-place and then either bubble it up or sift it down as required.
If the performance of decrease-key is important, a good alternative is a pairing heap, which is theoretically slower than a Fibonacci heap, but is much easier to implement and in practice is faster than the Fibonacci heap due to lower constant factors. In practice, pairing heap compares favorably with binary heap, and outperforms binary heap if you do a lot of decrease-key operations.
You could also marry a binary heap and a dictionary or hash map, and keep the dictionary updated with the position of the item in the heap. This gives you faster decrease-key at the cost of more memory and increased constant factors for the other operations.
Quoting Wikipedia:
To improve performance, priority queues typically use a heap as their
backbone, giving O(log n) performance for inserts and removals, and
O(n) to build initially. Alternatively, when a self-balancing binary
search tree is used, insertion and removal also take O(log n) time,
although building trees from existing sequences of elements takes O(n
log n) time; this is typical where one might already have access to
these data structures, such as with third-party or standard libraries.
If you are looking for a better way, there must be something special about the objects in your priority queue. For example, if the keys are numbers from 1 to 10, a countsort-based approach may outperform the usual ones.
If your application looks anything like repeatedly choosing the next scheduled event in a discrete event simulation, you might consider the options listed in e.g. http://en.wikipedia.org/wiki/Discrete_event_simulation and http://www.acm-sigsim-mskr.org/Courseware/Fujimoto/Slides/FujimotoSlides-03-FutureEventList.pdf. The later summarizes results from different implementations in this domain, including many of the options considered in other comments and answers - and a search will find a number of papers in this area. Priority queue overhead really does make some difference in how many times real time you can get your simulation to run - and if you wish to simulate something that takes weeks of real time this can be important.

Which data-structure to use for "dynamic" priority queueing?

I am looking for a datastructure to support a kind of advanced priority queueing. The idea is as follows. I need to sequentially process a number of items, and at any given point in time I know the "best" one to do next (based on some metric). The thing is, processing an item changes the metric for a few of the other items, so a static queue does not do the trick.
In my problem, I know which items need to have their priorities updated, so the datastructure I am looking for should have the methods
enqueue(item, priority)
dequeue()
requeue(item, new_priority)
Ideally I would like to requeue in O(log n) time. Any ideas?
There is an algorithm with time complexity similar to what you required, but it runs O(log n) only on average time, if it is what you needed. In this algorithm, you can use existing priority queue without the requeue() function.
Assuming you have a connection between the nodes in your graph and the elements in the priority queue. Let the element of the priority queue also store an extra bit called ignore. The algorithm for the modified dequeue runs as follow:
Call dequeue()
If the ignore bit in the element is true, go back to 1, otherwise return the item id.
The algorithm for the modified enqueue runs as follow:
Call enqueue(item, priority)
Visit neighbor nodes v of the item in the graph one by one
change the ignore bit to true for the current linked element in the queue correspond to v
enqueue(v, new_priority(v))
change the connection of the node v to the new enqueued elements.
num_ignore++
If the number of ignore element (num_ignore) is more than the number of non-ignore element, rebuild the priority queue
dequeue all elements, store, and then enqueue only non-ignore elements again
In this algorithm, the setting of ignore bit takes constant time, so you basically delay the O(log n) "requeue" until you accumulate O(n) ignore elements. Then clear all of them once, which takes O(n log n). Therefore, on average, each "requeue" takes O(log n).
You can not achieve the complexity you are asking for, as when updating elements the complexity should also depend on the number of updated elements.
However if we assume that the number of updated elements on a given step is p most of the typical implementations of a heap will do for a O(1) complexity to get max-element's value, O(log(n)) for deque, and O(p * log(n)) for the update operations. I would personally go for a binary heap as it is fairly easy to implement and will work for what you are asking for.
A Priority queue is exactly for this. You can implement it, for example, by using max-heap.
http://www.eecs.wsu.edu/~ananth/CptS223/Lectures/heaps.pdf describes increaseKey(), decreaseKey() and remove() operations. This would let you do what you want. I haven't figured out if the C++ stdlib implementation supports it yet.
Further, the version: http://theboostcpplibraries.com/boost.heap seems to support update() for some subclasses, but I haven't found a full reference yet.

Resources