how do queues provide O(1) performance on takes/puts? - data-structures

I assume the constant time performance of takes/puts is achieved by allowing consumers and producers to access the tail/head of the queue without locking each other. How is this achieved for in-memory queues? Does the answer change for durable queues (probably)? How is this solved in system that imposes a limit on producers and consumers of 1 each? How about when the system allows concurrent access?

Queue uses doubly linked list as it's data structure. In fact queue in Java is declared like this:
Queue<SomeClass> q = new LinkedList<>();
LinkedList in Java is doubly linked list by default.
Now offer() or insertion at head is always O(1) as you don't need to traverse the whole list and same with poll() where you remove the tail and return it.
Now as far as concurrent access is concerned it should not have any effect on the time complexity of the code.

Related

Why can't a priority queue wrap around like an ordinary queue?

I know that in order to improve efficiency, Queues use the wrap around method, to avoid to move everything down all the time that we delete an element.
However, I do not understand why Priority Queues can't wrap around like ordinary Queues. In my point of view, Priority Queues have more similar behaviour to Stack than to a Queue, how is it possible?
The most common priority queue implementation is a binary heap, which would not benefit from wrapping around. You could create a priority queue that's implemented in a circular buffer, but performance would suffer.
It's important to remember than priority queue is an abstract data structure. It defines the operations, but not the implementation. You can implement priority queue as a binary heap, a sorted array, an unsorted array, a binary tree, a skip list, a linked list, etc. There are many different ways to implement a priority queue.
Binary heap, on the other hand, is a specific implementation of the priority queue abstract data type.
As for stack vs queue: in actuality, stacks and queues are just specializations of the priority queue. If you consider time as the priority, then what we call a queue (a FIFO data structure), is actually a priority queue in which the oldest item is the highest priority. A stack (a LIFO data structure) is a priority queue in which the newest item is the highest priority.

Are Priority Queues really Queues?

In Priority Queues, an element is inserted and deleted from the queue according to its priority, and because of which while writing the insertion and deletion code of elements for any priority queue; insertion and deletion are done according to the priority of the elements.
Suppose you have a queue with elements 1,5,6 and the priority of the elements is the value of the elements itself, and now one needs to insert an element of priority 3; then the elements is inserted at the second location in queue giving the new queue 1,3,5,6.
But a queue is defined as a data structure in which elements can be inserted at end and deleted at beginning but not in the middle, but in the above described case element is inserted at the second location (that is in the middle of queue). So if priority queues not obeying definition of queue so Are Priority Queues really Queues?
Kindly explain.
Priority queues are "queues" in one sense of the word, in that elements wait their turn. They are not a subtype of the Queue abstract data type.
A queue is characterized as an information structure in which components might be embedded at closure and erased at starting yet not in the center, however in the above portrayed case component is embedded at the second area (that is amidst queue).
Yes, a priority queue is still a queue in the sense that items are being served in the order in which they are located in the queue. However, in this case a priority is associated with each item and they are served accordingly.
A priority queue is a queue in the sense of the English word queue, not as a strict subtype of the other data structure named 'queue'. There is no inheritance going on there, they're just names that describe their purpose.

How to store few millions of cache and then track down 20 oldest cache

I got an interview question saying I need to store few millions of cache and then I need to keep a track on 20 oldest cache and as soon as the threshold of cache collection increases, replace the 20 oldest with next set of oldest cache.
I answered to keep a hashmap for it, again the question increases
what if we wanna access any of the element on hashmap fastly, how to
do, so I told its map so accessing won't be time taking but
interviewer was not satisfied. So what should be the idle way for such
scenarios.
A queue is well-suited to finding and removing the oldest members.
A queue implemented as a doubly linked list has O(1) insertion and deletion at both ends.
A priority queue lends itself to giving different weights to different items in the queue (e.g. some queue elements may be more expensive to re-create than others).
You can use a hash map to hold the actual elements and find them quickly based on the hash key, and a queue of the hash keys to track age of cache elements.
By using a double-linked list for the queue and also maintaining a hash map of the elements you should be able to make a cache that supports a max size (or even a LRU cache). This should result in references to objects being stored 3 times and not the object being stored twice, be sure to check for this if you implement this (a simple way to avoid this is to just queue the hash key)
When checking for overflow you just pop the last item off the queue and then remove it from the hash map
When accessing an item you can use the hash map to find the cached item. Then if you are implementing a LRU cache you just remove it from the queue and add it back to the beginning, this.
By using this structure Insert, Update, Read, Delete are all going to be O(1).
The follow on question to expect is for an interviewer to ask for the ability for items to have a time-to-live (TTL) that varies per cached item. For this you need to have another queue that maintains items ordered by time-to-live, the only problem here is that inserts now become O(n) as you have to scan the TTL queue and find the spot to expire, so you have to decide if the memory usage of storing the TTL queue as a n-tree will be worthwhile (thereby yielding O(log n) insert time). Or you could implement your TTL queue as buckets for each ~1minute interval or similar, you get ~O(1) inserts still and just degrade the performance of your expiration background process slightly but not greatly (and it's a background process).

Is it possible to declare a maximum queue size with AMQP?

As the title says — is it possible to declare a maximum queue size and broker behaviour when this maximum size is reached? Or is this a broker-specific option?
I ask because I'm trying to learn about AMQP, not because I have this specific problem with any specific broker… But broker-specific answers would still be insightful.
AFAIK you can't declare maximum queue size with RabbitMQ.
Also there's no such setting in the AMQP sepc:
http://www.rabbitmq.com/amqp-0-9-1-quickref.html#queue.declare
Depending on why you're asking, you might not actually need a maximum queue size. Since version 2.0 RabbitMQ will seamlessly persist large queues to disk instead of storing all the messages in RAM. So if your concern the broker crashing because it exhausts its resources, this actually isn't much of a problem in most circumstances - assuming you aren't strapped for hard disk space.
In general this persistence actually has very little performance impact, because by definition the only "hot" parts of the queue are the head and tail, which stay in RAM; the majority of the backlog is "cold" so it makes little difference that it's sitting on disk instead.
We've recently discovered that at high throughput it isn't quite that simple - under some circumstances the throughput can deteriorate as the queue grows, which can lead to unbounded queue growth. But when that happens is a function of CPU, and we went for quite some time without hitting it.
You can read about RabbitMQ maximum queue implementation here http://www.rabbitmq.com/maxlength.html
They do not block the incoming messages addition but drop the messages from the head of the queue.
You should definitely read about Flow control here:
http://www.rabbitmq.com/memory.html
With qpid, yes
you can confire maximun queue size and politic in case raise the maximum. Ring, ignore messages,broke connection.
you also have lvq queues (las value) very configurable
There are some things that you can't do with brokers, but you can do in your app. For instance, there are two AMQP methods, basic.get and queue.declare, which return the number of messages in the queue. You can use this to periodically get a count of outstanding messages and take action (like start new consumer processes) if the message count gets too high.

A priority queue which allows efficient priority update?

UPDATE: Here's my implementation of Hashed Timing Wheels. Please let me know if you have an idea to improve the performance and concurrency. (20-Jan-2009)
// Sample usage:
public static void main(String[] args) throws Exception {
Timer timer = new HashedWheelTimer();
for (int i = 0; i < 100000; i ++) {
timer.newTimeout(new TimerTask() {
public void run(Timeout timeout) throws Exception {
// Extend another second.
timeout.extend();
}
}, 1000, TimeUnit.MILLISECONDS);
}
}
UPDATE: I solved this problem by using Hierarchical and Hashed Timing Wheels. (19-Jan-2009)
I'm trying to implement a special purpose timer in Java which is optimized for timeout handling. For example, a user can register a task with a dead line and the timer could notify a user's callback method when the dead line is over. In most cases, a registered task will be done within a very short amount of time, so most tasks will be canceled (e.g. task.cancel()) or rescheduled to the future (e.g. task.rescheduleToLater(1, TimeUnit.SECOND)).
I want to use this timer to detect an idle socket connection (e.g. close the connection when no message is received in 10 seconds) and write timeout (e.g. raise an exception when the write operation is not finished in 30 seconds.) In most cases, the timeout will not occur, client will send a message and the response will be sent unless there's a weird network issue..
I can't use java.util.Timer or java.util.concurrent.ScheduledThreadPoolExecutor because they assume most tasks are supposed to be timed out. If a task is cancelled, the cancelled task is stored in its internal heap until ScheduledThreadPoolExecutor.purge() is called, and it's a very expensive operation. (O(NlogN) perhaps?)
In traditional heaps or priority queues I've learned in my CS classes, updating the priority of an element was an expensive operation (O(logN) in many cases because it can only be achieved by removing the element and re-inserting it with a new priority value. Some heaps like Fibonacci heap has O(1) time of decreaseKey() and min() operation, but what I need at least is fast increaseKey() and min() (or decreaseKey() and max()).
Do you know any data structure which is highly optimized for this particular use case? One strategy I'm thinking of is just storing all tasks in a hash table and iterating all tasks every second or so, but it's not that beautiful.
How about trying to separate the handing of the normal case where things complete quickly from the error cases?
Use both a hash table and a priority queue. When a task is started it gets put in the hash table and if it finishes quickly it gets removed in O(1) time.
Every one second you scan the hash table and any tasks that have been a long time, say .75 seconds, get moved to the priority queue. The priority queue should always be small and easy to handle. This assumes that one second is much less than the timeout times you are looking for.
If scanning the hash table is too slow, you could use two hash tables, essentially one for even-numbered seconds and one for odd-numbered seconds. When a task gets started it is put in the current hash table. Every second move all the tasks from the non-current hash table into the priority queue and swap the hash tables so that the current hash table is now empty and the non-current table contains the tasks started between one and two seconds ago.
There options are a lot more complicated than just using a priority queue, but are pretty easily implemented should be stable.
To the best of my knowledge (I wrote a paper about a new priority queue, which also reviewed past results), no priority queue implementation gets the bounds of Fibonacci heaps, as well as constant-time increase-key.
There is a small problem with getting that literally. If you could get increase-key in O(1), then you could get delete in O(1) -- just increase the key to +infinity (you can handle the queue being full of lots of +infinitys using some standard amortization tricks). But if find-min is also O(1), that means delete-min = find-min + delete becomes O(1). That's impossible in a comparison-based priority queue because the sorting bound implies (insert everything, then remove one-by-one) that
n * insert + n * delete-min > n log n.
The point here is that if you want a priority-queue to support increase-key in O(1), then you must accept one of the following penalties:
Not be comparison based. Actually, this is a pretty good way to get around things, e.g. vEB trees.
Accept O(log n) for inserts and also O(n log n) for make-heap (given n starting values). This sucks.
Accept O(log n) for find-min. This is entirely acceptable if you never actually do find-min (without an accompanying delete).
But, again, to the best of my knowledge, no one has done the last option. I've always seen it as an opportunity for new results in a pretty basic area of data structures.
Use Hashed Timing Wheel - Google 'Hashed Hierarchical Timing Wheels' for more information. It's a generalization of the answers made by people here. I'd prefer a hashed timing wheel with a large wheel size to hierarchical timing wheels.
Some combination of hashes and O(logN) structures should do what you ask.
I'm tempted to quibble with the way you're analyzing the problem. In your comment above, you say
Because the update will occur very very frequently. Let's say we are sending M messages per connection then the overall time becomes O(MNlogN), which is pretty big. – Trustin Lee (6 hours ago)
which is absolutely correct as far as it goes. But most people I know would concentrate on the cost per message, on the theory that as you app has more and more work to do, obviously it's going to require more resources.
So if your application has a billion sockets open simultaneously (is that really likely?) the insertion cost is only about 60 comparisons per message.
I'll bet money that this is premature optimization: you haven't actually measured the bottlenecks in you system with a performance analysis tool like CodeAnalyst or VTune.
Anyway, there's probably an infinite number of ways of doing what you ask, once you just decide that no single structure will do what you want, and you want some combination of the strengths and weaknesses of different algorithms.
One possiblity is to divide the socket domain N into some number of buckets of size B, and then hash each socket into one of those (N/B) buckets. In that bucket is a heap (or whatever) with O(log B) update time. If an upper bound on N isn't fixed in advance, but can vary, then you can create more buckets dynamically, which adds a little complication, but is certainly doable.
In the worst case, the watchdog timer has to search (N/B) queues for expirations, but I assume the watchdog timer is not required to kill idle sockets in any particular order!
That is, if 10 sockets went idle in the last time slice, it doesn't have to search that domain for the one that time-out first, deal with it, then find the one that timed-out second, etc. It just has to scan the (N/B) set of buckets and enumerate all time-outs.
If you're not satisfied with a linear array of buckets, you can use a priority queue of queues, but you want to avoid updating that queue on every message, or else you're back where you started. Instead, define some time that's less than the actual time-out. (Say, 3/4 or 7/8 of that) and you only put the low-level queue into the high-level queue if it's longest time exceeds that.
And at the risk of stating the obvious, you don't want your queues keyed on elapsed time. The keys should be start time. For each record in the queues, elapsed time would have to be updated constantly, but the start time of each record doesn't change.
There's a VERY simple way to do all inserts and removes in O(1), taking advantage of the fact that 1) priority is based on time and 2) you probably have a small, fixed number of timeout durations.
Create a regular FIFO queue to hold all tasks that timeout in 10 seconds. Because all tasks have identical timeout durations, you can simply insert to the end and remove from the beginning to keep the queue sorted.
Create another FIFO queue for tasks with 30-second timeout duration. Create more queues for other timeout durations.
To cancel, remove the item from the queue. This is O(1) if the queue is implemented as a linked list.
Rescheduling can be done as cancel-insert, as both operations are O(1). Note that tasks can be rescheduled to different queues.
Finally, to combine all the FIFO queues into a single overall priority queue, have the head of every FIFO queue participate in a regular heap. The head of this heap will be the task with the soonest expiring timeout out of ALL tasks.
If you have m number of different timeout durations, the complexity for each operation of the overall structure is O(log m). Insertion is O(log m) due to the need to look up which queue to insert to. Remove-min is O(log m) for restoring the heap. Cancelling is O(1) but worst case O(log m) if you're cancelling the head of a queue. Because m is a small, fixed number, O(log m) is essentially O(1). It does not scale with the number of tasks.
Your specific scenario suggests a circular buffer to me. If the max. timeout is 30 seconds and we want to reap sockets at least every tenth of a second, then use a buffer of 300 doubly-linked lists, one for each tenth of a second in that period. To 'increaseTime' on an entry, remove it from the list it's in and add it to the one for its new tenth-second period (both constant-time operations). When a period ends, reap anything left over in the current list (maybe by feeding it to a reaper thread) and advance the current-list pointer.
You've got a hard-limit on the number of items in the queue - there is a limit to TCP sockets.
Therefore the problem is bounded. I suspect any clever data structure will be slower than using built-in types.
Is there a good reason not to use java.lang.PriorityQueue? Doesn't remove() handle your cancel operations in log(N) time? Then implement your own waiting based on the time until the item on the front of the queue.
I think storing all the tasks in a list and iterating through them would be best.
You must be (going to) run the server on some pretty beefy machine to get to the limits where this cost will be important?

Resources