It seems to be from everything I read that Circular Queues are far better than Linear Queues in many ways.
Why implement Queues vs Circular Queues?
So is there any reason to ever use Linear Queues?
Related
I assume the constant time performance of takes/puts is achieved by allowing consumers and producers to access the tail/head of the queue without locking each other. How is this achieved for in-memory queues? Does the answer change for durable queues (probably)? How is this solved in system that imposes a limit on producers and consumers of 1 each? How about when the system allows concurrent access?
Queue uses doubly linked list as it's data structure. In fact queue in Java is declared like this:
Queue<SomeClass> q = new LinkedList<>();
LinkedList in Java is doubly linked list by default.
Now offer() or insertion at head is always O(1) as you don't need to traverse the whole list and same with poll() where you remove the tail and return it.
Now as far as concurrent access is concerned it should not have any effect on the time complexity of the code.
I am currently researching a queueing solution to handle medium sized messages of 1MB.
Besides the features differences between Redis, Kafka and RabbitMQ I cannot find any good answer to their performance on messages of size around 1MB.
Any of you guys knows how many messages of 1MB can any of these handle?
Do you know any other queueing solutions which can perform better?
When you are evaluating Kafka vs Redis in your case, there are other factors which you have to take into account, besides message size. Here are some of them I can think of:
How many producers/consumers? Redis performance can be affected in case of greater number of producers/consumers due to the nature of Redis (push based queue). This is because Redis delivers the message to all the consumers at once, at the moment the message is put in the queue.
Do you need speed or reliability first? If speed is of utmost importance, use Redis since it does not persist messages and it will deliver them faster. If you need reliability use Kafka since it persist messages even after they are delivered.
Do you want your consumers to get messages once they are ready or you want messages to be sent to the consumers immediately? In first case use Kafka because it's pull based mechanism (consumer have to ask for the message). In second case use Redis since it's push based mechanism (message is pushed to the consumer once it's on the queue). RabbitMQ is also push based (although there is pull API with bad performance)
What is the number of messages expected? If it's not huge use Redis since you are limited with memory. Otherwise use Kafka. Best practice for RabbitMQ is to keep queues short. This means that you can consume messages at the close rate at which they appear on the queue. So if you have some long lasting operation on the consumer part probably RabbitMQ is not the best choice.
Scaling? Kafka scales horizontally really well (it's built with scalability in mind). RabbitMQ is usually scaled vertically. Redis also scales well horizontally if needed.
It's obvious that there are more than one criteria when you evaluate proper queueing solution. There are best practices and recommendations for each of the queueing engines that you are looking at. Think more about your specific use case, it's definitely worth the time since it will save you time later on if you chose inappropriate queueing engine.
I am answering for Kafka.
Kafka itself has very good performance even for big messages.
In our tests with 2 Kafka nodes we reach p2p communication with 170 MB/sec smaller messages 150 MB/s bigger messages.
The only thing you need to remember is to configure the broker to accept bigger messages.
Hier is nice article: Configuring Kafka for Performance and Resource Management - Handling Large Messages
I know other p2p solution which might be interesting when you have concrete requirements look at YAMI4
I was using Redis but only for very small messages, so I cannot say anything about 1MB.
Using queue and first-in-first-out algorithm seems like a standard way of dealing with requests to servers/databases/services. But could any other data structures and algorithms be used for dealing with large quantities of requests?
There is an article at http://queue.acm.org/detail.cfm?id=2991130 "Scaling Synchronization in Multicore Programs" which goes into the detail of reducing the locking costs when multiple cores are trying to update the same FIFO queue.
If different users or different requests have different priorities then a FIFO queue won't do, and you need something that takes account of the priority. I have often seen ordered maps used for this, such as TreeMap and std::map or std::multimap. It's not the use case these are designed for, but people know how to use them, and they are typically highly optimised, so they can be faster than most people's attempt at implementing a more specialised data structure.
My topology has a bottleneck or two. The capacity metric in the Storm UI is useful for identifying these, but I'd be much more interested in the size of Bolt's queues.
My understanding is that each bolt has two queues, one for tuples pending to be executed, and another for tuple pending to be emitted. Is it possible to monitor the size of these queues?
I found some stuff online about adding an ITaskHook implementation to Bolts, but it's not remotely clear how I can use this to monitor queue size. Can the methods in ITaskHook be used for to monitor this?
You should be able to see the length of the queues for the components of your topology using the metrics mechanism. An easy way of doing is to add conf.registerMetricsConsumer(LoggingMetricsConsumer.class) to the config of your topology.
Here is an example of what I get for one of my components
4:fetch __sendqueue {write_pos=12122, read_pos=12122, capacity=1024, population=0}
4:fetch __receive {write_pos=8588, read_pos=8587, capacity=1024, population=1}
Is there a standard approach for deduping parallel event streams ? Before I attempt to reinvent the wheel, I want to know if this problem has some known approaches.
My client component will be communicating with two servers. Each one is providing a near real-time event stream (~1 second). The events may occasionally be out of order. Assume I can uniquely identify the events. I need to send a single stream of events to the consuming code at the same near real-time performance.
A lot has been written about this kind of problem. Here's a foundational paper, by Leslie Lamport:
http://research.microsoft.com/en-us/um/people/lamport/pubs/pubs.html#time-clocks
The Wikipedia article on Operational Transformation theory is a perfectly good starting point for further research:
http://en.wikipedia.org/wiki/Operational_transformation
As for your problem, you'll have to choose some arbitrary weight to measure the cost of delay vs the cost of dropped events. You can maintain two priority queues, time-ordered, where incoming events go. You'd do a merge-and on the heads of the two queues with some delay (to allow for out-of-order events), and throw away events that happened "before" the timestamp of whatever event you last sent. If that's no better than what you had in mind already, well, at least you get to read that cool Lamport paper!
I think that the optimization might be OS-specific. From the task as you described it I think about two threads consuming incoming data and appending it to the common stream having access based on mutexes. Both Linux and Win32 have mutex-like procedures, but they may have slow performance if you have data rate is really great. In this case I'd operate by blocks of data, that will allow to use mutexes not so often. Sure there's a main thread that consumes the data and it also access it with a mutex.