I am writing events into Redis Stream.
But I would like to keep only the last 48 hours events.
According to the Redis documentations, I saw that I can manage my list size only using the MAXLEN which take affect by the records count and not by time frame.
Is there any way I can use the XADD function but to limit on insertion records oldest that the last 48 hours?
Thanks for the help!
This is yet not clear. I don't like the vanilla way of time capping a stream, that is, "trim by <seconds>", because it means that if there is a delay in the process XADD-ing items, later the next XADD will have to evict things potentially for seconds, causing latency spikes. Moreover it does not make a lot of sense semantically. Your real "capped resource" is memory, so it's not really so important how many items you want to store in the past VS how many items you can store, so the number of items limit makes more sense. Yet in certain applications where there are multiple streams with insertion rates that vary a lot between different producers, it makes sense to cap by time, to avoid wasting memory in certain producers that emit very few entries per unit of time. Maybe at some point I'll add some "best effort" time capping that does not do more work than a given amount, but that eventually will be able to trim the stream, given enough XADD calls.
AFAIK not yet. There were discussions about adding a timestamp cap (to XADD, and possible to XTRIM as well), but it doesn't look like this feature has been implemented in the latest release candidates.
A possible solution in nodejs based on trimming to a specified key (not on time per se).
https://gist.github.com/jakelowen/22cb8a233ac0cdbb8e77808e17e0e1fc
Proof of concept. Not battle tested.
I am using the Birman-Schiper-Stephenson protocol of distributed system with the current assumption that peer set of any node doesn't change. As the protocol dictates, the messages which have come out of causal order to a node have to be put in a 'delay queue'. My problem is with the organisation of the delay queue where we must implement some kind of order with the messages. After deciding the order we will have to make a 'Wake-Up' protocol which would efficiently search the queue after the current timestamp is modified to find out if one of the delayed messages can be 'woken-up' and accepted.
I was thinking of segregating the delayed messages into bins based on the points of difference of their vector-timestamps with the timestamp of this node. But the number of bins can be very large and maintaining them won't be efficient.
Please suggest some designs for such a queue(s).
Sorry about the delay -- didn't see your question until now. Anyhow, if you look at Isis2.codeplex.com you'll see that in Isis2, I have a causalsend implementation that employs the same vector timestamp scheme we described in the BSS paper. What I do is to keep my messages in a partial order, sorted by VT, and then when a delivery occurs I can look at the delayed queue and deliver off the front of the queue until I find something that isn't deliverable. Everything behind it will be undeliverable too.
But in fact there is a deeper insight here: you actually never want to allow the queue of delayed messages to get very long. If the queue gets longer than a few messages (say, 50 or 100) you run into the problem that the guy with the queue could be holding quite a few bytes of data and may start paging or otherwise running slowly. So it becomes a self-perpetuating cycle in which because he has a queue, he is very likely to be dropping messages and hence enqueuing more and more. Plus in any case from his point of view, the urgent thing is to recover that missed message that caused the others to be out of order.
What this adds up to is that you need a flow control scheme in which the amount of pending asynchronous stuff is kept small. But once you know the queue is small, searching every single element won't be very costly! So this deeper perspective says flow control is needed no matter what, and then because of flow control (if you have a flow control scheme that works) the queue is small, and because the queue is small, the search won't be costly!
I'm looking for a distributed message queue that will support millions of queues, with each queue handling tens of messages per second.
The messages will be small (tens of bytes), and I don't expect the queues to get very long--on the order of tens of messages per queue at maximum, but when the system is humming along, the queues should stay fairly empty.
I'm not sure how many nodes to expect in the cluster--probably depends on the specific solution, but if I had to guess, I would say ten nodes. I would prefer that queues were relatively resilient to individual node failures within the cluster, but a few lost messages here and there won't make me lose sleep.
Does such a message queue exist? Seems like most of the field is optimized toward handling hundreds of queues with high throughput. But what is SQS built on? Surely not magic.
Update:
By request, it may indeed help to shed light on my problem domain. (I'd left details out before so as not to muddy the waters.) I'm experimenting with distributed cellular automata, with an initial target of a million cells in simulation. In some CA models, it's useful to add an event model, so that a cell can send events to its neighbors. Hence, a million queues, each with one consumer and 8 or so producers.
Costs are a concern for now, as I'm funding the experiments myself. (Thus Amazon's SQS is probably out of reach.)
From your description, it looks like OMG's Data Distribution Service could be a good fit. It is related to message queueing technologies, but I would rather call it a distributed data management infrastructure. It is completely distributed and supports advanced features that give you a lot of control over how the data is distributed, by means of a rich set of Quality of Service settings.
Not knowing much about your problem, I could guess what an approach might be. DDS is about distributing the state of strongly-typed data-items, as structures with typed attributes. You could create a data-type describing the state of an automaton. One of its attributes could be an ID uniquely identifying the automaton in the system. If possible, that would be assigned according to a scheme such that every automaton knows what the ID's of its neighbors are (if they are present). Each automaton would publish its state as needed, resulting in a distributed data-space containing the current state of all automatons. DDS supports so-called partitioning of that data-space. If you took advantage of that, then each of the nodes in your machine would be responsible for a well-defined subset of all automatons. Communication over the wire would only happen for those automatons neighboring a different partition. Since automatons know the ID's of their neighbors, they would be able to query the data-space for the states of the automatons it's interested in.
It is a bit hard to explain without a white board, but the end-result would be a single instance (which are a sort of very light-weight message queues) for most automatons, and two or three instances for those automatons at the border of a partition. If you had ten nodes and one million automata, then each node would have to be able to hold administration for approximately hundred thousand automata. I have seen systems being built with DDS of that scale, and larger, with tens of updates per second for each instance. The nice thing is that this technology scales well with the number of nodes, so you could bring down the resource load per node by adding more nodes.
If this is a research project, then you might even be able to use a commercial product without charge. Just google on dds research license.
Is there a standard approach for deduping parallel event streams ? Before I attempt to reinvent the wheel, I want to know if this problem has some known approaches.
My client component will be communicating with two servers. Each one is providing a near real-time event stream (~1 second). The events may occasionally be out of order. Assume I can uniquely identify the events. I need to send a single stream of events to the consuming code at the same near real-time performance.
A lot has been written about this kind of problem. Here's a foundational paper, by Leslie Lamport:
http://research.microsoft.com/en-us/um/people/lamport/pubs/pubs.html#time-clocks
The Wikipedia article on Operational Transformation theory is a perfectly good starting point for further research:
http://en.wikipedia.org/wiki/Operational_transformation
As for your problem, you'll have to choose some arbitrary weight to measure the cost of delay vs the cost of dropped events. You can maintain two priority queues, time-ordered, where incoming events go. You'd do a merge-and on the heads of the two queues with some delay (to allow for out-of-order events), and throw away events that happened "before" the timestamp of whatever event you last sent. If that's no better than what you had in mind already, well, at least you get to read that cool Lamport paper!
I think that the optimization might be OS-specific. From the task as you described it I think about two threads consuming incoming data and appending it to the common stream having access based on mutexes. Both Linux and Win32 have mutex-like procedures, but they may have slow performance if you have data rate is really great. In this case I'd operate by blocks of data, that will allow to use mutexes not so often. Sure there's a main thread that consumes the data and it also access it with a mutex.
UPDATE: Here's my implementation of Hashed Timing Wheels. Please let me know if you have an idea to improve the performance and concurrency. (20-Jan-2009)
// Sample usage:
public static void main(String[] args) throws Exception {
Timer timer = new HashedWheelTimer();
for (int i = 0; i < 100000; i ++) {
timer.newTimeout(new TimerTask() {
public void run(Timeout timeout) throws Exception {
// Extend another second.
timeout.extend();
}
}, 1000, TimeUnit.MILLISECONDS);
}
}
UPDATE: I solved this problem by using Hierarchical and Hashed Timing Wheels. (19-Jan-2009)
I'm trying to implement a special purpose timer in Java which is optimized for timeout handling. For example, a user can register a task with a dead line and the timer could notify a user's callback method when the dead line is over. In most cases, a registered task will be done within a very short amount of time, so most tasks will be canceled (e.g. task.cancel()) or rescheduled to the future (e.g. task.rescheduleToLater(1, TimeUnit.SECOND)).
I want to use this timer to detect an idle socket connection (e.g. close the connection when no message is received in 10 seconds) and write timeout (e.g. raise an exception when the write operation is not finished in 30 seconds.) In most cases, the timeout will not occur, client will send a message and the response will be sent unless there's a weird network issue..
I can't use java.util.Timer or java.util.concurrent.ScheduledThreadPoolExecutor because they assume most tasks are supposed to be timed out. If a task is cancelled, the cancelled task is stored in its internal heap until ScheduledThreadPoolExecutor.purge() is called, and it's a very expensive operation. (O(NlogN) perhaps?)
In traditional heaps or priority queues I've learned in my CS classes, updating the priority of an element was an expensive operation (O(logN) in many cases because it can only be achieved by removing the element and re-inserting it with a new priority value. Some heaps like Fibonacci heap has O(1) time of decreaseKey() and min() operation, but what I need at least is fast increaseKey() and min() (or decreaseKey() and max()).
Do you know any data structure which is highly optimized for this particular use case? One strategy I'm thinking of is just storing all tasks in a hash table and iterating all tasks every second or so, but it's not that beautiful.
How about trying to separate the handing of the normal case where things complete quickly from the error cases?
Use both a hash table and a priority queue. When a task is started it gets put in the hash table and if it finishes quickly it gets removed in O(1) time.
Every one second you scan the hash table and any tasks that have been a long time, say .75 seconds, get moved to the priority queue. The priority queue should always be small and easy to handle. This assumes that one second is much less than the timeout times you are looking for.
If scanning the hash table is too slow, you could use two hash tables, essentially one for even-numbered seconds and one for odd-numbered seconds. When a task gets started it is put in the current hash table. Every second move all the tasks from the non-current hash table into the priority queue and swap the hash tables so that the current hash table is now empty and the non-current table contains the tasks started between one and two seconds ago.
There options are a lot more complicated than just using a priority queue, but are pretty easily implemented should be stable.
To the best of my knowledge (I wrote a paper about a new priority queue, which also reviewed past results), no priority queue implementation gets the bounds of Fibonacci heaps, as well as constant-time increase-key.
There is a small problem with getting that literally. If you could get increase-key in O(1), then you could get delete in O(1) -- just increase the key to +infinity (you can handle the queue being full of lots of +infinitys using some standard amortization tricks). But if find-min is also O(1), that means delete-min = find-min + delete becomes O(1). That's impossible in a comparison-based priority queue because the sorting bound implies (insert everything, then remove one-by-one) that
n * insert + n * delete-min > n log n.
The point here is that if you want a priority-queue to support increase-key in O(1), then you must accept one of the following penalties:
Not be comparison based. Actually, this is a pretty good way to get around things, e.g. vEB trees.
Accept O(log n) for inserts and also O(n log n) for make-heap (given n starting values). This sucks.
Accept O(log n) for find-min. This is entirely acceptable if you never actually do find-min (without an accompanying delete).
But, again, to the best of my knowledge, no one has done the last option. I've always seen it as an opportunity for new results in a pretty basic area of data structures.
Use Hashed Timing Wheel - Google 'Hashed Hierarchical Timing Wheels' for more information. It's a generalization of the answers made by people here. I'd prefer a hashed timing wheel with a large wheel size to hierarchical timing wheels.
Some combination of hashes and O(logN) structures should do what you ask.
I'm tempted to quibble with the way you're analyzing the problem. In your comment above, you say
Because the update will occur very very frequently. Let's say we are sending M messages per connection then the overall time becomes O(MNlogN), which is pretty big. – Trustin Lee (6 hours ago)
which is absolutely correct as far as it goes. But most people I know would concentrate on the cost per message, on the theory that as you app has more and more work to do, obviously it's going to require more resources.
So if your application has a billion sockets open simultaneously (is that really likely?) the insertion cost is only about 60 comparisons per message.
I'll bet money that this is premature optimization: you haven't actually measured the bottlenecks in you system with a performance analysis tool like CodeAnalyst or VTune.
Anyway, there's probably an infinite number of ways of doing what you ask, once you just decide that no single structure will do what you want, and you want some combination of the strengths and weaknesses of different algorithms.
One possiblity is to divide the socket domain N into some number of buckets of size B, and then hash each socket into one of those (N/B) buckets. In that bucket is a heap (or whatever) with O(log B) update time. If an upper bound on N isn't fixed in advance, but can vary, then you can create more buckets dynamically, which adds a little complication, but is certainly doable.
In the worst case, the watchdog timer has to search (N/B) queues for expirations, but I assume the watchdog timer is not required to kill idle sockets in any particular order!
That is, if 10 sockets went idle in the last time slice, it doesn't have to search that domain for the one that time-out first, deal with it, then find the one that timed-out second, etc. It just has to scan the (N/B) set of buckets and enumerate all time-outs.
If you're not satisfied with a linear array of buckets, you can use a priority queue of queues, but you want to avoid updating that queue on every message, or else you're back where you started. Instead, define some time that's less than the actual time-out. (Say, 3/4 or 7/8 of that) and you only put the low-level queue into the high-level queue if it's longest time exceeds that.
And at the risk of stating the obvious, you don't want your queues keyed on elapsed time. The keys should be start time. For each record in the queues, elapsed time would have to be updated constantly, but the start time of each record doesn't change.
There's a VERY simple way to do all inserts and removes in O(1), taking advantage of the fact that 1) priority is based on time and 2) you probably have a small, fixed number of timeout durations.
Create a regular FIFO queue to hold all tasks that timeout in 10 seconds. Because all tasks have identical timeout durations, you can simply insert to the end and remove from the beginning to keep the queue sorted.
Create another FIFO queue for tasks with 30-second timeout duration. Create more queues for other timeout durations.
To cancel, remove the item from the queue. This is O(1) if the queue is implemented as a linked list.
Rescheduling can be done as cancel-insert, as both operations are O(1). Note that tasks can be rescheduled to different queues.
Finally, to combine all the FIFO queues into a single overall priority queue, have the head of every FIFO queue participate in a regular heap. The head of this heap will be the task with the soonest expiring timeout out of ALL tasks.
If you have m number of different timeout durations, the complexity for each operation of the overall structure is O(log m). Insertion is O(log m) due to the need to look up which queue to insert to. Remove-min is O(log m) for restoring the heap. Cancelling is O(1) but worst case O(log m) if you're cancelling the head of a queue. Because m is a small, fixed number, O(log m) is essentially O(1). It does not scale with the number of tasks.
Your specific scenario suggests a circular buffer to me. If the max. timeout is 30 seconds and we want to reap sockets at least every tenth of a second, then use a buffer of 300 doubly-linked lists, one for each tenth of a second in that period. To 'increaseTime' on an entry, remove it from the list it's in and add it to the one for its new tenth-second period (both constant-time operations). When a period ends, reap anything left over in the current list (maybe by feeding it to a reaper thread) and advance the current-list pointer.
You've got a hard-limit on the number of items in the queue - there is a limit to TCP sockets.
Therefore the problem is bounded. I suspect any clever data structure will be slower than using built-in types.
Is there a good reason not to use java.lang.PriorityQueue? Doesn't remove() handle your cancel operations in log(N) time? Then implement your own waiting based on the time until the item on the front of the queue.
I think storing all the tasks in a list and iterating through them would be best.
You must be (going to) run the server on some pretty beefy machine to get to the limits where this cost will be important?