Using a MessageGroupStoreReaper with an Aggregator - spring

The reference documentation recommends using a reaper with an aggregator in order to prevent memory leaks due to stacked-up MessageGroup metadata.
https://docs.spring.io/spring-integration/reference/html/message-routing.html#aggregator
Is this always the case, that a reaper is necessary? Or is there a combination of aggregator attributes like expire-groups-upon-completion and/or expire-groups-upon-timeout that can set up conditions such that MessageGroup data is removed?
thanks for any pointers

Your observation is correct. The expire-groups-upon-completion ensures that completed groups are removed from the store. The expire-groups-upon-timeout, in combination with a group-timeout, does similar to the reaper functionality. Also, if you use a persistent store for the MessageGroupStore, then all your group are unloaded from the memory to DB.

Related

Spring Kafka and Re-balancing in the Middle of Processing

In this link https://www.oreilly.com/library/view/kafka-the-definitive/9781491936153/ch04.html the section entitled, "Consuming Records with Specific Offsets" makes reference to a strategy of effectively updating the topic partition offsets in an external store "as you go", and then on partition revocation (e.g. a re-balancing) simply commit any interrupted transaction to the external store.
Now, I'm assuming that this strategy means that in the partition revocation callback that I don't need to process the passed-in TopicPartition collection for the offsets as any "in progress" transaction that was interrupted will be persisted and will contain the partition offsets that need to be committed/saved.
(If I'm wrong on this, please correct me.)
So, then, given that this is Spring Kafka and I'm making use of an #Transactional service to persist the necessary data, is the above strategy relevant/doable? In other words, I'm unsure of how I'd resume/commit anything marked as #Transactional since the transaction manager, boundary, etc. is all taken care of under the hood.
Is this even an issue? If so, what would be the best way to achieve this strategy? Manually track transactions (which sounds horrible across methods and callbacks)?
Or should I just go through the TopicPartition collection on partition revocation and update the partition offsets anyway?
Hopefully this makes sense as I'd like to make sure I get this right.
Thanks in advance.
Released September 2017
That book is quite old in Kafka terms; with modern versions; keeping the offsets in Kafka is much simpler; just make sure your consumer can process all the records returned by a poll() within max.poll.interval.ms in order to avoid a rebalance altogether.

Using Chronicle-Queue as a file based FIFO queue

I got to know Chronicle-Queue from post:
Implementing a file based queue
Here's my use case:
I have a web server (say tomcat) which serves http requests
Each request processing might generate some tracing info.
I'll write these tracing info into a Chronicle-Queue (in bytes[], I'll do the marshalling/unmarshalling by my own, like using protobuf)
I'll have a dedicate thread to use a tailer to read from the Chronicle-Queue. Each message will be processed ONLY once, if failed, I'll have my own retry policy to put it back to the queue to allow next try.
Based on above use case, I have below questions:
How many appenders should be used? Multiple threads share 1 appender or each thread has its own appender?
is queue.acquireAppender() a heavy operation? Shall I cache the appender to avoid calls to acquireAppender()?
If for some reason server is down, can tailer remember the last success read entry and continue with next entry ? (like a millstone feature)
How can I purge/delete old files? Any API to do the purge?
And another irrelevant question:
Is it possible to use Chronicle-Queue to implement a file based BlockingQueue?
Thanks
Leon
How many appenders should be used? Multiple threads share 1 appender or each thread has its own appender?
I suggest you use queue.acquireAppender() and it will create Appenders as needed.
is queue.acquireAppender() a heavy operation? Shall I cache the appender to avoid calls to acquireAppender()?
It's not free but costs a ~100 of nanoseconds.
If for some reason server is down, can tailer remember the last success read entry and continue with next entry ? (like a millstone feature)
We suggest recording to another queue the outcomes of processing the first queue. In this you can record the index it is up to. This is a feature we are considering without the need to add a queue.
How can I purge/delete old files? Any API to do the purge?
If you set a StoreFileListener on the builder you can be notified when a file isn't needed any more.

Memory consumption of Crossbar retained events

self.publish('foo.%s' % id, 'bar', options=PublishOptions(retain=True))
When using retained events, what's the memory consumption behaviour on the Crossbar router? Is the event stored forever, or is it purged after some time and the memory reclaimed?
I'm using wildcard topics, so there will be an ever growing backlog of retained events, unless old topics/retained events are purged at some point.
For full-on event history, you can configure the memory usage (https://crossbar.io/docs/Event-History/) but for retained events only the latest event for a topic is retained.
By "wildcard topics" you mean that you're publishing to foo.<something> and so there'll be an unbounded number of topics you're publishing to?
I can see two solutions (both require changes to Crossbar): add a Meta API to expire/remove particular retained events, or add some configuration option(s) to crossbar to limit retention somehow (maybe by time, maybe by number of events)?
Another solution if it works for your use-case would be to make the "topic" a fixed URI and add the ever-changing part ("id") as one of the arguments; then you could either use "retain" for just the latest one or use the "event history" feature if you want to keep a certain number around.

What is the expire-groups-on-timeout equivalent in Java Config?

As per the docs for expire-groups-on-timeout :
"When a group is completed due to a timeout (or by a MessageGroupStoreReaper), the group is expired (completely removed) by default. Late arriving messages will start a new group. Set this to false to complete the group but have its metadata remain so that late arriving messages will be discarded. Empty groups can be expired later using a MessageGroupStoreReaper together with the empty-group-min-timeout attribute. Default: 'true'."
How do I achieve that in with Java Config? Basically after a group times out, I want the late arriving messages to be discarded and also the group to be expired once all the messages have arrived so that it doesnt produce a memory leak. For the later part, I guess having the MessageGroupStoreReaper will work.
In general, hyphenated properties are converted to camelCase, so
ab-cd-ef
is generally a property
abCdEf
However, there's a typo in the reference manual, it's expire-groups-upon-timeout not expire-groups-on-timeout.
So, you need setExpireGroupsUponTimeout().
I want the late arriving messages to be discarded and also the group to be expired once all the messages have arrived so that it doesnt produce a memory leak.
expireGroupsUponCompletion will remove the metadata for a complete group. To discard late messages after a timeout, but also clean up at some time later, you need a reaper and an appropriate setting in setMinimumTimeoutForEmptyGroups().

Queueing mechanism and Elasticsearch 1.4.0

I have a RabbitMQ broker, on which I post different messages that will end up as documents in Elasticsearch. There are multiple consumers from the broker, which are actually different threads in a task executor assigned to an amqp inbound gateway (using spring integration and spring amqp here).
Think at the following scenario: I have created a doc in ES with the structure
{
"field1" : "value1",
"field2" : "value2"
}
Afterwards I send two update requests, both updating the same field, let's say field1. If I send this messages one right after another(common use case in production), my consumer threads will fetch the messages in the right order(amqp allows this), but the processing could happen in the wrong order and the later updated value could be overwritten by the first one. I will end up having wring data.
How can I make sure my data won't get corrupted? =>Having 1 single consumer thread is not enough, because if I want to scale out by adding more machines with my consuming app, I will still end up having multiple consumers. I might need ordering of messages, but having multiple machines I will probably need to create some sort of a cluster aware component, I am using SI, so this seems really hard to do in my opinion.
In pre 1.2 versions of ES, we used an external version, like a timestamp, and ES would have thrown VersionConflictException in my scenario:first update would have had version 10000 let's say, the second 10001 and if the first would have been processed first, ES would reject the request with version 10000 as it's lower than the existing one. But from the latest versions, ES guys have removed this functionality for update operations.
One solution might be to use multiple queues and have a single consumer on each queue; use a hash function to always route updates to the same document to the same queue see the RabbitMQ Tutorials for the various options.
You can scale out by adding more queues (and changing your hash function).
For resiliency, consider running your consumers in Spring XD. You can have a single instance of each rabbit source (for each queue) and XD will take care of failing it over to another container node if it goes down.
Otherwise you could roll your own by having a warm standby - inbound adapters configured with auto-startup="false" and have something monitor and use a <control-bus/> to start a new instance if the active one goes down.
EDIT:
In response to the fourth comment below.
As I said above, to scale out, you would have to change the hash function. So adding consumers automatically while running would be tricky.
You don't have to hard-code the queue names in the jar, you can use a property placeholder and fill it from properties, system properties, or an environment variable.
This solution is the simplest but does have these limitations.
You could, however, build a management app that could scale it out - stop the producer, wait for all queues to quiesce, reconfigure the consumers and restart the producer - Spring Integration provides a <control-bus/> to start/stop adapters; you can also do it via JMX.
Alternative solutions are possible but will generally require maintaining some shared state across a cluster (perhaps using zookeeper etc), so are much more complex; and you still have to deal with race conditions (where the second update might arrive at some consumer before the first).
You can use the default mechanism for consistency checks. Basically you want to verify that you have the latest version of whatever you are updating.
So for that you need to fetch the _version with the object. In queries you can do this by setting version=true on the toplevel. That will cause the _version to be returned along with your query results. Then when doing an update, you simply set the version parameter in the url to the value you have and it will generate a version conflict if it doesn't match.
Nicer is to handle updates using closures. Basically this works as follows: have an update method that fetches the object by id, applies a closure (parameter to the update function) that encapsulate the modifications you want to make, and then stores modified object. If you trap the still possible version conflict, you can simply get the object again and re-apply the closure to the object. We do this and added a random sleep before the retry as well, this vastly reduces the chance of multiple updates failing and is a nice design pattern. Keeping the read and write together minimizes the chance of a conflict and then retrying with a sleep before that minimizes it further. You could add multiple retries to further reduce the risk.

Resources