Right now I have a use case where I have a stream of events coming in. There are a few splitters and then finally downstream an aggregator. As the stream is never ending and with the number of splitters we are unable to calculate the total number of messages we expect. However we are using a simple SpeL release strategy expression :
<release-strategy-expression="size() == 10"/>
We are using a group-timeout and also have set send-partial-result-on-expiry=true.
Given this use-case am I right in concluding that there is no in-built way to preserve the original ordering of the stream of events ?
I have tried using a SequenceSizeReleaseStrategy with releasePartialSequences set to true.
What I've observed is that this is sending each message as a separate group as it relies on the group-size header which is defaulted to zero.
Am I missing out on anything ? Is there a way to preserve the ordering in the aggregator given this use-case ?
For that purpose there is an EI pattern resequencer: https://docs.spring.io/spring-integration/docs/5.3.0.M4/reference/html/message-routing.html#resequencer.
So, you place it just before an aggregator and when that aggregator releases a group, messages are going to be in the result list in the sequence order.
The resequencer also can release partial groups if all the gaps in sequence are fulfilled.
Related
I am in the process of scaling out an application horizontally, and have realised read model updates (external projection via event handler) will need to be handled on a competing consumer basis.
I initially assumed that I would need to ensure ordering, but this requirement is message dependent. In the case of shopping cart checkouts where i want to know totals, i can add totals regardless of the order - get the message, update the SQL database, and ACK the message.
I am now racking my brains to even think of a scenario/messages that would be anything but, however i know this is not the case. Some extra clarity and examples would be immensely useful.
My questions i need help with please are:
What type of messages would the ordering need to be important, and
how would this be resolved using the messages as-is?
How would we know which event to resubscribe from when the processes
join/leave I can see possible timing issues that could cause a
subscription to be requested on a message that had just been
processed by another process?
I see there is a Pinned consumer strategy for best efforts affinity of stream to subscriber, however this is not guaranteed. I could solve this making a specific stream single threaded processing only those messages in order - is it possible for a process to have multiple subscriptions to different streams?
To use your example of a shopping cart, ordering would be potentially important for the following events:
Add item
Update item count
Remove item
You might have sequences like A: 'Add item, remove item' or B: 'Add item, Update item count (to 2), Update item count (to 3)'. For A, if you process the remove before the add, obviously you're in trouble. For B, if you process two update item counts out of order, you'll end up with the wrong final count.
This is normally scaled out by using some kind of sharding scheme, where a subset of all aggregates are allocated to each shard. For Event Store, I believe this can be done by creating a user-defined projection using partitionBy to partition the stream into multiple streams (aka 'shards'). Then you need to allocate partitions/shards to processing nodes in a some way. Some technologies are built around this approach to horizontal scaling (Kafka and Kinesis spring to mind).
I'm building a Kafka Streams application that generates change events by comparing every new calculated object with the last known object.
So for every message on the input topic, I update an object in a state store and every once in a while (using punctuate), I apply a calculation on this object and compare the result with the previous calculation result (coming from another state store).
To make sure this operation is consistent, I do the following after the punctuate triggers:
write a tuple to the state store
compare the two values, create change events and context.forward them. So the events go to the results topic.
swap the tuple by the new_value and write it to the state store
I use this tuple for scenario's where the application crashes or rebalances, so I can always send out the correct set of events before continuing.
Now, I noticed the resulting events are not always consistent, especially if the application frequently rebalances. It looks like in rare cases the Kafka Streams application emits events to the results topic, but the changelog topic is not up to date yet. In other words, I produced something to the results topic, but my changelog topic is not at the same state yet.
So, when I do a stateStore.put() and the method call returns successfully, are there any guarantees when it will be on the changelog topic?
Can I enforce a changelog flush? When I do context.commit(), when will that flush+commit happen?
To get complete consistency, you will need to enable processing.guarantee="exaclty_once" -- otherwise, with a potential error, you might get inconsistent results.
If you want to stay with "at_least_once", you might want to use a single store, and update the store after processing is done (ie, after calling forward()). This minimized the time window to get inconsistencies.
And yes, if you call context.commit(), before input topic offsets are committed, all stores will be flushed to disk, and all pending producer writes will also be flushed.
We have a Spring Integration project where one large input file is burst into many individual files and then aggregated back together using the Spring Integration Aggregator.
There are a number of filters in the pipeline that will filter out unwanted individual files. We keep track of the number of files filtered for each correlated input file. Our #ReleaseStrategy checks to see if we have received the number of individual files minus the number of filtered individual files.
What will happen if the last individual file to be processed is filtered before it hits our ReleaseStrategy? The ReleaseStrategy is called for each individual file that reaches it, I would not be polled again if the last individual file is filtered, but I would also expect Spring to have anticipated this use case and made some non-hackish provision for it that still provides me with the #Aggregator event. I don't get an #Aggregator event if I timeout or if I cause all the filter points to check to see if they are the last file.
Thanks!
The correct answer to the Split-filter-aggregate pattern was to set the following properties on the aggregator bean.
send-partial-result-on-expiry="true"
group-timeout="5000"
These two properties work together to handle situations exactly like the one described, where our ReleaseStrategy is never called with the last record because the last record was filtered. These settings will cause whatever is queued up to release once the timeout is reached.
The timeout is a "quiet period", if no messages for a given CorrelationStrategy are received within the timeout then the aggregate releases. Each message received resets the timeout. See section (21):
http://docs.spring.io/spring-integration/reference/htmlsingle/
I'm trying to setup Storm to aggregate a stream, but with various (DRPC available) metrics on the same stream.
E.g. the stream is consisted of messages that have a sender, a recipient, the channel through which the message arrived and a gateway through which it was delivered. I'm having trouble deciding how to organize one or more topologies that could give me e.g. total count of messages by gateway and/or by channel. And besides the total, counts per minute would be nice too.
The basic idea is to have a spout that will accept messaging events, and from there aggregate the data as needed. Currently I'm playing around with Trident and DRPC and I've came up with two possible topologies that solve the problem at this stage. Can't decide which approach is better, if any?!
The entire source is available at this gist.
It has three classes:
RandomMessageSpout
used to emit the messaging data
simulates the real data source
SeparateTopology
creates a separate DRPC stream for each metric needed
also a separate query state is created for each metric
they all use the same spout instance
CombinedTopology
creates a single DRPC stream with all the metrics needed
creates a separate query state for each metric
each query state extracts the desired metric and groups results for it
Now, for the problems and questions:
SeparateTopology
is it necessary to use the same spout instance or can I just say new RandomMessageSpout() each time?
I like the idea that I don't need to persist grouped data by all the metrics, but just the groupings we need to extract later
is the spout emitted data actually processed by all the state/query combinations, e.g. not the first one that comes?
would this also later enable dynamic addition of new state/query combinations at runtime?
CombinedTopology
I don't really like the idea that I need to persist data grouped by all the metrics since I don't need all the combinations
it came as a surprise that the all the metrics always return the same data
e.g. channel and gateway inquiries return status metrics data
I found that this was always the data grouped by the first field in state definition
this topic explains the reasoning behind this behaviour
but I'm wondering if this is a good way of doing thins in the first place (and will find a way around this issue if need be)
SnapshotGet vs TupleCollectionGet in stateQuery
with SnapshotGet things tended to work, but not always, only TupleCollectionGet solved the issue
any pointers as to what is correct way of doing that?
I guess this is a longish question / topic, but any help is really appreciated!
Also, if I missed the architecture entirely, suggestions on how to accomplish this would be most welcome.
Thanks in advance :-)
You can't actually split a stream in SeparateTopology by invoking newStream() using the same spout instance, since that would create new instances of the same RandomMessageSpout spout, which would result in duplicate values being emitted to your topology by multiple, separate spout instances. (Spout parallelization is only possible in Storm with partitioned spouts, where each spout instance processes a partition of the whole dataset -- a Kafka partition, for example).
The correct approach here is to modify the CombinedTopology to split the stream into multiple streams as needed for each metric you need (see below), and then do a groupBy() by that metric's field and persistentAggregate() on each newly branched stream.
From the Trident FAQ,
"each" returns a Stream object, which you can store in a variable. You can then run multiple eaches on the same Stream to split it, e.g.:
Stream s = topology.each(...).groupBy(...).aggregate(...)
Stream branch1 = s.each(...)
Stream branch2 = s.each(...)
See this thread on Storm's mailing list, and this one for more information.
**I am setting JMS filter in Producer side i.e jmsMessage.setObjectProperty("FILTER",filterId1) ...
so this is one to one relationship . i.e. key FILTER is associated with only one value i.e filterId1 ( msg is consumed by consumer which has value as filterid1 )....
but i want one to many relationship , i.e . FILTER is associted with many filterId's (filterId1 or filterId2 or filterId3 or filterId4 or filterId5 )
Consumer having value between any of these filterId's can consume the message .....
is der any functionality in jms if no then how we can achieve it programitically.....**
You can use Between on the filter, but I suspect you should probably use a different queue for your sets. Overusing Filters will give you a bad performance if you have many messages to be scanned.
I would favor Subscriptions with a filter, or simply use multiple queues for the stuff you need.
But that's going a big beyond simply answering your question, a simple answer would be use BETWEEN on the filter clause at your consumer.
(Also: there's no such thing as JMS Filter in Producer. a Filter only applies to a consumer. I assume you meant setting some data that will be used on the filter).