Kafka Streams approach to timed window with max count - apache-kafka-streams

I have a system where we process text messages. Each message gets split up into sentences, and each sentence gets processed individually and the results of each sentence get published to a topic. This all happens asynchronously.
I want to be able to aggregate the results for the sentences.
The problem is that I want the window to end when the total number of sentences have been reached, or when a total amount of time has passed. Basically Tumbling time windows, but can end when a total number of results have been received.
Secondarily I want to be able to know when that window ends so that I can process the aggregation as an atomic event.

It's possible but you have to implement a custom processor - your requirements are simply to specific for the high-level API to cater for.
Your processor would store messages into a state store and use punctuate to periodically check if the window expired. It would also keep a running counter and check if the max number of results have been received. If either condition is met, it does the aggregation, removes messages from the state store and sends the results downstream.
You'd have to think about what to do on restart (failover/re-balancing). When starting up, the processor should inspect its state store and calculate the current running count and the window expiry time.

Now Apache Kafka offers you a way to wait closing the window. Here piece of code;
suppress(Suppressed.untilWindowCloses(Suppressed.BufferConfig.unbounded()))
For more, check it out.

Related

Intentionally drop state when using suppress for rate limiting updates to KTable

I am using Kafka Streams 2.3.1 suppress() operator to limit the number of updates being sent to the underlying KTable.
The use case here is that in my processing logic, I want to make an HTTP call, however to limit the number of calls, I am windowing the stream and aggregating source topic messages that fall into the same time window to make a single API call.
Code looks roughly as follows
KTable<Windowed<String>, List<Event>> windowedEventKTable = inputKStream
.groupByKey()
.windowedBy(TimeWindows.of(Duration.ofSeconds(30)).grace(Duration.ofSeconds(5))
.aggregate(Aggregator::new, ((key, value, aggregate) -> aggregate.aggregate(value)), stateStore)
.suppress(Suppressed.untilTimeLimit(Duration.ofSeconds(5), maxRecords(500).emitEarlyWhenFull())
.mapValues((windowedKey, groupedTriggerAggregator) -> {//code here returning a list})
.toStream((k,v) -> k.key())
.flatMapValues((readOnlyKey, value) -> value);
The problem I am running into is that while the windows exceeding the record limit are emitted, the state is preserved. At some point the state for a single time window grows into multiple MB's, causing the supress store changelog message to exceed the topic's max.message.bytes limit. For our use case, as soon as window is emitted we actually don't care about leftover state and it would be safe to drop it.
As we are sharing the Kafka Cluster between multiple teams, the team running the cluster is hesitant to increase cluster level max.message.bytes property beyond 10 MB's that we require.
Do I have any options other than implementing my logic using transformValues? If not, are there any future Kafka Streams enhancements that would be able to handle this more out of the box?
For our use case, as soon as window is emitted we actually don't care about leftover state and it would be safe to drop it.
For this case, you can set the store retention time (default is 1 day) to the same value as the specified grace period, via aggregation() parameter Materialized.withRetentiontTime(...).
The problem I am running into is that while the windows exceeding the record limit are emitted, the state is preserved. At some point the state for a single time window grows into multiple MB's, causing the supress store changelog message to exceed the topic's max.message.bytes limit.
This is actually an interesting statement, and looking at your code, I just want to clarify something: As you limit by time and allow to emit early based on cache size, it seems that you have a lot of records that are out of order and update the state further even after an intermediate result was emitted. If you purge the state via retention time as describe above you need to consider the following:
Purging state won't affect any emits that are triggered base on cache size, because, the state will only be purges after the retention time passed.
0 Furthermore, purging state implies that all out of order records the appear after purging would not be processed at all, but would be dropped (because retention time implicitly marks input records with smaller timestamp as "late").
However, overall it seems that you don't really care about out of order data and event-time windows as it's ok for you to "arbitrarily" put records into a window as the only goal is to reduce the number of external API calls. Hence, it seems appropriate that you actually switch to processing time semantics by using WallclockTimetampExtractor (instead of the default extractor). For ensure that each record is only emitted once, you should change the suppress() configuration to only emit "final" results.

Is it possible to pause and resume Kafka Stream conditionally?

I have a requirement as stated # https://kafka.apache.org/21/documentation/streams/developer-guide/dsl-api.html#window-final-results for waiting until window is closed in order to handle late out of order event by buffering it for duration of window.
Per my understanding of this feature is once windowing is created, the window works like wall clock processing, e.g. Creating for 1 hour window, The window starts ticking once first event comes. This 1hr window is closed exactly one hour later and all the events buffered so far will be forwarded to down stream. However, i need to be able to hold this window even longer say conditionally for as long as required e.g. based on state / information in external system such as database.
To be precise my requirement for event forwarding is (windows of 1 hour if external state record says it is good) or (hold for as long as required until external record says it's good and resume tracking of the event until the event make it fully 1hr, disregarding the time when external system is not good)
To elaborate this 2nd condition, e.g. if my window duration 1 1hr , my event starts at 00:00, if on 00:30 it is down and back normal on 00:45, the window should extend until 01:15.
Is it possible to pause and resume the forwarding of events conditionally based on my requirement above ?
Do I have to use transformation / processor and use value store manually to track the first processing time of my event and conditionally forwarding buffered events in punctuator ?
I appreciate all kind of work around and suggestion for this requirement.
the window works like wall clock processing
No. Kafka Streams work on event-time, hence, the timestamps as returned from the TimestampExtractor (by default the embedded record timestamp) are use to advance time.
To be precise my requirement for event forwarding is (windows of 1 hour if external state record says it is good)
This would need a custom solution IMHO.
or (hold for as long as required until external record says it's good and resume tracking of the event until the event make it fully 1hr, disregarding the time when external system is not good)
Not 100% if I understand this part.
Is it possible to pause and resume the forwarding of events conditionally based on my requirement above ?
No.
Do I have to use transformation / processor and use value store manually to track the first processing time of my event and conditionally forwarding buffered events in punctuator ?
I think this might be required.
Check out this blog post, that explains how suppress() work in details, and when it emits based on observed event-time: https://www.confluent.io/blog/kafka-streams-take-on-watermarks-and-triggers

CQRS - out of order messages

Suppose we have 3 different services producing events, each of them publishing to its own event store.
Each of these services consumes other producers services events.
This because each service has to process another service's events AND to create its own projection. Each of the service runs on multiple instances.
The most straight forward way to do it (for me) was to put "something" in front of each ES which is picking events and publishing (pub/sub) them in queues of every other service.
This is perfect because every service can subscribe to each topics it likes, while the event publisher is doing the job and if a service is unavailable events are still delivered. This seems to me to guarantee high scalability and availability.
My problem is the queue. I can't get an easily scalable queue that guarantees ordering of the messages. It actually guarantees "slightly out of order" with at-least once delivery: to be clear, it's AWS SQS.
So, the ordering problems are:
No order guaranteed across events from the same event stream.
No order guaranteed across events from the same ES.
No order guaranteed across events from different ES (different services).
I though I could solve the first two problems just by keeping track of the "sequence number" of the events coming from the same ES.
This would be done by tracking the last sequence number of each topic from which we are consuming events
This should be easy for reacting to events and also building our projection.
Then, when I pop an event from the queue, if the eventSequenceNumber > previousAppliedEventSequenceNumber + 1 i renqueue it (or make it invisible for a certain time).
But it turns out that using this solution, it will destroy performances when events are produced at high rates (I can use a visibility timeout or other stuff, the result should be the same).
This because when I'm expecting event 10 and I ignore event 11 for a moment, I should ignore also all events (from ES) with sequence numbers coming after that event 11, until event 11 shows up again and it's effectively processed.
Other difficulties were:
where to keep track of the event's sequence number for build the projection.
how to keep track of the event's sequence number for build the projection so that when appling it, I have a consistent lastSequenceNumber.
What I'm missing?
P.S.: for the third problem think at the following scenario. We have a UserService and a CartService. The CartService has a projection where for each user keeps track of the products in the cart. Each cart's projection must have also user's name and other info's that are coming from the UserCreated event published from the UserService. If UserCreated comes after ProductAddedToCart the normal flow requires to throw an exception because the user doesn't exist yet.
What I'm missing?
You are missing flow -- consumers pull messages from sources, rather than having sources push the messages to the consumers.
When I wake up, I check my bookmark to find out which of your messages I read last, and then ask you if there have been any since. If there have, I retrieve them from you in order (think "document message"), also writing down the new bookmarks. Then I go back to sleep.
The primary purpose of push notifications is to interrupt the sleep period (thereby reducing latency).
With SQS acting as a queue, the idea is that you read all of the enqueued messages at once. If there are no gaps, then you can order the collection then start processing them and acking them. If there are gaps, you either wait (leaving the messages in the queue) or you go to the event store to fetch copies of the missing messages.
There's no magic -- if the message pipeline is promising "at least once" delivery, then the consumers must take steps to recognize duplicate messages as they arrive.
If UserCreated comes after ProductAddedToCart the normal flow requires to throw an exception because the user doesn't exist yet.
Review Race Conditions Don't Exist, by Udi Dahan: "A microsecond difference in timing shouldn’t make a difference to core business behaviors."
The basic issue is assuming we can get messages IN ORDER...
This is a fallacy in distributed computing...
I suggest you design for no message ordering in your system.
As for your issues, try and use UTC time in the message body/header created by the originator and try and work around this data point. Sequence numbers are going to fail unless you have a central deterministic sequence creator (which will be a non-scalable, single point of failure).
Using Sagas/State machine is a path that can help to make sense of (business) events ordering.

Which guarantees does Kafka Stream provide when using a RocksDb state store with changelog?

I'm building a Kafka Streams application that generates change events by comparing every new calculated object with the last known object.
So for every message on the input topic, I update an object in a state store and every once in a while (using punctuate), I apply a calculation on this object and compare the result with the previous calculation result (coming from another state store).
To make sure this operation is consistent, I do the following after the punctuate triggers:
write a tuple to the state store
compare the two values, create change events and context.forward them. So the events go to the results topic.
swap the tuple by the new_value and write it to the state store
I use this tuple for scenario's where the application crashes or rebalances, so I can always send out the correct set of events before continuing.
Now, I noticed the resulting events are not always consistent, especially if the application frequently rebalances. It looks like in rare cases the Kafka Streams application emits events to the results topic, but the changelog topic is not up to date yet. In other words, I produced something to the results topic, but my changelog topic is not at the same state yet.
So, when I do a stateStore.put() and the method call returns successfully, are there any guarantees when it will be on the changelog topic?
Can I enforce a changelog flush? When I do context.commit(), when will that flush+commit happen?
To get complete consistency, you will need to enable processing.guarantee="exaclty_once" -- otherwise, with a potential error, you might get inconsistent results.
If you want to stay with "at_least_once", you might want to use a single store, and update the store after processing is done (ie, after calling forward()). This minimized the time window to get inconsistencies.
And yes, if you call context.commit(), before input topic offsets are committed, all stores will be flushed to disk, and all pending producer writes will also be flushed.

hold messages in a channel until limit is reached

Is there a way to setup a spring integration channel in such a way that lets say it only sends the messages to output channel once it has accumulated 50 incoming messages. To look at it from polling perspective, I want the polling process to be based on the number of messages instead of a fixed time interval .. somehow poll the previous channel possibly multiple times but only accept messages once it has enough to process
Use an <aggregator/> with a release-strategy-expression="size == 50" and a correlation-strategy-expression="'foo'" (and expire-groups-on-completion="true). The expire-groups setting allows the next group ('foo') to form.
Follow the aggregator with a simple <splitter /> (no expressions, just in/out channels).
The aggregator will accumulate messages until 50 arrive and then release them as a collection, and the splitter will split the collection back to single messages.
If you want to release based on size or elapsed time (release a short group if x seconds elapse) then configure a MessageGroupStoreReaper.

Resources