Requirement:
Hold the data during db downtime and process it with 5 mins interval by keeping them in dead letter queue.
I have tried below approaches
Kafka retry topic but there are some limitations where I have no control over the listener to configure the interval. #kakfkalistner is picking the message as soon as we push
Pick the message from Kafka listener and storing it in hashset. Create schedular to scan the hashset in 5mins delay and wipe out(this approach is not handy since set is in memory)
Hello Stack Overflow community and anyone familiar with spring-kafka!
I am currently working on a project which leverages the #RetryableTopic feature from spring-kafka in order to reattempt the delivery of failed messages. The listener annotated with #RetryableTopic is consuming from a topic that has 50 partitions and 3 replicas. When the app is receiving a lot of traffic, it could possibly be autoscaled up to 50 instances of the app (consumers) grabbing from those partitions. I read in the spring-kafka documentation that by default, the retry topics that #RetryableTopic autocreates are created with one partition and one replica, but you can change these values with autoCreateTopicsWith() in the configuration. From this, I have a few questions:
With the autoscaling in mind, is it recommended to just create the retry topics with the same number of partitions and replicas (50 & 3) as the original topic?
Is there some benefit to having differing numbers of partitions/replicas for the retry topics considering their default values are just one?
The retry topics should have at least as many partitions as the original (by default, records are sent to the same partition); otherwise you have to customize the destination resolution to avoid the warning log. See Destination resolver returned non-existent partition
50 partitions might be overkill unless you get a lot of retried records.
It's up to you how many replicas you want, but in general, yes, I would use the same number of replicas as the original.
Only you can decide what are the "correct" numbers.
I am a newbie to Kafka-stream and I am experimenting it to process a steam of messages.
Scenario
Incoming payload structure is:
"building-<M>, sensor-<N>.<parameter>, value, timestamp".
For example:
"building-1, sensor-1.temperature, 18, 2020-06-12T15:01:05Z"
"building-1, sensor-1.humidity, 75, 2020-06-12T15:01:05Z"
"building-1, sensor-2.temperature, 20, 2020-06-12T15:01:05Z"
"building-1, sensor-2.humidity, 70, 2020-06-12T15:01:05Z"
Message key in kafka is building-id.
Stream transforms this as a POJO for further downstream processing:
SensorData {
buildingId = "building-1"
sensorId = "sensor-1"
parameterName = "temperature"
parameterValue = 18
timestamp = 1592048743000
..
..
}
Each sensor will send all of its parameters at same time as separate record. Each set of feed comes at every 5 minutes from each sensor.
Time stamp extractor is set to take the time from payload. It will also reject the record if the timestamp on the record is way off (say 1 hour deviation from current stream time)
In my topology, at one point, I want to perform an aggregate operation combining all the data from one sensor. For example, in the above sample, I want to perform an aggregation for each sensor using the temperature and humidity reported by that sensor.
Topology
I do a group using "buildingId" and "sensorId", then apply a session window of 2 minute gap with 1 minute grace period.
kStreamBuilder
.stream("building-sensor-updates", ...)
//Had to cleanup key and also needed some data from context
.tranform(() -> new String2SensorObjectConvertor())
//triggers another re-partition
.groupBy((key, value) -> value.buildingId + "-" + value.sensorId, ...)
.windowedBy(SessionWindows.with(..))
.aggregate(
() -> new SensorDataAggregator,
...,
Materialized.<String, SensorDataAggregator,
SessionStore<Bytes, byte[]>>as("session_aggregate_store"))
.suppress(Suppressed.untilWindowCloses(Suppressed.BufferConfig.unbounded()))
.toStream()
...
...
As expected, this triggers a re-partition and sub-stream will consume records from this re-partition topic "sensor_data_processor-session_aggregate_store-repartition". I am seeing an issue there as explained later.
Test input data
I am testing a scenario where past data is re-processed again from storage or from Kafka offset. For testing, I feeding data from csv using Kafka-spool-connect. Time stamp of each record in the input CSV file is kept in ascending order. For a same sensor, next set of records will have 5 minutes increased timestamp.
"building-1, sensor-1.temperature, 18, 2020-06-12T15:01:02Z"
"building-1, sensor-1.humidity, 75, 2020-06-12T15:01:05Z"
"building-1, sensor-2.temperature, 20, 2020-06-12T15:01:03Z"
"building-1, sensor-2.humidity, 70, 2020-06-12T15:01:06Z"
"building-1, sensor-1.temperature, 19, 2020-06-12T15:06:04Z"
"building-1, sensor-1.humidity, 65, 2020-06-12T15:06:08Z"
"building-1, sensor-2.temperature, 21, 2020-06-12T15:06:05Z"
"building-1, sensor-2.humidity, 73, 2020-06-12T15:06:09Z"
I inject test data in bulk (200000) without any delay.
Issue
When the substream process the records from this re partition topic, I see following WARNING message from KStreamSessionWindowAggregate and the records gets skipped.
WARN
org.apache.kafka.streams.kstream.internals.KStreamSessionWindowAggregate
- Skipping record for expired window. key=[BUILDING-ID-1003-sensor-1] topic=[sensor_data_processor-session_aggregate_store-repartition]
partition=[0] offset=[1870] timestamp=[1591872043000]
window=[1591872043000,1591872043000] expiration=[1591951243000]
streamTime=[1591951303000]
If you look at the time stamps in the WARNING message,
Time stamp of the message is "June 11, 2020 10:40:43Z"
Stream time has already passed "June 12, 2020 08:40:43Z"
Window expiration June 12, 2020 08:41:43Z
I tried with time window of 7 minutes with 2 min advance. I had similar issue there as well.
Observations
As the key of the original messages is "building-id", all records from same building (and hence same sensor) should go in to one partition and the records from each sensor should be in order.
I am also doing a tranform() at the beginning of topology. I had to cleanup key and also wanted some data from context. Though this may trigger a re-partition, this should not changed the order of records within a sensor as it only does a cleanup of key and hence the partition outcome would maintain same elements in the partition. I will get rid of this tranform() with some optimization.
My window grouping is based on building-id + sensor-id, so the elements from same sensor in each re-partitioned group also should be coming in order.
Given all this, I was hoping that each partition/group's stream-time will monotonically progress as per the timestamp of the events in that partition as their order is maintained. But I see a jump in the stream-time. I looked at org.apache.kafka.streams.kstream.internals.KStreamSessionWindowAggregate and some kafka-stream documentations -
It appears to me, monotonic stream-time is maintained for stream-task and not per partitions. And same stream-task may be used for processing multiple topic partitions. Because the records are injected in quick succession, it may process a bulk of records from a a partition and when it picks up another topic partition, the stream time might have already crossed a lot compared to the time stamp of records in the new topic partition which will result in expiring.
Questions
For replaying records like this, how this can be handled other than putting a large grace period for the window.
Even in realtime scenario, this issue might happen if there are back pressure. Using a large grace period is not an option as results will get delayed as I am using Suppresed.untilWindowClose(). What would be the best way to handle this?
If stream-time is maintained for stream-task and same task may be used for multiple topic partitions, is there anyway we can keep 1-1 mapping and stickiness between stream-task and topic partitions? If so, what would be the implications other than potential performance issues?
Why wouldn't kafka-stream maintain stream-time for topic partition instead of per stream-task?
When I looked at the "sensor_data_processor-session_aggregate_store-re-partition" topic mentioned in the warning message, I see that most of "temperature" records alone are getting published to that topic (Yes, for each group, "temperature" comes first in the test data set). Why only temperature records goes in to that topic ? Is it just a timing coincidence?
For replaying records like this, how this can be handled other than putting a large grace period for the window.
I guess you cannot. If you process data of today, and later data from yesterday, data from yesterday would be discarded. What you could do it, to start a new application. For this case, on startup the app has no stream time, and thus it will init its stream time with "yesterday" and thus data won't be discarded.
Even in realtime scenario, this issue might happen if there are back pressure. Using a large grace period is not an option as results will get delayed as I am using Suppresed.untilWindowClose(). What would be the best way to handle this?
Well, you have to pick your poison... Or you fall back to the Processor API and implement whatever logic you need manually.
If stream-time is maintained for stream-task and same task may be used for multiple topic partitions, is there anyway we can keep 1-1 mapping and stickiness between stream-task and topic partitions? If so, what would be the implications other than potential performance issues?
Stream time is definitely maintained per task, and there is a 1:1 mapping between tasks and partitions. Maybe the data is shuffled unexpectedly. My window grouping is based on building-id + sensor-id, so the elements from same sensor in each re-partitioned group also should be coming in order.: agreed, however, data would still be shuffled; thus, if one upstream task processed data faster than its "parallel" pears, it would lead to a fast advance of stream time if all downstream tasks, too.
Why wouldn't kafka-stream maintain stream-time for topic partition instead of per stream-task?
Not sure if I can follow. Each task tracks stream time individually. And there is a 1:1 mapping between tasks and partition. Hence, it seems both (tracking per partition or tracking per task -- assuming there is only one input partition per task) is the same.
My Kafka topic contains statuses keyed by deviceId. I would like to use KStreamBuilder.stream().groupByKey().aggregate(...) to only keep the latest value of a status in a TimeWindow. I guess that, as long as the topic is partitioned by key, the aggregation function can always return the latest values in this fashion:
(key, value, older_value) -> value
Is this a guarantee I can expect from Kafka Streams? Should I roll my own processing method that checks the timestamp?
Kafka Streams guaranteed ordering by offsets but not by timestamp. Thus, by default "last update wins" policy is based on offsets but not on timestamp. Late arriving records ("late" defined on timestamps) are out-of-order based on timestamps and they will not be reordered to keep original offsets order.
If you want to have your window containing the latest value based on timestamps you will need to use Processor API (PAPI) to make this work.
Within Kafka Streams' DSL, you cannot access the record timestamp that is required to get the correct result. A easy way might be to put a .transform() before .groupBy() and add the timestamp to the record (ie, its value) itself. Thus, you can use the timestamp within your Aggregator (btw: a .reduce() that is simpler to use might also work instead of .aggregate()). Finally, you need to do .mapValues() after your .aggregate() to remove the timestamp from the value again.
Using this mix-and-match approach of DSL and PAPI should simplify your code, as you can use DSL windowing support and KTable and do not need to do low-level time-window and state management.
Of course, you can also just do all this in a single low-level stateful processor, but I would not recommend it.
I do understand Spanner's read-only transaction in one paxos group.
But how does the read-only transaction over more than one paxos group work? The paper says that it uses TT.now().latest as timestamp which then performs a snapshot read with the given timestamp. But why does this work?
In each replica, there is a safe time. The safe time is the timestamp of the last write transaction within the replica. The replica is up to date, if asked timestamp <= safe time.
The paper also says that the snapshot read with the given timestamp (second phase of the read-only transaction) may need to wait until the replicas are up to date. What happens, if after the read transaction, there will never occur any write transaction? Then the safe time will never be updated and the read transaction will be blocked forever?
AFAICT, the point is that, if a process sees TT.now().latest has passed, all other process will never get that timestamp, thus any future write transaction will have commit time (safe time) greater than that. So the process performing the snapshot read only need to wait until that timestamp passes.
Spanner is now available a service on Google Cloud Platform.
Here are the docs on how the read-only transactions work:
https://cloud.google.com/spanner/docs/transactions#read-only_transactions
==
A Cloud Spanner read-only transaction executes a set of reads at a single logical point in time, both from the perspective of the read-only transaction itself and from the perspective of other readers and writers to the Cloud Spanner database. This means that read-only transactions always observe a consistent state of the database at a chosen point in the transaction history.
==