Kafka Streams with suppress() reprocessing changelog - apache-kafka-streams

I have a Spring Cloud Stream (Kafka Streams version 2.1) application with a Kafka Streams binder and I am doing time window aggregations, where I only want to make some action (API call) once
the window closes. The behavior I'm observing is that on every application restart, my mapValues function is called for every record stored in the changelog,
resulting in huge number of calls being made to the API.
My understanding of suppress() is that for every closed time window, a tombstone record should be sent to the aggregate changelog topic, effectively preventing me from reprocessing it, even after application restarts.
What could be causing messages to be reprocessed on an app restart?
I've already confirmed that the app is not reconsuming the source topic.
Snippet of the relevant code below:
Serde<Aggregator> aggregatorSerde = new JsonSerde<>(Aggregator.class, objectMapper);
Materialized<String, TriggerAggregator, WindowStore<Bytes, byte[]>> stateStore = Materialized.<String, Aggregator, WindowStore<Bytes, byte[]>>
with(Serdes.String(), aggregatorSerde);
KTable<Windowed<String>, List<Event>> windowedEventKTable = inputKStream
.groupByKey()
.windowedBy(TimeWindows.of(Duration.ofSeconds(30)).grace(Duration.ofSeconds(5))
.aggregate(Aggregator::new, ((key, value, aggregate) -> aggregate.aggregate(value)), stateStore)
.suppress(Suppressed.untilWindowCloses(Suppressed.BufferConfig.unbounded()).withName(supressStoreName))
.mapValues((windowedKey, groupedTriggerAggregator) -> {//code here returning a list})
.toStream((k,v) -> k.key())
.flatMapValues((readOnlyKey, value) -> value);

Related

KafkaStreams: Handling Deserialize exception in KStream-KTable Join

Let say we are doing a inner join between a KStream and KTable as shown below:
StreamsBuilder sb = new StreamsBuilder();
JsonSerde<SensorMetaData> sensorMetaDataJsonSerde = new JsonSerde<>(SensorMetaData.class);
KTable<String, String> kTable = sb.stream("sensorMetadata",
Consumed.with(Serdes.String(), sensorMetaDataJsonSerde)).toTable();
KStream<String, String> kStream = sb.stream("sensorValues",
Consumed.with(Serdes.String(), Serdes.String()));
KStream<String, String> joined = kStream.join(kTable, (left, right)->{return getJoinedOutput(left, right);});
Few points about the application:
SensorMetaData is a POJO
public class SensorMetaData{
String sensorId;
String sensorMetadata;
}
DEFAULT_DESERIALIZATION_EXCEPTION_HANDLER_CLASS_CONFIG is set to org.apache.kafka.streams.errors.LogAndContinueExceptionHandler
JsonSerde class will throw SerializationException if Deserialization fails.
When i run the application and send messages to both the topics, join works as expected.
Now i changed the schema of SensorMetaData as below and redeployed the application on a new node
public class SensorMetaData{
String sensorId;
MetadataTag[] metadataTags;
}
After the application starts, when iam sending a message to sensorValues topic( streams topic), the application is shutting down with org.apache.kafka.common.errors.SerializationException. Looking at the stack trace, i realized its failing to deserialize SensorMetaData while performing join because of the schema change in SensorMetaData. Break point in Deserialize method shows, its trying to deserialize data from the topic "app-KSTREAM-TOTABLE-STATE-STORE-0000000002-changelog".
So the question is why is the application shutting down instead of skipping the bad record (i.e. the record with old schema) even though, DEFAULT_DESERIALIZATION_EXCEPTION_HANDLER_CLASS_CONFIG is set to org.apache.kafka.streams.errors.LogAndContinueExceptionHandler ?
However, when application encounters bad record while reading from the topic "sensorMetadata" (i.e. sb.stream("sensorMetadata")), it successfully skips the record with warning "Skipping record due to deserialization error".
Why join is not skipping the bad record here ? How to handle this scenario. I want the application to skip the record and continue running instead of shutting down. Here is the stack trace
at kafkastream.JsonSerde$2.deserialize(JsonSerde.java:51)
at org.apache.kafka.streams.state.internals.ValueAndTimestampDeserializer.deserialize(ValueAndTimestampDeserializer.java:54)
at org.apache.kafka.streams.state.internals.ValueAndTimestampDeserializer.deserialize(ValueAndTimestampDeserializer.java:27)
at org.apache.kafka.streams.state.StateSerdes.valueFrom(StateSerdes.java:160)
at org.apache.kafka.streams.state.internals.MeteredKeyValueStore.outerValue(MeteredKeyValueStore.java:207)
at org.apache.kafka.streams.state.internals.MeteredKeyValueStore.lambda$get$2(MeteredKeyValueStore.java:133)
at org.apache.kafka.streams.processor.internals.metrics.StreamsMetricsImpl.maybeMeasureLatency(StreamsMetricsImpl.java:821)
at org.apache.kafka.streams.state.internals.MeteredKeyValueStore.get(MeteredKeyValueStore.java:133)
at org.apache.kafka.streams.processor.internals.ProcessorContextImpl$KeyValueStoreReadWriteDecorator.get(ProcessorContextImpl.java:465)
at org.apache.kafka.streams.kstream.internals.KTableSourceValueGetterSupplier$KTableSourceValueGetter.get(KTableSourceValueGetterSupplier.java:49)
at org.apache.kafka.streams.kstream.internals.KStreamKTableJoinProcessor.process(KStreamKTableJoinProcessor.java:77)
at org.apache.kafka.streams.processor.internals.ProcessorNode.lambda$process$2(ProcessorNode.java:142)
at org.apache.kafka.streams.processor.internals.metrics.StreamsMetricsImpl.maybeMeasureLatency(StreamsMetricsImpl.java:806)
at org.apache.kafka.streams.processor.internals.ProcessorNode.process(ProcessorNode.java:142)
at org.apache.kafka.streams.processor.internals.ProcessorContextImpl.forward(ProcessorContextImpl.java:201)
at org.apache.kafka.streams.processor.internals.ProcessorContextImpl.forward(ProcessorContextImpl.java:180)
at org.apache.kafka.streams.processor.internals.ProcessorContextImpl.forward(ProcessorContextImpl.java:133)
at org.apache.kafka.streams.processor.internals.SourceNode.process(SourceNode.java:101)
at org.apache.kafka.streams.processor.internals.StreamTask.lambda$process$3(StreamTask.java:383)
at org.apache.kafka.streams.processor.internals.metrics.StreamsMetricsImpl.maybeMeasureLatency(StreamsMetricsImpl.java:806)
at org.apache.kafka.streams.processor.internals.StreamTask.process(StreamTask.java:383)
at org.apache.kafka.streams.processor.internals.AssignedStreamsTasks.process(AssignedStreamsTasks.java:475)
at org.apache.kafka.streams.processor.internals.TaskManager.process(TaskManager.java:550)
at org.apache.kafka.streams.processor.internals.StreamThread.runOnce(StreamThread.java:802)
at org.apache.kafka.streams.processor.internals.StreamThread.runLoop(StreamThread.java:697)
at org.apache.kafka.streams.processor.internals.StreamThread.run(StreamThread.java:670)
INFO stream-client [app-814c1c5b-a899-4cbf-8d85-2ed6eba81ccb] State transition from ERROR to PENDING_SHUTDOWN
Kafka doesn't use the handler in DEFAULT_DESERIALIZATION_EXCEPTION_HANDLER_CLASS_CONFIG when it reads the RocksDB files (see that the stacktrace mentions the class StateSerdes). That's why it works fine for records coming from the source topic, but fails when deserialising the data in the table.
I'm not super experienced with Kafka, but I keep hearing over and over again: if something changes, copy the data with the new format to another topic or delete the data, reset offsets and re-process.
In this case, maybe it's better to delete the KTable files, the internal topics used for the ktable and let the app re-generate the KTable with the new structure.
This blog from a few months ago explains a bit more the process or deleting data: https://www.confluent.io/blog/data-reprocessing-with-kafka-streams-resetting-a-streams-application/
To share a bit of insight: kafka is a very complex beast. To manage it successfuly in production you need to build a good amount of tooling, code to maintain it, and (usually) change your deployment process to fit Kafka.

Kafka Streams: one of two TimestampExtractors doesn't get called

Having a strange problem in my sample Kafka Streams application.
I have the following 2 KStreams:
KStream stream1 = builder.stream("topic1", Consumed.with(Serdes.String(), Serdes.String())
.withTimestampExtractor(new Extractor1()))
...
KStream stream2 = builder.stream("topic2", Consumed.with(Serdes.String(), Serdes.String())
.withTimestampExtractor(new Extractor2()))
...
They are built from the same StreamsBuilder builder that is configured with KafkaStreams properties without StreamsConfig.DEFAULT_TIMESTAMP_EXTRACTOR_CLASS_CONFIG property set at all (I've also trying setting it).
The problem I am observing is that the extract() method of my first TimestampExtractor Extactor1 gets called, but the extract() method of my second Extactor2 doesn't get called at all (event though the messages flow through both streams).
What could be the reason?

Suppress triggers events only when new events are received on the stream

I am using Kafka streams 2.2.1.
I am using suppress to hold back events until a window closes. I am using event time semantics.
However, the triggered messages are only triggered once a new message is available on the stream.
The following code is extracted to sample the problem:
KStream<UUID, String>[] branches = is
.branch((key, msg) -> "a".equalsIgnoreCase(msg.split(",")[1]),
(key, msg) -> "b".equalsIgnoreCase(msg.split(",")[1]),
(key, value) -> true);
KStream<UUID, String> sideA = branches[0];
KStream<UUID, String> sideB = branches[1];
KStream<Windowed<UUID>, String> sideASuppressed =
sideA.groupByKey(
Grouped.with(new MyUUIDSerde(),
Serdes.String()))
.windowedBy(TimeWindows.of(Duration.ofMinutes(31)).grace(Duration.ofMinutes(32)))
.reduce((v1, v2) -> {
return v1;
})
.suppress(Suppressed.untilWindowCloses(Suppressed.BufferConfig.unbounded()))
.toStream();
Messages are only streamed from 'sideASuppressed' when a new message gets to 'sideA' stream (messages arriving to 'sideB' will not cause the suppress to emit any messages out even if the window closure time has passed a long time ago).
Although, in production the problem is likely not to occur much due to high volume, there are enough cases when it is essential not to wait for a new message that gets into 'sideA' stream.
Thanks in advance.
According to Kafka streams documentation:
Stream-time is only advanced if all input partitions over all input topics have new data (with newer timestamps) available. If at least one partition does not have any new data available, stream-time will not be advanced and thus punctuate() will not be triggered if PunctuationType.STREAM_TIME was specified. This behavior is independent of the configured timestamp extractor, i.e., using WallclockTimestampExtractor does not enable wall-clock triggering of punctuate().
I am not sure why this is the case, but, it explains why suppressed messages are only being emitted when messages are available in the queue it uses.
If anyone has an answer regarding why the implementation is such, I will be happy to learn. This behavior causes my implementation to emit messages just to get my the suppressed message to emit in time and causes the code to be much less readable.

Build a Kafka Stream that returns the list of distinct ids into time interval

I have a kafka stream of objects events:
KStream<String, VehicleEventTO> stream = builder.stream("mytopic", Consumed.with(Serdes.String(), new JsonSerde<>(MyObjectEvent.class)));
Each ObjectEvent has a property idType (Long). I need to build a Stream that returns distinct idTypes into time interval (For example: 10 minutes).
It's possible, using KafkaStream DSL? I don't find a solution.
Based on your use case you are looking for a Windowed aggregation. Kafka streams DSL has TimeWindowedKStream or SessionWindowdKStream which should be able to solve your problem.
I don't quite know KafkaStream's API, but regarding general streaming api,
you'd have a method that buffers messages over time (like buffer, groupedWithin, or something similar) where you can specify time (and/or maximum messages).
Then your stream would be something like:
KStream stream = builder.stream("mytopic", Consumed.with(Serdes.String(), new JsonSerde<>(MyObjectEvent.class)))
.map(record -> record.value().getId()) // assuming you get a stream of records, I don't know the KafkaStreams api
.groupedWithin(Duration.ofMinutes(10)) // <-- pseudocode, search for correct method
Then you'd get a stream that contains the ids over time.

Kafka Streams API: KStream to KTable

I have a Kafka topic where I send location events (key=user_id, value=user_location). I am able to read and process it as a KStream:
KStreamBuilder builder = new KStreamBuilder();
KStream<String, Location> locations = builder
.stream("location_topic")
.map((k, v) -> {
// some processing here, omitted form clarity
Location location = new Location(lat, lon);
return new KeyValue<>(k, location);
});
That works well, but I'd like to have a KTable with the last known position of each user. How could I do it?
I am able to do it writing to and reading from an intermediate topic:
// write to intermediate topic
locations.to(Serdes.String(), new LocationSerde(), "location_topic_aux");
// build KTable from intermediate topic
KTable<String, Location> table = builder.table("location_topic_aux", "store");
Is there a simple way to obtain a KTable from a KStream? This is my first app using Kafka Streams, so I'm probably missing something obvious.
Update:
In Kafka 2.5, a new method KStream#toTable() will be added, that will provide a convenient way to transform a KStream into a KTable. For details see: https://cwiki.apache.org/confluence/display/KAFKA/KIP-523%3A+Add+KStream%23toTable+to+the+Streams+DSL
Original Answer:
There is not straight forward way at the moment to do this. Your approach is absolutely valid as discussed in Confluent FAQs: http://docs.confluent.io/current/streams/faq.html#how-can-i-convert-a-kstream-to-a-ktable-without-an-aggregation-step
This is the simplest approach with regard to the code. However, it has the disadvantages that (a) you need to manage an additional topic and that (b) it results in additional network traffic because data is written to and re-read from Kafka.
There is one alternative, using a "dummy-reduce":
KStreamBuilder builder = new KStreamBuilder();
KStream<String, Long> stream = ...; // some computation that creates the derived KStream
KTable<String, Long> table = stream.groupByKey().reduce(
new Reducer<Long>() {
#Override
public Long apply(Long aggValue, Long newValue) {
return newValue;
}
},
"dummy-aggregation-store");
This approach is somewhat more complex with regard to the code compared to option 1 but has the advantage that (a) no manual topic management is required and (b) re-reading the data from Kafka is not necessary.
Overall, you need to decide by yourself, which approach you like better:
In option 2, Kafka Streams will create an internal changelog topic to back up the KTable for fault tolerance. Thus, both approaches require some additional storage in Kafka and result in additional network traffic. Overall, it’s a trade-off between slightly more complex code in option 2 versus manual topic management in option 1.

Resources