Kafka Streams: one of two TimestampExtractors doesn't get called - apache-kafka-streams

Having a strange problem in my sample Kafka Streams application.
I have the following 2 KStreams:
KStream stream1 = builder.stream("topic1", Consumed.with(Serdes.String(), Serdes.String())
.withTimestampExtractor(new Extractor1()))
...
KStream stream2 = builder.stream("topic2", Consumed.with(Serdes.String(), Serdes.String())
.withTimestampExtractor(new Extractor2()))
...
They are built from the same StreamsBuilder builder that is configured with KafkaStreams properties without StreamsConfig.DEFAULT_TIMESTAMP_EXTRACTOR_CLASS_CONFIG property set at all (I've also trying setting it).
The problem I am observing is that the extract() method of my first TimestampExtractor Extactor1 gets called, but the extract() method of my second Extactor2 doesn't get called at all (event though the messages flow through both streams).
What could be the reason?

Related

KafkaStreams: Handling Deserialize exception in KStream-KTable Join

Let say we are doing a inner join between a KStream and KTable as shown below:
StreamsBuilder sb = new StreamsBuilder();
JsonSerde<SensorMetaData> sensorMetaDataJsonSerde = new JsonSerde<>(SensorMetaData.class);
KTable<String, String> kTable = sb.stream("sensorMetadata",
Consumed.with(Serdes.String(), sensorMetaDataJsonSerde)).toTable();
KStream<String, String> kStream = sb.stream("sensorValues",
Consumed.with(Serdes.String(), Serdes.String()));
KStream<String, String> joined = kStream.join(kTable, (left, right)->{return getJoinedOutput(left, right);});
Few points about the application:
SensorMetaData is a POJO
public class SensorMetaData{
String sensorId;
String sensorMetadata;
}
DEFAULT_DESERIALIZATION_EXCEPTION_HANDLER_CLASS_CONFIG is set to org.apache.kafka.streams.errors.LogAndContinueExceptionHandler
JsonSerde class will throw SerializationException if Deserialization fails.
When i run the application and send messages to both the topics, join works as expected.
Now i changed the schema of SensorMetaData as below and redeployed the application on a new node
public class SensorMetaData{
String sensorId;
MetadataTag[] metadataTags;
}
After the application starts, when iam sending a message to sensorValues topic( streams topic), the application is shutting down with org.apache.kafka.common.errors.SerializationException. Looking at the stack trace, i realized its failing to deserialize SensorMetaData while performing join because of the schema change in SensorMetaData. Break point in Deserialize method shows, its trying to deserialize data from the topic "app-KSTREAM-TOTABLE-STATE-STORE-0000000002-changelog".
So the question is why is the application shutting down instead of skipping the bad record (i.e. the record with old schema) even though, DEFAULT_DESERIALIZATION_EXCEPTION_HANDLER_CLASS_CONFIG is set to org.apache.kafka.streams.errors.LogAndContinueExceptionHandler ?
However, when application encounters bad record while reading from the topic "sensorMetadata" (i.e. sb.stream("sensorMetadata")), it successfully skips the record with warning "Skipping record due to deserialization error".
Why join is not skipping the bad record here ? How to handle this scenario. I want the application to skip the record and continue running instead of shutting down. Here is the stack trace
at kafkastream.JsonSerde$2.deserialize(JsonSerde.java:51)
at org.apache.kafka.streams.state.internals.ValueAndTimestampDeserializer.deserialize(ValueAndTimestampDeserializer.java:54)
at org.apache.kafka.streams.state.internals.ValueAndTimestampDeserializer.deserialize(ValueAndTimestampDeserializer.java:27)
at org.apache.kafka.streams.state.StateSerdes.valueFrom(StateSerdes.java:160)
at org.apache.kafka.streams.state.internals.MeteredKeyValueStore.outerValue(MeteredKeyValueStore.java:207)
at org.apache.kafka.streams.state.internals.MeteredKeyValueStore.lambda$get$2(MeteredKeyValueStore.java:133)
at org.apache.kafka.streams.processor.internals.metrics.StreamsMetricsImpl.maybeMeasureLatency(StreamsMetricsImpl.java:821)
at org.apache.kafka.streams.state.internals.MeteredKeyValueStore.get(MeteredKeyValueStore.java:133)
at org.apache.kafka.streams.processor.internals.ProcessorContextImpl$KeyValueStoreReadWriteDecorator.get(ProcessorContextImpl.java:465)
at org.apache.kafka.streams.kstream.internals.KTableSourceValueGetterSupplier$KTableSourceValueGetter.get(KTableSourceValueGetterSupplier.java:49)
at org.apache.kafka.streams.kstream.internals.KStreamKTableJoinProcessor.process(KStreamKTableJoinProcessor.java:77)
at org.apache.kafka.streams.processor.internals.ProcessorNode.lambda$process$2(ProcessorNode.java:142)
at org.apache.kafka.streams.processor.internals.metrics.StreamsMetricsImpl.maybeMeasureLatency(StreamsMetricsImpl.java:806)
at org.apache.kafka.streams.processor.internals.ProcessorNode.process(ProcessorNode.java:142)
at org.apache.kafka.streams.processor.internals.ProcessorContextImpl.forward(ProcessorContextImpl.java:201)
at org.apache.kafka.streams.processor.internals.ProcessorContextImpl.forward(ProcessorContextImpl.java:180)
at org.apache.kafka.streams.processor.internals.ProcessorContextImpl.forward(ProcessorContextImpl.java:133)
at org.apache.kafka.streams.processor.internals.SourceNode.process(SourceNode.java:101)
at org.apache.kafka.streams.processor.internals.StreamTask.lambda$process$3(StreamTask.java:383)
at org.apache.kafka.streams.processor.internals.metrics.StreamsMetricsImpl.maybeMeasureLatency(StreamsMetricsImpl.java:806)
at org.apache.kafka.streams.processor.internals.StreamTask.process(StreamTask.java:383)
at org.apache.kafka.streams.processor.internals.AssignedStreamsTasks.process(AssignedStreamsTasks.java:475)
at org.apache.kafka.streams.processor.internals.TaskManager.process(TaskManager.java:550)
at org.apache.kafka.streams.processor.internals.StreamThread.runOnce(StreamThread.java:802)
at org.apache.kafka.streams.processor.internals.StreamThread.runLoop(StreamThread.java:697)
at org.apache.kafka.streams.processor.internals.StreamThread.run(StreamThread.java:670)
INFO stream-client [app-814c1c5b-a899-4cbf-8d85-2ed6eba81ccb] State transition from ERROR to PENDING_SHUTDOWN
Kafka doesn't use the handler in DEFAULT_DESERIALIZATION_EXCEPTION_HANDLER_CLASS_CONFIG when it reads the RocksDB files (see that the stacktrace mentions the class StateSerdes). That's why it works fine for records coming from the source topic, but fails when deserialising the data in the table.
I'm not super experienced with Kafka, but I keep hearing over and over again: if something changes, copy the data with the new format to another topic or delete the data, reset offsets and re-process.
In this case, maybe it's better to delete the KTable files, the internal topics used for the ktable and let the app re-generate the KTable with the new structure.
This blog from a few months ago explains a bit more the process or deleting data: https://www.confluent.io/blog/data-reprocessing-with-kafka-streams-resetting-a-streams-application/
To share a bit of insight: kafka is a very complex beast. To manage it successfuly in production you need to build a good amount of tooling, code to maintain it, and (usually) change your deployment process to fit Kafka.

InvalidTopologyException(msg:Component: [x] subscribes from non-existent stream [y]

I m trying to read data from kafka and insert into cassandra using storm. I've configured the topology also, however I'm getting some issue and I don't have clue why that is happening.
Here is my submitter piece.
TopologyBuilder topologyBuilder = new TopologyBuilder();
topologyBuilder.setSpout("spout", new KafkaSpout(spoutConfig));
topologyBuilder.setBolt("checkingbolt", new CheckingBolt("cassandraBoltStream")).shuffleGrouping("spout");
topologyBuilder.setBolt("cassandrabolt", new CassandraInsertBolt()).shuffleGrouping("checkingbolt");
Here, if I comment the last line, I don't see any exceptions. With the last line, I'm getting the below error:
InvalidTopologyException(msg:Component: [cassandrabolt] subscribes from non-existent stream: [default] of component [checkingbolt])
Can someone please help me, what is wrong here?
Here is the outputFieldDeclarer in CheckingBolt
public void declareOutputFields(OutputFieldsDeclarer ofd) {
ofd.declareStream(cassandraBoltStream, new Fields(new String[]{"jsonFields"}));
}
I don't have anything in declareOutputFields method for CassandraInsertBolt as that bolt doesn't emit any values.
TIA
The problem here is that you're mixing up stream names and component (i.e. spout/bolt) names. Component names are used for referring to different bolts, while stream names are used to refer to different streams coming out of the same bolt. For example, if you have a bolt named "evenOrOddBolt", it might emit two streams, an "even" stream and and "odd" stream. In many cases though, you only have one stream coming out of a bolt, which is why Storm has some convenience methods for using a default stream name.
When you do .shuffleGrouping("checkingbolt"), you are using one of these convenience methods, effectively saying "I want this bolt to consume the default stream coming out of the checkingbolt". There is an overloaded version of this method you can use if you want to explicitly name the stream, but it's only useful if you have multiple streams coming out of the same bolt.
When you do ofd.declareStream(cassandraBoltStream, new Fields(new String[]{"jsonFields"}));, you are saying the bolt will emit on a stream named "cassandraBoltStream". This is probably not what you want to do, you want to declare that it will emit on the default stream. You do this by using the ofd.declare method instead.
Refer to the documentation for more details.

Kafka Streams with suppress() reprocessing changelog

I have a Spring Cloud Stream (Kafka Streams version 2.1) application with a Kafka Streams binder and I am doing time window aggregations, where I only want to make some action (API call) once
the window closes. The behavior I'm observing is that on every application restart, my mapValues function is called for every record stored in the changelog,
resulting in huge number of calls being made to the API.
My understanding of suppress() is that for every closed time window, a tombstone record should be sent to the aggregate changelog topic, effectively preventing me from reprocessing it, even after application restarts.
What could be causing messages to be reprocessed on an app restart?
I've already confirmed that the app is not reconsuming the source topic.
Snippet of the relevant code below:
Serde<Aggregator> aggregatorSerde = new JsonSerde<>(Aggregator.class, objectMapper);
Materialized<String, TriggerAggregator, WindowStore<Bytes, byte[]>> stateStore = Materialized.<String, Aggregator, WindowStore<Bytes, byte[]>>
with(Serdes.String(), aggregatorSerde);
KTable<Windowed<String>, List<Event>> windowedEventKTable = inputKStream
.groupByKey()
.windowedBy(TimeWindows.of(Duration.ofSeconds(30)).grace(Duration.ofSeconds(5))
.aggregate(Aggregator::new, ((key, value, aggregate) -> aggregate.aggregate(value)), stateStore)
.suppress(Suppressed.untilWindowCloses(Suppressed.BufferConfig.unbounded()).withName(supressStoreName))
.mapValues((windowedKey, groupedTriggerAggregator) -> {//code here returning a list})
.toStream((k,v) -> k.key())
.flatMapValues((readOnlyKey, value) -> value);

Build a Kafka Stream that returns the list of distinct ids into time interval

I have a kafka stream of objects events:
KStream<String, VehicleEventTO> stream = builder.stream("mytopic", Consumed.with(Serdes.String(), new JsonSerde<>(MyObjectEvent.class)));
Each ObjectEvent has a property idType (Long). I need to build a Stream that returns distinct idTypes into time interval (For example: 10 minutes).
It's possible, using KafkaStream DSL? I don't find a solution.
Based on your use case you are looking for a Windowed aggregation. Kafka streams DSL has TimeWindowedKStream or SessionWindowdKStream which should be able to solve your problem.
I don't quite know KafkaStream's API, but regarding general streaming api,
you'd have a method that buffers messages over time (like buffer, groupedWithin, or something similar) where you can specify time (and/or maximum messages).
Then your stream would be something like:
KStream stream = builder.stream("mytopic", Consumed.with(Serdes.String(), new JsonSerde<>(MyObjectEvent.class)))
.map(record -> record.value().getId()) // assuming you get a stream of records, I don't know the KafkaStreams api
.groupedWithin(Duration.ofMinutes(10)) // <-- pseudocode, search for correct method
Then you'd get a stream that contains the ids over time.

Kafka Streams API: KStream to KTable

I have a Kafka topic where I send location events (key=user_id, value=user_location). I am able to read and process it as a KStream:
KStreamBuilder builder = new KStreamBuilder();
KStream<String, Location> locations = builder
.stream("location_topic")
.map((k, v) -> {
// some processing here, omitted form clarity
Location location = new Location(lat, lon);
return new KeyValue<>(k, location);
});
That works well, but I'd like to have a KTable with the last known position of each user. How could I do it?
I am able to do it writing to and reading from an intermediate topic:
// write to intermediate topic
locations.to(Serdes.String(), new LocationSerde(), "location_topic_aux");
// build KTable from intermediate topic
KTable<String, Location> table = builder.table("location_topic_aux", "store");
Is there a simple way to obtain a KTable from a KStream? This is my first app using Kafka Streams, so I'm probably missing something obvious.
Update:
In Kafka 2.5, a new method KStream#toTable() will be added, that will provide a convenient way to transform a KStream into a KTable. For details see: https://cwiki.apache.org/confluence/display/KAFKA/KIP-523%3A+Add+KStream%23toTable+to+the+Streams+DSL
Original Answer:
There is not straight forward way at the moment to do this. Your approach is absolutely valid as discussed in Confluent FAQs: http://docs.confluent.io/current/streams/faq.html#how-can-i-convert-a-kstream-to-a-ktable-without-an-aggregation-step
This is the simplest approach with regard to the code. However, it has the disadvantages that (a) you need to manage an additional topic and that (b) it results in additional network traffic because data is written to and re-read from Kafka.
There is one alternative, using a "dummy-reduce":
KStreamBuilder builder = new KStreamBuilder();
KStream<String, Long> stream = ...; // some computation that creates the derived KStream
KTable<String, Long> table = stream.groupByKey().reduce(
new Reducer<Long>() {
#Override
public Long apply(Long aggValue, Long newValue) {
return newValue;
}
},
"dummy-aggregation-store");
This approach is somewhat more complex with regard to the code compared to option 1 but has the advantage that (a) no manual topic management is required and (b) re-reading the data from Kafka is not necessary.
Overall, you need to decide by yourself, which approach you like better:
In option 2, Kafka Streams will create an internal changelog topic to back up the KTable for fault tolerance. Thus, both approaches require some additional storage in Kafka and result in additional network traffic. Overall, it’s a trade-off between slightly more complex code in option 2 versus manual topic management in option 1.

Resources