Let say we are doing a inner join between a KStream and KTable as shown below:
StreamsBuilder sb = new StreamsBuilder();
JsonSerde<SensorMetaData> sensorMetaDataJsonSerde = new JsonSerde<>(SensorMetaData.class);
KTable<String, String> kTable = sb.stream("sensorMetadata",
Consumed.with(Serdes.String(), sensorMetaDataJsonSerde)).toTable();
KStream<String, String> kStream = sb.stream("sensorValues",
Consumed.with(Serdes.String(), Serdes.String()));
KStream<String, String> joined = kStream.join(kTable, (left, right)->{return getJoinedOutput(left, right);});
Few points about the application:
SensorMetaData is a POJO
public class SensorMetaData{
String sensorId;
String sensorMetadata;
}
DEFAULT_DESERIALIZATION_EXCEPTION_HANDLER_CLASS_CONFIG is set to org.apache.kafka.streams.errors.LogAndContinueExceptionHandler
JsonSerde class will throw SerializationException if Deserialization fails.
When i run the application and send messages to both the topics, join works as expected.
Now i changed the schema of SensorMetaData as below and redeployed the application on a new node
public class SensorMetaData{
String sensorId;
MetadataTag[] metadataTags;
}
After the application starts, when iam sending a message to sensorValues topic( streams topic), the application is shutting down with org.apache.kafka.common.errors.SerializationException. Looking at the stack trace, i realized its failing to deserialize SensorMetaData while performing join because of the schema change in SensorMetaData. Break point in Deserialize method shows, its trying to deserialize data from the topic "app-KSTREAM-TOTABLE-STATE-STORE-0000000002-changelog".
So the question is why is the application shutting down instead of skipping the bad record (i.e. the record with old schema) even though, DEFAULT_DESERIALIZATION_EXCEPTION_HANDLER_CLASS_CONFIG is set to org.apache.kafka.streams.errors.LogAndContinueExceptionHandler ?
However, when application encounters bad record while reading from the topic "sensorMetadata" (i.e. sb.stream("sensorMetadata")), it successfully skips the record with warning "Skipping record due to deserialization error".
Why join is not skipping the bad record here ? How to handle this scenario. I want the application to skip the record and continue running instead of shutting down. Here is the stack trace
at kafkastream.JsonSerde$2.deserialize(JsonSerde.java:51)
at org.apache.kafka.streams.state.internals.ValueAndTimestampDeserializer.deserialize(ValueAndTimestampDeserializer.java:54)
at org.apache.kafka.streams.state.internals.ValueAndTimestampDeserializer.deserialize(ValueAndTimestampDeserializer.java:27)
at org.apache.kafka.streams.state.StateSerdes.valueFrom(StateSerdes.java:160)
at org.apache.kafka.streams.state.internals.MeteredKeyValueStore.outerValue(MeteredKeyValueStore.java:207)
at org.apache.kafka.streams.state.internals.MeteredKeyValueStore.lambda$get$2(MeteredKeyValueStore.java:133)
at org.apache.kafka.streams.processor.internals.metrics.StreamsMetricsImpl.maybeMeasureLatency(StreamsMetricsImpl.java:821)
at org.apache.kafka.streams.state.internals.MeteredKeyValueStore.get(MeteredKeyValueStore.java:133)
at org.apache.kafka.streams.processor.internals.ProcessorContextImpl$KeyValueStoreReadWriteDecorator.get(ProcessorContextImpl.java:465)
at org.apache.kafka.streams.kstream.internals.KTableSourceValueGetterSupplier$KTableSourceValueGetter.get(KTableSourceValueGetterSupplier.java:49)
at org.apache.kafka.streams.kstream.internals.KStreamKTableJoinProcessor.process(KStreamKTableJoinProcessor.java:77)
at org.apache.kafka.streams.processor.internals.ProcessorNode.lambda$process$2(ProcessorNode.java:142)
at org.apache.kafka.streams.processor.internals.metrics.StreamsMetricsImpl.maybeMeasureLatency(StreamsMetricsImpl.java:806)
at org.apache.kafka.streams.processor.internals.ProcessorNode.process(ProcessorNode.java:142)
at org.apache.kafka.streams.processor.internals.ProcessorContextImpl.forward(ProcessorContextImpl.java:201)
at org.apache.kafka.streams.processor.internals.ProcessorContextImpl.forward(ProcessorContextImpl.java:180)
at org.apache.kafka.streams.processor.internals.ProcessorContextImpl.forward(ProcessorContextImpl.java:133)
at org.apache.kafka.streams.processor.internals.SourceNode.process(SourceNode.java:101)
at org.apache.kafka.streams.processor.internals.StreamTask.lambda$process$3(StreamTask.java:383)
at org.apache.kafka.streams.processor.internals.metrics.StreamsMetricsImpl.maybeMeasureLatency(StreamsMetricsImpl.java:806)
at org.apache.kafka.streams.processor.internals.StreamTask.process(StreamTask.java:383)
at org.apache.kafka.streams.processor.internals.AssignedStreamsTasks.process(AssignedStreamsTasks.java:475)
at org.apache.kafka.streams.processor.internals.TaskManager.process(TaskManager.java:550)
at org.apache.kafka.streams.processor.internals.StreamThread.runOnce(StreamThread.java:802)
at org.apache.kafka.streams.processor.internals.StreamThread.runLoop(StreamThread.java:697)
at org.apache.kafka.streams.processor.internals.StreamThread.run(StreamThread.java:670)
INFO stream-client [app-814c1c5b-a899-4cbf-8d85-2ed6eba81ccb] State transition from ERROR to PENDING_SHUTDOWN
Kafka doesn't use the handler in DEFAULT_DESERIALIZATION_EXCEPTION_HANDLER_CLASS_CONFIG when it reads the RocksDB files (see that the stacktrace mentions the class StateSerdes). That's why it works fine for records coming from the source topic, but fails when deserialising the data in the table.
I'm not super experienced with Kafka, but I keep hearing over and over again: if something changes, copy the data with the new format to another topic or delete the data, reset offsets and re-process.
In this case, maybe it's better to delete the KTable files, the internal topics used for the ktable and let the app re-generate the KTable with the new structure.
This blog from a few months ago explains a bit more the process or deleting data: https://www.confluent.io/blog/data-reprocessing-with-kafka-streams-resetting-a-streams-application/
To share a bit of insight: kafka is a very complex beast. To manage it successfuly in production you need to build a good amount of tooling, code to maintain it, and (usually) change your deployment process to fit Kafka.
Related
Having a strange problem in my sample Kafka Streams application.
I have the following 2 KStreams:
KStream stream1 = builder.stream("topic1", Consumed.with(Serdes.String(), Serdes.String())
.withTimestampExtractor(new Extractor1()))
...
KStream stream2 = builder.stream("topic2", Consumed.with(Serdes.String(), Serdes.String())
.withTimestampExtractor(new Extractor2()))
...
They are built from the same StreamsBuilder builder that is configured with KafkaStreams properties without StreamsConfig.DEFAULT_TIMESTAMP_EXTRACTOR_CLASS_CONFIG property set at all (I've also trying setting it).
The problem I am observing is that the extract() method of my first TimestampExtractor Extactor1 gets called, but the extract() method of my second Extactor2 doesn't get called at all (event though the messages flow through both streams).
What could be the reason?
I have a Spring Cloud Stream (Kafka Streams version 2.1) application with a Kafka Streams binder and I am doing time window aggregations, where I only want to make some action (API call) once
the window closes. The behavior I'm observing is that on every application restart, my mapValues function is called for every record stored in the changelog,
resulting in huge number of calls being made to the API.
My understanding of suppress() is that for every closed time window, a tombstone record should be sent to the aggregate changelog topic, effectively preventing me from reprocessing it, even after application restarts.
What could be causing messages to be reprocessed on an app restart?
I've already confirmed that the app is not reconsuming the source topic.
Snippet of the relevant code below:
Serde<Aggregator> aggregatorSerde = new JsonSerde<>(Aggregator.class, objectMapper);
Materialized<String, TriggerAggregator, WindowStore<Bytes, byte[]>> stateStore = Materialized.<String, Aggregator, WindowStore<Bytes, byte[]>>
with(Serdes.String(), aggregatorSerde);
KTable<Windowed<String>, List<Event>> windowedEventKTable = inputKStream
.groupByKey()
.windowedBy(TimeWindows.of(Duration.ofSeconds(30)).grace(Duration.ofSeconds(5))
.aggregate(Aggregator::new, ((key, value, aggregate) -> aggregate.aggregate(value)), stateStore)
.suppress(Suppressed.untilWindowCloses(Suppressed.BufferConfig.unbounded()).withName(supressStoreName))
.mapValues((windowedKey, groupedTriggerAggregator) -> {//code here returning a list})
.toStream((k,v) -> k.key())
.flatMapValues((readOnlyKey, value) -> value);
I am progressing on writing my first Kafka Consumer by using Spring-Kafka. Had a look at the different options provided by framework, and have few doubts on the same. Can someone please clarify below if you have already worked on it.
Question - 1 : As per Spring-Kafka documentation, there are 2 ways to implement Kafka-Consumer; "You can receive messages by configuring a MessageListenerContainer and providing a message listener or by using the #KafkaListener annotation". Can someone tell when should I choose one option over another ?
Question - 2 : I have chosen KafkaListener approach for writing my application. For this I need to initialize a container factory instance and inside container factory there is option to control concurrency. Just want to double check if my understanding about concurrency is correct or not.
Suppose, I have a topic name MyTopic which has 4 partitions in it. And to consume messages from MyTopic, I've started 2 instances of my application and these instances are started by setting concurrency as 2. So, Ideally as per kafka assignment strategy, 2 partitions should go to consumer1 and 2 other partitions should go to consumer2. Since the concurrency is set as 2, does each of the consumer will start 2 threads, and will consume data from the topics in parallel ? Also should we consider anything if we are consuming in parallel.
Question 3 - I have chosen manual ack mode, and not managing the offsets externally (not persisting it to any database/filesystem). So should I need to write custom code to handle rebalance, or framework will manage it automatically ? I think no as I am acknowledging only after processing all the records.
Question - 4 : Also, with Manual ACK mode, which Listener will give more performance? BATCH Message Listener or normal Message Listener. I guess if I use Normal Message listener, the offsets will be committed after processing each of the messages.
Pasted the code below for your reference.
Batch Acknowledgement Consumer:
public void onMessage(List<ConsumerRecord<String, String>> records, Acknowledgment acknowledgment,
Consumer<?, ?> consumer) {
for (ConsumerRecord<String, String> record : records) {
System.out.println("Record : " + record.value());
// Process the message here..
listener.addOffset(record.topic(), record.partition(), record.offset());
}
acknowledgment.acknowledge();
}
Initialising container factory:
#Bean
public ConsumerFactory<String, String> consumerFactory() {
return new DefaultKafkaConsumerFactory<String, String>(consumerConfigs());
}
#Bean
public Map<String, Object> consumerConfigs() {
Map<String, Object> configs = new HashMap<String, Object>();
configs.put(ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG, bootStrapServer);
configs.put(ConsumerConfig.GROUP_ID_CONFIG, groupId);
configs.put(ConsumerConfig.ENABLE_AUTO_COMMIT_CONFIG, enablAutoCommit);
configs.put(ConsumerConfig.MAX_POLL_INTERVAL_MS_CONFIG, maxPolInterval);
configs.put(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, autoOffsetReset);
configs.put(ConsumerConfig.CLIENT_ID_CONFIG, clientId);
configs.put(ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG, StringDeserializer.class);
configs.put(ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG, StringDeserializer.class);
return configs;
}
#Bean
public ConcurrentKafkaListenerContainerFactory<String, String> kafkaListenerContainerFactory() {
ConcurrentKafkaListenerContainerFactory<String, String> factory = new ConcurrentKafkaListenerContainerFactory<String, String>();
// Not sure about the impact of this property, so going with 1
factory.setConcurrency(2);
factory.setBatchListener(true);
factory.getContainerProperties().setAckMode(AckMode.MANUAL);
factory.getContainerProperties().setConsumerRebalanceListener(RebalanceListener.getInstance());
factory.setConsumerFactory(consumerFactory());
factory.getContainerProperties().setMessageListener(new BatchAckConsumer());
return factory;
}
#KafkaListener is a message-driven "POJO" it adds stuff like payload conversion, argument matching, etc. If you implement MessageListener you can only get the raw ConsumerRecord from Kafka. See #KafkaListener Annotation.
Yes, the concurrency represents the number of threads; each thread creates a Consumer; they run in parallel; in your example, each would get 2 partitions.
Also should we consider anything if we are consuming in parallel.
Your listener must be thread-safe (no shared state or any such state needs to be protected by locks.
It's not clear what you mean by "handle rebalance events". When a rebalance occurs, the framework will commit any pending offsets.
It doesn't make a difference; message listener Vs. batch listener is just a preference. Even with a message listener, with MANUAL ackmode, the offsets are committed when all the results from the poll have been processed. With MANUAL_IMMEDIATE mode, the offsets are committed one-by-one.
Q1:
From the documentation,
The #KafkaListener annotation is used to designate a bean method as a
listener for a listener container. The bean is wrapped in a
MessagingMessageListenerAdapter configured with various features, such
as converters to convert the data, if necessary, to match the method
parameters.
You can configure most attributes on the annotation with SpEL by using
"#{…} or property placeholders (${…}). See the Javadoc for more information."
This approach can be useful for simple POJO listeners and you do not need to implement any interfaces. You are also enabled to listen on any topics and partitions in a declarative way using the annotations. You can also potentially return the value you received whereas in case of MessageListener, you are bound by the signature of the interface.
Q2:
Ideally yes. If you have multiple topics to consume from, it gets more complicated though. Kafka by default uses RangeAssignor which has its own behaviour (you can change this -- see more details under).
Q3:
If your consumer dies, there will be rebalancing. If you acknowledge manually and your consumer dies before committing offsets, you do not need to do anything, Kafka handles that. But you could end up with some duplicate messages (at-least once)
Q4:
It depends what you mean by "performance". If you meant latency, then consuming each record as fast as possible will be the way to go. If you want to achieve high throughput, then batch consumption is more efficient.
I had written some samples using Spring kafka and various listeners - check out this repo
I am using Spring Kafka in my project as it seemed a natural choice in a Spring based project to consume Kafka messages. To consume messages, I can make use of the MessageListener interface. Spring Kafka internally takes care to invoke my onMessage method for each new message.
However, in my setting I prefer to explicitly poll for new messages and work on them sequentially (which will take a few seconds). As a workaround, I might just block inside my onMessage implementation, or buffer the messages internally. However, this seems to go against the core idea of Spring Kafka.
Kafka is designed so that consumers have to poll for new messages, which matches my requirements. Is there a way to make use of this "natural" workflow with Spring Kafka?
Should I refrain from using Spring Kafka for this use case?
The KafkaConsumer documentation states:
For use cases where message processing time varies unpredictably,
neither of these options may be sufficient. The recommended way to
handle these cases is to move message processing to another thread,
which allows the consumer to continue calling poll while the processor
is still working. Some care must be taken to ensure that committed
offsets do not get ahead of the actual position. Typically, you must
disable automatic commits and manually commit processed offsets for
records only after the thread has finished handling them (depending on
the delivery semantics you need). Note also that you will need to
pause the partition so that no new records are received from poll
until after thread has finished handling those previously returned.
Related issue: https://github.com/spring-projects/spring-kafka/issues/195
The issue with having to keep polling the consumer has now been resolved (in 0.10.1.x by KIP-62) so that's not an issue any more (as long as you don't exceed the max.poll.interval.ms) which is 5 mins by default but can be increased.
However, if you want to poll yourself, you can still use spring-kafka (e.g. to get the Spring Boot auto configuration goodness if you are using Boot), but you can get a Consumer from the DefaultKafkaConsumerFactory and poll() it directly.
Here is how I do it. This is in the context of an integration test configuration class, which I load in my JUnit with:
#Import(IntegrationTestConfiguration.class)
In my Test class I have the following:
#Autowired
Consumer<String, String> consumer;
In my test configuration class, I have:
#Bean
public Consumer<String, String> consumer() {
String bootstrapAddress = "server:port"; // fix this
String groupId = "my.group"; // fix this.
Map<String, Object> props = new HashMap<>();
props.put(ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG, bootstrapAddress);
props.put(ConsumerConfig.GROUP_ID_CONFIG, groupId);
props.put(ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG, StringDeserializer.class);
props.put(ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG, StringDeserializer.class);
ConsumerFactory<String, String> fact = new DefaultKafkaConsumerFactory<>(props);
// Create the consumer, subscribe to the topic
Consumer<String, String> consumer = fact.createConsumer();
String topic = "my.topic"; // fix this.
List<String> topics = new ArrayList<String>();
topics.add(topic);
consumer.subscribe(topics);
return consumer;
}
Finally, in my test, I do:
#Test
public void testSomething() {
// Do stuff that will publish a message to Kafka
// Repeat a number of times untill you get the message you want...
// Or you give up
Duration d = Duration.ofSeconds(2);
ConsumerRecords<String, String> records = consumer.poll(d);
}
I have a Kafka topic where I send location events (key=user_id, value=user_location). I am able to read and process it as a KStream:
KStreamBuilder builder = new KStreamBuilder();
KStream<String, Location> locations = builder
.stream("location_topic")
.map((k, v) -> {
// some processing here, omitted form clarity
Location location = new Location(lat, lon);
return new KeyValue<>(k, location);
});
That works well, but I'd like to have a KTable with the last known position of each user. How could I do it?
I am able to do it writing to and reading from an intermediate topic:
// write to intermediate topic
locations.to(Serdes.String(), new LocationSerde(), "location_topic_aux");
// build KTable from intermediate topic
KTable<String, Location> table = builder.table("location_topic_aux", "store");
Is there a simple way to obtain a KTable from a KStream? This is my first app using Kafka Streams, so I'm probably missing something obvious.
Update:
In Kafka 2.5, a new method KStream#toTable() will be added, that will provide a convenient way to transform a KStream into a KTable. For details see: https://cwiki.apache.org/confluence/display/KAFKA/KIP-523%3A+Add+KStream%23toTable+to+the+Streams+DSL
Original Answer:
There is not straight forward way at the moment to do this. Your approach is absolutely valid as discussed in Confluent FAQs: http://docs.confluent.io/current/streams/faq.html#how-can-i-convert-a-kstream-to-a-ktable-without-an-aggregation-step
This is the simplest approach with regard to the code. However, it has the disadvantages that (a) you need to manage an additional topic and that (b) it results in additional network traffic because data is written to and re-read from Kafka.
There is one alternative, using a "dummy-reduce":
KStreamBuilder builder = new KStreamBuilder();
KStream<String, Long> stream = ...; // some computation that creates the derived KStream
KTable<String, Long> table = stream.groupByKey().reduce(
new Reducer<Long>() {
#Override
public Long apply(Long aggValue, Long newValue) {
return newValue;
}
},
"dummy-aggregation-store");
This approach is somewhat more complex with regard to the code compared to option 1 but has the advantage that (a) no manual topic management is required and (b) re-reading the data from Kafka is not necessary.
Overall, you need to decide by yourself, which approach you like better:
In option 2, Kafka Streams will create an internal changelog topic to back up the KTable for fault tolerance. Thus, both approaches require some additional storage in Kafka and result in additional network traffic. Overall, it’s a trade-off between slightly more complex code in option 2 versus manual topic management in option 1.