ProducerFencedException after kafka consumer group rebalancing - spring

I cannot comment on similar topic:
TransactionId prefix for producer-only and read-process-write - ProducerFencedException
so I will ask a new question.
Use case:
One topic with 2 partitions
Spring #KafkaListener with concurrency=1 (default) in consumer-group "sample-consumer-group"
Two instances of the same application - both with the same "transaction-id-prefix"
1) I start first app instance (lets call it "instance1" - evenrything is ok - single consumer subscribes to both partitions. Log:
o.s.k.l.KafkaMessageListenerContainer : sample-consumer-group: partitions assigned: [sampleTopic-1, sampleTopic-0]
2) I start second app instance (instance2) - everything semms ok - log from this instance:
o.s.k.l.KafkaMessageListenerContainer : sample-consumer-group: partitions assigned: [sampleTopic-1]
log from "instance1":
o.s.k.l.KafkaMessageListenerContainer : sample-consumer-group: partitions revoked: [sampleTopic-1, sampleTopic-0]
o.s.k.l.KafkaMessageListenerContainer : sample-consumer-group: partitions assigned: [sampleTopic-0]
Still seems ok...
But, when I then try to send message to any other topic (not from any kafkaListener, but from some #Transactional method - so this is producer only transaction) the following errors occur:
ERROR 4395 --- [roducer-tx-prefix-0] o.a.k.clients.producer.internals.Sender : [Producer clientId=producer-tx-prefix-0, transactionalId=tx-prefix-0] Aborting producer batches due to fatal error
org.apache.kafka.common.errors.ProducerFencedException: Producer attempted an operation with an old epoch. Either there is a newer producer with the same transactionalId, or the producer's transaction has been expired by the broker.
ERROR 4395 --- [roducer-tx-prefix-0] o.s.k.support.LoggingProducerListener : Exception thrown when sending a message with key='sync-register' and payload='2020-04-21T13:52:12.148412Z' to topic anotherTopic
So Is it related to the problem that I should have separate transaction-id-prefix for each instance for producer only transactions ? If yes, what is the current status of this and how to achieve this without using separate kafkaTemplate for for consumer started and producer started transactions ?

See the documentation.
As mentioned in the overview, the producer factory is configured with this property to build the producer transactional.id property. There is rather a dichotomy when specifying this property in that, when running multiple instances of the application, it must be the same on all instances to satisfy fencing zombies (also mentioned in the overview) when producing records on a listener container thread. However, when producing records using transactions that are not started by a listener container, the prefix has to be different on each instance. Version 2.3, makes this simpler to configure, especially in a Spring Boot application. In previous versions, you had to create two producer factories and KafkaTemplate s - one for producing records on a listener container thread and one for stand-alone transactions started by kafkaTemplate.executeInTransaction() or by a transaction interceptor on a #Transactional method.
Now, you can override the factory’s transactionalIdPrefix on the KafkaTemplate and the KafkaTransactionManager.
When using a transaction manager and template for a listener container, you would normally leave this to default to the producer factory’s property. This value should be the same for all application instances. For transactions started by the template (or the transaction manager for #Transaction) you should set the property on the template and transaction manager respectively. This property must have a different value on each application instance.

Here's the working workaround I've figured out, if you use Kafka < 2.5 and you have to satisfy the requirement of having static transaction id-s for consumer-based transactions, with random transaction ids for producer-based:
There is rather a dichotomy when specifying this property in that, when running multiple instances of the application, it must be the same on all instances to satisfy fencing zombies (also mentioned in the overview) when producing records on a listener container thread. However, when producing records using transactions that are not started by a listener container, the prefix has to be different on each instance.
public class TransactionSourceAwareKafkaProducerFactory<K, V> extends DefaultKafkaProducerFactory<K, V> {
public TransactionSourceAwareKafkaProducerFactory(Map<String, Object> configs) {
super(configs);
}
#NonNull #Override
public Producer<K, V> createProducer(String txIdPrefixArg) {
if (isProducerBasedTransation())
return super.createProducer(/** use some random id here **/);
else
return super.createProducer(txIdPrefixArg);
}
protected boolean isConsumerBasedTransation() {
return TransactionSupport.getTransactionIdSuffix() != null;
}
protected boolean isProducerBasedTransation() {
return !isConsumerBasedTransation();
}
}

Related

How to invoke CommonContainerStoppingErrorHandler once retries are exhausted with Batch listener

I am using spring boot (version 2.7.1) with spring cloud stream kafka binder (2.8.5) for processing Kafka messages
I've functional style consumer that consumes messages in batches. Right now its retrying 10 times and commits the offset for errored records.
I want now to introduce the mechanism of retry for certain numbers (works using below error handler) then stop processing messages and fail entire batch messages without auto committing offset.
I read through the documents and understand that CommonContainerStoppingErrorHandler can be used for stopping the container from consuming messages.
My handler looks below now and its retries exponentially.
#Bean
public ListenerContainerCustomizer<AbstractMessageListenerContainer<String, Message>> errorHandler() {
return (container, destinationName, group) -> {
container.getContainerProperties().setAckMode(ContainerProperties.AckMode.BATCH);
ExponentialBackOffWithMaxRetries backOffWithMaxRetries = new ExponentialBackOffWithMaxRetries(2);
backOffWithMaxRetries.setInitialInterval(1);
backOffWithMaxRetries.setMultiplier(2.0);
backOffWithMaxRetries.setMaxInterval(5);
container.setCommonErrorHandler(new DefaultErrorHandler(backOffWithMaxRetries));
};
}
How do I chain CommonContainerStoppingErrorHandler along with above error handler, so the failed batch is not commited and replayed upon restart ?
with BatchListenerFailedException from consumer, it is possible to fail entire batch (including one or other valid records before any problematic record in that batch) ?
Add a custom recoverer to the error handler - see this answer for an example: How do you exit spring boot application programmatically when retries are exhausted, to prevent kafka offset commit
No; records before the failed one will have their offsets committed.

Should I use SeekToCurrentErrorHandler with API restTemplate retry mechanism?

I am trying to write a kafka consumer application in spring-kafka. As consumer, I have to make sure I am not missing any record and all records should get processed.
My application design is like this :
Topics --> Read records from topic --> dump it into a table A in oracle database --> pick records from a table A --> call rest api to update records in system table B --> update response of API in table a --> commit records
Retry Mechanism on API level :
Now, if any of the records gets failed, means the response code is not as desired (400,500 etc..). We would retry those records 2 times.
Retry Mechanism on Topic level :
But, what if I got an error while committing offsets ? means, I need to have some kind of retry mechanism on the topic level as well. I went over documents and found an option :SeekToCurrentErrorHandler
#Bean
public ConcurrentKafkaListenerContainerFactory<String, String> kafkaListenerContainerFactory() {
ConcurrentKafkaListenerContainerFactory<String, String> factory = new ConcurrentKafkaListenerContainerFactory();
factory.setConsumerFactory(consumerFactory());
factory.getContainerProperties().setAckOnError(false);
factory.getContainerProperties().setAckMode(AckMode.RECORD);
factory.setErrorHandler(new SeekToCurrentErrorHandler(new FixedBackOff(1000L, 2L)));
return factory;
}
Now, what I understand, suppose If I am not able to commit any offsets, then after adding above code, this will retry a delivery up to 2 times (3 delivery attempts) with a back off of 1 second. So, does this means, my whole flow will be replayed twice ? if this is true, then do I need to add retry mechanism on the API level separately ?
I am just trying to understand, how can I make my consumer application more resilient so I don't miss any record from processing and should have error mechanism to handle any error/missed records. Please suggest.
It's best to avoid situations where the offsets can't be committed (make sure the max.poll.interval.ms is sufficient).
But, yes, if committing the offsets fails (and commitSync is true) then the record will be redelivered to the application. If commitSync is false, the failure will simply be logged (or sent to your listener) and the "next" offset for that partition will have its offset committed later (presumably).
Adding retry at the application level (e.g. using a RetryTemplate in the listener adapter - via the container factory) will still suffer from the same problem; it also can cause a rebalance if the retries take too long.
If you really want to avoid reprocessing in this situation, you need to make your listener code idempotent - e.g. store the topic/partition/offset someplace to indicate you have already processed that record.

DefaultMessageListenerContainer stops processing messages

I'm hoping this is a simple configuration issue but I can't seem to figure out what it might be.
Set-up
Spring-Boor 2.2.2.RELEASE
cloud-starter
cloud-starter-aws
spring-jms
spring-cloud-dependencies Hoxton.SR1
amazon-sqs-java-messaging-lib 1.0.8
Problem
My application starts up fine and begins to process messages from Amazon SQS. After some amount of time I see the following warning
2020-02-01 04:16:21.482 LogLevel=WARN 1 --- [ecutor-thread14] o.s.j.l.DefaultMessageListenerContainer : Number of scheduled consumers has dropped below concurrentConsumers limit, probably due to tasks having been rejected. Check your thread pool configuration! Automatic recovery to be triggered by remaining consumers.
The above warning gets printed multiple times and eventually I see the following two INFO messages
2020-02-01 04:17:51.552 LogLevel=INFO 1 --- [ecutor-thread40] c.a.s.javamessaging.SQSMessageConsumer : Shutting down ConsumerPrefetch executor
2020-02-01 04:18:06.640 LogLevel=INFO 1 --- [ecutor-thread40] com.amazon.sqs.javamessaging.SQSSession : Shutting down SessionCallBackScheduler executor
The above 2 messages will display several times and at some point no more messages are consumed from SQS. I don't see any other messages in my log to indicate an issue, but I get no messages from my handlers that they are processing messages (I have 2~) and I can see the AWS SQS queue growing in the number of messages and the age.
~: This exact code was working fine when I had a single handler, this problem started when I added the second one.
Configuration/Code
The first "WARNing" I realize is caused by the currency of the ThreadPoolTaskExecutor, but I can not get a configuration which works properly. Here is my current configuration for the JMS stuff, I have tried various levels of max pool size with no real affect other than the warings start sooner or later based on the pool size
public ThreadPoolTaskExecutor asyncAppConsumerTaskExecutor() {
ThreadPoolTaskExecutor taskExecutor = new ThreadPoolTaskExecutor();
taskExecutor.setThreadGroupName("asyncConsumerTaskExecutor");
taskExecutor.setThreadNamePrefix("asyncConsumerTaskExecutor-thread");
taskExecutor.setCorePoolSize(10);
// Allow the thread pool to grow up to 4 times the core size, evidently not
// having the pool be larger than the max concurrency causes the JMS queue
// to barf on itself with messages like
// "Number of scheduled consumers has dropped below concurrentConsumers limit, probably due to tasks having been rejected. Check your thread pool configuration! Automatic recovery to be triggered by remaining consumers"
taskExecutor.setMaxPoolSize(10 * 4);
taskExecutor.setQueueCapacity(0); // do not queue up messages
taskExecutor.setWaitForTasksToCompleteOnShutdown(true);
taskExecutor.setAwaitTerminationSeconds(60);
return taskExecutor;
}
Here is the JMS Container Factory we create
public DefaultJmsListenerContainerFactory jmsListenerContainerFactory(SQSConnectionFactory sqsConnectionFactory, ThreadPoolTaskExecutor asyncConsumerTaskExecutor) {
DefaultJmsListenerContainerFactory factory = new DefaultJmsListenerContainerFactory();
factory.setConnectionFactory(sqsConnectionFactory);
factory.setDestinationResolver(new DynamicDestinationResolver());
// The JMS processor will start 'concurrency' number of tasks
// and supposedly will increase this to the max of '10 * 3'
factory.setConcurrency(10 + "-" + (10 * 3));
factory.setTaskExecutor(asyncConsumerTaskExecutor);
// Let the task process 100 messages, default appears to be 10
factory.setMaxMessagesPerTask(100);
// Wait up to 5 seconds for a timeout, this keeps the task around a bit longer
factory.setReceiveTimeout(5000L);
factory.setSessionAcknowledgeMode(Session.CLIENT_ACKNOWLEDGE);
return factory;
}
I added the setMaxMessagesPerTask & setReceiveTimeout calls based on stuff found on the internet, the problem persists without these and at various settings (50, 2500L, 25, 1000L, etc...)
We create a default SQS connection factory
public SQSConnectionFactory sqsConnectionFactory(AmazonSQS amazonSQS) {
return new SQSConnectionFactory(new ProviderConfiguration(), amazonSQS);
}
Finally the handlers look like this
#JmsListener(destination = "consumer-event-queue")
public void receiveEvents(String message) throws IOException {
MyEventDTO myEventDTO = jsonObj.readValue(message, MyEventDTO.class);
//messageTask.process(myEventDTO);
}
#JmsListener(destination = "myalert-sqs")
public void receiveAlerts(String message) throws IOException, InterruptedException {
final MyAlertDTO myAlert = jsonObj.readValue(message, MyAlertDTO.class);
myProcessor.addAlertToQueue(myAlert);
}
You can see in the first function (receiveEvents) we just take the message from the queue and exit, we have not implemented the processing code for that.
The second function (receiveAlerts) gets the message, the myProcessor.addAlertToQueue function creates a runnable object and submits it to a threadpool to be processed at some point in the future.
The problem only started (the warning, info and failure to consume messages) only started when we added the receiveAlerts function, previously the other function was the only one present and we did not see this behavior.
More
This is part of a larger project and I am working on breaking this code out into a smaller test case to see if I can duplicate this issue. I will post a follow-up with the results.
In the Mean Time
I'm hoping this is just a config issue and someone more familiar with this can tell me what I'm doing wrong, or that someone can provide some thoughts and comments on how to correct this to work properly.
Thank you!
After fighting this one for a bit I think I finally resolved it.
The issue appears to be due to the "DefaultJmsListenerContainerFactory", this factory creates a new "DefaultJmsListenerContainer" for EACH method with a '#JmsListener' annotation. The person who originally wrote the code thought it was only called once for the application, and the created container would be re-used. So the issue was two-fold
The 'ThreadPoolTaskExecutor' attached to the factory had 40 threads, when the application had 1 '#JmsListener' method this worked fine, but when we aded a second method then each method got 10 threads (total of 20) for listening. This is fine, however; since we stated that each listener could grow up to 30 listeners we quickly ran out of threads in the pool mentioned in 1 above. This caused the "Number of scheduled consumers has dropped below concurrentConsumers limit" error
This is probably obvious given the above, but I wanted to call it out explicitly. In the Listener Factory we set the concurrency to be "10-30", however; all of the listeners have to share that pool. As such the max concurrency has to be setup so that each listeners' max value is small enough so that if each listener creates its maximum that it doesn't exceed the maximum number of threads in the pool (e.g. if we have 2 '#JmsListener' annotated methods and a pool with 40 threads, then the max value can be no more than 20).
Hopefully this might help someone else with a similar issue in the future....

Spring-Kafka Concurrency Property

I am progressing on writing my first Kafka Consumer by using Spring-Kafka. Had a look at the different options provided by framework, and have few doubts on the same. Can someone please clarify below if you have already worked on it.
Question - 1 : As per Spring-Kafka documentation, there are 2 ways to implement Kafka-Consumer; "You can receive messages by configuring a MessageListenerContainer and providing a message listener or by using the #KafkaListener annotation". Can someone tell when should I choose one option over another ?
Question - 2 : I have chosen KafkaListener approach for writing my application. For this I need to initialize a container factory instance and inside container factory there is option to control concurrency. Just want to double check if my understanding about concurrency is correct or not.
Suppose, I have a topic name MyTopic which has 4 partitions in it. And to consume messages from MyTopic, I've started 2 instances of my application and these instances are started by setting concurrency as 2. So, Ideally as per kafka assignment strategy, 2 partitions should go to consumer1 and 2 other partitions should go to consumer2. Since the concurrency is set as 2, does each of the consumer will start 2 threads, and will consume data from the topics in parallel ? Also should we consider anything if we are consuming in parallel.
Question 3 - I have chosen manual ack mode, and not managing the offsets externally (not persisting it to any database/filesystem). So should I need to write custom code to handle rebalance, or framework will manage it automatically ? I think no as I am acknowledging only after processing all the records.
Question - 4 : Also, with Manual ACK mode, which Listener will give more performance? BATCH Message Listener or normal Message Listener. I guess if I use Normal Message listener, the offsets will be committed after processing each of the messages.
Pasted the code below for your reference.
Batch Acknowledgement Consumer:
public void onMessage(List<ConsumerRecord<String, String>> records, Acknowledgment acknowledgment,
Consumer<?, ?> consumer) {
for (ConsumerRecord<String, String> record : records) {
System.out.println("Record : " + record.value());
// Process the message here..
listener.addOffset(record.topic(), record.partition(), record.offset());
}
acknowledgment.acknowledge();
}
Initialising container factory:
#Bean
public ConsumerFactory<String, String> consumerFactory() {
return new DefaultKafkaConsumerFactory<String, String>(consumerConfigs());
}
#Bean
public Map<String, Object> consumerConfigs() {
Map<String, Object> configs = new HashMap<String, Object>();
configs.put(ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG, bootStrapServer);
configs.put(ConsumerConfig.GROUP_ID_CONFIG, groupId);
configs.put(ConsumerConfig.ENABLE_AUTO_COMMIT_CONFIG, enablAutoCommit);
configs.put(ConsumerConfig.MAX_POLL_INTERVAL_MS_CONFIG, maxPolInterval);
configs.put(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, autoOffsetReset);
configs.put(ConsumerConfig.CLIENT_ID_CONFIG, clientId);
configs.put(ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG, StringDeserializer.class);
configs.put(ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG, StringDeserializer.class);
return configs;
}
#Bean
public ConcurrentKafkaListenerContainerFactory<String, String> kafkaListenerContainerFactory() {
ConcurrentKafkaListenerContainerFactory<String, String> factory = new ConcurrentKafkaListenerContainerFactory<String, String>();
// Not sure about the impact of this property, so going with 1
factory.setConcurrency(2);
factory.setBatchListener(true);
factory.getContainerProperties().setAckMode(AckMode.MANUAL);
factory.getContainerProperties().setConsumerRebalanceListener(RebalanceListener.getInstance());
factory.setConsumerFactory(consumerFactory());
factory.getContainerProperties().setMessageListener(new BatchAckConsumer());
return factory;
}
#KafkaListener is a message-driven "POJO" it adds stuff like payload conversion, argument matching, etc. If you implement MessageListener you can only get the raw ConsumerRecord from Kafka. See #KafkaListener Annotation.
Yes, the concurrency represents the number of threads; each thread creates a Consumer; they run in parallel; in your example, each would get 2 partitions.
Also should we consider anything if we are consuming in parallel.
Your listener must be thread-safe (no shared state or any such state needs to be protected by locks.
It's not clear what you mean by "handle rebalance events". When a rebalance occurs, the framework will commit any pending offsets.
It doesn't make a difference; message listener Vs. batch listener is just a preference. Even with a message listener, with MANUAL ackmode, the offsets are committed when all the results from the poll have been processed. With MANUAL_IMMEDIATE mode, the offsets are committed one-by-one.
Q1:
From the documentation,
The #KafkaListener annotation is used to designate a bean method as a
listener for a listener container. The bean is wrapped in a
MessagingMessageListenerAdapter configured with various features, such
as converters to convert the data, if necessary, to match the method
parameters.
You can configure most attributes on the annotation with SpEL by using
"#{…​} or property placeholders (${…​}). See the Javadoc for more information."
This approach can be useful for simple POJO listeners and you do not need to implement any interfaces. You are also enabled to listen on any topics and partitions in a declarative way using the annotations. You can also potentially return the value you received whereas in case of MessageListener, you are bound by the signature of the interface.
Q2:
Ideally yes. If you have multiple topics to consume from, it gets more complicated though. Kafka by default uses RangeAssignor which has its own behaviour (you can change this -- see more details under).
Q3:
If your consumer dies, there will be rebalancing. If you acknowledge manually and your consumer dies before committing offsets, you do not need to do anything, Kafka handles that. But you could end up with some duplicate messages (at-least once)
Q4:
It depends what you mean by "performance". If you meant latency, then consuming each record as fast as possible will be the way to go. If you want to achieve high throughput, then batch consumption is more efficient.
I had written some samples using Spring kafka and various listeners - check out this repo

Kafka Producer Thread, huge amound of threads even when no message is send

I profiled my kafka producer spring boot application and found many "kafka-producer-network-thread"s running (47 in total). Which would never stop running, even when no data is sending.
My application looks a bit like this:
var kafkaSender = KafkaSender(kafkaTemplate, applicationProperties)
kafkaSender.sendToKafka(json, rs.getString("KEY"))
with the KafkaSender:
#Service
class KafkaSender(val kafkaTemplate: KafkaTemplate<String, String>, val applicationProperties: ApplicationProperties) {
#Transactional(transactionManager = "kafkaTransactionManager")
fun sendToKafka(message: String, stringKey: String) {
kafkaTemplate.executeInTransaction { kt ->
kt.send(applicationProperties.kafka.topic, System.currentTimeMillis().mod(10).toInt(), System.currentTimeMillis().rem(10).toString(),
message)
}
}
companion object {
val log = LoggerFactory.getLogger(KafkaSender::class.java)!!
}
}
Since each time I want to send a message to Kafka I instantiate a new KafkaSender, I thought a new thread would be created which then sends the message to the kafka queue.
Currently it looks like a pool of producers is generated, but never cleaned up, even when none of them has anything to do.
Is this behaviour intended?
In my opinion the behaviour should be nearly the same as datasource pooling, keep the thread alive for some time, but when there is nothing to do, clear it up.
When using transactions, the producer cache grows on demand and is not reduced.
If you are producing messages on a listener container (consumer) thread; there is a producer for each topic/partition/consumer group. This is required to solve the zombie fencing problem, so that if a rebalance occurs and the partition moves to a different instance, the transaction id will remain the same so the broker can properly handle the situation.
If you don't care about the zombie fencing problem (and you can handle duplicate deliveries), set the producerPerConsumerPartition property to false on the DefaultKafkaProducerFactory and the number of producers will be much smaller.
EDIT
Starting with version 2.8 the default EOSMode is now V2 (aka BETA); which means it is no longer necessary to have a producer per topic/partition/group - as long as the broker version is 2.5 or later.

Resources