I am progressing on writing my first Kafka Consumer by using Spring-Kafka. Had a look at the different options provided by framework, and have few doubts on the same. Can someone please clarify below if you have already worked on it.
Question - 1 : As per Spring-Kafka documentation, there are 2 ways to implement Kafka-Consumer; "You can receive messages by configuring a MessageListenerContainer and providing a message listener or by using the #KafkaListener annotation". Can someone tell when should I choose one option over another ?
Question - 2 : I have chosen KafkaListener approach for writing my application. For this I need to initialize a container factory instance and inside container factory there is option to control concurrency. Just want to double check if my understanding about concurrency is correct or not.
Suppose, I have a topic name MyTopic which has 4 partitions in it. And to consume messages from MyTopic, I've started 2 instances of my application and these instances are started by setting concurrency as 2. So, Ideally as per kafka assignment strategy, 2 partitions should go to consumer1 and 2 other partitions should go to consumer2. Since the concurrency is set as 2, does each of the consumer will start 2 threads, and will consume data from the topics in parallel ? Also should we consider anything if we are consuming in parallel.
Question 3 - I have chosen manual ack mode, and not managing the offsets externally (not persisting it to any database/filesystem). So should I need to write custom code to handle rebalance, or framework will manage it automatically ? I think no as I am acknowledging only after processing all the records.
Question - 4 : Also, with Manual ACK mode, which Listener will give more performance? BATCH Message Listener or normal Message Listener. I guess if I use Normal Message listener, the offsets will be committed after processing each of the messages.
Pasted the code below for your reference.
Batch Acknowledgement Consumer:
public void onMessage(List<ConsumerRecord<String, String>> records, Acknowledgment acknowledgment,
Consumer<?, ?> consumer) {
for (ConsumerRecord<String, String> record : records) {
System.out.println("Record : " + record.value());
// Process the message here..
listener.addOffset(record.topic(), record.partition(), record.offset());
}
acknowledgment.acknowledge();
}
Initialising container factory:
#Bean
public ConsumerFactory<String, String> consumerFactory() {
return new DefaultKafkaConsumerFactory<String, String>(consumerConfigs());
}
#Bean
public Map<String, Object> consumerConfigs() {
Map<String, Object> configs = new HashMap<String, Object>();
configs.put(ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG, bootStrapServer);
configs.put(ConsumerConfig.GROUP_ID_CONFIG, groupId);
configs.put(ConsumerConfig.ENABLE_AUTO_COMMIT_CONFIG, enablAutoCommit);
configs.put(ConsumerConfig.MAX_POLL_INTERVAL_MS_CONFIG, maxPolInterval);
configs.put(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, autoOffsetReset);
configs.put(ConsumerConfig.CLIENT_ID_CONFIG, clientId);
configs.put(ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG, StringDeserializer.class);
configs.put(ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG, StringDeserializer.class);
return configs;
}
#Bean
public ConcurrentKafkaListenerContainerFactory<String, String> kafkaListenerContainerFactory() {
ConcurrentKafkaListenerContainerFactory<String, String> factory = new ConcurrentKafkaListenerContainerFactory<String, String>();
// Not sure about the impact of this property, so going with 1
factory.setConcurrency(2);
factory.setBatchListener(true);
factory.getContainerProperties().setAckMode(AckMode.MANUAL);
factory.getContainerProperties().setConsumerRebalanceListener(RebalanceListener.getInstance());
factory.setConsumerFactory(consumerFactory());
factory.getContainerProperties().setMessageListener(new BatchAckConsumer());
return factory;
}
#KafkaListener is a message-driven "POJO" it adds stuff like payload conversion, argument matching, etc. If you implement MessageListener you can only get the raw ConsumerRecord from Kafka. See #KafkaListener Annotation.
Yes, the concurrency represents the number of threads; each thread creates a Consumer; they run in parallel; in your example, each would get 2 partitions.
Also should we consider anything if we are consuming in parallel.
Your listener must be thread-safe (no shared state or any such state needs to be protected by locks.
It's not clear what you mean by "handle rebalance events". When a rebalance occurs, the framework will commit any pending offsets.
It doesn't make a difference; message listener Vs. batch listener is just a preference. Even with a message listener, with MANUAL ackmode, the offsets are committed when all the results from the poll have been processed. With MANUAL_IMMEDIATE mode, the offsets are committed one-by-one.
Q1:
From the documentation,
The #KafkaListener annotation is used to designate a bean method as a
listener for a listener container. The bean is wrapped in a
MessagingMessageListenerAdapter configured with various features, such
as converters to convert the data, if necessary, to match the method
parameters.
You can configure most attributes on the annotation with SpEL by using
"#{…} or property placeholders (${…}). See the Javadoc for more information."
This approach can be useful for simple POJO listeners and you do not need to implement any interfaces. You are also enabled to listen on any topics and partitions in a declarative way using the annotations. You can also potentially return the value you received whereas in case of MessageListener, you are bound by the signature of the interface.
Q2:
Ideally yes. If you have multiple topics to consume from, it gets more complicated though. Kafka by default uses RangeAssignor which has its own behaviour (you can change this -- see more details under).
Q3:
If your consumer dies, there will be rebalancing. If you acknowledge manually and your consumer dies before committing offsets, you do not need to do anything, Kafka handles that. But you could end up with some duplicate messages (at-least once)
Q4:
It depends what you mean by "performance". If you meant latency, then consuming each record as fast as possible will be the way to go. If you want to achieve high throughput, then batch consumption is more efficient.
I had written some samples using Spring kafka and various listeners - check out this repo
Related
I am trying to write a kafka consumer application in spring-kafka. As consumer, I have to make sure I am not missing any record and all records should get processed.
My application design is like this :
Topics --> Read records from topic --> dump it into a table A in oracle database --> pick records from a table A --> call rest api to update records in system table B --> update response of API in table a --> commit records
Retry Mechanism on API level :
Now, if any of the records gets failed, means the response code is not as desired (400,500 etc..). We would retry those records 2 times.
Retry Mechanism on Topic level :
But, what if I got an error while committing offsets ? means, I need to have some kind of retry mechanism on the topic level as well. I went over documents and found an option :SeekToCurrentErrorHandler
#Bean
public ConcurrentKafkaListenerContainerFactory<String, String> kafkaListenerContainerFactory() {
ConcurrentKafkaListenerContainerFactory<String, String> factory = new ConcurrentKafkaListenerContainerFactory();
factory.setConsumerFactory(consumerFactory());
factory.getContainerProperties().setAckOnError(false);
factory.getContainerProperties().setAckMode(AckMode.RECORD);
factory.setErrorHandler(new SeekToCurrentErrorHandler(new FixedBackOff(1000L, 2L)));
return factory;
}
Now, what I understand, suppose If I am not able to commit any offsets, then after adding above code, this will retry a delivery up to 2 times (3 delivery attempts) with a back off of 1 second. So, does this means, my whole flow will be replayed twice ? if this is true, then do I need to add retry mechanism on the API level separately ?
I am just trying to understand, how can I make my consumer application more resilient so I don't miss any record from processing and should have error mechanism to handle any error/missed records. Please suggest.
It's best to avoid situations where the offsets can't be committed (make sure the max.poll.interval.ms is sufficient).
But, yes, if committing the offsets fails (and commitSync is true) then the record will be redelivered to the application. If commitSync is false, the failure will simply be logged (or sent to your listener) and the "next" offset for that partition will have its offset committed later (presumably).
Adding retry at the application level (e.g. using a RetryTemplate in the listener adapter - via the container factory) will still suffer from the same problem; it also can cause a rebalance if the retries take too long.
If you really want to avoid reprocessing in this situation, you need to make your listener code idempotent - e.g. store the topic/partition/offset someplace to indicate you have already processed that record.
I profiled my kafka producer spring boot application and found many "kafka-producer-network-thread"s running (47 in total). Which would never stop running, even when no data is sending.
My application looks a bit like this:
var kafkaSender = KafkaSender(kafkaTemplate, applicationProperties)
kafkaSender.sendToKafka(json, rs.getString("KEY"))
with the KafkaSender:
#Service
class KafkaSender(val kafkaTemplate: KafkaTemplate<String, String>, val applicationProperties: ApplicationProperties) {
#Transactional(transactionManager = "kafkaTransactionManager")
fun sendToKafka(message: String, stringKey: String) {
kafkaTemplate.executeInTransaction { kt ->
kt.send(applicationProperties.kafka.topic, System.currentTimeMillis().mod(10).toInt(), System.currentTimeMillis().rem(10).toString(),
message)
}
}
companion object {
val log = LoggerFactory.getLogger(KafkaSender::class.java)!!
}
}
Since each time I want to send a message to Kafka I instantiate a new KafkaSender, I thought a new thread would be created which then sends the message to the kafka queue.
Currently it looks like a pool of producers is generated, but never cleaned up, even when none of them has anything to do.
Is this behaviour intended?
In my opinion the behaviour should be nearly the same as datasource pooling, keep the thread alive for some time, but when there is nothing to do, clear it up.
When using transactions, the producer cache grows on demand and is not reduced.
If you are producing messages on a listener container (consumer) thread; there is a producer for each topic/partition/consumer group. This is required to solve the zombie fencing problem, so that if a rebalance occurs and the partition moves to a different instance, the transaction id will remain the same so the broker can properly handle the situation.
If you don't care about the zombie fencing problem (and you can handle duplicate deliveries), set the producerPerConsumerPartition property to false on the DefaultKafkaProducerFactory and the number of producers will be much smaller.
EDIT
Starting with version 2.8 the default EOSMode is now V2 (aka BETA); which means it is no longer necessary to have a producer per topic/partition/group - as long as the broker version is 2.5 or later.
I have listeners configured in XML like this
<rabbit:listener-container connection-factory="connectionFactory" concurrency="1" acknowledge="manual">
<rabbit:listener ref="messageListener" queue-names="${address.queue.s1}" exclusive="true"/>
<rabbit:listener ref="messageListener" queue-names="${address.queue.s2}" exclusive="true"/>
<rabbit:listener ref="messageListener" queue-names="${address.queue.s3}" exclusive="true"/>
<rabbit:listener ref="messageListener" queue-names="${address.queue.s4}" exclusive="true"/>
<rabbit:listener ref="messageListener" queue-names="${address.queue.s5}" exclusive="true"/>
<rabbit:listener ref="messageListener" queue-names="${address.queue.s6}" exclusive="true"/>
</rabbit:listener-container>
I am trying to move that to Java Configuration and I don't see a way to add more than one MessageListener to a ListenerContainer. Creating multiple ListenerContainer beans is not an option in my case because I would not know the number of queues to consume from until runtime. Queue names will come from a configuration file.
I did the following
#PostConstruct
public void init()
{
for (String queue : queues.split(","))
{
// The Consumers would not connect if I don't call the 'start()' method.
messageListenerContainer(queue).start();
}
}
#Bean
public SimpleMessageListenerContainer messageListenerContainer(String queue)
{
SimpleMessageListenerContainer container = new SimpleMessageListenerContainer(consumerConnectionFactory);
container.setQueueNames(queue);
container.setMessageListener(messageListener());
// Set Exclusive Consumer 'ON'
container.setExclusive(true);
// Should be restricted to '1' to maintain data consistency.
container.setConcurrentConsumers(1);
container.setAcknowledgeMode(AcknowledgeMode.MANUAL);
return container;
}
It "sort" of works BUT I see some weird behavior with lots of ghost channels getting opened which never used to happen with the XML configuration. So it makes me suspicious that I am doing something wrong. I would like to know the correct way of creating MessageListenerContainers in Java configuration? Simply put, "How does Spring convert 'rabbit:listener-container' with multiple 'rabbit:listener' to java objects properly?" Any help/insight into this would be greatly appreciated.
Business Case
We have a Publisher that publishes User Profile Updates. The publisher could dispatch multiple updates for the same use and we have to process them in the correct order to maintain data integrity in the data store.
Example : User : ABC, Publish -> {UsrA:Change1,...., UsrA:Change 2,....,UsrA:Change 3} -> Consumer HAS to process {UsrA:Change1,...., UsrA:Change 2,....,UsrA:Change 3} in that order.
In our previous setup, we had 1 Queue that got all the User Updates and we had a consumer app with concurrency = 5. There were multiple app servers running the consumer app. That resulted in *5 * 'Number of instances of the consumer app' channels/threads* that could process the incoming messages. The speed was GREAT! but we were having out of order processing quite often resulting in data corruption.
To maintain strict FIFO order and still process message parallelly as much as possible, we implemented queue Sharding. We have a "x-consistent-hash with a hash-header on employee-id. Our Publisher publishes messages to the hash exchange and we have multiple sharded queues bound to the hash exchange. The idea is, we will have all changes for a given user (User A for example) queued up in the same shard. We then have our consumers connect to the sharded queues in 'Exclusive' mode and 'ConcurrentConsumers = 1' and process the messages. That way we are sure to process messages in the correct order while still processing messages parallelly. We could make it more parallel by increasing the number of shards.
Now on to the consumer configuration
We have the consumer app deployed on multiple app servers.
Original Approach:
I simply added multiple 'rabbit:listener' to my 'rabbit:listener-container' in my consumer app as you can see above and it works great except for the server that starts first get an exclusive lock on all the sharded queues and the other servers are just sitting there doing no work.
New Approach:
We moved the sharded queue names to the application configuration file. Like so
Consumer Instance 1 : Properties
queues=user.queue.s1,user.queue.s2,user.queue.s3
Consumer Instance 2 : Properties
queues=user.queue.s4,user.queue.s5,user.queue.s6
Also worth noting, we could have Any number of Consumer instances and the shards could be distributed unevenly between instances depending on resource availability.
With the queue names moved to configuration file, the XML confiugration will no longer work because we cannot dynamically add 'rabbit:listener' to my 'rabbit:listener-container' like we did before.
Then we decided to switch over to the Java Configuration. That is where we are STUCK!.
We did this initially
#Bean
public SimpleMessageListenerContainer messageListenerContainer()
{
SimpleMessageListenerContainer container = new SimpleMessageListenerContainer(consumerConnectionFactory);
container.setQueueNames(queues.split(","));
container.setMessageListener(messageListener());
container.setMissingQueuesFatal(false);
// Set Exclusive Consumer 'ON'
container.setExclusive(true);
// Should be restricted to '1' to maintain data consistency.
container.setConcurrentConsumers(1);
container.setAcknowledgeMode(AcknowledgeMode.MANUAL);
container.start();
return container;
}
and it works BUT all our queues are on one connection sharing 1 channel. That is NOT good for speed. What we want is One connection and every queue gets its own channel.
Next Step
No success here YET!. The java configuration in my original question is where we are at now.
I am baffled why this is so HARD to do. Clearly the XML configuration does something that is NOT easly doable in Java confiugration (Or atleast it feel sthat way to me). I see this as a gap that needs to be filled unless I am compeltly missing something. Please correct me if I am wrong. This is a genuine business case NOT some ficticious edge case. Please feel free to comment if you think otherwise.
and it works BUT all our queues are on one connection sharing 1 channel. That is NOT good for speed. What we want is One connection and every queue gets its own channel.
If you switch to the DirectMessageListenerContainer, each queue in that configuration gets its own Channel.
See the documentation.
To answer your original question (pre-edit):
#Bean
public SimpleMessageListenerContainer messageListenerContainer1(#Value("${address.queue.s1}") String queue)
{
SimpleMessageListenerContainer container = new SimpleMessageListenerContainer(consumerConnectionFactory);
container.setQueueNames(queue);
container.setMessageListener(messageListener());
// Set Exclusive Consumer 'ON'
container.setExclusive(true);
// Should be restricted to '1' to maintain data consistency.
container.setConcurrentConsumers(1);
container.setAcknowledgeMode(AcknowledgeMode.MANUAL);
return container;
}
...
#Bean
public SimpleMessageListenerContainer messageListenerContainer6(#Value("${address.queue.s6}" ) String queue)
{
SimpleMessageListenerContainer container = new SimpleMessageListenerContainer(consumerConnectionFactory);
container.setQueueNames(queue);
container.setMessageListener(messageListener());
// Set Exclusive Consumer 'ON'
container.setExclusive(true);
// Should be restricted to '1' to maintain data consistency.
container.setConcurrentConsumers(1);
container.setAcknowledgeMode(AcknowledgeMode.MANUAL);
return container;
}
Here is the Java Configuration for creating SimpleMessageListenerContainer
#Value("#{'${queue.names}'.split(',')}")
private String[] queueNames;
#Bean
public SimpleMessageListenerContainer listenerContainer(final ConnectionFactory connectionFactory) {
final SimpleMessageListenerContainer container = new SimpleMessageListenerContainer();
container.setConnectionFactory(connectionFactory);
container.setQueueNames(queueNames);
container.setMessageListener(vehiclesReceiver());
setCommonQueueProperties(container);
return container;
}
Each <rabbit:listener > creates its own SimpleListenerContainer bean with the same ConnectionFactory. To do similar in Java config, you have to declare as much SimpleListenerContainer beans as you have queues: one for each of them.
You also may consider to use #RabbitListener approach instead: https://docs.spring.io/spring-amqp/docs/2.0.4.RELEASE/reference/html/_reference.html#async-annotation-driven
Even after reading plenty of SO questions (1,2) and articles, It is unclear on which is the better option to set for consumers. Multiple consumers or a higher prefetch value?
From what I understand, when it comes to SimpleRabbitListenerContainerFactory, as it was designed initially to have only one thread per connection it was designed to address a limitation that the amqp-client only had one thread per connection, does that mean that setting multiple consumers won't make much difference as there is only one thread that actually consumes from rabbit and than hands it off to the multiple consumers (threads)?
Or there are actually several consumers consuming at the same time?
So what is the best practice when it comes to spring implementation of rabbit concerning prefetch/consumers? When should one be used over the other? And should I switch to this new DirectRabbitListenerContainerFactory? Is it 'better' or just depends on the use case?
Some downsides I see when it comes to high prefetch is that maybe it can cause memory issues if an app consumes more messages that it can hold in the buffer? (haven't actually tested this yet, tbh)
And when it comes to multiple consumers, I see the downside of having more file descriptors opened on OS level and I saw this article about that each consumer actually pings rabbit for each ack and this making it slower.
FYI, if it is relevant, I usually have my config set up like this:
#Bean
public ConnectionFactory connectionFactory() {
final CachingConnectionFactory connectionFactory = new CachingConnectionFactory(server);
connectionFactory.setUsername(username);
connectionFactory.setPassword(password);
connectionFactory.setVirtualHost(virtualHost);
connectionFactory.setRequestedHeartBeat(requestedHeartBeat);
return connectionFactory;
}
#Bean
public AmqpAdmin amqpAdmin() {
AmqpAdmin admin = new RabbitAdmin(connectionFactory());
admin.declareQueue(getRabbitQueue());
return admin;
}
#Bean
public SimpleRabbitListenerContainerFactory rabbitListenerContainerFactory() {
final SimpleRabbitListenerContainerFactory factory = new SimpleRabbitListenerContainerFactory();
factory.setConnectionFactory(connectionFactory());
factory.setConcurrentConsumers(concurrency);
factory.setMaxConcurrentConsumers(maxConcurrency);
factory.setPrefetchCount(prefetch);
factory.setMissingQueuesFatal(false);
return factory;
}
#Bean
public Queue getRabbitQueue() {
final Map<String, Object> p = new HashMap<String, Object>();
p.put("x-max-priority", 10);
return new Queue(queueName, true, false, false, p);
}
No; the SMLC wasn't "designed for one thread per connection" it was designed to address a limitation that the amqp-client only had one thread per connection so that thread hands off to consumer threads via an in-memory queue; that is no longer the case. The client is multi-threaded and there is one dedicated thread per consumer.
Having multiple consumers (increasing the concurrency) is completely effective (and was, even with the older client).
Prefetch is really to reduce network chatter and improve overall throughput. Whether you need to increase concurrency really is orthogonal to prefetch. You would typically increase concurrency if (a) your listener is relatively slow to process each message and (b) strict message ordering is not important.
The DirectListenerContainer was introduced to provide a different threading model, where the listener is invoked directly on the amqp-client thread.
The reasons for choosing one container over the other is described in Choosing a Container.
The following features are available with the SMLC, but not the DMLC:
txSize - with the SMLC, you can set this to control how many messages are delivered in a transaction and/or to reduce the number of acks, but it may cause the number of duplicate deliveries to increase after a failure. (The DMLC does have mesagesPerAck which can be used to reduce the acks, the same as with txSize and the SMLC, but it can’t be used with transactions - each message is delivered and ack’d in a separate transaction).
maxConcurrentConsumers and consumer scaling intervals/triggers - there is no auto-scaling in the DMLC; it does, however, allow you to programmatically change the consumersPerQueue property and the consumers will be adjusted accordingly.
However, the DMLC has the following benefits over the SMLC:
Adding and removing queues at runtime is more efficient; with the SMLC, the entire consumer thread is restarted (all consumers canceled and re-created); with the DMLC, unaffected consumers are not canceled.
The context switch between the RabbitMQ Client thread and the consumer thread is avoided.
Threads are shared across consumers rather than having a dedicated thread for each consumer in the SMLC. However, see the IMPORTANT note about the connection factory configuration in the section called “Threading and Asynchronous Consumers”.
I am using Spring Kafka in my project as it seemed a natural choice in a Spring based project to consume Kafka messages. To consume messages, I can make use of the MessageListener interface. Spring Kafka internally takes care to invoke my onMessage method for each new message.
However, in my setting I prefer to explicitly poll for new messages and work on them sequentially (which will take a few seconds). As a workaround, I might just block inside my onMessage implementation, or buffer the messages internally. However, this seems to go against the core idea of Spring Kafka.
Kafka is designed so that consumers have to poll for new messages, which matches my requirements. Is there a way to make use of this "natural" workflow with Spring Kafka?
Should I refrain from using Spring Kafka for this use case?
The KafkaConsumer documentation states:
For use cases where message processing time varies unpredictably,
neither of these options may be sufficient. The recommended way to
handle these cases is to move message processing to another thread,
which allows the consumer to continue calling poll while the processor
is still working. Some care must be taken to ensure that committed
offsets do not get ahead of the actual position. Typically, you must
disable automatic commits and manually commit processed offsets for
records only after the thread has finished handling them (depending on
the delivery semantics you need). Note also that you will need to
pause the partition so that no new records are received from poll
until after thread has finished handling those previously returned.
Related issue: https://github.com/spring-projects/spring-kafka/issues/195
The issue with having to keep polling the consumer has now been resolved (in 0.10.1.x by KIP-62) so that's not an issue any more (as long as you don't exceed the max.poll.interval.ms) which is 5 mins by default but can be increased.
However, if you want to poll yourself, you can still use spring-kafka (e.g. to get the Spring Boot auto configuration goodness if you are using Boot), but you can get a Consumer from the DefaultKafkaConsumerFactory and poll() it directly.
Here is how I do it. This is in the context of an integration test configuration class, which I load in my JUnit with:
#Import(IntegrationTestConfiguration.class)
In my Test class I have the following:
#Autowired
Consumer<String, String> consumer;
In my test configuration class, I have:
#Bean
public Consumer<String, String> consumer() {
String bootstrapAddress = "server:port"; // fix this
String groupId = "my.group"; // fix this.
Map<String, Object> props = new HashMap<>();
props.put(ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG, bootstrapAddress);
props.put(ConsumerConfig.GROUP_ID_CONFIG, groupId);
props.put(ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG, StringDeserializer.class);
props.put(ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG, StringDeserializer.class);
ConsumerFactory<String, String> fact = new DefaultKafkaConsumerFactory<>(props);
// Create the consumer, subscribe to the topic
Consumer<String, String> consumer = fact.createConsumer();
String topic = "my.topic"; // fix this.
List<String> topics = new ArrayList<String>();
topics.add(topic);
consumer.subscribe(topics);
return consumer;
}
Finally, in my test, I do:
#Test
public void testSomething() {
// Do stuff that will publish a message to Kafka
// Repeat a number of times untill you get the message you want...
// Or you give up
Duration d = Duration.ofSeconds(2);
ConsumerRecords<String, String> records = consumer.poll(d);
}