Tuning Kafka for latency, packet loss, and unreachability - performance

I am trying to optimize the performances of Kafka in a scenario where there is high latency (>500ms) and intermittent packet loss. I am working with JAVA and using 'kafka_2.13', version: '2.5.0' API. I have 24
nodes connected to a single broker, each node tries to send a small message to all the other subscribers. I observe that all nodes are able to communicate when there is no packet loss nor latency but they don't seem to be able to communicate soon after I add latency and packet loss. I will do more tests on Monday but I was wondering if anyone had any suggestions on possible configuration improvements.
Following you can see the code that I see to publish and receive messages and then the different configurations that used for consumers and producers.
boolean sendAsyncMessage (byte[] value, String topic) {
ProducerRecord<Long, byte[]> record = new ProducerRecord<> (topic, System.currentTimeMillis (), value);
long msStart = System.currentTimeMillis ();
producer.send (record, (metadata, exception) -> {
long msDelta = System.currentTimeMillis () - msStart;
logger.info ("Message with topic {} sent at {}, was ack after {}", topic, msStart, msDelta);
if (metadata == null) {
logger.info ("An exception was triggered during send:" + exception.toString ());
producer.flush ();
return true;
while (keepGoing.get ()) {
try {
// java example do it every time!
subscribe ();
ConsumerRecords<Long, byte[]> consumerRecords = consumer.poll (Duration.ofMillis (2000));
manageMessage (consumerRecords);
//Thread processRecords = new Thread (() -> manageMessage (consumerRecords));
//processRecords.start ();
} catch (Exception e) {
logger.error ("Problem in polling: " + e.toString ());
properties.put (ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG, KafkaBroker.KEY_SERIALIZER);
properties.put (ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG, KafkaBroker.VALUE_SERIALIZER);
properties.put (ProducerConfig.ACKS_CONFIG, reliable ? "1" : 0);
// host1:port1,host2:port2
properties.put (ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, server);
// how many bytes to buffer records waiting to be sent to the server
//properties.put (ProducerConfig.BUFFER_MEMORY_CONFIG, 33554432);
properties.put (ProducerConfig.COMPRESSION_TYPE_CONFIG, "gzip");
properties.put (ProducerConfig.CLIENT_ID_CONFIG, clientID);
properties.put (ProducerConfig.CONNECTIONS_MAX_IDLE_MS_CONFIG, 54000000);
// properties.put (ProducerConfig.MAX_REQUEST_SIZE_CONFIG, 1048576);
// properties.put (ProducerConfig.RECONNECT_BACKOFF_MAX_MS_CONFIG, 1000);
properties.put (ProducerConfig.RECONNECT_BACKOFF_MS_CONFIG, 300);
properties.put (ProducerConfig.RETRY_BACKOFF_MS_CONFIG, 300);
properties.put (ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG, KafkaBroker.KEY_DESERIALIZER);
// host1:port1,host2:port2
properties.put (ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG, server);
// should be the topic
properties.put (ConsumerConfig.GROUP_ID_CONFIG, groupID);
properties.put (ConsumerConfig.HEARTBEAT_INTERVAL_MS_CONFIG, "3000");
properties.put (ConsumerConfig.SESSION_TIMEOUT_MS_CONFIG, "30000");
properties.put (ConsumerConfig.AUTO_COMMIT_INTERVAL_MS_CONFIG, "1000");

Before trying to change all settings, I'd make a few changes in your logic:
Currently you are calling flush() after sending each message, effectively doing a synchronous send. This is not recommended as it forces the Kafka client to make a request to the cluster for every single message. This is pretty inefficient. In most cases, it's best to let the client decide when to actually send messages and not use flush().
In each iteration, you are calling subscribe(), this is not needed. You should only call subscribe() when you want to change the subscription. Also creating a new thread in each poll() loop is not recommended! In addition to being slow, you risk creating hundreds or thousands of threads if the consumer starts fetching large amounts of messages.
Kafka is using a TCP protocol, so packets lost should be automatically retried. By default, Kafka clients are configured to retry most operations and automatically reconnect to brokers if a connection is lost.
When doing your tests, before changing configurations, you should see how the Kafka client is behaving by monitoring its metrics and logs. Are timeouts reached because of the latency? Are messages retried?

In the end, the biggest factor that was preventing my distributed system to correctly communicate was the producer acks option. In the beginning, we had set this to all (the strictest option) and it seems that that, paired with the deteriorated network was preventing Kafka to have performances similar to other TCP based protocols. We now use 0 for unreliable messages and 1 for reliable.


Vertx amqp client doesnt reconnect on broker down

I am trying to write a program to pull messages from a message broker via Vert.x AMQP client. I want to make the program try to reconnect on broker down. Currently if I turn off the broker container, the program doesn't react. Below is my code.. What do I miss ?
public class BrokerConnector {
public void consumeEventsQueue() {
AmqpClientOptions options = new AmqpClientOptions()
AmqpClient amqpClient = AmqpClient.create(options);
amqpClient.connect(con -> {
if (con.failed()) {
System.out.println("Unable to connect to the broker");
} else {
System.out.println("Connection succeeded");
done -> {
if (done.failed()) {
System.out.println("Unable to create receiver");
} else {
AmqpReceiver receiver = done.result();
receiver.handler(msg -> {
System.out.println("Received " + msg.bodyAsString());
To my knowledge (and from peeking at the source) the vertx AMQP client doesn't have automatic client reconnect so it seems quite normal that on loss of connection you application is failing. The client exposes an exception handler that you can hook and recreate your client resources from when the connection drops. There are some clients for AMQP that do have automatic reconnect built in like Qpid JMS or the Qpid protonj2 client.

How to detect if RSocket connection is successfull?

I have the following program through which I can detect the connection failure i.e doBeforeRetry.
Can someone tell me how to detect the successful connection or reconnection. I want to integrate a Health Check program that monitors this connection, but I am unable to capture the event that informs the connections is successfull.
requester = RSocketRequester.builder()
.rsocketConnector(connector -> {
.doBeforeRetry(e-> System.out.println("doBeforeRetry===>"+e))
.doAfterRetry(e-> System.out.println("doAfterRetry===>"+e))
.tcp("localhost", 7999);
I achieved the detection of successful connection or reconnection with the following approach.
Client Side (Connection initialization)
Mono<RSocketRequester> requester = Mono.just(RSocketRequester.builder()
// connector configuration goes here
.tcp("localhost", 7999)));
One the server side
public void connect(RSocketRequester requester, #Payload String callerName) {
LOG.info("Client Connection Handshake: [{}]", callerName);
.data("I am server")
On the client side, when I receive the callback on the below method, I detect the connection is successfull.
public Mono<ConsumerPreference> handshake(final String response){
LOG.info("Server Connection Handshake received : Server message [{}]", response.getCallerName());
return Mono.empty();
throw new InitializationException("Invalid response message received from Server");
Additionally, created a application level heartbeat to ensure, the liveliness of the connection.
If you want to know if it's actually healthy, you should probably have a side task that is polling the health of the RSocket, by sending something like a custom ping protocol to your backend. You could time that and confirm that you have a healthy connection, record latencies and success/failures.

google-cloud pubsub leaves messages in queue after sending ack

I have this subscriber code:
try {
final List<ReceivedMessage> messages = syncSubscriber.fetch(10, true);//get all current messages.
List<String> ackIds = new ArrayList<>();
for (ReceivedMessage message : messages) {
//preferred bulk ack, due to network performance
public void sendAck(Collection<String> ackIdList) {
if (ackIdList != null && ackIdList.size() != 0) {
String subscriptionName = SubscriptionName.format(this.getProjectId(), this.subscriptionId);
AcknowledgeRequest acknowledgeRequest = AcknowledgeRequest.newBuilder().setSubscription(subscriptionName).addAllAckIds(ackIdList).build();
I poll the pubsub queue in loop
and even though the code sends ack i still get the same messages.
how should i ack otherwise?
My problem was that i had a break point between receiving the message and sending ack. My pubsub was configured to 10 seconds timeout.

AMQP Appender pending message count

We are sending audit log messages to a RabbitMQ cluster which is sometimes unavailable for reasons we cannot influence.
When the queue is not available, log messages start to accumulate locally and we get a out-of-memory eventually on the client.
We are using a AMQP Appender to submit our messages.
Is there a way we can query the count of pending log messages and raise an alert when messages start adding up?
Well, it isn't possible. There is just no any hooks to do that.
You can consider, though, to decrease maxSenderRetries from default 30 to 1 or 2. After that you'll start to lose log messages:
int retries = event.incrementRetries();
if (retries < AmqpAppender.this.maxSenderRetries) {
// Schedule a retry based on the number of times I've tried to re-send this
AmqpAppender.this.retryTimer.schedule(new TimerTask() {
public void run() {
}, (long) (Math.pow(retries, Math.log(retries)) * 1000));
else {
addError("Could not send log message " + logEvent.getMessage()
+ " after " + AmqpAppender.this.maxSenderRetries + " retries", e);
We might have to expose queueSize option instead of default:
public LinkedBlockingQueue() {
Feel free to raise a JIRA on the matter.

Subscribing to a removed queue with spring-websocket and RabbitMQ broker (Queue NOT_FOUND)

I have a spring-websocket (4.1.6) application on Tomcat8 that uses a STOMP RabbitMQ (3.4.4) message broker for messaging. When a client (Chrome 47) starts the application, it subscribes to an endpoint creating a durable queue. When this client unsubscribes from the endpoint, the queue will be cleaned up by RabbitMQ after 30 seconds as defined in a custom made RabbitMQ policy. When I try to reconnect to an endpoint that has a queue that was cleaned up, I receive the following exception in the RabbitMQ logs: "NOT_FOUND - no queue 'position-updates-user9zm_szz9' in vhost '/'\n". I don't want to use an auto-delete queue since I have some reconnect logic in case the websocket connection dies.
This problem can be reproduced by adding the following code to the spring-websocket-portfolio github example.
In the container div in the index.html add:
<button class="btn" onclick="appModel.subscribe()">SUBSCRIBE</button>
<button class="btn" onclick="appModel.unsubscribe()">UNSUBSCRIBE</button>
In portfolio.js replace:
stompClient.subscribe("/user/queue/position-updates", function(message) {
positionUpdates = stompClient.subscribe("/user/queue/position-updates", function(message) {
and also add the following:
self.unsubscribe = function() {
self.subscribe = function() {
positionUpdates = stompClient.subscribe("/user/queue/position-updates", function(message) {
self.pushNotification("Position update " + message.body);
Now you can reproduce the problem by:
Launch the application
click unsubscribe
delete the position-updates queue in the RabbitMQ console
click subscribe
Find the error message in the websocket frame via the chrome devtools and in the RabbitMQ logs.
reconnect logic in case the websocket connection dies.
no queue 'position-updates-user9zm_szz9' in vhost
Are fully different stories.
I'd suggest you implement "re-subscribe" logic in case of deleted queue.
Actually that is how STOMP works: it creates auto-deleted (generated) queue for the subscribe and yes, it is removed on the unsubscrire.
See more info in the RabbitMQ STOMP Adapter Manual.
From other side consider to subscribe to the existing AMQP queue:
To address existing queues created outside the STOMP adapter, destinations of the form /amq/queue/<name> can be used.
The problem is Stomp won't recreate the queue if it get's deleted by the RabbitMQ policy. I worked around it by creating the queue myself when the SessionSubscribeEvent is fired.
public void onApplicationEvent(AbstractSubProtocolEvent event) {
if (event instanceof SessionSubscribeEvent) {
MultiValueMap nativeHeaders = (MultiValueMap)event.getMessage().getHeaders().get("nativeHeaders");
List destination = (List)nativeHeaders.get("destination");
String queueName = ((String)destination.get(0)).substring("/queue/".length());
try {
Connection connection = connectionFactory.newConnection();
Channel channel = connection.createChannel();
channel.queueDeclare(queueName, true, false, false, null);
} catch (IOException e) {
