Kafka Producer 0.9.0 performance, large number of waiting-threads - performance

We are writing messages at the rate of about 9000 records/sec into our kafka cluster, at times we see that the producer performance degrades considerably and then it never recovers. When this happens we see the following error "unable to allocate buffer within timeout". Below are the JMX producer metrics taken when the process is running well and when it reaches the bad state. The "waiting-threads" metric is very high when the process degrades, any inputs would be appreciated.
The producer parameters are
batch.size=1000000
linger.ms=30000
acks=-1
metadata.fetch.timeout.ms=1000
compression.type=none
max.request.size=10000000
Athough the buffer is fully available the errors are "org.apache.kafka.common.errors.TimeoutException: Failed to allocate memory within the configured max blocking time"

at one point you are starting to send batches of 1.000.000 messages I think that that's why you performance gets degraded. Try lowering that number or set the linger.ms lower.

Related

Confluent kafka go client memory leak

My service consumes messages from one kafka topic. While the consumer is idle and blocked waiting for messages I see a continuous and linear increase in the POD memory. GO pprof proves that the go memory consumption is constant around 40 MB, at the same time POD metrics show more than 100 MB is consumed.
This leads me to the conclusion that memory is consumed in the C library librdkafka as mentioned here https://zendesk.engineering/hunting-down-a-c-memory-leak-in-a-go-program-2d08b24b617d
The solution to the memory consumption in librdkafka in the link above was to consume the OffsetCommitResponse events that librdkafka produces. Here is the quote from the link:
It turned out that librdkafka was generating an event every time it
received an OffsetCommitResponse from the Kafka broker (which, with
our auto-commit interval set to 5 seconds, was pretty often), and
placing it in a queue for our app to handle. However, our application
was not actually handling events from that queue, so the size of that
queue grew without bound
Does anyone know how to consume these events in go? unfortunately the link above didn't mention the solution
I solved this issue by counting the number of consumed messages in my service. When the number of consumed messages reaches a configured value e.g. 100,000 in my case, then I simply close and recreate the kafka consumer and producer.
This solution is neither elegant nor doesn't solve the original issue, but hey it stabilized my production. Now I have a flat memory consumption curve.

Decrease consume rate on RabbitMq server

We are running production single server RabbitMQ (3.7) where around 500 mobile applications are connected as producers (MQTT) and around 10 server applications as consumers. Those 500 publishers push messages basically into one queue and less often in the another one.
Recently we had issue with spikes of stacked messages in all our queues. Numbers of stacked messages went from 1 to 1000. This spike was caused by decrease of consumer rate.
I tired to find what happened and how to eliminate spikes in queues and I should limit queue length or eliminate connections. But we can’t limit we have to perform better. I took a look into RabbitMQ memory usage, cpu and same for consumers everything looks fine and RabbitMq was running around 50% on total load same for memory. Also consumers doesn’t seems to be a bottleneck because consume rate went event higher after the queue length grown.
I have a couple of questions:
Is RabbitMQ designed for such a large amount of consumers?
I read that each queue is single threaded is it possible that rabbit just can’t handle 500 producers in one queue and throughput gets lower?
What else I can use to tackle the cause of lower consumer rate? Number of threads in Rabbit?
What do you recon to measure or test benchmark/performance of RabbitMQ server?

mq slow persistent message reading

I am trying to track down an issue where a client can not read messages as fast as they should. Persistent messages are written to a queue. At times, the GET rate is slower than the PUT rate and we see messages backing up.
Using tcpdump, I see the following:
MQGET: Convert, Fail_If_Quiescing, Accept_Truncated_Msg, Syncpoint, Wait
Message is sent
Notification
MQCMIT
MQCMIT_REPLY
In analyzing the dump, sometimes I see the delta between the MQCMIT and MQCMIT_REPLY be in the 0.001 second timeframe and I also see it in the 0.1 second timeframe. It seems like the 0.1 sec delay is slowing the message transfer down. Is there anything I can do to decrease the delta between the MQCMIT and MQCMIT_REPLY? Should the client be reading multiple messages before the MQCMIT is sent?
This is MQ 8.0.0.3 on AIX 7.1.
The most straightforward way to increase message throughput on the receiving side is to batch MQGET operations. That is, do not issue MQCMIT for every MQGET, but rather after a number of MQGET operations. MQCMIT is the most expensive operation for persistent messages since it involves forcing log writes on the queue manager, and therefore suffers disk I/O latency. Experiment with the batch size - I often use 100, but some applications can go even higher. Too many outstanding MQGET operations can be problematic since they keep the transaction running for much longer time and prevent the log switching.
And of course you can check if your system overall tuning is satisfactory. You might have too long a latency between your client and queue manager, or your logs may reside on a slow device, or the logs may share the device with the queue files or an otherwise busy filesystem.

Storm latency caused by ack

I was using kafka-storm to connect kafka and storm. I have 3 servers running zookeeper, kafka and storm. There is a topic 'test' in kafka that has 9 partitions.
In the storm topology, the number of KafkaSpout executor is 9 and by default, the number of tasks should be 9 as well. And the 'extract' bolt is the only bolt connected to KafkaSpout, the 'log' spout.
From the UI, there is a huge rate of failure in the spout. However, he number of executed message in bolt = the number of emitted message - the number of failed mesage in bolt. This equation is almost matched when the failed message is empty at the beginning.
Based on my understanding, this means that the bolt did receive the message from spout but the ack signals are suspended in flight. That's the reason why the number of acks in spout are so small.
This problem might be solved by increase the timeout seconds and spout pending message number. But this will cause more memory usage and I cannot increase it to infinite.
I was wandering if there is a way to force storm ignore the ack in some spout/bolt, so that it will not waiting for that signal until time out. This should increase the throughout significantly but not guarantee for message processing.
if you set the number of ackers to 0 then storm will automatically ack every sample.
config.setNumAckers(0);
please note that the UI only measures and shows 5% of the data flow.
unless you set
config.setStatsSampleRate(1.0d);
try increasing the bolt's timeout and reducing the amount of topology.max.spout.pending.
also, make sure the spout's nextTuple() method is non blocking and optimized.
i would also recommend profiling the code, maybe your storm Queues are being filled and you need to increase their sizes.
config.put(Config.TOPOLOGY_TRANSFER_BUFFER_SIZE,32);
config.put(Config.TOPOLOGY_EXECUTOR_RECEIVE_BUFFER_SIZE,16384);
config.put(Config.TOPOLOGY_EXECUTOR_SEND_BUFFER_SIZE,16384);
Your capacity numbers are a bit high, leading me to believe that you're really maximizing the use of system resources (CPU, memory). In other words, the system seems to be bogged down a bit and that's probably why tuples are timing out. You might try using the topology.max.spout.pending config property to limit the number of inflight tuples from the spout. If you can reduce the number just enough, the topology should be able to efficiently handle the load without tuples timing out.

Storm topology processing slowing down gradually

I have been reading about apache Storm tried few examples from storm-starter. Also learnt about how to tune the topology and how to scale it to perform fast enough to meet the required throughput.
I have created example topology with acking enabled, i am able to achieve 3K-5K messages processing per second. It performs really fast in initial 10 to 15min or around 1mil to 2mil message and then it starts slowing down. On storm UI, I can see the overall latency starts going up gradually and does not comes back, after a while the processing drops to only few hundred a second. I am getting exact same behavior for all the typologies i tried, the simplest one is to just read from kafka using KafkaSpout and send it to transform bolt parse the msg and send it to kafka again using KafkaBolt. The parser is very fast as it takes less than a millisecond to parse the message. I tried few option of increasing/describing the parallelism, changing the buffer sizes etc. but same behavior. Please help me to find out the reason for gradual slowness in the topology. Here is the config i am using
1 Nimbus machine (4 CPU) 24GB RAM
2 Supervisor machines (8CPU) and using 1 thread per core with 24GB RAM
4 Node kafka cluster running on above 2 supervisor machines (each topic has 4 partitions)
KafkaSpout(2 parallelism)-->TransformerBolt(8)-->KafkaBolt(2)
topology.executor.receive.buffer.size: 65536
topology.executor.send.buffer.size: 65536
topology.spout.max.batch.size: 65536
topology.transfer.buffer.size: 32
topology.receiver.buffer.size: 8
topology.max.spout.pending: 250
At the start
After few minutes
After 45 min - latency started going up
After 80 min - Latency will keep going up and will go till 100 sec by the time it reaches 8 to 10mil messages
Visual VM screenshot
Threads
Pay attention to the capacity metric on RT_LEFT_BOLT, it is very close to 1; which explains why your topology is slowing down.
From the Storm documentation:
The Storm UI has also been made significantly more useful. There are new stats "#executed", "execute latency", and "capacity" tracked for all bolts. The "capacity" metric is very useful and tells you what % of the time in the last 10 minutes the bolt spent executing tuples. If this value is close to 1, then the bolt is "at capacity" and is a bottleneck in your topology. The solution to at-capacity bolts is to increase the parallelism of that bolt.
Therefore, your solution is to add more executors (and tasks) to that given bolt (RT_LEFT_BOLT). Another thing you can do is reduce the number of executors on RT_RIGHT_BOLT the capacity indicates you don't need that many executors, probably 1 or 2 can do the job.
The issue was due to GC setting with newgen params, it was not using the allocated heap completely so internal storm queues were getting full and running out of memory. The strange thing was that storm did not throw out of memory error, it just got stalled, with the help of visual vm i was able to trace it down.

Resources