mq slow persistent message reading - ibm-mq

I am trying to track down an issue where a client can not read messages as fast as they should. Persistent messages are written to a queue. At times, the GET rate is slower than the PUT rate and we see messages backing up.
Using tcpdump, I see the following:
MQGET: Convert, Fail_If_Quiescing, Accept_Truncated_Msg, Syncpoint, Wait
Message is sent
Notification
MQCMIT
MQCMIT_REPLY
In analyzing the dump, sometimes I see the delta between the MQCMIT and MQCMIT_REPLY be in the 0.001 second timeframe and I also see it in the 0.1 second timeframe. It seems like the 0.1 sec delay is slowing the message transfer down. Is there anything I can do to decrease the delta between the MQCMIT and MQCMIT_REPLY? Should the client be reading multiple messages before the MQCMIT is sent?
This is MQ 8.0.0.3 on AIX 7.1.

The most straightforward way to increase message throughput on the receiving side is to batch MQGET operations. That is, do not issue MQCMIT for every MQGET, but rather after a number of MQGET operations. MQCMIT is the most expensive operation for persistent messages since it involves forcing log writes on the queue manager, and therefore suffers disk I/O latency. Experiment with the batch size - I often use 100, but some applications can go even higher. Too many outstanding MQGET operations can be problematic since they keep the transaction running for much longer time and prevent the log switching.
And of course you can check if your system overall tuning is satisfactory. You might have too long a latency between your client and queue manager, or your logs may reside on a slow device, or the logs may share the device with the queue files or an otherwise busy filesystem.

Related

Confluent kafka go client memory leak

My service consumes messages from one kafka topic. While the consumer is idle and blocked waiting for messages I see a continuous and linear increase in the POD memory. GO pprof proves that the go memory consumption is constant around 40 MB, at the same time POD metrics show more than 100 MB is consumed.
This leads me to the conclusion that memory is consumed in the C library librdkafka as mentioned here https://zendesk.engineering/hunting-down-a-c-memory-leak-in-a-go-program-2d08b24b617d
The solution to the memory consumption in librdkafka in the link above was to consume the OffsetCommitResponse events that librdkafka produces. Here is the quote from the link:
It turned out that librdkafka was generating an event every time it
received an OffsetCommitResponse from the Kafka broker (which, with
our auto-commit interval set to 5 seconds, was pretty often), and
placing it in a queue for our app to handle. However, our application
was not actually handling events from that queue, so the size of that
queue grew without bound
Does anyone know how to consume these events in go? unfortunately the link above didn't mention the solution
I solved this issue by counting the number of consumed messages in my service. When the number of consumed messages reaches a configured value e.g. 100,000 in my case, then I simply close and recreate the kafka consumer and producer.
This solution is neither elegant nor doesn't solve the original issue, but hey it stabilized my production. Now I have a flat memory consumption curve.

Kafka: is it better to have a lot of small messages or fewer, but bigger ones?

There is a microservice, which receives the batch of the messages from the outside and push them to kafka. Each message is sent separately, so for each batch I have around 1000 messages 100 bytes each. It seems like the messages take much more space internally, because the free space on the disk going down much faster than I expected.
I'm thinking about changing the producer logic, the way it will put all the batch in one message (the consumer then will split them by itself). But I haven't found any information about space or performance issues with many small messages, neither any guildlines about balance between size and count. And I don't know Kafka enough to have my own conclusion.
Thank you.
The producer will, by itself, batch messages that are destined to the same partition, in order to avoid unnecesary calls.
The producer makes this thanks to its background threads. In the image, you can see how it batches 3 messages before sending them to each partition.
If you also set compression in the producer-side, it will also compress (GZip, LZ4, Snappy are the valid codecs) the messages before sending it to the wire. This property can also can be set on the broker-side (so the messages are sent uncompressed by the producer, and compressed by the broker).
It depends on your network capacity to decide wether you prefer a slower producer (as the compression will slow it) or bigger load on the wire. Note that setting a big compression level on big files may affect a lot your overall performance.
Anyway, I believe the big/small msg problem hurts a lot more to the consumer side; Sending messages to Kafka is easy and fast (the default behaviour is async, so the producer won't be too busy). But on the consumer side, you'll have to look the way you are processing the messages:
One Consumer-Worker
Here you couple consuming with processing. This is the simplest way: the consumer sets its own thread, reads a kafka msg and process it. Then continues the loop.
One Consumer - Many workers
Here you decouple consuming and processing. In most cases, reading from kafka will be faster than the time you need to process the message. It is just physics. In this approach, one consumer feeds many separate worker threads that share the processing load.
More info about this here, just above the Constructors area.
Why do I explain this? Well, if your messages are too big, and you choose the first option, your consumer may not call poll() within the timeout interval, so it will rebalance continuosly. If your messages are big (and take some time to be processed), better choose to implement the second option, as the consumer will continue its own way, calling poll() without falling in rebalances.
If the messages are too big and too many, you may have to start thinking about different structures than can buffer the messages into your memory. Pools, deques, queues, for example, are different options to acomplish this.
You may also increase the poll timeout interval. This may hide you about dead consumers, so I don't really recommend it.
So my answer would be: it depends, basicallty on: your network capacity, your required latency, your processing capacity. If you are able to process big messages equally fast as smaller ones, then I wouldn't care much.
Maybe if you need to filter and reprocess older messages I'd recommend partitioning the topics and sending smaller messages, but it's only a use-case.

Achieve concurrency in Kafka consumers

We are working on parallelising our Kafka consumer to process more number of records to handle the Peak load. One way, we are already doing is through spinning up as many consumers as many partitions within the same consumer group.
Our Consumer deals with making an API call which is synchronous as of now. We felt making this API call asynchronous will make our consumer handle more load. Hence, we are trying to making the API call Asynchronous and in its response we are increasing the offset. However we are seeing an issue with this:
By making the API call Asynchronous, we may get the response for the last record first and none of the previous record's API calls haven't initiated or done by then. If we commit the offset as soon as we receive the response of the last record, the offset would get changed to the last record. In the meantime if the consumer restarts or partition rebalances, we will not receive any record before the last record we committed the offset as. With this, we will miss out the unprocessed records.
As of now we already have 25 partitions. We are looking forward to understand if someone have achieved parallelism without increasing the partitions or increasing the partitions is the only way to achieve parallelism (to avoid offset issues).
First, you need to decouple (if only at first) the reading of the messages from the processing of these messages. Next look at how many concurrent calls you can make to your API as it doesn't make any sense to call it more frequently than the server can handle, asynchronously or not. If the number of concurrent API calls is roughly equal to the number of partitions you have in your topic, then it doesn't make sense to call the API asynchronously.
If the number of partitions is significantly less than the max number of possible concurrent API calls then you have a few choices. You could try to make the max number of concurrent API calls with fewer threads (one per consumer) by calling the API's asynchronously as you suggest, or you can create more threads and make your calls synchronously. Of course, then you get into the problem of how can your consumers hand their work off to a greater number of shared threads, but that's exactly what streaming execution platforms like Flink or Storm do for you. Streaming platforms (like Flink) that offer checkpoint processing can also handle your problem of how to handle offset commits when messages are processed out of order. You could roll your own checkpoint processing and roll your own shared thread management, but you'd have to really want to avoid using a streaming execution platform.
Finally, you might have more consumers than max possible concurrent API calls, but then I'd suggest that you just have fewer consumers and share partitions, not API calling threads.
And, of course, you can always change the number of your topic partitions to make your preferred option above more feasible.
Either way, to answer your specific question you want to look at how Flink does checkpoint processing with Kafka offset commits. To oversimplify (because I don't think you want to roll your own), the kafka consumers have to remember not only the offsets they just committed, but they have to hold on to the previous committed offsets, and that defines a block of messages flowing though your application. Either that block of messages in its entirety is processed all the way through or you need to rollback the processing state of each thread to the point where the last message in the previous block was processed. Again, that's a major oversimplification, but that's kinda how it's done.
You have to look at kafka batch processing. In a nutshell: you can setup huge batch.size with a little number (or even single) of partitions. As far, as whole batch of messages consumed at consumer side (i.e. in ram memory) - you can parallelize this messages in any way you want.
I would really like to share links, but their number rolls over the web hole.
UPDATE
In terms of committing offsets - you can do this for whole batch.
In general, kafka doesn't achieve target performance requirements by abusing partitions number, but rather relying on batch processing.
I already saw a lot of projects, suffering from partitions scaling (you may see issues later, during rebalancing for example). The rule of thumb - look at every available batch setting first.

ZeroMQ pattern for load balancing work across workers based on idleness

I have a single producer and n workers that I only want to give work to when they're not already processing a unit of work and I'm struggling to find a good zeroMQ pattern.
1) REQ/REP
The producer is the requestor and creates a connection to each worker. It tracks which worker is busy and round-robins to idle workers
Problem:
How to be notified of responses and still able to send new work to idle workers without dedicating a thread in the producer to each worker?
2) PUSH/PULL
Producer pushes into one socket that all workers feed off, and workers push into another socket that the producer listens to.
Problem:
Has no concept of worker idleness, i.e. work gets stuck behind long units of work
3) PUB/SUB
Non-starter, since there is no way to make sure work doesn't get lost
4) Reverse REQ/REP
Each worker is the REQ end and requests work from the producer and then sends another request when it completes the work
Problem:
Producer has to block on a request for work until there is work (since each recv has to be paired with a send ). This prevents workers to respond with work completion
Could be fixed with a separate completion channel, but the producer still needs some polling mechanism to detect new work and stay on the same thread.
5) PAIR per worker
Each worker has its own PAIR connection allowing independent sending of work and receipt of results
Problem:
Same problem as REQ/REP with requiring a thread per worker
As much as zeroMQ is non-blocking/async under the hood, I cannot find a pattern that allows my code to be asynchronous as well, rather than blocking in many many dedicated threads or polling spin-loops in fewer. Is this just not a good use case for zeroMQ?
Your problem is solved with the Load Balancing Pattern in the ZMQ Guide. It's all about flow control whilst also being able to send and receive messages. The producer will only send work requests to idle workers, whilst the workers are able to send and receive other messages at all times, e.g. abort, shutdown, etc.
Push/Pull is your answer.
When you send a message in ZeroMQ, all that happens initially is that it sits in a queue waiting to be delivered to the destination(s). When it has been successfully transferred it is removed from the queue. The queue is limited in length, but can be set by changing a socket's high water mark.
There is a/some background thread(s) that manage all this on your behalf, and your calls to the ZeroMQ API are simply issuing instructions to that/those threads. The threads at either end of a socket connection are collaborating to marshall the transfer of messages, i.e. a sender won't send a message unless the recipient can receive it.
Consider what this means in a push/pull set up. Suppose one of your pull workers is falling behind. It won't then be accepting messages. That means that messages being sent to it start piling up until the highwater mark is reached. ZeroMQ will no longer send messages to that pull worker. In fact AFAIK in ZeroMQ, a pull worker whose queue is more full than those of its peers will receive less messages, so the workload is evened out across all workers.
So What Does That Mean?
Just send the messages. Let 0MQ sort it out for you.
Whilst there's no explicit flag saying 'already busy', if messages can be sent at all then that means that some pull worker somewhere is able to receive it solely because it has kept up with the workload. It will therefore be best placed to process new messages.
There are limitations. If all the workers are full up then no messages are sent and you get blocked in the push when it tries to send another message. You can discover this only (it seems) by timing how long the zmq_send() took.
Don't Forget the Network
There's also the matter of network bandwidth to consider. Messages queued in the push will tranfer at the rate at which they're consumed by the recipients, or at the speed of the network (whichever is slower). If your network is fundamentally too slow, then it's the Wrong Network for the job.
Latency
Of course, messages piling up in buffers represents latency. This can be restricted by setting the high water mark to be quite low.
This won't cure a high latency problem, but it will allow you to find out that you have one. If you have an inadequate number of pull workers, a low high water mark will result in message sending failing/blocking sooner.
Actually I think in ZeroMQ it blocks for push/pull; you'd have to measure elapsed time in the call to zmq_send() to discover whether things had got bottled up.
Thought about Nanomsg?
Nanomsg is a reboot of ZeroMQ, one of the same guys is involved. There's many things I prefer about it, and ultimately I think it will replace ZeroMQ. It has some fancier patterns which are more universally usable (PAIR works on all transports, unlike in ZeroMQ). Also the patterns are essentially a plugable component in the source code, so it is far simpler for patterns to be developed and integrated than in ZeroMQ. There is a discussion on the differences here
Philisophical Discussion
Actor Model
ZeroMQ is definitely in the realms of Actor Model programming. Messages get stuffed into queues / channels / sockets, and at some undetermined point in time later they emerge at the recipient end to be processed.
The danger of this type of architecture is that it is possible to have the potential for deadlock without knowing it.
Suppose you have a system where messages pass both ways down a chain of processes, say instructions in one way and results in the other. It is possible that one of the processes will be trying to send a message whilst the recipient is actually also trying to send a message back to it.
That only works so long as the queues aren't full and can (temporarily) absorb the messages, allowing everyone to move on.
But suppose the network briefly became a little busy for some reason, and that delayed message transfer. The message send might then fail because the high water mark had been reached. Whoops! No one is then sending anything to anyone anymore!
CSP
A development of the Actor Model, called Communicating Sequential Processes, was invented to solve this problem. It has a restriction; there is no buffering of messages at all. No process can complete sending a message until the recipient has received all the data.
The theoretical consequence of this was that it was then possible to mathematically analyse a system design and pronounce it to be free of deadlock. The practical consequence is that if you've built a system that can deadlock, it will do so every time. That's actually not so bad; it'll show up in testing, not post-deployment.
Curiously this is hinted at in the documentation of Microsoft's Task Parallel library, where they advocate setting buffer lengths to zero in the intersts of achieving a more robust application.
It'd be like setting the ZeroMQ high water mark to zero, but in zmq_setsockopt() 0 means default, not nought. The default is non-zero...
CSP is much more suited to real time applications. Any shortage of available workers immediately results in an inability to send messages (so your system knows it's failed to keep up with the real time demand) instead of resulting in an increased latency as data is absorbed by sockets, etc. (which is far harder to discover).
Unfortunately almost every communications technology we have (Ethernet, TCP/IP, ZeroMQ, nanomsg, etc) leans towards Actor Model. Everything has some sort of buffer somewhere, be it a packet buffer on a NIC or a socket buffer in an operating system.
Thus to implement CSP in the real world one has to implement flow control on top of the existing transports. This takes work, and it's slightly inefficient. But if a system that needs it, it's definitely the way to go.
Personally I'd love to see 0MQ and Nanomsg to adopt it as a behavioural option.

MSMQ really slow performance with Recoverable=true

The performance of my Microsoft message queue (MSMQ) is at least a factor ten slower, if I enable persistent messages, by setting the Recoverable attribute to true. I did expect a drop in performance, since the messages are written to disk instead of being stored in memory, but nowhere nearly by that much.
Can I make some performance tuning of my message queue?
Edit: My messages are about 2 kilobytes each. With an in-memory version I can create about 10 messages per second. With message stored on disk, the speed is about 1 per second.
I completede agree, that performance penalty is expected, but I think that 10 messages per second is already so slow, that I thought that it was the sevice writing the messages, that was the bottleneck.
Non-recoverable messages still get written to disk but MSMQ doesn't wait for confirmation of success."Why are my Express MSMQ messages being written to disk?"
10 Express messages per second is incredibly slow, as is one Recoverable message per second. There is something seriously wrong with the machine you are using, or the service.
On my desktop machine, I can send 1,000 Recoverable 2kb messages in 6-7 seconds.
Cheers
John Breakwell

Resources