The performance of my Microsoft message queue (MSMQ) is at least a factor ten slower, if I enable persistent messages, by setting the Recoverable attribute to true. I did expect a drop in performance, since the messages are written to disk instead of being stored in memory, but nowhere nearly by that much.
Can I make some performance tuning of my message queue?
Edit: My messages are about 2 kilobytes each. With an in-memory version I can create about 10 messages per second. With message stored on disk, the speed is about 1 per second.
I completede agree, that performance penalty is expected, but I think that 10 messages per second is already so slow, that I thought that it was the sevice writing the messages, that was the bottleneck.
Non-recoverable messages still get written to disk but MSMQ doesn't wait for confirmation of success."Why are my Express MSMQ messages being written to disk?"
10 Express messages per second is incredibly slow, as is one Recoverable message per second. There is something seriously wrong with the machine you are using, or the service.
On my desktop machine, I can send 1,000 Recoverable 2kb messages in 6-7 seconds.
Cheers
John Breakwell
Related
There is a microservice, which receives the batch of the messages from the outside and push them to kafka. Each message is sent separately, so for each batch I have around 1000 messages 100 bytes each. It seems like the messages take much more space internally, because the free space on the disk going down much faster than I expected.
I'm thinking about changing the producer logic, the way it will put all the batch in one message (the consumer then will split them by itself). But I haven't found any information about space or performance issues with many small messages, neither any guildlines about balance between size and count. And I don't know Kafka enough to have my own conclusion.
Thank you.
The producer will, by itself, batch messages that are destined to the same partition, in order to avoid unnecesary calls.
The producer makes this thanks to its background threads. In the image, you can see how it batches 3 messages before sending them to each partition.
If you also set compression in the producer-side, it will also compress (GZip, LZ4, Snappy are the valid codecs) the messages before sending it to the wire. This property can also can be set on the broker-side (so the messages are sent uncompressed by the producer, and compressed by the broker).
It depends on your network capacity to decide wether you prefer a slower producer (as the compression will slow it) or bigger load on the wire. Note that setting a big compression level on big files may affect a lot your overall performance.
Anyway, I believe the big/small msg problem hurts a lot more to the consumer side; Sending messages to Kafka is easy and fast (the default behaviour is async, so the producer won't be too busy). But on the consumer side, you'll have to look the way you are processing the messages:
One Consumer-Worker
Here you couple consuming with processing. This is the simplest way: the consumer sets its own thread, reads a kafka msg and process it. Then continues the loop.
One Consumer - Many workers
Here you decouple consuming and processing. In most cases, reading from kafka will be faster than the time you need to process the message. It is just physics. In this approach, one consumer feeds many separate worker threads that share the processing load.
More info about this here, just above the Constructors area.
Why do I explain this? Well, if your messages are too big, and you choose the first option, your consumer may not call poll() within the timeout interval, so it will rebalance continuosly. If your messages are big (and take some time to be processed), better choose to implement the second option, as the consumer will continue its own way, calling poll() without falling in rebalances.
If the messages are too big and too many, you may have to start thinking about different structures than can buffer the messages into your memory. Pools, deques, queues, for example, are different options to acomplish this.
You may also increase the poll timeout interval. This may hide you about dead consumers, so I don't really recommend it.
So my answer would be: it depends, basicallty on: your network capacity, your required latency, your processing capacity. If you are able to process big messages equally fast as smaller ones, then I wouldn't care much.
Maybe if you need to filter and reprocess older messages I'd recommend partitioning the topics and sending smaller messages, but it's only a use-case.
Im running a 4-core Amazon EC2 instance(m3.xlarge) with 200.000 concurrent connections with no ressouce problems(each core at 10-20%, memory at 2/14GB). Anyway if i emit a message to all the user connected first on a cpu-core gets it within milliseconds but the last connected user gets it with a delay of 1-3 seconds and each CPU core goes up to 100% for 1-2 seconds. I noticed this problem even at "only" 50k concurrent users(12.5k per core).
How to reduce the delay?
I tried changing redis-adapter to mongo-adapter with no difference.
Im using this code to get sticky sessions on multiple cpu cores:
https://github.com/elad/node-cluster-socket.io
The test was very simple: The clients do just connect and do nothing more. The server only listens for a message and emits to all.
EDIT: I tested single-core without any cluster/adapter logic with 50k clients and the same result.
I published the server, single-core-server, benchmark and html-client in one package: https://github.com/MickL/socket-io-benchmark-kit
OK, let's break this down a bit. 200,000 users on four cores. If perfectly distributed, that's 50,000 users per core. So, if sending a message to a given user takes .1ms each of CPU time, that would take 50,000 * .1ms = 5 seconds to send them all.
If you see CPU utilization go to 100% during this, then a bottleneck probably is CPU and maybe you need more cores on the problem. But, there may be other bottlenecks too such as network bandwidth, network adapters or the redis process. So, one thing to immediately determine is whether your end-to-end time is directly proportional to the number of clusters/CPUs you have? If you drop to 2 cores, does the end-to-end time double? If you go to 8, does it drop in half? If yes for both, that's good news because that means you probably are only running into CPU bottleneck at the moment, not other bottlenecks. If that's the case, then you need to figure out how to make 200,000 emits across multiple clusters more efficient by examining node-cluster-socket.io code and finding ways to optimize your specific situation.
The most optimal the code could be would be to have every CPU do all it's housekeeping to gather exactly what it needs to send to all 50,000 users and then very quickly each CPU does a tight loop sending 50,000 network packets one right after the other. I can't really tell from the redis adapter code whether this is what happens or not.
A much worst case would be where some process gets all 200,000 socket IDs and then goes in a loop to send to each socket ID where in that loop, it has to lookup on redis which server contains that connection and then send a message to that server telling it to send to that socket. That would be a ton less efficient than instructing each server to just send a message to all it's own connected users.
It would be worth trying to figure out (by studying code) where in this spectrum, the socket.io + redis combination is.
Oh, and if you're using an SSL connection for each socket, you are also devoting some CPU to crypto on every send operation. There are ways to offload the SSL processing from your regular CPU (using additional hardware).
I am trying to track down an issue where a client can not read messages as fast as they should. Persistent messages are written to a queue. At times, the GET rate is slower than the PUT rate and we see messages backing up.
Using tcpdump, I see the following:
MQGET: Convert, Fail_If_Quiescing, Accept_Truncated_Msg, Syncpoint, Wait
Message is sent
Notification
MQCMIT
MQCMIT_REPLY
In analyzing the dump, sometimes I see the delta between the MQCMIT and MQCMIT_REPLY be in the 0.001 second timeframe and I also see it in the 0.1 second timeframe. It seems like the 0.1 sec delay is slowing the message transfer down. Is there anything I can do to decrease the delta between the MQCMIT and MQCMIT_REPLY? Should the client be reading multiple messages before the MQCMIT is sent?
This is MQ 8.0.0.3 on AIX 7.1.
The most straightforward way to increase message throughput on the receiving side is to batch MQGET operations. That is, do not issue MQCMIT for every MQGET, but rather after a number of MQGET operations. MQCMIT is the most expensive operation for persistent messages since it involves forcing log writes on the queue manager, and therefore suffers disk I/O latency. Experiment with the batch size - I often use 100, but some applications can go even higher. Too many outstanding MQGET operations can be problematic since they keep the transaction running for much longer time and prevent the log switching.
And of course you can check if your system overall tuning is satisfactory. You might have too long a latency between your client and queue manager, or your logs may reside on a slow device, or the logs may share the device with the queue files or an otherwise busy filesystem.
I have MQ server 7.1 running in machine1. I have a java app running in machine 2, that uses JMS to write messages to a queue in machine 1. The java app handles hundreds of messages per second (data coming from else where). Currently it takes about 100ms for 200 text messages (average size 600 bytes) or 2000 messages per second to write messages to the queue. Is this reasonable performance. What are some of the things that one can do to improve the performance further. i.e. faster?
There are a number of detailed recommendations available in the WebSphere MQ Performance Reports. These are published as SupportPacs. If you start at the SupportPac landing page, the ones you want are all named MPxx and are available per-platform and per-version.
As you will see from the SupportPacs, WMQ out of the box is tuned for a balance of speed and reliability across a wide variation of message sizes and types. There is considerable latitude for tuning through configuration and through design/architecture.
From the configuration perspective, there are buffers for persistent and non-persistent messages, an option to reduce disk write integrity from triple-write to single-write, tuning of log file sizes and numbers, connection multiplexing, etc., etc. You may infer from this that the more the QMgr is tuned to specific traffic characteristics, the faster you can get it to go. The flip side of this is that a QMgr tuned that tightly will tend to react badly if a new type of traffic shows up that is outside the tuning specifications.
I have also seen tremendous performance improvement allocating the WMQ filesystems to separate spindles. When a persistent message is written, it goes both to queue files and to log files. If both of those filesystems are in contention for the same disk read/write heads, this can degrade performance. This is why WMQ can sometimes run slower on a high-performance laptop than on a virtual machine or server of approximately the same size. If the laptop has physical spinning disk where the WMQ filesystems are both allocated and the server has SAN, there's no comparison.
From a design standpoint, much performance can be gained from parallelism. The Performance reports show that adding more client connections significantly improves performance, up to a point where it then levels off and eventually begins to decline. Fortunately, the top number of clients before it falls off is VERY large and the web app server typically bogs down before WMQ does, just from the number of Java threads required.
Another implementation detail that can make a big difference is the commit interval. If the app is such that many messages can be put or got at a time, doing so improves performance. A persistent message under syncpoint doesn't need to be flushed to disk until the COMMIT occurs. Writing multiple messages in a single unit of work allows WMQ to return control to the program faster, buffer the writes and then optimize them much more efficiently than writing one message at a time.
The Of Mice and Elephants article contains additional in-depth discussion of tuning options. It is part of the developerWorks Mission:Messaging series which contains some other articles which also touch on tuning.
I recommend to see this: Configuring and tuning WebSphere MQ for performance on Windows and UNIX
I developed a server for a custom protocol based on tcp/ip-stack with Netty. Writing this was a pleasure.
Right now I am testing performance. I wrote a test-application on netty that simply connects lots (20.000+) of "clients" to the server (for-loop with Thread.wait(1) after each bootstrap-connect). As soon as a client-channel is connected it sends a login-request to the server, that checks the account and sends a login-response.
The overall performance seems to be quite OK. All clients are logged in below 60s. But what's not so good is the spread waiting time per connections. I have extremely fast logins and extremely slow logins. Variing from 9ms to 40.000ms spread over the whole test-time. Is it somehow possible to share waiting time among the requesting channels (Fifo)?
I measured a lot of significant timestamps and found a strange phenomenon. I have a lot of connections where the server's timestamp of "channel-connected" is way after the client's timestamp (up to 19 seconds). I also do have the "normal" case, where they match and just the time between client-sending and server-reception is several seconds. And there are cases of everything in between those two cases. How can it be, that client and server "channel-connected" are so much time away from each other?
What is for sure is, that the client immediatly receives the server's login-response after it has been send.
Tuning:
I think I read most of the performance-articles around here. I am using the OrderMemoryAwareThreadPool with 200 Threads on a 4CPU-Hyper-Threading-i7 for the incoming connections and also do start the server-application with the known aggressive-options. I also completely tweaked my Win7-TCP-Stack.
The server runs very smooth on my machine. CPU-usage and memory consumption is ca. at 50% from what could be used.
Too much information:
I also started 2 of my test-apps from 2 seperate machines "attacking" the server in parallel with 15.000 connections each. There I had about 800 connections that got a timeout from the server. Any comments here?
Best regards and cheers to Netty,
Martin
Netty has a dedicated boss thread that accepts an incoming connection. If the boss thread accepts a new connection, it forwards the connection to a worker thread. The latency between the acceptance and the actual socket read might be larger than expected under load because of this. Although we are looking into different ways to improve the situation, meanwhile, you might want to increase the number of worker threads so that a worker thread handles less number of connections.
If you think it's performing way worse than non-Netty application, please feel free to file an issue with reproducing test case. We will try to reproduce and fix the problem.