WebSphere MQ Performance - performance

I have MQ server 7.1 running in machine1. I have a java app running in machine 2, that uses JMS to write messages to a queue in machine 1. The java app handles hundreds of messages per second (data coming from else where). Currently it takes about 100ms for 200 text messages (average size 600 bytes) or 2000 messages per second to write messages to the queue. Is this reasonable performance. What are some of the things that one can do to improve the performance further. i.e. faster?

There are a number of detailed recommendations available in the WebSphere MQ Performance Reports. These are published as SupportPacs. If you start at the SupportPac landing page, the ones you want are all named MPxx and are available per-platform and per-version.
As you will see from the SupportPacs, WMQ out of the box is tuned for a balance of speed and reliability across a wide variation of message sizes and types. There is considerable latitude for tuning through configuration and through design/architecture.
From the configuration perspective, there are buffers for persistent and non-persistent messages, an option to reduce disk write integrity from triple-write to single-write, tuning of log file sizes and numbers, connection multiplexing, etc., etc. You may infer from this that the more the QMgr is tuned to specific traffic characteristics, the faster you can get it to go. The flip side of this is that a QMgr tuned that tightly will tend to react badly if a new type of traffic shows up that is outside the tuning specifications.
I have also seen tremendous performance improvement allocating the WMQ filesystems to separate spindles. When a persistent message is written, it goes both to queue files and to log files. If both of those filesystems are in contention for the same disk read/write heads, this can degrade performance. This is why WMQ can sometimes run slower on a high-performance laptop than on a virtual machine or server of approximately the same size. If the laptop has physical spinning disk where the WMQ filesystems are both allocated and the server has SAN, there's no comparison.
From a design standpoint, much performance can be gained from parallelism. The Performance reports show that adding more client connections significantly improves performance, up to a point where it then levels off and eventually begins to decline. Fortunately, the top number of clients before it falls off is VERY large and the web app server typically bogs down before WMQ does, just from the number of Java threads required.
Another implementation detail that can make a big difference is the commit interval. If the app is such that many messages can be put or got at a time, doing so improves performance. A persistent message under syncpoint doesn't need to be flushed to disk until the COMMIT occurs. Writing multiple messages in a single unit of work allows WMQ to return control to the program faster, buffer the writes and then optimize them much more efficiently than writing one message at a time.
The Of Mice and Elephants article contains additional in-depth discussion of tuning options. It is part of the developerWorks Mission:Messaging series which contains some other articles which also touch on tuning.

I recommend to see this: Configuring and tuning WebSphere MQ for performance on Windows and UNIX

Related

Azure Redis cache latency

I am working on an application having web job and azure function app. Web job generates the redis cache for function app to consume. Cache size is around 10 Mega Bytes. I am using lazy loading and all as per the recommendation. I still find that the overall cache operation is slow. Depending upon the size of the file i am processing, i may end up calling Redis cache upto 100,000 times . Wondering if I need to hold the cache data in a local variabke instead of reading it every time from redis. Has anyone experienced any latency in accessing Redis? Does it makes sense to create a singletone object in c# function app and refresh it based on some timer or other logic?
could you consider this points in your usage this is some good practices of azure redis cashe
Redis works best with smaller values, so consider chopping up bigger data into multiple keys. In this Redis discussion, 100kb is considered "large". Read this article for an example problem that can be caused by large values.
Use Standard or Premium Tier for Production systems. The Basic Tier is a single node system with no data replication and no SLA. Also, use at least a C1 cache. C0 caches are really meant for simple dev/test scenarios since they have a shared CPU core, very little memory, are prone to "noisy neighbor", etc.
Remember that Redis is an In-Memory data store. so that you are aware of scenarios where data loss can occur.
Reuse connections - Creating new connections is expensive and increases latency, so reuse connections as much as possible. If you choose to create new connections, make sure to close the old connections before you release them (even in managed memory languages like .NET or Java).
Locate your cache instance and your application in the same region. Connecting to a cache in a different region can significantly increase latency and reduce reliability. Connecting from outside of Azure is supported, but not recommended especially when using Redis as a cache (as opposed to a key/value store where latency may not be the primary concern).
Redis works best with smaller values, so consider chopping up bigger data into multiple keys.
Configure your maxmemory-reserved setting to improve system responsiveness under memory pressure conditions, especially for write-heavy workloads or if you are storing larger values (100KB or more) in Redis. I would recommend starting with 10% of the size of your cache, then increase if you have write-heavy loads. See some considerations when selecting a value.
Avoid Expensive Commands - Some redis operations, like the "KEYS" command, are VERY expensive and should be avoided.
Configure your client library to use a "connect timeout" of at least 10 to 15 seconds, giving the system time to connect even under higher CPU conditions. If your client or server tend to be under high load, use an even larger value. If you use a large number of connections in a single application, consider adding some type of staggered reconnect logic to prevent a flood of connections hitting the server at the same time.

How many concurrent connections can MarkLogic server process?

Is there an upper limit on how many concurrent connections that MarkLogic can process? For example, ASP.NET is restricted to processing 10 requests concurrently, regardless of infrastructure and hardware. Is there a similar restriction in MarkLogic server? If not, are there any benchmarks that give some indication as to how many connections a typical instance can handle?
Given a large enough budget there is no practical limit on the number of concurrent connections.
The basic limit is the application server thread count, although excess requests will also pile up in the backlog queue. According to groups.xsd each application server is limited to at most 256 threads. The backlog seems to have no maximum, but most operating systems will silently limit it to something between 256-4096. So depending on whether or not you count the backlog, a single app server on a single host could have 256-4352 concurrent connections.
After that you can use multiple app servers, and add hosts to the cluster. Use a load balancer if necessary. Most operating systems will impose a limit of around 32,000 - 64,000 open sockets per host, but there is no hard limit on the number of hosts or app servers. Eventually request ids might be a problem, but those are 64-bit numbers so there is a lot of headroom.
Of course none of this guarantees that your CPU, memory, disk, and network can keep up with the demand. That is a separate problem, and highly application-specific.

Reduce RabbitMQ memory usage

I'm trying to run RabbitMQ on a small VPS (512mb RAM) along with Nginx and a few other programs. I've been able to tweak the memory usage of everything else without difficulty, but I can't seem to get RabbitMQ to use any less RAM.
I think I need to reduce the number of threads Erlang uses for RabbitMQ, but I've not been able to get it to work. I've also tried setting the vm_memory_high_watermark to a few different values below the default (of 40%), even as low as 5%.
Part of the problem might be that the VPS provider (MediaTemple) allows me to go over my allocated memory, so when using free or top, it shows that the server has around 900mb.
Any suggestions to reduce memory usage by RabbitMQ, or limit the number of threads that Erlang will create? I believe Erlang is using 30 threads, based on the -A30 flag that I've seen on the process command.
Ideally I'd like RabbitMQ mem usage to be below 100mb.
Edit:
With vm_memory_high_watermark set to 5% (or 0.05 in the config file), the RabbitMQ logs report that RabbitMQ's memory limit is set to 51mb. I'm not sure where 51mb is coming from. Current VPS allocated memory is 924mb, so 5% of that should be around 46mb.
According to htop/free before starting up RabbitMQ, I'm sitting around 453mb of used ram, and after start RabbitMQ I'm around 650mb. Nearly 200mb increase. Could it be that 200mb is the lower limit that RabbitMQ will run with?
Edit 2
Here are some screenshots of ps aux and free before and after starting RabbitMQ and a graph showing the memory spike when RabbitMQ is started.
Edit 3
I also checked with no plugins enabled, and it made very little difference. It seems the plugins I had (management and its prerequisites) only added about 8mb of ram usage.
Edit 4
I no longer have this server to test with, however, there is a conf setting delegate_count that is set to a default of 16. As far as I know, this spawns 16 sup-procs for rabbitmq. Lowering this number on smaller servers may help reduce the memory footprint. No idea if this actually works, or how it impacts performance, but it's something to try.
The appropriate way to limit memory usage in RabbitMQ is using the vm_memory_high_watermark. You said:
I've also tried setting the
vm_memory_high_watermark to a few
different values below the default (of
40%), even as low as 5%.
This should work, but it might not be behaving the way you expect. In the logs, you'll find a line that tells you what the absolute memory limit is, something like this:
=INFO REPORT==== 29-Oct-2009::15:43:27 ===
Memory limit set to 2048MB.
You need to tweak the memory limit as needed - Rabbit might be seeing your system as having a lot more RAM than you think it has if you're running on a VPS environment.
Sometimes, Rabbit can't tell what system you're on and uses 1GB as the base point (so you get a limit of 410MB by default).
Also, make sure you are running on a version of RabbitMQ that supports the vm_memory_high_watermark setting - ideally you should run with the latest stable release.
Make sure to set an appropriate QoS prefetch value. By default, if there's a client, the Rabbit server will send any messages it has for that client's queue to the client. This results in extensive memory usage both on the client & the server.
Drop the prefetch limit down to something reasonable, like say 100, and Rabbit will keep the remaining messages on disk on the server until the client is really ready to process them, and your memory usage will go way way down on both the client & the server.
Note that the suggestion of 100 is just a reasonable place to start - it sure beats infinity. To really optimize that number, you'll want to take into consideration the messages/sec your client is able to process, the latency of your network, and also how large each of your messages is on average.

Low-latency, large-scale message queuing

I'm going through a bit of a re-think of large-scale multiplayer games in the age of Facebook applications and cloud computing.
Suppose I were to build something on top of existing open protocols, and I want to serve 1,000,000 simultaneous players, just to scope the problem.
Suppose each player has an incoming message queue (for chat and whatnot), and on average one more incoming message queue (guilds, zones, instances, auction, ...) so we have 2,000,000 queues. A player will listen to 1-10 queues at a time. Each queue will have on average maybe 1 message per second, but certain queues will have much higher rate and higher number of listeners (say, a "entity location" queue for a level instance). Let's assume no more than 100 milliseconds of system queuing latency, which is OK for mildly action-oriented games (but not games like Quake or Unreal Tournament).
From other systems, I know that serving 10,000 users on a single 1U or blade box is a reasonable expectation (assuming there's nothing else expensive going on, like physics simulation or whatnot).
So, with a crossbar cluster system, where clients connect to connection gateways, which in turn connect to message queue servers, we'd get 10,000 users per gateway with 100 gateway machines, and 20,000 message queues per queue server with 100 queue machines. Again, just for general scoping. The number of connections on each MQ machine would be tiny: about 100, to talk to each of the gateways. The number of connections on the gateways would be alot higher: 10,100 for the clients + connections to all the queue servers. (On top of this, add some connections for game world simulation servers or whatnot, but I'm trying to keep that separate for now)
If I didn't want to build this from scratch, I'd have to use some messaging and/or queuing infrastructure that exists. The two open protocols I can find are AMQP and XMPP. The intended use of XMPP is a little more like what this game system would need, but the overhead is quite noticeable (XML, plus the verbose presence data, plus various other channels that have to be built on top). The actual data model of AMQP is closer to what I describe above, but all the users seem to be large, enterprise-type corporations, and the workloads seem to be workflow related, not real-time game update related.
Does anyone have any daytime experience with these technologies, or implementations thereof, that you can share?
#MSalters
Re 'message queue':
RabbitMQ's default operation is exactly what you describe: transient pubsub. But with TCP instead of UDP.
If you want guaranteed eventual delivery and other persistence and recovery features, then you CAN have that too - it's an option. That's the whole point of RabbitMQ and AMQP -- you can have lots of behaviours with just one message delivery system.
The model you describe is the DEFAULT behaviour, which is transient, "fire and forget", and routing messages to wherever the recipients are. People use RabbitMQ to do multicast discovery on EC2 for just that reason. You can get UDP type behaviours over unicast TCP pubsub. Neat, huh?
Re UDP:
I am not sure if UDP would be useful here. If you turn off Nagling then RabbitMQ single message roundtrip latency (client-broker-client) has been measured at 250-300 microseconds. See here for a comparison with Windows latency (which was a bit higher) http://old.nabble.com/High%28er%29-latency-with-1.5.1--p21663105.html
I cannot think of many multiplayer games that need roundtrip latency lower than 300 microseconds. You could get below 300us with TCP. TCP windowing is more expensive than raw UDP, but if you use UDP to go faster, and add a custom loss-recovery or seqno/ack/resend manager then that may slow you down again. It all depends on your use case. If you really really really need to use UDP and lazy acks and so on, then you could strip out RabbitMQ's TCP and probably pull that off.
I hope this helps clarify why I recommended RabbitMQ for Jon's use case.
I am building such a system now, actually.
I have done a fair amount of evaluation of several MQs, including RabbitMQ, Qpid, and ZeroMQ. The latency and throughput of any of those are more than adequate for this type of application. What is not good, however, is queue creation time in the midst of half a million queues or more. Qpid in particular degrades quite severely after a few thousand queues. To circumvent that problem, you will typically have to create your own routing mechanisms (smaller number of total queues, and consumers on those queues are getting messages that they don't have an interest in).
My current system will probably use ZeroMQ, but in a fairly limited way, inside the cluster. Connections from clients are handled with a custom sim. daemon that I built using libev and is entirely single-threaded (and is showing very good scaling -- it should be able to handle 50,000 connections on one box without any problems -- our sim. tick rate is quite low though, and there are no physics).
XML (and therefore XMPP) is very much not suited to this, as you'll peg the CPU processing XML long before you become bound on I/O, which isn't what you want. We're using Google Protocol Buffers, at the moment, and those seem well suited to our particular needs. We're also using TCP for the client connections. I have had experience using both UDP and TCP for this in the past, and as pointed out by others, UDP does have some advantage, but it's slightly more difficult to work with.
Hopefully when we're a little closer to launch, I'll be able to share more details.
Jon, this sounds like an ideal use case for AMQP and RabbitMQ.
I am not sure why you say that AMQP users are all large enterprise-type corporations. More than half of our customers are in the 'web' space ranging from huge to tiny companies. Lots of games, betting systems, chat systems, twittery type systems, and cloud computing infras have been built out of RabbitMQ. There are even mobile phone applications. Workflows are just one of many use cases.
We try to keep track of what is going on here:
http://www.rabbitmq.com/how.html (make sure you click through to the lists of use cases on del.icio.us too!)
Please do take a look. We are here to help. Feel free to email us at info#rabbitmq.com or hit me on twitter (#monadic).
My experience was with a non-open alternative, BizTalk. The most painful lesson we learnt is that these complex systems are NOT fast. And as you figured from the hardware requirements, that translates directly into significant costs.
For that reason, don't even go near XML for the core interfaces. Your server cluster will be parsing 2 million messages per second. That could easily be 2-20 GB/sec of XML! However, most messages will be for a few queues, while most queues are in fact low-traffic.
Therefore, design your architecture so that it's easy to start with COTS queue servers and then move each queue (type) to a custom queue server when a bottleneck is identified.
Also, for similar reasons, don't assume that a message queue architecture is the best for all comminication needs your application has. Take your "entity location in an instance" example. This is a classic case where you don't want guaranteed message delivery. The reason that you need to share this information is because it changes all the time. So, if a message is lost, you don't want to spend time recovering it. You'd only send the old locatiom of the affected entity. Instead, you'd want to send the current location of that entity. Technology-wise this means you want UDP, not TCP and a custom loss-recovery mechanism.
FWIW, for cases where intermediate results are not important (like positioning info) Qpid has a "last-value queue" that can deliver only the most recent value to a subscriber.

Are there any tools to optimize the number of consumer and producer threads on a JMS queue?

I'm working on an application that is distributed over two JBoss instances and that produces/consumes JMS messages on several JMS queues.
When we configured the application we had to determine which threading model we would use, in particular the number of producing and consuming threads per queue. We have done this in a rather ad-hoc fashion but after reading the most recent columns by Herb Sutter in Dr Dobbs (in particular this one) I would like to size our threads in a more rigorous manner.
Are there any methods/tools to measure the throughput of JMS queues (in particular JBoss Messaging queues) as a function of the number of producing/consuming threads?
This is not really about a specific tool, but may be helpful.
Consumers:
Not sure what your inner architecture is, but let's assume it's an MDB reading in messages. I assert that your only requirement here for rigorous thread count sizing is to choose a maximum cap. If your MDB uses resources from a finite supplier like a JDBC connection pool, consider the maximum cap as the highest number of concurrent instances from that resource that you can tolerate taking. If the MDB's queue is remote, you probably want to consider remote connections (or technically, JMS sessions) a finite resource. If the MDB has less finite requirements (and the queue is local), your maximum cap becomes the number of threads, memory used and/or flat out CPU consumed by the working threads. The reasoning here is that the JBoss MDB container will simply keep allocating more MDB instances (and therefore threads) until the queue is empty or the maximum cap is reached. The only reason I can think of that you would really agonize over the minimum would be if the container's elapsed time or overhead to create new instances is above your tolerance and those operations are usually pretty small potatoes.
Producers
A general axiom of messaging is that producers nearly always outperform consumers. You would think this is pretty arbitrary, but it is a pattern I see recurring all the time, even in widely different messaging scenarios. Anyways, it's tough to say how the threading should work for the producer without knowing a bit about the application, but are you basically capable of [indefinitely] proportionally increasing the number of producer threads and the number of messages generated, or do you have some sort of cap where additional threads simply do not generate more messages ? I would guess it is the latter since most useful work has some limited data or calculation supplier. As I see it, the two drivers here are ordering and persistence.
First off, if you have strict message ordering where messages must be processed in strict (FPFP) First Produced First Processed then you're in a bit of a bind because you almost have to drop down to single threaded throughput unless you can devise some form of logical message demarcation (eg. a client number where any given client's messages are always sent to the same queue, but you may have multiple queues each serviced by one thread so each client is effectively FPFP).
Ordering aside, persistence is the next consideration in that if you have reliable and extensive message persistence, (or have a very high tolerance for message loss) just let the producer threads go to town. The messages will queue up reliably and eventually the consumers will [hopefully] catch up. However, if your message persistence message count or simple queue depths can potentially give you the willies when they get too high, here's where a tool might come in useful. If your producer thread count can be dynamically modified (which they can in many Java ThreadPool implementations) then you could sample the queue depths and raise or lower the producer thread count in accordance with the queue depth ranges you define, optionally to the point where if the consumers basically stall, so will the producers. I do not know of a specific tool that does this but between two JBoss servers this is fairly simple to whip up. Picking your queue depth-->producer thread count will be trickier.
Having said all that, I am going to actually read the article you linked to.....
I've got the perfect thing for you: IBM provide a free command line tool called perfharness.
It's aimed at benchmarking JMS providers, i.e. measuring the throughput of queues (single or multiple) given different numbers of producing or consuming threads.
Some features:
Send and consume messages at a fixed rate (msg/s) or at maximum rate possible on the queue
Use a specific number of threads
Use either JMS or native MQ
Can use data either generated randomly or taken from a file
Generates statistics telling you exactly how fast your queue is performing
The only down side is that it's not super intuitive, given the number of operations it supports. And IBM haven't open sourced it, which is a shame. However it sounds perfect for your purposes.

Resources