Diagnosing pathological behavior of a piece of cluster software - windows

I'm using a kind of load balancer over a small cluster that is able to achieve >2000rps on zero-duration requests (t.i. ones that are immediately satisfied by the worker nodes).
But as soon as the requests stop being zero-duration and start taking even 1ms, performance immediately drops >10x. The data being transfered in both directions is identical and is about 2kb in size.
This is for sure not related to saturation of the cluster or network throughput, because 200rps of 1ms requests is a very tiny load and the network is 10Gbit. Besides, the CPU load is just some 2-5% both on the load balancer and on the worker nodes.
I wonder whether that might be related to some pathological behavior of the OS scheduler, or the OS network stack (t.i. there is some special case behavior for very short interactions).
How might I diagnose the reason? Which perfcounters to watch? What tools or methodologies to use?
(Just in case someone simply knows the answer to my particular problem, I'm talking about the MS HPC Server 2008 R2's "WCF Broker", running on Windows Server 2008 R2 over Hyper-V)

One thing you can do is use ETW tracing to try and understand what the nodes are doing while your WCF job is running. On HPC server, I sometimes clusrun xperf to collect traces on all or specific nodes. There are a number of tools that you can use for analyzing ETW traces, including xperf itself. I haven't done any serious work using HPC SOA (WCF), but I did write a simple WCF raytracer app and then used xperf to profile it on several of the nodes.

Turned out it was a completely network-unrelated issue having to do with peculiarities of the scheduling mechanism of HPC Server. I resolved the issue by tweaking a configuration option "serviceRequestPrefetchCount" to 0 in the loadBalancing section of the WCF service config file.

I'm assuming that there are some shared resources with some kind of locking system in place? Is locking a bottleneck? It's hard to guess without seeing the system.
Do you have a way to profile the workers? What are they spending most of their time on, especially in the fast vs slow scenarios?


NServiceBus: is this an illusion or a best practice?

Given an NServiceBus microservice that uses MSMQ, When I deploy few instances of that service into the same machine, Am I scaling out my application?, Am I improving the performance? or one instance is enough. shall I instead have a more powerful machine to handle messages?
No, running multiple instances on a single machine will not make things run faster, it is only making execution less efficient.
However, it might be that a single instance isn't giving you the expected performance even though your system monitoring indicates there are plenty of resources to spend but not used. In that case you might want tweak the configuration of your NServiceBus endpoint by configuration the amount of allowed parallel message execution.
On the following link you see how you can increase the concurrency:
You can further scaleout by actually using multiple machines but if all these endpoints share the same central database your network or database server can easily become the bottleneck. If you consider deploying or scaling out your endpoints across multiple machines make sure that any storage solutions are also scaled out for these not to become your bottleneck.
Zero downtime upgrades/deployments
The only reason to have multiple instance on the same box is for example when deploying a new version, you can temporarily run the current and the new version side-by-side to achieve zero downtime deployments.

How to profile set of process in freeBSD?

I am trying to debug a service with respect to its performance. The service I am trying to debug, internally spawns instances of the same binary. To improve the through-put, I am planning to increase number of instances of the binary. After a point in number of processes of the binary, through-put is not increasing. Now I am trying to reason-out why this is happening.
I need some help on where to start, tools available for process level profiling. I am using freeBSD platform.
If using more processes doesn't improve output, then your service isn't CPU bound. It might be constrained by e.g. disk or network throughput instead.
Start with systat. Especially systat -vmstat. See man systat.
This will show you several aspects (like memory usage, interrupts, processot usage and disk activity) of how busy your system is.
If your program does a lot of network activity, using systat -tcp might give insight as well.
If your service is a HTTP server, you might want to look at varnish.

Node.js CPU load balancing

I created test with JMeter to test performance of Ghost blogging platform. Ghost written in Node.js and was installed in cloud server with 1Gb RAM, 1 CPU.
I noticed after 400 concurrent users JMeter getting errors. Till 400 concurrent users load is normal. I decide increase CPU and added 1 CPU.
But errors reproduced and added 2 CPUs, totally 4 CPUs. The problem is occuring after 400 concurrent users.
I don't understand why 1 CPU can handle 400 users and the same results with 4 CPUs.
During monitoring I noticed that only one CPU is busy and 3 other CPUs idle. When I check JMeter summary in console there were errors, about 5% of request. See screenshot.
I would like to know is it possible to balance load between CPUs?
Are you using cluster module to load-balance and Node 0.10.x?
If that's so, please update your node.js to 0.11.x.
Node 0.10.x was using balancing algorithm provided by an operating system. In 0.11.x the algorithm was changed, so it will be more evenly distributed from now on.
Node.js is famously single-threaded (see this answer): a single node process will only use one core (see this answer for a more in-depth look), which is why you see that your program fully uses one core, and that all other cores are idle.
The usual solution is to use the cluster core module of Node, which helps you launch a cluster of Node processes to handle the load, by allowing you to create child processes that all share the same server ports.
However, you can't really use this without fixing Ghost's code. An option is to use pm2, which can wrap a node program, by using the cluster module for you. For instance, with four cores:
$ pm2 start app.js -i 4
In theory this should work, except if Ghost relies on some global variables (that can't be shared by every process).
Use cluster core and for load balancing nginx. Thats bad part about node.js. Fantastic framework, but developer has to enter into load balancing mess. While java and other runtimes makes is seamless. Anyway, nothing is perfect.

Marklogic latency : Document not found

I am working on a clustered marklogic environment where we have 10 Nodes. All nodes are shared E&D Nodes.
Problem that we are facing:
When a page is written in marklogic its takes some time (upto 3 secs) for all the nodes in the cluster to get updated & its during this time if I then do a read operation to fetch the previously written page, its not found.
Has anyone experienced this latency issue? and looked at eliminating it then please let me know.
It's normal for a new document to only appear after the database transaction commits. But it is not normal for a commit to take 3-sec.
Which version of MarkLogic Server?
Which OS and version?
Can you describe the hardware configuration?
How large are these documents? All other things equal, update time should be proportional to document size.
Can you reproduce this with a standalone host? That should eliminate cluster-related network latency from the transaction, which might tell you something. Possibly your cluster network has problems, or possibly one or more of the hosts has problems.
If you can reproduce the problem with a standalone host, use system monitoring to see what that host is doing at the time. On linux I favor something like iostat -Mxz 5 and top, but other tools can also help. The problem could be disk I/O - though it would have to be really slow to result in 3-sec commits. Or it might be that your servers are low on RAM, so they are paging during the commit phase.
If you can't reproduce it with a standalone host, then I think you'll have to run similar system monitoring on all the hosts in the cluster. That's harder, but for 10 hosts it is just barely manageable.

IIS7 Performance Issues for Web-services

We are experiencing slow processing of requests under heavy load. When looking at the currently running requests during these bursts I can see many requests to our web-service code.
The number of requests is not that large but they appear to be stuck in a preprocessing state. Below is an example:
We are running an IIS7 app pool in classic mode due to the need to support some legacy code.
Other requests continue to be processed but these stuck requests gradually seem to fill up the available threads leading to slow processing of other pages.
Does anyone have any idea on where these requests are getting stuck.
There appears to be no resource issue with the DB and the requests state show suggest this is all preprocessing.
We have run load tests on the code involved on local machines and can not replicate the issue.
Another possible factor is we are making use of MVC and UrlRouting.
Many thanks for any help.
Some issues only happen at production servers unfortunately, as load test can never simulate real world users.
You can try to capture hang dumps when performance is bad, and then analyze them (on your own or open a support case via http://support.microsoft.com to work with Microsoft support).
Usually you might have hit the famous thread pool bottleneck, http://support.microsoft.com/kb/821268. Dump analysis can easily tell the culprit and help locate a solution.
Why not move them into their own AppPool to separate them from the Classic ASP app - you'll then have more options to tune.
