Low throughput of spray application - performance

I am working on spray application that is expected to work as a part of real time bidder so both latency and throughput are very important (~80 ms - maximal latency, N*10000 rps are required). But now I experience a very low throughput (less than 1000 rps) on AWS machine of c4.2xlarge type (8 cores CPU, 16GB memory, ulimit - 500000). As I can see from jvm and system monitoring, it is far from being limited by CPU, memory, disk IO or networking.
The system contains mainly two pools of actors on request critical path and I don't see that throughput is increased when I increase pools (when pools are increased, I mostly see increased latency). So I think the problem can be in context switching. So I'd be glad to see advises regarding actor dispatchers configuration.
Some more details about the application architecture.
The system contains two sets of actors on request critical path: pool of ServerActor (extends spray HttpService) and pool of BiddingAgentActor. ServerActor just sends message with ask pattern (route is handled with its future) to BidderAgentActor. BidderAgentActor handles request mostly synchronously. It uses Futures for work concurrently when possible, but waits for result of final future. No external calls are made in this process (no DB queries, no HTTP calls...). There are several more actors in the application and BiddingAgentActor sends to them messages for offline processing.
Updated
Here is a link to a bit simplified code for both BidderAgentActor pool and for BidderAgentActor created per request: gist
Versions:
scala - 2.11.8
akka - 2.3.12
spray - 1.3.1
OS: ubuntu 14.04

Related

Play2 performance surprises

Some introduction first.
Our Play2 (ver. 2.5.10) web service provides responses in JSON, and the response size can be relatively large: up to 350KB.
We've been running our web service in a standalone mode for a while. We had gzip compression enabled as Play2, which reduces response body ~10 times, i.e. we had up to ~35KB response bodies.
In that mode the service can handle to up to 200 queries per second while running on AWS EC2 m4.xlarge (4 vCPU, 16GB RAM, 750 MBit network) inside Docker container. The performance is completely CPU-bound, most of the time (75%) is spent on JSON serialization, and rest of the time (25%) was spent on gzip compression. Business logic is very fast and is not even visible on the perf graphs.
Now we have introduced a separate front-end node (running Nginx) to handle some specific functions: authentication, authorization and, crucially to this question, traffic compression. We hoped to offload compression from the Play2 back-end to the front-end and spend those 25% of CPU cycles on main tasks.
However, instead of improving, performance got much worse! Now our web service can only handle up to 80 QPS. At that point, most of CPU is already consumed by something inside the JVM. Our metrics show it's not garbage collection, and it's also not in our code, but rather something inside Play.
It's important to notice that as this load (80 QPS at ~350KB per response) we generate ~30 MB/s of traffic. This number, while significant, doesn't saturate the EC2 networking, so that shouldn't be the bottleneck.
So my question, I guess, is as follows: does anyone have an explanation and a mitigation plan for this problem? Some hints about how to get to the root cause of this would also be helpful.

Uncaught Exception java.lang.OutOfMemoryError: "unable to create new native thread" error occurring while running jmeter in non gui mode

My scenario,
Step1: I have set my thread group for 1000:threads & 500:seconds
Step2:Configure heep space : HEAP=-Xms1024m -Xmx1024m
Step3:Now, running jmeter for non gui mode.
In this scenario,"Uncaught Exception java.lang.OutOfMemoryError: unable to create new native thread" error occuring in my system.
My system configuration
Processor:Intel® Pentium(R) CPU G2010 # 2.80GHz × 2
OS Type:32 bit
Disc:252.6GB
Memory:3.4 GiB
kindly give me a solution for this scenario.
Thanks,
Vairamuthu.
You don't have enough memory in your machine to consume 1000 threads. It is clearly visible from the error that your machine can not create 1000 threads. You should tweak your machine to resolve this situation.
You have to consider these points:
JMeter is a Java tool it runs with JVM. To obtain maximum capability, we need to provide maximum resources to JMeter during execution.First, we need to increase heap size (Inside JMeter bin directory, we get jmeter.bat/sh)
HEAP=-Xms512m –Xmx512m
It means default allocated heap size is minimum 512MB, maximum 512MB. Configure it as per you own PC configuration. Keep in mind, OS also need some amount of memory, so don't allocate all of you physical RAM.
Then, add memory allocation rate
NEW=-XX:NewSize=128m -XX:MaxNewSize=512m
This means memory will be increased at this rate. You should be careful, because, if your load generation is very high at the beginning, this might need to increase. Keep in mind, it will fragment your heap space inside JVM if the range too broad. If so Garbage Collector needs to work harder to clean up.
JMeter is Java GUI application. It also has the non-GUI edition which is very resource intensive(CPU/RAM). If we run Jmeter in non-GUI mode , it will consume less resource and we can run more thread.
Disable all the Listeners during the Test Run. They are only for debugging and use them to design your desired script.
Listeners should be disabled during load tests. Enabling them causes additional overheads, which consume valuable resources (more memory) that are needed by more important elements of your test.
Always try to use the Up-to-date software. Keep your Java and JMeter updated.
Don’t forget that when it comes to storing requests and response headers, assertion results and response data can consume a lot of memory! So try not to store these values on JMeter unless it’s absolutely necessary.
Also, you need to monitor whether your machine's Memory consumption, CPU usages are running below 80 % or not. If these usages exceed 80 % consider those tests as unreliable as report.
After all of these, if you can't generate 1000 threads from your machine, then you must try with the Distributed Load Testing.
Here is a document for JMeter Distributed Testing Step-by-step.
For better and more elaborated understanding these two blogs How many users JMeter can support? and 9 Easy Solutions for a JMeter Load Test “Out of Memory” Failure must help.
I have also found this article very helpful to understand and how to handle them.
The error is due to lack of free RAM.
Looking into your hardware, it doesn't seem you will be able to produce the load of 1k users so I would recommend reconsidering your approach.
For example, you anticipate 1000 simultaneous users working with your application. However it doesn't necessarily mean 100 concurrent users as:
real users don't hammer application non-stop, they need some time to "think" between operations, this "think time" differs depending on application nature, but you should keep it as close to reality as possible
application response time should be added to think time
So given you have 1000 users, each of them "thinks" 10 seconds between operations and application response time is 2 seconds, each user will be able to send 5 requests per minute (60 / (10 + 2)).
Assuming above scenario 1000 users will send 5000 requests per minute which gives us ~83 requests per second and it seems to be achievable with your current hardware.
So if you are not in position to get more powerful hardware or more similar machines to use JMeter in distributed more, the options are in:
Add "think times" between operations using i.e. Constant Timer or Uniform Random Timer
Change your test scenario logic to simulate "requests per second" rather than "concurrent users". You can do it using Constant Throughput Timer or Throughput Shaping Timer.
Your issue is due to using a 32 bit OS, in this mode you are limited both in what you can allocate as Heap (depending on OS you will not be able to exceed 1.6 to 2.1 g) and native threads creation.
I'd suggest switching to 64 Bits OS + 64 bits Jdk.
But if you don't have any other option try setting in jmeter.sh in JVM_ARGS:
-Xss128k
Or if too low:
-Xss256k

Unable to Load an IIS based webservice more than 30% CPU.

I am load testing an IIS based webservice
I need to find out what max. throughput it can support
both the server and the load generators are setup in AWS
The problem is that throughput of the webservice is not going beyond 1500 req/sec even on increasing the users from 500 to 3000, only response time increases (PS: i am using 15GB ram 8 core AWS machines for load generation).
Eore wierd part is CPU usage is not 100%, it is merely30-40%
Even the memory utilization is not high it is 20%.
I tried many counters in PerfMon and did not see anything which could show possible bottleneck
When I use a single machine to generate load it shows ~1500 throughput, if I add one more load generator then the throughput visibly drops to half on the original machine, still giving me a combined total of ~1500 requests/sec.
WHat am I missing here?
Thanks for your help in advance
Check the IIS configurations and thread pool settings in IIS. This is quite
known issue. If available threads are less and even if CPU or memory
is available throughput wont grow as the requests are queued up waiting.
Also check processor queue length counter in perfmon. It could be
some IO issue if the queue is long throughout the test

Windows network IOCP scalability over multiple cores

The behavior is the following: e.g. one server worker with 200 sockets handles 100K echoes per second. Starting another server worker on the same port (with the same number of sockets or double less for each worker, it does not matter), immediately decreases first worker performance to about 50% and just slightly improves the overall per machine performance (each worker serves around 50K echoes per second).
So, performance of 6 cores machine is approximately the same as for 1 core machine.
I've tried different approaches with e.g. having one independent IOCP port for each worker (specifying NumberOfConcurrentThreads to 1 in CreateIoCompletionPort), or trying one shared IOCP port for all workers (NumberOfConcurrentThreads is equal to number of workers), the performance is the same. My workers share zero data so there are no locks, etc.
I hope I'm missing something and its not Windows kernel network scalability problems.
I'm using Windows 7 Enterprise x64.
Of course the expectation was for approximately linearly scaling performance.
Does anybody know about practical scalability of IOCP over multiple cores on one machine?
What situation to expect when the number of active sockets increases?
Thank you!
The usual approach for non-NUMA systems is to have a single IOCP for all connections and a set of threads (usually tunable in size) that service the IOCP.
You can then tune the number of threads based on the number of CPUs and whether any of the work done by the threads is blocking in nature.
Performance should scale well unless you have some shared resource which all connections must access at which point contention for the shared resource will affect your scalability.
I have some free IOCP code available here and a simple multiple client test which allows you to run thousands of concurrent connections here.
For NUMA systems things can be slightly more complex as, ideally, you want to have a single IOCP, thread pool and buffer allocator per NUMA node to keep memory accesses to the local node.

Windows, multiple process vs multiple threads

We have to make our system highly scalable and it has been developed for windows platform using VC++. Say initially, we would like to process 100 requests(from msmq) simultaneously. What would be the best approach? Single process with 100 threads or 2 processes with 50-50 threads? What is the gain apart from process memory in case of second approach. does in windows first CPU time is allocated to process and then split between threads for that process, or OS counts the number of threads for each process and allocate CPU on the basis of threads rather than process. We notice that in first case, CPU utilization is 15-25% and we want to consume more CPU. Remember that we would like to get optimal performance thus 100 requests are just for example. We have also noticed that if we increase number of threads of the process above 120, performance degrades due to context switches.
One more point; our product already supports clustering, but we want to utilize more CPU on the single node.
Any suggestions will be highly appreciated.
You cant process more requests than you have CPU cores. "fast" scalable solutions involve setting up thread pools, where the number of active (not blocked on IO) threads == the number of CPU cores. So creating 100 threads because you want to service 100 msmq requests is not good design.
Windows has a thread pooling mechanism called IO Completion Ports.
Using IO Completion ports does push the design to a single process as, in a multi process design, each process would have its own IO Completion Port thread pool that it would manage independently and hence you could get a lot more threads contending for CPU cores.
The "core" idea of an IO Completion Port is that its a kernel mode queue - you can manually post events to the queue, or get asynchronous IO completions posted to it automatically by associating file (file, socket, pipe) handles with the port.
On the other side, the IO Completion Port mechanism automatically dequeues events onto waiting worker threads - but it does NOT dequeue jobs if it detects that the current "active" threads in the thread pool >= the number of CPU cores.
Using IO Completion Ports can potentially increase the scalability of a service a lot, usually however the gain is a lot smaller than expected as other factors quickly come into play when all the CPU cores are contending for the services other resource.
If your services are developed in c++, you might find that serialized access to the heap is a big performance minus - although Windows version 6.1 seems to have implemented a low contention heap so this might be less of an issue.
To summarize - theoretically your biggest performance gains would be from a design using thread pools managed in a single process. But you are heavily dependent on the libraries you are using to not serialize access to critical resources which can quickly loose you all the theoretical performance gains.
If you do have library code serializing your nicely threadpooled service (as in the case of c++ object creation&destruction being serialized because of heap contention) then you need to change your use of the library / switch to a low contention version of the library or just scale out to multiple processes.
The only way to know is to write test cases that stress the server in various ways and measure the results.
The standard approach on windows is multiple threads. Not saying that is always your best solution but there is a price to be paid for each thread or process and on windows a process is more expensive. As for scheduler i'm not sure but you can set the priory of the process and threads. The real benefit to threads is their shared address space and the ability to communicate without IPC, however synchronization must be careful maintained.
If you system is already developed, which it appears to be, it is likely to be easier to implement a multiple process solution especially if there is a chance that latter more then one machine may be utilized. As your IPC from 2 process on one machine can scale to multiple machines in the general case. Most attempts at massive parallelization fail because the entire system is not evaluated for bottle necks. for example if you implement a 100 threads that all write to the same database you may gain little in actual performance and just wait on your database.
just my .02

Resources