Socket.io: How to reduce emit delay with many concurrent connections? - socket.io

Im running a 4-core Amazon EC2 instance(m3.xlarge) with 200.000 concurrent connections with no ressouce problems(each core at 10-20%, memory at 2/14GB). Anyway if i emit a message to all the user connected first on a cpu-core gets it within milliseconds but the last connected user gets it with a delay of 1-3 seconds and each CPU core goes up to 100% for 1-2 seconds. I noticed this problem even at "only" 50k concurrent users(12.5k per core).
How to reduce the delay?
I tried changing redis-adapter to mongo-adapter with no difference.
Im using this code to get sticky sessions on multiple cpu cores:
https://github.com/elad/node-cluster-socket.io
The test was very simple: The clients do just connect and do nothing more. The server only listens for a message and emits to all.
EDIT: I tested single-core without any cluster/adapter logic with 50k clients and the same result.
I published the server, single-core-server, benchmark and html-client in one package: https://github.com/MickL/socket-io-benchmark-kit

OK, let's break this down a bit. 200,000 users on four cores. If perfectly distributed, that's 50,000 users per core. So, if sending a message to a given user takes .1ms each of CPU time, that would take 50,000 * .1ms = 5 seconds to send them all.
If you see CPU utilization go to 100% during this, then a bottleneck probably is CPU and maybe you need more cores on the problem. But, there may be other bottlenecks too such as network bandwidth, network adapters or the redis process. So, one thing to immediately determine is whether your end-to-end time is directly proportional to the number of clusters/CPUs you have? If you drop to 2 cores, does the end-to-end time double? If you go to 8, does it drop in half? If yes for both, that's good news because that means you probably are only running into CPU bottleneck at the moment, not other bottlenecks. If that's the case, then you need to figure out how to make 200,000 emits across multiple clusters more efficient by examining node-cluster-socket.io code and finding ways to optimize your specific situation.
The most optimal the code could be would be to have every CPU do all it's housekeeping to gather exactly what it needs to send to all 50,000 users and then very quickly each CPU does a tight loop sending 50,000 network packets one right after the other. I can't really tell from the redis adapter code whether this is what happens or not.
A much worst case would be where some process gets all 200,000 socket IDs and then goes in a loop to send to each socket ID where in that loop, it has to lookup on redis which server contains that connection and then send a message to that server telling it to send to that socket. That would be a ton less efficient than instructing each server to just send a message to all it's own connected users.
It would be worth trying to figure out (by studying code) where in this spectrum, the socket.io + redis combination is.
Oh, and if you're using an SSL connection for each socket, you are also devoting some CPU to crypto on every send operation. There are ways to offload the SSL processing from your regular CPU (using additional hardware).

Related

How to load test 10k requests per second using jmeter?

I need to load test my website with 10k req/sec for 1 hour using JMeter. I am confused with the values of loop count, number of thread, ramp-up period and duration.
Also will my laptop (i5 8GB) be able to do that? If not what is the alternative.
PS: I checked every question/answer on stackoverflow for this but I couldn't find any help. Please dont mark it repeated question.
You can use "Constant Throughput Timer" and define target throughput and select throughput based on "all active threads".
Define maximum number of users count in your script so that it will be enough for 10K req/sec.
Also if you are using windows machine then I think you will face this issue "https://www.baselogic.com/2011/11/23/solved-java-net-bindexception-address-use-connect-issue-windows/"
I will recommend to use distributed testing or use more than 1 machine.
The easiest way of configuring JMeter to send X requests per second is using either Precise Troughput Timer or Throughput Shaping Timer in combination with the Concurrency Thread Group. The number of threads needs to be sufficient, the exact number mainly depends on your application response time, if response time is 1 second - you will need 10k threads, if it's 500ms - you will need 5k threads, if it is 2 seconds - you will need 20k threads, etc.
Only you can answer whether your laptop can kick off the required number of virtual users as there are too many factors to consider: nature of the test, the size of the requests/responses, number of pre/post processors and assertions, etc. Make sure to follow JMeter Best Practices and monitor CPU, RAM, Network, etc. usage using i.e. JMeter PerfMon Plugin as if your laptop will be overloaded - JMeter won't be able to send requests fast enough and you will not be able to conduct 10k requests per second even if the server supports it. If your laptop hardware specifications are too low for the test scenario - you will have to go for Distributed Testing
You have a number of issues in play
test design. Use more than one load generator. In fact, use no fewer than three, evenly matched in hardware. Take one and load only one user of each type. This is your control set. If this set degrades at the same rate as your other load generators then you have a common issue, likely the site. If the control set does not degrade, but the other load generators do, then you likely have an overloaded generator. On the commercial test tool side of the fence, generating all load from one host have never been considered a good practice in performance testing.
10K requests per second. This is substantial. I have worked on some top 20 eCommerce sites and I can tell you that even they do not receive this type of traffic to the origin servers. Why? Cache! Either this his a Content Delivery Network where the load is spread across the county, OR there is a cache node directly in front of the load balancer(S) for the site (thing varnishcache of equivalent), OR both for a multi-staged cache. You might want to look for an objective reference in production to pin this to as a validation poinnt, if and only if (IFF) your goal is to represent end user behavior. Running a count of requests grouped by second from the HTTP access logs should be able to validate this number. Also, check the cache plan for fixed assets - it could be poorly managed and load would drop significantly just by better managing the sites cache settings to the client. If your goal is simply to saturate a SOAP/REST interface to the point of destruction then you might have a better path.
If you are looking to take a particular SOAP or REST set of remote procedure calls to the point of destruction, consider a classical stress test. Start your test at zero load, increase with the smallest step interval possible over the longest possible period of time. The physical analogy to this would be the classical hospital style stress test where a nurse comes around every minute and increases the speed OR the incline on the treadmill OR both until some end of test condition is achieved. For a hospital style test that is moving into Oxygen debt, an inability to keep pace, etc... For your application/interface it could be the doubling of response times from what is acceptable, a saturation of resources in the finite resource pool (CPU, DISK, MEMORY, NETWORK) on the back end hosts, etc...

Records are inserting less in the database when we increase the thread group count from 100 to 200 in Jmeter

Initially i have ran a load test with 100 users for 10 minutes and 1000 records got inserted in the database for the below scenarios.
Employee Creation -- Test script design took 1 minute
Employee Update -- Test script design took 2 minutes
And then I ran the same load test with 200 users for 10 minutes and 1100 records got inserted without any error logs or deadlocks.
My question is when we increase/double the thread group count from 100 to 200, Records insertion also should be double or approximately double. then why is it not happening? Same case with the number requests/samples.
You reached a maximum in your test throughput at about 110 records per min. In other words, you have a bottleneck on client or server, which doesn't allow 200 users to process request concurrently and/or within the same amount of time (either some users wait until they can start processing a request, or each request takes longer, so total number of requests is lower).
Some bottlenecks can be resolved by you (if they are related to script, JMeter configuration or JMeter machine), others have to be resolved on server side (by whoever has access to it), and some cannot be resolved at all (they are true bottlenecks of your app).
Without knowing your application, it's hard to suggest anything beyond general "checklist" items:
Verify JMeter script and check if it has any places where it may wait, take a long time, and so on. For example if your ramp-up period is too high, it may be that "first" user will finish execution, before "last" user even started it. Scriptable samplers, pre- and post-processors may cause delays as well.
Make sure JMeter is configured properly to handle 200 concurrent threads. For example if JMeter heap is set too low, it could be that JMeter is very slow, as it constantly needs to run GC. See this question for how to look at and configure memory (it discusses out of memory error, but even without that error inadequate memory can cause slowness)
Make sure JMeter machine is configured correctly to allow creation of 200+ HTTP connections concurrently. A common issue on both Windows and Linux machine is that people assume that they can have 65535 connections (as maximal number of ports), but in reality, both Windows and Linux limit number of ports they allow by default to be used. Also after the use port may remain in TIME_WAIT or CLOSE_WAIT state for several minutes, which makes it unusable. As a result, running out of ports is quite common. Here's how to monitor and resolve this issue on Windows and Linux.
Check JMeter machine performance as a whole: does it have enough CPU, memory; is it swapping memory, etc.
If none of the above is a problem, you need to look at how requests arrive to the server. If client is capable of sending 200 concurrent requests (which you should have established in previous steps), but server receives them at slower rate, then maybe something in the network slows things down. For example something like slow DNS resolution or slow routing between JMeter and server can cause issues.
Also Item #3 on the client is also applicable to the server.
If requests do arrive to the server at the same speed as they are sent from the client, then probably their processing by the server slows down as number of parallel requests goes up. This is where you are on dev and devOP territory, and probably need to work with them to identify bottlenecks on server side. It could be configuration of the web or application server, application itself, ... anything on app way pretty much.
Performance testing is 10% execution, and 90% analysis and identification of bottlenecks, so here you go.

Understanding RESTful Web Service stress test results

I'm trying to stress-test my Spring RESTful Web Service.
I run my Tomcat server on a Intel Core 2 Duo notebook, 4 GB of RAM. I know it's not a real server machine, but i've only this and it's only for study purpose.
For the test, I run JMeter on a remote machine and communication is through a private WLAN with a central wireless router. I prefer to test this from wireless connection because it would be accessed from mobile clients. With JMeter i run a group of 50 threads, starting one thread per second, then after 50 seconds all threads are running. Each thread sends repeatedly an HTTP request to the server, containing a small JSON object to be processed, and sleeping on each iteration for an amount of time equals to the sum of a 100 milliseconds constant delay and a random value of gaussian distribution with standard deviation of 100 milliseconds. I use some JMeter plugins for graphs.
Here are the results:
I can't figure out why mi hits per seconds doesn't pass the 100 threshold (in the graph they are multiplied per 10), beacuse with this configuration it should have been higher than this value (50 thread sending at least three times would generate 150 hit/sec). I don't get any error message from server, and all seems to work well. I've tried even more and more configurations, but i can't get more than 100 hit/sec.
Why?
[EDIT] Many time I notice a substantial performance degradation from some point on without any visible cause: no error response messages on client, only ok http response messages, and all seems to work well on the server too, but looking at the reports:
As you can notice, something happens between 01:54 and 02:14: hits per sec decreases, and response time increase, okay it could be a server overload, but what about the cpu decreasing? This is not compatible with the congestion hypothesis.
I want to notice that you've chosen very well which rows to display on Composite Graph. It's enough to make some conclusions:
Make note that Hits Per Second perfectly correlates with CPU usage. This means you have "CPU-bound" system, and the maximum performance is mostly limited by CPU. This is very important to remember: server resources spent by Hits, not active users. You may disable your sleep timers at all and still will receive the same 80-90 Hits/s.
The maximum level of CPU is somewhere at 80%, so I assume you run Windows OS (Win7?) on your machine. I used to see that it's impossible to achieve 100% CPU utilization on Windows machine, it just does not allow to spend the last 20%. And if you achieved the maximum, then you see your installation's capacity limit. It just has not enough CPU resources to serve more requests. To fight this bottleneck you should either give more CPU (use another server with higher level CPU hardware), or configure OS to let you use up to 100% (I don't know if it is applicable), or optimize your system (code, OS settings) to spend less CPU to serve single request.
For the second graph I'd suppose something is downloaded via the router, or something happens on JMeter machine. "Something happens" means some task is running. This may be your friend who just wanted to do some "grep error.log", or some scheduled task is running. To pin this down you should look at the router resources and jmeter machine resources at the degradation situation. There must be a process that swallows CPU/DISK/Network.

Spread waiting time among connection requests and performance issues

I developed a server for a custom protocol based on tcp/ip-stack with Netty. Writing this was a pleasure.
Right now I am testing performance. I wrote a test-application on netty that simply connects lots (20.000+) of "clients" to the server (for-loop with Thread.wait(1) after each bootstrap-connect). As soon as a client-channel is connected it sends a login-request to the server, that checks the account and sends a login-response.
The overall performance seems to be quite OK. All clients are logged in below 60s. But what's not so good is the spread waiting time per connections. I have extremely fast logins and extremely slow logins. Variing from 9ms to 40.000ms spread over the whole test-time. Is it somehow possible to share waiting time among the requesting channels (Fifo)?
I measured a lot of significant timestamps and found a strange phenomenon. I have a lot of connections where the server's timestamp of "channel-connected" is way after the client's timestamp (up to 19 seconds). I also do have the "normal" case, where they match and just the time between client-sending and server-reception is several seconds. And there are cases of everything in between those two cases. How can it be, that client and server "channel-connected" are so much time away from each other?
What is for sure is, that the client immediatly receives the server's login-response after it has been send.
Tuning:
I think I read most of the performance-articles around here. I am using the OrderMemoryAwareThreadPool with 200 Threads on a 4CPU-Hyper-Threading-i7 for the incoming connections and also do start the server-application with the known aggressive-options. I also completely tweaked my Win7-TCP-Stack.
The server runs very smooth on my machine. CPU-usage and memory consumption is ca. at 50% from what could be used.
Too much information:
I also started 2 of my test-apps from 2 seperate machines "attacking" the server in parallel with 15.000 connections each. There I had about 800 connections that got a timeout from the server. Any comments here?
Best regards and cheers to Netty,
Martin
Netty has a dedicated boss thread that accepts an incoming connection. If the boss thread accepts a new connection, it forwards the connection to a worker thread. The latency between the acceptance and the actual socket read might be larger than expected under load because of this. Although we are looking into different ways to improve the situation, meanwhile, you might want to increase the number of worker threads so that a worker thread handles less number of connections.
If you think it's performing way worse than non-Netty application, please feel free to file an issue with reproducing test case. We will try to reproduce and fix the problem.

How do udp sockets actually work internally?

I am trying to reduce packets manipulation to its minimum in order to improve efficiency of a specific program i am working on but i am struggling with the time it takes to send through a udp socket using sendto/recvfrom. I am using 2 very basic processes (applications), one is sending, the other one receiving.
I am willing to understand how linux internally works when using these function calls...
Here are my observations:
when sending packets at:
10Kbps, the time it takes for the messages to go from one application to the other is about 28us
400Kbps, the time it takes for the messages to go from one application to the other is about 25us
4Mbps, the time it takes for the messages to go from one application to the other is about 20us
40Mbps, the time it takes for the messages to go from one application to the other is about 18us
When using different CPUs, time is obviously different and consistent with those observations. There must be some sort of setting that enables some queue readings to be done faster depending on the traffic flow on a socket... how can that be controlled?
When using a node as a forwarding node only, going in and out takes about 8us when using 400Kbps flow, i want to converge to this value as much as i can. 25us is not acceptable and deemed to slow (it is obvious that this is way less than the delay between each packet anyway... but the point is to be able to eventually have a greater deal of packets to be processed, hence, this time needs to be shortened!). Is there anything faster than sendto/recvfrom (must use 2 different applications (processes), i know i cannot use a monolitic block, thus i need info to be sent on a socket)?

Resources