TCP connection limit/timeout in virtual machine and native macOS/ARM-based Mac gRPC Go client? - macos

I am currently working on a gRPC microservice, which is deployed to a Kubernetes cluster. As usual, I am benchmarking and load-/stress-testing my service, testing different load balancing settings, impact of SSL and so forth.
Initially, I used my Macbook and my gRPC client written in Go and executed this setup either in Docker or directly in containerd with nerdctl. The framework I use for this is called Colima and basically builds on a lean Alpine VM to provide the container engine. Herein, I ran into issues with connection timeouts and refusals once I crossed a certain number of parallel sessions, which I guess is a result from the container engine.
Therefore, I went ahead and ran my Go client natively on macOS. This setup somehow runs into the default 20s keepalive timeout for gRPC (https://grpc.github.io/grpc/cpp/md_doc_keepalive.html) the moment my parallel connections exceed the number of traffic I can work out by some margin (#1).
When I run the very same Go client on an x86 Ubuntu 22 desktop, there are no such issues whatsoever and I can start way more sessions in parallel, which are then processed accordingly without any issues with the 20s keepalive timeout.
Any ideas how that comes to being and if I could make some changes to my setup to be able to run my stress-test benchmarks from macOS?
#1: Let's say I can process and reply 1 request per second with my service. For stress testing, I now start 20 parallel sessions and would expect them to be processed sequentially.

Related

Load balancer and WebSockets

Our infrastructure is composed by
1 F5 load balancer
3 nodes
We have an application which uses websockets, so when a user goes to our site, it opens a websocket to the balancer which it connects to the first available node, and it works as expected.
Our truobles arrives with maintenance tasks, when we have to update our software, we need to turn offline 1 node at a time, deploy the new release and then turn it on again. Doing this task, the balancer drops the open websocket connections to the node and the clients retries to connect after few seconds to the first available nodes, creating an inconvenience for the client because he could miss a signal (or more).
How we can keep the connection between the client and the balancer, changing the backend websocket server? Is the load balancer enough to achieve our goal or we need to change our infrastructure?
To avoid this kind of problems I recommend to read about the Azure SignalR. With this you don't need to thing about stuff like load balancer, redis backplane and other infrastructures that you possibly need to a WebSockets connection.
Basically the clients will not connected to your node directly but redirected to Azure SignalR. You can read more about it here: https://learn.microsoft.com/en-us/azure/azure-signalr/signalr-overview
Since it is important to your application to maintain the connection, I don't see how any other way to archive no connection drop to your nodes, since you need to shut them down.
It's important to understand that the F5 is a full TCP proxy. This means that the F5 is the server to the client and the client to the server. If you are using the websockets protocol then you must apply a websockets profile to the F5 Virtual Server in order for the websockets application to be handled properly by the Load Balancer.
Details of the websockets profile can be found here: https://support.f5.com/csp/article/K14754
If a websockets and an HTTP profile are applied to the Virtual Server - meaning that you have websockets and web traffic using the same port and LB nodes - then the F5 will allow the websockets traffic as passthrough. Also keep in mind that if this is an HTTPS virtual sever that you will need to ensure a client and server side HTTPS profile (SSL offload) are applied to the Virtual Server.
While there are a variety of ways that you can fiddle with load balancers to minimize the downtime caused by a software upgrade, none of them solve the problem, which is that your application-layer protocol seems to not tolerate some small network outages.
Even if you have a perfect load balancer and your software deploys cause zero downtime, the customer's computer may be on flaky wifi which causes a network dropout for half a second - or going over ethernet and someone reconfigures some routing on their LAN, etc.
I'd suggest having your server maintain a queue of messages for clients (up to some size/time limit) so that when a client drops a connection - whether it be due to load balancers/upgrades - or any other reason, it can continue without disruption.

How do I find out what port another process is running on besides the web process on Heroku?

I have a webhook URL and a normal web server (running HapiJS).
I'd like to proxy certain requests in HapiJS to the webhook server that's running on a private port but I need to know what the $PORT is on the other non web process.
Is there a way to find this port number?
There is no way to find that port number.
Heroku dynos run on different runtimes. So even if you did know it, you would also need to figure out the IP address of that server, which would change with every deployment and once every 24 hours.
This would also not be very scalable, as the strength of heroku is to allow you to boot more dynos easily. If you rely on knowing where the other dyno is, you're losing that easy scaling.
You don't necessarily need this to communicate between processes though. Using a redis queue, you could enqueue asynchronous jobs to be processed by your worker process. Both processes would communicate, and they wouldn't need to know where the other one is.

OpenFire, HTTP-BIND and performance

I'm looking into getting an openfire server started and setting up a strophe.js client to connect to it. My concern is that using http-bind might be costly in terms of performance versus making a straight on XMPP connection.
Can anyone tell me whether my concern is relevant or not? And if so, to what extend?
The alternative would be to use a flash proxy for all communication with OpenFire.
Thank you
BOSH is more verbose than normal XMPP, especially when idle. An idle BOSH connection might be about 2 HTTP requests per minute, while a normal connection can sit idle for hours or even days without sending a single packet (in theory, in practice you'll have pings and keepalives to combat NATs and broken firewalls).
But, the only real way to know is to benchmark. Depending on your use case, and what your clients are (will be) doing, the difference might be negligible, or not.
Basics:
Socket - zero overhead.
HTTP - requests even on IDLE session.
I doubt that you will have 1M users at once, but if you are aiming for it, then conection-less protocol like http will be much better, as I'm not sure that any OS can support that kind of connected socket volume.
Also, you can tie your OpenFires together, form a farm, and you'll have nice scalability there.
we used Openfire and BOSH with about 400 concurrent users in the same MUC Channel.
What we noticed is that Openfire leaks memory. We had about 1.5-2 GB of memory used and got constant out of memory exceptions.
Also the BOSH Implementation of Openfire is pretty bad. We switched then to punjab which was better but couldn't solve the openfire issue.
We're now using ejabberd with their built-in http-bind implementation and it scales pretty well. Load on the server having the ejabberd running is nearly 0.
At the moment we face the problem that our 5 webservers which we use to handle the chat load are sometimes overloaded at about 200 connected Users.
I'm trying to use websockets now but it seems that it doesn't work yet.
Maybe redirecting the http-bind not via Apache rewrite rule but directly on a loadbalancer/proxy would solve the issue but I couldn't find a way on how to do this atm.
Hope this helps.
I ended up using node.js and http://code.google.com/p/node-xmpp-bosh as I faced some difficulties to connect directly to Openfire via BOSH.
I have a production site running with node.js configured to proxy all BOSH requests and it works like a charm (around 50 concurrent users). The only downside so far: in the Openfire admin console you will not see the actual IP address of the connected clients, only the local server address will show up as Openfire get's the connection from the node.js server.

Windows temporarily shuts down my TCP stack when I stress test my HTTP server

I'm building an HTTP server for Windows that uses IO Completion ports (IOCP). I have a stress test app that hits the server continuously with HTTP requests. After a couple seconds (a varying, unpredictable interval), my machine is unable to open any new TCP connections. I know this because my browser is unable to open any new connections, and the server just waits for an AcceptEx call to complete. If I cool off the stress process, then everything comes back to life again after a few seconds. I don't think it's a backlog issue because the stresser is synchronous - it waits for a result before issuing the next request. The stresser does run a couple (call it N) threads in parallel, but that can't cause more than a backlog of N (small HTTP) requests.
I'm on Windows 7 Pro. Will test on a Windows Server OS on Monday. What is causing this behaviour?
Are you running out of TCP ports due to a very large number of them staying in the TIMED_WAIT state?
Google Windows ephemeral ports.

Detecting dead applications while server is alive in NLB

Windows NLB works great and removes computer from the cluster when the computer is dead.
But what happens if the application dies but the server still works fine? How have you solved this issue?
Thanks
By not using NLB.
Hardware load balancers often have configurable "probe" functions to determine if a server is responding to requests. This can be by accessing the real application port/URL, or some specific "healthcheck" URL that returns only if the application is healthy.
Other options on these look at the queue/time taken to respond to requests
Cisco put it like this:
The Cisco CSM continually monitors server and application availability
using a variety of probes, in-band
health monitoring, return code
checking, and the Dynamic Feedback
Protocol (DFP). When a real server or
gateway failure occurs, the Cisco CSM
redirects traffic to a different
location. Servers are added and
removed without disrupting
service—systems easily are scaled up
or down.
(from here: http://www.cisco.com/en/US/products/hw/modules/ps2706/products_data_sheet09186a00800887f3.html#wp1002630)
Presumably with Windows NLB there is some way to programmatically set the weight of nodes? The nodes should self-monitor and if there is some problem (e.g. a particular node is low on disc space), set its weight to zero so it receives no further traffic.
However, this needs to be carefully engineered and have further human monitoring to ensure that you don't end up with a situation where one fault causes the entire cluster to announce itself down.
You can't really hope to deal with a "byzantine general" situation in network load balancing; an appropriately broken node may think it's fine, appear fine, but while being completely unable to do any actual work. The trick is to try to minimise the possibility of these situations happening in production.
There are multiple levels of health check for a network application.
is the server machine up?
is the application (service) running?
is the service accepting network connections?
does the service respond appropriately to a "are you ok" request?
does the service perform real work? (this will also check back-end systems behind the service your are probing)
My experience with NLB may be incomplete, but I'll describe what I know. NLB can do 1 and 2. With custom coding you can add the other levels with varying difficulty. With some network architectures this can be very difficult.
Most hardware load balancers from vendors like Cisco or F5 can be easily configured to do 3 or 4. Level 5 testing still requires custom coding.
We start in the situation where all nodes are part of the cluster but inactive.
We run a custom service monitor which makes a request on the service locally via the external interface. If the response was successful we start the node (allow it to start handling NLB traffic). If the response failed we stop the node from receiving traffic.
All the intermediate steps described by Darron are irrelevant. Did it work or not is the only thing we care about. If the machine is inaccessible then the rest of the NLB cluster will treat it as failed.

Resources