Load balancing according incoming traffic in haproxy - high-availability

I am using haproxy with round-robin perfectly but now I am facing a problem: one of my backend server is loaded.
I want to know if i can balancer the traffic according to the load on backend server.Also if one fails with limit of max. conn traffic goes to other backend server
difference between least conn, round robin & global max conn, default max conn, and server max connection

If a server is more loaded than other ones, then mechanically it will see more concurrent connections for the same request rate. That's where it becomes useful to switch to the leastconn algorithm, which will ensure that all servers always run with the same number of concurrent connections. This is useful for instance if some of your requests are much longer than other ones (eg: complex requests in a database).
For the second point, I'll be short because everything is in the doc, but leastconn focuses on the number of concurrent connections while round robin focuses on the cumulated number of connections. With round robin, each server gets a request
in turn, so the requests on a same server are optimally spaced. This is normally better for static servers or for applications with stickiness where users make a large number of requests once attached to a server, since this ensures you have the same number of users on the same server. Global maxconn is the total amount of concurrent connections a single haproxy process will support. It will stop accepting incoming connections when the limit is reached. The default maxconn applies to frontends only, and when a frontend's maxconn is reached, this frontend only will stop accepting new connections. The server maxconn ensures that haproxy never sends too many connections to a server. When the limit is reached, an other server is selected when possible (no cookie, etc), or the request is queued until the server releases a connection to pick it. If your servers are overloaded, you should check the number of connections and apply a server maxconn slightly below this value to protect them.

Related

How does AWS Application Load balancer select a target within a target group? How to load balance the websocket traffic?

I have an AWS Application load balancer to distribute the http(s) traffic.
Problem 1:
Suppose I have a target group with 2 EC2 instances: micro and xlarge. Obviously they can handle different traffic levels. Does the load balancer manage traffic proportionally to instance sizes or just round robin? If only round robin is used and no other factors taken into account, then it's not really balancing load, because at some point the micro instance will be suffering from the traffic, while xlarge will starve.
Problem 2:
Suppose I have target group with 2 EC2 instances, both are same size. But my service is not using a classic http request/response flow. It is using HTTP websockets, i.e. a client makes HTTP request just once, to establish a socket, and then keeps the socket open for longer time, sending and receiving messages (e.g. a chat service). Let's suppose my load balancer is using round robin and both EC2 instances have 1000 clients connected each. Now suppose one of the EC2 instances goes down and 1000 connected clients drop their socket connections. The instance gets back up quickly and is ready to accept websocket calls again. The 1000 clients who dropped are trying to reconnect. Now, if the load balancer would use pure round robin, I'll end up with 1500 clients connected to instance #1 and 500 clients connected to instance #2, thus not really balancing the load correctly.
Basically, I'm trying to find out if some more advanced logic is being used to select a target in a group, or is it just a naive round robin selection. If it's round robin only, then how can I really balance the websocket connections load?
Websockets start out as http or https connections, so a load balancer can dispatch them to a server. Once the server accepts the http connection, both the server and the client "upgrade" the connection to use the websocket protocol. They then leave the connection open to use for websocket traffic. As far as the load balancer can tell, the connection is simply a long-lasting http connection.
Taking a server down when it has websocket connections to clients requires your application to retry lost connections. Reconnecting on connection failure is one of the trickiest parts of websocket client programming. Your application cannot be robust without reconnect logic.
AWS's load balancer has no built-in knowledge of the capabilities of the servers behind it. You have observed that it sends requests equally to big and small servers. That can overwhelm the small ones.
I have managed this by building a /healthcheck endpoint in my servers. It's a straightforward https://example.com/heathcheck web page. You can put a little bit of content on the page announcing how many websocket connections are currently open, or anything else. Don't password protect it or require a session to hit it.
My /healthcheck endpoints, whenever hit, measure the server load. I simply use the number of current websocket connections, but you can use any metric you want. I compare the current load to a load threshold configured for each server. For example, on a micro instance I can handle 20 open websockets, and on a production instance I can handle 400.
If the server load is too high, my endpoint gives back a 503 http error status along with its content. 503 typically means "I am overloaded, please try again later." It can also mean "I will shut down when all my connections are closed. Please don't use me for any more connections."
Then I configure the load balancer to perform those health checks every couple of minutes on all the servers in the server pool (AWS calls the pool a "target group"). The health check operation detects "unhealthy" servers and temporarily takes them out of its rotation. (The health check also detects crashed servers, which is good.)
You need this loadbalancer health check for a large-scale production setup.
All that being said, you will get best results if all your server instances in your pool have roughly the same capacity as each other.

HTTPS connection keep alive performance

Does anyone know what’s the difference, in milliseconds and percentage, between the total time it takes to make an HTTPS request that is allowed to use keep alive vs one that doesn’t? For the sake of this question, let’s assume a web server that has one GET endpoint called /time that simply returns the server’s local time, and that clients call this endpoint on average once a minute.
My guess is that, putting the server on my home LAN, and calling /time from my laptop on the LAN would take 200ms. With keep-alive it’s probably going to be 150ms. So that’s 50ms difference, and 25% improvement.
My second question is similar, but only considers server processing time. Let’s say the server takes 100ms to process a GET /time request, but only 50ms to process the same with keep-alive. That’s 50ms faster, but a 50% performance gain, which is very meaningful as it increases the server’s capacity.
I think you have confused a lot of tings here. Keepalive header in HTTP protocol suggests a client that server wouldn't mind to accept multiple requests through the same connection.
Connection is a term related to underlying TCP protocol, and there is an overhead (three way handshake) in establishing it. On the other hand, too many connections at once hurt server's performance. That's why those options exist.
HTTPS implies security-associated workflow on top of HTTP protocol and I suspect it bears no relevance in the context of your question whatsoever.
So if you talk a request a minute, there is no any noticeable difference. The overhead of connection establishment is on the order of doezens milliseconds, so you will notice a difference starting at hundreds requests a second.
Here is my experiment. It's not HTTP, but it illustrates well the benefits of keeping the connection alive.
My setup is a network of servers that create their own secure connections.
I wrote a stress test that creates 100 threads on Server1. Each thread opens a TCP connection and establishes a secure channel with Server2. The thread on server2 sends the numbers 1..1000, and the thread on Server1 simply reads them, and sends "OK" to Server2. The TCP connections and secure channels are "kept alive".
First run:
100 threads are created on Server1
100 TCP connections are established between Server1 and Server2
100 threads are created on Server2 (to serve the Server1 requests)
100 secure channels are established, one per thread
total runtime : 10 seconds
Second run:
100 threads are created on Server1 (but those might have been reused by the JVM from the previous runs)
No new TCP connections are needed. The old ones are reused.
No threads are created on Server2. They are still waiting for requests.
No secure channels are established
total runtime : 1 second

How to scale websocket connection load as one adds/removes servers?

To explain the problem:
With HTTP:
Assume there are 100 requests/second arriving.
If, there are 4 servers, the load balancer (LB) can distribute the load across them evenly, 25/second per server
If i add a server (5 servers total), the LB balances it more evenly to now 20/second per server
If i remove a server (3 servers total), the LB decreases load per server to 33.3/second per server
So the load per server is automatically balanced as i add/remove servers, since each connection is so short lived.
With Websockets
Assume there are 100 clients, 2 servers (behind a LB)
The LB initially balances each incoming connection evenly, so each server has 50 connections.
However, if I add a server (3 servers total), the 3rd servers gets 0 connections, since the existing 100 clients are already connected to the 2 servers.
If i remove a server (1 server total), all those 100 connections will reconnect and are now served by 1 server.
Problem
Since websocket connections are persistent, adding/removing a server does not increase/decrease load per server until the clients decide to reconnect.
How does one then efficiently scale websockets and manage load per server?
This is similar to problems the gaming industry has been trying to solve for a long time. That is an area where you have many concurrent connections and you have to have fast communication between many clients.
Options:
Slave/master architecture where master retains connection to slaves to monitor health, load, etc. When someone joins the session/application they ping the master and the master responds with the next server. This is kind of client side load balancing except you are using server side heuristics.
This prevents your clients from blowing up a single server. You'll have to have the client poll the master before establishing the WS connection but that is simple.
This way you can also scale out to multi master if you need to and put them behind load balancers.
If you need to send a message between servers there are many options for that (handle it yourself, queues, etc).
This is how my drawing app Pixmap for Android, which I built last year, works. Works very well too.
Client side load balancing where client connects to a random host name. This is how Watch.ly works. Each host can then be its own load balancer and cluster of servers for safety. Risky but simple.
Traditional load balancing - ie round robin. Hard to beat haproxy. This should be your first approach and will scale to many thousands of concurrent users. Doesn't solve the problem of redistributing load though. One way to solve that with this setup is to push an event to your clients telling them to reconnect (and have each attempt to reconnect with a random timeout so you don't kill your servers).

Using HTTP2 how can I limit the number of concurrent requests?

We have a system client <-> server working over HTTP1.1. The client is making hundreds (sometimes thousands) of concurrent requests to the server.
Because the default limitations of the browsers to HTTP1.1 connections the client is actually making these requests in batches of (6 ~ 8) concurrent requests, we think we can get some performance improvement if we can increase the number of concurrent requests.
We moved the system to work over HTTP2 and we see the client requesting all the requests simultaneously as we wanted.
The problem now is the opposite: the server can not handle so many concurrent requests.
How can we limit the number of concurrent request the Client is doing simultaneous to something more manageable for the server? let's say 50 ~ 100 concurrent requests.
We were assuming that HTTP2 can allow us to graduate the number of concurrent
connections:
https://developers.google.com/web/fundamentals/performance/http2/
With HTTP/2 the client remains in full control of how server push is
used. The client can limit the number of concurrently pushed streams;
adjust the initial flow control window to control how much data is
pushed when the stream is first opened; or disable server push
entirely. These preferences are communicated via the SETTINGS frames
at the beginning of the HTTP/2 connection and may be updated at any
time.
Also here:
https://stackoverflow.com/a/36847527/316700
O maybe, if possible, we can limit this in the server side (what I think is more maintainable).
But looks like these solutions are talking about Server Push and what we have is the Client Pulling.
In case help in any way our architecture looks like this:
Client ==[http 2]==> ALB(AWS Beanstalk) ==[http 1.1]==> nginx ==[http 1.0]==> Puma
There is special setting in SETTINGS frame
You can specify SETTINGS_MAX_CONCURRENT_STREAMS to 100 on the server side
Reference

Go(lang): about MaxIdleConnsPerHost in the http client's transport

In case MaxIdleConnsPerHost is set to a high number, let's say 1000, the number of connections open will still depend on the other host, right? I mean, allowing 1000 idle connections with the same host will result in 1000 connections open as long as these are not closed by the other host?
So, effectively setting this value to a high number, will result in never closing a connection, but waiting for the other host to do it? am I interpreting this correctly?
Your understanding is correct. MaxIdleConnsPerHost restricts how many connections there are which are not actively serving requests, but which the client has not closed.
Idle connections are useful for web browsers because they can keep reusing connections for subsequent HTTP requests to the same server. Idle connections have a cost for the server, though. They use kernel resources, and you may run up against per process limits or kernel limits on the number of open connections, files, or handles, which may cause unexpected errors in your program, or even for other programs on the same machine.
As such, be careful when increasing MaxIdleConnsPerHost to a large number. It only makes sense to increase idle connections if you are seeing many connections in a short period from the same clients.

Resources