What's the auto scaling strategy of AWS ELB? - websocket

We are trying to build a WebSocket server(with 500k concurrent connections, but very low traffic) on AWS with ELB.
We tested single server with Public IP, it can handle more than 500k connections.
But when we put the servers after the ELB, every server can only get 65536 connections.
And we can see a lot of Spillover Count on ELB Monitoring.
The document of ELB says they will auto scaling the ELB by changing the IP list in DNS.
But when I dig my domain, I always get the same IP list.
The ELB auto scaling seems not work.

We submitted a ticket to AWS. And they said their ELB is based on requests.
One new connection just count as one connection.
So ELB can't use for long connection server.

Related

How does AWS Application Load balancer select a target within a target group? How to load balance the websocket traffic?

I have an AWS Application load balancer to distribute the http(s) traffic.
Problem 1:
Suppose I have a target group with 2 EC2 instances: micro and xlarge. Obviously they can handle different traffic levels. Does the load balancer manage traffic proportionally to instance sizes or just round robin? If only round robin is used and no other factors taken into account, then it's not really balancing load, because at some point the micro instance will be suffering from the traffic, while xlarge will starve.
Problem 2:
Suppose I have target group with 2 EC2 instances, both are same size. But my service is not using a classic http request/response flow. It is using HTTP websockets, i.e. a client makes HTTP request just once, to establish a socket, and then keeps the socket open for longer time, sending and receiving messages (e.g. a chat service). Let's suppose my load balancer is using round robin and both EC2 instances have 1000 clients connected each. Now suppose one of the EC2 instances goes down and 1000 connected clients drop their socket connections. The instance gets back up quickly and is ready to accept websocket calls again. The 1000 clients who dropped are trying to reconnect. Now, if the load balancer would use pure round robin, I'll end up with 1500 clients connected to instance #1 and 500 clients connected to instance #2, thus not really balancing the load correctly.
Basically, I'm trying to find out if some more advanced logic is being used to select a target in a group, or is it just a naive round robin selection. If it's round robin only, then how can I really balance the websocket connections load?
Websockets start out as http or https connections, so a load balancer can dispatch them to a server. Once the server accepts the http connection, both the server and the client "upgrade" the connection to use the websocket protocol. They then leave the connection open to use for websocket traffic. As far as the load balancer can tell, the connection is simply a long-lasting http connection.
Taking a server down when it has websocket connections to clients requires your application to retry lost connections. Reconnecting on connection failure is one of the trickiest parts of websocket client programming. Your application cannot be robust without reconnect logic.
AWS's load balancer has no built-in knowledge of the capabilities of the servers behind it. You have observed that it sends requests equally to big and small servers. That can overwhelm the small ones.
I have managed this by building a /healthcheck endpoint in my servers. It's a straightforward https://example.com/heathcheck web page. You can put a little bit of content on the page announcing how many websocket connections are currently open, or anything else. Don't password protect it or require a session to hit it.
My /healthcheck endpoints, whenever hit, measure the server load. I simply use the number of current websocket connections, but you can use any metric you want. I compare the current load to a load threshold configured for each server. For example, on a micro instance I can handle 20 open websockets, and on a production instance I can handle 400.
If the server load is too high, my endpoint gives back a 503 http error status along with its content. 503 typically means "I am overloaded, please try again later." It can also mean "I will shut down when all my connections are closed. Please don't use me for any more connections."
Then I configure the load balancer to perform those health checks every couple of minutes on all the servers in the server pool (AWS calls the pool a "target group"). The health check operation detects "unhealthy" servers and temporarily takes them out of its rotation. (The health check also detects crashed servers, which is good.)
You need this loadbalancer health check for a large-scale production setup.
All that being said, you will get best results if all your server instances in your pool have roughly the same capacity as each other.

How to whitelist IP/DNS of servers under loadbalancer?

I have two servers (EC2 instances provided by AWS) under a loadbalancer (To be more specific ELB provided by AWS).
These servers run identical restful API code. (State is saved in mutual database).
Lets say these servers provide a service called X.
These two servers are in an autoscaling group (ASG provided by AWS) (meaning that if one dies or more computing power is needed - an additional server will be created).
Every server has it's own IP adress.
Lets say I am a consumer of this infrastructure. I send a HTTP POST request to a loadbalancer endpoint and i expect at some point to receive a HTTP POST request. The HTTP POST request would be sent from one of two servers under loadbalancer. How can I be sure that the HTTP POST request is sent from the service X and not by someone other? (Initially I want to whitelist IPs or DNSs to be safe)
Note that i can not simply white list IP adress of every machine under a loadbalancer since machines are killed and spawned regularly (assigning new IP adresses every time).

Loadbalancing web sockets - AWS Elastic Loadbalancer

I have a question about how to load balance web sockets with AWS elastic load balancer.
I have 2 EC2 instances behind AWS elastic load balancer.
When any user login, the user session will be established with one of the server, say EC2 instance1. Now, all the requests from the same user will be routed to EC2 instance1.
Now, I have a different stateless request coming from a different system. This request will have userId in it. This request might end up going to a EC2 instance2. We are supposed to send a notification to the user based on the userId in the request.
Now,
1) Assume, the user session is with the EC2 instance1, but the notification is originating from the EC2 instance2.
I am not sure how to notify the user browser in this case.
2) Is there any limitation on the websocket connection like 64K and how to overcome with multiple servers, since user is coming thru Load balancer.
Thanks
You will need something else to notify the browser's websocket's server end about the event coming from the other system. There are a couple of publish-subscribe based solution which might help, but without knowing more details it is a bit hard to figure out which solution fits the best. Redis is generally a good answer, and Elasticache supports it.
I found this regarding to AWS ELB's limits:
http://docs.aws.amazon.com/general/latest/gr/aws_service_limits.html#limits_elastic_load_balancer
But none of them seems to be related to your question.
Websocket requests start with HTTP communication before handing over to websockets. In theory if you could include a cookie in that initial HTTP request then the sticky session features of ELB would allow you to direct websockets to specific EC2 instances. However, your websocket client may not support this.
A preferred solution would be to make your EC2 instances stateless. Store the websocket session data in AWS Elasticache (Either Redis or Memcached) and then incoming connections will be able to access the session regardless of which EC2 instance is used.
The advantage of this solution is that you remove the dependency on individual EC2 instances and your application will scale and handle failures better.
If the ELB has too many incoming connections, then it should scale automatically. Although I can't find a reference for that. ELB's are relatively slow to scale - minutes rather than seconds, if you are expecting surges in traffic then AWS can "pre-warm" more ELB resource for you. This is done via support requests.
Also, factor in the ELB connection time out. By default this is 60 seconds, it can be increased via the AWS console or API. Your application needs to send at least 1 byte of traffic before the timeout or the ELB will drop the connection.
Recently had to hook up crossbar.io websockets with ALB. Basically there are two things to consider. 1) You need to set stickiness to 1 day on the target group attributes. 2) You either need something on the same port that returns static webpage if connection is not upgraded, or a separate port serving a static webpage with a custom health check specifying that port on the target group. Go for a ALB over ELB, ALB's have support for ws:// and wss://, they only lack the health check over websockets.

tcp_tw_recycle behind application level load balancer?

Given that our linux servers never open direct connections to our clients, is it safe to use tcp_tw_recycle on them ?
Those servers are behind a application level load-balancer and all the connections i see on them are between internal 10.x.x.x addresses.
Thanks
We have such a load balancer provided by AWS (ELB), so I'll provide my advice based on that:
Why gamble? If your overhead/port-consumption is coming from quick client connections, Amazon recommends enabling persistent connections on your ELB instead. (I asked them about this question specifically and got that recommendation...our Amazon contact does not recommend enabling tcp_tw_recycle).
That said, if, say it's another internal box they're struggling to establish rapid connections with (apache-php chatting with MySQL on behalf of the client without persistent connections), you might be able to get away with it:
If ALL client connections will be via the ELB (please set your security group accordingly), then technically speaking you shouldn't encounter problems for the tcp_tw_recycle timestamp jumping cases I'm aware of:
ELB is a termination point on behalf of the client (their NAT firewall won't factor in, and ELB is not NAT based)
The ELB box(es) will not reset themselves, acquire the same IP address, and still be assigned as your ELB (will be someone else's if it happens at all)
The ELB box(es) will not be replaced by another ELB machine using the same IP and still be serving your traffic as your ELB (will be someone else's if it happens at all)
*2 and 3 are not a guarantee from Amazon, but it does appear to be their behavior, just as stop/start will get you a new private IP for EC2 boxes). If that did happen, I'd imagine it is a thing of extremely low probability.
You could theoretically run into issues restarting your own boxes if they communicate with other service machines (like MySQL or memcached) and you restart (not stop/start) one of your boxes, or move their elastic IP to another box and are not using private IPs for internal chatter. But you have some control over this. However, if it's all on the AWS cloud (or your fast internal network), issues are extremely unlikely (unless your AWS zone is having a bad day, and you're restarting/replacing your systems for that reason).
A buddy and I had a long-standing argument about this, and he won by proving his point with a long running 4k browser (fast script) load test via Neustar...there were no connection issues from the client side via ELB, and eliminating the overhead helped quite a bit :-)
If you haven't already, consider tcp_tw_reuse (we were using this to keep the ephemeral port range active before the above mentioned test showed the additional merit of eliminating the overhead with tcp_tw_recycle for us). Be sure to watch your counters on ifconfig if you do decide to disable that chunk of the protocol ;-P.
The following is also a good summary resource on the topic of timestamps jumping: Dropping of connections with tcp_tw_recycle

WebSockets and Load Balancing, a bottleneck?

When having a bunch of systems that act as WebSocket drones and a Load Balancer in front of those drones. When a WebSocket request comes into the LB it chooses a WebSocket drone, and the WebSocket is established. (I use AWS ELB tcp SSL-terminated at ELB)
Question:
Now does the created WebSocket go through the LB, or does the LB forward the WebSocket request to a WebSocket drone and thus there is a direct link between client and a WebSocket drone?
If the WebSocket connection goes through the LB, this would make the LB a huge bottleneck.
Removing the LB and handing clients a direct IP of a WebSocket drone could circumvent this bottleneck but requires creating this logic myself, which I'm planning to do (depending on this questions' answers).
So are my thoughts on how this works correct?
AWS ELB as LB
After looking at the possible duplicate suggested by Pavel K I conclude that the WebSocket connection will go through the AWS ELB, as in:
Browser <--WebSocket--> LB <--WebSocket--> WebSocketServer
This makes the ELB a bottleneck, what I would have wanted is:
Browser <--WebSocket--> WebSocketServer
Where the ELB is only used to give the client a hostname/IP of an available WebSocketServer.
DNS as LB
The above problem could be circumvented by balancing on DNS-level, as explained in the possible duplicate. Since that way DNS will give an IP of an available WebSocketServer when ws.myapp.com is requested.
Downside is that this would require constantly updating DNS with up/down WebSocketServer changes (if your app is elastic this becomes even more of a problem).
Custom LB
Another option could be to create a custom LB which constantly monitors WebSocketServers and gives back the IP of an available WebSocketServer when the client requests so.
Downside is that the client needs to perform a separate (AJAX) request to get the IP of an available WebSocketServer, whereas with AWS ELB the Load Balancing happens implicitly.
Conclusion
Choosing the better evil..

Resources