AWS ALB returning 502 without any log entries - amazon-ec2

We're using node js backend servers running in AWS ECS, behind an ALB. We then have AWS API gateway with a proxy lambda calling the ALB. This has been running in production for months, when suddenly a few days ago we started seeing 502 errors from some API calls.
I've checked the proxy lambda logs to see that the 502 is returned from the ALB. However, when I check my node application logs, there are no failing requests, in fact no requests seem to have reached the application at these timestamps. I then enabled access logs on the ALB, which only shows 200/201 responses - no 5xx whatsoever. I'm now a bit confused as to where to look next. What could cause my ALB to return 502 without this being present in the ALB access logs? And what could cause the requests to not reach my node app in ECS? Does anyone have any idea on what logs to check next or what to do to pinpoint the errors? Could some layer within ECS cause those symptoms? I can't see any errors in my docker containers or anything.
It seems to happen in bursts, up to 50 failed requests within a period of time, then all ok for several hours.

It could be due to a number of reasons. The below may be applicable to you -
The load balancer received a TCP RST from the target when attempting
to establish a connection.
The load balancer received an unexpected response from the target,
such as "ICMP Destination unreachable (Host unreachable)", when
attempting to establish a connection. Check whether traffic is allowed
from the load balancer subnets to the targets on the target port.
The target closed the connection with a TCP RST or a TCP FIN while the
load balancer had an outstanding request to the target. Check whether
the keep-alive duration of the target is shorter than the idle timeout
value of the load balancer.
The target response is malformed or contains HTTP headers that are not
valid.
The load balancer encountered an SSL handshake error or SSL handshake
timeout (10 seconds) when connecting to a target.
reference docs

This turned out to be memory leaks in my container applications. The RAM usage grew with every request until crash. At that point it took a while for ECS and ALB to react, so a bunch of requests were routed to the dead instance.
The problem was resolved by fixing the leak, but I'd have wanted better built in support for alarms on high memory usage from ECS/cloudwatch with triggers to replace instances on high usage gracefully. Seems i have to build that from scratch.

Related

How does AWS Application Load balancer select a target within a target group? How to load balance the websocket traffic?

I have an AWS Application load balancer to distribute the http(s) traffic.
Problem 1:
Suppose I have a target group with 2 EC2 instances: micro and xlarge. Obviously they can handle different traffic levels. Does the load balancer manage traffic proportionally to instance sizes or just round robin? If only round robin is used and no other factors taken into account, then it's not really balancing load, because at some point the micro instance will be suffering from the traffic, while xlarge will starve.
Problem 2:
Suppose I have target group with 2 EC2 instances, both are same size. But my service is not using a classic http request/response flow. It is using HTTP websockets, i.e. a client makes HTTP request just once, to establish a socket, and then keeps the socket open for longer time, sending and receiving messages (e.g. a chat service). Let's suppose my load balancer is using round robin and both EC2 instances have 1000 clients connected each. Now suppose one of the EC2 instances goes down and 1000 connected clients drop their socket connections. The instance gets back up quickly and is ready to accept websocket calls again. The 1000 clients who dropped are trying to reconnect. Now, if the load balancer would use pure round robin, I'll end up with 1500 clients connected to instance #1 and 500 clients connected to instance #2, thus not really balancing the load correctly.
Basically, I'm trying to find out if some more advanced logic is being used to select a target in a group, or is it just a naive round robin selection. If it's round robin only, then how can I really balance the websocket connections load?
Websockets start out as http or https connections, so a load balancer can dispatch them to a server. Once the server accepts the http connection, both the server and the client "upgrade" the connection to use the websocket protocol. They then leave the connection open to use for websocket traffic. As far as the load balancer can tell, the connection is simply a long-lasting http connection.
Taking a server down when it has websocket connections to clients requires your application to retry lost connections. Reconnecting on connection failure is one of the trickiest parts of websocket client programming. Your application cannot be robust without reconnect logic.
AWS's load balancer has no built-in knowledge of the capabilities of the servers behind it. You have observed that it sends requests equally to big and small servers. That can overwhelm the small ones.
I have managed this by building a /healthcheck endpoint in my servers. It's a straightforward https://example.com/heathcheck web page. You can put a little bit of content on the page announcing how many websocket connections are currently open, or anything else. Don't password protect it or require a session to hit it.
My /healthcheck endpoints, whenever hit, measure the server load. I simply use the number of current websocket connections, but you can use any metric you want. I compare the current load to a load threshold configured for each server. For example, on a micro instance I can handle 20 open websockets, and on a production instance I can handle 400.
If the server load is too high, my endpoint gives back a 503 http error status along with its content. 503 typically means "I am overloaded, please try again later." It can also mean "I will shut down when all my connections are closed. Please don't use me for any more connections."
Then I configure the load balancer to perform those health checks every couple of minutes on all the servers in the server pool (AWS calls the pool a "target group"). The health check operation detects "unhealthy" servers and temporarily takes them out of its rotation. (The health check also detects crashed servers, which is good.)
You need this loadbalancer health check for a large-scale production setup.
All that being said, you will get best results if all your server instances in your pool have roughly the same capacity as each other.

Does ALB over grpc protocol return network related errors when scaling concurrent load?

We were experimenting load balancing startegies for grpc based services in aws cloud. In addition to client side load balancing recommened in grpc platform, we also wanted to try the ALB offered in aws over the grpc protocol. We created a grpc service written in golang with two instances and followed all the steps like creating Target groups, configuring an ALB over grpc protocol and health checks. We wrote a load generation[in golang] tool to send concurrent requests to the service. The load generation tool creates a single grpc client connection and uses the same to send concurrent requests. When the concurrency[workers] is increased[~1000] and run for a period of time, some requests are failing with below error.
code = Unavailable desc = transport is closing
For 250K requests to the ALB in 20mins, around 1k requests were failing in small batches with the above error.
Then to identify the root cause, we used a NLB to test the same load and didn't get any errors.
Note: We are aware that NLB won't load balance requests over single client to multiple instances. This is done just to identify the cause of error.
We added channelz to the service and monitored the number of failed messages in all channels/sockets. The number of failures are below hunder[~70] in the channelz stats.
We also noticed that the monitoring stats for the alb showed 4xx error codes.
Please share suggestions to debug the failures from ALB or articles around the internals of AWS ALB to figure out the solution.

GKE + WebSocket + NodePort 30s dropped connections

I have a golang service that implements a WebSocket client using gorilla that is exposed to a Google Container Engine (GKE)/k8s cluster via a NodePort (30002 in this case).
I've got a manually created load balancer (i.e. NOT at k8s ingress/load balancer) with HTTP/HTTPS frontends (i.e. 80/443) that forward traffic to nodes in my GKE/k8s cluster on port 30002.
I can get my JavaScript WebSocket implementation in the browser (Chrome 58.0.3029.110 on OSX) to connect, upgrade and send / receive messages.
I log ping/pongs in the golang WebSocket client and all looks good until 30s in. 30s after connection my golang WebSocket client gets an EOF / close 1006 (abnormal closure) and my JavaScript code gets a close event. As far as I can tell, neither my Golang or JavaScript code is initiating the WebSocket closure.
I don't particularly care about session affinity in this case AFAIK, but I have tried both IP and cookie based affinity in the load balancer with long lived cookies.
Additionally, this exact same set of k8s deployment/pod/service specs and golang service code works great on my KOPS based k8s cluster on AWS through AWS' ELBs.
Any ideas where the 30s forced closures might be coming from? Could that be a k8s default cluster setting specific to GKE or something on the GCE load balancer?
Thanks for reading!
-- UPDATE --
There is a backend configuration timeout setting on the load balancer which is for "How long to wait for the backend service to respond before considering it a failed request".
The WebSocket is not unresponsive. It is sending ping/pong and other messages right up until getting killed which I can verify by console.log's in the browser and logs in the golang service.
That said, if I bump the load balancer backend timeout setting to 30000 seconds, things "work".
Doesn't feel like a real fix though because the load balancer will continue to feed actual unresponsive services traffic inappropriately, never mind if the WebSocket does become unresponsive.
I've isolated the high timeout setting to a specific backend setting using a path map, but hoping to come up with a real fix to the problem.
I think this may be Working as Intended. Google just updated the documentation today (about an hour ago).
LB Proxy Support docs
Backend Service Components docs
Cheers,
Matt
Check out the following example: https://github.com/kubernetes/ingress-gce/tree/master/examples/websocket

If the number of requests are huge, can load balancer cause the issue while sending responses to respective clients?

I do have architecture of a Load balancer followed by two Web Application server and Database, I am hitting thousands of HTTP requests to the server from Jmeter distributed testing environment.
At the time of getting response back, few request does not get response back from the server.
I checked Database logs, 100 % requests were responded.
Checked with Web Application servers access logs, 100 % requests were responded.
Can Load balancer cause the damage traversing these pending responses to the respective clients?
Every time different different request are getting stuck.
Thanks in Advance!!
If you suspect load balancer, look at 3 typical causes first:
Server takes longer to respond than load balancer is waiting
Client has shorter timeout than it takes for server to respond.
Port/thread/connection exhaustion on load balancer, or other LB configuration problems
In all three cases, I suggest looking at the load balancer logs. Since you didn't specify which LB you are using, I cannot say exactly how the log looks, but typically LB log gives you option to see:
How long it took for a request to be sent to a web server and for the response from the web server to return to load balancer. You can them compare those numbers to timeouts configured for load balancer and the client (problem 1 and 2).
How long it took for a request from the client to be processed by LB and how long LB took to respond to a client. If it takes long, then something is not right with load balancer (problem 3)
And then of course if you have any errors on load balancer, they may just explain what's going on.
If you cannot review logs for load balancer, I suggest changing your JMeter test temporarily to target servers behind load balancer directly. You can even configure your script to evenly distribute load between all servers (for example by using multiple thread groups). That would allow you to isolate the problem, and get more information on what's going on.

What does the Amazon ELB automatic health check do and what does it expect?

Here is the thing:
We've implemented a C++ RESTful API Server, with built-in HTTP parser and no standard HTTP server like apache or anything of the kind
It has been in use for several months in Amazon structure, using both plain and SSL communications, and no problems have been identified, related to Amazon infra-structure
We are deploying our first backend using Amazon ELB
Amazon ELB has a customizable health check system but also as an automatic one, as stated here
We've found no documentation of what data is sent by the health check system
The backend simple hangs on the socket read instruction and, eventually, the connection is closed
I'm not looking for a solution for the problem since the backend is not based on a standard web server, just if someone knows what kind of message is being sent by the ELB health check system, since we've found no documentation about this, anywhere.
Help is much appreciated. Thank you.
Amazon ELB has a customizable health check system but also as an
automatic one, as stated here
With customizable you are presumably referring to the health check configurable via the AWS Management Console (see Configure Health Check Settings) or via the API (see ConfigureHealthCheck).
The requirements to pass health checks configured this way are outlined in field Target of the HealthCheck data type documentation:
Specifies the instance being checked. The protocol is either TCP,
HTTP, HTTPS, or SSL. The range of valid ports is one (1) through
65535.
Note
TCP is the default, specified as a TCP: port pair, for example
"TCP:5000". In this case a healthcheck simply attempts to open a TCP
connection to the instance on the specified port. Failure to connect
within the configured timeout is considered unhealthy.
SSL is also specified as SSL: port pair, for example, SSL:5000.
For HTTP or HTTPS protocol, the situation is different. You have to
include a ping path in the string. HTTP is specified as a
HTTP:port;/;PathToPing; grouping, for example
"HTTP:80/weather/us/wa/seattle". In this case, a HTTP GET request is
issued to the instance on the given port and path. Any answer other
than "200 OK" within the timeout period is considered unhealthy.
The total length of the HTTP ping target needs to be 1024 16-bit
Unicode characters or less.
[emphasis mine]
With automatic you are presumably referring to the health check described in paragraph Cause within Why is the health check URL different from the URL displayed in API and Console?:
In addition to the health check you configure for your load balancer,
a second health check is performed by the service to protect against
potential side-effects caused by instances being terminated without
being deregistered. To perform this check, the load balancer opens a
TCP connection on the same port that the health check is configured to
use, and then closes the connection after the health check is
completed. [emphasis mine]
The paragraph Solution clarifies the payload being zero here, i.e. it is similar to the non HTTP/HTTPS method described for the configurable health check above:
This extra health check does not affect the performance of your
application because it is not sending any data to your back-end
instances. You cannot disable or turn off this health check.
Summary / Solution
Assuming your RESTful API Server, with built-in HTTP parser is supposed to serve HTTP only indeed, you will need to handle two health checks:
The first one you configured yourself as a HTTP:port;/;PathToPing - you'll receive a HTTP GET request and must answer with 200 OK within the specified timeout period to be considered healthy.
The second one configured automatically by the service - it will open a TCP connection on the HTTP port configured above, won't send any data, and then closes the connection after the health check is completed.
In conclusion it seems that your server might be behaving perfectly fine already and you are just irritated by the 2nd health check's behavior - does ELB actually consider your server to be unhealthy?
As far as I know it's just an HTTP GET request looking for a 200 OK http response.

Resources