GCP Cloud Endpoints latency - proxy

The product overview of Cloud Endpoints states:
Extensible Service Proxy delivers security and insight in less than 1ms per call.
But, I'm observing more than 10ms (maybe 100ms from time to time) added latency.
Our server settings are:
we have a GKE cluster, which has:
a Kubernetes deployment for pods, each of which has an ESP container and our own container, which serves a gRPC service
a Kubernetes service (of LoadBalancer type) whose target refers to the ESP container
we have an endpoint configuration for the gRPC service, which has only the basic stuff, as below.
we issued an API key for clients
We had a client program in another GKE cluster in the same zone for this experiment.
With this setting, our experiments showed:
with 15ms timeout on client's end, more than 95% calls were timed out
on GCP's endpoints dashboard, majority of requests took more than 100ms
on stackdriver trace, all the latency belongs to "Backend"
when measured at our own container, the latency was below 5ms
The server's CPU load was very low (below 10%) and there is no sign of overloading at that time.
Assuming gRPC does not add much latency, we think the latency was probably from the ESP.
So, we ran another experiment with ESP-bypassed:
we modified the Kubernetes Service in such a way that it refers to our own container, not the ESP container
After this fix, the latency measured at the client was dropped to 5ms.
So, if our experiments were correct, it seems the ESP container adds latency, far from 1ms, which is advertised at the product overview. Are we missing something?
Endpoint configuration:
type: google.api.Service
config_version: 3
name: foo.endpoints.bar.cloud.goog
title: foo in bar
apis:
- name: com.bar.FooService

Related

How does AWS Application Load balancer select a target within a target group? How to load balance the websocket traffic?

I have an AWS Application load balancer to distribute the http(s) traffic.
Problem 1:
Suppose I have a target group with 2 EC2 instances: micro and xlarge. Obviously they can handle different traffic levels. Does the load balancer manage traffic proportionally to instance sizes or just round robin? If only round robin is used and no other factors taken into account, then it's not really balancing load, because at some point the micro instance will be suffering from the traffic, while xlarge will starve.
Problem 2:
Suppose I have target group with 2 EC2 instances, both are same size. But my service is not using a classic http request/response flow. It is using HTTP websockets, i.e. a client makes HTTP request just once, to establish a socket, and then keeps the socket open for longer time, sending and receiving messages (e.g. a chat service). Let's suppose my load balancer is using round robin and both EC2 instances have 1000 clients connected each. Now suppose one of the EC2 instances goes down and 1000 connected clients drop their socket connections. The instance gets back up quickly and is ready to accept websocket calls again. The 1000 clients who dropped are trying to reconnect. Now, if the load balancer would use pure round robin, I'll end up with 1500 clients connected to instance #1 and 500 clients connected to instance #2, thus not really balancing the load correctly.
Basically, I'm trying to find out if some more advanced logic is being used to select a target in a group, or is it just a naive round robin selection. If it's round robin only, then how can I really balance the websocket connections load?
Websockets start out as http or https connections, so a load balancer can dispatch them to a server. Once the server accepts the http connection, both the server and the client "upgrade" the connection to use the websocket protocol. They then leave the connection open to use for websocket traffic. As far as the load balancer can tell, the connection is simply a long-lasting http connection.
Taking a server down when it has websocket connections to clients requires your application to retry lost connections. Reconnecting on connection failure is one of the trickiest parts of websocket client programming. Your application cannot be robust without reconnect logic.
AWS's load balancer has no built-in knowledge of the capabilities of the servers behind it. You have observed that it sends requests equally to big and small servers. That can overwhelm the small ones.
I have managed this by building a /healthcheck endpoint in my servers. It's a straightforward https://example.com/heathcheck web page. You can put a little bit of content on the page announcing how many websocket connections are currently open, or anything else. Don't password protect it or require a session to hit it.
My /healthcheck endpoints, whenever hit, measure the server load. I simply use the number of current websocket connections, but you can use any metric you want. I compare the current load to a load threshold configured for each server. For example, on a micro instance I can handle 20 open websockets, and on a production instance I can handle 400.
If the server load is too high, my endpoint gives back a 503 http error status along with its content. 503 typically means "I am overloaded, please try again later." It can also mean "I will shut down when all my connections are closed. Please don't use me for any more connections."
Then I configure the load balancer to perform those health checks every couple of minutes on all the servers in the server pool (AWS calls the pool a "target group"). The health check operation detects "unhealthy" servers and temporarily takes them out of its rotation. (The health check also detects crashed servers, which is good.)
You need this loadbalancer health check for a large-scale production setup.
All that being said, you will get best results if all your server instances in your pool have roughly the same capacity as each other.

Does ALB over grpc protocol return network related errors when scaling concurrent load?

We were experimenting load balancing startegies for grpc based services in aws cloud. In addition to client side load balancing recommened in grpc platform, we also wanted to try the ALB offered in aws over the grpc protocol. We created a grpc service written in golang with two instances and followed all the steps like creating Target groups, configuring an ALB over grpc protocol and health checks. We wrote a load generation[in golang] tool to send concurrent requests to the service. The load generation tool creates a single grpc client connection and uses the same to send concurrent requests. When the concurrency[workers] is increased[~1000] and run for a period of time, some requests are failing with below error.
code = Unavailable desc = transport is closing
For 250K requests to the ALB in 20mins, around 1k requests were failing in small batches with the above error.
Then to identify the root cause, we used a NLB to test the same load and didn't get any errors.
Note: We are aware that NLB won't load balance requests over single client to multiple instances. This is done just to identify the cause of error.
We added channelz to the service and monitored the number of failed messages in all channels/sockets. The number of failures are below hunder[~70] in the channelz stats.
We also noticed that the monitoring stats for the alb showed 4xx error codes.
Please share suggestions to debug the failures from ALB or articles around the internals of AWS ALB to figure out the solution.

How to improve/minimize varying response time of api

I created a rest api and I am not very happy with the performance of it. I spent some time to investigate and stumbled across a tool to easily track the performance of my api (www.apiscience.com).
They split the overall response time in 4 categories- connect, resolve, processing and transfer. The resolve part often takes about 150ms while the processing of the call itself only takes about 18ms which results in an average response time of 160ms (the call i tried here is really simple so the average would be higher normally).
My question is how can I improve/minimize the resolve time for my calls?
(side info: my servers are placed in Ireland and I chose Ireland as location for the tests too)
Thanks in advance!
Edit - What do they mean with Resolve Time?
(https://www.apiscience.com/blog/what-do-api-sciences-curl-based-timings-mean/)
API Science’s “Resolve Time” is the equivalent of Ken’s “DNS Lookup.”
DNS stands for Domain Name System. A URL consists of text (and
sometimes numbers); however, the communication addresses that compose
the Internet are formulated as IP (Internet Protocol) addresses, for
example, 208.80.152.2. Before a request can be routed between the
requesting client and the server that will process the request, the IP
address that the URL refers must be looked up. A request is sent to a
DNS resolver by curl, and the resolver returns the correlated IP
address. API Science’s “Resolve Time” is the time in milliseconds that
it took this operation to complete.
As the documentation mentions, the DNS resolution time is the amount of time an API consuming client waits before finding out where to route the actual calls to your API server - the mapping between your server's name and IP address.
Where you host your DNS can be completely independent from both where you host your API service, and where your domain name is registered, and there are multiple choices in the market for DNS hosting service. DNSPerf (of which I have no affiliation) does a comparison of services and is probably a good starting point for further research if you'd like to select a new DNS provider.

GKE + WebSocket + NodePort 30s dropped connections

I have a golang service that implements a WebSocket client using gorilla that is exposed to a Google Container Engine (GKE)/k8s cluster via a NodePort (30002 in this case).
I've got a manually created load balancer (i.e. NOT at k8s ingress/load balancer) with HTTP/HTTPS frontends (i.e. 80/443) that forward traffic to nodes in my GKE/k8s cluster on port 30002.
I can get my JavaScript WebSocket implementation in the browser (Chrome 58.0.3029.110 on OSX) to connect, upgrade and send / receive messages.
I log ping/pongs in the golang WebSocket client and all looks good until 30s in. 30s after connection my golang WebSocket client gets an EOF / close 1006 (abnormal closure) and my JavaScript code gets a close event. As far as I can tell, neither my Golang or JavaScript code is initiating the WebSocket closure.
I don't particularly care about session affinity in this case AFAIK, but I have tried both IP and cookie based affinity in the load balancer with long lived cookies.
Additionally, this exact same set of k8s deployment/pod/service specs and golang service code works great on my KOPS based k8s cluster on AWS through AWS' ELBs.
Any ideas where the 30s forced closures might be coming from? Could that be a k8s default cluster setting specific to GKE or something on the GCE load balancer?
Thanks for reading!
-- UPDATE --
There is a backend configuration timeout setting on the load balancer which is for "How long to wait for the backend service to respond before considering it a failed request".
The WebSocket is not unresponsive. It is sending ping/pong and other messages right up until getting killed which I can verify by console.log's in the browser and logs in the golang service.
That said, if I bump the load balancer backend timeout setting to 30000 seconds, things "work".
Doesn't feel like a real fix though because the load balancer will continue to feed actual unresponsive services traffic inappropriately, never mind if the WebSocket does become unresponsive.
I've isolated the high timeout setting to a specific backend setting using a path map, but hoping to come up with a real fix to the problem.
I think this may be Working as Intended. Google just updated the documentation today (about an hour ago).
LB Proxy Support docs
Backend Service Components docs
Cheers,
Matt
Check out the following example: https://github.com/kubernetes/ingress-gce/tree/master/examples/websocket

Paas for Websocket

I am looking for a WebSocket-enabled PAAS service. So far I have only experimented on Heroku and it works quite fine. Would you recommend other services?
Side question: I'm slightly worried about the billing. In the case of Heroku, it seems that usage is calculated via the time dynos are busy. I guess that in case of a Websocket connection, there may be a lot of idle time in between data exchange, and it would be fully billed anyway. Is that correct?
Heroku will bill you for the time the dyno is up, whether or not it is being used at all.
We've used Pusher as a complete websocket service, which allows you to asynchronously publish events from your main Heroku app and off-load the websocket connections and event publishing to Pusher.
They charge based on the volume of websocket traffic, which might be cheaper if you have a small volume or peaky traffic, and don't want to pay for a consistent set of dynos needed to service your peak traffic.

Resources