kube-apiserver: Slow pod creation - performance

For one week we are experiencing very slow response time from the kube-apiserver when trying to create pods. kube-apiserver responds on avg in 30s (before it was between 2 and 5s). Because of this many applications are timing out/not behaving as expected. Do you know how is possible to debug this problem server side since in GKE there is no visibility on master nodes?

Related

GKE and RPS - 'Usage is at capacity' - and performance issues

We have a GKE cluster with Ingress (kubernetes.io/ingress.class: "gce") where one backend is serving our production site.
The cluster is regional one with 3 zones enabled (autoscaling enabled).
The backend serving production site is a Varnish server running as Deployment - single replica. Behind Varnish there are multiple Nginx/PHP pods running under HorizontalPodAutoscaler.
The performance of of the site is slow. We have noticed by using GCP console that all traffic is routed only to one Backend and there is only 1/1 healthy endpoint in one zone?
We are getting exclamation mark next to the serving Backend with message 'Usage is at capacity, max = 1' and 'Backend utilization: 0%'. The other backend in second zone has no endpoint configured? And there is no third backed in third zone?
Initially we were getting a lot of 5xx responses from the backend at around 80RPS rate so we have turned on CDN via BackendConfig.
This has reduced 5xx responses including RPS on the backend to around 9RPS and around 83% RPS is being served from CDN.
We are trying to figure it out if it is possible to improve our backend utilization as clearly serving 80RPS from one Varnish server which has many pods behind should be easily achievable. We can not find any underperforming POD (varnish itself or nginx/php) in this scenario.
Is GKE/GCP throttling the backend/endpoint to only support 1RPS?
Is there any way to increase RPS per endpoint and increase number of endpoints, at least one per zone?
Is there any documentation available that explain how to scale such architecture on GKE/GCP?

Traefik adds huge overhead to the requests in Kubernetes

I am using Traefik as the ingress controller for my Kubernetes setup. I decided to run some performance test for my application but I faced a huge difference when I sent the requests through the Traefik.
The test consists of sending 10K request in parallel and the application returned the compiled result and based on the logs of my application it needs around 5 milliseconds to process one request. The results for the performance test are as below:
Native application:
Execution time in milliseconds: 61062
Application on Kubernetes (without going through Traefik and just using its IP):
Execution time in milliseconds: 62337
Application on Kubernetes and using Traefik:
Execution time in milliseconds: 159499
My question is why this huge difference exists and is there a way to reduce it (except adding more replicas).
I am using these yaml files for setting up Traefik:
https://raw.githubusercontent.com/containous/traefik/v1.7/examples/k8s/traefik-rbac.yaml
https://raw.githubusercontent.com/containous/traefik/v1.7/examples/k8s/traefik-ds.yaml
I tried Ambassador as my API gateway in kubernetes and its result was much better than Traefik and very close to using the IP of the container (63394 milliseconds). Obviously, Traefik is not as good as people think.

Elasticsearch speed vs. Cloud (localhost to production)

I have got a single ELK stack with a single node running in a vagrant virtual box on my machine. It has 3 indexes which are 90mb, 3.6gb, and 38gb.
At the same time, I have also got a Javascript application running on the host machine, consuming data from Elasticsearch which runs no problem, speed and everything's perfect. (Locally)
The issue comes when I put my Javascript application in production, as the Elasticsearch endpoint in the application has to go from localhost:9200 to MyDomainName.com:9200. The speed of the application runs fine within the company, but when I access it from home, the speed drastically decreases and often crashes. However, when I go to Kibana from home, running query there is fine.
The company is using BT broadband and has a download speed of 60mb, and 20mb upload. Doesn't use fixed IP so have to update A record whenever IP changes manually, but I don't think is relevant to the problem.
Is the internet speed the main issue that affected the loading speed outside of the company? How do I improve this? Is cloud (CDN?) the only option that would make things run faster? If so how much would it cost to host it in the cloud assuming I would index a lot of documents in the first time, but do a daily max. 10mb indexing after?
UPDATE1: Metrics from sending a request from Home using Chrome > Network
Queued at 32.77s
Started at 32.77s
Resource Scheduling
- Queueing 0.37 ms
Connection Start
- Stalled 38.32s
- DNS Lookup 0.22ms
- Initial Connection
Request/Response
- Request sent 48 μs
- Waiting (TTFB) 436.61.ms
- Content Download 0.58 ms
UPDATE2:
The stalling period seems to been much lesser when I use a VPN?

Cloudera NODE_MANAGER_UNEXPECTED_EXITS every hour

I have a cloudera 5.x cluster running in Azure. Everything was running fine, and then a few days ago I started getting "NODE_MANAGER_UNEXPECTED_EXITS" health notifications via email every hour.
This seems to happen on the 43 minute of every hour.
Most of the forms I've come across have suggested outOfMemory errors - though I'm not seeing any of these in the log files. For good measure I've tried upping the java head space memory allocation for NodeManager but this has not solved the problem.
I've stopped all jobs on the cluster - it is essentially sitting idle, but every hour I'm getting these alerts.
Example of the health alert that comes in the email:
NODE_MANAGER_UNEXPECTED_EXITS Role health test bad Critical The health test result for NODE_MANAGER_UNEXPECTED_EXITS has become bad: This role encountered 1 unexpected exit(s) in the previous 5 minute(s). Critical threshold: any.
Any help is greatly appreciated

What to do when ECS-agent is disconnected?

I have an issue that from time to time one of the EC2 instances within my cluster have its ECS-agent disconnected. This silently removes the EC2 instance from the cluster (i.e. not eligible to run any services anymore) and silently drains my cluster from serving servers. I have my cluster backed with an autoscaling group, spawning servers to keep up the healthy amount. But the ECS-agent'disconnected servers are not marked as unhealthy, so the AS-group thinks everything is alright.
I have the feeling there must be something (easy) to mitigate this, or I'm having a big issue with choosing ECS and using it in production.
We had this issue for a long time. With each new AWS ECS-optimized AMI it got better, but as of 3 months ago it still happened from time to time. As mcheshier mentioned make sure to always use the latest AMI or at least the latest aws ecs agent
The only way we were able to resolve it was through:
Timed autoscale rotations
We would try to prevent it by scaling up and down at random times
Good cloudwatch alerts
We happened to have our application set up as a bunch of microservices that were all queue (SQS) based. We could scale up and down based on queues. We had decent monitoring set up that let us approximate rates of queues across number of ECS containers. When we detected that the rate was off we would rotate that whole ECS instance. Ie. Say our cluster deployed 4 running containers of worker-1. We approximate that each worker does 1000 messages per 5 minutes. If our queue rate was 3000 per 5 minutes and we had 4 workers, then 1 was not working as expected. We had some scripts set up in lambda to find the faulty one and terminate the entire instance that ran that container.
I hope this helps, I realize it's specific to our in-house application, but the advice I can give you and anyone else is to take the initiative and put as many metrics out there as you can. This will let you do some neat analytics and look for kinks in the system, this being one of them.

Resources