kubernetes and/or terraform configuration issue causing performance degradation? - performance

I'm migrating an e2e test stack from docker compose based setup to kubernetes. As part of this migration, I'm also creating terraform modules for individual services that make up a product.
A single e2e stack - ATM - is composed of ~50 pods and starts up in out 5 minutes (I run dedicated DBs, in-memory data stores, esb integration tools, external mocked services, etc.., per stack, hence the high number of pods).
During testing I would like to start up as many of those complete stacks as possible.
Currently I have a k8s cluster with 9 nodes:
six 64GB RAM, 512GB SSD, with latest gen i5 CPUs (max-pods with default 110) and
three 256GB RAM, 1TB SSD, with 18 core Xeon CPU (max-pods set to 330 on each) nodes.
I'm using my terraform modules to start up the stacks (the modules mostly define kubernetes resources).
My expectation would be that I can fire up ~30 stacks in parallel without major hiccups. I also would expect that startup times follow - to a reasonable extent - a function like st = ((#s + #n) % #n) * ss-st, where:
st denotes the overall startup time,
#s denotes number of stacks,
#n denotes number of nodes and finally
ss-st denotes single stack startup time.
However, reality is very different. Blue columns shows actual startup times (measured in seconds), while red shows, what my idealized expectation is.
Installed Prometheus operator and I have some metrics, but non explain to me (yet) where is the bottleneck in this case (disk utilization seems to get maxed out on master nodes from time to time, but they alone does not seems to explain the end figures).
What am I doing wrong?

Related

Kubernetes number of replicas vs performance

I have just gotten into Kubernetes and really liking its ability to orchestrate containers. I had the assumption that when the app starts to grow, I can simply increase the replicas to handle the demand. However, now that I have run some benchmarking, the results confuse me.
I am running Laravel 6.2 w/ Apache on GKE with a single g1-small machine as the node. I'm only using NodePort service to expose the app since LoadBalancer seems expensive.
The benchmarking tool used are wrk and ab. When the replicas is increased to 2, requests/s somehow drops. I would expect the requests/s to increase since there are 2 pods available to serve the request. Is there a bottleneck occurring somewhere or perhaps my understanding is flawed. Do hope someone can point out what I'm missing.
A g1-small instance is really tiny: you get 50% utilization of a single core and 1.7 GB of RAM. You don't describe what your application does or how you've profiled it, but if it's CPU-bound, then adding more replicas of the process won't help you at all; you're still limited by the amount of CPU that GCP gives you. If you're hitting the memory limit of the instance that will dramatically reduce your performance, whether you swap or one of the replicas gets OOM-killed.
The other thing that can affect this benchmark is that, sometimes, for a limited time, you can be allowed to burst up to 100% CPU utilization. So if you got an instance and ran the first benchmark, it might have used a burst period and seen higher performance, but then re-running the second benchmark on the same instance might not get to do that.
In short, you can't just crank up the replica count on a Deployment and expect better performance. You need to identify where in the system the actual bottleneck is. Monitoring tools like Prometheus that can report high-level statistics on per-pod CPU utilization can help. In a typical database-backed Web application the database itself is the bottleneck, and there's nothing you can do about that at the Kubernetes level.

Worker node-status on a Ray EC2 cluster: update-failed

I now have a Ray cluster working on EC2 (Ubuntu 16.04) with a c4.8xlarge master node and one identical worker. I wanted to check whether multi-threading was being used, so I ran tests to time increasing numbers (n) of the same 9-second task. Since the instance has 18 CPUs, I expected to see the job taking about 9s for up to n<=35 (assuming one CPU for the cluster management) and then either a fault, or an increase to about 18 sec when switching to 36 vCPUs per node.
Instead, the cluster handled up to only 14 tasks in parallel and then the execution time jumped to 40s and continued to increase for increasing n. When I tried a c4xlarge master (4 CPUs), the times were directly proportional to n, i.e. they were running serially. So I surmise that the master actually requires 4 CPUs for the system, and that the worker node is not being used at all. However, if I add a second worker, the times for n>14 are about 40s less that without it. I also tried a value for target_utilization_factor less than 1.0, but that made no difference.
There were no reported errors, but I did notice that the ray-node-status for the worker in the EC2 Instances console was "update-failed". Is this significant? Can anyone enlighten me about this behaviour?
The cluster did not appear to be using the workers, so the trace is showing only 18 actual cpus dealing with the task. The monitor (ray exec ray_conf.yaml 'tail -n 100 -f /tmp/ray/session_/logs/monitor') identified that the "update-failed" is significant in that the setup commands, called by the ray updater.py, were failing on the worker nodes. Specifically, it was the attempt to install the C build-essential compiler package on them that, presumably, exceeded the worker memory allocation. I was only doing this in order to suppress a "setproctitle" installation warning - which I now understand can be safely ignored anyway.

Apache Nifi slow cluster issue

I am using a Apache nifi for one of my clickstream projects to do some ETL.
I am getting traffic around 300 messages per second currently with the following infra:
RAM - 16 GB
Swap - 6 GB
CPU - 16 cores
Disk - 100GB (Persistance not required)
Cluster - 6 nodes
The entire cluster UI has become extremely slow with the following issues
Processors giving back pressure when some failure happens, which consumes lot of threads
Provenance writing becomes very slow
Heartbeat across nodes becomes slow
Cluster Heart beat
I have the following questions on the setup
Is RPG use recommended, as it is a HTTP call, which i using to spread
across all the nodes, as there is an existing issue with EMQTT
process for consumer group.
What is the recommended value of thread count that should be allotted
per core?
What are the guidelines for infrastructure sizing
What are the tuning parameters for a large cluster with high incoming requests and lot of heavy JSON parsing for transformation
A couple of suggestions
Yes RPG usage is recommended, at least from what I've experienced, RPG seems to offer better distribution. Take a look at [3] below
Some processors are CPU intensive then others so there's no clear cut answer for what value can be set for Concurrent Tasks. This is more of trial and error or testing and fine tuning approach that you'd have to master. One suggestion is, if you set too many Concurrent Tasks for a CPU intensive processor, it will have serious impact on the nodes.
Hortonworks have made a detailed guide regarding this. I've provided the link below. [1]
Some best practices and handy guides:
https://community.hortonworks.com/articles/7882/hdfnifi-best-practices-for-setting-up-a-high-perfo.html
http://ijokarumawak.github.io/nifi/2016/11/22/nifi-jolt/
https://pierrevillard.com/2017/02/23/listfetch-pattern-and-remote-process-group-in-apache-nifi/

Choose Amazon EC2 Instance Types

What Amazon EC2 Instance Types to choose for an application that only receive json, transform, save to database and return a json.
Java(Spring) + PostgreSQL
Expected req/sec 10k.
Your application is CPU bound application and you should choose compute optimized instance, C4 is the latest generation instances in the compute optimized instances.
I had similar application requirement and with c4.xlarge , i could get 40k/min on a single server within SLA of 10 ms for each request. you can also benchmark your application by running a stress test on different types of C4 generation instances.
you must check out https://aws.amazon.com/ec2/instance-types/ doc by AWS on different types of instances and their use cases.
you can also check the CPU usage on your instance by looking into the cloud-watch metrics or running the top command on your linux instance.
Make sure that your instance is not having more than 75% CPU
utilization
You can start with smaller instance and then gradually increase to large server in C4 category, if you see CPU utilization is becoming the bottleneck.This is how i got the perfect instance type for my application , keeping the SLA within 10 ms on server time.
P.S :- in my case DB was also deployed on the same server , so throughput was less , it wil increase if you have DB server installed on other server.
let me know if you need any other info.
Let's say that every request requires 20ms of CPU processing time (thus not taking into account the waits between I/O operations), then each core will be able to process around 50 requests per second. In order to process 10k request per seconds you will need 200 cores, this can be achieved with 16 VCPU with 16 cores each.
Having said that you can then select the right instance for your needs using ec2 selector tool, for instance:
these are all the instance types with 16X16 cores for less than 10k$/y
if otherwise, you're fine with "just" 64 cores in total then take a look at these
If you have other constraints or if my assumptions weren't correct you can change the filters accordingly and choose the best type that suits your needs.

Rapid AWS autoscaling

How do you configure AWS autoscaling to scale up quickly? I've setup an AWS autoscaling group with an ELB. All is working well, except it takes several minutes before the new instances are added and are online. I came across the following in a post about Puppet and autoscaling:
The time to scale can be lowered from several minutes to a few seconds if the AMI you use for a group of nodes is already up to date.
http://puppetlabs.com/blog/rapid-scaling-with-auto-generated-amis-using-puppet/
Is this true? Can time to scale be reduced to a few seconds? Would using puppet add any performance boosts?
I also read that smaller instances start quicker than larger ones:
Small Instance 1.7 GB of memory, 1 EC2 Compute Unit (1 virtual core with 1 EC2 Compute Unit), 160 GB of instance storage, 32-bit platform with a base install of CentOS 5.3 AMI
Amount of time from launch of instance to availability:
Between 5 and 6 minutes us-east-1c
Large Instance 7.5 GB of memory, 4 EC2 Compute Units (2 virtual cores with 2 EC2 Compute Units each), 850 GB of instance storage, 64-bit platform with a base install of CentOS 5.3 AMI
Amount of time from launch of instance to availability:
Between 11 and 18 minutes us-east-1c
Both were started via command line using Amazons tools.
http://www.philchen.com/2009/04/21/how-long-does-it-take-to-launch-an-amazon-ec2-instance
I note that the article is old and my c1.xlarge instances are certainly not taking 18min to launch. Nonetheless, would configuring an autoscale group with 50 micro instances (with an up scale policy of 100% capacity increase) be more efficient than one with 20 large instances? Or potentially creating two autoscale groups, one of micros for quick launch time and one of large instances to add CPU grunt a few minutes later? All else being equal, how much quicker does a t1.micro come online than a c1.xlarge?
you can increase or decrease the time of reaction for an autoscaller by playing with
"--cooldown" value (in seconds).
regarding the types of instances to be used, this is mostly based on the application type and a decision on this topic should be taken after close performance monitor and production tuning.
The time to scale can be lowered from several minutes to a few seconds
if the AMI you use for a group of nodes is already up to date. This
way, when Puppet runs on boot, it has to do very little, if anything,
to configure the instance with the node’s assigned role.
The advice here is talking about having your AMI (The snapshot of your operating system) as up to date as possible. This way, when auto scale brings up a new machine, Puppet doesn't have to install lots of software like it normally would on a blank AMI, it may just need to pull some updated application files.
Depending on how much work your Puppet scripts do (apt-get install, compiling software, etc) this could save you 5-20 minutes.
The two other factors you have to worry about are:
How long it takes your load balancer to determine you need more resources (e.g a policy that dictates "new machines should be added when CPU is above 90% for more then 5 minutes" would be less responsive and more likely to lead to timeouts compared to "new machines should be added when CPU is above 60% for more then 1 minute")
How long it takes to provision a new EC2 instance (smaller Instance Types tend to take shorted times to provision)
How soon ASG responds would depend on 3 things:
1. Step - how much to increase by % or fixed number - a large step - you can rapidly increase. ASG will launch the entire Step in one go
2. Cooldown Period - This applies 'how soon' the next increase can happen. If the previous increase step is still within the defined cooldown period (seconds), ASG will wait and not take action for next increase yet. Having a small cooldown period will enable next Step quicker.
3 AMI type- how much time a AMI takes to launch, this depends on type of AMI - many factors come into play. All things equal Fully Baked AMIs launch much faster

Resources