i would like to scale kubernetes nodes according to unscheduled pods. if i have pods that can't be scheduled because of their resource requirements, i want to add a new node to the cluster.
looking at autoscaling feature of managed groups in GCE, this doesn't seem to be possible, as their model requires a metric per node in the cluster, while my metric is global.
can anyone confirm that this can't be achieved with current GCE solution?
anyone know of any existing tool/blogpost whatever that could help implementing a solution?
assuming i'm going to roll my own, i'm having problems finding an api that controls GCE managed groups (allows to add a node, remove a node)
thanks,
Nathan
If you are fine with the standard per-node metrics, read the "Horizontal auto-scaling of nodes (GCE)" section of the kubernetes cluster management guide to enable to autoscaler.
If you want custom metrics, you can check out the GCE document.
There is also a similar question on stackoverflow, and the author of one of the answers said that after writing their own custom metrics, the standard per-node metrics was found to be just as good, if not better, for their use case.
Related
I am exploring VertexAI pipelines and understand that it is a managed alternative to, say, AI Platform pipelines (where you have to deploy a GKE cluster to be able to run Kubeflow pipelines). What I am not clear on is whether VertexAI will autoscale the cluster depending on the load. In the answer to a similar question, it is mentioned that for pipeline steps that use GCP resources such as Dataflow etc., autoscaling will be done automatically. In the google docs, it is mentioned that for components, one can set resources, such as CPU_LIMIT GPU_LIMIT etc. My question is, can these limits be set for any type of component, i.e., Google Cloud pipeline components or Custom components, whether Python function-based or those packaged as a container image? Secondly, do these limits mean that the components resources will autoscale till they hit those limits? And what happens if these options are not even specified, how are the resources allocated then, will they autoscale as VertexAI sees fit?
Links to relevant docs and resources would be really helpful.
To answer your questions,
1. Can these limits be set for any type of components?
Yes. Because, these limits are applicable to all Kubeflow components and are not specific to any particular type of component.
These components could be implemented to perform tasks with a set amount of resources.
2. Do these limits mean that the component resources will autoscale till they hit the limits?
No, there is no autoscaling performed by Vertex AI. Based on the limits set, Vertex AI chooses one suitable VM to perform the task.
Having a pool of workers is supported in Google Cloud Pipeline Components such as “CustomContainerTrainingJobRunOp” and “CustomPythonPackageTrainingJobRunOp” as part of Distributed Training in Vertex AI. Otherwise, only 1 machine is used per step.
3. What happens if these limits are not specified? Does Vertex AI scale the resources as it sees fit?
If the limits are not specified, an “e2-standard-4” VM is used for task execution as the default option.
EDIT: I have updated the links with the latest version of the documentation.
So I am currently running dual clusters for data processing, one is a Kubernetes clusters and another is a Hadoop cluster.
K8s cluster is taken care of in terms of monitoring since it was quite easy to deploy Prometheus and Grafana on it.
For the Hadoop cluster however, I am still looking of a good way to do that.
The goal is to have a unified monitoring solution, so I though it would be a good idea to go with Prometheus since I am already familiar with it, but looks like it's not straight-forward.
Hadoop by default exposes some metrics through HTTP API but those metrics are not "Prometheus-friendly".
Would appreciate if you can explain how I can achieve this.
i suggest you look at this,
https://github.com/marcelmay/hadoop-hdfs-fsimage-exporter
in most cases when the application does not expose prometheus metrics you can use an exporter there are a lot of them.
they collect the metrics and expose them in a Prometheus friendly manner.
I am working on a project where we want to test an autoscaling algorithm which involves making a few changes to the default horizontal pod autoscaling algorithm on Kubernetes. Where should we make changes in the https://github.com/kubernetes/autoscaler/tree/master/cluster-autoscaler repo so that we can test our algorithm?
The Cluster-Autoscaler is pretty much a cluster auto scaling version which will scale the number of Nodes, in case any of below happens.
there are pods that failed to run in the cluster due to insufficient resources
there are nodes in the cluster that have been underutilized for an extended period of time and their pods can be placed on other
existing nodes.
If you want a custom pod scaling algorithm, I would recommend doing it as a simple service that does just that and not to alter the Cluster Autoscaler. Here's a similar solution to your problem.
https://stackoverflow.com/a/58447811/5617140
You can alter the service logic to suit your requirements. And it is probably a good idea to use Kubernetes API to achieve scaling tasks.
I am relatively new to all these, but I'm having troubles getting a clear picture among the listed technologies.
Though, all of these try to solve different problems, but do have things in common too. I would like to understand what are the things that are common and what is different. It is likely that the combination of few would be great fit, if so what are they?
I am listing a few of them along with questions, but it would be great if someone lists all of them in detail and answers the questions.
Kubernetes vs Mesos:
This link
What's the difference between Apache's Mesos and Google's Kubernetes
provides a good insight into the differences, but I'm unable to understand as to why Kubernetes should run on top of Mesos. Is it more to do with coming together of two opensource solutions?
Kubernetes vs Core-OS Fleet:
If I use kubernetes, is fleet required?
How does Docker-Swarm fit into all the above?
Disclosure: I'm a lead engineer on Kubernetes
I think that Mesos and Kubernetes are largely aimed at solving similar problems of running clustered applications, they have different histories and different approaches to solving the problem.
Mesos focuses its energy on very generic scheduling, and plugging in multiple different schedulers. This means that it enables systems like Hadoop and Marathon to co-exist in the same scheduling environment. Mesos is less focused on running containers. Mesos existed prior to widespread interest in containers and has been re-factored in parts to support containers.
In contrast, Kubernetes was designed from the ground up to be an environment for building distributed applications from containers. It includes primitives for replication and service discovery as core primitives, where-as such things are added via frameworks in Mesos. The primary goal of Kubernetes is a system for building, running and managing distributed systems.
Fleet is a lower-level task distributor. It is useful for bootstrapping a cluster system, for example CoreOS uses it to distribute the kubernetes agents and binaries out to the machines in a cluster in order to turn-up a kubernetes cluster. It is not really intended to solve the same distributed application development problems, think of it more like systemd/init.d/upstart for your cluster. It's not required if you run kubernetes, you can use other tools (e.g. Salt, Puppet, Ansible, Chef, ...) to accomplish the same binary distribution.
Swarm is an effort by Docker to extend the existing Docker API to make a cluster of machines look like a single Docker API. Fundamentally, our experience at Google and elsewhere indicates that the node API is insufficient for a cluster API. You can see a bunch of discussion on this here: https://github.com/docker/docker/pull/8859 and here: https://github.com/docker/docker/issues/8781
Join us on IRC # #google-containers if you want to talk more.
I think the simplest answer is that there is no simple answer. The swift rise to power of containers, and Docker in particular has left a power vacuum for "container scheduling and orchestration", whatever that might mean. In reality, that means you have a number of technologies that can work in harmony on some levels, but with certain aspects in competition. For example, Kubernetes can be used as a one stop shop for deploying and managing containers on a compute cluster (as Google originally designed it), but could also sit atop Fleet, making use of the resilience tier that Fleet provides on CoreOS.
As this Google vid states Kubernetes is not a complete out the box container scaling solution, but is a good statement to start from. In the same way, you would at some stage expect Apache Mesos to be able to work with Kubernetes, but not with Marathon, in as much as Marathon appears to fulfil the same role as Kubernetes. Somewhere I think I've read these could become part of the same effort, but I could be wrong about that - it's really about the strategic direction of Mesosphere and the corresponding adoption of Kubernetes principles.
In the DockerCon keynote, Solomon Hykes suggested Swarm would be a tier that could provide a common interface onto the many orchestration and scheduling frameworks. From what I can see, Swarm is designed to provide a smooth Docker deployment workflow, working with some existing container workflow frameworks such as Deis, but flexible enough to yield to "heavyweight" deployment and resource management such as Mesos.
Hope this helps - this could be an enormous post. I think the key is that these are young, evolving services that will likely merge and become interoperable, but we need to ride out the next 12 months to see how it plays out. There's some very clever people on the problem, so the future looks very bright.
As far as I understand it:
Mesos, Kubernetes and Fleet are all trying to solve a very similar problem. The idea is that you abstract away all your hardware from developers and the 'cluster management tool' sorts it all out for you. Then all you need to do is give a container to the cluster, give it some info (keep it running permanently, scale up if X happens etc) and the cluster manager will make it happen.
With Mesos, it does all the cluster management for you, but it doesn't include the scheduler. The scheduler is the bit that says, ok this process needs 2 procs and 512MB RAM, and I have a machine over there with that free, so I'll run it on that machine. There are some plugin schedulers available for Mesos: Marathon and Chronos and you can write your own. This gives you a lot of power of resource distribution and cluster scaling etc.
Fleet and Kubernetes seem to abstract away those sorts of details (so you don't have to write your own scheduler basically). This means you have to define your tasks and submit them in the format/manner defined by Fleet or Kubernetes and then they take over and schedule the tasks (containers) for you.
So I guess: Using Mesos may mean a bit more work in writing your own scheduler, but potentially provides more flexibility if required.
I think the idea of running Kubernetes on top of Mesos is that Kubernetes acts as the scheduler for Mesos. Personally I'm not sure what benefits this brings over running one or the other on its own though (hopefully someone will jump in and explain!)
As MikeB said.. it's early days, and it's all up for grabs (keep an eye on Amazon's ECS as well) so there are many competing standards and a lot of overlap!
-edit- I didn't mention Docker swarm as I don't really have much experience with it.
For anyone coming to this after 2017 fleet is deprecated. Do not use it anymore.
Fleet docs say "fleet is no longer actively developed or maintained by CoreOS" and link to Container orchestration: Moving from fleet to Kubernetes. Fleet was removed from Container Linux (formerly known as CoreOS Linux) and replaced with Kubernetes kubelet (agent). This coincided with a corporate pivot to offer Tectonic (a Kubernetes distro) as their primary product.
We are working on setting up our elasticsearch backend for a production environment. Up until a few weeks ago, we were using Solr, but we decided to use Elasticsearch for a few reasons, but the biggest reason is for the distributed nature of the backend.
With that said, we've been looking for some documentation and best practices on deploying elasticsearch using amazon's services.
For the moment, we were considering using a extra-large box and then scaling out from there, but we aren't sure that is the best approach. For example, it may be better to have three mediums than one extra-large.
We intend to index around 100K to 150K documents per day up to around ten million docs.
The question is, can anyone provide a general environment / deployment diagram for elasticsearch or best practices in general?
There's some docs for elasticsearch that talk about EC2 deployment. There's an autodiscovery plugin based on EC2 tags or security groups or whatever you like. You can also choose S3 for persistence, although that may not really be necessary.
I'd advise launching it in a VPC so you can have permanent internal IPs, in regular EC2 your internal IPs will change with every reboot even if you're using Elastic IPs.