Scale Google Cloud Composer cluster down to zero nodes? - etl

I have a Cloud Composer cluster running about a dozen dags a day. They all run during a 5 hour period in the middle of the night. The biggest DAG takes ~3 hours to complete running on 5 nodes, and the bulk of the work is highly parallelizable (that is, if we scale it up to, say, 15 nodes, it'd finish way sooner). In an effort to both keep costs low (or possibly reduce it), and improve our throughput, it'd be great if I could scale the cluster up when the big DAG is running, then scale it back down for the remaining almost 20 hours in the day when nothing is happening in the cluster. Using the UI, it only lets me scale down the cluster to 3 nodes.
My question: Is there a way to completely "shut down" the Cloud Composer cluster for part of the day? If anything, can I at least bring it own to a single node? Ideally, this would be an automated task.

The same problem is solved by traveloka team and written a detailed article about the process. But in the idle case they are running 1 node not zero.
https://medium.com/traveloka-engineering/enabling-autoscaling-in-google-cloud-composer-ac84d3ddd60

You can enable Autoscaling in the Node level:
Workloads > your composer cluster name > enable Autoscaling
PROJECT=[provide your gcp project id]
COMPOSER_NAME=[provide your composer environment name]
COMPOSER_LOCATION=[provide the selected composer’s location e.g. us-central]
CLUSTER_ZONE=[provide the selected composer’s zone e.g. us-central1-a]
GKE_CLUSTER=$(gcloud composer environments describe \
${COMPOSER_NAME} \
--location ${COMPOSER_LOCATION} \
--format="value(config.gkeCluster)" \
--project ${PROJECT} | \
grep -o '[^\/]*$')
gcloud container clusters update ${GKE_CLUSTER} --enable-autoscaling \
--min-nodes 1 \
--max-nodes 10 \
--zone ${CLUSTER_ZONE} \
--node-pool=default-pool \
--project ${PROJECT}
For worker level, we are going to apply Kubernetes’ Horizontal Pod Autoscaler (HPA) to airflow-worker Deployment in Composer.

Cloud composer also has costs you cannot do anything with
frontend (appengine flex)
database
These costs are a significant part of a small composer cluster.
If you want to scale down to 0, I suggest running airflow on a VM instead of a managed composer environment. After the airflow has completed its run, you can shut down the VM to reduce costs.
GKE (that runs composer), cannot scale down to 0 nodes, as it also running some kubernetes services that needs cpu & ram to run on.
Other than that you should check out the link posted by SANN3, as that posts gives some detailed insight in how to achieve autoscaling.

Related

What is the main constraint on running larger YARN jobs and how do I increase it?

What is the main constraint on running larger YARN jobs (Hadoop version HDP-3.1.0.0 (3.1.0.0-78)) and how do I increase it? Basically, want to do more (all of which are pretty large) sqoop jobs concurrently.
I am currently assuming that I need to increase the Resource Manager heap size (since that is what I see going up on the Ambari dashboard when I run YARN jobs). How to add more resources to RM heap / why does RM heap appear to be such a small fraction of total RAM available (to YARN?) across the cluster?
Looking in Ambari: YARN cluster memory is 55GB, but RM heap is only 900MB.
Could anyone with more experience tell me what is the difference and which is the limiting factor in running more YARN applications (and again, how do I increase it)? Anything else that I should be looking at? Any docs explaining this in more detail?
The convenient way to tune your YARN and MapReduce memory is to use yarn-utils script.
Download Companion Files ## Ref
wget http://public-repo-1.hortonworks.com/HDP/tools/2.6.0.3/hdp_manual_install_rpm_helper_files-2.6.0.3.8.tar.gz
tar zxvf hdp_manual_install_rpm_helper_files-2.6.0.3.8.tar.gz
Executing YARN Utility Script ## Ref
You can execute yarn-utils.py python script by providing Available Cores, Available Memory, No. of Disks, HBase is installed or not.
If you have a heterogeneous Hadoop Cluster then you have to create Configuration groups based on Nodes specification. If you need more info on that let me know I will update my answer according to that.

Are there any downsides to running Elasticsearch on a multi-purpose (i.e. non-dedicated) cluster?

I just set up an Elasticsearch (ES) 3 node cluster using one of GKE's click to deploy configurations. Each node is of n1-standard-4 machine type (4vCPUs/15GB RAM). I have always run ES on clusters dedicated to that single purpose (performance reasons, separation of concerns, make my life easier to debug machine faults), and currently, this GKE cluster is the same.
However, i have a group of batch jobs i would like to port to run on a GKE cluster. Since it updates several large files, I would like this to also run on a stateful cluster (just like ES) so I can move updated files to the cloud once a day rather than round tripping on every run. The batch jobs in question run at 5min, 15min or daily frequency for about 18hrs every day.
My question now is, what is the best way to deploy this batch process given the existing ES cluster...
Create an entirely new cluster?
Create another node pool?
Create a separate namespace and increase the cluster's autoscaling?
Some other approach i'm missing?
Note: I'm a few days into using GKE and containerization in general
Based on my knowledge I would go for another nodepool or autoscaler.
Create an entirely new cluster?
For me it would be an overkill for just running the jobs.
Create another node pool?
I would say it's the best option equally with the autoscaler, create a new nodepool just for the jobs which would scale down to 0 if there is nothing more to do.
Create a separate namespace and increase the cluster's autoscaling?
Same as another node pool, but from my point of view if you would like to do that, then you would have to label your nodes to the Elasticsearch, then jobs can't take any resources from them, so answering your question from comment
my question is more about if doing this with autoscaler within the same cluster would in any way affect elasticsearch esp with all the ES specific yaml configs?
It shouldn't, as I said above, you can always label the 3 specific nodes(default nodepool) to work only with elasticsearch then nothing will take their resources, cluster will rescale when it will need more resources for jobs and rescale to 3 ES nodes when jobs end their 18hrs work.
Also with regards to the 6h node pool doing nothing comment, wouldn't I be able to avoid this on a new cluster or node pool with a minimum scaling parameter of zero?
Based on gcp documentation it would work for nodepool, but not for new cluster.
If you specify a minimum of zero nodes, an idle node pool can scale down completely. However, at least one node must always be available in the cluster to run system Pods.
tldr Go for the autoscaler or another nodepool, if you're worried about resources for your ES label the 3 nodes just for ES.
I hope it answer your question. Let me know if you have any more questions.

Is the etcd clustering design suitable for reliable key/value replication applications?

I know this isn't a pure programming question, but I'm developing a distributed appliance management application that has a need for reliably sharing/replicating data across multiple servers/VMs, without the heavy orchestration of a typical database.
Is etcd clustering actually a reliable architecture for key/value replication in a cluster?
This is all supposed to be automated and since I'm starting with Atomic/Core OS, and etcd is already present, I stood up a 4 node cluster with full TLS and some user/role configuration to play with. I went as far as developing a library to read/write/export the data structures my application requires.
The etcd v3 documentation isn't as clear on some of the more revealing points that the v2 documentation was.
I was hoping for peer-to-peer replication of data, assuming connectivity between at least 1 other node in the cluster. If a node was unable to reach the rest of the cluster, its store would be available locally and I could codify how I want my appliance to deal with being "offline" for too long, until it re-establishes it availability.
I'm going to describe some of the limitations I've encountered with clustered etcd, and maybe someone can comment whether its the right solution for my job. This isn't a compliant session, I just want to understand if I'm using the wrong solution.
Availability/fail-over is impossible when only 2 nodes are in a cluster: I started with a 2 node configuration and setup a working cluster (eg put/get data from one to the other, replication works, etc...) When one goes down, the other enters endless leader election mode. It's the only one, yet raft needs at least 1 other node available to elect a leader. So fail-over needs at least 3 etcd nodes, so if 1 goes down, the other 2 can pick a leader. I'm really floored by this limitation. I must be doing something wrong.
Joining a new etcd node to a cluster requires almost all details about the existing cluster: When joining a new node, the member needs to be added in advance (with its peering connection details), then on the new node, you have to start etcd with cluster-state "existing" and initial-cluster to contain exactly all nodes in the cluster with their names and peer urls. This is a pretty synchronized procedure because after you join, everything is stored in the data directory. So subsequent starts don't need any cluster configuration at all. I find this makes adding a node kind of fragile, and kind of unnecessary as the cluster already knows about you (you had to add them as a member).
During the leader election process you can't access any node's keys/values: When a node goes down, no get/put requests are serviced. If you only have 2 nodes in the cluster, the election process never ends (see #1), so your application will never get access to that data, resulting in a bit of a cascading failure.
If anyone is interested in testing etcd out for this kind of stuff, I included my scripting below used (I left out the TSL configuration bits) in my testing. I don't expect anyone to test this for me.
# On the 1st node (I used Centos7 minimal, with etcd installed)
sudo firewall-cmd --permanent --add-port=2379/tcp
sudo firewall-cmd --permanent --add-port=2380/tcp
sudo firewall-cmd --reload
export CL_NAME=etcd1
export HOST=$(hostname)
export IP_ADDR=$(ip -4 addr show ens33 | grep -oP '(?<=inet\s)\d+(\.\d+){3}')
export TOKEN=$(date +%s | sha256sum | base64 | head -c 32 ; echo)
# turn on etcdctl v3 api support, why is this not default?!
export ETCDCTL_API=3
sudo etcd --name $CL_NAME --data-dir ~/data --advertise-client-urls=http://127.0.0.1:2379,https://$IP_ADDR:2379 --listen-client-urls=https://0.0.0.0:2379 --initial-advertise-peer-urls https://$IP_ADDR:2380 --listen-peer-urls https://$IP_ADDR:2380 --initial-cluster-state new --initial-cluster-token $TOKEN
# Next come up with a name for the next node, etcd2 then etcd3, etc... Get their hostname/IP and add them as a future member
etcdctl --endpoints="https://127.0.0.1:2379" member add etcd2 --peer-urls="http://<next node's IP address>:2380"
# 1st etcd is now running, with peering available and members
# added for next nodes
# copy "echo $TOKEN" for next steps where its needed
# On the 1st/next node (I used Centos7 minimal, with etcd installed)
sudo firewall-cmd --permanent --add-port=2379/tcp
sudo firewall-cmd --permanent --add-port=2380/tcp
sudo firewall-cmd --reload
export CL_NAME=etcd2
export HOST=$(hostname)
export IP_ADDR=$(ip -4 addr show ens33 | grep -oP '(?<=inet\s)\d+(\.\d+){3}')
export ETCDCTL_API=3
export TOKEN=<TOKEN string from above>
sudo etcd --name $CL_NAME --data-dir ~/data --advertise-client-urls=https://127.0.0.1:2379,https://$IP_ADDR:2379 --listen-client-urls=https://0.0.0.0:2379 --initial-advertise-peer-urls https://$IP_ADDR:2380 --listen-peer-urls https://$IP_ADDR:2380 --initial-cluster-state new --initial-cluster-token $TOKEN --initial-cluster="etcd1=http://<IP of 1st node>:2380,etcd2=http://$IP_ADD:2380"
# NOTE the --initial-cluster ALWAYS MUST have all nodes in the
# cluster, with their names and peer urls, otherwise it won't join
# Here's an example for the 3 node
sudo etcd --name $CL_NAME --data-dir ~/data --advertise-client-urls=https://127.0.0.1:2379,https://$IP_ADDR:2379 --listen-client-urls=https://0.0.0.0:2379 --initial-advertise-peer-urls https://$IP_ADDR:2380 --listen-peer-urls https://$IP_ADDR:2380 --initial-cluster-state new --initial-cluster-token $TOKEN --initial-cluster="etcd1=http://<IP of 1st node>:2380,etcd2=http://<IP of 2nd node>:2380,etcd3=http://<IP of 3rd node>:2830"
When you get down to 2 running nodes (kill one of them), a new leader will be elected right away. When you're down to 1, the entire effort it useless. If you bring 1 up again, it'll all work again.
Thank you for your insightful and well-asked question. I'm currently doing a similar investigation. I'm using etcd 3.4, where the API is v3 (not v3beta etc.). I have some partial answers for you at least (formed in part by my own understanding of distributed systems). I'll answer your questions 1 and 3 first, then 2 at the end.
2-node clusters are particularly troublesome for mutable state is concerned, largely because of the 'split-brain' problem, where there is no quorum and neither side (of a network split) has enough knowledge to know it is still authoritative; neither side can safely write any data because there is no useful (or at least generic) way to update the cluster when quorum returns.
2-node clusters in practice would be more tenable for active/passive deployments, such as hosting a floating IP. You can use the likes of pacemaker to implement HA solutions to coordinate resources such as a floating IP, DRBD block device, filesystem and MariaDB stack (I've done so)... BUT this still comes with the proviso that a 2-node cluster is theoretically (in the sence of sometimes that the risk of split-brain can be accepted).
To make 2-node 'clusters' more robust (to mitigate the risk of split-brain) something else is needed. Here's a few examples:
pacemaker recommends the use of STONITH (Shoot The Other Node In The Head), which is an example of a 'fencing' mechanism. So if the network inbetween the two nodes goes away and both nodes now think they are the sole remaining node, then the idea is that each node will try and power-off the other (first one wins). This might be interacting with the a power-controller in a cluster, or with a hypervisor. As you can see, this requires a certain level of priviledge with the supporting infrastructure.
network appliances do often operate as a 2-node cluster, and generally active/passive (its hard enough at times selling the idea of 2 identical pieces of equipment and only using one at a time). In a physical networking deployment, the HA link may be required to be a point-to-point link such as a single fibre pair over which heartbeats are sent. In routing protocols such as BGP this is even a requirement; the idea being that if the remote side powers off the link is seen to go down... lots of subtleties here, such as what happens if one fibre in a fibre-pair were to be broken/degraded.
I've also seen HA solutions that use some other data in deciding which side of a two-node cluster gets to be the leader. Things like (and this can be a combination and in a particular order) of:
comparison of IP address (which is numerically greater or lesser)
uptime (last observed uptime)
appliance serial number
statically configured weighting
All this concern over quorum and split-brain etc. is related to CAP theorem (https://en.wikipedia.org/wiki/CAP_theorem):
Consistency: Every read receives the most recent write or an error
Availability: Every request receives a (non-error) response, without the guarantee that it contains the most recent write
Partition tolerance: The system continues to operate despite an arbitrary number of messages being dropped (or delayed) by the network between nodes
In the event of a network partition, you have to choose between Consistency and Availability. Etcd has made the choice of Consistency; which is good, because etcd is a data-store for critical infrastructure. After an event, you really want things to come back up ASAP and without manual and error-prone recovery (ie. just guessing which side of the cluster we should mark as a the authoritative side).
Now, you might be thinking that etcd could at least revert to being able satisfy a read-attempt in the result of a partition; but then that would violate Consistency by risking the delivery of out-of-date information. Etcd also provides some other cluster fundamental services, such as making it easier to do leader elections and locks for your applications; so if etcd were to offer a potentially stale response, that would likely break promises and off-load a lot of the correctness responsibilities onto the clients (so you can bet that would tend not to be done correctly, particularly on anything that might consume its data via some REST/gRPC interface).
This is no different for systems such as Zookeeper, who make a requirement for a three (or five) node cluster much more overt; along with cautions about performance concerns that might cause heartbeat failures (eg. due to garbage collection pauses).
You mention you're on CoreOS, so your application is presumably running in a container. So in terms of HA strategies, you do have another option that people often haven't really considered.... just restart it automatically. At least in Kubernetes (or Openshift, which embodies Kubernetes --- and CoreOS now), if you have HA built into your cluster, then if a container or node were to fail, it can be rescheduled onto a different node.
So if you don't do a three-node etcd service, do a single-node etcd service that can be restarted on failure. That's a viable HA strategy; its not a load-balancing strategy though.
There is also the concept of mirroring, which is supported by etcd and etcdctl has build-in functionality for this. You could use the mirror-maker functionality to create multiple independent HA clusters (say, for a remote site or availability zone). You should generally never operate a cluster over a WAN; instead you should replicate a cluster.
You'll find this same sort of pattern for etcd, elasticsearch, kafka, etc.
With regard to your second question, etcd3 does have a cluster bootstrap facility (https://etcd.io/docs/v3.4.0/dev-internal/discovery_protocol/). In short, you can point your node at another information store and it will get all the same information you would otherwise have to have specified.
To understand the thinking around why etcd3 doesn't work the same as say kafka or elasticsearch cluster membership, you have to think of etcd as providing an infrastructure service -- the base of the pyrimid. You don't want to many dependencies (or assumptions about what traffic might be able to pass or what discovery features might be available).
Etcd could in theory use tools such as cloud-init to obtain things like AWS inventory, but that flexibility is devolved to the user. You should perhaps expect that to be embodied in a script or deployment playbook; its too entangled to do that within etcd and that's a sign of violating separation of concerns, and the expected deployment mechanism should already have all that inventory knowledge.
Kafka and Elasticsearch is kinda interesting to compare here. Kafka gets its cluster membership information from Zookeeper (which is what etcd was designed to replace, and solves much the same sort of problem).
Elasticsearch is interesting because it makes clustering remarkably easy. But that's because it uses sharding and replication, and ends up being quite different in terms of CAP theorem. That's not to say that it sits in the same space as etcd or zookeeper though. In my experience with Elasticsearch I always manually specify all the nodes in my playbook, but Elasticsearch does have more support for discovering its other nodes in environments such as AWS. (cf. https://www.elastic.co/guide/en/elasticsearch/reference/current/modules-discovery-hosts-providers.html)
Hope that's useful for you,
Cameron

3 Node Cluster for Elastic, Kafka and Cassandra - On 3 Machines

We are creating a 3 node elastic cluster, but want to use each of our 3 elastic nodes for other things, like Kafka and Cassandra. We need high availability, so we want to have 3 nodes for everything, but we don't want to have 9 machines, we just want to put them on one bigger computer. Is this a typical scenario?
I would say no.
One sandbox machine running a PoC with all the components local, sure, why not. But Production with HA requirements, you are just asking for trouble putting everything in one place. They're going to compete for resource, one blowing the box up kills the others, touching the machine to change one risks the others, etc, etc.
IMO keep your architecture clean and deploy on separate nodes for each component.

Scale for Elastic MapReduce (EMR)?

I'm evaluating EC2/EMR for running a ~20 node Hadoop cluster. (custom JAR cluster). I've run the simple WordCount example on a single-node 3.3 GHz 2GB RAM local VMWare instance which takes less than 10 seconds to complete. The WordCount example takes 3 minutes to complete on EMR with 2 c1.mediumm instances (excluding the startup time of 3-5 minutes). Takes the same time for 2 m1.small instances. There will be some overhead for running a job on EMR, and maybe this problem size is too small, so this seems understandable.
At about what size problems do you begin to see the performance advantage of the cloud? Or at about how many nodes or compute units?
If you're spinning up an EMR job, that essentially means you're asking Amazon to provide you an on-demand cluster of N machines, and the simple fact of provisioning and giving you these machines can easily take several minutes, not to mention that these machines need to be setup, can have bootstrap actions, and so on. I've rarely seen EMR jobs (even big ones) take more than 10 minutes to have the cluster ready, but I've also rarely seen a cluster be up in less than a couple minutes.
If you have a job that you're running frequently (for example every hour), then the cost of setting up and shutting down your EMR cluster might be too big, in this case it would be a good idea to create your cluster with some reserved instances on EC2. With reserved instances, you will have your own cluster always up and administered by you, so there is no time lost setting up/shutting down your cluster, this behaves like a regular Hadoop cluster.
What I've been doing in the past couple years is use an EC2 cluster on reserved instances that is always up and all the jobs are running on it, but for some jobs that are very large and that couldn't fit on my cluster, I run them on EMR where I can choose how many nodes I want and since these are large jobs the time to setup/shutdown the cluster is small in comparison to the total runtime. I would not recommend using EMR for small/frequent jobs.

Resources