3 Node Cluster for Elastic, Kafka and Cassandra - On 3 Machines - elasticsearch

We are creating a 3 node elastic cluster, but want to use each of our 3 elastic nodes for other things, like Kafka and Cassandra. We need high availability, so we want to have 3 nodes for everything, but we don't want to have 9 machines, we just want to put them on one bigger computer. Is this a typical scenario?

I would say no.
One sandbox machine running a PoC with all the components local, sure, why not. But Production with HA requirements, you are just asking for trouble putting everything in one place. They're going to compete for resource, one blowing the box up kills the others, touching the machine to change one risks the others, etc, etc.
IMO keep your architecture clean and deploy on separate nodes for each component.

Related

Can I use the same flow.xml.gz for two different Nifi cluster?

We have a 13 nodes nifi cluster with around 50k processors. The size of the flow.xml.gz is around 300MB. To bring up the 13 nodes Nifi cluster, it usually takes 8-10 hours. Recently we split the cluster into two parts, 5nodes cluster and 8 nodes cluster with the same 300MB flow.xml.gz in both. Since then we are not able to get the Nifi up in both the clusters. Also we are not seeing any valid logs related to this issue. Is it okay to have the same flow.xml.gz . What are the best practices we could be missing here when splitting the Nifi Cluster.
You ask a number of questions that all boil down to "How to improve performance of our NiFi cluster with a very large flow.xml.gz".
Without a lot more details on your cluster and the flows in it, I can't give a definite or guaranteed-to-work answer, but I can point out some of the steps.
Splitting the cluster is no good without splitting the flow.
Yes, you will reduce cluster communications overhead somewhat, but you probably have a number of input processors that are set to "Primary Node only". If you load the same flow.xml.gz on two clusters, both will have a primary node executing these, leading to contention issues.
More importantly, since every node still loads all of the flow.xml.gz (probably 4 Gb unzipped), you don't have any other performance benefits and verifying the 50k processors in the flow at startup still takes ages.
How to split the cluster
Splitting the cluster in the way you did probably left references to nodes that are now in the other cluster, for example in the local state directory. For NiFi clustering, that may cause problems electing a new cluster coordinator and primary node, because a quorum can't be reached.
It would be cleaner to disconnect, offload and delete those nodes first from the cluster GUI so that these references are deleted. Those nodes can then be configured as a fresh cluster with an empty flow. Even if you use the old flow again later, test it out with an empty flow to make it a lot quicker.
Since you already split the cluster, I would try to start one node of the 8 member cluster and see if you can access the cluster menu to delete the split-off nodes (disconnecting and offloading probably doesn't work anymore). Then for the other 7 members of the cluster, delete the flow.xml.gz and start them. They should copy over the flow from the running node. You should adjust the number of candidates expected in nifi.properties (nifi.cluster.flow.election.max.candidates) so that is not larger than the number of nodes to slightly speed up this process.
If successful, you then have the 300 MB flow running on the 8 member cluster and an empty flow on the new 5 member cluster.
Connect the new cluster to your development pipeline (NiFi registry, templates or otherwise). Then you can stop process groups on the 8 member cluster, import them on the new and after verifying that the flows are running on the new cluster, delete the process group from the old, slowly shrinking it.
If you have no pipeline or it's too much work to recreate all the controllers and parameter contexts, you could take a copy of the flow.xml.gz to one new node, start only that node and delete all the stuff you don't need. Only after that should you start the others (with their empty flow.xml.gz) again.
For more expert advice, you should also try the Apache NiFi Users email list. If you supply enough relevant details in your question, someone there may know what is going wrong with your cluster.

Are there any downsides to running Elasticsearch on a multi-purpose (i.e. non-dedicated) cluster?

I just set up an Elasticsearch (ES) 3 node cluster using one of GKE's click to deploy configurations. Each node is of n1-standard-4 machine type (4vCPUs/15GB RAM). I have always run ES on clusters dedicated to that single purpose (performance reasons, separation of concerns, make my life easier to debug machine faults), and currently, this GKE cluster is the same.
However, i have a group of batch jobs i would like to port to run on a GKE cluster. Since it updates several large files, I would like this to also run on a stateful cluster (just like ES) so I can move updated files to the cloud once a day rather than round tripping on every run. The batch jobs in question run at 5min, 15min or daily frequency for about 18hrs every day.
My question now is, what is the best way to deploy this batch process given the existing ES cluster...
Create an entirely new cluster?
Create another node pool?
Create a separate namespace and increase the cluster's autoscaling?
Some other approach i'm missing?
Note: I'm a few days into using GKE and containerization in general
Based on my knowledge I would go for another nodepool or autoscaler.
Create an entirely new cluster?
For me it would be an overkill for just running the jobs.
Create another node pool?
I would say it's the best option equally with the autoscaler, create a new nodepool just for the jobs which would scale down to 0 if there is nothing more to do.
Create a separate namespace and increase the cluster's autoscaling?
Same as another node pool, but from my point of view if you would like to do that, then you would have to label your nodes to the Elasticsearch, then jobs can't take any resources from them, so answering your question from comment
my question is more about if doing this with autoscaler within the same cluster would in any way affect elasticsearch esp with all the ES specific yaml configs?
It shouldn't, as I said above, you can always label the 3 specific nodes(default nodepool) to work only with elasticsearch then nothing will take their resources, cluster will rescale when it will need more resources for jobs and rescale to 3 ES nodes when jobs end their 18hrs work.
Also with regards to the 6h node pool doing nothing comment, wouldn't I be able to avoid this on a new cluster or node pool with a minimum scaling parameter of zero?
Based on gcp documentation it would work for nodepool, but not for new cluster.
If you specify a minimum of zero nodes, an idle node pool can scale down completely. However, at least one node must always be available in the cluster to run system Pods.
tldr Go for the autoscaler or another nodepool, if you're worried about resources for your ES label the 3 nodes just for ES.
I hope it answer your question. Let me know if you have any more questions.

Running multiple Elasticsearch production nodes in one machine

Is it a recommended practice to run multiple Elasticsearch nodes in one physical (virtual) machine? I'm speaking about production environment.
I currently have three virtual machines that unicast each other. Setup:
node.name:"VM1"
master:true
data:true
node.name:"VM2"
master:true
data:true
node.name:"VM3"
master:false
data:true
There's a request to have a dedicated master node in first virtual machine (next to VM1). I'm trying to avoid that and looking for strong arguments that I shouldn't do this.
Please advice.
Having a dedicated master makes sense in a larger environment to me. I would say if your nodes are not that busy having a data node also be a master would not be the end of the world. I would be more comfortable having 3 data nodes for high availability.

Elasticsearch: 2-node cluster with failover

I am using Elasticsearch 1.5.2 and trying to setup a 2-node cluster. These 2 nodes are primarily for failover strategy (if any one node goes down, the other one is still there to handle requests), I don't need to divide primary shards or something like that, (total data is no more than 500mb on hard-disk).
Everything goes well, until Split Brains thing kicks in. Now, since I don't have much data, I don't feel any requirement of 3 nodes. And I want to have failover mechanism too. Which means, discovery.zen.minimum_master_nodes cannot be more than 1.
Now, I have two questions:
Is there any configuration possible, which could overcome 2 master nodes or Split Brains problem?
If not, what all other options do I have to make it work? Like, keeping both in different clusters (one online, other one offline) and updating offline with online, time to time, for the time when online cluster goes down. Or, do I have to go for 3-node cluster?
I am going on production environment. Please help.

How to setup ElasticSearch cluster with auto-scaling on Amazon EC2?

There is a great tutorial elasticsearch on ec2 about configuring ES on Amazon EC2. I studied it and applied all recommendations.
Now I have AMI and can run any number of nodes in the cluster from this AMI. Auto-discovery is configured and the nodes join the cluster as they really should.
The question is How to configure cluster in way that I can automatically launch/terminate nodes depending on cluster load?
For example I want to have only 1 node running when we don't have any load and 12 nodes running on peak load. But wait, if I terminate 11 nodes in cluster what would happen with shards and replicas? How to make sure I don't lose any data in cluster if I terminate 11 nodes out of 12 nodes?
I might want to configure S3 Gateway for this. But all the gateways except for local are deprecated.
There is an article in the manual about shards allocation. May be I'm missing something very basic but I should admit I failed to figure out if it is possible to configure one node to always hold all the shards copies. My goal is to make sure that if this would be the only node running in the cluster we still don't lose any data.
The only solution I can imagine now is to configure index to have 12 shards and 12 replicas. Then when up to 12 nodes are launched every node would have copy of every shard. But I don't like this solution cause I would have to reconfigure cluster if I might want to have more then 12 nodes on peak load.
Auto scaling doesn't make a lot of sense with ElasticSearch.
Shard moving and re-allocation is not a light process, especially if you have a lot of data. It stresses IO and network, and can degrade the performance of ElasticSearch badly. (If you want to limit the effect you should throttle cluster recovery using settings like cluster.routing.allocation.cluster_concurrent_rebalance, indices.recovery.concurrent_streams, indices.recovery.max_size_per_sec . This will limit the impact but will also slow the re-balancing and recovery).
Also, if you care about your data you don't want to have only 1 node ever. You need your data to be replicated, so you will need at least 2 nodes (or more if you feel safer with a higher replication level).
Another thing to remember is that while you can change the number of replicas, you can't change the number of shards. This is configured when you create your index and cannot be changed (if you want more shards you need to create another index and reindex all your data). So your number of shards should take into account the data size and the cluster size, considering the higher number of nodes you want but also your minimal setup (can fewer nodes hold all the shards and serve the estimated traffic?).
So theoretically, if you want to have 2 nodes at low time and 12 nodes on peak, you can set your index to have 6 shards with 1 replica. So on low times you have 2 nodes that hold 6 shards each, and on peak you have 12 nodes that hold 1 shard each.
But again, I strongly suggest rethinking this and testing the impact of shard moving on your cluster performance.
In cases where the elasticity of your application is driven by a variable query load you could setup ES nodes configured to not store any data (node.data = false, http.enabled = true) and then put them in for auto scaling. These nodes could offload all the HTTP and result conflation processing from your main data nodes (freeing them up for more indexing and searching).
Since these nodes wouldn't have shards allocated to them bringing them up and down dynamically shouldn't be a problem and the auto-discovery should allow them to join the cluster.
I think this is a concern in general when it comes to employing auto-scalable architecture to meet temporary demands, but data still needs to be saved. I think there is a solution that leverages EBS
map shards to specific EBS volumes. Lets say we need 15 shards. We will need 15 EBS Volumes
amazon allows you to mount multiple volumes, so when we start we can start with few instances that have multiple volumes attached to them
as load increase, we can spin up additional instance - upto 15.
The above solution is only advised if you know your max capacity requirements.
I can give you an alternative approach using aws elastic search service(it will cost little bit more than normal ec2 elasticsearch).Write a simple script which continuously monitor the load (through api/cli)on the service and if the load goes beyond the threshold, programatically increase the nodes of your aws elasticsearch-service cluster.Here the advantage is aws will take care of the scaling(As per the documentation they are taking a snaphost and launching a completely new cluster).This will work for scale down also.
Regarding Auto-scaling approach there is some challenges like shard movement has an impact on the existing cluster, also we need to more vigilant while scaling down.You can find a good article on scaling down here which I have tested.If you can do some kind of intelligent automation of the steps in the above link through some scripting(python, shell) or through automation tools like Ansible, then the scaling in/out is achievable.But again you need to start the scaling up well before the normal limits since the scale up activities can have an impact on existing cluster.
Question: is possible to configure one node to always hold all the shards copies?
Answer: Yes,its possible by explicit shard routing.More details here
I would be tempted to suggest solving this a different way in AWS. I dont know what ES data this is or how its updated etc... Making a lot of assumptions I would put the ES instance behind a ALB (app load balancer) I would have a scheduled process that creates updated AMI's regularly (if you do it often then it will be quick to do), then based on load of your single server I would trigger more instances to be created from the latest instance you have available. Add the new instances to the ALB to share some of the load. As this quiet down I would trigger the termination of the temp instances. If you go this route here are a couple more things to consider
Use spot instances since they are cheaper and if it fits your use case
The "T" instances dont fit well here since they need time to build up credits
Use lambdas for the task of turning things on and off, if you want to be fancy you can trigger it based on a webhook to the aws gateway
Making more assumptions about your use case, consider putting a Varnish server in front of your ES machine so that you can more cheaply provide scale based on a cache strategy (lots of assumptions here) based on the stress you can dial in the right TTL for cache eviction. Check out the soft-purge feature for our ES stuff we have gotten a lot of good value from this.
if you do any of what i suggest here make sure to make your spawned ES instances report any logs back to a central addressable place on the persistent ES machine so you don't lose logs when the machines die

Resources