Controlling where shards are allocated - elasticsearch

My setup:
two zoness: fast and slow with 5 nodes each.
fast nodes have ephemeral storage, whereas the slow nodes are NFS based.
Running Elasticsearch OSS v7.7.1. (I have no control over the version)
I have the following cluster setting: cluster.routing.allocation.awareness.attributes: zone
My index has 2 replicas, so 3 shard instances (1x primary, 2x replica)
I am trying to ensure the following:
1 of the 3 shard instances to be located in zone fast.
2 of the 3 shard instances to be located in zone slow (because it has persistent storage)
Queries to be run in shard in zone fast where available.
Inserts to only return as written once its written once its been replicated.
Is this setup possible?
Link to a related question: How do I control where my primary and replica shards are located?
EDIT to add extra information:
Both fast and slow nodes run on a PaaS offering where we are not in control of hardware restarts meaning there can technically be non-graceful shutdowns/restarts at any point.
I'm worried about unflushed data and/or index corruption so I am looking for multiple replicas to be on the slow zone nodes backed by NFS to reduce the likelihood of data loss, despite the fact that this will "overload" the slow zone with redundant data.

Related

How to set the number of replicas properly when I try to start a TDengine cluster?

I was wondering how to set the replica parameter properly when start a TDengine cluster to balance the storage and high availability? According to documentation of TDengine, default value of replica is 1 which means no copies for each vnode (vGroup size should be 1 as well), and the replica can be dynamically changed to maintain a high avilability of the cluster. However, the extra vnode copies have to be generated physically when starting up multi-replica. So the problem rise up, how should a real company determine the value of replica to increase availability without taking up too much overhead(storage and performance) when using TDengine cluster?
replica means keeping a copy of the same data on multiple machines that are connected via a network. There are reasons you want to replicate data:
To keep data geographically close to your users (and thus reduce latency)
To allow the system to continue working even if some of its parts have failed (and thus increase availability)
To scale out the number of machines that can serve read queries (and thus increase read throughput)
referred from DDIA

How does elasticsearch prevent cascading failure after node outage due to disk pressure?

We operate an elasticsearch stack which uses 3 nodes to store log data. Our current config is to have indices with 3 primaries and 1 replica. (We have just eyeballed this config and are happy with the performance, so we decided to not (yet) spend time for optimization)
After a node outage (let's assume a full disk), I have observed that elasticsearch automatically redistributes its shards to the remaining instances - as advertised.
However this increases disk usage on the remaining two instances, making it a candidate for cascading failure.
Durability of the log data is not paramount. I am therefore thinking about reconfiguring elasticsearch to not create a new replica after a node outage. Instead, it could just run on the primaries only. This means that after a single node outage, we would run without redundancy. But that seems better than a cascading failure. (This is a one time cost)
An alternative would be to just increase disk size. (This is an ongoing cost)
My question
(How) can I configure elasticsearch to not create new replicas after the first node has failed? Or is this considered a bad idea and the canonical way is to just increase disk capacity?
Rebalancing is expensive
When a node leaves the cluster, some additional load is generated on the remaining nodes:
Promoting a replica shard to primary to replace any primaries that were on the node.
Allocating replica shards to replace the missing replicas (assuming there are enough nodes).
Rebalancing shards evenly across the remaining nodes.
This can lead to quite some data being moved around.
Sometimes, a node is only missing for a short period of time. A full rebalance is not justified in such a case. To take account for that, when a node goes down, then elasticsearch immediatelly promotes a replica shard to primary for each primary that was on the missing node, but then it waits for one minute before creating new replicas to avoid unnecessary copying.
Only rebalance when required
The duration of this delay is a tradeoff and can therefore be configured. Waiting longer means less chance of useless copying but also more chance for downtime due to reduced redundancy.
Increasing the delay to a few hours results in what I am looking for. It gives our engineers some time to react, before a cascading failure can be created from the additional rebalancing load.
I learned that from the official elasticsearch documentation.

elasticsearch cluster without replica

I want to create an Elasticsearch cluster without replica because the cluster is created in a VMinfra and the VMInfra is fully redundant and the Disks are configured in RAID.
Is it fine?
also What will be the advantage of creating a cluster without a replica?
Replicas increase availability. So let's say you have a data node that's overloaded or down, if you have another node with a replica of the same shard on it, your data will still be available. I'm not familiar with VMInfra so I'm not sure whether "fully redundant" means you can get the same level of availability in those cases.
For most production uses you'll want to have at least one replica for each shard - meaning, the primary is on one node, and the replica is on another, and therefore at least two separate data nodes.
Having more replicas might make sense as load increases. Replicas have costs in memory and disk space - you're essentially storing the data several times.
Having not any replica shards even in optimistic view is bad trade
off, Replicas provide redundant copies of your data to protect
against hardware failure beside that they are increasing the
throughput of your query results because the queries executed on all
replicas in parallel with out interrupting each others so they also
tune your search performance, On the other hand having too many
shards can makes your system to overhead.
If your virtual server environment has good configuration i recommend
to always have at least one replica shards on the different hard
drivers. You may find out more on pros and cons in Here and an old discussion on how replica shards work on this stackoverflow chat.

Configuring Elastic Search cluster with machines of different capacity(CPU, RAM) for rolling upgrades

Due to cost restrictions, I only have the following types of machines at disposal for setting up an ES cluster.
Node A: Lean(w.r.t. CPU, RAM) Instance
Node B: Beefy(w.r.t. CPU,RAM) Instance
Node M: "Leaner than A"(w.r.t. CPU, RAM) Instance
Disk-wise, both A and B have the same size.
My plan is to set up Node A and Node B acting as Master Eligible, Data node and Node M as Master-Eligible Only node(no data storing).
Because the two data nodes are NOT identical, what would be the implications?
I am going to make it a cluster of 3 machines only for the possibility of Rolling Upgrades(current volume of data and expected growth for few years can be managed with vertical scaling and leaving the default no. of shards and replica would enable me to scale horizontally if there is a need)
There is absolutely no need for your machines to have the same specs. You will need 3 master-eligible nodes not just for rolling-upgrades, but for high availability in general.
If you want to scale horizontally you can do so by either creating more indices to hold your data, or configure your index to have multiple primary and or replica shards. Since version 7 the default for new indices is to get created with 1 primary and 1 replica shard. A single index like this does not really allow you to schedule horizontally.
Update:
With respect to load and shard allocation (where to put data), Elasticsearch by default will simply consider the amount of storage available. When you start up an instance of Elasticsearch, it introspects the hardware and configures its threadpools (number of threads & size of queue) for various tasks accordingly. So the number of available threads to process tasks can vary. If I‘m not mistaken the coordinating node (the node receiving the external request) will distribute indexing/write requests in a round-robin fashion, not taking a load into consideration. Depending on your version of Elasticsearch, this is different for search/read requests where the coordinating node will leverage adaptive replica selection, taking into account the load/response time of the various replicas when distributing requests.
Besides this, sizing and scaling is a too complex topic to be answered comprehensively in a simple response. It typically also involves testing to figure out the limits/boundaries of a single node.
BTW: the number of default primary shards got changed in v7.x of Elasticsearch, as too much oversharding was one of the most common issues Elasticsearch users were facing. A “reasonable” shard size is in the tens of Gigabytes.

how does elasticsearch decide how many shards can be restored in parallel

I made a snapshot of an index with 24 shards. The index is of size 700g.
When restoring, it restores 4 shards in parallel.
My cluster is a new cluster with only one machine w/o replica nodes.
The machine is AWS c3.8xlarge with 32 vCPUs and 60G memory.
I also followed How to speed up Elasticsearch recovery?.
When restoring, the memory usage is full. How does elastic search decide how many shards can be restored in parallel?
I was wondering how I can tune my machines' hardware config to make restore faster. If my cluster has more machines, can the restoring speed be improved linearly?
Basically, for ES 6.x there are two settings that decide how fast is the recovery for primary shards:
cluster.routing.allocation.node_initial_primaries_recoveries Sets the number of primary shards that are recovering in parallel on one node. Defaults is 4. So, for a cluster with N machines, the total number of recovering shards in parallel is N*node_initial_primaries_recoveries (See https://www.elastic.co/guide/en/elasticsearch/reference/current/shards-allocation.html#_shard_allocation_settings)
indices.recovery.max_bytes_per_sec Decides how much storage is loaded on recovery per single index. Default is 40mb. (See https://www.elastic.co/guide/en/elasticsearch/reference/current/recovery.html)

Resources