What exactly is a shard? - elasticsearch

I am going through the documentation,
An index can potentially store a large amount of data that can exceed the hardware limits of a single node. For example, a single index of a billion documents taking up 1TB of disk space may not fit on the disk of a single node or may be too slow to serve search requests from a single node alone.
To solve this problem, Elasticsearch provides the ability to subdivide your index into multiple pieces called shards. When you create an index, you can simply define the number of shards that you want. Each shard is in itself a fully-functional and independent "index" that can be hosted on any node in the cluster.
What exactly constitutes elastic search shard ? Is it a lucene thread which is configured with memory ? Is it possible to adjust setting for individual shard ?

In addition to this answer which should help, I can add that a shard actually wraps a full-fledge Lucene search engine.
You cannot change settings for individual shards, instead you can change settings at the index-level and Elasticsearch will apply them on the index shards.
So Elasticsearch gives you the ability to split the workload on an index among all the shards (i.e. Lucene engines) of that index which are located on different nodes.
Very simply put: Elasticsearch = distributed Lucene !

Related

In Elasticsearch cluster, is there a way through which shards can be allocated a particular node during the time of creation?

I have a multinode elasticsearch cluster. On that cluster, I want to divide shards of same index on different nodes.
Suppose a document is to be ingested into the index that have different key-value pairs. Based on that key-value, I want my master-node to allocate a specific data-node that contains a list of documents having the same key-value.
My approach is to have a single index across the nodes available in the cluster and the shards of this index should get distributed in such a manner that the document having similar key-value pair be on same node. Is there a way around to this?
Also I want to increase number of shards in an index but getting error, "index <index_name> must be read-only to resize index." How do I increase number of shards?
there is the _routing field which can group documents in a particular shard. but you cannot automatically assign a shard with a value to a specific node. the closest you could get would be to manually handle it via reroute
however why you would want to do that is not clear, and definitely not recommended as it's a lot of manual control over something that Elasticsearch is pretty good at handling

Max value of number_of_routing_shards in Elasticsearch 6.x

What is the max recommended value of number_of_routing_shards for an index?
Can I specify a very high value like 30000? What are the side effects if I do so?
Shards are "slices" of an index created by elasticsearch to have flexibility to distribute indexed data. For example, among several datanodes.
Shards, in the low level are independent sets of lucene segments that work autonomously, which can be queried independently. This makes possible the high performance because search operations can be split into independent processes.
The more shards you have the more flexible becomes the storage assignment for a given index. This obviously has some caveats.
Distributed searches must wait each other to merge step-results into a consistent response. If there are many shards, the query must be sliced into more parts, (which has a computing overhead). The query is distributed to each shard, whose hashes match any of the current search (not all shards are necesary hit by every query) therefore the most busy (slower) shard, will define the overall performance of your search.
It's better to have a balanced number of indexes. Each index has a memory footprint that is stored in the cluster state. The more indexes you have the bigger the cluster state, the more time it takes to be shared among all cluster nodes.
The more shards an index has, the complexer it becomes, therefore the size taken to serialize it into the cluster state is bigger, slowing things down globally.
This will give you an index with 30.000 shards (according https://www.elastic.co/guide/en/elasticsearch/reference/6.x/indices-split-index.html), which is ... useless.
As all software tuning, recommended values vary with your:
use case
hardware (VM / network / disk ...) ?
metrics

Shards and replicas elastic search

Suppose at the time of index creation I didn't set any replica for that if I update using update settings API and changed replica status to 1.If I have 2 node the replica should be create on second node because on primary node side replica will not create due to that cluster status is showing yellow the shards not allocating to node2 even though we enabled the replicas to 1.
please share me why replica shard not allocating to node2?
but on cluster startup nodes are showing they detected and join each other.
Here are the Basic concepts of the Elastic search
Installation »
Basic Concepts
There are a few concepts that are core to Elasticsearch. Understanding these concepts from the outset will tremendously help ease the learning process.
Near Realtime (NRT)
Elasticsearch is a near real time search platform. What this means is there is a slight latency (normally one second) from the time you index a document until the time it becomes searchable.
Cluster
A cluster is a collection of one or more nodes (servers) that together holds your entire data and provides federated indexing and search capabilities across all nodes. A cluster is identified by a unique name which by default is "elasticsearch". This name is important because a node can only be part of a cluster if the node is set up to join the cluster by its name.
Make sure that you don’t reuse the same cluster names in different environments, otherwise you might end up with nodes joining the wrong cluster. For instance you could use logging-dev, logging-stage, and logging-prod for the development, staging, and production clusters.
Note that it is valid and perfectly fine to have a cluster with only a single node in it. Furthermore, you may also have multiple independent clusters each with its own unique cluster name.
Node
A node is a single server that is part of your cluster, stores your data, and participates in the cluster’s indexing and search capabilities. Just like a cluster, a node is identified by a name which by default is a random Universally Unique IDentifier (UUID) that is assigned to the node at startup. You can define any node name you want if you do not want the default. This name is important for administration purposes where you want to identify which servers in your network correspond to which nodes in your Elasticsearch cluster.
A node can be configured to join a specific cluster by the cluster name. By default, each node is set up to join a cluster named elasticsearch which means that if you start up a number of nodes on your network and—assuming they can discover each other—they will all automatically form and join a single cluster named elasticsearch.
In a single cluster, you can have as many nodes as you want. Furthermore, if there are no other Elasticsearch nodes currently running on your network, starting a single node will by default form a new single-node cluster named elasticsearch.
Index
An index is a collection of documents that have somewhat similar characteristics. For example, you can have an index for customer data, another index for a product catalog, and yet another index for order data. An index is identified by a name (that must be all lowercase) and this name is used to refer to the index when performing indexing, search, update, and delete operations against the documents in it.
In a single cluster, you can define as many indexes as you want.
Type
Within an index, you can define one or more types. A type is a logical category/partition of your index whose semantics is completely up to you. In general, a type is defined for documents that have a set of common fields. For example, let’s assume you run a blogging platform and store all your data in a single index. In this index, you may define a type for user data, another type for blog data, and yet another type for comments data.
Document
A document is a basic unit of information that can be indexed. For example, you can have a document for a single customer, another document for a single product, and yet another for a single order. This document is expressed in JSON (JavaScript Object Notation) which is an ubiquitous internet data interchange format.
Within an index/type, you can store as many documents as you want. Note that although a document physically resides in an index, a document actually must be indexed/assigned to a type inside an index.
Shards & Replicas
An index can potentially store a large amount of data that can exceed the hardware limits of a single node. For example, a single index of a billion documents taking up 1TB of disk space may not fit on the disk of a single node or may be too slow to serve search requests from a single node alone.
To solve this problem, Elasticsearch provides the ability to subdivide your index into multiple pieces called shards. When you create an index, you can simply define the number of shards that you want. Each shard is in itself a fully-functional and independent "index" that can be hosted on any node in the cluster.
Sharding is important for two primary reasons:
It allows you to horizontally split/scale your content volume
It allows you to distribute and parallelize operations across shards (potentially on multiple nodes) thus increasing performance/throughput
The mechanics of how a shard is distributed and also how its documents are aggregated back into search requests are completely managed by Elasticsearch and is transparent to you as the user.
In a network/cloud environment where failures can be expected anytime, it is very useful and highly recommended to have a failover mechanism in case a shard/node somehow goes offline or disappears for whatever reason. To this end, Elasticsearch allows you to make one or more copies of your index’s shards into what are called replica shards, or replicas for short.
Replication is important for two primary reasons:
It provides high availability in case a shard/node fails. For this reason, it is important to note that a replica shard is never allocated on the same node as the original/primary shard that it was copied from.
It allows you to scale out your search volume/throughput since searches can be executed on all replicas in parallel.
To summarize, each index can be split into multiple shards. An index can also be replicated zero (meaning no replicas) or more times. Once replicated, each index will have primary shards (the original shards that were replicated from) and replica shards (the copies of the primary shards). The number of shards and replicas can be defined per index at the time the index is created. After the index is created, you may change the number of replicas dynamically anytime but you cannot change the number of shards after-the-fact.
By default, each index in Elasticsearch is allocated 5 primary shards and 1 replica which means that if you have at least two nodes in your cluster, your index will have 5 primary shards and another 5 replica shards (1 complete replica) for a total of 10 shards per index.
Note:
Each Elasticsearch shard is a Lucene index. There is a maximum number of documents you can have in a single Lucene index. As of LUCENE-5843, the limit is 2,147,483,519 (= Integer.MAX_VALUE - 128) documents. You can monitor shard sizes using the _cat/shards api.

elasticsearch - how to add new shards and split index content

So I my index growth too fast and now has 60 million docs in 3 shards (single node).
I want to buy more machines and split content into more shards. How can I do this?
It's just connect new nodes to the cluster and update shards number in master?
Afaik elasticsearch cannot yet redistribute indexed documents automatically (see here). You would have to reindex all content. The problem behind it is, that documents are distributed to shards according to a hash value modulo number of shards. Just adding shards and keeping indexing would keep adding documents to the old shards too.
Elasticsearch allows to distribute documents according to a custom function (routing parameter). You could distribute all new content to the new shards, but this makes deletions difficult, because now you have to know if a document is "old" or "new". Further it ruins your uniform index statistics which may bias ranking in nonobvious ways.
Bottom line: adding shards to an existing index requires reindexing all contents or some heavy hacking.
You already have 3 shards, so if you add 2 nodes Elasticsearch will automatically reallocate 2 shards to the other 2 nodes, giving all shard 3 times more power.
If you want to add more shards, you need to reindex your data. This can be done by creating a new index with the desired number of shards and copying your data to that index (see https://www.elastic.co/guide/en/elasticsearch/guide/current/reindex.html)

Understanding Elastic Search

Sorry to say this but ES' documentation ( http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/index.html ) is confusing me.
Thanks to the glossary I understand the terms for database, table and row but I have read substantial sections of the documentation and I cannot find answers to:
Why do I need do to add number_of_shards and number_of_replicas to index creation? I did look here http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/index-modules.html but ironically it leaves those two settings out :/
How can I have 3 shards with 2 replicas? If the glossary is anything to go by shouldn't that be impossible considering that a shard is "is a single Lucene instance"?
If I add more nodes later how can I change these values to span the new nodes?
How does sharding work in ES?
How does replica sets work in ES?
How can I manage sharding? I understand it is auto join ( http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/setup-configuration.html#cluster-name ) but how do I define the difference between replicas and shards?
How can I manage replica sets? I.e. how do I add replicas, promote primaries etc?
For reference I read these links first:
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/glossary.html
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/index.html
If that information exists in the documentation then I would be very grateful if you can point me towards it.
Edit:
I am also unsure how auto-discovery works on a distributed network. Short if pinging every public network around how does it connect to the right one that could possibly be on the other side of the world?
Please see below for answers to your points.
Why do I need do to add number_of_shards and number_of_replicas to index creation? I did look here
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/index-modules.html
but ironically it leaves those two settings out :/
You don't "have" to, but you probably should in especially will want to in production. The default is five shards and one replication.
The number of replications defined is just the number of times your entire index is replicated throughout all of the nodes in your elasticsearch cluster. Think of it as being multiple read copies of a RDBMS database (but in this case, we read and write all copies).
A shard is the number of times I split up, or shard, an index. So, I can have an index with a single shard, or I can have an index with multiple shards. This is similar in concept to sharding a RDBMS database by primary key, but not identical.
So, the total number of shards you will have in an index is the product of number_of_shards and number_of_replicas.
When you do a search, elasticsearch will distribute your search to all possible nodes containing the shards in your index and aggregagate the result for you. You can think of this as a map/ reduce where the map is sending the search out to each shard and the reduce is collecting the results.
Also, you can change the replication number_of_replicas at any time, but you can never change the number_of_shards. This must be set at index creation.
How can I have 3 shards with 2 replicas? If the glossary is anything to go by shouldn't that be impossible considering that a
shard is "is a single Lucene instance"?
I think the above mostly answers this, but it's important to remember that elasticsearch is primarily a distributed computing solution to search. We are splitting the work up to multiple shards and possibly machines.
If I add more nodes later how can I change these values to span the new nodes?
Once the cluster is aware of another node in the cluster, no other action is needed by you. The settings propagate throughout the cluster on their own. In your above example of three shards and two replicas, if you had two nodes initially and added a third, each node will have on average two shards per node, this shard movement happens without your intervention (again, provided the cluster is aware of the new node)
How does sharding work in ES?
See above
How does replica sets work in ES?
See above
How can I manage sharding? I understand it is auto join ( http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/setup-configuration.html#cluster-name ) but how do I define the difference between replicas and shards?
You don't have to "manage" it actively. As stated earlier, sharding and everything else you define at index creation, is propagated to new nodes within the cluster.
You define replicas and shards like this:
{
"settings": {
"index": {
"number_of_shards": 20,
"number_of_replicas": 1
}
},
"mappings": {
"some_type": {
"properties": {
"some_field": {
"type": "long"
}
}
}
}
}
How can I manage replica sets? I.e. how do I add replicas, promote primaries etc?
You do that through the update indices API, documentation for this specific case is found on there site here:
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/indices-update-settings.html
I just noticed your edit, please see below:
I am also unsure how auto-discovery works on a distributed network.
In the YML config file you set the unicast like this:
discovery.zen.ping.multicast.enabled: false
#discovery.zen.minimum_master_nodes: 3
discovery.zen.ping.unicast.hosts: ["ip.add.r.ess", "ip.add.r.ess"]
The middle setting is an important setting, but I commented it out here. That number should always be number of (master nodes / 2) +1. This is to avoid split brain situations. Generally I set all nodes to master eligible.
These settings are for unicast, which is what I think you are going for with your question and not multicast.
In short, an index is broken into shards. Shards can be replicated, meaning multiple copies of the same shard can exist in the same cluster. So if an index has 3 shards and 2 replica's, that means you have nine shards in total of which six are replicas of the three master shards.
ES, will try to balance shards and their replica's across the cluster so that if a node goes down it can fail over from the master shards on that node to replicas. This can confuse some people: a master in elastic search refers to shards, not the actual node. So a single node can have a mix of replica's and master shards.
If you come from the lucene world, a lucene index is not the same thing as an elastic search index. An elastic search index is a logical group of indexed documents with types, mappings and documents. More or less the same as a database schema. A lucene index on the other hand is a group of several files that contains indexed data. When Elastic search creates indexes, what it does is create several lucene indexes (one for each field and shard) and when it replicates, it is basically copying the files of these lucene indices around.
You can't change the number of shards for an index but you can change the number of replicas. Typically what you do when you need to have more shards is create a new index and reindex the data.
In terms of shard management beyond deciding on the number of shards, there's not much to manage by default and ES is pretty good coordinating things by itself, There are a ton of options you can fiddle with once you gain a bit better understanding of how it works. Defaults are pretty OK for most. In terms of cluster management, you can do a lot via the API in terms of shutting down nodes in a controlled way, using index aliases, changing number of replica's, etc.
As for autodiscovery, ES uses local network multicast by default. You can switch to unicast and you probably want to change the default clustername to prevent accidents (had some fun in coffeeshops with unintended clusters forming). You probably don't want to cluster globally. I don't see that ending well.
It's a quite incident that about 80% of your questions are answered in the Video Presentation given by Shay Banon (The creater of ElastiSearch). Though this presentation has much more than you can find anywhere else. Hope this helps.
http://www.infoq.com/presentations/ElasticSearch
This video is a bit low-resolution, so if you want code shown in presentation follow this
https://github.com/kimchy/talks/tree/master/2011/wsnparis

Resources