Understanding Elastic Search - elasticsearch

Sorry to say this but ES' documentation ( http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/index.html ) is confusing me.
Thanks to the glossary I understand the terms for database, table and row but I have read substantial sections of the documentation and I cannot find answers to:
Why do I need do to add number_of_shards and number_of_replicas to index creation? I did look here http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/index-modules.html but ironically it leaves those two settings out :/
How can I have 3 shards with 2 replicas? If the glossary is anything to go by shouldn't that be impossible considering that a shard is "is a single Lucene instance"?
If I add more nodes later how can I change these values to span the new nodes?
How does sharding work in ES?
How does replica sets work in ES?
How can I manage sharding? I understand it is auto join ( http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/setup-configuration.html#cluster-name ) but how do I define the difference between replicas and shards?
How can I manage replica sets? I.e. how do I add replicas, promote primaries etc?
For reference I read these links first:
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/glossary.html
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/index.html
If that information exists in the documentation then I would be very grateful if you can point me towards it.
Edit:
I am also unsure how auto-discovery works on a distributed network. Short if pinging every public network around how does it connect to the right one that could possibly be on the other side of the world?

Please see below for answers to your points.
Why do I need do to add number_of_shards and number_of_replicas to index creation? I did look here
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/index-modules.html
but ironically it leaves those two settings out :/
You don't "have" to, but you probably should in especially will want to in production. The default is five shards and one replication.
The number of replications defined is just the number of times your entire index is replicated throughout all of the nodes in your elasticsearch cluster. Think of it as being multiple read copies of a RDBMS database (but in this case, we read and write all copies).
A shard is the number of times I split up, or shard, an index. So, I can have an index with a single shard, or I can have an index with multiple shards. This is similar in concept to sharding a RDBMS database by primary key, but not identical.
So, the total number of shards you will have in an index is the product of number_of_shards and number_of_replicas.
When you do a search, elasticsearch will distribute your search to all possible nodes containing the shards in your index and aggregagate the result for you. You can think of this as a map/ reduce where the map is sending the search out to each shard and the reduce is collecting the results.
Also, you can change the replication number_of_replicas at any time, but you can never change the number_of_shards. This must be set at index creation.
How can I have 3 shards with 2 replicas? If the glossary is anything to go by shouldn't that be impossible considering that a
shard is "is a single Lucene instance"?
I think the above mostly answers this, but it's important to remember that elasticsearch is primarily a distributed computing solution to search. We are splitting the work up to multiple shards and possibly machines.
If I add more nodes later how can I change these values to span the new nodes?
Once the cluster is aware of another node in the cluster, no other action is needed by you. The settings propagate throughout the cluster on their own. In your above example of three shards and two replicas, if you had two nodes initially and added a third, each node will have on average two shards per node, this shard movement happens without your intervention (again, provided the cluster is aware of the new node)
How does sharding work in ES?
See above
How does replica sets work in ES?
See above
How can I manage sharding? I understand it is auto join ( http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/setup-configuration.html#cluster-name ) but how do I define the difference between replicas and shards?
You don't have to "manage" it actively. As stated earlier, sharding and everything else you define at index creation, is propagated to new nodes within the cluster.
You define replicas and shards like this:
{
"settings": {
"index": {
"number_of_shards": 20,
"number_of_replicas": 1
}
},
"mappings": {
"some_type": {
"properties": {
"some_field": {
"type": "long"
}
}
}
}
}
How can I manage replica sets? I.e. how do I add replicas, promote primaries etc?
You do that through the update indices API, documentation for this specific case is found on there site here:
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/indices-update-settings.html
I just noticed your edit, please see below:
I am also unsure how auto-discovery works on a distributed network.
In the YML config file you set the unicast like this:
discovery.zen.ping.multicast.enabled: false
#discovery.zen.minimum_master_nodes: 3
discovery.zen.ping.unicast.hosts: ["ip.add.r.ess", "ip.add.r.ess"]
The middle setting is an important setting, but I commented it out here. That number should always be number of (master nodes / 2) +1. This is to avoid split brain situations. Generally I set all nodes to master eligible.
These settings are for unicast, which is what I think you are going for with your question and not multicast.

In short, an index is broken into shards. Shards can be replicated, meaning multiple copies of the same shard can exist in the same cluster. So if an index has 3 shards and 2 replica's, that means you have nine shards in total of which six are replicas of the three master shards.
ES, will try to balance shards and their replica's across the cluster so that if a node goes down it can fail over from the master shards on that node to replicas. This can confuse some people: a master in elastic search refers to shards, not the actual node. So a single node can have a mix of replica's and master shards.
If you come from the lucene world, a lucene index is not the same thing as an elastic search index. An elastic search index is a logical group of indexed documents with types, mappings and documents. More or less the same as a database schema. A lucene index on the other hand is a group of several files that contains indexed data. When Elastic search creates indexes, what it does is create several lucene indexes (one for each field and shard) and when it replicates, it is basically copying the files of these lucene indices around.
You can't change the number of shards for an index but you can change the number of replicas. Typically what you do when you need to have more shards is create a new index and reindex the data.
In terms of shard management beyond deciding on the number of shards, there's not much to manage by default and ES is pretty good coordinating things by itself, There are a ton of options you can fiddle with once you gain a bit better understanding of how it works. Defaults are pretty OK for most. In terms of cluster management, you can do a lot via the API in terms of shutting down nodes in a controlled way, using index aliases, changing number of replica's, etc.
As for autodiscovery, ES uses local network multicast by default. You can switch to unicast and you probably want to change the default clustername to prevent accidents (had some fun in coffeeshops with unintended clusters forming). You probably don't want to cluster globally. I don't see that ending well.

It's a quite incident that about 80% of your questions are answered in the Video Presentation given by Shay Banon (The creater of ElastiSearch). Though this presentation has much more than you can find anywhere else. Hope this helps.
http://www.infoq.com/presentations/ElasticSearch
This video is a bit low-resolution, so if you want code shown in presentation follow this
https://github.com/kimchy/talks/tree/master/2011/wsnparis

Related

What exactly is a shard?

I am going through the documentation,
An index can potentially store a large amount of data that can exceed the hardware limits of a single node. For example, a single index of a billion documents taking up 1TB of disk space may not fit on the disk of a single node or may be too slow to serve search requests from a single node alone.
To solve this problem, Elasticsearch provides the ability to subdivide your index into multiple pieces called shards. When you create an index, you can simply define the number of shards that you want. Each shard is in itself a fully-functional and independent "index" that can be hosted on any node in the cluster.
What exactly constitutes elastic search shard ? Is it a lucene thread which is configured with memory ? Is it possible to adjust setting for individual shard ?
In addition to this answer which should help, I can add that a shard actually wraps a full-fledge Lucene search engine.
You cannot change settings for individual shards, instead you can change settings at the index-level and Elasticsearch will apply them on the index shards.
So Elasticsearch gives you the ability to split the workload on an index among all the shards (i.e. Lucene engines) of that index which are located on different nodes.
Very simply put: Elasticsearch = distributed Lucene !

what is the impact of shard number in elasticsearch?

I am curious about the impact of #shards in Elasticsearch. Particularly, I am looking for the pros and cons for having big #shards and small #shards.
For example, I have a two-node cluster. Assuming replica is one, for one index, should I create two shards which are spread across these two nodes? Or should I use the default i.e., 5 shards? My thinking is the first one.
It seems to me there is no reason to have more than one share per node per index as one Lucene instance can have better caching than a few Lucene instances.
Edit:
Let's say I have only one node and want to create one index. How does the shard number affect the performance in this case? My thinking is I should have only one shard in such a case. Is that right?
You can find the answer in the elasticsearch documentation here

What are the ideal number of shards count for 3 node elasticsearch setup

I wanted to know how many primary shards and replicas are ideal to a three node cluster and wanted to know the rule of thumb to set the Primary shard and replicas depending on the servers. How can we change the number of shards safely in elasticsearch cluster. (Elasticsearch has great documentation but it was hard for me to get a answer for this, my bad). Thanks in advance
The ideal number of shards and replicas will all depend on how much data is in each index and how you want to query it. My recommendation would be to leave all the shards and replicas at their default values until such time as you need to tune them. This is what ES production support is ideal for.

Shards and replicas elastic search

Suppose at the time of index creation I didn't set any replica for that if I update using update settings API and changed replica status to 1.If I have 2 node the replica should be create on second node because on primary node side replica will not create due to that cluster status is showing yellow the shards not allocating to node2 even though we enabled the replicas to 1.
please share me why replica shard not allocating to node2?
but on cluster startup nodes are showing they detected and join each other.
Here are the Basic concepts of the Elastic search
Installation »
Basic Concepts
There are a few concepts that are core to Elasticsearch. Understanding these concepts from the outset will tremendously help ease the learning process.
Near Realtime (NRT)
Elasticsearch is a near real time search platform. What this means is there is a slight latency (normally one second) from the time you index a document until the time it becomes searchable.
Cluster
A cluster is a collection of one or more nodes (servers) that together holds your entire data and provides federated indexing and search capabilities across all nodes. A cluster is identified by a unique name which by default is "elasticsearch". This name is important because a node can only be part of a cluster if the node is set up to join the cluster by its name.
Make sure that you don’t reuse the same cluster names in different environments, otherwise you might end up with nodes joining the wrong cluster. For instance you could use logging-dev, logging-stage, and logging-prod for the development, staging, and production clusters.
Note that it is valid and perfectly fine to have a cluster with only a single node in it. Furthermore, you may also have multiple independent clusters each with its own unique cluster name.
Node
A node is a single server that is part of your cluster, stores your data, and participates in the cluster’s indexing and search capabilities. Just like a cluster, a node is identified by a name which by default is a random Universally Unique IDentifier (UUID) that is assigned to the node at startup. You can define any node name you want if you do not want the default. This name is important for administration purposes where you want to identify which servers in your network correspond to which nodes in your Elasticsearch cluster.
A node can be configured to join a specific cluster by the cluster name. By default, each node is set up to join a cluster named elasticsearch which means that if you start up a number of nodes on your network and—assuming they can discover each other—they will all automatically form and join a single cluster named elasticsearch.
In a single cluster, you can have as many nodes as you want. Furthermore, if there are no other Elasticsearch nodes currently running on your network, starting a single node will by default form a new single-node cluster named elasticsearch.
Index
An index is a collection of documents that have somewhat similar characteristics. For example, you can have an index for customer data, another index for a product catalog, and yet another index for order data. An index is identified by a name (that must be all lowercase) and this name is used to refer to the index when performing indexing, search, update, and delete operations against the documents in it.
In a single cluster, you can define as many indexes as you want.
Type
Within an index, you can define one or more types. A type is a logical category/partition of your index whose semantics is completely up to you. In general, a type is defined for documents that have a set of common fields. For example, let’s assume you run a blogging platform and store all your data in a single index. In this index, you may define a type for user data, another type for blog data, and yet another type for comments data.
Document
A document is a basic unit of information that can be indexed. For example, you can have a document for a single customer, another document for a single product, and yet another for a single order. This document is expressed in JSON (JavaScript Object Notation) which is an ubiquitous internet data interchange format.
Within an index/type, you can store as many documents as you want. Note that although a document physically resides in an index, a document actually must be indexed/assigned to a type inside an index.
Shards & Replicas
An index can potentially store a large amount of data that can exceed the hardware limits of a single node. For example, a single index of a billion documents taking up 1TB of disk space may not fit on the disk of a single node or may be too slow to serve search requests from a single node alone.
To solve this problem, Elasticsearch provides the ability to subdivide your index into multiple pieces called shards. When you create an index, you can simply define the number of shards that you want. Each shard is in itself a fully-functional and independent "index" that can be hosted on any node in the cluster.
Sharding is important for two primary reasons:
It allows you to horizontally split/scale your content volume
It allows you to distribute and parallelize operations across shards (potentially on multiple nodes) thus increasing performance/throughput
The mechanics of how a shard is distributed and also how its documents are aggregated back into search requests are completely managed by Elasticsearch and is transparent to you as the user.
In a network/cloud environment where failures can be expected anytime, it is very useful and highly recommended to have a failover mechanism in case a shard/node somehow goes offline or disappears for whatever reason. To this end, Elasticsearch allows you to make one or more copies of your index’s shards into what are called replica shards, or replicas for short.
Replication is important for two primary reasons:
It provides high availability in case a shard/node fails. For this reason, it is important to note that a replica shard is never allocated on the same node as the original/primary shard that it was copied from.
It allows you to scale out your search volume/throughput since searches can be executed on all replicas in parallel.
To summarize, each index can be split into multiple shards. An index can also be replicated zero (meaning no replicas) or more times. Once replicated, each index will have primary shards (the original shards that were replicated from) and replica shards (the copies of the primary shards). The number of shards and replicas can be defined per index at the time the index is created. After the index is created, you may change the number of replicas dynamically anytime but you cannot change the number of shards after-the-fact.
By default, each index in Elasticsearch is allocated 5 primary shards and 1 replica which means that if you have at least two nodes in your cluster, your index will have 5 primary shards and another 5 replica shards (1 complete replica) for a total of 10 shards per index.
Note:
Each Elasticsearch shard is a Lucene index. There is a maximum number of documents you can have in a single Lucene index. As of LUCENE-5843, the limit is 2,147,483,519 (= Integer.MAX_VALUE - 128) documents. You can monitor shard sizes using the _cat/shards api.

When do you start additional Elasticsearch nodes? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 2 years ago.
Improve this question
I'm in the middle of attempting to replace a Solr setup with Elasticsearch. This is a new setup, which has not yet seen production, so I have lots of room to fiddle with things and get them working well.
I have very, very large amounts of data. I'm indexing some live data and holding onto it for 7 days (by using the _ttl field). I do not store any data in the index (and disabled the _source field). I expect my index to stabilize around 20 billion rows. I will be putting this data into 2-3 named indexes. Search performance so far with up to a few billion rows is totally acceptable, but indexing performance is an issue.
I am a bit confused about how ES uses shards internally. I have created two ES nodes, each with a separate data directory, each with 8 indexes and 1 replica. When I look at the cluster status, I only see one shard and one replica for each node. Doesn't each node keep multiple indexes running internally? (Checking the on-disk storage location shows that there is definitely only one Lucene index present). -- Resolved, as my index setting was not picked up properly from the config. Creating the index using the API and specifying the number of shards and replicas has now produced exactly what I would've expected to see.
Also, I tried running multiple copies of the same ES node (from the same configuration), and it recognizes that there is already a copy running and creates its own working area. These new instances of nodes also seem to only have one index on-disk. -- Now that each node is actually using multiple indices, a single node with many indices is more than sufficient to throttle the entire system, so this is a non-issue.
When do you start additional Elasticsearch nodes, for maximum indexing performance? Should I have many nodes each running with 1 index 1 replica, or fewer nodes with tons of indexes? Is there something I'm missing with my configuration in order to have single nodes doing more work?
Also: Is there any metric for knowing when an HTTP-only node is overloaded? Right now I have one node devoted to HTTP only, but aside from CPU usage, I can't tell if it's doing OK or not. When is it time to start additional HTTP nodes and split up your indexing software to point to the various nodes?
Let's clarify the terminology a little first:
Node: an Elasticsearch instance running (a java process). Usually every node runs on its own machine.
Cluster: one or more nodes with the same cluster name.
Index: more or less like a database.
Type: more or less like a database table.
Shard: effectively a lucene index. Every index is composed of one or more shards. A shard can be a primary shard (or simply shard) or a replica.
When you create an index you can specify the number of shards and number of replicas per shard. The default is 5 primary shards and 1 replica per shard. The shards are automatically evenly distributed over the cluster. A replica shard will never be allocated on the same machine where the related primary shard is.
What you see in the cluster status is weird, I'd suggest to check your index settings using the using the get settings API. Looks like you configured only one shard, but anyway you should see more shards if you have more than one index. If you need more help you can post the output that you get from elasticsearch.
How many shards and replicas you use really depends on your data, the way you access them and the number of available nodes/servers. It's best practice to overallocate shards a little in order to redistribute them in case you add more nodes to your cluster, since you can't (for now) change the number of shards once you created the index. Otherwise you can always change the number of shards if you are willing to do a complete reindex of your data.
Every additional shard comes with a cost since each shard is effectively a Lucene instance. The maximum number of shards that you can have per machine really depends on the hardware available and your data as well. Good to know that having 100 indexes with each one shard or one index with 100 shards is really the same since you'd have 100 lucene instances in both cases.
Of course at query time if you want to query a single elasticsearch index composed of 100 shards elasticsearch would need to query them all in order to get proper results (unless you used a specific routing for your documents to then query only a specific shard). This would have a performance cost.
You can easily check the state of your cluster and nodes using the Cluster Nodes Info API through which you can check a lot of useful information, all you need in order to know whether your nodes are running smoothly or not. Even easier, there are a couple of plugins to check those information through a nice user interface (which internally uses the elasticsearch APIs anyway): paramedic and bigdesk.

Resources