I am very new to elastic search and its applications, I found that elastic search saves data(indexes) onto disk. Then I wondered: Are there any limitations on number of indexes that can be created or can I create as many as I can since I have a very large disk space?
Currently I have elastic search deployed using a single node cluster with Docker. I have read something about shards and its limitation etc., but I was not able to understand it properly.
Is there anyone on SO, who can shed some light onto these questions for a newbie in layman terms?
What is a single node cluster and how does my data get saved onto disk? Also what are shards and how is it related to elastic search?
I guess the best answer is "it depends ". Generally there is no limitation for having many indexes , Every index has its own mapping and irrelevant to other indexes by default, Actually indexes are instance of Elasticsearch servers and please note that they are not data rather you may think about as entire database alone. There are many variables for answering this question for example if are planning to have replication of your shards in one index then you may found limitation due to the size of document you are planning to ingest inside the index.
As an other note you may need to ask first why I need many indexes ? for enhancing search operation or queries throughput? if it is the case then perhaps its better to use replica shards beside your primary shards in the single index because the queries are executed parallel to each other in replica shards and you can think of shards as an stand alone index inside of your main index so in conclusion I can say there is no limitation as long as you have enough free space to save new data (expanding inverted indexes table created for on field) but regarding to you needs it may be better to have primary and replica shards inside an index .
Is there any limit on how many indexes we can create in elastic search?
Can 100 000 indexes be created in Elasticsearch?
I have read that, maximum of 600-1000 indices can be created. Can it be scaled?
eg: I have a number of stores, and the store has items. Each store will have its own index where its items will be indexed.
There is no limit as such, but obviously, you don't want to create too many indices(too many depends on your cluster, nodes, size of indices etc), but in general, it's not advisable as it can have a server impact on cluster functioning and performance.
Please check loggly's blog and their first point is about proper provisioning and below is important relevant text from the same blog.
ES makes it very easy to create a lot of indices and lots and lots of
shards, but it’s important to understand that each index and shard
comes at a cost. If you have too many indices or shards, the
management load alone can degrade your ES cluster performance,
potentially to the point of making it unusable. We’re focusing on
management load here, but running too many indices/shards can also
have pretty significant impacts on your indexing and search
performance.
The biggest factor we’ve found to impact management overhead is the
size of the Cluster State, which contains all of the mappings for
every index in the cluster. At one point, we had a single cluster with
a Cluster State size of over 900MB! The cluster was alive but not
usable.
Edit: Thanks #Silas, who pointed that from ES 2.X, cluster state updates are not that much costly(As the only diff is sent in update call). More info on this change can be found on this ES issue
Im new to elasticsearch and would like someone to help me clarify a few concepts
Im designing a small cluster with the following requirements
everything should still work when restarting one of the machines, one at a time (eg: OS updates)
a single disk failure is ok
heavy indexing should not impact query performance
How many master, data, ingest nodes should I have?
or do I need 2 clusters?
the indexing workload is purely indexing structured text documents, no processing/rules... do I even need an ingest node?
Also, does each node have a complete copy of the all the data? or only a cluster has the complete copy?
Be sure to read the documentation about Elasticsearch terminology at the very least.
With the default of 1 replica (primary shard and one replica shard) you can survive the failure of 1 Elasticsearch node (failed disk, restart, upgrade,...).
"heavy indexing should not impact query performance": You'll need to size your cluster correctly to handle both the indexing and searching. If you want to read current data and you do heavy updates, that will take up resources and you won't be able to fully decouple it.
By default every node is a data, ingest, and master-eligible node. The minimum HA setting needs 3 nodes. If you don't use ingest that's fine; it won't take up resources when you're not using it.
To understand which node has which data, you need to read up on the concept of shards. Basically every index is broken up into 1 to N shards (current default is 5) and there is one primary and one replica copy of each one of them (by default).
Background
We're designing the architecture of a new system using Elasticsearch now, and we plan to use Elastic Cloud based on reviews contrasting their service with AWS's, and self-hosting on an EC2 instance. As we design the system, I'm trying to learn from a small test project my team deployed on Elastic Cloud 6 months ago. While I've spent a lot of time reading the Elasticsearch Docs, Elasticsearch: The Definitive Guide, and Elastic Cloud's Docs, there are some concepts here that I'm still not understanding.
Our Test Project's issues
Our test project uses the default of 5 primary shards and 1 replica shard per primary. It was configured using the default deployment options on Elastic Cloud with a single one node, currently with 2GB of memory. Because there is only one node, and because replica shards are never assigned to the same node as their primary shard (reason 2), none of the replicas are getting assigned. Also, this project uses time-based data, and is creating one index per account per day, resulting in about 10 indexes per day (or 100 shards), and over time, the proverbial Kagillion Shards. This system was only ever meant to have several months of data on it at a time, so the solution has been to manually delete old data when memory on this deployment runs out.
The New System
Our new system is meant to have 5 years worth of time based-data on it, which is projected to grow to 250 GB in size. The current implementation uses a single index for the time-based data, with 6 primary shards and 1 replica per primary. This decision was made based on reading that a single shard should aim for a maximum of 30GB in size.
Questions
Our old system had one node with too many indexes (over 100) and too many shards (over 1000), and it seems like our new one is being designed with too few (one index for 5+ years of data). It seems a better indexing strategy according to the time-based data recommendations would be to create one index per week or month? That being said, according to another answer on SO the optimal number of indexes per node is 1, so what is the utility in creating multiple indices for time-based data in the first place if we're only running on one node?
How does one add a node to an ES deployment in Elastic Cloud? Currently all of the replica nodes in the test project are unassigned, because the deployment only has one node. There is a slider which allows you to easily choose the memory of each node in a deployment (between 1GB and 250B), however I see no way to add multiple nodes, which is confusing because it seems like basic functionality for Elasticsearch.
Our test project's node has restarted several times, always when there is lots of old data on the node, and therefore memory pressure. The solution has been to delete old data (as the test project was only meant to have several months of data at a time), but it appears the node didn't lose data when it restarted. Why would this be?
Our test project has taken no snapshots, which are supposed to happen automatically on Elastic Cloud every 30 minutes. I've asked their support about this, but just curious to see if anyone knows what could cause this and how to resolve it?
Our test project uses the default of 5 primary shards and 1 replica shard per primary. It was configured using the default deployment options on Elastic Cloud with a single one node
Clearly, on a single node, you cannot have replicas. So your index should have been configured with 0 replicas and you can do it dynamically to get your cluster back to green (PUT index/_settings {"index.number_of_replicas": 0}), simple as that.
Also, this project uses time-based data, and is creating one index per account per day, resulting in about 10 indexes per day (or 100 shards)
I cannot tell if 50 new primary shards (10 index) per day were reasonable or not because you don't give any information regarding the volume of data in your test project. But it's probably too many.
It seems a better indexing strategy according to the time-based data recommendations would be to create one index per week or month?
Having five years worth of data in a single index is perfectly possible, it doesn't really depend on how old the data is, but on how big it grows. You mention 250GB and also that you know a shard shouldn't grow over 30GB (and that again depends on the spec of your hardware underneath, more on that later), but since you have only 6 shards for that index, it means that each shard will grow over 40GB (which is ok according to this), but to be on the safe side, you should probably increase to 8-9 shards, or you split your data into yearly/monthly indices.
The 30GB-ish limit per shard is also dependent on how much heap your nodes have. If you have nodes with 2GB heap, then having 30GB shards is clearly too big. Since you're on ES Cloud and you plan to have 250GB of data, you must have chosen a node capacity of 16GB heap + 384GB storage (or bigger). So with 16GB heap, it's reasonable to have 30GB shards, but you'll need several nodes in my opinion. You can verify how many nodes you have using GET _cat/nodes?v.
That being said, according to another answer on SO the optimal number of indexes per node is 1...
What Chris is saying is a theoretical/ideal setting, which is almost never possible/advisable/desired to do in reality. You do want to have several shards in your index and the reason is that when your data grows, you want to be able to scale to more than one node, that's the whole point of ES, otherwise you'd be better off embedding the Lucene library directly in your project.
..., so what is the utility in creating multiple indices for time-based data in the first place if we're only running on one node?
First check how many nodes you have in your cluster using GET _cat/nodes?v, but clearly if you're assigned a single node for 250GB of data split on 6-8 shards, a single node is not ideal, indeed.
How does one add a node to an ES deployment in Elastic Cloud?
Right now, you can't. However, at the last Elastic{ON} conference, Elastic announced that it will be possible to pick the number of nodes or the kind of deployment (hot/warm, etc) you want to set up.
Currently all of the replica nodes in the test project are unassigned, because the deployment only has one node.
You don't really need replicas in a test project, right?
The solution has been to delete old data (as the test project was only meant to have several months of data at a time), but it appears the node didn't lose data when it restarted. Why would this be?
How did you delete the data? Between the time you deleted the data and before the node restarted, did you witness that the data was indeed gone?
Our test project has taken no snapshots, which are supposed to happen automatically on Elastic Cloud every 30 minutes.
This is weird, since on ES cloud your cluster generally gets snapshotted every 30 minutes. What do you see under Deployments > cluster-id > Elasticsearch > Snapshots? What does the ES Cloud support say about it? What do you get when running GET _cat/repositories?v and GET _cat/snapshots/found-snapshots?v? (update your question with the results)
I am curious about the impact of #shards in Elasticsearch. Particularly, I am looking for the pros and cons for having big #shards and small #shards.
For example, I have a two-node cluster. Assuming replica is one, for one index, should I create two shards which are spread across these two nodes? Or should I use the default i.e., 5 shards? My thinking is the first one.
It seems to me there is no reason to have more than one share per node per index as one Lucene instance can have better caching than a few Lucene instances.
Edit:
Let's say I have only one node and want to create one index. How does the shard number affect the performance in this case? My thinking is I should have only one shard in such a case. Is that right?
You can find the answer in the elasticsearch documentation here