I have an index with 12 shards allocated in 3 nodes. The index has around 5 millions of elements. Yesterday, these 5 milions of elements had a size of around 35Gb, and if I could reindex all these elements with the new reduced size of each of them, the total size of the index would be around 10Gb.
The problem is that, if I reindex these items inside the existing items, by using upserts, so updating them all, the index size does not reduce at all.
I've been checking the _forcemerge Elasticsearch API, but ATM I'm not using any kind of replicas, and they don't really recommend using this API in a master set of shards.
I'm learning the insides of Elasticsearch, and I'm actually working with a PROD environment, so I can't really clean-and-reindex.
Can anyone send me some light?
Related
My Elasticsearch cluster is constantly used by search queries. Once a week i get a batch of new documents i need to add to index. If i add them to index it will greatly slower search speeds while indexing and merging or moving shards.
What is the best way to avoid slowdown?
My solution so far:
1. Spin up a single node empty elastic.
2. Restore index i need to update from a snapshot.
3. Add new documents to this index.
4. Force merge shards
5. Snapshot resulting index.
6. Restore updated index on production cluster.
7. Update aliases to use updated index and delete old index.
I'm thinking restoring from snapshot shouldn't take much resources. Probably need to warm up restored index for better performance.
Is it normal solution or too complicated?
May be Elasticsearch has proper ways for adding documents without downtime or cluster slowdown?
500GB on one primary shard, I would clearly fix this before doing anything else. You have 10 nodes so you need to spread the load over all of them. Adding nodes will not help at all.
The official recommendation is to not let shards grow bigger than 10/50GB. So in your case I would split that index to have 10 primary shards (+1 replica each), so that each node can handle a part of the job. Otherwise, there's always only one node doing the write job and two nodes doing the read job, which is not optimal.
So before coming up with a way to circumvent the issue, fix the issue as I described above. Your cluster will be much better off, because 10 nodes should definitely handle 5TB easily without having to resort to a complex update procedure as the one you listed.
Try it out...
I need to load 1.2 billion documents in the elasticsearch. As of today we have 6 nodes in the cluster. To equally distribute the shards among the 6 nodes I have mentioned the number of shards to be 42. I use spark and it takes me almost 3 days load the index. The shards distribution looks so off.
The node6 only has two shards in it while node 2 has almost 10 shards. The size distribution is also not even. Some shards are 114.6gb while some are just 870mb within the same node.
I have tried to figure out the solution too. I can include the
index.routing.allocation.total_shards_per_node: 7
while creating the index and make it evenly distribute. Will forcing the designated amount of shards in the node, crash the node if there is not enough resource available?
I want to size the shards evenly. My index size is 900 gb apprx. I want each shards to be atleast 20 gb. Could I use the following setting while creating the index?
max_primary_shard_size: 25gb
Is setting up max shard size only possible through ilm policy and will I require roll over policy for that ? I am not too familiar with the ilm. Sorry if this does not make sense.
The main reason I am trying to optimize the index is because I am getting timeout error on my application when I am querying the elastic search. I know I can increase my timeout time in my application and do some query optimization, but first I want to optimize my index and make my application as fast as possible.
I load the index only one time and do not write any documents to it after onetime load. For additional data, which i load every 15 days, I create a different index and use an alias name on the both the indexes to query. Other than sharding if there is any suggestion to optimize my indexes I will really appreciate it. It takes me 3 days just to load the data so it is quite difficult to experiment.
are you using custom routing values in your indexing approach? that might explain the shard size differences.
and if you aren't already, disable replicas and refreshes when doing your bulk index, as that will speed things up
finally your shard size of 20gig is probably a little low, I would suggest doubling that size, aiming for <50gig
Is there any limit on how many indexes we can create in elastic search?
Can 100 000 indexes be created in Elasticsearch?
I have read that, maximum of 600-1000 indices can be created. Can it be scaled?
eg: I have a number of stores, and the store has items. Each store will have its own index where its items will be indexed.
There is no limit as such, but obviously, you don't want to create too many indices(too many depends on your cluster, nodes, size of indices etc), but in general, it's not advisable as it can have a server impact on cluster functioning and performance.
Please check loggly's blog and their first point is about proper provisioning and below is important relevant text from the same blog.
ES makes it very easy to create a lot of indices and lots and lots of
shards, but it’s important to understand that each index and shard
comes at a cost. If you have too many indices or shards, the
management load alone can degrade your ES cluster performance,
potentially to the point of making it unusable. We’re focusing on
management load here, but running too many indices/shards can also
have pretty significant impacts on your indexing and search
performance.
The biggest factor we’ve found to impact management overhead is the
size of the Cluster State, which contains all of the mappings for
every index in the cluster. At one point, we had a single cluster with
a Cluster State size of over 900MB! The cluster was alive but not
usable.
Edit: Thanks #Silas, who pointed that from ES 2.X, cluster state updates are not that much costly(As the only diff is sent in update call). More info on this change can be found on this ES issue
I am using elasticsearch to index logs from an automation run of test cases. I am creating an index for each of the runs (that can have from 1000 to million events). I create about 200 indices per day. Is this a good methodology to create an index for each run or should I just have 1 index and then put all the logs from multiple runs into this index?
The amount of data is huge and so I chose separate indices. I am expecting 200 logs everyday each with 1million events. Please help me
Depends how long you want to retain your data and the size of your cluster. At 200 indices per day, each with lots of associated files, you're looking at a lot of file handles. So, that doesn't sound like it would scale beyond a few weeks or months on a very small cluster since you'll be running out of file handles.
A better strategy might be to do what logstash does by default which is to create a new index every day. Then your next choice will be to play with the number of shards and nodes in the cluster. Assuming you want to store a worst case of 200M log entries per day on a 3 or 5 node cluster, probably the default of 5 shards is fine. If you go for more nodes, you'll probably want more shards so that each shard is smaller. Also consider using elasticsearch curator to e.g. close older indices and optimize them.
So I my index growth too fast and now has 60 million docs in 3 shards (single node).
I want to buy more machines and split content into more shards. How can I do this?
It's just connect new nodes to the cluster and update shards number in master?
Afaik elasticsearch cannot yet redistribute indexed documents automatically (see here). You would have to reindex all content. The problem behind it is, that documents are distributed to shards according to a hash value modulo number of shards. Just adding shards and keeping indexing would keep adding documents to the old shards too.
Elasticsearch allows to distribute documents according to a custom function (routing parameter). You could distribute all new content to the new shards, but this makes deletions difficult, because now you have to know if a document is "old" or "new". Further it ruins your uniform index statistics which may bias ranking in nonobvious ways.
Bottom line: adding shards to an existing index requires reindexing all contents or some heavy hacking.
You already have 3 shards, so if you add 2 nodes Elasticsearch will automatically reallocate 2 shards to the other 2 nodes, giving all shard 3 times more power.
If you want to add more shards, you need to reindex your data. This can be done by creating a new index with the desired number of shards and copying your data to that index (see https://www.elastic.co/guide/en/elasticsearch/guide/current/reindex.html)