Elasticsearch reindex gets stuck - elasticsearch

Context
We have two Elasticsearch clusters with 6 and 3 nodes each. The cluster with 6 nodes is the one we use in production environment and we use the one with 3 nodes for testing purposes. (We have the same problem in both clusters). All the nodes have the following characteristics:
Elasticsearch 7.4.2
1TB HDD disk
8 GB RAM
In our case, we need to reindex some of the indexes. Those indexes have billions of documents and a size between 50GB and 250GB.
Problem
Whenever we start reindexing, internally or from a remote source, the task starts working correctly but it reaches a point where it stops reindexing, without apparent reason. We canĀ“t see anything in the logs. The task is not cancelled or anything, it only stops reindexing documents, it looks like the task gets stuck. We tried changing GC strategies, we used CMS and Shenandoah but nothing changes.
Has anyone run into the same problem?

It's difficult to find the RCA of these issues without debugging it and with the little information you provided(missing cluster and index configuration, index slow logs information, elasticsearch error logs, Elasticsearch hot threads to name a few).

Related

Limiting Elasticsearch data retention below disk space

Scenario:
We use Elasticsearch & logstash to do application logging for a moderately high traffic system
This system generates ~200gb of logs every single day
We use 4 instances sharded; and want to retain roughly last 3 days worth of logs
So, we implemented a "cleanup" system, running daily, which removes all data older than 3 days
So far so good. However, a few days ago, some subsystem generated a persistent spike of data logs, resulting in filling up all available disk space within a few hours, which turned the cluster red. This also meant, that the cleanup system wasn't able to connect to ES, as the entire cluster was down -on account of disk being full. This is extremely problematic, as it limits our visibility into what's going on -and blocks our ability to see what caused this in the first place.
Doing root cause analysis here, a few questions pop out:
How can we look at the system in eg Kibana when the cluster status is red?
How can we tell ES to throw away (oldest-first) logs if there is no more space, rather than going status=red?
In what ways can we make sure this does not happen ever again?
Date based index patterns are tricky with spiky loads. There are two things to combine this for a smooth setup without needing manual intervention:
Switch to rollover indices. You can then define that you want to create a new index once your existing one has reached X GB. Then you don't care about the log volume per day any more, but you can simply keep as many indices around as you have disk space (and leave some buffer / fine tune the watermarks).
To automate the rollover, removal of indices, and optionally setting of an alias, we have Elastic Curator:
Example for rollover
Example for delete index, but you want to combine this with the count filtertype
PS: There will be another solution soon, called Index Lifecycle Management. It's built into Elasticsearch directly and can be configured through Kibana, but it's only around the corner at the moment.
How can we look at the system in eg Kibana when the cluster status is red?
Kibana can't connect to ES if it's already down. Best to poll Cluster health API to get cluster's current state.
How can we tell ES to throw away (oldest-first) logs if there is no more space, rather than going status=red?
This option is not inbuilt within Elasticsearch. Best way is to monitor disk space using Watcher or some other tool and have your monitoring send out an alert + trigger a job that cleansup old logs if the disk usage goes below a specified threshold.
In what ways can we make sure this does not happen ever again?
Monitor the disk space of your cluster nodes.

Search/Filter/Sort on constantly changing 1 million documents

My use-case is I have max 1 Million documents and documents getting updated constantly (once every 5 mins). Each document has almost 40 columns and I have sort/filter/search requirements on almost every column.
Since the documents are changing constantly, the doc value 5 minutes earlier is not valid anymore. I am thinking that an ideal DB component will need to be running in memory. For the other use-cases in the application (where documents do not change constantly), I am using ElasticSearch cluster. So to be consistent with the search elsewhere in the application, I want to explore if I can run a separate ES node/cluster purely in memory for my use-case above. I could not find any examples or precursors for running ElasticSearch in production in a pure in-memory configuration.
If not ES, can I run Apache Solr in memory? I can try out any technology which allows me to run in a pure in-memory mode, and provide functionality similar to ES (free text search at a per-column level).
What would you recommend for this use-case?

Elasticsearch reindex store sizes vary greatly

I am running Elasticsearch 6.2.4. I have a program that will automatically create an index for me as well as the mappings necessary for my data. For this issue, I created an index called "landsat" but it needs to actually be named "landsat_8", so I chose to reindex. The original "landsat" index has 2 shards and 0 read replicas. The store size is ~13.4gb with ~6.6gb per shard and the index holds just over 515k documents.
I created a new index called "landsat_8" with 5 shards, 1 read replica, and started a reindex with no special options. On a very small Elastic Cloud cluster (4GB RAM), it finished in 8 minutes. It was interesting to see that the final store size was only 4.2gb, yet it still held all 515k documents.
After it was finished, I realized that I failed to create my mappings before reindexing, so I blew it away and started over. I was shocked to find that after an hour, the /cat/_indices endpoint showed that only 7.5gb of data and 154,800 documents had been reindexed. 4 hours later, the entire job seemed to have died at 13.1gb, but it only showed 254,000 documents had been reindexed.
On this small 4gb cluster, this reindex operation was maxing out CPU. I increased the cluster to the biggest one Elastic Cloud offered (64gb ram), 5 shards, 0 RR and started the job again. This time, I set the refresh_interval on the new index to -1 and changed the size for the reindex operation to 2000. Long story short, this job ended in somewhere between 1h10m and 1h19m. However, this time I ended up with a total store size of 25gb, where each shard held ~5gb.
I'm very confused as to why the reindex operation causes such wildly different results in store size and reindex performance. Why, when I don't explicitly define any mappings and let ES automatically create mappings, is the store size so much smaller? And why, when I use the exact same mappings as the original index, is the store so much bigger?
Any advice would be greatly appreciated. Thank you!
UPDATE 1:
Here are the only differences in mappings:
The left image is "landsat" and the right image is "landsat_8". There is a root level "type" field and a nested "properties.type" field in the original "landsat" index. I forgot one of my goals was to remove the field "properties.type" from the data during the reindex. I seem to have been successful in doing so, but at the same time, accidentally renamed the root-level "type" field mapping to "provider", thus "landsat_8" has an unused "provider" mapping and an auto-created "type" mapping.
So there are some problems here, but I wouldn't think this would nearly double my store size...

Docker Elasticsearch Bulk index timeout

I am running Elasticsearch 2.3 using the docker official builds. I am trying to bulk index a fairly large dataset. The dataset in question is abotu 700mb and on a non dockerized setup takes around 30 minutes. Around 24 hours ago I started the bulk index operation on the docker elasticsearch container. As of yet it still hasn't completed, worse there is no load on the server which indicates it's not even attempting to index.
I know the bulk indexing works because I can index a smaller dataset and it works without a problem.
Is there any specific settings that I need to be aware of when indexing data over a certain size? or any way to check why it errored?
Thanks in advance.
For any future people reading this, firstly Hello from the past!
Secondly, elasticsearch has a default bulk maximum size of 100mb so make sure you're requests (including posted files) are below that

Elasticsearch, how many clusters, indexes do I need for 8 applications

I have an ELK Stack set up and accepting log data from 2 of my applications and everything is working ok. Its been running for 25 days and I have nearly 4GB of Data/Documents on a 25GB server.
My question
I have 8 applications in total that I would like to hook up to my ELK Stack.
Is the one cluster OK for this, or do I need to add more clusters? say a cluster for each applications data? If so how do I do that without having to re-index my data?
Why does cluster health say "yellow (244 of 488)?
Should I index each application to index on it own index rather than the default "logstash-{todays-date}". Like my-app-1-{todays-date}, my-app-2-{todays-date} etc..?
your help is greatly appreciated
G
Your cluster is yellow because your logstash-* indices are configured with 1 replica and you probably have a single node. 244 of 488 means that you have 488 shards in all your indices but only 244 are assigned on your single node and 244 remain to be assigned to new nodes. This is not a problem per se, but if your current node were to fail for some reason, you'd probably lose some data, whereas if you had 2+ nodes, the data would be replicated on other nodes, your cluster would be green (and you'd see 488 of 488) and you'd have a lower risk of losing data.
As for your second question, nothing prevents you from storing all the logs from your eight applications in the same daily logstash indices. You just need to make sure that your logstash configuration accounts for every different apps and adds one field with the application name (e.g. app: app1, app: app2, etc) to the indexed log events so that you can then distinguish within Kibana from which app each log event has been issued.
I have only used Elasticsearch and no the complete ELK stack, but I can give some ideas and guess what is going on. 488 = 2 x 244 , so I guess there are un-assigned replica shards in the single-machine cluster. You can update this setting ad-hoc and set it to zero:
curl -XPUT 'localhost:9200/my_index/_settings' -d '
{"index" : {"number_of_replicas" : 0}}'
You should update logstash index template not to use replicas when you are running just a single machine. Also your shards seem to be only about 20 MB in size so I'd recommend each index to use just one shard instead of five, each shard consumes extra resources. Having multiple shards increases indexing speed but slows down queries, you should check if one is sufficient or not.
Index / application / day would speed-up querying if dashboards are mostly application-specific, and you can create a day-specific alias to-be used by cross-application queries.

Resources