Limiting Elasticsearch data retention below disk space - elasticsearch

Scenario:
We use Elasticsearch & logstash to do application logging for a moderately high traffic system
This system generates ~200gb of logs every single day
We use 4 instances sharded; and want to retain roughly last 3 days worth of logs
So, we implemented a "cleanup" system, running daily, which removes all data older than 3 days
So far so good. However, a few days ago, some subsystem generated a persistent spike of data logs, resulting in filling up all available disk space within a few hours, which turned the cluster red. This also meant, that the cleanup system wasn't able to connect to ES, as the entire cluster was down -on account of disk being full. This is extremely problematic, as it limits our visibility into what's going on -and blocks our ability to see what caused this in the first place.
Doing root cause analysis here, a few questions pop out:
How can we look at the system in eg Kibana when the cluster status is red?
How can we tell ES to throw away (oldest-first) logs if there is no more space, rather than going status=red?
In what ways can we make sure this does not happen ever again?

Date based index patterns are tricky with spiky loads. There are two things to combine this for a smooth setup without needing manual intervention:
Switch to rollover indices. You can then define that you want to create a new index once your existing one has reached X GB. Then you don't care about the log volume per day any more, but you can simply keep as many indices around as you have disk space (and leave some buffer / fine tune the watermarks).
To automate the rollover, removal of indices, and optionally setting of an alias, we have Elastic Curator:
Example for rollover
Example for delete index, but you want to combine this with the count filtertype
PS: There will be another solution soon, called Index Lifecycle Management. It's built into Elasticsearch directly and can be configured through Kibana, but it's only around the corner at the moment.

How can we look at the system in eg Kibana when the cluster status is red?
Kibana can't connect to ES if it's already down. Best to poll Cluster health API to get cluster's current state.
How can we tell ES to throw away (oldest-first) logs if there is no more space, rather than going status=red?
This option is not inbuilt within Elasticsearch. Best way is to monitor disk space using Watcher or some other tool and have your monitoring send out an alert + trigger a job that cleansup old logs if the disk usage goes below a specified threshold.
In what ways can we make sure this does not happen ever again?
Monitor the disk space of your cluster nodes.

Related

Implements popular keyword in ElasticSearch

I'm using ElasticSearch on AWS EC2.
And i want to implement today's popular keyword function in ES.
there is 3 indexes(place, genre, name), and i want see today's popular keyword in name index only.
I tried to use ES slowlog and logstash. but slowlog save logs every shard's log.
(ex)number of shards : 5 then 5 query log saved.
Is there any good and easy way to implement popular keyword in ES?
As far as I know, this is not supported by Elasticsearch and you need to build your own custom solution.
Design you mentioned using the slowlog is not good as you mentioned its on per shard basis, even if you do some more computing and able to merge and relate them to a single search at index level, it would not be good, as
you have to change the slow log configuration and for every index there needs to be a different threshold, you can change it to 0ms, to make sure you get all the search queries in slow logs, but that would take a huge disk space and would not be good for Elasticsearch performance.
You have to do some parsing of slow log in your application and if you do it runtime it would be very costly.
I think you can maintain a distributed cache in your application where you store the top searched keyword like the leaderboard of a multi-player gaming app, which is changing very frequently but in your case, you don't even have to update this cache very frequently. I would not go into much implementation details, but simple Hashmap of search term as key and count as value would solve the issue.
Hope this helps. let me know if you have questions.

Elasticsearch reindex gets stuck

Context
We have two Elasticsearch clusters with 6 and 3 nodes each. The cluster with 6 nodes is the one we use in production environment and we use the one with 3 nodes for testing purposes. (We have the same problem in both clusters). All the nodes have the following characteristics:
Elasticsearch 7.4.2
1TB HDD disk
8 GB RAM
In our case, we need to reindex some of the indexes. Those indexes have billions of documents and a size between 50GB and 250GB.
Problem
Whenever we start reindexing, internally or from a remote source, the task starts working correctly but it reaches a point where it stops reindexing, without apparent reason. We canĀ“t see anything in the logs. The task is not cancelled or anything, it only stops reindexing documents, it looks like the task gets stuck. We tried changing GC strategies, we used CMS and Shenandoah but nothing changes.
Has anyone run into the same problem?
It's difficult to find the RCA of these issues without debugging it and with the little information you provided(missing cluster and index configuration, index slow logs information, elasticsearch error logs, Elasticsearch hot threads to name a few).

Search/Filter/Sort on constantly changing 1 million documents

My use-case is I have max 1 Million documents and documents getting updated constantly (once every 5 mins). Each document has almost 40 columns and I have sort/filter/search requirements on almost every column.
Since the documents are changing constantly, the doc value 5 minutes earlier is not valid anymore. I am thinking that an ideal DB component will need to be running in memory. For the other use-cases in the application (where documents do not change constantly), I am using ElasticSearch cluster. So to be consistent with the search elsewhere in the application, I want to explore if I can run a separate ES node/cluster purely in memory for my use-case above. I could not find any examples or precursors for running ElasticSearch in production in a pure in-memory configuration.
If not ES, can I run Apache Solr in memory? I can try out any technology which allows me to run in a pure in-memory mode, and provide functionality similar to ES (free text search at a per-column level).
What would you recommend for this use-case?

How to stop auto reindexing in elastic search if any update happens?

I am having a big use case with elasticsearch which has millions of records in it.
I will be updating the records frequently, say 1000 records per hour.
I don't want elastic search to reindex for my every update.
I am planning to reindex it on weekly basis.
Any Idea how to stop auto-reindex while update ?
Or any other better suggestion is welcome . Thanks in advance :)
Elasticsearch(ES) update an existing doc in below manner.
1. Deletes the old doc.
2. Index a new doc with the changes applied to it.
According to ES docs :-
In Elasticsearch, this lightweight process of writing and opening a
new segment is called a refresh. By default, every shard is refreshed
automatically once every second. This is why we say that Elasticsearch
has near real-time search: document changes are not visible to search
immediately, but will become visible within 1 second.
Note that these changes will not be visible/searchable until ES commits/flush these changes to disk cache and disk,which is control by soft-commit(es refresh interval, which is by default 1 second) and hard-commit(which actually write the document to disk, which prevent it being lost permanently and costly affair than a soft-commit).
You need to make sure, you tune your ES refresh interval, and do proper load testing, as setting it very low and very high has its own pros and cons.
for example setting it very less for example 1 second and if you have too many updates happening than it has a performance hit and it might crash your system. Also setting it very high for example 1 hour means you now don't have a NRT(near real time search) and during that time if your memory could contain again millions of doc(depending on your app) and can cause out of memory error, also committing on such a large memory is a very costly affair.

Is solr cloud applicable for use case where indexing is offline?

Solr cloud seems to be the suggested method to scale solr in future. I understand that legacy scaling methods (like master slave and replication) still exists. My use case with solr does not have to be near real time (NRT). It is fine if the newly indexed data is visible for searchers after about 1 day.
In the master slave (legacy scaling), I could replicate it once a day. In Solr cloud do i have an option like this?
Also i don't want the indexing to impact the searcher performance during index time. Is there a way to isolate the indexer from searcher shards in solr cloud?
You could skip SolrCloud and just index on a dedicate separate collection.
Then, you bring the new content to each machine individually and do a Core Swap.
Or similar thing using Aliases to point to the newest core/collection. Which also allows you to segment old content and new content into different collections and search them together.
I also used collection aliases in such cases. You can build your index once a day and when it is ready you simply change the alias. I'll give an example
At very begining you create index called: index_2014_12_01. This index is aliased by index_2014_12_01. The next day you build index_2014_12_02 and changing the alias now to point index_2014_12_02 instead of index_2014_12_01.

Resources