How do I configure elasticsearch to retain documents up to 30 days? - elasticsearch

Is there a default data retention period in elasticsearch? If yes can you help me find the configuration?

This is no longer supported in Elasticsearch 5.0.0 or greater. The best practice is to create indexes periodically (daily is most common) and then delete the index when the data gets old enough.
Here's a reference to how to delete an index (https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-delete-index.html)
This article (though it's old enough to reference _ttl) also gives some insight: https://www.elastic.co/blog/using-elasticsearch-and-logstash-to-serve-billions-of-searchable-events-for-customers
As a reminder, it's best to protect your Elasticsearch cluster from the outside world via a proxy and restrict the methods that can be sent to your cluster. This way you can prevent your cluster from being ransomed.

Yeah you can set a TTL on the data. Take a look here for the configuration options available.
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/mapping-ttl-field.html

Elasticsearch curator is the tool to use if you want to manage your indexes: https://www.elastic.co/guide/en/elasticsearch/client/curator/current/index.html
Here's an example of how to delete indices based on age: https://www.elastic.co/guide/en/elasticsearch/client/curator/current/ex_delete_indices.html
Combine with cron to have this done at regular intervals.

There is no default retention period but new versions of Elasticsearch have index lifecycle management (ILM). It allows to:
Delete stale indices to enforce data retention standards
Documentation.
Simple policy example:
PUT _ilm/policy/my_policy
{
"policy": {
"phases": {
"delete": {
"min_age": "30d",
"actions": {
"delete": {}
}
}
}
}
}
If you use OpenSearch in AWS then take a look at this documentation for the same thing.
Pretty old question but I got the same question just now. Maybe it will be helpful for somebody else.

Just FYI if you are using AWS's Elasticsearch service, they have a great doc on Using Curator to Rotate Data which includes some sample python code that can be used in a Lambda function.

Related

clear prometheus metrics from collector

I'm trying to modify prometheus mesos exporter to expose framework states:
https://github.com/mesos/mesos_exporter/pull/97/files
A bit about mesos exporter - it collects data from both mesos /metrics/snapshot endpoint, and /state endpoint.
The issue with the latter, both with the changes in my PR and with existing metrics reported on slaves, is that metrics created lasts for ever (until exporter is restarted).
So if for example a framework was completed, the metrics reported for this framework will be stale (e.g. it will still show the framework is using CPU).
So I'm trying to figure out how I can clear those stale metrics. If I could just clear the entire mesosStateCollector each time before collect is done it would be awesome.
There is a delete method for the different p8s vectors (e.g. GaugeVec), but in order to delete a metric, I need to not only the label name, but also the label value for the relevant metric.
Ok, so seems it was easier than I thought (if only I was familiar with go-lang before approaching this task).
Just need to cast the collector to GaugeVec and reset it:
prometheus.NewGaugeVec(prometheus.GaugeOpts{
Help: "Total slave CPUs (fractional)",
Namespace: "mesos",
Subsystem: "slave",
Name: "cpus",
}, labels): func(st *state, c prometheus.Collector) {
c.(*prometheus.GaugeVec).Reset() ## <-- added this for each GaugeVec
for _, s := range st.Slaves {
c.(*prometheus.GaugeVec).WithLabelValues(s.PID).Set(s.Total.CPUs)
}
},

Elasticsearch Delete by Query Version Conflict

I am using Elasticsearch version 5.6.10. I have a query that deletes records for a given agency, so they can later be updated by a nightly script.
The query is in elasticsearch-dsl and look like this:
def remove_employees_from_search(jurisdiction_slug, year):
s = EmployeeDocument.search()
s = s.filter('term', year=year)
s = s.query('nested', path='jurisdiction', query=Q("term", **{'jurisdiction.slug': jurisdiction_slug}))
response = s.delete()
return response
The problem is I am getting a ConflictError exception when trying to delete the records via that function. I have read this occurs because the documents were different between the time the delete process started and executed. But I don't know how this can be, because nothing else is modifying the records during the delete process.
I am going to add s = s.params(conflicts='proceed') in order to silence the exception. But this is a band-aid as I do not understand why the delete is not processing as expected. Any ideas on how to troubleshoot this? A snapshot of the error is below:
ConflictError:TransportError(409,
u'{
"took":10,
"timed_out":false,
"total":55,
"deleted":0,
"batches":1,
"version_conflicts":55,
"noops":0,
"retries":{
"bulk":0,
"search":0
},
"throttled_millis":0,
"requests_per_second":-1.0,
"throttled_until_millis":0,
"failures":[
{
"index":"employees",
"type":"employee_document",
"id":"24681043",
"cause":{
"type":"version_conflict_engine_exception",
"reason":"[employee_document][24681043]: version conflict, current version [5] is different than the one provided [4]",
"index_uuid":"G1QPF-wcRUOCLhubdSpqYQ",
"shard":"0",
"index":"employees"
},
"status":409
},
{
"index":"employees",
"type":"employee_document",
"id":"24681063",
"cause":{
"type":"version_conflict_engine_exception",
"reason":"[employee_document][24681063]: version conflict, current version [5] is different than the one provided [4]",
"index_uuid":"G1QPF-wcRUOCLhubdSpqYQ",
"shard":"0",
"index":"employees"
},
"status":409
}
You could try making it do a refresh first
client.indices.refresh(index='your-index')
source https://www.elastic.co/guide/en/elasticsearch/client/javascript-api/current/api-reference.html#_indices_refresh
First, this is a question that was asked 2 years ago, so take my response with a grain of salt due to the time gap.
I am using the javascript API, but I would bet that the flags are similar. When you index or delete there is a refresh flag which allows you to force the index to have the result appear to search.
I am not an Elasticsearch guru, but the engine must perform some systematic maintenance on the indices and shards so that it moves the indices to a stable state. It's probably done over time, so you would not necessarily get an immediate state update. Furthermore, from personal experience, I have seen when delete does not seemingly remove the item from the index. It might mark it as "deleted", give the document a new version number, but it seems to "stick around" (probably until general maintenance sweeps run).
Here I am showing the js API for delete, but it is the same for index and some of the other calls.
client.delete({
id: string,
index: string,
type: string,
wait_for_active_shards: string,
refresh: 'true' | 'false' | 'wait_for',
routing: string,
timeout: string,
if_seq_no: number,
if_primary_term: number,
version: number,
version_type: 'internal' | 'external' | 'external_gte' | 'force'
})
https://www.elastic.co/guide/en/elasticsearch/client/javascript-api/current/api-reference.html#_delete
refresh
'true' | 'false' | 'wait_for' - If true then refresh the affected shards to make this operation visible to search, if wait_for then wait for a refresh to make this operation visible to search, if false (the default) then do nothing with refreshes.
For additional reference, here is the page on Elasticsearch refresh info and what might be a fairly relevant blurb for you.
https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-refresh.html
Use the refresh API to explicitly refresh one or more indices. If the request targets a data stream, it refreshes the stream’s backing indices. A refresh makes all operations performed on an index since the last refresh available for search.
By default, Elasticsearch periodically refreshes indices every second, but only on indices that have received one search request or more in the last 30 seconds. You can change this default interval using the index.refresh_interval setting.

Exclude information from metrics endpoint

The metrics endpoint can respond with a lot of data, especially for cache. and gauge.. Is there a way to exlude this and only show the system metrics?
I'm asking this because the endpoint is polled by Logstash and it exceeds the maximum number of fields per index.
I found a solution for cache. Add this to one of your configuration classes:
#Bean
public CachePublicMetrics emptyCachePublicMetrics() {
return null;
}
Also, in my particular case with Logstash, the problem can be solved by adding a prune filter to the Logstash config:
prune {
blacklist_names => [ "^cache\.", "^gauge\." ]
}

Are ElasticSearch scripts safe for concurrency issues?

I'm running a process which updates user documents on ElasticSearch. This process can run on multiple instances on different machines. In case 2 instances will try to run a script to update the same document in the same time, can there be a case that some of the data will be lost because of a race-condition? or that the internal script mechanism is safe (using the version property for optimistic locking or any other way)?
The official ES scripts documentation
Using the version attribute is safe for that kind of jobs.
Do the search with version: true
GET /index/type/_search
{
"version": true
your_query...
}
Then for the update, add a version attribute corresponding to the number returned during the search.
POST /index/type/the_id_to_update/_update?version=3 // <- returned by the search
{
"doc":{
"ok": "name"
}
}
https://www.elastic.co/guide/en/elasticsearch/guide/current/version-control.html

can elasticsearch guarantee correctiveness when updating one doc at high frequency?

I have been working on a project which involves massive updates to elasticsearch and I found that when updating applied to one single doc at a high frequency, consistence can not be guaranteed.
For each update, this is how we do(scala code). Notice that, we have to explicitly remove origin fields and replace it with new one because 'merge' is not we want(_update is merge in fact in elasticsearch).
def replaceFields(alarmId: String, newFields: Map[String, Any]): Future[BulkResponse] = {
def removeField(fieldName: String): UpdateDefinition = {
log.info("script: " + s"""ctx._source.remove("${fieldName}")""")
update id alarmId in IndexType script s"""ctx._source.remove("${fieldName}")"""
}
client.execute {
bulk(
{newFields.toList.map(ele => removeField(ele._1)) :+
{update id alarmId in IndexType doc (newFields)}} : _*
)
}}
It cannot. You can increase the write quorum level to all (see Undestanding the write_consistency and quorum rule of Elasticsearch for some discussion around this; also see the docs https://www.elastic.co/guide/en/elasticsearch/reference/2.4/docs-index_.html#index-consistency) and that would get you closer. But Elasticsearch does not have any linearizability guarantees (eg https://aphyr.com/posts/317-jepsen-elasticsearch for examples and https://aphyr.com/posts/313-strong-consistency-models for definitions) and it's not difficult to cook up scenarios in which ES will not be consistent.
That being said, it tends to be consistent most of the time. But in a high update environment, you're gonna be putting a lot of GC pressure on your JVM to clean out the old docs. I assume you know how updates work under the hood in ES but, in case you are not, it's also worth paying attention to https://www.elastic.co/guide/en/elasticsearch/reference/current/_updating_documents.html

Resources