Is it possible with Elastic Curator to delete indexes matching field value - elasticsearch

We have on our elasticsearch several indexes. They come from FluentD pluging sendings logs fron our docker containers. We would like to delete old indexes not only older than specific amount of days based on index name but applying different delete rules depending on log fields.
Here is an example of log:
{
"_index": "fluentd-2018.03.28",
"_type": "fluentd",
"_id": "o98123bcbd_kqpowkd",
"_version": 1,
"_score": null,
"_source": {
"container_id": "bbd72ec5e46921ab8896a05684a7672ef113a79e842285d932f",
"container_name": "/redis-10981239d5",
"source": "stdout",
"log": "34:M 28 Mar 15:07:51.086 * 10 changes in 300 seconds. Saving...\r34:M 28 Mar 15:07:51.188 * Background saving terminated with success\r",
"#timestamp": "2018-03-28T15:07:56.217739954+00:00",
"#log_name": "docker.redis"
},
"fields": {
"#timestamp": [
"2018-03-28T15:07:56.217Z"
]
}
}
In that case, we would like to delete all logs matching #log_name = docker.redis older than 7 days.
Is it possible to define a Curator action which deletes indexes filtered by such a field value?
We tried different filtering without any success. The only action we manage to perform successfully is based on index name:
actions:
1:
action: delete_indices
description: >-
Delete indices older than 30 days
options:
ignore_empty_list: True
disable_action: True
filters:
- filtertype: pattern
kind: prefix
value: fluentd-
- filtertype: age
source: name
direction: older
timestring: '%Y.%m.%d'
unit: days
unit_count: 30

Curator offer only an index level retention configuration. If you need a retention based on document level, you can try with a script that execute a delete by query.
Otherwise, using curator, you need to separate your data in different indexes for applying different retention.

Related

Conditional indexing not working in ingest node pipelines

Am trying to implement an index template with datastream enabled and then set contains in ingest node pipelines. So that I could get metrics with below-mentioned index format :
.ds-metrics-kubernetesnamespace
I had tried this sometime back and I did these things as mentioned above and it was giving metrics in such format but now when I implement the same it's not changing anything in my index. I cannot see any logs in openshift cluster so ingest seems to be working fine(when I add a doc and test it works fine)
PUT _ingest/pipeline/metrics-index
{
"processors": [
{
"set": {
"field": "_index",
"value": "metrics-{{kubernetes.namespace}}",
"if": "ctx.kubernetes?.namespace==\"dev\""
}
}
]
}
This is the ingest node condition I have used for indexing.
metricbeatConfig:
metricbeat.yml: |
metricbeat.modules:
- module: kubernetes
enabled: true
metricsets:
- state_node
- state_daemonset
- state_deployment
- state_replicaset
- state_statefulset
- state_pod
- state_container
- state_job
- state_cronjob
- state_resourcequota
- state_service
- state_persistentvolume
- state_persistentvolumeclaim
- state_storageclass
- event
Since you're using Metricbeat, you have another way to do this which is much better.
Simply configure your elasticsearch output like this:
output.elasticsearch:
hosts: ["http://<host>:<port>"]
indices:
- index: "%{[kubernetes.namespace]}"
mappings:
dev: "metrics-dev"
default: "metrics-default"
or like this:
output.elasticsearch:
hosts: ["http://<host>:<port>"]
indices:
- index: "metrics-%{[kubernetes.namespace]}"
when.equals:
kubernetes.namespace: "dev"
default: "metrics-default"
or simply like this would also work if you have plenty of different namespaces and you don't want to manage different mappings:
output.elasticsearch:
hosts: ["http://<host>:<port>"]
index: "metrics-%{[kubernetes.namespace]}"
Steps to create datastreams in elastic stack:
create an ILM policy
Create an index template that has an index pattern that matches with the index pattern of metrics/logs.(Set number of primary shards/replica shards and mapping in index template)
Set a condition in ingest pipeline.(Make sure no such index exist)
If these conditions meet it will create a data stream and logs/metrics would have an index starting with .ds- and it will be hidden in index management.
In my case the issue was I did not have enough permission to create a custom index. When I checked my OpenShift logs I could find metricbeat was complaining about the privilege. So I gave Superuser permission and then used ingest node to set conditional indexing
PUT _ingest/pipeline/metrics-index
{
"processors": [
{
"set": {
"field": "_index",
"value": "metrics-{{kubernetes.namespace}}",
"if": "ctx.kubernetes?.namespace==\"dev\""
}
}
]
}

Does Elasticsearch curator Rollover action doesn't support Date math in the name?

I'm trying to use the date math in the elasticsearch curator rollover action, but it seems like it doesn't support alias name as a date math like '<indexname-{now/d}>'
---
# Remember, leave a key empty if there is no value. None will be a string,
# not a Python "NoneType"
#
# Also remember that all examples have 'disable_action' set to True. If you
# want to use this action as a template, be sure to set this to False after
# copying it.
actions:
1:
action: rollover
description: >-
Rollover the index associated with alias 'indexname-{now/d}', index should be in the format of indexname-{now/d}-000001.
options:
disable_action: False
name: '<indexname-{now/d}>'
conditions:
max_age: 1d
max_docs: 1000000
max_size: 50g
extra_settings:
index.number_of_shards: 3
index.number_of_replicas: 1
It is taking that name '<indexname-{now/d}>' as a string/ alias name and gives an error
Failed to complete action: rollover. <class 'ValueError'>: Unable to perform index rollover with alias "<indexname-{now/d}>".
I'll suggest adding the support for date math in the alias name for rollover action in the elasticsearch curator.
What it appears you are trying to do is to rollover an alias named indexname-2021.10.28. Is that correct? I mention this because the name directive is for the alias name rather than the index name. Additionally, using this pattern would be looking for an alias with today's date {now/d}, but the rollover conditions appear to be looking for something older than 1 day (or 1M docs, or over 50g). If that alias is older than 24 hours, the lookup will fail because it's looking for something that has likely not been created yet.
I presume you are more likely looking for an alias with a name like index name that points to indices that look like indexname-YYYY.MM.dd. Did you know that this behavior is automatic if the original index and alias combination are created with date math?
For example, if I had created this index + alias combination yesterday (and it's URLencoded for use in the dev tools console):
# PUT <my-index-{now/d}-000001>
PUT %3Cmy-index-%7Bnow%2Fd%7D-000001%3E
{
"aliases": {
"my-index": {
"is_write_index": true
}
}
}
The results would say:
{
"acknowledged" : true,
"shards_acknowledged" : true,
"index" : "my-index-2021.10.27-000001"
}
And if I forced a rollover today:
POST my-index/_rollover
{
"conditions": {
"max_age": "1d"
}
}
This is the resulting output:
{
"acknowledged" : true,
"shards_acknowledged" : true,
"old_index" : "my-index-2021.10.27-000001",
"new_index" : "my-index-2021.10.28-000002",
"rolled_over" : true,
"dry_run" : false,
"conditions" : {
"[max_age: 1d]" : true
}
}
With this behavior, it's very simple to get a date in the index name while still using default rollover behavior.

ElasticSearch BulkShardRequest failed due to org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor

I am storing logs into elastic search from my reactive spring application. I am getting the following error in elastic search:
Elasticsearch exception [type=es_rejected_execution_exception, reason=rejected execution of processing of [129010665][indices:data/write/bulk[s][p]]: request: BulkShardRequest [[logs-dev-2020.11.05][1]] containing [index {[logs-dev-2020.11.05][_doc][0d1478f0-6367-4228-9553-7d16d2993bc2], source[n/a, actual length: [4.1kb], max length: 2kb]}] and a refresh, target allocation id: WwkZtUbPSAapC3C-Jg2z2g, primary term: 1 on EsThreadPoolExecutor[name = 10-110-23-125-common-elasticsearch-apps-dev-v1/write, queue capacity = 200, org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor#6599247a[Running, pool size = 2, active threads = 2, queued tasks = 221, completed tasks = 689547]]]
My index settings:
{
"logs-dev-2020.11.05": {
"settings": {
"index": {
"highlight": {
"max_analyzed_offset": "5000000"
},
"number_of_shards": "3",
"provided_name": "logs-dev-2020.11.05",
"creation_date": "1604558592095",
"number_of_replicas": "2",
"uuid": "wjIOSfZOSLyBFTt1cT-whQ",
"version": {
"created": "7020199"
}
}
}
}
}
I have gone through this site:
https://www.elastic.co/blog/why-am-i-seeing-bulk-rejections-in-my-elasticsearch-cluster
I thought adjusting "write" size in thread-pool will resolve, but it is mentioned as not recommended in the site as below:
Adjusting the queue sizes is therefore strongly discouraged, as it is like putting a temporary band-aid on the problem rather than actually fixing the underlying issue.
So what else can we do improve the situation?
Other info:
Elastic Search version 7.2.1
Cluster health is good and they are 3 nodes in cluster
Index will be created on daily basis, there are 3 shards per index
While you are right, that increasing the thread_pool size is not a permanent solution, you will be glad to know that elasticsearch itself increased the size of write thread_pool(use in your bulk requests) from 200 to 10k in just a minor version upgrade. Please see the size of 200 in ES 7.8, while 10k of ES 7.9 .
If you are using the ES 7.X version, then you can also increase the size to if not 10k, then at least 1k(to avoid rejecting the requests).
If you want a proper fix, you need to do the below things
Find out if it's consistent or just some short-duration burst of write requests, while gets cleared in some time.
If it's consistent, then you need to figure out if have all the write optimization is in place, please refer to my short-tips to improve index speed.
See, if you have reached the full-capacity of your data-nodes, and if yes, scale your cluster to handle the increased/legitimate load.

Fluent-bit - Splitting json log into structured fields in Elasticsearch

I am trying to find a way in Fluent-bit config to tell/enforce ES to store plain json formatted logs (the log bit below that comes from docker stdout/stderror) in structured way - please see image at the bottom for better explanation. For example, apart from (or along with) storing the log as a plain json entry under log field, I would like to store each property individually as shown in red.
The documentation for Filters and Parsers are really poor and not clear. On top of that the forward input doesn't have a "parser" option. I tried json/docker/regex parsers but no luck. My regex is here if I have to use regex. Currently using ES (7.1), Fluent-bit (1.1.3) and Kibana (7.1) - not Kubernetes.
If anyone can direct me to an example or give one I would be much appreciated.
Thanks
{
"_index": "hello",
"_type": "logs",
"_id": "T631e2sBChSKEuJw-HO4",
"_version": 1,
"_score": null,
"_source": {
"#timestamp": "2019-06-21T21:34:02.000Z",
"tag": "php",
"container_id": "53154cf4d4e8d7ecf31bdb6bc4a25fdf2f37156edc6b859ba0ddfa9c0ab1715b",
"container_name": "/hello_php_1",
"source": "stderr",
"log": "{\"time_local\":\"2019-06-21T21:34:02+0000\",\"client_ip\":\"-\",\"remote_addr\":\"192.168.192.3\",\"remote_user\":\"\",\"request\":\"GET / HTTP/1.1\",\"status\":\"200\",\"body_bytes_sent\":\"0\",\"request_time\":\"0.001\",\"http_referrer\":\"-\",\"http_user_agent\":\"curl/7.38.0\",\"request_id\":\"91835d61520d289952b7e9b8f658e64f\"}"
},
"fields": {
"#timestamp": [
"2019-06-21T21:34:02.000Z"
]
},
"sort": [
1561152842000
]
}
Thanks
conf
[SERVICE]
Flush 5
Daemon Off
Log_Level debug
Parsers_File parsers.conf
[INPUT]
Name forward
Listen 0.0.0.0
Port 24224
[OUTPUT]
Name es
Match hello_*
Host elasticsearch
Port 9200
Index hello
Type logs
Include_Tag_Key On
Tag_Key tag
Solution is as follows.
[SERVICE]
Flush 5
Daemon Off
Log_Level debug
Parsers_File parsers.conf
[INPUT]
Name forward
storage.type filesystem
Listen my_fluent_bit_service
Port 24224
[FILTER]
Name parser
Parser docker
Match hello_*
Key_Name log
Reserve_Data On
Preserve_Key On
[OUTPUT]
Name es
Host my_elasticsearch_service
Port 9200
Match hello_*
Index hello
Type logs
Include_Tag_Key On
Tag_Key tag
[PARSER]
Name docker
Format json
Time_Key time
Time_Format %Y-%m-%dT%H:%M:%S.%L
Time_Keep On
# Command | Decoder | Field | Optional Action
# =============|==================|=================
Decode_Field_As escaped_utf8 log do_next
Decode_Field_As json log
You can use the Fluent Bit Nest filter for that purpose, please refer to the following documentation:
https://docs.fluentbit.io/manual/filter/nest

Finding out on which data path shard is located in Elasticsearch

I have multiple path.datas configured for my Elasticsearch cluster.
The official documentation states that only a single path is used for a single shard, so it's never splitted across multiple paths.
I'd like to find a way to finding out which path on which node is used for some specific shard (primary or replica), like index my-index primary shard 0 → node RQzJvAgLTDOnEnmIjYU9FA path /mnt/data1. Tried /_nodes, /_stats, /_segments, /_shard_stores, but there are no any references to paths.
You can find that info using the indices stats API by specifying the level=shards parameter
GET index/_stats?level=shards
will return a structure like this
"indices": {
"listings-master": {
"primaries": {
...
},
"total": {
...
},
"shards": {
"0": [
{
"shard_path": {
"state_path": "/app/data/nodes/0",
"data_path": "/app/data/nodes/0",
"is_custom_data_path": false
},
...
}
...
Not easily but but by doing a small python script I've the info I want, here the script
import json
with open('shard.json') as json_file:
data = json.load(json_file)
print(data.keys())
data=data['indices']
for indice in data:
#print(indice)
d1=data[indice]
shards=d1['shards']
#print(shards,type(shards),shards.keys())
for nshard in shards.keys():
shard=shards[nshard]
#print(shard,type(shard))
for elt in shard:
path=elt['shard_path']['data_path']
node=elt['routing']['node']
#print(repr(elt['shard_path']['data_path']))
#print("=========================")
print(indice,'\t',nshard,'\t',node,'\t',path)
They you obtain stuff like
log-2020.11.06 1 oxx /datassd/elasticsearch/nodes/0
log-2020.11.06 0 oxx /datassd/elasticsearch/nodes/0
log-2020.11.05 1 oxx /datassd/elasticsearch/nodes/0

Resources