Conditional indexing not working in ingest node pipelines - elasticsearch

Am trying to implement an index template with datastream enabled and then set contains in ingest node pipelines. So that I could get metrics with below-mentioned index format :
.ds-metrics-kubernetesnamespace
I had tried this sometime back and I did these things as mentioned above and it was giving metrics in such format but now when I implement the same it's not changing anything in my index. I cannot see any logs in openshift cluster so ingest seems to be working fine(when I add a doc and test it works fine)
PUT _ingest/pipeline/metrics-index
{
"processors": [
{
"set": {
"field": "_index",
"value": "metrics-{{kubernetes.namespace}}",
"if": "ctx.kubernetes?.namespace==\"dev\""
}
}
]
}
This is the ingest node condition I have used for indexing.
metricbeatConfig:
metricbeat.yml: |
metricbeat.modules:
- module: kubernetes
enabled: true
metricsets:
- state_node
- state_daemonset
- state_deployment
- state_replicaset
- state_statefulset
- state_pod
- state_container
- state_job
- state_cronjob
- state_resourcequota
- state_service
- state_persistentvolume
- state_persistentvolumeclaim
- state_storageclass
- event

Since you're using Metricbeat, you have another way to do this which is much better.
Simply configure your elasticsearch output like this:
output.elasticsearch:
hosts: ["http://<host>:<port>"]
indices:
- index: "%{[kubernetes.namespace]}"
mappings:
dev: "metrics-dev"
default: "metrics-default"
or like this:
output.elasticsearch:
hosts: ["http://<host>:<port>"]
indices:
- index: "metrics-%{[kubernetes.namespace]}"
when.equals:
kubernetes.namespace: "dev"
default: "metrics-default"
or simply like this would also work if you have plenty of different namespaces and you don't want to manage different mappings:
output.elasticsearch:
hosts: ["http://<host>:<port>"]
index: "metrics-%{[kubernetes.namespace]}"

Steps to create datastreams in elastic stack:
create an ILM policy
Create an index template that has an index pattern that matches with the index pattern of metrics/logs.(Set number of primary shards/replica shards and mapping in index template)
Set a condition in ingest pipeline.(Make sure no such index exist)
If these conditions meet it will create a data stream and logs/metrics would have an index starting with .ds- and it will be hidden in index management.
In my case the issue was I did not have enough permission to create a custom index. When I checked my OpenShift logs I could find metricbeat was complaining about the privilege. So I gave Superuser permission and then used ingest node to set conditional indexing
PUT _ingest/pipeline/metrics-index
{
"processors": [
{
"set": {
"field": "_index",
"value": "metrics-{{kubernetes.namespace}}",
"if": "ctx.kubernetes?.namespace==\"dev\""
}
}
]
}

Related

Filtering Filebeat input with or without Logstash

In our current setup we use Filebeat to ship logs to an Elasticsearch instance. The application logs are in JSON format and it runs in AWS.
For some reason AWS decided to prefix the log lines in a new platform release, and now the log parsing doesn't work.
Apr 17 06:33:32 ip-172-31-35-113 web: {"#timestamp":"2020-04-17T06:33:32.691Z","#version":"1","message":"Tomcat started on port(s): 5000 (http) with context path ''","logger_name":"org.springframework.boot.web.embedded.tomcat.TomcatWebServer","thread_name":"main","level":"INFO","level_value":20000}
Before it was simply:
{"#timestamp":"2020-04-17T06:33:32.691Z","#version":"1","message":"Tomcat started on port(s): 5000 (http) with context path ''","logger_name":"org.springframework.boot.web.embedded.tomcat.TomcatWebServer","thread_name":"main","level":"INFO","level_value":20000}
The question would be whether we can avoid using Logstash to convert the log lines into the old format? If not, how do I drop the prefix? Which filter is the best choice for this?
My current Filebeat configuration looks like this:
filebeat.inputs:
- type: log
paths:
- /var/log/web-1.log
json.keys_under_root: true
json.ignore_decoding_error: true
json.overwrite_keys: true
fields_under_root: true
fields:
environment: ${ENV_NAME:not_set}
app: myapp
cloud.id: "${ELASTIC_CLOUD_ID:not_set}"
cloud.auth: "${ELASTIC_CLOUD_AUTH:not_set}"
I would try to leverage the dissect and decode_json_fields processors:
processors:
# first ignore the preamble and only keep the JSON data
- dissect:
tokenizer: "%{?ignore} %{+ignore} %{+ignore} %{+ignore} %{+ignore}: %{json}"
field: "message"
target_prefix: ""
# then parse the JSON data
- decode_json_fields:
fields: ["json"]
process_array: false
max_depth: 1
target: ""
overwrite_keys: false
add_error_key: true
There is a plugin in Logstash called JSON filter that includes all the raw log line in a field called "message" (for instance).
filter {
json {
source => "message"
}
}
If you do not want to include the beginning part of the line, use the dissect filter in Logstash. It would be something like this:
filter {
dissect {
mapping => {
"message" => "%{}: %{message_without_prefix}"
}
}
}
Maybe in Filebeat there are these two features available as well. But in my experience, I prefer working with Logstash when parsing/manipulating logging data.

Can filebeat convert log lines output to json without logstash in pipeline?

We have standard log lines in our Spring Boot web applications (non json).
We need to centralize our logging and ship them to an elastic search as json.
(I've heard the later versions can do some transformation)
Can Filebeat read the log lines and wrap them as a json ? i guess it could append some meta data aswell. no need to parse the log line.
expected output :
{timestamp : "", beat: "", message: "the log line..."}
i have no code to show unfortunately.
filebeat supports several outputs including Elastic Search.
Config file filebeat.yml can look like this:
# filebeat options: https://www.elastic.co/guide/en/beats/filebeat/current/filebeat-reference-yml.html
filebeat.inputs:
- type: log
enabled: true
paths:
- /var/log/../file.err.log
processors:
- drop_fields:
# Prevent fail of Logstash (https://www.elastic.co/guide/en/beats/libbeat/current/breaking-changes-6.3.html#custom-template-non-versioned-indices)
fields: ["host"]
- dissect:
# tokenizer syntax: https://www.elastic.co/guide/en/logstash/current/plugins-filters-dissect.html.
tokenizer: "%{} %{} [%{}] {%{}} <%{level}> %{message}"
field: "message"
target_prefix: "spring boot"
fields:
log_type: spring_boot
output.elasticsearch:
hosts: ["https://localhost:9200"]
username: "filebeat_internal"
password: "YOUR_PASSWORD"
Well it seems to do it by default. this is my result when i tried it locally to read log lines. it wraps it exactly like i wanted.
{
"#timestamp":"2019-06-12T11:11:49.094Z",
"#metadata":{
"beat":"filebeat",
"type":"doc",
"version":"6.2.4"
},
"message":"the log line...",
"source":"/Users/myusername/tmp/hej.log",
"offset":721,
"prospector":{
"type":"log"
},
"beat":{
"name":"my-macbook.local",
"hostname":"my-macbook.local",
"version":"6.2.4"
}
}

Finding out on which data path shard is located in Elasticsearch

I have multiple path.datas configured for my Elasticsearch cluster.
The official documentation states that only a single path is used for a single shard, so it's never splitted across multiple paths.
I'd like to find a way to finding out which path on which node is used for some specific shard (primary or replica), like index my-index primary shard 0 → node RQzJvAgLTDOnEnmIjYU9FA path /mnt/data1. Tried /_nodes, /_stats, /_segments, /_shard_stores, but there are no any references to paths.
You can find that info using the indices stats API by specifying the level=shards parameter
GET index/_stats?level=shards
will return a structure like this
"indices": {
"listings-master": {
"primaries": {
...
},
"total": {
...
},
"shards": {
"0": [
{
"shard_path": {
"state_path": "/app/data/nodes/0",
"data_path": "/app/data/nodes/0",
"is_custom_data_path": false
},
...
}
...
Not easily but but by doing a small python script I've the info I want, here the script
import json
with open('shard.json') as json_file:
data = json.load(json_file)
print(data.keys())
data=data['indices']
for indice in data:
#print(indice)
d1=data[indice]
shards=d1['shards']
#print(shards,type(shards),shards.keys())
for nshard in shards.keys():
shard=shards[nshard]
#print(shard,type(shard))
for elt in shard:
path=elt['shard_path']['data_path']
node=elt['routing']['node']
#print(repr(elt['shard_path']['data_path']))
#print("=========================")
print(indice,'\t',nshard,'\t',node,'\t',path)
They you obtain stuff like
log-2020.11.06 1 oxx /datassd/elasticsearch/nodes/0
log-2020.11.06 0 oxx /datassd/elasticsearch/nodes/0
log-2020.11.05 1 oxx /datassd/elasticsearch/nodes/0

Is it possible with Elastic Curator to delete indexes matching field value

We have on our elasticsearch several indexes. They come from FluentD pluging sendings logs fron our docker containers. We would like to delete old indexes not only older than specific amount of days based on index name but applying different delete rules depending on log fields.
Here is an example of log:
{
"_index": "fluentd-2018.03.28",
"_type": "fluentd",
"_id": "o98123bcbd_kqpowkd",
"_version": 1,
"_score": null,
"_source": {
"container_id": "bbd72ec5e46921ab8896a05684a7672ef113a79e842285d932f",
"container_name": "/redis-10981239d5",
"source": "stdout",
"log": "34:M 28 Mar 15:07:51.086 * 10 changes in 300 seconds. Saving...\r34:M 28 Mar 15:07:51.188 * Background saving terminated with success\r",
"#timestamp": "2018-03-28T15:07:56.217739954+00:00",
"#log_name": "docker.redis"
},
"fields": {
"#timestamp": [
"2018-03-28T15:07:56.217Z"
]
}
}
In that case, we would like to delete all logs matching #log_name = docker.redis older than 7 days.
Is it possible to define a Curator action which deletes indexes filtered by such a field value?
We tried different filtering without any success. The only action we manage to perform successfully is based on index name:
actions:
1:
action: delete_indices
description: >-
Delete indices older than 30 days
options:
ignore_empty_list: True
disable_action: True
filters:
- filtertype: pattern
kind: prefix
value: fluentd-
- filtertype: age
source: name
direction: older
timestring: '%Y.%m.%d'
unit: days
unit_count: 30
Curator offer only an index level retention configuration. If you need a retention based on document level, you can try with a script that execute a delete by query.
Otherwise, using curator, you need to separate your data in different indexes for applying different retention.

Elasticsearch best_compression is not working

I am parsing Apache access log from Logstash and indexing it into a Elasticsearch index. I have also indexed geoip and agent fields.. While indexing I observed elasticsearch index size is 6.7x bigger than the actual file size (space on disk). So I just want to understand this is the correct behavior or I am doing something wrong here? I am using Elasticsearch 5.0, Logstash 5.0 and Kibana 5.0 version. I also tried best_compression but it's taking same disk size. Here is the complete observation with configuration file I tried so far.
My Observations:
Use Case 1:
Logstash Conf
Template File
Apache Log file Size : 211 MB
Total number of lines: 1,000,000
Index Size: 1.5 GB
Observation: Index is 6.7x bigger than the file size.
Use Case 2:
Logstash Conf
Template File
I have found a few solutions to compress elasticsearch index, then I tried it as well.
- Disable `_all` fields
- Remove unwanted fields that has been created by `geoip` and `agent` parsing.
- Enable `best_compression` [ index.codec": "best_compression"]
Apache Log file Size : 211 MB
Total number of lines: 1,000,000
Index Size: 1.3 GB
Observation: Index is 6.16x bigger than the file size
Log File Format:
127.0.0.1 - - [24/Nov/2016:02:03:08 -0800] "GET /wp-admin HTTP/1.0" 200 4916 "http://trujillo-carpenter.com/" "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 5.01; Trident/5.1)"
I found Logstash + Elasticsearch Storage Experients they are saying they have reduced index size from 6.23x to 1.57x. But that is pretty old solutions and these solution are no more working in Elasticsearch 5.0.
Some more reference I have already tried:
- Part 2.0: The true story behind Elasticsearch storage requirements
- https://github.com/elastic/elk-index-size-tests
Is there any better way to optimize the Elasticseach index size when your purpose is only show the visualization on Kibana?
I was facing this issue due to index settings were not applied to the index. My index name and template name were different. After using the same template name and index name compression is applied properly.
In the below example I was using index name apache_access_logs and template name elk_workshop.
Sharing corrected template and logstash configuration.
Logstash.conf
output {
elasticsearch {
hosts => ["localhost:9200"]
index => "apache_access_logs"
template => "apache_sizing_2.json"
template_name => "apache_access_logs" /* it was elk_workshop */
template_overwrite => true
}
}
Template:
{
"template": "apache_access_logs", /* it was elk_workshop */
"settings": {
"index.refresh_interval": "5s",
"index.refresh_interval": "30s",
"number_of_shards": 5,
"number_of_replicas": 0
},
..
}
Reference: https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-templates.html#indices-templates

Resources