Filebeat - Logstash - Multiple Config Files - Duplicate data - elasticsearch

I am new to logstash and filebeat. I am trying to set up multiple config files for my logstash instance.
Using filebeat to send data to logstash. Even if I have filters created for both the logstash config files, I am getting duplicate data.
Logstash config file - 1:
input {
beats {
port => 5045
}
}
filter {
if [fields][env] == "prod" {
grok { match => { "message" => "%{LOGLEVEL:loglevel}] %{GREEDYDATA:message}$" }
overwrite => [ "message" ]
}
}
}
output {
stdout {
codec => rubydebug
}
elasticsearch {
hosts => ["https://172.17.0.2:9200"]
index => "logstash-myapp-%{+YYYY.MM.dd}"
user => "elastic"
password => "password"
ssl => true
cacert => "/usr/share/logstash/certs/http_ca.crt"
}
}
logstash config file-2
input {
beats {
port => 5044
}
}
filter {
if [fields][env] == "dev" {
grok { match => { "message" => "%{LOGLEVEL:loglevel}] %{GREEDYDATA:message}$" }
overwrite => [ "message" ]
}
}
}
output {
stdout {
codec => rubydebug
}
elasticsearch {
hosts => ["https://172.17.0.2:9200"]
index => "logstash-myapp-%{+YYYY.MM.dd}"
user => "elastic"
password => "password"
ssl => true
cacert => "/usr/share/logstash/certs/http_ca.crt"
}
}
Logfile Content:
[INFO] First Line
[INFO] Second Line
[INFO] Third Line
Filebeat config:
filebeat.inputs:
- type: filestream
enabled: true
paths:
- /root/data/logs/*.log
fields:
app: test
env: dev
output.logstash:
# The Logstash hosts
hosts: ["172.17.0.4:5044"]
I know that even if we have multiple files for config, logstash processes each and every line of the data against all the filters present in all the config files. Hence we have put filters in each of the config files for "fields.env".
I am expecting 3 lines to be sent to Elasticsearch because "fields.env" is "dev", but it is sending 6 lines to Elasticsearch and duplicate data.
Pleas help.

The problem is that your two configuration files get merged, not only the filters but also the outputs.
So each log line making it into the pipeline through any of the input, will go through all filters (bearing any conditions of course) and all outputs (no conditions possible in output).
So the first log line [INFO] First Line coming in from port 5044, will only go through the filter guarded by [fields][env] == "dev", but then will go through each of the two outputs, hence why it ends up twice in your ES.
So the easy solution is to remove the output section from one of the configuration file, so that log lines only go through a single output.
The better solution is to create separate pipelines.

Related

Defining multiple outputs in Logstash whilst handling potential unavailability of an Elasticsearch instance

I have two outputs configured for Logstash as I need the data to be delivered to two separate Elasticsearch nodes in different locations.
A snippet of the configuration is below (redacted where required):
output {
elasticsearch {
hosts => [ "https://host1.local:9200" ]
cacert => '/etc/logstash/config/certs/ca.crt'
user => XXXXX
password => XXXXX
index => "%{[#metadata][beat]}-%{[#metadata][version]}-%{+YYYY.MM.dd}"
}
}
output {
elasticsearch {
hosts => [ "https://host2.local:9200" ]
cacert => '/etc/logstash/config/certs/ca.crt'
user => XXXXX
password => XXXXX
index => "%{[#metadata][beat]}-%{[#metadata][version]}-%{+YYYY.MM.dd}"
}
}
During testing I've noticed that if one of the ES instances, host1.local or host2.local is unavailable, Logstash fails to process/deliver data to the other, even though it's available.
Is there a modification I can make to the configuration that will allow data to be delivered to the available Elasticsearch instance, even if the other dies?
logstash has an at-least-once delivery model. If persistent queues are not enabled data can be lost across a restart, but otherwise, logstash will delivery events to all of the outputs at least once. As a result, if one output becomes unreachable the queue (either in-memory or persistent) will back up and block processing. You can use persistent queues and pipeline-to-pipeline communication with an output isolator pattern to avoid stalling one output when another is unavailable.
To follow up on #Badger's answer, you will need to use the output isolator pattern they suggested. As you observed, when one output is blocked, it prevents the other outputs from functioning. This is also noted in the docs
Logstash, by default, is blocked when any single output is down.
This solution can be used for any output and is not unique to Elasticsearch. Here is a code sample of how it would work for your example.
# config/pipelines.yml
- pipeline.id: intake
config.string: |
input { ... }
output { pipeline { send_to => [es-host1, es-host2] } }
- pipeline.id: buffered-es-host1
queue.type: persisted
config.string: |
input { pipeline { address => es-host1 } }
output {
elasticsearch {
hosts => [ "https://host1.local:9200" ]
cacert => '/etc/logstash/config/certs/ca.crt'
user => XXXXX
password => XXXXX
index => "%{[#metadata][beat]}-%{[#metadata][version]}-%{+YYYY.MM.dd}"
}
}
- pipeline.id: buffered-es-host2
queue.type: persisted
config.string: |
input { pipeline { address => es-host2 } }
output {
elasticsearch {
hosts => [ "https://host2.local:9200" ]
cacert => '/etc/logstash/config/certs/ca.crt'
user => XXXXX
password => XXXXX
index => "%{[#metadata][beat]}-%{[#metadata][version]}-%{+YYYY.MM.dd}"
}
}
You can follow this document for running multiple pipelines, but you need to pass into the command line the parameter path.settings with the new config/pipelines.yml file.
bin/logstash --path.settings config/pipelines.yml

Logstash and filebeat set event.dataset value

I've a configuration in which filebeat fetches logs from some files (using a custom format) and sends those logs to a logstash instance.
In logstash I apply a gork filter in order to split some of the fields and then I send the output to my elasticsearch instance.
The pipeline works fine and it is correctly loaded on elasticsearch, but no event data is present (such as event.dataset or event.module). So I'm looking for the piece of code for adding such information to my events.
Here my filebeat configuration:
filebeat.config:
modules:
path: ${path.config}/modules.d/*.yml
reload.enabled: false
filebeat.inputs:
- type: log
paths:
- /var/log/*/info.log
- /var/log/*/warning.log
- /var/log/*/error.log
output.logstash:
hosts: '${ELK_HOST:logstash}:5044'
Here my logstash pipeline:
input {
beats {
port => 5044
}
}
filter {
grok {
match => { "message" => "MY PATTERN"}
}
mutate {
add_field => { "logLevelLower" => "%{logLevel}" }
}
mutate {
lowercase => [ "logLevelLower" ]
}
}
output {
elasticsearch {
hosts => "elasticsearch:9200"
user => "USER"
password => "PASSWORD"
index => "%{[#metadata][beat]}-%{logLevelLower}-%{[#metadata][version]}"
}
}
You can do it like this easily with a mutate/add_field filter:
filter {
mutate {
add_field => {
"[ecs][version]" => "1.5.0"
"[event][kind]" => "event"
"[event][category]" => "host"
"[event][type]" => ["info"]
"[event][dataset]" => "module.dataset"
}
}
}
The Elastic Common Schema documentation explains how to pick values for kind, category and type.

Ship filebeat logs to logstash to index with docker metadata

Iam trying to index in elastichsearch with the help of filebeat and logstash. Here is the filebeat.yml :
filebeat.inputs:
- type: docker
combine_partial: true
containers:
path: "/usr/share/dockerlogs/data"
stream: "stdout"
ids:
- "*"
exclude_files: ['\.gz$']
ignore_older: 10m
processors:
# decode the log field (sub JSON document) if JSON encoded, then maps it's fields to elasticsearch fields
- decode_json_fields:
fields: ["log", "message"]
target: ""
# overwrite existing target elasticsearch fields while decoding json fields
overwrite_keys: true
- add_docker_metadata:
host: "unix:///var/run/docker.sock"
filebeat.config.modules:
path: ${path.config}/modules.d/*.yml
reload.enabled: false
# setup filebeat to send output to logstash
output.logstash:
hosts: ["xxx.xx.xx.xx:5044"]
# Write Filebeat own logs only to file to avoid catching them with itself in docker log files
logging.level: info
logging.to_files: false
logging.to_syslog: false
loggins.metrice.enabled: false
logging.files:
path: /var/log/filebeat
name: filebeat
keepfiles: 7
permissions: 0644
ssl.verification_mode: none
And here is the logstash.conf:
input
{
beats {
port => 5044
host => "0.0.0.0"
}
}
output
{
stdout {
codec => dots
}
elasticsearch {
hosts => "http://xxx.xx.xx.x:9200"
index => "%{[docker][container][labels][com][docker][swarm][service][name]}-%{+xxxx.ww}"
}
}
Iam trying to index with the docker name so it would be more readable and more clear than the usual pattern we see all the time like "filebeat-xxxxxx.some-date".
I tried several things:
- index => "%{[docker][container][labels][com][docker][swarm][service][name]}-%{+xxxx.ww}"
- index => "%{[docker][container][labels][com][docker][swarm][service][name]}-%{+YYYY.MM}"
- index => "%{[docker][swarm][service][name]}-%{+xxxx.ww}"
But nothing worked. What am i doing wrong ? Maybe iam doing something wrong or missing anthing in filebeat.yml file. It could be that too.
Thanks for any help or any lead.
Looks like you're unsure of what docker metadata fields are being added. It might be a good idea to just get successful indexing first with the default index name (ex. "filebeat-xxxxxx.some-date" or whatever) and then view the log events to see the format of your docker metadata fields.
I don't have the same setup as you, but for reference, I'm on AWS ECS so the format of my docker fields are:
"docker": {
"container": {
"name": "",
"labels": {
"com": {
"amazonaws": {
"ecs": {
"cluster": "",
"container-name": "",
"task-definition-family": "",
"task-arn": "",
"task-definition-version": ""
}
}
}
},
"image": "",
"id": ""
}
}
After seeing the format and fields available, I was able to add a custom "application_name" field using the above. This field is being generated in my input plugin which is redis in my case, but all input plugins should have the add_field option (https://www.elastic.co/guide/en/logstash/current/plugins-inputs-beats.html#plugins-inputs-beats-add_field):
input {
redis {
host => "***"
data_type => "list"
key => "***"
codec => json
add_field => {
"application_name" => "%{[docker][container][labels][com][amazonaws][ecs][task-definition-family]}"
}
}
}
After getting getting this new custom field, I was able to run specific filters (grok, json, kv, etc) for different "application_name" fields as they had different log formats, but the important part for you is that you could use it in your output to Elasticsearch for index names:
output {
elasticsearch {
user => ***
password => ***
hosts => [ "***" ]
index => "logstash-%{application_name}-%{+YYY.MM.dd}"
}
}

Duplicate field values for grok-parsed data

I have a filebeat that captures logs from uwsgi application running in docker. The data is sent to the logstash which parses it and forwards to elasticsearch.
Here is the logstash conf file:
input {
beats {
port => 5044
}
}
filter {
grok {
match => { "log" => "\[pid: %{NUMBER:worker.pid}\] %{IP:request.ip} \{%{NUMBER:request.vars} vars in %{NUMBER:request.size} bytes} \[%{HTTPDATE:timestamp}] %{URIPROTO:request.method} %{URIPATH:request.endpoint}%{URIPARAM:request.params}? => generated %{NUMBER:response.size} bytes in %{NUMBER:response.time} msecs(?: via sendfile\(\))? \(HTTP/%{NUMBER:request.http_version} %{NUMBER:response.code}\) %{NUMBER:headers} headers in %{NUMBER:response.size} bytes \(%{NUMBER:worker.switches} switches on core %{NUMBER:worker.core}\)" }
}
date {
# 29/Oct/2018:06:50:38 +0700
match => [ "timestamp" , "dd/MMM/yyyy:HH:mm:ss Z"]
}
kv {
source => "request.params"
field_split => "&?"
target => "request.query"
}
}
output {
elasticsearch {
hosts => ["http://localhost:9200"]
index => "test-index"
}
}
Everything was fine, but I've noticed that all values captured by the grok pattern is duplicated. Here is how it looks in kibana:
Note that the raw data like log which wasn't grok output is fine. I've seen that kv filter has allow_duplicate_values parameter, but it doesn't apply to grok.
What is wrong with my configuration? Also, is it possible to rerun grok patterns on existing data in elasticsearch?
Maybe your filebeat is already doing the job and creating these fields
Did you try to add this parameter to your grok ?
overwrite => [ "request.ip", "request.endpoint", ... ]
In order to rerun grok on already indexed data you need to use elasticsearch input plugin in order to read data from ES and re-index it after grok.

Losgstah configuration issue

I begin with logstash and ElasticSearch and I would like to index .pdf or .doc file type in ElasticSearch via logstash.
I configured logstash using the codec multiline to get my file in a single message in ElasticSearch. Below is my configuration file:
input {
file {
path => "D:/BaseCV/*"
codec => multiline {
# Grok pattern names are valid! :)
pattern => ""
what => "previous"
}
}
}
output {
stdout {
codec => "rubydebug"
}
elasticsearch {
hosts => "localhost"
index => "cvindex"
document_type => "file"
}
}
At the start of logstash the first file I add, I recovered in ElasticSearch in one message, but the following are spread over several messages. I wish I had the correspondence : 1 file = 1 message.
Is this possible ? What should I change my setup to solve the problem ?
Thank you for your feedback.

Resources