Using multiple pipelines in Logstash with beats input - elasticsearch

As per an earlier discussion (Defining multiple outputs in Logstash whilst handling potential unavailability of an Elasticsearch instance) I'm now using pipelines in Logstash in order to send data input (from Beats on TCP 5044) to multiple Elasticsearch hosts. The relevant extract from pipelines.yml is shown below.
- pipeline.id: beats
queue.type: persisted
config.string: |
input {
beats {
port => 5044
ssl => true
ssl_certificate_authorities => '/etc/logstash/config/certs/ca.crt'
ssl_key => '/etc/logstash/config/certs/forwarder-001.pkcs8.key'
ssl_certificate => '/etc/logstash/config/certs/forwarder-001.crt'
ssl_verify_mode => "force_peer"
}
}
output { pipeline { send_to => [es100, es101] } }
- pipeline.id: es100
path.config: "/etc/logstash/pipelines/es100.conf"
- pipeline.id: es101
path.config: "/etc/logstash/pipelines/es101.conf"
In each of the pipeline .conf files I have the related virtual address i.e. the file /etc/logstash/pipelines/es101.conf includes the following:
input {
pipeline {
address => es101
}
}
This configuration seems to work well i.e. data is received by each of the Elasticsearch hosts es100 and es101.
I need to ensure that if one of these hosts is unavailable, the other still receives data and thanks to a previous tip, I'm now using pipelines which I understand allow for this. However I'm obviously missing something key in this configuration as the data isn't received by a host if the other is unavailable. Any suggestions are gratefully welcomed.

Firstly, you should configure persistent queues on the downstream pipelines (es100, es101), and size them to contain all the data that arrives during an outage. But even with persistent queues logstash has an at-least-once delivery model. If the persistent queue fills up then back-pressure will cause the beats input to stop accepting data. As the documentation on the output isolator pattern says "If any of the persistent queues of the downstream pipelines ... become full, both outputs will stop". If you really want to make sure an output is never blocked because another output is unavailable then you will need to introduce some software with a different delivery model. For example, configure filebeat to write to kafka, then have two pipelines that read from kafka and write to elasticsearch. If kafka is configured with an at-most-once delivery model (the default) then it will lose data if it cannot deliver it.

Related

Filebeat : Send different logs from filebeat to different logstash Pipeline

I Want the functionality that one filebeat instance can send data to different logstash pipeline.
Is this possible?
I have configured one logstash service having two pipelines, both
pipelines separate ports are given.
Let's say Pipeline1 (Port 5044) , Pipeline2 (Port 5045)
Now i want to send data to the logstash using filebeat. So i have
two types of log file let's say log1, log2.
I want to send log1 to Pipeline1 and log2 to Pipeline 2.
I am running only one instance of filebeat, how i can do this?
Filebeat can have only one output, you will need to run another filebeat instance or change your logstash pipeline to listen in only one port and then filter the data based in tags, it is easier to filter on logstash than to have two instances.
In Filebeat you can specify a tag for each input that you have and use those tags in your logstash to send the log to desired pipeline.
For example, events with the tag log1 will be sent to the pipeline1 and events with the tag log2 will be sent to the pipeline2.
Your configuration needs to be something like this in Filebeat:
- type: log
enabled: true
paths:
- "/path/to/your/logs/*.json"
tags: ["logN"]
And then you will need a conditional in your logstash filters and outputs to each tag you want:
filter {
if "logN" in [tags] {
filters
}
}
output {
if "logN" in [tags] {
output
}
}
Filebeat can have only one output, but this can be achived by using an messaging medium between filebeat and logstash, i am using kafka in my case between filebeat and logstash to achieve the above request.

Many Logstash instances reading from Redis

I have one Logstash process running inside one node consuming from a Redis list, but I'm afraid that just one process cannot handle the data throughput without a great delay.
I was wondering if I run one more process for Logstash inside this same machine will perform a little better, but I'm not certain about that. I know that my ES index is not a bottleneck.
Would Logstash duplicate my data, if I consume the same list? This approach seems to be a right thing to do?
Thanks!
Here my input configuration:
input {
redis {
data_type => "list"
batch_count => 300
key => "flight_pricing_stats"
host => "my-redis-host"
}
}
You could try adjusting logstash input threads, if you are going to run another logstash process in the same machine. Default is 1.
input {
redis {
data_type => "list"
batch_count => 300
key => "flight_pricing_stats"
host => "my-redis-host"
threads => 2
}
}
You could run more than one logstash against the same redis, events should not get duplicated. But I'm not sure that would help.
If you're not certain whats going on, I recommend the logstash monitoring API. It can help you narrow down your real bottlenck.
And also an interesting post from elastic on the subject: Logstash Lines Introducing a benchmarking tool for Logstash

MetricBeat - Kafka's consumergroup metricset doesn't send any data?

I have running ZooKeeper and single Kafka broker and I want to get metrics with MetricBeat, index it with ElasticSearch and display with Kibana.
However, MetricBeat can only get data from partition metricset and nothing comes from consumergroup metricset.
Since kafka module is defined as periodical in metricbeat.yml, it should send some data on it's own, not just waiting for users interaction (f.exam. - write to topic) ?
To ensure myself, I tried to create consumer group, write and consume from topic, but still no data was collected by consumergroup metricset.
consumergroup is defined in both metricbeat.template.json and metricbeat.template-es2x.json.
While metricbeat.full.yml is completely commented off, this is my metricbeat.yml kafka module definition :
- module: kafka
metricsets: ["partition", "consumergroup"]
enabled: true
period: 10s
hosts: ["localhost:9092"]
client_id: metricbeat1
retries: 3
backoff: 250ms
topics: []
In /logs directory of MetricBeat, lines like this show up :
INFO Non-zero metrics in the last 30s:
libbeat.es.published_and_acked_events=109
libbeat.es.publish.write_bytes=88050
libbeat.publisher.messages_in_worker_queues=109
libbeat.es.call_count.PublishEvents=5
fetches.kafka-partition.events=106
fetches.kafka-consumergroup.success=2
libbeat.publisher.published_events=109
libbeat.es.publish.read_bytes=2701
fetches.kafka-partition.success=2
fetches.zookeeper-mntr.events=3
fetches.zookeeper-mntr.success=3
With ZooKeeper's mntr and Kafka's partition, I can see events= and success= values, but for consumergroup there is only success. It looks like no events are fired.
partition and mntr data are properly visible in Kibana, while consumergroup is missing.
Data stored in ElasticSearch are not readable with human eye, there are some internal strings used for directory names and logs do not contain any useful information.
Can anybody help me to understand what is going on and fix it(probably MetricBeat) to send data to ElasticSearch ? Thanks :)
You need to have an active consumer consuming out of the topics, to be able to generate events for consumergroup metricset.

Ship only a percentage of logs to logstash

How can I configure filebeat to only ship a percentage of logs (a sample if you will) to logstash?
In my application's log folder the logs are chunked to about 20 megs each. I want filebeat to ship only about 1/300th of that log volume to logstash.
I need to pare down the log volume before I send it over the wire to logstash so I cannot do this filtering from logstash it needs to happen on the endpoint before it leaves the server.
I asked this question in the ES forum and someone said it was not possible with filebeat: https://discuss.elastic.co/t/ship-only-a-percentage-of-logs-to-logstash/77393/2
Is there really no way I can extend filebeat to do this? Can nxlog or another product to this?
To the best of my knowledge, there is no way to do that with FileBeat. You can do it with Logstash, though.
filter {
drop {
percentage => 99.7
}
}
This may be a use-case where you would use Logstash in shipping mode on the server, rather than FileBeat.
input {
file {
path => "/var/log/hugelogs/*.log"
add_tags => [ 'sampled' ]
}
}
filter {
drop {
percentage => 99.7
}
}
output {
tcp {
host => 'logstash.prod.internal'
port => '3390'
}
}
It means installing Logstash on your servers. However, you configure it as minimally as possible. Just an input, enough filters to get your desired effect, and a single output (Tcp in this case, but it could be anything). Full filtering will happen down the pipeline.
There's no way to configure Filebeat to drop arbitrary events based on a probability. But Filebeat does have the ability to drop events based on conditions. There are two way to filter events.
Filebeat has a way to specify lines to include or exclude when reading the file. This is the most efficient place to apply the filtering because it happens early. This is done using include_lines and exclude_lines in the config file.
filebeat.prospectors:
- paths:
- /var/log/myapp/*.log
exclude_lines: ['^DEBUG']
All Beats have "processors" that allow you to apply an action based on a condition. One action is drop_events and the conditions are regexp, contains, equals, and range.
processors:
- drop_event:
when:
regexp:
message: '^DEBUG'

How to add dynamic hosts in Elasticsearch and logstash

I have prototype working for me with Devices sending logs and then logstash parsing it and putting into elasticsearch.
Logstash output code :-
output{
if [type] == "json" {
elasticsearch {
hosts => ["host1:9200","host2:9200","host3:9200"]
index => "index-metrics-%{+xxxx.ww}"
}
}
}
Now My Question is :
I will be producing this solution. For simplicity assume that I have one Cluster and I have right now 5 nodes inside that cluster.
So I know I can give array of 5 nodes IP / Hostname in elasticsearch output plugin and then it will round robin to distribute data.
How can I avoid putting all my node IP / hostnames into logstash config file.
As system goes into production I don't want to manually go into each logstash instance and update these hosts.
What are the best practices one should follow in this case ?
My requirement is :
I want to run my ES cluster and I want to add / remove / update any number of node at any time. I need all of my logstash instances send data irrespective of changes at ES side.
Thanks.
If you want to add/remove/update you will need to run sed or some kind of string replacement before the service startup. Logstash configs are "compiled" and cannot be changed that way.
hosts => [$HOSTS]
...
$ HOSTS="\"host1:9200\",\"host2:9200\""
$ sed "s/\$HOSTS/$HOSTS/g" $config
Your other option is to use environment variables for the dynamic portion, but that won't allow you to use a dynamic amount of hosts.

Resources