How to add dynamic hosts in Elasticsearch and logstash - elasticsearch

I have prototype working for me with Devices sending logs and then logstash parsing it and putting into elasticsearch.
Logstash output code :-
output{
if [type] == "json" {
elasticsearch {
hosts => ["host1:9200","host2:9200","host3:9200"]
index => "index-metrics-%{+xxxx.ww}"
}
}
}
Now My Question is :
I will be producing this solution. For simplicity assume that I have one Cluster and I have right now 5 nodes inside that cluster.
So I know I can give array of 5 nodes IP / Hostname in elasticsearch output plugin and then it will round robin to distribute data.
How can I avoid putting all my node IP / hostnames into logstash config file.
As system goes into production I don't want to manually go into each logstash instance and update these hosts.
What are the best practices one should follow in this case ?
My requirement is :
I want to run my ES cluster and I want to add / remove / update any number of node at any time. I need all of my logstash instances send data irrespective of changes at ES side.
Thanks.

If you want to add/remove/update you will need to run sed or some kind of string replacement before the service startup. Logstash configs are "compiled" and cannot be changed that way.
hosts => [$HOSTS]
...
$ HOSTS="\"host1:9200\",\"host2:9200\""
$ sed "s/\$HOSTS/$HOSTS/g" $config
Your other option is to use environment variables for the dynamic portion, but that won't allow you to use a dynamic amount of hosts.

Related

Using multiple pipelines in Logstash with beats input

As per an earlier discussion (Defining multiple outputs in Logstash whilst handling potential unavailability of an Elasticsearch instance) I'm now using pipelines in Logstash in order to send data input (from Beats on TCP 5044) to multiple Elasticsearch hosts. The relevant extract from pipelines.yml is shown below.
- pipeline.id: beats
queue.type: persisted
config.string: |
input {
beats {
port => 5044
ssl => true
ssl_certificate_authorities => '/etc/logstash/config/certs/ca.crt'
ssl_key => '/etc/logstash/config/certs/forwarder-001.pkcs8.key'
ssl_certificate => '/etc/logstash/config/certs/forwarder-001.crt'
ssl_verify_mode => "force_peer"
}
}
output { pipeline { send_to => [es100, es101] } }
- pipeline.id: es100
path.config: "/etc/logstash/pipelines/es100.conf"
- pipeline.id: es101
path.config: "/etc/logstash/pipelines/es101.conf"
In each of the pipeline .conf files I have the related virtual address i.e. the file /etc/logstash/pipelines/es101.conf includes the following:
input {
pipeline {
address => es101
}
}
This configuration seems to work well i.e. data is received by each of the Elasticsearch hosts es100 and es101.
I need to ensure that if one of these hosts is unavailable, the other still receives data and thanks to a previous tip, I'm now using pipelines which I understand allow for this. However I'm obviously missing something key in this configuration as the data isn't received by a host if the other is unavailable. Any suggestions are gratefully welcomed.
Firstly, you should configure persistent queues on the downstream pipelines (es100, es101), and size them to contain all the data that arrives during an outage. But even with persistent queues logstash has an at-least-once delivery model. If the persistent queue fills up then back-pressure will cause the beats input to stop accepting data. As the documentation on the output isolator pattern says "If any of the persistent queues of the downstream pipelines ... become full, both outputs will stop". If you really want to make sure an output is never blocked because another output is unavailable then you will need to introduce some software with a different delivery model. For example, configure filebeat to write to kafka, then have two pipelines that read from kafka and write to elasticsearch. If kafka is configured with an at-most-once delivery model (the default) then it will lose data if it cannot deliver it.

Filebeat : Send different logs from filebeat to different logstash Pipeline

I Want the functionality that one filebeat instance can send data to different logstash pipeline.
Is this possible?
I have configured one logstash service having two pipelines, both
pipelines separate ports are given.
Let's say Pipeline1 (Port 5044) , Pipeline2 (Port 5045)
Now i want to send data to the logstash using filebeat. So i have
two types of log file let's say log1, log2.
I want to send log1 to Pipeline1 and log2 to Pipeline 2.
I am running only one instance of filebeat, how i can do this?
Filebeat can have only one output, you will need to run another filebeat instance or change your logstash pipeline to listen in only one port and then filter the data based in tags, it is easier to filter on logstash than to have two instances.
In Filebeat you can specify a tag for each input that you have and use those tags in your logstash to send the log to desired pipeline.
For example, events with the tag log1 will be sent to the pipeline1 and events with the tag log2 will be sent to the pipeline2.
Your configuration needs to be something like this in Filebeat:
- type: log
enabled: true
paths:
- "/path/to/your/logs/*.json"
tags: ["logN"]
And then you will need a conditional in your logstash filters and outputs to each tag you want:
filter {
if "logN" in [tags] {
filters
}
}
output {
if "logN" in [tags] {
output
}
}
Filebeat can have only one output, but this can be achived by using an messaging medium between filebeat and logstash, i am using kafka in my case between filebeat and logstash to achieve the above request.

Many Logstash instances reading from Redis

I have one Logstash process running inside one node consuming from a Redis list, but I'm afraid that just one process cannot handle the data throughput without a great delay.
I was wondering if I run one more process for Logstash inside this same machine will perform a little better, but I'm not certain about that. I know that my ES index is not a bottleneck.
Would Logstash duplicate my data, if I consume the same list? This approach seems to be a right thing to do?
Thanks!
Here my input configuration:
input {
redis {
data_type => "list"
batch_count => 300
key => "flight_pricing_stats"
host => "my-redis-host"
}
}
You could try adjusting logstash input threads, if you are going to run another logstash process in the same machine. Default is 1.
input {
redis {
data_type => "list"
batch_count => 300
key => "flight_pricing_stats"
host => "my-redis-host"
threads => 2
}
}
You could run more than one logstash against the same redis, events should not get duplicated. But I'm not sure that would help.
If you're not certain whats going on, I recommend the logstash monitoring API. It can help you narrow down your real bottlenck.
And also an interesting post from elastic on the subject: Logstash Lines Introducing a benchmarking tool for Logstash

Ship only a percentage of logs to logstash

How can I configure filebeat to only ship a percentage of logs (a sample if you will) to logstash?
In my application's log folder the logs are chunked to about 20 megs each. I want filebeat to ship only about 1/300th of that log volume to logstash.
I need to pare down the log volume before I send it over the wire to logstash so I cannot do this filtering from logstash it needs to happen on the endpoint before it leaves the server.
I asked this question in the ES forum and someone said it was not possible with filebeat: https://discuss.elastic.co/t/ship-only-a-percentage-of-logs-to-logstash/77393/2
Is there really no way I can extend filebeat to do this? Can nxlog or another product to this?
To the best of my knowledge, there is no way to do that with FileBeat. You can do it with Logstash, though.
filter {
drop {
percentage => 99.7
}
}
This may be a use-case where you would use Logstash in shipping mode on the server, rather than FileBeat.
input {
file {
path => "/var/log/hugelogs/*.log"
add_tags => [ 'sampled' ]
}
}
filter {
drop {
percentage => 99.7
}
}
output {
tcp {
host => 'logstash.prod.internal'
port => '3390'
}
}
It means installing Logstash on your servers. However, you configure it as minimally as possible. Just an input, enough filters to get your desired effect, and a single output (Tcp in this case, but it could be anything). Full filtering will happen down the pipeline.
There's no way to configure Filebeat to drop arbitrary events based on a probability. But Filebeat does have the ability to drop events based on conditions. There are two way to filter events.
Filebeat has a way to specify lines to include or exclude when reading the file. This is the most efficient place to apply the filtering because it happens early. This is done using include_lines and exclude_lines in the config file.
filebeat.prospectors:
- paths:
- /var/log/myapp/*.log
exclude_lines: ['^DEBUG']
All Beats have "processors" that allow you to apply an action based on a condition. One action is drop_events and the conditions are regexp, contains, equals, and range.
processors:
- drop_event:
when:
regexp:
message: '^DEBUG'

logstash with elasticsearch_http

Apparently logstash OnDemand account does not work when I wanted to post an issue.
Anyways, I have a logstash setup with redis, elasticsearch, and kibana. My logstash are collecting logs from several files and putting in redis just fine.
Logstash version 1.3.3
Elasticsearch version 1.0.1
The only thing I have in elasticsearch_http for logstash is the host name. This all setup seems to glue together just fine.
The problem is that the elasticsearch_http is not consuming the redis entries as they come. What I have seen by running it in debug mode is that it flush about 100 entries after every 1 min (flush_size and idle_flush_time's default values). The documentation however states, from what I understand is, that it will force a flush in case the 100 flush_size is not satisfied (for example we had 10 messages in last 1 min). But it seems to work the other way. Its flushing about 100 messages every 1 min only. I changed the size to 2000 and it flush 2000 every min or so.
Here is my logstash-indexer.conf
input {
redis {
host => "1xx.xxx.xxx.93"
data_type => "list"
key => "testlogs"
codec => json
}
}
output {
elasticsearch_http {
host => "1xx.xxx.xxx.93"
}
}
Here is my elasticsearch.yml
cluster.name: logger
node.name: "logstash"
transport.tcp.port: 9300
http.port: 9200
discovery.zen.ping.unicast.hosts: ["1xx.xxx.xxx.93:9300"]
discovery.zen.ping.multicast.enabled: false
#discovery.zen.ping.unicast.enabled: true
network.bind_host: 1xx.xxx.xxx.93
network.publish_host: 1xx.xxx.xxx.93
The indexer, elasticsearch, redis, and kibana are on same server. The log collection from file is done on another server.
So I'm going to suggest a couple of different approaches to solve you problem. Logstash as you are discovering can be a bit quirky so I've found a these approaches useful in dealing with unexpected behavior from logstash.
Use the elasticsearch output instead of elasticsearch_http. You
can get the same functionality by using elasticsearch output with
protocol set to http. The elasticsearch output is more mature
(milestone 2 vs milestone 3) and I've seen this change make a
difference before.
Set the defaults for idle_flush_time and flush_size. There have
been issues with Logstash defaults previously, I've found it to be a
lot safer to set them explicitly. idle_flush_time is in seconds,
flush_size is the number of records to flush.
Upgrade to more recent versions of logstash. There is
enough of a change in how logstash is deployed with version 1.4.X
(http://logstash.net/docs/1.4.1/release-notes) that I'd that I'd
bite the bullet and upgrade. It's also significantly easier to get
attention if you still have a problem with the most recent stable
major release.
Make certain your Redis version matches those support by your
logstash version.
Experiment with setting the batch, batch_events and batch_timeout
values for the Redis output. You are using the list data_type.
list supports various batch options and as with some other
parameters it's best not to assume the defaults are always being set
correctly.
Do all of the above. In addition to trying the first set of
suggestions, I'd try all of them together in various combinations.
Keep careful records of each test run. Seems obvious but between all
the variations above it's easy to lose track - I'd keep careful
records and try to change only one variation at a time.

Resources