Logstash Duplicate Events - elasticsearch

I have two configuration files for Logstash: test1.conf and test2.conf.
Each one of them has it's own flow of input -> filter -> ouput.
Both of them have the same filter and elasticsearch output writing to the same index.
My problem is that Logstash is writing duplicate events to the ElasticSearch index, no matter which input I choose to test (every event becomes two identical events instead of one).
How can I fix this?

By default, Logstash has one pipeline named main which automatically detects all .conf files in conf.d folder; this configuration is set at pipelines.yml file:
- pipeline.id: main
path.config: "/etc/logstash/conf.d/*.conf"
If you have multiple .conf files under one pipeline, Logstash will merge them together, causing all filters and outputs to be performed on all of the inputs, so in this case, no matter which input is receiving events, it will go through two paths of filter/output, causing duplicate writing to ElasticSearch (identical events if the filters/outputs are the same for both .conf files).
Solutions
1. Move filter/output into a separate file
If your filters/outputs are the same across the config files, move filter/output into a separate file. So now you have two .conf files, one for each input, and a third .conf file for the filter/output. With this setup every input will go through only one processing path.
For example:
input1.conf
input {
# input 1
}
input2.conf
input {
# input 2
}
filter_output.conf
filter {
# common filter
}
output {
# common output
}
You can check out this answer for another example when this solution should be chosen.
Note that If the filters/output are the same but you still want to refer them as complete different processing paths, please keep reading.
2. Split the .conf files to different pipelines
If you need every .conf file to be independent, split the .conf files to different pipelines.
In order to do that, just edit pipelines.yml file.
For example:
pipelines.yml
- pipeline.id: test1
path.config: "/etc/logstash/conf.d/test1.conf"
- pipeline.id: test2
path.config: "/etc/logstash/conf.d/test2.conf"
Read more about Multiple Pipelines
3. Separate by types
Tag each input with different type and check it later on the filters/outputs with if statement.
You can read more about it in this answer.

Related

Reference Conf File within Conf File or Apply Rule to All Listening RSyslog Ports

We have a number of individual conf files with their own ruleset that's bound to its unique port. We want to create a single conf file that will filter/drop specific things such as, if msg from IP drop it or if msg contains x drop it. And have the drop filtering apply to all listening ports. Is this possible to do? Should we avoid using rulesets?
We're trying to avoid updating the drop/filter in each conf file for each port every time the filter has a new update.
Would anyone happen to know if one of the following things is possible with RSyslog?
Have 1 conf file that will listen on all rsyslog ports and be processed first? Without specifying each open port.
Have a conf file that calls another file with a rule in it?
Appreciate any help with this.
Typically, the default configuration file, say /etc/rsyslog.conf, will contain a line near the start saying something like
$IncludeConfig /etc/rsyslog.d/*.conf
or the equivalent RainerScript syntax
include(file="/etc/rsyslog.d/*.conf")
If not you can add it.
This will include all files matching the glob pattern, in alphabetical order. So you can optionally put any configuration in that directory, for example in arbitrarily named files 00-some.conf, and 10-somemore.conf and so on.
One file could have lots of input() statements like:
module(load="imtcp" MaxSessions="500")
input(type="imtcp" port="514")
input(type="imtcp" port="10514")
input(type="imtcp" port="20514")
assuming you are expecting to receive incoming tcp connections from remote
clients. See imtcp.
All the data from those remotes will be affected by any following rules.
For example, the last included file in the directory could hold lines like:
if ($msg contains "Password: ") then stop
if ($msg startswith "Debug") then stop
if ($hostname startswith "test") then stop
These will stop further processing of any matching input messages, effectively
deleting them.
The above inputs are all collected into a single global input queue.
All the if rules are applied to all the messages from that queue.
If you want to, you can partition some of the inputs into a new queue,
and write rules that will only apply to that new independent queue. The rest of the
configuration will know nothing about this new queue and rules.
This is called a ruleset. See
here and
here.
For example, you can have a ruleset called "myrules". Move one or more
inputs into the ruleset by adding the extra option:
input(type="imtcp" port="514" ruleset="myrules")
input(type="imtcp" port="10514" ruleset="myrules")
Move the rules to apply to that queue into a ruleset definition:
ruleset(name="myrules"){
if ($msg contains "Password: ") then stop
if ($msg startswith "Debug") then stop
*.* /var/log/mylogfile
}

Ship only a percentage of logs to logstash

How can I configure filebeat to only ship a percentage of logs (a sample if you will) to logstash?
In my application's log folder the logs are chunked to about 20 megs each. I want filebeat to ship only about 1/300th of that log volume to logstash.
I need to pare down the log volume before I send it over the wire to logstash so I cannot do this filtering from logstash it needs to happen on the endpoint before it leaves the server.
I asked this question in the ES forum and someone said it was not possible with filebeat: https://discuss.elastic.co/t/ship-only-a-percentage-of-logs-to-logstash/77393/2
Is there really no way I can extend filebeat to do this? Can nxlog or another product to this?
To the best of my knowledge, there is no way to do that with FileBeat. You can do it with Logstash, though.
filter {
drop {
percentage => 99.7
}
}
This may be a use-case where you would use Logstash in shipping mode on the server, rather than FileBeat.
input {
file {
path => "/var/log/hugelogs/*.log"
add_tags => [ 'sampled' ]
}
}
filter {
drop {
percentage => 99.7
}
}
output {
tcp {
host => 'logstash.prod.internal'
port => '3390'
}
}
It means installing Logstash on your servers. However, you configure it as minimally as possible. Just an input, enough filters to get your desired effect, and a single output (Tcp in this case, but it could be anything). Full filtering will happen down the pipeline.
There's no way to configure Filebeat to drop arbitrary events based on a probability. But Filebeat does have the ability to drop events based on conditions. There are two way to filter events.
Filebeat has a way to specify lines to include or exclude when reading the file. This is the most efficient place to apply the filtering because it happens early. This is done using include_lines and exclude_lines in the config file.
filebeat.prospectors:
- paths:
- /var/log/myapp/*.log
exclude_lines: ['^DEBUG']
All Beats have "processors" that allow you to apply an action based on a condition. One action is drop_events and the conditions are regexp, contains, equals, and range.
processors:
- drop_event:
when:
regexp:
message: '^DEBUG'

Wait for File Processing to be finished

I am using Spring Integration to process/load data from csv files.
My Configuration is -
1) Poll For incoming File
2) Split the file using splitter - this gives me individual lines(records) of the file
3) Tokenize the line - this gives me the values or columns
4) Use aggregator to aggregate/collect lines(records) and write it to database in a batch
Poller -> Splitter -> Tokenizer -> Aggregator
Now I want to wait till all the content of the file has been written to the database and then move the file to a different folder.
But how to identify when the file processing is finished ?
Problem here is, if the file has 1 million records and my aggregator has batch size of 500, how would i know when every record of my file has been aggregated and written out to the database.
The FileSplitter can optionally add markers (BOF, EOF) to the output - you would have to filter and/or route them before your secondary splitter.
See FileSplitter.
(markers) Set to true to emit start/end of file marker messages before and after the file data. Markers are messages with FileSplitter.FileMarker payloads (with START and END values in the mark property). Markers might be used when sequentially processing files in a downstream flow where some lines are filtered. They enable the downstream processing to know when a file has been completely processed. In addition, a header file_marker containing START or END are added to these messages. The END marker includes a line count. If the file is empty, only START and END markers are emitted with 0 as the lineCount. Default: false. When true, apply-sequence is false by default. Also see markers-json.

Filebeat duplicating events

I am running a basic elk stack setup using Filebeat > logstash > elasticsearch > kibana - all on version 5.2
When I remove Filebeat and configure logstash to look directly at a file, it ingests the correct number of events.
If I delete the data and re-ingest the file using Filebeat to pass the same log file contents to logstash, I get over 10% more events created. I have checked a number of these to confirm the duplicates are being created by filebeat.
Has anyone seen this issue? or have any suggestions why this would happen?
I need to understand first what do you mean by removing file beat!!
Possibility-1
if you have uninstalled and installed again, then obviously file beat will read the data from the path again(which you have re-ingested and post it to logstash->elasticsearch->kibana(assuming old data is not been removed from elastic node) hence the duplicates.
Possibility-2.
You just have stopped filebeat,configured for logstash and restarted filebeat and may be your registry file is not been updated properly during shutdown(as you know,file beat reads line by line and update the registry file upto what line it has successfully published to logstash/elasticsearch/kafka etc and if any of those output servers face any difficulty processing huge load of input coming from filebeat then filebeat waits until those servers are available for further processing of input data.Once those output servers are available,filebeat reads the registry file and scan upto what line it has published and starts publishing next line onwards).
Sample registry file will be like
{
"source": "/var/log/sample/sample.log",
"offset": 88,
"FileStateOS": {
"inode": 243271678,
"device": 51714
},
"timestamp": "2017-02-03T06:22:36.688837822-05:00",
"ttl": -2
}
As you can see, it maintains timestamp in the registry file.
So this is one of the reasons for duplicates.
For further references, you can follow below links
https://discuss.elastic.co/t/filebeat-sending-old-logs-on-restart/46189
https://discuss.elastic.co/t/deleting-filebeat-registry-file/46112
https://discuss.elastic.co/t/filebeat-stop-cleaning-registry/58902
Hope that helps.

How to add dynamic hosts in Elasticsearch and logstash

I have prototype working for me with Devices sending logs and then logstash parsing it and putting into elasticsearch.
Logstash output code :-
output{
if [type] == "json" {
elasticsearch {
hosts => ["host1:9200","host2:9200","host3:9200"]
index => "index-metrics-%{+xxxx.ww}"
}
}
}
Now My Question is :
I will be producing this solution. For simplicity assume that I have one Cluster and I have right now 5 nodes inside that cluster.
So I know I can give array of 5 nodes IP / Hostname in elasticsearch output plugin and then it will round robin to distribute data.
How can I avoid putting all my node IP / hostnames into logstash config file.
As system goes into production I don't want to manually go into each logstash instance and update these hosts.
What are the best practices one should follow in this case ?
My requirement is :
I want to run my ES cluster and I want to add / remove / update any number of node at any time. I need all of my logstash instances send data irrespective of changes at ES side.
Thanks.
If you want to add/remove/update you will need to run sed or some kind of string replacement before the service startup. Logstash configs are "compiled" and cannot be changed that way.
hosts => [$HOSTS]
...
$ HOSTS="\"host1:9200\",\"host2:9200\""
$ sed "s/\$HOSTS/$HOSTS/g" $config
Your other option is to use environment variables for the dynamic portion, but that won't allow you to use a dynamic amount of hosts.

Resources