logstash output to elasticsearch with document_id; what to do when I don't have a document_id? - elasticsearch

I have some logstash input where I use the document_id to remove duplicates. However, most input doesn't have a document_id. The following plumbs the actual document_id through, but if it doesn't exist, it gets accepted as literally %{document_id}, which means most documents are seen as a duplicate of each other. Here's what my output block looks like:
output {
elasticsearch_http {
host => "127.0.0.1"
document_id => "%{document_id}"
}
}
I thought I might be able to use a conditional in the output. It fails, and the error is given below the code.
output {
elasticsearch_http {
host => "127.0.0.1"
if document_id {
document_id => "%{document_id}"
}
}
}
Error: Expected one of #, => at line 101, column 8 (byte 3103) after output {
elasticsearch_http {
host => "127.0.0.1"
if
I tried a few "if" statements and they all fail, which is why I assume the problem is having a conditional of any sort in that block. Here are the alternatives I tried:
if document_id <> "" {
if [document_id] <> "" {
if [document_id] {
if "hello" <> "" {

You're close with the conditional idea but you can't place it inside a plugin block. Do this instead:
output {
if [document_id] {
elasticsearch_http {
host => "127.0.0.1"
document_id => "%{document_id}"
}
} else {
elasticsearch_http {
host => "127.0.0.1"
}
}
}
(But the suggestion in one of the other answers to use the uuid filter is good too.)

One way to solve this is to make sure a document_idis always available. You can achieve this by adding a UUID filter in the filter section that would create the document_id field if it is not present.
filter {
if "" in [document_id] {
uuid {
target => "document_id"
}
}
}
Edited per Magnus Bäck's suggestion. Thanks!

Reference : docinfo_fields
For any document added in elasticsearch, the _id is auto-generated if not specified during insert. We can use this same _id later to update/delete/search queries by using docinfo_fields feature.
Example :
filter {
json {
source => "message"
}
elasticsearch {
hosts => "http://localhost:9200/"
user => elastic
password => elastic
query => "..."
docinfo_fields => {
"_id" => "docid"
"_index" => "document_index"
}
}
if ("_elasticsearch_lookup_failure" not in [tags]) {
#... doc update logic ...
}
}
output {
elasticsearch {
hosts => "http://localhost:9200/"
user => elastic
password => elastic
index => "%{document_index}"
action => "update"
doc_as_upsert => true
document_id => "%{docid}"
}
}

Related

Logstash aggregate fields

I am trying to configure logstash to aggregate similar syslog based on a message field and in a specific timestamp.
To make my case clear, this is an example of what I would like to do.
example: I have those junk syslog coming through my logstash
timestamp. message
13:54:24. hello
13:54:35. hello
What I would like to do is have a condition that check if the message are the same and those message occurs in a specific timespan (for example 10min) I would like to aggregate them into one row, and increase the count
the output I am expecting to see is as follow
timestamp. message. count
13.54.35. hello. 2
I know and I saw that there is the opportunity to aggregate the fields, but I was wondering if there is a chance to do this aggregation based on a specific time range
If anyone can help me I would be extremely grateful as I am new to logstash and I have the problem that in my server I am receiving tons of junk syslog and I would like to reduce that amount.
So far I did some cleaning with this configuration
input {
syslog {
port => 514
}
}
filter {
prune {
whitelist_names =>["timestamp","message","newfield"]
}
mutate {
add_field => {"newfield" => "%{#timestamp}%{message}"}
}
}
output {
elasticsearch {
hosts => ["localhost:9200"]
index => "logstash_index"
}
stdout {
codec => rubydebug
}
}
Now I just need to do the aggregation.
Thank you so much for your help guys
EDIT:
Following the documentation, I put in place this configuration:
input {
syslog {
port => 514
}
}
filter {
prune {
whitelist_names =>["timestamp","message","newfield"]
}
mutate {
add_field => {"newfield" => "%{#timestamp}%{message}"}
}
if [message] =~ "MESSAGE FROM" {
aggregate {
task_id => "%{message}"
code => "map['message'] ||= 0; map['message'] += 1;"
push_map_as_event_on_timeout => true
timeout_task_id_field => "message"
timeout => 60
inactivity_timeout => 50
timeout_tags => ['_aggregatetimeout']
timeout_code => "event.set('count_message', event.get('message') > 1)"
}
}
}
output {
elasticsearch {
hosts => ["localhost:9200"]
index => "logstash_index"
}
stdout {
codec => rubydebug
}
}
I don't get any error but the output is not what I am expecting.
The actual output is that it create a tag field (Good) passing an array with _aggregationtimeout and _aggregationexception
{
"message" => "<88>MESSAGE FROM\r\n",
"tags" => [
[0] "_aggregatetimeout",
[1] "_aggregateexception"
],
"#timestamp" => 2021-07-23T12:10:45.646Z,
"#version" => "1"
}

Read a CSV in Logstash level and filter on basis of the extracted data

I am using Metricbeat to get process-level data and push it to Elastic Search using Logstash.
Now, the aim is to categorize the processes into 2 tags i.e the process running is either a browser or it is something else.
I am able to do that statically using this block of code :
input {
beats {
port => 5044
}
}
filter{
if [process][name]=="firefox.exe" or [process][name]=="chrome.exe" {
mutate {
add_field => { "process.type" => "browsers" }
convert => {
"process.type" => "string"
}
}
}
else {
mutate {
add_field => { "process.type" => "other" }
}
}
}
output {
elasticsearch {
hosts => "localhost:9200"
# manage_template => false
index => "metricbeatlogstash"
}
}
But when I try to make that if condition dynamic by reading the process list from a CSV, I am not getting any valid results in Kibana, nor a error on my LogStash level.
The CSV config file code is as follows :
input {
beats {
port => 5044
}
file{
path=>"filePath"
start_position=>"beginning"
sincedb_path=>"NULL"
}
}
filter{
csv{
separator=>","
columns=>["processList","IT"]
}
if [process][name] in [processList] {
mutate {
add_field => { "process.type" => "browsers" }
convert => {
"process.type" => "string"
}
}
}
else {
mutate {
add_field => { "process.type" => "other" }
}
}
}
output {
elasticsearch {
hosts => "localhost:9200"
# manage_template => false
index => "metricbeatlogstash2"
}
}
What you are trying to do does not work that way in logstash, the events in a logstash pipeline are independent from each other.
The events received by your beats input have no knowledge about the events received by your csv input, so you can't use fields from different events in a conditional.
To do what you want you can use the translate filter with the following config.
translate {
field => "[process][name]"
destination => "[process][type]"
dictionary_path => "process.csv"
fallback => "others"
refresh_interval => 300
}
This filter will check the value of the field [process][name] against a dictionary, loaded into memory from the file process.csv, the dictionary is a .csv file with two columns, the first is the name of the browser process and the second is always browser.
chrome.exe,browser
firefox.exe,browser
If the filter got a match, it will populate the field [process][type] (not process.type) with the value from the second column, in this case, always browser, if there is no match, it will populate the field [process][type] with the value of the fallback config, in this case, others, it will also reload the content of the process.csv file every 300 seconds (5 minutes)

How to send different logstash event to different output

There are many events as fields that in logstash filter section are extracted from message field like below:
match => ["message", "%{type1:f1} %{type2:f2} %{type3:f3}"]
The purpose is to send f1, f2, f3 to one output and only f1 and f3 to other output plugin such that:
output {
elasticsearch {
action => "index"
hosts => "localhost"
index =>"indx1-%{+YYYY-MM}"
.
}
}
output {
elasticsearch {
action => "index"
hosts => "localhost"
index =>"indx2-%{+YYYY-MM}"
}
}
The problem is that all events are involved in every output pluggin but I want to handle which events goes to which output plugin.Is it possible to do this?
I found a solution by using filebeat to forward data to logstash.
If running two instancea of filebeat and one instance of logstash, each filebeat forwarda input data to the same logstash but with different type like:
document_type: type1
In logstash, appropriate filter and output is exceuted using if clause:
filter {
if [type] == "type1" {
}
else {
}
}
output {
if [type] == "type1" {
elasticsearch {
action => "index"
hosts => "localhost"
index => "%{type}-%{+YYYY.MM}"
}
}
else {
elasticsearch {
action => "index"
hosts => "localhost"
index => "%{type}-%{+YYYY.MM}"
}
}
}
If you have two distinct matching patterns in the "filter" section, then you can add specific "tags" for each match. Then in the output section use something like this:
if "matchtype1" in [tags] {
elasticsearch {
hosts => "localhost"
index => "indxtype1-%{+YYYY.MM}"
}
}
if "matchtype2" in [tags]{
elasticsearch {
hosts => "localhost"
index => "indxtype2-%{+YYYY.MM}"
}
}

Logstash conditional output to elasticsearch (index per filebeat hostname)

I have several web servers with filebeat installed and I want to have multiple indices per host.
My current configuration looks as
input {
beats {
ports => 1337
}
}
filter {
grok {
match => { "message" => "%{COMBINEDAPACHELOG}"}
}
geoip {
source => "clientip"
}
}
output {
elasticsearch {
if [beat][hostname] == "luna"
{
hosts => "10.0.1.1:9200"
manage_template => true
index => "lunaindex-%{+YYYY.MM.dd}"
document_type => "apache"
}
}
}
However the above conf results to
The given configuration is invalid. Reason: Expected one of #, => at
line 22, column 6 (byte 346)
which is where the if statement takes place. Any help?
I would like to have the above in a nested format as
if [beat][hostname] == "lina"
{
index = lina
}
else if [beat][hostname] == "lona"
{
index = lona
}
etc. Any help please?
The thread is old but hopefully somebody will find this useful. Plugin definitions don't allow conditionals in them and hence the error. The conditional must include the entire definition like below. Also, see the documentation for details.
output {
if [beat][hostname] == "luna" {
elasticsearch {
hosts => "10.0.1.1:9200"
manage_template => true
index => "lunaindex-%{+YYYY.MM.dd}"
document_type => "apache"
}
} else{
elasticsearch {
// alternate configuration
}
}
}
To access any inner field you have to enclosed it with %{}.
Try this
%{[beat][hostname]}
See this for more explanations.
UPDATE:
Using %{[beat][hostname]} with == will not work, try
if "lina" in [beat][hostname]{
index = lina
}
A solution can be :
Define in each of your filebeat configuration file, in the prosperctor section define the document type :
document_type: luna
And in your pipeline conf file, check the type field
if[type]=="luna"
Hope this help.

How do I exclude the "fingerprint" field from elasticsearch

I'm using the fingerprint filter in Logstash to create a fingerprint field that I set to document_id in the elasticsearch output.
Configuration is as follows:
filter {
fingerprint {
method => "SHA1"
key => "KEY"
}
}
output {
elasticsearch {
host => localhost
document_id => "%{fingerprint}"
}
}
This results in a redundant fingerprint field in Elasticsearch that's the same value as _id. How do I prevent this redundant field from being saved to ES?
If you're using logstash 1.5 or higher, you can put your field in metadata and then it will not be sent to elasticsearch as part of the regular message.
Example:
filter {
fingerprint {
...
target => "[#metadata][fingerprint]"
}
}
output {
elasticsearch {
...
document_id => "%{[#metadata][fingerprint]}"
}
}
You could set the document_id as the fingerprint target
filter {
fingerprint {
method => "SHA1"
key => "KEY"
target => "document_id"
}
}

Resources