Can't set document_id for deduplicating docs in Filebeat

Can't set document_id for deduplicating docs in Filebeat - elasticsearch

What are you trying to do?
I have location data of some sensors, I want to make geo-spatial queries to find which sensors are in a specific area (query by polygon, bounding-box, etc). The location data (lat-lon) for these sensors may change in the future. I should be able to paste json files in ndjson format in the watched folder and overwrite the existing data with the new location data for each sensor.
I also have another filestream input for the indexing the logs of these sensors.
I went through docs for deduplication and filestream input for ndjson and followed them exactly.
Show me your configs.
# ============================== Filebeat inputs ===============================
filebeat.inputs:
- type: filestream
id: "log"
enabled: true
paths:
- D:\EFK\Data\Log\*.json
parsers:
- ndjson:
keys_under_root: true
add_error_key: true
fields.doctype: "log"
- type: filestream
id: "loc"
enabled: true
paths:
- D:\EFK\Data\Location\*.json
parsers:
- ndjson:
keys_under_root: true
add_error_key: true
document_id: "Id" # Not working as expected.
fields.doctype: "location"
processors:
- copy_fields:
fields:
- from: "Lat"
to: "fields.location.lat"
fail_on_error: false
ignore_missing: true
- copy_fields:
fields:
- from: "Long"
to: "fields.location.lon"
fail_on_error: false
ignore_missing: true
# ---------------------------- Elasticsearch Output ----------------------------
output.elasticsearch:
hosts: ["localhost:9200"]
index: "sensor-%{[fields.doctype]}"
setup.ilm.enabled: false
setup.template:
name: "sensor_template"
pattern: "sensor-*"
# ------------------------------ Global Processors --------------------------
processors:
- drop_fields:
fields: ["agent", "ecs", "input", "log", "host"]
What does your input file look like?
{"Id":1,"Lat":19.000000,"Long":20.00000,"key1":"value1"}
{"Id":2,"Lat":19.000000,"Long":20.00000,"key1":"value1"}
{"Id":3,"Lat":19.000000,"Long":20.00000,"key1":"value1"}
It's the 'Id' field here that I want to use for deduplicating (overwriting with new) documents.
Update 10/05/22 :
I have also tried working with:
json.document_id: "Id"
filebeat.inputs
- type: filestream
id: "loc"
enabled: true
paths:
- D:\EFK\Data\Location\*.json
json.document_id: "Id"
ndjson.document_id: "Id"
filebeat.inputs
- type: filestream
id: "loc"
enabled: true
paths:
- D:\EFK\Data\Location\*.json
ndjson.document_id: "Id"
Straight up document_id: "Id"
filebeat.inputs
- type: filestream
id: "loc"
enabled: true
paths:
- D:\EFK\Data\Location\*.json
document_id: "Id"
Trying to overwrite _id using copy_fields
processors:
- copy_fields:
fields:
- from: "Id"
to: "#metadata_id"
fail_on_error: false
ignore_missing: true
Elasticsearch config has nothing special other than disabled security. And it's all running on localhost.
Version used for Elasticsearch, Kibana and Filebeat: 8.1.3
Please do comment if you need more info :)
References:
Deduplication in Filebeat: https://www.elastic.co/guide/en/beats/filebeat/8.2/filebeat-deduplication.html#_how_can_i_avoid_duplicates
Filebeat ndjson input: https://www.elastic.co/guide/en/beats/filebeat/current/filebeat-input-filestream.html#_ndjson
Copy_fields in Filebeat: https://www.elastic.co/guide/en/beats/filebeat/current/copy-fields.html#copy-fields

Related

Elasticsearch/Kibana shows the wrong timestamp

I transfer logfiles with filebeat to elasticsearch.
The data are analyzed with kibana.
Now to my problem:
Kibana shows not the timestamp from the logfile.
Kibana shows the time of the transmission in #timestamp.
I want to show the timestamp from the logfile in kibana.
But the timestamp in the logfile is overwritten.
Where is my fault?
Has anyone a solution for my problem?
Here a example from my logfile and the my filebeat config.
{"#timestamp":"2022-06-23T10:40:25.852+02:00","#version":1,"message":"Could not refresh JMS Connection]","logger_name":"org.springframework.jms.listener.DefaultMessageListenerContainer","level":"ERROR","level_value":40000}
## Filebeat configuration
## https://github.com/elastic/beats/blob/master/deploy/docker/filebeat.docker.yml
#
filebeat.config:
modules:
path: ${path.config}/modules.d/*.yml
reload.enabled: false
filebeat.autodiscover:
providers:
# The Docker autodiscover provider automatically retrieves logs from Docker
# containers as they start and stop.
- type: docker
hints.enabled: true
filebeat.inputs:
- type: filestream
id: pls-logs
paths:
- /usr/share/filebeat/logs/*.log
parsers:
- ndjson:
processors:
- add_cloud_metadata: ~
output.elasticsearch:
hosts: ['http://elasticsearch:9200']
username: elastic
password:
## HTTP endpoint for health checking
## https://www.elastic.co/guide/en/beats/filebeat/current/http-endpoint.html
#
http.enabled: true
http.host: 0.0.0.0
Thanks for any support!

Based upon the question, this could be one potential option, which would be to use filebeat processors. What you could do is write that initial #timestamp value to another field, like event.ingested, using the following script below:
#Script to move the timestamp to the event.ingested field
- script:
lang: javascript
id: init_format
source: >
function process(event) {
var fieldTest = event.Get("#timestamp");
event.Put("event.ingested", fieldTest);
}
And then the last processor you write could move that event.ingested field to #timestamp again using the following processor:
#setting the timestamp field to the Date/time when the event originated, which would be the event.created field
- timestamp:
field: event.created
layouts:
- '2006-01-02T15:04:05Z'
- '2006-01-02T15:04:05.999Z'
- '2006-01-02T15:04:05.999-07:00'
test:
- '2019-06-22T16:33:51Z'
- '2019-11-18T04:59:51.123Z'
- '2020-08-03T07:10:20.123456+02:00'

how to take duplicate configurations out in filebeat.yaml

I have a list of inputs in filebeat, for example
- path: /xxx/xx.log
enabled: true
type: log
fields:
topic: topic_1
- path: /xxx/ss.log
enabled: true
type: log
fields:
topic: topic_2
so can I take the duplicate configs out as a reference variable? for example
- path: /xxx/xx.log
${vars}
fields:
topic: topic_1
- path: /xxx/ss.log
${vars}
fields:
topic: topic_2

You can use YAML's inheritance : your first input is used as a model, and the others can override parameters.
- &default-log
path: /xxx/xx.log
enabled: true
type: log
fields:
topic: topic_1
- <<: *default-log
path: /xxx/ss.log
fields:
topic: topic_2
AFAIK there is no way to define an "abstract" default, meaning your &default-log should be one of your inputs (not just an abstract model).
(YAML syntax verified with YAMLlint)

Filtering Filebeat input with or without Logstash

In our current setup we use Filebeat to ship logs to an Elasticsearch instance. The application logs are in JSON format and it runs in AWS.
For some reason AWS decided to prefix the log lines in a new platform release, and now the log parsing doesn't work.
Apr 17 06:33:32 ip-172-31-35-113 web: {"#timestamp":"2020-04-17T06:33:32.691Z","#version":"1","message":"Tomcat started on port(s): 5000 (http) with context path ''","logger_name":"org.springframework.boot.web.embedded.tomcat.TomcatWebServer","thread_name":"main","level":"INFO","level_value":20000}
Before it was simply:
{"#timestamp":"2020-04-17T06:33:32.691Z","#version":"1","message":"Tomcat started on port(s): 5000 (http) with context path ''","logger_name":"org.springframework.boot.web.embedded.tomcat.TomcatWebServer","thread_name":"main","level":"INFO","level_value":20000}
The question would be whether we can avoid using Logstash to convert the log lines into the old format? If not, how do I drop the prefix? Which filter is the best choice for this?
My current Filebeat configuration looks like this:
filebeat.inputs:
- type: log
paths:
- /var/log/web-1.log
json.keys_under_root: true
json.ignore_decoding_error: true
json.overwrite_keys: true
fields_under_root: true
fields:
environment: ${ENV_NAME:not_set}
app: myapp
cloud.id: "${ELASTIC_CLOUD_ID:not_set}"
cloud.auth: "${ELASTIC_CLOUD_AUTH:not_set}"

I would try to leverage the dissect and decode_json_fields processors:
processors:
# first ignore the preamble and only keep the JSON data
- dissect:
tokenizer: "%{?ignore} %{+ignore} %{+ignore} %{+ignore} %{+ignore}: %{json}"
field: "message"
target_prefix: ""
# then parse the JSON data
- decode_json_fields:
fields: ["json"]
process_array: false
max_depth: 1
target: ""
overwrite_keys: false
add_error_key: true

There is a plugin in Logstash called JSON filter that includes all the raw log line in a field called "message" (for instance).
filter {
json {
source => "message"
}
}
If you do not want to include the beginning part of the line, use the dissect filter in Logstash. It would be something like this:
filter {
dissect {
mapping => {
"message" => "%{}: %{message_without_prefix}"
}
}
}
Maybe in Filebeat there are these two features available as well. But in my experience, I prefer working with Logstash when parsing/manipulating logging data.

Can filebeat convert log lines output to json without logstash in pipeline?

We have standard log lines in our Spring Boot web applications (non json).
We need to centralize our logging and ship them to an elastic search as json.
(I've heard the later versions can do some transformation)
Can Filebeat read the log lines and wrap them as a json ? i guess it could append some meta data aswell. no need to parse the log line.
expected output :
{timestamp : "", beat: "", message: "the log line..."}
i have no code to show unfortunately.

filebeat supports several outputs including Elastic Search.
Config file filebeat.yml can look like this:
# filebeat options: https://www.elastic.co/guide/en/beats/filebeat/current/filebeat-reference-yml.html
filebeat.inputs:
- type: log
enabled: true
paths:
- /var/log/../file.err.log
processors:
- drop_fields:
# Prevent fail of Logstash (https://www.elastic.co/guide/en/beats/libbeat/current/breaking-changes-6.3.html#custom-template-non-versioned-indices)
fields: ["host"]
- dissect:
# tokenizer syntax: https://www.elastic.co/guide/en/logstash/current/plugins-filters-dissect.html.
tokenizer: "%{} %{} [%{}] {%{}} <%{level}> %{message}"
field: "message"
target_prefix: "spring boot"
fields:
log_type: spring_boot
output.elasticsearch:
hosts: ["https://localhost:9200"]
username: "filebeat_internal"
password: "YOUR_PASSWORD"

Well it seems to do it by default. this is my result when i tried it locally to read log lines. it wraps it exactly like i wanted.
{
"#timestamp":"2019-06-12T11:11:49.094Z",
"#metadata":{
"beat":"filebeat",
"type":"doc",
"version":"6.2.4"
},
"message":"the log line...",
"source":"/Users/myusername/tmp/hej.log",
"offset":721,
"prospector":{
"type":"log"
},
"beat":{
"name":"my-macbook.local",
"hostname":"my-macbook.local",
"version":"6.2.4"
}
}

how can i store in two index using two JSON formated log files using filebeat and output to elasticsearch

below is my configuration file for filebeat which is present in /etc/filebeat/filebeat.yml,
it throws an error of
Failed to publish events: temporary bulk send failure
filebeat.prospectors:
- paths:
- /var/log/nginx/virus123.log
input_type: log
fields:
type:virus123
json.keys_under_root: true
- paths:
- /var/log/nginx/virus1234.log
input_type: log
fields:
type:virus1234
json.keys_under_root: true
setup.template.name: "filebeat-%{[beat.version]}"
setup.template.pattern: "filebeat-%{[beat.version]}-*"
setup.template.overwrite: true
processors:
- drop_fields:
fields: ["beat","source"]
output.elasticsearch:
index: index: "filebeat-%{[beat.version]}-%{[fields.type]:other}-%{+yyyy.MM.dd}"
hosts: ["http://127.0.0.1:9200"]

I think I found your problem, Although i'm not sure it is the only problem
index: index: "filebeat-%{[beat.version]}-%{[fields.type]:other}-%{+yyyy.MM.dd}"
should be:
index: "filebeat-%{[beat.version]}-%{[fields.type]:other}-%{+yyyy.MM.dd}"
I saw a similar problem with a wrong index which cause the same error that you showed

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Can't set document_id for deduplicating docs in Filebeat - elasticsearch

Related

Elasticsearch/Kibana shows the wrong timestamp

how to take duplicate configurations out in filebeat.yaml

Filtering Filebeat input with or without Logstash

Can filebeat convert log lines output to json without logstash in pipeline?

how can i store in two index using two JSON formated log files using filebeat and output to elasticsearch

Categories

Resources