Logstash -> Elasticsearch : update document #timestamp if newer, discard if older - elasticsearch

Using the elasticsearch output in logstash, how can i update only the #timestamp for a log message if newer?
I don't want to reindex the whole document, nor have the same log message indexed twice.
Also, if the #timestamp is older, it must not update/replace the current version.
Currently, i'm doing this:
filter {
if ("cloned" in [tags]) {
fingerprint {
add_tag => [ "lastlogin" ]
key => "lastlogin"
method => "SHA1"
}
}
}
output {
if ("cloned" in [tags]) {
elasticsearch {
action => "update"
doc_as_upsert => true
document_id => "%{fingerprint}"
index => "lastlogin-%{+YYYY.MM}"
sniffing => true
template_overwrite => true
}
}
}
It is similar to How to deduplicate documents while indexing into elasticsearch from logstash but i do not want to always update the message field; only if the #timestamp field is more recent.

You can't decide from Logstash level if a document needs to be updated or nothing should be done, this needs to be decided at Elasticsearch level. Which means that you need to experiment and test with _update API.
I suggest looking at https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-update.html#upserts. Meaning, if the document exists the script is executed (where you can check, if you want, the #timestamp), otherwise the content of upsert is considered as a new document.

Related

Logstash Elasticsearch plugin. Compare results from two sources

I have two deployed Elasticsearch clusters. Data "surpassingly" should be the same in both clusters. My main aim is to compare _source field for each elasticsearch document from source and target ES clusters.
I created logstash config in which I define Elasticsearch input plugin, which run over each document in source cluster, next using elasticsearch filter look up the target Elasticsearch cluster and query from it document by _id which I took from source cluster, match results of the _source field for both documents.
Could you please helm to implement such a config.
input {
elasticsearch {
hosts => ["source_cluster:9200"]
ssl => true
user => "user"
password => "password"
index => "my_index_pattern"
}
}
filter {
mutate {
remove_field => ["#version", "#timestamp"]
}
elasticsearch {
hosts => ["target_custer:9200"]
ssl => true
user => "user"
password => "password"
query => ???????
match _source field ????
}
}
output {
stdout { codec => rubydebug }
}
Maybe print some results of comparison...

After adding Prune filter along with KV filter - logs are not going to Elastic search

I am learning ELK and trying to do as a POC for my project. I am applying KV filter for the sample integration logs from my project and i could see lot of extra fields are coming as a result so i have tried to apply prune filter and white-listed certain fields. I can see the logs getting printed in the logstash server but logs are not going to elastic search. If i remove the filter it is going to the elastic search. Please advise how to further debug on this issue.
filter {
kv {
field_split => "{},?\[\]"
transform_key => "capitalize"
transform_value => "capitalize"
trim_key => "\s"
trim_value => "\s"
include_brackets => false
}
prune
{
whitelist_names => [ "App_version", "Correlation_id", "Env", "Flow_name", "host", "Instance_id", "log_level","log_thread", "log_timestamp", "message", "patient_id", "status_code", "type", "detail"]
}
}
output {
elasticsearch {
hosts => ["http://localhost:9200"]
index => "mule-%{+YYYY.MM.dd}"
#user => "elastic"
#password => "changeme"
}
stdout { codec => rubydebug }
}
I also need two more suggestion,
I am also trying to use the grok filter in the initial logs and trying to take log level fields(time and log type) from the sample log and send the remaining logs to the KV filter. Is there any reference please share for it. This is what i have tried for it. but getting as _grokparsefailure. I have passed the msgbody to the kv filter with the source option.
grok {
match => { "message" => "%{TIMESTAMP_ISO8601:timestamp}\s+%{LOGLEVEL:loglevel}\s+%{GREEDYDATA:msgbody}"}
overwrite => [ "msgbody" ]
}
I am having message fields inside sample logs as like below. When the data goes to Kibana i can see two message field tag one is with full log and other is with correct message(highlighted). Will the mutate works for this case? Is there any way we can change the full log name as something else ??
[2020-02-10 11:20:07.172] INFO Mule.api [[MuleRuntime].cpuLight.04:
[main-api-test].api-main.CPU_LITE #256c5cf5:
[main-api-test].main-api-main/processors/0/processors/0.CPU_LITE
#378f34b0]: event:00000003 {app_name=main-api-main, app_version=v1,
env=Test, timestamp=2020-02-10T11:20:07.172Z,
log={correlation_id=00000003, patient_id=12345678,
instance_id=hospital, message=Start of System API,
flow_name=main-api-main}}
prune filter error
Your prune filter does not have the #timestamp field in the whitelist_names list, your output is date based (%{+YYYY.MM.dd}), logstash needs the #timestamp field in the output to extract the date.
I've ran your pipeline with your sample message and it worked as expected, with the prune filter the message is sent to elasticsearch, but it is stored in an index named mule- without any datetime field.
Without the prune filter your message uses the time when logstash received the event as the #timestamp, since you do not have any date filter to change it.
If you created the index pattern for the index mule-* with a datetime field like #timestamp, you won't see on Kibana any documents on the index that doesn't have the same datetime field.
grok error
Your grok is wrong, you need to escape the square brackets surrounding your timestamp. Kibana has a grok debugger where you can try your patterns.
The following grok works, move your kv to run after the grok and with the msgbody as source.
grok {
match => { "message" => "\[%{TIMESTAMP_ISO8601:timestamp}\]\s+%{LOGLEVEL:loglevel}\s+%{GREEDYDATA:msgbody}"}
overwrite => [ "msgbody" ]
}
kv {
source => "msgbody"
field_split => "{},?\[\]"
transform_key => "capitalize"
transform_value => "capitalize"
trim_key => "\s"
trim_value => "\s"
include_brackets => false
}
Just run it with output only to stdout to see the filters you need to change your prune filter.
duplicated message fields
If you put your kv filter after the grok you wouldn't have duplicated message fields since your kv is capitalizing your fields, you will end with a message field containing your full log, and a Message field containing your internal message, logstash fields are case sensitive.
However you can rename any field using the mutate filter.
mutate {
rename => ["message", "fullLogMessage"]
}

Elastic Document's #version not incrementing when updating via Logstash

I want to load the issue data from a JIRA instance to my Elastic Stack on a regular basis. I don't want to create a new elastic document every time I pull the data from the JIRA API, but instead update the existing document document, which means there should only exist one document per JIRA issue. When updating, I would expect the #version field to increment automatically when setting the document_id field of the elasticsearch output plugin.
Currently working setup
Elastic Stack: Version 7.4.0 running on Ubuntu in Docker containers
Logstash Input stage: get the JIRA issue data via http_poller input plugin
Logstash Filter stage: use the split filter plugin to modify the JSON data as needed
Logstash Output stage: pipe the data to Elasticsearch and make it visible in Kibana
Where I am struggling
The data is correctly registered in Elastic and shown in Kibana. As expected there is one document per issue. However, the document is being overwritten but #version stays at value 1. I assumend using action => "update", doc_as_upsert => true and document_id => "%{[#metadata][id]}" would be enough to make Elasticsearch realize that it needs to increment the version of the document.
I am wondering in general if this is the correct approach to make the JIRA issue data searchable over time. For example, will I be able to find the status quo of a JIRA ticket at a past #version? Or will the #version value only give me the information how often the document was updated, without giving me the indiviual document version's values?
logstash.conf (certain data was removed and replaced with <> tags)
input {
http_poller {
urls => {
data => {
method => get
url => "https://<myjira>.com/jira/rest/api/2/search?<searchJQL>"
headers => {
Authorization => "Basic <censored>"
Accept => "application/json"
"Content-Type" => "application/json"
}
}
}
request_timeout => 60
schedule => { every => "10s" } # low value for debugging
codec => "json"
}
}
filter {
split {
field => "issues"
add_field => {
"key" => "%{[issues][key]}"
"Summary" => "%{[issues][fields][summary]}"
[#metadata]["id"] => "%{[issues][id]}" # unique ID of a JIRA issue, the JIRA issue key could also be used
}
remove_field => [ "startAt", "total", "maxResults", "expand", "issues"]
}
}
output {
stdout { codec => rubydebug }
elasticsearch {
index => "gsep"
user => ["<usr>"]
password => ["<pw>"]
hosts => ["elasticsearch:9200"]
action => "update"
document_id => "%{[#metadata][id]}"
doc_as_upsert => true
}
}
Screenshots from Document Data in Kibana
I had to censor information, but the missing information should not be relevant. On the screenshot you can see that the same _id is correctly set, but the #version stays at 1. In Elasticstash/Kibana exists only exactly this document for the respective issue/_id.
The #version field is coming from logstash and is just an indicator for the version of your log message format. There is no auto-increment functionality etc.
Please note, there is also a _version field in elasticsearch documents.
_version is an automatically incremented value used for optimistic locking in a concurrency scenario.
Just to be clear, elasticsearch can't give you what you are expecting in terms of versioning out of the box. You can't access a different version of the same document relying on _version. There are design patterns hot to implement such a document history in elasticsearch. But that's a broad question with many answers and out of scope of this question.

Can't access Elasticsearch index name metadata in Logstash filter

I want to add the elasticsearch index name as a field in the event when processing in Logstash. This is suppose to be pretty straight forward but the index name does not get printed out. Here is the complete Logstash config.
input {
elasticsearch {
hosts => "elasticsearch.example.com"
index => "*-logs"
}
}
filter {
mutate {
add_field => {
"log_source" => "%{[#metadata][_index]}"
}
}
}
output {
elasticsearch {
index => "logstash-%{+YYYY.MM}"
}
}
This will result in log_source being set to %{[#metadata][_index]} and not the actual name of the index. I have tried this with _id and without the underscores but it will always just output the reference and not the value.
Doing just %{[#metadata]} crashes Logstash with the error that it's trying to accessing the list incorrectly so [#metadata] is being set but it seems like index or any values are missing.
Does anyone have a another way of assigning the index name to the event?
I am using 5.0.1 of both Logstash and Elasticsearch.
You're almost there, you're simply missing the docinfo setting, which is false by default:
input {
elasticsearch {
hosts => "elasticsearch.example.com"
index => "*-logs"
docinfo => true
}
}

Copy ElasticSearch-Index with Logstash

I have an ready-build Apache-Index on one machine, that I would like to clone to another machine using logstash. Fairly easy i thought
input {
elasticsearch {
host => "xxx.xxx.xxx.xxx"
index => "logs"
}
}
filter {
}
output {
elasticsearch {
cluster => "Loa"
host => "127.0.0.1"
protocol => http
index => "logs"
index_type => "apache_access"
}
}
that pulls over the docs, but doesn't stop as it uses the default query "*" (the original index has ~50.000 docs and I killed the former script, when the new index was over 600.000 docs and rising)
Next I tried to make sure the docs would get updated instead of duplicated, but this commit hasn't made it yet, so i don't have a primary..
Then I remembered sincedb but don't seem to be able to use that in the query (or is that possible)
Any advice? Maybe a complete different approach? Thanks a lot!
Assuming that the elasticsearch input creates a logstash event with the document id ( I assume it will be _id or something similar), try setting the elastic search output the following way:
output {
elasticsearch {
cluster => "Loa"
host => "127.0.0.1"
protocol => http
index => "logs"
index_type => "apache_access"
document_id => "%{_id}"
}
}
That way, even if the elasticsearch input, for whatever reason, continues to push the same documents indefinitely, elasticsearch will merely updated the existing documents, instead of creating new documents with new ids.
Once you reach 50,000, you can stop.

Resources