How to fix duplicated documents in Elasticsearch when indexing by Logstash? - elasticsearch

I'm using the Elastic Stack to handle my log files but is generating duplicated documents in the Elasticsearch.
I've made some survey and already tried to add the "document_id", but it did not solve.
This is the configuration of my Logstash:
input {
beats {
port => 5044
}
}
filter {
fingerprint {
source => "message"
target => "[fingerprint]"
method => "SHA1"
key => "key"
base64encode => true
}
if [doctype] == "audit-log" {
grok {
match => { "message" => "^\(%{GREEDYDATA:loguser}#%{IPV4:logip}\) \[%{DATESTAMP:logtimestamp}\] %{JAVALOGMESSAGE:logmessage}$" }
}
mutate {
remove_field => ["host"]
}
date {
match => [ "logtimestamp" , "dd/MM/yyyy HH:mm:ss" ]
target => "#timestamp"
locale => "EU"
timezone => "America/Sao_Paulo"
}
}
}
output {
elasticsearch {
hosts => "192.168.0.200:9200"
document_id => "%{[fingerprint]}"
}
}
Here the duplicated documents:
{
"_index": "logstash-2019.05.02-000001",
"_type": "_doc",
"_id": "EbncP00tf9yMxXoEBU4BgAAX/gc=",
"_version": 1,
"_score": null,
"_source": {
"#version": "1",
"fingerprint": "EbncP00tf9yMxXoEBU4BgAAX/gc=",
"message": "(thiago.alves#192.168.0.200) [06/05/2019 18:50:08] Logout do usuário 'thiago.alves'. (cookie=9d6e545860c24a9b8e3004e5b2dba4a6). IP=192.168.0.200",
...
}
######### DUPLICATED #########
{
"_index": "logstash-2019.05.02-000001",
"_type": "_doc",
"_id": "V7ogj2oB8pjEaraQT_cg",
"_version": 1,
"_score": null,
"_source": {
"#version": "1",
"fingerprint": "EbncP00tf9yMxXoEBU4BgAAX/gc=",
"message": "(thiago.alves#192.168.0.200) [06/05/2019 18:50:08] Logout do usuário 'thiago.alves'. (cookie=9d6e545860c24a9b8e3004e5b2dba4a6). IP=192.168.0.200",
...
}
That's it. I don't know why is duplicating yet. Someone have any idea?
Thank you in advance...

I had this problem once and after many attempts to solve it, I realized that I did a backup for my conf file into 'pipeline' folder and Logstash was using this backup file to process input rules. Be careful because Logstash will use others files in pipeline folder even the file extension is different from '.conf'.
So, please check if do you have others files in the 'pipeline' folder.
Please let me know if this was useful to you.

Generate a UUID key for each document then your issue will be solved.

Your code seems fine and shouldn't allow duplicates, maybe the duplicated one was added before you added document_id => "%{[fingerprint]}" to your logstash, so elasticsearch generated a unique Id for it that wont be overriden by other ids, remove the duplicated (the one having _id different than fingerprint) manually and try again, it should work.

We noticed that Logstash 7.5.2 is not working properly, It duplicates the logs which are coming from the Filebeat. The actual issue we noticed the inbuild Beats plugin has a bug. So we removed the existing one and updated it to the stable version (6.0.14). The steps are as below,
download
./bin/logstash-plugin remove logstash-input-beats
./bin/logstash-plugin install /{file path}/logstash-input-beats-6.0.14-java.gem
./bin/logstash-plugin list --verbose

Related

How to get fields inside message array from Logstash?

I've been trying to configure a logstash pipeline with input type is snmptrap along with yamlmibdir. Here's the code
input {
snmptrap {
host => "abc"
port => 1062
yamlmibdir => "/usr/share/logstash/vendor/bundle/jruby/2.5.0/gems/snmp-1.3.2/data/ruby/snmp/mibs"
}
}
filter {
mutate {
gsub => ["message","^\"{","{"]
gsub => ["message","}\"$","}"]
gsub => ["message","[\\]",""]
}
json { source => "message" }
split {
field => "message"
target => "evetns"
}
}
output {
elasticsearch {
hosts => "xyz"
index => "logstash-%{+YYYY.MM.dd}"
}
stdout { codec => rubydebug }
}
and the result shown in Kibana (JSON format)
{
"_index": "logstash-2019.11.18-000001",
"_type": "_doc",
"_id": "Y_5zjG4B6M9gb7sxUJwG",
"_version": 1,
"_score": null,
"_source": {
"#version": "1",
"#timestamp": "2019-11-21T05:33:07.675Z",
"tags": [
"_jsonparsefailure"
],
"1.11.12.13.14.15": "teststring",
"message": "#<SNMP::SNMPv1_Trap:0x244bf33f #enterprise=[1.2.3.4.5.6], #timestamp=#<SNMP::TimeTicks:0x196a1590 #value=55>, #varbind_list=[#<SNMP::VarBind:0x21f5e155 #name=[1.11.12.13.14.15], #value=\"teststring\">], #specific_trap=99, #source_ip=\"xyz\", #agent_addr=#<SNMP::IpAddress:0x5a5c3c5f #value=\"xC0xC1xC2xC3\">, #generic_trap=6>",
"host": "xyz"
},
"fields": {
"#timestamp": [
"2019-11-21T05:33:07.675Z"
]
},
"sort": [
1574314387675
]
}
As you can see in the message field, it's an array so how can I get all the field inside the array. also able to select these field to display on Kibana.
ps1. still got tags _jsonparsefailure if select type 'Table' in Expanded document
ps2. even if using gsub for remove '\' from expected json result, why still got an result with '\' ?

Filter jdbc data in Logstash

In my DB, I've data in below format:
But in ElasticSearch I want to push data with respect to item types. So each record in ElasticSearch will list all item names & its values per item type.
Like this:
{
"_index": "daily_needs",
"_type": "id",
"_id": "10",
"_source": {
"item_type: "10",
"fruits": "20",
"veggies": "32",
"butter": "11",
}
}
{
"_index": "daily_needs",
"_type": "id",
"_id": "11",
"_source": {
"item_type: "11",
"hair gel": "50",
"shampoo": "35",
}
}
{
"_index": "daily_needs",
"_type": "id",
"_id": "12",
"_source": {
"item_type: "12",
"tape": "9",
"10mm screw": "7",
"blinker fluid": "78",
}
}
Can I achieve this in Logstash?
I'm new into Logstash, but as per my understanding it can be done in filter. But I'm not sure which filter to use or do I've to create a custom filter for this.
Current conf example:
input {
jdbc {
jdbc_driver_library => "ojdbc6.jar"
jdbc_driver_class => "Java::oracle.jdbc.driver.OracleDriver"
jdbc_connection_string => "myjdbc-configs"
jdbc_user => "dbuser"
jdbc_password => "dbpasswd"
schedule => "* * * * *"
statement => "SELECT * from item_table"
}
}
filter {
## WHAT TO WRITE HERE??
}
output {
elasticsearch {
hosts => [ "http://myeshost/" ]
index => "myindex"
}
}
Kindly suggest. Thank you.
You can achieve this using aggregate filter plugin. I have not tested below, but should give you an idea.
filter {
aggregate {
task_id => "%{item_type}" #
code => "
map['Item_type'] = event.get('Item_type')
map[event.get('Item_Name')] = map[event.get('Item_Value')]
"
push_previous_map_as_event => true
timeout => 3600
timeout_tags => ['_aggregatetimeout']
}
if "aggregated" not in [tags] {
drop {}
}
}
Important Caveats for using aggregate filter:
The sql query MUST order the results by Item_Type, so the events are not out of order.
Column names in sql query should match the column names in the filter map[]
You should use ONLY ONE worker thread for aggregations otherwise events may be processed out of sequence and unexpected results will occur.

Dynamic elasticsearch index_type using logstash

I am working on storing data on elasticsearch using logstash from a rabbitmq server.
My logstash command looks like
logstash -e 'input{
rabbitmq {
exchange => "redwine_log"
key => "info.redwine"
host => "localhost"
durable => true
user => "guest"
password => "guest"
}
}
output {
elasticsearch {
host => "localhost"
index => "redwine"
}
}
filter {
json {
source => "message"
remove_field => [ "message" ]
}
}'
But I needed logstash to put the data into different types in elasticsearch cluster. What i meant by type is:
"hits": {
"total": 3,
"max_score": 1,
"hits": [
{
"_index": "logstash-2014.11.19",
"_type": "logs",
"_id": "ZEea8HBOSs-QwH67q1Kcaw",
"_score": 1,
"_source": {
"context": [],
"level": 200,
"level_name": "INFO",
This is the part of search result, where you can see the logstash by defualt creates a type named "logs" (_type : "logs"). In my project i needed the type to be dynamic and should be created based on the input data.
For example: my input data looks like
{
"data":"some data",
"type": "type_1"
}
and i need the logstash to create a new type in elasticsearch with a name "type_1"..
I tried using grok..But couldnt able to get this specifc requirment.
Its worked for me in this way
elasticsearch {
host => "localhost"
index_type => "%{type}"
}

Logstash and ElasticSearch filter Date #timestamp issue

Im trying to index some data from file to ElasticSearch by using Logstash.
If I'm not using the Date filter in order to replace the #timestamp everything works very well, but when in using the filter I do not get all the data.
I can't figure out why there is a difference between the Logstash command line and Elasticsearch in the #timestamp value.
Logstash conf
filter {
mutate {
replace => {
"type" => "dashboard_a"
}
}
grok {
match => [ "message", "%{DATESTAMP:Logdate} \[%{WORD:Severity}\] %{JAVACLASS:Class} %{GREEDYDATA:Stack}" ]
}
date {
match => [ "Logdate", "dd-MM-yyyy hh:mm:ss,SSS" ]
}
}
Logstash Command line trace
{
**"#timestamp" => "2014-08-26T08:16:18.021Z",**
"message" => "26-08-2014 11:16:18,021 [DEBUG] com.fnx.snapshot.mdb.SnapshotMDB - SnapshotMDB Ctor is called\r",
"#version" => "1",
"host" => "bts10d1",
"path" => "D:\\ElasticSearch\\logstash-1.4.2\\Dashboard_A\\Log_1\\6.log",
"type" => "dashboard_a",
"Logdate" => "26-08-2014 11:16:18,021",
"Severity" => "DEBUG",
"Class" => "com.fnx.snapshot.mdb.SnapshotMDB",
"Stack" => " - SnapshotMDB Ctor is called\r"
}
ElasticSearch result
{
"_index": "logstash-2014.08.28",
"_type": "dashboard_a",
"_id": "-y23oNeLQs2mMbyz6oRyew",
"_score": 1,
"_source": {
**"#timestamp": "2014-08-28T14:31:38.753Z",
**"message": "15:07,565 [DEBUG] com.fnx.snapshot.mdb.SnapshotMDB - SnapshotMDB Ctor is called\r",
"#version": "1",
"host": "bts10d1",
"path": "D:\\ElasticSearch\\logstash-1.4.2\\Dashboard_A\\Log_1\\6.log",
"type": "dashboard_a",
"tags": ["_grokparsefailure"]
}
}
Please make sure all your logs is in format!
You can see in the logstash command line trace the logs is
26-08-2014 11:16:18,021 [DEBUG] com.fnx.snapshot.mdb.SnapshotMDB - SnapshotMDB Ctor is called\r
But, in the elastsicsearch the log is
15:07,565 [DEBUG] com.fnx.snapshot.mdb.SnapshotMDB - SnapshotMDB Ctor is called\r",
Two logs have different time and their format are not same! The second one do not have any information about daytime, therefore it will cause the grok filter parsing error. You can go to check the origin logs. Or can you provide the origin logs sample for more discussion if all of them are in format!

Leave out default Logstash fields in ElasticSearch

After processing data with: input | filter | output > ElasticSearch the format it's get stored in is somewhat like:
"_index": "logstash-2012.07.02",
"_type": "stdin",
"_id": "JdRaI5R6RT2do_WhCYM-qg",
"_score": 0.30685282,
"_source": {
"#source": "stdin://dist/",
"#type": "stdin",
"#tags": [
"tag1",
"tag2"
],
"#fields": {},
"#timestamp": "2012-07-02T06:17:48.533000Z",
"#source_host": "dist",
"#source_path": "/",
"#message": "test"
}
I filter/store most of the important information in specific fields, is it possible to leave out the default fields like: #source_path and #source_host? In the near future it's going to store 8 billion logs/month and I would like to run some performance tests with this default fields excluded (I just don't use these fields).
This removes fields from output:
filter {
mutate {
# remove duplicate fields
# this leaves timestamp from message and source_path for source
remove => ["#timestamp", "#source"]
}
}
Some of that will depend on what web interface you are using to view your logs. I'm using Kibana, and a customer logger (c#) that indexes the following:
{
"_index": "logstash-2013.03.13",
"_type": "logs",
"_id": "n3GzIC68R1mcdj6Wte6jWw",
"_version": 1,
"_score": 1,
"_source":
{
"#source": "File",
"#message": "Shalom",
"#fields":
{
"tempor": "hit"
},
"#tags":
[
"tag1"
],
"level": "Info"
"#timestamp": "2013-03-13T21:47:51.9838974Z"
}
}
This shows up in Kibana, and the source fields are not there.
To exclude certain fields you can use prune filter plugin.
filter {
prune {
blacklist_names => [ "#timestamp", "#source" ]
}
}
Prune filter is not a logstash default plugin and must be installed first:
bin/logstash-plugin install logstash-filter-prune

Resources