How to avoid elasticsearch duplicate documents

How to avoid elasticsearch duplicate documents - elasticsearch

How do I avoid elasticsearch duplicate documents?
The elasticsearch index docs count (20,010,253) doesn’t match with logs line count (13,411,790).
documentation:
File input plugin.
File rotation is detected and handled by this input,
regardless of whether the file is rotated via a rename or a copy operation.
nifi:
real time nifi pipeline copies logs from nifi server to elk server.
nifi has rolling log files.
logs line count on elk server:
wc -l /mnt/elk/logstash/data/from/nifi/dev/logs/nifi/*.log
13,411,790 total
elasticsearch index docs count:
curl -XGET 'ip:9200/_cat/indices?v&pretty'
docs.count = 20,010,253
logstash input conf file:
cat /mnt/elk/logstash/input_conf_files/test_4.conf
input {
file {
path => "/mnt/elk/logstash/data/from/nifi/dev/logs/nifi/*.log"
type => "test_4"
sincedb_path => "/mnt/elk/logstash/scripts/sincedb/test_4"
}
}
filter {
if [type] == "test_4" {
grok {
match => {
"message" => "%{DATE:date} %{TIME:time} %{WORD:EventType} %{GREEDYDATA:EventText}"
}
}
}
}
output {
if [type] == "test_4" {
elasticsearch {
hosts => "ip:9200"
index => "test_4"
}
}
else {
stdout {
codec => rubydebug
}
}
}

You can use fingerprint filter plugin: https://www.elastic.co/guide/en/logstash/current/plugins-filters-fingerprint.html
This can e.g. be used to create consistent document ids when inserting
events into Elasticsearch, allowing events in Logstash to cause
existing documents to be updated rather than new documents to be
created.

Related

Duplicate field values for grok-parsed data

I have a filebeat that captures logs from uwsgi application running in docker. The data is sent to the logstash which parses it and forwards to elasticsearch.
Here is the logstash conf file:
input {
beats {
port => 5044
}
}
filter {
grok {
match => { "log" => "\[pid: %{NUMBER:worker.pid}\] %{IP:request.ip} \{%{NUMBER:request.vars} vars in %{NUMBER:request.size} bytes} \[%{HTTPDATE:timestamp}] %{URIPROTO:request.method} %{URIPATH:request.endpoint}%{URIPARAM:request.params}? => generated %{NUMBER:response.size} bytes in %{NUMBER:response.time} msecs(?: via sendfile\(\))? \(HTTP/%{NUMBER:request.http_version} %{NUMBER:response.code}\) %{NUMBER:headers} headers in %{NUMBER:response.size} bytes \(%{NUMBER:worker.switches} switches on core %{NUMBER:worker.core}\)" }
}
date {
# 29/Oct/2018:06:50:38 +0700
match => [ "timestamp" , "dd/MMM/yyyy:HH:mm:ss Z"]
}
kv {
source => "request.params"
field_split => "&?"
target => "request.query"
}
}
output {
elasticsearch {
hosts => ["http://localhost:9200"]
index => "test-index"
}
}
Everything was fine, but I've noticed that all values captured by the grok pattern is duplicated. Here is how it looks in kibana:
Note that the raw data like log which wasn't grok output is fine. I've seen that kv filter has allow_duplicate_values parameter, but it doesn't apply to grok.
What is wrong with my configuration? Also, is it possible to rerun grok patterns on existing data in elasticsearch?

Maybe your filebeat is already doing the job and creating these fields
Did you try to add this parameter to your grok ?
overwrite => [ "request.ip", "request.endpoint", ... ]
In order to rerun grok on already indexed data you need to use elasticsearch input plugin in order to read data from ES and re-index it after grok.

Mutiple logs in single config file to elasticsearch

I want to send logs from different location to elasticsearch using logstash conf file.
input {
file
{
path => "C:/Users/611166850/Desktop/logstash-5.0.2/logstash-5.0.2/logs/logs/ce.log"
type => "CE"
start_position => "beginning"
}
file
{
path => "C:/Users/611166850/Desktop/logstash-5.0.2/logstash-5.0.2/logs/logs/spovp.log"
type => "SP"
start_position => "beginning"
}
file
{
path => "C:/Users/611166850/Desktop/logstash-5.0.2/logstash-5.0.2/logs/logs/ovpportal_log"
type => "OVP"
start_position => "beginning"
}
}
output {
elasticsearch {
action => "index"
hosts => "localhost:9200"
index => "multiple"
codec => json
workers => 1
}
}
This is the config file I use, but Kibana is not recognising this index. Can someone help with this
Thanks in advance ,Rashmi

Check logstash's log file for errors.(maybe you'r config file is not correct)
Also search ES directly for preferred index, maybe problem is not Kibana, and you don't have any index with this name.

try starting logstash in debug mode to see if there are any logs in it.
you can also try to get the logstash out put to a file on local system rather than directly sending it to the elasticsearch. uncomment block as per your requirement
# block-1
# if "_grokparsefailure" not in [tags] {
# stdout {
# codec => rubydebug { metadata => true }
# }
# }
# block-2
# if "_grokparsefailure" not in [tags] {
# file {
# path => "/tmp/out-try1.logstash"
# }
# }
so by any of these methods you can get the output to console or to a file. comment _grokparsefailure part in case you don't see any output in file.
Note: in kibana default indices have #timestamp in their fields so check
1. if kibana is able to recognize the index if you unckeck the checkbox on page where you create new index
2. if your logs are properly parsed. if not you need to work out with grok filters with pattern matching your logs or create grok filters
all elasticsearch indices are visible on http://elasticsearch-ip:9200/_cat/indices?v (your elasticsearch ip) so try that too. share what you find

Can't access Elasticsearch index name metadata in Logstash filter

I want to add the elasticsearch index name as a field in the event when processing in Logstash. This is suppose to be pretty straight forward but the index name does not get printed out. Here is the complete Logstash config.
input {
elasticsearch {
hosts => "elasticsearch.example.com"
index => "*-logs"
}
}
filter {
mutate {
add_field => {
"log_source" => "%{[#metadata][_index]}"
}
}
}
output {
elasticsearch {
index => "logstash-%{+YYYY.MM}"
}
}
This will result in log_source being set to %{[#metadata][_index]} and not the actual name of the index. I have tried this with _id and without the underscores but it will always just output the reference and not the value.
Doing just %{[#metadata]} crashes Logstash with the error that it's trying to accessing the list incorrectly so [#metadata] is being set but it seems like index or any values are missing.
Does anyone have a another way of assigning the index name to the event?
I am using 5.0.1 of both Logstash and Elasticsearch.

You're almost there, you're simply missing the docinfo setting, which is false by default:
input {
elasticsearch {
hosts => "elasticsearch.example.com"
index => "*-logs"
docinfo => true
}
}

Losgstah configuration issue

I begin with logstash and ElasticSearch and I would like to index .pdf or .doc file type in ElasticSearch via logstash.
I configured logstash using the codec multiline to get my file in a single message in ElasticSearch. Below is my configuration file:
input {
file {
path => "D:/BaseCV/*"
codec => multiline {
# Grok pattern names are valid! :)
pattern => ""
what => "previous"
}
}
}
output {
stdout {
codec => "rubydebug"
}
elasticsearch {
hosts => "localhost"
index => "cvindex"
document_type => "file"
}
}
At the start of logstash the first file I add, I recovered in ElasticSearch in one message, but the following are spread over several messages. I wish I had the correspondence : 1 file = 1 message.
Is this possible ? What should I change my setup to solve the problem ?
Thank you for your feedback.

Logstash does not insert into elasticsearch (from file piped into stdin)

I am trying to insert the NYC taxi data into elasticsearch via Logstash.
with the command bin/logstash -f myconf.conf < /trip_data1.csv
But I get...
Logstash startup completed
Logstash shutdown completed
Where apparently nothing happened. No index has been created/modified in Elasticsearch either.
What am I doing wrong?
Here is my conf file:
input {
stdin{}
}
filter {
csv {
columns => ["medallion","hack_license","vendor_id","rate_code","store_and_fwd_flag","pickup_datetime","dropoff_datetime","passenger_count","trip_time_in_secs","trip_distance","pickup_longitude","pickup_latitude","dropoff_longitude","dropoff_latitude"]
separator => ","
}
}
output {
elasticsearch {
index => "samples1"
document_type => "sample"
}
stdout { codec => rubydebug }
}

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

How to avoid elasticsearch duplicate documents - elasticsearch

Related

Duplicate field values for grok-parsed data

Mutiple logs in single config file to elasticsearch

Can't access Elasticsearch index name metadata in Logstash filter

Losgstah configuration issue

Logstash does not insert into elasticsearch (from file piped into stdin)

Categories

Resources