Upsert documents in Elasticsearch using custom ID field

Upsert documents in Elasticsearch using custom ID field - elasticsearch

I am trying to load/ingest data from some log files that is almost a replica of what data is stored in some 3rd vendor's DB. The data is pipe separated "key-value" values and I am able split it up using kv filter plugin in logstash.
Sample data -
1.) TABLE="TRADE"|TradeID="1234"|Qty=100|Price=100.00|BuyOrSell="BUY"|Stock="ABCD Inc."
if we receive modification on the above record,
2.) TABLE="TRADE"|TradeID="1234"|Qty=120|Price=101.74|BuyOrSell="BUY"|Stock="ABCD Inc."
We need to update the record that was created on the first entry. So, I need to make the TradeID as id field and need to upsert the records so there is no duplication of same TradeID record.
Code for logstash.conf is somewhat like below -
input {
file {
path => "some path"
}
}
filter {
kv {
source => "message"
field_split => "\|"
value_split => "="
}
}
output {
elasticsearch {
hosts => ["https://localhost:9200"]
cacert => "path of .cert file"
ssl => true
ssl_certificate_verification => true
index => "trade-index"
user => "elastic"
password => ""
}
}

You need to update your elasticsearch output like below:
output {
elasticsearch {
hosts => ["https://localhost:9200"]
cacert => "path of .cert file"
ssl => true
ssl_certificate_verification => true
index => "trade-index"
user => "elastic"
password => ""
# add the following to make it work as an upsert
action => "update"
document_id => "%{TradeID}"
doc_as_upsert => true
}
}
So when Logstash reads the first trade, the document with ID 1234 will not exist and will be upserted (i.e. created). When the second trade is read, the document exists and will be simply updated.

Related

Update existing document of Elasticsearch and insert current record through logstash

I am trying to insert a record into elasticsearch and also update a field of an existing document whose _id I'll be getting from the current record. After searching online, I found that we can use the _update_by_query api with the http plugin in logstash. This is the below configuration.
output {
elasticsearch {
hosts => ["localhost:9200"]
index => "my_index_*"
document_id => "%{id_field}"
}
http {
url => "http://localhost:9200/my_index_*/_update_by_query"
http_method => "post"
content_type => "application/json"
format => "message"
message => '{"query":{"match":{"_id":"%{previous_record_id}"}},"script":{"source":"ctx._source.field_to_be_updated=xyz","lang":"painless"}}'
}
}
The Elasticsearch has no password protection and so I haven't added an authorization header.
But when I start logstash, the current record gets inserted but I always the below error for the http plugin.
2022-05-05T11:31:51,916][ERROR][logstash.outputs.http ][logstash_txe] [HTTP Output Failure] Encountered non-2xx HTTP code 400 {:response_code=>400, :url=>"http://localhost:9200/my_index_*/_update_by_query", :event=>#<LogStash::Event:0x192606f8>}

It's not how you're supposed to do it, you can simply use the elasticsearch output for both use cases.
The first one for indexing a new record and the following one for partial updating another record whose id is previous_record_id. The event data can be accessed in params.event within the script:
elasticsearch {
hosts => ["localhost:9200"]
index => "my_index_xyz"
document_id => "%{previous_record_id}"
action => "update"
script => "ctx._source.field_to_be_updated = params.event.xyz"
script_lang => "painless"
script_type => "inline"
}

Logstash to Elasticsearch adding new data in the fields instead of overwriting the existing data?

My pipeline is like this: CouchDB -> Logstash -> ElasticSearch. Every time I update a field value in CouchDB, the data in Elasticsearch is overwritten. My requirement is that, when data in a field is updated in CouchDB, I want to create a new data in Elasticsearch instead of overwriting the existing one.
My current logstash.conf is like this:
input {
couchdb_changes {
host => "<ip>"
port => <port>
db => "test_database"
keep_id => false
keep_revision => true
initial_sequence => 0
always_reconnect => true
#sequence_path => "/usr/share/logstash/config/seqfile"
}
}
output {
if([doc][doc_type] == "HR") {
elasticsearch {
hosts => ["http://elasticsearch:9200"]
index => "hrindex_new_1"
document_id => "%{[doc][_id]}"
user => elastic
password => changeme
}
}
if([doc][doc_type] == "SoftwareEngg") {
elasticsearch {
hosts => ["http://elasticsearch:9200"]
index => "softwareenggindex_new"
document_id => "%{[doc][_id]}"
user => elastic
password => changeme
}
}
}
How to do this?

You are using the document_id option in your elasticsearch output, what this option does is tell elasticsearch that it should index the document using this value as the document id, which will be a unique id.
document_id => "%{[doc][_id]}"
So, if in your source document the field [doc][_id] has for example the value of 1000, the _id field in the elasticsearch will also have the same value.
When you change something in the source document that has the [doc][_id] equals to 1000, it will replace the document with the _id equals to 1000 in elasticsearch because the _id is unique.
To achieve what you want, you will need to remove the option document_id from your outputs, this way elasticsearch will generate a unique value for the _id field of the document.
elasticsearch {
hosts => ["http://elasticsearch:9200"]
index => "softwareenggindex_new"
user => elastic
password => changeme
}

How hit_cache_size in logstash dns filter works?

I am using dns filter in logstash for my csv file. In my csv file, I have two fields. They are website and count.
Here's the sample content of my csv file:
|website|n|
|www.google.com|n1|
|www.yahoo.com|n2|
|www.bing.com|n3|
|www.stackoverflow.com|n4|
|www.smackcoders.com|n5|
|www.zoho.com|n6|
|www.quora.com|n7|
|www.elastic.co|n8|
Here's my logstash config file:
input {
file {
path => "/home/paulsteven/log_cars/cars_dns.csv"
start_position => "beginning"
sincedb_path => "/dev/null"
}
}
filter {
csv {
separator => ","
columns => ["website","n"]
}
dns {
resolve => [ "website" ]
action => "replace"
hit_cache_size => 8000
hit_cache_ttl => 300
failed_cache_size => 1000
failed_cache_ttl => 10
}
}
output {
elasticsearch {
hosts => "localhost:9200"
index => "dnsfilter03"
document_type => "details"
}
stdout{}
}
Here's the sample data passing through logstash:
{
"#version" => "1",
"path" => "/home/paulsteven/log_cars/cars_dns.csv",
"website" => "104.28.5.86",
"n" => "n21",
"host" => "smackcoders",
"message" => "www.smackcoders.com,n21",
"#timestamp" => 2019-04-23T10:41:15.680Z
}
In the logstash config file, I want to know about hit_cache_size. What is the use of it. I read the guide of dns filter in th elastic website but unable to figure it out. I added the field in my logstash config but nothing happened. can i get any examples for that. I want to know the use of hit_cache_size. What is the job, it's doing in dns filter

The hit_cache_size allows you to store the result of a successful request, so if you need to run a dns request on the same host will look into the cache instead and only will do a dns lookup if the host is not cached.
If your data has unique hosts then there is no reason to use the hit_cache_size since the hosts only appears once.
The hit_cache_ttl works with the hit_cache_size and says how many seconds the request will be stored in the cache.

Logstash Update a document in elasticsearch

Trying to update a specific field in elasticsearch through logstash. Is it possible to update only a set of fields through logstash ?
Please find the code below,
input {
file {
path => "/**/**/logstash/bin/*.log"
start_position => "beginning"
sincedb_path => "/dev/null"
type => "multi"
}
}
filter {
csv {
separator => "|"
columns => ["GEOREFID","COUNTRYNAME", "G_COUNTRY", "G_UPDATE", "G_DELETE", "D_COUNTRY", "D_UPDATE", "D_DELETE"]
}
elasticsearch {
hosts => ["localhost:9200"]
index => "logstash-data-monitor"
query => "GEOREFID:%{GEOREFID}"
fields => [["JSON_COUNTRY","G_COUNTRY"],
["XML_COUNTRY","D_COUNTRY"]]
}
if [G_COUNTRY] {
mutate {
update => { "D_COUNTRY" => "%{D_COUNTRY}"
}
}
}
}
output {
elasticsearch {
hosts => ["localhost:9200"]
index => "logstash-data-monitor"
document_id => "%{GEOREFID}"
}
}
We are using the above configuration when we use this the null value field is getting removed instead of skipping null value update.
Data comes from 2 different source. One is from XML file and the other is from JSON file.
XML log format : GEO-1|CD|23|John|892|Canada|31-01-2017|QC|-|-|-|-|-
JSON log format : GEO-1|AS|33|-|-|-|-|-|Mike|123|US|31-01-2017|QC
When adding one log new document will get created in the index. When reading the second log file the existing document should get updated. The update should happen only in the first 5 fields if log file is XML and last 5 fields if the log file is JSON. Please suggest us on how to do this in logstash.
Tried with the above code. Please check and can any one help on how to fix this ?

For the Elasticsearch output to do any action other than index you need to tell it to do something else.
elasticsearch {
hosts => ["localhost:9200"]
index => "logstash-data-monitor"
action => "update"
document_id => "%{GEOREFID}"
}
This should probably be wrapped in a conditional to ensure you're only updating records that need updating. There is another option, though, doc_as_upsert
elasticsearch {
hosts => ["localhost:9200"]
index => "logstash-data-monitor"
action => "update"
doc_as_upsert => true
document_id => "%{GEOREFID}"
}
This tells the plugin to insert if it is new, and update if it is not.
However, you're attempting to use two inputs to define a document. This makes things complicated. Also, you're not providing both inputs, so I'll improvise. To provide different output behavior, you will need to define two outputs.
input {
file {
path => "/var/log/xmlhome.log"
[other details]
}
file {
path => "/var/log/jsonhome.log"
[other details]
}
}
filter { [some stuff ] }
output {
if [path] == '/var/log/xmlhome.log' {
elasticsearch {
[XML file case]
}
} else if [path] == '/var/log/jsonhome.log' {
elasticsearch {
[JSON file case]
action => "update"
}
}
}
Setting it up like this will allow you to change the ElasticSearch behavior based on where the event originated.

separate indexes on logstash

Currently I have logstash configuration that pushing data to redis, and elastic server that pulling the data using the default index 'logstash'.
I've added another shipper and I've successfully managed to move the data using the default index as well. My goal is to move and restore that data on a separate index, what is the best way to achieve it?
This is my current configuration using the default index:
shipper output:
output {
redis {
host => "my-host"
data_type => "list"
key => "logstash"
codec => json
}
}
elk input:
input {
redis {
host => "my-host"
data_type => "list"
key => "logstash"
codec => json
}
}

Try to give the index filed in output. Give the name you want and then run that. so a seperate index will be created for that.
input {
redis {
host => "my-host"
data_type => "list"
key => "logstash"
codec => json
}
}
output {
stdout { codec => rubydebug }
elasticsearch {
index => "redis-logs"
cluster => "cluster name"
}
}

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Upsert documents in Elasticsearch using custom ID field - elasticsearch

Related

Update existing document of Elasticsearch and insert current record through logstash

Logstash to Elasticsearch adding new data in the fields instead of overwriting the existing data?

How hit_cache_size in logstash dns filter works?

Logstash Update a document in elasticsearch

separate indexes on logstash

Categories

Resources