How in Logstash update index with new data? - elasticsearch

I have PostgreSQL 10 database with table. 7000 new data comes into the table every hour.
In Logstash 6.4 I have such .conf file which create index in Elasticsearch.
.conf:
input {
jdbc {
jdbc_connection_string => "jdbc:postgresql://#host:#port/#database"
jdbc_user => "#username"
jdbc_password => "#password"
jdbc_driver_library => "C:\postgresql-42.2.5.jar"
jdbc_driver_class => "org.postgresql.Driver"
statement => "SELECT * from table_name"
}
}
output {
elasticsearch {
hosts => ["localhost:9200"]
index => "table_name"
}
}
Questions:
How update existing index with new data which appeared in table?
What is the maximum amount of data index can store? Could there be an overflow?

How update existing index with new data which appeared in table?
Index table_name is automatically updated with new entries added to your database table. However, if any existing entries are updated in database table then they are added into the index as new documents with a new document id. Instead, if you would like the existing document in ES updated, use a column name which has unique values and assign it as document id. This way if an existing entry in database is updated the corresponding document in ES is overwritten with latest values.
Use document_id => "%{column_name_with_unique_values>}" in output configuration
What is the maximum amount of data index can store? Could there be an overflow?
It depends on your resources really. However, for optimal performance it is recommended to keep your shard size between 20 - 40 GB. If your index has 5 primary shards you can store about 200 GB of data in a single index. Anything above that consider storing data in a new index. Ideally, use time series indices such as daily or monthly such that it becomes easier to maintain for ex. to archive & backup and then purge.

Related

logstash elasticsearch input plugin query do not execute by schedule

I want to import the latest data from one elasticsearch cluster index to another index. so I have made the following input plugin setting
elasticsearch {
hosts => ["xxxxx"]
user => "xxxx"
password => "xxxxx"
index => "xxxxxx"
query => '{"sort":[{"timestamp":{"order":"desc"}}]}'
schedule => "*/1 * * * *"
size => 10000
scroll => "1m"
docinfo => true
}
the data was imported to the target index successfully, but the timestamp was not latest. the data I get in the target index, the timestamp is the time I start the logstash. the query not execute as the schedule.
I want to know the elasticsearch plugin will import the data until all the data find by the query was inputed, than start another job.

Elasticsearch maintenance an unique _id across the aliases

We have ES data where we have several indexes belong to the same alias. One of them is a written index.
How can we keep the _id of documents is unique across the indexes belong to the same alias?
We are right now having a duplicated _id on our alias. Each index has 1 record of the same id. We only want the lastest record of that _id on our data, the newer will overwrite the older.
If i correctly understand the problem, you can have uniqueness of data by using _id value as a fingerprint value via logstash [ assuming its being used].
You can have something like the below in your logstash filter:
fingerprint{
source => ["session_id"]
method => "SHA1"
}
This value in the fingerprint field can then be used to put the data in an index and updated on top of an already existing document.
Below is an example of output section in logstash:
elasticsearch {
hosts => ["http://elasticsearch:9200"]
index => "indexname"
action => "update"
document_id => "%{fingerprint}"
doc_as_upsert => true
}

update multiple records in elastic using logstash

Hi guy i have issue with updating multiple records in elastic using logstash.
My logstash configuration is bellow
output {
elasticsearch {
hosts => "******"
user => "xxxxx"
password => "yyyyyy"
index => "index_name"
document_type => "doc_type"
action => "update"
script_lang => "painless"
script_type => "inline"
document_id => "%{Id}"
script => 'ctx._source.Tags = params.event.get("Tags");'
}
}
My output to logstash dump folder looks like:
{"index_name":"feed_name","doc_type":"doc_type","Id":["b504d808-f82d-4eaa-b192-446ec0ba487f", "1bcbc54f-fa7a-4079-90e7-71da527f56a5"],"es_action":"update","Tags": ["tag1","tag2"]}
My biggest issue here is that I am not able to update those two recods at once but I need to create two records each with different ID.
Is there a why to solve this by writing query in my output configuration?
In sql that would look someting like this:
Update Table
SET Tags
WHERE ID in (guid1, guid2)
I know that in this case I can add two records in logstash and problem solved but I need to solve second issue where I need to replace all records that have one tag1 and give it newTag.
Have you considered to use the split filter in order to clone the event in events with one id each one? It seems the filter can help you.

How to control number of shards in Elastic index from logstash?

I would like to control how many shards a new index should have in my logstash output file. Ex:
10-output.conf:
output {
if [type] == "mytype" {
elasticsearch {
hosts => [ "1.1.1.1:9200" ]
index => "logstash-mytype-%{+YYYY.ww}"
workers => 8
flush_size => 1000
? <====== what option to control the number of index shards goes here?
}
}
From what I understand in logstash elastic options this is not possible and new index will default to 5 shards?
The Logstash-Elasticsearch mix it's designed to work differently than what your expectation is: in Elasticsearch one defines an index template in which the number or shards is a configuration setting.
And whenever Logstash creates a new index by sending documents to this new index, Elasticsearch uses that index template (by matching the new index name with the configured template) to actually create the index.

Logstash doc_as_upsert cross index in Elasticsearch to eliminate duplicates

I have a logstash configuration that uses the following in the output block in an attempt to mitigate duplicates.
output {
if [type] == "usage" {
elasticsearch {
hosts => ["elastic4:9204"]
index => "usage-%{+YYYY-MM-dd-HH}"
document_id => "%{[#metadata][fingerprint]}"
action => "update"
doc_as_upsert => true
}
}
}
The fingerprint is calculated from a SHA1 hash of two unique fields.
This works when logstash sees the same doc in the same index, but since the command that generates the input data doesn't have a reliable rate at which different documents appear, logstash will sometimes insert duplicates docs in a different date stamped index.
For example, the command that logstash runs to get the input generally returns the last two hours of data. However, since I can't definitively tell when a doc will appear/disappear, I tun the command every fifteen minutes.
This is fine when the duplicates occur within the same hour. However, when the hour or day date stamp rolls over, and the document still appears, elastic/logstash thinks it's a new doc.
Is there a way to make the upsert work cross index? These would all be the same type of doc, they would simply apply to every index that matches "usage-*"
A new index is an entirely new keyspace and there's no way to tell ES to not index two documents with the same ID in two different indices.
However, you could prevent this by adding an elasticsearch filter to your pipeline which would look up the document in all indices and if it finds one, it could drop the event.
Something like this would do (note that usages would be an alias spanning all usage-* indices):
filter {
elasticsearch {
hosts => ["elastic4:9204"]
index => "usages"
query => "_id:%{[#metadata][fingerprint]}"
fields => {"_id" => "other_id"}
}
# if the document was found, drop this one
if [other_id] {
drop {}
}
}

Resources