Elasticsearch maintenance an unique _id across the aliases - elasticsearch

We have ES data where we have several indexes belong to the same alias. One of them is a written index.
How can we keep the _id of documents is unique across the indexes belong to the same alias?
We are right now having a duplicated _id on our alias. Each index has 1 record of the same id. We only want the lastest record of that _id on our data, the newer will overwrite the older.

If i correctly understand the problem, you can have uniqueness of data by using _id value as a fingerprint value via logstash [ assuming its being used].
You can have something like the below in your logstash filter:
fingerprint{
source => ["session_id"]
method => "SHA1"
}
This value in the fingerprint field can then be used to put the data in an index and updated on top of an already existing document.
Below is an example of output section in logstash:
elasticsearch {
hosts => ["http://elasticsearch:9200"]
index => "indexname"
action => "update"
document_id => "%{fingerprint}"
doc_as_upsert => true
}

Related

How in Logstash update index with new data?

I have PostgreSQL 10 database with table. 7000 new data comes into the table every hour.
In Logstash 6.4 I have such .conf file which create index in Elasticsearch.
.conf:
input {
jdbc {
jdbc_connection_string => "jdbc:postgresql://#host:#port/#database"
jdbc_user => "#username"
jdbc_password => "#password"
jdbc_driver_library => "C:\postgresql-42.2.5.jar"
jdbc_driver_class => "org.postgresql.Driver"
statement => "SELECT * from table_name"
}
}
output {
elasticsearch {
hosts => ["localhost:9200"]
index => "table_name"
}
}
Questions:
How update existing index with new data which appeared in table?
What is the maximum amount of data index can store? Could there be an overflow?
How update existing index with new data which appeared in table?
Index table_name is automatically updated with new entries added to your database table. However, if any existing entries are updated in database table then they are added into the index as new documents with a new document id. Instead, if you would like the existing document in ES updated, use a column name which has unique values and assign it as document id. This way if an existing entry in database is updated the corresponding document in ES is overwritten with latest values.
Use document_id => "%{column_name_with_unique_values>}" in output configuration
What is the maximum amount of data index can store? Could there be an overflow?
It depends on your resources really. However, for optimal performance it is recommended to keep your shard size between 20 - 40 GB. If your index has 5 primary shards you can store about 200 GB of data in a single index. Anything above that consider storing data in a new index. Ideally, use time series indices such as daily or monthly such that it becomes easier to maintain for ex. to archive & backup and then purge.

update multiple records in elastic using logstash

Hi guy i have issue with updating multiple records in elastic using logstash.
My logstash configuration is bellow
output {
elasticsearch {
hosts => "******"
user => "xxxxx"
password => "yyyyyy"
index => "index_name"
document_type => "doc_type"
action => "update"
script_lang => "painless"
script_type => "inline"
document_id => "%{Id}"
script => 'ctx._source.Tags = params.event.get("Tags");'
}
}
My output to logstash dump folder looks like:
{"index_name":"feed_name","doc_type":"doc_type","Id":["b504d808-f82d-4eaa-b192-446ec0ba487f", "1bcbc54f-fa7a-4079-90e7-71da527f56a5"],"es_action":"update","Tags": ["tag1","tag2"]}
My biggest issue here is that I am not able to update those two recods at once but I need to create two records each with different ID.
Is there a why to solve this by writing query in my output configuration?
In sql that would look someting like this:
Update Table
SET Tags
WHERE ID in (guid1, guid2)
I know that in this case I can add two records in logstash and problem solved but I need to solve second issue where I need to replace all records that have one tag1 and give it newTag.
Have you considered to use the split filter in order to clone the event in events with one id each one? It seems the filter can help you.

Logstash doc_as_upsert cross index in Elasticsearch to eliminate duplicates

I have a logstash configuration that uses the following in the output block in an attempt to mitigate duplicates.
output {
if [type] == "usage" {
elasticsearch {
hosts => ["elastic4:9204"]
index => "usage-%{+YYYY-MM-dd-HH}"
document_id => "%{[#metadata][fingerprint]}"
action => "update"
doc_as_upsert => true
}
}
}
The fingerprint is calculated from a SHA1 hash of two unique fields.
This works when logstash sees the same doc in the same index, but since the command that generates the input data doesn't have a reliable rate at which different documents appear, logstash will sometimes insert duplicates docs in a different date stamped index.
For example, the command that logstash runs to get the input generally returns the last two hours of data. However, since I can't definitively tell when a doc will appear/disappear, I tun the command every fifteen minutes.
This is fine when the duplicates occur within the same hour. However, when the hour or day date stamp rolls over, and the document still appears, elastic/logstash thinks it's a new doc.
Is there a way to make the upsert work cross index? These would all be the same type of doc, they would simply apply to every index that matches "usage-*"
A new index is an entirely new keyspace and there's no way to tell ES to not index two documents with the same ID in two different indices.
However, you could prevent this by adding an elasticsearch filter to your pipeline which would look up the document in all indices and if it finds one, it could drop the event.
Something like this would do (note that usages would be an alias spanning all usage-* indices):
filter {
elasticsearch {
hosts => ["elastic4:9204"]
index => "usages"
query => "_id:%{[#metadata][fingerprint]}"
fields => {"_id" => "other_id"}
}
# if the document was found, drop this one
if [other_id] {
drop {}
}
}

Logstash Dynamic Index From Document Field Fails

I still face problems to figure out, how to tell Logstash to send a dynamic index, based on a document field. Furthermore, this Field must be transformed in order to get the "real" index at the very end.
Given, that there is a field "time" (which is a UNIX Timestamp). This Field gets already transformed with a "date" Filter to a DateTime Object for Elastic.
Additionally, it should server as index (YYYYMM). The index should NOT be derived from #Timestamp, which is not touched.
Example:
{...,"time":1453412341,...}
Shall go to the Index: 201601
I use the following Config:
filter {
date {
match => [ "time", "UNIX" ]
target => "time"
timezone => "Europe/Berlin"
}
}
output {
elasticsearch {
index => "%{time}%{+YYYYMM}"
document_type => "..."
document_id => "%{ID}"
hosts => "..."
}
}
Sadly, its not working. Any idea, how to achieve that?
Thanks a lot!
The "%{+YYYYMM}" says to use the date values from #timestamp. If you want an index named after the YYYYMM in %{time}, you need to make a string out of that date field and then reference that string in the output stanza. There might be a mutate{} that would do it, or drop into ruby{}.
In most installations, you want to set #timestamp to the event's value. The default of logstash's own time is not very useful (imagine if your events were delayed by an hour during processing). If you did that, then %{+YYYYMM}" would work just fine.
This is caused because the index name is created based on UTC time by default.

To copy an index from one machine to another in elasticsearch

I have some indexes in one of my machines. I need to copy them to another machine, how can i do that in elasticsearch.
I did get some good documentation here, but since im an newbie to elasticsearch ecosystem and since im toying with lesser data indices, I thought I would use some plugins or ways which would be less time consuming.
I would use Logstash with an elasticsearch input plugin and an elasticsearch output plugin.
After installing Logstash, you can create a configuration file copy.conf that looks like this:
input {
elasticsearch {
hosts => "localhost:9200" <--- source ES host
index => "source_index"
}
}
filter {
mutate {
remove_field => [ "#version", "#timestamp" ] <--- remove added junk
}
}
output {
elasticsearch {
host => "localhost" <--- target ES host
port => 9200
protocol => "http"
manage_template => false
index => "target_index"
document_id => "%{id}" <--- name of your ID field
workers => 1
}
}
And then after setting the correct values (source/target host + source/target index), you can run this with bin/logstash -f copy.conf
I can see 3 options here
Snapshot/Restore - You can move your data across geographical locations.
Logstash reindex - As pointed out by Val
Stream2ES - This is a more simpler solution
You can use Snapshot and restore feature as well, where you can take snapshot (backup) of one index and then can Restore to somewhere else.
Just have a look at
https://www.elastic.co/guide/en/elasticsearch/reference/current/modules-snapshots.html

Resources