How to make Logstash replace old data? - elasticsearch

I have an Oracle DB. Logstash retrieves data from Oracle and puts it to ElasticSearch.
But when Logstash makes planned export every 5 minutes, ElasticSearch filled with copies cause old data still exist. This is an obvious situation. Oracle's condition almost not changed during this 5 minutes. Let's say - added 2-3 rows, and 4-5 deleted.
How can we replace old data with new without copies?
For example:
Delete the whole old index;
Create new index with the same name and make the same configuration (nGram configuration and mapping);
Add all new data;
Wait for 5 minutes and repeat.

It's pretty easy: create a new index for each import and apply the mappings, switch your alias afterwards to the most recent index. Remove old indices if needed. Your currenr data will be always searchable while indexing the most recent data.
Here are the sources you'll probalbly need to read:
Use aliases (https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-aliases.html) to point to the most current data when searching in elasticsearch (BTW it`s always a good idea to have aliases in place).
Use rollover api (https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-rollover-index.html) to create a new index for each import run - note the alias handling here too.
Use index templates (https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-templates.html) to autmatically apply the mappings/settings for your newly created indices.
Shrink, close and/or delete old indices to keep your cluster handling data you really need. Have a look on the curator (https://github.com/elastic/curator) as standalone tool.

You just need to use the fingerprint/hash of each document , or hash of the uniq fields in each document , as the document id , so that eveytime you can overwirte the same documents with updated one , in place , while adding new documents as well.
But this approach will not work with deleting data from oracle.

Related

Elastic Search:Update of existing Record (which has custom routing param set) results in duplicate record, if custom routing is not set during update

Env Details:
Elastic Search version 7.8.1
routing param is an optional in Index settings.
As per ElasticSearch docs - https://www.elastic.co/guide/en/elasticsearch/reference/current/mapping-routing-field.html
When indexing documents specifying a custom _routing, the uniqueness of the _id is not guaranteed across all of the shards in the index. In fact, documents with the same _id might end up on different shards if indexed with different _routing values.
We have landed up in same scenario where earlier we were using custom routing param(let's say customerId). And for some reason we need to remove custom routing now.
Which means now docId will be used as default routing param. This is creating duplicate record with same id across different shard during Index operation. Earlier it used to (before removing custom routing) it resulted in update of record (expected)
I am thinking of following approaches to come out of this, please advise if you have better approach to suggest, key here is to AVOID DOWNTIME.
Approach 1:
As we receive the update request, let duplicate record get created. Once record without custom routing gets created, issue a delete request for a record with custom routing.
CONS: If there is no update on records, then all those records will linger around with custom routing, we want to avoid this as this might results in unforeseen scenario in future.
Approach 2
We use Re-Index API to migrate data to new index (turning off custom routing during migration). Application will use new index after successful migration.
CONS: Some of our Indexes are huge, they take 12 hrs+ for re-index operation, and since elastic search re-index API will not migrate the newer records created between this 12hr window, as it uses snapshot mechanism. This needs a downtime approach.
Please suggest alternative if you have faced this before.
Thanks #Val, also found few other approaches like write to both indexes and read from old. And then shift to read new one after re-indexing is finished. Something on following lines -
Create an aliases pointing to the old indices (*_v1)
Point the application to these aliases instead of actual indices
Create a new indices (*_v2) with the same mapping
Move data from old indices to new using re-indexing and make sure we don't
retain custom routing during this.
Post re-indexing, change the aliases to point to new index instead of old
(need to verify this though, but there are easy alternatives if this
doesn't work)
Once verification is done, delete the old Indices
What do we do in transition period (window between reindexing start to reindexing finish) -
Write to both Indices (old and new) and read from old indices via aliases

Editing and re-indexing large amounts of data in elasticsearch (millions of records)

I recently made a new version of an index for my elasticsearch data with some new fields included. I re-indexed from the old index, so the new index has all of the old data with also the new mapping to include the new fields.
Now, I'd like to update all of my elasticsearch data in the index to include these new fields, which I can calculate by making some separate database + api calls to other sources.
What is the best way to do this, given that there are millions of records in the index?
Logistically speaking I'm not sure how to accomplish this... as in how can I keep track of the records that I've updated? I've been reading about the scroll api, but not certain if this is valid because of the max scroll time of 24 hours (what if it takes longer than that)? Also a serious consideration is that since I need to make other database calls to calculate the new field values, I don't want to hammer that database for too long in a single session.
Would there be some way to run an update for say 10 minutes every night, but keep track of what records have been updated/need updating?
I'm just not sure about a lot of this, would appreciate any insights or other ideas on how to go about it.
you would need to run an update by query on your original index, which is expensive
you might be able to use aliases to point to indices behind that, and when you want to make a change, create a new index with the new mappings etc and attach it to the alias so new data coming in gets written correctly. then reindex the "old" data into the new index
that will depend on the details of what you're doing though

How to populate data for existing documents after new fields are added in mapping

recently we came across a requirement where we need to add 3 new fields to our Existing Index. We pull this data from our source database using logstash. We have 1000's of documents stored in the current Index already. In the past, it was being told that whenever a change has happened to an existing index (such as adding a new field) we need to reindex with complete data reload again. Since we want the previous documents to have these new fields with data populated in them.
Is there any other way we can achieve this by not dropping the existing index or deleting any documents and reloading? I was hoping we can have a better way of doing this with the latest 7. X version.
No need to drop the index and recreate one. After you have updated the mapping of the index, you just upsert the documents in the index again, with the new fields. Each document will be overwritten by the new one.
https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-put-mapping.html
https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-bulk.html

Updating existing documents in ElasticSearch (ES) while using rollover API

I have a data source which will create a high number of entries that I'm planning to store in ElasticSearch.
The source creates two entries for the same document in ElasticSearch:
the 'init' part which records init-time and other details under a random key in ES
the 'finish' part which contains the main data, and updates the initially created document (merges) in ES under the init's random key.
I will need to use time-based indexes in ElasticSearch, with an alias pointing to the actual index,
using the rollover index.
For updates I'll use the update API to merge init and finish.
Question: If the init document with the random key is not in the current index (but in an older one already rolled over) would updating it using it's key
successfully execute? If not, what is the best practice to perform the update?
After some quietness I've set out to test it.
Short answer: After the index is rolled over under an alias, an update operation using the alias refers to the new index only, so it will create the document in the new index, resulting in two separate documents.
One way of solving it is to perform a search in the last 2 (or more if needed) indexes and figure out which non-alias index name to use for the update.
Other solution which I prefer is to avoid using the rollover, but calculate index name from the required date field of our document, and create new index from the application, using template to define mapping. This way event sourcing and replaying the documents in order will yield the same indexes.

Does ElasticSearch Snapshot/Restore functionality cause the data to be analyzed again during restore?

I have a decent amount of data in my ElasticSearch index. I changed the default analyzer for the index and hence essentially I need to reindex my data so that it is analyzed again using the new analyzer. So instead of creating a test script that will delete all of the existing data in the ES index and re-add the data I thought if there is a back-up/restore module that I could use. As part of that, I found the snapshot/restore module that ES supports - ElasticSearch-SnapshotAndRestore.
My question is - If I use the above ES snapshot/restore module will it actually cause the data to be re-analyzed? Since I changed the default analyzer, I need the data to be reanalyzed. If not, is there an alternate tool/module you will suggest that will allow for pure export and import of data and hence cause the data to be re-analyzed during import?
DevUser
No it does not re-analyze the data. You will need to reindex your data.
Fortunately that's fairly straightforward with Elasticsearch as it by default stores the source of your documents:
Reindexing your data
While you can add new types to an index, or add new fields to a type,
you can’t add new analyzers or make changes to existing fields. If you
were to do so, the data that has already been indexed would be
incorrect and your searches would no longer work as expected.
The simplest way to apply these changes to your existing data is just
to reindex: create a new index with the new settings and copy all of
your documents from the old index to the new index.
One of the advantages of the _source field is that you already have
the whole document available to you in Elasticsearch itself. You don’t
have to rebuild your index from the database, which is usually much
slower.
To reindex all of the documents from the old index efficiently, use
scan & scroll to retrieve batches of documents from the old index, and
the bulk API to push them into the new index.
http://www.elasticsearch.org/guide/en/elasticsearch/guide/current/reindex.html
I'd read up on Scan and Scroll prior to taking this approach:
http://www.elasticsearch.org/guide/en/elasticsearch/guide/current/scan-scroll.html
TaskRabbit did opensource an import/export tool but I've not used it so cannot recommend but it is worth a look:
https://github.com/taskrabbit/elasticsearch-dump

Resources