How to upsert documents in an existing elasticsearch index? - elasticsearch

I have an elasticsearch index which has multiple documents, now I want to update the index with some new documents which might also contain duplicates of the existing documents. What's the best way to do this? I'm using elasticsearch py for all CRUD operations

Every update in elasticsearch deletes the old document and create a new document as the smallest unit of document collection is called segments in elastic-search which are immutable, hence when you index a new document or update any exiting documents, it gets into the new segments which are merged into bigger segments during the merge process.
Now even if you have duplicate data but with the same id, it will replace the existing document, and its fine and more performant than first fetching the document and than comparing both the documents to see if they are duplicate and than discard the update/upsert request from an application, rather than just index whatever if coming and ES will again insert the duplicate docs.

Related

How to populate data for existing documents after new fields are added in mapping

recently we came across a requirement where we need to add 3 new fields to our Existing Index. We pull this data from our source database using logstash. We have 1000's of documents stored in the current Index already. In the past, it was being told that whenever a change has happened to an existing index (such as adding a new field) we need to reindex with complete data reload again. Since we want the previous documents to have these new fields with data populated in them.
Is there any other way we can achieve this by not dropping the existing index or deleting any documents and reloading? I was hoping we can have a better way of doing this with the latest 7. X version.
No need to drop the index and recreate one. After you have updated the mapping of the index, you just upsert the documents in the index again, with the new fields. Each document will be overwritten by the new one.
https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-put-mapping.html
https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-bulk.html

Elasticsearch index with historical versions of documents

I have an Elasticsearch index continuously being updated and I'm creating a second index with the same mappings for doing offline analytics: I need to store changes for certain fields, in order to retrieve the values that were associated in specific time in the past. Therefore, in this second index I store multiple versions of the same document (same id but different _id fields).
My objective is to get ranked results for a given query and reference date. I've tried with aggregations but rather than modifying the hits fields you get a new aggregations one with unordered results.
Is there any way other than removing duplicates at the client side?
This is similar but different to this previous question as the proposed solution of just having a boolean current field allows for removing duplicates when querying the present.

Updating existing documents in ElasticSearch (ES) while using rollover API

I have a data source which will create a high number of entries that I'm planning to store in ElasticSearch.
The source creates two entries for the same document in ElasticSearch:
the 'init' part which records init-time and other details under a random key in ES
the 'finish' part which contains the main data, and updates the initially created document (merges) in ES under the init's random key.
I will need to use time-based indexes in ElasticSearch, with an alias pointing to the actual index,
using the rollover index.
For updates I'll use the update API to merge init and finish.
Question: If the init document with the random key is not in the current index (but in an older one already rolled over) would updating it using it's key
successfully execute? If not, what is the best practice to perform the update?
After some quietness I've set out to test it.
Short answer: After the index is rolled over under an alias, an update operation using the alias refers to the new index only, so it will create the document in the new index, resulting in two separate documents.
One way of solving it is to perform a search in the last 2 (or more if needed) indexes and figure out which non-alias index name to use for the update.
Other solution which I prefer is to avoid using the rollover, but calculate index name from the required date field of our document, and create new index from the application, using template to define mapping. This way event sourcing and replaying the documents in order will yield the same indexes.

How to delete the documents that fall above a particular timeline in elasticsearch?

I have indexed a nearly a million documents in my elasticsearch index. But due to a flaw in the code, some of the documents are indexed with timestamp values in the future.
Now when I check for the last indexed documents, it is showing the documents with future timestamp values. So I need to delete such documents from my index.
How can I do that?
You need to fire a query to fetch document ids with future timestamp values using scroll/scan API and then issue a bulk request to delete them.
If you are using ES < 2.0, then you can also do it in a single query using Delete by query. It is deprecated in v2.0.

Does ElasticSearch Snapshot/Restore functionality cause the data to be analyzed again during restore?

I have a decent amount of data in my ElasticSearch index. I changed the default analyzer for the index and hence essentially I need to reindex my data so that it is analyzed again using the new analyzer. So instead of creating a test script that will delete all of the existing data in the ES index and re-add the data I thought if there is a back-up/restore module that I could use. As part of that, I found the snapshot/restore module that ES supports - ElasticSearch-SnapshotAndRestore.
My question is - If I use the above ES snapshot/restore module will it actually cause the data to be re-analyzed? Since I changed the default analyzer, I need the data to be reanalyzed. If not, is there an alternate tool/module you will suggest that will allow for pure export and import of data and hence cause the data to be re-analyzed during import?
DevUser
No it does not re-analyze the data. You will need to reindex your data.
Fortunately that's fairly straightforward with Elasticsearch as it by default stores the source of your documents:
Reindexing your data
While you can add new types to an index, or add new fields to a type,
you can’t add new analyzers or make changes to existing fields. If you
were to do so, the data that has already been indexed would be
incorrect and your searches would no longer work as expected.
The simplest way to apply these changes to your existing data is just
to reindex: create a new index with the new settings and copy all of
your documents from the old index to the new index.
One of the advantages of the _source field is that you already have
the whole document available to you in Elasticsearch itself. You don’t
have to rebuild your index from the database, which is usually much
slower.
To reindex all of the documents from the old index efficiently, use
scan & scroll to retrieve batches of documents from the old index, and
the bulk API to push them into the new index.
http://www.elasticsearch.org/guide/en/elasticsearch/guide/current/reindex.html
I'd read up on Scan and Scroll prior to taking this approach:
http://www.elasticsearch.org/guide/en/elasticsearch/guide/current/scan-scroll.html
TaskRabbit did opensource an import/export tool but I've not used it so cannot recommend but it is worth a look:
https://github.com/taskrabbit/elasticsearch-dump

Resources