I am trying to achieve updating the data in the indices of Elasticsearch with zero downtime but I am not sure how to achieve this. Can anyone assist me how I can do so?
For example: If I have an index of name my_es_index and I want to update the data in that particular index with zero downtime so that the old data is still there on one of the node while someone is performing certain query but parallelly in the backend , we are updating the data on that index.
Is it possible to achieve? If yes, please help me with how I can proceed.
You build/create another index (we called new index), then switch from old index to new index, then delete old index.
Read more at https://medium.com/craftsmenltd/rebuild-elasticsearch-index-without-downtime-168363829ea4
Unless you are to update the mapping of an existing field and preserving the name of the fields is required, I don't think taking the cluster down is needed.
While the above article is a good read and might be treated as best practices, ES is a lot flexible. Unlike MySQL/SQL, it allows you to update existing documents.
Adding a new field
Let's call the new field to be added as x.
add mapping to the index for x.
make the code changes such that going forward, all the new documents have this new field x.
while all the new documents have the field x, write-up a script which updates the older documents and adds this field x.
once you are sure that all the documents have the field x, you may enable the feature you added this field for.
Updating mapping of a field
Let's again call the field to be updated as x (assuming the name of the field is not the prime concern).
create a new field, say new_x (add correct mapping to the index).
follow the above steps to ensure new_x has the data (slight change that we need to ensure both x and new_x have this data).
once all the documents in the index have the field new_x, simply refactor the code to use new_x instead of x.
While one might argue that above two approaches are in a way hacks, it saves you time, effort and cost to boot up a new ES instance and manage the aliases.
Related
Env Details:
Elastic Search version 7.8.1
routing param is an optional in Index settings.
As per ElasticSearch docs - https://www.elastic.co/guide/en/elasticsearch/reference/current/mapping-routing-field.html
When indexing documents specifying a custom _routing, the uniqueness of the _id is not guaranteed across all of the shards in the index. In fact, documents with the same _id might end up on different shards if indexed with different _routing values.
We have landed up in same scenario where earlier we were using custom routing param(let's say customerId). And for some reason we need to remove custom routing now.
Which means now docId will be used as default routing param. This is creating duplicate record with same id across different shard during Index operation. Earlier it used to (before removing custom routing) it resulted in update of record (expected)
I am thinking of following approaches to come out of this, please advise if you have better approach to suggest, key here is to AVOID DOWNTIME.
Approach 1:
As we receive the update request, let duplicate record get created. Once record without custom routing gets created, issue a delete request for a record with custom routing.
CONS: If there is no update on records, then all those records will linger around with custom routing, we want to avoid this as this might results in unforeseen scenario in future.
Approach 2
We use Re-Index API to migrate data to new index (turning off custom routing during migration). Application will use new index after successful migration.
CONS: Some of our Indexes are huge, they take 12 hrs+ for re-index operation, and since elastic search re-index API will not migrate the newer records created between this 12hr window, as it uses snapshot mechanism. This needs a downtime approach.
Please suggest alternative if you have faced this before.
Thanks #Val, also found few other approaches like write to both indexes and read from old. And then shift to read new one after re-indexing is finished. Something on following lines -
Create an aliases pointing to the old indices (*_v1)
Point the application to these aliases instead of actual indices
Create a new indices (*_v2) with the same mapping
Move data from old indices to new using re-indexing and make sure we don't
retain custom routing during this.
Post re-indexing, change the aliases to point to new index instead of old
(need to verify this though, but there are easy alternatives if this
doesn't work)
Once verification is done, delete the old Indices
What do we do in transition period (window between reindexing start to reindexing finish) -
Write to both Indices (old and new) and read from old indices via aliases
I recently made a new version of an index for my elasticsearch data with some new fields included. I re-indexed from the old index, so the new index has all of the old data with also the new mapping to include the new fields.
Now, I'd like to update all of my elasticsearch data in the index to include these new fields, which I can calculate by making some separate database + api calls to other sources.
What is the best way to do this, given that there are millions of records in the index?
Logistically speaking I'm not sure how to accomplish this... as in how can I keep track of the records that I've updated? I've been reading about the scroll api, but not certain if this is valid because of the max scroll time of 24 hours (what if it takes longer than that)? Also a serious consideration is that since I need to make other database calls to calculate the new field values, I don't want to hammer that database for too long in a single session.
Would there be some way to run an update for say 10 minutes every night, but keep track of what records have been updated/need updating?
I'm just not sure about a lot of this, would appreciate any insights or other ideas on how to go about it.
you would need to run an update by query on your original index, which is expensive
you might be able to use aliases to point to indices behind that, and when you want to make a change, create a new index with the new mappings etc and attach it to the alias so new data coming in gets written correctly. then reindex the "old" data into the new index
that will depend on the details of what you're doing though
I have 2 question have indexed: FIRE_DETECTED and SMOKE:DETECTED in Elasticsearch
Goal
I want to search with query = 'fire' -> result: FIRE_DETECTED
query = 'dectected' -> result: FIRE_DETECTED and SMOKE:DETECTED
Some solution
Add more setting in analyzer
We need to create a new index with new setting (Add Token filter: word_delimiter_graph)
Reindex
Problem: How to add setting in production without effect customer?
Add 1 more field into Elasticsearch filterd_question
Split data with : and _
Save splited data in this filterd_question field
Problem: We need 1 more field
What is the best solution for this? (Add more solutions if need)
Again, this is really good and very common scenario while working with elasticsearch and as requirements keeps changes and in order to support them, we have to change the way we index the data in ES.
Both the approaches which you mentioned are used by companies and they both have their trad-offs and you have to choose one which suits according to your requirements.
Change/add the analyzer will require below steps in order to make it work:
Close the index
Add/Edit the analyzer definition.
open the index
Reindex all the documents(you should use the index alias with zero down time to efficiently do it and minimize the impact on end-users)
After step-4, your new searches, will work.
Pros: it won't create new fields, hence would save the space, hence more efficient and cleaner way of doing this change.
cons would be that re-index might take a lot of time, based on number of documents and its comparatively complex process.
Add a custom-analyzer and then add a new field using newly added analyzer
In this case also, it requires closing/opening a index, unless you are using the inbuilt analyzer, but in this case, your new documents or documents which are updating will have the new fields, so your search according to new analyzer/logic will bring partial results, but this is could be fine based on your use-case.
Pros: relatively simpler approach and doesn't require full-re indexing in all the cases.
Cons: extra space, if old field is not being used and complexity varies according to use-cases.
If you don't want to change/add analyzer. You can try using wildcard query. Although the con would be performance.
I am designing an e-Commerce site with multiple warehouse. All the warehouses have same set of products.
I am using ElasticSearch for my search engine.
There are 40 fields each ES document. 20 out of them will differ in value per warehouse, rest 20 fields will contain same values for all warehouses.
I want to use multiple types (1 type for each warehouse) in 1 index. All of the types will have same mappings. Please advise if my approach is correct for such scenario.
Few things not clear to me,
Will the inverted index be created only once for all types in same index?
If new type (new warehouse) is added in future how it will be merged with the previously stored data.
How it will impact the query time if I would have used only one type in one index.
Depending on all types being assigned to the same index, it will only created once and
If a new type is added, its information is added to the existing inverted index as well - adding new terms to the index, adding pointers to existing terms in the index, adding data to doc values per new inserted document.
I honestly can't answer that one, though it is simple to test this in a proof of concept.
In my previous project, I experienced the same setting implementing a search engine with Elasticsearch on a multishop-platform. In that case we had all shops in one type and when searching per shop relevant filters were applied. Though, the approach to separate shop-data by "_type" seems pretty clean to me. We applied it the other way, since my implementation was already able to cover it by filters at the moment of the feature request.
Cheers, Dominik
Hi I am using Elastic Search along with Titan..
Right now I have a a mixed index on a specific property for a specific type of vertex:
PropertyKey textProp = mgmt.getPropertyKey(EntityProps.text);
VertexLabel entityClass = mgmt.getVertexLabel(VertexLabels.Entity);
mgmt.buildIndex("EntityTextFull", Vertex.class)
.indexOnly(entityClass)
.addKey(textProp)
.buildMixedIndex("search");
The indexed key values are not unique.. I wonder if there is a way to update some properties including the indexed property for a specific vertex and then somehow reindex this specific vertex against this specific index..
Thanks,
Michail
You could potentially pick _update endpoint and do partial updates:
Externally, it appears as though we are partially updating a document
in place. Internally, however, the update API simply manages the same
retrieve-change-reindex process that we have already described. The
difference is that this process happens within a shard, thus avoiding
the network overhead of multiple requests. By reducing the time
between the retrieve and reindex steps, we also reduce the likelihood
of there being conflicting changes from other processes.
I've no experience with Titan, but I suppose you could translate raw queries into it.