How to re-index Elasticsearch without stale reads? - elasticsearch

I have indices with heavy read/write operations.
My indices have a read and a write alias.
When I need to update the mapping in my indices I go about this process:
create a new index with the new mapping,
add write-alias to the new index.
delete the write-alias to the old index.
reindex the data like this
POST _reindex?wait_for_completion=false
{
"conflicts": "proceed",
"source": {
"index": "old-index"
},
"dest": {
"op_type": "create",
"index": "new-index"
}
}
While reindexing read-alias points to old index while the write-alias points to the new index
When the re-indexing is complete, I create a read-alias on the new index and delete the read-alias on the old index.
This process works fine, but there is one caveat. While re-indexing the data is stale to the applications reading, i.e. updates can not be read until I have switched read to the new index.
Since I have quite large indices, the re-indexing takes many hours.
Is there any way to handle re-indexing without reading stale data?
I would of course like to write to two indices at the same time while re-indexing, but as I understand it's not possible.
The only workaround I can think of is to edit on the client-side, so all writes go to both indexes in two separate requests during re-indexing.
Any ideas or comments are much appreciated 🙏

Related

Is it possible to partition an ElasticSearch index?

I have a large amount of source code which changes frequently on disk. The source code is organized (and probably best managed) in chunks of "projects". I would like to maintain a current index of the source code so that they can be searched. Historical versions of the documents are not required.
To avoid infinitely growing indexes from the delete/add process, I would like to manage the index in chunks (partitions?). The ingestion process would drop the chunk corresponding to a project before re-indexing the project. A brief absence of the data during re-indexing is tolerable.
When execute I query, I need to hit all of the chunks. Management of the indexes is my primary concern -- performance less so.
I can imagine that there could be two ways this might work:
partition an index. Drop a partition, then rebuild it.
a meta-index. Each project would be created as an individual index, but some sort of a "meta" construct would allow all of them to be queried in a single operation.
From what I have read, this does not seem to be a candidate for rollover indexes.
There are more than 1000 projects. Specifying a list of projects when the query is executed is not practical.
Is it possible to partition an index so that I can manage (drop and reindex) it in named chunks, while maintaining the ability to query it as a single unified index?
Yes, you can achieve this using aliases.
Let's say you have the "old" version of the project data in index "project-1" and that index also has an alias "project".
Then you index the "new" version of the project data in index "project-2". All the queries are done on the alias "project" instead of querying the index directly.
So when you're done reindexing the new version of the data, you can simply switch the alias from "project-1" to "project-2". No interruption of service for your queries.
That's it!
POST _aliases
{
"actions": [
{
"add": {
"index": "project-1",
"alias": "project"
}
},
{
"remove": {
"index": "project-2",
"alias": "project"
}
}
]
}

ElasticSearch reindex nested field as new documents

I am currently changing my ElasticSearch schema.
I previously had one type Product in my index with a nested field Product.users.
And I now wants to get 2 different indices, one for Product, an other one for User and make links between both in code.
I use reindex API to reindex all my Product documents to the new index, removing the Product.users field using script:
ctx._source.remove('users');
But I don't know how to reindex all my Product.users documents to the new User index as in script I'll get an ArrayList of users and I want to create one User document for each.
Does anyone knows how to achieve that?
For those who may face this situation, I finally ended up reindexing users nested field using both scroll and bulk APIs.
I used scroll API to get batches of Product documents
For each batch iterate over those Product documents
For each document iterate over Product.users
Create a new User document and add it to a bulk
Send the bulk when I end iterating over Product batch
Doing the job <3
What you need is called ETL (Extract, Transform, Load).
Most the time, this is more handy to write a small python script that does exactly what you want, but, with elasticsearch, there is one I love: Apache Spark + elasticsearch4hadoop plugin.
Also, sometime logstash can do the trick, but with Spark you have:
SQL syntax or support Java/Scala/Python code
read/write elasticsearch very fast because distributed worker (1 ES shard = 1 Spark worker)
fault tolerant (a worker crash ? no problem)
clustering (ideal if you have billion of documents)
Use with Apache Zeppelin (a notebook with Spark packaged & ready), you will love it!
The simplest solution I can think of is to run the reindex command twice. Once selecting the Product fields and re indexing into the newProduct index and once for the user:
POST _reindex
{
"source": {
"index": "Product",
"type": "_doc",
"_source": ["fields", "to keep in", "new Products"]
"query": {
"match_all": {}
}
},
"dest": {
"index": "new_Products"
}
}
Then you should be able to do the re-index again on the new_User table by selecting Product.users only in the 2nd re-index

ElasticSearch : Concurrent updates to index while _reindex for the same index in progress

We have been using this link as a reference to accommodate any change in the mappings for a field in our index with zero downtime.
Question:
Considering the same example taken in the above link, when we reindex the data from
my_index_v1 to my_index_v2 using _reindex API. Does ElasticSearch guarantee that any concurrent updates happening in my_index_v1 would make it to my_index_v2 for sure?
For example, a document might get updated in my_index_v1 before or after it is reindexed by api to my_index_v2.
Ultimately, we just need to ensure that while we did not want any downtime for doing any mapping changes (hence did _reindex using alias and other cool stuff by ES), we also want to ensure that none of the add/update were missed while this huge reindex was in progress, as we are talking about reindexing >50GB data.
Thanks,
Sandeep
The reindex api will not consider the changes made after the process has started..
One thing you can do is once you are done reindexing process.You can again start process with version_type:external.
This will cause only documents from source index to destination index that have different version and are not present
Here is the example
POST _reindex
{
"source": {
"index": "twitter"
},
"dest": {
"index": "new_twitter",
"version_type": "external"
}
}
Setting version_type to external will cause Elasticsearch to preserve the version from the source, create any documents that are missing, and update any documents that have an older version in the destination index than they do in the source index:
One way to solve this is by using two aliases instead of one. One for queries (let’s call it read_alias), and one for indexing (write_alias). We can write our code so that all indexing happens through the write_alias and all queries go through the read_alias. Let's consider three periods of time:
Before rebuild
read_alias: points to current_index
write_alias: points to current_index
All queries return current data.
All modifications go into current_index.
During rebuild
read_alias: points to current_index
write_alias: points to new_index
All queries keep getting data as it existed before the rebuild, since searching code uses read_alias.
All rows, including modified ones, get indexed into the new_index, since both the rebuilding loop and the DB trigger use the write_alias.
After rebuild
read_alias: points to new_index
write_alias: points to new_index
All queries return new data, including the modifications made during rebuild.
All modifications go into new_index.
It should even be possible to get the modified data from queries while rebuilding, if we make the DB trigger code index modified rows into both the indices while the rebuild is going on (i.e., while the aliases point to different indices).
It is often better to rebuild the index from source data using custom code instead of relying on the _reindex API, since that way we can add new fields that may not have been stored in the old index.
This article has some more details.
It looks like it does it based off of snapshots of the source index.
Which would suggest to me that they couldn't reasonably honor changes to the source happening in the middle of the process. You avoid downtime on the search side, but I think you would need to pause updates on the indexing side during this process.
Something you could do is keep track on your index of when the document was last modified. Then once you finish indexing and switch the alias, you query the old index for what changed in the middle. Propagate those changes over to the new index and you get eventual consistency.

Updating all data elasticsearch

Is there any way to update all data in elasticsearch.
In below example, update done for external '1'.
curl -XPOST 'localhost:9200/customer/external/1/_update?pretty' -d '
{
"doc": { "name": "Jane Doe", "age": 20 }
}'
Similarly, I need to update all my data in external. Is there any way or query to updating all data.
Updating all documents in an index means that all documents will be deleted and new ones will be indexed. Which means lots of "marked-as-deleted" documents.
When you run a query ES will automatically filter out those "marked-as-deleted" documents, which will have an impact on the response time of the query. How much impact it depends on the data, use case and query.
Also, if you update all documents, unless you run a _force_merge there will be segments (especially the larger ones) that will still have "marked-as-deleted" documents and those segments are hard to be automatically merged by Lucene/Elasticsearch.
My suggestion, if your indexing process is not too complex (like getting the data from a relational database and process it before indexing into ES, for example), is to drop the index completely and index fresh data. It might be more effective than updating all the documents.

Retrieve data after deleting mapping in Elastic Search

I am fairly new to elastic search. Just this weekend I started trying out stuff in it, and while I think it's a pretty neat way to store documents, I came across the following problem. I was fooling around a bit with the mappings (without actually knowing at the time what they were and what they were for), and I accidentally deleted the mapping of my index, along with all the data stored by performing a
DELETE tst_environment/object/_mapping
{
"properties" : {
"title" : { "type": "string" }
}
}
Is there any way to retrieve the lost data or am I, well .. fucked? Any information regarding the issue is more than welcome :)
Unless you have taken a snapshot of the index it is not possbile to retrieve the data once you deleted the mapping.
You would have to reindex the data from initial source
FWIW the upcoming V2.0 of elasticsearch does not allow one to delete mappings .

Resources