Speeding up mapping creation in ElasticSearch

Speeding up mapping creation in ElasticSearch - elasticsearch

With each index that we create, we need to create mapping for 10 types.
While indexes are created and documents are indexed blazingly fast, bottleneck that we keep hitting is slow mapping creation. In some cases (when we need to create multiple indexes at the same time) it even breaks when ElasticSearch rejects request because mapping was not created in 30 seconds.
Is there any way to speed up mapping creation, or send mappings in bulk?

I think you have to use Index templates, that will allow you to define templates that will automatically be applied to new indices created. The templates include both settings and mappings, and a simple pattern template that controls if the template will be applied to the index created.
More details here.
Regards,
Alain

Related

Elastic Search:Update of existing Record (which has custom routing param set) results in duplicate record, if custom routing is not set during update

Env Details:
Elastic Search version 7.8.1
routing param is an optional in Index settings.
As per ElasticSearch docs - https://www.elastic.co/guide/en/elasticsearch/reference/current/mapping-routing-field.html
When indexing documents specifying a custom _routing, the uniqueness of the _id is not guaranteed across all of the shards in the index. In fact, documents with the same _id might end up on different shards if indexed with different _routing values.
We have landed up in same scenario where earlier we were using custom routing param(let's say customerId). And for some reason we need to remove custom routing now.
Which means now docId will be used as default routing param. This is creating duplicate record with same id across different shard during Index operation. Earlier it used to (before removing custom routing) it resulted in update of record (expected)
I am thinking of following approaches to come out of this, please advise if you have better approach to suggest, key here is to AVOID DOWNTIME.
Approach 1:
As we receive the update request, let duplicate record get created. Once record without custom routing gets created, issue a delete request for a record with custom routing.
CONS: If there is no update on records, then all those records will linger around with custom routing, we want to avoid this as this might results in unforeseen scenario in future.
Approach 2
We use Re-Index API to migrate data to new index (turning off custom routing during migration). Application will use new index after successful migration.
CONS: Some of our Indexes are huge, they take 12 hrs+ for re-index operation, and since elastic search re-index API will not migrate the newer records created between this 12hr window, as it uses snapshot mechanism. This needs a downtime approach.
Please suggest alternative if you have faced this before.

Thanks #Val, also found few other approaches like write to both indexes and read from old. And then shift to read new one after re-indexing is finished. Something on following lines -
Create an aliases pointing to the old indices (*_v1)
Point the application to these aliases instead of actual indices
Create a new indices (*_v2) with the same mapping
Move data from old indices to new using re-indexing and make sure we don't
retain custom routing during this.
Post re-indexing, change the aliases to point to new index instead of old
(need to verify this though, but there are easy alternatives if this
doesn't work)
Once verification is done, delete the old Indices
What do we do in transition period (window between reindexing start to reindexing finish) -
Write to both Indices (old and new) and read from old indices via aliases

Is there any performance benefit to creating an index mapping for Elasticsearch

I was wondering for people who have used Elasticsearch at scale if there is a performance benefit while searching if I create an index mapping and then put documents in it compared to not creating a mapping and just directly putting documents in

It is usually preferable to create the explicit mapping for an index, where possible.
For a search case, this is crucial in order to index data with the analysis chains needed to service the search strategy.
For a log use case, it may not be possible to know what the explicit mapping should be for log records that will be ingested, as there may be dynamic fields in the data that is not known ahead of time. Dynamic templates can help here, as can adopting a unified logging structure like Elastic Common Schema (ECS), either converting data to ECS format whilst logging, or converting whilst ingesting into Elasticsearch with ingest pipelines

Yes it is always better to use explicit mapping before putting the documents rather than depending on the dynamic mapping. If at all you are dependent on the dynamic mapping you may not be able to visualize on few data types like text. And also when you maintain mapping your index will always have the same kind of data. Please refer to this blog:
[https://qbox.io/blog/maximize-guide-elasticsearch-indexing-performance-part-1/][1]

ElasticSearch multiple types with same mapping in single index

I am designing an e-Commerce site with multiple warehouse. All the warehouses have same set of products.
I am using ElasticSearch for my search engine.
There are 40 fields each ES document. 20 out of them will differ in value per warehouse, rest 20 fields will contain same values for all warehouses.
I want to use multiple types (1 type for each warehouse) in 1 index. All of the types will have same mappings. Please advise if my approach is correct for such scenario.
Few things not clear to me,
Will the inverted index be created only once for all types in same index?
If new type (new warehouse) is added in future how it will be merged with the previously stored data.
How it will impact the query time if I would have used only one type in one index.

Depending on all types being assigned to the same index, it will only created once and
If a new type is added, its information is added to the existing inverted index as well - adding new terms to the index, adding pointers to existing terms in the index, adding data to doc values per new inserted document.
I honestly can't answer that one, though it is simple to test this in a proof of concept.
In my previous project, I experienced the same setting implementing a search engine with Elasticsearch on a multishop-platform. In that case we had all shops in one type and when searching per shop relevant filters were applied. Though, the approach to separate shop-data by "_type" seems pretty clean to me. We applied it the other way, since my implementation was already able to cover it by filters at the moment of the feature request.
Cheers, Dominik

Is Elasticsearch suitable as a final storage solution?

I'm currently learning Elasticsearch, and I have noticed that a lot of operations for modifying indices require reindexing of all documents, such as adding a field to all documents, which from my understanding means retrieving the document, performing the desirable operation, deleting the original document from the index and reindex it. This seems to be somewhat dangerous and a backup of the original index seems to be preferable before performing this (obviously).
This made me wonder if Elasticsearch actually is suitable as a final storage solution at all, or if I should keep the raw documents that makes up an index separately stored to be able to recreate an index from scratch if necessary. Or is a regular backup of the index safe enough?

You are talking about two issues here:
Deleting old documents and re-indexing on schema change: You don't always have to delete old documents when you add new fields. There are various options to change the schema. Have a look at this blog which explains changing the schema without any downtime.
http://www.elasticsearch.org/blog/changing-mapping-with-zero-downtime/
Also, look at the Update API which gives you the ability to add/remove fields.
The update API allows to update a document based on a script provided. The operation gets the document (collocated with the shard) from the index, runs the script (with optional script language and parameters), and index back the result (also allows to delete, or ignore the operation). It uses versioning to make sure no updates have happened during the "get" and "reindex".
Note, this operation still means full reindex of the document, it just removes some network roundtrips and reduces chances of version conflicts between the get and the index. The _source field need to be enabled for this feature to work.
Using Elasticsearch as a final storage solution at all : It depends on how you intend to use Elastic Search as storage. Do you need RDBMS , key Value store, column based datastore or a document store like MongoDb? Elastic Search is definitely well suited when you need a distributed document store (json, html, xml etc) with Lucene based advanced search capabilities. Have a look at the various use cases for ES especially the usage at The Guardian:http://www.elasticsearch.org/case-study/guardian/

I'm pretty sure, that search engines shouldn't be viewed as a storage solution, because of the nature of these applications. I've never heard about this kind of a practice to backup index of search engine.
Usual schema when you using ElasticSearch or Solr or whatever search engine you have:
You have some kind of a datasource (it could be database, legacy mainframe, excel papers, some REST service with data or whatever)
You have search engine that should index this datasource to add to your system capability for search. When datasource is changed - you could reindex it, or index only changed part with the help of incremental indexation.
If something happen to search engine index - you could easily reindex all your data.

Does ElasticSearch Snapshot/Restore functionality cause the data to be analyzed again during restore?

I have a decent amount of data in my ElasticSearch index. I changed the default analyzer for the index and hence essentially I need to reindex my data so that it is analyzed again using the new analyzer. So instead of creating a test script that will delete all of the existing data in the ES index and re-add the data I thought if there is a back-up/restore module that I could use. As part of that, I found the snapshot/restore module that ES supports - ElasticSearch-SnapshotAndRestore.
My question is - If I use the above ES snapshot/restore module will it actually cause the data to be re-analyzed? Since I changed the default analyzer, I need the data to be reanalyzed. If not, is there an alternate tool/module you will suggest that will allow for pure export and import of data and hence cause the data to be re-analyzed during import?
DevUser

No it does not re-analyze the data. You will need to reindex your data.
Fortunately that's fairly straightforward with Elasticsearch as it by default stores the source of your documents:
Reindexing your data
While you can add new types to an index, or add new fields to a type,
you can’t add new analyzers or make changes to existing fields. If you
were to do so, the data that has already been indexed would be
incorrect and your searches would no longer work as expected.
The simplest way to apply these changes to your existing data is just
to reindex: create a new index with the new settings and copy all of
your documents from the old index to the new index.
One of the advantages of the _source field is that you already have
the whole document available to you in Elasticsearch itself. You don’t
have to rebuild your index from the database, which is usually much
slower.
To reindex all of the documents from the old index efficiently, use
scan & scroll to retrieve batches of documents from the old index, and
the bulk API to push them into the new index.
http://www.elasticsearch.org/guide/en/elasticsearch/guide/current/reindex.html
I'd read up on Scan and Scroll prior to taking this approach:
http://www.elasticsearch.org/guide/en/elasticsearch/guide/current/scan-scroll.html
TaskRabbit did opensource an import/export tool but I've not used it so cannot recommend but it is worth a look:
https://github.com/taskrabbit/elasticsearch-dump

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio