Hibernate Search Mass Indexer Doesn't Drop Mappings - elasticsearch

I am currently using the MassIndexer like so to reindex my entities:
fullTextSession.createIndexer().startAndWait();
However, I have learned that the MassIndexer does not drop the existing mappings. It seems like the only way to drop mappings is to set the index_schema_management_strategy to 'drop-and-create', which is not recommended to be used in a production environment.
I have tried hitting elastic search directly using the DELETE index API before reindexing with the MassIndexer, but that introduces strange behavior with our mappings.
What is the recommended way to drop an index and it's mappings, and then rebuild that index using the MassIndexer?

I've found a way to get hibernate search to completely drop mappings and indexes, although this has been quite a workaround.
By instantiating an ElasticSearchService like so:
SearchIntegrator si = org.hibernate.search.orm.spi.SearchIntegratorHelper.extractFromEntityManagerFactory( fullTextSession.getEntityManagerFactory() );
ElasticsearchService elasticsearchService = si.getServiceManager().requestService(ElasticsearchService.class);
You can then access the following classes:
ElasticsearchSchemaDropper schemaDropper = elasticsearchService.getSchemaDropper();
ElasticsearchSchemaCreator schemaCreator = elasticsearchService.getSchemaCreator();
And do the following:
schemaDropper.drop(URLEncodedString.fromString("index you want to drop"), options);
schemaCreator.createIndex( indexMetadata, options );
schemaCreator.createMappings( indexMetadata, options );
This is essentially what the drop-and-create configuration setting will do for you. We plan on setting this up as some external service to hit whenever we want to completely rebuild our index - both the mappings and the documents.
Unfortunately, this feels very hacky, and it is curious that there doesn't seem to be a better way to do this.

The purpose of the MassIndexer is to help update the index content as a disaster recovery strategy or to build the initial index state, as normally the index is kept in synch.
It is not meant ro perform more advanced lifecycle operations on the index such as live schema changes.
For that I would suggest that you should at least stop/restart the Hibernate application (so you can use that other property) and most likely invoke index management operations via external scripts or as part of your release process.
Letting Hibernate Search manage the index schema is meant as a convenience during development.

Related

Elastic Search:Update of existing Record (which has custom routing param set) results in duplicate record, if custom routing is not set during update

Env Details:
Elastic Search version 7.8.1
routing param is an optional in Index settings.
As per ElasticSearch docs - https://www.elastic.co/guide/en/elasticsearch/reference/current/mapping-routing-field.html
When indexing documents specifying a custom _routing, the uniqueness of the _id is not guaranteed across all of the shards in the index. In fact, documents with the same _id might end up on different shards if indexed with different _routing values.
We have landed up in same scenario where earlier we were using custom routing param(let's say customerId). And for some reason we need to remove custom routing now.
Which means now docId will be used as default routing param. This is creating duplicate record with same id across different shard during Index operation. Earlier it used to (before removing custom routing) it resulted in update of record (expected)
I am thinking of following approaches to come out of this, please advise if you have better approach to suggest, key here is to AVOID DOWNTIME.
Approach 1:
As we receive the update request, let duplicate record get created. Once record without custom routing gets created, issue a delete request for a record with custom routing.
CONS: If there is no update on records, then all those records will linger around with custom routing, we want to avoid this as this might results in unforeseen scenario in future.
Approach 2
We use Re-Index API to migrate data to new index (turning off custom routing during migration). Application will use new index after successful migration.
CONS: Some of our Indexes are huge, they take 12 hrs+ for re-index operation, and since elastic search re-index API will not migrate the newer records created between this 12hr window, as it uses snapshot mechanism. This needs a downtime approach.
Please suggest alternative if you have faced this before.
Thanks #Val, also found few other approaches like write to both indexes and read from old. And then shift to read new one after re-indexing is finished. Something on following lines -
Create an aliases pointing to the old indices (*_v1)
Point the application to these aliases instead of actual indices
Create a new indices (*_v2) with the same mapping
Move data from old indices to new using re-indexing and make sure we don't
retain custom routing during this.
Post re-indexing, change the aliases to point to new index instead of old
(need to verify this though, but there are easy alternatives if this
doesn't work)
Once verification is done, delete the old Indices
What do we do in transition period (window between reindexing start to reindexing finish) -
Write to both Indices (old and new) and read from old indices via aliases

Keeping the .enrich index updated to source index elasticsearch

I'm using the new enrich API of Elasticsearch (ver 7.11),
to my understanding, I need to execute the policy "PUT /_enrich/policy/my-policy/_execute" each time when the source index changed, which lead to the creation of a new .enrich index.
is there an option to make it happen automatically and avoid of index creation on every change of the source index?
This is not (yet) supported and there have been other reports of similar needs.
It seems to be complex to provide the ability to regularly update an enrich index based on a changing source index and the issue above explains why.
That feature might be available some day, something seems to be in the works. I agree it would be super useful.
You can add a default pipeline to your index. that pipeline will process the documents.
See here.

Disable index mapping creation for Hibernate Search using Elasticsearch

I'm using Hibernate Search for Elasticsearch (5.8.2.Final) and Elasticsearch (6.0). I'm new Hibernate Search and I'm aware that Hibernate Search for Elasticsearch is experimental. I'm also aware that Hibernate 6 is going to bring some improvements for use with ES. However, in the meantime, i'm finding that the annotations are not allowing me to create the types of index mappings i want and I was wondering if there was a way to disable the creation of the index mapping entirely. I'd like to allow ES to apply an index template to my index when Hibernate first creates it. I've read the docs and stepped through the code but I am not seeing anything that would allow me to do this. Is this possible?
Thank you.
You can disable index and mapping creation altogether (see Sanne's answer), but you cannot currently ask Hibernate Search to create indexes without creating mappings.
One solution would be for you to create your indexes beforehand. After all, if you're fine with adding templates, why not add the indexes too?
Another solution: you could try the update strategy. If your custom mapping is compatible with the one generated by Hibernate Search, they might simply get merged. Be careful not to use this in production though, since a failure with the update strategy will leave you in deep trouble (see the warnings in the documentation).
To skip creating the index definitions:
hibernate.search.default.elasticsearch.index_schema_management_strategy none
See also 11.3.4. Hibernate Search configuration

Speeding up mapping creation in ElasticSearch

With each index that we create, we need to create mapping for 10 types.
While indexes are created and documents are indexed blazingly fast, bottleneck that we keep hitting is slow mapping creation. In some cases (when we need to create multiple indexes at the same time) it even breaks when ElasticSearch rejects request because mapping was not created in 30 seconds.
Is there any way to speed up mapping creation, or send mappings in bulk?
I think you have to use Index templates, that will allow you to define templates that will automatically be applied to new indices created. The templates include both settings and mappings, and a simple pattern template that controls if the template will be applied to the index created.
More details here.
Regards,
Alain

Is Elasticsearch suitable as a final storage solution?

I'm currently learning Elasticsearch, and I have noticed that a lot of operations for modifying indices require reindexing of all documents, such as adding a field to all documents, which from my understanding means retrieving the document, performing the desirable operation, deleting the original document from the index and reindex it. This seems to be somewhat dangerous and a backup of the original index seems to be preferable before performing this (obviously).
This made me wonder if Elasticsearch actually is suitable as a final storage solution at all, or if I should keep the raw documents that makes up an index separately stored to be able to recreate an index from scratch if necessary. Or is a regular backup of the index safe enough?
You are talking about two issues here:
Deleting old documents and re-indexing on schema change: You don't always have to delete old documents when you add new fields. There are various options to change the schema. Have a look at this blog which explains changing the schema without any downtime.
http://www.elasticsearch.org/blog/changing-mapping-with-zero-downtime/
Also, look at the Update API which gives you the ability to add/remove fields.
The update API allows to update a document based on a script provided. The operation gets the document (collocated with the shard) from the index, runs the script (with optional script language and parameters), and index back the result (also allows to delete, or ignore the operation). It uses versioning to make sure no updates have happened during the "get" and "reindex".
Note, this operation still means full reindex of the document, it just removes some network roundtrips and reduces chances of version conflicts between the get and the index. The _source field need to be enabled for this feature to work.
Using Elasticsearch as a final storage solution at all : It depends on how you intend to use Elastic Search as storage. Do you need RDBMS , key Value store, column based datastore or a document store like MongoDb? Elastic Search is definitely well suited when you need a distributed document store (json, html, xml etc) with Lucene based advanced search capabilities. Have a look at the various use cases for ES especially the usage at The Guardian:http://www.elasticsearch.org/case-study/guardian/
I'm pretty sure, that search engines shouldn't be viewed as a storage solution, because of the nature of these applications. I've never heard about this kind of a practice to backup index of search engine.
Usual schema when you using ElasticSearch or Solr or whatever search engine you have:
You have some kind of a datasource (it could be database, legacy mainframe, excel papers, some REST service with data or whatever)
You have search engine that should index this datasource to add to your system capability for search. When datasource is changed - you could reindex it, or index only changed part with the help of incremental indexation.
If something happen to search engine index - you could easily reindex all your data.

Resources