Hibernate Search Dynamic Mapping - elasticsearch

I am attempting to utilize elastic search with hibernate by using the hibernate-search elastic search integration. I have multi-tenant data that uses a discriminator strategy, so it would be great to index entities with that tenant identifier automatically added. This seems to work so far:
Session session = sessionFactory
.withOptions()
.tenantIdentifier("parkId-" + p.getId().toString())
.openSession();
However, during the index process, elastic search complains because of a strict_dynamic_mapping_exception:
Response:
{
"index": {
"_index": "entities.productmodel",
"_type": "entities.ProductModel",
"_id": "parkId-1_29426",
"status": 400,
"error": {
"type": "strict_dynamic_mapping_exception",
"reason": "mapping set to strict, dynamic introduction of [__HSearch_TenantId] within [entities.ProductModel] is not allowed"
}
}
}
This is all despite the fact that I am overriding the default behavior of hibernate search and setting dynamic mapping to true, as is shown in the docs:
configuration.setProperty("hibernate.search.default.elasticsearch.dynamic_mapping", "true");
(Other settings are properly being set via this method, so I know that is not the issue.)
Any idea what I'm missing? Even setting the dynamic_mapping to false results in no changes - elastic search still complains that the mapping is set to strict. My elastic search cluster is running locally via docker.

Make sure to re-generate your schema before each attempt when developping. You shouldn't need the dynamic_mapping setting for this, but if you generated the schema before you tried adding multitenancy, and did not update the schema since, you will experience errors like this one.
Just drop the indexes in your Elasticsearch cluster, or set the property hibernate.search.default.elasticsearch.index_schema_management_strategy to drop-and-create (NOT FOR USE IN PRODUCTION, YOU WILL LOSE ALL INDEX DATA). See this section of the documentation for more information about schema generation.

The tenant id field should be part of the schema if you have set hibernate.multiTenancy in your ORM configuration.
Did you?
If so, we might have a bug somewhere and a test case would help. We have a test case template here: https://github.com/hibernate/hibernate-test-case-templates/tree/master/search/hibernate-search-elasticsearch/hibernate-search-elasticsearch-5 .

Related

Elasticsearch: when inserting a record to index I don't want to create an index mapping

Elasticsearch default behavior when inserting a document to an index, is to create an index mapping if it's not exist.
I know that I can change this behavior on the cluster level using this call
PUT _cluster/settings
{
"persistent": {
"action.auto_create_index": "false"
}
}
but I can't control the customer's elasticsearch.
I'm asking is there a parameter which I can send with the index a document request that will tell elastic not to create the index in case it doesn't exist but to fail instead?
If you couldn’t change cluster settings or settings in elasticsearch.yml, I’m afraid it’s not possible, since there are no special parameters during POST/PUT of the documents.
Another possible solution could be to create an API level, which will prevent going to Elasticsearch completely, if there is no such index.
There is an issue on Github, that is proposing to set action.auto_create_index to false by default, but unfortunately, I couldn’t see if there is any progress on it.

Duplicate documents in Elasticsearch index with the same _uid

We've discovered some duplicate documents in one of our Elasticsearch indices and we haven't been able to work out the cause. There are two copies of each of the affected documents, and they have exactly the same _id, _type and _uid fields.
A GET request to /index-name/document-type/document-id just returns one copy, but searching for the document with a query like this returns two results, which is quite surprising:
POST /index-name/document-type/_search
{
"filter": {
"term": {
"_id": "document-id"
}
}
}
Aggregating on the _uid field also identifies the duplicate documents:
POST /index-name/_search
{
"size": 0,
"aggs": {
"duplicates": {
"terms": {
"field": "_uid",
"min_doc_count": 2
}
}
}
}
The duplicates are all on different shards. For example, a document might have one copy on primary shard 0 and one copy on primary shard 1. We've verified this by running the aggregate query above on each shard in turn using the preference parameter: it does not find any duplicates within a single shard.
Our best guess is that something has gone wrong with the routing, but we don't understand how the copies could have been routed to different shards. According to the routing documentation, the default routing is based on the document ID, and should consistently route a document to the same shard.
We are not using custom routing parameters that would override the default routing. We've double-checked this by making sure that the duplicate documents don't have a _routing field.
We also don't define any parent/child relationships which would also affect routing. (See this question in the Elasticsearch forum, for example, which has the same symptoms as our problem. We don't think the cause is the same because we're not setting any document parents).
We fixed the immediate problem by reindexing into a new index, which squashed the duplicate documents. We still have the old index around for debugging.
We haven't found a way of replicating the problem. The new index is indexing documents correctly, and we've tried rerunning an overnight processing job which also updates documents but it hasn't created any more duplicates.
The cluster has 3 nodes, 3 primary shards and 1 replica (i.e. 3 replica shards). minimum_master_nodes is set to 2, which should prevent the split-brain issue. We're running Elasticsearch 2.4 (which we know is old - we're planning to upgrade soon).
Does anyone know what might cause these duplicates? Do you have any suggestions for ways to debug it?
We found the answer! The problem was that the index had unexpectedly switched the hashing algorithm it used for routing, and this caused some updated documents to be stored on different shards to their original versions.
A GET request to /index-name/_settings revealed this:
"version": {
"created": "1070599",
"upgraded": "2040699"
},
"legacy": {
"routing": {
"use_type": "false",
"hash": {
"type": "org.elasticsearch.cluster.routing.DjbHashFunction"
}
}
}
"1070599" refers to Elasticsearch 1.7, and "2040699" is ES 2.4.
It looks like the index tried to upgrade itself from 1.7 to 2.4, despite the fact that it was already running 2.4. This is the issue described here: https://github.com/elastic/elasticsearch/issues/18459#issuecomment-220313383
We think this is what happened to trigger the change:
Back when we upgraded the index from ES 1.7 to 2.4, we decided not to upgrade Elasticsearch in-place, since that would cause downtime. Instead, we created a separate ES 2.4 cluster.
We loaded data into the new cluster using a tool that copied over all the index settings as well as the data, including the version setting which you should not set in ES 2.4.
While dealing with a recent issue, we happened to close and reopen the index. This normally preserves all the data, but because of the incorrect version setting, it caused Elasticsearch to think that an upgrade was in processed.
ES automatically set the legacy.routing.hash.type setting because of the false upgrade. This meant that any data indexed after this point used the old DjbHashFunction instead of the default Murmur3HashFunction which had been used to route the data originally.
This means that reindexing the data into a new index was the right thing to do to fix the issue. The new index has the correct version setting and no legacy hash function settings:
"version": {
"created": "2040699"
}

How to reindex AWS Elasticsearch?

My Ruby/Sinatra app connects to an AWS ES cluster using the elasticsearch-ruby gem to index text documents that authorised (by indexing using their user ID) users can search through. Now, I want to copy a document from one index to another to make a document query-able by a different, authorised user. I tried the _reindex endpoint as documented on this file only to get the following error:
Elasticsearch::Transport::Transport::Errors::Unauthorized - [401] {"Message":"Your request: '/_reindex' is not allowed."}:
Googling around, I stumbled across an Amazon docs page that lists all supported operations on both their API's, and for some twisted reason _reindex isn't there yet. Why is that? More importantly,
how do I get around this efficiently and achieve what I want to do?
You should double check the Elasticsearch version deployed by AWS ES. The _reindex API became available in version 2.2 I believe. You can check the version number by GETting the ES root ip & port with curl e.g. and checking version.number.
To work around not having the _reindex endpoint, I would recommend you implement it yourself. This isn't too bad. You can use a scroll to iterate through all the documents you want to reindex. If it is the entire index, you can use a matchall query with the scroll. You can then manipulate the documents as you wish or simply use the bulk api to post (i.e. reindex) the documents to the new index.
Make sure to have created the new index with the mapping template you want ahead of time.
This procedure above is best for reindexing lots of documents; if you just want to move a few or one (which it sounds like you do). Grab the document from its existing index by id and submit it to your second index.
AWS Elasticsearch now supports remote reindex, check this documentation:
https://docs.aws.amazon.com/elasticsearch-service/latest/developerguide/remote-reindex.html
Example below:
'''
POST <local-domain-endpoint>/_reindex
{
"source": {
"remote": {
"host": "https://remote-domain-endpoint:443"
},
"index": "remote_index"
},
"dest": {
"index": "local_index"
}
}
'''

Specifying data type and analyzer while creating index

I am using elastic search to index my data and i was able to do it with the following POST request
http://localhost:9200/index/type/id
{
JSON data over here
}
Yesterday while i was going through some of the elastic tutorials i found one person mentioning about setting analyzer to those fields where we are planning to do full text search.I found during my googling that mapping API can be used to update datatypes and analyzer, but in my case i want to do it as i am creating index.
How can i do it?
You can create index with custom settings (and mappings) in the first request and then index your data with second request. In this case you can not do both at the same time.
However if you index your data first and index does not exist yet, it will be created automatically with default settings. You can then update your mappings.
Source: Index

What is the best way to index Couchbase data on Elastic Search

I work with Couchbase DB and I want to index part of its data on Elastic Search (ES).
The data from Couchbase should be synced, i.e. if the document on CB changes, it should change the document on ES.
I have several questions about what is the best way to do it:
What is the best way to sync the data ? I saw that there is a CB plugin for ES (http://www.couchbase.com/couchbase-server/connectors/elasticsearch), but it that the recommended way ?
I don't want to store all the CB document on ES, but only part of it, e.g. some of the fields I want to store and some not - how can I do it ?
My documents may have different attributes and the difference may be big (e.g. 50 different attributes/fields). Assuming I want to index all these attributes to ES, will it effect the performance because I have a lot of fields indexed ?
10x,
Given the doc link, I am assuming you are using Couchbase and not CouchDB.
You are following the correct link for use of Elastic Search with Couchbase. Per the documentation, configure the Cross Data Center Replication (XDCR) capabilities of Couchbase to push data to ES automatically as mutations occur.
Without a defined mapping file, ES will create a default mapping. You can provide your own mapping file (or alter the one it generates) to control which fields get indexed. Refer to the enabled property in the ES documentation at http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/mapping-object-type.html.
Yes, indexing all fields will affect performance. You can find some performance management tips for the Couchbase integration at http://docs.couchbase.com/couchbase-elastic-search/#managing-performance. The preferred approach to the integration is perform the search in ES and only get keys back for the matched documents. You then make a multiget call against the Couchbase cluster to retrieve the document details themselves. So while ES will index many fields, you do not store all fields there nor do you retrieve their values from ES. The in-memory multiget against Couchbase is the fastest way to retrieve the matching documents, using the IDs from ES.
Lot of questions..!
Let me answer one by one:
1)The best way and already available solution to use river plugin to dynamically sync the data.And also it ll index the changed document alone..It ll help a lot in performance.
2)yes you can restrict the field to be indexed in river plugin. refer
The documents of plugin is available in couchbase website itself.
Refer: http://docs.couchbase.com/couchbase-elastic-search/
Github river is still in development.,but you can use the code and modify as your need.
https://github.com/mschoch/elasticsearch-river-couchbase
3)If you index all the fields, yes there will be some lag in performance.So better to index the needed fields alone. if you need to store some field just to store, then mention in mapping as not analyzed to specific.It will decrease indexing time and also searching time.
HOpe it helps..!
You might find this additional explanation regarding Don Stacy's answer to question 2 useful:
When replicating from Couchbase, there are 3 ways in which you can interfere with Elasticsearch's default mapping (before you start XDCR) and thus, as desired, not store certain fields by setting "store" = false:
Create manual mappings on your index
Create a dynamic template
Edit couchbase_template.json
Hints:
Note that when we do XDCR from Couchbase to Elasticsearch, Couchbase wraps the original document in a "doc" field. This means that you have to take this modified structure into account when you create your mapping. It would look something like this:
curl -XPUT 'http://localhost:9200/test/couchbaseDocument/_mapping' -d '
{
"couchbaseDocument": {
"_source": {
"enabled": false
},
"properties": {
"doc": {
"properties": {
"your_field_name": {
"store": true,
...
},
...
}
}
}
}
}'
Documentation: https://www.elastic.co/guide/en/elasticsearch/reference/current/mapping.html
Including/Excluding fields from _source: https://www.elastic.co/guide/en/elasticsearch/reference/current/mapping-source-field.html
Documentation: https://www.elastic.co/guide/en/elasticsearch/reference/2.0/dynamic-templates.html
https://forums.couchbase.com/t/about-elasticsearch-plugin/2433
https://forums.couchbase.com/t/custom-maps-for-jsontypes-with-elasticsearch-plugin/395

Resources