Is it possible to add a new similarity metric to an existing index in Elasticsearch? - elasticsearch

Let's say there is an existing index with a customized BM25 similarity metric like this:
{
"settings": {
"index": {
"similarity": {
"BM25_v1": {
"type": "BM25",
"b": 1.0
}
},
"number_of_replicas": 0,
"number_of_shards": 3,
"refresh_interval": "120s"
}
}
}
And this similarity metric is used for two fields:
{
'some_field': {
'type': 'text',
'norms': 'true',
'similarity': 'BM25_v1'
},
'another_field': {
'type': 'text',
'norms': 'true',
'similarity': 'BM25_v1'
},
}
Now, I was wondering if it's possible to add another similarity metric (BM25_v2) to the same index and use this new metric for the another_field, like this:
"index": {
"similarity": {
# The existing metric, not changed.
"BM25_v1": {
"type": "BM25",
"b": 1.0
},
# The new similarity metric for this index.
"BM25_v2": {
"type": "BM25",
"b": 0.0
}
}
}
# ... and use the new metric for one of the fields:
{
'some_field': {
'type': 'text',
'norms': 'true',
'similarity': 'BM25_v1' # This field uses the same old metric.
},
'another_field': {
'type': 'text',
'norms': 'true',
'similarity': 'BM25_v2' # The new metric is used for this field.
},
}
I couldn't find any example for this scenario in the documentation, so I wasn't sure if this is possible at all.
Update: I have already seen this old still-open issue which concerns with dynamic update of similarity metrics in Elasticsearch. But it is not completely clear from that discussion what is and isn't possible. Also there have been some attempts for achieving some level of similarity update; but I think it's not documented (e.g. it is possible to change the parameters of an existing similarity metric, say b or k1 in an existing BM25-based metric).

TLDR;
I believe you can't.
{
"error" : {
"root_cause" : [
{
"type" : "illegal_argument_exception",
"reason" : "Mapper for [title] conflicts with existing mapper:\n\tCannot update parameter [similarity] from [my_similarity] to [my_similarity_v2]"
}
],
"type" : "illegal_argument_exception",
"reason" : "Mapper for [title] conflicts with existing mapper:\n\tCannot update parameter [similarity] from [my_similarity] to [my_similarity_v2]"
},
"status" : 400
}
If you want to, I believe you will have to create a new field and re-index the data.
To reproduce
PUT /70973345
{
"settings": {
"index": {
"similarity": {
"my_similarity": {
"type": "BM25",
"b": 1.0
}
}
}
}
}
PUT /70973345/_mapping
{
"properties" : {
"title" : { "type" : "text", "similarity" : "my_similarity" }
}
}
We insert some dummy data, and retrieve it.
POST /70973345/_doc
{
"title": "I love rock'n roll"
}
POST /70973345/_doc
{
"title": "I love pasta al'arabita"
}
POST /70973345/_doc
{
"title": "pasta rock's"
}
GET /70973345/_search?explain=true
{
"query": {
"match": {
"title": "pasta"
}
}
}
If we try to update it the settings without closing, we get an error.
{
"error" : {
"root_cause" : [
{
"type" : "illegal_argument_exception",
"reason" : "Can't update non dynamic settings ...."
}
],
"type" : "illegal_argument_exception",
"reason" : "Can't update non dynamic settings ...."
},
"status" : 400
}
POST /70973345/_close?wait_for_active_shards=0
PUT /70973345/_settings
{
"index": {
"similarity": {
"my_similarity": {
"type": "BM25",
"b": 1.0
},
"my_similarity_v2": {
"type": "BM25",
"b": 0
}
}
}
}
The update works fine, BUT :
PUT /70973345/_mapping
{
"properties": {
"title": {
"type": "text",
"similarity": "my_similarity_v2"
}
}
}
{
"error" : {
"root_cause" : [
{
"type" : "illegal_argument_exception",
"reason" : "Mapper for [title] conflicts with existing mapper:\n\tCannot update parameter [similarity] from [my_similarity] to [my_similarity_v2]"
}
],
"type" : "illegal_argument_exception",
"reason" : "Mapper for [title] conflicts with existing mapper:\n\tCannot update parameter [similarity] from [my_similarity] to [my_similarity_v2]"
},
"status" : 400
}
It will not work, regardless of the open/close status of the index.
Which makes me believe this is not possible. you might need to re-index into a new indice the existing data.

Related

ElasticSearch Accessing Nested Documents in Script - Null Pointer Exception

Gist: Trying to write a custom filter on nested documents using painless. Want to write error checks when there are no nested documents to surpass null_pointer_exception
I have a mapping as such (simplified and obfuscated)
{
"video_entry" : {
"aliases" : { },
"mappings" : {
"properties" : {
"captions_added" : {
"type" : "boolean"
},
"category" : {
"type" : "keyword"
},
"is_votable" : {
"type" : "boolean"
},
"members" : {
"type" : "nested",
"properties" : {
"country" : {
"type" : "keyword",
},
"date_of_birth" : {
"type" : "date",
}
}
}
}
Each video_entry document can have 0 or more members nested documents.
Sample Document
{
"captions_added": true,
"category" : "Mental Health",
"is_votable: : true,
"members": [
{"country": "Denmark", "date_of_birth": "1998-04-04T00:00:00"},
{"country": "Denmark", "date_of_birth": "1999-05-05T00:00:00"}
]
}
If one or more nested document exist, we want to write some painless scripts that'd check certain fields across all the nested documents. My script works on mappings with a few documents but when I try it on larger set of documents I get null pointer exceptions despite having every null check possible. I've tried various access patterns, error checking mechanisms but I get exceptions.
POST /video_entry/_search
{
"query": {
"script": {
"script": {
"source": """
// various NULL checks that I already tried
// also tried short circuiting on finding null values
if (!params['_source'].empty && params['_source'].containsKey('members')) {
def total = 0;
for (item in params._source.members) {
// custom logic here
// if above logic holds true
// total += 1;
}
return total > 3;
}
return true;
""",
"lang": "painless"
}
}
}
}
Other Statements That I've Tried
if (params._source == null) {
return true;
}
if (params._source.members == null) {
return true;
}
if (!ctx._source.contains('members')) {
return true;
}
if (!params['_source'].empty && params['_source'].containsKey('members') &&
params['_source'].members.value != null) {
// logic here
}
if (doc.containsKey('members')) {
for (mem in params._source.members) {
}
}
Error Message
&& params._source.members",
^---- HERE"
"caused_by" : {
"type" : "null_pointer_exception",
"reason" : null
}
I've looked into changing the structure (flattening the document) and the usage of must_not as indicated in this answer. They don't suit our use case as we need to incorporate some more custom logic.
Different tutorials use ctx, doc and some use params. To add to the confusion Debug.explain(doc.members), Debug.explain(params._source.members) return empty responses and I'm having a hard time figuring out the types.
Gist: Trying to write a custom filter on nested documents using painless. Want to write error checks when there are no nested documents to surpass null_pointer_exception
Any help is appreciated.
TLDr;
Elastic flatten objects. Such that
{
"group" : "fans",
"user" : [
{
"first" : "John",
"last" : "Smith"
},
{
"first" : "Alice",
"last" : "White"
}
]
}
Turn into:
{
"group" : "fans",
"user.first" : [ "alice", "john" ],
"user.last" : [ "smith", "white" ]
}
To access members inner value you need to reference it using doc['members.<field>'] as members will not exist on its own.
Details
As you may know, Elastic handles inner documents in its own way. [doc]
So you will need to reference them accordingly.
Here is what I did to make it work.
Btw, I have been using the Dev tools of kibana
PUT /so_test/
PUT /so_test/_mapping
{
"properties" : {
"captions_added" : {
"type" : "boolean"
},
"category" : {
"type" : "keyword"
},
"is_votable" : {
"type" : "boolean"
},
"members" : {
"properties" : {
"country" : {
"type" : "keyword"
},
"date_of_birth" : {
"type" : "date"
}
}
}
}
}
POST /so_test/_doc/
{
"captions_added": true,
"category" : "Mental Health",
"is_votable" : true,
"members": [
{"country": "Denmark", "date_of_birth": "1998-04-04T00:00:00"},
{"country": "Denmark", "date_of_birth": "1999-05-05T00:00:00"}
]
}
PUT /so_test/_doc/
{
"captions_added": true,
"category" : "Mental breakdown",
"is_votable" : true,
"members": []
}
POST /so_test/_doc/
{
"captions_added": true,
"category" : "Mental success",
"is_votable" : true,
"members": [
{"country": "France", "date_of_birth": "1998-04-04T00:00:00"},
{"country": "Japan", "date_of_birth": "1999-05-05T00:00:00"}
]
}
And then I did this query (it is only a bool filter, but I guess making it work for your own use case should not prove too difficult)
GET /so_test/_search
{
"query":{
"bool": {
"filter": {
"script": {
"script": {
"lang": "painless",
"source": """
def flag = false;
// /!\ notice how the field is referenced /!\
if(doc['members.country'].size() != 0)
{
for (item in doc['members.country']) {
if (item == params.country){
flag = true
}
}
}
return flag;
""",
"params": {
"country": "Japan"
}
}
}
}
}
}
}
BTW you were saying you were a bit confused about the context for painless. you can find in the documentation so details about it.
[doc]
In this case the filter context is the one we want to look at.

Add geoIP data to old data from Elasticsearch index

I recently added a GeoIP processor to my ingestion pipeline in Elasticsearch. this works well and adds new fields to the newly ingested documents.
I wanted to add the GeoIP fields to older data by doing an _update_by_query on an index, however, it seems that it doesn't accept "processors" as a parameter.
What I want to do is something like this:
POST my_index*/_update_by_query
{
"refresh": true,
"processors": [
{
"geoip" : {
"field": "doc['client_ip']",
"target_field" : "geo",
"database_file" : "GeoLite2-City.mmdb",
"properties":["continent_name", "country_iso_code", "country_name", "city_name", "timezone", "location"]
}
}
],
"script": {
"day_of_week": {
"type": "long",
"script": "emit(doc['#timestamp'].value.withZoneSameInstant(ZoneId.of(doc['geo.timezone'])).getDayOfWeek().getValue())"
},
"hour_of_day": {
"type": "long",
"script": "emit(doc['#timestamp'].value.withZoneSameInstant(ZoneId.of(doc['geo.timezone'])).getHour())"
},
"office_hours": {
"script": "if (doc['day_of_week'].value< 6 && doc['day_of_week'].value > 0) {if (doc['hour_of_day'].value> 7 && doc['hour_of_day'].value<19) {return 1;} else {return -1;} } else {return -1;}"
}
}
}
I receive the following error:
{
"error" : {
"root_cause" : [
{
"type" : "parse_exception",
"reason" : "Expected one of [source] or [id] fields, but found none"
}
],
"type" : "parse_exception",
"reason" : "Expected one of [source] or [id] fields, but found none"
},
"status" : 400
}
Since you have the ingestion pipeline ready, you simply need to reference it in your call to the _update_by_query endpoint, like this:
POST my_index*/_update_by_query?pipeline=my-pipeline
^
|
add this

Complex document - provide mapping for a few fields only but keep the rest as is

i have some pretty complex documents i want to store in elastic to retrieve them and make them searchable. I dont know the whole strucutre of the data, so i would like elastic to "swallow" everything as i put it in but define some indexed fields to make searching possible.
As soon as i provide a mapping for elastic i will get errors, if i dont define mappings i fear that the index will grow too big because elastic will index too much.
I create the index using php, but it boils down to this:
PUT localhost:9200/name-of-index
{
"mappings": {
"_doc": {
"properties": {
"title": {
"type": "text"
}
}
}
}
}
Then - when i add one object to test everything i will get the following error:
status 400
{
"error": {
"root_cause": [
{
"type": "illegal_argument_exception",
"reason": "Rejecting mapping update to [name-of-index] as the final mapping would have more than 1 type: [_doc, 2187]"
}
],
"type": "illegal_argument_exception",
"reason": "Rejecting mapping update to [name-of-index] as the final mapping would have more than 1 type: [_doc, 2187]"
},
"status": 400
}
The command i use to post the document is about:
POST localhost:9200/name-of-index/2187
{
"title": "some title",
"otherField": "other value",
"obj": {
"nestedProp": "nestedValue",
"deepObj": {
"someStorage": [
...
{
"someVeryDeepProp": 1
}
...
]
}
},
"obj2": [
"str1",
"str2"
]
}
The node names are not real of course and the structure is much more complex than that. But i doubt that is the cause of my problem.
So how could i define a partial index and keep everything else as it is?
update: i forgot some elastic information..
{
"name" : "KNJ_3Eg",
"cluster_name" : "elasticsearch",
"cluster_uuid" : "4M9p8XiaQHKPz7N2AAuVlw",
"version" : {
"number" : "6.6.0",
"build_flavor" : "default",
"build_type" : "deb",
"build_hash" : "a9861f4",
"build_date" : "2019-01-24T11:27:09.439740Z",
"build_snapshot" : false,
"lucene_version" : "7.6.0",
"minimum_wire_compatibility_version" : "5.6.0",
"minimum_index_compatibility_version" : "5.0.0"
},
"tagline" : "You Know, for Search"
}
Okay, that was just a stupid mistake, i forgot to add the type.
So the correct request to add a document should be:
POST localhost:9200/name-of-index/_doc/2187
{
"title": "some title",
"otherField": "other value",
"obj": {
"nestedProp": "nestedValue",
"deepObj": {
"someStorage": [
...
{
"someVeryDeepProp": 1
}
...
]
}
},
"obj2": [
"str1",
"str2"
]
}
I guess from version 7 onwards it will be without "_doc" because types are generally deprecated.

How do I query a null date inside an array in elasticsearch?

In an elasticsearch query I am trying to search Document objects that have an array of approval notifications. The notifications are considered complete when dateCompleted is populated with a date, and considered pending when either dateCompleted doesn't exist or exists with null. If the document does not contain an array of approval notifications then it is out of the scope of the search.
I am aware of putting null_value for field dateCompleted and setting it to some arbitrary old date but that seems hackish to me.
I've tried to use Bool queries with must exist doc.approvalNotifications and must not exist doc.approvalNotifications.dateCompleted but that does not work if a document contains a mix of complete and pending approvalNotifications. e.g. it only returns document with ID 2 below. I am expecting documents with IDs 1 and 2 to be found.
How can I find pending approval notifications using elasticsearch?
PUT my_index/_mapping/Document
"properties" : {
"doc" : {
"properties" : {
"approvalNotifications" : {
"properties" : {
"approvalBatchId" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
},
"approvalTransitionState" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
},
"approvedByUser" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
},
"dateCompleted" : {
"type" : "date"
}
}
}
}
}
}
Documents:
{
"id": 1,
"status": "Pending Notifications",
"approvalNotifications": [
{
"approvalBatchId": "e6c39194-5475-4168-9729-8ddcf46cf9ab",
"dateCompleted": "2018-11-15T16:09:15.346+0000"
},
{
"approvalBatchId": "05eaeb5d-d802-4a28-b699-5e593a59d445",
}
]
}
{
"id": 2,
"status": "Pending Notifications",
"approvalNotifications": [
{
"approvalBatchId": "e6c39194-5475-4168-9729-8ddcf46cf9ab",
}
]
}
{
"id": 3,
"status": "Complete",
"approvalNotifications": [
{
"approvalBatchId": "e6c39194-5475-4168-9729-8ddcf46cf9ab",
"dateCompleted": "2018-11-15T16:09:15.346+0000"
},
{
"approvalBatchId": "05eaeb5d-d802-4a28-b699-5e593a59d445",
"dateCompleted": "2018-11-16T16:09:15.346+0000"
}
]
}
{
"id": 4
"status": "No Notifications"
}
You are almost there, you can achieve the desired behavior by using nested datatype for the "approvalNotifications" field.
What happens is that Elasticsearch flattens your approvalNotifications objects, treating their subfields as subfields of the original document. The nested field instead will tell ES to index each inner object as an implicit separate object, though related to the original one.
To query nested objects one should use nested query.
Hope that helps!

ElasticSearch - Copy one field value to other field for all documents

We have a field "name" in the index. We recently added a new field "alias".
I want to copy name field value to the new field alias for all documents.
Is there any Update query that will do this?
If that is not possible , Help me to achieve this.
Thanks in advance
I am trying this query
http://URL/index/profile/_update_by_query
{
"query": {
"constant_score" : {
"filter" : {
"exists" : { "field" : "name" }
}
}
},
"script" : "ctx._source.alias = name;"
}
In the script , I am not sure how to give name field.
I getting error
{
"error": {
"root_cause": [
{
"type": "class_cast_exception",
"reason": "java.lang.String cannot be cast to java.util.Map"
}
],
"type": "class_cast_exception",
"reason": "java.lang.String cannot be cast to java.util.Map"
},
"status": 500
}
Indeed, the syntax has changed a tiny little bit since. You need to modify your query to this:
POST index/_update_by_query
{
"query": {
"constant_score" : {
"filter" : {
"exists" : { "field" : "name" }
}
}
},
"script" : {
"inline": "ctx._source.alias = ctx._source.name;"
}
}
UPDATE for ES 6
Use source instead of inline

Resources