Elasticsearch partial update script: Clear array and replace with new values - elasticsearch

I have documents like:
{
MyProp: ["lorem", "ipsum", "dolor"]
... lots of stuff here ...
}
My documents can be quite big (but these MyProp fields are not), and expensive to generate from scratch.
Sometimes I need to update batches of these - it would therefore be beneficial to do a partial update (to save "indexing client" processing power and bandwidth, and thus time) and replace the MyProp values with new values.
Example of original document:
{
MyProp: ["lorem", "ipsum", "dolor"]
... lots of stuff here ...
}
Example of updated document (or rather how it should look):
{
MyProp: ["dolor", "sit"]
... lots of stuff here ...
}
From what I have seen, this includes scripting.
Can anyone enlighten me with the remaining bits of the puzzle?
Bounty added:
I'd like to also have some instructions of how to make these in a batch statement, if possible.

You can use the update by query API in order to do batch updates. This works since ES 2.3 onwards, otherwise you need to install a plugin.
POST index/_update_by_query
{
"script": {
"inline": "ctx._source.myProp += newProp",
"params": {
"newProp": "sit"
}
},
"query": {
"match_all": {}
}
}
You can of course use whatever query you want in order to select the documents on which MyProp needs to be updated. For instance, you could have a query to select documents having some specific MyProp values to be replaced.
The above will only add a new value to the existing array. If you need to completely replace the MyProp array, then you can also change the script to this:
POST index/_update_by_query
{
"script": {
"inline": "ctx._source.myProp = newProps",
"params": {
"newProps": ["dolor", "sit"]
}
},
"query": {
"match_all": {}
}
}
Note that you also need to enable dynamic scripting in order for this to work.
UPDATE
If you simply want to update a single document you can use the partial document update API, like this:
POST test/type1/1/_update
{
"doc" : {
"MyProp" : ["dolor", "sit"]
}
}
This will effectively replace the MyProp array in the specified document.
If you want to go the bulk route, you don't need scripting to achieve what you want:
POST index/type/_bulk
{ "update" : {"_id" : "1"} }
{ "doc" : {"MyProp" : ["dolor", "sit"] } }
{ "update" : {"_id" : "2"} }
{ "doc" : {"MyProp" : ["dolor", "sit"] } }

Would a _bulk update work for you?
POST test/type1/_bulk
{"update":{"_id":1}}
{"script":{"inline":"ctx._source.MyProp += new_param","params":{"new_param":"bla"},"lang":"groovy"}}
{"update":{"_id":2}}
{"script":{"inline":"ctx._source.MyProp += new_param","params":{"new_param":"bla"},"lang":"groovy"}}
{"update":{"_id":3}}
{"script":{"inline":"ctx._source.MyProp += new_param","params":{"new_param":"bla"},"lang":"groovy"}}
....
And you would also need to enable inline scripting for groovy. What the above would do is to add a bla value to the listed documents in MyProp field. Of course, depending on your requirements many other changes can be performed in that script.

Related

Elastic search apply boost based on nested field value

Below is my indexed document
{
"defaultBoostValue":1.01,
"boostDetails": [
{
"Type": "Type1",
"value": 1.0001
},
{
"Type": "Type2",
"value": 1.002
},
{
"Type": "Type3",
"value": 1.0005
}
]
}
i want to apply boost based on value passed, so suppose i pass Type 1 then boost applied will be 1.0001 and if that Type1 does not exist then it will use defaultBoostValue
below is my query which works but quite slow, is there any way to optimize it further
Original question
Above query works but is slow as we are using _source
{
"query": {
"function_score": {
"boost_mode": "multiply",
"functions": [
"script_score": {
"script": {
"source": """
double findBoost(Map params_copy) {
for (def group : params_copy._source.boostDetails) {
if (group['Type'] == params_copy.preferredBoostType ) {
return group['value'];
}
}
return params_copy._source['defaultBoostValue'];
}
return findBoost(params)
""",
"params": {
"preferredBoostType": "Type1"
}
}
}
}
]
}
}
}
I have removed the condition of not having dynamic mapping, if changing the structure of boostDetails mapping can help then I am ok but please explain how it can help and be faster to query also please give mapping types and modified structure if answer contains modifying mapping.
Using dynamic mappings (lots of fields)
It looks like you adjusted the doc structure compared to your original question.
The query above was thought for nested fields which cannot be easily iterated in a script for performance reasons. Having said that, the above is an even slower workaround which accesses the docs' _source and iterates its contents. But keep in mind that it's not recommended to access the _source in scripts!
If your docs aren't nested anymore, you can access the so-called doc values which are much more optimized for query-time access:
{
"query": {
"function_score": {
...
"functions": [
{
...
"script_score": {
"script": {
"lang": "painless",
"source": """
try {
if (doc['boost.boostType.keyword'].value == params.preferredBoostType) {
return doc['boost.boostFactor'].value;
} else {
throw new Exception();
}
} catch(Exception e) {
return doc['fallbackBoostFactor'].value;
}
""",
"params": {
"preferredBoostType": "Type1"
}
}
}
}
]
}
}
}
thus speeding up your function score query.
Alternative using an ordered list of values
Since the nested iteration is slow and dynamic mappings are blowing up your index, you could store your boosts in a standardized ordered list in each document:
"boostValues": [1.0001, 1.002, 1.0005, ..., 1.1]
and keep track of the corresponding boost types' order in the backend where you construct the queries:
var boostTypes = ["Type1", "Type2", "Type3", ..., "TypeN"]
So something like n-hot vectors.
Then, as you construct the Elasticsearch query, you'd look up the array index of the boostValues based on the boostType and pass this array index to the script query from above which'd access the corresponding boostValues doc-value.
This is guaranteed to be faster than _source access. But it's required that you always keep your boostTypes and boostValues in sync -- preferably append-only (as you add new boostTypes, the list grows in one dimension).

Is there a way to update a document with a Painless script without changing the order of unaffected fields?

I'm using Elasticsearch's Update by Query API to update some documents with a Painless script like this (the actual query is more complicated):
POST ts-scenarios/_update_by_query?routing=test
{
"query": {
"term": { "routing": { "value": "test" } }
},
"script": {
"source": """ctx._source.tagIDs = ["5T8QLHIBB_kDC9Ugho68"]"""
}
}
This works, except that upon reindexing, other fields get reordered, including some classes which are automatically (de)serialized using JSON.NET's type handling. That means a document with the following source before the update:
{
"routing" : "testsuite",
"activities" : [
{
"$type" : "Test.Models.SomeActivity, Test"
},
{
"$type" : "Test.Models.AnotherActivity, Test",
"CustomParameter" : 1,
"CustomSetting" : false
}
]
}
ends up as
{
"routing" : "testsuite",
"activities" : [
{
"$type" : "Test.Models.SomeActivity, Test"
},
{
"CustomParameter" : 1,
"CustomSetting" : false,
"$type" : "Test.Models.AnotherActivity, Test"
}
],
"tagIDs" : [
"5T8QLHIBB_kDC9Ugho68"
]
}
which JSON.NET can't deserialize. Is there a way I can tell the script (or the Update by Query API) not to change the order of those other fields?
In case it matters, I'm using Elasticsearch OSS version 7.6.1 on macOS. I haven't checked whether an Ingest pipeline would work here, as I'm not familiar with them.
(It turns out I can make the deserialization more flexible by setting the MetadataPropertyHandling property to ReadAhead, as mentioned here. That works, but as mentioned it may hurt performance and there might be other situations where field order matters. Technically, it shouldn't; JSON isn't XML, but there are always edge cases where it does matter.)

Elasticsearch. Painless script to search based on the last result

Let's see if someone could shed a light on this one, which seems to be a little hard.
We need to correlate data from multiple index and various fields. We are trying painless script.
Example:
We make a search in an index to gather data about the queueid of mails sent by someone#domain
Once we have the queueids, we need to store the queueids in an array an iterate over it to make new searchs to gather data like email receivers, spam checks, postfix results and so on.
Problem: Hos can we store the data from one search and use it later in the second search?
We are testing something like:
GET here_an_index/_search
{
"query": {
"bool" : {
"must": [
{
"range": {
"#timestamp": {
"gte": "now-15m",
"lte": "now"
}
}
}
],
"filter" : {
"script" : {
"script" : {
"source" : "doc['postfix_from'].value == params.from; qu = doc['postfix_queueid'].value; return qu",
"params" : {
"from" : "someona#mdomain"
}
}
}
}
}
}
}
And, of course, it throws an error.
"doc['postfix_from'].value ...",
"^---- HERE"
So, in a nuttshell: is there any way ti execute a search looking for some field value based on a filter (like from:someone#dfomain) and use this values on later searchs?
We have evaluated using script fields or nested, but due to some architecture reasons and what those changes would entail, right now, can not be used.
Thank you very much!

Match query return records only if query contains all words of object's field

I read about match and multiword queries but it seems that I need to do something a bit different.
Let's say I have following query: "this is a test" and I want to find that query in one field called "text". I want to get objects which match some of that query (doesn't matter how many words) but only those objects which query value contains every word of text field.
Example for query: "this is a test". I want get those objects:
obj1: {"text":"this is a test"}
obj2: {"text":"this is a"}
obj3 : { "text" : "is a" }
obj4 : { "text" : "test" }
But if obj has something more in text field it will not be returned for example:
obj5: {"text":"this is a test and something more"}
Is it possible to achieve this using Elasticsearch?
It's kind of a hack, but I was able to get it to work with a script filter:
POST /test_index/_search
{
"query": {
"match": {
"text": "this is a test"
}
},
"filter": {
"script": {
"script": "for(val in doc[\"text\"].values){ if(!(val in terms)){ return false; }}; return true;",
"params": {
"terms": ["this", "is", "a", "test"]
}
}
}
}
I thought there would be a better way to do this, but wasn't immediately able to come up with one. Using scripting can be problematic in production, unless your ES cluster is behind an auth wall of some kind.
Anyway, here's the code I used to test it:
http://sense.qbox.io/gist/3929abc89d71ebf724e6121b1b5ba6da54501088

How to use special document fields in scripts in elastic?

I'm trying to write query with custom script in elasticsearch:
https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-script-filter.html#query-dsl-script-filter
https://www.elastic.co/guide/en/elasticsearch/reference/master/modules-scripting.html.
This is useful when you need to compare two document fields.
Everything worked fine, until I decide to use special document field (ex: _id, _uid, etc). The query always returns empty results and there is no errors if I use it like this: doc['_id'].value.
So how to use, for example, "_id" field of a document in a custom script?
The _id is indexed in the uid field, using this format: type#id.
So, your script should look like this (for a type called my_type and an ID of 1):
{
"query": {
"filtered": {
"filter": {
"script" : {
"script" : "doc['_uid'].value == 'my_type#1'"
}
}
}
}
}
A more elaborate solution, to take out the id ES-way is like this:
{
"query": {
"filtered": {
"filter": {
"script": {
"script": "org.elasticsearch.index.mapper.Uid.splitUidIntoTypeAndId(new org.apache.lucene.util.BytesRef(doc['_uid'].value))[1].utf8ToString() == '1'"
}
}
}
}
}
where org.elasticsearch.index.mapper.Uid.splitUidIntoTypeAndId(new org.apache.lucene.util.BytesRef(doc['_uid'].value))[1] is the id and org.elasticsearch.index.mapper.Uid.splitUidIntoTypeAndId(new org.apache.lucene.util.BytesRef(doc['_uid'].value))[0] is the type.

Resources