Elasticsearch: count query in _search script - elasticsearch

I'm trying to make a single query for updating the one field value in ES index.
I have a index pages which contain information about the pages (id, name, time, parent_page_id, child_count etc)
I can update the field parent_page_id with number of documents which have this page id as parent_page_id
I can update the field with default single value like:
PUT HOST_ADDRESS/pages/_update_by_query
{
"script": {
"source": "def child_count = 0; ctx._source.child_count = child_count;",
"lang": "painless"
},
"query": {
"match_all": {}
}
}
I'm trying with this code to get child count but its not working.
"source": "def child_count = 0; client.prepareSearch('pages').setQuery(QueryBuilders.termQuery("parent_page_id", "ctx._source.id")).get().getTotal().getDocs().getCount(); ctx._source.child_count = child_count;",
"lang": "painless"
My question is, how can i make a sub count-query in script to have a real child count in variable child_count

Scripting doesn't work like this — you cannot use java DSL in there. There's no concept of client or QueryBuilders etc in the Painless contexts.
As such, you'll need to obtain the counts before you proceed to update the doc(s) with a script.
Tip: scripts are reusable when you store them:
POST HOST_ADDRESS/_scripts/update_child_count
{
"script": {
"lang": "painless",
"source": "ctx._source.child_count = params.child_count"
}
}
and then apply via the id:
PUT HOST_ADDRESS/pages/_update_by_query
{
"script": {
"id": "update_child_count", <-- no need to write the Painless code again
"params": {
"child_count": 987
}
},
"query": {
"term": {
"parent_page_id": 123
}
}
}

Related

Elastic search create a dynamic field in response

I am working on a Elasticsearch project. I want to get an additional column in response when an index is queried. Say for example, if I have an index with two columns num1 and num2, when this index is queried it should respond with two column (num1 and num2) but also with additional column add_result (which is actually a addition of two columns). If I query it normally like below it would respond with just two columns
{
query:{
match_all : {}
}
}
In my use case I have tried:
{
"runtime_mappings": {
"add_result": {
"type": "double",
"script": "emit(doc['file_count'].value + doc['follower_count'].value)"
}
},
"query": {
"match_all": {}
}
}
Yes, there are 2 ways:
1. Using runtime field
This feature is available since Elasticsearch 7.12. Simply make a GET request to the _search endpoint with the request body like this:
{
"runtime_mappings": {
"add_result": {
"type": "double",
"script": "emit(doc['num1'].value + doc['num2'].value)"
}
},
"fields": ["add_result", "num*"],
"query": {
"match_all": {}
}
}
You need to explicitly specify that you want to get your runtime fields back in the fields parameter.
2. Using script_field
The request looks like this:
{
"query": {
"match_all": {}
},
"fields": ["num*"],
"script_fields": {
"add_result": {
"script": {
"lang": "painless",
"source": "doc['num1'].value + doc['num2'].value"
}
}
}
}
Note that you still need to have the fields parameter, but you don't need to include your script_field (add_result in this case) in the fields parameter.

Elasticsearch Deleting all nested object with a specific datetime

I'm using Elasticsearch 5.6 and I have a schedule nested field with nested objects that look like this
{
"status": "open",
"starts_at": "2020-10-13T17:00:00-05:00",
"ends_at": "2020-10-13T18:00:00-05:00"
},
{
"status": "open",
"starts_at": "2020-10-13T18:00:00-05:00",
"ends_at": "2020-10-13T19:30:00-05:00"
}
what I'm looking for is a Painless query that will delete multiple nested objects that is equals to the starts_at field. I've tried multiple ways but none worked, they run correctly but don't delete the targeted objects
Was able to do this with looping over it and using SimpleDateFormat
POST index/_update_by_query
{
"script": {"source": "for(int i=0;i< ctx._source.schedule.length;i++){
SimpleDateFormat sdformat = new SimpleDateFormat('yyyy-MM-dd\\'T\\'HH:mm:ss');
boolean equalDateTime = sdformat.parse(ctx._source.schedule[i].starts_at).compareTo(sdformat.parse(params.starts_at)) == 0;
if(equalDateTime) {
ctx._source.schedule.remove(i)
}
}",
"params": {
"starts_at": "2020-10-13T17:00:00-05:00"
},
"lang": "painless"
},
"query":{
"bool": {"must":[
{"terms":{"_id":["12345"]}}
]}}
}
You can use UpdateByQuery for the same.
POST <indexName>/<type>/_update_by_query
{
"query":{ // <======== Filter out the parent documents containing the specified nested date
"match": {
"schedule.starts_at": "2020-10-13T17:00:00-05:00"
}
},
"script":{ // <============ use the script to remove the schedule containing specific start date
"inline": "ctx._source.schedule.removeIf(e -> e.starts_at == '2020-10-13T17:00:00-05:00')"
}
}

Elastisearch - How to process all fields in document while using pipeline processor

I am using below processor, but I want to apply it on all fields. So will I need to add all fields in 'field' or is there any other way to do it.
"description": "my pipeline that remvoves empty string and null strings",
"processors": [
{
"remove": {
"field": "my_field",
"ignore_missing": true,
"if": "ctx.my_field == \"null\" || ctx.my_field == \"\""
}
}
}
The remove processor doesn't allow you to use wildcard * for checking all fields. Instead you can pick the script processor and do it yourself in a generic way:
{
"script": {
"source": """
// find all fields that contain an empty string or null
def remove = ctx.keySet().stream()
.filter(field -> ctx[field] == "null" || ctx[field] == "")
.collect(Collectors.toList());
// remove them in one go
for (field in remove) {
ctx.remove(field);
}
"""
}
}
You can give, comma separate field names, or try with *(not sure whether its support, I am trying this), but comma(,) separate works for sure, as shown in the official doc
https://www.elastic.co/guide/en/elasticsearch/reference/master/remove-processor.html
{
"remove": {
"field": ["user_agent", "url"]
}
}
A complete example based on the accepted answer that removes all fields prefixed with prefix_ from a single document/multiple documents/all documents. Tested using ElasticSearch 6.8.
Single document:
POST /<index>/_doc/<id>/_update
{
"script": {
"lang": "painless",
"source": """
def fieldsToDelete = ctx._source.keySet().stream()
.filter(field -> field.startsWith(params.get("prefix")))
.collect(Collectors.toList());
fieldsToDelete.stream()
.forEach(field -> ctx._source.remove(field));
""",
"params": {
"prefix": "prefix_"
}
}
}
Multiple documents:
POST /<index>/_update_by_query?conflicts=proceed
{
"query": {
<query-based-on-your-requirements>
},
"script": {
"lang": "painless",
"source": """
def fieldsToDelete = ctx._source.keySet().stream()
.filter(field -> field.startsWith(params.get("prefix")))
.collect(Collectors.toList());
fieldsToDelete.stream()
.forEach(field -> ctx._source.remove(field));
""",
"params": {
"prefix": "prefix_"
}
}
}
All documents: Same as above, just omit the query attribute of the request body.

Elasticsearch - Script Filter over a list of nested objects

I am trying to figure out how to solve these two problems that I have with my ES 5.6 index.
"mappings": {
"my_test": {
"properties": {
"Employee": {
"type": "nested",
"properties": {
"Name": {
"type": "keyword",
"normalizer": "lowercase_normalizer"
},
"Surname": {
"type": "keyword",
"normalizer": "lowercase_normalizer"
}
}
}
}
}
}
I need to create two separate scripted filters:
1 - Filter documents where size of employee array is == 3
2 - Filter documents where the first element of the array has "Name" == "John"
I was trying to make some first steps, but I am unable to iterate over the list. I always have a null pointer exception error.
{
"bool": {
"must": {
"nested": {
"path": "Employee",
"query": {
"bool": {
"filter": [
{
"script": {
"script" : """
int array_length = 0;
for(int i = 0; i < params._source['Employee'].length; i++)
{
array_length +=1;
}
if(array_length == 3)
{
return true
} else
{
return false
}
"""
}
}
]
}
}
}
}
}
}
As Val noticed, you cant access _source of documents in script queries in recent versions of Elasticsearch.
But elasticsearch allow you to access this _source in the "score context".
So a possible workaround ( but you need to be careful about the performance ) is to use a scripted score combined with a min_score in your query.
You can find an example of this behavior in this stack overflow post Query documents by sum of nested field values in elasticsearch .
In your case a query like this can do the job :
POST <your_index>/_search
{
"min_score": 0.1,
"query": {
"function_score": {
"query": {
"match_all": {}
},
"functions": [
{
"script_score": {
"script": {
"source": """
if (params["_source"]["Employee"].length === params.nbEmployee) {
def firstEmployee = params._source["Employee"].get(0);
if (firstEmployee.Name == params.name) {
return 1;
} else {
return 0;
}
} else {
return 0;
}
""",
"params": {
"nbEmployee": 3,
"name": "John"
}
}
}
}
]
}
}
}
The number of Employee and first name should be set in the params to avoid script recompilation for every use case of this script.
But remember it can be very heavy on your cluster as Val already mentioned. You should narrow the set a document on which your will apply the script by adding filters in the function_score query ( match_all in my example ).
And in any case, it is not the way Elasticsearch should be used and you cant expect bright performances with such a hacked query.
1 - Filter documents where size of employee array is == 3
For the first problem, the best thing to do is to add another root-level field (e.g. NbEmployees) that contains the number of items in the Employee array so that you can use a range query and not a costly script query.
Then, whenever you modify the Employee array, you also update that NbEmployees field accordingly. Much more efficient!
2 - Filter documents where the first element of the array has "Name" == "John"
Regarding this one, you need to know that nested fields are separate (hidden) documents in Lucene, so there is no way to get access to all the nested docs at once in the same query.
If you know you need to check the first employee's name in your queries, just add another root-level field FirstEmployeeName and run your query on that one.

Iterate over array update_by_query

I have an array called 'tags' in the source of all my Elasticsearch docs in a particular index. I am trying to lowercase all values in the tag array using a update_by_query painless script.
This seems like a simple operation, here is what I have tried:
POST my_index/_update_by_query
{
"script": {
"source": """
for (int i = 0; i < ctx._source['tags'].length; ++i) {
ctx._source['tags'][i].value = ctx._source['tags'][i].value.toLowerCase()
}
""",
"lang": "painless"
},
"query": {
"match_all": {}
}
}
I am getting a null pointer exception when executing the above code. I think I may have the syntax slightly off. Having lots of trouble getting this to work, and would appreciate any help.
I fixed the issue...there were multiple small syntax errors but I needed to add an exists check:
POST my_index/_update_by_query
{
"script": {
"source": """
if (ctx._source.containsKey('tags')) {
for (int i = 0; i < ctx._source['tags'].length; ++i) {
ctx._source['tags'][i] = ctx._source['tags'][i].toLowerCase()
}
}
""",
"lang": "painless"
},
"query": {
"match_all": {}
}
}

Resources