Elasticsearch: Aggregation For Random Fields - elasticsearch

enter image description here
Now I have a document like the picture. The Structure of this document is "contents" field with many random key field(Notice that there isn't a fixed format for keys.They may just be like UUIDs ). I want to find the maximum value of start_time for all keys in "contents" with ES query. What can I do for this?
The document:
{"contents": {
"key1": {
"start_time": "2020-08-01T00:00:19.500Z",
"last_event_published_time": "2020-08-01T23:59:03.738Z",
"last_event_timestamp": "2020-08-01T23:59:03.737Z",
"size": 1590513,
"read_offset": 1590513,
"name": "key1_name"
},
"key2": {
"start_time": "2020-08-01T00:00:19.500Z",
"last_event_published_time": "2020-08-01T23:59:03.738Z",
"last_event_timestamp": "2020-08-01T23:59:03.737Z",
"size": 1590513,
"read_offset": 1590513,
"name": "key2_name"
}
}}
I have tried Joe's solution and it works. But when I modify the document like:
{
"timestamp": "2020-08-01T23:59:59.359Z",
"type": "beats_stats",
"beats_stats": {
"metrics": {
"filebeat": {
"harvester": {
"files": {
"d47f60db-ac59-4b51-a928-0772a815438a": {
"start_time": "2020-08-01T00:00:18.320Z",
"last_event_published_time": "2020-08-01T23:59:03.738Z",
"last_event_timestamp": "2020-08-01T23:59:03.737Z",
"size": 1590513,
"read_offset": 1590513,
"name": "/data/logs/galogs/ga_log_2020-08-01.log"
},
"e47f60db-ac59-4b51-a928-0772a815438a": {
"start_time": "2020-08-01T00:00:19.500Z",
"last_event_published_time": "2020-08-01T23:59:03.738Z",
"last_event_timestamp": "2020-08-01T23:59:03.737Z",
"size": 1590513,
"read_offset": 1590513,
"name": "/data/logs/galogs/ga_log_2020-08-01.log"
}
}
}
}
}
}}
It goes wrong:
"error" : {
"root_cause" : [
{
"type" : "script_exception",
"reason" : "runtime error",
"script_stack" : [
"for (def entry : params._source['beats_stats.metrics.filebeat.harvester.files'].values()) {\n ",
" ^---- HERE"
],
"script" : "\n for (def entry : params._source['beats_stats.metrics.filebeat.harvester.files'].values()) {\n state.start_millis_arr.add(\n Instant.parse(entry.start_time).toEpochMilli()\n );\n }\n ",
"lang" : "painless"
}
],
"type" : "search_phase_execution_exception",
"reason" : "all shards failed",
"phase" : "query",
"grouped" : true,
"failed_shards" : [
{
"shard" : 0,
"index" : "agg-test-index-1",
"node" : "B4mXZVgrTe-MsAQKMVhHUQ",
"reason" : {
"type" : "script_exception",
"reason" : "runtime error",
"script_stack" : [
"for (def entry : params._source['beats_stats.metrics.filebeat.harvester.files'].values()) {\n ",
" ^---- HERE"
],
"script" : "\n for (def entry : params._source['beats_stats.metrics.filebeat.harvester.files'].values()) {\n state.start_millis_arr.add(\n Instant.parse(entry.start_time).toEpochMilli()\n );\n }\n ",
"lang" : "painless",
"caused_by" : {
"type" : "null_pointer_exception",
"reason" : null
}
}
}
]}

You can use a scripted_metric to calculate those. It's quite onerous but certainly possible.
Mimicking your index & syncing a few docs:
POST myindex/_doc
{"contents":{"randomKey1":{"start_time":"2020-08-06T11:01:00.515Z"}}}
POST myindex/_doc
{"contents":{"35431fsf31_s35dfas":{"start_time":"2021-08-06T11:01:00.515Z"}}}
POST myindex/_doc
{"contents":{"999bc_123":{"start_time":"2019-08-06T11:01:00.515Z"}}}
Get the max date of unknown random sub-objects:
GET myindex/_search
{
"size": 0,
"aggs": {
"max_start_date": {
"scripted_metric": {
"init_script": "state.start_millis_arr = [];",
"map_script": """
for (def entry : params._source['contents'].values()) {
state.start_millis_arr.add(
Instant.parse(entry.start_time).toEpochMilli()
);
}
""",
"combine_script": """
// sort in-place
Collections.sort(state.start_millis_arr, Collections.reverseOrder());
return DateTimeFormatter.ISO_INSTANT.format(
Instant.ofEpochMilli(
// first is now the highest
state.start_millis_arr[0]
)
);
""",
"reduce_script": "return states"
}
}
}
}
BTW: #Sahil Gupta's comment is right -- never use images where pasting the text is possible (and helpful).

Related

Why can't I pass index field value into painless script?

Here's my index:
PUT my-index-000001/_doc/1
{
"virtual": "/testss/3-1.pdf",
"file": "3-1",
"caseno": "testss"
}
I am trying to pass the file value "3-1" into the following script and then conditionally either return the value or divide it by 100:
GET my-index-000001/_search
{
"script_fields": {
"mynewfield": {
"script": {
"source":"""
List i=Arrays.asList(doc['file'].value.splitOnToken("-"));
if (i.length==1){
return Float.parseFloat(i[0]);
}
if (i.length==2){
return Float.parseFloat(i[0])+Float.parseFloat(i[1])/100;
}
"""
}
}
}
}
And Getting following errors:
{
"error" : {
"root_cause" : [
{
"type" : "script_exception",
"reason" : "runtime error",
"script_stack" : [
"org.elasticsearch.index.mapper.TextFieldMapper$TextFieldType.fielddataBuilder(TextFieldMapper.java:757)",
"org.elasticsearch.index.fielddata.IndexFieldDataService.getForField(IndexFieldDataService.java:116)",
"org.elasticsearch.index.query.QueryShardContext.lambda$lookup$0(QueryShardContext.java:330)",
"org.elasticsearch.search.lookup.LeafDocLookup$1.run(LeafDocLookup.java:97)",
"org.elasticsearch.search.lookup.LeafDocLookup$1.run(LeafDocLookup.java:94)",
"java.base/java.security.AccessController.doPrivileged(AccessController.java:312)",
"org.elasticsearch.search.lookup.LeafDocLookup.get(LeafDocLookup.java:94)",
"org.elasticsearch.search.lookup.LeafDocLookup.get(LeafDocLookup.java:41)",
"i=Arrays.asList(doc['file'].value.splitOnToken(\"-\"));\n ",
" ^---- HERE"
],
"script" : "\n \n List i=Arrays.asList(doc['file'].value.splitOnToken(\"-\"));\n if (i.length==1){\n return Float.parseFloat(i[0]);\n }\n if (i.length==2){\n return Float.parseFloat(i[0])+Float.parseFloat(i[1])/100;\n }\n \n \n ",
"lang" : "painless",
"position" : {
"offset" : 39,
"start" : 19,
"end" : 78
}
}
],
"type" : "search_phase_execution_exception",
"reason" : "all shards failed",
"phase" : "query",
"grouped" : true,
"failed_shards" : [
{
"shard" : 0,
"index" : "my-index-000001",
"node" : "wJdb2G1VQCyaDNduQLS4SQ",
"reason" : {
"type" : "script_exception",
"reason" : "runtime error",
"script_stack" : [
"org.elasticsearch.index.mapper.TextFieldMapper$TextFieldType.fielddataBuilder(TextFieldMapper.java:757)",
"org.elasticsearch.index.fielddata.IndexFieldDataService.getForField(IndexFieldDataService.java:116)",
"org.elasticsearch.index.query.QueryShardContext.lambda$lookup$0(QueryShardContext.java:330)",
"org.elasticsearch.search.lookup.LeafDocLookup$1.run(LeafDocLookup.java:97)",
"org.elasticsearch.search.lookup.LeafDocLookup$1.run(LeafDocLookup.java:94)",
"java.base/java.security.AccessController.doPrivileged(AccessController.java:312)",
"org.elasticsearch.search.lookup.LeafDocLookup.get(LeafDocLookup.java:94)",
"org.elasticsearch.search.lookup.LeafDocLookup.get(LeafDocLookup.java:41)",
"i=Arrays.asList(doc['file'].value.splitOnToken(\"-\"));\n ",
" ^---- HERE"
],
"script" : "\n \n List i=Arrays.asList(doc['file'].value.splitOnToken(\"-\"));\n if (i.length==1){\n return Float.parseFloat(i[0]);\n }\n if (i.length==2){\n return Float.parseFloat(i[0])+Float.parseFloat(i[1])/100;\n }\n \n \n ",
"lang" : "painless",
"position" : {
"offset" : 39,
"start" : 19,
"end" : 78
},
"caused_by" : {
"type" : "illegal_argument_exception",
"reason" : "Text fields are not optimised for operations that require per-document field data like aggregations and sorting, so these operations are disabled by default. Please use a keyword field instead. Alternatively, set fielddata=true on [file] in order to load field data by uninverting the inverted index. Note that this can use significant memory."
}
}
}
]
},
"status" : 400
}
What should I do differently here?
I tried referencing File: "3-1" with doc['file'].value in the script...doesnt seem to be working.
This ended up working for me:
POST my-index-000001/_update_by_query/ { "script": {
"lang": "painless",
"source": "ctx._source.caseno=Arrays.asList(ctx._source.virtual.splitOnToken('/'))[1];ctx._source.file=Arrays.asList(Arrays.asList(ctx._source.virtual.splitOnToken('/'))[2].splitOnToken('.'))[0];List
i=Arrays.asList(Arrays.asList(Arrays.asList(ctx._source.virtual.splitOnToken('/'))[2].splitOnToken('.'))[0].splitOnToken('-'));if
(i.length==1){ctx._source.mynewfield=Float.parseFloat(i[0]);}if
(i.length==2){ctx._source.mynewfield=Float.parseFloat(i[0])+Float.parseFloat(i[1])/100}"
} }

ElasticSearch Accessing Nested Documents in Script - Null Pointer Exception

Gist: Trying to write a custom filter on nested documents using painless. Want to write error checks when there are no nested documents to surpass null_pointer_exception
I have a mapping as such (simplified and obfuscated)
{
"video_entry" : {
"aliases" : { },
"mappings" : {
"properties" : {
"captions_added" : {
"type" : "boolean"
},
"category" : {
"type" : "keyword"
},
"is_votable" : {
"type" : "boolean"
},
"members" : {
"type" : "nested",
"properties" : {
"country" : {
"type" : "keyword",
},
"date_of_birth" : {
"type" : "date",
}
}
}
}
Each video_entry document can have 0 or more members nested documents.
Sample Document
{
"captions_added": true,
"category" : "Mental Health",
"is_votable: : true,
"members": [
{"country": "Denmark", "date_of_birth": "1998-04-04T00:00:00"},
{"country": "Denmark", "date_of_birth": "1999-05-05T00:00:00"}
]
}
If one or more nested document exist, we want to write some painless scripts that'd check certain fields across all the nested documents. My script works on mappings with a few documents but when I try it on larger set of documents I get null pointer exceptions despite having every null check possible. I've tried various access patterns, error checking mechanisms but I get exceptions.
POST /video_entry/_search
{
"query": {
"script": {
"script": {
"source": """
// various NULL checks that I already tried
// also tried short circuiting on finding null values
if (!params['_source'].empty && params['_source'].containsKey('members')) {
def total = 0;
for (item in params._source.members) {
// custom logic here
// if above logic holds true
// total += 1;
}
return total > 3;
}
return true;
""",
"lang": "painless"
}
}
}
}
Other Statements That I've Tried
if (params._source == null) {
return true;
}
if (params._source.members == null) {
return true;
}
if (!ctx._source.contains('members')) {
return true;
}
if (!params['_source'].empty && params['_source'].containsKey('members') &&
params['_source'].members.value != null) {
// logic here
}
if (doc.containsKey('members')) {
for (mem in params._source.members) {
}
}
Error Message
&& params._source.members",
^---- HERE"
"caused_by" : {
"type" : "null_pointer_exception",
"reason" : null
}
I've looked into changing the structure (flattening the document) and the usage of must_not as indicated in this answer. They don't suit our use case as we need to incorporate some more custom logic.
Different tutorials use ctx, doc and some use params. To add to the confusion Debug.explain(doc.members), Debug.explain(params._source.members) return empty responses and I'm having a hard time figuring out the types.
Gist: Trying to write a custom filter on nested documents using painless. Want to write error checks when there are no nested documents to surpass null_pointer_exception
Any help is appreciated.
TLDr;
Elastic flatten objects. Such that
{
"group" : "fans",
"user" : [
{
"first" : "John",
"last" : "Smith"
},
{
"first" : "Alice",
"last" : "White"
}
]
}
Turn into:
{
"group" : "fans",
"user.first" : [ "alice", "john" ],
"user.last" : [ "smith", "white" ]
}
To access members inner value you need to reference it using doc['members.<field>'] as members will not exist on its own.
Details
As you may know, Elastic handles inner documents in its own way. [doc]
So you will need to reference them accordingly.
Here is what I did to make it work.
Btw, I have been using the Dev tools of kibana
PUT /so_test/
PUT /so_test/_mapping
{
"properties" : {
"captions_added" : {
"type" : "boolean"
},
"category" : {
"type" : "keyword"
},
"is_votable" : {
"type" : "boolean"
},
"members" : {
"properties" : {
"country" : {
"type" : "keyword"
},
"date_of_birth" : {
"type" : "date"
}
}
}
}
}
POST /so_test/_doc/
{
"captions_added": true,
"category" : "Mental Health",
"is_votable" : true,
"members": [
{"country": "Denmark", "date_of_birth": "1998-04-04T00:00:00"},
{"country": "Denmark", "date_of_birth": "1999-05-05T00:00:00"}
]
}
PUT /so_test/_doc/
{
"captions_added": true,
"category" : "Mental breakdown",
"is_votable" : true,
"members": []
}
POST /so_test/_doc/
{
"captions_added": true,
"category" : "Mental success",
"is_votable" : true,
"members": [
{"country": "France", "date_of_birth": "1998-04-04T00:00:00"},
{"country": "Japan", "date_of_birth": "1999-05-05T00:00:00"}
]
}
And then I did this query (it is only a bool filter, but I guess making it work for your own use case should not prove too difficult)
GET /so_test/_search
{
"query":{
"bool": {
"filter": {
"script": {
"script": {
"lang": "painless",
"source": """
def flag = false;
// /!\ notice how the field is referenced /!\
if(doc['members.country'].size() != 0)
{
for (item in doc['members.country']) {
if (item == params.country){
flag = true
}
}
}
return flag;
""",
"params": {
"country": "Japan"
}
}
}
}
}
}
}
BTW you were saying you were a bit confused about the context for painless. you can find in the documentation so details about it.
[doc]
In this case the filter context is the one we want to look at.

Change field values using Elasticsearch Painless

I want to check the length of each field under the object alert.
If it's longer than X, change it to value was truncated since it was too long
"_source" : {
"random" : 123455,
},
"#timestamp" : "2021-10-15T21:55:12.938+0000",
"application" : "app",
"alert" : {
"app_key" : "XYZ",
"host" : "Hostname",
"check" : "CPU overloaded",
"status" : "ok"
},
For some reason, I get this error when trying to apply my script:
"type" : "script_exception",
"reason" : "runtime error",
"script_stack" : [
"java.base/java.util.HashMap$HashIterator.nextNode(HashMap.java:1584)",
"java.base/java.util.HashMap$EntryIterator.next(HashMap.java:1617)",
"java.base/java.util.HashMap$EntryIterator.next(HashMap.java:1615)",
"for (alert in ctx._source.alert.entrySet())\n {\n ",
" ^---- HERE"
],
"script" : " ...",
"lang" : "painless",
"position" : {
"offset" : 38,
"start" : 7,
"end" : 65
},
"caused_by" : {
"type" : "concurrent_modification_exception",
"reason" : null
Script:
{
"script": {
"lang": "painless",
"source": """
for (alert in ctx._source.alert.entrySet())
{
if (alert.getValue().length() > params.number_of_characters) {
ctx._source.alert[alert] = "value was truncated since it was too long"
}
}
""",
"params": {
"number_of_characters": 5
}
}
}
You cannot modify the map directly while iterating over the set returned by entrySet, only through the setValue method of the entry.
Per the doc for entrySet():
Returns a Set view of the mappings contained in this map. The set is
backed by the map, so changes to the map are reflected in the set, and
vice-versa. If the map is modified while an iteration over the set is
in progress (except through the iterator's own remove operation, or
through the setValue operation on a map entry returned by the
iterator) the results of the iteration are undefined. The set supports
element removal, which removes the corresponding mapping from the map,
via the Iterator.remove, Set.remove, removeAll, retainAll and clear
operations. It does not support the add or addAll operations.
Try with the following script:
for (alert in ctx._source.alert.entrySet()){
if (alert.getValue().length() > params.number_of_characters) {
alert.setValue("value was truncated")
}
}
Eventually, with the help of Nicolas, I used this:
PUT _ingest/pipeline/rawpayload-ingest/
{
"processors": [
{
"script": {
"description": "Loop through all fields in alert object and shorten them if needed",
"lang": "painless",
"source": """
for (field in ctx.alert.entrySet()) {
if (field.getValue() instanceof String) {
if (field.getValue().length() > params.number_of_characters) {
field.setValue(field.getValue().substring(0, params.number_of_characters) + "...truncated")
}
}
}
""",
"params": {
"number_of_characters": 512
}
}
}
]
}

ElasticSearch, simple two fields comparison with painless

I'm trying to run a query such as SELECT * FROM indexPeople WHERE info.Age > info.AgeExpectancy
Note the two fields are NOT nested, they are just json object
POST /indexPeople/_search
{
"from" : 0,
"size" : 200,
"query" : {
"bool" : {
"filter" : [
{
"bool" : {
"must" : [
{
"script" : {
"script" : {
"source" : "doc['info.Age'].value > doc['info.AgeExpectancy'].value",
"lang" : "painless"
},
"boost" : 1.0
}
}
],
"adjust_pure_negative" : true,
"boost" : 1.0
}
}
],
"adjust_pure_negative" : true,
"boost" : 1.0
}
},
"_source" : {
"includes" : [
"info"
],
"excludes" : [ ]
}
}
However this query fails as
{
"error" : {
"root_cause" : [
{
"type" : "script_exception",
"reason" : "runtime error",
"script_stack" : [
"org.elasticsearch.index.fielddata.ScriptDocValues$Longs.get(ScriptDocValues.java:121)",
"org.elasticsearch.index.fielddata.ScriptDocValues$Longs.getValue(ScriptDocValues.java:115)",
"doc['info.Age'].value > doc['info.AgeExpectancy'].value",
" ^---- HERE"
],
"script" : "doc['info.Age'].value > doc['info.AgeExpectancy'].value",
"lang" : "painless",
"position" : {
"offset" : 22,
"start" : 0,
"end" : 70
}
}
],
"type" : "search_phase_execution_exception",
"reason" : "all shards failed",
"phase" : "query",
"grouped" : true,
"failed_shards" : [
{
"shard" : 0,
"index" : "indexPeople",
"node" : "c_Dv3IrlQmyvIVpLoR9qVA",
"reason" : {
"type" : "script_exception",
"reason" : "runtime error",
"script_stack" : [
"org.elasticsearch.index.fielddata.ScriptDocValues$Longs.get(ScriptDocValues.java:121)",
"org.elasticsearch.index.fielddata.ScriptDocValues$Longs.getValue(ScriptDocValues.java:115)",
"doc['info.Age'].value > doc['info.AgeExpectancy'].value",
" ^---- HERE"
],
"script" : "doc['info.Age'].value > doc['info.AgeExpectancy'].value",
"lang" : "painless",
"position" : {
"offset" : 22,
"start" : 0,
"end" : 70
},
"caused_by" : {
"type" : "illegal_state_exception",
"reason" : "A document doesn't have a value for a field! Use doc[<field>].size()==0 to check if a document is missing a field!"
}
}
}
]
},
"status" : 400
}
Is there a way to achieve this?
What is the best way to debug it? I wanted to print the objects or look at the logs (which aren't there), but I couldn't find a way to do neither.
The mapping is:
{
"mappings": {
"_doc": {
"properties": {
"info": {
"properties": {
"Age": {
"type": "long"
},
"AgeExpectancy": {
"type": "long"
}
}
}
}
}
}
}
perhaps you already solved the issue. The reason why the query failed is clear:
"caused_by" : {
"type" : "illegal_state_exception",
"reason" : "A document doesn't have a value for a field! Use doc[<field>].size()==0 to check if a document is missing a field!"
}
Basically there is one or more document that do not have one of the queried fields. So you can achieve the result you need by using an if to check if the fields do indeed exists. If they do not exist, you can simply return false as follows:
{
"script": """
if (doc['info.Age'].size() > 0 && doc['info.AgeExpectancy'].size() > 0) {
return doc['info.Age'].value > doc['info.AgeExpectancy'].value
}
return false;
}
"""
I tested it with an Elasticsearch 7.10.2 and it works.
What is the best way to debug it
That is a though question, perhaps someone has a better answer for it. I try to list some options. Obviously, debugging requires to read carefully the error messages.
PAINLESS LAB
If you have a pretty recent version of Kibana, you can try to use the painless lab to simulate your documents and get the errors quicker and in a more focused environment.
KIBANA Scripted Field
You can try to create a bolean scripted field in the index pattern named condition. Before clicking create remember to click "preview result":
MINIMAL EXAMPLE Create a minimal example to reduce the complexity.
For this answer I used a sample index with four documents with all possible cases.
No info: { "message": "ok"}
Info.Age but not AgeExpectancy: {"message":"ok","info":{"Age":14}}
Info.AgeExpectancy but not Age: {"message":"ok","info":{"AgeExpectancy":12}}
Info.Age and AgeExpectancy: {"message":"ok","info":{"Age":14, "AgeExpectancy": 12}}

Update multi level nested document in elasticsearch

Using Elasticsearch 1.7.1, I have the following document structure
"_source" : {
"questions" : {
"defaultQuestion" : {
"tag" : 0,
"gid" : 0,
"rid" : 0,
"caption" : "SRID",
},
"tableQuestion" : {
"rows" : [{
"ids" : {
"answerList" : ["3547", "3548"],
"tag" : "0",
"caption" : "Accounts",
},
"name" : {
"answerList" : ["Some Name"],
"tag" : "0",
"caption" : "Name",
}
}
],
"caption" : "BPI 1500541753537",
"id" : 644251570,
"tag" : ""
}
},
"id" : "447722821"
}
I want to add a new object in in questions.tableQuestion.rows. My current script is replacing the existing object with the new one. Kindly suggest how to append it instead. Following is my update script.
{ "update": {"_id": "935663867", "_retry_on_conflict" : 3} }
{ "script" : "ctx._source.questions += param1", "params" : {"param1" : {"tableQuestion": {"rows" : [ NEWROWOBJECT ]} } }}
You can build the path with next nested fields, right to the rows property and then use += operator. It's also good to have a check if rows array is null and initialize it in this case.
Checked with ES 2.4, but should be similar for earlier versions:
POST http://127.0.0.1:9200/sample/demo/{document_id}/_update
{
"script": {
"inline": "if (ctx._source.questions.tableQuestion.rows == null) ctx._source.questions.tableQuestion.rows = new ArrayList(); ctx._source.questions.tableQuestion.rows += param1;",
"params" : {
"param1" : {
"ids": {
"answerList": [
"478",
"255"
],
"tag": "2",
"caption": "My Test"
},
"name": {
"answerList": [
"My Name"
],
"tag": "1",
"caption": "My Demo"
}
}
}
}
}
For ES 5.x and Painless language the script is a bit different:
POST http://127.0.0.1:9200/sample/demo/{document_id}/_update
{
"script": {
"inline": "if (ctx._source.questions.tableQuestion.rows == null) { ctx._source.questions.tableQuestion.rows = new ArrayList();} ctx._source.questions.tableQuestion.rows.add(params.param1);",
"params" : {
"param1" : {
...
}
}
}
}
Update to the additional comment
If some part of the path is dynamic, you can also use parameters to build the path - with get(param_name) method - try this syntax (I removed the null check for simplicity):
{
"script": {
"inline": "ctx._source.questions.get(param2).rows += param1;",
"params" : {
"param2" : "6105243",
"param1" : {
"ids": {
"answerList": [
"478",
"255"
],
"tag": "2",
"caption": "My Test"
},
"name": {
"answerList": [
"My Name"
],
"tag": "1",
"caption": "My Demo"
}
}
}
}
}

Resources