Elasticsearch - Script Filter over a list of nested objects - elasticsearch

I am trying to figure out how to solve these two problems that I have with my ES 5.6 index.
"mappings": {
"my_test": {
"properties": {
"Employee": {
"type": "nested",
"properties": {
"Name": {
"type": "keyword",
"normalizer": "lowercase_normalizer"
},
"Surname": {
"type": "keyword",
"normalizer": "lowercase_normalizer"
}
}
}
}
}
}
I need to create two separate scripted filters:
1 - Filter documents where size of employee array is == 3
2 - Filter documents where the first element of the array has "Name" == "John"
I was trying to make some first steps, but I am unable to iterate over the list. I always have a null pointer exception error.
{
"bool": {
"must": {
"nested": {
"path": "Employee",
"query": {
"bool": {
"filter": [
{
"script": {
"script" : """
int array_length = 0;
for(int i = 0; i < params._source['Employee'].length; i++)
{
array_length +=1;
}
if(array_length == 3)
{
return true
} else
{
return false
}
"""
}
}
]
}
}
}
}
}
}

As Val noticed, you cant access _source of documents in script queries in recent versions of Elasticsearch.
But elasticsearch allow you to access this _source in the "score context".
So a possible workaround ( but you need to be careful about the performance ) is to use a scripted score combined with a min_score in your query.
You can find an example of this behavior in this stack overflow post Query documents by sum of nested field values in elasticsearch .
In your case a query like this can do the job :
POST <your_index>/_search
{
"min_score": 0.1,
"query": {
"function_score": {
"query": {
"match_all": {}
},
"functions": [
{
"script_score": {
"script": {
"source": """
if (params["_source"]["Employee"].length === params.nbEmployee) {
def firstEmployee = params._source["Employee"].get(0);
if (firstEmployee.Name == params.name) {
return 1;
} else {
return 0;
}
} else {
return 0;
}
""",
"params": {
"nbEmployee": 3,
"name": "John"
}
}
}
}
]
}
}
}
The number of Employee and first name should be set in the params to avoid script recompilation for every use case of this script.
But remember it can be very heavy on your cluster as Val already mentioned. You should narrow the set a document on which your will apply the script by adding filters in the function_score query ( match_all in my example ).
And in any case, it is not the way Elasticsearch should be used and you cant expect bright performances with such a hacked query.

1 - Filter documents where size of employee array is == 3
For the first problem, the best thing to do is to add another root-level field (e.g. NbEmployees) that contains the number of items in the Employee array so that you can use a range query and not a costly script query.
Then, whenever you modify the Employee array, you also update that NbEmployees field accordingly. Much more efficient!
2 - Filter documents where the first element of the array has "Name" == "John"
Regarding this one, you need to know that nested fields are separate (hidden) documents in Lucene, so there is no way to get access to all the nested docs at once in the same query.
If you know you need to check the first employee's name in your queries, just add another root-level field FirstEmployeeName and run your query on that one.

Related

How to search by non-tokenized field length in ElasticSearch

Say I create an index people which will take entries that will have two properties: name and friends
PUT /people
{
"mappings": {
"properties": {
"friends": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword"
}
}
}
}
}
}
and I put two entries, each one of them has two friends.
POST /people/_doc
{
"name": "Jack",
"friends": [
"Jill", "John"
]
}
POST /people/_doc
{
"name": "Max",
"friends": [
"John", "John" # Max will have two friends, but both named John
]
}
Now I want to search for people that have multiple friends
GET /people/_search
{
"query": {
"bool": {
"filter": [
{
"script": {
"script": {
"source": "doc['friends.keyword'].length > 1"
}
}
}
]
}
}
}
This will only return Jack and ignore Max. I assume this is because we are actually traversing the inversed index, and John and John create only one token - which is 'john' so the length of the tokens is actually 1 here.
Since my index is relatively small and performance is not the key, I would like to actually traverse the source and not the inversed index
GET /people/_search
{
"query": {
"bool": {
"filter": [
{
"script": {
"script": {
"source": "ctx._source.friends.length > 1"
}
}
}
]
}
}
}
But according to the https://github.com/elastic/elasticsearch/issues/20068 the source is supported only when updating, not when searching, so I cannot.
One obvious solution to this seems to take the length of the field and store it to the index. Something like friends_count: 2 and then filter based on that. But that requires reindexing and also this appears as something that should be solved in some obvious way I am missing.
Thanks a lot.
There is a new feature in ES 7.11 as runtime fields a runtime field is a field that is evaluated at query time. Runtime fields enable you to:
Add fields to existing documents without reindexing your data
Start working with your data without understanding how it’s structured
Override the value returned from an indexed field at query time
Define fields for a specific use without modifying the underlying schema
you can find more information here about runtime fields, but how you can use runtime fields you can do something like this:
Index Time:
PUT my-index/
{
"mappings": {
"runtime": {
"friends_count": {
"type": "keyword",
"script": {
"source": "doc['#friends'].size()"
}
}
},
"properties": {
"#timestamp": {"type": "date"}
}
}
}
You can also use runtime fields in search time for more information check here.
Search Time
GET my-index/_search
{
"runtime_mappings": {
"friends_count": {
"type": "keyword",
"script": {
"source": "ctx._source.friends.size()"
}
}
}
}
Update:
POST mytest/_update_by_query
{
"query": {
"match_all": {}
},
"script": {
"source": "ctx._source.arrayLength = ctx._source.friends.size()"
}
}
You can update all of your document with query above and adjust your query.
For everyone wondering about the same issue, I think #Kaveh answer is the most likely way to go, but I did not manage to make it work in my case. It seems to me that source is created after the query is performed and therefore you cannot access source for the purposes of filtering query.
This leaves you with two options:
filter the result on the application level (ugly and slow solution)
actually save the filed length in a separate field. Such as friends_count
possibly there is another option I don't know about(?).

How to display Due, Over Due and Not Due based on a date field in Saved Search

For example I have a date field delivery_datetime in Index, I have to show the user for the current day whether a particular parcel is Due today or Over Due or Not Due
I can't create a separate field and do reindex because it's based on current date and that changes every day, for instance if I have to calculate while indexing I have to reindex every day, and that's not feasible because I have a lot of data.
I may use update by query but my index is frequently updated via a Python script, thought we don't have ACID property here we'll have version conflict.
For my knowledge I think my only option is to use Scripted Field.
If I have to write the logic in pseudocode:
Due - delivery_datetime.dateOnly == now.dateOnly
Over Due - delivery_datetime.dateOnly < now.dateOnly
Not Due - delivery_datetime.dateOnly > now.dateOnly
Thought I have a lot of data if I generate CSV I don't want scripted field to make major impact on cluster performance.
So I need some help to do this efficiently in scripted field, or if there were any completely different solution will also be greatly helpful.
Expecting help by providing painless script if Scripted Field is the only solution.
Once we've ruled out doc upserts/updates there are essentially 2 approaches to this: script_fields or filter aggregations.
Let's first assume your mapping looks similar to:
{
"mappings": {
"properties": {
"delivery_datetime": {
"type": "object",
"properties": {
"dateOnly": {
"type": "date",
"format": "dd.MM.yyyy"
}
}
}
}
}
}
Now, if we filter all our packages by, say, its ID and want to know in which due-state it is, we can create 3 script fields like so:
GET parcels/_search
{
"_source": "timeframe_*",
"script_fields": {
"timeframe_due": {
"script": {
"source": "doc['delivery_datetime.dateOnly'].value.dayOfMonth == params.nowDayOfMonth",
"params": {
"nowDayOfMonth": 8
}
}
},
"timeframe_overdue": {
"script": {
"source": "doc['delivery_datetime.dateOnly'].value.dayOfMonth < params.nowDayOfMonth",
"params": {
"nowDayOfMonth": 8
}
}
},
"timeframe_not_due": {
"script": {
"source": "doc['delivery_datetime.dateOnly'].value.dayOfMonth > params.nowDayOfMonth",
"params": {
"nowDayOfMonth": 8
}
}
}
}
}
which'll return something along the lines of:
...
"fields" : {
"timeframe_due" : [
true
],
"timeframe_not_due" : [
false
],
"timeframe_overdue" : [
false
]
}
It's trivial and the date math has a significant weak point that'll be addressed below.
Alternatively, we can use 3 filter aggregations and similarly filter only 1 document in question out like so:
GET parcels/_search
{
"size": 0,
"query": {
"ids": {
"values": [
"my_id_thats_due_today"
]
}
},
"aggs": {
"due": {
"filter": {
"range": {
"delivery_datetime.dateOnly": {
"gte": "now/d",
"lte": "now/d"
}
}
}
},
"overdue": {
"filter": {
"range": {
"delivery_datetime.dateOnly": {
"lt": "now/d"
}
}
}
},
"not_due": {
"filter": {
"range": {
"delivery_datetime.dateOnly": {
"gt": "now/d"
}
}
}
}
}
}
yielding
...
"aggregations" : {
"overdue" : {
"doc_count" : 0
},
"due" : {
"doc_count" : 1
},
"not_due" : {
"doc_count" : 0
}
}
Now the advantages of the 2nd approach are as follows:
There are no scripts involved -> faster execution.
More importantly, you don't have to worry about day-of-month math like Dec 15th being later than Nov 20th but the trivial day-of-month comparison would yield otherwise. You can implement something similar in your scripts but more complexity equals worse execution speed.
You can ditch the ID filtering and use those aggregated counts in an internal dashboard. Possibly even a customer dashboard but regular customers rarely have lots of parcels which would be reasonable to aggregate.
Answering my own question, here is what worked for me.
Scripted Field Script:
def DiffMillis = 0;
if(!doc['delivery_datetime'].empty) {
// Converting each to days, 1000*60*60*24 = 86400000
DiffMillis = (new Date().getTime() / 86400000) - (doc['delivery_datetime'].value.getMillis() / 86400000);
}
doc['delivery_datetime'].empty ? "No Due Date": (DiffMillis==0?"Due": (DiffMillis>0?"Over Due":"Not Due") )
I specifically used ternary operator, because if I use if else then I have to use return, if I use return I faced search_phase_execution_exception while adding filters for the scripted field.

How do I get the length of an array using an elasticsearch query in the ELK stack?

I am using Kibana and have an index that looks like this
GET index_name/
{
"index_name": {
"aliases": {},
"mappings": {
"json": {
"properties": {
"scores": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
}
}
}
}
I would like to get the length of the scores array (ie, how many text elements it has in it) for each record, with the end goal of filtering out records whose length is greater than or equal to 20. So far, I'm able to identify (highlight) each of the records that IS "20" but can't seem to build a filter that I could then turn into a boolean value (1 for true) for later use / summing records that satisfy the condition. I am putting this into the Discover Panel's filter, after clicking on 'Edit Query DSL':
{
"query": {
"match": {
"scores": {
"query": "20",
"type": "phrase"
}
}
}
}
EDIT: an example of this field in the document is:
scores:12, 12, 12, 20, 20, 20
In the table tab view, it has a t next to it, signifying text. The length of this field varies anywhere from 1 to over 20 items from record to record. I also don't know how to get the length of this field (only) returned to me with a query, but I have seen some other answers that suggest something like this (which produces an error for me):
"filter" : {
"script" : {
"script" : "doc['score'].values.length > 10"
}
}
There are a couple of options
This is to find where the number of items of any size (separated via ,).
{
"query": {
"bool": {
"filter": {
"script": {
"script": {
"source":"/, /.split(doc['score.keyword'].value).length > 20"
}
}
}
}
}
}
NOTE: for the above solution setting script.painless.regex.enabled: true in elasticsearch.yml is required.
If all the scores are of a specific size (i.e. all just two digits), a string length (as you were attempting) would work:
{
"query": {
"bool": {
"filter": {
"script": {
"script": {
"source":"doc['scores.keyword'].value.length() > 78"
}
}
}
}
}
}
I chose 78 as each item (assuming 2 digits) is 2 digits + ,== 4, you want to see larger than 20, that is 19 * 4 + 2.
If you are concerned about the size of this array of scores often, you should probably store it as such. You can do processing in your ingest pipeline with the split processor to achieve this.

Elasticsearch script query involving root and nested values

Suppose I have a simplified Organization document with nested publication values like so (ES 2.3):
{
"organization" : {
"dateUpdated" : 1395211600000,
"publications" : [
{
"dateCreated" : 1393801200000
},
{
"dateCreated" : 1401055200000
}
]
}
}
I want to find all Organizations that have a publication dateCreated < the organization's dateUpdated:
{
"query": {
"nested": {
"path": "publications",
"query": {
"bool": {
"filter": [
{
"script": {
"script": "doc['publications.dateCreated'].value < doc['dateUpdated'].value"
}
}
]
}
}
}
}
}
My problem is that when I perform a nested query, the nested query does not have access to the root document values, so doc['dateUpdated'].value is invalid and I get 0 hits.
Is there a way to pass in a value into the nested query? Or is my nested approach completely off here? I would like to avoid creating a separate document just for publications if necessary.
Thanks.
You can not access the root values from nested query context. They are indexed as separate documents. From the documentation
The nested clause “steps down” into the nested comments field. It no
longer has access to fields in the root document, nor fields in any
other nested document.
You can get the desired results with the help of copy_to parameter. Another way to do this would be to use include_in_parent or include_in_root but they might be deprecated in future and it will also increase the index size as every field of nested type will be included in root document so in this case copy_to functionality is better.
This is a sample index
PUT nested_index
{
"mappings": {
"blogpost": {
"properties": {
"rootdate": {
"type": "date"
},
"copy_of_nested_date": {
"type": "date"
},
"comments": {
"type": "nested",
"properties": {
"nested_date": {
"type": "date",
"copy_to": "copy_of_nested_date"
}
}
}
}
}
}
}
Here every value of nested_date will be copied to copy_of_nested_date so copy_of_nested_date will look something like [1401055200000,1393801200000,1221542100000] and then you could use simple query like this to get the results.
{
"query": {
"bool": {
"filter": [
{
"script": {
"script": "doc['rootdate'].value < doc['copy_of_nested_date'].value"
}
}
]
}
}
}
You don't have to change your nested structure but you would have to reindex the documents after adding copy_to to publication dateCreated

Elastic search multiple terms in a dictionary

I have mapping like:
"profile": {
"properties": {
"educations": {
"properties": {
"university": {
"type": "string"
},
"graduation_year": {
"type": "string"
}
}
}
}
}
which obviously holds the educations history of people. Each person can have multiple educations. What I want to do is search for people who graduated from "SFU" in "2012". To do that I am using filtered search:
"filtered": {
"filter": {
"and": [
{
"term": {
"educations.university": "SFU"
}
},
{
"term": {
"educations.graduation_year": "2012"
}
}
]
}
But what this query does is to find the documents who have "SFU" and "2012" in their education, so this document would match, which is wrong:
educations[0] = {"university": "SFU", "graduation_year": 2000}
educations[1] = {"university": "UBC", "graduation_year": 2012}
Is there anyway I could filter both terms on each education?
You need to define nested type for educations and use nested filter to filter it, or Elasticsearch will internally flattens inner objects into a single object, and return the wrong results.
You can refer here for detail explainations and samples:
http://www.elasticsearch.org/blog/managing-relations-inside-elasticsearch/
http://www.spacevatican.org/2012/6/3/fun-with-elasticsearch-s-children-and-nested-documents/

Resources