Elasticsearch - Script filter on array - elasticsearch

I'm a new bie in ES and I want to use script filter to get all match that the array has at least one element less than max and greater than min (max and min are param in the script).
The document like:
{
"number": "5",
"array": {
"key": [
10,
5,
9,
20
]
}
}
I tried the script but it does not work
{
"script": {
"lang": "groovy",
"params": {
"max": 64,
"min": 6
},
"script": "for(element in doc['array.key'].values){element>= min + doc['number'].value && element <=max + doc['number'].value}"
}
}
There is no error message but the search result is wrong.Is there a way to iterate array field?
Thank you all.

Yes it's doable, your script is not doing that, though. Try using Groovy's any() method instead:
doc['array.key'].values.any{ it -> it >= min + doc['number'] && it <= max + doc['number'] }
A few things:
Your script just goes over a collection and checks a condition, doesn't return a boolean value, and that's what you want
you might consider changing the mapping for number into an integer type
not really sure why you have a field array and inside it a nested field key. Couldn't you just have a field array that would be... and array? ;-)
remember that in ES by default each field can be a single value or an array.
As #Val has mentioned you need to enable dynamic scripting in your conf/elasticsearch.yml but I'm guessing you've done it, otherwise you'd be getting exceptions.
A very simple mapping like this should work:
{
"mappings": {
"document": {
"properties": {
"value": {
"type": "integer"
},
"key": {
"type": "integer"
}
}
}
}
}
Example:
POST /documents/document/1
{
"number": 5,
"key": [
10,
5,
9,
20
]
}
POST /documents/document/2
{
"number": 5,
"key": [
70,
72
]
}
Query:
GET /documents/document/_search
{
"query": {
"bool": {
"filter": {
"script": {
"lang": "groovy",
"params": {
"max": 64,
"min": 6
},
"script": "doc['key'].values.any{ it -> it >= min + doc['number'] && it <= max + doc['number'] }"
}
}
}
}
}
Result:
{
"took": 22,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 1,
"max_score": 0,
"hits": [
{
"_index": "documents",
"_type": "document",
"_id": "1",
"_score": 0,
"_source": {
"number": 5,
"key": [
10,
5,
9,
20
]
}
}
]
}
}

Related

How to get values count in different ranges from array in elasticsearch which is separate for each document?

I want to get count of values for different ranges from an array for each document. For example, a document of student contains array "grades" which has different score per subject e.g. Maths - score 71, Science - score 91, etc. So, I want to get ranges per student like Grade A - 2 subject, Grade B - 1 subject.
So, the mapping is something like this:
{
"student-grades": {
"mappings": {
"properties": {
"grades": {
"type": "nested",
"properties": {
"subject": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"score": {
"type": "float"
}
}
},...
}
}
}
}
For counting ranges I have created a query something like this:
GET student-grades/_search
{
"aggs": {
"nestedAgg": {
"nested": {
"path": "grades"
},
"aggs": {
"gradeRanges": {
"range": {
"field": "grades.score",
"ranges": [
{
"key": "range1",
"to": 35.01
},
{
"key": "range2",
"from": 35.01,
"to": 50.01
},
{
"key": "range3",
"from": 50.01,
"to": 60.01
},
{
"key": "range4",
"from": 60.01,
"to": 70.01
},
{
"key": "range5",
"from": 70.01
}
]
},
"aggs": {
"perDoc": {
"top_hits": {
"size": 10
}
}
}
}
}
}
}
}
But, it is giving a single list of ranges for all documents combined. The number of subjects is not fixed, so I can't set a random size in top_hits. So, how can I get all ranges per document? (I am new to elasticsearch, so I am not aware about its all features.)
I used a painless script to solve this problem. Here is the query:
GET grades/_search
{
"script_fields": {
"gradeRanges": {
"script": {
"source": """
List gradeList = doc['grades.score'];
List gradeRanges= new ArrayList();
for(int i=0; i<5; i++) {
gradeRanges.add(0);
}
for(int i=0; i<gradeList.length; i++) {
if(gradeList[i]<=35.0) {
gradeRanges[0]++;
} else if(gradeList[i]<=50.0) {
gradeRanges[1]++;
} else if(gradeList[i]<=60.0) {
gradeRanges[2]++;
} else if(gradeList[i]<=70.0) {
gradeRanges[3]++;
} else {
gradeRanges[4]++;
}
}
return gradeRanges;
"""
}
}
},
"size": 100
}
This will give me separate grade ranges for each document, like this:
{
"took": 38,
"timed_out": false,
"_shards": {
"total": 2,
"successful": 2,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 12,
"relation": "eq"
},
"max_score": 1,
"hits": [
{
"_index": "grades",
"_id": "11",
"_score": 1,
"fields": {
"gradeRanges": [
0,
2,
0,
3,
1
]
}
},
{
"_index": "grades",
"_id": "12",
"_score": 1,
"fields": {
"gradeRanges": [
0,
1,
0,
1,
1
]
}
}
]
}
}

Select documents between two dates from Elasticsearch

I have an index that contains documents structured as follows:
{
"year": 2020,
"month": 10,
"day": 05,
"some_other_data": { ... }
}
the ID of each documents is constructed based on the date and some additional data from some_other_data document, like this: _id: "20201005_some_other_unique_data". There is no explicit _timestamp on the documents.
I can easily get the most recent additions by doing the following query:
{
"query": {
"match_all": {}
},
"sort": [
{"_uid": "desc"}
]
}
Now, the question is: how do I get documents that have essentially a date between day A and day B, where A is, for instance, 2020-07-12 and B is 2020-09-11. You can assume that the input date can be either integers, strings, or anything really as I can manipulate it beforehand.
edit: As requested, I'm including a sample result from the following query:
{
"size": 4,
"query": {
"match": {
"month": 7
}
},
"sort": [
{"_uid": "asc"}
]
}
The response:
{
"took": 3,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 1609,
"max_score": null,
"hits": [
{
"_index": "my_index",
"_type": "nested",
"_id": "20200703_andromeda_cryptic",
"_score": null,
"_source": {
"year": 2020,
"month": 7,
"day": 3,
"yara": {
"strain": "Andromeda",
},
"parent_yara": {
"strain": "CrypticMut",
},
},
"sort": [
"nested#20200703_andromeda_cryptic"
]
},
{
"_index": "my_index",
"_type": "nested",
"_id": "20200703_betabot_boaxxe",
"_score": null,
"_source": {
"year": 2020,
"month": 7,
"day": 3,
"yara": {
"strain": "BetaBot",
},
"parent_yara": {
"strain": "Boaxxe",
},
},
"sort": [
"nested#20200703_betabot_boaxxe"
]
},
{
"_index": "my_index",
"_type": "nested",
"_id": "20200703_darkcomet_zorex",
"_score": null,
"_source": {
"year": 2020,
"month": 7,
"day": 3,
"yara": {
"strain": "DarkComet",
},
"parent_yara": {
"strain": "Zorex",
},
},
"sort": [
"nested#20200703_darkcomet_zorex"
]
},
{
"_index": "my_index",
"_type": "nested",
"_id": "20200703_darktrack_fake_template",
"_score": null,
"_source": {
"year": 2020,
"month": 7,
"day": 3,
"yara": {
"strain": "Darktrack",
},
"parent_yara": {
"strain": "CrypticFakeTempl",
},
},
"sort": [
"nested#20200703_darktrack_fake_template"
]
}
]
}
}
The above-mentioned query will return all documents that have matched the month. So basically anything that was put there in July of any year. What I want to achieve, if at all possible, is getting all documents inserted after a certain date and before another certain date.
Unfortunately, I cannot migrate the data so that it has a timestamp or otherwise nicely sortable fields. Essentially, I need to figure out a logic that will say: give me all documents inserted after july 1st, and before august 2nd. The problem here, is that there are plenty of edge cases, like how to do it when start date and end date are in different years, different months, and so on.
edit: I have solved it using the painless scripting, as suggested by Briomkez, with small changes to the script itself, as follows:
getQueryForRange(dateFrom: String, dateTo: String, querySize: Number) {
let script = `
DateTimeFormatter formatter = new DateTimeFormatterBuilder().appendPattern("yyyy-MM-dd")
.parseDefaulting(ChronoField.NANO_OF_DAY, 0)
.toFormatter()
.withZone(ZoneId.of("Z"));
ZonedDateTime l = ZonedDateTime.parse(params.l, formatter);
ZonedDateTime h = ZonedDateTime.parse(params.h, formatter);
ZonedDateTime x = ZonedDateTime.of(doc['year'].value.intValue(), doc['month'].value.intValue(), doc['day'].value.intValue(), 0, 0, 0, 0, ZoneId.of('Z'));
ZonedDateTime first = l.isAfter(h) ? h : l;
ZonedDateTime last = first.equals(l) ? h : l;
return (x.isAfter(first) || x.equals(first)) && (x.equals(last) || x.isBefore(last));
`
return {
size: querySize,
query: {
bool: {
filter: {
script: {
script: {
source: script,
lang: "painless",
params: {
l: dateFrom,
h: dateTo,
},
},
},
},
},
},
sort: [{ _uid: "asc" }],
}
}
With these changes, the query works well for my version of Elasticsearch (7.2) and the order of dates in not important.
I see (at least) two alternatives here. Either use script query or simple bool queries.
A. USE SCRIPT QUERIES
Basically, the idea is to build the a timestamp at query time, by exploiting the datetime in painless.
{
"query": {
"bool": {
"filter": {
"script": {
"script": {
"source": "<INSERT-THE-SCRIPT-HERE>",
"lang": "painless",
"params": {
"l": "2020-07-12",
"h": "2020-09-11"
}
}
}
}
}
}
}
The script can be the following one:
// Building a ZonedDateTime from params.l
ZonedDateTime l = ZonedDateTime.parse(params.l,DateTimeFormatter.ISO_LOCAL_DATE);
// Building a ZonedDateTime from params.h
ZonedDateTime h = ZonedDateTime.parse(params.h,DateTimeFormatter.ISO_LOCAL_DATE);
// Building a ZonedDateTime from the doc
ZonedDateTime doc_date = ZonedDateTime.of(doc['year'].value, doc['month'].value, doc['day'].value, 0, 0, 0, 0, ZoneId.of('Z'));
return (x.isAfter(l) || x.equals(l)) && (x.equals(h) || x.isBefore(h));
B. ALTERNATIVE: splitting the problem in its building blocks
Let us denote with x the document you are searching and let us denote l and h be our lower date and higher date. Let us denote with x.year, x.month and x.day to access the subfield.
So x is contained in the range (l, h) iff
[Condition-1] l <= x AND
[Condition-2] x <= h
The first condition is met if the disjunction of the following conditions holds:
[Condition-1.1] l.year < x.year
[Condition-1.2] l.year == x.year AND l.month < x.month
[Condition-1.3] l.year == x.year AND l.month == x.month AND l.day <= x.day
Similarly, the second condition can be expressed as the disjunction of the following conditions:
[Condition-2.1] h.year > x.year
[Condition-2.2] h.year == x.year AND h.month > x.month
[Condition-2.3] h.year == x.year AND h.month == x.month AND h.day <= x.day
It remains to express these conditions in Elasticsearch DSL:
B-1. Using script query
Given this idea we can write a simple script query. We should substitute
{
"query": {
"bool": {
"filter": {
"script": {
"script": {
"source": "<INSERT SCRIPT HERE>",
"lang": "painless",
"params": {
"l": {
"year": 2020,
"month": 07,
"day": 01
},
"h": {
"year": 2020,
"month": 09,
"day": 01
}
}
}
}
}
}
}
In painless you can express the Condition, considering that:
x.year is doc['year'].value, x.month is doc['month'].value, x.day is doc['day'].value
h.year is params.h.year, etc.
l.year is params.l.year, etc.
B-2. Using boolean query
Now we should transform these conditions into a bool conditions. The pseudo-code is the following:
{
"query": {
"bool": {
// AND of two conditions
"must": [
{
// Condition 1
},
{
// Condition 2
}
]
}
}
}
Each Condition-X block will look like this:
{
"bool": {
// OR
"should": [
{ // Condition-X.1 },
{ // Condition-X.2 },
{ // Condition-X.3 },
],
"minimum_should_match" : 1
}
}
So, for example, we can express [Condition-2-3] with h = 2020-09-11 we can use this range query:
{
"bool": {
"must": [
{
"range": {
"year": {
"gte": 2020,
"lte": 2020
}
}
},
{
"range": {
"month": {
"gte": 9,
"lte": 9
}
}
},
{
"range": {
"day": {
"lte": 11
}
}
}
]
}
}
Write the entire query is feasible, but I think it would be very long :)

How to do nested aggregation in ElasticSearch where aggregation is over results of sub-agg

I'm using ElasticSearch 5.3. If you could guide me on how this is done in ES or in Kibana would be appreciated. I have read the docs especially on scoped, nested, and pipeline aggregations and have not been able to get any of then work at all or produce what i"m after.
Instead of describing what i want in generic terms, I'd like to formulate my problem as an relational DB problem:
This is my table:
CREATE TABLE metrics
(`host` varchar(50), `counter` int, `time` int)
;
INSERT INTO metrics
(`host`, `counter`, `time`)
VALUES
('host1', 3, 1),
('host2', 2, 2),
('host3', 1, 3)
,('host1', 5, 4)
,('host2', 2, 5)
,('host3', 2, 6)
,('host1', 9, 7)
,('host2', 3, 8)
,('host3', 5, 9)
;
I want to get the total value for the counter for all hosts. Note that each host emits an ever increasing value for some counter so I cannot just go and add counters for each record. Instead i need to use the following SQL:
select sum(max_counter)
from ( select max(counter) as max_counter
from metrics
where time > 0 AND time < 10
group by host) as temptable;
which produces the correct result of: 17 (= 9 + 3 + 5)
You can achieve it with pipeline aggregation
{
"size": 0,
"aggs": {
"hosts": {
"terms": {
"field": "host"
},
"aggs": {
"maxCounter": {
"max": {
"field": "counter"
}
}
}
},
"sumCounter": {
"sum_bucket": {
"buckets_path": "hosts>maxCounter"
}
}
},
"query": {
"range": {
"time": {
"gt": 0.0,
"lt": 10.0
}
}
}
}
First you group your entries by host field in hosts aggregation. Then inside it you apply max aggregation. And then you add sum_bucket aggregation which accepts the results from the previous one and returns the required sum. And you also filter your entries using range query.
Here is a result
{
"took": 22,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 9,
"max_score": 0.0,
"hits": []
},
"aggregations": {
"hosts": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "host1",
"doc_count": 3,
"maxCounter": {
"value": 9.0
}
},
{
"key": "host2",
"doc_count": 3,
"maxCounter": {
"value": 3.0
}
},
{
"key": "host3",
"doc_count": 3,
"maxCounter": {
"value": 5.0
}
}
]
},
"sumCounter": {
"value": 17.0
}
}
}
sumCounter is equal to 17.
Just in case, here is the original mapping
{
"mappings": {
"metrics": {
"properties": {
"host": {
"type": "text",
"fielddata": true
},
"counter": {
"type": "integer"
},
"time": {
"type": "integer"
}
}
}
}
}

ElasticSearch - Average aggregation/sort over multivalued non-unique numeric fields

I am trying to handle sorting over the average of multivalued field called 'rating_average'. In the example I'm giving you, the values for this field are [1, 2, 2]. I'm expecting the average to be (1+2+2)/3 = 1.66666667. The reality I'm getting 1.5 as an average.
After a few tests and analyzing extended stats, I've discovered that happens because the average is calculated over all non-unique items. So statistical operators are applied over the set [1, 2] instead of [1, 2, 2]. I've proved this end also by adding an aggregations section to my query to double check the average calculated for the sort block is identical to the one in the stats aggregation.
An example document is the following:
{
"_source": {
"content_uri": "http://data.semint.co.uk/resource/testContent1",
"rating_average": [
"1",
"2",
"2"
],
"forDesk": "http://data.semint.co.uk/resource/kMFMJd1rtKD"
}
The query I'm performing is the following:
{
"from": 0,
"size": 20,
"aggs": {
"rating_stats": {
"extended_stats": {
"field": "rating_average"
}
}
},
"query": {
"filtered": {
"filter": {
"bool": {
"must": [
{
"terms": {
"mediaType": [
"http://data.semint.co.uk/resource/testMediaType3"
],
"execution": "and"
}
}
]
}
}
}
},
"fields": [ "content_uri", "rating_average"],
"sort": [
{
"rating_average": {
"order": "desc",
"mode": "avg"
}
}
]
}
And these are the results I get from executing the query over the document aforementioned.
{
"took": 1,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"failed": 0
},
"hits": {
"total": 1,
"max_score": null,
"hits": [
{
"_index": "travel_content6",
"_type": "semantic-index",
"_id": "http://data.semint.co.uk/resource/testContent1",
"_score": null,
"fields": {
"content_uri": [
"http://data.semint.co.uk/resource/testContent1"
],
"rating_average": [1, 2, 2]
},
"sort": [
1.5
]
}
]
},
"aggregations": {
"rating_stats": {
"count": 2,
"min": 1,
"max": 2,
"avg": 1.5,
"sum": 3,
"sum_of_squares": 5,
"variance": 0.25,
"std_deviation": 0.5,
"std_deviation_bounds": {
"upper": 2.5,
"lower": 0.5
}
}
}
}

Get Percentage of Values in Elasticsearch

I have some test documents that look like
"hits": {
...
"_source": {
"student": "DTWjkg",
"name": "My Name",
"grade": "A"
...
"student": "ggddee",
"name": "My Name2",
"grade": "B"
...
"student": "ggddee",
"name": "My Name3",
"grade": "A"
And I wanted to get the percentage of students that have a grade of B, the result would be "33%", assuming there were only 3 students.
How would I do this in Elasticsearch?
So far I have this aggregation, which I feel like is close:
"aggs": {
"gradeBPercent": {
"terms": {
"field" : "grade",
"script" : "_value == 'B'"
}
}
}
This returns:
"aggregations": {
"gradeBPercent": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "false",
"doc_count": 2
},
{
"key": "true",
"doc_count": 1
}
]
}
}
I'm not looking necessarily looking for an exact answer, perhaps what I could terms and keywords I could google. I've read over the elasticsearch docs and not found anything that could help.
First off, you shouldn't need a script for this aggregation. If you want to limit your results to everyone where `value == 'B' then you should do that using a filter, not a script.
ElasticSearch won't return you a percentage exactly, but you can easily calculate that using the result from a TERMS AGGREGATION.
Example:
GET devdev/audittrail/_search
{
"size": 0,
"aggs": {
"a1": {
"terms": {
"field": "uIDRequestID"
}
}
}
}
That returns:
{
"took": 12,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 25083,
"max_score": 0,
"hits": []
},
"aggregations": {
"a1": {
"doc_count_error_upper_bound": 9,
"sum_other_doc_count": 1300,
"buckets": [
{
"key": 556,
"doc_count": 34
},
{
"key": 393,
"doc_count": 28
},
{
"key": 528,
"doc_count": 15
}
]
}
}
}
So what does that return mean?
the hits.total field is the total number of records matching your query.
the doc_count is telling you how many items are in each bucket.
So for my example here: I could say that the key "556" shows up in 34 of 25083 documents, so it has a percentage of (34 / 25083) * 100

Resources