I have an index that contains documents structured as follows:
{
"year": 2020,
"month": 10,
"day": 05,
"some_other_data": { ... }
}
the ID of each documents is constructed based on the date and some additional data from some_other_data document, like this: _id: "20201005_some_other_unique_data". There is no explicit _timestamp on the documents.
I can easily get the most recent additions by doing the following query:
{
"query": {
"match_all": {}
},
"sort": [
{"_uid": "desc"}
]
}
Now, the question is: how do I get documents that have essentially a date between day A and day B, where A is, for instance, 2020-07-12 and B is 2020-09-11. You can assume that the input date can be either integers, strings, or anything really as I can manipulate it beforehand.
edit: As requested, I'm including a sample result from the following query:
{
"size": 4,
"query": {
"match": {
"month": 7
}
},
"sort": [
{"_uid": "asc"}
]
}
The response:
{
"took": 3,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 1609,
"max_score": null,
"hits": [
{
"_index": "my_index",
"_type": "nested",
"_id": "20200703_andromeda_cryptic",
"_score": null,
"_source": {
"year": 2020,
"month": 7,
"day": 3,
"yara": {
"strain": "Andromeda",
},
"parent_yara": {
"strain": "CrypticMut",
},
},
"sort": [
"nested#20200703_andromeda_cryptic"
]
},
{
"_index": "my_index",
"_type": "nested",
"_id": "20200703_betabot_boaxxe",
"_score": null,
"_source": {
"year": 2020,
"month": 7,
"day": 3,
"yara": {
"strain": "BetaBot",
},
"parent_yara": {
"strain": "Boaxxe",
},
},
"sort": [
"nested#20200703_betabot_boaxxe"
]
},
{
"_index": "my_index",
"_type": "nested",
"_id": "20200703_darkcomet_zorex",
"_score": null,
"_source": {
"year": 2020,
"month": 7,
"day": 3,
"yara": {
"strain": "DarkComet",
},
"parent_yara": {
"strain": "Zorex",
},
},
"sort": [
"nested#20200703_darkcomet_zorex"
]
},
{
"_index": "my_index",
"_type": "nested",
"_id": "20200703_darktrack_fake_template",
"_score": null,
"_source": {
"year": 2020,
"month": 7,
"day": 3,
"yara": {
"strain": "Darktrack",
},
"parent_yara": {
"strain": "CrypticFakeTempl",
},
},
"sort": [
"nested#20200703_darktrack_fake_template"
]
}
]
}
}
The above-mentioned query will return all documents that have matched the month. So basically anything that was put there in July of any year. What I want to achieve, if at all possible, is getting all documents inserted after a certain date and before another certain date.
Unfortunately, I cannot migrate the data so that it has a timestamp or otherwise nicely sortable fields. Essentially, I need to figure out a logic that will say: give me all documents inserted after july 1st, and before august 2nd. The problem here, is that there are plenty of edge cases, like how to do it when start date and end date are in different years, different months, and so on.
edit: I have solved it using the painless scripting, as suggested by Briomkez, with small changes to the script itself, as follows:
getQueryForRange(dateFrom: String, dateTo: String, querySize: Number) {
let script = `
DateTimeFormatter formatter = new DateTimeFormatterBuilder().appendPattern("yyyy-MM-dd")
.parseDefaulting(ChronoField.NANO_OF_DAY, 0)
.toFormatter()
.withZone(ZoneId.of("Z"));
ZonedDateTime l = ZonedDateTime.parse(params.l, formatter);
ZonedDateTime h = ZonedDateTime.parse(params.h, formatter);
ZonedDateTime x = ZonedDateTime.of(doc['year'].value.intValue(), doc['month'].value.intValue(), doc['day'].value.intValue(), 0, 0, 0, 0, ZoneId.of('Z'));
ZonedDateTime first = l.isAfter(h) ? h : l;
ZonedDateTime last = first.equals(l) ? h : l;
return (x.isAfter(first) || x.equals(first)) && (x.equals(last) || x.isBefore(last));
`
return {
size: querySize,
query: {
bool: {
filter: {
script: {
script: {
source: script,
lang: "painless",
params: {
l: dateFrom,
h: dateTo,
},
},
},
},
},
},
sort: [{ _uid: "asc" }],
}
}
With these changes, the query works well for my version of Elasticsearch (7.2) and the order of dates in not important.
I see (at least) two alternatives here. Either use script query or simple bool queries.
A. USE SCRIPT QUERIES
Basically, the idea is to build the a timestamp at query time, by exploiting the datetime in painless.
{
"query": {
"bool": {
"filter": {
"script": {
"script": {
"source": "<INSERT-THE-SCRIPT-HERE>",
"lang": "painless",
"params": {
"l": "2020-07-12",
"h": "2020-09-11"
}
}
}
}
}
}
}
The script can be the following one:
// Building a ZonedDateTime from params.l
ZonedDateTime l = ZonedDateTime.parse(params.l,DateTimeFormatter.ISO_LOCAL_DATE);
// Building a ZonedDateTime from params.h
ZonedDateTime h = ZonedDateTime.parse(params.h,DateTimeFormatter.ISO_LOCAL_DATE);
// Building a ZonedDateTime from the doc
ZonedDateTime doc_date = ZonedDateTime.of(doc['year'].value, doc['month'].value, doc['day'].value, 0, 0, 0, 0, ZoneId.of('Z'));
return (x.isAfter(l) || x.equals(l)) && (x.equals(h) || x.isBefore(h));
B. ALTERNATIVE: splitting the problem in its building blocks
Let us denote with x the document you are searching and let us denote l and h be our lower date and higher date. Let us denote with x.year, x.month and x.day to access the subfield.
So x is contained in the range (l, h) iff
[Condition-1] l <= x AND
[Condition-2] x <= h
The first condition is met if the disjunction of the following conditions holds:
[Condition-1.1] l.year < x.year
[Condition-1.2] l.year == x.year AND l.month < x.month
[Condition-1.3] l.year == x.year AND l.month == x.month AND l.day <= x.day
Similarly, the second condition can be expressed as the disjunction of the following conditions:
[Condition-2.1] h.year > x.year
[Condition-2.2] h.year == x.year AND h.month > x.month
[Condition-2.3] h.year == x.year AND h.month == x.month AND h.day <= x.day
It remains to express these conditions in Elasticsearch DSL:
B-1. Using script query
Given this idea we can write a simple script query. We should substitute
{
"query": {
"bool": {
"filter": {
"script": {
"script": {
"source": "<INSERT SCRIPT HERE>",
"lang": "painless",
"params": {
"l": {
"year": 2020,
"month": 07,
"day": 01
},
"h": {
"year": 2020,
"month": 09,
"day": 01
}
}
}
}
}
}
}
In painless you can express the Condition, considering that:
x.year is doc['year'].value, x.month is doc['month'].value, x.day is doc['day'].value
h.year is params.h.year, etc.
l.year is params.l.year, etc.
B-2. Using boolean query
Now we should transform these conditions into a bool conditions. The pseudo-code is the following:
{
"query": {
"bool": {
// AND of two conditions
"must": [
{
// Condition 1
},
{
// Condition 2
}
]
}
}
}
Each Condition-X block will look like this:
{
"bool": {
// OR
"should": [
{ // Condition-X.1 },
{ // Condition-X.2 },
{ // Condition-X.3 },
],
"minimum_should_match" : 1
}
}
So, for example, we can express [Condition-2-3] with h = 2020-09-11 we can use this range query:
{
"bool": {
"must": [
{
"range": {
"year": {
"gte": 2020,
"lte": 2020
}
}
},
{
"range": {
"month": {
"gte": 9,
"lte": 9
}
}
},
{
"range": {
"day": {
"lte": 11
}
}
}
]
}
}
Write the entire query is feasible, but I think it would be very long :)
Related
I want to get count of values for different ranges from an array for each document. For example, a document of student contains array "grades" which has different score per subject e.g. Maths - score 71, Science - score 91, etc. So, I want to get ranges per student like Grade A - 2 subject, Grade B - 1 subject.
So, the mapping is something like this:
{
"student-grades": {
"mappings": {
"properties": {
"grades": {
"type": "nested",
"properties": {
"subject": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"score": {
"type": "float"
}
}
},...
}
}
}
}
For counting ranges I have created a query something like this:
GET student-grades/_search
{
"aggs": {
"nestedAgg": {
"nested": {
"path": "grades"
},
"aggs": {
"gradeRanges": {
"range": {
"field": "grades.score",
"ranges": [
{
"key": "range1",
"to": 35.01
},
{
"key": "range2",
"from": 35.01,
"to": 50.01
},
{
"key": "range3",
"from": 50.01,
"to": 60.01
},
{
"key": "range4",
"from": 60.01,
"to": 70.01
},
{
"key": "range5",
"from": 70.01
}
]
},
"aggs": {
"perDoc": {
"top_hits": {
"size": 10
}
}
}
}
}
}
}
}
But, it is giving a single list of ranges for all documents combined. The number of subjects is not fixed, so I can't set a random size in top_hits. So, how can I get all ranges per document? (I am new to elasticsearch, so I am not aware about its all features.)
I used a painless script to solve this problem. Here is the query:
GET grades/_search
{
"script_fields": {
"gradeRanges": {
"script": {
"source": """
List gradeList = doc['grades.score'];
List gradeRanges= new ArrayList();
for(int i=0; i<5; i++) {
gradeRanges.add(0);
}
for(int i=0; i<gradeList.length; i++) {
if(gradeList[i]<=35.0) {
gradeRanges[0]++;
} else if(gradeList[i]<=50.0) {
gradeRanges[1]++;
} else if(gradeList[i]<=60.0) {
gradeRanges[2]++;
} else if(gradeList[i]<=70.0) {
gradeRanges[3]++;
} else {
gradeRanges[4]++;
}
}
return gradeRanges;
"""
}
}
},
"size": 100
}
This will give me separate grade ranges for each document, like this:
{
"took": 38,
"timed_out": false,
"_shards": {
"total": 2,
"successful": 2,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 12,
"relation": "eq"
},
"max_score": 1,
"hits": [
{
"_index": "grades",
"_id": "11",
"_score": 1,
"fields": {
"gradeRanges": [
0,
2,
0,
3,
1
]
}
},
{
"_index": "grades",
"_id": "12",
"_score": 1,
"fields": {
"gradeRanges": [
0,
1,
0,
1,
1
]
}
}
]
}
}
My types have a field which is an array of times in ISO 8601 format. I want to get all the listing's which have a time on a certain day, and then order them by the earliest time they occur on that specific day. Problem is my query is ordering based on the earliest time of all days.
You can reproduce the problem below.
curl -XPUT 'localhost:9200/listings?pretty'
curl -XPOST 'localhost:9200/listings/listing/_bulk?pretty' -d '
{"index": { } }
{ "name": "second on 6th (3rd on the 5th)", "times": ["2018-12-05T12:00:00","2018-12-06T11:00:00"] }
{"index": { } }
{ "name": "third on 6th (1st on the 5th)", "times": ["2018-12-05T10:00:00","2018-12-06T12:00:00"] }
{"index": { } }
{ "name": "first on the 6th (2nd on the 5th)", "times": ["2018-12-05T11:00:00","2018-12-06T10:00:00"] }
'
# because ES takes time to add them to index
sleep 2
echo "Query listings on the 6th!"
curl -XPOST 'localhost:9200/listings/_search?pretty' -d '
{
"sort": {
"times": {
"order": "asc",
"nested_filter": {
"range": {
"times": {
"gte": "2018-12-06T00:00:00",
"lte": "2018-12-06T23:59:59"
}
}
}
}
},
"query": {
"bool": {
"filter": {
"range": {
"times": {
"gte": "2018-12-06T00:00:00",
"lte": "2018-12-06T23:59:59"
}
}
}
}
}
}'
curl -XDELETE 'localhost:9200/listings?pretty'
Adding the above script to a .sh file and running it helps reproduce the issue. You'll see the order is happening based on the 5th and not the 6th. Elasticsearch converts the times to a epoch_millis number for sorting, you can see the epoch number in the sort field in the hits object e.g 1544007600000. When doing an asc sort, in takes the smallest number in the array (order not important) and sorts based off that.
Somehow I need it to be ordered on the earliest time that occurs on the queried day i.e the 6th.
Currently using Elasticsearch 2.4 but even if someone can show me how it's done in the current version that would be great.
Here is their doc on nested queries and scripting if that helps.
I think the problem here is that the nested sorting is meant for nested objects, not for arrays.
If you convert the document into one that uses an array of nested objects instead of the simple array of dates, then you can construct a nested filtered sort that works.
The following is Elasticsearch 6.0 - they're changed the syntax a bit for 6.1 onwards, and I'm not sure how much of this works with 2.x:
Mappings:
PUT nested-listings
{
"mappings": {
"listing": {
"properties": {
"name": {
"type": "keyword"
},
"openTimes": {
"type": "nested",
"properties": {
"date": {
"type": "date"
}
}
}
}
}
}
}
Data:
POST nested-listings/listing/_bulk
{"index": { } }
{ "name": "second on 6th (3rd on the 5th)", "openTimes": [ { "date": "2018-12-05T12:00:00" }, { "date": "2018-12-06T11:00:00" }] }
{"index": { } }
{ "name": "third on 6th (1st on the 5th)", "openTimes": [ {"date": "2018-12-05T10:00:00"}, { "date": "2018-12-06T12:00:00" }] }
{"index": { } }
{ "name": "first on the 6th (2nd on the 5th)", "openTimes": [ {"date": "2018-12-05T11:00:00" }, { "date": "2018-12-06T10:00:00" }] }
So instead of the "nextNexpectionOpenTimes", we have an "openTimes" nested object, and each listing contains an array of openTimes.
Now the search:
POST nested-listings/_search
{
"sort": {
"openTimes.date": {
"order": "asc",
"nested_path": "openTimes",
"nested_filter": {
"range": {
"openTimes.date": {
"gte": "2018-12-06T00:00:00",
"lte": "2018-12-06T23:59:59"
}
}
}
}
},
"query": {
"nested": {
"path": "openTimes",
"query": {
"bool": {
"filter": {
"range": {
"openTimes.date": {
"gte": "2018-12-06T00:00:00",
"lte": "2018-12-06T23:59:59"
}
}
}
}
}
}
}
}
The main difference here is the slightly different query, since you need to use a "nested" query to filter on nested objects.
And this gives the following result:
{
"took": 1,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 3,
"max_score": null,
"hits": [
{
"_index": "nested-listings",
"_type": "listing",
"_id": "vHH6e2cB28sphqox2Dcm",
"_score": null,
"_source": {
"name": "first on the 6th (2nd on the 5th)"
},
"sort": [
1544090400000
]
},
{
"_index": "nested-listings",
"_type": "listing",
"_id": "unH6e2cB28sphqox2Dcm",
"_score": null,
"_source": {
"name": "second on 6th (3rd on the 5th)"
},
"sort": [
1544094000000
]
},
{
"_index": "nested-listings",
"_type": "listing",
"_id": "u3H6e2cB28sphqox2Dcm",
"_score": null,
"_source": {
"name": "third on 6th (1st on the 5th)"
},
"sort": [
1544097600000
]
}
]
}
}
I don't think you can actually select a single value from an array in ES, so for sorting, you were always going to be sorting on all the results. The best you can do with a plain array is choose how you treat that array for sorting purposes (use lowest, highest, mean, etc).
I have following kind of documents.
document 1
{
"doc": {
"id": 1,
"errors": {
"e1":5,
"e2":20,
"e3":30
},
"warnings": {
"w1":1,
"w2":2
}
}
}
document 2
{
"doc": {
"id": 2,
"errors": {
"e1":10
},
"warnings": {
"w1":1,
"w2":2,
"w3":33,
}
}
}
I would like to get following sum stats in one or more calls. Is it possible? I tried various solution but all works when key is known. In my case map keys (e1, e2 etc) are not known.
{
"errors": {
"e1": 15,
"e2": 20,
"e3": 30
},
"warnings": {
"w1": 2,
"w2": 4,
"w3": 33
}
}
There are two solutions, none of them are pretty. I have to point out that the option 2 should be the preferred way to go since option 1 uses an experimental feature.
1. Dynamic mapping, [experimental] scripted aggregation
Inspired by this answer and the Scripted Metric Aggregation page of ES docs, I began with just inserting your documents to non-existing index (which by default creates dynamic mapping).
NB: I tested this on ES 5.4, but the documentation suggests that this feature is available from at least 2.0.
The resulting query for aggregation is the following:
POST /my_index/my_type/_search
{
"size": 0,
"query" : {
"match_all" : {}
},
"aggs": {
"errors": {
"scripted_metric": {
"init_script" : "params._agg.errors = [:]",
"map_script" : "for (t in params['_source']['doc']['errors'].entrySet()) { params._agg.errors[t.key] = t.value } ",
"combine_script" : "return params._agg.errors",
"reduce_script": "Map res = [:] ; for (a in params._aggs) { for (t in a.entrySet()) { res[t.key] = res.containsKey(t.key) ? res[t.key] + t.value : t.value } } return res"
}
},
"warnings": {
"scripted_metric": {
"init_script" : "params._agg.errors = [:]",
"map_script" : "for (t in params['_source']['doc']['warnings'].entrySet()) { params._agg.errors[t.key] = t.value } ",
"combine_script" : "return params._agg.errors",
"reduce_script": "Map res = [:] ; for (a in params._aggs) { for (t in a.entrySet()) { res[t.key] = res.containsKey(t.key) ? res[t.key] + t.value : t.value } } return res"
}
}
}
}
Which produces this output:
{
...
"aggregations": {
"warnings": {
"value": {
"w1": 2,
"w2": 4,
"w3": 33
}
},
"errors": {
"value": {
"e1": 15,
"e2": 20,
"e3": 30
}
}
}
}
If you are following this path you might be interested in the JavaDoc of what params['_source'] is underneath.
Warning: I believe that scripted aggregation is not efficient and for better performance you should check out the option 2 or a different data processing engine.
What does experimental mean:
This functionality is experimental and may be changed or removed
completely in a future release. Elastic will take a best effort
approach to fix any issues, but experimental features are not subject
to the support SLA of official GA features.
With this in mind we proceed to option 2.
2. Static nested mapping, nested aggregation
Here the idea is to store your data differently and essentially be able to query and aggregate it differently. Firstly, we need to create a mapping using nested data type.
PUT /my_index_nested/
{
"mappings": {
"my_type": {
"properties": {
"errors": {
"type": "nested",
"properties": {
"name": {"type": "keyword"},
"val": {"type": "integer"}
}
},
"warnings": {
"type": "nested",
"properties": {
"name": {"type": "keyword"},
"val": {"type": "integer"}
}
}
}
}
}
}
A document in such an index will look like this:
{
"_index": "my_index_nested",
"_type": "my_type",
"_id": "1",
"_score": 1,
"_source": {
"errors": [
{
"name": "e1",
"val": 5
},
{
"name": "e2",
"val": 20
},
{
"name": "e3",
"val": 30
}
],
"warnings": [
{
"name": "w1",
"val": 1
},
{
"name": "w2",
"val": 2
}
]
}
}
Next we need to write the aggregate query. First we need to use nested aggregation, which will allow us to query this special nested data type. But since we actually want to aggregate by name, and sum the values of val, we will need to do a sub-aggregation.
The resulting query is as follows (I am adding comments alongside the query for clarity):
POST /my_index_nested/my_type/_search
{
"size": 0,
"aggs": {
"errors_top": {
"nested": {
// declare which nested objects we want to work with
"path": "errors"
},
"aggs": {
"errors": {
// what we are aggregating - different values of name
"terms": {"field": "errors.name"},
// sub aggregation
"aggs": {
"error_sum": {
// sum all val for same name
"sum": {"field": "errors.val"}
}
}
}
}
},
"warnings_top": {
// analogous to errors
}
}
}
The output of this query will be like:
{
...
"aggregations": {
"errors_top": {
"doc_count": 4,
"errors": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "e1",
"doc_count": 2,
"error_sum": {
"value": 15
}
},
{
"key": "e2",
"doc_count": 1,
"error_sum": {
"value": 20
}
},
{
"key": "e3",
"doc_count": 1,
"error_sum": {
"value": 30
}
}
]
}
},
"warnings_top": {
...
}
}
}
I'm a new bie in ES and I want to use script filter to get all match that the array has at least one element less than max and greater than min (max and min are param in the script).
The document like:
{
"number": "5",
"array": {
"key": [
10,
5,
9,
20
]
}
}
I tried the script but it does not work
{
"script": {
"lang": "groovy",
"params": {
"max": 64,
"min": 6
},
"script": "for(element in doc['array.key'].values){element>= min + doc['number'].value && element <=max + doc['number'].value}"
}
}
There is no error message but the search result is wrong.Is there a way to iterate array field?
Thank you all.
Yes it's doable, your script is not doing that, though. Try using Groovy's any() method instead:
doc['array.key'].values.any{ it -> it >= min + doc['number'] && it <= max + doc['number'] }
A few things:
Your script just goes over a collection and checks a condition, doesn't return a boolean value, and that's what you want
you might consider changing the mapping for number into an integer type
not really sure why you have a field array and inside it a nested field key. Couldn't you just have a field array that would be... and array? ;-)
remember that in ES by default each field can be a single value or an array.
As #Val has mentioned you need to enable dynamic scripting in your conf/elasticsearch.yml but I'm guessing you've done it, otherwise you'd be getting exceptions.
A very simple mapping like this should work:
{
"mappings": {
"document": {
"properties": {
"value": {
"type": "integer"
},
"key": {
"type": "integer"
}
}
}
}
}
Example:
POST /documents/document/1
{
"number": 5,
"key": [
10,
5,
9,
20
]
}
POST /documents/document/2
{
"number": 5,
"key": [
70,
72
]
}
Query:
GET /documents/document/_search
{
"query": {
"bool": {
"filter": {
"script": {
"lang": "groovy",
"params": {
"max": 64,
"min": 6
},
"script": "doc['key'].values.any{ it -> it >= min + doc['number'] && it <= max + doc['number'] }"
}
}
}
}
}
Result:
{
"took": 22,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 1,
"max_score": 0,
"hits": [
{
"_index": "documents",
"_type": "document",
"_id": "1",
"_score": 0,
"_source": {
"number": 5,
"key": [
10,
5,
9,
20
]
}
}
]
}
}
I need, Elasticsearch GET query to view the total score of each and every students by adding up the marks earned by them in all the subject rather I am getting total score of all the students in every subject.
GET /testindex/testindex/_search
{
"query" : {
"filtered" : {
"query" : {
"match_all" : {}
}
}
},
"aggs": {
"total": {
"sum": {
"script" : "doc['physics'].value + doc['maths'].value + doc['chemistry'].value"
}
}
}
}
Output
{
....
"hits": [
{
"_index": "testindex",
"_type": "testindex",
"_id": "1",
"_score": 1,
"_source": {
"personalDetails": {
"name": "viswa",
"age": "33"
},
"marks": {
"physics": 18,
"maths": 5,
"chemistry": 34
},
"remarks": [
"hard working",
"intelligent"
]
}
},
{
"_index": "testindex",
"_type": "testindex",
"_id": "2",
"_score": 1,
"_source": {
"personalDetails": {
"name": "bob",
"age": "13"
},
"marks": {
"physics": 48,
"maths": 45,
"chemistry": 44
},
"remarks": [
"hard working",
"intelligent"
]
}
}
]
},
"aggregations": {
"total": {
"value": 194
}
}
}
Expected Output:
I would like to get total mark earned in subjects of each and every student rather than total of all the students.
What changes I need to do in the query to achieve this.
{
"query": {
"filtered": {
"query": {
"match_all": {}
}
}
},
"aggs": {
"student": {
"terms": {
"field": "personalDetails.name",
"size": 10
},
"aggs": {
"total": {
"sum": {
"script": "doc['physics'].value + doc['maths'].value + doc['chemistry'].value"
}
}
}
}
}
}
But, be careful, for student terms aggregation you need a "unique" (something that makes that student unique - like a personal ID or something) field, maybe the _id itself, but you need to store it.