Using aggregation functions in Elasticsearch queries - elasticsearch

I'm using elasticsearch 0.90.10 and I want to perform a search on it using a query with aggregation functions like sum(), avg(), min().
Suppose my data is something like that
[
{
"name" : "Alice",
"grades" : [40, 50, 60, 70]
},
{
"name" : "Bob",
"grades" : [10, 20, 30, 40]
},
{
"name" : "Charlie",
"grades" : [70, 80, 90, 100]
}
]
Let's say I need to fetch students with average grade greater than 75 (i.e. avg(grades) >= 75). How can I wrote such a query in ES using DSL, filters or scripting?
Thanks in advance.

The new ES 1.0.0.RC1 that is out might have better ways to do this with aggregations BUT here is a simple (and very verbose) script that works:
POST /test_one/grades/_search
{
"query" : {
"match_all": {}
},
"filter" : {
"script" : {
"script" : " sum=0; foreach( grade : doc['grades'].values) { sum = sum + grade }; avg = sum/doc['grades'].values.length; avg > 25; "
}
}
}
Data I tested with:
POST /test_one/grades
{
"name": "chicken",
"grades": [35,55,65]
}
POST /test_one/grades
{
"name": "pork",
"grades": [15,35,45]
}
POST /test_one/grades
{
"name": "kale",
"grades": [5,10,20]
}

Related

Elasticsearch Birth date aggregation

I need my filter works like this:
18-24 | (16,635,890)
25-34 | (2,478,382)
35-44 | (1,129,493)
45-54 | (5,689,393)
55-64 | (4,585.933)
This is my ES mapping:
{
"dynamic": "strict",
"properties": {
"birthdate": {
"type": "date",
"format": "m/d/yyyy"
},
"first_name": {
"type": "keyword"
},
"last_name": {
"type": "keyword"
}
}
}
I would like to know if it's possible to do this with this mapping. I'm not very experienced in ES, I believe that to do this I need advanced knowledge in ES.
Also, I tried to do this to test, but without any aggregation :/
age: {
terms: {
field: 'birthdate'
}
}
--------------------
"doc_count_error_upper_bound" => 0,
"sum_other_doc_count" => 0,
"buckets" => [
{
"key" => 1072915200000,
"key_as_string" => "0/1/2004",
"doc_count" => 1
}
]
},
I tried to read the documentation and search in some forums, but without success. thanks
A good candidate for this would be the ranges aggregation but since your birthdate is formatted as a date, you'd need to calculate the age up until now before you proceeded to calculate the buckets. You can do so through a Painless script.
Putting it all together:
POST your-index/_search
{
"size": 0,
"aggs": {
"price_ranges": {
"range": {
"script": {
"source": "return (params.now_ms - doc['birthdate'].value.millis) / 1000 / 60 / 60 / 24 / 365",
"params": {
"now_ms": 1617958584396
}
},
"ranges": [
{
"from": 18,
"to": 24,
"key": "18-24"
},
{
"from": 25,
"to": 34,
"key": "25-34"
}
]
}
}
}
}
would return:
...
"aggregations" : {
"price_ranges" : {
"buckets" : [
{
"key" : "18-24",
"from" : 18.0,
"to" : 24.0,
"doc_count" : 0
},
{
"key" : "25-34",
"from" : 25.0,
"to" : 34.0,
"doc_count" : 2
},
...
]
}
}
Note that the current timestamp wasn't obtained through a dynamic new Date() call but rather hardcoded as a parametrized now_ms variable. This is the recommended way of doing date math due to the distributed nature of Elasticsearch. For more info on this, check my answer to How to get current time as unix timestamp for script use.
Shameless plug: if you're relatively new to ES, you might find my recently released Elasicsearch Handbook useful. One of the chapters is dedicated solely to aggregations and one to Painless scripting!

Get the number of appearances of a particular term in an elasticsearch field

I have an elasticsearch index (posts) with following mappings:
{
"id": "integer",
"title": "text",
"description": "text"
}
I want to simply find the number of occurrences of a particular term inside the description field for a single particular document (i have the document id and term to find).
e.g i have a post like this {id: 123, title:"some title", description: "my city is LA, this post description has two occurrences of word city "}.
I have the the document id/ post id for this post, just want to find how many times word "city" appears in the description for this particular post. (result should be 2 in this case)
Cant seem to find the way for this search, i don't want the occurrences across ALL the documents but just for a single document and inside its' one field. Please suggest a query for this. Thanks
Elasticsearch Version: 7.5
You can use a terms aggregation on your description but need to make sure its fielddata is set to true on it.
PUT kamboh/
{
"mappings": {
"properties": {
"id": {
"type": "integer"
},
"title": {
"type": "text"
},
"description": {
"type": "text",
"fields": {
"simple_analyzer": {
"type": "text",
"fielddata": true,
"analyzer": "simple"
},
"keyword": {
"type": "keyword"
}
}
}
}
}
}
Ingesting a sample doc:
PUT kamboh/_doc/1
{
"id": 123,
"title": "some title",
"description": "my city is LA, this post description has two occurrences of word city "
}
Aggregating:
GET kamboh/_search
{
"size": 0,
"aggregations": {
"terms_agg": {
"terms": {
"field": "description.simple_analyzer",
"size": 20
}
}
}
}
Yielding:
"aggregations" : {
"terms_agg" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "city",
"doc_count" : 1
},
{
"key" : "description",
"doc_count" : 1
},
...
]
}
}
Now, as you can see, the simple analyzer split the string into words and made them lowercase but it also got rid of the duplicate city in your string! I could not come up with an analyzer that'd keep the duplicates... With that being said,
It's advisable to do these word counts before you index!
You would split your string by whitespace and index them as an array of words instead of a long string.
This is also possible at search time, albeit it's very expensive, does not scale well and you need to have script.painless.regex.enabled: true in your es.yaml:
GET kamboh/_search
{
"size": 0,
"aggregations": {
"terms_script": {
"scripted_metric": {
"params": {
"word_of_interest": ""
},
"init_script": "state.map = [:];",
"map_script": """
if (!doc.containsKey('description')) return;
def split_by_whitespace = / /.split(doc['description.keyword'].value);
for (def word : split_by_whitespace) {
if (params['word_of_interest'] !== "" && params['word_of_interest'] != word) {
return;
}
if (state.map.containsKey(word)) {
state.map[word] += 1;
return;
}
state.map[word] = 1;
}
""",
"combine_script": "return state.map;",
"reduce_script": "return states;"
}
}
}
}
yielding
...
"aggregations" : {
"terms_script" : {
"value" : [
{
"occurrences" : 1,
"post" : 1,
"city" : 2, <------
"LA," : 1,
"of" : 1,
"this" : 1,
"description" : 1,
"is" : 1,
"has" : 1,
"my" : 1,
"two" : 1,
"word" : 1
}
]
}
}
...

How to query if a time is between two field values

How do I search for documents that are between a start and end time? For example, I want to query the following document using a time only like "18:33" or "21:32". "18:33" would return the following document and "21:32" wouldn't. I don't care about the date part nor the secs.
{
"my start time field": "2020-01-23T18:32:21.768Z",
"my end time field": "2020-01-23T20:32:21.768Z"
}
I've reviewed: Using the range query with date fields. but I'm not sure how to only look at times. Also, I want to see if a time is between two fields, not if a field is between two times.
Essentially, the Elasticsearch equivalent of BETWEEN for SQL Server. Like this answer except I don't want to use the current time but a variable.
DECLARE #blah datetime2 = GETDATE()
SELECT *
FROM Table1 T
WHERE CAST(#blah AS TIME)
BETWEEN cast(T.StartDate as TIME) AND cast(T.EndDate as TIME)
As per the suggestion from the OP and the link he provided which adheres to the laws of stackoverflow I'm providing the second solution in here:
Solution 2: Insert separate fields for hour minute as hh:mm
Note the format used which says hour_minute. You can find the list of formats available under the aforementioned link.
Basically you re-ingest the documents with a separate field that would have hour and minute values and execute range queries to get what you want.
Mapping:
PUT my_time_index
{
"mappings": {
"properties": {
"start_time":{
"type": "date",
"format": "hour_minute"
},
"end_time":{
"type": "date",
"format": "hour_minute"
}
}
}
}
Sample Document:
POST my_time_index/_doc/1
{
"start_time": "18:32",
"end_time": "20:32"
}
Query Request:
POST my_time_index/_search
{
"query": {
"bool": {
"must": [
{
"range": {
"start_time": {
"gte": "18:00"
}
}
},
{
"range": {
"end_time": {
"lte": "21:00"
}
}
}
]
}
}
}
Let me know if this helps!
Don't store times in a datetime datatype based upon this discussion.
If you want to filter for the specific hour of the day, you would need to extract that into it's own field.
Via the Kibana Dev Tools -> Console
Create some mock data:
POST between-research/_doc/1
{
"my start hour": 0,
"my end hour": 12
}
POST between-research/_doc/2
{
"my start hour": 13,
"my end hour": 23
}
Perform "between" search
POST between-research/_search
{
"query": {
"bool": {
"must": [
{
"range": {
"my start hour": {
"lte": 10
}
}
},
{
"range": {
"my end hour": {
"gte": 10
}
}
}
]
}
}
}
Solution 1: Existing Date Format
Without changing and ingesting your hours and minutes separately, I've come up with the below solution and I don't think you would be happy with the way ES provides you the solution, but it certainly works.
I've created a sample mapping, document, the query and response based on the data you've provided.
Mapping:
PUT my_date_index
{
"mappings": {
"properties": {
"start_time":{
"type": "date",
"format": "yyyy-MM-dd'T'HH:mm:ss.SSS'Z'"
},
"end_time":{
"type": "date",
"format": "yyyy-MM-dd'T'HH:mm:ss.SSS'Z'"
}
}
}
}
Sample Documents:
POST my_date_index/_doc/1
{
"start_time": "2020-01-23T18:32:21.768Z",
"end_time": "2020-01-23T20:32:21.768Z"
}
POST my_date_index/_doc/2
{
"start_time": "2020-01-23T19:32:21.768Z",
"end_time": "2020-01-23T20:32:21.768Z"
}
POST my_date_index/_doc/3
{
"start_time": "2020-01-23T21:32:21.768Z",
"end_time": "2020-01-23T22:32:21.768Z"
}
Query Request:
POST my_date_index/_search
{
"query": {
"bool": {
"must": [
{
"script": {
"script": {
"source": """
ZonedDateTime zstart_time = doc['start_time'].value;
int zstart_hour = zstart_time.getHour();
int zstart_minute = zstart_time.getMinute();
int zstart_total_minutes = zstart_hour * 60 + zstart_minute;
ZonedDateTime zend_time = doc['end_time'].value;
int zend_hour = zend_time.getHour();
int zend_minute = zend_time.getMinute();
int zend_total_minutes = zend_hour * 60 + zend_minute;
int my_input_total_minutes = params.my_input_hour * 60 + params.my_input_minute;
if(zstart_total_minutes <= my_input_total_minutes && zend_total_minutes >= my_input_total_minutes){
return true;
}
return false;
""",
"params": {
"my_input_hour": 20,
"my_input_minute": 10
}
}
}
}
]
}
}
}
Basically
calculate number of minutes from start_date
calculate number of minutes from end_date
calculate number of minutes from params.my_input_hour & params.my_input_minute
execute the logic in if condition as start_date <= input <= end_date using the minutes of all the three values and return the documents accordingly.
Response:
{
"took" : 2,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 1,
"relation" : "eq"
},
"max_score" : 2.0,
"hits" : [
{
"_index" : "my_time_index",
"_type" : "_doc",
"_id" : "1",
"_score" : 2.0,
"_source" : {
"start_time" : "18:32",
"end_time" : "20:32"
}
}
]
}
}
Do test them thoroughly for performance issues when it comes to solution 1 as script queries generally hit performances, however they come in handy if you have no option.
Let me know if this helps!

Elasticsearch boosting score for values in array

I am trying to implement scoring of documents based on certain values stored in array via elasticsearch. For example, if my document contain an array object like this:
Document 1:
{
id: "test",
marks: [{
"classtype" : "x1",
"value": 90
}]
}
Document 2:
{
id: "test2",
marks: [{
"classtype" : "x1",
"value": 50
},{
"classtype" : "x2",
"value": 60
}]
}
I want my output scores to be boosted by choosing boosting factor on basis of "classtype", but applicable on "value".
equivalent code would be:
var boostingfactor = {
"x1" : 1,
"x2" : 10
}
var smartscore = 0;
marks.forEach(function(mark){
return smartscore += mark.value * boostingfactor[mark.classtype];
});
return smartscore;
I have tried elasticsearch query on integer values, but not sure if same can be done for values present in array. I also tried writing scripts in elasticsearch's painless language, but couldnt find right way to filter values based on classtype.
POST /student/_search
{
"query": {
"function_score": {
"script_score" : {
"script" : {
"params": {
"x1": 1,
"x2": 10
},
"source": "params[doc['marks.classtype']] * marks.value"
}
}
}
}
}
Expected result is scoring of 90 (90*1) for sample document 1 and 650 (50*1+60*10) for document 2 but above query fails with exception:
{
"type": "script_exception",
"reason": "runtime error",
"script_stack": [
"params[doc['marks.classtype'].value]",
" ^---- HERE"
],
"script": "params[doc['marks.classtype'].value]",
"lang": "painless"
}
Is it possible to accomplish the result via modifying script?
Elasticsearch version: 7.1.0
I was able to read through array values using following script:
"script_score" : {
"script" : {
"params": {
"x1": 5,
"x2": 10
},
"source": "double sum = 0.0; for (item in params._source.marks) { sum += item.value * params[item.classtype]; } return sum;"
}
}

ElasticSearch: aggregation on _score field?

I would like to use the stats or extended_stats aggregation on the _score field but can't find any examples of this being done (i.e., seems like you can only use aggregations with actual document fields).
Is it possible to request aggregations on calculated "metadata" fields for each hit in an ElasticSearch query response (e.g., _score, _type, _shard, etc.)?
I'm assuming the answer is 'no' since fields like _score aren't indexed...
Note: The original answer is now outdated in terms of the latest version of Elasticsearch. The equivalent script using Groovy scripting would be:
{
...,
"aggregations" : {
"grades_stats" : {
"stats" : {
"script" : "_score"
}
}
}
}
In order to make this work, you will need to enable dynamic scripting or, even better, store a file-based script and execute it by name (for added security by not enabling dynamic scripting)!
You can use a script and refer to the score using doc.score. More details are available in ElasticSearch's scripting documentation.
A sample stats aggregation could be:
{
...,
"aggregations" : {
"grades_stats" : {
"stats" : {
"script" : "doc.score"
}
}
}
}
And the results would look like:
"aggregations": {
"grades_stats": {
"count": 165,
"min": 0.46667441725730896,
"max": 3.1525731086730957,
"avg": 0.8296855776598959,
"sum": 136.89812031388283
}
}
A histogram may also be a useful aggregation:
"aggs": {
"grades_histogram": {
"histogram": {
"script": "doc.score * 10",
"interval": 3
}
}
}
Histogram results:
"aggregations": {
"grades_histogram": {
"buckets": [
{
"key": 3,
"doc_count": 15
},
{
"key": 6,
"doc_count": 103
},
{
"key": 9,
"doc_count": 46
},
{
"key": 30,
"doc_count": 1
}
]
}
}
doc.score doesn't seem to work anymore. Using _score seems to work perfectly.
Example:
{
...,
"aggregations" : {
"grades_stats" : {
"stats" : {
"script" : "_score"
}
}
}
}

Resources