ElasticSearch: aggregation on _score field? - elasticsearch

I would like to use the stats or extended_stats aggregation on the _score field but can't find any examples of this being done (i.e., seems like you can only use aggregations with actual document fields).
Is it possible to request aggregations on calculated "metadata" fields for each hit in an ElasticSearch query response (e.g., _score, _type, _shard, etc.)?
I'm assuming the answer is 'no' since fields like _score aren't indexed...

Note: The original answer is now outdated in terms of the latest version of Elasticsearch. The equivalent script using Groovy scripting would be:
{
...,
"aggregations" : {
"grades_stats" : {
"stats" : {
"script" : "_score"
}
}
}
}
In order to make this work, you will need to enable dynamic scripting or, even better, store a file-based script and execute it by name (for added security by not enabling dynamic scripting)!
You can use a script and refer to the score using doc.score. More details are available in ElasticSearch's scripting documentation.
A sample stats aggregation could be:
{
...,
"aggregations" : {
"grades_stats" : {
"stats" : {
"script" : "doc.score"
}
}
}
}
And the results would look like:
"aggregations": {
"grades_stats": {
"count": 165,
"min": 0.46667441725730896,
"max": 3.1525731086730957,
"avg": 0.8296855776598959,
"sum": 136.89812031388283
}
}
A histogram may also be a useful aggregation:
"aggs": {
"grades_histogram": {
"histogram": {
"script": "doc.score * 10",
"interval": 3
}
}
}
Histogram results:
"aggregations": {
"grades_histogram": {
"buckets": [
{
"key": 3,
"doc_count": 15
},
{
"key": 6,
"doc_count": 103
},
{
"key": 9,
"doc_count": 46
},
{
"key": 30,
"doc_count": 1
}
]
}
}

doc.score doesn't seem to work anymore. Using _score seems to work perfectly.
Example:
{
...,
"aggregations" : {
"grades_stats" : {
"stats" : {
"script" : "_score"
}
}
}
}

Related

Elastic Search - Aggregating on Sub Aggregations

I am looking for a way to group aggregation results so I can filter them down. Currently my response is pretty large (> 1mb) and I'm hoping to return only the top matching filters.
I'm not sure if Elastic is capable of grouping aggregations by the sub aggregation without using nesting, but I figured I would give it a try.
The filter data is stored in an array on each of my objects:
// document a
"attributeValues" : [
"A12345|V12345",
"A22345|V22345",
...
]
// document b
"attributeValues" : [
"A12345|V15555",
"A22345|V22345",
...
]
I am currently aggregating on the values and getting results like this:
{
"key": "A12345|V12345",
"doc_count": 10
},
{
"key": "A12345|V15555",
"doc_count": 7
},
{
"key": "A22345|V22345",
"doc_count": 5
},
I would like to be able to group these aggregations by the first part of the string so that I can return only the top 10 matches and get something like this:
"topAttributes" : {
"buckets" : [
{
"key" : "A12345",
"doc_count" : 17,
"attributes" : {
"buckets" : [
{
"key": "A12345|V12345",
"doc_count": 10
},
{
"key": "A12345|V15555",
"doc_count": 7
},
I have tried to filter using the field script but I cannot seem to find anywhere online (checked many questions) to get the sub-aggregation's results.
The script would look something like this:
GET test_index/_search
{
"size" : 0,
"aggs": {
"attributeValuesTop": {
"terms": {
"size": 10,
"script": {
"source": """
return attributes.splitOnToken('|')[1];
"""
}
},
"aggs": {
"attributes": {
"terms": {
"field": "attributeValues",
"size": 10000
}
}
}
}
}
}
NOTE: I know we can use a nested solution, but nested is too slow for the amount of documents we have (millions of records) and the target of sub 300ms searches.

Average of differences calculated between two date fields

I'm working on a project that uses Elasticsearch to store data and show some complex statistics.
I have an index in that looks like this:
Reservation {
id: number
check_in: Date
check_out: Date
created_at: Date
// other fields...
}
I need to calculate the average days' difference between check_in and created_at of my Reservations in a specific date range and show the result as a number.
I tried this query:
{
"script_fields": {
"avgDates": {
"script": {
"lang": "expression",
"source": "doc['created_at'].value - doc['check_in'].value"
}
}
},
"query": {
"bool": {
"must": [
{
"range": {
"created_at": {
"gte": "{{lastMountTimestamp}}",
"lte": "{{currentTimestamp}}"
}
}
}
]
}
},
"size": 0,
"aggs": {
"avgBetweenDates": {
"avg": {
"field": "avgDates"
}
}
}
}
Dates fields are saved in ISO 8601 form (eg: 2020-03-11T14:25:15+00:00), I don't know if this could produce issues.
It catches some hits, So, the query works for sure! but, it always returns null as the value of the avgBetweenDates aggregation.
I need a result like this:
"aggregations": {
"avgBetweenDates": {
"value": 3.14159 // Π is just an example!
}
}
Any ideas will help!
Thank you.
Scripted Fields are not stored fields in ES. You can only perform aggregation on the stored fields as scripted fields are created on the fly.
You can simply move the script logic in the Average Aggregation as shown below. Note that for the sake of understanding, I've created sample mapping, documents, query and its response.
Mapping:
PUT my_date_index
{
"mappings": {
"properties": {
"check_in":{
"type":"date",
"format": "date_time"
},
"check_out":{
"type": "date",
"format": "date_time"
},
"created_at":{
"type": "date",
"format": "date_time"
}
}
}
}
Sample Documents:
POST my_date_index/_doc/1
{
"check_in": "2019-01-15T00:00:00.000Z",
"check_out": "2019-01-20T00:00:00.000Z",
"created_at": "2019-01-17T00:00:00.000Z"
}
POST my_date_index/_doc/2
{
"check_in": "2019-01-15T00:00:00.000Z",
"check_out": "2019-01-22T00:00:00.000Z",
"created_at": "2019-01-20T00:00:00.000Z"
}
Aggregation Query:
POST my_date_index/_search
{
"size": 0,
"aggs": {
"my_dates_diff": {
"avg": {
"script": """
ZonedDateTime d1 = doc['created_at'].value;
ZonedDateTime d2 = doc['check_in'].value;
long differenceInMillis = ChronoUnit.MILLIS.between(d1, d2);
return Math.abs(differenceInMillis/86400000);
"""
}
}
}
}
Notice, that you wanted difference in number of days. The above logic does that.
Response:
{
"took" : 2,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 2,
"relation" : "eq"
},
"max_score" : null,
"hits" : [ ]
},
"aggregations" : {
"my_dates_diff" : {
"value" : 3.5 <---- Average in Number of Days
}
}
}
Hope this helps!
Scripted fields created within the _search context can only be consumed within that scope. They're not visible within the aggregations! This means you'll have to go with either
moving your script to the aggs section and doing the avg there
a scripted metric aggregation (quite slow and difficult to get right)
or creating a dateDifference field at index time (preferably an int -- a difference of the timestamps) which will enable you to perform powerful numeric aggs like extended stats which provide a statistically useful output like:
{
...
"aggregations": {
"grades_stats": {
"count": 2,
"min": 50.0,
"max": 100.0,
"avg": 75.0,
"sum": 150.0,
"sum_of_squares": 12500.0,
"variance": 625.0,
"std_deviation": 25.0,
"std_deviation_bounds": {
"upper": 125.0,
"lower": 25.0
}
}
}
}
and are always faster than computing the timestamp differences with a script.

Elastic Search get top grouped sums with additional filters (Elasticsearch version5.3)

This is my Mapping :
{
"settings" : {
"number_of_shards" : 2,
"number_of_replicas" : 1
},
"mappings" :{
"cpt_logs_mapping" : {
"properties" : {
"channel_id" : {"type":"integer","store":"yes","index":"not_analyzed"},
"playing_date" : {"type":"string","store":"yes","index":"not_analyzed"},
"country_code" : {"type":"text","store":"yes","index":"analyzed"},
"playtime_in_sec" : {"type":"integer","store":"yes","index":"not_analyzed"},
"channel_name" : {"type":"text","store":"yes","index":"analyzed"},
"device_report_tag" : {"type":"text","store":"yes","index":"analyzed"}
}
}
}
}
I want to query the index similar to the way I do using the following MySQL query :
SELECT
channel_name,
SUM(`playtime_in_sec`) as playtime_in_sec
FROM
channel_play_times_bar_chart
WHERE
country_code = 'country' AND
device_report_tag = 'device' AND
channel_name = 'channel'
playing_date BETWEEN 'date_range_start' AND 'date_range_end'
GROUP BY channel_id
ORDER BY SUM(`playtime_in_sec`) DESC
LIMIT 30;
So far my QueryDSL looks like this
{
"size": 0,
"aggs": {
"ch_agg": {
"terms": {
"field": "channel_id",
"size": 30 ,
"order": {
"sum_agg": "desc"
}
},
"aggs": {
"sum_agg": {
"sum": {
"field": "playtime_in_sec"
}
}
}
}
}
}
QUESTION 1
Although the QueryDSL I have made does return me the top 30 channel_ids w.r.t playtimes but I am confused how to add other filters too within the search i.e country_code, device_report_tag & playing_date.
QUESTION 2
Another issue is that the result set contains only the channel_id and playtime fields unlike the MySQL result set which returns me channel_name and playtime_in_sec columns. This means I want to achieve aggregation using channel_id field but result set should instead return corresponding channel_name name of the group.
NOTE: Performance over here is a top priority as this is supposed to be running behind a graph generator querying millions or even more docs.
TEST DATA
hits: [
{
_index: "cpt_logs_index",
_type: "cpt_logs_mapping",
_id: "",
_score: 1,
_source: {
ChID: 1453,
playtime_in_sec: 35,
device_report_tag: "mydev",
channel_report_tag: "Sony Six",
country_code: "SE",
#timestamp: "2017-08-11",
}
},
{
_index: "cpt_logs_index",
_type: "cpt_logs_mapping",
_id: "",
_score: 1,
_source: {
ChID: 145,
playtime_in_sec: 25,
device_report_tag: "mydev",
channel_report_tag: "Star Movies",
country_code: "US",
#timestamp: "2017-08-11",
}
},
{
_index: "cpt_logs_index",
_type: "cpt_logs_mapping",
_id: "",
_score: 1,
_source: {
ChID: 12,
playtime_in_sec: 15,
device_report_tag: "mydev",
channel_report_tag: "HBO",
country_code: "PK",
#timestamp: "2017-08-12",
}
}
]
QUESTION1:
Are you looking to add a filter/query to the example above? If so you can simply add a "query" node to the query document:
{
"size": 0,
"query":{
"bool":{
"must":[
{"terms": { "country_code": ["pk","us","se"] } },
{"range": { "#timestamp": { "gt": "2017-01-01", "lte": "2017-08-11" } } }
]
}
},
"aggs": {
"ch_agg": {
"terms": {
"field": "ChID",
"size": 30
},
"aggs":{
"ch_report_tag_agg": {
"terms":{
"field" :"channel_report_tag.keyword"
},
"aggs":{
"sum_agg":{
"sum":{
"field":"playtime_in_sec"
}
}
}
}
}
}
}
}
You can use all normal queries/filters for elastic to pre-filter your search before you start aggregating (Regarding performance, elasticsearch will apply any filters / queries before starting to aggregate, so any filtering you can do here will help a lot)
Question2:
On the top of my head I would suggest one of two solutions (unless I'm not completely misunderstanding the question):
Add aggs levels for the fields you want in the output in the order you want to drill down. (you can nest aggs within aggs quite deeply without issues and get the bonus of count on each level)
Use the top_hits aggregation on the "lowest" level of aggs, and specify which fields you want in the output using "_source": { "include": [/fields/] }
Can you provide a few records of test data?
Also, it is useful to know which version of ElasticSearch you're running as the syntax and behaviour change a lot between major versions.

How to limit ElasticSearch results by a field value?

We've got a system that indexes resume documents in ElasticSearch using the mapper attachment plugin. Alongside the indexed document, I store some basic info, like if it's tied to an applicant or employee, their name, and the ID they're assigned in the system. A query that runs might look something like this when it hits ES:
{
"size" : 100,
"query" : {
"query_string" : {
"query" : "software AND (developer OR engineer)",
"default_field" : "fileData"
}
},
"_source" : {
"includes" : [ "applicant.*", "employee.*" ]
}
}
And gets me results like:
"hits": [100]
0: {
"_index": "careers"
"_type": "resume"
"_id": "AVEW8FJcqKzY6y-HB4tr"
"_score": 0.4530588
"_source": {
"applicant": {
"name": "John Doe"
"id": 338338
}
}
}...
What I'm trying to do is limit the results, so that if John Doe with id 338338 has three different resumes in the system that all match the query, I only get back one match, preferably the highest scoring one (though that's not as important, as long as I can find the person). I've been trying different options with filters and aggregates, but I haven't stumbled across a way to do this.
There are various approaches I can take in the app that calls ES to tackle this after I get results back, but if I can do it on the ES side, that would be preferable. Since I'm limiting the query to say, 100 results, I'd like to get back 100 individual people, rather than getting back 100 results and then finding out that 25% of them are docs tied to the same person.
What you want to do is an aggregation to get the top 100 unique records, and then a sub aggregation asking for the "top_hits". Here is an example from my system. In my example I'm:
setting the result size to 0 because I only care about the aggregations
setting the size of the aggregation to 100
for each aggregation, get the top 1 result
GET index1/type1/_search
{
"size": 0,
"aggs": {
"a1": {
"terms": {
"field": "input.user.name",
"size": 100
},
"aggs": {
"topHits": {
"top_hits": {
"size": 1
}
}
}
}
}
}
There's a simpler way to accomplish what #ckasek is looking to do by making use of Elasticsearch's collapse functionality.
Field Collapsing, as described in the Elasticsearch docs:
Allows to collapse search results based on field values. The collapsing is done by selecting only the top sorted document per collapse key.
Based on the original query example above, you would modify it like so:
{
"size" : 100,
"query" : {
"query_string" : {
"query" : "software AND (developer OR engineer)",
"default_field" : "fileData"
}
},
"collapse": {
"field": "id",
},
"_source" : {
"includes" : [ "applicant.*", "employee.*" ]
}
}
Using the answer above and the link from IanGabes, I was able to restructure my search like so:
{
"size": 0,
"query": {
"query_string": {
"query": "software AND (developer OR engineer)",
"default_field": "fileData"
}
},
"aggregations": {
"employee": {
"terms": {
"field": "employee.id",
"size": 100
},
"aggregations": {
"score": {
"max": {
"script": "scores"
}
}
}
},
"applicant": {
"terms": {
"field": "applicant.id",
"size": 100
},
"aggregations": {
"score": {
"max": {
"script": "scores"
}
}
}
}
}
}
This gets me back two buckets, one containing all the applicant Ids and the highest score from the matched docs, as well as the same for employees. The script is nothing more than a groovy script on the shard that contains '_score' as the content.

Get the number of unique terms in a field in elasticsearch

Here are some sample documents that I have
doc1
{
"occassion" : "Birthday",
"dessert": "gingerbread"
}
doc2
{
"occassion" : "Wedding",
"dessert": "friand"
}
doc3
{
"occassion":"Bethrothal" ,
"dessert":"gingerbread"
}
When I give simple terms aggregation, on the field "dessert", i get like the results like below
"aggregations": {
"desserts": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "gingerbread",
"doc_count": 2
},
{
"key": "friand",
"doc_count": 1
}
]
}
}
}
But if the issue here is if there are many documents and I need to know how many unique keywords were existing under the field name "desserts",it would take me a lot of time to figure it out. Is there a work around to get just the number of unique terms under the specified field name?
The cardinality aggregation seems to be what you're looking for: https://www.elastic.co/guide/en/elasticsearch/guide/current/cardinality.html
Querying this:
{
"size" : 0,
"aggs" : {
"distinct_desserts" : {
"cardinality" : {
"field" : "dessert"
}
}
}
}
Would return something like this:
"aggregations": {
"distinct_desserts": {
"value": 2
}
}
I would suggest cardinality with higher precision_threshold for accurate result.
GET /cars/transactions/_search
{
"size" : 0,
"aggs" : {
"count_distinct_desserts" : {
"cardinality" : {
"field" : "dessert",
"precision_threshold" : 100
}
}
}
}

Resources