Elastic Search get top grouped sums with additional filters (Elasticsearch version5.3) - elasticsearch

This is my Mapping :
{
"settings" : {
"number_of_shards" : 2,
"number_of_replicas" : 1
},
"mappings" :{
"cpt_logs_mapping" : {
"properties" : {
"channel_id" : {"type":"integer","store":"yes","index":"not_analyzed"},
"playing_date" : {"type":"string","store":"yes","index":"not_analyzed"},
"country_code" : {"type":"text","store":"yes","index":"analyzed"},
"playtime_in_sec" : {"type":"integer","store":"yes","index":"not_analyzed"},
"channel_name" : {"type":"text","store":"yes","index":"analyzed"},
"device_report_tag" : {"type":"text","store":"yes","index":"analyzed"}
}
}
}
}
I want to query the index similar to the way I do using the following MySQL query :
SELECT
channel_name,
SUM(`playtime_in_sec`) as playtime_in_sec
FROM
channel_play_times_bar_chart
WHERE
country_code = 'country' AND
device_report_tag = 'device' AND
channel_name = 'channel'
playing_date BETWEEN 'date_range_start' AND 'date_range_end'
GROUP BY channel_id
ORDER BY SUM(`playtime_in_sec`) DESC
LIMIT 30;
So far my QueryDSL looks like this
{
"size": 0,
"aggs": {
"ch_agg": {
"terms": {
"field": "channel_id",
"size": 30 ,
"order": {
"sum_agg": "desc"
}
},
"aggs": {
"sum_agg": {
"sum": {
"field": "playtime_in_sec"
}
}
}
}
}
}
QUESTION 1
Although the QueryDSL I have made does return me the top 30 channel_ids w.r.t playtimes but I am confused how to add other filters too within the search i.e country_code, device_report_tag & playing_date.
QUESTION 2
Another issue is that the result set contains only the channel_id and playtime fields unlike the MySQL result set which returns me channel_name and playtime_in_sec columns. This means I want to achieve aggregation using channel_id field but result set should instead return corresponding channel_name name of the group.
NOTE: Performance over here is a top priority as this is supposed to be running behind a graph generator querying millions or even more docs.
TEST DATA
hits: [
{
_index: "cpt_logs_index",
_type: "cpt_logs_mapping",
_id: "",
_score: 1,
_source: {
ChID: 1453,
playtime_in_sec: 35,
device_report_tag: "mydev",
channel_report_tag: "Sony Six",
country_code: "SE",
#timestamp: "2017-08-11",
}
},
{
_index: "cpt_logs_index",
_type: "cpt_logs_mapping",
_id: "",
_score: 1,
_source: {
ChID: 145,
playtime_in_sec: 25,
device_report_tag: "mydev",
channel_report_tag: "Star Movies",
country_code: "US",
#timestamp: "2017-08-11",
}
},
{
_index: "cpt_logs_index",
_type: "cpt_logs_mapping",
_id: "",
_score: 1,
_source: {
ChID: 12,
playtime_in_sec: 15,
device_report_tag: "mydev",
channel_report_tag: "HBO",
country_code: "PK",
#timestamp: "2017-08-12",
}
}
]

QUESTION1:
Are you looking to add a filter/query to the example above? If so you can simply add a "query" node to the query document:
{
"size": 0,
"query":{
"bool":{
"must":[
{"terms": { "country_code": ["pk","us","se"] } },
{"range": { "#timestamp": { "gt": "2017-01-01", "lte": "2017-08-11" } } }
]
}
},
"aggs": {
"ch_agg": {
"terms": {
"field": "ChID",
"size": 30
},
"aggs":{
"ch_report_tag_agg": {
"terms":{
"field" :"channel_report_tag.keyword"
},
"aggs":{
"sum_agg":{
"sum":{
"field":"playtime_in_sec"
}
}
}
}
}
}
}
}
You can use all normal queries/filters for elastic to pre-filter your search before you start aggregating (Regarding performance, elasticsearch will apply any filters / queries before starting to aggregate, so any filtering you can do here will help a lot)
Question2:
On the top of my head I would suggest one of two solutions (unless I'm not completely misunderstanding the question):
Add aggs levels for the fields you want in the output in the order you want to drill down. (you can nest aggs within aggs quite deeply without issues and get the bonus of count on each level)
Use the top_hits aggregation on the "lowest" level of aggs, and specify which fields you want in the output using "_source": { "include": [/fields/] }
Can you provide a few records of test data?
Also, it is useful to know which version of ElasticSearch you're running as the syntax and behaviour change a lot between major versions.

Related

Average of differences calculated between two date fields

I'm working on a project that uses Elasticsearch to store data and show some complex statistics.
I have an index in that looks like this:
Reservation {
id: number
check_in: Date
check_out: Date
created_at: Date
// other fields...
}
I need to calculate the average days' difference between check_in and created_at of my Reservations in a specific date range and show the result as a number.
I tried this query:
{
"script_fields": {
"avgDates": {
"script": {
"lang": "expression",
"source": "doc['created_at'].value - doc['check_in'].value"
}
}
},
"query": {
"bool": {
"must": [
{
"range": {
"created_at": {
"gte": "{{lastMountTimestamp}}",
"lte": "{{currentTimestamp}}"
}
}
}
]
}
},
"size": 0,
"aggs": {
"avgBetweenDates": {
"avg": {
"field": "avgDates"
}
}
}
}
Dates fields are saved in ISO 8601 form (eg: 2020-03-11T14:25:15+00:00), I don't know if this could produce issues.
It catches some hits, So, the query works for sure! but, it always returns null as the value of the avgBetweenDates aggregation.
I need a result like this:
"aggregations": {
"avgBetweenDates": {
"value": 3.14159 // Π is just an example!
}
}
Any ideas will help!
Thank you.
Scripted Fields are not stored fields in ES. You can only perform aggregation on the stored fields as scripted fields are created on the fly.
You can simply move the script logic in the Average Aggregation as shown below. Note that for the sake of understanding, I've created sample mapping, documents, query and its response.
Mapping:
PUT my_date_index
{
"mappings": {
"properties": {
"check_in":{
"type":"date",
"format": "date_time"
},
"check_out":{
"type": "date",
"format": "date_time"
},
"created_at":{
"type": "date",
"format": "date_time"
}
}
}
}
Sample Documents:
POST my_date_index/_doc/1
{
"check_in": "2019-01-15T00:00:00.000Z",
"check_out": "2019-01-20T00:00:00.000Z",
"created_at": "2019-01-17T00:00:00.000Z"
}
POST my_date_index/_doc/2
{
"check_in": "2019-01-15T00:00:00.000Z",
"check_out": "2019-01-22T00:00:00.000Z",
"created_at": "2019-01-20T00:00:00.000Z"
}
Aggregation Query:
POST my_date_index/_search
{
"size": 0,
"aggs": {
"my_dates_diff": {
"avg": {
"script": """
ZonedDateTime d1 = doc['created_at'].value;
ZonedDateTime d2 = doc['check_in'].value;
long differenceInMillis = ChronoUnit.MILLIS.between(d1, d2);
return Math.abs(differenceInMillis/86400000);
"""
}
}
}
}
Notice, that you wanted difference in number of days. The above logic does that.
Response:
{
"took" : 2,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 2,
"relation" : "eq"
},
"max_score" : null,
"hits" : [ ]
},
"aggregations" : {
"my_dates_diff" : {
"value" : 3.5 <---- Average in Number of Days
}
}
}
Hope this helps!
Scripted fields created within the _search context can only be consumed within that scope. They're not visible within the aggregations! This means you'll have to go with either
moving your script to the aggs section and doing the avg there
a scripted metric aggregation (quite slow and difficult to get right)
or creating a dateDifference field at index time (preferably an int -- a difference of the timestamps) which will enable you to perform powerful numeric aggs like extended stats which provide a statistically useful output like:
{
...
"aggregations": {
"grades_stats": {
"count": 2,
"min": 50.0,
"max": 100.0,
"avg": 75.0,
"sum": 150.0,
"sum_of_squares": 12500.0,
"variance": 625.0,
"std_deviation": 25.0,
"std_deviation_bounds": {
"upper": 125.0,
"lower": 25.0
}
}
}
}
and are always faster than computing the timestamp differences with a script.

How to apply exact match on single field and distinct on multiple fields together in ElasticSearch?

I recently started working on ElasticSearch, and I am trying search for following criteria
I want to apply exact match on ENAME & distinct on both EID & ENAME on above data.
Let say for matching, I have string ABC.
So result should be like as below
[
{"EID" :111, "ENAME" : "ABC"},
{"EID" : 444, "ENAME" : "ABC"}
]
You can achieve this via a combination of term query and terms aggregation.
Assuming that you have the following mapping:
PUT my_index
{
"mappings": {
"doc": {
"properties": {
"EID": {
"type": "keyword"
},
"ENAME": {
"type": "keyword"
}
}
}
}
}
And inserted the documents like this:
POST my_index/doc/3
{
"EID": "111",
"ENAME": "ABC"
}
POST my_index/doc/4
{
"EID": "222",
"ENAME": "XYZ"
}
POST my_index/doc/12
{
"EID": "444",
"ENAME": "ABC"
}
The query that will do the job might look like this:
POST my_index/doc/_search
{
"query": {
"term": { 1️⃣
"ENAME": "ABC"
}
},
"size": 0, 3️⃣
"aggregations": {
"by EID": {
"terms": { 2️⃣
"field": "EID"
}
}
}
}
Let me explain how it works:
1️⃣ - term query asks Elasticsearch to filter on exact value of a keyword field "ENAME";
2️⃣ - terms aggregation collects the list of all possible values of another keyword field "EID" and gives back the first N most frequent ones;
3️⃣ - "size": 0 tells Elasticsearch not to return any search hits (we are only interested in the aggregations).
The output of the query will look like this:
{
"hits": {
"total": 2,
"max_score": 0,
"hits": []
},
"aggregations": {
"by EID": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "111", <== Here is the first "distinct" value that we wanted
"doc_count": 3
},
{
"key": "444", <== Here is another "distinct" value
"doc_count": 2
}
]
}
}
}
The output does not look exactly like what you posted in the question, but I believe it is the closest what you can achieve with Elasticsearch.
However, this output is equivalent:
"ENAME" is implicitly present (since its value was used for filtering)
"EID" is present under the "buckets" of the aggregations section.
Note that under "doc_count" you will find the number of documents having such "EID".
What if I want to do a DISTINCT on several fields?
For a more complex scenario (e.g. when you need to do a distinct on many fields) see this answer.
More information about aggregations is available here.
Hope that helps!

Return distinct values in Elasticsearch

I am trying to solve an issue where I have to get distinct result in the search.
{
"name" : "ABC",
"favorite_cars" : [ "ferrari","toyota" ]
}, {
"name" : "ABC",
"favorite_cars" : [ "ferrari","toyota" ]
}, {
"name" : "GEORGE",
"favorite_cars" : [ "honda","Hyundae" ]
}
When I perform a term query on favourite cars "ferrari". I get two results whose name is ABC. I simply want that the result returned should be one in this case. So my requirement will be if I can apply a distinct on name field to receive one 1 result.
Thanks
One way to achieve what you want is to use a terms aggregation on the name field and then a top_hits sub-aggregation with size 1, like this:
{
"size": 0,
"query": {
"term": {
"favorite_cars": "ferrari"
}
},
"aggs": {
"names": {
"terms": {
"field": "name"
},
"aggs": {
"single_result": {
"top_hits": {
"size": 1
}
}
}
}
}
}
That way, you'll get a single term ABC and then nested into it a single matching document

How to limit ElasticSearch results by a field value?

We've got a system that indexes resume documents in ElasticSearch using the mapper attachment plugin. Alongside the indexed document, I store some basic info, like if it's tied to an applicant or employee, their name, and the ID they're assigned in the system. A query that runs might look something like this when it hits ES:
{
"size" : 100,
"query" : {
"query_string" : {
"query" : "software AND (developer OR engineer)",
"default_field" : "fileData"
}
},
"_source" : {
"includes" : [ "applicant.*", "employee.*" ]
}
}
And gets me results like:
"hits": [100]
0: {
"_index": "careers"
"_type": "resume"
"_id": "AVEW8FJcqKzY6y-HB4tr"
"_score": 0.4530588
"_source": {
"applicant": {
"name": "John Doe"
"id": 338338
}
}
}...
What I'm trying to do is limit the results, so that if John Doe with id 338338 has three different resumes in the system that all match the query, I only get back one match, preferably the highest scoring one (though that's not as important, as long as I can find the person). I've been trying different options with filters and aggregates, but I haven't stumbled across a way to do this.
There are various approaches I can take in the app that calls ES to tackle this after I get results back, but if I can do it on the ES side, that would be preferable. Since I'm limiting the query to say, 100 results, I'd like to get back 100 individual people, rather than getting back 100 results and then finding out that 25% of them are docs tied to the same person.
What you want to do is an aggregation to get the top 100 unique records, and then a sub aggregation asking for the "top_hits". Here is an example from my system. In my example I'm:
setting the result size to 0 because I only care about the aggregations
setting the size of the aggregation to 100
for each aggregation, get the top 1 result
GET index1/type1/_search
{
"size": 0,
"aggs": {
"a1": {
"terms": {
"field": "input.user.name",
"size": 100
},
"aggs": {
"topHits": {
"top_hits": {
"size": 1
}
}
}
}
}
}
There's a simpler way to accomplish what #ckasek is looking to do by making use of Elasticsearch's collapse functionality.
Field Collapsing, as described in the Elasticsearch docs:
Allows to collapse search results based on field values. The collapsing is done by selecting only the top sorted document per collapse key.
Based on the original query example above, you would modify it like so:
{
"size" : 100,
"query" : {
"query_string" : {
"query" : "software AND (developer OR engineer)",
"default_field" : "fileData"
}
},
"collapse": {
"field": "id",
},
"_source" : {
"includes" : [ "applicant.*", "employee.*" ]
}
}
Using the answer above and the link from IanGabes, I was able to restructure my search like so:
{
"size": 0,
"query": {
"query_string": {
"query": "software AND (developer OR engineer)",
"default_field": "fileData"
}
},
"aggregations": {
"employee": {
"terms": {
"field": "employee.id",
"size": 100
},
"aggregations": {
"score": {
"max": {
"script": "scores"
}
}
}
},
"applicant": {
"terms": {
"field": "applicant.id",
"size": 100
},
"aggregations": {
"score": {
"max": {
"script": "scores"
}
}
}
}
}
}
This gets me back two buckets, one containing all the applicant Ids and the highest score from the matched docs, as well as the same for employees. The script is nothing more than a groovy script on the shard that contains '_score' as the content.

ElasticSearch: aggregation on _score field?

I would like to use the stats or extended_stats aggregation on the _score field but can't find any examples of this being done (i.e., seems like you can only use aggregations with actual document fields).
Is it possible to request aggregations on calculated "metadata" fields for each hit in an ElasticSearch query response (e.g., _score, _type, _shard, etc.)?
I'm assuming the answer is 'no' since fields like _score aren't indexed...
Note: The original answer is now outdated in terms of the latest version of Elasticsearch. The equivalent script using Groovy scripting would be:
{
...,
"aggregations" : {
"grades_stats" : {
"stats" : {
"script" : "_score"
}
}
}
}
In order to make this work, you will need to enable dynamic scripting or, even better, store a file-based script and execute it by name (for added security by not enabling dynamic scripting)!
You can use a script and refer to the score using doc.score. More details are available in ElasticSearch's scripting documentation.
A sample stats aggregation could be:
{
...,
"aggregations" : {
"grades_stats" : {
"stats" : {
"script" : "doc.score"
}
}
}
}
And the results would look like:
"aggregations": {
"grades_stats": {
"count": 165,
"min": 0.46667441725730896,
"max": 3.1525731086730957,
"avg": 0.8296855776598959,
"sum": 136.89812031388283
}
}
A histogram may also be a useful aggregation:
"aggs": {
"grades_histogram": {
"histogram": {
"script": "doc.score * 10",
"interval": 3
}
}
}
Histogram results:
"aggregations": {
"grades_histogram": {
"buckets": [
{
"key": 3,
"doc_count": 15
},
{
"key": 6,
"doc_count": 103
},
{
"key": 9,
"doc_count": 46
},
{
"key": 30,
"doc_count": 1
}
]
}
}
doc.score doesn't seem to work anymore. Using _score seems to work perfectly.
Example:
{
...,
"aggregations" : {
"grades_stats" : {
"stats" : {
"script" : "_score"
}
}
}
}

Resources