ElasticSearch: search in two different ranges with different aggregations for each - elasticsearch

This is an odd question, but I'm trying to avoid calling ES twice to obtain different data from two different range of times.
Let's say that:
from "2016-10-01 to 2016-10-31" I want to SUM the field "orders.total_sales" (just an example) and another sum "reviews.count".
And from "2016-09-01 to 2016-09-30"
I only want to sum "orders.total_sales".
(The truth is I need like 50 sum aggregations on the first range), but for the second range, I only need 2).
I know it's possible to filter by two ranges of anything using should instead of must. But is it possible to distinguish the result from each range in order to operate with them (aggregations sum).
I don't think it's possible, but just in case someone has come with this issue before.
Thanks in advance.

You can use filter aggregation for this purpose. You would basically write two filters for two different range and then do sub aggregations as you want.
{
"size": 0,
"aggs": {
"range_one": {
"filter": {
"range": {
"your_date_field": {
"gte": "2016-01-01",
"lte": "2016-02-02"
}
}
},
"aggs": {
"sum_orders": {
"sum": {
"field": "your_sum_field1"
}
}
}
},
"range_two": {
"filter": {
"range": {
"your_date_field": {
"gte": "2016-02-01",
"lte": "2016-03-02"
}
}
},
"aggs": {
"sum_orders": {
"sum": {
"field": "your_sum_field2"
}
}
}
}
}
}

I ended up writing something like this with (due to ES errors, until I got it working)
Thank you very much! It worked, but not with filter, but the idea is the same
I did something like this:
{
"timeout" : 1500,
"query" : {
"bool" : {
"must" : [
{
"term" : {
"businessId" : "101598"
}
} ,
{
"range" : {
"date" : {
"from" : "2016-10-15T03:00:00.000Z",
"to" : "2016-10-31T03:00:00.000Z",
"include_lower" : true,
"include_upper" : true
}
}
}]
}
},
"aggs": {
"range_one": {
"date_range": {
"field": "date",
"ranges": [
{
"from": "2016-10-15T03:00:00.000Z",
"to": "2016-10-22T03:00:00.000Z"
}
]
},
"aggs": {
"sum_orders_sales": {
"sum": {
"field": "orders.totalSales"
}
}
}
},
"range_two": {
"date_range": {
"field": "date",
"ranges": [
{
"from": "2016-10-23T03:00:00.000Z",
"to": "2016-10-31T03:00:00.000Z"
}
]
},
"aggs": {
"sum_orders_count": {
"sum": {
"field": "orders.orderCount"
}
}
}
}
}
}
In my case performance and speed is important and since my two ranges are consecutive, I thought I could filter by the business_id (I need) and from the oldest date (start date of the first range) to the newest date (end date of the second range), assuming that aggregation works with the result of the query (otherwise, it will search all documents, and it would be great just to have it doing the aggregation operations over a resultset obtained just one). But I'm new with ES, so not sure I'm seeing it right. However, it's working like charm!
Thanks a lot1

Related

Query returns result with small size that is not my intention in elasticsearch

I am using rest api to query the result from ElasticSearch.
Below is the API query string.
GET /..../_search
{
"size":0,
"query": {
"bool": {
"must": [
{ "range": {
"#timestamp": {
"time_zone": "+09:00",
"gte": "2023-01-24T00:00:00.000Z",
"lt": "2023-01-24T03:03:00.000Z" } } },
{
"term" : {
"serviceid.keyword" : {
"value" : "430011397"
}
}
}
]
}
},
"aggs": {
"by_day": {
"auto_date_histogram": {
"field": "#timestamp",
"minimum_interval":"minute"
},
"aggs": {
"agg-type": {
"terms": {
"field": "nxlogtype.keyword",
"size": 100000
},
"aggs": {
"my-sub-agg-name": {
"avg": {
"field": "size"
}
}
}
}
}
}
}
}
As you can see, I specified the time range about three hours in gte and lt field.
However, the result returns only 6 buckets which have 30 minute intervals.
I expected that many buckets will be returned with one minute interval during the timestamp I specified, but the result is always same even though I changed the time range as more extended one.
Since I am quite new to elastic search, I am not familiar with query usage.
How to resolve my issue?

Search and aggregation on two indices

Two indexes are created with the dates.
First index mapping:
PUT /index_one
{
"mappings": {
"properties": {
"date_start": {
"type": "date",
"format": "yyyy-MM-dd HH:mm:ss.SSSZZ||yyyy-MM-dd HH:mm:ss||yyyy-MM-dd||epoch_millis"
}
}
}
}
Second index mapping:
PUT /index_two
{
"mappings": {
"properties": {
"date_end": {
"type": "date",
"format": "yyyy-MM-dd HH:mm:ss.SSSZZ||yyyy-MM-dd HH:mm:ss||yyyy-MM-dd||epoch_millis"
}
}
}
}
Need to find a date in a certain range and perform aggregation average of the dates difference.
Tried to make a request like this:
GET /index_one,index_two/_search?scroll=1m&q=[2021-01-01+TO+2021-12-31]&filter_path=aggregations,hits.total.value,hits.hits
{
"aggs": {
"filtered_dates": {
"filter": {
"bool": {
"must": [
{
"exists": {
"field": "date_start"
}
},
{
"exists": {
"field": "date_end"
}
}
]
}
},
"aggs": {
"avg_date": {
"avg": {
"script": {
"lang": "painless",
"source": "doc['date_end'].value.toInstant().toEpochMilli() - doc['date_begin'].value.toInstant().toEpochMilli()"
}
}
}
}
}
}
}
I get the following response to the request:
{
"hits": {
"total": {
"value": 16508
},
"hits": [
{
"_index": "index_one",
"_type": "_doc",
"_id": "93a34c5b-101b-45ea-9965-96a2e0446a28",
"_score": 1.0,
"_source": {
"date_begin": "2021-02-26 07:26:29.732+0300"
}
}
]
},
"aggregations": {
"filtered_dates": {
"meta": {},
"doc_count": 0,
"avg_date": {
"value": null
}
}
}
}
Can you please tell me if it is possible to make a query with search and aggregation over two indices in Elasticsearch? If so, how?
If you stored date_start on the document which contains date_end, it'd be much easier to figure out the average — check my answer to Store time related data in ElasticSearch.
Now, the script context operates on one single document at a time and has "no clue" about the other, potentially related docs. So if you don't store both dates at the same time in at least one doc, you'd need to somehow connect the docs nonetheless.
One option would be to use their ids:
POST index_one/_doc
{ "id":1, "date_start": "2021-01-01" }
POST index_two/_doc
{ "id":1, "date_end": "2021-12-31" }
POST index_one/_doc/2
{ "id":2, "date_start": "2021-01-01" }
POST index_two/_doc/2
{ "id":2, "date_end": "2021-01-31" }
After that, it's possible to:
Target multiple indices — as you already do.
Group the docs by their IDs and select only those that include at least 2 buckets (assuming two buckets represent the start & the end).
Obtain the min & max dates — essentially cherry-picking the date_start and date_end to be used later down the line.
Use a bucket_script aggregation to calculate their difference (in milliseconds).
Leverage a top-level average bucket aggregation to run over all the difference buckets and ... average them.
In concrete terms:
GET /index_one,index_two/_search?scroll=1m&q=[2021-01-01+TO+2021-12-31]&filter_path=aggregations,hits.total.value,hits.hits
{
"aggs": {
"grouped_by_id": {
"terms": {
"field": "id",
"min_doc_count": 2,
"size": 10
},
"aggs": {
"min_date": {
"min": {
"field": "date_start"
}
},
"max_date": {
"max": {
"field": "date_end"
}
},
"diff": {
"bucket_script": {
"buckets_path": {
"min": "min_date",
"max": "max_date"
},
"script": "params.max - params.min"
}
}
}
},
"avg_duration_across_the_board": {
"avg_bucket": {
"buckets_path": "grouped_by_id>diff",
"gap_policy": "skip"
}
}
}
}
If everything goes right, you'll end up with:
...
"aggregations" : {
"grouped_by_id" : {
...
},
"avg_duration_across_the_board" : {
"value" : 1.70208E10 <-- 17,020,800,000 milliseconds ~ 4,728 hrs
}
}
⚠️ Caveat: note that the 2nd level terms aggregation has an adjustable size. You'll probably need to increase it to cover more docs. But there are theoretical and practical limits as to how far it makes sense to increase it.
📖 Shameless plug: this was inspired in part by the chapter Aggregations & Buckets in my recently published Elasticsearch Handbook — containing lots of other real-world, non-trivial examples 🙌

Elasticsearch aggregate field between dates

I want to compare two buckets against each other and find new occurrences that appear in the second bucket. The below query returns all entries in the "query.keyword" field between the two UNIX timestamps provided but I want the UNIX timestamps to be apart of the aggregation section itself.
GET _search
{
"size": 0,
"query": {
"range" :{
"ts": {
"gte":1535155200,
"lte":1535414399
}
}
},
"aggs": {
"domains": {
"terms": {
"field":"query.keyword"
}
}
}
}
I've also tried this but received the error:
"Found two aggregation type definitions in [domains_prev]: [range] and [terms]",
GET _search
{
"size": 0,
"aggs": {
"domains_prev": {
"range" :{
"field":"ts",
"ranges": [
{"to" : 1535414399},
{"from" : 1535155200}
]
},
"terms": {
"field":"query.keyword"
}
}
}
}
The goal is to have something similar to this:
Agg1
"domains_prev"
"field":"query.keyword"
date:gte:timestamp, lte:timestamp
Agg2
"domains_today"
"field":"query.keyword"
date:today
show all "query.keyword" in agg2 that does not appear in agg1.
This is the SQL query that I use to achieve the intended result:
select domains FROM table WHERE date >= 20171123 and domains NOT IN (SELECT domains FROM table WHERE date < 20171123 group by domains)
You'll want to do a nested bucket aggregation starting with date range:
https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket-daterange-aggregation.html
From their page, start with an aggregation like this at the top level:
{
"aggs": {
"range": {
"date_range": {
"field": "date",
"format": "MM-yyy",
"ranges": [
{ "to": "now-10M/M" },
{ "from": "now-10M/M" }
]
}
}
}
}
Then nest your existing terms aggregation using query.keyword under that.
The end result should be something like:
{
"aggs": {
"range": {
"date_range": {
"field": "date",
"format": "MM-yyy",
"ranges": [
{ "to": "now-10M/M" },
{ "from": "now-10M/M" }
]
},
"aggs": {
"domains": {
"terms": {
"field":"query.keyword"
}
}
}
}
}
}

Elasticsearch derivate of a deep metric

I have a web crawler that collects data and stores snapshots several times a day. My query has some aggregations that group the snapshots together per day and return the last snapshot of each day using top_hits.
The documents look like this:
"_source": {
"taken_at": "2016-02-01T11:27:09.184-03:00",
... ,
"my_metric": 113
}
I'd like to be able to calculate the derivative of a certain metric, say my_metric, of the documents returned by top_hits (i.e., the derivative of the last snapshots of each day's my_metric).
Here's what I have so far:
{
"aggs": {
"filtered_snapshots": {
"filter": {
// ...
},
"aggs" : {
"grouped_data": {
"date_histogram": {
"field": "taken_at",
"interval": "day",
"format": "YYYY-MM-dd",
"order": { "_key" : "asc" }
},
"aggs": {
"resource_by_date": {
"terms": { "field": "remote_id" },
"aggs": {
"latest_snapshots": {
"top_hits": {
"sort": { "taken_at": { "order": "asc" }},
"size" : 1
}
}
}
},
"my_metric_deriv": {
"derivative": {
"buckets_path": "resource_by_date>latest_snapshots>my_metric"
}
}
}
}
}
}
}
}
I get a "No aggregation [my_metric] found for path ..." error with the query above.
Am I using a wrong bucket_path? I've read through the bucket_path and the derivative documentation and haven't found much that could help.
The documentation mentions briefly "deep metrics", stating that they can be limited in some ways, which I couldn't quite understand. I'm not sure how or if the limitations affect my case.

Calculating sum of nested fields with date_histogram aggregation in Elasticsearch

I'm having trouble getting the sum of a nested field in Elasticsearch using a date_histogram, and I'm hoping somebody can lend me a hand.
I have a mapping that looks like this:
"client" : {
// various irrelevant stuff here...
"associated_transactions" : {
"type" : "nested",
"include_in_parent" : true,
"properties" : {
"amount" : {
"type" : "double"
},
"effective_at" : {
"type" : "date",
"format" : "dateOptionalTime"
}
}
}
}
I'm trying to get a date_histogram that shows total revenue by month across all clients--i.e. a time series showing the sum associated_transactions.amount in a histogram determined by associated_transactions.effective_date. I tried running this query:
{
"query": {
// ...
},
"aggregations": {
"revenue": {
"date_histogram": {
"interval": "month",
"min_doc_count": 0,
"field": "associated_transactions.effective_at"
},
"aggs": {
"monthly_revenue": {
"sum": {
"field": "associated_transactions.amount"
}
}
}
}
}
}
But the sum it's giving me isn't right. It seems that what ES is doing is finding all clients who have any transaction in a given month, then summing all of the transactions (from any time) for those clients. That is, it's a sum of the amount spent in the lifetime of a client who made a purchase in a given month, not the sum of purchases in a given month.
Is there any way to get the data I'm looking for, or is this a limitation in how ES handles nested fields?
Thanks very much in advance for your help!
David
Try this?
{
"query": {
// ...
},
"aggregations": {
"revenue": {
"date_histogram": {
"interval": "month",
"min_doc_count": 0,
"field": "associated_transactions.effective_at"
"aggs": {
"monthly_revenue": {
"sum": {
"field": "associated_transactions.amount"
}
}
}
}
}
}
}
i.e. move the "aggs" key into the "date_histogram" field.
I stumbled upon this question while trying to solve similar problem with my implementation of ES.
It seems that currently Elasticsearch looks at position of aggregation in the JSON body request tree - not inheritance of its objects and filelds. So you should not put your sum aggregation "inside" "date_histogram", but place it outside on the same JSON tree level.
This worked for me:
{
"size": 0,
"aggs": {
"histogram_aggregation": {
"date_histogram": {
"field": "date_vield",
"calendar_interval": "day"
},
"aggs": {
"views": {
"sum": {
"field": "the_vield_i_want_to_sum"
}
}
}
}
},
"query": {
#some query
}
OP made mistake of placing his sum aggregation inside date histogram aggregation.

Resources