Elasticsearch derivate of a deep metric - elasticsearch

I have a web crawler that collects data and stores snapshots several times a day. My query has some aggregations that group the snapshots together per day and return the last snapshot of each day using top_hits.
The documents look like this:
"_source": {
"taken_at": "2016-02-01T11:27:09.184-03:00",
... ,
"my_metric": 113
}
I'd like to be able to calculate the derivative of a certain metric, say my_metric, of the documents returned by top_hits (i.e., the derivative of the last snapshots of each day's my_metric).
Here's what I have so far:
{
"aggs": {
"filtered_snapshots": {
"filter": {
// ...
},
"aggs" : {
"grouped_data": {
"date_histogram": {
"field": "taken_at",
"interval": "day",
"format": "YYYY-MM-dd",
"order": { "_key" : "asc" }
},
"aggs": {
"resource_by_date": {
"terms": { "field": "remote_id" },
"aggs": {
"latest_snapshots": {
"top_hits": {
"sort": { "taken_at": { "order": "asc" }},
"size" : 1
}
}
}
},
"my_metric_deriv": {
"derivative": {
"buckets_path": "resource_by_date>latest_snapshots>my_metric"
}
}
}
}
}
}
}
}
I get a "No aggregation [my_metric] found for path ..." error with the query above.
Am I using a wrong bucket_path? I've read through the bucket_path and the derivative documentation and haven't found much that could help.
The documentation mentions briefly "deep metrics", stating that they can be limited in some ways, which I couldn't quite understand. I'm not sure how or if the limitations affect my case.

Related

Search and aggregation on two indices

Two indexes are created with the dates.
First index mapping:
PUT /index_one
{
"mappings": {
"properties": {
"date_start": {
"type": "date",
"format": "yyyy-MM-dd HH:mm:ss.SSSZZ||yyyy-MM-dd HH:mm:ss||yyyy-MM-dd||epoch_millis"
}
}
}
}
Second index mapping:
PUT /index_two
{
"mappings": {
"properties": {
"date_end": {
"type": "date",
"format": "yyyy-MM-dd HH:mm:ss.SSSZZ||yyyy-MM-dd HH:mm:ss||yyyy-MM-dd||epoch_millis"
}
}
}
}
Need to find a date in a certain range and perform aggregation average of the dates difference.
Tried to make a request like this:
GET /index_one,index_two/_search?scroll=1m&q=[2021-01-01+TO+2021-12-31]&filter_path=aggregations,hits.total.value,hits.hits
{
"aggs": {
"filtered_dates": {
"filter": {
"bool": {
"must": [
{
"exists": {
"field": "date_start"
}
},
{
"exists": {
"field": "date_end"
}
}
]
}
},
"aggs": {
"avg_date": {
"avg": {
"script": {
"lang": "painless",
"source": "doc['date_end'].value.toInstant().toEpochMilli() - doc['date_begin'].value.toInstant().toEpochMilli()"
}
}
}
}
}
}
}
I get the following response to the request:
{
"hits": {
"total": {
"value": 16508
},
"hits": [
{
"_index": "index_one",
"_type": "_doc",
"_id": "93a34c5b-101b-45ea-9965-96a2e0446a28",
"_score": 1.0,
"_source": {
"date_begin": "2021-02-26 07:26:29.732+0300"
}
}
]
},
"aggregations": {
"filtered_dates": {
"meta": {},
"doc_count": 0,
"avg_date": {
"value": null
}
}
}
}
Can you please tell me if it is possible to make a query with search and aggregation over two indices in Elasticsearch? If so, how?
If you stored date_start on the document which contains date_end, it'd be much easier to figure out the average — check my answer to Store time related data in ElasticSearch.
Now, the script context operates on one single document at a time and has "no clue" about the other, potentially related docs. So if you don't store both dates at the same time in at least one doc, you'd need to somehow connect the docs nonetheless.
One option would be to use their ids:
POST index_one/_doc
{ "id":1, "date_start": "2021-01-01" }
POST index_two/_doc
{ "id":1, "date_end": "2021-12-31" }
POST index_one/_doc/2
{ "id":2, "date_start": "2021-01-01" }
POST index_two/_doc/2
{ "id":2, "date_end": "2021-01-31" }
After that, it's possible to:
Target multiple indices — as you already do.
Group the docs by their IDs and select only those that include at least 2 buckets (assuming two buckets represent the start & the end).
Obtain the min & max dates — essentially cherry-picking the date_start and date_end to be used later down the line.
Use a bucket_script aggregation to calculate their difference (in milliseconds).
Leverage a top-level average bucket aggregation to run over all the difference buckets and ... average them.
In concrete terms:
GET /index_one,index_two/_search?scroll=1m&q=[2021-01-01+TO+2021-12-31]&filter_path=aggregations,hits.total.value,hits.hits
{
"aggs": {
"grouped_by_id": {
"terms": {
"field": "id",
"min_doc_count": 2,
"size": 10
},
"aggs": {
"min_date": {
"min": {
"field": "date_start"
}
},
"max_date": {
"max": {
"field": "date_end"
}
},
"diff": {
"bucket_script": {
"buckets_path": {
"min": "min_date",
"max": "max_date"
},
"script": "params.max - params.min"
}
}
}
},
"avg_duration_across_the_board": {
"avg_bucket": {
"buckets_path": "grouped_by_id>diff",
"gap_policy": "skip"
}
}
}
}
If everything goes right, you'll end up with:
...
"aggregations" : {
"grouped_by_id" : {
...
},
"avg_duration_across_the_board" : {
"value" : 1.70208E10 <-- 17,020,800,000 milliseconds ~ 4,728 hrs
}
}
⚠️ Caveat: note that the 2nd level terms aggregation has an adjustable size. You'll probably need to increase it to cover more docs. But there are theoretical and practical limits as to how far it makes sense to increase it.
📖 Shameless plug: this was inspired in part by the chapter Aggregations & Buckets in my recently published Elasticsearch Handbook — containing lots of other real-world, non-trivial examples 🙌

Elasticsearch Pipelined search?

I've been using Elasticsearch for a while at my company and seems to have been working well so far for our searches.
We've been seeing more complex use cases from our customers to need more "ad-hoc/advanced" query capabilities and inter-document relationships (or joins in the traditional sense).
I understand that ES isn't built for joins and denormalisation is the recommended way. We have been denormalising the documents to support every use case so far and that in itself has become overly complex and expensive for us to do as our customers have to wait for a long time to get this code change rolled out.
We've been more often criticized by our business that "Hey your data model isn't right. It isn't suited for smarter queries". It's painfully harder for the team everytime to make them understand why denormalisation is required.
A few examples of the problems:
"Find me all the persons having the same birthdays"
"Find me all the persons travelling to the same cities within the same time frame"
Imagine every event document is a person record with their travel details.
So is there a concept of a pipeline search where I can break the search into multiple search queries and pass the output of one as an input to another?
Or is there any other recommended way to solve these types of problems without having to boil the ocean?
The two queries above can be solved with aggregations.
I'm assuming the following sample document/schema:
{
"firstName": "John",
"lastName": "Doe",
"birthDate": "1998-04-02",
"travelDate": "2019-10-31",
"city": "London"
}
The first one by aggregating with a terms on the birthdate field (day of the year) and min_doc_count: 2, e.g.:
{
"size": 0,
"aggs": {
"birthdays": {
"terms": {
"script": "return LocalDate.parse(params._source.birthDate).format(DateTimeFormatter.ofPattern('MM/dd'))",
"min_doc_count": 2
},
"aggs": {
"persons": {
"top_hits": {}
}
}
}
}
}
The second one by aggregating with a terms aggregation on the city field and constrained with a range query on the travelDate field for the desired time frame:
{
"size": 0,
"query": {
"range": {
"travelDate": {
"gte": "2019-10-01",
"lt": "2019-11-01"
}
}
},
"aggs": {
"cities": {
"terms": {
"field": "city.keyword"
},
"aggs": {
"persons": {
"top_hits": {}
}
}
}
}
}
The second query can also be done with field collapsing:
{
"_source": false,
"query": {
"range": {
"travelDate": {
"gte": "2019-10-01",
"lt": "2019-11-01"
}
}
},
"collapse": {
"field": "city.keyword",
"inner_hits": {
"name": "people"
}
}
}
If you need both aggregations at the same time, it is definitely possible to do so:
{
"size": 0,
"aggs": {
"birthdays": {
"terms": {
"script": "return LocalDate.parse(params._source.birthDate).format(DateTimeFormatter.ofPattern('MM/dd'))",
"min_doc_count": 2
},
"aggs": {
"persons": {
"top_hits": {}
}
}
},
"travels": {
"filter": {
"range": {
"travelDate": {
"gte": "2019-10-01",
"lt": "2019-11-01"
}
}
},
"aggs": {
"cities": {
"terms": {
"field": "city.keyword"
},
"aggs": {
"persons": {
"top_hits": {}
}
}
}
}
}
}
}

How to get specific _source fields in aggregation

I am exploring ElasticSearch, to be used in an application, which will handle large volumes of data and generate some statistical results over them. My requirement is to retrieve certain statistics for a particular field. For example, for a given field, I would like to retrieve its unique values and document frequency of each value, along-with the length of the value. The value lengths are indexed along-with each document.
So far, I have experimented with Terms Aggregation, with the following query:
{
"size": 0,
"query": {
"match_all": {}
},
"aggs": {
"type_count": {
"terms": {
"field": "val.keyword",
"size": 100
}
}
}
}
The query returns all the values in the field val with the number of documents in which each value occurs. I would like the field val_len to be returned as well. Is it possible to achieve this using ElasticSearch? In other words, is it possible to include specific _source fields in buckets? I have looked through the documentation available online, but I haven't found a solution yet.
Hoping somebody could point me in the right direction. Thanks in advance!
I tried to include _source in the following manners:
"aggs": {
"type_count": {
"terms": {
"field": "val.keyword",
"size": 100
},
"_source":["val_len"]
}
}
and
"aggs": {
"type_count": {
"terms": {
"field": "val.keyword",
"size": 100,
"_source":["val_len"]
}
}
}
But I guess this isn't the right way, because both gave me parsing errors.
You need to use another sub-aggregation called top_hits, like this:
"aggs": {
"type_count": {
"terms": {
"field": "val.keyword",
"size": 100
},
"aggs": {
"hits": {
"top_hits": {
"_source":["val_len"],
"size": 1
}
}
}
}
}
Another way of doing it is to use another avg sub-aggregation so you can sort on it, too
"aggs": {
"type_count": {
"terms": {
"field": "val.keyword",
"size": 100,
"order": {
"length": "desc"
}
},
"aggs": {
"length": {
"avg": {
"field": "val_len"
}
}
}
}
}

ElasticSearch: search in two different ranges with different aggregations for each

This is an odd question, but I'm trying to avoid calling ES twice to obtain different data from two different range of times.
Let's say that:
from "2016-10-01 to 2016-10-31" I want to SUM the field "orders.total_sales" (just an example) and another sum "reviews.count".
And from "2016-09-01 to 2016-09-30"
I only want to sum "orders.total_sales".
(The truth is I need like 50 sum aggregations on the first range), but for the second range, I only need 2).
I know it's possible to filter by two ranges of anything using should instead of must. But is it possible to distinguish the result from each range in order to operate with them (aggregations sum).
I don't think it's possible, but just in case someone has come with this issue before.
Thanks in advance.
You can use filter aggregation for this purpose. You would basically write two filters for two different range and then do sub aggregations as you want.
{
"size": 0,
"aggs": {
"range_one": {
"filter": {
"range": {
"your_date_field": {
"gte": "2016-01-01",
"lte": "2016-02-02"
}
}
},
"aggs": {
"sum_orders": {
"sum": {
"field": "your_sum_field1"
}
}
}
},
"range_two": {
"filter": {
"range": {
"your_date_field": {
"gte": "2016-02-01",
"lte": "2016-03-02"
}
}
},
"aggs": {
"sum_orders": {
"sum": {
"field": "your_sum_field2"
}
}
}
}
}
}
I ended up writing something like this with (due to ES errors, until I got it working)
Thank you very much! It worked, but not with filter, but the idea is the same
I did something like this:
{
"timeout" : 1500,
"query" : {
"bool" : {
"must" : [
{
"term" : {
"businessId" : "101598"
}
} ,
{
"range" : {
"date" : {
"from" : "2016-10-15T03:00:00.000Z",
"to" : "2016-10-31T03:00:00.000Z",
"include_lower" : true,
"include_upper" : true
}
}
}]
}
},
"aggs": {
"range_one": {
"date_range": {
"field": "date",
"ranges": [
{
"from": "2016-10-15T03:00:00.000Z",
"to": "2016-10-22T03:00:00.000Z"
}
]
},
"aggs": {
"sum_orders_sales": {
"sum": {
"field": "orders.totalSales"
}
}
}
},
"range_two": {
"date_range": {
"field": "date",
"ranges": [
{
"from": "2016-10-23T03:00:00.000Z",
"to": "2016-10-31T03:00:00.000Z"
}
]
},
"aggs": {
"sum_orders_count": {
"sum": {
"field": "orders.orderCount"
}
}
}
}
}
}
In my case performance and speed is important and since my two ranges are consecutive, I thought I could filter by the business_id (I need) and from the oldest date (start date of the first range) to the newest date (end date of the second range), assuming that aggregation works with the result of the query (otherwise, it will search all documents, and it would be great just to have it doing the aggregation operations over a resultset obtained just one). But I'm new with ES, so not sure I'm seeing it right. However, it's working like charm!
Thanks a lot1

Calculating sum of nested fields with date_histogram aggregation in Elasticsearch

I'm having trouble getting the sum of a nested field in Elasticsearch using a date_histogram, and I'm hoping somebody can lend me a hand.
I have a mapping that looks like this:
"client" : {
// various irrelevant stuff here...
"associated_transactions" : {
"type" : "nested",
"include_in_parent" : true,
"properties" : {
"amount" : {
"type" : "double"
},
"effective_at" : {
"type" : "date",
"format" : "dateOptionalTime"
}
}
}
}
I'm trying to get a date_histogram that shows total revenue by month across all clients--i.e. a time series showing the sum associated_transactions.amount in a histogram determined by associated_transactions.effective_date. I tried running this query:
{
"query": {
// ...
},
"aggregations": {
"revenue": {
"date_histogram": {
"interval": "month",
"min_doc_count": 0,
"field": "associated_transactions.effective_at"
},
"aggs": {
"monthly_revenue": {
"sum": {
"field": "associated_transactions.amount"
}
}
}
}
}
}
But the sum it's giving me isn't right. It seems that what ES is doing is finding all clients who have any transaction in a given month, then summing all of the transactions (from any time) for those clients. That is, it's a sum of the amount spent in the lifetime of a client who made a purchase in a given month, not the sum of purchases in a given month.
Is there any way to get the data I'm looking for, or is this a limitation in how ES handles nested fields?
Thanks very much in advance for your help!
David
Try this?
{
"query": {
// ...
},
"aggregations": {
"revenue": {
"date_histogram": {
"interval": "month",
"min_doc_count": 0,
"field": "associated_transactions.effective_at"
"aggs": {
"monthly_revenue": {
"sum": {
"field": "associated_transactions.amount"
}
}
}
}
}
}
}
i.e. move the "aggs" key into the "date_histogram" field.
I stumbled upon this question while trying to solve similar problem with my implementation of ES.
It seems that currently Elasticsearch looks at position of aggregation in the JSON body request tree - not inheritance of its objects and filelds. So you should not put your sum aggregation "inside" "date_histogram", but place it outside on the same JSON tree level.
This worked for me:
{
"size": 0,
"aggs": {
"histogram_aggregation": {
"date_histogram": {
"field": "date_vield",
"calendar_interval": "day"
},
"aggs": {
"views": {
"sum": {
"field": "the_vield_i_want_to_sum"
}
}
}
}
},
"query": {
#some query
}
OP made mistake of placing his sum aggregation inside date histogram aggregation.

Resources