Elasticsearch Pipelined search? - elasticsearch

I've been using Elasticsearch for a while at my company and seems to have been working well so far for our searches.
We've been seeing more complex use cases from our customers to need more "ad-hoc/advanced" query capabilities and inter-document relationships (or joins in the traditional sense).
I understand that ES isn't built for joins and denormalisation is the recommended way. We have been denormalising the documents to support every use case so far and that in itself has become overly complex and expensive for us to do as our customers have to wait for a long time to get this code change rolled out.
We've been more often criticized by our business that "Hey your data model isn't right. It isn't suited for smarter queries". It's painfully harder for the team everytime to make them understand why denormalisation is required.
A few examples of the problems:
"Find me all the persons having the same birthdays"
"Find me all the persons travelling to the same cities within the same time frame"
Imagine every event document is a person record with their travel details.
So is there a concept of a pipeline search where I can break the search into multiple search queries and pass the output of one as an input to another?
Or is there any other recommended way to solve these types of problems without having to boil the ocean?

The two queries above can be solved with aggregations.
I'm assuming the following sample document/schema:
{
"firstName": "John",
"lastName": "Doe",
"birthDate": "1998-04-02",
"travelDate": "2019-10-31",
"city": "London"
}
The first one by aggregating with a terms on the birthdate field (day of the year) and min_doc_count: 2, e.g.:
{
"size": 0,
"aggs": {
"birthdays": {
"terms": {
"script": "return LocalDate.parse(params._source.birthDate).format(DateTimeFormatter.ofPattern('MM/dd'))",
"min_doc_count": 2
},
"aggs": {
"persons": {
"top_hits": {}
}
}
}
}
}
The second one by aggregating with a terms aggregation on the city field and constrained with a range query on the travelDate field for the desired time frame:
{
"size": 0,
"query": {
"range": {
"travelDate": {
"gte": "2019-10-01",
"lt": "2019-11-01"
}
}
},
"aggs": {
"cities": {
"terms": {
"field": "city.keyword"
},
"aggs": {
"persons": {
"top_hits": {}
}
}
}
}
}
The second query can also be done with field collapsing:
{
"_source": false,
"query": {
"range": {
"travelDate": {
"gte": "2019-10-01",
"lt": "2019-11-01"
}
}
},
"collapse": {
"field": "city.keyword",
"inner_hits": {
"name": "people"
}
}
}
If you need both aggregations at the same time, it is definitely possible to do so:
{
"size": 0,
"aggs": {
"birthdays": {
"terms": {
"script": "return LocalDate.parse(params._source.birthDate).format(DateTimeFormatter.ofPattern('MM/dd'))",
"min_doc_count": 2
},
"aggs": {
"persons": {
"top_hits": {}
}
}
},
"travels": {
"filter": {
"range": {
"travelDate": {
"gte": "2019-10-01",
"lt": "2019-11-01"
}
}
},
"aggs": {
"cities": {
"terms": {
"field": "city.keyword"
},
"aggs": {
"persons": {
"top_hits": {}
}
}
}
}
}
}
}

Related

Search and aggregation on two indices

Two indexes are created with the dates.
First index mapping:
PUT /index_one
{
"mappings": {
"properties": {
"date_start": {
"type": "date",
"format": "yyyy-MM-dd HH:mm:ss.SSSZZ||yyyy-MM-dd HH:mm:ss||yyyy-MM-dd||epoch_millis"
}
}
}
}
Second index mapping:
PUT /index_two
{
"mappings": {
"properties": {
"date_end": {
"type": "date",
"format": "yyyy-MM-dd HH:mm:ss.SSSZZ||yyyy-MM-dd HH:mm:ss||yyyy-MM-dd||epoch_millis"
}
}
}
}
Need to find a date in a certain range and perform aggregation average of the dates difference.
Tried to make a request like this:
GET /index_one,index_two/_search?scroll=1m&q=[2021-01-01+TO+2021-12-31]&filter_path=aggregations,hits.total.value,hits.hits
{
"aggs": {
"filtered_dates": {
"filter": {
"bool": {
"must": [
{
"exists": {
"field": "date_start"
}
},
{
"exists": {
"field": "date_end"
}
}
]
}
},
"aggs": {
"avg_date": {
"avg": {
"script": {
"lang": "painless",
"source": "doc['date_end'].value.toInstant().toEpochMilli() - doc['date_begin'].value.toInstant().toEpochMilli()"
}
}
}
}
}
}
}
I get the following response to the request:
{
"hits": {
"total": {
"value": 16508
},
"hits": [
{
"_index": "index_one",
"_type": "_doc",
"_id": "93a34c5b-101b-45ea-9965-96a2e0446a28",
"_score": 1.0,
"_source": {
"date_begin": "2021-02-26 07:26:29.732+0300"
}
}
]
},
"aggregations": {
"filtered_dates": {
"meta": {},
"doc_count": 0,
"avg_date": {
"value": null
}
}
}
}
Can you please tell me if it is possible to make a query with search and aggregation over two indices in Elasticsearch? If so, how?
If you stored date_start on the document which contains date_end, it'd be much easier to figure out the average — check my answer to Store time related data in ElasticSearch.
Now, the script context operates on one single document at a time and has "no clue" about the other, potentially related docs. So if you don't store both dates at the same time in at least one doc, you'd need to somehow connect the docs nonetheless.
One option would be to use their ids:
POST index_one/_doc
{ "id":1, "date_start": "2021-01-01" }
POST index_two/_doc
{ "id":1, "date_end": "2021-12-31" }
POST index_one/_doc/2
{ "id":2, "date_start": "2021-01-01" }
POST index_two/_doc/2
{ "id":2, "date_end": "2021-01-31" }
After that, it's possible to:
Target multiple indices — as you already do.
Group the docs by their IDs and select only those that include at least 2 buckets (assuming two buckets represent the start & the end).
Obtain the min & max dates — essentially cherry-picking the date_start and date_end to be used later down the line.
Use a bucket_script aggregation to calculate their difference (in milliseconds).
Leverage a top-level average bucket aggregation to run over all the difference buckets and ... average them.
In concrete terms:
GET /index_one,index_two/_search?scroll=1m&q=[2021-01-01+TO+2021-12-31]&filter_path=aggregations,hits.total.value,hits.hits
{
"aggs": {
"grouped_by_id": {
"terms": {
"field": "id",
"min_doc_count": 2,
"size": 10
},
"aggs": {
"min_date": {
"min": {
"field": "date_start"
}
},
"max_date": {
"max": {
"field": "date_end"
}
},
"diff": {
"bucket_script": {
"buckets_path": {
"min": "min_date",
"max": "max_date"
},
"script": "params.max - params.min"
}
}
}
},
"avg_duration_across_the_board": {
"avg_bucket": {
"buckets_path": "grouped_by_id>diff",
"gap_policy": "skip"
}
}
}
}
If everything goes right, you'll end up with:
...
"aggregations" : {
"grouped_by_id" : {
...
},
"avg_duration_across_the_board" : {
"value" : 1.70208E10 <-- 17,020,800,000 milliseconds ~ 4,728 hrs
}
}
⚠️ Caveat: note that the 2nd level terms aggregation has an adjustable size. You'll probably need to increase it to cover more docs. But there are theoretical and practical limits as to how far it makes sense to increase it.
📖 Shameless plug: this was inspired in part by the chapter Aggregations & Buckets in my recently published Elasticsearch Handbook — containing lots of other real-world, non-trivial examples 🙌

Elasticsearch Pagination with timestamp range

Elasticsearch official documentation introduce that elasticsearch can realize pagination by composite aggregations.
The composite aggregation will fetch data many times to get all results.
So my question is, Can I use range from now-1h to now when I execute composite aggregation?
If I can. How to composite aggregation query keep source data unchanging when every range query have different now.
If I can't. My query below has no error and the result seems to be right.
{
"size": 0,
"query": {
"bool": {
"filter": [
{
"range": {
"timestamp": {
"gte": "now-1h"
}
}
}
]
}
},
"aggs": {
"user_device": {
"composite": {
"after": {
"user_name": "alen.lv"
},
"size": 100,
"sources": [
{
"user_name": {
"terms": {
"field": "user_name"
}
}
}
]
},
"aggs": {
"user_mac": {
"terms": {
"field": "user_mac",
"size": 1000
}
}
}
}
}
}

Elasticsearch Aggregations: Only return results of one of them?

I'm trying to find a way to only return the results of one aggregation in an Elasticsearch query. I have a max bucket aggregation (the one that I want to see) that is calculated from a sum bucket aggregation based on a date histogram aggregation. Right now, I have to go through 1,440 results to get to the one I want to see. I've already removed the results of the base query with the size: 0 modifier, but is there a way to do something similar with the aggregations as well? I've tried slipping the same thing into a few places with no luck.
Here's the query:
{
"size": 0,
"query": {
"range": {
"timestamp": {
"gte": "2018-11-28",
"lte": "2018-11-28"
}
}
},
"aggs": {
"hits_per_minute": {
"date_histogram": {
"field": "timestamp",
"interval": "minute"
},
"aggs": {
"total_hits": {
"sum": {
"field": "hits_count"
}
}
}
},
"max_transactions_per_minute": {
"max_bucket": {
"buckets_path": "hits_per_minute>total_hits"
}
}
}
}
Fortunately enough, you can do that with bucket_sort aggregation, which was added in Elasticsearch 6.4.
Do it with bucket_sort
POST my_index/doc/_search
{
"size": 0,
"query": {
"range": {
"timestamp": {
"gte": "2018-11-28",
"lte": "2018-11-28"
}
}
},
"aggs": {
"hits_per_minute": {
"date_histogram": {
"field": "timestamp",
"interval": "minute"
},
"aggs": {
"total_hits": {
"sum": {
"field": "hits_count"
}
},
"max_transactions_per_minute": {
"bucket_sort": {
"sort": [
{"total_hits": {"order": "desc"}}
],
"size": 1
}
}
}
}
}
}
This will give you a response like this:
{
...
"aggregations": {
"hits_per_minute": {
"buckets": [
{
"key_as_string": "2018-11-28T21:10:00.000Z",
"key": 1543957800000,
"doc_count": 3,
"total_hits": {
"value": 11
}
}
]
}
}
}
Note that there is no extra aggregation in the output and the output of hits_per_minute is truncated (because we asked to give exactly one, topmost bucket).
Do it with filter_path
There is also a generic way to filter the output of Elasticsearch: Response filtering, as this answer suggests.
In this case it will be enough to just do the following query:
POST my_index/doc/_search?filter_path=aggregations.max_transactions_per_minute
{ ... (original query) ... }
That would give the response:
{
"aggregations": {
"max_transactions_per_minute": {
"value": 11,
"keys": [
"2018-12-04T21:10:00.000Z"
]
}
}
}

Elasticsearch derivate of a deep metric

I have a web crawler that collects data and stores snapshots several times a day. My query has some aggregations that group the snapshots together per day and return the last snapshot of each day using top_hits.
The documents look like this:
"_source": {
"taken_at": "2016-02-01T11:27:09.184-03:00",
... ,
"my_metric": 113
}
I'd like to be able to calculate the derivative of a certain metric, say my_metric, of the documents returned by top_hits (i.e., the derivative of the last snapshots of each day's my_metric).
Here's what I have so far:
{
"aggs": {
"filtered_snapshots": {
"filter": {
// ...
},
"aggs" : {
"grouped_data": {
"date_histogram": {
"field": "taken_at",
"interval": "day",
"format": "YYYY-MM-dd",
"order": { "_key" : "asc" }
},
"aggs": {
"resource_by_date": {
"terms": { "field": "remote_id" },
"aggs": {
"latest_snapshots": {
"top_hits": {
"sort": { "taken_at": { "order": "asc" }},
"size" : 1
}
}
}
},
"my_metric_deriv": {
"derivative": {
"buckets_path": "resource_by_date>latest_snapshots>my_metric"
}
}
}
}
}
}
}
}
I get a "No aggregation [my_metric] found for path ..." error with the query above.
Am I using a wrong bucket_path? I've read through the bucket_path and the derivative documentation and haven't found much that could help.
The documentation mentions briefly "deep metrics", stating that they can be limited in some ways, which I couldn't quite understand. I'm not sure how or if the limitations affect my case.

Filter/Query support in Elasticsearch Top hits Aggregation

Elasticsearch documentation states that The top_hits aggregation returns regular search hits, because of this many per hit features can be supported Crucially, the list includes Named filters and queries
But trying to add any filter or query throws SearchParseException: Unknown key for a START_OBJECT
Use case: I have items which have list of nested comments
items{id} -> comments {date, rating}
I want to get top rated comment for each item in the last week.
{
"query": {
"match_all": {}
},
"aggs": {
"items": {
"terms": {
"field": "id",
"size": 10
},
"aggs": {
"comment": {
"nested": {
"path": "comments"
},
"aggs": {
"top_comment": {
"top_hits": {
"size": 1,
//need filter here to select only comments of last week
"sort": {
"comments.rating": {
"order": "desc"
}
}
}
}
}
}
}
}
}
}
So is the documentation wrong, or is there any way to add a filter?
https://www.elastic.co/guide/en/elasticsearch/reference/2.1/search-aggregations-metrics-top-hits-aggregation.html
Are you sure you have mapped them as Nested? I've just tried to execute such query on my data and it did work fine.
If so, you could simply add a filter aggregation, right after nested aggregation (hopefully I haven't messed up curly brackets):
POST data/_search
{
"size": 0,
"query": {
"filtered": {
"query": {
"match_all": {}
},
"filter": {
"nested": {
"path": "comments",
"query": {
"range": {
"comments.date": {
"gte": "now-1w",
"lte": "now"
}
}
}
}
}
}
},
"aggs": {
"items": {
"terms": {
"field": "id",
"size": 10
},
"aggs": {
"nested": {
"nested": {
"path": "comments"
},
"aggs": {
"filterComments": {
"filter": {
"range": {
"comments.date": {
"gte": "now-1w",
"lte": "now"
}
}
},
"aggs": {
"topComments": {
"top_hits": {
"size": 1,
"sort": {
"comments.rating": "desc"
}
}
}
}
}
}
}
}
}
}
}
P.S. Always include FULL path for nested objects.
So this query will:
Filter documents that have comments younger than one week to narrow down documents for aggregation and to find those, who actually have such comments (filtered query)
Do terms aggregation based on id field
Open nested sub documents (comments)
Filter them by date
Return the most badass one (most rated)

Resources