Search and aggregation on two indices - elasticsearch

Two indexes are created with the dates.
First index mapping:
PUT /index_one
{
"mappings": {
"properties": {
"date_start": {
"type": "date",
"format": "yyyy-MM-dd HH:mm:ss.SSSZZ||yyyy-MM-dd HH:mm:ss||yyyy-MM-dd||epoch_millis"
}
}
}
}
Second index mapping:
PUT /index_two
{
"mappings": {
"properties": {
"date_end": {
"type": "date",
"format": "yyyy-MM-dd HH:mm:ss.SSSZZ||yyyy-MM-dd HH:mm:ss||yyyy-MM-dd||epoch_millis"
}
}
}
}
Need to find a date in a certain range and perform aggregation average of the dates difference.
Tried to make a request like this:
GET /index_one,index_two/_search?scroll=1m&q=[2021-01-01+TO+2021-12-31]&filter_path=aggregations,hits.total.value,hits.hits
{
"aggs": {
"filtered_dates": {
"filter": {
"bool": {
"must": [
{
"exists": {
"field": "date_start"
}
},
{
"exists": {
"field": "date_end"
}
}
]
}
},
"aggs": {
"avg_date": {
"avg": {
"script": {
"lang": "painless",
"source": "doc['date_end'].value.toInstant().toEpochMilli() - doc['date_begin'].value.toInstant().toEpochMilli()"
}
}
}
}
}
}
}
I get the following response to the request:
{
"hits": {
"total": {
"value": 16508
},
"hits": [
{
"_index": "index_one",
"_type": "_doc",
"_id": "93a34c5b-101b-45ea-9965-96a2e0446a28",
"_score": 1.0,
"_source": {
"date_begin": "2021-02-26 07:26:29.732+0300"
}
}
]
},
"aggregations": {
"filtered_dates": {
"meta": {},
"doc_count": 0,
"avg_date": {
"value": null
}
}
}
}
Can you please tell me if it is possible to make a query with search and aggregation over two indices in Elasticsearch? If so, how?

If you stored date_start on the document which contains date_end, it'd be much easier to figure out the average — check my answer to Store time related data in ElasticSearch.
Now, the script context operates on one single document at a time and has "no clue" about the other, potentially related docs. So if you don't store both dates at the same time in at least one doc, you'd need to somehow connect the docs nonetheless.
One option would be to use their ids:
POST index_one/_doc
{ "id":1, "date_start": "2021-01-01" }
POST index_two/_doc
{ "id":1, "date_end": "2021-12-31" }
POST index_one/_doc/2
{ "id":2, "date_start": "2021-01-01" }
POST index_two/_doc/2
{ "id":2, "date_end": "2021-01-31" }
After that, it's possible to:
Target multiple indices — as you already do.
Group the docs by their IDs and select only those that include at least 2 buckets (assuming two buckets represent the start & the end).
Obtain the min & max dates — essentially cherry-picking the date_start and date_end to be used later down the line.
Use a bucket_script aggregation to calculate their difference (in milliseconds).
Leverage a top-level average bucket aggregation to run over all the difference buckets and ... average them.
In concrete terms:
GET /index_one,index_two/_search?scroll=1m&q=[2021-01-01+TO+2021-12-31]&filter_path=aggregations,hits.total.value,hits.hits
{
"aggs": {
"grouped_by_id": {
"terms": {
"field": "id",
"min_doc_count": 2,
"size": 10
},
"aggs": {
"min_date": {
"min": {
"field": "date_start"
}
},
"max_date": {
"max": {
"field": "date_end"
}
},
"diff": {
"bucket_script": {
"buckets_path": {
"min": "min_date",
"max": "max_date"
},
"script": "params.max - params.min"
}
}
}
},
"avg_duration_across_the_board": {
"avg_bucket": {
"buckets_path": "grouped_by_id>diff",
"gap_policy": "skip"
}
}
}
}
If everything goes right, you'll end up with:
...
"aggregations" : {
"grouped_by_id" : {
...
},
"avg_duration_across_the_board" : {
"value" : 1.70208E10 <-- 17,020,800,000 milliseconds ~ 4,728 hrs
}
}
⚠️ Caveat: note that the 2nd level terms aggregation has an adjustable size. You'll probably need to increase it to cover more docs. But there are theoretical and practical limits as to how far it makes sense to increase it.
📖 Shameless plug: this was inspired in part by the chapter Aggregations & Buckets in my recently published Elasticsearch Handbook — containing lots of other real-world, non-trivial examples 🙌

Related

Composite aggregation query with bucket_sort does not work properly

I have an index to store financial transactions:
{
"mappings": {
"_doc": {
"properties": {
"amount": {
"type": "long"
},
"currencyCode": {
"type": "keyword"
},
"merchantId": {
"type": "keyword"
},
"merchantName": {
"type": "text"
},
"partnerId": {
"type": "keyword"
},
"transactionDate": {
"type": "date"
},
"userId": {
"type": "keyword"
}
}
}
}
}
Here's my query:
GET /transactions/_search
{
"aggs": {
"date_merchant": {
"aggs": {
"amount": {
"sum": {
"field": "amount"
}
},
"amount_sort": {
"bucket_sort": {
"sort": [
{
"amount": {
"order": "desc"
}
}
]
}
},
"top_hit": {
"top_hits": {
"_source": {
"includes": [
"merchantName",
"currencyCode"
]
},
"size": 1
}
}
},
"composite": {
"size": 1,
"sources": [
{
"date": {
"date_histogram": {
"calendar_interval": "day",
"field": "transactionDate"
}
}
},
{
"merchant": {
"terms": {
"field": "merchantId"
}
}
}
]
}
}
},
"query": {
"bool": {
"filter": [
{
"term": {
"userId": "AAA"
}
},
{
"term": {
"partnerId": "BBB"
}
},
{
"range": {
"transactionDate": {
"gte": "2022-07-01"
}
}
},
{
"term": {
"currencyCode": "EUR"
}
}
]
}
},
"size": 0
}
Please note the "size": 1 in the composite aggregation.
If I change it to 3 (based on my data)... I get different results!
That means the bucket_sort operation doesn't work on the whole list of buckets, but just on the returned ones (if it's just one, that means it's not going to be sorted at all!)
How can I sort on ALL the buckets instead?
EDIT
Based on Benjamin's answer I changed my query to use normal aggregations instead of composite, and a large bucket size for merchant IDs (default is 10, while for date histogram there's no limit)
Composite agg design
The composite aggregation is designed to iterate all buckets in the most efficient way possible.
How can I sort on ALL the buckets instead?
To fully sort over ALL buckets, all buckets would have to be enumerated ahead of time, defeating the design of the composite aggregation.
So, how to actually sort over all buckets?
Then aggregate over all buckets in a single call. Set your size to the largest number of buckets available within your query.
The number of buckets will be the cardinality of merchantId and the number of days in the date histogram.
Another option is to paginate over all the composite buckets and then sort them client side. If you choose this path, it may be good to have each page of the composite aggregation be sorted so that sorting them client side will be faster.

Aggregation performance issue in Elasticsearch with hotel availability data

I am building a small app to find hotel room availability like booking.com using Elasticsearch 6.8.0.
Basically, I have a document per day and room, that specifies if it is available and the rate for that day. I need to run a query with this requirements:
Input:
The days of the desired staying.
The max amount of money I am willing to spend.
The page of the results I want to see.
The number of results per page.
Output:
List of cheapest offer per hotel that fulfill the requirements, ordered in ASC order.
Documents schema:
{
"mappings": {
"_doc": {
"properties": {
"room_id": {
"type": "keyword"
},
"available": {
"type": "boolean"
},
"rate": {
"type": "float"
},
"hotel_id": {
"type": "keyword"
},
"day": {
"type": "date",
"format": "yyyyMMdd"
}
}
}
}
}
I have an index per month, and at the moment I only search within the same month.
I came up with this query:
GET /hotels_201910/_search?filter_path=aggregations.hotel.buckets.min_price.value,aggregations.hotel.buckets.key
{
"size": 0,
"query": {
"bool": {
"filter": [
{
"range": {
"day": { "gte" : "20191001", "lte" : "20191010" }
}
},
{
"term": {
"available": true
}
}
]
}
},
"aggs": {
"hotel": {
"terms": {
"field": "hotel_id",
"min_doc_count": 1,
"size" : 1000000
},
"aggs": {
"room": {
"terms": {
"field": "room_id",
"min_doc_count": 10,
"size" : 1000000
},
"aggs": {
"sum_price": {
"sum": {
"field": "rate"
}
},
"max_price": {
"bucket_selector": {
"buckets_path": {
"price": "sum_price"
},
"script": "params.price <= 600"
}
}
}
},
"min_price": {
"min_bucket": {
"buckets_path": "room>sum_price"
}
},
"sort_by_min_price" : {
"bucket_sort" :{
"sort": [{"min_price" : { "order" : "asc" }}],
"from" : 0,
"size" : 20
}
}
}
}
}
}
And it works, but have several issues.
It is too slow. With 100K daily rooms, it takes about 500 ms to return on my computer, where no other query is running. So in a live system it would be very bad.
I need to setup the "size" to a big number in the terms aggregation, otherwise not all hotels and rooms are considered.
Is there a way to improve the performance of this aggregation? I have tried to split the index in multiple shards, but it did not help.
I am almost sure that the approach is wrong, and that is why is slow. Any recommendation about how to achieve a faster query response time in this case?
Before going to the answer, I didnt understand why you are using the below condition/aggregation
"min_price": {
"min_bucket": {
"buckets_path": "room>sum_price"
}
}
Can you give me more clarification on why you need this.
Now, the answer your main question:
Why do you want to term by room_id as well with hotel_id. You can get all the rooms of your search and then group them by hotel_id on application side.
The below logic, will get you all docs grouped by room_id and with sum metrics. You can use the same script filter for > 600 condition.
{
"size": 0,
"query": {
"bool": {
"filter": [
{
"range": {
"day": { "gte" : "20191001", "lte" : "20191010" }
}
},
{
"term": {
"available": true
}
}
]
}
},
"by_room_id": {
"composite" : {
"size": 100,
"sources" : [
{
"room_id": {
"terms" : {
"field": "room_id"
}
}
}
]
},
"aggregations": {
"price_on_required_dates": {
"sum": { "field": "rate" }
},
"include_source": {
"top_hits": {
"size": 1,
"_source": true
}
},
"price_bucket_sort": {
"bucket_sort": {
"sort": [
{"price_on_required_dates": {"order": "desc"}}
]
}
}
}
}
}
Also, to improve search performance,
https://www.elastic.co/guide/en/elasticsearch/reference/current/tune-for-search-speed.html

Elasticsearch Pipelined search?

I've been using Elasticsearch for a while at my company and seems to have been working well so far for our searches.
We've been seeing more complex use cases from our customers to need more "ad-hoc/advanced" query capabilities and inter-document relationships (or joins in the traditional sense).
I understand that ES isn't built for joins and denormalisation is the recommended way. We have been denormalising the documents to support every use case so far and that in itself has become overly complex and expensive for us to do as our customers have to wait for a long time to get this code change rolled out.
We've been more often criticized by our business that "Hey your data model isn't right. It isn't suited for smarter queries". It's painfully harder for the team everytime to make them understand why denormalisation is required.
A few examples of the problems:
"Find me all the persons having the same birthdays"
"Find me all the persons travelling to the same cities within the same time frame"
Imagine every event document is a person record with their travel details.
So is there a concept of a pipeline search where I can break the search into multiple search queries and pass the output of one as an input to another?
Or is there any other recommended way to solve these types of problems without having to boil the ocean?
The two queries above can be solved with aggregations.
I'm assuming the following sample document/schema:
{
"firstName": "John",
"lastName": "Doe",
"birthDate": "1998-04-02",
"travelDate": "2019-10-31",
"city": "London"
}
The first one by aggregating with a terms on the birthdate field (day of the year) and min_doc_count: 2, e.g.:
{
"size": 0,
"aggs": {
"birthdays": {
"terms": {
"script": "return LocalDate.parse(params._source.birthDate).format(DateTimeFormatter.ofPattern('MM/dd'))",
"min_doc_count": 2
},
"aggs": {
"persons": {
"top_hits": {}
}
}
}
}
}
The second one by aggregating with a terms aggregation on the city field and constrained with a range query on the travelDate field for the desired time frame:
{
"size": 0,
"query": {
"range": {
"travelDate": {
"gte": "2019-10-01",
"lt": "2019-11-01"
}
}
},
"aggs": {
"cities": {
"terms": {
"field": "city.keyword"
},
"aggs": {
"persons": {
"top_hits": {}
}
}
}
}
}
The second query can also be done with field collapsing:
{
"_source": false,
"query": {
"range": {
"travelDate": {
"gte": "2019-10-01",
"lt": "2019-11-01"
}
}
},
"collapse": {
"field": "city.keyword",
"inner_hits": {
"name": "people"
}
}
}
If you need both aggregations at the same time, it is definitely possible to do so:
{
"size": 0,
"aggs": {
"birthdays": {
"terms": {
"script": "return LocalDate.parse(params._source.birthDate).format(DateTimeFormatter.ofPattern('MM/dd'))",
"min_doc_count": 2
},
"aggs": {
"persons": {
"top_hits": {}
}
}
},
"travels": {
"filter": {
"range": {
"travelDate": {
"gte": "2019-10-01",
"lt": "2019-11-01"
}
}
},
"aggs": {
"cities": {
"terms": {
"field": "city.keyword"
},
"aggs": {
"persons": {
"top_hits": {}
}
}
}
}
}
}
}

Aggregate over multiple fields without subaggregation

I have documents in my ElasticSearch which have two fields. I want to build an aggregate over the combination of these, kind of like in SQL GROUP BY field_A, field_B and get a row per existing combination. I read everywhere that I should use subaggregation for this.
{
"aggs": {
"sales_by_article": {
"terms": {
"field": "catalogs.article_grouping",
"size": 1000000,
"order": {
"total_amount": "desc"
}
},
"aggs": {
"total_amount": {
"sum": {
"script": "Math.round(doc['amount.value'].value*100)/100.0"
}
},
"sales_by_submodel": {
"terms": {
"field": "catalogs.submodel_grouping",
"size": 1000,
"order": {
"total_amount": "desc"
}
},
"aggs": {
"total_amount": {
"sum": {
"script": "Math.round(doc['amount.value'].value*100)/100.0"
}
}
}
}
}
}
},
"size": 0
}
With the following simplified result:
{
"aggregations": {
"sales_by_article": {
"buckets": [
{
"key": "19114",
"total_amount": {
"value": 426794.25
},
"sales_by_submodel": {
"buckets": [
{
"key": "12",
"total_amount": {
"value": 51512.200000000004
}
},
...
]
}
},
...
]
}
}
}
However, the problem with this is that the ordering is not what I want. In this particular case, it first orders the articles based on total_amount per article, and then within an article it orders the submodels based on total_amount per submodel. However, what I want to achieve is to only have the deepest level and get an aggregation for the combination of article and submodel, ordered by the total_amount of this combination. This is the result I would like:
{
"aggregations": {
"sales_by_article_and_submodel": {
"buckets": [
{
"key": "1911412",
"total_amount": {
"value": 51512.200000000004
}
},
...
]
}
}
}
It's discussed in the docs a bit here: https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket-terms-aggregation.html#_multi_field_terms_aggregation
Basically you can use a script to create a term which is derived from each document (using as many fields as you want) at query run time, but it will be slow. If you are doing it for ad hoc analysis, it'll work fine. If you need to serve these requests at some high rate, then you probably want to make a field in your model that is a combination of the two fields you're interested in, so the index is populated for you already.
Example query using the script approach:
GET agreements/agreement/_search?size=0
{
"aggs" : {
"myAggregationName" : {
"terms" : {
"script" : {
"source": "doc['owningVendorCode'].value + '|' + doc['region'].value",
"lang": "painless"
}
}
}
}
}
I have learned I should use composite aggregates for this.

How to count number of objects in a nested field in elastic search?

How to count number of objects in a nested filed in elastic search?
Sample mapping :
"base_keywords": {
"type": "nested",
"properties": {
"base_key": {
"type": "text"
},
"category": {
"type": "text"
},
"created_at": {
"type": "date"
},
"date": {
"type": "date"
},
"rank": {
"type": "integer"
}
}
}
I would like to count number of objects in nested filed 'base_keywords'.
You would need to do this with inline script. This is what worked for me: (Using ES 6.x):
GET your-indices/_search
{
"aggs": {
"whatever": {
"sum": {
"script": {
"inline": "params._source.base_keywords.size()"
}
}
}
}
}
Aggs are normally good for counting and grouping, for nested documents you can use nested aggs:
"aggs": {
"MyAggregation1": {
"terms": {
"field": "FieldA",
"size": 0
},
"aggs": {
"BaseKeyWords": {
"nested": { "path": "base_keywords" },
"aggs": {
"BaseKeys": {
"terms": {
"field": "base_keywords.base_key.keyword",
"size": 0
}
}
}
}
}
}
}
You don't specify what you want to count, but aggs are quite flexible for grouping and counting data.
The "doc_count" and "key" behave similar to an sql group by + count()
Updated (This assumes you have a .keyword field create the "keys" values, since a property of type "text" can't be aggregated or counted:
{
"aggs": {
"MyKeywords1Agg": {
"nested": { "path": "keywords1" },
"aggs": {
"NestedKeywords": {
"terms": {
"field": "keywords1.keys.keyword",
"size": 0
}
}
}
}
}
}
For simply counting the number of nested keys you could simply do this:
{
"aggs": {
"MyKeywords1Agg": {
"nested": { "path": "keywords1" }
}
}
}
If you want to get some grouping on the field values on the "main" document or the nested documents, you will have to extend your mapping / data model to include terms that are aggregatable, which includes most data types in elasticsearch except "text", ex.: dates, numbers, geolocations, keywords.
Edit:
Example with aggregating on a unique identifier for each top level document, assuming you have a property on it called "WordMappingId" of type integer
{
"aggs": {
"word_maping_agg": {
"terms": {
"field": "WordMappingId",
"size": 0,
"missing": -1
},
"aggs": {
"Keywords1Agg": null,
"nested": { "path": "keywords1" }
}
}
}
}
If you don't add any properties to the "word_maping" document on the top level there is no way to do an aggregation for each unique document. The builtin _id field is by default not aggregateable, and I suggest you include a unique identifier from the source data on the top level to aggregate on.
Note: the "missing" parameter will put all documents that don't have the WordMappingId property set in a bucked with the supplied value, this makes sure you're not missing any documents in the search results.
Aggs can support a behaviour similar to a group by in SQL, but you need something to actually group it by, and according to the mapping you supplied there are no such fields currently in your index.
I was trying to do similar to understand production data distribution
The following query helped me find top 5
{
"query": {
"match_all": {}
},
"aggs": {
"n_base_keywords": {
"nested": { "path": "base_keywords" },
"aggs": {
"top_count": { "terms": { "field": "_id", "size" : 5 } }
}
}
}
}

Resources