Aggregation performance issue in Elasticsearch with hotel availability data - performance

I am building a small app to find hotel room availability like booking.com using Elasticsearch 6.8.0.
Basically, I have a document per day and room, that specifies if it is available and the rate for that day. I need to run a query with this requirements:
Input:
The days of the desired staying.
The max amount of money I am willing to spend.
The page of the results I want to see.
The number of results per page.
Output:
List of cheapest offer per hotel that fulfill the requirements, ordered in ASC order.
Documents schema:
{
"mappings": {
"_doc": {
"properties": {
"room_id": {
"type": "keyword"
},
"available": {
"type": "boolean"
},
"rate": {
"type": "float"
},
"hotel_id": {
"type": "keyword"
},
"day": {
"type": "date",
"format": "yyyyMMdd"
}
}
}
}
}
I have an index per month, and at the moment I only search within the same month.
I came up with this query:
GET /hotels_201910/_search?filter_path=aggregations.hotel.buckets.min_price.value,aggregations.hotel.buckets.key
{
"size": 0,
"query": {
"bool": {
"filter": [
{
"range": {
"day": { "gte" : "20191001", "lte" : "20191010" }
}
},
{
"term": {
"available": true
}
}
]
}
},
"aggs": {
"hotel": {
"terms": {
"field": "hotel_id",
"min_doc_count": 1,
"size" : 1000000
},
"aggs": {
"room": {
"terms": {
"field": "room_id",
"min_doc_count": 10,
"size" : 1000000
},
"aggs": {
"sum_price": {
"sum": {
"field": "rate"
}
},
"max_price": {
"bucket_selector": {
"buckets_path": {
"price": "sum_price"
},
"script": "params.price <= 600"
}
}
}
},
"min_price": {
"min_bucket": {
"buckets_path": "room>sum_price"
}
},
"sort_by_min_price" : {
"bucket_sort" :{
"sort": [{"min_price" : { "order" : "asc" }}],
"from" : 0,
"size" : 20
}
}
}
}
}
}
And it works, but have several issues.
It is too slow. With 100K daily rooms, it takes about 500 ms to return on my computer, where no other query is running. So in a live system it would be very bad.
I need to setup the "size" to a big number in the terms aggregation, otherwise not all hotels and rooms are considered.
Is there a way to improve the performance of this aggregation? I have tried to split the index in multiple shards, but it did not help.
I am almost sure that the approach is wrong, and that is why is slow. Any recommendation about how to achieve a faster query response time in this case?

Before going to the answer, I didnt understand why you are using the below condition/aggregation
"min_price": {
"min_bucket": {
"buckets_path": "room>sum_price"
}
}
Can you give me more clarification on why you need this.
Now, the answer your main question:
Why do you want to term by room_id as well with hotel_id. You can get all the rooms of your search and then group them by hotel_id on application side.
The below logic, will get you all docs grouped by room_id and with sum metrics. You can use the same script filter for > 600 condition.
{
"size": 0,
"query": {
"bool": {
"filter": [
{
"range": {
"day": { "gte" : "20191001", "lte" : "20191010" }
}
},
{
"term": {
"available": true
}
}
]
}
},
"by_room_id": {
"composite" : {
"size": 100,
"sources" : [
{
"room_id": {
"terms" : {
"field": "room_id"
}
}
}
]
},
"aggregations": {
"price_on_required_dates": {
"sum": { "field": "rate" }
},
"include_source": {
"top_hits": {
"size": 1,
"_source": true
}
},
"price_bucket_sort": {
"bucket_sort": {
"sort": [
{"price_on_required_dates": {"order": "desc"}}
]
}
}
}
}
}
Also, to improve search performance,
https://www.elastic.co/guide/en/elasticsearch/reference/current/tune-for-search-speed.html

Related

Composite aggregation query with bucket_sort does not work properly

I have an index to store financial transactions:
{
"mappings": {
"_doc": {
"properties": {
"amount": {
"type": "long"
},
"currencyCode": {
"type": "keyword"
},
"merchantId": {
"type": "keyword"
},
"merchantName": {
"type": "text"
},
"partnerId": {
"type": "keyword"
},
"transactionDate": {
"type": "date"
},
"userId": {
"type": "keyword"
}
}
}
}
}
Here's my query:
GET /transactions/_search
{
"aggs": {
"date_merchant": {
"aggs": {
"amount": {
"sum": {
"field": "amount"
}
},
"amount_sort": {
"bucket_sort": {
"sort": [
{
"amount": {
"order": "desc"
}
}
]
}
},
"top_hit": {
"top_hits": {
"_source": {
"includes": [
"merchantName",
"currencyCode"
]
},
"size": 1
}
}
},
"composite": {
"size": 1,
"sources": [
{
"date": {
"date_histogram": {
"calendar_interval": "day",
"field": "transactionDate"
}
}
},
{
"merchant": {
"terms": {
"field": "merchantId"
}
}
}
]
}
}
},
"query": {
"bool": {
"filter": [
{
"term": {
"userId": "AAA"
}
},
{
"term": {
"partnerId": "BBB"
}
},
{
"range": {
"transactionDate": {
"gte": "2022-07-01"
}
}
},
{
"term": {
"currencyCode": "EUR"
}
}
]
}
},
"size": 0
}
Please note the "size": 1 in the composite aggregation.
If I change it to 3 (based on my data)... I get different results!
That means the bucket_sort operation doesn't work on the whole list of buckets, but just on the returned ones (if it's just one, that means it's not going to be sorted at all!)
How can I sort on ALL the buckets instead?
EDIT
Based on Benjamin's answer I changed my query to use normal aggregations instead of composite, and a large bucket size for merchant IDs (default is 10, while for date histogram there's no limit)
Composite agg design
The composite aggregation is designed to iterate all buckets in the most efficient way possible.
How can I sort on ALL the buckets instead?
To fully sort over ALL buckets, all buckets would have to be enumerated ahead of time, defeating the design of the composite aggregation.
So, how to actually sort over all buckets?
Then aggregate over all buckets in a single call. Set your size to the largest number of buckets available within your query.
The number of buckets will be the cardinality of merchantId and the number of days in the date histogram.
Another option is to paginate over all the composite buckets and then sort them client side. If you choose this path, it may be good to have each page of the composite aggregation be sorted so that sorting them client side will be faster.

Translate MySQL aggregation query to ElasticSearch

I have a comments table that over the past year has grown considerably and I'm moving it to ElasticSearch.
The problem is that I need to adapt a query that I currently have in MySQL which returns the total number of comments for each day in the last 7 days for a given post.
Here's the MySQL query that I have now:
SELECT count(*) AS number, DATE(created_at) AS date
FROM `comments`
WHERE `post_id` = ?
GROUP BY `date`
ORDER BY `date` DESC
LIMIT 7
My index looks like this:
{
"mappings": {
"_doc": {
"properties": {
"id": {
"type": "keyword"
},
"post_id": {
"type": "integer"
},
"subject": {
"analyzer": "custom_html_strip",
"type": "text"
},
"body": {
"analyzer": "custom_html_strip",
"type": "text"
},
"created_at": {
"format": "yyyy-MM-dd HH:mm:ss",
"type": "date"
}
}
}
}
}
Is it possible to reproduce that query for ElasticSearch? If so, how would that look like?
My ElasticSearch knowledge is kinda limited, I know that it offers aggregation, but I don't really know how to put it all together.
Use the following query to get all the comments on a given "post_id" for the last 7 days.
{
"query": {
"bool": {
"must": [
{
"term": {
"id": {
"value": "the_post_id"
}
}
},
/*** only include this clause if you want the recent most 7 days ***/
{
"range": {
"created_at": {
"gte": "now-7d/d",
"lt": "now/d"
}
}
}
]
}
},
"aggs": {
"posts_per_day": {
"date_histogram": {
"field": "created_at",
"calendar_interval": "day",
"order" : {"_key" : "desc"}
}
}
}
}
From the aggregation, pick up the first 7 buckets in your client application.
Elasticsearch supports sql queries as well (though they are limited)
so with a little change you can use something like this
GET _sql
{
"query": """
SELECT count(*) AS number, created_at AS date
FROM comments
WHERE post_id = 123
GROUP BY date
ORDER BY date DESC
LIMIT 7
"""
}
You can see the corresponding query using _sql/translate, which will return
{
"size" : 0,
"query" : {
"term" : {
"post_id" : {
"value" : 123,
"boost" : 1.0
}
}
},
"_source" : false,
"stored_fields" : "_none_",
"aggregations" : {
"groupby" : {
"composite" : {
"size" : 7,
"sources" : [
{
"31239" : {
"terms" : {
"field" : "created_at",
"missing_bucket" : true,
"order" : "desc"
}
}
}
]
}
}
}
}
That being said, some of the stuff used in the translated query is not needed, so this will be a better native query
{
"query": {
"term": {
"post_id": {
"value": 123
}
}
},
"aggs": {
"unq_dates": {
"terms": {
"field": "created_at",
"size": 7,
"order": {
"_term": "desc"
}
}
}
}
}
First, it is recommended that your time fields be regularly aggregated, so using fields written directly to a 'YYYY-MM-DD' format will improve your performance and reduce your consumption of resources
{
"query":
{
"filter":
[
{
"term":
{
"post_id":
{
"value": 1
}
}
}
]
},
"_source": false,
"aggs":
{
"created_at":
{
"auto_date_histogram":
{
"field": "created_at",
"buckets": 7,
"format": "yyyy-MM-dd"
}
}
}
}

How to define percentage of result items with specific field in Elasticsearch query?

I have a search query that returns all items matching users that have type manager or lead.
{
"from": 0,
"size": 20,
"query": {
"bool": {
"should": [
{
"terms": {
"type": ["manager", "lead"]
}
}
]
}
}
}
Is there a way to define what percentage of the results should be of type "manager"?
In other words, I want the results to have 80% of users with type manager and 20% with type lead.
I want to make a suggestion to use bucket_path aggregation. As I know this aggregation needs to be run in sub-aggs of a histogram aggregation. As you have such field in your mapping so I think this query should work for you:
{
"size": 0,
"aggs": {
"NAME": {
"date_histogram": {
"field": "my_datetime",
"interval": "month"
},
"aggs": {
"role_type": {
"terms": {
"field": "type",
"size": 10
},
"aggs": {
"count": {
"value_count": {
"field": "_id"
}
}
}
},
"role_1_ratio": {
"bucket_script": {
"buckets_path": {
"role_1": "role_type['manager']>count",
"role_2": "role_type['lead']>count"
},
"script": "params.role_1 / (params.role_1+params.role_2)*100"
}
},
"role_2_ratio": {
"bucket_script": {
"buckets_path": {
"role_1": "role_type['manager']>count",
"role_2": "role_type['lead']>count"
},
"script": "params.role_2 / (params.role_1+params.role_2)*100"
}
}
}
}
}
}
Please let me know if it didn't work well for you.

Unable to create nested date aggregation query

I am trying to create an ElasticSearch aggregation query which can generate sum or average of value in all my ingested documents.
The documents are of the format -
{
"weather":"cold",
"date_1":"2017/07/05",
"feedback":[
{
"date_2":"2017/08/07",
"value":28,
"comment":"not cold"
},{
"date_2":"2017/08/09",
"value":48,
"comment":"a bit chilly"
},{
"date_2":"2017/09/07",
"value":18,
"comment":"very cold"
}, ...
]
}
I am able to create a sum aggregation of all "feedback.value" using "date_1" by using the following request -
GET _search
{
"query": {
"query_string": {
"query": "cold"
}
},
"size": 0,
"aggs": {
"temperature": {
"date_histogram":{
"field" : "date_1",
"interval" : "month"
},
"aggs":{
"temperature_agg":{
"terms": {
"field": "feedback.value"
}
}
}
}
}
}
However, I need to generate the same query across all documents aggregate based on "feedback.date_2". I am not sure if ElasticSearch can resolve such aggregation or how to approach it. Any guidance would be helpful
[EDIT]
Mapping file( I only define the nested items, ES identifes other fields on its own)
{
"mappings": {
"catalog_item": {
"properties": {
"feedback":{
"type":"nested",
"properties":{
"date_2":{
"type": "date",
"format":"YYYY-MM-DD"
},
"value": {
"type": "float"
},
"comment": {
"type": "text"
}
}
}
}
}
}
}
You would need to make use of nested documents and sum aggregation.
Here's a working example:
Sample Mapping:
PUT test
{
"mappings": {
"doc": {
"properties": {
"feedback": {
"type": "nested"
}
}
}
}
}
Add Sample document:
PUT test/doc/1
{
"date_1": "2017/08/07",
"feedback": [
{
"date_2": "2017/08/07",
"value": 28,
"comment": "not cold"
},
{
"date_2": "2017/08/09",
"value": 48,
"comment": "a bit chilly"
},
{
"date_2": "2017/09/07",
"value": 18,
"comment": "very cold"
}
]
}
Calculate both the sum and average based on date_2.
GET test/_search
{
"size": 0,
"aggs": {
"temperature_aggregation": {
"nested": {
"path": "feedback"
},
"aggs": {
"temperature": {
"date_histogram": {
"field": "feedback.date_2",
"interval": "month"
},
"aggs": {
"sum": {
"sum": {
"field": "feedback.value"
}
},
"avg": {
"avg": {
"field": "feedback.value"
}
}
}
}
}
}
}
}

how do I get the latest document grouped by a field?

I have an index with many documents in this format:
{
"userId": 1234,
"locationDate" "2016-07-19T19:24:51+0000",
"location": {
"lat": -47.38163,
"lon": 26.38916
}
}
In this index I have incremental positions from the user, updated every few seconds.
I would like to execute a search that would return me the latest position (sorted by locationDate) from each user (grouped by userId)
Is this possible with elastic search? the best I could do was get all the positions from the last 30 seconds, using this:
{"query":{
"filtered" : {
"query" : {
"match_all" : { }
},
"filter" : {
"range" : {
"locationDate" : {
"from" : "2016-07-19T18:54:51+0000",
"to" : null,
"include_lower" : true,
"include_upper" : true
}
}
}
}
}}
And then after that I sort them out by hand, but I would like to do this directly on elastic search
IMPORTANT: I am using elasticsearch 1.5.2
Try this (with aggregations):
{
"query": {
"filtered": {
"query": {
"match_all": {}
},
"filter": {
"range": {
"locationDate": {
"from": "2016-07-19T18:54:51+0000",
"to": null,
"include_lower": true,
"include_upper": true
}
}
}
}
},
"aggs": {
"byUser": {
"terms": {
"field": "userId",
"size": 10
},
"aggs": {
"firstOne": {
"top_hits": {
"size": 1,
"sort": [
{
"locationDate": {
"order": "desc"
}
}
]
}
}
}
}
}
}

Resources