Filtering, sorting and paginating by sub-aggregations in ElasticSearch 6 - elasticsearch

I have a collection of documents, where each document indicates the available rooms for a given hotel and day, and their cost for that day:
{
"hotel_id": 2016021519381313,
"day": "20200530",
"rooms": [
{
"room_id": "00d70230ca0142a6874358919336e53f",
"rate": 87
},
{
"room_id": "675a5ec187274a45ae7a5fdc20f72201",
"rate": 53
}
]
}
Being the mapping:
{
"properties": {
"day": {
"type": "keyword"
},
"hotel_id": {
"type": "long"
},
"rooms": {
"type": "nested",
"properties": {
"rate": {
"type": "long"
},
"room_id": {
"type": "keyword"
}
}
}
}
}
I am trying to figure out, how to do a query where I can get the available rooms for a set of days which total cost is less than a given amount, ordered by total cost in ascending order and paginated.
So far I came up with the way of getting rooms available for the set of days and their total cost. Basically filtering by the days, and grouping per hotel and room IDs, requiring that the minimum count in the aggregation is the number of days I am looking for.
{
"size" : 0,
"query": {
"bool": {
"must": [
{
"terms" : {
"day" : ["20200423", "20200424", "20200425"]
}
}
]
}
} ,
"aggs" : {
"hotel" : {
"terms" : {
"field" : "hotel_id"
},
"aggs" : {
"rooms" : {
"nested" : {
"path" : "rooms"
},
"aggs" : {
"rooms" : {
"terms" : {
"field" : "rooms.room_id",
"min_doc_count" : 3
},
"aggs" : {
"sum_price" : {
"sum" : { "field" : "rooms.rate" } }
}
}
}
}
}
}
}
}
So now I am interesting in ordering the result buckets in descending order at the "hotel" level based on the value of the sub-aggregation with "rooms", and also filtering the buckets that do not contains enough documents or which "sum_price" is bigger than a given budget. But I cannot manage how to do it.
I have been taking a look at "bucket_sort", but I cannot find the way to sort in base a subaggregation. I have been also taking a look to "bucket_selector", but it gives me empty buckets when they do not fit the predicate. I am probably not using them correctly in my case.
Which would be the right way of accomplish it?

Here is the query without pagination:
{
"size":0,
"query":{
"bool":{
"must":[
{
"terms":{
"day":[
"20200530",
"20200531",
"20200532"
]
}
}
]
}
},
"aggs":{
"rooms":{
"nested":{
"path":"rooms"
},
"aggs":{
"rooms":{
"terms":{
"field":"rooms.room_id",
"min_doc_count":3,
"order":{
"sum_price":"asc"
}
},
"aggs":{
"sum_price":{
"sum":{
"field":"rooms.rate"
}
},
"max_price":{
"bucket_selector":{
"buckets_path":{
"var1":"sum_price"
},
"script":"params.var1 < 100"
}
}
}
}
}
}
}
}
Please note that the following variables should be changed for the desired results:
day
min_doc_count
script in max_price

Related

elasticsearch - can you give weight to newer documents?

If we have 10,000 documents with the same score, but we limit the search to 1,000, is there a way to give more weight to newer documents so the newer 1,000 show up?
If all the documents have the same score then the most straightforward way to go is just sorting by creation date:
https://www.elastic.co/guide/en/elasticsearch/reference/current/sort-search-results.html
Example with _score as first criteria, and date for tiebreakers:
GET /my-index-000001/_search
{
"sort" : [
"_score",
{ "post_date" : {"order" : "desc"} },
],
"query" : {
"term" : { "user" : "kimchy" }
}
}
If you want to add score on top the query score you can use a distance query on the creation date field.
https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-distance-feature-query.html
PUT /items
{
"mappings": {
"properties": {
"name": {
"type": "keyword"
},
"creation_date": {
"type": "date"
}
}
}
}
PUT /items/_doc/1?refresh
{
"name" : "chocolate",
"production_date": "2018-02-01",
"location": [-71.34, 41.12]
}
PUT /items/_doc/2?refresh
{
"name" : "chocolate",
"creation_date": "2018-01-01"
}
PUT /items/_doc/3?refresh
{
"name" : "chocolate",
"creation_date": "2017-12-01"
}
GET /items/_search
{
"query": {
"bool": {
"must": {
"match": {
"name": "chocolate"
}
},
"should": {
"distance_feature": {
"field": "creation_date",
"pivot": "7d",
"origin": "now"
}
}
}
}
}
origin will define the starting point from where you want to give more weight to the documents which are close, in the example the closest to "now" the document is, the weight it will have.
pivot distance of the origin the document will receive half of the score.

Aggregation performance issue in Elasticsearch with hotel availability data

I am building a small app to find hotel room availability like booking.com using Elasticsearch 6.8.0.
Basically, I have a document per day and room, that specifies if it is available and the rate for that day. I need to run a query with this requirements:
Input:
The days of the desired staying.
The max amount of money I am willing to spend.
The page of the results I want to see.
The number of results per page.
Output:
List of cheapest offer per hotel that fulfill the requirements, ordered in ASC order.
Documents schema:
{
"mappings": {
"_doc": {
"properties": {
"room_id": {
"type": "keyword"
},
"available": {
"type": "boolean"
},
"rate": {
"type": "float"
},
"hotel_id": {
"type": "keyword"
},
"day": {
"type": "date",
"format": "yyyyMMdd"
}
}
}
}
}
I have an index per month, and at the moment I only search within the same month.
I came up with this query:
GET /hotels_201910/_search?filter_path=aggregations.hotel.buckets.min_price.value,aggregations.hotel.buckets.key
{
"size": 0,
"query": {
"bool": {
"filter": [
{
"range": {
"day": { "gte" : "20191001", "lte" : "20191010" }
}
},
{
"term": {
"available": true
}
}
]
}
},
"aggs": {
"hotel": {
"terms": {
"field": "hotel_id",
"min_doc_count": 1,
"size" : 1000000
},
"aggs": {
"room": {
"terms": {
"field": "room_id",
"min_doc_count": 10,
"size" : 1000000
},
"aggs": {
"sum_price": {
"sum": {
"field": "rate"
}
},
"max_price": {
"bucket_selector": {
"buckets_path": {
"price": "sum_price"
},
"script": "params.price <= 600"
}
}
}
},
"min_price": {
"min_bucket": {
"buckets_path": "room>sum_price"
}
},
"sort_by_min_price" : {
"bucket_sort" :{
"sort": [{"min_price" : { "order" : "asc" }}],
"from" : 0,
"size" : 20
}
}
}
}
}
}
And it works, but have several issues.
It is too slow. With 100K daily rooms, it takes about 500 ms to return on my computer, where no other query is running. So in a live system it would be very bad.
I need to setup the "size" to a big number in the terms aggregation, otherwise not all hotels and rooms are considered.
Is there a way to improve the performance of this aggregation? I have tried to split the index in multiple shards, but it did not help.
I am almost sure that the approach is wrong, and that is why is slow. Any recommendation about how to achieve a faster query response time in this case?
Before going to the answer, I didnt understand why you are using the below condition/aggregation
"min_price": {
"min_bucket": {
"buckets_path": "room>sum_price"
}
}
Can you give me more clarification on why you need this.
Now, the answer your main question:
Why do you want to term by room_id as well with hotel_id. You can get all the rooms of your search and then group them by hotel_id on application side.
The below logic, will get you all docs grouped by room_id and with sum metrics. You can use the same script filter for > 600 condition.
{
"size": 0,
"query": {
"bool": {
"filter": [
{
"range": {
"day": { "gte" : "20191001", "lte" : "20191010" }
}
},
{
"term": {
"available": true
}
}
]
}
},
"by_room_id": {
"composite" : {
"size": 100,
"sources" : [
{
"room_id": {
"terms" : {
"field": "room_id"
}
}
}
]
},
"aggregations": {
"price_on_required_dates": {
"sum": { "field": "rate" }
},
"include_source": {
"top_hits": {
"size": 1,
"_source": true
}
},
"price_bucket_sort": {
"bucket_sort": {
"sort": [
{"price_on_required_dates": {"order": "desc"}}
]
}
}
}
}
}
Also, to improve search performance,
https://www.elastic.co/guide/en/elasticsearch/reference/current/tune-for-search-speed.html

How can I filter doc_count value which is a result of a nested aggregation

How can I filter the doc_count value which is a result of a nested aggregation?
Here is my query:
"aggs": {
"CDIDs": {
"terms": {
"field": "CDID.keyword",
"size": 1000
},
"aggs": {
"my_filter": {
"filter": {
"range": {
"transactionDate": {
"gte": "now-1M/M"
}
}
}
},
"in_active": {
"bucket_selector": {
"buckets_path": {
"doc_count": "_count"
},
"script": "params.doc_count > 4"
}
}
}
}
}
The result of the query looks like:
{
"aggregations" : {
"CDIDs" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 2386,
"buckets" : [
{
"key" : "1234567",
"doc_count" : 5,
"my_filter" : {
"doc_count" : 4
}
},
{
"key" : "12345",
"doc_count" : 5,
"my_filter" : {
"doc_count" : 5
}
}
]
}
}
}
I'm trying to filter the second doc_count value here. Let's say I wanna have docs that are > 4 so the result should be having only one aggregation result in a bucket with doc_count = 5. Can anyone help how can I do this filter? Please let me know if any additional information is required.
Take a close look at the bucket_selector aggregation. You simply need to specify the aggregation name in buckets_path section i.e. "doc_count":"my_filter>_count"
Pipeline aggregation (buckets_path) has its own syntax where > acts as a separator. Refer to this LINK for more information on this.
Aggregation Query
POST <your_index_name>/_search
{
"size":0,
"aggs":{
"CDIDs":{
"terms":{
"field":"CDID.keyword",
"size":1000
},
"aggs":{
"my_filter":{
"filter":{
"range":{
"transactionDate":{
"gte":"now-1M/M"
}
}
}
},
"in_active":{
"bucket_selector":{
"buckets_path":{
"doc_count":"my_filter>_count"
},
"script":"params.doc_count > 4"
}
}
}
}
}
}
Hope it helps!

Range Query on a score returned by match Query in Elastic Search

Suppose I have a set of documents like :-
{
"Name":"Random String 1"
"Type":"Keyword"
"City":"Lousiana"
"Quantity":"10"
}
Now I want to implement a full text search using an N-gram analyazer on the field Name and City.
After that , I want to filter only the results returned with
"_score" :<Query Score Returned by ES>
greater than 1.2 (Maybe By Range Query Aggregation Method)
And after that apply term aggregation method on the property: "Type" and then return the top results in each bucket by using "top_hits" aggregation method.
How can I do so ?
I've been able to implement everything apart from the Range Query on score returned by a search query.
if you want to score the documents organically then i you can use min_score in query to filter the matched documents for the score.
for ngram analyer i added whitespace tokenizer and a lowercase filter
Mappings
PUT index1
{
"settings": {
"analysis": {
"analyzer": {
"edge_n_gram_analyzer": {
"tokenizer": "whitespace",
"filter" : ["lowercase", "ednge_gram_filter"]
}
},
"filter": {
"ednge_gram_filter" : {
"type" : "NGram",
"min_gram" : 2,
"max_gram": 10
}
}
}
},
"mappings": {
"document_type" : {
"properties": {
"Name" : {
"type": "text",
"analyzer": "edge_n_gram_analyzer"
},
"City" : {
"type": "text",
"analyzer": "edge_n_gram_analyzer"
},
"Type" : {
"type": "keyword"
}
}
}
}
}
Index Document
POST index1/document_type
{
"Name":"Random String 1",
"Type":"Keyword",
"City":"Lousiana",
"Quantity":"10"
}
Query
POST index1/_search
{
"min_score": 1.2,
"size": 0,
"query": {
"bool": {
"should": [
{
"term": {
"Name": {
"value": "string"
}
}
},
{
"term": {
"City": {
"value": "string"
}
}
}
]
}
},
"aggs": {
"type_terms": {
"terms": {
"field": "Type",
"size": 10
},
"aggs": {
"type_term_top_hits": {
"top_hits": {
"size": 10
}
}
}
}
}
}
Hope this helps

filter by child frequency in ElasticSearch

I currently have parents indexed in elastic search (documents) and child (comments) related to these documents.
My first objective was to search for a document with more than N comments, based on a child query. Here is how I did it:
documents/document/_search
{
"min_score": 0,
"query": {
"has_child" : {
"type" : "comment",
"score_type" : "sum",
"boost": 1,
"query" : {
"range": {
"date": {
"lte": 20130204,
"gte": 20130201,
"boost": 1
}
}
}
}
}
}
I used score to calculate the amount of comments a document has and then I filtered the documents by this amount, using "min_score".
Now, my objective is to search not just comments, but several other child documents related to the document, always based on frequency. Something like the query bellow:
documents/document/_search
{
"query": {
"match_all": {
}
},
"filter" : {
"and" : [{
"query": {
"has_child" : {
"type" : "comment",
"query" : {
"range": {
"date": {
"lte": 20130204,
"gte": 20130201
}
}
}
}
}
},
{
"or" : [
{"query": {
"has_child" : {
"type" : "comment",
"query" : {
"match": {
"text": "Finally"
}
}
}
}
},
{ "query": {
"has_child" : {
"type" : "comment",
"query" : {
"match": {
"text": "several"
}
}
}
}
}
]
}
]
}
}
The query above works fine, but it doesn't filter based on frequency as the first one does. As filters are computed before scores are calculated, I cannot use min_score to filter each child query.
Any solutions to this problem?
There is no score at all associated with filters. I'd suggest to move the whole logic to the query part and use a bool query to combine the different queries together.

Resources