Translate MySQL aggregation query to ElasticSearch - elasticsearch

I have a comments table that over the past year has grown considerably and I'm moving it to ElasticSearch.
The problem is that I need to adapt a query that I currently have in MySQL which returns the total number of comments for each day in the last 7 days for a given post.
Here's the MySQL query that I have now:
SELECT count(*) AS number, DATE(created_at) AS date
FROM `comments`
WHERE `post_id` = ?
GROUP BY `date`
ORDER BY `date` DESC
LIMIT 7
My index looks like this:
{
"mappings": {
"_doc": {
"properties": {
"id": {
"type": "keyword"
},
"post_id": {
"type": "integer"
},
"subject": {
"analyzer": "custom_html_strip",
"type": "text"
},
"body": {
"analyzer": "custom_html_strip",
"type": "text"
},
"created_at": {
"format": "yyyy-MM-dd HH:mm:ss",
"type": "date"
}
}
}
}
}
Is it possible to reproduce that query for ElasticSearch? If so, how would that look like?
My ElasticSearch knowledge is kinda limited, I know that it offers aggregation, but I don't really know how to put it all together.

Use the following query to get all the comments on a given "post_id" for the last 7 days.
{
"query": {
"bool": {
"must": [
{
"term": {
"id": {
"value": "the_post_id"
}
}
},
/*** only include this clause if you want the recent most 7 days ***/
{
"range": {
"created_at": {
"gte": "now-7d/d",
"lt": "now/d"
}
}
}
]
}
},
"aggs": {
"posts_per_day": {
"date_histogram": {
"field": "created_at",
"calendar_interval": "day",
"order" : {"_key" : "desc"}
}
}
}
}
From the aggregation, pick up the first 7 buckets in your client application.

Elasticsearch supports sql queries as well (though they are limited)
so with a little change you can use something like this
GET _sql
{
"query": """
SELECT count(*) AS number, created_at AS date
FROM comments
WHERE post_id = 123
GROUP BY date
ORDER BY date DESC
LIMIT 7
"""
}
You can see the corresponding query using _sql/translate, which will return
{
"size" : 0,
"query" : {
"term" : {
"post_id" : {
"value" : 123,
"boost" : 1.0
}
}
},
"_source" : false,
"stored_fields" : "_none_",
"aggregations" : {
"groupby" : {
"composite" : {
"size" : 7,
"sources" : [
{
"31239" : {
"terms" : {
"field" : "created_at",
"missing_bucket" : true,
"order" : "desc"
}
}
}
]
}
}
}
}
That being said, some of the stuff used in the translated query is not needed, so this will be a better native query
{
"query": {
"term": {
"post_id": {
"value": 123
}
}
},
"aggs": {
"unq_dates": {
"terms": {
"field": "created_at",
"size": 7,
"order": {
"_term": "desc"
}
}
}
}
}

First, it is recommended that your time fields be regularly aggregated, so using fields written directly to a 'YYYY-MM-DD' format will improve your performance and reduce your consumption of resources
{
"query":
{
"filter":
[
{
"term":
{
"post_id":
{
"value": 1
}
}
}
]
},
"_source": false,
"aggs":
{
"created_at":
{
"auto_date_histogram":
{
"field": "created_at",
"buckets": 7,
"format": "yyyy-MM-dd"
}
}
}
}

Related

Optimize ES query with too many terms elements

We are processing a dataset of billions of records, currently all of the data are saved in ElasticSearch, and all of the queries and aggregations are performed with ElasticSearch.
The simplified query body is like below, we put the device ids in terms and then concate them with should to avoid the limit of 1024 to terms, the total count of terms element is up to 100,000, and it now becomes very slow.
{
"_source": {
"excludes": [
"raw_msg"
]
},
"query": {
"filter": {
"bool": {
"must": [
{
"range": {
"create_ms": {
"gte": 1664985600000,
"lte": 1665071999999
}
}
}
],
"should": [
{
"terms": {
"device_id": [
"1328871",
"1328899",
"1328898",
"1328934",
"1328919",
"1328976",
"1328977",
"1328879",
"1328910",
"1328902",
... # more values, since terms not support values more than 1024, wen concate all of them with should
]
}
},
{
"terms": {
"device_id": [
"1428871",
"1428899",
"1428898",
"1428934",
"1428919",
"1428976",
"1428977",
"1428879",
"1428910",
"1428902",
...
]
}
},
... # concate more terms until all of the 100,000 values are included
],
"minimum_should_match": 1
}
}
},
"aggs": {
"create_ms": {
"date_histogram": {
"field": "create_ms",
"interval": "hour",
}
}
},
"size": 0}
My question is that is there a way to optimize this case? Or is there a better choice to do this kind of search?
Realtime or near realtime is a must, other engine is acceptable.
simplified schema of the data:
"id" : {
"type" : "long"
},
"content" : {
"type" : "text"
},
"device_id" : {
"type" : "keyword"
},
"create_ms" : {
"type" : "date"
},
... # more field
You can use the terms query with a terms lookup to specify a larger list of values like here
Store your ids in a specific document with id like 'device_ids'
"should": [
{
"terms": {
"device_id": {
"index": "your-index-name",
"id": "device_ids",
"path": "field-name"
}
}
}
]

Elasticsearch : How to do 'group by' with painless in scripted fields?

I would like to do something like the following using painless:
select day,sum(price)/sum(quantity) as ratio
from data
group by day
Is it possible?
I want to do this in order to visualize the ratio field in kibana, since kibana itself doesn't have the ability to divide aggregated values, but I would gladly listen to alternative solutions beyond scripted fields.
Yes, it's possible, you can achieve this with the bucket_script pipeline aggregation:
{
"aggs": {
"days": {
"date_histogram": {
"field": "dateField",
"interval": "day"
},
"aggs": {
"price": {
"sum": {
"field": "price"
}
},
"quantity": {
"sum": {
"field": "quantity"
}
},
"ratio": {
"bucket_script": {
"buckets_path": {
"sumPrice": "price",
"sumQuantity": "quantity"
},
"script": "params.sumPrice / params.sumQuantity"
}
}
}
}
}
}
UPDATE:
You can use the above query through the Transform API which will create an aggregated index out of the source index.
For instance, I've indexed a few documents in a test index and then we can dry-run the above aggregation query in order to see how the target aggregated index would look like:
POST _transform/_preview
{
"source": {
"index": "test2",
"query": {
"match_all": {}
}
},
"dest": {
"index": "transtest"
},
"pivot": {
"group_by": {
"days": {
"date_histogram": {
"field": "#timestamp",
"calendar_interval": "day"
}
}
},
"aggregations": {
"price": {
"sum": {
"field": "price"
}
},
"quantity": {
"sum": {
"field": "quantity"
}
},
"ratio": {
"bucket_script": {
"buckets_path": {
"sumPrice": "price",
"sumQuantity": "quantity"
},
"script": "params.sumPrice / params.sumQuantity"
}
}
}
}
}
The response looks like this:
{
"preview" : [
{
"quantity" : 12.0,
"price" : 1000.0,
"days" : 1580515200000,
"ratio" : 83.33333333333333
}
],
"mappings" : {
"properties" : {
"quantity" : {
"type" : "double"
},
"price" : {
"type" : "double"
},
"days" : {
"type" : "date"
}
}
}
}
What you see in the preview array are documents that are going to be indexed in the transtest target index, that you can then visualize in Kibana as any other index.
So what a transform actually does is run the aggregation query I gave you above and it will then store each bucket into another index that can be used.
I found a solution to get the ratio of sums with TSVB visualization in kibana.
You may see the image here to see an example.
At first, you have to create two sum aggregations, one that sums price and another that sums quantity. Then, you choose the 'Bucket Script' aggregation to divide the aforementioned sums, with the use of painless script.
The only drawback that I found is that you can not aggregate on multiple columns.

Aggregation performance issue in Elasticsearch with hotel availability data

I am building a small app to find hotel room availability like booking.com using Elasticsearch 6.8.0.
Basically, I have a document per day and room, that specifies if it is available and the rate for that day. I need to run a query with this requirements:
Input:
The days of the desired staying.
The max amount of money I am willing to spend.
The page of the results I want to see.
The number of results per page.
Output:
List of cheapest offer per hotel that fulfill the requirements, ordered in ASC order.
Documents schema:
{
"mappings": {
"_doc": {
"properties": {
"room_id": {
"type": "keyword"
},
"available": {
"type": "boolean"
},
"rate": {
"type": "float"
},
"hotel_id": {
"type": "keyword"
},
"day": {
"type": "date",
"format": "yyyyMMdd"
}
}
}
}
}
I have an index per month, and at the moment I only search within the same month.
I came up with this query:
GET /hotels_201910/_search?filter_path=aggregations.hotel.buckets.min_price.value,aggregations.hotel.buckets.key
{
"size": 0,
"query": {
"bool": {
"filter": [
{
"range": {
"day": { "gte" : "20191001", "lte" : "20191010" }
}
},
{
"term": {
"available": true
}
}
]
}
},
"aggs": {
"hotel": {
"terms": {
"field": "hotel_id",
"min_doc_count": 1,
"size" : 1000000
},
"aggs": {
"room": {
"terms": {
"field": "room_id",
"min_doc_count": 10,
"size" : 1000000
},
"aggs": {
"sum_price": {
"sum": {
"field": "rate"
}
},
"max_price": {
"bucket_selector": {
"buckets_path": {
"price": "sum_price"
},
"script": "params.price <= 600"
}
}
}
},
"min_price": {
"min_bucket": {
"buckets_path": "room>sum_price"
}
},
"sort_by_min_price" : {
"bucket_sort" :{
"sort": [{"min_price" : { "order" : "asc" }}],
"from" : 0,
"size" : 20
}
}
}
}
}
}
And it works, but have several issues.
It is too slow. With 100K daily rooms, it takes about 500 ms to return on my computer, where no other query is running. So in a live system it would be very bad.
I need to setup the "size" to a big number in the terms aggregation, otherwise not all hotels and rooms are considered.
Is there a way to improve the performance of this aggregation? I have tried to split the index in multiple shards, but it did not help.
I am almost sure that the approach is wrong, and that is why is slow. Any recommendation about how to achieve a faster query response time in this case?
Before going to the answer, I didnt understand why you are using the below condition/aggregation
"min_price": {
"min_bucket": {
"buckets_path": "room>sum_price"
}
}
Can you give me more clarification on why you need this.
Now, the answer your main question:
Why do you want to term by room_id as well with hotel_id. You can get all the rooms of your search and then group them by hotel_id on application side.
The below logic, will get you all docs grouped by room_id and with sum metrics. You can use the same script filter for > 600 condition.
{
"size": 0,
"query": {
"bool": {
"filter": [
{
"range": {
"day": { "gte" : "20191001", "lte" : "20191010" }
}
},
{
"term": {
"available": true
}
}
]
}
},
"by_room_id": {
"composite" : {
"size": 100,
"sources" : [
{
"room_id": {
"terms" : {
"field": "room_id"
}
}
}
]
},
"aggregations": {
"price_on_required_dates": {
"sum": { "field": "rate" }
},
"include_source": {
"top_hits": {
"size": 1,
"_source": true
}
},
"price_bucket_sort": {
"bucket_sort": {
"sort": [
{"price_on_required_dates": {"order": "desc"}}
]
}
}
}
}
}
Also, to improve search performance,
https://www.elastic.co/guide/en/elasticsearch/reference/current/tune-for-search-speed.html

Elasticsearch: aggregation and select docs only having max value of field

I am using elastic search 6.5.
Basically, based on my query my index can return multiple documents, I need only those documents which has the max value for a particular field.
E.g.
{
"query": {
"bool": {
"must": [
{
"match": { "header.date" : "2019-07-02" }
},
{
"match": { "header.field" : "ABC" }
},
{
"bool": {
"should": [
{
"regexp": { "body.meta.field": "myregex1" }
},
{
"regexp": { "body.meta.field": "myregex2" }
}
]
}
}
]
}
},
"size" : 10000
}
The above query will return lots of documents/messages as per the query. The sample data returned is:
"header" : {
"id" : "Text_20190702101200123_111",
"date" : "2019-07-02"
"field": "ABC"
},
"body" : {
"meta" : {
"field" : "myregex1",
"timestamp": "2019-07-02T10:12:00.123Z",
}
}
-----------------
"header" : {
"id" : "Text_20190702151200123_121",
"date" : "2019-07-02"
"field": "ABC"
},
"body" : {
"meta" : {
"field" : "myregex2",
"timestamp": "2019-07-02T15:12:00.123Z",
}
}
-----------------
"header" : {
"id" : "Text_20190702081200133_124",
"date" : "2019-07-02"
"field": "ABC"
},
"body" : {
"meta" : {
"field" : "myregex1",
"timestamp": "2019-07-02T08:12:00.133Z",
}
}
So based on the above 3 documents, I only want the max timestamp one to be shown i.e. "timestamp": "2019-07-02T15:12:00.123Z"
I only want one document in above example.
I tried doing it as below:
{
"query": {
"bool": {
"must": [
{
"match": { "header.date" : "2019-07-02" }
},
{
"match": { "header.field" : "ABC" }
},
{
"bool": {
"should": [
{
"regexp": { "body.meta.field": "myregex1" }
},
{
"regexp": { "body.meta.field": "myregex2" }
}
]
}
}
]
}
},
"aggs": {
"group": {
"terms": {
"field": "header.id",
"order": { "group_docs" : "desc" }
},
"aggs" : {
"group_docs": { "max" : { "field": "body.meta.tiemstamp" } }
}
}
},
"size": "10000"
}
Executing the above, I am still getting all the 3 documents, instead of only one.
I do get the buckets though, but I need only one of them and not all the buckets.
The output in addition to all the records,
"aggregations": {
"group": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "Text_20190702151200123_121",
"doc_count": 29,
"group_docs": {
"value": 1564551683867,
"value_as_string": "2019-07-02T15:12:00.123Z"
}
},
{
"key": "Text_20190702101200123_111",
"doc_count": 29,
"group_docs": {
"value": 1564551633912,
"value_as_string": "2019-07-02T10:12:00.123Z"
}
},
{
"key": "Text_20190702081200133_124",
"doc_count": 29,
"group_docs": {
"value": 1564510566971,
"value_as_string": "2019-07-02T08:12:00.133Z"
}
}
]
}
}
What am I missing here?
Please note that I can have more than one messages for same timestamp. So I want them all i.e. all the messages/documents belonging to the max time stamp.
In above example there are 29 messages for same timestamp (It can go to any number). So there are 29 * 3 messages being retrieved by my query after using the above aggregation.
Basically I am able to group correctly, I am looking for something like HAVING in SQl?

how do I get the latest document grouped by a field?

I have an index with many documents in this format:
{
"userId": 1234,
"locationDate" "2016-07-19T19:24:51+0000",
"location": {
"lat": -47.38163,
"lon": 26.38916
}
}
In this index I have incremental positions from the user, updated every few seconds.
I would like to execute a search that would return me the latest position (sorted by locationDate) from each user (grouped by userId)
Is this possible with elastic search? the best I could do was get all the positions from the last 30 seconds, using this:
{"query":{
"filtered" : {
"query" : {
"match_all" : { }
},
"filter" : {
"range" : {
"locationDate" : {
"from" : "2016-07-19T18:54:51+0000",
"to" : null,
"include_lower" : true,
"include_upper" : true
}
}
}
}
}}
And then after that I sort them out by hand, but I would like to do this directly on elastic search
IMPORTANT: I am using elasticsearch 1.5.2
Try this (with aggregations):
{
"query": {
"filtered": {
"query": {
"match_all": {}
},
"filter": {
"range": {
"locationDate": {
"from": "2016-07-19T18:54:51+0000",
"to": null,
"include_lower": true,
"include_upper": true
}
}
}
}
},
"aggs": {
"byUser": {
"terms": {
"field": "userId",
"size": 10
},
"aggs": {
"firstOne": {
"top_hits": {
"size": 1,
"sort": [
{
"locationDate": {
"order": "desc"
}
}
]
}
}
}
}
}
}

Resources