Elasticsearch : How to do 'group by' with painless in scripted fields? - elasticsearch

I would like to do something like the following using painless:
select day,sum(price)/sum(quantity) as ratio
from data
group by day
Is it possible?
I want to do this in order to visualize the ratio field in kibana, since kibana itself doesn't have the ability to divide aggregated values, but I would gladly listen to alternative solutions beyond scripted fields.

Yes, it's possible, you can achieve this with the bucket_script pipeline aggregation:
{
"aggs": {
"days": {
"date_histogram": {
"field": "dateField",
"interval": "day"
},
"aggs": {
"price": {
"sum": {
"field": "price"
}
},
"quantity": {
"sum": {
"field": "quantity"
}
},
"ratio": {
"bucket_script": {
"buckets_path": {
"sumPrice": "price",
"sumQuantity": "quantity"
},
"script": "params.sumPrice / params.sumQuantity"
}
}
}
}
}
}
UPDATE:
You can use the above query through the Transform API which will create an aggregated index out of the source index.
For instance, I've indexed a few documents in a test index and then we can dry-run the above aggregation query in order to see how the target aggregated index would look like:
POST _transform/_preview
{
"source": {
"index": "test2",
"query": {
"match_all": {}
}
},
"dest": {
"index": "transtest"
},
"pivot": {
"group_by": {
"days": {
"date_histogram": {
"field": "#timestamp",
"calendar_interval": "day"
}
}
},
"aggregations": {
"price": {
"sum": {
"field": "price"
}
},
"quantity": {
"sum": {
"field": "quantity"
}
},
"ratio": {
"bucket_script": {
"buckets_path": {
"sumPrice": "price",
"sumQuantity": "quantity"
},
"script": "params.sumPrice / params.sumQuantity"
}
}
}
}
}
The response looks like this:
{
"preview" : [
{
"quantity" : 12.0,
"price" : 1000.0,
"days" : 1580515200000,
"ratio" : 83.33333333333333
}
],
"mappings" : {
"properties" : {
"quantity" : {
"type" : "double"
},
"price" : {
"type" : "double"
},
"days" : {
"type" : "date"
}
}
}
}
What you see in the preview array are documents that are going to be indexed in the transtest target index, that you can then visualize in Kibana as any other index.
So what a transform actually does is run the aggregation query I gave you above and it will then store each bucket into another index that can be used.

I found a solution to get the ratio of sums with TSVB visualization in kibana.
You may see the image here to see an example.
At first, you have to create two sum aggregations, one that sums price and another that sums quantity. Then, you choose the 'Bucket Script' aggregation to divide the aforementioned sums, with the use of painless script.
The only drawback that I found is that you can not aggregate on multiple columns.

Related

Translate MySQL aggregation query to ElasticSearch

I have a comments table that over the past year has grown considerably and I'm moving it to ElasticSearch.
The problem is that I need to adapt a query that I currently have in MySQL which returns the total number of comments for each day in the last 7 days for a given post.
Here's the MySQL query that I have now:
SELECT count(*) AS number, DATE(created_at) AS date
FROM `comments`
WHERE `post_id` = ?
GROUP BY `date`
ORDER BY `date` DESC
LIMIT 7
My index looks like this:
{
"mappings": {
"_doc": {
"properties": {
"id": {
"type": "keyword"
},
"post_id": {
"type": "integer"
},
"subject": {
"analyzer": "custom_html_strip",
"type": "text"
},
"body": {
"analyzer": "custom_html_strip",
"type": "text"
},
"created_at": {
"format": "yyyy-MM-dd HH:mm:ss",
"type": "date"
}
}
}
}
}
Is it possible to reproduce that query for ElasticSearch? If so, how would that look like?
My ElasticSearch knowledge is kinda limited, I know that it offers aggregation, but I don't really know how to put it all together.
Use the following query to get all the comments on a given "post_id" for the last 7 days.
{
"query": {
"bool": {
"must": [
{
"term": {
"id": {
"value": "the_post_id"
}
}
},
/*** only include this clause if you want the recent most 7 days ***/
{
"range": {
"created_at": {
"gte": "now-7d/d",
"lt": "now/d"
}
}
}
]
}
},
"aggs": {
"posts_per_day": {
"date_histogram": {
"field": "created_at",
"calendar_interval": "day",
"order" : {"_key" : "desc"}
}
}
}
}
From the aggregation, pick up the first 7 buckets in your client application.
Elasticsearch supports sql queries as well (though they are limited)
so with a little change you can use something like this
GET _sql
{
"query": """
SELECT count(*) AS number, created_at AS date
FROM comments
WHERE post_id = 123
GROUP BY date
ORDER BY date DESC
LIMIT 7
"""
}
You can see the corresponding query using _sql/translate, which will return
{
"size" : 0,
"query" : {
"term" : {
"post_id" : {
"value" : 123,
"boost" : 1.0
}
}
},
"_source" : false,
"stored_fields" : "_none_",
"aggregations" : {
"groupby" : {
"composite" : {
"size" : 7,
"sources" : [
{
"31239" : {
"terms" : {
"field" : "created_at",
"missing_bucket" : true,
"order" : "desc"
}
}
}
]
}
}
}
}
That being said, some of the stuff used in the translated query is not needed, so this will be a better native query
{
"query": {
"term": {
"post_id": {
"value": 123
}
}
},
"aggs": {
"unq_dates": {
"terms": {
"field": "created_at",
"size": 7,
"order": {
"_term": "desc"
}
}
}
}
}
First, it is recommended that your time fields be regularly aggregated, so using fields written directly to a 'YYYY-MM-DD' format will improve your performance and reduce your consumption of resources
{
"query":
{
"filter":
[
{
"term":
{
"post_id":
{
"value": 1
}
}
}
]
},
"_source": false,
"aggs":
{
"created_at":
{
"auto_date_histogram":
{
"field": "created_at",
"buckets": 7,
"format": "yyyy-MM-dd"
}
}
}
}

Aggregation performance issue in Elasticsearch with hotel availability data

I am building a small app to find hotel room availability like booking.com using Elasticsearch 6.8.0.
Basically, I have a document per day and room, that specifies if it is available and the rate for that day. I need to run a query with this requirements:
Input:
The days of the desired staying.
The max amount of money I am willing to spend.
The page of the results I want to see.
The number of results per page.
Output:
List of cheapest offer per hotel that fulfill the requirements, ordered in ASC order.
Documents schema:
{
"mappings": {
"_doc": {
"properties": {
"room_id": {
"type": "keyword"
},
"available": {
"type": "boolean"
},
"rate": {
"type": "float"
},
"hotel_id": {
"type": "keyword"
},
"day": {
"type": "date",
"format": "yyyyMMdd"
}
}
}
}
}
I have an index per month, and at the moment I only search within the same month.
I came up with this query:
GET /hotels_201910/_search?filter_path=aggregations.hotel.buckets.min_price.value,aggregations.hotel.buckets.key
{
"size": 0,
"query": {
"bool": {
"filter": [
{
"range": {
"day": { "gte" : "20191001", "lte" : "20191010" }
}
},
{
"term": {
"available": true
}
}
]
}
},
"aggs": {
"hotel": {
"terms": {
"field": "hotel_id",
"min_doc_count": 1,
"size" : 1000000
},
"aggs": {
"room": {
"terms": {
"field": "room_id",
"min_doc_count": 10,
"size" : 1000000
},
"aggs": {
"sum_price": {
"sum": {
"field": "rate"
}
},
"max_price": {
"bucket_selector": {
"buckets_path": {
"price": "sum_price"
},
"script": "params.price <= 600"
}
}
}
},
"min_price": {
"min_bucket": {
"buckets_path": "room>sum_price"
}
},
"sort_by_min_price" : {
"bucket_sort" :{
"sort": [{"min_price" : { "order" : "asc" }}],
"from" : 0,
"size" : 20
}
}
}
}
}
}
And it works, but have several issues.
It is too slow. With 100K daily rooms, it takes about 500 ms to return on my computer, where no other query is running. So in a live system it would be very bad.
I need to setup the "size" to a big number in the terms aggregation, otherwise not all hotels and rooms are considered.
Is there a way to improve the performance of this aggregation? I have tried to split the index in multiple shards, but it did not help.
I am almost sure that the approach is wrong, and that is why is slow. Any recommendation about how to achieve a faster query response time in this case?
Before going to the answer, I didnt understand why you are using the below condition/aggregation
"min_price": {
"min_bucket": {
"buckets_path": "room>sum_price"
}
}
Can you give me more clarification on why you need this.
Now, the answer your main question:
Why do you want to term by room_id as well with hotel_id. You can get all the rooms of your search and then group them by hotel_id on application side.
The below logic, will get you all docs grouped by room_id and with sum metrics. You can use the same script filter for > 600 condition.
{
"size": 0,
"query": {
"bool": {
"filter": [
{
"range": {
"day": { "gte" : "20191001", "lte" : "20191010" }
}
},
{
"term": {
"available": true
}
}
]
}
},
"by_room_id": {
"composite" : {
"size": 100,
"sources" : [
{
"room_id": {
"terms" : {
"field": "room_id"
}
}
}
]
},
"aggregations": {
"price_on_required_dates": {
"sum": { "field": "rate" }
},
"include_source": {
"top_hits": {
"size": 1,
"_source": true
}
},
"price_bucket_sort": {
"bucket_sort": {
"sort": [
{"price_on_required_dates": {"order": "desc"}}
]
}
}
}
}
}
Also, to improve search performance,
https://www.elastic.co/guide/en/elasticsearch/reference/current/tune-for-search-speed.html

Elastic Search - Pagination on Aggregations

I have an index and I query an aggregation, instead of returning the whole aggregation at once I want to have it returned in chunks, that is small small blocks, is it possible to do so in Elastic Search?
Try to use Bucket sort
POST /sales/_search
{
"size": 0,
"aggs" : {
"sales_per_month" : {
"date_histogram" : {
"field" : "date",
"interval" : "month"
},
"aggs": {
"total_sales": {
"sum": {
"field": "price"
}
},
"sales_bucket_sort": {
"bucket_sort": {
"sort": [
{"total_sales": {"order": "desc"}}
],
"size": 3,
"from": 10
}
}
}
}
}
}

Faceted search in webshop with Elastic

I've seen a few examples about faceted search in Elastic but all of them know in advance on what fields you would want created buckets.
How should I work when I have a webshop with multiple categories, where the properties of the values are different in every category?
Is there a way to describe what properties your documents have when you ran a query (eg filter by category)?
I have this query right now:
{
"from" : 0, "size" : 10,
"query": {
"bool" : {
"must" : [
{ "terms": {"color": ["red", "green", "purple"]} },
{ "terms": {"make": ["honda", "toyota", "bmw"]} }
]
}
},
"aggregations": {
"all_cars": {
"global": {},
"aggs": {
"colors": {
"filter" : { "terms": {"make": ["honda", "toyota", "bmw"]} },
"aggregations": {
"filtered_colors": { "terms": {"field": "color.keyword"} }
}
},
"makes": {
"filter" : { "terms": {"color": ["red", "green"]} },
"aggregations": {
"filtered_makes": { "terms": {"field": "make.keyword"} }
}
}
}
}
}
}
How can I know on what fields I can make aggregations. Is there a way to describe the properties of a document after running a query? So I can know what the possible fields ,to aggregate on, are.
Right now I am storing all properties of my article in an array and I can quickly aggregate them like this:
{
"size": 0,
"aggregations": {
"array_aggregation": {
"terms": {
"field": "properties.keyword",
"size": 10
}
}
}
}
This is a step in the right direction but that way I don't know what the type of a property is.
Here's a sample object
"price": 10000,
"color": "red",
"make": "honda",
"sold": "2014-10-28",
"properties": [
"price",
"color",
"make",
"sold"
]
You can use the filter aggregation which will filter and then create a terms aggregation inside?

sub field aggregation group by order by in elasticsearch

I am unable to find the correct syntax to get an aggregation of a sub object ordered by a count field.
A good example of this is a twitter document:
{
"properties" : {
"id" : {
"type" : "long"
},
"message" : {
"type" : "string"
},
"user" : {
"type" : "object",
"properties" : {
"id" : {
"type" : "long"
},
"screenName" : {
"type" : "string"
},
"followers" : {
"type" : "long"
}
}
}
}
}
How would I go about getting the Top Influencers for a given set of tweets? This would be a unique list of the top 10 "user" objects ordered by the "user.followers" field.
I have tried using top_hits but get an exception:
org.elasticsearch.common.breaker.CircuitBreakingException: [FIELDDATA]
Data too large, data for [user.id]
"aggs": {
"top-influencers": {
"terms": {
"field": "user.id",
"order": {
"top_hit": "desc"
}
},
"aggs": {
"top_tags_hits": {
"top_hits": {}
},
"top_hit": {
"max": {
"field": "user.followers"
}
}
}
}
}
I can get almost what I want using the "sort" field on the query (no aggregation), however if a user has multiple tweets then they will appear twice in the result. I need to be able to group by the sub object "user" and only return each user once.
---UPDATE---
I have managed to get a list of the top users returning in very good time. Unfortunatly it still isnt unique. Also the docs say top_hits is designed to be a sub agg..., I am using it as a top level agg...
"aggs": {
"top_influencers": {
"top_hits": {
"sort": [
{
"user.followers": {
"order": "desc"
}
}
],
"_source": {
"include": [
"user.id",
"user.screenName",
"user.followers"
]
},
"size": 10
}
}
}
Try this:
{
"aggs": {
"GroupByType": {
"terms": {
"field": "user.id",
"size": 10000
},
"aggs": {
"Group": {
"top_hits":{
"size":1,
"_source": {
"includes": ["user.id", "user.screenName", "user.followers"]
},
"sort":[{
"user.followers": {
"order": "desc"
}
}]
}
}
}
}
}
}
You can then take the top 10 results of this query. Note that normal search in elastic search only goes up to 10000 records.

Resources