Aggregator of type top_hits cannot accept sub-aggregations with Percentiles - elasticsearch

I have the following documents:
{"id": 1, "type": "bags", "brand": "Louis Vuitton", "condition": "new", "price": 500}
{"id": 2, "type": "bags", "brand": "Louis Vuitton", "condition": "new", "price": 450}
{"id": 3, "type": "bags", "brand": "Louis Vuitton", "condition": "new", "price": 420}
{"id": 4, "type": "bags", "brand": "Louis Vuitton", "condition": "like new", "price": 150}
{"id": 5, "type": "bags", "brand": "Louis Vuitton", "condition": "like new", "price": 150}
{"id": 6, "type": "bags", "brand": "Louis Vuitton", "condition": "like new", "price": 100}
{"id": 7, "type": "bags", "brand": "Louis Vuitton" "condition": "used", "price": 400}
{"id": 8, "type": "bags", "brand": "Louis Vuitton", "condition": "used", "price": 350}
{"id": 9, "type": "bags", "brand": "Louis Vuitton", "condition": "used", "price": 300}
I am looking to write a query that will return to me the Percentiles of prices for the top 2 documents for each condition. In other words, I want to perform some calculation after getting the top 2 best scoring documents for each item condition (new, like new, used). I have tried this but I am getting the error the error Aggregator of type top_hits cannot accept sub-aggregations:
{
"query": {
"match": {
"brand": "Louis Vuitton"
}
},
"aggs": {
"item_conditions": {
"terms": {
"field": "condition"
},
"aggs": {
"top_two": {
"top_hits": {
"size": 2
},
"aggs": {
"top_two_percentiles": {
"percentiles": {
"field": "price"
}
}
}
}
}
}
}
}
Is there another way to achieve this, or do I have to do some post-processing myself after getting the results back from ES? The end result I want is to be able to supply this data to charts to make it look like this: https://ibb.co/y5FpV80

"... the percentiles of prices for the top two documents ..." is somewhat arbitrary. What's the metric that determines the score? A terms aggregation would score the buckets equally. The only differentiating factor would be the bucket count... What I'm saying is, you'll need to first determine what puts a given bucket in the top 2 and go from there.
In any event, you can:
Order any terms aggregation by the result of one of its numeric child aggregations.
After that, you can limit it to 2 buckets.
When that's done, you can use a percentiles bucket aggregation to calculate the percentiles of the two top prices.
In concrete terms:
POST your-index/_search?filter_path=aggregations.*.buckets.key,aggregations.*.buckets.doc_count,aggregations.*.buckets.percentiles_top_two_prices
{
"size": 0,
"query": {
"match": {
"brand": "Louis Vuitton"
}
},
"aggs": {
"item_conditions": {
"terms": {
"field": "condition"
},
"aggs": {
"top_two": {
"terms": {
"field": "price",
"size": 2,
"order": {
"max_score": "desc" <-- here's how you enforce the top 2 docs
}
},
"aggs": {
"max_score": {
"max": {
"script": "_score" <-- how you determine what happens here is up to you. _score will be equal across all buckets (I believe) so pick some other metric.
}
},
"just_the_price": {
"min": {
"field": "price" <-- there's no "identity" agg in ES so I'm using min. There will be only bucket because you're already under the parent which aggregates the price.
}
}
}
},
"percentiles_top_two_prices": {
"percentiles_bucket": {
"buckets_path": "top_two>just_the_price"
}
}
}
}
}
}
yielding something along the lines of:
{
"aggregations" : {
"item_conditions" : {
"buckets" : [
{
"key" : "like new",
"doc_count" : 3,
"percentiles_top_two_prices" : {
"values" : {
"1.0" : 100.0,
"5.0" : 100.0,
"25.0" : 100.0,
"50.0" : 150.0,
"75.0" : 150.0,
"95.0" : 150.0,
"99.0" : 150.0
}
}
},
{
"key" : "new",
"doc_count" : 3,
"percentiles_top_two_prices" : {
"values" : {
"1.0" : 420.0,
"5.0" : 420.0,
"25.0" : 420.0,
"50.0" : 450.0,
"75.0" : 450.0,
"95.0" : 450.0,
"99.0" : 450.0
}
}
},
{
"key" : "used",
"doc_count" : 3,
"percentiles_top_two_prices" : {
"values" : {
"1.0" : 300.0,
"5.0" : 300.0,
"25.0" : 300.0,
"50.0" : 350.0,
"75.0" : 350.0,
"95.0" : 350.0,
"99.0" : 350.0
}
}
}
]
}
}
}
I'm frankly not sure what these stats would bring you (when based on only two values) but this is how it could be done 😉

Related

Elasticsearch group, filter and aggregation in the same request

I have a use case to filter and group by two separate fields in Elasticsearch, while aggregating values in rest of the fields. Looking for help formulating a query for Elasticsearch. Here is an example. Each document has a locationId, category, productId, visitCount, purchaseCount and sampleCount. I want to view sums of visitCount, purchaseCount and sampleCount for each category within each location. Note that productId is unique across all entries. I have tried reading up Elasticsearch documentation but could not find a good source to learn how I can do grouping, filtering and aggregation all together. Please note that this is for a website use case where we show this data in a table with pages. Due to the amount of locations and categories, it is likely that there will be several groups that will go beyond 1 page. Please help formulate a query for Elasticsearch.
Sample documents:
[{
"locationId": 12345,
"category": "Food",
"productId": "JKHNG98",
"visitCount": 10,
"purchaseCount": 9,
"sampleCount": 7
}, {
"locationId": 12345,
"category": "Food",
"productId": "HJUSY68",
"visitCount": 1,
"purchaseCount": 15,
"sampleCount": 7
}, {
"locationId": 12345,
"category": "Entertainment",
"productId": "KGUJKHG78",
"visitCount": 20,
"purchaseCount": 15,
"sampleCount": 10
}, {
"locationId": 12345,
"category": "Entertainment",
"productId": "67912HYK",
"visitCount": 5,
"purchaseCount": 15,
"sampleCount": 10
}, {
"locationId": 54321,
"category": "Food",
"productId": "9823HYKN",
"visitCount": 15,
"purchaseCount": 12,
"sampleCount": 5
}, {
"locationId": 54321,
"category": "Food",
"productId": "KJHKJSAHD22",
"visitCount": 55,
"purchaseCount": 12,
"sampleCount": 5
}, {
"locationId": 54321,
"category": "Entertainment",
"productId": "SDJFHSF788",
"visitCount": 45,
"purchaseCount": 44,
"sampleCount": 23
}, {
"locationId": 54321,
"category": "Entertainment",
"productId": "2131286JH",
"visitCount": 80,
"purchaseCount": 44,
"sampleCount": 23
}]
Input can be multiple location IDs but always just 1 category.
Expected result for filter input category "Food":
[{
"locationId": 12345,
"category": "Food",
"sumOfVisitCount": 11,
"sumOfPurchaseCount": 24,
"sumOfSampleCount": 14
},{
"locationId": 54321,
"category": "Food",
"sumOfVisitCount": 15,
"sumOfPurchaseCount": 12,
"sumOfSampleCount": 5
}]
Expected result for filter input of location "12345":
[{
"locationId": 12345,
"category": "Food",
"sumOfVisitCount": 11,
"sumOfPurchaseCount": 24,
"sumOfSampleCount": 14
}, {
"locationId": 12345,
"category": "Entertainment"
"sumOfVisitCount": 25,
"sumOfPurchaseCount": 30,
"sumOfSampleCount": 20
}]
Expected result for filter input of location "12345" and category "Food":
[{
"locationId": 12345,
"category": "Food",
"sumOfVisitCount": 11,
"sumOfPurchaseCount": 24,
"sumOfSampleCount": 14
}]
You need to use a combination of query and aggregation to achieve your required result. Below query uses :
Search Query for filter input category "Food"
Query part uses term query to match the documents that have category equal to Food
Terms aggregation is used, to create buckets based on locationId, and in order to find the sum of sampleCount, viewCount and purchaseCount, sum aggregation is used
POST idxtest/_search
{
"size": 0,
"query": {
"term": {
"category.keyword": "Food"
}
},
"aggs": {
"NAME": {
"terms": {
"field": "locationId"
},
"aggs": {
"sumOfPurchaseCount": {
"sum": {
"field": "purchaseCount"
}
},
"sumOfVisitCount": {
"sum": {
"field": "visitCount"
}
},
"sumOfSampleCount": {
"sum": {
"field": "sampleCount"
}
}
}
}
}
}
Search Result:
"aggregations" : {
"NAME" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : 12345,
"doc_count" : 2,
"sumOfSampleCount" : {
"value" : 14.0
},
"sumOfVisitCount" : {
"value" : 11.0
},
"sumOfPurchaseCount" : {
"value" : 24.0
}
},
{
"key" : 54321,
"doc_count" : 2,
"sumOfSampleCount" : {
"value" : 10.0
},
"sumOfVisitCount" : {
"value" : 70.0
},
"sumOfPurchaseCount" : {
"value" : 24.0
}
}
]
}
Search Query for filter input of location "12345" and category "Food"
POST idxtest/_search
{
"size": 0,
"query": {
"bool": {
"must": [
{
"term": {
"category.keyword": "Food"
}
},
{
"term": {
"locationId": 12345
}
}
]
}
},
"aggs": {
"NAME": {
"terms": {
"field": "locationId"
},
"aggs": {
"sumOfPurchaseCount": {
"sum": {
"field": "purchaseCount"
}
},
"sumOfVisitCount": {
"sum": {
"field": "visitCount"
}
},
"sumOfSampleCount": {
"sum": {
"field": "sampleCount"
}
}
}
}
}
}
Search Result :
"aggregations" : {
"NAME" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : 12345,
"doc_count" : 2,
"sumOfSampleCount" : {
"value" : 14.0
},
"sumOfVisitCount" : {
"value" : 11.0
},
"sumOfPurchaseCount" : {
"value" : 24.0
}
}
]
}
}

Grouped bar-chart in Kibana using Vega-lite

From looking at https://vega.github.io/editor/#/examples/vega-lite/bar_grouped it shows example of creating grouped bar chart from a table of data.
In my case since I am getting data from elasticsearch it is not in tabular form.
I can't figure out a way to create two bar chart for each sum metric on a bucket.
"buckets" : [
{
"key_as_string" : "03/Dec/2019:00:00:00 +0900",
"key" : 1575298800000,
"doc_count" : 11187,
"deploy_agg" : {
"buckets" : {
"deploy_count" : {
"doc_count" : 43
}
}
},
"start_agg" : {
"buckets" : {
"start_count" : {
"doc_count" : 171
}
}
},
"sum_start_agg" : {
"value" : 171.0
},
"sum_deploy_agg" : {
"value" : 43.0
}
},..
I want to create two bars, one representing value of sum_start_agg and another one representing sum_deploy_agg value.
This is what I had for one bar chart.
"encoding": {
"x": {
"field": "key",
"type": "temporal",
"axis": {"title": "DATE"}
},
"y": {
"field": "deploy_agg.buckets.deploy_count.doc_count",
"type": "quantitative",
"axis": {"title": "deploy_count"}
}
"color": {"value": "green"}
"tooltip": [
{
"field": "deploy_agg.buckets.deploy_count.doc_count",
"type": "quantitative",
"title":"value"
}
]
}
You can use the Fold Transform to fold your two columns so that they can be referenced in an encoding. It might look something like this:
{
"data": {
"values": [
{
"key_as_string": "03/Dec/2019:00:00:00 +0900",
"key": 1575298800000,
"doc_count": 11187,
"deploy_agg": {"buckets": {"deploy_count": {"doc_count": 43}}},
"start_agg": {"buckets": {"start_count": {"doc_count": 171}}},
"sum_start_agg": {"value": 171},
"sum_deploy_agg": {"value": 43}
}
]
},
"transform": [
{
"fold": ["sum_start_agg.value", "sum_deploy_agg.value"],
"as": ["entry", "value"]
}
],
"mark": "bar",
"encoding": {
"x": {"field": "entry", "type": "nominal", "axis": null},
"y": {"field": "value", "type": "quantitative"},
"column": {"field": "key", "type": "temporal"},
"color": {"field": "entry", "type": "nominal"}
}
}

How to filter query based on a field value

I'm working with elasticsearch Query dsl, and I can't find a way to express the following:
Return results that have the field "price" > min budget and have "price" < max Budget and have has_price=true and also return all results that have "has_price=false"
In other words, I would like to use a range filter on results only that have has_price field set to true, otherwise, on results that have has_price set to false don't take in consideration the filter
Here's the mapping:
{
"formations": {
"mappings": {
"properties": {
"code": {
"type": "text"
},
"date": {
"type": "date",
"format": "dd/MM/yyyy"
},
"description": {
"type": "text"
},
"has_price": {
"type": "boolean"
},
"place": {
"type": "text"
},
"price": {
"type": "float"
},
"title": {
"type": "text"
}
}
}
}
}
The following query combines the 2 scenarios as 2 should clauses in a bool-query. And as there are only should clauses, minimum_should_match will be 1, meaning that at least one should-clause has to match:
Abstract Code Snippet
GET formations/_search
{
"query": {
"bool": {
"should": [
{ <1st scenario: has_price = false> },
{ <2nd scenario> has_price = true AND price IN budget_range}
]
}
}
}
Actual Sample Code Snippets
# 1. Create the index and populate it with some sample documents
POST formations/_bulk
{"index": {"_id": 1}}
{"has_price": true, "price": 2.0}
{"index": {"_id": 2}}
{"has_price": true, "price": 3.0}
{"index": {"_id": 3}}
{"has_price": true, "price": 4.0}
{"index": {"_id": 4}}
{"has_price": false, "price": 2.0}
{"index": {"_id": 5}}
{"has_price": false, "price": 3.0}
{"index": {"_id": 6}}
{"has_price": false, "price": 4.0}
# 2. Query assuming min_budget = 2.0 and max_budget = 4.0
GET formations/_search
{
"query": {
"bool": {
"should": [
{
"bool": {
"filter": {
"term": {
"has_price": false
}
}
}
},
{
"bool": {
"filter": [
{
"term": {
"has_price": true
}
},
{
"range": {
"price": {
"gt": 2,
"lt": 4
}
}
}
]
}
}
]
}
}
}
# 3. Result Snippet (4 hits: 3 from 1st scenario & 1 from 2nd scenario)
"hits" : {
"total" : {
"value" : 4,
"relation" : "eq"
},
...
Don't forget to add the Claus "minimum_should_match": 1 to your bool-query in case you add another non-should-clause to your bool-query.
Let me know if this answers your question & solves your issue.

Elasticsearch aggregate on nested JSON data

I have to do some aggregation on json data. I saw multiple answers here on stackoverflow but not nothing worked for me.
I have multiple row and in timeCountry column i have an array which stores JSON objects. with keys count, country_name, s_name.
I have to find the sum of all the rows according to s_name,
Example - if in 1st row timeCountry holds array like below
[ {
"count": 12,
"country_name": "america",
"s_name": "us"
},
{
"count": 10,
"country_name": "new zealand",
"s_name": "nz"
},
{
"count": 20,
"country_name": "India",
"s_name": "Ind"
}]
Row 2 data is like below
[{
"count": 12,
"country_name": "america",
"s_name": "us"
},
{
"count": 10,
"country_name": "South Africa",
"s_name": "sa"
},
{
"count": 20,
"country_name": "india",
"s_name": "ind"
}]
like so on.
I need result like below
[{
"count": 24,
"country_name": "america",
"s_name": "us"
}, {
"count": 10,
"country_name": "new zealand",
"s_name": "nz"
},
{
"count": 40,
"country_name": "India",
"s_name": "Ind"
}, {
"count": 10,
"country_name": "South Africa",
"s_name": "sa"
}
]
the above data is for only one row i have multiple rows timeCountry is column
What I tried writing for aggregation
{
"query": {
"match_all": {}
},
"aggregations":{
"records" :{
"nested":{
"path":"timeCountry"
},
"aggregations":{
"ids":{
"terms":{
"field": "timeCountry.country_name"
}
}
}
}
}
}
But its not working Please help
I tried this on my local elastic cluster and I was able to get aggregated data on the nested documents. Depending on your mapping of index the answer may vary from mine. Following is the DSL that I tried with for aggregation :
{
"aggs" : {
"records" : {
"nested" : {
"path" : "timeCountry"
},
"aggs" : {
"ids" : { "terms" : {
"field" : "timeCountry.country_name.keyword"
},
"aggs": {"sum_name": { "sum" : { "field" : "timeCountry.count" } } }
}
}
}
}
}
Following is the mapping of my index:
{
"settings" : {
"number_of_shards" : 1
},
"mappings": {
"agg_data" : {
"properties" : {
"timeCountry" : {
"type" : "nested"
}
}
}
}
}

Elasticsearch - bump individual result to the top

I'm working with Elasticsearch. I have an array of documents, and I'm trying to sort documents by the property price, except that I'd like a particular document to be the first result no matter what.
The below is what I'm using as my "sort" array as my attempt to order documents by ID 1213, and then all following documents ordered by price descending.
[
{
"id": {
"mode": "max",
"order": "desc",
"nested_filter": {
"term": {
"id": 1213
}
},
"missing": "_last"
}
},
{
"price": {
"order": "asc"
}
}
]
This doesn't appear to be working, though—document 1213 doesn't appear first. What am I doing wrong here?
As an example—the ideal returned result:
[{"id": 1213, "name": "Blue Sunglasses", "price": 12},
{"id": 1000, "name": "Green Sunglasses", "price": 2},
{"id": 1031, "name": "Purple Sunglasses", "price: 4},
{"id": 5923, "name": "Yellow Sunglasses, "price": 18}]
Instead, I get:
[{"id": 1000, "name": "Green Sunglasses", "price": 2},
{"id": 1031, "name": "Purple Sunglasses", "price: 4},
{"id": 1213, "name": "Blue Sunglasses", "price": 12},
{"id": 5923, "name": "Yellow Sunglasses, "price": 18}]
As others have already asked, what is the reason for the nested_filter?
There's many possible ways to do what you need. Here is one possible way which fits with the simple requirements you mentioned so far:
{
"query" : {
"custom_filters_score" : {
"query" : {
"match_all" : {}
},
"filters" : [
{
"filter" : {
"term" : {
"id" : "1213"
}
},
"boost" : 2
}
]
}
},
"sort" : [
"_score",
"price"
]
}
The assumption here is that your query is simple like the match_all query and does not affect the scores in anyway. If you do have something more complicated for the queries, to not affect the scores, you can try wrapping with a constant_score query. But ideally you get the document set you want where all the documents have the same score and then custom_filters_score query will boost the score of the document you want. You can do this for any number of documents adding further filters or if the documents are equal, use a terms filter. In the end the sort by the score and then the price.
In this case you need to use function_score to modify score of each doc.
{
"query": {
"function_score": {
"functions": [
{
"filter": {
"term": {
"id": "1213"
}
},
"weight": 1
},
{
"script_score": {
"script": "(1 / doc['price'].value)"
}
}
],
"score_mode": "sum",
"boost_mode" : "replace",
"query" : {
//YOUR QUERY GOES HERE
}
}
}
}
Explanation:
{
"script_score": {
"script": "(1 / doc['price'].value)"
}
}
Compute score based on price and give a value < 1. The higher the price the smaller the score (ascending). If you want to switch to descending then just replace it with
"script": "(1 - (1 / doc['price'].value))"
{
"filter": {
term": {
"id": "1213"
}
},
"weight": 1
}
This will give any docs with "id" = 1213 an extra 1 score. The total score at the end will be the sum of those 2 functions.

Resources