Elasticsearch group, filter and aggregation in the same request - elasticsearch

I have a use case to filter and group by two separate fields in Elasticsearch, while aggregating values in rest of the fields. Looking for help formulating a query for Elasticsearch. Here is an example. Each document has a locationId, category, productId, visitCount, purchaseCount and sampleCount. I want to view sums of visitCount, purchaseCount and sampleCount for each category within each location. Note that productId is unique across all entries. I have tried reading up Elasticsearch documentation but could not find a good source to learn how I can do grouping, filtering and aggregation all together. Please note that this is for a website use case where we show this data in a table with pages. Due to the amount of locations and categories, it is likely that there will be several groups that will go beyond 1 page. Please help formulate a query for Elasticsearch.
Sample documents:
[{
"locationId": 12345,
"category": "Food",
"productId": "JKHNG98",
"visitCount": 10,
"purchaseCount": 9,
"sampleCount": 7
}, {
"locationId": 12345,
"category": "Food",
"productId": "HJUSY68",
"visitCount": 1,
"purchaseCount": 15,
"sampleCount": 7
}, {
"locationId": 12345,
"category": "Entertainment",
"productId": "KGUJKHG78",
"visitCount": 20,
"purchaseCount": 15,
"sampleCount": 10
}, {
"locationId": 12345,
"category": "Entertainment",
"productId": "67912HYK",
"visitCount": 5,
"purchaseCount": 15,
"sampleCount": 10
}, {
"locationId": 54321,
"category": "Food",
"productId": "9823HYKN",
"visitCount": 15,
"purchaseCount": 12,
"sampleCount": 5
}, {
"locationId": 54321,
"category": "Food",
"productId": "KJHKJSAHD22",
"visitCount": 55,
"purchaseCount": 12,
"sampleCount": 5
}, {
"locationId": 54321,
"category": "Entertainment",
"productId": "SDJFHSF788",
"visitCount": 45,
"purchaseCount": 44,
"sampleCount": 23
}, {
"locationId": 54321,
"category": "Entertainment",
"productId": "2131286JH",
"visitCount": 80,
"purchaseCount": 44,
"sampleCount": 23
}]
Input can be multiple location IDs but always just 1 category.
Expected result for filter input category "Food":
[{
"locationId": 12345,
"category": "Food",
"sumOfVisitCount": 11,
"sumOfPurchaseCount": 24,
"sumOfSampleCount": 14
},{
"locationId": 54321,
"category": "Food",
"sumOfVisitCount": 15,
"sumOfPurchaseCount": 12,
"sumOfSampleCount": 5
}]
Expected result for filter input of location "12345":
[{
"locationId": 12345,
"category": "Food",
"sumOfVisitCount": 11,
"sumOfPurchaseCount": 24,
"sumOfSampleCount": 14
}, {
"locationId": 12345,
"category": "Entertainment"
"sumOfVisitCount": 25,
"sumOfPurchaseCount": 30,
"sumOfSampleCount": 20
}]
Expected result for filter input of location "12345" and category "Food":
[{
"locationId": 12345,
"category": "Food",
"sumOfVisitCount": 11,
"sumOfPurchaseCount": 24,
"sumOfSampleCount": 14
}]

You need to use a combination of query and aggregation to achieve your required result. Below query uses :
Search Query for filter input category "Food"
Query part uses term query to match the documents that have category equal to Food
Terms aggregation is used, to create buckets based on locationId, and in order to find the sum of sampleCount, viewCount and purchaseCount, sum aggregation is used
POST idxtest/_search
{
"size": 0,
"query": {
"term": {
"category.keyword": "Food"
}
},
"aggs": {
"NAME": {
"terms": {
"field": "locationId"
},
"aggs": {
"sumOfPurchaseCount": {
"sum": {
"field": "purchaseCount"
}
},
"sumOfVisitCount": {
"sum": {
"field": "visitCount"
}
},
"sumOfSampleCount": {
"sum": {
"field": "sampleCount"
}
}
}
}
}
}
Search Result:
"aggregations" : {
"NAME" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : 12345,
"doc_count" : 2,
"sumOfSampleCount" : {
"value" : 14.0
},
"sumOfVisitCount" : {
"value" : 11.0
},
"sumOfPurchaseCount" : {
"value" : 24.0
}
},
{
"key" : 54321,
"doc_count" : 2,
"sumOfSampleCount" : {
"value" : 10.0
},
"sumOfVisitCount" : {
"value" : 70.0
},
"sumOfPurchaseCount" : {
"value" : 24.0
}
}
]
}
Search Query for filter input of location "12345" and category "Food"
POST idxtest/_search
{
"size": 0,
"query": {
"bool": {
"must": [
{
"term": {
"category.keyword": "Food"
}
},
{
"term": {
"locationId": 12345
}
}
]
}
},
"aggs": {
"NAME": {
"terms": {
"field": "locationId"
},
"aggs": {
"sumOfPurchaseCount": {
"sum": {
"field": "purchaseCount"
}
},
"sumOfVisitCount": {
"sum": {
"field": "visitCount"
}
},
"sumOfSampleCount": {
"sum": {
"field": "sampleCount"
}
}
}
}
}
}
Search Result :
"aggregations" : {
"NAME" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : 12345,
"doc_count" : 2,
"sumOfSampleCount" : {
"value" : 14.0
},
"sumOfVisitCount" : {
"value" : 11.0
},
"sumOfPurchaseCount" : {
"value" : 24.0
}
}
]
}
}

Related

Aggregator of type top_hits cannot accept sub-aggregations with Percentiles

I have the following documents:
{"id": 1, "type": "bags", "brand": "Louis Vuitton", "condition": "new", "price": 500}
{"id": 2, "type": "bags", "brand": "Louis Vuitton", "condition": "new", "price": 450}
{"id": 3, "type": "bags", "brand": "Louis Vuitton", "condition": "new", "price": 420}
{"id": 4, "type": "bags", "brand": "Louis Vuitton", "condition": "like new", "price": 150}
{"id": 5, "type": "bags", "brand": "Louis Vuitton", "condition": "like new", "price": 150}
{"id": 6, "type": "bags", "brand": "Louis Vuitton", "condition": "like new", "price": 100}
{"id": 7, "type": "bags", "brand": "Louis Vuitton" "condition": "used", "price": 400}
{"id": 8, "type": "bags", "brand": "Louis Vuitton", "condition": "used", "price": 350}
{"id": 9, "type": "bags", "brand": "Louis Vuitton", "condition": "used", "price": 300}
I am looking to write a query that will return to me the Percentiles of prices for the top 2 documents for each condition. In other words, I want to perform some calculation after getting the top 2 best scoring documents for each item condition (new, like new, used). I have tried this but I am getting the error the error Aggregator of type top_hits cannot accept sub-aggregations:
{
"query": {
"match": {
"brand": "Louis Vuitton"
}
},
"aggs": {
"item_conditions": {
"terms": {
"field": "condition"
},
"aggs": {
"top_two": {
"top_hits": {
"size": 2
},
"aggs": {
"top_two_percentiles": {
"percentiles": {
"field": "price"
}
}
}
}
}
}
}
}
Is there another way to achieve this, or do I have to do some post-processing myself after getting the results back from ES? The end result I want is to be able to supply this data to charts to make it look like this: https://ibb.co/y5FpV80
"... the percentiles of prices for the top two documents ..." is somewhat arbitrary. What's the metric that determines the score? A terms aggregation would score the buckets equally. The only differentiating factor would be the bucket count... What I'm saying is, you'll need to first determine what puts a given bucket in the top 2 and go from there.
In any event, you can:
Order any terms aggregation by the result of one of its numeric child aggregations.
After that, you can limit it to 2 buckets.
When that's done, you can use a percentiles bucket aggregation to calculate the percentiles of the two top prices.
In concrete terms:
POST your-index/_search?filter_path=aggregations.*.buckets.key,aggregations.*.buckets.doc_count,aggregations.*.buckets.percentiles_top_two_prices
{
"size": 0,
"query": {
"match": {
"brand": "Louis Vuitton"
}
},
"aggs": {
"item_conditions": {
"terms": {
"field": "condition"
},
"aggs": {
"top_two": {
"terms": {
"field": "price",
"size": 2,
"order": {
"max_score": "desc" <-- here's how you enforce the top 2 docs
}
},
"aggs": {
"max_score": {
"max": {
"script": "_score" <-- how you determine what happens here is up to you. _score will be equal across all buckets (I believe) so pick some other metric.
}
},
"just_the_price": {
"min": {
"field": "price" <-- there's no "identity" agg in ES so I'm using min. There will be only bucket because you're already under the parent which aggregates the price.
}
}
}
},
"percentiles_top_two_prices": {
"percentiles_bucket": {
"buckets_path": "top_two>just_the_price"
}
}
}
}
}
}
yielding something along the lines of:
{
"aggregations" : {
"item_conditions" : {
"buckets" : [
{
"key" : "like new",
"doc_count" : 3,
"percentiles_top_two_prices" : {
"values" : {
"1.0" : 100.0,
"5.0" : 100.0,
"25.0" : 100.0,
"50.0" : 150.0,
"75.0" : 150.0,
"95.0" : 150.0,
"99.0" : 150.0
}
}
},
{
"key" : "new",
"doc_count" : 3,
"percentiles_top_two_prices" : {
"values" : {
"1.0" : 420.0,
"5.0" : 420.0,
"25.0" : 420.0,
"50.0" : 450.0,
"75.0" : 450.0,
"95.0" : 450.0,
"99.0" : 450.0
}
}
},
{
"key" : "used",
"doc_count" : 3,
"percentiles_top_two_prices" : {
"values" : {
"1.0" : 300.0,
"5.0" : 300.0,
"25.0" : 300.0,
"50.0" : 350.0,
"75.0" : 350.0,
"95.0" : 350.0,
"99.0" : 350.0
}
}
}
]
}
}
}
I'm frankly not sure what these stats would bring you (when based on only two values) but this is how it could be done 😉

aggregation elastic search query with sum

This is my current data. i want an aggregate query to return variantId sum of quantity based on type in/out.
hits: {
total: {
value: 5,
relation: "eq",
},
max_score: 1,
hits: [
{
_index: "transactions",
_type: "_doc",
_id: "out2391",
_score: 1,
_source: {
date: "2021-03-08",
transactionId: 2391,
brandId: 1112,
outletId: 121222,
variantId: 1321,
qty: 1,
closing: 10,
type: "out",
}
],
},
I want result that returns sum of quantity for type in/out for variants
[{
variantId: 1321,
in: sum(qty),
out: sum(qty)
},
{
variantId: 13211,
in: sum(qty),
out: sum(qty)
}
]
Ingest test documents
POST test_shaheer/_doc
{
"date": "2021-03-08",
"transactionId": 2391,
"brandId": 1112,
"outletId": 121222,
"variantId": 1321,
"qty": 1,
"closing": 10,
"type": "out"
}
POST test_shaheer/_doc
{
"date": "2021-03-08",
"transactionId": 2391,
"brandId": 1112,
"outletId": 121222,
"variantId": 1321,
"qty": 1,
"closing": 10,
"type": "out"
}
POST test_shaheer/_doc
{
"date": "2021-03-08",
"transactionId": 2391,
"brandId": 1112,
"outletId": 121222,
"variantId": 1321,
"qty": 5,
"closing": 10,
"type": "in"
}
POST test_shaheer/_doc
{
"date": "2021-03-08",
"transactionId": 2391,
"brandId": 1112,
"outletId": 121222,
"variantId": 1321,
"qty": 2,
"closing": 10,
"type": "in"
}
To achieve what you need you have nest aggregations , first you group by variantId, then each variantId by type, and finally you do a sum on the qty field inside each type.
Query
POST test_shaheer/_search
{
"size": 0,
"aggs": {
"variant_ids": {
"terms": {
"field": "variantId",
"size": 10
},
"aggs": {
"types": {
"terms": {
"field": "type.keyword",
"size": 10
},
"aggs": {
"qty_sum": {
"sum": {
"field": "qty"
}
}
}
}
}
}
}
}
Note size 0 to not show results.
Response
{
"took" : 2,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 4,
"relation" : "eq"
},
"max_score" : null,
"hits" : [ ]
},
"aggregations" : {
"variant_ids" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : 1321,
"doc_count" : 4,
"types" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "in",
"doc_count" : 2,
"qty_sum" : {
"value" : 7.0
}
},
{
"key" : "out",
"doc_count" : 2,
"qty_sum" : {
"value" : 2.0
}
}
]
}
}
]
}
}
}

How to get multiple fields returned in elasticsearch query?

How to get multiple fields returned that are unique using elasticsearch query?
All of my documents have duplicate name and job fields. I would like to use an es query to get all the unique values which include the name and job in the same response, so they are tied together.
[
{
"name": "albert",
"job": "teacher",
"dob": "11/22/91"
},
{
"name": "albert",
"job": "teacher",
"dob": "11/22/91"
},
{
"name": "albert",
"job": "teacher",
"dob": "11/22/91"
},
{
"name": "justin",
"job": "engineer",
"dob": "1/2/93"
},
{
"name": "justin",
"job": "engineer",
"dob": "1/2/93"
},
{
"name": "luffy",
"job": "rubber man",
"dob": "1/2/99"
}
]
Expected result in any format -> I was trying to use aggs but I only get one field
[
{
"name": "albert",
"job": "teacher"
},
{
"name": "justin",
"job": "engineer"
},
{
"name": "luffy",
"job": "rubber man"
},
]
This is what I tried so far
GET name.test.index/_search
{
"size": 0,
"aggs" : {
"name" : {
"terms" : { "field" : "name.keyword" }
}
}
}
using the above query gets me this which is good that its unique
{
"took" : 1,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 95,
"relation" : "eq"
},
"max_score" : null,
"hits" : [ ]
},
"aggregations" : {
"name" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "Justin",
"doc_count" : 56
},
{
"key" : "Luffy",
"doc_count" : 31
},
{
"key" : "Albert",
"doc_count" : 8
}
]
}
}
}
I tried doing nested aggregation but that did not work. Is there an alternative solution for getting multiple unique values or am I missing something?
That's a good start! There are a few ways to achieve what you want, each provides a different response format, so you can decide which one you prefer.
The first option is to leverage the top_hits sub-aggregation and return the two fields for each name bucket:
GET name.test.index/_search
{
"size": 0,
"aggs": {
"name": {
"terms": {
"field": "name.keyword"
},
"aggs": {
"top": {
"top_hits": {
"_source": [
"name",
"job"
],
"size": 1
}
}
}
}
}
}
The second option is to use a script in your terms aggregation instead of a field to return a compound value:
GET name.test.index/_search
{
"size": 0,
"aggs": {
"name": {
"terms": {
"script": "doc['name'].value + ' - ' + doc['job'].value"
}
}
}
}
The third option is to use two levels of field collapsing:
GET name.test.index/_search
{
"collapse": {
"field": "name",
"inner_hits": {
"name": "by_job",
"collapse": {
"field": "job"
},
"size": 1
}
}
}

Grouped bar-chart in Kibana using Vega-lite

From looking at https://vega.github.io/editor/#/examples/vega-lite/bar_grouped it shows example of creating grouped bar chart from a table of data.
In my case since I am getting data from elasticsearch it is not in tabular form.
I can't figure out a way to create two bar chart for each sum metric on a bucket.
"buckets" : [
{
"key_as_string" : "03/Dec/2019:00:00:00 +0900",
"key" : 1575298800000,
"doc_count" : 11187,
"deploy_agg" : {
"buckets" : {
"deploy_count" : {
"doc_count" : 43
}
}
},
"start_agg" : {
"buckets" : {
"start_count" : {
"doc_count" : 171
}
}
},
"sum_start_agg" : {
"value" : 171.0
},
"sum_deploy_agg" : {
"value" : 43.0
}
},..
I want to create two bars, one representing value of sum_start_agg and another one representing sum_deploy_agg value.
This is what I had for one bar chart.
"encoding": {
"x": {
"field": "key",
"type": "temporal",
"axis": {"title": "DATE"}
},
"y": {
"field": "deploy_agg.buckets.deploy_count.doc_count",
"type": "quantitative",
"axis": {"title": "deploy_count"}
}
"color": {"value": "green"}
"tooltip": [
{
"field": "deploy_agg.buckets.deploy_count.doc_count",
"type": "quantitative",
"title":"value"
}
]
}
You can use the Fold Transform to fold your two columns so that they can be referenced in an encoding. It might look something like this:
{
"data": {
"values": [
{
"key_as_string": "03/Dec/2019:00:00:00 +0900",
"key": 1575298800000,
"doc_count": 11187,
"deploy_agg": {"buckets": {"deploy_count": {"doc_count": 43}}},
"start_agg": {"buckets": {"start_count": {"doc_count": 171}}},
"sum_start_agg": {"value": 171},
"sum_deploy_agg": {"value": 43}
}
]
},
"transform": [
{
"fold": ["sum_start_agg.value", "sum_deploy_agg.value"],
"as": ["entry", "value"]
}
],
"mark": "bar",
"encoding": {
"x": {"field": "entry", "type": "nominal", "axis": null},
"y": {"field": "value", "type": "quantitative"},
"column": {"field": "key", "type": "temporal"},
"color": {"field": "entry", "type": "nominal"}
}
}

Elasticsearch aggregate on nested JSON data

I have to do some aggregation on json data. I saw multiple answers here on stackoverflow but not nothing worked for me.
I have multiple row and in timeCountry column i have an array which stores JSON objects. with keys count, country_name, s_name.
I have to find the sum of all the rows according to s_name,
Example - if in 1st row timeCountry holds array like below
[ {
"count": 12,
"country_name": "america",
"s_name": "us"
},
{
"count": 10,
"country_name": "new zealand",
"s_name": "nz"
},
{
"count": 20,
"country_name": "India",
"s_name": "Ind"
}]
Row 2 data is like below
[{
"count": 12,
"country_name": "america",
"s_name": "us"
},
{
"count": 10,
"country_name": "South Africa",
"s_name": "sa"
},
{
"count": 20,
"country_name": "india",
"s_name": "ind"
}]
like so on.
I need result like below
[{
"count": 24,
"country_name": "america",
"s_name": "us"
}, {
"count": 10,
"country_name": "new zealand",
"s_name": "nz"
},
{
"count": 40,
"country_name": "India",
"s_name": "Ind"
}, {
"count": 10,
"country_name": "South Africa",
"s_name": "sa"
}
]
the above data is for only one row i have multiple rows timeCountry is column
What I tried writing for aggregation
{
"query": {
"match_all": {}
},
"aggregations":{
"records" :{
"nested":{
"path":"timeCountry"
},
"aggregations":{
"ids":{
"terms":{
"field": "timeCountry.country_name"
}
}
}
}
}
}
But its not working Please help
I tried this on my local elastic cluster and I was able to get aggregated data on the nested documents. Depending on your mapping of index the answer may vary from mine. Following is the DSL that I tried with for aggregation :
{
"aggs" : {
"records" : {
"nested" : {
"path" : "timeCountry"
},
"aggs" : {
"ids" : { "terms" : {
"field" : "timeCountry.country_name.keyword"
},
"aggs": {"sum_name": { "sum" : { "field" : "timeCountry.count" } } }
}
}
}
}
}
Following is the mapping of my index:
{
"settings" : {
"number_of_shards" : 1
},
"mappings": {
"agg_data" : {
"properties" : {
"timeCountry" : {
"type" : "nested"
}
}
}
}
}

Resources