Aggregate over multiple fields without subaggregation - elasticsearch

I have documents in my ElasticSearch which have two fields. I want to build an aggregate over the combination of these, kind of like in SQL GROUP BY field_A, field_B and get a row per existing combination. I read everywhere that I should use subaggregation for this.
{
"aggs": {
"sales_by_article": {
"terms": {
"field": "catalogs.article_grouping",
"size": 1000000,
"order": {
"total_amount": "desc"
}
},
"aggs": {
"total_amount": {
"sum": {
"script": "Math.round(doc['amount.value'].value*100)/100.0"
}
},
"sales_by_submodel": {
"terms": {
"field": "catalogs.submodel_grouping",
"size": 1000,
"order": {
"total_amount": "desc"
}
},
"aggs": {
"total_amount": {
"sum": {
"script": "Math.round(doc['amount.value'].value*100)/100.0"
}
}
}
}
}
}
},
"size": 0
}
With the following simplified result:
{
"aggregations": {
"sales_by_article": {
"buckets": [
{
"key": "19114",
"total_amount": {
"value": 426794.25
},
"sales_by_submodel": {
"buckets": [
{
"key": "12",
"total_amount": {
"value": 51512.200000000004
}
},
...
]
}
},
...
]
}
}
}
However, the problem with this is that the ordering is not what I want. In this particular case, it first orders the articles based on total_amount per article, and then within an article it orders the submodels based on total_amount per submodel. However, what I want to achieve is to only have the deepest level and get an aggregation for the combination of article and submodel, ordered by the total_amount of this combination. This is the result I would like:
{
"aggregations": {
"sales_by_article_and_submodel": {
"buckets": [
{
"key": "1911412",
"total_amount": {
"value": 51512.200000000004
}
},
...
]
}
}
}

It's discussed in the docs a bit here: https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket-terms-aggregation.html#_multi_field_terms_aggregation
Basically you can use a script to create a term which is derived from each document (using as many fields as you want) at query run time, but it will be slow. If you are doing it for ad hoc analysis, it'll work fine. If you need to serve these requests at some high rate, then you probably want to make a field in your model that is a combination of the two fields you're interested in, so the index is populated for you already.
Example query using the script approach:
GET agreements/agreement/_search?size=0
{
"aggs" : {
"myAggregationName" : {
"terms" : {
"script" : {
"source": "doc['owningVendorCode'].value + '|' + doc['region'].value",
"lang": "painless"
}
}
}
}
}

I have learned I should use composite aggregates for this.

Related

Deduplicate and perform composite aggregation on deduced result

I've an index in elastic search which contains data of daily transactions. Each doc has mainly three fields as below :
TxnId, Status, TxnType,userId
two documents can have same TxnIds.
I'm looking for a query that provides aggregation over status,TxnType for unique txnIds. Basically I'm looking for something like : select unique txnIds from user_table group by status,txnType.
I've a ES query which will dedup on TxnIds. I've another ES query which can perform composite aggregation on status and txnType. I want to do both things in Single query.
I tried collapse feature . I also tried cardinality and dedup features. But query is not giving correct output.:
{
"size": 0,
"query": {
"bool": {
"filter": [
{
"term": {
"streamSource": 3
}
}
]
}
},
"collapse": {
"field": "txnId"
},
"aggs": {
"buckets": {
"composite": {
"size": 30,
"sources": [
{
"status": {
"terms": {
"field": "status"
}
}
},
{
"txnType": {
"terms": {
"field": "txnType"
}
}
}
]
}
}
}
}

Elasticsearch sort data upon all buckets

I am trying to make an es sort but I am struggling.
The base story of my data is that I have for example product definition which can consist of various products. (We call them abstract and concrete).
Let's say I have product A that is abstract it can consist of product B,C,D (called concretes).
I also for example have product E that can have F as a concrete and so on.
I want to aggregate the products by their abstract (to only show 1 of each concrete) and then sort all concretes based on some criteria.
I have written the following that doesn't work as expected.
"aggs": {
"category:58": {
"aggs": {
"products": {
"aggs": {
"abstract": {
"top_hits": {
"size": 1,
"sort": [
{
"criteria1": {
"order": "desc"
}
},
{
"_score": {
"order": "desc"
}
},
{
"criteria3": {
"missing": "_last",
"order": "asc",
"unmapped_type": "integer"
}
}
]
}
}
},
"terms": {
"field": "abstract_id",
"size": 10
}
}
},
"filter": {
"term": {
"categories.id": {
"value": "58"
}
}
}
}
},
If I got it correctly this will create 10 buckets and each bucket will have one product, and then my sort sorts a single product, where I should be sorting the entire result. The question is where do I place my sort that is currently in aggs->abstract.
If I remove the grouping by abstract_id and change it to something that is unique then the sorting does work, but then for one abstract product I can get all concretes displayed which I don't want to be the case.
I saw that I can't sort on terms so I'm kinda clueless now.
I ended up using multiple aggregations and then doing a bucket sort.
The query I ended up with looks like this
"aggs": {
"abstract": {
"top_hits": {
"size": 1
}
},
"criteria3": {
"sum": {
"field": "custom_filed_foo_bar"
}
},
"criteria1": {
"sum": {
"field": "boosted_value"
}
},
"criteria2": {
"max": {
"script":{
"source": "_score"
}
}
},
"sorting": {
"bucket_sort": {
"sort": [
{
"criteria1": {
"order": "desc"
}
},
{
"criteria2": {
"order": "desc"
}
},
{
"criteria3": {
"order": "desc"
}
}
]
}
}
I don't know if it's the correct approach but seems to be working

Search and aggregation on two indices

Two indexes are created with the dates.
First index mapping:
PUT /index_one
{
"mappings": {
"properties": {
"date_start": {
"type": "date",
"format": "yyyy-MM-dd HH:mm:ss.SSSZZ||yyyy-MM-dd HH:mm:ss||yyyy-MM-dd||epoch_millis"
}
}
}
}
Second index mapping:
PUT /index_two
{
"mappings": {
"properties": {
"date_end": {
"type": "date",
"format": "yyyy-MM-dd HH:mm:ss.SSSZZ||yyyy-MM-dd HH:mm:ss||yyyy-MM-dd||epoch_millis"
}
}
}
}
Need to find a date in a certain range and perform aggregation average of the dates difference.
Tried to make a request like this:
GET /index_one,index_two/_search?scroll=1m&q=[2021-01-01+TO+2021-12-31]&filter_path=aggregations,hits.total.value,hits.hits
{
"aggs": {
"filtered_dates": {
"filter": {
"bool": {
"must": [
{
"exists": {
"field": "date_start"
}
},
{
"exists": {
"field": "date_end"
}
}
]
}
},
"aggs": {
"avg_date": {
"avg": {
"script": {
"lang": "painless",
"source": "doc['date_end'].value.toInstant().toEpochMilli() - doc['date_begin'].value.toInstant().toEpochMilli()"
}
}
}
}
}
}
}
I get the following response to the request:
{
"hits": {
"total": {
"value": 16508
},
"hits": [
{
"_index": "index_one",
"_type": "_doc",
"_id": "93a34c5b-101b-45ea-9965-96a2e0446a28",
"_score": 1.0,
"_source": {
"date_begin": "2021-02-26 07:26:29.732+0300"
}
}
]
},
"aggregations": {
"filtered_dates": {
"meta": {},
"doc_count": 0,
"avg_date": {
"value": null
}
}
}
}
Can you please tell me if it is possible to make a query with search and aggregation over two indices in Elasticsearch? If so, how?
If you stored date_start on the document which contains date_end, it'd be much easier to figure out the average — check my answer to Store time related data in ElasticSearch.
Now, the script context operates on one single document at a time and has "no clue" about the other, potentially related docs. So if you don't store both dates at the same time in at least one doc, you'd need to somehow connect the docs nonetheless.
One option would be to use their ids:
POST index_one/_doc
{ "id":1, "date_start": "2021-01-01" }
POST index_two/_doc
{ "id":1, "date_end": "2021-12-31" }
POST index_one/_doc/2
{ "id":2, "date_start": "2021-01-01" }
POST index_two/_doc/2
{ "id":2, "date_end": "2021-01-31" }
After that, it's possible to:
Target multiple indices — as you already do.
Group the docs by their IDs and select only those that include at least 2 buckets (assuming two buckets represent the start & the end).
Obtain the min & max dates — essentially cherry-picking the date_start and date_end to be used later down the line.
Use a bucket_script aggregation to calculate their difference (in milliseconds).
Leverage a top-level average bucket aggregation to run over all the difference buckets and ... average them.
In concrete terms:
GET /index_one,index_two/_search?scroll=1m&q=[2021-01-01+TO+2021-12-31]&filter_path=aggregations,hits.total.value,hits.hits
{
"aggs": {
"grouped_by_id": {
"terms": {
"field": "id",
"min_doc_count": 2,
"size": 10
},
"aggs": {
"min_date": {
"min": {
"field": "date_start"
}
},
"max_date": {
"max": {
"field": "date_end"
}
},
"diff": {
"bucket_script": {
"buckets_path": {
"min": "min_date",
"max": "max_date"
},
"script": "params.max - params.min"
}
}
}
},
"avg_duration_across_the_board": {
"avg_bucket": {
"buckets_path": "grouped_by_id>diff",
"gap_policy": "skip"
}
}
}
}
If everything goes right, you'll end up with:
...
"aggregations" : {
"grouped_by_id" : {
...
},
"avg_duration_across_the_board" : {
"value" : 1.70208E10 <-- 17,020,800,000 milliseconds ~ 4,728 hrs
}
}
⚠️ Caveat: note that the 2nd level terms aggregation has an adjustable size. You'll probably need to increase it to cover more docs. But there are theoretical and practical limits as to how far it makes sense to increase it.
📖 Shameless plug: this was inspired in part by the chapter Aggregations & Buckets in my recently published Elasticsearch Handbook — containing lots of other real-world, non-trivial examples 🙌

Dividing counts of two different queries in kibana

I am trying to create a lucene expression for displaying division on counts of two queries. Both queries contain textual information and both results are in message field. I am not sure how to write this correctly. So far what i have done is without any luck -
doc['message'].value/doc['message'].value
for first query message contain text as - "404 not found"
for second query message contain text as - "500 error"
what i want to do is count(404 not found)/count(500 error)
I would appreciate any help.
I'm going to add the disclaimer that it would be significantly cleaner to just run two separate counts and perform the calculation on the client side like this:
GET /INDEX/_search
{
"size": 0,
"aggs": {
"types": {
"terms": {
"field": "type",
"size": 10
}
}
}
}
Which would return something like (except using your distinct keys instead of the types in my example):
"aggregations": {
"types": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "Article",
"doc_count": 881
},
{
"key": "Page",
"doc_count": 301
}
]
}
Using that, take your distinct counts and calculated the average.
With the above being stated, here is the hacky way I was able to put together from (via single request) this
GET /INDEX/_search
{
"size": 0,
"aggs": {
"parent_agg": {
"terms": {
"script": "'This approach is a weird hack'"
},
"aggs": {
"four_oh_fours": {
"filter": {
"term": {
"message": "404 not found"
}
},
"aggs": {
"count": {
"value_count": {
"field": "_index"
}
}
}
},
"five_hundreds": {
"filter": {
"term": {
"message": "500 error"
}
},
"aggs": {
"count": {
"value_count": {
"field": "_index"
}
}
}
},
"404s_over_500s": {
"bucket_script": {
"buckets_path": {
"four_oh_fours": "four_oh_fours.count",
"five_hundreds": "five_hundreds.count"
},
"script": "return params.four_oh_fours / (params.five_hundreds == 0 ? 1: params.five_hundreds)"
}
}
}
}
}
}
This should return an aggregate value based on the calculation within the script.
If someone can offer an approach aside from these two, I would love to see it. Hope this helps.
Edit - Same script done via "expression" type rather than painless (default). Just replace the above script value with the following:
"script": {
"inline": "four_oh_fours / (five_hundreds == 0 ? 1 : five_hundreds)",
"lang": "expression"
}
Updated the script here to accomplish the same thing via Lucene expressions

Aggregations for categories, sorted by category sequence

I have an elastic index, in which each document contains the following:
category {
"id": 4,
"name": "Green",
"seq": 2
}
I can use aggregations to get me the doc count for each of the categories:
{
"size": 0,
"aggs": {
"category": {
"terms": {
"field": "category.name"
}
}
}
}
This is fine, but the aggs are sorted by the doc count. What I'd like is to have the buckets sorted by the seq value, something that's easy in SQL.
Any suggestions?
Thanks!
Take a look at ordering terms aggregations.
Something like this could work, but only if "name" and "sequence" have the right relationships (one-to-one, or it works out in some other way):
POST /test_index/_search
{
"size": 0,
"aggs": {
"category": {
"terms": {
"field": "category.name",
"order" : { "seq_num" : "asc" }
},
"aggs": {
"seq_num": {
"max": {
"field": "category.seq"
}
}
}
}
}
}
Here is some code I used for testing:
http://sense.qbox.io/gist/4e551b2faec81eb0343e0e6d0cc9b10f20d7d4c1

Resources