Elasticsearch: how to scope aggregations to your query and filter? - elasticsearch

I have been playing around with elasticsearch query and filter for some time now but never worked with aggregations before. The idea that we can scope the aggregations with our query seems quite amazing to me but I want to understand how to do it properly so that I do not make any mistakes. Currently all my search queries are designed this way:
{
"query": {
},
"filter": {
},
"from": 0,
"size": 60
}
Now, when I added some aggregation buckets, the structure became this:
{
"aggs": {
"all_colors": {
"terms": {
"field": "color.name"
}
},
"all_brands": {
"terms": {
"field": "brand_slug"
}
},
"all_sizes": {
"terms": {
"field": "sizes"
}
}
},
"query": {
},
"filter": {
},
"from": 0,
"size": 60
}
However, the results of the aggregation are always the same irrespective of what info I provide in filter.
Now, when I changed the query structure to something like this, it started showing different results:
{
"aggs": {
"all_colors": {
"terms": {
"field": "color.name"
}
},
"all_brands": {
"terms": {
"field": "brand_slug"
}
},
"all_sizes": {
"terms": {
"field": "sizes"
}
}
},
"query": {
"filtered": {
"query": {
},
"filter": {
}
}
},
"from": 0,
"size": 60
}
Does it mean I will have to change the structure of my search queries everywhere to this new filtered type of structure ? Is there any other workaround which allows me to achieve desired results without having to change that much of code ?
Also, another thing I observed is that if my brand_slug field contains multiple keywords like "peter england", then both of these are returned in separate buckets like this:
{
"buckets": [
{
"key": "england",
"doc_count": 368
},
{
"key": "peter",
"doc_count": 368
}
]
}
How can I ensure that both these end up in a same bucket like this:
{
"buckets": [
{
"key": "peter england",
"doc_count": 368
}
]
}
UPDATE: This second part I have been able to accomplish by indexing brand, color and sizes differently like this:
"sizes": {
"type": "string",
"fields": {
"raw": {
"type": "string",
"index": "not_analyzed"
}
}
}

What you've noticed is by design. Have a look at my answer to a similar question on SO. Basically, input to both aggregation and filter sections is the output of query section. Filtered Query as you've suggested would be the best way to achieve the results you desire. There is another way too. You can use Filter Aggregation. Then you would not need to change your query and filter sections but simply copy the filter section inside the aggregation sections but that in my opinion would be an overkill and a violation of the DRY principle in general.

Related

Aggregate over multiple fields without subaggregation

I have documents in my ElasticSearch which have two fields. I want to build an aggregate over the combination of these, kind of like in SQL GROUP BY field_A, field_B and get a row per existing combination. I read everywhere that I should use subaggregation for this.
{
"aggs": {
"sales_by_article": {
"terms": {
"field": "catalogs.article_grouping",
"size": 1000000,
"order": {
"total_amount": "desc"
}
},
"aggs": {
"total_amount": {
"sum": {
"script": "Math.round(doc['amount.value'].value*100)/100.0"
}
},
"sales_by_submodel": {
"terms": {
"field": "catalogs.submodel_grouping",
"size": 1000,
"order": {
"total_amount": "desc"
}
},
"aggs": {
"total_amount": {
"sum": {
"script": "Math.round(doc['amount.value'].value*100)/100.0"
}
}
}
}
}
}
},
"size": 0
}
With the following simplified result:
{
"aggregations": {
"sales_by_article": {
"buckets": [
{
"key": "19114",
"total_amount": {
"value": 426794.25
},
"sales_by_submodel": {
"buckets": [
{
"key": "12",
"total_amount": {
"value": 51512.200000000004
}
},
...
]
}
},
...
]
}
}
}
However, the problem with this is that the ordering is not what I want. In this particular case, it first orders the articles based on total_amount per article, and then within an article it orders the submodels based on total_amount per submodel. However, what I want to achieve is to only have the deepest level and get an aggregation for the combination of article and submodel, ordered by the total_amount of this combination. This is the result I would like:
{
"aggregations": {
"sales_by_article_and_submodel": {
"buckets": [
{
"key": "1911412",
"total_amount": {
"value": 51512.200000000004
}
},
...
]
}
}
}
It's discussed in the docs a bit here: https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket-terms-aggregation.html#_multi_field_terms_aggregation
Basically you can use a script to create a term which is derived from each document (using as many fields as you want) at query run time, but it will be slow. If you are doing it for ad hoc analysis, it'll work fine. If you need to serve these requests at some high rate, then you probably want to make a field in your model that is a combination of the two fields you're interested in, so the index is populated for you already.
Example query using the script approach:
GET agreements/agreement/_search?size=0
{
"aggs" : {
"myAggregationName" : {
"terms" : {
"script" : {
"source": "doc['owningVendorCode'].value + '|' + doc['region'].value",
"lang": "painless"
}
}
}
}
}
I have learned I should use composite aggregates for this.

ElasticSearch calculate percentage for each bucket from total

I'm using ElasticSearch v5. I'm trying to do something similar described in Elasticsearch analytics percent where I have a terms aggregation and I want to calculate a percentage which is a value from each bucket over the total of all buckets. This is my request:
{
"query": {
"match_all": {}
},
"aggs": {
"periods": {
"terms": {
"field": "periods",
"size": 3
},
"aggs": {
"balance": {
"sum": {
"field": "balance"
}
}
}
},
"total_balance": {
"sum_bucket": {
"buckets_path": "periods>balance"
}
}
}
}
The result I get back this like this:
{
"aggregations": {
"periods": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 1018940846,
"buckets": [
{
"key": 1177977600000,
"doc_count": 11615418,
"balance": {
"value": 2492032741768.1616
}
},
{
"key": 1185926400000,
"doc_count": 11592425,
"balance": {
"value": 2575365325406.6533
}
},
{
"key": 1175385600000,
"doc_count": 11477402,
"balance": {
"value": 2456256695380.8306
}
}
]
},
"total_balance": {
"value": 7523654762555.645
}
}
}
How do I calculate "balance"/"total_balance" for each item in the bucket from ElasticSearch? I tried bucket script aggregation at the bucket (periods) level, but I cannot set my buckets_path to total_balance. This post https://discuss.elastic.co/t/combining-two-aggregations-to-get-term-percentage/22201 talks about using Significant Terms Aggregation, but I need calculation of using specific fields, not doc_count. I know I can do this as a simple calculation on the client side, but I would like to do this all together in ElasticSearch if possible.
No, you can't do that. By the time I'm writing this post, we're in version 6.1.
According to
https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-pipeline.html#buckets-path-syntax,
there's only two major types of aggregations pipelines: parent and siblings.
So, in order to reference the total_balance aggregation from within the periods buckets, we should be able to reference an "uncle" aggregation from the buckets_path attribute, which is not possible.

Getting description when aggregating with Elasticsearch

When we use the aggregation feature on elastic, we get a value of the field we aggregating back but we also want to get the description of that field. We have to use the sector.id as other parts of our api uses it later on.
For ex: our data looks like this:
[{
"id":"123"
"sectors":[{
"id":"sector-1",
"name":"Automotive"
}]
},
{
"id":"123"
"sectors":[{
"id":"sector-2",
"name":"Biology"
}]
}]
When we aggregate over sectors.id our response looks like:
"aggregations": {
"sector": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "sector-2",
"doc_count": 19672
},
{
"key": "sector-1",
"doc_count": 11699
}]
}
}
Is there any way to get sectors.name as well as the key in the results?
It seems like that sectors should a nested field. Now assuming that sector name is unique per sector-id.
You may use sub-aggregations to figure out the related keys
GET _search
{
"size": 0,
"aggs": {
"sectors": {
"nested": {
"path": "sectors"
},
"aggs": {
"sector_id": {
"terms": {
"field": "sectors.id"
},
"aggs": {
"sector_name": {
"terms": {
"field": "sectors.name"
}
}
}
}
}
}
}
}

Aggregating with multiple fields returned in ElasticSearch

Suppose I have a relative simple index with the following fields...
"testdata": {
"properties": {
"code": {
"type": "integer"
},
"name": {
"type": "string"
},
"year": {
"type": "integer"
},
"value": {
"type": "integer"
}
}
}
I can write a query to get the total sum of the values aggregated by the code like so:
{
"from":0,
"size":0,
"aggs": {
"by_code": {
"terms": {
"field": "code"
},
"aggs": {
"total_value": {
"sum": {
"field": "value"
}
}
}
}
}
}
And this returns the following (abridged) results:
"aggregations": {
"by_code": {
"doc_count_error_upper_bound": 478,
"sum_other_doc_count": 328116,
"buckets": [
{
"key": 236948,
"doc_count": 739,
"total_value": {
"value": 12537
}
},
However, this data is being fed to a web front-end, where it is required both the code and the name is displayed. So, the question is, is it possible to amend the query somehow to also return the name field, as well as the code field, in the results?
So, for example, the results can look a bit like this:
"aggregations": {
"by_code": {
"doc_count_error_upper_bound": 478,
"sum_other_doc_count": 328116,
"buckets": [
{
"key": 236948,
"code": 236948,
"name": "Test Name",
"doc_count": 739,
"total_value": {
"value": 12537
}
},
I've read up on sub-aggregations, but in this case there is a one-to-one relationship between code and name (so, you wouldn't have different names for the same key). Also, in my real case, there are 5 other fields, like description, that I would like to return, so I am wondering if there was another way to do it.
In SQL (from which this data originally came from before it was swapped to ElasticSearch) I would write the following query
SELECT Code, Name, SUM(Value) AS Total_Value
FROM [TestData]
GROUP BY Code, Name
You can achieve this using scripting, i.e. instead of specifying a field, you specify a combination of fields:
{
"from":0,
"size":0,
"aggs": {
"by_code": {
"terms": {
"script": "[doc.code.value, doc.name.value].join('-')"
},
"aggs": {
"total_value": {
"sum": {
"field": "value"
}
}
}
}
}
}
note: you need to make sure to enable dynamic scripting for this to work

How to use ElasticSearch to bucket historical data from midnight to now?

So I have an index with timestamps in the following format:
2015-03-20T12:00:00+0500
What I would like to do in the SQL equivalent is the following:
select date(timestamp), sum(orders)
from data
where time(timestamp) < time(now)
group by date(timestamp)
I know I need an aggregation but, for now, I've tried a basic search query below but I'm getting a malformed error:
{
"size": 0,
"query":
{
"filtered":
{
"query":
{
"match_all" : {}
},
"filter":
{
"range":
{
"#timestamp":
{
"from": "00:00:01.000",
"to": "15:00:00.000"
}
}
}
}
}
}
You do indeed want an aggregation, specifically the date histogram aggregation. Something like
{
"query": {"match_all": {}},
"aggs": {
"by_date": {
"date_histogram": {
"field": "timestamp",
"interval": "day"
},
"aggs": {
"order_sum": {
"sum": {"field": "foo"}
}
}
}
}
}
First you have a bucketing aggregation that groups your documents by date, then inside that a metric aggregation that computes a value (in this case a sum) for each bucket
which would return data of the form
{
...
"aggregations": {
"by_date": {
"buckets": [
{
"key_as_string": "2015-03-01T00:00:00.000Z",
"key": 1425168000000,
"doc_count": 8644,
"order_sum": {
"value": 1234
}
},
{
"key_as_string": "2015-03-02T00:00:00.000Z",
"key": 1425254400000,
"doc_count": 8819,
"order_sum": {
"value": 45678
}
},
...
]
}
}
}
There is a good intro to aggregations on the elasticsearch blog (part 1 and part 2) if you want to do some more reading.

Resources