ElasticSearch - Get Statistics on Aggregation results

ElasticSearch - Get Statistics on Aggregation results - elasticsearch

I have the following simple aggregation:
GET index1/type1/_search
{
"size": 0,
"aggs": {
"incidentID": {
"terms": {
"field": "incidentID",
"size": 5
}
}
}
}
Results are:
"aggregations": {
"incidentID": {
"buckets": [
{
"key": "0A631EB1-01EF-DC28-9503-FC28FE695C6D",
"doc_count": 233
},
{
"key": "DF107D2B-CA1E-85C9-E01A-C966DC6F7051",
"doc_count": 226
},
{
"key": "60B8955F-38FD-8DFE-D374-4387668C8368",
"doc_count": 220
},
{
"key": "B787868A-F72E-63DC-D837-B3A864D9FFC6",
"doc_count": 174
},
{
"key": "C597EC5F-C60F-F3BA-61CB-4990F12C1893",
"doc_count": 174
}
]
}
}
What I want to do is get the "statistics" of the "doc_count" returned. I want:
Min Value
Max Value
Average
Standard Deviation

No, this is not currently possible, here is the issue tracking the support:
https://github.com/elasticsearch/elasticsearch/issues/8110
Obviously, it is possible to do this client side if you are able to pull the full list of all buckets into memory.

Related

Elasticsearch - Sort results of Terms aggregation by key string length

I am querying ES with a Terms aggregation to find the first N unique values of a string field foo where the field contains a substring bar, and the document matches some other constraints.
Currently I am able to sort the results by the key string alphabetically:
{
"query": {other constraints},
"aggs": {
"my_values": {
"terms": {
"field": "foo.raw",
"include": ".*bar.*",
"order": {"_key": "asc"},
"size": N
}
}
}
}
This gives results like
{
...
"aggregations": {
"my_values": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 145,
"buckets": [
{
"key": "aa_bar_aa",
"doc_count": 1
},
{
"key": "iii_bar_iii",
"doc_count": 1
},
{
"key": "z_bar_z",
"doc_count": 1
}
]
}
}
}
How can I change the order option so that the buckets are sorted by the length of the strings in the foo key field, so that the results are like
{
...
"aggregations": {
"my_values": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 145,
"buckets": [
{
"key": "z_bar_z",
"doc_count": 1
},
{
"key": "aa_bar_aa",
"doc_count": 1
},
{
"key": "iii_bar_iii",
"doc_count": 1
}
]
}
}
}
This is desired because a shorter string is closer to the search substring so is considered a 'better' match so should appear earlier in the results than a longer string.
Any alternative way to sort the buckets by how similar they are to the original substring would also be helpful.
I need the sorting to occur in ES so that I only have to load the top N results from ES.

I worked out a way to do this.
I used a sub-aggregation per dynamic bucket to calculate the length of the key string as another field.
Then I was able to sort by this new length field first, then by the actual key so keys of the same length are sorted alphabetically.
{
"query": {other constraints},
"aggs": {
"my_values": {
"terms": {
"field": "foo.raw",
"include": ".*bar.*",
"order": [
{"key_length": "asc"},
{"_key": "asc"}
],
"size": N
},
"aggs": {
"key_length": {
"max": {"script": "doc['foo.raw'].value.length()" }
}
}
}
}
}
This gave me results like
{
...
"aggregations": {
"my_values": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 145,
"buckets": [
{
"key": "z_bar_z",
"doc_count": 1
},
{
"key": "aa_bar_aa",
"doc_count": 1
},
{
"key": "dd_bar_dd",
"doc_count": 1
},
{
"key": "bbb_bar_bbb",
"doc_count": 1
}
]
}
}
}
which is what I wanted.

Date_histogram aggregation returns bad results

I had to create aggregation that counts number of documents containing in date ranges.
My query looks like:
{
"query":{
"range":{
"doc.createdTime":{
"gte":1483228800000,
"lte":1485907199999
}
}
},
"size":0,
"aggs":{
"by_day":{
"histogram":{
"field":"doc.createdTime",
"interval":"604800000ms",
"format":"yyyy-MM-dd'T'HH:mm:ssZZ",
"min_doc_count":0,
"extended_bounds":{
"min":1483228800000,
"max":1485907199999
}
}
}
}
}
Interval: 604800000 equals to 7 days.
As a result, I recive these:
"aggregations": {
"by_day": {
"buckets": [
{
"key_as_string": "2016-12-29T00:00:00+00:00",
"key": 1482969600000,
"doc_count": 0
},
{
"key_as_string": "2017-01-05T00:00:00+00:00",
"key": 1483574400000,
"doc_count": 603
},
{
"key_as_string": "2017-01-12T00:00:00+00:00",
"key": 1484179200000,
"doc_count": 3414
},
{
"key_as_string": "2017-01-19T00:00:00+00:00",
"key": 1484784000000,
"doc_count": 71551
},
{
"key_as_string": "2017-01-26T00:00:00+00:00",
"key": 1485388800000,
"doc_count": 105652
}
]
}
}
As You can mantion that my buckets starts from 29/12/2016, but as a range query do not cover this date. I expect my buckets should start from 01/01/2017 as I pointed in the range query. This problem occurs only in query with interval with number of days greater then 1. In case of any other intervals it works fine. I've tried with day, months and hours and it works fine.
I've tried also to use filtered aggs and only then use date_histogram. Result is the same.
I'm using Elasticsearch 2.2.0 version.
And the question is how I can force date_histogram to start from date I need?

Try to add offset param with value calculated from given formula:
value = start_date_in_ms % week_in_ms = 1483228800000 % 604800000 =
259200000
{
"query": {
"range": {
"doc.createdTime": {
"gte": 1483228800000,
"lte": 1485907199999
}
}
},
"size": 0,
"aggs": {
"by_day": {
"date_histogram": {
"field": "doc.createdTime",
"interval": "604800000ms",
"offset": "259200000ms",
"format": "yyyy-MM-dd'T'HH:mm:ssZZ",
"min_doc_count": 0,
"extended_bounds": {
"min": 1483228800000,
"max": 1485907199999
}
}
}
}
}

Histogram aggregation OR something else?

Which aggregation should I use, when I want same functionality as Histogram, BUT with specify only number of buckets, instead of specify interval?
Something like: give me aggs for price, and split it to 5 buckets...
I don’t want to make min+max aggregation, then calculate 5 intervals before sending my query, because that means 1 extra query on server ... first ask for min+max, then send actual query.
STANDARD HISTOGRAM AGGS QUERY:
"aggs":{
"prices":{
"histogram": {
"field": "variants.priceVat.d1",
"interval": 500
}
}
}
STANDARD RESULT (min 10, max 850 = 2 buckets, because interval is 500):
"prices": {
"doc_count": 67,
"prices": {
"buckets": [
{
"key": 10,
"doc_count": 56
},
{
"key": 500,
"doc_count": 13
}
]
}
}
WHAT I WANT (five buckets with automatic range min:10, max:850 = 1 bucket interval is 168):
"prices": {
"doc_count": 67,
"prices":{
"buckets": [
{
"key": 10,
"doc_count": 42
},
{
"key": 178,
"doc_count": 10
},
{
"key": 346,
"doc_count": 4
},
{
"key": 514,
"doc_count": 7
},
{
"key": 682,
"doc_count": 2
}
]
}
}

How to use aggregations with Elastic Search

I'm using Elastic Search to create a search filter and I need to find all the values saved in the database of the "cambio" column without repeating the values.
The values are saved as follows: "Manual de 5 marchas" or "Manual de 6 marchas"....
I created this query to return all saved values:
GET /crawler10/crawler-vehicles10/_search
{
"size": 0,
"aggregations": {
"my_agg": {
"terms": {
"field": "cambio"
}
}
}
}
But when I run the returned values they look like this:
"aggregations": {
"my_agg": {
"doc_count_error_upper_bound": 2,
"sum_other_doc_count": 2613,
"buckets": [
{
"key": "de",
"doc_count": 2755
},
{
"key": "marchas",
"doc_count": 2714
},
{
"key": "manual",
"doc_count": 2222
},
{
"key": "modo",
"doc_count": 1097
},
{
"key": "5",
"doc_count": 1071
},
{
"key": "d",
"doc_count": 1002
},
{
"key": "n",
"doc_count": 1002
},
{
"key": "automática",
"doc_count": 935
},
{
"key": "com",
"doc_count": 919
},
{
"key": "6",
"doc_count": 698
}
]
}
}

Aggregations are based on the mapping type of the saved field. The field type for cambio seems to be set to analyzed(by default). Please create an index with the mapping not_analyzed for your field cambio.
You can create the index with a PUT request as below (if your ES version is less than 5) and then you will need to re-index your data in the crawler10 index.
PUT crawler10/_mapping/
{
"mappings": {
"crawler-vehicles10": {
"properties": {
"cambio": {
"type": "string"
"index": "not_analyzed"
}
}
}
}
}
For ES v5 or greater
PUT crawler10/_mapping/
{
"mappings": {
"crawler-vehicles10": {
"properties": {
"cambio": {
"type": "keyword"
}
}
}
}
}

Is it possible to returns other fields when you aggregate results on Elasticsearch?

Here is the mappings of my index PublicationsLikes:
id : String
account : String
api : String
date : Date
I'm currently making an aggregation on ES where I group the results counts by the id (of the publication).
{
"key": "<publicationId-1>",
"doc_count": 25
},
{
"key": "<publicationId-2>",
"doc_count": 387
},
{
"key": "<publicationId-3>",
"doc_count": 7831
}
The returned "key" (the id) is an information but I also need to select another fields of the publication like account and api. A bit like that:
{
"key": "<publicationId-1>",
"api": "Facebook",
"accountId": "65465z4fe6ezf456ezdf",
"doc_count": 25
},
{
"key": "<publicationId-2>",
"api": "Twitter",
"accountId": "afaez5f4eaz",
"doc_count": 387
}
How can I manage this?
Thanks.

This requirement is best achieved by top_hits aggregation, where you can sort the documents in each bucket and choose the first and also you can control which fields you want returned:
{
"size": 0,
"aggs": {
"publications": {
"terms": {
"field": "id"
},
"aggs": {
"sample": {
"top_hits": {
"size": 1,
"_source": ["api","accountId"]
}
}
}
}
}
}

You can use subaggregation for this.
GET /PublicationsLikes/_search
{
"aggs" : {
"ids": {
"terms": {
"field": "id"
},
"aggs": {
"accounts": {
"terms": {
"field": "account",
"size": 1
}
}
}
}
}
}
Your result will not exactly what you want but it will be a bit similar:
{
"key": "<publicationId-1>",
"doc_count": 25,
"accounts": {
"buckets": [
{
"key": "<account-1>",
"doc_count": 25
}
]
}
},
{
"key": "<publicationId-2>",
"doc_count": 387,
"accounts": {
"buckets": [
{
"key": "<account-2>",
"doc_count": 387
}
]
}
},
{
"key": "<publicationId-3>",
"doc_count": 7831,
"accounts": {
"buckets": [
{
"key": "<account-3>",
"doc_count": 7831
}
]
}
}
You can also check the link to find more information

Thanks both for your quick replies. I think the first solution is the most "beautiful" (in terms of request but also to retrieves the results) but both seems to be sub aggregations queries.
{
"size": 0,
"aggs": {
"publications": {
"terms": {
"size": 0,
"field": "publicationId"
},
"aggs": {
"sample": {
"top_hits": {
"size": 1,
"_source": ["accountId", "api"]
}
}
}
}
}
}
I think I must be careful to size=0 parameter, so, because I work in the Java Api, I decided to put INT.Max instead of 0.
Thnaks a lot guys.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

ElasticSearch - Get Statistics on Aggregation results - elasticsearch

No, this is not currently possible, here is the issue tracking the support: https://github.com/elasticsearch/elasticsearch/issues/8110 Obviously, it is possible to do this client side if you are able to pull the full list of all buckets into memory.

Related

Elasticsearch - Sort results of Terms aggregation by key string length

Date_histogram aggregation returns bad results

Histogram aggregation OR something else?

How to use aggregations with Elastic Search

Is it possible to returns other fields when you aggregate results on Elasticsearch?

Categories

Resources