When are "buckets": [] in an aggregation? - elasticsearch

My query is a nested aggregation
aggs: {
src: {
terms: {
field: "dst_ip",
size: 1000,
},
aggs: {
dst: {
terms: {
field: "a_field_which_changes",
size: 2000,
},
},
},
},
A typical doc the query is ran against is below (the mappings are all of type keyword)
{
"_index": "honey",
"_type": "event",
"_id": "AWHzRjHrjNgIX_EoDcfV",
"_score": 1,
"_source": {
"dst_ip": "10.101.146.166",
"src_ip": "10.10.16.1",
"src_port": "38",
}
},
There are actually two queries I make, one after the other. They differ by the value of a_field_which_changes, which is "src_ip" in one query and "src_port" in the other.
In the first query all the results are fine. The aggregation is 1 element large and the buckets specify what that element matched with
{
"key": "10.6.17.218", <--- "dst_ip" field
"doc_count": 1,
"dst": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "-1", <--- "src_port" field
"doc_count": 1
}
]
}
},
The other query yields two different kind of results:
{
"key": "10.6.17.218",
"doc_count": 1,
"dst": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": []
}
},
{
"key": "10.237.78.19",
"doc_count": 1,
"dst": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "10.12.67.89",
"doc_count": 1
}
]
}
},
The first result is problematic: it does not give the details of the buckets. It is no different from the other one but somehow the details are missing.
Why is it so, and most importantly - how to force Elasticsearch to display the details of the buckets?
The documentation goes into details on how to interfere with the aggregation but I could not find anything relevant there.

Related

Elasticsearch - Sort results of Terms aggregation by key string length

I am querying ES with a Terms aggregation to find the first N unique values of a string field foo where the field contains a substring bar, and the document matches some other constraints.
Currently I am able to sort the results by the key string alphabetically:
{
"query": {other constraints},
"aggs": {
"my_values": {
"terms": {
"field": "foo.raw",
"include": ".*bar.*",
"order": {"_key": "asc"},
"size": N
}
}
}
}
This gives results like
{
...
"aggregations": {
"my_values": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 145,
"buckets": [
{
"key": "aa_bar_aa",
"doc_count": 1
},
{
"key": "iii_bar_iii",
"doc_count": 1
},
{
"key": "z_bar_z",
"doc_count": 1
}
]
}
}
}
How can I change the order option so that the buckets are sorted by the length of the strings in the foo key field, so that the results are like
{
...
"aggregations": {
"my_values": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 145,
"buckets": [
{
"key": "z_bar_z",
"doc_count": 1
},
{
"key": "aa_bar_aa",
"doc_count": 1
},
{
"key": "iii_bar_iii",
"doc_count": 1
}
]
}
}
}
This is desired because a shorter string is closer to the search substring so is considered a 'better' match so should appear earlier in the results than a longer string.
Any alternative way to sort the buckets by how similar they are to the original substring would also be helpful.
I need the sorting to occur in ES so that I only have to load the top N results from ES.
I worked out a way to do this.
I used a sub-aggregation per dynamic bucket to calculate the length of the key string as another field.
Then I was able to sort by this new length field first, then by the actual key so keys of the same length are sorted alphabetically.
{
"query": {other constraints},
"aggs": {
"my_values": {
"terms": {
"field": "foo.raw",
"include": ".*bar.*",
"order": [
{"key_length": "asc"},
{"_key": "asc"}
],
"size": N
},
"aggs": {
"key_length": {
"max": {"script": "doc['foo.raw'].value.length()" }
}
}
}
}
}
This gave me results like
{
...
"aggregations": {
"my_values": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 145,
"buckets": [
{
"key": "z_bar_z",
"doc_count": 1
},
{
"key": "aa_bar_aa",
"doc_count": 1
},
{
"key": "dd_bar_dd",
"doc_count": 1
},
{
"key": "bbb_bar_bbb",
"doc_count": 1
}
]
}
}
}
which is what I wanted.

Elasticsearch return document ids while doing aggregate query

Is it possible to get an array of elasticsearch document id while group by, i.e
Current output
"aggregations": {,
"types": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "Text Document",
"doc_count": 3310
},
{
"key": "Unknown",
"doc_count": 15
},
{
"key": "Document",
"doc_count": 13
}
]
}
}
Desired output
"aggregations": {,
"types": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "Text Document",
"doc_count": 3310,
"ids":["doc1","doc2", "doc3"....]
},
{
"key": "Unknown",
"doc_count": 15,
"ids":["doc11","doc12", "doc13"....]
},
{
"key": "Document",
"doc_count": 13
"ids":["doc21","doc22", "doc23"....]
}
]
}
}
Not sure if this is possible in elasticsearch or not,
below is my aggregation query:
{
"size": 0,
"aggs": {
"types": {
"terms": {
"field": "docType",
"size": 10
}
}
}
}
Elasticsearch version:
6.3.2
You can use top_hits aggregation which will return all documents under an aggregation. Using source filtering you can select fields under hits
Query:
"aggs": {
"district": {
"terms": {
"field": "docType",
"size": 10
},
"aggs": {
"docs": {
"top_hits": {
"size": 10,
"_source": ["ids"]
}
}
}
}
}
For anyone interested, another solution is to create a custom key value using a script to create a string of delineated values from the doc, including the id. It may not be pretty, but you can then parse it out later - and if you just need something minimal like the doc id, it may be worth it.
{
"size": 0,
"aggs": {
"types": {
"terms": {
"script": "doc['docType'].value+'::'+doc['_id'].value",
"size": 10
}
}
}
}

Elasticsearch determinte bucket length in aggregation

please help me with understanding nested bucket aggregation in elastic search. I have next query aggregation results:
[...]
{
"key": "key1",
"doc_count": 1166,
"range_keys": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "happy",
"doc_count": 1166
}
]
}
},
{
"key": "key2",
"doc_count": 1123,
"range_keys": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "cookies",
"doc_count": 1122
},
{
"key": "happy",
"doc_count": 1
}
]
}
},
[...]
As you see, i have query results with only "happy", but i need to get all results only with "happy" and "cookies".
In order to achieve this goal i tried to use "size" argument, but this argument gave e results with size and less results query.
How i can determine "bucket" length in nested query?

Elasticsearch: date_histogram and computing intervals

Let say I have many stores and I want to show the stores which had the most growth in number of visits between January and February.
So far I’m using date_histogram to get the numbers per month and per store with this query :
query: {
range: {
visited_at: {
gte: "2016-01-01T00:00:00Z",
lt: "2016-03-01T00:00:00Z"
}
}
},
size: 0,
aggs: {
months: {
date_histogram: {
field: 'visited_at',
interval: "month"
},
aggs: {
stores: {
terms: {
size: 0,
field: 'store_id'
}
}
}
}
}
And it returns something like this:
"aggregations": {
"months": {
"buckets": [
{
"key_as_string": "2016-01-01T00:00:00.000+00:00",
"key": 1451574000000,
"doc_count": 300,
"stores": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": 1,
"doc_count": 100
},
{
"key": 2,
"doc_count": 200
}
]
}
},
{
"key_as_string": "2016-02-01T00:00:00.000+00:00",
"key": 1454252400000,
"doc_count": 550,
"stores": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": 1,
"doc_count": 150
},
{
"key": 2,
"doc_count": 400
}
]
}
}
]
}
}
With this I’m fetching the data for all the stores and then comparing the growth in my code but I’m hoping there is a query that would let Elasticsearch calculate the growth and return me only the top n.
I tried some Pipeline aggregations but I couldn’t manage to get what I wanted.
I guess another way to improve that would be to have a batch compute the monthly growth at the end of each month and then store it. Does Elasticsearch has something that could do this automatically?
FYI I'm on Elasticseach 2.2 and I'm using this for the growth: (feb_result - jan_result) / jan_result

How to get the count of most frequent pattern in elasticsearch?

I want to get the ten most frequent patterns in search with elasticsearch .
Example :
"cgn:4189, dfsdkfldslfs"
"cgn:4210, aezfvdsvgds"
"cgn:4189, fdsmpfjdjs"
"cgn:4195, cvsf"
"cgn:4189, mkpjd"
"cgn:4210, mfsfgkpjd"
I want to get :
4189 : 3
4210 : 2
4195 : 1
I know how to do that in mysql or via awk/sort/head ... but with elasticsearch I'm lost.
Exactly how it will work depends on your analyzer, but if you are just using the default, standard analyzer, you can probably get what you want pretty easily with a terms aggregation.
As a simple example, I set up a trivial index:
PUT /test_index
{
"settings": {
"number_of_shards": 1
}
}
Then indexed the data you posted, using the bulk api:
POST /test_index/doc/_bulk
{"index":{"_id":1}}
{"msg":"cgn:4189, dfsdkfldslfs"}
{"index":{"_id":2}}
{"msg":"cgn:4210, aezfvdsvgds"}
{"index":{"_id":3}}
{"msg":"cgn:4189, fdsmpfjdjs"}
{"index":{"_id":4}}
{"msg":"cgn:4195, cvsf"}
{"index":{"_id":5}}
{"msg":"cgn:4189, mkpjd"}
{"index":{"_id":6}}
{"msg":"cgn:4210, mfsfgkpjd"}
Then I can run a simple terms aggregation to get back all the terms and how often they occur (ordered descending by term frequency by default):
POST /test_index/_search?search_type=count
{
"aggs": {
"msg_terms": {
"terms": {
"field": "msg"
}
}
}
}
which returns:
{
"took": 4,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"failed": 0
},
"hits": {
"total": 6,
"max_score": 0,
"hits": []
},
"aggregations": {
"msg_terms": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "cgn",
"doc_count": 6
},
{
"key": "4189",
"doc_count": 3
},
{
"key": "4210",
"doc_count": 2
},
{
"key": "4195",
"doc_count": 1
},
{
"key": "aezfvdsvgds",
"doc_count": 1
},
{
"key": "cvsf",
"doc_count": 1
},
{
"key": "dfsdkfldslfs",
"doc_count": 1
},
{
"key": "fdsmpfjdjs",
"doc_count": 1
},
{
"key": "mfsfgkpjd",
"doc_count": 1
},
{
"key": "mkpjd",
"doc_count": 1
}
]
}
}
}
Here is the code I used:
http://sense.qbox.io/gist/a827095b675596c4e3d545ce963cde3fae932156

Resources