Let say I have many stores and I want to show the stores which had the most growth in number of visits between January and February.
So far I’m using date_histogram to get the numbers per month and per store with this query :
query: {
range: {
visited_at: {
gte: "2016-01-01T00:00:00Z",
lt: "2016-03-01T00:00:00Z"
}
}
},
size: 0,
aggs: {
months: {
date_histogram: {
field: 'visited_at',
interval: "month"
},
aggs: {
stores: {
terms: {
size: 0,
field: 'store_id'
}
}
}
}
}
And it returns something like this:
"aggregations": {
"months": {
"buckets": [
{
"key_as_string": "2016-01-01T00:00:00.000+00:00",
"key": 1451574000000,
"doc_count": 300,
"stores": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": 1,
"doc_count": 100
},
{
"key": 2,
"doc_count": 200
}
]
}
},
{
"key_as_string": "2016-02-01T00:00:00.000+00:00",
"key": 1454252400000,
"doc_count": 550,
"stores": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": 1,
"doc_count": 150
},
{
"key": 2,
"doc_count": 400
}
]
}
}
]
}
}
With this I’m fetching the data for all the stores and then comparing the growth in my code but I’m hoping there is a query that would let Elasticsearch calculate the growth and return me only the top n.
I tried some Pipeline aggregations but I couldn’t manage to get what I wanted.
I guess another way to improve that would be to have a batch compute the monthly growth at the end of each month and then store it. Does Elasticsearch has something that could do this automatically?
FYI I'm on Elasticseach 2.2 and I'm using this for the growth: (feb_result - jan_result) / jan_result
Related
I am querying ES with a Terms aggregation to find the first N unique values of a string field foo where the field contains a substring bar, and the document matches some other constraints.
Currently I am able to sort the results by the key string alphabetically:
{
"query": {other constraints},
"aggs": {
"my_values": {
"terms": {
"field": "foo.raw",
"include": ".*bar.*",
"order": {"_key": "asc"},
"size": N
}
}
}
}
This gives results like
{
...
"aggregations": {
"my_values": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 145,
"buckets": [
{
"key": "aa_bar_aa",
"doc_count": 1
},
{
"key": "iii_bar_iii",
"doc_count": 1
},
{
"key": "z_bar_z",
"doc_count": 1
}
]
}
}
}
How can I change the order option so that the buckets are sorted by the length of the strings in the foo key field, so that the results are like
{
...
"aggregations": {
"my_values": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 145,
"buckets": [
{
"key": "z_bar_z",
"doc_count": 1
},
{
"key": "aa_bar_aa",
"doc_count": 1
},
{
"key": "iii_bar_iii",
"doc_count": 1
}
]
}
}
}
This is desired because a shorter string is closer to the search substring so is considered a 'better' match so should appear earlier in the results than a longer string.
Any alternative way to sort the buckets by how similar they are to the original substring would also be helpful.
I need the sorting to occur in ES so that I only have to load the top N results from ES.
I worked out a way to do this.
I used a sub-aggregation per dynamic bucket to calculate the length of the key string as another field.
Then I was able to sort by this new length field first, then by the actual key so keys of the same length are sorted alphabetically.
{
"query": {other constraints},
"aggs": {
"my_values": {
"terms": {
"field": "foo.raw",
"include": ".*bar.*",
"order": [
{"key_length": "asc"},
{"_key": "asc"}
],
"size": N
},
"aggs": {
"key_length": {
"max": {"script": "doc['foo.raw'].value.length()" }
}
}
}
}
}
This gave me results like
{
...
"aggregations": {
"my_values": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 145,
"buckets": [
{
"key": "z_bar_z",
"doc_count": 1
},
{
"key": "aa_bar_aa",
"doc_count": 1
},
{
"key": "dd_bar_dd",
"doc_count": 1
},
{
"key": "bbb_bar_bbb",
"doc_count": 1
}
]
}
}
}
which is what I wanted.
please help me with understanding nested bucket aggregation in elastic search. I have next query aggregation results:
[...]
{
"key": "key1",
"doc_count": 1166,
"range_keys": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "happy",
"doc_count": 1166
}
]
}
},
{
"key": "key2",
"doc_count": 1123,
"range_keys": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "cookies",
"doc_count": 1122
},
{
"key": "happy",
"doc_count": 1
}
]
}
},
[...]
As you see, i have query results with only "happy", but i need to get all results only with "happy" and "cookies".
In order to achieve this goal i tried to use "size" argument, but this argument gave e results with size and less results query.
How i can determine "bucket" length in nested query?
My query is a nested aggregation
aggs: {
src: {
terms: {
field: "dst_ip",
size: 1000,
},
aggs: {
dst: {
terms: {
field: "a_field_which_changes",
size: 2000,
},
},
},
},
A typical doc the query is ran against is below (the mappings are all of type keyword)
{
"_index": "honey",
"_type": "event",
"_id": "AWHzRjHrjNgIX_EoDcfV",
"_score": 1,
"_source": {
"dst_ip": "10.101.146.166",
"src_ip": "10.10.16.1",
"src_port": "38",
}
},
There are actually two queries I make, one after the other. They differ by the value of a_field_which_changes, which is "src_ip" in one query and "src_port" in the other.
In the first query all the results are fine. The aggregation is 1 element large and the buckets specify what that element matched with
{
"key": "10.6.17.218", <--- "dst_ip" field
"doc_count": 1,
"dst": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "-1", <--- "src_port" field
"doc_count": 1
}
]
}
},
The other query yields two different kind of results:
{
"key": "10.6.17.218",
"doc_count": 1,
"dst": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": []
}
},
{
"key": "10.237.78.19",
"doc_count": 1,
"dst": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "10.12.67.89",
"doc_count": 1
}
]
}
},
The first result is problematic: it does not give the details of the buckets. It is no different from the other one but somehow the details are missing.
Why is it so, and most importantly - how to force Elasticsearch to display the details of the buckets?
The documentation goes into details on how to interfere with the aggregation but I could not find anything relevant there.
I am a newbie to Elastic Search and I am trying to find out how to handle the scenario briefed here. I am having a schema where a document may contain data such as
{
"country":"US",
"zone": "East",
"cluster": "Cluster1",
"time_taken": 4500,
"status": 0
},
{
"country":"US",
"zone": "East",
"cluster": "Cluster1",
"time_taken": 5000,
"status": 0
},
{
"country":"US",
"zone": "East",
"cluster": "Cluster1",
"time_taken": 5000,
"status": 1
},
{
"country":"US",
"zone": "East",
"cluster": "Cluster2",
"time_taken": 5000,
"status": 0
}
Where status = 0 for success, 1 for failure
I would want to show a result in a way that it can reflect a hierarchy with values from "success" like
US/East/Cluster1 = 66% (which is basically 2 success and 1 failure)
US/East/Cluster2 = 100% (which is basically 1 success)
US/East = 75%
US = 75%
Alternatively, if there is also a way to get the time taken average for success and failure scenarios spread across this hierarchy like denoted above, would be great.
I think a terms aggregation should get the job done for you.
In order to satisfy your first query examples (% success per cluster), try something like this:
{
"aggs": {
"byCluster": {
"terms": {
"field": "cluster"
},
"aggs": {
"success_or_fail": {
"terms": {
"field": "status"
}
}
}
}
}
}
This returns a result that looks something like this:
"aggregations": {
"byCluster": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "cluster1",
"doc_count": 3,
"success_or_fail": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": 0,
"doc_count": 2
},
{
"key": 1,
"doc_count": 1
}
]
}
},
{
"key": "cluster2",
"doc_count": 1,
"success_or_fail": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": 0,
"doc_count": 1
}
]
}
}
]
}
}
You can take the doc_count for the 0 bucket of the "success_or_fail" (arbitrary name) aggregation and divide it by the doc_count for the corresponding cluster. This will give you the % success for each cluster. (2/3 for "cluster1" and 1/1 for "cluster2").
The same type of aggregation could be used to group by "country" and "zone".
UPDATE
You can also nest a avg aggregation next to the "success_or_fail" terms aggregation, in order to achieve the average time taken you were looking for.
As in:
{
"query": {
"match_all": {}
},
"aggs": {
"byCluster": {
"terms": {
"field": "cluster"
},
"aggs": {
"success_or_fail": {
"terms": {
"field": "status"
},
"aggs": {
"avg_time_taken": {
"avg": {
"field": "time_taken"
}
}
}
}
}
}
}
}
I have the following simple aggregation:
GET index1/type1/_search
{
"size": 0,
"aggs": {
"incidentID": {
"terms": {
"field": "incidentID",
"size": 5
}
}
}
}
Results are:
"aggregations": {
"incidentID": {
"buckets": [
{
"key": "0A631EB1-01EF-DC28-9503-FC28FE695C6D",
"doc_count": 233
},
{
"key": "DF107D2B-CA1E-85C9-E01A-C966DC6F7051",
"doc_count": 226
},
{
"key": "60B8955F-38FD-8DFE-D374-4387668C8368",
"doc_count": 220
},
{
"key": "B787868A-F72E-63DC-D837-B3A864D9FFC6",
"doc_count": 174
},
{
"key": "C597EC5F-C60F-F3BA-61CB-4990F12C1893",
"doc_count": 174
}
]
}
}
What I want to do is get the "statistics" of the "doc_count" returned. I want:
Min Value
Max Value
Average
Standard Deviation
No, this is not currently possible, here is the issue tracking the support:
https://github.com/elasticsearch/elasticsearch/issues/8110
Obviously, it is possible to do this client side if you are able to pull the full list of all buckets into memory.