So I know my total hits are 182 documents
"hits": {
"total": {
"value": 182,
"relation": "eq"
},
"max_score": null,
"hits": []
},
And then I make a aggregation to know how many documents have the source instagagram or twitter and it returns me:
"bySource": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "instagram",
"doc_count": 162
},
{
"key": "twitter",
"doc_count": 20
}
]
}
Is it possible to get the percentage of documents that have source twitter and instagram?
So the percentage of documents that have source instagram is 89 % and twitter 11%.
My aggregation code its like this:
"aggs": {
"bySource": {
"terms": {
"field": "profile.source.keyword"
}
}
}
Let me know if this is possible.
Thank you
Sure, it is possible using the 'Bucket Script Aggregation'.
An example query might look like this:
{
"size": 0,
"aggs": {
"filters_agg": {
"filters": {
"filters": {
"sourceCount": {
"match_all": {}
}
}
},
"aggs": {
"bySource": {
"terms": {
"field": "profile.source.keyword"
}
},
"instagram_count_percentage": {
"bucket_script": {
"buckets_path": {
"instagram_count": "bySource['instagram']>_count",
"total_count": "_count"
},
"script": "Math.round((params.instagram_count * 100)/params.total_count)"
}
},
"twitter_count_percentage": {
"bucket_script": {
"buckets_path": {
"twitter_count": "bySource['twitter']>_count",
"total_count": "_count"
},
"script": "Math.round((params.twitter_count * 100)/params.total_count)"
}
}
}
}
}
}
And the response could be something like this:
{
"took": 1,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 182,
"relation": "eq"
},
"max_score": null,
"hits": []
},
"aggregations": {
"filters_agg": {
"buckets": {
"sourceCount": {
"doc_count": 182,
"bySource": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "instagram",
"doc_count": 162
},
{
"key": "twitter",
"doc_count": 20
}
]
},
"instagram_count_percentage": {
"value": 89
},
"twitter_count_percentage": {
"value": 11
}
}
}
}
}
}
Try to adjust it or get inspired depending on your case and your mapping.
Related
I requested like this ( I twigged just some terms for you to understand ) :
{
"size": 0,
"aggs": {
"byMonth": {
"date_histogram": {
"field": "date_time",
"order": {
"_key": "desc"
},
"interval": "month",
"format": "yyyy-MM",
"extended_bounds": {
"max": "2022-02",
"min": "2022-01"
}
},
"aggs": {
"byTest": {
"terms": {
"field": "test_cate_m",
"size": 100,
"order": {
"_count": "desc"
}
}
}
}
}
}
}
and response is :
{ -
"took": 15,
"timed_out": false,
"_shards": { -
"total": 183,
"successful": 183,
"skipped": 0,
"failed": 0
},
"hits": { -
"total": { -
"value": 10000,
"relation": "gte"
},
"max_score": null,
"hits": [ -
]
},
"aggregations": { -
"byMonth": { -
"buckets": [ -
{ -
"key_as_string": "2022-02",
"key": 1643673600000,
"doc_count": 600,
"byTest": { -
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [ -
{ -
"key": "test1",
"doc_count": 100
},
{ -
"key": "test2",
"doc_count": 200
},
{ -
"key": "test3",
"doc_count": 300
}
]
}
},
{ -
"key_as_string": "2022-01",
"key": 1640995200000,
"doc_count": 100,
"byTest": { -
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [ -
{ -
"key": "test3",
"doc_count": 100
}
]
}
}
]
}
}
}
in the nested buckets, there are no 'test1' , 'test2'. I'd like to get 'test1' and 'test2' in the buckets for comparison with both, even if there is no data.
and if i can, can i calculate with those both result within the query? like, i'd like to compare the each of key's doc_count in one query, not only just get the data. Can i do this?
If you help me out, it'll be a huge help :)
Sorry if this has been asked already but been lurking around SO and couldn't find anything which suits my needs.
Basically, what I'm trying to achieve in my first quick tries with ES is to add further counters within a Terms Aggregation.
Giving it a quick try I'm sending the following request to ES.
POST http://localhost:9200/people/_search
{
"size": 0,
"aggs": {
"agg_by_name": {
"terms": { "field": "name"}
}
}
}
And what I'm getting right now is just what the sample shows in the docs.
{
"took": 89,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 10000,
"relation": "gte"
},
"max_score": null,
"hits": []
},
"aggregations": {
"agg_by_name": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 9837,
"buckets": [
{
"key": "James",
"doc_count": 437
},
{
"key": "Eduard",
"doc_count": 367
},
{
"key": "Leonardo",
"doc_count": 235
},
{
"key": "George",
"doc_count": 209
},
{
"key": "Harrison",
"doc_count": 180
}, ...
However, I can't really get how to include further inner aggregations in the bucket. Something that would result in a document like this.
{
"key": "Harrison",
"doc_count": 180,
"lives_in_NY": 40,
"lives_in_CA": 140,
"distinct_surnames": [ ... ]
}
How should I structure my aggregation so that those are included bucket-wise?
You could try sometihng like this:
{
"size": 0,
"aggs": {
"getAllTheNames": {
"terms": {
"field": "name",
"size": 100
},
"aggs": {
"getAllTheSurnames": {
"terms": {
"field": "surname",
"size": 100
}
}
}
}
}
}
For living city could be something like:
{
"size": 0,
"aggs": {
"getAllTheNames": {
"terms": {
"field": "name",
"size": 100
},
"aggs": {
"getAllTheCities": {
"terms": {
"field": "city",
"size": 100
}
}
}
}
}
}
I'm running an aggregation on the hash of the docs in my set.
Within each bucket I select the oldest and most recent.
I want an overview:
total number of docs
most recent
oldest
I have managed to get the total to work but am struggling with the oldest and most recent.
My query (limited to 2 results in the aggregation until I get it right):
{
"size": 0,
"query": {
"bool": {
"must_not": [
{
"term": {
"Text_SHA2": "e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855"
}
}
]
}
},
"aggs": {
"overall_Total": {
"sum_bucket": {
"buckets_path": "by_SHA2>_count"
}
},
"overall_MostRecent": {
"max_bucket": {
"buckets_path": "by_SHA2>the_MostRecent"
}
},
"by_SHA2": {
"terms": {
"field": "Text_SHA2",
"size": 2
},
"aggs": {
"the_MostRecent": {
"max": {
"field": "ReceivedDateUTC"
}
},
"the_Oldest": {
"min": {
"field": "ReceivedDateUTC"
}
}
}
}
}
}
What I get back:
{
"took": 341,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 1163611,
"max_score": 0.0,
"hits": []
},
"aggregations": {
"by_SHA2": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 1163388,
"buckets": [
{
"key": "0683dcdcd26c16315292ecf02307e9d819a08522b35dff933b406688d8d3edb9",
"doc_count": 119,
"the_Oldest": {
"value": 1.54284803E12,
"value_as_string": "2018-11-22T00:53:50.000"
},
"the_MostRecent": {
"value": 1.572209574E12,
"value_as_string": "2019-10-27T20:52:54.000"
}
},
{
"key": "e757c30feeea67425ba02d8821295954d23bb9f6bf979fb8113d2cdf8f79b378",
"doc_count": 104,
"the_Oldest": {
"value": 1.545930842E12,
"value_as_string": "2018-12-27T17:14:02.000"
},
"the_MostRecent": {
"value": 1.572340576E12,
"value_as_string": "2019-10-29T09:16:16.000"
}
}
]
},
"overall_Total": {
"value": 223.0
},
"overall_MostRecent": {
"value": 1.572340576E12,
"keys": [
"e757c30feeea67425ba02d8821295954d23bb9f6bf979fb8113d2cdf8f79b378"
]
}
}
}
What I'd like to get back (please see difference in "overall_MostRecent" at the end):
{
"took": 341,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 1163611,
"max_score": 0.0,
"hits": []
},
"aggregations": {
"by_SHA2": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 1163388,
"buckets": [
{
"key": "0683dcdcd26c16315292ecf02307e9d819a08522b35dff933b406688d8d3edb9",
"doc_count": 119,
"the_Oldest": {
"value": 1.54284803E12,
"value_as_string": "2018-11-22T00:53:50.000"
},
"the_MostRecent": {
"value": 1.572209574E12,
"value_as_string": "2019-10-27T20:52:54.000"
}
},
{
"key": "e757c30feeea67425ba02d8821295954d23bb9f6bf979fb8113d2cdf8f79b378",
"doc_count": 104,
"the_Oldest": {
"value": 1.545930842E12,
"value_as_string": "2018-12-27T17:14:02.000"
},
"the_MostRecent": {
"value": 1.572340576E12,
"value_as_string": "2019-10-29T09:16:16.000"
}
}
]
},
"overall_Total": {
"value": 223.0
},
"overall_MostRecent": {
"value": 1.572340576E12,
"value_as_string": "2019-10-29T09:16:16.000"
}
}
}
There's obviously something wrong with my "overall_MostRecent" section of the query. If anyone could point that out to me I'd be much obliged.
I'm executing a query in elasticsearch. I need to have the number of hits for my attribute "end_date_ut" (type is Date and format is dateOptionalTime) for each month represented in the index.
For that, I'm using a date_histogram aggregation.
My query just bellow:
GET inc/_search
{
"size": 0,
"aggs": {
"appli": {
"date_histogram": {
"field": "end_date_ut",
"interval": "month"
}
}
}
}
And here is a part of the result:
"hits": {
"total": 517478,
"max_score": 0,
"hits": []
},
"aggregations": {
"appli": {
"buckets": [
{
"key_as_string": "2009-08-01T00:00:00.000Z",
"key": 1249084800000,
"doc_count": 0
},
{
"key_as_string": "2009-09-01T00:00:00.000Z",
"key": 1251763200000,
"doc_count": 1
},
{
"key_as_string": "2009-10-01T00:00:00.000Z",
"key": 1254355200000,
"doc_count": 2362
},
{
"key_as_string": "2009-11-01T00:00:00.000Z",
"key": 1257033600000,
"doc_count": 5336
},
{
"key_as_string": "2009-12-01T00:00:00.000Z",
"key": 1259625600000,
"doc_count": 7536
},
{
"key_as_string": "2010-01-01T00:00:00.000Z",
"key": 1262304000000,
"doc_count": 8864
}
The problem is that I have too many buckets (results). When I'm using "terms aggregation", I don't have any problems because I can set a size, but with "date_histogram aggregation" I can't find a way to put a limit on my query result.
{
"size": 0,
"aggs": {
"by_minute": {
"date_histogram": {
"field": "createTime",
"interval": "1m",
"order": {
"_count": "desc"
}
},
"aggs": {
"top2": {
"bucket_sort": {
"sort": [],
"size": 2
}
}
}
}
}
}
{
"took": 28,
"timed_out": false,
"_shards": {
"total": 2,
"successful": 2,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 999999,
"max_score": 0.0,
"hits": []
},
"aggregations": {
"by_minute": {
"buckets": [
{
"key_as_string": "2019-12-21T16:13:00.000Z",
"key": 1576944780000,
"doc_count": 6374
},
{
"key_as_string": "2019-12-21T16:10:00.000Z",
"key": 1576944600000,
"doc_count": 6327
}
]
}
}
}
I suggest to use min_doc_count to only include buckets that have data, i.e. the buckets with 0 documents would not come back in the response.
GET inc/_search
{
"size": 0,
"aggs": {
"appli": {
"date_histogram": {
"field": "end_date_ut",
"interval": "month",
"min_doc_count": 1 <--- add this
}
}
}
}
If you can, you can also add a range query in order to restrain the time interval on which the aggregation is run.
Say, I have following documents:
1st doc:
{
productName: "product1",
tags: [
{
"name":"key1",
"value":"value1"
},
{
"name":"key2",
"value":"value2"
}
]
}
2nd doc:
{
productName: "product2",
tags: [
{
"name":"key1",
"value":"value1"
},
{
"name":"key2",
"value":"value3"
}
]
}
I know if I want to group by productName, I could use a terms aggregation
"terms": {
"field": "productName"
}
which will give me two buckets with two different keys "product1", "product2".
However, what should the query be if I would like to group by tag key? i.e. I would like to group by tag with name==key1, then I am expecting one bucket with key="value1"; while if I group by tag with name==key2, I am expecting the result to be two buckets with keys "value2", "value3".
What should the query look like if I would like to group by the 'value' inside a nested array but not group by the 'key'? Any suggestion?
It sounds like a nested terms aggregation is what you're looking for.
With the two documents you posted, this query:
POST /test_index/_search
{
"size": 0,
"aggs": {
"product_name_terms": {
"terms": {
"field": "product_name"
}
},
"nested_tags": {
"nested": {
"path": "tags"
},
"aggs": {
"tags_name_terms": {
"terms": {
"field": "tags.name"
}
},
"tags_value_terms": {
"terms": {
"field": "tags.value"
}
}
}
}
}
}
returns this:
{
"took": 67,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 2,
"max_score": 0,
"hits": []
},
"aggregations": {
"product_name_terms": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": []
},
"nested_tags": {
"doc_count": 4,
"tags_name_terms": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "key1",
"doc_count": 2
},
{
"key": "key2",
"doc_count": 2
}
]
},
"tags_value_terms": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "value1",
"doc_count": 2
},
{
"key": "value2",
"doc_count": 1
},
{
"key": "value3",
"doc_count": 1
}
]
}
}
}
}
Here is some code I used to test it:
http://sense.qbox.io/gist/a9a172f41dbd520d5e61063a9686055681110522
EDIT: Filter by Nested Value
As per your comment, if you want to filter the nested results by a value (of the nested results), you can add another "layer" of aggregation making use of the filter aggregation as follows:
POST /test_index/_search
{
"size": 0,
"aggs": {
"nested_tags": {
"nested": {
"path": "tags"
},
"aggs": {
"filter_tag_name": {
"filter": {
"term": {
"tags.name": "key1"
}
},
"aggs": {
"tags_name_terms": {
"terms": {
"field": "tags.name"
}
},
"tags_value_terms": {
"terms": {
"field": "tags.value"
}
}
}
}
}
}
}
}
which returns:
{
"took": 10,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 2,
"max_score": 0,
"hits": []
},
"aggregations": {
"nested_tags": {
"doc_count": 4,
"filter_tag_name": {
"doc_count": 2,
"tags_name_terms": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "key1",
"doc_count": 2
}
]
},
"tags_value_terms": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "value1",
"doc_count": 2
}
]
}
}
}
}
}
Here's the updated code:
http://sense.qbox.io/gist/507c3aabf36b8f6ed8bb076c8c1b8552097c5458