Say, I have following documents:
1st doc:
{
productName: "product1",
tags: [
{
"name":"key1",
"value":"value1"
},
{
"name":"key2",
"value":"value2"
}
]
}
2nd doc:
{
productName: "product2",
tags: [
{
"name":"key1",
"value":"value1"
},
{
"name":"key2",
"value":"value3"
}
]
}
I know if I want to group by productName, I could use a terms aggregation
"terms": {
"field": "productName"
}
which will give me two buckets with two different keys "product1", "product2".
However, what should the query be if I would like to group by tag key? i.e. I would like to group by tag with name==key1, then I am expecting one bucket with key="value1"; while if I group by tag with name==key2, I am expecting the result to be two buckets with keys "value2", "value3".
What should the query look like if I would like to group by the 'value' inside a nested array but not group by the 'key'? Any suggestion?
It sounds like a nested terms aggregation is what you're looking for.
With the two documents you posted, this query:
POST /test_index/_search
{
"size": 0,
"aggs": {
"product_name_terms": {
"terms": {
"field": "product_name"
}
},
"nested_tags": {
"nested": {
"path": "tags"
},
"aggs": {
"tags_name_terms": {
"terms": {
"field": "tags.name"
}
},
"tags_value_terms": {
"terms": {
"field": "tags.value"
}
}
}
}
}
}
returns this:
{
"took": 67,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 2,
"max_score": 0,
"hits": []
},
"aggregations": {
"product_name_terms": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": []
},
"nested_tags": {
"doc_count": 4,
"tags_name_terms": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "key1",
"doc_count": 2
},
{
"key": "key2",
"doc_count": 2
}
]
},
"tags_value_terms": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "value1",
"doc_count": 2
},
{
"key": "value2",
"doc_count": 1
},
{
"key": "value3",
"doc_count": 1
}
]
}
}
}
}
Here is some code I used to test it:
http://sense.qbox.io/gist/a9a172f41dbd520d5e61063a9686055681110522
EDIT: Filter by Nested Value
As per your comment, if you want to filter the nested results by a value (of the nested results), you can add another "layer" of aggregation making use of the filter aggregation as follows:
POST /test_index/_search
{
"size": 0,
"aggs": {
"nested_tags": {
"nested": {
"path": "tags"
},
"aggs": {
"filter_tag_name": {
"filter": {
"term": {
"tags.name": "key1"
}
},
"aggs": {
"tags_name_terms": {
"terms": {
"field": "tags.name"
}
},
"tags_value_terms": {
"terms": {
"field": "tags.value"
}
}
}
}
}
}
}
}
which returns:
{
"took": 10,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 2,
"max_score": 0,
"hits": []
},
"aggregations": {
"nested_tags": {
"doc_count": 4,
"filter_tag_name": {
"doc_count": 2,
"tags_name_terms": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "key1",
"doc_count": 2
}
]
},
"tags_value_terms": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "value1",
"doc_count": 2
}
]
}
}
}
}
}
Here's the updated code:
http://sense.qbox.io/gist/507c3aabf36b8f6ed8bb076c8c1b8552097c5458
Related
So I know my total hits are 182 documents
"hits": {
"total": {
"value": 182,
"relation": "eq"
},
"max_score": null,
"hits": []
},
And then I make a aggregation to know how many documents have the source instagagram or twitter and it returns me:
"bySource": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "instagram",
"doc_count": 162
},
{
"key": "twitter",
"doc_count": 20
}
]
}
Is it possible to get the percentage of documents that have source twitter and instagram?
So the percentage of documents that have source instagram is 89 % and twitter 11%.
My aggregation code its like this:
"aggs": {
"bySource": {
"terms": {
"field": "profile.source.keyword"
}
}
}
Let me know if this is possible.
Thank you
Sure, it is possible using the 'Bucket Script Aggregation'.
An example query might look like this:
{
"size": 0,
"aggs": {
"filters_agg": {
"filters": {
"filters": {
"sourceCount": {
"match_all": {}
}
}
},
"aggs": {
"bySource": {
"terms": {
"field": "profile.source.keyword"
}
},
"instagram_count_percentage": {
"bucket_script": {
"buckets_path": {
"instagram_count": "bySource['instagram']>_count",
"total_count": "_count"
},
"script": "Math.round((params.instagram_count * 100)/params.total_count)"
}
},
"twitter_count_percentage": {
"bucket_script": {
"buckets_path": {
"twitter_count": "bySource['twitter']>_count",
"total_count": "_count"
},
"script": "Math.round((params.twitter_count * 100)/params.total_count)"
}
}
}
}
}
}
And the response could be something like this:
{
"took": 1,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 182,
"relation": "eq"
},
"max_score": null,
"hits": []
},
"aggregations": {
"filters_agg": {
"buckets": {
"sourceCount": {
"doc_count": 182,
"bySource": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "instagram",
"doc_count": 162
},
{
"key": "twitter",
"doc_count": 20
}
]
},
"instagram_count_percentage": {
"value": 89
},
"twitter_count_percentage": {
"value": 11
}
}
}
}
}
}
Try to adjust it or get inspired depending on your case and your mapping.
I don't know if it is possible to return additional fields in the response for each bucket.
The current request returns correct results, but I'm missing additional field information required for later processing.
{
"query": {
"bool": {
"must": {
"match_all": {}
}
}
},
"track_total_hits": true,
"from": 0,
"size": 0,
"aggs": {
"strings": {
"nested": {
"path": "filter_data.string_facet"
},
"aggs": {
"names": {
"terms": {
"field": "filter_data.string_facet.facet-name"
},
"aggs": {
"values": {
"terms": {
"field": "filter_data.string_facet.facet-value"
}
}
}
}
}
}
}
Here is the result. Note the data in field filter_data how nested fields are structured.
{
"took": 1,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 2,
"relation": "eq"
},
"max_score": 1,
"hits": [{
"_index": "my_index",
"_type": "_doc",
"_id": "7000043",
"_score": 1,
"_source": {
"item_data": {
"doc_id": 7000043,
"id": 7000043,
"live_state": 1,
"item_sku": "7000043",
"manufacturer_id": 1394
},
"filter_data": {
"string_facet": [{
"facet-name": "Thread size",
"facet-value": "G1/2",
"facet-name-id": 12,
"facet-value-id": 34
}]
}
}
}]
},
"aggregations": {
"strings": {
"doc_count": 5,
"names": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [{
"key": "Thread size",
"doc_count": 2,
"values": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [{
"key": "G1 1/4",
"doc_count": 1
}, {
"key": "G1/2",
"doc_count": 1
}]
}
}]
}
}
}
Is it possible to add additional fields to each bucket? It would be ideal to have such a format in the response. Basically add field facet-name-id anf facet-value-id to each bucket.
....
"buckets": [{
"key": "Thread size",
"doc_count": 2,
"values": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [{
"key": "G1 1/4",
"facet-name-id": 12,
"facet-value-id": 34
"doc_count": 1
}, {
"key": "G1/2",
"facet-name-id": 12,
"facet-value-id": 35
"doc_count": 1
}]
}
}]
...
If this is not possible, what would you recommend?
Thanx.
Sure, you can use top_hits as a sub-aggrgation of your deepest facet-value aggregation:
POST my_index/_search?filter_path=aggregations.*.*.buckets.key,aggregations.*.*.buckets.values.buckets.key,aggregations.*.*.buckets.values.buckets.*.hits.hits._source
{
"query": {
"bool": {
"must": {
"match_all": {}
}
}
},
"track_total_hits": true,
"from": 0,
"size": 0,
"aggs": {
"strings": {
"nested": {
"path": "filter_data.string_facet"
},
"aggs": {
"names": {
"terms": {
"field": "filter_data.string_facet.facet-name"
},
"aggs": {
"values": {
"terms": {
"field": "filter_data.string_facet.facet-value"
},
"aggs": {
"my_top_hits": {
"top_hits": {
"size": 10,
"_source": ["filter_data.string_facet"]
}
}
}
}
}
}
}
}
}
}
which'd yield:
{
"aggregations" : {
"strings" : {
"names" : {
"buckets" : [
{
"key" : "Thread size",
"values" : {
"buckets" : [
{
"key" : "G1/2",
"my_top_hits" : {
"hits" : {
"hits" : [
{
"_source" : {
"facet-value" : "G1/2",
"facet-name" : "Thread size",
"facet-value-id" : 34,
"facet-name-id" : 12
}
}
]
}
}
}
]
}
}
]
}
}
}
}
Notice that my_top_hits is an array of string_facet objects instead of an object as you requested. That's because although you're already 2 facets deep (facet-name and then facet-value), there may still be multiple different facet-value-id and facet-name-id combinations covered by a given facet-value bucket.
Having said that, you can of course limit the top_hits count with the size parameter but then you wouldn't be able to say with certainty whether or not the first top hit's facets are representative of the whole bucket .
Here is my query result
{
"took": 3,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 502,
"max_score": 0,
"hits": []
},
"aggregations": {
"HIGH_RISK_USERS": {
"doc_count": 1004,
"USERS_COUNT": {
"doc_count_error_upper_bound": 5,
"sum_other_doc_count": 437,
"buckets": [
{
"key": "49",
"doc_count": 502,
"NAME": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": []
}
},
{
"key": "02122219455#53.205.223.157",
"doc_count": 44,
"NAME": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "caller",
"doc_count": 42
},
{
"key": "CallFrom",
"doc_count": 2
}
]
}
},
{
"key": "+02129916178#53.205.223.157",
"doc_count": 2,
"NAME": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "caller",
"doc_count": 2
}
]
}
}
]
}
}
}
}
Here is my query
{
"size": 0,
"query": {
"bool": {
"filter": [
{
"nested": {
"path": "x_nova_extensions.entities",
"query": {
"bool": {
"filter": [
{
"match": {
"x_nova_extensions.entities.text": "49"
}
},
{
"terms": {
"x_nova_extensions.entities.type": [
"sourceCountryCode",
"CallerIPCountryCode",
"CallerIPCountryName",
"CallerIPCountryCode",
"CallerPhoneCountryName"
]
}
}
]
}
}
}
}
]
}
},
"aggs": {
"HIGH_RISK_USERS": {
"nested": {
"path": "x_nova_extensions.entities"
},
"aggs": {
"USERS_COUNT": {
"terms": {
"field": "x_nova_extensions.entities.text",
"size": 10,
"order": {
"_count": "desc"
}
},
"aggs": {
"NAME": {
"terms": {
"field": "x_nova_extensions.entities.type",
"include": [
"caller",
"callee",
"CallFrom",
"CallTo"
]
}
}
}
}
}
}
}
}
I want my query to return only bucket[].size > 0
I searched on the internet and I couldn't find any specific keyword or something else. Even I am not sure if Elasticsearch supports this or not. I want to sure that Elasticsearch supports this
Are there any keyword or how can I handle it ?
Thanks
I think the thing that you are looking for is Aggregation Pipeline
By that way, you can reach the bucket size and filter the result accordingly.
"min_bucket_selector": {
"bucket_selector": {
"buckets_path": {
"nameCount": "NAME._bucket_count"
},
"script": {
"source": "params.nameCount != 0"
}
}
}
}
}
But please pay attention to the elasticsearch version. The way how it is applied can be different according to the version.
This question already has an answer here:
how to return the count of unique documents by using elasticsearch aggregation
(1 answer)
Closed 5 years ago.
With this mapping:
PUT pizzas
{
"mappings": {
"pizza": {
"properties": {
"name": {
"type": "keyword"
},
"types": {
"type": "nested",
"properties": {
"topping": {
"type": "keyword"
},
"base": {
"type": "keyword"
}
}
}
}
}
}
}
And this data:
PUT pizzas/pizza/1
{
"name": "meat",
"types": [
{
"topping": "bacon",
"base": "normal"
},
{
"topping": "pepperoni",
"base": "normal"
}
]
}
PUT pizzas/pizza/2
{
"name": "veg",
"types": [
{
"topping": "broccoli",
"base": "normal"
}
]
}
If I run this nested aggregation query:
GET pizzas/_search
{
"size": 0,
"aggs": {
"types_agg": {
"nested": {
"path": "types"
},
"aggs": {
"base_agg": {
"terms": {
"field": "types.base"
}
}
}
}
}
}
I get this result:
{
"took": 2,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 2,
"max_score": 0,
"hits": []
},
"aggregations": {
"types_agg": {
"doc_count": 3,
"base_agg": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "normal",
"doc_count": 3
}
]
}
}
}
}
I expected my aggregation to return a doc_count of 2 because there are only two documents which match my query. However it is clear that because it's an inverted index, it is finding 3 results and therefore 3 documents.
Is there anyway to get it to return unique document counts?
(tested in Elasticsearch 5.4.3)
Just discovered the answer shortly after asking the question.
Changing the aggregation query to be:
GET pizzas/_search
{
"size": 0,
"aggs": {
"types_agg": {
"nested": {
"path": "types"
},
"aggs": {
"base_agg": {
"terms": {
"field": "types.base"
},
"aggs": {
"top_reverse_nested": {
"reverse_nested": {}
}
}
}
}
}
}
}
Yields the result:
{
"took": 5,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 2,
"max_score": 0,
"hits": []
},
"aggregations": {
"types_agg": {
"doc_count": 3,
"base_agg": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "normal",
"doc_count": 3,
"top_reverse_nested": {
"doc_count": 2
}
}
]
}
}
}
}
The important part which was added to the query was:
"aggs": {
"top_reverse_nested": {
"reverse_nested": {}
}
}
Reverse nested join back to the root of the document so it only gets unique aggregations.
You can read about reverse_nested here.
How can field of type string be included in the result set of an aggregation?
For example given the following mapping:
{
"sport": {
"mappings": {
"runners": {
"properties": {
"name": {
"type": "string"
},
"city": {
"type": "string"
},
"region": {
"type": "string"
},
"sport": {
"type": "string"
}
}
}
}
}
}
Sample data:
curl -XPOST "http://localhost:9200/sport/_bulk" -d'
{"index":{"_index":"sport","_type":"runner"}}
{"name":"Gary", "city":"New York","region":"A","sport":"Soccer"}
{"index":{"_index":"sport","_type":"runner"}}
{"name":"Bob", "city":"New York","region":"A","sport":"Tennis"}
{"index":{"_index":"sport","_type":"runner"}}
{"name":"Mike", "city":"Atlanta","region":"B","sport":"Soccer"}
'
How can the field name be included in result set of the aggregation:
{
"size": 0,
"aggregations": {
"agg": {
"terms": {
"field": "city"}
}
}
}
This seems to do what you want, if I'm understanding you correctly:
POST /sport/_search
{
"size": 0,
"aggregations": {
"city_terms": {
"terms": {
"field": "city"
},
"aggs": {
"name_terms": {
"terms": {
"field": "name"
}
}
}
}
}
}
With the data you provided, it returns:
{
"took": 43,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 3,
"max_score": 0,
"hits": []
},
"aggregations": {
"city_terms": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "new",
"doc_count": 2,
"name_terms": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "bob",
"doc_count": 1
},
{
"key": "gary",
"doc_count": 1
}
]
}
},
{
"key": "york",
"doc_count": 2,
"name_terms": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "bob",
"doc_count": 1
},
{
"key": "gary",
"doc_count": 1
}
]
}
},
{
"key": "atlanta",
"doc_count": 1,
"name_terms": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "mike",
"doc_count": 1
}
]
}
}
]
}
}
}
(You may want to add "index":"not_analyzed" to one or both fields in your mapping, if these results are not what you were expecting.)
Here's the code I used to test it:
http://sense.qbox.io/gist/07735aadc082c1c60409931c279f3fd85a340dbb