Getting _id fields of aggregated records in Elastic Search - elasticsearch

I am using ES to aggregate results based on a field. Additional to that, I would like to retrieve the _id of the records that went into each aggregated bucket as well. Is it possible ?
For example: for the following query
{
"aggs" : {
"genders" : {
"terms" : { "field" : "gender" }
}
}
}
the response would be something like this
{
...
"aggregations" : {
"genders" : {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets" : [
{
"key" : "male",
"doc_count" : 14
},
{
"key" : "female",
"doc_count" : 14
},
]
}
}
}
Now, here I want the _id of all the 14 male and 14 female records that make up the aggregation as well.
Why would I need that ?
Say, because I need to some post processing on these records i.e. insert a new field in those records based on their gender. Of course, its not as trivial as that, but my use case is something on that lines.
Thanks in advance !

Create nested aggregation something like
{
"aggs" : {
"genders" : {
"terms" : { "field" : "gender" }
},
"aggs": {
"ids":{
"terms" : {"field" : "_uid"}
}
}
}
}

Related

How can I aggregate the whole field value in Elasticsearch

I am using Elasticsearch 7.15 and need to aggregate a field and sort them by order.
My document saved in Elasticsearch looks like:
{
"logGroup" : "/aws/lambda/myLambda1",
...
},
{
"logGroup" : "/aws/lambda/myLambda2",
...
}
I need to find out which logGroup has the most document. In order to do that, I tried to use aggregate in Elasticsearch:
GET /my-index/_search?size=0
{
"aggs": {
"types_count": {
"terms": {
"field": "logGroup",
"size": 10000
}
}
}
}
the output of this query looks like:
"aggregations" : {
"types_count" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "aws",
"doc_count" : 26303620
},
{
"key" : "lambda",
"doc_count" : 25554470
},
{
"key" : "myLambda1",
"doc_count" : 25279201
}
...
}
As you can see from above output, it splits the logGroup value into terms and aggregate based on term not the whole string. Is there a way for me to aggregate them as a whole string?
I expect the output looks like:
"buckets" : [
{
"key" : "/aws/lambda/myLambda1",
"doc_count" : 26303620
},
{
"key" : "/aws/lambda/myLambda2",
"doc_count" : 25554470
},
The logGroup field in the index mapping is:
"logGroup" : {
"type" : "text",
"fielddata" : true
},
Can I achieve it without updating the index?
In order to get what you expect you need to change your mapping to this:
"logGroup" : {
"type" : "keyword"
},
Failing to do that, your log groups will get analyzed by the standard analyzer which splits the whole string and you'll not be able to aggregate by full log groups.
If you don't want or can't change the mapping and reindex everything, what you can do is the following:
First, add a keyword sub-field to your mapping, like this:
PUT /my-index/_mapping
{
"properties": {
"logGroup" : {
"type" : "text",
"fields": {
"keyword": {
"type" : "keyword"
}
}
}
}
}
And then run the following so that all existing documents pick up this new field:
POST my-index/_update_by_query?wait_for_completion=false
Finally, you'll be able to achieve what you want with the following query:
GET /my-index/_search
{
"size": 0,
"aggs": {
"types_count": {
"terms": {
"field": "logGroup.keyword",
"size": 10000
}
}
}
}

ElasticSearch: Get all elements where a parameter is not unique

I know there is an aggregation to get the count of all unique value for a field.
For example
{
"query" : {
"match_all" : {}
},
"aggs" : {
"type_count" : {
"cardinality" : {
"field" : "name"
}
}
},
"size":0
}
With this query I get the count of all the unique name.
But what I want is the list of all the names that are in the index more than once.
I want all the non unique names.
What is the best way to achieve that?
You can use the terms aggregation with a min_doc_count of 2, like this:
{
"query" : {
"match_all" : {}
},
"aggs" : {
"type_count" : {
"terms" : {
"field" : "name",
"min_doc_count": 2
}
}
},
"size":0
}

Error:Class cast exception in elastic search while sorting buckets in aggregation

Error:
ClassCastException[org.elasticsearch.search. aggregations.support.ValuesSource$Bytes$WithOrdinals$FieldData cannot
be cast to
org.elasticsearch.search.aggregations.support.ValuesSource$Numeric]}{[vTHdFzpuTEGMGR8MES_b9g]
My Query:
GET _search
{
"size" : 0,
"query" : {
"filtered" : {
"query" : {
"dis_max" : {
"tie_breaker" : 0.7,
"queries" : [ {
"bool" : {
"should" : [ {
"match" : {
"post.body" : {
"query" : "check",
"type" : "boolean"
}
}
}, {
"match" : {
"post.parentBody" : {
"query" : "check",
"type" : "boolean",
"boost" : 2.0
}
}
} ]
}
} ]
}
}
}
},
"aggregations" : {
"by_parent_id" : {
"terms" : {
"field" : "post.parentId",
"order" : {
"max_score" : "desc"
}
},
"aggregations" : {
"max_score" : {
"max" : {}
},
"top_post" : {
"top_hits" : {
"size" : 1
}
}
}
}
}
I want to sort buckets by max_score rather than by doc_count which is the default behaviour of elastic search.
I am trying to aggregate posts (which contains body and parentBody)
by parentId and then sorting buckets by max_score and in each bucket
I am getting top_hits. But I am getting the above error when I sorted
the buckets by defining max score aggregation. Rest everything works if I remove max_score aggregation. Every post object has parentId, body and parentBody. I have used the following references for coding this:
Elasticsearch Aggregation: How to Sort Bucket Order
https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-metrics-top-hits-aggregation.html#_field_collapse_example
Tell me what am I doing wrong? I have shared the query above.

How to get all unique tags from 2 collections in Elasticsearch?

I have a set of tags stored in document.tags and document.fields.articleTags.
This is how I get all the tags from both namespaces, but how can I get the result merged into one array in the response from ES?
{
"query" : {
"match_all" : { }
},
"size": 0,
"aggs" : {
"tags" : {
"terms" : { "field" : "tags" }
},
"articleTags" : {
"terms" : { "field" : "articleTags" }
}
}
}
Result
I get the tags listed in articleTags.buckets and tags.buckets. Is it possible to have the result delivered in one bucket?
{
"aggregations": {
"articleTags": {
"buckets": [
{
"key": "halloween"
}
]
},
"tags": {
"buckets": [
{
"key": "news"
}
Yes, you can using a single terms aggregation with a script instead that would "join" the two arrays (i.e. add them together), it goes like this:
{
"query" : {
"match_all" : { }
},
"size": 0,
"aggs" : {
"all_tags" : {
"terms" : { "script" : "doc.tags.values + doc.articleTags.values" }
}
}
}
Note that you need to make sure to enable dynamic scripting in order for this query to work.

Elasticsearch, how to return unique values of two fields

I have an index with 20 different fields. I need to be able to pull unique docs where combination of fields "cat" and "sub" are unique.
In SQL it would look this way: select unique cat, sub from table A;
I can do it for one field this way:
{
"size": 0,
"aggs" : {
"unique_set" : {
"terms" : { "field" : "cat" }
}
}}
but how do I add another field to check uniqueness across two fields?
Thanks,
SQL's SELECT DISTINCT [cat], [sub] can be imitated with a Composite Aggregation.
{
"size": 0,
"aggs": {
"cat_sub": {
"composite": {
"sources": [
{ "cat": { "terms": { "field": "cat" } } },
{ "sub": { "terms": { "field": "sub" } } }
]
}
}
}
}
Returns...
"buckets" : [
{
"key" : {
"cat" : "a",
"sub" : "x"
},
"doc_count" : 1
},
{
"key" : {
"cat" : "a",
"sub" : "y"
},
"doc_count" : 2
},
{
"key" : {
"cat" : "b",
"sub" : "y"
},
"doc_count" : 3
}
]
The only way to solve this are probably nested aggregations:
{
"size": 0,
"aggs" : {
"unique_set_1" : {
"terms" : {
"field" : "cats"
},
"aggregations" : {
"unique_set_2": {
"terms": {"field": "sub"}
}
}
}
}
}
Quote:
I need to be able to pull unique docs where combination of fields "cat" and "sub" are unique.
This is nonsense; your question is unclear. You can have 10s unique pairs {cat, sub}, and 100s unique triplets {cat, sub, field_3}, and 1000s unique documents Doc{cat, sub, field3, field4, ...}.
If you are interested in document counts per unique pair {"Category X", "Subcategory Y"} then you can use Cardinality aggregations. For two or more fields you will need to use scripting which will come with performance hit.
Example:
{
"aggs" : {
"multi_field_cardinality" : {
"cardinality" : {
"script": "doc['cats'].value + ' _my_custom_separator_ ' + doc['sub'].value"
}
}
}
}
Alternate solution: use nested Terms terms aggregations.

Resources