Terms aggregation on an inner object and retrieving bucket metadata - elasticsearch

We index the following products:
{
"id": "1",
"name": "the-name",
"categories": [
{
"id" : 10,
"name" : "cat-1"
},
{
"id" : 20,
"name" : "cat-2"
}
]
}
We are doing an aggregation on categories.id using :
REQUEST:
//...
"aggs": {
"by_cat": {
"terms": {
"field": "categories.id",
"size": 10
}
}
}
---
RESPONSE:
// ...
"by_cat" : {
"buckets" : [
{
"key" : 10,
"doc_count" : 804
},
{
"key" : 20,
"doc_count" : 327
},
It works well, however, each bucket contains only the categories.id in the key field. What we would like is to be able to have the name of the category in the bucket, for example :
// ...
"buckets" : [
{
"key" : 10,
"metadata": {
"name": "cat-1"
},
"doc_count" : 804
},
{
"key" : 20,
"metadata": {
"name": "cat-2"
},
"doc_count" : 327
},
What is the good way to do that ? We found two to get this information but they both looks "hackish" :
Using top_hits with size 1 and source limited to categories, it will retrieve one document per bucket containing the information we need. This first solution doesn't look performance-wise and the more aggregation we have, the more bloated is the response.
Adding a new column id_name which concatenate id and name and doing the term aggregation on it. It looks more like a hack, and may be complicated if many fields.
We also tried by mixing field and script in terms but it doesn't help.
metadata looked exactly what we wanted but it is global for all the buckets and not dynamic.
Do we have other way to retrieve this information ?

Related

Counting unique buckets from aggregation

I am trying to get the unique count for all labels used on a set of documents. In order to do that, and have the json returned in the bucket (cardinality doesnt return json and count together), I need to write a pipeline query.
My query gets me half way there, but I'm missing the second part that counts the number of buckets a label is in.
Here's my query
{
"size":0,
"aggs" : {
unique_count : {
"composite" : [
"metadataId" : {
"terms" :{"field" : "document.metadata.id"}
},
"label" : {
"terms" :{"field" : "document.label"}
}
]
}
}
}
This produces
...
"buckets" : [
{
"key" : {
"metadataId" : "1",
"label" : "label one"
},
"doc_count" : 2
},
{
"key" : {
"metadataId" : "2",
"label" : "label one"
},
"doc_count" : 1
},
{
"key" : {
"metadataId" : "3",
"label" : "label three"
},
"doc_count" : 3
}
]
...
The problem I'm facing is that each bucket is considered unique and the sum of the unique counts is what I would like to return. For example, in the buckets above the label "label one" is contained within two buckets, so it's doc_count should be 2, while "label three" should have a doc_count of 1.
After the last phase in the pipeline I'd like to see the following output:
"buckets" : [
{
"label" : "label one"
"doc_count" : 2
},
{
"label" : "label three"
"doc_count" : 1
}
]
I've tried all sorts of things, but they're just not getting me close to the output I need. Can anyone point me in the right direction?
Try with the nested terms aggregations where first level aggs would be on label and the second level on metadataId field. The aggs block should look something like:
"aggs" : {
"labels": {
"terms": {
"field": "label.keyword",
"size": 1000
},
"aggs": {
"metadata": {
"terms": {
"field": metadataId.keyword",
"size": 1000
}
}
}
}
}
As output, you will get buckets of labels with key as label value and doc_count with count of docs matching that label. Each label bucket will have a nested buckets of metadataId with key as metadataId value and doc_count with count of docs matching that label and metadataId.

elasticearch aggregation by array size

I need a stats on elasticsearch. I can't make the request.
I would like to know the number of people per appointment.
appointment index mapping
{
"id" : "383577",
"persons" : [
{
"id" : "1",
},
{
"id" : "2",
}
]
}
what i would like
"buckets" : [
{
"key" : "1", <--- appointment of 1 person
"doc_count" : 1241891
},
{
"key" : "2", <--- appointment of 2 persons
"doc_count" : 10137
},
{
"key" : "3", <--- appointment of 3 persons
"doc_count" : 8064
}
]
Thank you
The easiest way to do this is to create another integer field containing the length of the persons array and aggregating on that field.
{
"id" : "383577",
"personsCount": 2, <---- add this field
"persons" : [
{
"id" : "1",
},
{
"id" : "2",
}
]
}
The non-optimal way of achieving what you expect is to use a script that will return the length of the persons array dynamically, but be aware that this is sub-optimal and can potentially harm your cluster depending on the volume of data you have:
GET /_search
{
"aggs": {
"persons": {
"terms": {
"script": "doc['persons.id'].size()"
}
}
}
}
If you want to update all your documents to create that field you can do it like this:
POST index/_update_by_query
{
"script": {
"source": "ctx._source.personsCount = ctx._source.persons.length"
}
}
However, you'll also need to modify the logic of your indexing application to create that new field.

Elasticsearch Aggregation most common list of integers

I am looking for elastic search aggregation + mapping
that will return the most common list for a certain field.
For example for docs:
{"ToneCurvePV2012": [1,2,3]}
{"ToneCurvePV2012": [1,5,6]}
{"ToneCurvePV2012": [1,7,8]}
{"ToneCurvePV2012": [1,2,3]}
I wish for the aggregation result:
[1,2,3] (since it appears twice).
so far any aggregation that i made would return: 1
This is not possible with default terms aggregation. You need to use terms aggregation with script. Please note that this might impact your cluster performance.
Here, i have used script which will create string from array and used it for aggregation. so if you have array value like [1,2,3] then it will create string representation of it like '[1,2,3]' and that key will be used for aggregation.
Below is sample query you can use to generate aggregation as you expected:
POST index1/_search
{
"size": 0,
"aggs": {
"tone_s": {
"terms": {
"script": {
"source": "def value='['; for(int i=0;i<doc['ToneCurvePV2012'].length;i++){value= value + doc['ToneCurvePV2012'][i] + ',';} value+= ']'; value = value.replace(',]', ']'); return value;"
}
}
}
}
}
Output:
{
"hits" : {
"total" : {
"value" : 4,
"relation" : "eq"
},
"max_score" : null,
"hits" : [ ]
},
"aggregations" : {
"tone_s" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "[1,2,3]",
"doc_count" : 2
},
{
"key" : "[1,5,6]",
"doc_count" : 1
},
{
"key" : "[1,7,8]",
"doc_count" : 1
}
]
}
}
}
PS: key will be come as string and not as array in aggregation response.

How to get distinct results in ElasticSearch if a field is the same

I have an ElasticSearch service version 1.4 with an index 40M record of data.
I have data that has the same parent field. I would like to extract 1 unique result out of the same parent only.
Ex:
{
"id": "7835",
"isbn": "3985",
"parent_id": "7819",
},
{
"id": "1835",
"isbn": "4935",
"parent_id": "7719",
},
{
"id": "2835",
"isbn": "9985",
"parent_id": "7819",
}
The expected result that I would like to have is:
{
"id": "7835",
"isbn": "3985",
"parent_id": "7819",
},
{
"id": "1835",
"isbn": "4935",
"parent_id": "7719",
},
I have checked out aggregations:
ElasticSearch - Return Unique Values
{
"aggs" : {
"parentId" : {
"terms" : { "field" : "parent_id" }
}
However the response I get - show the 3 items (so the last one doesn't get ignored), and I have term buckets with the key afterwards inside the aggregations response, which to me is not useful as it seems to tell me how many occurrence per key inside the doc, which is not the desired output.
In order not to search for original document, you should add "size":0 above aggregation.
You can see only the number of documents per each parent_id in buckets field of response.
{
"size" : 0,
"aggs" : {
"parentId" : {
"terms" : { "field" : "parent_id" }
}
}

Show all Elasticsearch aggregation results/buckets and not just 10

I'm trying to list all buckets on an aggregation, but it seems to be showing only the first 10.
My search:
curl -XPOST "http://localhost:9200/imoveis/_search?pretty=1" -d'
{
"size": 0,
"aggregations": {
"bairro_count": {
"terms": {
"field": "bairro.raw"
}
}
}
}'
Returns:
{
"took" : 2,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
},
"hits" : {
"total" : 16920,
"max_score" : 0.0,
"hits" : [ ]
},
"aggregations" : {
"bairro_count" : {
"buckets" : [ {
"key" : "Barra da Tijuca",
"doc_count" : 5812
}, {
"key" : "Centro",
"doc_count" : 1757
}, {
"key" : "Recreio dos Bandeirantes",
"doc_count" : 1027
}, {
"key" : "Ipanema",
"doc_count" : 927
}, {
"key" : "Copacabana",
"doc_count" : 842
}, {
"key" : "Leblon",
"doc_count" : 833
}, {
"key" : "Botafogo",
"doc_count" : 594
}, {
"key" : "Campo Grande",
"doc_count" : 456
}, {
"key" : "Tijuca",
"doc_count" : 361
}, {
"key" : "Flamengo",
"doc_count" : 328
} ]
}
}
}
I have much more than 10 keys for this aggregation. In this example I'd have 145 keys, and I want the count for each of them. Is there some pagination on buckets? Can I get all of them?
I'm using Elasticsearch 1.1.0
The size param should be a param for the terms query example:
curl -XPOST "http://localhost:9200/imoveis/_search?pretty=1" -d'
{
"size": 0,
"aggregations": {
"bairro_count": {
"terms": {
"field": "bairro.raw",
"size": 10000
}
}
}
}'
Use size: 0 for ES version 2 and prior.
Setting size:0 is deprecated in 2.x onwards, due to memory issues inflicted on your cluster with high-cardinality field values. You can read more about it in the github issue here .
It is recommended to explicitly set reasonable value for size a number between 1 to 2147483647.
How to show all buckets?
{
"size": 0,
"aggs": {
"aggregation_name": {
"terms": {
"field": "your_field",
"size": 10000
}
}
}
}
Note
"size":10000 Get at most 10000 buckets. Default is 10.
"size":0 In result, "hits" contains 10 documents by default. We don't need them.
By default, the buckets are ordered by the doc_count in decreasing order.
Why do I get Fielddata is disabled on text fields by default error?
Because fielddata is disabled on text fields by default. If you have not wxplicitly chosen a field type mapping, it has the default dynamic mappings for string fields.
So, instead of writing "field": "your_field" you need to have "field": "your_field.keyword".
If you want to get all unique values without setting a magic number (size: 10000), then use COMPOSITE AGGREGATION (ES 6.5+).
From official documentation:
"If you want to retrieve all terms or all combinations of terms in a nested terms aggregation you should use the COMPOSITE AGGREGATION which allows to paginate over all possible terms rather than setting a size greater than the cardinality of the field in the terms aggregation. The terms aggregation is meant to return the top terms and does not allow pagination."
Implementation example in JavaScript:
const ITEMS_PER_PAGE = 1000;
const body = {
"size": 0, // Returning only aggregation results: https://www.elastic.co/guide/en/elasticsearch/reference/current/returning-only-agg-results.html
"aggs" : {
"langs": {
"composite" : {
"size": ITEMS_PER_PAGE,
"sources" : [
{ "language": { "terms" : { "field": "language" } } }
]
}
}
}
};
const uniqueLanguages = [];
while (true) {
const result = await es.search(body);
const currentUniqueLangs = result.aggregations.langs.buckets.map(bucket => bucket.key);
uniqueLanguages.push(...currentUniqueLangs);
const after = result.aggregations.langs.after_key;
if (after) {
// continue paginating unique items
body.aggs.langs.composite.after = after;
} else {
break;
}
}
console.log(uniqueLanguages);
Increase the size(2nd size) to 10000 in your term aggregations and you will get the bucket of size 10000. By default it is set to 10.
Also if you want to see the search results just make the 1st size to 1, you can see 1 document, since ES does support both searching and aggregation.
curl -XPOST "http://localhost:9200/imoveis/_search?pretty=1" -d'
{
"size": 1,
"aggregations": {
"bairro_count": {
"terms": {
"field": "bairro.raw",
"size": 10000
}
}
}
}'

Resources