Get the number of unique terms in a field in elasticsearch - elasticsearch

Here are some sample documents that I have
doc1
{
"occassion" : "Birthday",
"dessert": "gingerbread"
}
doc2
{
"occassion" : "Wedding",
"dessert": "friand"
}
doc3
{
"occassion":"Bethrothal" ,
"dessert":"gingerbread"
}
When I give simple terms aggregation, on the field "dessert", i get like the results like below
"aggregations": {
"desserts": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "gingerbread",
"doc_count": 2
},
{
"key": "friand",
"doc_count": 1
}
]
}
}
}
But if the issue here is if there are many documents and I need to know how many unique keywords were existing under the field name "desserts",it would take me a lot of time to figure it out. Is there a work around to get just the number of unique terms under the specified field name?

The cardinality aggregation seems to be what you're looking for: https://www.elastic.co/guide/en/elasticsearch/guide/current/cardinality.html
Querying this:
{
"size" : 0,
"aggs" : {
"distinct_desserts" : {
"cardinality" : {
"field" : "dessert"
}
}
}
}
Would return something like this:
"aggregations": {
"distinct_desserts": {
"value": 2
}
}

I would suggest cardinality with higher precision_threshold for accurate result.
GET /cars/transactions/_search
{
"size" : 0,
"aggs" : {
"count_distinct_desserts" : {
"cardinality" : {
"field" : "dessert",
"precision_threshold" : 100
}
}
}
}

Related

Elasticsearch aggregation on a field with dynamic properties

Given the following mapping where variants are a nested type and options is a flattened type:
{
"doc_type" : "product",
"id" : 1,
"variants" : [
{
"options" : {
"Size" : "XS",
},
"price" : 1,
},
{
"options" : {
"Size" : "S",
"Material": "Wool"
},
"price" : 6.99,
},
]
}
I want to run an aggregation that produces data in the following format:
{
"variants.options.Size": {
"buckets" : [
{
"key" : "XS",
"doc_count" : 1
},
{
"key" : "S",
"doc_count" : 1
},
],
},
"variants.options.Material": {
"buckets" : [
{
"key" : "Wool",
"doc_count" : 1
}
],
},
}
I could very easily do something like:
"aggs": {
"variants.options.Size": {
"terms": {
"field": "variants.options.Size"
}
},
"variants.options.Material": {
"terms": {
"field": "variants.options.Material"
}
}
}
The caveat here is that we're using the flattened type for options because the fields in options are dynamic and so there is no way for me to know before hand that we want to aggregate on Size and Material.
Essentially, I want to tell Elasticsearch that it should aggregate on whatever keys it finds under options. Is there a way to do this?
I want to tell Elasticsearch that it should aggregate on whatever keys it finds under options. Is there a way to do this?
Not directly. I had the same question a while back. I haven't found a clean solution to this day and I'm convinced there isn't one.
Luckily, there's a scripted_metric workaround that I outlined here. Applying it to your use case:
POST your_index/_search
{
"size": 0,
"aggs": {
"dynamic_variant_options": {
"scripted_metric": {
"init_script": "state.buckets = [:];",
"map_script": """
def variants = params._source['variants'];
for (def variant : variants) {
for (def entry : variant['options'].entrySet()) {
def key = entry.getKey();
def value = entry.getValue();
def path = "variants.options." + key;
if (state.buckets.containsKey(path)) {
if (state.buckets[path].containsKey(value)) {
state.buckets[path][value] += 1;
} else {
state.buckets[path][value] = 1;
}
} else {
state.buckets[path] = [value:1];
}
}
}
""",
"combine_script": "return state",
"reduce_script": "return states"
}
}
}
}
would yield:
"aggregations" : {
"dynamic_variant_options" : {
"value" : [
{
"buckets" : {
"variants.options.Size" : {
"S" : 1,
"XS" : 1
},
"variants.options.Material" : {
"Wool" : 1
}
}
}
]
}
}
You'll need to adjust the painless code if you want the buckets to be arrays of key-doc_count pairs instead of hash maps like in my example.

Return distinct values in Elasticsearch

I am trying to solve an issue where I have to get distinct result in the search.
{
"name" : "ABC",
"favorite_cars" : [ "ferrari","toyota" ]
}, {
"name" : "ABC",
"favorite_cars" : [ "ferrari","toyota" ]
}, {
"name" : "GEORGE",
"favorite_cars" : [ "honda","Hyundae" ]
}
When I perform a term query on favourite cars "ferrari". I get two results whose name is ABC. I simply want that the result returned should be one in this case. So my requirement will be if I can apply a distinct on name field to receive one 1 result.
Thanks
One way to achieve what you want is to use a terms aggregation on the name field and then a top_hits sub-aggregation with size 1, like this:
{
"size": 0,
"query": {
"term": {
"favorite_cars": "ferrari"
}
},
"aggs": {
"names": {
"terms": {
"field": "name"
},
"aggs": {
"single_result": {
"top_hits": {
"size": 1
}
}
}
}
}
}
That way, you'll get a single term ABC and then nested into it a single matching document

Bucket by fields present in returned documents using Elasticsearch

Our indexed documents do not have a completely fixed schema, that is, not every field is in every document. Is there a way to create buckets based on the fields present in a set of documents (i.e. in response to a query) with the count of how many documents contain those fields? For example, these documents that I just made up comprise the results of a query:
{"name":"Bob","field1":"value","field2":"value2","field3":"value3"}
{"name":"Sue","field2":"value4","field3":"value5"}
{"name":"Ali","field1":"value6","field2":"value7"}
{"name":"Joe","field3":"value8"}
This is the information (not format) I want to extract:
name: 4
field1: 2
field2: 3
field3: 3
Is there a way I can aggregate and count to get those results?
Yeah, I think you can do it like this:
GET /some_index/some_type/_search?search_type=count
{
"aggs": {
"name_bucket": {
"filter" : { "exists" : { "field" : "name" } }
},
"field1_bucket": {
"filter" : { "exists" : { "field" : "field1" } }
},
"field2_bucket": {
"filter" : { "exists" : { "field" : "field2" } }
},
"field3_bucket": {
"filter" : { "exists" : { "field" : "field3" } }
}
}
}
And you get something like this:
"aggregations": {
"field3_bucket": {
"doc_count": 3
},
"field1_bucket": {
"doc_count": 2
},
"field2_bucket": {
"doc_count": 3
},
"name_bucket": {
"doc_count": 4
}
}

ElasticSearch: aggregation on _score field?

I would like to use the stats or extended_stats aggregation on the _score field but can't find any examples of this being done (i.e., seems like you can only use aggregations with actual document fields).
Is it possible to request aggregations on calculated "metadata" fields for each hit in an ElasticSearch query response (e.g., _score, _type, _shard, etc.)?
I'm assuming the answer is 'no' since fields like _score aren't indexed...
Note: The original answer is now outdated in terms of the latest version of Elasticsearch. The equivalent script using Groovy scripting would be:
{
...,
"aggregations" : {
"grades_stats" : {
"stats" : {
"script" : "_score"
}
}
}
}
In order to make this work, you will need to enable dynamic scripting or, even better, store a file-based script and execute it by name (for added security by not enabling dynamic scripting)!
You can use a script and refer to the score using doc.score. More details are available in ElasticSearch's scripting documentation.
A sample stats aggregation could be:
{
...,
"aggregations" : {
"grades_stats" : {
"stats" : {
"script" : "doc.score"
}
}
}
}
And the results would look like:
"aggregations": {
"grades_stats": {
"count": 165,
"min": 0.46667441725730896,
"max": 3.1525731086730957,
"avg": 0.8296855776598959,
"sum": 136.89812031388283
}
}
A histogram may also be a useful aggregation:
"aggs": {
"grades_histogram": {
"histogram": {
"script": "doc.score * 10",
"interval": 3
}
}
}
Histogram results:
"aggregations": {
"grades_histogram": {
"buckets": [
{
"key": 3,
"doc_count": 15
},
{
"key": 6,
"doc_count": 103
},
{
"key": 9,
"doc_count": 46
},
{
"key": 30,
"doc_count": 1
}
]
}
}
doc.score doesn't seem to work anymore. Using _score seems to work perfectly.
Example:
{
...,
"aggregations" : {
"grades_stats" : {
"stats" : {
"script" : "_score"
}
}
}
}

Show all Elasticsearch aggregation results/buckets and not just 10

I'm trying to list all buckets on an aggregation, but it seems to be showing only the first 10.
My search:
curl -XPOST "http://localhost:9200/imoveis/_search?pretty=1" -d'
{
"size": 0,
"aggregations": {
"bairro_count": {
"terms": {
"field": "bairro.raw"
}
}
}
}'
Returns:
{
"took" : 2,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
},
"hits" : {
"total" : 16920,
"max_score" : 0.0,
"hits" : [ ]
},
"aggregations" : {
"bairro_count" : {
"buckets" : [ {
"key" : "Barra da Tijuca",
"doc_count" : 5812
}, {
"key" : "Centro",
"doc_count" : 1757
}, {
"key" : "Recreio dos Bandeirantes",
"doc_count" : 1027
}, {
"key" : "Ipanema",
"doc_count" : 927
}, {
"key" : "Copacabana",
"doc_count" : 842
}, {
"key" : "Leblon",
"doc_count" : 833
}, {
"key" : "Botafogo",
"doc_count" : 594
}, {
"key" : "Campo Grande",
"doc_count" : 456
}, {
"key" : "Tijuca",
"doc_count" : 361
}, {
"key" : "Flamengo",
"doc_count" : 328
} ]
}
}
}
I have much more than 10 keys for this aggregation. In this example I'd have 145 keys, and I want the count for each of them. Is there some pagination on buckets? Can I get all of them?
I'm using Elasticsearch 1.1.0
The size param should be a param for the terms query example:
curl -XPOST "http://localhost:9200/imoveis/_search?pretty=1" -d'
{
"size": 0,
"aggregations": {
"bairro_count": {
"terms": {
"field": "bairro.raw",
"size": 10000
}
}
}
}'
Use size: 0 for ES version 2 and prior.
Setting size:0 is deprecated in 2.x onwards, due to memory issues inflicted on your cluster with high-cardinality field values. You can read more about it in the github issue here .
It is recommended to explicitly set reasonable value for size a number between 1 to 2147483647.
How to show all buckets?
{
"size": 0,
"aggs": {
"aggregation_name": {
"terms": {
"field": "your_field",
"size": 10000
}
}
}
}
Note
"size":10000 Get at most 10000 buckets. Default is 10.
"size":0 In result, "hits" contains 10 documents by default. We don't need them.
By default, the buckets are ordered by the doc_count in decreasing order.
Why do I get Fielddata is disabled on text fields by default error?
Because fielddata is disabled on text fields by default. If you have not wxplicitly chosen a field type mapping, it has the default dynamic mappings for string fields.
So, instead of writing "field": "your_field" you need to have "field": "your_field.keyword".
If you want to get all unique values without setting a magic number (size: 10000), then use COMPOSITE AGGREGATION (ES 6.5+).
From official documentation:
"If you want to retrieve all terms or all combinations of terms in a nested terms aggregation you should use the COMPOSITE AGGREGATION which allows to paginate over all possible terms rather than setting a size greater than the cardinality of the field in the terms aggregation. The terms aggregation is meant to return the top terms and does not allow pagination."
Implementation example in JavaScript:
const ITEMS_PER_PAGE = 1000;
const body = {
"size": 0, // Returning only aggregation results: https://www.elastic.co/guide/en/elasticsearch/reference/current/returning-only-agg-results.html
"aggs" : {
"langs": {
"composite" : {
"size": ITEMS_PER_PAGE,
"sources" : [
{ "language": { "terms" : { "field": "language" } } }
]
}
}
}
};
const uniqueLanguages = [];
while (true) {
const result = await es.search(body);
const currentUniqueLangs = result.aggregations.langs.buckets.map(bucket => bucket.key);
uniqueLanguages.push(...currentUniqueLangs);
const after = result.aggregations.langs.after_key;
if (after) {
// continue paginating unique items
body.aggs.langs.composite.after = after;
} else {
break;
}
}
console.log(uniqueLanguages);
Increase the size(2nd size) to 10000 in your term aggregations and you will get the bucket of size 10000. By default it is set to 10.
Also if you want to see the search results just make the 1st size to 1, you can see 1 document, since ES does support both searching and aggregation.
curl -XPOST "http://localhost:9200/imoveis/_search?pretty=1" -d'
{
"size": 1,
"aggregations": {
"bairro_count": {
"terms": {
"field": "bairro.raw",
"size": 10000
}
}
}
}'

Resources