Elasticsearch terms query size to include some terms - elasticsearch

Elasticsearch docs shows below example for size and includes
GET /_search
{
"aggs": {
"JapaneseCars": {
"terms": {
"field": "make",
"size": 10
"include": [ "mazda", "honda" ]
}
}
}
}
But here "include" only includes "mazda" and "honda" in results, i want result to include those 2 as well other results based on doc_count since i am using size in query, is there any way to achieve this.

terms aggregation always return buckets with highest number of documents.
You cannot define an aggregation to always include buckets for some specified keys AND other top buckets.
But you could define two separate aggregations and merge buckets in your application
GET /_search
{
"aggs": {
"JapaneseCars": {
"terms": {
"field": "make",
"include": [ "mazda", "honda" ]
}
},
"OtherCars": {
"terms": {
"field": "make",
"exclude": [ "mazda", "honda" ]
}
},
}
}

You can skip the include keyword right:
GET /_search
{
"aggs": {
"JapaneseCars": {
"terms": {
"field": "make",
"size": 10
}
}
}
}

Related

Elasticsearch: Aggregate all unique values of a field and apply a condition or filter by another field

My documents look like this:
{
"ownID": "Val_123",
"parentID": "Val_456",
"someField": "Val_78",
"otherField": "Val_90",
...
}
I am trying to get all (unique, as in one instance) results for a list of ownID values, while filtering by a list of parentID values and vice-versa.
What I did so far is:
Get (separate!) unique values for ownID and parentID in key1 and key2
{
"size": 0,
"aggs": {
"key1": {
"terms": {
"field": "ownID",
"include": {
"partition": 0,
"num_partitions": 10
},
"size": 100
}
},
"key2": {
"terms": {
"field": "parentID",
"include": {
"partition": 0,
"num_partitions": 10
},
"size": 100
}
}
}
}
Use filter to get (some) results matching either ownID OR parentID
{
"size": 0,
"query": {
"bool": {
"should": [
{
"terms": {
"ownID": ["Val_1","Val_2","Val_3"]
}
},
{
"terms": {
"parentID": ["Val_8","Val_9"]
}
}
]
}
},
"aggs": {
"my_filter": {
"top_hits": {
"size": 30000,
"_source": {
"include": ["ownID", "parentID","otherField"]
}
}
}
}
}
However, I need to get separate results for each filter in the second query, and get:
(1) the parentID of the documents matching some value of ownID
(2) the ownID for the documents matching some value of parentID.
So far I managed to do it using two similar queries (see below for (1)), but I would ideally want to combine them and query only once.
{
"size": 0,
"query": {
"bool": {
"should": [
{
"terms": {
"ownID": [ "Val1", Val_2, Val_3 ]
}
}
]
}
},
"aggs": {
"my_filter": {
"top_hits": {
"size": 30000,
"_source": {
"include": "parentID"
}
}
}
}
}
I'm using Elasticsearch version 5.2
If I got your question correctly then you need to get all the aggregations count correct irrespective of the filter query but in search hits you want the filtered documents only, so for this elasticsearch has another type of filter : "post filter" : refer to this : https://www.elastic.co/guide/en/elasticsearch/reference/5.5/search-request-post-filter.html
its really simple, it will just filter the results after the aggregations have been computed.

How do I filter after an aggregation?

I am trying to filter after a top hits aggregation to get if the first apparition of an error was in a given range but I can't find a way.
I have seen something about bucket selector but can't get it to work
POST log-*/_search/
{
"size": 100,
"aggs": {
"group":{
"terms": {
"field": "errorID.keyword",
"size": 100
},
"aggs": {
"group_docs": {
"top_hits": {
"size": 1,
"sort": [
{
"#timestamp": {
"order": "asc"
}
}
]
}
},
}
}
}
}
}
With this top hits I get the first apparition of a concrete errorID as I have many documents with the same errorID, but what I want to find is if the first apparition is within a given range of dates.
I think that a valid solution would be to filter the results of the aggregation to check if it is in the range, but I don't know how could I do that.

Aggregated results show less items than doc_count?

I have an ElasticSearch query which aggregates the result on a certain field, called _aggregate. Now I have this strange situation given this query:
"size": 100,
"aggregations": {
"results": {
"terms": {
"field": "_aggregate",
"size": 1000,
"order": {
"_count": "desc"
}
},
"aggregations": {
"bundled": {
"top_hits": {
"sort": [
{
"_weight": "asc"
}
]
}
}
}
}
},
"query": {
"bool": {
"must": [
{
"term": {
"_aggregate": "5713618784853"
}
}
]
}
}
}
When I do this search, it returns 8 hits (like expected). However, when I take a look at the aggregated results, I see a doc_count of 8 (so far so good), but it only returns 3 hits.
Increasing the size of the _aggregate field does not have any effect.
Does anyone know how this is possible, or what can possibly cause this?
This is because the top_hits metric aggregation returns 3 hits by default. You can override this
"aggregations": {
"bundled": {
"top_hits": {
"size": 10, <--- add this
"sort": [
{
"_weight": "asc"
}
]
}
}
}

Filter Elasticsearch Aggregation by Bucket Key Value

I have an Elasticsearch index of documents in which there is a field that contains a list of URLs. Aggregating on this field gives me the count of unique URLs, as expected.
GET models*/_search
{
"query": {
"match_all": {}
},
"size": 0,
"aggs": {
"links": {
"terms": {
"field": "links.keyword",
"size": 10
}
}
}
}
I then want to filter out the buckets whose keys do not contain a certain string. I've tried doing so with the Bucket Selector Aggregation.
This attempt:
GET models*/_search
{
"query": {
"match_all": {}
},
"size": 0,
"aggs": {
"links": {
"terms": {
"field": "links.keyword",
"size": 10
}
},
"links_key_filter": {
"bucket_selector": {
"buckets_path": {
"key": "links"
},
"script": "!key.contains('foo')"
}
}
}
}
Fails with:
Invalid pipeline aggregation named [links_key_filter] of type
[bucket_selector]. Only sibling pipeline aggregations are allowed at
the top level
Putting the bucket selector inside the links aggregation, like so:
GET models*/_search
{
"query": {
"match_all": {}
},
"size": 0,
"aggs": {
"links": {
"terms": {
"field": "links.keyword",
"size": 10
},
"bucket_selector": {
"buckets_path": {
"key": "links"
},
"script": "!key.contains('foo')"
}
}
}
}
fails with:
Found two aggregation type definitions in [links]: [terms] and [bucket_selector]
I'm going to keep tinkering but am a bit stuck at the moment :(
You won't be able to use the bucket_selector because its bucket_path
must reference either a number value or a single value numeric metric aggregation [source]
and what a terms aggregation produces is denoted as StringTerms — and that simply won't work, regardless of whether you force a placeholder multibucket aggregation or not.
Having said that, each terms aggregation supports the exclude filter.
Assuming that your links are arrays of keywords:
POST models/_doc/1
{
"links": [
"google.com",
"wikipedia.org"
]
}
POST models/_doc/2
{
"links": [
"reddit.com",
"google.com"
]
}
and you'd like to group everything except reddit, you can use the following regex:
POST models*/_search
{
"query": {
"match_all": {}
},
"size": 0,
"aggs": {
"links": {
"terms": {
"field": "links.keyword",
"exclude": ".*reddit.*", <--
"size": 10
}
}
}
}
BTW, There are some non-trivial implications arising from the usage of such regexes, esp. when you imagine a case-sensitive scenario in which you'd need a query-time-generated regex — as discussed in How to correctly query inside of terms aggregate values in elasticsearch, using include and regex?
GET models*/_search
{
"query": {
"match_all": {}
},
"size": 0,
"aggs": {
"links": {
"terms": {
"field": "links.keyword",
"size": 10
}
},
"bucket_selector": {
"buckets_path": {
"key": "links"
},
"script": "!key.contains('foo')"
}
}
}
Your selector should come a level up, it should be directly in the aggs and parallel to your selector group.
I am not sure about the key filtering
You can use "_key" to get keys:
GET models*/_search
{
"query": {
"match_all": {}
},
"size": 0,
"aggs": {
"links": {
"terms": {
"field": "links.keyword",
"size": 10
},
"bucket_selector": {
"buckets_path": {
"key": "_key"
},
"script": "!params.key.contains('foo')"
}
}
}
}

ElasticSearch - Ordering aggregation by nested aggregation on nested field

{
"query": {
"match_all": {}
},
"from": 0,
"size": 0,
"aggs": {
"itineraryId": {
"terms": {
"field": "iid",
"size": 2147483647,
"order": [
{
"price>price>price.max": "desc"
}
]
},
"aggs": {
"duration": {
"stats": {
"field": "drn"
}
},
"price": {
"nested": {
"path": "prl"
},
"aggs": {
"price": {
"filter": {
"terms": {
"prl.cc.keyword": [
"USD"
]
}
},
"aggs": {
"price": {
"stats": {
"field": "prl.spl.vl"
}
}
}
}
}
}
}
}
}
}
Here, I am getting the error:
"Invalid terms aggregation order path [price>price>price.max]. Terms
buckets can only be sorted on a sub-aggregator path that is built out
of zero or more single-bucket aggregations within the path and a final
single-bucket or a metrics aggregation at the path end. Sub-path
[price] points to non single-bucket aggregation"
query works fine if I order by duration aggregation like
"order": [
{
"duration.max": "desc"
}
So is there any way to Order aggregation by nested aggregation on nested field i.e something like below ?
"order": [
{
"price>price>price.max": "desc"
}
As Val has pointed out in the comments ES does not support it yet.
Till then you can first aggregate the nested aggregation and then use the reverse nested aggregation to aggregate the duration, that is present in the root of the document.
https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket-reverse-nested-aggregation.html

Resources