Terms Aggregation return multiple fields (min_doc_count: 0) - elasticsearch

I'm making a Terms Aggregation but I want to return multiple fields. I want a user to select buckets via "slug" (my-name), but show the actual "name" (My Name).
At this moment I'm making a TopHits SubAggregation like this:
"organisation": {
"aggregations": {
"label": {
"top_hits": {
"_source": {
"includes": [
"organisations.name"
]
},
"size": 1
}
}
},
"terms": {
"field": "organisations.slug",
"min_doc_count": 0,
"size": 20
}
}
This gives the desired result when my whole query actually find some buckets/results.
You see I've set the min_doc_count to 0 which will return buckets with a doc count of 0. The problem I'm facing here is that my TopHits response is empty, which results of not being able to render the proper name to the client.
Example response:
"organisation": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "my-name",
"doc_count": 27,
"label": {
"hits": {
"total": 27,
"max_score": 1,
"hits": [
{
"_index": "users",
"_type": "doc",
"_id": "4475",
"_score": 1,
"_source": {
"organisations": [
{
"name": "My name"
}]
}
}]
}
}
},
{
"key": "my-name-2",
"doc_count": 0,
"label": {
"hits": {
"total": 0,
"max_score": null,
"hits": []
}
}
},
.....
Anyone has accomplished this desired result? I feel like TopHits won't help me here. It should always fetch the name.
What I've also tried:
Working with a terms sub aggregation. (same result)
Working with a significant terms sub aggregation. (same result)
What I think could be a solution, but feels dirty:
Index a new field with "organisations.slug___organisations.name" and work the magic via this.
Manual query the name field where the count is 0 (read TopHits is empty)
Kind regards,
Thanks in advance

Related

Limit to max records to be searched in Elastic Search Group by query

We have a strange issue where data for one of our customers has a lot of records based on certain field x. When the user triggers a query for the group by for that x field, the Elastic Search cluster is going for a toss and restarting with OOM.
Is there a way to limit max records that elastic search should look for while aggregating the result for a certain field so that cluster can be saved from going OOM ?
PS: The group by can go on multiple fields such as x,y,x, and w, and the user is searching for the last 30-day data only.
Use Sampler Aggregation with terms aggregation if you wish to restrict the number of documents that should be taken into account for an aggregation (let's say terms aggregation) (in this case)
Index Data:
{
"role": "example",
"number": 1
}
{
"role": "example1",
"number": 2
}
{
"role": "example2",
"number": 3
}
Search Query:
{
"size": 0,
"aggs": {
"sample": {
"sampler": {
"shard_size": 2 // Max documents you need to have for the aggregation
},
"aggs": {
"unique_roles": {
"terms": {
"field": "role.keyword"
}
}
}
}
}
}
Search Result:
"hits": {
"total": {
"value": 3,
"relation": "eq"
},
"max_score": null,
"hits": []
},
"aggregations": {
"sample": {
"doc_count": 2, // Note this
"unique_roles": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "example",
"doc_count": 1
},
{
"key": "example1",
"doc_count": 1
}
]
}
}
}

How to get the number of documents for each occurence in Elastic?

I have an Elastic index (say file) where I append a document every time the file is downloaded by a client.
Each document is quite basic, it contains a field filename and a date when to indicate the time of the download.
What I want to achieve is to get, for each file the number of times it has been downloaded in the last 3 months.
For the moment, the closest I get it with this query:
{
"query": {
"range": {
"when": {
"gte": "now-3M"
}
}
},
"aggs": {
"downloads": {
"terms": {
"field": "filename.keyword"
}
}
}
}
The result is something like that:
{
"took": 793,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 10000,
"relation": "gte"
},
"max_score": 1.0,
"hits": [
{
"_index": "file",
"_type": "_doc",
"_id": "8DkTFHQB3kG435svAA3O",
"_score": 1.0,
"_source": {
"filename": "taz",
"id": 24009,
"when": "2020-08-21T08:11:54.943Z"
}
},
...
]
},
"aggregations": {
"downloads": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 418486,
"buckets": [
{
"key": "file1",
"doc_count": 313873
},
{
"key": "file2",
"doc_count": 281504
},
...,
{
"key": "file10",
"doc_count": 10662
}
]
}
}
}
So I am quite interested in the aggregations.downloads.bucket, but this is limited to 10 results.
What do I need to change in my query to have all the list (in my case, I will have ~15,000 different files)?
Thanks.
The size of the terms buckets defaults to 10. If you want to increase it, go with
{
"query": {
"range": {
"when": {
"gte": "now-3M"
}
}
},
"aggs": {
"downloads": {
"terms": {
"field": "filename.keyword",
"size": 15000 <-------
}
}
}
}
Note that there are strategies to paginate those buckets using a composite aggregation.
Also note that as your index grows, you may hit the default limit as well. It's a dynamic cluster-wide setting so it can be changed.

Elasticsearch - Any way to find out all the documents with field value as text

In the elasticsearch cluster, I accidentally pushed some text in a field which should ideally be a Number. Later, I fixed that and pushed the Number type value. Now, I wanted to fix it such that all the old values can be replaced by some Number for which I need to find out all the documents which are having this field as text.
Is there any elasticsearch query that I can use to get this information?
I think that can be possible by using a nested aggregations.
At the top-level; use terms aggregation to know text values, at the sub-level; use top_hits aggregation to get documents that includes these values.
for instance:
GET example_index/_search
{
"size": 0,
"aggs": {
"NAME": {
"terms": {
"field": "example_field.keyword",
"size": 10
},
"aggs": {
"documents": {
"top_hits": {
"size": 10
}
}
}
}
}
}
This query; will return distinct values of the field, and the related documents in the sub-level, something like:
{
"aggregations": {
"NAME": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "mistake",
"doc_count": 2,
"documents": {
"hits": {
"total": 2,
"max_score": 1,
"hits": [
{
"_index": "example_index",
"_type": "example_index",
"_id": "2QoDoXEBOCkJkkpwq5P0",
"_score": 1,
"_source": {
"example_field": "mistake"
}
},
{
"_index": "example_index",
"_type": "example_index",
"_id": "qAoDoXEBOCkJkkpwq5T0",
"_score": 1,
"_source": {
"example_field": "mistake"
}
}
]
}
}
},
{
"key": "520",
"doc_count": 2,
"documents": {
"hits": {
"total": 1,
"max_score": 1,
"hits": [
{
"_index": "example_index",
"_type": "example_index",
"_id": "5goDoXEBOCkJkkpwq5P0",
"_score": 1,
"_source": {
"example_field": "1"
}
}
]
}
}
}
]
}
}
}
I the example above; we need to delete the documents with mistake value, you can simply delete them by id.
NOTE: if you have a big index, it's rather to write a function inside your code that builds aggregations, gets the response, filters values if it can be parsed to a number, then removes documents by id.

How to select buckets of aggregation results based on top hit document attribute?

I am trying to get result for following Elasticsearch query. I got the response as shown below. Now I want to select the buckets based on the top hit document field "source".
POST /data/_search?size=0{
"aggs":{
"by_partyIds":{
"terms":{
"field":"id.keyword"
},
"aggs":{
"oldest_record":{
"top_hits":{
"sort":[
{
"createdate.keyword":{
"order":"asc"
}
}
],
"_source":[
"source"
],
"size":1
}
}
}
}
}
}
Response :
{
"aggregations": {
"by_partyIds": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "1",
"doc_count": 3,
"oldest_record": {
"hits": {
"total": 3,
"max_score": null,
"hits": [
{
"_index": "data",
"_type": "osr",
"_id": "DcagSm4B9WnM0Ke-MgGk",
"_score": null,
"_source": {
"source": "US"
},
"sort": [
"20-09-18 05:45:26.000000000AM"
]
}
]
}
}
},
{
"key": "2",
"doc_count": 3,
"oldest_record": {
"hits": {
"total": 3,
"max_score": null,
"hits": [
{
"_index": "data",
"_type": "osr",
"_id": "7caiSm4B9WnM0Ke-HwGx",
"_score": null,
"_source": {
"source": "UK"
},
"sort": [
"22-09-18 05:45:26.000000000AM"
]
}
]
}
}
}
]
}
}
}
Now I want to get the buckets with count US as source. Can we write the query for that? I tried A bucket aggregation which is parent pipeline aggregation which executes a script which determines whether the current bucket will be retained in the parent multi-bucket aggregation. The specified metric must be numeric and the script must return a boolean value. If the script language is expression then a numeric return value is permitted. In this case 0.0 will be evaluated as false and all other values will evaluate to true.

Latest document for each category?

I have documents in ElasticSearch with structure like this:
{
"created_on": [timestamp],
"source_id": [a string ID for the source],
"type": [a term],
... other fields
}
Obviously, I can select these documents in Kibana, show them in "discover", produce (for example) a pie chart showing type terms, and so on.
However, the requirement I've been given is to use only the most recent document for each source_id.
The approach I've tried is to map the documents into one bucket per source_id, then for each bucket, reduce to remove all but the document with the latest created_on.
However, when I used the terms aggregator, the result only contained counts, not whole documents I could further process:
"aggs" : {
"sources" : {
"terms" : { "field" : "source_id" }
}
}
How can I make this query?
If I understood correctly what you're trying to do, one way to accomplish that is using the top_hits aggregations under the terms aggregation, which is useful for grouping results by any criteria you'd like to, for each bucket of its parent aggregation. Following your example, you could do something like
{
"aggs": {
"by_source_id": {
"terms": {
"field": "source_id"
},
"aggs": {
"most_recent": {
"top_hits": {
"sort": {
"created_on": "desc"
},
"size": 1
}
}
}
}
}
}
So you are grouping by source_id, which will create a bucket for each one, and then you'll get the top hits for each bucket according to the sorting criteria set in the top_hits agg, in this case the created_on field.
The result you should expect would be something like
....
"buckets": [
{
"key": 3,
"doc_count": 2,
"most_recent": {
"hits": {
"total": 2,
"max_score": null,
"hits": [
{
"_index": "so_sample02",
"_type": "items",
"_id": "2",
"_score": null,
"_source": {
"created_on": "2018-05-01 07:00:01",
"source_id": 3,
"type": "a"
},
"sort": [
1525158001000
]
}
]
}
}
},
{
"key": 5,
"doc_count": 2, .... and so on
Notice how within the bucket, most_recent, we get the corresponding hits. You can furthermore limit the amount of fields returned, by specifying in your top_hits agg "includes": ["fieldA", "fieldB" .. and so on]
Hope that helps.

Resources