Removing duplicates and sorting (aggs + sort) - elasticsearch

I'm trying to find the best solution where a query returns a sorted set, which I then use aggs to remove duplicates, this works fine, however when I add a sort on the query results, e.g.
"query": {..},
"sort": {.. "body.make": "asc" ..}
I'd like the aggs to also return the results in that order, however it seems to always order on the query score.
// Here I'm collecting all body.vin values to remove duplicates
// and then returning only the first in each result set.
"aggs": {
"dedup": {
"terms": {
"size": 8,
"field": "body.vin"
},
"aggs": {
"dedup_docs": {
"top_hits": {
"size": 1,
"_source": false
}
}
}
}
},
I've tried to put a term aggregation in between to see if that would sort:
// here again same thing, however I attempt to sort on body.make
// in the document, however I now realize that my bucket result
// being each a collection of the duplicates, will sort each duplicate
// and not on the last results.
"aggs": {
"dedup": {
"terms": {
"size": 8,
"field": "body.vin"
},
"aggs": {
"order": {
"terms": {
"field": "body.make",
"order": {
"_term": "asc"
}
},
"aggs": {
"dedup_docs": {
"top_hits": {
"size": 1,
"_source": false
}
}
}
}
}
}
},
But the results from the aggregation are always based on score.
Also I've toyed with the idea or solution of adjusting the scores based on query sort, in this way the aggregation would return the proper order as it returns based on score, but there doesn't seem to be anyway of doing this with the sort: {}.
If anyone has had success in sorting results, while removing duplicates, or ideas/suggestions, please let me know.

This is not the most ideal solution since it will only allow the sorting on one field. The best would be to change scores/boosts on sorted results
Trying to explain it made me realize how this could be done once I grasped the concept of buckets, or more so how they are passed. I would still be interested in the sort + score adjust solution but via aggregates this works:
// here we first aggregate all body.make, so first results might
// {"toyota": {body.vin 123}, "toyota": {body.vin 123}...} and the
// next result passed into the dedup aggregate would be say
// {"nissan"...
"aggs": {
"sort": {
"terms": {
"size": 8,
"field": "body.make",
"order": {
"_term": "desc"
}
},
"aggs": {
"dedup": {
"terms": {
"size": 8,
"field": "body.vin"
},
"aggs": {
"dedup_docs": {
"top_hits": {
"size": 1,
"_source": false
}
}
}
}
}
}
},

Related

Is it possible to make elasticsearch aggregations faster when I only want the top 10 buckets?

I understand elasticsearch aggregation queries take a long time to execute by nature, especially on high cardinality fields. In our use case, we only need to bring back the first x buckets sorted alphabetically. Considering we only need to bring back 10 buckets, is there a way to make our query faster? Is there a way to get Elasticsearch to look at only the first 10 buckets in each shard and compare those only?
Here is my query...
{
"size": "600",
"timeout": "60s",
"query": {
"bool": {
"must": [
{
"match_all": {
"boost": 1
}
}
],
"adjust_pure_negative": true,
"boost": 1
}
},
"aggregations": {
"firstname": {
"terms": {
"field": "firstname.keyword",
"size": 10,
"min_doc_count": 1,
"shard_min_doc_count": 0,
"show_term_doc_count_error": false,
"order": {
"_key": "asc"
},
"include": ".*.*",
"exclude": ""
}
}
}
}
I think I am on to something using a composite aggregation instead. Like this...
"home_address1": {
"composite": {
"size": 10,
"sources": [
{
"home_address1": {
"terms": {
"field": "home_address1.keyword",
"order": "asc"
}
}
}
]
}
}
Testing in Postman shows that this request takes way faster. Is this expected? If so, awesome. How can I add the include, exclude attributes to the composite query? For example, sometimes I only want to include buckets whose value matches "A.*"
If this query should not be any faster, then why does it appear so?
Composite terms aggs unfortunately don't support include, exclude and many other standard terms aggs' parameters so you've got 2 options:
Filter out the docs that don't match your prefix from within the query, like #Val pointed out.
or
use a script to do the filtering for you. This should be your last resort though -- scripts are pretty much guaranteed to run more slowly than standard query filters.
{
"size": 0,
"aggs": {
"home_address1": {
"composite": {
"size": 10,
"sources": [
{
"home_address1": {
"terms": {
"order": "asc",
"script": {
"source": """
def val = doc['home_address1.keyword'].value;
if (val == null) { return null; }
if (val.indexOf('A.') === 0) {
return val;
}
return null;
""",
"lang": "painless"
}
}
}
}
]
}
}
}
}
BTW your original terms agg already seems optimized so it surprises me that the composite is faster.

How do I filter after an aggregation?

I am trying to filter after a top hits aggregation to get if the first apparition of an error was in a given range but I can't find a way.
I have seen something about bucket selector but can't get it to work
POST log-*/_search/
{
"size": 100,
"aggs": {
"group":{
"terms": {
"field": "errorID.keyword",
"size": 100
},
"aggs": {
"group_docs": {
"top_hits": {
"size": 1,
"sort": [
{
"#timestamp": {
"order": "asc"
}
}
]
}
},
}
}
}
}
}
With this top hits I get the first apparition of a concrete errorID as I have many documents with the same errorID, but what I want to find is if the first apparition is within a given range of dates.
I think that a valid solution would be to filter the results of the aggregation to check if it is in the range, but I don't know how could I do that.

Paging the top_hits aggregation in ElasticSearch

Right now I'm doing a top_hits aggregation in Elastic Search that groups my data by a field, sorts the groups by a date, and chooses the top 1.
I need to somehow page this aggregation results in a way that I can pass through the pageSize and the pageNumber, but I don't know how.
In addition to this, I also need the total results of this aggregation so we can show it in a table in our web interface.
The aggregation looks like this:
POST my_index/_search
{
"size": 0,
"aggs": {
"top_artifacts": {
"terms": {
"field": "artifactId.keyword"
},
"aggs": {
"top_artifacts_hits": {
"top_hits": {
"size": 1,
"sort": [{
"date": {
"order": "desc"
}
}]
}
}
}
}
}
}
If I understand what you want, you should be able to do pagination through a Composite Aggregation. You can still pass your size parameter in your pagination, but your from would be the key for the bucket.
POST my_index/_search
{
"size": 0,
"aggs": {
"top_artifacts": {
"composite": {
"sources": [
{
"artifact": {
"terms": {
"field": "artifactId.keyword"
}
}
}
]
,
"size": 1, // OPTIONAL SIZE (How many buckets)
"after": {
"artifact": "FOO_BAZ" // Buckets after this bucket key
}
},
"aggs": {
"hits": {
"top_hits": {
"size": 1,
"sort": [
{
"timestamp": {
"order": "desc"
}
}
]
}
}
}
}
}
}

How to get specific _source fields in aggregation

I am exploring ElasticSearch, to be used in an application, which will handle large volumes of data and generate some statistical results over them. My requirement is to retrieve certain statistics for a particular field. For example, for a given field, I would like to retrieve its unique values and document frequency of each value, along-with the length of the value. The value lengths are indexed along-with each document.
So far, I have experimented with Terms Aggregation, with the following query:
{
"size": 0,
"query": {
"match_all": {}
},
"aggs": {
"type_count": {
"terms": {
"field": "val.keyword",
"size": 100
}
}
}
}
The query returns all the values in the field val with the number of documents in which each value occurs. I would like the field val_len to be returned as well. Is it possible to achieve this using ElasticSearch? In other words, is it possible to include specific _source fields in buckets? I have looked through the documentation available online, but I haven't found a solution yet.
Hoping somebody could point me in the right direction. Thanks in advance!
I tried to include _source in the following manners:
"aggs": {
"type_count": {
"terms": {
"field": "val.keyword",
"size": 100
},
"_source":["val_len"]
}
}
and
"aggs": {
"type_count": {
"terms": {
"field": "val.keyword",
"size": 100,
"_source":["val_len"]
}
}
}
But I guess this isn't the right way, because both gave me parsing errors.
You need to use another sub-aggregation called top_hits, like this:
"aggs": {
"type_count": {
"terms": {
"field": "val.keyword",
"size": 100
},
"aggs": {
"hits": {
"top_hits": {
"_source":["val_len"],
"size": 1
}
}
}
}
}
Another way of doing it is to use another avg sub-aggregation so you can sort on it, too
"aggs": {
"type_count": {
"terms": {
"field": "val.keyword",
"size": 100,
"order": {
"length": "desc"
}
},
"aggs": {
"length": {
"avg": {
"field": "val_len"
}
}
}
}
}

Result filter and pagination in Elasticsearch

I need some help or an idea for the correct procedure.
I already indexed a big vaste of documents. Now I found out that there are some documents with almost the same content, f.e.
{
"title": "myDocument",
"date": "2017-09-18",
"page": 1
}
{
"title": "myDocument",
"date": "2017-09-18",
"page": 2
}
The title field is mapped as text, date is date and page is integer. As you can see the only difference is the page value.
Now I want to make a query and filter out these duplicates. Field collapsing seems a good way to do it but in this case I can't get the correct count of results and that's important for me.
An other way would be to get all results first and then filter out "manually" but then I have a problem with pagination.
Try something like this.
GET index/type/_search
{
"aggs": {
"count_by_title_date_page":{
"terms": {
"field": "title.keyword",
"size": 100
},
"aggs": {
"date": {
"terms": {
"field": "date.keyword",
"size": 100
},
"aggs": {
"page": {
"terms": {
"field": "page.keyword",
"size": 100
}
}
}
}
}
}
}
}

Resources