Finding unique documents in an index in elastic search - elasticsearch

I am having duplicates entries in my index and I want to find out only unique documents in the index . TopHits aggregation solves this problem but my other requirement is to support sorting on the results (across buckets). Hence I cant use top hits aggregation.
Other options I can think of is to write a plugin or use painless script.
Need help to solve this.It would be great if you can redirect me to some examples.

Top hits aggregation find the value from the complete result set while If you use cardinality it gives only filtered result set.
You can use cardinality aggregation like below:
{
"aggs" : {
"UNIQUE_COUNT" : {
"cardinality" : {
"field" : "your_field"
}
}
}
}
This aggregation comes with some responsibility, You can find the below ElasticSearch documentation to understand it better.
Link: Cardinality Aggregation
For sorting, you can refer the below example, where you can pass your aggregation in order of terms for which your bucket get created:
{
"aggs": {
"AGG_NAME": {
"terms": {
"field": "you_field",
"size": 10,
"order": {
"UNIQUE_COUNT.doc_count": "asc"
},
"min_doc_count": 1
},
"aggs": {
"UNIQUE_COUNT": {
"cardinality": {
"field": "your_field"
}
}
}
}
}
}

Related

Paginate an aggregation sorted by hits on Elastic index

I have an Elastic index (say file) where I append a document every time the file is downloaded by a client. Each document is quite basic, it contains a field filename and a date when to indicate the time of the download.
What I want to achieve is to get, for each file the number of times it has been downloaded in the last 3 months. Thanks to another question, I have a query that returns all the results:
{
"query": {
"range": {
"when": {
"gte": "now-3M"
}
}
},
"aggs": {
"downloads": {
"terms": {
"field": "filename.keyword",
"size": 1000
}
}
},
"size": 0
}
Now, I want to have a paginated result. The term aggreation cannot be paginated, so I use a composite aggregation. Of course, if there is a better aggregation, it can be used here...
So for the moment, I have something like that:
{
"query": {
"range": {
"when": {
"gte": "now-3M"
}
}
},
"aggs": {
"downloads_agg": {
"composite": {
"size": 100,
"sources": [
{
"downloads": {
"terms": {
"field": "filename.keyword"
}
}
}
]
}
}
},
"size": 0
}
This aggregation allows me to paginate (thanks to after_key value in response), but it is not sorted by the number of downloads - it is sorted by the filename.
How can I sort that composite aggregation on the number of documents for each filename in my index?
Thanks.
Composite aggregation don't allow sorting based on the value field.
Excerpt from the discussion on elastic forum:
it's designed as a memory-friendly way to paginate over aggregations.
Part of the tradeoff is that you lose things like ordering by doc
count, since that isn't known until after all the docs have been
collected.
I have no experience with Transforms (part of X-pack & Licensed) but you can try that out. Apart from this, I don't see a way to get the expected output.

Get max bucket of terms aggregation (with pipeline aggregation)

I was wondering how to get the bucket with the highest doc_count when using a terms aggregation with Elasticsearch. I'm using the Kibana sample data kibana_sample_data_flights:
GET kibana_sample_data_flights/_search
{
"size": 0,
"aggs": {
"destinations": {
"terms": {
"field": "DestCityName"
}
}
}
}
If there was a single bucket with the max doc_count value I could set the size of the terms aggregation to 1, however this doesn't work if there are two buckets with the same max doc_count value.
Since I came accross pipeline aggregations, I feel there should be an easy way to achieve this. The max bucket aggregation seems to be able to deal with multiple max buckets, since the guide says this:
[...] which identifies the bucket(s) with the maximum value of [...]
However the only way to make this work was using a work-around with a sub-aggregation using value_count:
GET kibana_sample_data_flights/_search
{
"size": 0,
"aggs": {
"destinations": {
"terms": {
"field": "DestCityName"
},
"aggs": {
"counter": {
"value_count": {
"field": "_id"
}
}
}
},
"max_destination": {
"max_bucket": {
"buckets_path": "destinations>counter"
}
}
}
}
a) Is there a better way in general, to find the terms bucket with the max value?
b) Is there a better way using pipeline aggrations?
Thanks in advance!
Well you can simplify as below and you don't need to make use of value_count aggregation.
However, unfortunately using max_bucket is the only way to get what you are looking for.
POST <your_index_name>/_search
{
"size": 0,
"aggs": {
"destinations": {
"terms": {
"field": "DestCityName"
}
},
"max_destination": {
"max_bucket": {
"buckets_path": "destinations>_count" <---- Note the usage of _count
}
}
}
}
Hope this helps!

How to filter the results from a composite aggregation?

I want to filter the results of the composite aggregation which inside has a top_hits aggregation. So I first group my data with a top_hits, then I use this as a subaggregation inside my composite aggregation that has a single source based on an Id and I don't know how to filter those gruped results.
I've tried using the filters aggregation but I'm not sure since composite aggregation must be the father of all aggregations. Tried different combinations of these aggregation but none of these show me the results as I want.
{
"size": 0,
"aggs": {
"grouped_data": {
"composite": {
"sources": [
{
"artifact": {
"terms": {
"field": "artifactId.keyword"
}
}
}
],
"size": 20
},
"aggs": {
"top_artifacts_hits": {
"top_hits": {
"size": 1,
"sort": [{
"initialDate": {
"order": "desc"
}
}]
}
}
}
}
}
}
I tried using the query API for filtering but that is not a good option for me since the filters I want to apply are meant for the grouped results. Using some query before the main aggregation makes ElasticSearch query first and then group. I need it to be backwards. I'm using ES 6.3 under AWS.
So my documents look something like this:
{
"artifactId": "foo",
"clientId": "bar",
"artifactState": "foozz",
"initialDate": 1559745246
}
What I need to do is to get the last artifactState based on the initialDate for each different artifactId so this is why I'm using top_hits + composite.

Elasticsearch 5 (Searchkick) Aggregation Bucket Averages

We have an ES index holding scores given for different products. What we're trying to do is aggregate on product names and then get the average scores for each of product name 'buckets'. Currently the default aggregation functionality only gives us the counts for each bucket - is it possible to extend this to giving us average score per product name?
We've looked at pipeline aggregations but the documentation is pretty dense and doesn't seem to quite match what we're trying to do.
Here's where we've got to:
{
"aggs"=>{
"prods"=>{
"terms"=>{
"field"=>"product_name"
},
"aggs"=>{
"avgscore"=>{
"avg"=>{
"field"=>"score"
}
}
}
}
}
}
Either this is wrong, or could it be that there's something in how searckick compiles its ES queries that is breaking things?
Thanks!
Think this is the pipeline aggregation you want...
POST /_search
{
"size": 0,
"aggs": {
"product_count" : {
"terms" : {
"field" : "product"
},
"aggs": {
"total_score": {
"sum": {
"field": "score"
}
}
}
},
"avg_score": {
"avg_bucket": {
"buckets_path": "product_count>total_score"
}
}
}
}
Hopefully I have that the right way round, if not - switch the first two buckets.

Aggregation on top 100 documents sorted by a field

I would like to do a terms aggregation on top 100 documents sorted on a field (not relevance score!).
I know how to do the aggregation:
{
"query": {
"match_all" : {}
},
"aggs" : {
"mydata_agg" : {
"terms": {
"field" : "title"
}
}
}
}
and I know how to get top 100 documents sorted on a field:
{
"query": {
"match_all": {}
},
"sort": {
"units_sold": {
"order": "desc"
}
},
"size": 100
}
But how do I run the terms aggregation on those 100 sorted documents? I could use a range filter but then I need to specify myself the cutoff value of units_sold that results in top 100 documents. results. I prefer to do everything in one query. Is that possible?
I have searched for couple hours but was unable to find a solution.
The term aggregation creates buckets, and we need to sort the outcome of the first aggregation. this can be done using bucket_sort.Read this article for more information.

Resources