Aggregation on top 100 documents sorted by a field - elasticsearch

I would like to do a terms aggregation on top 100 documents sorted on a field (not relevance score!).
I know how to do the aggregation:
{
"query": {
"match_all" : {}
},
"aggs" : {
"mydata_agg" : {
"terms": {
"field" : "title"
}
}
}
}
and I know how to get top 100 documents sorted on a field:
{
"query": {
"match_all": {}
},
"sort": {
"units_sold": {
"order": "desc"
}
},
"size": 100
}
But how do I run the terms aggregation on those 100 sorted documents? I could use a range filter but then I need to specify myself the cutoff value of units_sold that results in top 100 documents. results. I prefer to do everything in one query. Is that possible?
I have searched for couple hours but was unable to find a solution.

The term aggregation creates buckets, and we need to sort the outcome of the first aggregation. this can be done using bucket_sort.Read this article for more information.

Related

How to limit number of documents returned for each term in Elasticsearch terms query?

I try to get documents with specified list of terms, like this:
GET /_search
{
"query" : {
"terms" : {
"md5" : ["file_1", "file_2"]
}
}
}
Is it possible to limit Elasticsearch results just to one document for each term? So as a result, I should have one document for "file_1", one document for "file_2" and so on.
What I try to accomplish is to get Elasticsearch _id of the most recent document for each term in list. Can I do this in this way or it's necessary to do separate request for each term?
You have two different ways to get the N documents for each term.
One way is by performing one request for each term.
The other way is by using the top hits aggregation (see the documentation here).
GET /_search
{
"query" : {
"terms" : {
"md5" : ["file_1", "file_2"]
}
},
"aggs": {
"top-docs": {
"terms": {
"field": "md5",
"size": 3
},
"aggs": {
"top_tag_hits": {
"top_hits": {
"size" : 1
}
}
}
}
}
}

Finding unique documents in an index in elastic search

I am having duplicates entries in my index and I want to find out only unique documents in the index . TopHits aggregation solves this problem but my other requirement is to support sorting on the results (across buckets). Hence I cant use top hits aggregation.
Other options I can think of is to write a plugin or use painless script.
Need help to solve this.It would be great if you can redirect me to some examples.
Top hits aggregation find the value from the complete result set while If you use cardinality it gives only filtered result set.
You can use cardinality aggregation like below:
{
"aggs" : {
"UNIQUE_COUNT" : {
"cardinality" : {
"field" : "your_field"
}
}
}
}
This aggregation comes with some responsibility, You can find the below ElasticSearch documentation to understand it better.
Link: Cardinality Aggregation
For sorting, you can refer the below example, where you can pass your aggregation in order of terms for which your bucket get created:
{
"aggs": {
"AGG_NAME": {
"terms": {
"field": "you_field",
"size": 10,
"order": {
"UNIQUE_COUNT.doc_count": "asc"
},
"min_doc_count": 1
},
"aggs": {
"UNIQUE_COUNT": {
"cardinality": {
"field": "your_field"
}
}
}
}
}
}

How do I select the top term buckets based on a rescore function in Elasticsearch

Consider the following query for Elasticsearch 5.6:
{
"size": 0,
"query": {
"match_all": {}
},
"rescore": [
{
"window_size": 10000,
"query": {
"rescore_query": {
"function_score": {
"boost_mode": "replace",
"script_score": {
"script": {
"source": "doc['topic_score'].value"
}
}
}
},
"query_weight": 0,
"rescore_query_weight": 1
}
}
],
"aggs": {
"distinct": {
"terms": {
"field": "identical_id",
"order": {
"top_score": "desc"
}
},
"aggs": {
"best_unique_result": {
"top_hits": {
"size": 1
}
},
"top_score": {
"max": {
"script": {
"inline": "_score"
}
}
}
}
}
}
}
This is a simplified version where the real query has a more complex main query and the rescore function is far more intensive.
Let me explain it's purpose first incase I'm about to spend a 1000 hours developing a pen that writes in space when a pencil would actually solve my problem. I'm performing a fast initial query, then rescoring the top results with a much more intensive function. From those results I want to show the top distinct values, i.e. no two results should have the same identical_id. If there's a better way to do this I'd also consider that an answer.
I expected a query like this would order results by the rescore query, group all the results that had the same identical_id and display the top hit for each such distinct group. I also assumed that since I'm ordering those term aggregation buckets by the max parent _score, they would be ordered to reflect the best result they contain as determined from the original rescore query.
The reality is that the term buckets are ordered by the maximum query score and not the rescore query score. Strangely the top hits within the buckets do seem to use the rescore.
Is there a better way to achieve the end result that I want, or some way I can fix this query to work the way I expect it too?
From documentation :
The query rescorer executes a second query only on the Top-K results returned by the query and post_filter phases. The number of docs which will be examined on each shard can be controlled by the window_size parameter, which defaults to 10.
As the rescore query kicks in after the post_filter phase, I assume the term aggregation buckets are already fixed.
I have no idea on how you can combine rescore and aggregations. Sorry :(
I think I have a pretty great solution to this problem, but I'll let the bounty continue to expiration incase someone comes up with a better approach.
{
"size": 0,
"query": {
"match_all": {}
},
"aggs": {
"sample": {
"sampler": {
"shard_size": 10000
},
"aggs": {
"distinct": {
"terms": {
"field": "identical_id",
"order": {
"top_score": "desc"
}
},
"aggs": {
"best_unique_result": {
"top_hits": {
"size": 1,
"sort": [
{
"_script": {
"type": "number",
"script": {
"source": "doc['topic_score'].value"
},
"order": "desc"
}
}
]
}
},
"top_score": {
"max": {
"script": {
"source": "doc['topic_score'].value"
}
}
}
}
}
}
}
}
}
The sampler aggregation will take the top N hits per shard from the core query and run aggregations over those. Then in the max aggregator that defines the bucket order I use the exact same script as the one I use to pick a top hit from the bucket. Now the buckets and the top hits are running over the same top N sets of items and the buckets will order by the max of the same score, generated from the same script. Unfortunately I still need run the script once to order the buckets and once to pick a top hit within the bucket, and you could use the rescore instead for the top hits ordering, but either way it has to run twice and I found it was faster as a sort script then as a rescore

Elasticsearch 5 (Searchkick) Aggregation Bucket Averages

We have an ES index holding scores given for different products. What we're trying to do is aggregate on product names and then get the average scores for each of product name 'buckets'. Currently the default aggregation functionality only gives us the counts for each bucket - is it possible to extend this to giving us average score per product name?
We've looked at pipeline aggregations but the documentation is pretty dense and doesn't seem to quite match what we're trying to do.
Here's where we've got to:
{
"aggs"=>{
"prods"=>{
"terms"=>{
"field"=>"product_name"
},
"aggs"=>{
"avgscore"=>{
"avg"=>{
"field"=>"score"
}
}
}
}
}
}
Either this is wrong, or could it be that there's something in how searckick compiles its ES queries that is breaking things?
Thanks!
Think this is the pipeline aggregation you want...
POST /_search
{
"size": 0,
"aggs": {
"product_count" : {
"terms" : {
"field" : "product"
},
"aggs": {
"total_score": {
"sum": {
"field": "score"
}
}
}
},
"avg_score": {
"avg_bucket": {
"buckets_path": "product_count>total_score"
}
}
}
}
Hopefully I have that the right way round, if not - switch the first two buckets.

Getting the terms with high document frequency

How can I get the top 10 terms with highest document frequencies?
I have an analyzed field called article.
I am using ES 2.3.0.
You can simply use an aggregation:
POST /my_articles/_search
{
"query": {
"match_all": {}
},
"size": 0,
"aggs": {
"term_count":{
"terms": {
"field":"article",
"size" : 10
}
}
}
}
For each word, it will return the number of document where it can be found. But it doesn't take into account if a word is here multiple times in the field.

Resources