ElasticSearch Query: Can't get all aggregated results - elasticsearch

In ElasticSearch, I have 5 shards, and I have documents like this in each shards.
[{userId:XX, name:YYY, bookId:"123abc" },....]
I group by "bookId" and try to get all books that have been borrowed by more than one user:
{
"size": 0,
"aggs": {
"group_by_bookId": {
"terms": {
"field": "bookId.keyword",
"size": 10000
},
"aggs": {
"having_several": {
"bucket_selector": {
"buckets_path": {
"the_doc_count": "_count"
},
"script": "params.the_doc_count > 1"
}
}
}
}
}
and this query do return say 500 buckets, which less than 10000, that means the query does return all documents it has. But when I search one field "bookId" = "123abc",there are 5 documents in return, but key="123abc" doesn't show in that return buckets "key" list, and I know that we have 5 documents that "bookId" = "123abc" located in each shard(Each shard only has one document that "bookId" = "123abc"). "aggs" works only on each shard, and combinates all returns from each shard, each shard only have one document for "bookId"="123abc", so it doesn't return that doucment in each shard.
So my question is any solution(elasticsearch query) for finding all documents that have duplicated value, no matter which shard those documents located.

I think that min_doc_count could help you.
The query will look like this:
{
"size": 0,
"aggs": {
"group_by_bookId": {
"terms": {
"field": "bookId.keyword",
"size": 10000,
"min_doc_count": 1
}
}
}
}
From the Elasticsearch documentation, min_doc_count is supposed to sort the results after the marge phase:
The min_doc_count criterion is only applied after merging local terms
statistics of all shards.
If this won't work, you can try to play with shard_size parameter in the query.

Related

Paginate an aggregation sorted by hits on Elastic index

I have an Elastic index (say file) where I append a document every time the file is downloaded by a client. Each document is quite basic, it contains a field filename and a date when to indicate the time of the download.
What I want to achieve is to get, for each file the number of times it has been downloaded in the last 3 months. Thanks to another question, I have a query that returns all the results:
{
"query": {
"range": {
"when": {
"gte": "now-3M"
}
}
},
"aggs": {
"downloads": {
"terms": {
"field": "filename.keyword",
"size": 1000
}
}
},
"size": 0
}
Now, I want to have a paginated result. The term aggreation cannot be paginated, so I use a composite aggregation. Of course, if there is a better aggregation, it can be used here...
So for the moment, I have something like that:
{
"query": {
"range": {
"when": {
"gte": "now-3M"
}
}
},
"aggs": {
"downloads_agg": {
"composite": {
"size": 100,
"sources": [
{
"downloads": {
"terms": {
"field": "filename.keyword"
}
}
}
]
}
}
},
"size": 0
}
This aggregation allows me to paginate (thanks to after_key value in response), but it is not sorted by the number of downloads - it is sorted by the filename.
How can I sort that composite aggregation on the number of documents for each filename in my index?
Thanks.
Composite aggregation don't allow sorting based on the value field.
Excerpt from the discussion on elastic forum:
it's designed as a memory-friendly way to paginate over aggregations.
Part of the tradeoff is that you lose things like ordering by doc
count, since that isn't known until after all the docs have been
collected.
I have no experience with Transforms (part of X-pack & Licensed) but you can try that out. Apart from this, I don't see a way to get the expected output.

Why am I getting number of buckets always equal to the specified size in terms aggregations?

I am a newbie in elastic search, I am using terms aggregation to get only the unique documents based on a field from the index. I have specified maximum size of unique documents in my query, why the bucket count is always equal to size?
{
"aggs": {
"name": {
"terms": {
"field": "fieldname",
"size": 10000
}
}
}
}
why am I getting 10000 buckets, when unique documents may be less than that?
10000 is the upper cap for the number of documents returned in a query. Your index will be having more than 10000 records. To get actual count use value count api
GET index/_count
OR
{
"size": 0,
"aggs": {
"total_doc_count": {
"value_count": {
"field": "fieldname"
}
}
}
}
To fetch more than 10000 documents in a query , you have to use scroll api.
POST /index-name/_search?scroll=1m --> scroll context
{
"size": 10000, --> will return docs in chunk of 10,000
"query": {
"match_all": {}
}
}
POST /_search/scroll
{
"scroll" : "1m",
"scroll_id" : "DXF1ZXJ5QW5kRmV0Y2gBAAAAAAAAAD4WYm9laVYtZndUQlNsdDcwakFMNjU1QQ==" --> you will get from previous request
}
If there are only 100 documents, terms aggregation will return only 100 not 10000

How to limit search results from each index in a multi index search query?

I am using Elasticsearch version 6.3 and I want to make queries across multiple indices.Elasticsearch has support for this and I can give multiple indices as comma separated values in the url with one query in request body and also give size parameter to limit the number of search results returned.However this limits the size of the overall search results and might lead to no results from some indexes- so instead I want to fetch first n number of results from each index.
I tried using multi search api (_msearch) but with that it seems I have to give the same query and size for all indexes and that works, but I am not able to get a single aggregation over the entire result , is there any way to address both the issues?
Solution 1:
You're on the right path with the _msearch query. What I would do is to issue one query per index (no aggregations!) with the size you want for that index, as well as another query just for the aggregations, like this:
{ "index": "index1" }
{ "size": 5, "query": { ... }}
{ "index": "index2" }
{ "size": 5, "query": { ... }}
{ "index": "index3" }
{ "size": 5, "query": { ... }}
{ "index": "index1,index2,index3" }
{ "size": 0, "query": { ... }, "aggs": { ... } }
So the first three queries will return document hits from each of the three indexes and the last query will return the aggregation computed on all indexes, but no documents.
Solution 2:
Another way to tackle this if you have a small size, is to have a single query in the query part and then aggregate on the index name and retrieve hits from each index using top_hits, like this:
POST index1,index2,index3/_search
{
"size": 0,
"query": { ... },
"aggs": {
"indexes": {
"terms": {
"field": "_index",
"size": 50
},
"aggs": {
"hits": {
"top_hits": {
"size": 5
}
}
}
}
}
}

How do I select the top term buckets based on a rescore function in Elasticsearch

Consider the following query for Elasticsearch 5.6:
{
"size": 0,
"query": {
"match_all": {}
},
"rescore": [
{
"window_size": 10000,
"query": {
"rescore_query": {
"function_score": {
"boost_mode": "replace",
"script_score": {
"script": {
"source": "doc['topic_score'].value"
}
}
}
},
"query_weight": 0,
"rescore_query_weight": 1
}
}
],
"aggs": {
"distinct": {
"terms": {
"field": "identical_id",
"order": {
"top_score": "desc"
}
},
"aggs": {
"best_unique_result": {
"top_hits": {
"size": 1
}
},
"top_score": {
"max": {
"script": {
"inline": "_score"
}
}
}
}
}
}
}
This is a simplified version where the real query has a more complex main query and the rescore function is far more intensive.
Let me explain it's purpose first incase I'm about to spend a 1000 hours developing a pen that writes in space when a pencil would actually solve my problem. I'm performing a fast initial query, then rescoring the top results with a much more intensive function. From those results I want to show the top distinct values, i.e. no two results should have the same identical_id. If there's a better way to do this I'd also consider that an answer.
I expected a query like this would order results by the rescore query, group all the results that had the same identical_id and display the top hit for each such distinct group. I also assumed that since I'm ordering those term aggregation buckets by the max parent _score, they would be ordered to reflect the best result they contain as determined from the original rescore query.
The reality is that the term buckets are ordered by the maximum query score and not the rescore query score. Strangely the top hits within the buckets do seem to use the rescore.
Is there a better way to achieve the end result that I want, or some way I can fix this query to work the way I expect it too?
From documentation :
The query rescorer executes a second query only on the Top-K results returned by the query and post_filter phases. The number of docs which will be examined on each shard can be controlled by the window_size parameter, which defaults to 10.
As the rescore query kicks in after the post_filter phase, I assume the term aggregation buckets are already fixed.
I have no idea on how you can combine rescore and aggregations. Sorry :(
I think I have a pretty great solution to this problem, but I'll let the bounty continue to expiration incase someone comes up with a better approach.
{
"size": 0,
"query": {
"match_all": {}
},
"aggs": {
"sample": {
"sampler": {
"shard_size": 10000
},
"aggs": {
"distinct": {
"terms": {
"field": "identical_id",
"order": {
"top_score": "desc"
}
},
"aggs": {
"best_unique_result": {
"top_hits": {
"size": 1,
"sort": [
{
"_script": {
"type": "number",
"script": {
"source": "doc['topic_score'].value"
},
"order": "desc"
}
}
]
}
},
"top_score": {
"max": {
"script": {
"source": "doc['topic_score'].value"
}
}
}
}
}
}
}
}
}
The sampler aggregation will take the top N hits per shard from the core query and run aggregations over those. Then in the max aggregator that defines the bucket order I use the exact same script as the one I use to pick a top hit from the bucket. Now the buckets and the top hits are running over the same top N sets of items and the buckets will order by the max of the same score, generated from the same script. Unfortunately I still need run the script once to order the buckets and once to pick a top hit within the bucket, and you could use the rescore instead for the top hits ordering, but either way it has to run twice and I found it was faster as a sort script then as a rescore

Elasticsearch term aggregation with range doc count

I want to aggregate a field and return only those buckets in which doc count is within 10 to 20 for e.g.
So far from documentation it says that we can provide min_doc_count parameter.
Is there any way we can provide max_doc_count also so i only get required buckets ?
Thanks
you can use the following query to filter the buckets using bucket_selector aggregation. You can take a deep look at pipeline aggregations and buckets paths here.
In the following example i am aggregating the document on product.name field where product is of type object for me.
{
"size": 0,
"aggs": {
"values": {
"terms": {
"field": "product.name.raw",
"size": 10
},
"aggs": {
"final_filter": {
"bucket_selector": {
"buckets_path": {
"values": "_count"
},
"script": "params.values > 10 && params.values < 20"
}
}
}
}
}
}
Hope this helps
Thanks

Resources