ElasticSearch aggregations for all pages - elasticsearch

I use size and from keywords for pagination across ElasticSearch results and each page change requires another search query to be executed.
I would like to compute facets with the aggregations feature, however aggregations are computed only based on the results constrained by size and from keywords e.g. when I ask for records 20-30 from the list, the aggregations are computed only on these 10 records that are returned. And I would like of course to have global facets computed on all the matching records that do not change while I switch the pages.
Any ideas how to do it apart from performing an additional global (uncostrained by size and from) search?

Aggregations are computed on all documents that match "query". The scope of aggregations has nothing to do with "size" and "from" values.

Related

Does Elastic Search query aggregation has limit to upto what it can process or aggregate?

Hello Community I am fairly new to elastic search and have stumbled upon this issue.
My Elastic Search Application requires aggregation of data to get top hits. The query is working perfectly on almost all use cases and we are getting the values in buckets. But for some cases in which the field on which the aggregation is being performed has a very long text field is not giving the desired results the aggregation is not happening and we are getting buckets as an empty array. So my question is that is there a case that Elastic Search aggregation has a size limit that it cannot aggregate very long text fields?

ElasticSearch: given a document and a query, what is the relevance score?

Once a query is executed on ElasticSearch, a relevance _score is calculated for each retrieved document.
Given a specific document (e.g. by doc ID) and a specific query, I would like to see what is its _score?
One way is perhaps to query ES, retrieve all the hit documents, and look up the desired document out of all the retrieved documents to see its score.
I assume there should be a more efficient way to do this. Given a query and a document ID, what is its _score?
I'm using ElasticSearch 7.x
PS: I need this for a learning-to-rank scenario (to create my judgment list). I have in fact a complex query that was created from various should and must over different fields. My major requirement was to get the score value for each individual sub-query, which seems there is no solution for it. I want to understand which part of this complex query is more useful and which one is less. The only way I've come up with is to execute each sub-query separately to get the score but I do not want to actually execute that query just asking for what is the score of a specific document for that sub-query.
Scoring of the document is not only related to just the document and all other documents in the index, but it also depends on various factor like:
_score is calculated per shard basis not on an index basis by default, although you can change this behavior by using DFS Query Then Fetch param in your query. More info on this official blog.
Is there is any boost applied at index or query time(index time is deprecated from 5.X).
Any custom scoring function is used in addition to the default ES scoring algorithm(tf/idf in old versions) and BM25 in the latest versions.
Edit: Based on the comments from the other respected community members, rephrasing the below statement:
To answer your question, Using the _explain API, you can understand how Elasticsearch computes a score explanation for a query and a specific document. This can give useful feedback on whether a document matches or didn’t match a specific query.

Compute Aggregations before running Filter Query

I've a simple scenario:
I search for some text, and elastic returns documents and
aggregations.
I then filter that search with values in the fields returned from that aggregations. I'm using a Terms Query inside a filter
I want the documents to be filtered by my filter conditions, which is working fine.
But I want the aggregation buckets without applying the filter condition (because if I get the buckets after applying the filter, I'll just get that one value)
My workaround to get the aggregations without applying filters: Send two request to Elastic search, In first request, Send the query with filters applied,and in second request, Send the query without filters applied
Question: Is there a better way to achieve this? I looked around SO and I guess I can set global:{} while defining aggregations, but I'm not sure!
Or better put, Is there a way I can get aggregation results before filters are applied to a document?
EDIT
I did some searching and it looks that post_filter was designed for cases like this, i.e., if you don't want your filter to affect aggregations. But, there was also massive talks of performance of post_filter
Now I wonder if sending two requests is better than using post_filter in terms of performance.
I think post_filter performance is not as bad as you are saying. It just applies a filter on search results post aggregation. So all docs have to pass through this filter. I think you should go with post_filter because-
It will save you network round trip, so you will have minimal latency.
Will save your ES overhead of allocating resources for handling new request.
Your needs for post_filtered search and aggregation will be fulfilled by same set of documents, so most of the values it needs will already be in main memory or cache memory (things like doc_values)
So, performance should not get effected to much. You can also do profiling and analyse it yourself.

ElasticSearch: post_filter or filter?

Let's say I have a similar situation explained here:
https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-post-filter.html
Before I stumbled upon this article, I have been using filter instead of post_filter for this kind of scenario, and it produced output just like the post_filter.
My question is: Are they the same thing? If not, which one is the recommended and more efficient method to use and why?
As far as search hits are concerned, they are the same thing, i.e. the hits you get will be correctly filtered according to either your filter in a filtered query or the filter in your post_filter.
However, as far as aggregations are concerned, the end result will not be the same. The difference between both boils down to what document set the aggregations will be computed on.
If your filter is in a filtered query, then your aggregations will be computed on the document set selected by the query(ies) and the filter(s) in your filtered query, i.e. the same set of documents that you will get in the response.
If your filter is in a post_filter, then your aggregations will be computed on the document set selected by your various query(ies). Once aggregations have been computed on that document set, the latter is further filtered by the filter(s) in your post_filter before returning the matching documents.
To sum it up,
a filtered query affects both search results and aggregations
while a post_filter only affects the search results but NOT the aggregations
Another important difference between filter and post_filter that wasn't mentioned in any of the answers: performance.
TL;DR
Don't use post_filter unless you actually need it for aggregations.
From The Definitive Guide:
WARNING: Performance consideration
Use a post_filter only if you need to differentially filter search
results and aggregations. Sometimes people will use post_filter for
regular searches.
Don’t do this! The nature of the post_filter means it runs after
the query, so any performance benefit of filtering (such as caches) is
lost completely.
The post_filter should be used only in combination with
aggregations, and only when you need differential filtering.
In my tests , I could find filter is behaving exactly as post_filter. Both are only affecting the hits section ONLY.

What is the maximum size of an array in elasticsearch

What would be the maximum recommended size for an array field in an elasticsearch document? I am looking into using a document field in order to keep a list of the IDs of "followers" of a user, in order to boost the document in the query results when the user executing the query is following a potential match.
Also, is there perhaps a better way to improve the relevance of a user document than this rather brutal approach?

Resources