ElasticSearch: post_filter or filter? - elasticsearch

Let's say I have a similar situation explained here:
https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-post-filter.html
Before I stumbled upon this article, I have been using filter instead of post_filter for this kind of scenario, and it produced output just like the post_filter.
My question is: Are they the same thing? If not, which one is the recommended and more efficient method to use and why?

As far as search hits are concerned, they are the same thing, i.e. the hits you get will be correctly filtered according to either your filter in a filtered query or the filter in your post_filter.
However, as far as aggregations are concerned, the end result will not be the same. The difference between both boils down to what document set the aggregations will be computed on.
If your filter is in a filtered query, then your aggregations will be computed on the document set selected by the query(ies) and the filter(s) in your filtered query, i.e. the same set of documents that you will get in the response.
If your filter is in a post_filter, then your aggregations will be computed on the document set selected by your various query(ies). Once aggregations have been computed on that document set, the latter is further filtered by the filter(s) in your post_filter before returning the matching documents.
To sum it up,
a filtered query affects both search results and aggregations
while a post_filter only affects the search results but NOT the aggregations

Another important difference between filter and post_filter that wasn't mentioned in any of the answers: performance.
TL;DR
Don't use post_filter unless you actually need it for aggregations.
From The Definitive Guide:
WARNING: Performance consideration
Use a post_filter only if you need to differentially filter search
results and aggregations. Sometimes people will use post_filter for
regular searches.
Don’t do this! The nature of the post_filter means it runs after
the query, so any performance benefit of filtering (such as caches) is
lost completely.
The post_filter should be used only in combination with
aggregations, and only when you need differential filtering.

In my tests , I could find filter is behaving exactly as post_filter. Both are only affecting the hits section ONLY.

Related

ElasticSearch: given a document and a query, what is the relevance score?

Once a query is executed on ElasticSearch, a relevance _score is calculated for each retrieved document.
Given a specific document (e.g. by doc ID) and a specific query, I would like to see what is its _score?
One way is perhaps to query ES, retrieve all the hit documents, and look up the desired document out of all the retrieved documents to see its score.
I assume there should be a more efficient way to do this. Given a query and a document ID, what is its _score?
I'm using ElasticSearch 7.x
PS: I need this for a learning-to-rank scenario (to create my judgment list). I have in fact a complex query that was created from various should and must over different fields. My major requirement was to get the score value for each individual sub-query, which seems there is no solution for it. I want to understand which part of this complex query is more useful and which one is less. The only way I've come up with is to execute each sub-query separately to get the score but I do not want to actually execute that query just asking for what is the score of a specific document for that sub-query.
Scoring of the document is not only related to just the document and all other documents in the index, but it also depends on various factor like:
_score is calculated per shard basis not on an index basis by default, although you can change this behavior by using DFS Query Then Fetch param in your query. More info on this official blog.
Is there is any boost applied at index or query time(index time is deprecated from 5.X).
Any custom scoring function is used in addition to the default ES scoring algorithm(tf/idf in old versions) and BM25 in the latest versions.
Edit: Based on the comments from the other respected community members, rephrasing the below statement:
To answer your question, Using the _explain API, you can understand how Elasticsearch computes a score explanation for a query and a specific document. This can give useful feedback on whether a document matches or didn’t match a specific query.

Compute Aggregations before running Filter Query

I've a simple scenario:
I search for some text, and elastic returns documents and
aggregations.
I then filter that search with values in the fields returned from that aggregations. I'm using a Terms Query inside a filter
I want the documents to be filtered by my filter conditions, which is working fine.
But I want the aggregation buckets without applying the filter condition (because if I get the buckets after applying the filter, I'll just get that one value)
My workaround to get the aggregations without applying filters: Send two request to Elastic search, In first request, Send the query with filters applied,and in second request, Send the query without filters applied
Question: Is there a better way to achieve this? I looked around SO and I guess I can set global:{} while defining aggregations, but I'm not sure!
Or better put, Is there a way I can get aggregation results before filters are applied to a document?
EDIT
I did some searching and it looks that post_filter was designed for cases like this, i.e., if you don't want your filter to affect aggregations. But, there was also massive talks of performance of post_filter
Now I wonder if sending two requests is better than using post_filter in terms of performance.
I think post_filter performance is not as bad as you are saying. It just applies a filter on search results post aggregation. So all docs have to pass through this filter. I think you should go with post_filter because-
It will save you network round trip, so you will have minimal latency.
Will save your ES overhead of allocating resources for handling new request.
Your needs for post_filtered search and aggregation will be fulfilled by same set of documents, so most of the values it needs will already be in main memory or cache memory (things like doc_values)
So, performance should not get effected to much. You can also do profiling and analyse it yourself.

ElasticSearch aggregations for all pages

I use size and from keywords for pagination across ElasticSearch results and each page change requires another search query to be executed.
I would like to compute facets with the aggregations feature, however aggregations are computed only based on the results constrained by size and from keywords e.g. when I ask for records 20-30 from the list, the aggregations are computed only on these 10 records that are returned. And I would like of course to have global facets computed on all the matching records that do not change while I switch the pages.
Any ideas how to do it apart from performing an additional global (uncostrained by size and from) search?
Aggregations are computed on all documents that match "query". The scope of aggregations has nothing to do with "size" and "from" values.

Elasticsearch questions: search, performance and caching

I'm new to elasticsearch, have been reading their API and some things are not clear to me
1) It is said that filters are cached. what does that mean? if i send a query with a filter on it, what gets cached? The results of that query? If i send a different query with the same filter, will the cache help me somehow?
I know the question is kinda vague, but so is ElasticSearch's documentation for this.
2) Is there a real performance difference between a query matching a term X to the "_all" field or to a specific field? As far i understand, both queries will be compared against all documents that contain X in one of their fields, and the only difference is in how many fields will be matched against X, in these documents. is that correct?
1) For your first question take a look at this link.
To quote from the post
"Filters don’t score documents – they simply include or exclude. If a document matches a filter, it is represented with a one in the BitSet; otherwise a zero. This means that Elasticsearch can store an entire segment’s filter state (“who matches this particular filter?”) in a single, compact BitSet.
The first time Elasticsearch executes a filter, it parses Lucene segment data structures to determine what matches your filter. Instead of throwing away this information, it caches it inside a BitSet.
The next time the same filter is executed, Elasticsearch can reference the compact BitSet instead of the Lucene segments. This has huge performance benefits."
2) "The idea of the _all field is that it includes the text of one or more other fields within the document indexed. It can come very handy especially for search requests, where we want to execute a search query against the content of a document, without knowing which fields to search on. This comes at the expense of CPU cycles and index size."link
So if you know what fields you are going to query use specifics fields to search on.

How to write fast Elastic Search queries

Is there a guide to writing the ES queries - what to do, what to avoid, this sort of stuff. The official site describes all various ways to search, but provides little giudance as to when select what.
In my particular instance I have a list of providers, each one has a name an address and a number of IDs. I want to give the user a box he can type in anything he knows about the provider and run search based on whatever is provided. Essentially I would like to match every word from the box against the records (documents) in the index.
For the end user this should look like a simple keyword search.
Matching should cover exact matches, wild card matches, phonetic matches, synonyms (for names). Also some fuzziness should be included too.
The official site describes various ways to do that, but how to combine them together? For instance to support wild card search do I use wild card query, or do I index it with the NGram and do just text query?
With the SQL queries a certain way to get this sort of information is to check the execution plan for the query. If the SQL optimizer tells you that it will use table scan against a table of considerable size, you know you should change your query, or, may be, add an index. AFAIK there is no equivalent for this powerful feature in ES and I am not even sure if it is possible to build it.
But at least some generic considerations...? Pretty please...
There is not a best way to go about doing things, because a lot of times it depends on what you are indexing, and how you map your data into variables within Elasticsearch.
Some rule of thumb that you should look out for:
a. Faceted Queries in Elasticsearch work in sequences:
{
"query": {
// data will be searched from this block first //
}, "facets": {
// after the data is received, it will be processed into facets //
}
}
Hence if your query size is huge, you are going to slow down your query further by faceting. Monitor the results of your query.
b. Filters vs Queries
Filters do a subset of your queries, meaning it will take the entire result of what your query is, and then filter out what you do want or what you do not want.
Queries are usually direct searches for data.
Hence, if you can make your query as specific as possible before you do a filter, it should yield faster results.
c. Queries are cached; running them again and again will generally yield faster responses. The Warmers API should be able to make your queries even quicker if you are always going to use the same set of queries
Again, all these are rule of thumbs and cannot be followed strictly, because what you index into specific variables will affect processing times. A string is different from long types, and strings with analyzers are different from non-analyzers. What you need to do is probably to experiment with your queries to get a better judgement.
One correction from the above - Filters are cacheable by ES, and not queries. Queries does the extra step of relevance scoring & full text search. So, where ever full text search is not needed using filter is advised.
Also, design your mappings with correct index values (not_analyzed, no, analyzed)

Resources