Validating my understanding of Dismax query in elasticsearch - elasticsearch

I have tried understanding how dismax query works and I want to validate my understanding, please see if I understood it correctly.
According to documentation a dismax query is:
A query that generates the union of documents produced by its
subqueries, and that scores each document with the maximum score for
that document as produced by any subquery, plus a tie breaking
increment for any additional matching subqueries.
Suppose, the total documents in our ES cluster be as follows:
{"FOO":"ABC"},{"FOO":"XYZ"},{"FOO":"ABC XYZ"},{"FOO":"ABC DEF"},{"FOO":"DEF"} and the dismax query is:
"dis_max": {
"queries": [
{
"match": {
"FOO": "ABC"
}
},
{
"match": {
"FOO": "XYZ"
}
}
]
}
}
So, as per the documentation let us first find out union of documents returned by dismax's sub-queries. The union of documents would be {"FOO":"ABC"},{"FOO":"XYZ"},{"FOO":"ABC XYZ"},{"FOO":"ABC DEF"}. According to the next step we need to score each document with the maximum score for that document as produced by any subquery. Which will be something like:
{"FOO":"ABC"}will be scored on {"match":{"FOO": "ABC"}} and {"match":{"FOO": "XYZ"}} and the maximum score returned will be used.
And similarly, {"FOO":"XYZ"}will be scored on {"match":{"FOO": "ABC"}} and {"match":{"FOO": "XYZ"}} and the maximum score returned will be used and this will be done for all the union of documents and finally the documents will be returned in a sorted way.
Is this how dismax query works? Or did I misunderstand or miss out anything?

Related

scoring of Term vs. Terms query different

I am retrieving documents by filtering and using a term query to apply a score.
The query should match all animals having a specified color - the more colors are matched, the higher the score of a doc. Strange thing is, term and terms query result in a different scoring.
{
"query": {
"bool": {
"should": [
{"terms": {"color": ["brown","darkbrown"] } },
]
}
}
}
should be the same like using
{"term": {"color": {"value": "brown"} } },
{"term": {"color": {"value": "darkbrown"} } }
Query no. 1 gives me the exact same score for a document whether 1 or 2 terms are matched. The latter of course returns a higher score, if more colors are matched.
As stated by the coordination factor the returned score should be higher if more terms are matched. Therefore these two queries should result in the same score - or is because term queries do not analyze the search term?
My field is indexed as text. Strings are indexed as an "array" of strings, e.g. "brown","darkbrown"
Difference between term vs terms query:
Term query return documents that contain one or more exact term in a provided field.
The terms query is the same as the term query, except you can search for multiple values.
Warning: Avoid using the term query for text fields.
As far your this part is concerned
or is because term queries do not analyze the search term?
Yes, It is because the search term does not analyze the term searched. It just matches the exact search term.

What is the difference between must and filter in Query DSL in elasticsearch?

I am new to elastic search and I am confused between must and filter. I want to perform an and operation between my terms, so I did this
POST /xyz/_search
{
"query": {
"bool": {
"must": [
{
"term": {
"city": "city1"
}
},
{
"term": {
"saleType": "sale_type1"
}
}
]
}
}
}
which gave me the required results matching both the terms, and on using filter like this
POST /xyz/_search
{
"query": {
"bool": {
"must": [
{
"term": {
"city": "city1"
}
}
],
"filter": {
"term": {
"saleType": "sale_type1"
}
}
}
}
}
I get the same result, so when should I use must and when should I use filter? What is the difference?
must contributes to the score. In filter, the score of the query is ignored.
In both must and filter, the clause(query) must appear in matching documents. This is the reason for getting same results.
You may check this link
Score
The relevance score of each document is represented by a positive floating-point number called the _score. The higher the _score, the more relevant the document.
A query clause generates a _score for each document.
To know how score is calculated, refer this link
must returns a score for every matching document. This score helps you rank the matching documents, and compare the relative relevance between documents (using the magnitude of the score of each document).
With this, one can say, Doc 1 is how many times more relevant than Doc 2. Or that Doc 1 to 7 are of much higher relevancy than Doc 8+.
For how the relative score is determined, you can refer to the references below.
Briefly, it is related to the number of term occurrences in the document, the document length, and the average number of term occurrences in your database index.
filter doesn't return a score. All one can say is, all matching documents are of relevance. But it won't help in evaluating if one is more relevant than the other. You can think of filter as a must with only 2 scores: zero or non-zero, and where all zero-scored documents are dropped.
filter is helpful if you just want to whitelist/blacklist for e.g., all documents belonging to the topic "pets".
In summary, there are 3 points that will help you in deciding when to use what:
must is your only choice when comparing/ranking documents by relevance
filter excludes all documents that don't match
filter is a lot faster because Elasticsearch doesn't need to compute the relative score
References:
Query vs Filter: https://www.elastic.co/guide/en/elasticsearch/reference/current/query-filter-context.html
Computation of Relevance: https://www.infoq.com/articles/similarity-scoring-elasticsearch/

Boosting the relevance score based on the unique keyword found

I am in a scenario where I need to give more relevance to the document in Index if it has a unique keyword. Let me provide a scenario.
Let's say I need to search for a term znkdref unsuccessfull so the result will have contents which have znkdref or unsuccessfull or znkdref unsuccessfull but here I want that the contents which are having znkdref unsuccessfull should have highest relevance and then content having znkdref should have less relevance and then content having unsuccessfull should have least relevance.
Is there a way to achieve this ?? I would be glad to get any help
You want to use Query Time Boosting, in particular Prioritized Clauses.
In short you need to extract the keywords that you want boosted and build a query that boosts the parts that you want.
{
"query": {
"bool": {
"should": [{
"match": {
"content": {
"query": "znkdref",
"boost": 2
}
}
},
{
"match": {
"content": {
"query": "unsuccessfull"
}
}
}]
}
}
}
Update based on comment:
If you want to know why a document got the score that it did (maybe to identify "keywords") then you can pass in "explain" as a query parameter or set it in the root POST payload. The result will now have document frequency counts and sub scores.
Do you mean "znkdref" is a unique keyword? For example, "znkdref" is a special name of something. If so.
Of course, the documents match the whole query string "znkdref unsuccessfull" will have a highest relevance score in general.
The documents contain "znkdref" will usually have a higher relevance score than the documents contain "unsuccessfull". Because TF.IDF score of "znkdref" is bigger than TF.IDF score of "unsuccessfull".
The relevance score function is described at https://www.elastic.co/guide/en/elasticsearch/guide/current/practical-scoring-function.html
I hope that my answer is helpful for you.

elasticsearch: boost query based on values of a variable

I understand how to boost query in elasticsearch depending on absolute value of a variable. For example
{
"query": {
"bool": [
{ "match": {"field1": {"query": 10, "boost": 2}} }
]
}
}
What I need to do is to make sure the field1 influences the score but I dont know any absolute value. For example, document will field1 = 20 will get higher score as compared to document with field1 = 10. However, this is different from sort. Because sorting is absolute. I just want this variable to contribute to the overall score but this is not the only field controlling the overall score.
The best solution here would be function_score query
It can be seen as the swiss army knife for customizing scores.
You can use field_value_factor function in it to achieve what you are looking for.

Constant Score Query elasticsearch boosting

My understanding of Constant Score Query in elasticsearch is that boost factor would be assigned as score for every matching query. The documentation says:
A query that wraps a filter or another query and simply returns a constant score equal to the query boost for every document in the filter.
However when I send this query:
"query": {
"constant_score": {
"filter": {
"term": {
"source": "BBC"
}
},
"boost": 3
}
},
"fields": ["title", "source"]
all the matching documents are given a score of 1?! I cannot figure out what I am doing wrong, and had also tried with query instead of filter in constant_score.
Scores are only meant to be relative to all other scores in a given result set, so a result set where everything has the score of 3 is the same as a result set where everything has the score of 1.
Really, the only purpose of the relevance _score is to sort the results of the current query in the correct order. You should not try to compare the relevance scores from different queries. - Elasticsearch Guide
Either the constant score is being ignored because it's not being combined with another query or it's being normalized. As #keety said, check to the output of explain to see exactly what's going on.
Constant score query gives equal score to any matching document irrespective any scoring factors like TF, IDF etc. This can be used when you don't care whether how much a doc matched but just if a doc matched or not and give a score too, unlike filter.
If you want score as 3 literally for all the matching documents for a particular query, then you should be using function score query, something like
"query": {
"function_score": {
"functions": [
{
"filter": { "term": { "source": "BBC" } },
"weight": 3
}
]
}
...
}

Resources