Elasticsearch fuzzy query too many hits bad score - sorting

How can I add a script that calculates the percentage of match from fuzzy query? And showed the matched value
Example patient has 5 symptoms and fist the possible (A)diagnosis has 5 possible symptoms and the second diagnosis (B) has 6 symptoms and
last(C) had 4 symptoms
Given some of the patience symptoms are similar, he actually matches 5/4 for C, 4/6 for B and 5/5 for A ...
The question is, how do you add such a script that calculates and sorts the matches, how do you put the value from match example for C?

You can use function_score for this requirement, like the one mentioned below.
Assuming symptoms as the field name where the symptom(s) are stored.
GET <index-of-symptoms>/_search?pretty
{
"size": 20,
"query": {
"function_score": {
"query": {
"match_all": {} #Write a master query to respond with all possible symptoms
},
"functions": [
{
"filter": {
"term": {
"symptoms": "headache"
}
},
"weight": 2 #Add weight to the score of the filter, this will be multiplied with the previous score each time
},
{
"filter": {
"term": {
"tags.keyword": "nausea"
}
},
"weight": 2
},
{
"filter": {
"term": {
"tags.keyword": "fever"
}
},
"weight": 2
}
]
}
}
}
This will give you a score on the basis of how many matches you qualify from the functions section.

Related

Is it possible to affect execution order of filters in Elasticsearch?

We have a query of the form:
{
"query": {
"bool": {
"filter": [
{
"term": {
"userId": {
"value": "a_user_id",
"boost": 1
}
}
},
{
"range": {
"date": {
"from": 1648598400000,
"to": 1648684799999,
"boost": 1
}
}
},
{
"query_string": {
"query": "*MyQuery*",
"fields": [
"aField^1.0",
"anotherField^1.0",
"thirdField^1.0"
],
"boost": 1
}
}
],
"boost": 1
}
}
}
If we remove the third filter (the query_string one), performance is dramatically improved (typically going from around 2000 to 20 ms) for different variants of the above query.
The thing is, the first two filters (on userId and the date range) will always result in only a handful of search hits (say 50 or so).
So, if it was possible to hint that to Elasticsearch, or otherwise affect the query plan, it could solve our issue.
In old (1.x) versions of ES it seems that this was affected by the order of filters. from Elasticsearch: Order of filters for best performance:
"The order of filters in a bool clause is important for performance. More-specific filters should be placed before less-specific filters in order to exclude as many documents as possible, as early as possible. If Clause A could match 10 million documents, and Clause B could match only 100 documents, then Clause B should be placed before Clause A."
But newer versions are smarter - https://www.elastic.co/blog/elasticsearch-query-execution-order:
Q: Does the order in which I put my queries/filters in the query DSL matter?
A: No, because they will be automatically reordered anyway based on their respective costs and match costs.
But is it still possible to reach the desired outcome here by modifying the ES search request somehow?
Your query should be like below, so that filters run first and will only select ~50 or so documents and then your costly query_string (because of the leading wildcard) will only run on those 50 docs.
{
"query": {
"bool": {
"must": [
{
"query_string": {
"query": "*MyQuery*",
"fields": [
"aField^1.0",
"anotherField^1.0",
"thirdField^1.0"
],
"boost": 1
}
}
],
"filter": [
{
"term": {
"userId": {
"value": "a_user_id",
"boost": 1
}
}
},
{
"range": {
"date": {
"from": 1648598400000,
"to": 1648684799999,
"boost": 1
}
}
}
],
"boost": 1
}
}
}

Elasticsearch Query for good title keyword results

We have a elasticsearch index containing a catalog of products, that we want to search by title and description.
We want it to have the following constraints:
We are searching title and description for occurences (matches in title should be twice as important as description)
We want it to have a very light fuzzy search result (but still accurate results)
Not matching results to the searchterm should not be filtered out, but only shown later (so matching results should be on top and worse results should be at the bottom)
category_id should filter products out (so no results of other categories should be shown)
The created_at attribute should be valued very high in sorting as well.
products should lose score the "older" they get. (This is very important, because they lose importance with every day)
I have tried to create a query like that, but the results are really less than accurate. Sometimes finding completely unrelated stuff. I think that's because of the wildcard query.
Also I think there must be a more elegant solution for the "created_at" scoring. Right?
I am using Elasticsearch 6.2
This is my current code. I would be happy to see an elegant solution for this:
{
"sort": [
{
"_score": {
"order": "desc"
}
}
],
"min_score": 0.3,
"size": 12,
"from": 0,
"query": {
"bool": {
"filter": {
"terms": {
"category_id": [
"212",
"213"
]
}
},
"should": [
{
"match": {
"title_completion": {
"query": "Development",
"boost": 20
}
}
},
{
"wildcard": {
"title": {
"value": "*Development*",
"boost": 1
}
}
},
{
"wildcard": {
"title_completion": {
"value": "*Development*",
"boost": 10
}
}
},
{
"match": {
"title": {
"query": "Development",
"operator": "and",
"fuzziness": 1
}
}
},
{
"range": {
"created_at": {
"gte": 1563264817998,
"boost": 11
}
}
},
{
"range": {
"created_at": {
"gte": 1563264040398,
"boost": 4
}
}
},
{
"range": {
"created_at": {
"gte": 1563256264398,
"boost": 1
}
}
}
]
}
}
}
First of all, building a request returning relevant results is usually a difficult task. It can't be done without knowing the content of the documents. That said, I can give you hints to fulfill your requirements and avoid unrelevant results.
We are searching title and description for occurences (matches in title should be twice as important as description)
You can use boost as you did in your query to give more importance to matches on title compared to description.
We want it to have a very light fuzzy search result (but still accurate results)
You should use AUTO value for the fuzzy field to define different values of fuzziness depending on the length of the term. E.g., by default terms having less than 3 letters (most common terms where a change in letter can result in a different word) will not allows changes. Terms with more than 3 letters will allow one change and more than 5 will allow 2 changes. You can change this behavior depending of your tests.
Not matching results to the searchterm should not be filtered out, but only shown later (so matching results should be on top and worse results should be at the bottom)
Use a should clause in the bool statement. Clauses in a should statements does not filter documents (unless specified otherwise). The queries in should clause are only used to increase the score.
category_id should filter products out (so no results of other categories should be shown)
Use a must of filter clause in the bool statement to ensure that all documents validate a constraint. If you don't want the subqueries to contribute to the score (I believe its your case), use filter instead of match because filter will be able to cache the results. Your query is ok for this requirement.
The created_at attribute should be valued very high in sorting as well. products should lose score the "older" they get. (This is very important, because they lose importance with every day)
You should use a function score with a decay function. If decay function are not clear for you, you can skip the equations in the document and jump to the figure which self explanatory. The following query is an example using a gauss decay function.
{
"function_score": {
// Name of the decay function
"gauss": {
// Field to use
"created_at": {
"origin": "now", // "now" is the default so you can omit this field
"offset": "1d", // Values with less than 1 day will not be impacted
"scale": "10d", // Duration for which the scores will be scaled using a gauss function
"decay" : 0.01 // Score for values further than scale
}
}
}
}
Hints for writing queries
Avoid wildcard queries: If you use * they are not efficient and will consume a lot of memory. If you want to be able to search in part of a term (finding "penthouse" when the user search "house") you should create a subfield using ngram tokenizer and write a standard match query using the subfield.
Avoid setting a minimum score: The score is a relative value. A small score or a high score does not mean that the document is relevant or not. You can read this article about the subject.
Be carefull with fuzzy queries: Fuzzy can generate a lot of noise and confuse users. In general, I would recommend to increase the default AUTO threshold for fuzzy and accept that some queries with mispelling does not return good results. Usually, it is simpler for a user to detect a mispelling in his input compared to understanding why he has completly unrelated results.
Example query
This is just an example that you will need to adapt with your data.
{
"size": 12,
"query": {
"bool": {
"filter": {
"terms": {
"category_id": <CATEGORY_IDS>
}
},
"should": [
{
"match": {
"title": {
"query": <QUERY>,
"fuzziness": AUTO:4:12,
"boost": 3
}
}
},
{
"match": {
"title_completion": {
"query": <QUERY>,
"boost": 1
}
}
},
{
"match": {
// title_completion field with ngram tokenizer
"title_completion.ngram": {
"query": <QUERY>,
// Use lower boost because it match only partially
"boost": 0.5
}
}
}
]
},
"function_score": {
// Name of the decay function
"gauss": {
// Field to use
"created_at": {
"origin": "now", // "now" is the default so you can omit this field
"offset": "1d", // Values with less than 1 day will not be impacted
"scale": "10d", // Duration for which the scores will be scaled using a gauss function
"decay" : 0.01 // Score for values further than scale
}
}
}
}
}

Give more score to documents that contains all query terms

I have a problem with scoring in elasticsearch. When user enter a query that contains 3 terms, sometimes a document that has two words a lot, outscores a document that contains all three words. for example if user enters "elasticsearch query tutorial", I want documents that contains all these words score higher than a document with a lot of "tutorial" and "elasticsearch" terms in it.
PS: I am using minimum should match and shingls in my query. also they made ranking a lot better, they did not solve this problem completely. I need something like query coordination in lucene's practical scoring function. is there anything like that in elastic with BM-25?
One of the possible solutions could be using function score:
{
"query": {
"function_score": {
"query": { "match_all": {} },
"functions": [
{
"filter": { "match": { "title": "elasticserch" } },
"weight": 1
},
{
"filter": { "match": { "title": "tutorial" } },
"weight": 1
}
],
"score_mode": "sum"
}
}
}
In this case, you would have clearly a better position for documents with more matches. However, this would completely ignore TF-IDF or any other parameters.

Must match multiple values

I have a query that works fine when I need the property of a document
to match just one value.
However I also need to be able to search with must with two values.
So if a banana has id 1 and a lemon has id 2 and I search for yellow
I will get both if I have 1 and 2 in the must clause.
But if i have just 1 I will only get the banana.
{
"from": 0,
"size": 20,
"query": {
"bool": {
"should": [
{ "match":
{ "fruit.color": "yellow" }}
],
"must" : [
{ "match": { "fruit.id" : "1" } }
]
}
}
}
I havenĀ“t found a way to search with two values with must.
is that possible?
If the document "must" be returned only if the id is 1 or 2, that sounds like another should clause. If I'm understanding your question properly, you want documents with either id 1 OR id 2. Additionally, if the color is yellow, give it a higher score.
Here's one way you might achieve what you're looking for:
{
"query": {
"bool": {
"should": {
"match": {
"fruit.color": "yellow"
}
},
"must": {
"bool": {
"should": [
{
"match": {
"fruit.id": "1"
}
},
{
"match": {
"fruit.id": "2"
}
}
]
}
}
}
}
}
Here I put the two match queries in the should clause of a separate bool query. This achieves the OR behavior you are looking for.
Have another look at the Bool Query documentation and take note of the nuances of should. It behaves differently by default depending on whether or not there is a sibling must clause and whether or not the bool query is being executed in filter context.
Another key option that is adjustable and can help you achieve your expected results is the minimum_should_match parameter. Have a look at this documentation page.
Instead of a match query, you could simply try the terms query for ORing between multiple terms.
Match queries are generally used for analyzed fields. For exact matching, you should use term queries
{
"from": 0,
"size": 20,
"query": {
"bool": {
"should": [
{ "match": { "fruit.color": "yellow" } }
],
"must" : [
{ "terms": { "fruit.id": ["1","2"] } }
]
}
}
}
term or terms query is the perfect way to fetch the exact text or id, using match query result in search inside the id or text
Ex:
id = '4'
id = '44'
Search using match query with id = 4 return both 4 & 44 since it matches 4 in both. This is where terms query come into play.
same search using terms query will return 4 only.
So the accepted is absolutely wrong. Use the #Rahul answer. Just one more thing you need to do, Instead of text you need to analyse the field as a keyword
Example for indexing a field both as a text and keyword (mapping is for flat level for nested change it accordingly).
{
"index_patterns": [ "test" ],
"mappings": {
"kb_mapping_doc": {
"_source": {
"enabled": true
},
"properties": {
"id": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword"
}
}
}
}
}
}
}
using #Rahul's answer doesn't worked because you might be analysed as a text.
id - access a text field
id.keyword - access a keyword field
it would be
{
"from": 0,
"size": 20,
"query": {
"bool": {
"should": [{
"match": {
"color": "yellow"
}
}],
"must": [{
"terms": {
"id.keyword": ["1", "2"]
}
}]
}
}
}
So I would say accepted answer will return falsy results Please use #Rahul's answer with the corresponding mapping.

Elastic Search - Sort By Doc Type

I have an elastic search index with 2 different doc types: 'a' and 'b'. I would like to sort my results by type and give preference to type='b' (even if it has a low score). I had been consuming the results of the search below at the client end and sorting them but I've realized that this approach does not work well since I am only inspecting the first 10 results which often does not contain any b's. Increasing the return results is not ideal. I'd like to get the elastic search to do the work.
http://<server>:9200/my_index/_search?q=london
You would need to play with function_score and, depending on how you already score your documents, test some weight values, boost_modes and score_modes for each type. For example:
GET /some_index/a,b/_search
{
"query": {
"function_score": {
"query": {
# your query here
},
"functions": [
{
"filter": {
"type": {
"value": "b"
}
},
"weight": 3
},
{
"filter": {
"type": {
"value": "a"
}
},
"weight": 1
}
],
"score_mode": "first",
"boost_mode": "multiply"
}
}
}
Its working for me.you will execute below commands at command Prompt.
curl -XGET localhost:9200/index_v1,index_v2/_search?pretty -d #boost.json
boost.json
{
"indices_boost" : {
"index_v2" : 1.4,
"index_v1" : 1.3
}
}

Resources