Elasticsearch fuzzy matching: How can I get direct hits first? - elasticsearch

I'm using Elasticsearch to search names in a database, and I want it to be fuzzy to allow for minor spelling errors. Based on the advice I've found on the matter, I'm using "match" and "fuzziness" instead of "fuzzy", which definitely seems to be more accurate. This is my query:
{ "query":
{ "match":
{ "last_name":
{ "query": "Beach",
"type": "phrase",
"fuzziness": 2
}
}
}
}
However, even though I have numerous results with last_name "Beach" (I know there's at least 100), I also get results with last_name "Beech" and "Berch" in the first 10 hits returned by my query. Can someone help me figure out how to get the exact matches first?

Try changing your query to a boolean query with 2 should queries.
The first one being your current query, and then second being a query that only gives exact matches, then give that one a big boost (like 10.0).
That should get your exact matches on top while still listing your partial matches.

I tried to edit "Constantijn" answer above to include sample based on his answer, but still not appearing (pending approval). So, I will just put a sample here instead...
{
"query": {
"bool": {
"should": [
{
"match": {
"last_name": {
"query": "Beach",
"fuzziness": 2,
"boost": 1
}
}
},
{
"match": {
"last_name": {
"query": "Beach",
"boost": 10
}
}
}
]
}
}
}

Related

Elasticsearch - Impact of adding Boost to query

I have a very simple Elastic query mentioned below.
{
"query": {
"bool": {
"must": [
{
"bool": {
"minimum_should_match": 1,
"should": [
{
"match": {
"tag": {
"query": "Audience: PRO Brand: Samsung",
"boost": 3,
"operator": "and"
}
}
},
{
"match": {
"tag": {
"query": "audience: PRO brand samsung",
"boost": 2,
"operator": "or"
}
}
}
]
}
}
]
}
}
}
I want to know if I add a boost in the query, will there be any performance impact because of this, and also will boosting help if you have a very large data set, where the occurrence of a search word is common.
Elasticsearch adds boost param with default value, IMO giving different value won't make much difference in the performance, but you should be able to measure it yourself.
Reg. your second question, adding boost definitely makes sense where the occurrence of your search words are common, this will help you to find the relevant document. for example: suppose you are searching for query in a index containing Elasticsearch posts(query will be very common on Elasticsearch posts), but you want the give more weight to documents which have tag elasticsearch-query. Adding boosts in this case, will provide you more relevant results.

How to limit the results in a multi match query?

i had used multi match phrase when I make search. However I have to put limit result of all math phrase seperately. I mean, I want to take only 2 result for each multi match. I can't find any limit/size attributes. Do you know any solution?
Example Code:
"query": {
"bool": {
"should": [
{
"match_phrase": {
"text": {
"query": " Home is clear and big ",
"slop": 2
}
}
},
{
"match_phrase": {
"text": {
"query": "365 different company use our system in test",
"slop": 2
}
}
}
]}}
use
{"limit" : 3, "from":0, "query": ...}
The simplest solution is to make to individual searches for each of the conditions. The size parameter can be set to retrieve only the first 2 results for each query.
https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-from-size.html
The boolean should query will not distinguish which condition has been satisfied: it returns documents for which at least one of the two conditions holds. The scores for the two matches will be combined into a single score but it will be impossible to tell which s

Must match multiple values

I have a query that works fine when I need the property of a document
to match just one value.
However I also need to be able to search with must with two values.
So if a banana has id 1 and a lemon has id 2 and I search for yellow
I will get both if I have 1 and 2 in the must clause.
But if i have just 1 I will only get the banana.
{
"from": 0,
"size": 20,
"query": {
"bool": {
"should": [
{ "match":
{ "fruit.color": "yellow" }}
],
"must" : [
{ "match": { "fruit.id" : "1" } }
]
}
}
}
I haven´t found a way to search with two values with must.
is that possible?
If the document "must" be returned only if the id is 1 or 2, that sounds like another should clause. If I'm understanding your question properly, you want documents with either id 1 OR id 2. Additionally, if the color is yellow, give it a higher score.
Here's one way you might achieve what you're looking for:
{
"query": {
"bool": {
"should": {
"match": {
"fruit.color": "yellow"
}
},
"must": {
"bool": {
"should": [
{
"match": {
"fruit.id": "1"
}
},
{
"match": {
"fruit.id": "2"
}
}
]
}
}
}
}
}
Here I put the two match queries in the should clause of a separate bool query. This achieves the OR behavior you are looking for.
Have another look at the Bool Query documentation and take note of the nuances of should. It behaves differently by default depending on whether or not there is a sibling must clause and whether or not the bool query is being executed in filter context.
Another key option that is adjustable and can help you achieve your expected results is the minimum_should_match parameter. Have a look at this documentation page.
Instead of a match query, you could simply try the terms query for ORing between multiple terms.
Match queries are generally used for analyzed fields. For exact matching, you should use term queries
{
"from": 0,
"size": 20,
"query": {
"bool": {
"should": [
{ "match": { "fruit.color": "yellow" } }
],
"must" : [
{ "terms": { "fruit.id": ["1","2"] } }
]
}
}
}
term or terms query is the perfect way to fetch the exact text or id, using match query result in search inside the id or text
Ex:
id = '4'
id = '44'
Search using match query with id = 4 return both 4 & 44 since it matches 4 in both. This is where terms query come into play.
same search using terms query will return 4 only.
So the accepted is absolutely wrong. Use the #Rahul answer. Just one more thing you need to do, Instead of text you need to analyse the field as a keyword
Example for indexing a field both as a text and keyword (mapping is for flat level for nested change it accordingly).
{
"index_patterns": [ "test" ],
"mappings": {
"kb_mapping_doc": {
"_source": {
"enabled": true
},
"properties": {
"id": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword"
}
}
}
}
}
}
}
using #Rahul's answer doesn't worked because you might be analysed as a text.
id - access a text field
id.keyword - access a keyword field
it would be
{
"from": 0,
"size": 20,
"query": {
"bool": {
"should": [{
"match": {
"color": "yellow"
}
}],
"must": [{
"terms": {
"id.keyword": ["1", "2"]
}
}]
}
}
}
So I would say accepted answer will return falsy results Please use #Rahul's answer with the corresponding mapping.

Elasticsearch rescore all results ignoring base score

I'm trying to rescore my results with the following query:
POST /archive/item/_search
{
"query": {
"multi_match": {
"fields": ["title", "description"],
"query": "1 złoty",
"operator": "and"
}
},
"rescore": {
"window_size": 50,
"query": {
"rescore_query": {
"multi_match": {
"type": "phrase",
"fields": ["title", "description"],
"query": "1 złoty",
"slop": 10
}
},
"query_weight": 0,
"rescore_query_weight": 1
}
}
}
I'm doing this because I want to score by proximity mainly.
Also, I want to ignore source field length impact on the score.
Am I doing this right? If not, what's the best practice here?
And the second question. Why window_size is needed anyway?
I don't want top results only.
The main query atcs like a filter, so all the results it returns are relevant.
I quess something like "window_size": "all" would be perfect, but I couldn't find anything in the docs.
To answer your second question, the reason it's needed is because it's designed to be for top results only. Basically it's a cost issue - the assumption is that the secondary algorithm is more expensive so it was only designed to be run on the top results. There's more discussion about this here:
https://github.com/elasticsearch/elasticsearch/issues/2640
and here:
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/search-request-rescore.html
Personally I think the "all" option is a great idea, maybe you should open an issue on github?
If you want to score with proximity match all results returned by some other filter this should do:
{
"query": {
"filtered" : {
"query" : {
"multi_match": {
"type": "phrase",
"fields": ["title", "description"],
"query": "1 złoty",
"slop": 10
}
},
"filter" : {
"query": {
"multi_match": {
"fields": ["title", "description"],
"query": "1 złoty",
"operator": "and"
}
}
}
}
}
}
According to this, the filter is run before the query, so the performance shouldn't be bad as well. What's more you don't score twice, because filters don't calculate scores. Another advantage is that filters can be cached which should speed things significantly.
Keep in mind that I did short tests only, mostly focusing on syntax not results. You might want to double check it.

elasticsearch boost importance of exact phrase match

Is there a way in elasticsearch to boost the importance of the exact phrase appearing in the the document?
For example if I was searching for the phrase "web developer" and if the words "web developer" appeared together they would be boosted by 5 compared to "web" and "developer" appearing separately throughout the document. Thereby any document that contained "web developer" together would appear first in the results.
You can combine different queries together using a bool query, and you can assing a different boost to them as well. Let's say you have a regular match query for both the terms, regardless of their positions, and then a phrase query with a higher boost.
Something like the following:
{
"query": {
"bool": {
"should": [
{
"match": {
"field": "web developer"
}
},
{
"match_phrase": {
"field": "web developer",
"boost": 5
}
}
],
"minimum_number_should_match": 1
}
}
}
As an alternative to javanna's answer, you could do something similar with must and should clauses within a bool query:
{
"query": {
"bool": {
"must": {
"match": {
"field": "web developer",
"operator": "and"
}
},
"should": {
"match_phrase": {
"field": "web developer"
}
}
}
}
}
Untested, but I believe the must clause here will match results containing both 'web' and 'developer' and the should clause will score phrases matching 'web developer' higher.
You could try using rescore to run an exact phrase match on your initial results. From the docs:
"Rescoring can help to improve precision by reordering just the top (eg 100 - 500) documents returned by the query and post_filter phases, using a secondary (usually more costly) algorithm, instead of applying the costly algorithm to all documents in the index."
https://www.elastic.co/guide/en/elasticsearch/reference/current/filter-search-results.html#rescore
I used below sample query in my case which is working. It brings exact + fuzzy results but exact ones are boosted!
{ "query": {
"bool": {
"should": [
{
"match": {
"name": "pala"
}
},
{
"fuzzy": {
"name": "pala"
}
}
]
}}}
I do not have enough reputation to comment on James Adison's answer, which I agree with.
What is still missing is the boost factor, which can be done using the following syntax:
{
"match_phrase":
{
"fieldName": {
"query": "query string for exact match",
"boost": 10
}
}
}
I think its default behaviour already with match query "or" operator. It'll filter phrase "web developer" first and then terms like "web" or "develeper". Though you can boost your query using above answers. Correct me if I'm wrong.

Resources