Custom score for exact, phonetic and fuzzy matching in elasticsearch - elasticsearch

I have a requirement where there needs to be custom scoring on name. To keep it simple lets say, if I search for 'Smith' against names in the index, the logic should be:
if input = exact 'Smith' then score = 100%
else
if input = phonetic match then
score = <depending upon fuzziness match of input with name>%
end if
end if;
I'm able to search documents with a fuzziness of 1 but I don't know how to give it custom score depending upon how fuzzy it is. Thanks!
Update:
I went through a post that had the same requirement as mine and it was mentioned that the person solved it by using native scripts. My question still remains, how to actually get the score based on the similarity distance such that it can be used in the native scripts:
The post for reference:
https://discuss.elastic.co/t/fuzzy-query-scoring-based-on-levenshtein-distance/11116
The text to look for in the post:
"For future readers I solved this issue by creating a custom score query and
writing a (native) script to handle the scoring."

You can implement this search logic using the rescore function query (docs here).
Here there is a possible example:
{
"query": {
"function_score": {
"query": { "match": {
"input": "Smith"
} },
"boost": "5",
"functions": [
{
"filter": { "match": { "input.keyword": "Smith" } },
"random_score": {},
"weight": 23
}
]
}
}
}
In this example we have a mapping with the input field indexed both as text and keyword (input.keyword is for exact match). We re-score the documents that match exactly the term "Smith" with an higher score respect to the all documents matched by the first query (in the example is a match, but in your case will be the query with fuzziness).
You can control the re-score effect tuning the weight parameter.

Related

Why is Elasticsearch with Wildcard Query always 1.0?

When i do a search in Elasticsearch with a Wildcard-Query (Wildcard at the End) the score results for all hits in 1.0.
Is this by design? Can I change this behavior somewhere?
Elasticsearch is basically saying that all results are equally relevant, as you've provided an unqualified search (a wildcard, equivalent to a match_all). As soon as you add some additional context through the various types of queries, you will notice changes in the scoring.
Depending on your ultimate goal, you may want to look into the Function Score query - reference: https://www.elastic.co/guide/en/elasticsearch/reference/6.7/query-dsl-function-score-query.html
The first example provided would give you essentially random scores for all documents in your cluster:
GET /_search
{
"query": {
"function_score": {
"query": { "match_all": {} },
"boost": "5",
"random_score": {},
"boost_mode":"multiply"
}
}
}

Elasticsearch - Edit distance using fuzzy is inaccurate

I am using ES 5.5 and my requirement is to allow upto two edits while matching a field.
In ES,I have value as 124456788 and query comes in as 123456789
"fuzzy": {
"idkey": {
"value": **"123456789"**,
"fuzziness": "20"
}
}
To my knowledge the edit distance is 2 between these two numbers. But it is not matching even with fuzziness property as 20.
I did an explain api call and here is what I am seeing
"description": "no match on required clause (((idkey:012345789)^0.7777778 (idkey:012346789)^0.7777778 (idkey:013456789)^0.7777778 (idkey:023456789)^0.8888889 (idkey:102345678)^0.7777778 (idkey:112345678)^0.7777778 (idkey:113456789)^0.8888889 (idkey:120456589)^0.7777778 (idkey:121345678)^0.7777778 (idkey:122345678)^0.7777778 (idkey:122345679)^0.7777778 (idkey:122456789)^0.8888889 (idkey:123006789)^0.7777778 (idkey:123045678)^0.7777778 (idkey:123096789)^0.7777778 (idkey:123106789)^0.7777778 (idkey:123145678)^0.7777778 (idkey:123146789)^0.7777778 (idkey:123226789)^0.7777778 (idkey:123256789)^0.8888889 (idkey:123345678)^0.7777778 (idkey:123345689)^0.7777778 (idkey:123346789)^0.7777778 (idkey:123406784)^0.7777778 (idkey:123415678)^0.7777778 (idkey:123435678)^0.7777778 (idkey:123446789)^0.8888889 (idkey:123453789)^0.8888889 (idkey:123454789)^0.8888889 (idkey:123455789)^0.8888889 (idkey:123456289)^0.8888889 (idkey:123456489)^0.8888889 (idkey:123456709)^0.8888889 (idkey:123456779)^0.8888889 (idkey:123456780)^0.8888889 (idkey:123456781)^0.8888889 (idkey:123456783)^0.8888889 (idkey:123456785)^0.8888889 (idkey:123456786)^0.8888889 (idkey:123456787)^0.8888889 (idkey:123456889)^0.8888889 (idkey:123457789)^0.8888889 (idkey:123466789)^0.8888889 (idkey:123496789)^0.8888889 (idkey:123556789)^0.8888889 (idkey:126456789)^0.8888889 (idkey:223456789)^0.8888889 (idkey:423456789)^0.8888889 (idkey:623456789)^0.8888889 (idkey:723456789)^0.8888889)^5.0)",
The value I am expecting to match is 124456788 but ES query is internally not converting it as one of the possible match parameter in fuzzy query.
Do i need to use different ES method to make this work?
This a simple indexing and search.
PUT /myIndex/type1/1
{
"key":"123456789",
"name":"test"
}
GET /myIndex/_search
{
"query": {
"fuzzy": {
"key": {
"value": "124456799",
"fuzziness": 2
}
}
}
}
It is always matching with the given key. fuzziness values 2 or greater is fine.

Elasticsearch wrong explanation validate api

I'm using Elasticsearch 5.2. I'm executing the below query against an index that has only one document
Query:
GET test/val/_validate/query?pretty&explain=true
{
"query": {
"bool": {
"should": {
"multi_match": {
"query": "alkis stackoverflow",
"fields": [
"name",
"job"
],
"type": "most_fields",
"operator": "AND"
}
}
}
}
}
Document:
PUT test/val/1
{
"name": "alkis stackoverflow",
"job": "developer"
}
The explanation of the query is
+(((+job:alkis +job:stackoverflow) (+name:alkis +name:stackoverflow))) #(#_type:val)
I read this as:
Field job must have alkis and stackoverflow
AND
Field name must have alkis and stackoverflow
This is not the case with my document though. The AND between the two fields is actually OR (as it seems from the result I'm getting)
When I change the type to best_fields I get
+(((+job:alkis +job:stackoverflow) | (+name:alkis +name:stackoverflow))) #(#_type:val)
Which is the correct explanation.
Is there a bug with the validate api? Have I misunderstood something? Isn't the scoring the only difference between these two types?
Since you picked the most_fields type with an explicit AND operator, the reasoning is that one match query is going to be generated per field and all terms must be present in a single field for a document to match, which is your case, i.e. both terms alkis and stackoverflow are present in the name field, hence why the document matches.
So in the explanation of the corresponding Lucene query, i.e.
+(((+job:alkis +job:stackoverflow) (+name:alkis +name:stackoverflow)))
when no specific operator is specified between the terms, the default one is an OR
So you need to read this as: Field job must have both alkis and stackoverflow OR field name must have both alkis and stackoverflow.
The AND operator that you apply only concerns all the terms in your query but in regard to a single field, it's not an AND between all fields. Said differently, your query will be executed as a two match queries (one per field) in a bool/should clause, like this:
{
"query": {
"bool": {
"should": [
{ "match": { "job": "alkis stackoverflow" }},
{ "match": { "name": "alkis stackoverflow" }}
]
}
}
}
In summary, the most_fields type is most useful when querying multiple fields that contain the same text analyzed in different ways. This is not your case and you'd probably better be using cross_fields or best_fields depending on your use case, but certainly not most_fields.
UPDATE
When using the best_fields type, ES generates a dis_max query instead of a bool/should and the | (which is not an OR !!) sign separates all sub-queries in a dis_max query.

Is there a way to score fuzzy hits with the same score as exact hits?

I'm trying to use elasticsearch as a integration tool which can match records from different sources. I'm combining filters and query for this. Filters are filtering out irrevelant records and putting trough candidate matches. Then out of those candidates all are being scored. I'm using fuzzy match because some of the records might contain a misspell (Nicolson Way/Nicholson Way). I would like them to be scored equally with disregard if its a fuzzy match or equal match.
http://www.elasticsearch.org/guide/en/elasticsearch/guide/current/fuzzy-scoring.html
Is there a way to achieve this with Elasticsearch?
Use a constant_score to give it a score of your choice:
{
"query": {
"constant_score": {
"filter": {
"query": {
"fuzzy": {"text": "whatever"}
}
},
"boost": 1
}
}
}

How do I build an elastic search query such that each token in a document field is matched?

I need to make sure that each token of a field is matched by at least one token in a user's search.
This is a generalized example for the sake of simplification.
Let Store_Name = "Square Steakhouse"
It is simple to build a query that matches this document when the user searches for Square, or Steakhouse. Furthermore, with kstem filter attached to the default analyzer, Steakhouses is also likely to match.
{
"size": 30,
"query": {
"match": {
"Store_Name": {
"query": "Square",
"operator": "AND"
}
}
}
}
Unfortunately, I need each token of the Store_Name field to be matched. I need the following behavior:
Query: Square Steakhouse Result: Match
Query: Square Steakhouses Result: Match
Query: Squared Steakhouse Result: Match
Query: Square Result: No Match
Query: Steakhouse Result: No Match
In summary
It is not an option to use not_analyzed, as I do need to take advantage of analyzer features
I intend to use kstem, custom synonyms, a custom char_filter, a lowercase filter, as well as a standard tokenizer
However, I need to make sure that each tokens of a field is matched
Is this possible in elastic search?
Here is a good method.
It is not perfect, but it is a good compromise in terms of simplicity, computation, and storage.
Index the token count of the field
Obtain the token count of the search text
Perform a filtered query and enforce the number of tokens between the results to be equal
You will want to use the analyze API in order to get the token count. Make sure to use the same analyzer as the field in question. Here is a VB.NET function to obtain token count:
Private Function GetTokenCount(ByVal RawString As String, Optional ByVal Analyzer As String = "default") As Integer
If Trim(RawString) = "" Then Return 0
Dim client = New ElasticConnection()
Dim result = client.Post("http://localhost:9200/myindex/_analyze?analyzer=" & Analyzer, RawString) 'Submit analyze request usign PlainElastic.NET API
Dim J = JObject.Parse(result.ToString()) 'Populate JSON.NET JObject
Return (From X In J("tokens")).Count() 'returns token count using a JSON.NET JObject
End Function
You will want to use this at index-time to store the token count of the field in question. Make sure there is an entry in the mapping for TokenCount
Here is a good elastic search query for utilizing this new token count information:
{
"size": 30,
"query": {
"filtered": {
"query": {
"match": {
"MyField": {
"query": "[query]",
"operator": "AND"
}
}
},
"filter": {
"term": {
"TokenCount": [tokencount]
}
}
}
}
}
Replace [query] with the search terms
Replace [tokencount] with the number of tokens in the search terms (using the GetTokenCount function above
This makes sure that all there are at least as many matches as tokens in MyField.
There are some drawbacks to the above. For example, if we are matching the field "blue red", and the user searches for "blue blue", the above will trigger a match. So, you may want to use a unique token filter. You may also wish to adjust the filter so that
Reference
Clinton Gormely inspired the solution

Resources