Elasticsearch - Edit distance using fuzzy is inaccurate - elasticsearch

I am using ES 5.5 and my requirement is to allow upto two edits while matching a field.
In ES,I have value as 124456788 and query comes in as 123456789
"fuzzy": {
"idkey": {
"value": **"123456789"**,
"fuzziness": "20"
}
}
To my knowledge the edit distance is 2 between these two numbers. But it is not matching even with fuzziness property as 20.
I did an explain api call and here is what I am seeing
"description": "no match on required clause (((idkey:012345789)^0.7777778 (idkey:012346789)^0.7777778 (idkey:013456789)^0.7777778 (idkey:023456789)^0.8888889 (idkey:102345678)^0.7777778 (idkey:112345678)^0.7777778 (idkey:113456789)^0.8888889 (idkey:120456589)^0.7777778 (idkey:121345678)^0.7777778 (idkey:122345678)^0.7777778 (idkey:122345679)^0.7777778 (idkey:122456789)^0.8888889 (idkey:123006789)^0.7777778 (idkey:123045678)^0.7777778 (idkey:123096789)^0.7777778 (idkey:123106789)^0.7777778 (idkey:123145678)^0.7777778 (idkey:123146789)^0.7777778 (idkey:123226789)^0.7777778 (idkey:123256789)^0.8888889 (idkey:123345678)^0.7777778 (idkey:123345689)^0.7777778 (idkey:123346789)^0.7777778 (idkey:123406784)^0.7777778 (idkey:123415678)^0.7777778 (idkey:123435678)^0.7777778 (idkey:123446789)^0.8888889 (idkey:123453789)^0.8888889 (idkey:123454789)^0.8888889 (idkey:123455789)^0.8888889 (idkey:123456289)^0.8888889 (idkey:123456489)^0.8888889 (idkey:123456709)^0.8888889 (idkey:123456779)^0.8888889 (idkey:123456780)^0.8888889 (idkey:123456781)^0.8888889 (idkey:123456783)^0.8888889 (idkey:123456785)^0.8888889 (idkey:123456786)^0.8888889 (idkey:123456787)^0.8888889 (idkey:123456889)^0.8888889 (idkey:123457789)^0.8888889 (idkey:123466789)^0.8888889 (idkey:123496789)^0.8888889 (idkey:123556789)^0.8888889 (idkey:126456789)^0.8888889 (idkey:223456789)^0.8888889 (idkey:423456789)^0.8888889 (idkey:623456789)^0.8888889 (idkey:723456789)^0.8888889)^5.0)",
The value I am expecting to match is 124456788 but ES query is internally not converting it as one of the possible match parameter in fuzzy query.
Do i need to use different ES method to make this work?

This a simple indexing and search.
PUT /myIndex/type1/1
{
"key":"123456789",
"name":"test"
}
GET /myIndex/_search
{
"query": {
"fuzzy": {
"key": {
"value": "124456799",
"fuzziness": 2
}
}
}
}
It is always matching with the given key. fuzziness values 2 or greater is fine.

Related

Terms Set Query's minimum_should_match_field does not behave as expected when the provided field has value zero

I am wondering, using "terms set" query, why when a field that specified by the minimum_should_match_field has value "0", it behaves as if it has value "1".
To replicate the problem, I take the example from the Elasticsearch doc and construct three steps below.
Step 1:
Create a new index
PUT /job-candidates
{
"mappings": {
"properties": {
"name": {
"type": "keyword"
},
"programming_languages": {
"type": "keyword"
},
"required_matches": {
"type": "long"
}
}
}
}
Step 2:
Create two docs with required_matches set to zero
PUT /job-candidates/_doc/1?refresh
{
"name": "Jane",
"programming_languages": [ "c++", "java" ],
"required_matches": 0
}
and also
PUT /job-candidates/_doc/1?refresh
{
"name": "Ben",
"programming_languages": [ "python" ],
"required_matches": 0
}
Step 3:
Search for docs with the following search
GET /job-candidates/_search
{
"query": {
"terms_set": {
"programming_languages": {
"terms": [ "c++", "java"],
"minimum_should_match_field": "required_matches"
}
}
}
}
Expected Results: I expect step 3 returns both docs "Jane" and "Ben"
Actual Results: but it only returns doc "Jane"
I don't understand. If minimum_should_match is 0, doesn't it mean that an returned doc do not need to match any term(s), therefore "Ben" doc should also be returned?
Some links I found but still can't answer my question:
minimum_should_match
It looks like minimum_should_match can't not be zero, but it does not says how search works if it's indeed zero or more than the number of optional values.
A discussion of default value for minimum_should_match
But they didn't discuss the "terms set" query in particular.
Any clarification will be appreciated! Thanks.
When looking at the terms_set source code, we can see that the underlying Lucene query being used is called CoveringQuery.
So the explanation can be found in Lucene's source code of CoveringQuery, whose documentation says
Per-document long value that records how many queries should match. Values that are less than 1 are treated like 1: only documents that have at least one matching clause will be considered matches. Documents that do not have a value for minimumNumberMatch do not match.
And a little further, the code that sets minimumNumberMatch is pretty self-explanatory:
final long minimumNumberMatch = Math.max(1, minMatchValues.longValue());
We can simply sum it up by stating that it doesn't really make sense to send a terms_set query with minimum_should_match: 0 as it would be equivalent to a match_all query.

Custom score for exact, phonetic and fuzzy matching in elasticsearch

I have a requirement where there needs to be custom scoring on name. To keep it simple lets say, if I search for 'Smith' against names in the index, the logic should be:
if input = exact 'Smith' then score = 100%
else
if input = phonetic match then
score = <depending upon fuzziness match of input with name>%
end if
end if;
I'm able to search documents with a fuzziness of 1 but I don't know how to give it custom score depending upon how fuzzy it is. Thanks!
Update:
I went through a post that had the same requirement as mine and it was mentioned that the person solved it by using native scripts. My question still remains, how to actually get the score based on the similarity distance such that it can be used in the native scripts:
The post for reference:
https://discuss.elastic.co/t/fuzzy-query-scoring-based-on-levenshtein-distance/11116
The text to look for in the post:
"For future readers I solved this issue by creating a custom score query and
writing a (native) script to handle the scoring."
You can implement this search logic using the rescore function query (docs here).
Here there is a possible example:
{
"query": {
"function_score": {
"query": { "match": {
"input": "Smith"
} },
"boost": "5",
"functions": [
{
"filter": { "match": { "input.keyword": "Smith" } },
"random_score": {},
"weight": 23
}
]
}
}
}
In this example we have a mapping with the input field indexed both as text and keyword (input.keyword is for exact match). We re-score the documents that match exactly the term "Smith" with an higher score respect to the all documents matched by the first query (in the example is a match, but in your case will be the query with fuzziness).
You can control the re-score effect tuning the weight parameter.

Elasticsearch wrong explanation validate api

I'm using Elasticsearch 5.2. I'm executing the below query against an index that has only one document
Query:
GET test/val/_validate/query?pretty&explain=true
{
"query": {
"bool": {
"should": {
"multi_match": {
"query": "alkis stackoverflow",
"fields": [
"name",
"job"
],
"type": "most_fields",
"operator": "AND"
}
}
}
}
}
Document:
PUT test/val/1
{
"name": "alkis stackoverflow",
"job": "developer"
}
The explanation of the query is
+(((+job:alkis +job:stackoverflow) (+name:alkis +name:stackoverflow))) #(#_type:val)
I read this as:
Field job must have alkis and stackoverflow
AND
Field name must have alkis and stackoverflow
This is not the case with my document though. The AND between the two fields is actually OR (as it seems from the result I'm getting)
When I change the type to best_fields I get
+(((+job:alkis +job:stackoverflow) | (+name:alkis +name:stackoverflow))) #(#_type:val)
Which is the correct explanation.
Is there a bug with the validate api? Have I misunderstood something? Isn't the scoring the only difference between these two types?
Since you picked the most_fields type with an explicit AND operator, the reasoning is that one match query is going to be generated per field and all terms must be present in a single field for a document to match, which is your case, i.e. both terms alkis and stackoverflow are present in the name field, hence why the document matches.
So in the explanation of the corresponding Lucene query, i.e.
+(((+job:alkis +job:stackoverflow) (+name:alkis +name:stackoverflow)))
when no specific operator is specified between the terms, the default one is an OR
So you need to read this as: Field job must have both alkis and stackoverflow OR field name must have both alkis and stackoverflow.
The AND operator that you apply only concerns all the terms in your query but in regard to a single field, it's not an AND between all fields. Said differently, your query will be executed as a two match queries (one per field) in a bool/should clause, like this:
{
"query": {
"bool": {
"should": [
{ "match": { "job": "alkis stackoverflow" }},
{ "match": { "name": "alkis stackoverflow" }}
]
}
}
}
In summary, the most_fields type is most useful when querying multiple fields that contain the same text analyzed in different ways. This is not your case and you'd probably better be using cross_fields or best_fields depending on your use case, but certainly not most_fields.
UPDATE
When using the best_fields type, ES generates a dis_max query instead of a bool/should and the | (which is not an OR !!) sign separates all sub-queries in a dis_max query.

Scope Elasticsearch Results to Specific Ids

I have a question about the Elasticsearch DSL.
I would like to do a full text search, but scope the searchable records to a specific array of database ids.
In SQL world, it would be the functional equivalent of WHERE id IN(1, 2, 3, 4).
I've been researching, but I find the Elasticsearch query DSL documentation a little cryptic and devoid of useful examples. Can anyone point me in the right direction?
Here is an example query which might work for you. This assumes that the _all field is enabled on your index (which is the default). It will do a full text search across all the fields in your index. Additionally, with the added ids filter, the query will exclude any document whose id is not in the given array.
{
"bool": {
"must": {
"match": {
"_all": "your search text"
}
},
"filter": {
"ids": {
"values": ["1","2","3","4"]
}
}
}
}
Hope this helps!
As discussed by Ali Beyad, ids field in the query can do that for you. Just to complement his answer, I am giving an working example. In case anyone in the future needs it.
GET index_name/_search
{
"query": {
"bool": {
"must": [
{
"match": {
"field": "your query"
}
},
{
"ids" : {
"values" : ["0aRM6ngBFlDmSSLpu_J4", "0qRM6ngBFlDmSSLpu_J4"]
}
}
]
}
}
}
You can create a bool query that contains an Ids query in a MUST clause:
https://www.elastic.co/guide/en/elasticsearch/reference/2.0/query-dsl-ids-query.html
By using a MUST clause in a bool query, your search will be further limited by the Ids you specify. I'm assuming here by Ids you mean the _id value for your documents.
According to es doc, you can
Returns documents based on their IDs.
GET /_search
{
"query": {
"ids" : {
"values" : ["1", "4", "100"]
}
}
}
With elasticaBundle symfony 5.2
$query = new Query();
$IdsQuery = new Query\Ids();
$IdsQuery->setIds($id);
$query->setQuery($IdsQuery);
$this->finder->find($query, $limit);
You have two options.
The ids query:
GET index/_search
{
"query": {
"ids": {
"values": ["1, 2, 3"]
}
}
}
or
The terms query:
GET index/_search
{
"query": {
"terms": {
"yourNonPrimaryIdField": ["1", "2","3"]
}
}
}
The ids query targets the document's internal _id field (= the primary ID). But it often happens that documents contain secondary (and more) IDs which you'd target thru the terms query.
Note that if your secondary IDs contain uppercase chars and you don't set their field's mapping to keyword, they'll be normalized (and lowercased) and the terms query will appear broken because it only works with exact matches. More on this here: Only getting results when elasticsearch is case sensitive

Is there a way to score fuzzy hits with the same score as exact hits?

I'm trying to use elasticsearch as a integration tool which can match records from different sources. I'm combining filters and query for this. Filters are filtering out irrevelant records and putting trough candidate matches. Then out of those candidates all are being scored. I'm using fuzzy match because some of the records might contain a misspell (Nicolson Way/Nicholson Way). I would like them to be scored equally with disregard if its a fuzzy match or equal match.
http://www.elasticsearch.org/guide/en/elasticsearch/guide/current/fuzzy-scoring.html
Is there a way to achieve this with Elasticsearch?
Use a constant_score to give it a score of your choice:
{
"query": {
"constant_score": {
"filter": {
"query": {
"fuzzy": {"text": "whatever"}
}
},
"boost": 1
}
}
}

Resources