Fuzzy Matching in Elasticsearch gives different results in two different versions

Fuzzy Matching in Elasticsearch gives different results in two different versions - elasticsearch

I have a mapping in elasticsearch with a field analyzer having tokenizer:
"tokenizer": {
"3gram_tokenizer": {
"type": "nGram",
"min_gram": "3",
"max_gram": "3",
"token_chars": [
"letter",
"digit"
]
}
}
now I am trying to search a name = "avinash" in Elasticsearch with query = "acinash"
The query formed is:
{
"size": 5,
"query": {
"bool": {
"must": [
{
"multi_match": {
"query": "acinash",
"fields": [
"name"
],
"type": "best_fields",
"operator": "AND",
"slop": 0,
"fuzziness": "1",
"prefix_length": 0,
"max_expansions": 50,
"zero_terms_query": "NONE",
"auto_generate_synonyms_phrase_query": false,
"fuzzy_transpositions": false,
"boost": 1.0
}
}
],
"adjust_pure_negative": true,
"boost": 1.0
}
}
}
But in ES version 6.8 I am getting the desired result(because of fuzziness) i.e "avinash" from querying "acinash", but in ES version 7.1 I am not getting the result.
Same goes when tried to search "avinash" using "avinaah" in 6.8 i am getting results but in 7.1 i am not getting results
What ES does is it will convert it into tokens :[aci, cin, ina, nas, ash] which ideally should match with tokenised inverted index in ES with tokens : [avi, vin, ina, nas, ash].
But why is it not matching in 7.1?

It's not related to ES version.
Update max_expansions to more than 50.
max_expansions : Maximum number of variations created.
With 3 grams letter & digits as token_chars, ideal max_expansion will be (26 alphabets + 10 digits) * 3

Related

Elastic Search Exception for a multi match query of type phrase when using a combination of number and alphabate without space

I am getting exception for below query:
"multi_match": {
"query": "\"73a\"",
"fields": [],
"type": "phrase",
"operator": "AND",
"analyzer": "custom_analyzer",
"slop": 0,
"prefix_length": 0,
"max_expansions": 50,
"zero_terms_query": "NONE",
"auto_generate_synonyms_phrase_query": true,
"fuzzy_transpositions": true,
"boost": 1.0
}
Exception I am getting:
error" : {
"root_cause" : [
{
"type" : "illegal_state_exception",
"reason" : "field \"log_no.keyword\" was indexed without position data; cannot run SpanTermQuery (term=73)"
},
{
"type" : "illegal_state_exception",
"reason" : "field \"airplanes_data.keyword\" was indexed without position data; cannot run SpanTermQuery (term=73)"
}
],
Note: 1) When I am changing the type from "phrase" to "best_fields", I am not getting any error and getting proper results for "query": ""73a"".
2) Using type as "phrase" and giving space between number and alphabet ex: "query": ""73 a"" also gives results without error.
My question is why with type as "phrase", I am getting error when there is no space between a number and alphabet combo in a query. Ex - "query": ""443abx"", "query": ""73222aaa"".
I am new to elastic search. Any help is appreciated. Thanks :)

elasticsearch fuzzy query seems to ignore brazilian stopwords

I have stopwords for brazilian portuguese configured at my index. but if I made a search for the term "ios" (it's a ios course), a bunch of other documents are returned, because the term "nos" (brazilian stopword) seems to be identified as a valid term for the fuzzy query.
But if I search just by the term "nos", nothing is returned. I would be not expected ios course to be returned by fuzzy query? I'm confused.
Is there any alternative to this. The main purpose here is that when user search for ios, the documents with stopword like "nos" won't be returned, while I can mantain the fuzziness for other more complex search made by users.
An example of query:
GET /index/_search
{
"explain": true,
"query": {
"bool" : {
"must" : [
{
"terms" : {
"document_type" : [
"COURSE"
],
"boost" : 1.0
}
},
{
"multi_match" : {
"query" : "ios",
"type" : "best_fields",
"operator" : "OR",
"slop" : 0,
"fuzziness" : "AUTO",
"prefix_length" : 0,
"max_expansions" : 50,
"zero_terms_query" : "NONE",
"auto_generate_synonyms_phrase_query" : true,
"fuzzy_transpositions" : true,
"boost" : 1.0
}
}
],
"adjust_pure_negative" : true,
"boost" : 1.0
}
}
}
part of explain query:
"description": "weight(corpo:nos in 52) [PerFieldSimilarity], result of:",
image with the config of stopwords
thanks
I tried to add the prefix length, but I want that stopwords to be ignored.

I believe that correctly way to work stopwords by language is below:
PUT idx_teste
{
"settings": {
"analysis": {
"filter": {
"brazilian_stop_filter": {
"type": "stop",
"stopwords": "_brazilian_"
}
},
"analyzer": {
"teste_analyzer": {
"tokenizer": "standard",
"filter": ["brazilian_stop_filter"]
}
}
}
},
"mappings": {
"properties": {
"name": {
"type": "text",
"analyzer": "teste_analyzer"
}
}
}
}
POST idx_teste/_analyze
{
"analyzer": "teste_analyzer",
"text":"course nos advanced"
}
Look term "nos" was removed.
{
"tokens": [
{
"token": "course",
"start_offset": 0,
"end_offset": 6,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "advanced",
"start_offset": 11,
"end_offset": 19,
"type": "<ALPHANUM>",
"position": 2
}
]
}

Elasticsearch wildcard query on numeric fields without using mapping

I'm looking for a creative solution because I can't use mapping as solution is already in production.
I have this query:
{
"size": 4,
"query": {
"bool": {
"filter": [
{
"range": {
"time": {
"from": 1597249812405,
"to": null,
}
}
},
{
"query_string": {
"query": "*181*",
"fields": [
"deId^1.0",
"deTag^1.0",
],
"type": "best_fields",
"default_operator": "or",
"max_determinized_states": 10000,
"enable_position_increments": true,
"fuzziness": "AUTO",
"fuzzy_prefix_length": 0,
"fuzzy_max_expansions": 50,
"phrase_slop": 0,
"escape": false,
"auto_generate_synonyms_phrase_query": true,
"fuzzy_transpositions": true,
"boost": 1
}
}
],
"adjust_pure_negative": true,
"boost": 1
}
},
"sort": [
{
"time": {
"order": "asc"
}
}
]
}
"deId" field is an integer in elasticsearch and the query returns nothing (though should),
Is there a solution to search for wildcards in numeric fields without using the multi field option which requires mapping?

Once you index an integer, ES does not treat the individual digits as position-sensitive tokens. In other words, it's not directly possible to perform wildcards on numeric datatypes.
There are some sub-optimal ways of solving this (think scripting & String.substring) but the easiest would be to convert those integers to strings.
Let's look at an example deId of 123181994:
POST prod/_doc
{
"deId_str": "123181994"
}
then
GET prod/_search
{
"query": {
"bool": {
"filter": [
{
"query_string": {
"query": "*181*",
"fields": [
"deId_str"
]
}
}
]
}
}
}
works like a charm.
Since your index/mapping is already in production, look into _update_by_query and stringify all the necessary numbers in a single call. After that, if you don't want to (and/or cannot) pass the strings at index time, use ingest pipelines to do the conversion for you.

I am new to ES and have a multi-match query in ES, and want to consider field based on its availability

{
"multi_match": {
"query": "TEST",
"fields": [
"description.regexkeyword^1.0",
"logical_name.regexkeyword^1.0",
"logical_table_name.regexkeyword^1.0",
"physical_name.regexkeyword^1.0",
"presentation_name.regexkeyword^1.0",
"table_name.regexkeyword^1.0"
],
"type": "best_fields",
"operator": "AND",
"slop": 0,
"prefix_length": 0,
"max_expansions": 50,
"lenient": false,
"zero_terms_query": "NONE",
"boost": 1
}
}
There is a field, i.e. edited_description, if in case edited_description exists in document then consider edited_description.regexkeyword^1.0 else consider description, i.e. description.regexkeyword^1.0.

You can't define an if condition in multi_match query. But what you can do is re-look your problem statement in a different way. I can re-look this as, that if in case edited_description and description both exists then the match in edited_description field should be given a higher preference.
This can achieved by setting slightly higher boost value for edited_description field.
{
"multi_match": {
"query": "TEST",
"fields": [
"description.regexkeyword^1.0",
"edited_description.regexkeyword^1.2",
"logical_name.regexkeyword^1.0",
"logical_table_name.regexkeyword^1.0",
"physical_name.regexkeyword^1.0",
"presentation_name.regexkeyword^1.0",
"table_name.regexkeyword^1.0"
],
"type": "best_fields",
"operator": "AND",
"slop": 0,
"prefix_length": 0,
"max_expansions": 50,
"lenient": false,
"zero_terms_query": "NONE",
"boost": 1
}
}
This will result in documents having a match in edited_description to be ranked higher. You can adjust the boosting value to your needs.

Confusing query_string search results

I've got Elasticsearch set up and am running queries against it, but I'm getting odd results, and can't figure out why:
For example, the here's one relevant portion of my mapping:
"classification": {
"type": "string",
"fields": {
"raw": {
"type": "string",
"index": "not_analyzed"
}
}
},
And, then here's some of the queries and results. For all of these, there are objects with classification value of "Jewelry & Adornment":
Query:
"query": {
"bool": {
"must": [
{
"match_all": {}
},
{
"query_string": {
"query": "(classification:/jewel.*/)"
}
}
]
}
}
Result:
"hits": {
"total": 2541,
"max_score": 1.4142135,
"hits": [
{
...
Yet if I add "ry":
Query:
"query_string": {
"query": "(classification:/jewelry.*/)"
}
Result:
"hits": {
"total": 0,
"max_score": null,
"hits": []
}
I've also tried running the queries:
"query_string": {
"query": "(classification\\*:/jewelry.*/)"
}
(should match either "classification" or "classification.raw")
And:
"query_string": {
"query": "(classification.raw:/jewelry.*/)"
}
I've also tried cases variations, e.g. "Jewelry" vs. "jewelry", to no effect. All of these return no results. This makes no sense to me. Even when querying "classification.raw" with "Jewelry" (same case and on a completely unanalyzed field), I get no results. Any ideas?
UPDATE
As per request of #keety
{
"tokens": [
{
"token": "jewelri",
"start_offset": 0,
"end_offset": 7,
"type": "<ALPHANUM>",
"position": 1
},
{
"token": "adorn",
"start_offset": 10,
"end_offset": 19,
"type": "<ALPHANUM>",
"position": 2
}
]
}
I imagine the fact that it's stemming "jewelry" to "jewelri" is my problem, but not sure why it's doing that or how to fix it.
UPDATE #2
These are the analyzers in play:
"analyzer": {
"default_index": {
"type": "custom",
"tokenizer": "icu_tokenizer",
"filter": [
"icu_folding",
"custom_stem",
"porter_stem",
"index_filter"
],
"char_filter": [
"html_strip",
"quotes"
]
},
"default_search": {
"type": "custom",
"tokenizer": "icu_tokenizer",
"filter": [
"icu_folding",
"custom_stem",
"porter_stem",
"search_filter"
],
"char_filter": [
"html_strip",
"quotes"
]
}
}
UPDATE #3
I ran an _explain query on one of the objects that should be matching but isn't and got the following:
"matched": false,
"explanation": {
"value": 0,
"description": "Failure to meet condition(s) of required/prohibited clause(s)",
"details": [
{
"value": 0.70710677,
"description": "ConstantScore(*:*), product of:",
"details": [
{
"value": 1,
"description": "boost"
},
{
"value": 0.70710677,
"description": "queryNorm"
}
]
},
{
"value": 0,
"description": "no match on required clause (ConstantScore())"
}
]
}
I don't know what "required clause (ConstantScore())" is. The only thing I can find related is Constant Score Query, but I'm not employing this particular query anywhere.
UPDATE #4
Okay, this is getting a little long-winded. Sorry about that. However, I just discovered that the problem seems to lie in using the regex syntax. If I just use a basic wildcard (along with "analyze_wildcard": true), then all my queries start working.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Fuzzy Matching in Elasticsearch gives different results in two different versions - elasticsearch

It's not related to ES version. Update max_expansions to more than 50. max_expansions : Maximum number of variations created. With 3 grams letter & digits as token_chars, ideal max_expansion will be (26 alphabets + 10 digits) * 3

Related

Elastic Search Exception for a multi match query of type phrase when using a combination of number and alphabate without space

elasticsearch fuzzy query seems to ignore brazilian stopwords

Elasticsearch wildcard query on numeric fields without using mapping

I am new to ES and have a multi-match query in ES, and want to consider field based on its availability

Confusing query_string search results

Categories

Resources