Why fuzzy query returns a match but query with fuzziness doesn't on the same input? - elasticsearch

I created the following index in Elasticsearch:
PUT /my-index
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "whitespace",
"filter": ["lowercase", "3_5_edgegrams"]
}
},
"filter": {
"3_5_edgegrams": {
"type": "edge_ngram",
"min_gram": 3,
"max_gram": 10
}
}
}
},
"mappings": {
"properties": {
"name": {
"type": "text",
"analyzer": "my_analyzer"
}
}
}
}
Then I inserted the following document:
{
"name": "Nuvus Gro Corp"
}
When I make the following query (let's call it fuzzy_query):
GET /my-index/_search
{
"query": {
"fuzzy": {
"name": {
"value": "qnuv"
}
}
}
}
I get a match for the above document.
When I make the query (let's call the query match_with_fuzziness):
GET /my-index/_search
{
"query": {
"match": {
"name": {
"query": "qnuv",
"fuzziness": "AUTO"
}
}
}
}
I don't get a match. If I make the following query:
GET /my-index/_search
{
"query": {
"match": {
"name": {
"query": "nuvq",
"fuzziness": "AUTO"
}
}
}
}
I again get a match. I don't understand why when I make the match_with_fuzziness query I don't get any matches.
EDIT: I analyzed the queries with Kibana Profiler and according to the profiler match_with_fuzziness is a SynonymQuery Synonym(name:qnu name:qnuv) query while fuzzy_query is a BoostQuery (name:nuv)^0.6666666

Very similar problem to the one explained in your other question.
The problem is that you haven't specified a specific search_analyzer, so at search time qnuv and nuvq also get analyzed by my_analyzer and edge-ngramed as well, hence the match you're receiving.
If we check the first query, since you're using the fuzzy query, qnuv (the search term) will match nuv (the first indexed edge-ngramed token) with a distance of 1 (i.e. the first q is "tolerated"), which is what the fuzzy query does by default (with "fuzziness: AUTO")
In the third query, nuv (the first edge-ngramed token of the search term) will match nuv (the first indexed edge-ngramed token).
The case of the second query is a bit special and I'm referencing below how the fuzziness parameter works in the context of match queries
Fuzzy matching is not applied to terms with synonyms or in cases where the analysis process produces multiple tokens at the same position. Under the hood these terms are expanded to a special synonym query that blends term frequencies, which does not support fuzzy expansion.
The part in bold is what applies to your case. Since the search term qnuv is analyzed by my_analyzer, it produces the two tokens qnu and qnuv at the same position and that does not support fuzzy matching.
You need to change your mapping to this one instead and it will work the way you expect, i.e. all three queries will return your document:
"mappings": {
"properties": {
"name": {
"type": "text",
"analyzer": "my_analyzer",
"search_analyzer": "standard" <---- add this line
}
}
}

Related

elasticsearch query string(full text search) which can't be searched

we have a document below. I can't searched with financialmarkets. but it can be searched with industry_icon_financialmarkets.png. Can anyone tell me what is the reason?
content is the text type field.
document:
{
"title":"test",
"content":"industry_icon_financialmarkets.png"
}
Query:
{
"from": 0,
"size": 2,
"query": {
"bool": {
"must": [
{
"query_string": {
"query": "\"industry_icon_financialmarkets.png\""
}
}
]
}
}
}
The default analyzer for text field is standard which won't break industry_icon_financialmarkets into tokens using _ as a delimiter. I would suggest you to use simple analyzer instead which will breaks text into terms whenever it encounters a character which is not a letter.
You can also add sub-field of type keyword to retain the original value.
So the mapping of the field should be:
{
"content": {
"type": "text",
"analyzer": "simple",
"fields": {
"keyword": {
"type": "keyword"
}
}
}
}
At the time of creating index, we should have our own mapping for each fields based on its type to get the expected result.
Mapping
PUT relevance
{"mapping":{"ID":{"type":"long"},"title":
{"type":"keyword","analyzer":"my_analyzer"},
"content":
{"type":"string","analyzer":"my_analyzer","search_analyzer":"my_analyzer"}},
"settings":
{"analysis":
{"analyzer":
{"my_analyzer":
{"tokenizer":"my_tokenizer"}},
"tokenizer":
{"my_tokenizer":
{"type":"ngram","min_gram":3,"max_gram":30,"token_chars":
["letter","digit"]
}
}
},"number_of_shards":5,"number_of_replicas":2
}
}
Then start inserting documents,
POST relevance/_doc/1
{
"name": "1elastic",
"content": "working fine" //replace special characters with space using program before inserting into ES index.
}
Query
GET relevance/_search
{"size":20,"query":{"bool":{"must":[{"match":{"content":
{"query":"fine","fuzziness":1}}}]}}}

Tokenize a big word into combination of words

Suppose I have Super Bowl is the value of a document's property in the elasticsearch. How can the term query superbowl match Super Bowl?
I read about letter tokenizer and word delimiter but both don't seem to solve my problem. Basically I want to be able to convert combination of a large word into meaningful combination of words.
https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-word-delimiter-tokenfilter.html
https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-letter-tokenizer.html
I know this is quite late but you could use synonym filter
You could define that super bowl is the same as "s bowl", "SuperBowl" etc.
There are ways to do this without changing what you actually index. For example, if you are using at least 5.2 (where normalizers were introduced), but it can also be earlier version but 5.x makes it easier, you can define a normalizer to lowercase your text and not change it and then use a fuzzy query at search time to account for the space between super and bowl. My solution though is specific to this example you have given. As it is with Elasticsearch most of time, one needs to think about what kind of data goes into Elasticsearch and what it is required at search time.
In any case, if you are interested in an approach here it is:
DELETE test
PUT /test
{
"settings": {
"analysis": {
"normalizer": {
"my_normalizer": {
"type": "custom",
"char_filter": [],
"filter": ["lowercase", "asciifolding"]
}
}
}
},
"mappings": {
"test": {
"properties": {
"title": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"normalizer": "my_normalizer"
}
}
}
}
}
}
}
POST test/test/1
{"title":"Super Bowl"}
GET /test/_search
{
"query": {
"fuzzy": {
"title.keyword": "superbowl"
}
}
}

Getting results for multi_match cross_fields query in elasticsearch with custom analyzer

I have an elastic search 5.3 server with products.
Each product has a 14 digit product code that has to be searchable by the following rules. The complete code should match as well as a search term with only the last 9 digits, the last 6, the last 5 or the last 4 digits.
In order to achieve this I created a custom analyser which creates the appropriate tokens at index time using the pattern capture token filter. This seems to be working correctly. The _analyse API shows that the correct terms are created.
To fetch the documents from elastic search I'm using a multi_match cross_fields bool query to search a number of fields simultaneously.
When I have a query string that has a part that matches a product code and a part that matches any of the other fields no results are returned, but when I search for each part separately the appropriate results are returned. Also when I have multiple parts spanning any of the fields except the product code the correct results are returned.
My maping and analyzer:
PUT /store
{
"mappings": {
"products":{
"properties":{
"productCode":{
"analyzer": "ProductCode",
"search_analyzer": "standard",
"type": "text"
},
"description": {
"type": "text"
},
"remarks": {
"type": "text"
}
}
}
},
"settings": {
"analysis": {
"filter": {
"ProductCodeNGram": {
"type": "pattern_capture",
"preserve_original": "true",
"patterns": [
"\\d{5}(\\d{9})",
"\\d{8}(\\d{6})",
"\\d{9}(\\d{5})",
"\\d{10}(\\d{4})"
]
}
},
"analyzer": {
"ProductCode": {
"filter": ["ProductCodeNGram"],
"type": "custom",
"preserve_original": "true",
"tokenizer": "standard"
}
}
}
}
}
The query
GET /store/products/_search
{
"query": {
"bool": {
"must": [
{
"multi_match": {
"query": "[query_string]",
"fields": ["productCode", "description", "remarks"],
"type": "cross_fields",
"operator": "and"
}
}
]
}
}
}
Sample data
POST /store/products
{
"productCode": "999999123456789",
"description": "Foo bar",
"remarks": "Foobar"
}
The following query strings all return one result:
"456789", "foo", "foobar", "foo foobar".
But the query_string "foo 456789" returns no results.
I am very curious as to why the last search does not return any results. I am convinced that it should.
The problem is that you are doing a cross_fields over fields with different analysers. Cross fields only works for fields using the same analyser. It in fact groups the fields by analyser before doing the cross fields. You can find more information in this documentation.
https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-multi-match-query.html#_literal_cross_field_literal_and_analysis
Although cross_fields needs the same analyzer across the fields it operates on, I've had luck using the tie_breaker parameter to allow other fields (that use different analyzers) to be weighed for the total score.
This has the added benefit of allowing per-field boosting to be calculated in the final score, too.
Here's an example using your query:
GET /store/products/_search
{
"query": {
"bool": {
"must": [
{
"multi_match": {
"query": "[query_string]",
"fields": ["productCode", "description", "remarks"],
"type": "cross_fields",
"tie_breaker": 1 # You may need to tweak this
}
}
]
}
}
}
I also removed the operator field, as I believe using the "AND" operator will cause fields that don't have the same analyzer to be scored inappropriately.

Proper way to query documents using an 'uppercase' token filter

We have an ElasticSearch index with some fields that use custom analyzers. One of the analyzers includes an uppercase token filter in order to get rid of case sensitivity while making queries (e.g. we want "ball" to also match "Ball" or "BALL")
The issue here is when doing regular expressions, the pattern is matched against the term in the index which is all uppercase. So "app*" won't match "Apple" in our index, because behind the scenes its really indexed as "APPLE".
Is there a way to get this to work without doing some hacky things outside of ES?
I might play around with "query_string" instead and see if that has any different results.
This all depends on the type of the query you are using. If that type will use the analyzer of the field itself to analyze the input string then it should be fine.
If you are using the regexp query, this one will NOT analyze the input string, so if you pass app.* to it, it will stay the same and this is what it will user for search.
But, if you use properly the query_string query that one should work:
{
"settings": {
"analysis": {
"analyzer": {
"my": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"uppercase"
]
}
}
}
},
"mappings": {
"test": {
"properties": {
"some_field": {
"type": "text",
"analyzer": "my"
}
}
}
}
}
And the query itself:
{
"query": {
"query_string": {
"query": "some_field:app*"
}
}
}
To make sure it's doing what I think it is, I always use the _validate api:
GET /_validate/query?explain&index=test
{
"query": {
"query_string": {
"query": "some_field:app*"
}
}
}
which will show what ES is doing to the input string:
"explanations": [
{
"index": "test",
"valid": true,
"explanation": "some_field:APP*"
}
]

elasticsearch scoring unique terms vs ngram terms

i've figured out how to return results on a partial word result using ngrams. but now i'd like to arrange (score or sort) my results based on the term first and then a partial term.
for example, the user searches a movie db for 'we'. i want 'we are marshall' and similar to show up at the top, and not 'north by northwest'. (the 'we' is in 'northwest').
currently this is my mapping for this title field:
"title": {
"type": "string",
"analyzer": "ngramAnalyer",
"fields": {
"term": {
"type": "string",
"analyzer": "fullTermCaseInsensitive"
},
"raw": {
"type": "string",
"index": "not_analyzed"
}
}
}
i've created a multifield where ngramAnalyzer is a custom ngram, term is using a keyword tokenizer with a standard filter, and raw is not_indexed.
my query is as follows:
"query": {
"function_score": {
"functions": [
{
"script_score": {
"script": "_score * (1+ (1 / doc['salesrank'].value) )"
}
}
],
"query": {
"bool": {
"must": [
{
"match_phrase": {
"title": {
"query": "we",
"max_expansions": 10
}
}
}
],
"should":{
"term" : {
"title.term" : {
"value" : "we",
"boost" : 10
}
}
}
}
}
}
i'm basically requiring that the ngram must be matched, and the term 'we' should be matched, and if so, boost it.
this isn't working of course.
any ideas?
edit
to add further complexity ... how would i match first on exact title, then on a custom score?
i've taken some stabs at it, but doesn't seem to work.
for example:
input: 'game'
results should be ordered by exact match 'game'
followed by a custom score based on a sales rank (integer)
so that the next results after 'game' might be something like 'hunger games'
what about bool combination of boosting query, where first match about full term with 10x boost factor, and another matches against ngram term with standard boost factor?

Resources