Elasticsearch: why exact match has lower score than partial match - elasticsearch

my question
I search the word form, but the exact match word form is not the fisrt in result. Is there any way to solve this problem?
my search query
{
"query": {
"match": {
"word": "form"
}
}
}
result
word score
--------------------------
formulation 10.864353
formaldehyde 10.864353
formless 10.864353
formal 10.84412
formerly 10.84412
forma 10.84412
formation 10.574185
formula 10.574185
formulate 10.574185
format 10.574185
formally 10.574185
form 10.254687
former 10.254687
formidable 10.254687
formality 10.254687
formative 10.254687
ill-formed 10.054999
in form 10.035862
pro forma 9.492243
POST my_index/_analyze
The word form in search has only one token form.
In index, form tokens are ["f", "fo", "for", "form"]; formulation tokens are ["f", "fo", ..., "formulatio", "formulation"].
my config
filter
"edgengram_filter": {
"type": "edge_ngram",
"min_gram": 1,
"max_gram": 20
}
analyzer
"analyzer": {
"abc_vocab_analyzer": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"keyword_repeat",
"lowercase",
"asciifolding",
"edgengram_filter",
"unique"
]
},
"abc_vocab_search_analyzer": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"keyword_repeat",
"lowercase",
"asciifolding",
"unique"
]
}
}
mapping
"word": {
"type": "text",
"analyzer": "abc_vocab_analyzer",
"search_analyzer": "abc_vocab_search_analyzer"
}

You get the result in the way you see because you've implemented edge-ngram filter and that form is a sub-string of the words similar to it. Basically in inverted index it would also store the document ids that contains formulation, formal etc.
Therefore, your relevancy also gets computed in that way. You can refer to this link and I'd specifically suggest you to go through sections Default Similarity and BM25. Although the present default similarity is BM25, that link would help you understand how scoring works.
You would need to create another sibling field which you can apply in a should clause. You can go ahead and create keyword sub-field with Term Query but you need to be careful about case-sensitivity.
Instead, as mentioned by #Val, you can create a sibling of text field with standard analyzer.
Mapping:
{
"word":{
"type": "text",
"analyzer": "abc_vocab_analyzer",
"search_analyzer": "abc_vocab_search_analyzer"
"fields":{
"standard":{
"type": "text"
}
}
}
}
Query:
POST <your_index_name>/_search
{
"query": {
"bool": {
"must": [
{
"match": {
"word": "form"
}
}
],
"should": [ <---- Note this
{
"match": {
"word.standard": "form"
}
}
]
}
}
}
Let me know if this helps!

Because your type for this field is text which means ES will do full-text search analysis on this field. And ES search process is kind of finding results most similar to the word you have given. To accurately search word "form", change your search method to match_phrase Furthermore, you could also read articles below to learn more about different ES search methods:
https://www.cnblogs.com/yjf512/p/4897294.html
https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-match-query-phrase.html

Looks like some issue in your custom analyzer, I created my custom autocomplete analyzer, which uses edge_ngram and lowercase filter and it works fine for me for your query and returns me exact match on top and this is how Elasticsearch works, exact matches always have more score., So no need to explicitly create another field and boost that, As Elasticsearch by default boosts the exact match on tokens match.
Index def
{
"settings": {
"analysis": {
"filter": {
"autocomplete_filter": {
"type": "edge_ngram",
"min_gram": 1,
"max_gram": 10
}
},
"analyzer": {
"autocomplete": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"lowercase",
"autocomplete_filter"
]
}
}
}
},
"mappings": {
"properties": {
"title": {
"type": "text",
"analyzer": "autocomplete",
"search_analyzer": "standard"
}
}
}
}
Index few doc
{
"title" : "formless"
}
{
"title" : "form"
}
{
"title" : "formulation"
}
Search query on title field as provided in the question
{
"query": {
"match": {
"title": "form"
}
}
}
Search result with exact match having highest score
"hits": [
{
"_index": "so-60523240-score",
"_type": "_doc",
"_id": "1",
"_score": 0.16410133,
"_source": {
"title": "form"
}
},
{
"_index": "so-60523240-score",
"_type": "_doc",
"_id": "2",
"_score": 0.16410133,
"_source": {
"title": "formulation"
}
},
{
"_index": "so-60523240-score",
"_type": "_doc",
"_id": "3",
"_score": 0.16410133,
"_source": {
"title": "formaldehyde"
}
},
{
"_index": "so-60523240-score",
"_type": "_doc",
"_id": "4",
"_score": 0.16410133,
"_source": {
"title": "formless"
}
}
]

Related

Elastic returns unexpected result from Search using edge_ngram

I am working out how to store my data in elasticsearch. First I tried the fuzzy function and while that worked okay I did not receive the expected results. Afterwards I tried the ngram and then the edge_ngram tokenizer. The edge_ngram tokenizer looked like it works like an autocomplete. Exactly what I needed. But it still gives unexpected results. I configured min 1 and max 5 to get all results starting with the first letter I search for. While this works I still get those results as I continue typing.
Example: I have a name field filled with documents named The New York Times and The Guardian. Now when I search for T both occur as expected. But the same happens when I search for TT, TTT and so on.
In that case it does not matter wether I execute the search in Kibana or from my application (which useses MultiMatch on all fields). Kibana even shows me the that it matched the single letter T.
So what did I miss and how can I achieve getting the results like with an autocomplete but without having too many results?
When defining your index mapping, you need to specify search_analyzer as standard. If no search_analyzer is defined explicitly, then by default elasticsearch considers search_analyzer to be the same as that of analyzer specified.
Adding a working example with index data, mapping, search query and search result
Index Mapping:
{
"settings": {
"analysis": {
"analyzer": {
"autocomplete": {
"tokenizer": "autocomplete",
"filter": [
"lowercase"
]
}
},
"tokenizer": {
"autocomplete": {
"type": "edge_ngram",
"min_gram": 1,
"max_gram": 5,
"token_chars": [
"letter"
]
}
}
}
},
"mappings": {
"properties": {
"name": {
"type": "text",
"analyzer": "autocomplete",
"search_analyzer": "standard" // note this
}
}
}
}
Index Data:
{
"name":"The Guardian"
}
{
"name":"The New York Times"
}
Search Query:
{
"query": {
"match": {
"name": "T"
}
}
}
Search Result:
"hits": [
{
"_index": "69027911",
"_type": "_doc",
"_id": "1",
"_score": 0.23092544,
"_source": {
"name": "The New York Times"
}
},
{
"_index": "69027911",
"_type": "_doc",
"_id": "2",
"_score": 0.20824991,
"_source": {
"name": "The Guardian"
}
}
]

Space handling in Elastic Search

If a Document (Say a merchant name) that I am searching for has no space in it and user search by adding space in it, the result won't show in elastic search. How can that be improved to get results?
For example:
Merchant name is "DeliBites"
User search by typing in "Deli Bites", then the above merchant does not appear in results. The merchant only appears in suggestions when I have typed just "Deli" or "Deli" followed by a space or "Deli."
Adding another option, you can also use the edge n-gram tokenizer which will work in most of the cases, its simple to setup and use.
Working example on your data
Index definition
{
"settings": {
"analysis": {
"filter": {
"autocomplete_filter": {
"type": "ngram",
"min_gram": 1,
"max_gram": 10
}
},
"analyzer": {
"autocomplete": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"lowercase",
"autocomplete_filter"
]
}
}
},
"index.max_ngram_diff" : 10
},
"mappings": {
"properties": {
"title": {
"type": "text",
"analyzer": "autocomplete",
"search_analyzer": "standard"
}
}
}
}
Index sample doc
{
"title" : "DeliBites"
}
Search query
{
"query": {
"match": {
"title": {
"query": "Deli Bites"
}
}
}
}
And search results
"hits": [
{
"_index": "65489013",
"_type": "_doc",
"_id": "1",
"_score": 0.95894027,
"_source": {
"title": "DeliBites"
}
}
]
I suggest using synonym token filter.
https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-synonym-tokenfilter.html
https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-synonym-graph-tokenfilter.html
you should have a dictionary for all words that you want search.
something like this:
DelitBites => Deli Bites
ipod => i pod
before implementing synonym be sure you understood all aspect of it.
https://www.elastic.co/blog/boosting-the-power-of-elasticsearch-with-synonyms

How to do partial search and get relevant score in Elasticsearch

I am new to Elasticsearch, trying to do some search.
I have names of objects like :
Homework
work
jobroles
jobs
I am using wildcard query, but its returning score of 1.0 for each docs.
I want score based on how well it matched. Ex
Ex. If I type
work
score of work > homework
Its a good question and directly you can't get the exact match on top, what you need is ngram analyzer which provides the partial matches and another field which stores the exact tokens in lowercase(text field with standard analyzer will solve it).
I've reproduced your issue and solved it using above mentioned approach, Please refer my blog on autocomplete and my this SO answer for in-depth read of various autocomplete/partial searches and why/what/how part of it.
Working example
Create index mapping
{
"settings": {
"analysis": {
"filter": {
"autocomplete_filter": {
"type": "ngram",
"min_gram": 1,
"max_gram": 10
}
},
"analyzer": {
"autocomplete": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"lowercase",
"autocomplete_filter"
]
}
}
},
"index.max_ngram_diff" : 10
},
"mappings": {
"properties": {
"title": {
"type": "text",
"analyzer": "autocomplete",
"search_analyzer": "standard"
},
"title_lowercase" :{
"type" : "text"
}
}
}
}
Index your sample docs
{
"title" : "Homework",
"title_lowercase" : "Homework"
}
{
"title" : "work",
"title_lowercase" : "work"
}
Search query
{
"query": {
"bool": {
"should": [
{
"match": {
"title": {
"query": "work"
}
}
},
{
"match": {
"title_lowercase": {
"query": "work"
}
}
}
]
}
}
}
And expected result
"hits": [
{
"_index": "internaledge",
"_type": "_doc",
"_id": "1",
"_score": 0.9926754, /note score of `work` is much higher than`homework`
"_source": {
"title": "work",
"title_lowercase": "work"
}
},
{
"_index": "internaledge",
"_type": "_doc",
"_id": "2",
"_score": 0.2995283,
"_source": {
"title": "Homework",
"title_lowercase": "Homework"
}
}
]

Elasticsearch case insesitive wildcard search with spaced words

field priorityName is of search_as_you_type dataType.
My use case is like I want to search the document with the following words:---
"let's" -> should give both the results
"DOING" -> should give both the results
"are you" -> should give both the results
"Are You" -> should give both the results
"you do" (short of you doing)-> should give both the results
"re you" -> should give both the results
Out of 6, only the first 5 are giving me the desired result using multi_match.
how can I have the 6th use case where we can have incomplete word not starting with the first characters.
Sampple docs
"_index": "priority",
"_type": "_doc",
"_id": "vaCI_HAB31AaC-t5TO9H",
"_score": 1,
"_source": { -
"priorityName": "What are you doing along Let's Go out"
}
},
{ -
"_index": "priority",
"_type": "_doc",
"_id": "vqCQ_HAB31AaC-t5wO8m",
"_score": 1,
"_source": { -
"priorityName": "what are you doing along let's go for shopping"
}
}
]
}
For last search re you, you need infix tokens and by default its not included in the search_as_you_type datatype. I would suggest you to create a custom analyzer which will create infix tokens and allow you to match all your 6 queries.
I have already created a custom analyzer and test it with your sample documents and all 6 queries are giving both the sample results.
Index mapping
POST /infix-index
{
"settings": {
"max_ngram_diff": 50,
"analysis": {
"filter": {
"autocomplete_filter": {
"type": "ngram",
"min_gram": 1,
"max_gram": 8
}
},
"analyzer": {
"autocomplete_analyzer": {
"type": "custom",
"tokenizer": "whitespace",
"filter": [
"lowercase",
"autocomplete_filter"
]
},
"lowercase_analyzer": {
"type": "custom",
"tokenizer": "whitespace",
"filter": [
"lowercase"
]
}
}
}
},
"mappings": {
"properties": {
"priorityName": {
"type": "text",
"analyzer": "autocomplete_analyzer",
"search_analyzer": "standard" --> note this
}
}
}
}
Index your sample docs
{
"priorityName" : "What are you doing along Let's Go out"
}
{
"priorityName" : "what are you doing along let's go for shopping"
}
Search query for last re you
{
"query": {
"match" : {
"priorityName" : "re you"
}
}
}
And result
"hits": [
{
"_index": "ngram",
"_type": "_doc",
"_id": "1",
"_score": 1.4652853,
"_source": {
"priorityName": "What are you doing along Let's Go out"
}
},
{
"_index": "ngram",
"_type": "_doc",
"_id": "2",
"_score": 1.4509768,
"_source": {
"priorityName": "what are you doing along let's go for shopping"
}
}
Other queries also returned me both the documents but not including them to shorten the length of this answer.
Note: Below are some important links to understand the answer in detail.
https://www.elastic.co/guide/en/elasticsearch/reference/current/search-analyzer.html
https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-ngram-tokenizer.html

elasticsearch: How to rank first appearing words or phrases higher

For example, if I have the following documents:
1. Casa Road
2. Jalan Casa
Say my query term is "cas"... on searching, both documents have same scores. I want the one with casa appearing earlier (i.e. document 1 here) and to rank first in my query output.
I am using an edgeNGram Analyzer. Also I am using aggregations so I cannot use the normal sorting that happens after querying.
You can use the Bool Query to boost the items that start with the search query:
{
"bool" : {
"must" : {
"match" : { "name" : "cas" }
},
"should": {
"prefix" : { "name" : "cas" }
},
}
}
I'm assuming the values you gave is in the name field, and that that field is not analyzed. If it is analyzed, maybe look at this answer for more ideas.
The way it works is:
Both documents will match the query in the must clause, and will receive the same score for that. A document won't be included if it doesn't match the must query.
Only the document with the term starting with cas will match the query in the should clause, causing it to receive a higher score. A document won't be excluded if it doesn't match the should query.
This might be a bit more involved, but it should work.
Basically, you need the position of the term within the text itself and, also, the number of terms from the text. The actual scoring is computed using scripts, so you need to enable dynamic scripting in elasticsearch.yml config file:
script.engine.groovy.inline.search: on
This is what you need:
a mapping that is using term_vector set to with_positions, and edgeNGram and a sub-field of type token_count:
PUT /test
{
"mappings": {
"test": {
"properties": {
"text": {
"type": "string",
"term_vector": "with_positions",
"index_analyzer": "edgengram_analyzer",
"search_analyzer": "keyword",
"fields": {
"word_count": {
"type": "token_count",
"store": "yes",
"analyzer": "standard"
}
}
}
}
}
},
"settings": {
"analysis": {
"filter": {
"name_ngrams": {
"min_gram": "2",
"type": "edgeNGram",
"max_gram": "30"
}
},
"analyzer": {
"edgengram_analyzer": {
"type": "custom",
"filter": [
"standard",
"lowercase",
"name_ngrams"
],
"tokenizer": "standard"
}
}
}
}
}
test documents:
POST /test/test/1
{"text":"Casa Road"}
POST /test/test/2
{"text":"Jalan Casa"}
the query itself:
GET /test/test/_search
{
"query": {
"bool": {
"must": [
{
"function_score": {
"query": {
"term": {
"text": {
"value": "cas"
}
}
},
"script_score": {
"script": "termInfo=_index['text'].get('cas',_POSITIONS);wordCount=doc['text.word_count'].value;if (termInfo) {for(pos in termInfo){return (wordCount-pos.position)/wordCount}};"
},
"boost_mode": "sum"
}
}
]
}
}
}
and the results:
"hits": {
"total": 2,
"max_score": 1.3715843,
"hits": [
{
"_index": "test",
"_type": "test",
"_id": "1",
"_score": 1.3715843,
"_source": {
"text": "Casa Road"
}
},
{
"_index": "test",
"_type": "test",
"_id": "2",
"_score": 0.8715843,
"_source": {
"text": "Jalan Casa"
}
}
]
}

Resources