Scoring higher for shorter fields - elasticsearch

I'm trying to get a higher score (or at least the same score) for the shortest values on Elastic Search.
Let's say I have these documents: "Abc", "Abca", "Abcb", "Abcc". The field label.ngram uses an EdgeNgram analyser.
With a really simple query like that:
{
"query": {
"match": {
"label.ngram": {
"query": "Ab"
}
}
}
}
I always get first the documents "Abca", "Abcb", "Abcc" instead of "Abc".
How can I get "Abc" first?
(should I use this: https://www.elastic.co/guide/en/elasticsearch/reference/current/index-modules-similarity.html?)
Thanks!

This is happening due to field normalization and to get the same score, you have to disable the norms on the field.
Norms store various normalization factors that are later used at query
time in order to compute the score of a document relatively to a
query.
Adding a working example with index data, mapping, search query, and search result
Index Mapping:
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "my_tokenizer"
}
},
"tokenizer": {
"my_tokenizer": {
"type": "edge_ngram",
"min_gram": 2,
"max_gram": 10,
"token_chars": [
"letter",
"digit"
]
}
}
}
},
"mappings": {
"properties": {
"title": {
"type": "text",
"norms": false,
"analyzer": "my_analyzer"
}
}
}
}
Index Data:
{
"title": "Abca"
}
{
"title": "Abcb"
}
{
"title": "Abcc"
}
{
"title": "Abc"
}
Search Query:
{
"query": {
"match": {
"title": {
"query": "Ab"
}
}
}
}
Search Result:
"hits": [
{
"_index": "65953349",
"_type": "_doc",
"_id": "1",
"_score": 0.1424427,
"_source": {
"title": "Abca"
}
},
{
"_index": "65953349",
"_type": "_doc",
"_id": "2",
"_score": 0.1424427,
"_source": {
"title": "Abcb"
}
},
{
"_index": "65953349",
"_type": "_doc",
"_id": "3",
"_score": 0.1424427,
"_source": {
"title": "Abcc"
}
},
{
"_index": "65953349",
"_type": "_doc",
"_id": "4",
"_score": 0.1424427,
"_source": {
"title": "Abc"
}
}
]

As mentioned by #ESCoder that using norms you can fix the scoring but this would not be very useful, if you want to score your search results, as this would cause all the documents in your search results to have the same score, which will impact the relevance of your search results big time.
Maybe you should tweak the document length norm param for default similarity algorithm(BM25) if you are on ES 5.X or higher. I tried doing this with your dataset and my setting but didn't make it to work.
Second option which will mostly work as suggested by you is to store the size of your fields in different field(but) this you should populate from your application as after analysis process, various tokens would be generated for same field. but this is extra overhead and I would prefer doing this by tweaking the similarity algo param.

Related

Elastic returns unexpected result from Search using edge_ngram

I am working out how to store my data in elasticsearch. First I tried the fuzzy function and while that worked okay I did not receive the expected results. Afterwards I tried the ngram and then the edge_ngram tokenizer. The edge_ngram tokenizer looked like it works like an autocomplete. Exactly what I needed. But it still gives unexpected results. I configured min 1 and max 5 to get all results starting with the first letter I search for. While this works I still get those results as I continue typing.
Example: I have a name field filled with documents named The New York Times and The Guardian. Now when I search for T both occur as expected. But the same happens when I search for TT, TTT and so on.
In that case it does not matter wether I execute the search in Kibana or from my application (which useses MultiMatch on all fields). Kibana even shows me the that it matched the single letter T.
So what did I miss and how can I achieve getting the results like with an autocomplete but without having too many results?
When defining your index mapping, you need to specify search_analyzer as standard. If no search_analyzer is defined explicitly, then by default elasticsearch considers search_analyzer to be the same as that of analyzer specified.
Adding a working example with index data, mapping, search query and search result
Index Mapping:
{
"settings": {
"analysis": {
"analyzer": {
"autocomplete": {
"tokenizer": "autocomplete",
"filter": [
"lowercase"
]
}
},
"tokenizer": {
"autocomplete": {
"type": "edge_ngram",
"min_gram": 1,
"max_gram": 5,
"token_chars": [
"letter"
]
}
}
}
},
"mappings": {
"properties": {
"name": {
"type": "text",
"analyzer": "autocomplete",
"search_analyzer": "standard" // note this
}
}
}
}
Index Data:
{
"name":"The Guardian"
}
{
"name":"The New York Times"
}
Search Query:
{
"query": {
"match": {
"name": "T"
}
}
}
Search Result:
"hits": [
{
"_index": "69027911",
"_type": "_doc",
"_id": "1",
"_score": 0.23092544,
"_source": {
"name": "The New York Times"
}
},
{
"_index": "69027911",
"_type": "_doc",
"_id": "2",
"_score": 0.20824991,
"_source": {
"name": "The Guardian"
}
}
]

Username search in Elasticsearch

I want to implement a simple username search within Elasticsearch. I don't want weighted username searches yet, so I would expect it wouldn't be to hard to find resources on how do this. But in the end, I came across NGrams and lot of outdated Elasticsearch tutorials and I completely lost track on the best practice on how to do this.
This is now my setup, but it is really bad because it matches so much unrelated usernames:
{
"settings": {
"index" : {
"max_ngram_diff": "11"
},
"analysis": {
"analyzer": {
"username_analyzer": {
"tokenizer": "username_tokenizer",
"filter": [
"lowercase"
]
}
},
"tokenizer": {
"username_tokenizer": {
"type": "ngram",
"min_gram": "1",
"max_gram": "12"
}
}
}
},
"mappings": {
"properties": {
"_all" : { "enabled" : false },
"username": {
"type": "text",
"analyzer": "username_analyzer"
}
}
}
}
I am using the newest Elasticsearch and I just want to query similar/exact usernames. I have a user db and users should be able to search for eachother, nothing to fancy.
If you want to search for exact usernames, then you can use the term query
Term query returns documents that contain an exact term in a provided field. If you have not defined any explicit index mapping, then you need to add .keyword to the field. This uses the keyword analyzer instead of the standard analyzer.
There is no need to use an n-gram tokenizer if you want to search for the exact term.
Adding a working example with index data, index mapping, search query, and search result
Index Mapping:
{
"mappings": {
"properties": {
"username": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword"
}
}
}
}
}
}
Index Data:
{
"username": "Jack"
}
{
"username": "John"
}
Search Query:
{
"query": {
"term": {
"username.keyword": "Jack"
}
}
}
Search Result:
"hits": [
{
"_index": "68844541",
"_type": "_doc",
"_id": "1",
"_score": 0.2876821,
"_source": {
"username": "Jack"
}
}
]
Edit 1:
To match for similar terms, you can use the fuzziness parameter along with the match query
{
"query": {
"match": {
"username": {
"query": "someting",
"fuzziness":"auto"
}
}
}
}
Search Result will be
"hits": [
{
"_index": "68844541",
"_type": "_doc",
"_id": "3",
"_score": 0.6065038,
"_source": {
"username": "something"
}
}
]

Ignore term frequency but use positions

I have an index with a text field and i want to ignore term frequencies in scoring but keep positions to have match phrase search ability.
and my index define like this:
curl --location --request PUT 'localhost:9200/my-index-001' \
--header 'Content-Type: application/json' \
--data-raw '{
"mappings": {
"autocomplete": {
"properties": {
"title": {
"type": "text",
"analyzer": "row_autocomplete"
},
"name": {
"type": "text",
"analyzer": "row_autocomplete"
}
}
}
},
"settings": {
"analysis": {
"analyzer": {
"row_autocomplete": {
"tokenizer": "icu_tokenizer",
"filter": ["icu_folding", "autocomplete_filter", "lowercase"]
}
},
"filter": {
"autocomplete_filter": {
"type": "edge_ngram",
"min_gram": 1,
"max_gram": 20
}
}
}
}
}'
index data:
[
{
"title": "university",
"name": "london and EC london English"
},
{
"title": "city",
"name": "london"
}
]
I want city get high score when I execute match query like this:
POST _search
{
"query": {
"bool": {
"should": [
{
"match": {
"name": {
"query": "london"
}
}
},
{
"match_phrase": {
"name": {
"query": "london",
}
}
}
]
}
}
}
and they got different score(university is greater than city actually) because of term frequency, What I want is only count term frequency one time, and according to fieldLength, city's fieldLength is less than university's fieldLength, so if I can ignore repeat termFreq, the score of city will greater than university refer to elasticsearch's rule:
GET _explain
# city's _explain
{
"value": 2.0785222,
"description": "tfNorm, computed as (freq * (k1 + 1)) / (freq + k1 * (1 - b + b * fieldLength / avgFieldLength)) from:",
"details": [
{
"value": 6.0,
"description": "termFreq=6.0",
"details": []
},
{
"value": 2.0,
"description": "fieldLength",
"details": []
},
...
]
}
# university's explain
{
"value": 2.1087635,
"description": "tfNorm, computed as (freq * (k1 + 1)) / (freq + k1 * (1 - b + b * fieldLength / avgFieldLength)) from:",
"details": [
{
"value": 24.0,
"description": "termFreq=24.0",
"details": []
},
{
"value": 29.0,
"description": "fieldLength",
"details": []
},
...
]
}
There are something I tried, for example In index mapping i can set index_options=docs to ignore term frequencies but this disables term positions and i cant use match phrase query anymore.
Does anyone have any idea?
thanks in advance.
You can use constant score query that wraps a filter query and returns every matching document with a relevance score equal to the boost parameter value.
If you use a constant score query, then your match query will give no score apart from just 0 or 1. This is because it will just act like a filter that will just if the query matches or not. match query will not act like a matching based on full-text search.
A constant_score query takes a boost argument that is set as the score
for every returned document when combined with other queries. By
default boost is set to 1.
Refer this to get detailed explaination on bool filter and this SO answer to understand difference between constant score query and bool filter.
Adding a working example with index data, search query and search result.
Index data
{ "name": "london only" }
{ "name": "london and London" }
Search Query:
{
"query": {
"constant_score": {
"filter": {
"bool": {
"must": [
{
"match": {
"name": "london"
}
}
]
}
}
}
}
}
Search Result:
"hits": [
{
"_index": "my_index",
"_type": "_doc",
"_id": "1",
"_score": 1.0,
"_source": {
"name": "london and london"
}
},
{
"_index": "my_index",
"_type": "_doc",
"_id": "2",
"_score": 1.0,
"_source": {
"name": "london only"
}
}
]
I indexed both the sample documents what you provided using default index mapping, so both title and name fields are text fields. and used same your query and it returns me the high score for just doc which contains just london as shown below:
"hits": [
{
"_index": "matchrel",
"_type": "_doc",
"_id": "1",
"_score": 0.51518387,
"_source": {
"title": "city",
"name": "london"
}
},
{
"_index": "matchrel",
"_type": "_doc",
"_id": "2",
"_score": 0.41750965,
"_source": {
"title": "university",
"name": "london university and EC London English"
}
}
]
Also, As you have not explained your use case in details, and with limited info, It seems it can be easily achieved with below query with also returns more score for london doc:
{
"query": {
"match_phrase": {
"name": "london"
}
}
}
And its search result
"hits": [
{
"_index": "matchrel",
"_type": "_doc",
"_id": "1",
"_score": 0.25759193, // note score
"_source": {
"title": "city",
"name": "london"
}
},
{
"_index": "matchrel",
"_type": "_doc",
"_id": "2",
"_score": 0.20875482,
"_source": {
"title": "university",
"name": "london university and EC London English"
}
}
]

ngram matching gives same score to less relevant documents

I am searching for Bob Smith in my elasticsearch index. The results Bob Smith and Bobbi Smith both come back in the response with the same score. I want Bob Smith to have a higher score so that it appears first in my result set. Why are the scores the equivalent?
Here is my query
{
"query": {
"query_string": {
"query": "Bob Smith",
"fields": [
"text_field"
]
}
}
}
Below are my index's settings. I am using the ngram token filter described here: https://qbox.io/blog/an-introduction-to-ngrams-in-elasticsearch
{
"contacts_5test": {
"aliases": {},
"mappings": {
"properties": {
"text_field": {
"type": "text",
"term_vector": "yes",
"analyzer": "ngram_filter_analyzer"
}
}
},
"settings": {
"index": {
"number_of_shards": "1",
"provided_name": "contacts_5test",
"creation_date": "1588987227997",
"analysis": {
"filter": {
"ngram_filter": {
"type": "nGram",
"min_gram": "4",
"max_gram": "4"
}
},
"analyzer": {
"ngram_filter_analyzer": {
"filter": [
"lowercase",
"ngram_filter"
],
"type": "custom",
"tokenizer": "standard"
}
}
},
"number_of_replicas": "1",
"uuid": "HqOXu9bNRwCHSeK39WWlxw",
"version": {
"created": "7060199"
}
}
}
}
}
Here are the results from my query...
"hits": [
{
"_index": "contacts_5test",
"_type": "_doc",
"_id": "1",
"_score": 0.69795835,
"_source": {
"text_field": "Bob Smith"
}
},
{
"_index": "contacts_5test",
"_type": "_doc",
"_id": "2",
"_score": 0.69795835,
"_source": {
"text_field": "Bobbi Smith"
}
}
]
If I instead search for Bobbi Smith, elastic returns both documents, but with a higher score for Bobbi Smith. This makes more sense.
I was able to reproduce your issue and reason for this is due to the use of your ngram_filter, which doesn't create any token for bob as the minimum length of the token should be 4 while standard tokenizer created bob token but then it gets filtered out in your ngram_filter where you mentioned min_gram as 4.
Even I tried with less min_gram length to 3, which would create the tokens but the issue is that both bob and bobbie will have same bob tokens, hence score for both of them will be same.
While when you search for Bobbi Smith, then bobbi ie exact token will be present only in one document, hence you get the higher score.
Note:- Please use the analyze API and explain API to inspect the tokens generated and how these are matched, this would help you to understand the issue and my explanation in details and my

Elasticsearch case insesitive wildcard search with spaced words

field priorityName is of search_as_you_type dataType.
My use case is like I want to search the document with the following words:---
"let's" -> should give both the results
"DOING" -> should give both the results
"are you" -> should give both the results
"Are You" -> should give both the results
"you do" (short of you doing)-> should give both the results
"re you" -> should give both the results
Out of 6, only the first 5 are giving me the desired result using multi_match.
how can I have the 6th use case where we can have incomplete word not starting with the first characters.
Sampple docs
"_index": "priority",
"_type": "_doc",
"_id": "vaCI_HAB31AaC-t5TO9H",
"_score": 1,
"_source": { -
"priorityName": "What are you doing along Let's Go out"
}
},
{ -
"_index": "priority",
"_type": "_doc",
"_id": "vqCQ_HAB31AaC-t5wO8m",
"_score": 1,
"_source": { -
"priorityName": "what are you doing along let's go for shopping"
}
}
]
}
For last search re you, you need infix tokens and by default its not included in the search_as_you_type datatype. I would suggest you to create a custom analyzer which will create infix tokens and allow you to match all your 6 queries.
I have already created a custom analyzer and test it with your sample documents and all 6 queries are giving both the sample results.
Index mapping
POST /infix-index
{
"settings": {
"max_ngram_diff": 50,
"analysis": {
"filter": {
"autocomplete_filter": {
"type": "ngram",
"min_gram": 1,
"max_gram": 8
}
},
"analyzer": {
"autocomplete_analyzer": {
"type": "custom",
"tokenizer": "whitespace",
"filter": [
"lowercase",
"autocomplete_filter"
]
},
"lowercase_analyzer": {
"type": "custom",
"tokenizer": "whitespace",
"filter": [
"lowercase"
]
}
}
}
},
"mappings": {
"properties": {
"priorityName": {
"type": "text",
"analyzer": "autocomplete_analyzer",
"search_analyzer": "standard" --> note this
}
}
}
}
Index your sample docs
{
"priorityName" : "What are you doing along Let's Go out"
}
{
"priorityName" : "what are you doing along let's go for shopping"
}
Search query for last re you
{
"query": {
"match" : {
"priorityName" : "re you"
}
}
}
And result
"hits": [
{
"_index": "ngram",
"_type": "_doc",
"_id": "1",
"_score": 1.4652853,
"_source": {
"priorityName": "What are you doing along Let's Go out"
}
},
{
"_index": "ngram",
"_type": "_doc",
"_id": "2",
"_score": 1.4509768,
"_source": {
"priorityName": "what are you doing along let's go for shopping"
}
}
Other queries also returned me both the documents but not including them to shorten the length of this answer.
Note: Below are some important links to understand the answer in detail.
https://www.elastic.co/guide/en/elasticsearch/reference/current/search-analyzer.html
https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-ngram-tokenizer.html

Resources