Elastic returns unexpected result from Search using edge_ngram - elasticsearch

I am working out how to store my data in elasticsearch. First I tried the fuzzy function and while that worked okay I did not receive the expected results. Afterwards I tried the ngram and then the edge_ngram tokenizer. The edge_ngram tokenizer looked like it works like an autocomplete. Exactly what I needed. But it still gives unexpected results. I configured min 1 and max 5 to get all results starting with the first letter I search for. While this works I still get those results as I continue typing.
Example: I have a name field filled with documents named The New York Times and The Guardian. Now when I search for T both occur as expected. But the same happens when I search for TT, TTT and so on.
In that case it does not matter wether I execute the search in Kibana or from my application (which useses MultiMatch on all fields). Kibana even shows me the that it matched the single letter T.
So what did I miss and how can I achieve getting the results like with an autocomplete but without having too many results?

When defining your index mapping, you need to specify search_analyzer as standard. If no search_analyzer is defined explicitly, then by default elasticsearch considers search_analyzer to be the same as that of analyzer specified.
Adding a working example with index data, mapping, search query and search result
Index Mapping:
{
"settings": {
"analysis": {
"analyzer": {
"autocomplete": {
"tokenizer": "autocomplete",
"filter": [
"lowercase"
]
}
},
"tokenizer": {
"autocomplete": {
"type": "edge_ngram",
"min_gram": 1,
"max_gram": 5,
"token_chars": [
"letter"
]
}
}
}
},
"mappings": {
"properties": {
"name": {
"type": "text",
"analyzer": "autocomplete",
"search_analyzer": "standard" // note this
}
}
}
}
Index Data:
{
"name":"The Guardian"
}
{
"name":"The New York Times"
}
Search Query:
{
"query": {
"match": {
"name": "T"
}
}
}
Search Result:
"hits": [
{
"_index": "69027911",
"_type": "_doc",
"_id": "1",
"_score": 0.23092544,
"_source": {
"name": "The New York Times"
}
},
{
"_index": "69027911",
"_type": "_doc",
"_id": "2",
"_score": 0.20824991,
"_source": {
"name": "The Guardian"
}
}
]

Related

Username search in Elasticsearch

I want to implement a simple username search within Elasticsearch. I don't want weighted username searches yet, so I would expect it wouldn't be to hard to find resources on how do this. But in the end, I came across NGrams and lot of outdated Elasticsearch tutorials and I completely lost track on the best practice on how to do this.
This is now my setup, but it is really bad because it matches so much unrelated usernames:
{
"settings": {
"index" : {
"max_ngram_diff": "11"
},
"analysis": {
"analyzer": {
"username_analyzer": {
"tokenizer": "username_tokenizer",
"filter": [
"lowercase"
]
}
},
"tokenizer": {
"username_tokenizer": {
"type": "ngram",
"min_gram": "1",
"max_gram": "12"
}
}
}
},
"mappings": {
"properties": {
"_all" : { "enabled" : false },
"username": {
"type": "text",
"analyzer": "username_analyzer"
}
}
}
}
I am using the newest Elasticsearch and I just want to query similar/exact usernames. I have a user db and users should be able to search for eachother, nothing to fancy.
If you want to search for exact usernames, then you can use the term query
Term query returns documents that contain an exact term in a provided field. If you have not defined any explicit index mapping, then you need to add .keyword to the field. This uses the keyword analyzer instead of the standard analyzer.
There is no need to use an n-gram tokenizer if you want to search for the exact term.
Adding a working example with index data, index mapping, search query, and search result
Index Mapping:
{
"mappings": {
"properties": {
"username": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword"
}
}
}
}
}
}
Index Data:
{
"username": "Jack"
}
{
"username": "John"
}
Search Query:
{
"query": {
"term": {
"username.keyword": "Jack"
}
}
}
Search Result:
"hits": [
{
"_index": "68844541",
"_type": "_doc",
"_id": "1",
"_score": 0.2876821,
"_source": {
"username": "Jack"
}
}
]
Edit 1:
To match for similar terms, you can use the fuzziness parameter along with the match query
{
"query": {
"match": {
"username": {
"query": "someting",
"fuzziness":"auto"
}
}
}
}
Search Result will be
"hits": [
{
"_index": "68844541",
"_type": "_doc",
"_id": "3",
"_score": 0.6065038,
"_source": {
"username": "something"
}
}
]

Elastic Enterprise Search: Search API doesnt match “33” for “033” in document title

I couldn't figure it out how I can match the title e.g.: "Test title No.033" with query "33".
When I search for "033", it return the document. But for only "33" it doesnt return :frowning:
The guide is not very helpful for me (Search API | Elastic App Search Documentation [7.12] | Elastic)
Could you please help me with this?
What other information should I provide?
If no analyzer is specified then elasticsearch uses a standard analyzer.
The tokens generated will be "test", "title", "no", "033"
You can use ngram tokenizer to do a partial match on "title" field
Adding a working example with index data, mapping, search query, and search result
Index Mapping:
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "my_tokenizer"
}
},
"tokenizer": {
"my_tokenizer": {
"type": "ngram",
"min_gram": 2,
"max_gram": 5,
"token_chars": [
"letter",
"digit"
]
}
}
},
"max_ngram_diff": 10
},
"mappings": {
"properties": {
"title": {
"type": "text",
"analyzer": "my_analyzer"
}
}
}
}
Analyze API
GET /_analzye
{
"analyzer" : "my_analyzer",
"text" : "Test title No.033"
}
The tokens generated will contain both "033" and "33"
Index Data:
{
"title":"Test title No.033"
}
Search Query:
{
"query":{
"match":{
"title":"33"
}
}
}
Search Result:
"hits": [
{
"_index": "67091386",
"_type": "_doc",
"_id": "1",
"_score": 0.2876821,
"_source": {
"title": "Test title No.033"
}
}
]
Elastic App Search doesn't allow you to configure the tokenizer like in Elasticsearch. However, you can tune in the results using the relevance tuning API.
You can give more weight to your title, and it will start showing when you search for either "33" or "033".
Note: Relevance tuning has multiple components like Weights, Functions that you could apply. I'd recommend you try it out in the Elastic App Search Console.

Scoring higher for shorter fields

I'm trying to get a higher score (or at least the same score) for the shortest values on Elastic Search.
Let's say I have these documents: "Abc", "Abca", "Abcb", "Abcc". The field label.ngram uses an EdgeNgram analyser.
With a really simple query like that:
{
"query": {
"match": {
"label.ngram": {
"query": "Ab"
}
}
}
}
I always get first the documents "Abca", "Abcb", "Abcc" instead of "Abc".
How can I get "Abc" first?
(should I use this: https://www.elastic.co/guide/en/elasticsearch/reference/current/index-modules-similarity.html?)
Thanks!
This is happening due to field normalization and to get the same score, you have to disable the norms on the field.
Norms store various normalization factors that are later used at query
time in order to compute the score of a document relatively to a
query.
Adding a working example with index data, mapping, search query, and search result
Index Mapping:
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "my_tokenizer"
}
},
"tokenizer": {
"my_tokenizer": {
"type": "edge_ngram",
"min_gram": 2,
"max_gram": 10,
"token_chars": [
"letter",
"digit"
]
}
}
}
},
"mappings": {
"properties": {
"title": {
"type": "text",
"norms": false,
"analyzer": "my_analyzer"
}
}
}
}
Index Data:
{
"title": "Abca"
}
{
"title": "Abcb"
}
{
"title": "Abcc"
}
{
"title": "Abc"
}
Search Query:
{
"query": {
"match": {
"title": {
"query": "Ab"
}
}
}
}
Search Result:
"hits": [
{
"_index": "65953349",
"_type": "_doc",
"_id": "1",
"_score": 0.1424427,
"_source": {
"title": "Abca"
}
},
{
"_index": "65953349",
"_type": "_doc",
"_id": "2",
"_score": 0.1424427,
"_source": {
"title": "Abcb"
}
},
{
"_index": "65953349",
"_type": "_doc",
"_id": "3",
"_score": 0.1424427,
"_source": {
"title": "Abcc"
}
},
{
"_index": "65953349",
"_type": "_doc",
"_id": "4",
"_score": 0.1424427,
"_source": {
"title": "Abc"
}
}
]
As mentioned by #ESCoder that using norms you can fix the scoring but this would not be very useful, if you want to score your search results, as this would cause all the documents in your search results to have the same score, which will impact the relevance of your search results big time.
Maybe you should tweak the document length norm param for default similarity algorithm(BM25) if you are on ES 5.X or higher. I tried doing this with your dataset and my setting but didn't make it to work.
Second option which will mostly work as suggested by you is to store the size of your fields in different field(but) this you should populate from your application as after analysis process, various tokens would be generated for same field. but this is extra overhead and I would prefer doing this by tweaking the similarity algo param.

Elasticsearch case insesitive wildcard search with spaced words

field priorityName is of search_as_you_type dataType.
My use case is like I want to search the document with the following words:---
"let's" -> should give both the results
"DOING" -> should give both the results
"are you" -> should give both the results
"Are You" -> should give both the results
"you do" (short of you doing)-> should give both the results
"re you" -> should give both the results
Out of 6, only the first 5 are giving me the desired result using multi_match.
how can I have the 6th use case where we can have incomplete word not starting with the first characters.
Sampple docs
"_index": "priority",
"_type": "_doc",
"_id": "vaCI_HAB31AaC-t5TO9H",
"_score": 1,
"_source": { -
"priorityName": "What are you doing along Let's Go out"
}
},
{ -
"_index": "priority",
"_type": "_doc",
"_id": "vqCQ_HAB31AaC-t5wO8m",
"_score": 1,
"_source": { -
"priorityName": "what are you doing along let's go for shopping"
}
}
]
}
For last search re you, you need infix tokens and by default its not included in the search_as_you_type datatype. I would suggest you to create a custom analyzer which will create infix tokens and allow you to match all your 6 queries.
I have already created a custom analyzer and test it with your sample documents and all 6 queries are giving both the sample results.
Index mapping
POST /infix-index
{
"settings": {
"max_ngram_diff": 50,
"analysis": {
"filter": {
"autocomplete_filter": {
"type": "ngram",
"min_gram": 1,
"max_gram": 8
}
},
"analyzer": {
"autocomplete_analyzer": {
"type": "custom",
"tokenizer": "whitespace",
"filter": [
"lowercase",
"autocomplete_filter"
]
},
"lowercase_analyzer": {
"type": "custom",
"tokenizer": "whitespace",
"filter": [
"lowercase"
]
}
}
}
},
"mappings": {
"properties": {
"priorityName": {
"type": "text",
"analyzer": "autocomplete_analyzer",
"search_analyzer": "standard" --> note this
}
}
}
}
Index your sample docs
{
"priorityName" : "What are you doing along Let's Go out"
}
{
"priorityName" : "what are you doing along let's go for shopping"
}
Search query for last re you
{
"query": {
"match" : {
"priorityName" : "re you"
}
}
}
And result
"hits": [
{
"_index": "ngram",
"_type": "_doc",
"_id": "1",
"_score": 1.4652853,
"_source": {
"priorityName": "What are you doing along Let's Go out"
}
},
{
"_index": "ngram",
"_type": "_doc",
"_id": "2",
"_score": 1.4509768,
"_source": {
"priorityName": "what are you doing along let's go for shopping"
}
}
Other queries also returned me both the documents but not including them to shorten the length of this answer.
Note: Below are some important links to understand the answer in detail.
https://www.elastic.co/guide/en/elasticsearch/reference/current/search-analyzer.html
https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-ngram-tokenizer.html

Elasticsearch: why exact match has lower score than partial match

my question
I search the word form, but the exact match word form is not the fisrt in result. Is there any way to solve this problem?
my search query
{
"query": {
"match": {
"word": "form"
}
}
}
result
word score
--------------------------
formulation 10.864353
formaldehyde 10.864353
formless 10.864353
formal 10.84412
formerly 10.84412
forma 10.84412
formation 10.574185
formula 10.574185
formulate 10.574185
format 10.574185
formally 10.574185
form 10.254687
former 10.254687
formidable 10.254687
formality 10.254687
formative 10.254687
ill-formed 10.054999
in form 10.035862
pro forma 9.492243
POST my_index/_analyze
The word form in search has only one token form.
In index, form tokens are ["f", "fo", "for", "form"]; formulation tokens are ["f", "fo", ..., "formulatio", "formulation"].
my config
filter
"edgengram_filter": {
"type": "edge_ngram",
"min_gram": 1,
"max_gram": 20
}
analyzer
"analyzer": {
"abc_vocab_analyzer": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"keyword_repeat",
"lowercase",
"asciifolding",
"edgengram_filter",
"unique"
]
},
"abc_vocab_search_analyzer": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"keyword_repeat",
"lowercase",
"asciifolding",
"unique"
]
}
}
mapping
"word": {
"type": "text",
"analyzer": "abc_vocab_analyzer",
"search_analyzer": "abc_vocab_search_analyzer"
}
You get the result in the way you see because you've implemented edge-ngram filter and that form is a sub-string of the words similar to it. Basically in inverted index it would also store the document ids that contains formulation, formal etc.
Therefore, your relevancy also gets computed in that way. You can refer to this link and I'd specifically suggest you to go through sections Default Similarity and BM25. Although the present default similarity is BM25, that link would help you understand how scoring works.
You would need to create another sibling field which you can apply in a should clause. You can go ahead and create keyword sub-field with Term Query but you need to be careful about case-sensitivity.
Instead, as mentioned by #Val, you can create a sibling of text field with standard analyzer.
Mapping:
{
"word":{
"type": "text",
"analyzer": "abc_vocab_analyzer",
"search_analyzer": "abc_vocab_search_analyzer"
"fields":{
"standard":{
"type": "text"
}
}
}
}
Query:
POST <your_index_name>/_search
{
"query": {
"bool": {
"must": [
{
"match": {
"word": "form"
}
}
],
"should": [ <---- Note this
{
"match": {
"word.standard": "form"
}
}
]
}
}
}
Let me know if this helps!
Because your type for this field is text which means ES will do full-text search analysis on this field. And ES search process is kind of finding results most similar to the word you have given. To accurately search word "form", change your search method to match_phrase Furthermore, you could also read articles below to learn more about different ES search methods:
https://www.cnblogs.com/yjf512/p/4897294.html
https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-match-query-phrase.html
Looks like some issue in your custom analyzer, I created my custom autocomplete analyzer, which uses edge_ngram and lowercase filter and it works fine for me for your query and returns me exact match on top and this is how Elasticsearch works, exact matches always have more score., So no need to explicitly create another field and boost that, As Elasticsearch by default boosts the exact match on tokens match.
Index def
{
"settings": {
"analysis": {
"filter": {
"autocomplete_filter": {
"type": "edge_ngram",
"min_gram": 1,
"max_gram": 10
}
},
"analyzer": {
"autocomplete": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"lowercase",
"autocomplete_filter"
]
}
}
}
},
"mappings": {
"properties": {
"title": {
"type": "text",
"analyzer": "autocomplete",
"search_analyzer": "standard"
}
}
}
}
Index few doc
{
"title" : "formless"
}
{
"title" : "form"
}
{
"title" : "formulation"
}
Search query on title field as provided in the question
{
"query": {
"match": {
"title": "form"
}
}
}
Search result with exact match having highest score
"hits": [
{
"_index": "so-60523240-score",
"_type": "_doc",
"_id": "1",
"_score": 0.16410133,
"_source": {
"title": "form"
}
},
{
"_index": "so-60523240-score",
"_type": "_doc",
"_id": "2",
"_score": 0.16410133,
"_source": {
"title": "formulation"
}
},
{
"_index": "so-60523240-score",
"_type": "_doc",
"_id": "3",
"_score": 0.16410133,
"_source": {
"title": "formaldehyde"
}
},
{
"_index": "so-60523240-score",
"_type": "_doc",
"_id": "4",
"_score": 0.16410133,
"_source": {
"title": "formless"
}
}
]

Resources