Elastic Enterprise Search: Search API doesnt match “33” for “033” in document title

Elastic Enterprise Search: Search API doesnt match “33” for “033” in document title - elasticsearch

I couldn't figure it out how I can match the title e.g.: "Test title No.033" with query "33".
When I search for "033", it return the document. But for only "33" it doesnt return :frowning:
The guide is not very helpful for me (Search API | Elastic App Search Documentation [7.12] | Elastic)
Could you please help me with this?
What other information should I provide?

If no analyzer is specified then elasticsearch uses a standard analyzer.
The tokens generated will be "test", "title", "no", "033"
You can use ngram tokenizer to do a partial match on "title" field
Adding a working example with index data, mapping, search query, and search result
Index Mapping:
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "my_tokenizer"
}
},
"tokenizer": {
"my_tokenizer": {
"type": "ngram",
"min_gram": 2,
"max_gram": 5,
"token_chars": [
"letter",
"digit"
]
}
}
},
"max_ngram_diff": 10
},
"mappings": {
"properties": {
"title": {
"type": "text",
"analyzer": "my_analyzer"
}
}
}
}
Analyze API
GET /_analzye
{
"analyzer" : "my_analyzer",
"text" : "Test title No.033"
}
The tokens generated will contain both "033" and "33"
Index Data:
{
"title":"Test title No.033"
}
Search Query:
{
"query":{
"match":{
"title":"33"
}
}
}
Search Result:
"hits": [
{
"_index": "67091386",
"_type": "_doc",
"_id": "1",
"_score": 0.2876821,
"_source": {
"title": "Test title No.033"
}
}
]

Elastic App Search doesn't allow you to configure the tokenizer like in Elasticsearch. However, you can tune in the results using the relevance tuning API.
You can give more weight to your title, and it will start showing when you search for either "33" or "033".
Note: Relevance tuning has multiple components like Weights, Functions that you could apply. I'd recommend you try it out in the Elastic App Search Console.

Related

Elastic returns unexpected result from Search using edge_ngram

I am working out how to store my data in elasticsearch. First I tried the fuzzy function and while that worked okay I did not receive the expected results. Afterwards I tried the ngram and then the edge_ngram tokenizer. The edge_ngram tokenizer looked like it works like an autocomplete. Exactly what I needed. But it still gives unexpected results. I configured min 1 and max 5 to get all results starting with the first letter I search for. While this works I still get those results as I continue typing.
Example: I have a name field filled with documents named The New York Times and The Guardian. Now when I search for T both occur as expected. But the same happens when I search for TT, TTT and so on.
In that case it does not matter wether I execute the search in Kibana or from my application (which useses MultiMatch on all fields). Kibana even shows me the that it matched the single letter T.
So what did I miss and how can I achieve getting the results like with an autocomplete but without having too many results?

When defining your index mapping, you need to specify search_analyzer as standard. If no search_analyzer is defined explicitly, then by default elasticsearch considers search_analyzer to be the same as that of analyzer specified.
Adding a working example with index data, mapping, search query and search result
Index Mapping:
{
"settings": {
"analysis": {
"analyzer": {
"autocomplete": {
"tokenizer": "autocomplete",
"filter": [
"lowercase"
]
}
},
"tokenizer": {
"autocomplete": {
"type": "edge_ngram",
"min_gram": 1,
"max_gram": 5,
"token_chars": [
"letter"
]
}
}
}
},
"mappings": {
"properties": {
"name": {
"type": "text",
"analyzer": "autocomplete",
"search_analyzer": "standard" // note this
}
}
}
}
Index Data:
{
"name":"The Guardian"
}
{
"name":"The New York Times"
}
Search Query:
{
"query": {
"match": {
"name": "T"
}
}
}
Search Result:
"hits": [
{
"_index": "69027911",
"_type": "_doc",
"_id": "1",
"_score": 0.23092544,
"_source": {
"name": "The New York Times"
}
},
{
"_index": "69027911",
"_type": "_doc",
"_id": "2",
"_score": 0.20824991,
"_source": {
"name": "The Guardian"
}
}
]

Search in Elasticsearch for a string containing the "not" keyword

I am using ElasticSearch on AWS (7.9 version) and I am trying to distinguish between two strings.
My main target is to split the search results on "Found" and on "Not found".
The generic question is how to search for "not" keyword.
Two example messages you can see below.
"CachingServiceOne:Found in cache - Retrieve."
"CachingServiceThree:Not found in cache - Create new."

You can use ngram tokenizer, to search for "not" on "title" field.
Adding a working example with index data, mapping, search query, and search result
Index Mapping:
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "my_tokenizer"
}
},
"tokenizer": {
"my_tokenizer": {
"type": "ngram",
"min_gram": 3,
"max_gram": 5,
"token_chars": [
"letter",
"digit"
]
}
}
},
"max_ngram_diff": 10
},
"mappings": {
"properties": {
"title": {
"type": "text",
"analyzer": "my_analyzer"
}
}
}
}
Index Data:
{
"title":"CachingServiceThree:Not found in cache - Create new."
}
{
"title":"CachingServiceOne:Found in cache - Retrieve."
}
Search Query:
{
"query":{
"match":{
"title":"Not"
}
}
}
Search Result:
"hits": [
{
"_index": "67093372",
"_type": "_doc",
"_id": "2",
"_score": 0.6720003,
"_source": {
"title": "CachingServiceThree:Not found in cache - Create new."
}
}
]

Well, the problem seems to be indeed the way the default analyzer works, and not the fact that I could not search for the not word. That is why I accepted the answer. But I would like to add another take. For the sake of simplicity.
Default analyzer does not split words on :.
That means, we have to search for title:CachingServiceThree\:Not.
Where title is the field name and : must be escaped \:.
What did the trick was title:*\:Not and title:*\:Found using the KQL syntax.
Using the wildcard did the trick to fetch everything. I am wondering whether using an array of all the actual values will be quicker.
That translated through the Inspect panel into:
{
"query": {
"bool": {
"filter": [
{
"bool": {
"should": [
{
"query_string": {
"fields": [
"title"
],
"query": "*\\:Not"
}
}
],
"minimum_should_match": 1
}
}
]
}
}
}

Space handling in Elastic Search

If a Document (Say a merchant name) that I am searching for has no space in it and user search by adding space in it, the result won't show in elastic search. How can that be improved to get results?
For example:
Merchant name is "DeliBites"
User search by typing in "Deli Bites", then the above merchant does not appear in results. The merchant only appears in suggestions when I have typed just "Deli" or "Deli" followed by a space or "Deli."

Adding another option, you can also use the edge n-gram tokenizer which will work in most of the cases, its simple to setup and use.
Working example on your data
Index definition
{
"settings": {
"analysis": {
"filter": {
"autocomplete_filter": {
"type": "ngram",
"min_gram": 1,
"max_gram": 10
}
},
"analyzer": {
"autocomplete": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"lowercase",
"autocomplete_filter"
]
}
}
},
"index.max_ngram_diff" : 10
},
"mappings": {
"properties": {
"title": {
"type": "text",
"analyzer": "autocomplete",
"search_analyzer": "standard"
}
}
}
}
Index sample doc
{
"title" : "DeliBites"
}
Search query
{
"query": {
"match": {
"title": {
"query": "Deli Bites"
}
}
}
}
And search results
"hits": [
{
"_index": "65489013",
"_type": "_doc",
"_id": "1",
"_score": 0.95894027,
"_source": {
"title": "DeliBites"
}
}
]

I suggest using synonym token filter.
https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-synonym-tokenfilter.html
https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-synonym-graph-tokenfilter.html
you should have a dictionary for all words that you want search.
something like this:
DelitBites => Deli Bites
ipod => i pod
before implementing synonym be sure you understood all aspect of it.
https://www.elastic.co/blog/boosting-the-power-of-elasticsearch-with-synonyms

Elasticsearch: why exact match has lower score than partial match

my question
I search the word form, but the exact match word form is not the fisrt in result. Is there any way to solve this problem?
my search query
{
"query": {
"match": {
"word": "form"
}
}
}
result
word score
--------------------------
formulation 10.864353
formaldehyde 10.864353
formless 10.864353
formal 10.84412
formerly 10.84412
forma 10.84412
formation 10.574185
formula 10.574185
formulate 10.574185
format 10.574185
formally 10.574185
form 10.254687
former 10.254687
formidable 10.254687
formality 10.254687
formative 10.254687
ill-formed 10.054999
in form 10.035862
pro forma 9.492243
POST my_index/_analyze
The word form in search has only one token form.
In index, form tokens are ["f", "fo", "for", "form"]; formulation tokens are ["f", "fo", ..., "formulatio", "formulation"].
my config
filter
"edgengram_filter": {
"type": "edge_ngram",
"min_gram": 1,
"max_gram": 20
}
analyzer
"analyzer": {
"abc_vocab_analyzer": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"keyword_repeat",
"lowercase",
"asciifolding",
"edgengram_filter",
"unique"
]
},
"abc_vocab_search_analyzer": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"keyword_repeat",
"lowercase",
"asciifolding",
"unique"
]
}
}
mapping
"word": {
"type": "text",
"analyzer": "abc_vocab_analyzer",
"search_analyzer": "abc_vocab_search_analyzer"
}

You get the result in the way you see because you've implemented edge-ngram filter and that form is a sub-string of the words similar to it. Basically in inverted index it would also store the document ids that contains formulation, formal etc.
Therefore, your relevancy also gets computed in that way. You can refer to this link and I'd specifically suggest you to go through sections Default Similarity and BM25. Although the present default similarity is BM25, that link would help you understand how scoring works.
You would need to create another sibling field which you can apply in a should clause. You can go ahead and create keyword sub-field with Term Query but you need to be careful about case-sensitivity.
Instead, as mentioned by #Val, you can create a sibling of text field with standard analyzer.
Mapping:
{
"word":{
"type": "text",
"analyzer": "abc_vocab_analyzer",
"search_analyzer": "abc_vocab_search_analyzer"
"fields":{
"standard":{
"type": "text"
}
}
}
}
Query:
POST <your_index_name>/_search
{
"query": {
"bool": {
"must": [
{
"match": {
"word": "form"
}
}
],
"should": [ <---- Note this
{
"match": {
"word.standard": "form"
}
}
]
}
}
}
Let me know if this helps!

Because your type for this field is text which means ES will do full-text search analysis on this field. And ES search process is kind of finding results most similar to the word you have given. To accurately search word "form", change your search method to match_phrase Furthermore, you could also read articles below to learn more about different ES search methods:
https://www.cnblogs.com/yjf512/p/4897294.html
https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-match-query-phrase.html

Looks like some issue in your custom analyzer, I created my custom autocomplete analyzer, which uses edge_ngram and lowercase filter and it works fine for me for your query and returns me exact match on top and this is how Elasticsearch works, exact matches always have more score., So no need to explicitly create another field and boost that, As Elasticsearch by default boosts the exact match on tokens match.
Index def
{
"settings": {
"analysis": {
"filter": {
"autocomplete_filter": {
"type": "edge_ngram",
"min_gram": 1,
"max_gram": 10
}
},
"analyzer": {
"autocomplete": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"lowercase",
"autocomplete_filter"
]
}
}
}
},
"mappings": {
"properties": {
"title": {
"type": "text",
"analyzer": "autocomplete",
"search_analyzer": "standard"
}
}
}
}
Index few doc
{
"title" : "formless"
}
{
"title" : "form"
}
{
"title" : "formulation"
}
Search query on title field as provided in the question
{
"query": {
"match": {
"title": "form"
}
}
}
Search result with exact match having highest score
"hits": [
{
"_index": "so-60523240-score",
"_type": "_doc",
"_id": "1",
"_score": 0.16410133,
"_source": {
"title": "form"
}
},
{
"_index": "so-60523240-score",
"_type": "_doc",
"_id": "2",
"_score": 0.16410133,
"_source": {
"title": "formulation"
}
},
{
"_index": "so-60523240-score",
"_type": "_doc",
"_id": "3",
"_score": 0.16410133,
"_source": {
"title": "formaldehyde"
}
},
{
"_index": "so-60523240-score",
"_type": "_doc",
"_id": "4",
"_score": 0.16410133,
"_source": {
"title": "formless"
}
}
]

case insensitive elasticsearch with uppercase or lowercase

I am working with elastic search and I am facing a problem. if any body gave me a hint , I will really thankful.
I want to analyze a field "name" or "description" which consist of different entries . e.g someone want to search Sara. if he enter SARA, SAra or sara. he should be able to get Sara.
elastic search uses analyzer which makes everything lowercase.
I want to implement it case insensitive regardless of user input uppercase or lowercase name, he/she should get results.
I am using ngram filter to search names and lowercase which makes it case insensitive. But I want to make sure that a person get results if even he enters in uppercase or lowercase.
Is there any way to do this in elastic search?
{"settings": {
"analysis": {
"filter": {
"ngram_filter": {
"type": "ngram",
"min_gram": 1,
"max_gram": 80
}
},
"analyzer": {
"index_ngram": {
"type": "custom",
"tokenizer": "keyword",
"filter": [ "ngram_filter", "lowercase" ]
},
I attach the example.js file which include json example and search.txt file to explain my problem . I hope my problem will be more clear now.
this is the link to onedrive where I kept both files.
https://1drv.ms/f/s!AsW4Pb3Y55Qjb34OtQI7qQotLzc

Is there any specific reason you are using ngram? Elasticsearch uses the same analyzer on the "query" as well as the text you index - unless search_analyzer is explicitly specified, as mentioned by #Adam in his answer. In your case it might be enough to use a standard tokenizer with a lowercase filter
I created an index with the following settings and mapping:
{
"settings": {
"analysis": {
"analyzer": {
"custom_analyzer": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"lowercase"
]
}
}
}
},
"mappings": {
"typehere": {
"properties": {
"name": {
"type": "string",
"analyzer": "custom_analyzer"
},
"description": {
"type": "string",
"analyzer": "custom_analyzer"
}
}
}
}
}
Indexed two documents
Doc 1
PUT /test_index/test_mapping/1
{
"name" : "Sara Connor",
"Description" : "My real name is Sarah Connor."
}
Doc 2
PUT /test_index/test_mapping/2
{
"name" : "John Connor",
"Description" : "I might save humanity someday."
}
Do a simple search
POST /test_index/_search?query=sara
{
"query" : {
"match" : {
"name" : "SARA"
}
}
}
And get back only the first document. I tried with "sara" and "Sara" also, same results.
{
"took": 12,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 1,
"max_score": 0.19178301,
"hits": [
{
"_index": "test_index",
"_type": "test_mapping",
"_id": "1",
"_score": 0.19178301,
"_source": {
"name": "Sara Connor",
"Description": "My real name is Sarah Connor."
}
}
]
}
}

The analysis process is executed for full-text search fields (analysed) twice: first when data are stored and the second time when you search. It’s worth to say that input JSON will be returned in the same shape as an output from a search query. The analysis process is only used to create tokens for an inverted index. Key to your solution are the following steps:
Create two analysers one with ngram filter and second analyser
without ngram filter because you don’t need to analyse input search
query using ngram because you have an exact value that you want to search.
Define mappings correctly for your fields. There are two fields in
the mapping that allow you to specify analysers. One is used for
storage (analyzer) and second, is used for searching
(search_analyzer) – if you specified only analyser field then
specified analyser is used for index and search time.
You can read more about it here:
https://www.elastic.co/guide/en/elasticsearch/reference/current/search-analyzer.html
And your code should look like that:
PUT /my_index
{
"settings": {
"analysis": {
"filter": {
"ngram_filter": {
"type": "ngram",
"min_gram": 1,
"max_gram": 5
}
},
"analyzer": {
"index_store_ngram": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"ngram_filter",
"lowercase"
]
}
}
}
},
"mappings": {
"my_type": {
"properties": {
"name": {
"type": "string",
"analyzer": "index_store_ngram",
"search_analyzer": "standard"
}
}
}
}
}
post /my_index/my_type/1
{
"name": "Sara_11_01"
}
GET /my_index/my_type/_search
{
"query": {
"match": {
"name": "sara"
}
}
}
GET /my_index/my_type/_search
{
"query": {
"match": {
"name": "SARA"
}
}
}
GET /my_index/my_type/_search
{
"query": {
"match": {
"name": "SaRa"
}
}
}
Edit 1: updated code for a new example provided in the question

This answer is in context of ElasticSearch 7.14. So, let me re-format the ask of this question in another way:
Irrespective of the actual case type provided in the match query, you would like to be able to get those documents that have been analysed with :
"tokenizer": "keyword",
"filter": [ "ngram_filter", "lowercase" ]
Now, coming to the answer part:
It will not be possible to get the match query to return the docs that have been analysed with filter lowercase and the match query contains uppercase letters. The analysis that you have applied in the settings is applicable both while updating and searching data. Although, it is also possible to apply different analysers for updating and searching, I do not see that helping your case. You would have to convert the match query value to lowercase before making the query. So, if your filter is lowercase, you can not match by say Sara or SARA or sAra etc. The match param should be all lowercase, just as it is in your analyser.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Elastic Enterprise Search: Search API doesnt match “33” for “033” in document title - elasticsearch

Related

Elastic returns unexpected result from Search using edge_ngram

Search in Elasticsearch for a string containing the "not" keyword

Space handling in Elastic Search

Elasticsearch: why exact match has lower score than partial match

case insensitive elasticsearch with uppercase or lowercase

Categories

Resources