I indexed my elasticsearch index with ngrams to make it possible to do fuzzy matching and prefix searches quickly. I notice that if I search for documents containing "Bob" in the name field, only results name = Bob return. I would like the response to include documents with name=Bob, but also documents with name = Bobbi, Bobbette, etcetera. The Bob results should have a relatively high score. The other results that don't match exactly, should still appear in the results set, but with lower scores. How can I achieve this with ngrams?
I am using a very small simple index to test. The index contains two documents.
{
"_index": "contacts_4",
"_type": "_doc",
"_id": "1",
"_score": 1.0,
"_source": {
"full_name": "Bob Smith"
}
},
{
"_index": "contacts_4",
"_type": "_doc",
"_id": "2",
"_score": 1.0,
"_source": {
"full_name": "Bobby Smith"
}
}
Here is a working example (using n-gram tokenizer):
ngram-tokenizer
Mapping
PUT my_index
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"filter": [
"lowercase"
],
"type": "custom",
"tokenizer": "my_tokenizer"
}
},
"tokenizer": {
"my_tokenizer": {
"token_chars": [
"letter",
"digit"
],
"min_gram": "3",
"type": "ngram",
"max_gram": "4"
}
}
}
},
"mappings": {
"properties": {
"full_name": {
"type": "text",
"analyzer": "my_analyzer",
"fields": {
"keyword": {
"type": "keyword"
}
}
}
}
}
}
Indexing documents
POST my_index/_doc/1
{
"full_name":"Bob Smith"
}
POST my_index/_doc/2
{
"full_name":"Bobby Smith"
}
POST my_index/_doc/3
{
"full_name":"Bobbette Smith"
}
Search Query
GET my_index/_search
{
"query": {
"match": {
"full_name": "Bob"
}
}
}
Results
"hits" : [
{
"_index" : "my_index",
"_type" : "_doc",
"_id" : "1",
"_score" : 0.1626403,
"_source" : {
"full_name" : "Bob Smith"
}
},
{
"_index" : "my_index",
"_type" : "_doc",
"_id" : "2",
"_score" : 0.13703513,
"_source" : {
"full_name" : "Bobby Smith"
}
},
{
"_index" : "my_index",
"_type" : "_doc",
"_id" : "3",
"_score" : 0.11085624,
"_source" : {
"full_name" : "Bobbette Smith"
}
}
]
Hope this helps
Related
I would like to apply any analyser that satisfy below search. Let's take an example. Suppose I have entered below text in a document
I have store similar kind of sentence as specialization in opensearch.
Cardiologist Doctor.
Cardiac surgeon.
neuro surgeon.
cardiac specialist.
nursing care
Anatomy.
Anaesthesiology.
So, if I search cardiac surgeon result should be ['cardiologist', 'cardiac surgeon', 'cardiac specialist'] and it should not return 'neuro surgeon', 'nursing care'.
Also, if I search anatomy result should be ['anatomoy'] and it should not return Anaesthesiology.
I have tried with ngram_filter, but when I search cardiologist it's returning cardiologist and nursing care both instead of cardiologist only.
"ngram_filter": {
"type": "edge_ngram",
"min_gram": 3,
"max_gram": 15
},
My suggestion using synonyms:
PUT synonyms
{
"settings": {
"index": {
"analysis": {
"analyzer": {
"synonyms_analyzer": {
"tokenizer": "standard",
"filter": [
"lowercase",
"synonyms_filter"
]
}
},
"filter": {
"synonyms_filter": {
"type": "synonym",
"synonyms": [
"cardiac surgeon, cardiologist, cardiac surgeon, cardiac specialist"
]
}
}
}
}
},
"mappings": {
"properties": {
"name": {
"type": "text",
"search_analyzer": "synonyms_analyzer"
}
}
}
}
POST _bulk
{ "index" : { "_index" : "synonyms", "_id" : "1"}}
{ "name" : "Cardiac surgeon" }
{ "index" : { "_index" : "synonyms", "_id" : "2"}}
{ "name" : "Cardiologist Doctor" }
{ "index" : { "_index" : "synonyms", "_id" : "3"}}
{ "name" : "neuro surgeon" }
{ "index" : { "_index" : "synonyms", "_id" : "4"}}
{ "name" : "cardiac specialist" }
{ "index" : { "_index" : "synonyms", "_id" : "5"}}
{ "name" : "nursing care" }
{ "index" : { "_index" : "synonyms", "_id" : "6"}}
{ "name" : "Anatomy" }
{ "index" : { "_index" : "synonyms", "_id" : "7"}}
{ "name" : "Anaesthesiology" }
GET synonyms/_search
{
"query": {
"match": {
"name": "cardiac surgeon"
}
}
}
Hits:
"hits": [
{
"_index": "synonyms",
"_id": "1",
"_score": 13.066887,
"_source": {
"name": "Cardiac surgeon"
}
},
{
"_index": "synonyms",
"_id": "4",
"_score": 7.9681025,
"_source": {
"name": "cardiac specialist"
}
},
{
"_index": "synonyms",
"_id": "2",
"_score": 1.567127,
"_source": {
"name": "Cardiologist Doctor"
}
}
]
I have hundreds of chemicals results in my index climate_change
I'm using a ngram research and this is the settings that I'm using for the index.
{
"settings": {
"index.max_ngram_diff": 30,
"index": {
"analysis": {
"analyzer": {
"analyzer": {
"tokenizer": "test_ngram",
"filter": [
"lowercase"
]
},
"search_analyzer": {
"tokenizer": "test_ngram",
"filter": [
"lowercase"
]
}
},
"tokenizer": {
"test_ngram": {
"type": "edge_ngram",
"min_gram": 1,
"max_gram": 30,
"token_chars": [
"letter",
"digit"
]
}
}
}
}
}
}
My main problem is that if I try to do a query like this one
GET climate_change/_search?size=1000
{
"query": {
"match": {
"description": {
"query":"oxygen"
}
}
}
}
I see that a lot of results have the same score 7.381186..but it's strange
{
"_index" : "climate_change",
"_type" : "_doc",
"_id" : "XXX",
"_score" : 7.381186,
"_source" : {
"recordtype" : "chemicals",
"description" : "carbon/oxygen"
}
},
{
"_index" : "climate_change",
"_type" : "_doc",
"_id" : "YYY",
"_score" : 7.381186,
"_source" : {
"recordtype" : "chemicals",
"description" : "oxygen"
}
How could it be possible?
In the example above, If I'm using ngram and I'm searching oxygen in the description field, I'll expect that the second result will have a score bigger than the first one.
I've also tried to specify the type of the tokenizer "standard" and "whitespace" in the settings, but it could not help.
Maybe is the '/' character inside the description?
Thanks a lot!
You need to define the analyzer in the mapping for the description field also.
Adding a working example with index data, mapping, search query and search result
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "test_ngram",
"filter": [
"lowercase"
]
},
"search_analyzer": {
"tokenizer": "test_ngram",
"filter": [
"lowercase"
]
}
},
"tokenizer": {
"test_ngram": {
"type": "edge_ngram",
"min_gram": 1,
"max_gram": 30,
"token_chars": [
"letter",
"digit"
]
}
}
}
},
"mappings": {
"properties": {
"description": {
"type": "text",
"analyzer": "my_analyzer"
}
}
}
}
Index Data:
{
"recordtype": "chemicals",
"description": "carbon/oxygen"
}
{
"recordtype": "chemicals",
"description": "oxygen"
}
Search Query:
{
"query": {
"match": {
"description": {
"query":"oxygen"
}
}
}
}
Search Result:
"hits": [
{
"_index": "67180160",
"_type": "_doc",
"_id": "2",
"_score": 0.89246297,
"_source": {
"recordtype": "chemicals",
"description": "oxygen"
}
},
{
"_index": "67180160",
"_type": "_doc",
"_id": "1",
"_score": 0.6651374,
"_source": {
"recordtype": "chemicals",
"description": "carbon/oxygen"
}
}
]
I am using Elasticsearch 7.9.0 on Windows.
I have the following mapping:
"Name": {
"type": "text",
"fields": {
"my-tokenizer": {
"type": "text",
"analyzer": "my-tokenizer"
},
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
However, when I do this in my query my results are not sorted by name as I expect:
"sort": [
{
"name.keyword": {"order": "asc" }
}
],
My results contain a sort null value - is that significant? Does it tell us anything?
"_type" : "_doc",
"_id" : "ABC",
"_score" : null,
"_source" : {
"name" : "Liverpool Football Club"
},
"sort" : [
null
]
},
(p.s. I have this as a decorator in my code Name = "name").
You created a mapping for the Name field (as mentioned above), but you have indexed documents for the name field. So probably documents are indexing with dynamic mapping.
Adding a working example with index data, search query, and search result. (Have not created any explicit mapping)
Index Data:
{
"name": "Liverpool Football Club"
}
{
"name": "quick brown f"
}
{
"name": "this is a test"
}
Search Query:
{
"sort": [
{
"name.keyword": "asc"
}
]
}
Search Result:
"hits": [
{
"_index": "64977683",
"_type": "_doc",
"_id": "1",
"_score": null,
"_source": {
"name": "Liverpool Football Club"
},
"sort": [
"Liverpool Football Club"
]
},
{
"_index": "64977683",
"_type": "_doc",
"_id": "3",
"_score": null,
"_source": {
"name": "quick brown f"
},
"sort": [
"quick brown f"
]
},
{
"_index": "64977683",
"_type": "_doc",
"_id": "2",
"_score": null,
"_source": {
"name": "this is a test"
},
"sort": [
"this is a test"
]
}
]
Can you give the full mapping and index requests?
Actually, I suspect that the documents are indexing with dynamic mapping and not using the one you defined.
I've a field indexed with custom analyzer with the below configuration
"COMPNAYNAME" : {
"type" : "text",
"analyzer" : "textAnalyzer"
}
"textAnalyzer" : {
"filter" : [
"lowercase"
],
"char_filter" : [ ],
"type" : "custom",
"tokenizer" : "ngram_tokenizer"
}
"tokenizer" : {
"ngram_tokenizer" : {
"type" : "ngram",
"min_gram" : "2",
"max_gram" : "3"
}
}
While I'm searching for a text "ikea" I'm getting the below results
Query :
GET company_info_test_1/_search
{
"query": {
"match": {
"COMPNAYNAME": {"query": "ikea"}
}
}
}
Fallowing are the results,
1.mikea
2.likeable
3.maaikeart
4.likeables
5.ikea b.v. <------
6.likeachef
7.ikea breda <------
8.bernikeart
9.ikea duiven
10.mikea media
I'm expecting the exact match result should be boosted more than the rest of the results.
Could you please help me what is the best way to index if I have to search with exact match as well as with fizziness.
Thanks in advance.
You can use ngram tokenizer along with "search_analyzer": "standard" Refer this to know more about search_analyzer
As pointed out by #EvaldasBuinauskas you can also use edge_ngram tokenizer here, if you want the tokens to be generated from the beginning only and not from the middle.
Adding a working example with index data, mapping, search query, and result
Index Data:
{ "title": "ikea b.v."}
{ "title" : "mikea" }
{ "title" : "maaikeart"}
Index Mapping
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "my_tokenizer"
}
},
"tokenizer": {
"my_tokenizer": {
"type": "ngram",
"min_gram": 2,
"max_gram": 10,
"token_chars": [
"letter",
"digit"
]
}
}
},
"max_ngram_diff": 50
},
"mappings": {
"properties": {
"title": {
"type": "text",
"analyzer": "my_analyzer",
"search_analyzer": "standard"
}
}
}
}
Search Query:
{
"query": {
"match" : {
"title" : "ikea"
}
}
}
Search Result:
"hits": [
{
"_index": "normal",
"_type": "_doc",
"_id": "4",
"_score": 0.1499838, <-- note this
"_source": {
"title": "ikea b.v."
}
},
{
"_index": "normal",
"_type": "_doc",
"_id": "1",
"_score": 0.13562363, <-- note this
"_source": {
"title": "mikea"
}
},
{
"_index": "normal",
"_type": "_doc",
"_id": "3",
"_score": 0.083597526,
"_source": {
"title": "maaikeart"
}
}
]
This question is similar to my other question enter link description here which Val answered.
I have an index containing 3 documents.
{
"firstname": "Anne",
"lastname": "Borg",
}
{
"firstname": "Leanne",
"lastname": "Ray"
},
{
"firstname": "Anne",
"middlename": "M",
"lastname": "Stone"
}
When I search for "Ann", I would like elastic to return all 3 of these documents (because they all match the term "Ann" to a degree). BUT, I would like Leanne Ray to have a lower score (relevance ranking) because the search term "Ann" appears at a later position in this document than the term appears in the other two documents.
Here are my index settings...
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"filter": [
"lowercase"
],
"type": "custom",
"tokenizer": "my_tokenizer"
}
},
"tokenizer": {
"my_tokenizer": {
"token_chars": [
"letter",
"digit",
"custom"
],
"custom_token_chars": "'-",
"min_gram": "1",
"type": "ngram",
"max_gram": "2"
}
}
}
},
"mappings": {
"properties": {
"firstname": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword"
}
},
"copy_to": [
"full_name"
]
},
"lastname": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword"
}
},
"copy_to": [
"full_name"
]
},
"middlename": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
},
"copy_to": [
"full_name"
]
},
"full_name": {
"type": "text",
"analyzer": "my_analyzer",
"fields": {
"keyword": {
"type": "keyword"
}
}
}
}
}
}
The following query brings back the expected documents, but attributes a higher score to Leanne Ray than to Anne Borg.
{
"query": {
"bool": {
"must": {
"query_string": {
"query": "Ann",
"fields": ["full_name"]
}
},
"should": {
"match": {
"full_name": "Ann"}
}
}
}
}
Here are the results...
"hits": [
{
"_index": "contacts_4",
"_type": "_doc",
"_id": "2",
"_score": 6.6333585,
"_source": {
"firstname": "Anne",
"middlename": "M",
"lastname": "Stone"
}
},
{
"_index": "contacts_4",
"_type": "_doc",
"_id": "1",
"_score": 6.142234,
"_source": {
"firstname": "Leanne",
"lastname": "Ray"
}
},
{
"_index": "contacts_4",
"_type": "_doc",
"_id": "3",
"_score": 6.079495,
"_source": {
"firstname": "Anne",
"lastname": "Borg"
}
}
Using an ngram token filter and an ngram tokenizer together seems to fix this problem...
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"filter": [
"ngram"
],
"tokenizer": "ngram"
}
}
}
},
"mappings": {
"properties": {
"firstname": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword"
}
},
"copy_to": [
"full_name"
]
},
"lastname": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword"
}
},
"copy_to": [
"full_name"
]
},
"middlename": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword"
}
},
"copy_to": [
"full_name"
]
},
"full_name": {
"type": "text",
"analyzer": "my_analyzer",
"search_analyzer": "my_analyzer"
}
}
}
}
The same query brings back the expected results with the desired relative scoring. Why does this work? Note that above, I am using an ngram tokenizer with a lowercase filter and the only difference here is that I am using an ngram filter instead of the lowercase filter.
Here are the results. Notice that Leanne Ray scored lower than both Anne Borg and Anne M Stone, as desired.
"hits": [
{
"_index": "contacts_4",
"_type": "_doc",
"_id": "3",
"_score": 4.953257,
"_source": {
"firstname": "Anne",
"lastname": "Borg"
}
},
{
"_index": "contacts_4",
"_type": "_doc",
"_id": "2",
"_score": 4.87168,
"_source": {
"firstname": "Anne",
"middlename": "M",
"lastname": "Stone"
}
},
{
"_index": "contacts_4",
"_type": "_doc",
"_id": "1",
"_score": 1.0364896,
"_source": {
"firstname": "Leanne",
"lastname": "Ray"
}
}
By the way, this query also brings back a whole lot of false positive results when the index contains other documents as well. It's not such a problem becasuethese false positives have very low scores relative to the scores of the desirable hits. But still not ideal. For example, if I add {firstname: Gideon, lastname: Grossma} to the document, the above query will bring back that document in the result set as well - albeit with a much lower score than the documents containing the string "Ann"
The answer is the same as in the linked thread. Since you're ngraming all the indexed data, it works the same way with Ann as with Anne, You'll get the exact same response (see below), with different scores, though:
"hits" : [
{
"_index" : "test",
"_type" : "_doc",
"_id" : "5Jr-DHIBhYuDqANwSeiw",
"_score" : 4.8442974,
"_source" : {
"firstname" : "Anne",
"lastname" : "Borg"
}
},
{
"_index" : "test",
"_type" : "_doc",
"_id" : "5pr-DHIBhYuDqANwSeiw",
"_score" : 4.828779,
"_source" : {
"firstname" : "Anne",
"middlename" : "M",
"lastname" : "Stone"
}
},
{
"_index" : "test",
"_type" : "_doc",
"_id" : "5Zr-DHIBhYuDqANwSeiw",
"_score" : 0.12874341,
"_source" : {
"firstname" : "Leanne",
"lastname" : "Ray"
}
}
]
UPDATE
Here is a modified query that you can use to check for parts (i.e. ann vs anne). Again, the casing makes no difference here, since the analyzer lowercases everything before indexing.
{
"query": {
"bool": {
"must": {
"query_string": {
"query": "ann",
"fields": [
"full_name"
]
}
},
"should": [
{
"match_phrase_prefix": {
"firstname": {
"query": "ann",
"boost": "10"
}
}
},
{
"match_phrase_prefix": {
"lastname": {
"query": "ann",
"boost": "10"
}
}
}
]
}
}
}