ElasticSearch - tokenizer choose improper matching

ElasticSearch - tokenizer choose improper matching - elasticsearch

I would like to apply any analyser that satisfy below search. Let's take an example. Suppose I have entered below text in a document
I have store similar kind of sentence as specialization in opensearch.
Cardiologist Doctor.
Cardiac surgeon.
neuro surgeon.
cardiac specialist.
nursing care
Anatomy.
Anaesthesiology.
So, if I search cardiac surgeon result should be ['cardiologist', 'cardiac surgeon', 'cardiac specialist'] and it should not return 'neuro surgeon', 'nursing care'.
Also, if I search anatomy result should be ['anatomoy'] and it should not return Anaesthesiology.
I have tried with ngram_filter, but when I search cardiologist it's returning cardiologist and nursing care both instead of cardiologist only.
"ngram_filter": {
"type": "edge_ngram",
"min_gram": 3,
"max_gram": 15
},

My suggestion using synonyms:
PUT synonyms
{
"settings": {
"index": {
"analysis": {
"analyzer": {
"synonyms_analyzer": {
"tokenizer": "standard",
"filter": [
"lowercase",
"synonyms_filter"
]
}
},
"filter": {
"synonyms_filter": {
"type": "synonym",
"synonyms": [
"cardiac surgeon, cardiologist, cardiac surgeon, cardiac specialist"
]
}
}
}
}
},
"mappings": {
"properties": {
"name": {
"type": "text",
"search_analyzer": "synonyms_analyzer"
}
}
}
}
POST _bulk
{ "index" : { "_index" : "synonyms", "_id" : "1"}}
{ "name" : "Cardiac surgeon" }
{ "index" : { "_index" : "synonyms", "_id" : "2"}}
{ "name" : "Cardiologist Doctor" }
{ "index" : { "_index" : "synonyms", "_id" : "3"}}
{ "name" : "neuro surgeon" }
{ "index" : { "_index" : "synonyms", "_id" : "4"}}
{ "name" : "cardiac specialist" }
{ "index" : { "_index" : "synonyms", "_id" : "5"}}
{ "name" : "nursing care" }
{ "index" : { "_index" : "synonyms", "_id" : "6"}}
{ "name" : "Anatomy" }
{ "index" : { "_index" : "synonyms", "_id" : "7"}}
{ "name" : "Anaesthesiology" }
GET synonyms/_search
{
"query": {
"match": {
"name": "cardiac surgeon"
}
}
}
Hits:
"hits": [
{
"_index": "synonyms",
"_id": "1",
"_score": 13.066887,
"_source": {
"name": "Cardiac surgeon"
}
},
{
"_index": "synonyms",
"_id": "4",
"_score": 7.9681025,
"_source": {
"name": "cardiac specialist"
}
},
{
"_index": "synonyms",
"_id": "2",
"_score": 1.567127,
"_source": {
"name": "Cardiologist Doctor"
}
}
]

Related

Does Elasticsearch provide highlighting on "copy_to" field in their newer versions?

I had used Elasticsearch few years ago(version 6.4.0) and they had no provision to provide highlight the "copy_to" field. I would like to know if they have this provision now?

Yes, highlight can enabled on copy_to field in latest version.
Please check below example which i have tried.
Index Mapping:
PUT my-index-000001
{
"mappings": {
"properties": {
"first_name": {
"type": "text",
"copy_to": "full_name"
},
"last_name": {
"type": "text",
"copy_to": "full_name"
},
"full_name": {
"type": "text"
}
}
}
}
Document Index:
PUT my-index-000001/_doc/1
{
"first_name": "John",
"last_name": "Smith"
}
Query:
GET my-index-000001/_search
{
"query": {
"match": {
"full_name": {
"query": "John Smith",
"operator": "and"
}
}
},
"highlight": {
"fields": {
"full_name": {}
}
}
}
Result:
"hits" : [
{
"_index" : "my-index-000001",
"_type" : "_doc",
"_id" : "1",
"_score" : 0.5753642,
"_source" : {
"first_name" : "John",
"last_name" : "Smith"
},
"highlight" : {
"full_name" : [
"<em>Smith</em>",
"<em>John</em>"
]
}
}
]
Update 1: Search using copy_to field and highlight match to particular field
In below example, search will be happen on full_name field which is copy field and highlight will be happen on first_name field.
Query:
GET my-index-000001/_search
{
"query": {
"match": {
"full_name": {
"query": "John Smith",
"operator": "and"
}
}
},
"highlight": {
"require_field_match": "false",
"fields": {
"first_name": {}
}
}
}
Result:
{
"_index" : "my-index-000001",
"_type" : "_doc",
"_id" : "1",
"_score" : 0.5753642,
"_source" : {
"first_name" : "John",
"last_name" : "Smith"
},
"highlight" : {
"first_name" : [
"<em>John</em>"
]
}
}

Bring back all relevant results when using ngrams with elasticsearch

I indexed my elasticsearch index with ngrams to make it possible to do fuzzy matching and prefix searches quickly. I notice that if I search for documents containing "Bob" in the name field, only results name = Bob return. I would like the response to include documents with name=Bob, but also documents with name = Bobbi, Bobbette, etcetera. The Bob results should have a relatively high score. The other results that don't match exactly, should still appear in the results set, but with lower scores. How can I achieve this with ngrams?
I am using a very small simple index to test. The index contains two documents.
{
"_index": "contacts_4",
"_type": "_doc",
"_id": "1",
"_score": 1.0,
"_source": {
"full_name": "Bob Smith"
}
},
{
"_index": "contacts_4",
"_type": "_doc",
"_id": "2",
"_score": 1.0,
"_source": {
"full_name": "Bobby Smith"
}
}

Here is a working example (using n-gram tokenizer):
ngram-tokenizer
Mapping
PUT my_index
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"filter": [
"lowercase"
],
"type": "custom",
"tokenizer": "my_tokenizer"
}
},
"tokenizer": {
"my_tokenizer": {
"token_chars": [
"letter",
"digit"
],
"min_gram": "3",
"type": "ngram",
"max_gram": "4"
}
}
}
},
"mappings": {
"properties": {
"full_name": {
"type": "text",
"analyzer": "my_analyzer",
"fields": {
"keyword": {
"type": "keyword"
}
}
}
}
}
}
Indexing documents
POST my_index/_doc/1
{
"full_name":"Bob Smith"
}
POST my_index/_doc/2
{
"full_name":"Bobby Smith"
}
POST my_index/_doc/3
{
"full_name":"Bobbette Smith"
}
Search Query
GET my_index/_search
{
"query": {
"match": {
"full_name": "Bob"
}
}
}
Results
"hits" : [
{
"_index" : "my_index",
"_type" : "_doc",
"_id" : "1",
"_score" : 0.1626403,
"_source" : {
"full_name" : "Bob Smith"
}
},
{
"_index" : "my_index",
"_type" : "_doc",
"_id" : "2",
"_score" : 0.13703513,
"_source" : {
"full_name" : "Bobby Smith"
}
},
{
"_index" : "my_index",
"_type" : "_doc",
"_id" : "3",
"_score" : 0.11085624,
"_source" : {
"full_name" : "Bobbette Smith"
}
}
]
Hope this helps

How to build simple terms query for nested object?

I have index like this:
PUT job_offers
{
"mappings": {
"properties": {
"location": {
"properties": {
"slug": {
"type": "keyword"
},
"name": {
"type": "text"
}
},
"type": "nested"
},
"experience": {
"properties": {
"slug": {
"type": "keyword"
},
"name": {
"type": "text"
}
},
"type": "nested"
}
}
}
}
I insert this object:
POST job_offers/_doc
{
"title": "Junior Ruby on Rails Developer",
"location": [
{
"slug": "new-york",
"name": "New York"
},
{
"slug": "atlanta",
"name": "Atlanta"
},
{
"slug": "remote",
"name": "Remote"
}
],
"experience": [
{
"slug": "junior",
"name": "Junior"
}
]
}
This query returns 0 documents.
GET job_offers/_search
{
"query": {
"terms": {
"location.slug": [
"remote",
"new-york"
]
}
}
}
Can you explain me why? I thought it should return documents where location.slug is remote or new-york.

Nested- Query have a different syntax
GET job_offers/_search
{
"query": {
"nested": {
"path": "location",
"query": {
"terms": {
"location.slug": ["remote","new-york"]
}
}
}
}
}
Result:
"hits" : [
{
"_index" : "job_offers",
"_type" : "_doc",
"_id" : "wWjoXnEBs0rCGpYsvUf4",
"_score" : 1.0,
"_source" : {
"title" : "Junior Ruby on Rails Developer",
"location" : [
{
"slug" : "new-york",
"name" : "New York"
},
{
"slug" : "atlanta",
"name" : "Atlanta"
},
{
"slug" : "remote",
"name" : "Remote"
}
],
"experience" : [
{
"slug" : "junior",
"name" : "Junior"
}
]
}
}
]
It will return entire document where location.slug matches "remote" or "new-york". If you want to get matched nested document , you need to use inner_hits
GET job_offers/_search
{
"query": {
"nested": {
"path": "location",
"query": {
"terms": {
"location.slug": ["remote","new-york"]
}
},
"inner_hits": {} --> note
}
}
}
Result:
"hits" : [
{
"_index" : "job_offers",
"_type" : "_doc",
"_id" : "wWjoXnEBs0rCGpYsvUf4",
"_score" : 1.0,
"_source" : {
"title" : "Junior Ruby on Rails Developer",
"location" : [
{
"slug" : "new-york",
"name" : "New York"
},
{
"slug" : "atlanta",
"name" : "Atlanta"
},
{
"slug" : "remote",
"name" : "Remote"
}
],
"experience" : [
{
"slug" : "junior",
"name" : "Junior"
}
]
},
"inner_hits" : { --> will give matched nested object
"location" : {
"hits" : {
"total" : {
"value" : 2,
"relation" : "eq"
},
"max_score" : 1.0,
"hits" : [
{
"_index" : "job_offers",
"_type" : "_doc",
"_id" : "wWjoXnEBs0rCGpYsvUf4",
"_nested" : {
"field" : "location",
"offset" : 0
},
"_score" : 1.0,
"_source" : {
"slug" : "new-york",
"name" : "New York"
}
},
{
"_index" : "job_offers",
"_type" : "_doc",
"_id" : "wWjoXnEBs0rCGpYsvUf4",
"_nested" : {
"field" : "location",
"offset" : 2
},
"_score" : 1.0,
"_source" : {
"slug" : "remote",
"name" : "Remote"
}
}
]
}
}
}
}
]
Also I see that you are using two fields for same data with different types. if data is same in both fields(name and slug) and only data type is different, you can use fields for that
It is often useful to index the same field in different ways for
different purposes. This is the purpose of multi-fields. For instance,
a string field could be mapped as a text field for full-text search,
and as a keyword field for sorting or aggregations:
In that case your mapping will become below
PUT job_offers
{
"mappings": {
"properties": {
"location": {
"properties": {
"name": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword"
}
}
}
},
"type": "nested"
},
"experience": {
"properties": {
"name": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword"
}
}
}
},
"type": "nested"
}
}
}
}

How to escape special character from `match_phase` query?

I am using elasticsearch 6.8 and doing below query:
curl localhost:9200/twitter/_search?pretty=true -H 'Content-Type: application/json' -d '
{ "query": {"match_phrase": { "name": ".C" }}}'
{
"took" : 2,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : 2,
"max_score" : 0.2876821,
"hits" : [
{
"_index" : "twitter",
"_type" : "1",
"_id" : "2",
"_score" : 0.2876821,
"_source" : {
"name" : "my name C 100"
}
},
{
"_index" : "twitter",
"_type" : "1",
"_id" : "1",
"_score" : 0.2876821,
"_source" : {
"name" : "my name .C 100"
}
}
]
}
}
You see that two documents get returned but I don't expect the first one which doesn't have .C get returned. I have tried to escape dot with {"match_phrase": { "name": "\\.C" }} but it doesn't work.
I don't want to change the type of the name to be keyword because I still need tokenizer.
And I have put . as protected words in the index setting as below:
#curl localhost:9200/twitter/_settings?
{
"twitter" : {
"settings" : {
"index" : {
"number_of_shards" : "5",
"provided_name" : "twitter",
"creation_date" : "1579489541087",
"analysis" : {
"filter" : {
"word_delim_filter" : {
"type" : "word_delimiter",
"protected_words" : [
"."
]
}
},
"analyzer" : {
"content" : {
"type" : "custom",
"tokenizer" : "whitespace"
},
"custom_synonyms_delim" : {
"filter" : [
"word_delim_filter"
],
"tokenizer" : "whitespace"
}
}
},
"number_of_replicas" : "1",
"uuid" : "nYr7NPdVRCqIcTzzM_iBeQ",
"version" : {
"created" : "6080299"
}
}
}
}
}
How can I escape dot in the query?

Here is a working example of how to handle dot in your scenario:
Mapping
PUT my_index
{
"settings": {
"analysis": {
"filter": {
"word_delim_filter": {
"type": "word_delimiter",
"type_table": [
". => ALPHANUM"
]
}
},
"analyzer": {
"content": {
"type": "custom",
"tokenizer": "whitespace"
},
"custom_synonyms_delim": {
"filter": [
"word_delim_filter"
],
"type": "custom",
"tokenizer": "keyword"
}
}
}
},
"mappings": {
"_doc": {
"properties": {
"name": {
"type": "text",
"analyzer": "custom_synonyms_delim",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
}
}
}
}
}
Indexing documents
POST my_index/_doc/1
{
"name" : "my name C 100"
}
POST my_index/_doc/2
{
"name" : "my name .C 100"
}
Search Query
GET my_index/_search
{
"query": {
"match_phrase": {
"name": ".C"
}
}
}
Results
"hits" : [
{
"_index" : "my_index",
"_type" : "_doc",
"_id" : "2",
"_score" : 0.6931472,
"_source" : {
"name" : "my name .C 100"
}
}
]
Hope this helps

Elastic terms query with emails not working

I want to pass a list of emails in Elastic Search Query, So I tried below query to achieve that, but didn't get any result.
{
"query": {
"terms": {
"email": [ "andrew#gmail.com", "michel#gmail.com" ]
}
}
}
When I used id instead of emails, that worked !
{
"query": {
"terms": {
"id": [ 43, 67 ]
}
}
}
Could you please explain what's wrong with my email query and how make it works

If you want to recognize email addresses as single tokens you should use uax_url_email tokenizer.
UAX URL Email Tokenizer
A working example:
Mappings
PUT my_index
{
"settings": {
"analysis": {
"analyzer": {
"my_email_analyzer": {
"type": "custom",
"tokenizer": "my_tokenizer",
"filter": ["lowercase", "stop"]
}
},
"tokenizer": {
"my_tokenizer":{
"type": "uax_url_email"
}
}
}
},
"mappings": {
"properties": {
"email": {
"type": "text",
"analyzer": "my_email_analyzer",
"search_analyzer": "my_email_analyzer",
"fields": {
"keyword":{
"type":"keyword"
}
}
}
}
}
}
POST few documents
POST my_index/_doc/1
{
"email":"andrew#gmail.com"
}
POST my_index/_doc/2
{
"email":"michel#gmail.com"
}
Search Query
GET my_index/_search
{
"query": {
"multi_match": {
"query": "andrew#gmail.com michel#gmail.com",
"fields": ["email"]
}
}
}
Results
"hits" : {
"total" : {
"value" : 2,
"relation" : "eq"
},
"max_score" : 0.6931472,
"hits" : [
{
"_index" : "my_index",
"_type" : "_doc",
"_id" : "1",
"_score" : 0.6931472,
"_source" : {
"email" : "andrew#gmail.com"
}
},
{
"_index" : "my_index",
"_type" : "_doc",
"_id" : "2",
"_score" : 0.6931472,
"_source" : {
"email" : "michel#gmail.com"
}
}
]
}
Another option is to use keyword type.
Search Query
GET my_index/_search
{
"query": {
"terms": {
"email.keyword": [
"andrew#gmail.com",
"michel#gmail.com"
]
}
}
}
In my opinion using the uax_url_email tokenizer is a better solution.
Hope this helps

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

ElasticSearch - tokenizer choose improper matching - elasticsearch

Related

Does Elasticsearch provide highlighting on "copy_to" field in their newer versions?

Bring back all relevant results when using ngrams with elasticsearch

How to build simple terms query for nested object?

How to escape special character from `match_phase` query?

Elastic terms query with emails not working

Categories

Resources