Synonyms relevance issue in Elasticsearch - elasticsearch

I am trying to configured synonyms in elasticsearch and done the sample configuration as well. But not getting expected relevancy when i am searching data.
Below is index Mapping configuration:
PUT /test_index
{
"settings": {
"index": {
"analysis": {
"filter": {
"my_synonyms": {
"type": "synonym",
"synonyms": [
"mind, brain",
"brainstorm,brain storm"
]
}
},
"analyzer": {
"my_analyzer": {
"tokenizer": "standard",
"filter": [
"lowercase"
]
},
"my_search_analyzer": {
"tokenizer": "standard",
"filter": [
"lowercase",
"my_synonyms"
]
}
}
}
}
},
"mappings": {
"properties": {
"my_field": {
"type": "text",
"analyzer": "my_analyzer"
}
}
}
}
Below is sample data which i have indexed:
POST test_index/_bulk
{ "index" : { "_id" : "1" } }
{"my_field": "This is a brainstorm" }
{ "index" : { "_id" : "2" } }
{"my_field": "A different brain storm" }
{ "index" : { "_id" : "3" } }
{"my_field": "About brainstorming" }
{ "index" : { "_id" : "4" } }
{"my_field": "I had a storm in my brain" }
{ "index" : { "_id" : "5" } }
{"my_field": "I envisaged something like that" }
Below is query which i am trying:
GET test_index/_search
{
"query": {
"match": {
"my_field": {
"query": "brainstorm",
"analyzer": "my_search_analyzer"
}
}
}
}
Current Result:
"hits" : [
{
"_index" : "test_index",
"_type" : "_doc",
"_id" : "2",
"_score" : 1.8185701,
"_source" : {
"my_field" : "A different brain storm"
}
},
{
"_index" : "test_index",
"_type" : "_doc",
"_id" : "4",
"_score" : 1.4100728,
"_source" : {
"my_field" : "I had a storm in my brain"
}
},
{
"_index" : "test_index",
"_type" : "_doc",
"_id" : "1",
"_score" : 0.90928507,
"_source" : {
"my_field" : "This is a brainstorm"
}
}
]
I am expecting document which is matching exect with query on top and document which is matching with synonyms should come with low score.
so here my expectation is document with value "This is a brainstorm" should come at position one.
Could you please suggest me how i can achive.
I have tried to applied boosting and weightage as well but no luck.
Thanks in advance !!!

Elasticsearch "replaces" every instance of a synonym all other synonyms, and does so on both indexing and searching (unless you provide a separate search_analyzer) so you're losing the exact token. To keep this information, use a subfield with standard analyzer and then use multi_match query to match either synonyms or exact value + boost the exact field.

I have got answer from Elastic Forum here. I have copied below for quick referance.
Hello there,
Since you are indexing synonyms into your inverted index, brain storm and brainstorm are all different tokens after analyzer does its thing. So Elasticsearch on query time uses your analyzer to create tokens for brain, storm and brainstorm from your query and match multiple tokens with indexes 2 and 4, your index 2 has lesser words so tf/idf scores it higher between the two and index number 1 only matches brainstorm.
You can also see what your analyzer does to your input with this;
POST test_index/_analyze
{
"analyzer": "my_search_analyzer",
"text": "I had a storm in my brain"
}
I did some trying out so, you should change your index analyzer to my_analyzer;
PUT /test_index
{
"settings": {
"index": {
"analysis": {
"filter": {
"my_synonyms": {
"type": "synonym",
"synonyms": [
"mind, brain",
"brainstorm,brain storm"
]
}
},
"analyzer": {
"my_analyzer": {
"tokenizer": "standard",
"filter": [
"lowercase"
]
},
"my_search_analyzer": {
"tokenizer": "standard",
"filter": [
"lowercase",
"my_synonyms"
]
}
}
}
}
},
"mappings": {
"properties": {
"my_field": {
"type": "text",
"analyzer": "my_analyzer"
}
}
}
}
Then you want to boost your exact matches, but you also want to get hits from my_search_analyzer tokens as well so i have changed your query a bit;
GET test_index/_search
{
"query": {
"bool": {
"should": [
{
"match": {
"my_field": {
"query": "brainstorm",
"analyzer": "my_search_analyzer"
}
}
},
{
"match_phrase": {
"my_field": {
"query": "brainstorm"
}
}
}
]
}
}
}
result:
{
"took" : 3,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 3,
"relation" : "eq"
},
"max_score" : 2.3491273,
"hits" : [
{
"_index" : "test_index",
"_type" : "_doc",
"_id" : "1",
"_score" : 2.3491273,
"_source" : {
"my_field" : "This is a brainstorm"
}
},
{
"_index" : "test_index",
"_type" : "_doc",
"_id" : "2",
"_score" : 1.8185701,
"_source" : {
"my_field" : "A different brain storm"
}
}
]
}
}

Related

How to extract matching groups when searching regex in elasticsearch

Am using elasticsearch to index some data (a text article) and mapped it as
"article": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword"
}
}
}
, then I use regexp queries to search for some matches, the result returns the correct documents, but is there a way to also return the matching groups/text which triggered the regex hit ?
You can use highlghting functionality of Elasticsearch.
Lets consider below is your sample document:
{
"article":"Elasticsearch Documentation"
}
Query:
{
"query": {
"regexp": {
"article": "el.*ch"
}
},
"highlight": {
"fields": {
"article": {}
}
}
}
Response
{
"_index" : "index1",
"_type" : "_doc",
"_id" : "cHzAH4IBgPd6xUeLm9QF",
"_score" : 1.0,
"_source" : {
"article" : "Elasticsearch Documentation"
},
"highlight" : {
"article" : [
"<em>Elasticsearch</em> Documentation"
]
}
}

How to Order Completion Suggester with Fuzziness

When using a Completion Suggester with Fuzziness defined the ordering of results for suggestions are alphabetical instead of most relevant. It seems that whatever the fuzzines is set to is removed from the search/query term at the end of the term. This is not what I expected from reading the Completion Suggester Fuzziness docs which state:
Suggestions that share the longest prefix to the query prefix will be scored higher.
But that is not true. Here is a use case that proves this:
PUT test/
{
"mappings":{
"properties":{
"id":{
"type":"integer"
},
"title":{
"type":"keyword",
"fields": {
"suggest": {
"type": "completion"
}
}
}
}
}
}
POST test/_bulk
{ "index" : {"_id": "1"}}
{ "title": "HOLARAT" }
{ "index" : {"_id": "2"}}
{ "title": "HOLBROOK" }
{ "index" : {"_id": "3"}}
{ "title": "HOLCONNEN" }
{ "index" : {"_id": "4"}}
{ "title": "HOLDEN" }
{ "index" : {"_id": "5"}}
{ "title": "HOLLAND" }
The above creates an index and adds some data.
If a suggestion query is done on said data:
POST test/_search
{
"_source": {
"includes": [
"title"
]
},
"suggest": {
"title-suggestion": {
"completion": {
"fuzzy": {
"fuzziness": "1"
},
"field": "title.suggest",
"size": 3
},
"prefix": "HOLL"
}
}
}
It returns the first 3 results in alphabetical order of the last matching character, instead of the longest prefix (which would be HOLLAND):
{
...
"suggest" : {
"title-suggestion" : [
{
"text" : "HOLL",
"offset" : 0,
"length" : 4,
"options" : [
{
"text" : "HOLARAT",
"_index" : "test",
"_type" : "_doc",
"_id" : "1",
"_score" : 3.0,
"_source" : {
"title" : "HOLARAT"
}
},
{
"text" : "HOLBROOK",
"_index" : "test",
"_type" : "_doc",
"_id" : "2",
"_score" : 3.0,
"_source" : {
"title" : "HOLBROOK"
}
},
{
"text" : "HOLCONNEN",
"_index" : "test",
"_type" : "_doc",
"_id" : "3",
"_score" : 3.0,
"_source" : {
"title" : "HOLCONNEN"
}
}
]
}
]
}
}
If the size param is removed then we can see that the score is the same for all entries, instead of the longest prefix being higher as stated.
With this being the case, how can results from Completion Suggesters with Fuzziness defined be ordered with the longest prefix at the top?
This has been reported in the past and this behavior is actually by design.
What I usually do in this case is to send two suggest queries (similar to what has been suggested here), one for exact match and another for fuzzy match. If the exact match contains a suggestion, I use it, otherwise I resort to using the fuzzy ones.
With the suggest query below, you'll get HOLLAND as exact-suggestion and then the fuzzy matches in fuzzy-suggestion:
POST test/_search
{
"_source": {
"includes": [
"title"
]
},
"suggest": {
"fuzzy-suggestion": {
"completion": {
"fuzzy": {
"fuzziness": "1"
},
"field": "title.suggest",
"size": 3
},
"prefix": "HOLL"
},
"exact-suggestion": {
"completion": {
"field": "title.suggest",
"size": 3
},
"prefix": "HOLL"
}
}
}

Elasticsearch English stemming not working correctly

I've added an english stemmer analyzer and filter to our query but it doesn't seem to be working correctly with plurals stemming from 'y' => 'ies'.
For example, when I search 'raspberry' the results never include 'raspberries' and so on.
I've tried both english and minimal_english but I still get the same result.
Here's the analyzer and settings:
analysis: {
analyzer: {
custom_analyzer: {
type: "custom",
tokenizer: "standard",
filter: ["lowercase", "english_stemmer"],
},
},
filter: {
english_stemmer: {
type: "stemmer",
language: "english",
},
},
},
}
What am I doing wrong?
Though english should work for the e.g. you mentioned, you can even go for porter_stem instead. This is equivalent to stemmer with language english.
porter_stem in action:
POST /_analyze
{
"tokenizer": "standard",
"filter": ["porter_stem"],
"text": ["raspberry", "raspberries"]
}
Response of above request:
{
"tokens" : [
{
"token" : "raspberri",
"start_offset" : 0,
"end_offset" : 9,
"type" : "<ALPHANUM>",
"position" : 0
},
{
"token" : "raspberri",
"start_offset" : 10,
"end_offset" : 21,
"type" : "<ALPHANUM>",
"position" : 101
}
]
}
You can see both raspberry and raspberries get tokenise to raspberri. Therefore searching for raspberry will also match raspberries and vice-versa.
Make sure that the field against which you are indexing and searching has defined the analyzer as custom_analyzer (according to settings you stated in your question).
Working e.g.
Mapping:
PUT test
{
"settings": {
"analysis": {
"analyzer": {
"custom_analyzer": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"lowercase",
"english_stemmer"
]
}
},
"filter": {
"english_stemmer": {
"type": "stemmer",
"language": "english"
}
}
}
},
"mappings": {
"properties": {
"field1": {
"type": "text",
"analyzer": "custom_analyzer"
}
}
}
}
Indexing:
PUT test/_doc/1
{
"field1": "raspberries"
}
PUT test/_doc/2
{
"field1": "raspberry"
}
Search:
GET test/_search
{
"query": {
"match": {
"field1": {
"query": "raspberry"
}
}
}
}
Response:
{
"took" : 0,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 2,
"relation" : "eq"
},
"max_score" : 0.18232156,
"hits" : [
{
"_index" : "test",
"_type" : "_doc",
"_id" : "1",
"_score" : 0.18232156,
"_source" : {
"field1" : "raspberries"
}
},
{
"_index" : "test",
"_type" : "_doc",
"_id" : "2",
"_score" : 0.18232156,
"_source" : {
"field1" : "raspberry"
}
}
]
}
}
You can also have a look at other stemmer kstem.
Unfortunately, porter_stem doesn't always work, e.g. virus and viruses. Someone suggested snowball - but I haven't tried it yet...

Multi match query and the scoring calculation in Elasticsearch

I have couple documents in the index, and 2 of them are
{
"id" : "c0706549-d06c-4043-8086-1b4b3ec1ef95",
"title" : "Google Pixel XL",
"memory" : "4GB",
"quantity" : 3
}
{
"id" : "23ecaecd-6b3f-4592-b79f-f46a20157221",
"title" : "Google Pixel XL",
"memory" : "6GB",
"quantity" : 1
}
And for the query
{
"query": { "multi_match": { "query": "pixel xl 6gb", "fields": ["title", "memory"] } }
}
I get the response
{
"_index" : "my_index",
"_type" : "_doc",
"_id" : "c0706549-d06c-4043-8086-1b4b3ec1ef95",
"_score" : 2.4280763,
"_source" : {
"id" : "c0706549-d06c-4043-8086-1b4b3ec1ef95",
"title" : "Google Pixel XL",
"memory" : "4GB",
"quantity" : 3
}
},
{
"_index" : "my_index",
"_type" : "_doc",
"_id" : "23ecaecd-6b3f-4592-b79f-f46a20157221",
"_score" : 2.4280763,
"_source" : {
"id" : "23ecaecd-6b3f-4592-b79f-f46a20157221",
"title" : "Google Pixel XL",
"memory" : "6GB",
"quantity" : 1
}
}
But I expect that the document with the memory field 6GB will be on top, can you please advise why this happens and how to fix it?
Index mapping
{
"mappings" : {
"properties" : {
"memory" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
},
"fielddata" : true
},
"title" : {
"type" : "text",
"analyzer" : "synonym_analyzer"
}
}
}
}
Index settings
{
"index" : {
"analysis" : {
"filter" : {
"synonym_filter" : {
"type" : "synonym",
"synonyms" : [
"laptop, notebook"
]
}
},
"analyzer" : {
"synonym_analyzer" : {
"tokenizer" : "standard",
"filter" : ["lowercase", "synonym_filter"]
}
}
}
}
}
Elasticsearch version 7.7.0
I just tried this locally and I am getting much higher score(~2X) for a document containing the 6GB and if you note carefully in your case, both the document has exactly the same score (2.4280763), that means both of them has exactly the same relevance and its just the order in Elasticsearch response that is different.
Think of it, you need to sort the numbers 1,2,3,1 then 1,1,2,3 will be the order
it doesn't matter which 1 comes before or after.
Also, you need to provide your mapping and index configuration(number of shards) and elasticsearch version(as older version uses tf/idf while the new one uses BM25) for score calculation.
I tried this on ES 7.7 version and as mentioned earlier with my mapping and your sample data got 2X better score.
Index mapping
{
"mappings": {
"properties": {
"title" : {
"type": "text"
},
"memory" : {
"type" : "text"
}
}
}
}
Index your 2 docs
{
"title" : "Google Pixel XL",
"memory" : "6GB"
}
{
"title" : "Google Pixel XL",
"memory" : "4GB"
}
Search query
{
"query": {
"multi_match": {
"query": "pixel xl 6gb",
"fields": [
"title",
"memory"
]
}
}
}
And search result
"hits": [
{
"_index": "multma",
"_type": "_doc",
"_id": "2",
"_score": 0.6931471, --> note this
"_source": {
"title": "Google Pixel XL",
"memory": "6GB" --> 6 GB one has a better score and coming on top
}
},
{
"_index": "multma",
"_type": "_doc",
"_id": "1",
"_score": 0.36464313,
"_source": {
"title": "Google Pixel XL",
"memory": "4GB"
}
}
]

Is there any way to add the field in document but hide it from _source, also document should be analysed and searchable

I want to add one field to the document which should be searchable but when we do get/search it should not appear under _source.
I have tried index and store options but its not achievable through it.
Its more like _all or copy_to, but in my case value is provided by me (not collecting from other fields of the document.)
I am looking for mapping through which I can achieve below cases.
When I put document :
PUT my_index/_doc/1
{
"title": "Some short title",
"date": "2015-01-01",
"content": "A very long content field..."
}
and do search
GET my_index/_search
output should be
{
"hits" : {
"total" : 1,
"max_score" : 1.0,
"hits" : [
{
"_index" : "my_index",
"_type" : "_doc",
"_id" : "1",
"_score" : 1.0,
"_source" : {
"title" : "Some short title",
"date" : "2015-01-01"
}
}
]
}
}
also when I do the below search
GET my_index/_search
{
"query": {
"query_string": {
"default_field": "content",
"query": "long content"
}
}
}
it should result me
"hits" : {
"total" : 1,
"max_score" : 0.5753642,
"hits" : [
{
"_index" : "my_index",
"_type" : "_doc",
"_id" : "1",
"_score" : 0.5753642,
"_source" : {
"title" : "Some short title",
"date" : "2015-01-01"
}
}
]
}
Simply use source filtering to exclude the content field:
GET my_index/_search
{
"_source": {
"excludes": [ "content" ]
},
"query": {
"query_string": {
"default_field": "content",
"query": "long content"
}
}
}
We can achieve this using below mapping :
PUT my_index
{
"mappings": {
"_doc": {
"_source": {
"excludes": [
"content"
]
},
"properties": {
"title": {
"type": "text",
"store": true
},
"date": {
"type": "date",
"store": true
},
"content": {
"type": "text"
}
}
}
}
}
Add document :
PUT my_index/_doc/1
{
"title": "Some short title",
"date": "2015-01-01",
"content": "A very long content field..."
}
When you run the query to search content on the field 'content' :
GET my_index/_search
{
"query": {
"query_string": {
"default_field": "content",
"query": "long content"
}
}
}
You will get the result with hits as below:
"hits" : {
"total" : 1,
"max_score" : 0.5753642,
"hits" : [
{
"_index" : "my_index",
"_type" : "_doc",
"_id" : "1",
"_score" : 0.5753642,
"_source" : {
"date" : "2015-01-01",
"title" : "Some short title"
}
}
]
}
It hides the field 'content'. :)
Hence achieved it with the help of mapping. You don't need to exclude it from query each time you make get/search call.
More read on source :
https://www.elastic.co/guide/en/elasticsearch/reference/6.6/mapping-source-field.html

Resources