How to Order Completion Suggester with Fuzziness - elasticsearch

When using a Completion Suggester with Fuzziness defined the ordering of results for suggestions are alphabetical instead of most relevant. It seems that whatever the fuzzines is set to is removed from the search/query term at the end of the term. This is not what I expected from reading the Completion Suggester Fuzziness docs which state:
Suggestions that share the longest prefix to the query prefix will be scored higher.
But that is not true. Here is a use case that proves this:
PUT test/
{
"mappings":{
"properties":{
"id":{
"type":"integer"
},
"title":{
"type":"keyword",
"fields": {
"suggest": {
"type": "completion"
}
}
}
}
}
}
POST test/_bulk
{ "index" : {"_id": "1"}}
{ "title": "HOLARAT" }
{ "index" : {"_id": "2"}}
{ "title": "HOLBROOK" }
{ "index" : {"_id": "3"}}
{ "title": "HOLCONNEN" }
{ "index" : {"_id": "4"}}
{ "title": "HOLDEN" }
{ "index" : {"_id": "5"}}
{ "title": "HOLLAND" }
The above creates an index and adds some data.
If a suggestion query is done on said data:
POST test/_search
{
"_source": {
"includes": [
"title"
]
},
"suggest": {
"title-suggestion": {
"completion": {
"fuzzy": {
"fuzziness": "1"
},
"field": "title.suggest",
"size": 3
},
"prefix": "HOLL"
}
}
}
It returns the first 3 results in alphabetical order of the last matching character, instead of the longest prefix (which would be HOLLAND):
{
...
"suggest" : {
"title-suggestion" : [
{
"text" : "HOLL",
"offset" : 0,
"length" : 4,
"options" : [
{
"text" : "HOLARAT",
"_index" : "test",
"_type" : "_doc",
"_id" : "1",
"_score" : 3.0,
"_source" : {
"title" : "HOLARAT"
}
},
{
"text" : "HOLBROOK",
"_index" : "test",
"_type" : "_doc",
"_id" : "2",
"_score" : 3.0,
"_source" : {
"title" : "HOLBROOK"
}
},
{
"text" : "HOLCONNEN",
"_index" : "test",
"_type" : "_doc",
"_id" : "3",
"_score" : 3.0,
"_source" : {
"title" : "HOLCONNEN"
}
}
]
}
]
}
}
If the size param is removed then we can see that the score is the same for all entries, instead of the longest prefix being higher as stated.
With this being the case, how can results from Completion Suggesters with Fuzziness defined be ordered with the longest prefix at the top?

This has been reported in the past and this behavior is actually by design.
What I usually do in this case is to send two suggest queries (similar to what has been suggested here), one for exact match and another for fuzzy match. If the exact match contains a suggestion, I use it, otherwise I resort to using the fuzzy ones.
With the suggest query below, you'll get HOLLAND as exact-suggestion and then the fuzzy matches in fuzzy-suggestion:
POST test/_search
{
"_source": {
"includes": [
"title"
]
},
"suggest": {
"fuzzy-suggestion": {
"completion": {
"fuzzy": {
"fuzziness": "1"
},
"field": "title.suggest",
"size": 3
},
"prefix": "HOLL"
},
"exact-suggestion": {
"completion": {
"field": "title.suggest",
"size": 3
},
"prefix": "HOLL"
}
}
}

Related

How to extract matching groups when searching regex in elasticsearch

Am using elasticsearch to index some data (a text article) and mapped it as
"article": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword"
}
}
}
, then I use regexp queries to search for some matches, the result returns the correct documents, but is there a way to also return the matching groups/text which triggered the regex hit ?
You can use highlghting functionality of Elasticsearch.
Lets consider below is your sample document:
{
"article":"Elasticsearch Documentation"
}
Query:
{
"query": {
"regexp": {
"article": "el.*ch"
}
},
"highlight": {
"fields": {
"article": {}
}
}
}
Response
{
"_index" : "index1",
"_type" : "_doc",
"_id" : "cHzAH4IBgPd6xUeLm9QF",
"_score" : 1.0,
"_source" : {
"article" : "Elasticsearch Documentation"
},
"highlight" : {
"article" : [
"<em>Elasticsearch</em> Documentation"
]
}
}

Synonyms relevance issue in Elasticsearch

I am trying to configured synonyms in elasticsearch and done the sample configuration as well. But not getting expected relevancy when i am searching data.
Below is index Mapping configuration:
PUT /test_index
{
"settings": {
"index": {
"analysis": {
"filter": {
"my_synonyms": {
"type": "synonym",
"synonyms": [
"mind, brain",
"brainstorm,brain storm"
]
}
},
"analyzer": {
"my_analyzer": {
"tokenizer": "standard",
"filter": [
"lowercase"
]
},
"my_search_analyzer": {
"tokenizer": "standard",
"filter": [
"lowercase",
"my_synonyms"
]
}
}
}
}
},
"mappings": {
"properties": {
"my_field": {
"type": "text",
"analyzer": "my_analyzer"
}
}
}
}
Below is sample data which i have indexed:
POST test_index/_bulk
{ "index" : { "_id" : "1" } }
{"my_field": "This is a brainstorm" }
{ "index" : { "_id" : "2" } }
{"my_field": "A different brain storm" }
{ "index" : { "_id" : "3" } }
{"my_field": "About brainstorming" }
{ "index" : { "_id" : "4" } }
{"my_field": "I had a storm in my brain" }
{ "index" : { "_id" : "5" } }
{"my_field": "I envisaged something like that" }
Below is query which i am trying:
GET test_index/_search
{
"query": {
"match": {
"my_field": {
"query": "brainstorm",
"analyzer": "my_search_analyzer"
}
}
}
}
Current Result:
"hits" : [
{
"_index" : "test_index",
"_type" : "_doc",
"_id" : "2",
"_score" : 1.8185701,
"_source" : {
"my_field" : "A different brain storm"
}
},
{
"_index" : "test_index",
"_type" : "_doc",
"_id" : "4",
"_score" : 1.4100728,
"_source" : {
"my_field" : "I had a storm in my brain"
}
},
{
"_index" : "test_index",
"_type" : "_doc",
"_id" : "1",
"_score" : 0.90928507,
"_source" : {
"my_field" : "This is a brainstorm"
}
}
]
I am expecting document which is matching exect with query on top and document which is matching with synonyms should come with low score.
so here my expectation is document with value "This is a brainstorm" should come at position one.
Could you please suggest me how i can achive.
I have tried to applied boosting and weightage as well but no luck.
Thanks in advance !!!
Elasticsearch "replaces" every instance of a synonym all other synonyms, and does so on both indexing and searching (unless you provide a separate search_analyzer) so you're losing the exact token. To keep this information, use a subfield with standard analyzer and then use multi_match query to match either synonyms or exact value + boost the exact field.
I have got answer from Elastic Forum here. I have copied below for quick referance.
Hello there,
Since you are indexing synonyms into your inverted index, brain storm and brainstorm are all different tokens after analyzer does its thing. So Elasticsearch on query time uses your analyzer to create tokens for brain, storm and brainstorm from your query and match multiple tokens with indexes 2 and 4, your index 2 has lesser words so tf/idf scores it higher between the two and index number 1 only matches brainstorm.
You can also see what your analyzer does to your input with this;
POST test_index/_analyze
{
"analyzer": "my_search_analyzer",
"text": "I had a storm in my brain"
}
I did some trying out so, you should change your index analyzer to my_analyzer;
PUT /test_index
{
"settings": {
"index": {
"analysis": {
"filter": {
"my_synonyms": {
"type": "synonym",
"synonyms": [
"mind, brain",
"brainstorm,brain storm"
]
}
},
"analyzer": {
"my_analyzer": {
"tokenizer": "standard",
"filter": [
"lowercase"
]
},
"my_search_analyzer": {
"tokenizer": "standard",
"filter": [
"lowercase",
"my_synonyms"
]
}
}
}
}
},
"mappings": {
"properties": {
"my_field": {
"type": "text",
"analyzer": "my_analyzer"
}
}
}
}
Then you want to boost your exact matches, but you also want to get hits from my_search_analyzer tokens as well so i have changed your query a bit;
GET test_index/_search
{
"query": {
"bool": {
"should": [
{
"match": {
"my_field": {
"query": "brainstorm",
"analyzer": "my_search_analyzer"
}
}
},
{
"match_phrase": {
"my_field": {
"query": "brainstorm"
}
}
}
]
}
}
}
result:
{
"took" : 3,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 3,
"relation" : "eq"
},
"max_score" : 2.3491273,
"hits" : [
{
"_index" : "test_index",
"_type" : "_doc",
"_id" : "1",
"_score" : 2.3491273,
"_source" : {
"my_field" : "This is a brainstorm"
}
},
{
"_index" : "test_index",
"_type" : "_doc",
"_id" : "2",
"_score" : 1.8185701,
"_source" : {
"my_field" : "A different brain storm"
}
}
]
}
}

Skip duplicates on field in a Elasticsearch search result

Is it possible to remove duplicates on a given field?
For example the following query:
{
"query": {
"term": {
"name_admin": {
"value": "nike"
}
}
},
"_source": [
"name_admin",
"parent_sku",
"sku"
],
"size": 2
}
is retrieving
"hits" : [
{
"_index" : "product",
"_type" : "_doc",
"_id" : "central30603",
"_score" : 4.596813,
"_source" : {
"parent_sku" : "SSP57",
"sku" : "SSP57816401",
"name_admin" : "NIKE U NSW PRO CAP NIKE AIR"
}
},
{
"_index" : "product",
"_type" : "_doc",
"_id" : "central156578",
"_score" : 4.596813,
"_source" : {
"parent_sku" : "SSP57",
"sku" : "SSP57816395",
"name_admin" : "NIKE U NSW PRO CAP NIKE AIR"
}
}
]
I'd like to skip duplicates on parent_sku so I only have one result per parent_sku like it's possible with suggestion by doing something like "skip_duplicates": true.
I know I cloud achieve this with an aggregation but I'd like to stick with a search, as my query is a bit more complicated and as I'm using the scroll API which doesn't work with aggregations.
Field collapsing should help here
{
"query": {
"term": {
"name_admin": {
"value": "nike"
}
}
},
"collapse" : {
"field" : "parent_sku",
"inner_hits": {
"name": "parent",
"size": 1
}
},
"_source": false,
"size": 2
}
The above query will return one document par parent_sku.

How to get results weighted by references from ElasticSearch

I have a dataset consisting of Notes referencing other Notes.
{id:"aaa", note: "lorem ipsum", references: ["ccc"]},
{id:"bbb", note: "lorem ipsum", references: ["aaa","ccc"]},
{id:"ccc", note: "lorem ipsum", references: ["bbb"]},
I want elastic search to use the references to weight the results, so in this case if I search for lorem I should get id "ccc" back since it has the most references. According to their docs, their graph solution does exactly this, but I also see examples where they are doing similar things.
But no explanation of how this is mapped to the Index. So my question is: how does one set up an ES index that uses references (indices)?
Other answers gave some clues, but then #7379490 provided the answer in another channel:
There is no way of doing this directly in ES. There are two possible solutions:
pre calculate references and pass them into ES by mapping a new value to the document.
Or aggregate and use aggregation to sort the response:
{
"query": {
"function_score": {
"query": {
"match": {
"note": "lorem"
}
},
"aggs": {
"references":{
"terms" : {
"field" : "references.keyword",
"order": { "_count": "desc" }
}
}
}
}
}
Function_score can be used to give higher score to documents with more values for reference field
Mapping:
{
"mappings": {
"properties": {
"id": {
"type": "integer"
},
"note": {
"type": "text"
},
"references": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword"
}
}
}
}
}
}
Query:
{
"query": {
"function_score": {
"query": {
"match": {
"note": "lorem"
}
},
"functions": [
{
"script_score": {
"script": "_score * doc['references.keyword'].length" --> references length
}
}
]
}
}
}
Result:
"hits" : [
{
"_index" : "index71",
"_type" : "_doc",
"_id" : "iJObRHIBXTOhZcHPaYks",
"_score" : 0.035661265,
"_source" : {
"id" : 2,
"note" : "lorem ipsum",
"references" : [
1,
3
]
}
},
{
"_index" : "index71",
"_type" : "_doc",
"_id" : "h5ObRHIBXTOhZcHPYok2",
"_score" : 0.017830633,
"_source" : {
"id" : 1,
"note" : "lorem ipsum",
"references" : [
3
]
}
},
{
"_index" : "index71",
"_type" : "_doc",
"_id" : "iZObRHIBXTOhZcHPb4k4",
"_score" : 0.017830633,
"_source" : {
"id" : 3,
"note" : "lorem ipsum",
"references" : [
2
]
}
}
]
If your data contain referencesGiven and you need to search by referencesReceived, I would recommend a 2-way pass:
insert all documents with an empty field for referencesReceived: []
(or referencesReceivedCount: 0 if that is enough)
for each document, for each item in referencesGiven, update the document receiving the reference

Is there any way to add the field in document but hide it from _source, also document should be analysed and searchable

I want to add one field to the document which should be searchable but when we do get/search it should not appear under _source.
I have tried index and store options but its not achievable through it.
Its more like _all or copy_to, but in my case value is provided by me (not collecting from other fields of the document.)
I am looking for mapping through which I can achieve below cases.
When I put document :
PUT my_index/_doc/1
{
"title": "Some short title",
"date": "2015-01-01",
"content": "A very long content field..."
}
and do search
GET my_index/_search
output should be
{
"hits" : {
"total" : 1,
"max_score" : 1.0,
"hits" : [
{
"_index" : "my_index",
"_type" : "_doc",
"_id" : "1",
"_score" : 1.0,
"_source" : {
"title" : "Some short title",
"date" : "2015-01-01"
}
}
]
}
}
also when I do the below search
GET my_index/_search
{
"query": {
"query_string": {
"default_field": "content",
"query": "long content"
}
}
}
it should result me
"hits" : {
"total" : 1,
"max_score" : 0.5753642,
"hits" : [
{
"_index" : "my_index",
"_type" : "_doc",
"_id" : "1",
"_score" : 0.5753642,
"_source" : {
"title" : "Some short title",
"date" : "2015-01-01"
}
}
]
}
Simply use source filtering to exclude the content field:
GET my_index/_search
{
"_source": {
"excludes": [ "content" ]
},
"query": {
"query_string": {
"default_field": "content",
"query": "long content"
}
}
}
We can achieve this using below mapping :
PUT my_index
{
"mappings": {
"_doc": {
"_source": {
"excludes": [
"content"
]
},
"properties": {
"title": {
"type": "text",
"store": true
},
"date": {
"type": "date",
"store": true
},
"content": {
"type": "text"
}
}
}
}
}
Add document :
PUT my_index/_doc/1
{
"title": "Some short title",
"date": "2015-01-01",
"content": "A very long content field..."
}
When you run the query to search content on the field 'content' :
GET my_index/_search
{
"query": {
"query_string": {
"default_field": "content",
"query": "long content"
}
}
}
You will get the result with hits as below:
"hits" : {
"total" : 1,
"max_score" : 0.5753642,
"hits" : [
{
"_index" : "my_index",
"_type" : "_doc",
"_id" : "1",
"_score" : 0.5753642,
"_source" : {
"date" : "2015-01-01",
"title" : "Some short title"
}
}
]
}
It hides the field 'content'. :)
Hence achieved it with the help of mapping. You don't need to exclude it from query each time you make get/search call.
More read on source :
https://www.elastic.co/guide/en/elasticsearch/reference/6.6/mapping-source-field.html

Resources