Elasticsearch advanced autocomplete - elasticsearch

I want to autocomplete user input with Elasticsearch. Now There are tons of tutorials out there how to do so, but none go into the really detailed stuff.
The last issue I'm having with my query is that it should score Results that are not real "autocompletions" lower. Example:
IS:
I type: "Bed"
I find: "Bed", "Bigbed", "Fancy Bed", "Bed Frame"
WANT:
I type: "Bed"
I find: "Bed", "Bed Frame", [other "Bed XXX" results], "Fancy Bed", "Bigbed"
So i want Elasticsearch to first complete "to the right" if that makes sense. And then use results that have words in front of it.
I've tried the completion suggester I doesn't do other stuff I want but also has the same issue.
In German there are lots of examples of words like Bigbed (which isn't a real word in English, I know. But I don't want those words as high results. But since they match closer than Bed Frame (because that is 2 Tokens) they show up so high.
This is my query currently:
POST autocompletion/_search?pretty
{
"query": {
"function_score": {
"query": {
"match": {
"keyword": {
"query": "Bed",
"fuzziness": 1,
"minimum_should_match": "100%"
}
}
},
"field_value_factor": {
"field": "bias",
"factor": 1
}
}
}
}

If you use elasticsearch completion suggester, as explained at https://www.elastic.co/guide/en/elasticsearch/reference/current/search-suggesters-completion.html, when querying like:
{
"suggest": {
"song-suggest" : {
"prefix" : "bed",
"completion" : {
"field" : "suggest"
}
}
}
}
You will get:
{
"took": 13,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 0,
"max_score": 0.0,
"hits": []
},
"suggest": {
"song-suggest": [
{
"text": "bed",
"offset": 0,
"length": 3,
"options": [
{
"text": "Bed",
"_index": "autocomplete",
"_type": "_doc",
"_id": "1",
"_score": 34.0,
"_source": {
"suggest": {
"input": [
"Bed"
],
"weight": 34
}
}
},
{
"text": "Bed Frame",
"_index": "autocomplete",
"_type": "_doc",
"_id": "3",
"_score": 34.0,
"_source": {
"suggest": {
"input": [
"Bed Frame"
],
"weight": 34
}
}
}
]
}
]
}
}
If you want to use the search API instead, you can use 2 queries:
prefix query "bed ****"
with a term starting by "bed"
Here the mapping:
{
"mappings": {
"_doc" : {
"properties" : {
"suggest" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword"
}
}
}
}
}
}
Here the search query:
{
"query" : {
"bool" : {
"must" : [
{
"match" : {
"suggest" : "Bed"
}
}
],
"should" : [
{
"prefix" : {
"suggest.keyword" : "Bed"
}
}
]
}
}
}
The should clause will boost document starting by "Bed". Et voilĂ !

Related

Elasticsearch: index boost with completion suggester

Is it possible to use index boost when using completion suggester in Elasticsearch? I have tried many different ways but doesn't seem to work. Haven't found any reference in the documentation claiming that it does not work for completion suggester. Example:
POST index1,index2/_search
{
"suggest" : {
"name_suggest" : {
"text" : "my_query",
"completion" : {
"field" : "name_suggest",
"size" : 7,
"fuzzy" :{}
}
}
},
"indices_boost" : [
{ "index1" : 2 },
{ "index2" : 1.5 }
]
}
The above does not return boosted scores. The scores are the same compared to running it without the indices_boost parameter.
Tried few options but these didn't work directly, instead, you can define the weight of a document at index-time, and these could be used as a workaround to get the boosted document, below is the complete example.
Index mapping same for index1, index2
{
"mappings": {
"properties": {
"suggest": {
"type": "completion"
},
"title": {
"type": "keyword"
}
}
}
}
Index doc 1 with weight in index-1
{
"suggest": {
"input": [
"Nevermind",
"Nirvana"
],
"weight": 30
}
}
Similar doc is inserted in index-2 with diff weight
{
"suggest": {
"input": [
"Nevermind",
"Nirvana"
],
"weight": 10 --> note less weight
}
}
And the simple search will now sort it according to weight
{
"suggest": {
"song-suggest": {
"prefix": "nir",
"completion": {
"field": "suggest"
}
}
}
}
And search result
{
"text": "Nirvana",
"_index": "index-1",
"_type": "_doc",
"_id": "1",
"_score": 34.0,
"_source": {
"suggest": {
"input": [
"Nevermind",
"Nirvana"
],
"weight": 30
}
}
},
{
"text": "Nirvana",
"_index": "index-2",
"_type": "_doc",
"_id": "1",
"_score": 30.0,
"_source": {
"suggest": {
"input": [
"Nevermind",
"Nirvana"
],
"weight": 10
}
}
}
]

Elasticsearch query showing weird behavior : bug?

To sum up things quickly, we are using Elasticsearch 6.8.4 and have documents with fields such as "statutPublicOuInterne" (public or internal state) or "identifiant" (identifier).
I cannot share the whole JSON (_source) for security reasons (corporate restrictions), but it looks like the following:
"_source": {
"dateCreation": "2020-11-05T16:31:28.404+01:00",
"dateDerModif": "2020-11-05T16:31:49.183+01:00",
"contenu": { ... }
"langue": "fr",
"observations": null,
"statutPublicOuInterne": "enAttenteTraitementCommissionTask",
"identifiant": "SFB-20201105-ELUH",
(...)
}
Some of the "statutPublicOuInterne" can have values such as "enAttenteTraitementCommissionTask" or "enCoursTraitementCommissionTask".
1st question: for some reason, when I search for statutPublicOuInterne=enCoursTraitementCommissionTask, it doesn't work, but if I search for statutPublicOuInterne=enCoursTraitementCommission (without "Task"), it works! That seems so weird to me and I really can't explain it.
2nd question: if I assume I need to search without the "Task" at the end, then searching for statutPublicOuInterne=enCoursTraitementCommission works but statutPublicOuInterne=enAttenteTraitementCommission doesn't work! (nor does statutPublicOuInterne=enAttenteTraitementCommissionTask work)
The query is as follows:
{
"query": {
"bool" : {
"must" : [
{
"match" : {
"statutPublicOuInterne" : {
"query" : "enAttenteTraitementCommission"
}
}
}
]
}
}
}
I just can't understand why it doesn't find anything, because if I search for this document with its "identifiant" field, then it works:
{
"query": {
"bool" : {
"must" : [
{
"match" : {
"identifiant" : {
"query" : "SFB-20201105-ELUH"
}
}
}
]
}
}
}
The response is:
{
"took": 2,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 1,
"max_score": 2.0283146,
"hits": [
{
"_index": "some-index",
"_type": "demandes",
"_id": "SFB-20201105-ELUH",
"_score": 2.0283146,
"_source": {
"dateCreation": "2020-11-05T16:31:28.404+01:00",
"dateDerModif": "2020-11-05T16:31:49.183+01:00",
"contenu": { ... }
"langue": "fr",
"observations": null,
"statutPublicOuInterne": "enAttenteTraitementCommissionTask",
"identifiant": "SFB-20201105-ELUH",
(...)
}
}
]
}
}
We can clearly see "statutPublicOuInterne": "enAttenteTraitementCommissionTask" in the response.
Am I missing something?
Many thanks in advance for your help!
Adding a working example with index data, mapping, search query, and search result
Index Mapping:
{
"mappings": {
"properties": {
"statutPublicOuInterne": {
"type": "text"
}
}
}
}
Index Data:
{
"dateCreation": "2020-11-05T16:31:28.404+01:00",
"dateDerModif": "2020-11-05T16:31:49.183+01:00",
"langue": "fr",
"observations": null,
"statutPublicOuInterne": "enAttenteTraitementCommissionTask",
"identifiant": "SFB-20201105-ELUH"
}
Search Query:
{
"query": {
"bool": {
"must": [
{
"match": {
"statutPublicOuInterne": {
"query": "enAttenteTraitementCommissionTask"
}
}
}
]
}
}
}
Search Result:
"hits": [
{
"_index": "64700803",
"_type": "_doc",
"_id": "1",
"_score": 0.2876821,
"_source": {
"dateCreation": "2020-11-05T16:31:28.404+01:00",
"dateDerModif": "2020-11-05T16:31:49.183+01:00",
"langue": "fr",
"observations": null,
"statutPublicOuInterne": "enAttenteTraitementCommissionTask",
"identifiant": "SFB-20201105-ELUH"
}
}
]

pick up objects inside a polygon, using elasticsearch

I am trying to collect all my items within a polygon, but I am encountering an error.
"reason": "failed to find geo_point field [SitePoint.coordinates]",
I spent some time trying to understand and I can't understand what's wrong
my index
{
"took": 43,
"timed_out": false,
"hits": {
"total": 5,
"max_score": 0,
"hits": [
{
"_index": "my_index",
"_type": "doc",
"_id": "xxxx",
"_score": 0,
"_source": {
"SitePoint": {
"type": "Point",
"coordinates": [
18.85491,
-33.92305
]
}
}
}
]
}
}
my query
GET my_index/_search
{
"query": {
"bool" : {
"must" : {
"match_all" : {}
},
"filter" : {
"geo_polygon" : {
"SitePoint.coordinates" : {
"points" : [
[18.85096,-33.96311],
[18.87787,-33.92564],
[18.85096,-33.96311]
]
}
}
}
}
}
}
Can someone help me please?
The issue seems to be that you're using a geo_polygon query which operates on geo_point fields, whereas SitePoint seems to be a geo_shape.
You need to use a geo_shape query instead of geo_polygon:
Try this query:
GET my_index/_search
{
"query": {
"bool" : {
"must" : {
"match_all" : {}
},
"filter" : {
"geo_shape" : {
"SitePoint" : {
"shape": {
"type": "polygon",
"coordinates" : [
[18.85096,-33.96311],
[18.87787,-33.92564],
[18.85096,-33.96311]
]
},
"relation": "within"
}
}
}
}
}
}

Cross Field Search with Multiple Complete and Incomplete Phrases in Each Field

Example data:
PUT /test/test/1
{
"text1":"cats meow",
"text2":"12345",
"text3":"toy"
}
PUT /test/test/2
{
"text1":"dog bark",
"text2":"98765",
"text3":"toy"
}
And an example query:
GET /test/test/_search
{
"size": 25,
"query": {
"multi_match" : {
"fields" : [
"text1",
"text2",
"text3"
],
"query" : "meow cats toy",
"type" : "cross_fields"
}
}
}
Returns the cat hit first and then the dog, which is what I want.
BUT when you query cat toy, both the cat and dog have the same relevance score. I want to be able to take into consideration the prefix of that word (and maybe a few other words inside that field), and run cross_fields.
So if I search:
GET /test/test/_search
{
"size": 25,
"query": {
"multi_match" : {
"fields" : [
"text1",
"text2",
"text3"
],
"query" : "cat toy",
"type" : "phrase_prefix"
}
}
}
or
GET /test/test/_search
{
"size": 25,
"query": {
"multi_match" : {
"fields" : [
"text1",
"text2",
"text3"
],
"query" : "meow cats",
"type" : "phrase_prefix"
}
}
}
I should get the cat/ID 1, but I did not.
I found that using cross_fields achieves multi-word phrases, but not multi-incomplete phrases. And phrase_prefix achieves incomplete phrases, but not multiple incomplete phrases...
Sifting through the documentation really isn't helping me discover how to combine these two.
Yeah, I had to apply an analyzer...
The analyzer is applied to the fields when creating the index before you add any data. I couldn't find an easier way to do this after you add the data.
The solution I have found is exploding all of the phrases into each individual prefixes so cross_fields can do it's magic. You can learn more about the use of edge-ngram here.
So instead of cross_field just searching the cats phrase, it's now going to search: c, ca, cat, and cats and every phrase after... So the text1 field will look like this to elastic: c ca cat cats m me meo meow.
~~~
Here are the steps to make the above question example work:
First you create and name the analyzer. To learn a bit more what the filter's values mean, I recommend you take a look at this.
PUT /test
{
"settings": {
"number_of_shards": 1,
"analysis": {
"filter": {
"autocomplete_filter": {
"type": "edge_ngram",
"min_gram": 1,
"max_gram": 20
}
},
"analyzer": {
"autocomplete": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"lowercase",
"autocomplete_filter"
]
}
}
}
}
}
Then I attached this analyzer to each field.
I changed the text1 to match the field I was applying this to.
PUT /test/_mapping/test
{
"test": {
"properties": {
"text1": {
"type": "string",
"analyzer": "autocomplete"
}
}
}
}
I ran GET /test/_mapping to be sure everything worked.
Then to add the data:
POST /test/test/_bulk
{ "index": { "_id": 1 }}
{ "text1": "cats meow", "text2": "12345", "text3": "toy" }
{ "index": { "_id": 2 }}
{ "text1": "dog bark", "text2": "98765", "text3": "toy" }
And the search!
{
"size": 25,
"query": {
"multi_match" : {
"fields" : [
"text1",
"text2",
"text3"
],
"query" : "cat toy",
"type" : "cross_fields"
}
}
}
Which returns:
{
"took": 3,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"failed": 0
},
"hits": {
"total": 2,
"max_score": 0.70778143,
"hits": [
{
"_index": "test",
"_type": "test",
"_id": "1",
"_score": 0.70778143,
"_source": {
"text1": "cats meow",
"text2": "12345",
"text3": "toy"
}
},
{
"_index": "test",
"_type": "test",
"_id": "2",
"_score": 0.1278426,
"_source": {
"text1": "dog bark",
"text2": "98765",
"text3": "toy"
}
}
]
}
}
This creates contrast between the two when you search cat toy, where as before the score was the same. But now, the cat hit has a higher score, as it should. This is achieved by taking into consideration every prefix (max 20 characters in this case/phrase) for each phrase and then seeing how relevant the data is with cross_fields.

Make a flat array from Elasticsearch query results

I have an index with the following documents (simplified):
{
"user" : "j.johnson",
"certifications" : [{
"certification_date" : "2013-02-09T00:00:00+03:00",
"previous_level" : "No Level",
"obtained_level" : "Junior"
}, {
"certification_date" : "2014-05-26T00:00:00+03:00",
"previous_level" : "Junior",
"obtained_level" : "Middle"
}
]
}
I want just to have a flat list of all certifications passed by all users where certification_date > 2014-01-01. It should be a pretty large array like this:
[{
"certification_date" : "2014-09-08T00:00:00+03:00",
"previous_level" : "No Level",
"obtained_level" : "Junior"
}, {
"certification_date" : "2014-05-26T00:00:00+03:00",
"previous_level" : "Junior",
"obtained_level" : "Middle"
}, {
"certification_date" : "2015-01-26T00:00:00+03:00",
"previous_level" : "Junior",
"obtained_level" : "Middle"
}
...
]
It doesn't seems to be a hard task, but I wasn't able to find an easy way to do that.
I would do it with a parent/child relationship, though you will have to reorganize your data. I don't think you can get what you want with your current schema.
More concretely, I set up an index like this, with user as parent and certification as child:
PUT /test_index
{
"settings": {
"number_of_shards": 1,
"number_of_replicas": 0
},
"mappings": {
"user": {
"properties": {
"user_name": { "type": "string" }
}
},
"certification":{
"_parent": { "type": "user" },
"properties": {
"certification_date": { "type": "date" },
"previous_level": { "type": "string" },
"obtained_level": { "type": "string" }
}
}
}
}
added some docs:
POST /test_index/_bulk
{"index":{"_index":"test_index","_type":"user","_id":1}}
{"user_name":"j.johnson"}
{"index":{"_index":"test_index","_type":"certification","_parent":1}}
{"certification_date" : "2013-02-09T00:00:00+03:00","previous_level" : "No Level","obtained_level" : "Junior"}
{"index":{"_index":"test_index","_type":"certification","_parent":1}}
{"certification_date" : "2014-05-26T00:00:00+03:00","previous_level" : "Junior","obtained_level" : "Middle"}
{"index":{"_index":"test_index","_type":"user","_id":2}}
{ "user_name":"b.bronson"}
{"index":{"_index":"test_index","_type":"certification","_parent":2}}
{"certification_date" : "2013-09-05T00:00:00+03:00","previous_level" : "No Level","obtained_level" : "Junior"}
{"index":{"_index":"test_index","_type":"certification","_parent":2}}
{"certification_date" : "2014-07-20T00:00:00+03:00","previous_level" : "Junior","obtained_level" : "Middle"}
Now I can just search certifications with a range filter:
POST /test_index/certification/_search
{
"query": {
"constant_score": {
"filter": {
"range": {
"certification_date": {
"gte": "2014-01-01"
}
}
}
}
}
}
...
{
"took": 2,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"failed": 0
},
"hits": {
"total": 2,
"max_score": 1,
"hits": [
{
"_index": "test_index",
"_type": "certification",
"_id": "QGXHp7JZTeafWYzb_1FZiA",
"_score": 1,
"_source": {
"certification_date": "2014-05-26T00:00:00+03:00",
"previous_level": "Junior",
"obtained_level": "Middle"
}
},
{
"_index": "test_index",
"_type": "certification",
"_id": "yvO2A9JaTieI5VHVRikDfg",
"_score": 1,
"_source": {
"certification_date": "2014-07-20T00:00:00+03:00",
"previous_level": "Junior",
"obtained_level": "Middle"
}
}
]
}
}
This structure is still not completely flat the way you asked for, but I think this is as close as ES will let you get.
Here is the code I used:
http://sense.qbox.io/gist/3c733ec75e6c0856fa2772cc8f67bd7c00aba637

Resources