Google style autocomplete & autocorrection with elasticsearch

Google style autocomplete & autocorrection with elasticsearch - elasticsearch

I'm trying to achieve google style autocomplete & autocorrection with elasticsearch.
Mappings :
POST music
{
"settings": {
"analysis": {
"filter": {
"nGram_filter": {
"type": "nGram",
"min_gram": 2,
"max_gram": 20,
"token_chars": [
"letter",
"digit",
"punctuation",
"symbol"
]
}
},
"analyzer": {
"nGram_analyzer": {
"type": "custom",
"tokenizer": "whitespace",
"filter": [
"lowercase",
"asciifolding",
"nGram_filter"
]
},
"whitespace_analyzer": {
"type": "custom",
"tokenizer": "whitespace",
"filter": [
"lowercase",
"asciifolding"
]
}
}
}
},
"mappings": {
"song": {
"properties": {
"song_field": {
"type": "string",
"analyzer": "nGram_analyzer",
"search_analyzer": "whitespace_analyzer"
},
"suggest": {
"type": "completion",
"analyzer": "simple",
"search_analyzer": "simple",
"payloads": true
}
}
}
}
}
Docs:
POST music/song
{
"song_field" : "beautiful queen",
"suggest" : "beautiful queen"
}
POST music/song
{
"song_field" : "beautiful",
"suggest" : "beautiful"
}
I expect that when user types: "beaatiful q" he will get something like beautiful queen (beaatiful is corrected to beautiful and q is completed to queen).
I've tried the following query:
POST music/song/_search?search_type=dfs_query_then_fetch
{
"size": 10,
"suggest": {
"didYouMean": {
"text": "beaatiful q",
"completion": {
"field": "suggest"
}
}
},
"query": {
"match": {
"song_field": {
"query": "beaatiful q",
"fuzziness": 2
}
}
}
}
Unfortunately, Completion suggester doesn't allow any typos so I get this response:
"suggest": {
"didYouMean": [
{
"text": "beaatiful q",
"offset": 0,
"length": 11,
"options": []
}
]
}
In addition, search gave me these results (beautiful ranked higher although user started to wrote "queen"):
"hits": [
{
"_index": "music",
"_type": "song",
"_id": "AVUj4Y5NancUpEdFLeLo",
"_score": 0.51315063,
"_source": {
"song_field": "beautiful"
"suggest": "beautiful"
}
},
{
"_index": "music",
"_type": "song",
"_id": "AVUj4XFAancUpEdFLeLn",
"_score": 0.32071912,
"_source": {
"song_field": "beautiful queen"
"suggest": "beautiful queen"
}
}
]
UPDATE !!!
I found out that I can use fuzzy query with completion suggester, but now I get no suggestions when querying (fuzzy only supports 2 edit distance):
POST music/song/_search
{
"size": 10,
"suggest": {
"didYouMean": {
"text": "beaatefal q",
"completion": {
"field": "suggest",
"fuzzy" : {
"fuzziness" : 2
}
}
}
}
}
I still expect "beautiful queen" as suggestion response.

When you want to provide 2 or more words as search suggestions, I have found out (the hard way), its not worth it to use ngrams or edgengrams in Elasticsearch.
Using the Shingles token filter and the shingles analyzer will provide you with multi-word phrases and if you couple that with the match_phrase_prefix it should give you the functionality your looking for.
Basically something like this:
PUT /my_index
{
"settings": {
"number_of_shards": 1,
"analysis": {
"filter": {
"my_shingle_filter": {
"type": "shingle",
"min_shingle_size": 2,
"max_shingle_size": 2,
"output_unigrams": false
}
},
"analyzer": {
"my_shingle_analyzer": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"lowercase",
"my_shingle_filter"
]
}
}
}
}
}
And don't forget to do your mapping:
{
"my_type": {
"properties": {
"title": {
"type": "string",
"fields": {
"shingles": {
"type": "string",
"analyzer": "my_shingle_analyzer"
}
}
}
}
}
}
Ngrams and edgengrams are going tokenize single characters, whereas the Shingles analyzer and filters, groups letters (making words) and provide a much more efficient way of producing and searching for phrases. I spent alot of time messing with the 2 above until I saw Shingles mentioned and read up on it. Much better.

Related

How to match when search term has more words than index?

I have an index which is 2-4 characters with no spaces but user often searches for the "full term" which I dont have indexed but has 3 extra characters after a blank space.
Ex: I index "A1" or "A1B" or "A1B2" and the "full term" is something like
"A1 11A" or "A1B ABA" or "A1B2 2C8".
This is current mapping:
"code": {
"type": "text"
},
If he searches "A1" it bring all of them which is also correct, if he types "A1B" I want to bring only the last two and if he searches "A1B2 2C8" I want to bring only the last one.
Is that possible? If so, what would be the best search/index strategy?

Index Mapping:
{
"settings": {
"analysis": {
"filter": {
"autocomplete_filter": {
"type": "edge_ngram",
"min_gram": 1,
"max_gram": 10
}
},
"analyzer": {
"autocomplete": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"lowercase",
"autocomplete_filter"
]
}
}
}
},
"mappings": {
"properties": {
"code": {
"type": "text",
"analyzer": "autocomplete",
"search_analyzer": "standard"
}
}
}
}
Index data:
{
"code": "A1"
}
{
"code": "A1B"
}
{
"code": "A1B2"
}
Search Query:
{
"query": {
"match": {
"code": {
"query": "A1B2 2C8"
}
}
}
}
Search Result:
"hits": [
{
"_index": "65067196",
"_type": "_doc",
"_id": "3",
"_score": 1.3486402,
"_source": {
"code": "A1B2"
}
}
]

Implementing search using Elasticsearch

I am currently implementing elasticsearch in my application. Please assume that "Hello World" is the data which we need to search. Our requirement is that we should get the result by entering "h" or "Hello World" or "Hello Worlds" as the keyword.
This is our current query.
{
"query": {
"wildcard" : {
"message" : {
"title" : "h*"
}
}
}
}
By using this we are getting the right result using the keyword "h". But we need to get the results in case of small spelling mistakes also.

You need to use english analyzer which stemmed tokens to its root form. More info can be found here
I implemented it by taking your example data, query and expected results using the edge n-gram analyzer and match query.
Index Mapping
{
"settings": {
"analysis": {
"filter": {
"autocomplete_filter": {
"type": "edge_ngram",
"min_gram": 1,
"max_gram": 10
}
},
"analyzer": {
"autocomplete": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"lowercase",
"autocomplete_filter"
]
}
}
}
},
"mappings": {
"properties": {
"title": {
"type": "text",
"analyzer": "autocomplete",
"search_analyzer": "english"
}
}
}
}
Index document
{
"title" : "Hello World"
}
Search query for h and its result
{
"query": {
"match": {
"title": "h"
}
}
}
"hits": [
{
"_index": "so-60524477-partial-key",
"_type": "_doc",
"_id": "1",
"_score": 0.42763555,
"_source": {
"title": "Hello World"
}
}
]
Search query for Hello Worlds and same document comes in result
{
"query": {
"match": {
"title": "Hello worlds"
}
}
}
Result
"hits": [
{
"_index": "so-60524477-partial-key",
"_type": "_doc",
"_id": "1",
"_score": 0.8552711,
"_source": {
"title": "Hello World"
}
}
]

EdgeNGrams or NGrams have better performance than wildcards. For wild card all documents have to be scanned to see which match the pattern. Ngrams break a text in small tokens.
Ex Quick Foxes will stored as [ Qu, Qui, Quic, Quick, Fo, Fox, Foxe, Foxes ] depending on min_gram and max_gram size.
Fuzziness can be used to find similar terms
Mapping
PUT my_index
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "my_tokenizer"
}
},
"tokenizer": {
"my_tokenizer": {
"type": "edge_ngram",
"min_gram": 1,
"max_gram": 20,
"token_chars": [
"letter",
"digit"
]
}
}
}
},
"mappings": {
"properties": {
"text":{
"type": "text",
"analyzer": "my_analyzer"
}
}
}
}
Query
GET my_index/_search
{
"query": {
"match": {
"text": {
"query": "hello worlds",
"fuzziness": 1
}
}
}
}

How to highlight characters within words using elasticsearch

I have implemented auto suggest using elastic search where I am giving suggestions to users based on typed value 'where'. Most of the part works fine if I type full word or few starting characters of word. I want to highlight specific characters typed by the user, say for example user types 'ca' then suggestions should highlight 'California' only and not whole word 'California'
Highlight tag should show result like <b>Ca</b>lifornia and not <b>California</b>.
Here is my index settings
{
"settings": {
"index": {
"analysis": {
"filter": {
"edge_filter": {
"type": "edge_ngram",
"min_gram": 1,
"max_gram": 50
},
"lowercase_filter":{
"type":"lowercase",
"language": "greek"
},
"metro_synonym": {
"type": "synonym",
"synonyms_path": "metro_synonyms.txt"
},
"profession_specialty_synonym": {
"type": "synonym",
"synonyms_path": "profession_specialty_synonyms.txt"
}
},
"analyzer": {
"auto_suggest_analyzer": {
"filter": [
"lowercase",
"edge_filter"
],
"type": "custom",
"tokenizer": "whitespace"
},
"auto_suggest_search_analyzer": {
"filter": [
"lowercase"
],
"type": "custom",
"tokenizer": "whitespace"
},
"lowercase": {
"filter": [
"trim",
"lowercase"
],
"type": "custom",
"tokenizer": "keyword"
}
}
}
}
},
"mappings": {
"properties": {
"what_auto_suggest": {
"type": "text",
"analyzer": "auto_suggest_analyzer",
"search_analyzer": "auto_suggest_search_analyzer",
"fields": {
"raw":{
"type":"keyword"
}
}
},
"company": {
"type": "text",
"analyzer": "lowercase"
},
"where_auto_suggest": {
"type": "text",
"analyzer": "auto_suggest_analyzer",
"search_analyzer": "auto_suggest_search_analyzer",
"fields": {
"raw":{
"type":"keyword"
}
}
},
"tags_auto_suggest": {
"type": "text",
"analyzer": "auto_suggest_analyzer",
"search_analyzer": "auto_suggest_search_analyzer",
"fields": {
"raw":{
"type":"keyword"
}
}
}
}
}
}
Query i am using to pull suggestions -
GET /autosuggest_index_test/_search
{
"query": {
"bool": {
"must": [
{
"match": {
"where_auto_suggest": {
"query": "ca",
"operator": "and"
}
}
}
]
}
},
"aggs": {
"NAME": {
"terms": {
"field": "where_auto_suggest.raw",
"size": 10
}
}
},
"highlight": {
"pre_tags": [
"<b>"
],
"post_tags": [
"</b>"
],
"fields": {
"where_auto_suggest": {
}
}
}
}
One of json result that I am getting -
{
"_index" : "autosuggest_index_test",
"_type" : "_doc",
"_id" : "Calabasas CA",
"_score" : 5.755663,
"_source" : {
"where_auto_suggest" : "Calabasas CA"
},
"highlight" : {
"where_auto_suggest" : [
"<b>Calabasas</b> <b>CA</b>"
]
}
}
Can someone please suggest, how to get output here (in the where_auto_suggest) like - "<b>Ca</b>labasas <b>CA</b>"

I don't really know why but if you use a edge_ngram tokenizer instead of an edge_ngram filter you will have highlighted characters instead of highlighted words.
So in your settings, you could declare such a tokenizer :
"settings": {
"index": {
"analysis": {
"tokenizer": {
"edge_tokenizer": {
"type": "edge_ngram",
"min_gram": 1,
"max_gram": 50,
"token_chars": [
"letter",
"digit",
"punctuation",
"symbol"
]
}
},
...
}
}
}
And change your analyzer to :
"analyzer": {
"auto_suggest_analyzer": {
"filter": [
"lowercase"
],
"type": "custom",
"tokenizer": "edge_tokenizer"
}
...
}
Thus your example request will return
{
...
"hits": {
"total": {
"value": 1,
"relation": "eq"
},
"max_score": 0.2876821,
"hits": [
{
"_index": "autosuggest_index_test",
"_type": "_doc",
"_id": "grIzo28BY9R4-IxJhcFv",
"_score": 0.2876821,
"_source": {
"where_auto_suggest": "california"
},
"highlight": {
"where_auto_suggest": [
"<b>ca</b>lifornia"
]
}
}
]
}
...
}

Elasticsearch Ngrams: Unexpected behavior for autocomplete

Here's a simplification of what I have:
PUT my_index
{
"settings": {
"analysis": {
"analyzer": {
"autocomplete": {
"tokenizer": "autocomplete",
"filter": [
"lowercase"
]
},
"autocomplete_search": {
"tokenizer": "lowercase"
}
},
"tokenizer": {
"autocomplete": {
"type": "edge_ngram",
"min_gram": 2,
"max_gram": 10,
"token_chars": [
"letter"
]
}
}
}
},
"mappings": {
"_doc": {
"properties": {
"title": {
"type": "text",
"analyzer": "autocomplete",
"search_analyzer": "autocomplete_search"
}
}
}
}
}
PUT my_index/_doc/1
{
"title": "Quick Foxes"
}
PUT my_index/_doc/2
{
"title": "Quick Fuxes"
}
PUT my_index/_doc/3
{
"title": "Foxes Quick"
}
PUT my_index/_doc/4
{
"title": "Foxes Slow"
}
I am trying to search for Quick Fo to test the autocomplete:
GET my_index/_search
{
"query": {
"match": {
"title": {
"query": "Quick Fo",
"operator": "and"
}
}
}
}
The problem is that this query also returns Foxes Quick where I expected 'Quick Foxes'
{
"took": 3,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 2,
"max_score": 0.5753642,
"hits": [
{
"_index": "my_index",
"_type": "_doc",
"_id": "1",
"_score": 0.5753642,
"_source": {
"title": "Quick Foxes"
}
},
{
"_index": "my_index",
"_type": "_doc",
"_id": "3",
"_score": 0.5753642,
"_source": {
"title": "Foxes Quick" <<<----- WHY???
}
}
]
}
}
What can I tweak so that I can query a classic "autocomplete" where "Quick Fo" surely won't return "Foxes Quick"..... but only "Quick Foxes"?
---- ADDITIONAL INFO -----------------------
This worked for me:
PUT my_index1
{
"settings": {
"analysis": {
"filter": {
"autocomplete_filter": {
"type": "edge_ngram",
"min_gram": 1,
"max_gram": 20
}
},
"analyzer": {
"autocomplete": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"lowercase",
"autocomplete_filter"
]
}
}
}
},
"mappings": {
"_doc": {
"properties": {
"text": {
"type": "text",
"analyzer": "autocomplete",
"search_analyzer": "standard"
}
}
}
}
}
PUT my_index1/_doc/1
{
"text": "Quick Brown Fox"
}
PUT my_index1/_doc/2
{
"text": "Quick Frown Fox"
}
PUT my_index1/_doc/3
{
"text": "Quick Fragile Fox"
}
GET my_index1/_search
{
"query": {
"match": {
"text": {
"query": "quick br",
"operator": "and"
}
}
}
}

Issue is due to your search analyzer autocomplete_search, in which you are using the lowercase tokenizer, so your search term Quick Fo will be divided into 2 terms, quick and fo (note lowercase) and will be matched against the tokens generated using the autocomplete analyzer on your indexed docs.
Now title Foxes Quick uses autocomplete analyzer and will be having both quick and fo tokens, hence it matches with the search term tokens.
you can simply use the _analyzer API, to check the tokens generated for your documents and as well as for your search term, to understand it better.
Please refer official ES doc https://www.elastic.co/guide/en/elasticsearch/guide/master/_index_time_search_as_you_type.html on how to implement the autocomplete, they also use different search time analyzer, but there is a certain limitation to it and can't solve all the use-cases(esp. if you have docs like yours), hence I implemented it using some other design, which is based on the business requirements.
Hope I was clear on explaining why it's returning the second doc in your case.
EDIT: Also in your case, IMO Match phrase prefix would be more useful.

Trying to form an Elasticsearch query for autocomplete

I've read a lot and it seems that using EdgeNGrams is a good way to go for implementing an autocomplete feature for search applications. I've already configured the EdgeNGrams in my settings for my index.
PUT /bigtestindex
{
"settings":{
"analysis":{
"analyzer":{
"autocomplete":{
"type":"custom",
"tokenizer":"standard",
"filter":[ "standard", "stop", "kstem", "ngram" ]
}
},
"filter":{
"edgengram":{
"type":"ngram",
"min_gram":2,
"max_gram":15
}
},
"highlight": {
"pre_tags" : ["<em>"],
"post_tags" : ["</em>"],
"fields": {
"title.autocomplete": {
"number_of_fragments": 1,
"fragment_size": 250
}
}
}
}
}
}
So if in my settings I have the EdgeNGram filter configured how do I add that to the search query?
What I have so far is a match query with highlight:
GET /bigtestindex/doc/_search
{
"query": {
"match": {
"content": {
"query": "thing and another thing",
"operator": "and"
}
}
},
"highlight": {
"pre_tags" : ["<em>"],
"post_tags" : ["</em>"],
"field": {
"_source.content": {
"number_of_fragments": 1,
"fragment_size": 250
}
}
}
}
How would I add autocomplete to the search query using EdgeNGrams configured in the settings for the index?
UPDATE
For the mapping, would it be ideal to do something like this:
"title": {
"type": "string",
"index_analyzer": "autocomplete",
"search_analyzer": "standard"
},
Or do I need to use multi_field type:
"title": {
"type": "multi_field",
"fields": {
"title": {
"type": "string"
},
"autocomplete": {
"analyzer": "autocomplete",
"type": "string",
"index": "not_analyzed"
}
}
},
I'm using ES 1.4.1 and want to use the title field for autocomplete purposes.... ?

Short answer: you need to use it in a field mapping. As in:
PUT /test_index
{
"settings": {
"analysis": {
"analyzer": {
"autocomplete": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"standard",
"stop",
"kstem",
"ngram"
]
}
},
"filter": {
"edgengram": {
"type": "ngram",
"min_gram": 2,
"max_gram": 15
}
}
}
},
"mappings": {
"doc": {
"properties": {
"field1": {
"type": "string",
"index_analyzer": "autocomplete",
"search_analyzer": "standard"
}
}
}
}
}
For a bit more discussion, see:
http://blog.qbox.io/multi-field-partial-word-autocomplete-in-elasticsearch-using-ngrams
and
http://blog.qbox.io/an-introduction-to-ngrams-in-elasticsearch
Also, I don't think you want the "highlight" section in your index definition; that belongs in the query.
EDIT: Upon trying out your code, there are a couple of problems with it. One was the highlight issue I already mentioned. Another is that you named your filter "edgengram", even though it is of type "ngram" rather than type "edgeNGram", but then you referenced the filter "ngram" in your analyzer, which will use the default ngram filter, which probably doesn't give you what you want. (Hint: you can use term vectors to figure out what your analyzer is doing to your documents; you probably want to turn them off in production, though.)
So what you actually want is probably something like this:
PUT /test_index
{
"settings": {
"analysis": {
"analyzer": {
"autocomplete": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"standard",
"stop",
"kstem",
"edgengram_filter"
]
}
},
"filter": {
"edgengram_filter": {
"type": "edgeNGram",
"min_gram": 2,
"max_gram": 15
}
}
}
},
"mappings": {
"doc": {
"properties": {
"content": {
"type": "string",
"index_analyzer": "autocomplete",
"search_analyzer": "standard"
}
}
}
}
}
When I indexed these two docs:
POST test_index/doc/_bulk
{"index":{"_id":1}}
{"content":"hello world"}
{"index":{"_id":2}}
{"content":"goodbye world"}
And ran this query (there was an error in your "highlight" block as well; should have said "fields" rather than "field")"
POST /test_index/doc/_search
{
"query": {
"match": {
"content": {
"query": "good wor",
"operator": "and"
}
}
},
"highlight": {
"pre_tags": [
"<em>"
],
"post_tags": [
"</em>"
],
"fields": {
"content": {
"number_of_fragments": 1,
"fragment_size": 250
}
}
}
}
I get back this response, which seems to be what you're looking for, if I understand you correctly:
{
"took": 5,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 1,
"max_score": 0.2712221,
"hits": [
{
"_index": "test_index",
"_type": "doc",
"_id": "2",
"_score": 0.2712221,
"_source": {
"content": "goodbye world"
},
"highlight": {
"content": [
"<em>goodbye</em> <em>world</em>"
]
}
}
]
}
}
Here is some code I used to test it out:
http://sense.qbox.io/gist/3092992993e0328f7c4ee80e768dd508a0bc053f

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Google style autocomplete & autocorrection with elasticsearch - elasticsearch

Related

How to match when search term has more words than index?

Implementing search using Elasticsearch

How to highlight characters within words using elasticsearch

Elasticsearch Ngrams: Unexpected behavior for autocomplete

Trying to form an Elasticsearch query for autocomplete

Categories

Resources