I am currently implementing elasticsearch in my application. Please assume that "Hello World" is the data which we need to search. Our requirement is that we should get the result by entering "h" or "Hello World" or "Hello Worlds" as the keyword.
This is our current query.
{
"query": {
"wildcard" : {
"message" : {
"title" : "h*"
}
}
}
}
By using this we are getting the right result using the keyword "h". But we need to get the results in case of small spelling mistakes also.
You need to use english analyzer which stemmed tokens to its root form. More info can be found here
I implemented it by taking your example data, query and expected results using the edge n-gram analyzer and match query.
Index Mapping
{
"settings": {
"analysis": {
"filter": {
"autocomplete_filter": {
"type": "edge_ngram",
"min_gram": 1,
"max_gram": 10
}
},
"analyzer": {
"autocomplete": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"lowercase",
"autocomplete_filter"
]
}
}
}
},
"mappings": {
"properties": {
"title": {
"type": "text",
"analyzer": "autocomplete",
"search_analyzer": "english"
}
}
}
}
Index document
{
"title" : "Hello World"
}
Search query for h and its result
{
"query": {
"match": {
"title": "h"
}
}
}
"hits": [
{
"_index": "so-60524477-partial-key",
"_type": "_doc",
"_id": "1",
"_score": 0.42763555,
"_source": {
"title": "Hello World"
}
}
]
Search query for Hello Worlds and same document comes in result
{
"query": {
"match": {
"title": "Hello worlds"
}
}
}
Result
"hits": [
{
"_index": "so-60524477-partial-key",
"_type": "_doc",
"_id": "1",
"_score": 0.8552711,
"_source": {
"title": "Hello World"
}
}
]
EdgeNGrams or NGrams have better performance than wildcards. For wild card all documents have to be scanned to see which match the pattern. Ngrams break a text in small tokens.
Ex Quick Foxes will stored as [ Qu, Qui, Quic, Quick, Fo, Fox, Foxe, Foxes ] depending on min_gram and max_gram size.
Fuzziness can be used to find similar terms
Mapping
PUT my_index
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "my_tokenizer"
}
},
"tokenizer": {
"my_tokenizer": {
"type": "edge_ngram",
"min_gram": 1,
"max_gram": 20,
"token_chars": [
"letter",
"digit"
]
}
}
}
},
"mappings": {
"properties": {
"text":{
"type": "text",
"analyzer": "my_analyzer"
}
}
}
}
Query
GET my_index/_search
{
"query": {
"match": {
"text": {
"query": "hello worlds",
"fuzziness": 1
}
}
}
}
Related
we have some fields with an articlenumbers. This articlenumbers looks like AB 987 g567 323. When i search for "AB 987 g" then i find the right product, but when i search without withespaces i dont find anything. I tried pattern_replace, but it doesnt work.
"whitespace_filter": {
"alphabets_char_filter": {
"type": "pattern_replace",
"pattern": " ",
"replacement": ""
}
How can i search for Articlenumbers with and without whitespaces?
You need to use edge_ngram along with char_filter, to achieve your use case
Adding a working example
Index Mapping:
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "my_tokenizer",
"char_filter": [
"replace_whitespace"
]
}
},
"tokenizer": {
"my_tokenizer": {
"type": "edge_ngram",
"min_gram": 2,
"max_gram": 10,
"token_chars": [
"letter",
"digit"
]
}
},
"char_filter": {
"replace_whitespace": {
"type": "mapping",
"mappings": [
"\\u0020=>"
]
}
}
}
},
"mappings": {
"properties": {
"articlenumbers": {
"type": "text",
"fields": {
"analyzed": {
"type": "text",
"analyzer": "my_analyzer"
}
}
}
}
}
}
Index Data:
{
"articlenumbers": "AB 987 g567 323"
}
Search Query:
{
"query": {
"multi_match": {
"query": "AB987g",
"fields": [
"articlenumbers",
"articlenumbers.analyzed"
]
}
}
}
Search Result:
"hits": [
{
"_index": "65936531",
"_type": "_doc",
"_id": "1",
"_score": 1.4384104,
"_source": {
"articlenumbers": "AB 987 g567 323"
}
}
]
I have an index which is 2-4 characters with no spaces but user often searches for the "full term" which I dont have indexed but has 3 extra characters after a blank space.
Ex: I index "A1" or "A1B" or "A1B2" and the "full term" is something like
"A1 11A" or "A1B ABA" or "A1B2 2C8".
This is current mapping:
"code": {
"type": "text"
},
If he searches "A1" it bring all of them which is also correct, if he types "A1B" I want to bring only the last two and if he searches "A1B2 2C8" I want to bring only the last one.
Is that possible? If so, what would be the best search/index strategy?
Index Mapping:
{
"settings": {
"analysis": {
"filter": {
"autocomplete_filter": {
"type": "edge_ngram",
"min_gram": 1,
"max_gram": 10
}
},
"analyzer": {
"autocomplete": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"lowercase",
"autocomplete_filter"
]
}
}
}
},
"mappings": {
"properties": {
"code": {
"type": "text",
"analyzer": "autocomplete",
"search_analyzer": "standard"
}
}
}
}
Index data:
{
"code": "A1"
}
{
"code": "A1B"
}
{
"code": "A1B2"
}
Search Query:
{
"query": {
"match": {
"code": {
"query": "A1B2 2C8"
}
}
}
}
Search Result:
"hits": [
{
"_index": "65067196",
"_type": "_doc",
"_id": "3",
"_score": 1.3486402,
"_source": {
"code": "A1B2"
}
}
]
I have this query:
{
"query": {
"match": {
"tag": {
"query": "john smith",
"operator": "and"
}
}
}
}
With the and operator I solved to return documents, where words "john" and "smith" must be present in the tag field in any position and any order. But I need to return documents where all partial words must be present in the tag field, like "joh" and "smit". I try this:
{
"query": {
"match": {
"tag": {
"query": "*joh* *smit*",
"operator": "and"
}
}
}
}
but nothing returns. How can I solve this?
You can use the edge_ngram tokenizer and boolean query with multiple must clause(using your example 2) to get the desired output.
Working example:
Index Def
{
"settings": {
"analysis": {
"filter": {
"autocomplete_filter": {
"type": "edge_ngram", --> note this
"min_gram": 1,
"max_gram": 10
}
},
"analyzer": {
"autocomplete": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"lowercase",
"autocomplete_filter"
]
}
}
}
},
"mappings": {
"properties": {
"title": {
"type": "text",
"analyzer": "autocomplete",
"search_analyzer": "standard"
}
}
}
}
Index two sample doc, one which should match and one which shouldn't
{
"title" : "john bravo" --> show;dn't match
}
{
"title" : "john smith" --> should match
}
Boolean Search query with must clause
{
"query": {
"bool": {
"must": [ --> this means both `jon` and `smit` match clause must match, replacement of your `and` operator.
{
"match": {
"title": "joh"
}
},
{
"match": {
"title": "smit"
}
}
]
}
}
}
Search result
"hits": [
{
"_index": "so_partial",
"_type": "_doc",
"_id": "1",
"_score": 1.2840209,
"_source": {
"title": "john smith"
}
}
]
Here's a simplification of what I have:
PUT my_index
{
"settings": {
"analysis": {
"analyzer": {
"autocomplete": {
"tokenizer": "autocomplete",
"filter": [
"lowercase"
]
},
"autocomplete_search": {
"tokenizer": "lowercase"
}
},
"tokenizer": {
"autocomplete": {
"type": "edge_ngram",
"min_gram": 2,
"max_gram": 10,
"token_chars": [
"letter"
]
}
}
}
},
"mappings": {
"_doc": {
"properties": {
"title": {
"type": "text",
"analyzer": "autocomplete",
"search_analyzer": "autocomplete_search"
}
}
}
}
}
PUT my_index/_doc/1
{
"title": "Quick Foxes"
}
PUT my_index/_doc/2
{
"title": "Quick Fuxes"
}
PUT my_index/_doc/3
{
"title": "Foxes Quick"
}
PUT my_index/_doc/4
{
"title": "Foxes Slow"
}
I am trying to search for Quick Fo to test the autocomplete:
GET my_index/_search
{
"query": {
"match": {
"title": {
"query": "Quick Fo",
"operator": "and"
}
}
}
}
The problem is that this query also returns Foxes Quick where I expected 'Quick Foxes'
{
"took": 3,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 2,
"max_score": 0.5753642,
"hits": [
{
"_index": "my_index",
"_type": "_doc",
"_id": "1",
"_score": 0.5753642,
"_source": {
"title": "Quick Foxes"
}
},
{
"_index": "my_index",
"_type": "_doc",
"_id": "3",
"_score": 0.5753642,
"_source": {
"title": "Foxes Quick" <<<----- WHY???
}
}
]
}
}
What can I tweak so that I can query a classic "autocomplete" where "Quick Fo" surely won't return "Foxes Quick"..... but only "Quick Foxes"?
---- ADDITIONAL INFO -----------------------
This worked for me:
PUT my_index1
{
"settings": {
"analysis": {
"filter": {
"autocomplete_filter": {
"type": "edge_ngram",
"min_gram": 1,
"max_gram": 20
}
},
"analyzer": {
"autocomplete": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"lowercase",
"autocomplete_filter"
]
}
}
}
},
"mappings": {
"_doc": {
"properties": {
"text": {
"type": "text",
"analyzer": "autocomplete",
"search_analyzer": "standard"
}
}
}
}
}
PUT my_index1/_doc/1
{
"text": "Quick Brown Fox"
}
PUT my_index1/_doc/2
{
"text": "Quick Frown Fox"
}
PUT my_index1/_doc/3
{
"text": "Quick Fragile Fox"
}
GET my_index1/_search
{
"query": {
"match": {
"text": {
"query": "quick br",
"operator": "and"
}
}
}
}
Issue is due to your search analyzer autocomplete_search, in which you are using the lowercase tokenizer, so your search term Quick Fo will be divided into 2 terms, quick and fo (note lowercase) and will be matched against the tokens generated using the autocomplete analyzer on your indexed docs.
Now title Foxes Quick uses autocomplete analyzer and will be having both quick and fo tokens, hence it matches with the search term tokens.
you can simply use the _analyzer API, to check the tokens generated for your documents and as well as for your search term, to understand it better.
Please refer official ES doc https://www.elastic.co/guide/en/elasticsearch/guide/master/_index_time_search_as_you_type.html on how to implement the autocomplete, they also use different search time analyzer, but there is a certain limitation to it and can't solve all the use-cases(esp. if you have docs like yours), hence I implemented it using some other design, which is based on the business requirements.
Hope I was clear on explaining why it's returning the second doc in your case.
EDIT: Also in your case, IMO Match phrase prefix would be more useful.
I'm trying to achieve google style autocomplete & autocorrection with elasticsearch.
Mappings :
POST music
{
"settings": {
"analysis": {
"filter": {
"nGram_filter": {
"type": "nGram",
"min_gram": 2,
"max_gram": 20,
"token_chars": [
"letter",
"digit",
"punctuation",
"symbol"
]
}
},
"analyzer": {
"nGram_analyzer": {
"type": "custom",
"tokenizer": "whitespace",
"filter": [
"lowercase",
"asciifolding",
"nGram_filter"
]
},
"whitespace_analyzer": {
"type": "custom",
"tokenizer": "whitespace",
"filter": [
"lowercase",
"asciifolding"
]
}
}
}
},
"mappings": {
"song": {
"properties": {
"song_field": {
"type": "string",
"analyzer": "nGram_analyzer",
"search_analyzer": "whitespace_analyzer"
},
"suggest": {
"type": "completion",
"analyzer": "simple",
"search_analyzer": "simple",
"payloads": true
}
}
}
}
}
Docs:
POST music/song
{
"song_field" : "beautiful queen",
"suggest" : "beautiful queen"
}
POST music/song
{
"song_field" : "beautiful",
"suggest" : "beautiful"
}
I expect that when user types: "beaatiful q" he will get something like beautiful queen (beaatiful is corrected to beautiful and q is completed to queen).
I've tried the following query:
POST music/song/_search?search_type=dfs_query_then_fetch
{
"size": 10,
"suggest": {
"didYouMean": {
"text": "beaatiful q",
"completion": {
"field": "suggest"
}
}
},
"query": {
"match": {
"song_field": {
"query": "beaatiful q",
"fuzziness": 2
}
}
}
}
Unfortunately, Completion suggester doesn't allow any typos so I get this response:
"suggest": {
"didYouMean": [
{
"text": "beaatiful q",
"offset": 0,
"length": 11,
"options": []
}
]
}
In addition, search gave me these results (beautiful ranked higher although user started to wrote "queen"):
"hits": [
{
"_index": "music",
"_type": "song",
"_id": "AVUj4Y5NancUpEdFLeLo",
"_score": 0.51315063,
"_source": {
"song_field": "beautiful"
"suggest": "beautiful"
}
},
{
"_index": "music",
"_type": "song",
"_id": "AVUj4XFAancUpEdFLeLn",
"_score": 0.32071912,
"_source": {
"song_field": "beautiful queen"
"suggest": "beautiful queen"
}
}
]
UPDATE !!!
I found out that I can use fuzzy query with completion suggester, but now I get no suggestions when querying (fuzzy only supports 2 edit distance):
POST music/song/_search
{
"size": 10,
"suggest": {
"didYouMean": {
"text": "beaatefal q",
"completion": {
"field": "suggest",
"fuzzy" : {
"fuzziness" : 2
}
}
}
}
}
I still expect "beautiful queen" as suggestion response.
When you want to provide 2 or more words as search suggestions, I have found out (the hard way), its not worth it to use ngrams or edgengrams in Elasticsearch.
Using the Shingles token filter and the shingles analyzer will provide you with multi-word phrases and if you couple that with the match_phrase_prefix it should give you the functionality your looking for.
Basically something like this:
PUT /my_index
{
"settings": {
"number_of_shards": 1,
"analysis": {
"filter": {
"my_shingle_filter": {
"type": "shingle",
"min_shingle_size": 2,
"max_shingle_size": 2,
"output_unigrams": false
}
},
"analyzer": {
"my_shingle_analyzer": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"lowercase",
"my_shingle_filter"
]
}
}
}
}
}
And don't forget to do your mapping:
{
"my_type": {
"properties": {
"title": {
"type": "string",
"fields": {
"shingles": {
"type": "string",
"analyzer": "my_shingle_analyzer"
}
}
}
}
}
}
Ngrams and edgengrams are going tokenize single characters, whereas the Shingles analyzer and filters, groups letters (making words) and provide a much more efficient way of producing and searching for phrases. I spent alot of time messing with the 2 above until I saw Shingles mentioned and read up on it. Much better.