Ignore specific character during fuzzy searches analyzer in Elastic search - elasticsearch

I have a fuzzy search analyzer in elastic search with following documents
PUT test_index
{
"settings": {
"index": {
"max_ngram_diff": 40
},
"analysis": {
"analyzer": {
"autocomplete": {
"tokenizer": "whitespace",
"filter": [
"lowercase",
"autocomplete"
]
},
"autocomplete_search": {
"tokenizer": "whitespace",
"filter": [
"lowercase"
]
}
},
"filter": {
"autocomplete": {
"type": "ngram",
"min_gram": 2,
"max_gram": 40
}
}
}
},
"mappings": {
"properties": {
"title": {
"type": "text",
"analyzer": "autocomplete",
"search_analyzer": "autocomplete_search"
}
}
}
}
PUT test_index/_doc/1
{ "title": "HRT 2018-BN18 N-SB" }
PUT test_index/_doc/2
{ "title": "GMC 2019-BN18 A-SB" }
How can i ignore the hyphen ('-') during my fuzzy search so that GMC 2019-BN18 A-SB , gmc 2019, gmc 2019-BN18 A-SB and GMC 2019-BN18 ASB yield the same document
I had tried to create another analyzer separately but i am not sure how can we apply multiple analyzer on the same field
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "standard",
"char_filter": [
"my_char_filter"
]
}
},
"char_filter": {
"my_char_filter": {
"type": "mapping",
"mappings": [
"- => "
]
}
}
}
}

You're on the right path, you just need to add that character filter to both analyzers to make sure the hyphens get removed at indexing and search time:
PUT test_index
{
"settings": {
"index": {
"max_ngram_diff": 40
},
"analysis": {
"char_filter": {
"my_char_filter": {
"type": "mapping",
"mappings": [
"- => "
]
}
},
"analyzer": {
"autocomplete": {
"char_filter": [
"my_char_filter"
],
"tokenizer": "whitespace",
"filter": [
"lowercase",
"autocomplete"
]
},
"autocomplete_search": {
"char_filter": [
"my_char_filter"
],
"tokenizer": "whitespace",
"filter": [
"lowercase"
]
}
},
"filter": {
"autocomplete": {
"type": "ngram",
"min_gram": 2,
"max_gram": 40
}
}
}
},
"mappings": {
"properties": {
"title": {
"type": "text",
"analyzer": "autocomplete",
"search_analyzer": "autocomplete_search"
}
}
}
}

Related

How to make edge_ngram token match with certaint quantity ofwords between them?

I'm trying to make a search request that retrieves the results only when less than
5 words are between requested tokens.
{
"settings": {
"index": {
"analysis": {
"filter": {
"stopWords": {
"type": "stop",
"stopwords": [
"_english_"
]
}
},
"normalizer": {
"lowercaseNormalizer": {
"filter": [
"lowercase",
"asciifolding"
],
"type": "custom",
"char_filter": []
}
},
"analyzer": {
"autoCompleteAnalyzer": {
"filter": [
"lowercase"
],
"type": "custom",
"tokenizer": "autoCompleteTokenizer"
},
"autoCompleteSearchAnalyzer": {
"type": "custom",
"tokenizer": "lowercase"
},
"charGroupAnalyzer": {
"filter": [
"lowercase"
],
"type": "custom",
"tokenizer": "charGroupTokenizer"
}
},
"tokenizer": {
"charGroupTokenizer": {
"type": "char_group",
"max_token_length": "20",
"tokenize_on_chars": [
"whitespace",
"-",
"\n"
]
},
"autoCompleteTokenizer": {
"token_chars": [
"letter"
],
"min_gram": "3",
"type": "edge_ngram",
"max_gram": "20"
}
}
}
}
}
}
The settings:
{
"mappings": {
"_doc": {
"properties": {
"description": {
"properties": {
"name": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 64
}
},
"analyzer": "autoCompleteAnalyzer",
"search_analyzer": "autoCompleteSearchAnalyzer"
},
"text": {
"type": "text",
"analyzer": "charGroupAnalyzer"
}
}
}
}
}
}
}
And make a bool request with request:
{
"query": {
"bool": {
"must": [
{
"multi_match": {
"fields": [
"description.name"
],
"operator": "and",
"query": "rounded elephant",
"fuzziness": 1
}
},
{
"match_phrase": {
"description.text": {
"analyzer": "charGroupAnalyzer",
"query": "rounded elephant",
"slop": 5,
"boost": 20
}
}
}
]
}
}
}
I expect the request to retrieve documents, where description contains:
... rounded very interesting elephant ...
This works good, when i use the complete words, like rounded elephant.
But, whe i enter prefixed words, like round eleph it fails.
But it's obvious that the description.name and description.text have different tokenizers (name contains ngram tokens, but text contain word tokens), so i get completely wrong results.
How can I configure mappings and search, to be able to use ngrams with distance between tokens?

Synonyms for Sankt and St

I'm trying to get synonyms working for my existing setup. Currently I have this settings:
PUT city
{
"settings": {
"analysis": {
"analyzer": {
"autocomplete": {
"tokenizer": "autocomplete",
"filter": [
"lowercase",
"my_synonym_filter",
"german_normalization",
"my_ascii_folding"
]
},
"autocomplete_search": {
"tokenizer": "lowercase",
"filter": [
"lowercase",
"my_synonym_filter",
"german_normalization",
"my_ascii_folding"
]
}
},
"filter": {
"my_ascii_folding": {
"type": "asciifolding",
"preserve_original": true
},
"my_synonym_filter": {
"type": "synonym",
"ignore_case": "true",
"synonyms": [
"sankt, st => sankt"
]
}
},
"tokenizer": {
"autocomplete": {
"type": "edge_ngram",
"min_gram": 1,
"max_gram": 15,
"token_chars": [
"letter",
"digit",
"symbol"
]
}
}
}
},
"mappings": {
"city": {
"properties": {
"name": {
"type": "text",
"analyzer": "autocomplete",
"search_analyzer": "autocomplete_search"
}
}
}
}
}
In this City Index I have documents like that:
St. Wolfgang or Sankt Wolfgang and so on. For me St. and Sankt are synonyms. So if I search for Sankt both of the documents should appear.
I created a new Filter and added the filter to my autocomplete analyzer:
"my_synonym_filter": {
"type": "synonym",
"ignore_case": "true",
"synonyms": [
"sankt, st."
]
}
So good for now. But the issues I faced are following:
Its clear that the dot after st is not analyzed and not searchable at the moment. But For the synonym the dot is important.
The second issue is if I search for sankt the synonym is st which gives me all documents which starts with st like Stuttgart. So this happens also because the dot is not used.
Do you have any idea how I can achieve the stuff? If you need any more information, please let me know.
Update:
After discussions I did this changes in my settings:
changed edge_ngram tokenizer to a standard tokenizer.
added an edgeNGram filter and added this filter to my analyzer.
deleted the filter german_normalization and my_ascii_folding from my analyzer to simplify the tests.
PUT city
{
"settings": {
"analysis": {
"analyzer": {
"autocomplete": {
"tokenizer": "autocomplete",
"filter": [
"lowercase",
"my_synonym_filter",
"edge_filter"
]
},
"autocomplete_search": {
"tokenizer": "autocomplete",
"filter": [
"my_synonym_filter",
"lowercase"
]
}
},
"filter": {
"edge_filter": {
"type": "edgeNGram",
"min_gram": 1,
"max_gram": 15
},
"my_synonym_filter": {
"type": "synonym",
"ignore_case": "true",
"synonyms": [
"sankt, st => sankt"
]
}
},
"tokenizer": {
"autocomplete": {
"type": "standard"
}
}
}
},
"mappings": {
"city": {
"properties": {
"name": {
"type": "text",
"analyzer": "autocomplete",
"search_analyzer": "autocomplete_search"
}
}
}
}
}
I added these 3 documents to the index:
"name":"Sankt Wolfgang",
"name":"Stuttgart",
"name":"St. Wolfgang"
Query String - Result
st -> "St. Wolfgang", "Stuttgart"
st. -> "St. Wolfgang", "Sankt Wolfgang"
sankt -> "St. Wolfgang", "Sankt Wolfgang"
This works pretty well for me. The main point here is to make sure to
put the synonym filter after the lowercase one
put the edge-n-gram filter at the end
use the edge-n-gram only at indexing time
So we create the index:
PUT city
{
"settings": {
"analysis": {
"analyzer": {
"autocomplete": {
"tokenizer": "standard",
"filter": [
"lowercase",
"my_synonym_filter",
"edge_filter"
]
},
"autocomplete_search": {
"tokenizer": "standard",
"filter": [
"lowercase",
"my_synonym_filter"
]
}
},
"filter": {
"edge_filter": {
"type": "edgeNGram",
"min_gram": 1,
"max_gram": 15
},
"my_synonym_filter": {
"type": "synonym",
"ignore_case": "true",
"synonyms": [
"sankt, st. => sankt"
]
}
}
}
},
"mappings": {
"city": {
"properties": {
"name": {
"type": "text",
"analyzer": "autocomplete",
"search_analyzer": "autocomplete_search"
}
}
}
}
}
Then we index data:
PUT city/city/1
{
"name":"St. Wolfgang"
}
PUT city/city/2
{
"name":"Stuttgart"
}
PUT city/city/3
{
"name":"Sankt Wolfgang"
}
Finally searching for either st or sankt will only return documents 1 and 3 but not 2
POST city/_search?q=name:st
POST city/_search?q=name:sankt

Search partial word in elasticsearch

I'm kind of new to Elasticsearch but I would like to search the partial in the word
For example if I search "helloworld" is it possible to type only "world"?
Right now it work perfectly for case "hello" the elasticsearch return the suggestion helloworld for me
Here is the code:
{
"settings": {
"number_of_shards": 1,
"analysis": {
"filter": {
"autocomplete_filter": {
"type": "edge_ngram",
"min_gram": 1,
"max_gram": 20
}
},
"analyzer": {
"autocomplete": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"lowercase",
"autocomplete_filter"
]
}
}
}
},
"mappings": {
"word": {
"properties": {
"text": {
"type": "string",
"analyzer": "autocomplete"
}
}
}
}
}
Can anyone give me any suggestion?

Implement autocomplete using custom analyzers in elasticsearch

Referring to this post, I've created the following mapping:
POST music
{
"song": {
"settings": {
"analysis": {
"filter": {
"nGram_filter": {
"type": "nGram",
"min_gram": 2,
"max_gram": 20,
"token_chars": [
"letter",
"digit",
"punctuation",
"symbol"
]
},
"analyzer": {
"nGram_analyzer": {
"type": "custom",
"tokenizer": "whitespace",
"filter": [
"lowercase",
"asciifolding",
"nGram_filter"
]
},
"whitespace_analyzer": {
"type": "custom",
"tokenizer": "whitespace",
"filter": [
"lowercase",
"asciifolding"
]
}
}
}
}
},
"mappings": {
"song_field_1": {
"type": "string",
"index": "not_analyzed",
"index_analyzer": "nGram_analyzer",
"search_analyzer": "whitespace_analyzer"
}
}
}
}
Inserted the following document:
POST music/song
{
"song_field_1" : "Premeditiated fella"
}
And sent this query:
POST music/song/_search
{
"size": 10,
"query": {
"match": {
"_all": {
"query": "pre"
}
}
}
}
I expected to get the document as an autocomplete option, but didn't get any result.
You need to create your index like this:
POST music
{
"settings": {
"analysis": {
"filter": {
"nGram_filter": {
"type": "nGram",
"min_gram": 2,
"max_gram": 20,
"token_chars": [
"letter",
"digit",
"punctuation",
"symbol"
]
},
"analyzer": {
"nGram_analyzer": {
"type": "custom",
"tokenizer": "whitespace",
"filter": [
"lowercase",
"asciifolding",
"nGram_filter"
]
},
"whitespace_analyzer": {
"type": "custom",
"tokenizer": "whitespace",
"filter": [
"lowercase",
"asciifolding"
]
}
}
}
}
},
"mappings": {
"song": {
"properties": {
"song_field_1": {
"type": "string",
"index_analyzer": "nGram_analyzer",
"search_analyzer": "whitespace_analyzer"
}
}
}
}
}
So:
song goes inside mappings
No need for "index": "not_analyzed" since you're specifying analyzers

Boost if result begin with the word

I use Elasticsearch to search with autocompletion with an ngram filter. I need to boost a result if it starts with the search keyword.
My query is simple :
"query": {
"match": {
"query": "re",
"operator": "and"
}
}
And this is my results :
Restaurants
Couture et retouches
Restauration rapide
But I want them like this :
Restaurants
Restauration rapide
Couture et retouches
How can I boost a result starting with the keyword?
In case it can helps, here is my mapping :
{
"settings": {
"analysis": {
"analyzer": {
"partialAnalyzer": {
"type": "custom",
"tokenizer": "ngram_tokenizer",
"filter": ["asciifolding", "lowercase"]
},
"searchAnalyzer": {
"type": "custom",
"tokenizer": "standard",
"filter": ["asciifolding", "lowercase"]
}
},
"tokenizer": {
"ngram_tokenizer": {
"type": "edge_ngram",
"min_gram": "1",
"max_gram": "15",
"token_chars": [ "letter", "digit" ]
}
}
}
},
"mappings": {
"place": {
"properties": {
"name": {
"type": "string",
"index_analyzer": "partialAnalyzer",
"search_analyzer": "searchAnalyzer",
"term_vector": "with_positions_offsets"
}
}
}
}
}
Regards,
How about this idea, not 100% sure of it as it depends on the data I think:
create a sub-field in your name field that should be analyzed with keyword analyzer (pretty much staying as is)
change the query to be a bool with shoulds
one should is the query you have now
the other should is a match with phrase_prefix on the sub-field.
The mapping:
{
"settings": {
"analysis": {
"analyzer": {
"partialAnalyzer": {
"type": "custom",
"tokenizer": "ngram_tokenizer",
"filter": [
"asciifolding",
"lowercase"
]
},
"searchAnalyzer": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"asciifolding",
"lowercase"
]
},
"keyword_lowercase": {
"type": "custom",
"tokenizer": "keyword",
"filter": [
"asciifolding",
"lowercase"
]
}
},
"tokenizer": {
"ngram_tokenizer": {
"type": "edge_ngram",
"min_gram": "1",
"max_gram": "15",
"token_chars": [
"letter",
"digit"
]
}
}
}
},
"mappings": {
"place": {
"properties": {
"name": {
"type": "string",
"index_analyzer": "partialAnalyzer",
"search_analyzer": "searchAnalyzer",
"term_vector": "with_positions_offsets",
"fields": {
"as_is": {
"type": "string",
"analyzer": "keyword_lowercase"
}
}
}
}
}
}
}
The query:
{
"query": {
"bool": {
"should": [
{
"match": {
"name": {
"query": "re",
"operator": "and"
}
}
},
{
"match": {
"name.as_is": {
"query": "re",
"type": "phrase_prefix"
}
}
}
]
}
}
}

Resources