Highlight with fuzziness and ngram - elasticsearch

I guess the title of the topic spoiled you enough :D
I use edge_ngram and highlight to build an autocomplete search. I have added fuzziness in the query to allow users to mispell their search, but it brokes a bit the highlight.
When i write Sport this is what I get :
<em>Spor</em>t
<em>Spor</em>t mécanique
<em>Spor</em>t nautique
I guess it's because it matches with the token spor generated by the ngram tokenizer.
The query:
{
"query": {
"bool": {
"should": [
{
"match": {
"name": {
"query": "sport",
"operator": "and",
"fuzziness": "AUTO"
}
}
},
{
"match_phrase_prefix": {
"name.raw": {
"query": "sport"
}
}
}
]
}
},
"highlight": {
"fields": {
"name": {
"term_vector": "with_positions_offsets"
}
}
}
}
And the mapping:
{
"settings": {
"analysis": {
"analyzer": {
"partialAnalyzer": {
"type": "custom",
"tokenizer": "ngram_tokenizer",
"filter": ["asciifolding", "lowercase"]
},
"keywordAnalyzer": {
"type": "custom",
"tokenizer": "keyword",
"filter": ["asciifolding", "lowercase"]
},
"searchAnalyzer": {
"type": "custom",
"tokenizer": "standard",
"filter": ["asciifolding", "lowercase"]
}
},
"tokenizer": {
"ngram_tokenizer": {
"type": "edge_ngram",
"min_gram": "1",
"max_gram": "15",
"token_chars": [ "letter", "digit" ]
}
}
}
},
"mappings": {
"place": {
"properties": {
"name": {
"type": "string",
"index_analyzer": "partialAnalyzer",
"search_analyzer": "searchAnalyzer",
"term_vector": "with_positions_offsets",
"fields": {
"raw": {
"type": "string",
"analyzer": "keywordAnalyzer"
}
}
}
}
}
}
}
I tried to add a new match clause without fuzziness in the query to try to match the keyword before the match with fuzziness but it changed nothing.
'match': {
'name': {
'query': 'sport',
'operator': 'and'
}
Any idea how I can handle this?
Regards, Raphaël

You could do that with highlight_query I guess
Try this in your highlighting query.
"highlight": {
"fields": {
"name": {
"term_vector": "with_positions_offsets",
"highlight_query": {
"match": {
"name.raw": {
"query": "spotr",
"fuzziness": 2
}
}
}
}
}
}
I hope it helps.

Related

How to make edge_ngram token match with certaint quantity ofwords between them?

I'm trying to make a search request that retrieves the results only when less than
5 words are between requested tokens.
{
"settings": {
"index": {
"analysis": {
"filter": {
"stopWords": {
"type": "stop",
"stopwords": [
"_english_"
]
}
},
"normalizer": {
"lowercaseNormalizer": {
"filter": [
"lowercase",
"asciifolding"
],
"type": "custom",
"char_filter": []
}
},
"analyzer": {
"autoCompleteAnalyzer": {
"filter": [
"lowercase"
],
"type": "custom",
"tokenizer": "autoCompleteTokenizer"
},
"autoCompleteSearchAnalyzer": {
"type": "custom",
"tokenizer": "lowercase"
},
"charGroupAnalyzer": {
"filter": [
"lowercase"
],
"type": "custom",
"tokenizer": "charGroupTokenizer"
}
},
"tokenizer": {
"charGroupTokenizer": {
"type": "char_group",
"max_token_length": "20",
"tokenize_on_chars": [
"whitespace",
"-",
"\n"
]
},
"autoCompleteTokenizer": {
"token_chars": [
"letter"
],
"min_gram": "3",
"type": "edge_ngram",
"max_gram": "20"
}
}
}
}
}
}
The settings:
{
"mappings": {
"_doc": {
"properties": {
"description": {
"properties": {
"name": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 64
}
},
"analyzer": "autoCompleteAnalyzer",
"search_analyzer": "autoCompleteSearchAnalyzer"
},
"text": {
"type": "text",
"analyzer": "charGroupAnalyzer"
}
}
}
}
}
}
}
And make a bool request with request:
{
"query": {
"bool": {
"must": [
{
"multi_match": {
"fields": [
"description.name"
],
"operator": "and",
"query": "rounded elephant",
"fuzziness": 1
}
},
{
"match_phrase": {
"description.text": {
"analyzer": "charGroupAnalyzer",
"query": "rounded elephant",
"slop": 5,
"boost": 20
}
}
}
]
}
}
}
I expect the request to retrieve documents, where description contains:
... rounded very interesting elephant ...
This works good, when i use the complete words, like rounded elephant.
But, whe i enter prefixed words, like round eleph it fails.
But it's obvious that the description.name and description.text have different tokenizers (name contains ngram tokens, but text contain word tokens), so i get completely wrong results.
How can I configure mappings and search, to be able to use ngrams with distance between tokens?

Elasticsearch term query to number token

I need to explain some weird behavior of term query to Elasticsearch database which contains number part in the string. Query is pretty simple:
{
"query": {
"bool": {
"should": [
{
"term": {
"address.street": "8 kvetna"
}
}
]
}
}
}
The problem is that term 8 kvetna returns empty result. I tried to _analyze it ad it make regular tokens like 8, k, kv, kve .... Also I am pretty sure there is a value 8 kvetna in database.
Here is the mapping for the field:
{
"settings": {
"index": {
"refresh_interval": "1m",
"number_of_shards": "1",
"number_of_replicas": "1",
"analysis": {
"filter": {
"autocomplete_filter": {
"type": "edge_ngram",
"min_gram": "1",
"max_gram": "20"
}
},
"analyzer": {
"autocomplete": {
"filter": [
"lowercase",
"asciifolding",
"autocomplete_filter"
],
"type": "custom",
"tokenizer": "standard"
}
"default": {
"filter": [
"lowercase",
"asciifolding"
],
"type": "custom",
"tokenizer": "standard"
}
}
}
}
},
"mappings": {
"doc": {
"dynamic": "strict",
"_all": {
"enabled": false
},
"properties": {
"address": {
"properties": {
"city": {
"type": "text",
"analyzer": "autocomplete"
},
"street": {
"type": "text",
"analyzer": "autocomplete"
}
}
}
}
}
}
}
What caused this weird result? I don't understand it. Thanks for any help.
Great start so far! Your only issue is that you're using a term query, while you should use a match one. A term query will try to do an exact match for 8 kvetna and that's not what you want. The following query will work:
{
"query": {
"bool": {
"should": [
{
"match": { <--- change this
"address.street": "8 kvetna"
}
}
]
}
}
}

elasticsearch match query is not working for numbers

I have a search query which is used to search in report name.
I have indexed the field name with autocomplete,edge_ngram
Normal field name search is proper when i'm having a number / year in the field name it's not working.
Query :
{
"query": {
"function_score": {
"query": {
"bool": {
"should": [
{
"match": {
"field_name": {
"query": "hybrid seeds india 2017",
"operator": "and"
}
}
}
]
}
}
}
},
"from": 0,
"size": 10
}
Setting and the Mappings
{
"mappings": {
"pages": {
"properties": {
"report_name": {
"fields": {
"autocomplete": {
"search_analyzer": "report_name_search",
"analyzer": "report_name_index",
"type": "string"
},
"report_name": {
"index": "not_analyzed",
"type": "string"
}
},
"type": "multi_field"
}
}
}
},
"settings": {
"analysis": {
"filter": {
"report_name_ngram": {
"max_gram": 150,
"min_gram": 2,
"type": "edge_ngram"
}
},
"analyzer": {
"report_name_index": {
"filter": [
"lowercase",
"report_name_ngram"
],
"tokenizer": "keyword"
},
"report_name_search": {
"filter": [
"lowercase"
],
"tokenizer": "keyword"
}
}
}
}
}
Can you guys help me out in this.
Thanks in advance

Elastic Search FVH highlights the smallest matching token

Settings:
{
"settings": {
"analysis": {
"analyzer": {
"idx_analyzer_ngram": {
"type": "custom",
"filter": [
"lowercase",
"asciifolding",
"edgengram_filter_1_32"
],
"tokenizer": "ngram_alltokenchar_tokenizer_1_32"
},
"ngrm_srch_analyzer": {
"filter": [
"lowercase"
],
"type": "custom",
"tokenizer": "keyword"
}
},
"tokenizer": {
"ngram_alltokenchar_tokenizer_1_32": {
"token_chars": [
"letter",
"whitespace",
"punctuation",
"symbol",
"digit"
],
"min_gram": "1",
"type": "nGram",
"max_gram": "32"
}
}
}
}
}
Mappings:
{
"properties": {
"TITLE": {
"type": "string",
"fields": {
"untouched": {
"index": "not_analyzed",
"type": "string"
},
"ngramanalyzed": {
"search_analyzer": "ngrm_srch_analyzer",
"index_analyzer": "idx_analyzer_ngram",
"type": "string",
"term_vector": "with_positions_offsets"
}
}
}
}
}
Query:
{
"query": {
"filtered": {
"query": {
"query_string": {
"query": "have some ha",
"fields": [
"TITLE.ngramanalyzed"
],
"default_operator": "and"
}
}
}
},
"highlight": {
"fields": {
"TITLE.ngramanalyzed": {}
}
}
}
I have document indexed with TITLE have some happy meal. When I search have some, I am able to get proper highlights.
<em>have</em> <em>some</em> happy meal
As i type more have some ha, the highlight results are not as expected.
<em>ha</em>ve <em>some</em> <em>ha</em>ppy meal
The have word gets partially highlighted as ha.
I would expect it to highlight the longest matching token, because with an ngrams with min size = 1, this gives me a highlight of 1 or more char while there should be another matching token of 4 or 5 chars (for example: have should also be highlighted along with ha being highlighted.
I am not able to find any solution for the same. Please suggest.

Boost if result begin with the word

I use Elasticsearch to search with autocompletion with an ngram filter. I need to boost a result if it starts with the search keyword.
My query is simple :
"query": {
"match": {
"query": "re",
"operator": "and"
}
}
And this is my results :
Restaurants
Couture et retouches
Restauration rapide
But I want them like this :
Restaurants
Restauration rapide
Couture et retouches
How can I boost a result starting with the keyword?
In case it can helps, here is my mapping :
{
"settings": {
"analysis": {
"analyzer": {
"partialAnalyzer": {
"type": "custom",
"tokenizer": "ngram_tokenizer",
"filter": ["asciifolding", "lowercase"]
},
"searchAnalyzer": {
"type": "custom",
"tokenizer": "standard",
"filter": ["asciifolding", "lowercase"]
}
},
"tokenizer": {
"ngram_tokenizer": {
"type": "edge_ngram",
"min_gram": "1",
"max_gram": "15",
"token_chars": [ "letter", "digit" ]
}
}
}
},
"mappings": {
"place": {
"properties": {
"name": {
"type": "string",
"index_analyzer": "partialAnalyzer",
"search_analyzer": "searchAnalyzer",
"term_vector": "with_positions_offsets"
}
}
}
}
}
Regards,
How about this idea, not 100% sure of it as it depends on the data I think:
create a sub-field in your name field that should be analyzed with keyword analyzer (pretty much staying as is)
change the query to be a bool with shoulds
one should is the query you have now
the other should is a match with phrase_prefix on the sub-field.
The mapping:
{
"settings": {
"analysis": {
"analyzer": {
"partialAnalyzer": {
"type": "custom",
"tokenizer": "ngram_tokenizer",
"filter": [
"asciifolding",
"lowercase"
]
},
"searchAnalyzer": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"asciifolding",
"lowercase"
]
},
"keyword_lowercase": {
"type": "custom",
"tokenizer": "keyword",
"filter": [
"asciifolding",
"lowercase"
]
}
},
"tokenizer": {
"ngram_tokenizer": {
"type": "edge_ngram",
"min_gram": "1",
"max_gram": "15",
"token_chars": [
"letter",
"digit"
]
}
}
}
},
"mappings": {
"place": {
"properties": {
"name": {
"type": "string",
"index_analyzer": "partialAnalyzer",
"search_analyzer": "searchAnalyzer",
"term_vector": "with_positions_offsets",
"fields": {
"as_is": {
"type": "string",
"analyzer": "keyword_lowercase"
}
}
}
}
}
}
}
The query:
{
"query": {
"bool": {
"should": [
{
"match": {
"name": {
"query": "re",
"operator": "and"
}
}
},
{
"match": {
"name.as_is": {
"query": "re",
"type": "phrase_prefix"
}
}
}
]
}
}
}

Resources