I have a value in the "handle" field of my users index. Let's say the exact string of this value is "exactstring". My query will return the correct result for "exactstrin" but it will fail (return nothing) for "exactstring". N-grams, like "actstr", also return the correct result. WTF?
{
"query": {
"bool": {
"must": [{
"multi_match": {
"fields": ["handle","name","bio"],
"prefix_length": 2,
"query": "exacthandle",
"fuzziness": "AUTO"
}
}],
"should" : [{
"multi_match": {
"fields": ["handle^6", "name^3", "bio"],
"query": "exacthandle"
}
}]
}
},
"size": 100,
"from": 0
}
Here's my settings:
"settings": {
"analysis": {
"analyzer": {
"search_term_analyzer": {
"type": "custom",
"stopwords": "_none_",
"filter": [
"standard",
"lowercase",
"asciifolding",
"no_stop"
],
"tokenizer": "whitespace"
},
"ngram_token_analyzer": {
"type": "custom",
"stopwords": "_none_",
"filter": [
"standard",
"lowercase",
"asciifolding",
"no_stop",
"ngram_filter"
],
"tokenizer": "whitespace"
}
},
"filter": {
"no_stop": {
"type": "stop",
"stopwords": "_none_"
},
"ngram_filter": {
"type": "nGram",
"min_gram": "2",
"max_gram": "9"
}
}
}
}
And my mappings:
"handle": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"name": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"bio": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
}
This must be happening because you have set less value for max_gram(Maximum length of characters in a gram). When you are querying for exactstring, then the tokens generated for exactstring, will have a maximum token of length 9 (i.e exactstri). Due to this, the query will match on exactstri and not on exactstring.
You need to specify the analyzer of the field. When mapping an
index, you can use the analyzer mapping parameter to specify an
analyzer for each text field.If none of the analyzers is specified,
then the standard analyzer is used.
You also need to increase the max_gram to at least 11 for ngram_token_analyzer in your index setting. And also set "max_ngram_diff": "20". The tokens generated (which you can check using Analyze API) will have exactstrin as well as exactstring and now your query will match on both exactstrin and exactstring
Index Mapping:
{
"settings": {
"analysis": {
"analyzer": {
"search_term_analyzer": {
"type": "custom",
"stopwords": "_none_",
"filter": [
"lowercase",
"asciifolding",
"no_stop"
],
"tokenizer": "whitespace"
},
"ngram_token_analyzer": {
"type": "custom",
"stopwords": "_none_",
"filter": [
"lowercase",
"asciifolding",
"no_stop",
"ngram_filter"
],
"tokenizer": "whitespace"
}
},
"filter": {
"no_stop": {
"type": "stop",
"stopwords": "_none_"
},
"ngram_filter": {
"type": "nGram",
"min_gram": "2",
"max_gram": "11" <-- note this
}
}
},
"max_ngram_diff": 20
},
"mappings": {
"properties": {
"handle": {
"type": "text",
"analyzer": "ngram_token_analyzer"
}
}
}
}
Index Data:
{
"handle":"exactstring"
}
Search Query:
{
"query": {
"bool": {
"must": [
{
"multi_match": {
"fields": [
"handle",
"name",
"bio"
],
"prefix_length": 2,
"query": "exactstring",
"fuzziness": "AUTO"
}
}
]
}
}
}
Search Result:
"hits": [
{
"_index": "64957062",
"_type": "_doc",
"_id": "1",
"_score": 0.6292809,
"_source": {
"handle": "exactstring"
}
}
]
I ended up using Multi-fields to solve this. Querying the field "handle.text" now returns exact string matches of arbitrary length.
I was also able to create ngram and edge_ngram filters with different min and max grams and then use them simultaneously.
The mapping looks like this now:
"handle": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
},
"text": {
"type": "text"
},
"edge_ngram": {
"type": "text",
"analyzer": "edge_ngram_token_analyzer"
},
"ngram": {
"type": "text",
"analyzer": "ngram_token_analyzer"
}
}
}
The multi-match query can now simultaneously use different analyzers on the same field (the big difference between the must and should queries comes down to fuzziness, the bio field tends to contain a lot of small tokens and fuzziness produces a lot of bad hits):
{
"query": {
"bool": {
"must" : [{
"multi_match": {
"fields": [
"handle.text^6",
"handle.edge_ngram^4",
"handle.ngram^2",
"bio.text^6",
"bio.keyword^4",
"name.text^6",
"name.keyword^5",
"name.edge_ngram^2",
"name.ngram^3"
],
"query": "exactstring",
"prefix_length": 2
}
}],
"should" : [{
"multi_match": {
"fields": [
"handle.text^6",
"handle.edge_ngram^2",
"handle.ngram^1",
"name.text^6",
"name.keyword^5",
"name.edge_ngram^2",
"name.ngram^3"
],
"query": "exactstring",
"prefix_length": 2,
"fuzziness": "AUTO"
}
}]
}
},
"size": 100,
"from": 0
}
For completeness, here are the settings of my index:
"settings": {
"analysis": {
"analyzer": {
"search_term_analyzer": {
"type": "custom",
"stopwords": "_none_",
"filter": [
"standard",
"lowercase",
"asciifolding",
"no_stop"
],
"tokenizer": "whitespace"
},
"ngram_token_analyzer": {
"type": "custom",
"stopwords": "_none_",
"filter": [
"standard",
"lowercase",
"asciifolding",
"no_stop",
"ngram_filter"
],
"tokenizer": "whitespace"
},
"edge_ngram_token_analyzer": {
"type": "custom",
"stopwords": "_none_",
"filter": [
"standard",
"lowercase",
"asciifolding",
"no_stop",
"edge_ngram_filter"
],
"tokenizer": "whitespace"
}
},
"filter": {
"no_stop": {
"type": "stop",
"stopwords": "_none_"
},
"ngram_filter": {
"type": "nGram",
"min_gram": "3",
"max_gram": "12"
},
"edge_ngram_filter": {
"type": "edgeNGram",
"min_gram": "3",
"max_gram": "50"
}
}
},
"max_ngram_diff": 50
}
h/t to #bhavya for max_ngram_diff
Related
I'm trying to make a search request that retrieves the results only when less than
5 words are between requested tokens.
{
"settings": {
"index": {
"analysis": {
"filter": {
"stopWords": {
"type": "stop",
"stopwords": [
"_english_"
]
}
},
"normalizer": {
"lowercaseNormalizer": {
"filter": [
"lowercase",
"asciifolding"
],
"type": "custom",
"char_filter": []
}
},
"analyzer": {
"autoCompleteAnalyzer": {
"filter": [
"lowercase"
],
"type": "custom",
"tokenizer": "autoCompleteTokenizer"
},
"autoCompleteSearchAnalyzer": {
"type": "custom",
"tokenizer": "lowercase"
},
"charGroupAnalyzer": {
"filter": [
"lowercase"
],
"type": "custom",
"tokenizer": "charGroupTokenizer"
}
},
"tokenizer": {
"charGroupTokenizer": {
"type": "char_group",
"max_token_length": "20",
"tokenize_on_chars": [
"whitespace",
"-",
"\n"
]
},
"autoCompleteTokenizer": {
"token_chars": [
"letter"
],
"min_gram": "3",
"type": "edge_ngram",
"max_gram": "20"
}
}
}
}
}
}
The settings:
{
"mappings": {
"_doc": {
"properties": {
"description": {
"properties": {
"name": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 64
}
},
"analyzer": "autoCompleteAnalyzer",
"search_analyzer": "autoCompleteSearchAnalyzer"
},
"text": {
"type": "text",
"analyzer": "charGroupAnalyzer"
}
}
}
}
}
}
}
And make a bool request with request:
{
"query": {
"bool": {
"must": [
{
"multi_match": {
"fields": [
"description.name"
],
"operator": "and",
"query": "rounded elephant",
"fuzziness": 1
}
},
{
"match_phrase": {
"description.text": {
"analyzer": "charGroupAnalyzer",
"query": "rounded elephant",
"slop": 5,
"boost": 20
}
}
}
]
}
}
}
I expect the request to retrieve documents, where description contains:
... rounded very interesting elephant ...
This works good, when i use the complete words, like rounded elephant.
But, whe i enter prefixed words, like round eleph it fails.
But it's obvious that the description.name and description.text have different tokenizers (name contains ngram tokens, but text contain word tokens), so i get completely wrong results.
How can I configure mappings and search, to be able to use ngrams with distance between tokens?
Settings:
{
"settings": {
"analysis": {
"analyzer": {
"idx_analyzer_ngram": {
"type": "custom",
"filter": [
"lowercase",
"asciifolding",
"edgengram_filter_1_32"
],
"tokenizer": "ngram_alltokenchar_tokenizer_1_32"
},
"ngrm_srch_analyzer": {
"filter": [
"lowercase"
],
"type": "custom",
"tokenizer": "keyword"
}
},
"tokenizer": {
"ngram_alltokenchar_tokenizer_1_32": {
"token_chars": [
"letter",
"whitespace",
"punctuation",
"symbol",
"digit"
],
"min_gram": "1",
"type": "nGram",
"max_gram": "32"
}
}
}
}
}
Mappings:
{
"properties": {
"TITLE": {
"type": "string",
"fields": {
"untouched": {
"index": "not_analyzed",
"type": "string"
},
"ngramanalyzed": {
"search_analyzer": "ngrm_srch_analyzer",
"index_analyzer": "idx_analyzer_ngram",
"type": "string",
"term_vector": "with_positions_offsets"
}
}
}
}
}
Query:
{
"query": {
"filtered": {
"query": {
"query_string": {
"query": "have some ha",
"fields": [
"TITLE.ngramanalyzed"
],
"default_operator": "and"
}
}
}
},
"highlight": {
"fields": {
"TITLE.ngramanalyzed": {}
}
}
}
I have document indexed with TITLE have some happy meal. When I search have some, I am able to get proper highlights.
<em>have</em> <em>some</em> happy meal
As i type more have some ha, the highlight results are not as expected.
<em>ha</em>ve <em>some</em> <em>ha</em>ppy meal
The have word gets partially highlighted as ha.
I would expect it to highlight the longest matching token, because with an ngrams with min size = 1, this gives me a highlight of 1 or more char while there should be another matching token of 4 or 5 chars (for example: have should also be highlighted along with ha being highlighted.
I am not able to find any solution for the same. Please suggest.
I am trying to use synonym analyzer at query time and not getting expected results. Can someone throw some light on this?
Here is my mapping for the index:
{
"jobs_user_profile_v2": {
"mappings": {
"profile": {
"_all": {
"enabled": false
},
"_ttl": {
"enabled": true
},
"properties": {
"rsa": {
"type": "nested",
"properties": {
"answer": {
"type": "string",
"index_analyzer": "autocomplete",
"search_analyzer": "synonym",
"position_offset_gap": 100
},
"answerId": {
"type": "long"
},
"answerOriginal": {
"type": "string",
"index": "not_analyzed"
},
"createdAt": {
"type": "long"
},
"label": {
"type": "string",
"index": "not_analyzed"
},
"labelOriginal": {
"type": "string",
"index": "not_analyzed"
},
"question": {
"type": "string",
"index": "not_analyzed"
},
"questionId": {
"type": "long"
},
"questionOriginal": {
"type": "string"
},
"source": {
"type": "integer"
},
"updatedAt": {
"type": "long"
}
}
}
}
}
}
}
}
The field to focus on is rsa.answer, which is the field I am querying.
My synonym mapping:
Beautician,Stylist,Make up artist,Massage therapist,Therapist,Spa,Hair Dresser,Salon,Beauty Parlour,Parlor => Beautician
Carpenter,Wood Worker,Furniture Carpenter => Carpenter
Cashier,Store Manager,Store Incharge,Purchase Executive,Billing Executive,Billing Boy => Cashier
Content Writer,Writer,Translator,Writing,Copywriter,Content Creation,Script Writer,Freelance Writer,Freelance Content Writer => Content Writer
My Search Query:
http://{{domain}}/jobs_user_profile_v2/_search
{
"query": {
"nested":{
"path": "rsa",
"query":{
"query_string": {
"query": "hair dresser",
"fields": ["answer"],
"analyzer" :"synonym"
}
},
"inner_hits": {
"explain": true
}
}
},
"explain" : true,
"sort" : [ {
"_score" : { }
} ]
}
It is showing proper Beautician and 'Cashierprofiles for search queryHair Dresserandbilling executivebut not showing anything forwood worker => carpenter` case.
My analyzer results:
http://{{domain}}/jobs_user_profile_v2/_analyze?analyzer=synonym&text=hair dresser
{
"tokens": [
{
"token": "beautician",
"start_offset": 0,
"end_offset": 12,
"type": "SYNONYM",
"position": 1
}
]
}
and for wood worker case
http://{{domain}}/jobs_user_profile_v2/_analyze?analyzer=synonym&text=wood worker
{
"tokens": [
{
"token": "carpenter",
"start_offset": 0,
"end_offset": 11,
"type": "SYNONYM",
"position": 1
}
]
}
It is also not working a few other cases.
My analyzer setting for index:
"analysis": {
"filter": {
"synonym": {
"ignore_case": "true",
"type": "synonym",
"synonyms_path": "synonym.txt"
},
"autocomplete_filter": {
"type": "edge_ngram",
"min_gram": "3",
"max_gram": "10"
}
},
"analyzer": {
"text_en_splitting_search": {
"type": "custom",
"filter": [
"stop",
"lowercase",
"porter_stem",
"word_delimiter"
],
"tokenizer": "whitespace"
},
"synonym": {
"filter": [
"stop",
"lowercase",
"synonym"
],
"type": "custom",
"tokenizer": "standard"
},
"autocomplete": {
"filter": [
"lowercase",
"autocomplete_filter"
],
"type": "custom",
"tokenizer": "standard"
},
"text_en_splitting": {
"filter": [
"lowercase",
"porter_stem",
"word_delimiter"
],
"type": "custom",
"tokenizer": "whitespace"
},
"text_general": {
"filter": [
"lowercase"
],
"type": "custom",
"tokenizer": "standard"
},
"edge_ngram_analyzer": {
"filter": [
"lowercase"
],
"type": "custom",
"tokenizer": "edge_ngram_tokenizer"
},
"autocomplete_analyzer": {
"filter": [
"lowercase"
],
"tokenizer": "whitespace"
}
},
"tokenizer": {
"edge_ngram_tokenizer": {
"token_chars": [
"letter",
"digit"
],
"min_gram": "2",
"type": "edgeNGram",
"max_gram": "10"
}
}
}
For the above case one multi-match is more ideal than query-string.
Multi-Match unlike query string does not tokenize the query terms before analyzing it . As a result multi-word synonyms may not work as expected.
Example:
{
"query": {
"nested": {
"path": "rsa",
"query": {
"multi_match": {
"query": "wood worker",
"fields": [
"rsa.answer"
],
"type" : "cross_fields",
"analyzer": "synonym"
}
}
}
}
}
If for some reason you prefer query-string then you would need to pass the entire query in double quotes to ensure it is not tokenized:
example :
post test/_search
{
"query": {
"nested": {
"path": "rsa",
"query": {
"query_string": {
"query": "\"wood worker\"",
"fields": [
"rsa.answer"
],
"analyzer": "synonym"
}
}
}
}
}
I use Elasticsearch to search with autocompletion with an ngram filter. I need to boost a result if it starts with the search keyword.
My query is simple :
"query": {
"match": {
"query": "re",
"operator": "and"
}
}
And this is my results :
Restaurants
Couture et retouches
Restauration rapide
But I want them like this :
Restaurants
Restauration rapide
Couture et retouches
How can I boost a result starting with the keyword?
In case it can helps, here is my mapping :
{
"settings": {
"analysis": {
"analyzer": {
"partialAnalyzer": {
"type": "custom",
"tokenizer": "ngram_tokenizer",
"filter": ["asciifolding", "lowercase"]
},
"searchAnalyzer": {
"type": "custom",
"tokenizer": "standard",
"filter": ["asciifolding", "lowercase"]
}
},
"tokenizer": {
"ngram_tokenizer": {
"type": "edge_ngram",
"min_gram": "1",
"max_gram": "15",
"token_chars": [ "letter", "digit" ]
}
}
}
},
"mappings": {
"place": {
"properties": {
"name": {
"type": "string",
"index_analyzer": "partialAnalyzer",
"search_analyzer": "searchAnalyzer",
"term_vector": "with_positions_offsets"
}
}
}
}
}
Regards,
How about this idea, not 100% sure of it as it depends on the data I think:
create a sub-field in your name field that should be analyzed with keyword analyzer (pretty much staying as is)
change the query to be a bool with shoulds
one should is the query you have now
the other should is a match with phrase_prefix on the sub-field.
The mapping:
{
"settings": {
"analysis": {
"analyzer": {
"partialAnalyzer": {
"type": "custom",
"tokenizer": "ngram_tokenizer",
"filter": [
"asciifolding",
"lowercase"
]
},
"searchAnalyzer": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"asciifolding",
"lowercase"
]
},
"keyword_lowercase": {
"type": "custom",
"tokenizer": "keyword",
"filter": [
"asciifolding",
"lowercase"
]
}
},
"tokenizer": {
"ngram_tokenizer": {
"type": "edge_ngram",
"min_gram": "1",
"max_gram": "15",
"token_chars": [
"letter",
"digit"
]
}
}
}
},
"mappings": {
"place": {
"properties": {
"name": {
"type": "string",
"index_analyzer": "partialAnalyzer",
"search_analyzer": "searchAnalyzer",
"term_vector": "with_positions_offsets",
"fields": {
"as_is": {
"type": "string",
"analyzer": "keyword_lowercase"
}
}
}
}
}
}
}
The query:
{
"query": {
"bool": {
"should": [
{
"match": {
"name": {
"query": "re",
"operator": "and"
}
}
},
{
"match": {
"name.as_is": {
"query": "re",
"type": "phrase_prefix"
}
}
}
]
}
}
}
In the mapping char_filter section of elasticsearch mapping, its kind of vague and I'm having a lot of difficulty understanding if and how to use charfilter analyzer: http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/analysis-mapping-charfilter.html
Basically the data we are storing in the index are ids of type String that look like this: "008392342000". I want to be able to search such ids when query terms actually contain a hyphen or trailing space like this: "008392342-000 ".
How would you advise I set the analyzer like?
Currently this is the definition of the field:
"mappings": {
"client": {
"properties": {
"ucn": {
"type": "multi_field",
"fields": {
"ucn_autoc": {
"type": "string",
"index": "analyzed",
"index_analyzer": "autocomplete_index",
"search_analyzer": "autocomplete_search"
},
"ucn": {
"type": "string",
"index": "not_analyzed"
}
}
}
}
}
}
Here is the settings for the index containing analyzer etc.
"settings": {
"analysis": {
"filter": {
"autocomplete_ngram": {
"max_gram": 15,
"min_gram": 1,
"type": "edge_ngram"
},
"ngram_filter": {
"type": "nGram",
"min_gram": 2,
"max_gram": 8
}
},
"analyzer": {
"lowercase_analyzer": {
"filter": [
"lowercase"
],
"tokenizer": "keyword"
},
"autocomplete_index": {
"filter": [
"lowercase",
"autocomplete_ngram"
],
"tokenizer": "keyword"
},
"ngram_index": {
"filter": [
"ngram_filter",
"lowercase"
],
"tokenizer": "keyword"
},
"autocomplete_search": {
"filter": [
"lowercase"
],
"tokenizer": "keyword"
},
"ngram_search": {
"filter": [
"lowercase"
],
"tokenizer": "keyword"
}
},
"index": {
"number_of_shards": 6,
"number_of_replicas": 1
}
}
}
You haven't provided your actual analyzers, what data goes in and what your expectations are, but based on the info you provided I would start with this:
{
"settings": {
"analysis": {
"char_filter": {
"my_mapping": {
"type": "mapping",
"mappings": [
"-=>"
]
}
},
"analyzer": {
"autocomplete_search": {
"tokenizer": "keyword",
"char_filter": [
"my_mapping"
],
"filter": [
"trim"
]
},
"autocomplete_index": {
"tokenizer": "keyword",
"filter": [
"trim"
]
}
}
}
},
"mappings": {
"test": {
"properties": {
"ucn": {
"type": "multi_field",
"fields": {
"ucn_autoc": {
"type": "string",
"index": "analyzed",
"index_analyzer": "autocomplete_index",
"search_analyzer": "autocomplete_search"
},
"ucn": {
"type": "string",
"index": "not_analyzed"
}
}
}
}
}
}
}
The char_filter would replace - with nothing: -=>. I would, also, use the trim filter to get rid of any trailing or leading white spaces. No idea what your autocomplete_index analyzer you have, I just used a keyword one.
Testing the analyzer GET /my_index/_analyze?analyzer=autocomplete_search&text= 0123-34742-000 results in:
"tokens": [
{
"token": "012334742000",
"start_offset": 0,
"end_offset": 17,
"type": "word",
"position": 1
}
]
which means it does eliminate the - and the white spaces.
And the typical query would be:
{
"query": {
"match": {
"ucn.ucn_autoc": " 0123-34742-000 "
}
}
}