Some Multi word synonyms are not working in elasticsearch for nested fields - elasticsearch

I am trying to use synonym analyzer at query time and not getting expected results. Can someone throw some light on this?
Here is my mapping for the index:
{
"jobs_user_profile_v2": {
"mappings": {
"profile": {
"_all": {
"enabled": false
},
"_ttl": {
"enabled": true
},
"properties": {
"rsa": {
"type": "nested",
"properties": {
"answer": {
"type": "string",
"index_analyzer": "autocomplete",
"search_analyzer": "synonym",
"position_offset_gap": 100
},
"answerId": {
"type": "long"
},
"answerOriginal": {
"type": "string",
"index": "not_analyzed"
},
"createdAt": {
"type": "long"
},
"label": {
"type": "string",
"index": "not_analyzed"
},
"labelOriginal": {
"type": "string",
"index": "not_analyzed"
},
"question": {
"type": "string",
"index": "not_analyzed"
},
"questionId": {
"type": "long"
},
"questionOriginal": {
"type": "string"
},
"source": {
"type": "integer"
},
"updatedAt": {
"type": "long"
}
}
}
}
}
}
}
}
The field to focus on is rsa.answer, which is the field I am querying.
My synonym mapping:
Beautician,Stylist,Make up artist,Massage therapist,Therapist,Spa,Hair Dresser,Salon,Beauty Parlour,Parlor => Beautician
Carpenter,Wood Worker,Furniture Carpenter => Carpenter
Cashier,Store Manager,Store Incharge,Purchase Executive,Billing Executive,Billing Boy => Cashier
Content Writer,Writer,Translator,Writing,Copywriter,Content Creation,Script Writer,Freelance Writer,Freelance Content Writer => Content Writer
My Search Query:
http://{{domain}}/jobs_user_profile_v2/_search
{
"query": {
"nested":{
"path": "rsa",
"query":{
"query_string": {
"query": "hair dresser",
"fields": ["answer"],
"analyzer" :"synonym"
}
},
"inner_hits": {
"explain": true
}
}
},
"explain" : true,
"sort" : [ {
"_score" : { }
} ]
}
It is showing proper Beautician and 'Cashierprofiles for search queryHair Dresserandbilling executivebut not showing anything forwood worker => carpenter` case.
My analyzer results:
http://{{domain}}/jobs_user_profile_v2/_analyze?analyzer=synonym&text=hair dresser
{
"tokens": [
{
"token": "beautician",
"start_offset": 0,
"end_offset": 12,
"type": "SYNONYM",
"position": 1
}
]
}
and for wood worker case
http://{{domain}}/jobs_user_profile_v2/_analyze?analyzer=synonym&text=wood worker
{
"tokens": [
{
"token": "carpenter",
"start_offset": 0,
"end_offset": 11,
"type": "SYNONYM",
"position": 1
}
]
}
It is also not working a few other cases.
My analyzer setting for index:
"analysis": {
"filter": {
"synonym": {
"ignore_case": "true",
"type": "synonym",
"synonyms_path": "synonym.txt"
},
"autocomplete_filter": {
"type": "edge_ngram",
"min_gram": "3",
"max_gram": "10"
}
},
"analyzer": {
"text_en_splitting_search": {
"type": "custom",
"filter": [
"stop",
"lowercase",
"porter_stem",
"word_delimiter"
],
"tokenizer": "whitespace"
},
"synonym": {
"filter": [
"stop",
"lowercase",
"synonym"
],
"type": "custom",
"tokenizer": "standard"
},
"autocomplete": {
"filter": [
"lowercase",
"autocomplete_filter"
],
"type": "custom",
"tokenizer": "standard"
},
"text_en_splitting": {
"filter": [
"lowercase",
"porter_stem",
"word_delimiter"
],
"type": "custom",
"tokenizer": "whitespace"
},
"text_general": {
"filter": [
"lowercase"
],
"type": "custom",
"tokenizer": "standard"
},
"edge_ngram_analyzer": {
"filter": [
"lowercase"
],
"type": "custom",
"tokenizer": "edge_ngram_tokenizer"
},
"autocomplete_analyzer": {
"filter": [
"lowercase"
],
"tokenizer": "whitespace"
}
},
"tokenizer": {
"edge_ngram_tokenizer": {
"token_chars": [
"letter",
"digit"
],
"min_gram": "2",
"type": "edgeNGram",
"max_gram": "10"
}
}
}

For the above case one multi-match is more ideal than query-string.
Multi-Match unlike query string does not tokenize the query terms before analyzing it . As a result multi-word synonyms may not work as expected.
Example:
{
"query": {
"nested": {
"path": "rsa",
"query": {
"multi_match": {
"query": "wood worker",
"fields": [
"rsa.answer"
],
"type" : "cross_fields",
"analyzer": "synonym"
}
}
}
}
}
If for some reason you prefer query-string then you would need to pass the entire query in double quotes to ensure it is not tokenized:
example :
post test/_search
{
"query": {
"nested": {
"path": "rsa",
"query": {
"query_string": {
"query": "\"wood worker\"",
"fields": [
"rsa.answer"
],
"analyzer": "synonym"
}
}
}
}
}

Related

Elasticsearch succeeds for partial match and fails for exact match

I have a value in the "handle" field of my users index. Let's say the exact string of this value is "exactstring". My query will return the correct result for "exactstrin" but it will fail (return nothing) for "exactstring". N-grams, like "actstr", also return the correct result. WTF?
{
"query": {
"bool": {
"must": [{
"multi_match": {
"fields": ["handle","name","bio"],
"prefix_length": 2,
"query": "exacthandle",
"fuzziness": "AUTO"
}
}],
"should" : [{
"multi_match": {
"fields": ["handle^6", "name^3", "bio"],
"query": "exacthandle"
}
}]
}
},
"size": 100,
"from": 0
}
Here's my settings:
"settings": {
"analysis": {
"analyzer": {
"search_term_analyzer": {
"type": "custom",
"stopwords": "_none_",
"filter": [
"standard",
"lowercase",
"asciifolding",
"no_stop"
],
"tokenizer": "whitespace"
},
"ngram_token_analyzer": {
"type": "custom",
"stopwords": "_none_",
"filter": [
"standard",
"lowercase",
"asciifolding",
"no_stop",
"ngram_filter"
],
"tokenizer": "whitespace"
}
},
"filter": {
"no_stop": {
"type": "stop",
"stopwords": "_none_"
},
"ngram_filter": {
"type": "nGram",
"min_gram": "2",
"max_gram": "9"
}
}
}
}
And my mappings:
"handle": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"name": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"bio": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
}
This must be happening because you have set less value for max_gram(Maximum length of characters in a gram). When you are querying for exactstring, then the tokens generated for exactstring, will have a maximum token of length 9 (i.e exactstri). Due to this, the query will match on exactstri and not on exactstring.
You need to specify the analyzer of the field. When mapping an
index, you can use the analyzer mapping parameter to specify an
analyzer for each text field.If none of the analyzers is specified,
then the standard analyzer is used.
You also need to increase the max_gram to at least 11 for ngram_token_analyzer in your index setting. And also set "max_ngram_diff": "20". The tokens generated (which you can check using Analyze API) will have exactstrin as well as exactstring and now your query will match on both exactstrin and exactstring
Index Mapping:
{
"settings": {
"analysis": {
"analyzer": {
"search_term_analyzer": {
"type": "custom",
"stopwords": "_none_",
"filter": [
"lowercase",
"asciifolding",
"no_stop"
],
"tokenizer": "whitespace"
},
"ngram_token_analyzer": {
"type": "custom",
"stopwords": "_none_",
"filter": [
"lowercase",
"asciifolding",
"no_stop",
"ngram_filter"
],
"tokenizer": "whitespace"
}
},
"filter": {
"no_stop": {
"type": "stop",
"stopwords": "_none_"
},
"ngram_filter": {
"type": "nGram",
"min_gram": "2",
"max_gram": "11" <-- note this
}
}
},
"max_ngram_diff": 20
},
"mappings": {
"properties": {
"handle": {
"type": "text",
"analyzer": "ngram_token_analyzer"
}
}
}
}
Index Data:
{
"handle":"exactstring"
}
Search Query:
{
"query": {
"bool": {
"must": [
{
"multi_match": {
"fields": [
"handle",
"name",
"bio"
],
"prefix_length": 2,
"query": "exactstring",
"fuzziness": "AUTO"
}
}
]
}
}
}
Search Result:
"hits": [
{
"_index": "64957062",
"_type": "_doc",
"_id": "1",
"_score": 0.6292809,
"_source": {
"handle": "exactstring"
}
}
]
I ended up using Multi-fields to solve this. Querying the field "handle.text" now returns exact string matches of arbitrary length.
I was also able to create ngram and edge_ngram filters with different min and max grams and then use them simultaneously.
The mapping looks like this now:
"handle": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
},
"text": {
"type": "text"
},
"edge_ngram": {
"type": "text",
"analyzer": "edge_ngram_token_analyzer"
},
"ngram": {
"type": "text",
"analyzer": "ngram_token_analyzer"
}
}
}
The multi-match query can now simultaneously use different analyzers on the same field (the big difference between the must and should queries comes down to fuzziness, the bio field tends to contain a lot of small tokens and fuzziness produces a lot of bad hits):
{
"query": {
"bool": {
"must" : [{
"multi_match": {
"fields": [
"handle.text^6",
"handle.edge_ngram^4",
"handle.ngram^2",
"bio.text^6",
"bio.keyword^4",
"name.text^6",
"name.keyword^5",
"name.edge_ngram^2",
"name.ngram^3"
],
"query": "exactstring",
"prefix_length": 2
}
}],
"should" : [{
"multi_match": {
"fields": [
"handle.text^6",
"handle.edge_ngram^2",
"handle.ngram^1",
"name.text^6",
"name.keyword^5",
"name.edge_ngram^2",
"name.ngram^3"
],
"query": "exactstring",
"prefix_length": 2,
"fuzziness": "AUTO"
}
}]
}
},
"size": 100,
"from": 0
}
For completeness, here are the settings of my index:
"settings": {
"analysis": {
"analyzer": {
"search_term_analyzer": {
"type": "custom",
"stopwords": "_none_",
"filter": [
"standard",
"lowercase",
"asciifolding",
"no_stop"
],
"tokenizer": "whitespace"
},
"ngram_token_analyzer": {
"type": "custom",
"stopwords": "_none_",
"filter": [
"standard",
"lowercase",
"asciifolding",
"no_stop",
"ngram_filter"
],
"tokenizer": "whitespace"
},
"edge_ngram_token_analyzer": {
"type": "custom",
"stopwords": "_none_",
"filter": [
"standard",
"lowercase",
"asciifolding",
"no_stop",
"edge_ngram_filter"
],
"tokenizer": "whitespace"
}
},
"filter": {
"no_stop": {
"type": "stop",
"stopwords": "_none_"
},
"ngram_filter": {
"type": "nGram",
"min_gram": "3",
"max_gram": "12"
},
"edge_ngram_filter": {
"type": "edgeNGram",
"min_gram": "3",
"max_gram": "50"
}
}
},
"max_ngram_diff": 50
}
h/t to #bhavya for max_ngram_diff

Must specify either an analyzer type, or a tokenizer

I am basically new to elastic search .I am trying to implement fuzzy search , synonym search ,edge ngram and autocomplete on "name_auto" field , but it seems like my index creation is failing.
another question can i implement all the analyzer for "name" field if so how can i do it.
{
"settings": {
"index": {
"analysis": {
"filter": {
"synonym": {
"ignore_case": "true",
"type": "synonym",
"format": "wordnet",
"synonyms_path": "analysis/wn_s.pl"
}
},
"analyzer": {
"synonym": {
"tokenizer": "whitespace",
"filter": [
"synonym"
]
},
"keyword_analyzer": {
"filter": [
"lowercase",
"asciifolding",
"trim"
],
"char_filter": [],
"type": "custom",
"tokenizer": "keyword"
},
"edge_ngram_analyzer": {
"filter": [
"lowercase"
],
"tokenizer": "edge_ngram_tokenizer"
},
"edge_ngram_search_analyzer": {
"tokenizer": "lowercase"
},
"tokenizer": {
"edge_ngram_tokenizer": {
"type": "edge_ngram",
"min_gram": 1,
"max_gram": 25,
"token_chars": [
"letter"
]
}
}
},
"mappings": {
"properties": {
"firebaseId": {
"type": "text"
},
"name": {
"fielddata": true,
"type": "text",
"analyzer": "standard"
},
"name_auto": {
"type": "text",
"fields": {
"keywordstring": {
"type": "text",
"analyzer": "keyword_analyzer"
},
"edgengram": {
"type": "text",
"analyzer": "edge_ngram_analyzer",
"search_analyzer": "edge_ngram_search_analyzer"
},
"completion": {
"type": "completion"
},
"synonym_analyzer": {
"type": "synonym",
"analyzer": "synonym"
}
}
}
}
}
}
}
}
}
This is the output :
> {
> "error": {
> "root_cause": [
> {
> "type": "illegal_argument_exception",
> "reason": "analyzer [tokenizer] must specify either an analyzer type, or a tokenizer"
> }
> ],
> "type": "illegal_argument_exception",
> "reason": "analyzer [tokenizer] must specify either an analyzer type, or a tokenizer"
> },
> "status": 400
> }
where am i doing wrong please guide me through right direction.
Your tokenizer section is located inside the analyzer section, which is not correct. Try with this instead, it should work:
{
"settings": {
"index": {
"analysis": {
"filter": {
"synonym": {
"ignore_case": "true",
"type": "synonym",
"format": "wordnet",
"synonyms_path": "analysis/wn_s.pl"
}
},
"analyzer": {
"synonym": {
"tokenizer": "whitespace",
"filter": [
"synonym"
]
},
"keyword_analyzer": {
"filter": [
"lowercase",
"asciifolding",
"trim"
],
"char_filter": [],
"type": "custom",
"tokenizer": "keyword"
},
"edge_ngram_analyzer": {
"filter": [
"lowercase"
],
"tokenizer": "edge_ngram_tokenizer"
},
"edge_ngram_search_analyzer": {
"tokenizer": "lowercase"
}
},
"tokenizer": {
"edge_ngram_tokenizer": {
"type": "edge_ngram",
"min_gram": 1,
"max_gram": 25,
"token_chars": [
"letter"
]
}
}
},
"mappings": {
"properties": {
"firebaseId": {
"type": "text"
},
"name": {
"fielddata": true,
"type": "text",
"analyzer": "standard"
},
"name_auto": {
"type": "text",
"fields": {
"keywordstring": {
"type": "text",
"analyzer": "keyword_analyzer"
},
"edgengram": {
"type": "text",
"analyzer": "edge_ngram_analyzer",
"search_analyzer": "edge_ngram_search_analyzer"
},
"completion": {
"type": "completion"
},
"synonym_analyzer": {
"type": "synonym",
"analyzer": "synonym"
}
}
}
}
}
}
}
}

Not able to search a phrase in elasticsearch 5.4

I am searching for a phrase in a email body. Need to get the exact data filtered like, if I search for 'Avenue New', it should return only results which has the phrase 'Avenue New' not 'Avenue Street', 'Park Avenue'etc
My mapping is like:
{
"exchangemailssql": {
"aliases": {},
"mappings": {
"email": {
"dynamic_templates": [
{
"_default": {
"match": "*",
"match_mapping_type": "string",
"mapping": {
"doc_values": true,
"type": "keyword"
}
}
}
],
"properties": {
"attachments": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"body": {
"type": "text",
"analyzer": "keylower",
"fielddata": true
},
"count": {
"type": "short"
},
"emailId": {
"type": "long"
}
}
}
},
"settings": {
"index": {
"refresh_interval": "3s",
"number_of_shards": "1",
"provided_name": "exchangemailssql",
"creation_date": "1500527793230",
"analysis": {
"filter": {
"nGram": {
"min_gram": "4",
"side": "front",
"type": "edge_ngram",
"max_gram": "100"
}
},
"analyzer": {
"keylower": {
"filter": [
"lowercase"
],
"type": "custom",
"tokenizer": "keyword"
},
"email": {
"filter": [
"lowercase",
"unique",
"nGram"
],
"type": "custom",
"tokenizer": "uax_url_email"
},
"full": {
"filter": [
"lowercase",
"snowball",
"nGram"
],
"type": "custom",
"tokenizer": "standard"
}
}
},
"number_of_replicas": "0",
"uuid": "2XTpHmwaQF65PNkCQCmcVQ",
"version": {
"created": "5040099"
}
}
}
}
}
I have given the search query like:
{
"query": {
"match_phrase": {
"body": "Avenue New"
}
},
"highlight": {
"fields" : {
"body" : {}
}
}
}
The problem here is that you're tokenizing the full body content using the keyword tokenizer, i.e. it will be one big lowercase string and you cannot search inside of it.
If you simply change the analyzer of your body field to standard instead of keylower, you'll find what you need using the match_phrase query.
"body": {
"type": "text",
"analyzer": "standard", <---change this
"fielddata": true
},

Boost if result begin with the word

I use Elasticsearch to search with autocompletion with an ngram filter. I need to boost a result if it starts with the search keyword.
My query is simple :
"query": {
"match": {
"query": "re",
"operator": "and"
}
}
And this is my results :
Restaurants
Couture et retouches
Restauration rapide
But I want them like this :
Restaurants
Restauration rapide
Couture et retouches
How can I boost a result starting with the keyword?
In case it can helps, here is my mapping :
{
"settings": {
"analysis": {
"analyzer": {
"partialAnalyzer": {
"type": "custom",
"tokenizer": "ngram_tokenizer",
"filter": ["asciifolding", "lowercase"]
},
"searchAnalyzer": {
"type": "custom",
"tokenizer": "standard",
"filter": ["asciifolding", "lowercase"]
}
},
"tokenizer": {
"ngram_tokenizer": {
"type": "edge_ngram",
"min_gram": "1",
"max_gram": "15",
"token_chars": [ "letter", "digit" ]
}
}
}
},
"mappings": {
"place": {
"properties": {
"name": {
"type": "string",
"index_analyzer": "partialAnalyzer",
"search_analyzer": "searchAnalyzer",
"term_vector": "with_positions_offsets"
}
}
}
}
}
Regards,
How about this idea, not 100% sure of it as it depends on the data I think:
create a sub-field in your name field that should be analyzed with keyword analyzer (pretty much staying as is)
change the query to be a bool with shoulds
one should is the query you have now
the other should is a match with phrase_prefix on the sub-field.
The mapping:
{
"settings": {
"analysis": {
"analyzer": {
"partialAnalyzer": {
"type": "custom",
"tokenizer": "ngram_tokenizer",
"filter": [
"asciifolding",
"lowercase"
]
},
"searchAnalyzer": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"asciifolding",
"lowercase"
]
},
"keyword_lowercase": {
"type": "custom",
"tokenizer": "keyword",
"filter": [
"asciifolding",
"lowercase"
]
}
},
"tokenizer": {
"ngram_tokenizer": {
"type": "edge_ngram",
"min_gram": "1",
"max_gram": "15",
"token_chars": [
"letter",
"digit"
]
}
}
}
},
"mappings": {
"place": {
"properties": {
"name": {
"type": "string",
"index_analyzer": "partialAnalyzer",
"search_analyzer": "searchAnalyzer",
"term_vector": "with_positions_offsets",
"fields": {
"as_is": {
"type": "string",
"analyzer": "keyword_lowercase"
}
}
}
}
}
}
}
The query:
{
"query": {
"bool": {
"should": [
{
"match": {
"name": {
"query": "re",
"operator": "and"
}
}
},
{
"match": {
"name.as_is": {
"query": "re",
"type": "phrase_prefix"
}
}
}
]
}
}
}

How to match query terms containing hyphens or trailing space in elasticsearch

In the mapping char_filter section of elasticsearch mapping, its kind of vague and I'm having a lot of difficulty understanding if and how to use charfilter analyzer: http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/analysis-mapping-charfilter.html
Basically the data we are storing in the index are ids of type String that look like this: "008392342000". I want to be able to search such ids when query terms actually contain a hyphen or trailing space like this: "008392342-000 ".
How would you advise I set the analyzer like?
Currently this is the definition of the field:
"mappings": {
"client": {
"properties": {
"ucn": {
"type": "multi_field",
"fields": {
"ucn_autoc": {
"type": "string",
"index": "analyzed",
"index_analyzer": "autocomplete_index",
"search_analyzer": "autocomplete_search"
},
"ucn": {
"type": "string",
"index": "not_analyzed"
}
}
}
}
}
}
Here is the settings for the index containing analyzer etc.
"settings": {
"analysis": {
"filter": {
"autocomplete_ngram": {
"max_gram": 15,
"min_gram": 1,
"type": "edge_ngram"
},
"ngram_filter": {
"type": "nGram",
"min_gram": 2,
"max_gram": 8
}
},
"analyzer": {
"lowercase_analyzer": {
"filter": [
"lowercase"
],
"tokenizer": "keyword"
},
"autocomplete_index": {
"filter": [
"lowercase",
"autocomplete_ngram"
],
"tokenizer": "keyword"
},
"ngram_index": {
"filter": [
"ngram_filter",
"lowercase"
],
"tokenizer": "keyword"
},
"autocomplete_search": {
"filter": [
"lowercase"
],
"tokenizer": "keyword"
},
"ngram_search": {
"filter": [
"lowercase"
],
"tokenizer": "keyword"
}
},
"index": {
"number_of_shards": 6,
"number_of_replicas": 1
}
}
}
You haven't provided your actual analyzers, what data goes in and what your expectations are, but based on the info you provided I would start with this:
{
"settings": {
"analysis": {
"char_filter": {
"my_mapping": {
"type": "mapping",
"mappings": [
"-=>"
]
}
},
"analyzer": {
"autocomplete_search": {
"tokenizer": "keyword",
"char_filter": [
"my_mapping"
],
"filter": [
"trim"
]
},
"autocomplete_index": {
"tokenizer": "keyword",
"filter": [
"trim"
]
}
}
}
},
"mappings": {
"test": {
"properties": {
"ucn": {
"type": "multi_field",
"fields": {
"ucn_autoc": {
"type": "string",
"index": "analyzed",
"index_analyzer": "autocomplete_index",
"search_analyzer": "autocomplete_search"
},
"ucn": {
"type": "string",
"index": "not_analyzed"
}
}
}
}
}
}
}
The char_filter would replace - with nothing: -=>. I would, also, use the trim filter to get rid of any trailing or leading white spaces. No idea what your autocomplete_index analyzer you have, I just used a keyword one.
Testing the analyzer GET /my_index/_analyze?analyzer=autocomplete_search&text= 0123-34742-000 results in:
"tokens": [
{
"token": "012334742000",
"start_offset": 0,
"end_offset": 17,
"type": "word",
"position": 1
}
]
which means it does eliminate the - and the white spaces.
And the typical query would be:
{
"query": {
"match": {
"ucn.ucn_autoc": " 0123-34742-000 "
}
}
}

Resources