Currently I am using dynamic template as follows, Here I am applying n-gram analyzer to all the "String" fields.
However to improve efficiency I would like to apply n-gram only on specific fields only and not on all String fields.
{
"template": "*",
"settings": {
"analysis": {
"filter": {
"ngram_filter": {
"type": "ngram",
"min_gram": 1,
"max_gram": 25
}
},
"analyzer": {
"case_insensitive": {
"tokenizer": "whitespace",
"filter": [
"ngram_filter",
"lowercase"
]
},
"search_analyzer": {
"type": "custom",
"tokenizer": "whitespace",
"filter": "lowercase"
}
}
}
},
"mappings": {
"my_type": {
"dynamic_templates": [
{
"strings": {
"match_mapping_type": "string",
"mapping": {
"type": "string",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
},
"analyzer": "case_insensitive",
"search_analyzer": "search_analyzer"
}
}
}
]
}
}
}
I have a payload like this:
{
"userId":"abc123-pqr180-xyz124-njd212",
"email" : "someuser#test.com",
"name" : "somename",
.
.
20 more fields
}
Now I want to apply n-gram only for "email" and "userid".
How can we do this ?
Since you cannot rename the fields I suggest the following solution, i.e. to duplicate the dynamic template for the name and email fields.
{
"template": "*",
"settings": {
"analysis": {
"filter": {
"ngram_filter": {
"type": "ngram",
"min_gram": 1,
"max_gram": 25
}
},
"analyzer": {
"case_insensitive": {
"tokenizer": "whitespace",
"filter": [
"ngram_filter",
"lowercase"
]
},
"search_analyzer": {
"type": "custom",
"tokenizer": "whitespace",
"filter": "lowercase"
}
}
}
},
"mappings": {
"my_type": {
"dynamic_templates": [
{
"names": {
"match_mapping_type": "string",
"match": "name",
"mapping": {
"type": "string",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
},
"analyzer": "case_insensitive",
"search_analyzer": "search_analyzer"
}
}
},
{
"emails": {
"match_mapping_type": "string",
"match": "email",
"mapping": {
"type": "string",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
},
"analyzer": "case_insensitive",
"search_analyzer": "search_analyzer"
}
}
}
]
}
}
}
Related
I am basically new to elastic search .I am trying to implement fuzzy search , synonym search ,edge ngram and autocomplete on "name_auto" field , but it seems like my index creation is failing.
another question can i implement all the analyzer for "name" field if so how can i do it.
{
"settings": {
"index": {
"analysis": {
"filter": {
"synonym": {
"ignore_case": "true",
"type": "synonym",
"format": "wordnet",
"synonyms_path": "analysis/wn_s.pl"
}
},
"analyzer": {
"synonym": {
"tokenizer": "whitespace",
"filter": [
"synonym"
]
},
"keyword_analyzer": {
"filter": [
"lowercase",
"asciifolding",
"trim"
],
"char_filter": [],
"type": "custom",
"tokenizer": "keyword"
},
"edge_ngram_analyzer": {
"filter": [
"lowercase"
],
"tokenizer": "edge_ngram_tokenizer"
},
"edge_ngram_search_analyzer": {
"tokenizer": "lowercase"
},
"tokenizer": {
"edge_ngram_tokenizer": {
"type": "edge_ngram",
"min_gram": 1,
"max_gram": 25,
"token_chars": [
"letter"
]
}
}
},
"mappings": {
"properties": {
"firebaseId": {
"type": "text"
},
"name": {
"fielddata": true,
"type": "text",
"analyzer": "standard"
},
"name_auto": {
"type": "text",
"fields": {
"keywordstring": {
"type": "text",
"analyzer": "keyword_analyzer"
},
"edgengram": {
"type": "text",
"analyzer": "edge_ngram_analyzer",
"search_analyzer": "edge_ngram_search_analyzer"
},
"completion": {
"type": "completion"
},
"synonym_analyzer": {
"type": "synonym",
"analyzer": "synonym"
}
}
}
}
}
}
}
}
}
This is the output :
> {
> "error": {
> "root_cause": [
> {
> "type": "illegal_argument_exception",
> "reason": "analyzer [tokenizer] must specify either an analyzer type, or a tokenizer"
> }
> ],
> "type": "illegal_argument_exception",
> "reason": "analyzer [tokenizer] must specify either an analyzer type, or a tokenizer"
> },
> "status": 400
> }
where am i doing wrong please guide me through right direction.
Your tokenizer section is located inside the analyzer section, which is not correct. Try with this instead, it should work:
{
"settings": {
"index": {
"analysis": {
"filter": {
"synonym": {
"ignore_case": "true",
"type": "synonym",
"format": "wordnet",
"synonyms_path": "analysis/wn_s.pl"
}
},
"analyzer": {
"synonym": {
"tokenizer": "whitespace",
"filter": [
"synonym"
]
},
"keyword_analyzer": {
"filter": [
"lowercase",
"asciifolding",
"trim"
],
"char_filter": [],
"type": "custom",
"tokenizer": "keyword"
},
"edge_ngram_analyzer": {
"filter": [
"lowercase"
],
"tokenizer": "edge_ngram_tokenizer"
},
"edge_ngram_search_analyzer": {
"tokenizer": "lowercase"
}
},
"tokenizer": {
"edge_ngram_tokenizer": {
"type": "edge_ngram",
"min_gram": 1,
"max_gram": 25,
"token_chars": [
"letter"
]
}
}
},
"mappings": {
"properties": {
"firebaseId": {
"type": "text"
},
"name": {
"fielddata": true,
"type": "text",
"analyzer": "standard"
},
"name_auto": {
"type": "text",
"fields": {
"keywordstring": {
"type": "text",
"analyzer": "keyword_analyzer"
},
"edgengram": {
"type": "text",
"analyzer": "edge_ngram_analyzer",
"search_analyzer": "edge_ngram_search_analyzer"
},
"completion": {
"type": "completion"
},
"synonym_analyzer": {
"type": "synonym",
"analyzer": "synonym"
}
}
}
}
}
}
}
}
I am looking for a way to make ES search the data with multiple analyzers.
NGram analyzer and one or few language analyzers.
Possible solution will be to use multi-fields and explicitly declare which analyzer to use for each field.
For example, to set the following mappings:
"mappings": {
"my_entity": {
"properties": {
"my_field": {
"type": "text",
"fields": {
"ngram": {
"type": "string",
"analyzer": "ngram_analyzer"
},
"spanish": {
"type": "string",
"analyzer": "spanish"
},
"english": {
"type": "string",
"analyzer": "english"
}
}
}
}
}
}
The problem with that is that I have explicitly write every field and its analyzers to a search query.
And it will not allow to search with "_all" and use multiple analyzers.
Is there a way to make "_all" query use multiple analyzers?
Something like "_all.ngram", "_all.spanish" and without using copy_to do duplicate the data?
Is it possible to combine ngram analyzer with a spanish (or any other foreign language) and make a single custom analyzer?
I have tested the following settings but these did not work:
PUT /ngrams_index
{
"settings": {
"number_of_shards": 1,
"analysis": {
"tokenizer": {
"ngram_tokenizer": {
"type": "nGram",
"min_gram": 3,
"max_gram": 3
}
},
"filter": {
"ngram_filter": {
"type": "nGram",
"min_gram": 3,
"max_gram": 3
},
"spanish_stop": {
"type": "stop",
"stopwords": "_spanish_"
},
"spanish_keywords": {
"type": "keyword_marker",
"keywords": ["ejemplo"]
},
"spanish_stemmer": {
"type": "stemmer",
"language": "light_spanish"
}
},
"analyzer": {
"ngram_analyzer": {
"type": "custom",
"tokenizer": "ngram_tokenizer",
"filter": [
"lowercase",
"spanish_stop",
"spanish_keywords",
"spanish_stemmer"
]
}
}
}
},
"mappings": {
"my_entity": {
"_all": {
"enabled": true,
"analyzer": "ngram_analyzer"
},
"properties": {
"my_field": {
"type": "text",
"fields": {
"analyzer1": {
"type": "string",
"analyzer": "ngram_analyzer"
},
"analyzer2": {
"type": "string",
"analyzer": "spanish"
},
"analyzer3": {
"type": "string",
"analyzer": "english"
}
}
}
}
}
}
}
GET /ngrams_index/_analyze
{
"field": "_all",
"text": "Hola, me llamo Juan."
}
returns: just ngram results, without Spanish analysis
where
GET /ngrams_index/_analyze
{
"field": "my_field.analyzer2",
"text": "Hola, me llamo Juan."
}
properly analyzes the search string.
Is it possible to build a custom analyzer which combine Spanish and ngram?
There is a way to create a custom ngram+language analyzer:
PUT /ngrams_index
{
"settings": {
"number_of_shards": 1,
"analysis": {
"filter": {
"ngram_filter": {
"type": "nGram",
"min_gram": 3,
"max_gram": 3
},
"spanish_stop": {
"type": "stop",
"stopwords": "_spanish_"
},
"spanish_keywords": {
"type": "keyword_marker",
"keywords": [
"ejemplo"
]
},
"spanish_stemmer": {
"type": "stemmer",
"language": "light_spanish"
}
},
"analyzer": {
"ngram_analyzer": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"lowercase",
"spanish_stop",
"spanish_keywords",
"spanish_stemmer",
"ngram_filter"
]
}
}
}
},
"mappings": {
"my_entity": {
"_all": {
"enabled": true,
"analyzer": "ngram_analyzer"
},
"properties": {
"my_field": {
"type": "text",
"analyzer": "ngram_analyzer"
}
}
}
}
}
GET /ngrams_index/_analyze
{
"field": "my_field",
"text": "Hola, me llamo Juan."
}
I am searching for a phrase in a email body. Need to get the exact data filtered like, if I search for 'Avenue New', it should return only results which has the phrase 'Avenue New' not 'Avenue Street', 'Park Avenue'etc
My mapping is like:
{
"exchangemailssql": {
"aliases": {},
"mappings": {
"email": {
"dynamic_templates": [
{
"_default": {
"match": "*",
"match_mapping_type": "string",
"mapping": {
"doc_values": true,
"type": "keyword"
}
}
}
],
"properties": {
"attachments": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"body": {
"type": "text",
"analyzer": "keylower",
"fielddata": true
},
"count": {
"type": "short"
},
"emailId": {
"type": "long"
}
}
}
},
"settings": {
"index": {
"refresh_interval": "3s",
"number_of_shards": "1",
"provided_name": "exchangemailssql",
"creation_date": "1500527793230",
"analysis": {
"filter": {
"nGram": {
"min_gram": "4",
"side": "front",
"type": "edge_ngram",
"max_gram": "100"
}
},
"analyzer": {
"keylower": {
"filter": [
"lowercase"
],
"type": "custom",
"tokenizer": "keyword"
},
"email": {
"filter": [
"lowercase",
"unique",
"nGram"
],
"type": "custom",
"tokenizer": "uax_url_email"
},
"full": {
"filter": [
"lowercase",
"snowball",
"nGram"
],
"type": "custom",
"tokenizer": "standard"
}
}
},
"number_of_replicas": "0",
"uuid": "2XTpHmwaQF65PNkCQCmcVQ",
"version": {
"created": "5040099"
}
}
}
}
}
I have given the search query like:
{
"query": {
"match_phrase": {
"body": "Avenue New"
}
},
"highlight": {
"fields" : {
"body" : {}
}
}
}
The problem here is that you're tokenizing the full body content using the keyword tokenizer, i.e. it will be one big lowercase string and you cannot search inside of it.
If you simply change the analyzer of your body field to standard instead of keylower, you'll find what you need using the match_phrase query.
"body": {
"type": "text",
"analyzer": "standard", <---change this
"fielddata": true
},
In the mapping char_filter section of elasticsearch mapping, its kind of vague and I'm having a lot of difficulty understanding if and how to use charfilter analyzer: http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/analysis-mapping-charfilter.html
Basically the data we are storing in the index are ids of type String that look like this: "008392342000". I want to be able to search such ids when query terms actually contain a hyphen or trailing space like this: "008392342-000 ".
How would you advise I set the analyzer like?
Currently this is the definition of the field:
"mappings": {
"client": {
"properties": {
"ucn": {
"type": "multi_field",
"fields": {
"ucn_autoc": {
"type": "string",
"index": "analyzed",
"index_analyzer": "autocomplete_index",
"search_analyzer": "autocomplete_search"
},
"ucn": {
"type": "string",
"index": "not_analyzed"
}
}
}
}
}
}
Here is the settings for the index containing analyzer etc.
"settings": {
"analysis": {
"filter": {
"autocomplete_ngram": {
"max_gram": 15,
"min_gram": 1,
"type": "edge_ngram"
},
"ngram_filter": {
"type": "nGram",
"min_gram": 2,
"max_gram": 8
}
},
"analyzer": {
"lowercase_analyzer": {
"filter": [
"lowercase"
],
"tokenizer": "keyword"
},
"autocomplete_index": {
"filter": [
"lowercase",
"autocomplete_ngram"
],
"tokenizer": "keyword"
},
"ngram_index": {
"filter": [
"ngram_filter",
"lowercase"
],
"tokenizer": "keyword"
},
"autocomplete_search": {
"filter": [
"lowercase"
],
"tokenizer": "keyword"
},
"ngram_search": {
"filter": [
"lowercase"
],
"tokenizer": "keyword"
}
},
"index": {
"number_of_shards": 6,
"number_of_replicas": 1
}
}
}
You haven't provided your actual analyzers, what data goes in and what your expectations are, but based on the info you provided I would start with this:
{
"settings": {
"analysis": {
"char_filter": {
"my_mapping": {
"type": "mapping",
"mappings": [
"-=>"
]
}
},
"analyzer": {
"autocomplete_search": {
"tokenizer": "keyword",
"char_filter": [
"my_mapping"
],
"filter": [
"trim"
]
},
"autocomplete_index": {
"tokenizer": "keyword",
"filter": [
"trim"
]
}
}
}
},
"mappings": {
"test": {
"properties": {
"ucn": {
"type": "multi_field",
"fields": {
"ucn_autoc": {
"type": "string",
"index": "analyzed",
"index_analyzer": "autocomplete_index",
"search_analyzer": "autocomplete_search"
},
"ucn": {
"type": "string",
"index": "not_analyzed"
}
}
}
}
}
}
}
The char_filter would replace - with nothing: -=>. I would, also, use the trim filter to get rid of any trailing or leading white spaces. No idea what your autocomplete_index analyzer you have, I just used a keyword one.
Testing the analyzer GET /my_index/_analyze?analyzer=autocomplete_search&text= 0123-34742-000 results in:
"tokens": [
{
"token": "012334742000",
"start_offset": 0,
"end_offset": 17,
"type": "word",
"position": 1
}
]
which means it does eliminate the - and the white spaces.
And the typical query would be:
{
"query": {
"match": {
"ucn.ucn_autoc": " 0123-34742-000 "
}
}
}
I have the following mapping for elasticsearch
{
"mappings": {
"hotel": {
'properties': {"name": {
"type": "string",
"search_analyzer": "str_search_analyzer",
"index_analyzer": "str_index_analyzer"},
"destination": {'properties': {'en': {
"type": "string",
"search_analyzer": "str_search_analyzer",
"index_analyzer": "str_index_analyzer"}}},
"country": {"properties": {"en": {
"type": "string",
"search_analyzer": "str_search_analyzer",
"index_analyzer": "str_index_analyzer"}}},
"destination_facets": {"properties": {"en": {
"type": "string",
"search_analyzer": "facet_analyzer"
}}}
}
}
},
"settings": {
"analysis": {
"analyzer": {
"str_search_analyzer": {
"tokenizer": "keyword",
"filter": ["lowercase"]
},
"str_index_analyzer": {
"tokenizer": "keyword",
"filter": ["lowercase", "substring"]
},
"facet_analyzer": {
"type": "keyword",
"tokenizer": "keyword"
},
},
"filter": {
"substring": {
"type": "edgeNGram",
"min_gram": 1,
"max_gram": 20,
}
}
}
}
}
Which I want my destination_facets to be not tokenized. But it comes as white-space tokenized. Is there a way to ignore all token activities?
You probably need to set your facet_analyzer not only for the search_analyzer but also for the index_analyzer (Elasticsearch probably use this one for facetting, the search_analyzer is only used to parse query strings).
Note that if you want the same analyze for both, you can just use the name analyzer in your mapping.
Ex :
{
"mappings": {
"hotel": {
...
"destination_facets": {"properties": {"en": {
"type": "string",
"analyzer": "facet_analyzer"
}}}
}
}
},
"settings": {
...
}
}