Elasticsearch facets comes as whitespace tokenized - elasticsearch

I have the following mapping for elasticsearch
{
"mappings": {
"hotel": {
'properties': {"name": {
"type": "string",
"search_analyzer": "str_search_analyzer",
"index_analyzer": "str_index_analyzer"},
"destination": {'properties': {'en': {
"type": "string",
"search_analyzer": "str_search_analyzer",
"index_analyzer": "str_index_analyzer"}}},
"country": {"properties": {"en": {
"type": "string",
"search_analyzer": "str_search_analyzer",
"index_analyzer": "str_index_analyzer"}}},
"destination_facets": {"properties": {"en": {
"type": "string",
"search_analyzer": "facet_analyzer"
}}}
}
}
},
"settings": {
"analysis": {
"analyzer": {
"str_search_analyzer": {
"tokenizer": "keyword",
"filter": ["lowercase"]
},
"str_index_analyzer": {
"tokenizer": "keyword",
"filter": ["lowercase", "substring"]
},
"facet_analyzer": {
"type": "keyword",
"tokenizer": "keyword"
},
},
"filter": {
"substring": {
"type": "edgeNGram",
"min_gram": 1,
"max_gram": 20,
}
}
}
}
}
Which I want my destination_facets to be not tokenized. But it comes as white-space tokenized. Is there a way to ignore all token activities?

You probably need to set your facet_analyzer not only for the search_analyzer but also for the index_analyzer (Elasticsearch probably use this one for facetting, the search_analyzer is only used to parse query strings).
Note that if you want the same analyze for both, you can just use the name analyzer in your mapping.
Ex :
{
"mappings": {
"hotel": {
...
"destination_facets": {"properties": {"en": {
"type": "string",
"analyzer": "facet_analyzer"
}}}
}
}
},
"settings": {
...
}
}

Related

Elastic search: Run multiple analyzers on the same data

I am looking for a way to make ES search the data with multiple analyzers.
NGram analyzer and one or few language analyzers.
Possible solution will be to use multi-fields and explicitly declare which analyzer to use for each field.
For example, to set the following mappings:
"mappings": {
"my_entity": {
"properties": {
"my_field": {
"type": "text",
"fields": {
"ngram": {
"type": "string",
"analyzer": "ngram_analyzer"
},
"spanish": {
"type": "string",
"analyzer": "spanish"
},
"english": {
"type": "string",
"analyzer": "english"
}
}
}
}
}
}
The problem with that is that I have explicitly write every field and its analyzers to a search query.
And it will not allow to search with "_all" and use multiple analyzers.
Is there a way to make "_all" query use multiple analyzers?
Something like "_all.ngram", "_all.spanish" and without using copy_to do duplicate the data?
Is it possible to combine ngram analyzer with a spanish (or any other foreign language) and make a single custom analyzer?
I have tested the following settings but these did not work:
PUT /ngrams_index
{
"settings": {
"number_of_shards": 1,
"analysis": {
"tokenizer": {
"ngram_tokenizer": {
"type": "nGram",
"min_gram": 3,
"max_gram": 3
}
},
"filter": {
"ngram_filter": {
"type": "nGram",
"min_gram": 3,
"max_gram": 3
},
"spanish_stop": {
"type": "stop",
"stopwords": "_spanish_"
},
"spanish_keywords": {
"type": "keyword_marker",
"keywords": ["ejemplo"]
},
"spanish_stemmer": {
"type": "stemmer",
"language": "light_spanish"
}
},
"analyzer": {
"ngram_analyzer": {
"type": "custom",
"tokenizer": "ngram_tokenizer",
"filter": [
"lowercase",
"spanish_stop",
"spanish_keywords",
"spanish_stemmer"
]
}
}
}
},
"mappings": {
"my_entity": {
"_all": {
"enabled": true,
"analyzer": "ngram_analyzer"
},
"properties": {
"my_field": {
"type": "text",
"fields": {
"analyzer1": {
"type": "string",
"analyzer": "ngram_analyzer"
},
"analyzer2": {
"type": "string",
"analyzer": "spanish"
},
"analyzer3": {
"type": "string",
"analyzer": "english"
}
}
}
}
}
}
}
GET /ngrams_index/_analyze
{
"field": "_all",
"text": "Hola, me llamo Juan."
}
returns: just ngram results, without Spanish analysis
where
GET /ngrams_index/_analyze
{
"field": "my_field.analyzer2",
"text": "Hola, me llamo Juan."
}
properly analyzes the search string.
Is it possible to build a custom analyzer which combine Spanish and ngram?
There is a way to create a custom ngram+language analyzer:
PUT /ngrams_index
{
"settings": {
"number_of_shards": 1,
"analysis": {
"filter": {
"ngram_filter": {
"type": "nGram",
"min_gram": 3,
"max_gram": 3
},
"spanish_stop": {
"type": "stop",
"stopwords": "_spanish_"
},
"spanish_keywords": {
"type": "keyword_marker",
"keywords": [
"ejemplo"
]
},
"spanish_stemmer": {
"type": "stemmer",
"language": "light_spanish"
}
},
"analyzer": {
"ngram_analyzer": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"lowercase",
"spanish_stop",
"spanish_keywords",
"spanish_stemmer",
"ngram_filter"
]
}
}
}
},
"mappings": {
"my_entity": {
"_all": {
"enabled": true,
"analyzer": "ngram_analyzer"
},
"properties": {
"my_field": {
"type": "text",
"analyzer": "ngram_analyzer"
}
}
}
}
}
GET /ngrams_index/_analyze
{
"field": "my_field",
"text": "Hola, me llamo Juan."
}

Elastic search highlights the whole word instead of the part of word

If the query was Brid I want to get <em>Brid</em>gitte in highlighted fields, not the whole word <em>Bridgitte</em>
My index looks like this (I've added ngram analyzer as was suggested here Highlighting part of word in elasticsearch)
{
"myindex": {
"aliases": {},
"mappings": {
"mytype": {
"properties": {
"myarrayproperty": {
"properties": {
"mystringproperty1": {
"type": "string",
"term_vector": "with_positions_offsets",
"analyzer": "index_ngram_analyzer",
"search_analyzer": "search_term_analyzer"
},
"mystringproperty2": {
"type": "string",
"term_vector": "with_positions_offsets",
"analyzer": "index_ngram_analyzer",
"search_analyzer": "search_term_analyzer"
}
},
"mylongproperty": {
"type": "long"
},
"mydateproperty": {
"type": "date",
"format": "strict_date_optional_time||epoch_millis"
},
"mystringproperty3": {
"type": "string",
"term_vector": "with_positions_offsets",
"analyzer": "index_ngram_analyzer",
"search_analyzer": "search_term_analyzer"
},
"mystringproperty4": {
"type": "string",
"term_vector": "with_positions_offsets",
"analyzer": "index_ngram_analyzer",
"search_analyzer": "search_term_analyzer"
}
}
}
},
"settings": {
"index": {
"creation_date": "1498030893611",
"analysis": {
"analyzer": {
"search_term_analyzer": {
"filter": "lowercase",
"type": "custom",
"tokenizer": "ngram_tokenizer"
},
"index_ngram_analyzer": {
"filter": ["lowercase"],
"type": "custom",
"tokenizer": "ngram_tokenizer"
}
},
"tokenizer": {
"ngram_tokenizer": {
"token_chars": ["letter", "digit"],
"min_gram": "1",
"type": "nGram",
"max_gram": "15"
}
}
},
"number_of_shards": "5",
"number_of_replicas": "1",
"uuid": "e5kBX-XRTKOqeAScO1Fs0w",
"version": {
"created": "2040499"
}
}
},
"warmers": {}
}
}
}
This is embedded Elasticsearch instance, not sure if it's relevant.
My query looks like this
MatchQueryBuilder queryBuilder = matchPhrasePrefixQuery("_all", query).maxExpansions(50);
final SearchResponse response = client.prepareSearch("myindex")
.setQuery(queryBuilder)
.addHighlightedField("mystringproperty3", 0, 0)
.addHighlightedField("mystringproperty4", 0, 0)
.addHighlightedField("myarrayproperty.mystringproperty1", 0, 0)
.setHighlighterRequireFieldMatch(false)
.execute().actionGet();
And it doesn't work. I've tried to change query to queryStringQuery but it seems like it doesn't support search by part of the word. Any suggestions?
It's not possible. Elastic search does the indexing of a word. From tokenization perspective, you can not do much here.
You may need to write the wrapper over a search result. (Not elastic search specific)

ElasticSearch mapping for multiple fields

Currently I am using dynamic template as follows, Here I am applying n-gram analyzer to all the "String" fields.
However to improve efficiency I would like to apply n-gram only on specific fields only and not on all String fields.
{
"template": "*",
"settings": {
"analysis": {
"filter": {
"ngram_filter": {
"type": "ngram",
"min_gram": 1,
"max_gram": 25
}
},
"analyzer": {
"case_insensitive": {
"tokenizer": "whitespace",
"filter": [
"ngram_filter",
"lowercase"
]
},
"search_analyzer": {
"type": "custom",
"tokenizer": "whitespace",
"filter": "lowercase"
}
}
}
},
"mappings": {
"my_type": {
"dynamic_templates": [
{
"strings": {
"match_mapping_type": "string",
"mapping": {
"type": "string",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
},
"analyzer": "case_insensitive",
"search_analyzer": "search_analyzer"
}
}
}
]
}
}
}
I have a payload like this:
{
"userId":"abc123-pqr180-xyz124-njd212",
"email" : "someuser#test.com",
"name" : "somename",
.
.
20 more fields
}
Now I want to apply n-gram only for "email" and "userid".
How can we do this ?
Since you cannot rename the fields I suggest the following solution, i.e. to duplicate the dynamic template for the name and email fields.
{
"template": "*",
"settings": {
"analysis": {
"filter": {
"ngram_filter": {
"type": "ngram",
"min_gram": 1,
"max_gram": 25
}
},
"analyzer": {
"case_insensitive": {
"tokenizer": "whitespace",
"filter": [
"ngram_filter",
"lowercase"
]
},
"search_analyzer": {
"type": "custom",
"tokenizer": "whitespace",
"filter": "lowercase"
}
}
}
},
"mappings": {
"my_type": {
"dynamic_templates": [
{
"names": {
"match_mapping_type": "string",
"match": "name",
"mapping": {
"type": "string",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
},
"analyzer": "case_insensitive",
"search_analyzer": "search_analyzer"
}
}
},
{
"emails": {
"match_mapping_type": "string",
"match": "email",
"mapping": {
"type": "string",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
},
"analyzer": "case_insensitive",
"search_analyzer": "search_analyzer"
}
}
}
]
}
}
}

elastic search edge_ngram issue?

i have edge_ngram configured for a filed.
suppose the word is indexed in edge_ngram is : quick
and its analyzing as : q,qu,qui,quic,quick
when i am tring to search quickfull the words contaning quick is also coming in results.
i want words only containing quickfull comes else it gives no results.
this is my mapping :
{
"john_search": {
"aliases": {},
"mappings": {
"drugs": {
"properties": {
"chemical": {
"type": "string"
},
"cutting_allowed": {
"type": "boolean"
},
"id": {
"type": "long"
},
"is_banned": {
"type": "boolean"
},
"is_discontinued": {
"type": "boolean"
},
"manufacturer": {
"type": "string"
},
"name": {
"type": "string",
"boost": 2,
"fields": {
"exact": {
"type": "string",
"boost": 4,
"analyzer": "standard"
},
"phenotic": {
"type": "string",
"analyzer": "dbl_metaphone"
}
},
"analyzer": "autocomplete"
},
"price": {
"type": "string",
"index": "not_analyzed"
},
"refrigerated": {
"type": "boolean"
},
"sell_freq": {
"type": "long"
},
"xtra_name": {
"type": "string"
}
}
}
},
"settings": {
"index": {
"creation_date": "1475061490060",
"analysis": {
"filter": {
"my_metaphone": {
"replace": "false",
"type": "phonetic",
"encoder": "metaphone"
},
"autocomplete_filter": {
"type": "edge_ngram",
"min_gram": "3",
"max_gram": "100"
}
},
"analyzer": {
"autocomplete": {
"filter": [
"lowercase",
"autocomplete_filter"
],
"type": "custom",
"tokenizer": "standard"
},
"dbl_metaphone": {
"filter": "my_metaphone",
"tokenizer": "standard"
}
}
},
"number_of_shards": "1",
"number_of_replicas": "1",
"uuid": "qoRll9uATpegMtrnFTsqIw",
"version": {
"created": "2040099"
}
}
},
"warmers": {}
}
}
any help would be appreciated
It's because your name field has "analyzer": "autocomplete", which means that the autocomplete analyzer will also be applied at search time, hence the search term quickfull will be tokenized to q, qu, qui, quic, quick, quickf, quickfu, quickful and quickfull and that matches quick as well.
In order to prevent this, you need to set "search_analyzer": "standard" on the name field to override the index-time analyzer.
"name": {
"type": "string",
"boost": 2,
"fields": {
"exact": {
"type": "string",
"boost": 4,
"analyzer": "standard"
},
"phenotic": {
"type": "string",
"analyzer": "dbl_metaphone"
}
},
"analyzer": "autocomplete",
"search_analyzer": "standard" <--- add this
},

How to match query terms containing hyphens or trailing space in elasticsearch

In the mapping char_filter section of elasticsearch mapping, its kind of vague and I'm having a lot of difficulty understanding if and how to use charfilter analyzer: http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/analysis-mapping-charfilter.html
Basically the data we are storing in the index are ids of type String that look like this: "008392342000". I want to be able to search such ids when query terms actually contain a hyphen or trailing space like this: "008392342-000 ".
How would you advise I set the analyzer like?
Currently this is the definition of the field:
"mappings": {
"client": {
"properties": {
"ucn": {
"type": "multi_field",
"fields": {
"ucn_autoc": {
"type": "string",
"index": "analyzed",
"index_analyzer": "autocomplete_index",
"search_analyzer": "autocomplete_search"
},
"ucn": {
"type": "string",
"index": "not_analyzed"
}
}
}
}
}
}
Here is the settings for the index containing analyzer etc.
"settings": {
"analysis": {
"filter": {
"autocomplete_ngram": {
"max_gram": 15,
"min_gram": 1,
"type": "edge_ngram"
},
"ngram_filter": {
"type": "nGram",
"min_gram": 2,
"max_gram": 8
}
},
"analyzer": {
"lowercase_analyzer": {
"filter": [
"lowercase"
],
"tokenizer": "keyword"
},
"autocomplete_index": {
"filter": [
"lowercase",
"autocomplete_ngram"
],
"tokenizer": "keyword"
},
"ngram_index": {
"filter": [
"ngram_filter",
"lowercase"
],
"tokenizer": "keyword"
},
"autocomplete_search": {
"filter": [
"lowercase"
],
"tokenizer": "keyword"
},
"ngram_search": {
"filter": [
"lowercase"
],
"tokenizer": "keyword"
}
},
"index": {
"number_of_shards": 6,
"number_of_replicas": 1
}
}
}
You haven't provided your actual analyzers, what data goes in and what your expectations are, but based on the info you provided I would start with this:
{
"settings": {
"analysis": {
"char_filter": {
"my_mapping": {
"type": "mapping",
"mappings": [
"-=>"
]
}
},
"analyzer": {
"autocomplete_search": {
"tokenizer": "keyword",
"char_filter": [
"my_mapping"
],
"filter": [
"trim"
]
},
"autocomplete_index": {
"tokenizer": "keyword",
"filter": [
"trim"
]
}
}
}
},
"mappings": {
"test": {
"properties": {
"ucn": {
"type": "multi_field",
"fields": {
"ucn_autoc": {
"type": "string",
"index": "analyzed",
"index_analyzer": "autocomplete_index",
"search_analyzer": "autocomplete_search"
},
"ucn": {
"type": "string",
"index": "not_analyzed"
}
}
}
}
}
}
}
The char_filter would replace - with nothing: -=>. I would, also, use the trim filter to get rid of any trailing or leading white spaces. No idea what your autocomplete_index analyzer you have, I just used a keyword one.
Testing the analyzer GET /my_index/_analyze?analyzer=autocomplete_search&text= 0123-34742-000 results in:
"tokens": [
{
"token": "012334742000",
"start_offset": 0,
"end_offset": 17,
"type": "word",
"position": 1
}
]
which means it does eliminate the - and the white spaces.
And the typical query would be:
{
"query": {
"match": {
"ucn.ucn_autoc": " 0123-34742-000 "
}
}
}

Resources