Mapping seach cities by name - elasticsearch

I'm building an autocomplete input to search in a city database.
When I type "Wash", I'd like to propose Washington. Now when I type "Wash" I have nothing, With "washingt" ES found "washington".
I'm using FOS elastica.
mapping:
indexes:
cities:
finder: ~
use_alias: false
settings:
analysis:
analyzer:
text_analyzer:
tokenizer: 'whitespace'
filter: [ lowercase, trim, asciifolding, elision, name_filter ]
filter:
name_filter:
type: edgeNGram
max_gram: 100
min_gram: 2
properties:
zipCode:
type: keyword
name:
analyzer: text_analyzer
ES requets:
{
"query": {
"query_string": {
"query": "washingt",
"fields": [
"name",
"zipCode"
]
}
},
"from": 0,
"size": 1000
}
Any ideas what I'm missing?
Edit:
POST cities/_analyze { "field": "name", "text": ["washington"] }
"tokens": [
{
"token": "w",
"start_offset": 0,
"end_offset": 10,
"type": "<ALPHANUM>",
"position": 0
}
,
{
"token": "wa",
"start_offset": 0,
"end_offset": 10,
"type": "<ALPHANUM>",
"position": 0
}
,
{
"token": "was",
"start_offset": 0,
"end_offset": 10,
"type": "<ALPHANUM>",
"position": 0
}
,
{
"token": "wash",
"start_offset": 0,
"end_offset": 10,
"type": "<ALPHANUM>",
"position": 0
}
,
{
"token": "washi",
"start_offset": 0,
"end_offset": 10,
"type": "<ALPHANUM>",
"position": 0
}
,
{
"token": "washin",
"start_offset": 0,
"end_offset": 10,
"type": "<ALPHANUM>",
"position": 0
}
,
{
"token": "washing",
"start_offset": 0,
"end_offset": 10,
"type": "<ALPHANUM>",
"position": 0
}
,
{
"token": "washingt",
"start_offset": 0,
"end_offset": 10,
"type": "<ALPHANUM>",
"position": 0
}
,
{
"token": "washingto",
"start_offset": 0,
"end_offset": 10,
"type": "<ALPHANUM>",
"position": 0
}
,
{
"token": "washington",
"start_offset": 0,
"end_offset": 10,
"type": "<ALPHANUM>",
"position": 0
}
]
}

Related

Separators in standard analyzer of elasticsearch

I know that elasicsearch's standard analyzer uses standard tokenizer to generate tokens.
In this elasticsearch docs, they say it does grammar-based tokenization, but the separators used by standard tokenizer are not clear.
My use case is as follows
In my elasticsearch index I have some fields which use the default analyzer standard analyzer
In those fields I want # character searchable and . as one more separator.
Can I achieve my use case with a standard analyzer?
I checked what and all tokens it will generate for string hey john.s #100 is a test name.
POST _analyze
{
"text": "hey john.s #100 is a test name",
"analyzer": "standard"
}
It generated the following tokens
{
"tokens": [
{
"token": "hey",
"start_offset": 0,
"end_offset": 3,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "john.s",
"start_offset": 4,
"end_offset": 10,
"type": "<ALPHANUM>",
"position": 1
},
{
"token": "100",
"start_offset": 12,
"end_offset": 15,
"type": "<NUM>",
"position": 2
},
{
"token": "is",
"start_offset": 16,
"end_offset": 18,
"type": "<ALPHANUM>",
"position": 3
},
{
"token": "a",
"start_offset": 19,
"end_offset": 20,
"type": "<ALPHANUM>",
"position": 4
},
{
"token": "test",
"start_offset": 21,
"end_offset": 25,
"type": "<ALPHANUM>",
"position": 5
},
{
"token": "name",
"start_offset": 26,
"end_offset": 30,
"type": "<ALPHANUM>",
"position": 6
}
]
}
So I got a doubt that Only whitespace is used as a separator in standard tokenizer?
Thank you in advance..
Lets first see why it is not breaking token on . for some of the words:
Standard analyzer use standard tokenizer only but standard tokenizer provides grammar based tokenization based on the Unicode Text Segmentation algorithm. You can read more about algorithm here, here and here. it is not using whitespace tokenizer.
Lets see now, how you can token on . dot and not on #:
You can use Character Group tokenizer and provide list of character on which you want to apply tokenization.
POST _analyze
{
"tokenizer": {
"type": "char_group",
"tokenize_on_chars": [
"whitespace",
".",
"\n"
]
},
"text": "hey john.s #100 is a test name"
}
Response:
{
"tokens": [
{
"token": "hey",
"start_offset": 0,
"end_offset": 3,
"type": "word",
"position": 0
},
{
"token": "john",
"start_offset": 4,
"end_offset": 8,
"type": "word",
"position": 1
},
{
"token": "s",
"start_offset": 9,
"end_offset": 10,
"type": "word",
"position": 2
},
{
"token": "#100",
"start_offset": 11,
"end_offset": 15,
"type": "word",
"position": 3
},
{
"token": "is",
"start_offset": 16,
"end_offset": 18,
"type": "word",
"position": 4
},
{
"token": "a",
"start_offset": 19,
"end_offset": 20,
"type": "word",
"position": 5
},
{
"token": "test",
"start_offset": 21,
"end_offset": 25,
"type": "word",
"position": 6
},
{
"token": "name",
"start_offset": 26,
"end_offset": 30,
"type": "word",
"position": 7
}
]
}

elasticsearch edgengram copy_to field partial search not working

Below is the elastic search mapping with one field called hostname and other field called catch_all which is basically copy_to field(there will be many more fields copying values to this)
{
"settings": {
"analysis": {
"filter": {
"myNGramFilter": {
"type": "edgeNGram",
"min_gram": 1,
"max_gram": 40
}},
"analyzer": {
"myNGramAnalyzer": {
"type": "custom",
"tokenizer": "standard",
"filter": ["lowercase", "myNGramFilter"]
}
}
}
},
"mappings": {
"test": {
"properties": {
"catch_all": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"store": true,
"ignore_above": 256
},
"grams": {
"type": "text",
"store": true,
"analyzer": "myNGramAnalyzer"
}
}
},
"hostname": {
"type": "text",
"copy_to": "catch_all"
}
}
}
}
}
When I do the
GET index/_analyze
{
"analyzer": "myNGramAnalyzer",
"text": "Dell PowerEdge R630"
}
{
"tokens": [
{
"token": "d",
"start_offset": 0,
"end_offset": 4,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "de",
"start_offset": 0,
"end_offset": 4,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "del",
"start_offset": 0,
"end_offset": 4,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "dell",
"start_offset": 0,
"end_offset": 4,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "p",
"start_offset": 5,
"end_offset": 14,
"type": "<ALPHANUM>",
"position": 1
},
{
"token": "po",
"start_offset": 5,
"end_offset": 14,
"type": "<ALPHANUM>",
"position": 1
},
{
"token": "pow",
"start_offset": 5,
"end_offset": 14,
"type": "<ALPHANUM>",
"position": 1
},
{
"token": "powe",
"start_offset": 5,
"end_offset": 14,
"type": "<ALPHANUM>",
"position": 1
},
{
"token": "power",
"start_offset": 5,
"end_offset": 14,
"type": "<ALPHANUM>",
"position": 1
},
{
"token": "powere",
"start_offset": 5,
"end_offset": 14,
"type": "<ALPHANUM>",
"position": 1
},
{
"token": "powered",
"start_offset": 5,
"end_offset": 14,
"type": "<ALPHANUM>",
"position": 1
},
{
"token": "poweredg",
"start_offset": 5,
"end_offset": 14,
"type": "<ALPHANUM>",
"position": 1
},
{
"token": "poweredge",
"start_offset": 5,
"end_offset": 14,
"type": "<ALPHANUM>",
"position": 1
},
{
"token": "r",
"start_offset": 15,
"end_offset": 19,
"type": "<ALPHANUM>",
"position": 2
},
{
"token": "r6",
"start_offset": 15,
"end_offset": 19,
"type": "<ALPHANUM>",
"position": 2
},
{
"token": "r63",
"start_offset": 15,
"end_offset": 19,
"type": "<ALPHANUM>",
"position": 2
},
{
"token": "r630",
"start_offset": 15,
"end_offset": 19,
"type": "<ALPHANUM>",
"position": 2
}
]
}
There is a token called "poweredge".
Right now we search with below query
{
"query": {
"multi_match": {
"fields": ["catch_all.grams"],
"query": "poweredge",
"operator": "and"
}
}
}
When we query with "poweredge" we get 1 result. But when we search by only "edge" there is no result.
Even the match query does not yield results for search word "edge".
Can somebody help here ?
I suggest to don't query with multi_match api for your use case, but to use a match query. The edgengram works in that way: it try to make ngram on the tokens generated by a whitespace tokenizer on you text. As written in documentation - read here:
The edge_ngram tokenizer first breaks text down into words whenever it
encounters one of a list of specified characters, then it emits
N-grams of each word where the start of the N-gram is anchored to the
beginning of the word.
As you have tested in your query to analyze API, it doesn't product "edge" - from poweredge - as ngram because it products ngram from the beginning of the word - look at the output of you analyze API call. Take a look here: https://www.elastic.co/guide/en/elasticsearch/guide/master/ngrams-compound-words.html

How to see the analyzed data that was indexed?

How do you see the analyzed data that is stored after you index something.
I know you can just do a search to see it like this
http://localhost:9200/local_products_fr/fields/_search/
But what I want to see is the actual data not the _source
something like what you get when you call _analyzer
http://localhost:9200/local_products_fr/_analyze?text=<p>my <b>super</b> text</p>&analyzer=analyzer_fr
{
"tokens": [
{
"token": "my",
"start_offset": 3,
"end_offset": 5,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "b",
"start_offset": 7,
"end_offset": 8,
"type": "<ALPHANUM>",
"position": 1
},
{
"token": "sup",
"start_offset": 9,
"end_offset": 14,
"type": "<ALPHANUM>",
"position": 2
},
{
"token": "b",
"start_offset": 16,
"end_offset": 17,
"type": "<ALPHANUM>",
"position": 3
},
{
"token": "text",
"start_offset": 19,
"end_offset": 23,
"type": "<ALPHANUM>",
"position": 4
}
]
}
i use this to get inverted index for a field per document
{
"query": {
"bool": {
"must": [{
"term": {
"_id": {
"value": "2"
}
}
}]
}
},
"script_fields": {
"terms": {
"script": "doc['product.name'].values"
}
}
}
Hope this works for you

Edge NGram with phrase matching

I need to autocomplete phrases. For example, when I search "dementia in alz", I want to get "dementia in alzheimer's".
For this, I configured Edge NGram tokenizer. I tried both edge_ngram_analyzer and standard as the analyzer in the query body. Nevertheless, I can't get results when I'm trying to match a phrase.
What am I doing wrong?
My query:
{
"query":{
"multi_match":{
"query":"dementia in alz",
"type":"phrase",
"analyzer":"edge_ngram_analyzer",
"fields":["_all"]
}
}
}
My mappings:
...
"type" : {
"_all" : {
"analyzer" : "edge_ngram_analyzer",
"search_analyzer" : "standard"
},
"properties" : {
"field" : {
"type" : "string",
"analyzer" : "edge_ngram_analyzer",
"search_analyzer" : "standard"
},
...
"settings" : {
...
"analysis" : {
"filter" : {
"stem_possessive_filter" : {
"name" : "possessive_english",
"type" : "stemmer"
}
},
"analyzer" : {
"edge_ngram_analyzer" : {
"filter" : [ "lowercase" ],
"tokenizer" : "edge_ngram_tokenizer"
}
},
"tokenizer" : {
"edge_ngram_tokenizer" : {
"token_chars" : [ "letter", "digit", "whitespace" ],
"min_gram" : "2",
"type" : "edgeNGram",
"max_gram" : "25"
}
}
}
...
My documents:
{
"_score": 1.1152233,
"_type": "Diagnosis",
"_id": "AVZLfHfBE5CzEm8aJ3Xp",
"_source": {
"#timestamp": "2016-08-02T13:40:48.665Z",
"type": "Diagnosis",
"Document_ID": "Diagnosis_1400541",
"Diagnosis": "F00.0 - Dementia in Alzheimer's disease with early onset",
"#version": "1",
},
"_index": "carenotes"
},
{
"_score": 1.1152233,
"_type": "Diagnosis",
"_id": "AVZLfICrE5CzEm8aJ4Dc",
"_source": {
"#timestamp": "2016-08-02T13:40:51.240Z",
"type": "Diagnosis",
"Document_ID": "Diagnosis_1424351",
"Diagnosis": "F00.1 - Dementia in Alzheimer's disease with late onset",
"#version": "1",
},
"_index": "carenotes"
}
Analysis of the "dementia in alzheimer" phrase:
{
"tokens": [
{
"end_offset": 2,
"token": "de",
"type": "word",
"start_offset": 0,
"position": 0
},
{
"end_offset": 3,
"token": "dem",
"type": "word",
"start_offset": 0,
"position": 1
},
{
"end_offset": 4,
"token": "deme",
"type": "word",
"start_offset": 0,
"position": 2
},
{
"end_offset": 5,
"token": "demen",
"type": "word",
"start_offset": 0,
"position": 3
},
{
"end_offset": 6,
"token": "dement",
"type": "word",
"start_offset": 0,
"position": 4
},
{
"end_offset": 7,
"token": "dementi",
"type": "word",
"start_offset": 0,
"position": 5
},
{
"end_offset": 8,
"token": "dementia",
"type": "word",
"start_offset": 0,
"position": 6
},
{
"end_offset": 9,
"token": "dementia ",
"type": "word",
"start_offset": 0,
"position": 7
},
{
"end_offset": 10,
"token": "dementia i",
"type": "word",
"start_offset": 0,
"position": 8
},
{
"end_offset": 11,
"token": "dementia in",
"type": "word",
"start_offset": 0,
"position": 9
},
{
"end_offset": 12,
"token": "dementia in ",
"type": "word",
"start_offset": 0,
"position": 10
},
{
"end_offset": 13,
"token": "dementia in a",
"type": "word",
"start_offset": 0,
"position": 11
},
{
"end_offset": 14,
"token": "dementia in al",
"type": "word",
"start_offset": 0,
"position": 12
},
{
"end_offset": 15,
"token": "dementia in alz",
"type": "word",
"start_offset": 0,
"position": 13
},
{
"end_offset": 16,
"token": "dementia in alzh",
"type": "word",
"start_offset": 0,
"position": 14
},
{
"end_offset": 17,
"token": "dementia in alzhe",
"type": "word",
"start_offset": 0,
"position": 15
},
{
"end_offset": 18,
"token": "dementia in alzhei",
"type": "word",
"start_offset": 0,
"position": 16
},
{
"end_offset": 19,
"token": "dementia in alzheim",
"type": "word",
"start_offset": 0,
"position": 17
},
{
"end_offset": 20,
"token": "dementia in alzheime",
"type": "word",
"start_offset": 0,
"position": 18
},
{
"end_offset": 21,
"token": "dementia in alzheimer",
"type": "word",
"start_offset": 0,
"position": 19
}
]
}
Many thanks to rendel who helped me to find the right solution!
The solution of Andrei Stefan is not optimal.
Why? First, the absence of the lowercase filter in the search analyzer makes search inconvenient; the case must be matched strictly. A custom analyzer with lowercase filter is needed instead of "analyzer": "keyword".
Second, the analysis part is wrong!
During index time a string "F00.0 - Dementia in Alzheimer's disease with early onset" is analyzed by edge_ngram_analyzer. With this analyzer, we have the following array of dictionaries as the analyzed string:
{
"tokens": [
{
"end_offset": 2,
"token": "f0",
"type": "word",
"start_offset": 0,
"position": 0
},
{
"end_offset": 3,
"token": "f00",
"type": "word",
"start_offset": 0,
"position": 1
},
{
"end_offset": 6,
"token": "0 ",
"type": "word",
"start_offset": 4,
"position": 2
},
{
"end_offset": 9,
"token": " ",
"type": "word",
"start_offset": 7,
"position": 3
},
{
"end_offset": 10,
"token": " d",
"type": "word",
"start_offset": 7,
"position": 4
},
{
"end_offset": 11,
"token": " de",
"type": "word",
"start_offset": 7,
"position": 5
},
{
"end_offset": 12,
"token": " dem",
"type": "word",
"start_offset": 7,
"position": 6
},
{
"end_offset": 13,
"token": " deme",
"type": "word",
"start_offset": 7,
"position": 7
},
{
"end_offset": 14,
"token": " demen",
"type": "word",
"start_offset": 7,
"position": 8
},
{
"end_offset": 15,
"token": " dement",
"type": "word",
"start_offset": 7,
"position": 9
},
{
"end_offset": 16,
"token": " dementi",
"type": "word",
"start_offset": 7,
"position": 10
},
{
"end_offset": 17,
"token": " dementia",
"type": "word",
"start_offset": 7,
"position": 11
},
{
"end_offset": 18,
"token": " dementia ",
"type": "word",
"start_offset": 7,
"position": 12
},
{
"end_offset": 19,
"token": " dementia i",
"type": "word",
"start_offset": 7,
"position": 13
},
{
"end_offset": 20,
"token": " dementia in",
"type": "word",
"start_offset": 7,
"position": 14
},
{
"end_offset": 21,
"token": " dementia in ",
"type": "word",
"start_offset": 7,
"position": 15
},
{
"end_offset": 22,
"token": " dementia in a",
"type": "word",
"start_offset": 7,
"position": 16
},
{
"end_offset": 23,
"token": " dementia in al",
"type": "word",
"start_offset": 7,
"position": 17
},
{
"end_offset": 24,
"token": " dementia in alz",
"type": "word",
"start_offset": 7,
"position": 18
},
{
"end_offset": 25,
"token": " dementia in alzh",
"type": "word",
"start_offset": 7,
"position": 19
},
{
"end_offset": 26,
"token": " dementia in alzhe",
"type": "word",
"start_offset": 7,
"position": 20
},
{
"end_offset": 27,
"token": " dementia in alzhei",
"type": "word",
"start_offset": 7,
"position": 21
},
{
"end_offset": 28,
"token": " dementia in alzheim",
"type": "word",
"start_offset": 7,
"position": 22
},
{
"end_offset": 29,
"token": " dementia in alzheime",
"type": "word",
"start_offset": 7,
"position": 23
},
{
"end_offset": 30,
"token": " dementia in alzheimer",
"type": "word",
"start_offset": 7,
"position": 24
},
{
"end_offset": 33,
"token": "s ",
"type": "word",
"start_offset": 31,
"position": 25
},
{
"end_offset": 34,
"token": "s d",
"type": "word",
"start_offset": 31,
"position": 26
},
{
"end_offset": 35,
"token": "s di",
"type": "word",
"start_offset": 31,
"position": 27
},
{
"end_offset": 36,
"token": "s dis",
"type": "word",
"start_offset": 31,
"position": 28
},
{
"end_offset": 37,
"token": "s dise",
"type": "word",
"start_offset": 31,
"position": 29
},
{
"end_offset": 38,
"token": "s disea",
"type": "word",
"start_offset": 31,
"position": 30
},
{
"end_offset": 39,
"token": "s diseas",
"type": "word",
"start_offset": 31,
"position": 31
},
{
"end_offset": 40,
"token": "s disease",
"type": "word",
"start_offset": 31,
"position": 32
},
{
"end_offset": 41,
"token": "s disease ",
"type": "word",
"start_offset": 31,
"position": 33
},
{
"end_offset": 42,
"token": "s disease w",
"type": "word",
"start_offset": 31,
"position": 34
},
{
"end_offset": 43,
"token": "s disease wi",
"type": "word",
"start_offset": 31,
"position": 35
},
{
"end_offset": 44,
"token": "s disease wit",
"type": "word",
"start_offset": 31,
"position": 36
},
{
"end_offset": 45,
"token": "s disease with",
"type": "word",
"start_offset": 31,
"position": 37
},
{
"end_offset": 46,
"token": "s disease with ",
"type": "word",
"start_offset": 31,
"position": 38
},
{
"end_offset": 47,
"token": "s disease with e",
"type": "word",
"start_offset": 31,
"position": 39
},
{
"end_offset": 48,
"token": "s disease with ea",
"type": "word",
"start_offset": 31,
"position": 40
},
{
"end_offset": 49,
"token": "s disease with ear",
"type": "word",
"start_offset": 31,
"position": 41
},
{
"end_offset": 50,
"token": "s disease with earl",
"type": "word",
"start_offset": 31,
"position": 42
},
{
"end_offset": 51,
"token": "s disease with early",
"type": "word",
"start_offset": 31,
"position": 43
},
{
"end_offset": 52,
"token": "s disease with early ",
"type": "word",
"start_offset": 31,
"position": 44
},
{
"end_offset": 53,
"token": "s disease with early o",
"type": "word",
"start_offset": 31,
"position": 45
},
{
"end_offset": 54,
"token": "s disease with early on",
"type": "word",
"start_offset": 31,
"position": 46
},
{
"end_offset": 55,
"token": "s disease with early ons",
"type": "word",
"start_offset": 31,
"position": 47
},
{
"end_offset": 56,
"token": "s disease with early onse",
"type": "word",
"start_offset": 31,
"position": 48
}
]
}
As you can see, the whole string tokenized with token size from 2 to 25 characters. The string is tokenized in a linear way together with all spaces and position incremented by one for every new token.
There are several problems with it:
The edge_ngram_analyzer produced unuseful tokens which will never be searched for, for example: "0 ", " ", " d", "s d", "s disease w" etc.
Also, it didn't produce a lot of useful tokens that could be used, for example: "disease", "early onset" etc. There will be 0 results if you try to search for any of these words.
Notice, the last token is "s disease with early onse". Where is the final "t"? Because of the "max_gram" : "25" we “lost” some text in all fields. You can't search for this text anymore because there are no tokens for it.
The trim filter only obfuscates the problem filtering extra spaces when it could be done by a tokenizer.
The edge_ngram_analyzer increments the position of each token which is problematic for positional queries such as phrase queries. One should use the edge_ngram_filter instead that will preserve the position of the token when generating the ngrams.
The optimal solution.
The mappings settings to use:
...
"mappings": {
"Type": {
"_all":{
"analyzer": "edge_ngram_analyzer",
"search_analyzer": "keyword_analyzer"
},
"properties": {
"Field": {
"search_analyzer": "keyword_analyzer",
"type": "string",
"analyzer": "edge_ngram_analyzer"
},
...
...
"settings": {
"analysis": {
"filter": {
"english_poss_stemmer": {
"type": "stemmer",
"name": "possessive_english"
},
"edge_ngram": {
"type": "edgeNGram",
"min_gram": "2",
"max_gram": "25",
"token_chars": ["letter", "digit"]
}
},
"analyzer": {
"edge_ngram_analyzer": {
"filter": ["lowercase", "english_poss_stemmer", "edge_ngram"],
"tokenizer": "standard"
},
"keyword_analyzer": {
"filter": ["lowercase", "english_poss_stemmer"],
"tokenizer": "standard"
}
}
}
}
...
Look at the analysis:
{
"tokens": [
{
"end_offset": 5,
"token": "f0",
"type": "word",
"start_offset": 0,
"position": 0
},
{
"end_offset": 5,
"token": "f00",
"type": "word",
"start_offset": 0,
"position": 0
},
{
"end_offset": 5,
"token": "f00.",
"type": "word",
"start_offset": 0,
"position": 0
},
{
"end_offset": 5,
"token": "f00.0",
"type": "word",
"start_offset": 0,
"position": 0
},
{
"end_offset": 17,
"token": "de",
"type": "word",
"start_offset": 9,
"position": 2
},
{
"end_offset": 17,
"token": "dem",
"type": "word",
"start_offset": 9,
"position": 2
},
{
"end_offset": 17,
"token": "deme",
"type": "word",
"start_offset": 9,
"position": 2
},
{
"end_offset": 17,
"token": "demen",
"type": "word",
"start_offset": 9,
"position": 2
},
{
"end_offset": 17,
"token": "dement",
"type": "word",
"start_offset": 9,
"position": 2
},
{
"end_offset": 17,
"token": "dementi",
"type": "word",
"start_offset": 9,
"position": 2
},
{
"end_offset": 17,
"token": "dementia",
"type": "word",
"start_offset": 9,
"position": 2
},
{
"end_offset": 20,
"token": "in",
"type": "word",
"start_offset": 18,
"position": 3
},
{
"end_offset": 32,
"token": "al",
"type": "word",
"start_offset": 21,
"position": 4
},
{
"end_offset": 32,
"token": "alz",
"type": "word",
"start_offset": 21,
"position": 4
},
{
"end_offset": 32,
"token": "alzh",
"type": "word",
"start_offset": 21,
"position": 4
},
{
"end_offset": 32,
"token": "alzhe",
"type": "word",
"start_offset": 21,
"position": 4
},
{
"end_offset": 32,
"token": "alzhei",
"type": "word",
"start_offset": 21,
"position": 4
},
{
"end_offset": 32,
"token": "alzheim",
"type": "word",
"start_offset": 21,
"position": 4
},
{
"end_offset": 32,
"token": "alzheime",
"type": "word",
"start_offset": 21,
"position": 4
},
{
"end_offset": 32,
"token": "alzheimer",
"type": "word",
"start_offset": 21,
"position": 4
},
{
"end_offset": 40,
"token": "di",
"type": "word",
"start_offset": 33,
"position": 5
},
{
"end_offset": 40,
"token": "dis",
"type": "word",
"start_offset": 33,
"position": 5
},
{
"end_offset": 40,
"token": "dise",
"type": "word",
"start_offset": 33,
"position": 5
},
{
"end_offset": 40,
"token": "disea",
"type": "word",
"start_offset": 33,
"position": 5
},
{
"end_offset": 40,
"token": "diseas",
"type": "word",
"start_offset": 33,
"position": 5
},
{
"end_offset": 40,
"token": "disease",
"type": "word",
"start_offset": 33,
"position": 5
},
{
"end_offset": 45,
"token": "wi",
"type": "word",
"start_offset": 41,
"position": 6
},
{
"end_offset": 45,
"token": "wit",
"type": "word",
"start_offset": 41,
"position": 6
},
{
"end_offset": 45,
"token": "with",
"type": "word",
"start_offset": 41,
"position": 6
},
{
"end_offset": 51,
"token": "ea",
"type": "word",
"start_offset": 46,
"position": 7
},
{
"end_offset": 51,
"token": "ear",
"type": "word",
"start_offset": 46,
"position": 7
},
{
"end_offset": 51,
"token": "earl",
"type": "word",
"start_offset": 46,
"position": 7
},
{
"end_offset": 51,
"token": "early",
"type": "word",
"start_offset": 46,
"position": 7
},
{
"end_offset": 57,
"token": "on",
"type": "word",
"start_offset": 52,
"position": 8
},
{
"end_offset": 57,
"token": "ons",
"type": "word",
"start_offset": 52,
"position": 8
},
{
"end_offset": 57,
"token": "onse",
"type": "word",
"start_offset": 52,
"position": 8
},
{
"end_offset": 57,
"token": "onset",
"type": "word",
"start_offset": 52,
"position": 8
}
]
}
On index time a text is tokenized by standard tokenizer, then separate words are filtered by lowercase, possessive_english and edge_ngram filters. Tokens are produced only for words.
On search time a text is tokenized by standard tokenizer, then separate words are filtered by lowercase and possessive_english. The searched words are matched against the tokens which had been created during the index time.
Thus we make the incremental search possible!
Now, because we do ngram on separate words, we can even execute queries like
{
'query': {
'multi_match': {
'query': 'dem in alzh',
'type': 'phrase',
'fields': ['_all']
}
}
}
and get correct results.
No text is "lost", everything is searchable and there is no need to deal with spaces by trim filter anymore.
I believe your query is wrong: while you need nGrams at indexing time, you don't need them at search time. At search time you need the text to be as "fixed" as possible.
Try this query instead:
{
"query": {
"multi_match": {
"query": " dementia in alz",
"analyzer": "keyword",
"fields": [
"_all"
]
}
}
}
You notice two whitespaces before dementia. Those are accounted for by your analyzer from the text. To get rid of those you need the trim token_filter:
"edge_ngram_analyzer": {
"filter": [
"lowercase","trim"
],
"tokenizer": "edge_ngram_tokenizer"
}
And then this query will work (no whitespaces before dementia):
{
"query": {
"multi_match": {
"query": "dementia in alz",
"analyzer": "keyword",
"fields": [
"_all"
]
}
}
}

Pass "-" in Elastic Search Query

When we are passing a query containing special characters, Elastic Search is splitting the text.
E.g. If we pass "test-test" in query how can we make Elastic Search treat this as a single word and not split it up.
Analyzer used on the field we are searching:
"text_search_filter": {
"type": "edge_ngram",
"min_gram": 1,
"max_gram": 15
},
"standard_stop_filter": {
"type": "stop",
"stopwords": "_english_"
}
},
"analyzer": {
"text_search_analyzer": {
"type": "custom",
"tokenizer": "whitespace",
"filter": [
"lowercase",
"asciifolding",
"text_search_filter"
]
}
}
Also the query for search:
"query": {
"multi_match": {
"query": "test-test",
"type": "cross_fields",
"fields": [
"FIELD_NAME"
],
}
}
{
"tokens": [
{
"token": "'",
"start_offset": 0,
"end_offset": 11,
"type": "word",
"position": 1
},
{
"token": "'t",
"start_offset": 0,
"end_offset": 11,
"type": "word",
"position": 1
},
{
"token": "'te",
"start_offset": 0,
"end_offset": 11,
"type": "word",
"position": 1
},
{
"token": "'tes",
"start_offset": 0,
"end_offset": 11,
"type": "word",
"position": 1
},
{
"token": "'test",
"start_offset": 0,
"end_offset": 11,
"type": "word",
"position": 1
},
{
"token": "'test-",
"start_offset": 0,
"end_offset": 11,
"type": "word",
"position": 1
},
{
"token": "'test-t",
"start_offset": 0,
"end_offset": 11,
"type": "word",
"position": 1
},
{
"token": "'test-te",
"start_offset": 0,
"end_offset": 11,
"type": "word",
"position": 1
},
{
"token": "'test-tes",
"start_offset": 0,
"end_offset": 11,
"type": "word",
"position": 1
},
{
"token": "'test-test",
"start_offset": 0,
"end_offset": 11,
"type": "word",
"position": 1
},
{
"token": "'test-test'",
"start_offset": 0,
"end_offset": 11,
"type": "word",
"position": 1
}
]
}
in my code i catch all words which contains "-" and added quotes for it.
example:
joe-doe -> "joe-doe"
java code for this:
static String placeWordsWithDashInQuote(String value) {
return Arrays.stream(value.split("\\s"))
.filter(v -> !v.isEmpty())
.map(v -> v.contains("-") && !v.startsWith("\"") ? "\"" + v + "\"" : v)
.collect(Collectors.joining(" "));
}
and after this example query looks like:
{
"query": {
"bool": {
"must": [
{
"query_string": {
"fields": [
"lastName",
"firstName"
],
"query": "\"joe-doe\"",
"default_operator": "AND"
}
}
]
}
},
"sort": [],
"from": 0,
"size": 10 }

Resources