Elasticsearch tokenization for international languages

Elasticsearch tokenization for international languages - elasticsearch

I wanted to find out how elasticsearch tokens the languages other than english and i tried out the analyze api provided by it. But I cannot understand the output at all. Take for example
GET myindex/_analyze?analyzer=hindi&text="में कहता हूँ और तुम सुनना "
Now in the above text there are 6 words in total so I expect at max 6 tokens( believing that text contains no stop words) but the output is somewhat like this
{
"tokens": [
{
"token": "2350",
"start_offset": 3,
"end_offset": 7,
"type": "<NUM>",
"position": 1
},
{
"token": "2375",
"start_offset": 10,
"end_offset": 14,
"type": "<NUM>",
"position": 2
},
{
"token": "2306",
"start_offset": 17,
"end_offset": 21,
"type": "<NUM>",
"position": 3
},
{
"token": "2325",
"start_offset": 25,
"end_offset": 29,
"type": "<NUM>",
"position": 4
},
{
"token": "2361",
"start_offset": 32,
"end_offset": 36,
"type": "<NUM>",
"position": 5
},
{
"token": "2340",
"start_offset": 39,
"end_offset": 43,
"type": "<NUM>",
"position": 6
},
{
"token": "2366",
"start_offset": 46,
"end_offset": 50,
"type": "<NUM>",
"position": 7
},
{
"token": "2361",
"start_offset": 54,
"end_offset": 58,
"type": "<NUM>",
"position": 8
},
{
"token": "2370",
"start_offset": 61,
"end_offset": 65,
"type": "<NUM>",
"position": 9
},
{
"token": "2305",
"start_offset": 68,
"end_offset": 72,
"type": "<NUM>",
"position": 10
},
{
"token": "2324",
"start_offset": 76,
"end_offset": 80,
"type": "<NUM>",
"position": 11
},
{
"token": "2352",
"start_offset": 83,
"end_offset": 87,
"type": "<NUM>",
"position": 12
},
{
"token": "2340",
"start_offset": 91,
"end_offset": 95,
"type": "<NUM>",
"position": 13
},
{
"token": "2369",
"start_offset": 98,
"end_offset": 102,
"type": "<NUM>",
"position": 14
},
{
"token": "2350",
"start_offset": 105,
"end_offset": 109,
"type": "<NUM>",
"position": 15
},
{
"token": "2360",
"start_offset": 113,
"end_offset": 117,
"type": "<NUM>",
"position": 16
},
{
"token": "2369",
"start_offset": 120,
"end_offset": 124,
"type": "<NUM>",
"position": 17
},
{
"token": "2344",
"start_offset": 127,
"end_offset": 131,
"type": "<NUM>",
"position": 18
},
{
"token": "2344",
"start_offset": 134,
"end_offset": 138,
"type": "<NUM>",
"position": 19
},
{
"token": "2366",
"start_offset": 141,
"end_offset": 145,
"type": "<NUM>",
"position": 20
}
]
}
That means instead of six elasticsearch has detected around 20 tokens and all of type NUM(I don't know what's that)
I am really confused why this is happening. Can someone enlighten me what is happening. What am I doing doing wrong or where I lack in my understanding?

How are you calling the elasticsearch API - possibly the Hindi characters are getting messed up by your client?
It works okay for me (at least the Hindi chars are appearing in the result) on Linux with curl:
curl -XPOST 'http://localhost:9200/myindex/_analyze?analyzer=hindi&pretty' -d 'में कहता हूँ और तुम सुनना '
{
"tokens" : [ {
"token" : "कह",
"start_offset" : 4,
"end_offset" : 8,
"type" : "<ALPHANUM>",
"position" : 2
}, {
"token" : "हुं",
"start_offset" : 9,
"end_offset" : 12,
"type" : "<ALPHANUM>",
"position" : 3
}, {
"token" : "तुम",
"start_offset" : 16,
"end_offset" : 19,
"type" : "<ALPHANUM>",
"position" : 5
}, {
"token" : "सुन",
"start_offset" : 20,
"end_offset" : 25,
"type" : "<ALPHANUM>",
"position" : 6
} ]
}

Related

How to stop storing special characters in content while indexing

This is a sample document with the following points:
Pharmaceutical
Marketing
Building â€“
responsibilities.Â Â
Mass. â€“ Aug. 13, 2020 â€“Â
How to remove the special characters or non ascii unicode chars from content while indexing? I'm using ES 7.x and storm crawler 1.17

Looks like an incorrect detection of charset. You could normalise the content before indexing by writing a custom parse filter and remove the unwanted characters there.

If writing a custom parse filter and normalization looks difficult for you. you can simply add the asciifolding token filter in your analyzer definition which would convert the non-ascii char to their ascii char as shown below
POST http://{{hostname}}:{{port}}/_analyze
{
"tokenizer": "standard",
"filter": [
"asciifolding"
],
"text": "Pharmaceutical Marketing Building â responsibilities.Â Â Mass. â Aug. 13, 2020 âÂ"
}
And generated tokens for your text.
{
"tokens": [
{
"token": "Pharmaceutical",
"start_offset": 0,
"end_offset": 14,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "Marketing",
"start_offset": 15,
"end_offset": 24,
"type": "<ALPHANUM>",
"position": 1
},
{
"token": "Building",
"start_offset": 25,
"end_offset": 33,
"type": "<ALPHANUM>",
"position": 2
},
{
"token": "a",
"start_offset": 34,
"end_offset": 35,
"type": "<ALPHANUM>",
"position": 3
},
{
"token": "responsibilities.A",
"start_offset": 36,
"end_offset": 54,
"type": "<ALPHANUM>",
"position": 4
},
{
"token": "A",
"start_offset": 55,
"end_offset": 56,
"type": "<ALPHANUM>",
"position": 5
},
{
"token": "Mass",
"start_offset": 57,
"end_offset": 61,
"type": "<ALPHANUM>",
"position": 6
},
{
"token": "a",
"start_offset": 63,
"end_offset": 64,
"type": "<ALPHANUM>",
"position": 7
},
{
"token": "Aug",
"start_offset": 65,
"end_offset": 68,
"type": "<ALPHANUM>",
"position": 8
},
{
"token": "13",
"start_offset": 70,
"end_offset": 72,
"type": "<NUM>",
"position": 9
},
{
"token": "2020",
"start_offset": 74,
"end_offset": 78,
"type": "<NUM>",
"position": 10
},
{
"token": "aA",
"start_offset": 79,
"end_offset": 81,
"type": "<ALPHANUM>",
"position": 11
}
]
}

Edge NGram with phrase matching

I need to autocomplete phrases. For example, when I search "dementia in alz", I want to get "dementia in alzheimer's".
For this, I configured Edge NGram tokenizer. I tried both edge_ngram_analyzer and standard as the analyzer in the query body. Nevertheless, I can't get results when I'm trying to match a phrase.
What am I doing wrong?
My query:
{
"query":{
"multi_match":{
"query":"dementia in alz",
"type":"phrase",
"analyzer":"edge_ngram_analyzer",
"fields":["_all"]
}
}
}
My mappings:
...
"type" : {
"_all" : {
"analyzer" : "edge_ngram_analyzer",
"search_analyzer" : "standard"
},
"properties" : {
"field" : {
"type" : "string",
"analyzer" : "edge_ngram_analyzer",
"search_analyzer" : "standard"
},
...
"settings" : {
...
"analysis" : {
"filter" : {
"stem_possessive_filter" : {
"name" : "possessive_english",
"type" : "stemmer"
}
},
"analyzer" : {
"edge_ngram_analyzer" : {
"filter" : [ "lowercase" ],
"tokenizer" : "edge_ngram_tokenizer"
}
},
"tokenizer" : {
"edge_ngram_tokenizer" : {
"token_chars" : [ "letter", "digit", "whitespace" ],
"min_gram" : "2",
"type" : "edgeNGram",
"max_gram" : "25"
}
}
}
...
My documents:
{
"_score": 1.1152233,
"_type": "Diagnosis",
"_id": "AVZLfHfBE5CzEm8aJ3Xp",
"_source": {
"#timestamp": "2016-08-02T13:40:48.665Z",
"type": "Diagnosis",
"Document_ID": "Diagnosis_1400541",
"Diagnosis": "F00.0 - Dementia in Alzheimer's disease with early onset",
"#version": "1",
},
"_index": "carenotes"
},
{
"_score": 1.1152233,
"_type": "Diagnosis",
"_id": "AVZLfICrE5CzEm8aJ4Dc",
"_source": {
"#timestamp": "2016-08-02T13:40:51.240Z",
"type": "Diagnosis",
"Document_ID": "Diagnosis_1424351",
"Diagnosis": "F00.1 - Dementia in Alzheimer's disease with late onset",
"#version": "1",
},
"_index": "carenotes"
}
Analysis of the "dementia in alzheimer" phrase:
{
"tokens": [
{
"end_offset": 2,
"token": "de",
"type": "word",
"start_offset": 0,
"position": 0
},
{
"end_offset": 3,
"token": "dem",
"type": "word",
"start_offset": 0,
"position": 1
},
{
"end_offset": 4,
"token": "deme",
"type": "word",
"start_offset": 0,
"position": 2
},
{
"end_offset": 5,
"token": "demen",
"type": "word",
"start_offset": 0,
"position": 3
},
{
"end_offset": 6,
"token": "dement",
"type": "word",
"start_offset": 0,
"position": 4
},
{
"end_offset": 7,
"token": "dementi",
"type": "word",
"start_offset": 0,
"position": 5
},
{
"end_offset": 8,
"token": "dementia",
"type": "word",
"start_offset": 0,
"position": 6
},
{
"end_offset": 9,
"token": "dementia ",
"type": "word",
"start_offset": 0,
"position": 7
},
{
"end_offset": 10,
"token": "dementia i",
"type": "word",
"start_offset": 0,
"position": 8
},
{
"end_offset": 11,
"token": "dementia in",
"type": "word",
"start_offset": 0,
"position": 9
},
{
"end_offset": 12,
"token": "dementia in ",
"type": "word",
"start_offset": 0,
"position": 10
},
{
"end_offset": 13,
"token": "dementia in a",
"type": "word",
"start_offset": 0,
"position": 11
},
{
"end_offset": 14,
"token": "dementia in al",
"type": "word",
"start_offset": 0,
"position": 12
},
{
"end_offset": 15,
"token": "dementia in alz",
"type": "word",
"start_offset": 0,
"position": 13
},
{
"end_offset": 16,
"token": "dementia in alzh",
"type": "word",
"start_offset": 0,
"position": 14
},
{
"end_offset": 17,
"token": "dementia in alzhe",
"type": "word",
"start_offset": 0,
"position": 15
},
{
"end_offset": 18,
"token": "dementia in alzhei",
"type": "word",
"start_offset": 0,
"position": 16
},
{
"end_offset": 19,
"token": "dementia in alzheim",
"type": "word",
"start_offset": 0,
"position": 17
},
{
"end_offset": 20,
"token": "dementia in alzheime",
"type": "word",
"start_offset": 0,
"position": 18
},
{
"end_offset": 21,
"token": "dementia in alzheimer",
"type": "word",
"start_offset": 0,
"position": 19
}
]
}

Many thanks to rendel who helped me to find the right solution!
The solution of Andrei Stefan is not optimal.
Why? First, the absence of the lowercase filter in the search analyzer makes search inconvenient; the case must be matched strictly. A custom analyzer with lowercase filter is needed instead of "analyzer": "keyword".
Second, the analysis part is wrong!
During index time a string "F00.0 - Dementia in Alzheimer's disease with early onset" is analyzed by edge_ngram_analyzer. With this analyzer, we have the following array of dictionaries as the analyzed string:
{
"tokens": [
{
"end_offset": 2,
"token": "f0",
"type": "word",
"start_offset": 0,
"position": 0
},
{
"end_offset": 3,
"token": "f00",
"type": "word",
"start_offset": 0,
"position": 1
},
{
"end_offset": 6,
"token": "0 ",
"type": "word",
"start_offset": 4,
"position": 2
},
{
"end_offset": 9,
"token": " ",
"type": "word",
"start_offset": 7,
"position": 3
},
{
"end_offset": 10,
"token": " d",
"type": "word",
"start_offset": 7,
"position": 4
},
{
"end_offset": 11,
"token": " de",
"type": "word",
"start_offset": 7,
"position": 5
},
{
"end_offset": 12,
"token": " dem",
"type": "word",
"start_offset": 7,
"position": 6
},
{
"end_offset": 13,
"token": " deme",
"type": "word",
"start_offset": 7,
"position": 7
},
{
"end_offset": 14,
"token": " demen",
"type": "word",
"start_offset": 7,
"position": 8
},
{
"end_offset": 15,
"token": " dement",
"type": "word",
"start_offset": 7,
"position": 9
},
{
"end_offset": 16,
"token": " dementi",
"type": "word",
"start_offset": 7,
"position": 10
},
{
"end_offset": 17,
"token": " dementia",
"type": "word",
"start_offset": 7,
"position": 11
},
{
"end_offset": 18,
"token": " dementia ",
"type": "word",
"start_offset": 7,
"position": 12
},
{
"end_offset": 19,
"token": " dementia i",
"type": "word",
"start_offset": 7,
"position": 13
},
{
"end_offset": 20,
"token": " dementia in",
"type": "word",
"start_offset": 7,
"position": 14
},
{
"end_offset": 21,
"token": " dementia in ",
"type": "word",
"start_offset": 7,
"position": 15
},
{
"end_offset": 22,
"token": " dementia in a",
"type": "word",
"start_offset": 7,
"position": 16
},
{
"end_offset": 23,
"token": " dementia in al",
"type": "word",
"start_offset": 7,
"position": 17
},
{
"end_offset": 24,
"token": " dementia in alz",
"type": "word",
"start_offset": 7,
"position": 18
},
{
"end_offset": 25,
"token": " dementia in alzh",
"type": "word",
"start_offset": 7,
"position": 19
},
{
"end_offset": 26,
"token": " dementia in alzhe",
"type": "word",
"start_offset": 7,
"position": 20
},
{
"end_offset": 27,
"token": " dementia in alzhei",
"type": "word",
"start_offset": 7,
"position": 21
},
{
"end_offset": 28,
"token": " dementia in alzheim",
"type": "word",
"start_offset": 7,
"position": 22
},
{
"end_offset": 29,
"token": " dementia in alzheime",
"type": "word",
"start_offset": 7,
"position": 23
},
{
"end_offset": 30,
"token": " dementia in alzheimer",
"type": "word",
"start_offset": 7,
"position": 24
},
{
"end_offset": 33,
"token": "s ",
"type": "word",
"start_offset": 31,
"position": 25
},
{
"end_offset": 34,
"token": "s d",
"type": "word",
"start_offset": 31,
"position": 26
},
{
"end_offset": 35,
"token": "s di",
"type": "word",
"start_offset": 31,
"position": 27
},
{
"end_offset": 36,
"token": "s dis",
"type": "word",
"start_offset": 31,
"position": 28
},
{
"end_offset": 37,
"token": "s dise",
"type": "word",
"start_offset": 31,
"position": 29
},
{
"end_offset": 38,
"token": "s disea",
"type": "word",
"start_offset": 31,
"position": 30
},
{
"end_offset": 39,
"token": "s diseas",
"type": "word",
"start_offset": 31,
"position": 31
},
{
"end_offset": 40,
"token": "s disease",
"type": "word",
"start_offset": 31,
"position": 32
},
{
"end_offset": 41,
"token": "s disease ",
"type": "word",
"start_offset": 31,
"position": 33
},
{
"end_offset": 42,
"token": "s disease w",
"type": "word",
"start_offset": 31,
"position": 34
},
{
"end_offset": 43,
"token": "s disease wi",
"type": "word",
"start_offset": 31,
"position": 35
},
{
"end_offset": 44,
"token": "s disease wit",
"type": "word",
"start_offset": 31,
"position": 36
},
{
"end_offset": 45,
"token": "s disease with",
"type": "word",
"start_offset": 31,
"position": 37
},
{
"end_offset": 46,
"token": "s disease with ",
"type": "word",
"start_offset": 31,
"position": 38
},
{
"end_offset": 47,
"token": "s disease with e",
"type": "word",
"start_offset": 31,
"position": 39
},
{
"end_offset": 48,
"token": "s disease with ea",
"type": "word",
"start_offset": 31,
"position": 40
},
{
"end_offset": 49,
"token": "s disease with ear",
"type": "word",
"start_offset": 31,
"position": 41
},
{
"end_offset": 50,
"token": "s disease with earl",
"type": "word",
"start_offset": 31,
"position": 42
},
{
"end_offset": 51,
"token": "s disease with early",
"type": "word",
"start_offset": 31,
"position": 43
},
{
"end_offset": 52,
"token": "s disease with early ",
"type": "word",
"start_offset": 31,
"position": 44
},
{
"end_offset": 53,
"token": "s disease with early o",
"type": "word",
"start_offset": 31,
"position": 45
},
{
"end_offset": 54,
"token": "s disease with early on",
"type": "word",
"start_offset": 31,
"position": 46
},
{
"end_offset": 55,
"token": "s disease with early ons",
"type": "word",
"start_offset": 31,
"position": 47
},
{
"end_offset": 56,
"token": "s disease with early onse",
"type": "word",
"start_offset": 31,
"position": 48
}
]
}
As you can see, the whole string tokenized with token size from 2 to 25 characters. The string is tokenized in a linear way together with all spaces and position incremented by one for every new token.
There are several problems with it:
The edge_ngram_analyzer produced unuseful tokens which will never be searched for, for example: "0 ", " ", " d", "s d", "s disease w" etc.
Also, it didn't produce a lot of useful tokens that could be used, for example: "disease", "early onset" etc. There will be 0 results if you try to search for any of these words.
Notice, the last token is "s disease with early onse". Where is the final "t"? Because of the "max_gram" : "25" we “lost” some text in all fields. You can't search for this text anymore because there are no tokens for it.
The trim filter only obfuscates the problem filtering extra spaces when it could be done by a tokenizer.
The edge_ngram_analyzer increments the position of each token which is problematic for positional queries such as phrase queries. One should use the edge_ngram_filter instead that will preserve the position of the token when generating the ngrams.
The optimal solution.
The mappings settings to use:
...
"mappings": {
"Type": {
"_all":{
"analyzer": "edge_ngram_analyzer",
"search_analyzer": "keyword_analyzer"
},
"properties": {
"Field": {
"search_analyzer": "keyword_analyzer",
"type": "string",
"analyzer": "edge_ngram_analyzer"
},
...
...
"settings": {
"analysis": {
"filter": {
"english_poss_stemmer": {
"type": "stemmer",
"name": "possessive_english"
},
"edge_ngram": {
"type": "edgeNGram",
"min_gram": "2",
"max_gram": "25",
"token_chars": ["letter", "digit"]
}
},
"analyzer": {
"edge_ngram_analyzer": {
"filter": ["lowercase", "english_poss_stemmer", "edge_ngram"],
"tokenizer": "standard"
},
"keyword_analyzer": {
"filter": ["lowercase", "english_poss_stemmer"],
"tokenizer": "standard"
}
}
}
}
...
Look at the analysis:
{
"tokens": [
{
"end_offset": 5,
"token": "f0",
"type": "word",
"start_offset": 0,
"position": 0
},
{
"end_offset": 5,
"token": "f00",
"type": "word",
"start_offset": 0,
"position": 0
},
{
"end_offset": 5,
"token": "f00.",
"type": "word",
"start_offset": 0,
"position": 0
},
{
"end_offset": 5,
"token": "f00.0",
"type": "word",
"start_offset": 0,
"position": 0
},
{
"end_offset": 17,
"token": "de",
"type": "word",
"start_offset": 9,
"position": 2
},
{
"end_offset": 17,
"token": "dem",
"type": "word",
"start_offset": 9,
"position": 2
},
{
"end_offset": 17,
"token": "deme",
"type": "word",
"start_offset": 9,
"position": 2
},
{
"end_offset": 17,
"token": "demen",
"type": "word",
"start_offset": 9,
"position": 2
},
{
"end_offset": 17,
"token": "dement",
"type": "word",
"start_offset": 9,
"position": 2
},
{
"end_offset": 17,
"token": "dementi",
"type": "word",
"start_offset": 9,
"position": 2
},
{
"end_offset": 17,
"token": "dementia",
"type": "word",
"start_offset": 9,
"position": 2
},
{
"end_offset": 20,
"token": "in",
"type": "word",
"start_offset": 18,
"position": 3
},
{
"end_offset": 32,
"token": "al",
"type": "word",
"start_offset": 21,
"position": 4
},
{
"end_offset": 32,
"token": "alz",
"type": "word",
"start_offset": 21,
"position": 4
},
{
"end_offset": 32,
"token": "alzh",
"type": "word",
"start_offset": 21,
"position": 4
},
{
"end_offset": 32,
"token": "alzhe",
"type": "word",
"start_offset": 21,
"position": 4
},
{
"end_offset": 32,
"token": "alzhei",
"type": "word",
"start_offset": 21,
"position": 4
},
{
"end_offset": 32,
"token": "alzheim",
"type": "word",
"start_offset": 21,
"position": 4
},
{
"end_offset": 32,
"token": "alzheime",
"type": "word",
"start_offset": 21,
"position": 4
},
{
"end_offset": 32,
"token": "alzheimer",
"type": "word",
"start_offset": 21,
"position": 4
},
{
"end_offset": 40,
"token": "di",
"type": "word",
"start_offset": 33,
"position": 5
},
{
"end_offset": 40,
"token": "dis",
"type": "word",
"start_offset": 33,
"position": 5
},
{
"end_offset": 40,
"token": "dise",
"type": "word",
"start_offset": 33,
"position": 5
},
{
"end_offset": 40,
"token": "disea",
"type": "word",
"start_offset": 33,
"position": 5
},
{
"end_offset": 40,
"token": "diseas",
"type": "word",
"start_offset": 33,
"position": 5
},
{
"end_offset": 40,
"token": "disease",
"type": "word",
"start_offset": 33,
"position": 5
},
{
"end_offset": 45,
"token": "wi",
"type": "word",
"start_offset": 41,
"position": 6
},
{
"end_offset": 45,
"token": "wit",
"type": "word",
"start_offset": 41,
"position": 6
},
{
"end_offset": 45,
"token": "with",
"type": "word",
"start_offset": 41,
"position": 6
},
{
"end_offset": 51,
"token": "ea",
"type": "word",
"start_offset": 46,
"position": 7
},
{
"end_offset": 51,
"token": "ear",
"type": "word",
"start_offset": 46,
"position": 7
},
{
"end_offset": 51,
"token": "earl",
"type": "word",
"start_offset": 46,
"position": 7
},
{
"end_offset": 51,
"token": "early",
"type": "word",
"start_offset": 46,
"position": 7
},
{
"end_offset": 57,
"token": "on",
"type": "word",
"start_offset": 52,
"position": 8
},
{
"end_offset": 57,
"token": "ons",
"type": "word",
"start_offset": 52,
"position": 8
},
{
"end_offset": 57,
"token": "onse",
"type": "word",
"start_offset": 52,
"position": 8
},
{
"end_offset": 57,
"token": "onset",
"type": "word",
"start_offset": 52,
"position": 8
}
]
}
On index time a text is tokenized by standard tokenizer, then separate words are filtered by lowercase, possessive_english and edge_ngram filters. Tokens are produced only for words.
On search time a text is tokenized by standard tokenizer, then separate words are filtered by lowercase and possessive_english. The searched words are matched against the tokens which had been created during the index time.
Thus we make the incremental search possible!
Now, because we do ngram on separate words, we can even execute queries like
{
'query': {
'multi_match': {
'query': 'dem in alzh',
'type': 'phrase',
'fields': ['_all']
}
}
}
and get correct results.
No text is "lost", everything is searchable and there is no need to deal with spaces by trim filter anymore.

I believe your query is wrong: while you need nGrams at indexing time, you don't need them at search time. At search time you need the text to be as "fixed" as possible.
Try this query instead:
{
"query": {
"multi_match": {
"query": " dementia in alz",
"analyzer": "keyword",
"fields": [
"_all"
]
}
}
}
You notice two whitespaces before dementia. Those are accounted for by your analyzer from the text. To get rid of those you need the trim token_filter:
"edge_ngram_analyzer": {
"filter": [
"lowercase","trim"
],
"tokenizer": "edge_ngram_tokenizer"
}
And then this query will work (no whitespaces before dementia):
{
"query": {
"multi_match": {
"query": "dementia in alz",
"analyzer": "keyword",
"fields": [
"_all"
]
}
}
}

Elasticsearch - using ngrams as a tokenizer and filter gives different outputs

Can someone explain why using ngrams as a tokenzier gives a different output compared to when using it as a filter. For example using it as a tokenizer for "Paracetamol" I get:
{
"tokens": [
{
"token": "par",
"start_offset": 0,
"end_offset": 11,
"type": "word",
"position": 1
},
{
"token": "para",
"start_offset": 0,
"end_offset": 11,
"type": "word",
"position": 1
},
{
"token": "parac",
"start_offset": 0,
"end_offset": 11,
"type": "word",
"position": 1
},
{
"token": "parace",
"start_offset": 0,
"end_offset": 11,
"type": "word",
"position": 1
},
{
"token": "paracet",
"start_offset": 0,
"end_offset": 11,
"type": "word",
"position": 1
},
{
"token": "paraceta",
"start_offset": 0,
"end_offset": 11,
"type": "word",
"position": 1
},
{
"token": "paracetam",
"start_offset": 0,
"end_offset": 11,
"type": "word",
"position": 1
},
{
"token": "paracetamo",
"start_offset": 0,
"end_offset": 11,
"type": "word",
"position": 1
},
{
"token": "paracetamol",
"start_offset": 0,
"end_offset": 11,
"type": "word",
"position": 1
},
{
"token": "ara",
"start_offset": 0,
"end_offset": 11,
"type": "word",
"position": 1
},
{
"token": "arac",
"start_offset": 0,
"end_offset": 11,
"type": "word",
"position": 1
},
{
"token": "arace",
"start_offset": 0,
"end_offset": 11,
"type": "word",
"position": 1
},
{
"token": "aracet",
"start_offset": 0,
"end_offset": 11,
"type": "word",
"position": 1
},
{
"token": "araceta",
"start_offset": 0,
"end_offset": 11,
"type": "word",
"position": 1
},
{
"token": "aracetam",
"start_offset": 0,
"end_offset": 11,
"type": "word",
"position": 1
},
{
"token": "aracetamo",
"start_offset": 0,
"end_offset": 11,
"type": "word",
"position": 1
},
{
"token": "aracetamol",
"start_offset": 0,
"end_offset": 11,
"type": "word",
"position": 1
},
{
"token": "rac",
"start_offset": 0,
"end_offset": 11,
"type": "word",
"position": 1
},
{
"token": "race",
"start_offset": 0,
"end_offset": 11,
"type": "word",
"position": 1
},
{
"token": "racet",
"start_offset": 0,
"end_offset": 11,
"type": "word",
"position": 1
},
{
"token": "raceta",
"start_offset": 0,
"end_offset": 11,
"type": "word",
"position": 1
},
{
"token": "racetam",
"start_offset": 0,
"end_offset": 11,
"type": "word",
"position": 1
},
{
"token": "racetamo",
"start_offset": 0,
"end_offset": 11,
"type": "word",
"position": 1
},
{
"token": "racetamol",
"start_offset": 0,
"end_offset": 11,
"type": "word",
"position": 1
},
{
"token": "ace",
"start_offset": 0,
"end_offset": 11,
"type": "word",
"position": 1
},
{
"token": "acet",
"start_offset": 0,
"end_offset": 11,
"type": "word",
"position": 1
},
{
"token": "aceta",
"start_offset": 0,
"end_offset": 11,
"type": "word",
"position": 1
},
{
"token": "acetam",
"start_offset": 0,
"end_offset": 11,
"type": "word",
"position": 1
},
{
"token": "acetamo",
"start_offset": 0,
"end_offset": 11,
"type": "word",
"position": 1
},
{
"token": "acetamol",
"start_offset": 0,
"end_offset": 11,
"type": "word",
"position": 1
},
{
"token": "cet",
"start_offset": 0,
"end_offset": 11,
"type": "word",
"position": 1
},
{
"token": "ceta",
"start_offset": 0,
"end_offset": 11,
"type": "word",
"position": 1
},
{
"token": "cetam",
"start_offset": 0,
"end_offset": 11,
"type": "word",
"position": 1
},
{
"token": "cetamo",
"start_offset": 0,
"end_offset": 11,
"type": "word",
"position": 1
},
{
"token": "cetamol",
"start_offset": 0,
"end_offset": 11,
"type": "word",
"position": 1
},
{
"token": "eta",
"start_offset": 0,
"end_offset": 11,
"type": "word",
"position": 1
},
{
"token": "etam",
"start_offset": 0,
"end_offset": 11,
"type": "word",
"position": 1
},
{
"token": "etamo",
"start_offset": 0,
"end_offset": 11,
"type": "word",
"position": 1
},
{
"token": "etamol",
"start_offset": 0,
"end_offset": 11,
"type": "word",
"position": 1
},
{
"token": "tam",
"start_offset": 0,
"end_offset": 11,
"type": "word",
"position": 1
},
{
"token": "tamo",
"start_offset": 0,
"end_offset": 11,
"type": "word",
"position": 1
},
{
"token": "tamol",
"start_offset": 0,
"end_offset": 11,
"type": "word",
"position": 1
},
{
"token": "amo",
"start_offset": 0,
"end_offset": 11,
"type": "word",
"position": 1
},
{
"token": "amol",
"start_offset": 0,
"end_offset": 11,
"type": "word",
"position": 1
},
{
"token": "mol",
"start_offset": 0,
"end_offset": 11,
"type": "word",
"position": 1
}
]
}
Where as using it as a filter I get:
{
"tokens": [
{
"token": "par",
"start_offset": 0,
"end_offset": 3,
"type": "word",
"position": 1
},
{
"token": "para",
"start_offset": 0,
"end_offset": 4,
"type": "word",
"position": 2
},
{
"token": "parac",
"start_offset": 0,
"end_offset": 5,
"type": "word",
"position": 3
},
{
"token": "parace",
"start_offset": 0,
"end_offset": 6,
"type": "word",
"position": 4
},
{
"token": "paracet",
"start_offset": 0,
"end_offset": 7,
"type": "word",
"position": 5
},
{
"token": "paraceta",
"start_offset": 0,
"end_offset": 8,
"type": "word",
"position": 6
},
{
"token": "paracetam",
"start_offset": 0,
"end_offset": 9,
"type": "word",
"position": 7
},
{
"token": "paracetamo",
"start_offset": 0,
"end_offset": 10,
"type": "word",
"position": 8
},
{
"token": "paracetamol",
"start_offset": 0,
"end_offset": 11,
"type": "word",
"position": 9
},
{
"token": "ara",
"start_offset": 1,
"end_offset": 4,
"type": "word",
"position": 10
},
{
"token": "arac",
"start_offset": 1,
"end_offset": 5,
"type": "word",
"position": 11
},
{
"token": "arace",
"start_offset": 1,
"end_offset": 6,
"type": "word",
"position": 12
},
{
"token": "aracet",
"start_offset": 1,
"end_offset": 7,
"type": "word",
"position": 13
},
{
"token": "araceta",
"start_offset": 1,
"end_offset": 8,
"type": "word",
"position": 14
},
{
"token": "aracetam",
"start_offset": 1,
"end_offset": 9,
"type": "word",
"position": 15
},
{
"token": "aracetamo",
"start_offset": 1,
"end_offset": 10,
"type": "word",
"position": 16
},
{
"token": "aracetamol",
"start_offset": 1,
"end_offset": 11,
"type": "word",
"position": 17
},
{
"token": "rac",
"start_offset": 2,
"end_offset": 5,
"type": "word",
"position": 18
},
{
"token": "race",
"start_offset": 2,
"end_offset": 6,
"type": "word",
"position": 19
},
{
"token": "racet",
"start_offset": 2,
"end_offset": 7,
"type": "word",
"position": 20
},
{
"token": "raceta",
"start_offset": 2,
"end_offset": 8,
"type": "word",
"position": 21
},
{
"token": "racetam",
"start_offset": 2,
"end_offset": 9,
"type": "word",
"position": 22
},
{
"token": "racetamo",
"start_offset": 2,
"end_offset": 10,
"type": "word",
"position": 23
},
{
"token": "racetamol",
"start_offset": 2,
"end_offset": 11,
"type": "word",
"position": 24
},
{
"token": "ace",
"start_offset": 3,
"end_offset": 6,
"type": "word",
"position": 25
},
{
"token": "acet",
"start_offset": 3,
"end_offset": 7,
"type": "word",
"position": 26
},
{
"token": "aceta",
"start_offset": 3,
"end_offset": 8,
"type": "word",
"position": 27
},
{
"token": "acetam",
"start_offset": 3,
"end_offset": 9,
"type": "word",
"position": 28
},
{
"token": "acetamo",
"start_offset": 3,
"end_offset": 10,
"type": "word",
"position": 29
},
{
"token": "acetamol",
"start_offset": 3,
"end_offset": 11,
"type": "word",
"position": 30
},
{
"token": "cet",
"start_offset": 4,
"end_offset": 7,
"type": "word",
"position": 31
},
{
"token": "ceta",
"start_offset": 4,
"end_offset": 8,
"type": "word",
"position": 32
},
{
"token": "cetam",
"start_offset": 4,
"end_offset": 9,
"type": "word",
"position": 33
},
{
"token": "cetamo",
"start_offset": 4,
"end_offset": 10,
"type": "word",
"position": 34
},
{
"token": "cetamol",
"start_offset": 4,
"end_offset": 11,
"type": "word",
"position": 35
},
{
"token": "eta",
"start_offset": 5,
"end_offset": 8,
"type": "word",
"position": 36
},
{
"token": "etam",
"start_offset": 5,
"end_offset": 9,
"type": "word",
"position": 37
},
{
"token": "etamo",
"start_offset": 5,
"end_offset": 10,
"type": "word",
"position": 38
},
{
"token": "etamol",
"start_offset": 5,
"end_offset": 11,
"type": "word",
"position": 39
},
{
"token": "tam",
"start_offset": 6,
"end_offset": 9,
"type": "word",
"position": 40
},
{
"token": "tamo",
"start_offset": 6,
"end_offset": 10,
"type": "word",
"position": 41
},
{
"token": "tamol",
"start_offset": 6,
"end_offset": 11,
"type": "word",
"position": 42
},
{
"token": "amo",
"start_offset": 7,
"end_offset": 10,
"type": "word",
"position": 43
},
{
"token": "amol",
"start_offset": 7,
"end_offset": 11,
"type": "word",
"position": 44
},
{
"token": "mol",
"start_offset": 8,
"end_offset": 11,
"type": "word",
"position": 45
}
]
}

These two approaches may results equal outputs.
But Depending on the circumstances one approach may be better than the other.
If you need special characters in your search terms, you will probably need to use the ngram tokenizer in your mapping. It's useful to know how to use both.
Reference

ElasticSearch - Problems with edgeNGram tokenizer

I use ElasticSearch for indexing database. I'm trying to use edgeNGram tokenizer to cut strings to shoter ones with requirement "new string must be longer then 4 chars".
I use following code to create index:
PUT test
POST /test/_close
PUT /test/_settings
{
"analysis": {
"analyzer": {
"index_edge_ngram" : {
"type": "custom",
"filter": ["custom_word_delimiter"],
"tokenizer" : "left_tokenizer"
}
},
"filter" : {
"custom_word_delimiter" : {
"type": "word_delimiter",
"generate_word_parts": "true",
"generate_number_parts": "true",
"catenate_words": "false",
"catenate_numbers": "false",
"catenate_all": "false",
"split_on_case_change": "false",
"preserve_original": "false",
"split_on_numerics": "true",
"ignore_case": "true"
}
},
"tokenizer" : {
"left_tokenizer" : {
"max_gram" : 30,
"min_gram" : 5,
"type" : "edgeNGram"
}
}
}
}
POST /test/_open
Now I run test to overview the results
GET /test/_analyze?analyzer=index_edge_ngram&text=please pay for multiple wins with only one payment
and get the results
{
"tokens": [
{
"token": "pleas",
"start_offset": 0,
"end_offset": 5,
"type": "word",
"position": 1
},
{
"token": "please",
"start_offset": 0,
"end_offset": 6,
"type": "word",
"position": 2
},
{
"token": "please",
"start_offset": 0,
"end_offset": 6,
"type": "word",
"position": 3
},
{
"token": "please",
"start_offset": 0,
"end_offset": 6,
"type": "word",
"position": 4
},
{
"token": "p",
"start_offset": 7,
"end_offset": 8,
"type": "word",
"position": 5
},
{
"token": "please",
"start_offset": 0,
"end_offset": 6,
"type": "word",
"position": 6
},
{
"token": "pa",
"start_offset": 7,
"end_offset": 9,
"type": "word",
"position": 7
},
{
"token": "please",
"start_offset": 0,
"end_offset": 6,
"type": "word",
"position": 8
},
{
"token": "pay",
"start_offset": 7,
"end_offset": 10,
"type": "word",
"position": 9
},
{
"token": "please",
"start_offset": 0,
"end_offset": 6,
"type": "word",
"position": 10
},
{
"token": "pay",
"start_offset": 7,
"end_offset": 10,
"type": "word",
"position": 11
},
{
"token": "please",
"start_offset": 0,
"end_offset": 6,
"type": "word",
"position": 12
},
{
"token": "pay",
"start_offset": 7,
"end_offset": 10,
"type": "word",
"position": 13
},
{
"token": "f",
"start_offset": 11,
"end_offset": 12,
"type": "word",
"position": 14
},
{
"token": "please",
"start_offset": 0,
"end_offset": 6,
"type": "word",
"position": 15
},
{
"token": "pay",
"start_offset": 7,
"end_offset": 10,
"type": "word",
"position": 16
},
{
"token": "fo",
"start_offset": 11,
"end_offset": 13,
"type": "word",
"position": 17
},
{
"token": "please",
"start_offset": 0,
"end_offset": 6,
"type": "word",
"position": 18
},
{
"token": "pay",
"start_offset": 7,
"end_offset": 10,
"type": "word",
"position": 19
},
{
"token": "for",
"start_offset": 11,
"end_offset": 14,
"type": "word",
"position": 20
},
{
"token": "please",
"start_offset": 0,
"end_offset": 6,
"type": "word",
"position": 21
},
{
"token": "pay",
"start_offset": 7,
"end_offset": 10,
"type": "word",
"position": 22
},
{
"token": "for",
"start_offset": 11,
"end_offset": 14,
"type": "word",
"position": 23
},
{
"token": "please",
"start_offset": 0,
"end_offset": 6,
"type": "word",
"position": 24
},
{
"token": "pay",
"start_offset": 7,
"end_offset": 10,
"type": "word",
"position": 25
},
{
"token": "for",
"start_offset": 11,
"end_offset": 14,
"type": "word",
"position": 26
},
{
"token": "m",
"start_offset": 15,
"end_offset": 16,
"type": "word",
"position": 27
},
{
"token": "please",
"start_offset": 0,
"end_offset": 6,
"type": "word",
"position": 28
},
{
"token": "pay",
"start_offset": 7,
"end_offset": 10,
"type": "word",
"position": 29
},
{
"token": "for",
"start_offset": 11,
"end_offset": 14,
"type": "word",
"position": 30
},
{
"token": "mu",
"start_offset": 15,
"end_offset": 17,
"type": "word",
"position": 31
},
{
"token": "please",
"start_offset": 0,
"end_offset": 6,
"type": "word",
"position": 32
},
{
"token": "pay",
"start_offset": 7,
"end_offset": 10,
"type": "word",
"position": 33
},
{
"token": "for",
"start_offset": 11,
"end_offset": 14,
"type": "word",
"position": 34
},
{
"token": "mul",
"start_offset": 15,
"end_offset": 18,
"type": "word",
"position": 35
},
{
"token": "please",
"start_offset": 0,
"end_offset": 6,
"type": "word",
"position": 36
},
{
"token": "pay",
"start_offset": 7,
"end_offset": 10,
"type": "word",
"position": 37
},
{
"token": "for",
"start_offset": 11,
"end_offset": 14,
"type": "word",
"position": 38
},
{
"token": "mult",
"start_offset": 15,
"end_offset": 19,
"type": "word",
"position": 39
},
{
"token": "please",
"start_offset": 0,
"end_offset": 6,
"type": "word",
"position": 40
},
{
"token": "pay",
"start_offset": 7,
"end_offset": 10,
"type": "word",
"position": 41
},
{
"token": "for",
"start_offset": 11,
"end_offset": 14,
"type": "word",
"position": 42
},
{
"token": "multi",
"start_offset": 15,
"end_offset": 20,
"type": "word",
"position": 43
},
{
"token": "please",
"start_offset": 0,
"end_offset": 6,
"type": "word",
"position": 44
},
{
"token": "pay",
"start_offset": 7,
"end_offset": 10,
"type": "word",
"position": 45
},
{
"token": "for",
"start_offset": 11,
"end_offset": 14,
"type": "word",
"position": 46
},
{
"token": "multip",
"start_offset": 15,
"end_offset": 21,
"type": "word",
"position": 47
},
{
"token": "please",
"start_offset": 0,
"end_offset": 6,
"type": "word",
"position": 48
},
{
"token": "pay",
"start_offset": 7,
"end_offset": 10,
"type": "word",
"position": 49
},
{
"token": "for",
"start_offset": 11,
"end_offset": 14,
"type": "word",
"position": 50
},
{
"token": "multipl",
"start_offset": 15,
"end_offset": 22,
"type": "word",
"position": 51
},
{
"token": "please",
"start_offset": 0,
"end_offset": 6,
"type": "word",
"position": 52
},
{
"token": "pay",
"start_offset": 7,
"end_offset": 10,
"type": "word",
"position": 53
},
{
"token": "for",
"start_offset": 11,
"end_offset": 14,
"type": "word",
"position": 54
},
{
"token": "multiple",
"start_offset": 15,
"end_offset": 23,
"type": "word",
"position": 55
},
{
"token": "please",
"start_offset": 0,
"end_offset": 6,
"type": "word",
"position": 56
},
{
"token": "pay",
"start_offset": 7,
"end_offset": 10,
"type": "word",
"position": 57
},
{
"token": "for",
"start_offset": 11,
"end_offset": 14,
"type": "word",
"position": 58
},
{
"token": "multiple",
"start_offset": 15,
"end_offset": 23,
"type": "word",
"position": 59
},
{
"token": "please",
"start_offset": 0,
"end_offset": 6,
"type": "word",
"position": 60
},
{
"token": "pay",
"start_offset": 7,
"end_offset": 10,
"type": "word",
"position": 61
},
{
"token": "for",
"start_offset": 11,
"end_offset": 14,
"type": "word",
"position": 62
},
{
"token": "multiple",
"start_offset": 15,
"end_offset": 23,
"type": "word",
"position": 63
},
{
"token": "w",
"start_offset": 24,
"end_offset": 25,
"type": "word",
"position": 64
},
{
"token": "please",
"start_offset": 0,
"end_offset": 6,
"type": "word",
"position": 65
},
{
"token": "pay",
"start_offset": 7,
"end_offset": 10,
"type": "word",
"position": 66
},
{
"token": "for",
"start_offset": 11,
"end_offset": 14,
"type": "word",
"position": 67
},
{
"token": "multiple",
"start_offset": 15,
"end_offset": 23,
"type": "word",
"position": 68
},
{
"token": "wi",
"start_offset": 24,
"end_offset": 26,
"type": "word",
"position": 69
},
{
"token": "please",
"start_offset": 0,
"end_offset": 6,
"type": "word",
"position": 70
},
{
"token": "pay",
"start_offset": 7,
"end_offset": 10,
"type": "word",
"position": 71
},
{
"token": "for",
"start_offset": 11,
"end_offset": 14,
"type": "word",
"position": 72
},
{
"token": "multiple",
"start_offset": 15,
"end_offset": 23,
"type": "word",
"position": 73
},
{
"token": "win",
"start_offset": 24,
"end_offset": 27,
"type": "word",
"position": 74
},
{
"token": "please",
"start_offset": 0,
"end_offset": 6,
"type": "word",
"position": 75
},
{
"token": "pay",
"start_offset": 7,
"end_offset": 10,
"type": "word",
"position": 76
},
{
"token": "for",
"start_offset": 11,
"end_offset": 14,
"type": "word",
"position": 77
},
{
"token": "multiple",
"start_offset": 15,
"end_offset": 23,
"type": "word",
"position": 78
},
{
"token": "wins",
"start_offset": 24,
"end_offset": 28,
"type": "word",
"position": 79
},
{
"token": "please",
"start_offset": 0,
"end_offset": 6,
"type": "word",
"position": 80
},
{
"token": "pay",
"start_offset": 7,
"end_offset": 10,
"type": "word",
"position": 81
},
{
"token": "for",
"start_offset": 11,
"end_offset": 14,
"type": "word",
"position": 82
},
{
"token": "multiple",
"start_offset": 15,
"end_offset": 23,
"type": "word",
"position": 83
},
{
"token": "wins",
"start_offset": 24,
"end_offset": 28,
"type": "word",
"position": 84
},
{
"token": "please",
"start_offset": 0,
"end_offset": 6,
"type": "word",
"position": 85
},
{
"token": "pay",
"start_offset": 7,
"end_offset": 10,
"type": "word",
"position": 86
},
{
"token": "for",
"start_offset": 11,
"end_offset": 14,
"type": "word",
"position": 87
},
{
"token": "multiple",
"start_offset": 15,
"end_offset": 23,
"type": "word",
"position": 88
},
{
"token": "wins",
"start_offset": 24,
"end_offset": 28,
"type": "word",
"position": 89
},
{
"token": "w",
"start_offset": 29,
"end_offset": 30,
"type": "word",
"position": 90
}
]
}
Here are my questions:
Why there are tokens shoter then 5 characters?
Why "position" property shows position of the token, but not the position of the word in the text? Looks like the other tokenizers works in that way.
Why there are not all the words in the output? Looks like it stops on the "wins".
Why there are so many repeats of the same token?

When building custom analyzers, it's worth going step-by-step and checking what is generated by each step in the analysis chain:
first the tokenizer slices and dices your input into tokens
then tokens filters take the tokens from step 1 as input and do their thing
finally char filters are applied
In your case, if you check what comes out of the tokenizer phase, it goes like this. See we're just specifying the tokenizer (i.e. left_tokenizer) as parameter.
curl -XGET 'localhost:9201/test/_analyze?tokenizer=left_tokenizer&pretty' -d 'please pay for multiple wins with only one payment'
The result is:
{
"tokens" : [ {
"token" : "pleas",
"start_offset" : 0,
"end_offset" : 5,
"type" : "word",
"position" : 1
}, {
"token" : "please",
"start_offset" : 0,
"end_offset" : 6,
"type" : "word",
"position" : 2
}, {
"token" : "please ",
"start_offset" : 0,
"end_offset" : 7,
"type" : "word",
"position" : 3
}, {
"token" : "please p",
"start_offset" : 0,
"end_offset" : 8,
"type" : "word",
"position" : 4
}, {
"token" : "please pa",
"start_offset" : 0,
"end_offset" : 9,
"type" : "word",
"position" : 5
}, {
"token" : "please pay",
"start_offset" : 0,
"end_offset" : 10,
"type" : "word",
"position" : 6
}, {
"token" : "please pay ",
"start_offset" : 0,
"end_offset" : 11,
"type" : "word",
"position" : 7
}, {
"token" : "please pay f",
"start_offset" : 0,
"end_offset" : 12,
"type" : "word",
"position" : 8
}, {
"token" : "please pay fo",
"start_offset" : 0,
"end_offset" : 13,
"type" : "word",
"position" : 9
}, {
"token" : "please pay for",
"start_offset" : 0,
"end_offset" : 14,
"type" : "word",
"position" : 10
}, {
"token" : "please pay for ",
"start_offset" : 0,
"end_offset" : 15,
"type" : "word",
"position" : 11
}, {
"token" : "please pay for m",
"start_offset" : 0,
"end_offset" : 16,
"type" : "word",
"position" : 12
}, {
"token" : "please pay for mu",
"start_offset" : 0,
"end_offset" : 17,
"type" : "word",
"position" : 13
}, {
"token" : "please pay for mul",
"start_offset" : 0,
"end_offset" : 18,
"type" : "word",
"position" : 14
}, {
"token" : "please pay for mult",
"start_offset" : 0,
"end_offset" : 19,
"type" : "word",
"position" : 15
}, {
"token" : "please pay for multi",
"start_offset" : 0,
"end_offset" : 20,
"type" : "word",
"position" : 16
}, {
"token" : "please pay for multip",
"start_offset" : 0,
"end_offset" : 21,
"type" : "word",
"position" : 17
}, {
"token" : "please pay for multipl",
"start_offset" : 0,
"end_offset" : 22,
"type" : "word",
"position" : 18
}, {
"token" : "please pay for multiple",
"start_offset" : 0,
"end_offset" : 23,
"type" : "word",
"position" : 19
"position" : 20
}, {
"token" : "please pay for multiple w",
"start_offset" : 0,
"end_offset" : 25,
"type" : "word",
"position" : 21
}, {
"token" : "please pay for multiple wi",
"start_offset" : 0,
"end_offset" : 26,
"type" : "word",
"position" : 22
}, {
"token" : "please pay for multiple win",
"start_offset" : 0,
"end_offset" : 27,
"type" : "word",
"position" : 23
}, {
"token" : "please pay for multiple wins",
"start_offset" : 0,
"end_offset" : 28,
"type" : "word",
"position" : 24
}, {
"token" : "please pay for multiple wins ",
"start_offset" : 0,
"end_offset" : 29,
"type" : "word",
"position" : 25
}, {
"token" : "please pay for multiple wins w",
"start_offset" : 0,
"end_offset" : 30,
"type" : "word",
"position" : 26
} ]
}
Then, your token filters will take each of the tokens above and do their job. For instance,
the first token pleas will come out as pleas
the second token please as please
the third token please (note the space at the end), as please
the fourth token please p as the two tokens please and p
the fifth token please pa as the two tokens please and pa
etc
So, your left_tokenizer considers the whole sentence as a single token input and tokenizes it from 5 characters to 30, which is why it stops at wins (that answers question 3)
As you can see above, some tokens are repeated because the word_delimiter token filter treats each token from the tokenizer in isolation, hence the "duplicates" (that answers question 4) and the tokens shorter than 5 characters (that answers question 1)
I don't think this is the way you want it to work, but it's not clear from your question how you want it to work, i.e. the kind of searches you want to be able to do. All I'm offering here is an explanation of what you're seeing.

What Elasticsearch analyzer should I use to search hybrid english words, product information?

My team is attempting to index our item information and need a sanity check on what I have created so far. Here is an example of some of the text we need to search on:
AA4VG90EP4DM1/32R-NSF52F001DX-S WITH DAMAN MANIFOLD 0281CS0011 SIEMENS SPEC 74/07104909/10 REV L WOMACK SYSTEMS
As you can see there is a mix of english words and random numbers and letters. After doing some research online I decided to go with a word delimiter and whitespace tokenizer. Here is the analyzer I am currently using:
{
{
   "itemindex": {
      "settings": {
         "index": {
            "uuid": "1HxasKSCSW2iRHf6pYfkWw",
            "analysis": {
               "analyzer": {
                  "my_analyzer": {
                     "type": "custom",
                     "filter": [
                        "lowercase",
                        "my_word_delimiter"
                     ],
                     "tokenizer": "whitespace"
                  }
               },
               "filter": {
                  "my_word_delimiter": {
                     "type_table": "/ => ALPHANUM",
                     "preserve_original": "true",
                     "catenate_words": "true",
                     "type": "word_delimiter"
                  }
               }
            },
            "number_of_replicas": "1",
            "number_of_shards": "5",
            "version": {
               "created": "1000099"
            }
         }
      }
   }
}
Here is the output from the analyze api for the above description:
{
"tokens": [
{
"token": "aa4vg90ep4dm1/32r-nsf52f001dx-s",
"start_offset": 0,
"end_offset": 31,
"type": "word",
"position": 1
},
{
"token": "aa",
"start_offset": 0,
"end_offset": 2,
"type": "word",
"position": 1
},
{
"token": "4",
"start_offset": 2,
"end_offset": 3,
"type": "word",
"position": 2
},
{
"token": "vg",
"start_offset": 3,
"end_offset": 5,
"type": "word",
"position": 3
},
{
"token": "90",
"start_offset": 5,
"end_offset": 7,
"type": "word",
"position": 4
},
{
"token": "ep",
"start_offset": 7,
"end_offset": 9,
"type": "word",
"position": 5
},
{
"token": "4",
"start_offset": 9,
"end_offset": 10,
"type": "word",
"position": 6
},
{
"token": "dm",
"start_offset": 10,
"end_offset": 12,
"type": "word",
"position": 7
},
{
"token": "1/32",
"start_offset": 12,
"end_offset": 16,
"type": "word",
"position": 8
},
{
"token": "r",
"start_offset": 16,
"end_offset": 17,
"type": "word",
"position": 9
},
{
"token": "nsf",
"start_offset": 18,
"end_offset": 21,
"type": "word",
"position": 10
},
{
"token": "rnsf",
"start_offset": 16,
"end_offset": 21,
"type": "word",
"position": 10
},
{
"token": "52",
"start_offset": 21,
"end_offset": 23,
"type": "word",
"position": 11
},
{
"token": "f",
"start_offset": 23,
"end_offset": 24,
"type": "word",
"position": 12
},
{
"token": "001",
"start_offset": 24,
"end_offset": 27,
"type": "word",
"position": 13
},
{
"token": "dx",
"start_offset": 27,
"end_offset": 29,
"type": "word",
"position": 14
},
{
"token": "s",
"start_offset": 30,
"end_offset": 31,
"type": "word",
"position": 15
},
{
"token": "dxs",
"start_offset": 27,
"end_offset": 31,
"type": "word",
"position": 15
},
{
"token": "with",
"start_offset": 32,
"end_offset": 36,
"type": "word",
"position": 16
},
{
"token": "daman",
"start_offset": 37,
"end_offset": 42,
"type": "word",
"position": 17
},
{
"token": "manifold",
"start_offset": 43,
"end_offset": 51,
"type": "word",
"position": 18
},
{
"token": "0281cs0011",
"start_offset": 52,
"end_offset": 62,
"type": "word",
"position": 19
},
{
"token": "0281",
"start_offset": 52,
"end_offset": 56,
"type": "word",
"position": 19
},
{
"token": "cs",
"start_offset": 56,
"end_offset": 58,
"type": "word",
"position": 20
},
{
"token": "0011",
"start_offset": 58,
"end_offset": 62,
"type": "word",
"position": 21
},
{
"token": "siemens",
"start_offset": 63,
"end_offset": 70,
"type": "word",
"position": 22
},
{
"token": "spec",
"start_offset": 71,
"end_offset": 75,
"type": "word",
"position": 23
},
{
"token": "74/07104909/10",
"start_offset": 76,
"end_offset": 90,
"type": "word",
"position": 24
},
{
"token": "rev",
"start_offset": 91,
"end_offset": 94,
"type": "word",
"position": 25
},
{
"token": "l",
"start_offset": 95,
"end_offset": 96,
"type": "word",
"position": 26
},
{
"token": "womack",
"start_offset": 98,
"end_offset": 104,
"type": "word",
"position": 27
},
{
"token": "systems",
"start_offset": 105,
"end_offset": 112,
"type": "word",
"position": 28
}
]
}
Finally, here is the NEST query I am using:
var results = client.Search<SearchResult>(s => s.Index("itemindex")
.Query(q => q
.QueryString(qs=> qs
.OnFields(f=> f.Description, f=> f.VendorPartNumber, f=> f.ItemNumber)
.Operator(Operator.or)
.Query(query + "*")))
.SortDescending("_score")
.Highlight(h => h
.OnFields(f => f
.OnField(e => e.Description)
.BoundaryCharacters(" ,")
.PreTags("<b>")
.PostTags("</b>")))
.From(start)
.Size(size));

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Elasticsearch tokenization for international languages - elasticsearch

Related

How to stop storing special characters in content while indexing

Edge NGram with phrase matching

Elasticsearch - using ngrams as a tokenizer and filter gives different outputs

ElasticSearch - Problems with edgeNGram tokenizer

What Elasticsearch analyzer should I use to search hybrid english words, product information?

Categories

Resources