Elastic analyzer to enable matching searches such as C#, C++, A+ - elasticsearch

I am trying to create a custom analyzer in elastic search to enable matching terms such as C#, C++, A+ currently it will only match C, C, A.
{
"settings": {
"analysis": {
"analyzer": {
"my_custom_analyzer": {
"type": "keyword",
"type_table": [
"# => ALPHANUM",
"+ => ALPHANUM"
],
"filter": [
"lowercase"
]
}
}
}
}
}
I've tried to analyze the index using the following:
{
"analyzer": "my_custom_analyzer",
"text": "CSS, A++, C#.Net, ASP.Net Hospitals is Africa's leading and the fastest growing super specialty care and multi-organ transplantation hospital. Designed the User Interfaces, User Controls according the requirements\n· Developed Cascading Style Sheets (CSS) for User Interface uniformity throughout the application\n· Involved in programming the business logic layer and data access layer\n· Involved in in developing pages in ASP.Net with C#.Net"
}
Result:
{
"tokens": [
{
"token": "CSS, A++, C#.Net, ASP.Net Hospitals is Africa's leading and the fastest growing super specialty care and multi-organ transplantation hospital. Designed the User Interfaces, User Controls according the requirements\n· Developed Cascading Style Sheets (CSS) for User Interface uniformity throughout the application\n· Involved in programming the business logic layer and data access layer\n· Involved in in developing pages in ASP.Net with C#.Net",
"start_offset": 0,
"end_offset": 443,
"type": "word",
"position": 0
}
]
}
Also I am not sure how to enable the analyzer, should this be done in mappings?
{
"properties": {
"attachment.content": {
"type": "my_custom_analyzer"
}
}
}
Response when trying to use it in mappings:
{
"error": {
"root_cause": [
{
"type": "mapper_parsing_exception",
"reason": "No handler for type [my_custom_analyzer] declared on field [attachment.content]"
}
],
"type": "mapper_parsing_exception",
"reason": "No handler for type [my_custom_analyzer] declared on field [attachment.content]"
},
"status": 400
}
Any help will be highly appreciated.

I managed to get a proper response from ES api using the following, it's not 100% yet, but it's on the right track, it's currently not highlighting which is a problem, but when using the analyzer api to test I get a response which i think is in the right direction.
{
"settings": {
"analysis": {
"filter": {
"my_delimeter": {
"type": "word_delimiter",
"type_table": [
"# => ALPHANUM",
"+ => ALPHANUM",
". => ALPHANUM"
]
}
},
"analyzer": {
"my_analyzer": {
"tokenizer": "whitespace",
"filter": ["lowercase", "my_delimeter"]
}
}
}
}
}
Text that I am analyzing:
{
"analyzer": "my_analyzer",
"text": "CSS, A++, C#.Net, ASP.Net Hospitals is Africa's leading and the fastest growing super specialty care and multi-organ transplantation hospital. Designed the User Interfaces, User Controls according the requirements\n· Developed Cascading Style Sheets (CSS) for User Interface uniformity throughout the application\n· Involved in programming the business logic layer and data access layer\n· Involved in in developing pages in ASP.Net with C#.Net"
}
Response:
{
"tokens": [
{
"token": "css",
"start_offset": 0,
"end_offset": 3,
"type": "word",
"position": 0
},
{
"token": "a++",
"start_offset": 5,
"end_offset": 8,
"type": "word",
"position": 1
},
{
"token": "c#.net",
"start_offset": 10,
"end_offset": 16,
"type": "word",
"position": 2
},
{
"token": "asp.net",
"start_offset": 18,
"end_offset": 25,
"type": "word",
"position": 3
},
{
"token": "hospitals",
"start_offset": 26,
"end_offset": 35,
"type": "word",
"position": 4
},
{
"token": "is",
"start_offset": 36,
"end_offset": 38,
"type": "word",
"position": 5
},
{
"token": "africa",
"start_offset": 39,
"end_offset": 45,
"type": "word",
"position": 6
},
{
"token": "leading",
"start_offset": 48,
"end_offset": 55,
"type": "word",
"position": 7
},
{
"token": "and",
"start_offset": 56,
"end_offset": 59,
"type": "word",
"position": 8
},
{
"token": "the",
"start_offset": 60,
"end_offset": 63,
"type": "word",
"position": 9
},
{
"token": "fastest",
"start_offset": 64,
"end_offset": 71,
"type": "word",
"position": 10
},
{
"token": "growing",
"start_offset": 72,
"end_offset": 79,
"type": "word",
"position": 11
},
{
"token": "super",
"start_offset": 80,
"end_offset": 85,
"type": "word",
"position": 12
},
{
"token": "specialty",
"start_offset": 86,
"end_offset": 95,
"type": "word",
"position": 13
},
{
"token": "care",
"start_offset": 96,
"end_offset": 100,
"type": "word",
"position": 14
},
{
"token": "and",
"start_offset": 101,
"end_offset": 104,
"type": "word",
"position": 15
},
{
"token": "multi",
"start_offset": 105,
"end_offset": 110,
"type": "word",
"position": 16
},
{
"token": "organ",
"start_offset": 111,
"end_offset": 116,
"type": "word",
"position": 17
},
{
"token": "transplantation",
"start_offset": 117,
"end_offset": 132,
"type": "word",
"position": 18
},
{
"token": "hospital.",
"start_offset": 133,
"end_offset": 142,
"type": "word",
"position": 19
},
{
"token": "designed",
"start_offset": 143,
"end_offset": 151,
"type": "word",
"position": 20
},
{
"token": "the",
"start_offset": 152,
"end_offset": 155,
"type": "word",
"position": 21
},
{
"token": "user",
"start_offset": 156,
"end_offset": 160,
"type": "word",
"position": 22
},
{
"token": "interfaces",
"start_offset": 161,
"end_offset": 171,
"type": "word",
"position": 23
},
{
"token": "user",
"start_offset": 173,
"end_offset": 177,
"type": "word",
"position": 24
},
{
"token": "controls",
"start_offset": 178,
"end_offset": 186,
"type": "word",
"position": 25
},
{
"token": "according",
"start_offset": 187,
"end_offset": 196,
"type": "word",
"position": 26
},
{
"token": "the",
"start_offset": 197,
"end_offset": 200,
"type": "word",
"position": 27
},
{
"token": "requirements",
"start_offset": 201,
"end_offset": 213,
"type": "word",
"position": 28
},
{
"token": "developed",
"start_offset": 216,
"end_offset": 225,
"type": "word",
"position": 29
},
{
"token": "cascading",
"start_offset": 226,
"end_offset": 235,
"type": "word",
"position": 30
},
{
"token": "style",
"start_offset": 236,
"end_offset": 241,
"type": "word",
"position": 31
},
{
"token": "sheets",
"start_offset": 242,
"end_offset": 248,
"type": "word",
"position": 32
},
{
"token": "css",
"start_offset": 250,
"end_offset": 253,
"type": "word",
"position": 33
},
{
"token": "for",
"start_offset": 255,
"end_offset": 258,
"type": "word",
"position": 34
},
{
"token": "user",
"start_offset": 259,
"end_offset": 263,
"type": "word",
"position": 35
},
{
"token": "interface",
"start_offset": 264,
"end_offset": 273,
"type": "word",
"position": 36
},
{
"token": "uniformity",
"start_offset": 274,
"end_offset": 284,
"type": "word",
"position": 37
},
{
"token": "throughout",
"start_offset": 285,
"end_offset": 295,
"type": "word",
"position": 38
},
{
"token": "the",
"start_offset": 296,
"end_offset": 299,
"type": "word",
"position": 39
},
{
"token": "application",
"start_offset": 302,
"end_offset": 313,
"type": "word",
"position": 40
},
{
"token": "involved",
"start_offset": 316,
"end_offset": 324,
"type": "word",
"position": 41
},
{
"token": "in",
"start_offset": 325,
"end_offset": 327,
"type": "word",
"position": 42
},
{
"token": "programming",
"start_offset": 328,
"end_offset": 339,
"type": "word",
"position": 43
},
{
"token": "the",
"start_offset": 340,
"end_offset": 343,
"type": "word",
"position": 44
},
{
"token": "business",
"start_offset": 344,
"end_offset": 352,
"type": "word",
"position": 45
},
{
"token": "logic",
"start_offset": 353,
"end_offset": 358,
"type": "word",
"position": 46
},
{
"token": "layer",
"start_offset": 359,
"end_offset": 364,
"type": "word",
"position": 47
},
{
"token": "and",
"start_offset": 365,
"end_offset": 368,
"type": "word",
"position": 48
},
{
"token": "data",
"start_offset": 369,
"end_offset": 373,
"type": "word",
"position": 49
},
{
"token": "access",
"start_offset": 374,
"end_offset": 380,
"type": "word",
"position": 50
},
{
"token": "layer",
"start_offset": 381,
"end_offset": 386,
"type": "word",
"position": 51
},
{
"token": "involved",
"start_offset": 389,
"end_offset": 397,
"type": "word",
"position": 52
},
{
"token": "in",
"start_offset": 398,
"end_offset": 400,
"type": "word",
"position": 53
},
{
"token": "in",
"start_offset": 401,
"end_offset": 403,
"type": "word",
"position": 54
},
{
"token": "developing",
"start_offset": 404,
"end_offset": 414,
"type": "word",
"position": 55
},
{
"token": "pages",
"start_offset": 415,
"end_offset": 420,
"type": "word",
"position": 56
},
{
"token": "in",
"start_offset": 421,
"end_offset": 423,
"type": "word",
"position": 57
},
{
"token": "asp.net",
"start_offset": 424,
"end_offset": 431,
"type": "word",
"position": 58
},
{
"token": "with",
"start_offset": 432,
"end_offset": 436,
"type": "word",
"position": 59
},
{
"token": "c#.net",
"start_offset": 437,
"end_offset": 443,
"type": "word",
"position": 60
}
]
}
Tried these mappings:
{
"properties": {
"attachment.content": {
"type": "text",
"search_analyzer": "my_analyzer",
"analyzer": "my_analyzer",
"fields": {
"content": {
"type": "text",
"analyzer": "my_analyzer"
}
}
}
}
}
still highlight response is:
"highlight": {
"skills": [
"<em>C</em>#",
"Microsoft Visual Studio <em>C</em># (Windows Form and Web APP) and Java Eclipse"
]
}

You have to add a custom char filter in your analyzer, in the settings of your index:
"char_filter": {
"languages_filter": {
"type": "mapping",
"mappings": ["c++ => cpp", "C++ => cpp", "IT => _IT_", "a+ => ap", "A+ => ap", "C# => csharp", "c# => csharp"]
}
}
Then you will add this custom analyzer to the mapping of your field:
PUT my-index/_mapping
{
"properties": {
"my-field": {
"type": "text",
"analyzer": "m_custom_analyzer"
}
}
}
Note: You cannot change an analyzer on an existing field. You will have to add a new field or to reindex

Related

How to stop storing special characters in content while indexing

This is a sample document with the following points:
Pharmaceutical
Marketing
Building –
responsibilities. Â
Mass. – Aug. 13, 2020 –Â
How to remove the special characters or non ascii unicode chars from content while indexing? I'm using ES 7.x and storm crawler 1.17
Looks like an incorrect detection of charset. You could normalise the content before indexing by writing a custom parse filter and remove the unwanted characters there.
If writing a custom parse filter and normalization looks difficult for you. you can simply add the asciifolding token filter in your analyzer definition which would convert the non-ascii char to their ascii char as shown below
POST http://{{hostname}}:{{port}}/_analyze
{
"tokenizer": "standard",
"filter": [
"asciifolding"
],
"text": "Pharmaceutical Marketing Building â responsibilities.  Mass. â Aug. 13, 2020 âÂ"
}
And generated tokens for your text.
{
"tokens": [
{
"token": "Pharmaceutical",
"start_offset": 0,
"end_offset": 14,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "Marketing",
"start_offset": 15,
"end_offset": 24,
"type": "<ALPHANUM>",
"position": 1
},
{
"token": "Building",
"start_offset": 25,
"end_offset": 33,
"type": "<ALPHANUM>",
"position": 2
},
{
"token": "a",
"start_offset": 34,
"end_offset": 35,
"type": "<ALPHANUM>",
"position": 3
},
{
"token": "responsibilities.A",
"start_offset": 36,
"end_offset": 54,
"type": "<ALPHANUM>",
"position": 4
},
{
"token": "A",
"start_offset": 55,
"end_offset": 56,
"type": "<ALPHANUM>",
"position": 5
},
{
"token": "Mass",
"start_offset": 57,
"end_offset": 61,
"type": "<ALPHANUM>",
"position": 6
},
{
"token": "a",
"start_offset": 63,
"end_offset": 64,
"type": "<ALPHANUM>",
"position": 7
},
{
"token": "Aug",
"start_offset": 65,
"end_offset": 68,
"type": "<ALPHANUM>",
"position": 8
},
{
"token": "13",
"start_offset": 70,
"end_offset": 72,
"type": "<NUM>",
"position": 9
},
{
"token": "2020",
"start_offset": 74,
"end_offset": 78,
"type": "<NUM>",
"position": 10
},
{
"token": "aA",
"start_offset": 79,
"end_offset": 81,
"type": "<ALPHANUM>",
"position": 11
}
]
}

Analyze API does not work for Elasticsearch 1.7

We are running Elasticsearch 1.7 (planning to upgrade very soon) and I am trying to use the Analyze API to understand what the different analyzers do, but the result presented from elasticsearch is not what I expect.
If I run the following query against our elasticsearch instance
GET _analyze
{
"analyzer": "stop",
"text": "Extremely good food! We had the happiest waiter and the crowd's always flowing!"
}
I will get this result
{
"tokens": [
{
"token": "analyzer",
"start_offset": 6,
"end_offset": 14,
"type": "<ALPHANUM>",
"position": 1
},
{
"token": "stop",
"start_offset": 18,
"end_offset": 22,
"type": "<ALPHANUM>",
"position": 2
},
{
"token": "text",
"start_offset": 30,
"end_offset": 34,
"type": "<ALPHANUM>",
"position": 3
},
{
"token": "extremely",
"start_offset": 38,
"end_offset": 47,
"type": "<ALPHANUM>",
"position": 4
},
{
"token": "good",
"start_offset": 48,
"end_offset": 52,
"type": "<ALPHANUM>",
"position": 5
},
{
"token": "food",
"start_offset": 53,
"end_offset": 57,
"type": "<ALPHANUM>",
"position": 6
},
{
"token": "we",
"start_offset": 59,
"end_offset": 61,
"type": "<ALPHANUM>",
"position": 7
},
{
"token": "had",
"start_offset": 62,
"end_offset": 65,
"type": "<ALPHANUM>",
"position": 8
},
{
"token": "the",
"start_offset": 66,
"end_offset": 69,
"type": "<ALPHANUM>",
"position": 9
},
{
"token": "happiest",
"start_offset": 70,
"end_offset": 78,
"type": "<ALPHANUM>",
"position": 10
},
{
"token": "waiter",
"start_offset": 79,
"end_offset": 85,
"type": "<ALPHANUM>",
"position": 11
},
{
"token": "and",
"start_offset": 86,
"end_offset": 89,
"type": "<ALPHANUM>",
"position": 12
},
{
"token": "the",
"start_offset": 90,
"end_offset": 93,
"type": "<ALPHANUM>",
"position": 13
},
{
"token": "crowd's",
"start_offset": 94,
"end_offset": 101,
"type": "<ALPHANUM>",
"position": 14
},
{
"token": "always",
"start_offset": 102,
"end_offset": 108,
"type": "<ALPHANUM>",
"position": 15
},
{
"token": "flowing",
"start_offset": 109,
"end_offset": 116,
"type": "<ALPHANUM>",
"position": 16
}
]
}
which does not make sense to me. I am using the stop analyzer, why is the words "and" and "the" in the result? I have tried to change the stop analyzer to both whitespace and standard, but I get the exact same result as above. There is no difference between them.
However, if I run the exact same query against an instance of Elasticsearch 5.x the result does no longer contain "and" and "the" and it seems much more as expected.
Is this because we are using 1.7 or is it something in our setup of Elasticsearch that is causing this issue?
edit:
I am using Sense plugin in chrome to do my queries, and the plugin does not support GET with a request body so it changes the request to a POST. Elastic Analyze API 1.7 does not seem to support POST requests :( If I change the query like this GET _analyze?analyzer=stop&text=THIS+is+a+test&pretty it works
In 1.x the syntax is different from 2.x and 5.x. According to the 1.x documentation, you should be using the _analyze API like this:
GET _analyze?analyzer=stop
{
"text": "Extremely good food! We had the happiest waiter and the crowd's always flowing!"
}

ElasticSearch comma inside the number

I want to remove commas inside numbers take for instance "Warhammer 40,000: Dawn of War III"
I want it to match if u search for "40000".
But currently my tokenizer gives me:
{
"tokens": [
{
"token": "warhammer",
"start_offset": 0,
"end_offset": 9,
"type": "word",
"position": 0
},
{
"token": "warhammer 40",
"start_offset": 0,
"end_offset": 12,
"type": "shingle",
"position": 0
},
{
"token": "40",
"start_offset": 10,
"end_offset": 12,
"type": "word",
"position": 1
},
{
"token": "000:",
"start_offset": 13,
"end_offset": 17,
"type": "word",
"position": 102
},
{
"token": "000: 000",
"start_offset": 13,
"end_offset": 16,
"type": "shingle",
"position": 102
},
{
"token": "000: 000 dawn",
"start_offset": 13,
"end_offset": 22,
"type": "shingle",
"position": 102
},
{
"token": "000: 000 dawn of",
"start_offset": 13,
"end_offset": 25,
"type": "shingle",
"position": 102
},
{
"token": "000: 000 dawn of war",
"start_offset": 13,
"end_offset": 29,
"type": "shingle",
"position": 102
},
{
"token": "000: 000 dawn of war 3",
"start_offset": 13,
"end_offset": 33,
"type": "shingle",
"position": 102
},
{
"token": "000",
"start_offset": 13,
"end_offset": 16,
"type": "word",
"position": 103
},
{
"token": "000 dawn",
"start_offset": 13,
"end_offset": 22,
"type": "shingle",
"position": 103
},
{
"token": "000 dawn of",
"start_offset": 13,
"end_offset": 25,
"type": "shingle",
"position": 103
},
{
"token": "000 dawn of war",
"start_offset": 13,
"end_offset": 29,
"type": "shingle",
"position": 103
},
{
"token": "000 dawn of war 3",
"start_offset": 13,
"end_offset": 33,
"type": "shingle",
"position": 103
},
{
"token": "dawn",
"start_offset": 18,
"end_offset": 22,
"type": "word",
"position": 104
},
{
"token": "dawn of",
"start_offset": 18,
"end_offset": 25,
"type": "shingle",
"position": 104
},
{
"token": "dawn of war",
"start_offset": 18,
"end_offset": 29,
"type": "shingle",
"position": 104
},
{
"token": "dawn of war 3",
"start_offset": 18,
"end_offset": 33,
"type": "shingle",
"position": 104
},
{
"token": "of war",
"start_offset": 23,
"end_offset": 29,
"type": "shingle",
"position": 105
},
{
"token": "of war 3",
"start_offset": 23,
"end_offset": 33,
"type": "shingle",
"position": 105
},
{
"token": "war",
"start_offset": 26,
"end_offset": 29,
"type": "word",
"position": 106
},
{
"token": "war 3",
"start_offset": 26,
"end_offset": 33,
"type": "shingle",
"position": 106
},
{
"token": "3",
"start_offset": 30,
"end_offset": 33,
"type": "SYNONYM",
"position": 107
}
]
}
The main issue here is that "40" and "000" is different tokens. I think it's best to treat them as a single token "40000" is there a token filter that can merge the two?
EDIT:
Ohhhh!
I tried:
"analyzer": {
"default": {
"tokenizer": "keyword"
}}
The result of:
http://localhost:9200/i/_analyze?text=Warhammer%2040,000:%20Dawn%20of%20War%20III
Gave me:
{
"tokens": [
{
"token": "Warhammer 40",
"start_offset": 0,
"end_offset": 12,
"type": "word",
"position": 0
},
{
"token": "000: Dawn of War III",
"start_offset": 13,
"end_offset": 33,
"type": "word",
"position": 101
}
]
}
You can merge numbers with decimal points using a character filter. In the following snippet the character filter called "decimal_mark_filter" will remove any comma that appears in-between numbers before tokenization takes place.
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "standard",
"char_filter": [
"decimal_mark_filter"
]
}
},
"char_filter": {
"decimal_mark_filter": {
"type": "pattern_replace",
"pattern": "(\\d+),(?=\\d)",
"replacement": "$1"
}
}
}
}
}
The analyzer gives the following tokens:
{
"tokens": [
{
"token": "Warhammer",
"start_offset": 0,
"end_offset": 9,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "40000",
"start_offset": 10,
"end_offset": 16,
"type": "<NUM>",
"position": 1
},
{
"token": "Dawn",
"start_offset": 18,
"end_offset": 22,
"type": "<ALPHANUM>",
"position": 2
},
{
"token": "of",
"start_offset": 23,
"end_offset": 25,
"type": "<ALPHANUM>",
"position": 3
},
{
"token": "War",
"start_offset": 26,
"end_offset": 29,
"type": "<ALPHANUM>",
"position": 4
},
{
"token": "III",
"start_offset": 30,
"end_offset": 33,
"type": "<ALPHANUM>",
"position": 5
}
]
}
This is just a modification of an example on the official Elasticsearch documentation

Edge NGram with phrase matching

I need to autocomplete phrases. For example, when I search "dementia in alz", I want to get "dementia in alzheimer's".
For this, I configured Edge NGram tokenizer. I tried both edge_ngram_analyzer and standard as the analyzer in the query body. Nevertheless, I can't get results when I'm trying to match a phrase.
What am I doing wrong?
My query:
{
"query":{
"multi_match":{
"query":"dementia in alz",
"type":"phrase",
"analyzer":"edge_ngram_analyzer",
"fields":["_all"]
}
}
}
My mappings:
...
"type" : {
"_all" : {
"analyzer" : "edge_ngram_analyzer",
"search_analyzer" : "standard"
},
"properties" : {
"field" : {
"type" : "string",
"analyzer" : "edge_ngram_analyzer",
"search_analyzer" : "standard"
},
...
"settings" : {
...
"analysis" : {
"filter" : {
"stem_possessive_filter" : {
"name" : "possessive_english",
"type" : "stemmer"
}
},
"analyzer" : {
"edge_ngram_analyzer" : {
"filter" : [ "lowercase" ],
"tokenizer" : "edge_ngram_tokenizer"
}
},
"tokenizer" : {
"edge_ngram_tokenizer" : {
"token_chars" : [ "letter", "digit", "whitespace" ],
"min_gram" : "2",
"type" : "edgeNGram",
"max_gram" : "25"
}
}
}
...
My documents:
{
"_score": 1.1152233,
"_type": "Diagnosis",
"_id": "AVZLfHfBE5CzEm8aJ3Xp",
"_source": {
"#timestamp": "2016-08-02T13:40:48.665Z",
"type": "Diagnosis",
"Document_ID": "Diagnosis_1400541",
"Diagnosis": "F00.0 - Dementia in Alzheimer's disease with early onset",
"#version": "1",
},
"_index": "carenotes"
},
{
"_score": 1.1152233,
"_type": "Diagnosis",
"_id": "AVZLfICrE5CzEm8aJ4Dc",
"_source": {
"#timestamp": "2016-08-02T13:40:51.240Z",
"type": "Diagnosis",
"Document_ID": "Diagnosis_1424351",
"Diagnosis": "F00.1 - Dementia in Alzheimer's disease with late onset",
"#version": "1",
},
"_index": "carenotes"
}
Analysis of the "dementia in alzheimer" phrase:
{
"tokens": [
{
"end_offset": 2,
"token": "de",
"type": "word",
"start_offset": 0,
"position": 0
},
{
"end_offset": 3,
"token": "dem",
"type": "word",
"start_offset": 0,
"position": 1
},
{
"end_offset": 4,
"token": "deme",
"type": "word",
"start_offset": 0,
"position": 2
},
{
"end_offset": 5,
"token": "demen",
"type": "word",
"start_offset": 0,
"position": 3
},
{
"end_offset": 6,
"token": "dement",
"type": "word",
"start_offset": 0,
"position": 4
},
{
"end_offset": 7,
"token": "dementi",
"type": "word",
"start_offset": 0,
"position": 5
},
{
"end_offset": 8,
"token": "dementia",
"type": "word",
"start_offset": 0,
"position": 6
},
{
"end_offset": 9,
"token": "dementia ",
"type": "word",
"start_offset": 0,
"position": 7
},
{
"end_offset": 10,
"token": "dementia i",
"type": "word",
"start_offset": 0,
"position": 8
},
{
"end_offset": 11,
"token": "dementia in",
"type": "word",
"start_offset": 0,
"position": 9
},
{
"end_offset": 12,
"token": "dementia in ",
"type": "word",
"start_offset": 0,
"position": 10
},
{
"end_offset": 13,
"token": "dementia in a",
"type": "word",
"start_offset": 0,
"position": 11
},
{
"end_offset": 14,
"token": "dementia in al",
"type": "word",
"start_offset": 0,
"position": 12
},
{
"end_offset": 15,
"token": "dementia in alz",
"type": "word",
"start_offset": 0,
"position": 13
},
{
"end_offset": 16,
"token": "dementia in alzh",
"type": "word",
"start_offset": 0,
"position": 14
},
{
"end_offset": 17,
"token": "dementia in alzhe",
"type": "word",
"start_offset": 0,
"position": 15
},
{
"end_offset": 18,
"token": "dementia in alzhei",
"type": "word",
"start_offset": 0,
"position": 16
},
{
"end_offset": 19,
"token": "dementia in alzheim",
"type": "word",
"start_offset": 0,
"position": 17
},
{
"end_offset": 20,
"token": "dementia in alzheime",
"type": "word",
"start_offset": 0,
"position": 18
},
{
"end_offset": 21,
"token": "dementia in alzheimer",
"type": "word",
"start_offset": 0,
"position": 19
}
]
}
Many thanks to rendel who helped me to find the right solution!
The solution of Andrei Stefan is not optimal.
Why? First, the absence of the lowercase filter in the search analyzer makes search inconvenient; the case must be matched strictly. A custom analyzer with lowercase filter is needed instead of "analyzer": "keyword".
Second, the analysis part is wrong!
During index time a string "F00.0 - Dementia in Alzheimer's disease with early onset" is analyzed by edge_ngram_analyzer. With this analyzer, we have the following array of dictionaries as the analyzed string:
{
"tokens": [
{
"end_offset": 2,
"token": "f0",
"type": "word",
"start_offset": 0,
"position": 0
},
{
"end_offset": 3,
"token": "f00",
"type": "word",
"start_offset": 0,
"position": 1
},
{
"end_offset": 6,
"token": "0 ",
"type": "word",
"start_offset": 4,
"position": 2
},
{
"end_offset": 9,
"token": " ",
"type": "word",
"start_offset": 7,
"position": 3
},
{
"end_offset": 10,
"token": " d",
"type": "word",
"start_offset": 7,
"position": 4
},
{
"end_offset": 11,
"token": " de",
"type": "word",
"start_offset": 7,
"position": 5
},
{
"end_offset": 12,
"token": " dem",
"type": "word",
"start_offset": 7,
"position": 6
},
{
"end_offset": 13,
"token": " deme",
"type": "word",
"start_offset": 7,
"position": 7
},
{
"end_offset": 14,
"token": " demen",
"type": "word",
"start_offset": 7,
"position": 8
},
{
"end_offset": 15,
"token": " dement",
"type": "word",
"start_offset": 7,
"position": 9
},
{
"end_offset": 16,
"token": " dementi",
"type": "word",
"start_offset": 7,
"position": 10
},
{
"end_offset": 17,
"token": " dementia",
"type": "word",
"start_offset": 7,
"position": 11
},
{
"end_offset": 18,
"token": " dementia ",
"type": "word",
"start_offset": 7,
"position": 12
},
{
"end_offset": 19,
"token": " dementia i",
"type": "word",
"start_offset": 7,
"position": 13
},
{
"end_offset": 20,
"token": " dementia in",
"type": "word",
"start_offset": 7,
"position": 14
},
{
"end_offset": 21,
"token": " dementia in ",
"type": "word",
"start_offset": 7,
"position": 15
},
{
"end_offset": 22,
"token": " dementia in a",
"type": "word",
"start_offset": 7,
"position": 16
},
{
"end_offset": 23,
"token": " dementia in al",
"type": "word",
"start_offset": 7,
"position": 17
},
{
"end_offset": 24,
"token": " dementia in alz",
"type": "word",
"start_offset": 7,
"position": 18
},
{
"end_offset": 25,
"token": " dementia in alzh",
"type": "word",
"start_offset": 7,
"position": 19
},
{
"end_offset": 26,
"token": " dementia in alzhe",
"type": "word",
"start_offset": 7,
"position": 20
},
{
"end_offset": 27,
"token": " dementia in alzhei",
"type": "word",
"start_offset": 7,
"position": 21
},
{
"end_offset": 28,
"token": " dementia in alzheim",
"type": "word",
"start_offset": 7,
"position": 22
},
{
"end_offset": 29,
"token": " dementia in alzheime",
"type": "word",
"start_offset": 7,
"position": 23
},
{
"end_offset": 30,
"token": " dementia in alzheimer",
"type": "word",
"start_offset": 7,
"position": 24
},
{
"end_offset": 33,
"token": "s ",
"type": "word",
"start_offset": 31,
"position": 25
},
{
"end_offset": 34,
"token": "s d",
"type": "word",
"start_offset": 31,
"position": 26
},
{
"end_offset": 35,
"token": "s di",
"type": "word",
"start_offset": 31,
"position": 27
},
{
"end_offset": 36,
"token": "s dis",
"type": "word",
"start_offset": 31,
"position": 28
},
{
"end_offset": 37,
"token": "s dise",
"type": "word",
"start_offset": 31,
"position": 29
},
{
"end_offset": 38,
"token": "s disea",
"type": "word",
"start_offset": 31,
"position": 30
},
{
"end_offset": 39,
"token": "s diseas",
"type": "word",
"start_offset": 31,
"position": 31
},
{
"end_offset": 40,
"token": "s disease",
"type": "word",
"start_offset": 31,
"position": 32
},
{
"end_offset": 41,
"token": "s disease ",
"type": "word",
"start_offset": 31,
"position": 33
},
{
"end_offset": 42,
"token": "s disease w",
"type": "word",
"start_offset": 31,
"position": 34
},
{
"end_offset": 43,
"token": "s disease wi",
"type": "word",
"start_offset": 31,
"position": 35
},
{
"end_offset": 44,
"token": "s disease wit",
"type": "word",
"start_offset": 31,
"position": 36
},
{
"end_offset": 45,
"token": "s disease with",
"type": "word",
"start_offset": 31,
"position": 37
},
{
"end_offset": 46,
"token": "s disease with ",
"type": "word",
"start_offset": 31,
"position": 38
},
{
"end_offset": 47,
"token": "s disease with e",
"type": "word",
"start_offset": 31,
"position": 39
},
{
"end_offset": 48,
"token": "s disease with ea",
"type": "word",
"start_offset": 31,
"position": 40
},
{
"end_offset": 49,
"token": "s disease with ear",
"type": "word",
"start_offset": 31,
"position": 41
},
{
"end_offset": 50,
"token": "s disease with earl",
"type": "word",
"start_offset": 31,
"position": 42
},
{
"end_offset": 51,
"token": "s disease with early",
"type": "word",
"start_offset": 31,
"position": 43
},
{
"end_offset": 52,
"token": "s disease with early ",
"type": "word",
"start_offset": 31,
"position": 44
},
{
"end_offset": 53,
"token": "s disease with early o",
"type": "word",
"start_offset": 31,
"position": 45
},
{
"end_offset": 54,
"token": "s disease with early on",
"type": "word",
"start_offset": 31,
"position": 46
},
{
"end_offset": 55,
"token": "s disease with early ons",
"type": "word",
"start_offset": 31,
"position": 47
},
{
"end_offset": 56,
"token": "s disease with early onse",
"type": "word",
"start_offset": 31,
"position": 48
}
]
}
As you can see, the whole string tokenized with token size from 2 to 25 characters. The string is tokenized in a linear way together with all spaces and position incremented by one for every new token.
There are several problems with it:
The edge_ngram_analyzer produced unuseful tokens which will never be searched for, for example: "0 ", " ", " d", "s d", "s disease w" etc.
Also, it didn't produce a lot of useful tokens that could be used, for example: "disease", "early onset" etc. There will be 0 results if you try to search for any of these words.
Notice, the last token is "s disease with early onse". Where is the final "t"? Because of the "max_gram" : "25" we “lost” some text in all fields. You can't search for this text anymore because there are no tokens for it.
The trim filter only obfuscates the problem filtering extra spaces when it could be done by a tokenizer.
The edge_ngram_analyzer increments the position of each token which is problematic for positional queries such as phrase queries. One should use the edge_ngram_filter instead that will preserve the position of the token when generating the ngrams.
The optimal solution.
The mappings settings to use:
...
"mappings": {
"Type": {
"_all":{
"analyzer": "edge_ngram_analyzer",
"search_analyzer": "keyword_analyzer"
},
"properties": {
"Field": {
"search_analyzer": "keyword_analyzer",
"type": "string",
"analyzer": "edge_ngram_analyzer"
},
...
...
"settings": {
"analysis": {
"filter": {
"english_poss_stemmer": {
"type": "stemmer",
"name": "possessive_english"
},
"edge_ngram": {
"type": "edgeNGram",
"min_gram": "2",
"max_gram": "25",
"token_chars": ["letter", "digit"]
}
},
"analyzer": {
"edge_ngram_analyzer": {
"filter": ["lowercase", "english_poss_stemmer", "edge_ngram"],
"tokenizer": "standard"
},
"keyword_analyzer": {
"filter": ["lowercase", "english_poss_stemmer"],
"tokenizer": "standard"
}
}
}
}
...
Look at the analysis:
{
"tokens": [
{
"end_offset": 5,
"token": "f0",
"type": "word",
"start_offset": 0,
"position": 0
},
{
"end_offset": 5,
"token": "f00",
"type": "word",
"start_offset": 0,
"position": 0
},
{
"end_offset": 5,
"token": "f00.",
"type": "word",
"start_offset": 0,
"position": 0
},
{
"end_offset": 5,
"token": "f00.0",
"type": "word",
"start_offset": 0,
"position": 0
},
{
"end_offset": 17,
"token": "de",
"type": "word",
"start_offset": 9,
"position": 2
},
{
"end_offset": 17,
"token": "dem",
"type": "word",
"start_offset": 9,
"position": 2
},
{
"end_offset": 17,
"token": "deme",
"type": "word",
"start_offset": 9,
"position": 2
},
{
"end_offset": 17,
"token": "demen",
"type": "word",
"start_offset": 9,
"position": 2
},
{
"end_offset": 17,
"token": "dement",
"type": "word",
"start_offset": 9,
"position": 2
},
{
"end_offset": 17,
"token": "dementi",
"type": "word",
"start_offset": 9,
"position": 2
},
{
"end_offset": 17,
"token": "dementia",
"type": "word",
"start_offset": 9,
"position": 2
},
{
"end_offset": 20,
"token": "in",
"type": "word",
"start_offset": 18,
"position": 3
},
{
"end_offset": 32,
"token": "al",
"type": "word",
"start_offset": 21,
"position": 4
},
{
"end_offset": 32,
"token": "alz",
"type": "word",
"start_offset": 21,
"position": 4
},
{
"end_offset": 32,
"token": "alzh",
"type": "word",
"start_offset": 21,
"position": 4
},
{
"end_offset": 32,
"token": "alzhe",
"type": "word",
"start_offset": 21,
"position": 4
},
{
"end_offset": 32,
"token": "alzhei",
"type": "word",
"start_offset": 21,
"position": 4
},
{
"end_offset": 32,
"token": "alzheim",
"type": "word",
"start_offset": 21,
"position": 4
},
{
"end_offset": 32,
"token": "alzheime",
"type": "word",
"start_offset": 21,
"position": 4
},
{
"end_offset": 32,
"token": "alzheimer",
"type": "word",
"start_offset": 21,
"position": 4
},
{
"end_offset": 40,
"token": "di",
"type": "word",
"start_offset": 33,
"position": 5
},
{
"end_offset": 40,
"token": "dis",
"type": "word",
"start_offset": 33,
"position": 5
},
{
"end_offset": 40,
"token": "dise",
"type": "word",
"start_offset": 33,
"position": 5
},
{
"end_offset": 40,
"token": "disea",
"type": "word",
"start_offset": 33,
"position": 5
},
{
"end_offset": 40,
"token": "diseas",
"type": "word",
"start_offset": 33,
"position": 5
},
{
"end_offset": 40,
"token": "disease",
"type": "word",
"start_offset": 33,
"position": 5
},
{
"end_offset": 45,
"token": "wi",
"type": "word",
"start_offset": 41,
"position": 6
},
{
"end_offset": 45,
"token": "wit",
"type": "word",
"start_offset": 41,
"position": 6
},
{
"end_offset": 45,
"token": "with",
"type": "word",
"start_offset": 41,
"position": 6
},
{
"end_offset": 51,
"token": "ea",
"type": "word",
"start_offset": 46,
"position": 7
},
{
"end_offset": 51,
"token": "ear",
"type": "word",
"start_offset": 46,
"position": 7
},
{
"end_offset": 51,
"token": "earl",
"type": "word",
"start_offset": 46,
"position": 7
},
{
"end_offset": 51,
"token": "early",
"type": "word",
"start_offset": 46,
"position": 7
},
{
"end_offset": 57,
"token": "on",
"type": "word",
"start_offset": 52,
"position": 8
},
{
"end_offset": 57,
"token": "ons",
"type": "word",
"start_offset": 52,
"position": 8
},
{
"end_offset": 57,
"token": "onse",
"type": "word",
"start_offset": 52,
"position": 8
},
{
"end_offset": 57,
"token": "onset",
"type": "word",
"start_offset": 52,
"position": 8
}
]
}
On index time a text is tokenized by standard tokenizer, then separate words are filtered by lowercase, possessive_english and edge_ngram filters. Tokens are produced only for words.
On search time a text is tokenized by standard tokenizer, then separate words are filtered by lowercase and possessive_english. The searched words are matched against the tokens which had been created during the index time.
Thus we make the incremental search possible!
Now, because we do ngram on separate words, we can even execute queries like
{
'query': {
'multi_match': {
'query': 'dem in alzh',
'type': 'phrase',
'fields': ['_all']
}
}
}
and get correct results.
No text is "lost", everything is searchable and there is no need to deal with spaces by trim filter anymore.
I believe your query is wrong: while you need nGrams at indexing time, you don't need them at search time. At search time you need the text to be as "fixed" as possible.
Try this query instead:
{
"query": {
"multi_match": {
"query": " dementia in alz",
"analyzer": "keyword",
"fields": [
"_all"
]
}
}
}
You notice two whitespaces before dementia. Those are accounted for by your analyzer from the text. To get rid of those you need the trim token_filter:
"edge_ngram_analyzer": {
"filter": [
"lowercase","trim"
],
"tokenizer": "edge_ngram_tokenizer"
}
And then this query will work (no whitespaces before dementia):
{
"query": {
"multi_match": {
"query": "dementia in alz",
"analyzer": "keyword",
"fields": [
"_all"
]
}
}
}

What Elasticsearch analyzer should I use to search hybrid english words, product information?

My team is attempting to index our item information and need a sanity check on what I have created so far. Here is an example of some of the text we need to search on:
AA4VG90EP4DM1/32R-NSF52F001DX-S WITH DAMAN MANIFOLD 0281CS0011 SIEMENS SPEC 74/07104909/10 REV L WOMACK SYSTEMS
As you can see there is a mix of english words and random numbers and letters. After doing some research online I decided to go with a word delimiter and whitespace tokenizer. Here is the analyzer I am currently using:
{
{
   "itemindex": {
      "settings": {
         "index": {
            "uuid": "1HxasKSCSW2iRHf6pYfkWw",
            "analysis": {
               "analyzer": {
                  "my_analyzer": {
                     "type": "custom",
                     "filter": [
                        "lowercase",
                        "my_word_delimiter"
                     ],
                     "tokenizer": "whitespace"
                  }
               },
               "filter": {
                  "my_word_delimiter": {
                     "type_table": "/ => ALPHANUM",
                     "preserve_original": "true",
                     "catenate_words": "true",
                     "type": "word_delimiter"
                  }
               }
            },
            "number_of_replicas": "1",
            "number_of_shards": "5",
            "version": {
               "created": "1000099"
            }
         }
      }
   }
}
Here is the output from the analyze api for the above description:
{
"tokens": [
{
"token": "aa4vg90ep4dm1/32r-nsf52f001dx-s",
"start_offset": 0,
"end_offset": 31,
"type": "word",
"position": 1
},
{
"token": "aa",
"start_offset": 0,
"end_offset": 2,
"type": "word",
"position": 1
},
{
"token": "4",
"start_offset": 2,
"end_offset": 3,
"type": "word",
"position": 2
},
{
"token": "vg",
"start_offset": 3,
"end_offset": 5,
"type": "word",
"position": 3
},
{
"token": "90",
"start_offset": 5,
"end_offset": 7,
"type": "word",
"position": 4
},
{
"token": "ep",
"start_offset": 7,
"end_offset": 9,
"type": "word",
"position": 5
},
{
"token": "4",
"start_offset": 9,
"end_offset": 10,
"type": "word",
"position": 6
},
{
"token": "dm",
"start_offset": 10,
"end_offset": 12,
"type": "word",
"position": 7
},
{
"token": "1/32",
"start_offset": 12,
"end_offset": 16,
"type": "word",
"position": 8
},
{
"token": "r",
"start_offset": 16,
"end_offset": 17,
"type": "word",
"position": 9
},
{
"token": "nsf",
"start_offset": 18,
"end_offset": 21,
"type": "word",
"position": 10
},
{
"token": "rnsf",
"start_offset": 16,
"end_offset": 21,
"type": "word",
"position": 10
},
{
"token": "52",
"start_offset": 21,
"end_offset": 23,
"type": "word",
"position": 11
},
{
"token": "f",
"start_offset": 23,
"end_offset": 24,
"type": "word",
"position": 12
},
{
"token": "001",
"start_offset": 24,
"end_offset": 27,
"type": "word",
"position": 13
},
{
"token": "dx",
"start_offset": 27,
"end_offset": 29,
"type": "word",
"position": 14
},
{
"token": "s",
"start_offset": 30,
"end_offset": 31,
"type": "word",
"position": 15
},
{
"token": "dxs",
"start_offset": 27,
"end_offset": 31,
"type": "word",
"position": 15
},
{
"token": "with",
"start_offset": 32,
"end_offset": 36,
"type": "word",
"position": 16
},
{
"token": "daman",
"start_offset": 37,
"end_offset": 42,
"type": "word",
"position": 17
},
{
"token": "manifold",
"start_offset": 43,
"end_offset": 51,
"type": "word",
"position": 18
},
{
"token": "0281cs0011",
"start_offset": 52,
"end_offset": 62,
"type": "word",
"position": 19
},
{
"token": "0281",
"start_offset": 52,
"end_offset": 56,
"type": "word",
"position": 19
},
{
"token": "cs",
"start_offset": 56,
"end_offset": 58,
"type": "word",
"position": 20
},
{
"token": "0011",
"start_offset": 58,
"end_offset": 62,
"type": "word",
"position": 21
},
{
"token": "siemens",
"start_offset": 63,
"end_offset": 70,
"type": "word",
"position": 22
},
{
"token": "spec",
"start_offset": 71,
"end_offset": 75,
"type": "word",
"position": 23
},
{
"token": "74/07104909/10",
"start_offset": 76,
"end_offset": 90,
"type": "word",
"position": 24
},
{
"token": "rev",
"start_offset": 91,
"end_offset": 94,
"type": "word",
"position": 25
},
{
"token": "l",
"start_offset": 95,
"end_offset": 96,
"type": "word",
"position": 26
},
{
"token": "womack",
"start_offset": 98,
"end_offset": 104,
"type": "word",
"position": 27
},
{
"token": "systems",
"start_offset": 105,
"end_offset": 112,
"type": "word",
"position": 28
}
]
}
Finally, here is the NEST query I am using:
var results = client.Search<SearchResult>(s => s.Index("itemindex")
.Query(q => q
.QueryString(qs=> qs
.OnFields(f=> f.Description, f=> f.VendorPartNumber, f=> f.ItemNumber)
.Operator(Operator.or)
.Query(query + "*")))
.SortDescending("_score")
.Highlight(h => h
.OnFields(f => f
.OnField(e => e.Description)
.BoundaryCharacters(" ,")
.PreTags("<b>")
.PostTags("</b>")))
.From(start)
.Size(size));

Resources