how to build a backward edge n-gram tokenizer

how to build a backward edge n-gram tokenizer - elasticsearch

I only see n-gram and edge n-gram, both of them start from the first letter.
I would like to create some tokenizer which can produce the following tokens.
For example：
600140 -> 0, 40, 140, 0140, 00140, 600140

You can leverage the reverse token filter twice coupled with the edge_ngram one:
PUT reverse
{
"settings": {
"analysis": {
"analyzer": {
"reverse_edgengram": {
"tokenizer": "keyword",
"filter": [
"reverse",
"edge",
"reverse"
]
}
},
"filter": {
"edge": {
"type": "edge_ngram",
"min_gram": 2,
"max_gram": 25
}
}
}
},
"mappings": {
"properties": {
"string_field": {
"type": "text",
"analyzer": "reverse_edgengram"
}
}
}
}
Then you can test it:
POST reverse/_analyze
{
"analyzer": "reverse_edgengram",
"text": "600140"
}
Which yields this:
{
"tokens" : [
{
"token" : "40",
"start_offset" : 0,
"end_offset" : 6,
"type" : "word",
"position" : 0
},
{
"token" : "140",
"start_offset" : 0,
"end_offset" : 6,
"type" : "word",
"position" : 0
},
{
"token" : "0140",
"start_offset" : 0,
"end_offset" : 6,
"type" : "word",
"position" : 0
},
{
"token" : "00140",
"start_offset" : 0,
"end_offset" : 6,
"type" : "word",
"position" : 0
},
{
"token" : "600140",
"start_offset" : 0,
"end_offset" : 6,
"type" : "word",
"position" : 0
}
]
}

Related

How to find out capital word with ngram tokenizer in Elasticsearch 7

I have to search out account temp123, TEMP456 with word temp OR TEMP
Here is my index with ngram tokenizer and some sample doc
# index
PUT /demo
{
"settings": {
"index": {
"max_ngram_diff": "20",
"analysis": {
"analyzer": {
"account_analyzer": {
"tokenizer": "account_tokenizer"
}
},
"tokenizer": {
"account_tokenizer": {
"token_chars": [
"letter",
"digit"
],
"min_gram": "1",
"type": "ngram",
"max_gram": "15"
}
}
}
}
},
"mappings": {
"properties": {
"account": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
},
"analyzer": "account_analyzer",
"search_analyzer": "standard"
}
}
}
}
# docs
PUT /demo/_doc/1
{
"account": "temp123"
}
PUT /demo/_doc/2
{
"account": "TEMP456"
}
With following queries, I expect to get both docs back. But I got doc 1 only.
It seems like I can not get doc with capital word.
How should I do to get both docs back with temp or TEMP ?
POST /demo/_search/
{
"query": {
"bool": {
"must": [
{
"match": {
"account": {
"query": "temp",
"fuzziness": "AUTO"
}
}
}
]
}
}
}
POST /demo/_search/
{
"query": {
"bool": {
"must": [
{
"match": {
"account": {
"query": "TEMP",
"fuzziness": "AUTO"
}
}
}
]
}
}
}

You can use _analyze to check the tokens that your analyzer is generating.
GET demo/_analyze
{
"analyzer": "account_analyzer",
"text": ["TEMP123"]
}
"tokens" : [
{
"token" : "T",
"start_offset" : 0,
"end_offset" : 1,
"type" : "word",
"position" : 0
},
{
"token" : "TE",
"start_offset" : 0,
"end_offset" : 2,
"type" : "word",
"position" : 1
},
{
"token" : "TEM",
"start_offset" : 0,
"end_offset" : 3,
"type" : "word",
"position" : 2
},
{
"token" : "TEMP",
"start_offset" : 0,
"end_offset" : 4,
"type" : "word",
"position" : 3
},
{
"token" : "TEMP1",
"start_offset" : 0,
"end_offset" : 5,
"type" : "word",
"position" : 4
},
{
"token" : "TEMP12",
"start_offset" : 0,
"end_offset" : 6,
"type" : "word",
"position" : 5
},
{
"token" : "TEMP123",
"start_offset" : 0,
"end_offset" : 7,
"type" : "word",
"position" : 6
},
{
"token" : "E",
"start_offset" : 1,
"end_offset" : 2,
"type" : "word",
"position" : 7
},
{
"token" : "EM",
"start_offset" : 1,
"end_offset" : 3,
"type" : "word",
"position" : 8
},
{
"token" : "EMP",
"start_offset" : 1,
"end_offset" : 4,
"type" : "word",
"position" : 9
},
{
"token" : "EMP1",
"start_offset" : 1,
"end_offset" : 5,
"type" : "word",
"position" : 10
},
{
"token" : "EMP12",
"start_offset" : 1,
"end_offset" : 6,
"type" : "word",
"position" : 11
},
{
"token" : "EMP123",
"start_offset" : 1,
"end_offset" : 7,
"type" : "word",
"position" : 12
},
{
"token" : "M",
"start_offset" : 2,
"end_offset" : 3,
"type" : "word",
"position" : 13
},
{
"token" : "MP",
"start_offset" : 2,
"end_offset" : 4,
"type" : "word",
"position" : 14
},
{
"token" : "MP1",
"start_offset" : 2,
"end_offset" : 5,
"type" : "word",
"position" : 15
},
{
"token" : "MP12",
"start_offset" : 2,
"end_offset" : 6,
"type" : "word",
"position" : 16
},
{
"token" : "MP123",
"start_offset" : 2,
"end_offset" : 7,
"type" : "word",
"position" : 17
},
{
"token" : "P",
"start_offset" : 3,
"end_offset" : 4,
"type" : "word",
"position" : 18
},
{
"token" : "P1",
"start_offset" : 3,
"end_offset" : 5,
"type" : "word",
"position" : 19
},
{
"token" : "P12",
"start_offset" : 3,
"end_offset" : 6,
"type" : "word",
"position" : 20
},
{
"token" : "P123",
"start_offset" : 3,
"end_offset" : 7,
"type" : "word",
"position" : 21
},
{
"token" : "1",
"start_offset" : 4,
"end_offset" : 5,
"type" : "word",
"position" : 22
},
{
"token" : "12",
"start_offset" : 4,
"end_offset" : 6,
"type" : "word",
"position" : 23
},
{
"token" : "123",
"start_offset" : 4,
"end_offset" : 7,
"type" : "word",
"position" : 24
},
{
"token" : "2",
"start_offset" : 5,
"end_offset" : 6,
"type" : "word",
"position" : 25
},
{
"token" : "23",
"start_offset" : 5,
"end_offset" : 7,
"type" : "word",
"position" : 26
},
{
"token" : "3",
"start_offset" : 6,
"end_offset" : 7,
"type" : "word",
"position" : 27
}
]
You need to add a lower case filter to your analyzer to that all tokens that are generate have lower case
{
"settings": {
"index": {
"max_ngram_diff": "20",
"analysis": {
"analyzer": {
"account_analyzer": {
"tokenizer": "account_tokenizer",
"filter": [ ----> note
"lowercase"
]
}
},
"tokenizer": {
"account_tokenizer": {
"token_chars": [
"letter",
"digit"
],
"min_gram": "1",
"type": "ngram",
"max_gram": "15"
}
}
}
}
},
"mappings": {
"properties": {
"account": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
},
"analyzer": "account_analyzer",
"search_analyzer": "standard"
}
}
}
}

Elasticsearch: custom tokenizer split by words and dots

I'm trying to create a tokenizer that will work this way:
POST dev_threats/_analyze
{
"tokenizer": "my_tokenizer",
"text": "some.test.domain.com"
}
and get tokens like:
[some, some.test, some.test.domain, some.test.domain.com, test, test.domain, test.domain.com, domain, domain.com]
I tried ngram tokenizer:
"ngram_domain_tokenizer": {
"type": "ngram",
"min_gram": 1,
"max_gram": 63,
"token_chars": [
"letter",
"digit",
"punctuation"
]
},
But for long values, it generates too many tokens...
Any idea how to get such result?

You don't need two different analyzers for this. There's another solution using shingles and it goes this way:
First you need to create an index with the proper analyzer, which I called domain_shingler:
PUT dev_threats
{
"settings": {
"analysis": {
"analyzer": {
"domain_shingler": {
"type": "custom",
"tokenizer": "dot_tokenizer",
"filter": [
"shingles",
"joiner"
]
}
},
"tokenizer": {
"dot_tokenizer": {
"type": "char_group",
"tokenize_on_chars": [
"punctuation"
]
}
},
"filter": {
"shingles": {
"type": "shingle",
"min_shingle_size": 2,
"max_shingle_size": 4,
"output_unigrams": true
},
"joiner": {
"type": "pattern_replace",
"pattern": """\s""",
"replacement": "."
}
}
}
},
"mappings": {
"properties": {
"domain": {
"type": "text",
"analyzer": "domain_shingler",
"search_analyzer": "standard"
}
}
}
}
If you try to analyze some.test.domain.com with that analyzer, you'll get the following tokens:
POST dev_threats/_analyze
{
"analyzer": "domain_shingler",
"text": "some.test.domain.com"
}
Results:
{
"tokens" : [
{
"token" : "some",
"start_offset" : 0,
"end_offset" : 4,
"type" : "word",
"position" : 0
},
{
"token" : "some.test",
"start_offset" : 0,
"end_offset" : 9,
"type" : "shingle",
"position" : 0,
"positionLength" : 2
},
{
"token" : "some.test.domain",
"start_offset" : 0,
"end_offset" : 16,
"type" : "shingle",
"position" : 0,
"positionLength" : 3
},
{
"token" : "some.test.domain.com",
"start_offset" : 0,
"end_offset" : 20,
"type" : "shingle",
"position" : 0,
"positionLength" : 4
},
{
"token" : "test",
"start_offset" : 5,
"end_offset" : 9,
"type" : "word",
"position" : 1
},
{
"token" : "test.domain",
"start_offset" : 5,
"end_offset" : 16,
"type" : "shingle",
"position" : 1,
"positionLength" : 2
},
{
"token" : "test.domain.com",
"start_offset" : 5,
"end_offset" : 20,
"type" : "shingle",
"position" : 1,
"positionLength" : 3
},
{
"token" : "domain",
"start_offset" : 10,
"end_offset" : 16,
"type" : "word",
"position" : 2
},
{
"token" : "domain.com",
"start_offset" : 10,
"end_offset" : 20,
"type" : "shingle",
"position" : 2,
"positionLength" : 2
},
{
"token" : "com",
"start_offset" : 17,
"end_offset" : 20,
"type" : "word",
"position" : 3
}
]
}

You can use path hierarchy tokenizer
PUT my-index
{
"settings": {
"analysis": {
"analyzer": {
"custom_path_tree": {
"tokenizer": "custom_hierarchy"
},
"custom_path_tree_reversed": {
"tokenizer": "custom_hierarchy_reversed"
}
},
"tokenizer": {
"custom_hierarchy": {
"type": "path_hierarchy",
"delimiter": "."
},
"custom_hierarchy_reversed": {
"type": "path_hierarchy",
"delimiter": ".",
"reverse": "true"
}
}
}
}
}
POST my-index/_analyze
{
"analyzer": "custom_path_tree",
"text": "some.test.domain.com"
}
POST my-index/_analyze
{
"analyzer": "custom_path_tree_reversed",
"text": "some.test.domain.com"
}
** Result**
"tokens" : [
{
"token" : "some",
"start_offset" : 0,
"end_offset" : 4,
"type" : "word",
"position" : 0
},
{
"token" : "some.test",
"start_offset" : 0,
"end_offset" : 9,
"type" : "word",
"position" : 0
},
{
"token" : "some.test.domain",
"start_offset" : 0,
"end_offset" : 16,
"type" : "word",
"position" : 0
},
{
"token" : "some.test.domain.com",
"start_offset" : 0,
"end_offset" : 20,
"type" : "word",
"position" : 0
}
]
}
{
"tokens" : [
{
"token" : "some.test.domain.com",
"start_offset" : 0,
"end_offset" : 20,
"type" : "word",
"position" : 0
},
{
"token" : "test.domain.com",
"start_offset" : 5,
"end_offset" : 20,
"type" : "word",
"position" : 0
},
{
"token" : "domain.com",
"start_offset" : 10,
"end_offset" : 20,
"type" : "word",
"position" : 0
},
{
"token" : "com",
"start_offset" : 17,
"end_offset" : 20,
"type" : "word",
"position" : 0
}
]
}
It will create path like tokens by splitting on given delimiter. Using normal and reverse option you can get tokens in both directions

Elasticsearch 6.8 match_phrase search N-gram tokenizer works not well

i use Elasticsearch N-gram tokenizer and use match_phrase to fuzzy match
my index and test data as below:
DELETE /m8
PUT m8
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "my_tokenizer"
}
},
"tokenizer": {
"my_tokenizer": {
"type": "ngram",
"min_gram": 1,
"max_gram": 3,
"custom_token_chars":"_."
}
}
},
"max_ngram_diff": 10
},
"mappings": {
"table": {
"properties": {
"dataSourceId": {
"type": "long"
},
"dataSourceType": {
"type": "integer"
},
"dbName": {
"type": "text",
"analyzer": "my_analyzer",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
}
}
}
}
}
PUT /m8/table/1
{
"dataSourceId":1,
"dataSourceType":2,
"dbName":"rm.rf"
}
PUT /m8/table/2
{
"dataSourceId":1,
"dataSourceType":2,
"dbName":"rm_rf"
}
PUT /m8/table/3
{
"dataSourceId":1,
"dataSourceType":2,
"dbName":"rmrf"
}
check _analyze:
POST m8/_analyze
{
"tokenizer": "my_tokenizer",
"text": "rm.rf"
}
_analyze result:
{
"tokens" : [
{
"token" : "r",
"start_offset" : 0,
"end_offset" : 1,
"type" : "word",
"position" : 0
},
{
"token" : "rm",
"start_offset" : 0,
"end_offset" : 2,
"type" : "word",
"position" : 1
},
{
"token" : "rm.",
"start_offset" : 0,
"end_offset" : 3,
"type" : "word",
"position" : 2
},
{
"token" : "m",
"start_offset" : 1,
"end_offset" : 2,
"type" : "word",
"position" : 3
},
{
"token" : "m.",
"start_offset" : 1,
"end_offset" : 3,
"type" : "word",
"position" : 4
},
{
"token" : "m.r",
"start_offset" : 1,
"end_offset" : 4,
"type" : "word",
"position" : 5
},
{
"token" : ".",
"start_offset" : 2,
"end_offset" : 3,
"type" : "word",
"position" : 6
},
{
"token" : ".r",
"start_offset" : 2,
"end_offset" : 4,
"type" : "word",
"position" : 7
},
{
"token" : ".rf",
"start_offset" : 2,
"end_offset" : 5,
"type" : "word",
"position" : 8
},
{
"token" : "r",
"start_offset" : 3,
"end_offset" : 4,
"type" : "word",
"position" : 9
},
{
"token" : "rf",
"start_offset" : 3,
"end_offset" : 5,
"type" : "word",
"position" : 10
},
{
"token" : "f",
"start_offset" : 4,
"end_offset" : 5,
"type" : "word",
"position" : 11
}
]
}
When i search 'rm', nothing found:
GET /m8/table/_search
{
"query": {
"bool": {
"must": [
{
"match_phrase": {
"dbName": "rm"
}
}
]
}
}
}
{
"took" : 2,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : 0,
"max_score" : null,
"hits" : [ ]
}
}
But '.rf' can be found:
{
"took" : 1,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : 1,
"max_score" : 1.7260926,
"hits" : [
{
"_index" : "m8",
"_type" : "table",
"_id" : "1",
"_score" : 1.7260926,
"_source" : {
"dataSourceId" : 1,
"dataSourceType" : 2,
"dbName" : "rm.rf"
}
}
]
}
}
My question:
Why 'rm' couldn't been found even _analyze has splited these phrase?

my_analyzer will be used during search time as well.
"mapping":{
"dbName": {
"type": "text",
"analyzer": "my_analyzer"
"search_analyzer":"my_analyzer" // <==== If you don't provide a search analyzer then what you defined in analyzer will be used during search time as well.
Match_phrase query is used to match phrases considering the position of analyzed text. e.g Searching for "Kal ho" will match document having "Kal" at position X, & "ho" at position X+1 in the analyzed text.
When you are searching for 'rm' (#1) the text gets analyzed using my_analyzer, which converts it into n-gram and on the top of that phrase_search will be used. Hence the outcome is not expected.
Solution:
Use standard analyzer with simple match query
GET /m8/_search
{
"query": {
"bool": {
"must": [
{
"match": {
"dbName": {
"query": "rm",
"analyzer": "standard" // <=========
}
}
}
]
}
}
}
OR Define during mapping & use a match query (not match_phrase)
"mapping":{
"dbName": {
"type": "text",
"analyzer": "my_analyzer"
"search_analyzer":"standard" //<==========
Followup Question: Why do you want to use a match_phrase query with n-gram tokenizer?

Elasticsearch - at&t and procter&gamble cases

By default Elasticsearch with English analyzer breaks at&t into tokens at, t and then removes at as a stopword.
POST _analyze
{
"analyzer": "english",
"text": "A word AT&T Procter&Gamble"
}
As a result tokens look like:
{
"tokens" : [
{
"token" : "word",
"start_offset" : 2,
"end_offset" : 6,
"type" : "<ALPHANUM>",
"position" : 1
},
{
"token" : "t",
"start_offset" : 10,
"end_offset" : 11,
"type" : "<ALPHANUM>",
"position" : 3
},
{
"token" : "procter",
"start_offset" : 12,
"end_offset" : 19,
"type" : "<ALPHANUM>",
"position" : 4
},
{
"token" : "gambl",
"start_offset" : 20,
"end_offset" : 26,
"type" : "<ALPHANUM>",
"position" : 5
}
]
}
I want to be able to match exactly at&t and at the same time to be able to search for procter&gamble exactly and to be able to search for e.g. only procter.
So I want to build an analizer which created both tokens
at&t and t for the at&t string
and
procter, gambl, procter&gamble for procter&gamble.
It there a way to create such an analyzer? Or should I create 2 index fields - one for regular English analyzer and the other one for English except tokenization by & ?

Mappings: You can tokenize on whitespace and use a word delimiter filter to create tokens for at&t
{
"settings": {
"analysis": {
"analyzer": {
"whitespace_with_acronymns": {
"tokenizer": "whitespace",
"filter": [
"lowercase",
"acronymns"
]
}
},
"filter": {
"acronymns": {
"type": "word_delimiter_graph",
"catenate_all": true
}
}
}
}
}
Tokens:
{
"analyzer": "whitespace_with_acronymns",
"text": "A word AT&T Procter&Gamble"
}
Result: at&t is tokenized as at,t,att, so you can search this by at,t and at&t.
{
"tokens" : [
{
"token" : "a",
"start_offset" : 0,
"end_offset" : 1,
"type" : "word",
"position" : 0
},
{
"token" : "word",
"start_offset" : 2,
"end_offset" : 6,
"type" : "word",
"position" : 1
},
{
"token" : "at",
"start_offset" : 7,
"end_offset" : 9,
"type" : "word",
"position" : 2
},
{
"token" : "att",
"start_offset" : 7,
"end_offset" : 11,
"type" : "word",
"position" : 2
},
{
"token" : "t",
"start_offset" : 10,
"end_offset" : 11,
"type" : "word",
"position" : 3
},
{
"token" : "procter",
"start_offset" : 12,
"end_offset" : 19,
"type" : "word",
"position" : 4
},
{
"token" : "proctergamble",
"start_offset" : 12,
"end_offset" : 26,
"type" : "word",
"position" : 4
},
{
"token" : "gamble",
"start_offset" : 20,
"end_offset" : 26,
"type" : "word",
"position" : 5
}
]
}
If you want to remove stop word "at", you can add stopword filter
{
"settings": {
"analysis": {
"analyzer": {
"whitespace_with_acronymns": {
"tokenizer": "whitespace",
"filter": [
"lowercase",
"acronymns",
"english_possessive_stemmer",
"english_stop",
"english_keywords",
"english_stemmer"
]
}
},
"filter": {
"acronymns": {
"type": "word_delimiter_graph",
"catenate_all": true
},
"english_stop": {
"type": "stop",
"stopwords": "_english_"
},
"english_keywords": {
"type": "keyword_marker",
"keywords": ["example"]
},
"english_stemmer": {
"type": "stemmer",
"language": "english"
},
"english_possessive_stemmer": {
"type": "stemmer",
"language": "possessive_english"
}
}
}
}
}

Split text containing <number><unit> into 3 tokens

We index a lot of documents that may contain titles like "lightbulb 220V" or "Box 23cm" or "Varta Super-charge battery 74Ah".
However our users, when searching, tend to separate number and unit with whitespace, so they search for "Varta 74 Ah" they do not get what they expect.
The above is a simplification of the problem, but the main question is hopefully valid. How can I analyze "Varta Super-charge battery 74Ah" so that (on top of other tokens) 74, Ah and 74Ah are created?
Thanks,
Michal

I guess this will help you:
PUT index_name
{
"settings": {
"analysis": {
"filter": {
"custom_filter": {
"type": "word_delimiter",
"split_on_numerics": true
}
},
"analyzer": {
"custom_analyzer": {
"type": "custom",
"tokenizer": "keyword",
"filter": ["custom_filter"]
}
}
}
}
}
You can use split_on_numerics property in your custom filter. This will give you this response:
POST
POST /index_name/_analyze
{
"analyzer": "custom_analyzer",
"text": "Varta Super-charge battery 74Ah"
}
Response
{
"tokens" : [
{
"token" : "Varta",
"start_offset" : 0,
"end_offset" : 5,
"type" : "word",
"position" : 0
},
{
"token" : "Super",
"start_offset" : 6,
"end_offset" : 11,
"type" : "word",
"position" : 1
},
{
"token" : "charge",
"start_offset" : 12,
"end_offset" : 18,
"type" : "word",
"position" : 2
},
{
"token" : "battery",
"start_offset" : 19,
"end_offset" : 26,
"type" : "word",
"position" : 3
},
{
"token" : "74",
"start_offset" : 27,
"end_offset" : 29,
"type" : "word",
"position" : 4
},
{
"token" : "Ah",
"start_offset" : 29,
"end_offset" : 31,
"type" : "word",
"position" : 5
}
]
}

You would need to create a Custom Analyzer which implement Ngram Tokenizer and then apply that on the text field you create.
Below is the sample mapping, document, query and the response:
Mapping:
PUT my_split_index
{
"settings": {
"index":{
"max_ngram_diff": 3
},
"analysis": {
"analyzer": {
"my_analyzer": { <---- Custom Analyzer
"tokenizer": "my_tokenizer"
}
},
"tokenizer": {
"my_tokenizer": {
"type": "ngram",
"min_gram": 2,
"max_gram": 5,
"token_chars": [
"letter",
"digit"
]
}
}
}
},
"mappings": {
"properties": {
"product":{
"type": "text",
"analyzer": "my_analyzer", <--- Note this as how custom analyzer is applied on this field
"fields": {
"keyword":{
"type": "keyword"
}
}
}
}
}
}
The feature that you are looking for is called Ngram which would create multiple tokens from a single token. The size of the tokens are dependent on the min_ngram and max_ngram setting as mentioned above.
Note that I've mentioned max_ngram_diff as 3, that is because in version 7.x, ES's default value is 1. Looking into your use-case I've created this as 3 This value is nothing but max_ngram - min_ngram.
Sample Documents:
POST my_split_index/_doc/1
{
"product": "Varta 74 Ah"
}
POST my_split_index/_doc/2
{
"product": "lightbulb 220V"
}
Query Request:
POST my_split_index/_search
{
"query": {
"match": {
"product": "74Ah"
}
}
}
Response:
{
"took" : 2,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 1,
"relation" : "eq"
},
"max_score" : 1.7029606,
"hits" : [
{
"_index" : "my_split_index",
"_type" : "_doc",
"_id" : "1",
"_score" : 1.7029606,
"_source" : {
"product" : "Varta 74 Ah"
}
}
]
}
}
Additional Info:
To understand what tokens are actually generated you can make use of below Analyze API:
POST my_split_index/_analyze
{
"analyzer": "my_analyzer",
"text": "Varta 74 Ah"
}
You could see that below tokens got generated when I execute the above API:
{
"tokens" : [
{
"token" : "Va",
"start_offset" : 0,
"end_offset" : 2,
"type" : "word",
"position" : 0
},
{
"token" : "Var",
"start_offset" : 0,
"end_offset" : 3,
"type" : "word",
"position" : 1
},
{
"token" : "Vart",
"start_offset" : 0,
"end_offset" : 4,
"type" : "word",
"position" : 2
},
{
"token" : "Varta",
"start_offset" : 0,
"end_offset" : 5,
"type" : "word",
"position" : 3
},
{
"token" : "ar",
"start_offset" : 1,
"end_offset" : 3,
"type" : "word",
"position" : 4
},
{
"token" : "art",
"start_offset" : 1,
"end_offset" : 4,
"type" : "word",
"position" : 5
},
{
"token" : "arta",
"start_offset" : 1,
"end_offset" : 5,
"type" : "word",
"position" : 6
},
{
"token" : "rt",
"start_offset" : 2,
"end_offset" : 4,
"type" : "word",
"position" : 7
},
{
"token" : "rta",
"start_offset" : 2,
"end_offset" : 5,
"type" : "word",
"position" : 8
},
{
"token" : "ta",
"start_offset" : 3,
"end_offset" : 5,
"type" : "word",
"position" : 9
},
{
"token" : "74",
"start_offset" : 6,
"end_offset" : 8,
"type" : "word",
"position" : 10
},
{
"token" : "Ah",
"start_offset" : 9,
"end_offset" : 11,
"type" : "word",
"position" : 11
}
]
}
Notice that the query I've mentioned in the Query Request section is 74Ah, however it still returns the document. That is because ES applies the analyzer twice, during the index time and during the search time. By default if you do not specify the search_analyzer in your query, the same analyzer you applied during indexing time also gets applied during query time.
Hope this helps!

You can define your index mapping as below and see it generates tokens, as you mentioned in your question. Also, it doesn't create a lot of tokens. Hence the size of your index would be smaller.
Index mapping
{
"settings": {
"analysis": {
"filter": {
"my_filter": {
"type": "word_delimiter",
"split_on_numerics": "true",
"catenate_words": "true",
"preserve_original": "true"
}
},
"analyzer": {
"my_analyzer": {
"type": "custom",
"tokenizer": "whitespace",
"filter": [
"my_filter",
"lowercase"
]
}
}
}
},
"mappings": {
"properties": {
"title": {
"type": "text",
"analyzer": "my_analyzer"
}
}
}
}
And check the tokens generated using _analyze API
{
"text": "Varta Super-charge battery 74Ah",
"analyzer" : "my_analyzer"
}
Tokens generated
{
"tokens": [
{
"token": "varta",
"start_offset": 0,
"end_offset": 5,
"type": "word",
"position": 0
},
{
"token": "super-charge",
"start_offset": 6,
"end_offset": 18,
"type": "word",
"position": 1
},
{
"token": "super",
"start_offset": 6,
"end_offset": 11,
"type": "word",
"position": 1
},
{
"token": "supercharge",
"start_offset": 6,
"end_offset": 18,
"type": "word",
"position": 1
},
{
"token": "charge",
"start_offset": 12,
"end_offset": 18,
"type": "word",
"position": 2
},
{
"token": "battery",
"start_offset": 19,
"end_offset": 26,
"type": "word",
"position": 3
},
{
"token": "74ah",
"start_offset": 27,
"end_offset": 31,
"type": "word",
"position": 4
},
{
"token": "74",
"start_offset": 27,
"end_offset": 29,
"type": "word",
"position": 4
},
{
"token": "ah",
"start_offset": 29,
"end_offset": 31,
"type": "word",
"position": 5
}
]
}
Edit: Tokens generated in one another might look the same in the first glace, But I made sure that it satisfies all your requirements, given in question and tokens generated are quite different in close inspection, details of which are below:
My tokens generated are all in small-case to provide the case insensitive search functionality, which is implicit in all the search engines.
The critical thing to note is tokens generated as 74ah and supercharge, this is mentioned in the question, and my analyzer provides these tokens as well.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

how to build a backward edge n-gram tokenizer - elasticsearch

I only see n-gram and edge n-gram, both of them start from the first letter. I would like to create some tokenizer which can produce the following tokens. For example： 600140 -> 0, 40, 140, 0140, 00140, 600140

Related

How to find out capital word with ngram tokenizer in Elasticsearch 7

Elasticsearch: custom tokenizer split by words and dots

Elasticsearch 6.8 match_phrase search N-gram tokenizer works not well

Elasticsearch - at&t and procter&gamble cases

Split text containing <number><unit> into 3 tokens

Categories

Resources