We index a lot of documents that may contain titles like "lightbulb 220V" or "Box 23cm" or "Varta Super-charge battery 74Ah".
However our users, when searching, tend to separate number and unit with whitespace, so they search for "Varta 74 Ah" they do not get what they expect.
The above is a simplification of the problem, but the main question is hopefully valid. How can I analyze "Varta Super-charge battery 74Ah" so that (on top of other tokens) 74, Ah and 74Ah are created?
Thanks,
Michal
I guess this will help you:
PUT index_name
{
"settings": {
"analysis": {
"filter": {
"custom_filter": {
"type": "word_delimiter",
"split_on_numerics": true
}
},
"analyzer": {
"custom_analyzer": {
"type": "custom",
"tokenizer": "keyword",
"filter": ["custom_filter"]
}
}
}
}
}
You can use split_on_numerics property in your custom filter. This will give you this response:
POST
POST /index_name/_analyze
{
"analyzer": "custom_analyzer",
"text": "Varta Super-charge battery 74Ah"
}
Response
{
"tokens" : [
{
"token" : "Varta",
"start_offset" : 0,
"end_offset" : 5,
"type" : "word",
"position" : 0
},
{
"token" : "Super",
"start_offset" : 6,
"end_offset" : 11,
"type" : "word",
"position" : 1
},
{
"token" : "charge",
"start_offset" : 12,
"end_offset" : 18,
"type" : "word",
"position" : 2
},
{
"token" : "battery",
"start_offset" : 19,
"end_offset" : 26,
"type" : "word",
"position" : 3
},
{
"token" : "74",
"start_offset" : 27,
"end_offset" : 29,
"type" : "word",
"position" : 4
},
{
"token" : "Ah",
"start_offset" : 29,
"end_offset" : 31,
"type" : "word",
"position" : 5
}
]
}
You would need to create a Custom Analyzer which implement Ngram Tokenizer and then apply that on the text field you create.
Below is the sample mapping, document, query and the response:
Mapping:
PUT my_split_index
{
"settings": {
"index":{
"max_ngram_diff": 3
},
"analysis": {
"analyzer": {
"my_analyzer": { <---- Custom Analyzer
"tokenizer": "my_tokenizer"
}
},
"tokenizer": {
"my_tokenizer": {
"type": "ngram",
"min_gram": 2,
"max_gram": 5,
"token_chars": [
"letter",
"digit"
]
}
}
}
},
"mappings": {
"properties": {
"product":{
"type": "text",
"analyzer": "my_analyzer", <--- Note this as how custom analyzer is applied on this field
"fields": {
"keyword":{
"type": "keyword"
}
}
}
}
}
}
The feature that you are looking for is called Ngram which would create multiple tokens from a single token. The size of the tokens are dependent on the min_ngram and max_ngram setting as mentioned above.
Note that I've mentioned max_ngram_diff as 3, that is because in version 7.x, ES's default value is 1. Looking into your use-case I've created this as 3 This value is nothing but max_ngram - min_ngram.
Sample Documents:
POST my_split_index/_doc/1
{
"product": "Varta 74 Ah"
}
POST my_split_index/_doc/2
{
"product": "lightbulb 220V"
}
Query Request:
POST my_split_index/_search
{
"query": {
"match": {
"product": "74Ah"
}
}
}
Response:
{
"took" : 2,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 1,
"relation" : "eq"
},
"max_score" : 1.7029606,
"hits" : [
{
"_index" : "my_split_index",
"_type" : "_doc",
"_id" : "1",
"_score" : 1.7029606,
"_source" : {
"product" : "Varta 74 Ah"
}
}
]
}
}
Additional Info:
To understand what tokens are actually generated you can make use of below Analyze API:
POST my_split_index/_analyze
{
"analyzer": "my_analyzer",
"text": "Varta 74 Ah"
}
You could see that below tokens got generated when I execute the above API:
{
"tokens" : [
{
"token" : "Va",
"start_offset" : 0,
"end_offset" : 2,
"type" : "word",
"position" : 0
},
{
"token" : "Var",
"start_offset" : 0,
"end_offset" : 3,
"type" : "word",
"position" : 1
},
{
"token" : "Vart",
"start_offset" : 0,
"end_offset" : 4,
"type" : "word",
"position" : 2
},
{
"token" : "Varta",
"start_offset" : 0,
"end_offset" : 5,
"type" : "word",
"position" : 3
},
{
"token" : "ar",
"start_offset" : 1,
"end_offset" : 3,
"type" : "word",
"position" : 4
},
{
"token" : "art",
"start_offset" : 1,
"end_offset" : 4,
"type" : "word",
"position" : 5
},
{
"token" : "arta",
"start_offset" : 1,
"end_offset" : 5,
"type" : "word",
"position" : 6
},
{
"token" : "rt",
"start_offset" : 2,
"end_offset" : 4,
"type" : "word",
"position" : 7
},
{
"token" : "rta",
"start_offset" : 2,
"end_offset" : 5,
"type" : "word",
"position" : 8
},
{
"token" : "ta",
"start_offset" : 3,
"end_offset" : 5,
"type" : "word",
"position" : 9
},
{
"token" : "74",
"start_offset" : 6,
"end_offset" : 8,
"type" : "word",
"position" : 10
},
{
"token" : "Ah",
"start_offset" : 9,
"end_offset" : 11,
"type" : "word",
"position" : 11
}
]
}
Notice that the query I've mentioned in the Query Request section is 74Ah, however it still returns the document. That is because ES applies the analyzer twice, during the index time and during the search time. By default if you do not specify the search_analyzer in your query, the same analyzer you applied during indexing time also gets applied during query time.
Hope this helps!
You can define your index mapping as below and see it generates tokens, as you mentioned in your question. Also, it doesn't create a lot of tokens. Hence the size of your index would be smaller.
Index mapping
{
"settings": {
"analysis": {
"filter": {
"my_filter": {
"type": "word_delimiter",
"split_on_numerics": "true",
"catenate_words": "true",
"preserve_original": "true"
}
},
"analyzer": {
"my_analyzer": {
"type": "custom",
"tokenizer": "whitespace",
"filter": [
"my_filter",
"lowercase"
]
}
}
}
},
"mappings": {
"properties": {
"title": {
"type": "text",
"analyzer": "my_analyzer"
}
}
}
}
And check the tokens generated using _analyze API
{
"text": "Varta Super-charge battery 74Ah",
"analyzer" : "my_analyzer"
}
Tokens generated
{
"tokens": [
{
"token": "varta",
"start_offset": 0,
"end_offset": 5,
"type": "word",
"position": 0
},
{
"token": "super-charge",
"start_offset": 6,
"end_offset": 18,
"type": "word",
"position": 1
},
{
"token": "super",
"start_offset": 6,
"end_offset": 11,
"type": "word",
"position": 1
},
{
"token": "supercharge",
"start_offset": 6,
"end_offset": 18,
"type": "word",
"position": 1
},
{
"token": "charge",
"start_offset": 12,
"end_offset": 18,
"type": "word",
"position": 2
},
{
"token": "battery",
"start_offset": 19,
"end_offset": 26,
"type": "word",
"position": 3
},
{
"token": "74ah",
"start_offset": 27,
"end_offset": 31,
"type": "word",
"position": 4
},
{
"token": "74",
"start_offset": 27,
"end_offset": 29,
"type": "word",
"position": 4
},
{
"token": "ah",
"start_offset": 29,
"end_offset": 31,
"type": "word",
"position": 5
}
]
}
Edit: Tokens generated in one another might look the same in the first glace, But I made sure that it satisfies all your requirements, given in question and tokens generated are quite different in close inspection, details of which are below:
My tokens generated are all in small-case to provide the case insensitive search functionality, which is implicit in all the search engines.
The critical thing to note is tokens generated as 74ah and supercharge, this is mentioned in the question, and my analyzer provides these tokens as well.
Related
I have to search out account temp123, TEMP456 with word temp OR TEMP
Here is my index with ngram tokenizer and some sample doc
# index
PUT /demo
{
"settings": {
"index": {
"max_ngram_diff": "20",
"analysis": {
"analyzer": {
"account_analyzer": {
"tokenizer": "account_tokenizer"
}
},
"tokenizer": {
"account_tokenizer": {
"token_chars": [
"letter",
"digit"
],
"min_gram": "1",
"type": "ngram",
"max_gram": "15"
}
}
}
}
},
"mappings": {
"properties": {
"account": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
},
"analyzer": "account_analyzer",
"search_analyzer": "standard"
}
}
}
}
# docs
PUT /demo/_doc/1
{
"account": "temp123"
}
PUT /demo/_doc/2
{
"account": "TEMP456"
}
With following queries, I expect to get both docs back. But I got doc 1 only.
It seems like I can not get doc with capital word.
How should I do to get both docs back with temp or TEMP ?
POST /demo/_search/
{
"query": {
"bool": {
"must": [
{
"match": {
"account": {
"query": "temp",
"fuzziness": "AUTO"
}
}
}
]
}
}
}
POST /demo/_search/
{
"query": {
"bool": {
"must": [
{
"match": {
"account": {
"query": "TEMP",
"fuzziness": "AUTO"
}
}
}
]
}
}
}
You can use _analyze to check the tokens that your analyzer is generating.
GET demo/_analyze
{
"analyzer": "account_analyzer",
"text": ["TEMP123"]
}
"tokens" : [
{
"token" : "T",
"start_offset" : 0,
"end_offset" : 1,
"type" : "word",
"position" : 0
},
{
"token" : "TE",
"start_offset" : 0,
"end_offset" : 2,
"type" : "word",
"position" : 1
},
{
"token" : "TEM",
"start_offset" : 0,
"end_offset" : 3,
"type" : "word",
"position" : 2
},
{
"token" : "TEMP",
"start_offset" : 0,
"end_offset" : 4,
"type" : "word",
"position" : 3
},
{
"token" : "TEMP1",
"start_offset" : 0,
"end_offset" : 5,
"type" : "word",
"position" : 4
},
{
"token" : "TEMP12",
"start_offset" : 0,
"end_offset" : 6,
"type" : "word",
"position" : 5
},
{
"token" : "TEMP123",
"start_offset" : 0,
"end_offset" : 7,
"type" : "word",
"position" : 6
},
{
"token" : "E",
"start_offset" : 1,
"end_offset" : 2,
"type" : "word",
"position" : 7
},
{
"token" : "EM",
"start_offset" : 1,
"end_offset" : 3,
"type" : "word",
"position" : 8
},
{
"token" : "EMP",
"start_offset" : 1,
"end_offset" : 4,
"type" : "word",
"position" : 9
},
{
"token" : "EMP1",
"start_offset" : 1,
"end_offset" : 5,
"type" : "word",
"position" : 10
},
{
"token" : "EMP12",
"start_offset" : 1,
"end_offset" : 6,
"type" : "word",
"position" : 11
},
{
"token" : "EMP123",
"start_offset" : 1,
"end_offset" : 7,
"type" : "word",
"position" : 12
},
{
"token" : "M",
"start_offset" : 2,
"end_offset" : 3,
"type" : "word",
"position" : 13
},
{
"token" : "MP",
"start_offset" : 2,
"end_offset" : 4,
"type" : "word",
"position" : 14
},
{
"token" : "MP1",
"start_offset" : 2,
"end_offset" : 5,
"type" : "word",
"position" : 15
},
{
"token" : "MP12",
"start_offset" : 2,
"end_offset" : 6,
"type" : "word",
"position" : 16
},
{
"token" : "MP123",
"start_offset" : 2,
"end_offset" : 7,
"type" : "word",
"position" : 17
},
{
"token" : "P",
"start_offset" : 3,
"end_offset" : 4,
"type" : "word",
"position" : 18
},
{
"token" : "P1",
"start_offset" : 3,
"end_offset" : 5,
"type" : "word",
"position" : 19
},
{
"token" : "P12",
"start_offset" : 3,
"end_offset" : 6,
"type" : "word",
"position" : 20
},
{
"token" : "P123",
"start_offset" : 3,
"end_offset" : 7,
"type" : "word",
"position" : 21
},
{
"token" : "1",
"start_offset" : 4,
"end_offset" : 5,
"type" : "word",
"position" : 22
},
{
"token" : "12",
"start_offset" : 4,
"end_offset" : 6,
"type" : "word",
"position" : 23
},
{
"token" : "123",
"start_offset" : 4,
"end_offset" : 7,
"type" : "word",
"position" : 24
},
{
"token" : "2",
"start_offset" : 5,
"end_offset" : 6,
"type" : "word",
"position" : 25
},
{
"token" : "23",
"start_offset" : 5,
"end_offset" : 7,
"type" : "word",
"position" : 26
},
{
"token" : "3",
"start_offset" : 6,
"end_offset" : 7,
"type" : "word",
"position" : 27
}
]
You need to add a lower case filter to your analyzer to that all tokens that are generate have lower case
{
"settings": {
"index": {
"max_ngram_diff": "20",
"analysis": {
"analyzer": {
"account_analyzer": {
"tokenizer": "account_tokenizer",
"filter": [ ----> note
"lowercase"
]
}
},
"tokenizer": {
"account_tokenizer": {
"token_chars": [
"letter",
"digit"
],
"min_gram": "1",
"type": "ngram",
"max_gram": "15"
}
}
}
}
},
"mappings": {
"properties": {
"account": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
},
"analyzer": "account_analyzer",
"search_analyzer": "standard"
}
}
}
}
I'm trying to create a tokenizer that will work this way:
POST dev_threats/_analyze
{
"tokenizer": "my_tokenizer",
"text": "some.test.domain.com"
}
and get tokens like:
[some, some.test, some.test.domain, some.test.domain.com, test, test.domain, test.domain.com, domain, domain.com]
I tried ngram tokenizer:
"ngram_domain_tokenizer": {
"type": "ngram",
"min_gram": 1,
"max_gram": 63,
"token_chars": [
"letter",
"digit",
"punctuation"
]
},
But for long values, it generates too many tokens...
Any idea how to get such result?
You don't need two different analyzers for this. There's another solution using shingles and it goes this way:
First you need to create an index with the proper analyzer, which I called domain_shingler:
PUT dev_threats
{
"settings": {
"analysis": {
"analyzer": {
"domain_shingler": {
"type": "custom",
"tokenizer": "dot_tokenizer",
"filter": [
"shingles",
"joiner"
]
}
},
"tokenizer": {
"dot_tokenizer": {
"type": "char_group",
"tokenize_on_chars": [
"punctuation"
]
}
},
"filter": {
"shingles": {
"type": "shingle",
"min_shingle_size": 2,
"max_shingle_size": 4,
"output_unigrams": true
},
"joiner": {
"type": "pattern_replace",
"pattern": """\s""",
"replacement": "."
}
}
}
},
"mappings": {
"properties": {
"domain": {
"type": "text",
"analyzer": "domain_shingler",
"search_analyzer": "standard"
}
}
}
}
If you try to analyze some.test.domain.com with that analyzer, you'll get the following tokens:
POST dev_threats/_analyze
{
"analyzer": "domain_shingler",
"text": "some.test.domain.com"
}
Results:
{
"tokens" : [
{
"token" : "some",
"start_offset" : 0,
"end_offset" : 4,
"type" : "word",
"position" : 0
},
{
"token" : "some.test",
"start_offset" : 0,
"end_offset" : 9,
"type" : "shingle",
"position" : 0,
"positionLength" : 2
},
{
"token" : "some.test.domain",
"start_offset" : 0,
"end_offset" : 16,
"type" : "shingle",
"position" : 0,
"positionLength" : 3
},
{
"token" : "some.test.domain.com",
"start_offset" : 0,
"end_offset" : 20,
"type" : "shingle",
"position" : 0,
"positionLength" : 4
},
{
"token" : "test",
"start_offset" : 5,
"end_offset" : 9,
"type" : "word",
"position" : 1
},
{
"token" : "test.domain",
"start_offset" : 5,
"end_offset" : 16,
"type" : "shingle",
"position" : 1,
"positionLength" : 2
},
{
"token" : "test.domain.com",
"start_offset" : 5,
"end_offset" : 20,
"type" : "shingle",
"position" : 1,
"positionLength" : 3
},
{
"token" : "domain",
"start_offset" : 10,
"end_offset" : 16,
"type" : "word",
"position" : 2
},
{
"token" : "domain.com",
"start_offset" : 10,
"end_offset" : 20,
"type" : "shingle",
"position" : 2,
"positionLength" : 2
},
{
"token" : "com",
"start_offset" : 17,
"end_offset" : 20,
"type" : "word",
"position" : 3
}
]
}
You can use path hierarchy tokenizer
PUT my-index
{
"settings": {
"analysis": {
"analyzer": {
"custom_path_tree": {
"tokenizer": "custom_hierarchy"
},
"custom_path_tree_reversed": {
"tokenizer": "custom_hierarchy_reversed"
}
},
"tokenizer": {
"custom_hierarchy": {
"type": "path_hierarchy",
"delimiter": "."
},
"custom_hierarchy_reversed": {
"type": "path_hierarchy",
"delimiter": ".",
"reverse": "true"
}
}
}
}
}
POST my-index/_analyze
{
"analyzer": "custom_path_tree",
"text": "some.test.domain.com"
}
POST my-index/_analyze
{
"analyzer": "custom_path_tree_reversed",
"text": "some.test.domain.com"
}
** Result**
"tokens" : [
{
"token" : "some",
"start_offset" : 0,
"end_offset" : 4,
"type" : "word",
"position" : 0
},
{
"token" : "some.test",
"start_offset" : 0,
"end_offset" : 9,
"type" : "word",
"position" : 0
},
{
"token" : "some.test.domain",
"start_offset" : 0,
"end_offset" : 16,
"type" : "word",
"position" : 0
},
{
"token" : "some.test.domain.com",
"start_offset" : 0,
"end_offset" : 20,
"type" : "word",
"position" : 0
}
]
}
{
"tokens" : [
{
"token" : "some.test.domain.com",
"start_offset" : 0,
"end_offset" : 20,
"type" : "word",
"position" : 0
},
{
"token" : "test.domain.com",
"start_offset" : 5,
"end_offset" : 20,
"type" : "word",
"position" : 0
},
{
"token" : "domain.com",
"start_offset" : 10,
"end_offset" : 20,
"type" : "word",
"position" : 0
},
{
"token" : "com",
"start_offset" : 17,
"end_offset" : 20,
"type" : "word",
"position" : 0
}
]
}
It will create path like tokens by splitting on given delimiter. Using normal and reverse option you can get tokens in both directions
By default Elasticsearch with English analyzer breaks at&t into tokens at, t and then removes at as a stopword.
POST _analyze
{
"analyzer": "english",
"text": "A word AT&T Procter&Gamble"
}
As a result tokens look like:
{
"tokens" : [
{
"token" : "word",
"start_offset" : 2,
"end_offset" : 6,
"type" : "<ALPHANUM>",
"position" : 1
},
{
"token" : "t",
"start_offset" : 10,
"end_offset" : 11,
"type" : "<ALPHANUM>",
"position" : 3
},
{
"token" : "procter",
"start_offset" : 12,
"end_offset" : 19,
"type" : "<ALPHANUM>",
"position" : 4
},
{
"token" : "gambl",
"start_offset" : 20,
"end_offset" : 26,
"type" : "<ALPHANUM>",
"position" : 5
}
]
}
I want to be able to match exactly at&t and at the same time to be able to search for procter&gamble exactly and to be able to search for e.g. only procter.
So I want to build an analizer which created both tokens
at&t and t for the at&t string
and
procter, gambl, procter&gamble for procter&gamble.
It there a way to create such an analyzer? Or should I create 2 index fields - one for regular English analyzer and the other one for English except tokenization by & ?
Mappings: You can tokenize on whitespace and use a word delimiter filter to create tokens for at&t
{
"settings": {
"analysis": {
"analyzer": {
"whitespace_with_acronymns": {
"tokenizer": "whitespace",
"filter": [
"lowercase",
"acronymns"
]
}
},
"filter": {
"acronymns": {
"type": "word_delimiter_graph",
"catenate_all": true
}
}
}
}
}
Tokens:
{
"analyzer": "whitespace_with_acronymns",
"text": "A word AT&T Procter&Gamble"
}
Result: at&t is tokenized as at,t,att, so you can search this by at,t and at&t.
{
"tokens" : [
{
"token" : "a",
"start_offset" : 0,
"end_offset" : 1,
"type" : "word",
"position" : 0
},
{
"token" : "word",
"start_offset" : 2,
"end_offset" : 6,
"type" : "word",
"position" : 1
},
{
"token" : "at",
"start_offset" : 7,
"end_offset" : 9,
"type" : "word",
"position" : 2
},
{
"token" : "att",
"start_offset" : 7,
"end_offset" : 11,
"type" : "word",
"position" : 2
},
{
"token" : "t",
"start_offset" : 10,
"end_offset" : 11,
"type" : "word",
"position" : 3
},
{
"token" : "procter",
"start_offset" : 12,
"end_offset" : 19,
"type" : "word",
"position" : 4
},
{
"token" : "proctergamble",
"start_offset" : 12,
"end_offset" : 26,
"type" : "word",
"position" : 4
},
{
"token" : "gamble",
"start_offset" : 20,
"end_offset" : 26,
"type" : "word",
"position" : 5
}
]
}
If you want to remove stop word "at", you can add stopword filter
{
"settings": {
"analysis": {
"analyzer": {
"whitespace_with_acronymns": {
"tokenizer": "whitespace",
"filter": [
"lowercase",
"acronymns",
"english_possessive_stemmer",
"english_stop",
"english_keywords",
"english_stemmer"
]
}
},
"filter": {
"acronymns": {
"type": "word_delimiter_graph",
"catenate_all": true
},
"english_stop": {
"type": "stop",
"stopwords": "_english_"
},
"english_keywords": {
"type": "keyword_marker",
"keywords": ["example"]
},
"english_stemmer": {
"type": "stemmer",
"language": "english"
},
"english_possessive_stemmer": {
"type": "stemmer",
"language": "possessive_english"
}
}
}
}
}
I only see n-gram and edge n-gram, both of them start from the first letter.
I would like to create some tokenizer which can produce the following tokens.
For example:
600140 -> 0, 40, 140, 0140, 00140, 600140
You can leverage the reverse token filter twice coupled with the edge_ngram one:
PUT reverse
{
"settings": {
"analysis": {
"analyzer": {
"reverse_edgengram": {
"tokenizer": "keyword",
"filter": [
"reverse",
"edge",
"reverse"
]
}
},
"filter": {
"edge": {
"type": "edge_ngram",
"min_gram": 2,
"max_gram": 25
}
}
}
},
"mappings": {
"properties": {
"string_field": {
"type": "text",
"analyzer": "reverse_edgengram"
}
}
}
}
Then you can test it:
POST reverse/_analyze
{
"analyzer": "reverse_edgengram",
"text": "600140"
}
Which yields this:
{
"tokens" : [
{
"token" : "40",
"start_offset" : 0,
"end_offset" : 6,
"type" : "word",
"position" : 0
},
{
"token" : "140",
"start_offset" : 0,
"end_offset" : 6,
"type" : "word",
"position" : 0
},
{
"token" : "0140",
"start_offset" : 0,
"end_offset" : 6,
"type" : "word",
"position" : 0
},
{
"token" : "00140",
"start_offset" : 0,
"end_offset" : 6,
"type" : "word",
"position" : 0
},
{
"token" : "600140",
"start_offset" : 0,
"end_offset" : 6,
"type" : "word",
"position" : 0
}
]
}
This question is in continuation of my previous this SO question. I've some text, on which I want to perform search both on numbers and text.
My Text:-
8080.foobar.getFooLabelFrombar(test.java:91)
And I want to search on getFooLabelFrombar, fooBar, 8080 and 91.
Earlier I was using the simple analyzer, which was tokenizing the above text into below tokens.
"tokens": [
{
"token": "foobar",
"start_offset": 10,
"end_offset": 16,
"type": "word",
"position": 2
},
{
"token": "getfoolabelfrombar",
"start_offset": 17,
"end_offset": 35,
"type": "word",
"position": 3
},
{
"token": "test",
"start_offset": 36,
"end_offset": 40,
"type": "word",
"position": 4
},
{
"token": "java",
"start_offset": 41,
"end_offset": 45,
"type": "word",
"position": 5
}
]
}
Beaucase of which, search on foobar and getFooLabelFrombar was giving the search result but not 8080 and 91, as simple analyzer doesn't tokenize the numbers.
Then as suggested in prev. SO post, I changed the analyzer to Standard, because of which numbers are searchable but not other 2 word search strings. As Standard analyzer would create below tokens :-
{
"tokens": [
{
"token": "8080",
"start_offset": 0,
"end_offset": 4,
"type": "<NUM>",
"position": 1
},
{
"token": "foobar.getfoolabelfrombar",
"start_offset": 5,
"end_offset": 35,
"type": "<ALPHANUM>",
"position": 2
},
{
"token": "test.java",
"start_offset": 36,
"end_offset": 45,
"type": "<ALPHANUM>",
"position": 3
},
{
"token": "91",
"start_offset": 46,
"end_offset": 48,
"type": "<NUM>",
"position": 4
}
]
}
I went to all the existing analyzers in ES, but nothing seems to fulfil my requirement. I tried creating my below custom analyzer but it doesn't work as well.
{
"analysis" : {
"analyzer" : {
"my_analyzer" : {
"tokenizer" : "letter"
"filter" : ["lowercase", "extract_numbers"]
}
},
"filter" : {
"extract_numbers" : {
"type" : "keep_types",
"types" : [ "<NUM>","<ALPHANUM>","word"]
}
}
}
}
Please suggest, How can I build my custom analyzer to suit my requirements.
What about using a character filter to replace dots with spaces?
PUT /my_index
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "standard",
"char_filter": ["replace_dots"]
}
},
"char_filter": {
"replace_dots": {
"type": "mapping",
"mappings": [
". => \\u0020"
]
}
}
}
}
}
POST /my_index/_analyze
{
"analyzer": "my_analyzer",
"text": "8080.foobar.getFooLabelFrombar(test.java:91)"
}
Which outputs what you want:
{
"tokens" : [
{
"token" : "8080",
"start_offset" : 0,
"end_offset" : 4,
"type" : "<NUM>",
"position" : 0
},
{
"token" : "foobar",
"start_offset" : 10,
"end_offset" : 16,
"type" : "<ALPHANUM>",
"position" : 2
},
{
"token" : "getFooLabelFrombar",
"start_offset" : 17,
"end_offset" : 35,
"type" : "<ALPHANUM>",
"position" : 3
},
{
"token" : "test",
"start_offset" : 36,
"end_offset" : 40,
"type" : "<ALPHANUM>",
"position" : 4
},
{
"token" : "java",
"start_offset" : 41,
"end_offset" : 45,
"type" : "<ALPHANUM>",
"position" : 5
},
{
"token" : "91",
"start_offset" : 46,
"end_offset" : 48,
"type" : "<NUM>",
"position" : 6
}
]
}