Edge_ngram explanation - elasticsearch

Explain please what happens in followed case:
Index creation
curl --insecure -H "Authorization: ApiKey $ESAPIKEY" -X PUT "https://localhost:9200/tstind?pretty" -H 'Content-Type: application/json' -d'
{
"settings": {
"analysis": {
"filter": {
"cna_edge_ngram":
{ "type": "edge_ngram",
"min_gram": 3,
"max_gram": 10
}
},
"analyzer": {
"cna": {
"type": "custom",
"tokenizer": "whitespace",
"filter": ["lowercase", "cna_edge_ngram"]
}
}
}
},
"mappings": {
"properties": {
"cn": {
"type": "text",
"analyzer": "cna",
"fielddata": "true"
}
}
}
}
'
Test data
curl --insecure -H "Authorization: ApiKey $ESAPIKEY" -X POST "https://localhost:9200/tstind/_bulk?pretty" -H 'Content-Type: application/json' -d'
{"index": {"_index": "tstind", "_id": "1"}}
{"cn": "carrot"}
{"index": {"_index": "tstind", "_id": "2"}}
{"cn": "apple banana"}
{"index": {"_index": "tstind", "_id": "3"}}
{"cn": "redapple apple orange"}
{"index": {"_index": "tstind", "_id": "4"}}
{"cn": "orange"}
{"index": {"_index": "tstind", "_id": "5"}}
{"cn": "apple"}
{"index": {"_index": "tstind", "_id": "6"}}
{"cn": "cucumber"}
'
Request
curl --insecure -H "Authorization: ApiKey $ESAPIKEY" -X GET "https://localhost:9200/tstind/_search?pretty" -H 'Content-Type: application/json' -d'
{
"query": {
"match": {
"cn": {"query": "appls"}
}
}
}
'
Result:
{
"took" : 0,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 3,
"relation" : "eq"
},
"max_score" : 1.246899,
"hits" : [
{
"_index" : "tstind",
"_id" : "5",
"_score" : 1.246899,
"_source" : {
"cn" : "apple"
}
},
{
"_index" : "tstind",
"_id" : "2",
"_score" : 1.1766877,
"_source" : {
"cn" : "apple banana"
}
},
{
"_index" : "tstind",
"_id" : "3",
"_score" : 1.113962,
"_source" : {
"cn" : "redapple apple orange"
}
}
]
}
}
I'm expect that result must be empty with no hits. Why in results exists documents that doesn't contains requested phrase "appls"?
I'm trying to investigate with analyze, what tokens in index exists:
curl --insecure -H "Authorization: ApiKey $ESAPIKEY" -X GET "https://localhost:9200/tstind/_analyze?pretty" -H 'Content-Type: application/json' -d'
{
"analyzer": "cna",
"text" : "apple banana"
}
'
{
"tokens" : [
{
"token" : "app",
"start_offset" : 0,
"end_offset" : 5,
"type" : "word",
"position" : 0
},
{
"token" : "appl",
"start_offset" : 0,
"end_offset" : 5,
"type" : "word",
"position" : 0
},
{
"token" : "apple",
"start_offset" : 0,
"end_offset" : 5,
"type" : "word",
"position" : 0
},
{
"token" : "ban",
"start_offset" : 6,
"end_offset" : 12,
"type" : "word",
"position" : 1
},
{
"token" : "bana",
"start_offset" : 6,
"end_offset" : 12,
"type" : "word",
"position" : 1
},
{
"token" : "banan",
"start_offset" : 6,
"end_offset" : 12,
"type" : "word",
"position" : 1
},
{
"token" : "banana",
"start_offset" : 6,
"end_offset" : 12,
"type" : "word",
"position" : 1
}
]
}
Looks like is everything ok, but result not as expected. Explain please, what happened in this case?
I found the way how to filter results with post_filter, but I think it is not the best idea.

In this case look the token to term "appls".
{
"tokens": [
{
"token": "app",
"start_offset": 0,
"end_offset": 5,
"type": "word",
"position": 0
},
{
"token": "appl",
"start_offset": 0,
"end_offset": 5,
"type": "word",
"position": 0
},
{
"token": "appls",
"start_offset": 0,
"end_offset": 5,
"type": "word",
"position": 0
}
]
}
Now, token by term "apple":
{
"tokens": [
{
"token": "app",
"start_offset": 0,
"end_offset": 5,
"type": "word",
"position": 0
},
{
"token": "appl",
"start_offset": 0,
"end_offset": 5,
"type": "word",
"position": 0
},
{
"token": "apple",
"start_offset": 0,
"end_offset": 5,
"type": "word",
"position": 0
}
]
}
O match occur because token "app" exists in term "appls" and document "apple".

Related

Elasticsearch does not match a partial query

I'm currently trying to create an analyzer that will match a part of a query. The main use case is with this term "3D mammogram", for some reason using my autocomplete analyzer down below, produces no results. Upon removing the "operator" : "AND" option, elastic started to return results but still the results that are expected are with less score for some reason.
Here are the settings and the mappings for my index:
MAPPINGS:
{
"index": {
"properties": {
"code": {
"type": "text"
},
"type": {
"type": "text"
},
"term": {
"type": "text",
"analyzer": "autocomplete",
"search_analyzer": "index_search"
}
}
}
}
SETTINGS:
{
"index" : {
"settings" : {
"index" : {
"number_of_shards" : "5",
"provided_name" : "index",
"creation_date" : ".......",
"analysis" : {
"filter" : {
"case_transition_filter" : {
"split_on_numerics" : "true",
"type" : "word_delimiter",
"preserve_original" : "true",
"stem_english_possessive" : "false"
},
"autocomplete_filter" : {
"type" : "edge_ngram",
"min_gram" : "2",
"max_gram" : "15"
},
"hyphen-filter" : {
"pattern" : "-",
"type" : "pattern_replace",
"replacement" : " "
}
},
"analyzer" : {
"autocomplete" : {
"filter" : [ "case_transition_filter", "lowercase", "hyphen-filter", "autocomplete_filter" ],
"type" : "custom",
"tokenizer" : "keyword"
},
"index_search" : {
"type" : "standard"
}
}
},
"number_of_replicas" : "1",
"uuid" : ".....g",
"version" : {
"created" : "..."
}
}
}
}
}
As you can see I'm using two different analyzers - the autocomplete one for indexing and a standard one for search.
From my backend I'm hitting the elastic index with these two match queries wrapped in a bool query:
{
"bool" : {
"should" : [
{
"match" : {
"term" : {
"query" : "3d mammogram",
"operator" : "AND",
"analyzer" : "keyword",
"fuzziness" : "1",
"prefix_length" : 1,
"max_expansions" : 50,
"fuzzy_transpositions" : true,
"lenient" : false,
"zero_terms_query" : "NONE",
"auto_generate_synonyms_phrase_query" : true,
"boost" : 2.0
}
}
},
{
"match" : {
"term" : {
"query" : "3d mammogram",
"operator" : "AND",
"fuzziness" : "1",
"prefix_length" : 1,
"max_expansions" : 50,
"fuzzy_transpositions" : true,
"lenient" : false,
"zero_terms_query" : "NONE",
"auto_generate_synonyms_phrase_query" : true,
"boost" : 1.0
}
}
}
],
"adjust_pure_negative" : true,
"minimum_should_match" : "1",
"boost" : 1.0
}
}
Both of the queries like that produce no results but upon removing the "operator" : "AND" from the second query I'm starting to get good results but not the ones that I expect.
Here are the results from the second query:
{
"took" : 2,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : 93,
"max_score" : 20.951433,
"hits" : [
{
"_index" : "index",
"_type" : "index",
"_id" : ".....",
"_score" : 20.951433,
"_source" : {
"id" : null,
"careNeedCode" : "...",
"careNeedType" : "...",
"term" : "Routine mammogram"
}
},
{
"_index" : "...",
"_type" : "...",
"_id" : "...",
"_score" : 19.059473,
"_source" : {
"id" : null,
"careNeedCode" : "...",
"careNeedType" : "...",
"term" : "Mammogram"
}
},
{
"_index" : "....",
"_type" : "...",
"_id" : "...",
"_score" : 18.515629,
"_source" : {
"id" : null,
"careNeedCode" : "...",
"careNeedType" : "...",
"term" : "Screening mammogram"
}
},
{
"_index" : "...",
"_type" : "search-term",
"_id" : "....",
"_score" : 18.515629,
"_source" : {
"id" : null,
"careNeedCode" : "...",
"careNeedType" : "treatment procedures",
"term" : "Diagnostic mammogram"
}
},
{
"_index" : "....",
"_type" : "...",
"_id" : "...",
"_score" : 18.515629,
"_source" : {
"id" : null,
"careNeedCode" : "...",
"careNeedType" : "...",
"term" : "Digital mammogram"
}
},
{
"_index" : "...",
"_type" : "...",
"_id" : "...",
"_score" : 18.480751,
"_source" : {
"id" : null,
"careNeedCode" : "...",
"careNeedType" : "...",
"term" : "Screening 3D mammogram"
}
},
{
"_index" : "...",
"_type" : "...",
"_id" : "...",
"_score" : 18.376223,
"_source" : {
"id" : null,
"careNeedCode" : "...",
"careNeedType" : "t...",
"term" : "Diagnostic 3D mammogram"
}
},
{
"_index" : "...",
"_type" : "...",
"_id" : "...",
"_score" : 17.930023,
"_source" : {
"id" : null,
"careNeedCode" : "...",
"careNeedType" : "...",
"term" : "Mammography"
}
},
{
"_index" : "...",
"_type" : "...",
"_id" : "....",
"_score" : 17.287262,
"_source" : {
"id" : null,
"careNeedCode" : "...",
"careNeedType" : "...",
"term" : "Screening mammography"
}
},
{
"_index" : "....",
"_type" : "...",
"_id" : "...",
"_score" : 17.287262,
"_source" : {
"id" : null,
"careNeedCode" : "...",
"careNeedType" : "...",
"term" : "Abnormal mammography"
}
}
]
}
}
As you can see the results containing "3d mammogram" are way below than results that have only "mammogram" in them. I'm not sure what I am missing here.
Based on your index mapping and settings, the tokens generated for "Screening 3D mammogram" will be
{
"tokens": [
{
"token": "sc",
"start_offset": 0,
"end_offset": 22,
"type": "word",
"position": 0
},
{
"token": "scr",
"start_offset": 0,
"end_offset": 22,
"type": "word",
"position": 0
},
{
"token": "scre",
"start_offset": 0,
"end_offset": 22,
"type": "word",
"position": 0
},
{
"token": "scree",
"start_offset": 0,
"end_offset": 22,
"type": "word",
"position": 0
},
{
"token": "screen",
"start_offset": 0,
"end_offset": 22,
"type": "word",
"position": 0
},
{
"token": "screeni",
"start_offset": 0,
"end_offset": 22,
"type": "word",
"position": 0
},
{
"token": "screenin",
"start_offset": 0,
"end_offset": 22,
"type": "word",
"position": 0
},
{
"token": "screening",
"start_offset": 0,
"end_offset": 22,
"type": "word",
"position": 0
},
{
"token": "screening ",
"start_offset": 0,
"end_offset": 22,
"type": "word",
"position": 0
},
{
"token": "screening 3",
"start_offset": 0,
"end_offset": 22,
"type": "word",
"position": 0
},
{
"token": "screening 3d",
"start_offset": 0,
"end_offset": 22,
"type": "word",
"position": 0
},
{
"token": "screening 3d ",
"start_offset": 0,
"end_offset": 22,
"type": "word",
"position": 0
},
{
"token": "screening 3d m",
"start_offset": 0,
"end_offset": 22,
"type": "word",
"position": 0
},
{
"token": "screening 3d ma",
"start_offset": 0,
"end_offset": 22,
"type": "word",
"position": 0
},
{
"token": "sc",
"start_offset": 0,
"end_offset": 9,
"type": "word",
"position": 0
},
{
"token": "scr",
"start_offset": 0,
"end_offset": 9,
"type": "word",
"position": 0
},
{
"token": "scre",
"start_offset": 0,
"end_offset": 9,
"type": "word",
"position": 0
},
{
"token": "scree",
"start_offset": 0,
"end_offset": 9,
"type": "word",
"position": 0
},
{
"token": "screen",
"start_offset": 0,
"end_offset": 9,
"type": "word",
"position": 0
},
{
"token": "screeni",
"start_offset": 0,
"end_offset": 9,
"type": "word",
"position": 0
},
{
"token": "screenin",
"start_offset": 0,
"end_offset": 9,
"type": "word",
"position": 0
},
{
"token": "screening",
"start_offset": 0,
"end_offset": 9,
"type": "word",
"position": 0
},
{
"token": "ma",
"start_offset": 13,
"end_offset": 22,
"type": "word",
"position": 3
},
{
"token": "mam",
"start_offset": 13,
"end_offset": 22,
"type": "word",
"position": 3
},
{
"token": "mamm",
"start_offset": 13,
"end_offset": 22,
"type": "word",
"position": 3
},
{
"token": "mammo",
"start_offset": 13,
"end_offset": 22,
"type": "word",
"position": 3
},
{
"token": "mammog",
"start_offset": 13,
"end_offset": 22,
"type": "word",
"position": 3
},
{
"token": "mammogr",
"start_offset": 13,
"end_offset": 22,
"type": "word",
"position": 3
},
{
"token": "mammogra",
"start_offset": 13,
"end_offset": 22,
"type": "word",
"position": 3
},
{
"token": "mammogram",
"start_offset": 13,
"end_offset": 22,
"type": "word",
"position": 3
}
]
}
There is no token generated for 3d. This is because you have taken "tokenizer" : "keyword" for the autocomplete analyzer. You need to modify your index mapping and change the tokenizer from keyword to standard
Modified index mapping will be
"analyzer" : {
"autocomplete" : {
"filter" : [ "case_transition_filter", "lowercase", "hyphen-filter", "autocomplete_filter" ],
"type" : "custom",
"tokenizer" : "standard" // note this
},
You need to reindex the data again with this new index mapping.
Adding a working example with index data,index mapping, search query, and search result
Index Mapping:
{
"settings": {
"analysis": {
"filter": {
"case_transition_filter": {
"split_on_numerics": "true",
"type": "word_delimiter",
"preserve_original": "true",
"stem_english_possessive": "false"
},
"autocomplete_filter": {
"type": "edge_ngram",
"min_gram": "2",
"max_gram": "15"
},
"hyphen-filter": {
"pattern": "-",
"type": "pattern_replace",
"replacement": " "
}
},
"analyzer": {
"autocomplete": {
"filter": [
"case_transition_filter",
"lowercase",
"hyphen-filter",
"autocomplete_filter"
],
"type": "custom",
"tokenizer": "standard" // note this
},
"search_term_search": {
"type": "standard"
}
}
},
"max_ngram_diff": 20
},
"mappings": {
"properties": {
"term": {
"type": "text",
"analyzer": "autocomplete",
"search_analyzer": "search_term_search"
}
}
}
}
The tokens generated will include "3d" and "mammogram" both.
Index Data:
{
"term": "Screening mammogram"
}
{
"term": "Diagnostic 3D mammogram"
}
{
"term": "Mammography"
}
Search Query:
{
"query": {
"match": {
"term": {
"query": "3D mammogram",
"operator": "and"
}
}
}
}
Search Result:
"hits": [
{
"_index": "67607194",
"_type": "_doc",
"_id": "4",
"_score": 1.4572026,
"_source": {
"term": "Diagnostic 3D mammogram"
}
}
]

Elasticsearch 6.8 match_phrase search N-gram tokenizer works not well

i use Elasticsearch N-gram tokenizer and use match_phrase to fuzzy match
my index and test data as below:
DELETE /m8
PUT m8
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "my_tokenizer"
}
},
"tokenizer": {
"my_tokenizer": {
"type": "ngram",
"min_gram": 1,
"max_gram": 3,
"custom_token_chars":"_."
}
}
},
"max_ngram_diff": 10
},
"mappings": {
"table": {
"properties": {
"dataSourceId": {
"type": "long"
},
"dataSourceType": {
"type": "integer"
},
"dbName": {
"type": "text",
"analyzer": "my_analyzer",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
}
}
}
}
}
PUT /m8/table/1
{
"dataSourceId":1,
"dataSourceType":2,
"dbName":"rm.rf"
}
PUT /m8/table/2
{
"dataSourceId":1,
"dataSourceType":2,
"dbName":"rm_rf"
}
PUT /m8/table/3
{
"dataSourceId":1,
"dataSourceType":2,
"dbName":"rmrf"
}
check _analyze:
POST m8/_analyze
{
"tokenizer": "my_tokenizer",
"text": "rm.rf"
}
_analyze result:
{
"tokens" : [
{
"token" : "r",
"start_offset" : 0,
"end_offset" : 1,
"type" : "word",
"position" : 0
},
{
"token" : "rm",
"start_offset" : 0,
"end_offset" : 2,
"type" : "word",
"position" : 1
},
{
"token" : "rm.",
"start_offset" : 0,
"end_offset" : 3,
"type" : "word",
"position" : 2
},
{
"token" : "m",
"start_offset" : 1,
"end_offset" : 2,
"type" : "word",
"position" : 3
},
{
"token" : "m.",
"start_offset" : 1,
"end_offset" : 3,
"type" : "word",
"position" : 4
},
{
"token" : "m.r",
"start_offset" : 1,
"end_offset" : 4,
"type" : "word",
"position" : 5
},
{
"token" : ".",
"start_offset" : 2,
"end_offset" : 3,
"type" : "word",
"position" : 6
},
{
"token" : ".r",
"start_offset" : 2,
"end_offset" : 4,
"type" : "word",
"position" : 7
},
{
"token" : ".rf",
"start_offset" : 2,
"end_offset" : 5,
"type" : "word",
"position" : 8
},
{
"token" : "r",
"start_offset" : 3,
"end_offset" : 4,
"type" : "word",
"position" : 9
},
{
"token" : "rf",
"start_offset" : 3,
"end_offset" : 5,
"type" : "word",
"position" : 10
},
{
"token" : "f",
"start_offset" : 4,
"end_offset" : 5,
"type" : "word",
"position" : 11
}
]
}
When i search 'rm', nothing found:
GET /m8/table/_search
{
"query": {
"bool": {
"must": [
{
"match_phrase": {
"dbName": "rm"
}
}
]
}
}
}
{
"took" : 2,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : 0,
"max_score" : null,
"hits" : [ ]
}
}
But '.rf' can be found:
{
"took" : 1,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : 1,
"max_score" : 1.7260926,
"hits" : [
{
"_index" : "m8",
"_type" : "table",
"_id" : "1",
"_score" : 1.7260926,
"_source" : {
"dataSourceId" : 1,
"dataSourceType" : 2,
"dbName" : "rm.rf"
}
}
]
}
}
My question:
Why 'rm' couldn't been found even _analyze has splited these phrase?
my_analyzer will be used during search time as well.
"mapping":{
"dbName": {
"type": "text",
"analyzer": "my_analyzer"
"search_analyzer":"my_analyzer" // <==== If you don't provide a search analyzer then what you defined in analyzer will be used during search time as well.
Match_phrase query is used to match phrases considering the position of analyzed text. e.g Searching for "Kal ho" will match document having "Kal" at position X, & "ho" at position X+1 in the analyzed text.
When you are searching for 'rm' (#1) the text gets analyzed using my_analyzer, which converts it into n-gram and on the top of that phrase_search will be used. Hence the outcome is not expected.
Solution:
Use standard analyzer with simple match query
GET /m8/_search
{
"query": {
"bool": {
"must": [
{
"match": {
"dbName": {
"query": "rm",
"analyzer": "standard" // <=========
}
}
}
]
}
}
}
OR Define during mapping & use a match query (not match_phrase)
"mapping":{
"dbName": {
"type": "text",
"analyzer": "my_analyzer"
"search_analyzer":"standard" //<==========
Followup Question: Why do you want to use a match_phrase query with n-gram tokenizer?

Split text containing <number><unit> into 3 tokens

We index a lot of documents that may contain titles like "lightbulb 220V" or "Box 23cm" or "Varta Super-charge battery 74Ah".
However our users, when searching, tend to separate number and unit with whitespace, so they search for "Varta 74 Ah" they do not get what they expect.
The above is a simplification of the problem, but the main question is hopefully valid. How can I analyze "Varta Super-charge battery 74Ah" so that (on top of other tokens) 74, Ah and 74Ah are created?
Thanks,
Michal
I guess this will help you:
PUT index_name
{
"settings": {
"analysis": {
"filter": {
"custom_filter": {
"type": "word_delimiter",
"split_on_numerics": true
}
},
"analyzer": {
"custom_analyzer": {
"type": "custom",
"tokenizer": "keyword",
"filter": ["custom_filter"]
}
}
}
}
}
You can use split_on_numerics property in your custom filter. This will give you this response:
POST
POST /index_name/_analyze
{
"analyzer": "custom_analyzer",
"text": "Varta Super-charge battery 74Ah"
}
Response
{
"tokens" : [
{
"token" : "Varta",
"start_offset" : 0,
"end_offset" : 5,
"type" : "word",
"position" : 0
},
{
"token" : "Super",
"start_offset" : 6,
"end_offset" : 11,
"type" : "word",
"position" : 1
},
{
"token" : "charge",
"start_offset" : 12,
"end_offset" : 18,
"type" : "word",
"position" : 2
},
{
"token" : "battery",
"start_offset" : 19,
"end_offset" : 26,
"type" : "word",
"position" : 3
},
{
"token" : "74",
"start_offset" : 27,
"end_offset" : 29,
"type" : "word",
"position" : 4
},
{
"token" : "Ah",
"start_offset" : 29,
"end_offset" : 31,
"type" : "word",
"position" : 5
}
]
}
You would need to create a Custom Analyzer which implement Ngram Tokenizer and then apply that on the text field you create.
Below is the sample mapping, document, query and the response:
Mapping:
PUT my_split_index
{
"settings": {
"index":{
"max_ngram_diff": 3
},
"analysis": {
"analyzer": {
"my_analyzer": { <---- Custom Analyzer
"tokenizer": "my_tokenizer"
}
},
"tokenizer": {
"my_tokenizer": {
"type": "ngram",
"min_gram": 2,
"max_gram": 5,
"token_chars": [
"letter",
"digit"
]
}
}
}
},
"mappings": {
"properties": {
"product":{
"type": "text",
"analyzer": "my_analyzer", <--- Note this as how custom analyzer is applied on this field
"fields": {
"keyword":{
"type": "keyword"
}
}
}
}
}
}
The feature that you are looking for is called Ngram which would create multiple tokens from a single token. The size of the tokens are dependent on the min_ngram and max_ngram setting as mentioned above.
Note that I've mentioned max_ngram_diff as 3, that is because in version 7.x, ES's default value is 1. Looking into your use-case I've created this as 3 This value is nothing but max_ngram - min_ngram.
Sample Documents:
POST my_split_index/_doc/1
{
"product": "Varta 74 Ah"
}
POST my_split_index/_doc/2
{
"product": "lightbulb 220V"
}
Query Request:
POST my_split_index/_search
{
"query": {
"match": {
"product": "74Ah"
}
}
}
Response:
{
"took" : 2,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 1,
"relation" : "eq"
},
"max_score" : 1.7029606,
"hits" : [
{
"_index" : "my_split_index",
"_type" : "_doc",
"_id" : "1",
"_score" : 1.7029606,
"_source" : {
"product" : "Varta 74 Ah"
}
}
]
}
}
Additional Info:
To understand what tokens are actually generated you can make use of below Analyze API:
POST my_split_index/_analyze
{
"analyzer": "my_analyzer",
"text": "Varta 74 Ah"
}
You could see that below tokens got generated when I execute the above API:
{
"tokens" : [
{
"token" : "Va",
"start_offset" : 0,
"end_offset" : 2,
"type" : "word",
"position" : 0
},
{
"token" : "Var",
"start_offset" : 0,
"end_offset" : 3,
"type" : "word",
"position" : 1
},
{
"token" : "Vart",
"start_offset" : 0,
"end_offset" : 4,
"type" : "word",
"position" : 2
},
{
"token" : "Varta",
"start_offset" : 0,
"end_offset" : 5,
"type" : "word",
"position" : 3
},
{
"token" : "ar",
"start_offset" : 1,
"end_offset" : 3,
"type" : "word",
"position" : 4
},
{
"token" : "art",
"start_offset" : 1,
"end_offset" : 4,
"type" : "word",
"position" : 5
},
{
"token" : "arta",
"start_offset" : 1,
"end_offset" : 5,
"type" : "word",
"position" : 6
},
{
"token" : "rt",
"start_offset" : 2,
"end_offset" : 4,
"type" : "word",
"position" : 7
},
{
"token" : "rta",
"start_offset" : 2,
"end_offset" : 5,
"type" : "word",
"position" : 8
},
{
"token" : "ta",
"start_offset" : 3,
"end_offset" : 5,
"type" : "word",
"position" : 9
},
{
"token" : "74",
"start_offset" : 6,
"end_offset" : 8,
"type" : "word",
"position" : 10
},
{
"token" : "Ah",
"start_offset" : 9,
"end_offset" : 11,
"type" : "word",
"position" : 11
}
]
}
Notice that the query I've mentioned in the Query Request section is 74Ah, however it still returns the document. That is because ES applies the analyzer twice, during the index time and during the search time. By default if you do not specify the search_analyzer in your query, the same analyzer you applied during indexing time also gets applied during query time.
Hope this helps!
You can define your index mapping as below and see it generates tokens, as you mentioned in your question. Also, it doesn't create a lot of tokens. Hence the size of your index would be smaller.
Index mapping
{
"settings": {
"analysis": {
"filter": {
"my_filter": {
"type": "word_delimiter",
"split_on_numerics": "true",
"catenate_words": "true",
"preserve_original": "true"
}
},
"analyzer": {
"my_analyzer": {
"type": "custom",
"tokenizer": "whitespace",
"filter": [
"my_filter",
"lowercase"
]
}
}
}
},
"mappings": {
"properties": {
"title": {
"type": "text",
"analyzer": "my_analyzer"
}
}
}
}
And check the tokens generated using _analyze API
{
"text": "Varta Super-charge battery 74Ah",
"analyzer" : "my_analyzer"
}
Tokens generated
{
"tokens": [
{
"token": "varta",
"start_offset": 0,
"end_offset": 5,
"type": "word",
"position": 0
},
{
"token": "super-charge",
"start_offset": 6,
"end_offset": 18,
"type": "word",
"position": 1
},
{
"token": "super",
"start_offset": 6,
"end_offset": 11,
"type": "word",
"position": 1
},
{
"token": "supercharge",
"start_offset": 6,
"end_offset": 18,
"type": "word",
"position": 1
},
{
"token": "charge",
"start_offset": 12,
"end_offset": 18,
"type": "word",
"position": 2
},
{
"token": "battery",
"start_offset": 19,
"end_offset": 26,
"type": "word",
"position": 3
},
{
"token": "74ah",
"start_offset": 27,
"end_offset": 31,
"type": "word",
"position": 4
},
{
"token": "74",
"start_offset": 27,
"end_offset": 29,
"type": "word",
"position": 4
},
{
"token": "ah",
"start_offset": 29,
"end_offset": 31,
"type": "word",
"position": 5
}
]
}
Edit: Tokens generated in one another might look the same in the first glace, But I made sure that it satisfies all your requirements, given in question and tokens generated are quite different in close inspection, details of which are below:
My tokens generated are all in small-case to provide the case insensitive search functionality, which is implicit in all the search engines.
The critical thing to note is tokens generated as 74ah and supercharge, this is mentioned in the question, and my analyzer provides these tokens as well.

how to build a backward edge n-gram tokenizer

I only see n-gram and edge n-gram, both of them start from the first letter.
I would like to create some tokenizer which can produce the following tokens.
For example:
600140 -> 0, 40, 140, 0140, 00140, 600140
You can leverage the reverse token filter twice coupled with the edge_ngram one:
PUT reverse
{
"settings": {
"analysis": {
"analyzer": {
"reverse_edgengram": {
"tokenizer": "keyword",
"filter": [
"reverse",
"edge",
"reverse"
]
}
},
"filter": {
"edge": {
"type": "edge_ngram",
"min_gram": 2,
"max_gram": 25
}
}
}
},
"mappings": {
"properties": {
"string_field": {
"type": "text",
"analyzer": "reverse_edgengram"
}
}
}
}
Then you can test it:
POST reverse/_analyze
{
"analyzer": "reverse_edgengram",
"text": "600140"
}
Which yields this:
{
"tokens" : [
{
"token" : "40",
"start_offset" : 0,
"end_offset" : 6,
"type" : "word",
"position" : 0
},
{
"token" : "140",
"start_offset" : 0,
"end_offset" : 6,
"type" : "word",
"position" : 0
},
{
"token" : "0140",
"start_offset" : 0,
"end_offset" : 6,
"type" : "word",
"position" : 0
},
{
"token" : "00140",
"start_offset" : 0,
"end_offset" : 6,
"type" : "word",
"position" : 0
},
{
"token" : "600140",
"start_offset" : 0,
"end_offset" : 6,
"type" : "word",
"position" : 0
}
]
}

elasticsearch custom analyzer by specific chracter

How to create custom analyzer that tokenize a field by '/' characters only.
I have url strings in my field for exp: "https://stackoverflow.com/questions/ask"
I want tokenized this like: "http", "stackoverflow.com", "questions" and "ask"
This seems to do what you want, using a pattern tokenizer:
PUT /test_index
{
"settings": {
"number_of_shards": 1,
"analysis": {
"analyzer": {
"slash_analyzer": {
"type": "pattern",
"pattern": "[/:]+",
"lowercase": true
}
}
}
},
"mappings": {
"doc": {
"properties": {
"url": {
"type": "string",
"index_analyzer": "slash_analyzer",
"search_analyzer": "standard",
"term_vector": "yes"
}
}
}
}
}
PUT /test_index/doc/1
{
"url": "http://stackoverflow.com/questions/ask"
}
I added term vectors in the mapping (you probably don't want to do this in production), so we can see what terms are generated:
GET /test_index/doc/1/_termvector
...
{
"_index": "test_index",
"_type": "doc",
"_id": "1",
"_version": 1,
"found": true,
"took": 1,
"term_vectors": {
"url": {
"field_statistics": {
"sum_doc_freq": 4,
"doc_count": 1,
"sum_ttf": 4
},
"terms": {
"ask": {
"term_freq": 1
},
"http": {
"term_freq": 1
},
"questions": {
"term_freq": 1
},
"stackoverflow.com": {
"term_freq": 1
}
}
}
}
}
Here's the code I used:
http://sense.qbox.io/gist/669fbdd681895d7e9f8db13799865c6e8be75b11
The standard analyzer already does that for you.
curl -XGET 'localhost:9200/_analyze?analyzer=standard&pretty' -d 'http://stackoverflow.com/questions/ask'
You get this:
{
"tokens" : [ {
"token" : "http",
"start_offset" : 0,
"end_offset" : 4,
"type" : "<ALPHANUM>",
"position" : 1
}, {
"token" : "stackoverflow.com",
"start_offset" : 7,
"end_offset" : 24,
"type" : "<ALPHANUM>",
"position" : 2
}, {
"token" : "questions",
"start_offset" : 25,
"end_offset" : 34,
"type" : "<ALPHANUM>",
"position" : 3
}, {
"token" : "ask",
"start_offset" : 35,
"end_offset" : 38,
"type" : "<ALPHANUM>",
"position" : 4
} ]
}

Resources