Elasticsearch Analyzer first 4 and last 4 characters - elasticsearch

With Elasticsearch, I would like to specify a search analyzer where the first 4 characters and last 4 characters are tokenized.
For example: supercalifragilisticexpialidocious => ["supe", "ious"]
I have had a go with an ngram as follows
PUT my_index
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "my_tokenizer"
}
},
"tokenizer": {
"my_tokenizer": {
"type": "ngram",
"min_gram": 4,
"max_gram": 4
}
}
}
}
}
I am testing the analyzer as follows
POST my_index/_analyze
{
"analyzer": "my_analyzer",
"text": "supercalifragilisticexpialidocious."
}
And get back `super' ... loads of stuff I don't want and 'cious'. The problem for me is how can I take only the first and last results from the ngram tokenizer specified above?
{
"tokens": [
{
"token": "supe",
"start_offset": 0,
"end_offset": 4,
"type": "word",
"position": 0
},
{
"token": "uper",
"start_offset": 1,
"end_offset": 5,
"type": "word",
"position": 1
},
...
{
"token": "ciou",
"start_offset": 29,
"end_offset": 33,
"type": "word",
"position": 29
},
{
"token": "ious",
"start_offset": 30,
"end_offset": 34,
"type": "word",
"position": 30
},
{
"token": "ous.",
"start_offset": 31,
"end_offset": 35,
"type": "word",
"position": 31
}
]
}

One way to achieve this is to leverage the pattern_capture token filter and take the first 4 and last 4 characters.
First, define your index like this:
PUT my_index
{
"settings": {
"index": {
"analysis": {
"analyzer": {
"my_analyzer": {
"type": "custom",
"tokenizer": "keyword",
"filter": [
"lowercase",
"first_last_four"
]
}
},
"filter": {
"first_last_four": {
"type": "pattern_capture",
"preserve_original": false,
"patterns": [
"""(\w{4}).*(\w{4})"""
]
}
}
}
}
}
}
Then, you can test your new custom analyzer:
POST test/_analyze
{
"text": "supercalifragilisticexpialidocious",
"analyzer": "my_analyzer"
}
And see that the tokens you expect are there:
{
"tokens" : [
{
"token" : "supe",
"start_offset" : 0,
"end_offset" : 34,
"type" : "word",
"position" : 0
},
{
"token" : "ious",
"start_offset" : 0,
"end_offset" : 34,
"type" : "word",
"position" : 0
}
]
}

Related

Which Analyzer can meet my need in elasticsearch?

In my situation, my field is like "abc,123", I want it can be searched either "abc" or "123".
my index mapping is just like the code below
{
"myfield": {
"type": "text",
"analyzer": "stop",
"search_analyzer": "stop" }
But when I use es _analyzer API to test, I got the result
{
"tokens": [
{
"token": "abc",
"start_offset": 0,
"end_offset": 3,
"type": "word",
"position": 0
}
]
}
"123" was lost.
If I want to meet my situation, do I need to choose some other analyzer or just to add some special configs?
You need to choose standard analyzer instead as stop analyzer breaks text into terms whenever it encounters a character which is not a letter and removes stop words like 'the'. In your case "abc,123" results in token abc when using stop analyzer. Using standard analyzer it returns abc and 123 as shown below
POST _analyze
{
"analyzer": "standard",
"text": "abc, 123"
}
Output:
{
"tokens": [
{
"token": "abc",
"start_offset": 0,
"end_offset": 3,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "123",
"start_offset": 5,
"end_offset": 8,
"type": "<NUM>",
"position": 1
}
]
}
EDIT1 Using Simple Pattern Split Tokenizer
PUT my_index
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "my_tokenizer"
}
},
"tokenizer": {
"my_tokenizer": {
"type": "simple_pattern_split",
"pattern": ","
}
}
}
}
}
POST my_index/_analyze
{
"analyzer": "my_analyzer",
"text": "abc,123"
}
Output:
{
"tokens": [
{
"token": "abc",
"start_offset": 0,
"end_offset": 3,
"type": "word",
"position": 0
},
{
"token": "123",
"start_offset": 4,
"end_offset": 7,
"type": "word",
"position": 1
}
]
}

Whitespaces in queries

I have an analyzer which ignores whitespaces. When I search for a string without space, it returns proper results. This is the analyzer:
{
"index": {
"number_of_shards": 1,
"analysis": {
"filter": {
"word_joiner": {
"type": "word_delimiter",
"catenate_all": true
}
},
"analyzer": {
"word_join_analyzer": {
"type": "custom",
"filter": [
"word_joiner"
],
"tokenizer": "keyword"
}
}
}
}
}
This is how it works:
curl -XGET "http://localhost:9200/cake/_analyze?analyzer=word_join_analyzer&pretty" -d 'ONE"\ "TWO'
Result:
{
"tokens" : [ {
"token" : "ONE",
"start_offset" : 1,
"end_offset" : 5,
"type" : "word",
"position" : 0
}, {
"token" : "ONETWO",
"start_offset" : 1,
"end_offset" : 13,
"type" : "word",
"position" : 0
}, {
"token" : "TWO",
"start_offset" : 7,
"end_offset" : 13,
"type" : "word",
"position" : 1
} ]
}
What I want is that I also get a "token" : "ONE TWO" from this analyzer. How can I do this?
Thanks!
You need to enable the preserve_original setting, which is false by default
{
"index": {
"number_of_shards": 1,
"analysis": {
"filter": {
"word_joiner": {
"type": "word_delimiter",
"catenate_all": true,
"preserve_original": true <--- add this
}
},
"analyzer": {
"word_join_analyzer": {
"type": "custom",
"filter": [
"word_joiner"
],
"tokenizer": "keyword"
}
}
}
}
}
This will yield:
{
"tokens": [
{
"token": "ONE TWO",
"start_offset": 0,
"end_offset": 7,
"type": "word",
"position": 0
},
{
"token": "ONE",
"start_offset": 0,
"end_offset": 3,
"type": "word",
"position": 0
},
{
"token": "ONETWO",
"start_offset": 0,
"end_offset": 7,
"type": "word",
"position": 0
},
{
"token": "TWO",
"start_offset": 4,
"end_offset": 7,
"type": "word",
"position": 1
}
]
}

Using autocomplete with email in elastic search doesn't work

I have a field with the following mapping defined :
"my_field": {
"properties": {
"address": {
"type": "string",
"analyzer": "email",
"search_analyzer": "whitespace"
}
}
}
My email analyser looks like this:
{
"analysis": {
"filter": {
"email_filter": {
"type": "edge_ngram",
"min_gram": "3",
"max_gram": "255"
}
},
"analyzer": {
"email": {
"type": "custom",
"filter": [
"lowercase",
"email_filter",
"unique"
],
"tokenizer": "uax_url_email"
}
}
}
}
When I try to search for an email id, like test.xyz#example.com
Searching for terms like tes,test.xy etc. doesn't work. But if I search for
test.xyz or test.xyz#example.com, it works fine. I tried analyzing the tokens using my email filter and it works fine as expected
Ex. Hitting http://localhost:9200/my_index/_analyze?analyzer=email&text=test.xyz#example.com
I get:
{
"tokens": [{
"token": "tes",
"start_offset": 0,
"end_offset": 20,
"type": "word",
"position": 0
}, {
"token": "test",
"start_offset": 0,
"end_offset": 20,
"type": "word",
"position": 0
}, {
"token": "test.",
"start_offset": 0,
"end_offset": 20,
"type": "word",
"position": 0
}, {
"token": "test.x",
"start_offset": 0,
"end_offset": 20,
"type": "word",
"position": 0
}, {
"token": "test.xy",
"start_offset": 0,
"end_offset": 20,
"type": "word",
"position": 0
}, {
"token": "test.xyz",
"start_offset": 0,
"end_offset": 20,
"type": "word",
"position": 0
}, {
"token": "test.xyz#",
"start_offset": 0,
"end_offset": 20,
"type": "word",
"position": 0
}, {
"token": "test.xyz#e",
"start_offset": 0,
"end_offset": 20,
"type": "word",
"position": 0
}, {
"token": "test.xyz#ex",
"start_offset": 0,
"end_offset": 20,
"type": "word",
"position": 0
}, {
"token": "test.xyz#exa",
"start_offset": 0,
"end_offset": 20,
"type": "word",
"position": 0
}, {
"token": "test.xyz#exam",
"start_offset": 0,
"end_offset": 20,
"type": "word",
"position": 0
}, {
"token": "test.xyz#examp",
"start_offset": 0,
"end_offset": 20,
"type": "word",
"position": 0
}, {
"token": "test.xyz#exampl",
"start_offset": 0,
"end_offset": 20,
"type": "word",
"position": 0
}, {
"token": "test.xyz#example",
"start_offset": 0,
"end_offset": 20,
"type": "word",
"position": 0
}, {
"token": "test.xyz#example.",
"start_offset": 0,
"end_offset": 20,
"type": "word",
"position": 0
}, {
"token": "test.xyz#example.c",
"start_offset": 0,
"end_offset": 20,
"type": "word",
"position": 0
}, {
"token": "test.xyz#example.co",
"start_offset": 0,
"end_offset": 20,
"type": "word",
"position": 0
}, {
"token": "test.xyz#example.com",
"start_offset": 0,
"end_offset": 20,
"type": "word",
"position": 0
}]
}
So I know that the tokenisation works. But while searching, it fails to search partial strings.
For ex. Looking for http://localhost:9200/my_index/my_field/_search?q=test, the result shows no hits.
Details of my index :
{
"my_index": {
"aliases": {
"alias_default": {}
},
"mappings": {
"my_field": {
"properties": {
"address": {
"type": "string",
"analyzer": "email",
"search_analyzer": "whitespace"
},
"boost": {
"type": "long"
},
"createdat": {
"type": "date",
"format": "strict_date_optional_time||epoch_millis"
},
"instanceid": {
"type": "long"
},
"isdeleted": {
"type": "integer"
},
"object": {
"type": "string"
},
"objecthash": {
"type": "string"
},
"objectid": {
"type": "string"
},
"parent": {
"type": "short"
},
"parentid": {
"type": "integer"
},
"updatedat": {
"type": "date",
"format": "strict_date_optional_time||epoch_millis"
}
}
}
},
"settings": {
"index": {
"creation_date": "1480342980403",
"number_of_replicas": "1",
"max_result_window": "100000",
"uuid": "OUuiTma8CA2VNtw9Og",
"analysis": {
"filter": {
"email_filter": {
"type": "edge_ngram",
"min_gram": "3",
"max_gram": "255"
},
"autocomplete_filter": {
"type": "edge_ngram",
"min_gram": "3",
"max_gram": "20"
}
},
"analyzer": {
"autocomplete": {
"type": "custom",
"filter": [
"lowercase",
"autocomplete_filter"
],
"tokenizer": "standard"
},
"email": {
"type": "custom",
"filter": [
"lowercase",
"email_filter",
"unique"
],
"tokenizer": "uax_url_email"
}
}
},
"number_of_shards": "5",
"version": {
"created": "2010099"
}
}
},
"warmers": {}
}
}
Ok, everything looks correct, except your query.
You simply need to specify the address field in your query like this and it will work:
http://localhost:9200/my_index/my_field/_search?q=address:test
If you don't specify the address field, the query will work on the _all field whose search analyzer is the standard one by default, hence why you're not finding anything.

Elasticsearch custom analyzer being ignored

I'm using Elasticsearch 2.2.0 and I'm trying to use the lowercase + asciifolding filters on a field.
This is the output of http://localhost:9200/myindex/
{
"myindex": {
"aliases": {},
"mappings": {
"products": {
"properties": {
"fold": {
"analyzer": "folding",
"type": "string"
}
}
}
},
"settings": {
"index": {
"analysis": {
"analyzer": {
"folding": {
"token_filters": [
"lowercase",
"asciifolding"
],
"tokenizer": "standard",
"type": "custom"
}
}
},
"creation_date": "1456180612715",
"number_of_replicas": "1",
"number_of_shards": "5",
"uuid": "vBMZEasPSAyucXICur3GVA",
"version": {
"created": "2020099"
}
}
},
"warmers": {}
}
}
And when I try to test the folding custom filter using the _analyze API, this is what I get as an output of http://localhost:9200/myindex/_analyze?analyzer=folding&text=%C3%89sta%20est%C3%A1%20loca
{
"tokens": [
{
"end_offset": 4,
"position": 0,
"start_offset": 0,
"token": "Ésta",
"type": "<ALPHANUM>"
},
{
"end_offset": 9,
"position": 1,
"start_offset": 5,
"token": "está",
"type": "<ALPHANUM>"
},
{
"end_offset": 14,
"position": 2,
"start_offset": 10,
"token": "loca",
"type": "<ALPHANUM>"
}
]
}
As you can see, the returned tokens are: Ésta, está, loca instead of esta, esta, loca. What's going on? it seems that this folding analyzer is being ignored.
Looks like a simple typo when you are creating your index.
In your "analysis":{"analyzer":{...}} block, this:
"token_filters": [...]
Should be
"filter": [...]
Check the documentation for confirmation of this. Because your filter array wasn't named correctly, ES completely ignored it, and just decided to use the standard analyzer. Here is a small example written using the Sense chrome plugin. Execute them in order:
DELETE /test
PUT /test
{
"analysis": {
"analyzer": {
"folding": {
"type": "custom",
"filter": [
"lowercase",
"asciifolding"
],
"tokenizer": "standard"
}
}
}
}
GET /test/_analyze
{
"analyzer":"folding",
"text":"Ésta está loca"
}
And the results of the last GET /test/_analyze:
"tokens": [
{
"token": "esta",
"start_offset": 0,
"end_offset": 4,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "esta",
"start_offset": 5,
"end_offset": 9,
"type": "<ALPHANUM>",
"position": 1
},
{
"token": "loca",
"start_offset": 10,
"end_offset": 14,
"type": "<ALPHANUM>",
"position": 2
}
]

custom tokenizer not generating tokens as expected if text contains special characters like # , #

i have defined the following tokenizer :
PUT /testanlyzer2
{
"settings" : {
"analysis" : {
"analyzer" : {
"my_ngram_analyzer" : {
"tokenizer" : "my_ngram_tokenizer"
}
},
"tokenizer" : {
"my_ngram_tokenizer" : {
"type" : "nGram",
"min_gram" : "1",
"max_gram" : "3",
"token_chars": [ "letter", "digit","symbol","currency_symbol","modifier_symbol","other_symbol" ]
}
}
}
}
}
For the following request
GET /testanlyzer2/_analyze?analyzer=my_ngram_analyzer&text="i a#m not available 9177"
Result is:
{
"tokens": [
{
"token": "i",
"start_offset": 1,
"end_offset": 2,
"type": "word",
"position": 1
},
{
"token": "a",
"start_offset": 3,
"end_offset": 4,
"type": "word",
"position": 2
}
]
}
For the following request::
GET /testanlyzer2/_analyze?analyzer=my_ngram_analyzer&text="i a#m not available 9177"
Result is::
{
"tokens": [
{
"token": "i",
"start_offset": 1,
"end_offset": 2,
"type": "word",
"position": 1
},
{
"token": "a",
"start_offset": 3,
"end_offset": 4,
"type": "word",
"position": 2
}
]
}
For the following request ::
GET /testanlyzer2/_analyze?analyzer=my_ngram_analyzer&text="i a#m not available 9177"
Result is :
Request failed to get to the server (status code: 0):
Expected result should contain these special characters(#,#,currency's,etc..) as tokens. please correct me if anything wrong in my custom tokenizer.
--Thanks
# is a special character in Sense (if you are using the Marvel's Sense dashboard) and it comments out the line.
To remove any html escaping/Sense special chars, I would test this like this:
PUT /testanlyzer2
{
"settings": {
"analysis": {
"analyzer": {
"my_ngram_analyzer": {
"tokenizer": "keyword",
"filter": [
"substring"
]
}
},
"filter": {
"substring": {
"type": "nGram",
"min_gram": 1,
"max_gram": 3
}
}
}
},
"mappings": {
"test": {
"properties": {
"text": {
"type": "string",
"analyzer": "my_ngram_analyzer"
}
}
}
}
}
POST /testanlyzer2/test/1
{
"text": "i a#m not available 9177"
}
POST /testanlyzer2/test/2
{
"text": "i a#m not available 9177"
}
GET /testanlyzer2/test/_search
{
"fielddata_fields": ["text"]
}

Resources