Defining cusstom tokenizer in elastic search - elasticsearch

This is how i am trying to define a custom tokenizer in es
"pattern" :"[\-s+]",
but when i run this i get the response as shown below
"pattern" : """[-s+]""",
notice in the output i get additional quotes : "pattern" : """[-s+]""",in the begenninng and the end, if we dont have to use any escape characters this works fine, but when using escape character this results in double quotes being appended, any help?

\ Is a reserved Lucene operator, you have to escape it.
https://www.elastic.co/guide/en/elasticsearch/reference/current/regexp-syntax.html
Please try this way
PUT test_varun
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "my_tokenizer"
}
},
"tokenizer": {
"my_tokenizer": {
"type": "pattern",
"pattern":"[\\-s+]"
}
}
}
}
}
If doesnt make it please attach an example input/output to reproduce in my end.

Related

Extend Elasticsearch's standard Analyzer with additional characters to tokenize on

I basically want the functionality of the inbuilt standard analyzer that additionally tokenizes on underscores.
Currently the standard analyzer will keep brown_fox_has as a singular token but I want [brown, fox, has] instead. The simple analyzer loses some functionality over the standard one, so I want to keep the standard as much as possible.
The docs only shows how you would add filters and other non-tokenizer changes, but I want to keep all of the standard tokenizer, while adding the additional underscore.
I could create a character filter to map _ to - and the standard tokenizer will do the job for me, but is there a better way?
es.indices.create(index="mine", body={
"settings": {
"analysis": {
"analyzer": {
"default": {
"type": "custom",
# "tokenize_on_chars": ["_"], # i want this to work with the standard tokenizer without using char group
"tokenizer": "standard",
"filter": ["lowercase"]
}
}
},
}
})
res = es.indices.analyze(index="mine", body={
"field": "text",
"text": "the quick brown_fox_has to be split"
})
Use normalizer and define it along with your preferred standard tokenizer
POST /_analyze
{
"char_filter": {
"type": "mapping",
"mappings": [
"_ =>\\u0020" // replace underscore with whitespace
]
},
"tokenizer": "standard",
"text": "the quick brown_fox_has to be split"
}

How to use Elasticsearch standard analyser without lower case

Im trying to create an analyser in elasticsearch using the pre-sets of "standard" analyser but with one change - no lower casing of words.
Ive tried chaining the whitespace and standard analyser like so:
PUT /standard_uppercase
{
"settings": {
"analysis": {
"analyzer": {
"rebuilt_standard": {
"tokenizer": "standard",
"filter": [
"standard",
"whitespace"
]
}
}
}
}
}
But this does not give the required results. Is there a way to overwrite only the lowercase part of an analyser but retail all the existing features of the standard analyser?
Thanks in advance.
According the documentation
Definition
The standard analyzer consists of:
Tokenizer
Standard Tokenizer
Token Filters
Standard Token Filter
Lower Case Token Filter
Stop Token Filter (disabled by default)
So, you could achieve your purposes in that way:
PUT /standard_uppercase
{
"settings": {
"analysis": {
"analyzer": {
"rebuilt_standard": {
"tokenizer": "standard",
"filter": [
"standard"
]
}
}
}
}
}

Analyze all uppercase tokens in a field

I would like to analyze value of a text field in 2 ways. Using standard analysis and a custom analysis that only indexes all uppercase tokens in the text.
For example, if the value is "This WHITE cat is very CUTE.", the only tokens that should be indexed for custom analysis is "WHITE" and "CUTE". For this, I am using Pattern Capture Token Filter with pattern "(\b[A-Z]+\b)+?". But this is indexing all tokens and not just uppercase tokens.
Is Pattern Capture Token Filter the right one to use for this task? If yes, what am I doing wrong? If not, how do I get this done? Please help.
You should use instead pattern_replace and char_filter:
PUT test
{
"settings": {
"analysis": {
"char_filter": {
"filter_lowercase": {
"type": "pattern_replace",
"pattern": "[A-Z][a-z]+|[a-z]+",
"replacement": ""
}
},
"analyzer": {
"my_analyzer": {
"tokenizer": "standard",
"char_filter": [
"filter_lowercase"
]
}
}
}
}
}
GET test/_analyze
{"analyzer": "my_analyzer",
"text" : "This WHITE cat is very CUTE"
}

Elasticsearch strange filter behaviour

I'm trying to replace a particular string inside a field. So I used custom analyser and character filter just as it's described in the docs, but it didn't work.
Here are my index settings:
{
"settings": {
"analysis": {
"char_filter": {
"doule_colon_to_space": {
"type": "mapping",
"mappings": [ "::=> "]
}},
"analyzer": {
"my_analyzer": {
"type": "custom",
"char_filter": [ "doule_colon_to_space" ],
"tokenizer": "standard"
}}
}}}
which should replace all double colons (::) in a field with spaces. I then update my mapping to use the analyzer:
{
"posts": {
"properties": {
"id": {
"type": "long"
},
"title": {
"type": "string",
"analyzer": "my_analyzer",
"fields": {
"simple": {
"type": "string",
"index": "not_analyzed"
}
}
}
}
}
}
Then I put a document in the index:
{
"id": 1,
"title": "Person::Bruce Wayne"
}
I then test if analyzer works, but it appears it's not - when I send this https://localhost:/first_test/_analyze?analyzer=my_analyzer&text=Person::Someone+Close, I got two tokens back - 'PersonSomeone' (together) and 'Close'. Am I doing this right? May be I should escape the space somehow? I use Elasticsearch 1.3.4
I think the whitespace in your char_filter pattern is being ignored. Try using the unicode escape sequence for a single space instead:
"mappings": [ "::=>\\u0020"]
Update:
In response to your comment, the short answer is yes, the example is wrong. The docs do suggest that you can use a mapping character filter to replace a token with another one which is padded by whitespace, but the code disagrees.
The source code for the MappingCharFilterFactory uses this regex to parse the settings:
// source => target
private static Pattern rulePattern = Pattern.compile("(.*)\\s*=>\\s*(.*)\\s*$");
This regex matches (and effectively discards) any whitespace (\\s*) surrounding the second replacement token ((.*)), so it seems that you cannot use leading or trailing whitespace as part of your replacement mapping (though it could include interstitial whitespace). Even if the regex were different, the matched token is trim()ed, which would have removed any leading and trailing whitespace.

Semi-exact (complete) match in ElasticSearch

Is there a way to require a complete (though not necessarily exact) match in ElasticSearch?
For instance, if a field has the term "I am a little teapot short and stout", I would like to match on " i am a LITTLE TeaPot short and stout! " but not just "teapot short and stout". I've tried the term filter, but that requires an actual exact match.
If your "not necessarily exact" definition refers to uppercase/lowercase letters combination and the punctuation marks (like ! you have in your example), this would be a solution, not too simple and obvious tough:
The mapping:
{
"settings": {
"analysis": {
"analyzer": {
"my_keyword_lowercase": {
"tokenizer": "keyword",
"filter": [
"lowercase",
"trim",
"my_pattern_replace"
]
}
},
"filter": {
"my_pattern_replace": {
"type": "pattern_replace",
"pattern": "!",
"replacement":""
}
}
}
},
"mappings": {
"test": {
"properties": {
"text": {
"type": "string",
"analyzer": "my_keyword_lowercase"
}
}
}
}
}
The idea here is the following:
use a keyword tokenizer to keep the text as is and not to be split into tokens
use the lowercase filter to get rid of the mixing uppercase/lowercase characters
trim filter used to get rid of the trailing and leading whitespaces
use a pattern_replace filter to get rid of the punctuation. This is like this because a keyword tokenizer won't do anything to the characters inside the text. A standard analyzer will do this, but the standard will, also, split the text whereas you need it as is
And this is the query you would use for the mapping above:
{
"query": {
"match": {
"text": " i am a LITTLE TeaPot short and stout! "
}
}
}

Resources