Extend Elasticsearch's standard Analyzer with additional characters to tokenize on - elasticsearch

I basically want the functionality of the inbuilt standard analyzer that additionally tokenizes on underscores.
Currently the standard analyzer will keep brown_fox_has as a singular token but I want [brown, fox, has] instead. The simple analyzer loses some functionality over the standard one, so I want to keep the standard as much as possible.
The docs only shows how you would add filters and other non-tokenizer changes, but I want to keep all of the standard tokenizer, while adding the additional underscore.
I could create a character filter to map _ to - and the standard tokenizer will do the job for me, but is there a better way?
es.indices.create(index="mine", body={
"settings": {
"analysis": {
"analyzer": {
"default": {
"type": "custom",
# "tokenize_on_chars": ["_"], # i want this to work with the standard tokenizer without using char group
"tokenizer": "standard",
"filter": ["lowercase"]
}
}
},
}
})
res = es.indices.analyze(index="mine", body={
"field": "text",
"text": "the quick brown_fox_has to be split"
})

Use normalizer and define it along with your preferred standard tokenizer
POST /_analyze
{
"char_filter": {
"type": "mapping",
"mappings": [
"_ =>\\u0020" // replace underscore with whitespace
]
},
"tokenizer": "standard",
"text": "the quick brown_fox_has to be split"
}

Related

Elasticsearch how to extend analyzer?

There are analyzers given such as standard analyzer
Suppose you want to modify tokenizer of standard analyzer, can you do something like
"my_analyzer": {
"tokenizer": "my_tokenizer",
"char_filter": "standard_char_filter",
"filter": "standard_filter"
}
i.e, base your analyzer on an existing analyzer and customize it?
If you want to redefine the standard analyzer you need to define a custom one, like this:
PUT /standard_example
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "standard", <-- change this
"filter": [
"lowercase"
]
}
}
}
}
}
However, since the standard analyzer uses the standard tokenizer, if you change the tokenizer to something else, it would not be a standard analyzer anymore, it would be your custom one with a lowercase token filter. However, it's perfectly ok for you to create your custom analyzer based on the standard one.

Re-using inbuilt language filters?

I saw the question here, which shows how one can create a custom analyzer to have both synonym support and support for languages.
However, it seems to create its own stemmer and stopwords collection as well.
What if I want to add synonyms to the "danish" inbuilt analyzer? Can I refer to the inbuilt Danish stemmer and stopwords filter? As an example, is it just called danish_stemmer and danish_stopwords?
Perhaps a list of inbuilt filters would help - where can I see the names of these inbuilt filters?
For each pre-built language analyzer there is an example of how to rebuild it. For danish there is this example:
PUT /danish_example
{
"settings": {
"analysis": {
"filter": {
"danish_stop": {
"type": "stop",
"stopwords": "_danish_"
},
"danish_keywords": {
"type": "keyword_marker",
"keywords": ["eksempel"]
},
"danish_stemmer": {
"type": "stemmer",
"language": "danish"
}
},
"analyzer": {
"rebuilt_danish": {
"tokenizer": "standard",
"filter": [
"lowercase",
"danish_stop",
"danish_keywords",
"danish_stemmer"
]
}
}
}
}
}
This is essentially building your own custom analyzer.
The list of available stemmers can be found here. The list of available pre-built stopwords lists can be found here.
Hope that helps!

How to use Elasticsearch standard analyser without lower case

Im trying to create an analyser in elasticsearch using the pre-sets of "standard" analyser but with one change - no lower casing of words.
Ive tried chaining the whitespace and standard analyser like so:
PUT /standard_uppercase
{
"settings": {
"analysis": {
"analyzer": {
"rebuilt_standard": {
"tokenizer": "standard",
"filter": [
"standard",
"whitespace"
]
}
}
}
}
}
But this does not give the required results. Is there a way to overwrite only the lowercase part of an analyser but retail all the existing features of the standard analyser?
Thanks in advance.
According the documentation
Definition
The standard analyzer consists of:
Tokenizer
Standard Tokenizer
Token Filters
Standard Token Filter
Lower Case Token Filter
Stop Token Filter (disabled by default)
So, you could achieve your purposes in that way:
PUT /standard_uppercase
{
"settings": {
"analysis": {
"analyzer": {
"rebuilt_standard": {
"tokenizer": "standard",
"filter": [
"standard"
]
}
}
}
}
}

What keywords are used by the Swedish analyzer?

On this part of the elasticsearch docs, it says that the Swedish analyzer can be reimplemented like this:
PUT /swedish_example
{
"settings": {
"analysis": {
"filter": {
"swedish_stop": {
"type": "stop",
"stopwords": "_swedish_"
},
"swedish_keywords": {
"type": "keyword_marker",
"keywords": ["exempel"]
},
"swedish_stemmer": {
"type": "stemmer",
"language": "swedish"
}
},
"analyzer": {
"swedish": {
"tokenizer": "standard",
"filter": [
"lowercase",
"swedish_stop",
"swedish_keywords",
"swedish_stemmer"
]
}
}
}
}
My question is, how does this analyser recognise keywords? Sure, the keywords can be defined in the settings.analysis.filter.swedish_keywords.keywords field, but what if I'm too lazy to do that? Does Elasticsearch look at some other keywords list of pre-defined Swedish keywords? Because in the example above it looks like there is no such list provided in the settings.
In other words, is it solely up to me to define keywords or does Elasticsearch look at some other list to find keywords by default?
Yes, you need to specify this list by you. Otherwise, this filter wouldn't do anything.
As per documentation of Elasticsearch:
Keyword Marker Token Filter
Protects words from being modified by stemmers. Must be placed before
any stemming filters.
Alternatively, you could specify:
keywords_path
A path (either relative to config location, or absolute) to a list of
words.
keywords_pattern
A regular expression pattern to match against words in the text.
More information about this filter - https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-keyword-marker-tokenfilter.html

Analyze all uppercase tokens in a field

I would like to analyze value of a text field in 2 ways. Using standard analysis and a custom analysis that only indexes all uppercase tokens in the text.
For example, if the value is "This WHITE cat is very CUTE.", the only tokens that should be indexed for custom analysis is "WHITE" and "CUTE". For this, I am using Pattern Capture Token Filter with pattern "(\b[A-Z]+\b)+?". But this is indexing all tokens and not just uppercase tokens.
Is Pattern Capture Token Filter the right one to use for this task? If yes, what am I doing wrong? If not, how do I get this done? Please help.
You should use instead pattern_replace and char_filter:
PUT test
{
"settings": {
"analysis": {
"char_filter": {
"filter_lowercase": {
"type": "pattern_replace",
"pattern": "[A-Z][a-z]+|[a-z]+",
"replacement": ""
}
},
"analyzer": {
"my_analyzer": {
"tokenizer": "standard",
"char_filter": [
"filter_lowercase"
]
}
}
}
}
}
GET test/_analyze
{"analyzer": "my_analyzer",
"text" : "This WHITE cat is very CUTE"
}

Resources