What keywords are used by the Swedish analyzer? - elasticsearch

On this part of the elasticsearch docs, it says that the Swedish analyzer can be reimplemented like this:
PUT /swedish_example
{
"settings": {
"analysis": {
"filter": {
"swedish_stop": {
"type": "stop",
"stopwords": "_swedish_"
},
"swedish_keywords": {
"type": "keyword_marker",
"keywords": ["exempel"]
},
"swedish_stemmer": {
"type": "stemmer",
"language": "swedish"
}
},
"analyzer": {
"swedish": {
"tokenizer": "standard",
"filter": [
"lowercase",
"swedish_stop",
"swedish_keywords",
"swedish_stemmer"
]
}
}
}
}
My question is, how does this analyser recognise keywords? Sure, the keywords can be defined in the settings.analysis.filter.swedish_keywords.keywords field, but what if I'm too lazy to do that? Does Elasticsearch look at some other keywords list of pre-defined Swedish keywords? Because in the example above it looks like there is no such list provided in the settings.
In other words, is it solely up to me to define keywords or does Elasticsearch look at some other list to find keywords by default?

Yes, you need to specify this list by you. Otherwise, this filter wouldn't do anything.
As per documentation of Elasticsearch:
Keyword Marker Token Filter
Protects words from being modified by stemmers. Must be placed before
any stemming filters.
Alternatively, you could specify:
keywords_path
A path (either relative to config location, or absolute) to a list of
words.
keywords_pattern
A regular expression pattern to match against words in the text.
More information about this filter - https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-keyword-marker-tokenfilter.html

Related

Extend Elasticsearch's standard Analyzer with additional characters to tokenize on

I basically want the functionality of the inbuilt standard analyzer that additionally tokenizes on underscores.
Currently the standard analyzer will keep brown_fox_has as a singular token but I want [brown, fox, has] instead. The simple analyzer loses some functionality over the standard one, so I want to keep the standard as much as possible.
The docs only shows how you would add filters and other non-tokenizer changes, but I want to keep all of the standard tokenizer, while adding the additional underscore.
I could create a character filter to map _ to - and the standard tokenizer will do the job for me, but is there a better way?
es.indices.create(index="mine", body={
"settings": {
"analysis": {
"analyzer": {
"default": {
"type": "custom",
# "tokenize_on_chars": ["_"], # i want this to work with the standard tokenizer without using char group
"tokenizer": "standard",
"filter": ["lowercase"]
}
}
},
}
})
res = es.indices.analyze(index="mine", body={
"field": "text",
"text": "the quick brown_fox_has to be split"
})
Use normalizer and define it along with your preferred standard tokenizer
POST /_analyze
{
"char_filter": {
"type": "mapping",
"mappings": [
"_ =>\\u0020" // replace underscore with whitespace
]
},
"tokenizer": "standard",
"text": "the quick brown_fox_has to be split"
}

Elasticsearch: search with wildcard and custom analyzer

Requirement: Search with special characters in a text field.
my Solution so far: Use wildcard query with custom analyzer. I want to use wildcards because it seems the easiest way to do partial searches in a long string with multiple search keys. See ES query below.
I have an index called "invoices" and it has document with one of the fields as
"searchString" : "I000010-1 000010 3901 North Saginaw Road add 2 Midland MI 48640 US MS Dhoni MSD-Company MSD (777) 777-7777 (333) 333-3333 sandeep#xyz.io msd-company msdhoni Dhoni, MS (3241480)"
Note: This field acts as the deprecated _all field in ES.
Index Mapping for this field:
"searchString": {"type": "text","analyzer": "multi_level_analyzer"},
Analyzer settings:
PUT invoices
{
"settings": {
"analysis": {
"analyzer": {
"multi_level_analyzer": {
"type": "custom",
"tokenizer": "whitespace",
"char_filter": [
"html_strip"
],
"filter": [
"lowercase",
"asciifolding"
]
}
}
}
}
}
My query looks something like this:
GET invoices/_search
{
"query": {
"bool": {
"must": [{
"wildcard": {
"searchString": {
"value": "msd-company*",
"boost": 1.0
}
}
},
{
"wildcard": {
"searchString": {
"value": "Saginaw*",
"boost": 1.0
}
}
}
]
}
}
}
My question:
Earlier when I was not using a custom analyzer the above query worked BUT I was not able to search for words with special characters like "msd-company".
After attaching the custom analyzer(multi_level_analyzer) the above query fails to return any result. I changed the wildcard query and appended an asterisk before the search key and for some reason it works now. (referred this answer)
I want to know the impact of using "* msd-company*" instead of "msd-company*" in the wildcard query for the text field.
How can I still use the wildcard query "msd-company*" with custom analyzer?
Open to suggestions for any other approach to my problem statement.
I have solved my problem by changing the mapping of the said field to this:
"searchString": {"type": "text","analyzer": "multi_level_analyzer", "search_analyzer": "standard"},
But since wildcard queries are expensive, I would still like to know if there exists a better solution to satisfy my search use case.

how to remove a stop word from default _english_ stop-words list in elasticsearch?

I am filtering the text using default English stop-words. I found 'and' is a stop-word in English, but I need to search for the results containing 'and'. I just want to remove and word from this default English stop-words filter and use other stopwords as usually. My elasticsearch schema looks similar to below.
"settings": {
"analysis": {
"analyzer": {
"default": {
"tokenizer": "whitespace" ,
"filter": ["stop_english"]
}
}....,
"filter":{
"stop_english": {
"type": "stop",
"stopwords": "_english_"
}
}
I expect to see the docs containing AND word with _search api.
You can set the stop words for a given index manually like this:
PUT /my_index
{
"settings": {
"analysis": {
"filter": {
"my_stop": {
"type": "stop",
"stopwords": ["and", "is", "the"]
}
}
}
}
}
I also found the list of English stop words used by elasticsearch here. If you manage to manually set the same list of stop words minus "and" in an index and reindex you data in the newly configured index with the good stop words , you should be good to go!
regarding reindexation of your data, you should check out the reindex api. I believe it is required since the tokenization of your data happens at ingestion time so you need to redo the ingestion by reindexing it. It is requires most of the time when changing index settings or some mapping changes (not 100% sure, but i think it makes sense).

Re-using inbuilt language filters?

I saw the question here, which shows how one can create a custom analyzer to have both synonym support and support for languages.
However, it seems to create its own stemmer and stopwords collection as well.
What if I want to add synonyms to the "danish" inbuilt analyzer? Can I refer to the inbuilt Danish stemmer and stopwords filter? As an example, is it just called danish_stemmer and danish_stopwords?
Perhaps a list of inbuilt filters would help - where can I see the names of these inbuilt filters?
For each pre-built language analyzer there is an example of how to rebuild it. For danish there is this example:
PUT /danish_example
{
"settings": {
"analysis": {
"filter": {
"danish_stop": {
"type": "stop",
"stopwords": "_danish_"
},
"danish_keywords": {
"type": "keyword_marker",
"keywords": ["eksempel"]
},
"danish_stemmer": {
"type": "stemmer",
"language": "danish"
}
},
"analyzer": {
"rebuilt_danish": {
"tokenizer": "standard",
"filter": [
"lowercase",
"danish_stop",
"danish_keywords",
"danish_stemmer"
]
}
}
}
}
}
This is essentially building your own custom analyzer.
The list of available stemmers can be found here. The list of available pre-built stopwords lists can be found here.
Hope that helps!

Analyze all uppercase tokens in a field

I would like to analyze value of a text field in 2 ways. Using standard analysis and a custom analysis that only indexes all uppercase tokens in the text.
For example, if the value is "This WHITE cat is very CUTE.", the only tokens that should be indexed for custom analysis is "WHITE" and "CUTE". For this, I am using Pattern Capture Token Filter with pattern "(\b[A-Z]+\b)+?". But this is indexing all tokens and not just uppercase tokens.
Is Pattern Capture Token Filter the right one to use for this task? If yes, what am I doing wrong? If not, how do I get this done? Please help.
You should use instead pattern_replace and char_filter:
PUT test
{
"settings": {
"analysis": {
"char_filter": {
"filter_lowercase": {
"type": "pattern_replace",
"pattern": "[A-Z][a-z]+|[a-z]+",
"replacement": ""
}
},
"analyzer": {
"my_analyzer": {
"tokenizer": "standard",
"char_filter": [
"filter_lowercase"
]
}
}
}
}
}
GET test/_analyze
{"analyzer": "my_analyzer",
"text" : "This WHITE cat is very CUTE"
}

Resources