Re-using inbuilt language filters? - elasticsearch

I saw the question here, which shows how one can create a custom analyzer to have both synonym support and support for languages.
However, it seems to create its own stemmer and stopwords collection as well.
What if I want to add synonyms to the "danish" inbuilt analyzer? Can I refer to the inbuilt Danish stemmer and stopwords filter? As an example, is it just called danish_stemmer and danish_stopwords?
Perhaps a list of inbuilt filters would help - where can I see the names of these inbuilt filters?

For each pre-built language analyzer there is an example of how to rebuild it. For danish there is this example:
PUT /danish_example
{
"settings": {
"analysis": {
"filter": {
"danish_stop": {
"type": "stop",
"stopwords": "_danish_"
},
"danish_keywords": {
"type": "keyword_marker",
"keywords": ["eksempel"]
},
"danish_stemmer": {
"type": "stemmer",
"language": "danish"
}
},
"analyzer": {
"rebuilt_danish": {
"tokenizer": "standard",
"filter": [
"lowercase",
"danish_stop",
"danish_keywords",
"danish_stemmer"
]
}
}
}
}
}
This is essentially building your own custom analyzer.
The list of available stemmers can be found here. The list of available pre-built stopwords lists can be found here.
Hope that helps!

Related

Extend Elasticsearch's standard Analyzer with additional characters to tokenize on

I basically want the functionality of the inbuilt standard analyzer that additionally tokenizes on underscores.
Currently the standard analyzer will keep brown_fox_has as a singular token but I want [brown, fox, has] instead. The simple analyzer loses some functionality over the standard one, so I want to keep the standard as much as possible.
The docs only shows how you would add filters and other non-tokenizer changes, but I want to keep all of the standard tokenizer, while adding the additional underscore.
I could create a character filter to map _ to - and the standard tokenizer will do the job for me, but is there a better way?
es.indices.create(index="mine", body={
"settings": {
"analysis": {
"analyzer": {
"default": {
"type": "custom",
# "tokenize_on_chars": ["_"], # i want this to work with the standard tokenizer without using char group
"tokenizer": "standard",
"filter": ["lowercase"]
}
}
},
}
})
res = es.indices.analyze(index="mine", body={
"field": "text",
"text": "the quick brown_fox_has to be split"
})
Use normalizer and define it along with your preferred standard tokenizer
POST /_analyze
{
"char_filter": {
"type": "mapping",
"mappings": [
"_ =>\\u0020" // replace underscore with whitespace
]
},
"tokenizer": "standard",
"text": "the quick brown_fox_has to be split"
}

Elasticsearch - Do searches for alternative country codes

I have a document with a field called 'countryCode'. I have a term query that search for the keyword value of it. But having some issues with:
Some records saying UK and some other saying GB
Some records saying US and some other USA
And the list goes on..
Can I instruct my index to handle all those variations somehow, instead of me having to expand the terms on my query filter?
What you are looking for is a way to have your tokens understand similar tokens which may or may not be having similar characters. This is only possible using synonyms.
Elasticsearch provides you to configure your synonyms and have your query use those synonyms and return the results accordingly.
I have configured a field using a custom analyzer using synonym token filter. I have created a sample mapping and query so that you can play with it and see if that fits your needs.
Mapping
PUT my_index
{
"settings": {
"analysis": {
"filter": {
"my_synonym_filter": {
"type": "synonym",
"synonyms": [
"usa, us",
"uk, gb"
]
}
},
"analyzer": {
"my_synonyms": {
"tokenizer": "standard",
"filter": [
"lowercase",
"my_synonym_filter"
]
}
}
}
},
"mappings": {
"mydocs": {
"properties": {
"name": {
"type": "text",
"analyzer": "my_synonyms"
}
}
}
}
}
Sample Document
POST my_index/mydocs/1
{
"name": "uk is pretty cool country"
}
And when you make use of the below query, it does return the above document as well.
Query
GET my_index/mydocs/_search
{
"query": {
"match": {
"name": "gb"
}
}
}
Refer to their official documentation to understand more on this. Hope this helps!
Handling within ES itself without using logstash, I'd suggest using a simple ingest pipeline with gsub processor to update the field in it's place
{
"gsub": {
"field": "countryCode",
"pattern": "GB",
"replacement": "UK"
}
}
https://www.elastic.co/guide/en/elasticsearch/reference/master/gsub-processor.html

What keywords are used by the Swedish analyzer?

On this part of the elasticsearch docs, it says that the Swedish analyzer can be reimplemented like this:
PUT /swedish_example
{
"settings": {
"analysis": {
"filter": {
"swedish_stop": {
"type": "stop",
"stopwords": "_swedish_"
},
"swedish_keywords": {
"type": "keyword_marker",
"keywords": ["exempel"]
},
"swedish_stemmer": {
"type": "stemmer",
"language": "swedish"
}
},
"analyzer": {
"swedish": {
"tokenizer": "standard",
"filter": [
"lowercase",
"swedish_stop",
"swedish_keywords",
"swedish_stemmer"
]
}
}
}
}
My question is, how does this analyser recognise keywords? Sure, the keywords can be defined in the settings.analysis.filter.swedish_keywords.keywords field, but what if I'm too lazy to do that? Does Elasticsearch look at some other keywords list of pre-defined Swedish keywords? Because in the example above it looks like there is no such list provided in the settings.
In other words, is it solely up to me to define keywords or does Elasticsearch look at some other list to find keywords by default?
Yes, you need to specify this list by you. Otherwise, this filter wouldn't do anything.
As per documentation of Elasticsearch:
Keyword Marker Token Filter
Protects words from being modified by stemmers. Must be placed before
any stemming filters.
Alternatively, you could specify:
keywords_path
A path (either relative to config location, or absolute) to a list of
words.
keywords_pattern
A regular expression pattern to match against words in the text.
More information about this filter - https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-keyword-marker-tokenfilter.html

Can I have multiple filters in an Elasticsearch index's settings?

I want an Elasticsearch index that simply stores "names" of features. I want to be able to issue phonetic queries and also type-ahead style queries separately. I would think I would be able to create one index with two analyzers and two filters; each analyzer could use one of the filters. But I do not seem to be able to do this.
Here is the index settings json I'm trying to use:
{
"settings": {
"number_of_shards": 1,
"analysis": {
"analyzer": {
"autocomplete_analyzer": {
"type": "custom",
"tokenizer": "standard",
"filter": ["standard", "lowercase", "ngram"]
}
},
"analyzer": {
"phonetic_analyzer": {
"tokenizer": "standard",
"filter": "double_metaphone_filter"
}
},
"filter": {
"double_metaphone_filter": {
"type": "phonetic",
"encoder": "double_metaphone"
}
},
"filter": {
"ngram": {
"type": "ngram",
"min_gram": 2,
"max_gram": 15
}
}
}
}
}
When I attempt to create an index with these settings:
http://hostname:9200/index/type
I get an HTTP 400, saying
Custom Analyzer [phonetic_analyzer] failed to find filter under name [double_metaphone_filter]
Don't get me wrong, I fully realize what that sentence means. I looked and looked for an erroneous comma or quote but I don't see any. Otherwise, everything is there and formatted correctly.
If I delete the phonetic analyzer, the index is created but ONLY with the autocomplete analyzer and ngram filter.
If I delete the ngram filter, the index is created but ONLY with the phonetic analyzer and phonetic filter.
I have a feeling I'm missing a fundamental concept of ES, like only one analyzer per index, or one filter per index, or I must have some other logical dependencies set up correctly, etc. It sure would be nice to have a logical diagram or complete API spec of the Elasticsearch infrastructure, i.e. any index can have 1..n analyzers, only 1 filter, query must need any one of bool, match, etc. But that unicorn does not seem to exist.
I see tons of documentation, blog posts, etc on how to do each of these functionalities, but with only one analyzer and one filter on the index. I'd really like to do this dual functionality on one index (for reasons out of scope).
Can someone offer some help, advice here?
You are just missing the proper formatting for your settings object. You cannot have two analyzer or filter keys, as there can only be one value per key in this settings map object. Providing a list of your filters seems to work just fine. When you were creating your index object, the second key was overriding the first.
Look here:
"settings": {
"number_of_shards": 1,
"analysis": {
"filter": {
"double_metaphone_filter": {
"type": "phonetic",
"encoder": "double_metaphone"
},
"ngram": {
"type": "ngram",
"min_gram": 2,
"max_gram": 15
}
},
"analyzer": {
"autocomplete_analyzer": {
"type": "custom",
"tokenizer": "standard",
"filter": ["standard", "lowercase", "ngram"]
},
"phonetic_analyzer": {
"tokenizer": "standard",
"filter": "double_metaphone_filter"
}
}
}
}
I downloaded the plugin to confirm this works.
You can now test this out at the _analyze enpoint with a payload:
{
"analyzer":"autocomplete_analyzer",
"text":"Jonnie Smythe"
}

Elasticsearch combining language and char_filter in an analyzer

I'm trying to combine a language analyzer with a char_filter but when I look at the _termvectors for the field the html/xml tags I can see values in there that are attributes of custom xml tags like "22anchor_titl".
My idea was to extend the german language filter:
settings:
analysis:
analyzer:
node_body_analyzer:
type: 'german'
char_filter: ['html_strip']
mappings:
mappings:
node:
body:
type: 'string'
analyzer: 'node_body_analyzer'
search_analyzer: 'node_search_analyzer'
Is there an error in my configuration or is the concept of deriving a new analyzer from the 'gernam' by adding a char_filter simply not possible. If so, would I have to make a type: 'custom' analyzer, implement the whole thing like this documentation and add the filter?
Cheers
Yes, you need to do that. What if you wanted to add another token filter? Where should have ES placed that one in the list of already existent token filters (since the order matters)? You need something like this:
"analysis": {
"filter": {
"german_stop": {
"type": "stop",
"stopwords": "_german_"
},
"german_keywords": {
"type": "keyword_marker",
"keywords": ["ghj"]
},
"german_stemmer": {
"type": "stemmer",
"language": "light_german"
}
},
"analyzer": {
"my_analyzer": {
"type":"custom",
"tokenizer": "standard",
"filter": [
"lowercase",
"german_stop",
"german_keywords",
"german_normalization",
"german_stemmer"
],
"char_filter":"html_strip"
}
}
}

Resources