Elasticsearch combining language and char_filter in an analyzer - elasticsearch

I'm trying to combine a language analyzer with a char_filter but when I look at the _termvectors for the field the html/xml tags I can see values in there that are attributes of custom xml tags like "22anchor_titl".
My idea was to extend the german language filter:
settings:
analysis:
analyzer:
node_body_analyzer:
type: 'german'
char_filter: ['html_strip']
mappings:
mappings:
node:
body:
type: 'string'
analyzer: 'node_body_analyzer'
search_analyzer: 'node_search_analyzer'
Is there an error in my configuration or is the concept of deriving a new analyzer from the 'gernam' by adding a char_filter simply not possible. If so, would I have to make a type: 'custom' analyzer, implement the whole thing like this documentation and add the filter?
Cheers

Yes, you need to do that. What if you wanted to add another token filter? Where should have ES placed that one in the list of already existent token filters (since the order matters)? You need something like this:
"analysis": {
"filter": {
"german_stop": {
"type": "stop",
"stopwords": "_german_"
},
"german_keywords": {
"type": "keyword_marker",
"keywords": ["ghj"]
},
"german_stemmer": {
"type": "stemmer",
"language": "light_german"
}
},
"analyzer": {
"my_analyzer": {
"type":"custom",
"tokenizer": "standard",
"filter": [
"lowercase",
"german_stop",
"german_keywords",
"german_normalization",
"german_stemmer"
],
"char_filter":"html_strip"
}
}
}

Related

Extend Elasticsearch's standard Analyzer with additional characters to tokenize on

I basically want the functionality of the inbuilt standard analyzer that additionally tokenizes on underscores.
Currently the standard analyzer will keep brown_fox_has as a singular token but I want [brown, fox, has] instead. The simple analyzer loses some functionality over the standard one, so I want to keep the standard as much as possible.
The docs only shows how you would add filters and other non-tokenizer changes, but I want to keep all of the standard tokenizer, while adding the additional underscore.
I could create a character filter to map _ to - and the standard tokenizer will do the job for me, but is there a better way?
es.indices.create(index="mine", body={
"settings": {
"analysis": {
"analyzer": {
"default": {
"type": "custom",
# "tokenize_on_chars": ["_"], # i want this to work with the standard tokenizer without using char group
"tokenizer": "standard",
"filter": ["lowercase"]
}
}
},
}
})
res = es.indices.analyze(index="mine", body={
"field": "text",
"text": "the quick brown_fox_has to be split"
})
Use normalizer and define it along with your preferred standard tokenizer
POST /_analyze
{
"char_filter": {
"type": "mapping",
"mappings": [
"_ =>\\u0020" // replace underscore with whitespace
]
},
"tokenizer": "standard",
"text": "the quick brown_fox_has to be split"
}

Re-using inbuilt language filters?

I saw the question here, which shows how one can create a custom analyzer to have both synonym support and support for languages.
However, it seems to create its own stemmer and stopwords collection as well.
What if I want to add synonyms to the "danish" inbuilt analyzer? Can I refer to the inbuilt Danish stemmer and stopwords filter? As an example, is it just called danish_stemmer and danish_stopwords?
Perhaps a list of inbuilt filters would help - where can I see the names of these inbuilt filters?
For each pre-built language analyzer there is an example of how to rebuild it. For danish there is this example:
PUT /danish_example
{
"settings": {
"analysis": {
"filter": {
"danish_stop": {
"type": "stop",
"stopwords": "_danish_"
},
"danish_keywords": {
"type": "keyword_marker",
"keywords": ["eksempel"]
},
"danish_stemmer": {
"type": "stemmer",
"language": "danish"
}
},
"analyzer": {
"rebuilt_danish": {
"tokenizer": "standard",
"filter": [
"lowercase",
"danish_stop",
"danish_keywords",
"danish_stemmer"
]
}
}
}
}
}
This is essentially building your own custom analyzer.
The list of available stemmers can be found here. The list of available pre-built stopwords lists can be found here.
Hope that helps!

Can I have multiple filters in an Elasticsearch index's settings?

I want an Elasticsearch index that simply stores "names" of features. I want to be able to issue phonetic queries and also type-ahead style queries separately. I would think I would be able to create one index with two analyzers and two filters; each analyzer could use one of the filters. But I do not seem to be able to do this.
Here is the index settings json I'm trying to use:
{
"settings": {
"number_of_shards": 1,
"analysis": {
"analyzer": {
"autocomplete_analyzer": {
"type": "custom",
"tokenizer": "standard",
"filter": ["standard", "lowercase", "ngram"]
}
},
"analyzer": {
"phonetic_analyzer": {
"tokenizer": "standard",
"filter": "double_metaphone_filter"
}
},
"filter": {
"double_metaphone_filter": {
"type": "phonetic",
"encoder": "double_metaphone"
}
},
"filter": {
"ngram": {
"type": "ngram",
"min_gram": 2,
"max_gram": 15
}
}
}
}
}
When I attempt to create an index with these settings:
http://hostname:9200/index/type
I get an HTTP 400, saying
Custom Analyzer [phonetic_analyzer] failed to find filter under name [double_metaphone_filter]
Don't get me wrong, I fully realize what that sentence means. I looked and looked for an erroneous comma or quote but I don't see any. Otherwise, everything is there and formatted correctly.
If I delete the phonetic analyzer, the index is created but ONLY with the autocomplete analyzer and ngram filter.
If I delete the ngram filter, the index is created but ONLY with the phonetic analyzer and phonetic filter.
I have a feeling I'm missing a fundamental concept of ES, like only one analyzer per index, or one filter per index, or I must have some other logical dependencies set up correctly, etc. It sure would be nice to have a logical diagram or complete API spec of the Elasticsearch infrastructure, i.e. any index can have 1..n analyzers, only 1 filter, query must need any one of bool, match, etc. But that unicorn does not seem to exist.
I see tons of documentation, blog posts, etc on how to do each of these functionalities, but with only one analyzer and one filter on the index. I'd really like to do this dual functionality on one index (for reasons out of scope).
Can someone offer some help, advice here?
You are just missing the proper formatting for your settings object. You cannot have two analyzer or filter keys, as there can only be one value per key in this settings map object. Providing a list of your filters seems to work just fine. When you were creating your index object, the second key was overriding the first.
Look here:
"settings": {
"number_of_shards": 1,
"analysis": {
"filter": {
"double_metaphone_filter": {
"type": "phonetic",
"encoder": "double_metaphone"
},
"ngram": {
"type": "ngram",
"min_gram": 2,
"max_gram": 15
}
},
"analyzer": {
"autocomplete_analyzer": {
"type": "custom",
"tokenizer": "standard",
"filter": ["standard", "lowercase", "ngram"]
},
"phonetic_analyzer": {
"tokenizer": "standard",
"filter": "double_metaphone_filter"
}
}
}
}
I downloaded the plugin to confirm this works.
You can now test this out at the _analyze enpoint with a payload:
{
"analyzer":"autocomplete_analyzer",
"text":"Jonnie Smythe"
}

Keep non-stemmed tokens on Elasticsearch

I'm using a stemmer (for the Brazilian Portuguese Language) when I index documents on Elasticsearch. This is what my default analyzer looks like(nvm minor mistakes here because I've copied this by hand from my code in the server):
{
"analysis":{
"filter":{
"my_asciifolding": {
"type": "asciifolding",
"preserve_original": true,
},
"stop_pt":{
"type": "stop",
"ignore_case": true,
"stopwords": "_brazilian_"
},
"stemmer_pt": {
"type": "stemmer",
"language": "brazilian"
}
},
"analyzer": {
"default": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"lowercase",
"my_asciifolding",
"stop_pt",
"stemmer_pt"
]
}
}
}
}
I haven't really touched my type mappings (apart from a few numeric fields I've declared "type":"long") so I expect most fields to be using this default analyzer I've specified above.
This works as expected, but the thing is that some users are frustrated because (since tokens are being stemmed), the query "vulnerabilities" and the query "vulnerable" return the same results, which is misleading because they expect the results having an exact match to be ranked first.
Whats is the default way (if any) to do this in elasticsearch? (maybe keep the unstemmed tokens in the index as well as the stemmed tokens?) I'm using version 1.5.1.
I ended up using "fields" field to index my attributes in different ways. Not sure whether this is optimal but this is the way I'm handling it right now:
Add another analyzer (I called it "no_stem_analyzer") with all filters that the "default" analyzer has, minus "stemmer".
For each attribute I want to keep both non stemmed and stemmed variants, I did this (example for field "DESCRIPTION"):
"mappings":{
"_default_":{
"properties":{
"DESCRIPTION":{
"type"=>"string",
"fields":{
"no_stem":{
"type":"string",
"index":"analyzed",
"analyzer":"no_stem_analyzer"
},
"stemmed":{
"type":"string",
"index":"analyzed",
"analyzer":"default"
}
}
}
},//.. other attributes here
}
}
At search time (using query_string_query) I must also indicate (using field "fields") that I want to search all sub-fields (e.g. "DESCRIPTION.*")
I also based my approach upon [this answer].(elasticsearch customize score for synonyms/stemming)

Difference in handling possessive (apostrophes) with english stemmer between 1.2 and 1.4

We have two instances of elastic search, one running 1.2.1 and one 1.4, the settings and the mapping is identical on the indices running on both instances, yet the results are different.
The setting for the default analyzer:
....
analysis: {
filter: {
ourEnglishStopWords: {
type: "stop",
stopwords: "_english_"
},
ourEnglishFilter: {
type: "stemmer",
name: "english"
}
},
analyzer: {
default: {
filter: [
"asciifolding",
"lowercase",
"ourEnglishStopWords",
"ourEnglishFilter"
],
tokenizer: "standard"
}
}
},
...
The difference between elastic search versions appears when indexing/searching for possessive forms,
whereas in 1.2.1 "player", "players" and "player's" would return the same results, in 1.4
first two ("player" and "players") have identical result set, while "player's" is not matching the set
Is it a known difference? What is the the right way to get the same behavior in 1.4 and up?
I think this is the change, introduced in 1.3.0:
The StemmerTokenFilter had a number of issues:
english returned the slow snowball English stemmer
porter2 returned the snowball Porter stemmer (v1)
Changes:
english now returns the fast PorterStemmer (for indices created from
v1.3.0 onwards)
porter2 now returns the snowball English stemmer (for indices created from v1.3.0 onwards)
According to that github issue, you can either to change your mapping to:
"ourEnglishFilter": {
"type": "stemmer",
"name": "porter2"
}
or try something else:
"filter": {
"ourEnglishStopWords": {
"type": "stop",
"stopwords": "_english_"
},
"ourEnglishFilter": {
"type": "stemmer",
"name": "english"
},
"possesiveEnglish": {
"type": "stemmer",
"name": "possessive_english"
}
},
"analyzer": {
"default": {
"filter": [
"asciifolding",
"lowercase",
"ourEnglishStopWords",
"possesiveEnglish",
"ourEnglishFilter"
],
"tokenizer": "standard"
}
}

Resources