Difference in handling possessive (apostrophes) with english stemmer between 1.2 and 1.4 - elasticsearch

We have two instances of elastic search, one running 1.2.1 and one 1.4, the settings and the mapping is identical on the indices running on both instances, yet the results are different.
The setting for the default analyzer:
....
analysis: {
filter: {
ourEnglishStopWords: {
type: "stop",
stopwords: "_english_"
},
ourEnglishFilter: {
type: "stemmer",
name: "english"
}
},
analyzer: {
default: {
filter: [
"asciifolding",
"lowercase",
"ourEnglishStopWords",
"ourEnglishFilter"
],
tokenizer: "standard"
}
}
},
...
The difference between elastic search versions appears when indexing/searching for possessive forms,
whereas in 1.2.1 "player", "players" and "player's" would return the same results, in 1.4
first two ("player" and "players") have identical result set, while "player's" is not matching the set
Is it a known difference? What is the the right way to get the same behavior in 1.4 and up?

I think this is the change, introduced in 1.3.0:
The StemmerTokenFilter had a number of issues:
english returned the slow snowball English stemmer
porter2 returned the snowball Porter stemmer (v1)
Changes:
english now returns the fast PorterStemmer (for indices created from
v1.3.0 onwards)
porter2 now returns the snowball English stemmer (for indices created from v1.3.0 onwards)
According to that github issue, you can either to change your mapping to:
"ourEnglishFilter": {
"type": "stemmer",
"name": "porter2"
}
or try something else:
"filter": {
"ourEnglishStopWords": {
"type": "stop",
"stopwords": "_english_"
},
"ourEnglishFilter": {
"type": "stemmer",
"name": "english"
},
"possesiveEnglish": {
"type": "stemmer",
"name": "possessive_english"
}
},
"analyzer": {
"default": {
"filter": [
"asciifolding",
"lowercase",
"ourEnglishStopWords",
"possesiveEnglish",
"ourEnglishFilter"
],
"tokenizer": "standard"
}
}

Related

Re-using inbuilt language filters?

I saw the question here, which shows how one can create a custom analyzer to have both synonym support and support for languages.
However, it seems to create its own stemmer and stopwords collection as well.
What if I want to add synonyms to the "danish" inbuilt analyzer? Can I refer to the inbuilt Danish stemmer and stopwords filter? As an example, is it just called danish_stemmer and danish_stopwords?
Perhaps a list of inbuilt filters would help - where can I see the names of these inbuilt filters?
For each pre-built language analyzer there is an example of how to rebuild it. For danish there is this example:
PUT /danish_example
{
"settings": {
"analysis": {
"filter": {
"danish_stop": {
"type": "stop",
"stopwords": "_danish_"
},
"danish_keywords": {
"type": "keyword_marker",
"keywords": ["eksempel"]
},
"danish_stemmer": {
"type": "stemmer",
"language": "danish"
}
},
"analyzer": {
"rebuilt_danish": {
"tokenizer": "standard",
"filter": [
"lowercase",
"danish_stop",
"danish_keywords",
"danish_stemmer"
]
}
}
}
}
}
This is essentially building your own custom analyzer.
The list of available stemmers can be found here. The list of available pre-built stopwords lists can be found here.
Hope that helps!

Can I have multiple filters in an Elasticsearch index's settings?

I want an Elasticsearch index that simply stores "names" of features. I want to be able to issue phonetic queries and also type-ahead style queries separately. I would think I would be able to create one index with two analyzers and two filters; each analyzer could use one of the filters. But I do not seem to be able to do this.
Here is the index settings json I'm trying to use:
{
"settings": {
"number_of_shards": 1,
"analysis": {
"analyzer": {
"autocomplete_analyzer": {
"type": "custom",
"tokenizer": "standard",
"filter": ["standard", "lowercase", "ngram"]
}
},
"analyzer": {
"phonetic_analyzer": {
"tokenizer": "standard",
"filter": "double_metaphone_filter"
}
},
"filter": {
"double_metaphone_filter": {
"type": "phonetic",
"encoder": "double_metaphone"
}
},
"filter": {
"ngram": {
"type": "ngram",
"min_gram": 2,
"max_gram": 15
}
}
}
}
}
When I attempt to create an index with these settings:
http://hostname:9200/index/type
I get an HTTP 400, saying
Custom Analyzer [phonetic_analyzer] failed to find filter under name [double_metaphone_filter]
Don't get me wrong, I fully realize what that sentence means. I looked and looked for an erroneous comma or quote but I don't see any. Otherwise, everything is there and formatted correctly.
If I delete the phonetic analyzer, the index is created but ONLY with the autocomplete analyzer and ngram filter.
If I delete the ngram filter, the index is created but ONLY with the phonetic analyzer and phonetic filter.
I have a feeling I'm missing a fundamental concept of ES, like only one analyzer per index, or one filter per index, or I must have some other logical dependencies set up correctly, etc. It sure would be nice to have a logical diagram or complete API spec of the Elasticsearch infrastructure, i.e. any index can have 1..n analyzers, only 1 filter, query must need any one of bool, match, etc. But that unicorn does not seem to exist.
I see tons of documentation, blog posts, etc on how to do each of these functionalities, but with only one analyzer and one filter on the index. I'd really like to do this dual functionality on one index (for reasons out of scope).
Can someone offer some help, advice here?
You are just missing the proper formatting for your settings object. You cannot have two analyzer or filter keys, as there can only be one value per key in this settings map object. Providing a list of your filters seems to work just fine. When you were creating your index object, the second key was overriding the first.
Look here:
"settings": {
"number_of_shards": 1,
"analysis": {
"filter": {
"double_metaphone_filter": {
"type": "phonetic",
"encoder": "double_metaphone"
},
"ngram": {
"type": "ngram",
"min_gram": 2,
"max_gram": 15
}
},
"analyzer": {
"autocomplete_analyzer": {
"type": "custom",
"tokenizer": "standard",
"filter": ["standard", "lowercase", "ngram"]
},
"phonetic_analyzer": {
"tokenizer": "standard",
"filter": "double_metaphone_filter"
}
}
}
}
I downloaded the plugin to confirm this works.
You can now test this out at the _analyze enpoint with a payload:
{
"analyzer":"autocomplete_analyzer",
"text":"Jonnie Smythe"
}

Elasticsearch combining language and char_filter in an analyzer

I'm trying to combine a language analyzer with a char_filter but when I look at the _termvectors for the field the html/xml tags I can see values in there that are attributes of custom xml tags like "22anchor_titl".
My idea was to extend the german language filter:
settings:
analysis:
analyzer:
node_body_analyzer:
type: 'german'
char_filter: ['html_strip']
mappings:
mappings:
node:
body:
type: 'string'
analyzer: 'node_body_analyzer'
search_analyzer: 'node_search_analyzer'
Is there an error in my configuration or is the concept of deriving a new analyzer from the 'gernam' by adding a char_filter simply not possible. If so, would I have to make a type: 'custom' analyzer, implement the whole thing like this documentation and add the filter?
Cheers
Yes, you need to do that. What if you wanted to add another token filter? Where should have ES placed that one in the list of already existent token filters (since the order matters)? You need something like this:
"analysis": {
"filter": {
"german_stop": {
"type": "stop",
"stopwords": "_german_"
},
"german_keywords": {
"type": "keyword_marker",
"keywords": ["ghj"]
},
"german_stemmer": {
"type": "stemmer",
"language": "light_german"
}
},
"analyzer": {
"my_analyzer": {
"type":"custom",
"tokenizer": "standard",
"filter": [
"lowercase",
"german_stop",
"german_keywords",
"german_normalization",
"german_stemmer"
],
"char_filter":"html_strip"
}
}
}

Elasticsearch support for traditional chinese

I am trying to index and search Chinese into Elasticsearch. By using Smart Chinese Analysis (elasticsearch-analysis-smartcn) plugin I have managed to search characters and words for both simplified and traditional chinese. I have tried to insert the same text in both simplified and traditional chinese, but the search returns only one result (depending on how the search is performed); since the text is the same I would expect both results to be returned. I have read here that in order to support traditional chinese I must also install the STConvert Analysis (elasticsearch-analysis-stconvert) plugin. Can anyone provide a working example that uses these two plugins? (or an alternative method that achieves the same result)
The test index is created as
{
"settings":{
"analysis":{
"analyzer":{
"chinese":{
"type":"smartcn"
}
}
}
},
"mappings":{
"testType":{
"properties":{
"message":{
"store":"yes",
"type":"string",
"index":"analyzed",
"analyzer":"chinese"
},
"documentText": {
"store":"compress",
"type":"string",
"index":"analyzed",
"analyzer":"chinese",
"termVector":"with_positions_offsets"
}
}
}
}
}
and the two requests with the same text in simplified-traditional are
{
"message": "汉字",
"documentText": "制造器官的噴墨打印機 這是一種制造人體器官的裝置。這種裝置是利用打印機噴射生物 細胞、 生長激素、凝膠體,形成三維的生物活體組織。凝膠體主要是為細胞提供生長的平台,之后逐步形成所想要的器官或組織。這項技術可以人工方式制造心臟、肝臟、腎臟。這項研究已經取得了一定進展,目前正在研究如何將供應營養的血管印出來。這個創意目前已經得到了佳能等大公司的贊助"
}
{
"message": "汉字",
"documentText": "制造器官的喷墨打印机 这是一种制造人体器官的装置。这种装置是利用打印机喷射生物 细胞、 生长激素、凝胶体,形成叁维的生物活体组织。凝胶体主要是为细胞提供生长的平台,之后逐步形成所想要的器官或组织。这项技术可以人工方式制造心脏、肝脏、肾脏。这项研究已经取得了一定进展,目前正在研究如何将供应营养的血管印出来。这个创意目前已经得到了佳能等大公司的赞助"
}
Finally, a sample search that I want to return two results is
{
"query":{
"query_string":{
"query":"documentText : 制造器官的喷墨打印机",
"default_operator":"AND"
}
}
}
After many attempts I found a configuration that works. I did not manage to make smartcn work with stconvert plugin, so I used the cjk analyzer of elasticsearch, with an addition of icu_tokenizer instead. By using t2s and s2t as filters, each character is stored in both forms, traditional and simplified.
{
"settings":{
"analysis":{
"filter": {
"english_stop": {
"type": "stop",
"stopwords": "_english_"
},
"t2s_convert": {
"type": "stconvert",
"delimiter": ",",
"convert_type": "t2s"
},
"s2t_convert": {
"type": "stconvert",
"delimiter": ",",
"convert_type": "s2t"
}
},
"analyzer": {
"my_cjk": {
"tokenizer": "icu_tokenizer",
"filter": [
"cjk_width",
"lowercase",
"cjk_bigram",
"english_stop",
"t2s_convert",
"s2t_convert"
]
}
}
}
},
"mappings":{
"testType":{
"properties":{
"message":{
"store":"yes",
"type":"string",
"index":"analyzed",
"analyzer":"my_cjk"
},
"documentText": {
"store":"compress",
"type":"string",
"index":"analyzed",
"analyzer":"my_cjk",
"termVector":"with_positions_offsets"
}
}
}
}
}

Keep non-stemmed tokens on Elasticsearch

I'm using a stemmer (for the Brazilian Portuguese Language) when I index documents on Elasticsearch. This is what my default analyzer looks like(nvm minor mistakes here because I've copied this by hand from my code in the server):
{
"analysis":{
"filter":{
"my_asciifolding": {
"type": "asciifolding",
"preserve_original": true,
},
"stop_pt":{
"type": "stop",
"ignore_case": true,
"stopwords": "_brazilian_"
},
"stemmer_pt": {
"type": "stemmer",
"language": "brazilian"
}
},
"analyzer": {
"default": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"lowercase",
"my_asciifolding",
"stop_pt",
"stemmer_pt"
]
}
}
}
}
I haven't really touched my type mappings (apart from a few numeric fields I've declared "type":"long") so I expect most fields to be using this default analyzer I've specified above.
This works as expected, but the thing is that some users are frustrated because (since tokens are being stemmed), the query "vulnerabilities" and the query "vulnerable" return the same results, which is misleading because they expect the results having an exact match to be ranked first.
Whats is the default way (if any) to do this in elasticsearch? (maybe keep the unstemmed tokens in the index as well as the stemmed tokens?) I'm using version 1.5.1.
I ended up using "fields" field to index my attributes in different ways. Not sure whether this is optimal but this is the way I'm handling it right now:
Add another analyzer (I called it "no_stem_analyzer") with all filters that the "default" analyzer has, minus "stemmer".
For each attribute I want to keep both non stemmed and stemmed variants, I did this (example for field "DESCRIPTION"):
"mappings":{
"_default_":{
"properties":{
"DESCRIPTION":{
"type"=>"string",
"fields":{
"no_stem":{
"type":"string",
"index":"analyzed",
"analyzer":"no_stem_analyzer"
},
"stemmed":{
"type":"string",
"index":"analyzed",
"analyzer":"default"
}
}
}
},//.. other attributes here
}
}
At search time (using query_string_query) I must also indicate (using field "fields") that I want to search all sub-fields (e.g. "DESCRIPTION.*")
I also based my approach upon [this answer].(elasticsearch customize score for synonyms/stemming)

Resources