Elasticsearch German stemmer doesn't do plural - elasticsearch

I'm working on a basic German analyzer in Elasticsearch which is defined as follows
{
"settings": {
"analysis": {
"filter": {
"german_stemmer": {
"type": "snowball",
"language": "German"
},
"german_stop": {
"type": "stop",
"stopwords": "_german_"
}
},
"analyzer": {
"german_search": {
"filter": ["lowercase", "german_stop", "german_stemmer"],
"tokenizer": "standard"
}
}
}
}
}
While testing it I realized that it is not dealing well with Kürbis and Kürbisse. Stemming those two words brings different output while from my understanding (just what I read online) Kurbis stands for Pumpkin and Kurbisse is Pumpkins. It looks like the stemmer is not dealing well with plurals.
Any ideas on how can I solve this?

Related

Elasticsearch: Problem with Italian analyzer

I noticed that the ES Italian analyzer does not stem words long less than 6 characters and this obviously creates a problem for my work. I tried to solve it customizing the analyzer but unfortunately did not succeed. So I implemented in the index an hunspell analyzer but it isn't very scalable so I want to keep the analyzer algorithmic. Does someone have a suggestion on how to solve this problem?
The default Italian language stemmer in Elasticsearch is not the normal snowball stemmer, but a light version called light_italian. I was able to reproduce that it doesn't stem some tokens that are shorter than 6 characters, as you described:
POST /_analyze
{
"analyzer": "italian",
"text": "pronto propio logie logia morte"
}
But Elasticsearch includes another italian stemmer token filter called italian that performs stemming on these tokens. You can test it with this code:
PUT /my-italian-stemmer-index
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "standard",
"filter": [
"lowercase",
"my_stemmer"
]
}
},
"filter": {
"my_stemmer": {
"type": "stemmer",
"language": "italian"
}
}
}
}
}
POST /my-italian-stemmer-index/_analyze
{
"analyzer": "my_analyzer",
"text": "pronto propio logie logia morte"
}
If you want to use it, you should rebuild the original Italian analyzer and swap out the token filter:
PUT /italian_example
{
"settings": {
"analysis": {
"filter": {
"italian_elision": {
"type": "elision",
"articles": [
"c", "l", "all", "dall", "dell",
"nell", "sull", "coll", "pell",
"gl", "agl", "dagl", "degl", "negl",
"sugl", "un", "m", "t", "s", "v", "d"
],
"articles_case": true
},
"italian_stop": {
"type": "stop",
"stopwords": "_italian_"
},
"italian_keywords": {
"type": "keyword_marker",
"keywords": ["esempio"]
},
"italian_stemmer": {
"type": "stemmer",
"language": "italian"
}
},
"analyzer": {
"rebuilt_italian": {
"tokenizer": "standard",
"filter": [
"italian_elision",
"lowercase",
"italian_stop",
"italian_keywords",
"italian_stemmer"
]
}
}
}
}
}

Custom stopword analyzer is not woring properly

I have created an index with a custom analyzer for stop words. I want that elastic-search to ignore these words at the time of searching. Then I added one document data in elasticsearch mapping.
but when I am querying in kibana for "the" keyword with the query. It should not show any successful match, because in my_analzer I have put "the" in my_stop_word section. But it is showing the match. I have studied that if you mention one analyzer at the time of indexing in the mapping field. then it takes that analyzer by default at the time of the query.
please help!
PUT /pandey
{
"settings":
{
"analysis":
{
"analyzer":
{
"my_analyzer":
{
"tokenizer": "standard",
"filter": [
"my_stemmer",
"english_stop",
"my_stop_word",
"lowercase"
]
}
},
"filter": {
"my_stemmer": {
"type": "stemmer",
"name": "english"
},
"english_stop":{
"type": "stop",
"stopwords": "_english_"
},
"my_stop_word": {
"type": "stop",
"stopwords": ["robot", "love", "affection", "play", "the"]
}
}
}
},
"mappings": {
"properties": {
"dialog": {
"type": "text",
"analyzer": "my_analyzer"
}
}
}
}
PUT pandey/_doc/1
{
"dailog" : "the boy is a robot. he is in love. i play cricket"
}
GET pandey/_search
{
"query": {
"match": {
"dailog": "the"
}
}
}
A small spelling mistake can lead to this.
You defined mapping for dialog but added document with field name dailog. the dynamic field mappings behavior of elastic will index it without error. we can disable it though.
So the query, "dailog": "the" will get the result using default analyzer.

Elastic Search - how to use language analyzer with UTF-8 filter?

I have a problem with ElasticSearch language analyzer. I am working on Lithuanian language, so I am using Lithuanian language analyzer. Analyzer works fine and I got all word cases I need. For example, I index Lithuania city "Klaipėda":
PUT /cities/city/1
{
"name": "Klaipėda"
}
Problem is that I also need to get a result, when I am searching "Klaipėda" only in Latin alphabet ("Klaipeda") and in all Lithuanian cases:
Nomanitive case: "Klaipeda"
Genitive case: "Klaipedos"
...
Locative case: "Klaipedoje"
"Klaipėda", "Klaipėdos", "Klaipėdoje" - works, but "Klaipeda", "Klaipedos", "Klaipedoje" - not.
My index:
PUT /cities
{
"mappings": {
"city": {
"properties": {
"name": {
"type": "string",
"analyzer": "lithuanian",
"fields": {
"folded": {
"type": "string",
"analyzer": "md_folded_analyzer"
}
}
}
}
}
},
"settings": {
"analysis": {
"analyzer": {
"md_folded_analyzer": {
"type": "lithuanian",
"tokenizer": "standard",
"filter": [
"lowercase",
"asciifolding",
"lithuanian_stop",
"lithuanian_keywords",
"lithuanian_stemmer"
]
}
}
}
}
}
and search query:
GET /cities/_search
{
"query": {
"multi_match" : {
"type": "most_fields",
"query": "klaipeda",
"fields": [ "name", "name.folded" ]
}
}
}
What I am doing wrong? Thanks for help.
The technique you are using here is so-called multi-fields. The limitation of the underlying name.folded field is that you can't perform search against it - you can perform only sorting by name.folded and aggregation.
To make a way round this I've come up with the following set-up:
Separate fields set-up (to eliminate duplicates - just specify copy_to):
curl -XPUT http://localhost:9200/cities -d '
{
"mappings": {
"city": {
"properties": {
"name": {
"type": "string",
"analyzer": "lithuanian",
"copy_to": "folded",
},
"folded": {
"type": "string",
"analyzer": "md_folded_analyzer"
}
}
}
}
}'
Change the type of your analyzer to custom as it described here, because otherwise the asciifolding is not got into the config. And more important - asciifolding should go after all stemming / stop-words in Lithuanian language, because after folding the word can miss desired sense.
curl -XPUT http://localhost:9200/my_cities -d '
{
"settings": {
"analysis": {
"filter": {
"lithuanian_stop": {
"type": "stop",
"stopwords": "_lithuanian_"
},
"lithuanian_stemmer": {
"type": "stemmer",
"language": "lithuanian"
}
},
"analyzer": {
"md_folded_analyzer": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"lowercase",
"lithuanian_stop",
"lithuanian_stemmer",
"asciifolding"
]
}
}
}
}
}
Sorry I've eliminated lithuanian_keywords - it requires additional set-up, which I missed here. But I hope you've got the idea.

Why match_phrase_prefix query returns wrong results with diffrent length of phrase?

I have very simple query:
POST /indexX/document/_search
{
"query": {
"match_phrase_prefix": {
"surname": "grab"
}
}
}
with mapping:
"surname": {
"type": "string",
"analyzer": "polish",
"copy_to": [
"full_name"
]
}
and definition for index (I use Stempel (Polish) Analysis for Elasticsearch plugin):
POST /indexX
{
"settings": {
"index": {
"analysis": {
"filter": {
"synonym" : {
"type": "synonym",
"synonyms_path": "analysis/synonyms.txt"
},
"polish_stop": {
"type": "stop",
"stopwords_path": "analysis/stopwords.txt"
},
"polish_my_stem": {
"type": "stemmer",
"rules_path": "analysis/stems.txt"
}
},
"analyzer": {
"polish_with_synonym": {
"tokenizer": "standard",
"filter": [
"synonym",
"lowercase",
"polish_stop",
"polish_stem",
"polish_my_stem"
]
}
}
}
}
}
}
For this query I get zero results. When I change phrase to GRA or GRABA it returns 1 result (GRABARZ is the surname). Why is this happening?
I tried max_expansions with values even as high as 1200 and that didn't help.
At the first glance, your analyzer stems the search term ("grab") and renders it unusable ("grabić").
Without going into details on how to resolve this, please consider getting rid of polish analyzer here. We are talking about people's names, not "ordinary" polish words.
I saw different techniques used in this case: multi-field searches, fuzzy searches, phonetic searches, dedicated plugins.
Some links:
https://www.elastic.co/blog/multi-field-search-just-got-better
http://www.basistech.com/fuzzy-search-names-in-elasticsearch/
https://www.found.no/play/gist/6c6434c9c638a8596efa
But I guess in case of polish names some kind of prefix query on non-analyzed field would suffice...

Erroneour match using snowball analyzer

Iam using snowball analyzer for the stemming of the words. But this has mapped the words "insider" and "inside" to the same stem word "insid" which is totally wrong. How can I improve these kind of stemming of words in elasticsearch.
That's how snowball analyzer works. It's simply too aggressive. Based on my own experience, I tend to use modified English analyzer
I remove stemmer filter as it's too aggressive. Just like snowball
I replace it with kstem, which is lightweight english filter. Does a great job.
I add hunspell dictionary token filter to normalize words (instead of stemmer)
I add asciifolding filter to normalize letter, so things such as rôle and role would be equal.
That would be it:
{
"settings": {
"analysis": {
"filter": {
"english_hunspell" : {
"type" : "hunspell",
"language" : "en_GB"
},
"english_stop": {
"type": "stop",
"stopwords": "_english_"
},
"english_possessive_stemmer": {
"type": "stemmer",
"language": "possessive_english"
}
},
"analyzer": {
"english": {
"tokenizer": "standard",
"filter": [
"asciifolding",
"english_possessive_stemmer",
"lowercase",
"english_stop",
"kstem",
"english_hunspell"
]
}
}
}
}
}

Resources