Erroneour match using snowball analyzer - elasticsearch

Iam using snowball analyzer for the stemming of the words. But this has mapped the words "insider" and "inside" to the same stem word "insid" which is totally wrong. How can I improve these kind of stemming of words in elasticsearch.

That's how snowball analyzer works. It's simply too aggressive. Based on my own experience, I tend to use modified English analyzer
I remove stemmer filter as it's too aggressive. Just like snowball
I replace it with kstem, which is lightweight english filter. Does a great job.
I add hunspell dictionary token filter to normalize words (instead of stemmer)
I add asciifolding filter to normalize letter, so things such as rôle and role would be equal.
That would be it:
{
"settings": {
"analysis": {
"filter": {
"english_hunspell" : {
"type" : "hunspell",
"language" : "en_GB"
},
"english_stop": {
"type": "stop",
"stopwords": "_english_"
},
"english_possessive_stemmer": {
"type": "stemmer",
"language": "possessive_english"
}
},
"analyzer": {
"english": {
"tokenizer": "standard",
"filter": [
"asciifolding",
"english_possessive_stemmer",
"lowercase",
"english_stop",
"kstem",
"english_hunspell"
]
}
}
}
}
}

Related

Elasticsearch German stemmer doesn't do plural

I'm working on a basic German analyzer in Elasticsearch which is defined as follows
{
"settings": {
"analysis": {
"filter": {
"german_stemmer": {
"type": "snowball",
"language": "German"
},
"german_stop": {
"type": "stop",
"stopwords": "_german_"
}
},
"analyzer": {
"german_search": {
"filter": ["lowercase", "german_stop", "german_stemmer"],
"tokenizer": "standard"
}
}
}
}
}
While testing it I realized that it is not dealing well with Kürbis and Kürbisse. Stemming those two words brings different output while from my understanding (just what I read online) Kurbis stands for Pumpkin and Kurbisse is Pumpkins. It looks like the stemmer is not dealing well with plurals.
Any ideas on how can I solve this?

Custom stopword analyzer is not woring properly

I have created an index with a custom analyzer for stop words. I want that elastic-search to ignore these words at the time of searching. Then I added one document data in elasticsearch mapping.
but when I am querying in kibana for "the" keyword with the query. It should not show any successful match, because in my_analzer I have put "the" in my_stop_word section. But it is showing the match. I have studied that if you mention one analyzer at the time of indexing in the mapping field. then it takes that analyzer by default at the time of the query.
please help!
PUT /pandey
{
"settings":
{
"analysis":
{
"analyzer":
{
"my_analyzer":
{
"tokenizer": "standard",
"filter": [
"my_stemmer",
"english_stop",
"my_stop_word",
"lowercase"
]
}
},
"filter": {
"my_stemmer": {
"type": "stemmer",
"name": "english"
},
"english_stop":{
"type": "stop",
"stopwords": "_english_"
},
"my_stop_word": {
"type": "stop",
"stopwords": ["robot", "love", "affection", "play", "the"]
}
}
}
},
"mappings": {
"properties": {
"dialog": {
"type": "text",
"analyzer": "my_analyzer"
}
}
}
}
PUT pandey/_doc/1
{
"dailog" : "the boy is a robot. he is in love. i play cricket"
}
GET pandey/_search
{
"query": {
"match": {
"dailog": "the"
}
}
}
A small spelling mistake can lead to this.
You defined mapping for dialog but added document with field name dailog. the dynamic field mappings behavior of elastic will index it without error. we can disable it though.
So the query, "dailog": "the" will get the result using default analyzer.

Elasticsearch word_delimiter filter with uppercase token dont match

I built an ElasticSearch index using a custom analyzer which uses lowercase and custom word_delimiter filter with keyword tokenizer.
"merged_analyzer": {
"type": "custom",
"tokenizer": "keyword",
"filter": [
"lowercase",
"asciifolding",
"word_delim",
"trim"
]
},
"merged_search_analyzer": {
"type": "custom",
"tokenizer": "keyword",
"filter": [
"lowercase",
"asciifolding"
]
}
"word_delim": {
"type": "word_delimiter",
"catenate_words": true,
"generate_word_parts": false,
"generate_number_parts": false,
"preserve_original": true
}
"properties": {
"lastName": {
"type": "keyword",
"normalizer": "keyword_normalizer",
"fields": {
"merged": {
"type": "text",
"analyzer": "merged_analyzer",
"search_analyzer": "merged_search_analyzer"
}
}
}
}
Then I tried searching for documents containing dash-separated sub-words, e.g. 'Abc-Xyz'. using the .merged field. Both 'abc-xyz' and 'abcxyz' (in lowercase) match, it's exactly what I expected, but I want my analyzer matchs also with uppercase letters or whitespace (e.g. 'Abc-Xyz', 'abc-xyz ').
It seems like the filters trim and lowercase have no effect on my analyzer
Any idea what I could be doing wrong?
I use elastic 6.2.4
I'm not sure, but it might be that the search analyzer is different from the index analyzer. there are two things you can do to check this.
configure a search_analyzer: https://www.elastic.co/guide/en/elasticsearch/reference/6.2/search-analyzer.html which would analyze using your merged_analyzer.
use the Analyze API: https://www.elastic.co/guide/en/elasticsearch/reference/6.2/indices-analyze.html
in order to check if your search tokens are as expected.

Why is my elastic search prefix query case-sensitive despite using lowercase filters on both index and search?

The Problem
I am working on an autocompleter using ElasticSearch 6.2.3. I would like my query results (a list of pages with a Name field) to be ordered using the following priority:
Prefix match at start of "Name" (Prefix query)
Any other exact (whole word) match within "Name" (Term query)
Fuzzy match (this is currently done on a different field to Name using a ngram tokenizer ... so I assume cannot be relevant to my problem but I would like to apply this on the Name field as well)
My Attempted Solution
I will be using a Bool/Should query consisting of three queries (corresponding to the three priorities above), using boost to define relative importance.
The issue I am having is with the Prefix query - it appears to not be lowercasing the search query despite my search analyzer having the lowercase filter. For example, the below query returns "Harry Potter" for 'harry' but returns zero results for 'Harry':
{ "query": { "prefix": { "Name.raw" : "Harry" } } }
I have verified using the _analyze API that both my analyzers do indeed lowercase the text "Harry" to "harry". Where am I going wrong?
From the ES documentation I understand I need to analyze the Name field in two different ways to enable use of both Prefix and Term queries:
using the "keyword" tokenizer to enable the Prefix query (I have applied this on a .raw field)
using a standard analyzer to enable the Term (I have applied this on the Name field)
I have checked duplicate questions such as this one but the answers have not helped
My mapping and settings are below
ES Index Mapping
{
"myIndex": {
"mappings": {
"pages": {
"properties": {
"Id": {},
"Name": {
"type": "text",
"fields": {
"raw": {
"type": "text",
"analyzer": "keywordAnalyzer",
"search_analyzer": "pageSearchAnalyzer"
}
},
"analyzer": "pageSearchAnalyzer"
},
"Tokens": {}, // Other fields not important for this question
}
}
}
}
}
ES Index Settings
{
"myIndex": {
"settings": {
"index": {
"analysis": {
"filter": {
"ngram": {
"type": "edgeNGram",
"min_gram": "2",
"max_gram": "15"
}
},
"analyzer": {
"keywordAnalyzer": {
"filter": [
"trim",
"lowercase",
"asciifolding"
],
"type": "custom",
"tokenizer": "keyword"
},
"pageSearchAnalyzer": {
"filter": [
"trim",
"lowercase",
"asciifolding"
],
"type": "custom",
"tokenizer": "standard"
},
"pageIndexAnalyzer": {
"filter": [
"trim",
"lowercase",
"asciifolding",
"ngram"
],
"type": "custom",
"tokenizer": "standard"
}
}
},
"number_of_replicas": "1",
"uuid": "l2AXoENGRqafm42OSWWTAg",
"version": {}
}
}
}
}
Prefix queries don't analyze the search terms, so the text you pass into it bypasses whatever would be used as the search analyzer (in your case, the configured search_analyzer: pageSearchAnalyzer) and evaluates Harry as-is directly against the keyword-tokenized, custom-filtered harry potter that was the result of the keywordAnalyzer applied at index time.
In your case here, you'll need to do one of a few different things:
Since you're using a lowercase filter on the field, you could just always use lowercase terms in your prefix query (using application-side lowercasing if necessary)
Run a match query against an edge_ngram-analyzed field instead of a prefix query like described in the ES search_analyzer docs
Here's an example of the latter:
1) Create the index w/ ngram analyzer and (recommended) standard search analyzer
PUT my_index
{
"settings": {
"index": {
"analysis": {
"filter": {
"ngram": {
"type": "edgeNGram",
"min_gram": "2",
"max_gram": "15"
}
},
"analyzer": {
"pageIndexAnalyzer": {
"filter": [
"trim",
"lowercase",
"asciifolding",
"ngram"
],
"type": "custom",
"tokenizer": "keyword"
}
}
}
}
},
"mappings": {
"pages": {
"properties": {
"name": {
"type": "text",
"fields": {
"ngram": {
"type": "text",
"analyzer": "pageIndexAnalyzer",
"search_analyzer": "standard"
}
}
}
}
}
}
}
2) Index some sample docs
POST my_index/pages/_bulk
{"index":{}}
{"name":"Harry Potter"}
{"index":{}}
{"name":"Hermione Granger"}
3) Run the a match query against the ngram field
POST my_index/pages/_search
{
"query": {
"match": {
"query": "Har",
"operator": "and"
}
}
}
I think it is better to use match_phrase_prefix query without using .keyword suffix. Check the docs at here https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-match-query-phrase-prefix.html

Why match_phrase_prefix query returns wrong results with diffrent length of phrase?

I have very simple query:
POST /indexX/document/_search
{
"query": {
"match_phrase_prefix": {
"surname": "grab"
}
}
}
with mapping:
"surname": {
"type": "string",
"analyzer": "polish",
"copy_to": [
"full_name"
]
}
and definition for index (I use Stempel (Polish) Analysis for Elasticsearch plugin):
POST /indexX
{
"settings": {
"index": {
"analysis": {
"filter": {
"synonym" : {
"type": "synonym",
"synonyms_path": "analysis/synonyms.txt"
},
"polish_stop": {
"type": "stop",
"stopwords_path": "analysis/stopwords.txt"
},
"polish_my_stem": {
"type": "stemmer",
"rules_path": "analysis/stems.txt"
}
},
"analyzer": {
"polish_with_synonym": {
"tokenizer": "standard",
"filter": [
"synonym",
"lowercase",
"polish_stop",
"polish_stem",
"polish_my_stem"
]
}
}
}
}
}
}
For this query I get zero results. When I change phrase to GRA or GRABA it returns 1 result (GRABARZ is the surname). Why is this happening?
I tried max_expansions with values even as high as 1200 and that didn't help.
At the first glance, your analyzer stems the search term ("grab") and renders it unusable ("grabić").
Without going into details on how to resolve this, please consider getting rid of polish analyzer here. We are talking about people's names, not "ordinary" polish words.
I saw different techniques used in this case: multi-field searches, fuzzy searches, phonetic searches, dedicated plugins.
Some links:
https://www.elastic.co/blog/multi-field-search-just-got-better
http://www.basistech.com/fuzzy-search-names-in-elasticsearch/
https://www.found.no/play/gist/6c6434c9c638a8596efa
But I guess in case of polish names some kind of prefix query on non-analyzed field would suffice...

Resources