Why Elasticsearch Arabic analyzer comes alongside Persian analyzer in the Persian examples? - elasticsearch

I recently checked the language analyzer in elasticsearch docs and wonder why in this example: analysis-lang-analyzer the Persian analyzer comes after the Arabic analyzer?! Is that needed because of anything? I mean the Persian analyzer alone, is not enough for the Persian language?
Persian analyzer The Persian analyzer could be reimplemented as a
custom analyzer as follows:
PUT /persian_example
{
"settings": {
"analysis": {
"char_filter": {
"zero_width_spaces": {
"type": "mapping",
"mappings": [ "\\u200C=>\\u0020"]
}
},
"filter": {
"persian_stop": {
"type": "stop",
"stopwords": "_persian_"
}
},
"analyzer": {
"rebuilt_persian": {
"tokenizer": "standard",
"char_filter": [ "zero_width_spaces" ],
"filter": [
"lowercase",
"decimal_digit",
"arabic_normalization",
"persian_normalization",
"persian_stop"
]
}
}
}
}
}

I don't think its required, you can index few sentences by removing the arabic_normalization and use the analyze API to check the tokens generated by persian analyzer and see if it generates the correct expected tokens.
You can also open the issue on elastic repo https://github.com/elastic/elasticsearch/issues which is the best place to ask this question as there someone from elastic can comment whether its a documentation issue or some real issue.

this analyzer are made by Lucene Users and forum. you should check what arabic_normalization do in lucene and is it necessary?
this is arabic_normalization class description:
Normalizer for Arabic.
Normalization is defined as:
Normalization of hamza with alef seat to a bare alef.
Normalization of teh marbuta to heh
Normalization of dotless yeh (alef maksura) to yeh.
Removal of Arabic diacritics (the harakat) Removal of tatweel
(stretching character).
(as I know Persian) I think it is better for Persian indexing that you first use Arabic Normalizer

Related

Extend Elasticsearch's standard Analyzer with additional characters to tokenize on

I basically want the functionality of the inbuilt standard analyzer that additionally tokenizes on underscores.
Currently the standard analyzer will keep brown_fox_has as a singular token but I want [brown, fox, has] instead. The simple analyzer loses some functionality over the standard one, so I want to keep the standard as much as possible.
The docs only shows how you would add filters and other non-tokenizer changes, but I want to keep all of the standard tokenizer, while adding the additional underscore.
I could create a character filter to map _ to - and the standard tokenizer will do the job for me, but is there a better way?
es.indices.create(index="mine", body={
"settings": {
"analysis": {
"analyzer": {
"default": {
"type": "custom",
# "tokenize_on_chars": ["_"], # i want this to work with the standard tokenizer without using char group
"tokenizer": "standard",
"filter": ["lowercase"]
}
}
},
}
})
res = es.indices.analyze(index="mine", body={
"field": "text",
"text": "the quick brown_fox_has to be split"
})
Use normalizer and define it along with your preferred standard tokenizer
POST /_analyze
{
"char_filter": {
"type": "mapping",
"mappings": [
"_ =>\\u0020" // replace underscore with whitespace
]
},
"tokenizer": "standard",
"text": "the quick brown_fox_has to be split"
}

how to remove a stop word from default _english_ stop-words list in elasticsearch?

I am filtering the text using default English stop-words. I found 'and' is a stop-word in English, but I need to search for the results containing 'and'. I just want to remove and word from this default English stop-words filter and use other stopwords as usually. My elasticsearch schema looks similar to below.
"settings": {
"analysis": {
"analyzer": {
"default": {
"tokenizer": "whitespace" ,
"filter": ["stop_english"]
}
}....,
"filter":{
"stop_english": {
"type": "stop",
"stopwords": "_english_"
}
}
I expect to see the docs containing AND word with _search api.
You can set the stop words for a given index manually like this:
PUT /my_index
{
"settings": {
"analysis": {
"filter": {
"my_stop": {
"type": "stop",
"stopwords": ["and", "is", "the"]
}
}
}
}
}
I also found the list of English stop words used by elasticsearch here. If you manage to manually set the same list of stop words minus "and" in an index and reindex you data in the newly configured index with the good stop words , you should be good to go!
regarding reindexation of your data, you should check out the reindex api. I believe it is required since the tokenization of your data happens at ingestion time so you need to redo the ingestion by reindexing it. It is requires most of the time when changing index settings or some mapping changes (not 100% sure, but i think it makes sense).

Re-using inbuilt language filters?

I saw the question here, which shows how one can create a custom analyzer to have both synonym support and support for languages.
However, it seems to create its own stemmer and stopwords collection as well.
What if I want to add synonyms to the "danish" inbuilt analyzer? Can I refer to the inbuilt Danish stemmer and stopwords filter? As an example, is it just called danish_stemmer and danish_stopwords?
Perhaps a list of inbuilt filters would help - where can I see the names of these inbuilt filters?
For each pre-built language analyzer there is an example of how to rebuild it. For danish there is this example:
PUT /danish_example
{
"settings": {
"analysis": {
"filter": {
"danish_stop": {
"type": "stop",
"stopwords": "_danish_"
},
"danish_keywords": {
"type": "keyword_marker",
"keywords": ["eksempel"]
},
"danish_stemmer": {
"type": "stemmer",
"language": "danish"
}
},
"analyzer": {
"rebuilt_danish": {
"tokenizer": "standard",
"filter": [
"lowercase",
"danish_stop",
"danish_keywords",
"danish_stemmer"
]
}
}
}
}
}
This is essentially building your own custom analyzer.
The list of available stemmers can be found here. The list of available pre-built stopwords lists can be found here.
Hope that helps!

What keywords are used by the Swedish analyzer?

On this part of the elasticsearch docs, it says that the Swedish analyzer can be reimplemented like this:
PUT /swedish_example
{
"settings": {
"analysis": {
"filter": {
"swedish_stop": {
"type": "stop",
"stopwords": "_swedish_"
},
"swedish_keywords": {
"type": "keyword_marker",
"keywords": ["exempel"]
},
"swedish_stemmer": {
"type": "stemmer",
"language": "swedish"
}
},
"analyzer": {
"swedish": {
"tokenizer": "standard",
"filter": [
"lowercase",
"swedish_stop",
"swedish_keywords",
"swedish_stemmer"
]
}
}
}
}
My question is, how does this analyser recognise keywords? Sure, the keywords can be defined in the settings.analysis.filter.swedish_keywords.keywords field, but what if I'm too lazy to do that? Does Elasticsearch look at some other keywords list of pre-defined Swedish keywords? Because in the example above it looks like there is no such list provided in the settings.
In other words, is it solely up to me to define keywords or does Elasticsearch look at some other list to find keywords by default?
Yes, you need to specify this list by you. Otherwise, this filter wouldn't do anything.
As per documentation of Elasticsearch:
Keyword Marker Token Filter
Protects words from being modified by stemmers. Must be placed before
any stemming filters.
Alternatively, you could specify:
keywords_path
A path (either relative to config location, or absolute) to a list of
words.
keywords_pattern
A regular expression pattern to match against words in the text.
More information about this filter - https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-keyword-marker-tokenfilter.html

Elasticsearch "simple_query_string" vs. "query_string" field analysis bug?

Recently we discovered that, since we aren't sanitizing search terms as they come into our system, we would get occasional parsing exceptions in Elasticsearch when special characters such as / (forward slash) , etc. were used w/ "query_string". So, we decided to switch to "simple_query_string". However, we discovered that the same analyzers do not appear to be used for each. I reviewed When Analyzers Are Used to see if it indicated there would be a difference between simple and regular query string but it did not, so I'm wondering if this is a bug. For example:
"query_string": { "query": "sales", "fields": [ "title" ] }
will use the analyzer for the "title" field which is our "en_analyzer" (see definition below) and properly stem "sales" to "sale" and find the matching documents. Simply changing "query_string" to "simple_query_string" will not. We have to search for "sale" or add an analyzer to the query, like so:
"simple_query_string": { "query": "sales", "fields": [ "title" ], "analyzer": "en_analyzer" }
Of course, not all our fields are analyzed the same way and so the default behavior described in the documentation I referenced above makes perfect sense and that's what we desire. Is this a bug or does "simple_query_string" just not behave the same way w/ respect to field analysis during a query? We are using ES 1.7.2.
The relevant parts of our definition for "en_analyzer" are:
"en_analyzer": { "type": "custom", "tokenizer": "icu_tokenizer", "filter": [ "icu_normalizer", "en_stop_filter", "en_stem_filter", "icu_folding", "shingle_filter" ], "char_filter": [ "html_strip" ] }
with:
"en_stop_filter": { "type": "stop", "stopwords": [ "_english_" ] }, "en_stem_filter": { "type": "stemmer", "name": "minimal_english" }
Link to my same question on Github ... though I edited this one better after I asked on Github first. So far no response there.
In 1.7.2, simple_query_string will use the default standard analyzer when none is specified and won't use any search analyzer defined on the field being searched. When the documentation doesn't tell, one shall turn to the ultimate source of knowledge, i.e. the source code. In SimpleQueryStringParser.java, the class comment states:
analyzer: analyzer to be used for analyzing tokens to determine which kind of query they should be converted into, defaults to "standard"
And a bit further down in the same class, we can read:
Use standard analyzer by default
And that behavior hasn't changed in the ES 2.x releases. As can be seen in the source code for SimpleQueryStringBuilder.java, if no analyzer is specified in the query, then the standard analyzer is used.
Quoting a comment from the source linked above:
Use standard analyzer by default if none specified
So to answer your question, that's not a bug, but the intended behavior.

Resources