Ignoring specific characters with Elasticsearch asciifolding - elasticsearch

In my analyzer, I have added the asciifolding filter. In most cases this works very well, but when working with the danish language, I would like to not normalize the øæå characters, since "rød" and "rod" are very different words.
We are using the hosted elastic cloud cluster, so if possible a solution that does not require any non-standard deployments through the cloud platform.
Is there any way to do asciifolding, but whitelist certain characters?
Currently running on ES version 6.8

You should probably be using the ICU Folding Token Filter.
From the documentation:
Case folding of Unicode characters based on UTR#30, like the
ASCII-folding token filter on steroids.
It let's you do everything that the AsciiFolding filter does, but in addition to this, it also allows you to ignore a range of characters through the unicodeSetFilter property.
In this case, you want to ignore æ,ø,å,Æ,Ø,Å:
"unicodeSetFilter": "[^æøåÆØÅ]"
Complete example:
PUT icu_sample
{
"settings": {
"index": {
"analysis": {
"analyzer": {
"danish_analyzer": {
"tokenizer": "icu_tokenizer",
"filter": [
"danish_folding",
"lowercase"
]
}
},
"filter": {
"danish_folding": {
"type": "icu_folding",
"unicodeSetFilter": "[^æøåÆØÅ]"
}
}
}
}
}
}

As you are already using the ASCII folding token filter but as its a token filter, so it really can't filter out certain characters as analysis process consists of below three sequential steps:
char filter (here you can filter or replace certain chars)
tokenizer(this process generates tokens)
token filter(can modify the tokens generated by tokenizer)
There is no out of the box solution, which could efficiently address your issue(just by not normalizing only a few chars).
Referring to the definitive guide to ES book article on this.
you can use preserve original parameter in the token filter which would preserve the original tokens at the same position, but this has an issue with less relevance and giving the exact match on the original word.
Hence in the same book and its advise to index the original meaning in a different fields and then use the multi_match query with most_fields and more information of this can be found in this .

Related

Elastic search filename search not working with dots in filename

I have elasticsearch mapping as follows:
{
"info": {
"properties": {
"timestamp": {"type":"date","format":"epoch_second"},
"user": {"type":"keyword" },
"filename": {"type":"text"}
}
}
}
When I try to do match query on filename, it works properly when I don't give dot in search input, but when dot in included, it returns many false results.
I learnt that standard analyzer is the issue. It breaks search input on dots and then search. What analyzer I can use in this case? The filenames can be millions and I don't want something with takes lot of memory and time. Please suggest.
As you are talking about filenames here, i would suggest using the keyword analyzer. This will not split the string and just index it as it is.
You could also just change ur mapping from text to keyword instead.

How do you search for exact terms (which may include special characters) with trailing/leading wildcard matching in Elasticsearch?

I am trying to figure out how to create Elasticsearch queries that allow for exact matches containing reserved characters while supporting trailing or leading wildcard expansion. I am using logstash dynamic templates which automatically also creates a raw field for each of my terms.
To sum up as concisely as possible, I want to create queries that can support two generic types of matching across all values:
Searching terms such as 'abc' to return results like 'abc.xyz.com'. In this case, the token created by the standard token analyzer completely tokenizes 'abc.xyz.com' into one token, and wildcard matching can succeed using the following command:
{
"query": {
"wildcard": {
"_all": "*abc*"
}
}
}
Searching terms such as fullpaths like '/Intel/1938138191(1).zip' to return results like 'C:/Program Files (x86)/Intel/1938138191(1).zip'. In this case, even if I backslash all of the reserved characters, doing a wildcard match like
{
"query": {
"wildcard": {
"_all": "*/Intel/1938138191(1).zip*"
}
}
}
will not work. And this is because _all defaults to using the standard analyzer, so the path will be split and an exact match cannot be made. However, if I SPECIFICALLY query the raw field like below (both when I escape / do not escape the special characters), I get the correct result:
{
"query": {
"wildcard": {
"field.raw": "*/Intel/1938138191(1).zip*"
}
}
}
So my question is, is there any way to support calling wildcard queries across both tokens analyzed by the standard analyzers and the raw fields which are not analyzed at all, in one query? That is some way of generically encapsulating searched terms so that in both of my above examples, I would get the correct result? For reference I am using Elasticsearch version 1.7. I have also tried looking into query string matching and term matching, all to no avail.

Is there a stemmer for elasticsearch that can change "broken" to "break"

Here is what I'd like the stemmer to do:
breaking: break
broke: break
broken: break
entering: enter
entered: enter
enter: enter
I've indexed the field as follows:
"body": {
"type": "text",
"fields": {
"stemmed": {
"type": "text",
"analyzer": "english"
}
}
}
When I query “breaking and entering”, I can see that what is searched for in the body.stemmed field is: "break and enter". Seems good.
However, when I query “broke and entered”, I get: “broke and enter”. Thus, apparently, “broke” does not become “break” when the "english" stemmer is used.
Likewise, “broken and entered” becomes: “broken and enter”. So, ES apparently does not change either “broke” or “broken” to “break” (which, according to this: snowball, I guess explains why if this is what is used).
So, is there a way to specify a "known" stemmer that will accomplish what I'm trying to do?
Your requirement can be fulfilled by a Dictionary Stemmer, which does dictionary lookups for stemming words. Algorithmic stemmers stem without knowledge about the root words, they simply do it algorithmically.
Look at Hunspell stemmer, think it will do the job:
https://www.elastic.co/guide/en/elasticsearch/guide/current/hunspell.html

Elastic search query string regex

I am having an issue querying an field (title) using query string regex.
This works: "title:/test/"
This does not : "title:/^test$/"
However they mention it is supported https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-regexp-query.html#regexp-syntax
My goal it to do exact match, but this match should not be partial, it should match the whole field value.
Does anybody have an idea what might be wrong here?
From the documentation
The Lucene regular expression engine is not Perl-compatible but supports a smaller range of operators.
You are using anchors ^ and $, which are not supported because there is no need for that, again from the docs
Lucene’s patterns are always anchored. The pattern provided must match the entire string
If you are looking for the phrase query kind of match you could use double quotes like this
{
"query": {
"query_string": {
"default_field": "title",
"query": "\"test phrase\""
}
}
}
but this would also match documents with title like test phrase someword
If you want exact match, you should look for term queries, make your title field mapping "index" : "not_analyzed" or you could use keyword analyzer with lowercase filter for case insensitive match. Your query would look like this
{
"query": {
"term": {
"title": {
"value": "my title"
}
}
}
}
This will give you exact match
Usually in Regex the ^ and $ symbols are used to indicate that the text is should be located at the start/end of the string. This is called anchoring. Lucene regex patterns are anchored by default.
So the pattern "test" with Elasticsearch is the equivalent of "^test$" in say Java.
You have to work to "unanchor" your pattern, for example by using "te.*" to match "test", "testing" and "teeth". Because the pattern "test" would only match "test".
Note that this requires that the field is not analyzed and also note that it has terrible performance. For exact match use a term filter as described in the answer by ChintanShah25.

Search keyword using double quotes to get exact match in elasticsearch

If user searches by giving quotes around keyword like "flowers and mulch" then exact matches should be displayed.
I tried using query_string which is almost working but not satisfied with those results.
Can anyone help me out please.
{
"query": {
"query_string": {
"fields": ["body"],
"query": "\"flowers and mulch\""
}
}
}
You should be using phrase_match for exact matches of phrases:
{
"query": {
"match_phrase": {
"body": "flowers and mulch"
}
}
}
Phrase matching
In the same way that the match query is the “go-to” query for standard
full text search, the match_phrase query is the one you should reach
for when you want to find words that are near to each other.
http://www.elasticsearch.org/guide/en/elasticsearch/guide/current/phrase-matching.html
As I put in the comment of the question, I think knowing what the OP found not satisfying about query_string would be great. I would recommend using query_string for these cases. Note that there are multiple options that could be set, such as: auto_generate_phrase_queries, split_on_whitespace, or quote_field_suffix (example: here), which makes it quite versatile.
The case one "two three"could be addressed using default parameters of query_string

Resources