Elastic search filename search not working with dots in filename - elasticsearch

I have elasticsearch mapping as follows:
{
"info": {
"properties": {
"timestamp": {"type":"date","format":"epoch_second"},
"user": {"type":"keyword" },
"filename": {"type":"text"}
}
}
}
When I try to do match query on filename, it works properly when I don't give dot in search input, but when dot in included, it returns many false results.
I learnt that standard analyzer is the issue. It breaks search input on dots and then search. What analyzer I can use in this case? The filenames can be millions and I don't want something with takes lot of memory and time. Please suggest.

As you are talking about filenames here, i would suggest using the keyword analyzer. This will not split the string and just index it as it is.
You could also just change ur mapping from text to keyword instead.

Related

Ignoring specific characters with Elasticsearch asciifolding

In my analyzer, I have added the asciifolding filter. In most cases this works very well, but when working with the danish language, I would like to not normalize the øæå characters, since "rød" and "rod" are very different words.
We are using the hosted elastic cloud cluster, so if possible a solution that does not require any non-standard deployments through the cloud platform.
Is there any way to do asciifolding, but whitelist certain characters?
Currently running on ES version 6.8
You should probably be using the ICU Folding Token Filter.
From the documentation:
Case folding of Unicode characters based on UTR#30, like the
ASCII-folding token filter on steroids.
It let's you do everything that the AsciiFolding filter does, but in addition to this, it also allows you to ignore a range of characters through the unicodeSetFilter property.
In this case, you want to ignore æ,ø,å,Æ,Ø,Å:
"unicodeSetFilter": "[^æøåÆØÅ]"
Complete example:
PUT icu_sample
{
"settings": {
"index": {
"analysis": {
"analyzer": {
"danish_analyzer": {
"tokenizer": "icu_tokenizer",
"filter": [
"danish_folding",
"lowercase"
]
}
},
"filter": {
"danish_folding": {
"type": "icu_folding",
"unicodeSetFilter": "[^æøåÆØÅ]"
}
}
}
}
}
}
As you are already using the ASCII folding token filter but as its a token filter, so it really can't filter out certain characters as analysis process consists of below three sequential steps:
char filter (here you can filter or replace certain chars)
tokenizer(this process generates tokens)
token filter(can modify the tokens generated by tokenizer)
There is no out of the box solution, which could efficiently address your issue(just by not normalizing only a few chars).
Referring to the definitive guide to ES book article on this.
you can use preserve original parameter in the token filter which would preserve the original tokens at the same position, but this has an issue with less relevance and giving the exact match on the original word.
Hence in the same book and its advise to index the original meaning in a different fields and then use the multi_match query with most_fields and more information of this can be found in this .

Elasticsearch - text type regexp

Does elasticsearch support regex search on text type string?
I created a document like below.
{
"T": "a$b$c$d"
}
and I tried to search this document with below query.
{
"query": {
"query_string": {
"query": "T:/a.*/"
}
}
}
It seems work for me, BUT when I tried to query with '$' symbol. It's unable to find the document.
{
"query": {
"query_string": {
"query": "T:/a$.*/"
}
}
}
How should I do to find the document? This key data should be text type(not keyword) since it can be longer than keyword max length.
You should be aware of some things, here:
If your field is analyzed (and tokenized in the process) you will only find matches in fields containing a token (not the whole "text") that matches your RegExp. If you want the whole content of the field to match, you must use a keyword field or at least a Keyword Analyzer that doesn't tokenize your text.
The $ symbol has a special meaning in Regular Expressions (it marks the end of a string), so you'll have to escape it: a\$.*
Your RegExp must match a whole token to get a hit. That's why there's no point to use $ as a (non-escaped) RegExp symbol: Your RegExp must match a whole token from beginning to end, anyway. So (to stick to your example) to match fields where a is followed by c, you'd need .*?a[^c]*c.*, or if you need the $s in there, escape them: .*?a\$[^c]*c\$.*

How do you search for exact terms (which may include special characters) with trailing/leading wildcard matching in Elasticsearch?

I am trying to figure out how to create Elasticsearch queries that allow for exact matches containing reserved characters while supporting trailing or leading wildcard expansion. I am using logstash dynamic templates which automatically also creates a raw field for each of my terms.
To sum up as concisely as possible, I want to create queries that can support two generic types of matching across all values:
Searching terms such as 'abc' to return results like 'abc.xyz.com'. In this case, the token created by the standard token analyzer completely tokenizes 'abc.xyz.com' into one token, and wildcard matching can succeed using the following command:
{
"query": {
"wildcard": {
"_all": "*abc*"
}
}
}
Searching terms such as fullpaths like '/Intel/1938138191(1).zip' to return results like 'C:/Program Files (x86)/Intel/1938138191(1).zip'. In this case, even if I backslash all of the reserved characters, doing a wildcard match like
{
"query": {
"wildcard": {
"_all": "*/Intel/1938138191(1).zip*"
}
}
}
will not work. And this is because _all defaults to using the standard analyzer, so the path will be split and an exact match cannot be made. However, if I SPECIFICALLY query the raw field like below (both when I escape / do not escape the special characters), I get the correct result:
{
"query": {
"wildcard": {
"field.raw": "*/Intel/1938138191(1).zip*"
}
}
}
So my question is, is there any way to support calling wildcard queries across both tokens analyzed by the standard analyzers and the raw fields which are not analyzed at all, in one query? That is some way of generically encapsulating searched terms so that in both of my above examples, I would get the correct result? For reference I am using Elasticsearch version 1.7. I have also tried looking into query string matching and term matching, all to no avail.

Elastic search query string regex

I am having an issue querying an field (title) using query string regex.
This works: "title:/test/"
This does not : "title:/^test$/"
However they mention it is supported https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-regexp-query.html#regexp-syntax
My goal it to do exact match, but this match should not be partial, it should match the whole field value.
Does anybody have an idea what might be wrong here?
From the documentation
The Lucene regular expression engine is not Perl-compatible but supports a smaller range of operators.
You are using anchors ^ and $, which are not supported because there is no need for that, again from the docs
Lucene’s patterns are always anchored. The pattern provided must match the entire string
If you are looking for the phrase query kind of match you could use double quotes like this
{
"query": {
"query_string": {
"default_field": "title",
"query": "\"test phrase\""
}
}
}
but this would also match documents with title like test phrase someword
If you want exact match, you should look for term queries, make your title field mapping "index" : "not_analyzed" or you could use keyword analyzer with lowercase filter for case insensitive match. Your query would look like this
{
"query": {
"term": {
"title": {
"value": "my title"
}
}
}
}
This will give you exact match
Usually in Regex the ^ and $ symbols are used to indicate that the text is should be located at the start/end of the string. This is called anchoring. Lucene regex patterns are anchored by default.
So the pattern "test" with Elasticsearch is the equivalent of "^test$" in say Java.
You have to work to "unanchor" your pattern, for example by using "te.*" to match "test", "testing" and "teeth". Because the pattern "test" would only match "test".
Note that this requires that the field is not analyzed and also note that it has terrible performance. For exact match use a term filter as described in the answer by ChintanShah25.

Search keyword using double quotes to get exact match in elasticsearch

If user searches by giving quotes around keyword like "flowers and mulch" then exact matches should be displayed.
I tried using query_string which is almost working but not satisfied with those results.
Can anyone help me out please.
{
"query": {
"query_string": {
"fields": ["body"],
"query": "\"flowers and mulch\""
}
}
}
You should be using phrase_match for exact matches of phrases:
{
"query": {
"match_phrase": {
"body": "flowers and mulch"
}
}
}
Phrase matching
In the same way that the match query is the “go-to” query for standard
full text search, the match_phrase query is the one you should reach
for when you want to find words that are near to each other.
http://www.elasticsearch.org/guide/en/elasticsearch/guide/current/phrase-matching.html
As I put in the comment of the question, I think knowing what the OP found not satisfying about query_string would be great. I would recommend using query_string for these cases. Note that there are multiple options that could be set, such as: auto_generate_phrase_queries, split_on_whitespace, or quote_field_suffix (example: here), which makes it quite versatile.
The case one "two three"could be addressed using default parameters of query_string

Resources