Search with optional punctuation - elasticsearch

I'm trying to configure my Elasticsearch to search a word with optional punctuation but i don't know how to do this with my settings.
For example, I search for the word "computer", but I would like to allow a pattern to search, in fact, the following words : "computer.", "(computer", "computer)", "computer,"...

To achieve this you should understand how Elastic mapping and text field works, text field in Elastic will be analyzed based on analyzer you set during the mapping, then this analyzer will be used to analyze your text and generates terms, For example if you use standard tokenizer for a text: This computer is fast. (Computer). these will be the result.
GET _analyze
{
"analyzer": "standard",
"text": "This computer is fast. (Computer)"
}
Result:
{
"tokens" : [
{
"token" : "this",
"start_offset" : 0,
"end_offset" : 4,
"type" : "<ALPHANUM>",
"position" : 0
},
{
"token" : "computer",
"start_offset" : 5,
"end_offset" : 13,
"type" : "<ALPHANUM>",
"position" : 1
},
{
"token" : "is",
"start_offset" : 14,
"end_offset" : 16,
"type" : "<ALPHANUM>",
"position" : 2
},
{
"token" : "fast",
"start_offset" : 17,
"end_offset" : 21,
"type" : "<ALPHANUM>",
"position" : 3
},
{
"token" : "computer",
"start_offset" : 24,
"end_offset" : 32,
"type" : "<ALPHANUM>",
"position" : 4
}
]
}
{
"tokens" : [
{
"token" : "this",
"start_offset" : 0,
"end_offset" : 4,
"type" : "<ALPHANUM>",
"position" : 0
},
{
"token" : "computer",
"start_offset" : 5,
"end_offset" : 13,
"type" : "<ALPHANUM>",
"position" : 1
},
{
"token" : "is",
"start_offset" : 14,
"end_offset" : 16,
"type" : "<ALPHANUM>",
"position" : 2
},
{
"token" : "fast",
"start_offset" : 17,
"end_offset" : 21,
"type" : "<ALPHANUM>",
"position" : 3
},
{
"token" : "computer",
"start_offset" : 24,
"end_offset" : 32,
"type" : "<ALPHANUM>",
"position" : 4
}
]
}
As you can see in the result punctuation will be removed for indexing so when you search for computer with match query you will get all type of documents back, for example:
POST _search
{
"query": {
"match": {
"Your_Filed": "Computer"
}
}
}
You can check all query DSL of Elastic search here.

Related

Why do elasticsearch queries require a certain number of characters to return results?

It seems like there is a character minimum needed to get results with elasticsearch for a specific property I am searching. It is called 'guid' and has the following configuration:
"guid": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
}
I have a document with the following GUID: 3e49996c-1dd8-4230-8f6f-abe4236a6fc4
The following query returns the document as-expected:
{"match":{"query":"9996c-1dd8*","fields":["guid"]}}
However this query does not:
{"match":{"query":"9996c-1dd*","fields":["guid"]}}
I have the same result with multi_match and query_string queries. I haven't been able to find anything in the documentation about a character minimum, so what is happening here?
Elastic does not require a minimum number of characters. What matters is the generated token.
An exercise that helps to understand is to use _analyzer to see your index tokens.
GET index_001/_analyze
{
"field": "guid",
"text": [
"3e49996c-1dd8-4230-8f6f-abe4236a6fc4"
]
}
You indicate the term 3e49996c-1dd8-4230-8f6f-abe4236a6fc4.
Look how the tokens are:
"tokens" : [
{
"token" : "3e49996c",
"start_offset" : 0,
"end_offset" : 8,
"type" : "<ALPHANUM>",
"position" : 0
},
{
"token" : "1dd8",
"start_offset" : 9,
"end_offset" : 13,
"type" : "<ALPHANUM>",
"position" : 1
},
{
"token" : "4230",
"start_offset" : 14,
"end_offset" : 18,
"type" : "<NUM>",
"position" : 2
},
{
"token" : "8f6f",
"start_offset" : 19,
"end_offset" : 23,
"type" : "<ALPHANUM>",
"position" : 3
},
{
"token" : "abe4236a6fc4",
"start_offset" : 24,
"end_offset" : 36,
"type" : "<ALPHANUM>",
"position" : 4
}
]
When you perform the search, the same analyzer that is used in the indexing will be used in the search.
When you search for the term "9996c-1dd8*".
GET index_001/_analyze
{
"field": "guid",
"text": [
"9996c-1dd8*"
]
}
The generated tokens are:
{
"tokens" : [
{
"token" : "9996c",
"start_offset" : 0,
"end_offset" : 5,
"type" : "<ALPHANUM>",
"position" : 0
},
{
"token" : "1dd8",
"start_offset" : 6,
"end_offset" : 10,
"type" : "<ALPHANUM>",
"position" : 1
}
]
}
Note that the inverted index will have the token 1dd8 and the term "9996c-1dd8*" generated the token "1dd8" so the match took place.
When you test with the term "9996c-1dd*", no tokens match, so there are no results.
GET index_001/_analyze
{
"field": "guid",
"text": [
"9996c-1dd*"
]
}
Tokens:
{
"tokens" : [
{
"token" : "9996c",
"start_offset" : 0,
"end_offset" : 5,
"type" : "<ALPHANUM>",
"position" : 0
},
{
"token" : "1dd",
"start_offset" : 6,
"end_offset" : 9,
"type" : "<ALPHANUM>",
"position" : 1
}
]
}
Token "1dd" is not equal to "1dd8".

Query in Kibana doesn't return logs with Regexp

I have a field named log.file.path in Elasticsearch and it has /var/log/dev-collateral/uaa.2020-09-26.log value, I tried to retrieve all logs that log.file.path field starts with /var/log/dev-collateral/uaa
I used the below regexp but it doesn't work.
{
"regexp":{
"log.file.path": "/var/log/dev-collateral/uaa.*"
}
}
Let's see why it is not working? I've indexed two documents using Kibana UI like below -
PUT myindex/_doc/1
{
"log.file.path" : "/var/log/dev-collateral/uaa.2020-09-26.log"
}
PUT myindex/_doc/2
{
"log.file.path" : "/var/log/dev-collateral/uaa.2020-09-26.txt"
}
When I try to see the tokens for of the text on log.file.path field using _analyze API
POST _analyze
{
"text": "/var/log/dev-collateral/uaa.2020-09-26.log"
}
It gives me,
{
"tokens" : [
{
"token" : "var",
"start_offset" : 1,
"end_offset" : 4,
"type" : "<ALPHANUM>",
"position" : 0
},
{
"token" : "log",
"start_offset" : 5,
"end_offset" : 8,
"type" : "<ALPHANUM>",
"position" : 1
},
{
"token" : "dev",
"start_offset" : 9,
"end_offset" : 12,
"type" : "<ALPHANUM>",
"position" : 2
},
{
"token" : "collateral",
"start_offset" : 13,
"end_offset" : 23,
"type" : "<ALPHANUM>",
"position" : 3
},
{
"token" : "uaa",
"start_offset" : 24,
"end_offset" : 27,
"type" : "<ALPHANUM>",
"position" : 4
},
{
"token" : "2020",
"start_offset" : 28,
"end_offset" : 32,
"type" : "<NUM>",
"position" : 5
},
{
"token" : "09",
"start_offset" : 33,
"end_offset" : 35,
"type" : "<NUM>",
"position" : 6
},
{
"token" : "26",
"start_offset" : 36,
"end_offset" : 38,
"type" : "<NUM>",
"position" : 7
},
{
"token" : "log",
"start_offset" : 39,
"end_offset" : 42,
"type" : "<ALPHANUM>",
"position" : 8
}
]
}
You can see, Elasticsearch has split your input text into tokens when you insert them on your index. This is because elasticsearch uses standard analyzer when we index documents and it splits our document to small parts as a token, remove punctuations, lowercased text etc. That's whey your current regexp query doesn't work.
GET myindex/_search
{
"query": {
"match": {
"log.file.path": "var"
}
}
}
If you try this way it will work but for your case, you need to match every log.file.path that ends with .log So what do now? Just don't apply analyzers while indexing documents. The keyword type stores the string you provide as it is.
Create mapping with keyword type,
PUT myindex2/
{
"mappings": {
"properties": {
"log.file.path": {
"type": "keyword"
}
}
}
}
Index documents,
PUT myindex2/_doc/1
{
"log.file.path" : "/var/log/dev-collateral/uaa.2020-09-26.log"
}
PUT myindex2/_doc/2
{
"log.file.path" : "/var/log/dev-collateral/uaa.2020-09-26.txt"
}
Search with regexp,
GET myindex2/_search
{
"query": {
"regexp": {
"log.file.path": "/var/log/dev-collateral/uaa.2020-09-26.*"
}
}
}
I used this query and it works!
{
"query": {
"regexp": {
"log.file.path.keyword": {
"value": "/var/log/dev-collateral/uaa.*",
"flags": "ALL",
"max_determinized_states": 10000,
"rewrite": "constant_score"
}
}
}
}

Elastic Search issue with simple_query_string

I could not find a perfect solution for the following situation, hope someone could help here.
Suppose there are following documents present for type called "groups":
{"name": "Orders 1.0"}
{"name": "Reports & Analysis 1.0"}
{"name": "Rebates 1.0"}
When I search documents using below simple_query_string I am getting all three records instead of only one records (i.e. #1)
{
"size" : 20,
"query" : {
"bool" : {
"must" : [
{
"bool" : {
"should" : [
{
"simple_query_string" : {
"query" : "Orders 1.0*",
"fields" : [
"name^1.0"
],
"flags" : -1,
"default_operator" : "or",
"lenient" : false,
"analyze_wildcard" : true,
"boost" : 1.0
}
}
],
"disable_coord" : false,
"adjust_pure_negative" : true,
"boost" : 1.0
}
}
],
"disable_coord" : false,
"adjust_pure_negative" : true,
"boost" : 1.0
}
}
}
I want only one record to be searched with name as Orders 1.0
The reason is this "default_operator" : "or",.
I assume that the field name is an analyzed field. If it uses an analyzer, then it will be tokenized. The same goes for your input. So you have the following (for the standard analyzer)
Orders 1.0 -> ["orders", "1.0"]
Reports & Analysis 1.0 -> ["reports", "analysis", "1.0"]
Rebates 1.0 -> ["rebates", "1.0"]
GET _analyze
{
"analyzer": "standard",
"text": ["Orders 1.0", "Reports & Analysis 1.0", "Rebates 1.0"]
}
{
"tokens" : [
{
"token" : "orders",
"start_offset" : 0,
"end_offset" : 6,
"type" : "<ALPHANUM>",
"position" : 0
},
{
"token" : "1.0",
"start_offset" : 7,
"end_offset" : 10,
"type" : "<NUM>",
"position" : 1
},
{
"token" : "reports",
"start_offset" : 11,
"end_offset" : 18,
"type" : "<ALPHANUM>",
"position" : 2
},
{
"token" : "analysis",
"start_offset" : 21,
"end_offset" : 29,
"type" : "<ALPHANUM>",
"position" : 3
},
{
"token" : "1.0",
"start_offset" : 30,
"end_offset" : 33,
"type" : "<NUM>",
"position" : 4
},
{
"token" : "rebates",
"start_offset" : 34,
"end_offset" : 41,
"type" : "<ALPHANUM>",
"position" : 5
},
{
"token" : "1.0",
"start_offset" : 42,
"end_offset" : 45,
"type" : "<NUM>",
"position" : 6
}
]
}
Since the default operator is OR, it means that you need any of the analyzed tokens to match, not all. And since you've included 1.0 that is present on all of them, it matches all the documents. One solution is to change the default operator to AND

using word_delimiter with edgeNGram ignores Word_Delimiter Token

I have my custom analyzer as below. But I dont understand how to achieve my goal.
My goal is that I want to have whitespace separated inverted index but also I want to have autocomplete feature after user enters min 3 chars. For that I though to combine word_delimiter and edgeNGram tokens as below
{
"settings": {
"index": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "whitespace",
"filter": [
"standard",
"lowercase",
"my_word_delimiter",
"my_edge_ngram_analyzer"
],
"type": "custom"
}
},
"filter": {
"my_word_delimiter": {
"catenate_all": true,
"type": "word_delimiter"
},
"my_edge_ngram_analyzer": {
"min_gram": 3,
"max_gram": 10,
"type": "edgeNGram"
}
}
}
}
}
}
This will give result for "Brother TN-200" as below. But I was expecting "tn" to be also in the reverted index as I have word_delimiter token. why is it not in the inverted index? How can I achieve this?
curl -XGET "localhost:9200/myIndex/_analyze?analyzer=my_analyzer&pr
etty=true" -d "Brother TN-200"
{
{
"token" : "bro",
"start_offset" : 14,
"end_offset" : 21,
"type" : "word",
"position" : 2
}, {
"token" : "brot",
"start_offset" : 14,
"end_offset" : 21,
"type" : "word",
"position" : 2
}, {
"token" : "broth",
"start_offset" : 14,
"end_offset" : 21,
"type" : "word",
"position" : 2
}, {
"token" : "brothe",
"start_offset" : 14,
"end_offset" : 21,
"type" : "word",
"position" : 2
}, {
"token" : "brother",
"start_offset" : 14,
"end_offset" : 21,
"type" : "word",
"position" : 2
}, {
"token" : "tn2",
"start_offset" : 22,
"end_offset" : 28,
"type" : "word",
"position" : 3
}, {
"token" : "tn20",
"start_offset" : 22,
"end_offset" : 28,
"type" : "word",
"position" : 3
}, {
"token" : "tn200",
"start_offset" : 22,
"end_offset" : 28,
"type" : "word",
"position" : 3
}, {
"token" : "200",
"start_offset" : 25,
"end_offset" : 28,
"type" : "word",
"position" : 4
}]
}
UPDATE:
of course, if I use "min_gram": 2, "tn" will be in the reverted index but I dont want this because if any other word consists "tn" inside the word, it will appear in the result list.
For example about "hp" keyword. I am getting products for "Hewlett Packard" as my products are like "hp xxx" but I get also a product called "tech hpc". I dont want this product to be displayed until I type "hpc". Thats the reason I set 3.
If i dont use edgeNGram tokenizer but only word_delimiter, "tn" is in the inverted index as Brother TN-200 will be indexed as brother, tn and 200. that's why I expected that word_delimiter makes the "tn" to be in the inverted index. Does it have no use if I use it with edgeNGram? –
In my_edge_ngram_analyzer the min_gram setting is 3 as a result any Token with length less than 3 codepoints would not show up.
You would need to set this to 2 if you would want TN to show up.
Example:
get <my_index>/_analyze?tokenizer=whitespace&filters=my_edge_ngram_analyzer&text=TN
The above call would return 0 tokens.

Elasticsearch analyzer ignored from index settings and only working when specified directly in query

I'm trying to change the haystack default settings to something very simple:
'settings': {
"analyzer": "spanish"
}
It looks right after rebuilding the index:
$ curl -XGET 'http://localhost:9200/haystack/_settings?pretty=true'
{
"haystack" : {
"settings" : {
"index.analyzer" : "spanish",
"index.number_of_shards" : "5",
"index.number_of_replicas" : "1",
"index.version.created" : "191199"
}
}
But when testing it with some stop words it won't work as expected, it should filter out "esto" and "que" and instead it's filtering "is" and "a" from the English stop words:
$ curl -XGET 'localhost:9200/haystack/_analyze?text=esto+is+a+test+que&pretty=true'
{
"tokens" : [ {
"token" : "esto",
"start_offset" : 0,
"end_offset" : 4,
"type" : "<ALPHANUM>",
"position" : 1
}, {
"token" : "test",
"start_offset" : 10,
"end_offset" : 14,
"type" : "<ALPHANUM>",
"position" : 4
}, {
"token" : "que",
"start_offset" : 15,
"end_offset" : 18,
"type" : "<ALPHANUM>",
"position" : 5
} ]
And only when I specify the analyzer in the query it works:
$ curl -XGET 'localhost:9200/haystack/_analyze?text=esto+is+a+test+que&analyzer=spanish&pretty=true'
{
"tokens" : [ {
"token" : "is",
"start_offset" : 5,
"end_offset" : 7,
"type" : "<ALPHANUM>",
"position" : 2
}, {
"token" : "test",
"start_offset" : 10,
"end_offset" : 14,
"type" : "<ALPHANUM>",
"position" : 4
} ]
Any idea of what am I doing wrong?
Thanks.
It should be
"settings": {
"index.analysis.analyzer.default.type" : "spanish"
}
And to apply it to just the "haystack" index:
{
"haystack" : {
"settings" : {
"index.analysis.analyzer.default.type" : "spanish",
}
}
Thanks to imotov for his suggestion.

Resources