I could not find a perfect solution for the following situation, hope someone could help here.
Suppose there are following documents present for type called "groups":
{"name": "Orders 1.0"}
{"name": "Reports & Analysis 1.0"}
{"name": "Rebates 1.0"}
When I search documents using below simple_query_string I am getting all three records instead of only one records (i.e. #1)
{
"size" : 20,
"query" : {
"bool" : {
"must" : [
{
"bool" : {
"should" : [
{
"simple_query_string" : {
"query" : "Orders 1.0*",
"fields" : [
"name^1.0"
],
"flags" : -1,
"default_operator" : "or",
"lenient" : false,
"analyze_wildcard" : true,
"boost" : 1.0
}
}
],
"disable_coord" : false,
"adjust_pure_negative" : true,
"boost" : 1.0
}
}
],
"disable_coord" : false,
"adjust_pure_negative" : true,
"boost" : 1.0
}
}
}
I want only one record to be searched with name as Orders 1.0
The reason is this "default_operator" : "or",.
I assume that the field name is an analyzed field. If it uses an analyzer, then it will be tokenized. The same goes for your input. So you have the following (for the standard analyzer)
Orders 1.0 -> ["orders", "1.0"]
Reports & Analysis 1.0 -> ["reports", "analysis", "1.0"]
Rebates 1.0 -> ["rebates", "1.0"]
GET _analyze
{
"analyzer": "standard",
"text": ["Orders 1.0", "Reports & Analysis 1.0", "Rebates 1.0"]
}
{
"tokens" : [
{
"token" : "orders",
"start_offset" : 0,
"end_offset" : 6,
"type" : "<ALPHANUM>",
"position" : 0
},
{
"token" : "1.0",
"start_offset" : 7,
"end_offset" : 10,
"type" : "<NUM>",
"position" : 1
},
{
"token" : "reports",
"start_offset" : 11,
"end_offset" : 18,
"type" : "<ALPHANUM>",
"position" : 2
},
{
"token" : "analysis",
"start_offset" : 21,
"end_offset" : 29,
"type" : "<ALPHANUM>",
"position" : 3
},
{
"token" : "1.0",
"start_offset" : 30,
"end_offset" : 33,
"type" : "<NUM>",
"position" : 4
},
{
"token" : "rebates",
"start_offset" : 34,
"end_offset" : 41,
"type" : "<ALPHANUM>",
"position" : 5
},
{
"token" : "1.0",
"start_offset" : 42,
"end_offset" : 45,
"type" : "<NUM>",
"position" : 6
}
]
}
Since the default operator is OR, it means that you need any of the analyzed tokens to match, not all. And since you've included 1.0 that is present on all of them, it matches all the documents. One solution is to change the default operator to AND
Related
It seems like there is a character minimum needed to get results with elasticsearch for a specific property I am searching. It is called 'guid' and has the following configuration:
"guid": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
}
I have a document with the following GUID: 3e49996c-1dd8-4230-8f6f-abe4236a6fc4
The following query returns the document as-expected:
{"match":{"query":"9996c-1dd8*","fields":["guid"]}}
However this query does not:
{"match":{"query":"9996c-1dd*","fields":["guid"]}}
I have the same result with multi_match and query_string queries. I haven't been able to find anything in the documentation about a character minimum, so what is happening here?
Elastic does not require a minimum number of characters. What matters is the generated token.
An exercise that helps to understand is to use _analyzer to see your index tokens.
GET index_001/_analyze
{
"field": "guid",
"text": [
"3e49996c-1dd8-4230-8f6f-abe4236a6fc4"
]
}
You indicate the term 3e49996c-1dd8-4230-8f6f-abe4236a6fc4.
Look how the tokens are:
"tokens" : [
{
"token" : "3e49996c",
"start_offset" : 0,
"end_offset" : 8,
"type" : "<ALPHANUM>",
"position" : 0
},
{
"token" : "1dd8",
"start_offset" : 9,
"end_offset" : 13,
"type" : "<ALPHANUM>",
"position" : 1
},
{
"token" : "4230",
"start_offset" : 14,
"end_offset" : 18,
"type" : "<NUM>",
"position" : 2
},
{
"token" : "8f6f",
"start_offset" : 19,
"end_offset" : 23,
"type" : "<ALPHANUM>",
"position" : 3
},
{
"token" : "abe4236a6fc4",
"start_offset" : 24,
"end_offset" : 36,
"type" : "<ALPHANUM>",
"position" : 4
}
]
When you perform the search, the same analyzer that is used in the indexing will be used in the search.
When you search for the term "9996c-1dd8*".
GET index_001/_analyze
{
"field": "guid",
"text": [
"9996c-1dd8*"
]
}
The generated tokens are:
{
"tokens" : [
{
"token" : "9996c",
"start_offset" : 0,
"end_offset" : 5,
"type" : "<ALPHANUM>",
"position" : 0
},
{
"token" : "1dd8",
"start_offset" : 6,
"end_offset" : 10,
"type" : "<ALPHANUM>",
"position" : 1
}
]
}
Note that the inverted index will have the token 1dd8 and the term "9996c-1dd8*" generated the token "1dd8" so the match took place.
When you test with the term "9996c-1dd*", no tokens match, so there are no results.
GET index_001/_analyze
{
"field": "guid",
"text": [
"9996c-1dd*"
]
}
Tokens:
{
"tokens" : [
{
"token" : "9996c",
"start_offset" : 0,
"end_offset" : 5,
"type" : "<ALPHANUM>",
"position" : 0
},
{
"token" : "1dd",
"start_offset" : 6,
"end_offset" : 9,
"type" : "<ALPHANUM>",
"position" : 1
}
]
}
Token "1dd" is not equal to "1dd8".
I'm trying to configure my Elasticsearch to search a word with optional punctuation but i don't know how to do this with my settings.
For example, I search for the word "computer", but I would like to allow a pattern to search, in fact, the following words : "computer.", "(computer", "computer)", "computer,"...
To achieve this you should understand how Elastic mapping and text field works, text field in Elastic will be analyzed based on analyzer you set during the mapping, then this analyzer will be used to analyze your text and generates terms, For example if you use standard tokenizer for a text: This computer is fast. (Computer). these will be the result.
GET _analyze
{
"analyzer": "standard",
"text": "This computer is fast. (Computer)"
}
Result:
{
"tokens" : [
{
"token" : "this",
"start_offset" : 0,
"end_offset" : 4,
"type" : "<ALPHANUM>",
"position" : 0
},
{
"token" : "computer",
"start_offset" : 5,
"end_offset" : 13,
"type" : "<ALPHANUM>",
"position" : 1
},
{
"token" : "is",
"start_offset" : 14,
"end_offset" : 16,
"type" : "<ALPHANUM>",
"position" : 2
},
{
"token" : "fast",
"start_offset" : 17,
"end_offset" : 21,
"type" : "<ALPHANUM>",
"position" : 3
},
{
"token" : "computer",
"start_offset" : 24,
"end_offset" : 32,
"type" : "<ALPHANUM>",
"position" : 4
}
]
}
{
"tokens" : [
{
"token" : "this",
"start_offset" : 0,
"end_offset" : 4,
"type" : "<ALPHANUM>",
"position" : 0
},
{
"token" : "computer",
"start_offset" : 5,
"end_offset" : 13,
"type" : "<ALPHANUM>",
"position" : 1
},
{
"token" : "is",
"start_offset" : 14,
"end_offset" : 16,
"type" : "<ALPHANUM>",
"position" : 2
},
{
"token" : "fast",
"start_offset" : 17,
"end_offset" : 21,
"type" : "<ALPHANUM>",
"position" : 3
},
{
"token" : "computer",
"start_offset" : 24,
"end_offset" : 32,
"type" : "<ALPHANUM>",
"position" : 4
}
]
}
As you can see in the result punctuation will be removed for indexing so when you search for computer with match query you will get all type of documents back, for example:
POST _search
{
"query": {
"match": {
"Your_Filed": "Computer"
}
}
}
You can check all query DSL of Elastic search here.
I have a field named log.file.path in Elasticsearch and it has /var/log/dev-collateral/uaa.2020-09-26.log value, I tried to retrieve all logs that log.file.path field starts with /var/log/dev-collateral/uaa
I used the below regexp but it doesn't work.
{
"regexp":{
"log.file.path": "/var/log/dev-collateral/uaa.*"
}
}
Let's see why it is not working? I've indexed two documents using Kibana UI like below -
PUT myindex/_doc/1
{
"log.file.path" : "/var/log/dev-collateral/uaa.2020-09-26.log"
}
PUT myindex/_doc/2
{
"log.file.path" : "/var/log/dev-collateral/uaa.2020-09-26.txt"
}
When I try to see the tokens for of the text on log.file.path field using _analyze API
POST _analyze
{
"text": "/var/log/dev-collateral/uaa.2020-09-26.log"
}
It gives me,
{
"tokens" : [
{
"token" : "var",
"start_offset" : 1,
"end_offset" : 4,
"type" : "<ALPHANUM>",
"position" : 0
},
{
"token" : "log",
"start_offset" : 5,
"end_offset" : 8,
"type" : "<ALPHANUM>",
"position" : 1
},
{
"token" : "dev",
"start_offset" : 9,
"end_offset" : 12,
"type" : "<ALPHANUM>",
"position" : 2
},
{
"token" : "collateral",
"start_offset" : 13,
"end_offset" : 23,
"type" : "<ALPHANUM>",
"position" : 3
},
{
"token" : "uaa",
"start_offset" : 24,
"end_offset" : 27,
"type" : "<ALPHANUM>",
"position" : 4
},
{
"token" : "2020",
"start_offset" : 28,
"end_offset" : 32,
"type" : "<NUM>",
"position" : 5
},
{
"token" : "09",
"start_offset" : 33,
"end_offset" : 35,
"type" : "<NUM>",
"position" : 6
},
{
"token" : "26",
"start_offset" : 36,
"end_offset" : 38,
"type" : "<NUM>",
"position" : 7
},
{
"token" : "log",
"start_offset" : 39,
"end_offset" : 42,
"type" : "<ALPHANUM>",
"position" : 8
}
]
}
You can see, Elasticsearch has split your input text into tokens when you insert them on your index. This is because elasticsearch uses standard analyzer when we index documents and it splits our document to small parts as a token, remove punctuations, lowercased text etc. That's whey your current regexp query doesn't work.
GET myindex/_search
{
"query": {
"match": {
"log.file.path": "var"
}
}
}
If you try this way it will work but for your case, you need to match every log.file.path that ends with .log So what do now? Just don't apply analyzers while indexing documents. The keyword type stores the string you provide as it is.
Create mapping with keyword type,
PUT myindex2/
{
"mappings": {
"properties": {
"log.file.path": {
"type": "keyword"
}
}
}
}
Index documents,
PUT myindex2/_doc/1
{
"log.file.path" : "/var/log/dev-collateral/uaa.2020-09-26.log"
}
PUT myindex2/_doc/2
{
"log.file.path" : "/var/log/dev-collateral/uaa.2020-09-26.txt"
}
Search with regexp,
GET myindex2/_search
{
"query": {
"regexp": {
"log.file.path": "/var/log/dev-collateral/uaa.2020-09-26.*"
}
}
}
I used this query and it works!
{
"query": {
"regexp": {
"log.file.path.keyword": {
"value": "/var/log/dev-collateral/uaa.*",
"flags": "ALL",
"max_determinized_states": 10000,
"rewrite": "constant_score"
}
}
}
}
I'd like to concatenate words then ngram it.
What's the correct setting for elasticsearch?
In english,
from: stack overflow
==> stackoverflow : concatenate first,
==> sta / tac / ack / cko / kov / ... and etc (min_gram: 3, max_gram: 10)
To do the concatenation I'm assuming that you just want to remove all spaces from your input data. To do this, you need to implement a pattern_replace char filter that replaces space with nothing.
Setting up the ngram tokenizer should be easy - just specify your token min/max lengths.
It's worth adding a lowercase token filter too - to make searching case insensitive.
curl -XPOST localhost:9200/my_index -d '{
"index": {
"analysis": {
"analyzer": {
"my_new_analyzer": {
"filter": [
"lowercase"
],
"tokenizer": "my_ngram_tokenizer",
"char_filter" : ["my_pattern"],
"type": "custom"
}
},
"char_filter" : {
"my_pattern":{
"type":"pattern_replace",
"pattern":"\u0020",
"replacement":""
}
},
"tokenizer" : {
"my_ngram_tokenizer" : {
"type" : "nGram",
"min_gram" : "3",
"max_gram" : "10",
"token_chars": ["letter", "digit", "punctuation", "symbol"]
}
}
}
}
}'
testing this:
curl -XGET 'localhost:9200/my_index/_analyze?analyzer=my_new_analyzer&pretty' -d 'stack overflow'
gives the following (just a small part shown below):
{
"tokens" : [ {
"token" : "sta",
"start_offset" : 0,
"end_offset" : 3,
"type" : "word",
"position" : 1
}, {
"token" : "stac",
"start_offset" : 0,
"end_offset" : 4,
"type" : "word",
"position" : 2
}, {
"token" : "stack",
"start_offset" : 0,
"end_offset" : 6,
"type" : "word",
"position" : 3
}, {
"token" : "stacko",
"start_offset" : 0,
"end_offset" : 7,
"type" : "word",
"position" : 4
}, {
"token" : "stackov",
"start_offset" : 0,
"end_offset" : 8,
"type" : "word",
"position" : 5
}, {
I'm trying to change the haystack default settings to something very simple:
'settings': {
"analyzer": "spanish"
}
It looks right after rebuilding the index:
$ curl -XGET 'http://localhost:9200/haystack/_settings?pretty=true'
{
"haystack" : {
"settings" : {
"index.analyzer" : "spanish",
"index.number_of_shards" : "5",
"index.number_of_replicas" : "1",
"index.version.created" : "191199"
}
}
But when testing it with some stop words it won't work as expected, it should filter out "esto" and "que" and instead it's filtering "is" and "a" from the English stop words:
$ curl -XGET 'localhost:9200/haystack/_analyze?text=esto+is+a+test+que&pretty=true'
{
"tokens" : [ {
"token" : "esto",
"start_offset" : 0,
"end_offset" : 4,
"type" : "<ALPHANUM>",
"position" : 1
}, {
"token" : "test",
"start_offset" : 10,
"end_offset" : 14,
"type" : "<ALPHANUM>",
"position" : 4
}, {
"token" : "que",
"start_offset" : 15,
"end_offset" : 18,
"type" : "<ALPHANUM>",
"position" : 5
} ]
And only when I specify the analyzer in the query it works:
$ curl -XGET 'localhost:9200/haystack/_analyze?text=esto+is+a+test+que&analyzer=spanish&pretty=true'
{
"tokens" : [ {
"token" : "is",
"start_offset" : 5,
"end_offset" : 7,
"type" : "<ALPHANUM>",
"position" : 2
}, {
"token" : "test",
"start_offset" : 10,
"end_offset" : 14,
"type" : "<ALPHANUM>",
"position" : 4
} ]
Any idea of what am I doing wrong?
Thanks.
It should be
"settings": {
"index.analysis.analyzer.default.type" : "spanish"
}
And to apply it to just the "haystack" index:
{
"haystack" : {
"settings" : {
"index.analysis.analyzer.default.type" : "spanish",
}
}
Thanks to imotov for his suggestion.