I am using elastic 7.15.0.
I want to organize search by priority with/without spaces.
What does it mean?
query - "surf coff"
I want to see records contains or startswith "surf coff" - (SURF COFFEE, SURF CAFFETERIA, SURFCOFFEE MAN)
When records contains or startswith "surf" - (SURF, SURF LOVE, ENDLESS SURF)
When records contains or startswith "coff" - (LOVE COFFEE, COFFEE MAN)
query - "surfcoff"
i want to see records contains or startswith "surfcoff" - (SURF COFFEE, SURF CAFFETERIA, SURFCOFFEE MAN) only.
I created the analyzer with filters:
lowercase
word_delimiter_graph
shingle
edge n gram
pattern replace for spaces
{
"settings":{
"index": {
"max_shingle_diff" : 9,
"max_ngram_diff": 9
},
"analysis":{
"analyzer":{
"word_join_analyzer":{
"tokenizer":"standard",
"filter":[
"lowercase",
"word_delimiter_graph",
"my_shingle",
"my_edge_ngram",
"my_char_filter"
]
}
},
"filter":{
"my_shingle":{
"type":"shingle",
"min_shingle_size": 2,
"max_shingle_size": 10
},
"my_edge_ngram": {
"type": "edge_ngram",
"min_gram": 2,
"max_gram": 10,
"token_chars": ["letter", "digit"]
},
"my_char_filter": {
"type": "pattern_replace",
"pattern": " ",
"replacement": ""
}
}
}
}
}
So when i analyzed text = "SURF COFFEE", i got this result
{
"tokens": [
{
"token": "su",
"start_offset": 0,
"end_offset": 4,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "sur",
"start_offset": 0,
"end_offset": 4,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "surf",
"start_offset": 0,
"end_offset": 4,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "su",
"start_offset": 0,
"end_offset": 11,
"type": "shingle",
"position": 0,
"positionLength": 2
},
{
"token": "sur",
"start_offset": 0,
"end_offset": 11,
"type": "shingle",
"position": 0,
"positionLength": 2
},
{
"token": "surf",
"start_offset": 0,
"end_offset": 11,
"type": "shingle",
"position": 0,
"positionLength": 2
},
{
"token": "surf",
"start_offset": 0,
"end_offset": 11,
"type": "shingle",
"position": 0,
"positionLength": 2
},
{
"token": "surfc",
"start_offset": 0,
"end_offset": 11,
"type": "shingle",
"position": 0,
"positionLength": 2
},
{
"token": "surfco",
"start_offset": 0,
"end_offset": 11,
"type": "shingle",
"position": 0,
"positionLength": 2
},
{
"token": "surfcof",
"start_offset": 0,
"end_offset": 11,
"type": "shingle",
"position": 0,
"positionLength": 2
},
{
"token": "surfcoff",
"start_offset": 0,
"end_offset": 11,
"type": "shingle",
"position": 0,
"positionLength": 2
},
{
"token": "surfcoffe",
"start_offset": 0,
"end_offset": 11,
"type": "shingle",
"position": 0,
"positionLength": 2
},
{
"token": "co",
"start_offset": 5,
"end_offset": 11,
"type": "<ALPHANUM>",
"position": 1
},
{
"token": "cof",
"start_offset": 5,
"end_offset": 11,
"type": "<ALPHANUM>",
"position": 1
},
{
"token": "coff",
"start_offset": 5,
"end_offset": 11,
"type": "<ALPHANUM>",
"position": 1
},
{
"token": "coffe",
"start_offset": 5,
"end_offset": 11,
"type": "<ALPHANUM>",
"position": 1
},
{
"token": "coffee",
"start_offset": 5,
"end_offset": 11,
"type": "<ALPHANUM>",
"position": 1
}
]
}
As you can see, there is token "surfcoff".
How my search should be organized?
I've tried to combine approaches by bool should query with -
query_string, match_phrase_prefix, match_prefix and others.
But none of them gave correct results.
Can you please, help me.
How my query should be built?
Or maybe i should try other analyzer filters.
For example query
{
"query": {
"bool": {
"should": [
{
"query_string": {
"query": "surf coff",
"default_field": "text",
"default_operator": "AND"
}
},
{
"query_string": {
"query": "surf",
"default_field": "text"
}
},
{
"query_string": {
"query": "coff",
"default_field": "text"
}
}
]
}
}
}
or this query
{
"query": {
"bool": {
"should": [
{
"query_string": {
"query": "(surf coff) OR (surf) OR (coff)",
"default_field": "text"
}
}
]
}
}
}
or this query
{
"query": {
"bool": {
"should": [
{
"query_string": {
"query": "((surf AND coff)^3 OR (surf)^2 OR (coff)^1)",
"default_field": "text"
}
}
]
}
}
}
or
{
"query": {
"match_bool_prefix" : {
"text" : "surf coff"
}
}
}
gives
SURF COFFEE SURFING NEVER ALONE
CONOSUR COLCHAGUA CONO SUR
SUNRISE CONCHA TORO SUNRISE 300 DAYS
SUN COFFEE
SURF COFFEE PROPAGANDA
....
but its strange for me, i think i misunderstand something.
{
"query": {
"bool": {
"should": [
{
"query_string": {
"query": "(surf* AND coff*)^3 OR (surf*)^2 OR (coff*)^1",
"default_field": "text"
}
}
]
}
}
}
{
"settings":{
"index": {
"max_shingle_diff" : 9,
"max_ngram_diff": 9
},
"analysis":{
"analyzer":{
"word_join_analyzer":{
"tokenizer":"standard",
"filter":[
"lowercase",
"word_delimiter_graph",
"my_shingle",
"my_char_filter"
]
}
},
"filter":{
"my_shingle":{
"type":"shingle",
"min_shingle_size": 2,
"max_shingle_size": 10
},
"my_char_filter": {
"type": "pattern_replace",
"pattern": " ",
"replacement": ""
}
}
}
}
}
removing edge-n-gram and adding wilcard query with priority resolved my question.
But I still don't understand why edge n gram didn't work.
Finally resolved with
"filter":[
"lowercase",
"word_delimiter_graph",
"my_shingle",
"my_edge_ngram",
"my_char_filter"
]
the problem was with search_analyzer, because docs said " Sometimes, though, it can make sense to use a different analyzer at search time, such as when using the edge_ngram tokenizer for autocomplete or when using search-time synonyms."
So i added standard search_analyzer to my text field:
"text": { "type": "text", "analyzer": "word_join_analyzer", "search_analyzer": "standard" }
Search query:
{
"query": {
"bool": {
"should": [
{
"query_string": {
"query": "surf coff",
"default_field": "text"
}
}
]
}
}
}
Related
I have to search document where text field "Body" include "Balance for subscriber with SAN" and exclude "was not found after invoking reip-adapter". I create KQL request in Kibana:
Body : "Balance for subscriber with SAN" and not Body : "was not found after invoking reip-adapter"
But have result including two condition such: "Balance for subscriber with SAN" and "was not found after invoking reip-adapter". Why in my result present AND "Balance for subscriber with SAN" AND "was not found after invoking reip-adapter"?
Inspect KQL Request:
"query": {
"bool": {
"must": [],
"filter": [
{
"bool": {
"filter": [
{
"bool": {
"should": [
{
"match_phrase": {
"Body": "Balance for subscriber with SAN"
}
}
],
"minimum_should_match": 1
}
},
{
"bool": {
"must_not": {
"bool": {
"should": [
{
"match_phrase": {
"Body": "was not found after invoking reip-adapter"
}
}
],
"minimum_should_match": 1
}
}
}
}
]
}
},
{
"range": {
"Timestamp": {
"format": "strict_date_optional_time",
"gte": "2020-08-29T08:24:55.067Z",
"lte": "2020-08-29T10:24:55.067Z"
}
}
}
],
"should": [],
"must_not": []
}
}
"and not" condition don`t working, Response:
-----omitted--------
"_source": {
"prospector": {},
"Severity": "INFO",
"uuid": "e71b207a-42a6-4b2c-98d1-b1094c578776",
"Body": "Balance for subscriber with SAN=0400043102was not found after invoking reip-adapter.",
"tags": [
"iptv",
"beats_input_codec_plain_applied"
],
"source": "/applogs/Iptv/app.log",
"host": {
"name": "e38"
},
"offset": 23097554,
"pid": "2473",
"Configuration": "IptvFacadeBean",
"Timestamp": "2020-08-29T10:24:50.040Z",
"#timestamp": "2020-08-29T10:24:50.446Z",
"input": {}
}
-----omitted--------
The index data you are indexing for Body field is :
"Body": "Balance for subscriber with SAN=0400043102was not found after
invoking reip-adapter."
There is no gap between the number and was ( 0400043102was), so the tokens generated are:
POST/_analyze
{
"analyzer" : "standard",
"text" : "Balance for subscriber with SAN=0400043102was not found after invoking reip-adapter."
}
The tokens are :
{
"tokens": [
{
"token": "balance",
"start_offset": 0,
"end_offset": 7,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "for",
"start_offset": 8,
"end_offset": 11,
"type": "<ALPHANUM>",
"position": 1
},
{
"token": "subscriber",
"start_offset": 12,
"end_offset": 22,
"type": "<ALPHANUM>",
"position": 2
},
{
"token": "with",
"start_offset": 23,
"end_offset": 27,
"type": "<ALPHANUM>",
"position": 3
},
{
"token": "san",
"start_offset": 28,
"end_offset": 31,
"type": "<ALPHANUM>",
"position": 4
},
{
"token": "0400043102was", <-- note this
"start_offset": 32,
"end_offset": 45,
"type": "<ALPHANUM>",
"position": 5
},
{
"token": "not",
"start_offset": 46,
"end_offset": 49,
"type": "<ALPHANUM>",
"position": 6
},
{
"token": "found",
"start_offset": 50,
"end_offset": 55,
"type": "<ALPHANUM>",
"position": 7
},
{
"token": "after",
"start_offset": 56,
"end_offset": 61,
"type": "<ALPHANUM>",
"position": 8
},
{
"token": "invoking",
"start_offset": 62,
"end_offset": 70,
"type": "<ALPHANUM>",
"position": 9
},
{
"token": "reip",
"start_offset": 71,
"end_offset": 75,
"type": "<ALPHANUM>",
"position": 10
},
{
"token": "adapter",
"start_offset": 76,
"end_offset": 83,
"type": "<ALPHANUM>",
"position": 11
}
]
}
Therefore, when you are trying to do match_phrase like this :
"should": [
{
"match_phrase": {
"Body": "was not found after invoking reip-adapter"
}
}
]
No token was is generated, therefore, the document matches and must_not condition is not working.
Index Data:
{ "Body":"Balance for subscriber with SAN=0400043102" }
{ "Body":"Balance for subscriber with SAN=0400043102was not found after invoking reip-adapter." }
Search Query
{
"query": {
"bool": {
"must": {
"match_phrase": {
"Body": "Balance for subscriber with SAN"
}
},
"must_not": {
"match_phrase": {
"Body": "not found after invoking reip-adapter"
}
}
}
}
}
Search Result:
"hits": [
{
"_index": "my_index",
"_type": "_doc",
"_id": "2",
"_score": 1.055546,
"_source": {
"Body": "Balance for subscriber with SAN=0400043102"
}
}
]
I have a field with the following mapping defined :
"my_field": {
"properties": {
"address": {
"type": "string",
"analyzer": "email",
"search_analyzer": "whitespace"
}
}
}
My email analyser looks like this:
{
"analysis": {
"filter": {
"email_filter": {
"type": "edge_ngram",
"min_gram": "3",
"max_gram": "255"
}
},
"analyzer": {
"email": {
"type": "custom",
"filter": [
"lowercase",
"email_filter",
"unique"
],
"tokenizer": "uax_url_email"
}
}
}
}
When I try to search for an email id, like test.xyz#example.com
Searching for terms like tes,test.xy etc. doesn't work. But if I search for
test.xyz or test.xyz#example.com, it works fine. I tried analyzing the tokens using my email filter and it works fine as expected
Ex. Hitting http://localhost:9200/my_index/_analyze?analyzer=email&text=test.xyz#example.com
I get:
{
"tokens": [{
"token": "tes",
"start_offset": 0,
"end_offset": 20,
"type": "word",
"position": 0
}, {
"token": "test",
"start_offset": 0,
"end_offset": 20,
"type": "word",
"position": 0
}, {
"token": "test.",
"start_offset": 0,
"end_offset": 20,
"type": "word",
"position": 0
}, {
"token": "test.x",
"start_offset": 0,
"end_offset": 20,
"type": "word",
"position": 0
}, {
"token": "test.xy",
"start_offset": 0,
"end_offset": 20,
"type": "word",
"position": 0
}, {
"token": "test.xyz",
"start_offset": 0,
"end_offset": 20,
"type": "word",
"position": 0
}, {
"token": "test.xyz#",
"start_offset": 0,
"end_offset": 20,
"type": "word",
"position": 0
}, {
"token": "test.xyz#e",
"start_offset": 0,
"end_offset": 20,
"type": "word",
"position": 0
}, {
"token": "test.xyz#ex",
"start_offset": 0,
"end_offset": 20,
"type": "word",
"position": 0
}, {
"token": "test.xyz#exa",
"start_offset": 0,
"end_offset": 20,
"type": "word",
"position": 0
}, {
"token": "test.xyz#exam",
"start_offset": 0,
"end_offset": 20,
"type": "word",
"position": 0
}, {
"token": "test.xyz#examp",
"start_offset": 0,
"end_offset": 20,
"type": "word",
"position": 0
}, {
"token": "test.xyz#exampl",
"start_offset": 0,
"end_offset": 20,
"type": "word",
"position": 0
}, {
"token": "test.xyz#example",
"start_offset": 0,
"end_offset": 20,
"type": "word",
"position": 0
}, {
"token": "test.xyz#example.",
"start_offset": 0,
"end_offset": 20,
"type": "word",
"position": 0
}, {
"token": "test.xyz#example.c",
"start_offset": 0,
"end_offset": 20,
"type": "word",
"position": 0
}, {
"token": "test.xyz#example.co",
"start_offset": 0,
"end_offset": 20,
"type": "word",
"position": 0
}, {
"token": "test.xyz#example.com",
"start_offset": 0,
"end_offset": 20,
"type": "word",
"position": 0
}]
}
So I know that the tokenisation works. But while searching, it fails to search partial strings.
For ex. Looking for http://localhost:9200/my_index/my_field/_search?q=test, the result shows no hits.
Details of my index :
{
"my_index": {
"aliases": {
"alias_default": {}
},
"mappings": {
"my_field": {
"properties": {
"address": {
"type": "string",
"analyzer": "email",
"search_analyzer": "whitespace"
},
"boost": {
"type": "long"
},
"createdat": {
"type": "date",
"format": "strict_date_optional_time||epoch_millis"
},
"instanceid": {
"type": "long"
},
"isdeleted": {
"type": "integer"
},
"object": {
"type": "string"
},
"objecthash": {
"type": "string"
},
"objectid": {
"type": "string"
},
"parent": {
"type": "short"
},
"parentid": {
"type": "integer"
},
"updatedat": {
"type": "date",
"format": "strict_date_optional_time||epoch_millis"
}
}
}
},
"settings": {
"index": {
"creation_date": "1480342980403",
"number_of_replicas": "1",
"max_result_window": "100000",
"uuid": "OUuiTma8CA2VNtw9Og",
"analysis": {
"filter": {
"email_filter": {
"type": "edge_ngram",
"min_gram": "3",
"max_gram": "255"
},
"autocomplete_filter": {
"type": "edge_ngram",
"min_gram": "3",
"max_gram": "20"
}
},
"analyzer": {
"autocomplete": {
"type": "custom",
"filter": [
"lowercase",
"autocomplete_filter"
],
"tokenizer": "standard"
},
"email": {
"type": "custom",
"filter": [
"lowercase",
"email_filter",
"unique"
],
"tokenizer": "uax_url_email"
}
}
},
"number_of_shards": "5",
"version": {
"created": "2010099"
}
}
},
"warmers": {}
}
}
Ok, everything looks correct, except your query.
You simply need to specify the address field in your query like this and it will work:
http://localhost:9200/my_index/my_field/_search?q=address:test
If you don't specify the address field, the query will work on the _all field whose search analyzer is the standard one by default, hence why you're not finding anything.
The problem is any character sequence having boost operator "^(caret symbol)" does not returning any search results.
But as per the below elastic search documentation
https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-query-string-query.html#_reserved_characters
&& || ! ( ) { } [ ] ^ " ~ * ? : \ characters can be escaped with \ symbol.
Have a requirement to do a contains search using n-gram analyser in elastic search.
Below is the mapping structure of the sample use case and the
{
"settings": {
"index": {
"analysis": {
"analyzer": {
"nGram_analyzer": {
"filter": [
"lowercase",
"asciifolding"
],
"type": "custom",
"tokenizer": "ngram_tokenizer"
},
"whitespace_analyzer": {
"filter": [
"lowercase",
"asciifolding"
],
"type": "custom",
"tokenizer": "whitespace"
}
},
"tokenizer": {
"ngram_tokenizer": {
"token_chars": [
"letter",
"digit",
"punctuation",
"symbol"
],
"min_gram": "2",
"type": "nGram",
"max_gram": "20"
}
}
}
}
},
"mappings": {
"employee": {
"properties": {
"employeeName": {
"type": "string",
"analyzer": "nGram_analyzer",
"search_analyzer": "whitespace_analyzer"
}
}
}
}
}
Have a employee name like below with special characters included
xyz%^&*
Also the sample query used for the contains search as below
GET
{
"query": {
"bool": {
"must": [
{
"match": {
"employeeName": {
"query": "xyz%^",
"type": "boolean",
"operator": "or"
}
}
}
]
}
}
}
Even if we try to escape as "query": "xyz%\^" its errors out. So not able to search any character contains search having "^(caret symbol)"
Any help is greatly appreciated.
There is a bug in ngram tokenizer related to issue.
Essentially ^ is not considered either Symbol |Letter |Punctuation by ngram-tokenizer.
As a result it tokenizes the input on ^.
Example: (url encoded xyz%^):
GET <index_name>/_analyze?tokenizer=ngram_tokenizer&text=xyz%25%5E
The above result of analyze api shows there is no ^ as shown in the response below :
{
"tokens": [
{
"token": "xy",
"start_offset": 0,
"end_offset": 2,
"type": "word",
"position": 0
},
{
"token": "xyz",
"start_offset": 0,
"end_offset": 3,
"type": "word",
"position": 1
},
{
"token": "xyz%",
"start_offset": 0,
"end_offset": 4,
"type": "word",
"position": 2
},
{
"token": "yz",
"start_offset": 1,
"end_offset": 3,
"type": "word",
"position": 3
},
{
"token": "yz%",
"start_offset": 1,
"end_offset": 4,
"type": "word",
"position": 4
},
{
"token": "z%",
"start_offset": 2,
"end_offset": 4,
"type": "word",
"position": 5
}
]
}
Since '^' is not indexed therefore there are no matches
I'm using Elasticsearch 2.2.0 and I'm trying to use the lowercase + asciifolding filters on a field.
This is the output of http://localhost:9200/myindex/
{
"myindex": {
"aliases": {},
"mappings": {
"products": {
"properties": {
"fold": {
"analyzer": "folding",
"type": "string"
}
}
}
},
"settings": {
"index": {
"analysis": {
"analyzer": {
"folding": {
"token_filters": [
"lowercase",
"asciifolding"
],
"tokenizer": "standard",
"type": "custom"
}
}
},
"creation_date": "1456180612715",
"number_of_replicas": "1",
"number_of_shards": "5",
"uuid": "vBMZEasPSAyucXICur3GVA",
"version": {
"created": "2020099"
}
}
},
"warmers": {}
}
}
And when I try to test the folding custom filter using the _analyze API, this is what I get as an output of http://localhost:9200/myindex/_analyze?analyzer=folding&text=%C3%89sta%20est%C3%A1%20loca
{
"tokens": [
{
"end_offset": 4,
"position": 0,
"start_offset": 0,
"token": "Ésta",
"type": "<ALPHANUM>"
},
{
"end_offset": 9,
"position": 1,
"start_offset": 5,
"token": "está",
"type": "<ALPHANUM>"
},
{
"end_offset": 14,
"position": 2,
"start_offset": 10,
"token": "loca",
"type": "<ALPHANUM>"
}
]
}
As you can see, the returned tokens are: Ésta, está, loca instead of esta, esta, loca. What's going on? it seems that this folding analyzer is being ignored.
Looks like a simple typo when you are creating your index.
In your "analysis":{"analyzer":{...}} block, this:
"token_filters": [...]
Should be
"filter": [...]
Check the documentation for confirmation of this. Because your filter array wasn't named correctly, ES completely ignored it, and just decided to use the standard analyzer. Here is a small example written using the Sense chrome plugin. Execute them in order:
DELETE /test
PUT /test
{
"analysis": {
"analyzer": {
"folding": {
"type": "custom",
"filter": [
"lowercase",
"asciifolding"
],
"tokenizer": "standard"
}
}
}
}
GET /test/_analyze
{
"analyzer":"folding",
"text":"Ésta está loca"
}
And the results of the last GET /test/_analyze:
"tokens": [
{
"token": "esta",
"start_offset": 0,
"end_offset": 4,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "esta",
"start_offset": 5,
"end_offset": 9,
"type": "<ALPHANUM>",
"position": 1
},
{
"token": "loca",
"start_offset": 10,
"end_offset": 14,
"type": "<ALPHANUM>",
"position": 2
}
]
When we are passing a query containing special characters, Elastic Search is splitting the text.
E.g. If we pass "test-test" in query how can we make Elastic Search treat this as a single word and not split it up.
Analyzer used on the field we are searching:
"text_search_filter": {
"type": "edge_ngram",
"min_gram": 1,
"max_gram": 15
},
"standard_stop_filter": {
"type": "stop",
"stopwords": "_english_"
}
},
"analyzer": {
"text_search_analyzer": {
"type": "custom",
"tokenizer": "whitespace",
"filter": [
"lowercase",
"asciifolding",
"text_search_filter"
]
}
}
Also the query for search:
"query": {
"multi_match": {
"query": "test-test",
"type": "cross_fields",
"fields": [
"FIELD_NAME"
],
}
}
{
"tokens": [
{
"token": "'",
"start_offset": 0,
"end_offset": 11,
"type": "word",
"position": 1
},
{
"token": "'t",
"start_offset": 0,
"end_offset": 11,
"type": "word",
"position": 1
},
{
"token": "'te",
"start_offset": 0,
"end_offset": 11,
"type": "word",
"position": 1
},
{
"token": "'tes",
"start_offset": 0,
"end_offset": 11,
"type": "word",
"position": 1
},
{
"token": "'test",
"start_offset": 0,
"end_offset": 11,
"type": "word",
"position": 1
},
{
"token": "'test-",
"start_offset": 0,
"end_offset": 11,
"type": "word",
"position": 1
},
{
"token": "'test-t",
"start_offset": 0,
"end_offset": 11,
"type": "word",
"position": 1
},
{
"token": "'test-te",
"start_offset": 0,
"end_offset": 11,
"type": "word",
"position": 1
},
{
"token": "'test-tes",
"start_offset": 0,
"end_offset": 11,
"type": "word",
"position": 1
},
{
"token": "'test-test",
"start_offset": 0,
"end_offset": 11,
"type": "word",
"position": 1
},
{
"token": "'test-test'",
"start_offset": 0,
"end_offset": 11,
"type": "word",
"position": 1
}
]
}
in my code i catch all words which contains "-" and added quotes for it.
example:
joe-doe -> "joe-doe"
java code for this:
static String placeWordsWithDashInQuote(String value) {
return Arrays.stream(value.split("\\s"))
.filter(v -> !v.isEmpty())
.map(v -> v.contains("-") && !v.startsWith("\"") ? "\"" + v + "\"" : v)
.collect(Collectors.joining(" "));
}
and after this example query looks like:
{
"query": {
"bool": {
"must": [
{
"query_string": {
"fields": [
"lastName",
"firstName"
],
"query": "\"joe-doe\"",
"default_operator": "AND"
}
}
]
}
},
"sort": [],
"from": 0,
"size": 10 }