Elasticsearch "AND in query_string" vs. "default_operator AND" - elasticsearch

elasticsearch v7.1.1
I dont understand the difference between a query_string containing "AND"
vs. "default_operator AND"
I thought it should yield the same result, but doesnt:
HTTP POST http://localhost:9200/umlautsuche
{
"settings": {
"analysis": {
"char_filter": {
"my_char_filter": {
"type": "mapping",
"mappings": ["ph => f"]
}
},
"filter": {
"my_ngram": {
"type": "edge_ngram",
"min_gram": 3,
"max_gram": 10
}
},
"analyzer": {
"my_name_analyzer": {
"tokenizer": "standard",
"char_filter": [
"my_char_filter"
],
"filter": [
"lowercase",
"german_normalization"
]
}
}
}
},
"mappings": {
"date_detection": false,
"dynamic_templates": [
{
"string_fields_german": {
"match_mapping_type": "string",
"match": "*",
"mapping": {
"type": "text",
"analyzer": "my_name_analyzer"
}
}
},
{
"dates": {
"match": "lastModified",
"match_pattern": "regex",
"mapping": {
"type": "date",
"ignore_malformed": true
}
}
}
]
}
}
HTTP POST http://localhost:9200/_bulk
{ "index" : { "_index" : "umlautsuche", "_id" : "1" } }
{"vorname": "Stephan-Jörg", "nachname": "Müller", "ort": "Hollabrunn"}
{ "index" : { "_index" : "umlautsuche", "_id" : "2" } }
{"vorname": "Stephan-Joerg", "nachname": "Mueller", "ort": "Hollabrunn"}
{ "index" : { "_index" : "umlautsuche", "_id" : "3" } }
{"vorname": "Stephan-Jörg", "nachname": "Müll", "ort": "Hollabrunn"}
No results here - unexpected by me:
HTTP POST http://localhost:9200/umlautsuche/_search
{
"query": {
"query_string": {
"query": "Stefan Müller Jör*",
"analyze_wildcard": true,
"default_operator": "AND",
"fields": ["vorname", "nachname"]
}
}
}
This query gives the results as expected by me:
HTTP POST http://localhost:9200/umlautsuche/_search
{
"query": {
"query_string": {
"query": "Stefan AND Müller AND Jör*",
"analyze_wildcard": true,
"default_operator": "AND",
"fields": ["vorname", "nachname"]
}
}
}
How do I configure query/analyzer so I dont need these "AND" between my search terms?

What you are facing is an obscurity of boolean logic of query_string boolean operators, and possibly an undocumented behavior. Because of this obscurity I believe it is better to either use bool query with explicit logic, or to use a copy_to.
Let me explain in a bit more detail what's going on and how can you fix it.
Why doesn't the first query match?
In order to see how the query gets executed, let's set profile: true:
POST /umlautsuche/_search
{
"query": {
"query_string": {
"query": "Stefan Müller Jör*",
"analyze_wildcard": true,
"default_operator": "AND",
"fields": [
"vorname",
"nachname"
]
}
},
"profile": true
}
In the ES response we will see:
"profile": {
"shards": [
{
"id": "[QCANVs5gR0GOiiGCmEwj7w][umlautsuche][0]",
"searches": [
{
"query": [
{
"type": "BooleanQuery",
"description": "+((+nachname:stefan +nachname:muller) | (+vorname:stefan +vorname:muller)) +(nachname:jor* | vorname:jor*)",
"time_in_nanos": 17787641,
"breakdown": {
"set_min_competitive_score_count": 0,
We are interested in this part:
"+((+nachname:stefan +nachname:muller) | (+vorname:stefan +vorname:muller)) +(nachname:jor* | vorname:jor*)"
Without going into deep analysis, we can tell that this query wants to find documents with surname stefan and with surname muller, which is impossible (because stefan is never a surname among the documents).
What we actually want to do, I presume, is "find people whose full name is Stefan Müller Jör*". This is not what the query generated by Elasticsearch does.
Why does the second query match?
Let's do the same trick with explain: true. The response would contain this:
"profile": {
"shards": [
{
"id": "[QCANVs5gR0GOiiGCmEwj7w][umlautsuche][0]",
"searches": [
{
"query": [
{
"type": "BooleanQuery",
"description": "+(nachname:stefan | vorname:stefan) +(nachname:muller | vorname:muller) +(nachname:jor* | vorname:jor*)",
"time_in_nanos": 17970342,
"breakdown": {
We can see that the query got interpreted like this:
"+(nachname:stefan | vorname:stefan) +(nachname:muller | vorname:muller) +(nachname:jor* | vorname:jor*)"
Which we can roughly interpret as "find people whose name or surname is one these three names", which is what we expect it to do.
In the documentation of query_string query it says that with default_operator: AND it should interpret spaces as ANDs:
The default operator used if no explicit operator is specified. For
example, with a default operator of OR, the query capital of Hungary
is translated to capital OR of OR Hungary, and with default operator
of AND, the same query is translated to capital AND of AND Hungary.
The default value is OR.
Although, from what we have just seen, this does not seem to be correct - at least in case of querying multiple fields.
So what can we do about it?
Use bool with explicit logic
This query seems to work:
POST /umlautsuche/_search
{
"query": {
"bool": {
"must": [
{
"query_string": {
"query": "Stefan Müller Jör*",
"analyze_wildcard": true,
"fields": [
"vorname"
]
}
},
{
"query_string": {
"query": "Stefan Müller Jör*",
"analyze_wildcard": true,
"fields": [
"nachname"
]
}
}
]
}
}
}
This query is not an exact equivalent, consider it as an example. For instance, if we would have another record like this, without "Jörg":
{"vorname": "Stephan", "nachname": "Müll", "ort": "Hollabrunn"}
the bool query above would match it despite missing "Jörg". To overcome this you can write a more complex bool query, but this will not do if you wanted to avoid parsing user input.
How can we still use plain, unparsed query string?
Introduce a copy_to field
We can try to use copy_to capability. It will copy the content of several fields into another field and will analyze these fields all together.
We will have to modify the mapping configuration (unfortunately the existing index will have to be recreated):
"mappings": {
"date_detection": false,
"dynamic_templates": [
{
"name_fields_german": {
"match_mapping_type": "string",
"match": "*name",
"mapping": {
"type": "text",
"analyzer": "my_name_analyzer",
"copy_to": "full_name"
}
}
},
{
"string_fields_german": {
"match_mapping_type": "string",
"match": "*",
"mapping": {
"type": "text",
"analyzer": "my_name_analyzer"
}
}
},
{
"dates": {
"match": "lastModified",
"match_pattern": "regex",
"mapping": {
"type": "date",
"ignore_malformed": true
}
}
}
]
}
Then we can populate the index in exactly the same manner as we did before.
Now we can query the new field full_name with the following query:
POST /umlautsuche/_search
{
"query": {
"bool": {
"must": [
{
"query_string": {
"query": "Stefan Müller Jör*",
"analyze_wildcard": true,
"default_operator": "AND",
"fields": [
"full_name"
]
}
}
]
}
}
}
This query will return same 2 documents as the second query. Thus, in this case default_operator: AND behaves as we would expect it, asking for all tokens from the query to be matched.
Hope that helps!

Related

How does phrase searching and phrase search with ~N interact with quote_field_suffix in a simple query string query?

For example, given:
PUT index
{
"settings": {
"analysis": {
"analyzer": {
"english_exact": {
"tokenizer": "standard",
"filter": [
"lowercase"
]
}
}
}
},
"mappings": {
"properties": {
"body": {
"type": "text",
"analyzer": "english",
"fields": {
"exact": {
"type": "text",
"analyzer": "english_exact"
}
}
}
}
}
}
PUT index/_doc/1
{
"body": "Ski resorts"
}
PUT index/_doc/1
{
"body": "Ski house resorts"
}
What happens with the following queries?
{
"query": {
"simple_query_string": {
"fields": [ "body" ],
"quote_field_suffix": ".exact",
"query": "\"ski resort\""
}
}
}
{
"query": {
"simple_query_string": {
"fields": [ "body" ],
"quote_field_suffix": ".exact",
"query": "\"ski resort\"~2"
}
}
}
Will the ".exact" extend to the entire phrase, so in this case the first query would get no results?
How could you do a phrase search that is not exact when using quote "quote_field_suffix": ".exact"?
Will the ".exact" extend to the entire phrase, so in this case the first query would get no results?
Yes, Your understanding is correct.
Documentation says, Suffix appended to quoted text in the query string.
So, it will search for exact match for ski resort. It is not there so it will return empty result.
How could you do a phrase search that is not exact when using quote "quote_field_suffix": ".exact"?
{
"query": {
"simple_query_string": {
"fields": [ "body" ],
"quote_field_suffix": ".exact",
"query": "ski resort~2"
}
}
}
It is not exact because it brings ski resorts also.

Elastic synonyms are taking over other words

On this sequence of commands:
Create the index:
PUT /test_index?
{
"settings": {
"analysis": {
"analyzer": {
"GermanCompoundWordsAnalyzer": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"lowercase",
"german_compound_synonym",
"german_normalization"
]
}
},
"filter": {
"german_compound_synonym": {
"type": "synonym",
"synonyms": [
"teppichläufer, auslegware läufer"
]
}
}
}
},
"mappings": {
"_doc": {
"properties": {
"name": {
"type": "text",
"analyzer": "GermanCompoundWordsAnalyzer"
}
}
}
}
}
Adding a few documents:
POST test_index/_doc/
{
"sku" : "kimchy",
"name" : "teppichläufer alfa"
}
POST test_index/_doc/
{
"sku" : "kimchy",
"name" : "teppichläufer beta"
}
Search for one document (I would expect), but 2 are returning :(
GET /test_index/_search
{
"query": {
"match": {
"name": {
"query": "teppichläufer beta",
"operator": "and"
}
}
}
}
I will get both documents since the synonym teppichläufer, auslegware läufer, läufer will endup on the position 1 and 'substitute' the beta. If I remove the "analyzer": "GermanCompoundWordsAnalyzer", I will just get one document as expected.
How do I use this synonyms and don't have this issue?
POST /test_index/_search
{
"query": {
"bool" : {
"should": [
{
"query_string": {
"default_field": "name",
"query": "teppichläufer beta"
, "default_operator": "AND"
}
}
]
}
}
}
After a little more search I found it on the documentations. This a RFM problems, sorry guys.
I tried with:
https://www.elastic.co/guide/en/elasticsearch/reference/master/analysis-synonym-graph-tokenfilter.html
The funny part is that it makes the NDCG of the results worst :)

Exclude certain fields from search - Elasticsearch

I have indexed documents with each over 100 field each analysed using Edge gram tokenizer to support Auto-Suggestion. I do require free text search that searches on all fields. When i am trying to do so, search is also happening fields with auto complete analyzed(ex. Data.autocomplete_analyzed). I have to restrict this by searching only fields analysed with type "text"(ex. Data). Is there a method to do so in 1. Index time 2. Query time.
Mapping file:
"mappings": {
"_doc": {
"properties": {
"Data": {
"type": "text",
"fields": {
"autocomplete_analyzed": {
"type": "text",
"analyzer": "autocomplete"
},
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
}
}
}
Search query :
{
"query": {
"bool": {
"should": [
{
"multi_match": {
"query": "aim",
"type": "phrase",
"slop": "2",
"fields": []
}
},
{
"multi_match": {
"query": "aim",
"fuzziness": "1",
"fields": []
}
}
],
"minimum_should_match": 1
}
In query time you can use Source filtering to choose the fields you want.
GET /_search
{
"_source": [ "obj1.*", "obj2.*" ],
"query" : {
"term" : { "user" : "kimchy" }
}
}
If you use query_string for search you can use fields
GET /_search
{
"query": {
"query_string": {
"query": "this AND that OR thus",
"fields": [
"docfilename",
"filepath",
"singlewordfield"
]
}
}
}

ElasticSearch: How to use edge_ngram and have real relevant hits to display first

I'm new with elasticsearch and I'm trying to develop a search for an ecommerce to suggested 5~10 matching products to the user.
As it should work while the user is typing, we found in the official documentation the use of edge_ngram and it KIND OF worked. But as we searched to test, the results were not the expected. As shows the example below (in our test)
Searching example
As it is shown in the image, the result for the term "Furadeira" (Power Drill) returns accessories before the power drill itself. How can I enhance the results? Even the order where the match is found in the string would help me, I guess.
So, this is the code I have until now:
//PUT example
{
"settings": {
"number_of_shards": 1,
"analysis": {
"filter": {
"autocomplete_filter": {
"type": "edge_ngram",
"min_gram": 1,
"max_gram": 20
},
"portuguese_stop": {
"type": "stop",
"stopwords": "_portuguese_"
},
"portuguese_stemmer": {
"type": "stemmer",
"language": "light_portuguese"
}
},
"analyzer": {
"portuguese": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"lowercase",
"portuguese_stop",
"portuguese_stemmer"
]
},
"autocomplete": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"lowercase",
"autocomplete_filter"
]
}
}
}
}
}
/* mapping */
//PUT /example/products/_mapping
{
"products": {
"properties": {
"name": {
"type": "text",
"analyzer": "autocomplete",
"search_analyzer": "standard"
}
}
}
}
/* Search */
//GET /example/products/_search
{
"query" : {
"query_string": {
"query" : "furadeira",
"type" : "most_fields", // Tried without this aswell
"fields" : [
"name^8",
"model^10",
"manufacturer^4",
"description"
]
}
}
}
/* Product example */
// PUT example/products/38313
{
"name": "FITA VEDA FRESTA (ESPUMA 4503) 12X5 M [ H0000164055 ]",
"description": "Caracteristicas do produto:Ve…Diminui ruidos indesejaveis.",
"price":21.90,
"product_id": 38313,
"image": "http://placehold.it/200x200",
"quantity": 92,
"width": 20.200,
"height": 1.500,
"length": 21.500,
"weight": 0.082,
"model": "167083",
"manufacturer": "3M DO BRASIL"
}
Thanks in advance.
you could enhance your query to be a so-called boolean query, which contains your existing query in a must clause, but have an additional query in a should clause, that matches exactly (not using the ngrammed field). If the query matches the should clause it will be scored higher.
See the bool query documentation.
let's assume you have a field that differentiates the Main product from Accessories. I call it level_field.
now you can have two approaches to go:
1) boost up The Main product _score by adding 'should' operation:
put your main query in the must operation and in should operation use level_field to boost the _score of documents which are the Main products.
{
"query": {
"bool": {
"must": {
"match": {
"name": {
"query": "furadeira"
}
}
},
"should": [
{ "match": {
"level_field": {
"query": "level1",
"boost": 3
}
}},
{ "match": {
"level_field": {
"query": "level2",
"boost": 2
}
}}
]
}
}
}
2) in second approach you can decrease _score for documents that they are not the Main products by using boosting query:
{
"query": {
"boosting": {
"positive": {
"query_string": {
"query" : "furadeira",
"type" : "most_fields",
"fields" : [
"name^8",
"model^10",
"manufacturer^4",
"description"
]
}
}
},
"negative": {
"term": {
"level_field": {
"value": "level2"
}
}
},
"negative_boost": 0.2
}
}
}
I hope it helps

Multi_match and match queries together

I have the following queries in elastic search :
{
"query": {
"multi_match": {
"query": "bluefin bat",
"type": "phrase",
"fields": [
"title^5",
"body.value"
]
}
},
"highlight": {
"fields": {
"body.value": {
"number_of_fragments": 3
}
}
},
"fields": [
"title",
"id"
]
}
I have tried using "dis_max" but then two of my fields have to be searched for the same query.
The remaining match query has a different query text.
The remaining match query is like this:
{
"query": {
"match": {
"ingredients": "key1, key2",
"analyzer": "keyword_analyzer"
}
}
}
How can I integrate these two queries without using dis_max for joining.
I figured out the answer. multi_match internally applies :
"dis_max"
Hence, you cannot apply dis_max with multi_match.
But what I could do is I could apply bool query to solve this type of problem.
I could apply should which actually translates to OR boolean value or I could apply must which is equivalent to AND.
So this is how I modified my query :
{
"query": {
"bool":{
"should": [
{"multi_match":
{"query": "SOME_QUERY",
"type": "phrase",
"fields": ["title^5","body"]
}
},
{
"match":{
"labels" :{
"query": "SOME_QUERY",
"analyzer": "keyword_analyzer"
}
}
},
{
"match":{
"displayName" :{
"query": "SOME_QUERY",
"fuzziness": "AUTO"
}
}
}
],
"minimum_number_should_match": "50%"
}
},
"fields": ["title","id","labels","displayName","username"],
"highlight": {
"fields": {
"body.storage.value": {
"number_of_fragments": 3}
}
}
}
I hope this helps someone in future.

Resources