Elasticsearch: simple_query_string and multi-words synonyms - elasticsearch

I have a field with the following search_analyzer:
"name_search_en" : {
"filter" : [
"english_possessive_stemmer",
"lowercase",
"name_synonyms_en",
"english_stop",
"english_stemmer",
"asciifolding"
],
"tokenizer" : "standard"
}
name_synonyms_en is a synonym_graph that looks like this
"name_synonyms_en" : {
"type" : "synonym_graph",
"synonyms" : [
"beach bag => straw bag,beach bag",
"bicycle,bike"
]
}
Running the following multi_match query the synonym are correctly applied
{
"query": {
"multi_match": {
"query": "beach bag",
"auto_generate_synonyms_phrase_query": false,
"type": "cross_fields",
"fields": [
"brand.en-US^1.0",
"name.en-US^1.0"
]
}
}
}
Here is the _validate explanation output. Both beach bag and straw bag are present, as expected, in the raw query:
"explanations" : [
{
"index" : "d7598351-311f-4844-bb91-4f26c9f538f3",
"valid" : true,
"explanation" : "+((((+name.en-US:straw +name.en-US:bag) (+name.en-US:beach +name.en-US:bag))) | (brand.en-US:beach brand.en-US:bag)) #DocValuesFieldExistsQuery [field=_primary_term]"
}
]
I would expect the same in the following simple_query_string
{
"query": {
"simple_query_string": {
"query": "beach bag",
"auto_generate_synonyms_phrase_query": false,
"fields": [
"brand.en-US^1.0",
"name.en-US^1.0"
]
}
}
}
but the straw bag synonym is not present in the raw query
"explanations" : [
{
"index" : "d7598351-311f-4844-bb91-4f26c9f538f3",
"valid" : true,
"explanation" : "+((name.en-US:beach | brand.en-US:beach)~1.0 (name.en-US:bag | brand.en-US:bag)~1.0) #DocValuesFieldExistsQuery [field=_primary_term]"
}
]
The problem seems to be related to multi-terms synonyms only. If I search for bike, the bicycle synonym is correctly present in the query
"explanations" : [
{
"index" : "d7598351-311f-4844-bb91-4f26c9f538f3",
"valid" : true,
"explanation" : "+(Synonym(name.en-US:bicycl name.en-US:bike) | brand.en-US:bike)~1.0 #DocValuesFieldExistsQuery [field=_primary_term]"
}
]
Is this the expected behaviour (meaning multi terms synonyms are not supported for this query)?

By default simple_query_string has the WHITESPACE flag enabled. The input text is tokenized. That's the reason the synonym filter doesn't handle correctly multi-words. This query disable all flags making multi-words synonyms working as expected
{
"query": {
"simple_query_string": {
"query": "beach bag",
"auto_generate_synonyms_phrase_query": false,
"flags": "NONE",
"fields": [
"brand.en-US^1.0",
"name.en-US^1.0"
]
}
}
}
This unfortunately does not play well with minimum_should_match parameter. Full discussion and more details on this can be found here https://discuss.elastic.co/t/simple-query-string-and-multi-terms-synonyms/174780

Related

Elasticsearch crash repeatedly after phrase prefix search on whitespace analyzer

I have defined my mapping as:
{
mappings: { // defined all mappings },
"settings": {
"index": {
"analysis": {
"analyzer": {
"default": {
"type": "custom",
"tokenizer": "whitespace",
"filter" : ["lowercase"]
}
}
}
}
}
The query which I am executing is this one:
{
"bool" : {
"must" : [
{
"query_string" : {
"query" : "*2AW\\-COTTON_\\&_SON_\\(*",
"fields" : [ ],
"type" : "phrase_prefix",
"default_operator" : "or",
"max_determinized_states" : 10000,
"enable_position_increments" : true,
"fuzziness" : "AUTO",
"fuzzy_prefix_length" : 0,
"fuzzy_max_expansions" : 50,
"phrase_slop" : 0,
"escape" : false,
"auto_generate_synonyms_phrase_query" : true,
"fuzzy_transpositions" : true,
"boost" : 1.0
}
}
],
"filter" : [
{
"terms" : {
"id" : [
"50010",
"1604"
],
"boost" : 1.0
}
}
],
"adjust_pure_negative" : true,
"boost" : 1.0
}
}
I am using a whitespace analyzer instead of the standard one, as I had to search on special characters as well. I have escaped special characters in this search. But when I do a phrase prefix query on this index, my whole elasticsearch crashes every time. For the first two queries, it will take 20-30 seconds, after that for any further query, ES will crash. Right now, I was testing this on a 2GB RAM machine, with an allocated heap size of 1GB, can this be the reason, will increasing machine size help? Thanks for any help!!
Since you haven't specified a field to perform the wilcard on, ES will search almost all fields.
Have you tried using a wildcard or regexp filter instead of query_string?
If you do know which field you want to query (and I suspect you do), use something along the lines of:
GET fuzzy/_search?request_cache=false
{
"query": {
"bool": {
"must": [
{
"regexp": {
"identifier": ".*2aw-cotton_\\&_son_\\(.*"
}
}
]
}
}
}
Even with 400 sample docs on my machine, the speed improvements are 40x over the wide-range query_string.
P.S: Of course, remove request_cache when in production.

Elasticsearch "AND in query_string" vs. "default_operator AND"

elasticsearch v7.1.1
I dont understand the difference between a query_string containing "AND"
vs. "default_operator AND"
I thought it should yield the same result, but doesnt:
HTTP POST http://localhost:9200/umlautsuche
{
"settings": {
"analysis": {
"char_filter": {
"my_char_filter": {
"type": "mapping",
"mappings": ["ph => f"]
}
},
"filter": {
"my_ngram": {
"type": "edge_ngram",
"min_gram": 3,
"max_gram": 10
}
},
"analyzer": {
"my_name_analyzer": {
"tokenizer": "standard",
"char_filter": [
"my_char_filter"
],
"filter": [
"lowercase",
"german_normalization"
]
}
}
}
},
"mappings": {
"date_detection": false,
"dynamic_templates": [
{
"string_fields_german": {
"match_mapping_type": "string",
"match": "*",
"mapping": {
"type": "text",
"analyzer": "my_name_analyzer"
}
}
},
{
"dates": {
"match": "lastModified",
"match_pattern": "regex",
"mapping": {
"type": "date",
"ignore_malformed": true
}
}
}
]
}
}
HTTP POST http://localhost:9200/_bulk
{ "index" : { "_index" : "umlautsuche", "_id" : "1" } }
{"vorname": "Stephan-Jörg", "nachname": "Müller", "ort": "Hollabrunn"}
{ "index" : { "_index" : "umlautsuche", "_id" : "2" } }
{"vorname": "Stephan-Joerg", "nachname": "Mueller", "ort": "Hollabrunn"}
{ "index" : { "_index" : "umlautsuche", "_id" : "3" } }
{"vorname": "Stephan-Jörg", "nachname": "Müll", "ort": "Hollabrunn"}
No results here - unexpected by me:
HTTP POST http://localhost:9200/umlautsuche/_search
{
"query": {
"query_string": {
"query": "Stefan Müller Jör*",
"analyze_wildcard": true,
"default_operator": "AND",
"fields": ["vorname", "nachname"]
}
}
}
This query gives the results as expected by me:
HTTP POST http://localhost:9200/umlautsuche/_search
{
"query": {
"query_string": {
"query": "Stefan AND Müller AND Jör*",
"analyze_wildcard": true,
"default_operator": "AND",
"fields": ["vorname", "nachname"]
}
}
}
How do I configure query/analyzer so I dont need these "AND" between my search terms?
What you are facing is an obscurity of boolean logic of query_string boolean operators, and possibly an undocumented behavior. Because of this obscurity I believe it is better to either use bool query with explicit logic, or to use a copy_to.
Let me explain in a bit more detail what's going on and how can you fix it.
Why doesn't the first query match?
In order to see how the query gets executed, let's set profile: true:
POST /umlautsuche/_search
{
"query": {
"query_string": {
"query": "Stefan Müller Jör*",
"analyze_wildcard": true,
"default_operator": "AND",
"fields": [
"vorname",
"nachname"
]
}
},
"profile": true
}
In the ES response we will see:
"profile": {
"shards": [
{
"id": "[QCANVs5gR0GOiiGCmEwj7w][umlautsuche][0]",
"searches": [
{
"query": [
{
"type": "BooleanQuery",
"description": "+((+nachname:stefan +nachname:muller) | (+vorname:stefan +vorname:muller)) +(nachname:jor* | vorname:jor*)",
"time_in_nanos": 17787641,
"breakdown": {
"set_min_competitive_score_count": 0,
We are interested in this part:
"+((+nachname:stefan +nachname:muller) | (+vorname:stefan +vorname:muller)) +(nachname:jor* | vorname:jor*)"
Without going into deep analysis, we can tell that this query wants to find documents with surname stefan and with surname muller, which is impossible (because stefan is never a surname among the documents).
What we actually want to do, I presume, is "find people whose full name is Stefan Müller Jör*". This is not what the query generated by Elasticsearch does.
Why does the second query match?
Let's do the same trick with explain: true. The response would contain this:
"profile": {
"shards": [
{
"id": "[QCANVs5gR0GOiiGCmEwj7w][umlautsuche][0]",
"searches": [
{
"query": [
{
"type": "BooleanQuery",
"description": "+(nachname:stefan | vorname:stefan) +(nachname:muller | vorname:muller) +(nachname:jor* | vorname:jor*)",
"time_in_nanos": 17970342,
"breakdown": {
We can see that the query got interpreted like this:
"+(nachname:stefan | vorname:stefan) +(nachname:muller | vorname:muller) +(nachname:jor* | vorname:jor*)"
Which we can roughly interpret as "find people whose name or surname is one these three names", which is what we expect it to do.
In the documentation of query_string query it says that with default_operator: AND it should interpret spaces as ANDs:
The default operator used if no explicit operator is specified. For
example, with a default operator of OR, the query capital of Hungary
is translated to capital OR of OR Hungary, and with default operator
of AND, the same query is translated to capital AND of AND Hungary.
The default value is OR.
Although, from what we have just seen, this does not seem to be correct - at least in case of querying multiple fields.
So what can we do about it?
Use bool with explicit logic
This query seems to work:
POST /umlautsuche/_search
{
"query": {
"bool": {
"must": [
{
"query_string": {
"query": "Stefan Müller Jör*",
"analyze_wildcard": true,
"fields": [
"vorname"
]
}
},
{
"query_string": {
"query": "Stefan Müller Jör*",
"analyze_wildcard": true,
"fields": [
"nachname"
]
}
}
]
}
}
}
This query is not an exact equivalent, consider it as an example. For instance, if we would have another record like this, without "Jörg":
{"vorname": "Stephan", "nachname": "Müll", "ort": "Hollabrunn"}
the bool query above would match it despite missing "Jörg". To overcome this you can write a more complex bool query, but this will not do if you wanted to avoid parsing user input.
How can we still use plain, unparsed query string?
Introduce a copy_to field
We can try to use copy_to capability. It will copy the content of several fields into another field and will analyze these fields all together.
We will have to modify the mapping configuration (unfortunately the existing index will have to be recreated):
"mappings": {
"date_detection": false,
"dynamic_templates": [
{
"name_fields_german": {
"match_mapping_type": "string",
"match": "*name",
"mapping": {
"type": "text",
"analyzer": "my_name_analyzer",
"copy_to": "full_name"
}
}
},
{
"string_fields_german": {
"match_mapping_type": "string",
"match": "*",
"mapping": {
"type": "text",
"analyzer": "my_name_analyzer"
}
}
},
{
"dates": {
"match": "lastModified",
"match_pattern": "regex",
"mapping": {
"type": "date",
"ignore_malformed": true
}
}
}
]
}
Then we can populate the index in exactly the same manner as we did before.
Now we can query the new field full_name with the following query:
POST /umlautsuche/_search
{
"query": {
"bool": {
"must": [
{
"query_string": {
"query": "Stefan Müller Jör*",
"analyze_wildcard": true,
"default_operator": "AND",
"fields": [
"full_name"
]
}
}
]
}
}
}
This query will return same 2 documents as the second query. Thus, in this case default_operator: AND behaves as we would expect it, asking for all tokens from the query to be matched.
Hope that helps!

Elastic search: How to highlight the fragment after the search term?

I am working on a search project which requires the highlight fragment after the search word.
My query is
{
"query": {
"multi_match" : {
"query" : "prawn",
"fields": ["name"]
, "operator": "and",
"use_dis_max": true
}
},
"_source": ["name"],
"highlight": {
"fields": {
"name": {
"pre_tags" : [""], "post_tags" : [""],
"fragment_size": 3,
"number_of_fragments": 1
}
}
}
}
Result:
{
"name" : "special prawn curry"
},
"highlight" : {
"name" : [
"special prawn"
]
}
Whereas, I want the result like
"name" : "special prawn curry"
},
"highlight" : {
"name" : [
"prawn curry"
]
}
i.e the fragment after the search word. Is it possible?
Well you can make use of the Plain highlighter (using "type":"plain") in highlight query and see if that works out.
This used to be the default highlighter till 6.0 release where they've made Unified as default highlighter.
POST <your_index_name>/_search
{
"query": {
"multi_match" : {
"query" : "prawn",
"fields": ["name"]
, "operator": "and",
"use_dis_max": true
}
},
"_source": ["name"],
"highlight": {
"fields": {
"name": {
"type": "plain", <---- Added this
"pre_tags" : [""], "post_tags" : [""],
"fragment_size": 3,
"number_of_fragments": 1,
"order": "score"
}
}
}
}
Hope this helps!

ElasticSearch multi_match query not working

I'm using elastic search 5.3 and have this super simple multi_match query. None of the documents contains the ridiculous query, neither in title nor content. Still ES gives me plenty of matches.
{
"query" : {
"multi_match" : {
"query" : "42a6o8435a6o4385q023bf50",
"fields" : [
"title","content"
],
"type" : "best_fields",
"operator" : "AND",
"minimum_should_match" : "100%"
}
}
}
}
The analyzer for content is "english" and for title it is this one:
"analyzer": {
"english_autocomplete": {
"tokenizer": "standard",
"filter": [
"lowercase",
"english_stop",
"english_stemmer",
"autocomplete_filter"
]
}
}
"autocomplete_filter": {
"type": "edge_ngram",
"min_gram": 1,
"max_gram": 20
}
Am I missing something or how can I tell ES not to do that?

Override search requirement in Elasticsearch

We search on the following fields in our index:
individual_name (string)
organisation_name (string)
profile (string)
locations (string)
nationwide (boolean)
If a user searches for "optometrist" in "Hamilton", and an optometrist in our index has listed themselves as "nationwide" (but not specifically in Hamilton), desired behaviour is that the optometrist would show up with the Hamilton results - effectively ignoring the location requirement.
We're currently running a multi_match query, an example of which is below.
{
"query": {
"filtered" : {
"query" : {
"multi_match": {
"query": "optometrist",
"zero_terms_query": "all",
"operator": "and",
"fields": [
"individual_name^1.2",
"organisation_name^1.5",
"profile",
"accreditations"
]
}
},
"filter": {
"and": [{
"term": {
"locations" : "hamilton"
}
}],
}
}
}
}
How can this be modified so documents with "nationwide": "yes" are returned for this query, regardless of location?
I've tried an or query under the and, but of course that ignored the multi_match.
I think this will give you the desired results:
{
"query": {
"filtered" : {
"query" : {
"multi_match": {
"query": "optometrist",
"zero_terms_query": "all",
"operator": "and",
"fields": [
"individual_name^1.2",
"organisation_name^1.5",
"profile",
"accreditations"
]
}
},
"filter": {
"or": [
{"term": {"locations" : "hamilton"}},
{'term' : {'nationwide' : 'yes'}}
],
}
}
}
}

Resources