Exclude certain fields from search - Elasticsearch - elasticsearch

I have indexed documents with each over 100 field each analysed using Edge gram tokenizer to support Auto-Suggestion. I do require free text search that searches on all fields. When i am trying to do so, search is also happening fields with auto complete analyzed(ex. Data.autocomplete_analyzed). I have to restrict this by searching only fields analysed with type "text"(ex. Data). Is there a method to do so in 1. Index time 2. Query time.
Mapping file:
"mappings": {
"_doc": {
"properties": {
"Data": {
"type": "text",
"fields": {
"autocomplete_analyzed": {
"type": "text",
"analyzer": "autocomplete"
},
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
}
}
}
Search query :
{
"query": {
"bool": {
"should": [
{
"multi_match": {
"query": "aim",
"type": "phrase",
"slop": "2",
"fields": []
}
},
{
"multi_match": {
"query": "aim",
"fuzziness": "1",
"fields": []
}
}
],
"minimum_should_match": 1
}

In query time you can use Source filtering to choose the fields you want.
GET /_search
{
"_source": [ "obj1.*", "obj2.*" ],
"query" : {
"term" : { "user" : "kimchy" }
}
}
If you use query_string for search you can use fields
GET /_search
{
"query": {
"query_string": {
"query": "this AND that OR thus",
"fields": [
"docfilename",
"filepath",
"singlewordfield"
]
}
}
}

Related

How does phrase searching and phrase search with ~N interact with quote_field_suffix in a simple query string query?

For example, given:
PUT index
{
"settings": {
"analysis": {
"analyzer": {
"english_exact": {
"tokenizer": "standard",
"filter": [
"lowercase"
]
}
}
}
},
"mappings": {
"properties": {
"body": {
"type": "text",
"analyzer": "english",
"fields": {
"exact": {
"type": "text",
"analyzer": "english_exact"
}
}
}
}
}
}
PUT index/_doc/1
{
"body": "Ski resorts"
}
PUT index/_doc/1
{
"body": "Ski house resorts"
}
What happens with the following queries?
{
"query": {
"simple_query_string": {
"fields": [ "body" ],
"quote_field_suffix": ".exact",
"query": "\"ski resort\""
}
}
}
{
"query": {
"simple_query_string": {
"fields": [ "body" ],
"quote_field_suffix": ".exact",
"query": "\"ski resort\"~2"
}
}
}
Will the ".exact" extend to the entire phrase, so in this case the first query would get no results?
How could you do a phrase search that is not exact when using quote "quote_field_suffix": ".exact"?
Will the ".exact" extend to the entire phrase, so in this case the first query would get no results?
Yes, Your understanding is correct.
Documentation says, Suffix appended to quoted text in the query string.
So, it will search for exact match for ski resort. It is not there so it will return empty result.
How could you do a phrase search that is not exact when using quote "quote_field_suffix": ".exact"?
{
"query": {
"simple_query_string": {
"fields": [ "body" ],
"quote_field_suffix": ".exact",
"query": "ski resort~2"
}
}
}
It is not exact because it brings ski resorts also.

Elasticsearch "AND in query_string" vs. "default_operator AND"

elasticsearch v7.1.1
I dont understand the difference between a query_string containing "AND"
vs. "default_operator AND"
I thought it should yield the same result, but doesnt:
HTTP POST http://localhost:9200/umlautsuche
{
"settings": {
"analysis": {
"char_filter": {
"my_char_filter": {
"type": "mapping",
"mappings": ["ph => f"]
}
},
"filter": {
"my_ngram": {
"type": "edge_ngram",
"min_gram": 3,
"max_gram": 10
}
},
"analyzer": {
"my_name_analyzer": {
"tokenizer": "standard",
"char_filter": [
"my_char_filter"
],
"filter": [
"lowercase",
"german_normalization"
]
}
}
}
},
"mappings": {
"date_detection": false,
"dynamic_templates": [
{
"string_fields_german": {
"match_mapping_type": "string",
"match": "*",
"mapping": {
"type": "text",
"analyzer": "my_name_analyzer"
}
}
},
{
"dates": {
"match": "lastModified",
"match_pattern": "regex",
"mapping": {
"type": "date",
"ignore_malformed": true
}
}
}
]
}
}
HTTP POST http://localhost:9200/_bulk
{ "index" : { "_index" : "umlautsuche", "_id" : "1" } }
{"vorname": "Stephan-Jörg", "nachname": "Müller", "ort": "Hollabrunn"}
{ "index" : { "_index" : "umlautsuche", "_id" : "2" } }
{"vorname": "Stephan-Joerg", "nachname": "Mueller", "ort": "Hollabrunn"}
{ "index" : { "_index" : "umlautsuche", "_id" : "3" } }
{"vorname": "Stephan-Jörg", "nachname": "Müll", "ort": "Hollabrunn"}
No results here - unexpected by me:
HTTP POST http://localhost:9200/umlautsuche/_search
{
"query": {
"query_string": {
"query": "Stefan Müller Jör*",
"analyze_wildcard": true,
"default_operator": "AND",
"fields": ["vorname", "nachname"]
}
}
}
This query gives the results as expected by me:
HTTP POST http://localhost:9200/umlautsuche/_search
{
"query": {
"query_string": {
"query": "Stefan AND Müller AND Jör*",
"analyze_wildcard": true,
"default_operator": "AND",
"fields": ["vorname", "nachname"]
}
}
}
How do I configure query/analyzer so I dont need these "AND" between my search terms?
What you are facing is an obscurity of boolean logic of query_string boolean operators, and possibly an undocumented behavior. Because of this obscurity I believe it is better to either use bool query with explicit logic, or to use a copy_to.
Let me explain in a bit more detail what's going on and how can you fix it.
Why doesn't the first query match?
In order to see how the query gets executed, let's set profile: true:
POST /umlautsuche/_search
{
"query": {
"query_string": {
"query": "Stefan Müller Jör*",
"analyze_wildcard": true,
"default_operator": "AND",
"fields": [
"vorname",
"nachname"
]
}
},
"profile": true
}
In the ES response we will see:
"profile": {
"shards": [
{
"id": "[QCANVs5gR0GOiiGCmEwj7w][umlautsuche][0]",
"searches": [
{
"query": [
{
"type": "BooleanQuery",
"description": "+((+nachname:stefan +nachname:muller) | (+vorname:stefan +vorname:muller)) +(nachname:jor* | vorname:jor*)",
"time_in_nanos": 17787641,
"breakdown": {
"set_min_competitive_score_count": 0,
We are interested in this part:
"+((+nachname:stefan +nachname:muller) | (+vorname:stefan +vorname:muller)) +(nachname:jor* | vorname:jor*)"
Without going into deep analysis, we can tell that this query wants to find documents with surname stefan and with surname muller, which is impossible (because stefan is never a surname among the documents).
What we actually want to do, I presume, is "find people whose full name is Stefan Müller Jör*". This is not what the query generated by Elasticsearch does.
Why does the second query match?
Let's do the same trick with explain: true. The response would contain this:
"profile": {
"shards": [
{
"id": "[QCANVs5gR0GOiiGCmEwj7w][umlautsuche][0]",
"searches": [
{
"query": [
{
"type": "BooleanQuery",
"description": "+(nachname:stefan | vorname:stefan) +(nachname:muller | vorname:muller) +(nachname:jor* | vorname:jor*)",
"time_in_nanos": 17970342,
"breakdown": {
We can see that the query got interpreted like this:
"+(nachname:stefan | vorname:stefan) +(nachname:muller | vorname:muller) +(nachname:jor* | vorname:jor*)"
Which we can roughly interpret as "find people whose name or surname is one these three names", which is what we expect it to do.
In the documentation of query_string query it says that with default_operator: AND it should interpret spaces as ANDs:
The default operator used if no explicit operator is specified. For
example, with a default operator of OR, the query capital of Hungary
is translated to capital OR of OR Hungary, and with default operator
of AND, the same query is translated to capital AND of AND Hungary.
The default value is OR.
Although, from what we have just seen, this does not seem to be correct - at least in case of querying multiple fields.
So what can we do about it?
Use bool with explicit logic
This query seems to work:
POST /umlautsuche/_search
{
"query": {
"bool": {
"must": [
{
"query_string": {
"query": "Stefan Müller Jör*",
"analyze_wildcard": true,
"fields": [
"vorname"
]
}
},
{
"query_string": {
"query": "Stefan Müller Jör*",
"analyze_wildcard": true,
"fields": [
"nachname"
]
}
}
]
}
}
}
This query is not an exact equivalent, consider it as an example. For instance, if we would have another record like this, without "Jörg":
{"vorname": "Stephan", "nachname": "Müll", "ort": "Hollabrunn"}
the bool query above would match it despite missing "Jörg". To overcome this you can write a more complex bool query, but this will not do if you wanted to avoid parsing user input.
How can we still use plain, unparsed query string?
Introduce a copy_to field
We can try to use copy_to capability. It will copy the content of several fields into another field and will analyze these fields all together.
We will have to modify the mapping configuration (unfortunately the existing index will have to be recreated):
"mappings": {
"date_detection": false,
"dynamic_templates": [
{
"name_fields_german": {
"match_mapping_type": "string",
"match": "*name",
"mapping": {
"type": "text",
"analyzer": "my_name_analyzer",
"copy_to": "full_name"
}
}
},
{
"string_fields_german": {
"match_mapping_type": "string",
"match": "*",
"mapping": {
"type": "text",
"analyzer": "my_name_analyzer"
}
}
},
{
"dates": {
"match": "lastModified",
"match_pattern": "regex",
"mapping": {
"type": "date",
"ignore_malformed": true
}
}
}
]
}
Then we can populate the index in exactly the same manner as we did before.
Now we can query the new field full_name with the following query:
POST /umlautsuche/_search
{
"query": {
"bool": {
"must": [
{
"query_string": {
"query": "Stefan Müller Jör*",
"analyze_wildcard": true,
"default_operator": "AND",
"fields": [
"full_name"
]
}
}
]
}
}
}
This query will return same 2 documents as the second query. Thus, in this case default_operator: AND behaves as we would expect it, asking for all tokens from the query to be matched.
Hope that helps!

Elastic synonyms are taking over other words

On this sequence of commands:
Create the index:
PUT /test_index?
{
"settings": {
"analysis": {
"analyzer": {
"GermanCompoundWordsAnalyzer": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"lowercase",
"german_compound_synonym",
"german_normalization"
]
}
},
"filter": {
"german_compound_synonym": {
"type": "synonym",
"synonyms": [
"teppichläufer, auslegware läufer"
]
}
}
}
},
"mappings": {
"_doc": {
"properties": {
"name": {
"type": "text",
"analyzer": "GermanCompoundWordsAnalyzer"
}
}
}
}
}
Adding a few documents:
POST test_index/_doc/
{
"sku" : "kimchy",
"name" : "teppichläufer alfa"
}
POST test_index/_doc/
{
"sku" : "kimchy",
"name" : "teppichläufer beta"
}
Search for one document (I would expect), but 2 are returning :(
GET /test_index/_search
{
"query": {
"match": {
"name": {
"query": "teppichläufer beta",
"operator": "and"
}
}
}
}
I will get both documents since the synonym teppichläufer, auslegware läufer, läufer will endup on the position 1 and 'substitute' the beta. If I remove the "analyzer": "GermanCompoundWordsAnalyzer", I will just get one document as expected.
How do I use this synonyms and don't have this issue?
POST /test_index/_search
{
"query": {
"bool" : {
"should": [
{
"query_string": {
"default_field": "name",
"query": "teppichläufer beta"
, "default_operator": "AND"
}
}
]
}
}
}
After a little more search I found it on the documentations. This a RFM problems, sorry guys.
I tried with:
https://www.elastic.co/guide/en/elasticsearch/reference/master/analysis-synonym-graph-tokenfilter.html
The funny part is that it makes the NDCG of the results worst :)

How do prioritize matches in the beginning of strings in Elasticsearch?

I have an Elasticsearch instance full of documents containing movie and series titles.
When I run this:
{
"query": {
"bool": {
"must": [
{
"multi_match": {
"fields": [
"Name^2",
"SeriesName^1.5",
"Description"
],
"fuzziness": "AUTO",
"prefix_length": 2,
"query": "game"
}
}
]
}
}
}
... I get titles like "The big game", "Hunger games", "War game", etc.
However, I would like to get titles starting with "game" BEFORE titles just containing "game".
When a user searches for "game", they expect titles like "Game of Thrones" and "Game change", before "The imitation game".
How can I make this more precise? Thank you!
Try something like below :
{ "query": {
"prefix" : { "Name" : "game" }
}
}
Please refer the documentation for the same Elasticsearch Documentation
To do this your field/property have to be tokenized as a keyword, see query below. One can also add an additional lowercase filter in mapping for your field/property.
{
"settings": {
"index": {
"analysis": {
"analyzer": {
"analyzer_startswith": {
"tokenizer": "keyword",
"filter": "lowercase"
}
}
}
}
},
"mappings": {
"test_index": {
"properties": {
"Name": {
"search_analyzer": "analyzer_startswith",
"index_analyzer": "analyzer_startswith",
"type": "string"
}
}
}
}
}

Multi_match and match queries together

I have the following queries in elastic search :
{
"query": {
"multi_match": {
"query": "bluefin bat",
"type": "phrase",
"fields": [
"title^5",
"body.value"
]
}
},
"highlight": {
"fields": {
"body.value": {
"number_of_fragments": 3
}
}
},
"fields": [
"title",
"id"
]
}
I have tried using "dis_max" but then two of my fields have to be searched for the same query.
The remaining match query has a different query text.
The remaining match query is like this:
{
"query": {
"match": {
"ingredients": "key1, key2",
"analyzer": "keyword_analyzer"
}
}
}
How can I integrate these two queries without using dis_max for joining.
I figured out the answer. multi_match internally applies :
"dis_max"
Hence, you cannot apply dis_max with multi_match.
But what I could do is I could apply bool query to solve this type of problem.
I could apply should which actually translates to OR boolean value or I could apply must which is equivalent to AND.
So this is how I modified my query :
{
"query": {
"bool":{
"should": [
{"multi_match":
{"query": "SOME_QUERY",
"type": "phrase",
"fields": ["title^5","body"]
}
},
{
"match":{
"labels" :{
"query": "SOME_QUERY",
"analyzer": "keyword_analyzer"
}
}
},
{
"match":{
"displayName" :{
"query": "SOME_QUERY",
"fuzziness": "AUTO"
}
}
}
],
"minimum_number_should_match": "50%"
}
},
"fields": ["title","id","labels","displayName","username"],
"highlight": {
"fields": {
"body.storage.value": {
"number_of_fragments": 3}
}
}
}
I hope this helps someone in future.

Resources