Requiring Phrase Matches in Elasticsearch SimpleStringQuery - elasticsearch

I'm creating a simple search engine using Elasticsearch 7.7 and the python elasticsearch_dsl package version 7.0.0. I'm using the simple_query_string search, because I'd like to enable most common search functionality (boolean operators, phrase search) without having to parse the query myself. This is largely working well except for the phrase match functionality.
I would like to ensure all results will include a phrase match if one is in the query. E.g. How google works - If I search for "green eggs" and ham, there will be no results that do not include "green eggs".
Let's assume I have 3 documents in my index:
{
"question":"I love my phrase",
"background: "dont you"
},
{
"question":"I love my phrase",
"background: "and other terms"
},
{
"question":"I have other terms",
"background: "and more"
}
What I am seeing now:
As expected, the below query only returns the first two documents, which have "my phrase" in one of the fields.
{
'simple_query_string':
{
'query': '"my phrase"',
'fields': ['question', 'background']
}
}
Contrary to what I expect, the below query will return all 3 results, with the 3rd one scored higher than the 1st.
{
'simple_query_string':
{
'query': '"my phrase" other terms',
'fields': ['question', 'background']
}
}
How can I alter my query so that searching for '"my phrase" other terms' will not return the 3rd document because it does not contain the phrase search, but score the 2nd document higher than the 1st because it contains additional search terms outside of the phrase?
Things I have tried that have not worked:
'query': '"my phrase" AND (other terms)'
'query': '"my phrase" AND other terms'
Thank you

Contrary to what I expect, the below query will return all 3 results
By default words in query combine with OR operator: see description for default_operator parameter in simple_query_string documentation. Your second query is interpreted as "my phrase" OR other OR terms, so it will return all 3 results: each document contains at least one of the terms "my phrase", other, terms.
How can I alter my query so that searching for '"my phrase" other terms' will not return the 3rd document because it does not contain the phrase search, but score the 2nd document higher than the 1st because it contains additional search terms outside of the phrase?
AFAIK, this isn't possible with simple_query_string search. You can try to use query_string search, which have feature named boolean operators. Using that feature you can write query which provide desired result:
{
"query": {
"query_string": {
"query": "+\"my phrase\" other terms",
"fields": ["question", "background"]
}
}
}

Related

Elasticsearch: What is the difference between a match and a term in a filter?

I was following an ES tutorial, and at some point I wrote a query using term in the filter instead the recommended solution using match. My understanding is that match was used in the query part to get scoring, while term was used in the filter part to just remove hits before enter the query part. To my surprise match also works in the filter part.
What is the difference between:
GET blogs/_search
{
"query": {
"bool": {
"filter": {
"match": {
"category.keyword": "News"
}
}
}
}
}
and:
GET blogs/_search
{
"query": {
"bool": {
"filter": {
"term": {
"category.keyword": "News"
}
}
}
}
}
Both returns the same hits, and the score is 0 for all hits.
What is the behaviour or match in a filter clause? I would expect it to yield some score, but it does not.
What I thought:
term : does not analyze either the parameter or the field, and it is a yes/no scenario.
match : analyzes parameter and field and calculates a score of how good they match.
But when using match against a keyword in the filter part of the query, how does it behave?
The match query is a high-level query that resorts to using a term query if it needs to.
Scoring has nothing to do with using match instead of term. Scoring kicks in when you use bool/must/should instead of bool/filter.
Here is how the match query works:
First, it checks the type of the field.
If it's a text field then the value will be analyzed, either with the analyzer specified in the query (if any), or with the search- or index-time analyzer specified in the mapping.
If it's a keyword field (like in your case), then the input is not analyzed and taken "as is"
Since you're using the match query on a keyword field and your input is a single term, nothing is analyzed and the match query resorts to using a term query underneath. This is why you're seeing the same results.
In general, it's always best to use a match query as it is smart enough to know what to do given the field you're querying and the input data you're searching for.
You can read more about the difference between the two here.

elasticsearch: or operator, number of matches

Is it possible to score my searches according to the number of matches when using operator "or"?
Currently query looks like this:
"query": {
"function_score": {
"query": {
"match": {
"tags.eng": {
"query": "apples banana juice",
"operator": "or",
"fuzziness": "AUTO"
}
}
},
"script_score": {
"script": # TODO
},
"boost_mode": "replace"
}
}
I don't want to use "and" operator, since I want documents containing "apple juice" to be found, as well as documents containing only "juice", etc. However a document containing the three words should score more than documents containing two words or a single word, and so on.
I found a possible solution here https://github.com/elastic/elasticsearch/issues/13806
which uses bool queries. However I don't know how to access the tokens (in this example: apples, banana, juice) generated by the analyzer.
Any help?
Based on the discussions above I came up with the following solution, which is a bit different that I imagined when I asked the question, but works for my case.
First of all I defined a new similarity:
"settings": {
"similarity": {
"boost_similarity": {
"type": "scripted",
"script": {
"source": "return 1;"
}
}
}
...
}
Then I had the following problem:
a query for "apple banana juice" had the same score for a doc with tags ["apple juice", "apple"] and another doc with tag ["banana", "apple juice"]. Although I would like to score the second one higher.
From the this other discussion I found out that this issue was caused because I had a nested field. And I created a usual text field to address it.
But I also was wanted to distinguish between a doc with tags ["apple", "banana", "juice"] and another doc with tag ["apple banana juice"] (all three words in the same tag). The final solution was therefore to keep both fields (a nested and a text field) for my tags.
Finally the query consists of bool query with two should clauses: the first should clause is performed on the text field and uses an "or" operator. The second should clause is performed on the nested field and uses and "and operator"
Despite I found a solution for this specific issue, I still face a few other problems when using ES to search for tagged documents. The examples in the documentation seem to work very well when searching for full texts. But does someone know where I can find something more specific to tagged documents?

Boosting the relevance score based on the unique keyword found

I am in a scenario where I need to give more relevance to the document in Index if it has a unique keyword. Let me provide a scenario.
Let's say I need to search for a term znkdref unsuccessfull so the result will have contents which have znkdref or unsuccessfull or znkdref unsuccessfull but here I want that the contents which are having znkdref unsuccessfull should have highest relevance and then content having znkdref should have less relevance and then content having unsuccessfull should have least relevance.
Is there a way to achieve this ?? I would be glad to get any help
You want to use Query Time Boosting, in particular Prioritized Clauses.
In short you need to extract the keywords that you want boosted and build a query that boosts the parts that you want.
{
"query": {
"bool": {
"should": [{
"match": {
"content": {
"query": "znkdref",
"boost": 2
}
}
},
{
"match": {
"content": {
"query": "unsuccessfull"
}
}
}]
}
}
}
Update based on comment:
If you want to know why a document got the score that it did (maybe to identify "keywords") then you can pass in "explain" as a query parameter or set it in the root POST payload. The result will now have document frequency counts and sub scores.
Do you mean "znkdref" is a unique keyword? For example, "znkdref" is a special name of something. If so.
Of course, the documents match the whole query string "znkdref unsuccessfull" will have a highest relevance score in general.
The documents contain "znkdref" will usually have a higher relevance score than the documents contain "unsuccessfull". Because TF.IDF score of "znkdref" is bigger than TF.IDF score of "unsuccessfull".
The relevance score function is described at https://www.elastic.co/guide/en/elasticsearch/guide/current/practical-scoring-function.html
I hope that my answer is helpful for you.

Elasticsearch how to match documents for which the field tokens are a sub-set of the query tokens

I have a keyword/key-phrase field I tokenize using standard analyser. I want this field to match if if there is a search phrase that has all tokens of this field in it.
For example if the field value is "veni, vidi, vici" and the search phrase is "Ceaser veni,vidi,vici" I want this search phrase to match but search phrase "veni, vidi" not match.
I also need "vidi, veni, vici" (weird!) to match. So the positions and ordering of the terms is not really important. A phrase match would not quite work for me I think.
I can use "bool query" with "minimum_should_match" parameter for this specific example but that is not really what I want as minimum should match is about ratio/number of tokens in the search phrase.
Pure ES solution would go like this. You will need two requests.
1) First you need to pass user query through analyze api to get all the search tokens.
curl -XGET 'localhost:9200/_analyze' -d '
{
"analyzer" : "standard",
"text" : "Ceaser veni,vidi,vici"
}'
you will get 4 tokens ceaser, veni, vidi, vici . You need to pass these tokens as an array to next search request.
2) We need to search for documents whose tokens are subset of search tokens.
{
"query": {
"filtered": {
"filter": {
"bool": {
"must": [
{
"query": {
"match": {
"title": "Ceaser veni,vidi,vici"
}
}
},
{
"script": {
"script": "if(search_tokens.containsAll(doc['title'].values)){return true;}",
"params": {
"search_tokens": [
"ceaser",
"veni",
"vidi",
"vici"
]
}
}
}
]
}
}
}
}
}
Here job of first match query inside the filter is to narrow down the documents on which script should run. containsAll method will check if the documents tokens are sublist of search tokens. This will be slow but will do the job with your current set up. One big improvement you can do is store tokens as an array so that doc['title'].values can be replaced with that field which will improve the script.
Hope this helps!
No built-in solution but this works:
Add an extra field with the number of terms in the field for each document. So in your "veni, vidi, vici" example, you would have a field like "field_term_count" : 3.
Perform a separate match search for each token in the search query.
Sum the number of searches that matched for each document with at least one match (e.g. a hashtable with key of document ID and value of count).
Compare the number of matches in 3 to the "field_term_count" field for each of the documents with matches. If they are equal then the document is a match.
Then "Ceaser veni,vidi,vici" will match but the search phrases "veni, vidi" will not, as desired. It should be quite fast for reasonable numbers of matches.

Is it possible to chain fquery filters in elastic search with exact matches?

I have been having trouble writing a method that will take in various search parameters in elasticsearch. I was working with queries that looked like this:
body:
{query:
{filtered:
{filter:
{and:
[
{term: {some_term: "foo"}},
{term: {is_visible: true}},
{term: {"term_two": "something"}}]
}
}
}
}
Using this syntax I thought I could chain these terms together and programatically generate these queries. I was using simple strings and if there was a term like "person_name" I could split the query into two and say "where person_name match 'JOHN'" and where person_name match 'SMITH'" getting accurate results.
However, I just came across the "fquery" upon asking this question:
Escaping slash in elasticsearch
I was not able to use this "and"/"term" filter searching a value with slashes in it, so I learned that I can use fquery to search for the full value, like this
"fquery": {
"query": {
"match": {
"by_line": "John Smith"
But how can I search like this for multiple items? IT seems that when i combine fquery and my filtered/filter/and/term queries, my "and" term queries are ignored. What is the best practice for making nested / chained queries using elastic search ?
As in the comment below, yes I can just add fquery to the "and" block like so
{:filtered=>
{:filter=>
{:and=>[
{:term=>{:is_visible=>true}},
{:term=>{:is_private=>false}},
{:fquery=>
{:query=>{:match=>{:sub_location=>"New JErsey"}}}}]}}}
Why would elasticsearch also return results with "sub_location" = "new York"? I would like to only return "new jersey" here.
A match query analyzes the input and by default it is a boolean OR query if there are multiple terms after the analysis. In your case, "New JErsey" gets analyzed into the terms "new" and "jersey". The match query that you are using will search for documents in which the indexed value of field "sub_location" is either "new" or "jersey". That is why your query also matches documents where the value of field "sub_location" is "new York" because of the common term "new".
To only match for "new jersey", you can use the following version of the match query:
{
"query": {
"match": {
"sub_location": {
"query": "New JErsey",
"operator": "and"
}
}
}
}
This will not match documents where the value of field "sub_location" is "New York". But, it will match documents where the value of field "sub_location" is say "York New" because the query finally translates into a boolean query like "York" AND "New". If you are fine with this behaviour, well and good, else read further.
All these issues arise because you are using the default analyzer for the field "sub_location" which breaks tokens at word boundaries and indexes them. If you really do not care about partial matches and want to always match the entire string, you can make use of custom analyzers to use Keyword Tokenizer and Lowercase Token Filter. Mind you, going ahead with this approach will need you to re-index all your documents again.

Resources