how decrease score on TF in elasticsearch? - elasticsearch

two docs: 1. "Some Important Company",2. "Some Important Company Important branch"
since "Important" have a high docCount(many docs has Important word), so when search on "Some Important Company"
the 2nd doc get a higher score, even though 1st doc has exactlly match.
so my question is how to boost score when exactlly matched or decrease the TF score?
my query is multi_match for customerName usedName,but usedName is all "" in this case

I assume the field of your document is indexed using a standard text analyzer or something of the like. I would combine a match query and a match_phrase query using a dismax compound query.
This would give something like that:
{
"query": {
"dis_max" : {
"queries" : [
{ "match" : { "myField" : "Some Important Company" }},
{ "match_phrase" : { "myField" : "Some Important Company" }}
],
"tie_breaker" : 0.7
}
}
}
There's no notion of "matching an exact phrase" with the match query. For this you need to use the match_phrase query. That's why you combine the two here. Using the dis_max, documents that match the two queries will get a boost. You can read more about dis_max and match_phrase:
https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-dis-max-query.html
https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-match-query-phrase.html

Related

Elasticsearch fuzziness with multi_match and bool_prefix type

I have a set of search_as_you_type_fields I need to search against. Here is my mapping
"mappings" : {
"properties" : {
"description" : {
"type" : "search_as_you_type",
"doc_values" : false,
"max_shingle_size" : 3
},
"questions" : {
"properties" : {
"content" : {
"type" : "search_as_you_type",
"doc_values" : false,
"max_shingle_size" : 3
},
"tags" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword"
}
}
}
}
},
"title" : {
"type" : "search_as_you_type",
"doc_values" : false,
"max_shingle_size" : 3
},
}
}
I am using a multi_match query with bool_prefix type.
"query": {
"multi_match": {
"query": "triangle",
"type": "bool_prefix",
"fields": [
"title",
"title._2gram",
"title._3gram",
"description",
"description._2gram",
"description._3gram",
"questions.content",
"questions.content._2gram",
"questions.content._3gram",
"questions.tags",
"questions.tags._2gram",
"questions.tags._3gram"
]
}
}
So far works fine. Now I want to add a typo tolerance which is fuzziness in ES. However, looks like bool_prefix has some conflicts working with this. So if I modify my query and add "fuzziness": "AUTO" and make an error in a word "triangle" -> "triangld", it won't get any results.
However, if I am looking for a phrase "right triangle", I have some different behavior:
even if no typos is made, I got more results with just "fuzziness": "AUTO" (1759 vs 1267)
if I add a typo to the 2d word "right triangdd", it seems to work, however looks like it now pushes the results containing "right" without "triangle" first ("The Bill of Rights", "Due process and right to privacy" etc.) in front.
If I make a typo in the 1st word ("righd triangle") or both ("righd triangdd"), the results seems to be just fine. So this is probably the only correct behavior.
I've seen a couple of articles and even GitHub issues that fuzziness does not work in a proper way with a multi_match query with bool_prefix, however I can't find a workaround for this. I've tried changing the query type, but looks like bool_prefix is the only one that supports search as you type and I need to get search result as a user starts typing something.
Since I make all the requests from ES from our backend What I also can do is manipulate a query string to build different search query types if needed. For example, for 1 word searches use one type for multi use another. But I basically need to maintain current behavior.
I've also tried appending a sign "~" or "~1[2]" to the string which seems to be another way of specifying the fuzziness, but the results are rather unclear and performance (search speed) seems to be worse.
My questions are:
How can I achieve fuzziness for 1 word searches? so that query "triangld" returns documents containing "triangle" etc.
How can I achieve correct search results when the typo in the 2d (last?) word of the query? Like I mentioned above it works, but see the point 2 above
Why just adding a fuzziness (see p. 1) returns more results even if the phrase is correct?
Anything I need to change in my analyzers etc.?
so to achieve a desired behavior, we did the following:
change query type to "query_string"
added query string preprocessing on the backend. We split the query string by white spaces and add "~1" or "~2" to each word if their length is more 4 chars or 8 chars respectively. ~ is a fuzziness syntax in ES. However, we don't add this to the current typing word until the user types a white space. For example, user typing [t, tr, tri, ... triangle] => no fuzzy, but once "triangle " => "triangle~2". This is because there will be unexpected results with the last word having fuzziness
we also removed all ngram fields from the search fields as we get the same results but performance is a bit better.
added "default_operator": "AND" to the query to contain the results from one field for phrase queries

Requiring Phrase Matches in Elasticsearch SimpleStringQuery

I'm creating a simple search engine using Elasticsearch 7.7 and the python elasticsearch_dsl package version 7.0.0. I'm using the simple_query_string search, because I'd like to enable most common search functionality (boolean operators, phrase search) without having to parse the query myself. This is largely working well except for the phrase match functionality.
I would like to ensure all results will include a phrase match if one is in the query. E.g. How google works - If I search for "green eggs" and ham, there will be no results that do not include "green eggs".
Let's assume I have 3 documents in my index:
{
"question":"I love my phrase",
"background: "dont you"
},
{
"question":"I love my phrase",
"background: "and other terms"
},
{
"question":"I have other terms",
"background: "and more"
}
What I am seeing now:
As expected, the below query only returns the first two documents, which have "my phrase" in one of the fields.
{
'simple_query_string':
{
'query': '"my phrase"',
'fields': ['question', 'background']
}
}
Contrary to what I expect, the below query will return all 3 results, with the 3rd one scored higher than the 1st.
{
'simple_query_string':
{
'query': '"my phrase" other terms',
'fields': ['question', 'background']
}
}
How can I alter my query so that searching for '"my phrase" other terms' will not return the 3rd document because it does not contain the phrase search, but score the 2nd document higher than the 1st because it contains additional search terms outside of the phrase?
Things I have tried that have not worked:
'query': '"my phrase" AND (other terms)'
'query': '"my phrase" AND other terms'
Thank you
Contrary to what I expect, the below query will return all 3 results
By default words in query combine with OR operator: see description for default_operator parameter in simple_query_string documentation. Your second query is interpreted as "my phrase" OR other OR terms, so it will return all 3 results: each document contains at least one of the terms "my phrase", other, terms.
How can I alter my query so that searching for '"my phrase" other terms' will not return the 3rd document because it does not contain the phrase search, but score the 2nd document higher than the 1st because it contains additional search terms outside of the phrase?
AFAIK, this isn't possible with simple_query_string search. You can try to use query_string search, which have feature named boolean operators. Using that feature you can write query which provide desired result:
{
"query": {
"query_string": {
"query": "+\"my phrase\" other terms",
"fields": ["question", "background"]
}
}
}

How to boost individual words in a elasticsearch match query

Suppose I want to query "Best holiday places to visit during summer" in a Elasticsearch cluster. But I want holiday, visit and summer to have high priority than other words:
Something Like this: Best holiday^4 places to visit^3 during summer^2.
I know about field boosting but what I want to do is not achievable by boost.
Basically I want to boost individual words.
Does any one have any idea about doing this in Elasticsearch 5.6 above??
You could use query_string to boost individual terms like this:
{
"query" : {
"query_string" : {
"fields" : ["content", "name"],
"query" : "Best holiday^4 places to visit^3 during summer^2"
}
}
}

Elasticsearch more like this returns too many documents

I have documents like this:
{
title:'...',
body: '...'
}
I want to get documents which are more than 90% similar to the with a specific document. I have used this query:
query = {
"query": {
"more_like_this" : {
"fields" : ["title", "body"],
"like" : "body of another document",
"min_term_freq" : 1,
"max_query_terms" : 12
}
}
}
How to change this query to check for 90% similarity with specified doc?
Take a look at the Query Formation Parameter minimum_should_match
You should specify minimun_should_match
minimum_should_match
After the disjunctive query has been formed, this parameter controls
the number of terms that must match. The syntax is the same as the
minimum should match. (Defaults to "30%").
It form query using this
The MLT query simply extracts the text from the input document,
analyzes it, usually using the same analyzer at the field, then
selects the top K terms with the highest tf-idf to form a disjunctive
query of these terms
So if you would like to boost you title field you should boost your title field because if the title contains most of the terms present in the term frequency/ Inverse document frequency. the result should be boosted because it has more relevance. You can boost your title field by 1.5.
Refer this document for referenceren on the more_like_this query

How to enable fuzziness for phrase queries in ElasticSearch

We're using ElasticSearch for searching through millions of tags. Our users should be able to include boolean operators (+, -, "xy", AND, OR, brackets). If no hits are returned, we fall back to a spelling suggestion provided by ES and search again. That's our query:
$ curl -XGET 'http://127.0.0.1:9200/my_index/my_type/_search' -d '
{
"query" : {
"query_string" : {
"query" : "some test query +bools -included",
"default_operator" : "AND"
}
},
"suggest" : {
"text" : "some test query +bools -included",
"simple_phrase" : {
"phrase" : {
"field" : "my_tags_field",
"size" : 1
}
}
}
}
Instead of only providing a fallback to spelling suggestions, we'd like to enable fuzzy matching. If, for example, a user searches for "stackoverfolw", ES should return matches for "stackoverflow".
Additional question: What's the better performing method for "correcting" spelling errors? As it is now, we have to perform two subsequent requests, first with the original search term, then with the by ES suggested term.
The query_string does support some fuzziness but only when using the ~ operator, which I think doesn't your usecase. I would add a fuzzy query then and put it in or with the existing query_string. For instance you can use a bool query and add the fuzzy query as a should clause, keeping the original query_string as a must clause.
As for your additional question about how to correct spelling mistakes: I would use fuzzy queries to automatically correct them and two subsequent requests if you want the user to select the right correction from a list (e.g. Did you mean), but your approach sounds good too.

Resources