Elasticsearch more like this returns too many documents - elasticsearch

I have documents like this:
{
title:'...',
body: '...'
}
I want to get documents which are more than 90% similar to the with a specific document. I have used this query:
query = {
"query": {
"more_like_this" : {
"fields" : ["title", "body"],
"like" : "body of another document",
"min_term_freq" : 1,
"max_query_terms" : 12
}
}
}
How to change this query to check for 90% similarity with specified doc?

Take a look at the Query Formation Parameter minimum_should_match

You should specify minimun_should_match
minimum_should_match
After the disjunctive query has been formed, this parameter controls
the number of terms that must match. The syntax is the same as the
minimum should match. (Defaults to "30%").
It form query using this
The MLT query simply extracts the text from the input document,
analyzes it, usually using the same analyzer at the field, then
selects the top K terms with the highest tf-idf to form a disjunctive
query of these terms
So if you would like to boost you title field you should boost your title field because if the title contains most of the terms present in the term frequency/ Inverse document frequency. the result should be boosted because it has more relevance. You can boost your title field by 1.5.
Refer this document for referenceren on the more_like_this query

Related

Elastic Search - Conditional field query if no match found for another field

Is it possible to do conditional field query if match was not found for another field ?
for eg: if I have a 3 fields in the index local_rating , global_rating and default_rating , I need to first check in local_rating and if there is no match then try for global_rating and finally for default_rating .
is this possible to do with one query ? or any other ways to achieve this
thanks in advance
Not sure about any existing features of Elasticsearh to fulfill your current requirements but you can try with fields and per-fields boosting, Individual fields can be boosted with the caret (^)notation. Also I don't know boosting is possible with numeric value or not?
GET /_search
{
"query": {
"multi_match" : {
"query" : 10,
"fields" : [ "local_rating^6", "global_rating^3","default_rating"]
}
}
}
See: https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-multi-match-query.html#field-boost

how decrease score on TF in elasticsearch?

two docs: 1. "Some Important Company",2. "Some Important Company Important branch"
since "Important" have a high docCount(many docs has Important word), so when search on "Some Important Company"
the 2nd doc get a higher score, even though 1st doc has exactlly match.
so my question is how to boost score when exactlly matched or decrease the TF score?
my query is multi_match for customerName usedName,but usedName is all "" in this case
I assume the field of your document is indexed using a standard text analyzer or something of the like. I would combine a match query and a match_phrase query using a dismax compound query.
This would give something like that:
{
"query": {
"dis_max" : {
"queries" : [
{ "match" : { "myField" : "Some Important Company" }},
{ "match_phrase" : { "myField" : "Some Important Company" }}
],
"tie_breaker" : 0.7
}
}
}
There's no notion of "matching an exact phrase" with the match query. For this you need to use the match_phrase query. That's why you combine the two here. Using the dis_max, documents that match the two queries will get a boost. You can read more about dis_max and match_phrase:
https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-dis-max-query.html
https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-match-query-phrase.html

Elasticsearch match query and tokenization

I wrote the following query concerning a field that is tokenized by whitespace :
"match" {
"field" : {
"query" : "bora"
}
}
I have two documents that matches the query on my index, one with "bora" on that field, another with "bora bora".
My problem is that "bora bora" document ends up with a better score than the other and this is not the required behaviour.
Do you see a way to do the same query but prioritizing the records which are not a repetition of the searched word ?
I can't update the index / remove the tokenization.

tf/idf boosting within field

My use case is like this:
for a query iphone charger, I am getting higher relevance for results, having name, iphone charger coupons than with name iphone charger, possibly because of better match in description and other fields. Boosting name field isn't helping much unless I skew the importance drastically. what I really need is tf/idf boost within name field
to quote elasticsearch blog:
the frequency of a term in a field is offset by the length of the field. However, the practical scoring function treats all fields in the same way. It will treat all title fields (because they are short) as more important than all body fields (because they are long).
I need to boost this more important value for a particular field. Can we do this with function score or any other way?
A one term difference in length is not much of a difference to the scoring algorithm (and, in fact, can vanish entirely due to imprecision on the length norm). If there are hits on other fields, you have a lot of scoring elements to fight against.
A dis_max would probably be a reasonable approach to this. Instead of all the additive scores and coords and such you are trying to overcome, it will simply select the score of the best matching subquery. If you boost the query against title, you can ensure matches there are strongly preferred.
You can then assign a "tie_breaker", so that the score against the description subquery is factored in only when "title" scores are tied.
{
"dis_max" : {
"tie_breaker" : 0.2,
"queries" : [
{
"terms" : {
"age" : ["iphone", "charger"],
"boost" : 10
}
},
{
"terms" : {
"description" : ["iphone", "charger"]
}
}
]
}
}
Another approach to this sort of thing, if you absolutely know when you have an exact match against the entire field, is to separately index an untokenized version of that field, and query that field as well. Any match against the untokenized version of the field will be an exact match again the entire field contents. This would prevent you needing to relying on the length norm to make that determination.

Finding fields Elasticsearch has matched on

I am using Elasticsearch to search for a group a user should join. I have the user data nested into the search query. On return I get back the closest matched group that user should be in.
The field I am searching on is a nested field as follows:
`{"interests": [
{"topics":["python", "stackoverflow", "elasticsearch"]},
{"topics":["arts", "textiles"]}
]}`
However if you want an understanding of a match - how do you do this?
Elasticsearch does have an explain function which says what the scoring is made up of using tfidf, but not specifically what terms were used.
For example, if I search for 'Textile', the doc should match on 'textiles'. Thus I want the term 'textiles' to be returned in explain or some other way.
The only way I see that provides this need, is to store the search and the document retrieved and then process both to discover words ES has most likely matched on.
EDIT - for some more clarity of the question
An example in my index of a group which has "interests": ['arts', 'fine arts', 'art painting', 'arts and crafts', 'sports']
Now my search, I am looking for Arts and many other things. Now the term I am searching for comes up in this list many times, thus should always be a contributor.
What I want in the response is to say these words were matched ['arts', 'fine arts', 'art painting', 'arts and crafts']along with the degree to which they match i..e 'arts' should be higher than the others, but all others are also relevant
Elasticsearch allows you to specify the _name field for all queries and
filters. This means that you can separate your query into different parts with
separate names, which will allow you to determine which parts matched.
For example:
{
"query" : {
"bool" : {
"should" : [
{"match" : { "interests.topics" : {"query" : "python", "_name" : "py-topic"} }},
{"match" : { "interests.topics" : {"query" : "arts", "_name" : "arts-topic"} }}
]
}
}
}
Then, in your response, you will get back any array of which queries (or
filters) matched and you can determine if the py-topic query and/or the
arts-topic query matched above.

Resources