tf/idf boosting within field

tf/idf boosting within field - elasticsearch

My use case is like this:
for a query iphone charger, I am getting higher relevance for results, having name, iphone charger coupons than with name iphone charger, possibly because of better match in description and other fields. Boosting name field isn't helping much unless I skew the importance drastically. what I really need is tf/idf boost within name field
to quote elasticsearch blog:
the frequency of a term in a field is offset by the length of the field. However, the practical scoring function treats all fields in the same way. It will treat all title fields (because they are short) as more important than all body fields (because they are long).
I need to boost this more important value for a particular field. Can we do this with function score or any other way?

A one term difference in length is not much of a difference to the scoring algorithm (and, in fact, can vanish entirely due to imprecision on the length norm). If there are hits on other fields, you have a lot of scoring elements to fight against.
A dis_max would probably be a reasonable approach to this. Instead of all the additive scores and coords and such you are trying to overcome, it will simply select the score of the best matching subquery. If you boost the query against title, you can ensure matches there are strongly preferred.
You can then assign a "tie_breaker", so that the score against the description subquery is factored in only when "title" scores are tied.
{
"dis_max" : {
"tie_breaker" : 0.2,
"queries" : [
{
"terms" : {
"age" : ["iphone", "charger"],
"boost" : 10
}
},
{
"terms" : {
"description" : ["iphone", "charger"]
}
}
]
}
}
Another approach to this sort of thing, if you absolutely know when you have an exact match against the entire field, is to separately index an untokenized version of that field, and query that field as well. Any match against the untokenized version of the field will be an exact match again the entire field contents. This would prevent you needing to relying on the length norm to make that determination.

Related

Multiple elasticsearch match queries

Say I have a document with 3 text fields: field_a , field_b and field_c.
Is it possible to do a single query so that we have results in this order:
'match' in field_a
'match' in field_b
'match' in field_c
'mutli_match' results can have results from different fields mixed together in the order of the results, what I want is any and all results from field_a, then any and all results from field_b and so on.

Even though, I find this approach strange in general (I think the problem you have should be solved in a different way, e.g. multiple stages of search), I think you could solve it for now in a following manner.
Multi match query have a perfect ability to provide boost to your fields. E.g.
"query": {
"multi_match" : {
"query" : "this is a test",
"fields" : [ "field_a^1000", "field_b^10", "field_c" ]
}
}
The sign ^ is a boost sign which will multiple score of the match in this field by the value - 1000 in case of field_a
However, I would recommend to avoid this sort of behavior in general, since:
It's hard to control those boosting values
It could be in some cases behaving not as expected (imagine you get the score of 1000 in field_b)
If you would have many hits, this makes whole idea of having match of field_c kinda obsolete, since no user will scroll that far away in search results

Elasticsearch - query primary and secondary attribute with different terms

I'm using elasticsearch to query data that originally was exported out of several relational databases that had a lot of redundencies. I now want to perform queries where I have a primary attribute and one or more secondary attributes that should match. I tried using a bool query with a must term and a should term, but that doesn't seem to work for my case, which may look like this:
Example:
I have a document with fullname and street name of a user and I want to search for similiar users in different indices. So the best match for my query should be the best match on fullname and best match on streetname field. But since the original data has a lot of redundencies and inconsistencies the field fullname (which I manually created out of fields name1, name2, name3) may contain the same name multiple times and it seems that elasticsearch ranks a double match in a must field higher than a match in a should attribute.
That means, I want to query for John Doe Back Street with the following sample data:
{
"fullname" : "John Doe John and Jane",
"street" : "Main Street"
}
{
"fullname" : "John Doe",
"street" : "Back Street"
}
Long story short, I want to query for a main attribute fullname - John Doe and secondary attribute street - Back Street and want the second document to be the best match and not the first because it contains John multiple times.

Manipulation of relevance in Elasticsearch is not the easiest part. Score calculation is based on three main parts:
Term frequency
Inverse document frequency
Field-length norm
Shortly:
the often the term occurs in field, the MORE relevant is
the often the term occurs in entire index, the LESS relevant is
the longer the term is, the MORE relevant is
I recommend you to read below materials:
What Is Relevance?
Theory Behind Relevance Scoring
Controlling Relevance and subpages
If in general, in your case, result of fullname is more important than from street you can boost importance of the first one. Below you have example code base on my working code:
{
"query": {
"multi_match": {
"query": "john doe",
"fields": [
"fullname^10",
"street"
]
}
}
}
In this example result from fullname is ten times (^10) much important than result from street. You can try to manipulate the boost or use other ways to control relevance but as I mentioned at the beginning - it is not the easiest way and everything depends on your particular situation. Mostly because of "inverse document frequency" part which considers terms from entire index - each next added document to index will probably change the score of the same search query.
I know that I did not answer directly but I hope to helped you to understand how this works.

Elasticsearch more like this returns too many documents

I have documents like this:
{
title:'...',
body: '...'
}
I want to get documents which are more than 90% similar to the with a specific document. I have used this query:
query = {
"query": {
"more_like_this" : {
"fields" : ["title", "body"],
"like" : "body of another document",
"min_term_freq" : 1,
"max_query_terms" : 12
}
}
}
How to change this query to check for 90% similarity with specified doc?

Take a look at the Query Formation Parameter minimum_should_match

You should specify minimun_should_match
minimum_should_match
After the disjunctive query has been formed, this parameter controls
the number of terms that must match. The syntax is the same as the
minimum should match. (Defaults to "30%").
It form query using this
The MLT query simply extracts the text from the input document,
analyzes it, usually using the same analyzer at the field, then
selects the top K terms with the highest tf-idf to form a disjunctive
query of these terms
So if you would like to boost you title field you should boost your title field because if the title contains most of the terms present in the term frequency/ Inverse document frequency. the result should be boosted because it has more relevance. You can boost your title field by 1.5.
Refer this document for referenceren on the more_like_this query

Scoring documents by both textual match and distance to a point

I have an ElasticSearch index with a list of "shops".
I'd like to allow customers to search these shops by both geo_distance (so, search for a point and get shops near that location), and textual match, like matches on shop name / address.
I'd like to get results that match either of these two criteria, and I'd like the order of these results to be a combination of both. The stronger the textual match, and the closer to the point searched, the higher the result. (Obviously, there's going to be a formula to combine these two, that'll need tweaking, not too worried about that part yet).
My issue / what I've tried:
geo_distance is a filter, not a query, so I can't combine both on the query part of the request.
I can use a bool => should filter (rather than query) that matches on either name or location. This gives me the results I want, but not in order.
I can also have _geo_distance as part of a sort clause so that documents closer to the point rank higher.
What I haven't figured out is how I would take the "regular" _score that ElasticSearch gives to documents when doing textual matches, and combine that with the geo_distance score.
By having the textual match in the filter, it doesn't seem to affect the score of documents (which makes sense). And I don't see how I could combine the textual match in the query part and a geo_distance filter so it's an OR rather than an AND.
I guess my best bet would be the equivalent of this:
{
function_score: {
query: { ... },
functions: [
{ geo_distance function },
{ multi_match_result score },
],
score_mode: 'multiply'
}
}
but I'm not sure you can do geo_distance as a score function, and I don't know how to have multi_match_result score as a score function, or if it's even possible.
Any pointers will be greatly appreciated.
I'm working with ElasticSearch v1.4, but I can upgrade if necessary.

but I'm not sure you can do geo_distance as a score function, and I don't know how to have multi_match_result score as a score function, or if it's even possible.
You can't really do it in the way that you're asking, but you can do what you want just as easily. For the simpler case, you get scoring just by using a normal query.
The problem with filters is that they're yes/no questions, so if you use them in a function_score, then it either boosts the score or it doesn't. What you probably want is degradation of the score as the distance from the origin grows. It's the yes/no nature that stops them from impacting the score at all. There's no improvement to relevancy implied by matching a filter -- it just means that it's part of the answer, but it doesn't make sense to say that it should be closer to the top/bottom as a result.
This is where the Decay function score helps. It works with numbers, dates, and -- most helpfully here -- geo_points. In addition to the types of data it accepts, it can decay using either gaussian, exponential, or linear decay functions. The one that you want to choose is honestly arbitrary and you should give the one that chooses the best "experience". I would suggest to start with gauss.
"function_score": {
"functions": [
"gauss": {
"my_geo_point_field": {
"origin": "0, 1",
"scale": "5km",
"offset": "500m",
"decay": 0.5
}
}
]
}
Note that origin is in x, y format (due to standard GeoJSON), which is longitude, latitude.
Each one of the values impacts how the score decays based on the graph (taken wholesale from the documentation). If you would use an offset of 0, then the score begins to drop once it's not exactly at the origin. With the offset, it allows it some buffer to be considered just as good.
The scale is directly associated with the decay in that the score will be chopped down by the decay value once it is scale-distance away from the origin (+/- the offset). In my above example, anything 5km from the origin would get half of the score as anything at the origin.
Again, just note that the different types of decay functions change the shape of scoring.
I'd like the order of these results to be a combination of both.
This is the purpose of the bool / should compound query. You get OR behavior with score improvement based on each match. Combining this with the above, you'd want something like:
{
"query": {
"bool": {
"should": [
{
"multi_match": { ... }
},
{
"function_score": {
"functions": [
"gauss": {
"my_geo_point_field": {
"origin": "0, 1",
"scale": "5km",
"offset": "500m",
"decay": 0.5
}
}
]
}
}
]
}
}
}
NOTE: If you add a must, then the should behavior changes from literal OR-like behavior (at least 1 must match) to completely optional behavior (none must match).
I'm working with ElasticSearch v1.4, but I can upgrade if necessary.
Starting with Elasticsearch 2.0, every filter is a query and every query is also a filter. The only difference is the context that it's used in. This doesn't change my answer here, but it's something that may help you in the future in addition to what I say next.
Geo-related performance increased dramatically in ES 2.2+. You should upgrade (and recreate your geo-related indices) to take advantage of those changes. ES 5.0 will have similar benefits!

Finding fields Elasticsearch has matched on

I am using Elasticsearch to search for a group a user should join. I have the user data nested into the search query. On return I get back the closest matched group that user should be in.
The field I am searching on is a nested field as follows:
`{"interests": [
{"topics":["python", "stackoverflow", "elasticsearch"]},
{"topics":["arts", "textiles"]}
]}`
However if you want an understanding of a match - how do you do this?
Elasticsearch does have an explain function which says what the scoring is made up of using tfidf, but not specifically what terms were used.
For example, if I search for 'Textile', the doc should match on 'textiles'. Thus I want the term 'textiles' to be returned in explain or some other way.
The only way I see that provides this need, is to store the search and the document retrieved and then process both to discover words ES has most likely matched on.
EDIT - for some more clarity of the question
An example in my index of a group which has "interests": ['arts', 'fine arts', 'art painting', 'arts and crafts', 'sports']
Now my search, I am looking for Arts and many other things. Now the term I am searching for comes up in this list many times, thus should always be a contributor.
What I want in the response is to say these words were matched ['arts', 'fine arts', 'art painting', 'arts and crafts']along with the degree to which they match i..e 'arts' should be higher than the others, but all others are also relevant

Elasticsearch allows you to specify the _name field for all queries and
filters. This means that you can separate your query into different parts with
separate names, which will allow you to determine which parts matched.
For example:
{
"query" : {
"bool" : {
"should" : [
{"match" : { "interests.topics" : {"query" : "python", "_name" : "py-topic"} }},
{"match" : { "interests.topics" : {"query" : "arts", "_name" : "arts-topic"} }}
]
}
}
}
Then, in your response, you will get back any array of which queries (or
filters) matched and you can determine if the py-topic query and/or the
arts-topic query matched above.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio