ElasticSearch boosting relevance based on the count of the field value - elasticsearch

I'm trying to boost the relevance based on the count of the field value. The less count of the field value, the more relevant.
For example, I have 1001 documents. 1000 documents are written by John, and only one is written by Joe.
// 1000 documents by John
{"title": "abc 1", "author": "John"}
{"title": "abc 2", "author": "John"}
// ...
{"title": "abc 1000", "author": "John"}
// 1 document by Joe
{"title": "abc 1", "author": "Joe"}
I'll get 1001 documents when I search "abc" against title field. These documents should have pretty similar relevance score if they are not exact same. The count of field value "John" is 1000 and the count of field value "Joe" is 1. Now, I'd like to boost the relevance of the document {"title": "abc 1", "author": "Joe"}, otherwise, it would be really hard to see the document with the author Joe.
Thank you!

In case someone runs into the same use case, I'll explain my workaround by using Function Score Query. This way would make at least two calls to Elasticsearch server.
Get the counts for each person(You may use aggregation feature). In our example, we get 1000 from John and 1 from Joe.
Generate the weight from the counts. The more counts, the less relevance weight. Something like 1 + sqrt(1/1000) for John and 1 + sqrt(1/1) for Joe.
Use the weight in the script to calculate the score according to the author value(The script can be much better):
{
"query": {
"function_score": {
"query": {
"match": { "title": "abc" }
},
"script_score" : {
"script" : {
"inline": "if (doc['author'].value == 'John') {return (1 + sqrt(1/1000)) * _score}\n return (1 + sqrt(1/1)) * _score;"
}
}
}
}
}

Related

Search string keyword by elasticsearch

I have an issue to implement elasticsearch with the query "energy saving tv".
I have 3 objects with "title" field:
T1: Phone with LG application is an energy saving tv
T2: That tv made by energy saving LG applications
T3: Phone with LG application ensures optimal energy saving
Then I used "match" and "AND" operator for query "energy saving tv":
GET my_index/_search
{
"query": {
"match": {
"title": {
"query": "energy saving tv",
"operator": "and"
}
}
}
}
Result:
Score T1: 5.0
Score T2: 5.37
So T2's score is higher than T1's score, but I wanna title that has form "energy*saving*tv" (in the order of words in the keyword) will have a score higher. Pls help me. Thank you very much!
You can use a Match phrase query to match a phrase comprised of several words.
{
"query": {
"match_phrase": {
"title": "energy saving tv"
}
}
}
Note that this will only match T1 since the exact order is preserved.
If you also want to include other results with a more mixed up or spread apart word order you can add the slop parameter.
This will also match T2, but with a lower score:
{
"query": {
"match_phrase": {
"title": {
"query": "energy saving tv",
"slop": 10
}
}
}
}
The slop basically defines the upper limit to how often you can move a query term to the right or left in order to match the document. It defaults to 0.
E.g. going from the query "energy saving tv" to the document "energy tv saving" would require a slop of 2, since tv moves one term to the left and saving moves one term to the right.
See this answer for a great visual explanation.

Repeated values in Elasticsearch array and query scoring

I have two documents with a field country which can contain repeated values, e.g.
Doc1:
country: [US, US, GB, US]
Doc2:
country: [US, GB]
I need a query that when looking for country:US will assign a higher score to Doc1 than Doc2 since US appears multiple times in the country field of Doc1, while it will assign the same score to the two documents when looking for country:GB as it appears the same number of times in both documents. Is this something achievable with Elasticsearch?
If you are doing a simple match search on US
GET countryindex/_search
{
"query": {
"match": {
"country": "US"
}
}
}
It will give more score to more frequency of elements so [US, US, GB, US] will get more score than "[US, GB]"
If you will search for "GB" -->"[US, GB]" will get more score than [US, US, GB, US], since shorter field length gets more score.
If you want to give same score when number of matches is same , you need to give norms: false in your mapping.
{
"properties": {
"title": {
"type": "text",
"norms": false
}
}
}

Elasticsearch, sorting by exact string match

I want to sort results, such that if one specific field (let's say 'first_name') is equal to an exact value (let's say 'Bob'), then those documents are returned first.
That would result in all documents where first_name is exactly 'Bob', would be returned first, and then all the other documents afterwards. Note that I don't intend to exclude documents where first_name is not 'Bob', merely sort them such that they're returned after all the Bobs.
I understand how numeric or alphabetical sorting works in Elasticsearch, but I can't find any part of the documentation covering this type of sorting.
Is this possible, and if so, how?
One solution is to manipulate the score of the results that contain the Bob in the first name field.
For example:
POST /test/users
{
"name": "Bob"
}
POST /test/users
{
"name": "Alice"
}
GET /test/users/_search
{
"query": {
"bool": {
"should": [
{
"match": {
"name": {
"query": "Bob",
"boost" : 2
}
}
},
{
"match_all": {}
}
]
}
}
}
Would return both Bob and Alice in that order (with approximate scores of 1 and 0.2 respectively).
From the book:
Query-time boosting is the main tool that you can use to tune
relevance. Any type of query accepts a boost parameter. Setting a
boost of 2 doesn’t simply double the final _score; the actual boost
value that is applied goes through normalization and some internal
optimization. However, it does imply that a clause with a boost of 2
is twice as important as a clause with a boost of 1.
Meaning that if you also wanted "Fred" to come ahead of Bob you could just boost it with a 3 factor in the example above.

Boosting only results with a near-identical score in Elasticsearch

I'm using the following query to search through a database of names, allowing fuzzy matching but giving preference to exact matches.
"query": {
"bool": {
"should": [
{
"match": {
"name": {
"query": "x",
"operator": "and",
"boost": 10
}
}
},
{
"match": {
"name": {
"query": "x",
"fuzziness": "AUTO",
"operator": "and"
}
}
},
{
"match": {
"altname": {
"query": "x",
"fuzziness": "AUTO",
"operator": "and"
}
}
}
]
}
}
The database contains entries with identical names. If that happens, I would like to boost those entries by a second field, let's call it weight. However, I only want the boost to be applied between the subset of results with a (near) identical score, not to all of the results.
This is further complicated by the fact that results with an identical name may receive a slightly different score, as they are influenced by the relevancy on the altname field.
For example, querying for dog could give 3 results:
Dog [id 1, score 2.3, weight 10]
Dog [id 2, score 2.2, weight 20]
Doge [id 3, score 1, weight 100]
I'm looking for a query that would boost the result with id 2 to the top score. The result with id 3 should always stay at the bottom due to its poor relevancy, regardless of its weight. Ideally with tunable parameters to tweak the factor of the score vs. the factor of the weight.
Any way to do this in a single pass in Elasticsearch, of course without ruining performance?
Looks like I figured it out.
First, I realised that the example in my original question was more complex than necessary. I narrowed it down to: "How to compose a query for 'blub' that returns the following documents in the order 2, 3, 1"
id: 1
name: blub
weight: 0.01
---
id: 2
name: blub
weight: 0.1
---
id: 3
name: blub stuff
weight: 1
Thus: for the two documents with an identical (or very similar) score, the weight should be used as a tie-breaker. But documents with a significantly lower score should never be allowed to trump other results, regardless of their weight.
I loaded the data in the excellent Play tool: https://www.found.no/play/gist/edd93c69c015d4c62366#search and started experimenting.
Turned out the log2p modifier did exactly what I expected. Repeated it on a real-world dataset and everything looks exactly as expected.
function_score:
query:
match:
name: blub
field_value_factor:
field: weight
modifier: log2p

How to filter results based on frequency of repeating terms in an array in elasticsearch

I have an array field with a lot of keywords and i need to sort the documents on the basis on how many times a particular keyword repetation in those arrays.
For eg,if my field name is "nationality" and for document 1, it consists of the following
doc1
nationality :
["US","UK","Australia","India","US","US"]
and for doc2
nationality:
["US","UK","US","US","US","China"]
I want only those documents to be shown where the term "US" occurs more than 3 times. That would make only doc2 to be shown. How to do this?
You can use scripting for this to be implemented.
{
"query": {
"filtered": {
"filter": {
"script": {
"script": "_index['nationality']['US'].tf() > 3"
}
}
}
}
}
Here in this scripy the array "nationality" is checked for the term "US" and the count is taken by tf (term frequency). Now only the documents with term frequency greater than three are shown in the results. You can learn more about the filter operations here

Resources