Elasticsearch Aggregation Range Buckets based on characters rather than numbers - elasticsearch

I can get an aggregation of individual text strings with this:
"alpha_titles": {
"terms": {
"field": "book_title.keyword",
"size" : 100,
"order": {
"_term": "asc"
}
}
}
But, the requirement I have is to bucket these titles into ranges: A - E, F - J, etc...
the "range" aggregation seems to only work work numeric values.
Is there a way to get buckets of alphabetic strings based on a starting value and ending value - something like a* to e* ???

Related

A way to score average just like disjunction_max

I have a query, use dis_max to get the max score from batch documents which scored by several nested queries:
{
"query": {
"dis_max": {
"queries": [
{
"nested" : {}, //nested_1
"nested" : {} //nested_2
}
]
}
}
}
And is there a query like "dis_avg", not return the max score, but get the average score.
For example, as I know:
nested_1 get score 0.7 from object_a, score 0.2 from object_b,
nested_2 get score 0.5 from object_a, score 0.8 from object_b,
and dis_max scores object_b 0.8 and mark it as 1st, and scores object_a 0.7 and mark it as 2nd.
And any query here so that I can:
scores object_b 0.5 and mark is as 2nd, and scores object_a 0.6 and mark it as 1st.
I used bool query to combine several queries and use boost to average weight each query to get average score.

Pact matchingRules, min and max

It seems not possible to mix min and max in the matching rules.
If I use
"matchingRules":
{
"$.body":
{
"min": 1,
"max": 2
},
...
only the minimum number of elements will be validated, "max" has no effect.
I've also tried
"matchingRules":
{
"$.body":
{
"min": 1
},
"$.body":
{
"max": 2
},
...
but then only the second rule matches, so the minimum number of elements will not be validated. Is there another possibility to guarantee the minimum and maximum number of elements in an array?
This is currently not possible. You can raise an issue in https://github.com/pact-foundation/pact-specification/ for this feature to be added in a future specification.

In Elasticsearch, how do you boost a score by the value of a ranking (1st, 2nd, 3rd...) field?

I have a field in the document that is a ranking: where the lower the value, the higher the score it should have. Being first is best, but being last is the worst. So a ranking of 1 is better than 1 million.
So far I've been able to boost on the field value itself with score_query and field_value_factor, but it gives the docs with ranks of higher numbers the higher boosts.
"field_value_factor": {
"field": "rank",
"factor": 1.2,
"modifier": "sqrt",
"missing": 1
}
Should I be using a different modifier? Or am I going about this completely wrong? Also, I don't want to sort on it alone since I have other factors influencing the _score.
You actually need to take a modifier function that reciprocates the rank value, i.e. 1 / rank, so that the higher the rank is, the lower the value will be will and rank = 1 will get a score of 1. The only one that does it is reciprocal
"field_value_factor": {
"field": "rank",
"factor": 1.2,
"modifier": "reciprocal",
"missing": 1
}

How do scoring profiles generate scores in Azure Search?

I want to add a scoring profile on my index on Azure Search. More specifically, every document in my index has a weight field of type Edm.Double, and I want to boost them according to this value. I don't want to just directly sort them with respect to weight because the relevance of the search term is also important.
So just to test it out, I created a scoring profile with a magnitude function with boost value 1000 (just to see if I got how this thing works), linear interpolation, starting value 0 and ending value 1. What I was expecting was the boost value to be added to the overall search score. So a document with weight 0.5 would get a boost of 500, whereas a document with weight 0.125 would get a boost of 125. However, the resulting scores were nowhere near this much intuitive.
I have a couple of questions in this case:
1) How is the function score generated in this case? I have documents with weights close to each other(let's say 0.5465 and 0.5419), but the differences between their final scores is around 100-150, whereas I would expect it to be around 4-5.
2) How are function scores and weights aggregated into a final score for each search result?
So the provided answer by Nate is difficult to understand and it misses some components. I
have made an overview of the entire scoring process, and its quite complex.
So when an user executes a search a query is given to Azure Search. Azure search uses the TF-IDF algorithm to determine a score from 0-1 based on Tokens being formed by the Analyzer. Keep in mind that language specific analyzers can come up with multiple tokens for one word. For every searchable field the score will be produced and then multiplied by the weight in the scoring profile. Lastly all weighted scores will be summed up and that's the initial weighted score.
A scoring profile might also contain scoring functions. The scoring function can be either a magnitude, freshness, geo or tag based function. Multiple functions can be made within one scoring profile.
The functions will be evaluated and the score from the functions can be either summed up, or taken the average, minimum, maximum or first matching. The total of all functions is then multiplied by the total weighted score and that's the final score.
An example, this is an example index with scoring profile.
{
"name": "musicstoreindex",
"fields": [
{ "name": "key", "type": "Edm.String", "key": true },
{ "name": "albumTitle", "type": "Edm.String" },
{ "name": "genre", "type": "Edm.String" },
{ "name": "genreDescription", "type": "Edm.String", "filterable": false },
{ "name": "artistName", "type": "Edm.String" },
{ "name": "rating", "type": "Edm.Int32" },
{ "name": "price", "type": "Edm.Double", "filterable": false },
{ "name": "lastUpdated", "type": "Edm.DateTimeOffset" }
],
"scoringProfiles": [
{
"name": "boostGenre",
"text": {
"weights": {
"albumTitle": 1.5,
"genre": 5,
"artistName": 2
}
}
},
{
"name": "newAndHighlyRated",
"functions": [
{
"type": "freshness",
"fieldName": "lastUpdated",
"boost": 10,
"interpolation": "linear",
"freshness": {
"boostingDuration": "P365D"
}
},
{
"type": "magnitude",
"fieldName": "rating",
"boost": 8,
"interpolation": "linear",
"magnitude": {
"boostingRangeStart": 1,
"boostingRangeEnd": 5,
"constantBoostBeyondRange": false
}
}
],
"functionAggregation": 0
}
]
}
Lets say the entered query is meteora the famous album by Linkin Park. Lets say we have the following document in our index.
{
"key": 123,
"albumTitle": "Meteora",
"genre": "Rock",
"genreDescription": "Rock with a flick of hiphop",
"artistName": "Linkin Park",
"rating": 4,
"price": 30,
"lastUpdated": "2020-01-01"
}
I'm not an expert on TF-IDF but I can imagine that the following unweighted score will be produced:
{
"albumTitle": 1,
"genre": 0,
"genreDescription": 0,
"artistName": 0
}
The scoring profile has a weight of 1.5 on the albumTitle field, so the total weighted score will be: 1 * 1.5 + 0 + 0 + 0 = 1.5
After that the scoring profile functions will be evaluated. In this case there are 2. The first one evaluates the freshness with a range of 365 days, one year. The last updated field has a value of the 1st of April this year. Lets say thats 50 days from now. The total range is 365 so you will get a score of 1 if the last updated date is today. And a 0 if its 365 days or more in the past. In our case its 1 - 50 / 365 = 0.8630... The boost of the function is 10 so the score for the first function is 8.630.
The second function is a magnitude function with a range from 1 to 5. The document got a 4 star rating so thats worth a score of 0.8, because a 1 star is 0 and 5 stars is 1. So a for 4 star is obviously 4 / 5 = 0.8. The boost of the magnitude function is 8 so we have to multiple the value with 8. 0.8 * 8 = 6.4.
The functionAggregation is 0, which means we have to sum the results of all functions. Giving us a total score of scoring profile functions of: 6.4 + 8.630 = 15.03. The rule is then to multiple the total scoring profile functions score with the total weighted score of the fields giving us a grand total of: 15.03 * 1.5 = 22.545.
Hope you enjoined this example.
Thanks for the providing the details. What were the base relevance scores of the two documents?
The boosting factor provided in the scoring profile is actually multiplied to the base relevance scores computed using term frequencies. For example, suppose that the base scores, given in #search.score in the response payload, of the two documents were 0.5 and 0.2 and the values in the weight column were 0.5465 and 0.5419 respectively. With the scoring profile configuration given above, with starting value of 0, ending value of 1, linear interpolation, and the boost factor of 1000. The final score you get for each document is computed as the following :
document 1 :
base search_score(0.5) * boost_factor (1000) * (weight (0.5465) - min(0)) / max - min (1) = final_search_score(273.25)
document 2 :
base_search_score(0.2) * boost_factor (1000) * (weight (0.5419) - min(0)) / max - min (1) = final_search_score(108.38)
Please let me know if the final scores you get do not agree with the function above. Thanks!
Nate

Average the score of subqueries, excluding the weight of not applicable ones

I am trying to build a custom score on my index. The score is based on several criteria, say (don't mind if this is actually relevant or not, it's just an example) :
Is the "size" of the item inferior to 3 inches (weighting factor : 2)
Is the "distance from home" of the item inferior to 3 miles (weighting factor : 3)
Is the "rating" of the item 3 stars or more, ... (weighting factor 1)
My score is like so : for each of these 3 criterion that match, give a score of 1. Then, average these (that is, divide by the sum of all weighting factors), but there is an extra trick : if a criteria can not be matched (i.e. "size" is null for an item, or "distance from home"), then, I have to exclude the weighting factor for this unknown.
Example : if item_1 matches all three criterions, it will have score of :
Criterion 1 : 1 (score) * 2 (weight)
Criterion 2 : 1 (score) * 3 (weight)
Criterion 3 : 1 (score) * 1 (weight)
Sum of weight for available criteria : 6
Total : 6/6 = 1 (fairly simple)
If item_2 matches criteria 1 and 2 but has no rating, we exclude the weight of criterion 3, and the score goes like so :
Criterion 1 : 1 (score) * 2 (weight)
Criterion 2 : 1 (score) * 3 (weight)
Criterion 3 : not available => 0
Sum of weight for available criteria : 5, as we exclude 3
Total : 5/5 = 1
The question is : can I do it efficiently ?
What I have so far :
Writing queries for each crietrion is easy : {"term" : { "size" < 3 }}
Combining them together as a sum of weighted factors is easy, I did a bool query of function scores
This can go like so : each function is filter based, gives a score of one, we use the boost mode "replace" to replace any query result, and uses "should" to have the addition of individuals matches.
Let's call this query the sumQuery :
"{ "bool" : {
"should" : [
{ "function_score" : {
"functions" : [
{
"filter": { "term" : { "size" < 3 }},
}
],
"score_mode": "sum",
"boost_mode": "replace",
"boost": 2
}
// Other criteria
]
}
}
Now, for calculating how to divide this sum, only thing I can think of is building a script function, say:
sumOfWeights=6d;
if (doc['size'].value == null) sumOfWeights -= 2;
if (doc['ratings'].value == null) sumOfWeights -= 1;
// ...
return _score/sumOfWeights;
And compose the mainQuery with this function :
"query": {
"function_score": {
"query": **mainQuery**,
"functions" : [ {
"script_score" : {
"script" : **script above**
}
} ]
}
}
This seems overly complex to me, and tends to become slow (esp. the script part) on my index and given the number of criteria. Do you have any (better) idea ?
A requested evolution should ease this kind of scenario.
It involves :
The ability to add a ̀ querỳ inside the function_score's functions, thus allowing to combine query scores arbitrarly
The ability to refer to each sub-function's score by name
The ability to set a "no match" value (say, a value of 0) that is taken into account if a filter does not match, thus making avg combinations better in some cases where the ̀weight would not be taken into account
Option 2 pretty much gives all that was needed for the question, because once you have access to each item's individual score, you can decide which one is to be taken into account.
Option 1 allows to tackle things differently, say by making a sub-request for computing the sum of criterias, and another to compute the weight to divide it by.

Resources