Combining filter and function_score - elastic search won't calculate score? - elasticsearch

I have posts index and i want to
a) Filter out all the posts by date i.e. return posts only created before the given date
b) Apply function_score to the results.
I came up with this query which looks kinda like this:
get posts/_search
{
"query": {
"function_score": {
"query": {
"bool": {
"filter": [
{
"range": {
"created_at": {
"lte": "2020-09-12T17:23:00Z"
}
}
}
]
}
},
"functions": [
{
"field_value_factor": {
"field": "likes",
"factor": 0.1,
"modifier": "ln1p",
"missing": 1
}
}
],
"score_mode": "sum"
}
},
"from": 0,
"size": 10
}
However the issue is that elastic search does NOT apply score to the results, all documents have 0.0 score. I can "trick" ES by moving filter into the functions, and then it does work, but i feel like it's not optimal, any idea why the query above won't return scores?
Example of a "cheat" query which DOES calculate scores for every document:
get posts/_search
{
"query": {
"function_score": {
"functions": [
{
"filter": {
"range": {
"created_at": {
"lte": "2020-09-12T17:22:58Z"
}
}
},
"weight": 100
},
{
"gauss": {
"created_at": {
"origin": "2020-09-12T17:22:58Z",
"scale": "1h",
"decay": 0.95
}
}
},
{
"field_value_factor": {
"field": "likes",
"factor": 0.1,
"modifier": "ln1p",
"missing": 1
}
}
]
}
},
"from": 0,
"size": 10
}

Filter does not influence score. it's just 0 OR 1.
To be able to influence a score with a function_score query you have to use a function that calculate score (match, Match_phrase, geo query etc...)
You can have more details on context in the documentation
The field_value_factor influence existing score but in your first query you don't have any scoring at this stage, as it is nearly the same than ordering on like quantity.
In your second query you calculate a score which depends of the recentness of your docs, then influence it with likes. That's why it works.

Related

ES: Sort on the result of a Query function

I'm quite new to ES and have been trying many different ways to sort on a subset results from Query/Filter. The aggs always sort on the whole collection instead of the result from the above query. My final goal is to sort on field price from the result of query (which was already sorted by _score and only 5 docs)
{
"query": {
"bool": {
"must": {
"function_score": {
"functions": [....],
"query": {....}
},
"score_mode": "sum",
"max_boost": 1.5
}
},
"filter": [...]
}
},
"size": 5,
"from": 0,
"sort": {
"_score": "desc"
},
"_source": [
"title",
"price"
],
"aggs": {
"i_am_confused": {
"terms": {
"field": "price",
"order": {
"_term": "desc"
}
}
}
}
}
I don't want to sort on client (because the subset result would be at least 700 docs).
I appreciate your help.
I've tried a couple of aggs they all don't work as I want, probably I didn't use them right.

ElasticSearch: Is it possible to do a "Weighted Avg Aggregation" weighted by the score?

I'm trying to perform an avg over a price field (price.avg). But I want the best matches of the query to have more impact on the average than the latests, so the avg should be weighted by the calculated score field. This is the aggregation that I'm implementing.
{
"query": {...},
"size": 100,
"aggs": {
"weighted_avg_price": {
"weighted_avg": {
"value": {
"field": "price.avg"
},
"weight": {
"script": "_score"
}
}
}
}
}
It should give me what I want. But instead I receive a null value:
{...
"hits": {...},
"aggregations": {
"weighted_avg_price": {
"value": null
}
}
}
Is there something that I'm missing? Is this aggregation query feasible? Is there any workaround?
When you debug what's available from within the script
GET prices/_search
{
"size": 0,
"aggs": {
"weighted_avg_price": {
"weighted_avg": {
"value": {
"field": "price"
},
"weight": {
"script": "Debug.explain(new ArrayList(params.keySet()))"
}
}
}
}
}
the following gets spit out
[doc, _source, _doc, _fields]
None of these contain information about the query _score that you're trying to access because aggregations operate in a context separate from the query-level scoring. This means the weight value needs to either
exist in the doc or
exist in the doc + be modifiable or
be a query-time constant (like 42 or 0.1)
A workaround could be to apply a math function to the retrieved price such as
"script": "Math.pow(doc.price.value, 0.5)"
#jzzfs I'm trying with the approach of "avg of the first N results (ordered by _score)", using top hits aggregation:
{
"query": {
"bool": {
"should": [
...
],
"minimum_should_match": 0
}
},
"size": 0,
"from": 0,
"sort": [
{
"_score": {
"order": "desc"
}
}
],
"aggs": {
"top_avg_price": {
"avg": {
"field": "price.max"
}
},
"aggs": {
"top_hits": {
"size": 10, // N: Changing the number of results doesn't change the top_avg_price
"_source": {
"includes": [
"price.max"
]
}
}
}
},
"explain": "false"
}
The avg aggregation is being done over the main results, not the top_hits aggregation.
I guess the top_avg_rpice should be a subaggregation of top_hits. But I think that's not possible ATM.

Date_histogram and top_hits from unique values only

I am trying to do a date_histogram aggregation to show a sum of Duration for each hour.
I have the following documents:
{
"EntryTimestamp": 1567029600000,
"Username": "johndoe",
"UpdateTimestamp": 1567029600000,
"Duration": 10,
"EntryID": "ASDF1234"
}
The following works very well but my problem is that sometimes multiple documents appear with the same EntryID. So ideally I would need to add a top_hits somehow, and order by the UpdateTimestamp as I need the last updated document for each unique EntryID. But not sure how to add this to my query.
{
"size": 0,
"query": {
"bool": {
"filter": [{
"range": {
"EntryTimestamp": {
"gte": "1567029600000",
"lte": "1567065599999",
"format": "epoch_millis"
}
}
}, {
"query_string": {
"analyze_wildcard": true,
"query": "Username.keyword=johndoe"
}
}
]
}
},
"aggs": {
"2": {
"date_histogram": {
"interval": "1h",
"field": "EntryTimestamp",
"min_doc_count": 0,
"extended_bounds": {
"min": "1567029600000",
"max": "1567065599999"
},
"format": "epoch_millis"
},
"aggs": {
"1": {
"sum": {
"field": "Duration"
}
}
}
}
}
}
I think you'll need a top_hits aggregation inside a terms aggregation.
The terms aggregation will get the distinct EntryIDs and the top hit aggregation inside of it will get only the most recent document (based on UpdateTimestamp) for each bucket (each distinct value) of the terms aggregation.
I have no clear syntax adapted to your context, and i believe you might run into some issues regarding the number of sub aggregations (i ran into some limitations with advanced aggregations in the past)
You can see this post for more info on that; i hope it'll prove to be helpful to you.

elasticsearch averaging a field on a bucket

I am a newbie to elasticsearch, trying to understand how aggregates and metrics work. I was particularly running an aggregate query to retrieve average num of bytesOut based on clientIPHash from an elasticsearch instance. The query I created (using kibana) is as follows:
{
"size": 0,
"query": {
"filtered": {
"query": {
"query_string": {
"query": "*",
"analyze_wildcard": true
}
},
"filter": {
"bool": {
"must": [
{
"range": {
"#timestamp": {
"gte": 1476177616965,
"lte": 1481361616965,
"format": "epoch_millis"
}
}
}
],
"must_not": []
}
}
}
},
"aggs": {
"2": {
"terms": {
"field": "ClientIP_Hash",
"size": 50,
"order": {
"1": "desc"
}
},
"aggs": {
"1": {
"avg": {
"field": "Bytes Out"
}
}
}
}
}
}
It gives me some output (supposed to be avg) grouped on clientIPHash like below:
ClientIP_Hash: Descending Average Bytes Out
64e6b1f6447fd044c5368740c3018f49 1,302,210
4ff8598a995e5fa6930889b8751708df 94,038
33b559ac9299151d881fec7508e2d943 68,527
c2095c87a0e2f254e8a37f937a68a2c0 67,083
...
The problem is, if I replace the avg with sum or min or any other metric type, I still get same values.
ClientIP_Hash: Descending Sum of Bytes Out
64e6b1f6447fd044c5368740c3018f49 1,302,210
4ff8598a995e5fa6930889b8751708df 94,038
33b559ac9299151d881fec7508e2d943 68,527
c2095c87a0e2f254e8a37f937a68a2c0 67,083
I checked the query generated by kibana, and it seems to correctly put the keyword 'sum' or 'avg' accordingly. I am puzzled why I get the same values for avg and sum or any other metric.
Could you see if the sample data set of yours have more values. As min, max and Avg remains the same if you have only one value.
Thanks

Decay filter function for a no-limit value with ElasticSearch

I have the following documents (at least 1 000 000) in an ElasticSearch index:
{"title":"toto", "views":132, "likes":23, "date" : "2014-09-01..." ...}
Where title is indexed with a lang analyser, views and likes fields are integer from 0 to infinite, and the date is a ..date field.
I want to search by title, and boost documents if they are recent and have a high views and likes.
I am using a decay filter function for the date (from today as origin), it's working as expected, but I don't know how to do for boosting the views and likes fields, since I have no max-origin.
Here my search query:
POST /threads/_search
{
"query": {
"function_score": {
"query": {
"multi_match": {
"query": "air france",
"type": "phrase",
"fields": [
"title^4",
"desc"
]
}
},
"functions": [
{
"exp": {
"date": {
"origin": "2014/09/29 13:00:00",
"scale": "12h",
"offset":"6h",
"decay":0.5
}
}
}
]
}
}
}
You could try a "field_value_factor", as per this section in the documentation. And you'd need to test and assess the results, modify the "factor" and the boost you are giving to "title" and then test again and see if it's getting closer to what you need. Also, you can use search=explain to see how ES computes the _score. Something like this:
POST /threads/_search?explain
{
"query": {
"function_score": {
"query": {
"multi_match": {
"query": "air france",
"type": "phrase",
"fields": [
"title^8",
"desc"
]
}
},
"functions": [
{
"exp": {
"date": {
"origin": "2014/09/29 13:00:00",
"scale": "12h",
"offset":"6h",
"decay":0.5
}
}
},
{
"field_value_factor": {
"field": "views",
"modifier": "log2p",
"factor": 0.1
}
},
{
"field_value_factor": {
"field": "likes",
"modifier": "log2p",
"factor": 0.1
}
}
]
}
}
}

Resources