ElasticSearch: Is it possible to do a "Weighted Avg Aggregation" weighted by the score? - elasticsearch

I'm trying to perform an avg over a price field (price.avg). But I want the best matches of the query to have more impact on the average than the latests, so the avg should be weighted by the calculated score field. This is the aggregation that I'm implementing.
{
"query": {...},
"size": 100,
"aggs": {
"weighted_avg_price": {
"weighted_avg": {
"value": {
"field": "price.avg"
},
"weight": {
"script": "_score"
}
}
}
}
}
It should give me what I want. But instead I receive a null value:
{...
"hits": {...},
"aggregations": {
"weighted_avg_price": {
"value": null
}
}
}
Is there something that I'm missing? Is this aggregation query feasible? Is there any workaround?

When you debug what's available from within the script
GET prices/_search
{
"size": 0,
"aggs": {
"weighted_avg_price": {
"weighted_avg": {
"value": {
"field": "price"
},
"weight": {
"script": "Debug.explain(new ArrayList(params.keySet()))"
}
}
}
}
}
the following gets spit out
[doc, _source, _doc, _fields]
None of these contain information about the query _score that you're trying to access because aggregations operate in a context separate from the query-level scoring. This means the weight value needs to either
exist in the doc or
exist in the doc + be modifiable or
be a query-time constant (like 42 or 0.1)
A workaround could be to apply a math function to the retrieved price such as
"script": "Math.pow(doc.price.value, 0.5)"

#jzzfs I'm trying with the approach of "avg of the first N results (ordered by _score)", using top hits aggregation:
{
"query": {
"bool": {
"should": [
...
],
"minimum_should_match": 0
}
},
"size": 0,
"from": 0,
"sort": [
{
"_score": {
"order": "desc"
}
}
],
"aggs": {
"top_avg_price": {
"avg": {
"field": "price.max"
}
},
"aggs": {
"top_hits": {
"size": 10, // N: Changing the number of results doesn't change the top_avg_price
"_source": {
"includes": [
"price.max"
]
}
}
}
},
"explain": "false"
}
The avg aggregation is being done over the main results, not the top_hits aggregation.
I guess the top_avg_rpice should be a subaggregation of top_hits. But I think that's not possible ATM.

Related

Combining filter and function_score - elastic search won't calculate score?

I have posts index and i want to
a) Filter out all the posts by date i.e. return posts only created before the given date
b) Apply function_score to the results.
I came up with this query which looks kinda like this:
get posts/_search
{
"query": {
"function_score": {
"query": {
"bool": {
"filter": [
{
"range": {
"created_at": {
"lte": "2020-09-12T17:23:00Z"
}
}
}
]
}
},
"functions": [
{
"field_value_factor": {
"field": "likes",
"factor": 0.1,
"modifier": "ln1p",
"missing": 1
}
}
],
"score_mode": "sum"
}
},
"from": 0,
"size": 10
}
However the issue is that elastic search does NOT apply score to the results, all documents have 0.0 score. I can "trick" ES by moving filter into the functions, and then it does work, but i feel like it's not optimal, any idea why the query above won't return scores?
Example of a "cheat" query which DOES calculate scores for every document:
get posts/_search
{
"query": {
"function_score": {
"functions": [
{
"filter": {
"range": {
"created_at": {
"lte": "2020-09-12T17:22:58Z"
}
}
},
"weight": 100
},
{
"gauss": {
"created_at": {
"origin": "2020-09-12T17:22:58Z",
"scale": "1h",
"decay": 0.95
}
}
},
{
"field_value_factor": {
"field": "likes",
"factor": 0.1,
"modifier": "ln1p",
"missing": 1
}
}
]
}
},
"from": 0,
"size": 10
}
Filter does not influence score. it's just 0 OR 1.
To be able to influence a score with a function_score query you have to use a function that calculate score (match, Match_phrase, geo query etc...)
You can have more details on context in the documentation
The field_value_factor influence existing score but in your first query you don't have any scoring at this stage, as it is nearly the same than ordering on like quantity.
In your second query you calculate a score which depends of the recentness of your docs, then influence it with likes. That's why it works.

Elasticsearch: Aggregate all unique values of a field and apply a condition or filter by another field

My documents look like this:
{
"ownID": "Val_123",
"parentID": "Val_456",
"someField": "Val_78",
"otherField": "Val_90",
...
}
I am trying to get all (unique, as in one instance) results for a list of ownID values, while filtering by a list of parentID values and vice-versa.
What I did so far is:
Get (separate!) unique values for ownID and parentID in key1 and key2
{
"size": 0,
"aggs": {
"key1": {
"terms": {
"field": "ownID",
"include": {
"partition": 0,
"num_partitions": 10
},
"size": 100
}
},
"key2": {
"terms": {
"field": "parentID",
"include": {
"partition": 0,
"num_partitions": 10
},
"size": 100
}
}
}
}
Use filter to get (some) results matching either ownID OR parentID
{
"size": 0,
"query": {
"bool": {
"should": [
{
"terms": {
"ownID": ["Val_1","Val_2","Val_3"]
}
},
{
"terms": {
"parentID": ["Val_8","Val_9"]
}
}
]
}
},
"aggs": {
"my_filter": {
"top_hits": {
"size": 30000,
"_source": {
"include": ["ownID", "parentID","otherField"]
}
}
}
}
}
However, I need to get separate results for each filter in the second query, and get:
(1) the parentID of the documents matching some value of ownID
(2) the ownID for the documents matching some value of parentID.
So far I managed to do it using two similar queries (see below for (1)), but I would ideally want to combine them and query only once.
{
"size": 0,
"query": {
"bool": {
"should": [
{
"terms": {
"ownID": [ "Val1", Val_2, Val_3 ]
}
}
]
}
},
"aggs": {
"my_filter": {
"top_hits": {
"size": 30000,
"_source": {
"include": "parentID"
}
}
}
}
}
I'm using Elasticsearch version 5.2
If I got your question correctly then you need to get all the aggregations count correct irrespective of the filter query but in search hits you want the filtered documents only, so for this elasticsearch has another type of filter : "post filter" : refer to this : https://www.elastic.co/guide/en/elasticsearch/reference/5.5/search-request-post-filter.html
its really simple, it will just filter the results after the aggregations have been computed.

Paging the top_hits aggregation in ElasticSearch

Right now I'm doing a top_hits aggregation in Elastic Search that groups my data by a field, sorts the groups by a date, and chooses the top 1.
I need to somehow page this aggregation results in a way that I can pass through the pageSize and the pageNumber, but I don't know how.
In addition to this, I also need the total results of this aggregation so we can show it in a table in our web interface.
The aggregation looks like this:
POST my_index/_search
{
"size": 0,
"aggs": {
"top_artifacts": {
"terms": {
"field": "artifactId.keyword"
},
"aggs": {
"top_artifacts_hits": {
"top_hits": {
"size": 1,
"sort": [{
"date": {
"order": "desc"
}
}]
}
}
}
}
}
}
If I understand what you want, you should be able to do pagination through a Composite Aggregation. You can still pass your size parameter in your pagination, but your from would be the key for the bucket.
POST my_index/_search
{
"size": 0,
"aggs": {
"top_artifacts": {
"composite": {
"sources": [
{
"artifact": {
"terms": {
"field": "artifactId.keyword"
}
}
}
]
,
"size": 1, // OPTIONAL SIZE (How many buckets)
"after": {
"artifact": "FOO_BAZ" // Buckets after this bucket key
}
},
"aggs": {
"hits": {
"top_hits": {
"size": 1,
"sort": [
{
"timestamp": {
"order": "desc"
}
}
]
}
}
}
}
}
}

ElasticSearch - Ordering aggregation by nested aggregation on nested field

{
"query": {
"match_all": {}
},
"from": 0,
"size": 0,
"aggs": {
"itineraryId": {
"terms": {
"field": "iid",
"size": 2147483647,
"order": [
{
"price>price>price.max": "desc"
}
]
},
"aggs": {
"duration": {
"stats": {
"field": "drn"
}
},
"price": {
"nested": {
"path": "prl"
},
"aggs": {
"price": {
"filter": {
"terms": {
"prl.cc.keyword": [
"USD"
]
}
},
"aggs": {
"price": {
"stats": {
"field": "prl.spl.vl"
}
}
}
}
}
}
}
}
}
}
Here, I am getting the error:
"Invalid terms aggregation order path [price>price>price.max]. Terms
buckets can only be sorted on a sub-aggregator path that is built out
of zero or more single-bucket aggregations within the path and a final
single-bucket or a metrics aggregation at the path end. Sub-path
[price] points to non single-bucket aggregation"
query works fine if I order by duration aggregation like
"order": [
{
"duration.max": "desc"
}
So is there any way to Order aggregation by nested aggregation on nested field i.e something like below ?
"order": [
{
"price>price>price.max": "desc"
}
As Val has pointed out in the comments ES does not support it yet.
Till then you can first aggregate the nested aggregation and then use the reverse nested aggregation to aggregate the duration, that is present in the root of the document.
https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket-reverse-nested-aggregation.html

Compute the "fill rate" of a field in Elasticsearch

I would like to compute the ratio of fields that have a value in my index.
I managed to count how many documents miss the field:
GET profiles/_search
{
"aggs": {
"profiles_wo_country": {
"missing": {
"field": "country"
}
}
},
"size": 0
}
I also managed to count how many documents have the filed:
GET profiles/_search
{
"query": {
"filtered": {
"query": {"match_all": {}},
"filter": {
"exists": {
"field": "country"
}
}
}
},
"size": 0
}
Naturally I can also get the total number of documents in the index. How can I compute the ratio?
An easy way to get the numbers you need out of a query is using the following query
POST profiles/_search?filter_path=hits.total,aggregations.existing.doc_count
{
"size": 0,
"aggs": {
"existing": {
"filter": {
"exists": {
"field": "tag"
}
}
}
}
}
You'll get an response like this one:
{
"hits": {
"total": 37258601
},
"aggregations": {
"existing": {
"doc_count": 9287160
}
}
}
And then in your client code, you can simply do
fill_rate = (aggregations.existing.doc_count / hits.total) * 100
And you're good to go.

Resources