Elasticsearch top_hits aggregation - elasticsearch

I have to get top N documents from multiple indices, then group the resulting set by index. I've tried the following:
{
"size": 0,
"query": {
"multi_match" : {
"query": "some term"
}
},
"aggs": {
"by_index": {
"terms": {
"field": "_index"
},
"aggs": {
"top_results": {
"top_hits": {
"size": 20
}
}
}
}
}
}
It aggregates results by _index and then limits each group to N (20) documents. But I need to receive no more than 20 documents in total.

Related

How to do proportions in Elastic search query

I have a field in my data that has four unique values for all the records. I have to aggregate the records based on each unique value and find the proportion of each field in the data. Essentially, (Number of records in each unique field/total number of records). Is there a way to do this with elastic search dashboards? I have used terms aggregation to aggregate the fields and applied value_count metric aggregation to get the doc_count value. But I am not able to use the bucket script to do the division. I am getting the error ""buckets_path must reference either a number value or a single value numeric metric aggregation, got: [StringTerms] at aggregation [latest_version]""
Below is my code:
{
"size": 0,
"aggs": {
"BAR": {
"date_histogram": {
"field": "timestamp",
"calendar_interval": "day"
},
"aggs": {
"latest_version": {
"filter": {
"match_phrase": {
"log": "main_filter"
}
},
"aggs": {
"latest_version_count": {
"terms": {
"field": "field_name"
},
"aggs": {
"version_count": {
"value_count": {
"field": "field_name"
}
}
}
},
"sum_buckets": {
"sum_bucket": {
"buckets_path": "latest_version_count>_count"
}
}
}
},
"BAR-percentage": {
"bucket_script": {
"buckets_path": {
"eachVersionCount": "latest_version>latest_version_count",
"totalVersionCount": "latest_version>sum_buckets"
},
"script": "params.eachVersionCount/params.totalVersionCount"
}
}
}
}
}
}

ElasticSearch - 2nd Level nested aggregation incorrect doc_count

ElasticSearch 7.10.1 nested aggregations.
Can anyone point me to why the doc_count on my 2nd nested aggregation is not correct?
The count on the first aggregation is accurate but the 2nd isnt (both are keyword fields).
{
"size": 0,
"_source": false,
"query": {
"match_all": {}
},
"aggs": {
"products": {
"nested": {
"path": "productsImpacted"
},
"aggs": {
"field1": {
"terms": {
"field": "productsImpacted.product.keyword",
"size": 1000
},
"aggs": {
"resellers": {
"nested": {
"path": "requestType"
},
"aggs": {
"field2": {
"terms": {
"field": "requestType.type.keyword",
"size": 1000
}
}
}
}
}
}
}
}
}
}
Thanks,
ES’agg is inaccurate.
you cna use size and shard_size to improve accuracy means a decline in performance,You can refer to the official documents:
https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket-terms-aggregation.html#search-aggregations-bucket-terms-aggregation-shard-size

Elasticsearch - get N top items in group

I keep such data in elasticsearch with such a structure.
"_source" : {
"artist" : "Roger McGuinn",
"track_id" : "TRBIACM128F930021A",
"title" : "The Bells Of Rhymney",
"score" : 0,
"user_id" : "61583201a0b70d3f7ed79b60",
"timestamp" : 1634991817
}
How can I get the top N songs with the best score for each user. If a user has rated a song several times, I would like to take into account only the most recent rating.
I'm done with this ,but instead the top 10 songs for the user, I just get the first 10 songs found, without including the score
{
"size": 0,
"aggs": {
"group_by_user": {
"terms": {
"field": "user_id.keyword",
"size": 1
},
"aggs": {
"group_by_track": {
"terms": {
"field": "track_id.keyword"
},
"aggs": {
"take_the latest_score": {
"terms": {
"field": "timestamp",
"size": 1
},
"aggs": {
"take N tracks": {
"top_hits": {
"size": 10
}
}
}
}
}
}
}
}
}
}
What I understand is that you'd want to return list of valid users with the highest rated track based on date/times.
You can make use of Date Histogram aggregation followed by Terms aggregation on which you can further extend pipeline to include Top Hits aggregation:
Aggregation Query:
POST <your_index_name>/_search
{
"size": 0,
"aggs": {
"songs_over_time": {
"date_histogram": {
"field": "timestamp",
"fixed_interval": "1h", <---- Note this. Change this to 1d if you'd want to return results on daily basis
"min_doc_count": 1
},
"aggs": {
"group_by_user": {
"terms": {
"field": "user_id.keyword",
"size": 10 <---- Note this. To return 10 users
},
"aggs": {
"take N tracks": {
"top_hits": {
"sort": [
{
"score": {
"order": "desc". <---- Also note this to sort based on score
}
}],
"_source": {
"includes": ["track_id", "score"]. <---- To return track_id and score
},
"size": 1
}
}
}
}
}
}
}
}
What this would give you for e.g since I'm using fixed_interval as 1h is, for every hour, return all highest rated track of valid users in that time.
Feel free to filter out the docs using Range Query on which you can run the above aggregation query.

How to mention from and size for the first level of elastic search aggregation in nested aggregation?

I have written a query to get the buckets based on id and then sort it. This works fine. But how to make it return buckets from position 100 till 200 for aggregation_by_id rule?
{
"query": {
"match_all": {}
},
"size": 0,
"aggregations": {
"aggregation_by_id": {
"terms": {
"field": "id.keyword"
"size" : 200
},
"aggs": {
"sort_timestamp": {
"top_hits": {
"sort": [{
"timestamp": {
"order": "desc",
"unmapped_type": "long"
}
}],
"size": 1
}
}
}
}
}
}

Compute the "fill rate" of a field in Elasticsearch

I would like to compute the ratio of fields that have a value in my index.
I managed to count how many documents miss the field:
GET profiles/_search
{
"aggs": {
"profiles_wo_country": {
"missing": {
"field": "country"
}
}
},
"size": 0
}
I also managed to count how many documents have the filed:
GET profiles/_search
{
"query": {
"filtered": {
"query": {"match_all": {}},
"filter": {
"exists": {
"field": "country"
}
}
}
},
"size": 0
}
Naturally I can also get the total number of documents in the index. How can I compute the ratio?
An easy way to get the numbers you need out of a query is using the following query
POST profiles/_search?filter_path=hits.total,aggregations.existing.doc_count
{
"size": 0,
"aggs": {
"existing": {
"filter": {
"exists": {
"field": "tag"
}
}
}
}
}
You'll get an response like this one:
{
"hits": {
"total": 37258601
},
"aggregations": {
"existing": {
"doc_count": 9287160
}
}
}
And then in your client code, you can simply do
fill_rate = (aggregations.existing.doc_count / hits.total) * 100
And you're good to go.

Resources