Elasticsearch - get N top items in group - elasticsearch

I keep such data in elasticsearch with such a structure.
"_source" : {
"artist" : "Roger McGuinn",
"track_id" : "TRBIACM128F930021A",
"title" : "The Bells Of Rhymney",
"score" : 0,
"user_id" : "61583201a0b70d3f7ed79b60",
"timestamp" : 1634991817
}
How can I get the top N songs with the best score for each user. If a user has rated a song several times, I would like to take into account only the most recent rating.
I'm done with this ,but instead the top 10 songs for the user, I just get the first 10 songs found, without including the score
{
"size": 0,
"aggs": {
"group_by_user": {
"terms": {
"field": "user_id.keyword",
"size": 1
},
"aggs": {
"group_by_track": {
"terms": {
"field": "track_id.keyword"
},
"aggs": {
"take_the latest_score": {
"terms": {
"field": "timestamp",
"size": 1
},
"aggs": {
"take N tracks": {
"top_hits": {
"size": 10
}
}
}
}
}
}
}
}
}
}

What I understand is that you'd want to return list of valid users with the highest rated track based on date/times.
You can make use of Date Histogram aggregation followed by Terms aggregation on which you can further extend pipeline to include Top Hits aggregation:
Aggregation Query:
POST <your_index_name>/_search
{
"size": 0,
"aggs": {
"songs_over_time": {
"date_histogram": {
"field": "timestamp",
"fixed_interval": "1h", <---- Note this. Change this to 1d if you'd want to return results on daily basis
"min_doc_count": 1
},
"aggs": {
"group_by_user": {
"terms": {
"field": "user_id.keyword",
"size": 10 <---- Note this. To return 10 users
},
"aggs": {
"take N tracks": {
"top_hits": {
"sort": [
{
"score": {
"order": "desc". <---- Also note this to sort based on score
}
}],
"_source": {
"includes": ["track_id", "score"]. <---- To return track_id and score
},
"size": 1
}
}
}
}
}
}
}
}
What this would give you for e.g since I'm using fixed_interval as 1h is, for every hour, return all highest rated track of valid users in that time.
Feel free to filter out the docs using Range Query on which you can run the above aggregation query.

Related

Elasticsearch top_hits aggregation

I have to get top N documents from multiple indices, then group the resulting set by index. I've tried the following:
{
"size": 0,
"query": {
"multi_match" : {
"query": "some term"
}
},
"aggs": {
"by_index": {
"terms": {
"field": "_index"
},
"aggs": {
"top_results": {
"top_hits": {
"size": 20
}
}
}
}
}
}
It aggregates results by _index and then limits each group to N (20) documents. But I need to receive no more than 20 documents in total.

How to get specific _source fields in aggregation

I am exploring ElasticSearch, to be used in an application, which will handle large volumes of data and generate some statistical results over them. My requirement is to retrieve certain statistics for a particular field. For example, for a given field, I would like to retrieve its unique values and document frequency of each value, along-with the length of the value. The value lengths are indexed along-with each document.
So far, I have experimented with Terms Aggregation, with the following query:
{
"size": 0,
"query": {
"match_all": {}
},
"aggs": {
"type_count": {
"terms": {
"field": "val.keyword",
"size": 100
}
}
}
}
The query returns all the values in the field val with the number of documents in which each value occurs. I would like the field val_len to be returned as well. Is it possible to achieve this using ElasticSearch? In other words, is it possible to include specific _source fields in buckets? I have looked through the documentation available online, but I haven't found a solution yet.
Hoping somebody could point me in the right direction. Thanks in advance!
I tried to include _source in the following manners:
"aggs": {
"type_count": {
"terms": {
"field": "val.keyword",
"size": 100
},
"_source":["val_len"]
}
}
and
"aggs": {
"type_count": {
"terms": {
"field": "val.keyword",
"size": 100,
"_source":["val_len"]
}
}
}
But I guess this isn't the right way, because both gave me parsing errors.
You need to use another sub-aggregation called top_hits, like this:
"aggs": {
"type_count": {
"terms": {
"field": "val.keyword",
"size": 100
},
"aggs": {
"hits": {
"top_hits": {
"_source":["val_len"],
"size": 1
}
}
}
}
}
Another way of doing it is to use another avg sub-aggregation so you can sort on it, too
"aggs": {
"type_count": {
"terms": {
"field": "val.keyword",
"size": 100,
"order": {
"length": "desc"
}
},
"aggs": {
"length": {
"avg": {
"field": "val_len"
}
}
}
}
}

How to detect the number of days that a person passed in a city?

I have the following mapping in Elasticsearch:
PUT /traffic-data
{
"mappings": {
"traffic-entry": {
"_all": {
"enabled": false
},
"properties": {
"CameraId": {
"type":"keyword"
},
"VehiclePlateNumber": {
"type":"keyword"
},
"DateTime": {
"type":"date"
}
}
}
}
}
I want to calculate how many days per month has a vehicle stayed. A unique vehicle is identified by VehiclePlateNumber.
So, I want to get the result something like this:
VehiclePlaneNumber Month StayDays
111 1 5
222 1 1
...
How can I do it using Elasticsearch query?
This is what I tried:
GET traffic-data/_search?
{
"size": 0,
"aggs":{
"by_district":{
"terms": {
"field": "VehiclePlateNumber",
"size": 100000
},
"aggs": {
"by_month": {
"terms": {
"field": "DateTime",
"size": 12
}
}
}
}
}
}
You can do terms aggregation on Vehicle plate number then a terms sub agg on month then a sum sub agg on days.
Something like:
GET traffic-data/_search
{
"size": 0,
"aggs":{
"by_district":{
"terms": {
"field": "VehiclePlateNumber",
"size": 100000
},
"aggs": {
"by_month": {
"terms": {
"field": "DateTime",
"size": 12
},
"aggs": {
"days": {
"sum": {
"field": "days"
}
}
}
}
}
}
}
}
Month should be a scripted field but would be better to compute it at index time.
That should work.
Or you can use entity centric design and regularly index that value computed. See https://www.elastic.co/elasticon/2015/sf/building-entity-centric-indexes

Compute the "fill rate" of a field in Elasticsearch

I would like to compute the ratio of fields that have a value in my index.
I managed to count how many documents miss the field:
GET profiles/_search
{
"aggs": {
"profiles_wo_country": {
"missing": {
"field": "country"
}
}
},
"size": 0
}
I also managed to count how many documents have the filed:
GET profiles/_search
{
"query": {
"filtered": {
"query": {"match_all": {}},
"filter": {
"exists": {
"field": "country"
}
}
}
},
"size": 0
}
Naturally I can also get the total number of documents in the index. How can I compute the ratio?
An easy way to get the numbers you need out of a query is using the following query
POST profiles/_search?filter_path=hits.total,aggregations.existing.doc_count
{
"size": 0,
"aggs": {
"existing": {
"filter": {
"exists": {
"field": "tag"
}
}
}
}
}
You'll get an response like this one:
{
"hits": {
"total": 37258601
},
"aggregations": {
"existing": {
"doc_count": 9287160
}
}
}
And then in your client code, you can simply do
fill_rate = (aggregations.existing.doc_count / hits.total) * 100
And you're good to go.

Elastic Search Aggregation and Details

I am trying to get the teachers name too in this query..
From this I am able to get loop the teachers and get the number of classes she is working for and also the amount of money she gets for each year.
But I can't get full details in this query. I want to display teachers name too.
here is my current query
{
"aggs": {
"teacher": {
"terms": {
"field": "teacher_id",
"size": 10
},
"aggs": {
"academic_year": {
"date_histogram": {
"field": "acc_year",
"interval": "year"
},
"aggs": {
"income": {
"stats": {
"field": "teacher_hourly_fee"
}
}
}
}
}
}
},
"size": 0
}
Most straightforward approach may be to combine teacher ID and name as a generated term using a script:
{
"aggs" : {
"teacher" : {
"terms" : {
"script" : "_source.teacher_id + '-' + _source.teacher_name",
"size": 10
}
}
}
}
Adjust script particulars per your actual schema.

Resources