Elasticsearch get n ordered records and then apply grouping - elasticsearch

Here's an example of what I'm looking for. Let's say I have records of some purchases. I want to get records where price is > $50 and order by price descending. I want to limit those ordered records to 100 and then group them by zip code.
Final result should have counts of hits for each zip where sum of those counts would total to 100 record.
ES v2.1.1

what do you mean by "group them by zip code":
just want to know the number of docs in the group?
a hash with zip code as the key associated with docs?
If 1:
{
"size": 100,
"query": {
"filtered": {
"filter": {
"range": {
"price": {
"gt": 50
}
}
}
}
},
"sort": {
"price": "desc"
},
"aggs": {
"by_zip_code": {
"terms": {
"field": "zip_code"
}
}
}
}
If 2, you may use the top hits aggregations. However, sorting by price is not possible (how could we do that?), and by default Elasticsearch uses the _count (check intrinsic sorts out). If the sort is not a big deal, the following will work:
{
"size": 0,
"query": {
"filtered": {
"filter": {
"range": {
"price": {
"gt": 50
}
}
}
}
},
"sort": {
"price": "desc"
},
"aggs": {
"by_zip_code": {
"terms": {
"field": "zip_code",
"size": 100
},
"aggs": {
"hits": {
"top_hits": {}
}
}
}
}
}

You need to use the Search API to get the 100 results and then post-process to perform the aggregation (since an aggregation of top hits cannot be done directly using the ES API).
"I want to get records where price is > $50" - You need a range filter.
"...order by price descending" - You need a sort.
"I want to limit those ordered records to 100" - You need to specify
the size parameter.
"...then group them by zip code" - You need to post-process the "hits":"hits" array to do this (e.g. inserting into a hash table / dictionary with zip code as the key values).
For steps 1-3 you need:
$ curl -XGET 'http://localhost:9200/my_index/_search?pretty' -d '{"query":
{"filtered" : {"filter" : { "range": { "price": { "gt": 50 }}}}},
"size" : 100,
"sort": { "price": { "order": "desc" }}
}'

Related

Elasticsearch aggregate on term multiple times per different time range

I'm trying to aggregate a field by each half of the time-range given in the query. For example, here's the query:
{
"query": {
"simple_query_string": {
"query": "+sitetype:(redacted) +sort_date:[now-2h TO now]"
}
}
}
...and I want to aggregate on term "product1.keyword" from now-2h to now-1h and aggregate on the same term "product1.keyword" from now-1h to now, so like:
"terms": {
"field": "product1",
"size": 10,
}
^ aggregate the top 10 results on product1 in now-2h TO now-1h,
and aggregate the top 10 results on product1 in now-1h TO now.
Clarification: product1 is not a date or time-related field. It would be like a type of car, phone, etc.
if you want use now in your query,you must make product1 field as date type,then you can try as below:
GET index1/_search
{
"size": 0,
"aggs": {
"dataAgg": {
"date_range": {
"field": "product1",
"ranges": [
{
"from": "now-2h",
"to": "now-1h"
},
{
"from": "now-1h",
"to": "now"
}
]
},
"aggs": {
"top10": {
"top_hits": {
"size": 10
}
}
}
}
}
}
and if you can't change product1's type ,you can try rang agg,but you must write the time explicitly instead of using now

Elasticsearch - get N top items in group

I keep such data in elasticsearch with such a structure.
"_source" : {
"artist" : "Roger McGuinn",
"track_id" : "TRBIACM128F930021A",
"title" : "The Bells Of Rhymney",
"score" : 0,
"user_id" : "61583201a0b70d3f7ed79b60",
"timestamp" : 1634991817
}
How can I get the top N songs with the best score for each user. If a user has rated a song several times, I would like to take into account only the most recent rating.
I'm done with this ,but instead the top 10 songs for the user, I just get the first 10 songs found, without including the score
{
"size": 0,
"aggs": {
"group_by_user": {
"terms": {
"field": "user_id.keyword",
"size": 1
},
"aggs": {
"group_by_track": {
"terms": {
"field": "track_id.keyword"
},
"aggs": {
"take_the latest_score": {
"terms": {
"field": "timestamp",
"size": 1
},
"aggs": {
"take N tracks": {
"top_hits": {
"size": 10
}
}
}
}
}
}
}
}
}
}
What I understand is that you'd want to return list of valid users with the highest rated track based on date/times.
You can make use of Date Histogram aggregation followed by Terms aggregation on which you can further extend pipeline to include Top Hits aggregation:
Aggregation Query:
POST <your_index_name>/_search
{
"size": 0,
"aggs": {
"songs_over_time": {
"date_histogram": {
"field": "timestamp",
"fixed_interval": "1h", <---- Note this. Change this to 1d if you'd want to return results on daily basis
"min_doc_count": 1
},
"aggs": {
"group_by_user": {
"terms": {
"field": "user_id.keyword",
"size": 10 <---- Note this. To return 10 users
},
"aggs": {
"take N tracks": {
"top_hits": {
"sort": [
{
"score": {
"order": "desc". <---- Also note this to sort based on score
}
}],
"_source": {
"includes": ["track_id", "score"]. <---- To return track_id and score
},
"size": 1
}
}
}
}
}
}
}
}
What this would give you for e.g since I'm using fixed_interval as 1h is, for every hour, return all highest rated track of valid users in that time.
Feel free to filter out the docs using Range Query on which you can run the above aggregation query.

ElasticSearch: Is it possible to do a "Weighted Avg Aggregation" weighted by the score?

I'm trying to perform an avg over a price field (price.avg). But I want the best matches of the query to have more impact on the average than the latests, so the avg should be weighted by the calculated score field. This is the aggregation that I'm implementing.
{
"query": {...},
"size": 100,
"aggs": {
"weighted_avg_price": {
"weighted_avg": {
"value": {
"field": "price.avg"
},
"weight": {
"script": "_score"
}
}
}
}
}
It should give me what I want. But instead I receive a null value:
{...
"hits": {...},
"aggregations": {
"weighted_avg_price": {
"value": null
}
}
}
Is there something that I'm missing? Is this aggregation query feasible? Is there any workaround?
When you debug what's available from within the script
GET prices/_search
{
"size": 0,
"aggs": {
"weighted_avg_price": {
"weighted_avg": {
"value": {
"field": "price"
},
"weight": {
"script": "Debug.explain(new ArrayList(params.keySet()))"
}
}
}
}
}
the following gets spit out
[doc, _source, _doc, _fields]
None of these contain information about the query _score that you're trying to access because aggregations operate in a context separate from the query-level scoring. This means the weight value needs to either
exist in the doc or
exist in the doc + be modifiable or
be a query-time constant (like 42 or 0.1)
A workaround could be to apply a math function to the retrieved price such as
"script": "Math.pow(doc.price.value, 0.5)"
#jzzfs I'm trying with the approach of "avg of the first N results (ordered by _score)", using top hits aggregation:
{
"query": {
"bool": {
"should": [
...
],
"minimum_should_match": 0
}
},
"size": 0,
"from": 0,
"sort": [
{
"_score": {
"order": "desc"
}
}
],
"aggs": {
"top_avg_price": {
"avg": {
"field": "price.max"
}
},
"aggs": {
"top_hits": {
"size": 10, // N: Changing the number of results doesn't change the top_avg_price
"_source": {
"includes": [
"price.max"
]
}
}
}
},
"explain": "false"
}
The avg aggregation is being done over the main results, not the top_hits aggregation.
I guess the top_avg_rpice should be a subaggregation of top_hits. But I think that's not possible ATM.

How do I filter after an aggregation?

I am trying to filter after a top hits aggregation to get if the first apparition of an error was in a given range but I can't find a way.
I have seen something about bucket selector but can't get it to work
POST log-*/_search/
{
"size": 100,
"aggs": {
"group":{
"terms": {
"field": "errorID.keyword",
"size": 100
},
"aggs": {
"group_docs": {
"top_hits": {
"size": 1,
"sort": [
{
"#timestamp": {
"order": "asc"
}
}
]
}
},
}
}
}
}
}
With this top hits I get the first apparition of a concrete errorID as I have many documents with the same errorID, but what I want to find is if the first apparition is within a given range of dates.
I think that a valid solution would be to filter the results of the aggregation to check if it is in the range, but I don't know how could I do that.

ElasticSearch filtering by field1 THEN field2 THEN take max of field3

I am struggling to get the information that I need from ElasticSearch.
My log statements are like this:
field1: Example
field2: Example2
field3: Example3
I would like to search a timeframe (using last 24 hours) to find all data that has this in field1 and that in field2.
There then may be multiple this.that.[field3] entries, so I want to only return the maximum of that field.
In fact, in my data, field3 is actually the key of the entry.
What is the best way of retrieving the information I need? I have managed to get the results returned using aggs, but the data is in buckets, and I am only interested in the data with the max value of field3.
I have added an example of the query that I am looking to do: https://jsonblob.com/54535d49e4b0d117eeaf6bb4
{
"size": 0,
"aggs": {
"agg_129": {
"filters": {
"filters": {
"CarName: Toyota": {
"query": {
"query_string": {
"query": "CarName: Toyota"
}
}
}
}
},
"aggs": {
"agg_130": {
"filters": {
"filters": {
"Attribute: TimeUsed": {
"query": {
"query_string": {
"query": "Attribute: TimeUsed"
}
}
}
}
},
"aggs": {
"agg_131": {
"terms": {
"field": "#timestamp",
"size": 0,
"order": {
"_count": "desc"
}
}
}
}
}
}
}
},
"query": {
"filtered": {
"query": {
"match_all": {}
},
"filter": {
"bool": {
"must": [
{
"range": {
"#timestamp": {
"gte": "2014-10-27T00:00:00.000Z",
"lte": "2014-10-28T23:59:59.999Z"
}
}
}
],
"must_not": []
}
}
}
}
}
So, that example above is showing only those that have CarName = Toyota and Attribute = TimeUsed.
My data is as follows:
There are x number of cars CarName and each car has y number of Attributes and each of those Attributes have a document with a timestamp.
To begin with, I was looking for a query for CarName.Attribute.timestamp (latest), however, if I am able to use just ONE query to get the latest timestamp for EVERY attribute for EVERY CarName, then that would decrease query calls from ~50 to one.
If you are using a ElasticSearch v1.3+, you can add a top_hits aggregation with parameter size:1 and descending sort on the field3 value.
This will return the whole document with maximum value on the field, as you wish.
This example in the documentation might do the trick.
Edit:
Ok, it seems you don't need the whole document, but only the maximum timestamp value. You can use a max aggregation instead of using a top_hits one.
The following query (not tested) should give you the maximum timestamp value for each top 10 Attribute value of each CarName top 10 value, in only one request.
terms aggregation is like a GROUP BY clause, and you should not have to query 50 times to retrieve the values of each CarName/Attribute combination : this is the point of nesting a terms aggregation for Attribute in the CarName aggregation.
Note that, to work properly, the CarName and Attribute fields should be not_analyzed. If it's not the case, you will have "funny" results in your buckets. The problem (and possible solution) is very well described here.
Feel free to change the size parameter of the terms aggregation to fit to your case.
{
"size": 0,
"aggs": {
"by_carnames": {
"terms": {
"field": "CarName",
"size": 10
},
"aggs": {
"by_attribute": {
"terms": {
"field": "Attribute",
"size": 10
},
"aggs": {
"max_timestamp": {
"max": {
"field": "#timestamp"
}
}
}
}
}
}
},
"query": {
"filtered": {
"filter": {
"bool": {
"must": [
{
"range": {
"#timestamp": {
"gte": "2014-10-27T00:00:00.000Z",
"lte": "2014-10-28T23:59:59.999Z"
}
}
}
]
}
}
}
}
}

Resources