How can I get most active 10% users in Elasticsearch? - elasticsearch

I would like to perform some aggregations on the most active 10% users.
Lets say my doc is:
{
"createDate": "2014-10-7T05:43:02",
"user":"Raz",
"os":"IOS"
},
{
"createDate": "2014-10-7T07:43:02",
"user":"Raz",
"os":"Android"
},
{
"createDate": "2014-10-7T09:43:02",
"user":"Jim",
"os":"Android"
}
and my aggregation is:
"aggs": {
"time_aggs": {
"date_histogram": {
"field": "createDate",
"interval": "10m"
},"aggs": {
"device_os":{
"term": {
"os":"IOS"
}
}
}
}
What should I add in the aggregations to apply them only on the most 10% active users ?
Thanks.

For now I'm implementing this by calculating the number of distinct users in certain time range (using cardinality aggregation). Then I aggregate the term clientId with size which reflect 10% from the distinct users.

Related

elasticsearch Need average per week of some value

I have simple data as
sales, date_of_sales
I need is average per week i.e. sum(sales)/no.of weeks.
Please help.
What i have till now is
{
"size": 0,
"aggs": {
"WeekAggergation": {
"date_histogram": {
"field": "date_of_sales",
"interval": "week"
}
},
"TotalSales": {
"sum": {
"field": "sales"
}
},
"myValue": {
"bucket_script": {
"buckets_path": {
"myGP": "TotalSales",
"myCount": "WeekAggergation._bucket_count"
},
"script": "params.myGP/params.myCount"
}
}
}
}
I get the error
Invalid pipeline aggregation named [myValue] of type [bucket_script].
Only sibling pipeline aggregations are allowed at the top level.
I think this may help:
{
"size": 0,
"aggs": {
"WeekAggergation": {
"date_histogram": {
"field": "date_of_sale",
"interval": "week",
"format": "yyyy-MM-dd"
},
"aggs": {
"TotalSales": {
"sum": {
"field": "sales"
}
},
"AvgSales": {
"avg": {
"field": "sales"
}
}
}
},
"avg_all_weekly_sales": {
"avg_bucket": {
"buckets_path": "WeekAggergation>TotalSales"
}
}
}
}
Note the TotalSales aggregation is now a nested aggregation under the weekly histogram aggregation (I believe there was a typo in the code provided - the simple schema provided indicated the field name of date_of_sale and the aggregation provided uses the plural form date_of_sales). This provides you a total of all sales in the weekly bucket.
Additionally, AvgSales provides a similar nested aggregation under the weekly histogram aggregation so you can see the average of all sales specific to that week.
Finally, the pipeline aggregation avg_all_weekly_sales will give the average of weekly sales based on the TotalSales bucket and the number of non-empty buckets - if you want to include empty buckets, add the gap_policy parameter like so:
...
"avg_all_weekly_sales": {
"avg_bucket": {
"buckets_path": "WeekAggergation>TotalSales",
"gap_policy": "insert_zeros"
}
}
...
(see: https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-pipeline-avg-bucket-aggregation.html).
This pipeline aggregation may or may not be what you're actually looking for, so please check the math to ensure the result is what is expected, but should provide the correct output based on the original script.

How do I select the top term buckets based on a rescore function in Elasticsearch

Consider the following query for Elasticsearch 5.6:
{
"size": 0,
"query": {
"match_all": {}
},
"rescore": [
{
"window_size": 10000,
"query": {
"rescore_query": {
"function_score": {
"boost_mode": "replace",
"script_score": {
"script": {
"source": "doc['topic_score'].value"
}
}
}
},
"query_weight": 0,
"rescore_query_weight": 1
}
}
],
"aggs": {
"distinct": {
"terms": {
"field": "identical_id",
"order": {
"top_score": "desc"
}
},
"aggs": {
"best_unique_result": {
"top_hits": {
"size": 1
}
},
"top_score": {
"max": {
"script": {
"inline": "_score"
}
}
}
}
}
}
}
This is a simplified version where the real query has a more complex main query and the rescore function is far more intensive.
Let me explain it's purpose first incase I'm about to spend a 1000 hours developing a pen that writes in space when a pencil would actually solve my problem. I'm performing a fast initial query, then rescoring the top results with a much more intensive function. From those results I want to show the top distinct values, i.e. no two results should have the same identical_id. If there's a better way to do this I'd also consider that an answer.
I expected a query like this would order results by the rescore query, group all the results that had the same identical_id and display the top hit for each such distinct group. I also assumed that since I'm ordering those term aggregation buckets by the max parent _score, they would be ordered to reflect the best result they contain as determined from the original rescore query.
The reality is that the term buckets are ordered by the maximum query score and not the rescore query score. Strangely the top hits within the buckets do seem to use the rescore.
Is there a better way to achieve the end result that I want, or some way I can fix this query to work the way I expect it too?
From documentation :
The query rescorer executes a second query only on the Top-K results returned by the query and post_filter phases. The number of docs which will be examined on each shard can be controlled by the window_size parameter, which defaults to 10.
As the rescore query kicks in after the post_filter phase, I assume the term aggregation buckets are already fixed.
I have no idea on how you can combine rescore and aggregations. Sorry :(
I think I have a pretty great solution to this problem, but I'll let the bounty continue to expiration incase someone comes up with a better approach.
{
"size": 0,
"query": {
"match_all": {}
},
"aggs": {
"sample": {
"sampler": {
"shard_size": 10000
},
"aggs": {
"distinct": {
"terms": {
"field": "identical_id",
"order": {
"top_score": "desc"
}
},
"aggs": {
"best_unique_result": {
"top_hits": {
"size": 1,
"sort": [
{
"_script": {
"type": "number",
"script": {
"source": "doc['topic_score'].value"
},
"order": "desc"
}
}
]
}
},
"top_score": {
"max": {
"script": {
"source": "doc['topic_score'].value"
}
}
}
}
}
}
}
}
}
The sampler aggregation will take the top N hits per shard from the core query and run aggregations over those. Then in the max aggregator that defines the bucket order I use the exact same script as the one I use to pick a top hit from the bucket. Now the buckets and the top hits are running over the same top N sets of items and the buckets will order by the max of the same score, generated from the same script. Unfortunately I still need run the script once to order the buckets and once to pick a top hit within the bucket, and you could use the rescore instead for the top hits ordering, but either way it has to run twice and I found it was faster as a sort script then as a rescore

Groupby query in elastic search

I have an elastic search cluster having the analytics data of my website. There are page view events when a user visits a page. Each pageview event will have a session-id field, which will remain same during the user session.
I would like to calculate the session duration of each session by grouping the events by session id and calculating the duration different between the first event and the last event
Is there any way I can achieve this with Elastic Search Query?
Pageview events
[
{
"session-id":"234234-234234-324324-23432432",
"url": 'testpage1',
"timestamp":54323424222
},
{
"session-id":"234234-234234-324324-23432432",
"url": 'testpage2',
"timestamp":54323424223
},
{
"session-id":"234234-234234-324324-23432432",
"url": 'testpage3',
"timestamp":54323424224
}
]
Session duration will be (54323424224 - 54323424222)ms
EDIT:
I was able to create a datatable visualization with sessionid, max timestamp, min stamp, by query min(timestamp) & max(timestamp) for each of the session id. Now all I need is the different between these to aggs.
There's no way to compute the difference between max and min inside buckets.
Try with this calculating the difference from min-max in your client-side:
{
"aggs": {
"bySession": {
"terms": {
"field": "session-id.keyword"
},
"aggs": {
"statsBySession": {
"stats": {
"field": "timestamp"
}
}
}
}
}
}
Stats bucket aggregation will give you information about min and max timestamps per session. You can calculate difference between them(max - min) using bucket script aggregation.
Refer: bucket-script-aggregation
and stats-bucket-aggregation.
You can use following query to calculate difference between max and min timestamps per session-id:
{
"size": 0,
"aggs": {
"session": {
"terms": {
"field": "session-id.keyword",
"size": 10
},
"aggs": {
"stats_bucket":{
"stats":{
"field": "timestamp"
}
},
"time_spent": {
"bucket_script": {
"buckets_path": {
"min_stats": "stats_bucket.min",
"max_stats": "stats_bucket.max"
},
"script": "params.max_stats - params.min_stats"
}
}
}
}
}
}

Elasticsearch - calculate percentage in nested aggregations in relation to parent bucket

Updated question
In my query I aggregate on date and then on sensor name. It is possible to calculate a ratio from a nested aggregation and the total count of documents (or any other aggregation) of the parent bucket? Example query:
{
"size": 0,
"aggs": {
"over_time": {
"aggs": {
"by_date": {
"date_histogram": {
"field": "date",
"interval": "1d",
"min_doc_count": 0
},
"aggs": {
"measure_count": {
"cardinality": {
"field": "date"
}
},
"all_count": {
"value_count": {
"field": "name"
}
},
"by_name": {
"terms": {
"field": "name",
"size": 0
},
"aggs": {
"count_by_name": {
"value_count": {
"field": "name"
}
},
"my ratio": count_by_name / all_count * 100 <-- How to do that?
}
}
}
}
}
}
}
}
I want a custom metric that gives me the ratio count_by_name / all_count * 100. Is that possible in ES, or do I have to compute that on the client?
This seems very simple to me, but I haven't found a way yet.
Old post:
Is there a way to let Elasticsearch consider the overall count of documents (or any other metric) when calculating the average for a bucket?
Example:
I have like 100000 sensors that generate events on different times. Every event is indexed as a document that has a timestamp and a value.
When I want to calculate a ratio of the value and a date histogram, and some sensors only generated values at one time, I want Elasticsearch to treat the not existing values(documents) for my sensors as 0 instead of null.
So when aggregating by day and a sensor only has generated two values at 10pm (3) and 11pm (5), the aggregate for the day should be (3+5)/24, or formal: SUM(VALUE)/24.
Instead, Elasticsearch calculates the average like (3+5)/2, which is not correct in my case.
There was once a ticket on Github https://github.com/elastic/elasticsearch/issues/9745, but the answer was "handle it in your application". That's no answer for me, as I would have to generate zillions of zero-Value documents for every sensor/time combination to get the average ratio right.
Any ideas on this?
If this is the case , simply divide the results by 24 from application side.And when granularity change , change this value accordingly. Number of hours per day is fixed right ....
You can use the Bucket script aggregation to do what you want.
{
"bucket_script": {
"buckets_path": {
"count_by_name": "count_by_name",
"all_count": "all_count"
},
"script": "count_by_name / all_count*100"
}
}
It's just an example.
https://www.elastic.co/guide/en/elasticsearch/reference/2.4/search-aggregations-pipeline-bucket-script-aggregation.html

Concurrent events aggregation in ElasticSearch

I have a number of documents representing events with starts_at and ends_at fields. At a given point in time, an event is considered active, if the point in question is after starts_at and before ends_at.
I'm looking for an aggregation, which should result in a date histogram, where each bucket contains the number of active events in that interval.
So far, the best approximation I have found is to create a set of buckets counting the number of starts in each interval, as well as a corresponding set of buckets counting the number of ends, and then postprocessing them by subtracting the number of starts from the number of ends for each interval:
{
"size": "0",
"query": {
"filtered": {
"query": {
"match_all": {}
},
"filter": {
"and": [
{
"term": {
"_type": "event"
}
},
{
"range": {
"starts_at": {
"gte": "2015-06-14T05:25:03Z",
"lte": "2015-06-21T05:25:03Z"
}
}
}
]
}
}
},
"aggs": {
"starts": {
"date_histogram": {
"field": "starts_at",
"interval": "15m",
"extended_bounds": {
"max": "2015-06-21T05:25:04Z",
"min": "2015-06-14T05:25:04Z"
},
"min_doc_count": 0
}
},
"ends": {
"date_histogram": {
"field": "ends_at",
"interval": "15m",
"extended_bounds": {
"max": "2015-06-21T05:25:04Z",
"min": "2015-06-14T05:25:04Z"
},
"min_doc_count": 0
}
}
}
}
I'm looking for something like this solution.
Is there a way to achieve that with a single query?
I'm not 100% sure but up-coming pipeline aggregations might solve this problem in near-future in a more elegant way.
Meanwhile you could choose the desired time resolution and at index time in addition to starts_at and ends_at fields you would also generate active_at field. It would be an array of time stamps and you could use either terms (if it is mapped as not_analyzed string) or date_histogram aggregation to get the correct "active events count" for each time-bucket.
The down-side is inflated storage requirements and possibly worse performance since there are more field values to aggregate over. Anyway it shouldn't be too bad if you don't choose a too high time resolution like 1 minute.

Resources