Groupby query in elastic search - elasticsearch

I have an elastic search cluster having the analytics data of my website. There are page view events when a user visits a page. Each pageview event will have a session-id field, which will remain same during the user session.
I would like to calculate the session duration of each session by grouping the events by session id and calculating the duration different between the first event and the last event
Is there any way I can achieve this with Elastic Search Query?
Pageview events
[
{
"session-id":"234234-234234-324324-23432432",
"url": 'testpage1',
"timestamp":54323424222
},
{
"session-id":"234234-234234-324324-23432432",
"url": 'testpage2',
"timestamp":54323424223
},
{
"session-id":"234234-234234-324324-23432432",
"url": 'testpage3',
"timestamp":54323424224
}
]
Session duration will be (54323424224 - 54323424222)ms
EDIT:
I was able to create a datatable visualization with sessionid, max timestamp, min stamp, by query min(timestamp) & max(timestamp) for each of the session id. Now all I need is the different between these to aggs.

There's no way to compute the difference between max and min inside buckets.
Try with this calculating the difference from min-max in your client-side:
{
"aggs": {
"bySession": {
"terms": {
"field": "session-id.keyword"
},
"aggs": {
"statsBySession": {
"stats": {
"field": "timestamp"
}
}
}
}
}
}

Stats bucket aggregation will give you information about min and max timestamps per session. You can calculate difference between them(max - min) using bucket script aggregation.
Refer: bucket-script-aggregation
and stats-bucket-aggregation.
You can use following query to calculate difference between max and min timestamps per session-id:
{
"size": 0,
"aggs": {
"session": {
"terms": {
"field": "session-id.keyword",
"size": 10
},
"aggs": {
"stats_bucket":{
"stats":{
"field": "timestamp"
}
},
"time_spent": {
"bucket_script": {
"buckets_path": {
"min_stats": "stats_bucket.min",
"max_stats": "stats_bucket.max"
},
"script": "params.max_stats - params.min_stats"
}
}
}
}
}
}

Related

Bucket aggregation that doesn't depend on the time range in Elasticsearch

I'm using Elasticsearch 7.9.3 to query time series data metrics which are stored in a form of:
{
"timestamp": <long>,
"name" : <string - metric name>,
"value" : <float>
}
I want to show this data in our UI widgets however the query might bring way too much data for the widget so I went with bucket aggregation that will calculate the average value per bucket and will bring the "calculated" representatives from the time series. Here is a slightly simplified query of what I'm doing
{
"size": 0,
"query": {
"bool": {
"filter": [
{
"term": {
"name": "METRICS_NAME_COMES_HERE"
}
},
{
"range": {
"timestamp": {
"gte": {{from}},
"lt": {{to}}
}
}
}
]
}
},
"aggs": {
"primary-agg": {
"date_histogram": {
"field": "timestamp",
"fixed_interval": "{{bucket_size}}ms",
"min_doc_count" : 1,
"offset": "{{offset_in_ms}}ms"
},
"aggs": {
"average-value": {
"avg": {
"field": "value"
}
}
}
}
}
}
Now when the time range changes (we have a kibana-like time picker in our ui widget that allows to change the time range translated to 'from'/'to' in the query), the bucket data gets recalculated and it may bring to significant data discrepancy shown in UI.
For example if from UI I see a "spike" of data, and zoom (thus narrowing down the search period) the spike is preserved but the actual values of the "representatives" are changed significantly.
So my question is what are the best practices to create a query that produces the fixed number of results (therefor I understand that I need some kind of aggregation) but the values are not affected by the range changes?

How to filter response in multi search in elasticsearch?

I am using python's client of elasticsearch 6.5 for multi search since I have to fetch data from multiple indexes with different queries and aggregations.
GET _msearch/
{
"index": QUESTION_INDEX
}
{
"aggs": {
"order_info":{
"terms": {
"field": "order_ids",
"size": 9999
},
"aggs": {
"total_value": {
"sum": "selling_price"
}
}
},
"median_price": {
"percentiles_bucket": {
"buckets_path": "order_info>total_value",
"percents": [50]
}
}
}
}
Now in my response I am getting the order_info bucket but I only need the percentile value. So is there any way to filter out this bucket from response of elasticsearch?
Edit 1: I want to reduce the response size which is coming over network call from es

elasticsearch Need average per week of some value

I have simple data as
sales, date_of_sales
I need is average per week i.e. sum(sales)/no.of weeks.
Please help.
What i have till now is
{
"size": 0,
"aggs": {
"WeekAggergation": {
"date_histogram": {
"field": "date_of_sales",
"interval": "week"
}
},
"TotalSales": {
"sum": {
"field": "sales"
}
},
"myValue": {
"bucket_script": {
"buckets_path": {
"myGP": "TotalSales",
"myCount": "WeekAggergation._bucket_count"
},
"script": "params.myGP/params.myCount"
}
}
}
}
I get the error
Invalid pipeline aggregation named [myValue] of type [bucket_script].
Only sibling pipeline aggregations are allowed at the top level.
I think this may help:
{
"size": 0,
"aggs": {
"WeekAggergation": {
"date_histogram": {
"field": "date_of_sale",
"interval": "week",
"format": "yyyy-MM-dd"
},
"aggs": {
"TotalSales": {
"sum": {
"field": "sales"
}
},
"AvgSales": {
"avg": {
"field": "sales"
}
}
}
},
"avg_all_weekly_sales": {
"avg_bucket": {
"buckets_path": "WeekAggergation>TotalSales"
}
}
}
}
Note the TotalSales aggregation is now a nested aggregation under the weekly histogram aggregation (I believe there was a typo in the code provided - the simple schema provided indicated the field name of date_of_sale and the aggregation provided uses the plural form date_of_sales). This provides you a total of all sales in the weekly bucket.
Additionally, AvgSales provides a similar nested aggregation under the weekly histogram aggregation so you can see the average of all sales specific to that week.
Finally, the pipeline aggregation avg_all_weekly_sales will give the average of weekly sales based on the TotalSales bucket and the number of non-empty buckets - if you want to include empty buckets, add the gap_policy parameter like so:
...
"avg_all_weekly_sales": {
"avg_bucket": {
"buckets_path": "WeekAggergation>TotalSales",
"gap_policy": "insert_zeros"
}
}
...
(see: https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-pipeline-avg-bucket-aggregation.html).
This pipeline aggregation may or may not be what you're actually looking for, so please check the math to ensure the result is what is expected, but should provide the correct output based on the original script.

Elasticsearch - calculate percentage in nested aggregations in relation to parent bucket

Updated question
In my query I aggregate on date and then on sensor name. It is possible to calculate a ratio from a nested aggregation and the total count of documents (or any other aggregation) of the parent bucket? Example query:
{
"size": 0,
"aggs": {
"over_time": {
"aggs": {
"by_date": {
"date_histogram": {
"field": "date",
"interval": "1d",
"min_doc_count": 0
},
"aggs": {
"measure_count": {
"cardinality": {
"field": "date"
}
},
"all_count": {
"value_count": {
"field": "name"
}
},
"by_name": {
"terms": {
"field": "name",
"size": 0
},
"aggs": {
"count_by_name": {
"value_count": {
"field": "name"
}
},
"my ratio": count_by_name / all_count * 100 <-- How to do that?
}
}
}
}
}
}
}
}
I want a custom metric that gives me the ratio count_by_name / all_count * 100. Is that possible in ES, or do I have to compute that on the client?
This seems very simple to me, but I haven't found a way yet.
Old post:
Is there a way to let Elasticsearch consider the overall count of documents (or any other metric) when calculating the average for a bucket?
Example:
I have like 100000 sensors that generate events on different times. Every event is indexed as a document that has a timestamp and a value.
When I want to calculate a ratio of the value and a date histogram, and some sensors only generated values at one time, I want Elasticsearch to treat the not existing values(documents) for my sensors as 0 instead of null.
So when aggregating by day and a sensor only has generated two values at 10pm (3) and 11pm (5), the aggregate for the day should be (3+5)/24, or formal: SUM(VALUE)/24.
Instead, Elasticsearch calculates the average like (3+5)/2, which is not correct in my case.
There was once a ticket on Github https://github.com/elastic/elasticsearch/issues/9745, but the answer was "handle it in your application". That's no answer for me, as I would have to generate zillions of zero-Value documents for every sensor/time combination to get the average ratio right.
Any ideas on this?
If this is the case , simply divide the results by 24 from application side.And when granularity change , change this value accordingly. Number of hours per day is fixed right ....
You can use the Bucket script aggregation to do what you want.
{
"bucket_script": {
"buckets_path": {
"count_by_name": "count_by_name",
"all_count": "all_count"
},
"script": "count_by_name / all_count*100"
}
}
It's just an example.
https://www.elastic.co/guide/en/elasticsearch/reference/2.4/search-aggregations-pipeline-bucket-script-aggregation.html

How can I get most active 10% users in Elasticsearch?

I would like to perform some aggregations on the most active 10% users.
Lets say my doc is:
{
"createDate": "2014-10-7T05:43:02",
"user":"Raz",
"os":"IOS"
},
{
"createDate": "2014-10-7T07:43:02",
"user":"Raz",
"os":"Android"
},
{
"createDate": "2014-10-7T09:43:02",
"user":"Jim",
"os":"Android"
}
and my aggregation is:
"aggs": {
"time_aggs": {
"date_histogram": {
"field": "createDate",
"interval": "10m"
},"aggs": {
"device_os":{
"term": {
"os":"IOS"
}
}
}
}
What should I add in the aggregations to apply them only on the most 10% active users ?
Thanks.
For now I'm implementing this by calculating the number of distinct users in certain time range (using cardinality aggregation). Then I aggregate the term clientId with size which reflect 10% from the distinct users.

Resources