Aggregation Median/Mean Queries - elasticsearch

I have an index with a type that can be reduced to:
{
'date': DATE_STRING,
'owner': INT,
'color: 'red' | 'purple' | 'blue'
}
and am looking to make queries to present the following data, where an owner's value is equal to the aggregate number of items they own that are 'blue' subtracted by the number of item's they own that are 'red' over a requested time (Don't ask why):
minimum value of any owner (within requested time)
maximum value of any owner (within requested time)
mean value of all owners (within requested time)
median value of all owners (within requested time)
a particular owner's value (within requested time)

Set up the index:
PUT colorful
{
"mappings": {
"properties": {
"date": {
"type": "date"
},
"owner": {
"type": "integer"
},
"color": {
"type": "keyword"
}
}
}
}
Insert a few docs
POST colorful/_doc
{"date":"2020-05-28T19:56:12.237Z","owner":131351351,"color":"red"}
POST colorful/_doc
{"date":"2020-04-28T19:58:02.110Z","owner":35135125,"color":"purple"}
POST colorful/_doc
{"date":"2020-05-15T19:58:15.966Z","owner":997654341,"color":"blue"}
POST colorful/_doc
{"date":"2020-05-21T19:58:35.766Z","owner":366449,"color":"red"}
Filter by a date range & aggregate. Min, Max, Avg(=Mean) can be calculated using stats, for median there's percentiles[50]. Not sure what you meant by a particular owner's value but the actual range-filtered docs can be fetched using top_hits plus you could add a filter for a specific doc.
GET colorful/_search
{
"size": 0,
"query": {
"range": {
"date": {
"gte": "now-3M",
"lte": "now-1h"
}
}
},
"aggs": {
"1)general_stats": {
"stats": {
"field": "owner"
}
},
"2)median": {
"percentiles": {
"field": "owner",
"percents": [
50
]
}
},
"3)top_hits": {
"top_hits": {
"size": 10
}
}
}
}

Related

Transforming in elasticsearch not update aggregated data

I am working on a scenario to aggregate daily data per user. The data processed realtime and stored in elasticsearch. Now I wanno use elasticsearch feature for aggregating data in real time.Iv'e read about Transfrom in elasticsearch and found this is the case we need.
The problem is when the source index is updated, the destination index which is proposed to calculate aggregation is not updated. This is the case I have tested:
source_index data model:
{
"my_datetime": "2021-06-26T08:50:59",
"client_no": "1",
"my_date": "2021-06-26",
"amount": 1000
}
and the transform I defined:
PUT _transform/my_transform
{
"source": {
"index": "dest_index"
},
"pivot": {
"group_by": {
"client_no": {
"terms": {
"field": "client_no"
}
},
"my_date": {
"terms": {
"field": "my_date"
}
}
},
"aggregations": {
"sum_amount": {
"sum": {
"field": "amount"
}
},
"count_amount": {
"value_count": {
"field": "amount"
}
}
}
},
"description": "total amount sum per client",
"dest": {
"index": "my_analytic"
},
"frequency": "60s",
"sync": {
"time": {
"field": "my_datetime",
"delay": "10s"
}
}
}
Now when I add another document or update current documents in source index, destination index is not updated and not consider new documents.
Also note that elasticsearch version I used is 7.13
I also changed date field to be timestamp(epoch format like 1624740659000) but still have the same problem.
What am I doing wrong here?
Could it be that your "my_datetime" is further in the past than the "delay": "10s" (plus the time of "frequency": "60s")?
The docs for sync.field note:
In general, it’s a good idea to use a field that contains the ingest timestamp. If you use a different field, you might need to set the delay such that it accounts for data transmission delays.
You might just need a higher delay.

ElasticSearch/Kibana: get values that are not found in entries more recent than a certain date

I have a fleet of devices that push to ElasticSearch at regular intervals (let's say every 10 minutes) entries of this form:
{
"deviceId": "unique-device-id",
"timestamp": 1586390031,
"payload" : { various data }
}
I usually look at this through Kibana by filtering for the last 7 days of data and then drilling down by device id or some other piece of data from the payload.
Now I'm trying to get a sense of the health of this fleet by finding devices that haven't reported anything in the last hour let's say. I've been messing around with all sorts of filters and visualisations and the closest I got to this is a data table with device ids and the timestamp of the last entry for each, sorted by timestamp. This is useful but is somewhat hard to work with as I have a few thousand devices.
What I dream of is getting either the above mentioned table to contain only the device ids that have not reported in the last hour, or getting only two numbers: the total count of distinct device ids seen in the last 7 days and the total count of device ids not seen in the last hour.
Can you point me in the right direction, if any one of these is even possible?
I'll skip the table and take the second approach -- only getting the counts. I think it's possible to walk your way backwards to the rows from the counts.
Note: I'll be using a human readable time format instead of timestamps but epoch_seconds will work just as fine in your real use case. Also, I've added the comment field to give each doc some background.
First, set up a your index:
PUT fleet
{
"mappings": {
"properties": {
"timestamp": {
"type": "date",
"format": "epoch_second||yyyy-MM-dd HH:mm:ss"
},
"comment": {
"type": "text"
},
"deviceId": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword"
}
}
}
}
}
}
Sync a few docs -- I'm in UTC+2 so I chose these timestamps:
POST fleet/_doc
{
"deviceId": "asdjhfa343",
"timestamp": "2020-04-05 10:00:00",
"comment": "in the last week"
}
POST fleet/_doc
{
"deviceId": "asdjhfa343",
"timestamp": "2020-04-10 13:05:00",
"comment": "#asdjhfa343 in the last hour"
}
POST fleet/_doc
{
"deviceId": "asdjhfa343",
"timestamp": "2020-04-10 12:05:00",
"comment": "#asdjhfa343 in the 2 hours"
}
POST fleet/_doc
{
"deviceId": "asdjhfa343sdas",
"timestamp": "2020-04-07 09:00:00",
"comment": "in the last week"
}
POST fleet/_doc
{
"deviceId": "asdjhfa343sdas",
"timestamp": "2020-04-10 12:35:00",
"comment": "in last 2hrs"
}
In total, we've got 5 docs and 2 distinct device ids w/ the following conditions
all have appeared in the last 7d
both of which in the last 2h and
only one of which in the last hour
so I'm interested in finding precisely 1 deviceId which has appeared in the last 2hrs BUT not last 1hr.
Using a combination of filter (for range filters), cardinality (for distinct counts) and bucket script (for count differences) aggregations.
GET fleet/_search
{
"size": 0,
"aggs": {
"distinct_devices_last7d": {
"filter": {
"range": {
"timestamp": {
"gte": "now-7d"
}
}
},
"aggs": {
"uniq_device_count": {
"cardinality": {
"field": "deviceId.keyword"
}
}
}
},
"not_seen_last1h": {
"filter": {
"range": {
"timestamp": {
"gte": "now-2h"
}
}
},
"aggs": {
"device_ids_per_hour": {
"date_histogram": {
"field": "timestamp",
"calendar_interval": "day",
"format": "'disregard' -- yyyy-MM-dd"
},
"aggs": {
"total_uniq_count": {
"cardinality": {
"field": "deviceId.keyword"
}
},
"in_last_hour": {
"filter": {
"range": {
"timestamp": {
"gte": "now-1h"
}
}
},
"aggs": {
"uniq_count": {
"cardinality": {
"field": "deviceId.keyword"
}
}
}
},
"uniq_difference": {
"bucket_script": {
"buckets_path": {
"in_last_1h": "in_last_hour>uniq_count",
"in_last2h": "total_uniq_count"
},
"script": "params.in_last2h - params.in_last_1h"
}
}
}
}
}
}
}
}
The date_histogram aggregation is just a placeholder that enables us to use a bucket script to get the final difference and not have to do any post-processing.
Since we passed size: 0, we're not interested in the hits section. So taking only the aggregations, here are the annotated results:
...
"aggregations" : {
"not_seen_last1h" : {
"doc_count" : 3,
"device_ids_per_hour" : {
"buckets" : [
{
"key_as_string" : "disregard -- 2020-04-10",
"key" : 1586476800000,
"doc_count" : 3, <-- 3 device messages in the last 2hrs
"total_uniq_count" : {
"value" : 2 <-- 2 distinct IDs
},
"in_last_hour" : {
"doc_count" : 1,
"uniq_count" : {
"value" : 1 <-- 1 distict ID in the last hour
}
},
"uniq_difference" : {
"value" : 1.0 <-- 1 == final result !
}
}
]
}
},
"distinct_devices_last7d" : {
"meta" : { },
"doc_count" : 5, <-- 5 device messages in the last 7d
"uniq_device_count" : {
"value" : 2 <-- 2 unique IDs
}
}
}

elasticsearch Need average per week of some value

I have simple data as
sales, date_of_sales
I need is average per week i.e. sum(sales)/no.of weeks.
Please help.
What i have till now is
{
"size": 0,
"aggs": {
"WeekAggergation": {
"date_histogram": {
"field": "date_of_sales",
"interval": "week"
}
},
"TotalSales": {
"sum": {
"field": "sales"
}
},
"myValue": {
"bucket_script": {
"buckets_path": {
"myGP": "TotalSales",
"myCount": "WeekAggergation._bucket_count"
},
"script": "params.myGP/params.myCount"
}
}
}
}
I get the error
Invalid pipeline aggregation named [myValue] of type [bucket_script].
Only sibling pipeline aggregations are allowed at the top level.
I think this may help:
{
"size": 0,
"aggs": {
"WeekAggergation": {
"date_histogram": {
"field": "date_of_sale",
"interval": "week",
"format": "yyyy-MM-dd"
},
"aggs": {
"TotalSales": {
"sum": {
"field": "sales"
}
},
"AvgSales": {
"avg": {
"field": "sales"
}
}
}
},
"avg_all_weekly_sales": {
"avg_bucket": {
"buckets_path": "WeekAggergation>TotalSales"
}
}
}
}
Note the TotalSales aggregation is now a nested aggregation under the weekly histogram aggregation (I believe there was a typo in the code provided - the simple schema provided indicated the field name of date_of_sale and the aggregation provided uses the plural form date_of_sales). This provides you a total of all sales in the weekly bucket.
Additionally, AvgSales provides a similar nested aggregation under the weekly histogram aggregation so you can see the average of all sales specific to that week.
Finally, the pipeline aggregation avg_all_weekly_sales will give the average of weekly sales based on the TotalSales bucket and the number of non-empty buckets - if you want to include empty buckets, add the gap_policy parameter like so:
...
"avg_all_weekly_sales": {
"avg_bucket": {
"buckets_path": "WeekAggergation>TotalSales",
"gap_policy": "insert_zeros"
}
}
...
(see: https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-pipeline-avg-bucket-aggregation.html).
This pipeline aggregation may or may not be what you're actually looking for, so please check the math to ensure the result is what is expected, but should provide the correct output based on the original script.

Trends metric on Kibana Dashboard, it’s possible?

I want to create a metric in kibana dashboard, which use ratio of multiple metrics and offset period.
Example :
Date Budget
YYYY-MM-DD $
2019-01-01 15
2019-01-02 10
2019-01-03 5
2019-01-04 10
2019-01-05 12
2019-01-06 4
If I select time range between 2019-01-04 to 2019-01-06 , I want to compute ratio with offset period: 2019-01-01 to 2019-01-03.
to resume : (sum(10+12+4) - sum(15+10+5)) / sum(10+12+4) = -0.15
evolution of my budget equal to -15% (and this is what I want to print in the dashboard)
But, with metric it's not possible (no offset), with visual builder: different metric aggregation do not have different offset (too bad because bucket script allow to compute ratio), and with vega : I not found a solution too.
Any idea ? Thanks a lot
Aurélien
NB: I use kibana version > 6.X
Please check the below sample mapping which I've constructed based on data you've provided in the query and aggregation solution that you wanted to take a look.
Mapping:
PUT <your_index_name>
{
"mappings": {
"mydocs": {
"properties": {
"date": {
"type": "date",
"format": "yyyy-MM-dd"
},
"budget": {
"type": "float"
}
}
}
}
}
Aggregation
I've made use of the following types of aggregation:
Date Histogram where I've mentioned interval as 4d based on the data you've mentioned in the question
Sum
Derivative
Bucket Script which actually gives you the required budget evolution figure.
Also I'm assuming that the date format would be in yyyy-MM-dd and budget would be of float data type.
Below is how your aggregation query would be.
POST <your_index_name>/_search
{
"size": 0,
"query": {
"range": {
"date": {
"gte": "2019-01-01",
"lte": "2019-01-06"
}
}
},
"aggs": {
"my_date": {
"date_histogram": {
"field": "date",
"interval": "4d",
"format": "yyyy-MM-dd"
},
"aggs": {
"sum_budget": {
"sum": {
"field": "budget"
}
},
"budget_derivative": {
"derivative": {
"buckets_path": "sum_budget"
}
},
"budget_evolution": {
"bucket_script": {
"buckets_path": {
"input_1": "sum_budget",
"input_2": "budget_derivative"
},
"script": "(params.input_2/params.input_1)*(100)"
}
}
}
}
}
}
Note that the result that you are looking for would be in the budget_evolution part.
Hope this helps!

Groupby query in elastic search

I have an elastic search cluster having the analytics data of my website. There are page view events when a user visits a page. Each pageview event will have a session-id field, which will remain same during the user session.
I would like to calculate the session duration of each session by grouping the events by session id and calculating the duration different between the first event and the last event
Is there any way I can achieve this with Elastic Search Query?
Pageview events
[
{
"session-id":"234234-234234-324324-23432432",
"url": 'testpage1',
"timestamp":54323424222
},
{
"session-id":"234234-234234-324324-23432432",
"url": 'testpage2',
"timestamp":54323424223
},
{
"session-id":"234234-234234-324324-23432432",
"url": 'testpage3',
"timestamp":54323424224
}
]
Session duration will be (54323424224 - 54323424222)ms
EDIT:
I was able to create a datatable visualization with sessionid, max timestamp, min stamp, by query min(timestamp) & max(timestamp) for each of the session id. Now all I need is the different between these to aggs.
There's no way to compute the difference between max and min inside buckets.
Try with this calculating the difference from min-max in your client-side:
{
"aggs": {
"bySession": {
"terms": {
"field": "session-id.keyword"
},
"aggs": {
"statsBySession": {
"stats": {
"field": "timestamp"
}
}
}
}
}
}
Stats bucket aggregation will give you information about min and max timestamps per session. You can calculate difference between them(max - min) using bucket script aggregation.
Refer: bucket-script-aggregation
and stats-bucket-aggregation.
You can use following query to calculate difference between max and min timestamps per session-id:
{
"size": 0,
"aggs": {
"session": {
"terms": {
"field": "session-id.keyword",
"size": 10
},
"aggs": {
"stats_bucket":{
"stats":{
"field": "timestamp"
}
},
"time_spent": {
"bucket_script": {
"buckets_path": {
"min_stats": "stats_bucket.min",
"max_stats": "stats_bucket.max"
},
"script": "params.max_stats - params.min_stats"
}
}
}
}
}
}

Resources