Hi I have denormalized the data to be flat in elasticsearch.
e.g.
{childId: 123, childAmount: 3.4, parentId: 1, parentAmount: 5.6}
{childId: 234, childAmount: 4.4, parentId: 1, parentAmount: 5.6}
{childId: 345, childAmount: 5.4, parentId: 2, parentAmount: 1.2}
See there are 3 children and 2 identical parent.
How to calulate the sum amount of parentAmount (which should be 6.8)?
Thanks. And if possible, how to use kibana metric visual to show this data?
In Kibana you can do it like this using a Metric visualization:
And with a query like this:
{
"size": 0,
"aggs": {
"per_parent": {
"terms": {
"field": "parentId",
"size": 25
},
"aggs": {
"max": {
"max": {
"field": "parentAmount"
}
}
}
},
"sum_amounts": {
"sum_bucket": {
"buckets_path": "per_parent>max"
}
}
}
}
Related
Problem description
we have log files from different devices parsed into our elastic search database, line by line. The log files are built as a ring buffer, so they always have a fixed size of 1000 lines. They can be manually exported whenever needed. After import and parsing in elastic search each document represents a single line of a log file with the following information:
DeviceID: 12345
FileType: ErrorLog
FileTimestamp: 2022-05-10 01:23:45
LogTimestamp: 2022-05-05 01:23:45
LogMessage: something very important here
Now I want to have a statistic on the timespan that usually is covered by that fixed amount of lines. Because, depending on the intensity of the usage of the device, a varying amount of log entries is generated and the files can cover from just a few days to several months... But since the log files are split into individual lines it is not that trivial (I suppose).
My goal is to have a chart that shows me a "histogram" of the different log file timespans...
First Try: Visualize library > Data table
I started by creating a Data table in the Visualize library where I was able to aggregate the data as follows:
I added 3 Buckets --> so I have all lines bucketed by their original file:
Split rows DeviceID.keyword
Split rows FileType.keyword
Split rows FileTimestamp
... and 2 Metrics --> to show the log file timespan (I couldn't find a way to create a max-min metric, so I started with individual metrics for max and min):
Metric Min LogTimeStamp
Metric Max LogTimeStamp
This results in the following query:
{
"aggs": {
"2": {
"terms": {
"field": "DeviceID.keyword",
"order": {
"_key": "desc"
},
"size": 100
},
"aggs": {
"3": {
"terms": {
"field": "FileType.keyword",
"order": {
"_key": "desc"
},
"size": 5
},
"aggs": {
"4": {
"terms": {
"field": "FileTimestamp",
"order": {
"_key": "desc"
},
"size": 100
},
"aggs": {
"1": {
"min": {
"field": "LogTimeStamp"
}
},
"5": {
"max": {
"field": "LogTimeStamp"
}
}
}
}
}
}
}
}
},
"size": 0,
...
}
... and this output:
DeviceID FileType FileTimestamp Min LogTimestamp Max LogTimestamp
---------------------------------------------------------------------------------------------
12345 ErrorLog 2022-05-10 01:23:45 2022-04-10 01:23:45 2022-05-10 01:23:45
...
Looks good so far! The expected result would be exactly 1 month for this example.
But my research showed, that it is not possible to add the desired metrics here, so I needed to try something else...
Second Try: Vizualize library > Custom visualization (Vega-Lite)
So I started some more research and found out, that vega might be a possibility. I already was able to transfer the bucket part from the first attempt there and I also added a scripted metric to automatically calculate the timespan (instead of min & max), so far, so good. The request body looks as follows:
body: {
"aggs": {
"DeviceID": {
"terms": { "field": "DeviceID.keyword" },
"aggs": {
"FileType": {
"terms": { "field": "FileType.keyword" } ,
"aggs": {
"FileTimestamp": {
"terms": { "field": "FileTimestamp" } ,
"aggs": {
"timespan": {
"scripted_metric": {
"init_script": "state.values = [];",
"map_script": "state.values.add(doc['#timestamp'].value);",
"combine_script": "long min = Long.MAX_VALUE; long max = 0; for (t in state.values) { long tms = t.toInstant().toEpochMilli(); if(tms > max) max = tms; if(tms < min) min = tms; } return [max,min];",
"reduce_script": "long min = Long.MAX_VALUE; long max = 0; for (a in states) { if(a[0] > max) max = a[0]; if(a[1] < min) min = a[1]; } return max-min;"
}
}
}
}
}
}
}
}
},
"size": 0,
}
...with this response (unnecessary information removed to reduce complexity):
{
"took": 12245,
"timed_out": false,
"_shards": { ... },
"hits": { ... },
"aggregations": {
"DeviceID": {
"buckets": [
{
"key": "12345",
"FileType": {
"buckets": [
{
"key": "ErrorLog",
"FileTimeStamp": {
"buckets": [
{
"key": 1638447972000,
"key_as_string": "2021-12-02T12:26:12.000Z",
"doc_count": 1000,
"timespan": {
"value": 31339243240
}
},
{
"key": 1636023881000,
"key_as_string": "2021-11-04T11:04:41.000Z",
"doc_count": 1000,
"timespan": {
"value": 31339243240
}
}
]
}
},
{
"key": "InfoLog",
"FileTimeStamp": {
"buckets": [
{
"key": 1635773438000,
"key_as_string": "2021-11-01T13:30:38.000Z",
"doc_count": 1000,
"timespan": {
"value": 2793365000
}
},
{
"key": 1636023881000,
"key_as_string": "2021-11-04T11:04:41.000Z",
"doc_count": 1000,
"timespan": {
"value": 2643772000
}
}
]
}
}
]
}
},
{
"key": "12346",
"FileType": {
...
}
},
...
]
}
}
}
Yeah, it seems to work! Now I have the timespan for each original log file.
Question
Now I am stuck with:
I want to average the timespans for each original log file (identified via the combination of DeviceID + FileType + FileTimeStamp) to prevent devices with multiple log files imported to have a higher weight, than devices with only 1 log file imported. I tried to add another aggregation for the avg, but I couldn't figure out where to put so that the result of the scripted_metric is used. My closest attempt was to put a avg_bucket after the FileTimeStamp bucket:
Request:
body: {
"aggs": {
"DeviceID": {
"terms": { "field": "DeviceID.keyword" },
"aggs": {
"FileType": {
"terms": { "field": "FileType.keyword" } ,
"aggs": {
"FileTimestamp": {
"terms": { "field": "FileTimestamp" } ,
"aggs": {
"timespan": {
"scripted_metric": {
"init_script": "state.values = [];",
"map_script": "state.values.add(doc['FileTimestamp'].value);",
"combine_script": "long min = Long.MAX_VALUE; long max = 0; for (t in state.values) { long tms = t.toInstant().toEpochMilli(); if(tms > max) max = tms; if(tms < min) min = tms; } return [max,min];",
"reduce_script": "long min = Long.MAX_VALUE; long max = 0; for (a in states) { if(a[0] > max) max = a[0]; if(a[1] < min) min = a[1]; } return max-min;"
}
}
}
},
// new part - start
"avg_timespan": {
"avg_bucket": {
"buckets_path": "FileTimestamp>timespan"
}
}
// new part - end
}
}
}
}
},
"size": 0,
}
But I receive the following error:
EsError: buckets_path must reference either a number value or a single value numeric metric aggregation, got: [InternalScriptedMetric] at aggregation [timespan]
So is it the right spot? (but not applicable to a scripted metric) Or am I on the wrong path?
I need to plot all this, but I can't find my way through all the buckets, etc.
I read about flattening (which would probably be a good idea, so (if done by the server) the result would not be that complex), but don't know where and how to put the flattening transformation.
I imagine the resulting chart like this:
x-axis = log file timespan, where the timespan is "binned" according to a given step size (e.g. 1 day), so there are only bars for each bin (1 = 0-1days, 2 = 1-2days, 3 = 2-3days, etc.) and not for all the different timespans of log files
y-axis = count of devices
type: lines or vertical bars, split by file type
e.g. something like this:
Any help is really appreciated! Thanks in advance!
If you have the privileges to create a transform, then the elastic painless example Getting duration by using bucket script can do exactly what you want. It creates a new index where all documents are grouped according to your needs.
To create the transform:
go to Stack Management > Transforms > + Create a transform
select Edit JSON config for the Pivot configuration object
paste & apply the JSON below
check whether the result is the expected in the Transform preview
fill out the rest of the transform details + save the transform
JSON config
{
"group_by": {
"DeviceID": {
"terms": {
"field": "DeviceID.keyword"
}
},
"FileType": {
"terms": {
"field": "FileType.keyword"
}
},
"FileTimestamp": {
"terms": {
"field": "FileTimestamp"
}
}
},
"aggregations": {
"TimeStampStats": {
"stats": {
"field": "#timestamp"
}
},
"TimeSpan": {
"bucket_script": {
"buckets_path": {
"first": "TimeStampStats.min",
"last": "TimeStampStats.max"
},
"script": "params.last - params.first"
}
}
}
}
Now you can create a chart from the new index, for example with these settings:
Vertical Bars
Metrics:
Y-axis = "Count"
Buckets:
X-axis = "TimeSpan"
Split series = "FileType"
Given an index with documents of the following format
{
"userA": "user1",
"relation": 10,
"userB": "user2"
}
How to create an aggregation query that will display for each user (from a given list), the sum of 'relation's between them.
for example: given userX, userY
result will be:
{
user4: {user1: 100, user2: 300, user3: 350},
...
userX: {user4: 123, user5: 456}
}
I tried to do it using 2 separate queries like that (the second one with userB instead in the aggs field)
GET myindex*/_search
{
"query": {
"bool": {
"must": [
{
"terms": {
"userA": [<input user ids>]
}
},
{
"terms": {
"userB": [<input user ids>]
}
]
}
},
"aggs": {
"connections": {
"terms": {
"field": "userA" //// Second query with `userB`
},
"aggs": {
"privateConversationCount": {
"avg": {
"field": "privateConversationCount"
}
}
}
}
}
}
But this is not correct, it requires a nested aggregation.
How could I write a query that will answer that need?
I have an elasticsearch index with documents like these :
{
"_source": {
"category": 1,
"value": 10,
"utctimestamp": "2020-10-21T15:32:00.000+00:00"
}
}
In Grafana, I'm able to retrive the value of the most recent event with the following query:
Now, I would like to get the MAX value of the most recent documents for each distinct value of category in the given time range.
This means that if I have the 3 following documents in my index :
{
"_source": {
"category": 1,
"value": 10,
"utctimestamp": "2020-10-21T10:30:00"
}
},
{
"_source": {
"category": 2,
"value": 20,
"utctimestamp": "2020-10-21T10:20:00"
}
},
{
"_source": {
"category": 2,
"value": 30,
"utctimestamp": "2020-10-21T10:10:00"
}
}
I would like the query to return the value MAX(10, 20) which is 20. Because the last document for category 1 has the value 10, and the last document for category 2 has the value 20. (If there were a 3rd category, its last value should also be included in the MAX).
Is it possible ?
Thanks to #val for his brilliant query in Sum over top_hits aggregation, your query would be something like this:
{
"size": 0,
"aggs": {
"category": {
"terms": {
"field": "category",
"size": 10
},
"aggs": {
"latest_quantity": {
"scripted_metric": {
"init_script": "params._agg.quantities = new TreeMap()",
"map_script": "params._agg.quantities.put(doc.utctimestamp.date, [doc.utctimestamp.date.millis, doc.value.value])",
"combine_script": "return params._agg.quantities.lastEntry().getValue()",
"reduce_script": "def maxkey = 0; def qty = 0; for (a in params._aggs) {def currentKey = a[0]; if (currentKey > maxkey) {maxkey = currentKey; qty = a[1]} } return qty;"
}
}
}
},
"max_quantities": {
"max_bucket": {
"buckets_path": "category>latest_quantity.value"
}
}
}
}
I ended up creating a middleware service with a REST API between Elasticsearch and Grafana that can make all the custom requests to Elasticsearch (like the request given in the answer of #saeednasehi), and I query the middleware from Grafana with the JSON data source plugin
This is my Mapping :
{
"settings" : {
"number_of_shards" : 2,
"number_of_replicas" : 1
},
"mappings" :{
"cpt_logs_mapping" : {
"properties" : {
"channel_id" : {"type":"integer","store":"yes","index":"not_analyzed"},
"playing_date" : {"type":"string","store":"yes","index":"not_analyzed"},
"country_code" : {"type":"text","store":"yes","index":"analyzed"},
"playtime_in_sec" : {"type":"integer","store":"yes","index":"not_analyzed"},
"channel_name" : {"type":"text","store":"yes","index":"analyzed"},
"device_report_tag" : {"type":"text","store":"yes","index":"analyzed"}
}
}
}
}
I want to query the index similar to the way I do using the following MySQL query :
SELECT
channel_name,
SUM(`playtime_in_sec`) as playtime_in_sec
FROM
channel_play_times_bar_chart
WHERE
country_code = 'country' AND
device_report_tag = 'device' AND
channel_name = 'channel'
playing_date BETWEEN 'date_range_start' AND 'date_range_end'
GROUP BY channel_id
ORDER BY SUM(`playtime_in_sec`) DESC
LIMIT 30;
So far my QueryDSL looks like this
{
"size": 0,
"aggs": {
"ch_agg": {
"terms": {
"field": "channel_id",
"size": 30 ,
"order": {
"sum_agg": "desc"
}
},
"aggs": {
"sum_agg": {
"sum": {
"field": "playtime_in_sec"
}
}
}
}
}
}
QUESTION 1
Although the QueryDSL I have made does return me the top 30 channel_ids w.r.t playtimes but I am confused how to add other filters too within the search i.e country_code, device_report_tag & playing_date.
QUESTION 2
Another issue is that the result set contains only the channel_id and playtime fields unlike the MySQL result set which returns me channel_name and playtime_in_sec columns. This means I want to achieve aggregation using channel_id field but result set should instead return corresponding channel_name name of the group.
NOTE: Performance over here is a top priority as this is supposed to be running behind a graph generator querying millions or even more docs.
TEST DATA
hits: [
{
_index: "cpt_logs_index",
_type: "cpt_logs_mapping",
_id: "",
_score: 1,
_source: {
ChID: 1453,
playtime_in_sec: 35,
device_report_tag: "mydev",
channel_report_tag: "Sony Six",
country_code: "SE",
#timestamp: "2017-08-11",
}
},
{
_index: "cpt_logs_index",
_type: "cpt_logs_mapping",
_id: "",
_score: 1,
_source: {
ChID: 145,
playtime_in_sec: 25,
device_report_tag: "mydev",
channel_report_tag: "Star Movies",
country_code: "US",
#timestamp: "2017-08-11",
}
},
{
_index: "cpt_logs_index",
_type: "cpt_logs_mapping",
_id: "",
_score: 1,
_source: {
ChID: 12,
playtime_in_sec: 15,
device_report_tag: "mydev",
channel_report_tag: "HBO",
country_code: "PK",
#timestamp: "2017-08-12",
}
}
]
QUESTION1:
Are you looking to add a filter/query to the example above? If so you can simply add a "query" node to the query document:
{
"size": 0,
"query":{
"bool":{
"must":[
{"terms": { "country_code": ["pk","us","se"] } },
{"range": { "#timestamp": { "gt": "2017-01-01", "lte": "2017-08-11" } } }
]
}
},
"aggs": {
"ch_agg": {
"terms": {
"field": "ChID",
"size": 30
},
"aggs":{
"ch_report_tag_agg": {
"terms":{
"field" :"channel_report_tag.keyword"
},
"aggs":{
"sum_agg":{
"sum":{
"field":"playtime_in_sec"
}
}
}
}
}
}
}
}
You can use all normal queries/filters for elastic to pre-filter your search before you start aggregating (Regarding performance, elasticsearch will apply any filters / queries before starting to aggregate, so any filtering you can do here will help a lot)
Question2:
On the top of my head I would suggest one of two solutions (unless I'm not completely misunderstanding the question):
Add aggs levels for the fields you want in the output in the order you want to drill down. (you can nest aggs within aggs quite deeply without issues and get the bonus of count on each level)
Use the top_hits aggregation on the "lowest" level of aggs, and specify which fields you want in the output using "_source": { "include": [/fields/] }
Can you provide a few records of test data?
Also, it is useful to know which version of ElasticSearch you're running as the syntax and behaviour change a lot between major versions.
How can I do an SQL like group by statement on a '_search' query in elastic search?
I basically need to:
1 - Filter a bunch of items using multiple filters, queries etc. Done
2 - Put these results into buckets of unique category_id. 'category_id' is currently mapped as a 'float' field of the item document type. I also need to display one of the items matching the above filters from each bucket.
3 - Paginate through these buckets
Note: Item count: 1 Million, Unique category_id count: 60,000
I would like to get all of the data type 'items' grouped by a field called . In the results I would like to get a list of all unique 'category_id' and a single item in each category (first or any item, doesn't matter) inside this group. I'd like to be able to use "from" and "size" to paginate through these results.
For example if i had data to the effect of:
id:1, category_id: 1, color:'blue',
id:2, category_id: 1, color:'red',
id:3, category_id: 1, color:'red',
id:4, category_id: 2, color:'blue',
id:5, category_id: 2, color:'red',
id:6, category_id: 3, color:'blue',
id:7, category_id: 3, color:'blue',
id:8, category_id: 3, color:'blue',
For example i want to get all that have the color 'red' then grouped by category_id and get back data to the effect of:
category_id: 1
{
item: { id:2, category_id: 1, color:'red'}
},
category_id: 2
{
item: { id:5, category_id: 2, color:'red'}
}
This is what i have so far, but it doesn't get the correct top hit, and i dont think it allows multiple filters and queries or is paginatable.
GET swap/item/_search
{
"size": 0,
"aggs": {
"color_filtered_items": {
"filter": {
"and": [
{
"terms": {
"color": [
"red"
]
}
}
]
},
"aggs": {
"group_by_cat_id": {
"terms": {
"field": "category_id",
"size": 10
},
"aggs": {
"items": {
"top_hits": {
"_source": {
"include": [
"name",
"id",
"category_id",
"color"
]
},
"size": 1
}
}
}
}
}
}
}
}
Hacks, workaround, changes to data storage suggestions welcome. Any help greatly appreciated.
Thank you all :)
The following should work , assuming that you don't want number range based aggregation for category_id.
Also you cant do pagination on aggregated results , but then you can control the size per aggregation.
{
"aggs": {
"itemsAgg": {
"terms": {
"field": "items",
"size": 10
},
"aggs": {
"categoryAgg": {
"terms": {
"field": "category_id",
"size": 10
}
}
}
}
}
}