Elasticsearch: Tricky aggregation with sum and comparison - elasticsearch

I am trying to pull statistics from my elastic cluster I cannot figure out.
In the end what I want to achieve is a count of streams (field: status) over time (field: timestamp) for a specific item (field: media).
The data are logs from nginx with anonymized IPs (field: ip_hash) and user agents (field: http_user_agent). To get a valid count I need to sum up the bytes transferred (field: bytes_sent) and compare that to a minimum threshold (integer) considering the same IP and user agent. It is only a valid stream / only counts if XY bytes of that stream have been transferred in sum.
"_source": {
"media": "my-stream.001",
"http_user_agent": "Spotify/8.4.44 Android/29 (SM-T535)",
"ip_hash": "fcd2653c44c1d8e33ef5d58ac5a33c2599b68f05d55270a8946166373d79a8212a49f75bcf3f71a62b9c71d3206c6343430a9ebec9c062a0b308a48838161ce8",
"timestamp": "2022-02-05 01:32:23.941",
"bytes_sent": 4893480,
"status": 206
}
Where I am having trouble is to sum up the transferred bytes based on the unique user agent / IP hash combination and comparing that to the threshold.
Any pointers are appreciated how I could solve this. Thank you!
So far I got this:
GET /logdata_*/_search
{
"size": 0,
"query": {
"bool": {
"must": [
{
"range": {
"timestamp": {
"gte": "now-1w/d",
"lt": "now/d"
}
}
}
]
}
},
"aggs": {
"status206":{
"filter": {
"term": {
"status": "206"
}
},
"aggs": {
"medias": {
"terms": {
"field": "media",
"size": 10
},
"aggs": {
"ips": {
"terms": {
"field": "ip_hash",
"size": 10
},
"aggs": {
"clients": {
"terms": {
"field": "http_user_agent",
"size": 10
},
"aggs": {
"transferred": {
"sum": {
"field": "bytes_sent"
}
}
}
}
}
}
}
}
}
}
}
}
Which gives something like this:
{
"took" : 1563,
"timed_out" : false,
"_shards" : {
"total" : 12,
"successful" : 12,
"skipped" : 8,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 10000,
"relation" : "gte"
},
"max_score" : null,
"hits" : [ ]
},
"aggregations" : {
"status206" : {
"doc_count" : 1307130,
"medias" : {
"doc_count_error_upper_bound" : 7612,
"sum_other_doc_count" : 1163149,
"buckets" : [
{
"key" : "20220402_ETD_Podcast_2234_Eliten_-_VD_Hanson.mp3",
"doc_count" : 21772,
"ips" : {
"doc_count_error_upper_bound" : 12,
"sum_other_doc_count" : 21574,
"buckets" : [
{
"key" : "ae55a10beda61afd3641fe2a6ca8470262d5a0c07040d3b9b8285ea1a4dba661a0502a7974dc5a4fecbfbbe5b7c81544cdcea126271533e724feb3d7750913a5",
"doc_count" : 38,
"clients" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "Deezer/7.0.0.xxx (Android; 10; Mobile; de) samsung SM-G960F",
"doc_count" : 38,
"transferred" : {
"value" : 7582635.0
}
}
]
}
},
{
"key" : "60082e96eb57c4a8b7962dc623ef7446fbc08cea676e75c4ff94ab5324dec93a6db1848d45f6dcc6e7acbcb700bb891cf6bee66e1aa98fc228107104176734ff",
"doc_count" : 37,
"clients" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "Deezer/7.0.0.xxx (Android; 12; Mobile; de) samsung SM-N770F",
"doc_count" : 36,
"transferred" : {
"value" : 7252448.0
}
},
{
"key" : "Mozilla/5.0 (Linux; Android 11; RMX2063) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.58 Mobile Safari/537.36",
"doc_count" : 1,
"transferred" : {
"value" : 843367.0
}
}
]
}
},
Now I would need to check that "transferred" is gte the treshhold and that would count as 1 stream. In the end I need the count of all applicable streams.

You can try the following:
> GET _search?filter_path=aggregations.valid_streams.count
{
"size": 0,
"query": {
"bool": {
"must": [
{
"range": {
"timestamp": {
"gte": "now-1w/d",
"lt": "now/d"
}
}
},
{
"match": {
"status": "206"
}
}
]
}
},
"aggs": {
"streams": {
"multi_terms": {
"size": "65536",
"terms": [
{
"field": "media"
},
{
"field": "ip_hash"
},
{
"field": "http_user_agent"
}
]
},
"aggs": {
"transferred": {
"sum": {
"field": "bytes_sent"
}
},
"threshold": {
"bucket_selector": {
"buckets_path": {
"total": "transferred"
},
"script": "params.total > 12345"
}
}
}
},
"valid_streams": {
"stats_bucket": {
"buckets_path": "streams>transferred"
}
}
}
}
Explanation:
streams - Combined terms aggregations since every changed field in it should be counted as a new stream. This is mainly for better readability, change it if it doesn't fit your logic.
transferred - sum aggregation to sum up the sent bytes.
threshold - bucket_selector aggregation which filters out the streams that didn't reach the XY threshold.
valid_streams - stats_bucket aggregation which returns a count field containing the amount of buckets = valid streams. BTW, it also gives you info about your valid streams (i.e average bytes)
The filter_path queryparam is used to reduce the returned response to only include the aggregation output.

Related

Query filter for searching rollup index works with epoch time fails with date math

`How do we query (filter) a rollup index?
For example, based on the query here
Request:
{
"size": 0,
"aggregations": {
"timeline": {
"date_histogram": {
"field": "timestamp",
"fixed_interval": "7d"
},
"aggs": {
"nodes": {
"terms": {
"field": "node"
},
"aggs": {
"max_temperature": {
"max": {
"field": "temperature"
}
},
"avg_voltage": {
"avg": {
"field": "voltage"
}
}
}
}
}
}
}
}
Response:
{
"took" : 93,
"timed_out" : false,
"terminated_early" : false,
"_shards" : ... ,
"hits" : {
"total" : {
"value": 0,
"relation": "eq"
},
"max_score" : 0.0,
"hits" : [ ]
},
"aggregations" : {
"timeline" : {
"buckets" : [
{
"key_as_string" : "2018-01-18T00:00:00.000Z",
"key" : 1516233600000,
"doc_count" : 6,
"nodes" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "a",
"doc_count" : 2,
"max_temperature" : {
"value" : 202.0
},
"avg_voltage" : {
"value" : 5.1499998569488525
}
},
{
"key" : "b",
"doc_count" : 2,
"max_temperature" : {
"value" : 201.0
},
"avg_voltage" : {
"value" : 5.700000047683716
}
},
{
"key" : "c",
"doc_count" : 2,
"max_temperature" : {
"value" : 202.0
},
"avg_voltage" : {
"value" : 4.099999904632568
}
}
]
}
}
]
}
}
}
How to filter say last 3 days, is it possible?
For a test case, I used fixed_interval rate of 1m (one minute, and also 60 minutes) and I tried the following and the error was all query shards failed. Is it possible to query filter rollup agggregations?
Test Query for searching rollup index
{
"size": 0,
"query": {
"range": {
"timestamp": {
"gte": "now-3d/d",
"lt": "now/d"
}
}
}
"aggregations": {
"timeline": {
"date_histogram": {
"field": "timestamp",
"fixed_interval": "7d"
},
"aggs": {
"nodes": {
"terms": {
"field": "node"
},
"aggs": {
"max_temperature": {
"max": {
"field": "temperature"
}
},
"avg_voltage": {
"avg": {
"field": "voltage"
}
}
}
}
}
}
}
}

bucket aggregation/bucket_script computation

How to apply computation using bucket fields via bucket_script? More so, I would like to understand how to aggregate on distinct, results.
For example, below is a sample query, and the response.
What I am looking for is to aggregate the following into two fields:
sum of all buckets dist.value from e.g. response (1+2=3)
sum of all buckets (dist.value x key) from e.g., response (1x10)+(2x20)=50
Query
{
"size": 0,
"query": {
"bool": {
"must": [
{
"match": {
"field": "value"
}
}
]
}
},
"aggs":{
"sales_summary":{
"terms":{
"field":"qty",
"size":"100"
},
"aggs":{
"dist":{
"cardinality":{
"field":"somekey.keyword"
}
}
}
}
}
}
Query Result:
{
"aggregations": {
"sales_summary": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": 10,
"doc_count": 100,
"dist": {
"value": 1
}
},
{
"key": 20,
"doc_count": 200,
"dist": {
"value": 2
}
}
]
}
}
}
You need to use a sum bucket aggregation, which is a pipeline aggregation to find the sum of response of cardinality aggregation across all the buckets.
Search Query for sum of all buckets dist.value from e.g. response (1+2=3):
POST idxtest1/_search
{
"size": 0,
"aggs": {
"sales_summary": {
"terms": {
"field": "qty",
"size": "100"
},
"aggs": {
"dist": {
"cardinality": {
"field": "pageview"
}
}
}
},
"sum_buckets": {
"sum_bucket": {
"buckets_path": "sales_summary>dist"
}
}
}
}
Search Response :
"aggregations" : {
"sales_summary" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : 10,
"doc_count" : 3,
"dist" : {
"value" : 2
}
},
{
"key" : 20,
"doc_count" : 3,
"dist" : {
"value" : 3
}
}
]
},
"sum_buckets" : {
"value" : 5.0
}
}
For the second requirement, you need to first modify the response of value in the bucket aggregation response, using bucket script aggregation, and then use the modified value to perform bucket sum aggregation on it.
Search Query for sum of all buckets (dist.value x key) from e.g., response (1x10)+(2x20)=50
POST idxtest1/_search
{
"size": 0,
"aggs": {
"sales_summary": {
"terms": {
"field": "qty",
"size": "100"
},
"aggs": {
"dist": {
"cardinality": {
"field": "pageview"
}
},
"format-value-agg": {
"bucket_script": {
"buckets_path": {
"newValue": "dist"
},
"script": "params.newValue * 10"
}
}
}
},
"sum_buckets": {
"sum_bucket": {
"buckets_path": "sales_summary>format-value-agg"
}
}
}
}
Search Response :
"aggregations" : {
"sales_summary" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : 10,
"doc_count" : 3,
"dist" : {
"value" : 2
},
"format-value-agg" : {
"value" : 20.0
}
},
{
"key" : 20,
"doc_count" : 3,
"dist" : {
"value" : 3
},
"format-value-agg" : {
"value" : 30.0
}
}
]
},
"sum_buckets" : {
"value" : 50.0
}
}

Finding sum of the "key" values in bucket aggregations in Elasticsearch

I have the following ES query:
GET database/_search
{
"from": 0,
"size": 0,
"query": {
"bool": {
"must": [
{
"nested": {
"query": {
"term": {
"colleges.institution_full_name": {
"value": "Academy of Sciences",
"boost": 1.0
}
}
},
"path": "colleges"
}
}
]
}
},
"_source": false,
"aggs": {
"publication_years": {
"terms": {
"field": "publication_year"
}
}
}
}
And I got the following response:
{
"took" : 3,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 232,
"relation" : "eq"
},
"max_score" : null,
"hits" : [ ]
},
"aggregations" : {
"publication_years" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : 2016,
"doc_count" : 119
},
{
"key" : 2017,
"doc_count" : 90
},
{
"key" : 2018,
"doc_count" : 22
},
{
"key" : 2019,
"doc_count" : 1
}
]
}
}
}
Now I want to calculate the average of the key values of publication years, i.e, average of 2016, 2017, 2018 & 2019. So how can I modify my ES query to get the average of publication years instead of getting every year individually. I tried using "avg" aggregation, but its also taking "doc_count" in consideration while calculating the average.
try it
POST database/_search
{
"size": 0,
"aggs": {
"groupByYear": {
"terms": {
"field": "publication_year"
},
"aggs": {
"avgYear": {
"avg": {
"field": "publication_year"
}
}
}
},
"avg_year": {
"avg_bucket": {
"buckets_path": "groupByYear>avgYear"
}
}
}
}
It's not clear what you want, do your want avg of 2016,2017,2018,2019?
it means you want 2017.5?

Elasticsearch - find IPs from which only anonymous requests came

I have network logs in my Elasticsearch. Each log has an username and an IP field. Something like this:
{"username":"user1", "ip": "1.2.3.4"}
{"username":"anonymous", "ip": "1.2.3.4"}
{"username":"anonymous", "ip": "2.3.4.5"}
{"username":"user2", "ip": "3.4.5.6"}
I have a seemingly simple task: list all IP-s from which only anonymous requests came. The problem is, I can not simply filter for anonymous, because then I'll list false IP-s which appear with anonymous, but not exclusively. Manually I can do this with a 3 step process:
List all unique IP-s
List unique IP-s that appear with something other than anonymous
Exclude items of 2nd list from the first.
But is there a way to do this with a single ES query? My first instinct was to use bool query. My current approach is this:
GET /sample1/_search
{
"query": {
"bool": {
"must": {
"wildcard": {
"ip": "*"
}
},
"must_not": {
"term": {
"username": "-anonymous"
}
}
}
},
"size": 0,
"aggs": {
"ips": {
"terms": {
"field": "ip.keyword"
}
}
}
}
I expect "2.3.4.5", but it returns all 3 unique IPs. I searched the web and tried different query types for hours. Any ideas?
Please find the below mapping, sample docs, the respective query for your scenario and the response:
Mapping:
PUT my_ip_index
{
"mappings": {
"properties": {
"user":{
"type": "keyword"
},
"ip":{
"type": "ip"
}
}
}
}
Documents:
POST my_ip_index/_doc/1
{
"user": "user1",
"ip": "1.2.3.4"
}
POST my_ip_index/_doc/2
{
"user": "anonymous",
"ip": "1.2.3.4"
}
POST my_ip_index/_doc/3
{
"user": "anonymous",
"ip": "2.3.4.5"
}
POST my_ip_index/_doc/4
{
"user": "user2",
"ip": "3.4.5.6"
}
Aggregation Query:
POST my_ip_index/_search
{
"size": 0,
"aggs": {
"my_valid_ips": {
"terms": {
"field": "ip",
"size": 10
},
"aggs": {
"valid_users": {
"terms": {
"field": "user",
"size": 10,
"include": "anonymous"
}
},
"min_bucket_selector": {
"bucket_selector": {
"buckets_path": {
"valid_users_count": "valid_users._bucket_count",
"my_valid_ips_count": "_count"
},
"script": {
"source": "params.valid_users_count == 1 && params.my_valid_ips_count == 1"
}
}
}
}
}
}
}
Note how I've made use of Terms Aggregation and Bucket Selector Aggregation in the above query.
I've added include part in Terms Agg so as to consider only anonymous users and the logic inside bucket aggregation is to filter out only if it is a single doc count in the top level terms aggregation for e.g. 2.3.4.5 followed by single bucket count in the second level terms aggregation.
Response:
{
"took" : 1,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 4,
"relation" : "eq"
},
"max_score" : null,
"hits" : [ ]
},
"aggregations" : {
"my_valid_ips" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "2.3.4.5", <---- Expected IP/Answer
"doc_count" : 1,
"valid_users" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "anonymous",
"doc_count" : 1
}
]
}
}
]
}
}
}
Hope it helps!

Elasticsearch aggregations: how to get bucket with 'other' results of terms aggregation?

I use aggregation to collect data from nested field and stuck a little
Example of document:
{
...
rectangle: {
attributes: [
{_id: 'some_id', ...}
]
}
ES allows group data by rectangle.attributes._id, but is there any way to get some 'other' bucket to put there documents that were not added to any of groups? Or maybe there is a way to create query to create bucket for documents by {"rectangle.attributes._id": {$ne: "{currentDoc}.rectangle.attributes._id"}}
I think bucket would be perfect because i need to do further aggregations with 'other' docs.
Or maybe there's some cool workaround
I use query like this for aggregation
"aggs": {
"attributes": {
"nested": {
"path": "rectangle.attributes"
},
"aggs": {
"attributesCount": {
"cardinality": {
"field": "rectangle.attributes._id.keyword"
}
},
"entries": {
"terms": {
"field": "rectangle.attributes._id.keyword"
}
}
}
}
}
And get this result
"buckets" : [
{
"key" : "some_parent_id",
"doc_count" : 27616,
"attributes" : {
"doc_count" : 45,
"entries" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "some_id",
"doc_count" : 45,
"attributeOptionsCount" : {
"value" : 2
}
}
]
}
}
}
]
result like this would be perfect:
"buckets" : [
{
"key" : "some_parent_id",
"doc_count" : 1000,
"attributes" : {
"doc_count" : 145,
"entries" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "some_id",
"doc_count" : 45
},
{
"key" : "other",
"doc_count" : 100
}
]
}
}
}
]
You can make use of missing value parameter. Update aggregation as below:
"aggs": {
"attributes": {
"nested": {
"path": "rectangle.attributes"
},
"aggs": {
"attributesCount": {
"cardinality": {
"field": "rectangle.attributes._id.keyword"
}
},
"entries": {
"terms": {
"field": "rectangle.attributes._id.keyword",
"missing": "other"
}
}
}
}
}

Resources