How can i add additional terms in the ElasticSearch Aggregation with Datetime Buckets? - elasticsearch

Using Elastic Search 5.3 aggregation api - unable to write a query which calculates a measure on a date bucket- week split by Dimension/ term/field. i am able to make the date buckets and get the measure calculated for that bucket, but unable to split it down by a term: say application or term say transaction. Elastic search 5+ version has deprecated a lot of APIs from previous versions. here is what i got - this is right now aggregating the measure across all terms for that date bucket. Need to split it by some fields / terms. How do I go about doing it.
POST /index_name/_search?size=0
{
"aggs": {
"myname_Summary": {
"date_histogram": {
"field": "#timestamp",
"interval": "week"
, "format": "yyyy-MM-dd"
, "time_zone": "-04:00"
},
"aggs":{ "total_volume" : {"sum": {"field": "volume"}}
}
}
}}

you can try this
{
"size": 0,
"aggs": {
"myname_Summary": {
"date_histogram": {
"field": "#timestamp",
"interval": "week",
"format": "yyyy-MM-dd",
"time_zone": "-04:00"
},
"aggs": {
"split": {
"terms": {
"field": "application",
"size": 10
},
"aggs": {
"transaction": {
"terms": {
"field": "transaction",
"size": 10
},
"aggs": {
"total_volume": {
"sum": {
"field": "volume"
}
}
}
}
}
}
}
}
}
}
Hope this helps

Related

Elasticsearch: How set 'doc_count' of a FILTER-Aggregation in relation to total 'doc_count'

A seemingly very trivial problem prompted me today to read the Elasticsearch documentation again diligently. So far, however, I have not come across the solution....
Question:
is ther's a simple way to set the doc_count of a filter aggregation in relation to the total doc_count?
Here's a snippet from my search-request-json.
In the feature_occurrences aggregation I filtered documents.
Now I want to calculate the ratio filtered/all Docs in each time bucket.
GET my_index/_search
{
"aggs": {
"time_buckets": {
"date_histogram": {
"field": "date",
"calendar_interval": "1d",
"min_doc_count": 0
},
"aggs": {
"feature_occurrences": {
"filter": {
"term": {
"x": "y"
}
}
},
"feature_occurrences_per_doc" : {
// feature_occurences.doc_count / doc_count
}
Any Ideas ?
You can use bucket_script to calc the ratio:
{
"aggs": {
"date": {
"date_histogram": {
"field": "#timestamp",
"interval": "hour"
},
"aggs": {
"feature_occurrences": {
"filter": {
"term": {
"cloud.region": "westeurope"
}
}
},
"ratio": {
"bucket_script": {
"buckets_path": {
"doc_count": "_count",
"features_count": "feature_occurrences._count"
},
"script": "params.features_count / params.doc_count"
}
}
}
}
}
}
Elastic bucket script doc:
https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-pipeline-bucket-script-aggregation.html

ES query ignoring time range filter

I have mimicked how kibana does a query search and have come up with the below query. Basically I'm looking for the lat 6 days of data (including those days where there is no data, since I need to feed it to a graph). But the returned buckets is giving me more than just those days. I woul like to understand where I'm going wring with this.
{
"version": true,
"size": 0,
"sort": [
{
"#timestamp": {
"order": "desc",
"unmapped_type": "boolean"
}
}
],
"_source": {
"excludes": []
},
"aggs": {
"target_traffic": {
"date_histogram": {
"field": "#timestamp",
"interval": "1d",
"time_zone": "Asia/Kolkata",
"min_doc_count": 0,
"extended_bounds": {
"min": "now-6d/d",
"max": "now"
}
},
"aggs": {
"days_filter": {
"filter": {
"range": {
"#timestamp": {
"gt": "now-6d",
"lte": "now"
}
}
},
"aggs": {
"in_bytes": {
"sum": {
"field": "netflow.in_bytes"
}
},
"out_bytes": {
"sum": {
"field": "netflow.out_bytes"
}
}
}
}
}
}
},
"stored_fields": [
"*"
],
"script_fields": {},
"docvalue_fields": [
"#timestamp",
"netflow.first_switched",
"netflow.last_switched"
],
"query": {
"bool": {
"must": [
{
"query_string": {
"query": "( flow.src_addr: ( \"10.5.5.1\" OR \"10.5.5.2\" ) OR flow.dst_addr: ( \"10.5.5.1\" OR \"10.5.5.2\" ) ) AND flow.traffic_locality: \"private\"",
"analyze_wildcard": true,
"default_field": "*"
}
}
]
}
}
}
If you put the range filter inside your aggregation section without any date range in your query, what is going to happen is that your aggregations will run on all your data and metrics will be bucketed by day over all your data.
The range query on #timestamp should be moved inside the query section so as to compute aggregations only on the data you want, i.e. the last 6 days.

How to group by month in Elastic search

I am using elastic search version 6.0.0
for group by month, I am using date histogram aggregation.
example which I've tried :
{
"from":0,
"size":2000,
"_source":{
"includes":[
"cost",
"date"
],
"excludes":[
],
"aggregations":{
"date_hist_agg":{
"date_histogram":{
"field":"date",
"interval":"month",
"format":"M",
"order":{
"_key":"asc"
},
"min_doc_count":1
},
"aggregations":{
"cost":{
"sum":{
"field":"cost"
}
}
}
}
}
}
}
and as a result i got 1(Jan/January) multiple times.
As I have data of January-2016 ,January-2017 , January-2018 so will return 3 times January. but i Want January only once which contains the sum of All years of January.
Instead of using a date_histogram aggregation you could use a terms aggregation with a script that extracts the month from the date.
{
"from": 0,
"size": 2000,
"_source": {"includes": ["cost","date"],"excludes"[]},
"aggregations": {
"date_hist_agg": {
"terms": {
"script": "doc['date'].date.monthOfYear",
"order": {
"_key": "asc"
},
"min_doc_count": 1
},
"aggregations": {
"cost": {
"sum": {
"field": "cost"
}
}
}
}
}
}
Note that using scripting is not optimal, if you know you'll need the month information, just create another field with that information so you can use a simple terms aggregation on it without having to use scripting.
We can use the calendar_interval with month value:
Documentation: https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket-datehistogram-aggregation.html#calendar_interval_examples
GET my_index/_search
{
"size": 0,
"query": {},
"aggs": {
"over_time": {
"date_histogram": {
"field": "yourDateAttribute",
"calendar_interval": "month",
"format": "yyyy-MM" // <--- control the output format
}
}
}
}

Elasticsearch: date_histogram with 5y interval

How to make Elasticsearch date_histogram work like this:
{
"aggs": {
"age_range": {
"date_histogram": {
"field": "birthdate",
"interval": "5y"
}
}
}
}
This is a known issue in elasticsearch, you can use either 260 weeks(260w) or 1825 days(1825d) (you can consider leap year if you want).
This will work
{
"size": 0,
"aggs": {
"NAME": {
"date_histogram": {
"field": "birthdate",
"interval": "1825d"
}
}
}
}

elasticsearch getting too many results, need help filtering query

I'm having much problem understanding the underlying of ES querying system.
I've got the following query for example:
{
"size": 0,
"query": {
"bool": {
"must": [
{
"term": {
"referer": "www.xx.yy.com"
}
},
{
"range": {
"#timestamp": {
"gte": "now",
"lt": "now-1h"
}
}
}
]
}
},
"aggs": {
"interval": {
"date_histogram": {
"field": "#timestamp",
"interval": "0.5h"
},
"aggs": {
"what": {
"cardinality": {
"field": "host"
}
}
}
}
}
}
That request get too many results:
"status" : 500, "reason" :
"ElasticsearchException[org.elasticsearch.common.breaker.CircuitBreakingException:
Data too large, data for field [#timestamp] would be larger than limit
of [3200306380/2.9gb]]; nested:
UncheckedExecutionException[org.elasticsearch.common.breaker.CircuitBreakingException:
Data too large, data for field [#timestamp] would be larger than limit
of [3200306380/2.9gb]]; nested: CircuitBreakingException[Data too
large, data for field [#timestamp] would be larger than limit of
[3200306380/2.9gb]]; "
I've tryied that request:
{
"size": 0,
"filter": {
"and": [
{
"term": {
"referer": "www.geoportail.gouv.fr"
}
},
{
"range": {
"#timestamp": {
"from": "2014-10-04",
"to": "2014-10-05"
}
}
}
]
},
"aggs": {
"interval": {
"date_histogram": {
"field": "#timestamp",
"interval": "0.5h"
},
"aggs": {
"what": {
"cardinality": {
"field": "host"
}
}
}
}
}
}
I would like to filter the data in order to be able to get a correct result, any help would be much appreciated!
I found a solution, it's kind of weird.
I've followed dimzak adviced and clear the cache:
curl --noproxy localhost -XPOST "http://localhost:9200/_cache/clear"
Then I used filtering instead of querying as Olly suggested:
{
"size": 0,
"query": {
"filtered": {
"query": {
"term": {
"referer": "www.xx.yy.fr"
}
},
"filter" : {
"range": {
"#timestamp": {
"from": "2014-10-04T00:00",
"to": "2014-10-05T00:00"
}
}
}
}
},
"aggs": {
"interval": {
"date_histogram": {
"field": "#timestamp",
"interval": "0.5h"
},
"aggs": {
"what": {
"cardinality": {
"field": "host"
}
}
}
}
}
}
I cannot give you both the ansxwer, I think dimzak deserves it best, but thumbs up to you two guys :)
You can try clearing cache first and then execute the above query as shown here.
Another solution may be to remove interval or reduce time range in your query...
My best bet would be either clear cache first, or allocate more memory to elasticsearch (more here)
Using a filter would improve performance:
{
"size": 0,
"query": {
"filtered": {
"query": {
"term": {
"referer": "www.xx.yy.com"
}
},
"filter" : {"range": {
"#timestamp": { "gte": "now", "lt": "now-1h"
}
}
}
}
},
"aggs": {
"interval": {
"date_histogram": {
"field": "#timestamp",
"interval": "0.5h"
},
"aggs": {
"what": {
"cardinality": {
"field": "host"
}
}
}
}
}
}
You may also find that date range is better than date histogram - you need to define the buckets yourself.
is the referer field being analysed? or do you want an exact match on this - if so set it to not_analyzed.
is there much cardinality in your hostname field? have you tried pre-hashing the values?

Resources