elasticsearch aggregation and get columns value - elasticsearch

want to get the values ​​of the bps and pps columns over time through aggregation. Now, if you use my query, you can only get a count of it. If there is a way to get the value of a specific column according to the interval value
this's my code
<code>
"size": 0,
"aggs": {
"group_by_state": {
"date_histogram": {
"field": "reg-date",
"interval": "day",
"min_doc_count": 0,
"extended_bounds": {
"min": "2018-10-01T00:00:00",
"max": "2018-10-07T23:59:59"
}
}
}
}
</code>
Is there a way I want to get a value that is not the number of specific columns that satisfy this query?

Getting the "value" of a single field once aggregated doesn't make sense anymore - if your one-day aggregation has three documents, then a single field doesn't have one "value" anymore over that one-day period.
Instead you can use a sub-aggregation to compute an aggregate value for the day, such as a sum or average:
{
"aggs": {
"group_by_state": {
"date_histogram": {
"field": "reg-date",
"interval": "day",
"min_doc_count": 0,
"extended_bounds": {
"min": "2018-10-01T00:00:00",
"max": "2018-10-07T23:59:59"
}
},
# sub-aggregations
"aggs": {
"bps_average": {
"avg": {
"field": "bps"
}
},
"pps_average": {
"avg": {
"field": "pps"
}
}
}
}
}
}
Then each bucket will have fields bps_average and pps_average. If you replace avg with sum you'll get a sum instead, and there are many other metrics aggregations.
The ElasticSearch guide has a good section on aggregations and nesting.

Related

Date_histogram and top_hits from unique values only

I am trying to do a date_histogram aggregation to show a sum of Duration for each hour.
I have the following documents:
{
"EntryTimestamp": 1567029600000,
"Username": "johndoe",
"UpdateTimestamp": 1567029600000,
"Duration": 10,
"EntryID": "ASDF1234"
}
The following works very well but my problem is that sometimes multiple documents appear with the same EntryID. So ideally I would need to add a top_hits somehow, and order by the UpdateTimestamp as I need the last updated document for each unique EntryID. But not sure how to add this to my query.
{
"size": 0,
"query": {
"bool": {
"filter": [{
"range": {
"EntryTimestamp": {
"gte": "1567029600000",
"lte": "1567065599999",
"format": "epoch_millis"
}
}
}, {
"query_string": {
"analyze_wildcard": true,
"query": "Username.keyword=johndoe"
}
}
]
}
},
"aggs": {
"2": {
"date_histogram": {
"interval": "1h",
"field": "EntryTimestamp",
"min_doc_count": 0,
"extended_bounds": {
"min": "1567029600000",
"max": "1567065599999"
},
"format": "epoch_millis"
},
"aggs": {
"1": {
"sum": {
"field": "Duration"
}
}
}
}
}
}
I think you'll need a top_hits aggregation inside a terms aggregation.
The terms aggregation will get the distinct EntryIDs and the top hit aggregation inside of it will get only the most recent document (based on UpdateTimestamp) for each bucket (each distinct value) of the terms aggregation.
I have no clear syntax adapted to your context, and i believe you might run into some issues regarding the number of sub aggregations (i ran into some limitations with advanced aggregations in the past)
You can see this post for more info on that; i hope it'll prove to be helpful to you.

Histogram is not starting at the right min even filter added

The Mapping
"eventTime": {
"type": "long"
},
The Query
POST some_indices/_search
{
"size": 0,
"query": {
"constant_score": {
"filter": {
"range": {
"eventTime": {
"from": 1563120000000,
"to": 1565712000000,
"format": "epoch_millis"
}
}
}
}
},
"aggs": {
"min_eventTime": { "min" : { "field": "eventTime"} },
"max_eventTime": { "max" : { "field": "eventTime"} },
"time_series": {
"histogram": {
"field": "eventTime",
"interval": 86400000,
"min_doc_count" : 0,
"extended_bounds": {
"min": 1563120000000,
"max": 1565712000000
}
}
}
}
}
The Response
"aggregations": {
"max_eventTime": {
"value": 1565539199997
},
"min_eventTime": {
"value": 1564934400000
},
"time_series": {
"buckets": [
{
"key": 1563062400000,
"doc_count": 0
},
{
"key": 1563148800000,
"doc_count": 0
},
{
...
Question
As the reference clearly mentioned
For filtering buckets, one should nest the histogram aggregation under a range filter aggregation with the appropriate from/to settings.
I set the filter properly (as the demo does) and the min and max is also providing the evidence.
But why still the first key is SMALLER THAN than the from (or min_eventTime)?
So weird and I totally get lost now ;(
Any advice will be appreciated ;)
References
https://www.elastic.co/guide/en/elasticsearch/reference/5.5/search-aggregations-bucket-histogram-aggregation.html#search-aggregations-bucket-histogram-aggregation
I hacked out a solution for now, but I kind of think it's a bug in Elastic Search.
I am using date_histogram instead though the field itself is a long type and via offset I moved the starting point forward to the right timestamp.
"aggs": {
"time_series": {
"date_histogram": {
"field": "eventTime",
"interval": 86400000,
"offset": "+16h",
"min_doc_count": 0,
"extended_bounds": {
"min": 1563120000000,
"max": 1565712000000
}
},
"aggs": {
"order_amount_total": {
"sum": {
"field": "order_amount"
}
}
}
}
}
Updated
Thanks for the help of #Val, I re-think about it and have a test as follows:
#Test
public void testComputation() {
System.out.println(1563120000000L % 86400000L); // 57600000
System.out.println(1563062400000L % 86400000L); // 0
}
I want to quote from the doc
With extended_bounds setting, you now can "force" the histogram aggregation to start building buckets on a specific min value and also keep on building buckets up to a max value (even if there are no documents anymore). Using extended_bounds only makes sense when min_doc_count is 0 (the empty buckets will never be returned if min_doc_count is greater than 0).
But I believe the specific min value should be one of 0, interval, 2 * interval, 3 * interval, .... instead of a random value as I used in the question.
So basically in my case, I could use offset of histogram to solve the issue as follows.
I don't actually need date_histogram at all.
"histogram": {
"field": "eventTime",
"interval": 86400000,
"offset": 57600000,
"min_doc_count" : 0,
"extended_bounds": {
"min": 1563120000000,
"max": 1565712000000
}
}
A clear explanation posted by Elastic Search member #polyfractal (thank you for the detailed crystal explanation) is also proving the same logic, more details could be found here.
A reason for the design I want to quote here:
if we cut the aggregation off right at the extended_bounds.min/max, we would generate buckets that are not the full interval and that would break many assumptions about how the histogram works.

How to get specific _source fields in aggregation

I am exploring ElasticSearch, to be used in an application, which will handle large volumes of data and generate some statistical results over them. My requirement is to retrieve certain statistics for a particular field. For example, for a given field, I would like to retrieve its unique values and document frequency of each value, along-with the length of the value. The value lengths are indexed along-with each document.
So far, I have experimented with Terms Aggregation, with the following query:
{
"size": 0,
"query": {
"match_all": {}
},
"aggs": {
"type_count": {
"terms": {
"field": "val.keyword",
"size": 100
}
}
}
}
The query returns all the values in the field val with the number of documents in which each value occurs. I would like the field val_len to be returned as well. Is it possible to achieve this using ElasticSearch? In other words, is it possible to include specific _source fields in buckets? I have looked through the documentation available online, but I haven't found a solution yet.
Hoping somebody could point me in the right direction. Thanks in advance!
I tried to include _source in the following manners:
"aggs": {
"type_count": {
"terms": {
"field": "val.keyword",
"size": 100
},
"_source":["val_len"]
}
}
and
"aggs": {
"type_count": {
"terms": {
"field": "val.keyword",
"size": 100,
"_source":["val_len"]
}
}
}
But I guess this isn't the right way, because both gave me parsing errors.
You need to use another sub-aggregation called top_hits, like this:
"aggs": {
"type_count": {
"terms": {
"field": "val.keyword",
"size": 100
},
"aggs": {
"hits": {
"top_hits": {
"_source":["val_len"],
"size": 1
}
}
}
}
}
Another way of doing it is to use another avg sub-aggregation so you can sort on it, too
"aggs": {
"type_count": {
"terms": {
"field": "val.keyword",
"size": 100,
"order": {
"length": "desc"
}
},
"aggs": {
"length": {
"avg": {
"field": "val_len"
}
}
}
}
}

Sublisting Aggregations Elastic search

Hi I wanted to know if after applying an aggregation can I only select a range of values to return in response. Suppose aggregation has100 docs can I select say documents from 10 to 30 or 0 to 20, etc. Any help would be appreciated, thanks
Elasticsearch supports filtering aggregation values with partitioning.
GET /_search
{
"size": 0,
"aggs": {
"expired_sessions": {
"terms": {
"field": "account_id",
"include": {
"partition": 0,
"num_partitions": 20
},
"size": 10000,
"order": {
"last_access": "asc"
}
},
"aggs": {
"last_access": {
"max": {
"field": "access_date"
}
}
}
}
}
}
See Filtering Values with partitions.
Be aware that partitioning may add a performance hit depending upon the aggregation.

Elastic search date_histogram extended_bounds

I want to get date_histogram during specific period, how to restrict the date period? Should I use the extended_bounds parameter? For example : I want to query the date_histogram between '2016-08-01' and '2016-08-31', and the interval is day. I query with this expression :
{
"aggs": {
"cf_loan": {
"date_histogram": {
"field": "createDate",
"interval": "day",
"format": "yyyy-MM-dd",
"min_doc_count": 0,
"extended_bounds": {
"min": "2016-08-01",
"max": "2016-08-31"
}
}
}
}
}
But I get the date_histogram not in the range.
You're almost there, you need to add a range query in order to only select documents whose createDate field is in the desired range.
{
"query": {
"range": { <---- add this range query
"createDate": {
"gte": "2016-08-01T00:00:00.000Z",
"lt": "2016-09-01T00:00:00.000Z"
}
}
},
"aggs": {
"cf_loan": {
"date_histogram": {
"field": "createDate",
"interval": "day",
"format": "yyyy-MM-dd",
"min_doc_count": 0,
"extended_bounds": {
"min": "2016-08-01",
"max": "2016-08-31"
}
}
}
}
}
The role of the extended_bounds parameter is to make sure you'll get daily buckets from min to max even if there are no documents in them. For instance, say you have 1 document each day between 2016-08-04 and 2016-08-28, then without the extended_bounds parameter, you'd get 25 buckets (2016-08-04, 2016-08-05, 2016-08-06, ..., 2016-08-28).
With the extended_bounds parameter, you'll also get the following buckets but with 0 documents:
2016-08-01
2016-08-02
2016-08-03
2016-08-29
2016-08-30
2016-08-31

Resources