Id like to calculate millions of adjacent records and summarize them in the end in Elasticsearch. How can I do this?
Documents (six of them) data in Elasticsearch:
10
20
-30
10
30
100
Calculation:
10 to 20 is 10
20 to -30 is -50
-30 to 10 is 40
10 to 30 is 20
30 to 100 is 70
The total is:
10 + (-50) + 40 + 20 + 70 = 90
How would I do a query with REST - RestHighLevelClient API to achive this?
Generic case
Most likely the only reasonable way to do this in Elasticsearch is to denormalize and put into Elasticsearch already computed deltas. In this case you will only need a simple sum aggregation.
This is because data in Elasticsearch is "flat", so it does not know that your documents are adjacent. It excels when all you need to know is already in the document at index time: in this case special indexes are pre-built and aggregations are very fast.
It is like A'tuin, a flat version of Earth from Pratchett's novels: some basic physics, like JOINs from RDBMS, do not work, but magic is possible.
Time series-specific case
In case when you have a time series you can achieve your goal with a combination of Serial Differencing and Sum Bucket sibling aggregations.
In order to use this approach you would need to aggregate on some date field. Imagine you have a mapping like this:
PUT time_diff
{
"mappings": {
"doc": {
"properties": {
"eventTime": {
"type": "date"
},
"val": {
"type": "integer"
}
}
}
}
}
And a document per day which look like this:
POST /time_diff/doc/1
{
"eventTime": "2018-01-01",
"val": 10
}
POST /time_diff/doc/2
{
"eventTime": "2018-01-02",
"val": 20
}
Then with a query like this:
POST /time_diff/doc/_search
{
"size": 0,
"aggs": {
"my_date_histo": {
"date_histogram": {
"field": "eventTime",
"interval": "day"
},
"aggs": {
"the_sum": {
"sum": {
"field": "val"
}
},
"my_diff": {
"serial_diff": {
"buckets_path": "the_sum"
}
}
}
},
"my_sum": {
"sum_bucket": {
"buckets_path": "my_date_histo>my_diff"
}
}
}
}
The response will look like:
{
...
"aggregations": {
"my_date_histo": {
"buckets": [
{
"key_as_string": "2018-01-01T00:00:00.000Z",
"key": 1514764800000,
"doc_count": 1,
"my_delta": {
"value": 10
}
},
...
]
},
"my_sum": {
"value": 90
}
}
}
This method though has obvious limitations:
only works if you have time series data
only correct if you have exactly 1 data point per date bucket (a day in example)
will explode in memory consumption if you have many points (millions as you mentioned)
Hope that helps!
Related
I use Elasticsearch terms aggregation to see how many documents have a certain value in their "foo" field like this:
{
...
"aggregations": {
"metastore": {
"terms": {
"field": "foo",
"size": 50
}
}
}
}
and I get the response:
"aggregations": {
"foo": {
"buckets": [
{
"key_as_string": "2018-10-01T00:00:00.000Z",
"key": 1538352000000,
"doc_count": 935
},
{
"key_as_string": "2018-11-01T00:00:00.000Z",
"key": 1541030400000,
"doc_count": 15839
},
...
/* 48 more values */
]
}
}
But I'm limiting the number of different values to 50. If there are more different values in this field they won't be returned in the response, and that's fine, because I don't need to all of them, but I would like to know how many of them there are. So, how could I get the total number of different values? It would be fantastic if the answer provided a full example query, thanks.
You can probably add a cardinality aggregation which will give you unique number of terms for the field. This will be equal to the number of buckets for the terms aggregation.
{
...
"aggregations": {
"metastore": {
"terms": {
"field": "foo",
"size": 50
}
},
"uniquefoo": {
"cardinality": {
"field": "foo"
}
}
}
}
NOTE: Please keep in mind that cardinality aggregation might in some cases return approx count. To know more on it read here.
The cardinality aggregation is there to help. Just note, however, that the number that is returned is an approximation and might not reflect the exact number of buckets you'd get if you were to request them all. However, the accuracy is pretty good on low cardinality fields.
{
...
"aggregations": {
"unique_count": {
"cardinality": {
"field": "foo"
}
},
"metastore": {
"terms": {
"field": "foo",
"size": 50
}
}
}
}
I want to create a metric in kibana dashboard, which use ratio of multiple metrics and offset period.
Example :
Date Budget
YYYY-MM-DD $
2019-01-01 15
2019-01-02 10
2019-01-03 5
2019-01-04 10
2019-01-05 12
2019-01-06 4
If I select time range between 2019-01-04 to 2019-01-06 , I want to compute ratio with offset period: 2019-01-01 to 2019-01-03.
to resume : (sum(10+12+4) - sum(15+10+5)) / sum(10+12+4) = -0.15
evolution of my budget equal to -15% (and this is what I want to print in the dashboard)
But, with metric it's not possible (no offset), with visual builder: different metric aggregation do not have different offset (too bad because bucket script allow to compute ratio), and with vega : I not found a solution too.
Any idea ? Thanks a lot
Aurélien
NB: I use kibana version > 6.X
Please check the below sample mapping which I've constructed based on data you've provided in the query and aggregation solution that you wanted to take a look.
Mapping:
PUT <your_index_name>
{
"mappings": {
"mydocs": {
"properties": {
"date": {
"type": "date",
"format": "yyyy-MM-dd"
},
"budget": {
"type": "float"
}
}
}
}
}
Aggregation
I've made use of the following types of aggregation:
Date Histogram where I've mentioned interval as 4d based on the data you've mentioned in the question
Sum
Derivative
Bucket Script which actually gives you the required budget evolution figure.
Also I'm assuming that the date format would be in yyyy-MM-dd and budget would be of float data type.
Below is how your aggregation query would be.
POST <your_index_name>/_search
{
"size": 0,
"query": {
"range": {
"date": {
"gte": "2019-01-01",
"lte": "2019-01-06"
}
}
},
"aggs": {
"my_date": {
"date_histogram": {
"field": "date",
"interval": "4d",
"format": "yyyy-MM-dd"
},
"aggs": {
"sum_budget": {
"sum": {
"field": "budget"
}
},
"budget_derivative": {
"derivative": {
"buckets_path": "sum_budget"
}
},
"budget_evolution": {
"bucket_script": {
"buckets_path": {
"input_1": "sum_budget",
"input_2": "budget_derivative"
},
"script": "(params.input_2/params.input_1)*(100)"
}
}
}
}
}
}
Note that the result that you are looking for would be in the budget_evolution part.
Hope this helps!
For reference, I'm using Elasticsearch 6.4.0
I have a Elasticsearch query that returns a certain number of hits, and I'm trying to remove hits with text field values that are too similar. My query is:
{
"size": 10,
"collapse": {
"field": "author_id"
},
"query": {
"function_score": {
"boost_mode": "replace",
"score_mode": "avg",
"functions": [
{
//my custom query function
}
],
"query": {
"bool": {
"must_not": [
{
"term": {
"author_id": MY_ID
}
}
]
}
}
}
},
"aggs": {
"book_name_sample": {
"sampler": {
"shard_size": 10
},
"aggs": {
"frequent_words": {
"significant_text": {
"field": "book_name",
"filter_duplicate_text": true
}
}
}
}
}
}
This query uses a custom function score combined with a filter to return books a person might like (that they haven't authored). Thing is, for some people, it returns books with names that are very similar (i.e. The Life of George Washington, Good Times with George Washington, Who was George Washington), and I'd like the hits to have a more diverse set of names.
I'm using a bucket_selector to aggregate the hits based on text similarity, and the query gives me something like:
...,
"aggregations": {
"book_name_sample": {
"doc_count": 10,
"frequent_words": {
"doc_count": 10,
"bg_count": 482626,
"buckets": [
{
"key": "George",
"doc_count": 3,
"score": 17.278715785140975,
"bg_count": 9718
},
{
"key": "Washington",
"doc_count": 3,
"score": 15.312204414323656,
"bg_count": 10919
}
]
}
}
}
Is it possible to filter the returned documents based on this aggregation result within Elasticsearch? IE remove hits with book_name_sample doc_count less than X? I know I can do this in PHP or whatever language uses the hits, but I'd like to keep it within ES. I've tried using a bucket_selector aggregator like so:
"book_name_bucket_filter": {
"bucket_selector": {
"buckets_path": {
"freqWords": "frequent_words"
},
"script": "params.freqWords < 3"
}
}
But then I get an error: org.elasticsearch.search.aggregations.bucket.sampler.InternalSampler cannot be cast to org.elasticsearch.search.aggregations.InternalMultiBucketAggregation
Also, if that filter removes enough documents so that the hit count is less than the requested size, is it possible to tell ES to go fetch the next top scoring hits so that hits count is filled out?
Why not use top hits inside the aggregation to get relevant document that match the bucket? You can specify how many relevant top hits you want inside the top hits aggregation. So basically this will give you a certain number of documents for each bucket.
TL;DR: I want to do the equivalent of Haskell's zipWith with buckets in elasticsearch.
I have an index with time and value "tuples", and each entry also has a head_id, pointing to meta information about a series of such tuples. It's the timeseries ID. Visualized it might look like this:
head_id | timestamp | value
---------+---------------+-------
1 | 1104537600000 | 10
1 | 1104538500000 | 20
1 | 1104539400000 | 30
2 | 1104537600000 | 1000
2 | 1104538500000 | 2000
2 | 1104539400000 | 3000
Let's represent each individual timeseries as a list like this, for clarity:
1: [ 10, 20, 30]
2: [1000, 2000, 3000]
What I want to achieve is "zip" those series together in an elasticsearch aggregation: Let's say I want to sum them:
result: [1010, 2020, 3030]
I currently need to fetch all the data and do the desired operation in application code. Now, to save memory and network bandwidth, I want to perform some operations like this directly within elasticsearch.
In this case, because the values I want to add up share the same timestamp, I was able to achieve this using a terms bucket aggregation with a sum sub-aggregation
GET /timeseries/_search
{
"aggs": {
"result": {
"terms": {"field": "timestamp"},
"aggs": {
"values_sum": {
"sum": {"field": "value"}
}
}
}
}
}
returns (simplified):
{
"aggregations": {
"result": {
"buckets": [
{
"key": 1104537600000,
"doc_count": 2,
"values_sum": {"value": 1010}
},
{
"key": 1104538500000,
"doc_count": 2,
"values_sum": {"value": 2020}
},
{
"key": 1104539400000,
"doc_count": 2,
"values_sum": {"value": 3030}
}
]
}
}
}
However, in my case it isn't guaranteed that the timeseries' timestamps will align like this, which means I need a more general way of aggregating 2 (or more general N) timeseries, assuming they will have the same amount of values each.
A potential workaround I thought of was to shift the beginning of each timeseries to 0, and then use the above technique. However, I don't know how I could achieve that.
Another potential workaround I thought of was first aggregating over head_id to get a bucket for each timeseries, and then use something like the serial differencing aggregation with lag=1. I can't use that aggregation though, because I want to do other operations than just subtraction, and it requires the buckets to be generated through a histogram aggregation, which isn't the case for me.
A potential workaround I thought of was to shift the beginning of each timeseries to 0, and then use the above technique. However, I don't know how I could achieve that.
This can be achieved using a script for the terms bucket key. It looks like this:
GET /timeseries/_search
{
"aggs": {
"result": {
"terms": {
"field": "timestamp",
"script": {
"inline": "_value - params.anchors[doc.head_id.value]",
"params": {
"anchors": {
"1": 1104537600000,
"2": 1104624000000,
...
}
}
}
},
"aggs": {
"values_sum": {
"sum": {"field": "value"}
}
}
}
}
}
Where anchors is a map associating head_id to the respective time instant that each series should start at.
I have a date field inside my data. I did a date histogram aggregation on it,with interval set as month. Now it returns,the number of documents per month,interval.
Here is the query I used:
{
"aggs": {
"dateHistogram": {
"date_histogram": {
"field": "currentDate",
"interval": "day"
}
}
}
}
Below the exact response I have received.
{
"aggregations": {
"dateHistogram": {
"buckets": [{
"key_as_string": "2015-05-06",
"key": 1430870400000,
"doc_count": 10
}, {
"key_as_string": "2015-04-06",
"key": 1430870500000,
"doc_count": 14
}]
}
}
}
From the above response it is clear that,there are 10 documents under the key "1430870400000" and 14 documents under the key "1430870500000". But despite from the document count,the individual documents are not shown. I want them to be shown in the response,so that I can take values out from it. How do I achieve this in elasticsearch?
The easy method for this is using the "top-hits" aggregation. You can find the usage of "top-hits" here
Top-hits aggregation will give you the relevant data inside the aggregation you have done and also there are options to specify from which result you want to fetch,and the size of the data you want to be taken and also sort options.
As per my understanding you want to fetch all documents and used that documents for aggregations so you should use match query with aggregation as below :
{
"query": {
"bool": {
"must": [
{
"match_all": {}
}
]
}
},
"aggs": {
"date_wise_logs_counts": {
"date_histogram": {
"field": "currentDate",
"interval": "day"
}
}
}
}
Above return default 10 documents in hit array, use size size=BIGNUMBER to get more than 10 items. (where BIGNUMBER equals a number you believe is bigger than your dataset). But you should use scan and scroll instead of size