zip-like bucket aggregation - elasticsearch

TL;DR: I want to do the equivalent of Haskell's zipWith with buckets in elasticsearch.
I have an index with time and value "tuples", and each entry also has a head_id, pointing to meta information about a series of such tuples. It's the timeseries ID. Visualized it might look like this:
head_id | timestamp | value
---------+---------------+-------
1 | 1104537600000 | 10
1 | 1104538500000 | 20
1 | 1104539400000 | 30
2 | 1104537600000 | 1000
2 | 1104538500000 | 2000
2 | 1104539400000 | 3000
Let's represent each individual timeseries as a list like this, for clarity:
1: [ 10, 20, 30]
2: [1000, 2000, 3000]
What I want to achieve is "zip" those series together in an elasticsearch aggregation: Let's say I want to sum them:
result: [1010, 2020, 3030]
I currently need to fetch all the data and do the desired operation in application code. Now, to save memory and network bandwidth, I want to perform some operations like this directly within elasticsearch.
In this case, because the values I want to add up share the same timestamp, I was able to achieve this using a terms bucket aggregation with a sum sub-aggregation
GET /timeseries/_search
{
"aggs": {
"result": {
"terms": {"field": "timestamp"},
"aggs": {
"values_sum": {
"sum": {"field": "value"}
}
}
}
}
}
returns (simplified):
{
"aggregations": {
"result": {
"buckets": [
{
"key": 1104537600000,
"doc_count": 2,
"values_sum": {"value": 1010}
},
{
"key": 1104538500000,
"doc_count": 2,
"values_sum": {"value": 2020}
},
{
"key": 1104539400000,
"doc_count": 2,
"values_sum": {"value": 3030}
}
]
}
}
}
However, in my case it isn't guaranteed that the timeseries' timestamps will align like this, which means I need a more general way of aggregating 2 (or more general N) timeseries, assuming they will have the same amount of values each.
A potential workaround I thought of was to shift the beginning of each timeseries to 0, and then use the above technique. However, I don't know how I could achieve that.
Another potential workaround I thought of was first aggregating over head_id to get a bucket for each timeseries, and then use something like the serial differencing aggregation with lag=1. I can't use that aggregation though, because I want to do other operations than just subtraction, and it requires the buckets to be generated through a histogram aggregation, which isn't the case for me.

A potential workaround I thought of was to shift the beginning of each timeseries to 0, and then use the above technique. However, I don't know how I could achieve that.
This can be achieved using a script for the terms bucket key. It looks like this:
GET /timeseries/_search
{
"aggs": {
"result": {
"terms": {
"field": "timestamp",
"script": {
"inline": "_value - params.anchors[doc.head_id.value]",
"params": {
"anchors": {
"1": 1104537600000,
"2": 1104624000000,
...
}
}
}
},
"aggs": {
"values_sum": {
"sum": {"field": "value"}
}
}
}
}
}
Where anchors is a map associating head_id to the respective time instant that each series should start at.

Related

Elasticsearch sort by filtered value

I'm using Elasticsearch 7.12, upgrading to 7.17 soon.
The following description of my problem has had the confusing business logic for my exact scenario removed.
I have an integer field in my document named 'Points'. It will usually contain 5-10 values, but may contain more, probably not more than 100 values. Something like:
Document 1:
{
"Points": [3, 12, 34, 60, 1203, 70, 88]
}
Document 2:
{
"Points": [16, 820, 31, 60]
}
Document 3:
{
"Points": [93, 20, 55]
}
My search needs to return documents with values within a range, such as between 10 and 19 inclusive. That part is fine. However I need to sort the results by the values found in that range. From the example above, I might need to find values between 30-39, sorted by the value in that range ascending - it should return Document 2 (containing value of 31) followed by Document 1 (containing value of 34).
Due to the potential range of values and searches I can't break this field down into fields like 0-9, 10-19 etc. to search on them independently - there would be many thousands of fields.
The documents themselves are otherwise quite large and there are a large number of them, so I have been advised to avoid nested fields if possible.
Can I apply a filter to a sort? Do I need a script to achieve this?
Thanks.
There are several ways of doing this:
Histogram aggregation
Aggregate your documents using a histogram aggregation with "hard bounds". Example query
POST /my_index/_search?size=0
{
"query": {
"constant_score": { "filter": { "range": { "Points": { "gte": "30", "lte" : "40" } } } }
},
"aggs": {
"points": {
"histogram": {
"field": "Points",
"interval": 10,
"hard_bounds": {
"min": 30,
"max": 40
}
},
"aggs" : {"top" : {"top_hits" : {}}}
}
}
}
THis will aggregate all the documents as long as they fall in that range, and the first bucket in the results, will contain the document that you want.
Try this with an extended terms aggregation:
If the range you want is relatively small. For eg like you mentioned "30 - 39", a simple terms aggregation on the results with an inclusion for all the numbers in that range, will also give you the desired result.
Example Query:
POST /my_index/_search?size=0
{
"query": {
"constant_score": { "filter": { "range": { "Points": { "gte": "30", "lte" : "40" } } } }
},
"aggs": {
"points": {
"terms": {
"field": "Points",
"include" : ["30","31"....,"39"]
},
"aggs" : {"top": {"top_hits" : {}}}
}
}
}
Each bucket in the terms aggregation results will contain the documents that have that particular "Point" occurring at least once. The first document in the first bucket has what you want.
The third option involves building a runtime field, that will trim the points to contain only the points between your range, and then sorting in ascending order on that field. But that will be slower.
HTH.

How to get total number of aggregation buckets in Elasticsearch?

I use Elasticsearch terms aggregation to see how many documents have a certain value in their "foo" field like this:
{
...
"aggregations": {
"metastore": {
"terms": {
"field": "foo",
"size": 50
}
}
}
}
and I get the response:
"aggregations": {
"foo": {
"buckets": [
{
"key_as_string": "2018-10-01T00:00:00.000Z",
"key": 1538352000000,
"doc_count": 935
},
{
"key_as_string": "2018-11-01T00:00:00.000Z",
"key": 1541030400000,
"doc_count": 15839
},
...
/* 48 more values */
]
}
}
But I'm limiting the number of different values to 50. If there are more different values in this field they won't be returned in the response, and that's fine, because I don't need to all of them, but I would like to know how many of them there are. So, how could I get the total number of different values? It would be fantastic if the answer provided a full example query, thanks.
You can probably add a cardinality aggregation which will give you unique number of terms for the field. This will be equal to the number of buckets for the terms aggregation.
{
...
"aggregations": {
"metastore": {
"terms": {
"field": "foo",
"size": 50
}
},
"uniquefoo": {
"cardinality": {
"field": "foo"
}
}
}
}
NOTE: Please keep in mind that cardinality aggregation might in some cases return approx count. To know more on it read here.
The cardinality aggregation is there to help. Just note, however, that the number that is returned is an approximation and might not reflect the exact number of buckets you'd get if you were to request them all. However, the accuracy is pretty good on low cardinality fields.
{
...
"aggregations": {
"unique_count": {
"cardinality": {
"field": "foo"
}
},
"metastore": {
"terms": {
"field": "foo",
"size": 50
}
}
}
}

Elasticsearch filter based on field similarity

For reference, I'm using Elasticsearch 6.4.0
I have a Elasticsearch query that returns a certain number of hits, and I'm trying to remove hits with text field values that are too similar. My query is:
{
"size": 10,
"collapse": {
"field": "author_id"
},
"query": {
"function_score": {
"boost_mode": "replace",
"score_mode": "avg",
"functions": [
{
//my custom query function
}
],
"query": {
"bool": {
"must_not": [
{
"term": {
"author_id": MY_ID
}
}
]
}
}
}
},
"aggs": {
"book_name_sample": {
"sampler": {
"shard_size": 10
},
"aggs": {
"frequent_words": {
"significant_text": {
"field": "book_name",
"filter_duplicate_text": true
}
}
}
}
}
}
This query uses a custom function score combined with a filter to return books a person might like (that they haven't authored). Thing is, for some people, it returns books with names that are very similar (i.e. The Life of George Washington, Good Times with George Washington, Who was George Washington), and I'd like the hits to have a more diverse set of names.
I'm using a bucket_selector to aggregate the hits based on text similarity, and the query gives me something like:
...,
"aggregations": {
"book_name_sample": {
"doc_count": 10,
"frequent_words": {
"doc_count": 10,
"bg_count": 482626,
"buckets": [
{
"key": "George",
"doc_count": 3,
"score": 17.278715785140975,
"bg_count": 9718
},
{
"key": "Washington",
"doc_count": 3,
"score": 15.312204414323656,
"bg_count": 10919
}
]
}
}
}
Is it possible to filter the returned documents based on this aggregation result within Elasticsearch? IE remove hits with book_name_sample doc_count less than X? I know I can do this in PHP or whatever language uses the hits, but I'd like to keep it within ES. I've tried using a bucket_selector aggregator like so:
"book_name_bucket_filter": {
"bucket_selector": {
"buckets_path": {
"freqWords": "frequent_words"
},
"script": "params.freqWords < 3"
}
}
But then I get an error: org.elasticsearch.search.aggregations.bucket.sampler.InternalSampler cannot be cast to org.elasticsearch.search.aggregations.InternalMultiBucketAggregation
Also, if that filter removes enough documents so that the hit count is less than the requested size, is it possible to tell ES to go fetch the next top scoring hits so that hits count is filled out?
Why not use top hits inside the aggregation to get relevant document that match the bucket? You can specify how many relevant top hits you want inside the top hits aggregation. So basically this will give you a certain number of documents for each bucket.

Calculate millions adjacent records and summarize them in Elasticsearch

Id like to calculate millions of adjacent records and summarize them in the end in Elasticsearch. How can I do this?
Documents (six of them) data in Elasticsearch:
10
20
-30
10
30
100
Calculation:
10 to 20 is 10
20 to -30 is -50
-30 to 10 is 40
10 to 30 is 20
30 to 100 is 70
The total is:
10 + (-50) + 40 + 20 + 70 = 90
How would I do a query with REST - RestHighLevelClient API to achive this?
Generic case
Most likely the only reasonable way to do this in Elasticsearch is to denormalize and put into Elasticsearch already computed deltas. In this case you will only need a simple sum aggregation.
This is because data in Elasticsearch is "flat", so it does not know that your documents are adjacent. It excels when all you need to know is already in the document at index time: in this case special indexes are pre-built and aggregations are very fast.
It is like A'tuin, a flat version of Earth from Pratchett's novels: some basic physics, like JOINs from RDBMS, do not work, but magic is possible.
Time series-specific case
In case when you have a time series you can achieve your goal with a combination of Serial Differencing and Sum Bucket sibling aggregations.
In order to use this approach you would need to aggregate on some date field. Imagine you have a mapping like this:
PUT time_diff
{
"mappings": {
"doc": {
"properties": {
"eventTime": {
"type": "date"
},
"val": {
"type": "integer"
}
}
}
}
}
And a document per day which look like this:
POST /time_diff/doc/1
{
"eventTime": "2018-01-01",
"val": 10
}
POST /time_diff/doc/2
{
"eventTime": "2018-01-02",
"val": 20
}
Then with a query like this:
POST /time_diff/doc/_search
{
"size": 0,
"aggs": {
"my_date_histo": {
"date_histogram": {
"field": "eventTime",
"interval": "day"
},
"aggs": {
"the_sum": {
"sum": {
"field": "val"
}
},
"my_diff": {
"serial_diff": {
"buckets_path": "the_sum"
}
}
}
},
"my_sum": {
"sum_bucket": {
"buckets_path": "my_date_histo>my_diff"
}
}
}
}
The response will look like:
{
...
"aggregations": {
"my_date_histo": {
"buckets": [
{
"key_as_string": "2018-01-01T00:00:00.000Z",
"key": 1514764800000,
"doc_count": 1,
"my_delta": {
"value": 10
}
},
...
]
},
"my_sum": {
"value": 90
}
}
}
This method though has obvious limitations:
only works if you have time series data
only correct if you have exactly 1 data point per date bucket (a day in example)
will explode in memory consumption if you have many points (millions as you mentioned)
Hope that helps!

Retrieve document frequency for terms in query result with aggregations

For some of my queries to ElasticSearch I want three pieces of information back:
Which terms T occurred in the result document set?
How often does each element of T occur in the result document set?
How often does each element of T occur in the entire index (--> document frequency)?
The first points are easily determined using the default term facet or, nowadays, by the term aggregation method.
So my question is really about the third point.
Before ElasticSearch 1.x, i.e. before the switch to the 'aggregation' paradigm, I could use a term facet with the 'global' option set to true and a QueryFilter to get the document frequency ('global counts') of the exact terms occurring in the document set specified by the QueryFilter.
At first I thought I could do the same thing using a global aggregation, but it seems I can't. The reason is - if I understand correctly - that the original facet mechanism were centered around terms whereas the aggregation buckets are defined by the the set of documents belonging to each bucket.
I.e. specifying the global option of a term facet with a QueryFilter first determined the terms hit by the filter and then computed facet values. Since the facet was global I would receive the document counts.
With aggregations, it's different. The global aggregation can only be used as a top aggregation, causing the aggregation to ignore the current query results and compute the aggregation - e.g. a terms aggregation - on all documents in the index. So for me, that's too much, since I WANT to restrict the returned terms ('buckets') to the terms in the document result set. But if I use a filter-sub-aggregation with a terms-sub-aggregation, I would restrict the term-buckets to the filter again, thus not retrieving the document frequencies but normal facet counts. The reason is that the buckets are determined after the filter so they are "too small". But I don't want restrict bucket size, I want to restrict the buckets to the terms in the query result set.
How can I get the document frequency of those terms in a query result set using aggregations (since facets are deprecated and will be removed)?
Thanks for your time!
EDIT: Here comes an example of how I tried to achieve the desired behaviour.
I will define two aggregations:
global_agg_with_filter_and_terms
global_agg_with_terms_and_filter
Both have a global aggregation at their tops because its the only valid position for it. Then, in the first aggregation, I first filter the results to the original query and then apply a term-sub-aggregation.
In the second aggregation, I do mostly the same, only that here the filter aggregation is a sub-aggregation of the terms aggregation. Hence the similar names, only the order of aggregation differs.
{
"query": {
"query_string": {
"query": "text: my query string"
}
},
"aggs": {
"global_agg_with_filter_and_terms": {
"global": {},
"aggs": {
"filter_agg": {
"filter": {
"query": {
"query_string": {
"query": "text: my query string"
}
}
},
"aggs": {
"terms_agg": {
"terms": {
"field": "facets"
}
}
}
}
}
},
"global_agg_with_terms_and_filter": {
"global": {},
"aggs": {
"document_frequency": {
"terms": {
"field": "facets"
},
"aggs": {
"term_count": {
"filter": {
"query": {
"query_string": {
"query": "text: my query string"
}
}
}
}
}
}
}
}
}
}
Response:
{
"took": 18,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 221,
"max_score": 0.9839197,
"hits": <omitted>
},
"aggregations": {
"global_agg_with_filter_and_terms": {
"doc_count": 1978,
"filter_agg": {
"doc_count": 221,
"terms_agg": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "fid8",
"doc_count": 155
},
{
"key": "fid6",
"doc_count": 40
},
{
"key": "fid9",
"doc_count": 10
},
{
"key": "fid5",
"doc_count": 9
},
{
"key": "fid13",
"doc_count": 5
},
{
"key": "fid7",
"doc_count": 2
}
]
}
}
},
"global_agg_with_terms_and_filter": {
"doc_count": 1978,
"document_frequency": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "fid8",
"doc_count": 1050,
"term_count": {
"doc_count": 155
}
},
{
"key": "fid6",
"doc_count": 668,
"term_count": {
"doc_count": 40
}
},
{
"key": "fid9",
"doc_count": 67,
"term_count": {
"doc_count": 10
}
},
{
"key": "fid5",
"doc_count": 65,
"term_count": {
"doc_count": 9
}
},
{
"key": "fid7",
"doc_count": 63,
"term_count": {
"doc_count": 2
}
},
{
"key": "fid13",
"doc_count": 55,
"term_count": {
"doc_count": 5
}
},
{
"key": "fid10",
"doc_count": 11,
"term_count": {
"doc_count": 0
}
},
{
"key": "fid11",
"doc_count": 9,
"term_count": {
"doc_count": 0
}
},
{
"key": "fid12",
"doc_count": 5,
"term_count": {
"doc_count": 0
}
}
]
}
}
}
}
At first, please have a look at the first two returned term-buckets of both aggregations, with keys fid8 and fid6. We can easily see that those terms have been appearing in the result set 155 and 40 times, respectively. Now please look at the second aggregation, global_agg_with_terms_and_filter. The terms-aggregation is within the scope of the global aggregation, so here we can actually see the document frequencies, 1050 and 668, respectively. So this part looks good. The issue arises when you scan the list of term buckets further down, to the buckets with the keys fid10 to fid12. While we receive their document frequency, we can also see that their term_count is 0. This is due to the fact that those terms did not occur in our query, that we also used for the filter-sub-aggregation. So the problem is that for ALL terms (global scope!) their document frequency and their facet count with regards to the actual query result is returned. But I need this to be made exactly for the terms that occurred in the query result, i.e. for those exact terms returned by the first aggregation global_agg_with_filter_and_terms.
Perhaps there is a possibity to define some kind of filter that removes all buckets where their sub-filter-aggregation term_count has a zero doc_count?
Hello and sorry if the answer is late.
You should have a look at the Significant Terms aggregation as, like the terms aggregation, it returns one bucket for each term occuring in the results set with the number of occurences available through doc_count, but you also get the number of occurrences in a background set through bg_count. This means it only creates buckets for terms appearing in documents of your query results set.
The default background set comprises all documents in the query scope, but can be filtered down to any subset you want using background_filter.
You can use a scripted bucket scoring function to rank the buckets the way you want by combining several metrics:
_subset_freq: number of documents the term appears in the results set,
_superset_freq: number of documents the term appears in the background set,
_subset_size: number of documents in the results set,
_superset_size: number of documents in the background set.
Request:
{
"query": {
"query_string": {
"query": "text: my query string"
}
},
"aggs": {
"terms": {
"significant_terms": {
"script": "_subset_freq",
"size": 100
}
}
}
}

Resources