Elasticsearch - How to down sample large datasets

Elasticsearch - How to down sample large datasets - elasticsearch

I'm trying to figure out how to query a large dataset so I could put it up on a js line chart.
The index has millions of documents and I want to be able to show the entire series even if it's zoomed out.
The mapping kinda looks like this:
{
"counter": {
"type": long // used as kind of a sequential ID
},
"deposits": {
"type": "nested",
"properties": {
"depositA": { "type": "long" },
"depositB": { "type": "long" }
}
}
}
I want to show a line chart where the X axis is the counter values and the Y axis is the sum of the depositA and depositB values.
The dataset has about 7M docs so I'm thinking if I could get ES to return the average of every 7 rows,I could trim that down to 1M points for my chart and still have something that looks sensible. Possibly, even take it down to 100k points?
The problem is I don't really know where to start and I'm just very new to ES.
I tried histogram aggregations but it doesn't seem to be what I'm looking for.
POST /data/_search?size=0
{
"aggs": {
"counters": {
"histogram": {
"field": "counter",
"interval": 50
}
}
}
}
While this returns the counter field in 50 intervals, it also only gives me a doc count in those 50 counters (which I guess is just how histograms work?). I would like to know how to get the average value of depositA+depositB across the 50 items along with the counter keys, if possible.
I'm really over my head here honestly but would love to learn.
If anyone could point me to any helpful information that would be very much appreciated.

In fact, histogram aggregation is the correct way to go, as I see. I think you need to put sub-aggregation on it. Here is an example for you:
POST indexa/_bulk
{"index": {"_id": "1"}}
{"counter": 1, "deposits": {"depositA": 10, "depositB": 15}}
{"index": {"_id": "2"}}
{"counter": 2, "deposits": {"depositA": 12, "depositB": 17}}
{"index": {"_id": "3"}}
{"counter": 3, "deposits": {"depositA": 16, "depositB": 16}}
{"index": {"_id": "4"}}
{"counter": 4, "deposits": {"depositA": 18, "depositB": 18}}
POST indexa/_search
{
"size": 0,
"aggs": {
"range": {
"histogram": {
"field": "counter",
"interval": 2
},
"aggs": {
"nested": {
"nested": {
"path": "deposits"
},
"aggs": {
"scripts": {
"avg": {
"script": {
"lang": "painless",
"source": "return doc['deposits.depositA'].value + doc['deposits.depositB'].value"
}
}
}
}
}
}
}
}
}
I think this will work for you. I put an avg aggregation with sub-aggregation. I used nested aggregation because your deposits field is nested typed.

Related

Elasticsearch filter based on field similarity

For reference, I'm using Elasticsearch 6.4.0
I have a Elasticsearch query that returns a certain number of hits, and I'm trying to remove hits with text field values that are too similar. My query is:
{
"size": 10,
"collapse": {
"field": "author_id"
},
"query": {
"function_score": {
"boost_mode": "replace",
"score_mode": "avg",
"functions": [
{
//my custom query function
}
],
"query": {
"bool": {
"must_not": [
{
"term": {
"author_id": MY_ID
}
}
]
}
}
}
},
"aggs": {
"book_name_sample": {
"sampler": {
"shard_size": 10
},
"aggs": {
"frequent_words": {
"significant_text": {
"field": "book_name",
"filter_duplicate_text": true
}
}
}
}
}
}
This query uses a custom function score combined with a filter to return books a person might like (that they haven't authored). Thing is, for some people, it returns books with names that are very similar (i.e. The Life of George Washington, Good Times with George Washington, Who was George Washington), and I'd like the hits to have a more diverse set of names.
I'm using a bucket_selector to aggregate the hits based on text similarity, and the query gives me something like:
...,
"aggregations": {
"book_name_sample": {
"doc_count": 10,
"frequent_words": {
"doc_count": 10,
"bg_count": 482626,
"buckets": [
{
"key": "George",
"doc_count": 3,
"score": 17.278715785140975,
"bg_count": 9718
},
{
"key": "Washington",
"doc_count": 3,
"score": 15.312204414323656,
"bg_count": 10919
}
]
}
}
}
Is it possible to filter the returned documents based on this aggregation result within Elasticsearch? IE remove hits with book_name_sample doc_count less than X? I know I can do this in PHP or whatever language uses the hits, but I'd like to keep it within ES. I've tried using a bucket_selector aggregator like so:
"book_name_bucket_filter": {
"bucket_selector": {
"buckets_path": {
"freqWords": "frequent_words"
},
"script": "params.freqWords < 3"
}
}
But then I get an error: org.elasticsearch.search.aggregations.bucket.sampler.InternalSampler cannot be cast to org.elasticsearch.search.aggregations.InternalMultiBucketAggregation
Also, if that filter removes enough documents so that the hit count is less than the requested size, is it possible to tell ES to go fetch the next top scoring hits so that hits count is filled out?

Why not use top hits inside the aggregation to get relevant document that match the bucket? You can specify how many relevant top hits you want inside the top hits aggregation. So basically this will give you a certain number of documents for each bucket.

Calculate millions adjacent records and summarize them in Elasticsearch

Id like to calculate millions of adjacent records and summarize them in the end in Elasticsearch. How can I do this?
Documents (six of them) data in Elasticsearch:
10
20
-30
10
30
100
Calculation:
10 to 20 is 10
20 to -30 is -50
-30 to 10 is 40
10 to 30 is 20
30 to 100 is 70
The total is:
10 + (-50) + 40 + 20 + 70 = 90
How would I do a query with REST - RestHighLevelClient API to achive this?

Generic case
Most likely the only reasonable way to do this in Elasticsearch is to denormalize and put into Elasticsearch already computed deltas. In this case you will only need a simple sum aggregation.
This is because data in Elasticsearch is "flat", so it does not know that your documents are adjacent. It excels when all you need to know is already in the document at index time: in this case special indexes are pre-built and aggregations are very fast.
It is like A'tuin, a flat version of Earth from Pratchett's novels: some basic physics, like JOINs from RDBMS, do not work, but magic is possible.
Time series-specific case
In case when you have a time series you can achieve your goal with a combination of Serial Differencing and Sum Bucket sibling aggregations.
In order to use this approach you would need to aggregate on some date field. Imagine you have a mapping like this:
PUT time_diff
{
"mappings": {
"doc": {
"properties": {
"eventTime": {
"type": "date"
},
"val": {
"type": "integer"
}
}
}
}
}
And a document per day which look like this:
POST /time_diff/doc/1
{
"eventTime": "2018-01-01",
"val": 10
}
POST /time_diff/doc/2
{
"eventTime": "2018-01-02",
"val": 20
}
Then with a query like this:
POST /time_diff/doc/_search
{
"size": 0,
"aggs": {
"my_date_histo": {
"date_histogram": {
"field": "eventTime",
"interval": "day"
},
"aggs": {
"the_sum": {
"sum": {
"field": "val"
}
},
"my_diff": {
"serial_diff": {
"buckets_path": "the_sum"
}
}
}
},
"my_sum": {
"sum_bucket": {
"buckets_path": "my_date_histo>my_diff"
}
}
}
}
The response will look like:
{
...
"aggregations": {
"my_date_histo": {
"buckets": [
{
"key_as_string": "2018-01-01T00:00:00.000Z",
"key": 1514764800000,
"doc_count": 1,
"my_delta": {
"value": 10
}
},
...
]
},
"my_sum": {
"value": 90
}
}
}
This method though has obvious limitations:
only works if you have time series data
only correct if you have exactly 1 data point per date bucket (a day in example)
will explode in memory consumption if you have many points (millions as you mentioned)
Hope that helps!

ElasticSearch Aggregate Intersection of GPS

I've strugged to figure this one out. I have records with time and GPS as such:
{ID: 1,Time:"2017-01-1",gps:{lat:38.00,lon:-79.00}},
{ID: 2,Time:"2017-01-1",gps:{lat:38.00,lon:-79.00}},
{ID: 1,Time:"2017-01-2",gps:{lat:39.00,lon:-77.00}},
{ID: 2,Time:"2017-01-2",gps:{lat:20.00,lon:-20.00}},
{ID: 1,Time:"2017-01-3",gps:{lat:20.00,lon:-20.00}},
{ID: 3,Time:"2017-01-1",gps:{lat:20.00,lon:-20.00}},
..........
I have a map that allows drawing circles and selecting regions. Currently, I can easily query and aggregate the records that have appeared in ANY of the locations selected. This is an example:
{
"query": {
"bool": {
"should": [
{
"geo_distance": {
"distance": 56100.0,
"gps": {
"lat": 38,
"lon": -79
}
}
},
{
"geo_distance": {
"distance": 56100.0,
"gps": {
"lat": 39,
"lon": -77
}
}
}
]
}
},
"aggs": {
"by_record_id":{
"terms": {
"field": "id"
}
}
}
}
However, I'm a bit baffled on HOW get the intersection of the selections. (NOTE: the circles are not overlapped). Essentially, I want an aggregate of the records that have had gps values that have appeared in both of the circles and remove any that have only appeared in one or none. For example, with the above records, I would only want an aggregation results for ID=1 (as ID=2 and ID=3 don't appear in both circles).
If I change the query to {"query":{"bool":{"must":[...]}}}, I get no results. Because, obviously, no record appears in 2 locations at the same time.
I've tried many different things with queries including function_score (putting each location in functions) and utilizing the scores (based on different score types). In addition, I've tried many different aggregate combinations including filtering with top_hits, cardinality (with precision_threshold), bucket_selector with cardinality.
This seems super easy and obvious in SQL. Please help an elasticsearch nube.

Got the answer!
"aggs": {
"ids": {
"terms": {
"field": "ID"
},
"aggs": {
"the_filter": {
"bucket_selector": {
"buckets_path": {
"the_doc_count": "_count"
},
"script": "params.the_doc_count >= 2"
}
}
}
}
}

Filter/aggregate one elasticsearch index of time series data by timestamps found in another index

The Data
So I have reams of different types of time series data. Currently i've chosen to put each type of data into their own index because with the exception of 4 fields, all of the data is very different. Also the data is sampled at different rates and are not guaranteed to have common timestamps across the same sub-second window so fusing them all into one large document is also not a trivial task.
The Goal
One of our common use cases that i'm trying to see if I can solve entirely in Elasticsearch is to return an aggregation result of one index based on the time windows returned from a query of another index. Pictorially:
This is what I want to accomplish.
Some Considerations
For small enough signal transitions on the "condition" data, I can just use a date histogram and some combination of a top hits sub aggregation, but this quickly breaks down when I have 10,000's or 100,000's of occurrences of "the condition". Further this is just one "case", I have 100's of sets of similar situations that i'd like to get the overall min/max from.
The comparisons are basically amongst what I would consider to be sibling level documents or indices, so there doesn't seem to be any obvious parent->child relationship that would be flexible enough over the long run, at least with how the data is currently structured.
It feels like there should be an elegant solution instead of brute force building the date ranges outside of Elasticsearch with the results of one query and feeding 100's of time ranges into another query.
Looking through the documentation it feels like some combination of Elasticsearch scripting and some of the pipelined aggregations are going to be what i want, but no definitive solutions are jumping out at me. I could really use some pointers in the right direction from the community.
Thanks.

I found a "solution" that worked for me for this problem. No answers or even comments from anyone yet, but i'll post my solution in case someone else comes along looking for something like this. I'm sure there is a lot of opportunity for improvement and optimization and if I discover such a solution (likely through a scripted aggregation) i'll come back and update my solution.
It may not be the optimal solution but it works for me. The key was to leverage the top_hits, serial_diff and bucket_selector aggregators.
The "solution"
def time_edges(index, must_terms=[], should_terms=[], filter_terms=[], data_sample_accuracy_window=200):
"""
Find the affected flights and date ranges where a specific set of terms occurs in a particular ES index.
index: the Elasticsearch index to search
terms: a list of dictionaries of form { "term": { "<termname>": <value>}}
"""
query = {
"size": 0,
"timeout": "5s",
"query": {
"constant_score": {
"filter": {
"bool": {
"must": must_terms,
"should": should_terms,
"filter": filter_terms
}
}
}
},
"aggs": {
"by_flight_id": {
"terms": {"field": "flight_id", "size": 1000},
"aggs": {
"last": {
"top_hits": {
"sort": [{"#timestamp": {"order": "desc"}}],
"size": 1,
"script_fields": {
"timestamp": {
"script": "doc['#timestamp'].value"
}
}
}
},
"first": {
"top_hits": {
"sort": [{"#timestamp": {"order": "asc"}}],
"size": 1,
"script_fields": {
"timestamp": {
"script": "doc['#timestamp'].value"
}
}
}
},
"time_edges": {
"histogram": {
"min_doc_count": 1,
"interval": 1,
"script": {
"inline": "doc['#timestamp'].value",
"lang": "painless",
}
},
"aggs": {
"timestamps": {
"max": {"field": "#timestamp"}
},
"timestamp_diff": {
"serial_diff": {
"buckets_path": "timestamps",
"lag": 1
}
},
"time_delta_filter": {
"bucket_selector": {
"buckets_path": {
"timestampDiff": "timestamp_diff"
},
"script": "if (params != null && params.timestampDiff != null) { params.timestampDiff > " + str(data_sample_accuracy_window) + "} else { false }"
}
}
}
}
}
}
}
}
return es.search(index=index, body=query)
Breaking things down
Get filter the results by 'Index 2'
"query": {
"constant_score": {
"filter": {
"bool": {
"must": must_terms,
"should": should_terms,
"filter": filter_terms
}
}
}
},
must_terms is the required value to be able to get all the results for "the condition" stored in "Index 2".
For example, to limit results to only the last 10 days and when condition is the value 10 or 12 we add the following must_terms
must_terms = [
{
"range": {
"#timestamp": {
"gte": "now-10d",
"lte": "now"
}
}
},
{
"terms": {"condition": [10, 12]}
}
]
This returns a reduced set of documents that we can then pass on into our aggregations to figure out where our "samples" are.
Aggregations
For my use case we have the notion of "flights" for our aircraft, so I wanted to group the returned results by their id and then "break up" all the occurences into buckets.
"aggs": {
"by_flight_id": {
"terms": {"field": "flight_id", "size": 1000},
...
}
}
}
You can get the rising edge of the first occurence and the falling edge of the last occurence using the top_hits aggregation
"last": {
"top_hits": {
"sort": [{"#timestamp": {"order": "desc"}}],
"size": 1,
"script_fields": {
"timestamp": {
"script": "doc['#timestamp'].value"
}
}
}
},
"first": {
"top_hits": {
"sort": [{"#timestamp": {"order": "asc"}}],
"size": 1,
"script_fields": {
"timestamp": {
"script": "doc['#timestamp'].value"
}
}
}
},
You can get the samples in between using a histogram on a timestamp. This breaks up your returned results into buckets for every unique timestamp. This is a costly aggregation, but worth it. Using the inline script allows us to use the timestamp value for the bucket name.
"time_edges": {
"histogram": {
"min_doc_count": 1,
"interval": 1,
"script": {
"inline": "doc['#timestamp'].value",
"lang": "painless",
}
},
...
}
By default the histogram aggregation returns a set of buckets with the document count for each bucket, but we need a value. This is what is required for serial_diff aggregation to work, so we have to do a token max aggregation on the results to get a value returned.
"aggs": {
"timestamps": {
"max": {"field": "#timestamp"}
},
"timestamp_diff": {
"serial_diff": {
"buckets_path": "timestamps",
"lag": 1
}
},
...
}
We use the results of the serial_diff to determine whether or not two bucket are approximately adjacent. We then discard samples that are adjacent to eachother and create a combined time range for our condition by using the bucket_selector aggregation. This will throw out buckets that are smaller than our data_sample_accuracy_window. This value is dependent on your dataset.
"aggs": {
...
"time_delta_filter": {
"bucket_selector": {
"buckets_path": {
"timestampDiff": "timestamp_diff"
},
"script": "if (params != null && params.timestampDiff != null) { params.timestampDiff > " + str(data_sample_accuracy_window) + "} else { false }"
}
}
}
The serial_diff results are also critical for us to determine how long our condition was set. The timestamps of our buckets end up representing the "rising" edge of our condition signal so the falling edge is unknown without some post-processing. We use the timestampDiff value to figure out where the falling edge is.

Very slow elasticsearch term aggregation. How to improve?

We have ~20M (hotel offers) documents stored in elastic(1.6.2) and the point is to group documents by multiple fields (duration, start_date, adults, kids) and select one cheapest offer out of each group. We have to sort those results by cost field.
To avoid sub-aggregations we have united target fields values into one called default_group_field by joining them with dot(.).
Mapping for the field looks like this:
"default_group_field": {
"index": "not_analyzed",
"fielddata": {
"loading": "eager_global_ordinals"
},
"type": "string"
}
Query we perform looks like this:
{
"size": 0,
"aggs": {
"offers": {
"terms": {
"field": "default_group_field",
"size": 5,
"order": {
"min_sort_value": "asc"
}
},
"aggs": {
"min_sort_value": {
"min": {
"field": "cost"
}
},
"cheapest": {
"top_hits": {
"_source": {}
},
"sort": {
"cost": "asc"
},
"size": 1
}
}
}
}
},
"query": {
"filtered": {
"filter": {
"and": [
...
]
}
}
}
}
The problem is that such query takes seconds (2-5sec) to load.
However once we perform query without aggregations we get a moderate amount of results (say "total": 490) in under 100ms.
{
"took": 53,
"timed_out": false,
"_shards": {
"total": 6,
"successful": 6,
"failed": 0
},
"hits": {
"total": 490,
"max_score": 1,
"hits": [...
But with aggregation it take 2sec :
{
"took": 2158,
"timed_out": false,
"_shards": {
"total": 6,
"successful": 6,
"failed": 0
},
"hits": {
"total": 490,
"max_score": 0,
"hits": [
]
},...
It seems like it should not take so long to process that moderate amount filtered documents and select the cheapest one out of every group. It could be done inside application, which seems an ugly hack for me.
The log is full of lines stating:
[DEBUG][index.fielddata.plain ] [Karen Page] [offers] Global-ordinals[default_group_field][2564761] took 2453 ms
That is why we updated our mapping to perform eager global_ordinals rebuild on index update, however this did not make notable impact on query timings.
Is there any way to speedup such aggregation, or maybe a way to tell elastic to do aggregation on filtered documents only.
Or maybe there is another source of such a long query execution? Any ideas highly appreciated!

thanks again for the effort.
Finally we have solved the main problem and our performance is back to normal.
To be short we have done the following:
- updated the mapping for the default_group_field to be of type Long
- compressed the default_group_field values so that it would match type Long
Some explanations:
Aggregations on string fields require some work work be done on them. As we see from logs building Global Ordinals for that field that has very wide variance was very expensive. In fact we do only aggregations on the field mentioned. With that said it is not very efficient to use String type.
So we have changed the mapping to:
default_group_field: {
type: 'long',
index: 'not_analyzed'
}
This way we do not touch those expensive operations.
After this and the same query timing reduced to ~100ms. It also dropped down CPU usage.
PS 1
I`ve got a lot of info from docs on global ordinals
PS 2
Still I have no idea on how to bypass this issue with the field of type String. Please comment if you have some ideas.

This is likely due to the the default behaviour of terms aggregations, which requires global ordinals to be built. This computation can be expensive for high-cardinality fields.
The following blog addresses the likely cause of this poor performance and several approaches to resolve it.
https://www.elastic.co/blog/improving-the-performance-of-high-cardinality-terms-aggregations-in-elasticsearch

Ok. I will try to answer this,
There are few parts in the question which I was not able to understand like -
To avoid sub-aggregations we have united target fields values into one called default_group_field by joining them with dot(.)
I am not sure what you really mean by this because you said that,
You added this field to avoid aggregation(But how? and also how are you avoiding the aggregation if you are joining them with dot(.)?)
Ok. Even I am also new to elastic search. So If there is anything I missed, you can comment on this answer. Thanks,
I will continue to answer this question.
But before that I am assuming that you have
that(default_group_field) field to differentiate between records
duration, start_date, adults, kids.
I will try to provide one example below after my solution.
My solution:
{
"size": 0,
"aggs": {
"offers": {
"terms": {
"field": "default_group_field"
},
"aggs": {
"sort_cost_asc": {
"top_hits": {
"sort": [
{
"cost": {
"order": "asc"
}
}
],
"_source": {
"include": [ ... fields you want from the document ... ]
},
"size": 1
}
}
}
}
},
"query": {
"... your query part ..."
}
}
I will try to explain what I am trying to do here:
I am assuming that your document looks like this (may be there is some nesting also, But for example I am trying to keep the document as simple as I can):
document1:
{
"default_group_field": "kids",
"cost": 100,
"documentId":1
}
document2:
{
"default_group_field": "kids",
"cost": 120,
"documentId":2
}
document3:
{
"default_group_field": "adults",
"cost": 50,
"documentId":3
}
document4:
{
"default_group_field": "adults",
"cost": 150,
"documentId":4
}
So now you have this documents and you want to get the min. cost document for both adults and kids:
so your query should look like this:
{
"size": 0,
"aggs": {
"offers": {
"terms": {
"field": "default_group_field"
},
"aggs": {
"sort_cost_asc": {
"top_hits": {
"sort": [
{
"cost": {
"order": "asc"
}
}
],
"_source": {
"include": ["documentId", "cost", "default_group_field"]
},
"size": 1
}
}
}
}
},
"query": {
"filtered":{ "query": { "match_all": {} } }
}
}
To explain the above query, what I am doing is grouping the document by "default_group_field" and then I am sorting each group by cost and size:1 helps me to get the just one document.
Therefore the result for this query will be min. cost document in each category (adults and kids)
Usually when I try to write the query for elastic search or db. I try to minimize the number of document or rows.
I assume that I am right in understanding your question.
If I am wrong in understanding your question or I did some mistake, Please reply and let me know where I went wrong.
Thanks,

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio