How to determine ETL health status from time series logs using Elasticsearch?

How to determine ETL health status from time series logs using Elasticsearch? - elasticsearch

TL;DR: what is the Elasticsearch equivalent to this Postgres query?
SELECT latest_pipeline_logs.* FROM (
SELECT pipeline_logs.*,
rank() OVER (
PARTITION BY pipeline_name
ORDER BY updated_at DESC
)
FROM pipeline_logs
) latest_pipeline_logs WHERE RANK = 1
I have hundreds of ETL pipelines with logs that are dumped into Elasticsearch. They are each executed independently at different intervals. I would like to derive a simple health status for each of my ETL pipelines using Elasticsearch aggregations.
Every pipeline logs its state on execution. My current thought process is to determine the health of each pipeline based on the two most important states that occur: succeeded and failed.
I know I can make an aggregation query and group by each pipeline with a sub-aggregation for statuses. For example, something along the lines of this:
{
...
"aggs": {
"pipelines": {
"field": "pipeline_name"
},
"aggs": {
"states": {
"terms": {
"field": "pipeline_state"
}
}
}
}
}
The problem with the above example is I could get several states because of the time series data-set, such as this:
{
"key": "some-pipeline-name",
"buckets": [
{
"key": "succeeded",
"doc_count": 123
},
{
"key": "failed",
"doc_count": 567
}
]
}
I could theoretically filter the results based on the date the pipeline is executed, but because some pipelines run every other month or so, I don't think this is an option.
The end state is to drive a simple dashboard using an Elasticsearch result set that looks something like this:
[
{
"key": "some-pipeline-name",
"latest-status": "succeeded"
},
{
"key": "some-other-pipeline",
"latest-status": "failed"
}
]
One thing to note is in this use case the historical data is not important. The dashboard will simply convey the latest state for each pipeline.
How would you go about achieving this with Elasticsearch?

If you're only interested in the latest status for each pipeline you could use top_hits as a sub-aggregation and then sort on time
{
"size": 0,
"aggs": {
"pipeline": {
"terms": {
"field": "pipeline_name",
"size": 1000
},
"aggs": {
"top_hits_status": {
"top_hits": {
"size": 1,
"sort": [
{
"timestamp": {
"order": "desc"
}
}
],
"_source": {
"includes": [
"pipeline_state"
]
}
}
}
}
}
}
}

Related

Improve ES Agg query - getting circuit_breaking_exception

I run aggregation that on 2 indices: idx-2020-07-21, idx-2020-07-22
The target:
Get all documents,
but in the case of duplicate id (50% are), get the one from the latest index using the index name.
This is the query I'm running
{
"size": 0,
"aggregations": {
"latest_item": {
"composite": {
"size": 1000,
"sources": [
{
"product": {
"terms": {
"field": "_id",
"missing_bucket": false,
"order": "asc"
}
}
}
]
},
"aggregations": {
"max_date": {
"top_hits": {
"from": 0,
"size": 1,
"version": false,
"explain": false,
"sort": [
{
"_index": {
"order": "desc"
}
}
]
}
}
}
}
}
}
Each index size is 8G with ~1M docs. ES version 7.5
and it takes around 8Min to aggregate, most of the times I get
{"error":{"root_cause":[{"type":"circuit_breaking_exception","reason":"[parent] Data too large, data for [<http_request>] would be [32933676058/30.6gb], which is larger than the limit of [32641751449/30.3gb].
Is there a better way to write this query?
How do I deal with this exception?
I run a java job that query ES every 10 min, I noticed it happened a lot in the second time,
do I need to release any resources or something? I use restHighLevelClient.searchAsync() with a listener that call again with the next key until I get null.
The cluster has 3 nodes, 32G each.
I tries to play with the bucket size it didn't help a lot.
Thanks!

Pagination with specific search type on ElasticSearch

We are currently using ElasticSearch 6.7 and have a huge amount of data making some request taking too much time.
To avoid this problem, we want to set up pagination within our research towards elasticsearch. The problem is that I can't put one of the pagination methods proposed by ES on the different requests that already exist.
For example, this request contains different aggregations and a query:
https://github.com/trackit/trackit/blob/master/usageReports/lambda/es_request_constructor.go#L61-L75
In addition, the results are sorted after the information is collected.
I tried to set up the Search After method as well as a form of pagination using from & size.
Scroll doesn't works with aggregations and composite aggregation doesn't accept query.
So, there is any good way to do pagination in ElasticSearch combined with other request type and how to do it with the example above?

composite aggregation doesn't accept query
It does accept query. In the example below, the results are filtered based on play_name. The aggregation only get applied to the result of the query and it can be paginated using the after option.
{
"query": {
"term": {
"play_name": "A Winters Tale"
}
},
"size": 0,
"aggs": {
"speaker": {
"composite": {
"after": {
"product": "FLORIZEL"
},
"sources": [
{
"product": {
"terms": {
"field": "speaker"
}
}
}
]
},
"aggs": {
"speech_number": {
"terms": {
"field": "speech_number"
},
"aggs": {
"line_id": {
"terms": {
"field": "line_id"
}
}
}
}
}
}
}
}

elasticsearch Need average per week of some value

I have simple data as
sales, date_of_sales
I need is average per week i.e. sum(sales)/no.of weeks.
Please help.
What i have till now is
{
"size": 0,
"aggs": {
"WeekAggergation": {
"date_histogram": {
"field": "date_of_sales",
"interval": "week"
}
},
"TotalSales": {
"sum": {
"field": "sales"
}
},
"myValue": {
"bucket_script": {
"buckets_path": {
"myGP": "TotalSales",
"myCount": "WeekAggergation._bucket_count"
},
"script": "params.myGP/params.myCount"
}
}
}
}
I get the error
Invalid pipeline aggregation named [myValue] of type [bucket_script].
Only sibling pipeline aggregations are allowed at the top level.

I think this may help:
{
"size": 0,
"aggs": {
"WeekAggergation": {
"date_histogram": {
"field": "date_of_sale",
"interval": "week",
"format": "yyyy-MM-dd"
},
"aggs": {
"TotalSales": {
"sum": {
"field": "sales"
}
},
"AvgSales": {
"avg": {
"field": "sales"
}
}
}
},
"avg_all_weekly_sales": {
"avg_bucket": {
"buckets_path": "WeekAggergation>TotalSales"
}
}
}
}
Note the TotalSales aggregation is now a nested aggregation under the weekly histogram aggregation (I believe there was a typo in the code provided - the simple schema provided indicated the field name of date_of_sale and the aggregation provided uses the plural form date_of_sales). This provides you a total of all sales in the weekly bucket.
Additionally, AvgSales provides a similar nested aggregation under the weekly histogram aggregation so you can see the average of all sales specific to that week.
Finally, the pipeline aggregation avg_all_weekly_sales will give the average of weekly sales based on the TotalSales bucket and the number of non-empty buckets - if you want to include empty buckets, add the gap_policy parameter like so:
...
"avg_all_weekly_sales": {
"avg_bucket": {
"buckets_path": "WeekAggergation>TotalSales",
"gap_policy": "insert_zeros"
}
}
...
(see: https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-pipeline-avg-bucket-aggregation.html).
This pipeline aggregation may or may not be what you're actually looking for, so please check the math to ensure the result is what is expected, but should provide the correct output based on the original script.

Elasticsearch: get top nested doc per month without top level duplicates

I have some time-based, nested data of which I would like to get the biggest changes, positive and negative, of plugins per month. I work with Elasticsearch 5.3 (and Kibana 5.3).
A document is structured as follows:
{
_id: "xxx",
#timestamp: 1508244365987,
siteURL: "www.foo.bar",
plugins: [
{
name: "foo",
version: "3.1.4"
},
{
name: "baz",
version: "13.37"
}
]
}
However, per id (siteURL), I have multiple entries per month and I would like to use only the latest per time bucket, to avoid unfair weighing.
I tried to solve this by using the following aggregation:
{
"aggs": {
"normal_dates": {
"date_range": {
"field": "#timestamp",
"ranges": [
{
"from": "now-1y/d",
"to": "now"
}
]
},
"aggs": {
"date_histo": {
"date_histogram": {
"field": "#timestamp",
"interval": "month"
},
"aggs": {
"top_sites": {
"terms": {
"field": "siteURL.keyword",
"size": 50000
},
"aggs": {
"top_plugin_hits": {
"top_hits": {
"sort": [
{
"#timestamp": {
"order": "desc"
}
}
],
"_source": {
"includes": [
"plugins.name"
]
},
"size": 1
}
}
}
}
}
}
}
}
}
}
Now I get per month the latest site and its plugins. Next I would like to turn the data inside out and get the plugins present per month and a count of the occurrences. Then I would use a serial_diff to compare months.
However, I don't know how to go from my aggregation to the serial diff, i.e. turn the data inside out.
Any help would be most welcome
PS: extra kudos if I can get it in a Kibana 5.3 table...

It turns out it is not possible to further aggregate on a top_hits query.
I ended up loading the results of the posted query into Python and used Python for further processing and visualization.

Filter/aggregate one elasticsearch index of time series data by timestamps found in another index

The Data
So I have reams of different types of time series data. Currently i've chosen to put each type of data into their own index because with the exception of 4 fields, all of the data is very different. Also the data is sampled at different rates and are not guaranteed to have common timestamps across the same sub-second window so fusing them all into one large document is also not a trivial task.
The Goal
One of our common use cases that i'm trying to see if I can solve entirely in Elasticsearch is to return an aggregation result of one index based on the time windows returned from a query of another index. Pictorially:
This is what I want to accomplish.
Some Considerations
For small enough signal transitions on the "condition" data, I can just use a date histogram and some combination of a top hits sub aggregation, but this quickly breaks down when I have 10,000's or 100,000's of occurrences of "the condition". Further this is just one "case", I have 100's of sets of similar situations that i'd like to get the overall min/max from.
The comparisons are basically amongst what I would consider to be sibling level documents or indices, so there doesn't seem to be any obvious parent->child relationship that would be flexible enough over the long run, at least with how the data is currently structured.
It feels like there should be an elegant solution instead of brute force building the date ranges outside of Elasticsearch with the results of one query and feeding 100's of time ranges into another query.
Looking through the documentation it feels like some combination of Elasticsearch scripting and some of the pipelined aggregations are going to be what i want, but no definitive solutions are jumping out at me. I could really use some pointers in the right direction from the community.
Thanks.

I found a "solution" that worked for me for this problem. No answers or even comments from anyone yet, but i'll post my solution in case someone else comes along looking for something like this. I'm sure there is a lot of opportunity for improvement and optimization and if I discover such a solution (likely through a scripted aggregation) i'll come back and update my solution.
It may not be the optimal solution but it works for me. The key was to leverage the top_hits, serial_diff and bucket_selector aggregators.
The "solution"
def time_edges(index, must_terms=[], should_terms=[], filter_terms=[], data_sample_accuracy_window=200):
"""
Find the affected flights and date ranges where a specific set of terms occurs in a particular ES index.
index: the Elasticsearch index to search
terms: a list of dictionaries of form { "term": { "<termname>": <value>}}
"""
query = {
"size": 0,
"timeout": "5s",
"query": {
"constant_score": {
"filter": {
"bool": {
"must": must_terms,
"should": should_terms,
"filter": filter_terms
}
}
}
},
"aggs": {
"by_flight_id": {
"terms": {"field": "flight_id", "size": 1000},
"aggs": {
"last": {
"top_hits": {
"sort": [{"#timestamp": {"order": "desc"}}],
"size": 1,
"script_fields": {
"timestamp": {
"script": "doc['#timestamp'].value"
}
}
}
},
"first": {
"top_hits": {
"sort": [{"#timestamp": {"order": "asc"}}],
"size": 1,
"script_fields": {
"timestamp": {
"script": "doc['#timestamp'].value"
}
}
}
},
"time_edges": {
"histogram": {
"min_doc_count": 1,
"interval": 1,
"script": {
"inline": "doc['#timestamp'].value",
"lang": "painless",
}
},
"aggs": {
"timestamps": {
"max": {"field": "#timestamp"}
},
"timestamp_diff": {
"serial_diff": {
"buckets_path": "timestamps",
"lag": 1
}
},
"time_delta_filter": {
"bucket_selector": {
"buckets_path": {
"timestampDiff": "timestamp_diff"
},
"script": "if (params != null && params.timestampDiff != null) { params.timestampDiff > " + str(data_sample_accuracy_window) + "} else { false }"
}
}
}
}
}
}
}
}
return es.search(index=index, body=query)
Breaking things down
Get filter the results by 'Index 2'
"query": {
"constant_score": {
"filter": {
"bool": {
"must": must_terms,
"should": should_terms,
"filter": filter_terms
}
}
}
},
must_terms is the required value to be able to get all the results for "the condition" stored in "Index 2".
For example, to limit results to only the last 10 days and when condition is the value 10 or 12 we add the following must_terms
must_terms = [
{
"range": {
"#timestamp": {
"gte": "now-10d",
"lte": "now"
}
}
},
{
"terms": {"condition": [10, 12]}
}
]
This returns a reduced set of documents that we can then pass on into our aggregations to figure out where our "samples" are.
Aggregations
For my use case we have the notion of "flights" for our aircraft, so I wanted to group the returned results by their id and then "break up" all the occurences into buckets.
"aggs": {
"by_flight_id": {
"terms": {"field": "flight_id", "size": 1000},
...
}
}
}
You can get the rising edge of the first occurence and the falling edge of the last occurence using the top_hits aggregation
"last": {
"top_hits": {
"sort": [{"#timestamp": {"order": "desc"}}],
"size": 1,
"script_fields": {
"timestamp": {
"script": "doc['#timestamp'].value"
}
}
}
},
"first": {
"top_hits": {
"sort": [{"#timestamp": {"order": "asc"}}],
"size": 1,
"script_fields": {
"timestamp": {
"script": "doc['#timestamp'].value"
}
}
}
},
You can get the samples in between using a histogram on a timestamp. This breaks up your returned results into buckets for every unique timestamp. This is a costly aggregation, but worth it. Using the inline script allows us to use the timestamp value for the bucket name.
"time_edges": {
"histogram": {
"min_doc_count": 1,
"interval": 1,
"script": {
"inline": "doc['#timestamp'].value",
"lang": "painless",
}
},
...
}
By default the histogram aggregation returns a set of buckets with the document count for each bucket, but we need a value. This is what is required for serial_diff aggregation to work, so we have to do a token max aggregation on the results to get a value returned.
"aggs": {
"timestamps": {
"max": {"field": "#timestamp"}
},
"timestamp_diff": {
"serial_diff": {
"buckets_path": "timestamps",
"lag": 1
}
},
...
}
We use the results of the serial_diff to determine whether or not two bucket are approximately adjacent. We then discard samples that are adjacent to eachother and create a combined time range for our condition by using the bucket_selector aggregation. This will throw out buckets that are smaller than our data_sample_accuracy_window. This value is dependent on your dataset.
"aggs": {
...
"time_delta_filter": {
"bucket_selector": {
"buckets_path": {
"timestampDiff": "timestamp_diff"
},
"script": "if (params != null && params.timestampDiff != null) { params.timestampDiff > " + str(data_sample_accuracy_window) + "} else { false }"
}
}
}
The serial_diff results are also critical for us to determine how long our condition was set. The timestamps of our buckets end up representing the "rising" edge of our condition signal so the falling edge is unknown without some post-processing. We use the timestampDiff value to figure out where the falling edge is.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio