Filter/aggregate one elasticsearch index of time series data by timestamps found in another index - elasticsearch

The Data
So I have reams of different types of time series data. Currently i've chosen to put each type of data into their own index because with the exception of 4 fields, all of the data is very different. Also the data is sampled at different rates and are not guaranteed to have common timestamps across the same sub-second window so fusing them all into one large document is also not a trivial task.
The Goal
One of our common use cases that i'm trying to see if I can solve entirely in Elasticsearch is to return an aggregation result of one index based on the time windows returned from a query of another index. Pictorially:
This is what I want to accomplish.
Some Considerations
For small enough signal transitions on the "condition" data, I can just use a date histogram and some combination of a top hits sub aggregation, but this quickly breaks down when I have 10,000's or 100,000's of occurrences of "the condition". Further this is just one "case", I have 100's of sets of similar situations that i'd like to get the overall min/max from.
The comparisons are basically amongst what I would consider to be sibling level documents or indices, so there doesn't seem to be any obvious parent->child relationship that would be flexible enough over the long run, at least with how the data is currently structured.
It feels like there should be an elegant solution instead of brute force building the date ranges outside of Elasticsearch with the results of one query and feeding 100's of time ranges into another query.
Looking through the documentation it feels like some combination of Elasticsearch scripting and some of the pipelined aggregations are going to be what i want, but no definitive solutions are jumping out at me. I could really use some pointers in the right direction from the community.
Thanks.

I found a "solution" that worked for me for this problem. No answers or even comments from anyone yet, but i'll post my solution in case someone else comes along looking for something like this. I'm sure there is a lot of opportunity for improvement and optimization and if I discover such a solution (likely through a scripted aggregation) i'll come back and update my solution.
It may not be the optimal solution but it works for me. The key was to leverage the top_hits, serial_diff and bucket_selector aggregators.
The "solution"
def time_edges(index, must_terms=[], should_terms=[], filter_terms=[], data_sample_accuracy_window=200):
"""
Find the affected flights and date ranges where a specific set of terms occurs in a particular ES index.
index: the Elasticsearch index to search
terms: a list of dictionaries of form { "term": { "<termname>": <value>}}
"""
query = {
"size": 0,
"timeout": "5s",
"query": {
"constant_score": {
"filter": {
"bool": {
"must": must_terms,
"should": should_terms,
"filter": filter_terms
}
}
}
},
"aggs": {
"by_flight_id": {
"terms": {"field": "flight_id", "size": 1000},
"aggs": {
"last": {
"top_hits": {
"sort": [{"#timestamp": {"order": "desc"}}],
"size": 1,
"script_fields": {
"timestamp": {
"script": "doc['#timestamp'].value"
}
}
}
},
"first": {
"top_hits": {
"sort": [{"#timestamp": {"order": "asc"}}],
"size": 1,
"script_fields": {
"timestamp": {
"script": "doc['#timestamp'].value"
}
}
}
},
"time_edges": {
"histogram": {
"min_doc_count": 1,
"interval": 1,
"script": {
"inline": "doc['#timestamp'].value",
"lang": "painless",
}
},
"aggs": {
"timestamps": {
"max": {"field": "#timestamp"}
},
"timestamp_diff": {
"serial_diff": {
"buckets_path": "timestamps",
"lag": 1
}
},
"time_delta_filter": {
"bucket_selector": {
"buckets_path": {
"timestampDiff": "timestamp_diff"
},
"script": "if (params != null && params.timestampDiff != null) { params.timestampDiff > " + str(data_sample_accuracy_window) + "} else { false }"
}
}
}
}
}
}
}
}
return es.search(index=index, body=query)
Breaking things down
Get filter the results by 'Index 2'
"query": {
"constant_score": {
"filter": {
"bool": {
"must": must_terms,
"should": should_terms,
"filter": filter_terms
}
}
}
},
must_terms is the required value to be able to get all the results for "the condition" stored in "Index 2".
For example, to limit results to only the last 10 days and when condition is the value 10 or 12 we add the following must_terms
must_terms = [
{
"range": {
"#timestamp": {
"gte": "now-10d",
"lte": "now"
}
}
},
{
"terms": {"condition": [10, 12]}
}
]
This returns a reduced set of documents that we can then pass on into our aggregations to figure out where our "samples" are.
Aggregations
For my use case we have the notion of "flights" for our aircraft, so I wanted to group the returned results by their id and then "break up" all the occurences into buckets.
"aggs": {
"by_flight_id": {
"terms": {"field": "flight_id", "size": 1000},
...
}
}
}
You can get the rising edge of the first occurence and the falling edge of the last occurence using the top_hits aggregation
"last": {
"top_hits": {
"sort": [{"#timestamp": {"order": "desc"}}],
"size": 1,
"script_fields": {
"timestamp": {
"script": "doc['#timestamp'].value"
}
}
}
},
"first": {
"top_hits": {
"sort": [{"#timestamp": {"order": "asc"}}],
"size": 1,
"script_fields": {
"timestamp": {
"script": "doc['#timestamp'].value"
}
}
}
},
You can get the samples in between using a histogram on a timestamp. This breaks up your returned results into buckets for every unique timestamp. This is a costly aggregation, but worth it. Using the inline script allows us to use the timestamp value for the bucket name.
"time_edges": {
"histogram": {
"min_doc_count": 1,
"interval": 1,
"script": {
"inline": "doc['#timestamp'].value",
"lang": "painless",
}
},
...
}
By default the histogram aggregation returns a set of buckets with the document count for each bucket, but we need a value. This is what is required for serial_diff aggregation to work, so we have to do a token max aggregation on the results to get a value returned.
"aggs": {
"timestamps": {
"max": {"field": "#timestamp"}
},
"timestamp_diff": {
"serial_diff": {
"buckets_path": "timestamps",
"lag": 1
}
},
...
}
We use the results of the serial_diff to determine whether or not two bucket are approximately adjacent. We then discard samples that are adjacent to eachother and create a combined time range for our condition by using the bucket_selector aggregation. This will throw out buckets that are smaller than our data_sample_accuracy_window. This value is dependent on your dataset.
"aggs": {
...
"time_delta_filter": {
"bucket_selector": {
"buckets_path": {
"timestampDiff": "timestamp_diff"
},
"script": "if (params != null && params.timestampDiff != null) { params.timestampDiff > " + str(data_sample_accuracy_window) + "} else { false }"
}
}
}
The serial_diff results are also critical for us to determine how long our condition was set. The timestamps of our buckets end up representing the "rising" edge of our condition signal so the falling edge is unknown without some post-processing. We use the timestampDiff value to figure out where the falling edge is.

Related

ElasticSearch _knn_search query on multiple fields

I'm using ES 8.2. I'd like to use approximate method of _knn_search on more than 1 vector. Below I've attached my current code searching on a single vector. So far as I've read _knn_search does not support search on nested fields.
Alternatively, I can use multi index search. One index, one vector, one search, sum up all results together. However, I need to store all these vectors together in one index as I need also to perform filtration on some other fields besides vectors for knn search.
Thus, the question is if there is a work around how I can perform _knn_search on more than 1 vector?
search_vector = np.zeros(512).tolist()
es_query = {
"knn": {
"field": "feature_vector_1.vector",
"query_vector": search_vector,
"k": 100,
"num_candidates": 1000
},
"filter": [
{
"range": {
"feature_vector_1.match_prc": {
"gt": 10
}
}
}
],
"_source": {
"excludes": ["feature_vector_1.vector", "feature_vector_2.vector"]
}
}
The last working query that I've end up with is
es_query = {
"knn": {
"field": "feature_vector_1.vector",
"query_vector": search_vector,
"k": 1000,
"num_candidates": 1000
},
"filter": [
{
"function_score": {
"query": {
"match_all": {}
},
"script_score": {
"script": {
"source": """
double value = dotProduct(params.queryVector, 'feature_vector_2.vector');
return 100 * (1 + value) / 2;
""",
"params": {
"queryVector": search_vector
}
},
}
}
}
],
"_source": {
"excludes": ["feature_vector_1.vector", "feature_vector_2.vector"]
}
}
However, it is not true AKNN on 2 vectors but still working option if performance of such query satisfies your expectations.
the below seems to be working for me for combining KNN searches, taking the average of multiple cosine similarity scores. Note that this is a little different than the original request, since it performs a brute force search, but you can still filter the results up front by replacing the match_all bit.
GET my-index/_search
{
"query": {
"script_score": {
"query": {
"match_all": {}
},
"script": {
"source": "(cosineSimilarity(params.vector1, 'my-vector1') + cosineSimilarity(params.vector2, 'my-vector2'))/2 + 1.0",
"params": {
"vector1": [
1.3012068271636963,
...
0.23468133807182312
],
"vector2": [
-0.49404603242874146,
...
-0.15835021436214447
]
}
}
}
}
}

elasticsearch Need average per week of some value

I have simple data as
sales, date_of_sales
I need is average per week i.e. sum(sales)/no.of weeks.
Please help.
What i have till now is
{
"size": 0,
"aggs": {
"WeekAggergation": {
"date_histogram": {
"field": "date_of_sales",
"interval": "week"
}
},
"TotalSales": {
"sum": {
"field": "sales"
}
},
"myValue": {
"bucket_script": {
"buckets_path": {
"myGP": "TotalSales",
"myCount": "WeekAggergation._bucket_count"
},
"script": "params.myGP/params.myCount"
}
}
}
}
I get the error
Invalid pipeline aggregation named [myValue] of type [bucket_script].
Only sibling pipeline aggregations are allowed at the top level.
I think this may help:
{
"size": 0,
"aggs": {
"WeekAggergation": {
"date_histogram": {
"field": "date_of_sale",
"interval": "week",
"format": "yyyy-MM-dd"
},
"aggs": {
"TotalSales": {
"sum": {
"field": "sales"
}
},
"AvgSales": {
"avg": {
"field": "sales"
}
}
}
},
"avg_all_weekly_sales": {
"avg_bucket": {
"buckets_path": "WeekAggergation>TotalSales"
}
}
}
}
Note the TotalSales aggregation is now a nested aggregation under the weekly histogram aggregation (I believe there was a typo in the code provided - the simple schema provided indicated the field name of date_of_sale and the aggregation provided uses the plural form date_of_sales). This provides you a total of all sales in the weekly bucket.
Additionally, AvgSales provides a similar nested aggregation under the weekly histogram aggregation so you can see the average of all sales specific to that week.
Finally, the pipeline aggregation avg_all_weekly_sales will give the average of weekly sales based on the TotalSales bucket and the number of non-empty buckets - if you want to include empty buckets, add the gap_policy parameter like so:
...
"avg_all_weekly_sales": {
"avg_bucket": {
"buckets_path": "WeekAggergation>TotalSales",
"gap_policy": "insert_zeros"
}
}
...
(see: https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-pipeline-avg-bucket-aggregation.html).
This pipeline aggregation may or may not be what you're actually looking for, so please check the math to ensure the result is what is expected, but should provide the correct output based on the original script.

How do I select the top term buckets based on a rescore function in Elasticsearch

Consider the following query for Elasticsearch 5.6:
{
"size": 0,
"query": {
"match_all": {}
},
"rescore": [
{
"window_size": 10000,
"query": {
"rescore_query": {
"function_score": {
"boost_mode": "replace",
"script_score": {
"script": {
"source": "doc['topic_score'].value"
}
}
}
},
"query_weight": 0,
"rescore_query_weight": 1
}
}
],
"aggs": {
"distinct": {
"terms": {
"field": "identical_id",
"order": {
"top_score": "desc"
}
},
"aggs": {
"best_unique_result": {
"top_hits": {
"size": 1
}
},
"top_score": {
"max": {
"script": {
"inline": "_score"
}
}
}
}
}
}
}
This is a simplified version where the real query has a more complex main query and the rescore function is far more intensive.
Let me explain it's purpose first incase I'm about to spend a 1000 hours developing a pen that writes in space when a pencil would actually solve my problem. I'm performing a fast initial query, then rescoring the top results with a much more intensive function. From those results I want to show the top distinct values, i.e. no two results should have the same identical_id. If there's a better way to do this I'd also consider that an answer.
I expected a query like this would order results by the rescore query, group all the results that had the same identical_id and display the top hit for each such distinct group. I also assumed that since I'm ordering those term aggregation buckets by the max parent _score, they would be ordered to reflect the best result they contain as determined from the original rescore query.
The reality is that the term buckets are ordered by the maximum query score and not the rescore query score. Strangely the top hits within the buckets do seem to use the rescore.
Is there a better way to achieve the end result that I want, or some way I can fix this query to work the way I expect it too?
From documentation :
The query rescorer executes a second query only on the Top-K results returned by the query and post_filter phases. The number of docs which will be examined on each shard can be controlled by the window_size parameter, which defaults to 10.
As the rescore query kicks in after the post_filter phase, I assume the term aggregation buckets are already fixed.
I have no idea on how you can combine rescore and aggregations. Sorry :(
I think I have a pretty great solution to this problem, but I'll let the bounty continue to expiration incase someone comes up with a better approach.
{
"size": 0,
"query": {
"match_all": {}
},
"aggs": {
"sample": {
"sampler": {
"shard_size": 10000
},
"aggs": {
"distinct": {
"terms": {
"field": "identical_id",
"order": {
"top_score": "desc"
}
},
"aggs": {
"best_unique_result": {
"top_hits": {
"size": 1,
"sort": [
{
"_script": {
"type": "number",
"script": {
"source": "doc['topic_score'].value"
},
"order": "desc"
}
}
]
}
},
"top_score": {
"max": {
"script": {
"source": "doc['topic_score'].value"
}
}
}
}
}
}
}
}
}
The sampler aggregation will take the top N hits per shard from the core query and run aggregations over those. Then in the max aggregator that defines the bucket order I use the exact same script as the one I use to pick a top hit from the bucket. Now the buckets and the top hits are running over the same top N sets of items and the buckets will order by the max of the same score, generated from the same script. Unfortunately I still need run the script once to order the buckets and once to pick a top hit within the bucket, and you could use the rescore instead for the top hits ordering, but either way it has to run twice and I found it was faster as a sort script then as a rescore

Concurrent events aggregation in ElasticSearch

I have a number of documents representing events with starts_at and ends_at fields. At a given point in time, an event is considered active, if the point in question is after starts_at and before ends_at.
I'm looking for an aggregation, which should result in a date histogram, where each bucket contains the number of active events in that interval.
So far, the best approximation I have found is to create a set of buckets counting the number of starts in each interval, as well as a corresponding set of buckets counting the number of ends, and then postprocessing them by subtracting the number of starts from the number of ends for each interval:
{
"size": "0",
"query": {
"filtered": {
"query": {
"match_all": {}
},
"filter": {
"and": [
{
"term": {
"_type": "event"
}
},
{
"range": {
"starts_at": {
"gte": "2015-06-14T05:25:03Z",
"lte": "2015-06-21T05:25:03Z"
}
}
}
]
}
}
},
"aggs": {
"starts": {
"date_histogram": {
"field": "starts_at",
"interval": "15m",
"extended_bounds": {
"max": "2015-06-21T05:25:04Z",
"min": "2015-06-14T05:25:04Z"
},
"min_doc_count": 0
}
},
"ends": {
"date_histogram": {
"field": "ends_at",
"interval": "15m",
"extended_bounds": {
"max": "2015-06-21T05:25:04Z",
"min": "2015-06-14T05:25:04Z"
},
"min_doc_count": 0
}
}
}
}
I'm looking for something like this solution.
Is there a way to achieve that with a single query?
I'm not 100% sure but up-coming pipeline aggregations might solve this problem in near-future in a more elegant way.
Meanwhile you could choose the desired time resolution and at index time in addition to starts_at and ends_at fields you would also generate active_at field. It would be an array of time stamps and you could use either terms (if it is mapped as not_analyzed string) or date_histogram aggregation to get the correct "active events count" for each time-bucket.
The down-side is inflated storage requirements and possibly worse performance since there are more field values to aggregate over. Anyway it shouldn't be too bad if you don't choose a too high time resolution like 1 minute.

Is there a way to have elasticsearch return a hit per generated bucket during an aggregation?

right now I have a query like this:
{
"query": {
"bool": {
"must": [
{
"match": {
"uuid": "xxxxxxx-xxxx-xxxx-xxxxx-xxxxxxxxxxxxx"
}
},
{
"range": {
"date": {
"from": "now-12h",
"to": "now"
}
}
}
]
}
},
"aggs": {
"query": {
"terms": [
{
"field": "query",
"size": 3
}
]
}
}
}
The aggregation works perfectly well, but I can't seem to find a way to control the hit data that is returned, I can use the size parameter at the top of the dsl, but the hits that are returned are not returned in the same order as the bucket so the bucket results do not line up with the hit results. Is there any way to correct this or do I have to issue 2 separate queries?
To expand on Filipe's answer, it seems like the top_hits aggregation is what you are looking for, e.g.
{
"query": {
... snip ...
},
"aggs": {
"query": {
"terms": {
"field": "query",
"size": 3
},
"aggs": {
"top": {
"top_hits": {
"size": 42
}
}
}
}
}
}
Your query uses exact matches (match and range) and binary logic (must, bool) and thus should probably be converted to use filters instead:
"filtered": {
"filter": {
"bool": {
"must": [
{
"term": {
"uuid": "xxxxxxx-xxxx-xxxx-xxxxx-xxxxxxxxxxxxx"
}
},
{
"range": {
"date": {
"from": "now-12h",
"to": "now"
}
}
}
]
}
}
As for the aggregations,
The hits that are returned do not represent all the buckets that were returned. so if have buckets for terms 'a', 'b', and 'c' I want to have hits that represent those buckets as well
Perhaps you are looking to control the scope of the buckets? You can make an aggregation bucket global so that it will not be influenced by the query or filter.
Keep in mind that Elasticsearch will not "group" hits in any way -- it is always a flat list ordered according to score and additional sorting options.
Aggregations can be organized in a nested structure and return computed or extracted values, in a specific order. In the case of terms aggregation, it is in descending count (highest number of hits first). The hits section of the response is never influenced by your choice of aggregations. Similarly, you cannot find hits in the aggregation sections.
If your goal is to group documents by a certain field, yes, you will need to run multiple queries in the current Elasticsearch release.
I'm not 100% sure, but I think there's no way to do that in the current version of Elasticsearch (1.2.x). The good news is that there will be when version 1.3.x gets released:
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/search-aggregations-metrics-top-hits-aggregation.html

Resources