Is there any way to aggregate an average without outliers on elastic? - elasticsearch

I need a way to create a transform that will aggregate the average of a field but without the outliers (let's say only the values that falls between 10%-90% percentiles). for example if I have the following documents:
[
{someField:1},
{someField:2},
{someField:3},
{someField:4},
{someField:5},
{someField:6},
{someField:7},
{someField:8},
{someField:9},
{someField:10}
]
It would calculate the average of 2-9
Edited: renamed "value" to "someField"

You can do this in one go with a scripted_metric aggregation but you'd have to write the percentiles function and then the avg function -- I wrote one here. But the script isn't going to be performant so I don't think it'd be worth the effort…
I'd instead suggest to first retrieve the percentile bounds:
POST myindex/_search
{
"size": 0,
"aggs": {
"boundaries": {
"percentiles": {
"field": "value",
"percents": [
10,
90
]
}
}
}
}
yielding [1.5, 9.5] and then plug these numbers into a weighted average aggregation:
POST myindex/_search
{
"size": 0,
"aggs": {
"avg_without_outliers": {
"weighted_avg": {
"value": {
"field": "value"
},
"weight": {
"script": {
"source": "def v = doc.value.value; return v <= params.min || v >= params.max ? 0 : 1",
"params": {
"min": 1.5,
"max": 9.5
}
}
}
}
}
}
}
The weight is either 0 or 1, depending on whether the particular doc that's being traversed is an outlier or not.

Related

How to calculate & draw metrics of scripted metrics in elasticsearch

Problem description
we have log files from different devices parsed into our elastic search database, line by line. The log files are built as a ring buffer, so they always have a fixed size of 1000 lines. They can be manually exported whenever needed. After import and parsing in elastic search each document represents a single line of a log file with the following information:
DeviceID: 12345
FileType: ErrorLog
FileTimestamp: 2022-05-10 01:23:45
LogTimestamp: 2022-05-05 01:23:45
LogMessage: something very important here
Now I want to have a statistic on the timespan that usually is covered by that fixed amount of lines. Because, depending on the intensity of the usage of the device, a varying amount of log entries is generated and the files can cover from just a few days to several months... But since the log files are split into individual lines it is not that trivial (I suppose).
My goal is to have a chart that shows me a "histogram" of the different log file timespans...
First Try: Visualize library > Data table
I started by creating a Data table in the Visualize library where I was able to aggregate the data as follows:
I added 3 Buckets --> so I have all lines bucketed by their original file:
Split rows DeviceID.keyword
Split rows FileType.keyword
Split rows FileTimestamp
... and 2 Metrics --> to show the log file timespan (I couldn't find a way to create a max-min metric, so I started with individual metrics for max and min):
Metric Min LogTimeStamp
Metric Max LogTimeStamp
This results in the following query:
{
"aggs": {
"2": {
"terms": {
"field": "DeviceID.keyword",
"order": {
"_key": "desc"
},
"size": 100
},
"aggs": {
"3": {
"terms": {
"field": "FileType.keyword",
"order": {
"_key": "desc"
},
"size": 5
},
"aggs": {
"4": {
"terms": {
"field": "FileTimestamp",
"order": {
"_key": "desc"
},
"size": 100
},
"aggs": {
"1": {
"min": {
"field": "LogTimeStamp"
}
},
"5": {
"max": {
"field": "LogTimeStamp"
}
}
}
}
}
}
}
}
},
"size": 0,
...
}
... and this output:
DeviceID FileType FileTimestamp Min LogTimestamp Max LogTimestamp
---------------------------------------------------------------------------------------------
12345 ErrorLog 2022-05-10 01:23:45 2022-04-10 01:23:45 2022-05-10 01:23:45
...
Looks good so far! The expected result would be exactly 1 month for this example.
But my research showed, that it is not possible to add the desired metrics here, so I needed to try something else...
Second Try: Vizualize library > Custom visualization (Vega-Lite)
So I started some more research and found out, that vega might be a possibility. I already was able to transfer the bucket part from the first attempt there and I also added a scripted metric to automatically calculate the timespan (instead of min & max), so far, so good. The request body looks as follows:
body: {
"aggs": {
"DeviceID": {
"terms": { "field": "DeviceID.keyword" },
"aggs": {
"FileType": {
"terms": { "field": "FileType.keyword" } ,
"aggs": {
"FileTimestamp": {
"terms": { "field": "FileTimestamp" } ,
"aggs": {
"timespan": {
"scripted_metric": {
"init_script": "state.values = [];",
"map_script": "state.values.add(doc['#timestamp'].value);",
"combine_script": "long min = Long.MAX_VALUE; long max = 0; for (t in state.values) { long tms = t.toInstant().toEpochMilli(); if(tms > max) max = tms; if(tms < min) min = tms; } return [max,min];",
"reduce_script": "long min = Long.MAX_VALUE; long max = 0; for (a in states) { if(a[0] > max) max = a[0]; if(a[1] < min) min = a[1]; } return max-min;"
}
}
}
}
}
}
}
}
},
"size": 0,
}
...with this response (unnecessary information removed to reduce complexity):
{
"took": 12245,
"timed_out": false,
"_shards": { ... },
"hits": { ... },
"aggregations": {
"DeviceID": {
"buckets": [
{
"key": "12345",
"FileType": {
"buckets": [
{
"key": "ErrorLog",
"FileTimeStamp": {
"buckets": [
{
"key": 1638447972000,
"key_as_string": "2021-12-02T12:26:12.000Z",
"doc_count": 1000,
"timespan": {
"value": 31339243240
}
},
{
"key": 1636023881000,
"key_as_string": "2021-11-04T11:04:41.000Z",
"doc_count": 1000,
"timespan": {
"value": 31339243240
}
}
]
}
},
{
"key": "InfoLog",
"FileTimeStamp": {
"buckets": [
{
"key": 1635773438000,
"key_as_string": "2021-11-01T13:30:38.000Z",
"doc_count": 1000,
"timespan": {
"value": 2793365000
}
},
{
"key": 1636023881000,
"key_as_string": "2021-11-04T11:04:41.000Z",
"doc_count": 1000,
"timespan": {
"value": 2643772000
}
}
]
}
}
]
}
},
{
"key": "12346",
"FileType": {
...
}
},
...
]
}
}
}
Yeah, it seems to work! Now I have the timespan for each original log file.
Question
Now I am stuck with:
I want to average the timespans for each original log file (identified via the combination of DeviceID + FileType + FileTimeStamp) to prevent devices with multiple log files imported to have a higher weight, than devices with only 1 log file imported. I tried to add another aggregation for the avg, but I couldn't figure out where to put so that the result of the scripted_metric is used. My closest attempt was to put a avg_bucket after the FileTimeStamp bucket:
Request:
body: {
"aggs": {
"DeviceID": {
"terms": { "field": "DeviceID.keyword" },
"aggs": {
"FileType": {
"terms": { "field": "FileType.keyword" } ,
"aggs": {
"FileTimestamp": {
"terms": { "field": "FileTimestamp" } ,
"aggs": {
"timespan": {
"scripted_metric": {
"init_script": "state.values = [];",
"map_script": "state.values.add(doc['FileTimestamp'].value);",
"combine_script": "long min = Long.MAX_VALUE; long max = 0; for (t in state.values) { long tms = t.toInstant().toEpochMilli(); if(tms > max) max = tms; if(tms < min) min = tms; } return [max,min];",
"reduce_script": "long min = Long.MAX_VALUE; long max = 0; for (a in states) { if(a[0] > max) max = a[0]; if(a[1] < min) min = a[1]; } return max-min;"
}
}
}
},
// new part - start
"avg_timespan": {
"avg_bucket": {
"buckets_path": "FileTimestamp>timespan"
}
}
// new part - end
}
}
}
}
},
"size": 0,
}
But I receive the following error:
EsError: buckets_path must reference either a number value or a single value numeric metric aggregation, got: [InternalScriptedMetric] at aggregation [timespan]
So is it the right spot? (but not applicable to a scripted metric) Or am I on the wrong path?
I need to plot all this, but I can't find my way through all the buckets, etc.
I read about flattening (which would probably be a good idea, so (if done by the server) the result would not be that complex), but don't know where and how to put the flattening transformation.
I imagine the resulting chart like this:
x-axis = log file timespan, where the timespan is "binned" according to a given step size (e.g. 1 day), so there are only bars for each bin (1 = 0-1days, 2 = 1-2days, 3 = 2-3days, etc.) and not for all the different timespans of log files
y-axis = count of devices
type: lines or vertical bars, split by file type
e.g. something like this:
Any help is really appreciated! Thanks in advance!
If you have the privileges to create a transform, then the elastic painless example Getting duration by using bucket script can do exactly what you want. It creates a new index where all documents are grouped according to your needs.
To create the transform:
go to Stack Management > Transforms > + Create a transform
select Edit JSON config for the Pivot configuration object
paste & apply the JSON below
check whether the result is the expected in the Transform preview
fill out the rest of the transform details + save the transform
JSON config
{
"group_by": {
"DeviceID": {
"terms": {
"field": "DeviceID.keyword"
}
},
"FileType": {
"terms": {
"field": "FileType.keyword"
}
},
"FileTimestamp": {
"terms": {
"field": "FileTimestamp"
}
}
},
"aggregations": {
"TimeStampStats": {
"stats": {
"field": "#timestamp"
}
},
"TimeSpan": {
"bucket_script": {
"buckets_path": {
"first": "TimeStampStats.min",
"last": "TimeStampStats.max"
},
"script": "params.last - params.first"
}
}
}
}
Now you can create a chart from the new index, for example with these settings:
Vertical Bars
Metrics:
Y-axis = "Count"
Buckets:
X-axis = "TimeSpan"
Split series = "FileType"

ElasticSearch _knn_search query on multiple fields

I'm using ES 8.2. I'd like to use approximate method of _knn_search on more than 1 vector. Below I've attached my current code searching on a single vector. So far as I've read _knn_search does not support search on nested fields.
Alternatively, I can use multi index search. One index, one vector, one search, sum up all results together. However, I need to store all these vectors together in one index as I need also to perform filtration on some other fields besides vectors for knn search.
Thus, the question is if there is a work around how I can perform _knn_search on more than 1 vector?
search_vector = np.zeros(512).tolist()
es_query = {
"knn": {
"field": "feature_vector_1.vector",
"query_vector": search_vector,
"k": 100,
"num_candidates": 1000
},
"filter": [
{
"range": {
"feature_vector_1.match_prc": {
"gt": 10
}
}
}
],
"_source": {
"excludes": ["feature_vector_1.vector", "feature_vector_2.vector"]
}
}
The last working query that I've end up with is
es_query = {
"knn": {
"field": "feature_vector_1.vector",
"query_vector": search_vector,
"k": 1000,
"num_candidates": 1000
},
"filter": [
{
"function_score": {
"query": {
"match_all": {}
},
"script_score": {
"script": {
"source": """
double value = dotProduct(params.queryVector, 'feature_vector_2.vector');
return 100 * (1 + value) / 2;
""",
"params": {
"queryVector": search_vector
}
},
}
}
}
],
"_source": {
"excludes": ["feature_vector_1.vector", "feature_vector_2.vector"]
}
}
However, it is not true AKNN on 2 vectors but still working option if performance of such query satisfies your expectations.
the below seems to be working for me for combining KNN searches, taking the average of multiple cosine similarity scores. Note that this is a little different than the original request, since it performs a brute force search, but you can still filter the results up front by replacing the match_all bit.
GET my-index/_search
{
"query": {
"script_score": {
"query": {
"match_all": {}
},
"script": {
"source": "(cosineSimilarity(params.vector1, 'my-vector1') + cosineSimilarity(params.vector2, 'my-vector2'))/2 + 1.0",
"params": {
"vector1": [
1.3012068271636963,
...
0.23468133807182312
],
"vector2": [
-0.49404603242874146,
...
-0.15835021436214447
]
}
}
}
}
}

is there a way to query range when maximal range is defined by an array with two numbers

I need to write an elastic range query that operates on a following index format
...
"facetProperties": {
"fid641616": [
31.75,
44.45
]
}
...
the following query works only if lt or gt matches the lower or the upper bound of the max range. As soon as I try to narrow both ends, there are no results.
{
"query": {
"bool": {
"should": [{
"range": {
"facetProperties.fid641616": {
"gt": 33,
"lt": 42
}
}
}]
}
},
"from": 0,
"size": 250,
"sort": [
],
"aggs": {
},
"_source": "facetProperties.fid641616"
}
Is there a way to get this working without modifying the index?
update1 - some use cases:
query range:
"range": {
"facetProperties.fid641616": {
"gt": 33,
"lt": 42
}
}
facet1 : [31] - should not be found
facet2 : [31,45] - should be found
facet1 : [31,32] - should not be found
facet1 : [44,45] - should not be found
Basically it is not possible to query based on the range or difference of two numbers in an array using conventional DSL queries in ES but you can do that using script.
Below is the document and sample script that should help you.
Sample Document:
POST range_index/_doc/1
{
"array": [31.75, 44.45]
}
Query:
POST range_index/_search
{
"query": {
"script": {
"script": {
"source": """
List list = doc['array'];
if(list.size()==2){
long first_number = list.get(0);
long last_number = list.get(1);
if(params.gt < first_number)
return false;
if(params.lt > last_number)
return false;
if((last_number - first_number) >= (params.lt - params.gt))
return true;
}
return false;
""",
"params": {
"gt": 33,
"lt": 42
}
}
}
}
}
What I've done is simply created a script that would return you documents having the difference of gt and lt that you have mentioned in your query.
You should be able to view the document I've mentioned in the result. Note that I'm assuming that the field array would be in asc order.
Basically it would return all the documents having difference of 42-33 i.e. 9.
Let me know if that helps!

Filter/aggregate one elasticsearch index of time series data by timestamps found in another index

The Data
So I have reams of different types of time series data. Currently i've chosen to put each type of data into their own index because with the exception of 4 fields, all of the data is very different. Also the data is sampled at different rates and are not guaranteed to have common timestamps across the same sub-second window so fusing them all into one large document is also not a trivial task.
The Goal
One of our common use cases that i'm trying to see if I can solve entirely in Elasticsearch is to return an aggregation result of one index based on the time windows returned from a query of another index. Pictorially:
This is what I want to accomplish.
Some Considerations
For small enough signal transitions on the "condition" data, I can just use a date histogram and some combination of a top hits sub aggregation, but this quickly breaks down when I have 10,000's or 100,000's of occurrences of "the condition". Further this is just one "case", I have 100's of sets of similar situations that i'd like to get the overall min/max from.
The comparisons are basically amongst what I would consider to be sibling level documents or indices, so there doesn't seem to be any obvious parent->child relationship that would be flexible enough over the long run, at least with how the data is currently structured.
It feels like there should be an elegant solution instead of brute force building the date ranges outside of Elasticsearch with the results of one query and feeding 100's of time ranges into another query.
Looking through the documentation it feels like some combination of Elasticsearch scripting and some of the pipelined aggregations are going to be what i want, but no definitive solutions are jumping out at me. I could really use some pointers in the right direction from the community.
Thanks.
I found a "solution" that worked for me for this problem. No answers or even comments from anyone yet, but i'll post my solution in case someone else comes along looking for something like this. I'm sure there is a lot of opportunity for improvement and optimization and if I discover such a solution (likely through a scripted aggregation) i'll come back and update my solution.
It may not be the optimal solution but it works for me. The key was to leverage the top_hits, serial_diff and bucket_selector aggregators.
The "solution"
def time_edges(index, must_terms=[], should_terms=[], filter_terms=[], data_sample_accuracy_window=200):
"""
Find the affected flights and date ranges where a specific set of terms occurs in a particular ES index.
index: the Elasticsearch index to search
terms: a list of dictionaries of form { "term": { "<termname>": <value>}}
"""
query = {
"size": 0,
"timeout": "5s",
"query": {
"constant_score": {
"filter": {
"bool": {
"must": must_terms,
"should": should_terms,
"filter": filter_terms
}
}
}
},
"aggs": {
"by_flight_id": {
"terms": {"field": "flight_id", "size": 1000},
"aggs": {
"last": {
"top_hits": {
"sort": [{"#timestamp": {"order": "desc"}}],
"size": 1,
"script_fields": {
"timestamp": {
"script": "doc['#timestamp'].value"
}
}
}
},
"first": {
"top_hits": {
"sort": [{"#timestamp": {"order": "asc"}}],
"size": 1,
"script_fields": {
"timestamp": {
"script": "doc['#timestamp'].value"
}
}
}
},
"time_edges": {
"histogram": {
"min_doc_count": 1,
"interval": 1,
"script": {
"inline": "doc['#timestamp'].value",
"lang": "painless",
}
},
"aggs": {
"timestamps": {
"max": {"field": "#timestamp"}
},
"timestamp_diff": {
"serial_diff": {
"buckets_path": "timestamps",
"lag": 1
}
},
"time_delta_filter": {
"bucket_selector": {
"buckets_path": {
"timestampDiff": "timestamp_diff"
},
"script": "if (params != null && params.timestampDiff != null) { params.timestampDiff > " + str(data_sample_accuracy_window) + "} else { false }"
}
}
}
}
}
}
}
}
return es.search(index=index, body=query)
Breaking things down
Get filter the results by 'Index 2'
"query": {
"constant_score": {
"filter": {
"bool": {
"must": must_terms,
"should": should_terms,
"filter": filter_terms
}
}
}
},
must_terms is the required value to be able to get all the results for "the condition" stored in "Index 2".
For example, to limit results to only the last 10 days and when condition is the value 10 or 12 we add the following must_terms
must_terms = [
{
"range": {
"#timestamp": {
"gte": "now-10d",
"lte": "now"
}
}
},
{
"terms": {"condition": [10, 12]}
}
]
This returns a reduced set of documents that we can then pass on into our aggregations to figure out where our "samples" are.
Aggregations
For my use case we have the notion of "flights" for our aircraft, so I wanted to group the returned results by their id and then "break up" all the occurences into buckets.
"aggs": {
"by_flight_id": {
"terms": {"field": "flight_id", "size": 1000},
...
}
}
}
You can get the rising edge of the first occurence and the falling edge of the last occurence using the top_hits aggregation
"last": {
"top_hits": {
"sort": [{"#timestamp": {"order": "desc"}}],
"size": 1,
"script_fields": {
"timestamp": {
"script": "doc['#timestamp'].value"
}
}
}
},
"first": {
"top_hits": {
"sort": [{"#timestamp": {"order": "asc"}}],
"size": 1,
"script_fields": {
"timestamp": {
"script": "doc['#timestamp'].value"
}
}
}
},
You can get the samples in between using a histogram on a timestamp. This breaks up your returned results into buckets for every unique timestamp. This is a costly aggregation, but worth it. Using the inline script allows us to use the timestamp value for the bucket name.
"time_edges": {
"histogram": {
"min_doc_count": 1,
"interval": 1,
"script": {
"inline": "doc['#timestamp'].value",
"lang": "painless",
}
},
...
}
By default the histogram aggregation returns a set of buckets with the document count for each bucket, but we need a value. This is what is required for serial_diff aggregation to work, so we have to do a token max aggregation on the results to get a value returned.
"aggs": {
"timestamps": {
"max": {"field": "#timestamp"}
},
"timestamp_diff": {
"serial_diff": {
"buckets_path": "timestamps",
"lag": 1
}
},
...
}
We use the results of the serial_diff to determine whether or not two bucket are approximately adjacent. We then discard samples that are adjacent to eachother and create a combined time range for our condition by using the bucket_selector aggregation. This will throw out buckets that are smaller than our data_sample_accuracy_window. This value is dependent on your dataset.
"aggs": {
...
"time_delta_filter": {
"bucket_selector": {
"buckets_path": {
"timestampDiff": "timestamp_diff"
},
"script": "if (params != null && params.timestampDiff != null) { params.timestampDiff > " + str(data_sample_accuracy_window) + "} else { false }"
}
}
}
The serial_diff results are also critical for us to determine how long our condition was set. The timestamps of our buckets end up representing the "rising" edge of our condition signal so the falling edge is unknown without some post-processing. We use the timestampDiff value to figure out where the falling edge is.

ElasticSearch max score

I'm trying to solve a performance issue we have when querying ElasticSearch for several thousand results. The basic idea is that we do some post-query processing and only show the Top X results ( Query may have ~100000 Results while we only need the top 100 according to our Score Mechanics ).
The basic mechanics are as follows:
ElasticSearch Score is normalized between 0..1 ( score/max(score) ), we add our ranking score ( also normalized between 0..1 ) and divide by 2.
What I'd like to do is move this logic into ElasticSearch using custom scoring ( or well, anything that works ): https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-function-score-query.html#function-script-score
The Problem I'm facing is that using Score Scripts / Score Functions I can't seem to find a way to do something like max(_score) to normalize the score between 0 and 1.
"script_score" : {
"script" : "(_score / max(_score) + doc['some_normalized_field'].value)/2"
}
Any ideas are welcome.
You can not get max_score before you have actually generated the _score for all the matching documents. script_score query will first generate the _score for all the matching documents and then max_score will be displayed by elasticsearch.
According to what i can understand from your problem, You want to preserve the max_score that was generated by the original query, before you applied "script_score". You can get the required result if you do some computation at the front-end. In short apply your formula at the front end and then sort the results.
you can save your factor inside your results using script_fields query.
{
"explain": true,
"query": {
"match_all": {}
},
"script_fields": {
"total_goals": {
"script": {
"lang": "painless",
"source": """
int total = 0;
for (int i = 0; i < doc['goals'].length; ++i) {
total += doc['goals'][i];
}
return total;
""",
"params":{
"last" : "any parameters required"
}
}
}
}
}
I am not sure that I understand your question. do you want to limit the amount of results?
are you tried?
{
"from" : 0, "size" : 10,
"query" : {
"term" : { "name" : "dennis" }
}
}
you can use sort to define sort order by default it will sorted by main query.
you can also use aggregations ( with or without function_score )
{
"query": {
"function_score": {
"functions": [
{
"gauss": {
"date": {
"scale": "3d",
"offset": "7d",
"decay": 0.1
}
}
},
{
"gauss": {
"priority": {
"origin": "0",
"scale": "100"
}
}
}
],
"query": {
"match" : { "body" : "dennis" }
}
}
},
"aggs": {
"hits": {
"top_hits": {
"size": 10
}
}
}
}
Based on this github ticket it is simply impossible to normalize score and they suggest to use boolean similarity as a workaround.

Resources