Script-based sorting on Elasticsearch date field

Script-based sorting on Elasticsearch date field - sorting

I am just getting started with Elasticsearch and would like to use script-based sorting on a field that is mapped as date, format hour_minute. There can be multiple instances of the field in each document.
Before introducing expressions, as a first step I'm trying a simple sort (using the Sense plugin):
POST myIndex/_search
{
"query": {
"match_all": {}
},
"sort": {
"_script": {
"script": "doc[\"someTime\"].value",
"lang": "groovy",
"type": "date",
"order": "asc"
}
}
}
I get this error (fragment):
SearchPhaseExecutionException[Failed to execute phase [query], all shards failed;
shardFailures {[tjWL-zV5QXmGjNlXzLvrzw][myIndex][0]:
SearchParseException[[myIndex][0]:
query[ConstantScore(*:*)],from[-1],size[-1]: Parse Failure [Failed to parse source…
If I post the above query with "type": "number" there is no error, although this of course doesn't sort by date. The following works fine:
POST myIndex/_search
{
"query": {
"match_all": {}
},
"sort": {
"someTime": {
"order": "asc"
}
}
}
Ultimately I'd like to use script-based sorting since I will be trying to query, filter or sort using date and time conditions, like query for documents with today’s date, then sort them by the lowest time that is after the time now, etc.
Any suggestions would be much appreciated.

Using scripts to sort documents is not really performant, especially if your document base is expected to grow over time. So I'm going to offer a solution for doing that and then suggest another option.
In order to sort using script, You need to transform your date into milliseconds so your sort can be run on a simple number (sort type can only be number or string).
POST myIndex/_search
{
"query": {
"match_all": {}
},
"sort": {
"_script": {
"script": "doc[\"someTime\"].date.getMillisOfDay()",
"lang": "groovy",
"type": "number", <----- make sure this is number
"order": "asc"
}
}
}
Note that depending on the granularity you want, you can also use getSecondOfDay() or getMinuteOfDay(). That way, provided your queries and filters have selected documents for the right day, your sort script will sort documents based on the number of milliseconds (or seconds or minutes) within that day.
The second solution would imply to also index the number of milliseconds (or seconds or minutes) since the beginning of that day into another field and simply use it to sort, so that you don't need script. The bottom line is that whatever information you need at search time that can be known at index time should be indexed instead of computed in real-time.
For instance, if your someTime field contains the date 2015-10-05T05:34:12.276Z then you'd index the millisOfDay field with the value 20052276, which is
5 hours * 3600000 ms
+34 minutes * 60000 ms
+12 seconds * 1000 ms
+276 ms
Then you can sort using
POST myIndex/_search
{
"query": {
"range": {
"someTime": {
"gt": "now"
}
}
},
"sort": {
"millisOfDay": {
"order": "asc"
}
}
}
Note that I've added a query to select only the documents whose someTime date is after now, so you'll get all documents in the future, but sorted by ascending millisOfDay, which means you'll get the nearest date from now first.
UPDATE
If someTime has the format HH:mm, then you can also store its millisOfDay value, e.g. if someTime = 17:30 then millisOfDay would be (17h * 3600000 ms) + (30 min * 60000 ms) = 63000000
Then, your query needs to be reworked a little bit using a script filter, like this:
{
"query": {
"filtered": {
"filter": {
"script": {
"script": "doc.millisOfDay.value > new DateTime().millisOfDay"
}
}
}
},
"sort": {
"millisOfDay": {
"order": "asc"
}
}
}

Related

Paginate an aggregation sorted by hits on Elastic index

I have an Elastic index (say file) where I append a document every time the file is downloaded by a client. Each document is quite basic, it contains a field filename and a date when to indicate the time of the download.
What I want to achieve is to get, for each file the number of times it has been downloaded in the last 3 months. Thanks to another question, I have a query that returns all the results:
{
"query": {
"range": {
"when": {
"gte": "now-3M"
}
}
},
"aggs": {
"downloads": {
"terms": {
"field": "filename.keyword",
"size": 1000
}
}
},
"size": 0
}
Now, I want to have a paginated result. The term aggreation cannot be paginated, so I use a composite aggregation. Of course, if there is a better aggregation, it can be used here...
So for the moment, I have something like that:
{
"query": {
"range": {
"when": {
"gte": "now-3M"
}
}
},
"aggs": {
"downloads_agg": {
"composite": {
"size": 100,
"sources": [
{
"downloads": {
"terms": {
"field": "filename.keyword"
}
}
}
]
}
}
},
"size": 0
}
This aggregation allows me to paginate (thanks to after_key value in response), but it is not sorted by the number of downloads - it is sorted by the filename.
How can I sort that composite aggregation on the number of documents for each filename in my index?
Thanks.

Composite aggregation don't allow sorting based on the value field.
Excerpt from the discussion on elastic forum:
it's designed as a memory-friendly way to paginate over aggregations.
Part of the tradeoff is that you lose things like ordering by doc
count, since that isn't known until after all the docs have been
collected.
I have no experience with Transforms (part of X-pack & Licensed) but you can try that out. Apart from this, I don't see a way to get the expected output.

How do I select the top term buckets based on a rescore function in Elasticsearch

Consider the following query for Elasticsearch 5.6:
{
"size": 0,
"query": {
"match_all": {}
},
"rescore": [
{
"window_size": 10000,
"query": {
"rescore_query": {
"function_score": {
"boost_mode": "replace",
"script_score": {
"script": {
"source": "doc['topic_score'].value"
}
}
}
},
"query_weight": 0,
"rescore_query_weight": 1
}
}
],
"aggs": {
"distinct": {
"terms": {
"field": "identical_id",
"order": {
"top_score": "desc"
}
},
"aggs": {
"best_unique_result": {
"top_hits": {
"size": 1
}
},
"top_score": {
"max": {
"script": {
"inline": "_score"
}
}
}
}
}
}
}
This is a simplified version where the real query has a more complex main query and the rescore function is far more intensive.
Let me explain it's purpose first incase I'm about to spend a 1000 hours developing a pen that writes in space when a pencil would actually solve my problem. I'm performing a fast initial query, then rescoring the top results with a much more intensive function. From those results I want to show the top distinct values, i.e. no two results should have the same identical_id. If there's a better way to do this I'd also consider that an answer.
I expected a query like this would order results by the rescore query, group all the results that had the same identical_id and display the top hit for each such distinct group. I also assumed that since I'm ordering those term aggregation buckets by the max parent _score, they would be ordered to reflect the best result they contain as determined from the original rescore query.
The reality is that the term buckets are ordered by the maximum query score and not the rescore query score. Strangely the top hits within the buckets do seem to use the rescore.
Is there a better way to achieve the end result that I want, or some way I can fix this query to work the way I expect it too?

From documentation :
The query rescorer executes a second query only on the Top-K results returned by the query and post_filter phases. The number of docs which will be examined on each shard can be controlled by the window_size parameter, which defaults to 10.
As the rescore query kicks in after the post_filter phase, I assume the term aggregation buckets are already fixed.
I have no idea on how you can combine rescore and aggregations. Sorry :(

I think I have a pretty great solution to this problem, but I'll let the bounty continue to expiration incase someone comes up with a better approach.
{
"size": 0,
"query": {
"match_all": {}
},
"aggs": {
"sample": {
"sampler": {
"shard_size": 10000
},
"aggs": {
"distinct": {
"terms": {
"field": "identical_id",
"order": {
"top_score": "desc"
}
},
"aggs": {
"best_unique_result": {
"top_hits": {
"size": 1,
"sort": [
{
"_script": {
"type": "number",
"script": {
"source": "doc['topic_score'].value"
},
"order": "desc"
}
}
]
}
},
"top_score": {
"max": {
"script": {
"source": "doc['topic_score'].value"
}
}
}
}
}
}
}
}
}
The sampler aggregation will take the top N hits per shard from the core query and run aggregations over those. Then in the max aggregator that defines the bucket order I use the exact same script as the one I use to pick a top hit from the bucket. Now the buckets and the top hits are running over the same top N sets of items and the buckets will order by the max of the same score, generated from the same script. Unfortunately I still need run the script once to order the buckets and once to pick a top hit within the bucket, and you could use the rescore instead for the top hits ordering, but either way it has to run twice and I found it was faster as a sort script then as a rescore

Filter/aggregate one elasticsearch index of time series data by timestamps found in another index

The Data
So I have reams of different types of time series data. Currently i've chosen to put each type of data into their own index because with the exception of 4 fields, all of the data is very different. Also the data is sampled at different rates and are not guaranteed to have common timestamps across the same sub-second window so fusing them all into one large document is also not a trivial task.
The Goal
One of our common use cases that i'm trying to see if I can solve entirely in Elasticsearch is to return an aggregation result of one index based on the time windows returned from a query of another index. Pictorially:
This is what I want to accomplish.
Some Considerations
For small enough signal transitions on the "condition" data, I can just use a date histogram and some combination of a top hits sub aggregation, but this quickly breaks down when I have 10,000's or 100,000's of occurrences of "the condition". Further this is just one "case", I have 100's of sets of similar situations that i'd like to get the overall min/max from.
The comparisons are basically amongst what I would consider to be sibling level documents or indices, so there doesn't seem to be any obvious parent->child relationship that would be flexible enough over the long run, at least with how the data is currently structured.
It feels like there should be an elegant solution instead of brute force building the date ranges outside of Elasticsearch with the results of one query and feeding 100's of time ranges into another query.
Looking through the documentation it feels like some combination of Elasticsearch scripting and some of the pipelined aggregations are going to be what i want, but no definitive solutions are jumping out at me. I could really use some pointers in the right direction from the community.
Thanks.

I found a "solution" that worked for me for this problem. No answers or even comments from anyone yet, but i'll post my solution in case someone else comes along looking for something like this. I'm sure there is a lot of opportunity for improvement and optimization and if I discover such a solution (likely through a scripted aggregation) i'll come back and update my solution.
It may not be the optimal solution but it works for me. The key was to leverage the top_hits, serial_diff and bucket_selector aggregators.
The "solution"
def time_edges(index, must_terms=[], should_terms=[], filter_terms=[], data_sample_accuracy_window=200):
"""
Find the affected flights and date ranges where a specific set of terms occurs in a particular ES index.
index: the Elasticsearch index to search
terms: a list of dictionaries of form { "term": { "<termname>": <value>}}
"""
query = {
"size": 0,
"timeout": "5s",
"query": {
"constant_score": {
"filter": {
"bool": {
"must": must_terms,
"should": should_terms,
"filter": filter_terms
}
}
}
},
"aggs": {
"by_flight_id": {
"terms": {"field": "flight_id", "size": 1000},
"aggs": {
"last": {
"top_hits": {
"sort": [{"#timestamp": {"order": "desc"}}],
"size": 1,
"script_fields": {
"timestamp": {
"script": "doc['#timestamp'].value"
}
}
}
},
"first": {
"top_hits": {
"sort": [{"#timestamp": {"order": "asc"}}],
"size": 1,
"script_fields": {
"timestamp": {
"script": "doc['#timestamp'].value"
}
}
}
},
"time_edges": {
"histogram": {
"min_doc_count": 1,
"interval": 1,
"script": {
"inline": "doc['#timestamp'].value",
"lang": "painless",
}
},
"aggs": {
"timestamps": {
"max": {"field": "#timestamp"}
},
"timestamp_diff": {
"serial_diff": {
"buckets_path": "timestamps",
"lag": 1
}
},
"time_delta_filter": {
"bucket_selector": {
"buckets_path": {
"timestampDiff": "timestamp_diff"
},
"script": "if (params != null && params.timestampDiff != null) { params.timestampDiff > " + str(data_sample_accuracy_window) + "} else { false }"
}
}
}
}
}
}
}
}
return es.search(index=index, body=query)
Breaking things down
Get filter the results by 'Index 2'
"query": {
"constant_score": {
"filter": {
"bool": {
"must": must_terms,
"should": should_terms,
"filter": filter_terms
}
}
}
},
must_terms is the required value to be able to get all the results for "the condition" stored in "Index 2".
For example, to limit results to only the last 10 days and when condition is the value 10 or 12 we add the following must_terms
must_terms = [
{
"range": {
"#timestamp": {
"gte": "now-10d",
"lte": "now"
}
}
},
{
"terms": {"condition": [10, 12]}
}
]
This returns a reduced set of documents that we can then pass on into our aggregations to figure out where our "samples" are.
Aggregations
For my use case we have the notion of "flights" for our aircraft, so I wanted to group the returned results by their id and then "break up" all the occurences into buckets.
"aggs": {
"by_flight_id": {
"terms": {"field": "flight_id", "size": 1000},
...
}
}
}
You can get the rising edge of the first occurence and the falling edge of the last occurence using the top_hits aggregation
"last": {
"top_hits": {
"sort": [{"#timestamp": {"order": "desc"}}],
"size": 1,
"script_fields": {
"timestamp": {
"script": "doc['#timestamp'].value"
}
}
}
},
"first": {
"top_hits": {
"sort": [{"#timestamp": {"order": "asc"}}],
"size": 1,
"script_fields": {
"timestamp": {
"script": "doc['#timestamp'].value"
}
}
}
},
You can get the samples in between using a histogram on a timestamp. This breaks up your returned results into buckets for every unique timestamp. This is a costly aggregation, but worth it. Using the inline script allows us to use the timestamp value for the bucket name.
"time_edges": {
"histogram": {
"min_doc_count": 1,
"interval": 1,
"script": {
"inline": "doc['#timestamp'].value",
"lang": "painless",
}
},
...
}
By default the histogram aggregation returns a set of buckets with the document count for each bucket, but we need a value. This is what is required for serial_diff aggregation to work, so we have to do a token max aggregation on the results to get a value returned.
"aggs": {
"timestamps": {
"max": {"field": "#timestamp"}
},
"timestamp_diff": {
"serial_diff": {
"buckets_path": "timestamps",
"lag": 1
}
},
...
}
We use the results of the serial_diff to determine whether or not two bucket are approximately adjacent. We then discard samples that are adjacent to eachother and create a combined time range for our condition by using the bucket_selector aggregation. This will throw out buckets that are smaller than our data_sample_accuracy_window. This value is dependent on your dataset.
"aggs": {
...
"time_delta_filter": {
"bucket_selector": {
"buckets_path": {
"timestampDiff": "timestamp_diff"
},
"script": "if (params != null && params.timestampDiff != null) { params.timestampDiff > " + str(data_sample_accuracy_window) + "} else { false }"
}
}
}
The serial_diff results are also critical for us to determine how long our condition was set. The timestamps of our buckets end up representing the "rising" edge of our condition signal so the falling edge is unknown without some post-processing. We use the timestampDiff value to figure out where the falling edge is.

Pipeline aggregations in ElasticSearch 1.5

I'm wondering if it is, in any way, possible to make ES run aggregations on other aggregations all in the same query?
Basically, that's called pipelining.
I'm talking about ElasticSearch 1.5, yes I know, that's unfortunate but I'm stuck with AWS and that's what they're selling, I have to live with that.
I'm guessing that is not possible, so I'll write the next phase of the question right away.
Assuming I can query ES multiple times based on results from previous queries, how would you do the following:
Have a list of the top 100 tags that sorted by the number of appearances in the documents? (I have a field tags for each record, I'd like to know which tags are the most common) - in the past hour.
Having that, for each of the 100 tags; have the number of appearances split by 1-hour buckets (denote by Y the number representing the last hour).
Then, calculate the by how many percents, Y deviates from the average value of all the other 1-hour buckets.
Thank you for helping !!!

Basically, that's called pipelining.
No. Pipeline Aggregations did not appear until Elasticsearch 2.0. For what it's worth, Elastic does offer its own ESaaS offering with Elastic Cloud. It also runs on AWS.
... how would you do the following
The first two follow more of a flow of scope rather than working on the values.
{
"query": {
"filtered": {
"filter": {
"range" : {
"timestamp": {
"gte": "now-1h"
}
}
}
}
}
}
This will give you the last hour of data.
{
"size": 0,
"aggs": {
"group_by_tag": {
"terms": {
"field": "tag",
"size": 100
}
}
}
}
This will give you the top 100 tags for all time.
If you put them together, then you get the top 100 tags in the past hour.
For the second request, it sounds like you want a mix of that, but you also want more than just the last hour.
Whenever performing an aggregation (or GROUP BY query for that matter), you need to think about incremental steps. If you want to group by hour, then do something, then that's the order that it needs to happen in. So it's not a matter of "now that I have the last hour, let's get the other hours too". Once you've narrowed you window (scope), then you can't go back in general.
So to get number 2, we need to look at it differently. Group by as many hours as you're interested in looking at (how many 1-hour buckets do you want), then get those and then get the count per bucket. I'll take a guess and say that you want 24, 1-hour buckets (note 24 * 100 is 2400, which is not insignificant!).
That's a lot of buckets, so maybe we can think about the question differently.
I want the last hour results of top 100.
I want all top 100 average for X time (where you define X, and having it reduced will make it faster, but naturally limited to the window of selection). By limiting with the filter, we reduce the scope of the overall aggregation:
This may look like this:
{
"size": 0,
"query": {
"filtered": {
"filter": {
"range" : {
"timestamp": {
"gte": "now-24h"
}
}
}
}
},
"aggs": {
"group_by_hour_and_day": {
"date_range": {
"field": "timestamp",
"ranges": [
{ "from": "now-1h" },
{ "to": "now-1h" }
]
},
"aggs": {
"group_by_tag": {
"field": "tag",
"size": 100
}
}
}
}
}
The problem with this request is that it gives you now-24 to now-1h, then now-1h to now. That's pretty loosely what you requested, but it doesn't give it by term (which may or may not matter). Instead, the term is given by time instead (again, steps/order matters). You can then say that the previous 24h average is the responding doc count of the wider window, divided by the window size (23 in this case for 23 hours). If you want to include the last hour in the average, then you can change "to": "now-1h" to "to": "now".
We can perhaps flip this to give us the answer differently, but with a little bit more effort (where query still limits by the max time range to consider):
{
"size": 0,
"query": { ... },
"aggs": {
"group_by_tag": {
"terms": {
"field": "tag",
"size": 100
},
"aggs": {
"group_by_range": {
"field": "timestamp",
"ranges": [
{ "from": "now-1h" },
{ "to": "now-1h" }
]
}
}
}
}
}
Notice that now we aggregate by tag first across the full scope. You could remove the second date_range aggregation as a result because you now have the total for the time window. The problem with this approach is that you could end up with a very popular tag in the last hour that is not popular enough in the past full range, and so it won't appear at all.
The solution to that is to add an extra step unfortunately, by making two top-level aggregations. One for the top 100 in the full scope and one for the top 100 in the last hour.
{
"size": 0,
"query": { ... },
"aggs": {
"group_by_tag": {
"terms": {
"field": "tag",
"size": 100
}
},
"group_by_last_hour": {
"filter": {
"range": {
"timestamp": {
"gte": "now-1h"
}
}
},
"aggs": {
"terms": {
"field": "tag",
"size": 100
}
}
}
}
}
This gives the top 100 for the full window -- whatever that might be -- and then it also separately gives the top 100 for the last hour.
Then, calculate the by how many percents, Y deviates from the average value of all the other 1-hour buckets.
Do this on the client side based on whichever form you care to use, and calculate the average by cross-comparing.
And considering the type of query, you should then cache the result, which allows you to play with larger window sizes than might be otherwise desirable.

Elasticsearch - calculate percentage in nested aggregations in relation to parent bucket

Updated question
In my query I aggregate on date and then on sensor name. It is possible to calculate a ratio from a nested aggregation and the total count of documents (or any other aggregation) of the parent bucket? Example query:
{
"size": 0,
"aggs": {
"over_time": {
"aggs": {
"by_date": {
"date_histogram": {
"field": "date",
"interval": "1d",
"min_doc_count": 0
},
"aggs": {
"measure_count": {
"cardinality": {
"field": "date"
}
},
"all_count": {
"value_count": {
"field": "name"
}
},
"by_name": {
"terms": {
"field": "name",
"size": 0
},
"aggs": {
"count_by_name": {
"value_count": {
"field": "name"
}
},
"my ratio": count_by_name / all_count * 100 <-- How to do that?
}
}
}
}
}
}
}
}
I want a custom metric that gives me the ratio count_by_name / all_count * 100. Is that possible in ES, or do I have to compute that on the client?
This seems very simple to me, but I haven't found a way yet.
Old post:
Is there a way to let Elasticsearch consider the overall count of documents (or any other metric) when calculating the average for a bucket?
Example:
I have like 100000 sensors that generate events on different times. Every event is indexed as a document that has a timestamp and a value.
When I want to calculate a ratio of the value and a date histogram, and some sensors only generated values at one time, I want Elasticsearch to treat the not existing values(documents) for my sensors as 0 instead of null.
So when aggregating by day and a sensor only has generated two values at 10pm (3) and 11pm (5), the aggregate for the day should be (3+5)/24, or formal: SUM(VALUE)/24.
Instead, Elasticsearch calculates the average like (3+5)/2, which is not correct in my case.
There was once a ticket on Github https://github.com/elastic/elasticsearch/issues/9745, but the answer was "handle it in your application". That's no answer for me, as I would have to generate zillions of zero-Value documents for every sensor/time combination to get the average ratio right.
Any ideas on this?

If this is the case , simply divide the results by 24 from application side.And when granularity change , change this value accordingly. Number of hours per day is fixed right ....

You can use the Bucket script aggregation to do what you want.
{
"bucket_script": {
"buckets_path": {
"count_by_name": "count_by_name",
"all_count": "all_count"
},
"script": "count_by_name / all_count*100"
}
}
It's just an example.
https://www.elastic.co/guide/en/elasticsearch/reference/2.4/search-aggregations-pipeline-bucket-script-aggregation.html

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio