Elasticsearch Datehistogram Interval - elasticsearch

I am creating a date histogram aggregation like this, where min and max of extended_bounds are unix epoch values.
"aggs": {
"0": {
"date_histogram": {
"field": "#timestamp",
"fixed_interval": "30s",
"time_zone": "Asia/Kolkata",
"extended_bounds": {
"min": 1656419435318,
"max": 1656420335318
}
}
}
}
Now I am using "30s" as hard-coded value for fixed_interval value.
How can this value be dynamically generated depending on the duration of the bounds (min & max of extended bounds), if I want same number of buckets for each duration? Is there any function available from any kibana plugins for this purpose?
For example if I want 30 buckets:
(a) for 1 hour duration, fixed_interval will be 2 mins
(b) for 24 hours, fixed_interval will be 45 mins
I can write code of own to do this calculation, but any existing api would be helpful.
Also, when to use calendar_interval in place of fixed_interval. I have checked kibana lens generated queries, where depending of search duration fixed_interval or calendar_interval is used.

Related

Is there a way to specify percentage value in ES DSL Sampler aggregation

I am trying to do a sum aggregation on a certain sample of data, I want to get the sum of costs (field) of only the top 25% records (with the highest cost).
I know I have an option to run a sampler aggregation which can help me achieve this, but there I need to pass the exact number of records on which I want to run the sampler aggregation.
{
"aggs": {
"sample": {
"sampler": {
"shard_size": 300
},
"aggs": {
"total_cost": {
"sum": {
"field": "cost"
}
}
}
}
}
}
But is there a way to specify a percentage instead of an absolute number here, because in my case the total number of document changes pretty regularly and I need to get the top 25% (costliest).
How I get it today is by doing 2 queries
first to get the total number of records
divide the number by 4 and do the sampler query with that number (also I have added a descending sort for the cost field, which is not shown in the query above)

Using date_histogram with fixed_interval (30d) unexpected bucket start

I have a requirement to get data aggregated per 30 days (not month) so I'm using a date_histogram with "fixed_interval": "30d" to get that data. For example, if the user wants the last 90 days aggregations, there should be 3 buckets: [90-60, 60-30, 30-0]. Taking today's date (18-Mar-2021), I would want buckets [18-Dec,17-Jan,16-Feb].
However, what I actually get is [4-Dec,3-Jan,2-Feb,4-Mar]. The first bucket starts way earlier than any data is available, which also means an additional bucket than expected is needed in the end.
I found out that you can't easily tell when your buckets are meant to start (e.g. I want my first bucket to start at today-90 days). Buckets seem to start from 1970-01-01 according to what I could find (e.g. this) and the documentation kinda says this as well (this link, though it doesn't go into depth of the impact).
With this in mind, I worked out that I could use offset with an "interesting formula" so that I get the correct buckets that I need. E.g.:
GET /my_index/_search?filter_path=aggregations
{
"size": 0,
"query": {
"bool": {
"must": [
{ "range" : {
"#timestamp" : {
"gte" : "TODAY - 90/60/30",
"lt" : "TODAY"
}}
}
]
}
},
"aggs": {
"discussion_interactions_chart": {
"date_histogram": {
"field": "#timestamp",
"fixed_interval": "30d",
"format": "yyyy-MM-dd",
"offset": "(DAYS(#timestamp.gte, 1970-01-01) % 30)d"
}
}
}
}
(obviously this query doesn't work directly, I build the variables in code which for the example of 18-Mar-2021 offset is 14)
So basically offset is calculated as the number of days between my lower bound date and epoch, and then mod that value by 30. This seems to work but it's kinda hard to justify this logic on a code review. Is there a nicer solution to this?
Here's a Python implementation of the answer in your question (which you really deserve upvotes for, it's clever and helped me):
fixed_interval_days = 90
# offset needed to make fixed_interval histogram end on today's date (it starts the intervals at 1970-01-01)
offset_days = (datetime.datetime.utcnow() - datetime.datetime(1970, 1, 1)).days % fixed_interval_days
...
A(
"date_histogram",
fixed_interval=f"{fixed_interval_days}d",
offset=f"{offset_days}d",

Need an elasticsearch query filter range which starts 5 minutes before scheduled time

I'm using elasticsearch 6.5.4, and a kibana watcher to alert.
I have a filter range like so:
"filter": [
{
"range": {
"#timestamp": {
"gte": "{{ctx.trigger.scheduled_time}}||-{{ctx.metadata.triggered_interval}}m"
}
}
}
]
The scheduled_time is every hour at the 5th minute (1:05, 2:05, etc.) The triggered_interval is 60.
I want to gather a range of #timestamps, ignoring the most recent 5 minutes. Basically, certain status messages might be too new to true errors, so want to ignore them.
I'm trying to craft this so it reads as: begin time is trigger.scheduled_time - 5m and end time is triggered_interval.
The range format is time1-time2, so scheduled_time-5m-triggered_interval is invalid syntax.
I've tried a few iterations but nothing seems to work. The watcher just returns null pointer exception.
"gte": "<{{{ctx.trigger.scheduled_time}}||-5m}>-{{ctx.metadata.triggered_interval}}m"
"gte": "<{{ctx.trigger.scheduled_time}}||-5m>-{{ctx.metadata.triggered_interval}}m"
"gte": "{{ctx.trigger.scheduled_time}}||-5m-{{ctx.metadata.triggered_interval}}m"
"gte": "({{ctx.trigger.scheduled_time}}||-5m)-{{ctx.metadata.triggered_interval}}m"
Is this possible to do in the range filter?
The elasticsearch date math functionality together with a range query should do the trick.
If you want to select all events older than 5 minutes and younger than 60 minutes, relative to the execution time, I´ll go with this:
"filter": [
{
"range": {
"#timestamp": {
"lte": "now-5m/m",
"gte": "now-60m/m"
}
}
}
]
In other words: Get all events, where the #timestamp is older than 5 minutes but not older than 60 minutes with all #timestamps rounded to full minute. If you don´t need the rounding, just remove the /m.
Cheers!

Can Elasticsearch do a decay search on the log of a value?

I store a number, views, in Elasticsearch. I want to find documents "closest" to it on a logarithmic scale, so that 10k and 1MM are the same distance (and get scored the same) from 100k views. Is that possible?
https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-function-score-query.html#exp-decay describes field value factor and decay functions but can they be "stacked"? Is there another approach?
I'm not sure if you can achieve this directly with decay, but you could easily do it with the script_score function. The example below uses dynamic scripting, but please be aware that using file-based scripts is the recommended, far more secure approach.
In the query below, the offset parameter is set to 100,000, and documents with that value for their 'views' field will score the highest. Score decays logarithmically as the value of views departs from offset. Per your example, documents with 1,000,000 and/or 10,000 have identical scores (0.30279312 in this formula).
You can invert the order of these results by changing the beginning of the script to multiply by _score instead of divide.
$ curl -XPOST localhost:9200/somestuff/_search -d '{
"size": 100,
"query": {
"bool": {
"must": [
{
"function_score": {
"functions": [
{
"script_score": {
"params": {
"offset": 100000
},
"script": "_score / (1 + ((log(offset) - log(doc['views'].value)).abs()))"
}
}
]
}
}
]
}
}
}'
Note: you may want to account for the possibility of 'views' being null, depending on your data.

ElasticSearch: rank by proximity to date

In ElasticSearch, is there a way to rank search results by proximity to a given date (or number)?
You can use Script Based Sorting to do calculate the proximity. However, if you have a large number of results, your might need to switch to native script to achieve good performance.
try this example:
"DECAY_FUNCTION": {
"FIELD_NAME": {
"origin": "2013-09-17",
"scale": "10d",
"offset": "5d",
"decay" : 0.5
}
}
DECAY_FUNCTION can be "linear", "exp" and "gauss". If your field is a date field, you can set scale and offset as days, weeks, and so on.
refer: http://www.elastic.co/guide/en/elasticsearch/reference/1.5/query-dsl-function-score-query.html

Resources