Using date_histogram with fixed_interval (30d) unexpected bucket start - elasticsearch

I have a requirement to get data aggregated per 30 days (not month) so I'm using a date_histogram with "fixed_interval": "30d" to get that data. For example, if the user wants the last 90 days aggregations, there should be 3 buckets: [90-60, 60-30, 30-0]. Taking today's date (18-Mar-2021), I would want buckets [18-Dec,17-Jan,16-Feb].
However, what I actually get is [4-Dec,3-Jan,2-Feb,4-Mar]. The first bucket starts way earlier than any data is available, which also means an additional bucket than expected is needed in the end.
I found out that you can't easily tell when your buckets are meant to start (e.g. I want my first bucket to start at today-90 days). Buckets seem to start from 1970-01-01 according to what I could find (e.g. this) and the documentation kinda says this as well (this link, though it doesn't go into depth of the impact).
With this in mind, I worked out that I could use offset with an "interesting formula" so that I get the correct buckets that I need. E.g.:
GET /my_index/_search?filter_path=aggregations
{
"size": 0,
"query": {
"bool": {
"must": [
{ "range" : {
"#timestamp" : {
"gte" : "TODAY - 90/60/30",
"lt" : "TODAY"
}}
}
]
}
},
"aggs": {
"discussion_interactions_chart": {
"date_histogram": {
"field": "#timestamp",
"fixed_interval": "30d",
"format": "yyyy-MM-dd",
"offset": "(DAYS(#timestamp.gte, 1970-01-01) % 30)d"
}
}
}
}
(obviously this query doesn't work directly, I build the variables in code which for the example of 18-Mar-2021 offset is 14)
So basically offset is calculated as the number of days between my lower bound date and epoch, and then mod that value by 30. This seems to work but it's kinda hard to justify this logic on a code review. Is there a nicer solution to this?

Here's a Python implementation of the answer in your question (which you really deserve upvotes for, it's clever and helped me):
fixed_interval_days = 90
# offset needed to make fixed_interval histogram end on today's date (it starts the intervals at 1970-01-01)
offset_days = (datetime.datetime.utcnow() - datetime.datetime(1970, 1, 1)).days % fixed_interval_days
...
A(
"date_histogram",
fixed_interval=f"{fixed_interval_days}d",
offset=f"{offset_days}d",

Related

Elasticsearch Datehistogram Interval

I am creating a date histogram aggregation like this, where min and max of extended_bounds are unix epoch values.
"aggs": {
"0": {
"date_histogram": {
"field": "#timestamp",
"fixed_interval": "30s",
"time_zone": "Asia/Kolkata",
"extended_bounds": {
"min": 1656419435318,
"max": 1656420335318
}
}
}
}
Now I am using "30s" as hard-coded value for fixed_interval value.
How can this value be dynamically generated depending on the duration of the bounds (min & max of extended bounds), if I want same number of buckets for each duration? Is there any function available from any kibana plugins for this purpose?
For example if I want 30 buckets:
(a) for 1 hour duration, fixed_interval will be 2 mins
(b) for 24 hours, fixed_interval will be 45 mins
I can write code of own to do this calculation, but any existing api would be helpful.
Also, when to use calendar_interval in place of fixed_interval. I have checked kibana lens generated queries, where depending of search duration fixed_interval or calendar_interval is used.

elasticsearch get date range of most recent ingestion

I have an elasticsearch index that gets new data in large dumps, so from looking at the graph its very obvious when new data is added.
If I only want to get data from the most recent ingestion (in this case data from 2020-08-06, whats the best way of doing this?
I can use this query to get the most recent document:
GET /indexname/_search
{
"query": {
"bool": {
"must": [
{
"query_string": {
"query": queryString
}
}
]
}
},
"sort": {
"#timestamp" : "desc"
},
"size": 1
}
Which will return the most recent document, in this case a document with a timestamp of 2020-08-06. I can set that to my endDate and set my startDate to that date minus one day, but im worried of cases where the data was ingested overnight and spanned two days.
I could keep making requests to go back in time 5 hours at a time to find when the most recent large gap is, but im worried that making a request in a for loop could be time consuming? Is there a smarter way for getting the date range of my most recent ingestion?thx
When your data is coming in batches it'd be best to attribute an identifier to each batch. That way, there's no date math required.

Complex ElasticSearch Query

I have documents with (id, value, modified_date). Need to get all the documents for ids which have a specific value as of the last modified_date.
My understanding is that I first need to find such ids and then put them inside a bigger query. To find such ids, looks like, I would use "top_hits" with some post-filtering of the results.
The goal is to do as much work as possible on the server side to speed things up. Would've been trivial in SQL, but with ElasticSearch I am at a loss. And then I would need to write this in python using elasticsearch_dsl. Can anyone help?
UPDATE: In case it's not clear, "all the documents for ids which have a specific value as of the last modified_date" means: 1. group by id, 2. in each group select the record with the largest modified_date, 3. keep only those records that have the specific value, 4. from those records keep only ids, 5. get all documents where ids are in the list coming from 4.
Specifically, 1 is an aggregation, 2 is another aggregation using "top_hits" and reverse sorting by date, 3 is an analog of SQL's HAVING clause - Bucket Selector Aggregation (?), 4 _source, 5 terms-lookup.
My biggest challenge so far has been figuring out that Bucket Selector Aggregation is what I need and putting things together.
This shows an example on how to get the latest elements in each group:
How to get latest values for each group with an Elasticsearch query?
This will return the average price bucketed in days intervals:
GET /logstash-*/_search?size=0
{
"query": {
"match_all": {}
},
"aggs": {
"2": {
"date_histogram": {
"field": "#timestamp",
"interval": "1d",
"time_zone": "Europe/Berlin",
"min_doc_count": 1
},
"aggs": {
"1": {
"avg": {
"field": "price"
}
}
}
}
}
}
I wrote it so it matches all record, that obviously returns more data than you need. Depending on the amount of data it might be easier to finish the task on client side.

Can Elasticsearch do a decay search on the log of a value?

I store a number, views, in Elasticsearch. I want to find documents "closest" to it on a logarithmic scale, so that 10k and 1MM are the same distance (and get scored the same) from 100k views. Is that possible?
https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-function-score-query.html#exp-decay describes field value factor and decay functions but can they be "stacked"? Is there another approach?
I'm not sure if you can achieve this directly with decay, but you could easily do it with the script_score function. The example below uses dynamic scripting, but please be aware that using file-based scripts is the recommended, far more secure approach.
In the query below, the offset parameter is set to 100,000, and documents with that value for their 'views' field will score the highest. Score decays logarithmically as the value of views departs from offset. Per your example, documents with 1,000,000 and/or 10,000 have identical scores (0.30279312 in this formula).
You can invert the order of these results by changing the beginning of the script to multiply by _score instead of divide.
$ curl -XPOST localhost:9200/somestuff/_search -d '{
"size": 100,
"query": {
"bool": {
"must": [
{
"function_score": {
"functions": [
{
"script_score": {
"params": {
"offset": 100000
},
"script": "_score / (1 + ((log(offset) - log(doc['views'].value)).abs()))"
}
}
]
}
}
]
}
}
}'
Note: you may want to account for the possibility of 'views' being null, depending on your data.

Elasticsearch: choose TOP N documents and apply query

I'm sorry I'm not good at English, please understand it.
Let's assume I have such data:
title category price
book1 study 10
book2 cook 20
book3 study 30
book4 study 40
book5 art 50
I can do "search books in 'study' category and sort them by price-descending order". Result would be:
book4 - book3 - book1
However, I couldn't find a way to do
"search books in 'study' category AMONG the books of TOP 40% in price".
(I wish 'TOP 40% in price' is correct expression)
In this case, result should be "book4" only, because "category search" would be performed for only book5 and book4.
At first, I thought I could do it by
sort all documents by price
select TOP 40%
post another query for category search among them
But now, I still have no idea how I can post a query among "part of documents", not all documents. After 2, I'd have a list of documents in TOP 40%. But how can I make a query which is applied to just them?
I realized that I don't know even "search TOP n%" in elasticsearch. Is there a way that is better than "sort all and select first n%"?
Any advice would be appreciated.
And this is my first question in stackoverflow. If my question is violating any rule of here, please tell me so that I can know it and apology.
If your data is normally distributed, or some other statistical distribution from which you can make sense of the data, you can probably do this in two queries.
You can take a look at the data in histogram form by doing:
{
"query": {
"match_all": {}
},
"facets": {
"stats": {
"histogram": {
"field": "price",
"interval": 100
}
}
}
}
I usually take this data into a spreadsheet to chart it and do other statistical analysis on it. "interval" above will need to be some reasonable value, 100 might not be the right fit.
The is just to decide how to code the intermediate step. Provided the data is normally distributed you can then get the statistical information about the collection using this query:
{
"query": {
"match_all": {}
},
"facets": {
"stats": {
"statistical": {
"field": "price"
}
}
}
}
The above gives you an output that looks like this:
count: 819517
total: 24249527030
min: 32
max: 53352
mean: 29590.023184387876
sum_of_squares: 875494716806082
variance: 192736269.99554798
std_deviation: 13882.94889407679
(the above is not based on your data sample, but just sample of available data I have to demonstrate statistical facet usage.)
So now that you know all of that, you can start applying your knowledge of statistics to the problem at hand. That is, find the Z score at the 60th percentile and find the location of the representative data point based on that.
How your final query looks like this:
{
"query": {
"range": {
"talent_profile": {
"gte": 40,
"lte": 50
}
}
}
the lte is going to be from the "max" from the stats facet and the gte is going to be from your intermediate analysis.

Resources