ElasticSearch 2.4 date range histogram using the difference between two date fields - elasticsearch

I haven't been able to find anything regarding this for ES 2.* in regards to the problem here or in the docs, so sorry if this is a duplicate.
What I am trying to do is create an aggregation in an ElasticSearch query that will allow me to create buckets based on the difference in a record between 2 date fields.
I.e. If I had data in ES for a shop, I might like to see the time difference between a purchase_date field and shipped_date field.
So in that instance I'd want to create an aggregate that had buckets to give me the hits for when shipped_date - purchase_date is < 1 day, 1-2 days, 3-4 days or 5+ days.
Ideally I was hoping this was possible in an ES query. Is that the case or would the best approach be to process the results into my own array based on the time difference for each hit?

I was able to achieve this by using the built in expression language which is enabled by default in ES 2.4. The functionality I wanted was to group my results to show the difference between EndDate and Date Processed in increments of 15 days. Relevant part of the query is:
{
...,
"aggs": {
"reason": {
"date_histogram": {
"min_doc_count": 1,
"interval": "1296000000ms", // 15 days
"format": "epoch_millis",
"script": {
"lang": "expression",
"inline": "doc['DateProcessed'] > doc['EndDate'] ? doc['DateProcessed'] - doc['EndDate'] : -1"
}
}
...
}
}

Related

elasticsearch get date range of most recent ingestion

I have an elasticsearch index that gets new data in large dumps, so from looking at the graph its very obvious when new data is added.
If I only want to get data from the most recent ingestion (in this case data from 2020-08-06, whats the best way of doing this?
I can use this query to get the most recent document:
GET /indexname/_search
{
"query": {
"bool": {
"must": [
{
"query_string": {
"query": queryString
}
}
]
}
},
"sort": {
"#timestamp" : "desc"
},
"size": 1
}
Which will return the most recent document, in this case a document with a timestamp of 2020-08-06. I can set that to my endDate and set my startDate to that date minus one day, but im worried of cases where the data was ingested overnight and spanned two days.
I could keep making requests to go back in time 5 hours at a time to find when the most recent large gap is, but im worried that making a request in a for loop could be time consuming? Is there a smarter way for getting the date range of my most recent ingestion?thx
When your data is coming in batches it'd be best to attribute an identifier to each batch. That way, there's no date math required.

Elasticsearch 2.4 post_filter date math

When using a post_filter with date math on an Elasticsearch 2.4 query such as the following:
"post_filter": {
"bool": {
"must": [
[
{
"range": {
"facets.due_date": {
"gte": "now+2d\/d",
"lte": "now+3d\/d-1s"
}
}
}
]
]
}
}
The results include documents with dates outside the range by 1 day. The exact same values are used in the aggregations which report the correct counts for the buckets (2 documents for Saturday in this case), however, as mentioned when I apply the above post filter 3 documents are returned [the extra document being for Sunday at 9am]. The dates are arbitrary, I can change them to in a few days and the same thing happens. I'm also on UTC time and have allowed for this in my testing by adding/removing a few hours in the values to bypass any errors raised by timezones.
If I use an actual set of concrete dates it works as expected so my question is, does post_filter have a problem / bug with date math or is there a way to use explain to show me the dates the post_filter is sending to the ES server?
Thanks in advance, been banging my head against a brick wall for 3 days on this !!
So it turns out for some very strange reason using lte on a post filter captures surrounding documents whereas if I use lt then it works as expected, I don't have a clue why this is doing this, I can only assume some rounding is taking place when the post_filter is applied but it not rounded when the aggregations are calculated!

Get records for particular day of the week in ElasticSearch

I have an ES cluster that has some summarized numerical data such that there is exactly 1 record per day. I want to write a query that will return the documents for a specific day of the week. For example, all records for Tuesdays. Currently I am doing this by getting all records for the required date range and then filtering out the ones for the day that I need. Is there a way to do that with a query?
You can do it using a script like this:
POST my_index/_search
{
"query": {
"script": {
"script": {
"source": "doc.my_date.value.dayOfWeek == 2"
}
}
}
}
If you're going to run this query often, you would be probably better off creating another field dayOfWeek in your document that contains the day of the week that you can then easily query using a term query. It would be more efficient than a script.

Complex ElasticSearch Query

I have documents with (id, value, modified_date). Need to get all the documents for ids which have a specific value as of the last modified_date.
My understanding is that I first need to find such ids and then put them inside a bigger query. To find such ids, looks like, I would use "top_hits" with some post-filtering of the results.
The goal is to do as much work as possible on the server side to speed things up. Would've been trivial in SQL, but with ElasticSearch I am at a loss. And then I would need to write this in python using elasticsearch_dsl. Can anyone help?
UPDATE: In case it's not clear, "all the documents for ids which have a specific value as of the last modified_date" means: 1. group by id, 2. in each group select the record with the largest modified_date, 3. keep only those records that have the specific value, 4. from those records keep only ids, 5. get all documents where ids are in the list coming from 4.
Specifically, 1 is an aggregation, 2 is another aggregation using "top_hits" and reverse sorting by date, 3 is an analog of SQL's HAVING clause - Bucket Selector Aggregation (?), 4 _source, 5 terms-lookup.
My biggest challenge so far has been figuring out that Bucket Selector Aggregation is what I need and putting things together.
This shows an example on how to get the latest elements in each group:
How to get latest values for each group with an Elasticsearch query?
This will return the average price bucketed in days intervals:
GET /logstash-*/_search?size=0
{
"query": {
"match_all": {}
},
"aggs": {
"2": {
"date_histogram": {
"field": "#timestamp",
"interval": "1d",
"time_zone": "Europe/Berlin",
"min_doc_count": 1
},
"aggs": {
"1": {
"avg": {
"field": "price"
}
}
}
}
}
}
I wrote it so it matches all record, that obviously returns more data than you need. Depending on the amount of data it might be easier to finish the task on client side.

To Select documents having same startDate and endDate

I have some documents where in each document , there is a startDate and endDate date fields. I need all documents with both these value as same. I couldn't find any query which will help me to do it.
Elasticsearch supports script filters, which you can use in this case . More Info
Something like this is what you will need -
POST /<yourIndex>/<yourType>/_search?
{
"query": {
"filtered": {
"filter": {
"script": {
"script": "doc['startDate'].value == doc['endDate'].value"
}
}
}
}
}
This can be achieved in 2 manner
Index solution - While indexing add an additional field called isDateSame and set it to true or false based on the value of startDate and endDate. Then you can easily do a query based on that field. This is the best optimized solution
Script solution - Elasticsdearch maintains all the indexed data in field data which is more like a reverse reverse index. Using script you can access any indexed fields and do comparison. This is pretty fast but not as good as first one.You can use the following query for the same

Resources