Elasticsearch: Aggregate documents based on date range - elasticsearch

I have a set of documents in ElasticSearch 5.5 with two date fields: start_date and end_date.
I want to aggregate them into date histogram buckets (ex: weekly) such that if the start_date < week X < end_date, then document would be in "week X" bucket.
This means that a single document might be in multiple buckets.
Consider the following concrete example: I have a set of documents describing company employees, and for each employee you have hire date and (optionally) termination date. I want to build date histogram of number of active employees for trailing twelve months.
Sample doc content:
{
"start_date": "2013-01-12T00:00:00.000Z",
"end_date": "2016-12-08T00:00:00.000Z",
"id": "123123123"
}
Is there a way to do this in ES?

I have found one way to do this, using filter aggregations (
https://www.elastic.co/guide/en/elasticsearch/reference/master/search-aggregations-bucket-filter-aggregation.html). If I need, say, 12 trailing months report, then I would create 12 buckets, where each bucket defines filter conditions, such as:
"bool":{
"must":[{
"range":{
"start_date":{
"lte":"2016-01-01T00:00:00.000Z"
}
}
},{
{
"range":{
"end_date":{
"gt":"2016-02-01T00:00:00.000Z"
}
}
}]
}
However, I feel that it would be nice if there was an easier way to do this, since if I want say trailing 365 days, that means I have to create 365 bucket filters, which makes resultant query very large.

I know this question is quite old but as it's still open I am sharing my knowledge on this. Also this question does not clearly explains that what kind of output is expected but still I think this can be achieved using the "Date Histogram Aggregation" and "Bucket Script Aggregation".
Here are the documentation links for both of these aggregations.
https://www.elastic.co/guide/en/elasticsearch/reference/5.5/search-aggregations-bucket-datehistogram-aggregation.html
https://www.elastic.co/guide/en/elasticsearch/reference/5.5/search-aggregations-pipeline-bucket-script-aggregation.html

Related

Elasticsearch - scripted sorting by date(s) field ascending with specific input date

Lets assume i have this kind of document (events):
[
{
"title": "Foo",
"dates": [
"2019-07-01",
"2019-07-15",
"2019-08-01"
]
},
{
"title": "Bar",
"dates": [
"2019-07-18"
]
}
]
And i want to perform a search that finds all events happening in a given datespan, so i search for all events that take place in between the 10th of july (2019-07-10) and the 10th auf august (2019-08-10).
The query wont be the problem but if i want to display the events in an ascending order (so the "Foo" event is ranked before because the 15th of july comes before the 18th) - how do i sort that correctly?
I thought about using a script to sort my documents, maybe by filtering the dates so only valid dates remain and then use the timestamp from the first date value to do a simple numeric ordering. But how would i filter the dates by the script?
Could i use a "script field" and pass my fromDate and toDate to copy the filtered dates onto a new field (lets say validDates) and use THAT field for sorting?
BTW: the index will only contain a few thousand events.
UPDATE:
After some research it looks like this can not be done without using nested objects instead of arrays. Please correct me if i am wrong, i would have preferred to use a simple array of dates over the nested type...

Get records for particular day of the week in ElasticSearch

I have an ES cluster that has some summarized numerical data such that there is exactly 1 record per day. I want to write a query that will return the documents for a specific day of the week. For example, all records for Tuesdays. Currently I am doing this by getting all records for the required date range and then filtering out the ones for the day that I need. Is there a way to do that with a query?
You can do it using a script like this:
POST my_index/_search
{
"query": {
"script": {
"script": {
"source": "doc.my_date.value.dayOfWeek == 2"
}
}
}
}
If you're going to run this query often, you would be probably better off creating another field dayOfWeek in your document that contains the day of the week that you can then easily query using a term query. It would be more efficient than a script.

ElasticSearch 2.4 date range histogram using the difference between two date fields

I haven't been able to find anything regarding this for ES 2.* in regards to the problem here or in the docs, so sorry if this is a duplicate.
What I am trying to do is create an aggregation in an ElasticSearch query that will allow me to create buckets based on the difference in a record between 2 date fields.
I.e. If I had data in ES for a shop, I might like to see the time difference between a purchase_date field and shipped_date field.
So in that instance I'd want to create an aggregate that had buckets to give me the hits for when shipped_date - purchase_date is < 1 day, 1-2 days, 3-4 days or 5+ days.
Ideally I was hoping this was possible in an ES query. Is that the case or would the best approach be to process the results into my own array based on the time difference for each hit?
I was able to achieve this by using the built in expression language which is enabled by default in ES 2.4. The functionality I wanted was to group my results to show the difference between EndDate and Date Processed in increments of 15 days. Relevant part of the query is:
{
...,
"aggs": {
"reason": {
"date_histogram": {
"min_doc_count": 1,
"interval": "1296000000ms", // 15 days
"format": "epoch_millis",
"script": {
"lang": "expression",
"inline": "doc['DateProcessed'] > doc['EndDate'] ? doc['DateProcessed'] - doc['EndDate'] : -1"
}
}
...
}
}

Conditional Sorting in ElasticSearch

I have some documents that I would like to sort on a date field. For documents with date equal to a specified date, example today, and all dates after that I would like to sort ascending. For dates before the specified date I would like to sort in descending order.
Is this possible in ElasticSearch? If so could you suggest any literature or an approach.
date is of type "date" and format "dateOptionalTime".
Thanks
Yes this is possible in ElasticSearch using a script, either for sorting or for scoring.
My preference would be for a scoring script because 'script based score' is going to be quicker (according to the documentation).
Using a scoring script, you could use the Unix timestamp for the date field of type int/long and an mvel sorting script in the custom_score query. You might need to re-index your documents. You would also need to be able to convert the searched for time into a Unix timestamp to pump it at ElasticSearch.
The sorting script would then deduct the requested timestamp from each document's timestamp and make an absolute value. Then the results are sorted in ascending order - the lowest 'distance' is the best.
So when looking for documents dated about a year ago, it would look something like:
"query": {
"custom_score" : {
"query" : {
....
},
"params" : {
"req_date_stamp" : 1348438345,
},
"script" : "abs(doc['timestamp'].value - req_date_timestamp)"
}
},
"sort": {
"_score": {
'order': 'asc'
}
}
(Apologies for any mistakes in my JSON - I tested this idea in pyes)
You might need to tweak this to get the rounding right - for example your question mentions matching days, so you might want to round the timestamp generator to the nearest day.
For "full" info you can check out the Custom Score Query docs and follow the link to MVEL scripting.
For this kind of specific use cases, you should use a sorting script.
See the "script based sorting" section in the Sort documentation page.
My English is poor.
My soluation is boost.
My data is {"terms_id": [20211011,20211012,20211013,20211014],"sort_value":1} {"terms_id": [20211012,20211013,20211014],"sort_value":2} {"terms_id": [20211013,20211014,20211015],"sort_value":1}
My query is {"bool":{"must":[],"should":[{"bool":{"must":[{"terms":{"terms_id":[20211012],"boost":5}}],"must_not":[]}},{"bool":{"must_not":[{"terms":{"terms_id":[20211012]}}]}}],"minimum_should_match":1}}
My sort is {"_score":{"order":"desc"},"sort_value":{"order":"desc"}}
Result is{"terms_id": [20211012,20211013,20211014],"sort_value":2} {"terms_id": [20211011,20211012,20211013,20211014],"sort_value":1} {"terms_id": [20211013,20211014,20211015],"sort_value":1}

Histogram on the basis of facet counts

I am currently working on a project in which I am storing user activity logs in elasticsearch. the user field in the log is like {"user":"abc#yahoo.com"}. I have a timestamp field for each activity, that describes when this activity was recorded. Can i generate date histogram on the basis of number of users in a particular time period. eg the histogram entry must show the number of users on that time. I can have this implemented by obtaining facet counts, but i need to get counts on various intervals and various ranges with minimum queries. Please guide me in this regard. Thanks.
Add a facet to your query something like the following:
{"facets": {
"daily_volume": {
"date_histogram": {
"size": 100,
"field": "created_at",
"interval": "day"
"order": "time"
}
}
}
This returns a nice set of ordered data for the number of items per day.
I then feed this to a Google Chart (the ColumnChart works nicely for histograms), doing a conversion on the returned timestamp integer to convert it to a Date type understood correctly by the Javascript charts API.

Resources