How to query if a time is between two field values - elasticsearch

How do I search for documents that are between a start and end time? For example, I want to query the following document using a time only like "18:33" or "21:32". "18:33" would return the following document and "21:32" wouldn't. I don't care about the date part nor the secs.
{
"my start time field": "2020-01-23T18:32:21.768Z",
"my end time field": "2020-01-23T20:32:21.768Z"
}
I've reviewed: Using the range query with date fields. but I'm not sure how to only look at times. Also, I want to see if a time is between two fields, not if a field is between two times.
Essentially, the Elasticsearch equivalent of BETWEEN for SQL Server. Like this answer except I don't want to use the current time but a variable.
DECLARE #blah datetime2 = GETDATE()
SELECT *
FROM Table1 T
WHERE CAST(#blah AS TIME)
BETWEEN cast(T.StartDate as TIME) AND cast(T.EndDate as TIME)

As per the suggestion from the OP and the link he provided which adheres to the laws of stackoverflow I'm providing the second solution in here:
Solution 2: Insert separate fields for hour minute as hh:mm
Note the format used which says hour_minute. You can find the list of formats available under the aforementioned link.
Basically you re-ingest the documents with a separate field that would have hour and minute values and execute range queries to get what you want.
Mapping:
PUT my_time_index
{
"mappings": {
"properties": {
"start_time":{
"type": "date",
"format": "hour_minute"
},
"end_time":{
"type": "date",
"format": "hour_minute"
}
}
}
}
Sample Document:
POST my_time_index/_doc/1
{
"start_time": "18:32",
"end_time": "20:32"
}
Query Request:
POST my_time_index/_search
{
"query": {
"bool": {
"must": [
{
"range": {
"start_time": {
"gte": "18:00"
}
}
},
{
"range": {
"end_time": {
"lte": "21:00"
}
}
}
]
}
}
}
Let me know if this helps!

Don't store times in a datetime datatype based upon this discussion.
If you want to filter for the specific hour of the day, you would need to extract that into it's own field.
Via the Kibana Dev Tools -> Console
Create some mock data:
POST between-research/_doc/1
{
"my start hour": 0,
"my end hour": 12
}
POST between-research/_doc/2
{
"my start hour": 13,
"my end hour": 23
}
Perform "between" search
POST between-research/_search
{
"query": {
"bool": {
"must": [
{
"range": {
"my start hour": {
"lte": 10
}
}
},
{
"range": {
"my end hour": {
"gte": 10
}
}
}
]
}
}
}

Solution 1: Existing Date Format
Without changing and ingesting your hours and minutes separately, I've come up with the below solution and I don't think you would be happy with the way ES provides you the solution, but it certainly works.
I've created a sample mapping, document, the query and response based on the data you've provided.
Mapping:
PUT my_date_index
{
"mappings": {
"properties": {
"start_time":{
"type": "date",
"format": "yyyy-MM-dd'T'HH:mm:ss.SSS'Z'"
},
"end_time":{
"type": "date",
"format": "yyyy-MM-dd'T'HH:mm:ss.SSS'Z'"
}
}
}
}
Sample Documents:
POST my_date_index/_doc/1
{
"start_time": "2020-01-23T18:32:21.768Z",
"end_time": "2020-01-23T20:32:21.768Z"
}
POST my_date_index/_doc/2
{
"start_time": "2020-01-23T19:32:21.768Z",
"end_time": "2020-01-23T20:32:21.768Z"
}
POST my_date_index/_doc/3
{
"start_time": "2020-01-23T21:32:21.768Z",
"end_time": "2020-01-23T22:32:21.768Z"
}
Query Request:
POST my_date_index/_search
{
"query": {
"bool": {
"must": [
{
"script": {
"script": {
"source": """
ZonedDateTime zstart_time = doc['start_time'].value;
int zstart_hour = zstart_time.getHour();
int zstart_minute = zstart_time.getMinute();
int zstart_total_minutes = zstart_hour * 60 + zstart_minute;
ZonedDateTime zend_time = doc['end_time'].value;
int zend_hour = zend_time.getHour();
int zend_minute = zend_time.getMinute();
int zend_total_minutes = zend_hour * 60 + zend_minute;
int my_input_total_minutes = params.my_input_hour * 60 + params.my_input_minute;
if(zstart_total_minutes <= my_input_total_minutes && zend_total_minutes >= my_input_total_minutes){
return true;
}
return false;
""",
"params": {
"my_input_hour": 20,
"my_input_minute": 10
}
}
}
}
]
}
}
}
Basically
calculate number of minutes from start_date
calculate number of minutes from end_date
calculate number of minutes from params.my_input_hour & params.my_input_minute
execute the logic in if condition as start_date <= input <= end_date using the minutes of all the three values and return the documents accordingly.
Response:
{
"took" : 2,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 1,
"relation" : "eq"
},
"max_score" : 2.0,
"hits" : [
{
"_index" : "my_time_index",
"_type" : "_doc",
"_id" : "1",
"_score" : 2.0,
"_source" : {
"start_time" : "18:32",
"end_time" : "20:32"
}
}
]
}
}
Do test them thoroughly for performance issues when it comes to solution 1 as script queries generally hit performances, however they come in handy if you have no option.
Let me know if this helps!

Related

Elasticsearch - Retrieving a list of documents which last status is 51

I have an index in Elasticsearch with this kind of documents:
"transactionId" : 5588,
"clientId" : "1",
"transactionType" : 1,
"transactionStatus" : 51,
"locationId" : 12,
"images" : [
{
"imageId" : 5773,
"imagePath" : "http://some/url/path",
"imageType" : "dummyData",
"ocrRead" : "XYZ999",
"imageName" : "SOMENUMBERSANDCHARACTERS.jpg",
"ocrConfidence" : "94.6",
"ROITopLeftCoordinate" : "839x251",
"ROIBottomRightCoordinate" : "999x323"
}
],
"creationTimestamp" : 1669645709130,
"current" : true,
"timestamp" : 1669646359686
It's an "add only" type of stack, where a record is never updated. For instance:
.- Adds a new record with "transactionStatus": 10
.- the transactionID changes status, then, adds a new record for the same transactionID with "transactionStatus": 51
and so on.
What I want to achieve, is get a list of 10 records whose last status is 51 but I can't write the correct query.
Here is what I've tried:
{ "size": 10,
"query": {
"match_all": {}
},
"collapse": {
"field": "transactionId",
"inner_hits": {
"name": "most_recent",
"size": 1,
"sort": [{"timestamp": "desc"}]
}
},
"post_filter": {
"term": {
"transactionStatus": "51"
}
}
}
If I change the "transactionStatus":51 on the post_filter term for, let's say 10, it gives me a transactionID record which last record is not 10.
I don't know if I could explain in a proper way. I apologize for my english, is not my native language.
GET test_status/_search
{
"query": {
"bool": {
"filter": [
{
"term": {
"transactionStatus": 51
}
}
]
}
},
"sort": [
{
"timestamp": {
"order": "desc"
}
}
]
}
This one will filter and then sort by timestamp. Let me know if there is something missing.

Elasticsearch Birth date aggregation

I need my filter works like this:
18-24 | (16,635,890)
25-34 | (2,478,382)
35-44 | (1,129,493)
45-54 | (5,689,393)
55-64 | (4,585.933)
This is my ES mapping:
{
"dynamic": "strict",
"properties": {
"birthdate": {
"type": "date",
"format": "m/d/yyyy"
},
"first_name": {
"type": "keyword"
},
"last_name": {
"type": "keyword"
}
}
}
I would like to know if it's possible to do this with this mapping. I'm not very experienced in ES, I believe that to do this I need advanced knowledge in ES.
Also, I tried to do this to test, but without any aggregation :/
age: {
terms: {
field: 'birthdate'
}
}
--------------------
"doc_count_error_upper_bound" => 0,
"sum_other_doc_count" => 0,
"buckets" => [
{
"key" => 1072915200000,
"key_as_string" => "0/1/2004",
"doc_count" => 1
}
]
},
I tried to read the documentation and search in some forums, but without success. thanks
A good candidate for this would be the ranges aggregation but since your birthdate is formatted as a date, you'd need to calculate the age up until now before you proceeded to calculate the buckets. You can do so through a Painless script.
Putting it all together:
POST your-index/_search
{
"size": 0,
"aggs": {
"price_ranges": {
"range": {
"script": {
"source": "return (params.now_ms - doc['birthdate'].value.millis) / 1000 / 60 / 60 / 24 / 365",
"params": {
"now_ms": 1617958584396
}
},
"ranges": [
{
"from": 18,
"to": 24,
"key": "18-24"
},
{
"from": 25,
"to": 34,
"key": "25-34"
}
]
}
}
}
}
would return:
...
"aggregations" : {
"price_ranges" : {
"buckets" : [
{
"key" : "18-24",
"from" : 18.0,
"to" : 24.0,
"doc_count" : 0
},
{
"key" : "25-34",
"from" : 25.0,
"to" : 34.0,
"doc_count" : 2
},
...
]
}
}
Note that the current timestamp wasn't obtained through a dynamic new Date() call but rather hardcoded as a parametrized now_ms variable. This is the recommended way of doing date math due to the distributed nature of Elasticsearch. For more info on this, check my answer to How to get current time as unix timestamp for script use.
Shameless plug: if you're relatively new to ES, you might find my recently released Elasicsearch Handbook useful. One of the chapters is dedicated solely to aggregations and one to Painless scripting!

ElasticSearch/Kibana: get values that are not found in entries more recent than a certain date

I have a fleet of devices that push to ElasticSearch at regular intervals (let's say every 10 minutes) entries of this form:
{
"deviceId": "unique-device-id",
"timestamp": 1586390031,
"payload" : { various data }
}
I usually look at this through Kibana by filtering for the last 7 days of data and then drilling down by device id or some other piece of data from the payload.
Now I'm trying to get a sense of the health of this fleet by finding devices that haven't reported anything in the last hour let's say. I've been messing around with all sorts of filters and visualisations and the closest I got to this is a data table with device ids and the timestamp of the last entry for each, sorted by timestamp. This is useful but is somewhat hard to work with as I have a few thousand devices.
What I dream of is getting either the above mentioned table to contain only the device ids that have not reported in the last hour, or getting only two numbers: the total count of distinct device ids seen in the last 7 days and the total count of device ids not seen in the last hour.
Can you point me in the right direction, if any one of these is even possible?
I'll skip the table and take the second approach -- only getting the counts. I think it's possible to walk your way backwards to the rows from the counts.
Note: I'll be using a human readable time format instead of timestamps but epoch_seconds will work just as fine in your real use case. Also, I've added the comment field to give each doc some background.
First, set up a your index:
PUT fleet
{
"mappings": {
"properties": {
"timestamp": {
"type": "date",
"format": "epoch_second||yyyy-MM-dd HH:mm:ss"
},
"comment": {
"type": "text"
},
"deviceId": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword"
}
}
}
}
}
}
Sync a few docs -- I'm in UTC+2 so I chose these timestamps:
POST fleet/_doc
{
"deviceId": "asdjhfa343",
"timestamp": "2020-04-05 10:00:00",
"comment": "in the last week"
}
POST fleet/_doc
{
"deviceId": "asdjhfa343",
"timestamp": "2020-04-10 13:05:00",
"comment": "#asdjhfa343 in the last hour"
}
POST fleet/_doc
{
"deviceId": "asdjhfa343",
"timestamp": "2020-04-10 12:05:00",
"comment": "#asdjhfa343 in the 2 hours"
}
POST fleet/_doc
{
"deviceId": "asdjhfa343sdas",
"timestamp": "2020-04-07 09:00:00",
"comment": "in the last week"
}
POST fleet/_doc
{
"deviceId": "asdjhfa343sdas",
"timestamp": "2020-04-10 12:35:00",
"comment": "in last 2hrs"
}
In total, we've got 5 docs and 2 distinct device ids w/ the following conditions
all have appeared in the last 7d
both of which in the last 2h and
only one of which in the last hour
so I'm interested in finding precisely 1 deviceId which has appeared in the last 2hrs BUT not last 1hr.
Using a combination of filter (for range filters), cardinality (for distinct counts) and bucket script (for count differences) aggregations.
GET fleet/_search
{
"size": 0,
"aggs": {
"distinct_devices_last7d": {
"filter": {
"range": {
"timestamp": {
"gte": "now-7d"
}
}
},
"aggs": {
"uniq_device_count": {
"cardinality": {
"field": "deviceId.keyword"
}
}
}
},
"not_seen_last1h": {
"filter": {
"range": {
"timestamp": {
"gte": "now-2h"
}
}
},
"aggs": {
"device_ids_per_hour": {
"date_histogram": {
"field": "timestamp",
"calendar_interval": "day",
"format": "'disregard' -- yyyy-MM-dd"
},
"aggs": {
"total_uniq_count": {
"cardinality": {
"field": "deviceId.keyword"
}
},
"in_last_hour": {
"filter": {
"range": {
"timestamp": {
"gte": "now-1h"
}
}
},
"aggs": {
"uniq_count": {
"cardinality": {
"field": "deviceId.keyword"
}
}
}
},
"uniq_difference": {
"bucket_script": {
"buckets_path": {
"in_last_1h": "in_last_hour>uniq_count",
"in_last2h": "total_uniq_count"
},
"script": "params.in_last2h - params.in_last_1h"
}
}
}
}
}
}
}
}
The date_histogram aggregation is just a placeholder that enables us to use a bucket script to get the final difference and not have to do any post-processing.
Since we passed size: 0, we're not interested in the hits section. So taking only the aggregations, here are the annotated results:
...
"aggregations" : {
"not_seen_last1h" : {
"doc_count" : 3,
"device_ids_per_hour" : {
"buckets" : [
{
"key_as_string" : "disregard -- 2020-04-10",
"key" : 1586476800000,
"doc_count" : 3, <-- 3 device messages in the last 2hrs
"total_uniq_count" : {
"value" : 2 <-- 2 distinct IDs
},
"in_last_hour" : {
"doc_count" : 1,
"uniq_count" : {
"value" : 1 <-- 1 distict ID in the last hour
}
},
"uniq_difference" : {
"value" : 1.0 <-- 1 == final result !
}
}
]
}
},
"distinct_devices_last7d" : {
"meta" : { },
"doc_count" : 5, <-- 5 device messages in the last 7d
"uniq_device_count" : {
"value" : 2 <-- 2 unique IDs
}
}
}

Elastic search: hour_minute_second mapping returns empty data

Below mapping i have created for search field
PUT /sample/_mapping
{
"properties": {
"webDateTime1": {
"type": "date",
"format": "dd-MM-yyyy HH:mm:ss||dd-MM-yyyy||hour_minute_second"
}
}
}
If i search based on "04-04-2019 20:17:18" getting proper data
if i search based on "04-04-2019" getting proper data
if i search based on "20:17:18" don't know always getting empty result.
Any help would be appreciated.
When you ingest some sample docs:
POST sample/_doc/1
{"webDateTime1":"04-04-2019 20:17:18"}
POST sample/_doc/2
{"webDateTime1":"04-04-2019"}
POST sample/_doc/3
{"webDateTime1":"20:17:18"}
and then aggregate on the date field,
GET sample/_search
{
"size": 0,
"aggs": {
"dt_values": {
"terms": {
"field": "webDateTime1"
}
}
}
}
you'll see how the values are actually indexed:
...
"buckets" : [
{
"key" : 73038000,
"key_as_string" : "01-01-1970 20:17:18",
"doc_count" : 1
},
{
"key" : 1554336000000,
"key_as_string" : "04-04-2019 00:00:00",
"doc_count" : 1
},
{
"key" : 1554409038000,
"key_as_string" : "04-04-2019 20:17:18",
"doc_count" : 1
}
]
...
That's the reason your query for 20:17:18 is causing you a headache.
Now, you'd typically wanna use the range query like so:
GET sample/_search
{
"query": {
"range": {
"webDateTime1": {
"gte": "20:17:18",
"lte": "20:17:18",
"format": "HH:mm:ss"
}
}
}
}
Notice the format parameter. But again, if you don't provide a date in your datetime field, it turns out it's going to take the unix epoch as the date.

Average of differences calculated between two date fields

I'm working on a project that uses Elasticsearch to store data and show some complex statistics.
I have an index in that looks like this:
Reservation {
id: number
check_in: Date
check_out: Date
created_at: Date
// other fields...
}
I need to calculate the average days' difference between check_in and created_at of my Reservations in a specific date range and show the result as a number.
I tried this query:
{
"script_fields": {
"avgDates": {
"script": {
"lang": "expression",
"source": "doc['created_at'].value - doc['check_in'].value"
}
}
},
"query": {
"bool": {
"must": [
{
"range": {
"created_at": {
"gte": "{{lastMountTimestamp}}",
"lte": "{{currentTimestamp}}"
}
}
}
]
}
},
"size": 0,
"aggs": {
"avgBetweenDates": {
"avg": {
"field": "avgDates"
}
}
}
}
Dates fields are saved in ISO 8601 form (eg: 2020-03-11T14:25:15+00:00), I don't know if this could produce issues.
It catches some hits, So, the query works for sure! but, it always returns null as the value of the avgBetweenDates aggregation.
I need a result like this:
"aggregations": {
"avgBetweenDates": {
"value": 3.14159 // Π is just an example!
}
}
Any ideas will help!
Thank you.
Scripted Fields are not stored fields in ES. You can only perform aggregation on the stored fields as scripted fields are created on the fly.
You can simply move the script logic in the Average Aggregation as shown below. Note that for the sake of understanding, I've created sample mapping, documents, query and its response.
Mapping:
PUT my_date_index
{
"mappings": {
"properties": {
"check_in":{
"type":"date",
"format": "date_time"
},
"check_out":{
"type": "date",
"format": "date_time"
},
"created_at":{
"type": "date",
"format": "date_time"
}
}
}
}
Sample Documents:
POST my_date_index/_doc/1
{
"check_in": "2019-01-15T00:00:00.000Z",
"check_out": "2019-01-20T00:00:00.000Z",
"created_at": "2019-01-17T00:00:00.000Z"
}
POST my_date_index/_doc/2
{
"check_in": "2019-01-15T00:00:00.000Z",
"check_out": "2019-01-22T00:00:00.000Z",
"created_at": "2019-01-20T00:00:00.000Z"
}
Aggregation Query:
POST my_date_index/_search
{
"size": 0,
"aggs": {
"my_dates_diff": {
"avg": {
"script": """
ZonedDateTime d1 = doc['created_at'].value;
ZonedDateTime d2 = doc['check_in'].value;
long differenceInMillis = ChronoUnit.MILLIS.between(d1, d2);
return Math.abs(differenceInMillis/86400000);
"""
}
}
}
}
Notice, that you wanted difference in number of days. The above logic does that.
Response:
{
"took" : 2,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 2,
"relation" : "eq"
},
"max_score" : null,
"hits" : [ ]
},
"aggregations" : {
"my_dates_diff" : {
"value" : 3.5 <---- Average in Number of Days
}
}
}
Hope this helps!
Scripted fields created within the _search context can only be consumed within that scope. They're not visible within the aggregations! This means you'll have to go with either
moving your script to the aggs section and doing the avg there
a scripted metric aggregation (quite slow and difficult to get right)
or creating a dateDifference field at index time (preferably an int -- a difference of the timestamps) which will enable you to perform powerful numeric aggs like extended stats which provide a statistically useful output like:
{
...
"aggregations": {
"grades_stats": {
"count": 2,
"min": 50.0,
"max": 100.0,
"avg": 75.0,
"sum": 150.0,
"sum_of_squares": 12500.0,
"variance": 625.0,
"std_deviation": 25.0,
"std_deviation_bounds": {
"upper": 125.0,
"lower": 25.0
}
}
}
}
and are always faster than computing the timestamp differences with a script.

Resources