ElasticSearch - Search between a range of dates to compare them - elasticsearch

I am new to ElasticSearch (using version 7.6) and trying to find out how to search between two periods in time. One query I'm trying out is to query week-12 of 2019 and week-12 of 2020. The idea is to compare the results. While reading the documentation and searching for samples I have came close to what I'm looking for.
The easy way was to fire two queries with both different dates. But I would like to limit the amount of queries. The latest query I have written based on reading the docs is with the use of aggregations, but I'm not sure this is the right way:
GET sample-data_*/_search/
{
"query": {
"bool": {
"must": [
{
"range": {
"#timestamp": {
"gte": "2020-03-20 08:00:00",
"lte": "2020-03-27 08:00:00"
}
}
}
]
}
},
"aggs": {
"range": {
"date_range": {
"field": "date",
"format": "8yyyy-MM-dd",
"ranges": [
{
"from": "2019-03-20",
"to": "2019-03-27",
"key": "last_years_week"
},
{
"from": "2020-03-20",
"to": "2020-03-27",
"key": "this_years_week"
}
],
"keyed": true
}
}
}
}
The results are coming in followed by the aggregations, but they do not contain the data that I am looking for. One of the results are returned:
{
"_index" : "sample-data_2020_03_26",
"_type" : "_doc",
"_id" : "JyhcfWFFz0s1vwizjgxh",
"_score" : 1.0,
"_source" : {
"#timestamp" : "2020-03-26 00:00:00",
"name" : "TEST0001",
"count" : "150",
"total" : 3000
}
}
...
"aggregations" : {
"range" : {
"buckets" : {
"last_years_week" : {
"from" : 1.55304E12,
"from_as_string" : "2019-03-20",
"to" : 1.5536448E12,
"to_as_string" : "2019-03-27",
"doc_count" : 0
},
"this_years_week" : {
"from" : 1.5846624E12,
"from_as_string" : "2020-03-20",
"to" : 1.5852672E12,
"to_as_string" : "2020-03-27",
"doc_count" : 0
}
}
}
}
My question is: what could be an efficient way to query data between two dates of different years using ElasticSearch, so they could be used to compare the numbers?
I would be happy to read more about the, for me complex, ElasticSearch query if you could point me into the right direction.
Thank you!

Not posting the working solution with the Elasticsearch query but as discussed in the question comments, summarize it in the form of the answer which some useful links.
Range queries on date fields are very useful to quickly search between date ranges, also supports various math operations on date fields.
aggregation on date range will also be useful and
The main difference between this aggregation and the normal range aggregation is that the from and to values can be expressed in Date Math expression useful if you want to have aggregations on your date range and it supports data math format as mention below:

Related

Elastic search: hour_minute_second mapping returns empty data

Below mapping i have created for search field
PUT /sample/_mapping
{
"properties": {
"webDateTime1": {
"type": "date",
"format": "dd-MM-yyyy HH:mm:ss||dd-MM-yyyy||hour_minute_second"
}
}
}
If i search based on "04-04-2019 20:17:18" getting proper data
if i search based on "04-04-2019" getting proper data
if i search based on "20:17:18" don't know always getting empty result.
Any help would be appreciated.
When you ingest some sample docs:
POST sample/_doc/1
{"webDateTime1":"04-04-2019 20:17:18"}
POST sample/_doc/2
{"webDateTime1":"04-04-2019"}
POST sample/_doc/3
{"webDateTime1":"20:17:18"}
and then aggregate on the date field,
GET sample/_search
{
"size": 0,
"aggs": {
"dt_values": {
"terms": {
"field": "webDateTime1"
}
}
}
}
you'll see how the values are actually indexed:
...
"buckets" : [
{
"key" : 73038000,
"key_as_string" : "01-01-1970 20:17:18",
"doc_count" : 1
},
{
"key" : 1554336000000,
"key_as_string" : "04-04-2019 00:00:00",
"doc_count" : 1
},
{
"key" : 1554409038000,
"key_as_string" : "04-04-2019 20:17:18",
"doc_count" : 1
}
]
...
That's the reason your query for 20:17:18 is causing you a headache.
Now, you'd typically wanna use the range query like so:
GET sample/_search
{
"query": {
"range": {
"webDateTime1": {
"gte": "20:17:18",
"lte": "20:17:18",
"format": "HH:mm:ss"
}
}
}
}
Notice the format parameter. But again, if you don't provide a date in your datetime field, it turns out it's going to take the unix epoch as the date.

No exact match for RANGE query for a specific time

Question
Why does the Elasticsearch Range query not exact match with the time "2017-11-30T13:23:23.063657+11:00"? Kindly suggest if there is a mistake in the query or it is expected.
Query
curl -XGET 'https://hostname/_search?pretty' -H 'Content-Type: application/json' -d'
{
"query": {
"range" : {
"time" : {
"gte": "2017-11-30T13:23:23.063657+11:00",
"lte": "2017-11-30T13:23:23.063657+11:00"
}
}
}
}
'
The expected only one data to match is below.
{
"_index": "***",
"_source": {
"time": "2017-11-30T13:23:23.063657+11:00",
"log_level": "INFO",
"log_time": "2017-11-30 13:23:23,042"
},
"fields": {
"time": [
1512008603063
]
}
}
Result
However, it matched multiple records which is closer to the time.
"hits" : {
"total" : 11,
"max_score" : 1.0,
"hits" : [ {
"_index" : "***",
"_score" : 1.0,
"_source" : {
"time" : "2017-11-30T13:23:23.063612+11:00",
"log_level" : "INFO",
"log_time" : "2017-11-30 13:23:23,016"
}
}, {
"_index" : "core-apis-non-prod.97d5f1ee-a570-11e6-b038-02dc30517283.2017.11.30",
"_score" : 1.0,
"_source" : {
"time" : "2017-11-30T13:23:23.063722+11:00",
"log_level" : "INFO",
"log_time" : "2017-11-30 13:23:23,046"
}
}
...
Elasticsearch uses Joda-Time for parsing dates. And your problem is that Joda-Time only stores date/time values down to the millisecond.
From the docs:
The library internally uses a millisecond instant which is identical
to the JDK and similar to other common time representations. This
makes interoperability easy, and Joda-Time comes with out-of-the-box
JDK interoperability.
This means that the last 3 digits of the seconds are not taken into account when parsing the date.
2017-11-30T13:23:23.063612+11:00
2017-11-30T13:23:23.063657+11:00
2017-11-30T13:23:23.063722+11:00
Are all interpreted as:
2017-11-30T13:23:23.063+11:00
And the corresponding epoch time is 1512008603063 for all these values.
You can see this too by adding explain to the query like this:
{
"query": {
"range" : {
"time" : {
"gte": "2017-11-30T13:23:23.063657+11:00",
"lte": "2017-11-30T13:23:23.063657+11:00"
}
}
},
"explain": true
}
That is basically the reason all those documents match your query.

Elasticsearch: amount of new documents, per type, in the last 24 hours(or timeperiod)

I tried using the following query:
curl -XGET 'localhost:9200/<index>/<type>/_search?pretty=true' -d '
{
"size": 0,
"query" : { "range" : { "_timestamp" : { "from" : "1420070400", "to" : "1451606400" } } },
"aggs": {
"langs": {
"terms": {
"field": "<field>"
}
}
}
}
'
The from/to detailed here is January 1st 2015 till January 1st 2016. The result from this query is identical compared to not having the "query" part in the query at all.
What I want to achieve is that the document count happens only in the given timerange, not for all existing documents of that time
The mapping of the type I'm working with is defined with this:
"_timestamp" : {
"enabled" : true,
"store" : true,
"format" : "date_time"
}
Am I doing it wrong or am I working on a mistaken assumption?
EDIT: To clarify, I'm looking for a way to see how many documents ES has created in the last 24 hours, per index, per type. But not only that, I want to do an aggregation on this.
So, let's say our type is "art" and the field I'm aggregating over is "type_of_art".
While in total there could be millions of documents, in the last 24 hours there would only be 7 statues, 5 painting and 3 operas that got added. For instance.
And if I wanted to know how much was created between October 1, 2014 and November 15, 2014, I imagine that exact same query would produce the result I need.
The values for dates are held in milliseconds, so the correct query is:
{
"size": 0,
"query" : { "range" : { "_timestamp" : { "from" : "1420070400000", "to" : "1451606400000" } } },
"aggs": {
"langs": {
"terms": {
"field": "<field>"
}
}
}
}

Elasticsearch, aggregation, how to count accurately in the estimated final list

elasticSearch (ES) term aggregation result is approximate in term of the finalists and their counts. https://www.elastic.co/guide/en/elasticsearch/reference/1.6/search-aggregations-bucket-terms-aggregation.html
I'd like to have the accurate counts for the for the estimated finalists, despite that the finalist are not accurate. I want to eliminate per bucket document count error.
I am thinking to issue a second query that's filtered by the finalists, and since I know the number of finalists, I can count them accurately if I set size=#finalists.
Using the example from the link above: after I have the top 5 Products: a,z,c,g,b from the first aggregation result, I want to find their accurate counts:
{
...
"aggregations" : {
"products" : {
"doc_count_error_upper_bound" : 46,
"buckets" : [
{
"key" : "Product A",
"doc_count" : 100,
"doc_count_error_upper_bound" : 0
},
{
"key" : "Product Z",
"doc_count" : 52,
"doc_count_error_upper_bound" : 2
},
...
]
}
}
}
Now the doc_counts are estimated, I can issue a second query filtered by the product ids:
{
...
"query": {
"filtered": {
"filter": {
"terms": {"product": ["Product A", "Product Z","Product C","Product G","Product B"]}
}
}
},
"aggs":{
"products":{
"terms":{
"field": "product",
"size": 5,
"shard_size": 5
}
}
}
}
My questions are:
does this give me the correct counts on a,z,c,g,b?
is there a better way to do this? inside one query, maybe nested aggregation?
the parsing aggregation results to prepare filters is done with JAVA code, and it is error-prone. Is there an example of this task? or can it be done by ES ?
Thanks in advance.

Multi-query date histogram in Elasticsearch

I'm using the Elasticsearch date_histogram aggregation for binning/bucketing my data. This works fine when plotting the results of a single query:
{
"query": {...},
"aggs" : {
"timeline" : {
"date_histogram" : {
"field" : "date",
"interval" : "month"
}
}
}
}
However, I now want to use ES for binning/bucketing the results of multiple queries. At the end, I need a line chart with each query representing a single line on the chart.
So, is it possible to use a single bucketing for multiple queries?
Ok, ended up defining a custom range for the date field and executed multiple queries with the same custom range. Probably not the most efficient way, but works fine.
{
"query": {...},
"aggs" : {
"ranges" : {
"date_range" : {
"field": date,
"format": yyyyMMdd,
"ranges": ranges}
}
}
}

Resources