ElasticSearch Date Field Mapping Malformation - elasticsearch

In my ElasticHQ mapping:
#timestamp date yyyy-MM-dd HH:mm:ssZZZ
...
date date yyyy-MM-dd HH:mm:ssZZZ
In the above I have two types of date field each with a mapping to the same format.
In the data:
"#timestamp": "2014-05-21 23:22:47UTC"
....
"date": "2014-05-22 05:08:09-0400",
As above, the date format does not map to what ES thinks I have my dates formatted as. I assume something hinky happened at index time (I wasn't around).
Also interesting: When using a filtered range query like the following, I get a Parsing Exception explaining that my date is too short:
GET _search
{
"query": {
"filtered": {
"query": {
"match_all": {}
},
"filter": {
"range": {
"date": {
"from": "2013-11-23 07:00:29",
"to": "2015-11-23 07:00:29",
"time_zone": "+04:00"
}
}
}
}
}
}
But searching with the following passes ES's error check, but returns no results, I assume because of the date formatting in the documents.
GET _search
{
"query": {
"filtered": {
"query": {
"match_all": {}
},
"filter": {
"range": {
"date": {
"from": "2013-11-23 07:00:29UTC",
"to": "2015-11-23 07:00:29UTC",
"time_zone": "+04:00"
}
}
}
}
}
}
My question is this: given the above, is there any way we can avoid having to Re-Index and change the mapping and continue to search the malformed data? WE have around 1TB of data in this particular cluster, and would like to keep it as is, for obvious reasons.
Also attempted was a query that adheres to what is in the Data:
"query": {
"range": {
"date": {
"gte": "2014-05-22 05:08:09-0400",
"to": "2015-05-22 05:08:09-0400"
}
}
}

The dates you have in your documents actually do conform to the date format you have in your mapping, i.e. yyyy-MM-dd HH:mm:ssZZZ
In date format patterns, ZZZ stands for an RFC 822 time zone (e.g. -04:00, +04:00, EST, UTC, GMT, ...) so the dates you have in your data do comply otherwise they wouldn't have been indexed in the first place.
However, the best practice is to always make sure dates are transformed to UTC (or any other time zone common to the whole document base that makes sense in your context) before indexing them so that you have a common basis to query on.
As for your query that triggers errors, 2013-11-23 07:00:29 doesn't comply with the date format since the time zone is missing at the end. As you've rightly discovered, adding UTC at the end fixes the query parsing problem (i.e. the missing ZZZ part), but you might still get no results.
Now to answer your question, you have two main tasks to do:
Fix your indexing process/component to make sure all the dates are in a common timezone (usually UTC)
Fix your existing data to transform the dates in your indexed documents into the same timezone
1TB is a lot of data to reindex for fixing one or two fields. I don't know how your documents look like, but it doesn't really matter. The way I would approach the problem would be to run a partial update on all documents, and for this, I see two different solutions, in both of which the idea is to just fix the #timestamp and date fields:
Depending on your version of ES, you can use the update-by-query plugin but transforming a date via script is a bit cumbersome.
Or you can write an adhoc client that will scroll over all your existing documents and partial update each of them and send them back in bulk.
Given the amount of data you have, solution 2 seems more appropriate.
So... your adhoc script should first issue a scroll query to obtain a scroll id like this:
curl -XGET 'server:9200/your_index/_search?search_type=scan&scroll=1m' -d '{
"query": { "match_all": {}},
"size": 1000
}'
As a result, you'll get a scroll id that you can now use to iterate over all your data with
curl -XGET 'server:9200/_search/scroll?_source=date,#timestamp&scroll=1m' -d 'your_scroll_id'
You'll get 1000 hits (you can de/increase the size parameter in the first query above depending on your mileage) that you can now iterate over.
For each hit you get, you'll only have your two date fields that you need to fix. Then you can transform your dates into the standard timezone of your choosing using a solution like this for instance.
Finally, you can send your 1000 updated partial documents in one bulk like this:
curl -XPOST server:9200/_bulk -d '
{ "update" : {"_id" : "1", "_type" : "your_type", "_index" : "your_index"} }
{ "doc" : {"date" : "2013-11-23 07:00:29Z", "#timestamp": "2013-11-23 07:00:29Z"} }
{ "update" : {"_id" : "2", "_type" : "your_type", "_index" : "your_index"} }
{ "doc" : {"date" : "2014-09-12 06:00:29Z", "#timestamp": "2014-09-12 06:00:29Z"} }
...
'
Rinse and repeat with the next iteration...
I hope this should give you some initial pointers to get started. Let us know if you have any questions.

Related

Elastic relative data math - finding all things today

I'm trying to do so fairly simple query with Elasticsearch, but I don't think I understand what I'm doing wrong, so I'm posting here for some pointers.
I have an elastic index where each document has a date like so:
{
// edited for brevity
"releasedate": "2020-10-03T15:55:03+00:00",
}
and I am using django DRF to make queries like so, where I pass this value along &releasedate__gt=now-3d/d
Which ends up with an elastic range query like this.
{
"from": 0,
"query": {
"bool": {
"filter": [
{
"range": {
"releasedate": {
"gt": "now/d-3d"
}
}
}
]
}
},
"size": 10,
"sort": [
"_score"
]
}
If I want to see all "documents since yesterday", I think of it in terms of all documents with releasedate greater than midnight yesterday, I figured the key part of the query would need to be like so:
{
"query": {
"bool": {
"filter": [
{
"range": {
"releasedate": {
"gt": "now/d-1d"
}
}
}
]
}
}
}
So I expect this would round the time now, to 00:00 today, then go back one day.
So if I ran this on 2020-10-04. I'd assume this would catch a document with the release date of 2020-10-03T15:55:03+00:00.
Here's my reasoning
Rounding down with now/d would take us to 2020-10-04T00:00.
And then going back one day with -1d would take us to 2020-10-03T00:00.
This ought to include the document, but I'm not seeing it. I need to look back more than one day to find the documents, so I need to use now/d-2d to find matching documents.
Any idea why this might be? I'm unsure of how to see what now/d-1d evaluates in terms of a timezone aware object, to check - that's what I might reach for, but I don't know how with Elastic.
FWIW, this is using Elastic 5.6. We'll be updating soon.
I'd say that once you round down to the nearest day (either with now-2d/d or now/d-2d -- as you did), the gt query's intervals will indeed be day-based.
In other words, gt : 2020-10-03T00:00 is >= 2020-10-04T00:00. So what you need instead of gt is gte and that'll work as >=2020-10-03T00:00.

#timestamp range query in elasticsearch

Can I make a range query on default timestamp field ignoring date values i.e. using only time in timestamp - say 2 hours of each day?
My intentions are to search for all the documents but exclude the documents indexed between 9 PM and 12 AM (I have seen example with date ranges in filtering).
timestamp example stands following:
"#timestamp": [
"2015-12-21T15:18:17.120Z"
]
Elasticsearch version: 1.5.2
My first idea would be to use the date math in Elasticsearch query, e.g. if you run your query at 1PM, this would work:
{
"query": {
"range" : {
"#timestamp" : {
"gte": "now-16h/h",
"lte": "now-1h/h"
}
}
}
}
(watch out for the timezone though).
As far as I know, the only other possibility would be to use scripting.
Please note also that you are running a very old version of Elasticsearch.
Edit If you need simply absolute date, then check how your #timestamp field look, and use the same format, for instance on my Elasticsearch, it would be:
{
"query": {
"range" : {
"#timestamp" : {
"gte": "2015-03-20T01:21:00.01Z",
"lte": "2015-03-21T01:12:00.04Z"
}
}
}
}

Can Elasticsearch filter by a date range without specifying a field?

I have multiple date fields and I want to have a single date range query to filter by any of them.
For example, I may have books in my index, and each book may have a published date, edition date, print date, and the author's birth date.
The mapping is straightforward (generated using Elasticsearch.net Nest):
"printDate" : {
"type" : "date",
"format" : "strict_date_optional_time||epoch_millis"
},
I looked at range queries and query string ranges - both need the name of the field explicitly and don't seem to support wildcards.
For example, this doesn't find anything, but works if I use a real field name instead of "*Date":
"filter": [
{
"range": {
"*Date": {
"gte": "2010-01-01T00:00:00",
"lte": "2015-01-01T00:00:00"
}
}
}
]
I also tried placing [2010-01-01 TO 2015-01-01] in a query string, but the dates aren't parsed correctly - it also finds 2010 or 01 as part of other strings (and seemingly other dates).
Another option is to list each field under a "should" clause and specifying "minimum_should_match":1, but that will make me maintain a list of all date fields, which seems inelegant.
Is there a way of searching for a date range on all date fields?
Try this query:
{
"query": {
"query_string": {
"default_field": "*Date",
"query": "[2010-01-01T00:00:00 TO 2015-01-01T00:00:00]"
}
}
}

Kibana: Report on different date intervals such as Today, Yesterday, Last Week

I need to display a report on Kibana that will aggregate results based on multiple date intervals. Times are mapped as float data type along with the timestamp.
Example:
Jobs, Yesterday, Last Week, Last Quarters
Job 1, 5hr, 10 hr, 60 hr
What is the best way to do this with ES and Kibana?
Given that you want it to display as:
job N | range 1 | range 2 | range 3 | ... | range N
This may be difficult to actually get in Kibana exactly because of how it likes to split up the data table, but it's best to know how to get something before you even try to visualize it:
{
"size" : 0,
"aggs" : {
"per_job": {
"terms": {
"field": "job",
"size": 10
},
"aggs": {
"ranges": {
"date_range": {
"field": "timestamp",
"ranges": [
{
"from": "now-1d/d"
},
{
"from" : "now-7d/d"
},
{
"from": "now-3M/M"
}
]
},
"aggs": {
"worked": {
"sum": {
"field": "hours"
}
}
}
}
}
}
}
}
What is this providing? This is grouping by each job, then splitting each job into three bucketed date ranges, each being longer versions of the previous range (notice there's no "to" specified, which you could specify as "to" : "now"), then finally each date range's split is summed up on the field of interest, which I assume is named hours.
How can you use this in Kibana? Well, Kibana is just a visualization tool to build these aggregations and chart or otherwise display them.
The top level aggregation is therefore going to be a Terms aggregation. The secondary or "sub-bucket" will be the Date Range, and finally the metric (above the buckets) will be the Sum.
Unfortunately, given that you seem to want a table view of it, there's no way that I am aware of to get the separate date ranges to just add another row unless you accept one table per job:

Elasticsearch filtering by part of date

QUESTION Maybe anyone has solution how to filter/query ElasticSearch data by month or day ? Let's say I need to get all users who celebrating birthdays today.
mapping
mappings:
dob: { type: date, format: "dd-MM-yyyy HH:mm:ss||yyyy-MM-dd'T'HH:mm:ss'Z'||yyyy-MM-dd'T'HH:mm:ss+SSSS"}
and stored in this way:
dob: 1950-06-03T00:00:00Z
main problem is how to search users by month and day only. Ignore the year, because birthday is annually as we know.
SOLUTION
I found solution to query birthdays with wildcards. As we know if we want use wildcards, the mapping of field must be a string, so I used multi field mapping.
mappings:
dob:
type: multi_field
fields:
dob: { type: date, format: "yyyy-MM-dd'T'HH:mm:ss'Z'}
string: { type: string, index: not_analyzed }
and query to get users by only month and day is:
{
"query": {
"wildcard": {
"dob.string": "*-06-03*"
}
}
}
NOTE
This query can be slow, as it needs to iterate over many terms.
CONCLUSION
It's not pretty nice way, but it's the only one I've found and it works!.
You should store the value-to-be-searched in Elasticsearch. The string/wildcard solution is half the way, but storing the numbers would be even better (and faster):
mappings:
dob:
type: date, format: "yyyy-MM-dd'T'HH:mm:ss'Z'
dob_day:
type: byte
dob_month:
type: byte
Example:
dob: 1950-03-06
dob_day: 06
dob_month: 03
Filtering (or querying) for plain numbers is easy: Match on both fields.
"query": {
"filtered": {
"filter": {
"bool": {
"must": [
{
"term": {
"dob_day": 06,
}
},
{
"term": {
"dob_month": 03,
}
},
]
}
}
}
}
PS: While thinking about the solution: Storing the date as self-merged number like "06-03" -> "603" or "6.03" would be less obvious, but allow range queries to be used. But remember that 531 (05-31) plus one day would be 601 (06-01).
A manually-computed julian date might also be handy, but the calculation must always assume 29 days for February and the range query would have a chance of being off-by-1 if the range includes the 29st of Feb.
Based on your question I am assuming that you want a query, and not a filter (they are different), you can use the date math/format combined with a range query.
See: range query for usage
For explanation of date math see the following link
curl -XPOST http://localhost:9200/twitter/tweet/_search -d
{
"query": {
"range": {
"birthday": {
"gte" : "2014-01-01",
"lte" : "2014-01-01"
}
}
}
}
I have tested this with the latest elastic search.
If you don't want to parse strings, you can use a simple script. This filter will match when dateField has a specific month in any year:
"filter": {
"script": {
"lang": "expression",
"script": "doc['dateField'].getMonth() == month",
"params": {
"month": 05,
}
}
}
Note: the month parameter is 0-indexed.
The same method will work for day of month or any other date component.

Resources