Elasticsearch filtering by part of date - elasticsearch

QUESTION Maybe anyone has solution how to filter/query ElasticSearch data by month or day ? Let's say I need to get all users who celebrating birthdays today.
mapping
mappings:
dob: { type: date, format: "dd-MM-yyyy HH:mm:ss||yyyy-MM-dd'T'HH:mm:ss'Z'||yyyy-MM-dd'T'HH:mm:ss+SSSS"}
and stored in this way:
dob: 1950-06-03T00:00:00Z
main problem is how to search users by month and day only. Ignore the year, because birthday is annually as we know.
SOLUTION
I found solution to query birthdays with wildcards. As we know if we want use wildcards, the mapping of field must be a string, so I used multi field mapping.
mappings:
dob:
type: multi_field
fields:
dob: { type: date, format: "yyyy-MM-dd'T'HH:mm:ss'Z'}
string: { type: string, index: not_analyzed }
and query to get users by only month and day is:
{
"query": {
"wildcard": {
"dob.string": "*-06-03*"
}
}
}
NOTE
This query can be slow, as it needs to iterate over many terms.
CONCLUSION
It's not pretty nice way, but it's the only one I've found and it works!.

You should store the value-to-be-searched in Elasticsearch. The string/wildcard solution is half the way, but storing the numbers would be even better (and faster):
mappings:
dob:
type: date, format: "yyyy-MM-dd'T'HH:mm:ss'Z'
dob_day:
type: byte
dob_month:
type: byte
Example:
dob: 1950-03-06
dob_day: 06
dob_month: 03
Filtering (or querying) for plain numbers is easy: Match on both fields.
"query": {
"filtered": {
"filter": {
"bool": {
"must": [
{
"term": {
"dob_day": 06,
}
},
{
"term": {
"dob_month": 03,
}
},
]
}
}
}
}
PS: While thinking about the solution: Storing the date as self-merged number like "06-03" -> "603" or "6.03" would be less obvious, but allow range queries to be used. But remember that 531 (05-31) plus one day would be 601 (06-01).
A manually-computed julian date might also be handy, but the calculation must always assume 29 days for February and the range query would have a chance of being off-by-1 if the range includes the 29st of Feb.

Based on your question I am assuming that you want a query, and not a filter (they are different), you can use the date math/format combined with a range query.
See: range query for usage
For explanation of date math see the following link
curl -XPOST http://localhost:9200/twitter/tweet/_search -d
{
"query": {
"range": {
"birthday": {
"gte" : "2014-01-01",
"lte" : "2014-01-01"
}
}
}
}
I have tested this with the latest elastic search.

If you don't want to parse strings, you can use a simple script. This filter will match when dateField has a specific month in any year:
"filter": {
"script": {
"lang": "expression",
"script": "doc['dateField'].getMonth() == month",
"params": {
"month": 05,
}
}
}
Note: the month parameter is 0-indexed.
The same method will work for day of month or any other date component.

Related

Elasticsearch: store & query dates before Epoch

I'm looking for a solution to be able to store and query dates before Epoch (1st January 1970).
For example, I'm trying to store into Elasticsearch's index the date December, 25 1969 00:00:00 +0100.
I suppose it possible to store into into string or integer, but is there any solution to keep the field type date into Elasticsearch's mapping ?
You can definitely store dates before the epoch, simply by storing the date string or negative numbers.
If your index mapping looks like this:
PUT epoch
{
"mappings": {
"properties": {
"my_date": {
"type": "date"
}
}
}
}
Then you can index documents before the epoch, like this:
PUT epoch/_doc/1
{
"my_date": -608400
}
PUT epoch/_doc/2
{
"my_date": "1969-12-25T00:00:00"
}
When searching for document with a date before the epoch, both would be returned:
POST epoch/_search
{
"query": {
"range": {
"before": {
"lt": "1970-01-01"
}
}
}
}

Can Elasticsearch filter by a date range without specifying a field?

I have multiple date fields and I want to have a single date range query to filter by any of them.
For example, I may have books in my index, and each book may have a published date, edition date, print date, and the author's birth date.
The mapping is straightforward (generated using Elasticsearch.net Nest):
"printDate" : {
"type" : "date",
"format" : "strict_date_optional_time||epoch_millis"
},
I looked at range queries and query string ranges - both need the name of the field explicitly and don't seem to support wildcards.
For example, this doesn't find anything, but works if I use a real field name instead of "*Date":
"filter": [
{
"range": {
"*Date": {
"gte": "2010-01-01T00:00:00",
"lte": "2015-01-01T00:00:00"
}
}
}
]
I also tried placing [2010-01-01 TO 2015-01-01] in a query string, but the dates aren't parsed correctly - it also finds 2010 or 01 as part of other strings (and seemingly other dates).
Another option is to list each field under a "should" clause and specifying "minimum_should_match":1, but that will make me maintain a list of all date fields, which seems inelegant.
Is there a way of searching for a date range on all date fields?
Try this query:
{
"query": {
"query_string": {
"default_field": "*Date",
"query": "[2010-01-01T00:00:00 TO 2015-01-01T00:00:00]"
}
}
}

Conditions in ElasticSearch queries

I'm new to ElasticSearch. It seems pretty awesome, but I'm extremely confused by the JSON-based query language.
I am going to use ES as a document store. I am interested in making queries such "get all documents where age = 25", or "get all documents where name = 'john' and city = 'london'". However I have yet to understand how this can be done.
I can do this:
{
"query": {
"match": {
"age": "25"
}
}
}
But is this what I'm looking for? I think this would also return documents where age is "25 apples".
Please explain how one can issue such simpe queries against ES.
For age matching you can use term Query. Term Query looks for exact match.
{
"query": {
"term": {
"age": "25"
}
}
}
If you're afraid of the query DSL, you might be less afraid to use the query string mini-language (close to the Lucene query language) passed directly in the URI with the q= query string parameter, it might be a bit simpler to learn and event though it has some limitations, it allows you to go a long way.
For instance, in order to query all documents with age = 25, you 'd do it like this:
curl -XGET 'localhost:9200/_search?q=age:25'
For all documents with age between 25 and 30:
curl -XGET 'localhost:9200/_search?q=age:[25 TO 30]'
For all documents with age between 25 and 30 and the country US:
curl -XGET 'localhost:9200/_search?q=age:[25 TO 30] AND country:us'
To get all documents where (name = john and city = london).You can use following command:
GET /bank/account/_search?q=firstname:Rodriquez%20AND%20age:31
Or,You can also try Filters:
GET /demo/post/_search?pretty=true
{
"query": {
"filtered": {
"filter": {
"and": [
{
"term": {
"name": "john"
}
},
{
"term": {
"city": "london"
}
}
]
}
}
}
}

How to calculate difference between two datetime in ElasticSearch

I'm working with ES and I need a query that returns the difference between two datetime (mysql timediff), but have not found any function of ES to do that. Someone who can help me?
MySQL Query
SELECT SEC_TO_TIME(
AVG(
TIME_TO_SEC(
TIMEDIFF(r.acctstoptime,r.acctstarttime)
)
)
) as average_access
FROM radacct
Thanks!
Your best best is scripted fields. The above search query should work , provided you have enabled dynamic scripting and these date fields are defined as date in the mapping.
{
"script_fields": {
"test1": {
"script": "doc['acctstoptime'].value - doc['acctstarttime'].value"
}
}
}
Note that you would be getting result in epoch , which you need to convert to your denomination.
You can read about scripted field here and some of its examples here.
Here is another example using script fields. It converts dates to milli seconds since epoch, subtracts the two and converts the results into number of days between the two dates.
{
"query": {
"bool": {
"must": [
{
"exists": {
"field": "priorTransactionDate"
}
},
{
"script": {
"script": "(doc['transactionDate'].date.millis - doc['priorTransactionDate'].date.millis)/1000/86400 < 365"
}
}
]
}
}
}

ElasticSearch Date Field Mapping Malformation

In my ElasticHQ mapping:
#timestamp date yyyy-MM-dd HH:mm:ssZZZ
...
date date yyyy-MM-dd HH:mm:ssZZZ
In the above I have two types of date field each with a mapping to the same format.
In the data:
"#timestamp": "2014-05-21 23:22:47UTC"
....
"date": "2014-05-22 05:08:09-0400",
As above, the date format does not map to what ES thinks I have my dates formatted as. I assume something hinky happened at index time (I wasn't around).
Also interesting: When using a filtered range query like the following, I get a Parsing Exception explaining that my date is too short:
GET _search
{
"query": {
"filtered": {
"query": {
"match_all": {}
},
"filter": {
"range": {
"date": {
"from": "2013-11-23 07:00:29",
"to": "2015-11-23 07:00:29",
"time_zone": "+04:00"
}
}
}
}
}
}
But searching with the following passes ES's error check, but returns no results, I assume because of the date formatting in the documents.
GET _search
{
"query": {
"filtered": {
"query": {
"match_all": {}
},
"filter": {
"range": {
"date": {
"from": "2013-11-23 07:00:29UTC",
"to": "2015-11-23 07:00:29UTC",
"time_zone": "+04:00"
}
}
}
}
}
}
My question is this: given the above, is there any way we can avoid having to Re-Index and change the mapping and continue to search the malformed data? WE have around 1TB of data in this particular cluster, and would like to keep it as is, for obvious reasons.
Also attempted was a query that adheres to what is in the Data:
"query": {
"range": {
"date": {
"gte": "2014-05-22 05:08:09-0400",
"to": "2015-05-22 05:08:09-0400"
}
}
}
The dates you have in your documents actually do conform to the date format you have in your mapping, i.e. yyyy-MM-dd HH:mm:ssZZZ
In date format patterns, ZZZ stands for an RFC 822 time zone (e.g. -04:00, +04:00, EST, UTC, GMT, ...) so the dates you have in your data do comply otherwise they wouldn't have been indexed in the first place.
However, the best practice is to always make sure dates are transformed to UTC (or any other time zone common to the whole document base that makes sense in your context) before indexing them so that you have a common basis to query on.
As for your query that triggers errors, 2013-11-23 07:00:29 doesn't comply with the date format since the time zone is missing at the end. As you've rightly discovered, adding UTC at the end fixes the query parsing problem (i.e. the missing ZZZ part), but you might still get no results.
Now to answer your question, you have two main tasks to do:
Fix your indexing process/component to make sure all the dates are in a common timezone (usually UTC)
Fix your existing data to transform the dates in your indexed documents into the same timezone
1TB is a lot of data to reindex for fixing one or two fields. I don't know how your documents look like, but it doesn't really matter. The way I would approach the problem would be to run a partial update on all documents, and for this, I see two different solutions, in both of which the idea is to just fix the #timestamp and date fields:
Depending on your version of ES, you can use the update-by-query plugin but transforming a date via script is a bit cumbersome.
Or you can write an adhoc client that will scroll over all your existing documents and partial update each of them and send them back in bulk.
Given the amount of data you have, solution 2 seems more appropriate.
So... your adhoc script should first issue a scroll query to obtain a scroll id like this:
curl -XGET 'server:9200/your_index/_search?search_type=scan&scroll=1m' -d '{
"query": { "match_all": {}},
"size": 1000
}'
As a result, you'll get a scroll id that you can now use to iterate over all your data with
curl -XGET 'server:9200/_search/scroll?_source=date,#timestamp&scroll=1m' -d 'your_scroll_id'
You'll get 1000 hits (you can de/increase the size parameter in the first query above depending on your mileage) that you can now iterate over.
For each hit you get, you'll only have your two date fields that you need to fix. Then you can transform your dates into the standard timezone of your choosing using a solution like this for instance.
Finally, you can send your 1000 updated partial documents in one bulk like this:
curl -XPOST server:9200/_bulk -d '
{ "update" : {"_id" : "1", "_type" : "your_type", "_index" : "your_index"} }
{ "doc" : {"date" : "2013-11-23 07:00:29Z", "#timestamp": "2013-11-23 07:00:29Z"} }
{ "update" : {"_id" : "2", "_type" : "your_type", "_index" : "your_index"} }
{ "doc" : {"date" : "2014-09-12 06:00:29Z", "#timestamp": "2014-09-12 06:00:29Z"} }
...
'
Rinse and repeat with the next iteration...
I hope this should give you some initial pointers to get started. Let us know if you have any questions.

Resources