I'm working on a firestore db need to query on dates range.
My question : in terms of query efficiency, how should I store my dates?
as numbers?
as firestore Timestamp?
(my guess is that string format is bad !)
Knowing that my queries will only be on day (not hours or seconds), is it more efficient to query on a field that has just day-precision, or can I keep better precision?
Thanks!
For performance, it doesn't really matter how you store the date. The index created for the date field will be able to find the range of documents equally well for each one. The performance of Firestore queries is based on the number of documents matched by the query, not the type of data used.
Related
I know how Elasticsearch index words and strings, but I wonder if there's a different behaviour for timestamps?
We have internal elasticsearch instance that index events ( millions of events per day).
I want to pull once in X seconds all the events that we received in the last X seconds.
Does Elasticsearch index the timestamp in efficient way such that we don't need to traverse all the documents to return the relevant results? How it index this data?
Anything numeric, like date fields, integer fields, geo fields, etc, are not stored in the inverted index, but in BKD trees (since ES 5), which are especially suited for range queries and finding collection of unordered docIDs that meet the time range conditions.
I'm wondering what would be an efficient way to detect the last modified timestamp of an index in Elastic Search. I have read posts of adding a timestamp fields in pipeline but this solution has limitations (e.g. only newly created index supports timestamp update?)
If only a handful of indices are required to track their last modify time, what would be the most efficient way? Would periodic query and compare result between queries give us an approx. last modify time? any other ways to track ES events?
there is a creation_date setting, but not a comparable update_date one. the reasoning behind this is that updating this for every indexing event would be very expensive, even more so in a distributed environment
you could use something like meta, but it has the same limitation as adding a timestamp to individual documents
It is common to have elastic indices with dates, in particular from something like logstash.
So for example, you have indices like foo-2016.05.01, foo-2016.05.02, etc...
When doing a time range query for data. What is the cost of querying indexes that I already know won't have data for that time range?
So for example if time range query only asks for data from 2016.05.02 but I also include the foo-2016.05.01 index in my query.
Is that basically a quick one-op per index where the index knows it has no data in that date range, or will doing this be costly to performance? I'm hoping not only to know the yes/no answer, but to understand why it behaves the way it does.
Short version: it's probably going to be expensive. The cost will be n where n is the number of distinct field values for the date data. If all entries in the index had an identical date field value, it'd be a cheap query of 1 check (and would be pointless since it'd be a binary "all or nothing" response at that point). Of course, the reality is usually that every single doc has a unique date field value (which is incrementing such as in a log), depending on how granular the date is (assuming here that the time is included to seconds or milliseconds). Elasticsearch will check each aggregated, unique date field value of the included indices to try and find documents that match on the field by satisfying the predicates of the range query. This is the nature of the inverted index (indexing documents by their fields).
An easy way to improve performance is to change the Range Query to a Range Filter which caches results and improves performance for requests beyond the first one. Of course, this is only valuable if you're repeating the same range filter over time (the cache is read more than it is written), and if the range is not part of scoring the documents (that is to say those in range are not more valuable that those not in range when returning a set of both - also known as "boosting").
Another way to improve performance is by convention. If you query by day, store each day in its own rolling index and then do pre-search logic to select the indexes to query. This eliminates the need for the filter or query entirely.
Elasticsearch doesn't care about the index name (that includes the date) and it doesn't automagically exclude that index from your range query. It will query all the shards (a copy - be it replica or primary) of all the indices specified in the query. Period.
Kibana, on the other hand, knows based on the time range selected to query specific indices only.
If you know your range will not make sense on some indices, then exclude those from the query before creating the query.
A common approach for logging usecase, in case the current day is most frequently queried is to create an alias. Give it a significant name - like today - that will always point to today's index. Also, common with time based indices is the retention period. For these two tasks - managing the aliases and deleting the "expired" indices - you can use Curator.
In case the most times you care about the current day, use that alias and thus you get rid of the days before today.
In case not, then filter the indices to be queried based on the range before deciding on which indices to run the query.
I want to query elasticsearch documents within a date range. I have two options now, both work fine for me. Have tested both of them.
1. Range Query
2. Range Filter
Since I have a small data set for now, I am unable to test the performance for both of them. What is the difference between these two? and which one would result in faster retrieval of documents and faster response?
The main difference between queries and filters has to do with scoring. Queries return documents with a relative ranked score for each document. Filters do not. This difference allows a filter to be faster for two reasons. First, it does not incur the cost of calculating the score for each document. Second, it can cache the results as it does not have to deal with possible changes in the score from moment to moment - it's just a boolean really, does the document match or not?
From the documentation:
Filters are usually faster than queries because:
they don’t have to calculate the relevance _score for each document —
the answer is just a boolean “Yes, the document matches the filter” or
“No, the document does not match the filter”. the results from most
filters can be cached in memory, making subsequent executions faster.
As a practical matter, the question is do you use the relevance score in any way? If not, filters are the way to go. If you do, filters still may be of use but should be used where they make sense. For instance, if you had a language field (let's say language: "EN" as an example) in your documents and wanted to query by language along with a relevance score, you would combine a query for the text search along with a filter for language. The filter would cache the document ids for all documents in english and then the query could be applied to that subset.
I'm over simplifying a bit, but that's the basics. Good places to read up on this:
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/query-dsl-filtered-query.html
http://www.elasticsearch.org/guide/en/elasticsearch/reference/0.90/query-dsl-filtered-query.html
http://exploringelasticsearch.com/searching_data.html
http://elasticsearch-users.115913.n3.nabble.com/Filters-vs-Queries-td3219558.html
Filters are cached so they are faster!
http://www.elasticsearch.org/guide/en/elasticsearch/guide/current/filter-caching.html
I have seen quite a few questions here about the best storage ideologies for dates in MongoDB. most of the answers boiling down to using JavaScript Date objects.
I have another question, however: which is the most performant way to store dates?
I am operating on a collection of about 5 million entries and performs about 500 ranged read operations per minute on it asking for records $gt or $lt the current timestamp. How are indices built around the native JavaScript Date object? Are they more performant that storing an integer timestamp?
Dates are stored as 64 bit integers in MongoDB.
See the BSON spec http://bsonspec.org/#/specification