I see people storing / getting the server time and times relative to it using date or getTime which can be kept in the database as a string of the sorts: "July 21, 1983 01:15:00".
Up until now I stored my server time as the difference between NOW and 1 january 2013. This would return a number value (in minutes), rounded down between 1 jan 2013 and right now, which I keep as internal server time.
The advantages of this are that:
- querying the server implies a simple numeric comparison operation, while (I make an educated guess) comparing two dates implying internal conversion to objects and using fat comparison operations.
- storing a number of that size is more lightweight than a string of ~25 characters.
- converting back to "real" time is by adding 1 jan 2013 but second and millisecond values are lost due to initial roundness.
But still, other fellow programmers insist that using the string version
- is easy to read as a human.
- its an universal format for most languages (especially nodejs, mongodb and as3 which this project has).
I am uncertain which is better for large scale databases and specifically, for a multiplayer socket based game. I am sure others with real experience in this could shed some light on my issue.
So which is better and why?
Store them as Mongo Date objects. Mongo stores dates as 8-byte second-offset integers [1], and displays them in human readable format. You are NOT storing 25 characters!
Therefore, all comparisons are just as fast. There is no string parsing except for when you're querying, which is a one-time operation per query.
Your difference is stored as either as an int of 4 bytes. So you're saving ONLY 4 bytes over normal MongoDB date storage. That's a very small savings, considering against the average size of your mongo objects.
Consider all the disadvantages of your "offset since January 2013" method:
Time spent writing extra logic to offset the dates when updating or querying.
Time spent dealing with bugs that arise from having forgotten to offset a date.
Time spent shifting dates by hand or in your head when inspecting database output (when diagnosing a problem), instead of seeing the actual date right away.
Inability to use date operators in the MongoDB aggregations without extra work (e.g. $dayOfMonth, extra work being a projection to shift your dates internally to ).
Basically, more code and more headache and more time spent, all to save 4 bytes on objects in a database where the same 4 bytes can be saved by renaming your field from "updated" to "upd"? I don't think that's a wise tradeoff.
Also,
Best way to store date/time in mongodb
Premature optimization is the root of all evil. Don't optimize unless you've determined something to be a problem.
1 - http://bsonspec.org/#/specification
Related
While looking at java 8 Time API I see a lot of methods expect as a parameter ChronoUnit (implementation of TemporalUnit) as here while other expect a ChronoField (implementation of TemporalField) as here.
Could anyone help me clarify the designers decision when a method is expecting to use a ChronoUnit and when a ChronoField and what are their differences?
Thanks.
Units are used to measure a quantity of time - years, months, days, hours, minutes, seconds. For example, the second is an SI unit.
By contrast, fields are how humans generally refer to time, which is in parts. If you look at a digital clock, the seconds count from 0 up to 59 and then go back to 0 again. This is a field - "second-of-minute" in this case, formed by counting seconds within a minute. Similarly, days are counted within a month, and months within a year. To define a complete point on the time-line you have to have a set of linked fields, eg:
second-of-minute
minute-of-hour
hour-of-day
day-of-month
month-of-year
year (-of-forever)
The ChronoField API exposes the two parts of second-of-minute. Use getBaseUnit() to get "seconds" and getRangeUnit() to get "minutes".
The Chrono part of the name refers to the fact that the definitions are chronology-neutral. Specifically, this means that the unit or field has a meaning only when associated with a calendar system, or Chronology. An example of this is the Coptic chronology, where there are 13 months in a year. Despite this being different to the common civil/ISO calendar system, the ChronoField.MONTH_OF_YEAR constant can still be used.
The TemporalUnit and TemporalField interfaces provide the higher level abstraction, allowing units/fields that are not chronology-neutral to be added and processed.
A TemporalUnit serves as general unit of time measurement. Therefore it can be used in determining the size of temporal amount between two given points in time (in abstract sense).
However, a TemporalField is not necessarily related to any kind of (abstract) time axis and usually represents a detail value of a point in time. Example: A month is only one component of a complete calendar date consisting of year, month and day-of-month.
Some people might argue that a calendar month and the month unit could be interpreted more or less as equivalent. Older libraries like java.util.Calendar don't make this difference. However, field and unit are used in a very different way as shown above (composing points in time versus measuring temporal amount).
Interestingly, the JDK-8-designers have decided that a field must have a base unit which is not null (I am personally not happy about this narrowing decision because I can imagine other fields not necessarily having a base unit). In case of months it is quite trivial. In case of days, we have different fields with the same base unit DAYS, for example day-of-month, day-of-year, day-of-week. This 1:n-relationship justifies the separation of units and fields in context of JSR-310 (aka java.time-package).
I'm looking for some best practices to handle and store static time values.
A static time is usually the time of a recurring event, e.g. the activities in a sport centre, the opening times of a restaurant, the time a TV show is aired every day.
This time values are not bound to a specific date, and should not be affected by daylight saving time. For example, a restaurant will open at 11:00am both in winter and summer.
What's the best way to handle this situation? How should this kind of values be stored?
I'm mainly interested in issues with automatic TimeZone and DST adjustments (that should be avoided), and in keeping the time values independent by any specific date.
The best strategies I've found so far are:
store the time as an integer number of seconds since midnight,
store the time as a string.
I did read this question, but it's mostly about the normal time values and not the use cases I described.
Update
The library I'm working on: github
Regarding database storage, consider the following in order from most preferred to least preferred option:
Use a TIME type if your database supports it, such as in SQL Server (2008 and greater), MySQL, and Postgres, or INTERVAL HOUR TO SECOND in Oracle.
Use separate integer fields for Hours and Minutes (and Seconds if you need them). Consider using a custom user-defined type to bind these together if your DB supports it.
Use string in 24-hour format with a leading zero, such as "01:23:00", "12:00:00" or "23:59:00". If you include seconds, then always include seconds. You want to keep the strings lexicographically sortable. Don't mix and match formatting. Be consistent.
Regarding the approach of storing a whole number of minutes (or seconds) elapsed since midnight, I recommend avoiding it. That works great when you are actually storing an elapsed duration of time, but not so great when storing a time of day. Consider:
Not every day has a midnight. In some time zones (ex: Brazil), on the day of the spring-forward DST transition, the clocks go from 23:59:59 to 01:00:00.
In any time zone that has DST, the "time elapsed since midnight" could be lying to you. Even when midnight exists, if you save 10:00 as "10 hours", then that's potentially a false statement. There may have been 9 hours or 11 hours elapsed since midnight, if you consider the two days per-year involved in DST transitions.
At some point in your application, you'll likely be applying this time-of-day value to some particular date. When you do, if you are using "elapsed time" semantics, you might be tempted to simply add the elapsed time to midnight of the date in question. That will lead to errors on DST transition days, for the reasons I just mentioned. If you are instead representing a "time of day" in your storage, you'll be more likely to combine them together properly. Of course, this is highly dependent on what language and API you are using.
With any of these, be careful when using recurrence patterns. Say you store a time of "02:00:00" when a bar closes every night. When DST springs forward, that time might not exist, and when it falls back, it will exist twice. You need to be prepared to check for this condition when you apply the time to any particular date.
What you should do is entirely up to your use case. In many situations, the sensible thing to do is to jump forward one hour in the spring-forward gap, and to pick the first of the two points in the fall-back overlap. But YMMV.
See also, the DST tag wiki.
Per comments, it looks like the "tod" gem will suffice for your Ruby code.
The question seems a little vague, but I will have a try.
Generally speaking, using an integer seems good enough for me. It is easy to compare, easy to add or subtract a duration (of seconds), and is space- and time-efficient. You can consider wrapping it in a class if you are using an object-oriented language.
As far as I know, there are no existing classes for your needs in C or C++.
In the .NET world, the TimeSpan class may be useful for your purpose. It has some conveniences, like: you can get the TimeSpan value from DateTime.TimeOfDay; you can add the TimeSpan with an interval (a TimeSpan); you can get the hours, minutes, and seconds components separately; etc.
If you use Python, datime.time is also a good candidate. It is designed exactly for usages like yours.
I do not know other good candidates in other languages.
Speaking for Java:
In Java, the use-cases you describe are not covered well by old java.util.Date (which is a global timestamp despite of its name) or java.util.GregorianCalendar (which is a kind of combination of date and time and zone etc.), but:
In Java 8 you have the new built-in class java.time.LocalTime which covers your use-cases well. Predecessor is the equally-named class LocalTime in the external and popular Java library JodaTime which is working since Java 5. Furthermore, in my own alpha-state-library I have the type net.time4j.PlainTime which is similar, but also offers 24:00-support (good for example for shop opening times). All in all Java is a well suited language with interesting time libraries which can mostly do what you wish. In detail:
a) TimeZone and DST adjustments are not handled by the Java classes mentioned above. Instead they are only handled if you convert such a plain wall time to another type like org.joda.time.DateTime which contains a reference to a timezone.
b) Indeed these time classes are completely independent from calendar date, too.
c) The internal storage strategy is for JSR-310 (Java 8):
private final byte hour;
private final byte minute;
private final byte second;
private final int nano;
JodaTime uses the other strategy of local milliseconds instead (elapsed time since midnight).
You cannot represent a time unless you also know the day/month/year. There is no such thing as "should not be affected by daylight saving time" as there are many complicated issues to deal with, including leap seconds and so on. Time, as a human sees it, is a complicated thing that cannot easily be dealt with mathematically.
If you really need to store "11am" without any date associated, then that's what you should store. Just store 11am (or perhaps just 11, use 24 hour time).
Then, if you need to do any math you must apply a date before doing any operations on the time.
I would also refrain from storing "11am" as "x seconds from midnight". You really should just use 11 hours, since that is what the user sees, and then have a good date/time library convert it to a useful format. For example, telling the user if the restaurant is open right now you'd pass it to a date library with today's date.
I am running a fairly large query on a specific range of dates. The query takes about 30 seconds EXCEPT when I do a range of 10/01/2011-10/31/2011. For some reason that range never finishes. For example 01/01/2011-01/31/2011, and pretty much every other range, finishes in the expected time.
Also, I noticed that doing smaller ranges, like a week, takes longer than larger ranges.
When Oracle gathers statistics on a table, it will record the low value and the high value in a date column and use that to estimate the cardinality of a predicate. If you create a histogram on the column, it will gather more detailed information about the distribution of data within the column. Otherwise, Oracle's cost based optimizer (CBO) will assume a uniform distribution.
For example, if you have a table with 1 million rows and a DATE column with a low value of January 1, 2001 and a high value of January 1, 2011, it will assume that the approximately 10% of the data is in the range January 1, 2001 - January 1, 2002 and that roughly 0.027% of the data would come from some time on March 3, 2008 (1/(10 years * 365 days per year + leap days).
So long as your queries use dates from within the known range, the optimizer's cardinality estimates are generally pretty good so its decisions about what plan to use are pretty good. If you go a bit beyond the upper or lower bound, the estimates are still pretty good because the optimizer assumes that there probably is data that is larger or smaller than it saw when it sampled the data to gather the statistics. But when you get too far from the range that the optimizer statistics expect to see, the optimizer's cardinality estimates get too far out of line and it eventually chooses a bad plan. In your case, prior to refreshing the statistics, the maximum value the optimizer was expecting was probably September 25 or 26, 2011. When your query looked for data for the month of October, 2011, the optimizer most likely expected that the query would return very few rows and chose a plan that was optimized for that scenario rather than for the larger number of rows that were actually returned. That caused the plan to be much worse given the actual volume of data that was returned.
In Oracle 10.2, when Oracle does a hard parse and generates a plan for a query that is loaded into the shared pool, it peeks at the bind variable values and uses those values to estimate the number of rows a query will return and thus the most efficient query plan. Once a query plan has been created and until the plan is aged out of the shared pool, subsequent executions of the same query will use the same query plan regardless of the values of the bind variables. Of course, the next time the query has to be hard parsed because the plan was aged out, Oracle will peek and will likely see new bind variable values.
Bind variable peeking is not a particularly well-loved feature (Adaptive Cursor Sharing in 11g is much better) because it makes it very difficult for a DBA or a developer to predict what plan is going to be used at any particular instant because you're never sure if the bind variable values that the optimizer saw during the hard parse are representative of the bind variable values you generally see. For example, if you are searching over a 1 day range, an index scan would almost certainly be more efficient. If you're searching over a 5 year range, a table scan would almost certainly be more efficient. But you end up using whatever plan was chosen during the hard parse.
Most likely, you can resolve the problem simply by ensuring that statistics are gathered more frequently on tables that are frequently queried based on ranges of monotonically increasing values (date columns being by far the most common such column). In your case, it had been roughly 6 weeks since statistics had been gathered before the problem arose so it would probably be safe to ensure that statistics are gathered on this table every month or every couple weeks depending on how costly it is to gather statistics.
You could also use the DBMS_STATS.SET_COLUMN_STATS procedure to explicitly set the statistics for this column on a more regular basis. That requires more coding and work but saves you the time of gathering statistics. That can be hugely beneficial in a data warehouse environment but it's probably overkill in a more normal OLTP environment.
Is there any sort of advantage (performance, indexes, size, etc) to storing dates in MongoDB as an ISODate() vs. storing as a regular UNIX timestamp?
The amount of overhead of a ISODate compared to a time_t is trivial compared to the advantages of the former.
An ISO 8601 format date is human readable, it can be used to express dates prior to January 1, 1970, and most importantly, it isn't prey to the Y2038 problem.
This last bit can't be stressed enough. In 1960, it seemed ludicrous that wasting an octet or two on a century number could yield any benefit as the turn of the century was impossibly far off. We know how wrong that turned out to be. The year 2038 will be here sooner than you expect, and time_t are already insufficient for representing – for example – the schedule of payments on a 30-year contract.
MongoDB's built-in Date type is very similar to a unix timestamp stored in time_t. The only difference is that Dates are a 64bit field storing miliseconds since Jan 1 1970, rather than a 32bit fields storing seconds since the same epoch. The only down side is that for current releases it treats the count as unsigned so it can't handle dates before 1970 correctly. This will be fixed in MongoDB 2.0 scheduled for release in about a month.
A possible point of confusion is the name "ISODate". It is just a helper function in the shell to wrap around javascript's horrible Date constructor. If you call either "ISODate()" or "new Date()" you will get back the exact same Date object, we just changed how it prints. You are still free to use normal ISO Date stings or time_t ints without using our constructors, but you won't get nice Date objects back in your language of choice.
Just curious how others deal with enums & nosql? Is it better to store an attribute as an enum value or a string? Does this affect the size or performance of the database in some cases? For example, just think of, let's say, a pro sports player... his sport type could be Football, Hockey, Baseball, Basketball, etc... string vs enum, what do you all think?
You should be using enums in your code - strong typing helps avoid a lot of mistakes - and converting them to strings or numbers for storage.
Strings do require significantly more storage space - "Basketball" is 10-20 bytes depending on encoding, and if you store it as 4 it only needs 1 byte. However, there are very few cases where this will actually matter - if you have a million records, it is still less than 20MB difference in total database size. Strings are easier to work with and less likely to fail silently if the enumeration changes, so use strings.
Strings are also slower than numbers for most operations, including conversion to enum on load. However, the difference is orders of magnitude less than the time taken to retrieve anything at all from the database, so doesn't matter.
String are better of portability perspective. And Enum is not supported by popular DBMS's like MSSQL Server and many others.
You can have application level logic to prevent valid input against an array and just store it as String.
EDIT:
My preferences changed to String as CakePHP (where I do web apps) no-longer support Enum for portability concerns.