While looking at java 8 Time API I see a lot of methods expect as a parameter ChronoUnit (implementation of TemporalUnit) as here while other expect a ChronoField (implementation of TemporalField) as here.
Could anyone help me clarify the designers decision when a method is expecting to use a ChronoUnit and when a ChronoField and what are their differences?
Thanks.
Units are used to measure a quantity of time - years, months, days, hours, minutes, seconds. For example, the second is an SI unit.
By contrast, fields are how humans generally refer to time, which is in parts. If you look at a digital clock, the seconds count from 0 up to 59 and then go back to 0 again. This is a field - "second-of-minute" in this case, formed by counting seconds within a minute. Similarly, days are counted within a month, and months within a year. To define a complete point on the time-line you have to have a set of linked fields, eg:
second-of-minute
minute-of-hour
hour-of-day
day-of-month
month-of-year
year (-of-forever)
The ChronoField API exposes the two parts of second-of-minute. Use getBaseUnit() to get "seconds" and getRangeUnit() to get "minutes".
The Chrono part of the name refers to the fact that the definitions are chronology-neutral. Specifically, this means that the unit or field has a meaning only when associated with a calendar system, or Chronology. An example of this is the Coptic chronology, where there are 13 months in a year. Despite this being different to the common civil/ISO calendar system, the ChronoField.MONTH_OF_YEAR constant can still be used.
The TemporalUnit and TemporalField interfaces provide the higher level abstraction, allowing units/fields that are not chronology-neutral to be added and processed.
A TemporalUnit serves as general unit of time measurement. Therefore it can be used in determining the size of temporal amount between two given points in time (in abstract sense).
However, a TemporalField is not necessarily related to any kind of (abstract) time axis and usually represents a detail value of a point in time. Example: A month is only one component of a complete calendar date consisting of year, month and day-of-month.
Some people might argue that a calendar month and the month unit could be interpreted more or less as equivalent. Older libraries like java.util.Calendar don't make this difference. However, field and unit are used in a very different way as shown above (composing points in time versus measuring temporal amount).
Interestingly, the JDK-8-designers have decided that a field must have a base unit which is not null (I am personally not happy about this narrowing decision because I can imagine other fields not necessarily having a base unit). In case of months it is quite trivial. In case of days, we have different fields with the same base unit DAYS, for example day-of-month, day-of-year, day-of-week. This 1:n-relationship justifies the separation of units and fields in context of JSR-310 (aka java.time-package).
Related
I recently did an RNAseq experiment in which I had controls and experiments at 3 different time periods. My samples were distributed as following for a total of 10 samples:
T1_control1, T2_control1, T2_control2, T3_control1,
T1_exp1, T1_exp2, T2_exp1, T2_exp2, T3_exp1, T3_exp2
I did differential expression analysis with DESeq2 and from it I obtained 3 files from each time period T1, T2, and T3 that show the logFold change values from the control to the experimental for each gene. My question is how I can statistically compare the logFold change value for one gene in one period vs another time period. I am not sure what test to use since there is only one logFold change values per each time period for each gene.
Thank you in advance.
I am not sure what test to use since there is only one logFold change values per each time period for each gene.
Since RNA-sequencing is costly (for many purposes), I think most groups ensure their sequencing run is accurate is by combining multiple biological replicates in each group or sequencing deep. An argument could be made that showing only one data point for each gene at each time point is appropriate given that standard protocols were followed.
Although one option is to increase the sample size of each time period by determining fold change between each control-experimental partner. It would be important however to consult the literature and colleagues whether this is appropriate for the specific type of analysis you are doing.
I'm looking for some best practices to handle and store static time values.
A static time is usually the time of a recurring event, e.g. the activities in a sport centre, the opening times of a restaurant, the time a TV show is aired every day.
This time values are not bound to a specific date, and should not be affected by daylight saving time. For example, a restaurant will open at 11:00am both in winter and summer.
What's the best way to handle this situation? How should this kind of values be stored?
I'm mainly interested in issues with automatic TimeZone and DST adjustments (that should be avoided), and in keeping the time values independent by any specific date.
The best strategies I've found so far are:
store the time as an integer number of seconds since midnight,
store the time as a string.
I did read this question, but it's mostly about the normal time values and not the use cases I described.
Update
The library I'm working on: github
Regarding database storage, consider the following in order from most preferred to least preferred option:
Use a TIME type if your database supports it, such as in SQL Server (2008 and greater), MySQL, and Postgres, or INTERVAL HOUR TO SECOND in Oracle.
Use separate integer fields for Hours and Minutes (and Seconds if you need them). Consider using a custom user-defined type to bind these together if your DB supports it.
Use string in 24-hour format with a leading zero, such as "01:23:00", "12:00:00" or "23:59:00". If you include seconds, then always include seconds. You want to keep the strings lexicographically sortable. Don't mix and match formatting. Be consistent.
Regarding the approach of storing a whole number of minutes (or seconds) elapsed since midnight, I recommend avoiding it. That works great when you are actually storing an elapsed duration of time, but not so great when storing a time of day. Consider:
Not every day has a midnight. In some time zones (ex: Brazil), on the day of the spring-forward DST transition, the clocks go from 23:59:59 to 01:00:00.
In any time zone that has DST, the "time elapsed since midnight" could be lying to you. Even when midnight exists, if you save 10:00 as "10 hours", then that's potentially a false statement. There may have been 9 hours or 11 hours elapsed since midnight, if you consider the two days per-year involved in DST transitions.
At some point in your application, you'll likely be applying this time-of-day value to some particular date. When you do, if you are using "elapsed time" semantics, you might be tempted to simply add the elapsed time to midnight of the date in question. That will lead to errors on DST transition days, for the reasons I just mentioned. If you are instead representing a "time of day" in your storage, you'll be more likely to combine them together properly. Of course, this is highly dependent on what language and API you are using.
With any of these, be careful when using recurrence patterns. Say you store a time of "02:00:00" when a bar closes every night. When DST springs forward, that time might not exist, and when it falls back, it will exist twice. You need to be prepared to check for this condition when you apply the time to any particular date.
What you should do is entirely up to your use case. In many situations, the sensible thing to do is to jump forward one hour in the spring-forward gap, and to pick the first of the two points in the fall-back overlap. But YMMV.
See also, the DST tag wiki.
Per comments, it looks like the "tod" gem will suffice for your Ruby code.
The question seems a little vague, but I will have a try.
Generally speaking, using an integer seems good enough for me. It is easy to compare, easy to add or subtract a duration (of seconds), and is space- and time-efficient. You can consider wrapping it in a class if you are using an object-oriented language.
As far as I know, there are no existing classes for your needs in C or C++.
In the .NET world, the TimeSpan class may be useful for your purpose. It has some conveniences, like: you can get the TimeSpan value from DateTime.TimeOfDay; you can add the TimeSpan with an interval (a TimeSpan); you can get the hours, minutes, and seconds components separately; etc.
If you use Python, datime.time is also a good candidate. It is designed exactly for usages like yours.
I do not know other good candidates in other languages.
Speaking for Java:
In Java, the use-cases you describe are not covered well by old java.util.Date (which is a global timestamp despite of its name) or java.util.GregorianCalendar (which is a kind of combination of date and time and zone etc.), but:
In Java 8 you have the new built-in class java.time.LocalTime which covers your use-cases well. Predecessor is the equally-named class LocalTime in the external and popular Java library JodaTime which is working since Java 5. Furthermore, in my own alpha-state-library I have the type net.time4j.PlainTime which is similar, but also offers 24:00-support (good for example for shop opening times). All in all Java is a well suited language with interesting time libraries which can mostly do what you wish. In detail:
a) TimeZone and DST adjustments are not handled by the Java classes mentioned above. Instead they are only handled if you convert such a plain wall time to another type like org.joda.time.DateTime which contains a reference to a timezone.
b) Indeed these time classes are completely independent from calendar date, too.
c) The internal storage strategy is for JSR-310 (Java 8):
private final byte hour;
private final byte minute;
private final byte second;
private final int nano;
JodaTime uses the other strategy of local milliseconds instead (elapsed time since midnight).
You cannot represent a time unless you also know the day/month/year. There is no such thing as "should not be affected by daylight saving time" as there are many complicated issues to deal with, including leap seconds and so on. Time, as a human sees it, is a complicated thing that cannot easily be dealt with mathematically.
If you really need to store "11am" without any date associated, then that's what you should store. Just store 11am (or perhaps just 11, use 24 hour time).
Then, if you need to do any math you must apply a date before doing any operations on the time.
I would also refrain from storing "11am" as "x seconds from midnight". You really should just use 11 hours, since that is what the user sees, and then have a good date/time library convert it to a useful format. For example, telling the user if the restaurant is open right now you'd pass it to a date library with today's date.
Essentially, I want a system that can filter simply such as "Between August 4th and August 7th", but be as complicated as "Every third saturday or monday of each january on leap years".
I figured that in order to represent the complicated boolean algebra, I would need a tree structure. Each node would either be a boolean operation (AND, OR, XOR, NOT) and then would have children that it apply to, which can either be specific filters or another boolean operation.
Each "specific filter" would be something like "Sundays" or "Leap Years". I think everything up to this point is very doable. However, the problem then arises in parsing the tree to actually find what dates are needed, in order to then make database queries to get the data points.
With the example above (Every third saturday or monday of each january on leap years), if we pre-restrict ourselves to the years that we have data (5 years worth). If the sat/mon filters happen to be the top nodes in the tree, we will end up with 500 segmented dates (2 per week, 50 weeks a year, 5 years). Then, the next node has to search through all 500 to find which ones conform to "every third" filter. This isn't even the most complicated example, because an arbitrary number of filters should be allowed, and XOR makes that even more crazy.
So, is there any easy route? Did someone already build this? This is just a small part a project involving data visualization, but it seems that it could be an entire project by itself.
I found a couple in Ruby. IceCube seems promising, even though it might not support all your needs.
I will try to explain what I want to accomplish. I am looking for an algorithm or approach, not the actual implementation in my specific system.
I have a table with actuals (incoming customer requests) on a daily basis. These actuals need to be "copied" into the next year, where they will be used as a basis for planning the amount of requests in the future.
The smallest timespan for planning, on a technical basis, is a "period", which consists of at least one day. A period always changes after a week or after a month. This means, that if a week is both in May and June, it will be split in two periods.
Here's an example:
2010-05-24 - 2010-05-30 Week 21 | Period_Id 123
2010-05-31 - 2010-05-31 Week 22 | Period_Id 124
2010-06-01 - 2010-06-06 Week 22 | Period_Id 125
We did this to reduce the amount of data, because we have a few thousand items that have 356 daily values. For planning, this is reduced to "a few thousand x 65" (or whatever the period count is per year). I can aggregate a month, or a week, by combining all periods that belong to one month. The important thing about this is, I could still use daily values, then find the corresponding period and add it there if necessary.
What I need, is an approach on aggregating the actuals for every (working)day, week or month in next years equivalent period. My requirements are not fixed here. The actuals have a certain distribution, because there are certain deadlines and habits that are reflected in the data. I would like to be able to preserve this as far as possible, but planning is never completely accurate, so I can make a compromise here.
Don't know if this is what you're looking for, but this is a strategy for calculating the forecasts using flexible periods:
First define a mapping for each day in next year to the corresponding day in this year. Then when you need a forecast for period x you take all days in that period and sum the actuals for the matching days.
With this you can precalculate every week/month but create new forecasts if the contents of periods change.
Map weeks to weeks. The first full week of this year to the first full week of the next. Don't worry about "periods" and aggregation; they are irrelevant.
Where a missing holiday leaves a hole in the data, just take the values for the same day of the previous week or the next week, and do the same at the beginning/end of the year.
Now for each day of the week, combine the results for the year and look for events more than, say, two standard deviations from the mean (if you don't know what that means then skip this step), and look for correlations with known events like holidays. If a holiday doesn't show an effect in this test then ignore it. If you find an effect, shift it to compensate for the different date next year. Don't worry about higher-order effects, you don't have enough data to pin them down.
Now draw in periods wherever you like and aggregate all you want.
Don't make any promises about the accuracy of these predictions, there's no way to know it. Don't worry about whether this is the best possible way; it isn't, but it's as good as any you're likely to find. You can spend as much more time and effort fine-tuning this as you wish; it might raise expectations but it's not likely to make the results much more accurate-- it's about as likely to make them worse.
There is no A-priori way to answer that question. You have to look at your data, and decide what the important parameters (day of week, week number, month, season, temperature outside?) using the results.
For example, if many of your customers are jewish/muslim, then the gregorian calendar, and ISO-week numbers and all that won't help you much, because jewish/muslim holidays (and so users behaviour) are determined using other calendars.
Another example - Trying to predict iPhone search volume according to last year's search doesn't sound like a good idea. It seems that the important timescales are much longer than a year (the technology becoming mainstream over the years) and much shorter than a year (Specific events that affect us for days-weeks).
Suppose you were able keep track of the news mentions of different entities, like say "Steve Jobs" and "Steve Ballmer".
What are ways that could you tell whether the amount of mentions per entity per a given time period was unusual relative to their normal degree of frequency of appearance?
I imagine that for a more popular person like Steve Jobs an increase of like 50% might be unusual (an increase of 1000 to 1500), while for a relatively unknown CEO an increase of 1000% for a given day could be possible (an increase of 2 to 200). If you didn't have a way of scaling that your unusualness index could be dominated by unheard-ofs getting their 15 minutes of fame.
update: To make it clearer, it's assumed that you are already able to get a continuous news stream and identify entities in each news item and store all of this in a relational data store.
You could use a rolling average. This is how a lot of stock trackers work. By tracking the last n data points, you could see if this change was a substantial change outside of their usual variance.
You could also try some normalization -- one very simple one would be that each category has a total number of mentions (m), a percent change from the last time period (δ), and then some normalized value (z) where z = m * δ. Lets look at the table below (m0 is the previous value of m) :
Name m m0 δ z
Steve Jobs 4950 4500 .10 495
Steve Ballmer 400 300 .33 132
Larry Ellison 50 10 4.0 400
Andy Nobody 50 40 .20 10
Here, a 400% change for unknown Larry Ellison results in a z value of 400, a 10% change for the much better known Steve Jobs is 495, and my spike of 20% is still a low 10. You could tweak this algorithm depending on what you feel are good weights, or use standard deviation or the rolling average to find if this is far away from their "expected" results.
Create a database and keep a history of stories with a time stamp. You then have a history of stories over time of each category of news item you're monitoring.
Periodically calculate the number of stories per unit of time (you choose the unit).
Test if the current value is more than X standard deviations away from the historical data.
Some data will be more volatile than others so you may need to adjust X appropriately. X=1 is a reasonable starting point
Way over simplified-
store people's names and the amount of articles created in the past 24 hours with their name involved. Compare to historical data.
Real life-
If you're trying to dynamically pick out people's names, how would you go about doing that? Searching through articles how do you grab names? Once you grab a new name, do you search for all articles for him? How do you separate out Steve Jobs from Apple from Steve Jobs the new star running back that is generating a lot of articles?
If you're looking for simplicity, create a table with 50 people's names that you actually insert. Every day at midnight, have your program run a quick google query for past 24 hours and store the number of results. There are a lot of variables in this though that we're not accounting for.
The method you use is going to depend on the distribution of the counts for each person. My hunch is that they are not going to be normally distributed, which means that some of the standard approaches to longitudinal data might not be appropriate - especially for the small-fry, unknown CEOs you mention, who will have data that are very much non-continuous.
I'm really not well-versed enough in longitudinal methods to give you a solid answer here, but here's what I'd probably do if you locked me in a room to implement this right now:
Dig up a bunch of past data. Hard to say how much you'd need, but I would basically go until it gets computationally insane or the timeline gets unrealistic (not expecting Steve Jobs references from the 1930s).
In preparation for creating a simulated "probability distribution" of sorts (I'm using terms loosely here), more recent data needs to be weighted more than past data - e.g., a thousand years from now, hearing one mention of (this) Steve Jobs might be considered a noteworthy event, so you wouldn't want to be using expected counts from today (Andy's rolling mean is using this same principle). For each count (day) in your database, create a sampling probability that decays over time. Yesterday is the most relevant datum and should be sampled frequently; 30 years ago should not.
Sample out of that dataset using the weights and with replacement (i.e., same datum can be sampled more than once). How many draws you make depends on the data, how many people you're tracking, how good your hardware is, etc. More is better.
Compare your actual count of stories for the day in question to that distribution. What percent of the simulated counts lie above your real count? That's roughly (god don't let any economists look at this) the probability of your real count or a larger one happening on that day. Now you decide what's relevant - 5% is the norm, but it's an arbitrary, stupid norm. Just browse your results for awhile and see what seems relevant to you. The end.
Here's what sucks about this method: there's no trend in it. If Steve Jobs had 15,000 a week ago, 2000 three days ago, and 300 yesterday, there's a clear downward trend. But the method outlined above can only account for that by reducing the weights for the older data; it has no way to project that trend forward. It assumes that the process is basically stationary - that there's no real change going on over time, just more and less probable events from the same random process.
Anyway, if you have the patience and willpower, check into some real statistics. You could look into multilevel models (each day is a repeated measure nested within an individual), for example. Just beware of your parametric assumptions... mention counts, especially on the small end, are not going to be normal. If they fit a parametric distribution at all, it would be in the Poisson family: the Poisson itself (good luck), the overdispersed Poisson (aka negative binomial), or the zero-inflated Poisson (quite likely for your small-fry, no chance for Steve).
Awesome question, at any rate. Lend your support to the statistics StackExchange site, and once it's up you'll be able to get a much better answer than this.