How do I specify invalid time in a time dimension? - time

I am building a time dimension for only time in my data warehouse. I already have a date dimension.
How do I denote an unknown time? In my DimDate dimension, I marked 01/01/1753 as being reserved for unknown dates, but I think a time will be a bit harder. We don't allow NULLs in our fact tables. How do I do this, and what might that row look like?

You state the "We don't allow NULLs in our fact tables " but ask "How do I denote an unknown time?"
Assuming you are using in your FACT table a data type TIME + enforce a NOT NULL constraint on data arriving from source system => you simply cannot insert unknown\invalid time into your fact and hence should have no problem.
The obvious exception to the above is an invalid business wise value reported by the source system such as Sunil proposed ('00:59:59.9999999') but this is very uncommon, unstable solution for obvius reasons (changing requirements can easily turn this value into a valid one)
If you chose to allow (and i hope you did) records with NULL values or invalid dates from your source system to enter the fact then the best practice would be using surrogate keys on our DimTime and insert them as FK into your FACT tables – this will easily allow you to represent valid + invalid values in your dimension.
This approach can easily also support the approach of an invalid business wise value ('00:59:59.9999999'), such a value gets an FK_DimTime=-1.
I strongly advise on allowing specific types of garbage from source systems to enter the FACT (i.e – invalid\missing\NULL date\time values) tables as long as you clearly mark it in relevant DIMs as this tends to drive Users to improve data quality in source systems.
Here is some background on the matter

It can look like anything you want. Most dimensions have a 'display name' of some kind, so your dimensions could look something like this:
create table dbo.DimDate (DateID int, DateValue date, DisplayDate nvarchar(20))
-- this is an unknown date; 1753-01-01 is only there because we need some valid date value
insert into dbo.DimDate values (1, '1753-01-01', 'Unknown')
-- this is the real date 1 Jan 1753
insert into dbo.DimDate values (2, '1753-01-01', '01 Jan 1753')
create table dbo.DimTime (TimeID int, TimeValue time, DisplayTime nvarchar(20))
-- this is an unknown time; 00:00 is only there because we need some valid time value
insert into dbo.DimTime values (1, '00:00', 'Unknown')
-- this is the real time value for midnight
insert into dbo.DimTime values (2, '00:00', 'Midnight')
Of course, this assumes that your reporting tool and users use the DisplayDate and DisplayTime columns for filtering instead of the DateValue and TimeValue columns directly, but that's simply a matter of training and standards and whatever solution you adopt needs to be understood anyway.
There are other alternatives such as a flag column for unknown values, or a convention that a negative TimeID indicates an unknown value. But those are less obvious and harder to maintain than an explicit row value, in my opinion.

Just create a DimTime records with a -1 technical surrogate key and populate to the time column a value '00:59:59.9999999'. This way this will be a unlikely time ever captured (accuracy to the last digit) by your DWH, it will always equate to a unknown in your reports or queries when you want to put filter like,
EventTime < #ReportTime AND EventTime <> '00:59:59.9999999'
Hope this is viable solution to your problem.


Informatica: If Current month data missing, use previous month

The project I'm working on has monthly data for gas prices in California. The data is taken from a website and loaded into a table. I've done this part - the data is current until March 2016. We are now in April, which does not have any data yet, so the next step I need to do is use March's data and place that into April.
Here is what my table looks like right now:
My question is: How do I add a new row with first column data of 201604 and use March's price?
Let me know if I need to add more information.
I can't help but thinking that your table structure is going to hurt later.
You don't appear to have a primary key which helps with integrity and performance.
YYYYMM could be a key but it's not clear whether you are storing it as a number or a string.
The use of YYYYMM as a column name might prove troublesome as that is part of the Oracle data format.
your naming convention of GAS_PRICES table and GAS_PRICE column could provide confusion due the similarity

Constraint with Query

There are two tables
City (Name, Country_code, Population)
Country (Name, Code, Population)
The task is:
The sum of population of all cities in a country, should be less or equal to population of a country. -
Create a constraint and an assertion
Create a trigger using constraint and assertion. Or propose your own
way of trigger syntax
I tried to create a constraint on table country, but get an error because of query
ADD CONSTRAINT check_pop_sum
You can do this using a trigger. Check this:
CREATE TRIGGER check_population
RAISE_APPLICATION_ERROR(-20000, 'Population exceeded');
The situation you describe is not a legitimate data integrity issue. so the solution is not a constraint, no matter how it is implemented. Data integrity is concerned primarily with the validity of the data, not accuracy. Accuracy is not a concern of the database at all.
Data integrity can fit into two categories: context-free integrity and contextual integrity. Context-free integrity is when you can verify the validity of the datum without referring to any other data. If you try to write an integer value into a date field for example (domain checking) or set an integer field to "3" instead of 3 or set it to 3 when the range is defined as "between 100 and 2000".
Contextual integrity is when the validity must be considered as part of a group. Any foreign key, for example. The value 3 may be perfectly valid in and of itself, but can fail validity checking if the proper row in a different table doesn't exist.
Accuracy, on the other hand, is completely different. Let's look again at the integer field constrained to a range of between 100 and 2000. Suppose the value 599 is written to that field. Is it valid? Yes, it meets all applicable constraints. Is it accurate? There is no way to tell. Accuracy of the data, as the data itself, originates from outside the database.
But doesn't the ability to add all city's population within a county and compare it to the overall county population mean that we can check for accuracy here?
Not really, or not in a significant way. Upon inserting a new city or updating a city population value, we can test to see if the total of all city populations exceeds the county population. We can alert the user to a problem but we can't tell the user where the problem is. Is the error in the insert or update? Is there a too-large population value in an existing city that was entered earlier? Are there several such too-large values for many cities? Are all city population values correct but the country population too small?
From within the database, there is no way to tell. The best we can to is verify the incorrect total and warn the user. But we can't say "The population of city XYZ is too large" because we can't tell if that is the problem. The best we can do is warn that the total of all cities within the county exceed the population defined for the county as a whole. It is up to the data owners to determine where the problem actually occurs.
This may seem like a small difference but a constraint determines that the data is invalid and doesn't allow the operation to continue ("Data Integrity: preventing bogus data from entering the database in the first place").
In the case of a city population, the data is not invalid. We can't even be sure if it is wrong, it could well be absolutely correct. There is no reason to prevent the operation from completing and the data entering the database.
It is nice if there can be some ability to verify accuracy but notice that this is not even such a case. As city data is entered into the database, the population value for most of them could be wildly erroneous. You aren't aware of a problem until the county population is finally exceeded. Some check is better than none, I suppose, but it only alerts if the inaccuracies result in a too-large value. There could just as well be inaccuracies that result in too small a value. So some sort of accuracy check must be implemented from the get-go that will test for any inaccuracies -- too large or too small.
That is what will maintain the accuracy of the data, not some misplaced operation within the database.

Can I compare values in the same column in adjacent rows in PowerPivot?

I have a PowerPivot table for which I need to be able to determine how long an item was in an Error state. My data set looks something like this:
What I need to be able to do is to look at the values in the ID and State columns, and see if the value in the previous row is ERROR in the State column, and the same in the ID column. If it is, I then need to calculate the difference between the Changed Date values in those two rows.
So, for example, when I got to row 4, I would see that the value in the State column for Row 3, the previous row, is ERROR, and that the value in the ID column in the previous row is the same as the current row, so I would then calculate the difference between the Changed Date values in Row 3 and Row 4 (I don't care about the values in any of the other columns for this particular requirement).
Is there a way to do this in PowerPivot? I've done a fair amount of Internet searching, and it looks like if it can be done, it would use the EARLIER or EARLIEST DAX functions, but I can't find anything that tells me how, or even if, this can be done.
I have had similar requirements many times and after a really long time of trial-and-error, I finally understood how EARLIER works. It can be very powerful, but also very slow so always check for the performance of your calculations.
To answer your question, you will need to create 4 calculated columns:
1) Item Rank - used for ranking the issues with same Item ID
=COUNTROWS(FILTER('ID', EARLIER([Item ID]) = [Item ID] && EARLIER([Date]) >= [Date]))
2) Follows Error - to easily find issue that follows EROR issue
=IF([State] = "EROR",[Item Rank]+1)
3) Time of Following Issue - simple lookup so that you can calculate the different
=IF([Follows Error]>0,
LOOKUPVALUE([Date], [User], [User], [Item Rank], [Follows Error]),
4) Time Diff - calculation of time different for the specific issue
DAY([Time of Following Issue])-DAY([Date]),
With those calculated columns, you can then easily create a powerpivot table, drag State and Item Id onto the ROWS pane and then simply add Time Diff to Values. You will get an overview of issues that contain string "EROR" issue and the time it took to resolve them.
This is what it looks like in PowerPivot window:
And the resulting Pivot table:
You can download my Excel file here (2013).
As I mentioned, be careful with the performance as the calculated columns with nested EARLIER and IF conditions might be a bit too performance-demanding. If there is a smarter way, I would be very happy to see it, but for now this works for me just fine.
Also, keep in mind that all calculated columns could be nested into 1, but I kept them separated to make it easier to understand the formulas.
Hope this helps :-)

(var)char as the type of the column for performance?

I have a column called "status" in PostgreSQL. First it used to be "status_id" of type integer. The values were kept on client, so there was no table on the server called statuses where I'd keep those statuses and then do inner join with the first table.
I used to send the ids of the statuses from the client (they had the names on the client). However, at some point I understood I'd better make the server hold those statuses. Not in a separate table but in the first one and I want to make them strings. So the initial table will have a status column of type string (varchar, to be more specific). I read it wouldn't be that slow.
In general, is it a good idea? I suppose it is because doing inner join (in case I'd keep statuses in the separate table) each time is expensive as well as sending ids from the client.
1) The only concern I have is that the column status should be of type char, not varchar. It should make it more effective I suppose. Is that so?
2) If the first case is correct then I'm not sure I'll be able to name all the statuses using exactly the same amount of characters, let's say, 5 characters. Some of them might be longer, some shorter. How can I solve this?
It's not denationalization because I'm talking about 1 single table. There is no and has never been the second table called Statuses with the fields (id, status_name).
What I'm trying to convey is that I could use char(n) for status_name and also add index on it. Then it should be fast enough. However, it might be or not possible to name all the statuses with the certain (n) amount of characters and that's the only concern.
I don't think so using char or varchar instead integer is good idea. It is hard to expect how much slower it will be than integer PK, but this design will be slower - impact will be more terrible when you will join larger tables. If you can, use ENUM types instead.
CREATE TYPE mood AS ENUM ('sad', 'ok', 'happy');
name text,
current_mood mood
INSERT INTO person VALUES ('Moe', 'happy');
SELECT * FROM person WHERE current_mood = 'happy';
name | current_mood
Moe | happy
(1 row)
PostgreSQL varchar and char types are very similar. Internal implementation is same - char can be (it is paradox) little bit slower due addition by spaces.
I'd go one step further. Never use the outdated data type char(n), unless you know you have to (for compatibility or some rare exotic reason). The type is utterly useless in a modern database. Padding strings with blank characters is nonsense, and if you have to do it, you can do it in a cheaper fashion with rpad() on data retrieval.
SELECT rpad('short', 10) AS char_10_string;
varchar is basically the same as text and allows a length specifier: varchar(n). I generally use just text. If I need to limit the length, I use a CHECK constraint. Here's one example, why.
Whenever you can use a simple integer (or enum) instead, that's a bit smaller and faster in every respect. Consider #Pavel's answer for enum.
As for:
because doing inner join (...) each time is expensive
Well, it carries a small cost, but it's generally cheaper than redundantly saving text representation of the status instead of a much cheaper integer in the main table. That kind of rumor is spread by people having problems understanding the concept of database normalization. The enum type is a compromise here - for relatively static sets of values.

What is a reasonable year datatype in Oracle?

Two possibilities come into my mind:
Pro NUMBER(4):
No duplicate entries possible if specified as UNIQUE
Easy arithmetic (add one, subtract one)
Con NUMBER(4):
No Validation (e.g. negative numbers)
Duplicate entries are possible ('2013-06-24', '2013-06-23', ...)
Not so easy arithmetic (add one = ADD_MONTHS(12))
As additional requirement the column gets compared with the current year EXTRACT (YEAR FROM SYSDATE). In my opinion NUMBER(4) ist the better choice. What do you think, is there another option I have missed?
You can restrict a date column to only have one entry per year if you want to, with a function-based index:
create unique index uq_yr on <table> (trunc(<column>, 'YYYY'));
Trying to insert two dates in the same year would give you an ORA-00001 error. Of course, if you don't want the rest of the date then it may be unhelpful or confusing to hold it, but on the other hand there may be secondary info you want to keep (e.g. if you're recording that an annual audit happened, holding the full date might not hurt anything). You could also have a virtual column (from 11g) that holds the trunc value for easier manipulation, perhaps.
You could also use an interval year(4) to month data type, and insert using numtoyminterval(2013, 'year'), etc. You could do interval arithmetic to add and subtract years, and extract to get the year back out as a number. That would probably be more painful than using a date though, overall.
If you're really only interested in the year (and you are not holding the month in a different column!) then a number is probably going to be simplest, with a check constraint to make sure it's a sensible number - number(4) doesn't stop you inserting 2.013 when you meant 2,013 (though you need to be converting from a string to hit that, and not have an NLS parameter mismatch), which would be truncated to just 2.
You've quite well summed up the pros/cons.
Provided that you name clearly your field so that it's easy to understand that it contains a year information, I would go with a NUMBER(4) for simplicity & storing no more or less than what is necessary. And even if there is no validation, IMO negative years are valid :)
Depending on your use case you might also consider building a one-off date (dimension) table and linking to a specific row via ID. That way, you have access to more information which you could later add to the dinemsion table (leap year etc.) and the entries in your dimension can be validated on creation.
