Querying a data warehouse data involving time dimension - time

I have two tables for time dimension
date (unique row for each day)
time of the day (unique row for each minute in a day)
Given this schema what would a query look like if one wants to retrieve facts for last X hours where X can be any number greater than 0.
Things start to be become tricky when the start time and end time happen to be in two different days of the year.
EDIT: My Fact table does not have a time stamp column

Fact tables do have (and should have) original timestamp in order to avoid weird by-time queries which happen over the boundary of a day. Weird means having some type of complicated date-time function in the WHERE clause.
In most DWs these type of queries are very rare, but you seem to be streaming data into your DW and using it for reporting at the same time.
So I would suggest:
Introduce the full timestamp in the fact table.
For the old records, re-create the timestamp from the Date and Time keys.
DW queries are all about not having any functions in the WHERE clause, or if a function has to be used, make sure it is SARGABLE.

You would probably be better served by converting the Start Date and End Date columns to TIMESTAMP and populating them.
Slicing the table would require taking the appropriate interval BETWEEN Start Date AND End Date. In Oracle the interval would be something along the lines of SYSDATE - (4/24) or SYSDATE - NUMTODSINTERVAL(4, 'HOUR')
This could also be rewritten as:
Start Date <= (SYSDATE - (4/24)) AND End Date >= (SYSDATE - (4/24))

It seems to me that given the current schema you have, that you will need to retrieve the appropriate time IDs from the time dimension table which meet your search criteria, and then search for matching rows in the fact table. Depending on the granularity of your time dimension, you might want to check the performance of doing either (SQL Server examples):
A subselect:
SELECT X FROM FOO WHERE TIMEID IN (SELECT ID FROM DIMTIME WHERE HOUR >= DATEPART(HOUR, CURRENT_TIMESTAMP()) AND DATEID IN (SELECT ID FROM DIMDATE WHERE DATE = GETDATE())
An inner join:
SELECT X FROM FOO INNER JOIN DIMTIME ON TIMEID = DIMTIME.ID WHERE HOUR >= DATEPART(HOUR, CURRENT_TIMESTAMP()) INNER JOIN DIMDATE ON DATEID = DIMDATE.ID WHERE DATE = GETDATE()
Neither of these are truly attractive options.
Have you considered that you may be querying against a cube that is intended for roll-up analysis and not necessarily for "last X" analysis?
If this is not a "roll-up" cube, I would agree with the other posters in that you should re-stamp your fact tables with better keys, and if you do in fact intend to search off of hour frequently, you should probably include that in the fact table as well, as any other attempt will probably make the query non-sargable (see What makes a SQL statement sargable?).
Microsoft recommends at http://msdn.microsoft.com/en-us/library/aa902672%28v=sql.80%29.aspx that:
In contrast to surrogate keys used in other dimension tables, date and time dimension keys should be "smart." A suggested key for a date dimension is of the form "yyyymmdd". This format is easy for users to remember and incorporate into queries. It is also a recommended surrogate key format for fact tables that are partitioned into multiple tables by date.
Best luck!

Related

Query not filtering with date in Oracle

There are records in table for particular date. But when I query with that value, I am unable to filter the records.
select * from TBL_IPCOLO_BILLING_MST
where LAST_UPDATED_DATE = '03-09-21';
The dates are in dd-mm-yy format.
To the answer by Valeriia Sharak, I would just add a few things since your question is tagged Oracle. I was going to add this as a comment to her answer, but it's too long.
First, it is bad practice to compare dates to strings. Your query, for example, would not even execute for me -- it would end with ORA-01843: not a valid month. That is because Oracle must do an implicit type conversion to convert your string "03-09-21" to a date and it uses the current NLS_DATE_FORMAT setting to do that (which in my system happens to be DD-MON-YYYY).
Second, as was pointed out, your comparison is probably not matching rows due LAST_UPDATED_DATE having hours, minutes, and seconds. But a more performant solution for that might be:
...
WHERE last_update_date >= TO_DATE('03-09-21','DD-MM-YY')
AND last_update_date < TO_DATE('04-09-21','DD-MM-YY')
This makes the comparison without wrapping last_update_date in a TRUNC() function. This could perform better in either of the following circumstances:
If there is an index on last_update_date that would be useful in your query
If the table with last_update_date is large and is being joined to other tables (because it makes it easier for Oracle to estimate the number of rows from your table that are inputs to the join).
Your column might contain hours and seconds, but they can be hidden.
So when you filter on the date, oracle implicitly adds time to the date. So basically you are filtering on '03-09-21 00:00:00'
Try to trunc your column:
select * from TBL_IPCOLO_BILLING_MST
where trunc(LAST_UPDATED_DATE) = '03-09-21';
Hope, I understood your question correctly.
Oracle docs

How to create index when there is a arithmetic operation done on that column

I have a query that selects records from a table that are older than 72 days.
SELECT id FROM TABLE_NAME WHERE TIMESTAMP <= SYSDATE - INTERVAL '72' HOUR;
The performance of this query is horrible, so I have added an index to the TIMESTAMP column.
This works fine with thousands of records, but when the record count is 10 million (even more, sometimes), I hardly see any performance improvement with the index.
My guess is that the arithmetic operation is killing the performance of the query.
Please tell me if there are any other approaches to speeding up this query.
Assuming that the timestamp column is of the type TIMESTAMP, the problem is that the implicit conversion from DATE (which is returned by SYSDATE) to TIMESTAMP kills the index.
You could add a function-based index or you could change the use of SYSDATE to SYSTIMESTAMP.

Joining two tables in hive

I have table where I have partitioned date by year and month and date
'ABC' Partition by
(year='2011', month='08', day='01')
I want to run a query something like
select * from ABC where dt>='2011-03-01' and dt<='2012-02-01';
How can I run this query with above partitioning scheme in terms of year, month and day?
You might consider creating an external table that is partitioned by 'yyyy-mm-dd', and uses the same locations as your existing table. You won't have to copy any data, and you'll have the flexibility of both partitioning formats.
select * from ABC where year='2011' and month >= '03'
UNION
select * from ABC where year='2012' and month = '01'
UNION
select * from ABC where year='2012' and month='02' and day='01';
The above query should solve the purpose but it's really neither flexible nor well-readable. Like Matt suggested, a better partitioning format would be of a single string variable in yyyy-MM-dd format as the partitioning column. However, you might have to make a copy of the data if you change the partitioning scheme for year, month, day to dt. In my opinion though, it's totally worth it.

Storing weekly and monthly aggregates in Oracle

I need to dynamically update weekly and monthly sales data per product and customer. These need to be updated and checked during the sale of a product, and for various reasons I'm not able to use stored procedures or materialized views for this (I'll read everything into the application, modify everything in memory and then update and commit the results).
What is the best table structure for holding the sales during a period?
Store the period type (M, W) with start and end dates, or just the type and start date?
Use date fields and a char, or code it into a string ('M201201' / 'W201248')
Normalize sales and periods into two tables, or keep both sales and the period in a single table?
I will be doing two kinds of queries - select the sales of the current weekly (xor monthly) period/customer/article but not update them, and select for update weekly and monthly periods for a customer/article.
If you store both the start and end dates of the applicable period in the row, then your retrieval queries will be much easier, at least the ones that are based on a single date (like today). This is a very typical mode of access since you are probably going to be looking at things from the perspective of a business transaction (like a specific sale) which happens on a given date.
It is very direct and simple to say where #date_of_interest >= start_date and #date_of_interest <= end_date. Any other combination requires you to do date arithmetic either in code before you go to your query or within your query itself.
Keeping a type code (M, W) as well as both start and end dates entails introducing some redundancy. However, you might choose to introduce this redundancy for the sake of easing data retrieval. This: where #date_of_interest >= start_date and #date_of_interest <= end_date and range_type='M' is also very direct and simple.
As with all denormalization, you need to ensure that you have controls that will manage this redundancy.
I would recommend you to use a normalized schema for that purpose where you store weekly and monthly aggregation in two different tables with the same structure. I don't know the kind of queries you're going to do, but I suppose that this would make it easier to do any sort of search (when it's done in the right way!!!).
Probably something like this sample
product_prices (
prod_code,
price,
date_price_begin
);
sales (
prod_code,
customer_code,
sale_date
);
<aggregate-week>
select trunc(sale_date,'w') as week,
prod_code,
customer_code,
sum(price) keep (dense_rank first order by date_price_start) as price
from sales
natural join product_prices
where sale_date > date_from
group by trunc(sale_date,'iw'),
prod_code,
customer_code
/
<aggregate-month>
select trunc(sale_date,'m') as month,
prod_code,
customer_code,
sum(price) keep (dense_rank first order by date_price_start) as price
from sales
natural join product_prices
where sale_date > date_from
group by trunc(sale_date,'m'),
prod_code,
customer_code
/

Real Time issues: Oracle Performance tuning (types / indexes / plsql / queries)

I am looking for a real time solution...
Below are my DB columns. I am using Oracle10g. Please help me in defining table types / indexes and tuned PLSQL / query (both) for the updates and insertion
Insert and Update queries are simple but here we need to take care of the performance because my system will execute such 200 times per second.
Let me know... should I use procedures or simple queries? It is requested to write tuned plsql and query with proper DB table types / indexes.
I would really like to see the performance of my system after continuous 200 updates per second
DB table (columns) (I can change the structure if required so please let me know...)
Play ID - ID
Type - Song or Message
Count - Summation of total play
Retries - Summation of total play, if failed.
Duration - Total Duration
Last Updated - Late Updated Date Time
Thanks in advance ... let me know in case of any confusion...
You've not really given a lot of detail about WHAT you are updating etc.
As a basis for you to write your update statements, don't use PL/SQL unless you cannot achieve what you want to do in SQL as the context switching alone will hurt your performance before you even get round to processing any records.
If you are able to create indexes specifically for the update then index the columns that will appear in your update statement's WHERE clause so the records can be found quickly before being updated.
As for inserting, look up the benefits of the /*+ append */ hint for inserting records to see if it will benefit your particular case.
Finally, the table structure you will use will depend on may factors that you haven't even begun to touch on with the details you've supplied, I suggest you either do some research on DB structure or ask your DBA's for a 101 class in it.
Best of luck...
EDIT:
In response to:
Play ID - ID ( here id would be song name like abc.wav something..so may be VARCHAR2, yet not decided..whats your openion...is that fine if primary key is of type VARCHAR2....any suggesstions are most welcome...... ) Type - Song or Message ( varchar2) Count - Summation of total play ( Integer) Retries - Summation of total play, if failed. ( Integer) Duration - Total Duration ( Integer) Last Updated - Late Updated Date Time ( DateTime )
There is nothing wrong with having a PRIMARY KEY as a VARCHAR2 data type (though there is often debate about the value of having a non-specific PK, i.e. a sequence). You must, however, ensure your PK is unique, if you can't guarentee this then it would be worth having a sequence as your PK over having to introduce another columnn to maintain uniqueness.
As for declaring your table columns as INTEGER, they eventually will be resolved to NUMBER anyway so I'd just create the table column as a number (unless you have a very specific reason for creating them as INTEGER).
Finally, the DATETIME column, you only need decare it as a DATE datatype unless you need real precision in your time portion, in which case declare it as a TIMESTAMP datatype.
As for helping you with the structure of the table itself (i.e. which columns you want etc.) then that is not something I can help you with as I know nothing of your reporting requirements, application requirements or audit requirements, company best practice, naming conventions etc. I'm afraid that is something for you to decide for yourself.
For performance though, keep indexes to a minumum (i.e. only index columns that will aid your UPDATE WHERE clause search), only update the minimum data possible and, as suggested before, research the APPEND hint for inserts it may help in your case but you will have to test it for yourself.

Resources