Seeking Opinion:Denormalising Fact and Dim tables to improve performance of SSRS Reports - reporting

We seem to have bit of a debate on a discussion point in our team.
We are working on a Data Warehouse in the Microsoft SQL Server 2012 platform. We have followed the Kimball Architecture to build this Data Warehouse.
Issue:
A reporting solution (built on SSRS), which sources data from this Warehouse, has significant performance issues when sourcing data from fact and dim tables. Some of our team members suggest that we extract data from facts and dims into a new set of tables using SSIS packages. This would mean we denormalise these tables into ‘Snapshot’ tables. In this way the we would not need to join these tables to create data sets within the reports. Data could be read out of these tables directly.
I do have my own worries about this; inconsistencies, maintenance of different data structures, duplication of data etc to name a few.
Question:
Would you consider creating snapshot tables (by denormalising facts and dim tables) for reporting tables a right approach?
Would like to hear your thoughts on this.
Cheers
Nithin

I don't think there is anything wrong with snapshot tables. The two most important aspects of a data warehouse are:
The data is correct.
The data is useful.
If your users are unable to extract the totals they require, in a reasonable timescale, they won't use the warehouse.
My own solution includes 3 snapshot tables. Like you, I was worried about inconsistencies. To address this we built an automated checking process. This sub-system executes a series of queries, stored on a network drive, once an hour. Any records returned by the queries are considered a fail. Fails are reported and immediately investigated by my ETL team. This sub-system ensures the snapshots and underlying facts are always aligned and consistent with each other. Drift is prevented.
That said, additional tables equals additional complexity. And that requires more time/effort to manage. Before introducing another layer to your warehouse, you should investigate why these queries are underperforming. If joins are to blame:
Are you using an inappropriate data type, for your P/F keys?
Are the FKeys indexed (some RDBMS do this by default, others do not)?
Have you looked at the execution plans, for the offending queries?
Is the join really to blame, or is it a filter applied to the dim table?

for raw cube performance my advice would be to always try to denormalize your tables and have one fact table and one table for each dimension (star schema).
If you are unsure if it will actually help you could start creating materialized views. These are kind of the best of both worlds, on the long run you should alter your etl.
In my previous job we only had flattened tables which worked quite well. Currenly we have a normalized schema but flatten it in the last step.

Related

Store summary data at master table, instead of deriving it

I am trying to prepare DB design for APEX application. Requirement is as follows.
In Departments IR page, users are asking below columns
Number of employees in each department (Department may or may not have employees)
Primary Location for Department (Department can have multiple addresses and addresses are stored in other table, along with primary flag)
Alternative Manager's Email Address for Department (alt_manager_id column, this is optional column and refers to employees table)
I can implement these requirements using either inline sub queries or using OUTER JIONs. But, these approaches will have performance impact as the data grows (like 100s of thousands of rows). So, my question is, is it ok to store these data directly at "Departments" table and update "Departments" table when child tables gets updated. Basically, I am trying to store summary data at master table, instead of deriving it as on when needed from child tables. Is this considered bad practice? Is it ok to implement such DB design?
Thank you
"Is this considered bad practice?"
Usually yes. There are several problems with maintaining summary detail information in a master record.
Your inserts into child tables (and deletes if you have them) now also have to take a lock on the master record, to increment the count. This adds complexity to what should be simple transactions.
It also has two performance hits: the additional overhead of maintaining the counts and the potential for sessions to hang in multi-user environments.
Note that you are adding a definite performance hit to your insert activity for a possible saving in the performance of aggregating queries.
The good practice is to just run the counts when you need the summaries. Tune the queries if you need to.
If you think you really are going to be querying the summary data often enough for the workload to be a problem you should consider building materialized views for the summary queries. Then, when you enable query rewrites, Oracle will transparently query the materialized view if it can satisfy the query rather than re-running the aggregations. This is a technique which is used a lot in data warehouses, but there's no reason not to use it in OLTP environments if you really have the data volumes to justify it. Find out more.
Generally, try the simplest thing which could work first. Only look to do something different (like building a materialized view for aggregations) when you know you have a demonstrable problem with performance.

How to do real time data ingestion from Transactional tables to a Flat table

We have transaction tables in Oracle and for reporting purposes we need this data transfered in real time to another flat Oracle table in another database. The performance of the report is great with table placed in this flat table.
Currently we are using golden gate for replication to the other database and using materialized view for this but due to some problems we need to switch to some other way of populating/maintaining this flat table. What options do we have?
It is a pretty basic requirement but the solutions I can see are for batch processing. Also if there are any other solutions you feel would better serve this purpose. Changing the target database to something other is also an option as there might be more such reports coming ahead.

Oracle 11g - Building a Type 2 SCD based on existing historical data in a relational model

I'm an ETL developer that's currently being tasked with developing a type 2 SCD from existing historical data in a relational database. I'm perfectly capable of creating a type 2 SCD that's responsible for tracking future changes to the data, but I'm completely useless when it comes to the task at hand.
The relational model is in our ODS . Based on that relational model, I'm supposed to build flat records in our DW dimension. There are multiple attributes which need to be monitored for changes, each in specific related tables in the relational model. Historical changes must be kept on a daily basis, and if multiple changes to the same attribute occur on the same day, only the last subsists.
How can I tackle this? I'm lost. Thanks in advance.
P.S. we're talking tables with 20-30 million rows and multiple attributes that may change at any given time and therefore must result in a new record in the SCD.
This will indeed be painful. I'm assuming from your question that the tables containing the attribute values are currently varying independently (or you wouldn't need to ask the question).
If you have a table 'Table1' containing 'Key', 'Attribute1' and 'Effective From','Effective To' columns, then you can 'explode' that table into a virtual table in the form 'Key','Attribute1','Date', projecting out one row for every date where that attribute was current.
(Note that you probably don't want to do this as a ranged join against your date dimension, because this will be a Triangular Join (ie perform really badly), you probably need to explode the rows in an ETL tool/programmatically)
If you perform this process across multiple tables, you will have a set of tables giving you the full day-by-day snapshot of each attribute for every day that you care about. It's then fairly easy to join those tables based on 'FK' and 'Date' to give you the complete daily snapshot across all of the attribute values.
Then, of course, you need to run this though another process to collapse rows with the same Key, contiguous dates and all the same attribute values, ie 'unexplode' the rows, back into 'effective from','effective to' form. Note again, that this is fundamentally a row-by-row operation (or at very least a windowing function), and a set-based approach will perform very badly. Personally I'd just stream it all though some .net/java code to achieve this.
Given data volumes this will take a while, but should be achievable.

insert data from one table to two tables group by for Oracle

I have a situation where I need a large amount of data (9+ billion per day) data being collected in a loading table that has fields like
-TABLE loader
first_seen,request,type,response,hits
1232036346,mydomain.com,A,203.11.12.1,200
1332036546,ogm.com,A,103.13.12.1,600
1432039646,mydomain.com,A,203.11.12.1,30
that need to split into two tables (de-duplicated)
-TABLE final
request,type,response,hitcount,id
mydomain.com,A,203.11.12.1,230,1
ogm.com,A,103.13.12.1,600,2
and
-TABLE timestamps
id,times_seen
1,1232036346
2,1432036546
1,1432039646
I can create the schemas and do the select like
select request,type,response,sum(hitcount) from loader group by request,type,response;
get data into the final table. for best performance I want to see if I can use "insert all" to move data from the loader to these two tables and perhaps use triggers in the database to try to achieve this. Any ideas and recommendations on the best ways to solve this?
"9+ billion per day"
That's more than just a large number of rows: that's a huge number, and it will require special engineering to handle it.
For starters, you don't just need INSERT statements. The requirement to maintain the count for existing (request,type,response) tuples points to UPDATE too. The need to generate and return a synthetic key is problematic in this scenario. It rules out MERGE, the easiest way of implementing upserts (because the MERGE syntax doesn't support the RETURNING clause).
Beyond that, attempting to handle nine billion rows in a single transaction is a bad idea. How long will it take to process? What happens if it fails halfway through? You need to define a more granular unit of work.
Although, that raises some business issues. What do the users only want to see the whole picture, after the Close-Of-Day? Or would they derive benefit from seeing Intra-day results? If yes, how to distinguish Intra-day from Close-Of-Day results? If no, how to hide partially processed results whilst the rest is still in flight? Also, how soon after Close-Of-Day do they want to see those totals?
Then there are the architectural considerations. These figure mean processing over one hundred thousand (one lakh) rows every second. That requires serious crunch and expensive licensing extras. Obviously Enterprise Edition for parallel processing but also Partitioning and perhaps RAC options.
By now you should have an inkling why nobody answered your question straight-away. This is a consultancy gig not a StackOverflow question.
But let's sketch a solution.
We must have continuous processing of incoming raw data. So we stream records for loading into FINAL and TIMESTAMP tables alongside the LOADER table, which becomes an audit of the raw data (or else perhaps we get rid of the LOADER table altogether).
We need to batch the incoming records to leverage set-based operations. Depending on the synthetic key implementation we should aim for pure SQL, otherwise Bulk PL/SQL.
Keeping the thing going is vital so we need to pay attention to Bulk Error Handling.
Ideally the target tables can be partitioned, so we can load into offline tables and use Partition Exchange to bring the cleaned data online.
For the synthetic key I would be tempted to use a hash key based on the (request,type,response) tuple rather than a sequence, as that would give us the option to load TIMESTAMP and FINAL independently. (Collisions are extremely unlikely.)
Just to be clear, this is a bagatelle not a serious architecture. You need to experiment and benchmark various approaches against realistic volumes of data on Production-equivalent hardware.

Disadvantages of consolidating databases?

In an organization that has two applications each with its own Oracle database instance, what are the disadvantages of consolidating the two databases into one database with two schemas?
Backups and replicating the database are bigger and slower, probably. What else?
Some background:
The two databases are the "gold source" for their respective data. Each is critical to the operation of the organization and each is actually used by several appliations, tools, and reports (but each database is principally "owned" by one application). The need to join data across the databases, to relate entities in one to entities in the other, comes up frequently. For this reason there are DB links connecting the two and some cross-database materialized views to help with performance. There is an effort underway to reduce data duplication and these materialized views are under discussion. Some in the organization want to phase out DB links and materialized views and introduce more web services to make the data available across applications. My concern is that there are too many situations that require complex joins of data across the two databases so services that expose the data won't perform. Another approach for reducing DB links and materialized views is to consolidate the schemas into one database, but I want to make sure I'm not forgetting any critical disadvantages to that approach.
In a single consolidated database, you will lose some flexibility from a DBA point of view:
A database obviously can have only one version (10.2.0.5 for example), which means that upgrades and patches will affect all schemas -- this may be a bad thing in case of multiple vendor app requirement mismatch.
Similarly, some administrative tasks (restore database A to point in time t) may be more complicated with a single database.
Overall, you will have less administration tasks (a single backup, single patching...) but each task will be more critical since they will have a global effect.
On the development side, beware of namespace collisions: some features are global over a single database, for example:
directories,
public synonyms,
DB link
Schemas
This means that you will have some work to do if you want to consolidate two databases that have public synonyms with the same name that points to two different things.
Could have something to do with licence costs - scaling up vs. scaling out.
The biggest concern I would have is that all your code will need to be rewritten to account for the new database and schemas. Or at least looked at. This courl introduce new bugs. I don't know how Oracle handles refernces to different databases, so I'll use an example of what I mean using SQL Server syntax. If I was joining to two tables onthe same server in different databases my select would be something like this:
SELECT a.field1, b.field2
FROM database1.dbo.table1 a
JOIN database2.dbo.table2 b
ON a.myid = b.myFK
To go your your new consolidated idea, you would want to write:
SELECT a.field1, b.field2
FROM schema1.table1 a
JOIN schema2.table2 b
ON a.myid = b.myFK
You will need to be especially careful of any tables that have the same name in both databases now, this could cause some sneaky bugs.
Note these are not difficult changes but all SQL hitting your database would have to be examined to see if it will work or adjusted if not.
I'm not sure if just putting them in the same database would do it either. You might need to consolidate some tables to avoid the duplication across applications. (In this case add fields to reference the old id numbers for things people are used to looking up by id like person_id that may appear on old paperwork, so they can be researched) This is a fairly major rewrite with all the attendant possibilites to make things worse due to new bugs.
If you go down this path, I highly recommend that you read a book on refactoring datbases before you decide how to design.
its hard to tell by just the information provided, big in db world would be 100gb or more, so 2 dbs would be 200GB. if both db are not bigger than 100GB then size should not be a huge factor in the decision, replication and sync can be done on changes only and backups should not be a big difference (again this depends on specifics such as when backups are done or if downtime is possible or backups are done during non-peak times)
Other than that other factors are:
naming collisions in dbo's such as keys, foreign key names, table names, etc. some renaming of tables, store procedures names too.

Resources