Currently, I've been involved in an warehouse based intelligent transaction analysis banking system featuring customer churn behavior, fraud detection & CRM analysis. We've been using Oracle as the database & it's completely a data warehousing project with data mining algorithms used for analysis.
We have records of about 1000 customers of a bank. For modeling, whether it is better to use the star schema or snowflake schema or constellation schema? I know the basic difference of star and snowflake schema- normalization of dimension table occurs in snowflake (a.k.a. snowflaking) schema which may be problematic for joining in case of large-sized database.
So, which schema would be better for my case? Answers from experienced programmers involved in data warehousing are highly welcomed!
Thanks in advance!
In brief, my assumption going into a project like this would be that a star schema would be appropriate. I might modify that if it appeared that a dimension was getting too large to efficiently full scan and the efficiency of queries against it could be meaningfully improved by snowflaking unless that dimension joined to the fact table on a partitioning key (due to difficulties in applying partition pruning on a predicate placed on a snowflaked dimension).
Related
In this post I am not asking any tutorials, how to do something, in this post, I am asking your help, if someone could explain me with simple words, what is DWH (data warehouse ) and what is ETL.
Of course, I google'ed and youtube'd alot, I found many articles, videos, but still, I am not very sure what it is.
Why I am asking?
I need to know it very well before I am applying for a job.
This answer by no means should be treated as a complete definition of a data warehouse. It's only my attempt to explain the term in layman's terms.
Transactional (operational, OLTP) and analytical (data warehouses) systems can both use the same RDBMS as the back-end and they may contain exactly the same data. However, their data models will be completely different, because they are optimized for different access patterns.
In transactional systems you usually work with a single row (e.g. a customer or an invoice) and the write consistency is crucial, so the data model is normalized. On the contrary, data warehouses are optimized for reading large number of rows (e.g. all invoices from the previous year) and aggregating data, so dimensional models are flattened (star schema, Kimball's dimensions and facts).
Transactional systems store only the current version of entities (i.e. current customer's address), while data warehouses may use slowly changing dimensions (SCD) to preserve history (e.g. all addresses of the customer with date ranges to indicate when each of them was valid).
ETL stands for extract, transform, load and it is the procedure of:
extracting data from a transactional system,
transforming it into dimensional format,
loading in a data warehouse.
I did a bit R&D on the fact tables, whether they are normalized or de-normalized.
I came across some findings which make me confused.
According to Kimball:
Dimensional models combine normalized and denormalized table structures. The dimension tables of descriptive information are highly denormalized with detailed and hierarchical roll-up attributes in the same table. Meanwhile, the fact tables with performance metrics are typically normalized. While we advise against a fully normalized with snowflaked dimension attributes in separate tables (creating blizzard-like conditions for the business user), a single denormalized big wide table containing both metrics and descriptions in the same table is also ill-advised.
The other finding, which I also I think is ok, by fazalhp at GeekInterview:
The main funda of DW is de-normalizing the data for faster access by the reporting tool...so if ur building a DW ..90% it has to be de-normalized and off course the fact table has to be de normalized...
So my question is, are fact tables normalized or de-normalized? If any of these then how & why?
From the point of relational database design theory, dimension tables are usually in 2NF and fact tables anywhere between 2NF and 6NF.
However, dimensional modelling is a methodology unto itself, tailored to:
one use case, namely reporting
mostly one basic type (pattern) of a query
one main user category -- business analyst, or similar
row-store RDBMS like Oracle, SQl Server, Postgres ...
one independently controlled load/update process (ETL); all other clients are read-only
There are other DW design methodologies out there, like
Inmon's -- data structure driven
Data Vault -- data structure driven
Anchor modelling -- schema evolution driven
The main thing is not to mix-up database design theory with specific design methodology. You may look at a certain methodology through database design theory perspective, but have to study each methodology separately.
Most people working with a data warehouse are familiar with transactional RDBMS and apply various levels of normalization, so those concepts are used to describe working a star schema. What they're doing is trying to get you to unlearn all those normalization habits. This can get confusing because there is a tendency to focus on what "not" to do.
The fact table(s) will probably be the most normalized since they usually contain just numerical values along with various id's for linking to dimensions. They key with fact tables is how granular do you need to get with your data. An example for Purchases could be specific line items by product in an order or aggregated at a daily, weekly, monthly level.
My suggestion is to keep searching and studying how to design a warehouse based on your needs. Don't look to get to high levels of normalized forms. Think more about the reports you want to generate and the analysis capabilities to give your users.
Are there any real live (non-academic) and public (open-source or free) examples of a semantic database like Metalog being used to solve a computing problem that traditionally had been done with relational databases?
Semantic databases work much better if only part of your data follows a schema.
If you need additional columns in a semantic database, you just add them. Even for single rows. This is hard or inefficient in a relational database.
Also clustering is much more simple with semantic or tuple databases. Most often, this means just to install the database on N servers and set a few config options.
I am currently trying to improve the performance of a web application. The goal of the application is to provide (real time) analytics. We have a database model that is similiar to a star schema, few fact tables and many dimensional tables. The database is running with Mysql and MyIsam engine.
The Fact table size can easily go into the upper millions and some dimension tables can also reach the millions.
Now the point is, select queries can get awfully slow if the dimension tables get joined on the fact tables and also aggretations are done. First thing that comes in mind when hearing this is, why not precalculate the data? This is not possible because the users are allowed to use several freely customizable filters.
So what I need is an all-in-one system suitable for every purpose ;) Sadly it wasn't invented yet. So I came to the idea to combine 2 existing systems. Mixing a row oriented and a column oriented database (e.g. like infinidb or infobright). Keeping the mysql MyIsam solution (for fast inserts and row based queries) and add a column oriented database (for fast aggregation operations on few columns) to it and fill it periodically (nightly) via cronjob. Problem would be when the current data (it must be real time) is queried, therefore I maybe would need to get data from both databases which can complicate things.
First tests with infinidb showed really good performance on aggregation of a few columns, so I really think this could help me speed up the application.
So the question is, is this a good idea? Has somebody maybe already done this? Maybe there is are better ways to do it.
I have no experience in column oriented databases yet and I'm also not sure how the schema of it should look like. First tests showed good performance on the same star schema like structure but also in a big table like structure.
I hope this question fits on SO.
Greenplum, which is a proprietary (but mostly free-as-in-beer) extension to PostgreSQL, supports both column-oriented and row-oriented tables with high customizable compression. Further, you can mix settings within the same table if you expect that some parts will experience heavy transactional load while others won't. E.g., you could have the most recent year be row-oriented and uncompressed, the prior year column-oriented and quicklz-compresed, and all historical years column-oriented and bz2-compressed.
Greenplum is free for use on individual servers, but if you need to scale out with its MPP features (which are its primary selling point) it does cost significant amounts of money, as they're targeting large enterprise customers.
(Disclaimer: I've dealt with Greenplum professionally, but only in the context of evaluating their software for purchase.)
As for the issue of how to set up the schema, it's hard to say much without knowing the particulars of your data, but in general having compressed column-oriented tables should make all of your intuitions about schema design go out the window.
In particular, normalization is almost never worth the effort, and you can sometimes get big gains in performance by denormalizing to borderline-comical levels of redundancy. If the data never hits disk in an uncompressed state, you might just not care that you're repeating each customer's name 40,000 times. Infobright's compression algorithms are designed specifically for this sort of application, and it's not uncommon at all to end up with 40-to-1 ratios between the logical and physical sizes of your tables.
What are the best practices for database design and normalization for high traffic websites like stackoverflow?
Should one use a normalized database for record keeping or a normalized technique or a combination of both?
Is it sensible to design a normalized database as the main database for record keeping to reduce redundancy and at the same time maintain another denormalized form of the database for fast searching?
or
Should the main database be denormalized but with normalized views at the application level for fast database operations?
or some other approach?
The performance hit of joining is frequently overestimated. Database products like Oracle are built to join very efficiently. Joins are often regarded as performing badly when the real culprit is a poor data model or a poor indexing strategy. People also forget that denormalised databases perform very badly when it comes to inserting or updating data.
The key thing to bear in mind is the type of application you're building. Most of the famous websites are not like regular enterprise applications. That's why Google, Facebook, etc don't use relational databases. There's been a lot of discussion of this topic recently, which I have blogged about.
So if you're building a website which is primarily about delivering shedloads of semi-structured content you probably don't want to be using a relational database, denormalised or otherwise. But if you're building a highly transactional website (such as an online bank) you need a design which guarantees data security and integrity, and does so well. That means a relational database in at least third normal form.
Denormalizing the db to reduce the number of joins needed for intense queries is one of many different ways of scaling. Having to do fewer joins means less heavy lifting by the db, and disk is cheap.
That said, for ridiculous amounts of traffic good relational db performance can be hard to achieve. That is why many bigger sites use key value stores(e.g. memcached) and other caching mechanisms.
The Art of Capacity Planning is pretty good.
You can listen to a discussion on this very topic by the creators of stack overflow on thier podcast at:
http://itc.conversationsnetwork.org/shows/detail3993.html
First: Define for yourself what hight-traffic means:
50.000 Page-Viewss per day?
500.000 Page-Views per day?
5.000.000 Page-Views per day?
more?
Then calculate this down to probable peak page-views per minute and per seconds.
After that think about the data you want to query per page-view. Is the data cacheable? How dynamic is the data, how big is the data?
Analyze your individual requirements, program some code, do some load-testing, optimize. In most cases, before you need to scale out the database servers you need to scale out the web-servers.
Relational-database can be, if fully optimized, amazingly fast, when joining tables!
A relational-database could be hit seldom when to as a back-end, to populate a cache or fill some denormalized data tables. I would not make denormalization the default approach.
(You mentioned search, look into e.g. lucene or something similar, if you need full-text search.)
The best best-practice answer is definitely: It depends ;-)
For a project I'm working on, we've gone for the denormalized table route as we expect our major tables to have a high ratio of writes to reads (instead of all users hitting the same tables, we've denormalized them and set each "user set" to use a particular shard). You may find read http://highscalability.com/ for examples of how the "big sites" cope with the volume - Stack Overflow was recently featured.
Neither matters if you aren't caching properly.