Having a bit of trouble getting my head round this.
I have three models - Sectors, Industries, Companies.
Companies is the viewable resource and are oraganised into Sectors and Industries.
Sectors contain Industries. Industries contain companies.
Previously this was achieved by a table column containing comma separated values of the industry and sector IDs - tacky, I know.
I'm now using a pivot table (company_industry) along with a bi-directional 'belongsToMany' relationship between the company and industry models.
That works fine! For a single tier organising system. But when I come to add Sectors as a parent to Industries, that's when my brain explodes.
I wonder if anyone recognises this problem and can share with me a good resource to explain a best practice resolution.
Thank you kindly.
Related
Fairly new to data warehouse design and star schemas. We have designed a fact table which is storing various measures about Memberships, our grain is daily and some of the measures in this table are things like qty sold new, qty sold renewing, qty active, qty cancelled.
My question is this, the business will want to see the measures at other grains such as monthly, quarterly, yearly etc.. so would typically the approach here just be to aggregate the day level data for whatever time period was needed or would you recommend creating separate fact tables for the "key" time periods for our business requirements e.g. monthly, quarterly, yearly? I have read some mixed information on this which is mainly why I'm seeking others views.
Some information I read had people embedding a hierarchy in the fact table to designate different grains which was then identified via a "level" type column, which was advised against by quite a few people and didn't seem good to me also, those advising against we're suggesting separate fact tables per grain but to be honest I don't see why we wouldn't just aggregate from the daily entries we have, what benefits would we get from a fact table for each grain other than some slight performance improvements maybe?
Each DataMart will have its own "perspective", which may require an aggregated fact grain.
Star schema modeling is a "top-down" process, where you start from a set of questions or use cases and build a schema that makes those questions easy to answer. Not a "bottom-up" process where you start with the source data and figure out the schema design from there.
You may end up with multiple data marts that share the same granular fact table, but which need to aggregate it in different ways, either for performance, or to have a gran to calculate and store a measure that only makes sense at the aggregated grain.
Eg
SalesFact (store,day,customer,product,quantiy,price,cost)
and
StoreSalesFact(store, week, revenue, payroll_expense, last_year_revenue)
Could you explain me please is it possible to create Analysis with only one fact table?
I have one fact table in physical and business layer. It has all columns which I need.
I've tried to create analysis I added months column to horizontal line and sum(sale_num) in vertical line from fact table in analysis and expected to see chart but nothing happened and query which perform OBI doesn't have any group by
Yes you can but you have to stick to the ground rules of dimensional analytics: Facts contain measures. Dimensions contain everything else. Facts do NOT contain attributes!
You simply model one logical fact and one logical dimension on your physical table. If you don't do weird things you don't even need to alias the physical table. It becomes the source of both your logical fact and logical dimension.
As long as you stick to the basic rules of dimensional modeling everything will work fine.
Let´s say I have the following situation:
A dimension Product with some attributes that aren't volatile (Description and Diameter - they can only be changed by a SCD-1 change for correction) and a attribute that can be volatile (Selling Group, it can change over time for the same product).
So, when a change occurs in these volatile attributes of one product, I need to somehow track them.
I have come with these 2 approaches:
For both: keep using SCD-1 for non-volatile attributes.
Approach #1: Use SCD-2 in product_dim only for volatile attributes.
Approach #2: Make Selling Group a whole new dimension and every sell will track the current value in moment of ETL. No need for SCD-2 here.
I am new in Data Warehousing and I'm trying to understand which is better and why. One of my aims is to use a OLAP software to read all of this stuff.
It all comes to the business needs of your model. I don't know the business enough from your question, but as a rule of thumb if you wanna do analysis by Selling Group (i.e: Total Quantity of all products sold by Selling Group X) then you should create as a separate dimension. So in this case approach#2 is correct.
Considering general concepts and assuming a selling group is some kind of group of products, it doesn't make sense having it as an attribute of a product.
If you want to learn more about Dimensional Modelling I'd suggest looking into Ralph Kimball's work if you haven't done yet. An excellent resource is his book The Data Warehouse Toolkit which covers your question and many more techniques. It's a nice tool to have over your desk when questions like this pop up. Most of the experienced Data Modellers have a copy of it to consult every now and then.
I did a bit R&D on the fact tables, whether they are normalized or de-normalized.
I came across some findings which make me confused.
According to Kimball:
Dimensional models combine normalized and denormalized table structures. The dimension tables of descriptive information are highly denormalized with detailed and hierarchical roll-up attributes in the same table. Meanwhile, the fact tables with performance metrics are typically normalized. While we advise against a fully normalized with snowflaked dimension attributes in separate tables (creating blizzard-like conditions for the business user), a single denormalized big wide table containing both metrics and descriptions in the same table is also ill-advised.
The other finding, which I also I think is ok, by fazalhp at GeekInterview:
The main funda of DW is de-normalizing the data for faster access by the reporting tool...so if ur building a DW ..90% it has to be de-normalized and off course the fact table has to be de normalized...
So my question is, are fact tables normalized or de-normalized? If any of these then how & why?
From the point of relational database design theory, dimension tables are usually in 2NF and fact tables anywhere between 2NF and 6NF.
However, dimensional modelling is a methodology unto itself, tailored to:
one use case, namely reporting
mostly one basic type (pattern) of a query
one main user category -- business analyst, or similar
row-store RDBMS like Oracle, SQl Server, Postgres ...
one independently controlled load/update process (ETL); all other clients are read-only
There are other DW design methodologies out there, like
Inmon's -- data structure driven
Data Vault -- data structure driven
Anchor modelling -- schema evolution driven
The main thing is not to mix-up database design theory with specific design methodology. You may look at a certain methodology through database design theory perspective, but have to study each methodology separately.
Most people working with a data warehouse are familiar with transactional RDBMS and apply various levels of normalization, so those concepts are used to describe working a star schema. What they're doing is trying to get you to unlearn all those normalization habits. This can get confusing because there is a tendency to focus on what "not" to do.
The fact table(s) will probably be the most normalized since they usually contain just numerical values along with various id's for linking to dimensions. They key with fact tables is how granular do you need to get with your data. An example for Purchases could be specific line items by product in an order or aggregated at a daily, weekly, monthly level.
My suggestion is to keep searching and studying how to design a warehouse based on your needs. Don't look to get to high levels of normalized forms. Think more about the reports you want to generate and the analysis capabilities to give your users.
I am looking for a way to search in an efficient way for data in a huge multi-dimensional matrix.
My application contains data that is characterized by multiple dimensions. Imagine keeping data about all sales in a company (my application is totally different, but this is just to demonstrate the problem). Every sale is characterized by:
the product that is being sold
the customer that bought the product
the day on which it has been sold
the employee that sold the product
the payment method
the quantity sold
I have millions of sales, done on thousands of products, by hundreds of employees, on lots of days.
I need a fast way to calculate e.g.:
the total quantity sold by an employee on a certain day
the total quantity bought by a customer
the total quantity of a product paid by credit card
...
I need to store the data in the most detailed way, and I could use a map where the key is the sum of all dimensions, like this:
class Combination
{
Product *product;
Customer *customer;
Day *day;
Employee *employee;
Payment *payment;
};
std::map<Combination,quantity> data;
But since I don't know beforehand which queries are performed, I need multiple combination classes (where the data members are in different order) or maps with different comparison functions (using a different sequence to sort on).
Possibly, the problem could be simplified by giving each product, customer, ... a number instead of a pointer to it, but even then I end up with lots of memory.
Are there any data structures that could help in handling this kind of efficient searches?
EDIT:
Just to clarify some things: On disk my data is stored in a database, so I'm not looking for ways to change this.
The problem is that to perform my complex mathematical calculations, I have all this data in memory, and I need an efficient way to search this data in memory.
Could an in-memory database help? Maybe, but I fear that an in-memory database might have a serious impact on memory consumption and on performance, so I'm looking for better alternatives.
EDIT (2):
Some more clarifications: my application will perform simulations on the data, and in the end the user is free to save this data or not into my database. So the data itself changes the whole time. While performing these simulations, and the data changes, I need to query the data as explained before.
So again, simply querying the database is not an option. I really need (complex?) in-memory data structures.
EDIT: to replace earlier answer.
Can you imagine you have any other possible choice besides running qsort( ) on that giant array of structs? There's just no other way that I can see. Maybe you can sort it just once at time zero and keep it sorted as you do dynamic insertions/deletions of entries.
Using a database (in-memory or not) to work with your data seems like the right way to do this.
If you don't want to do that, you don't have to implement lots of combination classes, just use a collection that can hold any of the objects.