Star Schema: How the fact table aggregations are performed? - oracle

https://web.stanford.edu/dept/itss/docs/oracle/10g/olap.101/b10333/globdiag.gif
Assume that we have a start schema as above..
My questions is - In real-time how do we populate the colums (unit_price, unit_cost) columns of the fact table..?
Can anyone provide me a start schema tables with real data?
I am having hard time in understanding star schema...
Please help!..

Start schema consists of two types of tables fact tables and dimensions.
The ideal of the star design is that you can split your data in two part.
The static part is described with dimensions and the dynamic part (= transactions) in the fact table.
Each transaction is stored in the fact table as a new record and is connected to the surrounding dimensions, that define the context of the transaction.
The example in link contains two fact tables: SHIPMENTS and PRODUCT_CONDITIONS.
Note that the fact tables in the link are dubbed UNITS_HISTORY_FACT and PRICE_AND_COST_HISTORY_FACT, but I find this not a best choice.
The SHIPMENTS table stores one record for each shipment of a PRODUCT to a CUSTOMER at some TIME, via a defined CHANNEL.
All the above information is defined using the corresponding keys of the respective dimensions.
The fact table also contains MEASURES describing the attributes of the transaction, here the number of UNITS shipped.
The structure of the fact table would be therefore
CUSTOMER_ID
PRODUCT_ID
TIME_ID
CHANNEL_ID
UNITS
The second fact table (bottom) is more interesting, because here you split the product in two parts:
PRODUCT dimension defining the ID, name and other more static attributes
PRODUCT_CONDITION this is fact table, designed with the expectation the price and cost of the product will change over time.
With each change of the price or cost insert a new record in the fact table and connect it to the PRODUCT and TIME (of change).
The structure of the fact table would be therefore
PRODUCT_ID
TIME_ID
UNIT_PRICE
UNIT_COST
Final note the the design of the TIME dimension.
The best practice to connect the fact table with the dimension tables is to use meaningless ID (surrogate keys), but for TIME dimension you should be careful. For big (time partitioned) fact table is often used the natural key (DATE format) to be able to deploy the partitioning features. See more details in How I Defined a Time Dimension Using a Surrogate Key and other resources in web.

Related

Database: Storing multiple Types in single table or multiple intermediate tables for Delta Tables

Using Java and Oracle.
We need to update changes in Email, UserID of employee to third party.
Actual table is Employee and intermediate table we keep which we will use for comparison of changes before sending to third party.
Following are database designs coming in mind for intermediate table:
Only Single table:
EmployeeiD|Value|Type|UpdateDate
Value is userid or email, type will be 'email' or 'userid'. Update date is kept so to figure out that which of email or userid was different and update to third party.
Multiple Table:
Employee_EmailID
EmpId|EmailID|Updatedate
Employee_UserID
EmpId|UserID|Updatedate
Java flow will be:
Pick employee from actual table.
Pick employee from above intermediate table.
Compare differences. Update difference to third party.
Update above table with updated value and last update date.
Which one is consider as best way, single table approach or multiple table or is there any standard way to implement the same? There are 10,000 Employees in system.
Intermediate table is just storing Delta records i.e Records transferred to third party so that it can be compared next day.
Good database design has separate tables for different concepts. Using the same database column to hold different types of data will lead to code which is harder to understand, prone to data corruption and less performative.
You may think it's only two tables and a few tens of thousands of rows, so does it matter? But that is only your current requirement. What you choose now will set the template for what happens when (say) you need to add telephone numbers to the process.
Now in future if we get 5 more entities to update
Do you mean "entities", like say Customers rather than Employees? Or do you really mean "attributes" as in my example of Employee Telephone Number?
Generally speaking we have a separate table for distinct entities, and all the attributes of that entity are grouped at the same cardinality. To take your example, I would expect an Employee to have one UserID and one Email Address so I would design the table like this:
Employee_audit
EmpId|UserID|EmailID|Updatedate
That is, I have one record which stores the complete state of the Employee record at the Updatedate.
If we add a new entity, Customers then we have a new table. Simple. But a new attribute like Employee Phone Number offers a choice, because an employee can have more than one: work landline, mobile, fax, home, etc. So we could represent this in three ways: a child table with a type column, multiple child tables for each type, or as distinct columns on the Employee record.
For the main Employee table I would choose the separate table (or tables, depending on whether I'm shooting for 6NF). But for an audit table I would choose one record per Employee and pivot the phone numbers like this:
Employee_audit
EmpId|UserID|EmailID|Landline|Mobile|Fax|Home|Updatedate
The one thing I would never do is have a single table with type and value columns. It seems attractive because it means we could track additional entities without any further DDL. But in fact it becomes harder to re-assemble the complete state of an Employee at any given time with each attribute we add. Also it means the auditing process itself is more complicated (because it needs to determine which attributes have changed and whether it needs to audit the change) and more expensive (because changing three attributes on the same record entails inserting three audit records).

How to model an OLTP audit table in dimensional schema?

We have an audit table which we get from OLTP system, it records any activity done by the user including if he downloaded some attachment, or read some note or written some note , or any change for an incident etc.How do we include these audit table activity in our dimensional model for incident management system(IT service management)?
On a simple level, which is all I can provide based on the level of detail in the question, is to look at your audit table and decide which categories of audit you want to be a dimension. Perhaps there are audit_type, user_type, and audit_subtype fields or something like that? Also, typically you have another field called a "measure" or "quantity", which is typically used for stats on numerics, to support aggregate functions. For example, you might typically have store_id, product_cat as categorical dimensions, but roll up sales$ as min,max,avg,stdev grouped by different date types like month, quarter and other dimensions. If your data is purely categorical by date, then COUNT() is usually used as a calculated measure.
You really just need to decide how you want to be able to drill up and drill down though the data, which categories matter, and which quantities matter. Once you decide that, create a flat table with FKs to lookup tables. A star schema is simply a fat table with a bunch of lookup tables floating around it like a star.
Hope this helps

Magento reindexing of indexes database table name?

Which tables are connected with the process of reindexing of index in magento.
Please share any documents available for the same.
Can't take credit for this as it is taken from original post at: Can someone explain Magentos Indexing feature in detail?
Magento's indexing is only similar to database-level indexing in spirit. As Anton states, it is a process of denormalization to allow faster operation of a site. Let me try to explain some of the thoughts behind the Magento database structure and why it makes indexing necessary to operate at speed.
In a more "typical" MySQL database, a table for storing catalog products would be structured something like this:
PRODUCT:
product_id INT
sku VARCHAR
name VARCHAR
size VARCHAR
longdesc VARCHAR
shortdesc VARCHAR
... etc ...
This is fast for retrieval, but it leaves a fundamental problem for a piece of eCommerce software: what do you do when you want to add more attributes? What if you sell toys, and rather than a size column, you need age_range? Well, you could add another column, but it should be clear that in a large store (think Walmart, for instance), this would result in rows that are 90% empty and attempting to maintenance new attributes is nigh impossible.
To combat this problem, Magento splits tables into smaller units. I don't want to recreate the entire EAV system in this answer, so please accept this simplified model:
PRODUCT:
product_id INT
sku VARCHAR
PRODUCT_ATTRIBUTE_VALUES
product_id INT
attribute_id INT
value MISC
PRODUCT_ATTRIBUTES
attribute_id
name
Now it's possible to add attributes at will by entering new values into product_attributes and then putting adjoining records into product_attribute_values. This is basically what Magento does (with a little more respect for datatypes than I've displayed here). In fact, now there's no reason for two products to have identical fields at all, so we can create entire product types with different sets of attributes!
However, this flexibility comes at a cost. If I want to find the color of a shirt in my system (a trivial example), I need to find:
The product_id of the item (in the product table)
The attribute_id for color (in the attribute table)
Finally, the actual value (in the attribute_values table)
Magento used to work like this, but it was dead slow. So, to allow better performance, they made a compromise: once the shop owner has defined the attributes they want, go ahead and generate the big table from the beginning. When something changes, nuke it from space and generate it over again. That way, data is stored primarily in our nice flexible format, but queried from a single table.
These resulting lookup tables are the Magento "indexes". When you re-index, you are blowing up the old table and generating it again.

Advantage of splitting a table

My question may seems more general. But only answer I got so far is from the SO itself. My question is, I have a table customer information. I have 47 fields in it. Some of the fields are optional. I would like to split that table into two customer_info and customer_additional_info. One of its column is storing a file in byte format. Is there any advantage by splitting the table. I saw that the JOIN will slow down the query execution. Can I have more PROs and CONs of splitting a table into two?
I don't see much advantage in splitting the table unless some of the columns are very infrequently accessed and fairly large. There's a theoretical advantage to keeping rows small as you're going to get more of them in a cached block, and you improve the efficiency of a full table scan and of the buffer cache. Based on that I'd be wary of storing this file column in the customer table if it was more than a very small size.
Other than that, I'd keep it in a single table.
I can think of only 2 arguments in favor of splitting the table:
If all the columns in Customer_Addition_info are related, you could potentially get the benefit of additional declarative data integrity that you couldn't get with a single table. For instance, lets say your addition table was CustomerAddress. Your business logic may dictate that a customer address is optional, but once you have a customer Zip code, the addressL1, City and State become required fields. You could set these columns to non null if they exist in a customerAddress table. You couldn't do that if they existed directly in the customer table.
If you were doing some Object-relational mapping and your had a customer class with many subclasses and you didn't want to use Single Table Inheritance. Sometimes STI creates problems when you have similar properties of various subclasses that require different storage layout. Being that all subclasses have to use the same table, you might have name clashes. The alternative is Class Table inheritance where you have a table for the superclass, and an addition table for each subclass. This is a similar scenario to the one you described in your question.
As for CONS, The join makes things harder and slower. You also run the risk of accidentally creating a 1 to many relationship. I.E. You create 2 addresses in the CustomerAddress table and now you don't know which one is valid.
EDIT:
Let me explain the declarative ref integrity point further.
If your business rules are such that a customer address is optional, and you embed addressL1, addressL2, City, State, and Zip in your customer table, you would need to make each of these fields Nullable. That would allow someone to insert a customer with a City but no state. You could write a table level check constraint to cover this situation. But that isn't as easy as simply setting the AddressL1, City, State and Zip columns in the CustomerAddress table not nullable. To be clear, I am NOT advocating using the multi-table approach. However you asked for Pros and Cons, and I'm just pointing out this aspect falls on the pro side of the ledger.
I second what David Aldridge said, I'd just like to add a point about the file column (presumably BLOB)...
BLOBs are stored up to approx. 4000 bytes in-line1. If a BLOB is used rarely, you can specify DISABLE STORAGE IN ROW to store it out-of-line, removing the "cache pollution" without the need to split the table.
But whatever you do, measure the effects on realistic amounts of data before you make the final decision.
1 That is, in the row itself.

Sql Server heavily queried Table - should I store secondary info (html text) in another table

The Overview:
I have a table "category" that is for the most part used to categorise products and currently looks like this:
CREATE TABLE [dbo].[Category]
(
CategoryId int IDENTITY(1,1) NOT NULL,
CategoryNode hierarchyid NOT NULL UNIQUE,
CategoryString AS CategoryNode.ToString() PERSISTED,
CategoryLevel AS CategoryNode.GetLevel() PERSISTED,
CategoryTitle varchar(50) NOT NULL,
IsActive bit NOT NULL DEFAULT 1
)
This table is heavily queried to display the category hierarchy on a shopping website (typically every page view) and can have a substantial number of items.
I'm using the Entity Framework in my data layer.
The Question:
I have a need to add what could potentially be a fairly large "description" which could come in the form of the entire contents of a web-page and I'm wondering whether I should store this in a related table rather than adding it to the existing category table given that the entity framework will drag the "description" column out of the database 100% of the time when 99.5% of the time I'll only want the CategoryTitle and CategoryId.
Typically I wouldn't worry about the overhead of the Entity Framework, but in the case I think it might be important to take it into consideration. I could work around this with a view or a complex type from a stored proc, but this means a lot of refactoring that I'd prefer to avoid.
I'm just interested to know if anyone has any thoughts, suggestions or a desire to slap my wrists in relation to this scenario...
EDIT:
I should add that the reason I'm hesitating to set up a secondary table is because I don't like the idea of adding an additional table that has a 1 to 1 relationship with the Category table - it seems somewhat pointless. But I'm also not a DBA so I'm not sure whether this is an acceptable practice or not.
You could put your column in the table and then create an index covering all other columns. That way the index will be used when you do all lookups you do with your current schema.
The key word for this construction is Covering Index: http://en.m.wikipedia.org/wiki/Database_index#Covering_index
I would store in a different table for the simple reason to not increase the size of a record in Category table. An increase in record size due to such a VARCHAR column will reduce the number of records that can fit a given disk page (typically of size 4KB), thereby increasing the number of pages to fetch to main memory to search, increasing the number of disk accesses, affecting the query execution times.
I would store this in a different table (i.e. vertically partition the category table into most frequently accessed columns and not-so-frequently used columns), and define a OneToOne relationship at the application layer with the entity that contains the not-so-frequently used column, as a member in the main Category entity, set the fetch type to LAZY.

Resources