DWH import with surrogate keys (and SCD) - etl

I have a Data Warehouse which uses internal surrogate keys and type 2 slowly changing dimensions.
In the clearing we just have the business keys from the erp-system, like this:
In the Data Warehouse we want to use the surrogate keys instead (Note: Article price changed from 500$ to 1000$ and articles is using surrogate keys where possible, here only for manufacturer).
If we were just using the business keys it's no problem, just compare, update old entries, insert new entries.
But what's the best way to do this with surrogate keys?
Get existing Ids (0 or -1 for not existing yet) from the Data Warehouse in the clearing and then compare the entries?
Keep the business keys in the Data Warehouse aswell, compare them and update Ids then in the Data Warehouse?

to be able to do lookups when loading tables - just like when referencing a manufacturer while loading the articles, you have to store the natural/business keys in the DWH. From my experience, this is always done.
But you should store the business keys of a source entity only in the destination entity. Let me clarify, business key of the manufacturer should only appear in the Manufacturer table in your DWH, not elsewhere. When you need to reference the manufacturer in different table, such as Article, you use your surrogate key of the manufacturer.
So, you got it right in the second screenshot.
Then, when you load Article table and you need to know if the manufacturer changed for a specific article, you first look up the manufacturer's surrogate key based on its business key and compare the surrogate key with the key in the Article table. This is how it is usually done.

Related

What will be benefit of surrogate key in data warehosue layer?

OLTP source tables are having surrogate keys (numeric values) and natural keys (alphanumeric values), then can I skip creating surrogate keys in target OLAP DB (Dimensional Model) for dimension tables.
I know that I will need surrogate keys for fact tables as unique key for fact table will be a large set and I will need a single columns with numeric values as primary key there.
I am joining multiple source tables for populating data into one dimension target then I am wondering to use unique id (numeric values) of driving table (this id is inherited from OLTP source) as primary key , provided that data granularity of resulted record is at driving tables' id level (resulted record is the record after main source driving table is joined with other source tables).
What will be benefit of surrogate key in data warehouse layer?
Thanks,
Rajneesh
Basically, surrogate key is an artificial key that is used as a substitute for natural key (NK) defined in data warehouse tables. We can use natural key or business keys as a primary key for tables.
These are some of the benefits of surrogate keys
Surrogate keys help protect the Datawarehouse system from unexpected
administrative changes
Surrogate keys allow the Datawarehouse system to integrate the same data.
Surrogate keys enable you to add rows to dimensions that do not
exist in the source system
Surrogate keys provide the means for tracking changes in dimension
attributes over time
Integer surrogate keys can improve query and processing performance
compared to larger character or GUID keys
Closing the loop as Koushik Roy has answered this.
Adding additional references with hope it may be helpful for community:
https://www.mssqltips.com/sqlservertip/5431/surrogate-key-vs-natural-key-differences-and-when-to-use-in-sql-server/
https://dwgeek.com/data-warehouse-surrogate-key-design-advantages-disadvantages.html/

App Inventor TinyDB has no unique key constraint?

I wanted to develop a simple Android app that requires a small database. I've developed a prototype with App Inventor and TinyDB, but it seems that TinyDB allows to add several records to the database with the same "tag" (this is how keys are named in TinyDB).
I am adding an extra field that autoincrements itself in every database record and using this counter as a primary key, but that's not exactly what I want. Is there a way to implement a primary key constraint for a "tag" in TinyDB?
TinyDB has no built-in way to store primary keys, but you can store an ordered list of the items where the index is the primary key. Then you just find where it is in the list to find the primary key.
If you use that system, though, you will decrease the keyspace (there will be one less possible tag out of an infinite number of possible tags that can be stored by the user.) If the user gets to create their own tags, you can prefix all of the tags they create with a symbol. No matter what tags the user enters, they will not be able to accidentally or purposely overwrite your primary key list.

Changing Primary Key in Oracle

I'm updating a table that was originally poorly designed. The table currently has a primary key that is the name of the vendor. This serves as a foreign key to many other tables. This has led to issues with the Vendor name initially being entered incorrectly or with typos that need to be fixed. Since it's the foreign key to relationships, this is more complicated than it's worth.
Current Schema:
Vendor_name(pk) Vendor_contact comments
Desired Schema:
id(pk) Vendor_name Vendor_contact comments
I want to update the primary key to be an auto-generated numeric key. The vendor name field needs to persist but no longer be the key. I'll also need to update the value of the foreign key on other tables and on join tables.
Is the best way to do this to create a new numeric id column on my Vendor table, crosswalk the id to vendor names and add a new foreign key with the new id as the foreign key, drop the foreign key of vendor name on those tables (per this post), and then somehow mark the id as the primary key and unmark the vendor name?
Or is there a more streamlined way of doing this that isn't so broken out?
It's important to note that only 5 users can access this table so I can easily shut them out for a period of time while these updates are made - that's not an issue.
I'm working with SQLDeveloper and Python/Django.
The biggest problem you have is all the application code which references VENDOR_NAME in the dependent tables. Not just using it to join to the parent table, but also relying on it to display the name without joining to VENDOR.
So, although having a natural key as a foreign key is a PITN, changing this situation is likely to generate a whole lot of work, with a marginal overall benefit. Be sure to get buy-in from all the stakeholders before starting out.
The way I would approach it is this:
Do a really thorough impact analysis
Ensure you have complete regression tests for all the functions which rely on the Vendor data
Create VENDOR_ID as a unique key on VENDOR
Add VENDOR_ID to all the dependent tables
Create a second foreign on all the dependent tables referencing VENDOR_ID
Ensure that the VENDOR_ID is populated whenever the VENDOR_NAME is.
That last point can be tackled by either fix the insert and update statements on the dependent tables, or with triggers. Which approach you take will determine on your application design and also the number of tables involved. Obviously you want to avoid the performance hit of all those triggers if you can.
At this point you have an infrastructure which will support the new primary key but which still uses the old one. Why would you want to do this? Because you could go into Production like this without changing the application code. It gives you the option to move the application code to use VENDOR_ID across a broader time frame. Obviously, if developers have been keen on coding SELECT * FROM you will have issues that need addressing immediately.
Once you've fixed all the code you can drop VENDOR_NAME from all the dependent tables, and switch VENDOR_NAME to unique key and VENDOR_ID to primary key on the master table.
If you're on 11g you should check out Edition-Based Redefinition. It's designed to make this sort of exercise an awful lot easier. Find out more.
I would do it this way:
create your new sequence
create table temp as select your_sequence.nextval,vendor_name, vendor_contact, comments from vendor.
rename the original table to something like vendor_old
add the primary key and other constraints to the new table
rename the new table to the old name
Testing is essential and you must ensure no one is working on the database except you when this is done.

How to insert rows in phpMyAdmin

I have a database which I've opened in phpMyAdmin. I clicked the "Insert" button, which has an icon showing one row being inserted between two others.
When I actually try to insert a row, I get the following error:
1062 - Duplicate entry '294' for key 'PRIMARY'
How do I get phpMyAdmin to insert a row (presumably by increasing all the higher-numbered rows by 1) as the icon and the term "Insert" implies? It only seems to want to "Add" a row to the end, not "Insert" it.
As I said, the icon specifically shows one row being inserted between two others, and this is what I want to do. How do I get it to do what it claims it will do?
First, "INSERT" is standard SQL terminology for putting something in the database; it doesn't specifically mean "putting it between two existing values". I see how the icon can be a bit confusing, but when "insertting" data there is no difference between putting something at the end or in the middle of the database. For that matter, there's no real inherent order to data stored in a database; you can select many different ways to sort it when you display the data (and phpMyAdmin generally does a good job of guessing what's reasonable), but data just exists. You can select to sort it by the primary key or alphabetically by user name or any means you wish.
Second, your primary key shouldn't change. It's the key that holds your data together; if you start changing that your references from other tables will be messed up (see below). So don't change that.
Third, if you have your primary key set up with auto_increment (the A_I checkbox in phpMyAdmin), then you shouldn't ever need to set it or worry about it yourself. It's all managed by MySQL. If you aren't happy with the order and want to move 294 to 295 so you can insert something else at 294, then your database design needs tweaking because that's not how auto_incrementing primary keys are designed to work. As a simple solution, you may wish to create another field called "sort_value" or something that you can change.
Which all brings me to the root cause of your trouble: you're trying to create a new row while reusing an existing auto_increment value, and MySQL is smart enough to know this is a bad idea.
So as I said above, changing your primary key (whether or not it's auto generated) is a bad idea, but it may not be obvious why if you only have one table. But relational databases are designed so that you can reference tables from other tables, so for instance a customer database might have a table for "customers", "products", and "purchases" where the purchases table references the primary key ID from both customers and products...imagine the carnage your data would see if you then change the value of those keys in the customer table. You'd show customers associated with some other customer's purchases. So it might not make sense in your database, but overall that's the best way to handle things.
If you really, really don't want to change your database structure, don't reference that key from any other tables, and don't want to listen to my advice, you should be able to simply turn off the auto_increment function on your primary key and reorder them however you wish.

Surrogate key in 'User' / 'Role' tables for desktop app? Whats the purpose?

I have to add some security for a C#/.NET WinForms/Desktop application. I am using Oracle DB back-end.
The tables are simple: User (ID,Name), Role(ID,Role), UserRole(UserID,RoleID).
I am using the windows account name to populate User table. Role table will for now just be simply 'Admin','SuperUser','BasicUser'...
Since no two people could ever possible have the same windows account name... even when I do not control these name management (netops does, hence why I want to use windows accounts so I don't have to manage it ;)). For Role table, I should again never have dupe value - I control the input, there will only be 3 (tactical app going away within year). UserRole is a join table to represent the Many-To-Many relationships of users and roles, so no surragate key is justified.
Simple question - Why bother with 'ID' (int) in the User and Role table? Any point or advantage here? Is this one of those 'I've always done it this way' type things? Or have I just not done this in awhile and forget the reason?
Names change - primary key values must not. Abigail Smith becomes Abigail Jones and the username changes but a surrogate key protects against having to cascade those changes everywhere.
If you are using a surrogate key but there is a column or combination of columns which should be unique, then enforce that using a unique index. There's a good chance you'll want indexes on your user.name and role.role columns anyway, and a unique index is more space efficient and supplies useful metadata to the optimizer. If you have a surrogate key but don't have another combination of columns that uniquely identify a row then think again whether you have your entity definition right.
One caution. Especially for very narrow tables with few access paths, you may use an index-organized table. Oracle will only allow an index organized table on the primary key, but does allow foreign keys against a unique set of columns (if it is enforced by a unique constraint, not simply a unique index).
It is possible that you'll end up with a table where a unique ID is enforced through a unique index and treated as PK by an ORM and used as the parent for foreign key relationships, but the primary key (as defined in the DB) is the rolename/username/whatever because you want that as the driver for an index-organised table.
A surrogate key is not required on intersection tables, but here are a few reasons to do so:
Consistency: If every table has a single artificial key, you always know the key name when you know the table name.
Ease Of Use: Less typing — one key means ON and WHERE clauses are shorter and thus less error-prone.
Interoperability: Some ORMs only work well with tables with a single primary key column.

Resources