Datavault: How to get hashes for foreign key relationships (populating link tables) - data-vault

I've read the data vault book end to end, but I'm still trying to resolve one specific thing related to how you'd populate the link tables (how to get all hashes for that). From the blog of scalefree: massively parallel processing, it demonstrates that satellites and hubs can be loaded in full parallel fashion, but it doesn't go into a lot of detail related to the link tables.
Links require hash keys, thus in some way 'business keys' from multiple tables to establish the relationships, that's what they do, they record relations between hubs. There aren't very good explanations or in-depth explanations how you would retrieve the business keys of related entities when populating these link tables.
For a specific table like 'customer' things are easy for hub and satellite: just convert the business key to a hash and load both of them in parallel.
But a customer details table or a transaction table from an OLTP need some kind of join to happen to look up the business key for the customer or to look up all the related entities in the transaction (product, customer, store, etc), because those tables do not typically store (all) business key(s) as an attribute in the table.
If I assume that staging is loaded incrementally and truncated, then staging doesn't necessarily have all the entities loaded to be able to perform joins there. How to resolve this dilemma and create a design that works?
Join on tables in the source OLTP systems to generate the business keys from there and propagate them as hashes from there? (this ends up wrong if the business key was chosen incorrectly)
Use a persistent staging area, so never truncate? (then it's always possible to join on any table in there to resolve)
Use some kind of index for surrogate keys -> business keys and perform a lookup from there? (minimizes I/O a bit further and is a mix between incremental staging and persistent staging).
some other method...?
Essentially, what is the best practice for generating the hashes for all foreign key relations of your OLTP systems?

I talked to an expert about this and this is the answer I accepted from him:
The only sensible two ways to produce hashes for tables that do not have all the columns necessary to produce a business key for that table is:
In the case where you have a full load of all the tables that have the business keys (yet maybe incremental for a link table), join to the relevant source tables having the business keys in staging. This is ok, because you can guarantee you have all the data in staging at that moment.
In the case where you have incremental loads for tables having business keys, you must use a persistent staging area (PSA) to do this for you.
It is considered bad practice to join tables in source system queries in order to generate the business keys. The reason is that the data warehouse should have as little operational impact as possible.

Related

What is the most performant way to add/remove join table associations for a bi-directional many-to-many association of complex entities?

Just to be clear, I'm NOT asking how to get things WORKING, but what sort of approach will have the best PERFORMANCE.
The reason I'm so focused on performance is because I'm dealing with 2 very complex and hierarchical entities. To give you a little background, the program is for tracking attendance, so the 2 entities in question are Locations (where people attend) and Reasons (if people can attend and why). These 2 entities are in a bi-directional many-to-many relationship where each entity has multiple thousands of records in the database and are completely independent from each other aside from the mapping table that contains the associations. There can be as few as 1 and infinitely many Reasons a person may attend a Location, and as few as 1 and infinitely many Locations a Reason can be used at.
i.e. a Reason of 'person is in security group RED' is used to grant access to locations 'Lab A', 'Lab B', and 'Lab C', resulting in the join table containing 3 entries, each with the 'person is in security group RED' as the Reason side of the relationship and each lab location to associate it to each of the lab Location entities. Conversely, Location 'Lab A' may also include other Reasons, such as a 'security group BLUE' and 'training event X' and will thus have additional records in the join table to those Reasons.
Both entities are complex in and of themselves and have many cascading associations of their own. i.e. the Reason entity has a 3-tiered cascading relationship structure of its own to maintain, and the Location entity has many other associations it's concerned with completely independent from Reasons. I stress this to emphasize the performance impact concerns of always pulling in all the entity associations just to update the join table entries.
Most of the time when a Location or Reason is being updated, it will be completely independent from the associations they have with one another, and so I don't want to worry about the Location <--> Reason join table for normal operations. However the links between Locations and Reasons are crucial to the program's primary functions and thus adding and removing these associations are just as crucial. So when it comes to adding/removing these associations, what is the mos performant way of doing so?
I want to avoid the cascade behaviors if I can, because I don't want unrelated changes made to these entities to trigger a complex cascade save. i.e. when Location 'X' changes, I don't want to bring in its many Reasons and all their 3-tiered cascaded dependencies when I'm not changing anything related to Location 'X's reasons at the time. I also don't want to bring in all of Location 'X's Reasons (and their dependencies) just to add/remove an entry in the Location <--> Reason join table for that Location (and conversely the same argument for the Reason.)
I was wondering if there was a way to target the Location <--> Reason join table association directly when I want to add/remove such associations, but otherwise keep it pretty light when dealing with regular operations on the entities.
i.e. I still want to get basic details about the Reasons associated with a Location when I go to change other details about the Location, without cascading the save operation of that Location to its Reasons (and thus each Reasons' downstream associations), and vice versa, and when it comes time add or remove entries to the join table for these entities, I don't want those entities to be saved directly, because then I'd have to worry about all their other data and associations getting updated.
If such a setup isn't possible with hibernate and its nature for tracking entities rather than their associations, which association setup for these entities would have the best performance, knowing that each entity already has many other cascading operations to deal with and most operations on these entities will NOT touch the join table between them at all, and only those targeted add/remove operations will specifically need to update JUST that join table between the location and the reason and nothing else.
First of all, you almost always don't want a real many-to-many association. There will be a time where you want to track additional data like a timestamp for these join tables which is when you would have to rewrite your logic, so I suggest you always try to think about modelling the join table as entity with two many-to-one associations and on the inverse side, one-to-many associations. The entity for the join table has a composite id, consisting of the ids of the many-to-one associations.
This enables you to manage entries manually by doing persist/merge/remove or use update/delete DML queries. This will perform best and the only "downside" is that you might do work upfront that you is possibly not need, because it might in fact be a real many-to-many, although I highly doubt that.

Store summary data at master table, instead of deriving it

I am trying to prepare DB design for APEX application. Requirement is as follows.
In Departments IR page, users are asking below columns
Number of employees in each department (Department may or may not have employees)
Primary Location for Department (Department can have multiple addresses and addresses are stored in other table, along with primary flag)
Alternative Manager's Email Address for Department (alt_manager_id column, this is optional column and refers to employees table)
I can implement these requirements using either inline sub queries or using OUTER JIONs. But, these approaches will have performance impact as the data grows (like 100s of thousands of rows). So, my question is, is it ok to store these data directly at "Departments" table and update "Departments" table when child tables gets updated. Basically, I am trying to store summary data at master table, instead of deriving it as on when needed from child tables. Is this considered bad practice? Is it ok to implement such DB design?
Thank you
"Is this considered bad practice?"
Usually yes. There are several problems with maintaining summary detail information in a master record.
Your inserts into child tables (and deletes if you have them) now also have to take a lock on the master record, to increment the count. This adds complexity to what should be simple transactions.
It also has two performance hits: the additional overhead of maintaining the counts and the potential for sessions to hang in multi-user environments.
Note that you are adding a definite performance hit to your insert activity for a possible saving in the performance of aggregating queries.
The good practice is to just run the counts when you need the summaries. Tune the queries if you need to.
If you think you really are going to be querying the summary data often enough for the workload to be a problem you should consider building materialized views for the summary queries. Then, when you enable query rewrites, Oracle will transparently query the materialized view if it can satisfy the query rather than re-running the aggregations. This is a technique which is used a lot in data warehouses, but there's no reason not to use it in OLTP environments if you really have the data volumes to justify it. Find out more.
Generally, try the simplest thing which could work first. Only look to do something different (like building a materialized view for aggregations) when you know you have a demonstrable problem with performance.

Salesforce Table Relationships for Business Analyst

I am a business analyst. I use Tableau a lot but have limited knowledge about the back-end of Salesforce. The majority of our company's data is stored in Salesforce and our data team does not support business users for understanding such topics.
In many of my projects, I use the Salesforce connector inside Tableau to extract Salesforce tables, but it requires knowledge about joins relationships among tables. Most of the time, I can guess correctly about the primary key among tables, but I still want to learn systematically about the data structure and have my data independence.
So, how do I learn the data structure by myself? Or how do I ask specific structure questions to data team so I don't trouble them as much?
Do you have Salesforce account with "Customize Application" permission? If you don't have in production - maybe they'll be willing to promote you to sysadmin in one of sandboxes.
If you do - Setup -> Schema Builder might be easiest tool to visualise relations. It's bit old, flash-based but pretty neat way to model relationships. https://trailhead.salesforce.com/en/content/learn/modules/data_modeling/schema_builder
Another one might be workbench, http://workbench.developerforce.com/ It's not as neat but lets you experiment with metadata & queries, learn which object has what child relationships...
For standard objects if you have a primary key / foreign key you can use some lookup tables to learn more about target table. All Account Ids in all SF instances start with 001. Contacts with 003, Users with 005... Combine some blogs like http://www.fishofprey.com/2011/09/obscure-salesforce-object-key-prefixes.html with https://developer.salesforce.com/docs/atlas.en-us.api.meta/api/sforce_api_objects_account.htm and it's a good start. Won't help much with custom objects and fields (specific to your company) but well.
It's bit "meta" but you can query info about tables and columns too. After all - you might be more comfortable in Tableau ;) Querying Salesforce Object Column Names w/SOQL might give you some hints.
If your job is to build advanced reports off these data sources, I would imagine you need to understand the data structure to some extent. This would mean you need to have authorization to view and access the database table list to get familiar with it and possibly run raw queries to verify data integrity.
If they are not comfortable with you touching the production system, ask for access to a development system which is a copy of production or even just realistic test data.

Loose coupling among objects within oracle schema

I am building an information service that manages Suppliers.
The suppliers are used by our billing system, tender system and sales system.
Though 60% of the attributes of supplier are unique to each system, there are still 40% attributes of Supplier that are shared across the systems.
My objective is to build a flexible system, so that change to one individual system's data, should not impact other systems. For example, if i need to make certain tables offline for upgrading them, it should not impact rest of the systems that need supplier information.
What is the best way of achieving this? Should all the different context specific attributes live in one schema, but deployed on different table spaces?
Also, the read and update may happen more for one set of attributes than the other. How should i logically represent them via one model, but deploy them in such a fashion that they can evolve independently?
Thank you.
First, tablespaces are a means of controlling the storage characteristics of segments, they won't help wrt avoiding impact from changes.
I recommend you create separate child tables for each set of attributes, each with a 1:1 referential integrity constraint to a parent table. e.g.
SUPPLIERS (supplier_id PK, common attributes...)
SUPPLIER_BILLING_INFO (supplier_id PK, billing attributes...) + FK to SUPPLIERS
SUPPLIER_TENDER_INFO (supplier_id PK, tender attributes...) + FK to SUPPLIERS
SUPPLIER_SALES_INFO (supplier_id PK, sales attributes...) + FK to SUPPLIERS
Obviously they'll need to live in one instance. Whether you put them in one schema or in separate schemas is up to you.
Changes to one system should have no impact on other systems, as long as they don't all refer to all the tables (i.e. the Billing system should never access SUPPLIER_TENDER_INFO).
This sounds like a very difficult question that can't be easily answered here. But I can think of a few tricks that might help you with some of your issues. It is possible to make huge changes to your data and still keep the system online.
DBMS_REDEFINITION allows you to change your table structure while other people are still using the table (although it looks very complicated).
Partitioning also allows you to change part of your table without affecting other users. For example, you can truncate just one of the partitions of a table. Partitioning also allows you to use different physical structures for the same table. For example, one partition could use a tablespace with a small block size (good for writing), and another partition could use a tablespace with a larger block size (good for reading).

Is there a performance hit by added nonenforced foreign keys to a SQL Server 2008 database?

I'm working with a database and I want to start using LINQ To SQL with it. The database doesn't have any FKs inside of it right now for performance reasons. We are inserting millions of rows at a time to the DB which is why there aren't any FKs.
So I'm thinking I'm going to add nonenforced FKs to the database to describe the relationships between the tables for my LINQ To SQL but I don't want there to be a performance hit by adding nonenforced foreign keys.
Does anyone know what the effect of this might be?
Update: I'm using LINQ-To-SQL for the nonperformance intesive stuff. 80% of the data access is through stored procs on production. But for writing unit tests and other non performance critical tasks, LINQ-To-SQL makes data access really easy.
Update: Here is how you add a nonenforced FK
ALTER TABLE [dbo].[ACI] WITH NOCHECK ADD CONSTRAINT [FK_ACI_CustomerInformation] FOREIGN KEY([ACIOI])
REFERENCES [dbo].[CustomerInformation] ([ACI_OI])
NOT FOR REPLICATION
GO
ALTER TABLE [dbo].[ACI] NOCHECK CONSTRAINT [FK_ACI_CustomerInformation]
GO
The answer can be different for different environments (data/logs on same drive, tempdb on same drive, lots of cache vs little, etc) so the best way to find this out is to benchmark. Create two identical databases, one with fk's and one without. Do your normal million-row-load into each database, and measure your transactions per second. That way you'll know for sure in your own environment.
Foreign keys will create non-clustered indexes in your table, which will improve performance of joins on foreign keys.
Extra indexes will decrease the performance of your insert/update/delete/merge statements and will increase table sizes.
http://msdn.microsoft.com/en-us/library/ms191195.aspx
Even when created with NOT FOR REPLICATION the indexes are still present and SQL Server will need to maintain them.
In your case I would either:
- use foreign keys and take performance hit
or
- not use foreign keys in production (goodbye data integrity) and run my tests against a copy of production database for which I would create foreign keys.
It may have some impact, especially at those volumes.
However I would test this on a similiar system first, so you can measure the impact, if any.
To be honest though, I would probably use hand written stored procedures for this, so you can optimize them as required, instead of using LINQ to SQL.
I realize this is an old question, but I want to comment on how bad a practice it is to create a FK that is not enforced on existing data. If in fact there is a need for a foreign key, you need to fix any bad data before adding the foreign key (which should have been added at design time) not try to ignore it. All you are doing is masking your very serious data integrity problem by refusing to notice it and do something about it. There is the occasional need to do this due to changed requirements, but it should not be considered as a first choice of techniques when adding a foreign key to a table that has data. Finding and fixing the bad data should be.
Data that has no relationship to the PK is useless. If I had a order table with a customer id that no longer existed in the customer table, how would I know who ordered the product? Of course this is why the FKs should have been enforced from the beginning whether you did million row inserts or not. I do multi-million row inserts through SSIS on a daily basis to many many tables that have foreign keys, to use this as a reason for not setting them up in the first place indicates a lack of understanding of database design. Sacrificing your data integrity to speed is ALWAYS a poor idea. Without data integrity, your database is unreliable and therfore useless.

Resources