Can a reference table be collegiated to an HUB ?
In all books the reference tables are collegiated only to the Satellite....but why it's not collegiated to a HUB table?
Thanks a lot
If your reference data is categorising a business key already stored within a hub then yes it would make sense to store it as a satellite to that hub.
Reference tables however are for situations where there are no hubs because you need more information on a piece of data that exists within a satellite but that item doesn't justify being a business key in itself. Remember there's a strict rule of no satellite to satellite joins in Data Vault and so if you find yourself in this situation normally then you need not only a new satellite but a new link table to connect the data your reference code.
A common example would be dates. It's a regular need to have some sort of reference table for dates, effectively the equivalent of a date dimension that allows you to quickly categorise dates into years, quarters, financial periods and so on. Rather than having to create a link table for every satellite that holds a date and shift date keys in to it you can instead use a reference table for direct joins.
A reference table might also lean towards storage of meta data that doesn't fit into the normal data vault format at all, to give an example if you have a source system in your hubs you may want a reference table giving a more descriptive name to those source systems.
Related
I've read the data vault book end to end, but I'm still trying to resolve one specific thing related to how you'd populate the link tables (how to get all hashes for that). From the blog of scalefree: massively parallel processing, it demonstrates that satellites and hubs can be loaded in full parallel fashion, but it doesn't go into a lot of detail related to the link tables.
Links require hash keys, thus in some way 'business keys' from multiple tables to establish the relationships, that's what they do, they record relations between hubs. There aren't very good explanations or in-depth explanations how you would retrieve the business keys of related entities when populating these link tables.
For a specific table like 'customer' things are easy for hub and satellite: just convert the business key to a hash and load both of them in parallel.
But a customer details table or a transaction table from an OLTP need some kind of join to happen to look up the business key for the customer or to look up all the related entities in the transaction (product, customer, store, etc), because those tables do not typically store (all) business key(s) as an attribute in the table.
If I assume that staging is loaded incrementally and truncated, then staging doesn't necessarily have all the entities loaded to be able to perform joins there. How to resolve this dilemma and create a design that works?
Join on tables in the source OLTP systems to generate the business keys from there and propagate them as hashes from there? (this ends up wrong if the business key was chosen incorrectly)
Use a persistent staging area, so never truncate? (then it's always possible to join on any table in there to resolve)
Use some kind of index for surrogate keys -> business keys and perform a lookup from there? (minimizes I/O a bit further and is a mix between incremental staging and persistent staging).
some other method...?
Essentially, what is the best practice for generating the hashes for all foreign key relations of your OLTP systems?
I talked to an expert about this and this is the answer I accepted from him:
The only sensible two ways to produce hashes for tables that do not have all the columns necessary to produce a business key for that table is:
In the case where you have a full load of all the tables that have the business keys (yet maybe incremental for a link table), join to the relevant source tables having the business keys in staging. This is ok, because you can guarantee you have all the data in staging at that moment.
In the case where you have incremental loads for tables having business keys, you must use a persistent staging area (PSA) to do this for you.
It is considered bad practice to join tables in source system queries in order to generate the business keys. The reason is that the data warehouse should have as little operational impact as possible.
I'm an ETL developer that's currently being tasked with developing a type 2 SCD from existing historical data in a relational database. I'm perfectly capable of creating a type 2 SCD that's responsible for tracking future changes to the data, but I'm completely useless when it comes to the task at hand.
The relational model is in our ODS . Based on that relational model, I'm supposed to build flat records in our DW dimension. There are multiple attributes which need to be monitored for changes, each in specific related tables in the relational model. Historical changes must be kept on a daily basis, and if multiple changes to the same attribute occur on the same day, only the last subsists.
How can I tackle this? I'm lost. Thanks in advance.
P.S. we're talking tables with 20-30 million rows and multiple attributes that may change at any given time and therefore must result in a new record in the SCD.
This will indeed be painful. I'm assuming from your question that the tables containing the attribute values are currently varying independently (or you wouldn't need to ask the question).
If you have a table 'Table1' containing 'Key', 'Attribute1' and 'Effective From','Effective To' columns, then you can 'explode' that table into a virtual table in the form 'Key','Attribute1','Date', projecting out one row for every date where that attribute was current.
(Note that you probably don't want to do this as a ranged join against your date dimension, because this will be a Triangular Join (ie perform really badly), you probably need to explode the rows in an ETL tool/programmatically)
If you perform this process across multiple tables, you will have a set of tables giving you the full day-by-day snapshot of each attribute for every day that you care about. It's then fairly easy to join those tables based on 'FK' and 'Date' to give you the complete daily snapshot across all of the attribute values.
Then, of course, you need to run this though another process to collapse rows with the same Key, contiguous dates and all the same attribute values, ie 'unexplode' the rows, back into 'effective from','effective to' form. Note again, that this is fundamentally a row-by-row operation (or at very least a windowing function), and a set-based approach will perform very badly. Personally I'd just stream it all though some .net/java code to achieve this.
Given data volumes this will take a while, but should be achievable.
I want to be able to store an arbitrary C# struct with some variables on the Azure SQL server and retrieve it later into a similar struct. How can I do this without knowing the structure of the database?
SQL Azure is very similar to SQL Server, as you would build your schema, tables, rows, etc. the same way. If you wanted a schemaless approach to data types, you'd need to serialize your objects to some generic column, along with a Type column. Or use a Property table approach.
Alternatively, Windows Azure has a schema-free storage construct, the Windows Azure Table. Each row may contain different data. You'd just need some mechanism for determining the type of data you wrote (maybe one of the row properties, perhaps). Azure Tables are lightweight compared to SQL Azure, in that it's not a relational database. Each row is referenced by a Partition Key and Row Key (the pair being essentially a composite key).
So... assuming you don't have complex search / index requirements, you should be able to use Azure Tables to accomplish what you're trying to do.
This blog post goes over the basics of both SQL Azure and Azure Tables.
There are also examples of using Azure Tables in the Platform Training Kit.
I am aware of the generation of the Performance Counters and Diagnosis in webrole and worker-role in Azure.
My question is can I get the Performance Counter on a remote place or remote app, given the subscription ID and other certificates (3rd Party app to give performance Counter).
Question in other words, Can I get the Performance Counter Data, the way I use Service Management API for any hosted service...?
What are the pre-configurations required to be done in Server...? to get CPU data...???
Following is the description of the attributes for Performance counters table:
EventTickCount: Stores the tick count (in UTC) when the log entry was recorded.
DeploymentId: Id of your deployment.
Role: Role name
RoleInstance: Role instance name
CounterName: Name of the counter
CounterValue: Value of the performance counter
One of the key thing here is to understand how to effectively query this table (and other diagnostics table). One of the things we would want from the diagnostics table is to fetch the data for a certain period of time. Our natural instinct would be to query this table on Timestamp attribute. However that's a BAD DESIGN choice because you know in an Azure table the data is indexed on PartitionKey and RowKey. Querying on any other attribute will result in full table scan which will create a problem when your table contains a lot of data.
The good thing about these logs table is that PartitionKey value in a way represents the date/time when the data point was collected. Basically PartitionKey is created by using higher order bits of DateTime.Ticks (in UTC). So if you were to fetch the data for a certain date/time range, first you would need to calculate the Ticks for your range (in UTC) and then prepend a "0" in front of it and use those values in your query.
If you're querying using REST API, you would use syntax like:
PartitionKey ge '0<from date/time ticks in UTC>' and PartitionKey le '0<to date/time in UTC>'.
You could use this syntax if you're querying table storage in our tool Cloud Storage Studio, Visual Studio or Azure Storage Explorer.
Unfortunately I don't have much experience with the Storage Client library but let me work something out. May be I will write a blog post about it. Once I do that, I will post the link to my blog post here.
Gaurav
Since the performance counters data gets persisted in Windows Azure Table Storage (WADPerformanceCountersTable), you can query that table through a remote app (either by using Microsoft's Storage Client library or writing your own custom wrapper around Azure Table Service REST API to retrieve the data. All you will need is the storage account name and key.
not sure if the subject entirely conveys what I'm trying to achieve, but let me explain:
We are building an application that uses Oracle as storage backend. Each year, last years dataset will be "Archived", and a new instance created and populated from scratch.
What are the options to do this within the same schema?
Keep version information on a record level (we presume this will be too slow for our use-case).
Keep version information on a table level, so for each new version, we will re-create all the tables but with a new version prefix. (We like this solution, since we can do it all in code).
?
Is there not something like partitions/personalities/namespaces available that will allow us to achieve this in Oracle?
My oracle experience is rather limited, any assistance will be greatly appreciated!
The RDBMS conceptual model is not very good at maintaining temporal versions of data. So it is not just Oracle which is lacking in this regard.
I am unclear why you think keeping version information at the record level will be too slow. Too slow in creating a new version? Or too slow where it comes to data retrieval during regular operations?
Here is how you could do it. Given a table CUSTOMERS with a business key of CUSTOMER_REF I might normally build it like this (I am using abbreviated syntax rather than best practice for reasons of space):
create table customers
( id number not null primary key
, customer_ref number not null unique key
, name varchar2(30) not null )
/
The versioned equivalent would look like this:
create table customers
( id number not null primary key
, customer_ref number not null
, version_number number
, name varchar2(30) not null
, constraint whatever unique (customer_ref, version_number) )
/
This works by keeping the current version of VERSION_NUMBER null, and only populating it at archival time. Any lookup is going to have to include and version_number is null. This will be a bit of a pain and you may need to include the column in any additional indexes you build.
Obviously maintaining all versions of the records in the same table will increase the size of your tables, which might have an effect on performance. Oracle's Partitioning option can definitely help here. It also would give you a neat way of creating next year's set of data. However, it is a chargeable extra on top of the Enterprise License, so it is an expensive option. Find out more..
The most time consuming aspect of this will be managing foreign key relationships in the new version of the table. Presuming you choose to use synthetic primary keys, the archival process will have to generate new IDs and then painstakingly cascade them to their dependent records in the new versions of referencing foreign keys.
Thinking about this makes discreet tables for each version seem very attractive. For ease of use I would keep the current version un-prefixed, so that archiving becomes a process simply of
create table customers_n as select * from customers;
You might want to avoid downtime while creating the versioned tables. In that case you could use materialized views to capture the tables' state during the run-up to the archival switchover. When the clock strikes twelve you can switch off the refresh. (caveat: this is thinking on the fly, I have never done anything like this so try before you buy.)
One pertinent advantage of multiple tables (and Partitioning) is that you can move the archived records to a READ ONLY tablespace. This not only preserves them from unwanted change, it also means you can exclude them from subsequent backups.
edit
I notice you have commented that the archived data can occasionbally be amended. In taht case moving it to READ ONLY tablespaces is not a go-er.
The only thing I wil add to what APC said is regarding your asking for "namespaces".
A namespace in Oracle is a schema, whereby you can have the same object name(s) in each schema.
Of course this all depends on how your app must access multiple versions, but I would lean towards a different schema for each year before I would use some sort of naming convention to maintain versions of tables in the same schema. The reason is, eventually you will have a nightmares. At least with different schemas, all DDL can be the same, all references to objects will be the same, and tools like ER modellers and query tools will work within the context of that schema. Data models change, so at some point you may need to run some compare tools, and if all your tables are named funky with some sort of version postfix, that won't work well.
Add a schema can be copied / moved with export or data pump quickly using the fromuser/touser or remap_schema options, so you won't need much code, except to do any cleanup of last years data out of the new version.
I find schemas are very useful as "containers" and most apps I host only have schema level privileges, so I'm guaranteed the app can be easily and quickly moved from instance to instance, or multiple copies of the app can be hosted side-by-side on the same instance.
Might the schema change between years. For example, in 2010 you have fifteen columns but in 2011 you add a sixteenth.
If so, will the same application work on both 2010 and 2011 data.
If the schema is static, I'd go for table with a 'YEAR' column and use VPD/RLS/FGAC to apply a YEAR = '2010' predicate.
I'd only worry about partitioning if performance was a problem.
1) Interval partition it by year and some date field in the row.
2) Add it at the end of each table and populate it with a sequence and trigger.
3) Then partition by interval year on this col.