SCD TYPE2 in informatica - informatica-powercenter

Can anyone of you please elaborate on how to map the informatica for the inserts and updates to the target from source table?
I appreciate it, if you explain with example.

TYPE2 Only INSERTS(New Rows as well as Updated Rows)
Version Data Mapping:
The Type 2 Dimension/Version Data mapping filters source rows based on user-defined comparisons and inserts both new and changed dimensions into the target. Changes are tracked in the target table by versioning the primary key and creating a version number for each dimension in the table. In the Type 2 Dimension/Version Data target, the current version of a dimension has the highest version number and the highest incremented primary key of the dimension.
Use the Type 2 Dimension/Version Data mapping to update a slowly changing dimension table when you want to keep a full history of dimension data in the table. Version numbers and versioned primary keys track the order of changes to each dimension.
When you use this option, the Designer creates two additional fields in the target:
PM_PRIMARYKEY. The Integration Service generates a primary key for each row written to the target.
PM_VERSION_NUMBER. The Integration Service generates a version number for each row written to the target.
Creating a Type 2 Dimension/Effective Date Range Mapping
The Type 2 Dimension/Effective Date Range mapping filters source rows based on user-defined comparisons and inserts both new and changed dimensions into the target. Changes are tracked in the target table by maintaining an effective date range for each version of each dimension in the target. In the Type 2 Dimension/Effective Date Range target, the current version of a dimension has a begin date with no corresponding end date.
Use the Type 2 Dimension/Effective Date Range mapping to update a slowly changing dimension table when you want to keep a full history of dimension data in the table. An effective date range tracks the chronological history of changes for each dimension.
When you use this option, the Designer creates the following additional fields in the target:
PM_BEGIN_DATE. For each new and changed dimension written to the target, the Integration Service uses the system date to indicate the start of the effective date range for the dimension.
PM_END_DATE. For each dimension being updated, the Integration Service uses the system date to indicate the end of the effective date range for the dimension.
PM_PRIMARYKEY. The Integration Service generates a primary key for each row written to the target.
The Type 2 Dimension/Flag Current mapping
The Type 2 Dimension/Flag Current mapping filters source rows based on user-defined comparisons and inserts both new and changed dimensions into the target. Changes are tracked in the target table by flagging the current version of each dimension and versioning the primary key. In the Type 2 Dimension/Flag Current target, the current version of a dimension has a current flag set to 1 and the highest incremented primary key.
Use the Type 2 Dimension/Flag Current mapping to update a slowly changing dimension table when you want to keep a full history of dimension data in the table, with the most current data flagged. Versioned primary keys track the order of changes to each dimension.
When you use this option, the Designer creates two additional fields in the target:
PM_CURRENT_FLAG. The Integration Service flags the current row “1” and all previous versions “0.”
PM_PRIMARYKEY. The Integration Service generates a primary key for each row written to the target.

You can start by looking at the Definition of SCD type-2 here.
http://en.wikipedia.org/wiki/Slowly_changing_dimension#Type_2
This implementation is so common in data warehouses that Informatica actually provides you with the template to do so. You can just "plug-in" your table names and the attributes.
If you have informatica installed, you can go to the following location in the help guide to see the detailed implementation logic.
Contents > Designer Guide > Using the mapping wizards > Creating a type 2 dimension.

Use a router to define groups for UPDATE and INSERT. Pass the output of each group to update strategy and then to target. HTH.

Related

The difference of a output table generated by aggregation is keyedTable and keyedStreamTable

When the output table generated by aggregation is keyedTable and keyedStreamTable, the results are different
When the aggregation engine uses the tables generated by keyedTable and keyedStreamTable to receive the results, the effect is different. The former can be received, but it cannot be used as a data source for a larger period; the latter does not play an aggregation role, but only intercepts the first record of ticks data per minute.
The code executed by the GUI is as follows:
barColNames=`ActionTime`InstrumentID`Open`High`Low`Close`Volume`Amount`OpenPosition`AvgPrice`TradingDay
barColTypes=[TIMESTAMP,SYMBOL,DOUBLE,DOUBLE,DOUBLE,DOUBLE,INT,DOUBLE,DOUBLE,DOUBLE,DATE]
Choose one of the following two lines of code, and find that the results are inconsistent
/////////// Generate a 1-minute K line (barsMin01) This is an empty table
share keyedTable(`ActionTime`InstrumentID,100:0, barColNames, barColTypes) as barsMin01
//////// This code can be used for aggregation, but it cannot be used as a data source for other periods
share keyedStreamTable(`ActionTime`InstrumentID,100:0, barColNames, barColTypes) as barsMin01
////////Choosing this code does not have an aggregation effect, and it is found that only the first tick of every minute is intercepted.
//////////define the data sources
metrics=<[first(LastPrice), max(LastPrice), min(LastPrice), last(LastPrice), sum(Volume), sum(Amount), sum(OpenPosition), sum(Amount)/sum(Volume)/300, last(TradingDay) ]>
////////////Aggregation engine
//////////// generate 1-min k line, Aggregation engine
nMin01=1*60000
tsAggrKlineMin01 = createTimeSeriesAggregator(name="aggr_kline_min01", windowSize=nMin01, step=nMin01, metrics=metrics, dummyTable=ticks, outputTable=barsMin01, timeColumn=`ActionTime, keyColumn=`InstrumentID,updateTime=500, useWindowStartTime=true)
/////////// subscribe and the 1-min k line will be generated
subscribeTable(tableName="ticks", actionName="act_tsaggr_min01", offset=0, handler=append!{getStreamEngine("aggr_kline_min01")}, batchSize=1000, throttle=1, hash=0, msgAsTable=true)
There are some diffenence between keyedTable and keyedStreamTable:
keyedTable: When adding a new record to the table, the system will automatically check the primary key of the new record. If the primary key of the new record is the same as the primary key of the existing record, the corresponding record in the table will be updated.
keyedStreamTable: When adding a new record to the table, the system will automatically check the primary key of the new record. If the primary key of the new record is the same as the primary key of the existing record, the corresponding record will not be updated.
That is, one of them is for updating and the other is for filtering.
The keyedStreamTable you mentioned "does not play an aggregation role, but intercepts the first record of ticks data per minute", is exactly because you set updateTime=500 in createTimeSeriesAggregator. If updateTime is specified, the calculations may occur multiple times in the current window.
You use keyedStreamTable here to subscribe to this result table, so updateTime cannot be used. If you want to force trigger, you can specify the forceTrigger parameter.

what happens when two update for same record comes in one file while loading in DB using INFORMATICA

Suppose I Have a table xyz:
id name add city act_flg start_dtm end_dtm
1 amit abc,z pune Y 21012018 null
and this table is loaded from a file Using Informatica using SCD2.
suppose there is one file that contains two record with id=2
ie. 2 vipul abc,z mumbai
2 vipul asdf bangalore
so who will this be loaded into db?
It depends how your doing the SCD type 2. If you are using a look-up with Static cache , both records will be added end date as null
Best case in this scenario is to use a dynamic lookup cache and read your source data in such a way that latest record is read last. This will ensure one record is expired with end date and only one active record( ie end date is null) exists per id.
Hmm 1 of 2 possibilities depending on what you mean... if you mean that you're pulling data from different source systems which sometimes have the same ids on those systems then its easy... just stamp both the natural key (i.e. the id) and a source system value on the dimension column along with the arbitrary surrogate key which is unique to your target table... (this is a datawarehousing basic so read kimball).
If you mean that you are somehow tracing realtime changes in the single record in the source system and writing these changes to the input files of your etl job then you need to agree with your client whether they're happy for you to aggregate them based on the timestamp of the change and just pick the most recent one or to create 2 records, one with its expiry datetime set and the other still open (which is the standard scd approach... again read kimball).

How to write etl test cases in excel sheets

I do not have an exact idea how to write ETL test cases.I did the following 3 scenarios.
1.source n target count should be same.
2.check duplicates in target
3.column mapping for source and target.
how will write test case for mapping.I am really confused.please help.please give me one sample test case
Your test cases should include:
Total record count for Source and Target should be same. (Total distinct count if any duplicates)
Distinct Column count for all the common columns should be same.
Total null count for source and target common columns should be same.
Do source minus target and target minus source for common columns.
If your mapping has aggregates then check sum of that column should match with target sum. Group it by month, year if you have any date column or key.
If you mappings have lookups then left outer join that table with source and check if the same value is inserted.(You can add this column to distinct column count and minus queries)
Check what all transformations are there in mapping, see if the result matches with the source.
The test cases are written on these following test scenarios:
Mapping doc validation, structure validation, constraint validation, data consistency issues etc.
Examples of some test cases are as follows:
Validate the source structure and target table structure against corresponding mapping doc.
Source data type and target data type should be same.
Length of data types in both source and target should be equal.
Verify that field data types and formats are specified.
Source data type length should not be less than the target data type & etc.
Just to add
Validate post processing
Validate late arriving dimensions
Validate Data access layer against reports and target tables
Inserts, updates, deletes, null records test in source/stg tables/config tables etc.

Oracle - build dimension from a file based data source

I'm trying to build a star schema in Oracle 12c. In my case my data source is not a relational database but a single excel/csv file which is populated via a google form, which means I don't have any sort of reference from a source system such as auto incremental keys/ids. Now what would be the best approach to build a star schema given this condition?
File row sample:
<submitted timestamp>,<submitted by user>,<region>,<country>,<branch>,<branch location>,<branch area>,<branch type>,<branch name>,<branch private? yes/no value>,<the following would be all "fact" values (measurements),...,...,...
In case i wanted to build a "branch" dimension, how would I handle updates/inserts after the first load into the dimension table?
Thought solution so far:
I had thought of making a concatenated string "key" with the branch values, which would make it unique (underscore would be the "glue" to concatenate the values), eg:
<region>_<country>_<branch>_<branch location> as branch_key
I would insert all the distinct branches into a staging table, including they branch_key column for each one of them, then when trying to load into the dimension I could compare which key does not exists yet in my dimension table and then insert it. As for updates, I'm a bit stuck on how to handle that, I had thought of having another file mapping which branches are active having a expiration date column. Basically trying to simulate what I could do having the data in a database instead of CSV files.
This is all I can think of so far, do you have any other recommendations/ideas on how to implement this? Take on consideration that the data source cannot as in I have to read these csv files, since data is not stored anywhere else.
Thank you.

Teradata: How to design table to be normalized with many foreign key columns?

I am designing a table in Teradata with about 30 columns. These columns are going to need to store several time-interval-style values such as Daily, Monthly, Weekly, etc. It is bad design to store the actual string values in the table since this would be an attrocious repeat of data. Instead, what I want to do is create a primitive lookup table. This table would hold Daily, Monthly, Weekly and would use Teradata's identity column to derive the primary key. This primary key would then be stored in the table I am creating as foreign keys.
This would work fine for my application since all I need to know is the primitive key value as I populate my web form's dropdown lists. However, other applications we use will need to either run reports or receive this data through feeds. Therefore, a view will need to be created that joins this table out to the primitives table so that it can actually return Daily, Monthly, and Weekly.
My concern is performance. I've never created a table with such a large amount of foreign key fields and am fairly new to Teradata. Before I go on the long road of figuring this all out the hard way, I'd like any advice I can get on the best way to achieve my goal.
Edit: I suppose I should add that this lookup table would be a mishmash of unrelated primitives. It would contain group of values relating to time intervals as already mentioned above, but also time frames such as 24x7 and 8x5. The table would be designed like this:
ID Type Value
--- ------------ ------------
1 Interval Daily
2 Interval Monthly
3 Interval Weekly
4 TimeFrame 24x7
5 TimeFrame 8x5
Edit Part 2: Added a new tag to get more exposure to this question.
What you've done should be fine. Obviously, you'll need to run the actual queries and collect statistics where appropriate.
One thing I can recommend is to have an additional row in the lookup table like so:
ID Type Value
--- ------------ ------------
0 Unknown Unknown
Then in the main table, instead of having fields as null, you would give them a value of 0. This allows you to use inner joins instead of outer joins, which will help with performance.

Resources