CloverETL: Compare two records - etl

I have two files, A and B. The records in both files share the same format and the first n characters of a record is its unique identifier. The record is of fixed length format and consists of m fields (field1, field2, field3, ...fieldm). File B contains new records and records in file A that have changed. How can I use cloverETL to determine which fields have changed in a record that appears in both files?
Also, how can I gather metrics on the frequency of changes for individual fiels. For example, I would like to know how many records had changes in fieldm.

This is typical example of Slowly Changing Dimension problem. Solution with CloverETL is described on theirs blog: Building Data Warehouse with CloverETL: Slowly Changing Dimension Type 1 and Building Data Warehouse with CloverETL: Slowly Changing Dimension Type 2.

Related

How to get the sum of values of a column in tmap?

I have 2 columns - Matches(Integer), Accounts_type(String). And i want to create a third column where i want to get proportions of matches played by different account types. I am new to Talend & am facing issue with this for past 2 days & did a lot of research but to no avail. Please help..
You can do it like this:
You need to read your source data twice (I used tFixedFlowInput_1 and tFixedFlowInput_2 with the same data). The idea is to calculate the total of your matches in tAggregateRow_1, it simply does a sum of all Matches without a group by column, then use that as a lookup.
The tMap then joins your source data with the calculated total. Since the total will always be one record, you don't need any join column. You then simply divide Matches by Total as required.
This is supposing you have unique values in Account_type; if you don't, you need to add another tAggregateRow between your source and tMap_1, in order to get sum of Matches for each Account_type (group by Account_type).

CDC strategy for multiple staging tables

I'm implementing a Data Mart following the Kimball methodology and I have a challenge with applying deltas from multiple source tables against a single target dimension.
Here's an example of the incoming source data:
STG_APPLICATION
APP_ID, APP_NAME, APP_START_DATE, CDC_HASH, ...
1, FOOBAR, 20/10/2018, MD5_XXX
STG_APPLICATION_STATUS
APP_ID, STATUS_CODE, STATUS_DESC, CDC_HASH, ...
1, SUBMITTED, "APP WAS SUBMITTED", MD5_YYY
Each of these tables (there are several others) represent a normalised version of the source data i.e. a single application can have one or more statuses associated with it.
Now then, because we only get a full alpha for these tables we have to do a snapshot merge, i.e. apply a full outer join on the current day set of records against the previous day set of records for each individual table. This is computed by comparing the CDC_HASH (a concat of all source columns). The result of this comparison is stored in a delta table as follows:
STG_APPLICATION_DELTA
APP_ID, APP_NAME, APP_START_DATE, CDC_HASH, CDC_STATUS ...
STG_APPLICATION_STATUS
APP_ID, STATUS_CODE, STATUS_DESC, CDC_HASH, CDC_STATUS...
1, AWARDED, "APP WAS AWARDED", MD5_YYY, NEW
So in this example, the first table, STG_APPLICATION did not generate a delta record as the attributes pertaining to that table did not change between daily loads. However, the associated table, STG_APPLICATION_STATUS, did calculate a delta, i.e. one or more fields have changed since the last load. This is highlighted by the CDC_STATUS which identifies it as a new record to insert.
The problem now of course is how to correctly handle this situation when loading the target dimension? For example:
DIM_APPLICATION
ID, APPLICATION_ID, APP_NAME, APP_START_DATE, APP_STATUS_CODE, FROM_DATE, TO_DATE
1, 1, FOOBAR, 20/10/2018, SUBMITTED, 20/10/2018, 12/04/2019
2, 1, NULL, NULL, NULL, AWARDED, 13/04/2019, 99/99/9999
This shows the first record - based on these two staging tables being joined - and the second record which is meant to reflect an updated version of the record. However, as previously illustrated, my Delta tables are only partially populated, and therefore I am unable to correctly update the dimension as shown here.
Logically, I understand that I need to be able to include all fields that are used by the dimension as part of my delta calculation, so that I have a copy of a full record when updating the dimension, but I'm not sure of the best way to implement this in my staging area. As shown already, I currently only have independent staging tables, each of which calculate their delta separately.
Please can somebody advise on the best way to handle this? I'm scrutinized Kimball's books on this but to no avail. And I've equally found no suitable answer on any online forums. This is a common problem so I'm sure there exists a suitable architectural pattern to resolve this.
You will need to either compare on joined records or lookup the current dimension values.
If the amount of (unchanged) data is not excessive, you could join the full snapshots of STG_APPLICATION and STG_APPLICATION_STATUS together on APP_ID until they resemble the dimension record column-wise and store those in a separate table with their CDC hash to use as previous day. You then take the deltas at this level and send the (complete) changed records as updates to the dimension.
If the amount of records in the daily update makes it impractical to join the full tables, you can take the deltas and full outer join them as you do now. Then you look up the current dimension record for this APP_ID and fill in all empty fields in the delta record. The completed record is then sent as an update to the dimension.
This solution requires less storage but seems more fragile, especially if multiple changes are possible within a day. If there are many changes, performance may also suffer. For a handful of changes in millions of records, it should be more efficient.

How to write etl test cases in excel sheets

I do not have an exact idea how to write ETL test cases.I did the following 3 scenarios.
1.source n target count should be same.
2.check duplicates in target
3.column mapping for source and target.
how will write test case for mapping.I am really confused.please help.please give me one sample test case
Your test cases should include:
Total record count for Source and Target should be same. (Total distinct count if any duplicates)
Distinct Column count for all the common columns should be same.
Total null count for source and target common columns should be same.
Do source minus target and target minus source for common columns.
If your mapping has aggregates then check sum of that column should match with target sum. Group it by month, year if you have any date column or key.
If you mappings have lookups then left outer join that table with source and check if the same value is inserted.(You can add this column to distinct column count and minus queries)
Check what all transformations are there in mapping, see if the result matches with the source.
The test cases are written on these following test scenarios:
Mapping doc validation, structure validation, constraint validation, data consistency issues etc.
Examples of some test cases are as follows:
Validate the source structure and target table structure against corresponding mapping doc.
Source data type and target data type should be same.
Length of data types in both source and target should be equal.
Verify that field data types and formats are specified.
Source data type length should not be less than the target data type & etc.
Just to add
Validate post processing
Validate late arriving dimensions
Validate Data access layer against reports and target tables
Inserts, updates, deletes, null records test in source/stg tables/config tables etc.

Oracle - build dimension from a file based data source

I'm trying to build a star schema in Oracle 12c. In my case my data source is not a relational database but a single excel/csv file which is populated via a google form, which means I don't have any sort of reference from a source system such as auto incremental keys/ids. Now what would be the best approach to build a star schema given this condition?
File row sample:
<submitted timestamp>,<submitted by user>,<region>,<country>,<branch>,<branch location>,<branch area>,<branch type>,<branch name>,<branch private? yes/no value>,<the following would be all "fact" values (measurements),...,...,...
In case i wanted to build a "branch" dimension, how would I handle updates/inserts after the first load into the dimension table?
Thought solution so far:
I had thought of making a concatenated string "key" with the branch values, which would make it unique (underscore would be the "glue" to concatenate the values), eg:
<region>_<country>_<branch>_<branch location> as branch_key
I would insert all the distinct branches into a staging table, including they branch_key column for each one of them, then when trying to load into the dimension I could compare which key does not exists yet in my dimension table and then insert it. As for updates, I'm a bit stuck on how to handle that, I had thought of having another file mapping which branches are active having a expiration date column. Basically trying to simulate what I could do having the data in a database instead of CSV files.
This is all I can think of so far, do you have any other recommendations/ideas on how to implement this? Take on consideration that the data source cannot as in I have to read these csv files, since data is not stored anywhere else.
Thank you.

Hadoop map-reduce : Order of records while grouping

I have a record in each line of input and each record has around 10 fields. First, I group the records by three fields (field1, field2, field3) thus one mapper/reducer is responsible for one unique group (based on the three fields). Within each group, I sort the records based on another integer field timestamp and I tag each record in the group with the same tag aTag by adding another field.
Lets say that in mapper#1, I tag a sorted group as aTag and in mapper#2, I tag another group (a different group because I initially grouped the records based on the three fields) with the same tag aTag.
Now, if I group the records based on the tag field (i.e., grouping the groups in different mappers), I notice that the ordering within each group is no more preserved. I was expecting that since each mapper has a group with all records having the same tag, grouping by the tag name should just involve getting the relevant groups from other mappers and just concatenating them without re-ordering each individual group.
Is it because I am trying to store the records in gzip format and hence it tries to re-order the records for better compression? Also I would like to know how to preserve the order after grouping by the tag name.
It seems that you are trying to implement the sort step of MapReduce yourself in local memory, but then it completely ignores what you did and re-sorts the items in each group anyway. The proper way to fix this would be to specify a comparator on the keys, so that within each partition so that the merged input to the reducer is according to that comparison function. This means that
You don't have to do the sorting yourself
You don't run out of memory on one machine trying to sort a really large group.
It seems on your case that you'd want to add timestamp to the set of keys, tell it to partition on the first three keys, and tell it to sort on the timestamp.
For more information, see the following diagram, and Where is Sort used in MapReduce phase and why?

Resources