How to write etl test cases in excel sheets - etl

I do not have an exact idea how to write ETL test cases.I did the following 3 scenarios.
1.source n target count should be same.
2.check duplicates in target
3.column mapping for source and target.
how will write test case for mapping.I am really confused.please help.please give me one sample test case

Your test cases should include:
Total record count for Source and Target should be same. (Total distinct count if any duplicates)
Distinct Column count for all the common columns should be same.
Total null count for source and target common columns should be same.
Do source minus target and target minus source for common columns.
If your mapping has aggregates then check sum of that column should match with target sum. Group it by month, year if you have any date column or key.
If you mappings have lookups then left outer join that table with source and check if the same value is inserted.(You can add this column to distinct column count and minus queries)
Check what all transformations are there in mapping, see if the result matches with the source.

The test cases are written on these following test scenarios:
Mapping doc validation, structure validation, constraint validation, data consistency issues etc.
Examples of some test cases are as follows:
Validate the source structure and target table structure against corresponding mapping doc.
Source data type and target data type should be same.
Length of data types in both source and target should be equal.
Verify that field data types and formats are specified.
Source data type length should not be less than the target data type & etc.

Just to add
Validate post processing
Validate late arriving dimensions
Validate Data access layer against reports and target tables
Inserts, updates, deletes, null records test in source/stg tables/config tables etc.

Related

Power query - strategy for handling repeating rows

Given a report which as a table with repeated row headings, is there a good strategy for using Power Query/M to extract the data in a clean format?
For example the report available here, has an excel file (which at time of writing is pointing to August 2021):
https://www.opec.org/opec_web/static_files_project/media/downloads/publications/MOMR%20Appendix%20Tables%20(August%202021).xlsx
In this example:
we have the World demand table portion
Non-OPEC Liquids production portion
both of these have rows: Americas/Europe/Asia Pacific:
which makes it hard to distinguish them in Power Query
What is right approach which would allow extraction of data from this type of table?
I would add a column ... custom column ... with formula
=if [2018] = null then [Column] else null
and then right click the new column and fill down
That would put World Demand and non-OPEC as a column that you could additionally filter on

Performance in Elasticsearch

I am now beginning with elasticsearch.
I have two cases of data in a relational database, but in both cases I want to find the records from the first table as quickly as possible.
Case 1: binding tables 1: n (example Invoice - Items of invoice)
Have I been to save the data to the elasticsearch system: all rows from slave or master_id and group all data from slave to single string?
Case 2: binding tables n: 1 (example Invoice - Customer)
Have I been to save the data as in case 1 to independent index or add next column to previous index?
The problem is that sometimes I only need to search for records that contain a specific invoice item, sometimes a specific customer, and sometimes both an invoice item and a customer.
Should I create one index containing all the data, or all 3 variants?
Another problem, is it possible to speed up the search in elasticsearch somehow, when the stored data is eg only EAN (13 digit number) but not plain text?
Thank
Jaroslav
You should denormalize and just use single index for all your data(invoices, items and customer) for the best performance, Elasticsearch although supports joins and parent-child relationship but their performance is no where near to when all the data is part of single index and quick benchmark test on your data will prove it easily.

Oracle - build dimension from a file based data source

I'm trying to build a star schema in Oracle 12c. In my case my data source is not a relational database but a single excel/csv file which is populated via a google form, which means I don't have any sort of reference from a source system such as auto incremental keys/ids. Now what would be the best approach to build a star schema given this condition?
File row sample:
<submitted timestamp>,<submitted by user>,<region>,<country>,<branch>,<branch location>,<branch area>,<branch type>,<branch name>,<branch private? yes/no value>,<the following would be all "fact" values (measurements),...,...,...
In case i wanted to build a "branch" dimension, how would I handle updates/inserts after the first load into the dimension table?
Thought solution so far:
I had thought of making a concatenated string "key" with the branch values, which would make it unique (underscore would be the "glue" to concatenate the values), eg:
<region>_<country>_<branch>_<branch location> as branch_key
I would insert all the distinct branches into a staging table, including they branch_key column for each one of them, then when trying to load into the dimension I could compare which key does not exists yet in my dimension table and then insert it. As for updates, I'm a bit stuck on how to handle that, I had thought of having another file mapping which branches are active having a expiration date column. Basically trying to simulate what I could do having the data in a database instead of CSV files.
This is all I can think of so far, do you have any other recommendations/ideas on how to implement this? Take on consideration that the data source cannot as in I have to read these csv files, since data is not stored anywhere else.
Thank you.

CloverETL: Compare two records

I have two files, A and B. The records in both files share the same format and the first n characters of a record is its unique identifier. The record is of fixed length format and consists of m fields (field1, field2, field3, ...fieldm). File B contains new records and records in file A that have changed. How can I use cloverETL to determine which fields have changed in a record that appears in both files?
Also, how can I gather metrics on the frequency of changes for individual fiels. For example, I would like to know how many records had changes in fieldm.
This is typical example of Slowly Changing Dimension problem. Solution with CloverETL is described on theirs blog: Building Data Warehouse with CloverETL: Slowly Changing Dimension Type 1 and Building Data Warehouse with CloverETL: Slowly Changing Dimension Type 2.

SCD TYPE2 in informatica

Can anyone of you please elaborate on how to map the informatica for the inserts and updates to the target from source table?
I appreciate it, if you explain with example.
TYPE2 Only INSERTS(New Rows as well as Updated Rows)
Version Data Mapping:
The Type 2 Dimension/Version Data mapping filters source rows based on user-defined comparisons and inserts both new and changed dimensions into the target. Changes are tracked in the target table by versioning the primary key and creating a version number for each dimension in the table. In the Type 2 Dimension/Version Data target, the current version of a dimension has the highest version number and the highest incremented primary key of the dimension.
Use the Type 2 Dimension/Version Data mapping to update a slowly changing dimension table when you want to keep a full history of dimension data in the table. Version numbers and versioned primary keys track the order of changes to each dimension.
When you use this option, the Designer creates two additional fields in the target:
PM_PRIMARYKEY. The Integration Service generates a primary key for each row written to the target.
PM_VERSION_NUMBER. The Integration Service generates a version number for each row written to the target.
Creating a Type 2 Dimension/Effective Date Range Mapping
The Type 2 Dimension/Effective Date Range mapping filters source rows based on user-defined comparisons and inserts both new and changed dimensions into the target. Changes are tracked in the target table by maintaining an effective date range for each version of each dimension in the target. In the Type 2 Dimension/Effective Date Range target, the current version of a dimension has a begin date with no corresponding end date.
Use the Type 2 Dimension/Effective Date Range mapping to update a slowly changing dimension table when you want to keep a full history of dimension data in the table. An effective date range tracks the chronological history of changes for each dimension.
When you use this option, the Designer creates the following additional fields in the target:
PM_BEGIN_DATE. For each new and changed dimension written to the target, the Integration Service uses the system date to indicate the start of the effective date range for the dimension.
PM_END_DATE. For each dimension being updated, the Integration Service uses the system date to indicate the end of the effective date range for the dimension.
PM_PRIMARYKEY. The Integration Service generates a primary key for each row written to the target.
The Type 2 Dimension/Flag Current mapping
The Type 2 Dimension/Flag Current mapping filters source rows based on user-defined comparisons and inserts both new and changed dimensions into the target. Changes are tracked in the target table by flagging the current version of each dimension and versioning the primary key. In the Type 2 Dimension/Flag Current target, the current version of a dimension has a current flag set to 1 and the highest incremented primary key.
Use the Type 2 Dimension/Flag Current mapping to update a slowly changing dimension table when you want to keep a full history of dimension data in the table, with the most current data flagged. Versioned primary keys track the order of changes to each dimension.
When you use this option, the Designer creates two additional fields in the target:
PM_CURRENT_FLAG. The Integration Service flags the current row “1” and all previous versions “0.”
PM_PRIMARYKEY. The Integration Service generates a primary key for each row written to the target.
You can start by looking at the Definition of SCD type-2 here.
http://en.wikipedia.org/wiki/Slowly_changing_dimension#Type_2
This implementation is so common in data warehouses that Informatica actually provides you with the template to do so. You can just "plug-in" your table names and the attributes.
If you have informatica installed, you can go to the following location in the help guide to see the detailed implementation logic.
Contents > Designer Guide > Using the mapping wizards > Creating a type 2 dimension.
Use a router to define groups for UPDATE and INSERT. Pass the output of each group to update strategy and then to target. HTH.

Resources