Data validation for millions of rows in ETL testing - etl

How do we validate the data for millions of records in an ETL testing process at a stretchy (without doing validation on sample records between source and Target

Set up test data that covers all the logic scenarios in your ETL code - including field values that are transformed and field values that should remain untouched
Perform record counts on your sources and targets to ensure that the correct number or records are being moved

Related

Which Nifi processor to use for RDBMS Extract

i will explain my use case to understand which DB extract utility to use.
I need to extract data from SQL Server tables with varying frequency each day. Each extract query is a complex SQL statement, involving 5-10 tables in joins etc with multiple causes. Have around 20-30 such statements overall.
All these extract queries might be required to run multiple times a day with varying frequencies each day. It depends on how many times we receive data from source system or other cases.
We are planning to use Kafka to publish a message to let Nifi workflow know whenever a RDBMS table is updated and flow needs to be triggered (i can't just trigger Nifi flow based on "incremental" column value, there might only be all row update scenarios and we might not create new rows in tables).
How should i go about designing my Nifi. There are ExecuteSQL/GenerateTableFetch/ExecuteSQLRecord/QueryDatabaseTable all sorts of components available. Which one is going to fit my requirement best?
Thanks!
I am suggesting that you use ExecuteSQL. You can set query from attribute or compose it using attribute. Easiest way is to create json and then parse that json and create attributes. Check this example, here is sql created from file you can adjust it to create it from kafka link

Oracle Data Integrator(ODI) Error Handling

Is there any error handling mechanism in ODI.
I am trying to handle a scenario where ODI can load the bad data into error table when it fails to do transform source data and insert into target table.So that process will not get stopped even if there is any change in the incoming data format.
Most of the Integration Knowledge Modules (IKMs) have an option to enable or disable Flow Control. When Flow Control is enabled, these main steps will occur :
The data will first be inserted into a temporary table which has the same structure as the target table. These tables are prefixed by I$_ by default
All the conditions (constraints) defined in the model for the target datastore will be checked.
The rows failing the conditions will be inserted in an error table with some information about the loading time, the condition which has been broken and all the data of the row. These tables are prefixed by E$_.
The rows passing the conditions will be inserted/updated in the target table.
Needless to say, enabling Flow Control will affect the performance of your loading as there is an extra insert and some conditions checks. But if catching data quality issues is needed, it's a great feature which is easy to implement.

Historical Data Comparison in realtime - faster in SQL or code?

I have a requirement in the project I am currently working on to compare the most recent version of a record with the previous historical record to detect changes.
I am using the Azure Offline data sync framework to transfer data from a client device to the server which causes records in the synced table to update based on user changes. I then have a trigger copying each update into a history table and a SQL query which runs when building a list of changes to compare the current record vs the most recent historical by doing column comparisons - mainly string but some integer and date values.
Is this the most efficient way of achieving this? Would it be quicker to load the data into memory and perform a code based comparison with rules?
Also, if I continually store all the historical data in a SQL table, will this affect the performance over time and would I be better storing this data in something like Azure Table Storage? I am also thinking along the lines of cost as SQL usage is much more expensive that Table Storage but obviously I cannot use a trigger and would need to insert each synced row into Table Storage manually.
You could avoid querying and comparing the historical data altogether, because the most recent version is already in the main table (and if it's not, it will certainly be new/changed data).
Consider a main table with 50.000 records and 1.000.000 records of historical data (and growing every day).
Instead of updating the main table directly and then querying the 1.000.000 records (and extracting the most recent record), you could query the smaller main table for that one record (probably an ID), compare the fields, and only if there is a change (or no data yet) update those fields and add the record to the historical data (or use a trigger / stored procedure for that).
That way you don't even need a database (probably containing multiple indexes) for the historical data, you could even store it in a flat file if you wanted, depending on what you want to do with that data.
The sync framework I am using deals with the actual data changes, so i only get new history records when there is an actual change. Given a batch of updates to a number of records, i need to compare all the changes with their previous state and produce an output list of whats changed.

How to improve DB load produced by SSRS reports

I would like to know whether there is a posibility to reduce the DB load produced by SSRS reports.
I have a SSRS report consisting of several sub-reports. Every one of them has at least one DB query.
Many of them query the same data since the most sub-reports have a kind of template header filled with dynamic data.
Some sub-reports are shown depending on whether a query returns any data. So once the data is queried to determine whether to show the report at all. Then the report itself queries the same data to show it in a table
In general I can tell that I need a mechanism to pass queried DB data from parent report to a sub-report. The parent report will query some data, it will iterate over the data sets and for every data set it will show a sub-report passing the current data set as a parameter.
I could not find a mechanism to pass the data set (data row). That's why I show the sub-report by passing a kind of data set ID. The sub-report itself queries the same data again, filters by the passed data set ID and shows only the relevant data set. This causes huge load on the DB.
Thank you in advance!
The design you describe is fairly standard and I would not expect it to cause "huge load on the DB". I would expect the DB load of running 10 filtered sub-reports to only be about 10-20% more than running one report covering the same 10 items.
I would add an index on the "data set ID" column to make that filter more efficient.
Depending on complexity of your subreports using lookup function may be an acceptable faster solution. And previous comment about hiding rows or subreports with no data applies here too.

Informatica Data Quality - Match Analysis

In our Duplicate analysis requirement the input data has 1418 records out of which 1380 records are duplicate records.
On using the Match Analysis (used Key Generator, Matcher, Associator, Consolidator) in IDQ integrated with PowerCenter except for 8 records all duplicates were eliminated.
On executing the workflow by excluding these records, duplicates appear in other records for which duplicate didnt occur in the previous run.
Can anyone tell why this mismatch occurs?
Looks like your Consolidator transformation is not getting correct association ids and hence inserting multiple records resulting in duplicates.
please try the below steps:
1) Try to create a workflow in IDQ itself by deploying the mapping which you developed in IDQ.
2) Also keep a check on the business keys of the records which make a primary key through which you are identifying the dups in source.

Resources