Informatica Workflow shows different number of records in transformation and target - informatica-powercenter

Hi I am a informatica newbie trying to train myself in power center workflows.
when I look at many of the workflow last sessions runs in the workflow monitor I see the number of records that are picked by the transformation is different than the number of records that get updated or inserted in the target table.
in the below image for example my sql transformation picks 80,742 rows from the source table. but only 29,813 rows get loaded into the target table.
image of informatica workflow monitor
on further analyzing the workflow log file I can see it loaded both insertable records and updatable records:
WRT_8036 Target: W_SALES_ORDER_LINE_F (Instance Name:
[W_SALES_ORDER_LINE_F]) WRT_8038 Inserted rows - Requested: 15284
Applied: 15284 Rejected: 0 Affected: 15284 WRT_8041
Updated rows - Requested: 14529 Applied: 14529 Rejected: 0
Affected: 14529
WRITER_1_*_1> WRT_8035 Load complete time: Wed Mar 19 04:41:24 2014
I am not able to figure out why would the workflows load lesser records than what source sql gives. and I would really appreciate some help in this matter.
Thanks,
Matt

This happens when there is a join happening in the ETL and the column over which we join has got duplicate values.

Related

Fixing data missing in Hadoop

I have a question, which according to my understanding, is more about theory.
I have a job run in Hadoop; basically, it pulls out all customer information from all applications to the company's database. The job is run on a daily basis. The result is a master table with over 200 columns.
customer_id
active_status
dayid
1
0
20221230
2
1
20230101
From Jan 01st to Jan 03rd, data in the column active_status was missing. One of my stakeholders needs the data on Jan 02nd for the reports. My boss said there were 2 options:
Copy the data of Jan 02 into a new table with values in column A being replaced with the data of Jan 04
Fix the master table. She said that in Hadoop, tables could not be updated in a way like in the SQL database, and I would need to remove the file in HDFS, add a new file, and load data into the master table.
This is not the first time I heard that tables could not be updated in Hadoop. I have read that Hadoop works in the Write-Once & Read-Many mechanism.
However, I also know there is the statement INSERT OVERWRITE, which is used to replace existing data with new rows.
So, how should I understand this? For the 2nd option recommended by my boss. Is only the file related to dayid 20230102 be removed from HDFS, or the whole file (all dayid) is removed?
I'm a BA and am quite new in big data, so I hope you can shed more light on this.

what is the best way of handling data resubmission in a datawarehouse?

Let's assume that we have a datawarehouse comprised of four components :
extract : source data is extracted from an oracle database to a flat file. there is a flat file per source table. Extraction date is kept as part of the flat file name. Each record contains a insert/update date from the source system.
staging area : temporary tables used to load the extracted data into database tables
operational data store : staged data will be loaded in the ODS. The ODS keeps all the history of all the loaded data and the data is typecast. Surrogate keys are not yet generated.
datawarehouse : data is loaded from the ODS, surrogate keys are generated, dimensions are historized, and finally fact data is loaded and attached to the proper dimension.
So far so good, and regarding regular delta loading I have no issue. However the question I ask myself is : I have regularly encountered in the past situations where, for whatever reason, you will want to resubmit extracted data into the loading pipeline. Let's assume for instance that we select all the extracted flat files over the last 15 days, and that we push them again to the ETL process.
There is no new extraction from the source systems. Previously loaded files are re-used and fed into the ETL process.
Data is then reloaded into the staging tables, which will have been truncated previously
now data has to move to the ODS. And here I have a real headache on how to proceed.
Alternative 1 : just insert the new rows. So we would have :
row 2, natural key : ID001, batch date : 12/1/2022 16:34, extraction date : 10/1/2022, source system modification timestamp : 10/1/2022 10:43:00
previous row : natural key : ID001, batch date : 10/1/2022 01:00, extraction date : 10/1/2022, source system modification timestamp : 10/1/2022
But then, when loading to the DWH, we need to have some kind of insert/update mechanism and we cannot do a straight insert as it will created duplicate facts.
Alternative 2 : apply an insert/update logic at ODS level. With the previous example we would have :
check if the ODS table contains already a row with natural key : ID001 - extraction date : 10/1/2022, source system modification timestamp : 10/1/2022
insert if not found
Alternative 3 : purge in the ODS the previously loaded data, i.e.
purge all the data where extraction date in the last 15 days
load the data from the staging.
Alternative 1 is performant but shifts the insert/update task at DWH level, so the performance-killer is still there.
Alternative 2 requires an insert update, which for millions of rows does not seem optimal.
Alternative 3 looks good but if feels wrong to delete data from the ODS.
What is your view on this ? In other words my question would be how to reconcile the recommandation to have insert-only processes in the datawarehouse, with the reality that from time to time you will need to reload previously extracted data to fix bugs or correct missing data.
There are two primary methods to load data into your data warehouse:
Full load: with a full load, the entire data staged is dumped, or loaded, and is then completely replaced with the new, updated data flow. No additional information, such as timestamps or audit technical columns, are needed.
Incremental load/ Delta load: only the difference between the target and source data is loaded through the ETL process in data warehouse. There are 2 types of incremental loads, depending on the data volumetry , streaming incremental load and batch incremental load.

Why does QueryDatabaseTable doing a complete query fetch instead of using maximum column value to fetch data from Oracle in Apache Nifi?

I am using QueryDatabaseTable processor to do an incremental batch update to Bigquery. Oracle Database Table keeps on increasing at the rate of 5 new rows per minute.
Flow: QueryDatabaseTable -> ConvertAvroToJson -> PutBigQueryBatchUpdate
I ran this flow with a schedule of 10 minute, the query results in about 2000 rows.
QueryDatabaseTable processor configuration I have modified:
Table Name, Additional WHERE clause, Maximum-value Columns.
QueryDatabaseTable is supposed only fetch after the maximum value of the column visible in 'View State'. But my setup simply return the entire result for query.
After each query the maximum value of the column is updated to the latest maximum value.
The maximum value of the column contains a Date.
I have also tried running after clearing the state, and with no values Maximum-value Columns empty, same result.
What am I missing?
Additional info:
QueryDatabaseTable config also has this following section, which I think is related to this issue,
Transaction Isolation Level : No value set
QueryDatebaseTable did not work if I gave just the table name.
Removing the WHERE clause property and creating Custom query made the processor work as intended.

Sync database extraction with Hadoop

Lets say you have periodic task that extract data from a database and loads that data into Hadoop.
How does Apache Sqoop/Nifi mantain database sync between the source database (SQL or NoSQL) with destination storage(Hadoop HDFS or HBASE, even S3)?
For example, lets say that at time A the database has 500 records and at time B it has 600 records with some of the old records updated, does it have a mechanism that efficiently knows the difference between time A and time B that only updates rows that changed and add missing rows?
Yes,NiFi has QueryDatabaseTable processor which can store the state and incrementally fetches the records that got updated.
in your table if you are having some date column that can be updated when your records gets updated then you can use the same date column in Max value columns property then processor will pulls only the changes that got made from last state value.
Here is the awesome article regarding querydatabasetable processor
https://community.hortonworks.com/articles/51902/incremental-fetch-in-nifi-with-querydatabasetable.html

Apache Sqoop Incremental import

I understand that Sqoop offers couple of methods to handle incremental imports
Append mode
lastmodified mode
Questions on Append mode:
Is the append mode supported only for the check column as integer data type? What if i want to use a date or a timestamp column but still i want to only append to the data already in HDFS?
Does this mode mean that the new data is appended to the existing HDFS file or it picks only the new data from the source DB or both?
Lets say that the check-column is an id column in the source table. There already exists a row in the table where the id column is 100. When the sqoop import is run in the append mode where the last-value is 50. Now it imports all rows where the id > 50. When run again with last-value as 150, but this time the row with the id value as 100 was updated to 200. Would this row also be pulled?
Example: Lets say there is a table called customers with one of the records as follows. The first column is the id.
100 abc xyz 5000
When Sqoop job is run in the append mode and last-value as 50 for the id column, then it would pull the above record.
Now the same record is changed and id also gets changed (hypothetical example though) as follows
200 abc xyz 6000
If you run the sqoop command again, would this pull the above record as well was the question.
Questions on lastmodified mode:
Looks like running sqoop with this mode would merge the existing data with the new data using 2 MR jobs internally. What is the column that sqoop use to compare the old and the new for the merge process?
Can user specify the column for the merge process?
Can more than one column be provided that have to be used for the merge process?
Should the target-dir exist for the merge process to happen, so that sqoop treats the existing target dir as the old dataset? Otherwise, how would Sqoop what is the old data set to be merged?
Answers for append mode:
Yes, it needs to be integer
Both
Question is not clear.
Answers for lastmodified mode:
Incremental load does not merge data with lastmodified, it is primarily to pull updated and inserted data using timestamp.
Merge process is completely different. Once you have both old data and new data, you can merge new data onto old data to a different directory. You can see detailed explanation here.
Merge process works with only one field
target-dir should not exist. The video covers complete merge process

Resources