what is the best way of handling data resubmission in a datawarehouse? - etl

Let's assume that we have a datawarehouse comprised of four components :
extract : source data is extracted from an oracle database to a flat file. there is a flat file per source table. Extraction date is kept as part of the flat file name. Each record contains a insert/update date from the source system.
staging area : temporary tables used to load the extracted data into database tables
operational data store : staged data will be loaded in the ODS. The ODS keeps all the history of all the loaded data and the data is typecast. Surrogate keys are not yet generated.
datawarehouse : data is loaded from the ODS, surrogate keys are generated, dimensions are historized, and finally fact data is loaded and attached to the proper dimension.
So far so good, and regarding regular delta loading I have no issue. However the question I ask myself is : I have regularly encountered in the past situations where, for whatever reason, you will want to resubmit extracted data into the loading pipeline. Let's assume for instance that we select all the extracted flat files over the last 15 days, and that we push them again to the ETL process.
There is no new extraction from the source systems. Previously loaded files are re-used and fed into the ETL process.
Data is then reloaded into the staging tables, which will have been truncated previously
now data has to move to the ODS. And here I have a real headache on how to proceed.
Alternative 1 : just insert the new rows. So we would have :
row 2, natural key : ID001, batch date : 12/1/2022 16:34, extraction date : 10/1/2022, source system modification timestamp : 10/1/2022 10:43:00
previous row : natural key : ID001, batch date : 10/1/2022 01:00, extraction date : 10/1/2022, source system modification timestamp : 10/1/2022
But then, when loading to the DWH, we need to have some kind of insert/update mechanism and we cannot do a straight insert as it will created duplicate facts.
Alternative 2 : apply an insert/update logic at ODS level. With the previous example we would have :
check if the ODS table contains already a row with natural key : ID001 - extraction date : 10/1/2022, source system modification timestamp : 10/1/2022
insert if not found
Alternative 3 : purge in the ODS the previously loaded data, i.e.
purge all the data where extraction date in the last 15 days
load the data from the staging.
Alternative 1 is performant but shifts the insert/update task at DWH level, so the performance-killer is still there.
Alternative 2 requires an insert update, which for millions of rows does not seem optimal.
Alternative 3 looks good but if feels wrong to delete data from the ODS.
What is your view on this ? In other words my question would be how to reconcile the recommandation to have insert-only processes in the datawarehouse, with the reality that from time to time you will need to reload previously extracted data to fix bugs or correct missing data.

There are two primary methods to load data into your data warehouse:
Full load: with a full load, the entire data staged is dumped, or loaded, and is then completely replaced with the new, updated data flow. No additional information, such as timestamps or audit technical columns, are needed.
Incremental load/ Delta load: only the difference between the target and source data is loaded through the ETL process in data warehouse. There are 2 types of incremental loads, depending on the data volumetry , streaming incremental load and batch incremental load.

Related

update Parquet file format

my requirement is to read that and generate another set of parquet data into another ADLS folder.
do i need this into spark dataframes and perform upserts ?
Parquet is like any other file format. You have to overwrite the files to perform insert, updates and deletes. It does not have ACID properties like a database.
1 - We can use SET properties with the spark dataframe to accomplish what you want. However, it compares at both the row and column level. Not as nice as an ANSI SQL.
https://spark.apache.org/docs/latest/sql-ref-syntax-qry-select-setops.html
2 - We can save the data in the target directory as a DELTA file. Most people are using DELTA since it has ACID properties like a database. Please see the merge statement. It allows for updates and inserts.
https://docs.delta.io/latest/delta-update.html
Additionally we can soft delete when reversing the match.
The nice thing about a delta file (table) is we can partition by date for a daily file load. Thus we can use time travel to see what happen yesterday versus today.
https://www.databricks.com/blog/2019/02/04/introducing-delta-time-travel-for-large-scale-data-lakes.html
3 - If you do not care about history and soft deletes, the easiest way to accomplish this task is to archive the old files in the target directory, then copy over the new files from the source directory to the target directory.

Sync database extraction with Hadoop

Lets say you have periodic task that extract data from a database and loads that data into Hadoop.
How does Apache Sqoop/Nifi mantain database sync between the source database (SQL or NoSQL) with destination storage(Hadoop HDFS or HBASE, even S3)?
For example, lets say that at time A the database has 500 records and at time B it has 600 records with some of the old records updated, does it have a mechanism that efficiently knows the difference between time A and time B that only updates rows that changed and add missing rows?
Yes,NiFi has QueryDatabaseTable processor which can store the state and incrementally fetches the records that got updated.
in your table if you are having some date column that can be updated when your records gets updated then you can use the same date column in Max value columns property then processor will pulls only the changes that got made from last state value.
Here is the awesome article regarding querydatabasetable processor
https://community.hortonworks.com/articles/51902/incremental-fetch-in-nifi-with-querydatabasetable.html

How to deal with primary key check like situations in Hive?

I have to load the incremental load to my base table (say table_stg) everyday once. I get the snapshot of data everyday from various sources in xml format. The id column is supposed to be unique but since data is coming from different sources, there is a chance of duplicate data.
day1:
table_stg
id,col2,col3,ts,col4
1,a,b,2016-06-24 01:12:27.000532,c
2,e,f,2016-06-24 01:12:27.000532,k
3,a,c,2016-06-24 01:12:27.000532,l
day2: (say the xml is parsed and loaded into table_inter as below)
id,col2,col3,ts,col4
4,a,b,2016-06-25 01:12:27.000417,l
2,e,f,2016-06-25 01:12:27.000417,k
5,w,c,2016-06-25 01:12:27.000417,f
5,w,c,2016-06-25 01:12:27.000417,f
when i put this data ino table_stg, my final output should be:
id,col2,col3,ts,col4
1,a,b,2016-06-24 01:12:27.000532,c
2,e,f,2016-06-24 01:12:27.000532,k
3,a,c,2016-06-24 01:12:27.000532,l
4,a,b,2016-06-25 01:12:27.000417,l
5,w,c,2016-06-25 01:12:27.000417,f
What could be the best way to handle these kind of situations(without deleting the table_stg(base table) and reloading the whole data)
Hive does allow duplicates on primary and unique keys.You should have an upstream job doing the data cleaning before loading it into the Hive table.
You can write a python script for that if data is less or use spark if data size is huge.
spark provides dropDuplicates() method to achieve this.

Schema verification/validation before loading data into HDFS/Hive

I am a newbie to Hadoop Ecosystem and I need some suggestion from Bigdata experts on achieving schema verification/validation before loading the huge data into hdfs.
The scenario is:
I have a huge dataset with given schema (having around 200
column-header in it). This dataset is going to be stored in Hive
tables/HDFS. Before loading the data into hive table/hdfs I want to
perform a schema level verification/validation on the data supplied to
avoid any unwanted errors/exception while loading the data into hdfs.
Like in case somebody tries to pass a data file having fewer or more
number of columns in it then at the first level of verification this
load fail.
What could be the best possible approach for achieving the same?
Regards,
Bhupesh
Since you have files, you can add them into HDFS,and run map reduce on top of that. Here you would be having a hold on each row, so you can verify number of columns, their types and any other validations.
When i referred to jason/xml, there is slight overhead to make map reduce identify the records in that format. However with respect to validation there is schema validation which you can enforce and also define only specific values for a field using schema. So once the schema is ready, your parsing(xml to java) and then store them at another final HDFS location for further use(like HBase). When you are sure that data is validated, you can create Hive tables on top of that.
Use below utility to create temp tables every time based on the schema you receive in csv file format in staging directory and then apply some conditions to identify whether you have valid columns or not. Finally load into original table.
https://github.com/enahwe/Csv2Hive

Extraction of data from Siebel Data base to Dat file and staging table

I am working on a new requirment and I am new into this. So seeking your help.
Requriment - From Siebel base tables (S_ORG_EXT,S_CONTACT,S_PROD_INT) I have to export data and need to put into two staging tables (S1 and S2) and from these staging tables I need to create dat files pipe delimited that include row count also. For staging table S1, we should have Accounts with their associated contacts and for S2, we should have account with its associated contact and Product.
How should I need to go about this. Should I need to use Informatica job directly to pull data from Siebel base tables or need to run EIM export job to get data in EIM table and from there to staging table.
Kindly help me know which way I should go.
Access the base tables directly using Informatica, limiting the extract to only the rows and columns you need.
I'd recommend unloading these to flat files before loading them into the Staging Tables (it gives you a point of recovery if something goes wrong in your Staging Table load, and means you don't have to hit the Siebel DB again).
Then from there you can either unload the staging tables, or just use your flat file extract, to generate your delimited files with row counts.
I tend to favour modular processes, with sensible recovery points, over 'streaming' the data through for (arguably) faster execution time, so here's what I'd do (one mapping for each):
1. Unload from Base Tables to flat files.
2. Join the flat file entities as required and create new flat files in the Staging Table format.
3. Load staging tables.
4. Unload staging tables (optional, if you can get away with using the files created in Step 2)
5. Generate .dat files in pipe-delimited format with the row count.
If the loading of a staging table is only for audit purposes etc, and you can base Step 5 on the files you created in Step 2, then you could perform stage (3) concurrently with stage (5), which may reduce overall runtime.
If this is a one-off process, or you just want to write it in a hurry, you could skip writing out the flat files and just do it all in one or two mappings. I wouldn't do this though, because
a) it's harder to test and
b) there are fewer recovery points.
Cheers!

Resources