How do I read this mapping document? - etl

I am new to etl and am working with talend. I was given this document and was told to make an "extraction job." how exactly do I read this document for this talend job that I have to make?

Well, ETL basically means, Extract-Transform-Load.
From your example, I can understand that you have to create a Target table which will pull data from the Source table based on certain conditions. These conditions are mentioned in your image.
You basically have to look at the Source File columns from you image. They clearly state,
1.) File(Table name), this means which table from the Source DB this attribute is flowing in.
2.) Attribute(s) (Field Name) : This is the name of the column.
3.) Extract logic : This means what logic has to be applied while extracting this column from Source(Straight Move) means, just dump the source value in Target.
This is just to get you started, as nobody will actually create the whole ETL flow for you here.

Related

what is the best way of handling data resubmission in a datawarehouse?

Let's assume that we have a datawarehouse comprised of four components :
extract : source data is extracted from an oracle database to a flat file. there is a flat file per source table. Extraction date is kept as part of the flat file name. Each record contains a insert/update date from the source system.
staging area : temporary tables used to load the extracted data into database tables
operational data store : staged data will be loaded in the ODS. The ODS keeps all the history of all the loaded data and the data is typecast. Surrogate keys are not yet generated.
datawarehouse : data is loaded from the ODS, surrogate keys are generated, dimensions are historized, and finally fact data is loaded and attached to the proper dimension.
So far so good, and regarding regular delta loading I have no issue. However the question I ask myself is : I have regularly encountered in the past situations where, for whatever reason, you will want to resubmit extracted data into the loading pipeline. Let's assume for instance that we select all the extracted flat files over the last 15 days, and that we push them again to the ETL process.
There is no new extraction from the source systems. Previously loaded files are re-used and fed into the ETL process.
Data is then reloaded into the staging tables, which will have been truncated previously
now data has to move to the ODS. And here I have a real headache on how to proceed.
Alternative 1 : just insert the new rows. So we would have :
row 2, natural key : ID001, batch date : 12/1/2022 16:34, extraction date : 10/1/2022, source system modification timestamp : 10/1/2022 10:43:00
previous row : natural key : ID001, batch date : 10/1/2022 01:00, extraction date : 10/1/2022, source system modification timestamp : 10/1/2022
But then, when loading to the DWH, we need to have some kind of insert/update mechanism and we cannot do a straight insert as it will created duplicate facts.
Alternative 2 : apply an insert/update logic at ODS level. With the previous example we would have :
check if the ODS table contains already a row with natural key : ID001 - extraction date : 10/1/2022, source system modification timestamp : 10/1/2022
insert if not found
Alternative 3 : purge in the ODS the previously loaded data, i.e.
purge all the data where extraction date in the last 15 days
load the data from the staging.
Alternative 1 is performant but shifts the insert/update task at DWH level, so the performance-killer is still there.
Alternative 2 requires an insert update, which for millions of rows does not seem optimal.
Alternative 3 looks good but if feels wrong to delete data from the ODS.
What is your view on this ? In other words my question would be how to reconcile the recommandation to have insert-only processes in the datawarehouse, with the reality that from time to time you will need to reload previously extracted data to fix bugs or correct missing data.
There are two primary methods to load data into your data warehouse:
Full load: with a full load, the entire data staged is dumped, or loaded, and is then completely replaced with the new, updated data flow. No additional information, such as timestamps or audit technical columns, are needed.
Incremental load/ Delta load: only the difference between the target and source data is loaded through the ETL process in data warehouse. There are 2 types of incremental loads, depending on the data volumetry , streaming incremental load and batch incremental load.

How to find all the files created by GenerateTableFetch has been processed

We have a flow where GenerateTableFetch takes inpute from splitJson which gives TableName, ColumnName as argument. At once multiple tables are passed as input to GenerateTableFetch and next ExecuteSql executes the query.
Now i want to trigger a new process when all the files for a table has been processed by the below processor (At the end there is PutFile).
How to find that all the files created for a Table has been processed?
You may need NIFI-5601 to accomplish this, there is a patch currently under review at the time of this writing, I hope to get it into NiFi 1.9.0.
EDIT: Adding potential workarounds in the meantime
If you can use ListDatabaseTables instead of getting your table names from a JSON file, then you can set Include Count to true. Then you will get attributes for the table name and the count of its rows. Then you can divide the count by the value of the Partition Size in GTF and that will give you the number of fetches (let's call it X). Then add an attribute via UpdateAttribute called "parent" or something, and set it to ${UUID()}. Keep these attributes in the flow files going into GTF and ExecuteScript, then you can use Wait/Notify to wait until X flow files are received (setting Target Signal Count to ${X}) and using ${parent} as the Release Signal Identifier.
If you can't use ListDatabaseTables, then you may be able to have ExecuteSQLRecord after your SplitJSON, you can execute something like SELECT COUNT(*) FROM ${table.name}. If using ExecuteSQL, you may need a ConvertAvroToJSON, if using ExecuteSQLRecord use a JSONRecordSetWriter. Then you can extract the count from the flow file contents using EvaluateJsonPath.
Once you have the table name and the row count in attributes, you can continue with the flow I outlined above (i.e. determine the number of flow files that GTF will generate, etc.).

Providing user defined column names in AWS glue

I have a lot of parquet files. I need to read them through Amazon Glue and then provide column names to the table that is being read.
The problem is parquet already have column names which is being read by the crawler and show it in the table. Is it possible to provide my column names to these parquet files in glue
To replace the detected column names with names of your own, you could either:
Use one of the following build in transformations on DynamicFrame
ApplyMapping - Applies a declarative mapping to this DynamicFrame and returns a new DynamicFrame with those mappings applied. (source column, source type, target column, target type)
RenameField - Renames a field in this DynamicFrame and returns a new DynamicFrame with the field renamed. (oldName -> newName)
See the Scala or Python ETL programming guides for more detail.
Or try updating the data catalog field names manually if you don't need to continuously re-crawl the data (or if you do, it is possible to prevent a glue crawler from updating existing data catalog tables via the crawler configuration).
Alternatively, if your requirements are more discrete, the map transform is available to convert each DynamicRecord in the DynamicFrame to a new DynamicRecord of your choosing.

Schema verification/validation before loading data into HDFS/Hive

I am a newbie to Hadoop Ecosystem and I need some suggestion from Bigdata experts on achieving schema verification/validation before loading the huge data into hdfs.
The scenario is:
I have a huge dataset with given schema (having around 200
column-header in it). This dataset is going to be stored in Hive
tables/HDFS. Before loading the data into hive table/hdfs I want to
perform a schema level verification/validation on the data supplied to
avoid any unwanted errors/exception while loading the data into hdfs.
Like in case somebody tries to pass a data file having fewer or more
number of columns in it then at the first level of verification this
load fail.
What could be the best possible approach for achieving the same?
Regards,
Bhupesh
Since you have files, you can add them into HDFS,and run map reduce on top of that. Here you would be having a hold on each row, so you can verify number of columns, their types and any other validations.
When i referred to jason/xml, there is slight overhead to make map reduce identify the records in that format. However with respect to validation there is schema validation which you can enforce and also define only specific values for a field using schema. So once the schema is ready, your parsing(xml to java) and then store them at another final HDFS location for further use(like HBase). When you are sure that data is validated, you can create Hive tables on top of that.
Use below utility to create temp tables every time based on the schema you receive in csv file format in staging directory and then apply some conditions to identify whether you have valid columns or not. Finally load into original table.
https://github.com/enahwe/Csv2Hive

how to work on specific part of cvs file uploaded into HDFS?

how to work on specific part of cvs file uploaded into HDFS ?
I'm new in Hadoop and i have an a question that is if i export an a relational database into cvs file then uploaded it into HDFS . so how to work on specific part (table) in file using MapReduce .
thanks in advance .
I assume that the RDBMS tables are exported to individual csv files for each table and stored in HDFS. I presume that, you are referring to column(s) data within the table(s) when you mentioned 'specific part (table)'. If so, place the individual csv files into the separate file paths say /user/userName/dbName/tables/table1.csv
Now, you can configure the job for the input path and field occurrences. You may consider to use the default Input Format so that your mapper would get one line at time as input. Based on the configuration/properties, you can read the specific fields and process the data.
Cascading allows you to get started very quickly with MapReduce. It has framework that allows you to set up Taps to access sources (your CSV file) and process it inside a pipeline say to (for example) add column A to column B and place the sum into column C by selecting them as Fields
use BigTable means convert your database to one big table

Resources