Change datasource in the same tranformation ETL - etl

I have a transformation to extract data for a database but the database source has many tables with different names. The database does have a structure consistent with my transformation, e.g: Events_1, Events_2, Events_3.
It possible change the connection parameters for the extraction of all tables dynamically? I want to extract all data with just one job that will still work when there is a new insert or a new table like Events_600.
Screen-shot of DB:

You can use variables in the transformation and use them to set the connection and even to change the source table.
You will need a job to run through the list of variable values, then once for each row pass those values as parameters to the transformation, for example.

Related

Read data from multiple tables at a time and combine the data based where clause using Nifi

I have scenario where I need to extract multiple database table data including schema and combine(combination data) them and then write to xl file?
In NiFi the general strategy to read in from a something like a fact table with ExecuteSQL or some other SQL processor, then using LookupRecord to enrich the data with a lookup table. The thing in NiFi is that you can only do a table at a time, so you'd need one LookupRecord for each enrichment table. You could then write to a CSV file that you could open in Excel. There might be some extensions elsewhere that can write directly to Excel but I'm not aware of any in the standard NiFi distro.

Providing user defined column names in AWS glue

I have a lot of parquet files. I need to read them through Amazon Glue and then provide column names to the table that is being read.
The problem is parquet already have column names which is being read by the crawler and show it in the table. Is it possible to provide my column names to these parquet files in glue
To replace the detected column names with names of your own, you could either:
Use one of the following build in transformations on DynamicFrame
ApplyMapping - Applies a declarative mapping to this DynamicFrame and returns a new DynamicFrame with those mappings applied. (source column, source type, target column, target type)
RenameField - Renames a field in this DynamicFrame and returns a new DynamicFrame with the field renamed. (oldName -> newName)
See the Scala or Python ETL programming guides for more detail.
Or try updating the data catalog field names manually if you don't need to continuously re-crawl the data (or if you do, it is possible to prevent a glue crawler from updating existing data catalog tables via the crawler configuration).
Alternatively, if your requirements are more discrete, the map transform is available to convert each DynamicRecord in the DynamicFrame to a new DynamicRecord of your choosing.

Can I keep data of different file formats in same hive table?

I am receiving data of formats like csv, xml, json and I want to keep all the files in same hive table.Is it achievable?
Hive expects all the files for one table to use the same delimiter, same compression applied etc. So, you cannot use a Hive table on top of files with multiple formats.
The solution you may want to use is
Create a separate table (json/xml/csv) for each of the file formats
Create a view for the UNION of the 3 tables created above.
This way the consumer of the data has to query only one view/object, if that's what you are looking for.
Yes, you can achieve this through a combination of different external tables.
Because different SerDes with different specifications for how to read columns in the different files will be needed, you will need to create one external table per type of file (and table). The data from each of these external tables can then be combined into a view with UNION, as suggested by Ramesh. The view can could then be used for reading from these, and you could e.g. insert the data into a managed table.

Schema verification/validation before loading data into HDFS/Hive

I am a newbie to Hadoop Ecosystem and I need some suggestion from Bigdata experts on achieving schema verification/validation before loading the huge data into hdfs.
The scenario is:
I have a huge dataset with given schema (having around 200
column-header in it). This dataset is going to be stored in Hive
tables/HDFS. Before loading the data into hive table/hdfs I want to
perform a schema level verification/validation on the data supplied to
avoid any unwanted errors/exception while loading the data into hdfs.
Like in case somebody tries to pass a data file having fewer or more
number of columns in it then at the first level of verification this
load fail.
What could be the best possible approach for achieving the same?
Regards,
Bhupesh
Since you have files, you can add them into HDFS,and run map reduce on top of that. Here you would be having a hold on each row, so you can verify number of columns, their types and any other validations.
When i referred to jason/xml, there is slight overhead to make map reduce identify the records in that format. However with respect to validation there is schema validation which you can enforce and also define only specific values for a field using schema. So once the schema is ready, your parsing(xml to java) and then store them at another final HDFS location for further use(like HBase). When you are sure that data is validated, you can create Hive tables on top of that.
Use below utility to create temp tables every time based on the schema you receive in csv file format in staging directory and then apply some conditions to identify whether you have valid columns or not. Finally load into original table.
https://github.com/enahwe/Csv2Hive

AS400 to Oracle 10g via xml with Informatica Powercenter

Is the following workflow possible with Informatica Powercenter?
AS400 -> Xml(in memory) -> Oracle 10g stored procedure (pass xml as param)
Specifically, I need to take a result set eg. 100 rows. Convert those rows into a single xml document as a string in memory and then pass that as a parameter to an Oracle stored procedure that is called only once. I understood that a workflow runs row-by-row and this kind of 'batching' is not possible.
Yes, this scenario should be possible.
You can connect to AS/400 sources with native Informatica connector(s), although this might require (expensive) licenses. Another option is to extract the data from AS/400 source into a text file, and use that as a normal file source.
To convert multiple rows into one row, you would use an Aggregator transformation. You may need to create a dummy column (with same value for all rows) using an Expression, and use that column as the grouping key of the Aggregator, to squeeze the input into one single row. Row values would be concatenated together (separated by some special character) and then you would use another Expression to split and parse the data into as many ports (fields) as you need.
Next, with an XML Generator transformation you can create the XML. This transformation can have multiple input ports (fields) and its result will be directed into a single output port.
Finally, you would load the generated XML value into your Oracle target, possibly using a Stored Procedure transformation.

Resources