How to load multiple excel files into different tables based on xls metadata using SSIS? - etl

I have multiple excel files with two types of metadata, Now i have to push the data into two different tables based on metadata of excel files using SSIS.

There are many, many different ways to do this. You'd need to share a lot more information on how your data is structured to really give a great answer, but here's the general strategy I'd suggest.
In the control flow tab, have a separate data flow for each Excel file. The data flows will all work the same, with the exception of having a different Excel source in each data flow, so it will be enough to get the first version working and then copy and paste for the other files.
In the data flow, use a conditional split transformation to read the metadata coming from Excel and send the row to the correct table.
If you really want to be fancy, however, you could create a child package that includes all your data flow logic. Using the Execute Package Task you can pass the Excel file name to the child package for each Excel file you need to import. This way you consolidate your logic in one package and can still import from multiple Excel files in parallel.

Related

Incremental data processing for file in talend

How can you manage your incremental data processing when you don't have database or anything to log the previous execution timestamp?
Can we use tAddCRCRow component? But How come will it know that if this data has been processed already or not specially when both sorce and destinations are nothing but collections of file?
Thank you.
You have to use your target file as a lookup and identify the existing value. This will help you to resolve your issue.
In case of files, you have to use multiple file as a lookup. Or create a separate table which holds the unique value of all the files and use it as a lookup

how to save a text file to hive using table of context as schema

I have many project reports in text format (word and pdf). These files contains data that I want to extract; Such as references, keywords, names mentioned .......
I want to process these files with Apache spark and save the result to hive,
use the power of dataframe (use the table of context as schema) is that possible?
May you share with me any ideas about how to process these files?
As far as I understand, you will need to parse the files using Tika and manually create custom schema s as described here.
Let me know if this helps. Cheers.

can we send two types of files in a single flow file in Apache nifi?

I have made a custom processor which converts a Excel workbook in JSON and output it but i want that workbook and JSON both in Output. Is it Possible?
The content of a flow file is just bytes, you could put whatever you want in it assuming that something downstream knows how to understand the combination of an Excel workbook and a JSON file.
A more common approach is for a processor to have multiple relationships, one would be "original" which is where you would transfer the original input to the processor, in this case the Excel workbook, and the other would be "success" where you would transfer the successfully created JSON, and then maybe a "failure" relationship where you would transfer the Excel workbook if you couldn't create the JSON for some reason.

How to import only some columns from XLS with ETL?

I want to do something like Read only certain columns from xls in Jaspersoft ETL Express 5.4.1, but I don't know the schema of the files. However, from what I read here, it looks like I can only do this with the Enterprise Version's Dynamic Schema thing.
Is there no other way?
you can do it using tMap component. design job like below.
tFileInputExcel--main--TMap---main--youroutput
create metadata for your input file that is excel
then used this metadata in your input component
in Tmap select only required columns in output.
See the image of tMap wherein i am selecting only two columns from input flow.
Enterprise version has many features and dynamic schema is the most important one. But as far as your concern that is not required. it is required if you have variable of schema wherein you don`t know how many columns you will received in your feed.

Kettle: load CSV file which contains multiple data tables

I'm trying to import data from a csv file which, unfortunately, contains multiple data tables. Actually, it's not really a pure csv file.
It contains a header field with some metadata and then the actual csv data parts are separated by:
//-------------
Table <table_nr>;;;;
An example file looks as follows:
Summary;;
Reporting Date;29/05/2013;12:36:18
Report Name;xyz
Reporting Period From;20/05/2013;00:00:00
Reporting Period To;26/05/2013;23:59:59
//-------------
Table 1;;;;
header1;header2;header3;header4;header5
string_aw;0;0;0;0
string_ax;1;1;1;0
string_ay;1;2;0;1
string_az;0;0;0;0
TOTAL;2;3;1;1
//-------------
Table 2;;;
header1;header2;header3;header4
string_bv;2;2;2
string_bw;3;2;3
string_bx;1;1;1
string_by;1;1;1
string_bz;0;0;0
What would be the best way to process load such data using kettle?
Is there a way to split this file into the header and csv data parts and then process each of them as separate inputs?
Thanks in advance for any hints and tips.
Best,
Haes.
I don't think there are any steps that will really help you with data in such a format. You probably need to do some preprocessing before bringing your data into a CSV step. You could still do this in your job, though, by calling out to the shell and executing a command there first, like maybe an awk script to split up the file into its component files and then load those files via the normal Kettle pattern.

Resources