Strategy to load a set of files in Talend

Strategy to load a set of files in Talend - etl

I want to know which is best strategy to aboard the following problem in Talend:
I need to load data from a set of delimited files that are stored in a directory with names like (SAMPLE1.DAT, SAMPLE2.DAT, ... , SAMPLEX.DAT)
The target will be a table in a MySQL database
I have to load all data at once because after this task I need to work with all records in the same table
I'm a bit confused because I don't know if it possible in Talend. I was seeing the tFileInputDelimited component but I didn't find the way to solve it.
Thanks

To read several files from one directory, you would use the tFileList component. It allows you to specify a directory and a file name pattern. All files in the directory matching the pattern will be processed, one after the other.
You need to use an "Iterate" link from the tFileList component to those components that describe what you want to do with each file. In your case, you would start with a tFileInputDelimited component (read the file) and connect the main output of that to a tMysqlOutput component. The MySQL component will, by default, just append the data to an existing table, so that should get you the result you want.
In the tFileInputDelimited component, you would not use a fixed filename, but a variable filename which is set by the tFileList component for each iteration (your loop variable, so to speak of). The name of that loop variable can be seen in the "outline" view in the studio, usually in the bottom left corner.

You would use components tFileInputDelimited into tMap (optional) into tmysqlOutput
Step 1 : configure some components like this, except you will use the delimited file input:
Step 2 : configure the component settings for the delimited file, click the disk for the wizard :
Step 3 : configure your database by right clicking on Db Connection under metadata, then following wizard:
Step 4 : Right click on each component and choose Row > Main > drag to next step in flow.
Step 5 : Open your tMap and map the columns from the file schema to the database schema.
Step 6 : Run the job, it should work if you have followed all the wizard, if there are errors just hover over the red component and it usually describes errors pretty well. You will see as the job runs how many records it has transferred.
Step 7 : after you have made it that far, create a tfiledelimited output with the same schema as the input, right click on the input choose Row > Rejects and drag that to the new delimited output, this is where and records that are rejected by the tmap will be sent.

Related

Azure Data Factory Dataflow with Azure Synapse Link to Remove Duplicate Records

I am using Azure Synapse Link to connect Azure storage to load the RAW to from CRM. there is option which i choose Append mode only. it is creating a duplicate records if anything changes happen from CRM tables(modules). in that case, how do we handle this scenario, either aggerating in dataflow to remove the duplicate records or else we can handle in dataverse(power apps) itself. kindly advise.
eg :
accountnumber accountname
222 XXX
222 XXX
222 XXX
how do we handle in dataverse side or else dataflow aggregation flows. kindly help me.
links referred : https://learn.microsoft.com/en-us/power-apps/maker/data-platform/azure-synapse-link-advanced-configuration
also i could find in ms docs for below snippet code but where and how do we use, if some please share the screenshot below code.
aggregate(groupBy(mycols = sha2(256,columns())),
each(match(true()), $$ = first($$))) ~> DistinctRows
as per below link :
https://learn.microsoft.com/en-us/azure/data-factory/data-flow-script#distinct-row-using-all-columns

I have tried to repro this in my azure ADF dataflow environment.
Sample source is taken as in below image.
Select the Aggregate transform next to source.
In Aggregate Settings, Click group by and then click Open expression builder
Enter the Column name and enter the expression as sha2(256,columns()) and then click Save and Finish
Then, in aggregate settings, click on the aggregates and then click the Open expression builder
Click on + Add column pattern near column1 and then delete Column1. Then Enter true() in matching condition. Then click on undefined column expression and enter $$ in column name expression and first($$) in value expression. Then click Save and Finish.
Output Of Aggregate Transform
This can be done by writing a script in dataflow. To write a dataflow transformation script , Click the script button
The script for equivalent UI transformation is shown in below image.
By this way, you can aggregate in dataflow and remove the duplicates.

How to read an excel sheet and put the cell value within different text fields through UiPath?

How to read an excel sheet and put the cell value within different text fields through UiPath?
I have a excel sheet as follows:
I have read the excel contents and to iterate over the contents later I have stored the contents in a Output Data Table as follows:
Read Range - Output:
DataTable: CVdatatable
Output Data Table
DataTable: CVdatatable
Text: opCVdatatable
Screenshot:
Finally, I want to read the text opCVdatatable in a iteration and write them into text fields. So in the desired Input fileds I mentioned opCVdatatable or opCVdatatable+ "[k(enter)]" as required.
Screenshot:
But UiPath seems to start from the begining of the Output Data Table whenever I called for opCVdatatable.
Inshort, each desired Input fileds are iteratively getting filled up by all the data with the data stored in the Output Data Table.
Can someone help me out please?

My first recommendation is to use Workbook: Read range activity to read data from Excel because it is quicker, works in the background, and does not require excel to be installed on the system.
Start your sequence like this (note the add headers property is not checked):
You do not need to use Output Data Table because this activity outputs a string containing all row items. What you want to do instead is to access the items in the data table and output each one as a string in your type into, e.g., CVDatatable.Rows(0).Item(0).ToString, like so:
You mention you want to read the text opCVdatatable in an iteration and write them into text fields. This is a little bit more complex, but i'll give you an example. You can use a For Each Row activity and loop through each row in CVDatatable, setting the index property if required. See below:
The challenge is to get the selector correct here and make it dynamic, so that it targets a different text field per iteration. The selector for the type into activity will depend on the system you are targeting, but here is an example:
And the selector for this:
Also, here is a working XAML file for you to test.
Hope this helps.
Chris

Here's a different, more general approach. Instead of including the target in the process itself, the Excel would be modified to include parts of a selector:
Note that column B now contains an identifier, and this ID depends on the application you will be working with. For example, here's my sample app looks like. As you can see, the first text box has an id of 585, the second one is 586, and so on (note that you can work with any kind of identifier including the control's name if exposed to UiPath):
Now, instead of adding multiple Type Into elements to your workflow, you would add just a single one, loop over each of the datatable's row, and then create a dynamic selector:
In my case the selector for the Type Into activity looks as follows:
"<wnd cls='#32770' title='General' /><wnd ctrlid='" + row(1).ToString() + "' />"
This will allow you to maintain the process from the Excel sheet alone - if there's a new field that needs to be mapped, just add it to your sheet. No changes to the Workflow are required.

Get data's source in kettle

When I use kettle , I was wandering how to get a table column's source column. Just for an example , after I have merged two tables into one table based on primary key already , Given any column in output table , I could judge whether table it belongs to and get the original column name in original table. Thank you for helping and sorry for my poor English...
http://i.stack.imgur.com/xoR0s.png
When I was given any field in table3 (suppose a field named A in table3) , I could know where it comes from without the graphical view (from java code or other ways) , like the original table name (here are input1 or input2) and the original column name(maybe B in input1 , but represents A in table3). Besides I use mysql.

There are a couple of ways to do this:
1) Manually. If you right-click on the output step and choose Show Output fields (or whatever it's called), you will see the "origin step" for each of the outgoing fields. You can do the same for input fields. Then you can trace them back to those origin steps, and repeat the process of viewing the input fields at those steps, and seeing those fields' origins, and so on. This is probably not what you're looking for.
2) With code. Prior to 6.0, you'd need to programmatically perform the same operations as are listed in option 1 above. In 6.0 there is the Data Lineage capability, which offers the LineageClient API that can find the origin fields for the specified output fields. For more information see my blog post describing the Data Lineage capability. Also I put a Gremlin Console in the PDI Marketplace, to make the use of LineageClient easier (and you can visually see the lineage graph too).

Tibco BusinessWorks 6 Parsing CSV File

I am trying to parse a CSV file as an input for my process.
I used the FilePoller to detect any changes, made in the .csv file. The poller is connected to the ParseData Activity which is referenced by the DataFormat Ressource, where I am defining the structure of my data. The output of ParseData is used in the RenderXML activity to create an xml string.
I am facing the issue, that I am not able to loop through each line in the CSV file to parse it to an XML document structure.
Do you have a resolution, how to implement this?
Please find the "Demo" process attached to this post.
Link to Process
Thank you in advance
Adrian

Select both the RenderXml and WriteFile activities. You can do that by clicking one, and then holding the Shift button while you click on the second.
Right-click over the selected activities, and select Create Group > Iterate. This will create an Iterate Scope around the two selected activities.
Click on the Properties panel and select the output of the ParseData activity as the Variable List through which you would like to iterate.
I hope this helps!

As per my understanding you are getting repeated element in Parse Data Output and these repeated element you want to map to render xml schema . Which is not allowed until and unless in schema you have element to which you want to map is of repeated type. If that element is of repeated type then map element from the right side to left side and select for each option. Then for every element in Parse data it will map to schema element.
Hope Your doubt get resolved.

Ignore error in SSIS

I am getting an "Violation of UNIQUE KEY constraint 'AK_User'. Cannot insert duplicate key in object 'dbo.tblUsers when trying to copy data from an excel file to sql db using SSIS.
Is there any way of ingnoring this error, and let the package continue to the next record without stopping?
What I need is if it inserts three records but the first record is a duplicate, instead of failing, it should continue with the other records and insert them.

There is a System variable called propagate which can be used to continue or stop the execution of package .
1.Create an ON-Error event handler for the task which is failing .Generally it is created for the entire Data Flow Task.
2.Press F4 to get the list of all variables and click on the Icon at the top
to show System Variable.By default Propagate variable will be True ,you need to change it to false ,which basically means that SSIS wont propagate the Error to other component and let the execution continue
Update 1:
To skip the bad rows there are basically 2 ways to do so :-
1.Use Lookup
Try to match the primary key column values in source and destination and then use Lookup No Match Output to your destination.If the value doesn't match with the destination then insert the rows else just skip the rows or redirect to some table or flat file using Lookup Match Output
Example
For more details on Lookup refer this article
2.Or you can redirect the error rows to a flat file or a table .Every SSIS Data Flow components has a Error Output .
For example for Derived component ,the error output dialogue box is
But this condition may not helpful to u in your case as redirect error rows in destination doesn't work properly .If an error occurs it redirects the entire data without inserting any row in the destination .I think this happens because OLEDB destination does a bulk insert or inserts data using transactions.So try to use lookup to achieve your functionality .

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Strategy to load a set of files in Talend - etl

Related

Azure Data Factory Dataflow with Azure Synapse Link to Remove Duplicate Records

How to read an excel sheet and put the cell value within different text fields through UiPath?

Get data's source in kettle

Tibco BusinessWorks 6 Parsing CSV File

Ignore error in SSIS

Categories

Resources