Talend : Generate files dynamically

Talend : Generate files dynamically - etl

I'm working on a projet with Talend (TOS 5.2.2) which aims to split one data (.dat) file into many files. The .dat file structure is like this:
countryA|CountryCode|Name
town|Countrycode|NAme
countryB|CountryCode|Name
town|Countrycode|NAme
town|Countrycode|NAme
town|Countrycode|NAme
countryC|CountryCode|Name
town|Countrycode|NAme
town|Countrycode|NAme
I want to generate dynamically files, and every file contains : A Country and their related town (based on CountryCode).
No database must be used.
Any ideas?

You need to have a tJavaRow after the dat input.
At the start define a
String currCntry = "";
int prevCntryCode = -1 ;
In the tJavaRow you should compare the Actual cntrycode with the previous one, if it isn't matching, then you should make currCntry = row1.Country After that, you should have a simple tFileOutPutDelimited where the name is set like this: currCntry + ".DAT"
If you run this, the output dat will contain the country rows aswell, what you can do is when you found a new country you manually set the first row to be a "header" row.

Related

Excel Power Query import (same file but with different month name)

Each month I need to automate the importing of reference data, however the Excel file is named differently.
Monthly Data File January 2022.xlsx
Monthly Data File February 2022.xlsx
Could you point me in the right direction please?

In excel, use formulas ... name manager... to pick a cell and give it a range name, like NameVariable
Enter your filepath and filename C:\temp\Monthly Data File January 2022.xlsx in that named cell. Change the content of that range as needed when filename changes later
Load one file into powerquery, then in home ... advanced editor ... add a formula that refers to that range name, similar to this:
MVar = Excel.CurrentWorkbook(){[Name="NameVariable"]}[Content]{0}[Column1],
and change any hard coded references to the filename to use MVar instead
As an example, change
let Source = Excel.Workbook(File.Contents("C:\temp\Monthly Data File January 2022.xlsx"), null, true),
to be
let MVar = Excel.CurrentWorkbook(){[Name="NameVariable"]}[Content]{0}[Column1],
Source = Excel.Workbook(File.Contents(MVar), null, true),

How to extract multiple values as multiple column data from filename by Informatica PowerCenter?

I am very new to Informatica PowerCenter, Just started learning. Looking for help. My requirement is : I have to extract data from flat file(CSV file) and store the data into Oracle Table. Some of the column value of the target table should be coming from extracting file name.
For example:
My Target Table is like below:
USER_ID Program_Code Program_Desc Visit Date Term
EACRP00127 ER Special Visits 08/02/2015 Aug 2015
My input filename is: Aug 2015 ER Special Visits EACRP00127.csv
From this FileName I have to extract "AUG 2015" as Term, "ER Special Visits" as Program_Desc and "EACRP00127" as Program_Code along with some other fields from the CSV file.
I have found one solution using "Currently Processed Filename". But with this I am able to get one single value from filename. how can I extract 3 values from the filename and store in the target table? Looking for some shed of light towards solution. Thank you.

Using expression transformation you can create three output values from Currently Processed Filename column.
So you get the file name from SQ using this field 'Currently Processed Filename'. Then you can substring the whole string to get what you want.
input/output = Currently Processed Filename
o_Term = substr(Currently Processed Filename,1,9)
o_Program_Desc = substr(Currently Processed Filename,10,18)
o_Program_Code = substr(Currently Processed Filename,28,11)

Combine csv files with different structures over time

I am here to ask you a hypothetical question.
Part of my current job consists of creating and updating dashboards. Most dashboards have to be updated everyday.
I've created a PowerBI dashboard from data linked to a folder filled with csv files. I did some queries to edit some things. So, everyday, I download a csv file from a client's web application and add the said file to the linked folder, everything gets updated automatically and all the queries created are applied.
Hypothetical scenario: my client changes the csv structure (e.g. column order, a few column name). How can I deal with this so I can keep my merged csv files table updated?
My guess would be to put the files with the new structure in a different folder, apply new queries so the table structures match, then append queries so I have a single table of data.
Is there a better way?
Thanks in advance.

Say I have some CSVs (all in the same folder) that I need to append/combine into a single Excel table, but:
the column order varies in some CSVs,
and the headers in some CSVs are different (for whatever reason) and need changing/renaming.
First CSV:
a,c,e,d,b
1,1,1,1,1
2,2,2,2,2
3,3,3,3,3
Second CSV:
ALPHA,b,c,d,e
4,4,4,4,4
5,5,5,5,5
6,6,6,6,6
Third CSV:
a,b,charlie,d,e
7,7,7,7,7
8,8,8,8,8
9,9,9,9,9
10,10,10,10,10
If the parent folder (containing my CSVs) is at "C:\Users\user\Desktop\changing csvs\csvs", then this M code should help me achieve what I need:
let
renameMap = [ALPHA = "a", charlie = "c"],
filesInFolder = Folder.Files("C:\Users\user\Desktop\changing csvs\csvs"),
binaryToCSV = Table.AddColumn(filesInFolder, "CSVs", each
let
csv = Csv.Document([Content], [Delimiter = ",", Encoding = 65001, QuoteStyle = QuoteStyle.Csv]),
promoteHeaders = Table.PromoteHeaders(csv, [PromoteAllScalars = true]),
headers = Table.ColumnNames(promoteHeaders),
newHeaders = List.Transform(headers, each Record.FieldOrDefault(renameMap, _, _)),
renameHeaders = Table.RenameColumns(promoteHeaders, List.Zip({headers, newHeaders}))
in
renameHeaders
),
append = Table.Combine(binaryToCSV[CSVs])
in
append
You'd need to change the folder path in the code to whatever it is on your system.
Regarding this line renameMap = [ALPHA = "a", charlie = "c"],, I needed to change "ALPHA" to "a" and "charlie" to "c" in my case, but you'd need to replace with whatever columns need renaming in your case. (Add however many headers you need to rename.)
This line append = Table.Combine(binaryToCSV[CSVs]) will append the tables to one another (to give you one table). It should automatically handle differences in column order. If there any rogue columns (e.g. say there was a column f in one of my CSVs that I didn't notice), then my final table will contain a column f, albeit with some nulls/blanks -- which is why it's important all renaming has been done before that line.
Once combined, you can obviously do whatever else needs doing to the table.
Try it to see if it works in your case.

Pentaho Data Integration (DI) Get Last File in a Directory of a SFTP Server

I am doing a transformation on Pentaho Data Integration and I have a list of files in a directory of my SFTP server. This files are named with FILE_YYYYMMDDHHIISS.txt format, my directory looks like that:
mydirectory
FILE_20130701090000.txt
FILE_20130701170000.txt
FILE_20130702090000.txt
FILE_20130702170000.txt
FILE_20130703090000.txt
FILE_20130703170000.txt
My problem is that I need get the last file of this list in accordance of its creation date, to pass it to other transformation step...
How can I do this in Pentaho Data Integration?

In fact this is quite simple because your file names can be sorted textually, and the max in the sort list will be your most recent file.
Since a list of files is likely short, you can use a Memory Group by step. A grouping step needs a separate column by which to aggregate. If you only have column and you want to find the max in the entire set, you can add a grouping column with an Add Constants step, and configure it to add a column with, say an integer 1 in every row.
Configure your Memory Group by to group on the column of 1s, and use the file name column as the subject. Then simply select the Maximum grouping type. This will produce a single row with your grouping column, the file name field removed and the aggregate column containing your max file name. It would look something like this:

algorithm - how do I use data from one file as criteria for search/delete from a different file?

I have two different pipe-delimited data files. One is larger than the other. I'm trying to selectively remove data from the large file (we'll call it file A), based on the data contained in the small file (file B). File A contains all of the data, and file B contains only a portion of the data from file A.
I want a function or existing program that removes all of the data contained within file B from file A. I had in mind a function like this:
Pseudo-code:
while !eof(fileB) {
criteria = readLine(fileB);
lineToRemove = searchForLine(criteria, fileA);
deleteLine(lineToRemove, fileA);
}
However, that solution seems very inefficient to me. File A has 23,000 lines in it, and file B has 17,000. And the data contained within file B is literally scattered throughout file A.
If there is a program that can do this, I'd prefer it over code. I'm not picky about the code either. C++ is my strong language, but this data file is going to get converted into a SQL database in the near future so I'm good with SQL/PHP code as well.

Load the two tables into SQL, whatever the database. Doing this sort of manipulation is what databases are designed for. Then you can execute the command:
delete from A
where A.criteria = (select B.criteria from B)
However, I would put the data into Staging tables, and then create and populate the data that I want in SQL. Something like:
create table A ( . . . )
insert into A
select *
from StagingA
where A.criteria not in (select B.criteria from StagingB)
(Here I've used "*" and an insert without a column list. In practice, you should have the list of columns.)

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Talend : Generate files dynamically - etl

Related

Excel Power Query import (same file but with different month name)

How to extract multiple values as multiple column data from filename by Informatica PowerCenter?

Combine csv files with different structures over time

Pentaho Data Integration (DI) Get Last File in a Directory of a SFTP Server

algorithm - how do I use data from one file as criteria for search/delete from a different file?

Categories

Resources