Append multiple CSVs into one single file with Apache Nifi - apache-nifi

I have a folder with CSV files that have the same first 3 columns and different last N columns. N is minimum 2 and up to 11.
Last n columns have number as header, for example:
File 1:
AAA,BBB,CCC,0,10,15
1,India,c,0,28,54
2,Taiwan,c,0,23,52
3,France,c,0,26,34
4,Japan,c,0,27,46
File 2:
AAA,BBB,CCC,0,5,15,30,40
1,Brazil,c,0,20,64,71,88
2,Russia,c,0,20,62,72,81
3,Poland,c,0,21,64,78,78
4,Litva,c,0,22,66,75,78
Desired output:
AAA,BBB,CCC,0,5,10,15,30,40
1,India,c,0,null,28,54,null,null
2,Taiwan,c,0,null,23,52,null,null
3,France,c,0,null,26,34,null,null
4,Japan,c,0,null,27,46,null,null
1,Brazil,c,0,20,null,64,71,88
2,Russia,c,0,20,null,62,72,81
3,Poland,c,0,21,null,64,78,78
4,Litva,c,0,22,null,66,75,78
Is there a way to append this files together with Nifi where a new column would get created (even if I do not now the column name beforehad) if a file with additional data is present in the folder?
I tried with Merge content processor but by default it just appends content of all my files together without minding headers (all the headers are always appended).

What you could do is write some scripts to combine the rows and columns using the ExecuteStreamCommand. This would allow you to write a custom script in whatever language you want.

Related

How to take two columns of two TXT and create new TXT with the two columns?

I have two text files with only one column each.
I need to take the column from each of the text files and create a new text file with the two columns with tabs.
These columns have no relation (ID) but are in order with each other.
I could do that in Excel, but there are more than 200 thousand lines and not accepted.
How can I do it in Pentaho?
Take 2 text input steps, read both the files,
after that add 2 add constant step create same column with some value,make sure the value of the both constant values remains same.
use stream lookup/merge join and merge them with constant values.
generate the file.
You can read both files with Text file input, add "row number" in each stream, which gives you two streams of 2 fields each. Then you can Merge join both streams on Row number, and finally a Select fields step to clean up the output so that only the two relevant fields are kept. Then Text file output to write it.

How to find columns count of csv(Excel) sheet in ETL?

To count the rows of csv file we can use Get Files Rows Count Input in etl. How to find the number columns of a csv file?
Just read the first row of the CSV file using Text-File-Input setting header rows to 0. Usually, the first row contains field names. If you read the whole row into a single field, you can use Split-Field-To-Rows to have a single fieldname per row and the number of rows tells you the number of fields. There are other ways, but this one easily prepares for a subsequent metadata injection - if that's what you have in mind.
No Need of Metadata injection , In Split-Field-To-Rows, check "Include rownum in output" and give some name to that Variable. Then apply sort rows on that Variable, use Sample rows, then you will get number of fields which are present in the file.

How to start reading from second row using CSV data config element in jmeter

CSV data config always reads from first row. I want to add column headers in the CSV file. Hence I want the CSV config to start reading from second row.
Below is the setting in Thread Group.
No.of threads = 1
Loop Count= 10 (depends on no.of rows in CSV file)
What version of JMeter are you using? It seems like leaving the Variable Names field empty will do the trick. More info here:
Versions of JMeter after 2.3.4 support CSV files which have a header
line defining the column names. To enable this, leave the "Variable
Names" field empty. The correct delimiter must be provided.
Leave the variable names empty like on an attached snapshot.
In csv file add header. Header values will act as variable names which could be used as parameters in requests.
~sample facility.csv file
facility1,name
GG1LMD,test1
Request
There is a field called "Ignore first line (only used if Variable Names is not empty)", set this to true.

Pentaho Data Integration (DI) Get Last File in a Directory of a SFTP Server

I am doing a transformation on Pentaho Data Integration and I have a list of files in a directory of my SFTP server. This files are named with FILE_YYYYMMDDHHIISS.txt format, my directory looks like that:
mydirectory
FILE_20130701090000.txt
FILE_20130701170000.txt
FILE_20130702090000.txt
FILE_20130702170000.txt
FILE_20130703090000.txt
FILE_20130703170000.txt
My problem is that I need get the last file of this list in accordance of its creation date, to pass it to other transformation step...
How can I do this in Pentaho Data Integration?
In fact this is quite simple because your file names can be sorted textually, and the max in the sort list will be your most recent file.
Since a list of files is likely short, you can use a Memory Group by step. A grouping step needs a separate column by which to aggregate. If you only have column and you want to find the max in the entire set, you can add a grouping column with an Add Constants step, and configure it to add a column with, say an integer 1 in every row.
Configure your Memory Group by to group on the column of 1s, and use the file name column as the subject. Then simply select the Maximum grouping type. This will produce a single row with your grouping column, the file name field removed and the aggregate column containing your max file name. It would look something like this:

SSIS - Create a CSV file from 2 different OLE DB sources

I'm looking for a way in SSIS (2005) to create a CSV file from 2 or more different layout OLE DB sources. Again, the column layout is different for all the sources.
EX.
Source 1
A,1234,ABCD,1234
Source 2
A,ABCD,1234
Final CSV Result File
A,1234,ABCD,1234
A,ABCD,1234
There are a couple of options. You could combine the results from your inputs into one column and include commas in your data and then merge the two sources together. It's not a great solution, but it would meet your business needs as stated.
Combine the two inputs with a Union All component leading to your Destination component. In the union, you'll have to choose which input columns should go to which output columns.

Resources