I have a folder with CSV files that have the same first 3 columns and different last N columns. N is minimum 2 and up to 11.
Last n columns have number as header, for example:
File 1:
AAA,BBB,CCC,0,10,15
1,India,c,0,28,54
2,Taiwan,c,0,23,52
3,France,c,0,26,34
4,Japan,c,0,27,46
File 2:
AAA,BBB,CCC,0,5,15,30,40
1,Brazil,c,0,20,64,71,88
2,Russia,c,0,20,62,72,81
3,Poland,c,0,21,64,78,78
4,Litva,c,0,22,66,75,78
Desired output:
AAA,BBB,CCC,0,5,10,15,30,40
1,India,c,0,null,28,54,null,null
2,Taiwan,c,0,null,23,52,null,null
3,France,c,0,null,26,34,null,null
4,Japan,c,0,null,27,46,null,null
1,Brazil,c,0,20,null,64,71,88
2,Russia,c,0,20,null,62,72,81
3,Poland,c,0,21,null,64,78,78
4,Litva,c,0,22,null,66,75,78
Is there a way to append this files together with Nifi where a new column would get created (even if I do not now the column name beforehad) if a file with additional data is present in the folder?
I tried with Merge content processor but by default it just appends content of all my files together without minding headers (all the headers are always appended).
What you could do is write some scripts to combine the rows and columns using the ExecuteStreamCommand. This would allow you to write a custom script in whatever language you want.
I have a txt file, which I have to insert into a database.
My problem is that in some files I have header "customer_" instead of "customer".
I don’t know how to fix this in Pentaho. I’ve tried "select values" but I have no idea how it works.
My transformation for now : get file names -> csv file input -> tx file output -> table output.
You have Metadata Injection capabilities built in Pentaho Data Integration, but just "any" file won't work, you need some kind of logic to determine that "customer_" or whatever you get maps to the "customer" column in the database.
Once you have the logic to build of the variations of possible columns in the origin file to columns in the table, you can inject that metadata to your transformation.
I'm trying to build a star schema in Oracle 12c. In my case my data source is not a relational database but a single excel/csv file which is populated via a google form, which means I don't have any sort of reference from a source system such as auto incremental keys/ids. Now what would be the best approach to build a star schema given this condition?
File row sample:
<submitted timestamp>,<submitted by user>,<region>,<country>,<branch>,<branch location>,<branch area>,<branch type>,<branch name>,<branch private? yes/no value>,<the following would be all "fact" values (measurements),...,...,...
In case i wanted to build a "branch" dimension, how would I handle updates/inserts after the first load into the dimension table?
Thought solution so far:
I had thought of making a concatenated string "key" with the branch values, which would make it unique (underscore would be the "glue" to concatenate the values), eg:
<region>_<country>_<branch>_<branch location> as branch_key
I would insert all the distinct branches into a staging table, including they branch_key column for each one of them, then when trying to load into the dimension I could compare which key does not exists yet in my dimension table and then insert it. As for updates, I'm a bit stuck on how to handle that, I had thought of having another file mapping which branches are active having a expiration date column. Basically trying to simulate what I could do having the data in a database instead of CSV files.
This is all I can think of so far, do you have any other recommendations/ideas on how to implement this? Take on consideration that the data source cannot as in I have to read these csv files, since data is not stored anywhere else.
Thank you.
I am doing a transformation on Pentaho Data Integration and I have a list of files in a directory of my SFTP server. This files are named with FILE_YYYYMMDDHHIISS.txt format, my directory looks like that:
mydirectory
FILE_20130701090000.txt
FILE_20130701170000.txt
FILE_20130702090000.txt
FILE_20130702170000.txt
FILE_20130703090000.txt
FILE_20130703170000.txt
My problem is that I need get the last file of this list in accordance of its creation date, to pass it to other transformation step...
How can I do this in Pentaho Data Integration?
In fact this is quite simple because your file names can be sorted textually, and the max in the sort list will be your most recent file.
Since a list of files is likely short, you can use a Memory Group by step. A grouping step needs a separate column by which to aggregate. If you only have column and you want to find the max in the entire set, you can add a grouping column with an Add Constants step, and configure it to add a column with, say an integer 1 in every row.
Configure your Memory Group by to group on the column of 1s, and use the file name column as the subject. Then simply select the Maximum grouping type. This will produce a single row with your grouping column, the file name field removed and the aggregate column containing your max file name. It would look something like this:
I have two different pipe-delimited data files. One is larger than the other. I'm trying to selectively remove data from the large file (we'll call it file A), based on the data contained in the small file (file B). File A contains all of the data, and file B contains only a portion of the data from file A.
I want a function or existing program that removes all of the data contained within file B from file A. I had in mind a function like this:
Pseudo-code:
while !eof(fileB) {
criteria = readLine(fileB);
lineToRemove = searchForLine(criteria, fileA);
deleteLine(lineToRemove, fileA);
}
However, that solution seems very inefficient to me. File A has 23,000 lines in it, and file B has 17,000. And the data contained within file B is literally scattered throughout file A.
If there is a program that can do this, I'd prefer it over code. I'm not picky about the code either. C++ is my strong language, but this data file is going to get converted into a SQL database in the near future so I'm good with SQL/PHP code as well.
Load the two tables into SQL, whatever the database. Doing this sort of manipulation is what databases are designed for. Then you can execute the command:
delete from A
where A.criteria = (select B.criteria from B)
However, I would put the data into Staging tables, and then create and populate the data that I want in SQL. Something like:
create table A ( . . . )
insert into A
select *
from StagingA
where A.criteria not in (select B.criteria from StagingB)
(Here I've used "*" and an insert without a column list. In practice, you should have the list of columns.)