NiFi - ListS3 listing files with exact path but different dates - apache-nifi

I have to list and collect several specific files on a S3 bucket from different dates.
The path looks like the following:
/path/to/20221201/files/specificfile/
/path/to/20221202/files/specificfile/
The "files" and "specificfile" folders contain several different files, where I am only interest in a specific one from each.
I tried changing the date with * thinking it would list any date, but I get no result.
Any suggestions?
Thanks

Related

Append multiple CSVs into one single file with Apache Nifi

I have a folder with CSV files that have the same first 3 columns and different last N columns. N is minimum 2 and up to 11.
Last n columns have number as header, for example:
File 1:
AAA,BBB,CCC,0,10,15
1,India,c,0,28,54
2,Taiwan,c,0,23,52
3,France,c,0,26,34
4,Japan,c,0,27,46
File 2:
AAA,BBB,CCC,0,5,15,30,40
1,Brazil,c,0,20,64,71,88
2,Russia,c,0,20,62,72,81
3,Poland,c,0,21,64,78,78
4,Litva,c,0,22,66,75,78
Desired output:
AAA,BBB,CCC,0,5,10,15,30,40
1,India,c,0,null,28,54,null,null
2,Taiwan,c,0,null,23,52,null,null
3,France,c,0,null,26,34,null,null
4,Japan,c,0,null,27,46,null,null
1,Brazil,c,0,20,null,64,71,88
2,Russia,c,0,20,null,62,72,81
3,Poland,c,0,21,null,64,78,78
4,Litva,c,0,22,null,66,75,78
Is there a way to append this files together with Nifi where a new column would get created (even if I do not now the column name beforehad) if a file with additional data is present in the folder?
I tried with Merge content processor but by default it just appends content of all my files together without minding headers (all the headers are always appended).
What you could do is write some scripts to combine the rows and columns using the ExecuteStreamCommand. This would allow you to write a custom script in whatever language you want.

Check if the input file names with the file names in the config table

I have a folder which contains many files and I got a configuration table in sql database which contains the list of file names which I need to load to Azure Blob Storage.
I tried getting the file names from the source folder using 'Get Metadata' activity and then used Filter activity to filter the file name but this way I have to hard code the filename inside the filter.
Can someone please let me know a way to do this?
here is an example:
I have below files in a folder.
And the below in sql Config table
This is how the sample pipeline looks like.
1. Lookup list of files from sql config table and using foreach actvity append to an array variable. In my example it is in config_files.
2. Using GetMetadata, list the childItems in the folder, and append the file names into another variable. In my example it is files
3. Use SetVariable activity to store the result i.e. the files that match from the entries in config table.
Expression: #intersection(variables('files'),variables('config_files'))

RStudio: list files doesn't work

I want to list several tif-files in order to calculate a satallite based Index.
I want to use these two orders:
setwd("C:/Satellitendaten/Rohdaten/BT")
list_files <- list.files(getwd(), pattern=".tif$", full.names=FALSE, recursive=FALSE)
But as a result, there is nor running nor error. What might be a reason for no results? Are the orders correct?

Power Query From Folder as Merge, Not Append

I need to import multiple files from a folder and I need each file's contents to be new columns in the resultant table.
There are multiple examples all over the web of how to include multiple files from a folder as an append (e.g., PowerQuery multiple files and add column) but I need the contents of each file to be merged as new columns in the original table.
Any help will be greatly appreciated.
I came up with my own answer. Once you append the files you can pivot on the file name to turn them into columns.

Pentaho Data Integration (DI) Get Last File in a Directory of a SFTP Server

I am doing a transformation on Pentaho Data Integration and I have a list of files in a directory of my SFTP server. This files are named with FILE_YYYYMMDDHHIISS.txt format, my directory looks like that:
mydirectory
FILE_20130701090000.txt
FILE_20130701170000.txt
FILE_20130702090000.txt
FILE_20130702170000.txt
FILE_20130703090000.txt
FILE_20130703170000.txt
My problem is that I need get the last file of this list in accordance of its creation date, to pass it to other transformation step...
How can I do this in Pentaho Data Integration?
In fact this is quite simple because your file names can be sorted textually, and the max in the sort list will be your most recent file.
Since a list of files is likely short, you can use a Memory Group by step. A grouping step needs a separate column by which to aggregate. If you only have column and you want to find the max in the entire set, you can add a grouping column with an Add Constants step, and configure it to add a column with, say an integer 1 in every row.
Configure your Memory Group by to group on the column of 1s, and use the file name column as the subject. Then simply select the Maximum grouping type. This will produce a single row with your grouping column, the file name field removed and the aggregate column containing your max file name. It would look something like this:

Resources