NiFi - How to get listing of directories and organize them by name then obtain files from each directory? - apache-nifi

I'm trying to figure out how to perform the following steps within NiFi.
Obtain listing of directories from a specific location e.g. /my_src (Note the folders that will be appearing within here will be dated e.g. 20211125)
Based off of the listing obtained I need to sort the folders by date
For each folder then I need to GetFile from that directory
Then sort those files by their names
I am stuck at step 1 on finding a processor that pulls the directory names. I only see GetFile and List file.
Reason for this is that I need to process the folders based on the oldest to newest.
I would expect to be using a regex pattern to locate the valid folders that match the date format and ignore the other folders. Then with those values found pass them along sorted to another process that would get files from that path location, which GetFile does not seem to allow me to set dynamically.
Am I to approach this process differently within NiFi?

Related

extracting .jpeg files from subfolders and putting them in another folder using SSIS

I have a folder that has around 400 subfolders each with ONE .jpeg file in them. I need to get all the pictures into 1 new folder using SSIS, everything is on my local (no connecting through different servers or DBs) just subfolders to one folder so that I can pull out those images without going one by one into each subfolder.
I would create 3 variables, all of type String. CurrentFile, FolderBase, FolderOutput.
FolderBase is going to be where we start searching i.e. C:\ssisdata
FolderOutput is where we are going to move any .jpg files that we find rooted under FolderBase.
Use a Foreach File Enumerator (sample How to import text files with the same name and schema but different directories into database?) configured to process subfolders looking for *.jpg. Map the first element on the Variable tab to be our CurrentFile. Map the Enumerator to start in FolderBase. For extra flexibility, create an additional variable to hold the file mask *.jpg.
Run the package. It should quickly zip through all the folders finding and doing nothing.
Drag and drop a file system task into the Foreach Enumerator. Make it a Move file (or maybe it's rename) type. Use a Variable source and destination. The Source will be CurrentFile and the destination will be FolderOutput

How can I find duplicately named files in Windows?

I am organizing a large Windows folder with many subfolders (with sub folders, etc...), in which files have been saved multiple times in different locations. Can anyone figure out how to identify all files with duplicate names across multiple directories? Some ways I am thinking about include:
A command or series of that could be run in the command line (cmd). Perhaps DIR could be a start...
Possibly a tool that comes with Windows
Possibly a way to specify in search to find duplicate filenames
NOT a separate downloadable tool (those could carry unwanted security risks).
I would like to be able to know the directory paths and filename to the duplicate file(s).
Not yet a full solution, but I think I am on the right track, further comments would be appreciated:
From CMD (start, type cmd):
DIR "C:\mypath" /S > filemap.txt
This should generate a recursive list of files within the directories.
TODO: Find a way to have filenames on the left side of the list
From outside cmd:
Open filemap.txt
Copy and paste the results into Excel
From Excel:
Sort the data
Add in the next column logic to compare to see if the current text = previous text (for filename)
Filter on that row to identify all duplicates
To see where the duplicates are located:
Search filemap.txt for the duplicate filenames identified above and note their directory location.
Note: I plan to update this as I get further along, or if a better solution is found.

pig load udf for loading files from several sub directories

I want to write a custom load udf in pig for loading files from a directory structure.
The directory structure is like an email directory.It has a root directory called maildir.Inside this we have the sub-directory of individual mail holders.Inside every mailaccount holder directory are several sub directories like inbox,sent,trash etc.
eg: maildir/mailholdername1/inbox/1.txt
maildir/mailholdername2/sent/1.txt
I want to read only inbox files from all mailerholdername sub-directories.
I am not able to understand:
what should be passed to the load udf as parameter
how should the entire directory structure be parsed an only respective inbox files are read.
I want to process one file and perform some data extraction and load it as one record.Hence if there are 10 files, i get a relation having 10 records
Further, i want to do some operation on these inbox files and extract some data.
Because you have a defined folder structure that doesn't have variable depth, I think it's as simple as passing the following pattern as your input path:
A = LOAD 'maildir/*/inbox/1.txt' USING PigStorage('\t') AS (f1,f2,f3)
You probably don't need to create your own UDF for this, the PigLoader should be able to handle them, assuming they are in some delimited format (the above example assumes 3 fields, tab delimited).
If there are multiple txt files in each inbox, use *.txt rather than 1.txt. Finally, if the maildir root directory is not in your users home directory, you should use the absolute path to the folder, say /data/maildir/*/index/*.txt

7zip: In C#, how to add multiple files of the same name in different directories to the same zip file?

I created a C# snippet that calls 7zip (7za) to add a list of files to a zip archive. Problem is multiple files in different directories have the same name, so 7zip either complains about duplicate names or replaces the first file with the second only storing the last added. I cannot recursively scan a directory which would allow duplicates.
Is there a way to force 7zip to store the directory, or in ASP.NET MVC 3 C# to create zip files with duplicate file names when not considering the full path?
The path to our image is the GTIN number broken up by every five digits. The last five are the name of the image.
G:\1234\56789\01234.jpg
G:\4321\09876\01234.jpg
G:\5531\33355\01234.jpg
These would fail to all store in a 7zip archive correctly.
You can use SevenZipSharp: http://sevenzipsharp.codeplex.com/ a wrapper around 7zip. You will have full control from code.
We managed to get multiples in the same archive by creating a file list that doesn't contain leading backslashes, then running the application from the directory containing them:
1234\56789\01234.jpg
4321\09876\01234.jpg
5531\33355\01234.jpg
It solves it for now. Anyone with a better idea?

Arbitrary sort key in filesystem

I have a pet project where I build a text-to-HTML translator. I keep the content and the converted output in a directory tree, mirroring the structure via the filesystem hierachy. Chapters go into directories and subchapters go into subdirectories. I get the chapter headings from the directory and file names. I want to keep all data in files, no database or so.
Kind of a keep-it-simple approach, no need to deal with meta-data.
All works well, except for the sort order of the directories and files to be included. I need sort of an arbitrary key for sorting directories and files in my application. That would determine the order the content goes into the output.
I have two solutions, both not really good:
1) Prepend directories and files with a sort key (e.g. "01_") and strip that in the output files in order not to pollute the output file names. That works badly for directories since they must keep the key data in order not to break the directory structure. That ends with an ugly "01_Introduction"...
2) put an config file into each directory with information on how to sort the directory content, to be used from my applications. That is error-prone and breaks the keep-it-simple no meta-data approach.
Do you have an idea? What would you do?
If your goal is to effectively avoid metadata, then I'd go with some variation of option 1.
I really do not find 01_Introduction to be ugly., at all.

Resources