Apache NiFi extract only File-name from GetFile - apache-nifi

Below is a simple NiFi flow which monitors a folder for file and copies to a different folder. It works fine, but I'm looking for a processor which extracts only the filename and writes the name of the file in a text-file
I tried ExtractText processor but could not figure how to configure it to read only filename. Any advise is highly appreciated.

If I understand your use case correctly, you should be able to use ListFile -> ReplaceText -> UpdateAttribute -> PutFile.
ListFile will generate a flow file for each file it finds in the directory, but the flow file will not have any content, it will just put the filename in an attribute. Then you can use ReplaceText to replace the entire text (i.e. flow file contents) with ${filename}. UpdateAttribute would be used to change the filename attribute to whatever you want the destination text file to be called, for use in PutFile.

Related

Write header line for empty CSVs using Apache Nifis "CsvRecordSetWriter" Controller and "ConvertRecord" Processor

I'm using NiFi 1.11.4 to read CSV files from an SFTP, do a few transformations and then drop them off on GCS. Some of the files contain no content, only a header line. During my transformations I convert the files to the AVRO format, but when converting back to CSV no file output is produced for the files where the content is empty.
I have the following settings for the Processor:
And for the Controller:
I did find the following topic: How to use ConvertRecord and CSVRecordSetWriter to output header (with no data) in Apache NiFi? but in the comments it mentions explicitly that ConvertRecord should cover this since 1.8. Sadly I understood it incorrectly, it does not seem to work or my setup is wrong.
While I could make it work with by explicitly writing the schema as a line to empty files, I wanted to know if there is also a more elegant way?

How can I add the path of the file as attributes/fields in a flowfile after parsing?

I have a GetFile Processor parsing some logs using a ExtractGrok Processor. Unfortunately the actual log files themselves contain the same naming convention request.log but the path to the logs is different. For example /var/log/server1/account1/request.log or /var/log/server1/account2/request.log. What I'd like to do is capture anything in /server1/ and /account1/ and store them as fields "host" and "account" so I can further partition the records.
I imagine this will look something like GetFile Processor ---> ExtractGrok Processor and then possibly PartitionRecord? But this is where I'm stuck. Partition Record would require the directory path to be part of the attributes so I can't include it. I guess I would need to somehow extract the file path before the PartitionRecord? I'm just not sure exactly where to do this.
GetFile should already add the filename and path attributes to the FlowFiles (see the documentation). You could use UpdateAttribute to get the values "server1" and "account1" out of the path attribute using Expression Language (see the getDelimitedField function for example).
To add the attribute(s) as fields you can use UpdateRecord with a GrokReader and then you don't need the ExtractGrok processor.
It doesn't sound like you need PartitionRecord as I'm presuming each FlowFile's records contain the same value for "host" and/or "account". If that's the case, and you don't need the "host" and "account" fields for anything else, you probably don't need all of the above components and could use UpdateAttribute -> RouteOnAttribute. If that's not the case, then you can use PartitionRecord on whichever other field(s) you want to partition on.

NiFi ReplaceText Behaving Differently

I have a CSV file that I've introduced into my pipeline (for testing purposes) in two ways. First, I'm using GetFile to read it in from the server file system. Second, I'm using GenerateFlowFile. The content of these files is identical; I copied and pasted the content from the GetFile output to insert as text into GenerateFlowFile. Yet, when I run these through a ReplaceText processor, I am seeing different results.
The file from GenerateFlowFile is working as expected, and the regex string in ReplaceText is being found and replaced with an empty string exactly as I want. But, the file from GetFile is returning a file with no change after running through ReplaceText. How is this possible, and how can I fix this?
I tried to create a reproducible example, but I'm only seeing the issue with my data and can't replicate it with non-PII data. If it makes a difference, the regex used in ReplaceText is ^.*"\(Line.*,\n and the replacement value is an empty string set. Essentially, I want to drop the extraneous first line.

MergeRecord - control name of output

I have a fairly simple process that merges xml files into one or more XML files, using the MergeRecord processor. I'm then converting them into JSON and writing them out with PutFile. The files come out with fabulous names like 79f000ec-9da1-4b59-a0a8-79cc3bb5e85a.
Is there any way to control those file names, or at least give them an appropriate extension?
beforeyour putFile use updateAttribut processor and rename ${fileName}
Exemple :

NiFi: Routing on File Types, e.g. csv, tsv, xlsx

I have a connected SFTP server, and I am trying to route files based on type: .csv, .tsv, and .xlsx. For now, I'm just uploading test files through the command line.
My flow is:
GetSFTP (with correct hostname, etc.) ->
RouteOnAttribute ->
LogAttribute (will dump elsewhere soon, this is just for testing)
My problem, I think, is that I created a property in RouteOnAttribute incorrectly:
Am I correct in assuming that this does not actually pick up on the .csv because it is not technically part of the filename? What would be the correct expression to route on the file type? Thanks!
You need some information that will tell you the type of file.
GetSFTP should be getting the filename from the file on the sftp server, so if those have the appropriate extensions then I would expect your RouteOnAttribute to work correctly.
If the filename does not have the appropriate extension, then the only thing you can do is try to use IdentifyMimeType to determine what type of file it is, and then route on the mime.type attribute.

Resources