How to process files with same name in Apache NiFi? - hadoop

I'm learning NiFi and I'm working on a flow where I get files using GetFile and then I do some process and then store them into HDFS using PutHDFS processor. The thing is, most probably I'll get files with the same name. For ex, I might get a file every 30 minutes and the file that is generated every 30 minutes will have the same name.
Now when I put that file into HDFS, I get an "File with the same name already exists". How do I overcome this? Is there any way to change the file name on the run?

It is a very easy one. I just have to use UpdateAttribute processor to change the file name. For ex: you can append timestamp to the file name.
In the UpdateProcessor, add a property filename and its value ${filename}.${now()}

Related

Process latest file from GetS/List SFTP processor

I am getting multiple files from List SFTP processor. However the requirement is to only process the latest file based on last modification time of file. I tried merging files via merge content processor , but the last modification time goes away. Current version of Nifi is 1.6, so record set writer can't be used. How can the solution for it be implemented.
You can use AttributesTo*Processor and create a new flow file from filename and file.lastModifiedTime attributes. Then you can merge content to get a single flow file with both filename and modifiedtime. You should be able to able to get the file from here.

Need to use 1 Processor instead of 5 FetchHDFS in NiFi

I have 5 XML files in HDFS which I am fetching using Apache this is the flow nifi. First, I am using Generate Flow file processor and then I have to use 5 different FetchHdfs processors. I can't use GetHdfs because it deletes all the file from directory and I don't have permission to ingest the files back. Hence, I am searching for a way that instead of using 5 FetchHdfs, what else can I do?. All the files are in the same directory and I want to keep them so that I can test multiple times.
I am ingesting those files in TransformXML processor and converting them to JSON
Instead of the GetHDFS Processor, try the ListHDFS Processor as it lists the entire directory and doesn't delete the files ListHDFS It says in the description, "Unlike GetHDFS, this Processor does not delete any data from HDFS."
Thanks everyone for answering. I am unable to vote anyone's answer and hence I am writing what I did.
First I used the ListHDFS processor and it will list out all the filenames.
Then I used FetchHDFS and in HDFS filename, I put '${path}/${filename}'.
change the ${path} to your path of the directory and leave the ${filename} as is as this is a property of ListHDFS and that's where it is picking the filenames from.
This way, there is no need of loops or anything and as soon as the new file is uploaded in the directory, it will be picked by the ListHDFS processors.
So, leave the entire processes working.

Is it possible to get Nifi to Put to multiple HDFS folders?

I need to stream a bunch of json files to Nifi, which will then go to HDFS. Nifi needs to look at the creation date (UNIX format) within the json file and then route it to the appropriate HDFS folder. So far I have the processors set up like this:
Consume Kafka -> RouteOnContent (using regex ^"creationDate": \"[0-9]{4}-[0-9]{2}-[0-9]{2}$) -> PutHDFS
There is an HDFS folder for every day, like "2019-01-28", "2019-01-29", "2019-01-30" etc. However, the "PutHDFS" processor will just output to a single directory and I obviously don't want to have 365 processors. And as far as I know, Nifi doesn't have a way to create HDFS folders dynamically so is there an elegant way to handle this?
https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi/nifi-hadoop-nar/1.8.0/org.apache.nifi.processors.hadoop.PutHDFS/index.html
there is a parameter Directoryin PutHDFS processor:
The parent HDFS directory to which files should be written. The directory will be created if it doesn't exist.
Supports Expression Language: true (will be evaluated using flow file attributes and variable registry)
so you can use expression like ${creationDate} for this parameter

Error while adding TimeLine to file in Apache Nifi

I am using HDP 2.5. I try to add time for file which is locate in HDFS file. For that I use GetHDFS->UpdateAttribute->PutHDFS.
First I get file from HDFS through GetHDFS processor and then I change format of file in UpdateAttribute by adding property "
${filename}.${now():format("yyyy-MM-dd-HH:mm:ss.SSS'z'")}". Finally I put file in HDFS. In this stage I have one issue for example If destination folder(in HDFS) contain file which already have time line. Once I run flow in result two or more time line is present for same file
File which contain already timeline
After flow of Nifi File contain two timeline
Can anyone tell me how to resolve this issue
If you don't want to change your current workflow, the best option is probably to use the "File filter" property in the GetHDFS processor to only get files not containing the date in the filename (assuming your files have some naming convention). Another option is to send the renamed files in another directory.
As a general comment, I'd recommend using the combination of ListHDFS and FetchHDFS processors as it is a more efficient pattern when working with a NiFi cluster. You could then use a RouteOnAttribute in the middle to do some more advanced filtering than the "File filter" option.
Another comment: your approach is not the most performant one as you are downloading the data from HDFS, and then uploading it back. A rename/move operation in HDFS would probably be cleaner (or having a correct naming in the first place). You could use WebHDFS interface to perform the renaming using InvokeHTTP processor in NiFi in combination with ListHDFS processor.
You can use Expression Langage to delete the previous timestamp and then add the current timestamp. You have several string functions such as substringBefore or substringAfter that you can use depending on the logic of your file names.
enter link description here

How to append current date to property file value every day in Unix?

I've got a property file which is read several times per day by an external application in order to process some files. One of the properties tells the app where to store the processed files. Application runs on Linux.
success_path=/u02/oapp/success
The problem is that every day several files are thrown in that path and after several months, I would have thousands of files in this plane folder.
Question: How can I append the current date to this property file so it would look like:
success_path=/u02/oapp/success/dd-MMM-yyyy
This would be updated every day at 12:00AM so for example today it would be
success_path=/u02/oapp/success/28-JAN-2017
The file is /u02/oapp/configuration/oapp.properties
Thanks in advance
Instead of appending current date to the property, add additional logic to the code that stores the processed files so that:
it takes the base directory from the property file (success_path in your case)
it creates a year/month/day directory to store the files
Something like:
/u02/oapp/success/year/month/day (as in `/u02/oapp/success/2017/01/01`)
or
/u02/oapp/success/yearmonth/day (as in `/u02/oapp/success/201701/01`)
or
/u02/oapp/success/yearmonthday (as in `/u02/oapp/success/20170101`)
If you don't have the capability to change the app's behavior, you might need to write a cron job that periodically moves the files external to the app.
jq -Rr 'select(startswith("success_path="))="success_path=/u02/oapp/success/"+(now|strflocaltime("%d-%b-%Y")|ascii_upcase)' /u02/oapp/configuration/oapp.properties | sponge /u02/oapp/configuration/oapp.properties

Resources