Wait for all the files arrival-NiFi - apache-nifi

Is there any processor in Nifi that waits for the all the files to arrive and then put those files into HDFS.
For example:
If there are total 5 files to be fetched using SFTP but we received only 3 files, I want NiFi to wait till 5 files arrived and then put those 5 files into HDFS using PUTHDFS.
Thank you for your anwsers

The issue is, how do you know all files have arrived? Is it always a static 5 files?
If it is absolutely always 5 files, then just use a MergeContent with a Minimum and Maximum Number of Entries set to 5. This means that all files will wait until there are exactly 5 files waiting to be merge.
But this is very inflexible to change.
Why do you need to wait for all 5 files before you put them into HDFS?
Are you trying to prevent a small files problem?
If so, you don't need to wait for all 5 files, just use a Merge and set a minimum file size to bucket files up to a minimum, with a worst-case time out.
Alternatively, the PutHDFS has a Conflict Resolution Strategy property which can be set to append as long as the filename is the same - you can just UpdateAttribute and set the filename to the same name, and then append the files whenever they arrive.

You can use List* processors with a Record Writer and use a MergeRecord processor to wait for a specific number of files.
Use a ListSFTP processor. Set the Record Writer attribute. You can use anyone.
Connect the success to a MergeRecord processor with maximum and minimum bin sizes to set to the number of files you want to wait for.
Now the merge relation will have a single flowfile containing the file listing. Split them to individual files and process them.
Have a look at Additional Details of ListSFTP processor. It details how you can wait for your batch to complete process.

Related

How to wait until a specific file arrives in a folder before NiFi's ListFile processor lists the entire contents of the floder

I need to move several hundred files from a Windows source folder to a destination folder together in one operation. The files are named sequentially (e.g. part-0001.csv, part-002.csv). It is not known what the final file in the sequence will be called. The files will arrive in the source folder over a number of weeks and it is not ascertainable when the final one will arrive. The users want to use a trigger file (i.e. the arrival of a spefic named file in the folder e.g. trigger.txt) to cause flow to start. My first two thoughts were using a first ListFile processor as an input to a second, or the input to an ExecuteProcess processor that would call a script to start the second one, however, neither of these processors accept an input, so I am a bit stumped as to how I might achieve this, or indeed if it is possible with NiFi. Has anyone encountered this use case, and if so how did you resolve it?

Need to use 1 Processor instead of 5 FetchHDFS in NiFi

I have 5 XML files in HDFS which I am fetching using Apache this is the flow nifi. First, I am using Generate Flow file processor and then I have to use 5 different FetchHdfs processors. I can't use GetHdfs because it deletes all the file from directory and I don't have permission to ingest the files back. Hence, I am searching for a way that instead of using 5 FetchHdfs, what else can I do?. All the files are in the same directory and I want to keep them so that I can test multiple times.
I am ingesting those files in TransformXML processor and converting them to JSON
Instead of the GetHDFS Processor, try the ListHDFS Processor as it lists the entire directory and doesn't delete the files ListHDFS It says in the description, "Unlike GetHDFS, this Processor does not delete any data from HDFS."
Thanks everyone for answering. I am unable to vote anyone's answer and hence I am writing what I did.
First I used the ListHDFS processor and it will list out all the filenames.
Then I used FetchHDFS and in HDFS filename, I put '${path}/${filename}'.
change the ${path} to your path of the directory and leave the ${filename} as is as this is a property of ListHDFS and that's where it is picking the filenames from.
This way, there is no need of loops or anything and as soon as the new file is uploaded in the directory, it will be picked by the ListHDFS processors.
So, leave the entire processes working.

How to combining the files in queue using filename?

I have process two files(file1 and file2) and result of processing 2 files is 1000 flow files which is queued.
Now i need to combine flow files using "filename" attribute.
For example: 1000 flow files in queues(unordered).we need to combine flow files if filename is file1/file2. And then process it based on FIFO strategy.
combine all flow files based on it's filename.
is it possible in NiFi?
I'm not sure if I fully understand your use case, but check out the MergeContent processor, you could set "filename" for the Correlation Attribute Name property, that should combine together all flow files that have the same filename.
Have you tried to use RouteOnAttribute processor?
From what you describe it feels that this might do the job.

Hadoop streaming with multiple input files

I want to build an inverted index from a set of files with Hadoop using the Streaming API. The documentation always refers to using a file whose lines have the entries to the mapper to be fed. But in this case, I have multiple input files, and I need the mappers to process only one file at a time. Is there a way to accomplish that. For preprocessing reasons, I need the input to be like this, and I cannot have the input in the classic line = key, value format that the documentation refers.
By default a mapper only processes one file, unless you use an input class that allow combine inputs like CombineFileInputFormat.
Then, if you have 10 files you will end with 10 mappers and each of them will process only one file. If you are only using mappers (not reducers) that will end in 10 outputs files (one for each mapper).
In the other side, if you have enough big splittable files, it is possible that one file be processed by several mappers at the same time.

processing GBs of data in kafka/storm

Is it possible to process GBs of data in Kafka/Storm as a single message? File frequency is 30 minutes.
If not possible If I break the message into 1 MB each and then can I process it in Kafka/Storm?
My files is in SEGY format (Oil/gas domain) and I will call bin executables (written in c++) through storm to process this file. Whether tuples can be formed successfully for this file format?
Please help.
Are you sure you want to use Storm to do this processing? Seems like a batch application may be more appropriate.
Regardless, you might be able to get it to work but I would recommend having your spout split the data up into more manageable chunks that can be processed by your bolts.

Resources