How to combining the files in queue using filename? - apache-nifi

I have process two files(file1 and file2) and result of processing 2 files is 1000 flow files which is queued.
Now i need to combine flow files using "filename" attribute.
For example: 1000 flow files in queues(unordered).we need to combine flow files if filename is file1/file2. And then process it based on FIFO strategy.
combine all flow files based on it's filename.
is it possible in NiFi?

I'm not sure if I fully understand your use case, but check out the MergeContent processor, you could set "filename" for the Correlation Attribute Name property, that should combine together all flow files that have the same filename.

Have you tried to use RouteOnAttribute processor?
From what you describe it feels that this might do the job.

Related

Wait for all the files arrival-NiFi

Is there any processor in Nifi that waits for the all the files to arrive and then put those files into HDFS.
For example:
If there are total 5 files to be fetched using SFTP but we received only 3 files, I want NiFi to wait till 5 files arrived and then put those 5 files into HDFS using PUTHDFS.
Thank you for your anwsers
The issue is, how do you know all files have arrived? Is it always a static 5 files?
If it is absolutely always 5 files, then just use a MergeContent with a Minimum and Maximum Number of Entries set to 5. This means that all files will wait until there are exactly 5 files waiting to be merge.
But this is very inflexible to change.
Why do you need to wait for all 5 files before you put them into HDFS?
Are you trying to prevent a small files problem?
If so, you don't need to wait for all 5 files, just use a Merge and set a minimum file size to bucket files up to a minimum, with a worst-case time out.
Alternatively, the PutHDFS has a Conflict Resolution Strategy property which can be set to append as long as the filename is the same - you can just UpdateAttribute and set the filename to the same name, and then append the files whenever they arrive.
You can use List* processors with a Record Writer and use a MergeRecord processor to wait for a specific number of files.
Use a ListSFTP processor. Set the Record Writer attribute. You can use anyone.
Connect the success to a MergeRecord processor with maximum and minimum bin sizes to set to the number of files you want to wait for.
Now the merge relation will have a single flowfile containing the file listing. Split them to individual files and process them.
Have a look at Additional Details of ListSFTP processor. It details how you can wait for your batch to complete process.

Need to use 1 Processor instead of 5 FetchHDFS in NiFi

I have 5 XML files in HDFS which I am fetching using Apache this is the flow nifi. First, I am using Generate Flow file processor and then I have to use 5 different FetchHdfs processors. I can't use GetHdfs because it deletes all the file from directory and I don't have permission to ingest the files back. Hence, I am searching for a way that instead of using 5 FetchHdfs, what else can I do?. All the files are in the same directory and I want to keep them so that I can test multiple times.
I am ingesting those files in TransformXML processor and converting them to JSON
Instead of the GetHDFS Processor, try the ListHDFS Processor as it lists the entire directory and doesn't delete the files ListHDFS It says in the description, "Unlike GetHDFS, this Processor does not delete any data from HDFS."
Thanks everyone for answering. I am unable to vote anyone's answer and hence I am writing what I did.
First I used the ListHDFS processor and it will list out all the filenames.
Then I used FetchHDFS and in HDFS filename, I put '${path}/${filename}'.
change the ${path} to your path of the directory and leave the ${filename} as is as this is a property of ListHDFS and that's where it is picking the filenames from.
This way, there is no need of loops or anything and as soon as the new file is uploaded in the directory, it will be picked by the ListHDFS processors.
So, leave the entire processes working.

How to implement the equivalent of the Aggregator EIP in Nifi

I'm very experienced with Apache Camel and EIPs and am struggling to understand how to implement equivalents in Nifi. I understand that Nifi uses a different paradigm (flow based programming) but I don't think what I'm trying to do is unreasonable.
In a nutshell I want the contents of each file to be sent to many rest services and I want to aggregate the responses into a single document which will stored in elasticsearch. I might also do some further processing and cleanup to improve what is stored (but this isn't my immediate issue)
The screenshot is a quick mock-up of what I'm trying to achieve but I don't understand enough about Nifi to know how to implement this pattern correctly.
If you are going to take a single piece of data and then fork to multiple parts of the flow and then converge back, there needs to be a way for MergeContent to know which pieces go together.
There are generally two ways this can be done...
The first is using MergeContent in "defragment mode". Think of this as reversing a split operation that was performed by one of the split processors like SplitText. For example, you split a file of 100 lines into 100 flow files of 1 line each, then do some stuff to each one, then want to converge back. The split processors produce a standard set of split attributes (described in the docs of the processors) and the defragment mode knows how to bin the splits accordingly and merge them back together. This probably doesn't apply to your example since you didn't start with a split processor.
The second approach is the "Correlation Attribute" in MergeConent. This tells merge content to only merge flow files together that have the same value for the attribute specified. In your example, when a file gets picked up by GetFile and sent to 3 InvokeHttp processors, there are 3 flow files created, and they all should have their "filename" attribute set to the name of the file picked up from disk. So telling MergeContent to correlate on filename should do the trick, and probably setting the min and max number of entries to the number you expect like 3, and a maximum time in case one of them fails or hangs.

Hadoop streaming with multiple input files

I want to build an inverted index from a set of files with Hadoop using the Streaming API. The documentation always refers to using a file whose lines have the entries to the mapper to be fed. But in this case, I have multiple input files, and I need the mappers to process only one file at a time. Is there a way to accomplish that. For preprocessing reasons, I need the input to be like this, and I cannot have the input in the classic line = key, value format that the documentation refers.
By default a mapper only processes one file, unless you use an input class that allow combine inputs like CombineFileInputFormat.
Then, if you have 10 files you will end with 10 mappers and each of them will process only one file. If you are only using mappers (not reducers) that will end in 10 outputs files (one for each mapper).
In the other side, if you have enough big splittable files, it is possible that one file be processed by several mappers at the same time.

Hadoop streaming: single file or multi file per map. Don't Split

I have a lot of zip files that need to be processed by a C++ library. So I use C++ to write my hadoop streaming program. The program will read a zip file, unzip it, and process the extracted data.
My problem is that:
my mapper can't get the content of exactly one file. It usually gets something like 2.4 files or 3.2 files. Hadoop will send several files to my mapper but at least one of the file is partial. You know zip files can't be processed like this.
Can I get exactly one file per map? I don't want to use file list as input and read it from my program because I want to have the advantage of data locality.
I can accept the contents of multiple zip file per map if Hadoop don't split the zip files. I mean exactly 1, 2, 3 files, not something like 2.3 files. Actually it will be even better because my program need to load about 800MB data file for processing the unziped data. Can we do this?
You can find the solution here:
http://wiki.apache.org/hadoop/FAQ#How_do_I_get_each_of_a_job.27s_maps_to_work_on_one_complete_input-file_and_not_allow_the_framework_to_split-up_the_files.3F
The easiest way I would suggest is to set mapred.min.split.size to a large value so that your files do not get split.
If this does not work then you would need to implement an InputFormat which is not very difficult to do and you can find the steps at: http://developer.yahoo.com/hadoop/tutorial/module5.html#fileformat
Rather then depending on the min split size I would suggest an easier way is to Gzip your files.
There is a way to compress files using gzip
http://www.gzip.org/
If you are on Linux you compress the extracted data with
gzip -r /path/to/data
Now that you have this pass this data as your input in your hadoop streaming job.

Resources