Need to use 1 Processor instead of 5 FetchHDFS in NiFi - apache-nifi

I have 5 XML files in HDFS which I am fetching using Apache this is the flow nifi. First, I am using Generate Flow file processor and then I have to use 5 different FetchHdfs processors. I can't use GetHdfs because it deletes all the file from directory and I don't have permission to ingest the files back. Hence, I am searching for a way that instead of using 5 FetchHdfs, what else can I do?. All the files are in the same directory and I want to keep them so that I can test multiple times.
I am ingesting those files in TransformXML processor and converting them to JSON

Instead of the GetHDFS Processor, try the ListHDFS Processor as it lists the entire directory and doesn't delete the files ListHDFS It says in the description, "Unlike GetHDFS, this Processor does not delete any data from HDFS."

Thanks everyone for answering. I am unable to vote anyone's answer and hence I am writing what I did.
First I used the ListHDFS processor and it will list out all the filenames.
Then I used FetchHDFS and in HDFS filename, I put '${path}/${filename}'.
change the ${path} to your path of the directory and leave the ${filename} as is as this is a property of ListHDFS and that's where it is picking the filenames from.
This way, there is no need of loops or anything and as soon as the new file is uploaded in the directory, it will be picked by the ListHDFS processors.
So, leave the entire processes working.

Related

Wait for all the files arrival-NiFi

Is there any processor in Nifi that waits for the all the files to arrive and then put those files into HDFS.
For example:
If there are total 5 files to be fetched using SFTP but we received only 3 files, I want NiFi to wait till 5 files arrived and then put those 5 files into HDFS using PUTHDFS.
Thank you for your anwsers
The issue is, how do you know all files have arrived? Is it always a static 5 files?
If it is absolutely always 5 files, then just use a MergeContent with a Minimum and Maximum Number of Entries set to 5. This means that all files will wait until there are exactly 5 files waiting to be merge.
But this is very inflexible to change.
Why do you need to wait for all 5 files before you put them into HDFS?
Are you trying to prevent a small files problem?
If so, you don't need to wait for all 5 files, just use a Merge and set a minimum file size to bucket files up to a minimum, with a worst-case time out.
Alternatively, the PutHDFS has a Conflict Resolution Strategy property which can be set to append as long as the filename is the same - you can just UpdateAttribute and set the filename to the same name, and then append the files whenever they arrive.
You can use List* processors with a Record Writer and use a MergeRecord processor to wait for a specific number of files.
Use a ListSFTP processor. Set the Record Writer attribute. You can use anyone.
Connect the success to a MergeRecord processor with maximum and minimum bin sizes to set to the number of files you want to wait for.
Now the merge relation will have a single flowfile containing the file listing. Split them to individual files and process them.
Have a look at Additional Details of ListSFTP processor. It details how you can wait for your batch to complete process.

Is it possible to get Nifi to Put to multiple HDFS folders?

I need to stream a bunch of json files to Nifi, which will then go to HDFS. Nifi needs to look at the creation date (UNIX format) within the json file and then route it to the appropriate HDFS folder. So far I have the processors set up like this:
Consume Kafka -> RouteOnContent (using regex ^"creationDate": \"[0-9]{4}-[0-9]{2}-[0-9]{2}$) -> PutHDFS
There is an HDFS folder for every day, like "2019-01-28", "2019-01-29", "2019-01-30" etc. However, the "PutHDFS" processor will just output to a single directory and I obviously don't want to have 365 processors. And as far as I know, Nifi doesn't have a way to create HDFS folders dynamically so is there an elegant way to handle this?
https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi/nifi-hadoop-nar/1.8.0/org.apache.nifi.processors.hadoop.PutHDFS/index.html
there is a parameter Directoryin PutHDFS processor:
The parent HDFS directory to which files should be written. The directory will be created if it doesn't exist.
Supports Expression Language: true (will be evaluated using flow file attributes and variable registry)
so you can use expression like ${creationDate} for this parameter

Error while adding TimeLine to file in Apache Nifi

I am using HDP 2.5. I try to add time for file which is locate in HDFS file. For that I use GetHDFS->UpdateAttribute->PutHDFS.
First I get file from HDFS through GetHDFS processor and then I change format of file in UpdateAttribute by adding property "
${filename}.${now():format("yyyy-MM-dd-HH:mm:ss.SSS'z'")}". Finally I put file in HDFS. In this stage I have one issue for example If destination folder(in HDFS) contain file which already have time line. Once I run flow in result two or more time line is present for same file
File which contain already timeline
After flow of Nifi File contain two timeline
Can anyone tell me how to resolve this issue
If you don't want to change your current workflow, the best option is probably to use the "File filter" property in the GetHDFS processor to only get files not containing the date in the filename (assuming your files have some naming convention). Another option is to send the renamed files in another directory.
As a general comment, I'd recommend using the combination of ListHDFS and FetchHDFS processors as it is a more efficient pattern when working with a NiFi cluster. You could then use a RouteOnAttribute in the middle to do some more advanced filtering than the "File filter" option.
Another comment: your approach is not the most performant one as you are downloading the data from HDFS, and then uploading it back. A rename/move operation in HDFS would probably be cleaner (or having a correct naming in the first place). You could use WebHDFS interface to perform the renaming using InvokeHTTP processor in NiFi in combination with ListHDFS processor.
You can use Expression Langage to delete the previous timestamp and then add the current timestamp. You have several string functions such as substringBefore or substringAfter that you can use depending on the logic of your file names.
enter link description here

Hadoop streaming: single file or multi file per map. Don't Split

I have a lot of zip files that need to be processed by a C++ library. So I use C++ to write my hadoop streaming program. The program will read a zip file, unzip it, and process the extracted data.
My problem is that:
my mapper can't get the content of exactly one file. It usually gets something like 2.4 files or 3.2 files. Hadoop will send several files to my mapper but at least one of the file is partial. You know zip files can't be processed like this.
Can I get exactly one file per map? I don't want to use file list as input and read it from my program because I want to have the advantage of data locality.
I can accept the contents of multiple zip file per map if Hadoop don't split the zip files. I mean exactly 1, 2, 3 files, not something like 2.3 files. Actually it will be even better because my program need to load about 800MB data file for processing the unziped data. Can we do this?
You can find the solution here:
http://wiki.apache.org/hadoop/FAQ#How_do_I_get_each_of_a_job.27s_maps_to_work_on_one_complete_input-file_and_not_allow_the_framework_to_split-up_the_files.3F
The easiest way I would suggest is to set mapred.min.split.size to a large value so that your files do not get split.
If this does not work then you would need to implement an InputFormat which is not very difficult to do and you can find the steps at: http://developer.yahoo.com/hadoop/tutorial/module5.html#fileformat
Rather then depending on the min split size I would suggest an easier way is to Gzip your files.
There is a way to compress files using gzip
http://www.gzip.org/
If you are on Linux you compress the extracted data with
gzip -r /path/to/data
Now that you have this pass this data as your input in your hadoop streaming job.

atomic hadoop fs move

While building an infrastructure for one of my current projects I've faced the problem of replacement of already existing HDFS files. More precisely, I want to do the following:
We have a few machines (log-servers) which are continuously generating logs. We have a dedicated machine (log-preprocessor) which is responsible for receiving log chunks (each chunk is about 30 minutes in length and 500-800 mb in size) from log-servers, preprocessing them and uploading to HDFS of our Hadoop-cluster.
Preprocessing is done in 3 steps:
for each logserver: filter (in parallel) received log chunk (output file is about 60-80mb)
combine (merge-sort) all output files from the step1 and do some minor filtering (additionally, 30-min files are combined together into 1-hour files)
using current mapping from external DB, process the file from step#2 to obtain the final logfile and put this file to HDFS.
Final logfiles are to be used as input for several periodoc HADOOP-applications which are running on a HADOOP-cluster. In HDFS logfiles are stored as follows:
hdfs:/spool/.../logs/YYYY-MM-DD.HH.MM.log
Problem description:
The mapping which is used on step 3 changes over time and we need to reflect these changes by recalculating step3 and replacing old HDFS files with new ones. This update is performed with some periodicity (e.g. every 10-15 minutes) at least for last 12 hours. Please note that, if the mapping has changed, the result of applying step3 on the same input file may be significantly different (it will not be just a superset/subset of previous result). So we need to overwrite existing files in HDFS.
However, we can't just do hadoop fs -rm and then hadoop fs -copyToLocal because if some HADOOP-application is using the file which is temporary removed the app may fail. The solution I use -- put a new file near the old one, the files have the same name but different suffixes denoting files` version. Now the layout is the following:
hdfs:/spool/.../logs/2012-09-26.09.00.log.v1
hdfs:/spool/.../logs/2012-09-26.09.00.log.v2
hdfs:/spool/.../logs/2012-09-26.09.00.log.v3
hdfs:/spool/.../logs/2012-09-26.10.00.log.v1
hdfs:/spool/.../logs/2012-09-26.10.00.log.v2
Any Hadoop-application during it's start (setup) chooses the files with the most up-to-date versions and works with them. So even if some update is going on, the application will not experience any problems because no input file is removed.
Questions:
Do you know some easier approach to this problem which does not use this complicated/ugly file versioning?
Some applications may start using a HDFS-file which is currently uploading, but not yet uploaded (applications see this file in HDFS but don't know if it consistent). In case of gzip files this may lead to failed mappers. Could you please advice how could I handle this issue? I know that for local file systems I can do something like:
cp infile /finaldir/outfile.tmp && mv /finaldir/output.tmp /finaldir/output
This works because mv is an atomic operation, however I'm not sure that this is the case for HDFS. Could you please advice if HDFS has some atomic operation like mv in conventional local file systems?
Thanks in advance!
IMO, the file rename approach is absolutely fine to go with.
HDFS, upto 1.x, lacks atomic renames (they are dirty updates IIRC) - but the operation has usually been considered 'atomic-like' and never given problems to the specific scenario you have in mind here. You could rely on this without worrying about a partial state since the source file is already created and closed.
HDFS 2.x onwards supports proper atomic renames (via a new API call) that has replaced the earlier version's dirty one. It is also the default behavior of rename if you use the FileContext APIs.

Resources