Is it possible to get Nifi to Put to multiple HDFS folders? - hadoop

I need to stream a bunch of json files to Nifi, which will then go to HDFS. Nifi needs to look at the creation date (UNIX format) within the json file and then route it to the appropriate HDFS folder. So far I have the processors set up like this:
Consume Kafka -> RouteOnContent (using regex ^"creationDate": \"[0-9]{4}-[0-9]{2}-[0-9]{2}$) -> PutHDFS
There is an HDFS folder for every day, like "2019-01-28", "2019-01-29", "2019-01-30" etc. However, the "PutHDFS" processor will just output to a single directory and I obviously don't want to have 365 processors. And as far as I know, Nifi doesn't have a way to create HDFS folders dynamically so is there an elegant way to handle this?

https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi/nifi-hadoop-nar/1.8.0/org.apache.nifi.processors.hadoop.PutHDFS/index.html
there is a parameter Directoryin PutHDFS processor:
The parent HDFS directory to which files should be written. The directory will be created if it doesn't exist.
Supports Expression Language: true (will be evaluated using flow file attributes and variable registry)
so you can use expression like ${creationDate} for this parameter

Related

Need to use 1 Processor instead of 5 FetchHDFS in NiFi

I have 5 XML files in HDFS which I am fetching using Apache this is the flow nifi. First, I am using Generate Flow file processor and then I have to use 5 different FetchHdfs processors. I can't use GetHdfs because it deletes all the file from directory and I don't have permission to ingest the files back. Hence, I am searching for a way that instead of using 5 FetchHdfs, what else can I do?. All the files are in the same directory and I want to keep them so that I can test multiple times.
I am ingesting those files in TransformXML processor and converting them to JSON
Instead of the GetHDFS Processor, try the ListHDFS Processor as it lists the entire directory and doesn't delete the files ListHDFS It says in the description, "Unlike GetHDFS, this Processor does not delete any data from HDFS."
Thanks everyone for answering. I am unable to vote anyone's answer and hence I am writing what I did.
First I used the ListHDFS processor and it will list out all the filenames.
Then I used FetchHDFS and in HDFS filename, I put '${path}/${filename}'.
change the ${path} to your path of the directory and leave the ${filename} as is as this is a property of ListHDFS and that's where it is picking the filenames from.
This way, there is no need of loops or anything and as soon as the new file is uploaded in the directory, it will be picked by the ListHDFS processors.
So, leave the entire processes working.

Using Logstash on tarfile to create Elasticsearch pipelines

I periodically receive gzipped tarfiles containing different types of logs I want to load into Elasticsearch. Is Logstash suitable for this use case? The issue I seem to be running into is that even if I can extract the tarfile contents, Logstash requires me to specify absolute file paths whereas my file paths will differ for each tarfile I want to load.
The file input plugin for Logstash is usually used for "active" log files to read and index data in real time.
If the logfiles that you are going to be processing are complete, you don't need to use the file input plugin at all, it's enough to use the stdin input plugin and pass the contents of the file to the Logstash process.

Error while adding TimeLine to file in Apache Nifi

I am using HDP 2.5. I try to add time for file which is locate in HDFS file. For that I use GetHDFS->UpdateAttribute->PutHDFS.
First I get file from HDFS through GetHDFS processor and then I change format of file in UpdateAttribute by adding property "
${filename}.${now():format("yyyy-MM-dd-HH:mm:ss.SSS'z'")}". Finally I put file in HDFS. In this stage I have one issue for example If destination folder(in HDFS) contain file which already have time line. Once I run flow in result two or more time line is present for same file
File which contain already timeline
After flow of Nifi File contain two timeline
Can anyone tell me how to resolve this issue
If you don't want to change your current workflow, the best option is probably to use the "File filter" property in the GetHDFS processor to only get files not containing the date in the filename (assuming your files have some naming convention). Another option is to send the renamed files in another directory.
As a general comment, I'd recommend using the combination of ListHDFS and FetchHDFS processors as it is a more efficient pattern when working with a NiFi cluster. You could then use a RouteOnAttribute in the middle to do some more advanced filtering than the "File filter" option.
Another comment: your approach is not the most performant one as you are downloading the data from HDFS, and then uploading it back. A rename/move operation in HDFS would probably be cleaner (or having a correct naming in the first place). You could use WebHDFS interface to perform the renaming using InvokeHTTP processor in NiFi in combination with ListHDFS processor.
You can use Expression Langage to delete the previous timestamp and then add the current timestamp. You have several string functions such as substringBefore or substringAfter that you can use depending on the logic of your file names.
enter link description here

Write time series data into hdfs partitioned by month and day?

I'm writing a program which save the time series data from kafka into hadoop. and I designed the directory struct like this:
event_data
|-2016
|-01
|-data01
|-data02
|-data03
|-2017
|-01
|-data01
Because the is a daemon task, I write a LRU-based manager to manage the opened file and close inactive file in time to avoid resource leaking, but the income data stream is not sorted by time, it's very common to open the existed file again to append new data.
I tried use FileSystem#append() method to open a OutputStream when file existed, but it run error on my hdfs cluster(Sorry, I can't offer the specific error here because it's several month ago and now I tried another solution).
Then I use another ways to achieve my goals:
Adding a sequence suffix to the file name when the same name file exists. now I have a lot of file in my hdfs. It looks very dirty.
My question is: what's the best practice for the circumstances?
Sorry that this is not a direct answer to your programming problem, but if you're open for all options rather than implement it by yourself, I'd like to share you our experiences with fluentd and it's HDFS (WebHDFS) Output Plugin.
Fluentd is a open source, pluggable data collector and by which you can build your data pipeline easily, it'll read data from inputs, process it and then write it to the specified outputs, in your scenario, the input is kafka and the output is HDFS. What you need to do is:
Config fluentd input following fluentd kafka plugin, you'll config the source part with your kafka/topic info
Enable webhdfs and append operation for your HDFS cluster, you can find how to do it following HDFS (WebHDFS) Output Plugin
Config your match part to write your data to HDFS, there's example on the plugin docs page. For partition your data by month and day, you can configure path parameter with time slice placeholders, something like:
path "/event_data/%Y/%m/data%d"
With this option to collect your data, you can then write your mapreduce job to do ETL or whatever you like.
I don't know if this is suitable for your problem, just provide one more option here.

How do I control output files name and content of an Hadoop streaming job?

Is there a way to control the output filenames of an Hadoop Streaming job?
Specifically I would like my job's output files content and name to be organized by the ket the reducer outputs - each file would only contain values for one key and its name would be the key.
Update:
Just found the answer - Using a Java class that derives from MultipleOutputFormat as the jobs output format allows control of the output file names.
http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/mapred/lib/MultipleOutputFormat.html
I haven't seen any samples for this out there...
Can anyone point out to an Hadoop Streaming sample that makes use of a custom output format Java class?
Using a Java class that derives from MultipleOutputFormat as the jobs output format allows control of the output file names. http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/mapred/lib/MultipleOutputFormat.html
When using Hadoop Streaming, since only one JAR is supported you actually have to fork the streaming jar and put your new output format classes in it for streaming jobs to be able to reference it...
EDIT:
As of version 0.20.2 of hadoop this Class has been deprecated and you should now use:
http://hadoop.apache.org/docs/mapreduce/current/api/org/apache/hadoop/mapreduce/lib/output/MultipleOutputs.html
In general, Hadoop would have you consider the entire directory to be the output, and not an individual file. There's no way to directly control the filename, whether using Streaming or regular Java jobs.
However, nothing is stopping you from doing this splitting and renaming yourself, after the job has finished. You can $HADOOP dfs -cat path/to/your/output/directory/part-*, and pipe that to a script of yours that splits content up by keys and writes it to new files.

Resources