How to handle small files problem in Nifi - apache-nifi

My current flow in Nifi is like
ListHDFS->FetchHDFS->SplitText->JoltransformJSon->PUTHBaseJSON.
Hourly input JSON files would be max of 10GB.
Single file size would be 80 -100MB.
Splittext & JoltTransform -> transform the text and sent it as 4KB files . Hence the hourly job is taking 50 to 1.20 minutest to complete the flow . How can I make this faster. What would be the best flow to handle the use case.
Have tried to use Mergecontent , didnt worked out well .
Thanks All

You can use MergeRecord processor After JoltTransfromJson Processor and
keep your maximum number of records to make flowfile eligible to merge into single flowfile.
Use Max Bin Age property as wildcard to force eligible the bin to be Merged.
Then use record oriented processor for HBase i.e PutHBaseRecord processor and configure your Record Reader controller service(JsonTree Reader) to read the incoming flowfile and tune the Batch size property value to get maximum performance.
By using this process we are processing chunks of records which eventually increase the performance of storing data into HBase.
Flow:
ListHDFS->FetchHDFS->SplitText->JoltransformJSon->MergeRecord ->PUTHBaseRecord
Refer to these links for Merge Record configs and Record Reader configs

Related

running record count from SplitRecord processor Nifi

Is there a way to get fragment index from SplitRecord processor Nifi? I am splitting a very big xls (4 mill records) into "Records Per Split" = 100000.
Now I want to just process first 2 splits, to see quality of the file and reject rest of the file.
I can see fragment index is in other split function (e.g. JsonSplit), but not in record split. Any other hack?
Method1:
By using Control Rate processor we can achieve this case
Control Rate Processor:
By this configs we are releasing 2 flowfiles for every minute and
Flow:
Configure the queue expiration to like 10 sec(or lower number if you need), then the flowfiles are going to expired in the queue but first 2 flowfiles are going to be released.
Method2:
By using SplitText processor then use RouteOnAttribute Processor and add new property as
${fragment.index:le(2)}
By using above expression language we are only allowing only the first 2 fragment indexes.
Refer to this link for splitting Big File in NiFi.

What processors should be combined to process large JSON files in NiFi?

I would like to set up a NiFi workflow that pulls large JSON documents (between 500 MB and 3 GB), that have been gzipped from an FTP server, split the JSON objects into individual flow files, and finally convert each JSON object to SQL and insert it into a MySQL database.
I am running NiFi 1.6.0, on Oracle Java 8, and Java has 1024 MB heap space set.
My current flow is:
GetFTP -> CompressContent -> SplitJson -> EvaluateJsonPath -> AttributesToJson -> ConvertJSONToSQL -> PutSQL
This flow works great for JSON documents that are smaller. It throws Java OutOfMemory errors once a file that is larger than 400 MB enters the SplitJson processor. What changes can I make to the existing flow to enable it to process large JSON documents?
Generally you will want to avoid splitting to a flow file per document. You will get much better performance if you can keep many documents together in a single flow file. You will want to take a look at NiFi's record processing capabilities, specifically you will want to look at PutDatabaseRecord.
Here is a good intro to the record processing approach:
https://www.slideshare.net/BryanBende/apache-nifi-record-processing
If you absolutely have to perform splitting down to individual records per flow file, then you should at least perform a two phase split, where the first split processors splits to maybe 10k-20k per for flow file, then the second split processor splits down to 1 per flow file.

Nifi record counts

I am getting files from remote server using Nifi: my files are as follow:
timestamp (ms), nodeID,value
12345,x,12.4
12346,x,12.7
12348,x,13.4
12356,x,13,6
12355,y,12.0
I am now just get and fetch and split lines and send them to Kafka, but before hand, I need to apply a checksum approach on my records and aggregate them based on time stamp, what I need to do to add an additional column to my content and count the records based on aggregated time stamps, for example aggregation based on each 10 milliseconds and nodeID..
timestamp (ms), nodeID,value, counts
12345,x,12.4,3
12346,x,12.7,3
12348,x,13.4,3
12356,x,13,6,1
12355,y,12.0,1
How to do above process in NiFi. I am totally new to Nifi but need to add above functinality to my Nifi process. I am currently using below nifi process
This may not answer your question directly, but you should consider refactoring your flow to use the "record" processors. It would greatly simplify things and would probably get you closer to being able to do the aggregation.
The idea is to not split up the records, and instead process them in place. Given your current flow, the 4 processors after FetchSFTP would like change to a single ConvertRecord processor that converts CSV to JSON. You would first need to defined a simple Avro schema for your data.
Once you have the record processing setup, you might be able to use PartitionRecord to partition the records by the node id, and then from there the missing piece would be how to count by the timestamps.
Some additional resources...
https://blogs.apache.org/nifi/entry/record-oriented-data-with-nifi
https://bryanbende.com/development/2017/06/20/apache-nifi-records-and-schema-registries
https://www.slideshare.net/BryanBende/apache-nifi-record-processing

Creating larger NiFi flow files when using the ConsumeKafka processor

I've created a simple NiFi pipeline that reads a stream of data from a Kafka topic (using ConsumeKafka) and writes it to the HDFS (using PutHDFS). Currently, I'm seeing lots of small files being created on the HDFS. A new file is created about once a second, some with only one or two records.
I want fewer, larger files to be written to the HDFS.
I have the following settings in ConsumeKafka:
Message Demarcator = <new line>
Max Poll Records = 10000
Max Uncommitted Time = 20s
In the past I've used Flume instead of Nifi, and it has batchSize and batchDurationMillis, which allow me to tweak how big HDFS files are. It seems like ConsumeKafka in Nifi is missing a batchDurationMillis equivalent.
What's the solution in NiFi?
Using the Message Demarcator and Max Poll Records is the correct approach to get multiple messages per flow file. You may want to slow down the ConsumeKafka processor by adjusting the Run Schedule (on the scheduling tab) from 0 sec which means run as fast as possible, to something like 1 second or whatever makes sense for you to grab more data.
Even with the above, you would likely still want to stick a MergeContent processor before PutHDFS, and merge together flow files based on size so that you can wait til you have the appropriate amount of data before writing to HDFS.
How to use MergeContent will depend on the type of data you are merging... If you have Avro, there is a specific merge strategy for Avro. If you have JSON you can merge them one after another, or you can wrap them with a header, footer, and demarcator to make a valid JSON array.

Bulk loading Avro to HBase with NiFi

I'm ingesting flowfiles containing Avro records with NiFi, and need to insert them into HBase. These flowfiles vary in size, but some have 10,000,000+ records. I use SplitAvro twice (one to split to 10,000 recs, then one to split to 1 rec), then use an ExecuteScript processor to pull out the row key for HBase and add it as a flowfile attribute. Finally I use PutHBaseCell (with a batch size of 10,000) to write to HBase using the row key attribute..
The processor that splits the Avro to 1 rec is very slow (Concurrent tasks is set to 5). Is there a way to speed that up? And is there a better way to load this Avro data into HBase?
(I am using a 2 node NiFi (v1.2) cluster (made from VMs), each node has 16 CPUs and 16GB RAM.)
There is a new PutHBaseRecord processor that will be part of the next release (there is a 1.4.0 release being voted upon right now).
With this processor you would avoid ever splitting your flow files, and you just send a flow file will millions of Avro records right to PutHBaseRecord, and PutHBaseRecord would be configured with an Avro reader.
You should get significantly better performance with this approach.

Resources