Creating larger NiFi flow files when using the ConsumeKafka processor - hadoop

I've created a simple NiFi pipeline that reads a stream of data from a Kafka topic (using ConsumeKafka) and writes it to the HDFS (using PutHDFS). Currently, I'm seeing lots of small files being created on the HDFS. A new file is created about once a second, some with only one or two records.
I want fewer, larger files to be written to the HDFS.
I have the following settings in ConsumeKafka:
Message Demarcator = <new line>
Max Poll Records = 10000
Max Uncommitted Time = 20s
In the past I've used Flume instead of Nifi, and it has batchSize and batchDurationMillis, which allow me to tweak how big HDFS files are. It seems like ConsumeKafka in Nifi is missing a batchDurationMillis equivalent.
What's the solution in NiFi?

Using the Message Demarcator and Max Poll Records is the correct approach to get multiple messages per flow file. You may want to slow down the ConsumeKafka processor by adjusting the Run Schedule (on the scheduling tab) from 0 sec which means run as fast as possible, to something like 1 second or whatever makes sense for you to grab more data.
Even with the above, you would likely still want to stick a MergeContent processor before PutHDFS, and merge together flow files based on size so that you can wait til you have the appropriate amount of data before writing to HDFS.
How to use MergeContent will depend on the type of data you are merging... If you have Avro, there is a specific merge strategy for Avro. If you have JSON you can merge them one after another, or you can wrap them with a header, footer, and demarcator to make a valid JSON array.

Related

How to wait for GenerateTableFetch queries to finish

My use case is like this. I have some X tables to be pulled from MySQL. I am splitting them using SplitText to put each table in a individual flow file and pull using GenerateTableFetch and ExecuteSQL.
And I want to be notified or put some other action when import is done for all the tables. At SplitText text processor I have routed original relationship to Wait on ${filename} with target count ${fragment.count}. This will track how many tables are done.
But now I am not able to figure out how to know when a particular table is done. GenerateTableFetch forks flow file into multiple based on Partition Size. But it does not write attributes like fragment.count which I can use to wait on for each table.
Is there a way I can achieve this? Or maybe is there a way to know at the end of the entire flow if all flow files in the flow have been processed and nothing is in queue or being processed?
If you have a standalone instance of NiFi (or are not distributing the flow files among a cluster to ExecuteSQL nodes), then you could use QueryDatabaseTable instead, it (by default) will only issue all flow files when the entire result set is processed. If you have all the rows go into a single flow file, then the fact that the flow file has been transferred downstream is an indication that the fetch is complete.
I have written NIFI-5601 to cover the improvement of adding fragment.* attributes to flow files generated by GTF.
Till NiFi add's support for this, I managed to make it work using MergeContent. Use table_name as Correlation attribute name and then use merged relation to Wait processor using ${merge.count} as target. Refer screenshots if someone is looking to do the same.

How to handle small files problem in Nifi

My current flow in Nifi is like
ListHDFS->FetchHDFS->SplitText->JoltransformJSon->PUTHBaseJSON.
Hourly input JSON files would be max of 10GB.
Single file size would be 80 -100MB.
Splittext & JoltTransform -> transform the text and sent it as 4KB files . Hence the hourly job is taking 50 to 1.20 minutest to complete the flow . How can I make this faster. What would be the best flow to handle the use case.
Have tried to use Mergecontent , didnt worked out well .
Thanks All
You can use MergeRecord processor After JoltTransfromJson Processor and
keep your maximum number of records to make flowfile eligible to merge into single flowfile.
Use Max Bin Age property as wildcard to force eligible the bin to be Merged.
Then use record oriented processor for HBase i.e PutHBaseRecord processor and configure your Record Reader controller service(JsonTree Reader) to read the incoming flowfile and tune the Batch size property value to get maximum performance.
By using this process we are processing chunks of records which eventually increase the performance of storing data into HBase.
Flow:
ListHDFS->FetchHDFS->SplitText->JoltransformJSon->MergeRecord ->PUTHBaseRecord
Refer to these links for Merge Record configs and Record Reader configs

Apache Nifi - Consume Kafka + Merge Content + Put HDFS to avoid small files

I am having around 2000000 messages in Kafka topic and I want to put these records into HDFS using NiFi,so I am using PutHDFS processor for this along with ConsumeKafka_0_10 but it generates small files in HDFS, So I am using Merge Content processor for the merging the records before pushing the file.
Please help if the configuration needs changes This works fine for small number of messages but writes a single file for every record when it comes to topics with massive data.
Thank you!!
The Minimum Number of Entries is set to 1 which means it could have anywhere from 1 to the Max Number of Entries. Try making that something higher like 100k.

Nifi record counts

I am getting files from remote server using Nifi: my files are as follow:
timestamp (ms), nodeID,value
12345,x,12.4
12346,x,12.7
12348,x,13.4
12356,x,13,6
12355,y,12.0
I am now just get and fetch and split lines and send them to Kafka, but before hand, I need to apply a checksum approach on my records and aggregate them based on time stamp, what I need to do to add an additional column to my content and count the records based on aggregated time stamps, for example aggregation based on each 10 milliseconds and nodeID..
timestamp (ms), nodeID,value, counts
12345,x,12.4,3
12346,x,12.7,3
12348,x,13.4,3
12356,x,13,6,1
12355,y,12.0,1
How to do above process in NiFi. I am totally new to Nifi but need to add above functinality to my Nifi process. I am currently using below nifi process
This may not answer your question directly, but you should consider refactoring your flow to use the "record" processors. It would greatly simplify things and would probably get you closer to being able to do the aggregation.
The idea is to not split up the records, and instead process them in place. Given your current flow, the 4 processors after FetchSFTP would like change to a single ConvertRecord processor that converts CSV to JSON. You would first need to defined a simple Avro schema for your data.
Once you have the record processing setup, you might be able to use PartitionRecord to partition the records by the node id, and then from there the missing piece would be how to count by the timestamps.
Some additional resources...
https://blogs.apache.org/nifi/entry/record-oriented-data-with-nifi
https://bryanbende.com/development/2017/06/20/apache-nifi-records-and-schema-registries
https://www.slideshare.net/BryanBende/apache-nifi-record-processing

Running a MR Job on a portion of the HDFS file

Imagine you have a big file stored in hdtf which contains structured data. Now the goal is to process only a portion of data in the file like all the lines in the file where second column value is between so and so. Is it possible to launch the MR job such that hdfs only stream the relevant portion of the file versus streaming everything to the mappers.
The reason is that I want to expedite the job speed by only working on the portion that I need. Probably one approach is to run a MR job to get create a new file but I am wondering if one can avoid that?
Please note that the goal is to keep the data in HDFS and I do not want to read and write from database.
HDFS stores files as a bunch of bytes in blocks, and there is no indexing, and therefore no way to only read in a portion of your file (at least at the time of this writing). Furthermore, any given mapper may get the first block of the file or the 400th, and you don't get control over that.
That said, the whole point of MapReduce is to distribute the load over many machines. In our cluster, we run up to 28 mappers at a time (7 per node on 4 nodes), so if my input file is 1TB, each map slot may only end up reading 3% of the total file, or about 30GB. You just perform the filter that you want in the mapper, and only process the rows you are interested in.
If you really need filtered access, you might want to look at storing your data in HBase. It can act as a native source for MapReduce jobs, provides filtered reads, and stores its data on HDFS, so you are still in the distributed world.
One answer is looking at the way that hive solves this problem. The data is in "tables" which are really just meta data about files on disk. Hive allows you to set columns on which a table is partitioned. This creates a separate folder for each partition so if you were partitioning a file by date you would have:
/mytable/2011-12-01
/mytable/2011-12-02
Inside of the date directory would be you actual files. So if you then ran a query like:
SELECT * FROM mytable WHERE dt ='2011-12-01'
Only files in /mytable/2011-12-01 would be fed into the job.
Tho bottom line is that if you want functionality like this you either want to move to a higher level language (hive/pig) or you need to roll your own solutions.
Big part of the processing cost - is data parsing to produce Key-Values to the Mapper. We create there (usually) one java object per value + some container. It is costly both in terms of CPU and garbage collector pressure
I would suggest the solution "in the middle". You can write input format which will read the input stream and skip non-relevant data in the early stage (for example by looking into few first bytes of the string). As a result you will read all data, but actually parse and pass to the Mapper - only portion of it.
Another approach I would consider - is to use RCFile format (or other columnar format), and take care that relevant and non relevant data will sit in the different columns.
If the files that you want to process have some unique attribute about their filename (like extension or partial filename match), you can also use the setInputPathFilter method of FileInputFormat to ignore all but the ones you want for your MR job. Hadoop by default ignores all ".xxx" and _xxx" files/dirs, but you can extend with setInputPathFilter.
As others have noted above, you will likely get sub-optimal performance out of your cluster doing something like this which breaks the "one block per mapper" paradigm, but sometimes this is acceptable. Can sometimes take more to "do it right", esp if you're dealing with a small amount of data & the time to re-architect and/or re-dump into HBase would eclipse the extra time required to run your job sub-optimally.

Resources