Bulk loading Avro to HBase with NiFi - apache-nifi

I'm ingesting flowfiles containing Avro records with NiFi, and need to insert them into HBase. These flowfiles vary in size, but some have 10,000,000+ records. I use SplitAvro twice (one to split to 10,000 recs, then one to split to 1 rec), then use an ExecuteScript processor to pull out the row key for HBase and add it as a flowfile attribute. Finally I use PutHBaseCell (with a batch size of 10,000) to write to HBase using the row key attribute..
The processor that splits the Avro to 1 rec is very slow (Concurrent tasks is set to 5). Is there a way to speed that up? And is there a better way to load this Avro data into HBase?
(I am using a 2 node NiFi (v1.2) cluster (made from VMs), each node has 16 CPUs and 16GB RAM.)

There is a new PutHBaseRecord processor that will be part of the next release (there is a 1.4.0 release being voted upon right now).
With this processor you would avoid ever splitting your flow files, and you just send a flow file will millions of Avro records right to PutHBaseRecord, and PutHBaseRecord would be configured with an Avro reader.
You should get significantly better performance with this approach.

Related

NIFI: Proper way to consume kafka and store data into hive

I have the task to create kafka consumer that should extract messages from kafka, transfrom it and store into Hive table.
So, in kafka topic there are a lot of messages as json object.
I like to add some field and insert its into hive.
I create flow with following Nifi-processors:
ConsumeKafka_2_0
JoltTransformJSON - for transform json
ConvertRecord - to transform json into insert query for hive
PutHiveQL
The topic will be sufficiently loaded and handle about 5Gb data per day.
So, are the any ways to optimize my flow (i think it's a bad idea to give a huge amount of insert queries to Hive)? Maybe it will be better to use the external table and putHDFS Processor (in this way how to be with partition and merge input json into one file?)
As you suspect, using PutHiveQL to perform a large number of individual INSERTs is not very performant. Using your external table approach will likely be much better. If the table is in ORC format, you could use ConvertAvroToORC (for Hive 1.2) or PutORC (for Hive 3) which both generate Hive DDL to help create the external table.
There are also Hive streaming processors, but if you are using Hive 1.2 PutHiveStreaming is not very performant either (but should still be better than PutHiveQL with INSERTs). For Hive 3, PutHive3Streaming should be much more performant and is my recommended solution.

Nifi joins using ExecuteSQL for larger tables

I am trying to Join multiple tables using NiFi. The datasource may be MySQL or RedShift maybe something else in future. Currently, I am using ExecuteSQL processor for this but the output is in a Single flowfile. Hence, for terabyte of data, this may not be suitable. I have also tried using generateTableFetch but this doesn't have join option.
Here are my Questions:
Is there any alternative for ExecuteSQL processor?
Is there a way to make ExecuteSQL processor output in multiple flowfiles? Currently I can split the output of ExecuteSQL using SplitAvro processor. But I want ExecuteSQL itself splitting the output
GenerateTableFetch generates SQL queries based on offset. Will this slows down the process when the dataset becomes larger?
Please share your thoughts. Thanks in advance
1.Is there any alternative for ExecuteSQL processor?
if you are joining multiple tables then we need to use ExecuteSQL processor.
2.Is there a way to make ExecuteSQL processor output in multiple flowfiles? Currently I can split the output of ExecuteSQL using SplitAvro processor. But I want ExecuteSQL itself splitting the output ?
Starting from NiFi-1.8 version we can configure Max Rows for flowfile, so that ExecuteSQL processor splits the flowfiles.
NiFi-1251 addressing this issue.
3.GenerateTableFetch generates SQL queries based on offset. Will this slows down the process when the dataset becomes larger?
if your source table is having indexes on the Maximum-value Columns then it won't slow down the process even if your dataset is becoming larger.
if there is no indexes created on the source table then there will be full table scan will be done always, which results slow down the process.

How to handle small files problem in Nifi

My current flow in Nifi is like
ListHDFS->FetchHDFS->SplitText->JoltransformJSon->PUTHBaseJSON.
Hourly input JSON files would be max of 10GB.
Single file size would be 80 -100MB.
Splittext & JoltTransform -> transform the text and sent it as 4KB files . Hence the hourly job is taking 50 to 1.20 minutest to complete the flow . How can I make this faster. What would be the best flow to handle the use case.
Have tried to use Mergecontent , didnt worked out well .
Thanks All
You can use MergeRecord processor After JoltTransfromJson Processor and
keep your maximum number of records to make flowfile eligible to merge into single flowfile.
Use Max Bin Age property as wildcard to force eligible the bin to be Merged.
Then use record oriented processor for HBase i.e PutHBaseRecord processor and configure your Record Reader controller service(JsonTree Reader) to read the incoming flowfile and tune the Batch size property value to get maximum performance.
By using this process we are processing chunks of records which eventually increase the performance of storing data into HBase.
Flow:
ListHDFS->FetchHDFS->SplitText->JoltransformJSon->MergeRecord ->PUTHBaseRecord
Refer to these links for Merge Record configs and Record Reader configs

Creating larger NiFi flow files when using the ConsumeKafka processor

I've created a simple NiFi pipeline that reads a stream of data from a Kafka topic (using ConsumeKafka) and writes it to the HDFS (using PutHDFS). Currently, I'm seeing lots of small files being created on the HDFS. A new file is created about once a second, some with only one or two records.
I want fewer, larger files to be written to the HDFS.
I have the following settings in ConsumeKafka:
Message Demarcator = <new line>
Max Poll Records = 10000
Max Uncommitted Time = 20s
In the past I've used Flume instead of Nifi, and it has batchSize and batchDurationMillis, which allow me to tweak how big HDFS files are. It seems like ConsumeKafka in Nifi is missing a batchDurationMillis equivalent.
What's the solution in NiFi?
Using the Message Demarcator and Max Poll Records is the correct approach to get multiple messages per flow file. You may want to slow down the ConsumeKafka processor by adjusting the Run Schedule (on the scheduling tab) from 0 sec which means run as fast as possible, to something like 1 second or whatever makes sense for you to grab more data.
Even with the above, you would likely still want to stick a MergeContent processor before PutHDFS, and merge together flow files based on size so that you can wait til you have the appropriate amount of data before writing to HDFS.
How to use MergeContent will depend on the type of data you are merging... If you have Avro, there is a specific merge strategy for Avro. If you have JSON you can merge them one after another, or you can wrap them with a header, footer, and demarcator to make a valid JSON array.

HBase aggregation, Get And Put operation, Bulk Operation

I would like to know how can I map a value of a key.
I know that it can be done with Get and then Put operations. Is there any other way to do it efficiently? 'checkAndPut' is not ver helpful
can it be done with something like :
(key,value) => value+g()
I have read the book HBase the Definitive Guide and it seems like Map Reduce Job interpreted to Put/Get operations on top of HBase. Does it means that it is not a 'Bulk Operation' (since it's an operation per key) ?
How /Does Spark relevant here ?
HBase has scans (1) to retrieve multiple rows; and MapReduce jobs can and do use this command (2).
For HBase 'bulk' is mostly [or solely] is 'bulk load'/'bulk import' where one adds data via constructing HFiles and 'injecting' them to HBase cluster (as opposed to PUT-s) (3).
Your task can be implemented as a MapReduce Job as well as a Spark app (4 being one of examples, maybe not the best one), or a Pig script, or a Hive query if you use HBase table from Hive (5); pick your poison.
If you set up a Table with a counter then you can use an Increment to add a certain amount to the existing value in an atomic operation.
From a MapReduce job you would aggregate your input in micro batches (wherever you have your incremental counts), group them by key/value, sum them up, and then issue a Put from your job (1 Put per key).
What I mentioned above is not a 'bulk' operation but it would probably work just fine if the amount of rows that you modify in each batch is relatively small compared to the total number or rows in your table.
IFF you expect to modify your entire table at each batch then you should look at Bulk Loads. This will require you to write a job that reads your existing values in HBase, your new values from the incremental sources, add them together, and write them back to HBase (In a 'bulk load' fashion, not directly)
A Bulk Load writes HFiles directly to HDFS without going through the HBase 'write pipeline' (Memstore, minor compactions, major compactions, etc), and then issue a command to swap the existing files with the new ones. The swap is FAST! Note, you could also generate the new HFile outside the HBase cluster (not to overload it) and then copy them over and issue the swap command.

Resources