How does PutHiveQL works on batch? - apache-nifi

I am trying to input multiple insert statements to PutHiveQL via ReplaceText processor. Each insert statement is a flowfile coming out from ReplaceText. I set the batch in PutHiveQL to 100. However, it seems it still sends it 1 flowfile at a time. How to best implement this batch?

I don't think the PutHiveQL processor batches statements at the JDBC layer as you expect, not in the way that processors like PutSQL do. From the code, it looks like the Batch Size property is used to control how many flowfiles the processor works on before yielding, but the statements for each flowfile are still executed individually.
That might be a good topic for a NiFi feature request.

The version of Hive supported by NiFi doesn't allow for batching/transactions. The Batch Size parameter is meant to try to move multiple incoming flow files a bit faster than having the processor invoked every so often. So if you schedule the PutHiveQL processor for every 5 seconds with a Batch Size of 100, then every 5 seconds (if there are 100 flow files queued), the processor will attempt to process those during one "session".
Alternatively you can specify a Batch Size of 0 or 1 and schedule it as fast as you like; unfortunately this will have no effect on the Hive side of things, as it auto-commits each HiveQL statement; the version of Hive doesn't support transactions or batching.
Another (possibly more performant) alternative is to put the entire set of rows as a CSV file into HDFS and use the HiveQL "LOAD DATA" DML statement to create a table on top of the data: https://cwiki.apache.org/confluence/display/Hive/GettingStarted#GettingStarted-DMLOperations

Related

Looking for an Equivalent of GenerateTableFetch

I use ExecuteSQLRecord to run a query and write to CSV format. The table has 10M rows. Although I can split the output into multiple flow files, the query is executed by only a single thread and is very slow.
Is there a way to partition the query into multiple queries so that the next processor can run multiple concurrent tasks, each one process one partition? It would be like:
GenerateTableFetch -> ExecuteSQLRecord (with concurrent tasks)
The problem is that GenerateTableFetch only accepts table name as input. It does not accept customized queries.
Please advise if you have solutions. Thank you in advance.
You can increase the concurrency on Nifi processors (by increase the number in Councurrent Task), you can also increase the throughput, some time it works :
Also if you are working on the cluster, before the processor, you can apply load balancing on the queue, so it will distribute the workload among the nodes of your cluster (load balance strategy, put to round robin):
Check this, youtube channel, for Nifi antipatterns (there is a video on concurrency): Nifi Notes
Please clarify your question, if I didn't answer it.
Figured out an alternative way. I developed a Oracle PL/SQL function which takes table name as an argument, and produces a series of queries like "SELECT * FROM T1 OFFSET x ROWS FETCH NEXT 10000 ROWS ONLY". The number of queries is based on the number of rows of the table, which is a statistics number in the catalog table. If the table has 1M rows, and I want to have 100k rows in each batch, it will produces 10 queries. I use ExecuteSQLRecord to call this function, which effectively does the job of NiFi processor GenerateTableFetch. My next processor (e.g. ExecuteSQLRecord again) can now have 10 concurrent tasks working in parallel.

How to run a processor only when ahother proccessor has finished its execution?

I'm migrating a table (2 millions of rows) from DB2 to SQL Server. I'm using the next flow:
ExecuteSQL (to select records from the Db2 table).
SplitAvro (to split the records. I configured it with Output Size = 1 to control the case that if one fails the rest is inserted without problems.
PutDataBaseRecord (to insert the records in the SQL Server table).
ExecuteSQL (I need to call a stored procedure that executes update sentences against the same table that PutDataBaseRecord is working to).
The problem is the second ExecuteSQL is running before PutDataBaseRecord complete the insertion of all records.
How can I tell nifi to run that processor only when the other one finishes?
Thanks in advance!
After PutDatabaseRecord you can use MergeContent in Defragment mode to undo the split operation performed by SplitAvro. This way a single flow file will come out of MergeContent only when all splits have been seen, and at that point you know its time to for the second ExecuteSQL to run.
The answer provided by #bryan-bende is great, as it is simple and elegant. If that doesn't work for some reason, you could also look at Wait/Notify. Having said that, Bryan's answer is simpler and probably more robust.

How to wait for GenerateTableFetch queries to finish

My use case is like this. I have some X tables to be pulled from MySQL. I am splitting them using SplitText to put each table in a individual flow file and pull using GenerateTableFetch and ExecuteSQL.
And I want to be notified or put some other action when import is done for all the tables. At SplitText text processor I have routed original relationship to Wait on ${filename} with target count ${fragment.count}. This will track how many tables are done.
But now I am not able to figure out how to know when a particular table is done. GenerateTableFetch forks flow file into multiple based on Partition Size. But it does not write attributes like fragment.count which I can use to wait on for each table.
Is there a way I can achieve this? Or maybe is there a way to know at the end of the entire flow if all flow files in the flow have been processed and nothing is in queue or being processed?
If you have a standalone instance of NiFi (or are not distributing the flow files among a cluster to ExecuteSQL nodes), then you could use QueryDatabaseTable instead, it (by default) will only issue all flow files when the entire result set is processed. If you have all the rows go into a single flow file, then the fact that the flow file has been transferred downstream is an indication that the fetch is complete.
I have written NIFI-5601 to cover the improvement of adding fragment.* attributes to flow files generated by GTF.
Till NiFi add's support for this, I managed to make it work using MergeContent. Use table_name as Correlation attribute name and then use merged relation to Wait processor using ${merge.count} as target. Refer screenshots if someone is looking to do the same.

Creating larger NiFi flow files when using the ConsumeKafka processor

I've created a simple NiFi pipeline that reads a stream of data from a Kafka topic (using ConsumeKafka) and writes it to the HDFS (using PutHDFS). Currently, I'm seeing lots of small files being created on the HDFS. A new file is created about once a second, some with only one or two records.
I want fewer, larger files to be written to the HDFS.
I have the following settings in ConsumeKafka:
Message Demarcator = <new line>
Max Poll Records = 10000
Max Uncommitted Time = 20s
In the past I've used Flume instead of Nifi, and it has batchSize and batchDurationMillis, which allow me to tweak how big HDFS files are. It seems like ConsumeKafka in Nifi is missing a batchDurationMillis equivalent.
What's the solution in NiFi?
Using the Message Demarcator and Max Poll Records is the correct approach to get multiple messages per flow file. You may want to slow down the ConsumeKafka processor by adjusting the Run Schedule (on the scheduling tab) from 0 sec which means run as fast as possible, to something like 1 second or whatever makes sense for you to grab more data.
Even with the above, you would likely still want to stick a MergeContent processor before PutHDFS, and merge together flow files based on size so that you can wait til you have the appropriate amount of data before writing to HDFS.
How to use MergeContent will depend on the type of data you are merging... If you have Avro, there is a specific merge strategy for Avro. If you have JSON you can merge them one after another, or you can wrap them with a header, footer, and demarcator to make a valid JSON array.

Pull Data from Hive to SQL Server without duplicates using Apache Nifi

Sorry I'm new in Apache Nifi. So i made a data flow regarding pulling data from Hive and storing it in SQL. There is no error on my data flow, the only problem is, its pulling data repeatedly.
My Data flow is consists of the following:
SelectHiveQL
SplitAvro
ConvertAvroToJson
ConvertJsonTOSQL
PutSQL
For example my table in hive have 20 rows only but when i run the data flow and check my table in MS SQL. It saved 5,000 rows. The SelectHiveQL pulled the data repeatedly.
What do i need to do so it will only pull 20 rows or just the exact number of rows in my Hive Table?
Thank you
SelectHiveQL (like many NiFi processors) runs on a user-specified schedule. To get a processor to only run once, you can set the run schedule to something like 30 sec, then start and immediately stop the processor. The processor will be triggered once, and stopping it does not interrupt that current execution, it just causes it not to be scheduled again.
Another way might be to set the run schedule to something very large, such that it would only execute once per some very long time interval (days, years, etc.)

Resources