Nifi joins using ExecuteSQL for larger tables - apache-nifi

I am trying to Join multiple tables using NiFi. The datasource may be MySQL or RedShift maybe something else in future. Currently, I am using ExecuteSQL processor for this but the output is in a Single flowfile. Hence, for terabyte of data, this may not be suitable. I have also tried using generateTableFetch but this doesn't have join option.
Here are my Questions:
Is there any alternative for ExecuteSQL processor?
Is there a way to make ExecuteSQL processor output in multiple flowfiles? Currently I can split the output of ExecuteSQL using SplitAvro processor. But I want ExecuteSQL itself splitting the output
GenerateTableFetch generates SQL queries based on offset. Will this slows down the process when the dataset becomes larger?
Please share your thoughts. Thanks in advance

1.Is there any alternative for ExecuteSQL processor?
if you are joining multiple tables then we need to use ExecuteSQL processor.
2.Is there a way to make ExecuteSQL processor output in multiple flowfiles? Currently I can split the output of ExecuteSQL using SplitAvro processor. But I want ExecuteSQL itself splitting the output ?
Starting from NiFi-1.8 version we can configure Max Rows for flowfile, so that ExecuteSQL processor splits the flowfiles.
NiFi-1251 addressing this issue.
3.GenerateTableFetch generates SQL queries based on offset. Will this slows down the process when the dataset becomes larger?
if your source table is having indexes on the Maximum-value Columns then it won't slow down the process even if your dataset is becoming larger.
if there is no indexes created on the source table then there will be full table scan will be done always, which results slow down the process.

Related

what is the difference between executesql and executesqlrecord

In Nifi, why do we have executesql if we have executesqlrecord?
Is there any difference between executesql and executesqlrecord other than that the first produces only Avro and the second gives more options for the produced flowfiles?
Is there any performance preferences between them? for example executesql executes in a batch mode and the executesqlrecord executes in a row-by-row mode?
Both processors are very similar. See here Difference Between ExecuteSQL and ExecuteSQLRecord
ExecuteSQL was created first.
ExecuteSQLRecord was added later after the Records feature was introduced to NiFi.
ExecuteSQL was never removed. It provides a user with options & maintains backwards compatability for people still using ExecuteSQL.

Looking for an Equivalent of GenerateTableFetch

I use ExecuteSQLRecord to run a query and write to CSV format. The table has 10M rows. Although I can split the output into multiple flow files, the query is executed by only a single thread and is very slow.
Is there a way to partition the query into multiple queries so that the next processor can run multiple concurrent tasks, each one process one partition? It would be like:
GenerateTableFetch -> ExecuteSQLRecord (with concurrent tasks)
The problem is that GenerateTableFetch only accepts table name as input. It does not accept customized queries.
Please advise if you have solutions. Thank you in advance.
You can increase the concurrency on Nifi processors (by increase the number in Councurrent Task), you can also increase the throughput, some time it works :
Also if you are working on the cluster, before the processor, you can apply load balancing on the queue, so it will distribute the workload among the nodes of your cluster (load balance strategy, put to round robin):
Check this, youtube channel, for Nifi antipatterns (there is a video on concurrency): Nifi Notes
Please clarify your question, if I didn't answer it.
Figured out an alternative way. I developed a Oracle PL/SQL function which takes table name as an argument, and produces a series of queries like "SELECT * FROM T1 OFFSET x ROWS FETCH NEXT 10000 ROWS ONLY". The number of queries is based on the number of rows of the table, which is a statistics number in the catalog table. If the table has 1M rows, and I want to have 100k rows in each batch, it will produces 10 queries. I use ExecuteSQLRecord to call this function, which effectively does the job of NiFi processor GenerateTableFetch. My next processor (e.g. ExecuteSQLRecord again) can now have 10 concurrent tasks working in parallel.

How to run a processor only when ahother proccessor has finished its execution?

I'm migrating a table (2 millions of rows) from DB2 to SQL Server. I'm using the next flow:
ExecuteSQL (to select records from the Db2 table).
SplitAvro (to split the records. I configured it with Output Size = 1 to control the case that if one fails the rest is inserted without problems.
PutDataBaseRecord (to insert the records in the SQL Server table).
ExecuteSQL (I need to call a stored procedure that executes update sentences against the same table that PutDataBaseRecord is working to).
The problem is the second ExecuteSQL is running before PutDataBaseRecord complete the insertion of all records.
How can I tell nifi to run that processor only when the other one finishes?
Thanks in advance!
After PutDatabaseRecord you can use MergeContent in Defragment mode to undo the split operation performed by SplitAvro. This way a single flow file will come out of MergeContent only when all splits have been seen, and at that point you know its time to for the second ExecuteSQL to run.
The answer provided by #bryan-bende is great, as it is simple and elegant. If that doesn't work for some reason, you could also look at Wait/Notify. Having said that, Bryan's answer is simpler and probably more robust.

How to order by fetch data in NIFI QueryDataBaseTable processor

How to guarantee data sequence every time when fetching delta table by NiFi QueryDataBaseTable processor. The table has an incremental field called "SEQNUM". And set up the "Maximum-value Columns" by "SEQNUM" in QueryDataBaseTable processor. Has any method to order by fetching delta table?
Once you got the result flowfile from QueryDatabaseTable processor
Then use QueryRecord processor add new sql query with order by clause in it.
By using QueryRecord processor we are making sure the order of seqnum in each flowfile is arranged either asc/desc.
if you are having more than one flowfile as result of QueryDatabaseTable then by using MergeRecord processor merge the flowfiles into one then connect the merged connection to QueryRecord processor for ordering the data in flowfile (but this is not optimal way instead of NiFi consider Hive for these kind of heavy lifts).
Refer this and this links for more details regards to QueryRecord processor.

Bulk loading Avro to HBase with NiFi

I'm ingesting flowfiles containing Avro records with NiFi, and need to insert them into HBase. These flowfiles vary in size, but some have 10,000,000+ records. I use SplitAvro twice (one to split to 10,000 recs, then one to split to 1 rec), then use an ExecuteScript processor to pull out the row key for HBase and add it as a flowfile attribute. Finally I use PutHBaseCell (with a batch size of 10,000) to write to HBase using the row key attribute..
The processor that splits the Avro to 1 rec is very slow (Concurrent tasks is set to 5). Is there a way to speed that up? And is there a better way to load this Avro data into HBase?
(I am using a 2 node NiFi (v1.2) cluster (made from VMs), each node has 16 CPUs and 16GB RAM.)
There is a new PutHBaseRecord processor that will be part of the next release (there is a 1.4.0 release being voted upon right now).
With this processor you would avoid ever splitting your flow files, and you just send a flow file will millions of Avro records right to PutHBaseRecord, and PutHBaseRecord would be configured with an Avro reader.
You should get significantly better performance with this approach.

Resources