How to run a processor only when ahother proccessor has finished its execution? - apache-nifi

I'm migrating a table (2 millions of rows) from DB2 to SQL Server. I'm using the next flow:
ExecuteSQL (to select records from the Db2 table).
SplitAvro (to split the records. I configured it with Output Size = 1 to control the case that if one fails the rest is inserted without problems.
PutDataBaseRecord (to insert the records in the SQL Server table).
ExecuteSQL (I need to call a stored procedure that executes update sentences against the same table that PutDataBaseRecord is working to).
The problem is the second ExecuteSQL is running before PutDataBaseRecord complete the insertion of all records.
How can I tell nifi to run that processor only when the other one finishes?
Thanks in advance!

After PutDatabaseRecord you can use MergeContent in Defragment mode to undo the split operation performed by SplitAvro. This way a single flow file will come out of MergeContent only when all splits have been seen, and at that point you know its time to for the second ExecuteSQL to run.

The answer provided by #bryan-bende is great, as it is simple and elegant. If that doesn't work for some reason, you could also look at Wait/Notify. Having said that, Bryan's answer is simpler and probably more robust.

Related

How to increase the performance of insert data from mongo to greenplum with PDI(kettle)?

I use PDI(kettle) to extract the data from mongodb to greenplum. I tested if extract the data from mongodb to file, it was faster, about 10000 rows per second. But if extract into greenplum, it is only about 130 per second.
And I modified following parameters of greenplum, but it is no significant improvement.
gpconfig -c log_statement -v none
gpconfig -c gp_enable_global_deadlock_detector -v on
And if I want to add the number of output table. It seems to be hung up and no data will be inserted for a long time. I don't know why?
How to increase the performance of insert data from mongo to greenplum with PDI(kettle)?
Thank you.
There are a variety of factors that could be at play here.
Is PDI loading via an ODBC or JDBC connection?
What is the size of data? (row count doesn't really tell us much)
What is the size of your Greenplum cluster (# of hosts and # of segments per host)
Is the table you are loading into indexed?
What is the network connectivity between Mongo and Greenplum?
The best bulk load performance using data integration tools such as PDI, Informatica Power Center, IBM Data Stage, etc.. will be accomplished using Greenplum's native bulk loading utilities gpfdist and gpload.
Greenplum love batches.
a) You can modify batch size in transformation with Nr rows in rowset.
b) You can modify commit size in table output.
I think a and b should match.
Find your optimum values. (For example we use 1000 for rows with big json objects inside)
Now, using following connection properties
reWriteBatchedInserts=true
It will re-write SQL from insert to batched insert. It increase ten times insert performance for my scenario.
https://jdbc.postgresql.org/documentation/94/connect.html
Thank you guys!

Looking for an Equivalent of GenerateTableFetch

I use ExecuteSQLRecord to run a query and write to CSV format. The table has 10M rows. Although I can split the output into multiple flow files, the query is executed by only a single thread and is very slow.
Is there a way to partition the query into multiple queries so that the next processor can run multiple concurrent tasks, each one process one partition? It would be like:
GenerateTableFetch -> ExecuteSQLRecord (with concurrent tasks)
The problem is that GenerateTableFetch only accepts table name as input. It does not accept customized queries.
Please advise if you have solutions. Thank you in advance.
You can increase the concurrency on Nifi processors (by increase the number in Councurrent Task), you can also increase the throughput, some time it works :
Also if you are working on the cluster, before the processor, you can apply load balancing on the queue, so it will distribute the workload among the nodes of your cluster (load balance strategy, put to round robin):
Check this, youtube channel, for Nifi antipatterns (there is a video on concurrency): Nifi Notes
Please clarify your question, if I didn't answer it.
Figured out an alternative way. I developed a Oracle PL/SQL function which takes table name as an argument, and produces a series of queries like "SELECT * FROM T1 OFFSET x ROWS FETCH NEXT 10000 ROWS ONLY". The number of queries is based on the number of rows of the table, which is a statistics number in the catalog table. If the table has 1M rows, and I want to have 100k rows in each batch, it will produces 10 queries. I use ExecuteSQLRecord to call this function, which effectively does the job of NiFi processor GenerateTableFetch. My next processor (e.g. ExecuteSQLRecord again) can now have 10 concurrent tasks working in parallel.

Nifi joins using ExecuteSQL for larger tables

I am trying to Join multiple tables using NiFi. The datasource may be MySQL or RedShift maybe something else in future. Currently, I am using ExecuteSQL processor for this but the output is in a Single flowfile. Hence, for terabyte of data, this may not be suitable. I have also tried using generateTableFetch but this doesn't have join option.
Here are my Questions:
Is there any alternative for ExecuteSQL processor?
Is there a way to make ExecuteSQL processor output in multiple flowfiles? Currently I can split the output of ExecuteSQL using SplitAvro processor. But I want ExecuteSQL itself splitting the output
GenerateTableFetch generates SQL queries based on offset. Will this slows down the process when the dataset becomes larger?
Please share your thoughts. Thanks in advance
1.Is there any alternative for ExecuteSQL processor?
if you are joining multiple tables then we need to use ExecuteSQL processor.
2.Is there a way to make ExecuteSQL processor output in multiple flowfiles? Currently I can split the output of ExecuteSQL using SplitAvro processor. But I want ExecuteSQL itself splitting the output ?
Starting from NiFi-1.8 version we can configure Max Rows for flowfile, so that ExecuteSQL processor splits the flowfiles.
NiFi-1251 addressing this issue.
3.GenerateTableFetch generates SQL queries based on offset. Will this slows down the process when the dataset becomes larger?
if your source table is having indexes on the Maximum-value Columns then it won't slow down the process even if your dataset is becoming larger.
if there is no indexes created on the source table then there will be full table scan will be done always, which results slow down the process.

Pull Data from Hive to SQL Server without duplicates using Apache Nifi

Sorry I'm new in Apache Nifi. So i made a data flow regarding pulling data from Hive and storing it in SQL. There is no error on my data flow, the only problem is, its pulling data repeatedly.
My Data flow is consists of the following:
SelectHiveQL
SplitAvro
ConvertAvroToJson
ConvertJsonTOSQL
PutSQL
For example my table in hive have 20 rows only but when i run the data flow and check my table in MS SQL. It saved 5,000 rows. The SelectHiveQL pulled the data repeatedly.
What do i need to do so it will only pull 20 rows or just the exact number of rows in my Hive Table?
Thank you
SelectHiveQL (like many NiFi processors) runs on a user-specified schedule. To get a processor to only run once, you can set the run schedule to something like 30 sec, then start and immediately stop the processor. The processor will be triggered once, and stopping it does not interrupt that current execution, it just causes it not to be scheduled again.
Another way might be to set the run schedule to something very large, such that it would only execute once per some very long time interval (days, years, etc.)

How does PutHiveQL works on batch?

I am trying to input multiple insert statements to PutHiveQL via ReplaceText processor. Each insert statement is a flowfile coming out from ReplaceText. I set the batch in PutHiveQL to 100. However, it seems it still sends it 1 flowfile at a time. How to best implement this batch?
I don't think the PutHiveQL processor batches statements at the JDBC layer as you expect, not in the way that processors like PutSQL do. From the code, it looks like the Batch Size property is used to control how many flowfiles the processor works on before yielding, but the statements for each flowfile are still executed individually.
That might be a good topic for a NiFi feature request.
The version of Hive supported by NiFi doesn't allow for batching/transactions. The Batch Size parameter is meant to try to move multiple incoming flow files a bit faster than having the processor invoked every so often. So if you schedule the PutHiveQL processor for every 5 seconds with a Batch Size of 100, then every 5 seconds (if there are 100 flow files queued), the processor will attempt to process those during one "session".
Alternatively you can specify a Batch Size of 0 or 1 and schedule it as fast as you like; unfortunately this will have no effect on the Hive side of things, as it auto-commits each HiveQL statement; the version of Hive doesn't support transactions or batching.
Another (possibly more performant) alternative is to put the entire set of rows as a CSV file into HDFS and use the HiveQL "LOAD DATA" DML statement to create a table on top of the data: https://cwiki.apache.org/confluence/display/Hive/GettingStarted#GettingStarted-DMLOperations

Resources