I have a NiFi flow which inserts some data into some tables. After I insert into a table some data, I send a signal and then ExecuteSQL runs an aggregation query on that table. The tables names are based on the files names.
The thing is that when ExecuteSQL runs the query, I only get a subset of the result. If I run the same query in database's console, I get a different number of rows returned.
Could this be a problem that has to do with the Event Driven Scheduling strategy ?
If ExecuteSQL is stopped, and I get the flowfile ( the signal ) in the queue of the ExecuteSQL, and then I start manually ExecuteSQL, I get back the expected result.
If you are running multiple inserts (using PutSQL for example) and you wish to run ExecuteSQL only after all of them are finished, and the order in which they finish is not deterministic, you might try one of these two approaches:
MergeContent - use a MergeContent processor after PutSQL, setting the Minimum Number of Entries and/or Max Bin Age to trigger when the inserts are finished. You can route the merged relationship to ExecuteSQL.
MonitorActivity - use a MonitorActivity processor to monitor the flow of output from PutSQL and trigger an inactive alert after a configured time period. You would route the inactive relationship to ExecuteSQL to run the aggregate query.
Related
I use ExecuteSQLRecord to run a query and write to CSV format. The table has 10M rows. Although I can split the output into multiple flow files, the query is executed by only a single thread and is very slow.
Is there a way to partition the query into multiple queries so that the next processor can run multiple concurrent tasks, each one process one partition? It would be like:
GenerateTableFetch -> ExecuteSQLRecord (with concurrent tasks)
The problem is that GenerateTableFetch only accepts table name as input. It does not accept customized queries.
Please advise if you have solutions. Thank you in advance.
You can increase the concurrency on Nifi processors (by increase the number in Councurrent Task), you can also increase the throughput, some time it works :
Also if you are working on the cluster, before the processor, you can apply load balancing on the queue, so it will distribute the workload among the nodes of your cluster (load balance strategy, put to round robin):
Check this, youtube channel, for Nifi antipatterns (there is a video on concurrency): Nifi Notes
Please clarify your question, if I didn't answer it.
Figured out an alternative way. I developed a Oracle PL/SQL function which takes table name as an argument, and produces a series of queries like "SELECT * FROM T1 OFFSET x ROWS FETCH NEXT 10000 ROWS ONLY". The number of queries is based on the number of rows of the table, which is a statistics number in the catalog table. If the table has 1M rows, and I want to have 100k rows in each batch, it will produces 10 queries. I use ExecuteSQLRecord to call this function, which effectively does the job of NiFi processor GenerateTableFetch. My next processor (e.g. ExecuteSQLRecord again) can now have 10 concurrent tasks working in parallel.
Nifi || Combining flowfiles coming from multiple Putsql processor and connect with other process group.
We are doing calculation in two ways:
First Process group - Inserting data into table.
Second Process group - Doing calculation on inserted data.
I want to connect both the flows so in case of any issue no overall calculation should take place and they both run in one go. Currently I am scheduling them separately.
I tried mergecontent but nothing works.
The 'Success' relationship from the PutSQL contains records that were successfully updated in the database.
Add an 'Input Port' to your 'Overall Calculation' group, and drag the 'Success' relationship from each PutSQL on to the 'Overall Calculation' and connect to the same 'Input Port'.
I have used Mergecontent with replacetext processor and it worked.
Thanks everyone
I am exploring nifi, as of I have created processor group with number of processor which basically select data from Oracle DB and insert in to mongoDB. the the flow works as expected.
The flow is QueryDatabaseTable -> SplitAvro -> ConvertAvorToJson -> PutMongoRecord
In QueryDatabaseTable I have custom query select * from employee, which gives me 100 records and these 100 records inserted into mongoDB. But here issue is QueryDatabaseTable is called again and again, so in result same 100 records are get added in mongoDB again and again. Is there any way to stop this repeated execution? Thanks in advance.
Update: I am using Nifi 1.9.2
PFB QueryDatabaseTable setting tab below
Scheduling
Properties
Update 2: Configuration
Use maximum-value columns if you want to prevent duplicates selection.
Sorry I'm new in Apache Nifi. So i made a data flow regarding pulling data from Hive and storing it in SQL. There is no error on my data flow, the only problem is, its pulling data repeatedly.
My Data flow is consists of the following:
SelectHiveQL
SplitAvro
ConvertAvroToJson
ConvertJsonTOSQL
PutSQL
For example my table in hive have 20 rows only but when i run the data flow and check my table in MS SQL. It saved 5,000 rows. The SelectHiveQL pulled the data repeatedly.
What do i need to do so it will only pull 20 rows or just the exact number of rows in my Hive Table?
Thank you
SelectHiveQL (like many NiFi processors) runs on a user-specified schedule. To get a processor to only run once, you can set the run schedule to something like 30 sec, then start and immediately stop the processor. The processor will be triggered once, and stopping it does not interrupt that current execution, it just causes it not to be scheduled again.
Another way might be to set the run schedule to something very large, such that it would only execute once per some very long time interval (days, years, etc.)
I am using the process QueryDatabaseTable in NiFi for incrementally getting data from a DB2. QueryDatabaseTable is scheduled to run every 5 minutes. Maximum-value Columns is set to "rep" (which corresponds to a date, in the DB2 db).
I have a seperate MySQL database I want to update with the value "rep", that QueryDatabaseTable uses to query the DB2 database with. How can i get this value?
In the logfiles I've found that the attributes of the FlowFiles does not contain this value.
QueryDatabaseTable doesn't currently accept incoming flow files or allow the use of Expression Language to define the table name, I've written up an improvement Jira to handle this:
https://issues.apache.org/jira/browse/NIFI-2340