I am exploring nifi, as of I have created processor group with number of processor which basically select data from Oracle DB and insert in to mongoDB. the the flow works as expected.
The flow is QueryDatabaseTable -> SplitAvro -> ConvertAvorToJson -> PutMongoRecord
In QueryDatabaseTable I have custom query select * from employee, which gives me 100 records and these 100 records inserted into mongoDB. But here issue is QueryDatabaseTable is called again and again, so in result same 100 records are get added in mongoDB again and again. Is there any way to stop this repeated execution? Thanks in advance.
Update: I am using Nifi 1.9.2
PFB QueryDatabaseTable setting tab below
Scheduling
Properties
Update 2: Configuration
Use maximum-value columns if you want to prevent duplicates selection.
Related
I am using QueryDatabaseTable processor to do an incremental batch update to Bigquery. Oracle Database Table keeps on increasing at the rate of 5 new rows per minute.
Flow: QueryDatabaseTable -> ConvertAvroToJson -> PutBigQueryBatchUpdate
I ran this flow with a schedule of 10 minute, the query results in about 2000 rows.
QueryDatabaseTable processor configuration I have modified:
Table Name, Additional WHERE clause, Maximum-value Columns.
QueryDatabaseTable is supposed only fetch after the maximum value of the column visible in 'View State'. But my setup simply return the entire result for query.
After each query the maximum value of the column is updated to the latest maximum value.
The maximum value of the column contains a Date.
I have also tried running after clearing the state, and with no values Maximum-value Columns empty, same result.
What am I missing?
Additional info:
QueryDatabaseTable config also has this following section, which I think is related to this issue,
Transaction Isolation Level : No value set
QueryDatebaseTable did not work if I gave just the table name.
Removing the WHERE clause property and creating Custom query made the processor work as intended.
I'm working in NIFI with PutDataBaseRecord to insert the data of a CSV file to a database table. Everything goes well in the first execution because there is no data in the table. Then I modify the file so it contains new records and existing ones. PutDataBaseRecord fails because of existing records (primary key constraint) but it doesn´t insert the new records.
Is there any way to configure the processor to instruct it to insert the new records and ignore the ones that failed?
I attached pictures of how my processor is configured.
Thanks in advance!
Flujo NIFI
PutDataBaseRecord
This is possible. However it is not a straightforward implementation.
I would suggest you to try the following flow - ListFile -> FetchFile -> SplitRecord -> PutDatabaseRecord.
In SplitRecord processor, set 'Records per Split' property to '1'.
SplitRecord processor splits up the input flow file into multiple small flow files (1 file for each row in our case due to the setting 'Records per Split=1'). Then this individual flow files will be routed to 'split' relationship i.e. to the PutDatabaseRecord processor in our flow.
PutDatabaseRecord inserts new records into table and fails for existing records.
Tested the flow with GetFile processor and it works. Hope this solves your problem.
I have a NiFi flow which inserts some data into some tables. After I insert into a table some data, I send a signal and then ExecuteSQL runs an aggregation query on that table. The tables names are based on the files names.
The thing is that when ExecuteSQL runs the query, I only get a subset of the result. If I run the same query in database's console, I get a different number of rows returned.
Could this be a problem that has to do with the Event Driven Scheduling strategy ?
If ExecuteSQL is stopped, and I get the flowfile ( the signal ) in the queue of the ExecuteSQL, and then I start manually ExecuteSQL, I get back the expected result.
If you are running multiple inserts (using PutSQL for example) and you wish to run ExecuteSQL only after all of them are finished, and the order in which they finish is not deterministic, you might try one of these two approaches:
MergeContent - use a MergeContent processor after PutSQL, setting the Minimum Number of Entries and/or Max Bin Age to trigger when the inserts are finished. You can route the merged relationship to ExecuteSQL.
MonitorActivity - use a MonitorActivity processor to monitor the flow of output from PutSQL and trigger an inactive alert after a configured time period. You would route the inactive relationship to ExecuteSQL to run the aggregate query.
Sorry I'm new in Apache Nifi. So i made a data flow regarding pulling data from Hive and storing it in SQL. There is no error on my data flow, the only problem is, its pulling data repeatedly.
My Data flow is consists of the following:
SelectHiveQL
SplitAvro
ConvertAvroToJson
ConvertJsonTOSQL
PutSQL
For example my table in hive have 20 rows only but when i run the data flow and check my table in MS SQL. It saved 5,000 rows. The SelectHiveQL pulled the data repeatedly.
What do i need to do so it will only pull 20 rows or just the exact number of rows in my Hive Table?
Thank you
SelectHiveQL (like many NiFi processors) runs on a user-specified schedule. To get a processor to only run once, you can set the run schedule to something like 30 sec, then start and immediately stop the processor. The processor will be triggered once, and stopping it does not interrupt that current execution, it just causes it not to be scheduled again.
Another way might be to set the run schedule to something very large, such that it would only execute once per some very long time interval (days, years, etc.)
I am using the process QueryDatabaseTable in NiFi for incrementally getting data from a DB2. QueryDatabaseTable is scheduled to run every 5 minutes. Maximum-value Columns is set to "rep" (which corresponds to a date, in the DB2 db).
I have a seperate MySQL database I want to update with the value "rep", that QueryDatabaseTable uses to query the DB2 database with. How can i get this value?
In the logfiles I've found that the attributes of the FlowFiles does not contain this value.
QueryDatabaseTable doesn't currently accept incoming flow files or allow the use of Expression Language to define the table name, I've written up an improvement Jira to handle this:
https://issues.apache.org/jira/browse/NIFI-2340