NiFi GetMongo fetches data forever - apache-nifi

I have millions of records in MongoDB and I want to use NiFi to move data. Here is the scenario I want to run:
1) I will setup NiFi
2) NiFi will automatically fetch records with batches of 100 records.
3) Once it is done, it will fetch when a new entry is added.
I tried to apply this scenario with a small MongoDB collection (fetch from mongo and store as a file) and I saw that NiFi is repeating the process forever and it is duplicating the records.
Here is the flow I created on NiFi:
Are there are any suggestions to solve this problem?

Unfortunately, GetMongo doesn't have state tracking capabilities. There are similar questions where I have explained about it. You can find them:
Apache NIFI Jon is not terminating automatically
Apache Niffi getMongo Processor

Related

Scalable elasticsearch module with spring data elasticsearch possible?

I am working on designing a scalable service(springboot) using which data will be indexed to elastic search.
Use case:
My application uses 6 databases(mySql) having same schema. Each database caters to specific region. I have a micro service that connects to all these dbs and indexes data from specific tables to elasicsearch server(v6.8.8) in similar fashion having 6 elasticsearch indexes one for each db.
Quartz jobs are employed for this purpose and RestHighLevelClient. Also there are delta jobs running each second to look for changes using audit and indexes.
Current problem:
Current design is not scalable - one service doing all the work(data loading, mapping, upsert in bulk). Because indexing is done through quarts jobs, scaling services(running multiple instances) will run the same job multiple times.
No failover - Looking for a distributed elasticsearch nodes and indexing data to both nodes. How to do this efficiently.
I am considering spring data elasticsearch to index data sametime when it is going to be persisted to db.
Does it offer all features ? I use :
Elasticsearch right from installing template to creating/deleting indexes, aliases.
Blue/green deployment - index to non-active nodes and change the aliases.
bulk upsert, querying, aggregations..etc
Any other solutions are welcome. Thanks for your time.
Your one of the use case is to move data from DB (Mysql) to ES in a scalable manner. It is basically a CDC (Change data capture) pipeline.
You can use kafka-connect framework for the same.
The flow should be like:
Read Mysql Transaction logs => Publish the data to Kafka (This can be accomplished using Debezium Source Connector)
Consume data from Kafka => Push it to Elastic Search (This can be accomplished using ES-SYNC Connector)
Why to use the framework ?
Using connect framework data can be read directly from Mysql Transaction logs without writing code.
Connect framework is a distributed & Scalable system
It will reduce the load on your database as you now don't need to query your database for detecting any changes
Easy to set-up

How do we know when a flow is completed in case we have multiple flowfiles running parallely?

I have a requirement where we have a template which uses SQL as source and SQL as destination and data would be more than 100GB for each table so here template will be instantiated multiple times based on tables to be migrated and also each table is partitioned into multiple flowfiles. How do we know when the process is completed? As here there will be multiple flowfiles we are unable to conclude as it hits end processor.
I have tried using SitetoSiteStatusReportingTask to check queue count, but it provides count based on connection and its difficult to fetch connectionid for each connection then concatenate as we have large number of templates. Here we have another problem in reporting task as it provides data on all process groups which are available on NIFI canvas which will be huge data if all templates are running and may impact in performance even though I used avro schema to fetch only queue count and connection id.
Can you please suggest some ideas and help me to achieve this?
you have multiple solution :
1 - you can use the wait/notify duo processor.
if you dont want multiple flowfile running parallely :
2 - set backpressure on Queue
3 - specify group level flow file concurrency (recommended but Nifi 1.12 only )

Nifi processor runs recursively

I am exploring nifi, as of I have created processor group with number of processor which basically select data from Oracle DB and insert in to mongoDB. the the flow works as expected.
The flow is QueryDatabaseTable -> SplitAvro -> ConvertAvorToJson -> PutMongoRecord
In QueryDatabaseTable I have custom query select * from employee, which gives me 100 records and these 100 records inserted into mongoDB. But here issue is QueryDatabaseTable is called again and again, so in result same 100 records are get added in mongoDB again and again. Is there any way to stop this repeated execution? Thanks in advance.
Update: I am using Nifi 1.9.2
PFB QueryDatabaseTable setting tab below
Scheduling
Properties
Update 2: Configuration
Use maximum-value columns if you want to prevent duplicates selection.

Nifi Fetching Data From Oracle Issue

I am having a requirement to fetch data from oracle and upload into google cloud storage.
I am using executeSql proecssor but it is failing for large table and even for table with 1million records of approx 45mb size it is taking 2hrs to pull.
The table name are getting passed using restapi to listenHttp which passes them to executeSql. I cant use QueryDatabase because the number of table are dynamic and calls to start the fetch is also dynamic using a UI and Nifi RestUi.
Please suggest any tuning parameter in ExecuteSql Processor.
I believe you are talking about having the capability to have smaller flow files and possibly sending them downstream while the processor is still working on the (large) result set. For QueryDatabaseTable this was added in NiFi 1.6.0 (via NIFI-4836) and in an upcoming release (NiFi 1.8.0 via NIFI-1251) this capability will be available for ExecuteSQL as well.
You should be able to use GenerateTableFetch to do what you want. There you can set the Partition Size (which will end up being the number of rows per flow file) and you don't need a Maximum Value Column if you want to fetch the entire table each time a flow file comes in (which also allows you do handle multiple tables as you described). GenerateTableFetch will generate the SQL statements to fetch "pages" of data from the table, which should give you better, incremental performance on very large tables.

Pull Data from Hive to SQL Server without duplicates using Apache Nifi

Sorry I'm new in Apache Nifi. So i made a data flow regarding pulling data from Hive and storing it in SQL. There is no error on my data flow, the only problem is, its pulling data repeatedly.
My Data flow is consists of the following:
SelectHiveQL
SplitAvro
ConvertAvroToJson
ConvertJsonTOSQL
PutSQL
For example my table in hive have 20 rows only but when i run the data flow and check my table in MS SQL. It saved 5,000 rows. The SelectHiveQL pulled the data repeatedly.
What do i need to do so it will only pull 20 rows or just the exact number of rows in my Hive Table?
Thank you
SelectHiveQL (like many NiFi processors) runs on a user-specified schedule. To get a processor to only run once, you can set the run schedule to something like 30 sec, then start and immediately stop the processor. The processor will be triggered once, and stopping it does not interrupt that current execution, it just causes it not to be scheduled again.
Another way might be to set the run schedule to something very large, such that it would only execute once per some very long time interval (days, years, etc.)

Resources