Getting execution time from QueryDatabaseTable in NiFi - apache-nifi

I am using the process QueryDatabaseTable in NiFi for incrementally getting data from a DB2. QueryDatabaseTable is scheduled to run every 5 minutes. Maximum-value Columns is set to "rep" (which corresponds to a date, in the DB2 db).
I have a seperate MySQL database I want to update with the value "rep", that QueryDatabaseTable uses to query the DB2 database with. How can i get this value?
In the logfiles I've found that the attributes of the FlowFiles does not contain this value.

QueryDatabaseTable doesn't currently accept incoming flow files or allow the use of Expression Language to define the table name, I've written up an improvement Jira to handle this:
https://issues.apache.org/jira/browse/NIFI-2340

Related

How to use putSQL in apache nifi

I a beginner in data warehousing and apache nifi. I was trying taking the Mysql table data into nifi and then want to put that data into another mysql database table, I am successfully getting data from the database table one and I can also able to print that data into file using putFile processor.
But now I want to store that queued data into mysql database table, I know there is putSQL processor but it was not working for me.
Can anyone let me know how to do it correctly.
Here are the screenshots of my flow
PutSQL configuration-
I converted data from Avro to JSON and then JSON to SQL in case if that would work, but this also not worked.
Use PutDatabaseRecord and remove the Convert* processors.
From nifi docs:
The PutDatabaseRecord processor uses a specified RecordReader to input
(possibly multiple) records from an incoming flow file. These records
are translated to SQL statements and executed as a single transaction.
If any errors occur, the flow file is routed to failure or retry, and
if the records are transmitted successfully, the incoming flow file is
routed to success. The type of statement executed by the processor is
specified via the Statement Type property, which accepts some
hard-coded values such as INSERT, UPDATE, and DELETE, as well as 'Use
statement.type Attribute', which causes the processor to get the
statement type from a flow file attribute. IMPORTANT: If the Statement
Type is UPDATE, then the incoming records must not alter the value(s)
of the primary keys (or user-specified Update Keys). If such records
are encountered, the UPDATE statement issued to the database may do
nothing (if no existing records with the new primary key values are
found), or could inadvertently corrupt the existing data (by changing
records for which the new values of the primary keys exist).
This should be more performant and cleaner.

Nifi Fetching Data From Oracle Issue

I am having a requirement to fetch data from oracle and upload into google cloud storage.
I am using executeSql proecssor but it is failing for large table and even for table with 1million records of approx 45mb size it is taking 2hrs to pull.
The table name are getting passed using restapi to listenHttp which passes them to executeSql. I cant use QueryDatabase because the number of table are dynamic and calls to start the fetch is also dynamic using a UI and Nifi RestUi.
Please suggest any tuning parameter in ExecuteSql Processor.
I believe you are talking about having the capability to have smaller flow files and possibly sending them downstream while the processor is still working on the (large) result set. For QueryDatabaseTable this was added in NiFi 1.6.0 (via NIFI-4836) and in an upcoming release (NiFi 1.8.0 via NIFI-1251) this capability will be available for ExecuteSQL as well.
You should be able to use GenerateTableFetch to do what you want. There you can set the Partition Size (which will end up being the number of rows per flow file) and you don't need a Maximum Value Column if you want to fetch the entire table each time a flow file comes in (which also allows you do handle multiple tables as you described). GenerateTableFetch will generate the SQL statements to fetch "pages" of data from the table, which should give you better, incremental performance on very large tables.

Sync database extraction with Hadoop

Lets say you have periodic task that extract data from a database and loads that data into Hadoop.
How does Apache Sqoop/Nifi mantain database sync between the source database (SQL or NoSQL) with destination storage(Hadoop HDFS or HBASE, even S3)?
For example, lets say that at time A the database has 500 records and at time B it has 600 records with some of the old records updated, does it have a mechanism that efficiently knows the difference between time A and time B that only updates rows that changed and add missing rows?
Yes,NiFi has QueryDatabaseTable processor which can store the state and incrementally fetches the records that got updated.
in your table if you are having some date column that can be updated when your records gets updated then you can use the same date column in Max value columns property then processor will pulls only the changes that got made from last state value.
Here is the awesome article regarding querydatabasetable processor
https://community.hortonworks.com/articles/51902/incremental-fetch-in-nifi-with-querydatabasetable.html

Pull Data from Hive to SQL Server without duplicates using Apache Nifi

Sorry I'm new in Apache Nifi. So i made a data flow regarding pulling data from Hive and storing it in SQL. There is no error on my data flow, the only problem is, its pulling data repeatedly.
My Data flow is consists of the following:
SelectHiveQL
SplitAvro
ConvertAvroToJson
ConvertJsonTOSQL
PutSQL
For example my table in hive have 20 rows only but when i run the data flow and check my table in MS SQL. It saved 5,000 rows. The SelectHiveQL pulled the data repeatedly.
What do i need to do so it will only pull 20 rows or just the exact number of rows in my Hive Table?
Thank you
SelectHiveQL (like many NiFi processors) runs on a user-specified schedule. To get a processor to only run once, you can set the run schedule to something like 30 sec, then start and immediately stop the processor. The processor will be triggered once, and stopping it does not interrupt that current execution, it just causes it not to be scheduled again.
Another way might be to set the run schedule to something very large, such that it would only execute once per some very long time interval (days, years, etc.)

Apache Nifi ExecuteSQL Processor

I am trying to fetch data from oracle database using ExecuteSQL processor.I have some queries like suppose there are 15 records in my oracle database.Here when I run the ExecuteSQL processor,it will run continuously as a streaming process and store the whole records as a single file in HDFS and repeatedly do the same.Thus many files will be there in the hdfs location which will fetch the already fetched records from oracle db and these files contains the same data.How can i make this processor to run in such a way that it must fetch all the data from oracle db once and store as a single file and when ever new records is inserted into the db,it must ingest those to hdfs location?
Take a look at the QueryDatabaseTable processor:
https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi.processors.standard.QueryDatabaseTable/index.html
You will need to tell this processor one or more columns to use to track new records, this is the Maximum Value Columns property. If your table has a one-up id column you can use that, and every time it runs it will track the last id that was seen, and start there on the next execution.

Resources