How to use putSQL in apache nifi - etl

I a beginner in data warehousing and apache nifi. I was trying taking the Mysql table data into nifi and then want to put that data into another mysql database table, I am successfully getting data from the database table one and I can also able to print that data into file using putFile processor.
But now I want to store that queued data into mysql database table, I know there is putSQL processor but it was not working for me.
Can anyone let me know how to do it correctly.
Here are the screenshots of my flow
PutSQL configuration-
I converted data from Avro to JSON and then JSON to SQL in case if that would work, but this also not worked.

Use PutDatabaseRecord and remove the Convert* processors.
From nifi docs:
The PutDatabaseRecord processor uses a specified RecordReader to input
(possibly multiple) records from an incoming flow file. These records
are translated to SQL statements and executed as a single transaction.
If any errors occur, the flow file is routed to failure or retry, and
if the records are transmitted successfully, the incoming flow file is
routed to success. The type of statement executed by the processor is
specified via the Statement Type property, which accepts some
hard-coded values such as INSERT, UPDATE, and DELETE, as well as 'Use
statement.type Attribute', which causes the processor to get the
statement type from a flow file attribute. IMPORTANT: If the Statement
Type is UPDATE, then the incoming records must not alter the value(s)
of the primary keys (or user-specified Update Keys). If such records
are encountered, the UPDATE statement issued to the database may do
nothing (if no existing records with the new primary key values are
found), or could inadvertently corrupt the existing data (by changing
records for which the new values of the primary keys exist).
This should be more performant and cleaner.

Related

Apache NiFi how to check output from each processor

I am new to using Apache NiFi, and I trying to create a template that takes a JSON file and turns it into a set of SQL insert statements.
So far I have created a template that takes the JSON file and I have got it to the point of PutSQL. There is no database to connect to at the moment, but what I have not been able to check is the output. Can this be done? What I need to check is whether the array of JSON has been turned into a INSERT per element in the array
As far as inpecting the output, what does your flow look like? If you have something like ConvertJSONToSQL -> PutSQL, you can leave PutSQL stopped and run ConvertJSONToSQL, then you will see FlowFile(s) in the connection between the two processors. Then you can right-click on the connection and choose List Queue, then click the "eye" icon on the right for the FlowFile you wish to inspect. That will show you the contents of the FlowFile right before it goes into PutSQL.
Having said all that, if your JSON file contains fields that correspond to columns in your database, consider PutDatabaseRecord instead of ConvertJSONToSQL -> PutSQL. That can use a JsonTreeReader to parse each record, and it will generate and execute the necessary SQL as a prepared statement using the values in all records of the FlowFile. That way you don't need to generate the SQL yourself or worry about fragmented transactions or any of that.

Read data from multiple tables at a time and combine the data based where clause using Nifi

I have scenario where I need to extract multiple database table data including schema and combine(combination data) them and then write to xl file?
In NiFi the general strategy to read in from a something like a fact table with ExecuteSQL or some other SQL processor, then using LookupRecord to enrich the data with a lookup table. The thing in NiFi is that you can only do a table at a time, so you'd need one LookupRecord for each enrichment table. You could then write to a CSV file that you could open in Excel. There might be some extensions elsewhere that can write directly to Excel but I'm not aware of any in the standard NiFi distro.

PutDataBaseRecord does not insert any record when one of them already exists in the database

I'm working in NIFI with PutDataBaseRecord to insert the data of a CSV file to a database table. Everything goes well in the first execution because there is no data in the table. Then I modify the file so it contains new records and existing ones. PutDataBaseRecord fails because of existing records (primary key constraint) but it doesn´t insert the new records.
Is there any way to configure the processor to instruct it to insert the new records and ignore the ones that failed?
I attached pictures of how my processor is configured.
Thanks in advance!
Flujo NIFI
PutDataBaseRecord
This is possible. However it is not a straightforward implementation.
I would suggest you to try the following flow - ListFile -> FetchFile -> SplitRecord -> PutDatabaseRecord.
In SplitRecord processor, set 'Records per Split' property to '1'.
SplitRecord processor splits up the input flow file into multiple small flow files (1 file for each row in our case due to the setting 'Records per Split=1'). Then this individual flow files will be routed to 'split' relationship i.e. to the PutDatabaseRecord processor in our flow.
PutDatabaseRecord inserts new records into table and fails for existing records.
Tested the flow with GetFile processor and it works. Hope this solves your problem.

Nifi Fetching Data From Oracle Issue

I am having a requirement to fetch data from oracle and upload into google cloud storage.
I am using executeSql proecssor but it is failing for large table and even for table with 1million records of approx 45mb size it is taking 2hrs to pull.
The table name are getting passed using restapi to listenHttp which passes them to executeSql. I cant use QueryDatabase because the number of table are dynamic and calls to start the fetch is also dynamic using a UI and Nifi RestUi.
Please suggest any tuning parameter in ExecuteSql Processor.
I believe you are talking about having the capability to have smaller flow files and possibly sending them downstream while the processor is still working on the (large) result set. For QueryDatabaseTable this was added in NiFi 1.6.0 (via NIFI-4836) and in an upcoming release (NiFi 1.8.0 via NIFI-1251) this capability will be available for ExecuteSQL as well.
You should be able to use GenerateTableFetch to do what you want. There you can set the Partition Size (which will end up being the number of rows per flow file) and you don't need a Maximum Value Column if you want to fetch the entire table each time a flow file comes in (which also allows you do handle multiple tables as you described). GenerateTableFetch will generate the SQL statements to fetch "pages" of data from the table, which should give you better, incremental performance on very large tables.

Getting execution time from QueryDatabaseTable in NiFi

I am using the process QueryDatabaseTable in NiFi for incrementally getting data from a DB2. QueryDatabaseTable is scheduled to run every 5 minutes. Maximum-value Columns is set to "rep" (which corresponds to a date, in the DB2 db).
I have a seperate MySQL database I want to update with the value "rep", that QueryDatabaseTable uses to query the DB2 database with. How can i get this value?
In the logfiles I've found that the attributes of the FlowFiles does not contain this value.
QueryDatabaseTable doesn't currently accept incoming flow files or allow the use of Expression Language to define the table name, I've written up an improvement Jira to handle this:
https://issues.apache.org/jira/browse/NIFI-2340

Resources