PutDataBaseRecord does not insert any record when one of them already exists in the database - apache-nifi

I'm working in NIFI with PutDataBaseRecord to insert the data of a CSV file to a database table. Everything goes well in the first execution because there is no data in the table. Then I modify the file so it contains new records and existing ones. PutDataBaseRecord fails because of existing records (primary key constraint) but it doesn´t insert the new records.
Is there any way to configure the processor to instruct it to insert the new records and ignore the ones that failed?
I attached pictures of how my processor is configured.
Thanks in advance!
Flujo NIFI
PutDataBaseRecord

This is possible. However it is not a straightforward implementation.
I would suggest you to try the following flow - ListFile -> FetchFile -> SplitRecord -> PutDatabaseRecord.
In SplitRecord processor, set 'Records per Split' property to '1'.
SplitRecord processor splits up the input flow file into multiple small flow files (1 file for each row in our case due to the setting 'Records per Split=1'). Then this individual flow files will be routed to 'split' relationship i.e. to the PutDatabaseRecord processor in our flow.
PutDatabaseRecord inserts new records into table and fails for existing records.
Tested the flow with GetFile processor and it works. Hope this solves your problem.

Related

How to use putSQL in apache nifi

I a beginner in data warehousing and apache nifi. I was trying taking the Mysql table data into nifi and then want to put that data into another mysql database table, I am successfully getting data from the database table one and I can also able to print that data into file using putFile processor.
But now I want to store that queued data into mysql database table, I know there is putSQL processor but it was not working for me.
Can anyone let me know how to do it correctly.
Here are the screenshots of my flow
PutSQL configuration-
I converted data from Avro to JSON and then JSON to SQL in case if that would work, but this also not worked.
Use PutDatabaseRecord and remove the Convert* processors.
From nifi docs:
The PutDatabaseRecord processor uses a specified RecordReader to input
(possibly multiple) records from an incoming flow file. These records
are translated to SQL statements and executed as a single transaction.
If any errors occur, the flow file is routed to failure or retry, and
if the records are transmitted successfully, the incoming flow file is
routed to success. The type of statement executed by the processor is
specified via the Statement Type property, which accepts some
hard-coded values such as INSERT, UPDATE, and DELETE, as well as 'Use
statement.type Attribute', which causes the processor to get the
statement type from a flow file attribute. IMPORTANT: If the Statement
Type is UPDATE, then the incoming records must not alter the value(s)
of the primary keys (or user-specified Update Keys). If such records
are encountered, the UPDATE statement issued to the database may do
nothing (if no existing records with the new primary key values are
found), or could inadvertently corrupt the existing data (by changing
records for which the new values of the primary keys exist).
This should be more performant and cleaner.

NiFi make single Database call after processing multiple flowfiles

I have a below scenario, I am trying
get new files list with ListFile processor
set a constant variable zipFilesBundleConstant = listBundle on each flowfile
Put the list to Database
Get all the list of files old and new from Database to process further with ExecuteSQL processor. (Here I want to make only one Database call to fetch complete list old and new, but ExecuteSQL is being called for all the flowfiles)
I tried keeping MergeContent processor with zipFilesBundleConstant as Correlation Attribute Name before ExecuteSQL to combine all the flowfiles but that is not working as expected and it merges some but always gives me multiple flowfiles.
Can anyone please help me with a solution on how to make a one call after inserting the new files list into the database.
You can use ExecuteSQL processor has separate workflow to fetch existing old files list from the database with the scheduling strategy as per the requirement.
https://nifi.apache.org/docs/nifi-docs/html/user-guide.html#scheduling-tab

Nifi processor runs recursively

I am exploring nifi, as of I have created processor group with number of processor which basically select data from Oracle DB and insert in to mongoDB. the the flow works as expected.
The flow is QueryDatabaseTable -> SplitAvro -> ConvertAvorToJson -> PutMongoRecord
In QueryDatabaseTable I have custom query select * from employee, which gives me 100 records and these 100 records inserted into mongoDB. But here issue is QueryDatabaseTable is called again and again, so in result same 100 records are get added in mongoDB again and again. Is there any way to stop this repeated execution? Thanks in advance.
Update: I am using Nifi 1.9.2
PFB QueryDatabaseTable setting tab below
Scheduling
Properties
Update 2: Configuration
Use maximum-value columns if you want to prevent duplicates selection.

Nifi DistributedCache lookup issue

I have configured a flow as follows:
GetFile
SplitText -> splitting into flowfiles
ExtractText -> adding attributes with two keys
PutDistributedMapCache -> Cache Entry Identifier is ${Key1}_${Key2}
Now I configured one sample GenerateFlowFile which generates a sample record and then goes into LookupRecord ( concat(/Key1,'_',/Key2)) which looks for the same key in cache.
I see a problem in my caching flow because when I configure a GenerateFlowFile to cache same records , I am able to do lookup
This flow is not able to lookup. Please help
Flow is:
PutDistributedMapCache
ExtractText
Lookup flow
LookupRecord Config
I have added four keys in total because that is my business use case.
I have a csv file with 53 records and I use Splitfile to split each record and add attributes which act as my key which I store in PutDistributedMapcache. Now I have a different flow where in I start with a GenerateFlowFile which generates a record like this :
So I expect my LookupKeyRecord which has a jsonreader and jsonwriter to read this record , lookup with the key in the distributedcache and populate the /Feedback field in my record.
This fails to look up records and records goes as UNMATCHED.
Now the catch is lets say I remove GetFile and use a GenerateFlowFile with this config to cache :
so my lookup works with the keys 9_9_9_9. But the moment I add another set of records with different keys , my lookup fails.
I figured it out , my DistributedMapCache server was having a default config of Max Cache Entries as 1. I increaded it , its working now :)

How to run a processor only when ahother proccessor has finished its execution?

I'm migrating a table (2 millions of rows) from DB2 to SQL Server. I'm using the next flow:
ExecuteSQL (to select records from the Db2 table).
SplitAvro (to split the records. I configured it with Output Size = 1 to control the case that if one fails the rest is inserted without problems.
PutDataBaseRecord (to insert the records in the SQL Server table).
ExecuteSQL (I need to call a stored procedure that executes update sentences against the same table that PutDataBaseRecord is working to).
The problem is the second ExecuteSQL is running before PutDataBaseRecord complete the insertion of all records.
How can I tell nifi to run that processor only when the other one finishes?
Thanks in advance!
After PutDatabaseRecord you can use MergeContent in Defragment mode to undo the split operation performed by SplitAvro. This way a single flow file will come out of MergeContent only when all splits have been seen, and at that point you know its time to for the second ExecuteSQL to run.
The answer provided by #bryan-bende is great, as it is simple and elegant. If that doesn't work for some reason, you could also look at Wait/Notify. Having said that, Bryan's answer is simpler and probably more robust.

Resources