run nifi flow once and notify me when it is finish - apache-nifi

I use rest api in my program,I made a processor group for convent a mongodb collection to json file:
I want to run the scheduling only one time,so I set the "Run schedule" to 10000 sec.Then I will stop the group when the data flow have ran one time,and I made a Notify processor and add a DistributedMapCacheService.But the DistributedMapCacheClientService of the Notify processor only comunicates with the DistributedMapCacheService in nifi itself,It never nofity my program.
I try to use my own socket server,but I only get a message "nifi" but no more message.
My question is:If I only want scheduling run once and stop it,how do I know when shall I stop it?Or is there some other way to achieve my purpose,like detect if the json file exists or use incremental data(If the scheduling run twice,the data will be repeated twice)?

As #daggett said you can do it in a synchronous way you can use HandleHttpRequest as trigger and HandleHttpResponse to manage the response.
For an asynchronous was you have several options for the notification like PutTCP, PostHTTP, GetHTTP, use FTP, file system, XMPP or whatever.
If the scheduling run twice the duplicated elements depends on the processors you use, some of them have state others no, but if you are facing problems with repeated elements you can use the DetectDuplicate processor.

Related

Camunda: Receive multiple, different messages at once

I am currently developing a kinda complex workflow with camunda. The goal of this workflow is to orchestrate the execution of different external business processes. Which includes start, overwatch and synchronize these workflows. Everything besides the synchronization works as expected.
Example:
My example has one main workflow which starts multiple sub workflows. The main workflow has to be aware when all sub workflows are finished. Every sub workflow is triggered by a message and sends a message back to the main workflow at the end of execution. Therefore, all sub workflows should be synchronized in the main workflow.
Xml can be accessed on this site: https://pastebin.com/2aj4z0zU
Unfortunately, this leads to numerous message correlation exceptions at the choke point in the main workflow (1st lane, after the first parallel gateway). I am using the following code to correlate the messages:
this.runtimeService.createMessageCorrelation(messageName)
.processInstanceId(processInstanceId)
.setVariables(payload)
.correlate();
The whole workflow is executable and runs without errors, but only if one example_workflow at a time is executed. Starting multiple example_workflows quickly one after another results in this type of exception randomly for every message type:
ENGINE-16004 Exception while closing command context: Cannot correlate message 'PROCESS_B_FINISHED': No process definition or execution matches the parameters org.camunda.bpm.engine.MismatchingMessageCorrelationException: Cannot correlate message 'PROCESS_B_FINISHED': No process definition or execution matches the parameters
at org.camunda.bpm.engine.impl.cmd.CorrelateMessageCmd.execute(CorrelateMessageCmd.java:88) ~[camunda-engine-7.14.0.jar!/:7.14.0]
Currently, the correlation exceptions occur if a postgresql database is used. The same workflow runs much better, but not perfect, when we use a h2 file-based database. All receive tasks are not configured asynchronously, only send tasks are (async before + exclusive).
Questions:
Is this already the best practice to synchronize multiple messages in one workflow?
What could be the reason for the correlation exceptions while using a postgresql database?
Used software:
spring boot application [Version:2.3.4]
camunda [Version:7.14.0]
h2 [Version:1.4.200]
postgresql [Version:42.2.22]
the process model seems to contain sequences where it can run into a deadlock (What if blue is followed directly by green? Or yellow?) or where you have race conditions. If the process has not reached a state where it is in a receiving state for the message, then the message delivery will fail (as indicated in the error message you shared)
(The reason you are observing the CorellationException more frequently on postgresql if the race condition. With this external database some operations take slightly more time, increasing the chance of the race condition occurring).
The process engine needs to be able to match a message to a unique receiver. If there are multiple potential receivers for the same message name, and no other correlation criteria creating a unique match is provided, then the delivery will also fail. You either need to use unique message names per instance or better use a businessKey or a process data which is unique per instance as additional correlation criteria. This is why it does not work when you run multiple process instances.
Modelling a workflow with this parallel message bottleneck leads to a race condition, as mentioned by #rob2universe's post.
To solve this problem, I had firstly to correlate the messages directly. I did this by adding a unique identifier to every message, which was not a big deal due to the fact that an item ID was defined within the payload of every message. Secondly, I had to remove all asynchronous and exclusive markers for every receive task and connected gateways. And thirdly, I had to reset the job executor properties to default values. Limiting the pool size and jobs per acquisition did not benefit the workflow execution.
After all these changes, my workflow now runs as expected with no errors. Unfortunately, due to the described bottleneck optimistic logging exceptions are common, but the workflow engine handles these exceptions without further errors.

Find Provenance Data For Flowfile Within a Processor

I am attempting to develop a NiFi processor that would extend the functionality of the built-in processor "Monitor Activity".
The problem I am attempting to solve is that in my application, I would have multiple flows entering the processor, with the processor alerting by email when no flowfiles arrive within a certain time period. However, if only one of the flows stop, no alert will be triggered.
I would like to modify the processor such that it would be able to distinguish between the different flows and alert accordingly.
In order to do this, I would need a way to deferentiate between flowfiles originating from one processor and another.
I am aware NiFi keeps detailed provenance records that can be easily accessed from within the GUI interface but I'm unable to find an easy way of accessing this information programmatically from within processor code.

Nifi processor to route flows based on changeable list of regex

I am trying to use Nifi to act as a router for syslog based on a list of regexes matching the syslog.body (nb as this is just a proof of concept I can change any part if needed)
The thought process is that via a separate system (for now, vi and a text file 😃) an admin can define a list of criteria (regex format for each seems sensible) which, if matched, would result in syslog messages being sent to a specific separate system (for example, all critical audit data (matched by the regex list) is sent to the audit system and all other data goes to the standard log store
I know that this can be done on Route by content processors but the properties are configured before the processor starts and an admin would have to stop the processor every time they need to make an edit
I would like to load the list of regex in periodically (automatically) and have the processor properties be updated
I don’t mind if this is done all natively in Nifi (but that is preferable for elegance and to save an external app being written) or via a REST API call driven by a python script or something (or can Nifi send REST calls to itself?!)
I appreciate a processor property cannot be updated while running, so it would have to be stopped to be updated, but that’s fine as the queue will buffer for the brief period. Maybe a check to see if the file has changed could avoid outages for no reason rather than periodic update regardless, I can solve that problem later.
Thanks
Chris
I think the easiest solution would be to use ScanContent, a processor which specifies a dictionary file on disk which contains a list of search terms and monitors the file for changes, reloading in that event. The processor then applies the search terms to the content of incoming flowfiles and allows you to route them based on matches. While this processor doesn't support regular expressions as dictionary terms, you could make a slight modification to the code or use this as a baseline for a custom processor with those changes.
If that doesn't work for you, there are a number of LookupService implementations which show how CSV, XML, property files, etc. can be monitored and read by the controller framework to provide an updated mapping of key/value pairs. These can also serve as a foundation for building a more complicated scan/match flow using the loaded terms/patterns.
Finally, if you have to rely on direct processor property updating, you can script this with the NiFi API calls to stop, update, and restart the processors so it can be done in near-real-time. To determine these APIs, visit the API documentation or execute the desired tasks via the UI in your browser and use the Developer Tools to capture the HTTP requests being made.

Serial consumption between message types

I have a MassTransit system that will consume 2 message types, one for a batch process, the other for CRUD operations on a single entity. Whilst the batch process is running, the CRUD operations should not be de-queued.
Is this possible to achieve using MassTransit? It seems the exchange binding -> type name, would potentially make this behavior difficult.
A solution would be to use one message type to denote both operations and then interrogate the message contents to discern between single and batch but this feels like a code smell. Also, this would require concurrency configuration to ensure only one consumer is ever active.
Can anyone help with an alternative solution here? Essentially, we need to pause all message consumption whilst an event driven process is running.
Thanks in advance.
By pause, do you mean that you want the CRUD operations to be able to occur without being blocked by the batch process? Because if it's only a matter of not having the two separate messages get in the way of each other, the most logical solution is using two separate queues, one receive endpoint for the batch process and another for the CRUD operations.
Now, if you truly need to separate the batch process such that it doesn't happen during the CRUD operations, that will require more work. And what if you receive a CRUD operation while the batch process is already running?
I think the separate queues is your best solution, however.

Connecting Redis events to Lua Script execution and concurrency issues

I have grouped key value pairs or data structures built using Redisson library. The design is that a change in value of any group of value(s) should be sent as event to subscribing Lua scripts. These scripts then do computations and update another group's key-value pair. This process is implemented as a chain such that once the Lua script updates a key-value per, that in turn generates a event and another Lua script does the work similar to first Lua script based on certain parameters.
Question 1: How to connect the Lua script and the event?
Question 2: Events are pipelined but it may be that my Lua Scripts may have to wait for network IO. In that case, I assume the next event is processed and the subscribing script executed. this for me is a problem because first script hasn't finished updating the key-value pair it needs to and the second script is going ahead with its work. This will cause errors for me. Is there a way to get over this?
Question 3: How to emit events from Redisson datastructures and I need the Lua script to understand that data structure's structure. How?
At the time of writing, Redis (3.2.9) does not allow blocking commands inside Lua scripts, including the subscribe command. So it is impossible to achieve what you have described via Lua script.
However you can do it using Redisson Topic and/or Redisson distributed services:
Modify a value, send a message to a channel. Another process receives the message, do the computation and updating.
Or ...
If there's only one particular process that does the computation and updating, you can use Redisson remote service to tell this process do the work, it works like RPC. Maybe it is able to modify the first value too.
Or ...
Create the whole lot as one runnable job and send it to be processed by a Redisson remote executor. You can also choose to schedule the job if it is not immediately required.

Resources