Store external data into NiFi Registry - apache-nifi

Is that possible to store external data (not NiFi flow) into NiFi Registry using REST API?
https://nifi.apache.org/docs/nifi-registry-docs/index.html
As i know, NiFi Registry designed for versioning NiFi flow. But i want to know whether it is capable of storing other data into NiFi registry and retrieve it based on versions.

As of today, it is not currently possible to store data/objects in NiFi Registry other than a NiFi Flow and its configuration (component properties, default variable values, controller services, etc).
There have been discussions about extending NiFi Registry’s storage capabilities to include other items. Often discussed is NiFi extensions, such as NAR bundles which are the archive format for components such as custom processors. This would allow custom components to be versioned in the same place as a flow and downloaded at runtime based on a flow definition rather than pre-installed on a NiFi/MiNiFi instances.
Today though, only Flows are supported. Other data or components has to be stored/versioned somewhere else.
If you have data you want to associate with a specific flow version snapshot, here is a suggestion: You could store that data externally in another service and use the flow version snapshot comment field to store a URI/link to where the associated data resides. If you use a machine parsable format such as JSON in the snapshot comment to store this URI metadata, an automated process could retrieve this data from an external system by reading this field when doing an operation involving a specific flow snapshot version.

Related

How To Run A Nifi Template As Part Of A DH Ingestion Flow

We have a nifi template that used to extract data from a relational SQL Server DB and ingest every row of the tables as documents in Marklogic.
Now we want to add DH in the mix, and run the nifi template as part of the ingestion flows to populate the staging DB.
What is the recommended approach to put the nifi template to be called as part of an ingestion flow?
Is there any other recommended approach to extract the data from the relational DB and ingested into Marklogic during an ingestion flow?
Thanks for any help.
The easiest integration here is to run your NiFi flow and use the PutMarkLogic processor to write data to ML - that processor can be found in the MarkLogic-NiFi connector at https://github.com/marklogic/nifi/releases . You can configure PutMarkLogic with a REST transform. You'll want to use the mlRunIngest transform documented at https://docs.marklogic.com/datahub/5.4/tools/rest/rest-extensions.html so that you can reference your ingestion step configuration while writing data to ML.
The NiFi community generally recommends that you replace Templates with NiFi Registry instead, which has versioning and deployment controls built in that are far more accessible than Templates.
Both Templates and Versioned Flows can be automated using the REST endpoints in NiFi & Registry, with variables recommended to be set using Parameter Contexts.
You can roll your own client for this, or you may wish to use my Python client NiPyAPI.

How to reload external configuration data in NiFi processor

I have a custom nifi processor that uses external data for some user controlled configuration. I want to know how to signal the processor to reload the data when it is changed.
I was thinking that a flofile could be sent to signal the processor but I am concerned that in a clustered environment only one processor would get the notification and all the others would still be running on old configuration.
The most common ways to watch a file for changes are the JDK WatchService or Apache Commons IO Monitor...
https://www.baeldung.com/java-watchservice-vs-apache-commons-io-monitor-library
https://www.baeldung.com/java-nio2-watchservice
Your processor could use one of these and reload the data when the file changed, just make sure to synchronize access to relevant fields in your processor between the code that is reloading them and the code that is using them during execution.

How to send updated data from java program to NiFi?

I have micro-services running and when a web user update data in DB using micro-service end-points, I want to send updated data to NiFi also. This data contains updated list of names, deleted names, edited names etc. How to do it? which processor I have to use from NiFi side?
I am new to NiFi. I am yet to try anything from my side. I am reading google documents which can guide me.
No source code is written. I want to start it. But I will share here once I write it.
Expected result is NiFi should get updated list of names and NiFi should refer updated list for generating required alerts/triggers etc.
You can actually do it in lots of ways. MQ, Kafka, HTTP(usinh ListenHTTP). Just deploy the relevant one to you and configure it, even listen to a directory(using ListFile & FetchFile).
You can connect NiFi to pretty much everything, so just choose how you want to connect your micro services to NiFi.

Apache NiFi deployment in production

I am new to Apache NiFi. From the documentation, I could understand that NiFi is a framework with a drag and drop UI that helps to build data pipelines. This NiFi flow can be exported into a template which is then saved in Git.
Are we supposed to import this template into production NiFi server? Once imported, are we supposed to manually start all the processors via the UI? Please help.
Templates are just example flows to share with people and are not really meant for deployment. Please take a look at NiFi Registry and the concept of versioned flows.
https://nifi.apache.org/registry.html
https://www.youtube.com/watch?v=X_qhRVChjZY&feature=youtu.be
https://bryanbende.com/development/2018/01/19/apache-nifi-how-do-i-deploy-my-flow
Template is xml representation of your process structure (processors, processor groups, controllers, relationships etc). You can upload it to another nifi server for deploy. You also can start all the processors by nifi-api.

How to implement search functionalities?

Currently we are making a memory cache mechanism implementation in search functionalities.
Now the data is getting very large and we are not able to handle it in memory. Also we are getting more input source from different systems (oracle, flat file, and git).
Can you please share me how can we achieve this process?
We thought ES will help on this. But how can we provide input if any changes happen in end source? (batch processing will NOT help)
Hadoop - Not that level of data we are NOT handling
Also share your thoughts.
we are getting more input source from different systems (oracle, flat file, and git)
I assume that's why you tagged Kafka? It'll work, but you bring up a valid point
But how can we provide input if any changes happen...?
For plain text, or Git events, you'll obviously need to alter some parser engine and restart the job to get extra data in the message schema.
For Oracle, the GoldenGate product will publish table column changes, and Kafka Connect can recognize those events and update the payload accordingly.
If all you care about is searching things, plenty of tools exist, but you mention Elasticsearch, so using Filebeat works for plaintext, and Logstash can work with various other types of input sources. If you have Kafka, then feed events to Kafka, let Logstash or Kafka Connect update ES

Resources