Using Apache Nifi in a docker instance, for a beginner - apache-nifi

So, I want, very basically, to be able to spin up a container which runs Nifi, with a template I already have. I'm very new to containers, and fairly new to Nifi. I think I know how to spin up a Nifi container, but not how to make it so that it will automatically run my template every time.

You can use the apache/nifi Docker container found here as a starting point, and use a Docker RUN/COPY command to inject your desired flow. There are three ways to load an existing flow into a NiFi instance.
Export the flow as a template (an XML file containing the exported flow segment) and import it as a template into your running Nifi instance. This requires the "destination" NiFi instance to be running and uses the NiFi API.
Create the flow you want, manually extract the entire flow from the "source" NiFi instance by copying $NIFI_HOME/conf/flow.xml.gz, and overwrite the flow.xml.gz file in the "destination" NiFi's conf directory. This does not require the destination NiFi instance to be running, but it must occur before the destination NiFi starts.
Use the NiFi Registry to version control the original flow segment from the source NiFi and make it available to the destination NiFi. This seems like overkill for your scenario.
I would recommend Option 2, as you should have the desired flow as you want it. Simply use COPY /src/flow.xml.gz /destination/flow.xml.gz in your Dockerfile.
If you literally want it to "run my template every time", you probably want to ensure that the processors are all in enabled state (showing a "Play" icon) when you copy/save off the flow.xml.gz file, and that in your nifi.properties, nifi.flowcontroller.autoResumeState=true.

Related

How to send updated data from java program to NiFi?

I have micro-services running and when a web user update data in DB using micro-service end-points, I want to send updated data to NiFi also. This data contains updated list of names, deleted names, edited names etc. How to do it? which processor I have to use from NiFi side?
I am new to NiFi. I am yet to try anything from my side. I am reading google documents which can guide me.
No source code is written. I want to start it. But I will share here once I write it.
Expected result is NiFi should get updated list of names and NiFi should refer updated list for generating required alerts/triggers etc.
You can actually do it in lots of ways. MQ, Kafka, HTTP(usinh ListenHTTP). Just deploy the relevant one to you and configure it, even listen to a directory(using ListFile & FetchFile).
You can connect NiFi to pretty much everything, so just choose how you want to connect your micro services to NiFi.

How do I run a cron inside a kubernetes pod/container which has a running spring-boot application?

I have a spring-boot application running on a container. One of the APIs is a file upload API and every time a file is uploaded it has to be scanned for viruses. We have uvscan to scan the uploaded file. I'm looking at adding uvscan to the base image but the virus definitions need to be updated on a daily basis. I've created a script to update the virus definitions. The simplest way currently is to run a cron inside the container which invokes the script. Is there any other alternative to do this? Can the uvscan utility be isolated from the app pod and invoked from the application?
There are many ways to solve the problem. I hope, I can help you to find what suits you best.
From my perspective, it would be pretty convenient to have a CronJob that builds and pushes the new docker image with uvscan and the updated virus definition database on a daily basis.
In your file processing sequence you can create a scan Job using Kubernetes API, and provide it access to shared volume with a file you need to scan.
Scan Job will use :latest image, and if new images will appear in the registry it will download new image and create pod from it.
The downside is when you create images daily it consumes "some" amount of disk space, so you may need to invent the process of removing the old images from the registry and from the docker cache on each node of Kubernetes cluster.
Alternatively, you can put AV database on a shared volume or using Mount Propagation and update it independently of pods. If uvscan opens AV database in read-only mode it should be possible.
On the other hand it usually takes time to load virus definition into the memory, so it might be better to run virus scan as a Deployment than as a Job with a daily restart after new image was pushed to the registry.
At my place of work, we also run our dockerized services within EC2 instances. If you only need to update the definitions once a day, I would recommend utilizing an AWS Lamda function. It's relatively affordable and you don't need to worry about the overhead of a scheduler, etc. If you need help setting up the Lambda, I could always provide more context. Nevertheless, I'm only offering another solution for you in the AWS realm of things.
So basically I simply added a cron to the application running inside the container to update the virus definitions.

Do we need to stop NiFi when we dynamically update existing WorkFlow?

In my existing NiFi workflow I have following scenerio:
1) I may have to add new Kafka Topic dynamically.
2) I may have to add new route.
3) I may need to add new processor (not sure whether on the flow we can add new processor!)
My question is if my NiFi workflow is running in production do I need to stop and restart it when I dynamically (on the fly) update the workflow?
In what scenario, I need to bring it down? Thanks.
You don't need to restart the whole NiFi instance, but some changes may required stopping and starting a specific processor. For example, if you have a ConsumeKafka processor with a list of topics, you would have to stop it, add the new topic to the property, and start it again. You could script these operations using the REST API.

How does one setup a Distributed Map Cache for NiFi?

I'm brand new to NiFi and simply playing around with processors.
I'm trying to incorporate Wait and Notify processors in my testing, but I have to setup a Distributed Map Cache (server and client?).
The NiFi documentation assumes a level of understanding that I do not have.
I've installed memcached on my computer (macOS) and verified that it's running on Port 11211 (default). I've created a DistributedMapCacheClientService and DistributedMapCacheServer under NiFi's CONTROLLER SERVICES, but I'm getting java.net.SocketTimeoutException & other errors.
Is there a good tutorial on this entire topic? Can someone suggest how to move forward?
the DistributedMapCacheClientService and DistributedMapCacheServer does not require additional software.
To create these services, right-click on the canvas, select Configure and then select the Controller Services tab. You can then add new services by clicking the + button on the right and searching by name.
create DistributedMapCacheServer with default parameters (port 4557) and enable it. this will start built-in cache server.
create DistributedMapCacheClientService with hostname localhost and other default parameters and enable it
create a simple flow GenerateFlowFile set the run schedule and not zero bytes size in parameters.
connect it to PutDistributedMapCache set Entry Identifier as Key01 and choose your DistributedMapCacheClientService
try to run it. and if port 4557 not used by other software the put cache should work.
#Darshan
Yey it will work beacause in the documentation of DistributedMapCacheClientService says that it :
Provides the ability to communicate with a DistributedMapCacheServer. This can be used in order to share a Map between nodes in a NiFi cluster

elasticsearch.yml : will the configuration file be written on the fly?

I am building my own docker image for the elasticsearch application.
One question I have : will the configuration file elasticsearch.yml be modified by the application on the fly?
I hope that will never happen even if the node is running in cluster. But some other application (like redis), they modify the config file on the fly when cluster status changes. And if the configuration file changes on the fly, I have to export it as volumn since docker image can not retain the changes done on the fly
No, you don't run any risk of overwriting your configuration file. The configuration will be read from that file and kept in memory. ES also allows to persistently change settings at runtime, but they are stored in another global cluster state file (in data/CLUSTER_NAME/nodes/N/_state, where N is the 0-based node index) and re-read on each restart.

Resources