How to empty all the queues in Nifi at a time? - apache-nifi

I have started exploring NiFi. I have built some flow which is working. But i want to clear all the queues at a time to test the flow each time if i made any changes. I know we can stop and start each processor and test step by step. But i want to know is there a way we can clear all the queues at a time.

the easiest way to stop nifi, delete the following folders, and start it again:
content_repository
database_repository
flowfile_repository
provenance_repository
another approach to use nifi-api to get list of all queues and then call function to empty them.

Related

NiFi how to release flow file until a process downstream is finished

I am designing a data ingestion pattern using NiFi. One process needs to stop releasing flow files until a process downstream has finished processed. I tried to use wait and notified and have not made any success. I am hoping if the queue size and back pressure can be set across a few processors.
Similarly if there's a way I can implement logic: Don't allow flow files go in if there is one currently processing between multiple processors.
Any help is appreciated
You need a combination of MonitorActivity with executestreamcommand (with a python "nipyapi" script).
I have a similar requirement in one of my working flows.
You will need to install python lib nipyapi first and create this script on the nifi box.
from time import sleep
import nipyapi
nipyapi.utils.set_endpoint('http://ipaddress:port/nifi-api', ssl=False, login=False)
## Get PG ID using the PG Name
mypg = nipyapi.canvas.get_process_group('start')
nipyapi.canvas.schedule_process_group(mypg.id, scheduled=True) ## Start
sleep(1)
nipyapi.canvas.schedule_process_group(mypg.id, scheduled=False) ## Stop
I will put the template in the img in the link bellow, see the configuration on the monitor-activity processor - it will generate a flow if not activity is happening for 10 sec(you can play with the times thou).
Download template
Note: this is not a very good approach if you have high latency requirements.
Another idea would be to monitor the aggregate queue in the entire flow and if queue is zero then you restart start flow. (this would be very intense if you have a lot of connections)
I was able to design a solution within NiFi. Essentially using generate flow file as a signal (Only run once ever). The trick is have the newly generated flow file to merge with the original input flow through defragmentation. And every time after the flow has finished, the success condition will be able to merge with the next input flow file.
Solution Flow

Execute a method only once at start of an Apache Storm topology

If I have a simple Apache Storm topology with a spout (set to a parallelism of 2) running on two separate nodes. How can I write a method that will be run once, and only once, at the start of the topology before any processing of tuples has begun?
Any implementation of a singleton/static class, or synchronized method alone will not work, as the two instances are running on separate nodes.
Perhaps there are some Storm methods that I can use to decide if I'm the first Spout to be instantiated, and run only then? I tried playing around with the getThisTaskId() & getThisWorkerTasks() methods, but was unsuccessful.
NOTE: The parallelism of 2 is to keep things simple. A solution should work for any number of nodes/workers.
Edit: Thought of an easier solution. I'll leave the original answer below in case it is helpful.
You can use TopologyContext.getThisTaskIndex to do this. If you make your spout open method run the code only if TopologyContext.getThisTaskIndex == 0, then your code will run only once, before any tuples are emitted.
If the worker that ran this code crashes, the code will be run again when the spout instance with task index 0 is restarted. In order to fix this, you can use Zookeeper to store state that should carry over across restarts, e.g. put a flag in Zookeeper once the only-once code has run, and have the spout open check that the flag is not set before running the code.
You can use TopologyContext.getStormId to get a constant unique string to identify the topology, so you can tell whether the flag was set by this topology or a previous deployment.
Original answer:
The easiest way to run some code only once on deployment of a topology, is to call the code when you submit the topology. You can call the only-once code at the same time as you wire your topology with TopologyBuilder. This will only get run once. The downside is it will run on the machine you're calling storm jar from.
If you for some reason can't do it this way or need to run the code from one of the worker nodes, there isn't anything built in to Storm to allow you to do this. The reason there isn't such a mechanism is that it requires extra coordination between the worker JVMs, and I don't think anyone has needed something like this.
The best option for you would probably be to look at Zookeeper/Curator to do this coordination (see https://curator.apache.org/curator-recipes/index.html). This should allow you to make only one worker in the cluster run your code. You'll have to consider what should happen if the worker chosen to run your code crashes/stalls.
Storm already uses Zookeeper for coordination, so you can just connect to that cluster.

Is there a way to limit the number of File Handler instances?

I have a component which uses the spring framework File integration to kick off a process each time a file arrives at a location. I can see from log files that two threads/processes/instances are running. Is there a way to limit it to one?
The second process/thread appears to kick off almost immediately after the first and they are interfering with each other. The first instance processes the file but then the second tries to do the same and hits a filenotfound exception because the first moved it.
First of all you need to consider to configure a poller for your file inbound channel adapter with the fixedDelay instead of fixedRate. This way the next polling task is not going to be start until the end of the previous one.
Also consider to use some filter do not process the same file again. Not sure what is your use-case , but simple AcceptOnceFileListFilter should be enough. There is a prevent-duplicates option on the channel adapter for convenience.
See more info in the Reference Manual: https://docs.spring.io/spring-integration/docs/current/reference/html/#files
And also about the poller behavior: https://docs.spring.io/spring-integration/docs/current/reference/html/#channel-adapter-namespace-inbound

Kubernetes Job: exactly one pod

I want to run only one pod of my kubernetes app at a time(relaunch in case of failure), I am using job controller.
But as per documentations, kubernetes may launch more than one pods and will eventually achieve specified replicas. Is there any way to achieve exactly one pod at a time or any recommended design pattern for such use cases.
My app is reading data from HDFS and writing it to a message queue. It exits after processing all the files. I want to minimize possibility of writing duplicate records.
I suggest you use replicasets for this. Set the number of replica to 1. More here https://kubernetes.io/docs/concepts/workloads/controllers/replicaset/#when-to-use-a-replicaset
In principle, in the case of Jobs with no paralellism implied, there shouldn't be this kind of "race condition" ("should be 1" according to the documentation [1]). The job would be rescheduled only if an attempt fails. Did you come across the situation where 2 pods from the same job were being executed at the same time?
In any case, if you want to be completely sure, you may want to implement an extra coordination method or external solution.
[1] https://kubernetes.io/docs/concepts/workloads/controllers/jobs-run-to-completion/
If I understand your question correctly I think you are looking for: .spec.strategy.rollingUpdate.maxSurge
If you set this to 0 then the existing pods will be killed before starting an new one.

storm - How to check if a topology is idle or running?

That can be met checking if a determined bolt is processing, if the bolt have any tuples in the queue to be inserted yet or something like that.
What I want, in resume, is know, in any way, if a topology has done it's work yet or no.
I know it sounds contradictory, since a topology should never have the work done, but I'm using it to do tests and in the beginning I have not a non-stop stream of data, but a finite amount of data.
To check the running topologies and their statuses you can run:
{dir/to/storm}/bin/storm list
You can also navigate to the running storm UI and check topologies/logs from there.
If you want to check if the work has been performed on a tuple then you can add your own logging. I have added some logic to print out how many tuples are processed each second which I find useful.
U can check it from the storm UI which I think is the most easiest way.

Resources