I have a NiFi flow, in which I'm getting data from elasticsearch, after some processing I'm saving flow file in a destination. Once all the flow file is saved I want notify the Merge Process to merge all the data from file into a large csv file. My problem is how I can notify Merge Processor that now all the files are saved and now start merging all the files.
Thanks for the help.. :)
NiFi has a way to use the Wait/Notify processors to notify when the files are saved. What I would suggest is that whichever processor you are using to save the files to have its successful state get sent to the Notify Processor, which could then be used to notify the Wait Processor. The Wait Processor could used with the Merge Processor to merge said files. I know that it is indeed possible to use the Wait/Notify in NiFi for complex actions Pierre Villard NiFi WorkFlow Wait/Notify
Although I know that this isnt what you're specifically trying to do here, but this might help you better understand the Wait/Notify a bit better Wait/Notify Moving Zendesk
Related
We have multiple team nifi applications running in same nifi machine... Is there any way to log the logs specific to my application? Also by default nifi-app.log file is difficult to track the issues and bulletin board shows the error msg for only 5 mins... How to get the errors captured and send an mail alert in Nifi?
Please help me to get through this. Thanks in advance!
There are a couple ways to approach this. One is to route failure relationships from processors to a PutEmail processor which can send an alert on errors. Another is to use a custom reporting task to alert a monitoring service when a certain number of flowfiles are in an error queue.
Finally, we have heard that in multitenant environments, log parsing is difficult. While NiFi aims to reduce or completely eliminate the need to visually inspect logs by providing the data provenance feature, in the event you do need to inspect the logs, we recommend searching the log by processor ID to isolate relevant messages. You can also use NiFi itself to ingest those same logs and perform parsing and filtering activities if desired. Future versions may improve this experience.
By parsing the nifi log, you can separate the logs which is specific to your team applications, by using the processor group id and using Nifi Rest API. Check the below link for the nifi template and python codes to solve this issue:
https://link.medium.com/L6IY1wTimV
You can send all the errors in a Processor Group to the same processor, it could be a regular UpdateAttribute or a custom processor, this processor is going to add the path and all the relevant information and then send this to a general error/logs flow that is going to check the information inside the flowfile regarding to the error, and will make the decision of send or not an email, to whom and this kind of things.
Using this approach, the system keeps simple and inside NiFi, so you don't add more layers of complexity, and you are going to have only one processor to manage the errors per Process Group.
This is the way we are managing errors in my company.
We have a requirement to parse lots of incoming files (into a directory) and processing them and putting the outcome onto AWS kinesis for each file.
The frequency of files can be 60,000 per day and files can arrive every 15 seconds. Each file may contain about 1000 entries.
Can spring-integration handle this load?
Would there be any issues processing this kind of volumes?
As the files are coming in on to an inbound-channel-adapter can we execute a service-activator for each file?
I believe we need to use task-executors on channels with poller? Any examples?
Would task-executors call the service-activators in a multi-threaded manner?
Any pointers would be helpful. Links to any code examples would be nice.
This is not the kind of question one asks here on SO - too broad and too much questions in a single thread. I assume even if I answer to all of them, you are going to ask more and SO is not good for Q&A chat. Anyway:
Yes, Spring Integration can handle this. You can use simple FileReadingMessageSource to poll the directory periodically.
Each file (an outbound message payload) can be fed to the FileSplitter to parse it line by line.
After splitter you indeed can use an ExecutorChannel to process those lines in parallel.
The Service Activator can be called in multi-threaded environment as long as it is a thread-safe.
In the end you can use KinesisMessageHandler to send record to the AWS Kinesis. And yes, this one can be used from different threads as well.
All the information you can find in the Spring Integration Reference Manual. Some Samples may help you as well. And also Spring Integration AWS Extension is there for you.
My Raspberry pi 2 is doing good with windows 10 and I'm able to control LED from internet using .Net MF. Now, I wanted to send my LED (I'm going to use temperature sensor instead of LED) ON-OFF signal onto big data for storing and analyzing or retrieving purpose.
Checked on net, not able to find the simple and easy way to do that. Could any one please suggest any tutorial for "How can I send real time data to Hadoop"? I want to understand whole architecture to proceed on this.
What all technologies/things I should concentrate on to make such POC?
Note: I think, I need some combination like MQTT broker, Spark or Strom etc...But not sure, how can I put all things together to make it practically possible. Please correct me if I'm wrong and help.
You could send the signals as a stream of events to Hadoop in real time, using one of several components which make up the Hadoop "ecosystem". Systems such as Spark or Storm which are for processing the data in real time are only necessary if you want to apply logic to the stream in real-time. If you just want to batch up the events and store them in HDFS for later retrieval by a batch process, you could use:
Apache Flume. A Flume agent runs on one or more of the Hadoop nodes and listens on a port. Your Raspberry Pi sends each event one by one to that port. Flume buffers the events and then writes them to HDFS https://flume.apache.org/FlumeUserGuide.html
Kafka. Your Raspberry Pi sends the events one by one to a Kafka instance which stores them as a message queue. A further distributed batch process runs periodically on Hadoop in order to move the events from Kafka to HDFS. This solution is more robust but has more moving parts.
In a elastic mapreduce streaming job, what is going to happen if a mapper suddenly dies? The data that were already processed will be replayed? If so, is there any option to disable that?
I am asking because I am using EMR to insert some data to third party database. Every mapper sends the data coming in through HTTP. In this case if a mapper crashes I don't want to replay the HTTP requests and I need to continue where I were left.
MR is a fault tolerant framework. When a Map task fails (streaming API or Java API) the behavior is the same.
Once the job tracker is notified that the task has failed it will try and reschedule the task. The temporary output generated by the failed task is deleted.
A more detailed discussion on how failures are handled in MR can be seen here
For your particular case I think you need to refer to the external source in your setup() method to find out the records which have been processed, then use this information in your mapper() methods to decide whether a particular record should be processed or not.
Does HDFS provide a way to poll for file system events like file creation/modification/deletion? Also, does it provide/support any callback mechanism to get notified of such events as they occur?
I dont see an immediate and elaborate use case for such a thing but there is a specific requirement to check on this capability. I did not come across any documentation that mentions this. It would be great if any of the HDFS committers comment on this.
There is currently no feature inbuilt HDFS that allows this.
Workarounds would be to perform client-side polling on watched directories, or a manual tailing of the transaction log(s) for all recorded events.
As of hadoop 2.7, this is now possible with the INotify library. See this example: https://github.com/onefoursix/hdfs-inotify-example/blob/master/src/main/java/com/onefoursix/HdfsINotifyExample.java