Does HDFS provide a way to poll for file system events like file creation/modification/deletion? Also, does it provide/support any callback mechanism to get notified of such events as they occur?
I dont see an immediate and elaborate use case for such a thing but there is a specific requirement to check on this capability. I did not come across any documentation that mentions this. It would be great if any of the HDFS committers comment on this.
There is currently no feature inbuilt HDFS that allows this.
Workarounds would be to perform client-side polling on watched directories, or a manual tailing of the transaction log(s) for all recorded events.
As of hadoop 2.7, this is now possible with the INotify library. See this example: https://github.com/onefoursix/hdfs-inotify-example/blob/master/src/main/java/com/onefoursix/HdfsINotifyExample.java
Related
I have a NiFi flow, in which I'm getting data from elasticsearch, after some processing I'm saving flow file in a destination. Once all the flow file is saved I want notify the Merge Process to merge all the data from file into a large csv file. My problem is how I can notify Merge Processor that now all the files are saved and now start merging all the files.
Thanks for the help.. :)
NiFi has a way to use the Wait/Notify processors to notify when the files are saved. What I would suggest is that whichever processor you are using to save the files to have its successful state get sent to the Notify Processor, which could then be used to notify the Wait Processor. The Wait Processor could used with the Merge Processor to merge said files. I know that it is indeed possible to use the Wait/Notify in NiFi for complex actions Pierre Villard NiFi WorkFlow Wait/Notify
Although I know that this isnt what you're specifically trying to do here, but this might help you better understand the Wait/Notify a bit better Wait/Notify Moving Zendesk
We have multiple team nifi applications running in same nifi machine... Is there any way to log the logs specific to my application? Also by default nifi-app.log file is difficult to track the issues and bulletin board shows the error msg for only 5 mins... How to get the errors captured and send an mail alert in Nifi?
Please help me to get through this. Thanks in advance!
There are a couple ways to approach this. One is to route failure relationships from processors to a PutEmail processor which can send an alert on errors. Another is to use a custom reporting task to alert a monitoring service when a certain number of flowfiles are in an error queue.
Finally, we have heard that in multitenant environments, log parsing is difficult. While NiFi aims to reduce or completely eliminate the need to visually inspect logs by providing the data provenance feature, in the event you do need to inspect the logs, we recommend searching the log by processor ID to isolate relevant messages. You can also use NiFi itself to ingest those same logs and perform parsing and filtering activities if desired. Future versions may improve this experience.
By parsing the nifi log, you can separate the logs which is specific to your team applications, by using the processor group id and using Nifi Rest API. Check the below link for the nifi template and python codes to solve this issue:
https://link.medium.com/L6IY1wTimV
You can send all the errors in a Processor Group to the same processor, it could be a regular UpdateAttribute or a custom processor, this processor is going to add the path and all the relevant information and then send this to a general error/logs flow that is going to check the information inside the flowfile regarding to the error, and will make the decision of send or not an email, to whom and this kind of things.
Using this approach, the system keeps simple and inside NiFi, so you don't add more layers of complexity, and you are going to have only one processor to manage the errors per Process Group.
This is the way we are managing errors in my company.
I have a problem using real time UDP stream processing with the map reduce system. Actually I am doing a university project and I want to use mapreduce to process this data. UDP stream is about ship data from several AIS devices.
As far as I am aware, Apache Storm will be the solution for that. But I dont know that I can incorporate mapreduce to the Storm . I want to incorporate mapreduce concepts and ultimately I want to learn it.
Also I want to have some advice about the system architecture, the normal procedure is this,
UDP stream received by the system
decode the stream
real time analytic should be shown
stored for future data retrial purposes.
so can anyone suggest what is the best way to do this? can Apache Storm do this?
I'll answer the easy question first: Yes, Apache Storm can do what you want it to do.
That said, any of other 'big data' streaming tools can do this data processing as well. These tools include Storm, but also Spark and Samza.
If I were building this myself, I'd push the streaming data into a messaging queue, probably Kafka, then use Storm to pull individual messages out and process them. You can then store the result however you want. That could be onto disk, back into Kafka, or whatever makes sense in your case.
Finally, it doesn't seem that mapreduce is a good fit to your problem. Mapreduce is for batch processing, which isn't what you are describing as your problem.
I am looking for a clean way to implement a java event system that hooks into Hadoop v2. I know there is a notification url and I have used that in the past. What I want to do is to hook into the JobStatus and have events posted to a queue service for propagating events to client. I tried extending job and assigning the status using reflection to my custom JobStatus class but this is not working. I have also cursory looked into Yarn's event system to add a hook that will allow me to listen for yarn events and propagate those. I really need an expert opinion on how to accomplish this kind of task. I want to get log messages, status change events in real time over to a web client.
Thanks for any assistance in advance.
I've found references in a few places to some internal logging capabilities of ZMQ. The functionality that I think might exist is the ability to connect to either or both of a inproc or ipc SUB socket and listen to messages that give information about the internal state of ZMQ. This would be quite useful when debugging a distributed application. For instance, if messages are missing/being dropped, it might shed some light on why they're being dropped.
The most obvious mention of this is here: http://lists.zeromq.org/pipermail/zeromq-dev/2010-September/005724.html, but it's also referred to here: http://lists.zeromq.org/pipermail/zeromq-dev/2011-April/010830.html. However, I haven't found any documentation of this feature.
Is some sort of logging functionality truly available? If so, how is it used?
Some grepping through the git history eventually answered my question. The short answer is that a way for ZMQ to transmit logging messages to the outside world was implemented, but it was never used to actually send logging messages by the rest of the code base. After a while it was removed since nothing used it.
The commit that originally added it making use of an inproc socket:
https://github.com/zeromq/libzmq/commit/ce0972dca3982538fd123b61fbae3928fad6d1e7
The commit that added a new "sys" socket type specifically to support the logging:
https://github.com/zeromq/libzmq/commit/651c1adc80ddc724877f2ebedf07d18e21e363f6
JIRA issue, pull request, and commit to remove the functionality:
https://zeromq.jira.com/browse/LIBZMQ-336
https://github.com/zeromq/libzmq/pull/277
https://github.com/zeromq/libzmq/commit/5973da486696aca389dab0f558c5ef514470bcd2