how to find data volume size for a specific process group in NiFi? - apache-nifi

I am new to NiFi. We have a complex NiFi data flow in our organization. We have segregated different projects data flow into Process Groups. I was asked to find the data volume for a specific project (Process Group in NiFi) for a given period of time? How to find it in NiFi Web UI?

The Apache NiFi User Guide has a section on monitoring components and data flow. By default, a processor/process group will display the total amount of data processed by that component over the last 5 minutes. There are also Reporting Tasks which allow the transmission of such monitoring data to external destinations.

Related

Nifi - Flow Files Queue List

I was working on Apache Nifi and found hundreds of files in a queue. While listing the queue it only displays the 100 flowfiles.
My question is that, Is there is any way to see all the files which are in queue state?
For ease shared the Screenshot
Generally, nifi UI allows you to see 100 queued flow files as it's hardcoded value. See this discussion: Nifi API: Get ALL elements with a flowfile queue

How to get errors captured from nifi logs specific to my application when multiple nifi applications are running

We have multiple team nifi applications running in same nifi machine... Is there any way to log the logs specific to my application? Also by default nifi-app.log file is difficult to track the issues and bulletin board shows the error msg for only 5 mins... How to get the errors captured and send an mail alert in Nifi?
Please help me to get through this. Thanks in advance!
There are a couple ways to approach this. One is to route failure relationships from processors to a PutEmail processor which can send an alert on errors. Another is to use a custom reporting task to alert a monitoring service when a certain number of flowfiles are in an error queue.
Finally, we have heard that in multitenant environments, log parsing is difficult. While NiFi aims to reduce or completely eliminate the need to visually inspect logs by providing the data provenance feature, in the event you do need to inspect the logs, we recommend searching the log by processor ID to isolate relevant messages. You can also use NiFi itself to ingest those same logs and perform parsing and filtering activities if desired. Future versions may improve this experience.
By parsing the nifi log, you can separate the logs which is specific to your team applications, by using the processor group id and using Nifi Rest API. Check the below link for the nifi template and python codes to solve this issue:
https://link.medium.com/L6IY1wTimV
You can send all the errors in a Processor Group to the same processor, it could be a regular UpdateAttribute or a custom processor, this processor is going to add the path and all the relevant information and then send this to a general error/logs flow that is going to check the information inside the flowfile regarding to the error, and will make the decision of send or not an email, to whom and this kind of things.
Using this approach, the system keeps simple and inside NiFi, so you don't add more layers of complexity, and you are going to have only one processor to manage the errors per Process Group.
This is the way we are managing errors in my company.

Spring boot applications high availability

We have a microservice which is developed using spring boot. couple of the functionalities it implements is
1) A scheduler that triggers, at a specified time, a file download using webhdfs and process it and once the data is processed, it will send an email to users with the data process summary.
2) Read messages from kafka and once the data is read, send an email to users.
We are now planning to make this application high available either in Active-Active or Active-passive set up. The problem we are facing now is if both the instances of the application are running then both of them will try to download the file/read the data from kafka, process it and send emails. How can this be avoided? I mean to ensure that only one instance triggers the download and process it ?
Please let me know if there is known solution for this kind of scenarios as this seems to be a common scenario in most of the projects? Is master-slave/leader election approach a correct solution?
Thanks
Let the service download that file, extract the information and publish them via kafka.
Check beforehand if the information was already processed by querying kafka or a local DB.
You also could publish an DataProcessed-Event that triggers the EmailService, that sends the corresponding E-Mail.

Scaling services in Distributed system SOA

What are the various alternatives to data processing in SOA. What I have done so far in PoC is:
Scaling the Services on multiple machines.
One universal service will handle the service registry & discovery.
Multiple requests for one service can be forwarded to any instance of the service running on multiple machines on the cluster.
Next, we are planning introduction of a distributed caching layer. Any service can get the data from the distributed caching layer. Entire flow if the system will be:
Client will request the data from service.
Service will check the cache for the valid requested data. If
data is in the valid state it will be returned to the client right
away. Otherwise permanent data storage will the called for the
requested data and it will flow to client by updating the cache.
Now if the client request for processing the data and it can be
processed by a service. Data can be processed by single instance of the service or by multiple instances of the service 3a or 3b?
3a. We just pass the important data filters from client to service. Distribute the processing command among the multiple instances of the service. Each instance will perform operation on a small set of data& will update the data in the cache and permanent store. Here instead of passing the data we are passing processing command among the clusters.
3b. We process the whole data in one instance of the service and update it on the cache and permanent data store.
Finally we return the processed data to the client.
For the transaction system, should we depend on the distributed cache? It might result into consistency problems while data is being processed by multiple instance of the service.One instance can read stale data and process that stale copy in distributed system. how robust it will be to depend on distributed cache?
How large set of the transaction data should be processed in distributed system (SOA) ? I have been reading this line on mulesoft's site
"Share workload between applications while maintaining transient state information with in-memory data-grid to provide bulletproof reliability together with scalability"
Any pointers to achieve such a distributed system where we can have scalability and reliability?

Apache Kafka vs Apache Storm

Apache Kafka: Distributed messaging system
Apache Storm: Real Time Message Processing
How we can use both technologies in a real-time data pipeline for processing event data?
In terms of real time data pipeline both seems to me do the job identical. How can we use both the technologies on a data pipeline?
You use Apache Kafka as a distributed and robust queue that can handle high volume data and enables you to pass messages from one end-point to another.
Storm is not a queue. It is a system that has distributed real time processing abilities, meaning you can execute all kind of manipulations on real time data in parallel.
The common flow of these tools (as I know it) goes as follows:
real-time-system --> Kafka --> Storm --> NoSql --> BI(optional)
So you have your real time app handling high volume data, sends it to Kafka queue. Storm pulls the data from kafka and applies some required manipulation. At this point you usually like to get some benefits from this data, so you either send it to some Nosql db for additional BI calculations, or you could simply query this NoSql from any other system.
I know that this is an older thread and the comparisons of Apache Kafka and Storm were valid and correct when they were written but it is worth noting that Apache Kafka has evolved a lot over the years and since version 0.10 (April 2016) Kafka has included a Kafka Streams API which provides stream processing capabilities without the need for any additional software such as Storm. Kafka also includes the Connect API for connecting into various sources and sinks (destinations) of data.
Announcement blog - https://www.confluent.io/blog/introducing-kafka-streams-stream-processing-made-simple/
Current Apache documentation - https://kafka.apache.org/documentation/streams/
In 0.11 Kafka the stream processing functionality was further expanded to provide Exactly Once Semantics and Transactions.
https://www.confluent.io/blog/exactly-once-semantics-are-possible-heres-how-apache-kafka-does-it/
Kafka and Storm have a slightly different purpose:
Kafka is a distributed message broker which can handle big amount of messages per second. It uses publish-subscribe paradigm and relies on topics and partitions. Kafka uses Zookeeper to share and save state between brokers. So Kafka is basically responsible for transferring messages from one machine to another.
Storm is a scalable, fault-tolerant, real-time analytic system (think like Hadoop in realtime). It consumes data from sources (Spouts) and passes it to pipeline (Bolts). You can combine them in the topology. So Storm is basically a computation unit (aggregation, machine learning).
But you can use them together: for example your application uses kafka to send data to other servers which uses storm to make some computation on it.
This is how it works
Kafka - To provide a realtime stream
Storm - To perform some operations on that stream
You might take a look at the GitHub project https://github.com/abhishekgoel137/kafka-nodejs-d3js.
(D3js is a graph-representation library)
Ideal case:
Realtime application -> Kafka -> Storm -> NoSQL -> d3js
This repository is based on:
Realtime application -> Kafka -> <plain Node.js> -> NoSQL -> d3js
As every one explain you that
Apache Kafka: is continuous messaging queue
Apache Storm: is continuous processing tool
here in this aspect Kafka will get the data from any website like FB,Twitter by using API's and that data is processed by using Apache Storm and you can store the processed data in either in any databases you like.
https://github.com/miguno/kafka-storm-starter
Just follow it you will get some idea
When I have a use case that requires me to visualize or alert on patterns (think of twitter trends), while continuing to process the events, I have a several patterns.
NiFi would allow me to process an event and update a persistent data store with low(er) batch aggregation with very, very little custom coding.
Storm (lots of custom coding) allows me nearly real time access to the trending events.
If I can wait for many seconds, then I can batch out of kafka, into hdfs (Parquet) and process.
If I need to know in seconds, I need NiFi, and probably even Storm. (Think of monitoring thousands of earth stations, where I need to see small region weather conditions for tornado warnings).
Simply Kafka send the messages from node to another , and Storm processing the messages . Check this example of how you can Integration Apache Kafka With Storm

Resources