What is the best practice for nifi production deployment - apache-nifi

I have a three node nifi cluster. We just installed nifi packages on linux machines and cluster with separate zookeeper cluster. I am planning to monitor nifi performance via nagios but we saw hortonworks ambari provides fetures for management and monitoring also.
What is the best practice for nifi deployment on prod
how should we scale up
how can we monitor nifi
Should we monitor queue/process performance
Should use something like ambari
regards..
Edit-1:
#James actually I am collecting user event logs from several sources within company.
All events are first written to Kafka. Nifi consumes kafka, does simple transformations like getting a field from payload to attribute.
After transformations data is written to both elasticsearch and hdfs. Before writing to hdfs we are merging flowfiles so writing to hdfs is in batches.
I have around 50k/s event.

Related

Installing NiFi (open source) on the datanodes of an existing Hadoop cluster

If you have 10 datanodes on an existing Hadoop cluster could you install NiFi on 4 or 6 datanodes?
The main purpose of NiFi would be loading data daily from RDBMS to HDFS, high volume.
Datanodes would be configured with high RAM lets say 100GB.
External 3 node Zookeeper cluster would be used.
Are there any major concerns with this approach?
Does it make more sense to just install NiFi on EVERY datanode, so 10?
Are there any issues with having a large cluster of 10 nifi nodes?
Will some NiFi configuration best practices conflict with Hadoop config?
Edit: Currently using Hortonworks version 2.6.5 and open source NiFi 1.9.2
Are there any major concerns with this approach?
Cloudera Data platform is integrated with Cloudera Dataflow which on based on Apache NiFi, so integration should not be a concern.
Does it make more sense to just install NiFi on EVERY datanode, so 10?
Depends on what traffic you are expecting, but I would consider NiFi a standalone service, such as Kafka, Zookeeper... so a cluster of 3 would be a great start and maybe increasing if needed. Starting will all DataNodes is not required. It is ok to share these services with DataNodes, just make sure resources are allocated correctly (cores, memory, storage...) - this is easier with Cloudera.
Are there any issues with having a large cluster of 10 nifi nodes?
More info on scaling on 6) NiFi Clusters Scale Linearly. You should have a lot of traffic to go over 10 nodes.
Will some NiFi configuration best practices conflict with Hadoop
config?
That depends on how you configure it. I would advise using Cloudera for both, which is very tested to work together. You may not end up with latest versions for your services, but at least you have a higher reliability.
Even if you have an existing HDP 2.6.5 cluster, or perhaps by now you upgraded to HDP 3 or even its successor CDP, you can use the Hortonworks/Cloudera Nifi solution via your management console. So if you currently use Ambari (or its counterpart Cloudera Manager) the recommended way to install Nifi is through that.
It will be called Hortonworks Data Flow or Cloudera Data Flow respectively.
Regarding the other part of your question:
Typically it is recommended to install Nifi on dedicated nodes, and 10 nodes is likely overkill if you are not sure.
Here is some information on sizing your Nifi deployment (note that Cloudera and Hortonworks have merged, so though the site is called Cloudera this page is actually written with a HDP cluster in mind, of course that does not impact the sizing).
https://docs.cloudera.com/HDPDocuments/HDF3/HDF-3.1.1/bk_planning-your-deployment/content/ch_hardware-sizing.html
Full disclosure: I am an employee of Cloudera (formerly Hortonworks)

What is difference between Apache flume and Apache storm?

What is difference between Apache flume and Apache storm?
Is is possible to ingest logs data into Hadoop cluster using storm?
Both are used for streaming data so can storm be used as an alternative to flume?
Apache Flume is a service for collecting large amounts of streaming data, particularly logs. Flume pushes data to consumers using mechanisms it calls data sinks. Flume can push data to many popular sinks right out of the box, including HDFS, HBase, Cassandra, and some relational databases.
Apache Storm involves streaming data. It is the bridge between batch processing and stream processing, which Hadoop is not natively designed to handle. Storm runs continuously, processing a stream of incoming data and dicing it into batches, so Hadoop can more easily ingest it. Data sources are called spouts and each processing node is a bolt. Bolts perform computations and processes on the data, including pushing output to data stores and other services.
If you need something that works out of the box, choose Flume, once you decide whether to push or pull makes more sense. If streaming data is, for now, just a small add-on to your already developed Hadoop environment, Storm is a good choice.
It is possible to ingest logs data into the Hadoop cluster using a storm
We can use the storm as an alternative to the flume

HDF NIfi - Does Nifi writes provenance/data on HDP Node?

​Hi
I have HDF cluster with 3 Nifi instance which lunches jobs(Hive/Spark) on HDP cluster. Usually nifi writes all information to different repositories available on local machine.
My question is - Does nifi writes any data,provenance information or does spilling on HDP nodes (ex. data nodes in HDP cluster) while accessing HDFS,Hive or Spark services ?
Thanks
Apache NiFi does not use HDFS for any of its internal repositories/data. The only interaction between NiFi and Hadoop services would be through specific processors made to interact with these services, such as PutHDFS, PutHiveQL, etc.
Provenance data can be pushed out of NiFi using the SiteToSiteProvenanceReportingTask and then stored in whatever location is appropriate (HDFS, HBase, etc).

Where to run the flume agent that writes to HDFS?

I have 25-20 agents sending the data to couple of collector agents and and these collector agents then have to write it to the HDFS.
Where to run these collector agents? On the Data node of the Hadoop cluster or outside of the cluster? What are the pros/cons of each and how are people currently running them?
tier 2 flume agents use hdfsSink write directly to HDFS. what's more , Tier1 can use failover sinkgroup. In case of one of tier 2 flume agent is down.
I assume your using something like Flume. If that's the case, the Flume agent (at least the first tier) runs where ever the data is being sourced from. IE: Web Server for Web Logs..
Flume does support other protocols, like JMS, so the location will vary in those scenario's.
For production clusters, you don't want to run "agents" like flume on Datanodes. Best to level the resources of that hardware for the cluster.
If you have a lot of agents, you'll want to use a tiered architecture to consolidate and funnel the numerous sources into a smaller set of agents that will write to HDFS. This helps control visibility and exposure of the cluster to external servers.

How to decide the flume topology approach?

I am setting up flume but very not sure of what topology to go ahead with for our use case.
We basically have two web servers which can generate logs at the speed of 2000 entries per second. Each entry of size around 137Bytes.
Currently we have used rsyslog( writing to a tcp port) to which a php script writes these logs to. And we are running a local flume agent on each webserver , these local agents listen to a tcp port and put data directly in hdfs.
So localhost:tcpport is the "flume source " and "hdfs" is the flume sink.
I am not sure about the above approach and am confused between three approaches:
Approach 1: Web Server, RSyslog & Flume Agent on each machine and a Flume collector running on the Namenode in hadoop cluster, to collect the data and dump into hdfs.
Approach 2: Web Server, RSyslog on same machine and a Flume collector (listening on a remote port for events written by rsyslog on web server)running on the Namenode in hadoop cluster, to collect the data and dump into hdfs.
Approach 3: Web Server, RSyslog & Flume Agent on same machine. And all agents writing directly to the hdfs.
Also, we are using hive, so we are writing directly into partitioned directories. So we want to think of an approach that allows us to write on Hourly partitions.
Basically I just want to know If people have used flume for similar purposes and if it is the right and reliable tool and if my approach seems sensible.
I hope that's not too vague. Any help would be appreciated.
The typical suggestion for your problem would be to have a fan-in or converging-flow agent deployment model. (Google for "flume fan in" for more details). In this model, you would ideally have an agent on each webserver. Each of those agents forward the events to few aggregator or collector agents. The aggregator agents then forward the events to a final destination agent that writes to HDFS.
This tiered architecture allows you to simplify scaling, failover etc.

Resources