HDF NIfi - Does Nifi writes provenance/data on HDP Node? - apache-nifi

​Hi
I have HDF cluster with 3 Nifi instance which lunches jobs(Hive/Spark) on HDP cluster. Usually nifi writes all information to different repositories available on local machine.
My question is - Does nifi writes any data,provenance information or does spilling on HDP nodes (ex. data nodes in HDP cluster) while accessing HDFS,Hive or Spark services ?
Thanks

Apache NiFi does not use HDFS for any of its internal repositories/data. The only interaction between NiFi and Hadoop services would be through specific processors made to interact with these services, such as PutHDFS, PutHiveQL, etc.
Provenance data can be pushed out of NiFi using the SiteToSiteProvenanceReportingTask and then stored in whatever location is appropriate (HDFS, HBase, etc).

Related

Installing NiFi (open source) on the datanodes of an existing Hadoop cluster

If you have 10 datanodes on an existing Hadoop cluster could you install NiFi on 4 or 6 datanodes?
The main purpose of NiFi would be loading data daily from RDBMS to HDFS, high volume.
Datanodes would be configured with high RAM lets say 100GB.
External 3 node Zookeeper cluster would be used.
Are there any major concerns with this approach?
Does it make more sense to just install NiFi on EVERY datanode, so 10?
Are there any issues with having a large cluster of 10 nifi nodes?
Will some NiFi configuration best practices conflict with Hadoop config?
Edit: Currently using Hortonworks version 2.6.5 and open source NiFi 1.9.2
Are there any major concerns with this approach?
Cloudera Data platform is integrated with Cloudera Dataflow which on based on Apache NiFi, so integration should not be a concern.
Does it make more sense to just install NiFi on EVERY datanode, so 10?
Depends on what traffic you are expecting, but I would consider NiFi a standalone service, such as Kafka, Zookeeper... so a cluster of 3 would be a great start and maybe increasing if needed. Starting will all DataNodes is not required. It is ok to share these services with DataNodes, just make sure resources are allocated correctly (cores, memory, storage...) - this is easier with Cloudera.
Are there any issues with having a large cluster of 10 nifi nodes?
More info on scaling on 6) NiFi Clusters Scale Linearly. You should have a lot of traffic to go over 10 nodes.
Will some NiFi configuration best practices conflict with Hadoop
config?
That depends on how you configure it. I would advise using Cloudera for both, which is very tested to work together. You may not end up with latest versions for your services, but at least you have a higher reliability.
Even if you have an existing HDP 2.6.5 cluster, or perhaps by now you upgraded to HDP 3 or even its successor CDP, you can use the Hortonworks/Cloudera Nifi solution via your management console. So if you currently use Ambari (or its counterpart Cloudera Manager) the recommended way to install Nifi is through that.
It will be called Hortonworks Data Flow or Cloudera Data Flow respectively.
Regarding the other part of your question:
Typically it is recommended to install Nifi on dedicated nodes, and 10 nodes is likely overkill if you are not sure.
Here is some information on sizing your Nifi deployment (note that Cloudera and Hortonworks have merged, so though the site is called Cloudera this page is actually written with a HDP cluster in mind, of course that does not impact the sizing).
https://docs.cloudera.com/HDPDocuments/HDF3/HDF-3.1.1/bk_planning-your-deployment/content/ch_hardware-sizing.html
Full disclosure: I am an employee of Cloudera (formerly Hortonworks)

Data Transfer approach between Nifi Cluster

I am looking for a reliable data transfer approach between different Nifi cluster.
I have two Nifi cluster - one fetching data from source and another for pushing data into Hive/HDFS.
I need to transfer data from first nifi cluster to another nifi cluster. Is there any component available in nifi to do this ?
Nifi Cluster 1
GetFile --> Publish to Port
Nifi Cluster 2
Read from Port --> Publish to HDFS
Thanks
Apache NiFi provides the site-to-site feature for transferring data between two instances. You can read about it here:
https://nifi.apache.org/docs/nifi-docs/html/user-guide.html#site-to-site

What is the best practice for nifi production deployment

I have a three node nifi cluster. We just installed nifi packages on linux machines and cluster with separate zookeeper cluster. I am planning to monitor nifi performance via nagios but we saw hortonworks ambari provides fetures for management and monitoring also.
What is the best practice for nifi deployment on prod
how should we scale up
how can we monitor nifi
Should we monitor queue/process performance
Should use something like ambari
regards..
Edit-1:
#James actually I am collecting user event logs from several sources within company.
All events are first written to Kafka. Nifi consumes kafka, does simple transformations like getting a field from payload to attribute.
After transformations data is written to both elasticsearch and hdfs. Before writing to hdfs we are merging flowfiles so writing to hdfs is in batches.
I have around 50k/s event.

Spark cluster - read/write on hadoop

I would like to read data from hadoop, process on spark, and wirte result on hadoop and elastic search. I have few worker nodes to do this.
Spark standalone cluster is sufficient? or Do I need to make hadoop cluster to use yarn or mesos?
If standalone cluster mode is sufficient, should jar file be set on all node unlike yarn, mesos mode?
First of all, you can not write data in Hadoop or read data from Hadoop. It is HDFS (Component of Hadoop ecosystem) which is responsible for read/write of data.
Now coming to your question
Yes, it possible to read data from HDFS and process it in spark engine and then write the output on HDFS.
YARN, mesos and spark standalone all are cluster managers and you can use any one of them to do management of resources in your cluster and it had nothing to do with hadoop. But since you want to read and write data from/to HDFS then you need to install HDFS on cluster and thus it is better to install hadoop on your all nodes that will also install HDFS on all nodes. Now whether you want to use YARN, mesos or spark standalone that is your choice all will work with HDFS I myself use spark standalone for cluster management.
It is not clear about which jar files you are talking to but I assume it will be of spark then yes you need to set the path for spark jar on each node so that there will be no contradiction in paths when spark run's.

Does apache storm do resource management job in the cluster by itslef?

Well I am new to Apache Storm and after some search and read tutorials, I didn't get that how fault tolerance, load balancing and other resource manager duties takes place in Storm cluster? Should it be configured on top of YARN or it doest the resource management job itself? Does it have its HDFS part, or there should be an existing HDFS configured in a cluster first?
Storm can manage its resources by itself or can run on top of YARN. If you have a shared cluster (ie, with other system like Hadoop, Spark, or Flink running), using YARN should be the better choice to avoid resource conflicts.
About HDFS: Storm is independent of HDFS. If you want to run in on top of HDFS, you need to setup HDFS by yourself. Furthermore, Storm provides Spouts/Bolt to access HDFS: https://storm.apache.org/documentation/storm-hdfs.html

Resources