Installing NiFi (open source) on the datanodes of an existing Hadoop cluster - hadoop

If you have 10 datanodes on an existing Hadoop cluster could you install NiFi on 4 or 6 datanodes?
The main purpose of NiFi would be loading data daily from RDBMS to HDFS, high volume.
Datanodes would be configured with high RAM lets say 100GB.
External 3 node Zookeeper cluster would be used.
Are there any major concerns with this approach?
Does it make more sense to just install NiFi on EVERY datanode, so 10?
Are there any issues with having a large cluster of 10 nifi nodes?
Will some NiFi configuration best practices conflict with Hadoop config?
Edit: Currently using Hortonworks version 2.6.5 and open source NiFi 1.9.2

Are there any major concerns with this approach?
Cloudera Data platform is integrated with Cloudera Dataflow which on based on Apache NiFi, so integration should not be a concern.
Does it make more sense to just install NiFi on EVERY datanode, so 10?
Depends on what traffic you are expecting, but I would consider NiFi a standalone service, such as Kafka, Zookeeper... so a cluster of 3 would be a great start and maybe increasing if needed. Starting will all DataNodes is not required. It is ok to share these services with DataNodes, just make sure resources are allocated correctly (cores, memory, storage...) - this is easier with Cloudera.
Are there any issues with having a large cluster of 10 nifi nodes?
More info on scaling on 6) NiFi Clusters Scale Linearly. You should have a lot of traffic to go over 10 nodes.
Will some NiFi configuration best practices conflict with Hadoop
config?
That depends on how you configure it. I would advise using Cloudera for both, which is very tested to work together. You may not end up with latest versions for your services, but at least you have a higher reliability.

Even if you have an existing HDP 2.6.5 cluster, or perhaps by now you upgraded to HDP 3 or even its successor CDP, you can use the Hortonworks/Cloudera Nifi solution via your management console. So if you currently use Ambari (or its counterpart Cloudera Manager) the recommended way to install Nifi is through that.
It will be called Hortonworks Data Flow or Cloudera Data Flow respectively.
Regarding the other part of your question:
Typically it is recommended to install Nifi on dedicated nodes, and 10 nodes is likely overkill if you are not sure.
Here is some information on sizing your Nifi deployment (note that Cloudera and Hortonworks have merged, so though the site is called Cloudera this page is actually written with a HDP cluster in mind, of course that does not impact the sizing).
https://docs.cloudera.com/HDPDocuments/HDF3/HDF-3.1.1/bk_planning-your-deployment/content/ch_hardware-sizing.html
Full disclosure: I am an employee of Cloudera (formerly Hortonworks)

Related

Does hadoop itself contains fault-tolerance failover functionality?

I just installed new version of hadoop2, I wish to know if I config a hadoop cluster and it's brought up, how can I know if data transmission is failed, and there's a need for failover?
Do I have to install other components like zookeeper to track/enable any HA events?
Thanks!
High Availability is not enabled by default. I would highly encourage you to read the Hadoop documentation from Apache. (http://hadoop.apache.org/) It will give an overview of the architecture and services that run on a Hadoop cluster.
Zookeeper is required for many Hadoop services to coordinate their actions across the entire Hadoop cluster, regardless of the cluster being HA or not. More information can be found in the Apache Zookeeper documentation (http://zookeeper.apache.org/).

How to intergrate hadoop using ambari without HDP?

I have a hadoop cluster with apache hadoop 2.0.7.
I want to know how to integrate Ambari with the apache hadoop without the HDP(HortonWorks).
Actually, If I use HDP the solution is easy. but , I don't want to use the in my situation.
Do you have an any Idea?
Ambari relies on 'Stack' definitions to describe what services the Hadoop cluster consists of. Hortonworks defined a custom Ambari stack, its called HDP.
You could define your own stack and use any services and respective versions that you wanted. See the ambari wiki for more information about defining stacks and services.
That being said, I don't think it's possible to use your pre-existing installation of Hadoop with Ambari. Ambari is used to provision and manage hadoop clusters. It keeps track of the state of each of its stacks services, and the states of each services components. Since your cluster is already provisioned it would be difficult (maybe impossible) to add it to an Ambari instance.

Existing Cluster monitoring by Hortonworks Ambari

I have a 10 node existing cluster in RHEL 6.6 which was prepared by plain apache Hadoop configuration XMLs. Now I wanted to check the cluster status by Ambari. Would it be possible to install Hortonworks Ambari just to monitor only not to install Hadoop.
No, Ambari must provision the cluster it's monitoring.
Ambari is designed around a Stack concept where each stack consists of several services. A stack definition is what allows Ambari to install, manage and monitor the services in the cluster.
In order for you to use Ambari with the hadoop core that you built you would have to provide your own Ambari stack definition.
Specifically in your case your existing Hadoop installation would not have the necessary alert.json descriptors used by Ambari to provide alerts for any given service.

Did hortan sandbox can use as a single node Hadoop cluster

I like to study about Hadoop multinode setup and installation, by referring the above tutorial I understand that single node cluster environment can be used as node for the multinode cluster
http://bigdatahandler.com/hadoop-hdfs/hadoop-multi-node-cluster-setup/
Currently I am learning Hadoop using Horton sandbox, can we use a sandbox system as a single node environment?
If not what is the difference between sandbox and traditional Hadoop cluster installation
The sandbox images (from Hortonworks and Cloudera) provide the user with a pre-configured development environment with all the usual tools already available and installed (pig, hive etc.). Since the image is a single "system" it is set-up such that the hadoop cluster is single-node: i.e. everything - HDFS, Hadoop map-reduce etc. - is local to that image. That is a massive benefit, as anyone who has set up a hadoop cluster will tell you! It allows you to get up-and-running with very little operational overhead.
What these sandboxes do not provide, however, is realistic cluster behaviour as you have only one node. But there other possibilities - tools such as Vagrant and Docker - that would allow you to do this (I have not tried it myself).
The big data handler link you shared seems to be about combining several of these standalone, inherently single-node "clusters" so that you have something more realistic. But I would guess setting this up so that YARN, Zookeeper and other services are not duplicated comes with a not insignificant challenge.

Cloudera installation Doubts?

I am new to cloudera, I installed cloudera in my system successfully I have two doubts,
Consider a machine with some nodes already using hadoop with some data, Can we install Cloudera to use the existing Hadoop without made any changes or modifaction on data stored existing hadooop.
I installed Cloudera in my machine, I have another three machines to add those as clusters, I want to know, Am i want install cloudera in those three machines before add those machines as clusters ?, or Can we add a node as clusters without installing cloudera on that purticular nodes?.
Thanks in advance can anyone, please give some information about the above questions.
Answer to questions -
1. If you want to migrate to CDH from existing Apache Distribution, you can follow this link
Excerpt:
Overview
The migration process does require a moderate understanding of Linux
system administration. You should make a plan before you start. You
will be restarting some critical services such as the name node and
job tracker, so some downtime is necessary. Given the value of the
data on your cluster, you’ll also want to be careful to take recent
back ups of any mission-critical data sets as well as the name node
meta-data.
Backing up your data is most important if you’re upgrading from a
version of Hadoop based on an Apache Software Foundation release
earlier than 0.20.
2.CDH binary needs be installed and configured in all the nodes to have a CDH based cluster up and running.
From the Cloudera Manual
You can migrate the data from a CDH3 (or any Apache Hadoop) cluster to a CDH4 cluster by
using a tool that copies out data in parallel, such as the DistCp tool
offered in CDH4.
Other sources
Regarding your second question,
Again from the manual page
Important:
Before proceeding, you need to decide:
As a general rule:
The NameNode and JobTracker run on the the same "master" host unless
the cluster is large (more than a few tens of nodes), and the master
host (or hosts) should not
run the Secondary NameNode (if used), DataNode or TaskTracker
services. In a large cluster, it is especially important that the
Secondary NameNode (if used) runs on a separate machine from the
NameNode. Each node in the cluster except the master host(s) should
run the DataNode and TaskTracker services.
Additionally, if you use Cloudera Manager it will automatically do all the setup necessary i.e install the necessary selected components on the nodes in the cluster.
Off-topic: I had a bad habit of not referrring the manual properly. Have a clear look at it, it answers all our questions
Answer to your second question,
you can add directly, with installation few pre requisites like openssh-clients and firewalls and java.
these machines( existing node, new three nodes) should accept same username and password (or) you should set passwordless ssh to these hosts..
you should connect to the internet while adding the nodes.
I hope it will help you:)

Resources