Sorry if my question appears to be naïve. We are planning to use CDH 5.3.0 or 5.4.0. We want to implement a multi-node cluster.
The example multi-node installations that I have seen/read on different blogs/resources have master and slaves on different hosts.
However, we are restrained by the number of hosts. We have only 2 powerful hosts ( 32 cores 400+ GB RAM), so if we decide to have master on one and slave on other, we will end up with only one slave. My questions are :
Is it possible to have master and slave on the same hosts?
Can I have more than one slave node on a single host.
Also does one need to pay to use Cloudera Manager or it is open-source like the rest of the components.
If you can point me in the direction of some resource which would help me understand above scenarios it would be helpful.
Thanks for your help.
Regards,
V
old question but no and wrong answer:
yes, it is possible to install Master & Worker services on a single host.
e.g. HDFS (NameNode and Datanode). You can even install a full cloudera or Hortonworks installation with ALL services on a single host if it is powerfull enough, but i would only recommend it for POC or testcases.
If you use cloudera or hortonworks without virtualization it is not possible to run multiple instances of the SAME worker services e.g. datanode on the same host. 1 Host 1 worker instance. everything else would not make sense.
Cloudera is a package of multiple open source projekt (Hadoop,Spark....) and other closed source parts like cloudera manager and other enterprise closed source features. But everything you need is free even for commercial use with the community licence.
Right now (2017): only cloudera navigator is the big feature which is not part of the community edition
Yes you can configure namenode and datanode both on a single node.
You cannot have more than two datanodes on a single machine.
Cloudera is open-source hadoop distribution.
Related
If you have 10 datanodes on an existing Hadoop cluster could you install NiFi on 4 or 6 datanodes?
The main purpose of NiFi would be loading data daily from RDBMS to HDFS, high volume.
Datanodes would be configured with high RAM lets say 100GB.
External 3 node Zookeeper cluster would be used.
Are there any major concerns with this approach?
Does it make more sense to just install NiFi on EVERY datanode, so 10?
Are there any issues with having a large cluster of 10 nifi nodes?
Will some NiFi configuration best practices conflict with Hadoop config?
Edit: Currently using Hortonworks version 2.6.5 and open source NiFi 1.9.2
Are there any major concerns with this approach?
Cloudera Data platform is integrated with Cloudera Dataflow which on based on Apache NiFi, so integration should not be a concern.
Does it make more sense to just install NiFi on EVERY datanode, so 10?
Depends on what traffic you are expecting, but I would consider NiFi a standalone service, such as Kafka, Zookeeper... so a cluster of 3 would be a great start and maybe increasing if needed. Starting will all DataNodes is not required. It is ok to share these services with DataNodes, just make sure resources are allocated correctly (cores, memory, storage...) - this is easier with Cloudera.
Are there any issues with having a large cluster of 10 nifi nodes?
More info on scaling on 6) NiFi Clusters Scale Linearly. You should have a lot of traffic to go over 10 nodes.
Will some NiFi configuration best practices conflict with Hadoop
config?
That depends on how you configure it. I would advise using Cloudera for both, which is very tested to work together. You may not end up with latest versions for your services, but at least you have a higher reliability.
Even if you have an existing HDP 2.6.5 cluster, or perhaps by now you upgraded to HDP 3 or even its successor CDP, you can use the Hortonworks/Cloudera Nifi solution via your management console. So if you currently use Ambari (or its counterpart Cloudera Manager) the recommended way to install Nifi is through that.
It will be called Hortonworks Data Flow or Cloudera Data Flow respectively.
Regarding the other part of your question:
Typically it is recommended to install Nifi on dedicated nodes, and 10 nodes is likely overkill if you are not sure.
Here is some information on sizing your Nifi deployment (note that Cloudera and Hortonworks have merged, so though the site is called Cloudera this page is actually written with a HDP cluster in mind, of course that does not impact the sizing).
https://docs.cloudera.com/HDPDocuments/HDF3/HDF-3.1.1/bk_planning-your-deployment/content/ch_hardware-sizing.html
Full disclosure: I am an employee of Cloudera (formerly Hortonworks)
I am in planning phase of a multi-node Hadoop cluster in a Docker based environment. So it should be based on a lightweight easy to use virtualized system.
Current architecture (regarding to documentation) contains 1 master and 3 slave nodes. This host machine uses HDFS filesystem and KVM for virtualization.
The whole cloud is managed by Cloudera Manager. There are several Hadoop modules installed on this cluster. There is also a NodeJS data upload service.
This time I should make architecture Docker based.
I have read several tutorials and have some opinions, but also open questions.
A. What do you think, is https://github.com/Lewuathe/docker-hadoop-cluster a good base for my project? I have found also an official image, but it is single-node.
B. How will system requirements change if I would like to make this in a single container? It would be great, because this architecture should work in different locations, so changes can be easily transferred between these locations. Synchronization between these so called clones would be important.
C. Do you have some other ideas, maybe best practices?
As of September 2016 there is no quick answer.
https://github.com/Lewuathe/docker-hadoop-cluster does not seem like a good start, as it should be universal for your B. option
Keep an eye on https://github.com/sequenceiq/hadoop-docker and https://github.com/kiwenlau/hadoop-cluster-docker
To address your question C., you may want to check out BlueData's software platform: http://www.bluedata.com/blog/2015/06/docker-containers-big-data-clusters
It's designed to run multi-node Hadoop clusters in a Docker-based environment and there is a free version available for download (you can also run it in an AWS EC2 instance).
This work has already been done for you, actually:
https://hub.docker.com/r/cloudera/clusterdock/
It includes a pre-packaged multi-node CDH cluster, with Cloudera Manager as an optional component for cluster management et al.
I'm new to Hadoop ecosystem and i'm trying to understand how a cluster works. Until now, I've been using Hortonworks distribution to test anything in a single-node mode. Now I'm wondering - if it's possible to connect two VM's (running on one PC physically) so that one will be NameNode and the other one DataNode (i'm not sure if they should be separated). I found a similar tutorial for Cloudera, so I guess it's possible in theory.
If it's not even a good idea to run two Hadoop VM's on one PC, - then what is the most painless way to configure and run it on two separate PC's?
May be it will be useful. This post "Setting up a Hadoop cluster"
http://gbif.blogspot.ru/2011/01/setting-up-hadoop-cluster-part-1-manual.html
Hi i have a small doubt , I have started to use in my curiosity but now i have the following problem
My scenario is like this - i have 10 machines connected in LAN and i need to create Name Node in one system and Data Nodes in remaining 9 machines . So do i need to install Hadoop on all the 10 machines ?
For example i have ( 1.. 10 ) machines , where machine1 is Server and from machine(2..9) are slaves[Data Nodes] so do i need to install hadoop on all 10 machines ?
And i have searched a lot On Hadoop cluster network on commodity machine but i dint get any thing related to Installation [ that is configuration]. Some of them given like how to config and install Hadoop on own system but not on the clustered environment
Can any one help me ? and give me the detailed idea or article suggested links to do the above process
Thanks
Yes, you need Hadoop installed in every node and each node should have the services started as for appropriate for its role. Also the configuration files, present on each node, have to coherently describe the topology of the cluster, including location/name/port for various common used resources (eg. namenode). Doing this manually, from scratch, is error prone, specially if you never did this before and you don't know exactly what you're trying to do. Also would be good to decide on a specific distribution of Hadoop (HortonWorks, Cloudera, HDInsight, Intel, etc)
I would recommend use one of the many deployment solutions out there. My favorite is Puppet, but I'm sure Chef will do too.
A different (perhaps better?) alternative is to use Ambari, which is a Hadoop specialized deployment and administering solution. See Deploying and Managing Hadoop Clusters with AMBARI.
Some Puppet resources to get you started: Using Vagrant, Puppet, Testing & Hadoop
Please verify below tutorial
http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-multi-node-cluster/
Hope it helps
Yes hadoop needs to be there on all the computers
For clustered Environment please go through the video
I am new to cloudera, I installed cloudera in my system successfully I have two doubts,
Consider a machine with some nodes already using hadoop with some data, Can we install Cloudera to use the existing Hadoop without made any changes or modifaction on data stored existing hadooop.
I installed Cloudera in my machine, I have another three machines to add those as clusters, I want to know, Am i want install cloudera in those three machines before add those machines as clusters ?, or Can we add a node as clusters without installing cloudera on that purticular nodes?.
Thanks in advance can anyone, please give some information about the above questions.
Answer to questions -
1. If you want to migrate to CDH from existing Apache Distribution, you can follow this link
Excerpt:
Overview
The migration process does require a moderate understanding of Linux
system administration. You should make a plan before you start. You
will be restarting some critical services such as the name node and
job tracker, so some downtime is necessary. Given the value of the
data on your cluster, you’ll also want to be careful to take recent
back ups of any mission-critical data sets as well as the name node
meta-data.
Backing up your data is most important if you’re upgrading from a
version of Hadoop based on an Apache Software Foundation release
earlier than 0.20.
2.CDH binary needs be installed and configured in all the nodes to have a CDH based cluster up and running.
From the Cloudera Manual
You can migrate the data from a CDH3 (or any Apache Hadoop) cluster to a CDH4 cluster by
using a tool that copies out data in parallel, such as the DistCp tool
offered in CDH4.
Other sources
Regarding your second question,
Again from the manual page
Important:
Before proceeding, you need to decide:
As a general rule:
The NameNode and JobTracker run on the the same "master" host unless
the cluster is large (more than a few tens of nodes), and the master
host (or hosts) should not
run the Secondary NameNode (if used), DataNode or TaskTracker
services. In a large cluster, it is especially important that the
Secondary NameNode (if used) runs on a separate machine from the
NameNode. Each node in the cluster except the master host(s) should
run the DataNode and TaskTracker services.
Additionally, if you use Cloudera Manager it will automatically do all the setup necessary i.e install the necessary selected components on the nodes in the cluster.
Off-topic: I had a bad habit of not referrring the manual properly. Have a clear look at it, it answers all our questions
Answer to your second question,
you can add directly, with installation few pre requisites like openssh-clients and firewalls and java.
these machines( existing node, new three nodes) should accept same username and password (or) you should set passwordless ssh to these hosts..
you should connect to the internet while adding the nodes.
I hope it will help you:)