Cloudera installation Doubts? - hadoop

I am new to cloudera, I installed cloudera in my system successfully I have two doubts,
Consider a machine with some nodes already using hadoop with some data, Can we install Cloudera to use the existing Hadoop without made any changes or modifaction on data stored existing hadooop.
I installed Cloudera in my machine, I have another three machines to add those as clusters, I want to know, Am i want install cloudera in those three machines before add those machines as clusters ?, or Can we add a node as clusters without installing cloudera on that purticular nodes?.
Thanks in advance can anyone, please give some information about the above questions.

Answer to questions -
1. If you want to migrate to CDH from existing Apache Distribution, you can follow this link
Excerpt:
Overview
The migration process does require a moderate understanding of Linux
system administration. You should make a plan before you start. You
will be restarting some critical services such as the name node and
job tracker, so some downtime is necessary. Given the value of the
data on your cluster, you’ll also want to be careful to take recent
back ups of any mission-critical data sets as well as the name node
meta-data.
Backing up your data is most important if you’re upgrading from a
version of Hadoop based on an Apache Software Foundation release
earlier than 0.20.
2.CDH binary needs be installed and configured in all the nodes to have a CDH based cluster up and running.

From the Cloudera Manual
You can migrate the data from a CDH3 (or any Apache Hadoop) cluster to a CDH4 cluster by
using a tool that copies out data in parallel, such as the DistCp tool
offered in CDH4.
Other sources
Regarding your second question,
Again from the manual page
Important:
Before proceeding, you need to decide:
As a general rule:
The NameNode and JobTracker run on the the same "master" host unless
the cluster is large (more than a few tens of nodes), and the master
host (or hosts) should not
run the Secondary NameNode (if used), DataNode or TaskTracker
services. In a large cluster, it is especially important that the
Secondary NameNode (if used) runs on a separate machine from the
NameNode. Each node in the cluster except the master host(s) should
run the DataNode and TaskTracker services.
Additionally, if you use Cloudera Manager it will automatically do all the setup necessary i.e install the necessary selected components on the nodes in the cluster.
Off-topic: I had a bad habit of not referrring the manual properly. Have a clear look at it, it answers all our questions

Answer to your second question,
you can add directly, with installation few pre requisites like openssh-clients and firewalls and java.
these machines( existing node, new three nodes) should accept same username and password (or) you should set passwordless ssh to these hosts..
you should connect to the internet while adding the nodes.
I hope it will help you:)

Related

Hadoop environment is Down

I am student and doing computer science. As a part of my research i am working on the hadoop environment. The person who was working on this research before me has configured 9 Datanode with a namenode and a stand by node. we have our network traffic data stored in the hive and i am developing hive queries to identify network attack. The person who was working on this already left from our place and working somewhere else and busy with job. so i have couple of questions :
1) how can I understand the architecture on HDFS of my environment i.e how the machines are connected to build this environment. Also what services for this environment installed on which machines?
2) Now we have 9 datanodes in the environement and my professor wants to reduce the datanodes. her goal is to do the research with 2-3 (minimal) machine in this environment.
3) What are the good and easy source to get understanding about the cloudera and hadoop ? Also the commands which can be used to explicitly start and stop a service.
4) Right now in cloudera manager I am not able to start the Namenode server, Secondary datanode and one more. I stop all the services in order from cloudera and now starting in order and in that order the HDFS service comes first so while starting it, it gives the failure message for namenode datanode and datanode8.
I tried several ways but no luck. Please suggest me some ways I can solve issues and good resource(for beginner), I can refer to dig into this more.
Thanks.
There are several resources to start. For everything Cloudera/CDH, the place to go is Cloudera Documentation. For Hadoop, the place to go is Hadoop Documentation. Now, I reckon, this is a rather big bite to chew. If you're new to Hadoop, better start with a book, some introduction (I can't recommend one since I haven't read any).
For your specific problem, it seems that the some services don't start. You need to look at the services' logs, on the respective nodes. I can't tell you where those logs are, because it depends on the your distribution version and on how it was configured. I suspect one vital service does not start (probably HDFS, looks like namenode is down) and this causes every other service to fail. Hadoop Wiki has a troubsleshooting guide, try to follow that and see if it helps you.
As for the question on how to adjust the cluster size, first get it up and running and then consider changing it. Refer to Decommissioning and Recommissioning Hosts.

Multi-node hadoop cluster installation

Sorry if my question appears to be naïve. We are planning to use CDH 5.3.0 or 5.4.0. We want to implement a multi-node cluster.
The example multi-node installations that I have seen/read on different blogs/resources have master and slaves on different hosts.
However, we are restrained by the number of hosts. We have only 2 powerful hosts ( 32 cores 400+ GB RAM), so if we decide to have master on one and slave on other, we will end up with only one slave. My questions are :
Is it possible to have master and slave on the same hosts?
Can I have more than one slave node on a single host.
Also does one need to pay to use Cloudera Manager or it is open-source like the rest of the components.
If you can point me in the direction of some resource which would help me understand above scenarios it would be helpful.
Thanks for your help.
Regards,
V
old question but no and wrong answer:
yes, it is possible to install Master & Worker services on a single host.
e.g. HDFS (NameNode and Datanode). You can even install a full cloudera or Hortonworks installation with ALL services on a single host if it is powerfull enough, but i would only recommend it for POC or testcases.
If you use cloudera or hortonworks without virtualization it is not possible to run multiple instances of the SAME worker services e.g. datanode on the same host. 1 Host 1 worker instance. everything else would not make sense.
Cloudera is a package of multiple open source projekt (Hadoop,Spark....) and other closed source parts like cloudera manager and other enterprise closed source features. But everything you need is free even for commercial use with the community licence.
Right now (2017): only cloudera navigator is the big feature which is not part of the community edition
Yes you can configure namenode and datanode both on a single node.
You cannot have more than two datanodes on a single machine.
Cloudera is open-source hadoop distribution.

Did hortan sandbox can use as a single node Hadoop cluster

I like to study about Hadoop multinode setup and installation, by referring the above tutorial I understand that single node cluster environment can be used as node for the multinode cluster
http://bigdatahandler.com/hadoop-hdfs/hadoop-multi-node-cluster-setup/
Currently I am learning Hadoop using Horton sandbox, can we use a sandbox system as a single node environment?
If not what is the difference between sandbox and traditional Hadoop cluster installation
The sandbox images (from Hortonworks and Cloudera) provide the user with a pre-configured development environment with all the usual tools already available and installed (pig, hive etc.). Since the image is a single "system" it is set-up such that the hadoop cluster is single-node: i.e. everything - HDFS, Hadoop map-reduce etc. - is local to that image. That is a massive benefit, as anyone who has set up a hadoop cluster will tell you! It allows you to get up-and-running with very little operational overhead.
What these sandboxes do not provide, however, is realistic cluster behaviour as you have only one node. But there other possibilities - tools such as Vagrant and Docker - that would allow you to do this (I have not tried it myself).
The big data handler link you shared seems to be about combining several of these standalone, inherently single-node "clusters" so that you have something more realistic. But I would guess setting this up so that YARN, Zookeeper and other services are not duplicated comes with a not insignificant challenge.

Hadoop on cluster configuration /Installation

Hi i have a small doubt , I have started to use in my curiosity but now i have the following problem
My scenario is like this - i have 10 machines connected in LAN and i need to create Name Node in one system and Data Nodes in remaining 9 machines . So do i need to install Hadoop on all the 10 machines ?
For example i have ( 1.. 10 ) machines , where machine1 is Server and from machine(2..9) are slaves[Data Nodes] so do i need to install hadoop on all 10 machines ?
And i have searched a lot On Hadoop cluster network on commodity machine but i dint get any thing related to Installation [ that is configuration]. Some of them given like how to config and install Hadoop on own system but not on the clustered environment
Can any one help me ? and give me the detailed idea or article suggested links to do the above process
Thanks
Yes, you need Hadoop installed in every node and each node should have the services started as for appropriate for its role. Also the configuration files, present on each node, have to coherently describe the topology of the cluster, including location/name/port for various common used resources (eg. namenode). Doing this manually, from scratch, is error prone, specially if you never did this before and you don't know exactly what you're trying to do. Also would be good to decide on a specific distribution of Hadoop (HortonWorks, Cloudera, HDInsight, Intel, etc)
I would recommend use one of the many deployment solutions out there. My favorite is Puppet, but I'm sure Chef will do too.
A different (perhaps better?) alternative is to use Ambari, which is a Hadoop specialized deployment and administering solution. See Deploying and Managing Hadoop Clusters with AMBARI.
Some Puppet resources to get you started: Using Vagrant, Puppet, Testing & Hadoop
Please verify below tutorial
http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-multi-node-cluster/
Hope it helps
Yes hadoop needs to be there on all the computers
For clustered Environment please go through the video

Is it possible to add node automatically when hadoop program is on running application

I'm beginner programmer and hadoop learner.
I'm testing hadoop full distribute mode using 5 PC(has Dual-core cpu and ram 2G)
before starting maptask and hdfs, I knew that I must configure file(etc/hosts on Ip, hostname and hadoop folder/conf/masters,slaves file) so I finished configured that file
and when debating on seminar in my company, my boss and chief insisted that even if hadoop application running state, if hadoop need more node or cluster, automatically, hadoop will add more node
Is it possible? When I studied about hadoop clusturing, Many hadoop books and community site insisted that after configuration and running application, We can't add more node or cluster.
But My boss said to me that Amazon said adding node on running application is possible.
Is really true?
hadoop master users on stack overflow community, Please tell me detail about the truth.
Yes it indeed is possible.
Here is the explanation in hadoop's wiki.
Also Amazon's EMR enables one to add 100s of nodes on-the-fly in an alreadt running cluster and as soon as the machines are up they are delegated tasks(unstarted mapper and/or reducer tasks) by the master.
So, yes, it is very much possible and is in use and not just in theory.

Resources