cloudera cluster node roles - hadoop

I need to ran simple benchmark test on my cloudera CDH4 cluster setup.
My cloudera cluster setup (CDH4) has 4 nodes, A, B, C and D
I am using cloudera manager FREE edition to manage cloudera services.
Each node is configured to perform multiple roles as stated below.
A : NameNode, JobTrackerNode, regionserver, SecondaryNameNode, DataNode, TaskTrackerNode
B : DataNode, TaskTrackerNode
C : DataNode, TaskTrackerNode
D : DataNode, TaskTrackerNode
My first question is, Can one node be NameNode and DataNode?
Is this setup all right?
My second question is, on cloudera manager UI, i can see many services running but i am not sure whether i need all this services or not?
Services running on my setup are :
Do i need only hdfs1 and mapreduce1 services. If yes how can i remove other services?
Cloud and hadoop concept is new to me so pardon me if some of my assumptions are illogical or wrong.

answer to your first question is yes. but you would never do that in production as NameNode needs sufficient amount of RAM. people usually run only NameNode+JobTracker on their master node. it is also better to run SecondarNameNode on a different machine.
coming to your second question, Cloudera Manager is not only Hadoop. it's a complete package that includes several Hadoop sub-projects like HBase(a NOSQL DB), Oozie(a Workflow engine) etc. and these are the processes which yo see on the UI.
If you wanna play just with Hadoop, HDFS and MapReduce are sufficient. You can stop rest of the processes easily from the UI itself. it won't do any harm to your Hadoop cluster.


Understanding wrt Hadoop

I am learning Hadoop and came to know that that there are two versions of the framework viz: Hadoop1 and Hadoop2.
If my understanding is correct, in Hadoop1, the execution environment is based on two daemons viz TaskTracker and JobTracker whereas in Hadoop2 (aka yarn), the execution environment is based on "new daemons" viz ResourceManager, NodeManager, ApplicationMaster.
Please correct me if this is not correct.
I came to know of the following configuration parameter:
possible values which it can take: local , classic , yarn
I don't understand what does they actually mean; for example if I install Hadoop 2 , then how can it have old execution environment (which has TaskTracker, JobTracker).
Can anyone help me what these values mean?
yarn stands for MR version 2.
classic is for MR version 1
local for local runs of the MR jobs.
MR V1 and MR V2 are just about how resources are managed and a job is executed. The current hadoop release is capable of both (and even in local lightweight mode). When you set the value as yarn, you are simply instructing the framework to use yarn way to execute the job. Similarly when you set it to local, you just telling the framework that there is no cluster for execution and its all within a JVM. It is not a different infrastructure for MR V1 and MR V2 framework; its just the way of job execution, which changes.
jobTracker, TaskTracker etc are all just daemon thread, which are spawned when needed and killed.
MRv1 uses the JobTracker to create and assign tasks to data nodes. This was found to be too inefficient when dealing with large cluster, leading to yarn
MRv2 (aka YARN, "Yet Another Resource Negotiator") has a Resource Manager for each cluster, and each data node runs a Node Manager. For each job, one slave node will act as the Application Master, monitoring resources/tasks, etc.
Local mode is given to simulate and debug MR application within a single machine/JVM.
EDIT: Based on comments
jps (Java Virtual Machine Process Status)is a JVM tool, which according to official page:
The jps tool lists the instrumented HotSpot Java Virtual Machines
(JVMs) on the target system. The tool is limited to reporting
information on JVMs for which it has the access permissions.
jps is not a big data tool, rather a java tool which tells about JVM, however it does not divulge any information on processes running within the JVM.
It only list the JVM, it has access to. It means there still be certain JVMs which remains undetected.
Keeping the above points in mind, if you observed that jsp command emits different result based on hadoop deployment mode:
Local (or Standalone) mode: There are no daemons and everything runs on a single JVM.
Pseudo-Distributed mode: Each daemon(Namenode, Datanode etc) runs on its own JVM on a single host.
Distributed mode: Each Daemon run on its own JVM across a cluster of hosts.
Hence each of the processes may or may not run in same JVM and hence jps output will be different.
Now in distributed mode, the MR v2 framework works in default mode. i.e. yarn; hence you see yarn specific daemons running
Apache Hadoop 1.x (MRv1) consists of the following daemons:
Note that NameNode and DataNode are common between two, because they are HDFS specific daemon, while other two are MR v1 and yarn specific.

Should the Falcon Prism be installed on separate machine than the existing clusters?

I am trying to understand setup for a Falcon Distributed Cluster.
I am having Cluster A and Cluster B, both with their Falcon Servers (and namenode, oozie, hive etc.). Now, to install the Prism, what would be the best idea? Shall I install it on one of the clusters (different node than falcon server) or on a different machine? If Prism is set on a third cluster (single node) should it have the components like namenode, oozie etc. running too?
Prism will have a config store where the entities are stored. The configstore will typically be on hdfs and hence needs hadoop client.
So, yes the third cluster would need hdfs, namenode etc. Oozie is not necessary.

How to start multiple datanode processes on standalone hadoop setup(pseudo-distributed)

I am new to Hadoop. I have configured standalone hadoop setup on single VM running Ubuntu 13.03. After starting the hadoop processes using, jps command shows
775 DataNode
1053 JobTracker
962 SecondaryNameNode
1365 Jps
1246 TaskTracker
590 NameNode
As per my understanding Hadoop has started with 1 namenode and 1 datanode. I want to create multiple datanode processes i.e. multiple instances of datanode. Is there any way I can do that?
There are multiple possibilities how to install and configure Hadoop.
Local (standalone) Mode - it means all Hadoop components run in a signle Java process
Pseudo-Distributed Mode - Hadoop runs all its components (datanode, tastracker, jobtracker, namenode, ...) as separate Java processes. It servers as a simulation for fully distributed installation but it runs on local machine only.
Distributed Mode - fully distributed installation. Shortly without any details: Some machines play 'slave' role and contain Datanode+Tasktracker components and there is a server playing 'master' role and contains Namenode+JobTracker.
Back to your queastion, if you would like to run Hadoop on single machine, you have the first two options. It is impossible to run it in fully distributed mode on a single node. Maybe you can do do a workaround, but it is nonsence from basic point of view. Hadoop was designed as a distributed system, the possibility to run it on a single machine serves IMHO for debug/trial purposes only.
For more details follow Hadoop documentation. I hope I answered your question.

Cloudera installation Doubts?

I am new to cloudera, I installed cloudera in my system successfully I have two doubts,
Consider a machine with some nodes already using hadoop with some data, Can we install Cloudera to use the existing Hadoop without made any changes or modifaction on data stored existing hadooop.
I installed Cloudera in my machine, I have another three machines to add those as clusters, I want to know, Am i want install cloudera in those three machines before add those machines as clusters ?, or Can we add a node as clusters without installing cloudera on that purticular nodes?.
Thanks in advance can anyone, please give some information about the above questions.
Answer to questions -
1. If you want to migrate to CDH from existing Apache Distribution, you can follow this link
The migration process does require a moderate understanding of Linux
system administration. You should make a plan before you start. You
will be restarting some critical services such as the name node and
job tracker, so some downtime is necessary. Given the value of the
data on your cluster, you’ll also want to be careful to take recent
back ups of any mission-critical data sets as well as the name node
Backing up your data is most important if you’re upgrading from a
version of Hadoop based on an Apache Software Foundation release
earlier than 0.20.
2.CDH binary needs be installed and configured in all the nodes to have a CDH based cluster up and running.
From the Cloudera Manual
You can migrate the data from a CDH3 (or any Apache Hadoop) cluster to a CDH4 cluster by
using a tool that copies out data in parallel, such as the DistCp tool
offered in CDH4.
Other sources
Regarding your second question,
Again from the manual page
Before proceeding, you need to decide:
As a general rule:
The NameNode and JobTracker run on the the same "master" host unless
the cluster is large (more than a few tens of nodes), and the master
host (or hosts) should not
run the Secondary NameNode (if used), DataNode or TaskTracker
services. In a large cluster, it is especially important that the
Secondary NameNode (if used) runs on a separate machine from the
NameNode. Each node in the cluster except the master host(s) should
run the DataNode and TaskTracker services.
Additionally, if you use Cloudera Manager it will automatically do all the setup necessary i.e install the necessary selected components on the nodes in the cluster.
Off-topic: I had a bad habit of not referrring the manual properly. Have a clear look at it, it answers all our questions
Answer to your second question,
you can add directly, with installation few pre requisites like openssh-clients and firewalls and java.
these machines( existing node, new three nodes) should accept same username and password (or) you should set passwordless ssh to these hosts..
you should connect to the internet while adding the nodes.
I hope it will help you:)

How to administer Hadoop Cluster

i have running 4 nodes hadoop cluster and i am asking about any way to administer that cluster remotely
for example
administering the cluster from my laptop for
executing MapReduce tasks
disabling or enabling data nodes
is there any way to do that remotely ?
If you're using the Cloudera distribution, the Cloudera Manager webapp would let you do that.
Other distributions may have similar control apps. That would give you per-node control.
For executing MR tasks, you would setup normally submit the job from an external node anyway, pointing to the correct JobTracker and NameNode. So I'm not sure what else you're asking for there.
