Topologies for hadoop cluster? - hadoop

In what topologies we can make hadoop cluster? Currently only running with tree structure,single master and multiple slaves. Am looking for more variants like multiple master etc.

Your question is interesting, after seeing the question for any tool or framework which provides multiple masters in Hadoop. But, i didn't find one.
Coming to hadoop, NameNode (Master) - Multiple DataNodes (Slaves), so according to hadoop's design perceptive this fair enough.
But, their is another addition to the NameNode called as secondary namenode, did n't think about it as a another master. refer here http://www.cloudera.com/blog/2009/02/multi-host-secondarynamenode-configuration/

Related

MapR Architecture Vs Cloudera Architecture

I'm familiar with the infrastructure or architecture of Cloudera:
Master Nodes include NameNode, SecondaryNameNode, JobTracker, and HMaster.
Slave Nodes include DataNode, TaskTracker, and HRegionServer.
Master nodes should all be on their own nodes (unless its a small cluster, than SecondaryNameNode, JobTracker, and HMaster may be combined, and even the NameNode if its a really small cluster).
Slave Nodes should always be colocated on the same node. The more slave nodes, the merrier.
SecondaryNameNode is a misnomer, unless you enable it for High Availability.
Does MapR maintain this setup? How is it similar and how is it different?
Good information by #JamCon in his reply, but there are some things worth clarifying:
The comment regarding patches is not accurate. MapR packages a broad range of Hadoop projects in its distribution so you don't have to separately compile anything. And MapR has the same APIs as any other distro, meaning their packages are not about compatibility but are simply bug fixes / enhancements from the community. There's typically no extra work required to get Hadoop ecosystem projects to run on MapR. And they release ecosystem updates at least once a month, as far as I can tell, to keep current with new enhancements.
Regarding the inclusion of YARN, we've been running MapR on YARN across large clusters since July '14! I believe MapR has their own ecosystem project vetting process, and they graduate MapR packaged versions to GA once they determine a project is ready for enterprise support.
MapR deviates from the vanilla Hadoop & CDH distributions a bit. It keeps most of the services and structure (Job Tracker, Data Nodes, HBase Master & Region, MR, etc), but there are some significant differences.
One of the defining items about MapR's distribution is that it doesn't use HDFS. It has its own custom FS, which features HA and operates without Name Nodes (via distributed metadata). It also allowed them to enable NFS access years ahead of the rest of the Hadoop distros, as well as snap shotting.
The custom FS does complicate their distribution a bit, though ... for example, when you want to run products or services, you often need to install the MapR specific patches. When you want to run mahout, you need to compile it with the MapR patches from https://github.com/mapr/mahout. But it also gives them an opportunity to incorporate better security at the FS level, as seen by the implementation of "Access Control Expressions" and Cluster/Job/Volume ACLs.
Overall, it's a well structured product. My biggest concern is they've deviated so far from the norm that when new innovations are adopted, they're slow to adapt, because it has to be incorporated into their highly modified environment. YARN is a perfect example ... they haven't released it yet, even though their competitors have.
From an architecture stand point with MapR there are no master nodes. The functions that the master nodes provide in a typical Hadoop architecture are instead distributed and performed within the "data nodes" of MapR.
https://www.mapr.com/why-hadoop/why-mapr/architecture-matters
MapR doesn't have master node, inbuilt mechansim but in Cloudera have master node, secondary name node and resource manager
http://commandstech.com/mapr-vs-cloudera-vs-hortonworks/

How to submit a job to specifc nodes in Hadoop?

I have a Hadoop cluster with 1 Master and 5 slaves. Is there any way of submitting jobs to specific set of slaves? Basically what i am trying to do is benchmark my application with many possibilities. So after testing with 5 slaves, I would like to run my application with 4 slaves and then 3 slaves and so on.
Currently the only way I know of is decommissioning a slave and removing from the hadoop cluster. But that seems to be a tedious task. I was wondering if there is an easier approach so as to avoid removing a node from the cluster.
Thanks.
In hadoop/conf there is a file called 'slaves' here you can simply add or remove nodes, and then restart your dfs and mapred.
There is a setting that points to a file with a list of excluded hosts you can set in the mapred-site-xml. Though also a bit cumbersome, changing a single configuration value might be preferable physically decommissioning and recommissioning multiple nodes. You could prepare multiple host exclusion files in advance, change the setting and restart the mapreduce service. Restarting the mapreduce service is pretty quick.
In 0.23 this setting is named mapreduce.jobtracker.hosts.exclude.filename. This is a feature introduced in 0.21, though I believe the setting was named mapred.hosts.exclude then. Check what this setting is called for the version of Hadoop you are using.
For those who encounter this problem, comments from Alex and stackoverflow question will help in successfully decommissioning a node from hadoop cluster.
EDIT : Just editing files hdfs-site.xml and mapred-site.xml and executing hadoop dfsadmin -refreshNodes might put your datanode into decommissioning node status for a long time. So it is also necessary to change dfs.replication to an appropriate value.

Is it possible to add node automatically when hadoop program is on running application

I'm beginner programmer and hadoop learner.
I'm testing hadoop full distribute mode using 5 PC(has Dual-core cpu and ram 2G)
before starting maptask and hdfs, I knew that I must configure file(etc/hosts on Ip, hostname and hadoop folder/conf/masters,slaves file) so I finished configured that file
and when debating on seminar in my company, my boss and chief insisted that even if hadoop application running state, if hadoop need more node or cluster, automatically, hadoop will add more node
Is it possible? When I studied about hadoop clusturing, Many hadoop books and community site insisted that after configuration and running application, We can't add more node or cluster.
But My boss said to me that Amazon said adding node on running application is possible.
Is really true?
hadoop master users on stack overflow community, Please tell me detail about the truth.
Yes it indeed is possible.
Here is the explanation in hadoop's wiki.
Also Amazon's EMR enables one to add 100s of nodes on-the-fly in an alreadt running cluster and as soon as the machines are up they are delegated tasks(unstarted mapper and/or reducer tasks) by the master.
So, yes, it is very much possible and is in use and not just in theory.

cloudera cluster node roles

I need to ran simple benchmark test on my cloudera CDH4 cluster setup.
My cloudera cluster setup (CDH4) has 4 nodes, A, B, C and D
I am using cloudera manager FREE edition to manage cloudera services.
Each node is configured to perform multiple roles as stated below.
A : NameNode, JobTrackerNode, regionserver, SecondaryNameNode, DataNode, TaskTrackerNode
B : DataNode, TaskTrackerNode
C : DataNode, TaskTrackerNode
D : DataNode, TaskTrackerNode
My first question is, Can one node be NameNode and DataNode?
Is this setup all right?
My second question is, on cloudera manager UI, i can see many services running but i am not sure whether i need all this services or not?
Services running on my setup are :
hbase1
hdfs1
mapreduce1
hue1
oozie1
zookeeper1
Do i need only hdfs1 and mapreduce1 services. If yes how can i remove other services?
Cloud and hadoop concept is new to me so pardon me if some of my assumptions are illogical or wrong.
answer to your first question is yes. but you would never do that in production as NameNode needs sufficient amount of RAM. people usually run only NameNode+JobTracker on their master node. it is also better to run SecondarNameNode on a different machine.
coming to your second question, Cloudera Manager is not only Hadoop. it's a complete package that includes several Hadoop sub-projects like HBase(a NOSQL DB), Oozie(a Workflow engine) etc. and these are the processes which yo see on the UI.
If you wanna play just with Hadoop, HDFS and MapReduce are sufficient. You can stop rest of the processes easily from the UI itself. it won't do any harm to your Hadoop cluster.
HTH

How to config and use multi master nodes in a Hadoop cluster?

Could anyone please tell us how to config and use multi master nodes in a Hadoop cluster?
If you are looking multiple NameNodes, then check HDFS high availability and HDFS federation. Both should be available in 2x Hadoop release.
One more master in 1x Hadoop release in the JobTracker and there can be only one JobTracker in a cluster. BTW, JobTracker functionality has been split in the 2x Hadoop release. Check this for more details.
There might be some other alternate options also, but depends on the requirement for having multiple masters. Is it availability, scalability or something other?

Resources