I'm familiar with the infrastructure or architecture of Cloudera:
Master Nodes include NameNode, SecondaryNameNode, JobTracker, and HMaster.
Slave Nodes include DataNode, TaskTracker, and HRegionServer.
Master nodes should all be on their own nodes (unless its a small cluster, than SecondaryNameNode, JobTracker, and HMaster may be combined, and even the NameNode if its a really small cluster).
Slave Nodes should always be colocated on the same node. The more slave nodes, the merrier.
SecondaryNameNode is a misnomer, unless you enable it for High Availability.
Does MapR maintain this setup? How is it similar and how is it different?
Good information by #JamCon in his reply, but there are some things worth clarifying:
The comment regarding patches is not accurate. MapR packages a broad range of Hadoop projects in its distribution so you don't have to separately compile anything. And MapR has the same APIs as any other distro, meaning their packages are not about compatibility but are simply bug fixes / enhancements from the community. There's typically no extra work required to get Hadoop ecosystem projects to run on MapR. And they release ecosystem updates at least once a month, as far as I can tell, to keep current with new enhancements.
Regarding the inclusion of YARN, we've been running MapR on YARN across large clusters since July '14! I believe MapR has their own ecosystem project vetting process, and they graduate MapR packaged versions to GA once they determine a project is ready for enterprise support.
MapR deviates from the vanilla Hadoop & CDH distributions a bit. It keeps most of the services and structure (Job Tracker, Data Nodes, HBase Master & Region, MR, etc), but there are some significant differences.
One of the defining items about MapR's distribution is that it doesn't use HDFS. It has its own custom FS, which features HA and operates without Name Nodes (via distributed metadata). It also allowed them to enable NFS access years ahead of the rest of the Hadoop distros, as well as snap shotting.
The custom FS does complicate their distribution a bit, though ... for example, when you want to run products or services, you often need to install the MapR specific patches. When you want to run mahout, you need to compile it with the MapR patches from https://github.com/mapr/mahout. But it also gives them an opportunity to incorporate better security at the FS level, as seen by the implementation of "Access Control Expressions" and Cluster/Job/Volume ACLs.
Overall, it's a well structured product. My biggest concern is they've deviated so far from the norm that when new innovations are adopted, they're slow to adapt, because it has to be incorporated into their highly modified environment. YARN is a perfect example ... they haven't released it yet, even though their competitors have.
From an architecture stand point with MapR there are no master nodes. The functions that the master nodes provide in a typical Hadoop architecture are instead distributed and performed within the "data nodes" of MapR.
https://www.mapr.com/why-hadoop/why-mapr/architecture-matters
MapR doesn't have master node, inbuilt mechansim but in Cloudera have master node, secondary name node and resource manager
http://commandstech.com/mapr-vs-cloudera-vs-hortonworks/
Related
The HortonWorks HDP, could be implemented in two ways:
Sandbox (VM)
Manual Installation.
I would like to understand, whether HDP SandBox, or the manual installation is preferred in the production environment. The choice could be made on obvious reasons like performance, but I would like to understand whether there are any other considerations?
The Hortonworks Sandbox allows to try out the features and functionality in Hadoop and its' ecosystem of projects. That's all.
If you want to go to production, you have three installation type:
Automated with Ambari
Manual
Cloud with Cloudbreak
Regards,
Alain
performance. hadoop is about parallel processing. Can't do that with a single node.
storage. hadoop uses a distributed file system. With a single node your storage space is very limited.
redundancy. if this node dies, everything is gone. Normal hadoop configuration include a redundancy factor (of 3 by default) so that when some nodes or disks go down, all of the data is still reachable. Similarly with a standby namenode.
There are a few other points, but these are the main ones IMO.
Single node hadoop only makes sense for proof of concept, and experimentation. Not for providing production level value.
I know that Hadoop is based on Master/Slave architecture
HDFS works with NameNodes and DataNodes
and MapReduce works with jobtrackers and Tasktrackers
But I can't find all these services on MapR, I find out that it has its own Architecture with its own services
I'm a little bit confused, could any one please tell me what is the difference between using Hadoop only and using it with MapR !
You have to refer to Hadoop 2.x latest architecture since YARN ( Yet Another Resource Negotiator) & High Availability have been introduced in 2.x version.
Job tracker and Task tracker are replaced with Resource Manager, Node Manager and Applications Manager.
Hadoop 2.x YARN & High Availability
For MapR architecture, refer to MapR article
For comparison between different distributors, refer to this image
Detailed comparison is available at Data-magnum article by Bill Vorhies
MapR and apache Hadoop DO NOT have same architecture at storage level. MapR uses its own filesystem MaRFS which is completely different from HDFS in terms of concept and implemenation . you can find more detailed comparision here : https://www.mapr.com/blog/comparing-mapr-fs-and-hdfs-nfs-and-snapshots#.VfGwwxG6eUk
https://www.mapr.com/resources/videos/comparison-mapr-fs-and-hdfs
Mapr uses most of Apache bigdata distributions as their baseline.
Mapr is a hadoop (and bigdata technology stacks) distribution provider with certain add-ons and technical support to its client.
Underline the mapr is entirely on the same architecture as of apache hadoop including all the core library distribution. However mapr distribution is more like a bundle of a complete and compatible bigdata technology package.
The main benefit of mapr is that it's distribution of various technologies like hive, hbase, spark etc will be compatible with core hadoop and among each other. This I'd particularly important because the bigdata technologies are evolving in different pace and hence news releases becomes incompatible very soon.
So, the vendors like mapr, cloudera etc are providing their version of hadoop didtribution and support such that end users can concentrate on the product building without worrying about the compatibility issues. But almost all of them are using apache distribution under the carpet.
In future, they might come up certain variation and additional features in an attempt to prevent client's switch to other vendors, but as of now is not the case.
I like to study about Hadoop multinode setup and installation, by referring the above tutorial I understand that single node cluster environment can be used as node for the multinode cluster
http://bigdatahandler.com/hadoop-hdfs/hadoop-multi-node-cluster-setup/
Currently I am learning Hadoop using Horton sandbox, can we use a sandbox system as a single node environment?
If not what is the difference between sandbox and traditional Hadoop cluster installation
The sandbox images (from Hortonworks and Cloudera) provide the user with a pre-configured development environment with all the usual tools already available and installed (pig, hive etc.). Since the image is a single "system" it is set-up such that the hadoop cluster is single-node: i.e. everything - HDFS, Hadoop map-reduce etc. - is local to that image. That is a massive benefit, as anyone who has set up a hadoop cluster will tell you! It allows you to get up-and-running with very little operational overhead.
What these sandboxes do not provide, however, is realistic cluster behaviour as you have only one node. But there other possibilities - tools such as Vagrant and Docker - that would allow you to do this (I have not tried it myself).
The big data handler link you shared seems to be about combining several of these standalone, inherently single-node "clusters" so that you have something more realistic. But I would guess setting this up so that YARN, Zookeeper and other services are not duplicated comes with a not insignificant challenge.
I am new to cloudera, I installed cloudera in my system successfully I have two doubts,
Consider a machine with some nodes already using hadoop with some data, Can we install Cloudera to use the existing Hadoop without made any changes or modifaction on data stored existing hadooop.
I installed Cloudera in my machine, I have another three machines to add those as clusters, I want to know, Am i want install cloudera in those three machines before add those machines as clusters ?, or Can we add a node as clusters without installing cloudera on that purticular nodes?.
Thanks in advance can anyone, please give some information about the above questions.
Answer to questions -
1. If you want to migrate to CDH from existing Apache Distribution, you can follow this link
Excerpt:
Overview
The migration process does require a moderate understanding of Linux
system administration. You should make a plan before you start. You
will be restarting some critical services such as the name node and
job tracker, so some downtime is necessary. Given the value of the
data on your cluster, you’ll also want to be careful to take recent
back ups of any mission-critical data sets as well as the name node
meta-data.
Backing up your data is most important if you’re upgrading from a
version of Hadoop based on an Apache Software Foundation release
earlier than 0.20.
2.CDH binary needs be installed and configured in all the nodes to have a CDH based cluster up and running.
From the Cloudera Manual
You can migrate the data from a CDH3 (or any Apache Hadoop) cluster to a CDH4 cluster by
using a tool that copies out data in parallel, such as the DistCp tool
offered in CDH4.
Other sources
Regarding your second question,
Again from the manual page
Important:
Before proceeding, you need to decide:
As a general rule:
The NameNode and JobTracker run on the the same "master" host unless
the cluster is large (more than a few tens of nodes), and the master
host (or hosts) should not
run the Secondary NameNode (if used), DataNode or TaskTracker
services. In a large cluster, it is especially important that the
Secondary NameNode (if used) runs on a separate machine from the
NameNode. Each node in the cluster except the master host(s) should
run the DataNode and TaskTracker services.
Additionally, if you use Cloudera Manager it will automatically do all the setup necessary i.e install the necessary selected components on the nodes in the cluster.
Off-topic: I had a bad habit of not referrring the manual properly. Have a clear look at it, it answers all our questions
Answer to your second question,
you can add directly, with installation few pre requisites like openssh-clients and firewalls and java.
these machines( existing node, new three nodes) should accept same username and password (or) you should set passwordless ssh to these hosts..
you should connect to the internet while adding the nodes.
I hope it will help you:)
I'm beginner programmer and hadoop learner.
I'm testing hadoop full distribute mode using 5 PC(has Dual-core cpu and ram 2G)
before starting maptask and hdfs, I knew that I must configure file(etc/hosts on Ip, hostname and hadoop folder/conf/masters,slaves file) so I finished configured that file
and when debating on seminar in my company, my boss and chief insisted that even if hadoop application running state, if hadoop need more node or cluster, automatically, hadoop will add more node
Is it possible? When I studied about hadoop clusturing, Many hadoop books and community site insisted that after configuration and running application, We can't add more node or cluster.
But My boss said to me that Amazon said adding node on running application is possible.
Is really true?
hadoop master users on stack overflow community, Please tell me detail about the truth.
Yes it indeed is possible.
Here is the explanation in hadoop's wiki.
Also Amazon's EMR enables one to add 100s of nodes on-the-fly in an alreadt running cluster and as soon as the machines are up they are delegated tasks(unstarted mapper and/or reducer tasks) by the master.
So, yes, it is very much possible and is in use and not just in theory.