Hadoop in Maui+Torque cluster - hadoop

I have a cluster with Torque+Maui. Is it possible to install Hadoop in same cluster? What are pros and cons of doing this if possible?

This may be a good place to start:
http://hadoop.apache.org/docs/r0.18.3/hod.html
I haven't worked with it personally but I've heard that this isn't being actively maintained.
From what I've seen Hadoop has its own scheduler which expects a set of Hadoop nodes to be running where the Hadoop file system lives. This is usually a persistent environment so you can load the file system once(big data) and assign your job to a node that happens to hold a copy of the data you need. Torque tends to take any set of free node from the cluster, assigns them to a job, runs the job, then cleans up the environment for the next job. This runs contrary to the design of Hadoop.
I can see where it would be good to have a environment that could do both to fully utilize the systems you already have but management will be messy at best.

Related

Ability to offer only part of the node resources?

Using dc/os we like to schedule tasks close to the data that the task requires that in our case is stored in hadoop/hdfs (on an HDP cluster). Issue is that the hadoop cluster is not run from within dc/os and so we are looking for a way to offer only a subset of the system resources.
For example: say we like to reserve 8GB of memory to data node services, then we like to provide the remainder to dc/os to schedule tasks.
From what i have read so far, the task can specify the resources it requires, but i have not found any means to specify what you want to offer from the node perspective.
I'm aware that a CDH cluster can be run on dc/os, that would be one way to go, but for now that is not provided for HDP.
Thanks for any idea's/tips,
Paul

Is HortonWorks Sandbox VM preferred in production environment?

The HortonWorks HDP, could be implemented in two ways:
Sandbox (VM)
Manual Installation.
I would like to understand, whether HDP SandBox, or the manual installation is preferred in the production environment. The choice could be made on obvious reasons like performance, but I would like to understand whether there are any other considerations?
The Hortonworks Sandbox allows to try out the features and functionality in Hadoop and its' ecosystem of projects. That's all.
If you want to go to production, you have three installation type:
Automated with Ambari
Manual
Cloud with Cloudbreak
Regards,
Alain
performance. hadoop is about parallel processing. Can't do that with a single node.
storage. hadoop uses a distributed file system. With a single node your storage space is very limited.
redundancy. if this node dies, everything is gone. Normal hadoop configuration include a redundancy factor (of 3 by default) so that when some nodes or disks go down, all of the data is still reachable. Similarly with a standby namenode.
There are a few other points, but these are the main ones IMO.
Single node hadoop only makes sense for proof of concept, and experimentation. Not for providing production level value.

Did hortan sandbox can use as a single node Hadoop cluster

I like to study about Hadoop multinode setup and installation, by referring the above tutorial I understand that single node cluster environment can be used as node for the multinode cluster
http://bigdatahandler.com/hadoop-hdfs/hadoop-multi-node-cluster-setup/
Currently I am learning Hadoop using Horton sandbox, can we use a sandbox system as a single node environment?
If not what is the difference between sandbox and traditional Hadoop cluster installation
The sandbox images (from Hortonworks and Cloudera) provide the user with a pre-configured development environment with all the usual tools already available and installed (pig, hive etc.). Since the image is a single "system" it is set-up such that the hadoop cluster is single-node: i.e. everything - HDFS, Hadoop map-reduce etc. - is local to that image. That is a massive benefit, as anyone who has set up a hadoop cluster will tell you! It allows you to get up-and-running with very little operational overhead.
What these sandboxes do not provide, however, is realistic cluster behaviour as you have only one node. But there other possibilities - tools such as Vagrant and Docker - that would allow you to do this (I have not tried it myself).
The big data handler link you shared seems to be about combining several of these standalone, inherently single-node "clusters" so that you have something more realistic. But I would guess setting this up so that YARN, Zookeeper and other services are not duplicated comes with a not insignificant challenge.

Differences between existing MapReduce and YARN (MRv2)

Would anyone tell me, which are the differences between existing MapReduce and YARN, because I do not find all clearly differences between these two?
P.S: I'm asking for something like a comparison between these.
Thanks!
MRv1 uses the JobTracker to create and assign tasks to data nodes, which can become a resource bottleneck when the cluster scales out far enough (usually around 4,000 nodes).
MRv2 (aka YARN, "Yet Another Resource Negotiator") has a Resource Manager for each cluster, and each data node runs a Node Manager. For each job, one slave node will act as the Application Master, monitoring resources/tasks, etc.
MRv1 which is also called as Hadoop 1 where the HDFS (Resource management and scheduling) and MapReduce(Programming Framework) are tightly coupled.
Because of this non-batch applications can not be run on the hadoop 1.
It has single namenode so, it doesn't provides high system availability and scalability.
MRv2 (aka Hadoop 2) in this version of hadoop the resource management and scheduling tasks are separated from MapReduce which is separated by YARN(Yet Another Resource Negotiator).
The resource management and scheduling layer lies beneath the MapReduce layer.
It also provides high system availability and scalability as we can create redundant NameNodes.
The new feature of snapshot through which we can take backup of filesystems which helps disaster recovery.

Is it possible to add node automatically when hadoop program is on running application

I'm beginner programmer and hadoop learner.
I'm testing hadoop full distribute mode using 5 PC(has Dual-core cpu and ram 2G)
before starting maptask and hdfs, I knew that I must configure file(etc/hosts on Ip, hostname and hadoop folder/conf/masters,slaves file) so I finished configured that file
and when debating on seminar in my company, my boss and chief insisted that even if hadoop application running state, if hadoop need more node or cluster, automatically, hadoop will add more node
Is it possible? When I studied about hadoop clusturing, Many hadoop books and community site insisted that after configuration and running application, We can't add more node or cluster.
But My boss said to me that Amazon said adding node on running application is possible.
Is really true?
hadoop master users on stack overflow community, Please tell me detail about the truth.
Yes it indeed is possible.
Here is the explanation in hadoop's wiki.
Also Amazon's EMR enables one to add 100s of nodes on-the-fly in an alreadt running cluster and as soon as the machines are up they are delegated tasks(unstarted mapper and/or reducer tasks) by the master.
So, yes, it is very much possible and is in use and not just in theory.

Resources