Sharing a cluster with Hadoop - hadoop

Is it possible to set Hadoop up so that it plays nicely with other applications on a cluster?
I'm familiar with the Torque+Maui resource scheduler, and with using HadoopOnDemand to provision temporary Hadoop clusters. But that gets pretty cumbersome if lots of people want to use Hadoop: each person has the same headache of setting up and tearing down their own mini hadoop cluster, copying data on and off thier own HDFS, etc.
It would be much cooler if we could have one permanent instance of Hadoop running that people share, with an HDFS that is always up. This would require Hadoop intelligently allocating work to nodes that aren't busy with other applications (like R, say), and not being to greedy when queueing jobs.
Is this possible?

Isn't this what the fair scheduler does?
http://hadoop.apache.org/mapreduce/docs/r0.21.0/fair_scheduler.html
We use this to run a permanent hadoop cluster with 30 users. You can have it preempt tasks to reallocate to new pools, and can set individual priorities for each pool too.

Related

Hadoop Resource management

I have a 12 node cluster and I am running a yarn architecture. It seems that my nodes are busy most of the time and many times job fails. How can I check the usage of the resources at any point of time?
Also is there any method to set a limited resource to a user for eg: if a user submits a job he should be given only 25gb of memory and 12 cores.
There are multiple ways to monitor the cluster.
If you are using Cloudera distribution then you can go to Cloudera Manager to monitor and manage the resources
If you are using Hortonworks distribution then you can go to Ambari web interface to monitor and manage the resources
If you are not using any distributions then clusters will be managed using Ganglia or Nagios web interface
Even if you do not have any of these you can go to resource manager web interface which typically runs on http://:8088. 8088 is default port number, it can be customized and you can get that information from yarn-site.xml
If your organization does not provide access to the web interfaces you can use commands such as yarn application --list and mapred job --list to see what is going on in the cluster
It is little tedious to monitor actual usage. You should know linux commands to monitor and develop shell scripts.
Also is there any method to set a limited resource to a user for eg: if a user submits a job he should be given only 25gb of memory and 12 cores.
Yes, you need to use queues and pools concept of schedulers embedded in Yarn. There are 3 types of scheduler FIFO, Capacity and Fair. FIFO should not be used in any of the clusters, it is mainly for development. You need to understand capacity and fair scheduler and set the limits.
It seems that my nodes are busy most of the time and many times job fails
You can implement some generic performance tuning guidelines to improve the thorughput. Have a look at this post : Tips to improve MapReduce Job performance in Hadoop , cloudera article and Map reduce performance aticle
Also is there any method to set a limited resource to a user for eg: if a user submits a job he should be given only 25gb of memory and 12 cores.
Adding to Durga's answer,
Fair scheduling is a method of assigning resources to applications such that all apps get, on average, an equal share of resources over time. Hadoop NextGen is capable of scheduling multiple resource types.
By default, the Fair Scheduler bases scheduling fairness decisions only on memory. It can be configured to schedule with both memory and CPU, using the notion of Dominant Resource Fairness developed by Ghodsi.
The scheduler organizes apps further into “queues”, and shares resources fairly between these queues. By default, all users share a single queue, named “default”. If an app specifically lists a queue in a container resource request, the request is submitted to that queue. It is also possible to assign queues based on the user name included with the request through configuration.
e.g.
<user name="sample_user">
<maxRunningApps>30</maxRunningApps>
</user>
<userMaxAppsDefault>5</userMaxAppsDefault>
The CapacityScheduler is designed to run Hadoop applications as a shared, multi-tenant cluster in an operator-friendly manner while maximizing the throughput and the utilization of the cluster.
Traditionally each organization has it own private set of compute resources that have sufficient capacity to meet the organization’s SLA under peak or near peak conditions. This generally leads to poor average utilization and overhead of managing multiple independent clusters, one per each organization
<property>
<name>yarn.scheduler.capacity.queue-mappings</name>
<value>u:user1:queue1,g:group1:queue2,u:%user:%user,u:user2:%primary_group</value>
<description>
Here, <user1> is mapped to <queue1>, <group1> is mapped to <queue2>,
maps users to queues with the same name as user, <user2> is mapped
to queue name same as <primary group> respectively. The mappings will be
evaluated from left to right, and the first valid mapping will be used.
</description>
</property>
Have a look at Fair scheduler and Capacity scheduler

Can the same Zookeeper instance be used by number of services?

Is one Zookeeper installation good enough to be used by Hadoop Kafka and Storm clusters?
I want to deploy all on one test environment and try playing with those technologies,
can I use one zookeeper installation for that? same znode could be dedicated for number of services?
Yes, you can use a single zookeeper installation to support more than one cluster and indeed different types of clusters. This has been the case for a long time - here's a link to a good discussion on it from 2009: http://zookeeper-user.578899.n2.nabble.com/Multiple-ZK-clusters-or-a-single-shared-cluster-td3277547.html
For testing this is fine (and even to run it on one ZK server). For production use though you'll want at least a 3 node cluster. And you should think carefully about running everything off of a single cluster.
The reason is that if you run multiple Hadoop, Storm and Kafka clusters off of a single ZK cluster, that one set of servers becomes a single point of failure for all of your distributed systems. You can beef up the ZK setup with more than 3 servers (let's say 7) so that it can handle multiple failures, but if someone were to accidentally bring ZK down all your distributed environments would come down too.
Some argue that you would be better off with more isolation between systems. I think it varies by use case but I'd be careful about putting all of your eggs in one ZK basket.

Hadoop Performance Monitoring tools for Windows

Any tools for monitoring performance on a Hadoop cluster in Windows. We installed Hortonworks HDP 2.2.0 on windows single node cluster and tested our jar. we were able to process 5 million records in 26 minutes. Now we have set up a cluster with 4 slave machines and 1 name node. Though the RAM of each machine is 8 Gigs, we are just doing a proof of concept. we see no improvement in the processing time in the cluster. Are there any tools which point out the problem. All the available are written for Linux.
Thanks,
Kishore.
5 million records doesn't sound like a lot to throw on Hadoop. What's the size of your data in gb?
I don't know any Hadoop monitoring tools for Windows but you should start with the basics - is your data splittable? Have a look at the resource manager's view - how many containers did you have for your map-reduce app? Were they distributed on all machines? (the capacity scheduler tends not to distribute the load on several machines if it can stick all of it on one). CPU usage per task attempt, io per task attempt?
You should also store, compare and analyze Windows performance counters - cpu, i/o, network to see if you have any bottlenecks.
You may not need Windows-native tools to surface the kinds of performance metrics you are looking for. If you're after performance metrics from YARN, MapReduce, or HDFS, you can collect metrics from each of those technologies out of the box from a web interface/HTTP endpoint exposed by each tech in question.
With HDFS, for example, you can collect metrics from the NameNode and DataNodes via HTTP. In addition, you can access the full suite of metrics via JMX, though that option requires a little more configuration.
I wrote a guide to collecting Hadoop performance metrics with native tools which you might find useful. It details methods for collecting metrics for MapReduce, YARN, HDFS, and ZooKeeper.

What additional benefit does Yarn bring to the existing map reduce?

Yarn differs in its infrastructure layer from the original map reduce architecture in the following way:
In YARN, the job tracker is split into two different daemons called Resource Manager and Node Manager (node specific). The resource manager only manages the allocation of resources to the different jobs apart from comprising a scheduler which just takes care of the scheduling jobs without worrying about any monitoring or status updates. Different resources such as memory, cpu time, network bandwidth etc. are put into one unit called the Resource Container. There are different AppMasters running on different nodes which talk to a number of these resource containers and accordingly update the Node Manager with the monitoring/status details.
I want to know that how does using this kind of an approach increase the performance from the map-reduce perspective? Also, if there is any definitive content on the motivation behind Yarn and its benefits over the existing implementation of Map-reduce, please point me to the same.
Here are some of the articles (1, 2, 3) about YARN. These talk about the benefits of using YARN.
YARN is more general than MR and it should be possible to run other computing models like BSP besides MR. Prior to YARN, it required a separate cluster for MR, BSP and others. Now they they can coexist in a single cluster, which leads to higher usage of the cluster. Here are some of the applications ported to YARN.
From a MapReduce perspective in legacy MR there are separate slots for Map and Reduce tasks, but in YARN their is no fixed purpose of a container. The same container can be used for a Map task, Reduce task, Hama BSP Task or something else. This leads to better utilization.
Also, it makes it possible to run different versions of Hadoop in the same cluster which is not possible with legacy MR, which makes is easy from a maintenance point.
Here are some of the additional links for YARN. Also, Hadoop: The Definitive Guide, 3rd Edition has an entire section dedicated to YARN.
FYI, it had been a bit controversial to develop YARN instead of using some of frameworks which had been doing something similar and had been running for ages successfully with bugs ironed out.
I do not think that Yarn will speedup the existing MR framework. Looking into architecture we can see that the system now is more modular - but modularity usually contradicts higher performance.
It can be claimed that YARN has nothing to do with MapReduce. MapReduce just became one of the YARN applications. You can see it as moving from some embedded program to embeded OS with program within it
At the same time Yarn opens the door for different MR implementations with different frameworks. For example , if we assume that our dataset is smaller then cluster memory we can get much better performance. I think http://www.spark-project.org/ is one such example
To summarize it: Yarn does not improve the existing MR, but will enable other MR implementations to be better in all aspects.
All the above answers covered lot of information: I am simplifying all the information as follows:
MapReduce: YARN:
1. It is Platform plus Application It is a Platform in Hadoop 2.0 and
in Hadoop 1. 0 and it is only of doesn't exist in Hadoop 1.0
the applications in Hadoop 2.0
2. It is single use system i.e., It is multi purpose system, We can run
We can run MapReduce jobs only. MapReduce, Spark, Tez, Flink, BSP, MPP,
MPI, Giraph etc... (General Purpose)
3. JobTracker scalability i.e., Both Resource Management and
Both Resource Management and Application Management gets separated &
Job Management managed by RM+NM, Paradigm specific AMs
respectively.
4. Poor Resource Management Flexible Resource Management i.e.,
system i.e., slots (map/reduce) containers.
5. It is not highly available High availability and reliability.
6. Scaled out up to 5000 nodes Scaled out 10000 plus nodes.
7. Job->tasks Application -> DAG of Jobs -> tasks
8. Classical MapReduce = MapReduce Yarn MapReduce = MapReduce API +
API + MapReduce FrameWork MapReduce FrameWork + YARN System
+ MapReduce System So MR programs which were written over
Hadoop 1.0 run over Yarn also with out
changing a single line of code i.e.,
backward compatibility.
Let's see Hadoop 1.0 drawbacks, which have been addressed by Hadoop 2.0 with addition of Yarn.
Issue of Scalability : Job Tracker runs on a single machine even though you have thousands of nodes in Hadoop cluster. The responsibilities of Job tracker : Resource management, Job and Task schedule and monitoring. Since all these processes are running on a single node, this model is not scalable.
Issue of availability ( Single point of failure): Job Tracker is a single point of failure.
Resource utilization: Due to predefined number of Map & Reduce task slots, resources are not utilized properly. When all Mapper nodes are busy, Reducer nodes are idle and can't be used to process Mapper tasks.
Tight integration with Map Reduce framework: Hadoop 1.x can run Map reduce jobs only. Support for jobs other than Map Reduce jobs does not exists.
Now single Job Tracker bottleneck has been removed with YARN architecture in Hadoop 2.x
The fundamental idea of YARN is to split up the functionalities of resource management and job scheduling/monitoring into separate daemons. The idea is to have a global ResourceManager (RM) and per-application ApplicationMaster (AM). An application is either a single job or a DAG of jobs.
The ResourceManager has two main components: Scheduler and ApplicationsManager.
The Scheduler is responsible for allocating resources to the various running applications subject to familiar constraints of capacities, queues etc. The Scheduler is pure scheduler in the sense that it performs no monitoring or tracking of status for the application.
The ApplicationsManager is responsible for accepting job-submissions, negotiating the first container for executing the application specific ApplicationMaster and provides the service for restarting the ApplicationMaster container on failure.
The per-application ApplicationMaster has the responsibility of negotiating appropriate resource containers from the Scheduler, tracking their status and monitoring for progress.
Now advantages of YARN
Scalability issues have been resolved
No single point of failure. All components are highly available
Resource utilization has been improved with proper utilization of Map and reduce slots.
Non Map Reduce Jobs can be submitted
It looks like this link might be what you're looking for: http://developer.yahoo.com/blogs/hadoop/posts/2011/02/mapreduce-nextgen/.
My understanding is that YARN is supposed to be more generic. You can create your own YARN applications that negotiate directly with the Resource Manager for resources (1), and MapReduce is just one of several Application Managers that already exist (2).

Can hadoop be used as a distributed queue server?

I'm thinking of learning hadoop but not sure if it'll solve my problem. Basically I have a job with a queue and a bunch of workers. Each worker does a small amount of work and then either saves the results(if successful) or sends it back to the queue for further processing. My problem is scalable, is limited by the bandwidth on the network(ec2) which will never keep up with multiple cpu's crunching the data. I thought maybe I could run my jobs in Java in a hadoop cluster and have hadoop distribute the work via a queue. Would this be a better approach? I am correct in assuming hadoop can a queue and try to run jobs as locally as possible to minimize bandwidth usage and maximize cpu usage? My program is very cpu bound but most of my recent problems with its performence are related to passing work over a network(I want to keep the work as local as possible), but the difference between the hadoop tutorials I see and my problem is that in the tutorials all the work is known in advance while my program is generating new work for its self constantly(until its finally done). Would this work and would it help me reduce the impact of passing messages over a network?
Sorry I'm new to hadoop and wanted to know if it could solve my problem.
Hadoop is all about running jobs in a batch-like mode over a large data set. It's hard to get it to have some sort of queue-like behavior, but not impossible. There is Apache ZooKeeper, which will give you synchronization to build a queue if you need it.
There are plenty of tools to solve the problem it looks like you are trying to solve. I suggest taking a look at RabbitMQ. If you use python, Celery is quite fantastic.

Resources