changing default scheduler in hadoop 1.2.1 - hadoop

As FIFO has been default scheduler in hadoop 1.2.1, where exactly do i need to make changes to change default scheduler from FIFO to capacity or fair. I had recently checked mapred-default.xml which is present inside hadoop-core-1.2.1.jar as directed in this answer but i didnt get where to hit and change the scheduling criteria. Please provide guidance thanking in advance

where exactly do i need to make changes to change default scheduler from FIFO to capacity or fair
In the mapred-site.xml
Fair Scheduler
<property>
<name>mapred.jobtracker.taskScheduler</name>
<value>org.apache.hadoop.mapred.FairScheduler</value>
</property>
Capacity Scheduler
<property>
<name>mapred.jobtracker.taskScheduler</name>
<value>org.apache.hadoop.mapred.CapacityTaskScheduler</value>
</property>
Note, you may want to actually read the documentation from those links because they tell you how to set them up in detail.

Related

"Application priority" in Yarn

I am using Hadoop 2.9.0. Is it possible to submit jobs with different priorities in YARN? According to some JIRA tickets it seems that application priorities have now been implemented.
I tried using the YarnClient, and setting a priority to the ApplicationSubmissionContext before submitting the job. I also tried using the CLI and using updateApplicationPriority. However, nothing seems to be changing the application priority, it always remains 0.
Have I misunderstood the concept of ApplicationPriority for YARN? I saw some documentation about setting priorities to queues, but for my use case I need all jobs in one queue.
Will appreciate any clarification on my understanding, or suggestions about what I could be doing wrong.
Thanks.
Yes, it is possible to set priority of your applications on the yarn cluster.
Leaf Queue-level priority
You can define queues with different priority and use spark-submit to submit your application to the specific queue with the wanted priority.
Basically you can define your queues in etc/hadoop/capacity-scheduler.xml like this:
<property>
<name>yarn.scheduler.capacity.root.prod.queues</name>
<value>prod1,prod2</value>
<description>Production queues.</description>
</property>
<property>
<name>yarn.scheduler.capacity.root.test.queues</name>
<value>test1,test2</value>
<description>Test queues.</description>
</property>
See documentation of queue properties here
Note: Application priority works only along with FIFO ordering policy.
Default ordering policy is FIFO.
In order to set application priority you can add properties like this to the same file:
<property>
<name>yarn.scheduler.capacity.root.test.default-application-priority</name>
<value>10</value>
<description>Test queues have low priority.</description>
</property>
<property>
<name>yarn.scheduler.capacity.root.prod.default-application-priority</name>
<value>90</value>
<description>Production queues have high priority.</description>
</property>
See more information about application priority here
Changing application priority at runtime:
If you want to change application priority at runtime you can also use the CLI like this:
yarn application -appId <ApplicationId> -updatePriority <Priority>
Can you share what command you execute on what node and what response you get?
See more info here
Using YarnClient
You did not share your code so it is difficult to see if you do it right. But it is possible to submit a new application with a specific priority using YarnClient
ApplicationClientProtocol.submitApplication(SubmitApplicationRequest)
See more info here

MapReduce jobs get stuck in Accepted state

I have my own MapReduce code that I'm trying to run, but it just stays at Accepted state. I tried running another sample MR job that I'd run previously and which was successful. But now, both the jobs stay in Accepted state. I tried changing various properties in the mapred-site.xml and yarn-site.xml as mentioned here and here but that didn't help either. Can someone please point out what could possibly be going wrong. I'm using hadoop-2.2.0
I've tried many values for the various properties, here is one set of values-
In mapred-site.xml
<property>
<name>mapreduce.job.tracker</name>
<value>localhost:54311</value>
</property>
<property>
<name>mapreduce.job.tracker.reserved.physicalmemory.mb</name>
<value></value>
</property>
<property>
<name>mapreduce.map.memory.mb</name>
<value>256</value>
</property>
<property>
<name>mapreduce.reduce.memory.mb</name>
<value>256</value>
</property>
<property>
<name>yarn.app.mapreduce.am.resource.mb</name>
<value>400</value>
<source>mapred-site.xml</source>
</property>
In yarn-site.xml
<property>
<name>yarn.nodemanager.resource.memory-mb</name>
<value>400</value>
<source>yarn-site.xml</source>
</property>
<property>
<name>yarn.scheduler.capacity.maximum-am-resource-percent</name>
<value>.3</value>
</property>
I've had the same effect and found that making the system have more memory available per worker node and reduce the memory required for an application helped.
The settings I have (on my very small experimental boxes) in my yarn-site.xml:
<property>
<name>yarn.nodemanager.resource.memory-mb</name>
<value>2200</value>
<description>Amount of physical memory, in MB, that can be allocated for containers.</description>
</property>
<property>
<name>yarn.scheduler.minimum-allocation-mb</name>
<value>500</value>
</property>
Had the same issue, and for me it was a full hard drive (>90% full) which was the issue. Cleaning space saved me.
A job stuck in accepted state on YARN is usually because of free resources are not enough. You can check it at http://resourcemanager:port/cluster/scheduler:
if Memory Used + Memory Reserved >= Memory Total, memory is not enough
if VCores Used + VCores Reserved >= VCores Total, VCores is not enough
It may also be limited by parameters such as maxAMShare.
Am using Hadoop 3.0.1.I had faced the same issue where-in submitted map reduce job were shown as stuck in ACCEPTED state in ResourceManager web UI.Also, in the same ResourceManager web UI,under Cluster metrics -> Memory used was 0, Total Memory was 0; Cluster Node Metrics -> Active Nodes was 0, although NamedNode web UI listed the data nodes perfectly.Running yarn node -list on the cluster did not display any NodeManagers.Turns out, that my NodeManagers were not running.After starting the NodeManagers,the newly submitted map reduce jobs could proceed further.They were no more stuck in ACCEPTED state, and got to "RUNNING" state
I faced the same issue. And i changed every configuration mentioned in above answers but still it was no use. After this, i re-checked the health of my cluster. There, i observed that my one and only node was in un-healthy state. The issue was due to lack of disk space in my /tmp/hadoop-hadoopUser/nm-local-dir directory. Same can be checked by checking node health status at resource manager web UI at port 8032. To resolve this, i added below property in yarn-site.xml.
<property>
<name>yarn.nodemanager.disk-health-checker.max-disk-utilization-per-disk-percentage</name>
<value>98.5</value>
</property>
After restarting my hadoop daemons, node status got changed to healthy and jobs started to run
Adding the property yarn.resourcemanager.hostname to the master node hostname in yarn-site.xml and copy this file to all the nodes in the cluster to reflect this configuration has solved the issue for me.

Coexistance of Hadoop MR1 and MR2

Is is possible to run both Hadoop MR1 and MR2 together in same cluster (at least, in theory)?
If yes, how can I do that?
In theory, you can do as:
run DataNode TaskTracker and NodeManager on one machine
run NameNode SecondaryNameNode and ResourceManager on other machines
all processes with different ports
but, not suggest to do this, see cloudera blog:
"Make sure you are not trying to run MRv1 and YARN on the same set of nodes at the same time. This is not supported; it will degrade performance and may result in an unstable cluster deployment."
In theory, yes.
Unpack the tarball into 2 different locations, owned by different users.
In both of them, change all mapred/yarn related ports to mutually exclusive sets.
Run the datanodes from only one of the locations.
Start mapred/yarn related daemons in both locations
Do post here if it works.
Also dfs name dir and data dir should be different for MR1 and MR2.
<property>
<name>dfs.name.dir</name>
<value>/home/userx/hdfs/name</value>
</property>
<property>
<name>dfs.data.dir</name>
<value>/home/userx/hdfs/data</value>
</property>
It seems for Mapr, this is not only a theory but practice, check this link.
You dont need to run both, just run the Hadoop 2.0, it provides full backward compatibility to MapReduce applications written for Hadoop 1.0.
There are few minor changes in API, please look at the link to check if any changes effect your applications.

In Hadoop , how to get the instance of currently running Jobtracker?

I am working on a Monitoring Tool for Hadoop. I need to get the currently running jobtracker. How can I get that?
Check out the <hadoopdir>/conf/mapred-site.xml configuration file.
In this file, you should find a <property> that has a <name> of mapred.job.tracker:
<property>
<name>mapred.job.tracker</name>
<value>node5:12345</value>
<property>
This tells you what node it is running on and what port it is running on.
If you are looking for any specific information, please elaborate in your original question.
I believe the closest you can get at this time is using the JobClient class: here
This will allow you to see running jobs, or walk all jobs.

Is it possible to run Hadoop in Pseudo-Distributed operation without HDFS?

I'm exploring the options for running a hadoop application on a local system.
As with many applications the first few releases should be able to run on a single node, as long as we can use all the available CPU cores (Yes, this is related to this question). The current limitation is that on our production systems we have Java 1.5 and as such we are bound to Hadoop 0.18.3 as the latest release (See this question). So unfortunately we can't use this new feature yet.
The first option is to simply run hadoop in pseudo distributed mode. Essentially: create a complete hadoop cluster with everything on it running on exactly 1 node.
The "downside" of this form is that it also uses a full fledged HDFS. This means that in order to process the input data this must first be "uploaded" onto the DFS ... which is locally stored. So this takes additional transfer time of both the input and output data and uses additional disk space. I would like to avoid both of these while we stay on a single node configuration.
So I was thinking: Is it possible to override the "fs.hdfs.impl" setting and change it from "org.apache.hadoop.dfs.DistributedFileSystem" into (for example) "org.apache.hadoop.fs.LocalFileSystem"?
If this works the "local" hadoop cluster (which can ONLY consist of ONE node) can use existing files without any additional storage requirements and it can start quicker because there is no need to upload the files. I would expect to still have a job and task tracker and perhaps also a namenode to control the whole thing.
Has anyone tried this before?
Can it work or is this idea much too far off the intended use?
Or is there a better way of getting the same effect: Pseudo-Distributed operation without HDFS?
Thanks for your insights.
EDIT 2:
This is the config I created for hadoop 0.18.3
conf/hadoop-site.xml using the answer provided by bajafresh4life.
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- Put site-specific property overrides in this file. -->
<configuration>
<property>
<name>fs.default.name</name>
<value>file:///</value>
</property>
<property>
<name>mapred.job.tracker</name>
<value>localhost:33301</value>
</property>
<property>
<name>mapred.job.tracker.http.address</name>
<value>localhost:33302</value>
<description>
The job tracker http server address and port the server will listen on.
If the port is 0 then the server will start on a free port.
</description>
</property>
<property>
<name>mapred.task.tracker.http.address</name>
<value>localhost:33303</value>
<description>
The task tracker http server address and port.
If the port is 0 then the server will start on a free port.
</description>
</property>
</configuration>
Yes, this is possible, although I'm using 0.19.2. I'm not too familiar with 0.18.3, but I'm pretty sure it shouldn't make a difference.
Just make sure that fs.default.name is set to the default (which is file:///), and mapred.job.tracker is set to point to where your jobtracker is hosted. Then start up your daemons using bin/start-mapred.sh . You don't need to start up the namenode or datanodes. At this point you should be able to run your map/reduce jobs using bin/hadoop jar ...
We've used this configuration to run Hadoop over a small cluster of machines using a Netapp appliance mounted over NFS.

Resources