why hadoop capacity scheduler uses 200% of Capacity - hadoop

I encountered the same problem on our cluster and returned to my pc to do some simple experiments hoping to figure it out.I configured hadoop in Pseudo-distributed mode and used the default capacity-scheduler.xml and configured the mapred-site.xml as the following:
<!-- Put site-specific property overrides in this file. -->
<configuration>
<property>
<name>io.sort.mb</name>
<value>5</value>
</property>
<property>
<name>mapred.job.tracker</name>
<value>localhost:9001</value>
</property>
<property>
<name>mapred.child.java.opts</name>
<value>-Xmx10m</value>
</property>
<property>
<name>mapred.jobtracker.taskScheduler</name>
<value>org.apache.hadoop.mapred.CapacityTaskScheduler</value>
</property>
<property>
<name>mapred.queue.names</name>
<value>default</value>
</property>
<property>
<name>mapred.cluster.map.memory.mb</name>
<value>100</value>
</property>
<property>
<name>mapred.cluster.max.map.memory.mb</name>
<value>200</value>
</property>
</configuration>
The web UI looks like this :
Queue Name default
Scheduling Information
Queue configurationfatal
Capacity Percentage: 100.0%
User Limit: 100%
Priority Supported: NO
-------------
Map tasks
Capacity: 2 slots
Used capacity: 2 (100.0% of Capacity)
Running tasks: 1
Active users:
User 'luo': 2 (100.0% of used capacity)
-------------
Reduce tasks
Capacity: 2 slots
Used capacity: 0 (0.0% of Capacity)
Running tasks: 0
-------------
Job info
Number of Waiting Jobs: 0
Number of users who have submitted jobs: 1
Actually , it did work without anything wrong when I submitted a streaming job with one map task which occupies 2 slots and no reduce task.The streaming script is rather simple
~/hadoop/hadoop-0.20.2/bin/hadoop jar Streaming_blat.jar -D mapred.job.map.memory.mb=199 -D mapred.job.name='memory alloc' -D mapred.map.tasks=1 -input file://pwd/input/ -mapper ' /home/luo/hadoop/hadoop-0.20.2/bin/a.out' -output file://pwd/output/ -reducer NONE
a.out is just a C program simply outputting the pid and ppid to a specified file.
And problems came when I set mapred.map.tasks=3. The web UI showed
Map tasks
Capacity: 2 slots
Used capacity: 4 (200.0% of Capacity)
Running tasks: 2
Active users:
User 'luo': 4 (100.0% of used capacity)
which means it already exceeds the limit of map slots I set in mapred-site.xml. As a result, it prompted something like this again and again
Killing one of the least progress tasks - attempt_201210121915_0012_m_000000_0, as the cumulative memory usage of all the tasks on the TaskTracker exceeds virtual memory limit 207618048.
What I want it to do is suspend the map task until there are available slots without exceeding the capacity.So what's wrong have I done ? Could any one provide some solutions? Thanks a lot.

All right,I answer it myself.After cracking the code, I know those 4 properties must be all set in the mapred-site.xml,otherwise the scheduler does not perform memory check(I only set two of them).
mapred.cluster.map.memory.mb
mapred.cluster.reduce.memory.mb
mapred.cluster.max.map.memory.mb
mapred.cluster.max.reduce.memory.mb

Related

HADOOP YARN - Application is added to the scheduler and is not yet activated. Skipping AM assignment as cluster resource is empty

I am evaluating YARN for a project. I am trying to get the simple distributed shell example to work. I have gotten the application to the SUBMITTED phase, but it never starts. This is the information reported from this line:
ApplicationReport report = yarnClient.getApplicationReport(appId);
Application is added to the scheduler and is not yet activated. Skipping AM assignment as cluster resource is empty. Details : AM Partition = DEFAULT_PARTITION; AM Resource Request = memory:1024, vCores:1; Queue Resource Limit for AM = memory:0, vCores:0; User AM Resource Limit of the queue = memory:0, vCores:0; Queue AM Resource Usage = memory:128, vCores:1;
The solutions for other developers seems to have to increase yarn.scheduler.capacity.maximum-am-resource-percent in the yarn-site.xml file from its default value of .1. I have tried values of .2 and .5 but it does not seem to help.
Looks like you did not configure the RAM allocated to Yarn in a proper way. This can be a pin in the ..... if you try to infer/adapt from tutorials according to your own installation. I would strongly recommend that you use tools such as this one:
wget http://public-repo-1.hortonworks.com/HDP/tools/2.6.0.3/hdp_manual_install_rpm_helper_files-2.6.0.3.8.tar.gz
tar zxvf hdp_manual_install_rpm_helper_files-2.6.0.3.8.tar.gz
rm hdp_manual_install_rpm_helper_files-2.6.0.3.8.tar.gz
mv hdp_manual_install_rpm_helper_files-2.6.0.3.8/ hdp_conf_files
python hdp_conf_files/scripts/yarn-utils.py -c 4 -m 8 -d 1 false
-c number of cores you have for each node
-m amount of memory you have for each node (Giga)
-d number of disk you have for each node
-bool "True" if HBase is installed; "False" if not
This should give you something like:
Using cores=4 memory=8GB disks=1 hbase=True
Profile: cores=4 memory=5120MB reserved=3GB usableMem=5GB disks=1
Num Container=3
Container Ram=1536MB
Used Ram=4GB
Unused Ram=3GB
yarn.scheduler.minimum-allocation-mb=1536
yarn.scheduler.maximum-allocation-mb=4608
yarn.nodemanager.resource.memory-mb=4608
mapreduce.map.memory.mb=1536
mapreduce.map.java.opts=-Xmx1228m
mapreduce.reduce.memory.mb=3072
mapreduce.reduce.java.opts=-Xmx2457m
yarn.app.mapreduce.am.resource.mb=3072
yarn.app.mapreduce.am.command-opts=-Xmx2457m
mapreduce.task.io.sort.mb=614
Edit your yarn-site.xml and mapred-site.xml accordingly.
nano ~/hadoop/etc/hadoop/yarn-site.xml
nano ~/hadoop/etc/hadoop/mapred-site.xml
Moreover, you should have this in your yarn-site.xml
<property>
<name>yarn.acl.enable</name>
<value>0</value>
</property>
<property>
<name>yarn.resourcemanager.hostname</name>
<value>name_of_your_master_node</value>
</property>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
and this in your mapred-site.xml:
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
Then, upload your conf files to each node using scp (If you uploaded you ssh keys to each one)
for node in node1 node2 node3; do scp ~/hadoop/etc/hadoop/* $node:/home/hadoop/hadoop/etc/hadoop/; done
And then, restart yarn
stop-yarn.sh
start-yarn.sh
and check that you can see your nodes:
hadoop#master-node:~$ yarn node -list
18/06/01 12:51:33 INFO client.RMProxy: Connecting to ResourceManager at master-node/192.168.0.37:8032
Total Nodes:3
Node-Id Node-State Node-Http-Address Number-of-Running-Containers
node3:34683 RUNNING node3:8042 0
node2:36467 RUNNING node2:8042 0
node1:38317 RUNNING node1:8042 0
This might fix the issue (good luck) (additional info)
Add below properties to yarn-site.xml and restart dfs and yarn
<property>
<name>yarn.scheduler.capacity.root.support.user-limit-factor</name>
<value>2</value>
</property>
<property>
<name>yarn.nodemanager.disk-health-checker.min-healthy-disks</name>
<value>0.0</value>
</property>
<property>
<name>yarn.nodemanager.disk-health-checker.max-disk-utilization-per-disk-percentage</name>
<value>100.0</value>
</property>
I got the same error and tried to solve it hard. I realized the resource manager had no resource to allocate the application master (AM) of the MapReduce application.
I navigated on browser http://localhost:8088/cluster/nodes/unhealthy and examined unhealthy nodes (in my case there was only one) -> health report. I saw the warning about that some log directories filled up. I cleaned those directories then my node became healthy and the application state switched to RUNNING from ACCEPTED. Actually, as a default, if the node disk fills up more than %90, YARN behaves like that. Someway you have to clean space and make available space lower than %90.
My exact health report was:
1/1 local-dirs usable space is below configured utilization percentage/no more usable space [ /tmp/hadoop-train/nm-local-dir : used space above threshold of 90.0% ] ;
1/1 log-dirs usable space is below configured utilization percentage/no more usable space [ /opt/manual/hadoop/logs/userlogs : used space above threshold of 90.0% ]

always Hive Job running in-process local Hadoop

When I set this property in hive-site.xml
<property>
<name>hive.exec.mode.local.auto</name>
<value>false</value>
</property>
Hive always runs the hadoop job locally.
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks not specified. Estimated from input data size: 55
Job running in-process (local Hadoop)
Why does this happen?
As mentioned in HIVE-2585,Going forward Hive will assume that the metastore is operating in local mode if the configuration property hive.metastore.uris is unset, and will assume remote mode otherwise.
Ensure following property is set in Hive-site.xml:
<property>
<name>hive.metastore.uris</name>
<value><URIs of metastore server>:9083</value>
</property>
<property>
<name> hive.metastore.local</name>
<value>false</value>
</property>
The hive.metastore.local property is no longer supported as of Hive 0.10; setting hive.metastore.uris is sufficient to indicate that you are using a remote metastore.
EDIT:
Starting with release 0.7, Hive also supports a mode to run map-reduce jobs in local-mode automatically. The relevant options are hive.exec.mode.local.auto, hive.exec.mode.local.auto.inputbytes.max, and hive.exec.mode.local.auto.tasks.max:
hive> SET hive.exec.mode.local.auto=false;
Note that this feature is disabled by default. If enabled, Hive analyzes the size of each map-reduce job in a query and may run it locally if the following thresholds are satisfied:
1. The total input size of the job is lower than: hive.exec.mode.local.auto.inputbytes.max (128MB by default)
2. The total number of map-tasks is less than: hive.exec.mode.local.auto.tasks.max (4 by default)
3. The total number of reduce tasks required is 1 or 0.
So for queries over small data sets, or for queries with multiple map-reduce jobs where the input to subsequent jobs is substantially smaller (because of reduction/filtering in the prior job), jobs may be run locally.
Reference: Hive Getting started

How to optimize and tune hadoop cluster performance

I am not very familiar with hadoop cluster configs and I have recently integrated Apache Nutch with Apache Hadoop and I have crawled data indexed in Solr successfully.
I have my master-slave sources as below:
Master:
CPU : 4 cores
memory :12G
hard disk : 37G
Slave1 :
CPU : 2 cores
memory :4G
hard disk : 18G
Slave2:
CPU : 2 cores
memory :4G
hard disk : 16G
Slave3 :
CPU : 2 cores
memory :4G
hard disk : 16G
Slave4 :
CPU : 4 cores
memory :4G
hard disk : 50G
I have configed core-site.xml, mapred-site.xml, hdfs-site.xml, masters and slaves.
Here is my core-site.xml :
<configuration>
<property>
<name>hadoop.tmp.dir</name>
<value>/usr/local/My Project Name/hadoop-datastore</value>
<description>store data</description>
</property>
<property>
<name>fs.default.name</name>
<value>hdfs://master:54310</value>
<description>the name of default file system</description>
</property>
</configuration>
Here is my mapred-site.xml :
<configuration>
<property>
<name>mapred.job.tracker</name>
<value>master:54311</value>
<description>host and port</description>
</property>
<property>
<name>mapred.reduce.tasks</name>
<value>10</value>
<description></description>
</property>
<property>
<name>mapred.map.tasks</name>
<value>20</value>
<description></description>
</property>
<property>
<name>mapred.tasktracker.map.tasks.maximum</name>
<value>8</value>
<description></description>
</property>
<property>
<name>mapred.tasktracker.reduce.tasks.maximum</name>
<value>8</value>
<description></description>
</property>
</configuration>
And here is my hdfs-site.xml:
<configuration>
<property>
<name>dfs.replication</name>
<value>2</value>
<description>default block</description>
</property>
</configuration>
And here is my conf/masters :
master
And finally my conf/slaves:
master
slave1
slave2
slave3
slave4
This story goes well: When I run master and run the Jps command, I have the folowings on master:
19031 TaskTracker
18644 DataNode
18764 SecondaryNameNode
18884 JobTracker
13226 Jps
18506 NameNode
And when I run the Jps command on all the slaves, I have the followings:
4969 DataNode
5057 TaskTracker
5592 Jps
When I look at Master Hadoop Map/Reduce administration I have the following Cluster Summary:
<h2>Cluster Summary (Heap Size is 114.5 MB/889 MB)</h2>
<table border="1" cellpadding="5" cellspacing="0">
<tr><th>Running Map Tasks</th><th>Running Reduce Tasks</th><th>Total Submissions</th><th>Nodes</th><th>Occupied Map Slots</th><th>Occupied Reduce Slots</th><th>Reserved Map Slots</th><th>Reserved Reduce Slots</th><th>Map Task Capacity</th><th>Reduce Task Capacity</th><th>Avg. Tasks/Node</th><th>Blacklisted Nodes</th><th>Graylisted Nodes</th><th>Excluded Nodes</th></tr>
<tr><td>8</td><td>8</td><td>1607</td><td>1</td><td>8</td><td>8</td><td>0</td><td>0</td><td>8</td><td>8</td><td>16.00</td><td>0</td><td>0</td><td>0</td></tr></table>
<br>
The problem is this procedure works fine with topN :1000 but There is load on master with high cpu and memory usage but when I find top on slaves, Neither cpu nor memory has loads. I mean both cpu and memory usage is low and cpu idle is high.
I wonder whether it is natural and OK or not. I am looking for some solutions and configs so that I am able to share the load on all slaves and make the procedure faster.
Any links, documentations and solutions are very much appreciated.
Your master node is running a lot of services :
TaskTracker DataNode SecondaryNameNode JobTracker NameNode
Typically in a decent sized cluster the Master would not have the datanode service.
Name Node & secondary Name node should be on different nodes. You can set secondary name node on one of your data nodes.
Similarly Task Tracker - Master typically does not have task Tracker. I.e. you do not run MR tasks on Master.
On the other hand for pure experimentation the setup you have done is ok & the CPU usage you are noticing is obvious.
I found an error about version 1.2.1 looking deeply at logs directory, saying this version is a 1.2.1 snapshot version. So I changed the server, installing simply version 1.2.1 and making all slaves and master similar in version. That fixed my problem. Now happily I have five nodes equal to the count of my machines.
And I really thank ... for his greate help

Discrepancy in task run per map with the tasks configured in mapred-site.xml

I stopped all the agents running in my pseudo distributed mode by giving the following command.
stop-all.sh
Then I changed the configuration file of "mapred-site.xml" to 1 Map Task
<configuration>
<property>
<name>mapred.job.tracker</name>
<value>localhost:54311</value>
</property>
<property>
<name>mapred.tasktracker.map.tasks.maximum</name>
<value>1</value>
</property>
<property>
<name>mapred.tasktracker.reduce.tasks.maximum</name>
<value>1</value>
</property>
</configuration>
you see I have set 1 MapTask snd 1 ReduceTask to run.
Then I started back all the agents
start-all.sh
and run the map-reduce program but still I see 2 tasks instead of 1 as configured in mapred-site.xml.
The screen shot of the tasks are shown below,
Why is such discrepancy occurring, Please guide me through
thanks
Okay so this property mapred.tasktracker.map.tasks.maximum tells the maximum number of tasks (mapper tasks) which can be run by a task tracker at a time. Basically you are restricting each node running task tracker to run one mapper at a time.
If you have 10 nodes then you should be able to run 10 mappers in parallel.
However if your job requires 2 mappers (which is totally based on size of input data & block size unless you extend inputformat) and you have only one node then the map tasks would be executed sequentially on that node.
Hope this is clearer now.

How to get datanode timeout?

I have a 3 node hadoop setup, with replication factor as 2.
When one of my datanode dies, namenode waits for 10 mins before removing it from live nodes. Till then my hdfs writes fail saying bad ack from node.
Is there a way to set a smaller timeout( like 1 min) so that the node where datanode dies is discarded immediately ?
Setting up the following in your hdfs-site.xml will give you 1-minute timeout.
<property>
<name>heartbeat.recheck.interval</name>
<value>15</value>
<description>Determines datanode heartbeat interval in seconds</description>
</property>
If above doesn't work - try the following (seems to be version-dependent):
<property>
<name>dfs.heartbeat.recheck.interval</name>
<value>15</value>
<description>Determines datanode heartbeat interval in seconds.</description>
</property>
Timeout equals to 2 * heartbeat.recheck.interval + 10 * heartbeat.interval. Default for heartbeat.interval is 3 seconds.
In the version of Hadoop that we use, dfs.heartbeat.recheck.interval should be specified in milliseconds (check the code/doc of your version of Hadoop, to validate that).
I've managed to make this work. I'm using Hadoop version 0.2.2.
Here's what I added to my hdfs-site.xml:
<property>
<name>dfs.heartbeat.interval</name>
<value>2</value>
<description>Determines datanode heartbeat interval in seconds.</description>
</property>
<property>
<name>dfs.heartbeat.recheck.interval</name>
<value>1</value>
<description>Determines when machines are marked dead</description>
</property>
This parameters can differ for other versions of Hadoop. Here's how to check that you're using the right parameters: Once you set them, start your master, and check the configuration at :
http://your_master_machine:19888/conf
If you don't find "dfs.heartbeat.interval" and/or "dfs.heartbeat.recheck.interval" in there, that means you should try using their version without the "dfs." prefix:
"heartbeat.interval" and "heartbeat.recheck.interval"
Finally, to check that the dead datanode is no longer used after the desired amount of time, kill a datanode, then check repeatedly the console at:
http://your_master_machine:50070
For me, with the configuration shown here, I can see that a dead datanode is removed after about 20 seconds.

Resources