MapReduce jobs get stuck in Accepted state - hadoop

I have my own MapReduce code that I'm trying to run, but it just stays at Accepted state. I tried running another sample MR job that I'd run previously and which was successful. But now, both the jobs stay in Accepted state. I tried changing various properties in the mapred-site.xml and yarn-site.xml as mentioned here and here but that didn't help either. Can someone please point out what could possibly be going wrong. I'm using hadoop-2.2.0
I've tried many values for the various properties, here is one set of values-
In mapred-site.xml
<property>
<name>mapreduce.job.tracker</name>
<value>localhost:54311</value>
</property>
<property>
<name>mapreduce.job.tracker.reserved.physicalmemory.mb</name>
<value></value>
</property>
<property>
<name>mapreduce.map.memory.mb</name>
<value>256</value>
</property>
<property>
<name>mapreduce.reduce.memory.mb</name>
<value>256</value>
</property>
<property>
<name>yarn.app.mapreduce.am.resource.mb</name>
<value>400</value>
<source>mapred-site.xml</source>
</property>
In yarn-site.xml
<property>
<name>yarn.nodemanager.resource.memory-mb</name>
<value>400</value>
<source>yarn-site.xml</source>
</property>
<property>
<name>yarn.scheduler.capacity.maximum-am-resource-percent</name>
<value>.3</value>
</property>

I've had the same effect and found that making the system have more memory available per worker node and reduce the memory required for an application helped.
The settings I have (on my very small experimental boxes) in my yarn-site.xml:
<property>
<name>yarn.nodemanager.resource.memory-mb</name>
<value>2200</value>
<description>Amount of physical memory, in MB, that can be allocated for containers.</description>
</property>
<property>
<name>yarn.scheduler.minimum-allocation-mb</name>
<value>500</value>
</property>

Had the same issue, and for me it was a full hard drive (>90% full) which was the issue. Cleaning space saved me.

A job stuck in accepted state on YARN is usually because of free resources are not enough. You can check it at http://resourcemanager:port/cluster/scheduler:
if Memory Used + Memory Reserved >= Memory Total, memory is not enough
if VCores Used + VCores Reserved >= VCores Total, VCores is not enough
It may also be limited by parameters such as maxAMShare.

Am using Hadoop 3.0.1.I had faced the same issue where-in submitted map reduce job were shown as stuck in ACCEPTED state in ResourceManager web UI.Also, in the same ResourceManager web UI,under Cluster metrics -> Memory used was 0, Total Memory was 0; Cluster Node Metrics -> Active Nodes was 0, although NamedNode web UI listed the data nodes perfectly.Running yarn node -list on the cluster did not display any NodeManagers.Turns out, that my NodeManagers were not running.After starting the NodeManagers,the newly submitted map reduce jobs could proceed further.They were no more stuck in ACCEPTED state, and got to "RUNNING" state

I faced the same issue. And i changed every configuration mentioned in above answers but still it was no use. After this, i re-checked the health of my cluster. There, i observed that my one and only node was in un-healthy state. The issue was due to lack of disk space in my /tmp/hadoop-hadoopUser/nm-local-dir directory. Same can be checked by checking node health status at resource manager web UI at port 8032. To resolve this, i added below property in yarn-site.xml.
<property>
<name>yarn.nodemanager.disk-health-checker.max-disk-utilization-per-disk-percentage</name>
<value>98.5</value>
</property>
After restarting my hadoop daemons, node status got changed to healthy and jobs started to run

Adding the property yarn.resourcemanager.hostname to the master node hostname in yarn-site.xml and copy this file to all the nodes in the cluster to reflect this configuration has solved the issue for me.

Related

Unable to allocate more than one CPU to my Map/Reduce Job in Hortonworks

I am running HDP 2.4.2 version on 5 Node cluster.
Whenever i am launching any job on cluster they are only taking one CPU instead on the configured CPU.
I have configured 4 CPU but my jobs are only taking one CPU.
I have five 24 Cores 128 GB Ubuntu boxes in my cluster.
Please let me know if this is a limitation with HDP because it was working fine with cloudera
EDIT
<property>
<name>yarn.scheduler.maximum-allocation-vcores</name>
<value>15</value>
</property>
<property>
<name>yarn.nodemanager.resource.cpu-vcores</name>
<value>15</value>
</property>
Below is the solution for this problem
CPU scheduling must be enabled on the cluster by default it is set to disable. After enabling the CPU scheduling on the cluster it started providing requested CPUs to my jobs.
yarn.scheduler.capacity.resource-calculator is the property Name. Search it in yarn configuration and enable it. The default value for this property is DefaultResourceCalculator which will be overridden by DominantResourceCalculator once we enable the CPU scheduling.
Ref link: https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.5.3/bk_yarn-resource-management/content/enabling_cpu_scheduling.html

HBase's RegionServer crashs

I'm trying to create about 589 tables and make random insertions. I start processing table by table: so I create one table then make all of my insertions, then create another one until all of the data get ingested.
The architecture of this solution is :
Python client located in one machine which ingests HBase with data.
Cloudera server hosting HBase configured in stand-alone which is a VM located in the same machine as the client and indentified by its IP address. The caracteristics of this server are as follows: 64GB of storage, 4GB of RAM and 1 CPU.
The client communicates with an HBase Thrift Server.
So the problem here is that when I try to ingest all of that amount of data. The client is only able to create and insert about 300MB before the regionserver shuts down (about 45 tables created and respective rows inserted and then the server crashs at the 46th table's data ingestion). I have tested all of this with different machine caracteristics, the size of the ingested data varies from machine to another (If the machine has more memory, more data will be inserted [Have tested this with different VM hardware caracteristics]). I'm suspecting that it's coming from the management of the Java Heap Memory, so I have tried to make different configurations. But it didn't make it better. Here is my main configuration of HBase :
hbase-site.xml
<property>
<name>hbase.rest.port</name>
<value>8070</value>
<description>The port for the HBase REST server.</description>
</property>
<property>
<name>hbase.cluster.distributed</name>
<value>true</value>
</property>
<property>
<name>hbase.rootdir</name>
<value>hdfs://quickstart.cloudera:8020/hbase</value>
</property>
<property>
<name>hbase.regionserver.ipc.address</name>
<value>0.0.0.0</value>
</property>
<property>
<name>hbase.master.ipc.address</name>
<value>0.0.0.0</value>
</property>
<property>
<name>hbase.thrift.info.bindAddress</name>
<value>0.0.0.0</value>
</property>
<property>
<name>hbase.hregion.max.filesize</name>
<value>10737418240</value> <!-- 10 GB -->
</property>
<property>
<name>hbase.hregion.memstore.flush.size</name>
<value>33554432</value> <!-- 32 MB -->
</property>
<property>
<name>hbase.client.write.buffer</name>
<value>8388608</value>
</property>
<property>
<name>hbase.client.scanner.caching</name>
<value>10000</value>
</property>
<property>
<name>hbase.regionserver.handler.count</name>
<value>64</value>
</property>
hbase-env.sh
# The maximum amount of heap to use. Default is left to JVM default.
export HBASE_HEAPSIZE=4G
# Uncomment below if you intend to use off heap cache. For example, to allocate 8G of
# offheap, set the value to "8G".
# export HBASE_OFFHEAPSIZE=1G
# Extra Java runtime options.
# Below are what we set by default. May only work with SUN JVM.
# For more on why as well as other possible settings,
# see http://wiki.apache.org/hadoop/PerformanceTuning
export HBASE_OPTS="-XX:+UseConcMarkSweepGC"
# Configure PermSize. Only needed in JDK7. You can safely remove it for JDK8+
export HBASE_MASTER_OPTS="$HBASE_MASTER_OPTS -XX:PermSize=128m -XX:MaxPermSize=128m"
export HBASE_REGIONSERVER_OPTS="$HBASE_REGIONSERVER_OPTS -XX:PermSize=4g -XX:MaxPermSize=4g"
Here is the error that I get from the Master Server's log:
util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC):
pause of approximately 1016msNo GCs detected
and nothing appears in the RegionServer's log.
On the other hand, when I try to create only one table and to insert a greater amount of data this works!
Any brilliant idea about how to fix this, please?
Thanks in advance.
Your VM's memory is way too low. Try bumping it up to at least 12GB. You're forgetting that a Java process's heap is only one part of the memory footprint. By setting HBASE_HEAPSIZE=4G you're saying you want HBase to allocate all your VM's memory. The VM also needs to run Linux daemons and your Cloudera services besides HBase.

Hadoop: Running beyond virtual memory limits, showing huge numbers

I am running a MapReduce Pipes program, and I have set the memory limits to be as follows:
in yarn-site.xml:
<property>
<name>yarn.nodemanager.resource.memory-mb</name>
<value>3072</value>
</property>
<property>
<name>yarn.scheduler.minimum-allocation-mb</name>
<value>256</value>
</property>
In mapred-site.xml:
<property>
<name>mapreduce.map.memory.mb</name>
<value>512</value>
</property>
<property>
<name>mapreduce.reduce.memory.mb</name>
<value>512</value>
</property>
<property>
<name>mapreduce.map.java.opts</name>
<value>-Xmx384m</value>
</property>
<property>
<name>mapreduce.reduce.java.opts</name>
<value>-Xmx384m</value>
</property>
I am running currently on a single node in pseudo-distributed mode. I am getting the following error before having the container killed:
2015-04-11 12:47:49,594 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: Diagnostics report from attempt_1428741438743_0001_m_000000_0: Container [pid=8140,containerID=container_1428741438743_0001_01_000002] is running beyond virtual memory limits. Current usage: 304.1 MB of 1 GB physical memory used; 1.0 TB of 2.1 GB virtual memory used. Killing container.
The main thing that concerns me is 1.0 TB of virtual memory used, the application that I am running is way far from consuming this amount of memory, it is even way far from consuming 1 GB of memory.
Does that mean that there is a memory leak in my code, or could my memory configurations just be wrong?
Thank you.
Regards,
I found what the problem was: in part of my code, each of the mappers had to access a local lmdb database. When an lmdb database starts, it reserves 1 TB of virtual memory, this caused Hadoop to think that I was using this much memory while in fact I wasn't.
I solved the issue by setting yarn.nodemanager.vmem-check-enabled to false in yarn-site.xml, which prevents Hadoop from checking the virtual memory limits. Note that you shouldn't use that unless you're sure of it, because Hadoop is trying to protect you from memory leaks and similar issues by this check. I only used it because I was sure it wasn't a memory leak

Hadoop HA Namenode remote access

Im configuring Hadoop 2.2.0 stable release with HA namenode but i dont know how to configure remote access to the cluster.
I have HA namenode configured with manual failover and i defined dfs.nameservices and i can access hdfs with nameservice from all the nodes included in the cluster, but not from outside.
I can perform operations on hdfs by contact directly the active namenode, but i dont want that, i want to contact the cluster and then be redirected to the active namenode. I think this is the normal configuration for a HA cluster.
Does anyone now how to do that?
(thanks in advance...)
You have to add more values to the hdfs site:
<property>
<name>dfs.ha.namenodes.myns</name>
<value>machine-98,machine-99</value>
</property>
<property>
<name>dfs.namenode.rpc-address.myns.machine-98</name>
<value>machine-98:8100</value>
</property>
<property>
<name>dfs.namenode.rpc-address.myns.machine-99</name>
<value>machine-145:8100</value>
</property>
<property>
<name>dfs.namenode.http-address.myns.machine-98</name>
<value>machine-98:50070</value>
</property>
<property>
<name>dfs.namenode.http-address.myns.machine-99</name>
<value>machine-145:50070</value>
</property>
You need to contact one of the Name nodes (as you're currently doing) - there is no cluster node to contact.
The hadoop client code knows the address of the two namenodes (in core-site.xml) and can identity which is the active and which is the standby. There might be a way by which you can interrogate a zookeeper node in the quorum to identify the active / standby (maybe, i'm not sure) but you might as well check one of the namenodes - you have a 50/50 chance it's the active one.
I'd have to check, but you might be able to query either if you're just reading from HDFS.
for Active Name node you can always ask Zookeeper.
you can get the active name node from the below Zk Path.
/hadoop-ha/namenodelogicalname/ActiveStandbyElectorLock
There are two ways to resolve this situation(code with java)
use core-site.xml and hdfs-site.xml in your code
load conf via addResource
use conf.set in your code
set hadoop conf via conf.set
an example use conf.set

Simulating Map-reduce using Cloudera

I want to use cloudera to simulate Hadoop job on a single machine (of course with many VMs). I have 2 question
1) Can I change the replication policy of HDFS in cloudera?
2) Can I see cpu usage of each VMs?
You can use hadoop fs -setrep to change the replication factor on any file. You can also change the default replication factor by modifying hdfs-site.xml by adding the following:
<property>
<name>dfs.replication</name>
<value>2</value>
</property>
You'll have to log into each box and use top to see the cpu usage of each VM. There is nothing out of the box in Hadoop that lets you see this.
I found out that I can change data replication policy by changing "ReplicationTargetChooser.java".

Resources