I am new to Hadoop. I tried to create a hadoop cluster based on the example given on the Apache Hadoop site.
However when I run the map reduce example the application is stuck at map 100% and reduce 0%.
Please help
I have setup the environment using Vagrant and Virtual box. Created two instances.
I am running name node and a data node in one instance and resource manager and node manager in the other instance.
mapred-siet.xml configuration
<configuration>
<!-- Map Reduce applications configuration -->
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
<property>
<name>mapreduce.map.memory.mb</name>
<value>1536</value>
</property>
<property>
<name>mapreduce.map.java.opts</name>
<value>-Xmx1024M</value>
</property>
<property>
<name>mapreduce.reduce.memory.mb</name>
<value>3072</value>
</property>
<property>
<name>mapreduce.reduce.java.opts</name>
<value>-Xmx2560M</value>
</property>
<property>
<name>mapreduce.task.io.sort.mb</name>
<value>512</value>
</property>
<property>
<name>mapreduce.task.io.sort.factor</name>
<value>100</value>
</property>
<property>
<name>mapreduce.reduce.shuffle.parallelcopies</name>
<value>50</value>
</property>
<!-- Map Reduce Job History Server -->
<property>
<name>mapreduce.jobhistory.intermediate-done-dir</name>
<value>/mr-history/tmp</value>
</property>
<property>
<name>mapreduce.jobhistory.done-dir</name>
<value>/mr-history/done</value>
</property>
yarn-site.xml
e<configuration>
<!-- Resource Manager -->
<property>
<name>yarn.acl.enable</name>
<value>false</value>
</property>
<property>
<name>yarn.resourcemanager.scheduler.class</name>
<value>org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler</value>
</property>
<property>
<name>yarn.scheduler.minimum-allocation-mb</name>
<value>1024</value>
</property>
<property>
<name>yarn.scheduler.maximum-allocation-mb</name>
<value>4096</value>
</property>
<!-- Node Manager -->
<property>
<name>yarn.nodemanager.resource.memory-mb</name>
<value>2048</value>
</property>
<property>
<name>yarn.nodemanager.vmem-pmem-ratio</name>
<value>2.1</value>
</property>
<property>
<name>yarn.nodemanager.local-dirs</name>
<value>/opt/hadoop-2.6.2/tempData</value>
</property>
<property>
<name>yarn.nodemanager.log-dirs</name>
<value>/opt/hadoop-2.6.2/logDir</value>
</property>
<property>
<name>yarn.nodemanager.log.retain-seconds</name>
<value>10800</value>
</property>
<property>
<name>yarn.nodemanager.remote-app-log-dir</name>
<value>/logs</value>
</property>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<!-- History Server -->
<property>
<name>yarn.log-aggregation.retain-seconds</name>
<value>-1</value>
</property>
<property>
<name>yarn.log-aggregation.retain-check-interval-seconds</name>
<value>-1</value>
</property>
I was able to run the application now. As I thought it was a problem with the memory required by the system. I changed the following properties as given below
yarn.scheduler.maximum-allocation-mb
8192
<!-- Node Manager -->
<property>
<name>yarn.nodemanager.resource.memory-mb</name>
<value>8192</value>
</property>
and repeated the process. its working fine now.
Related
I can see the logs after running MR tasks on resource manager UI page.
But they were gone after I reboot hadoop cluster.
Configs below. Much appreciate for the help. It is not fixed for long time.
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
<property>
<name>mapreduce.jobhistory.address</name>
<value>hadoop201:10020</value>
</property>
<property>
<name>mapreduce.jobhistory.webapp.address</name>
<value>hadoop201:19888</value>
</property>
<property>
<name>mapreduce.jobhistory.done-dir</name>
<value>/opt/module/hadoop-3.1.3/logs/his_log/done</value>
</property>
<property>
<name>mapreduce.jobhistory.intermediate-done-dir</name>
<value>/opt/module/hadoop-3.1.3/logs/his_log</value>
</property>
<property>
<name>yarn.app.mapreduce.am.staging-dir</name>
<value>/opt/module/hadoop-3.1.3/logs/mr-stage-his</value>
<description></description>
</property>
<property>
<name>yarn.log-aggregation-enable</name>
<value>true</value>
</property>
<property>
<name>yarn.nodemanager.log-aggregation.roll-monitoring-interval-seconds</name>
<value>3600</value>
</property>
<property>
<name>yarn.nodemanager.remote-app-log-dir</name>
<value>/opt/module/hadoop-3.1.3/logs/resource_manager_logs</value>
</property>
<property>
<name>yarn.log.server.url</name>
<value>http://hadoop201:19888/jobhistory/logs</value>
</property>
<property>
<name>yarn.log-aggregation.retain-seconds</name>
<value>5184000</value>
</property>
I tried review the configs of mapred-site.xml and yarn-site.xml while it still doesn't work.
I expect that logs can still be seen after cluster reboot.
I configured spark engine in hive-site.xml using:
<script src="https://ajax.googleapis.com/ajax/libs/jquery/2.1.1/jquery.min.js"></script>
I configured spark engine in hive-site.xml using:
<property>
<name>hive.execution.engine</name>
<value>spark</value>
</property>
<property>
<name>spark.master</name>
<value>yarn-cluster</value>
</property>
<property>
<name>spark.dynamicAllocation.enabled</name>
<value>true</value>
</property>
<property>
<name>spark.executor.cores</name>
<value>4</value>
</property>
<property>
<name>spark.dynamicAllocation.initialExecutors</name>
<value>1</value>
</property>
<property>
<name>spark.dynamicAllocation.minExecutors</name>
<value>1</value>
</property>
<property>
<name>spark.dynamicAllocation.maxExecutors</name>
<value>8</value>
</property>
<property>
<name>spark.shuffle.service.enabled</name>
<value>true</value>
</property>
<property>
<name>spark.executor.memory</name>
<value>3g</value>
</property>
<property>
<name>spark.driver.memory</name>
<value>3g</value>
</property>
<property>
<name>spark.serializer</name>
<value>org.apache.spark.serializer.KryoSerializer</value>
</property>
<property>
<name>spark.io.compression.codec</name>
<value>lzf</value>
</property>
<property>
<name>spark.yarn.jar</name>
<value>hdfs://VCluster1/user/spark/share/lib/spark-assembly-1.3.1-hadoop2.7.1.jar</value>
</property>
<property>
<name>spark.kryo.referenceTracking</name>
<value>false</value>
</property>
<property>
<name>spark.kryo.classesToRegister</name>
<value>org.apache.hadoop.hive.ql.io.HiveKey,org.apache.hadoop.io.BytesWritable,org.apache.hadoop.hive.ql.exec.vector.VectorizedRowBatch</value>
</property>
In yarn-site.xml:
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle,spark_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.aux-services.spark_shuffle.class</name>
<value>org.apache.spark.network.yarn.YarnShuffleService</value>
</property>
When I run hive on spark job, dynamic allocation is not working. Spark would automatically assign spark.executor.instances to whatever the number I set to spark.dynamicAllocation.initialExecutors and not change. Can anyone help me to figure out the problem?
Thanks
I'm preparing hadoop cluster on four very small virtual servers (2GB RAM, 2Cores each) for a proof of concept.
One server as name node and resource manager and three are data nodes.
Every time I'm running the test job (3,4 GB file with data) - two of data nodes (random ones) are working at maximum capability and one of them is sleeping (monitoring via htop).
All 3 data nodes are visible in the hadoop GUI.
What am I missing?
Any help will be much appreciated.
yarn-site.xml
<configuration>
<!-- Site specific YARN configuration properties -->
<property>
<name>yarn.resourcemanager.resource-tracker.address</name>
<value>hadoop-master:8031</value>
</property>
<property>
<name>yarn.resourcemanager.address</name>
<value>hadoop-master:8032</value>
</property>
<property>
<name>yarn.resourcemanager.scheduler.address</name>
<value>hadoop-master:8030</value>
</property>
<property>
<name>yarn.resourcemanager.admin.address</name>
<value>hadoop-master:8033</value>
</property>
<property>
<name>yarn.resourcemanager.webapp.address</name>
<value>hadoop-master:8088</value>
</property>
<property>
<name>yarn.nodemanager.resource.memory-mb</name>
<value>2048</value>
</property>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.aux-services.mapreduce_shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
</configuration>
hdfs-site.xml
<configuration>
<property>
<name>dfs.data.dir</name>
<value>/opt/hadoop/hadoop/dfs/data</value>
<final>true</final>
</property>
<property>
<name>dfs.name.dir</name>
<value>/opt/hadoop/hadoop/dfs/name</value>
<final>true</final>
</property>
<property>
<name>dfs.replication</name>
<value>3</value>
</property>
<property>
<name>dfs.permissions</name>
<value>false</value>
</property>
<property>
<name>dfs.blocksize</name>
<value>67108864</value>
</property>
I found the solution.
To increase number of reducers in the file mapred-site.xml I added
<property>
<name>A</name>
<value>5</value>
</property>
After I added additional nodes to cluster, hadoop has increased mappers without any additional change in the configuration. All data nodes are working at maximum capability now.
I have just configured a hadoop clustering by using cdh5.I could successful run test jobs in command line and get the results.resourcemanager ui doesnot show job status,even in the completion .If I set mapreduce.framework.name to yarn in mapred-site.xml and job fails and show failure status in the resourcemanager ui.
Test job,I have used to run
yarn jar /usr/lib/hadoop-mapreduce/hadoop-mapreduce-examples-2.5.0-cdh5.3.1.jar pi 16 10000
Here is my yarn-site.xml
<property>
<name>yarn.resourcemanager.ha.enabled</name>
<value>true</value>
</property>
<property>
<name>yarn.resourcemanager.cluster-id</name>
<value>cluster1</value>
</property>
<property>
<name>yarn.resourcemanager.ha.rm-ids</name>
<value>rm1,rm2</value>
</property>
<property>
<name>yarn.resourcemanager.hostname.rm1</name>
<value>rhel2</value>
</property>
<property>
<name>yarn.resourcemanager.hostname.rm2</name>
<value>rhel3</value>
</property>
<property>
<name>yarn.resourcemanager.ha.automatic-failover.embedded</name>
<value>true</value>
</property>
<property>
<name>yarn.resourcemanager.scheduler.class</name>
<value>org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler</value>
</property>
<property>
<name>yarn.resourcemanager.recovery.enabled</name>
<value>true</value>
</property>
<property>
<name>yarn.resourcemanager.store.class</name>
<value>org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore</value>
</property>
<property>
<name>yarn.resourcemanager.zk.state-store.address</name>
<value>localhost:2181</value>
</property>
<property>
<name>yarn.resourcemanager.zk-address</name>
<value>rhel2.had.com:2181,rhel3.had.com:2181,rhel4.had.com:2181</value>
</property>
<property>
<name>yarn.app.mapreduce.am.scheduler.connection.wait.interval-ms</name>
<value>5000</value>
</property>
<property>
<name>yarn.web-proxy.address</name>
<value>rhel2:9046</value>
</property>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
<!-- Node Config -->
<property>
<description>Address where the localizer IPC is.</description>
<name>yarn.nodemanager.localizer.address</name>
<value>0.0.0.0:23344</value>
</property>
<property>
<description>NM Webapp address.</description>
<name>yarn.nodemanager.webapp.address</name>
<value>0.0.0.0:23999</value>
</property>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.local-dirs</name>
<value>/tmp/pseudo-dist/yarn/local</value>
</property>
<property>
<name>yarn.nodemanager.log-dirs</name>
<value>/tmp/pseudo-dist/yarn/log</value>
</property>
<property>
<name>mapreduce.shuffle.port</name>
<value>23080</value>
</property>
</configuration>
I didn't set any parameter in mapred-site.xml and the file is empty
Please let me know,any changes to be done to mapred-site.xml or yarn-site.xml file to get web ui worked
I tried to set up 2 queues - queue1,queue2.
I added the names of these queues to the mapred-site.xml
<property>
<name>mapred.queue.names</name>
<value>queue1,queue2</value>
</property>
I configured CapacityScheduler.xml as shown below.
<?xml version="1.0"?>
<configuration>
<property>
<name>mapred.capacity-scheduler.maximum-system-jobs</name>
<value>3000</value>
</property>
<property>
<name>mapred.capacity-scheduler.queue.queue1.capacity</name>
<value>100</value>
</property>
<property>
<name>mapred.capacity-scheduler.queue.queue2.capacity</name>
<value>100</value>
</property>
<property>
<name>mapred.capacity-scheduler.queue.queue1.maximum-capacity</name>
<value>-1</value>
</property>
<property>
<name>mapred.capacity-scheduler.queue.queue2.maximum-capacity</name>
<value>-1</value>
</property>
<property>
<name>mapred.capacity-scheduler.queue.queue1.supports-priority</name>
<value>false</value>
</property>
<property>
<name>mapred.capacity-scheduler.queue.queue2.supports-priority</name>
<value>false</value>
</property>
<property>
<name>mapred.capacity-scheduler.queue.queue1.minimum-user-limit-percent</name>
<value>100</value>
</property>
<property>
<name>mapred.capacity-scheduler.queue.queue2.minimum-user-limit-percent</name>
<value>100</value>
</property>
<property>
<name>mapred.capacity-scheduler.queue.queue1.user-limit-factor</name>
<value>1</value>
</property>
<property>
<name>mapred.capacity-scheduler.queue.queue2.user-limit-factor</name>
<value>1</value>
</property>
<property>
<name>mapred.capacity-scheduler.queue.queue1.maximum-initialized-active-tasks</name>
<value>200000</value>
</property>
<property>
<name>mapred.capacity-scheduler.queue.queue2.maximum-initialized-active-tasks</name>
<value>200000</value>
</property>
<property>
<name>mapred.capacity-scheduler.queue.queue1.maximum-initialized-active-tasks-per-user</name>
<value>100000</value>
</property>
<property>
<name>mapred.capacity-scheduler.queue.queue2.maximum-initialized-active-tasks-per-user</name>
<value>100000</value>
</property>
<property>
<name>mapred.capacity-scheduler.queue.queue1.init-accept-jobs-factor</name>
<value>10</value>
</property>
<property>
<name>mapred.capacity-scheduler.queue.queue2.init-accept-jobs-factor</name>
<value>10</value>
</property>
<property>
<name>mapred.capacity-scheduler.default-supports-priority</name>
<value>false</value>
</property>
<property>
<name>mapred.capacity-scheduler.default-minimum-user-limit-percent</name>
<value>100</value>
</property>
<property>
<name>mapred.capacity-scheduler.default-user-limit-factor</name>
<value>1</value>
</property>
<property>
<name>mapred.capacity-scheduler.default-maximum-active-tasks-per-queue</name>
<value>200000</value>
</property>
<property>
<name>mapred.capacity-scheduler.default-maximum-active-tasks-per-user</name>
<value>100000</value>
</property>
<property>
<name>mapred.capacity-scheduler.default-init-accept-jobs-factor</name>
<value>10</value>
</property>
<!-- Capacity scheduler Job Initialization configuration parameters -->
<property>
<name>mapred.capacity-scheduler.init-poll-interval</name>
<value>5000</value>
</property>
<property>
<name>mapred.capacity-scheduler.init-worker-threads</name>
<value>5</value>
</property>
</configuration>
The bin/start-all.sh starts the following services.
17083 DataNode
17557 TaskTracker
17373 JobTracker
16902 NameNode
17279 SecondaryNameNode
17703 Jps
Im able to view the WEB UI for Jobtracker in
http://localhost:50030/
Tasktracker's WEB UI
http://localhost:50060/
shows "Unable to Connect". But after a few seconds the jobtracker and tasktracker shuts down. jps command on the terminal only shows
17083 DataNode
16902 NameNode
17279 SecondaryNameNode
17703 Jps
What might be the solution.
both of your queues have a capacity of 100 , which makes the capacity scheduler to think there are couple of queues that each have a capacity of 100%. I suggest you change the setting to :
<?xml version="1.0"?>
<configuration>
<property>
<name>mapred.capacity-scheduler.maximum-system-jobs</name>
<value>3000</value>
</property>
<property>
<name>mapred.capacity-scheduler.queue.queue1.capacity</name>
<value>80</value> <!-- change here -->
</property>
<property>
<name>mapred.capacity-scheduler.queue.queue2.capacity</name>
<value>20</value> <!-- change here -->
</property>
<property>
<name>mapred.capacity-scheduler.queue.queue1.maximum-capacity</name>
<value>-1</value>
</property>
<property>
<name>mapred.capacity-scheduler.queue.queue2.maximum-capacity</name>
<value>-1</value>
</property>
The sum of all your queues must always and only be 100 (ie 100%) you can have two queues with 100 and 0 percent respectively - that is valid.
Also I think it's good practice to always have a "default" queue, with some allocation at the very least. I don't know what the scheduler will do if you don't specify the queue name when you don't have a default.