Hadoop Wordcount example failing due to AM container - windows

I've been trying to run the hadoop wordcount example for a while now, however I am facing some issues. I have hadoop 2.7.1 and running it on Windows. Below are the error details:
command:
yarn jar C:\hadoop-2.7.1\share\hadoop\mapreduce\hadoop-mapreduce-examples-2.7.1.jar wordcount input output
Output:
INFO input.FileInputFormat: Total input paths to process : 1
INFO mapreduce.JobSubmitter: number of splits:1
INFO mapreduce.JobSubmitter: Submitting tokens for job: job_14
90853163147_0009
INFO impl.YarnClientImpl: Submitted application application_14
90853163147_0009
INFO mapreduce.Job: The url to track the job: http://*****
*****/proxy/application_1490853163147_0009/
INFO mapreduce.Job: Running job: job_1490853163147_0009
INFO mapreduce.Job: Job job_1490853163147_0009 running in uber
mode : false
INFO mapreduce.Job: map 0% reduce 0%
INFO mapreduce.Job: Job job_1490853163147_0009 failed with sta
te FAILED due to: Application application_1490853163147_0009 failed 2 times due
to AM Container for appattempt_1490853163147_0009_000002 exited with exitCode:
1639
For more detailed output, check application tracking page:http://********
:****/cluster/app/application_1490853163147_0009Then, click on links to logs of
each attempt.
Diagnostics: Exception from container-launch.
Container id: container_1490853163147_0009_02_000001
Exit code: 1639
Exception message: Incorrect command line arguments.
Stack trace: ExitCodeException exitCode=1639: Incorrect command line arguments.
at org.apache.hadoop.util.Shell.runCommand(Shell.java:545)
at org.apache.hadoop.util.Shell.run(Shell.java:456)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:
722)
at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.la
unchContainer(DefaultContainerExecutor.java:211)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.C
ontainerLaunch.call(ContainerLaunch.java:302)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.C
ontainerLaunch.call(ContainerLaunch.java:82)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.
java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor
.java:617)
at java.lang.Thread.run(Thread.java:745)
Shell output: Usage: task create [TASKNAME] [COMMAND_LINE] |
task isAlive [TASKNAME] |
task kill [TASKNAME]
task processList [TASKNAME]
Creates a new task jobobject with taskname
Checks if task jobobject is alive
Kills task jobobject
Prints to stdout a list of processes in the task
along with their resource usage. One process per line
and comma separated info per process
ProcessId,VirtualMemoryCommitted(bytes),
WorkingSetSize(bytes),CpuTime(Millisec,Kernel+User)
Container exited with a non-zero exit code 1639
Failing this attempt. Failing the application.
INFO mapreduce.Job: Counters: 0
Yarn-site.xml:
<configuration>
<property>
<name>yarn.application.classpath</name>
<value>
C:\hadoop-2.7.1\etc\hadoop,
C:\hadoop-2.7.1\share\hadoop\common\*,
C:\hadoop-2.7.1\share\hadoop\common\lib\*,
C:\hadoop-2.7.1\share\hadoop\hdfs\*,
C:\hadoop-2.7.1\share\hadoop\hdfs\lib\*,
C:\hadoop-2.7.1\share\hadoop\mapreduce\*,
C:\hadoop-2.7.1\share\hadoop\mapreduce\lib\*,
C:\hadoop-2.7.1\share\hadoop\yarn\*,
C:\hadoop-2.7.1\share\hadoop\yarn\lib\*
</value>
</property>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.disk-health-checker.max-disk-utilization-per-disk-percentage</name>
<value>98.5</value>
</property>
<property>
<name>yarn.nodemanager.resource.memory-mb</name>
<value>2200</value>
<description>Amount of physical memory, in MB, that can be allocated for containers.</description>
</property>
<property>
<name>yarn.scheduler.minimum-allocation-mb</name>
<value>500</value>
</property>
<property>
<name>yarn.log-aggregation-enable</name>
<value>true</value>
</property>
<property>
<description>Where to aggregate logs to.</description>
<name>yarn.nodemanager.remote-app-log-dir</name>
<value>/tmp/logs</value>
</property>
<property>
<name>yarn.log-aggregation.retain-seconds</name>
<value>259200</value>
</property>
<property>
<name>yarn.log-aggregation.retain-check-interval-seconds</name>
<value>3600</value>
</property>
</configuration>
mapred.xml:
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!--
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License. See accompanying LICENSE file.
-->
<!-- Put site-specific property overrides in this file. -->
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>
Any idea on what is going wrong?

exitCode: 1639 Looks like your are running hadoop on Windows .
https://github.com/OctopusDeploy/Issues/issues/1346

I faced exactly same problem. I was following a guide on how to install Hadoop 2.6.0 (http://www.ics.uci.edu/~shantas/Install_Hadoop-2.6.0_on_Windows10.pdf) while actually installing Hadoop 2.8.0.
As soon as I was done I ran
hadoop jar D:\hadoop-2.8.0\share\hadoop\mapreduce\hadoop-mapreduce-examples-2.8.0.jar wordcount /foo/bar/LICENSE.txt /out1
And got (from yarn nodemanager's logs):
17/06/19 13:15:30 INFO monitor.ContainersMonitorImpl: Starting resource-monitoring for container_1497902417767_0004_01_000001
17/06/19 13:15:30 INFO nodemanager.DefaultContainerExecutor: launchContainer: [D:\hadoop-2.8.0\bin\winutils.exe, task, create, -m, -1, -c, -1, container_1497902417767_0004_01_000001, cmd /c D:/hadoop/temp/nm-localdir/usercache/******/appcache/application_1497902417767_0004/container_1497902417767_0004_01_000001/default_container_executor.cmd]
17/06/19 13:15:30 WARN nodemanager.DefaultContainerExecutor: Exit code from container container_1497902417767_0004_01_000001 is : 1639
17/06/19 13:15:30 WARN nodemanager.DefaultContainerExecutor: Exception from container-launch with container ID: container_1497902417767_0004_01_000001 and exit code: 1639
ExitCodeException exitCode=1639: Incorrect command line arguments.
TaskExit: error (1639): Invalid command line argument. Consult the Windows Installer SDK for detailed command line help.
Another symptom was (from yarn nodemanager's logs):
17/06/19 13:25:49 WARN util.SysInfoWindows: Expected split length of sysInfo to be 11. Got 7
The solution was to get compatible (with Hadoop 2.8.0) binaries: https://github.com/steveloughran/winutils/tree/master/hadoop-2.8.0-RC3/bin
Once I got a correct winutils.exe, my problem went away.

Related

InvalidAuxServiceException in MapReduce Job

I am getting the following exception while running a map-reduce job on recently created open source hadoop cluster. I am running the latest hadoop version 3.3.0.
2020-09-03 00:58:30,068 INFO mapreduce.Job: Task Id : attempt_1599094453872_0001_m_000000_2, Status : FAILED
Container launch failed for container_1599094453872_0001_01_000004 : org.apache.hadoop.yarn.exceptions.InvalidAuxServiceException: The auxService:mapreduce_shuffle does not exist
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:83)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:57)
at java.lang.reflect.Constructor.newInstance(Constructor.java:437)
at org.apache.hadoop.yarn.api.records.impl.pb.SerializedExceptionPBImpl.instantiateExceptionImpl(SerializedExceptionPBImpl.java:171)
at org.apache.hadoop.yarn.api.records.impl.pb.SerializedExceptionPBImpl.instantiateException(SerializedExceptionPBImpl.java:182)
at org.apache.hadoop.yarn.api.records.impl.pb.SerializedExceptionPBImpl.deSerialize(SerializedExceptionPBImpl.java:106)
at org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl$Container.launch(ContainerLauncherImpl.java:163)
at org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl$EventProcessor.run(ContainerLauncherImpl.java:394)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1160)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
at java.lang.Thread.run(Thread.java:820)
As per some of the online suggestions, I have added the following two properties in yarn-site.xml and restarted both yarn and dfs. However, it is still throwing the same exception as above. Sometimes the job succeeds with the exception.
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.aux-services.mapreduce_shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>

Sqoop Import - Job stuck while connecting to Resource Manager

I am trying to import data from sql server to Hive using Sqoop. When I execute the import command, sql server connection is successful but when it comes to connecting to resource manager, the job get stuck. Here is the log:
18/07/16 12:35:20 DEBUG mapreduce.JobBase: Adding to job classpath: file:/usr/lib/sqoop/lib/commons-jexl-2.1.1.jar
18/07/16 12:35:20 DEBUG mapreduce.JobBase: Adding to job classpath: file:/usr/lib/sqoop/lib/avro.jar
18/07/16 12:35:20 DEBUG mapreduce.JobBase: Adding to job classpath: file:/usr/lib/sqoop/lib/fastutil-6.3.jar
18/07/16 12:35:20 DEBUG mapreduce.JobBase: Adding to job classpath: file:/usr/lib/sqoop/lib/commons-codec-1.4.jar
18/07/16 12:35:21 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
Here is the sqoop import command:
sqoop import --connect "jdbc:sqlserver://ip;databaseName=TEST" --driver com.microsoft.sqlserver.jdbc.SQLServerDriver --username user1 --password pass1 --hive-import --create-hive-table --hive-table "customer_data_march" --table "customer_data_march_parsed" --split-by Account_Branch_Converted -m 1 --verbose
When I check the resource manager process allocation, it shows the following:
[root#quickstart /]# jps
3527
1609 ThriftServer
850 JobHistoryServer
1135 ResourceManager
2611 HRegionServer
786 Bootstrap
506 NameNode
4225 Jps
3499
950 NodeManager
382 JournalNode
1983 RunJar
3469 Bootstrap
3124 Bootstrap
1745 RunJar
2459 Bootstrap
2493 HistoryServer
248 QuorumPeerMain
1446 HMaster
634 SecondaryNameNode
297 DataNode
Here are configurations in yarn-site.xml
<!--
Licensed to the Apache Software Foundation (ASF) under one or more
contributor license agreements. See the NOTICE file distributed with
this work for additional information regarding copyright ownership.
The ASF licenses this file to You under the Apache License, Version 2.0
(the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.aux-services.mapreduce_shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
<property>
<name>yarn.log-aggregation-enable</name>
<value>true</value>
</property>
<property>
<name>yarn.dispatcher.exit-on-error</name>
<value>true</value>
</property>
<property>
<description>List of directories to store localized files in.</description>
<name>yarn.nodemanager.local-dirs</name>
<value>/var/lib/hadoop-yarn/cache/${user.name}/nm-local-dir</value>
</property>
<property>
<description>Where to store container logs.</description>
<name>yarn.nodemanager.log-dirs</name>
<value>/var/log/hadoop-yarn/containers</value>
</property>
<property>
<description>Where to aggregate logs to.</description>
<name>yarn.nodemanager.remote-app-log-dir</name>
<value>/var/log/hadoop-yarn/apps</value>
</property>
<property>
<description>Classpath for typical applications.</description>
<name>yarn.application.classpath</name>
<value>
$HADOOP_CONF_DIR,
$HADOOP_COMMON_HOME/*,$HADOOP_COMMON_HOME/lib/*,
$HADOOP_HDFS_HOME/*,$HADOOP_HDFS_HOME/lib/*,
$HADOOP_MAPRED_HOME/*,$HADOOP_MAPRED_HOME/lib/*,
$HADOOP_YARN_HOME/*,$HADOOP_YARN_HOME/lib/*
</value>
</property>
<!--
<property>
<name>yarn.resourcemanager.address</name>
<value>127.0.0.1:8032</value>
</property>
<property>
<name>yarn.resourcemanager.scheduler.address</name>
<value>127.0.0.1:8030</value>
</property>
<property>
<name>yarn.resourcemanager.resource-tracker.address</name>
<value>127.0.0.1:8031</value>
</property>
-->
</configuration>
After leaving my job for a while, it gives me following error:
18/07/16 14:02:19 INFO mapreduce.Job: Job job_1531745885301_0001 running in uber mode : false
18/07/16 14:02:20 INFO mapreduce.Job: map 0% reduce 0%
18/07/16 14:02:20 INFO mapreduce.Job: Job job_1531745885301_0001 failed with state FAILED due to: Application application_1531745885301_0001 failed 2 times due to ApplicationMaster for attempt appattempt_1531745885301_0001_000002 timed out. Failing the application.
18/07/16 14:02:20 INFO mapreduce.Job: Counters: 0
18/07/16 14:02:20 WARN mapreduce.Counters: Group FileSystemCounters is deprecated. Use org.apache.hadoop.mapreduce.FileSystemCounter instead
18/07/16 14:02:20 INFO mapreduce.ImportJobBase: Transferred 0 bytes in 1,241.2774 seconds (0 bytes/sec)
18/07/16 14:02:20 WARN mapreduce.Counters: Group org.apache.hadoop.mapred.Task$Counter is deprecated. Use org.apache.hadoop.mapreduce.TaskCounter instead
18/07/16 14:02:20 INFO mapreduce.ImportJobBase: Retrieved 0 records.
18/07/16 14:02:20 ERROR tool.ImportTool: Error during import: Import job failed!
I am running the docker using following command:
docker run --hostname=quickstart.cloudera --privileged=true -t -p 8889:8888 -i 00a03c98e0d2 /u
sr/bin/docker-quickstart
Where am I going wrong?

Application status in hadoop cluster manager is undefined

I am trying to run a HelloWorld programing in Pig or Hive on my single-cluster setup. All my applications hangs. That includes my word count example in pig. My Hive command below also hangs:
SELECT COUNT(*) FROM u_data;
My Hive logs stops as follows: set mapreduce.job.reduces=<number>
Starting Job = job_1433035285759_0006, Tracking URL = http://der-Inspiron-3521:8088/proxy/application_1433035285759_0006/
Kill Command = /home/der/utility/hadoop-2.7.0/bin/hadoop job -kill job_1433035285759_0006
I follow the link. The application shows the
status = undefined
Here's my map-reduce configuration file:
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
<property>
<property>
<name>mapreduce.jobhistory.address</name>
<value>localhost:10020</value>
<description>Host and port for Job History Server (default 0.0.0.0:10020)</description>
</property>
<name>mapreduce.jobtracker.http.address</name>
<value>localhost:50030</value>
<description>The host and port that the MapReduce job tracker runs
at. If "local", then jobs are run in-process as a single map
and reduce task.
</description>
</property>
</configuration>
My hadoop server is version 2.7.0.
I tried hadoop server 2.6.0 but I had other issues.
My pig scripts hangs at the MapReduce Launching stage:
2015-05-30 19:34:59,641 [main] INFO org.apache.pig.backend.hadoop.executionengine.
mapReduceLayer.MapReduceLauncher -
detailed locations: M: lines[1,8],words[-1,-1],
wordcount[4,12],grouped[3,10] C: wordcount[4,12],grouped[3,10] R: wordcount[4,12]
2015-05-30 19:34:59,666 [main]
INFO org.apache.pig.backend.
hadoop.executionengine.mapReduceLayer
MapReduceLauncher - 0% complete
2015-05-30 19:34:59,666 [main] INFO org.apache.pig.backend.hadoop.executionengine.
mapReduceLayer.MapReduceLauncher -
Running jobs are [job_1433039621140_0001]

Mapreduce job gzip compression failure

I have setup a new cluster (using HDP on Windows ) and I am encountering a new problem which I haven't seen before.
When I run a simple word count problem from hadoop-examples jar the MapreduceV2 job fails with below error
5/05/16 18:58:29 INFO mapreduce.Job: Task Id : attempt_1431802381254_0001_r_000000_0, Status : FAILED
Error: org.apache.hadoop.mapreduce.task.reduce.Shuffle$ShuffleError: error in shuffle in fetcher#15
Now,when I go to Application Master tracker and dig into logs I find that reducer is expecting a gzip file but the mapper output wasn’t
2015-05-16 18:45:20,864 WARN [fetcher#1] org.apache.hadoop.mapreduce.task.reduce.Fetcher: Failed to shuffle output of attempt_1431791182314_0011_m_000000_0 from <url>:13562
java.io.IOException: not a gzip file
When I specifically drill into Map phase log,I see this
2015-05-16 18:45:09,532 WARN [main] org.apache.hadoop.io.compress.zlib.ZlibFactory: Failed to load/initialize native-zlib library
2015-05-16 18:45:09,532 INFO [main] org.apache.hadoop.io.compress.CodecPool: Got brand-new compressor [.gz]
2015-05-16 18:45:09,532 WARN [main] org.apache.hadoop.mapred.IFile: Could not obtain compressor from CodecPool
I have the following configurations in my core-site.xml
<property>
<name>io.compression.codecs</name>
<value>org.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.DefaultCodec,org.apache.hadoop.io.compress.BZip2Codec</value>
<description>A list of the compression codec classes that can be used for compression/decompression.</description>
</property>
and in mapred-site.xml
<property>
<name>mapred.compress.map.output</name>
<value>true</value>
</property>
<property>
<name>mapred.map.output.compression.codec</name>
<value>org.apache.hadoop.io.compress.GzipCodec</value>
</property>
<property>
<name>mapred.output.compression.type</name>
<value>BLOCK</value>
</property>
<property>
<name>mapred.output.compress</name>
<value>true</value>
</property>
<property>
<name>mapred.output.compression.codec</name>
<value>org.apache.hadoop.io.compress.GzipCodec</value>
</property>
Now I realise this is pointing to error in native zlib dll loading,so I ran the job overriding options to run without compression and it does work.
I have downloaded the zlib.dll from zlib site and placed it in Hadoop/bin , C:\system32 and C:\SystemWOW64 folders and restarted the cluster services but still I have same error. Not sure why.I would appreciate any ideas to debug this further and resolve it
Hadoop 2.7.2
I ran into the same issue, when I built and ran hadoop 2.7.2 on windows 7. To resolve the issue you need to do the following:
1) On the Build Machine: set ZLIB_HOME to the zlib headers folder zlib_unzip_folder\zlib128-dll\include and build the distribution.
2) On the Run Machine make zlib1.dll zlib_unzip_folder\zlib128-dll\zlib1.dll available on the path.
I used zlib 1.2.8 and the download link can be found here: http://zlib.net/zlib128-dll.zip
Hadoop 2.4.1
This issue can also be reproduced on an older version of HADOOP by setting native lib as false and forcing map output to be compressed. For More detail you can see here: https://issues.apache.org/jira/browse/HADOOP-11334

yarn hadoop 2.4.0: info message: ipc.Client Retrying connect to server

i've searched for two days for a solution. but nothing worked.
First, i'm new to the whole hadoop/yarn/hdfs topic and want to configure a small cluster.
the message above doesn't show up everytime i run an example from the mapreduce-examples.jar
sometimes teragen works, sometimes not.
in some cases the whole job failed, in others the job finishes successfully. sometimes the job failes, without printing the message above.
14/06/08 15:42:46 INFO ipc.Client: Retrying connect to server: FQDN-HOSTNAME/XXX.XX.XX.XXX:53022. Already tried 2 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=3, sleepTime=1000 MILLISECONDS)
this message is print 30 times. also the port (in code example: 53022) changes with every time a job is started.
if job finished succesfuly, this is print
14/06/08 15:34:20 INFO mapred.ClientServiceDelegate: Application state is completed. FinalApplicationStatus=SUCCEEDED. Redirecting to job history server
14/06/08 15:34:20 INFO mapreduce.Job: Job job_1402234146062_0002 running in uber mode : false
14/06/08 15:34:20 INFO mapreduce.Job: map 100% reduce 100%
14/06/08 15:34:20 INFO mapreduce.Job: Job job_1402234146062_0002 completed successfully
if it fails,this is shown.
INFO mapreduce.Job: Job job_1402234146062_0005 failed with state FAILED due to: Task failed task_1402234146062_0005_m_000002
Job failed as tasks failed. failedMaps:1 failedReduces:0
in this case, some tasks failed. but in log files of nodemanager, datanode, resourcemanager, ... is no reason or message to find.
INFO mapreduce.Job: Task Id : attempt_1402234146062_0006_m_000002_1, Status : FAILED
Additional Information about my Configuration:
used OS: centOS 6.5
Java Version: OpenJDK Runtime Environment (rhel-2.4.7.1.el6_5-x86_64 u55-b13)
OpenJDK 64-Bit Server VM (build 24.51-b03, mixed mode)
yarn-site.xml
<configuration>
<!-- Site specific YARN configuration properties -->
<property>
<name>yarn.nodemanager.address</name>
<value>FQDN-HOSTNAME:8050</value>
</property>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.localizer.address</name>
<value>FQDN-HOSTNAME:8040</value>
</property>
<property>
<name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
<property>
<name>yarn.resourcemanager.resource-tracker.address</name>
<value>FQDN-HOSTNAME:8025</value>
</property>
<property>
<name>yarn.resourcemanager.scheduler.address</name>
<value>FQDN-HOSTNAME:8030</value>
</property>
<property>
<name>yarn.resourcemanager.address</name>
<value>FQDN-HOSTNAME:8032</value>
</property>
</configuration>
hdfs-site.xml
<configuration>
<property>
<name>dfs.replication</name>
<value>2</value>
</property>
<property>
<name>dfs.permissions </name>
<value>false </value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>file:///var/data/hadoop/hdfs/nn</value>
</property>
<property>
<name>fs.checkpoint.dir</name>
<value>file:///var/data/hadoop/hdfs/snn</value>
</property>
<property>
<name>fs.checkpoint.edits.dir</name>
<value>file:///var/data/hadoop/hdfs/snn</value>
<name>fs.checkpoint.edits.dir</name>
<value>file:///var/data/hadoop/hdfs/snn</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>file:///var/data/hadoop/hdfs/dn</value>
</property>
</configuration>
mapred-site.xml
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
<property>
<name>mapreduce.cluster.temp.dir</name>
<value>/mapred/tempDir</value>
</property>
<property>
<name>mapreduce.cluster.local.dir</name>
<value>/mapred/localDir</value>
</property>
<property>
<name>mapreduce.jobhistory.address</name>
<value>FQDN-HOSTNAME:10020</value>
</property>
</configuration>
I hope somebody could help me. :)
Thank you,
Norman
The job finishes sometimes successfully because when you have one reducer and that reduce task by chance is sent to a working node manager then it becomes successful job.
You have to make sure that FQDN-HOSTNAME is written exactly the same way in the slaves file. If I remember correctly, my solution was that I removed the entry for the hostname mapping in /etc/hosts, that is commenting it out like this:
#127.0.0.1 FQDN-HOSTNAME
This is a bug in how the MR AppMaster starts up with ephemeral ports. It exists in Hadoop 2.6.0 release version as well.
I have figured out a fix to this bug and created a JIRA on the MAPREDUCE project along with a comment on how to fix it.
https://issues.apache.org/jira/browse/MAPREDUCE-6338
Another possible solution for this, is to check for the firewall in all the nodes.
If you're dealing with iptables, you can run this on every node:
# /etc/init.d/iptables save
# /etc/init.d/iptables stop
That will stop the firewall until next restart, but it should be enough for you to test the cluster. You don't have to restart yarn or anything, just run the job again.
If you want to completely stop the FW:
# chkconfig iptables off
Definitely a bug, this post provides a clearer insight into what is happening.
https://groups.google.com/a/cloudera.org/forum/#!msg/cdh-user/P1rfMQmYVWk/eARZXHUTkW0J
We are planning on getting around this issue by reducing the ephemeral port range, thus limiting what ports are grabbed, and then configuring iptables to allow for that port range. Setting the port ranges is explained here -
http://www.ncftp.com/ncftpd/doc/misc/ephemeral_ports.html
if you see a message like
INFO ipc.Client: Retrying connect to server: <hostname>/<ip>:<port>. Already tried 1 time(s); maxRetries=3
Need to check:
check your firewall between client and Node Manager
check yarn.app.mapreduce.am.job.client.port-range by default the he range is all possible ports
Wow! Are these answers for real?? Talking about FQDN when the job clearly completes...as long as firewall is disabled?? And the OP even put the detailed log messages / configuration.
The problem is that yarn.app.mapreduce.am.job.client.port-range is not being honored. I'm running into it also.
Firewall off...all is well (and I can see the ephemeral ports from yarn job).
Firewall on...all times outs (eventually).
Horton completely ignores this question on other boards.
So here's a log output from a job which demonstrates the problem. In first case, I have the firewall enabled on the client(s) based on Horton's doc (along with other ports I discovered by looking very closely at my installation). You will see the process timing out...and then all of a sudden working. Because I disabled the firewall after watching the job output :)
2015-01-15 16:48:22,943 INFO [main] org.apache.hadoop.ipc.Client: Retrying connect to server: de-luster-l2723nraqsy5-ywhniidze3lb-qfk4asn77vc5/10.0.0.41:52015. Already tried 39 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 MILLISECONDS)
2015-01-15 16:48:23,349 INFO [main] org.apache.hadoop.mapred.YarnChild: mapreduce.cluster.local.dir for child: /hadoop/yarn/local/usercache/l.admin/appcache/application_1420482341308_0020
2015-01-15 16:48:24,122 INFO [main] org.apache.hadoop.conf.Configuration.deprecation: session.id is deprecated. Instead, use dfs.metrics.session-id
2015-01-15 16:48:24,656 INFO [main] org.apache.hadoop.mapred.Task: Using ResourceCalculatorProcessTree : [ ]
2015-01-15 16:48:24,724 INFO [main] org.apache.hadoop.mapred.ReduceTask: Using ShuffleConsumerPlugin: org.apache.hadoop.mapreduce.task.reduce.Shuffle#7f94ee59
2015-01-15 16:48:24,792 INFO [main] org.apache.hadoop.mapreduce.task.reduce.MergeManagerImpl: MergerManager: memoryLimit=534354336, maxSingleShuffleLimit=133588584, mergeThreshold=352673888, ioSortFactor=100, memToMemMergeOutputsThreshold=100
Did ya see it?? Problem with timeout...then all of a sudden Shuffle commences. Nothing to do with FQDNs after all :)

Resources