Yarn MapReduce approximate-pi example fails exit code 1 when run as non-hadoop user - hadoop

I am running a small private cluster of linux machines with Hadoop 2.6.2 and yarn. I launch yarn jobs from a linux edge node. The canned Yarn example to approximate the value of pi works perfectly when run by the hadoop (superuser, owner of the cluster) user, but fails when run from my personal account on the edge node. In both cases (hadoop, me) I run the job exactly like this:
clott#edge: /home/hadoop/hadoop-2.6.2/bin/yarn jar /home/hadoop/hadoop-2.6.2/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.6.2.jar pi 2 5
It fails; the full output is below. I think the file-not-found exception is totally bogus. I think something causes the launch of the container to fail, and so there's no output to be found. What causes container launches to fail, and how can this be debugged?
Because this identical same command works perfectly when run by the hadoop user but not when run by a different account on the same edge node, I suspect a permission or other yarn configuration problem; I don't suspect a missing-jar file problem. My personal account uses the same environment variables as the hadoop account, for what that's worth.
These questions are similar but I didn't find a solution:
https://issues.cloudera.org/browse/DISTRO-577
Running a map reduce job as a different user
Yarn MapReduce Job Issue - AM Container launch error in Hadoop 2.3.0
I have tried these remedies without any success:
In core-site.xml, set the value of hadoop.tmp.dir to /tmp/temp-${user.name}
Add my personal user account to every node in the cluster
I guess that many installations run with just a single user, but I'm trying to allow two people to work together on the cluster without trashing each other's intermediate results. Am I totally nuts?
Full output:
Number of Maps = 2
Samples per Map = 5
Wrote input for Map #0
Wrote input for Map #1
Starting Job
15/12/22 15:29:18 INFO client.RMProxy: Connecting to ResourceManager at ac1.mycompany.com/1.2.3.4:8032
15/12/22 15:29:18 INFO input.FileInputFormat: Total input paths to process : 2
15/12/22 15:29:19 INFO mapreduce.JobSubmitter: number of splits:2
15/12/22 15:29:19 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1450815437271_0002
15/12/22 15:29:19 INFO impl.YarnClientImpl: Submitted application application_1450815437271_0002
15/12/22 15:29:19 INFO mapreduce.Job: The url to track the job: http://ac1.mycompany.com:8088/proxy/application_1450815437271_0002/
15/12/22 15:29:19 INFO mapreduce.Job: Running job: job_1450815437271_0002
15/12/22 15:29:31 INFO mapreduce.Job: Job job_1450815437271_0002 running in uber mode : false
15/12/22 15:29:31 INFO mapreduce.Job: map 0% reduce 0%
15/12/22 15:29:31 INFO mapreduce.Job: Job job_1450815437271_0002 failed with state FAILED due to: Application application_1450815437271_0002 failed 2 times due to AM Container for appattempt_1450815437271_0002_000002 exited with exitCode: 1
For more detailed output, check application tracking page:http://ac1.mycompany.com:8088/proxy/application_1450815437271_0002/Then, click on links to logs of each attempt.
Diagnostics: Exception from container-launch.
Container id: container_1450815437271_0002_02_000001
Exit code: 1
Stack trace: ExitCodeException exitCode=1:
at org.apache.hadoop.util.Shell.runCommand(Shell.java:538)
at org.apache.hadoop.util.Shell.run(Shell.java:455)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:715)
at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:211)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Container exited with a non-zero exit code 1
Failing this attempt. Failing the application.
15/12/22 15:29:31 INFO mapreduce.Job: Counters: 0
Job Finished in 13.489 seconds
java.io.FileNotFoundException: File does not exist: hdfs://ac1.mycompany.com/user/clott/QuasiMonteCarlo_1450816156703_163431099/out/reduce-out
at org.apache.hadoop.hdfs.DistributedFileSystem$18.doCall(DistributedFileSystem.java:1122)
at org.apache.hadoop.hdfs.DistributedFileSystem$18.doCall(DistributedFileSystem.java:1114)
at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1114)
at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1817)
at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1841)
at org.apache.hadoop.examples.QuasiMonteCarlo.estimatePi(QuasiMonteCarlo.java:314)
at org.apache.hadoop.examples.QuasiMonteCarlo.run(QuasiMonteCarlo.java:354)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.hadoop.examples.QuasiMonteCarlo.main(QuasiMonteCarlo.java:363)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:71)
at org.apache.hadoop.util.ProgramDriver.run(ProgramDriver.java:144)
at org.apache.hadoop.examples.ExampleDriver.main(ExampleDriver.java:74)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
at org.apache.hadoop.util.RunJar.main(RunJar.java:136)

Yes Manjunath Ballur you were right it was a permissions problem! Finally learned how to preserve the yarn application logs, which clearly revealed the problem. Here are the steps:
Edit yarn-site.xml and add a property to delay yarn log deletion:
<property>
<name>yarn.nodemanager.delete.debug-delay-sec</name>
<value>600</value>
</property>
Push yarn-site.xml to all nodes (ARGH I forgot this for a long time) and restart cluster.
Run yarn example to estimate pi as shown above, it fails. Look at http://namenode:8088/cluster/apps/FAILED to see the failed application, click on the link for the most recent failure, look at the bottom to see which nodes in the cluster were used.
Open a window on one of the nodes in the cluster where the app failed. Find the job directory, which in my case was
~hadoop/hadoop-2.6.2/logs/userlogs/application_1450815437271_0004/container_1450‌​815437271_0004_01_000001/
Et voila, I saw files stdout (only log4j bitching), stderr (nearly empty) and syslog (winner winner chicken dinner). In the syslog file I found this gem:
2015-12-23 08:31:42,376 INFO [main] org.apache.hadoop.service.AbstractService: Service JobHistoryEventHandler failed in state INITED; cause: org.apache.hadoop.yarn.exceptions.YarnRuntimeException: org.apache.hadoop.security.AccessControlException: Permission denied: user=clott, access=EXECUTE, inode="/tmp/hadoop-yarn/staging/history":hadoop:supergroup:drwxrwx---
So the problem was permissions on hdfs:///tmp/hadoop-yarn/staging/history. A simple chmod 777 put me right, I'm not fighting the group perms anymore. Now a non-hadoop non-superuser can run a yarn job.

Related

S3distcp on local hadoop cluster not working

I am trying to run s3distcp from my local hadoop pseudo cluster. As a result of executing s3distcp.jar i received the following stack-trace . It seems that reducer task is failing but I am not able to pinpoint the reason which could be causing reducer to fail :-
18/02/21 12:14:01 WARN mapred.LocalJobRunner: job_local639263089_0001
java.lang.Exception: java.lang.RuntimeException: Reducer task failed to copy 1 files: file:/home/chirag/workspaces/lzo/data-1518765365022.lzo etc
at org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:489)
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:556)
Caused by: java.lang.RuntimeException: Reducer task failed to copy 1 files: file:/home/chirag/workspaces/lzo/data-1518765365022.lzo etc
at com.amazon.external.elasticmapreduce.s3distcp.CopyFilesReducer.close(CopyFilesReducer.java:70)
at org.apache.hadoop.io.IOUtils.cleanup(IOUtils.java:250)
at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:459)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:392)
at org.apache.hadoop.mapred.LocalJobRunner$Job$ReduceTaskRunnable.run(LocalJobRunner.java:346)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
18/02/21 12:14:02 INFO mapreduce.Job: Job job_local639263089_0001 running in uber mode : false
18/02/21 12:14:02 INFO mapreduce.Job: map 100% reduce 0%
18/02/21 12:14:02 INFO mapreduce.Job: Job job_local639263089_0001 failed with state FAILED due to: NA
18/02/21 12:14:02 INFO mapreduce.Job: Counters: 35
I'm getting the same error. In my case, I found logs in HDFS /var/log/hadoop-yarn/apps/hadoop/logs related to the MR job that s3-dist-cp kicks off.
hadoop fs -ls /var/log/hadoop-yarn/apps/hadoop/logs
I copied them out to local:
hadoop fs -get /var/log/hadoop-yarn/apps/hadoop/logs/application_nnnnnnnnnnnnn_nnnn/ip-nnn-nn-nn-nnn.ec2.internal_nnnn
And then examined them in a text editor to find more diagnostic information about the detailed results of the Reducer phase. In my case I was getting an error back from the S3 service. You might find a different error.

Hadoop: NullPointerException when redirecting to job history server

I have a Hadoop cluster (HDP 2.1). Everything has been working for a long time, but suddenly jobs have started to return the following recurrent error:
16/10/13 16:21:11 INFO Configuration.deprecation: fs.default.name is deprecated. Instead, use fs.defaultFS
16/10/13 16:21:12 INFO Configuration.deprecation: fs.default.name is deprecated. Instead, use fs.defaultFS
16/10/13 16:21:12 INFO impl.TimelineClientImpl: Timeline service address: http://dev-fiwr-bignode-12.hi.inet:8188/ws/v1/timeline/
16/10/13 16:21:13 INFO client.RMProxy: Connecting to ResourceManager at dev-fiwr-bignode-12.hi.inet/10.95.76.79:8050
16/10/13 16:21:13 INFO input.FileInputFormat: Total input paths to process : 2
16/10/13 16:21:13 INFO mapreduce.JobSubmitter: number of splits:2
16/10/13 16:21:13 INFO Configuration.deprecation: fs.default.name is deprecated. Instead, use fs.defaultFS
16/10/13 16:21:14 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1476366871137_0003
16/10/13 16:21:14 INFO impl.YarnClientImpl: Submitted application application_1476366871137_0003
16/10/13 16:21:14 INFO mapreduce.Job: The url to track the job: http://dev-fiwr-bignode-12.hi.inet:8088/proxy/application_1476366871137_0003/
16/10/13 16:21:14 INFO mapreduce.Job: Running job: job_1476366871137_0003
16/10/13 16:21:19 INFO mapreduce.Job: Job job_1476366871137_0003 running in uber mode : false
16/10/13 16:21:19 INFO mapreduce.Job: map 0% reduce 0%
16/10/13 16:21:23 INFO mapreduce.Job: map 50% reduce 0%
16/10/13 16:21:24 INFO mapreduce.Job: map 100% reduce 0%
16/10/13 16:21:28 INFO mapreduce.Job: map 100% reduce 100%\
6/10/13 16:21:29 INFO mapred.ClientServiceDelegate: Application state is completed. FinalApplicationStatus=SUCCEEDED. Redirecting to job history server
16/10/13 16:21:29 INFO mapred.ClientServiceDelegate: Application state is completed. FinalApplicationStatus=SUCCEEDED. Redirecting to job history server
16/10/13 16:21:29 INFO mapred.ClientServiceDelegate: Application state is completed. FinalApplicationStatus=SUCCEEDED. Redirecting to job history server
Exception in thread \"main\" java.io.IOException:
org.apache.hadoop.ipc.RemoteException(java.lang.NullPointerException): java.lang.NullPointerException
org.apache.hadoop.mapreduce.v2.hs.HistoryClientService$HSClientProtocolHandler.getTaskAttemptCompletionEvents(HistoryClientService.java:277)
org.apache.hadoop.mapreduce.v2.api.impl.pb.service.MRClientProtocolPBServiceImpl.getTaskAttemptCompletionEvents(MRClientProtocolPBServiceImpl.java:173)
org.apache.hadoop.yarn.proto.MRClientProtocol$MRClientProtocolService$2.callBlockingMethod(MRClientProtocol.java:283)
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585)
org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928)
org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013)
org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009)
java.security.AccessController.doPrivileged(Native Method)
javax.security.auth.Subject.doAs(Subject.java:415)
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1594)
org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007)
org.apache.hadoop.mapred.ClientServiceDelegate.invoke(ClientServiceDelegate.java:334)
org.apache.hadoop.mapred.ClientServiceDelegate.getTaskCompletionEvents(ClientServiceDelegate.java:386)
org.apache.hadoop.mapred.YARNRunner.getTaskCompletionEvents(YARNRunner.java:539)
org.apache.hadoop.mapreduce.Job$5.run(Job.java:668)
org.apache.hadoop.mapreduce.Job$5.run(Job.java:665)
java.security.AccessController.doPrivileged(Native Method)
javax.security.auth.Subject.doAs(Subject.java:415)
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1594)
org.apache.hadoop.mapreduce.Job.getTaskCompletionEvents(Job.java:665)
org.apache.hadoop.mapreduce.Job.monitorAndPrintJob(Job.java:1366)
org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1306)
dijkstra.adjacencylist.AdjacencyListDriver.jobRun(AdjacencyListDriver.java:53)
dijkstra.adjacencylist.AdjacencyListDriver.run(AdjacencyListDriver.java:31)
org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84)
dijkstra.launch.LaunchClass.launchAdjMatrix(LaunchClass.java:226)
dijkstra.launch.LaunchClass.main(LaunchClass.java:199)
sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
java.lang.reflect.Method.invoke(Method.java:606)
org.apache.hadoop.util.RunJar.main(RunJar.java:212)
Caused by:
org.apache.hadoop.ipc.RemoteException(java.lang.NullPointerException):
java.lang.NullPointerException
org.apache.hadoop.mapreduce.v2.hs.HistoryClientService$HSClientProtocolHandler.getTaskAttemptCompletionEvents(HistoryClientService.java:277)
...
Googling a bit, I've seen these issues:
https://issues.apache.org/jira/browse/MAPREDUCE-5703
https://issues.apache.org/jira/browse/MAPREDUCE-5547
They seem to be related. Nevertheless, why was the cluster running properly until now? Nothing was changed in the configuration, the clsuter is not in safe mode, the HDFS space usage is around 0.03%... Any clues? And in the case this is related to the issues above mentioned, any workaround?
Many thanks, I'll stay tuned for your answers or additional info requirements.
Your issues is similar to 5703, judging by the stack trace, and as stated in that bug:
"The method GetTaskAttemptCompletionEventsResponse() fetched a Job by calling verifyAndGetJob(), but it never checked if job was null or not, which was the root cause of this issue."
There is a job lookup using a job id, the job is not found.
In that bug it lists a scenario in which a job history server (JHS) is queried about a finished job but JHS failed to receive the info for that job.
There seems to be open issues regarding job termination and job history uploads that allow this exception to happen when job history upload fails. In the bug the issue was triggered by restarting the node writing the history before the history upload is complete, or by that node having no good nodes to write the history to.
Unfortunately, there is nothing else here that might help identify what caused the history upload to fail in your case, but that appears to be the underlying source of the issue. Your job history server has no record of the job that successfully completed.

Hadoop error in shuffle in fetcher: Exceeded MAX_FAILED_UNIQUE_FETCHES

I am new to hadoop. I have a kerberos security enabled hadoop cluster (master and 1 slave) set up on a virtual box. I am trying to run a job from the hadoop examples 'pi'. The job terminates with the error Exceeded MAX_FAILED_UNIQUE_FETCHES. I tried searching for this error but the solutions given on the internet do not seem to be working for me. Perhaps I am missing something obvious. I even tried removing the slave from the etc/hadoop/slaves file to see if the job can run only on the master but that fails as well with the same error. Below is the log. I am running this on 64-bit Ubuntu 14.04 virtual box. Any help appreciated.
montauk#montauk-vmaster:/usr/local/hadoop$ sudo -u yarn bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.4.0.jar pi 2 10
Number of Maps = 2
Samples per Map = 10
OpenJDK 64-Bit Server VM warning: You have loaded library /usr/local/hadoop/lib/native/libhadoop.so.1.0.0 which might have disabled stack guard. The VM will try to fix the stack guard now.
It's highly recommended that you fix the library with 'execstack -c <libfile>', or link it with '-z noexecstack'.
14/06/05 12:04:43 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Wrote input for Map #0
Wrote input for Map #1
Starting Job
14/06/05 12:04:49 INFO client.RMProxy: Connecting to ResourceManager at /192.168.0.29:8040
14/06/05 12:04:50 INFO hdfs.DFSClient: Created HDFS_DELEGATION_TOKEN token 17 for yarn on 192.168.0.29:54310
14/06/05 12:04:50 INFO security.TokenCache: Got dt for hdfs://192.168.0.29:54310; Kind: HDFS_DELEGATION_TOKEN, Service: 192.168.0.29:54310, Ident: (HDFS_DELEGATION_TOKEN token 17 for yarn)
14/06/05 12:04:50 INFO input.FileInputFormat: Total input paths to process : 2
14/06/05 12:04:51 INFO mapreduce.JobSubmitter: number of splits:2
14/06/05 12:04:51 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1401975262053_0007
14/06/05 12:04:51 INFO mapreduce.JobSubmitter: Kind: HDFS_DELEGATION_TOKEN, Service: 192.168.0.29:54310, Ident: (HDFS_DELEGATION_TOKEN token 17 for yarn)
14/06/05 12:04:53 INFO impl.YarnClientImpl: Submitted application application_1401975262053_0007
14/06/05 12:04:53 INFO mapreduce.Job: The url to track the job: http://montauk-vmaster:8088/proxy/application_1401975262053_0007/
14/06/05 12:04:53 INFO mapreduce.Job: Running job: job_1401975262053_0007
14/06/05 12:05:29 INFO mapreduce.Job: Job job_1401975262053_0007 running in uber mode : false
14/06/05 12:05:29 INFO mapreduce.Job: map 0% reduce 0%
14/06/05 12:06:04 INFO mapreduce.Job: map 50% reduce 0%
14/06/05 12:06:06 INFO mapreduce.Job: map 100% reduce 0%
14/06/05 12:06:34 INFO mapreduce.Job: map 100% reduce 100%
14/06/05 12:06:34 INFO mapreduce.Job: Task Id : attempt_1401975262053_0007_r_000000_0, Status : FAILED
Error: org.apache.hadoop.mapreduce.task.reduce.Shuffle$ShuffleError: error in shuffle in fetcher#4
at org.apache.hadoop.mapreduce.task.reduce.Shuffle.run(Shuffle.java:134)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:376)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:167)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:162)
Caused by: java.io.IOException: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.
at org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl.checkReducerHealth(ShuffleSchedulerImpl.java:323)
at org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl.copyFailed(ShuffleSchedulerImpl.java:245)
at org.apache.hadoop.mapreduce.task.reduce.Fetcher.copyFromHost(Fetcher.java:347)
at org.apache.hadoop.mapreduce.task.reduce.Fetcher.run(Fetcher.java:165)
I came across the same problem as yours when I install cdh5.1.0 with kerberos security using tarball,solutions found by google are insufficient memory,but I don't think it's my situation since my input is very small (52K).
After digging several days,I found root cause in this link.
To sum up solutions in that link can be:
add following property in yarn-site.xml even it's default in yarn-default.xml
<property>
<name>yarn.nodemanager.aux-services.mapreduce_shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
remove property yarn.nodemanager.local-dirs and use default value /tmp.Then exec following commands:
mkdir -p /tmp/hadoop-yarn/nm-local-dir
chown yarn:yarn /tmp/hadoop-yarn/nm-local-dir
The problem can be concluded:
After setting yarn.nodemanager.local-dirs property, the property yarn.nodemanager.aux-services.mapreduce_shuffle.class in yarn-default.xml doesn't work.
The root cause I haven't found also.
I had the same issue.I had mapreduce job without reducer.Then I solved it using job.setNumReduceTasks(0);
change below property in yarn-site.xml and create the directory.
yarn.nodemanager.local-dirs
/tmp
mkdir -p /tmp/hadoop-yarn/nm-local-dir
chown yarn:yarn /tmp/hadoop-yarn/nm-local-dir
tune the resources properety in mapred-site.xml
mapreduce.reduce.shuffle.input.buffer.percent=0.50
mapreduce.reduce.shuffle.memory.limit.percent=0.2
mapreduce.reduce.shuffle.parallelcopies=4
Restart resourcemanager and nodemanager on their respective nodes.

Hadoop error stalling job reduce process

I have been running a Hadoop job(word count example) a few times on my two-node cluster setup, and it´s been working fine up until now. I keep getting a RuntimeException which stalls the reduce process at 19%:
2013-04-13 18:45:22,191 INFO org.apache.hadoop.mapred.Task: Task:attempt_201304131843_0001_m_000000_0 is done. And is in the process of commiting
2013-04-13 18:45:22,299 INFO org.apache.hadoop.mapred.Task: Task 'attempt_201304131843_0001_m_000000_0' done.
2013-04-13 18:45:22,318 INFO org.apache.hadoop.mapred.TaskLogsTruncater: Initializing logs' truncater with mapRetainSize=-1 and reduceRetainSize=-1
2013-04-13 18:45:23,181 WARN org.apache.hadoop.mapred.Child: Error running child
java.lang.RuntimeException: Error while running command to get file permissions : org.apache.hadoop.util.Shell$ExitCodeException:
at org.apache.hadoop.util.Shell.runCommand(Shell.java:255)
at org.apache.hadoop.util.Shell.run(Shell.java:182)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:375)
at org.apache.hadoop.util.Shell.execCommand(Shell.java:461)
at org.apache.hadoop.util.Shell.execCommand(Shell.java:444)
at org.apache.hadoop.fs.FileUtil.execCommand(FileUtil.java:710)
at org.apache.hadoop.fs.RawLocalFileSystem$RawLocalFileStatus.loadPermissionInfo(RawLocalFileSystem.java:443)
at org.apache.hadoop.fs.RawLocalFileSystem$RawLocalFileStatus.getOwner(RawLocalFileSystem.java:426)
at org.apache.hadoop.mapred.TaskLog.obtainLogDirOwner(TaskLog.java:267)
at org.apache.hadoop.mapred.TaskLogsTruncater.truncateLogs(TaskLogsTruncater.java:124)
at org.apache.hadoop.mapred.Child$4.run(Child.java:260)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1149)
at org.apache.hadoop.mapred.Child.main(Child.java:249)
at org.apache.hadoop.fs.RawLocalFileSystem$RawLocalFileStatus.loadPermissionInfo(RawLocalFileSystem.java:468)
at org.apache.hadoop.fs.RawLocalFileSystem$RawLocalFileStatus.getOwner(RawLocalFileSystem.java:426)
at org.apache.hadoop.mapred.TaskLog.obtainLogDirOwner(TaskLog.java:267)
at org.apache.hadoop.mapred.TaskLogsTruncater.truncateLogs(TaskLogsTruncater.java:124)
at org.apache.hadoop.mapred.Child$4.run(Child.java:260)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1149)
at org.apache.hadoop.mapred.Child.main(Child.java:249)
Has anyone any ideas of what might be causing this?
Edit: Solved it myself.
If anyone else runs into the same problem, this was caused by the etc/hosts file on the master-node. I hadn´t entered the host-name and address of the slave-node.
This is how my hosts-file is structured on the master-node:
127.0.0.1 MyUbuntuServer
192.xxx.x.xx2 master
192.xxx.x.xx3 MySecondUbuntuServer
192.xxx.x.xx3 slave
a similar problem is described here:
http://comments.gmane.org/gmane.comp.apache.mahout.user/8898
The info there might be related to other version of hadoop. It says:
java.lang.RuntimeException: Error while running command to
get file permissions : java.io.IOException: Cannot run program
"/bin/ls": error=12, Not enough space
The solution their was to change heapsize through mapred.child.java.opts* *-Xmx1200M
See also: https://groups.google.com/a/cloudera.org/forum/?fromgroups=#!topic/cdh-user/BHGYJDNKMGE
HTH,
Avner

Oozie map-reduce example fails with ClassNotFoundException when using Bigtop 0.5.0

I'm using a relatively clean installation of CentOS 6.3 minimal with the Bigtop 0.5.0 repo and Sun Java 1.6. I add the Bigtop repo as per the instructions here.
I have installed Hadoop common and Oozie using yum. I configured oozie by running sudo service oozie init, then set up the HDFS paths using the commands in the init-hdfs.sh file in Bigtop 0.6.0.
I can run Java and streaming map reduce jobs without any problems. I can also run the Oozie streaming example that comes bundled with Bigtop. Unfortunately, when I try to run the map-reduce example, I get a java.lang.ClassNotFoundException
I can see from the HDFS audit logs that the oozie-examples-3.3.0.jar file gets inspected, but never opened. These are the only four entries for the jar file in the audit log for the time the workflow is running:
2013-03-12 14:42:07,394 INFO FSNamesystem.audit: allowed=true ugi=user (auth:SIMPLE) via oozie (auth:SIMPLE) ip=/192.168.56.12 cmd=getfileinfo src=/user/user/examples/apps/map-reduce/lib/oozie-examples-3.3.0.jar dst=null perm=null
2013-03-12 14:42:07,399 INFO FSNamesystem.audit: allowed=true ugi=user (auth:SIMPLE) via oozie (auth:SIMPLE) ip=/192.168.56.12 cmd=getfileinfo src=/user/user/examples/apps/map-reduce/lib/oozie-examples-3.3.0.jar dst=null perm=null
<snip>
2013-03-12 14:42:07,547 INFO FSNamesystem.audit: allowed=true ugi=user (auth:SIMPLE) via oozie (auth:SIMPLE) ip=/192.168.56.12 cmd=getfileinfo src=/user/user/examples/apps/map-reduce/lib/oozie-examples-3.3.0.jar dst=null perm=null
2013-03-12 14:42:07,550 INFO FSNamesystem.audit: allowed=true ugi=user (auth:SIMPLE) via oozie (auth:SIMPLE) ip=/192.168.56.12 cmd=getfileinfo src=/user/user/examples/apps/map-reduce/lib/oozie-examples-3.3.0.jar dst=null perm=null
The container logs I get from the webconsole on port 8088 show the following exception, but offer no further clues:
2013-03-12 15:10:18,681 FATAL [IPC Server handler 2 on 57310] org.apache.hadoop.mapred.TaskAttemptListenerImpl: Task: attempt_1363061307536_0002_m_000000_0 - exited : java.lang.RuntimeException: Error in configuring object
at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:106)
at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:72)
at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:130)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:396)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:335)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:157)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1367)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:152)
Caused by: java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:103)
... 9 more
Caused by: java.lang.RuntimeException: java.lang.RuntimeException: java.lang.ClassNotFoundException: Class org.apache.oozie.example.SampleMapper not found
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:1611)
at org.apache.hadoop.mapred.JobConf.getMapperClass(JobConf.java:979)
at org.apache.hadoop.mapred.MapRunner.configure(MapRunner.java:38)
... 14 more
Caused by: java.lang.RuntimeException: java.lang.ClassNotFoundException: Class org.apache.oozie.example.SampleMapper not found
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:1579)
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:1603)
... 16 more
Caused by: java.lang.ClassNotFoundException: Class org.apache.oozie.example.SampleMapper not found
at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:1485)
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:1577)
... 17 more
I managed to grab the job.xml file out of the temp directory while the failing stage of the workflow was running, and I can see that the jar file gets added to the classpath property:
<property><name>mapreduce.job.classpath.files</name><value>/user/user/user-oozi/0000001-130312141058075-oozie-oozi-W/mr-node--map-reduce/map-reduce-launcher.jar,/user/user/examples/apps/map-reduce/lib/oozie-examples-3.3.0.jar</value><source>programatically</source></property>
... but the class is still apparently not found. I've set all debugging up to DEBUG for all components and can find no more clues.
Have I simply misconfigured something, or is this actually a bug? I don't really know what to do next.

Resources