Issue in Map Reduce Program - hadoop

I am executing a hadoop Map-Reduce Job for simple word count problem using Putty.
I have configured Hadoop on a VM and I have verified all components of Hadoop are running using jps.
When I am executing the code using command
hadoop jar Untitled.jar
I am getting error
15/06/20 19:36:48 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
15/06/20 19:37:09 INFO util.NativeCodeLoader: Loaded the native-hadoop library
15/06/20 19:37:09 WARN snappy.LoadSnappy: Snappy native library not loaded
15/06/20 19:37:09 INFO mapred.FileInputFormat: Total input paths to process : 0
15/06/20 19:37:10 INFO mapred.JobClient: Running job: job_201506201820_0004
15/06/20 19:37:11 INFO mapred.JobClient: map 0% reduce 0%
15/06/20 19:37:12 INFO mapred.JobClient: Task Id : attempt_201506201820_0004_m_000001_0, Status : FAILED
Error initializing attempt_201506201820_0004_m_000001_0:
ENOENT: No such file or directory
at org.apache.hadoop.io.nativeio.NativeIO.chmod(Native Method)
at org.apache.hadoop.fs.FileUtil.execSetPermission(FileUtil.java:701)
at org.apache.hadoop.fs.FileUtil.setPermission(FileUtil.java:656)
at org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:514)
at org.apache.hadoop.fs.RawLocalFileSystem.mkdirs(RawLocalFileSystem.java:349)
at org.apache.hadoop.mapred.JobLocalizer.initializeJobLogDir(JobLocalizer.java:240)
at org.apache.hadoop.mapred.DefaultTaskController.initializeJob(DefaultTaskController.java:205)
at org.apache.hadoop.mapred.TaskTracker$4.run(TaskTracker.java:1336)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1190)
at org.apache.hadoop.mapred.TaskTracker.initializeJob(TaskTracker.java:1311)
at org.apache.hadoop.mapred.TaskTracker.localizeJob(TaskTracker.java:1226)
at org.apache.hadoop.mapred.TaskTracker$5.run(TaskTracker.java:2603)
at java.lang.Thread.run(Thread.java:745)
15/06/20 19:37:13 WARN mapred.JobClient: Error reading task outputhttp://ankit-VirtualBox:50060/tasklog?plaintext=true&attemptid=attempt_201506201820_0004_m_000001_0&filter=stdout
15/06/20 19:37:13 WARN mapred.JobClient: Error reading task outputhttp://ankit-VirtualBox:50060/tasklog?plaintext=true&attemptid=attempt_201506201820_0004_m_000001_0&filter=stderr
15/06/20 19:37:14 INFO mapred.JobClient: Task Id : attempt_201506201820_0004_m_000001_1, Status : FAILED
Error initializing attempt_201506201820_0004_m_000001_1:
ENOENT: No such file or directory
at org.apache.hadoop.io.nativeio.NativeIO.chmod(Native Method)
at org.apache.hadoop.fs.FileUtil.execSetPermission(FileUtil.java:701)
at org.apache.hadoop.fs.FileUtil.setPermission(FileUtil.java:656)
at org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:514)
at org.apache.hadoop.fs.RawLocalFileSystem.mkdirs(RawLocalFileSystem.java:349)
at org.apache.hadoop.mapred.JobLocalizer.initializeJobLogDir(JobLocalizer.java:240)
at org.apache.hadoop.mapred.DefaultTaskController.initializeJob(DefaultTaskController.java:205)
at org.apache.hadoop.mapred.TaskTracker$4.run(TaskTracker.java:1336)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1190)
at org.apache.hadoop.mapred.TaskTracker.initializeJob(TaskTracker.java:1311)
at org.apache.hadoop.mapred.TaskTracker.localizeJob(TaskTracker.java:1226)
at org.apache.hadoop.mapred.TaskTracker$5.run(TaskTracker.java:2603)
at java.lang.Thread.run(Thread.java:745)
15/06/20 19:37:15 WARN mapred.JobClient: Error reading task outputhttp://ankit-VirtualBox:50060/tasklog?plaintext=true&attemptid=attempt_201506201820_0004_m_000001_1&filter=stdout
15/06/20 19:37:15 WARN mapred.JobClient: Error reading task outputhttp://ankit-VirtualBox:50060/tasklog?plaintext=true&attemptid=attempt_201506201820_0004_m_000001_1&filter=stderr
15/06/20 19:37:16 INFO mapred.JobClient: Task Id : attempt_201506201820_0004_m_000001_2, Status : FAILED
15/06/20 19:37:16 WARN mapred.JobClient: Error reading task outputhttp://ankit-VirtualBox:50060/tasklog?plaintext=true&attemptid=attempt_201506201820_0004_m_000001_2&filter=stdout
15/06/20 19:37:16 WARN mapred.JobClient: Error reading task outputhttp://ankit-VirtualBox:50060/tasklog?plaintext=true&attemptid=attempt_201506201820_0004_m_000001_2&filter=stderr
15/06/20 19:37:17 INFO mapred.JobClient: Task Id : attempt_201506201820_0004_m_000000_0, Status : FAILED
Error initializing attempt_201506201820_0004_m_000000_0:
ENOENT: No such file or directory
at org.apache.hadoop.io.nativeio.NativeIO.chmod(Native Method)
at org.apache.hadoop.fs.FileUtil.execSetPermission(FileUtil.java:701)
at org.apache.hadoop.fs.FileUtil.setPermission(FileUtil.java:656)
at org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:514)
at org.apache.hadoop.fs.RawLocalFileSystem.mkdirs(RawLocalFileSystem.java:349)
at org.apache.hadoop.mapred.JobLocalizer.initializeJobLogDir(JobLocalizer.java:240)
at org.apache.hadoop.mapred.DefaultTaskController.initializeJob(DefaultTaskController.java:205)
at org.apache.hadoop.mapred.TaskTracker$4.run(TaskTracker.java:1336)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1190)
at org.apache.hadoop.mapred.TaskTracker.initializeJob(TaskTracker.java:1311)
at org.apache.hadoop.mapred.TaskTracker.localizeJob(TaskTracker.java:1226)
at org.apache.hadoop.mapred.TaskTracker$5.run(TaskTracker.java:2603)
at java.lang.Thread.run(Thread.java:745)
15/06/20 19:37:17 WARN mapred.JobClient: Error reading task outputhttp://ankit-VirtualBox:50060/tasklog?plaintext=true&attemptid=attempt_201506201820_0004_m_000000_0&filter=stdout
15/06/20 19:37:17 WARN mapred.JobClient: Error reading task outputhttp://ankit-VirtualBox:50060/tasklog?plaintext=true&attemptid=attempt_201506201820_0004_m_000000_0&filter=stderr
15/06/20 19:37:18 INFO mapred.JobClient: Task Id : attempt_201506201820_0004_m_000000_1, Status : FAILED
Error initializing attempt_201506201820_0004_m_000000_1:
ENOENT: No such file or directory
at org.apache.hadoop.io.nativeio.NativeIO.chmod(Native Method)
at org.apache.hadoop.fs.FileUtil.execSetPermission(FileUtil.java:701)
at org.apache.hadoop.fs.FileUtil.setPermission(FileUtil.java:656)
at org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:514)
at org.apache.hadoop.fs.RawLocalFileSystem.mkdirs(RawLocalFileSystem.java:349)
at org.apache.hadoop.mapred.JobLocalizer.initializeJobLogDir(JobLocalizer.java:240)
at org.apache.hadoop.mapred.DefaultTaskController.initializeJob(DefaultTaskController.java:205)
at org.apache.hadoop.mapred.TaskTracker$4.run(TaskTracker.java:1336)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1190)
at org.apache.hadoop.mapred.TaskTracker.initializeJob(TaskTracker.java:1311)
at org.apache.hadoop.mapred.TaskTracker.localizeJob(TaskTracker.java:1226)
at org.apache.hadoop.mapred.TaskTracker$5.run(TaskTracker.java:2603)
at java.lang.Thread.run(Thread.java:745)
15/06/20 19:37:19 WARN mapred.JobClient: Error reading task outputhttp://ankit-VirtualBox:50060/tasklog?plaintext=true&attemptid=attempt_201506201820_0004_m_000000_1&filter=stdout
15/06/20 19:37:19 WARN mapred.JobClient: Error reading task outputhttp://ankit-VirtualBox:50060/tasklog?plaintext=true&attemptid=attempt_201506201820_0004_m_000000_1&filter=stderr
15/06/20 19:37:20 INFO mapred.JobClient: Task Id : attempt_201506201820_0004_m_000000_2, Status : FAILED
Error initializing attempt_201506201820_0004_m_000000_2:
ENOENT: No such file or directory
at org.apache.hadoop.io.nativeio.NativeIO.chmod(Native Method)
at org.apache.hadoop.fs.FileUtil.execSetPermission(FileUtil.java:701)
at org.apache.hadoop.fs.FileUtil.setPermission(FileUtil.java:656)
at org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:514)
at org.apache.hadoop.fs.RawLocalFileSystem.mkdirs(RawLocalFileSystem.java:349)
at org.apache.hadoop.mapred.JobLocalizer.initializeJobLogDir(JobLocalizer.java:240)
at org.apache.hadoop.mapred.DefaultTaskController.initializeJob(DefaultTaskController.java:205)
at org.apache.hadoop.mapred.TaskTracker$4.run(TaskTracker.java:1336)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1190)
at org.apache.hadoop.mapred.TaskTracker.initializeJob(TaskTracker.java:1311)
at org.apache.hadoop.mapred.TaskTracker.localizeJob(TaskTracker.java:1226)
at org.apache.hadoop.mapred.TaskTracker$5.run(TaskTracker.java:2603)
at java.lang.Thread.run(Thread.java:745)
15/06/20 19:37:20 WARN mapred.JobClient: Error reading task outputhttp://ankit-VirtualBox:50060/tasklog?plaintext=true&attemptid=attempt_201506201820_0004_m_000000_2&filter=stdout
15/06/20 19:37:20 WARN mapred.JobClient: Error reading task outputhttp://ankit-VirtualBox:50060/tasklog?plaintext=true&attemptid=attempt_201506201820_0004_m_000000_2&filter=stderr
15/06/20 19:37:21 INFO mapred.JobClient: Job complete: job_201506201820_0004
15/06/20 19:37:21 INFO mapred.JobClient: Counters: 4
15/06/20 19:37:21 INFO mapred.JobClient: Job Counters
15/06/20 19:37:21 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=0
15/06/20 19:37:21 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0
15/06/20 19:37:21 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=0
15/06/20 19:37:21 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0
15/06/20 19:37:21 INFO mapred.JobClient: Job Failed: JobCleanup Task Failure, Task: task_201506201820_0004_m_000000
java.io.IOException: Job failed!
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1357)
at WordCount.main(WordCount.java:32)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:483)
at org.apache.hadoop.util.RunJar.main(RunJar.java:160)
What am I missing.

you don't seem to be providing any input path. Look at the error:
15/06/20 19:37:09 INFO mapred.FileInputFormat: Total input paths to process : 0

From logs, you can see the issue occured at first map execution it self, as 0% for both map and reduce processes. Next is the actual error. "No such file". The only time we deal with map reduce and file is input and output. Given we are at start of map process, I think the issue could be with input path and its permissions. Please look into them. Also the output directory should not be existed before running the job. It creates I suppose. Happy coding

Related

Hadoop Mapreduce Job: Input path does not exist: hdfs://localhost:9000/user/******/grep-temp-1208171489

I am running an example mapreduce job that comes with Hadoop 2.8.1
I am using these commands:
bin/hdfs dfs -copyFromLocal etc/hadoop/core-site.xml .
bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.3.jar grep ./core-site.xml output ‘configuration’
However, when I run this, the job exits (the main error appears in the title):
***********s-mbp-2:hadoop-2.8.1 ***********$ bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.8.1.jar grep ./core-site.xml output ‘configuration’
17/09/12 14:30:28 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
17/09/12 14:30:29 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
17/09/12 14:30:29 INFO input.FileInputFormat: Total input files to process : 1
17/09/12 14:30:30 INFO mapreduce.JobSubmitter: number of splits:1
17/09/12 14:30:30 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1505251091124_0001
17/09/12 14:30:30 INFO impl.YarnClientImpl: Submitted application application_1505251091124_0001
17/09/12 14:30:30 INFO mapreduce.Job: The url to track the job: http://***********s-mbp-2.lan:8088/proxy/application_1505251091124_0001/
17/09/12 14:30:30 INFO mapreduce.Job: Running job: job_1505251091124_0001
17/09/12 14:30:33 INFO mapreduce.Job: Job job_1505251091124_0001 running in uber mode : false
17/09/12 14:30:33 INFO mapreduce.Job: map 0% reduce 0%
17/09/12 14:30:33 INFO mapreduce.Job: Job job_1505251091124_0001 failed with state FAILED due to: Application application_1505251091124_0001 failed 2 times due to AM Container for appattempt_1505251091124_0001_000002 exited with exitCode: 127
Failing this attempt.Diagnostics: Exception from container-launch.
Container id: container_1505251091124_0001_02_000001
Exit code: 127
Stack trace: ExitCodeException exitCode=127:
at org.apache.hadoop.util.Shell.runCommand(Shell.java:972)
at org.apache.hadoop.util.Shell.run(Shell.java:869)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:1170)
at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:236)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:305)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:84)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Container exited with a non-zero exit code 127
For more detailed output, check the application tracking page: http://***********s-mbp-2.lan:8088/cluster/app/application_1505251091124_0001 Then click on links to logs of each attempt.
. Failing the application.
17/09/12 14:30:33 INFO mapreduce.Job: Counters: 0
17/09/12 14:30:33 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
17/09/12 14:30:33 INFO mapreduce.JobSubmitter: Cleaning up the staging area /tmp/hadoop-yarn/staging/***********/.staging/job_1505251091124_0002
org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path does not exist: hdfs://localhost:9000/user/***********/grep-temp-576807334
at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:329)
at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:271)
at org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:59)
at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:393)
at org.apache.hadoop.mapreduce.JobSubmitter.writeNewSplits(JobSubmitter.java:303)
at org.apache.hadoop.mapreduce.JobSubmitter.writeSplits(JobSubmitter.java:320)
at org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:198)
at org.apache.hadoop.mapreduce.Job$11.run(Job.java:1341)
at org.apache.hadoop.mapreduce.Job$11.run(Job.java:1338)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1807)
at org.apache.hadoop.mapreduce.Job.submit(Job.java:1338)
at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1359)
at org.apache.hadoop.examples.Grep.run(Grep.java:94)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
at org.apache.hadoop.examples.Grep.main(Grep.java:103)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:71)
at org.apache.hadoop.util.ProgramDriver.run(ProgramDriver.java:144)
at org.apache.hadoop.examples.ExampleDriver.main(ExampleDriver.java:74)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.hadoop.util.RunJar.run(RunJar.java:234)
at org.apache.hadoop.util.RunJar.main(RunJar.java:148)
What is going wrong, and how do I fix it?
It looks like the program can't access the path in the exception. You may try to manually create that directory with mkdir in hdfs system before running the code. Hope it will help.

reducer always fails and map succeeds

I am running simple wordcount job on 1GB of text file . My cluster has 8 Datanodes and 1 namenode each has a storage capacity of 3GB.
When i run wordcount I can see map always succeeds and reducer is throwing an error and fails. Please find below error message.
14/10/05 15:42:02 INFO mapred.JobClient: map 100% reduce 31%
14/10/05 15:42:07 INFO mapred.JobClient: Task Id : attempt_201410051534_0002_m_000016_0, Status : FAILED
FSError: java.io.IOException: No space left on device
14/10/05 15:42:14 INFO mapred.JobClient: Task Id : attempt_201410051534_0002_r_000000_0, Status : FAILED
java.io.IOException: Task: attempt_201410051534_0002_r_000000_0 - The reduce copier failed
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:390)
at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1149)
at org.apache.hadoop.mapred.Child.main(Child.java:249)
Caused by: org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find any valid local directory for file:/app/hadoop/tmp/mapred/local/taskTracker/hduser/jobcache/job_201410051534_0002/attempt_201410051534_0002_r_000000_0/output/map_18.out
Could you please tell me how can i fix this problem ?
Thanks
Navaz

mapreduce wroking on single node cluster but not on multinode cluster

I am running a map reduce program which works fine on my cdh quickstart vm but when trying on a multinode cluster, it gives the below error:
WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
14/02/12 00:23:06 WARN mapred.JobClient: No job jar file set. User classes may not be found. See JobConf(Class) or JobConf#setJar(String).
14/02/12 00:23:06 INFO input.FileInputFormat: Total input paths to process : 1
14/02/12 00:23:07 INFO mapred.JobClient: Running job: job_201401221117_5777
14/02/12 00:23:08 INFO mapred.JobClient: map 0% reduce 0%
14/02/12 00:23:16 INFO mapred.JobClient: Task Id : attempt_201401221117_5777_m_000000_0, Status : FAILED
java.lang.RuntimeException: java.lang.ClassNotFoundException: Class Mappercsv not found
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:1774)
at org.apache.hadoop.mapreduce.task.JobContextImpl.getMapperClass(JobContextImpl.java:191)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:631)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:330)
at org.apache.hadoop.mapred.Child$4.run(Child.java:268)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408)
at org.apache.hadoop.mapred.Child.main(Child.java:262)
Caused by: java.lang.ClassNotFoundException: Class Mappercsv not found
at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:1680)
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:1772)
... 8 more"
Please help.

Hadoop giving SCDynamicStore on my JAR but not on hadoop-examples.jar

I'm very confused about building and executing my first job in Hadoop and would love help from anyone who can clarify the error I am seeing and provide guidance :)
I have a JAR file that I've compiled. When I try to execute a M/R job using it in OSX, I get the SCDynamicStore error that is often associated with the HADOOP_OPTS environment variable. However, this does not happen when I run examples from the example JAR file. I have set the variable in hadoop-env.sh and it appears to be recognized in the cluster.
Running a test from hadoop-examples.jar works:
$ hadoop jar /usr/local/Cellar/hadoop/1.1.2/libexec/hadoop-examples-1.1.2.jar wordcount /stock/data /stock/count.out
13/06/22 13:21:51 INFO input.FileInputFormat: Total input paths to process : 3
13/06/22 13:21:51 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
13/06/22 13:21:51 WARN snappy.LoadSnappy: Snappy native library not loaded
13/06/22 13:21:51 INFO mapred.JobClient: Running job: job_201306221315_0003
13/06/22 13:21:52 INFO mapred.JobClient: map 0% reduce 0%
13/06/22 13:21:56 INFO mapred.JobClient: map 66% reduce 0%
13/06/22 13:21:58 INFO mapred.JobClient: map 100% reduce 0%
13/06/22 13:22:04 INFO mapred.JobClient: map 100% reduce 33%
13/06/22 13:22:05 INFO mapred.JobClient: map 100% reduce 100%
13/06/22 13:22:05 INFO mapred.JobClient: Job complete: job_201306221315_0003
...
Running a job using my own class does not work:
$ hadoop jar test.jar mapreduce.X /data /output
13/06/22 13:38:36 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
13/06/22 13:38:36 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
13/06/22 13:38:36 WARN snappy.LoadSnappy: Snappy native library not loaded
13/06/22 13:38:36 INFO mapred.FileInputFormat: Total input paths to process : 3
13/06/22 13:38:36 INFO mapred.JobClient: Running job: job_201306221328_0002
13/06/22 13:38:37 INFO mapred.JobClient: map 0% reduce 0%
13/06/22 13:38:44 INFO mapred.JobClient: Task Id : attempt_201306221328_0002_m_000000_0, Status : FAILED
java.lang.RuntimeException: Error in configuring object
at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:93)
at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:64)
at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:117)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:432)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:372)
at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1149)
at org.apache.hadoop.mapred.Child.main(Child.java:249)
Caused by: java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:88)
... 9 more
Caused by: java.lang.NoClassDefFoundError: com/google/gson/TypeAdapterFactory
at mapreduce.VerifyMarket$Map.<clinit>(VerifyMarket.java:26)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:249)
at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:802)
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:847)
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:873)
at org.apache.hadoop.mapred.JobConf.getMapperClass(JobConf.java:947)
at org.apache.hadoop.mapred.MapRunner.configure(MapRunner.java:34)
... 14 more
Caused by: java.lang.ClassNotFoundException: com.google.gson.TypeAdapterFactory
at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
at java.lang.ClassLoader.loadClass(ClassLoader.java:247)
... 22 more
attempt_201306221328_0002_m_000000_0: 2013-06-22 13:38:39.314 java[60367:1203] Unable to load realm info from SCDynamicStore
13/06/22 13:38:44 INFO mapred.JobClient: Task Id : attempt_201306221328_0002_m_000001_0, Status : FAILED
... (This repeats a few times, but hopefully this is enough to see what I mean.)
Initially, I thought this was related to the aforementioned environment variable, but now I'm not so sure. Maybe I'm packaging my JAR incorrectly?
The easiest answer was to convert the project to Maven and include a gson dependency in the POM. Now mvn package picks up all the necessary dependencies and creates a single JAR file that contains everything necessary to complete the job in the cluster.

Can distcp be used to copy a directory of files from S3 to HDFS?

I am wondering if hadoop distcp can be used to copy multiple files at once from S3 to HDFS. It appears to only work for individual files with absolute paths. I would like to copy either an entire directory, or use a wildcard.
See: Hadoop DistCp using wildcards?
I am aware of s3distcp, but I would prefer to use distcp for simplicity's sake.
Here was my attempt at copying a directory from S3 to HDFS:
[root#ip-10-147-167-56 ~]# /root/ephemeral-hdfs/bin/hadoop distcp s3n://<key>:<secret>#mybucket/dir hdfs:///input/
13/05/23 19:58:27 INFO tools.DistCp: srcPaths=[s3n://<key>:<secret>#mybucket/dir]
13/05/23 19:58:27 INFO tools.DistCp: destPath=hdfs:/input
13/05/23 19:58:29 INFO tools.DistCp: sourcePathsCount=4
13/05/23 19:58:29 INFO tools.DistCp: filesToCopyCount=3
13/05/23 19:58:29 INFO tools.DistCp: bytesToCopyCount=87.0
13/05/23 19:58:29 INFO mapred.JobClient: Running job: job_201305231521_0005
13/05/23 19:58:30 INFO mapred.JobClient: map 0% reduce 0%
13/05/23 19:58:45 INFO mapred.JobClient: Task Id : attempt_201305231521_0005_m_000000_0, Status : FAILED
java.lang.NullPointerException
at org.apache.hadoop.fs.s3native.NativeS3FileSystem$NativeS3FsInputStream.close(NativeS3FileSystem.java:106)
at java.io.BufferedInputStream.close(BufferedInputStream.java:468)
at java.io.FilterInputStream.close(FilterInputStream.java:172)
at org.apache.hadoop.tools.DistCp.checkAndClose(DistCp.java:1386)
at org.apache.hadoop.tools.DistCp$CopyFilesMapper.copy(DistCp.java:434)
at org.apache.hadoop.tools.DistCp$CopyFilesMapper.map(DistCp.java:547)
at org.apache.hadoop.tools.DistCp$CopyFilesMapper.map(DistCp.java:314)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:436)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:372)
at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:416)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121)
at org.apache.hadoop.mapred.Child.main(Child.java:249)
13/05/23 19:58:55 INFO mapred.JobClient: Task Id : attempt_201305231521_0005_m_000000_1, Status : FAILED
java.lang.NullPointerException
at org.apache.hadoop.fs.s3native.NativeS3FileSystem$NativeS3FsInputStream.close(NativeS3FileSystem.java:106)
at java.io.BufferedInputStream.close(BufferedInputStream.java:468)
at java.io.FilterInputStream.close(FilterInputStream.java:172)
at org.apache.hadoop.tools.DistCp.checkAndClose(DistCp.java:1386)
at org.apache.hadoop.tools.DistCp$CopyFilesMapper.copy(DistCp.java:434)
at org.apache.hadoop.tools.DistCp$CopyFilesMapper.map(DistCp.java:547)
at org.apache.hadoop.tools.DistCp$CopyFilesMapper.map(DistCp.java:314)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:436)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:372)
at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:416)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121)
at org.apache.hadoop.mapred.Child.main(Child.java:249)
13/05/23 19:59:04 INFO mapred.JobClient: Task Id : attempt_201305231521_0005_m_000000_2, Status : FAILED
java.lang.NullPointerException
at org.apache.hadoop.fs.s3native.NativeS3FileSystem$NativeS3FsInputStream.close(NativeS3FileSystem.java:106)
at java.io.BufferedInputStream.close(BufferedInputStream.java:468)
at java.io.FilterInputStream.close(FilterInputStream.java:172)
at org.apache.hadoop.tools.DistCp.checkAndClose(DistCp.java:1386)
at org.apache.hadoop.tools.DistCp$CopyFilesMapper.copy(DistCp.java:434)
at org.apache.hadoop.tools.DistCp$CopyFilesMapper.map(DistCp.java:547)
at org.apache.hadoop.tools.DistCp$CopyFilesMapper.map(DistCp.java:314)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:436)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:372)
at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:416)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121)
at org.apache.hadoop.mapred.Child.main(Child.java:249)
13/05/23 19:59:18 INFO mapred.JobClient: Job complete: job_201305231521_0005
13/05/23 19:59:18 INFO mapred.JobClient: Counters: 6
13/05/23 19:59:18 INFO mapred.JobClient: Job Counters
13/05/23 19:59:18 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=38319
13/05/23 19:59:18 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0
13/05/23 19:59:18 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0
13/05/23 19:59:18 INFO mapred.JobClient: Launched map tasks=4
13/05/23 19:59:18 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=0
13/05/23 19:59:18 INFO mapred.JobClient: Failed map tasks=1
13/05/23 19:59:18 INFO mapred.JobClient: Job Failed: # of failed Map Tasks exceeded allowed limit. FailedCount: 1. LastFailedTask: task_201305231521_0005_m_000000
With failures, global counters are inaccurate; consider running with -i
Copy failed: java.io.IOException: Job failed!
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1265)
at org.apache.hadoop.tools.DistCp.copy(DistCp.java:667)
at org.apache.hadoop.tools.DistCp.run(DistCp.java:881)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
at org.apache.hadoop.tools.DistCp.main(DistCp.java:908)
You cannot use wildcards in s3n:// addresses.
However, it is possible to copy an entire directory from S3 to HDFS. The reason for the null pointer exceptions in this case was that the HDFS destination folder already existed.
Fix: delete the HDFS destination folder: ./hadoop fs -rmr /input/
Note 1: I also tried passing -update and -overwrite, but I still got NPE.
Note 2: https://hadoop.apache.org/docs/r1.2.1/distcp.html shows how to copy multiple explicit files.

Resources