distcp hdfs to s3 fails - hadoop

I was trying to do one directory which has hundreds os small files with extension .avro
but it fails for some files with following error :
14/09/18 13:05:19 INFO mapred.JobClient: map 99% reduce 0%
14/09/18 13:05:22 INFO mapred.JobClient: map 100% reduce 0%
14/09/18 13:05:24 INFO mapred.JobClient: Task Id : attempt_201408291204_35665_m_000000_0, Status : FAILED
java.io.IOException: Copied: 32 Skipped: 0 Failed: 1
at org.apache.hadoop.tools.DistCp$CopyFilesMapper.close(DistCp.java:584)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:57)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:417)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:332)
at org.apache.hadoop.mapred.Child$4.run(Child.java:268)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1438)
at org.apache.hadoop.mapred.Child.main(Child.java:262)
14/09/18 13:05:25 INFO mapred.JobClient: map 83% reduce 0%
14/09/18 13:05:32 INFO mapred.JobClient: map 100% reduce 0%
14/09/18 13:05:32 INFO mapred.JobClient: Task Id : attempt_201408291204_35665_m_000005_0, Status : FAILED
java.io.IOException: Copied: 20 Skipped: 0 Failed: 1
at org.apache.hadoop.tools.DistCp$CopyFilesMapper.close(DistCp.java:584)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:57)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:417)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:332)
at org.apache.hadoop.mapred.Child$4.run(Child.java:268)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1438)
at org.apache.hadoop.mapred.Child.main(Child.java:262)
14/09/18 13:05:33 INFO mapred.JobClient: map 83% reduce 0%
14/09/18 13:05:41 INFO mapred.JobClient: map 93% reduce 0%
14/09/18 13:05:48 INFO mapred.JobClient: map 100% reduce 0%
14/09/18 13:05:51 INFO mapred.JobClient: Job complete: job_201408291204_35665
14/09/18 13:05:51 INFO mapred.JobClient: Counters: 33
14/09/18 13:05:51 INFO mapred.JobClient: File System Counters
14/09/18 13:05:51 INFO mapred.JobClient: FILE: Number of bytes read=0
14/09/18 13:05:51 INFO mapred.JobClient: FILE: Number of bytes written=1050200
14/09/18 13:05:51 INFO mapred.JobClient: FILE: Number of read operations=0
14/09/18 13:05:51 INFO mapred.JobClient: FILE: Number of large read operations=0
14/09/18 13:05:51 INFO mapred.JobClient: FILE: Number of write operations=0
14/09/18 13:05:51 INFO mapred.JobClient: HDFS: Number of bytes read=782797980
14/09/18 13:05:51 INFO mapred.JobClient: HDFS: Number of bytes written=0
14/09/18 13:05:51 INFO mapred.JobClient: HDFS: Number of read operations=88
14/09/18 13:05:51 INFO mapred.JobClient: HDFS: Number of large read operations=0
14/09/18 13:05:51 INFO mapred.JobClient: HDFS: Number of write operations=0
14/09/18 13:05:51 INFO mapred.JobClient: S3: Number of bytes read=0
14/09/18 13:05:51 INFO mapred.JobClient: S3: Number of bytes written=782775062
14/09/18 13:05:51 INFO mapred.JobClient: S3: Number of read operations=0
14/09/18 13:05:51 INFO mapred.JobClient: S3: Number of large read operations=0
14/09/18 13:05:51 INFO mapred.JobClient: S3: Number of write operations=0
14/09/18 13:05:51 INFO mapred.JobClient: Job Counters
14/09/18 13:05:51 INFO mapred.JobClient: Launched map tasks=8
14/09/18 13:05:51 INFO mapred.JobClient: Total time spent by all maps in occupied slots (ms)=454335
14/09/18 13:05:51 INFO mapred.JobClient: Total time spent by all reduces in occupied slots (ms)=0
14/09/18 13:05:51 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0
14/09/18 13:05:51 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0
14/09/18 13:05:51 INFO mapred.JobClient: Map-Reduce Framework
14/09/18 13:05:51 INFO mapred.JobClient: Map input records=125
14/09/18 13:05:51 INFO mapred.JobClient: Map output records=53
14/09/18 13:05:51 INFO mapred.JobClient: Input split bytes=798
14/09/18 13:05:51 INFO mapred.JobClient: Spilled Records=0
14/09/18 13:05:51 INFO mapred.JobClient: CPU time spent (ms)=50250
14/09/18 13:05:51 INFO mapred.JobClient: Physical memory (bytes) snapshot=1930326016
14/09/18 13:05:51 INFO mapred.JobClient: Virtual memory (bytes) snapshot=9781469184
14/09/18 13:05:51 INFO mapred.JobClient: Total committed heap usage (bytes)=5631639552
14/09/18 13:05:51 INFO mapred.JobClient: org.apache.hadoop.mapreduce.lib.input.FileInputFormatCounter
14/09/18 13:05:51 INFO mapred.JobClient: BYTES_READ=22883
14/09/18 13:05:51 INFO mapred.JobClient: distcp
14/09/18 13:05:51 INFO mapred.JobClient: Bytes copied=782769559
14/09/18 13:05:51 INFO mapred.JobClient: Bytes expected=782769559
14/09/18 13:05:51 INFO mapred.JobClient: Files copied=70
14/09/18 13:05:51 INFO mapred.JobClient: Files skipped=53
Here more snippet from JobTracker UI :
2014-09-18 13:04:24,381 INFO org.apache.hadoop.fs.s3native.NativeS3FileSystem: OutputStream for key '09/01/01/SEARCHES/_distcp_tmp_hrb8ba/part-m-00005.avro' upload complete
2014-09-18 13:04:25,136 INFO org.apache.hadoop.tools.DistCp: FAIL part-m-00005.avro : java.io.IOException: Fail to rename tmp file (=s3://magnetic-test/09/01/01/SEARCHES/_distcp_tmp_hrb8ba/part-m-00005.avro) to destination file (=s3://abc/09/01/01/SEARCHES/part-m-00005.avro)
at org.apache.hadoop.tools.DistCp$CopyFilesMapper.rename(DistCp.java:494)
at org.apache.hadoop.tools.DistCp$CopyFilesMapper.copy(DistCp.java:463)
at org.apache.hadoop.tools.DistCp$CopyFilesMapper.map(DistCp.java:549)
at org.apache.hadoop.tools.DistCp$CopyFilesMapper.map(DistCp.java:316)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:417)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:332)
at org.apache.hadoop.mapred.Child$4.run(Child.java:268)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1438)
at org.apache.hadoop.mapred.Child.main(Child.java:262)
Caused by: java.io.IOException
at org.apache.hadoop.tools.DistCp$CopyFilesMapper.rename(DistCp.java:490)
... 11 more
Anyone know about this issue ?

Got this resolved by adding -D mapred.task.timeout=60000000 in distcp command

I tried the suggested answer, but with no luck. I experienced the issue when copying many small files (in the order of thousands, which in total did not account for more than half a gigabyte). I couldn't make distcp command work (same error as posted by OP), so switching to hadoop fs -cp was my solution. As a side note, in the same cluster, using distcp for copying other, much larger files worked ok.

Related

Map and Reduce execute 100% each but streaming job fails. python

I'm running a Graph Traversal algorithm using map reduce, and it gives the desired output when tested without using hadoop. but on running the command :
hadoop jar /usr/local/hadoop/share/hadoop/tools/lib/hadoop-streaming-2.6.0.jar -file /home/hduser/finalmap.py -mapper 'python finalmap.py' -file /home/hduser/finalred.py -reducer 'python finalred.py' -input /Random_Walk_Input -output Random_Walk_Output1
the following happens :
16/01/27 11:03:51 INFO mapreduce.Job: map 0% reduce 0%
16/01/27 11:03:55 INFO mapreduce.Job: map 33% reduce 0%
16/01/27 11:04:02 INFO mapreduce.Job: Task Id : attempt_1453872707553_0001_m_000001_1, Status : FAILED
Error: java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 1 at org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:322) at org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:535) at org.apache.hadoop.streaming.PipeMapper.close(PipeMapper.java:130) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:61) at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:34) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:450) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343) at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:163) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628) at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
16/01/27 11:04:03 INFO mapreduce.Job: map 50% reduce 0%
16/01/27 11:04:14 INFO mapreduce.Job: Task Id : attempt_1453872707553_0001_m_000001_2, Status : FAILED
Error: java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 1 at org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:322) at org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:535) at org.apache.hadoop.streaming.PipeMapper.close(PipeMapper.java:130) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:61) at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:34) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:450) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343) at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:163) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628) at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
16/01/27 11:04:22 INFO mapreduce.Job: map 50% reduce 17%
16/01/27 11:04:25 INFO mapreduce.Job: map 100% reduce 100%
16/01/27 11:04:26 INFO mapreduce.Job: Job job_1453872707553_0001 failed with state FAILED due to: Task failed task_1453872707553_0001_m_000001
Job failed as tasks failed. failedMaps:1 failedReduces:0
16/01/27 11:04:27 INFO mapreduce.Job: Counters: 39 File System Counters FILE: Number of bytes read=0 FILE: Number of bytes written=15725173 FILE: Number of read operations=0 FILE: Number of large read operations=0 FILE: Number of write operations=0 HDFS: Number of bytes read=413787 HDFS: Number of bytes written=0 HDFS: Number of read operations=3 HDFS: Number of large read operations=0 HDFS: Number of write operations=0 Job Counters Failed map tasks=4 Killed reduce tasks=1 Launched map tasks=5 Launched reduce tasks=1 Other local map tasks=3 Data-local map tasks=2 Total time spent by all maps in occupied slots (ms)=68482 Total time spent by all reduces in occupied slots (ms)=19382 Total time spent by all map tasks (ms)=68482 Total time spent by all reduce tasks (ms)=19382 Total vcore-seconds taken by all map tasks=68482 Total vcore-seconds taken by all reduce tasks=19382 Total megabyte-seconds taken by all map tasks=70125568 Total megabyte-seconds taken by all reduce tasks=19847168 Map-Reduce Framework Map input records=17666 Map output records=767145 Map output bytes=14081829 Map output materialized bytes=15616125 Input split bytes=91 Combine input records=0 Spilled Records=767145 Failed Shuffles=0 Merged Map outputs=0 GC time elapsed (ms)=229 CPU time spent (ms)=17120 Physical memory (bytes) snapshot=269684736 Virtual memory (bytes) snapshot=852369408 Total committed heap usage (bytes)=200802304 File Input Format Counters Bytes Read=413696 16/01/27 11:04:27 ERROR streaming.StreamJob: Job not successful! Streaming Command Failed!
What does this mean? It shows mapper and reducer have executed 100% each but again says failed maps :1 and failed reduces :0
Make sure that your streaming jar version and the hadoop version match(They are of the same version number)
This fixed the error for me!!

Hadoop MapReduce job is using only two reducers out of 16

I am new to Hadoop MR. I have a 4 node cluster which has 32 Map slots and 16 Reduce slots. Job is processing close to 100 GB of data using 761 Maps and 2 Reducers.
My question is why is it using 2 reducers only. Please let me if I missed any configuration related to Reducers or it is expected.
I have set below property in mapreduce configuration but still it is using 2 reducers.
Default Number of Reduce Tasks per Job
mapred.reduce.tasks=8
Log:
15/12/30 14:58:56 INFO mapred.JobClient: Job complete: job_201512301313_0002
15/12/30 14:58:56 INFO mapred.JobClient: Counters: 33
15/12/30 14:58:56 INFO mapred.JobClient: File System Counters
15/12/30 14:58:56 INFO mapred.JobClient: FILE: Number of bytes read=11711801793
15/12/30 14:58:56 INFO mapred.JobClient: FILE: Number of bytes written=24324166884
15/12/30 14:58:56 INFO mapred.JobClient: FILE: Number of read operations=0
15/12/30 14:58:56 INFO mapred.JobClient: FILE: Number of large read operations=0
15/12/30 14:58:56 INFO mapred.JobClient: FILE: Number of write operations=0
15/12/30 14:58:56 INFO mapred.JobClient: HDFS: Number of bytes read=101855418108
15/12/30 14:58:56 INFO mapred.JobClient: HDFS: Number of bytes written=821001518
15/12/30 14:58:56 INFO mapred.JobClient: HDFS: Number of read operations=1536
15/12/30 14:58:56 INFO mapred.JobClient: HDFS: Number of large read operations=0
15/12/30 14:58:56 INFO mapred.JobClient: HDFS: Number of write operations=2
15/12/30 14:58:56 INFO mapred.JobClient: Job Counters
15/12/30 14:58:56 INFO mapred.JobClient: Launched map tasks=761
15/12/30 14:58:56 INFO mapred.JobClient: Launched reduce tasks=2
15/12/30 14:58:56 INFO mapred.JobClient: Data-local map tasks=753
15/12/30 14:58:56 INFO mapred.JobClient: Rack-local map tasks=8
15/12/30 14:58:56 INFO mapred.JobClient: Total time spent by all maps in occupied slots (ms)=10467348
15/12/30 14:58:56 INFO mapred.JobClient: Total time spent by all reduces in occupied slots (ms)=936182
15/12/30 14:58:56 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0
15/12/30 14:58:56 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0

From Hadoop logs how can I find intermediate output byte sizes & reduce output bytes sizes?

From hadoop logs, How can I estimate the size of total intermediate outputs of Mappers(in Bytes) and the size of total outputs of Reducers(in Bytes)?
My mappers and reducers use LZO compression, and I want to know the size of mapper/reducer outputs after compression.
15/06/06 17:19:15 INFO mapred.JobClient: map 100% reduce 94%
15/06/06 17:19:16 INFO mapred.JobClient: map 100% reduce 98%
15/06/06 17:19:17 INFO mapred.JobClient: map 100% reduce 99%
15/06/06 17:20:04 INFO mapred.JobClient: map 100% reduce 100%
15/06/06 17:20:05 INFO mapred.JobClient: Job complete: job_201506061602_0026
15/06/06 17:20:05 INFO mapred.JobClient: Counters: 30
15/06/06 17:20:05 INFO mapred.JobClient: Job Counters
15/06/06 17:20:05 INFO mapred.JobClient: Launched reduce tasks=401
15/06/06 17:20:05 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=1203745
15/06/06 17:20:05 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0
15/06/06 17:20:05 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0
15/06/06 17:20:05 INFO mapred.JobClient: Rack-local map tasks=50
15/06/06 17:20:05 INFO mapred.JobClient: Launched map tasks=400
15/06/06 17:20:05 INFO mapred.JobClient: Data-local map tasks=350
15/06/06 17:20:05 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=6642599
15/06/06 17:20:05 INFO mapred.JobClient: File Output Format Counters
15/06/06 17:20:05 INFO mapred.JobClient: Bytes Written=534808008
15/06/06 17:20:05 INFO mapred.JobClient: FileSystemCounters
15/06/06 17:20:05 INFO mapred.JobClient: FILE_BYTES_READ=247949371
15/06/06 17:20:05 INFO mapred.JobClient: HDFS_BYTES_READ=168030609
15/06/06 17:20:05 INFO mapred.JobClient: FILE_BYTES_WRITTEN=651797418
15/06/06 17:20:05 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=534808008
15/06/06 17:20:05 INFO mapred.JobClient: File Input Format Counters
15/06/06 17:20:05 INFO mapred.JobClient: Bytes Read=167978609
15/06/06 17:20:05 INFO mapred.JobClient: Map-Reduce Framework
15/06/06 17:20:05 INFO mapred.JobClient: Map output materialized bytes=354979707
15/06/06 17:20:05 INFO mapred.JobClient: Map input records=3774768
15/06/06 17:20:05 INFO mapred.JobClient: Reduce shuffle bytes=354979707
15/06/06 17:20:05 INFO mapred.JobClient: Spilled Records=56007636
15/06/06 17:20:05 INFO mapred.JobClient: Map output bytes=336045816
15/06/06 17:20:05 INFO mapred.JobClient: Total committed heap usage (bytes)=592599187456
15/06/06 17:20:05 INFO mapred.JobClient: CPU time spent (ms)=9204120
15/06/06 17:20:05 INFO mapred.JobClient: Combine input records=0
15/06/06 17:20:05 INFO mapred.JobClient: SPLIT_RAW_BYTES=52000
15/06/06 17:20:05 INFO mapred.JobClient: Reduce input records=28003818
15/06/06 17:20:05 INFO mapred.JobClient: Reduce input groups=11478107
15/06/06 17:20:05 INFO mapred.JobClient: Combine output records=0
15/06/06 17:20:05 INFO mapred.JobClient: Physical memory (bytes) snapshot=516784615424
15/06/06 17:20:05 INFO mapred.JobClient: Reduce output records=94351104
15/06/06 17:20:05 INFO mapred.JobClient: Virtual memory (bytes) snapshot=1911619866624
15/06/06 17:20:05 INFO mapred.JobClient: Map output records=28003818
You can get these info by using FileSystemCounters. Details of the terms used in this counter is given below:
FILE_BYTES_READ is the number of bytes read by local file system. Assume all the map input data comes from HDFS, then in map phase FILE_BYTES_READ should be zero. On the other hand, the input file of reducers are data on the reduce-side local disks which are fetched from map-side disks. Therefore, FILE_BYTES_READ denotes the total bytes read by reducers.
FILE_BYTES_WRITTEN consists of two parts. The first part comes from mappers. All the mappers will spill intermediate output to disk. All the bytes that mappers write to disk will be included in FILE_BYTES_WRITTEN. The second part comes from reducers. In the shuffle phase, all the reducers will fetch intermediate data from mappers and merge and spill to reducer-side disks. All the bytes that reducers write to disk will also be included in FILE_BYTES_WRITTEN.
HDFS_BYTES_READ denotes the bytes read by mappers from HDFS when the job starts. This data includes not only the content of source file but also metadata about splits.
HDFS_BYTES_WRITTEN denotes the bytes written to HDFS. It’s the number of bytes of the final output.

Hadoop distcp command not working

Hi i am trying to move my data from cluster having CDH4.3 to cluster having CDH4.5.
I am executing the following command.
hadoop distcp -update hftp://server1:50070/hbase/test/x hdfs://server2:8020/copy/
After executing i am getting the following error:
14/01/28 19:42:43 INFO tools.DistCp: srcPaths=[hftp://server1:50070/hbase/test/x]
14/01/28 19:42:43 INFO tools.DistCp: destPath=hdfs://server2:8020/copy
14/01/28 19:42:45 INFO tools.DistCp: sourcePathsCount=1
14/01/28 19:42:45 INFO tools.DistCp: filesToCopyCount=1
14/01/28 19:42:45 INFO tools.DistCp: bytesToCopyCount=1
14/01/28 19:42:46 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
14/01/28 19:42:47 INFO mapred.JobClient: Running job: job_201401101918_0008
14/01/28 19:42:48 INFO mapred.JobClient: map 0% reduce 0%
14/01/28 19:43:05 INFO mapred.JobClient: map 100% reduce 0%
14/01/28 19:43:07 INFO mapred.JobClient: Task Id : attempt_201401101918_0008_m_000000_0, Status : FAILED
14/01/28 19:43:08 INFO mapred.JobClient: map 0% reduce 0%
14/01/28 19:43:19 INFO mapred.JobClient: map 100% reduce 0%
14/01/28 19:43:22 INFO mapred.JobClient: Task Id : attempt_201401101918_0008_m_000000_1, Status : FAILED
java.io.IOException: Copied: 0 Skipped: 0 Failed: 1
at org.apache.hadoop.tools.DistCp$CopyFilesMapper.close(DistCp.java:582)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:57)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:417)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:332)
at org.apache.hadoop.mapred.Child$4.run(Child.java:268)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408)
at org.apache.hadoop.mapred.Child.main(Child.java:262)
14/01/28 19:43:23 INFO mapred.JobClient: map 0% reduce 0%
14/01/28 19:43:33 INFO mapred.JobClient: map 100% reduce 0%
14/01/28 19:43:35 INFO mapred.JobClient: Task Id : attempt_201401101918_0008_m_000000_2, Status : FAILED
java.io.IOException: Copied: 0 Skipped: 0 Failed: 1
at org.apache.hadoop.tools.DistCp$CopyFilesMapper.close(DistCp.java:582)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:57)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:417)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:332)
at org.apache.hadoop.mapred.Child$4.run(Child.java:268)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408)
at org.apache.hadoop.mapred.Child.main(Child.java:262)
14/01/28 19:43:36 INFO mapred.JobClient: map 0% reduce 0%
14/01/28 19:43:46 INFO mapred.JobClient: map 100% reduce 0%
14/01/28 19:43:50 INFO mapred.JobClient: map 0% reduce 0%
14/01/28 19:43:53 INFO mapred.JobClient: Job complete: job_201401101918_0008
14/01/28 19:43:53 INFO mapred.JobClient: Counters: 6
14/01/28 19:43:53 INFO mapred.JobClient: Job Counters
14/01/28 19:43:53 INFO mapred.JobClient: Failed map tasks=1
14/01/28 19:43:53 INFO mapred.JobClient: Launched map tasks=4
14/01/28 19:43:53 INFO mapred.JobClient: Total time spent by all maps in occupied slots (ms)=64095
14/01/28 19:43:53 INFO mapred.JobClient: Total time spent by all reduces in occupied slots (ms)=0
14/01/28 19:43:53 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0
14/01/28 19:43:53 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0
14/01/28 19:43:53 INFO mapred.JobClient: Job Failed: NA
With failures, global counters are inaccurate; consider running with -i
Copy failed: java.io.IOException: Job failed!
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1388)
at org.apache.hadoop.tools.DistCp.copy(DistCp.java:667)
at org.apache.hadoop.tools.DistCp.run(DistCp.java:881)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84)
at org.apache.hadoop.tools.DistCp.main(DistCp.java:908)
You have new mail in /var/spool/mail/root
[hdfs#sdl1039 root]$ hadoop distcp -update hftp://server1:50070/hbase/test/x hdfs://server2:8020/copy hadoop distcp -update hftp://server1:50070/hbase/test/x hdfs://server2:8020/copy
14/01/28 19:46:09 INFO tools.DistCp: srcPaths=[hftp://server1:50070/hbase/test/x, hdfs://server2:8020/copy, hadoop, distcp, hftp://server1:50070/hbase/test/x]
14/01/28 19:46:09 INFO tools.DistCp: destPath=hdfs://server2:8020/copy
With failures, global counters are inaccurate; consider running with -i
Copy failed: org.apache.hadoop.mapred.InvalidInputException: Input source hadoop does not exist.
Input source distcp does not exist.
at org.apache.hadoop.tools.DistCp.checkSrcPath(DistCp.java:641)
at org.apache.hadoop.tools.DistCp.copy(DistCp.java:656)
at org.apache.hadoop.tools.DistCp.run(DistCp.java:881)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84)
at org.apache.hadoop.tools.DistCp.main(DistCp.java:908)
Please guide me where i am going wrong.
I got a solution for now
hadoop distcp -update hdfs://server1:8020/hbase/test/x hdfs://server2:8020/copy/
But definatly would like to know why hftp is not working for me.
I think u have a wrong port number for hftp. 50070 is the default port for namenode web ui.
try :
hadoop distcp -update hftp://server1/hbase/test/x hdfs://server2:8020/copy/

hadoop showing map reduce percentages running twice

I'm running Apache's Hadoop, and using the grep example provided by that installation. I'm wondering why map reduce percentages show up running twice? I thought they only had to run once; which makes me doubt my understanding of map reduce. I looked it up (http://grokbase.com/t/gg/mongodb-user/125ay1eazq/map-reduce-percentage-seems-running-twice) but there really wasn't an explanation and this link was for MongoDB.
hduser#ubse1:/usr/local/hadoop$ bin/hadoop jar hadoop*examples*.jar grep /user/hduser/grep /user/hduser/grep-output4 ".*woe is me.*"
I'm running this on a project gutenberg .txt file. The output file is correct.
Here is the output for running the command if needed:
12/08/06 06:56:57 INFO util.NativeCodeLoader: Loaded the native-hadoop library
12/08/06 06:56:57 WARN snappy.LoadSnappy: Snappy native library not loaded
12/08/06 06:56:57 INFO mapred.FileInputFormat: Total input paths to process : 1
12/08/06 06:56:58 INFO mapred.JobClient: Running job: job_201208030925_0011
12/08/06 06:56:59 INFO mapred.JobClient: map 0% reduce 0%
12/08/06 06:57:18 INFO mapred.JobClient: map 100% reduce 0%
12/08/06 06:57:30 INFO mapred.JobClient: map 100% reduce 100%
12/08/06 06:57:35 INFO mapred.JobClient: Job complete: job_201208030925_0011
12/08/06 06:57:35 INFO mapred.JobClient: Counters: 30
12/08/06 06:57:35 INFO mapred.JobClient: Job Counters
12/08/06 06:57:35 INFO mapred.JobClient: Launched reduce tasks=1
12/08/06 06:57:35 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=31034
12/08/06 06:57:35 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0
12/08/06 06:57:35 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0
12/08/06 06:57:35 INFO mapred.JobClient: Rack-local map tasks=2
12/08/06 06:57:35 INFO mapred.JobClient: Launched map tasks=2
12/08/06 06:57:35 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=11233
12/08/06 06:57:35 INFO mapred.JobClient: File Input Format Counters
12/08/06 06:57:35 INFO mapred.JobClient: Bytes Read=5592666
12/08/06 06:57:35 INFO mapred.JobClient: File Output Format Counters
12/08/06 06:57:35 INFO mapred.JobClient: Bytes Written=391
12/08/06 06:57:35 INFO mapred.JobClient: FileSystemCounters
12/08/06 06:57:35 INFO mapred.JobClient: FILE_BYTES_READ=281
12/08/06 06:57:35 INFO mapred.JobClient: HDFS_BYTES_READ=5592862
12/08/06 06:57:35 INFO mapred.JobClient: FILE_BYTES_WRITTEN=65331
12/08/06 06:57:35 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=391
12/08/06 06:57:35 INFO mapred.JobClient: Map-Reduce Framework
12/08/06 06:57:35 INFO mapred.JobClient: Map output materialized bytes=287
12/08/06 06:57:35 INFO mapred.JobClient: Map input records=124796
12/08/06 06:57:35 INFO mapred.JobClient: Reduce shuffle bytes=287
12/08/06 06:57:35 INFO mapred.JobClient: Spilled Records=10
12/08/06 06:57:35 INFO mapred.JobClient: Map output bytes=265
12/08/06 06:57:35 INFO mapred.JobClient: Total committed heap usage (bytes)=336404480
12/08/06 06:57:35 INFO mapred.JobClient: CPU time spent (ms)=7040
12/08/06 06:57:35 INFO mapred.JobClient: Map input bytes=5590193
12/08/06 06:57:35 INFO mapred.JobClient: SPLIT_RAW_BYTES=196
12/08/06 06:57:35 INFO mapred.JobClient: Combine input records=5
12/08/06 06:57:35 INFO mapred.JobClient: Reduce input records=5
12/08/06 06:57:35 INFO mapred.JobClient: Reduce input groups=5
12/08/06 06:57:35 INFO mapred.JobClient: Combine output records=5
12/08/06 06:57:35 INFO mapred.JobClient: Physical memory (bytes) snapshot=464568320
12/08/06 06:57:35 INFO mapred.JobClient: Reduce output records=5
12/08/06 06:57:35 INFO mapred.JobClient: Virtual memory (bytes) snapshot=1539559424
12/08/06 06:57:35 INFO mapred.JobClient: Map output records=5
12/08/06 06:57:35 INFO mapred.FileInputFormat: Total input paths to process : 1
12/08/06 06:57:35 INFO mapred.JobClient: Running job: job_201208030925_0012
12/08/06 06:57:36 INFO mapred.JobClient: map 0% reduce 0%
12/08/06 06:57:50 INFO mapred.JobClient: map 100% reduce 0%
12/08/06 06:58:05 INFO mapred.JobClient: map 100% reduce 100%
12/08/06 06:58:10 INFO mapred.JobClient: Job complete: job_201208030925_0012
12/08/06 06:58:10 INFO mapred.JobClient: Counters: 30
12/08/06 06:58:10 INFO mapred.JobClient: Job Counters
12/08/06 06:58:10 INFO mapred.JobClient: Launched reduce tasks=1
12/08/06 06:58:10 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=15432
12/08/06 06:58:10 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0
12/08/06 06:58:10 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0
12/08/06 06:58:10 INFO mapred.JobClient: Rack-local map tasks=1
12/08/06 06:58:10 INFO mapred.JobClient: Launched map tasks=1
12/08/06 06:58:10 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=14264
12/08/06 06:58:10 INFO mapred.JobClient: File Input Format Counters
12/08/06 06:58:10 INFO mapred.JobClient: Bytes Read=391
12/08/06 06:58:10 INFO mapred.JobClient: File Output Format Counters
12/08/06 06:58:10 INFO mapred.JobClient: Bytes Written=235
12/08/06 06:58:10 INFO mapred.JobClient: FileSystemCounters
12/08/06 06:58:10 INFO mapred.JobClient: FILE_BYTES_READ=281
12/08/06 06:58:10 INFO mapred.JobClient: HDFS_BYTES_READ=505
12/08/06 06:58:10 INFO mapred.JobClient: FILE_BYTES_WRITTEN=42985
12/08/06 06:58:10 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=235
12/08/06 06:58:10 INFO mapred.JobClient: Map-Reduce Framework
12/08/06 06:58:10 INFO mapred.JobClient: Map output materialized bytes=281
12/08/06 06:58:10 INFO mapred.JobClient: Map input records=5
12/08/06 06:58:10 INFO mapred.JobClient: Reduce shuffle bytes=0
12/08/06 06:58:10 INFO mapred.JobClient: Spilled Records=10
EDIT Driver Class for Grep:
Grep.java
/**
* Licensed to the Apache Software Foundation (ASF) under one
* or more contributor license agreements. See the NOTICE file
* distributed with this work for additional information
* regarding copyright ownership. The ASF licenses this file
* to you under the Apache License, Version 2.0 (the
* "License"); you may not use this file except in compliance
* with the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
package org.apache.hadoop.examples;
import java.util.Random;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.*;
import org.apache.hadoop.mapred.lib.*;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;
/* Extracts matching regexs from input files and counts them. */
public class Grep extends Configured implements Tool {
private Grep() {} // singleton
public int run(String[] args) throws Exception {
if (args.length < 3) {
System.out.println("Grep <inDir> <outDir> <regex> [<group>]");
ToolRunner.printGenericCommandUsage(System.out);
return -1;
}
Path tempDir =
new Path("grep-temp-"+
Integer.toString(new Random().nextInt(Integer.MAX_VALUE)));
JobConf grepJob = new JobConf(getConf(), Grep.class);
try {
grepJob.setJobName("grep-search");
FileInputFormat.setInputPaths(grepJob, args[0]);
grepJob.setMapperClass(RegexMapper.class);
grepJob.set("mapred.mapper.regex", args[2]);
if (args.length == 4)
grepJob.set("mapred.mapper.regex.group", args[3]);
grepJob.setCombinerClass(LongSumReducer.class);
grepJob.setReducerClass(LongSumReducer.class);
FileOutputFormat.setOutputPath(grepJob, tempDir);
grepJob.setOutputFormat(SequenceFileOutputFormat.class);
grepJob.setOutputKeyClass(Text.class);
grepJob.setOutputValueClass(LongWritable.class);
JobClient.runJob(grepJob);
JobConf sortJob = new JobConf(getConf(), Grep.class);
sortJob.setJobName("grep-sort");
FileInputFormat.setInputPaths(sortJob, tempDir);
sortJob.setInputFormat(SequenceFileInputFormat.class);
sortJob.setMapperClass(InverseMapper.class);
sortJob.setNumReduceTasks(1); // write a single file
FileOutputFormat.setOutputPath(sortJob, new Path(args[1]));
sortJob.setOutputKeyComparatorClass // sort by decreasing freq
(LongWritable.DecreasingComparator.class);
JobClient.runJob(sortJob);
}
finally {
FileSystem.get(grepJob).delete(tempDir, true);
}
return 0;
}
public static void main(String[] args) throws Exception {
int res = ToolRunner.run(new Configuration(), new Grep(), args);
System.exit(res);
}
}
In the file there are the statistics of two jobs: job: job_201208030925_0011 and job: job_201208030925_0012. The percentages belong to these two jobs, hence there are 2 map progress percentages.

Resources