Hadoop MapReduce job is using only two reducers out of 16

Hadoop MapReduce job is using only two reducers out of 16 - hadoop

I am new to Hadoop MR. I have a 4 node cluster which has 32 Map slots and 16 Reduce slots. Job is processing close to 100 GB of data using 761 Maps and 2 Reducers.
My question is why is it using 2 reducers only. Please let me if I missed any configuration related to Reducers or it is expected.
I have set below property in mapreduce configuration but still it is using 2 reducers.
Default Number of Reduce Tasks per Job
mapred.reduce.tasks=8
Log:
15/12/30 14:58:56 INFO mapred.JobClient: Job complete: job_201512301313_0002
15/12/30 14:58:56 INFO mapred.JobClient: Counters: 33
15/12/30 14:58:56 INFO mapred.JobClient: File System Counters
15/12/30 14:58:56 INFO mapred.JobClient: FILE: Number of bytes read=11711801793
15/12/30 14:58:56 INFO mapred.JobClient: FILE: Number of bytes written=24324166884
15/12/30 14:58:56 INFO mapred.JobClient: FILE: Number of read operations=0
15/12/30 14:58:56 INFO mapred.JobClient: FILE: Number of large read operations=0
15/12/30 14:58:56 INFO mapred.JobClient: FILE: Number of write operations=0
15/12/30 14:58:56 INFO mapred.JobClient: HDFS: Number of bytes read=101855418108
15/12/30 14:58:56 INFO mapred.JobClient: HDFS: Number of bytes written=821001518
15/12/30 14:58:56 INFO mapred.JobClient: HDFS: Number of read operations=1536
15/12/30 14:58:56 INFO mapred.JobClient: HDFS: Number of large read operations=0
15/12/30 14:58:56 INFO mapred.JobClient: HDFS: Number of write operations=2
15/12/30 14:58:56 INFO mapred.JobClient: Job Counters
15/12/30 14:58:56 INFO mapred.JobClient: Launched map tasks=761
15/12/30 14:58:56 INFO mapred.JobClient: Launched reduce tasks=2
15/12/30 14:58:56 INFO mapred.JobClient: Data-local map tasks=753
15/12/30 14:58:56 INFO mapred.JobClient: Rack-local map tasks=8
15/12/30 14:58:56 INFO mapred.JobClient: Total time spent by all maps in occupied slots (ms)=10467348
15/12/30 14:58:56 INFO mapred.JobClient: Total time spent by all reduces in occupied slots (ms)=936182
15/12/30 14:58:56 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0
15/12/30 14:58:56 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0

Related

Hadoop producing no output?

I've recently started learning how to use the Hadoop system, and decided it's time to try writing some code. Before that, I wanted to try running the examples seen in the Getting Started page. However, it does not seem to produce any visible results.
I'm currently using Hadoop version 3.3.1 using a single-node setup,
and using jdk 11.0.11. I am running this on Windows 10 (due to current development requirements).
I've used the following command on cmd:
hadoop jar %hadoop_home%/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.3.1.jar grep input /output 'dfs[a-z.]+'
The output to the command:
C:\Windows\system32>hadoop jar %hadoop_home%/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.3.1.jar grep input /output 'dfs[a-z.]+'
2021-12-15 00:33:10,486 INFO client.DefaultNoHARMFailoverProxyProvider: Connecting to ResourceManager at /0.0.0.0:8032
2021-12-15 00:33:10,800 INFO mapreduce.JobResourceUploader: Disabling Erasure Coding for path: /tmp/hadoop-yarn/staging/E/.staging/job_1639519343908_0005
2021-12-15 00:33:11,029 INFO input.FileInputFormat: Total input files to process : 10
2021-12-15 00:33:11,108 INFO mapreduce.JobSubmitter: number of splits:10
2021-12-15 00:33:11,281 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1639519343908_0005
2021-12-15 00:33:11,281 INFO mapreduce.JobSubmitter: Executing with tokens: []
2021-12-15 00:33:11,442 INFO conf.Configuration: resource-types.xml not found
2021-12-15 00:33:11,443 INFO resource.ResourceUtils: Unable to find 'resource-types.xml'.
2021-12-15 00:33:11,497 INFO impl.YarnClientImpl: Submitted application application_1639519343908_0005
2021-12-15 00:33:11,527 INFO mapreduce.Job: The url to track the job: http://DESKTOP-S15C716:8088/proxy/application_1639519343908_0005/
2021-12-15 00:33:11,528 INFO mapreduce.Job: Running job: job_1639519343908_0005
2021-12-15 00:33:19,611 INFO mapreduce.Job: Job job_1639519343908_0005 running in uber mode : false
2021-12-15 00:33:19,615 INFO mapreduce.Job: map 0% reduce 0%
2021-12-15 00:33:31,178 INFO mapreduce.Job: map 50% reduce 0%
2021-12-15 00:33:32,263 INFO mapreduce.Job: map 60% reduce 0%
2021-12-15 00:33:39,624 INFO mapreduce.Job: map 90% reduce 0%
2021-12-15 00:33:40,632 INFO mapreduce.Job: map 100% reduce 0%
2021-12-15 00:33:41,636 INFO mapreduce.Job: map 100% reduce 100%
2021-12-15 00:33:41,648 INFO mapreduce.Job: Job job_1639519343908_0005 completed successfully
2021-12-15 00:33:41,760 INFO mapreduce.Job: Counters: 51
File System Counters
FILE: Number of bytes read=6
FILE: Number of bytes written=3021766
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=31877
HDFS: Number of bytes written=86
HDFS: Number of read operations=35
HDFS: Number of large read operations=0
HDFS: Number of write operations=2
HDFS: Number of bytes read erasure-coded=0
Job Counters
Killed map tasks=1
Launched map tasks=10
Launched reduce tasks=1
Data-local map tasks=10
Total time spent by all maps in occupied slots (ms)=89653
Total time spent by all reduces in occupied slots (ms)=8222
Total time spent by all map tasks (ms)=89653
Total time spent by all reduce tasks (ms)=8222
Total vcore-milliseconds taken by all map tasks=89653
Total vcore-milliseconds taken by all reduce tasks=8222
Total megabyte-milliseconds taken by all map tasks=91804672
Total megabyte-milliseconds taken by all reduce tasks=8419328
Map-Reduce Framework
Map input records=819
Map output records=0
Map output bytes=0
Map output materialized bytes=60
Input split bytes=1139
Combine input records=0
Combine output records=0
Reduce input groups=0
Reduce shuffle bytes=60
Reduce input records=0
Reduce output records=0
Spilled Records=0
Shuffled Maps =10
Failed Shuffles=0
Merged Map outputs=10
GC time elapsed (ms)=90
CPU time spent (ms)=0
Physical memory (bytes) snapshot=0
Virtual memory (bytes) snapshot=0
Total committed heap usage (bytes)=2952790016
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
File Input Format Counters
Bytes Read=30738
File Output Format Counters
Bytes Written=86
2021-12-15 00:33:41,790 INFO client.DefaultNoHARMFailoverProxyProvider: Connecting to ResourceManager at /0.0.0.0:8032
2021-12-15 00:33:41,814 INFO mapreduce.JobResourceUploader: Disabling Erasure Coding for path: /tmp/hadoop-yarn/staging/E/.staging/job_1639519343908_0006
2021-12-15 00:33:41,855 INFO input.FileInputFormat: Total input files to process : 1
2021-12-15 00:33:41,913 INFO mapreduce.JobSubmitter: number of splits:1
2021-12-15 00:33:41,950 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1639519343908_0006
2021-12-15 00:33:41,950 INFO mapreduce.JobSubmitter: Executing with tokens: []
2021-12-15 00:33:42,179 INFO impl.YarnClientImpl: Submitted application application_1639519343908_0006
2021-12-15 00:33:42,190 INFO mapreduce.Job: The url to track the job: http://DESKTOP-S15C716:8088/proxy/application_1639519343908_0006/
2021-12-15 00:33:42,191 INFO mapreduce.Job: Running job: job_1639519343908_0006
2021-12-15 00:33:55,301 INFO mapreduce.Job: Job job_1639519343908_0006 running in uber mode : false
2021-12-15 00:33:55,302 INFO mapreduce.Job: map 0% reduce 0%
2021-12-15 00:34:00,336 INFO mapreduce.Job: map 100% reduce 0%
2021-12-15 00:34:06,366 INFO mapreduce.Job: map 100% reduce 100%
2021-12-15 00:34:07,375 INFO mapreduce.Job: Job job_1639519343908_0006 completed successfully
2021-12-15 00:34:07,404 INFO mapreduce.Job: Counters: 50
File System Counters
FILE: Number of bytes read=6
FILE: Number of bytes written=548197
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=212
HDFS: Number of bytes written=0
HDFS: Number of read operations=9
HDFS: Number of large read operations=0
HDFS: Number of write operations=2
HDFS: Number of bytes read erasure-coded=0
Job Counters
Launched map tasks=1
Launched reduce tasks=1
Data-local map tasks=1
Total time spent by all maps in occupied slots (ms)=3232
Total time spent by all reduces in occupied slots (ms)=3610
Total time spent by all map tasks (ms)=3232
Total time spent by all reduce tasks (ms)=3610
Total vcore-milliseconds taken by all map tasks=3232
Total vcore-milliseconds taken by all reduce tasks=3610
Total megabyte-milliseconds taken by all map tasks=3309568
Total megabyte-milliseconds taken by all reduce tasks=3696640
Map-Reduce Framework
Map input records=0
Map output records=0
Map output bytes=0
Map output materialized bytes=6
Input split bytes=126
Combine input records=0
Combine output records=0
Reduce input groups=0
Reduce shuffle bytes=6
Reduce input records=0
Reduce output records=0
Spilled Records=0
Shuffled Maps =1
Failed Shuffles=0
Merged Map outputs=1
GC time elapsed (ms)=13
CPU time spent (ms)=0
Physical memory (bytes) snapshot=0
Virtual memory (bytes) snapshot=0
Total committed heap usage (bytes)=536870912
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
File Input Format Counters
Bytes Read=86
File Output Format Counters
Bytes Written=0
Yet when viewing the contents of the now-made 'output' folder,
I receive the following result:
hdfs dfs -ls /output
Found 2 items
-rw-r--r-- 1 E supergroup 0 2021-12-15 00:34 /output/_SUCCESS
-rw-r--r-- 1 E supergroup 0 2021-12-15 00:34 /output/part-r-00000
I.e. there's no data written to those files!
May anyone please assist me?

If you have no data in your HDFS input folder that matches the grep pattern 'dfs[a-z.]+', then the output will be empty
From the linked docs (which are for Unix, not Windows), make sure this command completed
bin/hdfs dfs -put %HADOOP_HOME%/etc/hadoop/*.xml input
And you can grep dfs $HADOOP_HOME/etc/hadoop/*.xml (at least on Unix) as well, locally, to verify there should be data output

Why there is no reducer when running 1TB teragen?

I am running a terasort benchmark for hadoop using the following command:
jar /Users/karan.verma/Documents/backups/h/hadoop-2.6.4/share/hadoop/mapreduce/hadoop-mapreduce-examples-*.jar teragen -Dmapreduce.job.maps=100 1t random-data
and got the following logs printed for 100 map tasks:
18/03/27 13:06:03 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
18/03/27 13:06:04 INFO client.RMProxy: Connecting to ResourceManager at /127.0.0.1:8032
18/03/27 13:06:05 INFO terasort.TeraSort: Generating -727379968 using 100
18/03/27 13:06:05 INFO mapreduce.JobSubmitter: number of splits:100
18/03/27 13:06:05 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1522131782827_0001
18/03/27 13:06:06 INFO impl.YarnClientImpl: Submitted application application_1522131782827_0001
18/03/27 13:06:06 INFO mapreduce.Job: The url to track the job: http://localhost:8088/proxy/application_1522131782827_0001/
18/03/27 13:06:06 INFO mapreduce.Job: Running job: job_1522131782827_0001
18/03/27 13:06:16 INFO mapreduce.Job: Job job_1522131782827_0001 running in uber mode : false
18/03/27 13:06:16 INFO mapreduce.Job: map 0% reduce 0%
18/03/27 13:06:29 INFO mapreduce.Job: map 2% reduce 0%
18/03/27 13:06:31 INFO mapreduce.Job: map 3% reduce 0%
18/03/27 13:06:32 INFO mapreduce.Job: map 5% reduce 0%
....
18/03/27 13:09:27 INFO mapreduce.Job: map 100% reduce 0%
and here is the final counters as printed on console:
18/03/27 13:09:29 INFO mapreduce.Job: Counters: 30
File System Counters
FILE: Number of bytes read=0
FILE: Number of bytes written=10660990
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=8594
HDFS: Number of bytes written=0
HDFS: Number of read operations=400
HDFS: Number of large read operations=0
HDFS: Number of write operations=200
Job Counters
Launched map tasks=100
Other local map tasks=100
Total time spent by all maps in occupied slots (ms)=983560
Total time spent by all reduces in occupied slots (ms)=0
Total time spent by all map tasks (ms)=983560
Total vcore-milliseconds taken by all map tasks=983560
Total megabyte-milliseconds taken by all map tasks=1007165440
Map-Reduce Framework
Map input records=0
Map output records=0
Input split bytes=8594
Spilled Records=0
Failed Shuffles=0
Merged Map outputs=0
GC time elapsed (ms)=9746
CPU time spent (ms)=0
Physical memory (bytes) snapshot=0
Virtual memory (bytes) snapshot=0
Total committed heap usage (bytes)=11220811776
File Input Format Counters
Bytes Read=0
File Output Format Counters
Bytes Written=0
and here is the output on job schedular:
Please suggest why there is no reduce task?

Your run command says that you're running teragen and not terasort. teragen simply generates data that you can then use for terasort, and so no reducers are needed.
To run terasort over the data that you've just generated, run:
hadoop jar /Users/karan.verma/Documents/backups/h/hadoop-2.6.4/share/hadoop/mapreduce/hadoop-mapreduce-examples-*.jar terasort random-data terasort-output
You should then see reducers.

No reduce tasks run when executing teragen. Here is the documentation:
TeraGen will run map tasks to generate the data and will not run any reduce tasks. The default number of map task is defined by the "mapreduce.job.maps=2" param. It's the only purpose here is to generate the 1TB of random data in the following format " 10 bytes key | 2 bytes break | 32 bytes acsii/hex | 4 bytes break | 48 bytes filler | 4 bytes break | \r\n".

From Hadoop logs how can I find intermediate output byte sizes & reduce output bytes sizes?

From hadoop logs, How can I estimate the size of total intermediate outputs of Mappers(in Bytes) and the size of total outputs of Reducers(in Bytes)?
My mappers and reducers use LZO compression, and I want to know the size of mapper/reducer outputs after compression.
15/06/06 17:19:15 INFO mapred.JobClient: map 100% reduce 94%
15/06/06 17:19:16 INFO mapred.JobClient: map 100% reduce 98%
15/06/06 17:19:17 INFO mapred.JobClient: map 100% reduce 99%
15/06/06 17:20:04 INFO mapred.JobClient: map 100% reduce 100%
15/06/06 17:20:05 INFO mapred.JobClient: Job complete: job_201506061602_0026
15/06/06 17:20:05 INFO mapred.JobClient: Counters: 30
15/06/06 17:20:05 INFO mapred.JobClient: Job Counters
15/06/06 17:20:05 INFO mapred.JobClient: Launched reduce tasks=401
15/06/06 17:20:05 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=1203745
15/06/06 17:20:05 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0
15/06/06 17:20:05 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0
15/06/06 17:20:05 INFO mapred.JobClient: Rack-local map tasks=50
15/06/06 17:20:05 INFO mapred.JobClient: Launched map tasks=400
15/06/06 17:20:05 INFO mapred.JobClient: Data-local map tasks=350
15/06/06 17:20:05 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=6642599
15/06/06 17:20:05 INFO mapred.JobClient: File Output Format Counters
15/06/06 17:20:05 INFO mapred.JobClient: Bytes Written=534808008
15/06/06 17:20:05 INFO mapred.JobClient: FileSystemCounters
15/06/06 17:20:05 INFO mapred.JobClient: FILE_BYTES_READ=247949371
15/06/06 17:20:05 INFO mapred.JobClient: HDFS_BYTES_READ=168030609
15/06/06 17:20:05 INFO mapred.JobClient: FILE_BYTES_WRITTEN=651797418
15/06/06 17:20:05 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=534808008
15/06/06 17:20:05 INFO mapred.JobClient: File Input Format Counters
15/06/06 17:20:05 INFO mapred.JobClient: Bytes Read=167978609
15/06/06 17:20:05 INFO mapred.JobClient: Map-Reduce Framework
15/06/06 17:20:05 INFO mapred.JobClient: Map output materialized bytes=354979707
15/06/06 17:20:05 INFO mapred.JobClient: Map input records=3774768
15/06/06 17:20:05 INFO mapred.JobClient: Reduce shuffle bytes=354979707
15/06/06 17:20:05 INFO mapred.JobClient: Spilled Records=56007636
15/06/06 17:20:05 INFO mapred.JobClient: Map output bytes=336045816
15/06/06 17:20:05 INFO mapred.JobClient: Total committed heap usage (bytes)=592599187456
15/06/06 17:20:05 INFO mapred.JobClient: CPU time spent (ms)=9204120
15/06/06 17:20:05 INFO mapred.JobClient: Combine input records=0
15/06/06 17:20:05 INFO mapred.JobClient: SPLIT_RAW_BYTES=52000
15/06/06 17:20:05 INFO mapred.JobClient: Reduce input records=28003818
15/06/06 17:20:05 INFO mapred.JobClient: Reduce input groups=11478107
15/06/06 17:20:05 INFO mapred.JobClient: Combine output records=0
15/06/06 17:20:05 INFO mapred.JobClient: Physical memory (bytes) snapshot=516784615424
15/06/06 17:20:05 INFO mapred.JobClient: Reduce output records=94351104
15/06/06 17:20:05 INFO mapred.JobClient: Virtual memory (bytes) snapshot=1911619866624
15/06/06 17:20:05 INFO mapred.JobClient: Map output records=28003818

You can get these info by using FileSystemCounters. Details of the terms used in this counter is given below:
FILE_BYTES_READ is the number of bytes read by local file system. Assume all the map input data comes from HDFS, then in map phase FILE_BYTES_READ should be zero. On the other hand, the input file of reducers are data on the reduce-side local disks which are fetched from map-side disks. Therefore, FILE_BYTES_READ denotes the total bytes read by reducers.
FILE_BYTES_WRITTEN consists of two parts. The first part comes from mappers. All the mappers will spill intermediate output to disk. All the bytes that mappers write to disk will be included in FILE_BYTES_WRITTEN. The second part comes from reducers. In the shuffle phase, all the reducers will fetch intermediate data from mappers and merge and spill to reducer-side disks. All the bytes that reducers write to disk will also be included in FILE_BYTES_WRITTEN.
HDFS_BYTES_READ denotes the bytes read by mappers from HDFS when the job starts. This data includes not only the content of source file but also metadata about splits.
HDFS_BYTES_WRITTEN denotes the bytes written to HDFS. It’s the number of bytes of the final output.

distcp hdfs to s3 fails

I was trying to do one directory which has hundreds os small files with extension .avro
but it fails for some files with following error :
14/09/18 13:05:19 INFO mapred.JobClient: map 99% reduce 0%
14/09/18 13:05:22 INFO mapred.JobClient: map 100% reduce 0%
14/09/18 13:05:24 INFO mapred.JobClient: Task Id : attempt_201408291204_35665_m_000000_0, Status : FAILED
java.io.IOException: Copied: 32 Skipped: 0 Failed: 1
at org.apache.hadoop.tools.DistCp$CopyFilesMapper.close(DistCp.java:584)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:57)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:417)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:332)
at org.apache.hadoop.mapred.Child$4.run(Child.java:268)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1438)
at org.apache.hadoop.mapred.Child.main(Child.java:262)
14/09/18 13:05:25 INFO mapred.JobClient: map 83% reduce 0%
14/09/18 13:05:32 INFO mapred.JobClient: map 100% reduce 0%
14/09/18 13:05:32 INFO mapred.JobClient: Task Id : attempt_201408291204_35665_m_000005_0, Status : FAILED
java.io.IOException: Copied: 20 Skipped: 0 Failed: 1
at org.apache.hadoop.tools.DistCp$CopyFilesMapper.close(DistCp.java:584)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:57)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:417)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:332)
at org.apache.hadoop.mapred.Child$4.run(Child.java:268)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1438)
at org.apache.hadoop.mapred.Child.main(Child.java:262)
14/09/18 13:05:33 INFO mapred.JobClient: map 83% reduce 0%
14/09/18 13:05:41 INFO mapred.JobClient: map 93% reduce 0%
14/09/18 13:05:48 INFO mapred.JobClient: map 100% reduce 0%
14/09/18 13:05:51 INFO mapred.JobClient: Job complete: job_201408291204_35665
14/09/18 13:05:51 INFO mapred.JobClient: Counters: 33
14/09/18 13:05:51 INFO mapred.JobClient: File System Counters
14/09/18 13:05:51 INFO mapred.JobClient: FILE: Number of bytes read=0
14/09/18 13:05:51 INFO mapred.JobClient: FILE: Number of bytes written=1050200
14/09/18 13:05:51 INFO mapred.JobClient: FILE: Number of read operations=0
14/09/18 13:05:51 INFO mapred.JobClient: FILE: Number of large read operations=0
14/09/18 13:05:51 INFO mapred.JobClient: FILE: Number of write operations=0
14/09/18 13:05:51 INFO mapred.JobClient: HDFS: Number of bytes read=782797980
14/09/18 13:05:51 INFO mapred.JobClient: HDFS: Number of bytes written=0
14/09/18 13:05:51 INFO mapred.JobClient: HDFS: Number of read operations=88
14/09/18 13:05:51 INFO mapred.JobClient: HDFS: Number of large read operations=0
14/09/18 13:05:51 INFO mapred.JobClient: HDFS: Number of write operations=0
14/09/18 13:05:51 INFO mapred.JobClient: S3: Number of bytes read=0
14/09/18 13:05:51 INFO mapred.JobClient: S3: Number of bytes written=782775062
14/09/18 13:05:51 INFO mapred.JobClient: S3: Number of read operations=0
14/09/18 13:05:51 INFO mapred.JobClient: S3: Number of large read operations=0
14/09/18 13:05:51 INFO mapred.JobClient: S3: Number of write operations=0
14/09/18 13:05:51 INFO mapred.JobClient: Job Counters
14/09/18 13:05:51 INFO mapred.JobClient: Launched map tasks=8
14/09/18 13:05:51 INFO mapred.JobClient: Total time spent by all maps in occupied slots (ms)=454335
14/09/18 13:05:51 INFO mapred.JobClient: Total time spent by all reduces in occupied slots (ms)=0
14/09/18 13:05:51 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0
14/09/18 13:05:51 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0
14/09/18 13:05:51 INFO mapred.JobClient: Map-Reduce Framework
14/09/18 13:05:51 INFO mapred.JobClient: Map input records=125
14/09/18 13:05:51 INFO mapred.JobClient: Map output records=53
14/09/18 13:05:51 INFO mapred.JobClient: Input split bytes=798
14/09/18 13:05:51 INFO mapred.JobClient: Spilled Records=0
14/09/18 13:05:51 INFO mapred.JobClient: CPU time spent (ms)=50250
14/09/18 13:05:51 INFO mapred.JobClient: Physical memory (bytes) snapshot=1930326016
14/09/18 13:05:51 INFO mapred.JobClient: Virtual memory (bytes) snapshot=9781469184
14/09/18 13:05:51 INFO mapred.JobClient: Total committed heap usage (bytes)=5631639552
14/09/18 13:05:51 INFO mapred.JobClient: org.apache.hadoop.mapreduce.lib.input.FileInputFormatCounter
14/09/18 13:05:51 INFO mapred.JobClient: BYTES_READ=22883
14/09/18 13:05:51 INFO mapred.JobClient: distcp
14/09/18 13:05:51 INFO mapred.JobClient: Bytes copied=782769559
14/09/18 13:05:51 INFO mapred.JobClient: Bytes expected=782769559
14/09/18 13:05:51 INFO mapred.JobClient: Files copied=70
14/09/18 13:05:51 INFO mapred.JobClient: Files skipped=53
Here more snippet from JobTracker UI :
2014-09-18 13:04:24,381 INFO org.apache.hadoop.fs.s3native.NativeS3FileSystem: OutputStream for key '09/01/01/SEARCHES/_distcp_tmp_hrb8ba/part-m-00005.avro' upload complete
2014-09-18 13:04:25,136 INFO org.apache.hadoop.tools.DistCp: FAIL part-m-00005.avro : java.io.IOException: Fail to rename tmp file (=s3://magnetic-test/09/01/01/SEARCHES/_distcp_tmp_hrb8ba/part-m-00005.avro) to destination file (=s3://abc/09/01/01/SEARCHES/part-m-00005.avro)
at org.apache.hadoop.tools.DistCp$CopyFilesMapper.rename(DistCp.java:494)
at org.apache.hadoop.tools.DistCp$CopyFilesMapper.copy(DistCp.java:463)
at org.apache.hadoop.tools.DistCp$CopyFilesMapper.map(DistCp.java:549)
at org.apache.hadoop.tools.DistCp$CopyFilesMapper.map(DistCp.java:316)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:417)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:332)
at org.apache.hadoop.mapred.Child$4.run(Child.java:268)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1438)
at org.apache.hadoop.mapred.Child.main(Child.java:262)
Caused by: java.io.IOException
at org.apache.hadoop.tools.DistCp$CopyFilesMapper.rename(DistCp.java:490)
... 11 more
Anyone know about this issue ?

Got this resolved by adding -D mapred.task.timeout=60000000 in distcp command

I tried the suggested answer, but with no luck. I experienced the issue when copying many small files (in the order of thousands, which in total did not account for more than half a gigabyte). I couldn't make distcp command work (same error as posted by OP), so switching to hadoop fs -cp was my solution. As a side note, in the same cluster, using distcp for copying other, much larger files worked ok.

hadoop showing map reduce percentages running twice

I'm running Apache's Hadoop, and using the grep example provided by that installation. I'm wondering why map reduce percentages show up running twice? I thought they only had to run once; which makes me doubt my understanding of map reduce. I looked it up (http://grokbase.com/t/gg/mongodb-user/125ay1eazq/map-reduce-percentage-seems-running-twice) but there really wasn't an explanation and this link was for MongoDB.
hduser#ubse1:/usr/local/hadoop$ bin/hadoop jar hadoop*examples*.jar grep /user/hduser/grep /user/hduser/grep-output4 ".*woe is me.*"
I'm running this on a project gutenberg .txt file. The output file is correct.
Here is the output for running the command if needed:
12/08/06 06:56:57 INFO util.NativeCodeLoader: Loaded the native-hadoop library
12/08/06 06:56:57 WARN snappy.LoadSnappy: Snappy native library not loaded
12/08/06 06:56:57 INFO mapred.FileInputFormat: Total input paths to process : 1
12/08/06 06:56:58 INFO mapred.JobClient: Running job: job_201208030925_0011
12/08/06 06:56:59 INFO mapred.JobClient: map 0% reduce 0%
12/08/06 06:57:18 INFO mapred.JobClient: map 100% reduce 0%
12/08/06 06:57:30 INFO mapred.JobClient: map 100% reduce 100%
12/08/06 06:57:35 INFO mapred.JobClient: Job complete: job_201208030925_0011
12/08/06 06:57:35 INFO mapred.JobClient: Counters: 30
12/08/06 06:57:35 INFO mapred.JobClient: Job Counters
12/08/06 06:57:35 INFO mapred.JobClient: Launched reduce tasks=1
12/08/06 06:57:35 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=31034
12/08/06 06:57:35 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0
12/08/06 06:57:35 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0
12/08/06 06:57:35 INFO mapred.JobClient: Rack-local map tasks=2
12/08/06 06:57:35 INFO mapred.JobClient: Launched map tasks=2
12/08/06 06:57:35 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=11233
12/08/06 06:57:35 INFO mapred.JobClient: File Input Format Counters
12/08/06 06:57:35 INFO mapred.JobClient: Bytes Read=5592666
12/08/06 06:57:35 INFO mapred.JobClient: File Output Format Counters
12/08/06 06:57:35 INFO mapred.JobClient: Bytes Written=391
12/08/06 06:57:35 INFO mapred.JobClient: FileSystemCounters
12/08/06 06:57:35 INFO mapred.JobClient: FILE_BYTES_READ=281
12/08/06 06:57:35 INFO mapred.JobClient: HDFS_BYTES_READ=5592862
12/08/06 06:57:35 INFO mapred.JobClient: FILE_BYTES_WRITTEN=65331
12/08/06 06:57:35 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=391
12/08/06 06:57:35 INFO mapred.JobClient: Map-Reduce Framework
12/08/06 06:57:35 INFO mapred.JobClient: Map output materialized bytes=287
12/08/06 06:57:35 INFO mapred.JobClient: Map input records=124796
12/08/06 06:57:35 INFO mapred.JobClient: Reduce shuffle bytes=287
12/08/06 06:57:35 INFO mapred.JobClient: Spilled Records=10
12/08/06 06:57:35 INFO mapred.JobClient: Map output bytes=265
12/08/06 06:57:35 INFO mapred.JobClient: Total committed heap usage (bytes)=336404480
12/08/06 06:57:35 INFO mapred.JobClient: CPU time spent (ms)=7040
12/08/06 06:57:35 INFO mapred.JobClient: Map input bytes=5590193
12/08/06 06:57:35 INFO mapred.JobClient: SPLIT_RAW_BYTES=196
12/08/06 06:57:35 INFO mapred.JobClient: Combine input records=5
12/08/06 06:57:35 INFO mapred.JobClient: Reduce input records=5
12/08/06 06:57:35 INFO mapred.JobClient: Reduce input groups=5
12/08/06 06:57:35 INFO mapred.JobClient: Combine output records=5
12/08/06 06:57:35 INFO mapred.JobClient: Physical memory (bytes) snapshot=464568320
12/08/06 06:57:35 INFO mapred.JobClient: Reduce output records=5
12/08/06 06:57:35 INFO mapred.JobClient: Virtual memory (bytes) snapshot=1539559424
12/08/06 06:57:35 INFO mapred.JobClient: Map output records=5
12/08/06 06:57:35 INFO mapred.FileInputFormat: Total input paths to process : 1
12/08/06 06:57:35 INFO mapred.JobClient: Running job: job_201208030925_0012
12/08/06 06:57:36 INFO mapred.JobClient: map 0% reduce 0%
12/08/06 06:57:50 INFO mapred.JobClient: map 100% reduce 0%
12/08/06 06:58:05 INFO mapred.JobClient: map 100% reduce 100%
12/08/06 06:58:10 INFO mapred.JobClient: Job complete: job_201208030925_0012
12/08/06 06:58:10 INFO mapred.JobClient: Counters: 30
12/08/06 06:58:10 INFO mapred.JobClient: Job Counters
12/08/06 06:58:10 INFO mapred.JobClient: Launched reduce tasks=1
12/08/06 06:58:10 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=15432
12/08/06 06:58:10 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0
12/08/06 06:58:10 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0
12/08/06 06:58:10 INFO mapred.JobClient: Rack-local map tasks=1
12/08/06 06:58:10 INFO mapred.JobClient: Launched map tasks=1
12/08/06 06:58:10 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=14264
12/08/06 06:58:10 INFO mapred.JobClient: File Input Format Counters
12/08/06 06:58:10 INFO mapred.JobClient: Bytes Read=391
12/08/06 06:58:10 INFO mapred.JobClient: File Output Format Counters
12/08/06 06:58:10 INFO mapred.JobClient: Bytes Written=235
12/08/06 06:58:10 INFO mapred.JobClient: FileSystemCounters
12/08/06 06:58:10 INFO mapred.JobClient: FILE_BYTES_READ=281
12/08/06 06:58:10 INFO mapred.JobClient: HDFS_BYTES_READ=505
12/08/06 06:58:10 INFO mapred.JobClient: FILE_BYTES_WRITTEN=42985
12/08/06 06:58:10 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=235
12/08/06 06:58:10 INFO mapred.JobClient: Map-Reduce Framework
12/08/06 06:58:10 INFO mapred.JobClient: Map output materialized bytes=281
12/08/06 06:58:10 INFO mapred.JobClient: Map input records=5
12/08/06 06:58:10 INFO mapred.JobClient: Reduce shuffle bytes=0
12/08/06 06:58:10 INFO mapred.JobClient: Spilled Records=10
EDIT Driver Class for Grep:
Grep.java
/**
* Licensed to the Apache Software Foundation (ASF) under one
* or more contributor license agreements. See the NOTICE file
* distributed with this work for additional information
* regarding copyright ownership. The ASF licenses this file
* to you under the Apache License, Version 2.0 (the
* "License"); you may not use this file except in compliance
* with the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
package org.apache.hadoop.examples;
import java.util.Random;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.*;
import org.apache.hadoop.mapred.lib.*;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;
/* Extracts matching regexs from input files and counts them. */
public class Grep extends Configured implements Tool {
private Grep() {} // singleton
public int run(String[] args) throws Exception {
if (args.length < 3) {
System.out.println("Grep <inDir> <outDir> <regex> [<group>]");
ToolRunner.printGenericCommandUsage(System.out);
return -1;
}
Path tempDir =
new Path("grep-temp-"+
Integer.toString(new Random().nextInt(Integer.MAX_VALUE)));
JobConf grepJob = new JobConf(getConf(), Grep.class);
try {
grepJob.setJobName("grep-search");
FileInputFormat.setInputPaths(grepJob, args[0]);
grepJob.setMapperClass(RegexMapper.class);
grepJob.set("mapred.mapper.regex", args[2]);
if (args.length == 4)
grepJob.set("mapred.mapper.regex.group", args[3]);
grepJob.setCombinerClass(LongSumReducer.class);
grepJob.setReducerClass(LongSumReducer.class);
FileOutputFormat.setOutputPath(grepJob, tempDir);
grepJob.setOutputFormat(SequenceFileOutputFormat.class);
grepJob.setOutputKeyClass(Text.class);
grepJob.setOutputValueClass(LongWritable.class);
JobClient.runJob(grepJob);
JobConf sortJob = new JobConf(getConf(), Grep.class);
sortJob.setJobName("grep-sort");
FileInputFormat.setInputPaths(sortJob, tempDir);
sortJob.setInputFormat(SequenceFileInputFormat.class);
sortJob.setMapperClass(InverseMapper.class);
sortJob.setNumReduceTasks(1); // write a single file
FileOutputFormat.setOutputPath(sortJob, new Path(args[1]));
sortJob.setOutputKeyComparatorClass // sort by decreasing freq
(LongWritable.DecreasingComparator.class);
JobClient.runJob(sortJob);
}
finally {
FileSystem.get(grepJob).delete(tempDir, true);
}
return 0;
}
public static void main(String[] args) throws Exception {
int res = ToolRunner.run(new Configuration(), new Grep(), args);
System.exit(res);
}
}

In the file there are the statistics of two jobs: job: job_201208030925_0011 and job: job_201208030925_0012. The percentages belong to these two jobs, hence there are 2 map progress percentages.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Hadoop MapReduce job is using only two reducers out of 16 - hadoop

Related

Hadoop producing no output?

Why there is no reducer when running 1TB teragen?

From Hadoop logs how can I find intermediate output byte sizes & reduce output bytes sizes?

distcp hdfs to s3 fails

hadoop showing map reduce percentages running twice

Categories

Resources