mapreduce matrix multiplication with hadoop - hadoop

I am trying to run the matrix multiplication example mentioned(with source code) on the following link:
http://www.norstad.org/matrix-multiply/index.html
I have hadoop setup in pseudodistributed mode and I configured it using this tutorial:
http://hadoop-tutorial.blogspot.com/2010/11/running-hadoop-in-pseudo-distributed.html?showComment=1321528406255#c3661776111033973764
When I run my jar file then I get the following error:
Identity test
11/11/30 10:37:34 INFO input.FileInputFormat: Total input paths to process : 2
11/11/30 10:37:34 INFO mapred.JobClient: Running job: job_201111291041_0010
11/11/30 10:37:35 INFO mapred.JobClient: map 0% reduce 0%
11/11/30 10:37:44 INFO mapred.JobClient: map 100% reduce 0%
11/11/30 10:37:56 INFO mapred.JobClient: map 100% reduce 100%
11/11/30 10:37:58 INFO mapred.JobClient: Job complete: job_201111291041_0010
11/11/30 10:37:58 INFO mapred.JobClient: Counters: 17
11/11/30 10:37:58 INFO mapred.JobClient: Job Counters
11/11/30 10:37:58 INFO mapred.JobClient: Launched reduce tasks=1
11/11/30 10:37:58 INFO mapred.JobClient: Launched map tasks=2
11/11/30 10:37:58 INFO mapred.JobClient: Data-local map tasks=2
11/11/30 10:37:58 INFO mapred.JobClient: FileSystemCounters
11/11/30 10:37:58 INFO mapred.JobClient: FILE_BYTES_READ=114
11/11/30 10:37:58 INFO mapred.JobClient: HDFS_BYTES_READ=248
11/11/30 10:37:58 INFO mapred.JobClient: FILE_BYTES_WRITTEN=298
11/11/30 10:37:58 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=124
11/11/30 10:37:58 INFO mapred.JobClient: Map-Reduce Framework
11/11/30 10:37:58 INFO mapred.JobClient: Reduce input groups=2
11/11/30 10:37:58 INFO mapred.JobClient: Combine output records=0
11/11/30 10:37:58 INFO mapred.JobClient: Map input records=4
11/11/30 10:37:58 INFO mapred.JobClient: Reduce shuffle bytes=60
11/11/30 10:37:58 INFO mapred.JobClient: Reduce output records=2
11/11/30 10:37:58 INFO mapred.JobClient: Spilled Records=8
11/11/30 10:37:58 INFO mapred.JobClient: Map output bytes=100
11/11/30 10:37:58 INFO mapred.JobClient: Combine input records=0
11/11/30 10:37:58 INFO mapred.JobClient: Map output records=4
11/11/30 10:37:58 INFO mapred.JobClient: Reduce input records=4
11/11/30 10:37:58 INFO input.FileInputFormat: Total input paths to process : 1
11/11/30 10:37:59 INFO mapred.JobClient: Running job: job_201111291041_0011
11/11/30 10:38:00 INFO mapred.JobClient: map 0% reduce 0%
11/11/30 10:38:09 INFO mapred.JobClient: map 100% reduce 0%
11/11/30 10:38:21 INFO mapred.JobClient: map 100% reduce 100%
11/11/30 10:38:23 INFO mapred.JobClient: Job complete: job_201111291041_0011
11/11/30 10:38:23 INFO mapred.JobClient: Counters: 17
11/11/30 10:38:23 INFO mapred.JobClient: Job Counters
11/11/30 10:38:23 INFO mapred.JobClient: Launched reduce tasks=1
11/11/30 10:38:23 INFO mapred.JobClient: Launched map tasks=1
11/11/30 10:38:23 INFO mapred.JobClient: Data-local map tasks=1
11/11/30 10:38:23 INFO mapred.JobClient: FileSystemCounters
11/11/30 10:38:23 INFO mapred.JobClient: FILE_BYTES_READ=34
11/11/30 10:38:23 INFO mapred.JobClient: HDFS_BYTES_READ=124
11/11/30 10:38:23 INFO mapred.JobClient: FILE_BYTES_WRITTEN=100
11/11/30 10:38:23 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=124
11/11/30 10:38:23 INFO mapred.JobClient: Map-Reduce Framework
11/11/30 10:38:23 INFO mapred.JobClient: Reduce input groups=2
11/11/30 10:38:23 INFO mapred.JobClient: Combine output records=2
11/11/30 10:38:23 INFO mapred.JobClient: Map input records=2
11/11/30 10:38:23 INFO mapred.JobClient: Reduce shuffle bytes=0
11/11/30 10:38:23 INFO mapred.JobClient: Reduce output records=2
11/11/30 10:38:23 INFO mapred.JobClient: Spilled Records=4
11/11/30 10:38:23 INFO mapred.JobClient: Map output bytes=24
11/11/30 10:38:23 INFO mapred.JobClient: Combine input records=2
11/11/30 10:38:23 INFO mapred.JobClient: Map output records=2
11/11/30 10:38:23 INFO mapred.JobClient: Reduce input records=2
Exception in thread "main" java.io.IOException: Cannot open filename /tmp/Matrix Multiply/out/_logs
at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.openInfo(DFSClient.ja va:1497)
at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.<init>(DFSClient.java :1488)
at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:376)
at org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSyst em.java:178)
at org.apache.hadoop.io.SequenceFile$Reader.openFile(SequenceFile.java:1 437)
at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:142 4)
at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:141 7)
at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:141 2)
at TestMatrixMultiply.fillMatrix(TestMatrixMultiply.java:62)
at TestMatrixMultiply.readMatrix(TestMatrixMultiply.java:84)
at TestMatrixMultiply.checkAnswer(TestMatrixMultiply.java:108)
at TestMatrixMultiply.runOneTest(TestMatrixMultiply.java:144)
at TestMatrixMultiply.testIdentity(TestMatrixMultiply.java:156)
at TestMatrixMultiply.main(TestMatrixMultiply.java:258)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl. java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAcces sorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:601)
at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
Can someone please suggest me that what am I doing wrong?Thanks

It trys to read the job output. When you submit this to your cluster it will add this _log directory. Since directory are no sequence files, they can't be read.
You have to change the code that reads this.
I have scripted something equal:
FileStatus[] stati = fs.listStatus(output);
for (FileStatus status : stati) {
if (!status.isDir()) {
Path path = status.getPath();
// HERE IS THE READ CODE FROM YOUR EXAMPLE
}
}
http://code.google.com/p/hama-shortest-paths/source/browse/trunk/hama-gsoc/src/de/jungblut/clustering/mapreduce/KMeansClusteringJob.java#127

It may be a primitive suggestion but, you may need to change log filename with
/tmp/Matrix\ Multiply/out/_logs. Spaces in directory names may not be handled automatically and I assumed you are working on Linux.

There are two problems in TestMatrixMultiply.java:
As Thomas Jungblut said, _logs should be excluded in readMatrix() method. I have changed the code like this:
if (fs.isFile(path)) {
fillMatrix(result, path);
} else {
FileStatus[] fileStatusArray = fs.listStatus(path);
for (FileStatus fileStatus : fileStatusArray) {
if ( !fileStatus.isDir() ) // this line is added by me
fillMatrix(result, fileStatus.getPath());
}
}
In the end of main() method, fs.delete should be commented, or the output directory will be immediately deleted each time after a mapreduce job finished.
finally {
//fs.delete(new Path(DATA_DIR_PATH), true);
}

Related

From Hadoop logs how can I find intermediate output byte sizes & reduce output bytes sizes?

From hadoop logs, How can I estimate the size of total intermediate outputs of Mappers(in Bytes) and the size of total outputs of Reducers(in Bytes)?
My mappers and reducers use LZO compression, and I want to know the size of mapper/reducer outputs after compression.
15/06/06 17:19:15 INFO mapred.JobClient: map 100% reduce 94%
15/06/06 17:19:16 INFO mapred.JobClient: map 100% reduce 98%
15/06/06 17:19:17 INFO mapred.JobClient: map 100% reduce 99%
15/06/06 17:20:04 INFO mapred.JobClient: map 100% reduce 100%
15/06/06 17:20:05 INFO mapred.JobClient: Job complete: job_201506061602_0026
15/06/06 17:20:05 INFO mapred.JobClient: Counters: 30
15/06/06 17:20:05 INFO mapred.JobClient: Job Counters
15/06/06 17:20:05 INFO mapred.JobClient: Launched reduce tasks=401
15/06/06 17:20:05 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=1203745
15/06/06 17:20:05 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0
15/06/06 17:20:05 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0
15/06/06 17:20:05 INFO mapred.JobClient: Rack-local map tasks=50
15/06/06 17:20:05 INFO mapred.JobClient: Launched map tasks=400
15/06/06 17:20:05 INFO mapred.JobClient: Data-local map tasks=350
15/06/06 17:20:05 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=6642599
15/06/06 17:20:05 INFO mapred.JobClient: File Output Format Counters
15/06/06 17:20:05 INFO mapred.JobClient: Bytes Written=534808008
15/06/06 17:20:05 INFO mapred.JobClient: FileSystemCounters
15/06/06 17:20:05 INFO mapred.JobClient: FILE_BYTES_READ=247949371
15/06/06 17:20:05 INFO mapred.JobClient: HDFS_BYTES_READ=168030609
15/06/06 17:20:05 INFO mapred.JobClient: FILE_BYTES_WRITTEN=651797418
15/06/06 17:20:05 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=534808008
15/06/06 17:20:05 INFO mapred.JobClient: File Input Format Counters
15/06/06 17:20:05 INFO mapred.JobClient: Bytes Read=167978609
15/06/06 17:20:05 INFO mapred.JobClient: Map-Reduce Framework
15/06/06 17:20:05 INFO mapred.JobClient: Map output materialized bytes=354979707
15/06/06 17:20:05 INFO mapred.JobClient: Map input records=3774768
15/06/06 17:20:05 INFO mapred.JobClient: Reduce shuffle bytes=354979707
15/06/06 17:20:05 INFO mapred.JobClient: Spilled Records=56007636
15/06/06 17:20:05 INFO mapred.JobClient: Map output bytes=336045816
15/06/06 17:20:05 INFO mapred.JobClient: Total committed heap usage (bytes)=592599187456
15/06/06 17:20:05 INFO mapred.JobClient: CPU time spent (ms)=9204120
15/06/06 17:20:05 INFO mapred.JobClient: Combine input records=0
15/06/06 17:20:05 INFO mapred.JobClient: SPLIT_RAW_BYTES=52000
15/06/06 17:20:05 INFO mapred.JobClient: Reduce input records=28003818
15/06/06 17:20:05 INFO mapred.JobClient: Reduce input groups=11478107
15/06/06 17:20:05 INFO mapred.JobClient: Combine output records=0
15/06/06 17:20:05 INFO mapred.JobClient: Physical memory (bytes) snapshot=516784615424
15/06/06 17:20:05 INFO mapred.JobClient: Reduce output records=94351104
15/06/06 17:20:05 INFO mapred.JobClient: Virtual memory (bytes) snapshot=1911619866624
15/06/06 17:20:05 INFO mapred.JobClient: Map output records=28003818
You can get these info by using FileSystemCounters. Details of the terms used in this counter is given below:
FILE_BYTES_READ is the number of bytes read by local file system. Assume all the map input data comes from HDFS, then in map phase FILE_BYTES_READ should be zero. On the other hand, the input file of reducers are data on the reduce-side local disks which are fetched from map-side disks. Therefore, FILE_BYTES_READ denotes the total bytes read by reducers.
FILE_BYTES_WRITTEN consists of two parts. The first part comes from mappers. All the mappers will spill intermediate output to disk. All the bytes that mappers write to disk will be included in FILE_BYTES_WRITTEN. The second part comes from reducers. In the shuffle phase, all the reducers will fetch intermediate data from mappers and merge and spill to reducer-side disks. All the bytes that reducers write to disk will also be included in FILE_BYTES_WRITTEN.
HDFS_BYTES_READ denotes the bytes read by mappers from HDFS when the job starts. This data includes not only the content of source file but also metadata about splits.
HDFS_BYTES_WRITTEN denotes the bytes written to HDFS. It’s the number of bytes of the final output.

hadoop test examples to validate the installation

I have successfully configured Hadoop 2.4 on my Ubuntu 14.04 using this tutorial.
http://dogdogfish.com/2014/04/26/installing-hadoop-2-4-on-ubuntu-14-04/
Now after completing installtion how can I perform test on it?
How and where can I get the test data or jar files?
You have some example jars in your hadoop installation directory.
Simplest thing you can do is run the teragen example(or wordcount).
It is the first step in perform terasort.
Steps:
Go to the hadoop installation directory.
Run "hadoop jar hadoop-examples-0.20.2-cdh3u0.jar" to see all the jars you can run.
Go to home/[user] directory and create a file "example.txt" with the following data
"This is a file to test Hadoop Installation example
For the sake of the experiment, consider it to be 1TB"
While you are in that directory, run "hadoop dfs -put examples.txt /" this uploads the file onto your HDFS
Run "hadoop dfs -ls /" to check it is on there
Go to your Hadoop installation directory and run "hadoop jar hadoop-examples-0.20.2-cdh3u0.jar teragen 1000 /user/teragendata" - 1000 is the size data is to be broken into and the other param is the output directory.
On successful execution, you will see something like the text at the bottom.
Now to see how your MR job was run, in your browser open JobTracker and see the completed jobs. "localhost50030/jobtracker.jsp"
cloudera#cloudera-vm:/usr/lib/hadoop$ hadoop jar hadoop-examples-0.20.2-cdh3u0.jar teragen 600 /user/teragendata
Generating 600 using 2 maps with step of 300
14/07/24 09:02:44 INFO mapred.JobClient: Running job: job_201407230030_0008
14/07/24 09:02:45 INFO mapred.JobClient: map 0% reduce 0%
14/07/24 09:02:57 INFO mapred.JobClient: map 100% reduce 0%
14/07/24 09:03:00 INFO mapred.JobClient: Job complete: job_201407230030_0008
14/07/24 09:03:00 INFO mapred.JobClient: Counters: 13
14/07/24 09:03:00 INFO mapred.JobClient: Job Counters
14/07/24 09:03:00 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=22008
14/07/24 09:03:00 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0
14/07/24 09:03:00 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0
14/07/24 09:03:00 INFO mapred.JobClient: Launched map tasks=2
14/07/24 09:03:00 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=0
14/07/24 09:03:00 INFO mapred.JobClient: FileSystemCounters
14/07/24 09:03:00 INFO mapred.JobClient: HDFS_BYTES_READ=164
14/07/24 09:03:00 INFO mapred.JobClient: FILE_BYTES_WRITTEN=105150
14/07/24 09:03:00 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=60000
14/07/24 09:03:00 INFO mapred.JobClient: Map-Reduce Framework
14/07/24 09:03:00 INFO mapred.JobClient: Map input records=600
14/07/24 09:03:00 INFO mapred.JobClient: Spilled Records=0
14/07/24 09:03:00 INFO mapred.JobClient: Map input bytes=600
14/07/24 09:03:00 INFO mapred.JobClient: Map output records=600
14/07/24 09:03:00 INFO mapred.JobClient: SPLIT_RAW_BYTES=164

logging the output of a map reduce job to a text file

I've been using this jobclient.monitorandprintjob() method to print the output of a map reduce job to the console. My usage is something like this:
job_client.monitorAndPrintJob(job_conf, job_client.getJob(j.getAssignedJobID()))
The output of which is as follows (printed on the console):
13/03/04 07:20:00 INFO mapred.JobClient: Running job: job_201302211725_10139<br>
13/03/04 07:20:01 INFO mapred.JobClient: map 0% reduce 0%<br>
13/03/04 07:20:08 INFO mapred.JobClient: map 100% reduce 0%<br>
13/03/04 07:20:13 INFO mapred.JobClient: map 100% reduce 100%<br>
13/03/04 07:20:13 INFO mapred.JobClient: Job complete: job_201302211725_10139<br>
13/03/04 07:20:13 INFO mapred.JobClient: Counters: 26<br>
13/03/04 07:20:13 INFO mapred.JobClient: Job Counters<br>
13/03/04 07:20:13 INFO mapred.JobClient: Launched reduce tasks=1<br>
13/03/04 07:20:13 INFO mapred.JobClient: Aggregate execution time of mappers(ms)=5539<br>
13/03/04 07:20:13 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0<br>
13/03/04 07:20:13 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0<br>
13/03/04 07:20:13 INFO mapred.JobClient: Launched map tasks=2<br>
13/03/04 07:20:13 INFO mapred.JobClient: Data-local map tasks=2<br>
13/03/04 07:20:13 INFO mapred.JobClient: Aggregate execution time of reducers(ms)=4337<br>
13/03/04 07:20:13 INFO mapred.JobClient: FileSystemCounters<br>
13/03/04 07:20:13 INFO mapred.JobClient: MAPRFS_BYTES_READ=583<br>
13/03/04 07:20:13 INFO mapred.JobClient: MAPRFS_BYTES_WRITTEN=394<br>
13/03/04 07:20:13 INFO mapred.JobClient: FILE_BYTES_WRITTEN=140219<br>
13/03/04 07:20:13 INFO mapred.JobClient: Map-Reduce Framework<br>
13/03/04 07:20:13 INFO mapred.JobClient: Map input records=6<br>
13/03/04 07:20:13 INFO mapred.JobClient: Reduce shuffle bytes=136<br>
13/03/04 07:20:13 INFO mapred.JobClient: Spilled Records=22<br>
13/03/04 07:20:13 INFO mapred.JobClient: Map output bytes=116<br>
13/03/04 07:20:13 INFO mapred.JobClient: CPU_MILLISECONDS=1320<br>
13/03/04 07:20:13 INFO mapred.JobClient: Map input bytes=64<br>
13/03/04 07:20:13 INFO mapred.JobClient: Combine input records=13<br>
13/03/04 07:20:13 INFO mapred.JobClient: SPLIT_RAW_BYTES=180<br>
13/03/04 07:20:13 INFO mapred.JobClient: Reduce input records=11<br>
13/03/04 07:20:13 INFO mapred.JobClient: Reduce input groups=11<br>
13/03/04 07:20:13 INFO mapred.JobClient: Combine output records=11<br>
13/03/04 07:20:13 INFO mapred.JobClient: PHYSICAL_MEMORY_BYTES=734961664<br>
13/03/04 07:20:13 INFO mapred.JobClient: Reduce output records=11<br>
13/03/04 07:20:13 INFO mapred.JobClient: VIRTUAL_MEMORY_BYTES=9751805952<br>
13/03/04 07:20:13 INFO mapred.JobClient: Map output records=13<br>
13/03/04 07:20:13 INFO mapred.JobClient: GC time elapsed (ms)=0<br>
I would like the above output/log to be printed in a text file, rather than the console. any suggestions?
In your HADOOP_HOME/conf you may find one file named : log4j.properties. I believe you can configure where and how to log in there.
To be precise, you shall be using a rolling file appender, so you shall un-comment(just remove #) the following lines from log4j.properties file:
# Rolling File Appender
#
#log4j.appender.RFA=org.apache.log4j.RollingFileAppender
#log4j.appender.RFA.File=${hadoop.log.dir}/${hadoop.log.file}
# Logfile size and and 30-day backups
#log4j.appender.RFA.MaxFileSize=1MB
#log4j.appender.RFA.MaxBackupIndex=30
#log4j.appender.RFA.layout=org.apache.log4j.PatternLayout
#log4j.appender.RFA.layout.ConversionPattern=%d{ISO8601} %-5p %c{2} - %m%n
#log4j.appender.RFA.layout.ConversionPattern=%d{ISO8601} %-5p %c{2} (%F:%M(%L)) - %m%n
And customize the other parameters to your liking.
For more about log4j configurations, read here.

hadoop showing map reduce percentages running twice

I'm running Apache's Hadoop, and using the grep example provided by that installation. I'm wondering why map reduce percentages show up running twice? I thought they only had to run once; which makes me doubt my understanding of map reduce. I looked it up (http://grokbase.com/t/gg/mongodb-user/125ay1eazq/map-reduce-percentage-seems-running-twice) but there really wasn't an explanation and this link was for MongoDB.
hduser#ubse1:/usr/local/hadoop$ bin/hadoop jar hadoop*examples*.jar grep /user/hduser/grep /user/hduser/grep-output4 ".*woe is me.*"
I'm running this on a project gutenberg .txt file. The output file is correct.
Here is the output for running the command if needed:
12/08/06 06:56:57 INFO util.NativeCodeLoader: Loaded the native-hadoop library
12/08/06 06:56:57 WARN snappy.LoadSnappy: Snappy native library not loaded
12/08/06 06:56:57 INFO mapred.FileInputFormat: Total input paths to process : 1
12/08/06 06:56:58 INFO mapred.JobClient: Running job: job_201208030925_0011
12/08/06 06:56:59 INFO mapred.JobClient: map 0% reduce 0%
12/08/06 06:57:18 INFO mapred.JobClient: map 100% reduce 0%
12/08/06 06:57:30 INFO mapred.JobClient: map 100% reduce 100%
12/08/06 06:57:35 INFO mapred.JobClient: Job complete: job_201208030925_0011
12/08/06 06:57:35 INFO mapred.JobClient: Counters: 30
12/08/06 06:57:35 INFO mapred.JobClient: Job Counters
12/08/06 06:57:35 INFO mapred.JobClient: Launched reduce tasks=1
12/08/06 06:57:35 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=31034
12/08/06 06:57:35 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0
12/08/06 06:57:35 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0
12/08/06 06:57:35 INFO mapred.JobClient: Rack-local map tasks=2
12/08/06 06:57:35 INFO mapred.JobClient: Launched map tasks=2
12/08/06 06:57:35 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=11233
12/08/06 06:57:35 INFO mapred.JobClient: File Input Format Counters
12/08/06 06:57:35 INFO mapred.JobClient: Bytes Read=5592666
12/08/06 06:57:35 INFO mapred.JobClient: File Output Format Counters
12/08/06 06:57:35 INFO mapred.JobClient: Bytes Written=391
12/08/06 06:57:35 INFO mapred.JobClient: FileSystemCounters
12/08/06 06:57:35 INFO mapred.JobClient: FILE_BYTES_READ=281
12/08/06 06:57:35 INFO mapred.JobClient: HDFS_BYTES_READ=5592862
12/08/06 06:57:35 INFO mapred.JobClient: FILE_BYTES_WRITTEN=65331
12/08/06 06:57:35 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=391
12/08/06 06:57:35 INFO mapred.JobClient: Map-Reduce Framework
12/08/06 06:57:35 INFO mapred.JobClient: Map output materialized bytes=287
12/08/06 06:57:35 INFO mapred.JobClient: Map input records=124796
12/08/06 06:57:35 INFO mapred.JobClient: Reduce shuffle bytes=287
12/08/06 06:57:35 INFO mapred.JobClient: Spilled Records=10
12/08/06 06:57:35 INFO mapred.JobClient: Map output bytes=265
12/08/06 06:57:35 INFO mapred.JobClient: Total committed heap usage (bytes)=336404480
12/08/06 06:57:35 INFO mapred.JobClient: CPU time spent (ms)=7040
12/08/06 06:57:35 INFO mapred.JobClient: Map input bytes=5590193
12/08/06 06:57:35 INFO mapred.JobClient: SPLIT_RAW_BYTES=196
12/08/06 06:57:35 INFO mapred.JobClient: Combine input records=5
12/08/06 06:57:35 INFO mapred.JobClient: Reduce input records=5
12/08/06 06:57:35 INFO mapred.JobClient: Reduce input groups=5
12/08/06 06:57:35 INFO mapred.JobClient: Combine output records=5
12/08/06 06:57:35 INFO mapred.JobClient: Physical memory (bytes) snapshot=464568320
12/08/06 06:57:35 INFO mapred.JobClient: Reduce output records=5
12/08/06 06:57:35 INFO mapred.JobClient: Virtual memory (bytes) snapshot=1539559424
12/08/06 06:57:35 INFO mapred.JobClient: Map output records=5
12/08/06 06:57:35 INFO mapred.FileInputFormat: Total input paths to process : 1
12/08/06 06:57:35 INFO mapred.JobClient: Running job: job_201208030925_0012
12/08/06 06:57:36 INFO mapred.JobClient: map 0% reduce 0%
12/08/06 06:57:50 INFO mapred.JobClient: map 100% reduce 0%
12/08/06 06:58:05 INFO mapred.JobClient: map 100% reduce 100%
12/08/06 06:58:10 INFO mapred.JobClient: Job complete: job_201208030925_0012
12/08/06 06:58:10 INFO mapred.JobClient: Counters: 30
12/08/06 06:58:10 INFO mapred.JobClient: Job Counters
12/08/06 06:58:10 INFO mapred.JobClient: Launched reduce tasks=1
12/08/06 06:58:10 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=15432
12/08/06 06:58:10 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0
12/08/06 06:58:10 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0
12/08/06 06:58:10 INFO mapred.JobClient: Rack-local map tasks=1
12/08/06 06:58:10 INFO mapred.JobClient: Launched map tasks=1
12/08/06 06:58:10 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=14264
12/08/06 06:58:10 INFO mapred.JobClient: File Input Format Counters
12/08/06 06:58:10 INFO mapred.JobClient: Bytes Read=391
12/08/06 06:58:10 INFO mapred.JobClient: File Output Format Counters
12/08/06 06:58:10 INFO mapred.JobClient: Bytes Written=235
12/08/06 06:58:10 INFO mapred.JobClient: FileSystemCounters
12/08/06 06:58:10 INFO mapred.JobClient: FILE_BYTES_READ=281
12/08/06 06:58:10 INFO mapred.JobClient: HDFS_BYTES_READ=505
12/08/06 06:58:10 INFO mapred.JobClient: FILE_BYTES_WRITTEN=42985
12/08/06 06:58:10 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=235
12/08/06 06:58:10 INFO mapred.JobClient: Map-Reduce Framework
12/08/06 06:58:10 INFO mapred.JobClient: Map output materialized bytes=281
12/08/06 06:58:10 INFO mapred.JobClient: Map input records=5
12/08/06 06:58:10 INFO mapred.JobClient: Reduce shuffle bytes=0
12/08/06 06:58:10 INFO mapred.JobClient: Spilled Records=10
EDIT Driver Class for Grep:
Grep.java
/**
* Licensed to the Apache Software Foundation (ASF) under one
* or more contributor license agreements. See the NOTICE file
* distributed with this work for additional information
* regarding copyright ownership. The ASF licenses this file
* to you under the Apache License, Version 2.0 (the
* "License"); you may not use this file except in compliance
* with the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
package org.apache.hadoop.examples;
import java.util.Random;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.*;
import org.apache.hadoop.mapred.lib.*;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;
/* Extracts matching regexs from input files and counts them. */
public class Grep extends Configured implements Tool {
private Grep() {} // singleton
public int run(String[] args) throws Exception {
if (args.length < 3) {
System.out.println("Grep <inDir> <outDir> <regex> [<group>]");
ToolRunner.printGenericCommandUsage(System.out);
return -1;
}
Path tempDir =
new Path("grep-temp-"+
Integer.toString(new Random().nextInt(Integer.MAX_VALUE)));
JobConf grepJob = new JobConf(getConf(), Grep.class);
try {
grepJob.setJobName("grep-search");
FileInputFormat.setInputPaths(grepJob, args[0]);
grepJob.setMapperClass(RegexMapper.class);
grepJob.set("mapred.mapper.regex", args[2]);
if (args.length == 4)
grepJob.set("mapred.mapper.regex.group", args[3]);
grepJob.setCombinerClass(LongSumReducer.class);
grepJob.setReducerClass(LongSumReducer.class);
FileOutputFormat.setOutputPath(grepJob, tempDir);
grepJob.setOutputFormat(SequenceFileOutputFormat.class);
grepJob.setOutputKeyClass(Text.class);
grepJob.setOutputValueClass(LongWritable.class);
JobClient.runJob(grepJob);
JobConf sortJob = new JobConf(getConf(), Grep.class);
sortJob.setJobName("grep-sort");
FileInputFormat.setInputPaths(sortJob, tempDir);
sortJob.setInputFormat(SequenceFileInputFormat.class);
sortJob.setMapperClass(InverseMapper.class);
sortJob.setNumReduceTasks(1); // write a single file
FileOutputFormat.setOutputPath(sortJob, new Path(args[1]));
sortJob.setOutputKeyComparatorClass // sort by decreasing freq
(LongWritable.DecreasingComparator.class);
JobClient.runJob(sortJob);
}
finally {
FileSystem.get(grepJob).delete(tempDir, true);
}
return 0;
}
public static void main(String[] args) throws Exception {
int res = ToolRunner.run(new Configuration(), new Grep(), args);
System.exit(res);
}
}
In the file there are the statistics of two jobs: job: job_201208030925_0011 and job: job_201208030925_0012. The percentages belong to these two jobs, hence there are 2 map progress percentages.

Too many fetch failures: Hadoop on cluster (x2)

I have been using Hadoop for the last week or so (trying to get to grips with it), and although I have been able to set up a multinode cluster (2 machines: 1 laptop and a small desktop) and retrieve results, I always seem to encounter "Too many fetch failures" when I run a hadoop job.
An example output (on a trivial wordcount example) is:
hadoop#ap200:/usr/local/hadoop$ bin/hadoop jar hadoop-examples-0.20.203.0.jar wordcount sita sita-output3X
11/05/20 15:02:05 INFO input.FileInputFormat: Total input paths to process : 7
11/05/20 15:02:05 INFO mapred.JobClient: Running job: job_201105201500_0001
11/05/20 15:02:06 INFO mapred.JobClient: map 0% reduce 0%
11/05/20 15:02:23 INFO mapred.JobClient: map 28% reduce 0%
11/05/20 15:02:26 INFO mapred.JobClient: map 42% reduce 0%
11/05/20 15:02:29 INFO mapred.JobClient: map 57% reduce 0%
11/05/20 15:02:32 INFO mapred.JobClient: map 100% reduce 0%
11/05/20 15:02:41 INFO mapred.JobClient: map 100% reduce 9%
11/05/20 15:02:49 INFO mapred.JobClient: Task Id : attempt_201105201500_0001_m_000003_0, Status : FAILED
Too many fetch-failures
11/05/20 15:02:53 INFO mapred.JobClient: map 85% reduce 9%
11/05/20 15:02:57 INFO mapred.JobClient: map 100% reduce 9%
11/05/20 15:03:10 INFO mapred.JobClient: Task Id : attempt_201105201500_0001_m_000002_0, Status : FAILED
Too many fetch-failures
11/05/20 15:03:14 INFO mapred.JobClient: map 85% reduce 9%
11/05/20 15:03:17 INFO mapred.JobClient: map 100% reduce 9%
11/05/20 15:03:25 INFO mapred.JobClient: Task Id : attempt_201105201500_0001_m_000006_0, Status : FAILED
Too many fetch-failures
11/05/20 15:03:29 INFO mapred.JobClient: map 85% reduce 9%
11/05/20 15:03:32 INFO mapred.JobClient: map 100% reduce 9%
11/05/20 15:03:35 INFO mapred.JobClient: map 100% reduce 28%
11/05/20 15:03:41 INFO mapred.JobClient: map 100% reduce 100%
11/05/20 15:03:46 INFO mapred.JobClient: Job complete: job_201105201500_0001
11/05/20 15:03:46 INFO mapred.JobClient: Counters: 25
11/05/20 15:03:46 INFO mapred.JobClient: Job Counters
11/05/20 15:03:46 INFO mapred.JobClient: Launched reduce tasks=1
11/05/20 15:03:46 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=72909
11/05/20 15:03:46 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0
11/05/20 15:03:46 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0
11/05/20 15:03:46 INFO mapred.JobClient: Launched map tasks=10
11/05/20 15:03:46 INFO mapred.JobClient: Data-local map tasks=10
11/05/20 15:03:46 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=76116
11/05/20 15:03:46 INFO mapred.JobClient: File Output Format Counters
11/05/20 15:03:46 INFO mapred.JobClient: Bytes Written=1412473
11/05/20 15:03:46 INFO mapred.JobClient: FileSystemCounters
11/05/20 15:03:46 INFO mapred.JobClient: FILE_BYTES_READ=4462381
11/05/20 15:03:46 INFO mapred.JobClient: HDFS_BYTES_READ=6950740
11/05/20 15:03:46 INFO mapred.JobClient: FILE_BYTES_WRITTEN=7546513
11/05/20 15:03:46 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=1412473
11/05/20 15:03:46 INFO mapred.JobClient: File Input Format Counters
11/05/20 15:03:46 INFO mapred.JobClient: Bytes Read=6949956
11/05/20 15:03:46 INFO mapred.JobClient: Map-Reduce Framework
11/05/20 15:03:46 INFO mapred.JobClient: Reduce input groups=128510
11/05/20 15:03:46 INFO mapred.JobClient: Map output materialized bytes=2914947
11/05/20 15:03:46 INFO mapred.JobClient: Combine output records=201001
11/05/20 15:03:46 INFO mapred.JobClient: Map input records=137146
11/05/20 15:03:46 INFO mapred.JobClient: Reduce shuffle bytes=2914947
11/05/20 15:03:46 INFO mapred.JobClient: Reduce output records=128510
11/05/20 15:03:46 INFO mapred.JobClient: Spilled Records=507835
11/05/20 15:03:46 INFO mapred.JobClient: Map output bytes=11435785
11/05/20 15:03:46 INFO mapred.JobClient: Combine input records=1174986
11/05/20 15:03:46 INFO mapred.JobClient: Map output records=1174986
11/05/20 15:03:46 INFO mapred.JobClient: SPLIT_RAW_BYTES=784
11/05/20 15:03:46 INFO mapred.JobClient: Reduce input records=201001
I did a google on the problem, and the people at apache seem to suggest it could be anything from a networking problem (or something to do with /etc/hosts files) or could be a corrupt disk on the slave nodes.
Just to add: I do see 2 "live nodes" on namenode Admin panel (localhost:50070/dfshealth) and under Map/reduce Admin, I see 2 nodes aswell.
Any clues as to how I can avoid these errors?
Thanks in advance.
Edit:1:
The tasktracker log is on: http://pastebin.com/XMkNBJTh
The datanode log is on: http://pastebin.com/ttjR7AYZ
Many thanks.
Modify datanode node/etc/hosts file.
Each line is divided into three parts. The first part is the network IP address, the second part is the host name or domain name, the third part is the host alias detailed steps are as follows:
First check the host name:
cat / proc / sys / kernel / hostname
You will see a HOSTNAME attribute. Change the value of the IP behind on OK and then exit.
Use the command:
hostname ***. ***. ***. ***
Asterisk is replaced by the corresponding IP.
Modify the the hosts configuration similarly, as follows:
127.0.0.1 localhost.localdomain localhost
:: 1 localhost6.localdomain6 localhost6
10.200.187.77 10.200.187.77 hadoop-datanode
If the IP address is configured and successfully modified, or show host name there is a problem, continue to modify the hosts file.
Following solution will definitely work
1.Remove or comment line with Ip 127.0.0.1 and 127.0.1.1
2.use host name not alias for referring node in host file and Master/slave file present in hadoop directory
-->in Host file 172.21.3.67 master-ubuntu
-->in master/slave file master-ubuntu
3. see for NameSpaceId of namenode = NameSpaceId of Datanode
I had the same problem: "Too many fetch failures" and very slow Hadoop performance (the simple wordcount example took more than 20 minutes to run on a 2-node cluster of powerful servers). I also got "WARN mapred.JobClient: Error reading task outputConnection refused" errors.
The problem was fixed, when I followed the instruction by Thomas Jungblut: I removed my master node from the slaves configuration file. After this, the errors disappeared and the wordcount example took only 1 minute.

Resources