DiskErrorException on slave machine - Hadoop multinode - hadoop

I am trying to process XML files from hadoop, i got following error on invoking word-count job on XML files .
13/07/25 12:39:57 INFO mapred.JobClient: Task Id : attempt_201307251234_0001_m_000008_0, Status : FAILED
Too many fetch-failures
13/07/25 12:39:58 INFO mapred.JobClient: map 99% reduce 0%
13/07/25 12:39:59 INFO mapred.JobClient: map 100% reduce 0%
13/07/25 12:40:56 INFO mapred.JobClient: Task Id : attempt_201307251234_0001_m_000009_0, Status : FAILED
Too many fetch-failures
13/07/25 12:40:58 INFO mapred.JobClient: map 99% reduce 0%
13/07/25 12:40:59 INFO mapred.JobClient: map 100% reduce 0%
13/07/25 12:41:22 INFO mapred.JobClient: map 100% reduce 1%
13/07/25 12:41:57 INFO mapred.JobClient: Task Id : attempt_201307251234_0001_m_000015_0, Status : FAILED
Too many fetch-failures
13/07/25 12:41:58 INFO mapred.JobClient: map 99% reduce 1%
13/07/25 12:41:59 INFO mapred.JobClient: map 100% reduce 1%
13/07/25 12:42:57 INFO mapred.JobClient: Task Id : attempt_201307251234_0001_m_000014_0, Status : FAILED
Too many fetch-failures
13/07/25 12:42:58 INFO mapred.JobClient: map 99% reduce 1%
13/07/25 12:42:59 INFO mapred.JobClient: map 100% reduce 1%
13/07/25 12:43:22 INFO mapred.JobClient: map 100% reduce 2%
i observer following error at hadoop-hduser-tasktracker-localhost.localdomain.log file on slave machine .
2013-07-25 12:38:58,124 WARN org.apache.hadoop.mapred.TaskTracker: getMapOutput(attempt_201307251234_0001_m_000001_0,0) failed :
org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find taskTracker/hduser/jobcache/job_201307251234_0001/attempt_201307251234_0001_m_000001_0/output/file.out.index in any of the configured local directories
at org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathToRead(LocalDirAllocator.java:429)
This works fine when i ran for text files

Looks like you have hit this issue. Either apply the patch or download the fixed version, and you should be good to go.
HTH

Related

Hadoop mapper phase stuck after 19% In which case this possibility can occur?

My MapReduce program is running fine with other MR code. There is no error in the code. Still it is getting stuck.
15/05/28 19:53:29 INFO input.FileInputFormat: Total input paths to process : 1
15/05/28 19:53:31 INFO mapred.JobClient: Running job: job_201504101709_0927
15/05/28 19:53:32 INFO mapred.JobClient: map 0% reduce 0%
15/05/28 19:53:46 INFO mapred.JobClient: map 19% reduce 0%
15/05/28 20:03:50 INFO mapred.JobClient: map 0% reduce 0%
15/05/28 20:03:51 INFO mapred.JobClient: Task Id : attempt_201504101709_0927_m_000000_0, Status : FAILED
Task attempt_201504101709_0927_m_000000_0 failed to report status for 602 seconds. Killing!
Possible reason for this might be a bug in Mapper, like an infinite loop. Just check if everything is fine in Mapper. If you feel its not problem, update your question with your mapper code.

hadoop yarn single node performance tuning

I have hadoop 2.5.2 single mode installation on my Ubuntu VM, which is: 4-core, 3GHz per core; 4G memory. This VM is not for production, only for demo and learning.
Then, I wrote a vey simple map-reduce application using python, and use this application to process 49 xmls. All these xml files are small-size, hundreds of lines each. So, I expected a quick process. But, big22 surprise to me, it took more than 20 minutes to finish the job (the output of the job is correct.). Below is the output metrics :
14/12/15 19:37:55 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
14/12/15 19:37:57 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
14/12/15 19:38:03 INFO mapred.FileInputFormat: Total input paths to process : 49
14/12/15 19:38:06 INFO mapreduce.JobSubmitter: number of splits:49
14/12/15 19:38:08 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1418368500264_0005
14/12/15 19:38:10 INFO impl.YarnClientImpl: Submitted application application_1418368500264_0005
14/12/15 19:38:10 INFO mapreduce.Job: Running job: job_1418368500264_0005
14/12/15 19:38:59 INFO mapreduce.Job: Job job_1418368500264_0005 running in uber mode : false
14/12/15 19:38:59 INFO mapreduce.Job: map 0% reduce 0%
14/12/15 19:39:42 INFO mapreduce.Job: map 2% reduce 0%
14/12/15 19:40:05 INFO mapreduce.Job: map 4% reduce 0%
14/12/15 19:40:28 INFO mapreduce.Job: map 6% reduce 0%
14/12/15 19:40:49 INFO mapreduce.Job: map 8% reduce 0%
14/12/15 19:41:10 INFO mapreduce.Job: map 10% reduce 0%
14/12/15 19:41:29 INFO mapreduce.Job: map 12% reduce 0%
14/12/15 19:41:50 INFO mapreduce.Job: map 14% reduce 0%
14/12/15 19:42:08 INFO mapreduce.Job: map 16% reduce 0%
14/12/15 19:42:28 INFO mapreduce.Job: map 18% reduce 0%
14/12/15 19:42:49 INFO mapreduce.Job: map 20% reduce 0%
14/12/15 19:43:08 INFO mapreduce.Job: map 22% reduce 0%
14/12/15 19:43:28 INFO mapreduce.Job: map 24% reduce 0%
14/12/15 19:43:48 INFO mapreduce.Job: map 27% reduce 0%
14/12/15 19:44:09 INFO mapreduce.Job: map 29% reduce 0%
14/12/15 19:44:29 INFO mapreduce.Job: map 31% reduce 0%
14/12/15 19:44:49 INFO mapreduce.Job: map 33% reduce 0%
14/12/15 19:45:09 INFO mapreduce.Job: map 35% reduce 0%
14/12/15 19:45:28 INFO mapreduce.Job: map 37% reduce 0%
14/12/15 19:45:49 INFO mapreduce.Job: map 39% reduce 0%
14/12/15 19:46:09 INFO mapreduce.Job: map 41% reduce 0%
14/12/15 19:46:29 INFO mapreduce.Job: map 43% reduce 0%
14/12/15 19:46:49 INFO mapreduce.Job: map 45% reduce 0%
14/12/15 19:47:09 INFO mapreduce.Job: map 47% reduce 0%
14/12/15 19:47:29 INFO mapreduce.Job: map 49% reduce 0%
14/12/15 19:47:49 INFO mapreduce.Job: map 51% reduce 0%
14/12/15 19:48:08 INFO mapreduce.Job: map 53% reduce 0%
14/12/15 19:48:28 INFO mapreduce.Job: map 55% reduce 0%
14/12/15 19:48:48 INFO mapreduce.Job: map 57% reduce 0%
14/12/15 19:49:09 INFO mapreduce.Job: map 59% reduce 0%
14/12/15 19:49:29 INFO mapreduce.Job: map 61% reduce 0%
14/12/15 19:49:55 INFO mapreduce.Job: map 63% reduce 0%
14/12/15 19:50:23 INFO mapreduce.Job: map 65% reduce 0%
14/12/15 19:50:53 INFO mapreduce.Job: map 67% reduce 0%
14/12/15 19:51:22 INFO mapreduce.Job: map 69% reduce 0%
14/12/15 19:51:50 INFO mapreduce.Job: map 71% reduce 0%
14/12/15 19:52:18 INFO mapreduce.Job: map 73% reduce 0%
14/12/15 19:52:48 INFO mapreduce.Job: map 76% reduce 0%
14/12/15 19:53:18 INFO mapreduce.Job: map 78% reduce 0%
14/12/15 19:53:48 INFO mapreduce.Job: map 80% reduce 0%
14/12/15 19:54:18 INFO mapreduce.Job: map 82% reduce 0%
14/12/15 19:54:48 INFO mapreduce.Job: map 84% reduce 0%
14/12/15 19:55:19 INFO mapreduce.Job: map 86% reduce 0%
14/12/15 19:55:48 INFO mapreduce.Job: map 88% reduce 0%
14/12/15 19:56:16 INFO mapreduce.Job: map 90% reduce 0%
14/12/15 19:56:44 INFO mapreduce.Job: map 92% reduce 0%
14/12/15 19:57:14 INFO mapreduce.Job: map 94% reduce 0%
14/12/15 19:57:45 INFO mapreduce.Job: map 96% reduce 0%
14/12/15 19:58:15 INFO mapreduce.Job: map 98% reduce 0%
14/12/15 19:58:46 INFO mapreduce.Job: map 100% reduce 0%
14/12/15 19:59:20 INFO mapreduce.Job: map 100% reduce 100%
14/12/15 19:59:28 INFO mapreduce.Job: Job job_1418368500264_0005 completed successfully
14/12/15 19:59:30 INFO mapreduce.Job: Counters: 49
File System Counters
FILE: Number of bytes read=17856
FILE: Number of bytes written=5086434
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=499030
HDFS: Number of bytes written=10049
HDFS: Number of read operations=150
HDFS: Number of large read operations=0
HDFS: Number of write operations=2
Job Counters
Launched map tasks=49
Launched reduce tasks=1
Data-local map tasks=49
Total time spent by all maps in occupied slots (ms)=8854232
Total time spent by all reduces in occupied slots (ms)=284672
Total time spent by all map tasks (ms)=1106779
Total time spent by all reduce tasks (ms)=35584
Total vcore-seconds taken by all map tasks=1106779
Total vcore-seconds taken by all reduce tasks=35584
Total megabyte-seconds taken by all map tasks=1133341696
Total megabyte-seconds taken by all reduce tasks=36438016
Map-Reduce Framework
Map input records=9352
Map output records=296
Map output bytes=17258
Map output materialized bytes=18144
Input split bytes=6772
Combine input records=0
Combine output records=0
Reduce input groups=53
Reduce shuffle bytes=18144
Reduce input records=296
Reduce output records=52
Spilled Records=592
Shuffled Maps =49
Failed Shuffles=0
Merged Map outputs=49
GC time elapsed (ms)=33590
CPU time spent (ms)=191390
Physical memory (bytes) snapshot=13738057728
Virtual memory (bytes) snapshot=66425016320
Total committed heap usage (bytes)=10799808512
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
File Input Format Counters
Bytes Read=492258
File Output Format Counters
Bytes Written=10049
14/12/15 19:59:30 INFO streaming.StreamJob: Output directory: /data_output/sb50projs_1_output
As a newbie to hadoop, for this crazy unreasonable performance, I have several questions:
how to configure my hadoop/yarn/mapreduce to make the whole environment more convenient for trial usage?
I understand hadoop is designed for huge-data and big files. But for a trial environment, my files are small and my data is very limited, which default configuration items should I change? I have changed "dfs.blocksize" of hdfs-site.xml to a smaller value to match my small files, but seems no big enhancements. I know there are some JVM configuration items in yarn-site.xml and mapred-site.xml, but I am not sure about how to adjust them.
how to read hadoop logs
Under the logs folder, there are separate log files for nodemanager/resourcemanager/namenode/datanode. I tried to read these files to understand how the 20 minutes are spent during the process, but it's not easy for a newbie like me. So I wonder is there any tool/UI could help me to analyze the logs.
basic performance tuning tools
Actually I have googled around for this question, and I got a bunch of names like Ganglia/Nagios/Vaidya/Ambari. I want to know, which tool is best analyse the issue like , "why it took 20 minutes to do such a simple job?".
big number of hadoop processes
Even if there is no job running on my hadoop, I found around 100 hadoop processes on my VM, like below (I am using htop, and sort the result by memory). Is this normal for hadoop ? Or am I incorrect for some environment configuration?
You don't have to change anything.
The default configuration is done for small environment. You may change it if you grow the environment. Ant there is a lot of params and a lot of time for fine tuning.
But I admit your configuration is smaller than the usual ones for tests.
The log you have to read isn't the services ones but the job ones. Find them in /var/log/hadoop-yarn/containers/
If you want a better view of your MR, use the web interface on http://127.0.0.1:8088/. You will see your job's progression in real time.
IMO, Basic tuning = use hadoop web interfaces. There are plenty available natively.
I think you find your problem. This can be nomal, or not.
But quickly, YARN launch MR to use all the available memory :
Available memory is set in your yarn-site.xml : yarn.nodemanager.resource.memory-mb (default to 8 Gio).
Memory for a task is defined in mapred-site.xml or in the task itself by the property : mapreduce.map.memory.mb (default to 1536 Mio)
So :
Change the available memory for your nodemanager (to 3Gio, in order to let 1 Gio for the system)
Change the memory available for hadoop services (-Xmx in hadoop-env.sh, yarn-env.sh) (system + each hadoop services (namenode / datanode / ressourcemanager / nodemanager) < 1 Gio.
Change the memory for your map tasks (512 Mio ?). The lesser it is, more task can be executed in the same time.
Change yarn.scheduler.minimum-allocation-mb to 512 in yarn-site.xml to allow mappers with less than 1 Gio of memory.
I hope this will help you.

hadoop test examples to validate the installation

I have successfully configured Hadoop 2.4 on my Ubuntu 14.04 using this tutorial.
http://dogdogfish.com/2014/04/26/installing-hadoop-2-4-on-ubuntu-14-04/
Now after completing installtion how can I perform test on it?
How and where can I get the test data or jar files?
You have some example jars in your hadoop installation directory.
Simplest thing you can do is run the teragen example(or wordcount).
It is the first step in perform terasort.
Steps:
Go to the hadoop installation directory.
Run "hadoop jar hadoop-examples-0.20.2-cdh3u0.jar" to see all the jars you can run.
Go to home/[user] directory and create a file "example.txt" with the following data
"This is a file to test Hadoop Installation example
For the sake of the experiment, consider it to be 1TB"
While you are in that directory, run "hadoop dfs -put examples.txt /" this uploads the file onto your HDFS
Run "hadoop dfs -ls /" to check it is on there
Go to your Hadoop installation directory and run "hadoop jar hadoop-examples-0.20.2-cdh3u0.jar teragen 1000 /user/teragendata" - 1000 is the size data is to be broken into and the other param is the output directory.
On successful execution, you will see something like the text at the bottom.
Now to see how your MR job was run, in your browser open JobTracker and see the completed jobs. "localhost50030/jobtracker.jsp"
cloudera#cloudera-vm:/usr/lib/hadoop$ hadoop jar hadoop-examples-0.20.2-cdh3u0.jar teragen 600 /user/teragendata
Generating 600 using 2 maps with step of 300
14/07/24 09:02:44 INFO mapred.JobClient: Running job: job_201407230030_0008
14/07/24 09:02:45 INFO mapred.JobClient: map 0% reduce 0%
14/07/24 09:02:57 INFO mapred.JobClient: map 100% reduce 0%
14/07/24 09:03:00 INFO mapred.JobClient: Job complete: job_201407230030_0008
14/07/24 09:03:00 INFO mapred.JobClient: Counters: 13
14/07/24 09:03:00 INFO mapred.JobClient: Job Counters
14/07/24 09:03:00 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=22008
14/07/24 09:03:00 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0
14/07/24 09:03:00 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0
14/07/24 09:03:00 INFO mapred.JobClient: Launched map tasks=2
14/07/24 09:03:00 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=0
14/07/24 09:03:00 INFO mapred.JobClient: FileSystemCounters
14/07/24 09:03:00 INFO mapred.JobClient: HDFS_BYTES_READ=164
14/07/24 09:03:00 INFO mapred.JobClient: FILE_BYTES_WRITTEN=105150
14/07/24 09:03:00 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=60000
14/07/24 09:03:00 INFO mapred.JobClient: Map-Reduce Framework
14/07/24 09:03:00 INFO mapred.JobClient: Map input records=600
14/07/24 09:03:00 INFO mapred.JobClient: Spilled Records=0
14/07/24 09:03:00 INFO mapred.JobClient: Map input bytes=600
14/07/24 09:03:00 INFO mapred.JobClient: Map output records=600
14/07/24 09:03:00 INFO mapred.JobClient: SPLIT_RAW_BYTES=164

cloudera hadoop mapreduce job GC overhead limit exceeded error

I am running a canopy cluster job (using mahout) on a cloudera cdh4. the content to be clustered has about 1m records (each record is less than 1k in size). the whole hadoop environment (including all the nodes) is running in a vm with 4G memory. the installation of cdh4 is by default. I got the following exception when running the job.
It looks the job client should need a higher jvm heap size according to the exception. However, there are quite a few configuration options for jvm heap size in cloudera manager. I changed "Client Java Heap Size in Bytes" from 256MiB to 512MiB. However, it didnt improve.
Any hints/tips on setting these heap size options?
13/07/03 17:12:45 INFO input.FileInputFormat: Total input paths to process : 1
13/07/03 17:12:46 INFO mapred.JobClient: Running job: job_201307031710_0001
13/07/03 17:12:47 INFO mapred.JobClient: map 0% reduce 0%
13/07/03 17:13:06 INFO mapred.JobClient: map 1% reduce 0%
13/07/03 17:13:27 INFO mapred.JobClient: map 2% reduce 0%
13/07/03 17:14:01 INFO mapred.JobClient: map 3% reduce 0%
13/07/03 17:14:50 INFO mapred.JobClient: map 4% reduce 0%
13/07/03 17:15:50 INFO mapred.JobClient: map 5% reduce 0%
13/07/03 17:17:06 INFO mapred.JobClient: map 6% reduce 0%
13/07/03 17:18:44 INFO mapred.JobClient: map 7% reduce 0%
13/07/03 17:20:24 INFO mapred.JobClient: map 8% reduce 0%
13/07/03 17:22:20 INFO mapred.JobClient: map 9% reduce 0%
13/07/03 17:25:00 INFO mapred.JobClient: map 10% reduce 0%
13/07/03 17:28:08 INFO mapred.JobClient: map 11% reduce 0%
13/07/03 17:31:46 INFO mapred.JobClient: map 12% reduce 0%
13/07/03 17:35:57 INFO mapred.JobClient: map 13% reduce 0%
13/07/03 17:40:52 INFO mapred.JobClient: map 14% reduce 0%
13/07/03 17:46:55 INFO mapred.JobClient: map 15% reduce 0%
13/07/03 17:55:02 INFO mapred.JobClient: map 16% reduce 0%
13/07/03 18:08:42 INFO mapred.JobClient: map 17% reduce 0%
13/07/03 18:59:11 INFO mapred.JobClient: map 8% reduce 0%
13/07/03 18:59:13 INFO mapred.JobClient: Task Id : attempt_201307031710_0001_m_000001_0, Status : FAILED
Error: GC overhead limit exceeded
13/07/03 18:59:23 INFO mapred.JobClient: map 9% reduce 0%
13/07/03 19:00:09 INFO mapred.JobClient: map 10% reduce 0%
13/07/03 19:01:49 INFO mapred.JobClient: map 11% reduce 0%
13/07/03 19:04:25 INFO mapred.JobClient: map 12% reduce 0%
13/07/03 19:07:48 INFO mapred.JobClient: map 13% reduce 0%
13/07/03 19:12:48 INFO mapred.JobClient: map 14% reduce 0%
13/07/03 19:19:46 INFO mapred.JobClient: map 15% reduce 0%
13/07/03 19:29:05 INFO mapred.JobClient: map 16% reduce 0%
13/07/03 19:43:43 INFO mapred.JobClient: map 17% reduce 0%
13/07/03 20:49:36 INFO mapred.JobClient: map 8% reduce 0%
13/07/03 20:49:38 INFO mapred.JobClient: Task Id : attempt_201307031710_0001_m_000001_1, Status : FAILED
Error: GC overhead limit exceeded
13/07/03 20:49:48 INFO mapred.JobClient: map 9% reduce 0%
13/07/03 20:50:31 INFO mapred.JobClient: map 10% reduce 0%
13/07/03 20:52:08 INFO mapred.JobClient: map 11% reduce 0%
13/07/03 20:54:38 INFO mapred.JobClient: map 12% reduce 0%
13/07/03 20:58:01 INFO mapred.JobClient: map 13% reduce 0%
13/07/03 21:03:01 INFO mapred.JobClient: map 14% reduce 0%
13/07/03 21:10:10 INFO mapred.JobClient: map 15% reduce 0%
13/07/03 21:19:54 INFO mapred.JobClient: map 16% reduce 0%
13/07/03 21:31:35 INFO mapred.JobClient: map 8% reduce 0%
13/07/03 21:31:37 INFO mapred.JobClient: Task Id : attempt_201307031710_0001_m_000000_0, Status : FAILED
java.lang.Throwable: Child Error
at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:250)
Caused by: java.io.IOException: Task process exit with nonzero status of 65.
at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:237)
13/07/03 21:32:09 INFO mapred.JobClient: map 9% reduce 0%
13/07/03 21:33:31 INFO mapred.JobClient: map 10% reduce 0%
13/07/03 21:35:42 INFO mapred.JobClient: map 11% reduce 0%
13/07/03 21:38:41 INFO mapred.JobClient: map 12% reduce 0%
13/07/03 21:42:27 INFO mapred.JobClient: map 13% reduce 0%
13/07/03 21:48:20 INFO mapred.JobClient: map 14% reduce 0%
13/07/03 21:56:12 INFO mapred.JobClient: map 15% reduce 0%
13/07/03 22:07:20 INFO mapred.JobClient: map 16% reduce 0%
13/07/03 22:26:36 INFO mapred.JobClient: map 17% reduce 0%
13/07/03 23:35:30 INFO mapred.JobClient: map 8% reduce 0%
13/07/03 23:35:32 INFO mapred.JobClient: Task Id : attempt_201307031710_0001_m_000000_1, Status : FAILED
Error: GC overhead limit exceeded
13/07/03 23:35:42 INFO mapred.JobClient: map 9% reduce 0%
13/07/03 23:36:16 INFO mapred.JobClient: map 10% reduce 0%
13/07/03 23:38:01 INFO mapred.JobClient: map 11% reduce 0%
13/07/03 23:40:47 INFO mapred.JobClient: map 12% reduce 0%
13/07/03 23:44:44 INFO mapred.JobClient: map 13% reduce 0%
13/07/03 23:50:42 INFO mapred.JobClient: map 14% reduce 0%
13/07/03 23:58:58 INFO mapred.JobClient: map 15% reduce 0%
13/07/04 00:10:22 INFO mapred.JobClient: map 16% reduce 0%
13/07/04 00:21:38 INFO mapred.JobClient: map 7% reduce 0%
13/07/04 00:21:40 INFO mapred.JobClient: Task Id : attempt_201307031710_0001_m_000001_2, Status : FAILED
java.lang.Throwable: Child Error
at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:250)
Caused by: java.io.IOException: Task process exit with nonzero status of 65.
at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:237)
13/07/04 00:21:50 INFO mapred.JobClient: map 8% reduce 0%
13/07/04 00:22:27 INFO mapred.JobClient: map 9% reduce 0%
13/07/04 00:23:52 INFO mapred.JobClient: map 10% reduce 0%
13/07/04 00:26:00 INFO mapred.JobClient: map 11% reduce 0%
13/07/04 00:28:47 INFO mapred.JobClient: map 12% reduce 0%
13/07/04 00:32:17 INFO mapred.JobClient: map 13% reduce 0%
13/07/04 00:37:34 INFO mapred.JobClient: map 14% reduce 0%
13/07/04 00:44:30 INFO mapred.JobClient: map 15% reduce 0%
13/07/04 00:54:28 INFO mapred.JobClient: map 16% reduce 0%
13/07/04 01:16:30 INFO mapred.JobClient: map 17% reduce 0%
13/07/04 01:32:05 INFO mapred.JobClient: map 8% reduce 0%
13/07/04 01:32:08 INFO mapred.JobClient: Task Id : attempt_201307031710_0001_m_000000_2, Status : FAILED
Error: GC overhead limit exceeded
13/07/04 01:32:21 INFO mapred.JobClient: map 9% reduce 0%
13/07/04 01:33:26 INFO mapred.JobClient: map 10% reduce 0%
13/07/04 01:35:37 INFO mapred.JobClient: map 11% reduce 0%
13/07/04 01:38:48 INFO mapred.JobClient: map 12% reduce 0%
13/07/04 01:43:06 INFO mapred.JobClient: map 13% reduce 0%
13/07/04 01:49:58 INFO mapred.JobClient: map 14% reduce 0%
13/07/04 01:59:07 INFO mapred.JobClient: map 15% reduce 0%
13/07/04 02:12:00 INFO mapred.JobClient: map 16% reduce 0%
13/07/04 02:37:56 INFO mapred.JobClient: map 17% reduce 0%
13/07/04 03:31:55 INFO mapred.JobClient: map 8% reduce 0%
13/07/04 03:32:00 INFO mapred.JobClient: Job complete: job_201307031710_0001
13/07/04 03:32:00 INFO mapred.JobClient: Counters: 7
13/07/04 03:32:00 INFO mapred.JobClient: Job Counters
13/07/04 03:32:00 INFO mapred.JobClient: Failed map tasks=1
13/07/04 03:32:00 INFO mapred.JobClient: Launched map tasks=8
13/07/04 03:32:00 INFO mapred.JobClient: Data-local map tasks=8
13/07/04 03:32:00 INFO mapred.JobClient: Total time spent by all maps in occupied slots (ms)=11443502
13/07/04 03:32:00 INFO mapred.JobClient: Total time spent by all reduces in occupied slots (ms)=0
13/07/04 03:32:00 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0
13/07/04 03:32:00 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0
Exception in thread "main" java.lang.RuntimeException: java.lang.InterruptedException: Canopy Job failed processing vector
The Mahout jobs are very memory intensive. I don't know whether the mappers or reducers are the culprits, but, either way you will have to tell Hadoop to give them more RAM. "GC Overhead Limit Exceeded" is just a way of saying "out of memory" -- means the JVM gave up trying to reclaim the last 0.01% of available RAM.
How you set this is indeed a little complex, because there are several properties and they changed in Hadoop 2. CDH4 can support Hadoop 1 or 2 -- which one are you using?
If I had to guess: set mapreduce.child.java.opts to -Xmx1g. But the right answer really depends on your version and your data.
You need to change your memory settings for Hadoop, as the memory allocated for Hadoop is not enough to accommodate the job requirement you are running, Try to increase the heap memory and verify, due to over uses of memory OS might be killing the processes due to which job is failing.

Too many fetch failures: Hadoop on cluster (x2)

I have been using Hadoop for the last week or so (trying to get to grips with it), and although I have been able to set up a multinode cluster (2 machines: 1 laptop and a small desktop) and retrieve results, I always seem to encounter "Too many fetch failures" when I run a hadoop job.
An example output (on a trivial wordcount example) is:
hadoop#ap200:/usr/local/hadoop$ bin/hadoop jar hadoop-examples-0.20.203.0.jar wordcount sita sita-output3X
11/05/20 15:02:05 INFO input.FileInputFormat: Total input paths to process : 7
11/05/20 15:02:05 INFO mapred.JobClient: Running job: job_201105201500_0001
11/05/20 15:02:06 INFO mapred.JobClient: map 0% reduce 0%
11/05/20 15:02:23 INFO mapred.JobClient: map 28% reduce 0%
11/05/20 15:02:26 INFO mapred.JobClient: map 42% reduce 0%
11/05/20 15:02:29 INFO mapred.JobClient: map 57% reduce 0%
11/05/20 15:02:32 INFO mapred.JobClient: map 100% reduce 0%
11/05/20 15:02:41 INFO mapred.JobClient: map 100% reduce 9%
11/05/20 15:02:49 INFO mapred.JobClient: Task Id : attempt_201105201500_0001_m_000003_0, Status : FAILED
Too many fetch-failures
11/05/20 15:02:53 INFO mapred.JobClient: map 85% reduce 9%
11/05/20 15:02:57 INFO mapred.JobClient: map 100% reduce 9%
11/05/20 15:03:10 INFO mapred.JobClient: Task Id : attempt_201105201500_0001_m_000002_0, Status : FAILED
Too many fetch-failures
11/05/20 15:03:14 INFO mapred.JobClient: map 85% reduce 9%
11/05/20 15:03:17 INFO mapred.JobClient: map 100% reduce 9%
11/05/20 15:03:25 INFO mapred.JobClient: Task Id : attempt_201105201500_0001_m_000006_0, Status : FAILED
Too many fetch-failures
11/05/20 15:03:29 INFO mapred.JobClient: map 85% reduce 9%
11/05/20 15:03:32 INFO mapred.JobClient: map 100% reduce 9%
11/05/20 15:03:35 INFO mapred.JobClient: map 100% reduce 28%
11/05/20 15:03:41 INFO mapred.JobClient: map 100% reduce 100%
11/05/20 15:03:46 INFO mapred.JobClient: Job complete: job_201105201500_0001
11/05/20 15:03:46 INFO mapred.JobClient: Counters: 25
11/05/20 15:03:46 INFO mapred.JobClient: Job Counters
11/05/20 15:03:46 INFO mapred.JobClient: Launched reduce tasks=1
11/05/20 15:03:46 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=72909
11/05/20 15:03:46 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0
11/05/20 15:03:46 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0
11/05/20 15:03:46 INFO mapred.JobClient: Launched map tasks=10
11/05/20 15:03:46 INFO mapred.JobClient: Data-local map tasks=10
11/05/20 15:03:46 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=76116
11/05/20 15:03:46 INFO mapred.JobClient: File Output Format Counters
11/05/20 15:03:46 INFO mapred.JobClient: Bytes Written=1412473
11/05/20 15:03:46 INFO mapred.JobClient: FileSystemCounters
11/05/20 15:03:46 INFO mapred.JobClient: FILE_BYTES_READ=4462381
11/05/20 15:03:46 INFO mapred.JobClient: HDFS_BYTES_READ=6950740
11/05/20 15:03:46 INFO mapred.JobClient: FILE_BYTES_WRITTEN=7546513
11/05/20 15:03:46 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=1412473
11/05/20 15:03:46 INFO mapred.JobClient: File Input Format Counters
11/05/20 15:03:46 INFO mapred.JobClient: Bytes Read=6949956
11/05/20 15:03:46 INFO mapred.JobClient: Map-Reduce Framework
11/05/20 15:03:46 INFO mapred.JobClient: Reduce input groups=128510
11/05/20 15:03:46 INFO mapred.JobClient: Map output materialized bytes=2914947
11/05/20 15:03:46 INFO mapred.JobClient: Combine output records=201001
11/05/20 15:03:46 INFO mapred.JobClient: Map input records=137146
11/05/20 15:03:46 INFO mapred.JobClient: Reduce shuffle bytes=2914947
11/05/20 15:03:46 INFO mapred.JobClient: Reduce output records=128510
11/05/20 15:03:46 INFO mapred.JobClient: Spilled Records=507835
11/05/20 15:03:46 INFO mapred.JobClient: Map output bytes=11435785
11/05/20 15:03:46 INFO mapred.JobClient: Combine input records=1174986
11/05/20 15:03:46 INFO mapred.JobClient: Map output records=1174986
11/05/20 15:03:46 INFO mapred.JobClient: SPLIT_RAW_BYTES=784
11/05/20 15:03:46 INFO mapred.JobClient: Reduce input records=201001
I did a google on the problem, and the people at apache seem to suggest it could be anything from a networking problem (or something to do with /etc/hosts files) or could be a corrupt disk on the slave nodes.
Just to add: I do see 2 "live nodes" on namenode Admin panel (localhost:50070/dfshealth) and under Map/reduce Admin, I see 2 nodes aswell.
Any clues as to how I can avoid these errors?
Thanks in advance.
Edit:1:
The tasktracker log is on: http://pastebin.com/XMkNBJTh
The datanode log is on: http://pastebin.com/ttjR7AYZ
Many thanks.
Modify datanode node/etc/hosts file.
Each line is divided into three parts. The first part is the network IP address, the second part is the host name or domain name, the third part is the host alias detailed steps are as follows:
First check the host name:
cat / proc / sys / kernel / hostname
You will see a HOSTNAME attribute. Change the value of the IP behind on OK and then exit.
Use the command:
hostname ***. ***. ***. ***
Asterisk is replaced by the corresponding IP.
Modify the the hosts configuration similarly, as follows:
127.0.0.1 localhost.localdomain localhost
:: 1 localhost6.localdomain6 localhost6
10.200.187.77 10.200.187.77 hadoop-datanode
If the IP address is configured and successfully modified, or show host name there is a problem, continue to modify the hosts file.
Following solution will definitely work
1.Remove or comment line with Ip 127.0.0.1 and 127.0.1.1
2.use host name not alias for referring node in host file and Master/slave file present in hadoop directory
-->in Host file 172.21.3.67 master-ubuntu
-->in master/slave file master-ubuntu
3. see for NameSpaceId of namenode = NameSpaceId of Datanode
I had the same problem: "Too many fetch failures" and very slow Hadoop performance (the simple wordcount example took more than 20 minutes to run on a 2-node cluster of powerful servers). I also got "WARN mapred.JobClient: Error reading task outputConnection refused" errors.
The problem was fixed, when I followed the instruction by Thomas Jungblut: I removed my master node from the slaves configuration file. After this, the errors disappeared and the wordcount example took only 1 minute.

Resources