hadoop MapReduce Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out - hadoop

I am facing below error while trying to run MapReduce job with more than one input file. Although I am able to run MapReduce job with only one input file.
I go through some posts and almost every one is saying there is firewall Issue or not setup properly hostnames in /etc/hosts file.
Even IF this is the case my MapReduce job will fail whether the input is single file or directory(multiple files)
Below is the output from console.
INFO input.FileInputFormat: Total input paths to process : 2
WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
WARN snappy.LoadSnappy: Snappy native library not loaded
INFO mapred.JobClient: Running job: job_201505201700_0005
INFO mapred.JobClient: map 0% reduce 0%
INFO mapred.JobClient: map 50% reduce 0%
INFO mapred.JobClient: map 100% reduce 0%
INFO mapred.JobClient: map 100% reduce 16%
INFO mapred.JobClient: map 100% reduce 0%
INFO mapred.JobClient: Task Id : attempt_201505201700_0005_r_000000_0, Status : FAILED
Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.
WARN mapred.JobClient: Error reading task outputAMR-DEV02.local
WARN mapred.JobClient: Error reading task outputAMR-DEV02.local
INFO mapred.JobClient: map 100% reduce 16%
INFO mapred.JobClient: map 100% reduce 0%
INFO mapred.JobClient: Task Id : attempt_201505201700_0005_r_000000_1, Status : FAILED
Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.
WARN mapred.JobClient: Error reading task outputEmbeddedQASrv.local
WARN mapred.JobClient: Error reading task outputEmbeddedQASrv.local
INFO mapred.JobClient: map 100% reduce 16%
Note. EmbeddedQASrv.local(ip address. 192.168.115.80) and AMR-DEV02.local(ip address. 192.168.115.79) are my slave node host names.
My Hadoop cluster is consisting of 1 Master and 2 Slaves.
This is the command I am running from console.(emp_dept_data is a directory contains empData and deptData files)
hadoop jar testdata/joindevice.jar JoinDevice emp_dept_data output15
However, If i run this command MapReduce job gets successed(single file as input)
hadoop jar testdata/joindevice.jar JoinDevice emp_dept_data/empData output16
Here is my /etc/hosts file entry set up Master node. However same entry's were copied to my slave nodes also.
127.0.0.1 amr-dev01.local amr-dev01 localhost
::1 localhost6.localdomain6 localhost6
#Hadoop Configurations
192.168.115.78 master
192.168.115.79 slave01
192.168.115.80 slave02
I am clueless for what is wrong and where to check for exact root cause.

The actual problem was with /etc/hosts file. I commented my local host configuration.
amr-dev01.local amr-dev01 localhost
and Instead of specifying different names like master, slave01, slave02...I used same hostnames
192.168.115.78 amr-dev01
192.168.115.79 amr-dev02

Related

Hadoop mapper phase stuck after 19% In which case this possibility can occur?

My MapReduce program is running fine with other MR code. There is no error in the code. Still it is getting stuck.
15/05/28 19:53:29 INFO input.FileInputFormat: Total input paths to process : 1
15/05/28 19:53:31 INFO mapred.JobClient: Running job: job_201504101709_0927
15/05/28 19:53:32 INFO mapred.JobClient: map 0% reduce 0%
15/05/28 19:53:46 INFO mapred.JobClient: map 19% reduce 0%
15/05/28 20:03:50 INFO mapred.JobClient: map 0% reduce 0%
15/05/28 20:03:51 INFO mapred.JobClient: Task Id : attempt_201504101709_0927_m_000000_0, Status : FAILED
Task attempt_201504101709_0927_m_000000_0 failed to report status for 602 seconds. Killing!
Possible reason for this might be a bug in Mapper, like an infinite loop. Just check if everything is fine in Mapper. If you feel its not problem, update your question with your mapper code.

In Hadoop, How can I find which slave node is executing an attempt N?

I'm using Hadoop 1.2.1, and my hadoop application fails in doing Reduce. From Hadoop run I see messages like following :
15/05/22 18:14:15 INFO mapred.JobClient: map 0% reduce 0% 15/05/22
18:14:25 INFO mapred.JobClient: map 100% reduce 0% 15/05/22 18:24:25
INFO mapred.JobClient: map 0% reduce 0% 15/05/22 18:24:26 INFO
mapred.JobClient: Task Id : attempt_201505221804_0013_m_000000_0,
Status : FAILED Task attempt_201505221804_0013_m_000000_0 failed to
report status for 600 seconds. Killing! 15/05/22 18:24:35 INFO
mapred.JobClient: map 100% reduce 0%
I'd like to see the log of attempt_201505221804_0013_m_000000_0, but it is too time-consuming to find which slave had executed attempt_201505221804_0013_m_000000_0.
Someone told me to use Hadoop web pages to find it, but there is some firewall on this cluster and I can't change the option because the cluster is fundamentally not owned by our group.
Is there any way to find in where this attempt was executed?
You should be able to find this information in the jobtracker logs which are by default under HADOOP_HOME/logs. This will contain entries looking similar to this:
INFO org.apache.hadoop.mapred.JobTracker: Adding task (MAP) 'attempt_201503262103_0001_m_000000_0' to tip task_201503262103_0001_m_000000, for tracker 'host'
You can search the file for the specific attempt id.

hadoop test examples to validate the installation

I have successfully configured Hadoop 2.4 on my Ubuntu 14.04 using this tutorial.
http://dogdogfish.com/2014/04/26/installing-hadoop-2-4-on-ubuntu-14-04/
Now after completing installtion how can I perform test on it?
How and where can I get the test data or jar files?
You have some example jars in your hadoop installation directory.
Simplest thing you can do is run the teragen example(or wordcount).
It is the first step in perform terasort.
Steps:
Go to the hadoop installation directory.
Run "hadoop jar hadoop-examples-0.20.2-cdh3u0.jar" to see all the jars you can run.
Go to home/[user] directory and create a file "example.txt" with the following data
"This is a file to test Hadoop Installation example
For the sake of the experiment, consider it to be 1TB"
While you are in that directory, run "hadoop dfs -put examples.txt /" this uploads the file onto your HDFS
Run "hadoop dfs -ls /" to check it is on there
Go to your Hadoop installation directory and run "hadoop jar hadoop-examples-0.20.2-cdh3u0.jar teragen 1000 /user/teragendata" - 1000 is the size data is to be broken into and the other param is the output directory.
On successful execution, you will see something like the text at the bottom.
Now to see how your MR job was run, in your browser open JobTracker and see the completed jobs. "localhost50030/jobtracker.jsp"
cloudera#cloudera-vm:/usr/lib/hadoop$ hadoop jar hadoop-examples-0.20.2-cdh3u0.jar teragen 600 /user/teragendata
Generating 600 using 2 maps with step of 300
14/07/24 09:02:44 INFO mapred.JobClient: Running job: job_201407230030_0008
14/07/24 09:02:45 INFO mapred.JobClient: map 0% reduce 0%
14/07/24 09:02:57 INFO mapred.JobClient: map 100% reduce 0%
14/07/24 09:03:00 INFO mapred.JobClient: Job complete: job_201407230030_0008
14/07/24 09:03:00 INFO mapred.JobClient: Counters: 13
14/07/24 09:03:00 INFO mapred.JobClient: Job Counters
14/07/24 09:03:00 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=22008
14/07/24 09:03:00 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0
14/07/24 09:03:00 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0
14/07/24 09:03:00 INFO mapred.JobClient: Launched map tasks=2
14/07/24 09:03:00 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=0
14/07/24 09:03:00 INFO mapred.JobClient: FileSystemCounters
14/07/24 09:03:00 INFO mapred.JobClient: HDFS_BYTES_READ=164
14/07/24 09:03:00 INFO mapred.JobClient: FILE_BYTES_WRITTEN=105150
14/07/24 09:03:00 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=60000
14/07/24 09:03:00 INFO mapred.JobClient: Map-Reduce Framework
14/07/24 09:03:00 INFO mapred.JobClient: Map input records=600
14/07/24 09:03:00 INFO mapred.JobClient: Spilled Records=0
14/07/24 09:03:00 INFO mapred.JobClient: Map input bytes=600
14/07/24 09:03:00 INFO mapred.JobClient: Map output records=600
14/07/24 09:03:00 INFO mapred.JobClient: SPLIT_RAW_BYTES=164

DiskErrorException on slave machine - Hadoop multinode

I am trying to process XML files from hadoop, i got following error on invoking word-count job on XML files .
13/07/25 12:39:57 INFO mapred.JobClient: Task Id : attempt_201307251234_0001_m_000008_0, Status : FAILED
Too many fetch-failures
13/07/25 12:39:58 INFO mapred.JobClient: map 99% reduce 0%
13/07/25 12:39:59 INFO mapred.JobClient: map 100% reduce 0%
13/07/25 12:40:56 INFO mapred.JobClient: Task Id : attempt_201307251234_0001_m_000009_0, Status : FAILED
Too many fetch-failures
13/07/25 12:40:58 INFO mapred.JobClient: map 99% reduce 0%
13/07/25 12:40:59 INFO mapred.JobClient: map 100% reduce 0%
13/07/25 12:41:22 INFO mapred.JobClient: map 100% reduce 1%
13/07/25 12:41:57 INFO mapred.JobClient: Task Id : attempt_201307251234_0001_m_000015_0, Status : FAILED
Too many fetch-failures
13/07/25 12:41:58 INFO mapred.JobClient: map 99% reduce 1%
13/07/25 12:41:59 INFO mapred.JobClient: map 100% reduce 1%
13/07/25 12:42:57 INFO mapred.JobClient: Task Id : attempt_201307251234_0001_m_000014_0, Status : FAILED
Too many fetch-failures
13/07/25 12:42:58 INFO mapred.JobClient: map 99% reduce 1%
13/07/25 12:42:59 INFO mapred.JobClient: map 100% reduce 1%
13/07/25 12:43:22 INFO mapred.JobClient: map 100% reduce 2%
i observer following error at hadoop-hduser-tasktracker-localhost.localdomain.log file on slave machine .
2013-07-25 12:38:58,124 WARN org.apache.hadoop.mapred.TaskTracker: getMapOutput(attempt_201307251234_0001_m_000001_0,0) failed :
org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find taskTracker/hduser/jobcache/job_201307251234_0001/attempt_201307251234_0001_m_000001_0/output/file.out.index in any of the configured local directories
at org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathToRead(LocalDirAllocator.java:429)
This works fine when i ran for text files
Looks like you have hit this issue. Either apply the patch or download the fixed version, and you should be good to go.
HTH

Too many fetch failures: Hadoop on cluster (x2)

I have been using Hadoop for the last week or so (trying to get to grips with it), and although I have been able to set up a multinode cluster (2 machines: 1 laptop and a small desktop) and retrieve results, I always seem to encounter "Too many fetch failures" when I run a hadoop job.
An example output (on a trivial wordcount example) is:
hadoop#ap200:/usr/local/hadoop$ bin/hadoop jar hadoop-examples-0.20.203.0.jar wordcount sita sita-output3X
11/05/20 15:02:05 INFO input.FileInputFormat: Total input paths to process : 7
11/05/20 15:02:05 INFO mapred.JobClient: Running job: job_201105201500_0001
11/05/20 15:02:06 INFO mapred.JobClient: map 0% reduce 0%
11/05/20 15:02:23 INFO mapred.JobClient: map 28% reduce 0%
11/05/20 15:02:26 INFO mapred.JobClient: map 42% reduce 0%
11/05/20 15:02:29 INFO mapred.JobClient: map 57% reduce 0%
11/05/20 15:02:32 INFO mapred.JobClient: map 100% reduce 0%
11/05/20 15:02:41 INFO mapred.JobClient: map 100% reduce 9%
11/05/20 15:02:49 INFO mapred.JobClient: Task Id : attempt_201105201500_0001_m_000003_0, Status : FAILED
Too many fetch-failures
11/05/20 15:02:53 INFO mapred.JobClient: map 85% reduce 9%
11/05/20 15:02:57 INFO mapred.JobClient: map 100% reduce 9%
11/05/20 15:03:10 INFO mapred.JobClient: Task Id : attempt_201105201500_0001_m_000002_0, Status : FAILED
Too many fetch-failures
11/05/20 15:03:14 INFO mapred.JobClient: map 85% reduce 9%
11/05/20 15:03:17 INFO mapred.JobClient: map 100% reduce 9%
11/05/20 15:03:25 INFO mapred.JobClient: Task Id : attempt_201105201500_0001_m_000006_0, Status : FAILED
Too many fetch-failures
11/05/20 15:03:29 INFO mapred.JobClient: map 85% reduce 9%
11/05/20 15:03:32 INFO mapred.JobClient: map 100% reduce 9%
11/05/20 15:03:35 INFO mapred.JobClient: map 100% reduce 28%
11/05/20 15:03:41 INFO mapred.JobClient: map 100% reduce 100%
11/05/20 15:03:46 INFO mapred.JobClient: Job complete: job_201105201500_0001
11/05/20 15:03:46 INFO mapred.JobClient: Counters: 25
11/05/20 15:03:46 INFO mapred.JobClient: Job Counters
11/05/20 15:03:46 INFO mapred.JobClient: Launched reduce tasks=1
11/05/20 15:03:46 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=72909
11/05/20 15:03:46 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0
11/05/20 15:03:46 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0
11/05/20 15:03:46 INFO mapred.JobClient: Launched map tasks=10
11/05/20 15:03:46 INFO mapred.JobClient: Data-local map tasks=10
11/05/20 15:03:46 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=76116
11/05/20 15:03:46 INFO mapred.JobClient: File Output Format Counters
11/05/20 15:03:46 INFO mapred.JobClient: Bytes Written=1412473
11/05/20 15:03:46 INFO mapred.JobClient: FileSystemCounters
11/05/20 15:03:46 INFO mapred.JobClient: FILE_BYTES_READ=4462381
11/05/20 15:03:46 INFO mapred.JobClient: HDFS_BYTES_READ=6950740
11/05/20 15:03:46 INFO mapred.JobClient: FILE_BYTES_WRITTEN=7546513
11/05/20 15:03:46 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=1412473
11/05/20 15:03:46 INFO mapred.JobClient: File Input Format Counters
11/05/20 15:03:46 INFO mapred.JobClient: Bytes Read=6949956
11/05/20 15:03:46 INFO mapred.JobClient: Map-Reduce Framework
11/05/20 15:03:46 INFO mapred.JobClient: Reduce input groups=128510
11/05/20 15:03:46 INFO mapred.JobClient: Map output materialized bytes=2914947
11/05/20 15:03:46 INFO mapred.JobClient: Combine output records=201001
11/05/20 15:03:46 INFO mapred.JobClient: Map input records=137146
11/05/20 15:03:46 INFO mapred.JobClient: Reduce shuffle bytes=2914947
11/05/20 15:03:46 INFO mapred.JobClient: Reduce output records=128510
11/05/20 15:03:46 INFO mapred.JobClient: Spilled Records=507835
11/05/20 15:03:46 INFO mapred.JobClient: Map output bytes=11435785
11/05/20 15:03:46 INFO mapred.JobClient: Combine input records=1174986
11/05/20 15:03:46 INFO mapred.JobClient: Map output records=1174986
11/05/20 15:03:46 INFO mapred.JobClient: SPLIT_RAW_BYTES=784
11/05/20 15:03:46 INFO mapred.JobClient: Reduce input records=201001
I did a google on the problem, and the people at apache seem to suggest it could be anything from a networking problem (or something to do with /etc/hosts files) or could be a corrupt disk on the slave nodes.
Just to add: I do see 2 "live nodes" on namenode Admin panel (localhost:50070/dfshealth) and under Map/reduce Admin, I see 2 nodes aswell.
Any clues as to how I can avoid these errors?
Thanks in advance.
Edit:1:
The tasktracker log is on: http://pastebin.com/XMkNBJTh
The datanode log is on: http://pastebin.com/ttjR7AYZ
Many thanks.
Modify datanode node/etc/hosts file.
Each line is divided into three parts. The first part is the network IP address, the second part is the host name or domain name, the third part is the host alias detailed steps are as follows:
First check the host name:
cat / proc / sys / kernel / hostname
You will see a HOSTNAME attribute. Change the value of the IP behind on OK and then exit.
Use the command:
hostname ***. ***. ***. ***
Asterisk is replaced by the corresponding IP.
Modify the the hosts configuration similarly, as follows:
127.0.0.1 localhost.localdomain localhost
:: 1 localhost6.localdomain6 localhost6
10.200.187.77 10.200.187.77 hadoop-datanode
If the IP address is configured and successfully modified, or show host name there is a problem, continue to modify the hosts file.
Following solution will definitely work
1.Remove or comment line with Ip 127.0.0.1 and 127.0.1.1
2.use host name not alias for referring node in host file and Master/slave file present in hadoop directory
-->in Host file 172.21.3.67 master-ubuntu
-->in master/slave file master-ubuntu
3. see for NameSpaceId of namenode = NameSpaceId of Datanode
I had the same problem: "Too many fetch failures" and very slow Hadoop performance (the simple wordcount example took more than 20 minutes to run on a 2-node cluster of powerful servers). I also got "WARN mapred.JobClient: Error reading task outputConnection refused" errors.
The problem was fixed, when I followed the instruction by Thomas Jungblut: I removed my master node from the slaves configuration file. After this, the errors disappeared and the wordcount example took only 1 minute.

Resources