Pig reduce job stuck at 50% while running group command

Pig reduce job stuck at 50% while running group command - hadoop

I have loaded a file with some 6000 data lines using the following commands
A = load '/home/hduser/hdfsdrive/piginput/data/airlines.dat' using PigStorage(',') as (Airline_ID:int, Name:chararray, Alias:chararray, IATA:chararray, ICAO:chararray, Callsign:chararray, Country:chararray, Active:chararray);
B = foreach airline generate Country,Airline_ID;
C = group B by Country;
D = foreach C generate group,COUNT(B);
In the above code, I could execute the first 3 commands without any issues, but 4th one is running for a long time. I tried the following
dump C;
Even this stuck at the same place. Here is the log:
2016-04-20 16:08:16,617 INFO org.apache.hadoop.util.NativeCodeLoader:
Loaded the native-hadoop library 2016-04-20 16:08:16,898 WARN
org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Source name ugi
already exists! 2016-04-20 16:08:17,125 INFO
org.apache.hadoop.util.ProcessTree: setsid exited with exit code 0
2016-04-20 16:08:17,129 INFO org.apache.hadoop.mapred.Task: Using
ResourceCalculatorPlugin :
org.apache.hadoop.util.LinuxResourceCalculatorPlugin#1da9647b
2016-04-20 16:08:17,190 INFO org.apache.hadoop.mapred.ReduceTask:
ShuffleRamManager: MemoryLimit=130652568,
MaxSingleShuffleLimit=32663142 2016-04-20 16:08:17,195 INFO
org.apache.hadoop.mapred.ReduceTask:
attempt_201604201138_0003_r_000000_0 Thread started: Thread for
merging on-disk files 2016-04-20 16:08:17,195 INFO
org.apache.hadoop.mapred.ReduceTask:
attempt_201604201138_0003_r_000000_0 Thread started: Thread for
merging in memory files 2016-04-20 16:08:17,195 INFO
org.apache.hadoop.mapred.ReduceTask:
attempt_201604201138_0003_r_000000_0 Thread waiting: Thread for
merging on-disk files 2016-04-20 16:08:17,196 INFO
org.apache.hadoop.mapred.ReduceTask:
attempt_201604201138_0003_r_000000_0 Need another 1 map output(s)
where 0 is already in progress 2016-04-20 16:08:17,196 INFO
org.apache.hadoop.mapred.ReduceTask:
attempt_201604201138_0003_r_000000_0 Thread started: Thread for
polling Map Completion Events 2016-04-20 16:08:17,196 INFO
org.apache.hadoop.mapred.ReduceTask:
attempt_201604201138_0003_r_000000_0 Scheduled 0 outputs (0 slow hosts
and0 dup hosts) 2016-04-20 16:08:22,197 INFO
org.apache.hadoop.mapred.ReduceTask:
attempt_201604201138_0003_r_000000_0 Scheduled 1 outputs (0 slow hosts
and0 dup hosts) 2016-04-20 16:09:18,202 INFO
org.apache.hadoop.mapred.ReduceTask:
attempt_201604201138_0003_r_000000_0 Need another 1 map output(s)
where 1 is already in progress 2016-04-20 16:09:18,203 INFO
org.apache.hadoop.mapred.ReduceTask:
attempt_201604201138_0003_r_000000_0 Scheduled 0 outputs (0 slow hosts
and0 dup hosts) 2016-04-20 16:10:18,208 INFO
org.apache.hadoop.mapred.ReduceTask:
attempt_201604201138_0003_r_000000_0 Need another 1 map output(s)
where 1 is already in progress 2016-04-20 16:10:18,208 INFO
org.apache.hadoop.mapred.ReduceTask:
attempt_201604201138_0003_r_000000_0 Scheduled 0 outputs (0 slow hosts
and0 dup hosts) 2016-04-20 16:11:18,214 INFO
org.apache.hadoop.mapred.ReduceTask:
attempt_201604201138_0003_r_000000_0 Need another 1 map output(s)
where 1 is already in progress 2016-04-20 16:11:18,214 INFO
org.apache.hadoop.mapred.ReduceTask:
attempt_201604201138_0003_r_000000_0 Scheduled 0 outputs (0 slow hosts
and0 dup hosts) 2016-04-20 16:11:22,395 WARN
org.apache.hadoop.mapred.ReduceTask:
attempt_201604201138_0003_r_000000_0 copy failed:
attempt_201604201138_0003_m_000000_0 from ubuntu 2016-04-20
16:11:22,396 WARN org.apache.hadoop.mapred.ReduceTask:
java.net.SocketTimeoutException: connect timed out at
java.net.PlainSocketImpl.socketConnect(Native Method) at
java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:339)
at
java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:200)
at
java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:182)
at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392) at
java.net.Socket.connect(Socket.java:579) at
sun.net.NetworkClient.doConnect(NetworkClient.java:175) at
sun.net.www.http.HttpClient.openServer(HttpClient.java:432) at
sun.net.www.http.HttpClient.openServer(HttpClient.java:527) at
sun.net.www.http.HttpClient.(HttpClient.java:211) at
sun.net.www.http.HttpClient.New(HttpClient.java:308) at
sun.net.www.http.HttpClient.New(HttpClient.java:326) at
sun.net.www.protocol.http.HttpURLConnection.getNewHttpClient(HttpURLConnection.java:998)
at
sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:934)
at
sun.net.www.protocol.http.HttpURLConnection.connect(HttpURLConnection.java:852)
at
org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getInputStream(ReduceTask.java:1636)
at
org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.setupSecureConnection(ReduceTask.java:1593)
at
org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getMapOutput(ReduceTask.java:1493)
at
org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(ReduceTask.java:1401)
at
org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:1333)
2016-04-20 16:11:22,398 INFO org.apache.hadoop.mapred.ReduceTask: Task
attempt_201604201138_0003_r_000000_0: Failed fetch #1 from
attempt_201604201138_0003_m_000000_0 2016-04-20 16:11:22,398 WARN
org.apache.hadoop.mapred.ReduceTask:
attempt_201604201138_0003_r_000000_0 adding host ubuntu to penalty
box, next contact in 12 seconds 2016-04-20 16:11:22,398 INFO
org.apache.hadoop.mapred.ReduceTask:
attempt_201604201138_0003_r_000000_0: Got 1 map-outputs from previous
failures 2016-04-20 16:11:37,399 INFO
org.apache.hadoop.mapred.ReduceTask:
attempt_201604201138_0003_r_000000_0 Scheduled 1 outputs (0 slow hosts
and0 dup hosts) 2016-04-20 16:12:19,403 INFO
org.apache.hadoop.mapred.ReduceTask:
attempt_201604201138_0003_r_000000_0 Need another 1 map output(s)
where 1 is already in progress 2016-04-20 16:12:19,403 INFO
org.apache.hadoop.mapred.ReduceTask:
attempt_201604201138_0003_r_000000_0 Scheduled 0 outputs (0 slow hosts
and0 dup hosts)
Even I stopped all the jobs and tried restarted, but no use. My environment is Ubuntu/ Hadoop 1.2.1/ Pig 0.15.0
Please help.
Thanks, Sathish

I resolved this. Issue is with the IP address configured in /etc/hosts was incorrect. I updated this to the IP address assigned to the Ubuntu machine and restart the Hadoop services. I found this mismatch from hadoop-hduser-jobtracker-ubuntu.log, where it says:
STARTUP_MSG: host = ubuntu/10.1.0.249
And in hadoop-hduser-datanode-ubuntu.log, it throw following error:
2016-04-25 12:23:05,738 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: 10.1.6.173/10.1.6.173:9000. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS)
Based on these errors, I could track the issue with IP address and fixed it in /etc/hosts file, restarted the server. After this, all the Hadoop jobs are running without any issues and I could run load the data and run PIG scripts.
Thanks, Sathish.

You are loading the data to relation A but generating B from relation airline?
B = foreach airline generate Country,Airline_ID;
Should be
B = foreach A generate Country,Airline_ID;
Also if you are counting the number of Airlines per country, you will have to modify the relation D as
D = foreach C generate group as Country,COUNT(B.Airline_ID);

Related

Hadoop last reduce job stuck at connection timeout

My hadoop job was stuck at the last reduce tasks. I have seen a lot of connection timeout from 3 different hosts to a single host. However, I am able to ping the troublesome machine from any other machine.
This is a 5 node cluster. It's built very recently. They have the same binary of hadoop and pig. It has 3 new machines and 2 old ones. If I removed the 2 old machines, it's working fine.
Problematic old machine version:
java version "1.6.0_20"
OpenJDK Runtime Environment (IcedTea6 1.9.13) (6b20-1.9.13-0ubuntu1~10.10.1)
OpenJDK 64-Bit Server VM (build 19.0-b09, mixed mode)
New working machine version:
java version "1.6.0_24"
OpenJDK Runtime Environment (IcedTea6 1.11.3) (6b24-1.11.3-1ubuntu0.12.04.1)
OpenJDK 64-Bit Server VM (build 20.0-b12, mixed mode)
Also I have seen insane amount (180) of dup hosts from the logs.
2012-08-28 12:34:53,307 INFO org.apache.hadoop.mapred.ReduceTask: attempt_201208231557_0021_r_000008_0 Need another 4 map output(s) where 1 is already in progress
2012-08-28 12:34:53,307 INFO org.apache.hadoop.mapred.ReduceTask: attempt_201208231557_0021_r_000008_0 Scheduled 0 outputs (0 slow hosts and3 dup hosts)
2012-08-28 12:35:53,308 INFO org.apache.hadoop.mapred.ReduceTask: attempt_201208231557_0021_r_000008_0 Need another 4 map output(s) where 1 is already in progress
2012-08-28 12:35:53,308 INFO org.apache.hadoop.mapred.ReduceTask: attempt_201208231557_0021_r_000008_0 Scheduled 0 outputs (0 slow hosts and3 dup hosts)
2012-08-28 12:36:53,309 INFO org.apache.hadoop.mapred.ReduceTask: attempt_201208231557_0021_r_000008_0 Need another 4 map output(s) where 1 is already in progress
2012-08-28 12:36:53,309 INFO org.apache.hadoop.mapred.ReduceTask: attempt_201208231557_0021_r_000008_0 Scheduled 0 outputs (0 slow hosts and3 dup hosts)
2012-08-28 12:37:18,416 WARN org.apache.hadoop.mapred.ReduceTask: attempt_201208231557_0021_r_000008_0 copy failed: attempt_201208231557_0021_m_000002_0 from Han
2012-08-28 12:37:18,417 WARN org.apache.hadoop.mapred.ReduceTask: java.net.SocketTimeoutException: connect timed out
at java.net.PlainSocketImpl.socketConnect(Native Method)
at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:327)
at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:193)
at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:180)
at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:384)
at java.net.Socket.connect(Socket.java:546)
at sun.net.NetworkClient.doConnect(NetworkClient.java:173)
at sun.net.www.http.HttpClient.openServer(HttpClient.java:409)
at sun.net.www.http.HttpClient.openServer(HttpClient.java:530)
at sun.net.www.http.HttpClient.<init>(HttpClient.java:240)
at sun.net.www.http.HttpClient.New(HttpClient.java:321)
at sun.net.www.http.HttpClient.New(HttpClient.java:338)
at sun.net.www.protocol.http.HttpURLConnection.getNewHttpClient(HttpURLConnection.java:935)
at sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:876)
at sun.net.www.protocol.http.HttpURLConnection.connect(HttpURLConnection.java:801)
at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getInputStream(ReduceTask.java:1618)
at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.setupSecureConnection(ReduceTask.java:1575)
at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getMapOutput(ReduceTask.java:1483)
at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(ReduceTask.java:1394)
at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:1326)
2012-08-28 12:37:18,417 INFO org.apache.hadoop.mapred.ReduceTask: Task attempt_201208231557_0021_r_000008_0: Failed fetch #6 from attempt_201208231557_0021_m_000002_0
2012-08-28 12:37:18,417 WARN org.apache.hadoop.mapred.ReduceTask: attempt_201208231557_0021_r_000008_0 adding host Han to penalty box, next contact in 48 seconds
2012-08-28 12:37:18,417 INFO org.apache.hadoop.mapred.ReduceTask: attempt_201208231557_0021_r_000008_0: Got 1 map-outputs from previous failures
2012-08-28 12:37:53,418 INFO org.apache.hadoop.mapred.ReduceTask: attempt_201208231557_0021_r_000008_0 Need another 4 map output(s) where 0 is already in progress
2012-08-28 12:37:53,418 INFO org.apache.hadoop.mapred.ReduceTask: attempt_201208231557_0021_r_000008_0 Scheduled 0 outputs (1 slow hosts and0 dup hosts)
2012-08-28 12:37:53,418 INFO org.apache.hadoop.mapred.ReduceTask: Penalized(slow) Hosts:
2012-08-28 12:37:53,418 INFO org.apache.hadoop.mapred.ReduceTask: Han Will be considered after: 13 seconds.
2012-08-28 12:38:08,418 INFO org.apache.hadoop.mapred.ReduceTask: attempt_201208231557_0021_r_000008_0 Scheduled 1 outputs (0 slow hosts and0 dup hosts)
2012-08-28 12:38:53,419 INFO org.apache.hadoop.mapred.ReduceTask: attempt_201208231557_0021_r_000008_0 Need another 4 map output(s) where 1 is already in progress
2012-08-28 12:38:53,419 INFO org.apache.hadoop.mapred.ReduceTask: attempt_201208231557_0021_r_000008_0 Scheduled 0 outputs (0 slow hosts and3 dup hosts)
2012-08-28 12:39:53,420 INFO org.apache.hadoop.mapred.ReduceTask: attempt_201208231557_0021_r_000008_0 Need another 4 map output(s) where 1 is already in progress
2012-08-28 12:39:53,420 INFO org.apache.hadoop.mapred.ReduceTask: attempt_201208231557_0021_r_000008_0 Scheduled 0 outputs (0 slow hosts and3 dup hosts)
2012-08-28 12:40:53,421 INFO org.apache.hadoop.mapred.ReduceTask: attempt_201208231557_0021_r_000008_0 Need another 4 map output(s) where 1 is already in progress
2012-08-28 12:40:53,421 INFO org.apache.hadoop.mapred.ReduceTask: attempt_201208231557_0021_r_000008_0 Scheduled 0 outputs (0 slow hosts and3 dup hosts)
2012-08-28 12:41:08,537 WARN org.apache.hadoop.mapred.ReduceTask: attempt_201208231557_0021_r_000008_0 copy failed: attempt_201208231557_0021_m_000002_0 from Han
2012-08-28 12:41:08,538 WARN org.apache.hadoop.mapred.ReduceTask: java.net.SocketTimeoutException: connect timed out
at java.net.PlainSocketImpl.socketConnect(Native Method)
at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:327)
at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:193)
at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:180)
at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:384)
at java.net.Socket.connect(Socket.java:546)
at sun.net.NetworkClient.doConnect(NetworkClient.java:173)
at sun.net.www.http.HttpClient.openServer(HttpClient.java:409)
at sun.net.www.http.HttpClient.openServer(HttpClient.java:530)
at sun.net.www.http.HttpClient.<init>(HttpClient.java:240)
at sun.net.www.http.HttpClient.New(HttpClient.java:321)
at sun.net.www.http.HttpClient.New(HttpClient.java:338)
at sun.net.www.protocol.http.HttpURLConnection.getNewHttpClient(HttpURLConnection.java:935)
at sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:876)
at sun.net.www.protocol.http.HttpURLConnection.connect(HttpURLConnection.java:801)
at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getInputStream(ReduceTask.java:1618)
at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.setupSecureConnection(ReduceTask.java:1575)
at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getMapOutput(ReduceTask.java:1483)
at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(ReduceTask.java:1394)
at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:1326)
2012-08-28 12:41:08,538 INFO org.apache.hadoop.mapred.ReduceTask: Task attempt_201208231557_0021_r_000008_0: Failed fetch #7 from attempt_201208231557_0021_m_000002_0
2012-08-28 12:41:08,538 WARN org.apache.hadoop.mapred.ReduceTask: attempt_201208231557_0021_r_000008_0 adding host Han to penalty box, next contact in 62 seconds
2012-08-28 12:41:08,538 INFO org.apache.hadoop.mapred.ReduceTask: attempt_201208231557_0021_r_000008_0: Got 1 map-outputs from previous failures
2012-08-28 12:41:53,539 INFO org.apache.hadoop.mapred.ReduceTask: attempt_201208231557_0021_r_000008_0 Need another 4 map output(s) where 0 is already in progress
2012-08-28 12:41:53,539 INFO org.apache.hadoop.mapred.ReduceTask: attempt_201208231557_0021_r_000008_0 Scheduled 0 outputs (1 slow hosts and0 dup hosts)
2012-08-28 12:41:53,539 INFO org.apache.hadoop.mapred.ReduceTask: Penalized(slow) Hosts:
2012-08-28 12:41:53,539 INFO org.apache.hadoop.mapred.ReduceTask: Han Will be considered after: 17 seconds.
2012-08-28 12:42:13,540 INFO org.apache.hadoop.mapred.ReduceTask: attempt_201208231557_0021_r_000008_0 Scheduled 1 outputs (0 slow hosts and0 dup hosts)
2012-08-28 12:42:53,540 INFO org.apache.hadoop.mapred.ReduceTask: attempt_201208231557_0021_r_000008_0 Need another 4 map output(s) where 1 is already in progress
2012-08-28 12:42:53,540 INFO org.apache.hadoop.mapred.ReduceTask: attempt_201208231557_0021_r_000008_0 Scheduled 0 outputs (0 slow hosts and3 dup hosts)
More logs:
2012-09-06 11:01:47,719 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201209041720_0010_r_000014_0 1.0% reduce > reduce
2012-09-06 11:01:47,720 INFO org.apache.hadoop.mapred.TaskTracker: Task attempt_201209041720_0010_r_000014_0 is done.
2012-09-06 11:01:47,720 INFO org.apache.hadoop.mapred.TaskTracker: reported output size for attempt_201209041720_0010_r_000014_0 was -1
2012-09-06 11:01:47,720 INFO org.apache.hadoop.mapred.TaskTracker: addFreeSlot : current free slots : 6
2012-09-06 11:01:47,769 INFO org.apache.hadoop.mapred.JvmManager: JVM : jvm_201209041720_0010_r_1617080101 exited with exit code 0. Number of tasks it ran: 1
2012-09-06 11:01:47,913 INFO org.apache.hadoop.mapred.JvmManager: JVM : jvm_201209041720_0010_r_-434797671 exited with exit code 0. Number of tasks it ran: 1
2012-09-06 11:01:50,428 INFO org.apache.hadoop.mapred.TaskTracker: Received KillTaskAction for task: attempt_201209041720_0010_r_000014_0
2012-09-06 11:01:50,428 INFO org.apache.hadoop.mapred.TaskTracker: About to purge task: attempt_201209041720_0010_r_000014_0
2012-09-06 11:01:50,429 INFO org.apache.hadoop.mapred.TaskTracker: Received KillTaskAction for task: attempt_201209041720_0010_r_000020_0
2012-09-06 11:01:50,429 INFO org.apache.hadoop.mapred.TaskTracker: About to purge task: attempt_201209041720_0010_r_000020_0

Are you using the default speculative execution settings? (4 tries) - here's an article that may help.

It might be due to the slow hard drives. On new machines (Samsung 830 SSD) I am seeing transfer rate of 3MB/s and slow machines (spinning hard drives) 0.12 MB/s. It probably just got a large reducer job and too slow.

Task trackers need to shuffle inter media among data nodes when HADOOP job is in reduce phase. It occur connection timeout error if you setup data nodes host name in slaves configuration file, while cannot resolve host names.
Add host name and IP address to /etc/hosts is a quick solution.

Not sure why this might be happening, but I had the same issue... My datanode was trying to access a namenode on localhost, instead of my actual 'master' namenode...
I solved this by going to my hosts file:
etc/hosts
and removing the localhost line from there. I'm pretty sure there should be a cleaner way to fix this, but this solution worked for me.
If anybody has a cleaner way to prevent the datanode from trying to access the namenode on localhost, please let me know!!!

Reducer stuck due to dead host

I noticed my reducer is stuck due to a dead host. On the logs, it's showing a lot of retry messages. Is it possible to tell job tracker to give up on the dead node and resume the work? There were 323 mappers and only 1 reducer. I am on hadoop-1.0.3.
2012-08-08 11:52:19,903 INFO org.apache.hadoop.mapred.ReduceTask: 192.168.1.23 Will be considered after: 65 seconds.
2012-08-08 11:53:19,905 INFO org.apache.hadoop.mapred.ReduceTask: attempt_201207191440_0203_r_000000_0 Need another 63 map output(s) where 0 is already in progress
2012-08-08 11:53:19,905 INFO org.apache.hadoop.mapred.ReduceTask: attempt_201207191440_0203_r_000000_0 Scheduled 0 outputs (1 slow hosts and0 dup hosts)
2012-08-08 11:53:19,905 INFO org.apache.hadoop.mapred.ReduceTask: Penalized(slow) Hosts:
2012-08-08 11:53:19,905 INFO org.apache.hadoop.mapred.ReduceTask: 192.168.1.23 Will be considered after: 5 seconds.
2012-08-08 11:53:29,906 INFO org.apache.hadoop.mapred.ReduceTask: attempt_201207191440_0203_r_000000_0 Scheduled 1 outputs (0 slow hosts and0 dup hosts)
2012-08-08 11:53:47,907 WARN org.apache.hadoop.mapred.ReduceTask: attempt_201207191440_0203_r_000000_0 copy failed: attempt_201207191440_0203_m_000001_0 from 192.168.1.23
2012-08-08 11:53:47,907 WARN org.apache.hadoop.mapred.ReduceTask: java.net.NoRouteToHostException: No route to host
at java.net.PlainSocketImpl.socketConnect(Native Method)
at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:327)
at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:193)
at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:180)
at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:384)
at java.net.Socket.connect(Socket.java:546)
at sun.net.NetworkClient.doConnect(NetworkClient.java:173)
at sun.net.www.http.HttpClient.openServer(HttpClient.java:409)
at sun.net.www.http.HttpClient.openServer(HttpClient.java:530)
at sun.net.www.http.HttpClient.<init>(HttpClient.java:240)
at sun.net.www.http.HttpClient.New(HttpClient.java:321)
at sun.net.www.http.HttpClient.New(HttpClient.java:338)
at sun.net.www.protocol.http.HttpURLConnection.getNewHttpClient(HttpURLConnection.java:935)
at sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:876)
at sun.net.www.protocol.http.HttpURLConnection.connect(HttpURLConnection.java:801)
at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getInputStream(ReduceTask.java:1618)
at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.setupSecureConnection(ReduceTask.java:1575)
at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getMapOutput(ReduceTask.java:1483)
at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(ReduceTask.java:1394)
at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:1326)
2012-08-08 11:53:47,907 INFO org.apache.hadoop.mapred.ReduceTask: Task attempt_201207191440_0203_r_000000_0: Failed fetch #18 from attempt_201207191440_0203_m_000001_0
2012-08-08 11:53:47,907 WARN org.apache.hadoop.mapred.ReduceTask: attempt_201207191440_0203_r_000000_0 adding host 192.168.1.23 to penalty box, next contact in 1124 seconds
2012-08-08 11:53:47,907 INFO org.apache.hadoop.mapred.ReduceTask: attempt_201207191440_0203_r_000000_0: Got 1 map-outputs from previous failures
2012-08-08 11:54:22,909 INFO org.apache.hadoop.mapred.ReduceTask: attempt_201207191440_0203_r_000000_0 Need another 63 map output(s) where 0 is already in progress
2012-08-08 11:54:22,909 INFO org.apache.hadoop.mapred.ReduceTask: attempt_201207191440_0203_r_000000_0 Scheduled 0 outputs (1 slow hosts and0 dup hosts)
2012-08-08 11:54:22,909 INFO org.apache.hadoop.mapred.ReduceTask: Penalized(slow) Hosts:
2012-08-08 11:54:22,909 INFO org.apache.hadoop.mapred.ReduceTask: 192.168.1.23 Will be considered after: 1089 seconds.
I leave it alone and it retried for a while then give up on the dead host and rerun the mapper and succeeded. It's caused by two ip addressed on the host and I intentionally turned off one ip which was the one hadoop use.
My question is whether there is a way to tell hadoop to give up the dead host without retrying.

From your log you can see that one of the tasktrackers which ran a map task can not be connected to. The tasktracker on which the reducer runs is trying to retrieve the map intermediate results through the HTTP protocol and it fails because the tasktracker having the results is dead.
The default behaviour for tasktracker failure is like this:
The jobtracker arranges for map tasks that were run and completed successfully on the failed tasktracker to be rerun if they belong to incomplete jobs, since their intermediate output residing on the failed tasktracker’s local filesystem may not be accessible to the reduce task. Any tasks in progress are also rescheduled.
The problem is that if a task(be it a map or a reduce) fails too many times (I think 4 times) it will not be rescheduled anymore and the job will fail.
In your case, the map seems to complete successfully but the reducer is unable to connect to the mapper and retrieve the intermediate results. It tries 4 times and after that the job fails.
A failed task, cannot be completely ignored, as it is part of the job and unless all tasks comprised by the job succeed, the job itself doesn't succeed.
Try to find the link the reducer is trying to access and copy it in a browser to see the error you get.
You can also blacklist and completely exclude a node from the list of nodes Hadoop uses:
In conf/mapred-site.xml
<property>
<name>mapred.hosts.exclude</name>
<value>/full/path/of/host/exclude/file</value>
</property>
To reconfigure nodes.
/bin/hadoop mradmin -refreshNodes

Map works well, but Reduce failed

I run a simple sort program, however, I meet such error like below.
12/06/15 01:13:17 WARN mapred.JobClient: Error reading task outputServer returned HTTP response code: 403 for URL: _http://192.168.1.106:50060/tasklog?plaintext=true&attemptid=attempt_201206150102_0002_m_000001_1&filter=stdout
12/06/15 01:13:18 WARN mapred.JobClient: Error reading task outputServer returned HTTP response code: 403 for URL: _http://192.168.1.106:50060/tasklog?plaintext=true&attemptid=attempt_201206150102_0002_m_000001_1&filter=stderr
12/06/15 01:13:20 INFO mapred.JobClient: map 50% reduce 0%
12/06/15 01:13:23 INFO mapred.JobClient: map 100% reduce 0%
12/06/15 01:14:19 INFO mapred.JobClient: Task Id : attempt_201206150102_0002_m_000000_2, Status : FAILED
Too many fetch-failures
12/06/15 01:14:20 WARN mapred.JobClient: Error reading task outputServer returned HTTP response code: 403 for URL: _http://192.168.1.106:50060/tasklog?plaintext=true&attemptid=attempt_201206150102_0002_m_000000_2&filter=stdout
Does anyone know what's the reason and how to resolve it?
-------Update for more log information-------------------
2012-06-15 19:56:07,039 WARN org.apache.hadoop.util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2012-06-15 19:56:07,258 WARN org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Source name ugi already exists!
2012-06-15 19:56:07,339 INFO org.apache.hadoop.mapred.Task: Using ResourceCalculatorPlugin : null
2012-06-15 19:56:07,346 INFO org.apache.hadoop.mapred.ReduceTask: ShuffleRamManager: MemoryLimit=144965632, MaxSingleShuffleLimit=36241408
2012-06-15 19:56:07,351 INFO org.apache.hadoop.mapred.ReduceTask: attempt_201206151954_0001_r_000000_0 Thread started: Thread for merging on-disk files
2012-06-15 19:56:07,351 INFO org.apache.hadoop.mapred.ReduceTask: attempt_201206151954_0001_r_000000_0 Thread started: Thread for merging in memory files
2012-06-15 19:56:07,351 INFO org.apache.hadoop.mapred.ReduceTask: attempt_201206151954_0001_r_000000_0 Thread waiting: Thread for merging on-disk files
2012-06-15 19:56:07,352 INFO org.apache.hadoop.mapred.ReduceTask: attempt_201206151954_0001_r_000000_0 Need another 2 map output(s) where 0 is already in progress
2012-06-15 19:56:07,352 INFO org.apache.hadoop.mapred.ReduceTask: attempt_201206151954_0001_r_000000_0 Thread started: Thread for polling Map Completion Events
2012-06-15 19:56:07,352 INFO org.apache.hadoop.mapred.ReduceTask: attempt_201206151954_0001_r_000000_0 Scheduled 0 outputs (0 slow hosts and0 dup hosts)
2012-06-15 19:56:12,353 INFO org.apache.hadoop.mapred.ReduceTask: attempt_201206151954_0001_r_000000_0 Scheduled 1 outputs (0 slow hosts and0 dup hosts)
2012-06-15 19:56:32,076 WARN org.apache.hadoop.mapred.ReduceTask: attempt_201206151954_0001_r_000000_0 copy failed: attempt_201206151954_0001_m_000000_0 from 192.168.1.106
2012-06-15 19:56:32,077 WARN org.apache.hadoop.mapred.ReduceTask: java.io.IOException: Server returned HTTP response code: 403 for URL: _http://192.168.1.106:50060/mapOutput?job=job_201206151954_0001&map=attempt_201206151954_0001_m_000000_0&reduce=0
at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1436)
at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getInputStream(ReduceTask.java:1639)
at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.setupSecureConnection(ReduceTask.java:1575)
at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getMapOutput(ReduceTask.java:1483)
at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(ReduceTask.java:1394)
at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:1326)
2012-06-15 19:56:32,077 INFO org.apache.hadoop.mapred.ReduceTask: Task attempt_201206151954_0001_r_000000_0: Failed fetch #1 from attempt_201206151954_0001_m_000000_0
2012-06-15 19:56:32,077 INFO org.apache.hadoop.mapred.ReduceTask: Failed to fetch map-output from attempt_201206151954_0001_m_000000_0 even after MAX_FETCH_RETRIES_PER_MAP retries... or it is a read error, reporting to the JobTracker
2012-06-15 19:56:32,077 WARN org.apache.hadoop.mapred.ReduceTask: attempt_201206151954_0001_r_000000_0 adding host 192.168.1.106 to penalty box, next contact in 12 seconds
2012-06-15 19:56:32,077 INFO org.apache.hadoop.mapred.ReduceTask: attempt_201206151954_0001_r_000000_0: Got 1 map-outputs from previous failures
2012-06-15 19:56:47,080 INFO org.apache.hadoop.mapred.ReduceTask: attempt_201206151954_0001_r_000000_0 Scheduled 1 outputs (0 slow hosts and0 dup hosts)
2012-06-15 19:56:56,048 WARN org.apache.hadoop.mapred.ReduceTask: attempt_201206151954_0001_r_000000_0 copy failed: attempt_201206151954_0001_m_000000_0 from 192.168.1.106
2012-06-15 19:56:56,049 WARN org.apache.hadoop.mapred.ReduceTask: java.io.IOException: Server returned HTTP response code: 403 for URL: _http://192.168.1.106:50060/mapOutput?job=job_201206151954_0001&map=attempt_201206151954_0001_m_000000_0&reduce=0
at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1436)
at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getInputStream(ReduceTask.java:1639)
at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.setupSecureConnection(ReduceTask.java:1575)
at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getMapOutput(ReduceTask.java:1483)
at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(ReduceTask.java:1394)
at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:1326)
2012-06-15 19:56:56,049 INFO org.apache.hadoop.mapred.ReduceTask: Task attempt_201206151954_0001_r_000000_0: Failed fetch #2 from attempt_201206151954_0001_m_000000_0
2012-06-15 19:56:56,049 INFO org.apache.hadoop.mapred.ReduceTask: Failed to fetch map-output from attempt_201206151954_0001_m_000000_0 even after MAX_FETCH_RETRIES_PER_MAP retries... or it is a read error, reporting to the JobTracker
2012-06-15 19:56:56,049 WARN org.apache.hadoop.mapred.ReduceTask: attempt_201206151954_0001_r_000000_0 adding host 192.168.1.106 to penalty box, next contact in 16 seconds
2012-06-15 19:56:56,049 INFO org.apache.hadoop.mapred.ReduceTask: attempt_201206151954_0001_r_000000_0: Got 1 map-outputs from previous failures
2012-06-15 19:57:11,053 INFO org.apache.hadoop.mapred.ReduceTask: attempt_201206151954_0001_r_000000_0 Need another 2 map output(s) where 0 is already in progress
2012-06-15 19:57:11,053 INFO org.apache.hadoop.mapred.ReduceTask: attempt_201206151954_0001_r_000000_0 Scheduled 0 outputs (1 slow hosts and0 dup hosts)
2012-06-15 19:57:11,053 INFO org.apache.hadoop.mapred.ReduceTask: Penalized(slow) Hosts:
2012-06-15 19:57:11,053 INFO org.apache.hadoop.mapred.ReduceTask: 192.168.1.106 Will be considered after: 1 seconds.
2012-06-15 19:57:16,055 INFO org.apache.hadoop.mapred.ReduceTask: attempt_201206151954_0001_r_000000_0 Scheduled 1 outputs (0 slow hosts and0 dup hosts)
2012-06-15 19:57:25,984 WARN org.apache.hadoop.mapred.ReduceTask: attempt_201206151954_0001_r_000000_0 copy failed: attempt_201206151954_0001_m_000000_0 from 192.168.1.106
2012-06-15 19:57:25,984 WARN org.apache.hadoop.mapred.ReduceTask: java.io.IOException: Server returned HTTP response code: 403 for URL: _http://192.168.1.106:50060/mapOutput?job=job_201206151954_0001&map=attempt_201206151954_0001_m_000000_0&reduce=0
at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1436)
at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getInputStream(ReduceTask.java:1639)
at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.setupSecureConnection(ReduceTask.java:1575)
at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getMapOutput(ReduceTask.java:1483)
at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(ReduceTask.java:1394)
at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:1326)
Best Regards,

I had the same issue. After digging down, I pinpointed the problem to be the name-resolution for the hosts. Please check the log of the particular attempt in
$HADOOP_HOME/logs/userlogs/JobXXX/attemptXXX/syslog
and if it has something like
WARN org.apache.hadoop.mapred.ReduceTask:
java.net.UnknownHostException: slave-1.local.lan
then just add the appropriate entry in /etc/hosts. After doing this the error got resolved and in the next attempt everything worked fine.

Pig DUMP gets stuck in GROUP

I'm a PIG beginner (using pig 0.10.0) and I have some simple JSON like the following:
test.json:
{
"from": "1234567890",
.....
"profile": {
"email": "me#domain.com"
.....
}
}
which i perform some group/counting in pig:
>pig -x local
with the following PIG script:
REGISTER /pig-udfs/oink.jar;
REGISTER /pig-udfs/json-simple-1.1.jar;
REGISTER /pig-udfs/guava-12.0.jar;
REGISTER /pig-udfs/elephant-bird-2.2.3.jar;
users = LOAD 'test.json' USING com.twitter.elephantbird.pig.load.JsonLoader('-nestedLoad=true') as (json:map[]);
domain_user = FOREACH users GENERATE oink.EmailDomainFilter(json#'profile'#'email') as email, json#'from' as user_id;
DUMP domain_user; /* Outputs: (domain.com,1234567890) */
grouped_domain_user = GROUP domain_user BY email;
DUMP grouped_domain_user; /* Outputs: =stuck here= */
Basically, when i try to dump the grouped_domain_user, pig gets stuck, seemly waiting for a map output to complete:
2012-05-31 17:45:22,111 [Thread-15] INFO org.apache.hadoop.mapred.Task - Task 'attempt_local_0002_m_000000_0' done.
2012-05-31 17:45:22,119 [Thread-15] INFO org.apache.hadoop.mapred.Task - Using ResourceCalculatorPlugin : null
2012-05-31 17:45:22,123 [Thread-15] INFO org.apache.hadoop.mapred.ReduceTask - ShuffleRamManager: MemoryLimit=724828160, MaxSingleShuffleLimit=181207040
2012-05-31 17:45:22,125 [Thread-15] INFO org.apache.hadoop.io.compress.CodecPool - Got brand-new decompressor
2012-05-31 17:45:22,125 [Thread-15] INFO org.apache.hadoop.io.compress.CodecPool - Got brand-new decompressor
2012-05-31 17:45:22,125 [Thread-15] INFO org.apache.hadoop.io.compress.CodecPool - Got brand-new decompressor
2012-05-31 17:45:22,126 [Thread-15] INFO org.apache.hadoop.io.compress.CodecPool - Got brand-new decompressor
2012-05-31 17:45:22,126 [Thread-15] INFO org.apache.hadoop.io.compress.CodecPool - Got brand-new decompressor
2012-05-31 17:45:22,128 [Thread for merging on-disk files] INFO org.apache.hadoop.mapred.ReduceTask - attempt_local_0002_r_000000_0 Thread started: Thread for merging on-disk files
2012-05-31 17:45:22,128 [Thread for merging in memory files] INFO org.apache.hadoop.mapred.ReduceTask - attempt_local_0002_r_000000_0 Thread started: Thread for merging in memory files
2012-05-31 17:45:22,128 [Thread for merging on-disk files] INFO org.apache.hadoop.mapred.ReduceTask - attempt_local_0002_r_000000_0 Thread waiting: Thread for merging on-disk files
2012-05-31 17:45:22,129 [Thread-15] INFO org.apache.hadoop.mapred.ReduceTask - attempt_local_0002_r_000000_0 Need another 1 map output(s) where 0 is already in progress
2012-05-31 17:45:22,129 [Thread for polling Map Completion Events] INFO org.apache.hadoop.mapred.ReduceTask - attempt_local_0002_r_000000_0 Thread started: Thread for polling Map Completion Events
2012-05-31 17:45:22,129 [Thread-15] INFO org.apache.hadoop.mapred.ReduceTask - attempt_local_0002_r_000000_0 Scheduled 0 outputs (0 slow hosts and0 dup hosts)
2012-05-31 17:45:28,118 [communication thread] INFO org.apache.hadoop.mapred.LocalJobRunner - reduce > copy >
2012-05-31 17:45:31,122 [communication thread] INFO org.apache.hadoop.mapred.LocalJobRunner - reduce > copy >
2012-05-31 17:45:37,123 [communication thread] INFO org.apache.hadoop.mapred.LocalJobRunner - reduce > copy >
2012-05-31 17:45:43,124 [communication thread] INFO org.apache.hadoop.mapred.LocalJobRunner - reduce > copy >
2012-05-31 17:45:46,124 [communication thread] INFO org.apache.hadoop.mapred.LocalJobRunner - reduce > copy >
2012-05-31 17:45:52,126 [communication thread] INFO org.apache.hadoop.mapred.LocalJobRunner - reduce > copy >
2012-05-31 17:45:58,127 [communication thread] INFO org.apache.hadoop.mapred.LocalJobRunner - reduce > copy >
2012-05-31 17:46:01,128 [communication thread] INFO org.apache.hadoop.mapred.LocalJobRunner - reduce > copy >
.... repeats ....
Suggestions would be welcome on why this is happening.
Thanks!
UPDATE
Chris solved this one for me. I was setting the fs.default.name, etc to correct values in the pig.properties, however i also had the HADOOP_CONF_DIR environment variable set to point to my local Hadoop installation with these sames values set with <final>true</final>.
Great find and much appreciated.

To mark this question as answered, and to those stumbling across this in future:
When running on local mode (whether that be for pig via the pig -x local, or submitting a map reduce job to the local job runner, if you are seeing the reduce phase 'hang', especially if you see entries in the log similar to:
2012-05-31 17:45:22,129 [Thread-15] INFO org.apache.hadoop.mapred.ReduceTask -
attempt_local_0002_r_000000_0 Need another 1 map output(s) where 0 is already in progress
Then your job, although started in local mode, has probably switched to 'clustered' mode because the mapred.job.tracker property is marked as 'final' in your $HADOOP/conf/mapred-site.xml:
<property>
<name>mapred.job.tracker</name>
<value>hdfs://localhost:9000</value>
<final>true</final>
</property>
You should also check the fs.default.name property in core-site.xml too, and ensure it is not marked as final
This means that you are unable to set this value at runtime, and you may even see error messages similar to:
12/05/22 14:28:29 WARN conf.Configuration:
file:/tmp/.../job_local_0001.xml:a attempt to override final parameter: fs.default.name; Ignoring.
12/05/22 14:28:29 WARN conf.Configuration:
file:/tmp/.../job_local_0001.xml:a attempt to override final parameter: mapred.job.tracker; Ignoring.

Hadoop Single Node installation on Windows 7

I am new to hadoop and trying to get a single node setup of Hadoop 0.20.2 on my Windows 7 machine.
My questions are two-fold - one with respect to the completeness of the installation itself and the other regarding the error in the reduce stage of a sample Word Count program.
My Installation steps are as follows:
I am following http://blog.benhall.me.uk/2011/01/installing-hadoop-0210-on-windows.html for the installation procedure.
I have installed cygwin and set up password-less ssh on my localhost
My java version is:
java version "1.7.0_02"
Java(TM) SE Runtime Environment (build 1.7.0_02-b13)
Java HotSpot(TM) 64-Bit Server VM (build 22.0-b10, mixed mode)
Contents of conf/core-site.xml:
<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:9000</value>
</property>
</configuration>
Contents of conf/hdfs-site.xml:
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
</configuration>
Contents of conf/mapred-site.xml:
<configuration>
<property>
<name>mapred.job.tracker</name>
<value>localhost:9001</value>
</property>
</configuration>
I set the JAVA_HOME variable and the command "hadoop version" prints 0.20.2
hadoop namenode -format creates the DFS without any errors
start-all.sh prints that namenode, secondarynamenode, datanode, jobtracker and tasktracker have all started.
however, the command "jps" prints:
$ jps
4584 Jps
11008 JobTracker
2084 NameNode
I noticed in that jps printed the pids' of tasktracker, secondarynamenode as well.
I am able to view the output of
http://localhost:50030 for the jobtracker,
http://localhost:50060 for the tasktracker and
http://localhost:50070 for the namenode.
I tried both put and get commands to the hdfs and they were successful:
bin/hadoop fs -mkdir In
bin/hadoop fs -put *.txt In
mkdir temp
bin/hadoop fs -get In temp
ls -l temp/In
$ ls -l temp/In/
total 365
348624 Mar 24 23:59 CHANGES.txt
13366 Mar 24 23:59 LICENSE.txt
101 Mar 24 23:59 NOTICE.txt
1366 Mar 24 23:59 README.txt
I could also view these files by browsing the DFS via the http interface for namenode
Is my installation complete?
If yes, why does the jps command not show the pids of all five components?
If not, then, what steps do i need to complete the installation?
What are other sanity checks used to test the completeness of the installation?
I initially believed my installation to be complete and ran a sample WordCount map-reduce program along the lines of http://jayant7k.blogspot.com/2010/06/writing-your-first-map-reduce-program.html
I obtain the following output:
12/03/25 00:10:26 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
12/03/25 00:10:26 INFO input.FileInputFormat: Total input paths to process : 1
12/03/25 00:10:27 INFO mapred.JobClient: Running job: job_201203242348_0001
12/03/25 00:10:28 INFO mapred.JobClient: map 0% reduce 0%
12/03/25 00:10:35 INFO mapred.JobClient: map 100% reduce 0%
12/03/25 00:21:29 INFO mapred.JobClient: Task Id : attempt_201203242348_0001_r_0
00000_0, Status : FAILED
Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.
12/03/25 00:32:25 INFO mapred.JobClient: Task Id : attempt_201203242348_0001_r_0
00000_1, Status : FAILED
Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.
12/03/25 00:44:02 INFO mapred.JobClient: Task Id : attempt_201203242348_0001_r_0
00000_2, Status : FAILED
Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.
12/03/25 00:55:00 INFO mapred.JobClient: Job complete: job_201203242348_0001
12/03/25 00:55:00 INFO mapred.JobClient: Counters: 12
12/03/25 00:55:00 INFO mapred.JobClient: Job Counters
12/03/25 00:55:00 INFO mapred.JobClient: Launched reduce tasks=4
12/03/25 00:55:00 INFO mapred.JobClient: Launched map tasks=1
12/03/25 00:55:00 INFO mapred.JobClient: Data-local map tasks=1
12/03/25 00:55:00 INFO mapred.JobClient: Failed reduce tasks=1
12/03/25 00:55:00 INFO mapred.JobClient: FileSystemCounters
12/03/25 00:55:00 INFO mapred.JobClient: HDFS_BYTES_READ=13366
12/03/25 00:55:00 INFO mapred.JobClient: FILE_BYTES_WRITTEN=23511
12/03/25 00:55:00 INFO mapred.JobClient: Map-Reduce Framework
12/03/25 00:55:00 INFO mapred.JobClient: Combine output records=0
12/03/25 00:55:00 INFO mapred.JobClient: Map input records=244
12/03/25 00:55:00 INFO mapred.JobClient: Spilled Records=1887
12/03/25 00:55:00 INFO mapred.JobClient: Map output bytes=19699
12/03/25 00:55:00 INFO mapred.JobClient: Combine input records=0
12/03/25 00:55:00 INFO mapred.JobClient: Map output records=1887
The map task seems complete, but the reduce task shows the following error in the logs:
2012-03-25 00:10:35,202 INFO org.apache.hadoop.mapred.ReduceTask: attempt_201203242348_0001_r_000000_0: Got 1 new map-outputs
2012-03-25 00:10:40,193 INFO org.apache.hadoop.mapred.ReduceTask: attempt_201203242348_0001_r_000000_0 Scheduled 1 outputs (0 slow hosts and0 dup hosts)
2012-03-25 00:10:40,243 INFO org.apache.hadoop.mapred.ReduceTask: header: attempt_201203242348_0001_m_000000_0, compressed len: 23479, decompressed len: 23475
2012-03-25 00:10:40,243 INFO org.apache.hadoop.mapred.ReduceTask: Shuffling 23475 bytes (23479 raw bytes) into RAM from attempt_201203242348_0001_m_000000_0
2012-03-25 00:11:35,194 INFO org.apache.hadoop.mapred.ReduceTask: attempt_201203242348_0001_r_000000_0 Need another 1 map output(s) where 1 is already in progress
2012-03-25 00:11:35,194 INFO org.apache.hadoop.mapred.ReduceTask: attempt_201203242348_0001_r_000000_0 Scheduled 0 outputs (0 slow hosts and0 dup hosts)
2012-03-25 00:12:35,197 INFO org.apache.hadoop.mapred.ReduceTask: attempt_201203242348_0001_r_000000_0 Need another 1 map output(s) where 1 is already in progress
2012-03-25 00:12:35,197 INFO org.apache.hadoop.mapred.ReduceTask: attempt_201203242348_0001_r_000000_0 Scheduled 0 outputs (0 slow hosts and0 dup hosts)
2012-03-25 00:13:35,202 INFO org.apache.hadoop.mapred.ReduceTask: attempt_201203242348_0001_r_000000_0 Need another 1 map output(s) where 1 is already in progress
2012-03-25 00:13:35,202 INFO org.apache.hadoop.mapred.ReduceTask: attempt_201203242348_0001_r_000000_0 Scheduled 0 outputs (0 slow hosts and0 dup hosts)
2012-03-25 00:13:40,249 INFO org.apache.hadoop.mapred.ReduceTask: Failed to shuffle from attempt_201203242348_0001_m_000000_0
java.net.SocketTimeoutException: Read timed out
at java.net.SocketInputStream.socketRead0(Native Method)
at java.net.SocketInputStream.read(SocketInputStream.java:150)
at java.net.SocketInputStream.read(SocketInputStream.java:121)
at java.io.BufferedInputStream.fill(BufferedInputStream.java:235)
at java.io.BufferedInputStream.read1(BufferedInputStream.java:275)
at java.io.BufferedInputStream.read(BufferedInputStream.java:334)
at sun.net.www.http.ChunkedInputStream.fastRead(ChunkedInputStream.java:239)
at sun.net.www.http.ChunkedInputStream.read(ChunkedInputStream.java:680)
at java.io.FilterInputStream.read(FilterInputStream.java:133)
at sun.net.www.protocol.http.HttpURLConnection$HttpInputStream.read(HttpURLConnection.java:2959)
at org.apache.hadoop.mapred.IFileInputStream.doRead(IFileInputStream.java:149)
at org.apache.hadoop.mapred.IFileInputStream.read(IFileInputStream.java:101)
at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.shuffleInMemory(ReduceTask.java:1522)
at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getMapOutput(ReduceTask.java:1408)
at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(ReduceTask.java:1261)
at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:1195)
The following are the contents of the task tracker logs:
2012-03-25 00:10:27,910 INFO org.apache.hadoop.mapred.TaskTracker: LaunchTaskAction (registerTask): attempt_201203242348_0001_m_000002_0 task's state:UNASSIGNED
2012-03-25 00:10:27,915 INFO org.apache.hadoop.mapred.TaskTracker: Trying to launch : attempt_201203242348_0001_m_000002_0
2012-03-25 00:10:27,915 INFO org.apache.hadoop.mapred.TaskTracker: In TaskLauncher, current free slots : 2 and trying to launch attempt_201203242348_0001_m_000002_0
2012-03-25 00:10:28,453 INFO org.apache.hadoop.mapred.JvmManager: In JvmRunner constructed JVM ID: jvm_201203242348_0001_m_625085452
2012-03-25 00:10:28,454 INFO org.apache.hadoop.mapred.JvmManager: JVM Runner jvm_201203242348_0001_m_625085452 spawned.
2012-03-25 00:10:29,217 INFO org.apache.hadoop.mapred.TaskTracker: JVM with ID: jvm_201203242348_0001_m_625085452 given task: attempt_201203242348_0001_m_000002_0
2012-03-25 00:10:29,523 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201203242348_0001_m_000002_0 0.0% setup
2012-03-25 00:10:29,524 INFO org.apache.hadoop.mapred.TaskTracker: Task attempt_201203242348_0001_m_000002_0 is done.
2012-03-25 00:10:29,524 INFO org.apache.hadoop.mapred.TaskTracker: reported output size for attempt_201203242348_0001_m_000002_0 was 0
2012-03-25 00:10:29,526 INFO org.apache.hadoop.mapred.TaskTracker: addFreeSlot : current free slots : 2
2012-03-25 00:10:29,718 INFO org.apache.hadoop.mapred.JvmManager: JVM : jvm_201203242348_0001_m_625085452 exited. Number of tasks it ran: 1
2012-03-25 00:10:30,911 INFO org.apache.hadoop.mapred.TaskTracker: org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find taskTracker/jobcache/job_201203242348_0001/attempt_201203242348_0001_m_000002_0/output/file.out in any of the configured local directories
2012-03-25 00:10:30,952 INFO org.apache.hadoop.mapred.TaskTracker: LaunchTaskAction (registerTask): attempt_201203242348_0001_m_000000_0 task's state:UNASSIGNED
2012-03-25 00:10:30,952 INFO org.apache.hadoop.mapred.TaskTracker: Trying to launch : attempt_201203242348_0001_m_000000_0
2012-03-25 00:10:30,952 INFO org.apache.hadoop.mapred.TaskTracker: In TaskLauncher, current free slots : 2 and trying to launch attempt_201203242348_0001_m_000000_0
2012-03-25 00:10:30,952 INFO org.apache.hadoop.mapred.TaskTracker: Received KillTaskAction for task: attempt_201203242348_0001_m_000002_0
2012-03-25 00:10:30,952 INFO org.apache.hadoop.mapred.TaskTracker: About to purge task: attempt_201203242348_0001_m_000002_0
2012-03-25 00:10:30,952 INFO org.apache.hadoop.mapred.TaskRunner: attempt_201203242348_0001_m_000002_0 done; removing files.
2012-03-25 00:10:30,952 INFO org.apache.hadoop.mapred.IndexCache: Map ID attempt_201203242348_0001_m_000002_0 not found in cache
2012-03-25 00:10:31,077 INFO org.apache.hadoop.mapred.JvmManager: In JvmRunner constructed JVM ID: jvm_201203242348_0001_m_-1399302881
2012-03-25 00:10:31,077 INFO org.apache.hadoop.mapred.JvmManager: JVM Runner jvm_201203242348_0001_m_-1399302881 spawned.
2012-03-25 00:10:31,812 INFO org.apache.hadoop.mapred.TaskTracker: JVM with ID: jvm_201203242348_0001_m_-1399302881 given task: attempt_201203242348_0001_m_000000_0
2012-03-25 00:10:32,642 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201203242348_0001_m_000000_0 1.0%
2012-03-25 00:10:32,642 INFO org.apache.hadoop.mapred.TaskTracker: Task attempt_201203242348_0001_m_000000_0 is done.
2012-03-25 00:10:32,642 INFO org.apache.hadoop.mapred.TaskTracker: reported output size for attempt_201203242348_0001_m_000000_0 was 0
2012-03-25 00:10:32,642 INFO org.apache.hadoop.mapred.TaskTracker: addFreeSlot : current free slots : 2
2012-03-25 00:10:32,822 INFO org.apache.hadoop.mapred.JvmManager: JVM : jvm_201203242348_0001_m_-1399302881 exited. Number of tasks it ran: 1
2012-03-25 00:10:33,982 INFO org.apache.hadoop.mapred.TaskTracker: LaunchTaskAction (registerTask): attempt_201203242348_0001_r_000000_0 task's state:UNASSIGNED
2012-03-25 00:10:33,982 INFO org.apache.hadoop.mapred.TaskTracker: Trying to launch : attempt_201203242348_0001_r_000000_0
2012-03-25 00:10:33,982 INFO org.apache.hadoop.mapred.TaskTracker: In TaskLauncher, current free slots : 2 and trying to launch attempt_201203242348_0001_r_000000_0
2012-03-25 00:10:34,057 INFO org.apache.hadoop.mapred.JvmManager: In JvmRunner constructed JVM ID: jvm_201203242348_0001_r_625085452
2012-03-25 00:10:34,057 INFO org.apache.hadoop.mapred.JvmManager: JVM Runner jvm_201203242348_0001_r_625085452 spawned.
2012-03-25 00:10:34,852 INFO org.apache.hadoop.mapred.TaskTracker: JVM with ID: jvm_201203242348_0001_r_625085452 given task: attempt_201203242348_0001_r_000000_0
2012-03-25 00:10:40,243 INFO org.apache.hadoop.mapred.TaskTracker: Sent out 23479 bytes for reduce: 0 from map: attempt_201203242348_0001_m_000000_0 given 23479/23475
2012-03-25 00:10:40,243 INFO org.apache.hadoop.mapred.TaskTracker.clienttrace: src: 192.168.1.33:50060, dest: 192.168.1.33:60790, bytes: 23479, op: MAPRED_SHUFFLE, cliID: attempt_201203242348_0001_m_000000_0
2012-03-25 00:10:41,153 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201203242348_0001_r_000000_0 0.0% reduce > copy >
2012-03-25 00:10:44,158 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201203242348_0001_r_000000_0 0.0% reduce > copy >
2012-03-25 00:16:05,244 INFO org.apache.hadoop.mapred.TaskTracker: Sent out 23479 bytes for reduce: 0 from map: attempt_201203242348_0001_m_000000_0 given 23479/23475
2012-03-25 00:16:05,244 INFO org.apache.hadoop.mapred.TaskTracker.clienttrace: src: 192.168.1.33:50060, dest: 192.168.1.33:60864, bytes: 23479, op: MAPRED_SHUFFLE, cliID: attempt_201203242348_0001_m_000000_0
2012-03-25 00:16:05,249 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201203242348_0001_r_000000_0 0.0% reduce > copy >
2012-03-25 00:16:08,249 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201203242348_0001_r_000000_0 0.0% reduce > copy >
2012-03-25 00:21:25,251 FATAL org.apache.hadoop.mapred.TaskTracker: Task: attempt_201203242348_0001_r_000000_0 - Killed due to Shuffle Failure: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.
I had opened the ports 9000 and 9001 in the windows firewall
I checked the telnet output to verify that these ports were indeed open:
C:\Windows\system32>netstat -a -n | grep -e "500[367]0"
TCP 0.0.0.0:50030 0.0.0.0:0 LISTENING
TCP 0.0.0.0:50060 0.0.0.0:0 LISTENING
TCP 0.0.0.0:50070 0.0.0.0:0 LISTENING
TCP [::]:50030 [::]:0 LISTENING
TCP [::]:50060 [::]:0 LISTENING
TCP [::]:50070 [::]:0 LISTENING
C:\Windows\system32>netstat -a -n | grep -e "900[01]"
TCP 127.0.0.1:9000 0.0.0.0:0 LISTENING
TCP 127.0.0.1:9000 127.0.0.1:60332 ESTABLISHED
TCP 127.0.0.1:9000 127.0.0.1:60987 ESTABLISHED
TCP 127.0.0.1:9001 0.0.0.0:0 LISTENING
TCP 127.0.0.1:9001 127.0.0.1:60410 ESTABLISHED
TCP 127.0.0.1:60332 127.0.0.1:9000 ESTABLISHED
TCP 127.0.0.1:60410 127.0.0.1:9001 ESTABLISHED
TCP 127.0.0.1:60987 127.0.0.1:9000 ESTABLISHED
Could you help with both the issues of installation and getting the reduce task to work?
I looked at http://wiki.apache.org/hadoop/SocketTimeout and a few other links and tried the suggestions, but without any success.
I appreciate your patience in reading this post and would be happy to provide additional details.
Thanks in advance.

See this line in your logs:
2012-03-25 00:10:30,911 INFO org.apache.hadoop.mapred.TaskTracker: org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find taskTracker/jobcache/job_201203242348_0001/attempt_201203242348_0001_m_000002_0/output/file.out in any of the configured local directories
I am guessing that you need to check hadoop.tmp.dir and mapred.local.dir. You mentioned about the configs that you are using and so the values of these two params is default. The default values of these params is given here. Set those to some relevant location and try again.
NOTE: Before you this change, you need to stop hadoop and start after you are done.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio