Cannot write the output of the reducer to a sequence file - hadoop

I have a Map function and a Reduce function outputting kep-value pairs of class Text and IntWritable..
This is just the gist of the Map part in the Main function :
TableMapReduceUtil.initTableMapperJob(
tablename, // input HBase table name
scan, // Scan instance to control CF and attribute selection
AnalyzeMapper.class, // mapper
Text.class, // mapper output key
IntWritable.class, // mapper output value
job);
And here's my Reducer part in the Main function which writes the output to HDFS
job.setReducerClass(AnalyzeReducerFile.class);
job.setNumReduceTasks(1);
FileOutputFormat.setOutputPath(job, new Path("hdfs://localhost:54310/output_file"));
How do i make the reducer write to a Sequence File instead?
I've tried the following code but doesn't work
job.setReducerClass(AnalyzeReducerFile.class);
job.setNumReduceTasks(1);
job.setOutputFormatClass(SequenceFileOutputFormat.class);
SequenceFileOutputFormat.setOutputPath(job, new Path("hdfs://localhost:54310/sequenceOutput"));
Edit: here is the output message i get when i run
WARN hdfs.DFSClient: DataStreamer Exception: org.apache.hadoop.ipc.RemoteException: org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException: No lease on /sequenceOutput/_temporary/_attempt_local_0001_r_000000_0/part-r-00000 File does not exist. Holder DFSClient_NONMAPREDUCE_-79044441_1 does not have any open files.
13/07/29 17:04:20 WARN hdfs.DFSClient: Error Recovery for block null bad datanode[0] nodes == null
13/07/29 17:04:20 WARN hdfs.DFSClient: Could not get block locations. Source file "/sequenceOutput/_temporary/_attempt_local_0001_r_000000_0/part-r-00000" - Aborting...
13/07/29 17:04:20 ERROR hdfs.DFSClient: Failed to close file /sequenceOutput/_temporary/_attempt_local_0001_r_000000_0/part-r-00000

Related

File is created in HDFS but can't write content

I Installed HDFP 3.0.1 in Vmware.
DataNode and NameNode are running
I upload files from AmbariUI/Terminal to HDFS, Everything works.
When I try to write the data:
Configuration conf = new Configuration();
conf.set("fs.defaultFS", "hdfs://172.16.68.131:8020");
FileSystem fs = FileSystem.get(conf);
OutputStream os = fs.create(new Path("hdfs://172.16.68.131:8020/tmp/write.txt"));
InputStream is = new BufferedInputStream(new FileInputStream("/home/vq/hadoop/test.txt"));
IOUtils.copyBytes(is, os, conf);
log:
19/07/15 22:40:31 WARN hdfs.DataStreamer: Abandoning BP-1419118625-172.17.0.2-1543512323726:blk_1073760904_20134
19/07/15 22:40:31 WARN hdfs.DataStreamer: Excluding datanode DatanodeInfoWithStorage[172.18.0.2:50010,DS-6c34ba72-0587-4927-88a1-781ba7d444d9,DISK]
19/07/15 22:40:32 WARN hdfs.DataStreamer: DataStreamer Exception
org.apache.hadoop.ipc.RemoteException(java.io.IOException): File /tmp/write.txt could only be written to 0 of the 1 minReplication nodes. There are 1 datanode(s) running and 1 node(s) are excluded in this operationa .
It creates file in HDFS but it's empty.
The same is when I read the data:
Configuration conf = new Configuration();
conf.set("fs.defaultFS", "hdfs://172.16.68.131:8020");
FileSystem fs = FileSystem.get(conf);
FSDataInputStream inputStream = fs.open(new Path("hdfs://172.16.68.131:8020/tmp/ui.txt"));
System.out.println(inputStream.available());
byte[] bs = new byte[inputStream.available()];
I can read available bytes. but can't read the file.
log:
19/07/15 22:33:33 WARN hdfs.DFSClient: Failed to connect to /172.18.0.2:50010 for file /tmp/ui.txt for block BP-1419118625-172.17.0.2-1543512323726:blk_1073760902_20132, add to deadNodes and continue.
org.apache.hadoop.net.ConnectTimeoutException: 60000 millis timeout while waiting for channel to be ready for connect. ch : java.nio.channels.SocketChannel[connection-pending remote=/172.18.0.2:50010]
19/07/15 22:33:33 WARN hdfs.DFSClient: No live nodes contain block BP-1419118625-172.17.0.2-1543512323726:blk_1073760902_20132 after checking nodes = [DatanodeInfoWithStorage[172.18.0.2:50010,DS-6c34ba72-0587-4927-88a1-781ba7d444d9,DISK]], ignoredNodes = null
19/07/15 22:33:33 INFO hdfs.DFSClient: Could not obtain BP-1419118625-172.17.0.2-1543512323726:blk_1073760902_20132 from any node: No live nodes contain current block Block locations: DatanodeInfoWithStorage[172.18.0.2:50010,DS-6c34ba72-0587-4927-88a1-781ba7d444d9,DISK] Dead nodes: DatanodeInfoWithStorage[172.18.0.2:50010,DS-6c34ba72-0587-4927-88a1-781ba7d444d9,DISK]. Will get new block locations from namenode and retry...
19/07/15 22:33:33 WARN hdfs.DFSClient: DFS chooseDataNode: got # 3 IOException, will wait for 6717.521796266041 msec
I've seen many answers on the internet but no success.

Sentiment Analysis of twitter data using hadoop and pig

Tweets from twitter are stored in hdfs in hadoop.
The tweets need to be processed for sentiment analysis. The tweets in hdfs are in avro format so they need to be processed using Json loader But in pig scripting the tweets from hdfs are not getting read.After changing jar files the pig script is showing failed message
By using these following jar files by pig script is getting failed.
REGISTER '/home/cloudera/Desktop/elephant-bird-hadoop-compat-4.17.jar';
REGISTER '/home/cloudera/Desktop/elephant-bird-pig-4.17.jar';
REGISTER '/home/cloudera/Desktop/json-simple-3.1.0.jar';
These are another set of jar files with which its not failing but data is also not getting read.
REGISTER '/home/cloudera/Desktop/elephant-bird-hadoop-compat-4.17.jar';
REGISTER '/home/cloudera/Desktop/elephant-bird-pig-4.17.jar';
REGISTER '/home/cloudera/Desktop/json-simple-1.1.jar';
Here is all my pig scripting commands i have used:
tweets = LOAD '/user/cloudera/OutputData/tweets' USING com.twitter.elephantbird.pig.load.JsonLoader('-nestedLoad') AS myMap;
B = FOREACH tweets GENERATE myMap#'id' as id ,myMap#'tweets' as tweets;
tokens = foreach B generate id, tweets, FLATTEN(TOKENIZE(tweets)) As word;
dictionary = load ' /user/cloudera/OutputData/AFINN.txt' using PigStorage('\t') AS(word:chararray,rating:int);
word_rating = join tokens by word left outer, dictionary by word using 'replicated';
describe word_rating;
rating = foreach word_rating generate tokens::id as id,tokens::tweets as tweets, dictionary::rating as rate;
word_group = group rating by (id,tweets);
avg_rate = foreach word_group generate group, AVG(rating.rate) as tweet_rating;
positive_tweets = filter avg_rate by tweet_rating>=0;
DUMP positive_tweets;
negative_tweets = filter avg_rate by tweet_rating<=0;
DUMP negative_tweets;
Error on dumping above tweets command for the first set of jar files:
Input(s):
Failed to read data from "/user/cloudera/OutputData/tweets"
Output(s):
Failed to produce result in "hdfs://quickstart.cloudera:8020/tmp/temp-1614543351/tmp37889715"
Counters:
Total records written : 0
Total bytes written : 0
Spillable Memory Manager spill count : 0
Total bags proactively spilled: 0
Total records proactively spilled: 0
Job DAG:
job_1556902124324_0001
2019-05-03 09:59:09,409 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Failed!
2019-05-03 09:59:09,427 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1066: Unable to open iterator for alias tweets. Backend error : org.json.simple.parser.ParseException
Details at logfile: /home/cloudera/pig_1556902594207.log
Error on dumping above tweets command for the second set of jar files:
Input(s):
Successfully read 0 records (5178477 bytes) from: "/user/cloudera/OutputData/tweets"
Output(s):
Successfully stored 0 records in: "hdfs://quickstart.cloudera:8020/tmp/temp-1614543351/tmp479037703"
Counters:
Total records written : 0
Total bytes written : 0
Spillable Memory Manager spill count : 0
Total bags proactively spilled: 0
Total records proactively spilled: 0
Job DAG:
job_1556902124324_0002
2019-05-03 10:01:05,417 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Success!
2019-05-03 10:01:05,418 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - fs.default.name is deprecated. Instead, use fs.defaultFS
2019-05-03 10:01:05,418 [main] INFO org.apache.pig.data.SchemaTupleBackend - Key [pig.schematuple] was not set... will not generate code.
2019-05-03 10:01:05,428 [main] INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 1
2019-05-03 10:01:05,428 [main] INFO org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths to process : 1
Expected output was sorted positive and neative tweets but getting errors.
Please do help. Thank you.
ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1066: Unable to open iterator for alias tweets. Backend error : org.json.simple.parser.ParseException This usually indicates a syntax error in the Pig script.
The AS keyword in a LOAD statement usually require a schema. myMap in your LOAD statement is not a valid schema.
See https://stackoverflow.com/a/12829494/8886552 for an example of JsonLoader.

All task attempt are done but job failed in mapreduce

I work with 8 map tasks and 1 reduce task. Although all of the map task attempts are successfully done, map reduce job failed. My example code is from Hadoop Beginner's Guide (Garry Turkington)that is run for skip data.The main idea of program is that testing of task failure in map reduce. Although data that causing failure (skiptext in example)have in source file, the map reduce can do the job successfully. But, I didn't finish job and encounter the job failed .How should I do?
full source code is:
import java.io.IOException;
import org.apache.hadoop.conf.* ;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.* ;
import org.apache.hadoop.mapred.* ;
import org.apache.hadoop.mapred.lib.* ;
public class SkipData
{
public static class MapClass extends MapReduceBase
implements Mapper<LongWritable, Text, Text, LongWritable>
{
private final static LongWritable one = new
LongWritable(1);
private Text word = new Text("totalcount");
public void map(LongWritable key, Text value,
OutputCollector<Text, LongWritable> output,
Reporter reporter) throws IOException
{
String line = value.toString();
if (line.equals("skiptext"))
throw new RuntimeException("Found skiptext") ;
output.collect(word, one);
}
}
public static void main(String[] args) throws Exception
{
Configuration config = new Configuration() ;
JobConf conf = new JobConf(config, SkipData.class);
conf.setJobName("SkipData");
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(LongWritable.class);
conf.setMapperClass(MapClass.class);
conf.setCombinerClass(LongSumReducer.class);
conf.setReducerClass(LongSumReducer.class);
FileInputFormat.setInputPaths(conf,args[0]) ;
FileOutputFormat.setOutputPath(conf, new
Path(args[1])) ;
JobClient.runJob(conf);
}
}
The full error console is:
18/02/28 21:12:58 INFO mapreduce.Job: Job job_local724352166_0001 failed with state FAILED due to: NA
18/02/28 21:12:58 WARN mapred.LocalJobRunner: job_local724352166_0001
java.lang.Exception: java.lang.RuntimeException: Found skiptext
at org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:462)
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:522)
Caused by: java.lang.RuntimeException: Found skiptext
at mapredpack.SkipTest$MapClass.map(SkipTest.java:23)
at mapredpack.SkipTest$MapClass.map(SkipTest.java:1)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:54)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:453)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343)
at org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner .java:243)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
18/02/28 21:12:58 DEBUG security.UserGroupInformation: PrivilegedAction as:naychi (auth:SIMPLE) from:org.apache.hadoop.mapreduce.Job.getCounters(Job.java:758)
18/02/28 21:12:59 DEBUG security.UserGroupInformation: PrivilegedAction as:naychi (auth:SIMPLE) from:org.apache.hadoop.fs.FileContext.getAbstractFileSystem(FileContext.java:331)
18/02/28 21:12:59 INFO mapreduce.Job: Counters: 23
File System Counters
FILE: Number of bytes read=29905
FILE: Number of bytes written=2020669
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=128005127
HDFS: Number of bytes written=0
HDFS: Number of read operations=80
HDFS: Number of large read operations=0
HDFS: Number of write operations=7
Map-Reduce Framework
Map input records=1542671
Map output records=1542669
Map output bytes=29310711
Map output materialized bytes=135
Input split bytes=686
Combine input records=1161148
Combine output records=5
Spilled Records=5
Failed Shuffles=0
Merged Map outputs=0
GC time elapsed (ms)=8601
Total committed heap usage (bytes)=3840933888
File Input Format Counters
Bytes Read=23163911
Exception in thread "main" java.io.IOException: Job failed!
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:873)
at mapredpack.SkipTest.main(SkipTest.java:58)
18/02/28 21:12:59 DEBUG ipc.Client: stopping client from cache: org.apache.hadoop.ipc.Client#2e55dd0c
18/02/28 21:12:59 DEBUG ipc.Client: removing client from cache: org.apache.hadoop.ipc.Client#2e55dd0c
18/02/28 21:12:59 DEBUG ipc.Client: stopping actual client because no more references remain: org.apache.hadoop.ipc.Client#2e55dd0c
18/02/28 21:12:59 DEBUG ipc.Client: Stopping client
18/02/28 21:12:59 DEBUG ipc.Client: IPC Client (1313916817) connection to localhost/127.0.0.1:9000 from naychi: closed
18/02/28 21:12:59 DEBUG ipc.Client: IPC Client (1313916817) connection to localhost/127.0.0.1:9000 from naychi: stopped, remaining connections 0
Looks like the code is working as designed. A skiptext line was found and the job is implemented to throw a task-ending exception in that case. This is a common coding technique to force people to implement logic at a certain point. Put a throw RuntimeException() where the code needs to be modified and the developer is forced to look at that part of the code.
Look at the code and decide what you want to do in the case of a skiptext line. Is there additional logic you need to implement, replacing the exception? If so, then replace the thrown exception with the correct behavior.

HBase looks for hadoop-common library on HDFS

I am trying to run a hadoop(version:2.2.0) job on a single machine which writes the reducer output into a Hbase(version 0.98) table. When I run the job, HBase looks for hadoop-common.jar file in the HDFS:
2014-03-07 10:52:44,499 ERROR [main] security.UserGroupInformation
(UserGroupInformation.java:doAs(1494)) - PriviledgedActionException
as:ubuntu (auth:SIMPLE) cause:java.io.FileNotFoundException: File does
not exist:
hdfs://127.0.0.1:9000/home/ubuntu/workspace/XXXX/lib/hadoop-common-2.2.0.jar
Here is my job configuration:
Configuration conf = HBaseConfiguration.create();
conf.set("hbase.zookeeper.quorum", "127.0.0.1");
Job job = Job.getInstance(conf, "CreateLookUpTable");
job.setJarByClass(HBaseLookUp.class);
// configure output and input source
N3TextInputFormat.addInputPath(job, new Path(inputText));
job.setInputFormatClass(N3InputFormat.class);
// configure mapper and reducer
job.setMapperClass(HBaseLookUp.Map.class);
// configure output
TableMapReduceUtil.initTableReducerJob("LookUp", HBaseLookUp.Red.class, job);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(Text.class);
job.setReducerClass(HBaseLookUp.Red.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(NullWritable.class);
Any idea about this issue?

Getting error while implementing a simple sorting program in Mapreduce with zero reduce nodes

I tried implementing a sorting program in mapreduce such that I have just the sorted output after the map phase where the sorting is done by the hadoop framework internally. For it, I tried to set the number of reduce tasks to zero as there wasnt any reduction required. Now when I tried executing the program, I kept on getting checksum
error.. I am not able to figure out what's to be done next. Surely it's possible to run the program on my netbook as the sorting does work fine when I have set the reduce tasks to one.. Please help!!
For your reference, here's the entire code that I have written to perform the sorting:
/*
* To change this template, choose Tools | Templates
* and open the template in the editor.
*/
/**
*
* #author root
*/
import org.apache.hadoop.mapred.*;
import org.apache.hadoop.io.*;
import java.util.*;
import java.io.*;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.util.*;
import org.apache.hadoop.conf.*;
public class word extends Configured implements Tool
{
public static class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable>
{
private static IntWritable one=new IntWritable(1);
private Text word=new Text();
public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter report) throws IOException
{
String line=value.toString();
StringTokenizer token=new StringTokenizer(line," .,?!");
String wordToken=null;
while(token.hasMoreTokens())
{
wordToken=token.nextToken();
output.collect(new Text(wordToken), one);
}
}
}
public int run(String args[])throws Exception
{
//Configuration conf=getConf();
JobConf job=new JobConf(word.class);
job.setInputFormat(TextInputFormat.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(IntWritable.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
job.setOutputFormat(TextOutputFormat.class);
job.setMapperClass(Map.class);
job.setNumReduceTasks(0);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
JobClient.runJob(job);
return 0;
}
public static void main(String args[])throws Exception
{
int exitCode=ToolRunner.run(new word(), args);
System.exit(exitCode);
}
}
Here is the checksum error I got on executing this program:
12/03/25 10:26:42 WARN conf.Configuration: DEPRECATED: hadoop-site.xml found in the classpath. Usage of hadoop-site.xml is deprecated. Instead use core-site.xml, mapred-site.xml and hdfs-site.xml to override properties of core-default.xml, mapred-default.xml and hdfs-default.xml respectively
12/03/25 10:26:43 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId=
12/03/25 10:26:43 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
12/03/25 10:26:44 INFO mapred.FileInputFormat: Total input paths to process : 1
12/03/25 10:26:45 INFO mapred.JobClient: Running job: job_local_0001
12/03/25 10:26:45 INFO mapred.FileInputFormat: Total input paths to process : 1
12/03/25 10:26:45 INFO mapred.MapTask: numReduceTasks: 0
12/03/25 10:26:45 INFO fs.FSInputChecker: Found checksum error: b[0, 26]=610a630a620a640a650a740a790a780a730a670a7a0a680a730a
org.apache.hadoop.fs.ChecksumException: Checksum error: file:/root/NetBeansProjects/projectAll/output/regionMulti/individual/part-00000 at 0
at org.apache.hadoop.fs.FSInputChecker.verifySum(FSInputChecker.java:277)
at org.apache.hadoop.fs.FSInputChecker.readChecksumChunk(FSInputChecker.java:241)
at org.apache.hadoop.fs.FSInputChecker.read1(FSInputChecker.java:189)
at org.apache.hadoop.fs.FSInputChecker.read(FSInputChecker.java:158)
at java.io.DataInputStream.read(DataInputStream.java:100)
at org.apache.hadoop.util.LineReader.readLine(LineReader.java:134)
at org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:136)
at org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:40)
at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.java:192)
at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:176)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177)
12/03/25 10:26:45 WARN mapred.LocalJobRunner: job_local_0001
org.apache.hadoop.fs.ChecksumException: Checksum error: file:/root/NetBeansProjects/projectAll/output/regionMulti/individual/part-00000 at 0
at org.apache.hadoop.fs.FSInputChecker.verifySum(FSInputChecker.java:277)
at org.apache.hadoop.fs.FSInputChecker.readChecksumChunk(FSInputChecker.java:241)
at org.apache.hadoop.fs.FSInputChecker.read1(FSInputChecker.java:189)
at org.apache.hadoop.fs.FSInputChecker.read(FSInputChecker.java:158)
at java.io.DataInputStream.read(DataInputStream.java:100)
at org.apache.hadoop.util.LineReader.readLine(LineReader.java:134)
at org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:136)
at org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:40)
at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.java:192)
at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:176)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177)
12/03/25 10:26:46 INFO mapred.JobClient: map 0% reduce 0%
12/03/25 10:26:46 INFO mapred.JobClient: Job complete: job_local_0001
12/03/25 10:26:46 INFO mapred.JobClient: Counters: 0
Exception in thread "main" java.io.IOException: Job failed!
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1252)
at sortLog.run(sortLog.java:59)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
at sortLog.main(sortLog.java:66)
Java Result: 1
BUILD SUCCESSFUL (total time: 4 seconds)
So have a look at the org.apache.hadoop.mapred.MapTask arround line 600 in 0.20.2.
// get an output object
if (job.getNumReduceTasks() == 0) {
output =
new NewDirectOutputCollector(taskContext, job, umbilical, reporter);
} else {
output = new NewOutputCollector(taskContext, job, umbilical, reporter);
}
If you set the number of reduce tasks to zero it will be directly written to the output. The NewOutputCollector will use the so called MapOutputBuffer which does the spilling, sorting, combining and partitioning.
So when you set no reducer, no sort takes places, even if Tom White states this in the definitive guide.
I have faced the same problem (checksum error concerning file part-00000 at 0). I solved it by renaming the file to any other name than -00000.
So if you need at least one Reducer to make the internal sorting happen, than you can take the IdentityReducer.
You may also want to see this discussion:
hadoop: difference between 0 reducer and identity reducer?

Resources