Hadoop MultipleOutputs in Reducer with FileAlreadyExistsException - hadoop

I use MultipleOutputs in a reducer. The multiple output will write file to a folder called NewIdentities. The code is shown as below:
private MultipleOutputs<Text,Text> mos;
#Override
public void reduce(Text inputKey, Iterable<Text> values, Context context) throws IOException, InterruptedException {
......
// output to change report
if (ischangereport.equals("TRUE")) {
mos.write(new Text(e.getHID()), new Text(changereport.deleteCharAt(changereport.length() - 1).toString()), "NewIdentities/");
}
}
}
#Override
public void setup(Context context) {
mos = new MultipleOutputs<Text,Text>(context);
}
#Override
protected void cleanup(Context context) throws IOException, InterruptedException {
mos.close();
}
It can run previously. But when I run it today, it throws an exception as below. My hadoop version is 2.4.0.
Error: org.apache.hadoop.fs.FileAlreadyExistsException: /CaptureOnlyMatchIndex9/TEMP/ChangeReport/NewIdentities/-r-00000 for client 192.168.71.128 already exists at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInternal(FSNamesystem.java:2297) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInt(FSNamesystem.java:2225) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFile(FSNamesystem.java:2178) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.create(NameNodeRpcServer.java:520) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.create(ClientNamenodeProtocolServerSideTranslatorPB.java:354) at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007) at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:526) at org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:106) at org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:73) at org.apache.hadoop.hdfs.DFSOutputStream.newStreamForCreate(DFSOutputStream.java:1604) at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1465) at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1390) at org.apache.hadoop.hdfs.DistributedFileSystem$6.doCall(DistributedFileSystem.java:394) at org.apache.hadoop.hdfs.DistributedFileSystem$6.doCall(DistributedFileSystem.java:390) at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) at org.apache.hadoop.hdfs.DistributedFileSystem.create(DistributedFileSystem.java:390) at org.apache.hadoop.hdfs.DistributedFileSystem.create(DistributedFileSystem.java:334) at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:906) at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:887) at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:784) at org.apache.hadoop.mapreduce.lib.output.TextOutputFormat.getRecordWriter(TextOutputFormat.java:132) at org.apache.hadoop.mapreduce.lib.output.MultipleOutputs.getRecordWriter(MultipleOutputs.java:475) at

I found the reason for it. Because in one of my reducers, it run out of the memory. So it throws out an out-of-memory exception implicitly. The hadoop stops the current multiple output. And maybe another thread of reducer want to output, so it creates another multiple output object, so the collision happens.

Related

spring-retry : "Getting Retry exhausted after last attempt with no recovery path" with original exception on non retryable method

I am trying to implement spring-retry(version - 1.3.1) in my spring boot application. I have to retry webservice operation to read the record if not found in first request.
sample code:
#Retryable(include = {IllegalArgumentException.class}, backoff = #Backoff(500), maxAttempts = 3, recover ="readFallback")
Object read(String Id);
#Recover
Object readFallback(RuntimeException e, String Id);
void deletePayment(String paymentId);
Problem :
I am getting correct response from read method(annotated with #Retryable) in exception scenario but I am getting RetryExhaustedException with nested original exception when I am getting exception on my delete method. As you see, delete method doesn't annotated with #Retryable . Delete method is in different package.
**Sample exception response ** : "Retry exhausted after last attempt with no recovery path; nested exception is exception.NotFoundException: Not found"
Expected : Delete method should not be impacted by #Retryable. Can someone help me to find what am i missing or doing wrong. I have tried but unable to not found the solution of this problem on internet.
Thanks in advance !
Works as expected for me:
#SpringBootApplication
#EnableRetry
public class So71546747Application {
public static void main(String[] args) {
SpringApplication.run(So71546747Application.class, args);
}
#Bean
ApplicationRunner runner(SomeRetryables retrier) {
return args -> {
retrier.foo("testFoo");
try {
Thread.sleep(1000);
retrier.bar("testBar");
}
catch (Exception e) {
e.printStackTrace();
}
};
}
}
#Component
class SomeRetryables {
#Retryable
void foo(String in) {
System.out.println(in);
throw new RuntimeException(in);
}
#Recover
void recover(String in, Exception ex) {
System.out.println("recovered");
}
void bar(String in) {
System.out.println(in);
throw new RuntimeException(in);
}
}
testFoo
testFoo
testFoo
recovered
testBar
java.lang.RuntimeException: testBar
at com.example.demo.SomeRetryables.bar(So71546747Application.java:52)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.base/java.lang.reflect.Method.invoke(Method.java:566)
at org.springframework.aop.support.AopUtils.invokeJoinpointUsingReflection(AopUtils.java:344)
at org.springframework.aop.framework.ReflectiveMethodInvocation.invokeJoinpoint(ReflectiveMethodInvocation.java:198)
at org.springframework.aop.framework.CglibAopProxy$CglibMethodInvocation.invokeJoinpoint(CglibAopProxy.java:789)
at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:163)
at org.springframework.aop.framework.CglibAopProxy$CglibMethodInvocation.proceed(CglibAopProxy.java:753)
at org.springframework.retry.annotation.AnnotationAwareRetryOperationsInterceptor.invoke(AnnotationAwareRetryOperationsInterceptor.java:166)
at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:186)
at org.springframework.aop.framework.CglibAopProxy$CglibMethodInvocation.proceed(CglibAopProxy.java:753)
at org.springframework.aop.framework.CglibAopProxy$DynamicAdvisedInterceptor.intercept(CglibAopProxy.java:698)
at com.example.demo.SomeRetryables$$EnhancerBySpringCGLIB$$e61dd199.bar(<generated>)
at com.example.demo.So71546747Application.lambda$0(So71546747Application.java:26)
at org.springframework.boot.SpringApplication.callRunner(SpringApplication.java:768)
at org.springframework.boot.SpringApplication.callRunners(SpringApplication.java:758)
at org.springframework.boot.SpringApplication.run(SpringApplication.java:310)
at org.springframework.boot.SpringApplication.run(SpringApplication.java:1312)
at org.springframework.boot.SpringApplication.run(SpringApplication.java:1301)
at com.example.demo.So71546747Application.main(So71546747Application.java:17)
Please provide an MCRE that exhibits the behavior you see so we can see what's wrong.

How to read binary file from hdfs?

I'm now working on writing a parser for shapefile with spark. I take use of NewAPIHadoopFile to first extract binary records one by one from the original .shp file. The problem is that when the program takes file from local disk, it works. But when reading file from hdfs, the bytes flow I get from DataInputStream is no longer integrated with the original file. The exception is as below.
java.lang.NegativeArraySizeException
at ShapeFileParse.ShapeParseUtil.parseRecordPrimitiveContent(ShapeParseUtil.java:53)
at spatial.ShapeFileReader.nextKeyValue(ShapeFileReader.java:54)
at org.apache.spark.rdd.NewHadoopRDD$$anon$1.hasNext(NewHadoopRDD.scala:182)
at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
at org.apache.spark.util.Utils$.getIteratorSize(Utils.scala:1702)
at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1134)
at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1134)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1916)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1916)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
at org.apache.spark.scheduler.Task.run(Task.scala:86)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
17/05/25 17:19:15 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, localhost): java.lang.NegativeArraySizeException
at ShapeFileParse.ShapeParseUtil.parseRecordPrimitiveContent(ShapeParseUtil.java:53)
at spatial.ShapeFileReader.nextKeyValue(ShapeFileReader.java:54)
at org.apache.spark.rdd.NewHadoopRDD$$anon$1.hasNext(NewHadoopRDD.scala:182)
at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
at org.apache.spark.util.Utils$.getIteratorSize(Utils.scala:1702)
at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1134)
at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1134)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1916)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1916)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
at org.apache.spark.scheduler.Task.run(Task.scala:86)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
My Code in RecordReader is as follow:
public void initialize(InputSplit split, TaskAttemptContext context) throws IOException, InterruptedException {
ShapeParseUtil.initializeGeometryFactory();
FileSplit fileSplit = (FileSplit)split;
long start = fileSplit.getStart();
long end = start + fileSplit.getLength();
int len = (int)fileSplit.getLength();
Path filePath = fileSplit.getPath();
FileSystem fileSys = filePath.getFileSystem(context.getConfiguration());
FSDataInputStream inputStreamFS = fileSys.open(filePath);
//byte[] wholeStream = new byte[len];
inputStream = new DataInputStream(inputStreamFS);
//IOUtils.readFully(inputStream, wholeStream, 0, len);
//inputStream = new DataInputStream(new ByteArrayInputStream(wholeStream));
ShapeParseUtil.parseShapeFileHead(inputStream);
}
And here is how I set my FileInputFormat
public class ShapeInputFormat extends FileInputFormat<ShapeKey, BytesWritable> {
public RecordReader<ShapeKey, BytesWritable> createRecordReader(InputSplit split, TaskAttemptContext context) throws IOException, InterruptedException {
return new ShapeFileReader(split, context);
}
#Override
protected boolean isSplitable(JobContext context, Path filename) {
return false;
}
}
When I use IOUtils.readFully to test, the whole bytes array I get is good. But when using DataInputStream, the flow goes wrong. So I believe the file on hdfs is still integrated. Since my file is of only 560kb and there is only one file on hdfs now.I think it can't be assigned to multiple blocks.
I'm new to spark so maybe this is a naive problem, really thanks to who can teach me that.
I figured it out myself by changing read() to readFully() when reading byte arrays for in each record. I'm still a little bit confused since the available() I get is exactly equal to the size of split. But if I call inputStream.read(), it failed to to read to the length I assigned.

Reading ORC File using MapReduce

I'm trying to read ORC File compressed using SNAPPY via MapReduce. My intent is just to leverage IdentityMapper, essentially to merge small files. However, I get NullPointerException on doing so. I can see from the log that schema is being inferred, do I still need to set the schema for the output file from mapper ?
public class Test {
public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
Configuration conf = new Configuration();
Job job = new Job(conf, "test");
job.setJarByClass(Test.class);
job.setMapperClass(Mapper.class);
conf.set("orc.compress", "SNAPPY");
job.setOutputKeyClass(NullWritable.class);
job.setOutputValueClass(Writable.class);
job.setInputFormatClass(OrcInputFormat.class);
job.setOutputFormatClass(OrcOutputFormat.class);
job.setNumReduceTasks(0);
String source = args[0];
String target = args[1];
FileInputFormat.setInputPath(job, new Path(source))
FileOutputFormat.setOutputPath(job, new Path(target));
boolean result = job.waitForCompletion(true);
System.exit(result ? 0 : 1);
}
Error: java.lang.NullPointerException at
org.apache.orc.impl.WriterImpl.(WriterImpl.java:178) at
org.apache.orc.OrcFile.createWriter(OrcFile.java:559) at
org.apache.orc.mapreduce.OrcOutputFormat.getRecordWriter(OrcOutputFormat.java:55)
at
org.apache.hadoop.mapred.MapTask$NewDirectOutputCollector.(MapTask.java:644)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764) at
org.apache.hadoop.mapred.MapTask.run(MapTask.java:341) at
org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168) at
java.security.AccessController.doPrivileged(Native Method) at
javax.security.auth.Subject.doAs(Subject.java:415) at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1642)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:163)

Avro AvroMultipleOutputs part-r-00000: File is not open for writing Exception

I have written a MapReduce Job with Avro 1.7.4 on Hadoop 2.3.0. In the first step I wrote all the Avro Results in a AvroSequenceFile. Everything worked well without problems.
Then I tried to use the AvroMultipleOutputs class to write the results in different files. I wrote the same MapReduce Job without using Avro and it was no problem to write the data in two separate files (btw. part-r-00000 was created in hdfs but left empty).
The Avro variant produces exceptions if I write data in the reducer. (If I comment out the lines which write data out I get no exceptions).
Here is Job configurations:
this.conf = new Configuration();
Job job = Job.getInstance(conf);
job.setJarByClass(ArchiveDataProcessorMR.class);
Path inPath = new Path(props.getProperty("inPath").trim());
Path outPath = new Path(props.getProperty("outPath").trim());
Path outPathMeta = new Path(props.getProperty("outPath.meta").trim());
Path outPathPayload = new Path(props.getProperty("outPath.payload").trim());
// cleanup resources from previous run
FileSystem fs = FileSystem.get(conf);
fs.delete(outPath, true);
FileInputFormat.setInputPaths(job, inPath);
FileOutputFormat.setOutputPath(job, outPath);
AvroSequenceFileInputFormat<DataExportKey,DataExportValue> sequenceInputFormat = new AvroSequenceFileInputFormat<DataExportKey,DataExportValue>();
job.setInputFormatClass(sequenceInputFormat.getClass());
AvroJob.setInputValueSchema(job, DataExportValue.getClassSchema());
AvroJob.setInputKeySchema(job, DataExportKey.getClassSchema());
AvroJob.setMapOutputValueSchema(job, DataExportValue.getClassSchema());
AvroJob.setMapOutputKeySchema(job, DataExportKey.getClassSchema());
job.setMapperClass(ArchiveDataMapper.class);
job.setReducerClass(ArchiveDataReducer.class);
AvroSequenceFileOutputFormat<DataExportKey,DataExportValue> sequenceOutputFormat = new AvroSequenceFileOutputFormat<DataExportKey,DataExportValue>();
AvroJob.setOutputKeySchema(job, DataExportKey.getClassSchema());
AvroJob.setOutputValueSchema(job, DataExportValue.getClassSchema());
job.setOutputFormatClass(AvroSequenceFileOutputFormat.class);
AvroMultipleOutputs.addNamedOutput(job, "meta", sequenceOutputFormat.getClass(), DataExportKey.getClassSchema(), DataExportValue.getClassSchema());
AvroMultipleOutputs.addNamedOutput(job, "payload", sequenceOutputFormat.getClass(), DataExportKey.getClassSchema(), DataExportValue.getClassSchema());
The reducer code (without the business logic) looks like
public static class ArchiveDataReducer extends Reducer<AvroKey<DataExportKey>, AvroValue<DataExportValue>,AvroKey<DataExportKey>,AvroValue<DataExportValue>> {
private AvroMultipleOutputs amos;
public void setup(Context context) throws IOException, InterruptedException {
this.amos = new AvroMultipleOutputs(context);
}
public void cleanup(Context context) throws IOException, InterruptedException {
this.amos.close();
}
/**
* #param key
*/
public void reduce(AvroKey<DataExportKey> key, Iterable<AvroValue<DataExportValue>> xmlIter, Context context) throws IOException, InterruptedException {
try {
DataExportValue newValue = new DataExportValue();
if (key.datum()......) {
... snip...
amos.write("meta",key, new AvroValue<DataExportValue>(newValue));
} else {
... snip...
amos.write("payload",key, new AvroValue<DataExportValue>(newValue));
}
} catch (Exception e) {
e.printStackTrace();
}
}
} // class ArchiveDataReducer
The exception message is
14/05/04 06:52:58 INFO mapreduce.Job: map 100% reduce 0%
14/05/04 06:53:09 INFO mapreduce.Job: map 100% reduce 91%
14/05/04 06:53:09 INFO mapreduce.Job: Task Id : attempt_1399104292130_0016_r_000000_1, Status : FAILED
Error: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException): No lease on /applications/wd/data_avro/_temporary/1/_temporary/attempt_1399104292130_0016_r_000000_1/part-r-00000: File is not open for writing. Holder DFSClient_attempt_1399104292130_0016_r_000000_1_338983539_1 does not have any open files.
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:2856)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.analyzeFileState(FSNamesystem.java:2667)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2573)
at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:563)
at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:407)
at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1962)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1958)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1956)
at org.apache.hadoop.ipc.Client.call(Client.java:1406)
at org.apache.hadoop.ipc.Client.call(Client.java:1359)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)
at com.sun.proxy.$Proxy10.addBlock(Unknown Source)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:186)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
at com.sun.proxy.$Proxy10.addBlock(Unknown Source)
at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.addBlock(ClientNamenodeProtocolTranslatorPB.java:348)
at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.locateFollowingBlock(DFSOutputStream.java:1264)
at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSOutputStream.java:1112)
at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:522)
Any hints how to solve this problem? Did you get another example with AvroMultipleOutputs running? I would like to see your code.
Kind regards
Martin

hadoop MultipleInputs fails with ClassCastException

My hadoop version is 1.0.3,when I use multipleinputs, I got this error.
java.lang.ClassCastException: org.apache.hadoop.mapreduce.lib.input.TaggedInputSplit cannot be cast to org.apache.hadoop.mapreduce.lib.input.FileSplit
at org.myorg.textimage$ImageMapper.setup(textimage.java:80)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:142)
at org.apache.hadoop.mapreduce.lib.input.DelegatingMapper.run(DelegatingMapper.java:55)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:416)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121)
at org.apache.hadoop.mapred.Child.main(Child.java:249)
I tested single input path, no problem. Only when I use
MultipleInputs.addInputPath(job, TextInputpath, TextInputFormat.class,
TextMapper.class);
MultipleInputs.addInputPath(job, ImageInputpath,
WholeFileInputFormat.class, ImageMapper.class);
I googled and found this link https://issues.apache.org/jira/browse/MAPREDUCE-1178 which said 0.21 had this bug. But I am using 1.0.3, does this bug come back again. Anyone has the same problem or anyone can tell me how to fix it? Thanks
here is the setup code of image mapper,4th line is where the error occurs:
protected void setup(Context context) throws IOException,
InterruptedException {
InputSplit split = context.getInputSplit();
Path path = ((FileSplit) split).getPath();
try {
pa = new Text(path.toString());
} catch (Exception e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
Following up on my comment, the Javadocs for TaggedInputSplit confirms that you are probably wrongly casting the input split to a FileSplit:
/**
* An {#link InputSplit} that tags another InputSplit with extra data for use
* by {#link DelegatingInputFormat}s and {#link DelegatingMapper}s.
*/
My guess is your setup method looks something like this:
#Override
protected void setup(Context context) throws IOException,
InterruptedException {
FileSplit split = (FileSplit) context.getInputSplit();
}
Unfortunately TaggedInputSplit is not public visible, so you can't easily do an instanceof style check, followed by a cast and then call to TaggedInputSplit.getInputSplit() to get the actual underlying FileSplit. So either you'll need to update the source yourself and re-compile&deploy, post a JIRA ticket to ask this to be fixed in future version (if it already hasn't been actioned in 2+) or perform some nasty nasty reflection hackery to get to the underlying InputSplit
This is completely untested:
#Override
protected void setup(Context context) throws IOException,
InterruptedException {
InputSplit split = context.getInputSplit();
Class<? extends InputSplit> splitClass = split.getClass();
FileSplit fileSplit = null;
if (splitClass.equals(FileSplit.class)) {
fileSplit = (FileSplit) split;
} else if (splitClass.getName().equals(
"org.apache.hadoop.mapreduce.lib.input.TaggedInputSplit")) {
// begin reflection hackery...
try {
Method getInputSplitMethod = splitClass
.getDeclaredMethod("getInputSplit");
getInputSplitMethod.setAccessible(true);
fileSplit = (FileSplit) getInputSplitMethod.invoke(split);
} catch (Exception e) {
// wrap and re-throw error
throw new IOException(e);
}
// end reflection hackery
}
}
Reflection Hackery Explained:
With TaggedInputSplit being declared protected scope, it's not visible to classes outside the org.apache.hadoop.mapreduce.lib.input package, and therefore you cannot reference that class in your setup method. To get around this, we perform a number of reflection based operations:
Inspecting the class name, we can test for the type TaggedInputSplit using it's fully qualified name
splitClass.getName().equals("org.apache.hadoop.mapreduce.lib.input.TaggedInputSplit")
We know we want to call the TaggedInputSplit.getInputSplit() method to recover the wrapped input split, so we utilize the Class.getMethod(..) reflection method to acquire a reference to the method:
Method getInputSplitMethod = splitClass.getDeclaredMethod("getInputSplit");
The class still isn't public visible so we use the setAccessible(..) method to override this, stopping the security manager from throwing an exception
getInputSplitMethod.setAccessible(true);
Finally we invoke the method on the reference to the input split and cast the result to a FileSplit (optimistically hoping its a instance of this type!):
fileSplit = (FileSplit) getInputSplitMethod.invoke(split);
I had this same problem, but the actual problem was that I was still setting the InputFormat after setting up the MultipleInputs:
job.setInputFormatClass(SequenceFileInputFormat.class);
Once I removed this line everything worked fine.

Resources