hadoop MultipleInputs fails with ClassCastException - hadoop

My hadoop version is 1.0.3,when I use multipleinputs, I got this error.
java.lang.ClassCastException: org.apache.hadoop.mapreduce.lib.input.TaggedInputSplit cannot be cast to org.apache.hadoop.mapreduce.lib.input.FileSplit
at org.myorg.textimage$ImageMapper.setup(textimage.java:80)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:142)
at org.apache.hadoop.mapreduce.lib.input.DelegatingMapper.run(DelegatingMapper.java:55)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:416)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121)
at org.apache.hadoop.mapred.Child.main(Child.java:249)
I tested single input path, no problem. Only when I use
MultipleInputs.addInputPath(job, TextInputpath, TextInputFormat.class,
TextMapper.class);
MultipleInputs.addInputPath(job, ImageInputpath,
WholeFileInputFormat.class, ImageMapper.class);
I googled and found this link https://issues.apache.org/jira/browse/MAPREDUCE-1178 which said 0.21 had this bug. But I am using 1.0.3, does this bug come back again. Anyone has the same problem or anyone can tell me how to fix it? Thanks
here is the setup code of image mapper,4th line is where the error occurs:
protected void setup(Context context) throws IOException,
InterruptedException {
InputSplit split = context.getInputSplit();
Path path = ((FileSplit) split).getPath();
try {
pa = new Text(path.toString());
} catch (Exception e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}

Following up on my comment, the Javadocs for TaggedInputSplit confirms that you are probably wrongly casting the input split to a FileSplit:
/**
* An {#link InputSplit} that tags another InputSplit with extra data for use
* by {#link DelegatingInputFormat}s and {#link DelegatingMapper}s.
*/
My guess is your setup method looks something like this:
#Override
protected void setup(Context context) throws IOException,
InterruptedException {
FileSplit split = (FileSplit) context.getInputSplit();
}
Unfortunately TaggedInputSplit is not public visible, so you can't easily do an instanceof style check, followed by a cast and then call to TaggedInputSplit.getInputSplit() to get the actual underlying FileSplit. So either you'll need to update the source yourself and re-compile&deploy, post a JIRA ticket to ask this to be fixed in future version (if it already hasn't been actioned in 2+) or perform some nasty nasty reflection hackery to get to the underlying InputSplit
This is completely untested:
#Override
protected void setup(Context context) throws IOException,
InterruptedException {
InputSplit split = context.getInputSplit();
Class<? extends InputSplit> splitClass = split.getClass();
FileSplit fileSplit = null;
if (splitClass.equals(FileSplit.class)) {
fileSplit = (FileSplit) split;
} else if (splitClass.getName().equals(
"org.apache.hadoop.mapreduce.lib.input.TaggedInputSplit")) {
// begin reflection hackery...
try {
Method getInputSplitMethod = splitClass
.getDeclaredMethod("getInputSplit");
getInputSplitMethod.setAccessible(true);
fileSplit = (FileSplit) getInputSplitMethod.invoke(split);
} catch (Exception e) {
// wrap and re-throw error
throw new IOException(e);
}
// end reflection hackery
}
}
Reflection Hackery Explained:
With TaggedInputSplit being declared protected scope, it's not visible to classes outside the org.apache.hadoop.mapreduce.lib.input package, and therefore you cannot reference that class in your setup method. To get around this, we perform a number of reflection based operations:
Inspecting the class name, we can test for the type TaggedInputSplit using it's fully qualified name
splitClass.getName().equals("org.apache.hadoop.mapreduce.lib.input.TaggedInputSplit")
We know we want to call the TaggedInputSplit.getInputSplit() method to recover the wrapped input split, so we utilize the Class.getMethod(..) reflection method to acquire a reference to the method:
Method getInputSplitMethod = splitClass.getDeclaredMethod("getInputSplit");
The class still isn't public visible so we use the setAccessible(..) method to override this, stopping the security manager from throwing an exception
getInputSplitMethod.setAccessible(true);
Finally we invoke the method on the reference to the input split and cast the result to a FileSplit (optimistically hoping its a instance of this type!):
fileSplit = (FileSplit) getInputSplitMethod.invoke(split);

I had this same problem, but the actual problem was that I was still setting the InputFormat after setting up the MultipleInputs:
job.setInputFormatClass(SequenceFileInputFormat.class);
Once I removed this line everything worked fine.

Related

Get failure exception in #HystrixCommand fallback method

Is there a way to get the reason a HystrixCommand failed when using the #HystrixCommand annotation within a Spring Boot application? It looks like if you implement your own HystrixCommand, you have access to the getFailedExecutionException but how can you get access to this when using the annotation? I would like to be able to do different things in the fallback method based on the type of exception that occurred. Is this possible?
I saw a note about HystrixRequestContext.initializeContext() but the HystrixRequestContext doesn't give you access to anything, is there a different way to use that context to get access to the exceptions?
Simply add a Throwable parameter to the fallback method and it will receive the exception which the original command produced.
From https://github.com/Netflix/Hystrix/tree/master/hystrix-contrib/hystrix-javanica
#HystrixCommand(fallbackMethod = "fallback1")
User getUserById(String id) {
throw new RuntimeException("getUserById command failed");
}
#HystrixCommand(fallbackMethod = "fallback2")
User fallback1(String id, Throwable e) {
assert "getUserById command failed".equals(e.getMessage());
throw new RuntimeException("fallback1 failed");
}
I haven't found a way to get the exception with Annotations either, but creating my own Command worked for me like so:
public static class DemoCommand extends HystrixCommand<String> {
protected DemoCommand() {
super(HystrixCommandGroupKey.Factory.asKey("Demo"));
}
#Override
protected String run() throws Exception {
throw new RuntimeException("failed!");
}
#Override
protected String getFallback() {
System.out.println("Events (so far) in Fallback: " + getExecutionEvents());
return getFailedExecutionException().getMessage();
}
}
Hopefully this helps someone else as well.
As said in the documentation Hystrix-documentation getFallback() method will be thrown when:
Whenever a command execution fails: when an exception is thrown by construct() or run()
When the command is short-circuited because the circuit is open
When the command’s thread pool and queue or semaphore are at capacity
When the command has exceeded its timeout length.
So you can easily get what raised your fallback method called by assigning the the execution exception to a Throwable object.
Assuming your HystrixCommand returns a String
public class ExampleTask extends HystrixCommand<String> {
//Your class body
}
do as follows:
#Override
protected ErrorCodes getFallback() {
Throwable t = getExecutionException();
if (circuitBreaker.isOpen()) {
// Log or something
} else if (t instanceof RejectedExecutionException) {
// Log and get the threadpool name, could be useful
} else {
// Maybe something else happened
}
return "A default String"; // Avoid using any HTTP request or ypu will need to wrap it also in HystrixCommand
}
More info here
I couldn't find a way to obtain the exception with the annotations, but i found HystrixPlugins , with that you can register a HystrixCommandExecutionHook and you can get the exact exception in that like this :
HystrixPlugins.getInstance().registerCommandExecutionHook(new HystrixCommandExecutionHook() {
#Override
public <T> void onFallbackStart(final HystrixInvokable<T> commandInstance) {
}
});
The command instance is a GenericCommand.
Most of the time just using getFailedExecutionException().getMessage() gave me null values.
Exception errorFromThrowable = getExceptionFromThrowable(getExecutionException());
String errMessage = (errorFromThrowable != null) ? errorFromThrowable.getMessage()
this gives me better results all the time.

set a conf value in mapper - get it in run method

In the run method of the Driver class, I want to fetch a String value (from the mapper function) and want to write it to a file. I used the following code, but null was returned. Please help
Mapper
public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
context.getConfiguration().set("feedName", feedName);
}
Driver Class
#Override
public int run(String[] args) throws Exception {
String lineVal = conf.get("feedName")
}
Configuration is one way.
If you want to pass non-counter types of values back to the driver, you can utilize HDFS for that.
Either write to your main output context (key and values) that you emit from your job.
Or alternatively use MultipleOutputs, if you do not want to mess with your standard job output.
For example, you can write any kind of properties as Text keys and Text values from your mappers or reducers.
Once control is back to your driver, simply read from HDFS. For example you can store your name/values to the Configuration object to be used by the next job in your sequence:
public void load(Configuration targetConf, Path src, FileSystem fs) throws IOException {
InputStream is = fs.open(src);
try {
Properties props = new Properties();
props.load(new InputStreamReader(is, "UTF8"));
for (Map.Entry prop : props.entrySet()) {
String name = (String)prop.getKey();
String value = (String)prop.getValue();
targetConf.set(name, value);
}
} finally {
is.close();
}
}
Note that if you have multiple mappers or reducers where you write to MultipleOutputs, you will end up with multiple {name}-m-##### or {name}-r-##### files.
In that case, you will need to either read from every output file or run a single reducer job to combine your outputs into one and then just read from one file as shown above.
Using configuration you can only do the viceversa.
You can set values in Driver class
public int run(String[] args) throws Exception {
conf.set("feedName",value);
}
and set get those in Mapper class
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
Configuration conf = context.getConfiguration();
String lineVal = conf.get("feedName");
}
UPDATE
One option to your question is write data to a file and store it in HDFS, and then access them in Driver class. These files can be treated as "Intermediate Files".
Just try it and see.

How to set multiple Avro schemas with AvroParquetOutputFormat?

In my MapReduce job, Im using AvroParquetOutputFormat to write to Parquet files using Avro schema.
The application logic requires multiple types of files getting created by Reducer and each file has its own Avro schema.
The class AvroParquetOutputFormat has a static method setSchema() to set Avro schema of output. Looking at the code, AvroParquetOutputFormat uses AvroWriteSupport.setSchema() which again is a static implementation.
Without extending AvroWriteSupport and hacking the logic, is there a simpler way to achieve multiple Avro schema output from AvroParquetOutputFormat in a single MR job?
Any pointers/inputs highly appreciated.
Thanks & Regards
MK
It may be quite late to answer, but I have also faced this issue and came up with a solution.
First, There is no support like 'MultipleAvroParquetOutputFormat' inbuilt in parquet-mr. But to achieve a similar behavior I used MultipleOutputs.
For a map-only kind of job, put your mapper like this:
public class EventMapper extends Mapper<LongWritable, BytesWritable, Void, GenericRecord>{
protected KafkaAvroDecoder deserializer;
protected String outputPath = "";
// Using MultipleOutputs to write custom named files
protected MultipleOutputs<Void, GenericRecord> mos;
public void setup(Context context) throws IOException, InterruptedException {
super.setup(context);
Configuration conf = context.getConfiguration();
outputPath = conf.get(FileOutputFormat.OUTDIR);
mos = new MultipleOutputs<Void, GenericRecord>(context);
}
public void map(LongWritable ln, BytesWritable value, Context context){
try {
GenericRecord record = (GenericRecord) deserializer.fromBytes(value.getBytes());
AvroWriteSupport.setSchema(context.getConfiguration(), record.getSchema());
Schema schema = record.getSchema();
String mergeEventsPath = outputPath + "/" + schema.getName(); // Adding '/' will do no harm
mos.write( (Void) null, record, mergeEventsPath);
} catch (IOException e) {
e.printStackTrace();
} catch (InterruptedException e) {
e.printStackTrace();
}
}
#Override
public void cleanup(Context context) throws IOException, InterruptedException {
mos.close();
}
}
This will create a new RecordWriter for each schema and creates a new parquet file, appended with the schema name, for example, schema1-r-0000.parquet.
This will also create the default part-r-0000x.parquet files based on schema set in the driver. To avoid this, use LazyOutputFormat like:
LazyOutputFormat.setOutputFormatClass(job, AvroParquetOutputFormat.class);
Hope this helps.

Hadoop RawLocalFileSystem and getPos

I've found that the getPos in the RawLocalFileSystem's input stream can throw a null pointer exception if its underlying stream is closed.
I discovered this when playing with a custom record reader.
to patch it, I simply check if a call to "stream.available()" throws an exception, and if so, I return 0 in the getPos() function.
The existing getPos() implementation is found here:
https://svn.apache.org/repos/asf/hadoop/common/branches/branch-0.20/src/examples/org/apache/hadoop/examples/MultiFileWordCount.java
What should be the correct behaviour of getPos() in the RecordReader?
The "getPos" in the RecordReader has changed over time.
In the old mapred RecordReader implementations, it was used to count bytes read.
/**
* Returns the current position in the input.
*
* #return the current position in the input.
* #throws IOException
*/
long getPos() throws IOException;
In the newer mapreduce RecordReader implementations, this information is not provided by the RR class, but rather, it is part of the FSInputStream implementations:
class LocalFSFileInputStream extends FSInputStream implements HasFileDescriptor {
private FileInputStream fis;
private long position;
public LocalFSFileInputStream(Path f) throws IOException {
this.fis = new TrackingFileInputStream(pathToFile(f));
}
#Override
public void seek(long pos) throws IOException {
fis.getChannel().position(pos);
this.position = pos;
}
#Override
public long getPos() throws IOException {
return this.position;
}
Thus, with the new mapreduce API, the RecordReader was abstracted to not necessarily return a getPos(). Newer implementations of RecordReaders which might want to use this underlying implementation can be rewritten to use the FSInputStream objects directly, which do provide a getPos().

how to set custom input format in MapReduce?

I am writing MapReduce program and using classes in org.apache.hadoop.mapred.*. Can anybody tell me cause of this error? My CustomInputFormat class extends InputFormat and I have overridden createRecordReader method.
Signature of my CustomInputFormat is:
class ParagraphInputFormat extends InputFormat {
#Override
public RecordReader createRecordReader(InputSplit arg0,
TaskAttemptContext arg1) throws IOException, InterruptedException {
return new CustomRecordReader();
}
#Override
public List<InputSplit> getSplits(JobContext arg0) throws IOException,
InterruptedException {
// TODO Auto-generated method stub
return null;
}
}
And Signature of CustomRecordReader is class CustomRecordReader extends RecordReader
While declaring this class I used org.apache.hadoop.mapreduce.. I am confused between org.apache.hadoop.mapred. and org.apache.hadoop.mapreduce.*. Eclipse keeps on showing deprecated messages sometimes. I heard that apache has added some classes then removed those and then again added previous classes. Is is due to that? Is it affecting my code?
JobConf conf = new JobConf(new Configuration(),MyMRJob.class);
conf.setJobName("NameofJob");
conf.setOutputKeyClass(CutomeKeyClass.class); //no error to this line
conf.setOutputValueClass(IntWritable.class);
conf.setMapperClass(MYMap.class);
conf.setCombinerClass(MyReduce.class);
conf.setReducerClass(MyReduce.class);
conf.setInputFormat(CustomInputFormat.class);//ERROR to this line while typing
conf.setOutputFormat(IntWritable.class);
FileInputFormat.setInputPaths(conf, new Path(args[0]));
FileOutputFormat.setOutputPath(conf, new Path(args[1]));
JobClient.runJob(conf);
Your input format extends InputFormat of the mapreduce package (it extends rather than implements and the signature matches that of the new api), yet your job configuration is using the old API (JobConf rather than Job).
So you'll either need to amend your Custom input format to implement InputFormat (o.a.h.mapred.InputFormat), or amend your job configuration to use the new API (Job)
hey i have faced the same problem then i used the classes from
org.apache.hadoop.mapreduce instead of
org.apache.hadoop.mapred
its a problem of old and new API and for this dont use JobConf Configuration use Job Configuration only...

Resources