Hadoop PathFilter config is null - hadoop

I've got a path filter that looks like this:
public class AvroFileInclusionFilter extends Configured implements PathFilter {
Configuration conf;
#Override
public void setConf(Configuration conf) {
this.conf = conf;
}
#Override
public boolean accept(Path path) {
System.out.println("FileInclusion: " + conf.get("fileInclusion"));
return true;
}
}
I am explicitly setting the fileInclusion property on the configuration. For some reason, the configuration being used in the path filter is not the same configuration that I am setting up in my job, like so:
Job job = Job.getInstance(getConf(), "Stock Updater");
job.getConfiguration().set("outputPath", opts.outputPath);
String[] inputPaths = findPathsForDays(job.getConfiguration(),
new Path(opts.inputPath), findDaysToQuery(job.getConfiguration(),
opts.updatefile)).toArray(new String[]{});
job.getConfiguration().set("fileInclusion", "hello`");
AvroKeyValueInputFormat.addInputPath(job, new Path(opts.inputPath));
job.getConfiguration().set("mapred.input.pathFilter.class", AvroFileInclusionFilter.class.getName());
job.setInputFormatClass(AvroKeyValueInputFormat.class);
LazyOutputFormat.setOutputFormatClass(job, AvroKeyValueOutputFormat.class);
AvroKeyValueOutputFormat.setOutputPath(job, new Path(opts.outputPath));
job.addCacheFile(new Path(opts.updatefile).toUri());
AvroKeyValueOutputFormat.setCompressOutput(job, true);
job.getConfiguration().set(AvroJob.CONF_OUTPUT_CODEC, snappyCodec().toString());
AvroJob.setInputKeySchema(job, DateKey.SCHEMA$);
AvroJob.setInputValueSchema(job, StockUpdated.SCHEMA$);
AvroJob.setMapOutputKeySchema(job, DateKey.SCHEMA$);
AvroJob.setMapOutputValueSchema(job, StockUpdated.SCHEMA$);
AvroJob.setOutputKeySchema(job, DateKey.SCHEMA$);
AvroJob.setOutputValueSchema(job, StockUpdated.SCHEMA$);
job.setMapperClass(StockUpdaterMapper.class);
job.setReducerClass(StockUpdaterReducer.class);
AvroMultipleOutputs.addNamedOutput(job, "output", AvroKeyValueOutputFormat.class,
DateKey.SCHEMA$, StockUpdated.SCHEMA$);
job.setJarByClass(getClass());
boolean success = job.waitForCompletion(true);
The conf.get("fileInclusion") is always null and I cannot seem to figure out why. I've been working on this for quite awhile not and I'm pretty much at the end of my rope. Why is the configuration different? I'm submitting the job using "hadoop jar" and "yarn jar".

Instead of creating the object job by giving getConf() method as argument, try the following
Configuration conf = new Configuration();
conf.set("outputPath", opts.outputPath);
conf.set("mapred.input.pathFilter.class", AvroFileInclusionFilter.class.getName());
..
..
// After setting up the required key values in Configuration object Create Job object by supplying conf
Job job = new Job(conf, "Stock Updater");

PathFilter should 'implements Configurable' instead of 'extends Configured'

Related

Passing objects to MapReduce from a driver

I created a driver which reads a config file, builds a list of objects (based on the config) and passes that list to MapReduce (MapReduce has a static attribute which holds a reference to that list of object).
It works but only locally. As soon as I run the job on a cluster config I will get all sort of errors suggesting that the list hasn't been built. It makes me think that I'm doing it wrong and on a cluster setup MapReduce is being run independently from the driver.
My question is how to correctly initialise a Mapper.
(I'm using Hadoop 2.4.1)
This is related to the problem of side data distribution.
There are two approaches for side data distribution.
1) Distributed Caches
2) Configuration
As you have the objects to be shared, we can use the Configuration class.
This discussion will depend on the Configuration class to make available an Object across the cluster, accessible to all Mappers and(or) Reducers. The approach here is quite simple. The setString(String, String) setter of the Configuration classed is harnessed to achieve this task. The Object that has to be shared across is serialized into a java string at the driver end and is de-serialized back to the object at the Mapper or Reducer.
In the example code below, I have used com.google.gson.Gson class for the easy serialization and deserialization. You can use Java Serialization as well.
Class that Represents the Object You need to Share
public class TestBean {
String string1;
String string2;
public TestBean(String test1, String test2) {
super();
this.string1 = test1;
this.string2 = test2;
}
public TestBean() {
this("", "");
}
public String getString1() {
return string1;
}
public void setString1(String test1) {
this.string1 = test1;
}
public String getString2() {
return string2;
}
public void setString2(String test2) {
this.string2 = test2;
}
}
The Main Class from where you can set the Configurations
public class GSONTestDriver {
public static void main(String[] args) throws Exception {
System.out.println("In Main");
Configuration conf = new Configuration();
TestBean testB1 = new TestBean("Hello1","Gson1");
TestBean testB2 = new TestBean("Hello2","Gson2");
Gson gson = new Gson();
String testSerialization1 = gson.toJson(testB1);
String testSerialization2 = gson.toJson(testB2);
conf.set("instance1", testSerialization1);
conf.set("instance2", testSerialization2);
Job job = new Job(conf, " GSON Test");
job.setJarByClass(GSONTestDriver.class);
job.setMapperClass(GSONTestMapper.class);
job.setNumReduceTasks(0);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(NullWritable.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.waitForCompletion(true);
}
}
The mapper class from where you can retrieve the object
public class GSONTestMapper extends
Mapper<LongWritable, Text, Text, NullWritable> {
Configuration conf;
String inst1;
String inst2;
public void setup(Context context) {
conf = context.getConfiguration();
inst1 = conf.get("instance1");
inst2 = conf.get("instance2");
Gson gson = new Gson();
TestBean tb1 = gson.fromJson(inst1, TestBean.class);
System.out.println(tb1.getString1());
System.out.println(tb1.getString2());
TestBean tb2 = gson.fromJson(inst2, TestBean.class);
System.out.println(tb2.getString1());
System.out.println(tb2.getString2());
}
public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
context.write(value,NullWritable.get());
}
}
The bean is converted to a serialized Json String using the toJson(Object src) method of the class com.google.gson.Gson. Then the serialised Json string is passed as value through the configuration instance and accessed by name from the Mapper. The string is deserialized there using the fromJson(String json, Class classOfT) method of the same Gson class. Instead of my test bean, you could place your objects.

is Generic option -D not supporting in hadoop 0.20.2?

i am trying to set the configuration property from the console using -D generic options.
here is my console input:
$ hadoop jar hadoop-0.20.2/gtee.jar dd.MaxTemperature -D file.pattern=2007.* /inputdata /outputdata
but i did cross verification from the code by
Configuration conf;
System.out.println(conf.get("file.pattern"));
results null output.what would be the problem here, why value of the property "file.pattern" not displaying ? Can any one please help me.
Thanks
EDITED SECTION:
Driver Code:
public int run(String[] args) throws Exception {
Job job = new Job();
job.setJarByClass(MaxTemperature.class);
job.setJobName("MaxTemperature");
Configuration conf = new Configuration();
FileSystem fs = FileSystem.get(URI.create(args[0]), conf);
// System.out.println(conf.get("file.pattern"));
if (fs.exists(new Path(args[1]))) {
fs.delete(new Path(args[1]), true);
}
System.out.println(conf.get("file.pattern"));
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileInputFormat.setInputPathFilter(job, RegexExcludePathFilter.class);
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.setOutputFormatClass(TextOutputFormat.class);
job.setMapperClass(MapMapper.class);
job.setCombinerClass(Mapreducers.class);
job.setReducerClass(Mapreducers.class);
return job.waitForCompletion(true) ? 0 : 1;
}
public static void main(String[] args) throws Exception {
int xx = ToolRunner.run(new Configuration(),new MaxTemperature(), args);
System.exit(xx);
}
path filter implementation:
public static class RegexExcludePathFilter extends Configured implements
PathFilter {
//String pattern = "2007.[0-1]?[0-2].[0-9][0-9].txt" ;
Configuration conf;
Pattern pattern;
#Override
public boolean accept(Path path) {
Matcher m = pattern.matcher(path.toString());
return m.matches();
}
#Override
public void setConf(Configuration conf) {
this.conf = conf;
pattern = Pattern.compile(conf.get("file.pattern"));
System.out.println(pattern);
}
}
To confirm you -D option is supported in the version 20.2 however that requires you to implement the Tool interface to read variables from command line
Configuration conf = new Configuration(); //this is the issue
// When implementing tool use this
Configuration conf = this.getConf();
You are passing it with a space in between, that's not how you should do it. Instead try:
-Dfile.pattern=2007.*

hadoop, how to include 3part jar while try to run mapred job

As we know, new need to pack all needed class into the job-jar and upload it to server. it's so slow, i will to know whether there is a way which to specify the thirdpart jar include executing map-red job, so that i could only pack my classes with out dependencies.
PS(i found there is a "-libjar" command, but i doesn't figure out how to use it. Here is the link http://blog.cloudera.com/blog/2011/01/how-to-include-third-party-libraries-in-your-map-reduce-job/)
Those are called generic options.
So, to support those, your job should implement Tool.
Run your job like --
hadoop jar yourfile.jar [mainClass] args -libjars <comma seperated list of jars>
Edit:
To implement Tool and extend Configured, you do something like this in your MapReduce application --
public class YourClass extends Configured implements Tool {
public static void main(String[] args) throws Exception {
int res = ToolRunner.run(new YourClass(), args);
System.exit(res);
}
public int run(String[] args) throws Exception
{
//parse you normal arguments here.
Configuration conf = getConf();
Job job = new Job(conf, "Name of job");
//set the class names etc
//set the output data type classes etc
//to accept the hdfs input and outpur dir at run time
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
return job.waitForCompletion(true) ? 0 : 1;
}
}
For me I had to specify -libjar option before the arguments. Otherwise it was considered as an argument.

Unable to run MR on cluster

I have an Map reduce program that is running successfully in standalone(Ecllipse) mode but while trying to run the same MR by exporting the jar in cluster. It is showing null pointer exception like this,
13/06/26 05:46:22 ERROR mypackage.HHDriver: Error while configuring run method.
java.lang.NullPointerException
I used the following code for run method.
public static void main(String[] args) throws IOException, InterruptedException, ClassNotFoundException {
Configuration configuration = new Configuration();
Tool headOfHouseHold = new HHDriver();
try {
ToolRunner.run(configuration,headOfHouseHold,args);
} catch (Exception exception) {
exception.printStackTrace();
LOGGER.error("Error while configuring run method", exception);
// System.exit(1);
}
}
run method:
if (args != null && args.length == 8) {
// Setting the Configurations
GenericOptionsParser genericOptionsParser=new GenericOptionsParser(args);
Configuration configuration=genericOptionsParser.getConfiguration();
//Configuration configuration = new Configuration();
configuration.set("fs.default.name", args[0]);
configuration.set("mapred.job.tracker", args[1]);
configuration.set("deltaFlag",args[2]);
configuration.set("keyPrefix",args[3]);
configuration.set("outfileName",args[4]);
configuration.set("Inpath",args[5]);
String outputPath=args[6];
configuration.set("mapred.map.tasks.speculative.execution", "false");
configuration.set("mapred.reduce.tasks.speculative.execution", "false");
// To avoid the creation of _LOG and _SUCCESS files
configuration.set("mapreduce.fileoutputcommitter.marksuccessfuljobs", "false");
configuration.set("hadoop.job.history.user.location", "none");
configuration.set(Constants.MAX_NUM_REDUCERS,args[7]);
// Configuration of the MR-Job
Job job = new Job(configuration, "HH Job");
job.setJarByClass(HHDriver.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
job.setNumReduceTasks(HouseHoldingHelper.numReducer(configuration));
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
MultipleOutputs.addNamedOutput(job,configuration.get("outfileName"),
TextOutputFormat.class,Text.class,Text.class);
// Deletion of OutputTemp folder (if exists)
FileSystem fileSystem = FileSystem.get(configuration);
Path path = new Path(outputPath);
if (path != null /*&& path.depth() >= 5*/) {
fileSystem.delete(path, true);
}
// Deletion of empty files in the output (if exists)
FileStatus[] fileStatus = fileSystem.listStatus(new Path(outputPath));
for (FileStatus file : fileStatus) {
if (file.getLen() == 0) {
fileSystem.delete(file.getPath(), true);
}
}
// Setting the Input/Output paths
FileInputFormat.setInputPaths(job, new Path(configuration.get("Inpath")));
FileOutputFormat.setOutputPath(job, new Path(outputPath));
job.setMapperClass(HHMapper.class);
job.setReducerClass(HHReducer.class);
job.waitForCompletion(true);
return job.waitForCompletion(true) ? 0 : 1;
I double checked the run method parameters those are not null and it is running in standalone mode as well..
Issue could be because the hadoop configuration is not properly getting passed to your program.
You can try putting this in the beginning of your driver class:
GenericOptionsParser genericOptionsParser=new GenericOptionsParser(args[]);
Configuration hadoopConfiguration=genericOptionsParser.getConfiguration();
Then use the hadoopConfiguration object when initializing objects.
e.g.
public int run(String[] args) throws Exception {
GenericOptionsParser genericOptionsParser=new GenericOptionsParser(args[]);
Configuration hadoopConfiguration=genericOptionsParser.getConfiguration();
Job job = new Job(hadoopConfiguration);
//set other stuff
}

How to use JobControl in hadoop

I want to merge two files into one.
I made two mappers to read, and one reducer to join.
JobConf classifiedConf = new JobConf(new Configuration());
classifiedConf.setJarByClass(myjob.class);
classifiedConf.setJobName("classifiedjob");
FileInputFormat.setInputPaths(classifiedConf,classifiedInputPath );
classifiedConf.setMapperClass(ClassifiedMapper.class);
classifiedConf.setMapOutputKeyClass(TextPair.class);
classifiedConf.setMapOutputValueClass(Text.class);
Job classifiedJob = new Job(classifiedConf);
//first mapper config
JobConf featureConf = new JobConf(new Configuration());
featureConf.setJobName("featureJob");
featureConf.setJarByClass(myjob.class);
FileInputFormat.setInputPaths(featureConf, featuresInputPath);
featureConf.setMapperClass(FeatureMapper.class);
featureConf.setMapOutputKeyClass(TextPair.class);
featureConf.setMapOutputValueClass(Text.class);
Job featureJob = new Job(featureConf);
//second mapper config
JobConf joinConf = new JobConf(new Configuration());
joinConf.setJobName("joinJob");
joinConf.setJarByClass(myjob.class);
joinConf.setReducerClass(JoinReducer.class);
joinConf.setOutputKeyClass(Text.class);
joinConf.setOutputValueClass(Text.class);
Job joinJob = new Job(joinConf);
//reducer config
//JobControl config
joinJob.addDependingJob(featureJob);
joinJob.addDependingJob(classifiedJob);
secondJob.addDependingJob(joinJob);
JobControl jobControl = new JobControl("jobControl");
jobControl.addJob(classifiedJob);
jobControl.addJob(featureJob);
jobControl.addJob(secondJob);
Thread thread = new Thread(jobControl);
thread.start();
while(jobControl.allFinished()){
jobControl.stop();
}
But, I get this message:
WARN mapred.JobClient:
Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
WARN mapred.JobClient: No job jar file set. User classes may not be found. See JobConf(Class) or JobConf#setJar(String).
anyone help please..................
Which version of Hadoop are you using?
The warning you get will stop the program?
You don't need to use setJarByClass(). You can see my snippet, I can run it without using setJarByClass() method.
JobConf job = new JobConf(PageRankJob.class);
job.setJobName("PageRankJob");
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
job.setMapperClass(PageRankMapper.class);
job.setReducerClass(PageRankReducer.class);
job.setInputFormat(TextInputFormat.class);
job.setOutputFormat(TextOutputFormat.class);
FileInputFormat.setInputPaths(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
JobClient.runJob(job);
You should implement your Job this way:
public class MyApp extends Configured implements Tool {
public int run(String[] args) throws Exception {
// Configuration processed by ToolRunner
Configuration conf = getConf();
// Create a JobConf using the processed conf
JobConf job = new JobConf(conf, MyApp.class);
// Process custom command-line options
Path in = new Path(args[1]);
Path out = new Path(args[2]);
// Specify various job-specific parameters
job.setJobName("my-app");
job.setInputPath(in);
job.setOutputPath(out);
job.setMapperClass(MyMapper.class);
job.setReducerClass(MyReducer.class);
// Submit the job, then poll for progress until the job is complete
JobClient.runJob(job);
return 0;
}
public static void main(String[] args) throws Exception {
// Let ToolRunner handle generic command-line options
int res = ToolRunner.run(new Configuration(), new MyApp(), args);
System.exit(res);
}
}
This comes straight out of Hadoop's documentation here.
So basically your job needs to inherit from Configured and implement Tool. This will force you to implement run(). Then start your job from your main class using Toolrunner.run(<your job>, <args>) and the warning will disappear.
You need to have this code in the driver job.setJarByClass(MapperClassName.class);

Resources