is Generic option -D not supporting in hadoop 0.20.2? - hadoop

i am trying to set the configuration property from the console using -D generic options.
here is my console input:
$ hadoop jar hadoop-0.20.2/gtee.jar dd.MaxTemperature -D file.pattern=2007.* /inputdata /outputdata
but i did cross verification from the code by
Configuration conf;
System.out.println(conf.get("file.pattern"));
results null output.what would be the problem here, why value of the property "file.pattern" not displaying ? Can any one please help me.
Thanks
EDITED SECTION:
Driver Code:
public int run(String[] args) throws Exception {
Job job = new Job();
job.setJarByClass(MaxTemperature.class);
job.setJobName("MaxTemperature");
Configuration conf = new Configuration();
FileSystem fs = FileSystem.get(URI.create(args[0]), conf);
// System.out.println(conf.get("file.pattern"));
if (fs.exists(new Path(args[1]))) {
fs.delete(new Path(args[1]), true);
}
System.out.println(conf.get("file.pattern"));
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileInputFormat.setInputPathFilter(job, RegexExcludePathFilter.class);
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.setOutputFormatClass(TextOutputFormat.class);
job.setMapperClass(MapMapper.class);
job.setCombinerClass(Mapreducers.class);
job.setReducerClass(Mapreducers.class);
return job.waitForCompletion(true) ? 0 : 1;
}
public static void main(String[] args) throws Exception {
int xx = ToolRunner.run(new Configuration(),new MaxTemperature(), args);
System.exit(xx);
}
path filter implementation:
public static class RegexExcludePathFilter extends Configured implements
PathFilter {
//String pattern = "2007.[0-1]?[0-2].[0-9][0-9].txt" ;
Configuration conf;
Pattern pattern;
#Override
public boolean accept(Path path) {
Matcher m = pattern.matcher(path.toString());
return m.matches();
}
#Override
public void setConf(Configuration conf) {
this.conf = conf;
pattern = Pattern.compile(conf.get("file.pattern"));
System.out.println(pattern);
}
}

To confirm you -D option is supported in the version 20.2 however that requires you to implement the Tool interface to read variables from command line
Configuration conf = new Configuration(); //this is the issue
// When implementing tool use this
Configuration conf = this.getConf();

You are passing it with a space in between, that's not how you should do it. Instead try:
-Dfile.pattern=2007.*

Related

Not understanding the path in distributed path

From the below code I didn't understand 2 things:
DistributedCache.addcachefile(new URI ('/abc.dat'), job.getconfiguration())
I didn't understand URI path has to be present in the HDFS. Correct me if I am wrong.
And what is p.getname().equals() from the below code:
public class MyDC {
public static class MyMapper extends Mapper < LongWritable, Text, Text, Text > {
private Map < String, String > abMap = new HashMap < String, String > ();
private Text outputKey = new Text();
private Text outputValue = new Text();
protected void setup(Context context) throws
java.io.IOException, InterruptedException {
Path[] files = DistributedCache.getLocalCacheFiles(context.getConfiguration());
for (Path p: files) {
if (p.getName().equals("abc.dat")) {
BufferedReader reader = new BufferedReader(new FileReader(p.toString()));
String line = reader.readLine();
while (line != null) {
String[] tokens = line.split("\t");
String ab = tokens[0];
String state = tokens[1];
abMap.put(ab, state);
line = reader.readLine();
}
}
}
if (abMap.isEmpty()) {
throw new IOException("Unable to load Abbrevation data.");
}
}
protected void map(LongWritable key, Text value, Context context)
throws java.io.IOException, InterruptedException {
String row = value.toString();
String[] tokens = row.split("\t");
String inab = tokens[0];
String state = abMap.get(inab);
outputKey.set(state);
outputValue.set(row);
context.write(outputKey, outputValue);
}
}
public static void main(String[] args)
throws IOException, ClassNotFoundException, InterruptedException {
Job job = new Job();
job.setJarByClass(MyDC.class);
job.setJobName("DCTest");
job.setNumReduceTasks(0);
try {
DistributedCache.addCacheFile(new URI("/abc.dat"), job.getConfiguration());
} catch (Exception e) {
System.out.println(e);
}
job.setMapperClass(MyMapper.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(Text.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.waitForCompletion(true);
}
}
The idea of Distributed Cache is to make some static data available to the task node before it starts its execution.
File has to be present in HDFS ,so that it can then add it to the Distributed Cache (to each task node)
DistributedCache.getLocalCacheFile basically gets all the cache files present in that task node. By if (p.getName().equals("abc.dat")) { you are getting the appropriate Cache File to be processed by your application.
Please refer to the docs below:
https://hadoop.apache.org/docs/r1.2.1/mapred_tutorial.html#DistributedCache
https://hadoop.apache.org/docs/r1.2.1/api/org/apache/hadoop/filecache/DistributedCache.html#getLocalCacheFiles(org.apache.hadoop.conf.Configuration)
DistributedCache is an API which is used to add a file or a group of files in the memory and will be available for every data-nodes whether the map-reduce will work. One example of using DistributedCache is map-side joins.
DistributedCache.addcachefile(new URI ('/abc.dat'), job.getconfiguration()) will add the abc.dat file in the cache area. There can be n numbers of file in the cache and p.getName().equals("abc.dat")) will check the file which you required. Every path in HDFS will be taken under Path[] for map-reduce processing. For example :
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
The first Path(args[0]) is the first argument
(input file location) you pass while Jar execution and Path(args[1]) is the second argument which the output file location. Everything is taken as Path array.
In the same way when you add any file to cache it will get arrayed in the Path array which you shud be retrieving using the below code.
Path[] files = DistributedCache.getLocalCacheFiles(context.getConfiguration());
It will return all the files present in the cache and you will your file name by p.getName().equals() method.

How can i have multiple mappers and reducers?

I have this code in which i have set one mapper and one reducer.I want to include one more mapper and a reducer for doing further jobs.
The problem is that i have to take the output file of the first map reduce job as the input to the next map reduce job.Is it possible to do that?if yes then how can i do it?
public int run(String[] args) throws Exception
{
JobConf conf = new JobConf(getConf(),DecisionTreec45.class);
conf.setJobName("c4.5");
// the keys are words (strings)
conf.setOutputKeyClass(Text.class);
// the values are counts (ints)
conf.setOutputValueClass(IntWritable.class);
conf.setMapperClass(MyMapper.class);
conf.setReducerClass(MyReducer.class);
//set your input file path below
FileInputFormat.setInputPaths(conf, "/home/hduser/Id3_hds/playtennis.txt");
FileOutputFormat.setOutputPath(conf, new Path("/home/hduser/Id3_hds/1/output"+current_index));
JobClient.runJob(conf);
return 0;
}
yes its possible to do that. you can check the following tutorial to see how chaining occurs. http://gandhigeet.blogspot.com/2012/12/as-discussed-in-previous-post-hadoop.html
Make sure you delete the intermediate output data in HDFS which will be created by each MR phase by using fs.delete(intermediateoutputPath);
Look at how this works.
You need to have two jobs. Job2 is dependent on job1.
public class ChainJobs extends Configured implements Tool {
private static final String OUTPUT_PATH = "intermediate_output";
#Override
public int run(String[] args) throws Exception {
/*
* Job 1
*/
Configuration conf = getConf();
FileSystem fs = FileSystem.get(conf);
Job job = new Job(conf, "Job1");
job.setJarByClass(ChainJobs.class);
job.setMapperClass(MyMapper1.class);
job.setReducerClass(MyReducer1.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
TextInputFormat.addInputPath(job, new Path(args[0]));
TextOutputFormat.setOutputPath(job, new Path(OUTPUT_PATH));
job.waitForCompletion(true); /*this goes to next command after this job is completed. your second job is dependent on your first job.*/
/*
* Job 2
*/
Configuration conf2 = getConf();
Job job2 = new Job(conf2, "Job 2");
job2.setJarByClass(ChainJobs.class);
job2.setMapperClass(MyMapper2.class);
job2.setReducerClass(MyReducer2.class);
job2.setOutputKeyClass(Text.class);
job2.setOutputValueClass(Text.class);
job2.setInputFormatClass(TextInputFormat.class);
job2.setOutputFormatClass(TextOutputFormat.class);
TextInputFormat.addInputPath(job2, new Path(OUTPUT_PATH));
TextOutputFormat.setOutputPath(job2, new Path(args[1]));
return job2.waitForCompletion(true) ? 0 : 1;
}
/**
* Method Name: main Return type: none Purpose:Read the arguments from
* command line and run the Job till completion
*
*/
public static void main(String[] args) throws Exception {
// TODO Auto-generated method stub
if (args.length != 2) {
System.err.println("Enter valid number of arguments <Inputdirectory> <Outputlocation>");
System.exit(0);
}
ToolRunner.run(new Configuration(), new ChainJobs(), args);
}
}

Hadoop mapreduce.job.reduces in Generic Option Syntax?

I am trying to set the number of reducers to use via command line. It seems like I am using wrong syntax. I am using hadoop 2.5 (yarn) MR2.
hadoop jar mrjobs-0.1.jar com.example.Weather -D mapreduce.job.reduces=2 datasets/inputs output
This commands is not working when I added -D option else its working fine.
Any help appreciated !
thanks!
Syntax looks proper, I have tested against 2.5 YARN MR2 with the following it's working:
hadoop jar hadoop-mapreduce-examples.jar wordcount -Dmapreduce.job.reduces=5 input output
Most probably the problem could be your Driver class hasn't implemented ToolRunner which works in coordination with GenericOptionsParser to parse generic command line arguments.
Here is an example of how to implement ToolRunner in your MapReduce Driver class:
// imports ignored
public class ExampleDriver extends Configured implements Tool {
public int run(String[] args) throws Exception {
if (args.length != 2) {
System.err.println("Usage: ExampleDriver <in> <out>");
System.exit(2);
}
Configuration conf = getConf();
Job job = Job.getInstance(conf);
job.setJobName("example driver");
job.setJarByClass(ExampleDriver.class);
job.setMapperClass(YourMapper.class);
job.setReducerClass(YourReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
int ret = job.waitForCompletion(true) ? 0 : 1;
return ret;
}
public static void main(String[] args) throws Exception {
int res = ToolRunner.run(new Configuration(), new ExampleDriver(), args);
System.exit(res);
}
}

Unable to run MR on cluster

I have an Map reduce program that is running successfully in standalone(Ecllipse) mode but while trying to run the same MR by exporting the jar in cluster. It is showing null pointer exception like this,
13/06/26 05:46:22 ERROR mypackage.HHDriver: Error while configuring run method.
java.lang.NullPointerException
I used the following code for run method.
public static void main(String[] args) throws IOException, InterruptedException, ClassNotFoundException {
Configuration configuration = new Configuration();
Tool headOfHouseHold = new HHDriver();
try {
ToolRunner.run(configuration,headOfHouseHold,args);
} catch (Exception exception) {
exception.printStackTrace();
LOGGER.error("Error while configuring run method", exception);
// System.exit(1);
}
}
run method:
if (args != null && args.length == 8) {
// Setting the Configurations
GenericOptionsParser genericOptionsParser=new GenericOptionsParser(args);
Configuration configuration=genericOptionsParser.getConfiguration();
//Configuration configuration = new Configuration();
configuration.set("fs.default.name", args[0]);
configuration.set("mapred.job.tracker", args[1]);
configuration.set("deltaFlag",args[2]);
configuration.set("keyPrefix",args[3]);
configuration.set("outfileName",args[4]);
configuration.set("Inpath",args[5]);
String outputPath=args[6];
configuration.set("mapred.map.tasks.speculative.execution", "false");
configuration.set("mapred.reduce.tasks.speculative.execution", "false");
// To avoid the creation of _LOG and _SUCCESS files
configuration.set("mapreduce.fileoutputcommitter.marksuccessfuljobs", "false");
configuration.set("hadoop.job.history.user.location", "none");
configuration.set(Constants.MAX_NUM_REDUCERS,args[7]);
// Configuration of the MR-Job
Job job = new Job(configuration, "HH Job");
job.setJarByClass(HHDriver.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
job.setNumReduceTasks(HouseHoldingHelper.numReducer(configuration));
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
MultipleOutputs.addNamedOutput(job,configuration.get("outfileName"),
TextOutputFormat.class,Text.class,Text.class);
// Deletion of OutputTemp folder (if exists)
FileSystem fileSystem = FileSystem.get(configuration);
Path path = new Path(outputPath);
if (path != null /*&& path.depth() >= 5*/) {
fileSystem.delete(path, true);
}
// Deletion of empty files in the output (if exists)
FileStatus[] fileStatus = fileSystem.listStatus(new Path(outputPath));
for (FileStatus file : fileStatus) {
if (file.getLen() == 0) {
fileSystem.delete(file.getPath(), true);
}
}
// Setting the Input/Output paths
FileInputFormat.setInputPaths(job, new Path(configuration.get("Inpath")));
FileOutputFormat.setOutputPath(job, new Path(outputPath));
job.setMapperClass(HHMapper.class);
job.setReducerClass(HHReducer.class);
job.waitForCompletion(true);
return job.waitForCompletion(true) ? 0 : 1;
I double checked the run method parameters those are not null and it is running in standalone mode as well..
Issue could be because the hadoop configuration is not properly getting passed to your program.
You can try putting this in the beginning of your driver class:
GenericOptionsParser genericOptionsParser=new GenericOptionsParser(args[]);
Configuration hadoopConfiguration=genericOptionsParser.getConfiguration();
Then use the hadoopConfiguration object when initializing objects.
e.g.
public int run(String[] args) throws Exception {
GenericOptionsParser genericOptionsParser=new GenericOptionsParser(args[]);
Configuration hadoopConfiguration=genericOptionsParser.getConfiguration();
Job job = new Job(hadoopConfiguration);
//set other stuff
}

Multiple ways to write driver of Hadoop program - Which one to choose?

I have observed that there are multiple ways to write driver method of Hadoop program.
Following method is given in Hadoop Tutorial by Yahoo
public void run(String inputPath, String outputPath) throws Exception {
JobConf conf = new JobConf(WordCount.class);
conf.setJobName("wordcount");
// the keys are words (strings)
conf.setOutputKeyClass(Text.class);
// the values are counts (ints)
conf.setOutputValueClass(IntWritable.class);
conf.setMapperClass(MapClass.class);
conf.setReducerClass(Reduce.class);
FileInputFormat.addInputPath(conf, new Path(inputPath));
FileOutputFormat.setOutputPath(conf, new Path(outputPath));
JobClient.runJob(conf);
}
and this method is given in Hadoop The Definitive Guide 2012 book by Oreilly.
public static void main(String[] args) throws Exception {
if (args.length != 2) {
System.err.println("Usage: MaxTemperature <input path> <output path>");
System.exit(-1);
}
Job job = new Job();
job.setJarByClass(MaxTemperature.class);
job.setJobName("Max temperature");
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.setMapperClass(MaxTemperatureMapper.class);
job.setReducerClass(MaxTemperatureReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
While trying program given in Oreilly book I found that constructors of Job class are deprecated. As Oreilly book is based on Hadoop 2 (yarn) I was surprised to see that they have used deprecated class.
I would like to know which method everyone uses?
I use the former approach.If we go with overriding the run() method, we can use hadoop jar options like -D,-libjars,-files etc.,.All these are very much necessary in almost any hadoop project.
Not sure if we can use them through the main() method.
Slightly different to your first (Yahoo) block - you should be using the ToolRunner / Tool classes which take advantage of the GenericOptionsParser (as noted in Eswara's answer)
A template pattern would be something like:
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;
public class ToolExample extends Configured implements Tool {
#Override
public int run(String[] args) throws Exception {
// old API
JobConf jobConf = new JobConf(getConf());
// new API
Job job = new Job(getConf());
// rest of your config here
// determine success / failure (depending on your choice of old / new api)
// return 0 for success, non-zero for an error
return 0;
}
public static void main(String args[]) throws Exception {
System.exit(ToolRunner.run(new ToolExample(), args));
}
}

Resources