Unable to run MR on cluster

Unable to run MR on cluster - hadoop

I have an Map reduce program that is running successfully in standalone(Ecllipse) mode but while trying to run the same MR by exporting the jar in cluster. It is showing null pointer exception like this,
13/06/26 05:46:22 ERROR mypackage.HHDriver: Error while configuring run method.
java.lang.NullPointerException
I used the following code for run method.
public static void main(String[] args) throws IOException, InterruptedException, ClassNotFoundException {
Configuration configuration = new Configuration();
Tool headOfHouseHold = new HHDriver();
try {
ToolRunner.run(configuration,headOfHouseHold,args);
} catch (Exception exception) {
exception.printStackTrace();
LOGGER.error("Error while configuring run method", exception);
// System.exit(1);
}
}
run method:
if (args != null && args.length == 8) {
// Setting the Configurations
GenericOptionsParser genericOptionsParser=new GenericOptionsParser(args);
Configuration configuration=genericOptionsParser.getConfiguration();
//Configuration configuration = new Configuration();
configuration.set("fs.default.name", args[0]);
configuration.set("mapred.job.tracker", args[1]);
configuration.set("deltaFlag",args[2]);
configuration.set("keyPrefix",args[3]);
configuration.set("outfileName",args[4]);
configuration.set("Inpath",args[5]);
String outputPath=args[6];
configuration.set("mapred.map.tasks.speculative.execution", "false");
configuration.set("mapred.reduce.tasks.speculative.execution", "false");
// To avoid the creation of _LOG and _SUCCESS files
configuration.set("mapreduce.fileoutputcommitter.marksuccessfuljobs", "false");
configuration.set("hadoop.job.history.user.location", "none");
configuration.set(Constants.MAX_NUM_REDUCERS,args[7]);
// Configuration of the MR-Job
Job job = new Job(configuration, "HH Job");
job.setJarByClass(HHDriver.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
job.setNumReduceTasks(HouseHoldingHelper.numReducer(configuration));
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
MultipleOutputs.addNamedOutput(job,configuration.get("outfileName"),
TextOutputFormat.class,Text.class,Text.class);
// Deletion of OutputTemp folder (if exists)
FileSystem fileSystem = FileSystem.get(configuration);
Path path = new Path(outputPath);
if (path != null /*&& path.depth() >= 5*/) {
fileSystem.delete(path, true);
}
// Deletion of empty files in the output (if exists)
FileStatus[] fileStatus = fileSystem.listStatus(new Path(outputPath));
for (FileStatus file : fileStatus) {
if (file.getLen() == 0) {
fileSystem.delete(file.getPath(), true);
}
}
// Setting the Input/Output paths
FileInputFormat.setInputPaths(job, new Path(configuration.get("Inpath")));
FileOutputFormat.setOutputPath(job, new Path(outputPath));
job.setMapperClass(HHMapper.class);
job.setReducerClass(HHReducer.class);
job.waitForCompletion(true);
return job.waitForCompletion(true) ? 0 : 1;
I double checked the run method parameters those are not null and it is running in standalone mode as well..

Issue could be because the hadoop configuration is not properly getting passed to your program.
You can try putting this in the beginning of your driver class:
GenericOptionsParser genericOptionsParser=new GenericOptionsParser(args[]);
Configuration hadoopConfiguration=genericOptionsParser.getConfiguration();
Then use the hadoopConfiguration object when initializing objects.
e.g.
public int run(String[] args) throws Exception {
GenericOptionsParser genericOptionsParser=new GenericOptionsParser(args[]);
Configuration hadoopConfiguration=genericOptionsParser.getConfiguration();
Job job = new Job(hadoopConfiguration);
//set other stuff
}

Related

Not understanding the path in distributed path

From the below code I didn't understand 2 things:
DistributedCache.addcachefile(new URI ('/abc.dat'), job.getconfiguration())
I didn't understand URI path has to be present in the HDFS. Correct me if I am wrong.
And what is p.getname().equals() from the below code:
public class MyDC {
public static class MyMapper extends Mapper < LongWritable, Text, Text, Text > {
private Map < String, String > abMap = new HashMap < String, String > ();
private Text outputKey = new Text();
private Text outputValue = new Text();
protected void setup(Context context) throws
java.io.IOException, InterruptedException {
Path[] files = DistributedCache.getLocalCacheFiles(context.getConfiguration());
for (Path p: files) {
if (p.getName().equals("abc.dat")) {
BufferedReader reader = new BufferedReader(new FileReader(p.toString()));
String line = reader.readLine();
while (line != null) {
String[] tokens = line.split("\t");
String ab = tokens[0];
String state = tokens[1];
abMap.put(ab, state);
line = reader.readLine();
}
}
}
if (abMap.isEmpty()) {
throw new IOException("Unable to load Abbrevation data.");
}
}
protected void map(LongWritable key, Text value, Context context)
throws java.io.IOException, InterruptedException {
String row = value.toString();
String[] tokens = row.split("\t");
String inab = tokens[0];
String state = abMap.get(inab);
outputKey.set(state);
outputValue.set(row);
context.write(outputKey, outputValue);
}
}
public static void main(String[] args)
throws IOException, ClassNotFoundException, InterruptedException {
Job job = new Job();
job.setJarByClass(MyDC.class);
job.setJobName("DCTest");
job.setNumReduceTasks(0);
try {
DistributedCache.addCacheFile(new URI("/abc.dat"), job.getConfiguration());
} catch (Exception e) {
System.out.println(e);
}
job.setMapperClass(MyMapper.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(Text.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.waitForCompletion(true);
}
}

The idea of Distributed Cache is to make some static data available to the task node before it starts its execution.
File has to be present in HDFS ,so that it can then add it to the Distributed Cache (to each task node)
DistributedCache.getLocalCacheFile basically gets all the cache files present in that task node. By if (p.getName().equals("abc.dat")) { you are getting the appropriate Cache File to be processed by your application.
Please refer to the docs below:
https://hadoop.apache.org/docs/r1.2.1/mapred_tutorial.html#DistributedCache
https://hadoop.apache.org/docs/r1.2.1/api/org/apache/hadoop/filecache/DistributedCache.html#getLocalCacheFiles(org.apache.hadoop.conf.Configuration)

DistributedCache is an API which is used to add a file or a group of files in the memory and will be available for every data-nodes whether the map-reduce will work. One example of using DistributedCache is map-side joins.
DistributedCache.addcachefile(new URI ('/abc.dat'), job.getconfiguration()) will add the abc.dat file in the cache area. There can be n numbers of file in the cache and p.getName().equals("abc.dat")) will check the file which you required. Every path in HDFS will be taken under Path[] for map-reduce processing. For example :
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
The first Path(args[0]) is the first argument
(input file location) you pass while Jar execution and Path(args[1]) is the second argument which the output file location. Everything is taken as Path array.
In the same way when you add any file to cache it will get arrayed in the Path array which you shud be retrieving using the below code.
Path[] files = DistributedCache.getLocalCacheFiles(context.getConfiguration());
It will return all the files present in the cache and you will your file name by p.getName().equals() method.

Why Context.Write not working as expected- Hadoop Map reduce

I have 1 MR job and its output looks like :
128.187.140.171,11
129.109.6.54,27
129.188.154.200,44
129.193.116.41,5
129.217.186.112,17
In the mapper code of 2nd MR job, I am doing this ;
public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
// Parse the input string into a nice map
// System.out.println(value.toString());
if (value.toString().contains(",")) {
System.out.println("Inside");
String[] arr = value.toString().split(",");
if (arr.length > 1) {
System.out.println(arr[1]);
context.write(new Text(arr[1]), new Text(arr[0]));
}
}
The output of print statements are correct :
Inside
11
Inside
27
But the context.write keeps showing following output :
1,slip4068.sirius.com
1,hstar.gsfc.nasa.gov
1,ad11-010.compuserve.com
1,slip85-2.co.us.ibm.net
1,stimpy.actrix.gen.nz
1,j14.ktk1.jaring.my
1,ad08-009.compuserve.com
Why I keep getting 1 in the Keys ?
This is my driver code:
public int run(String[] args) throws Exception {
// TODO Auto-generated method stub
Configuration conf = getConf();
conf.set("mapreduce.output.textoutputformat.separator", ",");
Job job = new Job(conf, "WL Demo");
job.setJarByClass(WLDemo.class);
job.setMapperClass(WLMapper1.class);
job.setReducerClass(WLReducer1.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
Path in = new Path(args[0]);
Path out = new Path(args[1]);
Path out2 = new Path(args[2]);
FileInputFormat.setInputPaths(job, in);
FileOutputFormat.setOutputPath(job, out);
boolean succ = job.waitForCompletion(true);
if (!succ) {
System.out.println("Job1 failed, exiting");
return -1;
}
Job job2 = new Job(conf, "top-k-pass-2");
FileInputFormat.setInputPaths(job2, out);
FileOutputFormat.setOutputPath(job2, out2);
job2.setJarByClass(WLDemo.class);
job2.setMapperClass(WLMapper2.class);
// job2.setReducerClass(Reducer1.class);
job2.setInputFormatClass(TextInputFormat.class);
job2.setMapOutputKeyClass(Text.class);
job2.setMapOutputValueClass(Text.class);
job2.setNumReduceTasks(1);
succ = job2.waitForCompletion(true);
if (!succ) {
System.out.println("Job2 failed, exiting");
return -1;
}
return 0;
}
How can I get correct values in output key of my 2nd MR job ?

Change job2.setNumReduceTasks(1) to job2.setNumReduceTasks(0) .Because of that it is running a identity reducer that is bringing the output key as 1, you should have 1 as a key for some records from map1 output.

How can i have multiple mappers and reducers?

I have this code in which i have set one mapper and one reducer.I want to include one more mapper and a reducer for doing further jobs.
The problem is that i have to take the output file of the first map reduce job as the input to the next map reduce job.Is it possible to do that?if yes then how can i do it?
public int run(String[] args) throws Exception
{
JobConf conf = new JobConf(getConf(),DecisionTreec45.class);
conf.setJobName("c4.5");
// the keys are words (strings)
conf.setOutputKeyClass(Text.class);
// the values are counts (ints)
conf.setOutputValueClass(IntWritable.class);
conf.setMapperClass(MyMapper.class);
conf.setReducerClass(MyReducer.class);
//set your input file path below
FileInputFormat.setInputPaths(conf, "/home/hduser/Id3_hds/playtennis.txt");
FileOutputFormat.setOutputPath(conf, new Path("/home/hduser/Id3_hds/1/output"+current_index));
JobClient.runJob(conf);
return 0;
}

yes its possible to do that. you can check the following tutorial to see how chaining occurs. http://gandhigeet.blogspot.com/2012/12/as-discussed-in-previous-post-hadoop.html
Make sure you delete the intermediate output data in HDFS which will be created by each MR phase by using fs.delete(intermediateoutputPath);

Look at how this works.
You need to have two jobs. Job2 is dependent on job1.
public class ChainJobs extends Configured implements Tool {
private static final String OUTPUT_PATH = "intermediate_output";
#Override
public int run(String[] args) throws Exception {
/*
* Job 1
*/
Configuration conf = getConf();
FileSystem fs = FileSystem.get(conf);
Job job = new Job(conf, "Job1");
job.setJarByClass(ChainJobs.class);
job.setMapperClass(MyMapper1.class);
job.setReducerClass(MyReducer1.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
TextInputFormat.addInputPath(job, new Path(args[0]));
TextOutputFormat.setOutputPath(job, new Path(OUTPUT_PATH));
job.waitForCompletion(true); /*this goes to next command after this job is completed. your second job is dependent on your first job.*/
/*
* Job 2
*/
Configuration conf2 = getConf();
Job job2 = new Job(conf2, "Job 2");
job2.setJarByClass(ChainJobs.class);
job2.setMapperClass(MyMapper2.class);
job2.setReducerClass(MyReducer2.class);
job2.setOutputKeyClass(Text.class);
job2.setOutputValueClass(Text.class);
job2.setInputFormatClass(TextInputFormat.class);
job2.setOutputFormatClass(TextOutputFormat.class);
TextInputFormat.addInputPath(job2, new Path(OUTPUT_PATH));
TextOutputFormat.setOutputPath(job2, new Path(args[1]));
return job2.waitForCompletion(true) ? 0 : 1;
}
/**
* Method Name: main Return type: none Purpose:Read the arguments from
* command line and run the Job till completion
*
*/
public static void main(String[] args) throws Exception {
// TODO Auto-generated method stub
if (args.length != 2) {
System.err.println("Enter valid number of arguments <Inputdirectory> <Outputlocation>");
System.exit(0);
}
ToolRunner.run(new Configuration(), new ChainJobs(), args);
}
}

Hadoop mapreduce.job.reduces in Generic Option Syntax?

I am trying to set the number of reducers to use via command line. It seems like I am using wrong syntax. I am using hadoop 2.5 (yarn) MR2.
hadoop jar mrjobs-0.1.jar com.example.Weather -D mapreduce.job.reduces=2 datasets/inputs output
This commands is not working when I added -D option else its working fine.
Any help appreciated !
thanks!

Syntax looks proper, I have tested against 2.5 YARN MR2 with the following it's working:
hadoop jar hadoop-mapreduce-examples.jar wordcount -Dmapreduce.job.reduces=5 input output
Most probably the problem could be your Driver class hasn't implemented ToolRunner which works in coordination with GenericOptionsParser to parse generic command line arguments.
Here is an example of how to implement ToolRunner in your MapReduce Driver class:
// imports ignored
public class ExampleDriver extends Configured implements Tool {
public int run(String[] args) throws Exception {
if (args.length != 2) {
System.err.println("Usage: ExampleDriver <in> <out>");
System.exit(2);
}
Configuration conf = getConf();
Job job = Job.getInstance(conf);
job.setJobName("example driver");
job.setJarByClass(ExampleDriver.class);
job.setMapperClass(YourMapper.class);
job.setReducerClass(YourReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
int ret = job.waitForCompletion(true) ? 0 : 1;
return ret;
}
public static void main(String[] args) throws Exception {
int res = ToolRunner.run(new Configuration(), new ExampleDriver(), args);
System.exit(res);
}
}

is Generic option -D not supporting in hadoop 0.20.2?

i am trying to set the configuration property from the console using -D generic options.
here is my console input:
$ hadoop jar hadoop-0.20.2/gtee.jar dd.MaxTemperature -D file.pattern=2007.* /inputdata /outputdata
but i did cross verification from the code by
Configuration conf;
System.out.println(conf.get("file.pattern"));
results null output.what would be the problem here, why value of the property "file.pattern" not displaying ? Can any one please help me.
Thanks
EDITED SECTION:
Driver Code:
public int run(String[] args) throws Exception {
Job job = new Job();
job.setJarByClass(MaxTemperature.class);
job.setJobName("MaxTemperature");
Configuration conf = new Configuration();
FileSystem fs = FileSystem.get(URI.create(args[0]), conf);
// System.out.println(conf.get("file.pattern"));
if (fs.exists(new Path(args[1]))) {
fs.delete(new Path(args[1]), true);
}
System.out.println(conf.get("file.pattern"));
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileInputFormat.setInputPathFilter(job, RegexExcludePathFilter.class);
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.setOutputFormatClass(TextOutputFormat.class);
job.setMapperClass(MapMapper.class);
job.setCombinerClass(Mapreducers.class);
job.setReducerClass(Mapreducers.class);
return job.waitForCompletion(true) ? 0 : 1;
}
public static void main(String[] args) throws Exception {
int xx = ToolRunner.run(new Configuration(),new MaxTemperature(), args);
System.exit(xx);
}
path filter implementation:
public static class RegexExcludePathFilter extends Configured implements
PathFilter {
//String pattern = "2007.[0-1]?[0-2].[0-9][0-9].txt" ;
Configuration conf;
Pattern pattern;
#Override
public boolean accept(Path path) {
Matcher m = pattern.matcher(path.toString());
return m.matches();
}
#Override
public void setConf(Configuration conf) {
this.conf = conf;
pattern = Pattern.compile(conf.get("file.pattern"));
System.out.println(pattern);
}
}

To confirm you -D option is supported in the version 20.2 however that requires you to implement the Tool interface to read variables from command line
Configuration conf = new Configuration(); //this is the issue
// When implementing tool use this
Configuration conf = this.getConf();

You are passing it with a space in between, that's not how you should do it. Instead try:
-Dfile.pattern=2007.*

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Unable to run MR on cluster - hadoop

Related

Not understanding the path in distributed path

Why Context.Write not working as expected- Hadoop Map reduce

How can i have multiple mappers and reducers?

Hadoop mapreduce.job.reduces in Generic Option Syntax?

is Generic option -D not supporting in hadoop 0.20.2?

Categories

Resources