Setting number of Reduce tasks using command line - hadoop

I am a beginner in Hadoop. When trying to set the number of reducers using command line using Generic Options Parser, the number of reducers is not changing. There is no property set in the configuration file "mapred-site.xml" for the number of reducers and I think, that would make the number of reducers=1 by default. I am using cloudera QuickVM and hadoop version : "Hadoop 2.5.0-cdh5.2.0".
Pointers Appreciated. Also my issue was I wanted to know the preference order of the ways to set the number of reducers.
Using configuration File "mapred-site.xml"
mapred.reduce.tasks
By specifying in the driver class
job.setNumReduceTasks(4)
By specifying at the command line using Tool interface:
-Dmapreduce.job.reduces=2
Mapper :
import java.io.IOException;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
public class WordCountMapper extends Mapper<LongWritable, Text, Text, IntWritable>
{
#Override
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException
{
String line = value.toString();
//Split the line into words
for(String word: line.split("\\W+"))
{
//Make sure that the word is legitimate
if(word.length() > 0)
{
//Emit the word as you see it
context.write(new Text(word), new IntWritable(1));
}
}
}
}
Reducer :
import java.io.IOException;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
public class WordCountReducer extends Reducer<Text, IntWritable, Text, IntWritable>{
#Override
public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException
{
//Initializing the word count to 0 for every key
int count=0;
for(IntWritable value: values)
{
//Adding the word count counter to count
count += value.get();
}
//Finally write the word and its count
context.write(key, new IntWritable(count));
}
}
Driver :
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;
public class WordCount extends Configured implements Tool
{
public int run(String[] args) throws Exception
{
//Instantiate the job object for configuring your job
Job job = new Job();
//Specify the class that hadoop needs to look in the JAR file
//This Jar file is then sent to all the machines in the cluster
job.setJarByClass(WordCount.class);
//Set a meaningful name to the job
job.setJobName("Word Count");
//Add the apth from where the file input is to be taken
FileInputFormat.addInputPath(job, new Path(args[0]));
//Set the path where the output must be stored
FileOutputFormat.setOutputPath(job, new Path(args[1]));
//Set the Mapper and the Reducer class
job.setMapperClass(WordCountMapper.class);
job.setReducerClass(WordCountReducer.class);
//Set the type of the key and value of Mapper and reducer
/*
* If the Mapper output type and Reducer output type are not the same then
* also include setMapOutputKeyClass() and setMapOutputKeyValue()
*/
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
//job.setNumReduceTasks(4);
//Start the job and wait for it to finish. And exit the program based on
//the success of the program
System.exit(job.waitForCompletion(true)?0:1);
return 0;
}
public static void main(String[] args) throws Exception
{
// Let ToolRunner handle generic command-line options
int res = ToolRunner.run(new Configuration(), new WordCount(), args);
System.exit(res);
}
}
And I have tried the following commands to run the job :
hadoop jar /home/cloudera/Misc/wordCount.jar WordCount -Dmapreduce.job.reduces=2 hdfs:/Input/inputdata hdfs:/Output/wordcount_tool_D=2_take13
and
hadoop jar /home/cloudera/Misc/wordCount.jar WordCount -D mapreduce.job.reduces=2 hdfs:/Input/inputdata hdfs:/Output/wordcount_tool_D=2_take14

Answering your query on order. It would always be 2>3>1
The option specified in your driver class takes precedence over the ones you specify as an argument to your GenOptionsParser or the ones you specify in your site specific config.
I would recommend debugging the configurations inside your driver class by printing it out before you submit the job. This way , you can be sure what the configurations are , right before you submit the job to the cluster.
Configuration conf = getConf(); // This is available to you since you extended Configured
for(Entry entry: conf)
//Sysout the entries here

Related

Hadoop WordCount Tutorial java.lang.ClassNotFoundException

I'm relatively new to hadoop and I'm struggling a little bit to understand the ClassNotFoundException I get when trying to run the job. I'm using the standard tutorial found here and here is my WordCount class (running on ubuntu 16.04 hadoop 2.7.3 distributed cluster mode):
import java.io.IOException;
import java.util.StringTokenizer;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class WordCount {
public static class TokenizerMapper
extends Mapper<Object, Text, Text, IntWritable>{
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(Object key, Text value, Context context
) throws IOException, InterruptedException {
StringTokenizer itr = new StringTokenizer(value.toString());
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
context.write(word, one);
}
}
}
public static class IntSumReducer
extends Reducer<Text,IntWritable,Text,IntWritable> {
private IntWritable result = new IntWritable();
public void reduce(Text key, Iterable<IntWritable> values,
Context context
) throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
result.set(sum);
context.write(key, result);
}
}
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = Job.getInstance(conf, "word count");
job.setJarByClass(WordCount.class);
job.setMapperClass(TokenizerMapper.class);
job.setCombinerClass(IntSumReducer.class);
job.setReducerClass(IntSumReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
To try and remain organized, I added a couple paths to my ~/.bashrc file:
hduser#mynode:~$ cd $HADOOP_CODE
hduser#mynode:/usr/local/hadoop/code$
This is one directory down from the $HADOOP_HOME directory. To compile the WordCount.JAVA file, I ran:
hduser#mynode:/usr/local/hadoop$ hadoop com.sun.tools.javac.Main $HADOOP_CODE/WordCount.java
hduser#mynode:/usr/local/hadoop$ jar cf wc.jar $HADOOP_CODE/WordCount*.class
I then tried:
hduser#mynode:/usr/local/hadoop$ hadoop jar $HADOOP_CODE/wc.jar $HADOOP_CODE/WordCount /home/hduser/input /home/hduser/output/wordcount
which bombed with the following error:
Exception in thread "main" java.lang.ClassNotFoundException: /usr/local/hadoop/code/WordCount
EDIT
This gave me the same error:
hduser#mynode:/usr/local/hadoop/code$ hadoop jar $HADOOP_CODE/wc.jar WordCount /home/hduser/input /home/hduser/output/wordcount
To get it to run without error, I moved the WordCount.Java file up one directory to the default hadoop ($HADOOP_HOME) folder. I also know from here and here that the solution is to add a package to the file.
What I'm trying to understand is why that is the solution. With no package name, where is hadoop looking for the specified package, and why can't I pass it a full path to get it to run correctly? This may be a basic java question (apologies - I'm from the python world), but what is the package name doing during the compile process that makes it so I could run without a path name, but leaving off the package name means it has to be in that default directory? I'd prefer not to have to add a package name to every job I run. An explanation would be greatly appreciated!

Meaning of context.getconfiguration in hadoop

I have this doubt in code of searching by argument.
what is the meaning of context.getConfiguration().get("Uid2Search");
package SearchTxnByArg;
// This is the Mapper Program for SearchTxnByArg
import java.io.IOException;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
public class MyMap extends Mapper<LongWritable, Text, NullWritable, Text>{
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
String Txn = value.toString();
String TxnParts[] = Txn.split(",");
String Uid = TxnParts[2];
String Uid2Search = context.getConfiguration().get("Uid2Search");
if(Uid.equals(Uid2Search))
{
context.write(null, value);
}
}
}
Driver Program
package SearchTxnByArg;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class MyDriver {
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
conf.set("Uid2Search", args[0]);
Job job = new Job(conf, "Map Reduce Search Txn by Arg");
job.setJarByClass(MyDriver.class);
job.setMapperClass(MyMap.class);
job.setMapOutputKeyClass(NullWritable.class);
job.setMapOutputValueClass(Text.class);
job.setNumReduceTasks(0);
FileInputFormat.addInputPath(job, new Path(args[1]));
FileOutputFormat.setOutputPath(job, new Path(args[2]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
I don't know how you have written your driver program. But in my experience,
If you are trying to get system property either by using -D option from the command line or by System.setproperty method by default these values will be set to context configuration.
As per documentation,
Configurations are specified by resources. A resource contains a set
of name/value pairs as XML data. Each resource is named by either a
String or by a Path. If named by a String, then the classpath is
examined for a file with that name. If named by a Path, then the local
filesystem is examined directly, without referring to the classpath.
Unless explicitly turned off, Hadoop by default specifies two
resources, loaded in-order from the classpath: core-default.xml :
Read-only defaults for hadoop. core-site.xml: Site-specific
configuration for a given hadoop installation. Applications may add
additional resources, which are loaded subsequent to these resources
in the order they are added.
Please see this answer as well
Context object: allows the Mapper/Reducer to interact with the rest of the Hadoop system. It includes configuration data for the job as well as interfaces which allow it to emit output.
Applications can use the Context:
to report progress
to set application-level status messages
update Counters
indicate they are alive
to get the values that are stored in job configuration across map/reduce phase.

Hadoop MapReduce Program for removing duplicate records

Could anyone help me to write the mapper and reducer for merging these two files and then removing the duplicate records?
These are the two text files:
file1.txt
2012-3-1a
2012-3-2b
2012-3-3c
2012-3-4d
2012-3-5a
2012-3-6b
2012-3-7c
2012-3-3c
and file2.txt:
2012-3-1b
2012-3-2a
2012-3-3b
2012-3-4d
2012-3-5a
2012-3-6c
2012-3-7d
2012-3-3c
A simple word count program will do the job for you. The only change you need to make is, set the output value of the Reducer as NullWritable.get()
Is there a common key in both the files which helps identify if record matched or not? If so then:
Mappers Input: Standard TextInputFormat
Mapper's Output Key : Common Key and Mapper's output Value : Entire Record.
At reducer : It will not be required to iterate over the Keys just take only 1 instance of the Value for Write.
If the match or duplicacy can be concluded only if complete record matched: then
Mappers Input: Standard TextInputFormat
Mapper's Output Key : Entire Record and Mapper's output Value : NullWritable.
At reducer: It will not be required to iterate over the Keys. Just take only one instance of Key and write that as a Value.
Reducer Output Key: Reducer Input Key, Reducer Output Value : NullWritable
Here's code to remove duplicate lines in large text data, which uses hash for efficiency:
DRMapper.java
import com.google.common.hash.Hashing;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
import java.io.IOException;
import java.nio.charset.StandardCharsets;
class DRMapper extends Mapper<LongWritable, Text, Text, Text> {
private Text hashKey = new Text();
private Text mappedValue = new Text();
#Override
public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
String line = value.toString();
hashKey.set(Hashing.murmur3_32().hashString(line, StandardCharsets.UTF_8).toString());
mappedValue.set(line);
context.write(hashKey, mappedValue);
}
}
DRReducer.java
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
import java.io.IOException;
public class DRReducer extends Reducer<Text, Text, Text, NullWritable> {
#Override
public void reduce(Text key, Iterable<Text> values, Context context)
throws IOException, InterruptedException {
Text value;
if (values.iterator().hasNext()) {
value = values.iterator().next();
if (!(value.toString().isEmpty())) {
context.write(value, NullWritable.get());
}
}
}
}
DuplicateRemover.java
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class DuplicateRemover {
private static final int DEFAULT_NUM_REDUCERS = 210;
public static void main(String[] args) throws Exception {
if (args.length != 2) {
System.err.println("Usage: DuplicateRemover <input path> <output path>");
System.exit(-1);
}
Job job = new Job();
job.setJarByClass(DuplicateRemover.class);
job.setJobName("Duplicate Remover");
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.setMapperClass(DRMapper.class);
job.setReducerClass(DRReducer.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(Text.class);
job.setNumReduceTasks(DEFAULT_NUM_REDUCERS);
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
compile with:
javac -encoding UTF8 -cp $(hadoop classpath) *.java
jar cf dr.jar *.class
Assuming that the input text files are in in_folder, run as:
hadoop jar dr.jar in_folder out_folder

yarn stderr no logger appender and no stdout

I'm running a simple mapreduce program wordcount agian Apache Hadoop 2.6.0. The hadoop is running distributedly (several nodes). However, I'm not able to see any stderr and stdout from yarn job history. (but I can see the syslog)
The wordcount program is really simple, just for demo purpose.
import java.io.IOException;
import java.util.StringTokenizer;
import org.apache.commons.logging.Log;
import org.apache.commons.logging.LogFactory;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class WordCount {
public static final Log LOG = LogFactory.getLog(WordCount.class);
public static class TokenizerMapper
extends Mapper<Object, Text, Text, IntWritable>{
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(Object key, Text value, Context context
) throws IOException, InterruptedException {
LOG.info("LOG - map function invoked");
System.out.println("stdout - map function invoded");
StringTokenizer itr = new StringTokenizer(value.toString());
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
context.write(word, one);
}
}
}
public static class IntSumReducer
extends Reducer<Text,IntWritable,Text,IntWritable> {
private IntWritable result = new IntWritable();
public void reduce(Text key, Iterable<IntWritable> values,
Context context
) throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
result.set(sum);
context.write(key, result);
}
}
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
conf.set("mapreduce.job.jar","/space/tmp/jar/wordCount.jar");
Job job = Job.getInstance(conf, "word count");
job.setJarByClass(WordCount.class);
job.setMapperClass(TokenizerMapper.class);
job.setCombinerClass(IntSumReducer.class);
job.setReducerClass(IntSumReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path("hdfs://localhost:9000/user/jsun/input"));
FileOutputFormat.setOutputPath(job, new Path("hdfs://localhost:9000/user/jsun/output"));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
Note in the map function of Mapper class, I added two statements:
LOG.info("LOG - map function invoked");
System.out.println("stdout - map function invoded");
These two statements are to test whether I can see logging from hadoop server. I can successfully run the program. But if I go to localhost:8088 to see the application history and then "logs", I see nothing in "stdout", and in "stderr":
log4j:WARN No appenders could be found for logger (org.apache.hadoop.ipc.Server).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
I think there is some configuration needed to get those output, but not sure which piece of information is missing. I searched online as well as in stackoverflow. Some people mentioned container-log4j.properties but they are not specific about how to configure that file and where to put.
One thing to note is I also tried the job with Hortonworks Data Platform 2.2 and Cloudera 5.4. The result is the same. I remember when I dealt with some previous version of hadoop (hadoop 1.x), I can easily see the loggings from same place. So I guess this is something new in hadoop 2.x
=======
As a comparison, if I make the apache hadoop run in local mode (meaning LocalJobRunner), I can see some loggings in console like this:
[2015-09-08 15:57:25,992]org.apache.hadoop.mapred.MapTask$MapOutputBuffer.init(MapTask.java:998) INFO:kvstart = 26214396; length = 6553600
[2015-09-08 15:57:25,996]org.apache.hadoop.mapred.MapTask.createSortingCollector(MapTask.java:402) INFO:Map output collector class = org.apache.hadoop.mapred.MapTask$MapOutputBuffer
[2015-09-08 15:57:26,064]WordCount$TokenizerMapper.map(WordCount.java:28) INFO:LOG - map function invoked
stdout - map function invoded
[2015-09-08 15:57:26,075]org.apache.hadoop.mapred.LocalJobRunner$Job.statusUpdate(LocalJobRunner.java:591) INFO:
[2015-09-08 15:57:26,077]org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:1457) INFO:Starting flush of map output
[2015-09-08 15:57:26,077]org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:1475) INFO:Spilling map output
These kind of loggings ("map function is invoked") is what I expected in hadoop server logging.
All the sysout written in Map-Reduce program can not be seen on console. It is because map-reduce run in multiple parallel copies across the cluster, so there is no concept of a single console with output.
However, The System.out.println() for map and reduce phases can be seen in the job logs. Easy way to access the logs is
open the jobtracker web console - http://localhost:50030/jobtracker.jsp
click on the completed job
click on map or reduce task
click on tasknumber
Go to task logs
Check stdout logs.
Please note that if you are not able to locate URL, just look into the console log for jobtracker URL.

Not getting correct output when running standard "WordCount" program using Hadoop0.20.2

I'm new to Hadoop.I have been trying to run the famous "WordCount" program -- which counts the total number of words
in a list of files using Hadoop-0.20.2.
I'm using single node cluster.
Following is my program:
import java.io.File;
import java.io.IOException;
import java.util.*;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.conf.*;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapreduce.*;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
public class WordCount {
public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
word.set(tokenizer.nextToken());
context.write(word, one);
}
}
}
public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterator<IntWritable> values, Context context)
throws IOException, InterruptedException {
int sum = 0;
while (values.hasNext()) {
++sum ;
}
context.write(key, new IntWritable(sum));
}
}
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = new Job(conf, "wordcount");
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.setJarByClass(WordCount.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
job.setMapperClass(Map.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(IntWritable.class);
job.setReducerClass(Reduce.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
job.setNumReduceTasks(5);
job.waitForCompletion(true);
}
}
Suppose input file is A.txt which has following contents
A B C D A B C D
When I run this program using hadoop-0.20.2 (not showing commands for sake of clarity) ,the output that comes is
A 1
A 1
B 1
B !
C 1
C 1
D !
D 1
which is wrong.The actual output should be :
A 2
B 2
C 2
D 2
This "WordCount" program is pretty standard program. I'm not sure what is wrong with this code.
I have written the contents of all configuration files like mapred-site.xml , core-site.xml etc correctly.
How can I fix this problem?
This code actually runs a local mapreduce job. If you want to submit this to the real cluster, you have to provide the fs.default.name and the mapred.job.tracker configuration parameter. These keys are mapped to your machine with a host:port pair. Just like in your mapred/core-site.xml.
Make sure your data is available in HDFS and not on local disk, as well as your number of reducers should be reduced. That's about 2 records per reducer. You should set this to 1.
reduce signature is incorrect.
Second parameter is Iterable type and not Iterator
http://hadoop.apache.org/common/docs/r0.20.1/api/org/apache/hadoop/mapreduce/Reducer.html
See also Using Hadoop for the First Time, MapReduce Job does not run Reduce Phase

Resources