mapreduce job: Mapper not invoking while the reducer gets invoked - hadoop

I have four classes namely MapperOne, ReducerOne, MapperTwo, ReducerTwo .I want a chain among these. MapperOne-->ReducerOne-->output file Generation which is input to MapperTwo-->MapperTwo-->ReducerTwo-->Final Output File.
MY DRIVER CLASS CODE:
public class StockDriver {
public static void main(String[] args) throws IOException, InterruptedException, ClassNotFoundException {
System.out.println(" Driver invoked------");
Configuration config = new Configuration();
config.set("mapreduce.input.keyvaluelinerecordreader.key.value.separator", " ");
config.set("mapred.textoutputformat.separator", " --> ");
String inputPath="In\\NYSE_daily_prices_Q_less.csv";
String outpath = "C:\\Users\\Outputs\\run1";
String outpath2 = "C:\\UsersOutputs\\run2";
Job job1 = new Job(config,"Stock Analysis: Creating key values");
job1.setInputFormatClass(TextInputFormat.class);
job1.setOutputFormatClass(TextOutputFormat.class);
job1.setMapOutputKeyClass(Text.class);
job1.setMapOutputValueClass(StockDetailsTuple.class);
job1.setOutputKeyClass(Text.class);
job1.setOutputValueClass(Text.class);
job1.setMapperClass(StockMapperOne.class);
job1.setReducerClass(StockReducerOne.class);
FileInputFormat.setInputPaths(job1, new Path(inputPath));
FileOutputFormat.setOutputPath(job1, new Path(outpath));
//THE SECOND MAP_REDUCE TO DO CALCULATIONS
Job job2 = new Job(config,"Stock Analysis: Calculating Covariance");
job2.setInputFormatClass(TextInputFormat.class);
job2.setOutputFormatClass(TextOutputFormat.class);
job2.setMapOutputKeyClass(LongWritable.class);
job2.setMapOutputValueClass(Text.class);
job2.setOutputKeyClass(Text.class);
job2.setOutputValueClass(Text.class);
job2.setMapperClass(StockMapperTwo.class);
job2.setReducerClass(StockReducerTwo.class);
String outpath3=outpath+"\\part-r-00000";
System.out.println("OUT PATH 3: " +outpath3 );
FileInputFormat.setInputPaths(job2, new Path(outpath3));
FileOutputFormat.setOutputPath(job2, new Path(outpath2));
if(job1.waitForCompletion(true)){
System.out.println(job2.waitForCompletion(true));
}
}
}
My MapperOne and ReducerOne is getting executed properly and the output file is stored in proper path. Now when the second job is executed, then ONLY the reducer is invoked. Below are my MapperTwo and ReducerTwo codes.
MAPPER TWO
public class StockMapperTwo extends Mapper<Text, Text, LongWritable, Text> {
public void map(LongWritable key, Iterable<Text> values, Context context) throws IOException, InterruptedException{
System.out.println("------ MAPPER 2 CALLED-----");
for(Text val: values){
System.out.println("KEY: "+ key.toString() + " VALUE: "+ val.toString());
//context.write(new Text("mapper2"), new Text("hi"));
context.write(new LongWritable(2), new Text("hi"));
}
}
}
REDUCER TWO
public class StockReducerTwo extends Reducer<LongWritable, Text, Text, Text>{
public void reduce(LongWritable key, Iterable<Text>values, Context context) throws IOException, InterruptedException{
System.out.println(" REDUCER 2 INVOKED");
context.write(new Text("hello"), new Text("hi"));
}
}
My doubt to this config are
Why the mapper is skipped even though its set in job2.setMapperClass(StockMapperTwo.class);
If I don't set job2.setMapOutputKeyClass(LongWritable.class); job2.setMapOutputValueClass(Text.class); then even the reducer is not invoked. and this error is coming.
java.io.IOException: Type mismatch in key from map: expected
org.apache.hadoop.io.Text, recieved org.apache.hadoop.io.LongWritable
at
org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:870)
at
org.apache.hadoop.mapred.MapTask$NewOutputCollector.write(MapTask.java:573)
at
org.apache.hadoop.mapreduce.TaskInputOutputContext.write(TaskInputOutputContext.java:80)
at org.apache.hadoop.mapreduce.Mapper.map(Mapper.java:124) at
org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
How this is happening? I am not able to invoke my mapper and reducer properly.

Sorry for posting this question. I didnt observed that my mapper is wrongly written.
insted of this
public void map(LongWritable key,Text values, Context context) throws IOException, InterruptedException{
I kept it like
public void map(LongWritable key, Iterable<Text> values, Context context) throws IOException, InterruptedException{
And it took me really looong time to observe the mistake. I m not sure why there was no proper error to show the mistake. Anyways its resolved now.

Related

Getting the partition id of input file in Hadoop

I need to know the row index of the partitions of the input file that I'm using. I could force this in the original file by concatenating the row index to the data but I'd rather have a way of doing this in Hadoop. I have this in my mapper...
String id = context.getConfiguration().get("mapreduce.task.partition");
But "id" is 0 in every case. In the "Hadoop: The Definitive Guide" it mentions accessing properties like the partition id "can be accessed from the context object passed to all methods of the Mapper or Reducer". It does not, from what I can tell, actually go into how to access this information.
I went through the documentation for the Context object and it seems like the above is the way to do it and the script does compile. But since I'm getting 0 for every value, I'm not sure if I'm actually using the right thing and I'm unable to find any detail online that could help in figuring this out.
Code used to test...
public class Test {
public static class TestMapper extends Mapper<LongWritable, Text, Text, Text> {
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
String id = context.getConfiguration().get("mapreduce.task.partition");
context.write(new Text("Test"), new Text(id + "_" + value.toString()));
}
}
public static class TestReducer extends Reducer<Text, Text, Text, Text> {
public void reduce(Text key, Iterable<Text> values, Context context) throws IOException, InterruptedException {
for(Text value : values) {
context.write(key, value);
}
}
}
public static void main(String[] args) throws Exception {
if(args.length != 2) {
System.err.println("Usage: Test <input path> <output path>");
System.exit(-1);
}
Job job = new Job();
job.setJarByClass(Test.class);
job.setJobName("Test");
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.setMapperClass(TestMapper.class);
job.setReducerClass(TestReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
Two options are:
Use the offset instead of the row number
Track the line number in the mapper
For the first one, the key which is LongWritable tells you the offset of the line being processed. Unless your lines are exactly the same length, you won't be able to calculate the line number from an offset, but it does allow you to determine ordering if thats useful.
The second option is to just track it in the mapper. You could change your code to something like:
public static class TestMapper extends Mapper<LongWritable, Text, Text, Text> {
private long currentLineNum = 0;
private Text test = new Text("Test");
public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
context.write(test, new Text(currentLineNum + "_" + value));
currentLineNum++;
}
}
You could also represent your matrix as lines of tuples and include the row and col on every tuple so when you're reading in the file, you have that information. If you use a file that is just space or comma seperated values that make up a 2D array, it'll be extremely hard to figure out what line (row) you are currently working on in the mapper

Hadoop MapReduce example for string transformation

I have a big amount of strings in some text file and need transform this strings by such algorithm: convert string into lowercase and remove all spaces.
Can you give me example of Hadoop MapReduce function which implements that algorithm?
Thank you.
I tried the below code and getting the output in a single line.
public class toUpper {
public static class textMapper extends Mapper<LongWritable,Text,NullWritable,Text>
{
Text outvalue=new Text();
public void map(LongWritable key,Text values,Context context) throws IOException, InterruptedException
{
String token;
StringBuffer br=new StringBuffer();
StringTokenizer st=new StringTokenizer(values.toString());
while(st.hasMoreTokens())
{
token=st.nextToken();
br.append(token.toUpperCase());
}
st=null;
outvalue.set(br.toString());
context.write(NullWritable.get(), outvalue);
br=null;
}
}
public static class textReduce extends Reducer<NullWritable,Text,NullWritable,Text>
{
Text outvale=new Text();
public void reduce(NullWritable key,Iterable<Text> values,Context context) throws IOException, InterruptedException
{
StringBuffer br=new StringBuffer();
for(Text st:values)
{
br.append(st.toString());
}
outvale.set(br.toString());
context.write(NullWritable.get(), outvale);
}
}
public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
Configuration conf=new Configuration();
#SuppressWarnings("deprecation")
Job job=new Job(conf,"touipprr");
job.setJarByClass(toUpper.class);
job.setMapperClass(textMapper.class);
job.setReducerClass(textReduce.class);
job.setOutputKeyClass(NullWritable.class);
job.setOutputValueClass(Text.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true)?1:0);
}
}
In the days when I was playing around with map-reduce, I had a similar thought that there must be some practice or technique through which we can modify every word in a record and do all the cleaning stuffs.
When we recap the entire algorithm of map-reduce, we have a map function, which splits the incoming records into tokens with the help of delimiters(perhaps you will know about them better). Now, let us try to approach the problem statement given by you in a descriptive manner.
Following are the things that I will try to do when I am new to map-reduce:
> I will probably write a map() method which will split the lines for me
> I will possibly run out of options and write a reduce function
and somehow will be able to achieve my objective
The above practice is completely okay but there is a better technique that can help you to decide whether or not you are going to need the reduce function thereby you will have more options to enabling you think and completely focus on achieving your objective and also thinking about optimizing you code.
In such situations among which your problem statement falls into, a class came to my rescue : ChainMapper
Now, how the ChainMapper is going to work? following are few points to be considered
-> The first mapper will read the file from HDFS, split each lines as per delimiter and store the tokens in the context.
-> Second mapper will get the output from the first mapper and here you can do all sorts of string related operations as you business requires such as encrypting the text or changing to upper case or lowercase etc.
-> The operated string which is the result of the second mapper shall be stored into the context again
-> Now, if you need a reducer to do the aggregation task such as wordcount, go for it.
I have a piece of code which may not be efficient ( or some may feel its horrible) but it serves your purpose as you might be playing around with mapreduce.
SplitMapper.java
public class SplitMapper extends Mapper<LongWritable,Text,Text,IntWritable>{
#Override
public void map(Object key,Text value,Context context)
throws IOException,InterruptedException{
StringTokenizer xs=new StringTokenizer(value.toString());
IntWritable dummyValue=new IntWritable(1);
while(xs.hasMoreElements()){
String content=(String)xs.nextElement();
context.write(new Text(content),dummyValue);
}
}
}
LowerCaseMapper.java
public class LowerCaseMapper extends Mapper<Text,IntWritable,Text,IntWritable>{
#Override
public void map(Text key,IntWritable value,Context context)
throws IOException,InterruptedException{
String val=key.toString().toLowerCase();
Text newKey=new Text(val);
Context.write(newKey,value);
}
}
Since I am performing a wordcount here so I require a reducer
ChainMapReducer.java
public class ChainMapReducer extends Reducer<Text,IntWritable,Text,IntWritable>{
#Override
public void reduce(Text key,Iterable<IntWritable> value,Context context)
throws IOException,InterruptedException{
int sum=0;
for(IntWritable v:value){
sum+=value.get();
}
context.write(key,new IntWritables(sum));
}
}
To be able to implement the concept of chainmapper successfully, you must pay attention to every details of the driver class
DriverClass.java
public class DriverClass extends Configured implements Tool{
static Configuration cf;
public int run(String args[]) throws IOException,InterruptedException,ClassNotFoundException{
cf=new Configuration();
Job j=Job.getInstance(cf);
//configuration for the first mapper
Configuration.splitMapConfig=new Configuration(false);
ChainMapper.addMapper(j,SplitMapper.class,Object.class,Text.class,Text.class,IntWritable.class,splitMapConfig);
//configuration for the second mapper
Configuration.lowerCaseConfig=new Configuration(false);
ChainMapper.addMapper(j,LowerCaseMapper.class,Text.class,IntWritable.class,Text.class,IntWritable.class,lowerCaseConfig);
j.setJarByClass(DriverClass.class);
j.setCombinerClass(ChainMapReducer.class);
j.setOutputKeyClass(Text.class);
j.setOutputValueClass(IntWritable.class);
Path outputPath=new Path(args[1]);
FileInputFormat.addInputPath(j,new Path(args[0]));
FileOutputFormat.setOutputPath(j,outputPath);
outputPath.getFileSystem(cf).delete(outputPath,true);
}
public static void main(String args[]) throws Exception{
int res=ToolRunner.run(cf,new DriverClass(),args);
System.exit(1);
}
}
The driver class is pretty much understandable only one needs to observe the signature of the ChainMapper.add(<job-object>,<Map-ClassName>,<Input arguments types>,<configuration-for-the-concerned-mapper>)
I hope that the solution serves your purpose, please let me know in case of any issues that might arise when you try to implement.
Thankyou!

If 2 Mappers output the same key , what will the input to the reducer be?

I've the following doubt while learning Map reduce. It will be of great help if some one could answer.
I've two mappers working on the same file - I configured them using MultipleInputFormat
mapper 1 - Expected Output [ after extracting few columns of a file]
a - 1234
b - 3456
c - 1345
Mapper 2 Expected output [After extracting few columns of the same file]
a - Monday
b - Tuesday
c - Wednesday
And there is a reducer function that just outputs the key and value pair that it gets as input
So I expected the output to be as I know that similar keys will be shuffled to make a list.
a - [1234,Monday]
b - [3456, Tuesday]
c - [1345, Wednesday]
But am getting some weird output.I guess only 1 Mapper is getting run.
Should this not be expected ? Will the output of each mapper be shuffled separately ? Will both the mappers run parallel ?
Excuse me if its a lame question Please understand that I am new to Hadoop and Map Reduce
Below is the code
//Mapper1
public class numbermapper extends Mapper<Object, Text, Text, Text>{
public void map(Object key,Text value, Context context) throws IOException, InterruptedException {
String record = value.toString();
String[] parts = record.split(",");
System.out.println("***Mapper number output "+parts[0]+" "+parts[1]);
context.write(new Text(parts[0]), new Text(parts[1]));
}
}
//Mapper2
public class weekmapper extends Mapper<Object, Text, Text, Text> {
public void map(Object key, Text value, Context context)
throws IOException, InterruptedException {
String record = value.toString();
String[] parts = record.split(",");
System.out.println("***Mapper week output "+parts[0]+" "+parts[2]);
context.write(new Text(parts[0]), new Text(parts[2]));
}
}
//Reducer
public class rjoinreducer extends Reducer<Text, Text, Text, Text> {
public void reduce(Text key, Text values, Context context)
throws IOException, InterruptedException {
context.write(key, values);
}
}
//Driver class
public class driver {
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = new Job(conf, "Reduce-side join");
job.setJarByClass(numbermapper.class);
job.setReducerClass(rjoinreducer.class);
job.setMapOutputValueClass(Text.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
MultipleInputs.addInputPath(job, new Path(args[0]),TextInputFormat.class, numbermapper.class);
MultipleInputs.addInputPath(job, new Path(args[0]),TextInputFormat.class, weekmapper.class);
Path outputPath = new Path(args[1]);
FileOutputFormat.setOutputPath(job, outputPath);
outputPath.getFileSystem(conf).delete(outputPath);
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
And this is the O/P I got-
a Monday
b Tuesday
c Wednesday
Dataset used
a,1234,Monday
b,3456,Tuesday
c,1345,Wednesday
Multiple input format was just taking 1 file and running one mapper on it because I have given the same path for both the Mappers.
When I copy the dataset to a different file and ran the same program taking two different files (same content but different names for the files) I got the expected output.
So i now understood that the output from different mapper functions is also combined based on key , not just the output from the same mapper function.
Thanks for trying to help....!!!

Hadoop MultipleOutputs does not write to multiple files when file formats are custom format

I am trying to read from cassandra and write the reducers output to multiple output files using MultipleOutputs api (Hadoop version 1.0.3). The file formats in my case are custom output formats extending FileOutputFormat. I have configured my job in a similar manner as shown in MultipleOutputs api.
However, when I run the job, I only get one output file named part-r-0000 which is in text output format. If job.setOutputFormatClass() is not set, by default it considers TextOutputFormat to be the format. Also it will only allow one of the two format classes to be initialized. It completely ignores the output formats I specified in MulitpleOutputs.addNamedOutput(job, "format1", MyCustomFileFormat1.class, Text.class, Text.class) and MulitpleOutputs.addNamedOutput(job, "format2", MyCustomFileFormat2.class, Text.class, Text.class). Is someone else facing similar problem or am I doing something wrong ?
I also tried to write a very simple MR program which reads from a text file and writes the output in 2 formats TextOutputFormat and SequenceFileOutputFormat as shown in the MultipleOutputs api. However, no luck there as well. I get only 1 output file in text output format.
Can someone help me with this ?
Job job = new Job(getConf(), "cfdefGen");
job.setJarByClass(CfdefGeneration.class);
//read input from cassandra column family
ConfigHelper.setInputColumnFamily(job.getConfiguration(), KEYSPACE, COLUMN_FAMILY);
job.setInputFormatClass(ColumnFamilyInputFormat.class);
job.getConfiguration().set("cassandra.consistencylevel.read", "QUORUM");
//thrift input job configurations
ConfigHelper.setInputRpcPort(job.getConfiguration(), "9160");
ConfigHelper.setInputInitialAddress(job.getConfiguration(), HOST);
ConfigHelper.setInputPartitioner(job.getConfiguration(), "RandomPartitioner");
SlicePredicate predicate = new SlicePredicate().setColumn_names(Arrays.asList(ByteBufferUtil.bytes("classification")));
//ConfigHelper.setRangeBatchSize(job.getConfiguration(), 2048);
ConfigHelper.setInputSlicePredicate(job.getConfiguration(), predicate);
//specification for mapper
job.setMapperClass(MyMapper.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(Text.class);
//specifications for reducer (writing to files)
job.setReducerClass(ReducerToFileSystem.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
//job.setOutputFormatClass(MyCdbWriter1.class);
job.setNumReduceTasks(1);
//set output path for storing output files
Path filePath = new Path(OUTPUT_DIR);
FileSystem hdfs = FileSystem.get(getConf());
if(hdfs.exists(filePath)){
hdfs.delete(filePath, true);
}
MyCdbWriter1.setOutputPath(job, new Path(OUTPUT_DIR));
MultipleOutputs.addNamedOutput(job, "cdb1', MyCdbWriter1.class, Text.class, Text.class);
MultipleOutputs.addNamedOutput(job, "cdb2", MyCdbWriter2.class, Text.class, Text.class);
boolean success = job.waitForCompletion(true);
return success ? 0:1;
public static class ReducerToFileSystem extends Reducer<Text, Text, Text, Text>
{
private MultipleOutputs<Text, Text> mos;
public void setup(Context context){
mos = new MultipleOutputs<Text, Text>(context);
}
//public void reduce(Text key, Text value, Context context)
//throws IOException, InterruptedException (This was the mistake, changed the signature and it worked fine)
public void reduce(Text key, Iterable<Text> values, Context context)
throws IOException, InterruptedException
{
//context.write(key, value);
mos.write("cdb1", key, value, OUTPUT_DIR+"/"+"cdb1");
mos.write("cdb2", key, value, OUTPUT_DIR+"/"+"cdb2");
context.progress();
}
public void cleanup(Context context) throws IOException, InterruptedException {
mos.close();
}
}
public class MyCdbWriter1<K, V> extends FileOutputFormat<K, V>
{
#Override
public RecordWriter<K, V> getRecordWriter(TaskAttemptContext job) throws IOException, InterruptedException
{
}
public static void setOutputPath(Job job, Path outputDir) {
job.getConfiguration().set("mapred.output.dir", outputDir.toString());
}
protected static class CdbDataRecord<K, V> extends RecordWriter<K, V>
{
#override
write()
close()
}
}
I found my mistake after debugging that my reduce method is never called. I found that my function definition did not match API's definition, changed it from public void reduce(Text key, Text value, Context context) to public void reduce(Text key, Iterable<Text> values, Context context). I don't know why reduce method does not have #Override tag, it would have prevented my mistake.
I also encountered a similar issue - mine turned out to be that I was filtering all my records in the Map process so nothing was being passed to Reduce. With un-named multiple outputs in the reduce task, this still resulted in a _SUCCESS file and an empty part-r-00000 file.

Hadoop - Join with MultipleInputs probably skips Reducer

So, I want to perform a reduce side join with MR. (No Hive or anything, I'm experimenting on vanilla Hadoop atm).
I have 2 input files, first goes like this:
12 13
12 15
12 16
12 23
the second is simply 12 1000.
So I assign each file to a separate mapper which actually tags each key value pair with 0 or 1 depending on its source file. And that works well. How I can tell?
I get the MapOutput as expected:
| key | |value|
12 0 1000
12 1 13
12 1 15
12 1 16 etc
My Partitioner partitions based on first part of key (ie 12).
The Reducer should join by key. Yet, the job seems to skip the reduce step.
I wonder if there's something wrong with my Driver?
My code (Hadoop v0.22, but same results with 0.20.2 with extra libs from the trunk):
Mappers
public static class JoinDegreeListMapper extends
Mapper<Text, Text, TextPair, Text> {
public void map(Text node, Text degree, Context context)
throws IOException, InterruptedException {
context.write(new TextPair(node.toString(), "0"), degree);
}
}
public static class JoinEdgeListMapper extends
Mapper<Text, Text, TextPair, Text> {
public void map(Text firstNode, Text secondNode, Context context)
throws IOException, InterruptedException {
context.write(new TextPair(firstNode.toString(), "1"), secondNode);
}
}
Reducer
public static class JoinOnFirstReducer extends
Reducer<TextPair, Text, Text, Text> {
public void reduce(TextPair key, Iterator<Text> values, Context context)
throws IOException, InterruptedException {
context.progress();
Text nodeDegree = new Text(values.next());
while (values.hasNext()) {
Text secondNode = values.next();
Text outValue = new Text(nodeDegree.toString() + "\t"
+ secondNode.toString());
context.write(key.getFirst(), outValue);
}
}
}
Partitioner
public static class JoinOnFirstPartitioner extends
Partitioner<TextPair, Text> {
#Override
public int getPartition(TextPair key, Text Value, int numOfPartitions) {
return (key.getFirst().hashCode() & Integer.MAX_VALUE) % numOfPartitions;
}
}
Driver
public int run(String[] args) throws Exception {
Path edgeListPath = new Path(args[0]);
Path nodeListPath = new Path(args[1]);
Path outputPath = new Path(args[2]);
Configuration conf = getConf();
Job job = new Job(conf);
job.setJarByClass(JoinOnFirstNode.class);
job.setJobName("Tag first node with degree");
job.setPartitionerClass(JoinOnFirstPartitioner.class);
job.setGroupingComparatorClass(TextPair.FirstComparator.class);
//job.setSortComparatorClass(TextPair.FirstComparator.class);
job.setReducerClass(JoinOnFirstReducer.class);
job.setMapOutputKeyClass(TextPair.class);
job.setMapOutputValueClass(Text.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
MultipleInputs.addInputPath(job, edgeListPath, EdgeInputFormat.class,
JoinEdgeListMapper.class);
MultipleInputs.addInputPath(job, nodeListPath, EdgeInputFormat.class,
JoinDegreeListMapper.class);
FileOutputFormat.setOutputPath(job, outputPath);
return job.waitForCompletion(true) ? 0 : 1;
}
My reduce function had Iterator<> instead of Iterable, so the job skipped to Identity Reducer.
I can't quite believe I overlooked that. Noob error.
And the answer came from this Q/A
Using Hadoop for the First Time, MapReduce Job does not run Reduce Phase

Resources