Weird error in Hadoop reducer - hadoop

The reducer in my map-reduce job is as follows:
public static class Reduce_Phase2 extends MapReduceBase implements Reducer<IntWritable, Neighbourhood, Text,Text> {
public void reduce(IntWritable key, Iterator<Neighbourhood> values, OutputCollector<Text, Text> output, Reporter reporter) throws IOException {
ArrayList<Neighbourhood> cachedValues = new ArrayList<Neighbourhood>();
while(values.hasNext()){
Neighbourhood n = values.next();
cachedValues.add(n);
//correct output
//output.collect(new Text(n.source), new Text(n.neighbours));
}
for(Neighbourhood node:cachedValues){
//wrong output
output.collect(new Text(key.toString()), new Text(node.source+"\t\t"+node.neighbours));
}
}
}
TheNeighbourhood class has two attributes, source and neighbours, both of type Text. This reducer receives one key which has 19 values(of type Neighbourhood) assigned. When I output the source and neighbours inside the while loop, I get the actual values of 19 different values. However, if I output them after the while loop as shown in the code, I get 19 similar values. That is, one object gets output 19 times! It is very weired that what happens. Is there any idea on that?
Here is the code of the class Neighbourhood
public class Neighbourhood extends Configured implements WritableComparable<Neighbourhood> {
Text source ;
Text neighbours ;
public Neighbourhood(){
source = new Text();
neighbours = new Text();
}
public Neighbourhood (String s, String n){
source = new Text(s);
neighbours = new Text(n);
}
#Override
public void readFields(DataInput arg0) throws IOException {
source.readFields(arg0);
neighbours.readFields(arg0);
}
#Override
public void write(DataOutput arg0) throws IOException {
source.write(arg0);
neighbours.write(arg0);
}
#Override
public int compareTo(Neighbourhood o) {
return 0;
}
}

You're being caught out by a efficiency mechanism employed by Hadoop - Object reuse.
Your calls to values.next() is returning the same object reference each time, all Hadoop is doing behind the scenes is replaced the contents of that same object with the underlying bytes (deserialized using the readFields() method).
To avoid this you'll need to create deep copies of the object returned from values.next() - Hadoop actually has a utility class to do this for you called ReflectionUtils.copy. A simple fix would be as follows:
while(values.hasNext()){
Neighbourhood n = ReflectionUtils.newInstance(Neighbourhood.class, conf);
ReflectionUtils.copy(values.next(), n, conf);
You'll need to cache a version of the job Configuration (conf in the above code), which you can obtain by overriding the configure(JobConf) method in your Reducer:
#Override
protected void configure(JobConf job) {
conf = job;
}
Be warned though - accumulating a list in this way is often the cause of memory problems in your job, especially if you have 100,000+ values for a given single key.

Related

Getting the partition id of input file in Hadoop

I need to know the row index of the partitions of the input file that I'm using. I could force this in the original file by concatenating the row index to the data but I'd rather have a way of doing this in Hadoop. I have this in my mapper...
String id = context.getConfiguration().get("mapreduce.task.partition");
But "id" is 0 in every case. In the "Hadoop: The Definitive Guide" it mentions accessing properties like the partition id "can be accessed from the context object passed to all methods of the Mapper or Reducer". It does not, from what I can tell, actually go into how to access this information.
I went through the documentation for the Context object and it seems like the above is the way to do it and the script does compile. But since I'm getting 0 for every value, I'm not sure if I'm actually using the right thing and I'm unable to find any detail online that could help in figuring this out.
Code used to test...
public class Test {
public static class TestMapper extends Mapper<LongWritable, Text, Text, Text> {
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
String id = context.getConfiguration().get("mapreduce.task.partition");
context.write(new Text("Test"), new Text(id + "_" + value.toString()));
}
}
public static class TestReducer extends Reducer<Text, Text, Text, Text> {
public void reduce(Text key, Iterable<Text> values, Context context) throws IOException, InterruptedException {
for(Text value : values) {
context.write(key, value);
}
}
}
public static void main(String[] args) throws Exception {
if(args.length != 2) {
System.err.println("Usage: Test <input path> <output path>");
System.exit(-1);
}
Job job = new Job();
job.setJarByClass(Test.class);
job.setJobName("Test");
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.setMapperClass(TestMapper.class);
job.setReducerClass(TestReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
Two options are:
Use the offset instead of the row number
Track the line number in the mapper
For the first one, the key which is LongWritable tells you the offset of the line being processed. Unless your lines are exactly the same length, you won't be able to calculate the line number from an offset, but it does allow you to determine ordering if thats useful.
The second option is to just track it in the mapper. You could change your code to something like:
public static class TestMapper extends Mapper<LongWritable, Text, Text, Text> {
private long currentLineNum = 0;
private Text test = new Text("Test");
public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
context.write(test, new Text(currentLineNum + "_" + value));
currentLineNum++;
}
}
You could also represent your matrix as lines of tuples and include the row and col on every tuple so when you're reading in the file, you have that information. If you use a file that is just space or comma seperated values that make up a 2D array, it'll be extremely hard to figure out what line (row) you are currently working on in the mapper

When to go for custom Input format for Map reduce jobs

When should we go for custom Input Format while using Map Reduce programming ?
Say I have a file which I need to read line by line and it has 15 columns delimited by pipe, should I go for custom Input Format ?
I can use a TextInput Format as well as Custom Input Format in this case.
CustomInputFormat can be written when you need to customize input
record reading. But in your case you need not have such an implementation.
see below example of CustomInputFormat out of many such...
Example : Reading Paragraphs as Input Records
If you are working on Hadoop MapReduce or Using AWS EMR then there might be an use case where input files consistent a paragraph as key-value record instead of a single line (think about scenarios like analyzing comments of news articles). So instead of processing a single line as input if you need to process a complete paragraph at once as a single record then you will need to customize the default behavior of **TextInputFormat** i.e. to read each line by default into reading a complete paragraph as one input key-value pair for further processing in MapReduce jobs.
This requires us to to create a custom record reader which can be done by implementing the class RecordReader. The next() method is where you would tell the record reader to fetch a paragraph instead of one line. See the following implementation, it’s self-explanatory:
public class ParagraphRecordReader implements RecordReader<LongWritable, Text> {
private LineRecordReader lineRecord;
private LongWritable lineKey;
private Text lineValue;
public ParagraphRecordReader(JobConf conf, FileSplit split) throws IOException {
lineRecord = new LineRecordReader(conf, split);
lineKey = lineRecord.createKey();
lineValue = lineRecord.createValue();
}
#Override
public void close() throws IOException {
lineRecord.close();
}
#Override
public LongWritable createKey() {
return new LongWritable();
}
#Override
public Text createValue() {
return new Text("");
}
#Override
public float getProgress() throws IOException {
return lineRecord.getPos();
}
#Override
public synchronized boolean next(LongWritable key, Text value) throws IOException {
boolean appended, isNextLineAvailable;
boolean retval;
byte space[] = {' '};
value.clear();
isNextLineAvailable = false;
do {
appended = false;
retval = lineRecord.next(lineKey, lineValue);
if (retval) {
if (lineValue.toString().length() > 0) {
byte[] rawline = lineValue.getBytes();
int rawlinelen = lineValue.getLength();
value.append(rawline, 0, rawlinelen);
value.append(space, 0, 1);
appended = true;
}
isNextLineAvailable = true;
}
} while (appended);
return isNextLineAvailable;
}
#Override
public long getPos() throws IOException {
return lineRecord.getPos();
}
}
With a ParagraphRecordReader implementation, we would need to extend TextInputFormat to create a custom InputFomat by just overriding the getRecordReader method and return an object of ParagraphRecordReader to override default behavior.
ParagrapghInputFormat will look like:
public class ParagrapghInputFormat extends TextInputFormat
{
#Override
public RecordReader<LongWritable, Text> getRecordReader(InputSplit split, JobConf conf, Reporter reporter)throws IOException {
reporter.setStatus(split.toString());
return new ParagraphRecordReader(conf, (FileSplit)split);
}
}
Ensure that the job configuration to use our custom input format implementation for reading data into MapReduce jobs. It will be as simple as setting up inputformat type to ParagraphInputFormat as show below:
conf.setInputFormat(ParagraphInputFormat.class);
With above changes, we can read paragraphs as input records into MapReduce programs.
let’s assume that input file is as follows with paragraphs:
And a simple mapper code would look like:
#Override
public void map(LongWritable key, Text value, OutputCollector<Text, Text> output, Reporter reporter)
throws IOException {
System.out.println(key+" : "+value);
}
Yes u can use TextInputformat for you case.

Hadoop MapReduce example for string transformation

I have a big amount of strings in some text file and need transform this strings by such algorithm: convert string into lowercase and remove all spaces.
Can you give me example of Hadoop MapReduce function which implements that algorithm?
Thank you.
I tried the below code and getting the output in a single line.
public class toUpper {
public static class textMapper extends Mapper<LongWritable,Text,NullWritable,Text>
{
Text outvalue=new Text();
public void map(LongWritable key,Text values,Context context) throws IOException, InterruptedException
{
String token;
StringBuffer br=new StringBuffer();
StringTokenizer st=new StringTokenizer(values.toString());
while(st.hasMoreTokens())
{
token=st.nextToken();
br.append(token.toUpperCase());
}
st=null;
outvalue.set(br.toString());
context.write(NullWritable.get(), outvalue);
br=null;
}
}
public static class textReduce extends Reducer<NullWritable,Text,NullWritable,Text>
{
Text outvale=new Text();
public void reduce(NullWritable key,Iterable<Text> values,Context context) throws IOException, InterruptedException
{
StringBuffer br=new StringBuffer();
for(Text st:values)
{
br.append(st.toString());
}
outvale.set(br.toString());
context.write(NullWritable.get(), outvale);
}
}
public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
Configuration conf=new Configuration();
#SuppressWarnings("deprecation")
Job job=new Job(conf,"touipprr");
job.setJarByClass(toUpper.class);
job.setMapperClass(textMapper.class);
job.setReducerClass(textReduce.class);
job.setOutputKeyClass(NullWritable.class);
job.setOutputValueClass(Text.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true)?1:0);
}
}
In the days when I was playing around with map-reduce, I had a similar thought that there must be some practice or technique through which we can modify every word in a record and do all the cleaning stuffs.
When we recap the entire algorithm of map-reduce, we have a map function, which splits the incoming records into tokens with the help of delimiters(perhaps you will know about them better). Now, let us try to approach the problem statement given by you in a descriptive manner.
Following are the things that I will try to do when I am new to map-reduce:
> I will probably write a map() method which will split the lines for me
> I will possibly run out of options and write a reduce function
and somehow will be able to achieve my objective
The above practice is completely okay but there is a better technique that can help you to decide whether or not you are going to need the reduce function thereby you will have more options to enabling you think and completely focus on achieving your objective and also thinking about optimizing you code.
In such situations among which your problem statement falls into, a class came to my rescue : ChainMapper
Now, how the ChainMapper is going to work? following are few points to be considered
-> The first mapper will read the file from HDFS, split each lines as per delimiter and store the tokens in the context.
-> Second mapper will get the output from the first mapper and here you can do all sorts of string related operations as you business requires such as encrypting the text or changing to upper case or lowercase etc.
-> The operated string which is the result of the second mapper shall be stored into the context again
-> Now, if you need a reducer to do the aggregation task such as wordcount, go for it.
I have a piece of code which may not be efficient ( or some may feel its horrible) but it serves your purpose as you might be playing around with mapreduce.
SplitMapper.java
public class SplitMapper extends Mapper<LongWritable,Text,Text,IntWritable>{
#Override
public void map(Object key,Text value,Context context)
throws IOException,InterruptedException{
StringTokenizer xs=new StringTokenizer(value.toString());
IntWritable dummyValue=new IntWritable(1);
while(xs.hasMoreElements()){
String content=(String)xs.nextElement();
context.write(new Text(content),dummyValue);
}
}
}
LowerCaseMapper.java
public class LowerCaseMapper extends Mapper<Text,IntWritable,Text,IntWritable>{
#Override
public void map(Text key,IntWritable value,Context context)
throws IOException,InterruptedException{
String val=key.toString().toLowerCase();
Text newKey=new Text(val);
Context.write(newKey,value);
}
}
Since I am performing a wordcount here so I require a reducer
ChainMapReducer.java
public class ChainMapReducer extends Reducer<Text,IntWritable,Text,IntWritable>{
#Override
public void reduce(Text key,Iterable<IntWritable> value,Context context)
throws IOException,InterruptedException{
int sum=0;
for(IntWritable v:value){
sum+=value.get();
}
context.write(key,new IntWritables(sum));
}
}
To be able to implement the concept of chainmapper successfully, you must pay attention to every details of the driver class
DriverClass.java
public class DriverClass extends Configured implements Tool{
static Configuration cf;
public int run(String args[]) throws IOException,InterruptedException,ClassNotFoundException{
cf=new Configuration();
Job j=Job.getInstance(cf);
//configuration for the first mapper
Configuration.splitMapConfig=new Configuration(false);
ChainMapper.addMapper(j,SplitMapper.class,Object.class,Text.class,Text.class,IntWritable.class,splitMapConfig);
//configuration for the second mapper
Configuration.lowerCaseConfig=new Configuration(false);
ChainMapper.addMapper(j,LowerCaseMapper.class,Text.class,IntWritable.class,Text.class,IntWritable.class,lowerCaseConfig);
j.setJarByClass(DriverClass.class);
j.setCombinerClass(ChainMapReducer.class);
j.setOutputKeyClass(Text.class);
j.setOutputValueClass(IntWritable.class);
Path outputPath=new Path(args[1]);
FileInputFormat.addInputPath(j,new Path(args[0]));
FileOutputFormat.setOutputPath(j,outputPath);
outputPath.getFileSystem(cf).delete(outputPath,true);
}
public static void main(String args[]) throws Exception{
int res=ToolRunner.run(cf,new DriverClass(),args);
System.exit(1);
}
}
The driver class is pretty much understandable only one needs to observe the signature of the ChainMapper.add(<job-object>,<Map-ClassName>,<Input arguments types>,<configuration-for-the-concerned-mapper>)
I hope that the solution serves your purpose, please let me know in case of any issues that might arise when you try to implement.
Thankyou!

How to read two files parallely with two map task running in parallel

Please go a little easy on me cause I am only 3 months old in Hadoop and Mapreduce.
I've 2 files 120 MB each, The data inside each file is completely unstructured but with a common pattern. Because of the varying structure of data my requirement can not be sufficed by the default LineInputFormat.
Hence While reading the file I override the isSplitable() method and stop the split by returning false. so that 1 mapper can access one complete file and I can perform my logic and achieve the requirement.
My machine can run two mappers in parallel, So by stopping the split i am degrading the performance by running the mapper one by one for each file rather then running two mappers parallely for a file.
My Question is How can I run two mappers in parallel for both the files so the performance improves.
For Example
When split was allowed:
file 1: split 1 (1st mapper) || split 2 (2nd mapper)------ 2 min
file 2: split 1 (1st mapper) || split 2 (2nd mapper)------ 2 min
Total Time for reading two files ===== 4 min
When Split not allowed:
file 1: no parallel jobs so (1st mapper)---------4 min
file 2: no parallel jobs so (1st mapper)---------4 min
Total Time to read two files ===== 8 min (Performance degraded)
What I want
File 1 (1st Mapper) || file 2 (2nd Mapper) ------4 min
Total time to read two files ====== 4 min
Basically I want both the Files to be read at the same time by two different mapper.
Please help me in achieving the scenario.
Below are my Custom InputFormat and Custom RecordReader Code.
public class NSI_inputformatter extends FileInputFormat<NullWritable, Text>{
#Override
public boolean isSplitable(FileSystem fs, Path filename)
{
//System.out.println("Inside the isSplitable Method of NSI_inputformatter");
return false;
}
#Override
public RecordReader<NullWritable, Text> getRecordReader(InputSplit split,
JobConf job_run, Reporter reporter) throws IOException {
// TODO Auto-generated method stub
//System.out.println("Inside the getRecordReader method of NSI_inputformatter");
return new NSI_record_reader(job_run, (FileSplit)split);
}
}
Record Reader:
public class NSI_record_reader implements RecordReader<NullWritable, Text>
{
FileSplit split;
JobConf job_run;
String text;
public boolean processed=false;
public NSI_record_reader(JobConf job_run, FileSplit split)
{
//System.out.println("Inside the NSI_record_reader constructor");
this.split=split;
this.job_run=job_run;
//System.out.println(split.toString());
}
#Override
public boolean next(NullWritable key, Text value) throws IOException {
// TODO Auto-generated method stub
//System.out.println("Inside the next method of the NLI_record_reader");
if (!processed)
{
byte [] content_add=new byte[(int)(split.getLength())];
Path file=split.getPath();
FileSystem fs=file.getFileSystem(job_run);
FSDataInputStream input=null;
try{
input=fs.open(file);
System.out.println("the input is " +input+ input.toString());
IOUtils.readFully(input, content_add, 0, content_add.length);
value.set(content_add, 0, content_add.length);
}
finally
{
IOUtils.closeStream(input);
}
processed=true;
return true;
}
return false;
}
#Override
public void close() throws IOException {
// TODO Auto-generated method stub
}
#Override
public NullWritable createKey() {
System.out.println("Inside createkey() mrthod of NSI_record_reader");
// TODO Auto-generated method stub
return NullWritable.get();
}
#Override
public Text createValue() {
System.out.println("Inside createValue() mrthod of NSI_record_reader");
// TODO Auto-generated method stub
return new Text();
}
#Override
public long getPos() throws IOException {
// TODO Auto-generated method stub
System.out.println("Inside getPs() mrthod of NSI_record_reader");
return processed ? split.getLength() : 0;
}
#Override
public float getProgress() throws IOException {
// TODO Auto-generated method stub
System.out.println("Inside getProgress() mrthod of NSI_record_reader");
return processed ? 1.0f : 0.0f;
}
}
Input Sample:
<Dec 12, 2013 1:05:56 AM CST> <Error> <HTTP> <BEA-101017> <[weblogic.servlet.internal.WebAppServletContext#42e87d99 - appName: 'Agile', name: '/Agile', context-path: '/Agile', spec-version: 'null'] Root cause of ServletException.
javax.servlet.jsp.JspException: Connection reset by peer: socket write error
at com.agile.ui.web.taglib.common.FormTag.writeFormHeader(FormTag.java:498)
at com.agile.ui.web.taglib.common.FormTag.doStartTag(FormTag.java:429)
at jsp_servlet._default.__login_45_cms._jspService(__login_45_cms.java:929)
at weblogic.servlet.jsp.JspBase.service(JspBase.java:34)
at weblogic.servlet.internal.StubSecurityHelper$ServletServiceAction.run(StubSecurityHelper.ja va:227)
Truncated. see log file for complete stacktrace
>
Retrieving the value for the attribute Page Two.Validation Status for the Object 769630
Retrieving the value for the attribute Page Two.Pilot Required for the Object 769630
Retrieving the value for the attribute Page Two.NPO Contact for the Object 769630
<Dec 12, 2013 1:12:13 AM CST> <Warning> <Socket> <BEA-000449> <Closing socket as no data read from it during the configured idle timeout of 0 secs>
Thanks.
You could try to set property -D mapred.min.split.size=209715200. In this case FileInputFormat should not split your files because they are smaller than mapred.min.split.size.

Make use of the relation name/table name/file name in Hadoop's MapReduce

Is there a way to use the relation name in MapReduce's Map and Reduce? I am trying to do Set difference using Hadoop's MapReduce.
Input: 2 files R and S containing list of terms. (Am going to use t to denote a term)
Objective: To find R - S, i.e. terms in R and not in S
Approach:
Mapper: Spits out t -> R or t -> S, depending on whether t comes from R or S. So, the map output has the t as the key and the file name as the value.
Reducer: If the value list for a t contains only R, then output t -> t.
Do I need to some how tag the terms with the filename? Or is there any other way?
Source code for something I did for Set Union (doesn't need file name anywhere in this case). Just wanted to use this as an example to illustrate the unavailability of filename in Mapper.
public class Union {
public static class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text, Text> {
public void map(LongWritable key, Text value, OutputCollector output, Reporter reporter) throws IOException {
output.collect(value, value);
}
}
public static class Reduce extends MapReduceBase implements Reducer<Text, Text, Text, Text> {
public void reduce(Text key, Iterator<Text> values, OutputCollector<Text, Text> output, Reporter reporter) throws IOException{
while (values.hasNext())
{
output.collect(key, values.next());
break;
}
}
}
public static void main(String[] args) throws Exception {
JobConf conf = new JobConf(Union.class);
conf.setJobName("Union");
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(Text.class);
conf.setMapperClass(Map.class);
conf.setCombinerClass(Reduce.class);
conf.setReducerClass(Reduce.class);
conf.set("mapred.job.queue.name", "myQueue");
conf.setNumReduceTasks(5);
conf.setInputFormat(TextInputFormat.class);
conf.setOutputFormat(TextOutputFormat.class);
FileInputFormat.setInputPaths(conf, new Path(args[0]));
FileOutputFormat.setOutputPath(conf, new Path(args[1]));
JobClient.runJob(conf);
}
}
As you can see I can't identify which key -> value pair (input to the Mapper) came from which file. Am I overlooking something simple here?
Thanks much.
I would implement your question just like you answered. That is just the way MapReduce was meant to be.
I guess your problem was actually writing n-times the same value into the HDFS?
EDIT:
Pasted from my Comment down there
Ah I got it ;) I'm not really familiar with the "old" API, but you can "query" your Reporter with:
reporter.getInputSplit();
This returns you an interface called InputSplit. This is easily castable to "FileSplit". And within FileSplit object you could obtain the Path with: "split.getPath()". And from the Path object you just need to call the getName() method.
So this snippet should work for you:
FileSplit fsplit = reporter.getInputSplit(); // maybe cast it down to FileSplit if needed..
String yourFileName = fsplit.getPath().getName();

Resources