NoSuchElementException in mapreduce - hadoop

I am new to map reduce getting NoSuchElementException, please help.
input file container below text :
this is a hadoop program
i am writing it for first time
Mapper class :
public class Mappers extends MapReduceBase implements Mapper<LongWritable, Text, IntWritable, IntWritable>{
private Text word = new Text();
private IntWritable singleWordCount = new IntWritable();
private IntWritable one = new IntWritable(1);
#Override
public void map(LongWritable key, Text value, OutputCollector<IntWritable, IntWritable> output, Reporter reporter) throws IOException {
StringTokenizer wordList = new StringTokenizer(value.toString());
while (wordList.hasMoreTokens()) {
int wordSize = wordList.nextToken().length();
singleWordCount.set(wordSize);
if(word != null && wordList != null && wordList.nextToken() != null){
word.set(wordList.nextToken());
output.collect(singleWordCount, one);
}
}
}
}
This is the error I am getting

You're calling wordList.nextToken() three times in the loop for every iteration. Every time you call it StringTokenizerwill return the next token, which will cause the exception when your program hits the word first in your text, because you retrieve first then time and then try to retrieve the next word, which doesn't exist, causing the exception.
What you need to do is retrieve it once in every iteration and store it in a variable. Or if you really need to retrieve two words in one iteration always call hasMoreTokens() to check if there actually is another word to process before you actually call nextToken().

Related

Getting the partition id of input file in Hadoop

I need to know the row index of the partitions of the input file that I'm using. I could force this in the original file by concatenating the row index to the data but I'd rather have a way of doing this in Hadoop. I have this in my mapper...
String id = context.getConfiguration().get("mapreduce.task.partition");
But "id" is 0 in every case. In the "Hadoop: The Definitive Guide" it mentions accessing properties like the partition id "can be accessed from the context object passed to all methods of the Mapper or Reducer". It does not, from what I can tell, actually go into how to access this information.
I went through the documentation for the Context object and it seems like the above is the way to do it and the script does compile. But since I'm getting 0 for every value, I'm not sure if I'm actually using the right thing and I'm unable to find any detail online that could help in figuring this out.
Code used to test...
public class Test {
public static class TestMapper extends Mapper<LongWritable, Text, Text, Text> {
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
String id = context.getConfiguration().get("mapreduce.task.partition");
context.write(new Text("Test"), new Text(id + "_" + value.toString()));
}
}
public static class TestReducer extends Reducer<Text, Text, Text, Text> {
public void reduce(Text key, Iterable<Text> values, Context context) throws IOException, InterruptedException {
for(Text value : values) {
context.write(key, value);
}
}
}
public static void main(String[] args) throws Exception {
if(args.length != 2) {
System.err.println("Usage: Test <input path> <output path>");
System.exit(-1);
}
Job job = new Job();
job.setJarByClass(Test.class);
job.setJobName("Test");
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.setMapperClass(TestMapper.class);
job.setReducerClass(TestReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
Two options are:
Use the offset instead of the row number
Track the line number in the mapper
For the first one, the key which is LongWritable tells you the offset of the line being processed. Unless your lines are exactly the same length, you won't be able to calculate the line number from an offset, but it does allow you to determine ordering if thats useful.
The second option is to just track it in the mapper. You could change your code to something like:
public static class TestMapper extends Mapper<LongWritable, Text, Text, Text> {
private long currentLineNum = 0;
private Text test = new Text("Test");
public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
context.write(test, new Text(currentLineNum + "_" + value));
currentLineNum++;
}
}
You could also represent your matrix as lines of tuples and include the row and col on every tuple so when you're reading in the file, you have that information. If you use a file that is just space or comma seperated values that make up a 2D array, it'll be extremely hard to figure out what line (row) you are currently working on in the mapper

When to go for custom Input format for Map reduce jobs

When should we go for custom Input Format while using Map Reduce programming ?
Say I have a file which I need to read line by line and it has 15 columns delimited by pipe, should I go for custom Input Format ?
I can use a TextInput Format as well as Custom Input Format in this case.
CustomInputFormat can be written when you need to customize input
record reading. But in your case you need not have such an implementation.
see below example of CustomInputFormat out of many such...
Example : Reading Paragraphs as Input Records
If you are working on Hadoop MapReduce or Using AWS EMR then there might be an use case where input files consistent a paragraph as key-value record instead of a single line (think about scenarios like analyzing comments of news articles). So instead of processing a single line as input if you need to process a complete paragraph at once as a single record then you will need to customize the default behavior of **TextInputFormat** i.e. to read each line by default into reading a complete paragraph as one input key-value pair for further processing in MapReduce jobs.
This requires us to to create a custom record reader which can be done by implementing the class RecordReader. The next() method is where you would tell the record reader to fetch a paragraph instead of one line. See the following implementation, it’s self-explanatory:
public class ParagraphRecordReader implements RecordReader<LongWritable, Text> {
private LineRecordReader lineRecord;
private LongWritable lineKey;
private Text lineValue;
public ParagraphRecordReader(JobConf conf, FileSplit split) throws IOException {
lineRecord = new LineRecordReader(conf, split);
lineKey = lineRecord.createKey();
lineValue = lineRecord.createValue();
}
#Override
public void close() throws IOException {
lineRecord.close();
}
#Override
public LongWritable createKey() {
return new LongWritable();
}
#Override
public Text createValue() {
return new Text("");
}
#Override
public float getProgress() throws IOException {
return lineRecord.getPos();
}
#Override
public synchronized boolean next(LongWritable key, Text value) throws IOException {
boolean appended, isNextLineAvailable;
boolean retval;
byte space[] = {' '};
value.clear();
isNextLineAvailable = false;
do {
appended = false;
retval = lineRecord.next(lineKey, lineValue);
if (retval) {
if (lineValue.toString().length() > 0) {
byte[] rawline = lineValue.getBytes();
int rawlinelen = lineValue.getLength();
value.append(rawline, 0, rawlinelen);
value.append(space, 0, 1);
appended = true;
}
isNextLineAvailable = true;
}
} while (appended);
return isNextLineAvailable;
}
#Override
public long getPos() throws IOException {
return lineRecord.getPos();
}
}
With a ParagraphRecordReader implementation, we would need to extend TextInputFormat to create a custom InputFomat by just overriding the getRecordReader method and return an object of ParagraphRecordReader to override default behavior.
ParagrapghInputFormat will look like:
public class ParagrapghInputFormat extends TextInputFormat
{
#Override
public RecordReader<LongWritable, Text> getRecordReader(InputSplit split, JobConf conf, Reporter reporter)throws IOException {
reporter.setStatus(split.toString());
return new ParagraphRecordReader(conf, (FileSplit)split);
}
}
Ensure that the job configuration to use our custom input format implementation for reading data into MapReduce jobs. It will be as simple as setting up inputformat type to ParagraphInputFormat as show below:
conf.setInputFormat(ParagraphInputFormat.class);
With above changes, we can read paragraphs as input records into MapReduce programs.
let’s assume that input file is as follows with paragraphs:
And a simple mapper code would look like:
#Override
public void map(LongWritable key, Text value, OutputCollector<Text, Text> output, Reporter reporter)
throws IOException {
System.out.println(key+" : "+value);
}
Yes u can use TextInputformat for you case.

hadoop - total line of input files

I have an input file contains:
id value
1e 1
2e 1
...
2e 1
3e 1
4e 1
And I would like to find the total id of my input file. So In my main, I have declare a list so that when I read the input file, I will insert the line into the list
MainDriver.java
public static Set list = new HashSet();
and I my map
// Apply regex to find the id
...
// Insert id to the list
MainDriver.list.add(regex.group(1)); // add 1e, 2e, 3e ...
and In my reduce, I try to use the list as
public void reduce(WritableComparable key, Iterator values,
OutputCollector output, Reporter reporter) throws IOException
{
...
output.collect(key, new IntWritable(MainDriver.list.size()));
}
So I expect the value print out the file, in this case will be 4. But it actually prints out 0.
I have verify that regex.group(1) would extract valid id. So I have no clue why the size of my list is 0 in the reduce process.
The mappers and reducers run on separate JVMs (and often separate machines altogether) both from each other and from the driver program, so there is no common instance of your list Set variable that all of those methods can concurrently read and write to.
One way in MapReduce to count the number of keys is:
Emit (id, 1) from your mapper
(optionally) Sum the 1s for each mapper using a combiner to minimize network and reducer I/O
In the reducer:
In setup() initialize a class-scope numeric variable (int or long presumbly) to 0
In reduce() increment the counter, and ignore the values
In cleanup() emit the counter value now that all keys have been processed
Run the job with a single reducer, so all the keys go to the same JVM where a single count can be made
This is basically ignoring the advantage of using MapReduce in the first place.
Correct me if I'm wrong, but it appears you can map your output from your Mapper by "id", and then in your Reducer you receive something like Text key, Iterator values as the parameters.
You can then just sum up values and output output.collect(key, <total value>);
Example (apologies for using Context rather than OutputCollector, but the logic is the same):
public static class MyMapper extends Mapper<LongWritable, Text, Text, Text> {
private final Text key = new Text("id");
private final Text id = new Text();
public void map(LongWritable key, Text value,
Context context) throws IOException, InterruptedException {
id.set(regex.group(1)); // do whatever you do
context.write(id, countOne);
}
}
public static class MyReducer extends Reducer<Text, Text, Text, IntWritable> {
private final IntWritable totalCount = new IntWritable();
public void reduce(Text key, Iterable<Text> values,
Context context) throws IOException, InterruptedException {
int cnt = 0;
for (Text value : values) {
cnt ++;
}
totalCount.set(cnt);
context.write(key, totalCount);
}
}

Hadoop MapReduce: return sorted list of words in a text file

So my task is to return a alpahbetically sorted list of all words contained in a text file while keeping duplicates.
{To be or not to be} −→ {be be not or to to}
My idea is to take each word as the key as well as the value. This way, because hadoop sorts the keys, they will automatically be sorted alphabtically. In the Reduce phase I simply append all words with the same key (so basically identical words) to one single Text value.
public class WordSort {
public static class Map extends Mapper<LongWritable, Text, Text, Text> {
private Text word = new Text();
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
word.set(tokenizer.nextToken());
// transform to lower case
String lower = word.toString().toLowerCase();
context.write(new Text(lower), new Text(lower));
}
}
}
public static class Reduce extends Reducer<Text, Text, Text, Text> {
public void reduce(Text key, Iterable<Text> values, Context context) throws IOException, InterruptedException {
String result = "";
for (Text value : values){
res += value.toString() + " ";
}
context.write(key, new Text(result));
}
}
However my problem is, how do I simply return the value in my output file? At the moment I have this:
be be be
not not
or or
to to to
So in every line I have the key first and then the values, but I just want to return the values so that I get this:
be be
not
or
to to
Is this even possible or do I have to just delete one entry from the value of each word?
Disclaimer: I'm not an Hadoop user, but I do a lot of Map/Reduce with CouchDB.
If you just need the keys, why don't you emit an empty value?
Moreover, it sounds like you don't want to reduce them at all, since you want to get a key for every occurrence.
Just tried with the MaxTemperature example from the Hadoop - The Definitive Guide and the below code worked
context.write(null, new Text(result));

Hadoop Task Side Effect File Example

Can I get an example of how to use task side effect files?
public class Map0t extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable >{
public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {
IntWritable one = new IntWritable(1);
StringTokenizer tokenizer = new StringTokenizer(value.toString(), ",");
String x;
String y;
String z;
x = tokenizer.nextToken();
y = tokenizer.nextToken();
z = tokenizer.nextToken();
output.collect(new Text(x+" "+z), one);
}
}
I want to write, new Text(x+" "+y), new Text(z) as a side effect in the above Mapper function to a different folder in HDFS.
I searched but could not find any example on how to use task side effect files.
Not an optimum approach, but one way I can think is
Open a file in HDFS in setup() in the mapper, write into the file and then close the file in the clean() in the mapper. One think to make sure is to use a unique file name in the setup() of the mapper.

Resources