How can i emmit key values in the end of the whole file processing? - hadoop

Mapper reads lines from file... How can i emmit key values in the end, after the whole scanning of the file and not per line?

Using the new mapreduce API, you can override the Mapper.cleanup(Context) method and use Context.write(K, V) as you normally would in the map method.
#Override
protected void cleanup(Context context) {
context.write(new Text("key"), new Text("value"));
}
The old mapred API you can override the close() method - but you'll need to store a reference to the OutputCollector given to the map method:
private OutputCollector cachedCollector = null;
void map(Longwritable key, Text value, OutputCollector outputCollector, Reporter reporter) {
if (cachedCollector == null) {
cachedCollector = outputCollector;
}
// ...
}
public void close() {
cachedCollector.collect(outputKey, outputValue);
}

Do you have one Key value for whole file or multiple?
If it is case #1:
Use WholeFileInputFormat. You will receive full file content as a single record. You can split this into records, process all the records and emit final Key/Value at the end of your processing
Cae #2:
Use same fileInputFormat. Store all key values in a temp storage. At the end, access your temp storage and emit whatever the Key/values you want and suppress those you don't want

Another alternative to Chris's answer could be that you can achieve this by overriding the run() of the Mapper class (New API)
public static class Map extends Mapper<IntWritable, IntWritable, IntWritable, IntWritable> {
//map method here
// Override the run()
#override
public void run(Context context) throws IOException, InterruptedException {
setup(context);
while (context.nextKeyValue()) {
map(context.getCurrentKey(), context.getCurrentValue(), context);
}
// Have your last <key,value> emitted here
context.write(lastOutputKey, lastOutputValue);
cleanup(context);
}
}
And in order to ensure that each mapper gets one file to process, you have to create your own version of FileInputFormat and override isSplittable(), like this:
Class NonSplittableFileInputFormat extends FileInputFormat{
#Override
public boolean isSplitable(FileSystem fs, Path filename){
return false;
}
}

Related

how to share HashMap between Mappers in Hadoop?

Can I share HashMap with different Mapper with same Values like static variable? I am running job in hadoop cluster, And I am trying to share variable values between all mappers which are running on different datanodes.
INPUT ==> FileID FilePath
InputFormat => KeyValueTextInputFormat
public class Demo {
static int termID=0;
public static class DemoMapper extends Mapper<Object, Text, IntWritable, Text> {
static HashMap<String, Integer> termMapping = new HashMap<String, Integer>();
#Override
protected void map(Object key, Text value, Context context) throws IOException, InterruptedException {
BufferedReader reader = new BufferedReader(new FileReader(value));
String line;
String currentTerm;
while ((line = reader.readLine()) != null) {
tokenizer = new StringTokenizer(line, " ");
while (tokenizer.hasMoreTokens()) {
currentTerm = tokenizer.nextToken();
if (!termMap.containsKey(currentTerm)) {
if (!termMapping.containsKey(currentTerm)) {
termMapping.put(currentTerm, termID++);
}
termMap.put(currentTerm, 1);
} else {
termMap.put(currentTerm, termMap.get(currentTerm) + 1);
}
}
}
}
}
public static void main(String[] args) {
}
}
I don't think you really need to share anything.
All you are doing here is a variety of simple word count (of paths).
Just output (currentTerm, 1) and let the reducer handle the appropriate aggregation. You can also toss in a Combiner for improved performance.
You don't need to worry about duplicates - just look back over the WordCount example.
Also, I think you types should instead be extends Mapper<LongWritable, Text, Text, IntWritable> if you are reading a file and outputing (String, int) data
There is also a MapWritable class, but that seems like overkill

Determine block belong to which file path in Hadoop

I have multiples input paths for my job. Ex:
//Driver.class
for (String s : listFile) {
MultipleInputs.addInputPath(job, new Path(s), SequenceFileInputFormat.class);// ex: /home/path1, /home/path2, ...
}
.....
//Mapper.class
public void map(Text key, Data bytes, Context context) throws IOException, InterruptedException {
.....
}
My question is that is there any way to determine the current pair (key, value) in map() function belong to which file?
Since you are using SequenceFileInputFormat as your InputFormat, the SequenceFileInputFormat uses SequenceFileRecordReader as its RecordReader and extends FileInputFormat whose method getSplits() return the FileSplit which owns the Path, and of course the SequenceFileRecordReader can get the Path. So what you need to do is that when you get key and value make one of them contains the Path, which need to do in the RecordReader.
Here are the steps:
Make a custom valClass who contains the original value and the path:
class YourValClass implements Writable {
Writable value; // your orginal value
Path path; // the path you want
}
Make a custom InputFormat class extend the SequenceFileInputFormat, and override the createRecordReader() method to return to your custom RecordReader :
class YourInputputFormat extends SequenceFileInputFormat<YourKeyClass, YourValClass> {
#Override
public RecordReader<YourKeyClass, YourValClass> createRecordReader(InputSplit split, TaskAttemptContext context) throws IOException {
return new YourRecordReader(); // return your custom RecordReader
}
}
Make the custom RecordReader, in which you can combine your value and path together:
class YourRecordReader extends SequenceFileRecordReader<YourKeyClass, YourValClass> {
Path path;
#Override
public void initialize(InputSplit inputSplit, TaskAttemptContext taskAttemptContext) throws IOException, InterruptedException {
super.initialize(inputSplit, taskAttemptContext);
FileSplit fileSplit = (FileSplit) inputSplit;
this.path = fileSplit.getPath(); // assign the path
}
#Override
public YourValClass getCurrentValue() {
YourValClass val = super.getCurrentValue();
if (null != val) {
val.path = path; // set the path
}
return val;
}
}
Now you can get the path from YourValClass value.

How to effectively reduce the length of input to mapper

My data has 20 fields in the schema. Only the first three fields are important to me as far as my map reduce program is concerned. How can I decrease the size of input to mapper so that only the first three fields are received.
1,2,3,4,5,6,7,8...20 columns in schema.
I want only 1,2,3 in the mapper to process it as offset and value.
NOTE I cant use PIG as some other map reduce logic is implemented in MAP REDUCE.
You need a custom RecordReader to do this :
public class TrimmedRecordReader implements RecordReader<LongWritable, Text> {
private LineRecordReader lineReader;
private LongWritable lineKey;
private Text lineValue;
public TrimmedRecordReader(JobConf job, FileSplit split) throws IOException {
lineReader = new LineRecordReader(job, split);
lineKey = lineReader.createKey();
lineValue = lineReader.createValue();
}
public boolean next(LongWritable key, Text value) throws IOException {
if (!lineReader.next(lineKey, lineValue)) {
return false;
}
String[] fields = lineValue.toString().split(",");
if (fields.length < 3) {
throw new IOException("Invalid record received");
}
value.set(fields[0] + "," + fields[1] + "," + fields[2]);
return true;
}
public LongWritable createKey() {
return lineReader.createKey();
}
public Text createValue() {
return lineReader.createValue();
}
public long getPos() throws IOException {
return lineReader.getPos();
}
public void close() throws IOException {
lineReader.close();
}
public float getProgress() throws IOException {
return lineReader.getProgress();
}
}
It should be pretty self-explanatory, just a wrap up of LineRecordReader.
Unfortunately, to invoke it you need to extend the InputFormat too. The following is enough :
public class TrimmedTextInputFormat extends FileInputFormat<LongWritable, Text> {
public RecordReader<LongWritable, Text> getRecordReader(InputSplit input,
JobConf job, Reporter reporter) throws IOException {
reporter.setStatus(input.toString());
return new TrimmedRecordReader(job, (FileSplit) input);
}
}
Just don't forget to set it in the driver.
You can implement custom input format in map reduce to read the required fields alone.
Just for reference, following blog post explains how to read text as Paragraphs
http://blog.minjar.com/post/54759039969/mapreduce-custom-input-formats-reading

Hadoop Mapreduce: Custom Input Format

I have a file with data having text and "^" in between:
SOME TEXT^GOES HERE^
AND A FEW^MORE
GOES HERE
I am writing a custom input format to delimit the rows using "^" character. i.e The output of the mapper should be like:
SOME TEXT
GOES HERE
AND A FEW
MORE GOES HERE
I have written a written a custom input format which extends FileInputFormat and also written a custom record reader that extends RecordReader. Code for my custom record reader is given below. I dont know how to proceed with this code. Having trouble with the nextKeyValue() method in the WHILE loop part. How should I read the data from a split and generate my custom key-value? I am using all new mapreduce package instead of the old mapred package.
public class MyRecordReader extends RecordReader<LongWritable, Text>
{
long start, current, end;
Text value;
LongWritable key;
LineReader reader;
FileSplit split;
Path path;
FileSystem fs;
FSDataInputStream in;
Configuration conf;
#Override
public void initialize(InputSplit inputSplit, TaskAttemptContext cont) throws IOException, InterruptedException
{
conf = cont.getConfiguration();
split = (FileSplit)inputSplit;
path = split.getPath();
fs = path.getFileSystem(conf);
in = fs.open(path);
reader = new LineReader(in, conf);
start = split.getStart();
current = start;
end = split.getLength() + start;
}
#Override
public boolean nextKeyValue() throws IOException
{
if(key==null)
key = new LongWritable();
key.set(current);
if(value==null)
value = new Text();
long readSize = 0;
while(current<end)
{
Text tmpText = new Text();
readSize = read //here how should i read data from the split, and generate key-value?
if(readSize==0)
break;
current+=readSize;
}
if(readSize==0)
{
key = null;
value = null;
return false;
}
return true;
}
#Override
public float getProgress() throws IOException
{
}
#Override
public LongWritable getCurrentKey() throws IOException
{
}
#Override
public Text getCurrentValue() throws IOException
{
}
#Override
public void close() throws IOException
{
}
}
There is no need to implement that yourself. You can simply set the configuration value textinputformat.record.delimiter to be the circumflex character.
conf.set("textinputformat.record.delimiter", "^");
This should work fine with the normal TextInputFormat.

Type mismatch in key from map, using SequenceFileInputFormat correctly

I am trying to run a recommender example from chapter6 (listing 6.1 ~ 6.4) in the ebook Mahout in Action. There are two mapper/reducer pairs. Here is the code:
Mapper - 1
public class WikipediaToItemPrefsMapper extends
Mapper<LongWritable,Text,VarLongWritable,VarLongWritable> {
private static final Pattern NUMBERS = Pattern.compile("(\d+)");
#Override
public void map(LongWritable key,
Text value,
Context context)
throws IOException, InterruptedException {
String line = value.toString();
Matcher m = NUMBERS.matcher(line);
m.find();
VarLongWritable userID = new VarLongWritable(Long.parseLong(m.group()));
VarLongWritable itemID = new VarLongWritable();
while (m.find()) {
itemID.set(Long.parseLong(m.group()));
context.write(userID, itemID);
}
}
}
Reducer - 1
public class WikipediaToUserVectorReducer extends
Reducer<VarLongWritable,VarLongWritable,VarLongWritable,VectorWritable> {
#Override
public void reduce(VarLongWritable userID,
Iterable<VarLongWritable> itemPrefs,
Context context)
throws IOException, InterruptedException {
Vector userVector = new RandomAccessSparseVector(
Integer.MAX_VALUE, 100);
for (VarLongWritable itemPref : itemPrefs) {
userVector.set((int)itemPref.get(), 1.0f);
}
//LongWritable userID_lw = new LongWritable(userID.get());
context.write(userID, new VectorWritable(userVector));
//context.write(userID_lw, new VectorWritable(userVector));
}
}
The reducer outputs a userID and a userVector and it looks like this: 98955 {590:1.0 22:1.0 9059:1.0 3:1.0 2:1.0 1:1.0} provided FileInputformat and TextInputFormat are used in the driver.
I want to use another pair of mapper-reducer to process this data further:
Mapper - 2
public class UserVectorToCooccurenceMapper extends
Mapper<VarLongWritable,VectorWritable,IntWritable,IntWritable> {
#Override
public void map(VarLongWritable userID,
VectorWritable userVector,
Context context)
throws IOException, InterruptedException {
Iterator<Vector.Element> it = userVector.get().iterateNonZero();
while (it.hasNext()) {
int index1 = it.next().index();
Iterator<Vector.Element> it2 = userVector.get().iterateNonZero();
while (it2.hasNext()) {
int index2 = it2.next().index();
context.write(new IntWritable(index1),
new IntWritable(index2));
}
}
}
}
Reducer - 2
public class UserVectorToCooccurenceReducer extends
Reducer {
#Override
public void reduce(IntWritable itemIndex1,
Iterable<IntWritable> itemIndex2s,
Context context)
throws IOException, InterruptedException {
Vector cooccurrenceRow = new RandomAccessSparseVector(Integer.MAX_VALUE, 100);
for (IntWritable intWritable : itemIndex2s) {
int itemIndex2 = intWritable.get();
cooccurrenceRow.set(itemIndex2, cooccurrenceRow.get(itemIndex2) + 1.0);
}
context.write(itemIndex1, new VectorWritable(cooccurrenceRow));
}
}
This is the driver I am using:
public final class RecommenderJob extends Configured implements Tool {
#Override
public int run(String[] args) throws Exception {
Job job_preferenceValues = new Job (getConf());
job_preferenceValues.setJarByClass(RecommenderJob.class);
job_preferenceValues.setJobName("job_preferenceValues");
job_preferenceValues.setInputFormatClass(TextInputFormat.class);
job_preferenceValues.setOutputFormatClass(SequenceFileOutputFormat.class);
FileInputFormat.setInputPaths(job_preferenceValues, new Path(args[0]));
SequenceFileOutputFormat.setOutputPath(job_preferenceValues, new Path(args[1]));
job_preferenceValues.setMapOutputKeyClass(VarLongWritable.class);
job_preferenceValues.setMapOutputValueClass(VarLongWritable.class);
job_preferenceValues.setOutputKeyClass(VarLongWritable.class);
job_preferenceValues.setOutputValueClass(VectorWritable.class);
job_preferenceValues.setMapperClass(WikipediaToItemPrefsMapper.class);
job_preferenceValues.setReducerClass(WikipediaToUserVectorReducer.class);
job_preferenceValues.waitForCompletion(true);
Job job_cooccurence = new Job (getConf());
job_cooccurence.setJarByClass(RecommenderJob.class);
job_cooccurence.setJobName("job_cooccurence");
job_cooccurence.setInputFormatClass(SequenceFileInputFormat.class);
job_cooccurence.setOutputFormatClass(TextOutputFormat.class);
SequenceFileInputFormat.setInputPaths(job_cooccurence, new Path(args[1]));
FileOutputFormat.setOutputPath(job_cooccurence, new Path(args[2]));
job_cooccurence.setMapOutputKeyClass(VarLongWritable.class);
job_cooccurence.setMapOutputValueClass(VectorWritable.class);
job_cooccurence.setOutputKeyClass(IntWritable.class);
job_cooccurence.setOutputValueClass(VectorWritable.class);
job_cooccurence.setMapperClass(UserVectorToCooccurenceMapper.class);
job_cooccurence.setReducerClass(UserVectorToCooccurenceReducer.class);
job_cooccurence.waitForCompletion(true);
return 0;
}
public static void main(String[] args) throws Exception {
ToolRunner.run(new Configuration(), new RecommenderJob(), args);
}
}
The error that I get is:
java.io.IOException: Type mismatch in key from map: expected org.apache.mahout.math.VarLongWritable, received org.apache.hadoop.io.IntWritable
In course of Googling for a fix, I found out that my issue is similar to this question. But the difference is that I am already using SequenceFileInputFormat and SequenceFileOutputFormat, I believe correctly. I also see that org.apache.mahout.cf.taste.hadoop.item.RecommenderJob does more or less something similar. In my understanding & Yahoo Tutorial
SequenceFileOutputFormat rapidly serializes arbitrary data types to the file; the corresponding SequenceFileInputFormat will deserialize the file into the same types and presents the data to the next Mapper in the same manner as it was emitted by the previous Reducer.
What am I doing wrong? Will really appreciate some pointers from someone.. I spent the day trying to fix this and got nowhere :(
Your second mapper has the following signature:
public class UserVectorToCooccurenceMapper extends
Mapper<VarLongWritable,VectorWritable,IntWritable,IntWritable>
But you define the following in your driver code:
job_cooccurence.setMapOutputKeyClass(VarLongWritable.class);
job_cooccurence.setMapOutputValueClass(VectorWritable.class);
The reducer is expecting <IntWritable, IntWritable> as input, so you should just amend your driver code to:
job_cooccurence.setMapOutputKeyClass(IntWritable.class);
job_cooccurence.setMapOutputValueClass(IntWritable.class);

Resources