Determine block belong to which file path in Hadoop

I have multiples input paths for my job. Ex:
for (String s : listFile) {
MultipleInputs.addInputPath(job, new Path(s), SequenceFileInputFormat.class);// ex: /home/path1, /home/path2, ...
public void map(Text key, Data bytes, Context context) throws IOException, InterruptedException {
My question is that is there any way to determine the current pair (key, value) in map() function belong to which file?

Since you are using SequenceFileInputFormat as your InputFormat, the SequenceFileInputFormat uses SequenceFileRecordReader as its RecordReader and extends FileInputFormat whose method getSplits() return the FileSplit which owns the Path, and of course the SequenceFileRecordReader can get the Path. So what you need to do is that when you get key and value make one of them contains the Path, which need to do in the RecordReader.
Here are the steps:
Make a custom valClass who contains the original value and the path:
class YourValClass implements Writable {
Writable value; // your orginal value
Path path; // the path you want
Make a custom InputFormat class extend the SequenceFileInputFormat, and override the createRecordReader() method to return to your custom RecordReader :
class YourInputputFormat extends SequenceFileInputFormat<YourKeyClass, YourValClass> {
public RecordReader<YourKeyClass, YourValClass> createRecordReader(InputSplit split, TaskAttemptContext context) throws IOException {
return new YourRecordReader(); // return your custom RecordReader
Make the custom RecordReader, in which you can combine your value and path together:
class YourRecordReader extends SequenceFileRecordReader<YourKeyClass, YourValClass> {
Path path;
public void initialize(InputSplit inputSplit, TaskAttemptContext taskAttemptContext) throws IOException, InterruptedException {
super.initialize(inputSplit, taskAttemptContext);
FileSplit fileSplit = (FileSplit) inputSplit;
this.path = fileSplit.getPath(); // assign the path
public YourValClass getCurrentValue() {
YourValClass val = super.getCurrentValue();
if (null != val) {
val.path = path; // set the path
return val;
Now you can get the path from YourValClass value.


How to effectively reduce the length of input to mapper

My data has 20 fields in the schema. Only the first three fields are important to me as far as my map reduce program is concerned. How can I decrease the size of input to mapper so that only the first three fields are received.
1,2,3,4,5,6,7,8...20 columns in schema.
I want only 1,2,3 in the mapper to process it as offset and value.
NOTE I cant use PIG as some other map reduce logic is implemented in MAP REDUCE.
You need a custom RecordReader to do this :
public class TrimmedRecordReader implements RecordReader<LongWritable, Text> {
private LineRecordReader lineReader;
private LongWritable lineKey;
private Text lineValue;
public TrimmedRecordReader(JobConf job, FileSplit split) throws IOException {
lineReader = new LineRecordReader(job, split);
lineKey = lineReader.createKey();
lineValue = lineReader.createValue();
public boolean next(LongWritable key, Text value) throws IOException {
if (!, lineValue)) {
return false;
String[] fields = lineValue.toString().split(",");
if (fields.length < 3) {
throw new IOException("Invalid record received");
value.set(fields[0] + "," + fields[1] + "," + fields[2]);
return true;
public LongWritable createKey() {
return lineReader.createKey();
public Text createValue() {
return lineReader.createValue();
public long getPos() throws IOException {
return lineReader.getPos();
public void close() throws IOException {
public float getProgress() throws IOException {
return lineReader.getProgress();
It should be pretty self-explanatory, just a wrap up of LineRecordReader.
Unfortunately, to invoke it you need to extend the InputFormat too. The following is enough :
public class TrimmedTextInputFormat extends FileInputFormat<LongWritable, Text> {
public RecordReader<LongWritable, Text> getRecordReader(InputSplit input,
JobConf job, Reporter reporter) throws IOException {
return new TrimmedRecordReader(job, (FileSplit) input);
Just don't forget to set it in the driver.
You can implement custom input format in map reduce to read the required fields alone.
Just for reference, following blog post explains how to read text as Paragraphs

Hadoop Mapreduce: Custom Input Format

I have a file with data having text and "^" in between:
I am writing a custom input format to delimit the rows using "^" character. i.e The output of the mapper should be like:
I have written a written a custom input format which extends FileInputFormat and also written a custom record reader that extends RecordReader. Code for my custom record reader is given below. I dont know how to proceed with this code. Having trouble with the nextKeyValue() method in the WHILE loop part. How should I read the data from a split and generate my custom key-value? I am using all new mapreduce package instead of the old mapred package.
public class MyRecordReader extends RecordReader<LongWritable, Text>
long start, current, end;
Text value;
LongWritable key;
LineReader reader;
FileSplit split;
Path path;
FileSystem fs;
FSDataInputStream in;
Configuration conf;
public void initialize(InputSplit inputSplit, TaskAttemptContext cont) throws IOException, InterruptedException
conf = cont.getConfiguration();
split = (FileSplit)inputSplit;
path = split.getPath();
fs = path.getFileSystem(conf);
in =;
reader = new LineReader(in, conf);
start = split.getStart();
current = start;
end = split.getLength() + start;
public boolean nextKeyValue() throws IOException
key = new LongWritable();
value = new Text();
long readSize = 0;
Text tmpText = new Text();
readSize = read //here how should i read data from the split, and generate key-value?
key = null;
value = null;
return false;
return true;
public float getProgress() throws IOException
public LongWritable getCurrentKey() throws IOException
public Text getCurrentValue() throws IOException
public void close() throws IOException
There is no need to implement that yourself. You can simply set the configuration value textinputformat.record.delimiter to be the circumflex character.
conf.set("textinputformat.record.delimiter", "^");
This should work fine with the normal TextInputFormat.

How can i emmit key values in the end of the whole file processing?

Mapper reads lines from file... How can i emmit key values in the end, after the whole scanning of the file and not per line?
Using the new mapreduce API, you can override the Mapper.cleanup(Context) method and use Context.write(K, V) as you normally would in the map method.
protected void cleanup(Context context) {
context.write(new Text("key"), new Text("value"));
The old mapred API you can override the close() method - but you'll need to store a reference to the OutputCollector given to the map method:
private OutputCollector cachedCollector = null;
void map(Longwritable key, Text value, OutputCollector outputCollector, Reporter reporter) {
if (cachedCollector == null) {
cachedCollector = outputCollector;
// ...
public void close() {
cachedCollector.collect(outputKey, outputValue);
Do you have one Key value for whole file or multiple?
If it is case #1:
Use WholeFileInputFormat. You will receive full file content as a single record. You can split this into records, process all the records and emit final Key/Value at the end of your processing
Cae #2:
Use same fileInputFormat. Store all key values in a temp storage. At the end, access your temp storage and emit whatever the Key/values you want and suppress those you don't want
Another alternative to Chris's answer could be that you can achieve this by overriding the run() of the Mapper class (New API)
public static class Map extends Mapper<IntWritable, IntWritable, IntWritable, IntWritable> {
//map method here
// Override the run()
public void run(Context context) throws IOException, InterruptedException {
while (context.nextKeyValue()) {
map(context.getCurrentKey(), context.getCurrentValue(), context);
// Have your last <key,value> emitted here
context.write(lastOutputKey, lastOutputValue);
And in order to ensure that each mapper gets one file to process, you have to create your own version of FileInputFormat and override isSplittable(), like this:
Class NonSplittableFileInputFormat extends FileInputFormat{
public boolean isSplitable(FileSystem fs, Path filename){
return false;

Custom WritableCompare displays object reference as output

I am new to Hadoop and Java, and I feel there is something obvious I am just missing. I am using Hadoop 1.0.3 if that means anything.
My goal for using hadoop is to take a bunch of files and parse them one file at a time (as opposed to line by line). Each file will produce multiple key-values, but context to the other lines is important. The key and value are multi-value/composite, so I have implemented WritableCompare for the key and Writable for the value. Because the processing of each file take a bit of CPU, I want to save the output of the mapper, then run multiple reducers later on.
For the composite keys, I followed [][1]
The problem is, the output is just Java object references as opposed to the composite key and value. Example:
LinkKeyWritable#bd2f9730 LinkValueWritable#8752408c
I am not sure if the problem is related to not reducing the data at all or
Here is my main class:
public static void main(String[] args) throws Exception {
JobConf conf = new JobConf(Parser.class);
PerFileInputFormat.setInputPaths(conf, new Path(args[0]));
FileOutputFormat.setOutputPath(conf, new Path(args[1]));
And my Mapper class:
public class RawMap extends MapReduceBase implements
Mapper {
public void map(NullWritable key, Text value,
OutputCollector<LinkKeyWritable, LinkValueWritable> output,
Reporter reporter) throws IOException {
String json = value.toString();
SerpyReader reader = new SerpyReader(json);
GoogleParser parser = new GoogleParser(reader);
for (String page : reader.getPages()) {
String content = reader.readPageContent(page);
for (Link link : parser.getLinks()) {
LinkKeyWritable linkKey = new LinkKeyWritable(link);
LinkValueWritable linkValue = new LinkValueWritable(link);
output.collect(linkKey, linkValue);
Link is basically a struct of various information that get's split between LinkKeyWritable and LinkValueWritable
public class LinkKeyWritable implements WritableComparable<LinkKeyWritable>{
protected Link link;
public LinkKeyWritable() {
link = new Link();
public LinkKeyWritable(Link link) {
super(); = link;
public void readFields(DataInput in) throws IOException {
link.batchDay = in.readLong();
link.source = in.readUTF();
link.domain = in.readUTF();
link.path = in.readUTF();
public void write(DataOutput out) throws IOException {
public int compareTo(LinkKeyWritable o) {
return ComparisonChain.start().
public int hashCode() {
return Objects.hashCode(link.batchDay, link.source, link.domain, link.path);
public boolean equals(final Object obj){
if(obj instanceof LinkKeyWritable) {
final LinkKeyWritable o = (LinkKeyWritable)obj;
return Objects.equal(link.batchDay,
&& Objects.equal(link.source,
&& Objects.equal(link.domain,
&& Objects.equal(link.path,;
return false;
public class LinkValueWritable implements Writable{
protected Link link;
public LinkValueWritable() {
link = new Link();
public LinkValueWritable(Link link) { = new Link(); = link.type; = link.description;
public void readFields(DataInput in) throws IOException {
link.type = in.readUTF();
link.description = in.readUTF();
public void write(DataOutput out) throws IOException {
public int hashCode() {
return Objects.hashCode(link.type, link.description);
public boolean equals(final Object obj){
if(obj instanceof LinkKeyWritable) {
final LinkKeyWritable o = (LinkKeyWritable)obj;
return Objects.equal(link.type,
&& Objects.equal(link.description,;
return false;
I think the answer is in the implementation of the TextOutputFormat. Specifically, the LineRecordWriter's writeObject method:
* Write the object to the byte stream, handling Text as a special
* case.
* #param o the object to print
* #throws IOException if the write throws, we pass it on
private void writeObject(Object o) throws IOException {
if (o instanceof Text) {
Text to = (Text) o;
out.write(to.getBytes(), 0, to.getLength());
} else {
As you can see, if your key or value is not a Text object, it calls the toString method on it and writes that out. Since you've left toString unimplemented in your key and value, it's using the Object class's implementation, which is writing out the reference.
I'd say that you should try writing an appropriate toString function or using a different OutputFormat.
It looks like you have a list of objects just like you wanted. You need to implement toString() on your writable if you want a human-readable version printed out instead of an ugly java reference.

Type mismatch in key from map, using SequenceFileInputFormat correctly

I am trying to run a recommender example from chapter6 (listing 6.1 ~ 6.4) in the ebook Mahout in Action. There are two mapper/reducer pairs. Here is the code:
Mapper - 1
public class WikipediaToItemPrefsMapper extends
Mapper<LongWritable,Text,VarLongWritable,VarLongWritable> {
private static final Pattern NUMBERS = Pattern.compile("(\d+)");
public void map(LongWritable key,
Text value,
Context context)
throws IOException, InterruptedException {
String line = value.toString();
Matcher m = NUMBERS.matcher(line);
VarLongWritable userID = new VarLongWritable(Long.parseLong(;
VarLongWritable itemID = new VarLongWritable();
while (m.find()) {
context.write(userID, itemID);
Reducer - 1
public class WikipediaToUserVectorReducer extends
Reducer<VarLongWritable,VarLongWritable,VarLongWritable,VectorWritable> {
public void reduce(VarLongWritable userID,
Iterable<VarLongWritable> itemPrefs,
Context context)
throws IOException, InterruptedException {
Vector userVector = new RandomAccessSparseVector(
Integer.MAX_VALUE, 100);
for (VarLongWritable itemPref : itemPrefs) {
userVector.set((int)itemPref.get(), 1.0f);
//LongWritable userID_lw = new LongWritable(userID.get());
context.write(userID, new VectorWritable(userVector));
//context.write(userID_lw, new VectorWritable(userVector));
The reducer outputs a userID and a userVector and it looks like this: 98955 {590:1.0 22:1.0 9059:1.0 3:1.0 2:1.0 1:1.0} provided FileInputformat and TextInputFormat are used in the driver.
I want to use another pair of mapper-reducer to process this data further:
Mapper - 2
public class UserVectorToCooccurenceMapper extends
Mapper<VarLongWritable,VectorWritable,IntWritable,IntWritable> {
public void map(VarLongWritable userID,
VectorWritable userVector,
Context context)
throws IOException, InterruptedException {
Iterator<Vector.Element> it = userVector.get().iterateNonZero();
while (it.hasNext()) {
int index1 =;
Iterator<Vector.Element> it2 = userVector.get().iterateNonZero();
while (it2.hasNext()) {
int index2 =;
context.write(new IntWritable(index1),
new IntWritable(index2));
Reducer - 2
public class UserVectorToCooccurenceReducer extends
Reducer {
public void reduce(IntWritable itemIndex1,
Iterable<IntWritable> itemIndex2s,
Context context)
throws IOException, InterruptedException {
Vector cooccurrenceRow = new RandomAccessSparseVector(Integer.MAX_VALUE, 100);
for (IntWritable intWritable : itemIndex2s) {
int itemIndex2 = intWritable.get();
cooccurrenceRow.set(itemIndex2, cooccurrenceRow.get(itemIndex2) + 1.0);
context.write(itemIndex1, new VectorWritable(cooccurrenceRow));
This is the driver I am using:
public final class RecommenderJob extends Configured implements Tool {
public int run(String[] args) throws Exception {
Job job_preferenceValues = new Job (getConf());
FileInputFormat.setInputPaths(job_preferenceValues, new Path(args[0]));
SequenceFileOutputFormat.setOutputPath(job_preferenceValues, new Path(args[1]));
Job job_cooccurence = new Job (getConf());
SequenceFileInputFormat.setInputPaths(job_cooccurence, new Path(args[1]));
FileOutputFormat.setOutputPath(job_cooccurence, new Path(args[2]));
return 0;
public static void main(String[] args) throws Exception { Configuration(), new RecommenderJob(), args);
The error that I get is: Type mismatch in key from map: expected org.apache.mahout.math.VarLongWritable, received
In course of Googling for a fix, I found out that my issue is similar to this question. But the difference is that I am already using SequenceFileInputFormat and SequenceFileOutputFormat, I believe correctly. I also see that does more or less something similar. In my understanding & Yahoo Tutorial
SequenceFileOutputFormat rapidly serializes arbitrary data types to the file; the corresponding SequenceFileInputFormat will deserialize the file into the same types and presents the data to the next Mapper in the same manner as it was emitted by the previous Reducer.
What am I doing wrong? Will really appreciate some pointers from someone.. I spent the day trying to fix this and got nowhere :(
Your second mapper has the following signature:
public class UserVectorToCooccurenceMapper extends
But you define the following in your driver code:
The reducer is expecting <IntWritable, IntWritable> as input, so you should just amend your driver code to:
