How to effectively reduce the length of input to mapper - hadoop

My data has 20 fields in the schema. Only the first three fields are important to me as far as my map reduce program is concerned. How can I decrease the size of input to mapper so that only the first three fields are received.
1,2,3,4,5,6,7,8...20 columns in schema.
I want only 1,2,3 in the mapper to process it as offset and value.
NOTE I cant use PIG as some other map reduce logic is implemented in MAP REDUCE.

You need a custom RecordReader to do this :
public class TrimmedRecordReader implements RecordReader<LongWritable, Text> {
private LineRecordReader lineReader;
private LongWritable lineKey;
private Text lineValue;
public TrimmedRecordReader(JobConf job, FileSplit split) throws IOException {
lineReader = new LineRecordReader(job, split);
lineKey = lineReader.createKey();
lineValue = lineReader.createValue();
}
public boolean next(LongWritable key, Text value) throws IOException {
if (!lineReader.next(lineKey, lineValue)) {
return false;
}
String[] fields = lineValue.toString().split(",");
if (fields.length < 3) {
throw new IOException("Invalid record received");
}
value.set(fields[0] + "," + fields[1] + "," + fields[2]);
return true;
}
public LongWritable createKey() {
return lineReader.createKey();
}
public Text createValue() {
return lineReader.createValue();
}
public long getPos() throws IOException {
return lineReader.getPos();
}
public void close() throws IOException {
lineReader.close();
}
public float getProgress() throws IOException {
return lineReader.getProgress();
}
}
It should be pretty self-explanatory, just a wrap up of LineRecordReader.
Unfortunately, to invoke it you need to extend the InputFormat too. The following is enough :
public class TrimmedTextInputFormat extends FileInputFormat<LongWritable, Text> {
public RecordReader<LongWritable, Text> getRecordReader(InputSplit input,
JobConf job, Reporter reporter) throws IOException {
reporter.setStatus(input.toString());
return new TrimmedRecordReader(job, (FileSplit) input);
}
}
Just don't forget to set it in the driver.

You can implement custom input format in map reduce to read the required fields alone.
Just for reference, following blog post explains how to read text as Paragraphs
http://blog.minjar.com/post/54759039969/mapreduce-custom-input-formats-reading

Related

how to share HashMap between Mappers in Hadoop?

Can I share HashMap with different Mapper with same Values like static variable? I am running job in hadoop cluster, And I am trying to share variable values between all mappers which are running on different datanodes.
INPUT ==> FileID FilePath
InputFormat => KeyValueTextInputFormat
public class Demo {
static int termID=0;
public static class DemoMapper extends Mapper<Object, Text, IntWritable, Text> {
static HashMap<String, Integer> termMapping = new HashMap<String, Integer>();
#Override
protected void map(Object key, Text value, Context context) throws IOException, InterruptedException {
BufferedReader reader = new BufferedReader(new FileReader(value));
String line;
String currentTerm;
while ((line = reader.readLine()) != null) {
tokenizer = new StringTokenizer(line, " ");
while (tokenizer.hasMoreTokens()) {
currentTerm = tokenizer.nextToken();
if (!termMap.containsKey(currentTerm)) {
if (!termMapping.containsKey(currentTerm)) {
termMapping.put(currentTerm, termID++);
}
termMap.put(currentTerm, 1);
} else {
termMap.put(currentTerm, termMap.get(currentTerm) + 1);
}
}
}
}
}
public static void main(String[] args) {
}
}
I don't think you really need to share anything.
All you are doing here is a variety of simple word count (of paths).
Just output (currentTerm, 1) and let the reducer handle the appropriate aggregation. You can also toss in a Combiner for improved performance.
You don't need to worry about duplicates - just look back over the WordCount example.
Also, I think you types should instead be extends Mapper<LongWritable, Text, Text, IntWritable> if you are reading a file and outputing (String, int) data
There is also a MapWritable class, but that seems like overkill

Hadoop Mapreduce: Custom Input Format

I have a file with data having text and "^" in between:
SOME TEXT^GOES HERE^
AND A FEW^MORE
GOES HERE
I am writing a custom input format to delimit the rows using "^" character. i.e The output of the mapper should be like:
SOME TEXT
GOES HERE
AND A FEW
MORE GOES HERE
I have written a written a custom input format which extends FileInputFormat and also written a custom record reader that extends RecordReader. Code for my custom record reader is given below. I dont know how to proceed with this code. Having trouble with the nextKeyValue() method in the WHILE loop part. How should I read the data from a split and generate my custom key-value? I am using all new mapreduce package instead of the old mapred package.
public class MyRecordReader extends RecordReader<LongWritable, Text>
{
long start, current, end;
Text value;
LongWritable key;
LineReader reader;
FileSplit split;
Path path;
FileSystem fs;
FSDataInputStream in;
Configuration conf;
#Override
public void initialize(InputSplit inputSplit, TaskAttemptContext cont) throws IOException, InterruptedException
{
conf = cont.getConfiguration();
split = (FileSplit)inputSplit;
path = split.getPath();
fs = path.getFileSystem(conf);
in = fs.open(path);
reader = new LineReader(in, conf);
start = split.getStart();
current = start;
end = split.getLength() + start;
}
#Override
public boolean nextKeyValue() throws IOException
{
if(key==null)
key = new LongWritable();
key.set(current);
if(value==null)
value = new Text();
long readSize = 0;
while(current<end)
{
Text tmpText = new Text();
readSize = read //here how should i read data from the split, and generate key-value?
if(readSize==0)
break;
current+=readSize;
}
if(readSize==0)
{
key = null;
value = null;
return false;
}
return true;
}
#Override
public float getProgress() throws IOException
{
}
#Override
public LongWritable getCurrentKey() throws IOException
{
}
#Override
public Text getCurrentValue() throws IOException
{
}
#Override
public void close() throws IOException
{
}
}
There is no need to implement that yourself. You can simply set the configuration value textinputformat.record.delimiter to be the circumflex character.
conf.set("textinputformat.record.delimiter", "^");
This should work fine with the normal TextInputFormat.

How can i emmit key values in the end of the whole file processing?

Mapper reads lines from file... How can i emmit key values in the end, after the whole scanning of the file and not per line?
Using the new mapreduce API, you can override the Mapper.cleanup(Context) method and use Context.write(K, V) as you normally would in the map method.
#Override
protected void cleanup(Context context) {
context.write(new Text("key"), new Text("value"));
}
The old mapred API you can override the close() method - but you'll need to store a reference to the OutputCollector given to the map method:
private OutputCollector cachedCollector = null;
void map(Longwritable key, Text value, OutputCollector outputCollector, Reporter reporter) {
if (cachedCollector == null) {
cachedCollector = outputCollector;
}
// ...
}
public void close() {
cachedCollector.collect(outputKey, outputValue);
}
Do you have one Key value for whole file or multiple?
If it is case #1:
Use WholeFileInputFormat. You will receive full file content as a single record. You can split this into records, process all the records and emit final Key/Value at the end of your processing
Cae #2:
Use same fileInputFormat. Store all key values in a temp storage. At the end, access your temp storage and emit whatever the Key/values you want and suppress those you don't want
Another alternative to Chris's answer could be that you can achieve this by overriding the run() of the Mapper class (New API)
public static class Map extends Mapper<IntWritable, IntWritable, IntWritable, IntWritable> {
//map method here
// Override the run()
#override
public void run(Context context) throws IOException, InterruptedException {
setup(context);
while (context.nextKeyValue()) {
map(context.getCurrentKey(), context.getCurrentValue(), context);
}
// Have your last <key,value> emitted here
context.write(lastOutputKey, lastOutputValue);
cleanup(context);
}
}
And in order to ensure that each mapper gets one file to process, you have to create your own version of FileInputFormat and override isSplittable(), like this:
Class NonSplittableFileInputFormat extends FileInputFormat{
#Override
public boolean isSplitable(FileSystem fs, Path filename){
return false;
}
}

Custom WritableCompare displays object reference as output

I am new to Hadoop and Java, and I feel there is something obvious I am just missing. I am using Hadoop 1.0.3 if that means anything.
My goal for using hadoop is to take a bunch of files and parse them one file at a time (as opposed to line by line). Each file will produce multiple key-values, but context to the other lines is important. The key and value are multi-value/composite, so I have implemented WritableCompare for the key and Writable for the value. Because the processing of each file take a bit of CPU, I want to save the output of the mapper, then run multiple reducers later on.
For the composite keys, I followed [http://stackoverflow.com/questions/12427090/hadoop-composite-key][1]
The problem is, the output is just Java object references as opposed to the composite key and value. Example:
LinkKeyWritable#bd2f9730 LinkValueWritable#8752408c
I am not sure if the problem is related to not reducing the data at all or
Here is my main class:
public static void main(String[] args) throws Exception {
JobConf conf = new JobConf(Parser.class);
conf.setJobName("raw_parser");
conf.setOutputKeyClass(LinkKeyWritable.class);
conf.setOutputValueClass(LinkValueWritable.class);
conf.setMapperClass(RawMap.class);
conf.setNumMapTasks(0);
conf.setInputFormat(PerFileInputFormat.class);
conf.setOutputFormat(TextOutputFormat.class);
PerFileInputFormat.setInputPaths(conf, new Path(args[0]));
FileOutputFormat.setOutputPath(conf, new Path(args[1]));
JobClient.runJob(conf);
}
And my Mapper class:
public class RawMap extends MapReduceBase implements
Mapper {
public void map(NullWritable key, Text value,
OutputCollector<LinkKeyWritable, LinkValueWritable> output,
Reporter reporter) throws IOException {
String json = value.toString();
SerpyReader reader = new SerpyReader(json);
GoogleParser parser = new GoogleParser(reader);
for (String page : reader.getPages()) {
String content = reader.readPageContent(page);
parser.addPage(content);
}
for (Link link : parser.getLinks()) {
LinkKeyWritable linkKey = new LinkKeyWritable(link);
LinkValueWritable linkValue = new LinkValueWritable(link);
output.collect(linkKey, linkValue);
}
}
}
Link is basically a struct of various information that get's split between LinkKeyWritable and LinkValueWritable
LinkKeyWritable:
public class LinkKeyWritable implements WritableComparable<LinkKeyWritable>{
protected Link link;
public LinkKeyWritable() {
super();
link = new Link();
}
public LinkKeyWritable(Link link) {
super();
this.link = link;
}
#Override
public void readFields(DataInput in) throws IOException {
link.batchDay = in.readLong();
link.source = in.readUTF();
link.domain = in.readUTF();
link.path = in.readUTF();
}
#Override
public void write(DataOutput out) throws IOException {
out.writeLong(link.batchDay);
out.writeUTF(link.source);
out.writeUTF(link.domain);
out.writeUTF(link.path);
}
#Override
public int compareTo(LinkKeyWritable o) {
return ComparisonChain.start().
compare(link.batchDay, o.link.batchDay).
compare(link.domain, o.link.domain).
compare(link.path, o.link.path).
result();
}
#Override
public int hashCode() {
return Objects.hashCode(link.batchDay, link.source, link.domain, link.path);
}
#Override
public boolean equals(final Object obj){
if(obj instanceof LinkKeyWritable) {
final LinkKeyWritable o = (LinkKeyWritable)obj;
return Objects.equal(link.batchDay, o.link.batchDay)
&& Objects.equal(link.source, o.link.source)
&& Objects.equal(link.domain, o.link.domain)
&& Objects.equal(link.path, o.link.path);
}
return false;
}
}
LinkValueWritable:
public class LinkValueWritable implements Writable{
protected Link link;
public LinkValueWritable() {
link = new Link();
}
public LinkValueWritable(Link link) {
this.link = new Link();
this.link.type = link.type;
this.link.description = link.description;
}
#Override
public void readFields(DataInput in) throws IOException {
link.type = in.readUTF();
link.description = in.readUTF();
}
#Override
public void write(DataOutput out) throws IOException {
out.writeUTF(link.type);
out.writeUTF(link.description);
}
#Override
public int hashCode() {
return Objects.hashCode(link.type, link.description);
}
#Override
public boolean equals(final Object obj){
if(obj instanceof LinkKeyWritable) {
final LinkKeyWritable o = (LinkKeyWritable)obj;
return Objects.equal(link.type, o.link.type)
&& Objects.equal(link.description, o.link.description);
}
return false;
}
}
I think the answer is in the implementation of the TextOutputFormat. Specifically, the LineRecordWriter's writeObject method:
/**
* Write the object to the byte stream, handling Text as a special
* case.
* #param o the object to print
* #throws IOException if the write throws, we pass it on
*/
private void writeObject(Object o) throws IOException {
if (o instanceof Text) {
Text to = (Text) o;
out.write(to.getBytes(), 0, to.getLength());
} else {
out.write(o.toString().getBytes(utf8));
}
}
As you can see, if your key or value is not a Text object, it calls the toString method on it and writes that out. Since you've left toString unimplemented in your key and value, it's using the Object class's implementation, which is writing out the reference.
I'd say that you should try writing an appropriate toString function or using a different OutputFormat.
It looks like you have a list of objects just like you wanted. You need to implement toString() on your writable if you want a human-readable version printed out instead of an ugly java reference.

Type mismatch in key from map, using SequenceFileInputFormat correctly

I am trying to run a recommender example from chapter6 (listing 6.1 ~ 6.4) in the ebook Mahout in Action. There are two mapper/reducer pairs. Here is the code:
Mapper - 1
public class WikipediaToItemPrefsMapper extends
Mapper<LongWritable,Text,VarLongWritable,VarLongWritable> {
private static final Pattern NUMBERS = Pattern.compile("(\d+)");
#Override
public void map(LongWritable key,
Text value,
Context context)
throws IOException, InterruptedException {
String line = value.toString();
Matcher m = NUMBERS.matcher(line);
m.find();
VarLongWritable userID = new VarLongWritable(Long.parseLong(m.group()));
VarLongWritable itemID = new VarLongWritable();
while (m.find()) {
itemID.set(Long.parseLong(m.group()));
context.write(userID, itemID);
}
}
}
Reducer - 1
public class WikipediaToUserVectorReducer extends
Reducer<VarLongWritable,VarLongWritable,VarLongWritable,VectorWritable> {
#Override
public void reduce(VarLongWritable userID,
Iterable<VarLongWritable> itemPrefs,
Context context)
throws IOException, InterruptedException {
Vector userVector = new RandomAccessSparseVector(
Integer.MAX_VALUE, 100);
for (VarLongWritable itemPref : itemPrefs) {
userVector.set((int)itemPref.get(), 1.0f);
}
//LongWritable userID_lw = new LongWritable(userID.get());
context.write(userID, new VectorWritable(userVector));
//context.write(userID_lw, new VectorWritable(userVector));
}
}
The reducer outputs a userID and a userVector and it looks like this: 98955 {590:1.0 22:1.0 9059:1.0 3:1.0 2:1.0 1:1.0} provided FileInputformat and TextInputFormat are used in the driver.
I want to use another pair of mapper-reducer to process this data further:
Mapper - 2
public class UserVectorToCooccurenceMapper extends
Mapper<VarLongWritable,VectorWritable,IntWritable,IntWritable> {
#Override
public void map(VarLongWritable userID,
VectorWritable userVector,
Context context)
throws IOException, InterruptedException {
Iterator<Vector.Element> it = userVector.get().iterateNonZero();
while (it.hasNext()) {
int index1 = it.next().index();
Iterator<Vector.Element> it2 = userVector.get().iterateNonZero();
while (it2.hasNext()) {
int index2 = it2.next().index();
context.write(new IntWritable(index1),
new IntWritable(index2));
}
}
}
}
Reducer - 2
public class UserVectorToCooccurenceReducer extends
Reducer {
#Override
public void reduce(IntWritable itemIndex1,
Iterable<IntWritable> itemIndex2s,
Context context)
throws IOException, InterruptedException {
Vector cooccurrenceRow = new RandomAccessSparseVector(Integer.MAX_VALUE, 100);
for (IntWritable intWritable : itemIndex2s) {
int itemIndex2 = intWritable.get();
cooccurrenceRow.set(itemIndex2, cooccurrenceRow.get(itemIndex2) + 1.0);
}
context.write(itemIndex1, new VectorWritable(cooccurrenceRow));
}
}
This is the driver I am using:
public final class RecommenderJob extends Configured implements Tool {
#Override
public int run(String[] args) throws Exception {
Job job_preferenceValues = new Job (getConf());
job_preferenceValues.setJarByClass(RecommenderJob.class);
job_preferenceValues.setJobName("job_preferenceValues");
job_preferenceValues.setInputFormatClass(TextInputFormat.class);
job_preferenceValues.setOutputFormatClass(SequenceFileOutputFormat.class);
FileInputFormat.setInputPaths(job_preferenceValues, new Path(args[0]));
SequenceFileOutputFormat.setOutputPath(job_preferenceValues, new Path(args[1]));
job_preferenceValues.setMapOutputKeyClass(VarLongWritable.class);
job_preferenceValues.setMapOutputValueClass(VarLongWritable.class);
job_preferenceValues.setOutputKeyClass(VarLongWritable.class);
job_preferenceValues.setOutputValueClass(VectorWritable.class);
job_preferenceValues.setMapperClass(WikipediaToItemPrefsMapper.class);
job_preferenceValues.setReducerClass(WikipediaToUserVectorReducer.class);
job_preferenceValues.waitForCompletion(true);
Job job_cooccurence = new Job (getConf());
job_cooccurence.setJarByClass(RecommenderJob.class);
job_cooccurence.setJobName("job_cooccurence");
job_cooccurence.setInputFormatClass(SequenceFileInputFormat.class);
job_cooccurence.setOutputFormatClass(TextOutputFormat.class);
SequenceFileInputFormat.setInputPaths(job_cooccurence, new Path(args[1]));
FileOutputFormat.setOutputPath(job_cooccurence, new Path(args[2]));
job_cooccurence.setMapOutputKeyClass(VarLongWritable.class);
job_cooccurence.setMapOutputValueClass(VectorWritable.class);
job_cooccurence.setOutputKeyClass(IntWritable.class);
job_cooccurence.setOutputValueClass(VectorWritable.class);
job_cooccurence.setMapperClass(UserVectorToCooccurenceMapper.class);
job_cooccurence.setReducerClass(UserVectorToCooccurenceReducer.class);
job_cooccurence.waitForCompletion(true);
return 0;
}
public static void main(String[] args) throws Exception {
ToolRunner.run(new Configuration(), new RecommenderJob(), args);
}
}
The error that I get is:
java.io.IOException: Type mismatch in key from map: expected org.apache.mahout.math.VarLongWritable, received org.apache.hadoop.io.IntWritable
In course of Googling for a fix, I found out that my issue is similar to this question. But the difference is that I am already using SequenceFileInputFormat and SequenceFileOutputFormat, I believe correctly. I also see that org.apache.mahout.cf.taste.hadoop.item.RecommenderJob does more or less something similar. In my understanding & Yahoo Tutorial
SequenceFileOutputFormat rapidly serializes arbitrary data types to the file; the corresponding SequenceFileInputFormat will deserialize the file into the same types and presents the data to the next Mapper in the same manner as it was emitted by the previous Reducer.
What am I doing wrong? Will really appreciate some pointers from someone.. I spent the day trying to fix this and got nowhere :(
Your second mapper has the following signature:
public class UserVectorToCooccurenceMapper extends
Mapper<VarLongWritable,VectorWritable,IntWritable,IntWritable>
But you define the following in your driver code:
job_cooccurence.setMapOutputKeyClass(VarLongWritable.class);
job_cooccurence.setMapOutputValueClass(VectorWritable.class);
The reducer is expecting <IntWritable, IntWritable> as input, so you should just amend your driver code to:
job_cooccurence.setMapOutputKeyClass(IntWritable.class);
job_cooccurence.setMapOutputValueClass(IntWritable.class);

Resources