Hadoop Mapreduce, How do I rewrite a txt file inputted in the mapper with map reduce output? - algorithm

I am trying to create a map reduce program to perform the k-means algorithm. I know using map reduce isn't the best way to do iterative algorithms.
I have created the mapper and reducer classes.
In the mapper code I read an input file. When a map reduce has completed I want the results to be stored in the same input file. How do i make the output file overwrite the inputted file from the mapper?
Also so I make the map reduce iterate until the values from the old input file and new input file converge i.e. the difference between the values is less than 0.1
My code is:
import java.io.IOException;
import java.util.StringTokenizer;
import java.util.*;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapreduce.Mapper;
import java.io.FileReader;
import java.io.BufferedReader;
import java.util.ArrayList;
public class kmeansMapper extends Mapper<Object, Text, DoubleWritable,
DoubleWritable> {
private final static String centroidFile = "centroid.txt";
private List<Double> centers = new ArrayList<Double>();
public void setup(Context context) throws IOException{
BufferedReader br = new BufferedReader(new
FileReader(centroidFile));
String contentLine;
while((contentLine = br.readLine())!=null){
centers.add(Double.parseDouble(contentLine));
}
}
public void map(Object key, Text input, Context context) throws IOException,
InterruptedException {
String[] fields = input.toString().split(" ");
Double rating = Double.parseDouble(fields[2]);
Double distance = centers.get(0) - rating;
int position = 0;
for(int i=1; i<centers.size(); i++){
Double cDistance = Math.abs(centers.get(i) - rating);
if(cDistance< distance){
position = i;
distance = cDistance;
}
}
Double closestCenter = centers.get(position);
context.write(new DoubleWritable(closestCenter),new
DoubleWritable(rating)); //outputs closestcenter and rating value
}
}
import java.io.IOException;
import java.lang.*;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapreduce.Reducer;
import java.util.*;
public class kmeansReducer extends Reducer<DoubleWritable, DoubleWritable,
DoubleWritable, Text> {
public void reduce(DoubleWritable key, Iterable<DoubleWritable> values,
Context context)// get count // get total //get values in a string
throws IOException, InterruptedException {
Iterator<DoubleWritable> v = values.iterator();
double total = 0;
double count = 0;
String value = ""; //value is the rating
while (v.hasNext()){
double i = v.next().get();
value = value + " " + Double.toString(i);
total = total + i;
++count;
}
double nCenter = total/count;
context.write(new DoubleWritable(nCenter), new Text(value));
}
}
import java.util.Arrays;
import org.apache.commons.lang.StringUtils;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class run
{
public static void runJob(String[] input, String output) throws Exception {
Configuration conf = new Configuration();
Job job = new Job(conf);
Path toCache = new Path("input/centroid.txt");
job.addCacheFile(toCache.toUri());
job.setJarByClass(run.class);
job.setMapperClass(kmeansMapper.class);
job.setReducerClass(kmeansReducer.class);
job.setMapOutputKeyClass(DoubleWritable.class);
job.setMapOutputValueClass(DoubleWritable.class);
job.setNumReduceTasks(1);
Path outputPath = new Path(output);
FileInputFormat.setInputPaths(job, StringUtils.join(input, ","));
FileOutputFormat.setOutputPath(job, outputPath);
outputPath.getFileSystem(conf).delete(outputPath,true);
job.waitForCompletion(true);
}
public static void main(String[] args) throws Exception {
runJob(Arrays.copyOfRange(args, 0, args.length-1), args[args.length-1]);
}
}
Thanks

I know you put the disclaimer.. but please switch to Spark or some other framework that can solve problems in-memory. Your life will be so much better.
If you really want to do this, just iteratively run the code in runJob and use a temporary file name for input. You can see this question on moving files in hadoop to achieve this. You'll need a FileSystem instance and a temp file for input:
FileSystem fs = FileSystem.get(new Configuration());
Path tempInputPath = Paths.get('/user/th/kmeans/tmp_input';
Broadly speaking, after each iteration is finished, do
fs.delete(tempInputPath)
fs.rename(outputPath, tempInputPath)
Of course for the very first iteration you must set the input path to be the input paths provided when running the job. Subsequent iterations can use the tempInputPath, which will be the output of the previous iteration.

Related

Find reversed names on a file with hadoop mapreduce

Hello I have this file http://aminer.org/lab-datasets/citation/citation-network1.zip and I need to find the names of the authors that have publications with just 2 authors and they reverse their names on at least one of them.
The mapper I made is this one :
package bigdatauom;
import java.io.IOException;
import java.util.ArrayList;
import java.util.StringTokenizer;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
public class TokenizerMapper
extends Mapper<Object, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text keyAuthors = new Text();
public void map(Object key, Text value, Context context
) throws IOException, InterruptedException {
StringTokenizer authorslinetok = new StringTokenizer(value.toString(), "#");
while (authorslinetok.hasMoreTokens()) {
String tempLine = authorslinetok.nextToken();
if (tempLine.charAt(0) == '#') {
tempLine = tempLine.substring(1);
StringTokenizer seperateAuthorsTok = new StringTokenizer(tempLine, ",");
ArrayList<String> authors = new ArrayList<String>();
while (seperateAuthorsTok.hasMoreTokens()) {
authors.add(seperateAuthorsTok.nextToken());
}
if (authors.size() == 2){
keyAuthors.set(tempLine);
context.write(keyAuthors, one);
}
}
}
}
}
I need to have 2 instances of the reducer and have been working on this project for one week with no result.
Any advice is appreciated thanks in advance!

Hadoop Total Order Partitioner

import java.io.*;
import org.apache.hadoop.conf.*;
import org.apache.hadoop.fs.*;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapreduce.*;
import org.apache.hadoop.mapreduce.lib.input.*;
import org.apache.hadoop.mapreduce.lib.output.*;
import org.apache.hadoop.mapreduce.lib.partition.*;
import org.apache.hadoop.mapreduce.lib.reduce.*;
import org.apache.hadoop.util.*;
/**
* Demonstrates how to use Total Order Partitioner on Word Count.
*/
public class TotalOrderPartitionerExample {
public static class WordCount extends Configured implements Tool {
private final static int REDUCE_TASKS = 8;
public static void main(String[] args) throws Exception {
int exitCode = ToolRunner.run(new WordCount(), args);
System.exit(exitCode);
}
#Override #SuppressWarnings({ "unchecked", "rawtypes" })
public int run(String[] args) throws Exception {
// Check arguments.
if (args.length != 2) {
String usage =
"Usage: " +
"hadoop jar TotalOrderPartitionerExample$WordCount " +
"<input dir> <output dir>\n"
System.out.printf(usage);
System.exit(-1);
}
String jobName = "WordCount";
String mapJobName = jobName + "-Map";
String reduceJobName = jobName + "-Reduce";
// Get user args.
String inputDir = args[0];
String outputDir = args[1];
// Define input path and output path.
Path mapInputPath = new Path(inputDir);
Path mapOutputPath = new Path(outputDir + "-inter");
Path reduceOutputPath = new Path(outputDir);
// Define partition file path.
Path partitionPath = new Path(outputDir + "-part.lst");
// Configure map-only job for sampling.
Job mapJob = new Job(getConf());
mapJob.setJobName(mapJobName);
mapJob.setJarByClass(WordCount.class);
mapJob.setMapperClass(WordMapper.class);
mapJob.setNumReduceTasks(0);
mapJob.setOutputKeyClass(Text.class);
mapJob.setOutputValueClass(IntWritable.class);
TextInputFormat.setInputPaths(mapJob, mapInputPath);
// Set the output format to a sequence file.
mapJob.setOutputFormatClass(SequenceFileOutputFormat.class);
SequenceFileOutputFormat.setOutputPath(mapJob, mapOutputPath);
// Submit the map-only job.
int exitCode = mapJob.waitForCompletion(true) ? 0 : 1;
if (exitCode != 0) { return exitCode; }
// Set up the second job, the reduce-only.
Job reduceJob = new Job(getConf());
reduceJob.setJobName(reduceJobName);
reduceJob.setJarByClass(WordCount.class);
// Set the input to the previous job's output.
reduceJob.setInputFormatClass(SequenceFileInputFormat.class);
SequenceFileInputFormat.setInputPaths(reduceJob, mapOutputPath);
// Set the output path to the final output path.
TextOutputFormat.setOutputPath(reduceJob, reduceOutputPath);
// Use identity mapper for key/value pairs in SequenceFile.
reduceJob.setReducerClass(IntSumReducer.class);
reduceJob.setMapOutputKeyClass(Text.class);
reduceJob.setMapOutputValueClass(IntWritable.class);
reduceJob.setOutputKeyClass(Text.class);
reduceJob.setOutputValueClass(IntWritable.class);
reduceJob.setNumReduceTasks(REDUCE_TASKS);
// Use Total Order Partitioner.
reduceJob.setPartitionerClass(TotalOrderPartitioner.class);
// Generate partition file from map-only job's output.
TotalOrderPartitioner.setPartitionFile(
reduceJob.getConfiguration(), partitionPath);
InputSampler.writePartitionFile(reduceJob, new InputSampler.RandomSampler(
1, 10000));
// Submit the reduce job.
return reduceJob.waitForCompletion(true) ? 0 : 2;
}
}
public static class WordMapper extends
Mapper<LongWritable, Text, Text, IntWritable> {
#Override
public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
String line = value.toString();
for (String word : line.split("\\W+")) {
if (word.length() == 0) { continue; }
context.write(new Text(word), new IntWritable(1));
}
}
}
}
I got this code from github.
I compared elapsed time of maps and reduces.
Regular wordcount does better job performing than total order paritioner.
Why is that?
Any optimizations or changes needed to meet average performance?
Hashpartitioner performance vs TotalOrderPartitioner Performance?
Yes, HashPartitioner will perform better than TotalOrderPartitioner because the HashPartitioner does not have the overhead or running the InputSampler and writing the Partition file etc.,
TotalOrderPartitioner is only used when you need a globally sorted output and will be slower than HashPartitioner.

Number of parallel mapper tasks in Hadoop Streaming job

I'm just starting to learn about Hadoop. I'm trying to use the streaming interface in conjunction with a Python script that processes files: for each input file I create an output file with some information about it, so this is a map job with no reducer. What I'm finding is that files are being processed one at a time, which isn't quite what I'd wanted.
I'll explain what I've done, but I'll also post some code afterwards in case there's something I'm missing there.
I've got an input format and record reader that reads whole files and uses their content as values and file names as keys. (The files aren't huge.) On the other end, I've got an output format and record writer that writes out values to files with names based on the keys. I'm using -io rawbytes and my Python script knows how to read and write key/value pairs.
It all works fine, in terms of producing the output I'm expecting. If I run with, e.g., 10 input files I get 10 splits. That means that each time my script runs it only gets one key/value pair - which isn't ideal, but it's not a big deal, and I can see that this might be unavoidable. What's less good is that it that there is only one running instance of the script at any one time. Setting mapreduce.job.maps doesn't make any difference (although I vaguely remember seeing something about this value only being a suggestions, so perhaps Hadoop is making a different decision). What am I missing?
Here's my code:-
#!/bin/bash
hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar \
-libjars mimi.jar \
-D mapreduce.job.reduces=0 \
-files rawbytes_mapper.py,irrelevant.py \
-inputformat "mimi.WholeFileInputFormat" \
-outputformat "mimi.NamedFileOutputFormat" \
-io rawbytes \
-mapper "rawbytes_mapper.py irrelevant blah blah blah" \
-input "input/*.xml" \
-output output
#!/usr/bin/python
def read_raw_bytes(input):
length_bytes = input.read(4)
if len(length_bytes) < 4:
return None
length = 0
for b in length_bytes:
length = (length << 8) + ord(b)
return input.read(length)
def write_raw_bytes(output, s):
length = len(s)
length_bytes = []
for _ in range(4):
length_bytes.append(chr(length & 0xff))
length = length >> 8
length_bytes.reverse()
for b in length_bytes:
output.write(b)
output.write(s)
def read_keys_and_values(input):
d = {}
while True:
key = read_raw_bytes(input)
if key is None: break
value = read_raw_bytes(input)
d[key] = value
return d
def write_keys_and_values(output, d):
for key in d:
write_raw_bytes(output, key)
write_raw_bytes(output, d[key])
if __name__ == "__main__":
import sys
module = __import__(sys.argv[1])
before = read_keys_and_values(sys.stdin)
module.init(sys.argv[2:])
after = module.process(before)
write_keys_and_values(sys.stdout, after)
package mimi;
import java.io.IOException;
import java.nio.charset.StandardCharsets;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.FSDataInputStream;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.BytesWritable;
import org.apache.hadoop.io.IOUtils;
import org.apache.hadoop.mapred.FileInputFormat;
import org.apache.hadoop.mapred.FileSplit;
import org.apache.hadoop.mapred.InputSplit;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.mapred.RecordReader;
import org.apache.hadoop.mapred.Reporter;
public class WholeFileInputFormat extends FileInputFormat<BytesWritable, BytesWritable>
{
private static class WholeFileRecordReader implements RecordReader<BytesWritable, BytesWritable>
{
private FileSplit split;
private JobConf conf;
private boolean processed = false;
public WholeFileRecordReader(FileSplit split, JobConf conf)
{
this.split = split;
this.conf = conf;
}
#Override
public BytesWritable createKey()
{
return new BytesWritable();
}
#Override
public BytesWritable createValue()
{
return new BytesWritable();
}
#Override
public boolean next(BytesWritable key, BytesWritable value) throws IOException
{
if (processed)
{
return false;
}
byte[] contents = new byte[(int) split.getLength()];
Path file = split.getPath();
String name = file.getName();
byte[] bytes = name.getBytes(StandardCharsets.UTF_8);
key.set(bytes, 0, bytes.length);
FileSystem fs = file.getFileSystem(conf);
FSDataInputStream in = null;
try
{
in = fs.open(file);
IOUtils.readFully(in, contents, 0, contents.length);
value.set(contents, 0, contents.length);
}
finally
{
IOUtils.closeStream(in);
}
processed = true;
return true;
}
#Override
public float getProgress() throws IOException
{
return processed ? 1.0f : 0.0f;
}
#Override
public long getPos() throws IOException
{
return processed ? 0l : split.getLength();
}
#Override
public void close() throws IOException
{
// do nothing
}
}
#Override
protected boolean isSplitable(FileSystem fs, Path file)
{
return false;
}
#Override
public RecordReader<BytesWritable, BytesWritable> getRecordReader(InputSplit split,
JobConf conf,
Reporter reporter)
throws IOException
{
return new WholeFileRecordReader((FileSplit) split, conf);
}
}
package mimi;
import java.io.IOException;
import java.nio.charset.StandardCharsets;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.FSDataOutputStream;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.BytesWritable;
import org.apache.hadoop.mapred.lib.MultipleOutputFormat;
import org.apache.hadoop.mapred.FileOutputFormat;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.mapred.RecordWriter;
import org.apache.hadoop.mapred.Reporter;
import org.apache.hadoop.util.Progressable;
public class NamedFileOutputFormat extends MultipleOutputFormat<BytesWritable, BytesWritable>
{
private static class BytesValueWriter implements RecordWriter<BytesWritable, BytesWritable>
{
FSDataOutputStream out;
BytesValueWriter(FSDataOutputStream out)
{
this.out = out;
}
#Override
public synchronized void write(BytesWritable key, BytesWritable value) throws IOException
{
out.write(value.getBytes(), 0, value.getLength());
}
#Override
public void close(Reporter reporter) throws IOException
{
out.close();
}
}
#Override
protected String generateFileNameForKeyValue(BytesWritable key, BytesWritable value, String name)
{
return new String(key.getBytes(), 0, key.getLength(), StandardCharsets.UTF_8);
}
#Override
public RecordWriter<BytesWritable, BytesWritable> getBaseRecordWriter(FileSystem ignored,
JobConf conf,
String name,
Progressable progress)
throws IOException
{
Path file = FileOutputFormat.getTaskOutputPath(conf, name);
FileSystem fs = file.getFileSystem(conf);
FSDataOutputStream out = fs.create(file, progress);
return new BytesValueWriter(out);
}
}
I think I can help you with this part of your problem:
each time my script runs it only gets one key/value pair - which isn't ideal
If isSplitable method returns false only one file per mapper will be processed. So if you won't override isSplitable method and leave it return true you should have more than one key/value pair in one mapper. In your case every file is one key/value pair so they can't be splitted even when isSplitable returns true.
I cannot figure out why only one mapper starts at one time, but I'm still thinking about it :)

Problems with setting up and accessing Distributed Cache

For some reason I can't find any good sources online for getting Distributed Cache working with the new API. Hoping someone here can explain what I'm doing wrong. My current attempt is sort of a mish-mash of various things I've found online.
This program attempts to run the k-nearest neighbors algorithm. The input file is the test dataset, while the distributed cache holds the train dataset and train labels. The mapper should take one row of test data, compare it to every row in the distributed cache data, and return the label of the row it is most similar to.
import java.net.URI;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.filecache.DistributedCache;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.KeyValueTextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class KNNDriver extends Configured implements Tool {
public int run(String[] args) throws Exception {
if (args.length != 2) {
System.out.printf("Usage: %s [generic options] <input dir> <output dir>\n", getClass().getSimpleName());
return -1;
}
Configuration conf = new Configuration();
// conf.set("mapreduce.input.keyvaluelinerecordreader.key.value.separator", "^");
conf.setInt ("train_rows",1000);
conf.setInt ("test_rows",1000);
conf.setInt ("cols",612);
DistributedCache.addCacheFile(new URI("cacheData/train_sample.csv"),conf);
DistributedCache.addCacheFile(new URI("cacheData/train_labels.csv"),conf);
Job job = new Job(conf);
job.setJarByClass(KNNDriver.class);
job.setJobName("KNN");
FileInputFormat.setInputPaths(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.setMapperClass(KNNMapper.class);
job.setReducerClass(KNNReducer.class);
// job.setInputFormatClass(KeyValueTextInputFormat.class);
job.setMapOutputKeyClass(IntWritable.class);
job.setMapOutputValueClass(IntWritable.class);
job.setOutputKeyClass(IntWritable.class);
job.setOutputValueClass(IntWritable.class);
boolean success = job.waitForCompletion(true);
return success ? 0 : 1;
}
public static void main(String[] args) throws Exception {
int exitCode = ToolRunner.run(new Configuration(), new KNNDriver(), args);
System.exit(exitCode);
}
}
import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import java.io.FileNotFoundException;
import java.util.Scanner;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
public class KNNMapper extends Mapper<LongWritable, Text, IntWritable, IntWritable> {
int[][] train_vals;
int[] train_label_vals;
int train_rows;
int test_rows;
int cols;
#Override
public void setup(Context context) throws IOException, InterruptedException {
Configuration conf = context.getConfiguration();
// Path[] cacheFiles = context.getLocalCacheFiles();
int train_rows = conf.getInt("train_rows", 0);
int test_rows = conf.getInt("test_rows", 0);
int cols = conf.getInt("cols", 0);
train_vals = new int[train_rows][cols];
train_label_vals = new int[train_rows];
// read train csv, parse, and store into 2d int array
Scanner myScan;
try {
myScan = new Scanner(new File("train_sample.csv"));
//Set the delimiter used in file
myScan.useDelimiter("[,\r\n]+");
//Get all tokens and store them in some data structure
//I am just printing them
System.out.println("myScan loaded for train_sample");
for(int row = 0; row < train_rows; row++) {
for(int col = 0; col < cols; col++) {
train_vals[row][col] = Integer.parseInt(myScan.next().toString());
}
}
myScan.close();
} catch (FileNotFoundException e) {
System.out.print("Error: Train file not found.");
}
// read train_labels csv, parse, and store into 2d int array
try {
myScan = new Scanner(new File("train_labels.csv"));
//Set the delimiter used in file
myScan.useDelimiter("[,\r\n]+");
//Get all tokens and store them in some data structure
//I am just printing them
System.out.println("myScan loaded for train_sample");
for(int row = 0; row < train_rows; row++) {
train_label_vals[row] = Integer.parseInt(myScan.next().toString());
}
myScan.close();
} catch (FileNotFoundException e) {
System.out.print("Error: Train Labels file not found.");
}
}
#Override
public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
// setup() gave us train_vals & train_label_vals.
// Each line in map() represents a test observation. We iterate
// through every train_val row to find nearest L2 match, then
// return a key/value pair of <observation #,
// convert from Text to String
String line = value.toString();
long distance;
double best_distance = Double.POSITIVE_INFINITY;
int col_num;
int best_digit = -1;
IntWritable rowId = null;
int i;
IntWritable rowNum;
String[] pixels;
// comma delimited files, split on commas
// first we find the # of rows
for (i = 0; i < train_rows; i++) {
distance = 0;
col_num = 0;
pixels = line.split(",");
rowId = new IntWritable(Integer.parseInt(pixels[0]));
for (int j = 1; j < cols; j++) {
distance += (Integer.parseInt(pixels[j]) - train_vals[i][j-1])^2;
}
if (distance < best_distance) {
best_distance = distance;
best_digit = train_label_vals[i];
}
}
context.write(rowId, new IntWritable(best_digit));
}
}
I commented out the Path... statement because I don't understand what it does, or how it sends the file data to the mapper, but I noticed it listed on a couple websites. Currently the program is not finding the Distributed Cache datasets even though they are uploaded to HDFS.
Try to use symlinking:
DistributedCache.createSymlink(conf);
DistributedCache.addCacheFile(new URI("cacheData/train_sample.csv#train_sample.csv"),conf);
DistributedCache.addCacheFile(new URI("cacheData/train_labels.csv#train_labels.csv"),conf);
This will make the files available in the local directory of the mapper under the name that you are actually trying to access it.

Why does the last reducer stop with java heap error during merge step

I keep increasing the number of reducers and I see that while all except one reducers run quickly and finish their job, one last reducer just hangs at the merge step with this message in its tasktracker log:
Down to the last merge-pass, with 3 segments left of total size: 171207264 bytes
... and after a long time staying at this statement, it throws a java heap error and starts some cleaning which just doesn't finish.
I increased the child.opts memory to 3.5GB (unable to go beyond this limit) and compressed the map output too.
What might be the cause?
Here is the driver code:
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
conf.set("mapred.task.timeout", "6000000");
conf.set("mapred.compress.map.output", "true");
Job job = new Job(conf, "FreebasePreprocess_Phase2");
job.setNumReduceTasks(6);
job.setJarByClass(FreebasePreprocess.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(Text.class);
job.setMapperClass(Map.class);
job.setReducerClass(Reduce.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
FileInputFormat.addInputPath(job, new Path("/user/watsonuser/freebase_data100m120m_output"));
FileOutputFormat.setOutputPath(job, new Path("/user/watsonuser/freebase_data100m120m_output_2"));
job.waitForCompletion(true);
}
Here is the mapper:
import java.io.IOException;
import java.util.Arrays;
import java.util.List;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.lib.input.FileSplit;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Mapper.Context;
public class Map extends Mapper<LongWritable, Text, Text, Text>{
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException
{
String[] entities = value.toString().split("\\t");
String[] strings = {"/type/object/type", "/common/topic/notable_for", "/type/user/usergroup"};
List<String> filteredPredicates = Arrays.asList(strings);
FileSplit fileSplit = (FileSplit)context.getInputSplit();
String filename = fileSplit.getPath().getName();
// System.out.println("File name "+filename);
if(filename.startsWith("part-r")) {
// if(filename.equalsIgnoreCase("quad.tsv")) {
//this is a quad dump file
String name = null;
String predicate = null;
String oid = null;
String outVal = null;
String outKey = null;
if(entities.length==3) {
oid = entities[0].trim();
predicate = entities[1].trim();
name = entities[2].trim();
/*if(predicate.contains("/type/object/name/lang"))
{
if(predicate.endsWith("/en"))
{*/
/*outKey = sid;
outVal = oid+"#-#-#-#"+"topic_name";
context.write(new Text(outKey), new Text(outVal));*/
/* }
}*/
outKey = oid;
outVal = predicate+"#-#-#-#"+name;
context.write(new Text(outKey), new Text(outVal));
}
}
else if(filename.equalsIgnoreCase("freebase-simple-topic-dump.tsv")) {
//this is a simple topic dump file
String sid = null;
String name = null;
String outKey = null;
String outVal = null;
if(entities.length>1) {
sid = entities[0];
name = entities[1];
outKey = sid;
outVal = name+"#-#-#-#"+"topic_name";
context.write(new Text(outKey), new Text(outVal));
}
}
}
}
Here is the reducer
import java.io.IOException;
import java.util.ArrayList;
import java.util.Iterator;
import java.util.LinkedList;
import java.util.List;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
public class Reduce extends Reducer<Text, Text, Text, Text>
{
public void reduce(Text key, Iterable<Text> values, Context context)
throws IOException, InterruptedException
{
String name = null;
String sid = null;
String predicate = null;
String oid = null;
String id = null;
String outKey = null;
String outVal = null;
ArrayList<Text> valuesList = new ArrayList<Text>();
Iterator<Text> ite = values.iterator();
while(ite.hasNext()) {
Text t = ite.next();
Text txt = new Text();
txt.set(t.toString());
valuesList.add(txt);
String[] entities = t.toString().split("#-#-#-#");
if(entities[entities.length-1].equalsIgnoreCase("topic_name"))
{
name = entities[0];
}
}
for(int i=0; i<valuesList.size(); i++) {
{
Text t2 = valuesList.get(i);
String[] entities = t2.toString().split("#-#-#-#");
if(!entities[entities.length-1].contains("topic_name"))
{
if(name!=null) {
outKey = entities[1]+"\t"+entities[0]+"\t"+name;
}
else {
outKey = entities[1]+"\t"+entities[0]+"\t"+key.toString();
}
context.write(new Text(outKey), null);
}
}
}
}
My guess is that you have a single key with a huge number of values and the following line in your reducer is causing you problems:
valuesList.add(txt);
Lets say you had a key with 100m values, you're trying to build an arraylist of size 100m - at some stage your reducer JVM is going to run out of memory.
You can probably confirm this by putting in some debug and inspecting the logs for the reducer that never ends:
valuesList.add(txt);
if (valuesList.size() % 10000 == 0) {
System.err.println(key + "\t" + valueList.size());
}
I haven't written raw MR in a while, but I would approach it in a way similar to this:
Keeping all values for a key in memory is always dangerous. I would instead add another MR phase to your job. In the first stage emit newkey = (key, 0), newValue = value when value contains "topic-name", and newkey = (key, 1), newValue = value when value doesn't contain "topic-name". This will require writing a custom writablecomparable that can handle a pair, and knows how to sort it.
For the reducer in the next phase write a partitioner that partitions on the first element of the new key. Now because of the last reducer's sorted-by-key output, you are guaranteed that you get the k,v pair with the 'name' before you get the other k,v pairs for each key. Now you have access to the "name" for each value corresponding to a key.

Resources