For some reason I can't find any good sources online for getting Distributed Cache working with the new API. Hoping someone here can explain what I'm doing wrong. My current attempt is sort of a mish-mash of various things I've found online.
This program attempts to run the k-nearest neighbors algorithm. The input file is the test dataset, while the distributed cache holds the train dataset and train labels. The mapper should take one row of test data, compare it to every row in the distributed cache data, and return the label of the row it is most similar to.
import java.net.URI;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.filecache.DistributedCache;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.KeyValueTextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class KNNDriver extends Configured implements Tool {
public int run(String[] args) throws Exception {
if (args.length != 2) {
System.out.printf("Usage: %s [generic options] <input dir> <output dir>\n", getClass().getSimpleName());
return -1;
}
Configuration conf = new Configuration();
// conf.set("mapreduce.input.keyvaluelinerecordreader.key.value.separator", "^");
conf.setInt ("train_rows",1000);
conf.setInt ("test_rows",1000);
conf.setInt ("cols",612);
DistributedCache.addCacheFile(new URI("cacheData/train_sample.csv"),conf);
DistributedCache.addCacheFile(new URI("cacheData/train_labels.csv"),conf);
Job job = new Job(conf);
job.setJarByClass(KNNDriver.class);
job.setJobName("KNN");
FileInputFormat.setInputPaths(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.setMapperClass(KNNMapper.class);
job.setReducerClass(KNNReducer.class);
// job.setInputFormatClass(KeyValueTextInputFormat.class);
job.setMapOutputKeyClass(IntWritable.class);
job.setMapOutputValueClass(IntWritable.class);
job.setOutputKeyClass(IntWritable.class);
job.setOutputValueClass(IntWritable.class);
boolean success = job.waitForCompletion(true);
return success ? 0 : 1;
}
public static void main(String[] args) throws Exception {
int exitCode = ToolRunner.run(new Configuration(), new KNNDriver(), args);
System.exit(exitCode);
}
}
import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import java.io.FileNotFoundException;
import java.util.Scanner;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
public class KNNMapper extends Mapper<LongWritable, Text, IntWritable, IntWritable> {
int[][] train_vals;
int[] train_label_vals;
int train_rows;
int test_rows;
int cols;
#Override
public void setup(Context context) throws IOException, InterruptedException {
Configuration conf = context.getConfiguration();
// Path[] cacheFiles = context.getLocalCacheFiles();
int train_rows = conf.getInt("train_rows", 0);
int test_rows = conf.getInt("test_rows", 0);
int cols = conf.getInt("cols", 0);
train_vals = new int[train_rows][cols];
train_label_vals = new int[train_rows];
// read train csv, parse, and store into 2d int array
Scanner myScan;
try {
myScan = new Scanner(new File("train_sample.csv"));
//Set the delimiter used in file
myScan.useDelimiter("[,\r\n]+");
//Get all tokens and store them in some data structure
//I am just printing them
System.out.println("myScan loaded for train_sample");
for(int row = 0; row < train_rows; row++) {
for(int col = 0; col < cols; col++) {
train_vals[row][col] = Integer.parseInt(myScan.next().toString());
}
}
myScan.close();
} catch (FileNotFoundException e) {
System.out.print("Error: Train file not found.");
}
// read train_labels csv, parse, and store into 2d int array
try {
myScan = new Scanner(new File("train_labels.csv"));
//Set the delimiter used in file
myScan.useDelimiter("[,\r\n]+");
//Get all tokens and store them in some data structure
//I am just printing them
System.out.println("myScan loaded for train_sample");
for(int row = 0; row < train_rows; row++) {
train_label_vals[row] = Integer.parseInt(myScan.next().toString());
}
myScan.close();
} catch (FileNotFoundException e) {
System.out.print("Error: Train Labels file not found.");
}
}
#Override
public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
// setup() gave us train_vals & train_label_vals.
// Each line in map() represents a test observation. We iterate
// through every train_val row to find nearest L2 match, then
// return a key/value pair of <observation #,
// convert from Text to String
String line = value.toString();
long distance;
double best_distance = Double.POSITIVE_INFINITY;
int col_num;
int best_digit = -1;
IntWritable rowId = null;
int i;
IntWritable rowNum;
String[] pixels;
// comma delimited files, split on commas
// first we find the # of rows
for (i = 0; i < train_rows; i++) {
distance = 0;
col_num = 0;
pixels = line.split(",");
rowId = new IntWritable(Integer.parseInt(pixels[0]));
for (int j = 1; j < cols; j++) {
distance += (Integer.parseInt(pixels[j]) - train_vals[i][j-1])^2;
}
if (distance < best_distance) {
best_distance = distance;
best_digit = train_label_vals[i];
}
}
context.write(rowId, new IntWritable(best_digit));
}
}
I commented out the Path... statement because I don't understand what it does, or how it sends the file data to the mapper, but I noticed it listed on a couple websites. Currently the program is not finding the Distributed Cache datasets even though they are uploaded to HDFS.
Try to use symlinking:
DistributedCache.createSymlink(conf);
DistributedCache.addCacheFile(new URI("cacheData/train_sample.csv#train_sample.csv"),conf);
DistributedCache.addCacheFile(new URI("cacheData/train_labels.csv#train_labels.csv"),conf);
This will make the files available in the local directory of the mapper under the name that you are actually trying to access it.
Related
I am trying to create a map reduce program to perform the k-means algorithm. I know using map reduce isn't the best way to do iterative algorithms.
I have created the mapper and reducer classes.
In the mapper code I read an input file. When a map reduce has completed I want the results to be stored in the same input file. How do i make the output file overwrite the inputted file from the mapper?
Also so I make the map reduce iterate until the values from the old input file and new input file converge i.e. the difference between the values is less than 0.1
My code is:
import java.io.IOException;
import java.util.StringTokenizer;
import java.util.*;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapreduce.Mapper;
import java.io.FileReader;
import java.io.BufferedReader;
import java.util.ArrayList;
public class kmeansMapper extends Mapper<Object, Text, DoubleWritable,
DoubleWritable> {
private final static String centroidFile = "centroid.txt";
private List<Double> centers = new ArrayList<Double>();
public void setup(Context context) throws IOException{
BufferedReader br = new BufferedReader(new
FileReader(centroidFile));
String contentLine;
while((contentLine = br.readLine())!=null){
centers.add(Double.parseDouble(contentLine));
}
}
public void map(Object key, Text input, Context context) throws IOException,
InterruptedException {
String[] fields = input.toString().split(" ");
Double rating = Double.parseDouble(fields[2]);
Double distance = centers.get(0) - rating;
int position = 0;
for(int i=1; i<centers.size(); i++){
Double cDistance = Math.abs(centers.get(i) - rating);
if(cDistance< distance){
position = i;
distance = cDistance;
}
}
Double closestCenter = centers.get(position);
context.write(new DoubleWritable(closestCenter),new
DoubleWritable(rating)); //outputs closestcenter and rating value
}
}
import java.io.IOException;
import java.lang.*;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapreduce.Reducer;
import java.util.*;
public class kmeansReducer extends Reducer<DoubleWritable, DoubleWritable,
DoubleWritable, Text> {
public void reduce(DoubleWritable key, Iterable<DoubleWritable> values,
Context context)// get count // get total //get values in a string
throws IOException, InterruptedException {
Iterator<DoubleWritable> v = values.iterator();
double total = 0;
double count = 0;
String value = ""; //value is the rating
while (v.hasNext()){
double i = v.next().get();
value = value + " " + Double.toString(i);
total = total + i;
++count;
}
double nCenter = total/count;
context.write(new DoubleWritable(nCenter), new Text(value));
}
}
import java.util.Arrays;
import org.apache.commons.lang.StringUtils;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class run
{
public static void runJob(String[] input, String output) throws Exception {
Configuration conf = new Configuration();
Job job = new Job(conf);
Path toCache = new Path("input/centroid.txt");
job.addCacheFile(toCache.toUri());
job.setJarByClass(run.class);
job.setMapperClass(kmeansMapper.class);
job.setReducerClass(kmeansReducer.class);
job.setMapOutputKeyClass(DoubleWritable.class);
job.setMapOutputValueClass(DoubleWritable.class);
job.setNumReduceTasks(1);
Path outputPath = new Path(output);
FileInputFormat.setInputPaths(job, StringUtils.join(input, ","));
FileOutputFormat.setOutputPath(job, outputPath);
outputPath.getFileSystem(conf).delete(outputPath,true);
job.waitForCompletion(true);
}
public static void main(String[] args) throws Exception {
runJob(Arrays.copyOfRange(args, 0, args.length-1), args[args.length-1]);
}
}
Thanks
I know you put the disclaimer.. but please switch to Spark or some other framework that can solve problems in-memory. Your life will be so much better.
If you really want to do this, just iteratively run the code in runJob and use a temporary file name for input. You can see this question on moving files in hadoop to achieve this. You'll need a FileSystem instance and a temp file for input:
FileSystem fs = FileSystem.get(new Configuration());
Path tempInputPath = Paths.get('/user/th/kmeans/tmp_input';
Broadly speaking, after each iteration is finished, do
fs.delete(tempInputPath)
fs.rename(outputPath, tempInputPath)
Of course for the very first iteration you must set the input path to be the input paths provided when running the job. Subsequent iterations can use the tempInputPath, which will be the output of the previous iteration.
I want to run two mappers that produce two different outputs in different directories.The output of the first mapper(Send as argument) should be send to the input of the second mapper.i have this code in the driver class
import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.MultipleOutputs;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
public class Export_Column_Mapping
{
private static String[] Detail_output_column_array = new String[27];
private static String[] Shop_output_column_array = new String[8];
private static String details_output = null ;
private static String Shop_output = null;
public static void main(String[] args) throws Exception
{
String Output_filetype = args[3];
String Input_column_number = args[4];
String Output_column_number = args[5];
Configuration Detailsconf = new Configuration(false);
Detailsconf.setStrings("output_filetype",Output_filetype);
Detailsconf.setStrings("Input_column_number",Input_column_number);
Detailsconf.setStrings("Output_column_number",Output_column_number);
Job Details = new Job(Detailsconf," Export_Column_Mapping");
Details.setJarByClass(Export_Column_Mapping.class);
Details.setJobName("DetailsFile_Job");
Details.setMapperClass(DetailFile_Mapper.class);
Details.setNumReduceTasks(0);
Details.setInputFormatClass(TextInputFormat.class);
Details.setOutputFormatClass(TextOutputFormat.class);
FileInputFormat.setInputPaths(Details, new Path(args[0]));
FileOutputFormat.setOutputPath(Details, new Path(args[1]));
if(Details.waitForCompletion(true))
{
Configuration Shopconf = new Configuration();
Job Shop = new Job(Shopconf,"Export_Column_Mapping");
Shop.setJarByClass(Export_Column_Mapping.class);
Shop.setJobName("ShopFile_Job");
Shop.setMapperClass(ShopFile_Mapper.class);
Shop.setNumReduceTasks(0);
Shop.setInputFormatClass(TextInputFormat.class);
Shop.setOutputFormatClass(TextOutputFormat.class);
FileInputFormat.setInputPaths(Shop, new Path(args[1]));
FileOutputFormat.setOutputPath(Shop, new Path(args[2]));
MultipleOutputs.addNamedOutput(Shop, "text", TextOutputFormat.class,LongWritable.class, Text.class);
System.exit(Shop.waitForCompletion(true) ? 0 : 1);
}
}
public static class DetailFile_Mapper extends Mapper<LongWritable,Text,Text,Text>
{
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException
{
String str_Output_filetype = context.getConfiguration().get("output_filetype");
String str_Input_column_number = context.getConfiguration().get("Input_column_number");
String[] input_columns_number = str_Input_column_number.split(",");
String str_Output_column_number= context.getConfiguration().get("Output_column_number");
String[] output_columns_number = str_Output_column_number.split(",");
String str_line = value.toString();
String[] input_column_array = str_line.split(",");
try
{
for(int i = 0;i<=input_column_array.length+1; i++)
{
int int_outputcolumn = Integer.parseInt(output_columns_number[i]);
int int_inputcolumn = Integer.parseInt(input_columns_number[i]);
if((int_inputcolumn != 0) && (int_outputcolumn != 0) && output_columns_number.length == input_columns_number.length)
{
Detail_output_column_array[int_outputcolumn-1] = input_column_array[int_inputcolumn-1];
if(details_output != null)
{
details_output = details_output+" "+ Detail_output_column_array[int_outputcolumn-1];
Shop_output = Shop_output+" "+ Shop_output_column_array[int_outputcolumn-1];
}else
{
details_output = Detail_output_column_array[int_outputcolumn-1];
Shop_output = Shop_output_column_array[int_outputcolumn-1];
}
}
}
}catch (Exception e)
{
}
context.write(null,new Text(details_output));
}
}
public static class ShopFile_Mapper extends Mapper<LongWritable,Text,Text,Text>
{
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException
{
try
{
for(int i = 0;i<=Shop_output_column_array.length; i++)
{
Shop_output_column_array[0] = Detail_output_column_array[0];
Shop_output_column_array[1] = Detail_output_column_array[1];
Shop_output_column_array[2] = Detail_output_column_array[2];
Shop_output_column_array[3] = Detail_output_column_array[3];
Shop_output_column_array[4] = Detail_output_column_array[14];
if(details_output != null)
{
Shop_output = Shop_output+" "+ Shop_output_column_array[i];
}else
{
Shop_output = Shop_output_column_array[i-1];
}
}
}catch (Exception e){
}
context.write(null,new Text(Shop_output));
}
}
}
I get the error..
Error:org.apache.hadoop.mapreduce.lib.input.InvalidInputException:
Input path does not exist:
file:/home/Barath.B.Natarajan.ap/rules/text.txt
I want to run the jobs one by one can any one help me in this?...
There is something called jobcontrol with which you will be able to achieve it.
Suppose there are two jobs A and B
ControlledJob A= new ControlledJob(JobConf for A);
ControlledJob B= new ControlledJob(JobConf for B);
B.addDependingJob(A);
JobControl jControl = newJobControl("Name");
jControl.addJob(A);
jControl.addJob(B);
Thread runJControl = new Thread(jControl);
runJControl.start();
while (!jControl.allFinished()) {
code = jControl.getFailedJobList().size() == 0 ? 0 : 1;
Thread.sleep(1000);
}
System.exit(1);
Initialize code at the beginning like this:
int code =1;
Let the first job in your case be the first mapper with zero reducer and second job be the second mapper with zero reducer.The configuration should be such that the input path of B and output path of A should be same.
I am trying to do this large matrix multiplication. The code below works fine, but when I try it for large matrices, I get the below error. Note that there is absolutely nothing wrong with my input file (no weird characters etc).
Also, would like to mention that after running for some time, it crashes with the following error, and also I get a prompt on my Ubuntu system saying that the file system root only has around 360 Mb. Is it only because of the space issue that the crash is happening?
import java.io.IOException;
import java.util.*;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.conf.*;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapreduce.*;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
public class OneStepMatrixMultiplication {
public static class Map extends Mapper<LongWritable, Text, Text, Text> {
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
Configuration conf = context.getConfiguration();
int m = Integer.parseInt(conf.get("m"));
int p = Integer.parseInt(conf.get("p"));
String line = value.toString();
String[] indicesAndValue = line.split(",");
Text outputKey = new Text();
Text outputValue = new Text();
if (indicesAndValue[0].equals("A")) {
for (int k = 0; k < p; k++) {
outputKey.set(indicesAndValue[1] + "," + k);
outputValue.set("A," + indicesAndValue[2] + "," + indicesAndValue[3]);
context.write(outputKey, outputValue);
}
} else {
for (int i = 0; i < m; i++) {
outputKey.set(i + "," + indicesAndValue[2]);
outputValue.set("B," + indicesAndValue[1] + "," + indicesAndValue[3]);
context.write(outputKey, outputValue);
}
}
}
}
public static class Reduce extends Reducer<Text, Text, Text, Text>
{
public void reduce(Text key, Iterable<Text> values, Context context) throws IOException, InterruptedException {
String[] value;
HashMap<Integer, Float> hashA = new HashMap<Integer, Float>();
HashMap<Integer, Float> hashB = new HashMap<Integer, Float>();
for (Text val : values)
{
value = val.toString().split(",");
if (value[0].equals("A"))
{
hashA.put(Integer.parseInt(value[1]), Float.parseFloat(value[2]));
} else
{
hashB.put(Integer.parseInt(value[1]), Float.parseFloat(value[2]));
}
}
int n = Integer.parseInt(context.getConfiguration().get("n"));
float result = 0.0f;
float a_ij;
float b_jk;
for (int j = 0; j < n; j++)
{
a_ij = hashA.containsKey(j) ? hashA.get(j) : 0.0f;
b_jk = hashB.containsKey(j) ? hashB.get(j) : 0.0f;
result += a_ij * b_jk;
}
if (result != 0.0f)
{
context.write(null, new Text(key.toString() + "," + Float.toString(result)));
}
}
}
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
// A is an m-by-n matrix; B is an n-by-p matrix.
conf.set("m", "10000");
conf.set("n", "3");
conf.set("p", "10000");
Job job = new Job(conf, "MatrixMatrixMultiplicationOneStep");
job.setJarByClass(OneStepMatrixMultiplication.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
job.setMapperClass(Map.class);
job.setReducerClass(Reduce.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.waitForCompletion(true);
}
}
Error
5/05/31 18:45:35 WARN mapred.LocalJobRunner: job_local1019087739_0001
java.lang.Exception: java.io.IOException: Spill failed
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:354)
Try to change your
hadoop.tmp.dir
to a location with sufficient storage.
I've write Linear Regression Program in java.
Input is -->
2,21.05
3,23.51
4,24.23
5,27.71
6,30.86
8,45.85
10,52.12
11,55.98
I want store input in array like x[]={2,3,...11} before processing input to reduce task. Then send that array variable to reduce() function
But I'm only on value at a time My program.
import java.io.DataInput;
import java.io.DataOutput;
import java.io.IOException;
import java.util.ArrayList;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.FloatWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.io.Writable;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.Reducer.Context;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class LinearRegression {
public static class RegressionMapper extends
Mapper<LongWritable, Text, Text, CountRegression> {
private Text id = new Text();
private CountRegression countRegression = new CountRegression();
#Override
public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
String tempString = value.toString();
String[] inputData = tempString.split(",");
String xVal = inputData[0];
String yVal = inputData[1];
countRegression.setxVal(Integer.parseInt(xVal));
countRegression.setyVal(Float.parseFloat(yVal));
id.set(xVal);
context.write(id, countRegression);
}
}
public static class RegressionReducer extends
Reducer<Text, CountRegression, Text, CountRegression> {
private CountRegression result = new CountRegression();
// static float meanX = 0;
// private float xValues[];
// private float yValues[];
static float xRed = 0.0f;
static float yRed = 0.3f;
static float sum = 0;
static ArrayList<Float> list = new ArrayList<Float>();
public void reduce(Text key, Iterable<CountRegression> values,
Context context) throws IOException, InterruptedException {
//float b = 0;
// while(values.iterator().hasNext())
// {
// xRed = xRed + values.iterator().next().getxVal();
// yRed = yRed + values.iterator().next().getyVal();
// }
for (CountRegression val : values) {
list.add(val.getxVal());
// list.add(val.getyVal());
// xRed += val.getxVal();
// yRed = val.getyVal();
// meanX += val.getxVal();
//xValues = val.getxVal();
}
for (int i=0; i< list.size(); i++) {
int lastIndex = list.listIterator().previousIndex();
sum += list.get(lastIndex);
}
result.setxVal(sum);
result.setyVal(yRed);
context.write(key, result);
}
}
public static class CountRegression implements Writable {
private float xVal = 0;
private float yVal = 0;
public float getxVal() {
return xVal;
}
public void setxVal(float x) {
this.xVal = x;
}
public float getyVal() {
return yVal;
}
public void setyVal(float y) {
this.yVal = y;
}
#Override
public void readFields(DataInput in) throws IOException {
xVal = in.readFloat();
yVal = in.readFloat();
}
#Override
public void write(DataOutput out) throws IOException {
out.writeFloat(xVal);
out.writeFloat(yVal);
}
#Override
public String toString() {
return "y = "+xVal+" +"+yVal+" x" ;
}
}
public static void main(String[] args) throws Exception {
// Provides access to configuration parameters.
Configuration conf = new Configuration();
// Create a new Job It allows the user to configure the job, submit it, control its execution, and query the state.
Job job = new Job(conf);
//Set the user-specified job name.
job.setJobName("LinearRegression");
//Set the Jar by finding where a given class came from.
job.setJarByClass(LinearRegression.class);
// Set the Mapper for the job.
job.setMapperClass(RegressionMapper.class);
// Set the Combiner for the job.
job.setCombinerClass(RegressionReducer.class);
// Set the Reducer for the job.
job.setReducerClass(RegressionReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(CountRegression.class);
FileInputFormat.setInputPaths(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
I keep increasing the number of reducers and I see that while all except one reducers run quickly and finish their job, one last reducer just hangs at the merge step with this message in its tasktracker log:
Down to the last merge-pass, with 3 segments left of total size: 171207264 bytes
... and after a long time staying at this statement, it throws a java heap error and starts some cleaning which just doesn't finish.
I increased the child.opts memory to 3.5GB (unable to go beyond this limit) and compressed the map output too.
What might be the cause?
Here is the driver code:
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
conf.set("mapred.task.timeout", "6000000");
conf.set("mapred.compress.map.output", "true");
Job job = new Job(conf, "FreebasePreprocess_Phase2");
job.setNumReduceTasks(6);
job.setJarByClass(FreebasePreprocess.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(Text.class);
job.setMapperClass(Map.class);
job.setReducerClass(Reduce.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
FileInputFormat.addInputPath(job, new Path("/user/watsonuser/freebase_data100m120m_output"));
FileOutputFormat.setOutputPath(job, new Path("/user/watsonuser/freebase_data100m120m_output_2"));
job.waitForCompletion(true);
}
Here is the mapper:
import java.io.IOException;
import java.util.Arrays;
import java.util.List;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.lib.input.FileSplit;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Mapper.Context;
public class Map extends Mapper<LongWritable, Text, Text, Text>{
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException
{
String[] entities = value.toString().split("\\t");
String[] strings = {"/type/object/type", "/common/topic/notable_for", "/type/user/usergroup"};
List<String> filteredPredicates = Arrays.asList(strings);
FileSplit fileSplit = (FileSplit)context.getInputSplit();
String filename = fileSplit.getPath().getName();
// System.out.println("File name "+filename);
if(filename.startsWith("part-r")) {
// if(filename.equalsIgnoreCase("quad.tsv")) {
//this is a quad dump file
String name = null;
String predicate = null;
String oid = null;
String outVal = null;
String outKey = null;
if(entities.length==3) {
oid = entities[0].trim();
predicate = entities[1].trim();
name = entities[2].trim();
/*if(predicate.contains("/type/object/name/lang"))
{
if(predicate.endsWith("/en"))
{*/
/*outKey = sid;
outVal = oid+"#-#-#-#"+"topic_name";
context.write(new Text(outKey), new Text(outVal));*/
/* }
}*/
outKey = oid;
outVal = predicate+"#-#-#-#"+name;
context.write(new Text(outKey), new Text(outVal));
}
}
else if(filename.equalsIgnoreCase("freebase-simple-topic-dump.tsv")) {
//this is a simple topic dump file
String sid = null;
String name = null;
String outKey = null;
String outVal = null;
if(entities.length>1) {
sid = entities[0];
name = entities[1];
outKey = sid;
outVal = name+"#-#-#-#"+"topic_name";
context.write(new Text(outKey), new Text(outVal));
}
}
}
}
Here is the reducer
import java.io.IOException;
import java.util.ArrayList;
import java.util.Iterator;
import java.util.LinkedList;
import java.util.List;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
public class Reduce extends Reducer<Text, Text, Text, Text>
{
public void reduce(Text key, Iterable<Text> values, Context context)
throws IOException, InterruptedException
{
String name = null;
String sid = null;
String predicate = null;
String oid = null;
String id = null;
String outKey = null;
String outVal = null;
ArrayList<Text> valuesList = new ArrayList<Text>();
Iterator<Text> ite = values.iterator();
while(ite.hasNext()) {
Text t = ite.next();
Text txt = new Text();
txt.set(t.toString());
valuesList.add(txt);
String[] entities = t.toString().split("#-#-#-#");
if(entities[entities.length-1].equalsIgnoreCase("topic_name"))
{
name = entities[0];
}
}
for(int i=0; i<valuesList.size(); i++) {
{
Text t2 = valuesList.get(i);
String[] entities = t2.toString().split("#-#-#-#");
if(!entities[entities.length-1].contains("topic_name"))
{
if(name!=null) {
outKey = entities[1]+"\t"+entities[0]+"\t"+name;
}
else {
outKey = entities[1]+"\t"+entities[0]+"\t"+key.toString();
}
context.write(new Text(outKey), null);
}
}
}
}
My guess is that you have a single key with a huge number of values and the following line in your reducer is causing you problems:
valuesList.add(txt);
Lets say you had a key with 100m values, you're trying to build an arraylist of size 100m - at some stage your reducer JVM is going to run out of memory.
You can probably confirm this by putting in some debug and inspecting the logs for the reducer that never ends:
valuesList.add(txt);
if (valuesList.size() % 10000 == 0) {
System.err.println(key + "\t" + valueList.size());
}
I haven't written raw MR in a while, but I would approach it in a way similar to this:
Keeping all values for a key in memory is always dangerous. I would instead add another MR phase to your job. In the first stage emit newkey = (key, 0), newValue = value when value contains "topic-name", and newkey = (key, 1), newValue = value when value doesn't contain "topic-name". This will require writing a custom writablecomparable that can handle a pair, and knows how to sort it.
For the reducer in the next phase write a partitioner that partitions on the first element of the new key. Now because of the last reducer's sorted-by-key output, you are guaranteed that you get the k,v pair with the 'name' before you get the other k,v pairs for each key. Now you have access to the "name" for each value corresponding to a key.