cleaning data using mapreduce program - hadoop

i have a data with 30 lines. I am trying to clean the data using mapreduce program. data is cleaning properly , but only one line is displaying out of 30 lines. I guess record reader is not reading line by line here. Could you please check my code and let me know where the problem is. I am new to hadoop .
Data :-
1 Vlan154.DEL-ISP-COR-SWH-002.mantraonline.com (61.95.250.140) 0.460 ms 0.374 ms 0.351 ms
2 202.56.223.213 (202.56.223.213) 39.718 ms 39.511 ms 39.559 ms
3 202.56.223.17 (202.56.223.17) 39.714 ms 39.724 ms 39.628 ms
4 125.21.167.153 (125.21.167.153) 41.114 ms 40.001 ms 39.457 ms
5 203.208.190.65 (203.208.190.65) 120.340 ms 71.384 ms 71.346 ms
6 ge-0-1-0-0.sngtp-dr1.ix.singtel.com (203.208.149.158) 71.493 ms ge-0-1-2-0.sngtp-dr1.ix.singtel.com (203.208.149.210) 71.183 ms ge-0-1-0-0.sngtp-dr1.ix.singtel.com (203.208.149.158) 71.739 ms
7 ge-0-0-0-0.sngtp-ar3.ix.singtel.com (203.208.182.2) 80.917 ms ge-2-0-0-0.sngtp-ar3.ix.singtel.com (203.208.183.20) 71.550 ms ge-1-0-0-0.sngtp-ar3.ix.singtel.com (203.208.182.6) 71.534 ms
8 203.208.151.26 (203.208.151.26) 141.716 ms 203.208.145.190 (203.208.145.190) 134.740 ms 203.208.151.26 (203.208.151.26) 142.453 ms
9 219.158.3.225 (219.158.3.225) 138.774 ms 157.205 ms 157.123 ms
10 219.158.4.69 (219.158.4.69) 156.865 ms 157.044 ms 156.845 ms
11 202.96.12.62 (202.96.12.62) 157.109 ms 160.294 ms 159.805 ms
12 61.148.3.58 (61.148.3.58) 159.521 ms 178.088 ms 160.004 ms
MPLS Label=33 CoS=5 TTL=1 S=0
13 202.106.48.18 (202.106.48.18) 199.730 ms 181.263 ms 181.300 ms
14 * * *
15 * * *
16 * * *
17 * * *
18 * * *
19 * * *
20 * * *
21 * * *
22 * * *
23 * * *
mapreduce program:-
public class TraceRouteDataCleaning {
/**
* #param args
* #throws IOException
* #throws InterruptedException
* #throws ClassNotFoundException
*/
public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
Configuration conf = new Configuration();
String userArgs[] = new GenericOptionsParser(conf, args).getRemainingArgs();
if (userArgs.length < 2) {
System.out.println("Usage: hadoop jar jarfilename mainclass input output");
System.exit(1);
}
Job job = new Job(conf, "cleaning trace route data");
job.setJarByClass(TraceRouteDataCleaning.class);
job.setMapperClass(TraceRouteMapper.class);
job.setReducerClass(TraceRouteReducer.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(Text.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
FileInputFormat.addInputPath(job, new Path(userArgs[0]));
FileOutputFormat.setOutputPath(job, new Path(userArgs[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
public static class TraceRouteMapper extends Mapper<LongWritable, Text, Text, Text>{
StringBuilder emitValue = null;
StringBuilder emitKey = null;
Text kword = new Text();
Text vword = new Text();
public void map(LongWritable key, Text value, Context context) throws InterruptedException, IOException
{
// String[] cleanData;
String lines = value.toString();
//deleting ms in RTT time data
lines = lines.replace(" ms", "");
String[] data = lines.split(" ");
emitValue = new StringBuilder(1024);
emitKey = new StringBuilder(1024);
if (data.length == 6) {
emitKey.append(data[0]);
emitValue.append(data[1]).append("\t").append(data[2]).append("\t").append(data[3]).append("\t").append(data[4]).append("\t").append(data[5]);
kword.set(emitKey.toString());
vword.set(emitValue.toString());
context.write(kword, vword);
}
}
}
public static class TraceRouteReducer extends Reducer<Text, Text, Text, Text>{
Text vword = new Text();
public void reduce(Text key, Iterable<Text> values, Context context) throws IOException, InterruptedException{
context.write(key,vword);
}
}
}

first thing your reducer classs should below based on your requirement.if your key is not emitting multiple texts then opt first reducer or else choose second one.
public static class TraceRouteReducer extends Reducer<Text, Text, Text, Text>{
Text vword = new Text();
public void reduce(Text key, Text values, Context context) throws IOException, InterruptedException{
vword=values;
/*
for (Iterator iterator = values.iterator(); iterator.hasNext();) {
vword.set(iterator.next().toString());
System.out.println("printing " +vword.toString());
}*/
context.write(key,vword);
}
}
----------or------------
public static class TraceRouteReducer extends Reducer<Text, Text, Text, Text>{
Text vword = new Text();
public void reduce(Text key, Iterable<Text> values, Context context) throws IOException, InterruptedException{
for (Iterator iterator = values.iterator(); iterator.hasNext();) {
vword.set(iterator.next().toString());
context.write(key,vword);
}
}
}
second in your mapper you are splitting based on space.but not feasible as of my knowledge. split based on "\\s+" regular expression.
String[] data = lines.split("\\s+");

Related

Sort the output in Hadoop Mapreduce

I am a beginner using Hadoop and I want to read a text file through MapReduce and output it. I have set up a counter to see the order of the data, but the output is not in order. Here are my code and screenshot.
Question: How can we sort the output based on the value of the key?
Sample input data in text file:
199907 21 22 23 24 25
199808 26 27 28 29 30
199909 31 32 33 34 35
200010 36 37 38 39 40
200411 41 42 43 44 45
Mapper
public static class TestMapper
extends Mapper<LongWritable, Text, Text, Text>{
int days = 1;
#Override
public void map(LongWritable key, Text value, Context context
) throws IOException, InterruptedException {
/* get the file name */
FileSplit fileSplit = (FileSplit)context.getInputSplit();
String filename = fileSplit.getPath().getName();
//context.write(new Text(filename), new Text(""));
StringTokenizer token = new StringTokenizer(value.toString());
String yearMonth = token.nextToken();
if(Integer.parseInt(yearMonth) ==0)
return;
while(token.hasMoreTokens()){
context.write(new Text(yearMonth+" "+days),new Text(token.nextToken()));
}
days++;
}
}
Reducer
public static class TestReducer
extends Reducer<Text,Text,Text,Text> {
#Override
public void reduce(Text key, Iterable<Text> values,Context context)
throws IOException, InterruptedException {
ArrayList<String> valList = new ArrayList<String>();
for(Text val: values)
valList.add(val.toString());
context.write(key,new Text(valList.toString()));
}
}
Driver/Main class
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs();
if (otherArgs.length < 2) {
System.err.println("Usage: Class name <in> [<in>...] <out>");
System.exit(2);
}
Job job = Job.getInstance(conf, "My Class");
job.addFileToClassPath(new Path("/myPath"));
job.setJarByClass(myJar.class);
job.setMapperClass(TestMapper.class);
job.setReducerClass(TestReducer.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(Text.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
FileInputFormat.setInputDirRecursive(job, true);
for (int i = 0; i < otherArgs.length - 1; ++i) {
FileInputFormat.addInputPath(job, new Path(otherArgs[i]));
}
FileOutputFormat.setOutputPath(job, new Path(otherArgs[otherArgs.length - 1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
Screenshot part of the output

MultipleOutputFormat does only one iteration in Reduce step

Here is my Reducer. Reducer takes in EdgeWritable and NullWritable
EdgeWritable has 4 integers, say <71, 74, 7, 2000>
The communication is between 71(FromID) to 74(ToID) on 7(July) 2000(Year).
Mapper outputs 10787 records as such to reducer, But Reducer only outputs 1.
I need to output 44 files with for 44 months between the period Oct-1998 and July 2002. The output should be in format "out"+month+year. ForExample July 2002 records will be in file out72002.
I have debugged the code. After one iteration, it outputs one file and stops without taking next record. Please suggest How I should use MultipleOutput. Thanks
public class MultipleOutputReducer extends Reducer<EdgeWritable, NullWritable, IntWritable, IntWritable>{
private MultipleOutputs<IntWritable,IntWritable> multipleOutputs;
protected void setup(Context context) throws IOException, InterruptedException{
multipleOutputs = new MultipleOutputs<IntWritable, IntWritable>(context);
}
#Override
public void reduce(EdgeWritable key, Iterable val , Context context) throws IOException, InterruptedException {
int year = key.get(3).get();
int month= key.get(2).get();
int to = key.get(1).get();
int from = key.get(0).get();
//if(year >= 1997 && year <= 2001){
if((month >= 9 && year >= 1997) || (month <= 6 && year <= 2001)){
multipleOutputs.write(new IntWritable(from), new IntWritable(to), "out"+month+year );
}
//}
}
#Override
public void cleanup(Context context) throws IOException, InterruptedException{
multipleOutputs.close();
}
Driver
public class TimeSlicingDriver extends Configured implements Tool{
static final SimpleDateFormat sdf = new SimpleDateFormat("EEE, d MMM yyyy HH:mm:ss Z");
public int run(String[] args) throws Exception {
if(args.length != 2){
System.out.println("Enter <input path> <output path>");
System.exit(-1);
}
Configuration setup = new Configuration();
//setup.set("Input Path", args[0]);
Job job = new Job(setup, "Time Slicing");
//job.setJobName("Time Slicing");
job.setJarByClass(TimeSlicingDriver.class);
job.setMapperClass(TimeSlicingMapper.class);
job.setReducerClass(MultipleOutputReducer.class);
//MultipleOutputs.addNamedOutput(setup, "output", org.apache.hadoop.mapred.TextOutputFormat.class, EdgeWritable.class, NullWritable.class);
job.setMapOutputKeyClass(EdgeWritable.class);
job.setMapOutputValueClass(NullWritable.class);
job.setOutputKeyClass(IntWritable.class);
job.setOutputValueClass(IntWritable.class);
/**Set the Input File Path and output file path*/
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
return job.waitForCompletion(true)?0:1;
}
you are not iterating your Iterator "val", for that reason your logic in your code is executed one time for each group.

hadoop input data problems

I'm having trouble with the map functions:
The original data is stored in the tsv file:
I just want the last two columns saved:
the first is the original node(383), second is the target(4575), third is the weight(1)
383 4575 1
383 4764 1
383 5458 1
383 5491 1
public void map(LongWritable key, Text value,OutputCollector output, Reporter reporter) throws IOException {
String line = value.toString();
String[] tokens = line.split("t");
int weight = Integer.parseInt(tokens[2]);
int target = Integer.parseInt(tokens[0]);
}
Here is my code:
public void map(LongWritable key, Text value, Context context) throws IOException InterruptedException
{
String line = value.toString();
//split the tsv file
String[] tokens = line.split("/t");
//save the weight and target
private Text target = Integer.parsetxt(tokens[0]);
int weight = Integer.parseInt(tokens[2]);
context.write(new Text(target), new Intwritable(weight) );
}
}
public class Reduce extends Reducer<Text, IntWritable, Text, IntWritable>
{
#Override
public void reduce(Text key, Iterable<IntWritable> values,Context context)
throws IOException, InterruptedException
{
//initialize the count variable
int weightsum = 0;
for (IntWritable value : values) {
weightsum += value.get();
}
context.write(key, new IntWritable(weightsum));
}
}
String[] tokens = line.split("t");
should be
String[] tokens = line.split("\t");
split with spaces.
String[] tokens = line.split("\\s+");
private Text target = Integer.parsetxt(tokens[1]);
int weight = Integer.parseInt(tokens[2]);

isSplitable in combineFileInputFormat does not work

I have thousands of small files, and I want to process them with combineFileInputFormat.
In combineFileInputFormat, there are multiple small files for one mapper, each file will not be split.
the snippet of one of the small input files like this,
vers,3
period,2015-01-26-18-12-00,438469546,449329626,complete
config,libdvm.so,chromeview
pkgproc,com.futuredial.digitchat,10021,,0ns:10860078
pkgpss,com.futuredial.digitchat,10021,,0ns:9:6627:6627:6637:5912:5912:5912
pkgsvc-run,com.futuredial.digitchat,10021,.LiveScreenService,1,0n:10860078
pkgsvc-start,com.futuredial.digitchat,10021,.LiveScreenService,1,0n:10860078
pkgproc,com.google.android.youtube,10103,,0ns:10860078
pkgpss,com.google.android.youtube,10103,,0ns:9:12986:13000:13021:11552:11564:11580
pkgsvc- run,com.google.android.youtube,10103,com.google.android.apps.youtube.app.offline.transfer.OfflineTransferService,1,0n:10860078
pkgsvc- start,com.google.android.youtube,10103,com.google.android.apps.youtube.app.offline.transfer.OfflineTransferService,1,0n:10860078
I want to pass whole file content to the mapper. However, hadoop split the file to half.
For example, the above file may be split into
vers,3
period,2015-01-26-18-12-00,438469546,449329626,complete
config,libdvm.so,chromeview
pkgproc,com.futuredial.digitchat,#the line has been cut
But I want the content of whole file to be processed.
Here is my code, which reference Reading file as single record in hadoop
The driven code
public class CombineSmallfiles {
public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
Configuration conf = new Configuration();
String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs();
if (otherArgs.length != 2) {
System.err.println("Usage: conbinesmallfiles <in> <out>");
System.exit(2);
}
conf.setInt("mapred.min.split.size", 1);
conf.setLong("mapred.max.split.size", 26214400); // 25m
//conf.setLong("mapred.max.split.size", 134217728); // 128m
//conf.setInt("mapred.reduce.tasks", 5);
Job job = new Job(conf, "combine smallfiles");
job.setJarByClass(CombineSmallfiles.class);
job.setMapperClass(CombineSmallfileMapper.class);
//job.setReducerClass(IdentityReducer.class);
job.setNumReduceTasks(0);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(Text.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
MultipleOutputs.addNamedOutput(job,"pkgproc",TextOutputFormat.class,Text.class,Text.class);
MultipleOutputs.addNamedOutput(job,"pkgpss",TextOutputFormat.class,Text.class,Text.class);
MultipleOutputs.addNamedOutput(job,"pkgsvc",TextOutputFormat.class,Text.class,Text.class);
job.setInputFormatClass(CombineSmallfileInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
FileInputFormat.addInputPath(job, new Path(otherArgs[0]));
FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));
int exitFlag = job.waitForCompletion(true) ? 0 : 1;
System.exit(exitFlag);
}
}
My Mapper code
public class CombineSmallfileMapper extends Mapper<NullWritable, Text, Text, Text> {
private Text file = new Text();
private MultipleOutputs mos;
private String period;
private Long elapsed;
#Override
public void setup(Context context) throws IOException, InterruptedException {
mos = new MultipleOutputs(context);
}
#Override
protected void map(NullWritable key, Text value, Context context) throws IOException, InterruptedException {
String file_name = context.getConfiguration().get("map.input.file.name");
String [] filename_tokens = file_name.split("_");
String uuid = filename_tokens[0];
String [] datetime_tokens;
try{
datetime_tokens = filename_tokens[1].split("-");
}catch(ArrayIndexOutOfBoundsException err){
throw new ArrayIndexOutOfBoundsException(file_name);
}
String year,month,day,hour,minute,sec,msec;
year = datetime_tokens[0];
month = datetime_tokens[1];
day = datetime_tokens[2];
hour = datetime_tokens[3];
minute = datetime_tokens[4];
sec = datetime_tokens[5];
msec = datetime_tokens[6];
String datetime = year+"-"+month+"-"+"-"+day+" "+hour+":"+minute+":"+sec+"."+msec;
String content = value.toString();
String []lines = content.split("\n");
for(int u = 0;u<lines.length;u++){
String line = lines[u];
String []tokens = line.split(",");
if(tokens[0].equals("period")){
period = tokens[1];
try{
long startTime = Long.valueOf(tokens[2]);
long endTime = Long.valueOf(tokens[3]);
elapsed = endTime-startTime;
}catch(NumberFormatException err){
throw new NumberFormatException(line);
}
}else if(tokens[0].equals("pkgproc")){
String proc_info = "";
try{
proc_info += period+","+String.valueOf(elapsed)+","+tokens[2]+","+tokens[3];
}catch(ArrayIndexOutOfBoundsException err){
throw new ArrayIndexOutOfBoundsException("pkgproc: "+content+ "line:"+line);
}
for(int i = 4;i<tokens.length;i++){
String []state_info = tokens[i].split(":");
String state = "";
state += ","+state_info[0].charAt(0)+","+state_info[0].charAt(1)+","+state_info[0].charAt(2)+","+state_info[1];
mos.write("pkgproc",new Text(tokens[1]), new Text(proc_info+state+','+uuid+','+datetime));
}
}else if(tokens[0].equals("pkgpss")){
String proc_info = "";
proc_info += period+","+String.valueOf(elapsed)+","+tokens[2]+","+tokens[3];
for(int i = 4;i<tokens.length;i++){
String []state_info = tokens[i].split(":");
String state = "";
state += ","+state_info[0].charAt(0)+","+state_info[0].charAt(1)+","+state_info[0].charAt(2)+","+state_info[1]+","+state_info[2]+","+state_info[3]+","+state_info[4]+","+state_info[5]+","+state_info[6]+","+state_info[7];
mos.write("pkgpss",new Text(tokens[1]), new Text(proc_info+state+','+uuid+','+datetime));
}
}else if(tokens[0].startsWith("pkgsvc")){
String []stateName = tokens[0].split("-");
String proc_info = "";
//tokens[2] = uid, tokens[3] = serviceName
proc_info += stateName[1]+','+period+","+String.valueOf(elapsed)+","+tokens[2]+","+tokens[3];
String opcount = tokens[4];
for(int i = 5;i<tokens.length;i++){
String []state_info = tokens[i].split(":");
String state = "";
state += ","+state_info[0].charAt(0)+","+state_info[0].charAt(1)+","+state_info[1];
mos.write("pkgsvc",new Text(tokens[1]), new Text(proc_info+state+','+opcount+','+uuid+','+datetime));
}
}
}
}
}
My CombineFileInputFormat, which overrides isSplitable and return false
public class CombineSmallfileInputFormat extends CombineFileInputFormat<NullWritable, Text> {
#Override
public RecordReader<NullWritable, Text> createRecordReader(InputSplit split, TaskAttemptContext context) throws IOException {
return new CombineFileRecordReader<NullWritable,Text>((CombineFileSplit) split,context,WholeFileRecordReader.class);
}
#Override
protected boolean isSplitable(JobContext context,Path file ){
return false;
}
}
The WholeFileRecordReader
public class WholeFileRecordReader extends RecordReader<NullWritable, Text> {
//private static final Logger LOG = Logger.getLogger(WholeFileRecordReader.class);
/** The path to the file to read. */
private final Path mFileToRead;
/** The length of this file. */
private final long mFileLength;
/** The Configuration. */
private final Configuration mConf;
/** Whether this FileSplit has been processed. */
private boolean mProcessed;
/** Single Text to store the file name of the current file. */
// private final Text mFileName;
/** Single Text to store the value of this file (the value) when it is read. */
private final Text mFileText;
/**
* Implementation detail: This constructor is built to be called via
* reflection from within CombineFileRecordReader.
*
* #param fileSplit The CombineFileSplit that this will read from.
* #param context The context for this task.
* #param pathToProcess The path index from the CombineFileSplit to process in this record.
*/
public WholeFileRecordReader(CombineFileSplit fileSplit, TaskAttemptContext context,
Integer pathToProcess) {
mProcessed = false;
mFileToRead = fileSplit.getPath(pathToProcess);
mFileLength = fileSplit.getLength(pathToProcess);
mConf = context.getConfiguration();
context.getConfiguration().set("map.input.file.name", mFileToRead.getName());
assert 0 == fileSplit.getOffset(pathToProcess);
//if (LOG.isDebugEnabled()) {
//LOG.debug("FileToRead is: " + mFileToRead.toString());
//LOG.debug("Processing path " + pathToProcess + " out of " + fileSplit.getNumPaths());
//try {
//FileSystem fs = FileSystem.get(mConf);
//assert fs.getFileStatus(mFileToRead).getLen() == mFileLength;
//} catch (IOException ioe) {
//// oh well, I was just testing.
//}
//}
//mFileName = new Text();
mFileText = new Text();
}
/** {#inheritDoc} */
#Override
public void close() throws IOException {
mFileText.clear();
}
/**
* Returns the absolute path to the current file.
*
* #return The absolute path to the current file.
* #throws IOException never.
* #throws InterruptedException never.
*/
#Override
public NullWritable getCurrentKey() throws IOException, InterruptedException {
return NullWritable.get();
}
/**
* <p>Returns the current value. If the file has been read with a call to NextKeyValue(),
* this returns the contents of the file as a BytesWritable. Otherwise, it returns an
* empty BytesWritable.</p>
*
* <p>Throws an IllegalStateException if initialize() is not called first.</p>
*
* #return A BytesWritable containing the contents of the file to read.
* #throws IOException never.
* #throws InterruptedException never.
*/
#Override
public Text getCurrentValue() throws IOException, InterruptedException {
return mFileText;
}
/**
* Returns whether the file has been processed or not. Since only one record
* will be generated for a file, progress will be 0.0 if it has not been processed,
* and 1.0 if it has.
*
* #return 0.0 if the file has not been processed. 1.0 if it has.
* #throws IOException never.
* #throws InterruptedException never.
*/
#Override
public float getProgress() throws IOException, InterruptedException {
return (mProcessed) ? (float) 1.0 : (float) 0.0;
}
/**
* All of the internal state is already set on instantiation. This is a no-op.
*
* #param split The InputSplit to read. Unused.
* #param context The context for this task. Unused.
* #throws IOException never.
* #throws InterruptedException never.
*/
#Override
public void initialize(InputSplit split, TaskAttemptContext context)
throws IOException, InterruptedException {
// no-op.
}
/**
* <p>If the file has not already been read, this reads it into memory, so that a call
* to getCurrentValue() will return the entire contents of this file as Text,
* and getCurrentKey() will return the qualified path to this file as Text. Then, returns
* true. If it has already been read, then returns false without updating any internal state.</p>
*
* #return Whether the file was read or not.
* #throws IOException if there is an error reading the file.
* #throws InterruptedException if there is an error.
*/
#Override
public boolean nextKeyValue() throws IOException, InterruptedException {
if (!mProcessed) {
if (mFileLength > (long) Integer.MAX_VALUE) {
throw new IOException("File is longer than Integer.MAX_VALUE.");
}
byte[] contents = new byte[(int) mFileLength];
FileSystem fs = mFileToRead.getFileSystem(mConf);
FSDataInputStream in = null;
try {
// Set the contents of this file.
in = fs.open(mFileToRead);
IOUtils.readFully(in, contents, 0, contents.length);
mFileText.set(contents, 0, contents.length);
} finally {
IOUtils.closeQuietly(in);
}
mProcessed = true;
return true;
}
return false;
}
}
I want every mapper to parse multiple small files and each small file can not be split.
However, above code will cut(split) my input file and will raise a parsing error (since my parser will split the line into tokens).
In my concept, combineFileInputFormat will gather multiple files into one split, and each split will feed into one mapper. Therefore, one mapper can handle multiple files.
In my code, the max input split is set to 25MB, so I think the problem is that combineFileInputFormat will split the last part of small file of input split to satisfy the split size limit.
However, I have override isSplitable and return false, but it still splits the small file.
What is the correct way to do that?
I am not sure if it is possible to specify number of files to a mapper, rather than specify input split size?
Use setMaxSplitSize() method in your constructor code, it should work,
It ideally tells the split size,
public class CFInputFormat extends CombineFileInputFormat<FileLineWritable, Text> {
public CFInputFormat(){
super();
setMaxSplitSize(67108864); // 64 MB, default block size on hadoop
}
public RecordReader<FileLineWritable, Text> createRecordReader(InputSplit split, TaskAttemptContext context) throws IOException{
return new CombineFileRecordReader<FileLineWritable, Text>((CombineFileSplit)split, context, CFRecordReader.class);
}
#Override
protected boolean isSplitable(JobContext context, Path file){
return false;
}
}

Writing a value to file without moving to reducer

I have an input of records like this,
a|1|Y,
b|0|N,
c|1|N,
d|2|Y,
e|1|Y
Now, in mapper, i has to check the value of third column. If it is 'Y' then that record has to write directly to output file without moving that record to reducer or else i.e, 'N' value records has to move to reducer for further processing..
So,
a|1|Y,
d|2|Y,
e|1|Y
should not go to reducer but
b|0|N,
c|1|N
should go to reducer and then to output file.
How can i do this??
What you can probably do is use MultipleOutputs - click here to separate out records of 'Y' and 'N' type to two different files from mappers.
Next, you run saparate jobs for the two newly generated 'Y' and 'N' type data sets.
For 'Y' types set number of reducers to 0, so that, Reducers aren't use. And, for 'N' types do it the way you want using reducers.
Hope this helps.
See if this works,
public class Xxxx {
public static class MyMapper extends
Mapper<LongWritable, Text, LongWritable, Text> {
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
FileSystem fs = FileSystem.get(context.getConfiguration());
Random r = new Random();
FileSplit split = (FileSplit)context.getInputSplit();
String fileName = split.getPath().getName();
FSDataOutputStream out = fs.create(new Path(fileName + "-m-" + r.nextInt()));
String parts[];
String line = value.toString();
String[] splits = line.split(",");
for(String s : splits) {
parts = s.split("\\|");
if(parts[2].equals("Y")) {
out.writeBytes(line);
}else {
context.write(key, value);
}
}
out.close();
fs.close();
}
}
public static class MyReducer extends
Reducer<LongWritable, Text, LongWritable, Text> {
public void reduce(LongWritable key, Iterable<Text> values,
Context context) throws IOException, InterruptedException {
for(Text t : values) {
context.write(key, t);
}
}
}
/**
* #param args
* #throws IOException
* #throws InterruptedException
* #throws ClassNotFoundException
*/
public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
// TODO Auto-generated method stub
Configuration conf = new Configuration();
conf.set("fs.default.name", "hdfs://localhost:9000");
conf.set("mapred.job.tracker", "localhost:9001");
Job job = new Job(conf, "Xxxx");
job.setJarByClass(Xxxx.class);
Path outPath = new Path("/output_path");
job.setMapperClass(MyMapper.class);
job.setReducerClass(MyReducer.class);
FileInputFormat.addInputPath(job, new Path("/input.txt"));
FileOutputFormat.setOutputPath(job, outPath);
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
In your map function, you will get input line by line. Split it according by using | as the delimiter. (by using the String.split() method to be exact)
It will look like this
String[] line = value.toString().split('|');
Access the third element of this array by line[2]
Then, using a simple if else statement, emit the output with N value for further processing.

Resources