Creating Sequence File Format for Hadoop MR - hadoop

I was working with Hadoop MapRedue, and had a question.
Currently, my mapper's input KV type is LongWritable, LongWritable type and
output KV type is also LongWritable, LongWritable type.
InputFileFormat is SequenceFileInputFormat.
Basically What I want to do is to change a txt file into SequenceFileFormat so that I can use this into my mapper.
What I would like to do is
input file is something like this
1\t2 (key = 1, value = 2)
2\t3 (key = 2, value = 3)
and on and on...
I looked at this thread How to convert .txt file to Hadoop's sequence file format but reliazing that TextInputFormat only support Key = LongWritable and Value = Text
Is there any way to get txt and make a sequence file in KV = LongWritable, LongWritable?

Sure, basically the same way I told in the other thread you've linked. But you have to implement your own Mapper.
Just a quick scratch for you:
public class LongLongMapper extends
Mapper<LongWritable, Text, LongWritable, LongWritable> {
#Override
protected void map(LongWritable key, Text value,
Mapper<LongWritable, Text, LongWritable, LongWritable>.Context context)
throws IOException, InterruptedException {
// assuming that your line contains key and value separated by \t
String[] split = value.toString().split("\t");
context.write(new LongWritable(Long.valueOf(split[0])), new LongWritable(
Long.valueOf(split[1])));
}
public static void main(String[] args) throws IOException,
InterruptedException, ClassNotFoundException {
Configuration conf = new Configuration();
Job job = new Job(conf);
job.setJobName("Convert Text");
job.setJarByClass(LongLongMapper.class);
job.setMapperClass(Mapper.class);
job.setReducerClass(Reducer.class);
// increase if you need sorting or a special number of files
job.setNumReduceTasks(0);
job.setOutputKeyClass(LongWritable.class);
job.setOutputValueClass(LongWritable.class);
job.setOutputFormatClass(SequenceFileOutputFormat.class);
job.setInputFormatClass(TextInputFormat.class);
FileInputFormat.addInputPath(job, new Path("/input"));
FileOutputFormat.setOutputPath(job, new Path("/output"));
// submit and wait for completion
job.waitForCompletion(true);
}
}
Each value in your mapper function will get a line of your input, so we are just splitting it by your delimiter (tab) and parsing each part of it into longs.
That's it.

Related

Getting the partition id of input file in Hadoop

I need to know the row index of the partitions of the input file that I'm using. I could force this in the original file by concatenating the row index to the data but I'd rather have a way of doing this in Hadoop. I have this in my mapper...
String id = context.getConfiguration().get("mapreduce.task.partition");
But "id" is 0 in every case. In the "Hadoop: The Definitive Guide" it mentions accessing properties like the partition id "can be accessed from the context object passed to all methods of the Mapper or Reducer". It does not, from what I can tell, actually go into how to access this information.
I went through the documentation for the Context object and it seems like the above is the way to do it and the script does compile. But since I'm getting 0 for every value, I'm not sure if I'm actually using the right thing and I'm unable to find any detail online that could help in figuring this out.
Code used to test...
public class Test {
public static class TestMapper extends Mapper<LongWritable, Text, Text, Text> {
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
String id = context.getConfiguration().get("mapreduce.task.partition");
context.write(new Text("Test"), new Text(id + "_" + value.toString()));
}
}
public static class TestReducer extends Reducer<Text, Text, Text, Text> {
public void reduce(Text key, Iterable<Text> values, Context context) throws IOException, InterruptedException {
for(Text value : values) {
context.write(key, value);
}
}
}
public static void main(String[] args) throws Exception {
if(args.length != 2) {
System.err.println("Usage: Test <input path> <output path>");
System.exit(-1);
}
Job job = new Job();
job.setJarByClass(Test.class);
job.setJobName("Test");
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.setMapperClass(TestMapper.class);
job.setReducerClass(TestReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
Two options are:
Use the offset instead of the row number
Track the line number in the mapper
For the first one, the key which is LongWritable tells you the offset of the line being processed. Unless your lines are exactly the same length, you won't be able to calculate the line number from an offset, but it does allow you to determine ordering if thats useful.
The second option is to just track it in the mapper. You could change your code to something like:
public static class TestMapper extends Mapper<LongWritable, Text, Text, Text> {
private long currentLineNum = 0;
private Text test = new Text("Test");
public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
context.write(test, new Text(currentLineNum + "_" + value));
currentLineNum++;
}
}
You could also represent your matrix as lines of tuples and include the row and col on every tuple so when you're reading in the file, you have that information. If you use a file that is just space or comma seperated values that make up a 2D array, it'll be extremely hard to figure out what line (row) you are currently working on in the mapper

If 2 Mappers output the same key , what will the input to the reducer be?

I've the following doubt while learning Map reduce. It will be of great help if some one could answer.
I've two mappers working on the same file - I configured them using MultipleInputFormat
mapper 1 - Expected Output [ after extracting few columns of a file]
a - 1234
b - 3456
c - 1345
Mapper 2 Expected output [After extracting few columns of the same file]
a - Monday
b - Tuesday
c - Wednesday
And there is a reducer function that just outputs the key and value pair that it gets as input
So I expected the output to be as I know that similar keys will be shuffled to make a list.
a - [1234,Monday]
b - [3456, Tuesday]
c - [1345, Wednesday]
But am getting some weird output.I guess only 1 Mapper is getting run.
Should this not be expected ? Will the output of each mapper be shuffled separately ? Will both the mappers run parallel ?
Excuse me if its a lame question Please understand that I am new to Hadoop and Map Reduce
Below is the code
//Mapper1
public class numbermapper extends Mapper<Object, Text, Text, Text>{
public void map(Object key,Text value, Context context) throws IOException, InterruptedException {
String record = value.toString();
String[] parts = record.split(",");
System.out.println("***Mapper number output "+parts[0]+" "+parts[1]);
context.write(new Text(parts[0]), new Text(parts[1]));
}
}
//Mapper2
public class weekmapper extends Mapper<Object, Text, Text, Text> {
public void map(Object key, Text value, Context context)
throws IOException, InterruptedException {
String record = value.toString();
String[] parts = record.split(",");
System.out.println("***Mapper week output "+parts[0]+" "+parts[2]);
context.write(new Text(parts[0]), new Text(parts[2]));
}
}
//Reducer
public class rjoinreducer extends Reducer<Text, Text, Text, Text> {
public void reduce(Text key, Text values, Context context)
throws IOException, InterruptedException {
context.write(key, values);
}
}
//Driver class
public class driver {
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = new Job(conf, "Reduce-side join");
job.setJarByClass(numbermapper.class);
job.setReducerClass(rjoinreducer.class);
job.setMapOutputValueClass(Text.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
MultipleInputs.addInputPath(job, new Path(args[0]),TextInputFormat.class, numbermapper.class);
MultipleInputs.addInputPath(job, new Path(args[0]),TextInputFormat.class, weekmapper.class);
Path outputPath = new Path(args[1]);
FileOutputFormat.setOutputPath(job, outputPath);
outputPath.getFileSystem(conf).delete(outputPath);
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
And this is the O/P I got-
a Monday
b Tuesday
c Wednesday
Dataset used
a,1234,Monday
b,3456,Tuesday
c,1345,Wednesday
Multiple input format was just taking 1 file and running one mapper on it because I have given the same path for both the Mappers.
When I copy the dataset to a different file and ran the same program taking two different files (same content but different names for the files) I got the expected output.
So i now understood that the output from different mapper functions is also combined based on key , not just the output from the same mapper function.
Thanks for trying to help....!!!

mapper behaving differently on local and on the cluster

I run a map only job(on Hadoop) in order to sort the key values, because it's said "Hadoop automatically sorts data emitted by mappers before being sent to reducers".
input file
2013-04-15 835352
2013-04-16 846299
2013-04-17 828286
2013-04-18 747767
2013-04-19 807924
I think Map(second_cloumn, first_column) should sort this file as shown in output1. It actually did when i run this on my local machine. But when i run this on a cluster, the output is like shown in output2.
output1 file
747767 2013-04-18
807924 2013-04-19
828286 2013-04-17
835352 2013-04-15
846299 2013-04-16
output2 file
835352 2013-04-15
747767 2013-04-18
807924 2013-04-19
828286 2013-04-17
846299 2013-04-16
How can I guarantee it to be always like in the output1. I am open for another suggestions for sorting by the second column.
Mapper
public class MapAccessTime1 extends Mapper<LongWritable, Text, IntWritable, Text> {
private IntWritable one = new IntWritable(1);
private Text word = new Text();
#Override
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
String line = value.toString();
int val = 0;
StringTokenizer tokenizer = new StringTokenizer(line);
if (!line.startsWith("#")) {
if (tokenizer.hasMoreTokens()) {
word.set(tokenizer.nextToken());
}
if (tokenizer.hasMoreTokens()) {
val = Integer.parseInt(tokenizer.nextToken());
one = new IntWritable(val);
context.write(one, word);
}
}
}
}
MapOnly job doesn't do shuffle and sorting. Using an identity reducer solves my problem.

Hadoop MultipleOutputs does not write to multiple files when file formats are custom format

I am trying to read from cassandra and write the reducers output to multiple output files using MultipleOutputs api (Hadoop version 1.0.3). The file formats in my case are custom output formats extending FileOutputFormat. I have configured my job in a similar manner as shown in MultipleOutputs api.
However, when I run the job, I only get one output file named part-r-0000 which is in text output format. If job.setOutputFormatClass() is not set, by default it considers TextOutputFormat to be the format. Also it will only allow one of the two format classes to be initialized. It completely ignores the output formats I specified in MulitpleOutputs.addNamedOutput(job, "format1", MyCustomFileFormat1.class, Text.class, Text.class) and MulitpleOutputs.addNamedOutput(job, "format2", MyCustomFileFormat2.class, Text.class, Text.class). Is someone else facing similar problem or am I doing something wrong ?
I also tried to write a very simple MR program which reads from a text file and writes the output in 2 formats TextOutputFormat and SequenceFileOutputFormat as shown in the MultipleOutputs api. However, no luck there as well. I get only 1 output file in text output format.
Can someone help me with this ?
Job job = new Job(getConf(), "cfdefGen");
job.setJarByClass(CfdefGeneration.class);
//read input from cassandra column family
ConfigHelper.setInputColumnFamily(job.getConfiguration(), KEYSPACE, COLUMN_FAMILY);
job.setInputFormatClass(ColumnFamilyInputFormat.class);
job.getConfiguration().set("cassandra.consistencylevel.read", "QUORUM");
//thrift input job configurations
ConfigHelper.setInputRpcPort(job.getConfiguration(), "9160");
ConfigHelper.setInputInitialAddress(job.getConfiguration(), HOST);
ConfigHelper.setInputPartitioner(job.getConfiguration(), "RandomPartitioner");
SlicePredicate predicate = new SlicePredicate().setColumn_names(Arrays.asList(ByteBufferUtil.bytes("classification")));
//ConfigHelper.setRangeBatchSize(job.getConfiguration(), 2048);
ConfigHelper.setInputSlicePredicate(job.getConfiguration(), predicate);
//specification for mapper
job.setMapperClass(MyMapper.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(Text.class);
//specifications for reducer (writing to files)
job.setReducerClass(ReducerToFileSystem.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
//job.setOutputFormatClass(MyCdbWriter1.class);
job.setNumReduceTasks(1);
//set output path for storing output files
Path filePath = new Path(OUTPUT_DIR);
FileSystem hdfs = FileSystem.get(getConf());
if(hdfs.exists(filePath)){
hdfs.delete(filePath, true);
}
MyCdbWriter1.setOutputPath(job, new Path(OUTPUT_DIR));
MultipleOutputs.addNamedOutput(job, "cdb1', MyCdbWriter1.class, Text.class, Text.class);
MultipleOutputs.addNamedOutput(job, "cdb2", MyCdbWriter2.class, Text.class, Text.class);
boolean success = job.waitForCompletion(true);
return success ? 0:1;
public static class ReducerToFileSystem extends Reducer<Text, Text, Text, Text>
{
private MultipleOutputs<Text, Text> mos;
public void setup(Context context){
mos = new MultipleOutputs<Text, Text>(context);
}
//public void reduce(Text key, Text value, Context context)
//throws IOException, InterruptedException (This was the mistake, changed the signature and it worked fine)
public void reduce(Text key, Iterable<Text> values, Context context)
throws IOException, InterruptedException
{
//context.write(key, value);
mos.write("cdb1", key, value, OUTPUT_DIR+"/"+"cdb1");
mos.write("cdb2", key, value, OUTPUT_DIR+"/"+"cdb2");
context.progress();
}
public void cleanup(Context context) throws IOException, InterruptedException {
mos.close();
}
}
public class MyCdbWriter1<K, V> extends FileOutputFormat<K, V>
{
#Override
public RecordWriter<K, V> getRecordWriter(TaskAttemptContext job) throws IOException, InterruptedException
{
}
public static void setOutputPath(Job job, Path outputDir) {
job.getConfiguration().set("mapred.output.dir", outputDir.toString());
}
protected static class CdbDataRecord<K, V> extends RecordWriter<K, V>
{
#override
write()
close()
}
}
I found my mistake after debugging that my reduce method is never called. I found that my function definition did not match API's definition, changed it from public void reduce(Text key, Text value, Context context) to public void reduce(Text key, Iterable<Text> values, Context context). I don't know why reduce method does not have #Override tag, it would have prevented my mistake.
I also encountered a similar issue - mine turned out to be that I was filtering all my records in the Map process so nothing was being passed to Reduce. With un-named multiple outputs in the reduce task, this still resulted in a _SUCCESS file and an empty part-r-00000 file.

Make use of the relation name/table name/file name in Hadoop's MapReduce

Is there a way to use the relation name in MapReduce's Map and Reduce? I am trying to do Set difference using Hadoop's MapReduce.
Input: 2 files R and S containing list of terms. (Am going to use t to denote a term)
Objective: To find R - S, i.e. terms in R and not in S
Approach:
Mapper: Spits out t -> R or t -> S, depending on whether t comes from R or S. So, the map output has the t as the key and the file name as the value.
Reducer: If the value list for a t contains only R, then output t -> t.
Do I need to some how tag the terms with the filename? Or is there any other way?
Source code for something I did for Set Union (doesn't need file name anywhere in this case). Just wanted to use this as an example to illustrate the unavailability of filename in Mapper.
public class Union {
public static class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text, Text> {
public void map(LongWritable key, Text value, OutputCollector output, Reporter reporter) throws IOException {
output.collect(value, value);
}
}
public static class Reduce extends MapReduceBase implements Reducer<Text, Text, Text, Text> {
public void reduce(Text key, Iterator<Text> values, OutputCollector<Text, Text> output, Reporter reporter) throws IOException{
while (values.hasNext())
{
output.collect(key, values.next());
break;
}
}
}
public static void main(String[] args) throws Exception {
JobConf conf = new JobConf(Union.class);
conf.setJobName("Union");
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(Text.class);
conf.setMapperClass(Map.class);
conf.setCombinerClass(Reduce.class);
conf.setReducerClass(Reduce.class);
conf.set("mapred.job.queue.name", "myQueue");
conf.setNumReduceTasks(5);
conf.setInputFormat(TextInputFormat.class);
conf.setOutputFormat(TextOutputFormat.class);
FileInputFormat.setInputPaths(conf, new Path(args[0]));
FileOutputFormat.setOutputPath(conf, new Path(args[1]));
JobClient.runJob(conf);
}
}
As you can see I can't identify which key -> value pair (input to the Mapper) came from which file. Am I overlooking something simple here?
Thanks much.
I would implement your question just like you answered. That is just the way MapReduce was meant to be.
I guess your problem was actually writing n-times the same value into the HDFS?
EDIT:
Pasted from my Comment down there
Ah I got it ;) I'm not really familiar with the "old" API, but you can "query" your Reporter with:
reporter.getInputSplit();
This returns you an interface called InputSplit. This is easily castable to "FileSplit". And within FileSplit object you could obtain the Path with: "split.getPath()". And from the Path object you just need to call the getName() method.
So this snippet should work for you:
FileSplit fsplit = reporter.getInputSplit(); // maybe cast it down to FileSplit if needed..
String yourFileName = fsplit.getPath().getName();

Resources