I have a map and reduce job running. I want to output some data to a file and in the job some data to another file. How can it be achieved. Please help me as I am new to Hadoop map reduce. Can someone give examples?
There is an OutputFormat class called MultipleOutputFormat that can be used instead of the default TextOutputFormat.
As stated in the documentation:
This abstract class extends the FileOutputFormat, allowing to write the output data to different output files. There are three basic use cases for this class. Case one: This class is used for a map reduce job with at least one reducer. The reducer wants to write data to different files depending on the actual keys. It is assumed that a key (or value) encodes the actual key (value) and the desired location for the actual key (value). Case two: This class is used for a map only job. The job wants to use an output file name that is either a part of the input file name of the input data, or some derivation of it. Case three: This class is used for a map only job. The job wants to use an output file name that depends on both the keys and the input file name.
Since this is an abstract class, you'll need to use one of its implementations, most probably MultipleTextOutputFormat.
The way a different OutputFormat than TextOutputFormat is used is by specifying it when creating and configuring the job:
Configuration conf = this.getConf();
Job job = Job.getInstance(conf, "job-name");
job.setJarByClass(this.class);
job.setMapperClass(mapper.class);
job.setCombinerClass(combiner.class);
job.setReducerClass(reducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
job.setOutputFormatClass(MultipleTextOutputFormat.class);
MultipleTextOutputFotmat.setOutputPath(job, new Path(args[1]));
...
Hope it helps.
Related
I have one Single Mapper , say SingleGroupIdentifierMapper.java
Now this is a generic mapper which does all the filtration on a single line of mapper-input value/record based on property file (containing filters and key-value field indexes) passed to it from the driver class using cache.
Only the reducer business logic is is different and has been implemented keeping the mapper logic generic and to be implemented using the PropertyFile as mentioned above.
Now my problem statement is I have input from multiple Sources now, having different formats. That means I have to do some thing like
MultipleInputs.addInputPath(conf, new Path("/inputA"),TextInputFormat.class, SingleGroupIdentifierMapper.class);
MultipleInputs.addInputPath(conf, new Path("/inputB"),TextInputFormat.class, SingleGroupIdentifierMapper.class);
But the cached property file which I pass from the driver class to the mapper for implementing filter based on field indexes is common, So how can I pass two different property file to the same mapper, where if it processes, say Input A, then it will use PropertyFileA (to filter and create key value pair) and if it processes, say Input B then it will use PropertyFileB (to filter and create key value pair).
It is possible to change the Generic Code of the Mapper to take care of this scenario BUT how to approach this problem in the Generic Class and how to identify in the same Mapper Class if the input is from inputA/inputB and accordingly apply the propertyFile Configuration on the data.
Can we pass arguments to the constructor of this mapper class to specify it is from inputB or it needs to read which property file in cache.?
Eg Something like :
MultipleInputs.addInputPath(conf, new Path("/inputB"),TextInputFormat.class, args[], SingleGroupIdentifierMapper.class);
where args[] is passed to the SingleGroupIdentifierMapper class's constructor which we define to take as input and set it as a attribure.
Any thoughts or expertise is most welcomed.
Hope I was able to express my problem clearly, kindly ask me in case there needs to be more clarity in the question.
Thanks in Advance,
Cheers :)
Unfortunately MultipleInputs is not that flexible. But there is a workaround which matches InputSplit paths to the property files in the setup method of the Mapper. If you are not using any sort of Combine*Format, than a single mapper will process a single split from a single file:
When adding prop files into cache use /propfile_1#PROPS_A and /propfile_2#PROPS_B
Add input path into job.getConfiguration().set("PROPS_A", "/inputA") and job.getConfiguration().set("PROPS_B", "/inputB")
In the Mapper.setup(Context context) method, use context.getInputSplit().toString() to get the path of the split. Than match it to the paths saved in the context.getConfiguration().get("PROPS_A") or PROPS_B
If you are using some Combine*Format, than you would need to extend it, override getSplits that use information from the JobContext to build the PathFilter[] and call createPool, which will create splits that contain files from the same group (inputA or inputB).
I've known that a writable object can be passed to mapper using something like:
DefaultStringifier.store(conf, object ,"key");
object = DefaultStringifier.load(conf, "key", Class );
My question is:
In a mapper I read out the object then change the value of this object,
for example: object=another .
How to do to make sure that the change of the object's value
could be known by the next time of mapper task?
Is there any better way to pass parameter to mapper?
Use the file system instead. Write the value in HDFS, and replace the file with a different content. Neither config nor DistributedCache are not appropiate for mutable state.
I'm a newbie in Hadoop!
Now I am trying to use MultipleOutputFormat with hadoop 2.2.0, but it seems they only work with deprecated 'JobConf' which in turn uses deprecated Mapper and Reducer (org.apache.hadoop.mapred.Reducer) etc., . Any ideas how to to acheive multiple output functionality with new 'org.apache.hadoop.mapreduce.Job' ?
As #JudgeMental noted, you should use MultipleOutputs with the new API (mapreduce) because MultipleOutputFormat only supports the old API (mapred). MultipleOutputs actually provides you more features than MultipleOutputFormat:
With MultipleOutputs, each output can have its own OutputFormat, whereas with MultipleOutputFormat every output has to be the same OutputFormat.
With MultipleOutputFormat you have more control over the naming scheme and output directory structure than MultipleOutputs.
You can use MultipleOutputs in the map and reduce functions in the same job, something that you cannot do with MultipleOutputFormat.
You can have different key and value types for different outputs with MultipleOutputs.
So both are not mutually exclusive, even if MultipleOutputs has more features, it is less flexible regrding the naming capabilities.
To learn how to use MultipleOutputs, you should just take a look at this documentation which contains a complete example. In short, here is what you would put in the driver class:
// Defines additional single text based output 'text' for the job
MultipleOutputs.addNamedOutput(job, "text", TextOutputFormat.class, LongWritable.class, Text.class);
// Defines additional sequence-file based output 'sequence' for the job
MultipleOutputs.addNamedOutput(job, "seq", SequenceFileOutputFormat.class, LongWritable.class, Text.class);
And in your Mapper or Reducer you should just initialize your MultipleOutputs in the setup method with MultipleOutputs mos = new MultipleOutputs(context); and then you can use it in the map and reduce functions as mos.write("seq", LongWritable(1), new Text("Bye"), "seq_a"). Don't forget to close it in the cleanup method with mos.close() !
As it stands out MultipleTextOutputFormat have not been migrated to the new API. So if we need to choose an output directory and output fiename based on the key-value being written on the fly, then what's the alternative we have with new mapreduce API ?
I'm using AWS EMR Hadoop 1.0.3, and it is possible to specify different directories and files based on k/v pairs. Use either of the following functions from the MultipleOutputs class:
public void write(KEYOUT key, VALUEOUT value, String baseOutputPath)
or
public <K,V> void write(String namedOutput, K key, V value,
String baseOutputPath)
The former write method requires the key to be the same type as the map output key (in case you are using this in the mapper) or the same type as the reduce output key (in case you are using this in the reducer). The value must also be typed in similar fashion.
The latter write method requires the key/value types to match the types specified when you setup the MultipleObjects static properties using the addNamedOutput function:
public static void addNamedOutput(Job job,
String namedOutput,
Class<? extends OutputFormat> outputFormatClass,
Class<?> keyClass,
Class<?> valueClass)
So if you need different output types than the Context is using, you must use the latter write method.
The trick to getting different output directories is to pass a baseOutputPath that contains a directory separator, like this:
multipleOutputs.write("output1", key, value, "dir1/part");
In my case, this created files named "dir1/part-r-00000".
I was not successful in using a baseOutputPath that contains the .. directory, so all baseOutputPaths are strictly contained in the path passed to the -output parameter.
For more details on how to setup and properly use MultipleOutputs, see this code I found (not mine, but I found it very helpful; does not use different output directories). https://github.com/rystsov/learning-hadoop/blob/master/src/main/java/com/twitter/rystsov/mr/MultipulOutputExample.java
Similar to: Hadoop Reducer: How can I output to multiple directories using speculative execution?
Basically you can write to HDFS directly from your reducer - you'll just need to be wary of speculative execution and name your files uniquely, then you'll need to implement you own OutputCommitter to clean up the aborted attempts (this is the most difficult part if you have truely dynamic output folders - you'll need to step through each folder and delete the attemps associated with aborted / failed tasks). A simple solution to this is to turn off speculative execution
For the best answer,turn to Hadoop - definitive guide 3rd Ed.(starting pg. 253.)
An Excerpt from the HDG book -
"In the old MapReduce API, there are two classes for producing multiple outputs: MultipleOutputFormat and MultipleOutputs. In a nutshell, MultipleOutputs is more fully featured, but MultipleOutputFormat has more control over the output directory structure and file naming. MultipleOutputs in the new API combines the best features of the two multiple output classes in the old API."
It has an example on how you can control directory structure,file naming and output format using MultipleOutputs API.
HTH.
I am totally confused with hadoop API. (guess its changing all the time)
If i am not wrong, JobConf was deprecated and we were supposed to use Job and Configuration classes instead to run a map reduce job from java. it seems though that in recently released hadoop 1.0.0 JobConf is not longer deprecated!
So i am using Job and configuration classes to run a map reduce job. Now, i need to put reducers output files in a folder structure based on certain values that are part of my map output. I went through several articles and found that one can achieve that with a OutputFormat Class but we have this class in two packages:
org.apache.hadoop.mapred and
org.apache.hadoop.mapreduce
In our job object we can set a output format class as:
job.setOutputFormatClass(SomeOutputFormat.class);
Now if SomeOutputFormat extends say org.apache.hadoop.mapreduce.lib.output.FileOutputFormat , we get one method named getRecordWriter(); this does not help in any way to override the output path.
There is another way by using jobConf but that again does not seem to work in terms of setting mappers, reducers, partitions, sorting and grouping classes.
Is there something very obvious that i am missing? I want to write my reduce output file inside a folder which is based on a value. for exmaple, SomeOutputPrefix/Value1/Value2/realReduceFileName
Thanks!
I think you need to implement
your own output format class and
your own RecordWriter which will be writing different values to different places
So your SomeOutputWriter will return new SomeRecordWriter("SomeOutputPrefix") in its getRecordWriter() method, and SomeRecordWriter will write different values to different folders.