Hadoop API: OutputFormat for Reducer - hadoop

I am totally confused with hadoop API. (guess its changing all the time)
If i am not wrong, JobConf was deprecated and we were supposed to use Job and Configuration classes instead to run a map reduce job from java. it seems though that in recently released hadoop 1.0.0 JobConf is not longer deprecated!
So i am using Job and configuration classes to run a map reduce job. Now, i need to put reducers output files in a folder structure based on certain values that are part of my map output. I went through several articles and found that one can achieve that with a OutputFormat Class but we have this class in two packages:
org.apache.hadoop.mapred and
org.apache.hadoop.mapreduce
In our job object we can set a output format class as:
job.setOutputFormatClass(SomeOutputFormat.class);
Now if SomeOutputFormat extends say org.apache.hadoop.mapreduce.lib.output.FileOutputFormat , we get one method named getRecordWriter(); this does not help in any way to override the output path.
There is another way by using jobConf but that again does not seem to work in terms of setting mappers, reducers, partitions, sorting and grouping classes.
Is there something very obvious that i am missing? I want to write my reduce output file inside a folder which is based on a value. for exmaple, SomeOutputPrefix/Value1/Value2/realReduceFileName
Thanks!

I think you need to implement
your own output format class and
your own RecordWriter which will be writing different values to different places
So your SomeOutputWriter will return new SomeRecordWriter("SomeOutputPrefix") in its getRecordWriter() method, and SomeRecordWriter will write different values to different folders.

Related

How to pass different set of data to two different mappers of the same job

I have one Single Mapper , say SingleGroupIdentifierMapper.java
Now this is a generic mapper which does all the filtration on a single line of mapper-input value/record based on property file (containing filters and key-value field indexes) passed to it from the driver class using cache.
Only the reducer business logic is is different and has been implemented keeping the mapper logic generic and to be implemented using the PropertyFile as mentioned above.
Now my problem statement is I have input from multiple Sources now, having different formats. That means I have to do some thing like
MultipleInputs.addInputPath(conf, new Path("/inputA"),TextInputFormat.class, SingleGroupIdentifierMapper.class);
MultipleInputs.addInputPath(conf, new Path("/inputB"),TextInputFormat.class, SingleGroupIdentifierMapper.class);
But the cached property file which I pass from the driver class to the mapper for implementing filter based on field indexes is common, So how can I pass two different property file to the same mapper, where if it processes, say Input A, then it will use PropertyFileA (to filter and create key value pair) and if it processes, say Input B then it will use PropertyFileB (to filter and create key value pair).
It is possible to change the Generic Code of the Mapper to take care of this scenario BUT how to approach this problem in the Generic Class and how to identify in the same Mapper Class if the input is from inputA/inputB and accordingly apply the propertyFile Configuration on the data.
Can we pass arguments to the constructor of this mapper class to specify it is from inputB or it needs to read which property file in cache.?
Eg Something like :
MultipleInputs.addInputPath(conf, new Path("/inputB"),TextInputFormat.class, args[], SingleGroupIdentifierMapper.class);
where args[] is passed to the SingleGroupIdentifierMapper class's constructor which we define to take as input and set it as a attribure.
Any thoughts or expertise is most welcomed.
Hope I was able to express my problem clearly, kindly ask me in case there needs to be more clarity in the question.
Thanks in Advance,
Cheers :)
Unfortunately MultipleInputs is not that flexible. But there is a workaround which matches InputSplit paths to the property files in the setup method of the Mapper. If you are not using any sort of Combine*Format, than a single mapper will process a single split from a single file:
When adding prop files into cache use /propfile_1#PROPS_A and /propfile_2#PROPS_B
Add input path into job.getConfiguration().set("PROPS_A", "/inputA") and job.getConfiguration().set("PROPS_B", "/inputB")
In the Mapper.setup(Context context) method, use context.getInputSplit().toString() to get the path of the split. Than match it to the paths saved in the context.getConfiguration().get("PROPS_A") or PROPS_B
If you are using some Combine*Format, than you would need to extend it, override getSplits that use information from the JobContext to build the PathFilter[] and call createPool, which will create splits that contain files from the same group (inputA or inputB).

Issue with setting multiple projectionSchemas for AvroParquetInputFormat

I use AvroParquetInputFormat. The usecase requires scanning of multiple input directories and each directory will have files with one schema. Since AvroParquetInputFormat class could not handle multiple input schemas, I created a workaround by statically creating multiple dummy classes like MyAvroParquetInputFormat1, MyAvroParquetInputFormat2 etc where each class just inherits from AvroParquetInputFormat. And for each directory, I set a different MyAvroParquetInputFormat and that worked (please let me know if there is a cleaner way to achieve this).
My current problem is as follows:
Each file has a few hundred columns and based on meta-data I construct a projectionSchema for each directory, to reduce unnecessary disk & network IO. I use the static setRequestedProjection() method on each of my MyAvroParquetInputFormat classes. But, being static, the last call’s projectionSchema is used for reading data from all directories, which is not the required behavior.
Any pointers to workarounds/solutions would is highly appreciated.
Thanks & Regards
MK
Keep in mind that if your avro schemas are compatible (see avro doc for definition of schema compatibility) you can access all the data with a single schema. Extending on this, it is also possible to construct a parquet friendly schema (no unions) that is compatible with all your schemas so you can use just that one.
As for the approach you took, there is no easy way of doing this that I know of. You have to extend MultipleInputs functionality somehow to assign a different schema for each of your input formats. MultipleInputs works by setting two configuration properties in your job configuration:
mapreduce.input.multipleinputs.dir.formats //contains a comma separated list of InputFormat classes
mapreduce.input.multipleinputs.dir.mappers //contains a comma separated list of Mapper classes.
These two lists must be the same length. And this is where it gets tricky. This information is used deep within hadoop code to initialize mappers and input formats, so that's where you should add your own code.
As an alternative, I would suggest that you do the projection using one of the tools already available, such as hive. If there are not too many different schemas, you can write a set of simple hive queries to do the projection for each of the schemas, and after that you can use a single mapper to process the data or whatever the hell you want.

MultipleTextOutputFormat alternative in new API

As it stands out MultipleTextOutputFormat have not been migrated to the new API. So if we need to choose an output directory and output fiename based on the key-value being written on the fly, then what's the alternative we have with new mapreduce API ?
I'm using AWS EMR Hadoop 1.0.3, and it is possible to specify different directories and files based on k/v pairs. Use either of the following functions from the MultipleOutputs class:
public void write(KEYOUT key, VALUEOUT value, String baseOutputPath)
or
public <K,V> void write(String namedOutput, K key, V value,
String baseOutputPath)
The former write method requires the key to be the same type as the map output key (in case you are using this in the mapper) or the same type as the reduce output key (in case you are using this in the reducer). The value must also be typed in similar fashion.
The latter write method requires the key/value types to match the types specified when you setup the MultipleObjects static properties using the addNamedOutput function:
public static void addNamedOutput(Job job,
String namedOutput,
Class<? extends OutputFormat> outputFormatClass,
Class<?> keyClass,
Class<?> valueClass)
So if you need different output types than the Context is using, you must use the latter write method.
The trick to getting different output directories is to pass a baseOutputPath that contains a directory separator, like this:
multipleOutputs.write("output1", key, value, "dir1/part");
In my case, this created files named "dir1/part-r-00000".
I was not successful in using a baseOutputPath that contains the .. directory, so all baseOutputPaths are strictly contained in the path passed to the -output parameter.
For more details on how to setup and properly use MultipleOutputs, see this code I found (not mine, but I found it very helpful; does not use different output directories). https://github.com/rystsov/learning-hadoop/blob/master/src/main/java/com/twitter/rystsov/mr/MultipulOutputExample.java
Similar to: Hadoop Reducer: How can I output to multiple directories using speculative execution?
Basically you can write to HDFS directly from your reducer - you'll just need to be wary of speculative execution and name your files uniquely, then you'll need to implement you own OutputCommitter to clean up the aborted attempts (this is the most difficult part if you have truely dynamic output folders - you'll need to step through each folder and delete the attemps associated with aborted / failed tasks). A simple solution to this is to turn off speculative execution
For the best answer,turn to Hadoop - definitive guide 3rd Ed.(starting pg. 253.)
An Excerpt from the HDG book -
"In the old MapReduce API, there are two classes for producing multiple outputs: MultipleOutputFormat and MultipleOutputs. In a nutshell, MultipleOutputs is more fully featured, but MultipleOutputFormat has more control over the output directory structure and file naming. MultipleOutputs in the new API combines the best features of the two multiple output classes in the old API."
It has an example on how you can control directory structure,file naming and output format using MultipleOutputs API.
HTH.

hadoop CustomWritables

I have more of a design question regarding the necessity of a CustomWritable for my use case:
So I have a document pair that I will process through a pipeline and write out intermediate and final data to HDFS. My key will be something like ObjectId - DocId - Pair - Lang. I do not see why/if I will need a CustomWritable for this use case. I guess if I did not have a key, I would need a CustomWritable? Also, when I write data out to HDFS in the Reducer, I use a Custom Partitioner. So, that would kind of eliminate my need for a Custom Writable?
I am not sure if I got the concept of the need for a Custom Writable right. Can someone point me in the right direction?
Writables can be used for de/serializing objects. For example a log entry can contain a timestamp, an user IP and the browser agent. So you should implement your own WritableComparable for a key that identifies this entry and you should implement a value class that implements Writable that reads and writes the attributes in your log entry.
These serializations are just a handy way to get the data from a binary format to an object. Some Frameworks like HBase still require byte arrays to persist the data. So you'll have a lot of overhead in transfering this by yourself and messes up your code.
Thomas' answer explains a bit. Its way too late but I'd like to add the following for prospective readers:
Partitioner only comes into play between the map and reduce phase and has no role to play in writing from reducer to output files.
I don't believe writing INTERMEDIATE data to hdfs is a requirement in most cases, although there are some hacks that can be applied to do the same.
When you write from a reducer to hdfs, the keys will automatically be sorted and each reducer will write to ONE SEPARATE file. Based on their compareTo method, keys are sorted. So if you want to sort based on multiple variables, go for a Custom key class that extends WritableComparable, and implement the write, readFields and compareTo methods. You can now control the way the keys are sorted, based on the compareTo implementation

Create Value class for Sequence Files at runtime

I have some types of data that I have to upload on HDFS as Sequence Files.
Initially, I had thought of creating a .jr file at runtime depending on the type of schema and use rcc DDL tool by Hadoop to create these classes and use them.
But looking at rcc documentation, I see that it has been deprecated. I was trying to see what other options I have to create these value classes per type of data.
This is a problem as I get to know the metadata of the data to be loaded at runtime along with the data-stream. So, I have, no choice, but to create Value class at runtime and then use it for writing (key, vale) to SequenceFile.Writer and finally saving it on HDFS.
Is there any solution for this problem?
You can try looking other serialization frameworks, like Protocol Buffers, Thrift, or Avro. You might want to look at Avro first, since it doesn't require static code generation, which might be more suitable for you.
Or if you want something really quick and dirty, each record in the SequenceFile can be a HashMap where the key/values are the name of the field and the value.

Resources