Control the Reducer Result Output File/Bucket - hadoop

I have an application where I would like to make my reducers (I have several of them for a map/reduce job) to record their outputs into different files on the HDFS depending on the key coming to them for processing. So if the reducer sees a key of say type A, apply the reduce the logic but tell Hadoop to put the result into the hdfs file belonging to type A result and so on. Obviously multiple reducers can will be outputting different portion of the type A result and each reducer can end up working on any type like A or B but tell hadoop to write the result into the type A bucket or something
Is this possible?

MultipleOutputs is almost what you are looking for (assuming you are at least at version 0.21). In my own work I have used a clone of this class modified to be more flexible about naming conventions to send output to different folders/files based on anything I want, including aspects of the input records (keys or values). As is, the class has some draconian restrictions on what names you can give to the outputs.

Related

Hadoop streaming with multiple input files

I want to build an inverted index from a set of files with Hadoop using the Streaming API. The documentation always refers to using a file whose lines have the entries to the mapper to be fed. But in this case, I have multiple input files, and I need the mappers to process only one file at a time. Is there a way to accomplish that. For preprocessing reasons, I need the input to be like this, and I cannot have the input in the classic line = key, value format that the documentation refers.
By default a mapper only processes one file, unless you use an input class that allow combine inputs like CombineFileInputFormat.
Then, if you have 10 files you will end with 10 mappers and each of them will process only one file. If you are only using mappers (not reducers) that will end in 10 outputs files (one for each mapper).
In the other side, if you have enough big splittable files, it is possible that one file be processed by several mappers at the same time.

Hadoop Streaming Job with no input file

Is it possible to execute a Hadoop Streaming job that has no input file?
In my use case, I'm able to generate the necessary records for the reducer with a single mapper and execution parameters. Currently, I'm using a stub input file with a single line, I'd like to remove this requirement.
We have 2 use cases in mind.
1)
I want to distribute the loading of files into hdfs from a network location available to all nodes. Basically, I'm going to run ls in the mapper and send the output to a small set of reducers.
We are going to be running fits leveraging several different parameter ranges against several models. The model names do not change and will go to the reducer as keys while the list of tests to run is generated in the mapper.
According to the docs this is not possible. The following are required parameters for execution:
input directoryname or filename
output directoryname
mapper executable or JavaClassName
reducer executable or JavaClassName
It looks like providing a dummy input file is the way to go currently.

how output files(part-m-0001/part-r-0001) are created in map reduce

I understand that the map reduce output are stored in files named like part-r-* for reducer and part-m-* for mapper.
When I run a mapreduce job sometimes a get the whole output in a single file(size around 150MB), and sometimes for almost same data size I get two output files(one 100mb and other 50mb). This seems very random to me. I cant find out any reason for this.
I want to know how its decided to put that data in a single or multiple output files. and if any way we can control it.
Thanks
Unlike specified in the answer by Jijo here - the number of the files depends on on the number of Reducers/Mappers.
It has nothing to do with the number of physical nodes in the cluster.
The rule is: one part-r-* file for one Reducer. The number of Reducers is set by job.setNumReduceTasks();
If there are no Reducers in your job - then one part-m-* file for one Mapper. There is one Mapper for one InputSplit (usually - unless you use custom InputFormat implementation, there is one InputSplit for one HDFS block of your input data).
The number of output files part-m-* and part-r-* is set according to the number of map tasks and the number of reduce tasks respectively.

How to go through the OutputFormat.RecordWriter write(key,value) twice in Hadoop

I have a situation where I need to go through the key/value pairs of my OutputFormat twice. In essence:
OutputFormat.getRecordWriter() // returns RecordWriteType1
... and when all those are complete across all machines
OutputFormat.getRecordWriter() // return RecordWriterType2
The typing of both RecordWriterType1/2 are the same. Is there a way to do this?
Thank you,
Marko.
Unfortunately you cannot simply run over the reducer data twice.
You do have some options to possibly work around:
Use an identity reducer to output the sorted data to HDFS, then run two jobs over the data with identity mappers - wasteful but simple if you don't have that much data
As above, but you could use map only jobs and the key comparator to emulate the reducer function as you know the input is already sorted (you'll need to make sure the split size is set sufficiently large to ensure all data from the first reducer output file is processed in a single mapper and not split over 2+ mapper instances
You could write the reducer key/values to local disk in your reducer, and then in the clean up method of the reducer, opening the local file up and process as detailed in the second option (using the group comparator to detemine key boundary).
If you dig through the source for ReduceTask, you may even be able to 'abuse' the merged sorted segments on local disk and run over the data again, but this option is pure unadulterated hackery...

hadoop streaming ensuring one key per reducer

I have a mapper that, while processing data, classifies output into 3 different types (type is the output key). My goal is to create 3 different csv files via the reducers, each with all of the data for one key with a header row.
The key values can change and are text strings.
Now, ideally, i would like to have 3 different reducers and each reducer would get only one key with it's entire list of values.
Except, this doesn't seem to work because the keys don't get mapped to specific reducers.
The answer to this in other places has been to write a custom partitioner class that would map each desired key value to a specific reducer. This would be great except that I need to use streaming with python and i am not able to include a custom streaming jar in my job so that seems not an option.
I see in the hadoop docs that there is an alternate partitioner class available that can enable secondary sorts, but it isn't immediately obvious to me that it is possible, using either the default or key field based partitioner, to ensure that each key ends up on it's own reducer without writing a java class and using a custom streaming jar.
Any suggestions would be much appreciated.
Examples:
mapper output:
csv2\tfieldA,fieldB,fieldC
csv1\tfield1,field2,field3,field4
csv3\tfieldRed,fieldGreen
...
the problem is that if i have 3 reducers i end up with key distribution like this:
reducer1 reducer2 recuder3
csv1 csv2
csv3
one reducer gets two different key types and one reducer gets no data sent to it at all. this is because the hash(key csv1) mod 3 and hash(key csv2) mod 3 result in the same value.
I'm pretty sure MultipleOutputFormat [1] can be used under streaming. That'll solve most of your problems.
http://hadoop.apache.org/common/docs/r0.20.1/api/org/apache/hadoop/mapred/lib/MultipleOutputFormat.html
If you are stuck with streaming, and can't include any external jars for a custom partitioner, then this is probably not going to work the way you want it to without some hacks.
If these are absolute requirements, you can get around this, but it's messy.
Here's what you can do:
Hadoop, by default, uses a hashing partitioner, like this:
key.hashCode() % numReducers
So you can pick keys such that they hash to 1, 2, and 3 (or three numbers such that x % 3 = 1, 2, 3). This is a nasty hack, and I wouldn't suggest it unless you have no other options.
If you want custom output to different csv files you can direct write (with API) to hdfs. As you know hadoop passes with key and associated value list to single reduce task. In reduce code, check , while key is same write to same file. If another key comes, create new file manually and write into it. It does not matter how many reducers you have

Resources