Map Reduce: ChainMapper and ChainReducer - hadoop

I need to split my Map Reduce jar file in two jobs in order to get two different output file, one from each reducers of the two jobs.
I mean that the first job has to produce an output file that will be the input for the second job in chain.
I read something about ChainMapper and ChainReducer in hadoop version 0.20 (currently I am using 0.18): those could be good for my needs?
Can anybody suggest me some links where to find some examples in order to use those methods? Or maybe there are another way to achieve my issue?
Thank you,
Luca

There are many ways you can do it.
Cascading jobs
Create the JobConf object "job1" for the first job and set all the parameters with "input" as inputdirectory and "temp" as output directory. Execute this job: JobClient.run(job1).
Immediately below it, create the JobConf object "job2" for the second job and set all the parameters with "temp" as inputdirectory and "output" as output directory. Execute this job: JobClient.run(job2).
Two JobConf objects
Create two JobConf objects and set all the parameters in them just like (1) except that you don't use JobClient.run.
Then create two Job objects with jobconfs as parameters:
Job job1=new Job(jobconf1); Job job2=new Job(jobconf2);
Using the jobControl object, you specify the job dependencies and then run the jobs:
JobControl jbcntrl=new JobControl("jbcntrl");
jbcntrl.addJob(job1);
jbcntrl.addJob(job2);
job2.addDependingJob(job1);
jbcntrl.run();
ChainMapper and ChainReducer
If you need a structure somewhat like Map+ | Reduce | Map*, you can use the ChainMapper and ChainReducer classes that come with Hadoop version 0.19 and onwards. Note that in this case, you can use only one reducer but any number of mappers before or after it.

I think the above solution involves disk I/O operation, thus will slow down with large datasets.Alternative is to use Oozie or Cascading.

Related

Job with just the reducer phase?

In Hadoop MapReduce the intermediate output (map output) is saved in the local disk. I would like to know if it is possible to start a job just with the reduce phase, that reads the mapoutput from the local disk, partition the data and execute the reduce tasks?
There is a basic implementation of Mapper called IdentityMapper , which essentially passes all the key-value pairs to a Reducer.
Reducer reads the outputs generated by the different mappers as pairs and emits key value pairs.
The Reducer’s job is to process the data that comes from the mapper.
If MapReduce programmer do not set the Mapper Class using JobConf.setMapperClass then IdentityMapper.class is used as a default value.
You can't run just reducers without any mappers..
Map reduce works on data which is in HDFS. So I dont think you can write reducer only map reduce to read from local disk
If you use Hadoop Streaming, you can just add:
-mapper "/bin/sh -c \"cat\""

Chaining Map Reduce Program

I have a situation, during one POC I want to create a nested MapReduce within one Job. Like a Map M1 O/P to Reducer R1 O/P then that R1 output goes to M2 and final output will come with either M2 or we can run R2 with M2 O/P.
Single Job ID - M1->R1->M2->R2...Final output will be in a single O/P file.
Can we do it without Oozie?
You can chain multiple jobs in your Driver class. First, create a job for first MapReduce, by defining all the required configuration. Then start the job as usual by calling:
job1.waitForCompletion(true);
This is wait until the job is finished. Now check the final status of the first job, whether failed or succeeded for appropriate next action.
If the first job is completed successfully, then launch the next MapReduce in the same way. First define the required parameters and launch the job with:
job2.waitForCompletion(true);
The important thing will be output path of the first will be input for the second job. This is serial (sequential) job chaining, because both the jobs will be running one after another.
You can also make use of job control where in you can execute a number of map reduce jobs in a sequence. In your case there are two mappers and two or one reducers. You can have two map reduce jobs and for the second job you can use set the number of reducers to zero if you don't require reducers.

How to schedule post processing task after a mapreduce job

I'm looking for a simple method to chain post processing code after a map reduce job
specifically, in involves renaming\moving the out files create by org.apache.hadoop.mapred.lib.MultipleOutputs (the class has limitations on the output file names, so I ca't produce the files directly in the mapreduce job)
The options I know (or think of) are:
add it in the job creation code - this is what I do now, but I prefer the task will be scheduled by the jobtracker (to reduce the chances of the process being aborted)
using a workflow engine (luigi, oozie) - but this seems like an overkill for this issue
using job chaining - this allows chaining mapreduce jobs - it it possible to chain a "simple" task?
Your "simple" task should be a Mapper-only job. Your Map() receives as key the file name and renames the file. For this you have to write your own InputFormat and RecordReader, like in the links, but your RecordReader should not actually read the file, just return the file name in getCurrentKey():
https://code.google.com/p/hadoop-course/source/browse/HadoopSamples/src/main/java/mr/wholeFile/WholeFileInputFormat.java?r=3
https://code.google.com/p/hadoop-course/source/browse/HadoopSamples/src/main/java/mr/wholeFile/WholeFileRecordReader.java?r=3

How to configure oozie workflow for multi-input path with multiple mappers

Can any one help me in configuring a work flow with Map-Reduce action, that takes multiple input paths each input path is associated to one Mapper as like MultipleInputs.addInputPath api takes input path and a mapper. The out put of these mappers will be given to reducer.
I tried this with java action, but it will execute only one map task. But here input path contains huge data, so java action will not us in this case.
Is there any way in handling this case?
Regards,
Krish.
In the workflow you can give a comma separated list of input directories in mapred.input.dir . This will make the files in those directories to run on different mappers.

Accessing MR job output from java main class

My MapReduce program has three chained MR jobs. I want to access MR1 ouptut from the main class. Is it possible in hadoop environment?
If not, then please suggest if there is any other way to do similar thing.
One way would be to feed the output of job 1 to input of job2 and output of job2 to input of job3.
Here is an example how: http://developer.yahoo.com/hadoop/tutorial/module4.html#chaining
This blog talks about some more:
http://blogs.msdn.com/b/avkashchauhan/archive/2012/03/29/how-to-chain-multiple-mapreduce-jobs-in-hadoop.aspx

Resources