Accessing MR job output from java main class - hadoop

My MapReduce program has three chained MR jobs. I want to access MR1 ouptut from the main class. Is it possible in hadoop environment?
If not, then please suggest if there is any other way to do similar thing.

One way would be to feed the output of job 1 to input of job2 and output of job2 to input of job3.
Here is an example how: http://developer.yahoo.com/hadoop/tutorial/module4.html#chaining
This blog talks about some more:
http://blogs.msdn.com/b/avkashchauhan/archive/2012/03/29/how-to-chain-multiple-mapreduce-jobs-in-hadoop.aspx

Related

how to check that mapreduce is running parallelly?

I submitted a mapreduce job and checked the log.
In the log l see that there are many mappers, each mapper processes one split, and the processing details of each mapper is logged in the log file sequentially in time.
However, I would like to check if my job is running parallelly and I want to see how many mappers are running concurrently.
I don't know where to find these informations.
Please help me, thx!
Use following Jobtracker Web UI and drill down to executing MapReduce job
http://<Jobtracker-HostName>:50030/

How to schedule post processing task after a mapreduce job

I'm looking for a simple method to chain post processing code after a map reduce job
specifically, in involves renaming\moving the out files create by org.apache.hadoop.mapred.lib.MultipleOutputs (the class has limitations on the output file names, so I ca't produce the files directly in the mapreduce job)
The options I know (or think of) are:
add it in the job creation code - this is what I do now, but I prefer the task will be scheduled by the jobtracker (to reduce the chances of the process being aborted)
using a workflow engine (luigi, oozie) - but this seems like an overkill for this issue
using job chaining - this allows chaining mapreduce jobs - it it possible to chain a "simple" task?
Your "simple" task should be a Mapper-only job. Your Map() receives as key the file name and renames the file. For this you have to write your own InputFormat and RecordReader, like in the links, but your RecordReader should not actually read the file, just return the file name in getCurrentKey():
https://code.google.com/p/hadoop-course/source/browse/HadoopSamples/src/main/java/mr/wholeFile/WholeFileInputFormat.java?r=3
https://code.google.com/p/hadoop-course/source/browse/HadoopSamples/src/main/java/mr/wholeFile/WholeFileRecordReader.java?r=3

Is there a way to kill reducer task in Hadoop?

Running a few map reduce jobs and one job takes over the all the reducer capacity. Is there a way to kill one or two reducer tasks to free up the cluster?
I can go directly to the one of the task tracker server and kill the java process manually. But I am wondering if there is a more decent way to do this?
You can kill the task-attempt by :
hadoop job -kill-task [task_attempt_id]
To get the task-attempt-id, you need to go one level deeper into the task(by clicking on task hyperlink on job tracker).
First find the job ID:
hadoop job -list
Now, kill the job:
hadoop job -kill <job_ID_goes_here>
hadoop job -kill-task [attempt-id] wherein the attempt-id can be obtained from the UI.

A tool showing a breakdown of completion times and source machine names for each and every mapper and reducer?

I know job tasks page (in the JobTracker UI) is already showing start time and end time of every tasks in mapper and reducer but I would like to see something more like source machine names, number of spills and so on. I guess I can try to write such a tool using JobTracker class? But before embarking on that, I would like to see if there is such a tool already.
Does the hadoop job -history all output-dir command give you enough information to parse / process?
http://hadoop.apache.org/common/docs/r1.0.3/cluster_setup.html - Search for the above command

Map Reduce: ChainMapper and ChainReducer

I need to split my Map Reduce jar file in two jobs in order to get two different output file, one from each reducers of the two jobs.
I mean that the first job has to produce an output file that will be the input for the second job in chain.
I read something about ChainMapper and ChainReducer in hadoop version 0.20 (currently I am using 0.18): those could be good for my needs?
Can anybody suggest me some links where to find some examples in order to use those methods? Or maybe there are another way to achieve my issue?
Thank you,
Luca
There are many ways you can do it.
Cascading jobs
Create the JobConf object "job1" for the first job and set all the parameters with "input" as inputdirectory and "temp" as output directory. Execute this job: JobClient.run(job1).
Immediately below it, create the JobConf object "job2" for the second job and set all the parameters with "temp" as inputdirectory and "output" as output directory. Execute this job: JobClient.run(job2).
Two JobConf objects
Create two JobConf objects and set all the parameters in them just like (1) except that you don't use JobClient.run.
Then create two Job objects with jobconfs as parameters:
Job job1=new Job(jobconf1); Job job2=new Job(jobconf2);
Using the jobControl object, you specify the job dependencies and then run the jobs:
JobControl jbcntrl=new JobControl("jbcntrl");
jbcntrl.addJob(job1);
jbcntrl.addJob(job2);
job2.addDependingJob(job1);
jbcntrl.run();
ChainMapper and ChainReducer
If you need a structure somewhat like Map+ | Reduce | Map*, you can use the ChainMapper and ChainReducer classes that come with Hadoop version 0.19 and onwards. Note that in this case, you can use only one reducer but any number of mappers before or after it.
I think the above solution involves disk I/O operation, thus will slow down with large datasets.Alternative is to use Oozie or Cascading.

Resources