Mapper with multipleInput on Hadoop cluster - hadoop

I have to implement two mapReduce jobs where a Mapper in phase II (Mapper_2) needs to have an output of the Reducer in phase I (reducer_1).
Mapper_2 also needs another input that is a big text file (2TB).
I have written as follows but my question is: text input will be split amongst nodes in the cluster, but what about output of reducer _1 as I want each mapper in phase II to have the whole of Reducer_1's output.
MultipleInputs.addInputPath(Job, TextInputPath, SomeInputFormat.class, Mapper_2.class);
MultipleInputs.addInputPath(Job, Ruducer_1OutputPath, SomeInputFormat.class, Mapper_2.class);

Your use of multiple inputs seems fine. I would look at using the distributed cache for the output of reducer_1 to be shared with mapper_2.
JobConf job = new JobConf();
DistributedCache.addCacheFile(new URI("/path/to/reducer_1/ouput"),
job);
Also, when using Distributed Cache, remember to read the cache file in the setup method of mapper_2.
setup() runs once for each mapper before map() gets called and cleanup() runs once for each mapper after the last call to map()

Related

Hadoop passing variables from reducer to main

I am working on a map reduce program. I'm trying to pass parameters to the context configuration in the reduce method using the setLong method and then after completion read them in the main
in reducer:
context.getConfiguration().setLong(key, someLong);
In the Main after the job completion i try to read using :
long val = job.getConfiguration().getLong(key, -1);
but i always get -1.
when i try reading inside the reducer i see that the value is set and i get the correct answer.
am i missing something?
Thank you
You can use counters: set&update their value in reducers and then you can access them in your client application (Main).
You can translate configuration from main to map task or reduce task, but you cannot translate it back. The procedure of configuration translation is:
A configuration file is generated on the MapReduce client based on the configuration you set on main, and it will be pushed to a HDFS path only shared by the job. The file will be readonly
When launching a map or reduce task, the configuration file is pulled from the HDFS path, and task init the configuration based by the file.
If you want to translate configuration back, you may use another HDFS file: update the file on Reducer, and read it after job completes

Where is script like wordcount executed in MapReduce?

I am asking for knowledge? When running a MapReduce with a wordcount jar when does the code execute? Is it during the mapper task or the driver method?
When running a MapReduce with a wordcount jar when does the code execute?
It executes with main i.e. Driver code and then map code followed by reducer code (if any)
Is it during the mapper task or the driver method?
Yes its both.
Driver - would drive the map reduce, where you define which class i should use for mapper, reducer, paritioner, combiner.
Mapper - mapper is where your word count would create a map of key value pair with key as word and value as number of times it occured.
Reducer - Reducer would take the value from each mapper and would sum all the value against same key accross mapper and would give you the final result.
as correctly answered by SMA , code execution starts with main method of driver class , which passes control to mapper and reducer class using methods setMapperClass , setReducerClass of job object.
Have a look at official map reduce tutorial for better understanding. I am using key points to explain the example.
Let's have a look at java word count example.
Assume that you have created wc.jar as follows.
$ jar cf wc.jar WordCount*.class
Now run the WordCount example
$ bin/hadoop jar wc.jar WordCount /user/joe/wordcount/input /user/joe/wordcount/output
Now you have passed input directory & output directory for reading and writing of data respectively.
Assuming that:
/user/joe/wordcount/input - input directory in HDFS
/user/joe/wordcount/output - output directory in HDFS
input directory in HDFS is used by Mapper to read the data & output directory in HDFS is used by reducer to store data.
Applications typically implement the Mapper and Reducer interfaces to provide the map and reduce methods. These form the core of the job.
If you see the main method of WordCount class (you can call it as Driver program),
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = Job.getInstance(conf, "word count");
job.setJarByClass(WordCount.class);
job.setMapperClass(TokenizerMapper.class);
job.setCombinerClass(IntSumReducer.class);
job.setReducerClass(IntSumReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
We are setting Mapper class, Reducer class, Combiner class, Output Key & Value classes and Input/Output file locations.
Some jobs may have only Mapper. Some jobs will have Mapper and Reducer. Some jobs will have Partitioner & Combiner in addition to Mapper & Reducer classes. So basically the flow is decided by you and Hadoop framework will form a workflow depending on your inputs.
For above example :
Mapper will read input data from HDFS file location from FileInputFormat.addInputPath API. Mapper is set by below line
job.setMapperClass(TokenizerMapper.class);
Combiner, which is a mini reducer will run on Mapper output. It will reduce network IO by combining output of Mappers.
job.setCombinerClass(IntSumReducer.class);
Reducer is set by below API.
job.setReducerClass(IntSumReducer.class);
Mapper
Mapper maps input key/value pairs to a set of intermediate key/value pairs. The Hadoop MapReduce framework spawns one map task for each InputSplit generated by the InputFormat for the job.
Reducer
Reducer reduces a set of intermediate values which share a key to a smaller set of values.
Combiner:
Users can optionally specify a combiner, via Job.setCombinerClass(Class), to perform local aggregation of the intermediate outputs, which helps to cut down the amount of data transferred from the Mapper to the Reducer.

How to tune Spark application with hadoop custom input format

My spark application process the files (average size is 20 MB) with custom hadoop input format and stores the result in HDFS.
Following is the code snippet.
Configuration conf = new Configuration();
JavaPairRDD<Text, Text> baseRDD = ctx
.newAPIHadoopFile(input, CustomInputFormat.class,Text.class, Text.class, conf);
JavaRDD<myClass> mapPartitionsRDD = baseRDD
.mapPartitions(new FlatMapFunction<Iterator<Tuple2<Text, Text>>, myClass>() {
//my logic goes here
}
//few more translformations
result.saveAsTextFile(path);
This application creates 1 task/ partition per file and processes and stores the corresponding part file in HDFS.
i.e, For 10,000 input files 10,000 tasks are created and 10,000 part files are stored in HDFS.
Both mapPartitions and map operations on baseRDD are creating 1 task per file.
SO question
How to set the number of partitions for newAPIHadoopFile?
suggests to set
conf.setInt("mapred.max.split.size", 4); for configuring no of partitions.
But when this parameter is set CPU is utilized at maximum and none of the stage is not started even after long time.
If I don't set this parameter then application will be completed successfully as mentioned above.
How to set number of partitions with newAPIHadoopFile and increase the efficiency?
What happens with mapred.max.split.size option?
============
update:
What happens with mapred.max.split.size option?
In my use case file size is small and changing the split size options are irrelevant here.
more info on this SO: Behavior of the parameter "mapred.min.split.size" in HDFS
Just use baseRDD.repartition(<a sane amount>).mapPartitions(...). That will move the resulting operation to fewer partitions, especially if your files are small.

MRJob and mapreduce task partitioning over Hadoop

I am trying to perform a mapreduce job using the Python MRJob lib and am having some issues getting it to properly distribute across my Hadoop cluster. I believe I am simply missing a basic principle of mapreduce. My cluster is a small, one master one slave test cluster. The basic idea is that I'm just requesting a series of web pages with parameters, doing some analysis on them and returning back some properties on the web page.
The input to my map function is simply a list of URLs with parameters such as the following:
http://guelph.backpage.com/automotive/?layout=bla&keyword=towing
http://guelph.backpage.com/whatever/?p=blah
http://semanticreference.com/search.html?go=Search&q=red
http://copiahcounty.wlbt.com/h/events?ename=drupaleventsxmlapi&s=rrr
http://sweetrococo.livejournal.com/34076.html?mode=ffff
Such that the key-value pairs for the initial input are just key:None, val:URL.
The following is my map function:
def mapper(self, key, url):
'''Yield domain as the key, and (url, query parameter) tuple as the value'''
parsed_url = urlparse(url)
domain = parsed_url.scheme + "://" + parsed_url.netloc + "/"
if self.myclass.check_if_param(parsed_url):
parsed_url_query = parsed_url.query
url_q_dic = parse_qs(parsed_url_query)
for query_param, query_val in url_q_dic.iteritems():
#yielding a tuple in mrjob will yield a list
yield domain, (url, query_param)
Pretty simple, I'm just checking to make sure the URL has a parameter and yielding the URL's domain as key and a tuple giving me the URL and the query parameter as value which MRJob kindly transforms into a list to pass to the reducer, which is the following:
def reducer(self, domain, url_query_params):
final_list = []
for url_query_param in url_query_params:
url_to_list_props = url_query_param[0]
param_to_list_props = url_query_param[1]
#set our target that we will request and do some analysis on
self.myclass.set_target(url_to_list_props, param_to_list_props)
#perform a bunch of requests and do analysis on the URL requested
props_list = self.myclass.get_props()
for prop in props_list:
final_list.append(prop)
#index this stuff to a central db
MapReduceIndexer(domain, final_list).add_prop_info()
yield domain, final_list
My problem is that only one reducer task is run. I would expect the number of reducer tasks to be equal to the number of unique keys emitted by the mapper. The end result with the above code is that I have one reducer which runs on the master, but the slave sits idly and does nothing, which is obviously not ideal. I notice that in my output a few mapper tasks are started, but always only 1 reducer task. Other than that, the task runs smoothly and all works as expected.
My question is... what the heck am I doing wrong? Am I misunderstanding the reduce step or screwing up my key-value pairs somewhere? Why are there not multiple reducers running on this job?
Update: OK so from the answer given I increased mapred.reduce.tasks to higher (it was the default which I now realize is 1). This was indeed why I was getting 1 reducer. I now see 3 reduce tasks being performed simultaneously. I now have an import error on my slave that needs to be resolved but at least I am getting somewhere...
The number of reducers is totally unrelated to the form of your input data. For MRJob it looks like you need bootstrap options

Chaining multiple MapReduce jobs in Hadoop

In many real-life situations where you apply MapReduce, the final algorithms end up being several MapReduce steps.
i.e. Map1 , Reduce1 , Map2 , Reduce2 , and so on.
So you have the output from the last reduce that is needed as the input for the next map.
The intermediate data is something you (in general) do not want to keep once the pipeline has been successfully completed. Also because this intermediate data is in general some data structure (like a 'map' or a 'set') you don't want to put too much effort in writing and reading these key-value pairs.
What is the recommended way of doing that in Hadoop?
Is there a (simple) example that shows how to handle this intermediate data in the correct way, including the cleanup afterward?
I think this tutorial on Yahoo's developer network will help you with this: Chaining Jobs
You use the JobClient.runJob(). The output path of the data from the first job becomes the input path to your second job. These need to be passed in as arguments to your jobs with appropriate code to parse them and set up the parameters for the job.
I think that the above method might however be the way the now older mapred API did it, but it should still work. There will be a similar method in the new mapreduce API but i'm not sure what it is.
As far as removing intermediate data after a job has finished you can do this in your code. The way i've done it before is using something like:
FileSystem.delete(Path f, boolean recursive);
Where the path is the location on HDFS of the data. You need to make sure that you only delete this data once no other job requires it.
There are many ways you can do it.
(1) Cascading jobs
Create the JobConf object "job1" for the first job and set all the parameters with "input" as inputdirectory and "temp" as output directory. Execute this job:
JobClient.run(job1).
Immediately below it, create the JobConf object "job2" for the second job and set all the parameters with "temp" as inputdirectory and "output" as output directory. Execute this job:
JobClient.run(job2).
(2) Create two JobConf objects and set all the parameters in them just like (1) except that you don't use JobClient.run.
Then create two Job objects with jobconfs as parameters:
Job job1=new Job(jobconf1);
Job job2=new Job(jobconf2);
Using the jobControl object, you specify the job dependencies and then run the jobs:
JobControl jbcntrl=new JobControl("jbcntrl");
jbcntrl.addJob(job1);
jbcntrl.addJob(job2);
job2.addDependingJob(job1);
jbcntrl.run();
(3) If you need a structure somewhat like Map+ | Reduce | Map*, you can use the ChainMapper and ChainReducer classes that come with Hadoop version 0.19 and onwards.
There are actually a number of ways to do this. I'll focus on two.
One is via Riffle ( http://github.com/cwensel/riffle ) an annotation library for identifying dependent things and 'executing' them in dependency (topological) order.
Or you can use a Cascade (and MapReduceFlow) in Cascading ( http://www.cascading.org/ ). A future version will support Riffle annotations, but it works great now with raw MR JobConf jobs.
A variant on this is to not manage MR jobs by hand at all, but develop your application using the Cascading API. Then the JobConf and job chaining is handled internally via the Cascading planner and Flow classes.
This way you spend your time focusing on your problem, not on the mechanics of managing Hadoop jobs etc. You can even layer different languages on top (like clojure or jruby) to even further simplify your development and applications. http://www.cascading.org/modules.html
I have done job chaining using with JobConf objects one after the other. I took WordCount example for chaining the jobs. One job figures out how many times a word a repeated in the given output. Second job takes first job output as input and figures out total words in the given input. Below is the code that need to be placed in Driver class.
//First Job - Counts, how many times a word encountered in a given file
JobConf job1 = new JobConf(WordCount.class);
job1.setJobName("WordCount");
job1.setOutputKeyClass(Text.class);
job1.setOutputValueClass(IntWritable.class);
job1.setMapperClass(WordCountMapper.class);
job1.setCombinerClass(WordCountReducer.class);
job1.setReducerClass(WordCountReducer.class);
job1.setInputFormat(TextInputFormat.class);
job1.setOutputFormat(TextOutputFormat.class);
//Ensure that a folder with the "input_data" exists on HDFS and contains the input files
FileInputFormat.setInputPaths(job1, new Path("input_data"));
//"first_job_output" contains data that how many times a word occurred in the given file
//This will be the input to the second job. For second job, input data name should be
//"first_job_output".
FileOutputFormat.setOutputPath(job1, new Path("first_job_output"));
JobClient.runJob(job1);
//Second Job - Counts total number of words in a given file
JobConf job2 = new JobConf(TotalWords.class);
job2.setJobName("TotalWords");
job2.setOutputKeyClass(Text.class);
job2.setOutputValueClass(IntWritable.class);
job2.setMapperClass(TotalWordsMapper.class);
job2.setCombinerClass(TotalWordsReducer.class);
job2.setReducerClass(TotalWordsReducer.class);
job2.setInputFormat(TextInputFormat.class);
job2.setOutputFormat(TextOutputFormat.class);
//Path name for this job should match first job's output path name
FileInputFormat.setInputPaths(job2, new Path("first_job_output"));
//This will contain the final output. If you want to send this jobs output
//as input to third job, then third jobs input path name should be "second_job_output"
//In this way, jobs can be chained, sending output one to other as input and get the
//final output
FileOutputFormat.setOutputPath(job2, new Path("second_job_output"));
JobClient.runJob(job2);
Command to run these jobs is:
bin/hadoop jar TotalWords.
We need to give final jobs name for the command. In the above case, it is TotalWords.
You may run MR chain in the manner as given in the code.
PLEASE NOTE: Only the driver code has been provided
public class WordCountSorting {
// here the word keys shall be sorted
//let us write the wordcount logic first
public static void main(String[] args)throws IOException,InterruptedException,ClassNotFoundException {
//THE DRIVER CODE FOR MR CHAIN
Configuration conf1=new Configuration();
Job j1=Job.getInstance(conf1);
j1.setJarByClass(WordCountSorting.class);
j1.setMapperClass(MyMapper.class);
j1.setReducerClass(MyReducer.class);
j1.setMapOutputKeyClass(Text.class);
j1.setMapOutputValueClass(IntWritable.class);
j1.setOutputKeyClass(LongWritable.class);
j1.setOutputValueClass(Text.class);
Path outputPath=new Path("FirstMapper");
FileInputFormat.addInputPath(j1,new Path(args[0]));
FileOutputFormat.setOutputPath(j1,outputPath);
outputPath.getFileSystem(conf1).delete(outputPath);
j1.waitForCompletion(true);
Configuration conf2=new Configuration();
Job j2=Job.getInstance(conf2);
j2.setJarByClass(WordCountSorting.class);
j2.setMapperClass(MyMapper2.class);
j2.setNumReduceTasks(0);
j2.setOutputKeyClass(Text.class);
j2.setOutputValueClass(IntWritable.class);
Path outputPath1=new Path(args[1]);
FileInputFormat.addInputPath(j2, outputPath);
FileOutputFormat.setOutputPath(j2, outputPath1);
outputPath1.getFileSystem(conf2).delete(outputPath1, true);
System.exit(j2.waitForCompletion(true)?0:1);
}
}
THE SEQUENCE IS
(JOB1)MAP->REDUCE-> (JOB2)MAP
This was done to get the keys sorted yet there are more ways such as using a treemap
Yet I want to focus your attention onto the way the Jobs have been chained!!
Thank you
You can use oozie for barch processing your MapReduce jobs. http://issues.apache.org/jira/browse/HADOOP-5303
There are examples in Apache Mahout project that chains together multiple MapReduce jobs. One of the examples can be found at:
RecommenderJob.java
http://search-lucene.com/c/Mahout:/core/src/main/java/org/apache/mahout/cf/taste/hadoop/item/RecommenderJob.java%7C%7CRecommenderJob
We can make use of waitForCompletion(true) method of the Job to define the dependency among the job.
In my scenario I had 3 jobs which were dependent on each other. In the driver class I used the below code and it works as expected.
public static void main(String[] args) throws Exception {
// TODO Auto-generated method stub
CCJobExecution ccJobExecution = new CCJobExecution();
Job distanceTimeFraudJob = ccJobExecution.configureDistanceTimeFraud(new Configuration(),args[0], args[1]);
Job spendingFraudJob = ccJobExecution.configureSpendingFraud(new Configuration(),args[0], args[1]);
Job locationFraudJob = ccJobExecution.configureLocationFraud(new Configuration(),args[0], args[1]);
System.out.println("****************Started Executing distanceTimeFraudJob ================");
distanceTimeFraudJob.submit();
if(distanceTimeFraudJob.waitForCompletion(true))
{
System.out.println("=================Completed DistanceTimeFraudJob================= ");
System.out.println("=================Started Executing spendingFraudJob ================");
spendingFraudJob.submit();
if(spendingFraudJob.waitForCompletion(true))
{
System.out.println("=================Completed spendingFraudJob================= ");
System.out.println("=================Started locationFraudJob================= ");
locationFraudJob.submit();
if(locationFraudJob.waitForCompletion(true))
{
System.out.println("=================Completed locationFraudJob=================");
}
}
}
}
The new Class org.apache.hadoop.mapreduce.lib.chain.ChainMapper help this scenario
Although there are complex server based Hadoop workflow engines e.g., oozie, I have a simple java library that enables execution of multiple Hadoop jobs as a workflow. The job configuration and workflow defining inter job dependency is configured in a JSON file. Everything is externally configurable and does not require any change in existing map reduce implementation to be part of a workflow.
Details can be found here. Source code and jar is available in github.
http://pkghosh.wordpress.com/2011/05/22/hadoop-orchestration/
Pranab
I think oozie helps the consequent jobs to receive the inputs directly from the previous job. This avoids the I/o operation performed with jobcontrol.
If you want to programmatically chain your jobs, you will wnat to use JobControl. The usage is quite simple:
JobControl jobControl = new JobControl(name);
After that you add ControlledJob instances. ControlledJob defines a job with it's dependencies, thus automatically pluging inputs and outputs to fit a "chain" of jobs.
jobControl.add(new ControlledJob(job, Arrays.asList(controlledjob1, controlledjob2));
jobControl.run();
starts the chain. You will want to put that in a speerate thread. This allows to check the status of your chain whil it runs:
while (!jobControl.allFinished()) {
System.out.println("Jobs in waiting state: " + jobControl.getWaitingJobList().size());
System.out.println("Jobs in ready state: " + jobControl.getReadyJobsList().size());
System.out.println("Jobs in running state: " + jobControl.getRunningJobList().size());
List<ControlledJob> successfulJobList = jobControl.getSuccessfulJobList();
System.out.println("Jobs in success state: " + successfulJobList.size());
List<ControlledJob> failedJobList = jobControl.getFailedJobList();
System.out.println("Jobs in failed state: " + failedJobList.size());
}
As you have mentioned in your requirement that you want o/p of MRJob1 to be the i/p of MRJob2 and so on, you can consider using oozie workflow for this usecase. Also you might consider writing your intermediate data to HDFS since it will used by the next MRJob. And after the job completes you can clean-up your intermediate data.
<start to="mr-action1"/>
<action name="mr-action1">
<!-- action for MRJob1-->
<!-- set output path = /tmp/intermediate/mr1-->
<ok to="end"/>
<error to="end"/>
</action>
<action name="mr-action2">
<!-- action for MRJob2-->
<!-- set input path = /tmp/intermediate/mr1-->
<ok to="end"/>
<error to="end"/>
</action>
<action name="success">
<!-- action for success-->
<ok to="end"/>
<error to="end"/>
</action>
<action name="fail">
<!-- action for fail-->
<ok to="end"/>
<error to="end"/>
</action>
<end name="end"/>
New answer since the confirmed answer with the JobClient.run() is not working in the new API:
If you have two jobs like this:
Configuration conf1 = new Configuration();
Job job1 = Job.getInstance(conf1, "a");
Configuration conf2 = new Configuration();
Job job2 = Job.getInstance(conf2, "b");
Then the only thing you should do is adding the following line before creating 'job2':
job1.waitForCompletion(true);

Resources