I would like to know if new Distributed cache API is back-compatible with Hadoop 1?
If I change my code adhering the new API (since the old one is deprecated) will it work on Hadoop 1 cluster?
By new I mean:
Configuration conf = getConf();
...
Job job = Job.getInstance(conf);
...
job.addCacheFile(new URI(filename));
Need help with Tuning apache storm. I have run a command on the nimbus server to increase the spout executors & and a for a bolt.
My question is simple. Does the command:
storm rebalance TopologyName -e spout/or/bolt=
Does this override number of parallel hints in the Java code ?
I ran this and did not see a change in the web GUI interface.
Also is there a way to override the parameter in the storm.yaml file ?
topology.max.spout.pending: 1000
Thanks for any help on this. I do have an excellent book on Storm but I cannot find out why my changes are not being reflected after rebalance...
Did you set the number of tasks high enough? See here for further details:
Rebalancing executors in Apache Storm
https://storm.apache.org/documentation/Understanding-the-parallelism-of-a-Storm-topology.html
So yes, it does override the parallel hint, but only if applicable.
And yes, you can use storm.yaml to set the default "max pending" parameter. This value can be changed for each topology individually by overwriting the default value in the configuration you provide for a topology when submitting it:
Config conf = new Config();
conf.setMaxSpoutPending( /* put your value here */ );
StormSubmitter.submitTopology("topologyName", conf, builder.createTopology());
I'm very new to hadoop and have question.
I'm submitting (or creating) mapreduce jobs using Hadoop Job API v2 (i.e. namespace mapreduce than old one mapred)
I submit MR Jobs based on our own jobs. We maintain the Hadoop Job Name in this table.
I want to track the submitted jobs for the progress (and so completion) so that we can update our own jobs as complete.
All the Job Status API requires Job object. Whereas ‘Job Monitoring’ module of ours does not have any job object with it.
Can you please help us with anyway to get Job Status given a Job Name? We make sure job name are unique.
I google quite a bit only to find below. Is this the way to go? There is no other way in the v2 (.mapreduce. and not .mapred.) API to get job's status given the JobId?
Configuration conf = new Configuration();
JobClient jobClient = new JobClient(new JobConf(conf)); // deprecation WARN
JobID jobID = JobID.forName(jobID); // deprecation WARN
RunningJob runningJob = jobClient.getJob(jobID);
Field field = runningJob.getClass().getDeclaredField("status"); // reflection !!!
field.setAccessible(true);
JobStatus jobStatus = (JobStatus) field.get(runningJob);
http://blog.erdemagaoglu.com/post/9407457968/hadoop-mapreduce-job-statistics-a-fraction-of-them
I am adding a file to distributed cache of Hadoop using
Configuration cng=new Configuration();
JobConf conf = new JobConf(cng, Driver.class);
DistributedCache.addCacheFile(new Path("DCache/Orders.txt").toUri(), cng);
where DCache/Orders.txt is the file in HDFS.
When I try to retrieve this file from the cache in configure method of mapper using:
Path[] cacheFiles=DistributedCache.getLocalCacheFiles(conf);
I get null pointer. What can be the error?
Thanks
DistributedCache doesn't work in single node mode, it just returns a null pointer. Or at least that was my experience with the current version.
I think the url is supposed to start with the hdfs identifier.
http://hadoop.apache.org/common/docs/current/mapred_tutorial.html#DistributedCache
In many real-life situations where you apply MapReduce, the final algorithms end up being several MapReduce steps.
i.e. Map1 , Reduce1 , Map2 , Reduce2 , and so on.
So you have the output from the last reduce that is needed as the input for the next map.
The intermediate data is something you (in general) do not want to keep once the pipeline has been successfully completed. Also because this intermediate data is in general some data structure (like a 'map' or a 'set') you don't want to put too much effort in writing and reading these key-value pairs.
What is the recommended way of doing that in Hadoop?
Is there a (simple) example that shows how to handle this intermediate data in the correct way, including the cleanup afterward?
I think this tutorial on Yahoo's developer network will help you with this: Chaining Jobs
You use the JobClient.runJob(). The output path of the data from the first job becomes the input path to your second job. These need to be passed in as arguments to your jobs with appropriate code to parse them and set up the parameters for the job.
I think that the above method might however be the way the now older mapred API did it, but it should still work. There will be a similar method in the new mapreduce API but i'm not sure what it is.
As far as removing intermediate data after a job has finished you can do this in your code. The way i've done it before is using something like:
FileSystem.delete(Path f, boolean recursive);
Where the path is the location on HDFS of the data. You need to make sure that you only delete this data once no other job requires it.
There are many ways you can do it.
(1) Cascading jobs
Create the JobConf object "job1" for the first job and set all the parameters with "input" as inputdirectory and "temp" as output directory. Execute this job:
JobClient.run(job1).
Immediately below it, create the JobConf object "job2" for the second job and set all the parameters with "temp" as inputdirectory and "output" as output directory. Execute this job:
JobClient.run(job2).
(2) Create two JobConf objects and set all the parameters in them just like (1) except that you don't use JobClient.run.
Then create two Job objects with jobconfs as parameters:
Job job1=new Job(jobconf1);
Job job2=new Job(jobconf2);
Using the jobControl object, you specify the job dependencies and then run the jobs:
JobControl jbcntrl=new JobControl("jbcntrl");
jbcntrl.addJob(job1);
jbcntrl.addJob(job2);
job2.addDependingJob(job1);
jbcntrl.run();
(3) If you need a structure somewhat like Map+ | Reduce | Map*, you can use the ChainMapper and ChainReducer classes that come with Hadoop version 0.19 and onwards.
There are actually a number of ways to do this. I'll focus on two.
One is via Riffle ( http://github.com/cwensel/riffle ) an annotation library for identifying dependent things and 'executing' them in dependency (topological) order.
Or you can use a Cascade (and MapReduceFlow) in Cascading ( http://www.cascading.org/ ). A future version will support Riffle annotations, but it works great now with raw MR JobConf jobs.
A variant on this is to not manage MR jobs by hand at all, but develop your application using the Cascading API. Then the JobConf and job chaining is handled internally via the Cascading planner and Flow classes.
This way you spend your time focusing on your problem, not on the mechanics of managing Hadoop jobs etc. You can even layer different languages on top (like clojure or jruby) to even further simplify your development and applications. http://www.cascading.org/modules.html
I have done job chaining using with JobConf objects one after the other. I took WordCount example for chaining the jobs. One job figures out how many times a word a repeated in the given output. Second job takes first job output as input and figures out total words in the given input. Below is the code that need to be placed in Driver class.
//First Job - Counts, how many times a word encountered in a given file
JobConf job1 = new JobConf(WordCount.class);
job1.setJobName("WordCount");
job1.setOutputKeyClass(Text.class);
job1.setOutputValueClass(IntWritable.class);
job1.setMapperClass(WordCountMapper.class);
job1.setCombinerClass(WordCountReducer.class);
job1.setReducerClass(WordCountReducer.class);
job1.setInputFormat(TextInputFormat.class);
job1.setOutputFormat(TextOutputFormat.class);
//Ensure that a folder with the "input_data" exists on HDFS and contains the input files
FileInputFormat.setInputPaths(job1, new Path("input_data"));
//"first_job_output" contains data that how many times a word occurred in the given file
//This will be the input to the second job. For second job, input data name should be
//"first_job_output".
FileOutputFormat.setOutputPath(job1, new Path("first_job_output"));
JobClient.runJob(job1);
//Second Job - Counts total number of words in a given file
JobConf job2 = new JobConf(TotalWords.class);
job2.setJobName("TotalWords");
job2.setOutputKeyClass(Text.class);
job2.setOutputValueClass(IntWritable.class);
job2.setMapperClass(TotalWordsMapper.class);
job2.setCombinerClass(TotalWordsReducer.class);
job2.setReducerClass(TotalWordsReducer.class);
job2.setInputFormat(TextInputFormat.class);
job2.setOutputFormat(TextOutputFormat.class);
//Path name for this job should match first job's output path name
FileInputFormat.setInputPaths(job2, new Path("first_job_output"));
//This will contain the final output. If you want to send this jobs output
//as input to third job, then third jobs input path name should be "second_job_output"
//In this way, jobs can be chained, sending output one to other as input and get the
//final output
FileOutputFormat.setOutputPath(job2, new Path("second_job_output"));
JobClient.runJob(job2);
Command to run these jobs is:
bin/hadoop jar TotalWords.
We need to give final jobs name for the command. In the above case, it is TotalWords.
You may run MR chain in the manner as given in the code.
PLEASE NOTE: Only the driver code has been provided
public class WordCountSorting {
// here the word keys shall be sorted
//let us write the wordcount logic first
public static void main(String[] args)throws IOException,InterruptedException,ClassNotFoundException {
//THE DRIVER CODE FOR MR CHAIN
Configuration conf1=new Configuration();
Job j1=Job.getInstance(conf1);
j1.setJarByClass(WordCountSorting.class);
j1.setMapperClass(MyMapper.class);
j1.setReducerClass(MyReducer.class);
j1.setMapOutputKeyClass(Text.class);
j1.setMapOutputValueClass(IntWritable.class);
j1.setOutputKeyClass(LongWritable.class);
j1.setOutputValueClass(Text.class);
Path outputPath=new Path("FirstMapper");
FileInputFormat.addInputPath(j1,new Path(args[0]));
FileOutputFormat.setOutputPath(j1,outputPath);
outputPath.getFileSystem(conf1).delete(outputPath);
j1.waitForCompletion(true);
Configuration conf2=new Configuration();
Job j2=Job.getInstance(conf2);
j2.setJarByClass(WordCountSorting.class);
j2.setMapperClass(MyMapper2.class);
j2.setNumReduceTasks(0);
j2.setOutputKeyClass(Text.class);
j2.setOutputValueClass(IntWritable.class);
Path outputPath1=new Path(args[1]);
FileInputFormat.addInputPath(j2, outputPath);
FileOutputFormat.setOutputPath(j2, outputPath1);
outputPath1.getFileSystem(conf2).delete(outputPath1, true);
System.exit(j2.waitForCompletion(true)?0:1);
}
}
THE SEQUENCE IS
(JOB1)MAP->REDUCE-> (JOB2)MAP
This was done to get the keys sorted yet there are more ways such as using a treemap
Yet I want to focus your attention onto the way the Jobs have been chained!!
Thank you
You can use oozie for barch processing your MapReduce jobs. http://issues.apache.org/jira/browse/HADOOP-5303
There are examples in Apache Mahout project that chains together multiple MapReduce jobs. One of the examples can be found at:
RecommenderJob.java
http://search-lucene.com/c/Mahout:/core/src/main/java/org/apache/mahout/cf/taste/hadoop/item/RecommenderJob.java%7C%7CRecommenderJob
We can make use of waitForCompletion(true) method of the Job to define the dependency among the job.
In my scenario I had 3 jobs which were dependent on each other. In the driver class I used the below code and it works as expected.
public static void main(String[] args) throws Exception {
// TODO Auto-generated method stub
CCJobExecution ccJobExecution = new CCJobExecution();
Job distanceTimeFraudJob = ccJobExecution.configureDistanceTimeFraud(new Configuration(),args[0], args[1]);
Job spendingFraudJob = ccJobExecution.configureSpendingFraud(new Configuration(),args[0], args[1]);
Job locationFraudJob = ccJobExecution.configureLocationFraud(new Configuration(),args[0], args[1]);
System.out.println("****************Started Executing distanceTimeFraudJob ================");
distanceTimeFraudJob.submit();
if(distanceTimeFraudJob.waitForCompletion(true))
{
System.out.println("=================Completed DistanceTimeFraudJob================= ");
System.out.println("=================Started Executing spendingFraudJob ================");
spendingFraudJob.submit();
if(spendingFraudJob.waitForCompletion(true))
{
System.out.println("=================Completed spendingFraudJob================= ");
System.out.println("=================Started locationFraudJob================= ");
locationFraudJob.submit();
if(locationFraudJob.waitForCompletion(true))
{
System.out.println("=================Completed locationFraudJob=================");
}
}
}
}
The new Class org.apache.hadoop.mapreduce.lib.chain.ChainMapper help this scenario
Although there are complex server based Hadoop workflow engines e.g., oozie, I have a simple java library that enables execution of multiple Hadoop jobs as a workflow. The job configuration and workflow defining inter job dependency is configured in a JSON file. Everything is externally configurable and does not require any change in existing map reduce implementation to be part of a workflow.
Details can be found here. Source code and jar is available in github.
http://pkghosh.wordpress.com/2011/05/22/hadoop-orchestration/
Pranab
I think oozie helps the consequent jobs to receive the inputs directly from the previous job. This avoids the I/o operation performed with jobcontrol.
If you want to programmatically chain your jobs, you will wnat to use JobControl. The usage is quite simple:
JobControl jobControl = new JobControl(name);
After that you add ControlledJob instances. ControlledJob defines a job with it's dependencies, thus automatically pluging inputs and outputs to fit a "chain" of jobs.
jobControl.add(new ControlledJob(job, Arrays.asList(controlledjob1, controlledjob2));
jobControl.run();
starts the chain. You will want to put that in a speerate thread. This allows to check the status of your chain whil it runs:
while (!jobControl.allFinished()) {
System.out.println("Jobs in waiting state: " + jobControl.getWaitingJobList().size());
System.out.println("Jobs in ready state: " + jobControl.getReadyJobsList().size());
System.out.println("Jobs in running state: " + jobControl.getRunningJobList().size());
List<ControlledJob> successfulJobList = jobControl.getSuccessfulJobList();
System.out.println("Jobs in success state: " + successfulJobList.size());
List<ControlledJob> failedJobList = jobControl.getFailedJobList();
System.out.println("Jobs in failed state: " + failedJobList.size());
}
As you have mentioned in your requirement that you want o/p of MRJob1 to be the i/p of MRJob2 and so on, you can consider using oozie workflow for this usecase. Also you might consider writing your intermediate data to HDFS since it will used by the next MRJob. And after the job completes you can clean-up your intermediate data.
<start to="mr-action1"/>
<action name="mr-action1">
<!-- action for MRJob1-->
<!-- set output path = /tmp/intermediate/mr1-->
<ok to="end"/>
<error to="end"/>
</action>
<action name="mr-action2">
<!-- action for MRJob2-->
<!-- set input path = /tmp/intermediate/mr1-->
<ok to="end"/>
<error to="end"/>
</action>
<action name="success">
<!-- action for success-->
<ok to="end"/>
<error to="end"/>
</action>
<action name="fail">
<!-- action for fail-->
<ok to="end"/>
<error to="end"/>
</action>
<end name="end"/>
New answer since the confirmed answer with the JobClient.run() is not working in the new API:
If you have two jobs like this:
Configuration conf1 = new Configuration();
Job job1 = Job.getInstance(conf1, "a");
Configuration conf2 = new Configuration();
Job job2 = Job.getInstance(conf2, "b");
Then the only thing you should do is adding the following line before creating 'job2':
job1.waitForCompletion(true);