How to chaining mapred and mapreduce job - hadoop

Now I have two hadoop jobs need to chain together. One is Mapred job(old api), the other is Mapreduce job(new API), this is because the external library we used for these two jobs.
I want to know whether there is a good way to chain these two jobs.
I have tried one way (first run the mapred job with JobClient.runjob(), after it finished run the second one.) But there is a problem for me submit this job to the hadoop clustor. If I close my local terminal, then only the first job will run, the second won't. It is because the Java code is running locally, so is there a good solution for this? Then I can just submit the whole job to cluster, the local program not need to keep running.

Related

hadoop jobs in deadlock with pyspark and oozie

I am trying to run pyspark on yarn with oozie, after submitting the workflow, there are 2 jobs in the hadoop job queue, one is the oozie job , which is with the application type "map reduce", and another job triggered by the previous one, with application type "Spark", while the first job is running, the second job remains in 'accepted" status. here comes the problem, while the first job is waiting for the second job to finish to proceed, the second is waiting for the first one to finish to run, I may be stuck in a dead lock, how could I get rid of this trouble, is there anyway the hadoop job with application type "mapreduce" run parallel with other jobs of different application type?
Any advice is appreciated, thanks!
Please check the value for property into Yarn scheduler configuration. I guess you need to increase it to something like .9 or so.
Property: yarn.scheduler.capacity.maximum-am-resource-percent
You would need to start Yarn, MapReduce and Oozie after updating the property.
More info: Setting Application Limits.

Submitting parallel jobs from the same client to Hadoop

I have a three node Hadoop 2.6 cluster on which I tried to run multiple instances of TestDFSIO in parallel by using "&" at the end of each command. But it turns out that only one of those jobs gets submitted and processed by the cluster and the rest are not even submitted (somehow thrown away). So, was wondering if this has anything with Hadoop's Yarn or MapReduce options or anything else.

Difference between job, application, task, task attempt logs in Hadoop, Oozie

I'm running an Oozie job with multiple actions and there's a part I could not make it work. In the process of troubleshooting I'm overwhelmed with lots of logs.
In YARN UI (yarn.resourceman­ager.webapp.address in yarn-site.xml, normally on port 8088), there's the application_<app_id> logs.
In Job History Server (yarn.log.server.url in yarn-site.xml, ours on port 19888), there's the job_<job_id> logs. (These job logs should also show up on Hue's Job Browser, right?)
In Hue's Oozie workflow editor, there's the task and task_attempt (not sure if they're the same, everything's a mixed-up soup to me already), which redirects to the Job Browser if you clicked here and there.
Can someone explain what's the difference between these things from Hadoop/Oozie architectural standpoint?
P.S.
I've seen in logs container_<container_id> as well. Might as well include this in your explanation in relation to the things above.
In terms of YARN, the programs that are being run on a cluster are called applications. In terms of MapReduce they are called jobs. So, if you are running MapReduce on YARN, job and application are the same thing (if you take a close look, job ids and application ids are the same).
MapReduce job consists of several tasks (they could be either map or reduce tasks). If a task fails, it is launched again on another node. Those are task attempts.
Container is a YARN term. This is a unit of resource allocation. For example, MapReduce task would be run in a single container.

How to schedule Hadoop jobs conditionally?

I am pretty new to Hadoop, and particularly to Hadoop Job Scheduling. Here is what I am trying to do.
I have 2 flows, each having a Hadoop job. I have freedom to put these flows either in the same project or in different ones. I don't want the Hadoop jobs to run simultaneously on the cluster, but I also want to make sure that they run alternatively.
E.g. flow_1 (with hadoop_job_1) runs and finishes -> flow_2 (with hadoop_job_2) runs and finishes -> flow_1 (with hadoop_job_1) runs and finishes and so on.
And of course, I would also like to handle special conditions gracefully.
E.g. flow_1 done, but flow_2 is not ready, then flow_1 gets chance to run again if it is ready, if flow_1 fails, flow_2 still gets its turn, etc.
I would like to know which schedulers I can explore which are capable of doing this.
We are using MapR.
Thanks
This looks to be a standard use case of oozie. Take a look at these tutorials
Executing an Oozie workflow with Pig, Hive & Sqoop actions and Oozie workflow scheduler for Hadoop

Does oozie provide any performance optimizations in terms of I/O?

Since oozie is a workflow engine for Hadoop platform, does it improve the performance of execution of a DAG dependencies of MapReduce jobs?
I mean, since the output of one MapReduce job is given as input to the next MapReduce job in the DAG, does oozie provides any mechanism for storing the intermediate results in memory and thus saving I/O.
Or is it just a workflow manager, that coordinates a series of dependent MapReduce?
Want to know how internally oozie works?
It is just a workflow manager. It doesn't change how, say, MapReduce works even though it runs M/R jobs.
What you are describing is much more like what Apache Spark does. I'm not aware that Oozie integrates directly with Spark yet, but, it can't possibly be difficult or far off.
It is "just a workflow manager, that coordinates a series of MapReduce" jobs. It uses the same mechanisms to execute jobs as using the command line.

Resources