Does oozie provide any performance optimizations in terms of I/O? - hadoop

Since oozie is a workflow engine for Hadoop platform, does it improve the performance of execution of a DAG dependencies of MapReduce jobs?
I mean, since the output of one MapReduce job is given as input to the next MapReduce job in the DAG, does oozie provides any mechanism for storing the intermediate results in memory and thus saving I/O.
Or is it just a workflow manager, that coordinates a series of dependent MapReduce?
Want to know how internally oozie works?

It is just a workflow manager. It doesn't change how, say, MapReduce works even though it runs M/R jobs.
What you are describing is much more like what Apache Spark does. I'm not aware that Oozie integrates directly with Spark yet, but, it can't possibly be difficult or far off.

It is "just a workflow manager, that coordinates a series of MapReduce" jobs. It uses the same mechanisms to execute jobs as using the command line.

Related

Hadoop Ecosystem: Map Reduce needed for Pig/Hive

There is a whole lot of hadoop ecosystem pictures on the internet, so i struggle to get an understanding how the tools work together.
E.g. in the picture attached, why are pig and hive based on map reduce whereas the other tools like spark or storm on YARN?
Would you be so kind and explain this?
Thanks!
BR
haddop ecosystem
The picture shows Pig and Hive on top of MapReduce. This is because MapReduce is a distributed computing engine that is used by Pig and Hive. Pig and Hive queries get executed as MapReduce jobs. It is easier to work with Pig and Hive, since they give a higher-level abstraction to work with MapReduce.
Now let's take a look at Spark/Storm/Flink on YARN in the picture. YARN is a cluster manager that allows various applications to run on top of it. Storm, Spark and Flink are all examples of applications that can run on top of YARN. MapReduce is also considered as an application that can run on YARN, as shown in the diagram. YARN handles the resource management piece so that multiple applications can share the same cluster. (If you are interested in another example of a similar technology, check out Mesos).
Finally, at the bottom of the picture is HDFS. This is the distributed storage layer that allows applications to store and access data. It provides features such as distributed storage, replication and fault tolerance.
If you are interested in deeper-dives, check out the Apache Projects page.

How does MapReduce processing work with Local File System?

How does MapReduce processing work if the inputs/output are from local file system?
Does MapReduce job execution happen asynchronously across the Hadoop cluster?
If yes, how does that happen?
In which usecase, do we actually need to use this approach?
MapReduce Works even same in local system (mapper->reducer)
(only its matter of efficiency as it will be less efficient in local system rather than cluster).
Yes,MapReduce job execution happen asynchronously across the Hadoop cluster(it depends on what kind of scheduler you are using in your mapreduce program)
click for more about scheduler
In most of the case this used for testing purpose (running mapreduce program in local system).

What is oozie equivalent for Spark?

We have very complex pipelines which we need to compose and schedule. I see that Hadoop ecosystem has Oozie for this. What are the choices for Spark based jobs when I am running Spark on Mesos or Standalone and doesn't have a Hadoop cluster?
Unlike with Hadoop, it is pretty easy to chains things with Spark. So writing a Spark Scala script might be enough. My first recommendation is tying that.
If you like to keep it SQL like, you can try SparkSQL.
If you have a really complex flow, it is worth looking at Google data flow https://github.com/GoogleCloudPlatform/DataflowJavaSDK.
Oozie can be used in case of Yarn,
for spark there is no built in scheduler available, So you are free to choose any scheduler which works in the cluster mode.
For Mesos I feel Chronos would be the right choice, more info on Chronos

How to schedule Hadoop jobs conditionally?

I am pretty new to Hadoop, and particularly to Hadoop Job Scheduling. Here is what I am trying to do.
I have 2 flows, each having a Hadoop job. I have freedom to put these flows either in the same project or in different ones. I don't want the Hadoop jobs to run simultaneously on the cluster, but I also want to make sure that they run alternatively.
E.g. flow_1 (with hadoop_job_1) runs and finishes -> flow_2 (with hadoop_job_2) runs and finishes -> flow_1 (with hadoop_job_1) runs and finishes and so on.
And of course, I would also like to handle special conditions gracefully.
E.g. flow_1 done, but flow_2 is not ready, then flow_1 gets chance to run again if it is ready, if flow_1 fails, flow_2 still gets its turn, etc.
I would like to know which schedulers I can explore which are capable of doing this.
We are using MapR.
Thanks
This looks to be a standard use case of oozie. Take a look at these tutorials
Executing an Oozie workflow with Pig, Hive & Sqoop actions and Oozie workflow scheduler for Hadoop

How to schedule hadoop jobs using BMC Control-M?

Anybody knows how to control/schedule Hadoop jobs using BMC Control-M software? Is it even possible?
I have tried Ooozie and want to explore more options for scheduling hadoop jobs.
Please enlighten!
The answer is YES.
And this answer is going to get even better.
Today, you can use the abundant command line interfaces available with various Hadoop components. You can then run these CLIs as commands individually or combine them into scripts embedded directly in Control-M jobs or wrapped in shell scripts (Bash is a popular one) and scheduled with Control-M. I've provided a sample script that performs some HDFS manipulaiton and then runs a MapReduce job.
The better part is coming in a few months when we will be releasing our integrated support for Hadoop. At that point (I am assuming you are familiar with BMC Control-M) we will be providing graphical forms similar to our other CMs, for defining various job types (Pig, Hive, MapReduce are all being considered but I'm not sure what will actually get implemented), integrated support for status monitoring, retrieval of job output, etc.
We have already heard from a number of customers who are using Control-M to manage their Hadoop environments.
In addition to the "mechanics" of running Hadoop jobs, you also get Control-M's capabilities for managing graphical flows, integraiton with a broad range of platfroms and applications, ability to manage Service Levels, forecasting, auditing, reporting, and much more.
I would be happy to discuss this further with you and especially since we are still in the early stages of this work, we would love to learn what your requirements are in this area. Please send me a note at joe_goldberg#bmc.com and I would be happy to set up a conference call or demo.
#!/bin/csh
#
cd /h/gron/java/hadoop/hadoop-1.0.3
bin/hadoop dfs -rmr output_$UUID 'dfs[a-z.]+'
bin/hadoop jar hadoop-examples-1.0.3.jar grep input output_$UUID 'dfs[a-z.]+'

Resources