Workflow tool comaparison: Oozie Vs Cascading - hadoop

I am looking for a workflow tool to run complex map-reduce jobs. I have Oozie in mind but also want to explore Cascading. Is there any sample code or example that chains existing M/R jobs using cascading API? Also, can you provide the comparison Oozie Vs Cascading?

Cascading and Oozie are not in the same category.
Oozie is a workflow scheduler.
Cascading is an API for creating workflows. It is agnostic about schedulers, i.e., it should run with whatever scheduler system that you use.
There is perhaps some confusion because the Oozie docs mention a "DAG", and both run atop Hadoop.
Also, Cascading has a notion of "data availability" in the checkpoint support, which is supported in Oozie, albeit differently.

Personally i play around with both to some extend, what i found interesting with cascading is
1)concise and expressive in terms of simple keywords like flow,tap,pipe etc.,
2)amazing TDD based approach for local development and research
3)nice planner view(.dot file) and will be useful once the project is grown, so maintenance is ease.
4)DSL based approach using groovy,scala,cloujre. so no need to worry about learning any new language or rather hadoop.
5)simple cloud deployment(e.g. amazon support as raw jar deployment).
6)you can call anything like existing pig or hive or pure other MR jar as long as they expose java api.
7)amazing for ML and NLP related works.

Related

What is oozie equivalent for Spark?

We have very complex pipelines which we need to compose and schedule. I see that Hadoop ecosystem has Oozie for this. What are the choices for Spark based jobs when I am running Spark on Mesos or Standalone and doesn't have a Hadoop cluster?
Unlike with Hadoop, it is pretty easy to chains things with Spark. So writing a Spark Scala script might be enough. My first recommendation is tying that.
If you like to keep it SQL like, you can try SparkSQL.
If you have a really complex flow, it is worth looking at Google data flow https://github.com/GoogleCloudPlatform/DataflowJavaSDK.
Oozie can be used in case of Yarn,
for spark there is no built in scheduler available, So you are free to choose any scheduler which works in the cluster mode.
For Mesos I feel Chronos would be the right choice, more info on Chronos

How to start exploring BigData, Hadoop and its ecosystem components?

I have just started exploring BigData technology and the Hadoop framework.
But, getting confused with so many ecosystem components and framework. Could you please advise to get a structured start for learning ?
I mean which ecosystem component should one focus? Any in particular or all?
Help much appreciated!
Ranit
I wrote this answer on Quora few months back. Hope this will help:
1. Go through some introductory videos on Hadoop
Its very important to have some high level idea of hadoop before directly starting working on it. These introductory videos will help in understanding the scope of Hadoop and the use cases where it can be applied. There are a lot of resources available online for the same and going through any of the videos will be beneficial.
2. Understanding MapReduce
The second thing which helped me was to understand what Map Reduce is and how it works. It is explained very nicely in this paper: http://static.googleusercontent....
Another nice tutorial is available here : http://ksat.me/map-reduce-a-real...
For points 1 and 2, go through first four lectures for week one video lectures. The whole concept of distributed computing and map reduce is explained very nicely here. https://class.coursera.org/mmds-001/lecture
3. Getting started with Cloudera VM
Once you understand the basics of Hadoop, you can download the VM provided by cloudera and starting running some hadoop commands on it. You can download the VM from this link: http://www.cloudera.com/content/...
It would be nice to get familiar with basic Hadoop commands on the VM and understanding how it works.
4. Setting up the standalone/Pseudo distributed Hadoop
I would recommend setting up your own standalone Hadoop on your machine once you are familiar with Hadoop using the VM. The steps for installing are explained very nicely on this blog by Michael G. Noll : Running Hadoop On Ubuntu Linux (Single-Node Cluster) - Michael G. Noll
5. Understanding the Hadoop Ecosystem
It would be nice to get familiar with other components in the Hadoop ecosystem like Apache Pig, Hive, Hbase, Flume-NG, Hue etc. All these serve different purposes and having some information on all these will be really helpful in building any product around the hadoop ecosystem. You can install all these easily on your machine and get started with them. Cloudera VM by has most of these installed already.
6. Writing Map Reduce Jobs
Once you are done with steps 1-5, I don't think writing Map Reduce would be a challenge. It is explained thoroughly in The Definitive Guide. If MapReduce really interests you a lot, I would suggest reading this book Mining Massive Datasets by Anand Rajaraman, Jure Leskovec and Jeffrey D. Ullman : Page on Stanford
I would recommend going for Hadoop first, it's the basis for a lot of those other systems out there. Check out the main site: http://hadoop.apache.org/ and check out Cloudera, they provide a Virtual image (called CDH), that comes with everything pre-installed, so you can jump into action without having to deal with installation problems: http://www.cloudera.com/content/cloudera/en/downloads/cdh/cdh-5-2-0.html
After that, I would look into HDFS, just to understand a bit more how Hadoop stores that data, and then it would depend on what type of problems you're trying to solve, each particular system tackles a specific and (usually) different problem:
Hive / Cassandra: For database-like interaction
Pig: For data transformation.
Spark: For real time data analysis
Check out this link for more details: http://www.cloudera.com/content/cloudera/en/training/library/apache-hadoop-ecosystem.html
I hope you find that useful.
Big data is a broad term for data sets so large or complex that traditional data processing applications are inadequate. Challenges include analysis, capture, data curation, search, sharing, storage, transfer, visualization, and information privacy - From wikipedia
Hadoop is a a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models.
There are four main modules in Hadoop.
1.Hadoop Common: The common utilities that support the other Hadoop modules.
2.Hadoop Distributed File System (HDFS™): A distributed file system that provides high-throughput access to application data.
3.Hadoop YARN: A framework for job scheduling and cluster resource management.
4.Hadoop MapReduce: A YARN-based system for parallel processing of large data sets.
Before going further, Let's note that we have three different types of data.
Structured : Structured data has strong schema and schema will be checked during write & read operation. e.g. Data in RDBMS systems like Oracle, MySQL Server etc.
Unstructured: Data does not have any structure and it can be any form - Web server logs, E-Mail, Images etc.
Semi-structured: Data is not strictly structured but have some structure. e.g. XML files
Depending on type of data to be processed, we have to choose right technology.
Some more projects, which are part of Hadoop
HBase™: A scalable, distributed database that supports structured data storage for large tables.
Hive™: A data warehouse infrastructure that provides data summarization and ad hoc querying.
Pig™: A high-level data-flow language and execution framework for parallel computation
Hive Vs PIG comparison can be found at my other post in this question
HBASE won't replace Map Reduce. HBase is scalable distributed database & Map Reduce is programming model for distributed processing of data. Map Reduce may act on data in HBASE in processing.
You can use HIVE/HBASE for structured/semi-structured data and process it with Hadoop Map Reduce
You can use SQOOP to import structured data from traditional RDBMS database Oracle, SQL Server etc and process it with Hadoop Map Reduce
You can use FLUME for processing Un-structured data and process with Hadoop Map Reduce
Have a look at: Hadoop Use Cases
Hive should be used for analytical querying of data collected over a period of time. e.g Calculate trends , summarize website logs but it can't be used for real time queries.
HBase fits for real-time querying of Big Data. Facebook use it for messaging and real-time analytics.
PIG can be used to construct dataflows,run a scheduled jobs, crunch big volumes of data,aggregate/summarize it and store into relation database systems. Good for ad-hoc analysis.
Hive can be used for ad-hoc data analysis but it can't support all un-structured data formats unlike PIG
ZooKeeper is a centralized service for maintaining configuration information, naming, providing distributed synchronization and providing group services which are very useful for a variety of distributed systems. HBase is not operational without ZooKeeper.
Apache Spark is a general compute engine that offers fast data analysis on a large scale. Spark is built on HDFS but bypasses MapReduce and instead uses its own data processing framework. Common uses cases for Apache Spark include real-time queries, event stream processing, iterative algorithms, complex operations and machine learning.
Mahout™: A Scalable machine learning and data mining library.
Tez™: A generalized data-flow programming framework, built on Hadoop YARN, which provides a powerful and flexible engine to execute an arbitrary DAG of tasks to process data for both batch and interactive use-cases. Tez is being adopted by Hive™, Pig™ and other frameworks in the Hadoop ecosystem, and also by other commercial software (e.g. ETL tools), to replace Hadoop™ MapReduce as the underlying execution engine
I have covered only some of key components of Hadoop ecosystem. If you like to have a look at all component of ecosystem, have a look at this ecosystem table
If above table is very difficult to digest, have a look at minified version of ecosystem at this article
But to understand all of these system, I would like you to start with Apache website first and explore other articles later.
Big data is not a technology in itself, instead it is a concept.
You can think of database, database is not a technology in itself, it is a concept. Oracle, DB2 etc are database technologies.
So coming back to big data, this concept is used to deal with huge data which is difficult to be analyzed using traditional databases or technologies. People think hadoop as synonym of bigdata but again let me tell you that Hadoop is nothing but a technology developed by Apache to implement bigdata concept.
Hadoop has its own file system called hdfs and it uses mapreduce to solve bigdata problems. Apart from Hadoop there is hive which is similar to sql but internally it uses map reduce. Hbase is similar to nosql database. Pig is scripting language which uses mapreduce internally.
There are many licensed version for big data like MapR, Hortonworks, Cloudera etc.
So start learning with Hadoop - HDFS, Mapreduce, Yarn, Hive.
Things I did to learn Hadoop.
a) Install Hadoop from scratch. I mean download CentOs, Hadoop , JAVA etc., and install them manually.
b) Understand how HDFS works.
c) Understand how MapReduce works.
d) Write word count in JAVA.
This will help you get started.

Does oozie provide any performance optimizations in terms of I/O?

Since oozie is a workflow engine for Hadoop platform, does it improve the performance of execution of a DAG dependencies of MapReduce jobs?
I mean, since the output of one MapReduce job is given as input to the next MapReduce job in the DAG, does oozie provides any mechanism for storing the intermediate results in memory and thus saving I/O.
Or is it just a workflow manager, that coordinates a series of dependent MapReduce?
Want to know how internally oozie works?
It is just a workflow manager. It doesn't change how, say, MapReduce works even though it runs M/R jobs.
What you are describing is much more like what Apache Spark does. I'm not aware that Oozie integrates directly with Spark yet, but, it can't possibly be difficult or far off.
It is "just a workflow manager, that coordinates a series of MapReduce" jobs. It uses the same mechanisms to execute jobs as using the command line.

How to schedule hadoop jobs using BMC Control-M?

Anybody knows how to control/schedule Hadoop jobs using BMC Control-M software? Is it even possible?
I have tried Ooozie and want to explore more options for scheduling hadoop jobs.
Please enlighten!
The answer is YES.
And this answer is going to get even better.
Today, you can use the abundant command line interfaces available with various Hadoop components. You can then run these CLIs as commands individually or combine them into scripts embedded directly in Control-M jobs or wrapped in shell scripts (Bash is a popular one) and scheduled with Control-M. I've provided a sample script that performs some HDFS manipulaiton and then runs a MapReduce job.
The better part is coming in a few months when we will be releasing our integrated support for Hadoop. At that point (I am assuming you are familiar with BMC Control-M) we will be providing graphical forms similar to our other CMs, for defining various job types (Pig, Hive, MapReduce are all being considered but I'm not sure what will actually get implemented), integrated support for status monitoring, retrieval of job output, etc.
We have already heard from a number of customers who are using Control-M to manage their Hadoop environments.
In addition to the "mechanics" of running Hadoop jobs, you also get Control-M's capabilities for managing graphical flows, integraiton with a broad range of platfroms and applications, ability to manage Service Levels, forecasting, auditing, reporting, and much more.
I would be happy to discuss this further with you and especially since we are still in the early stages of this work, we would love to learn what your requirements are in this area. Please send me a note at joe_goldberg#bmc.com and I would be happy to set up a conference call or demo.
#!/bin/csh
#
cd /h/gron/java/hadoop/hadoop-1.0.3
bin/hadoop dfs -rmr output_$UUID 'dfs[a-z.]+'
bin/hadoop jar hadoop-examples-1.0.3.jar grep input output_$UUID 'dfs[a-z.]+'

Best practices for using Oozie for Hadoop

I have been using Hadoop quite a while now. After some time I realized I need to chain Hadoop jobs, and have some type of workflow. I decided to use Oozie , but couldn't find much of information about best practices. I would like to hear it from more experienced folks.
Best Regards
The best way to learn oozie is to download the examples tar file that comes with the distribution and run each of them. It has an example for mapreduce, pig , streaming workflow as well as sample coordinator xmls.
First run the normal workflows and once you debug that , move to running the workflows with coordinator so that you can take it step by step. Lastly one best practice would be to make most of your variables in workflow and coordinator be to configurable and supplied through a component.properties file so that you don't have touch the xml often.
http://yahoo.github.com/oozie/releases/3.1.0/DG_Examples.html
There are documents about Oozie on github and apache.
https://github.com/yahoo/oozie/wiki
http://yahoo.github.com/oozie/releases/3.1.0/DG_Examples.html
http://incubator.apache.org/oozie/index.html
Apache document is being updated and should be live soon.

Resources