How to schedule hadoop jobs using BMC Control-M? - hadoop

Anybody knows how to control/schedule Hadoop jobs using BMC Control-M software? Is it even possible?
I have tried Ooozie and want to explore more options for scheduling hadoop jobs.
Please enlighten!

The answer is YES.
And this answer is going to get even better.
Today, you can use the abundant command line interfaces available with various Hadoop components. You can then run these CLIs as commands individually or combine them into scripts embedded directly in Control-M jobs or wrapped in shell scripts (Bash is a popular one) and scheduled with Control-M. I've provided a sample script that performs some HDFS manipulaiton and then runs a MapReduce job.
The better part is coming in a few months when we will be releasing our integrated support for Hadoop. At that point (I am assuming you are familiar with BMC Control-M) we will be providing graphical forms similar to our other CMs, for defining various job types (Pig, Hive, MapReduce are all being considered but I'm not sure what will actually get implemented), integrated support for status monitoring, retrieval of job output, etc.
We have already heard from a number of customers who are using Control-M to manage their Hadoop environments.
In addition to the "mechanics" of running Hadoop jobs, you also get Control-M's capabilities for managing graphical flows, integraiton with a broad range of platfroms and applications, ability to manage Service Levels, forecasting, auditing, reporting, and much more.
I would be happy to discuss this further with you and especially since we are still in the early stages of this work, we would love to learn what your requirements are in this area. Please send me a note at joe_goldberg#bmc.com and I would be happy to set up a conference call or demo.
#!/bin/csh
#
cd /h/gron/java/hadoop/hadoop-1.0.3
bin/hadoop dfs -rmr output_$UUID 'dfs[a-z.]+'
bin/hadoop jar hadoop-examples-1.0.3.jar grep input output_$UUID 'dfs[a-z.]+'

Related

How hadoop mapreduce internally works in cloud?

I started working on hadoop mapreduce.
I am beginner to Java & hadoop and know the coding for hadoop mapreduce, but interested to learn how it internally works in cloud.
Can you please share some good link which explain how hadoop works internally?
How Hadoop works in not related to cloud. It works in the same way in 3 laptop ;-) Hadoop is often "link" to cloud computing because it is designed to be used with a lot of cheap machines, so it makes sense to run Hadoop in cloud.
By the way, Hadoop is NOT only map/reduce. It's a distributed file system first, and we are able to execute distributed tasks on the distributed file. And NOT ONLY map/reduce task (since version 2 I think).
It's a very large subject. So if you start, you will have to read many articles before to be a master ;-)
My advice. First look for articles about MapReduce:
http://www-01.ibm.com/software/data/infosphere/hadoop/mapreduce/ (short)
https://developer.yahoo.com/hadoop/tutorial/module4.html (long)
Then look for articles about Hadoop architecture (file system then YARN)
http://hadoop.apache.org/docs/r2.7.0/hadoop-project-dist/hadoop-hdfs/HdfsDesign.html
http://hadoop.apache.org/docs/r2.7.0/hadoop-yarn/hadoop-yarn-site/YARN.html
You should have a look on slideshare too.

Running a hadoop job using java command

I have a simple java program that sets up a MR job. I could successfully execute this in Hadoop infrastructure (hadoop 2x) using 'hadoop jar '. But I want to achieve the same thing using java command as below.
java className
How can I pass hadoop configuration to this className?
What extra arguments do I need to supply?
Any link/documentation would be highly appreciated.
As you run your 'hadoop jar' command with the other parameters, same way you can run using java.
check if, this commands evaluates to hadoop class path
$ hadoop classpath
then whatever your custom jar is should be added in class path
$ java -cp `hadoop classpath`:/my/tools/jar/tools.jar
I am able to get mine working with this, on my hadoop cluster
I don't think you can find a documentation on this. hadoop command is a script, a lot of classes are used there eg. Class for accessing filesystem FsShell, class used when we run a jar RunJar etc. Adding hadoop related libraries, configuration files to classpath are handled in the hadoop command itself.
You better take a look at the hadoop script.
How can you do that? Any jar file execution means, it has to execute in distributed environment where all daemons work together to complete the execution.
We are not running locally or on local file system. So, it needs be executed as per the norms of hdfs so i don't think we can execute like we do in local file system.
Hadoop is a framework which simplifies the distributed computing. Before hadoop also, programmers know about parallel processing and multi threading concepts. But when you deal with multiple machines you need to know how to
Communicate between machines
Network processing
What if one machine fails? fault tolerance
and many more! which is a huge, that's where hadoop simplifies your job. It takes care of all your operating level stuff and you can focus on just your business logic.
So in your case, based on what you are asking, there is no direct answer for that. Because by passing parameters the your program doesn't work. You will need to write lot of libraries to deal with distributed computing. If you want to explore them, then I would suggest go ahead and read hadoop source code.
http://hadoop.apache.org/version_control.html

Does oozie provide any performance optimizations in terms of I/O?

Since oozie is a workflow engine for Hadoop platform, does it improve the performance of execution of a DAG dependencies of MapReduce jobs?
I mean, since the output of one MapReduce job is given as input to the next MapReduce job in the DAG, does oozie provides any mechanism for storing the intermediate results in memory and thus saving I/O.
Or is it just a workflow manager, that coordinates a series of dependent MapReduce?
Want to know how internally oozie works?
It is just a workflow manager. It doesn't change how, say, MapReduce works even though it runs M/R jobs.
What you are describing is much more like what Apache Spark does. I'm not aware that Oozie integrates directly with Spark yet, but, it can't possibly be difficult or far off.
It is "just a workflow manager, that coordinates a series of MapReduce" jobs. It uses the same mechanisms to execute jobs as using the command line.

running a non mapreduce program in hadoop

I have a question.. I have a program write in Netbeans. the program read data from cassandra and write the result into it. My program is not MapReduce at all.I execute the program and make a .jar file from it. now, I want to know if I can execute it in Hadoop?
actually, I want to know can I run a non-MapReduce Program in Hadoop?
You could architect this program to run on Hadoop v2 as a Yarn application. This would require re-architecting your application to fit the Yarn paradigm. An example of how to do this is given here: Writing App Framework on Yarn
This is not a simple exercise. Also, if you are interested in using Hadoop, I would consider simply re-writing your application to use HBase (another No-SQL Columnar database competitor to Cassandra) which is written specifically for Hadoop. It translates your query requests to MapReduce calls automatically.
This question is ages long but has never been answered. Anyhow, two projects are looking into this issue:
Apache Slider (incubating): http://slider.incubator.apache.org/
and
Apache Myriad (incubating): http://myriad.incubator.apache.org/
Slider is mainly sponsored by Hortonworks while Myriad is a MapR / Mesosphere project with large assistance from PayPal.

Workflow tool comaparison: Oozie Vs Cascading

I am looking for a workflow tool to run complex map-reduce jobs. I have Oozie in mind but also want to explore Cascading. Is there any sample code or example that chains existing M/R jobs using cascading API? Also, can you provide the comparison Oozie Vs Cascading?
Cascading and Oozie are not in the same category.
Oozie is a workflow scheduler.
Cascading is an API for creating workflows. It is agnostic about schedulers, i.e., it should run with whatever scheduler system that you use.
There is perhaps some confusion because the Oozie docs mention a "DAG", and both run atop Hadoop.
Also, Cascading has a notion of "data availability" in the checkpoint support, which is supported in Oozie, albeit differently.
Personally i play around with both to some extend, what i found interesting with cascading is
1)concise and expressive in terms of simple keywords like flow,tap,pipe etc.,
2)amazing TDD based approach for local development and research
3)nice planner view(.dot file) and will be useful once the project is grown, so maintenance is ease.
4)DSL based approach using groovy,scala,cloujre. so no need to worry about learning any new language or rather hadoop.
5)simple cloud deployment(e.g. amazon support as raw jar deployment).
6)you can call anything like existing pig or hive or pure other MR jar as long as they expose java api.
7)amazing for ML and NLP related works.

Resources