How can I run calculate and look after calculating process to remote Hadoop cluster? - hadoop

I have a java program and I want to send task (jar) from it to remote
Hadoop. I need to pass special parameters to jar ofcourse.
If the calculating task just has ended java program must know this.
Can I do it through hadoop API?
Where can I get articles or someting also?

Hadoop has some API's for this. so if you write Java code for a Hadoop Job, you can define the job characteristics like:
job.SetMapperClass(),
job.setReducerClass(),
job.setPartitionerClass(),
job.setInputPath(),
etc..
then you run your job, and you can wait for the job to finish by using
job.waitForCompletion(true)

Related

Running a hadoop job using java command

I have a simple java program that sets up a MR job. I could successfully execute this in Hadoop infrastructure (hadoop 2x) using 'hadoop jar '. But I want to achieve the same thing using java command as below.
java className
How can I pass hadoop configuration to this className?
What extra arguments do I need to supply?
Any link/documentation would be highly appreciated.
As you run your 'hadoop jar' command with the other parameters, same way you can run using java.
check if, this commands evaluates to hadoop class path
$ hadoop classpath
then whatever your custom jar is should be added in class path
$ java -cp `hadoop classpath`:/my/tools/jar/tools.jar
I am able to get mine working with this, on my hadoop cluster
I don't think you can find a documentation on this. hadoop command is a script, a lot of classes are used there eg. Class for accessing filesystem FsShell, class used when we run a jar RunJar etc. Adding hadoop related libraries, configuration files to classpath are handled in the hadoop command itself.
You better take a look at the hadoop script.
How can you do that? Any jar file execution means, it has to execute in distributed environment where all daemons work together to complete the execution.
We are not running locally or on local file system. So, it needs be executed as per the norms of hdfs so i don't think we can execute like we do in local file system.
Hadoop is a framework which simplifies the distributed computing. Before hadoop also, programmers know about parallel processing and multi threading concepts. But when you deal with multiple machines you need to know how to
Communicate between machines
Network processing
What if one machine fails? fault tolerance
and many more! which is a huge, that's where hadoop simplifies your job. It takes care of all your operating level stuff and you can focus on just your business logic.
So in your case, based on what you are asking, there is no direct answer for that. Because by passing parameters the your program doesn't work. You will need to write lot of libraries to deal with distributed computing. If you want to explore them, then I would suggest go ahead and read hadoop source code.
http://hadoop.apache.org/version_control.html

how to get multipleOutput in hadoop

I'm new to Hadoop, and now have to process a input file. I want to process each line and the output should be one file for each line.
I surf the internet and found MultipleOutputFormat, and generateFileNameForKeyValue.
But most people write it with JobConf class. As I'm using Hadoop 0.20.1, I think Job class takes place. And I don't know how to use Job class to generate multiple output files by key.
Could anyone help me?
The Eclipse plugin is mainly used to submit and monitor jobs as well as interact with HDFS, against a real or 'psuedo' cluster.
If you're running in local mode, then i don't think the plugin gains you anything - seeing as your job will be run in a single JVM. With this in mind i would say include include the most recent 1.x hadoop-core in your Eclipse project's classpath.
Eitherway MultipleOutputFormat has not been ported to the new mapreduce package (neither in 1.1.2 or 2.0.4-alpha), so you'll either need to port it yourself or find another way (maybe MultipleOutputs - The Javadoc page has some usage on using MultipleOutputs)

How to schedule jobs in hadoop

I am new to hadoop, I have written few jobs and exported them as jar file . I am able to run them using hadoop jar command, I want to run these jobs every one hour . How do I do this? Thanks in advance.
Hadoop itself doesn't have ways to schedule jobs like you are suggesting. So you have two main choices, Java's Time and scheduling functions, or run the jobs from the operating system, I would suggest Cron. I would personally use cron to do this, it's simple and very flexible, and is installed by default on most servers. There are also lots of tutorials.
Cron example to run on the first minute of every hour.
0 * * * * /bin/hadoop jar myJar.jar
If you want to keep it inside of java itself, I would suggest checking out this question which has details and code, How to schedule task for start of every hour.
You could probably achieve that by writing a cron or some script. But the better way, in my view, would be to use some scheduler like Oozie.
In addition to an already mentioned Oozie, you might want to take a look at Falcon.
From own experience, however, a much easier approach is to try using your CI system to avoid adding new systems to your stack, for example Jenkins
Adding another option to CRON & Oozie, Quartz Scheduler

Hadoop Job Automation

I have hadoop 3 node cluster which is used to analyse the data every day at 9 PM. I want to automate the running job in hadoop command line. How can i do that.
Assuming you are using linux, I'll recommend you use the cron scheduler.
Look at a this tutorial for instructions.
Check out oozie, it's a workflow manager for hadoop and I believe it has the ability to schedule jobs

Hadoop Job Scheduling query

I am a beginner to Hadoop.
As per my understanding, Hadoop framework runs the Jobs in FIFO order (default scheduling).
Is there any way to tell the framework to run the job at a particular time?
i.e Is there any way to configure to run the job daily at 3PM like that?
Any inputs on this greatly appreciated.
Thanks, R
What about calling the job from external java schedule framework, like Quartz? Then you can run the job as you want.
you might consider using Oozie (http://yahoo.github.com/oozie/). It allows (beside other things):
Frequency execution: Oozie workflow specification supports both data
and time triggers. Users can specify execution frequency and can wait
for data arrival to trigger an action in the workflow.
It is independent of any other Hadoop schedulers and should work with any of them, so probably nothing in you Hadoop configuration will change.
How about having a script to execute your Hadoop job and then using at command to execute at some specified time.if you want the job to run regularly, you could setup a cron job to execute your script.
I'd use a commercial scheduling app if Cron does not cut it and/or a custom workflow solution. We use a solution called jams but keep in mind it's .net-oriented.

Resources