Difference between job, application, task, task attempt logs in Hadoop, Oozie - hadoop

I'm running an Oozie job with multiple actions and there's a part I could not make it work. In the process of troubleshooting I'm overwhelmed with lots of logs.
In YARN UI (yarn.resourceman­ager.webapp.address in yarn-site.xml, normally on port 8088), there's the application_<app_id> logs.
In Job History Server (yarn.log.server.url in yarn-site.xml, ours on port 19888), there's the job_<job_id> logs. (These job logs should also show up on Hue's Job Browser, right?)
In Hue's Oozie workflow editor, there's the task and task_attempt (not sure if they're the same, everything's a mixed-up soup to me already), which redirects to the Job Browser if you clicked here and there.
Can someone explain what's the difference between these things from Hadoop/Oozie architectural standpoint?
P.S.
I've seen in logs container_<container_id> as well. Might as well include this in your explanation in relation to the things above.

In terms of YARN, the programs that are being run on a cluster are called applications. In terms of MapReduce they are called jobs. So, if you are running MapReduce on YARN, job and application are the same thing (if you take a close look, job ids and application ids are the same).
MapReduce job consists of several tasks (they could be either map or reduce tasks). If a task fails, it is launched again on another node. Those are task attempts.
Container is a YARN term. This is a unit of resource allocation. For example, MapReduce task would be run in a single container.

Related

Does oozie use Yarn containers

We are currently running a large amount of Oozie jobs in our cluster.
Many of those jobs use templates and have sub-workflows.
These jobs don't always contain large and heavy jobs, they mostly contain a small shell script.
The Hue job browser show lots and lots of Oozie steps.
We now sometimes feel that our cluster is getting overloaded by these jobs. This made me wonder, does every one of those Oozie jobs get a yarn container appointed to it?
If so this would mean that for a 2 min job we are effectively useing 2-10 times more resources than required.
Just see by yourself...
in the Hue Dashboard, click on any Workflow that has been executed, select the "Actions" tab, look at the "External ID" column => every job_000000_0000 refers to a YARN job
...and when "External ID" points to a Sub-Workflow, then if you click, you will get its own YARN jobs
alternately you can use the command line with oozie job -info <wkf/sub-wkf exec id>
You can get more details in that post for instance.
A frequent issue with Shell or Java actions is that the "launcher" YARN job uses the default job settings defined by your Hadoop admin -- e.g. 1 GB of RAM for the AppMaster and 1.5 GB for the "launcher".
But typically your shell just requires a few MB of RAM (on top of what is used by Oozie to bootstrap the Action in a raw YARN container), and its AppMaster just requires the bare minimum to control the execution-- say, 512 MB each.
So you can reduce the footprint of your Oozie actions by setting some undocumented properties -- in practice, standard Hadoop props prefixed by oozie.launcher.
See for instance this post then that post.
PS: oozie.launcher.mapreduce.map.java.opts is relevant for a Java action (or a Pig action, a Sqoop action, etc.) and should stay consistent with the global RAM quota; but it's not relevant for a Shell action [unless you set a really goofy value, in which case it might affect the Oozie bootstrap process]
In your case Yes, all jobs will get container still if you are invoking MR through shell. Its not true that for each container YARN will provide unnecessary memory or resources.
Yarn provides exact or little more resources but it increases if Job requires more.

Get status when running job without hadoop

When I run a hadoop job with the hadoop application it prints a lot of stuff. Among them, It show the relative progress of the job ("map: 30%, reduce: 0%" and stuff like that). But, when running a job without the application it does not print anything, not even errors. Is there a way to get that level of logging without the application? That is, without running [hadoop_folder]/bin/hadoop jar <my_jar> <indexer> <args>....
You can get this information from Application Master (assuming you use YARN and not MR1 where you would get it from Job Tracker). There is usually web UI where you can find this information. Details will depend on your Hadoop installation / distribution.
In case of Hadoop v1 check Job tracker web URL and in case of Hadoop v2 check Application Master web UI

How to schedule Hadoop jobs conditionally?

I am pretty new to Hadoop, and particularly to Hadoop Job Scheduling. Here is what I am trying to do.
I have 2 flows, each having a Hadoop job. I have freedom to put these flows either in the same project or in different ones. I don't want the Hadoop jobs to run simultaneously on the cluster, but I also want to make sure that they run alternatively.
E.g. flow_1 (with hadoop_job_1) runs and finishes -> flow_2 (with hadoop_job_2) runs and finishes -> flow_1 (with hadoop_job_1) runs and finishes and so on.
And of course, I would also like to handle special conditions gracefully.
E.g. flow_1 done, but flow_2 is not ready, then flow_1 gets chance to run again if it is ready, if flow_1 fails, flow_2 still gets its turn, etc.
I would like to know which schedulers I can explore which are capable of doing this.
We are using MapR.
Thanks
This looks to be a standard use case of oozie. Take a look at these tutorials
Executing an Oozie workflow with Pig, Hive & Sqoop actions and Oozie workflow scheduler for Hadoop

How to chaining mapred and mapreduce job

Now I have two hadoop jobs need to chain together. One is Mapred job(old api), the other is Mapreduce job(new API), this is because the external library we used for these two jobs.
I want to know whether there is a good way to chain these two jobs.
I have tried one way (first run the mapred job with JobClient.runjob(), after it finished run the second one.) But there is a problem for me submit this job to the hadoop clustor. If I close my local terminal, then only the first job will run, the second won't. It is because the Java code is running locally, so is there a good solution for this? Then I can just submit the whole job to cluster, the local program not need to keep running.

How I can find out IPs of slave nodes where currently map reduce task is running or about to run for a given Job?

I want to find out IPs of slave nodes where currently map reduce job is running or about to run for a given Job.
Is there any method to do this ?
Thanks in Advance.
For any job, you can view the list of running tasks through the Job Scheduler Web UI - this will detail the nodes on which the task is running.
As for where tasks are about to run - this is not neccessarily decided in advance. As slots become available on a node, the Job Scheduler (there are a number which behave differently depending on your needs) identifies a job task which will run on that node (based upon a number of criteria, hopefully honoring data locality where it can) and instructs the task tracker on that node to run the specific task.
Programatically, look at the javadocs for the JobClient class, it should be able to acquire information about the running tasks, and their node names (you'll probably need to do a DNS lookup to get the actual IPs i imagine)
Hadoop comes with several web interfaces which are by default (see conf/hadoop-default.xml) available at these locations:
http://localhost:50030/ – web UI for MapReduce job tracker(s)
http://localhost:50060/ – web UI for task tracker(s)
http://localhost:50070/ – web UI for HDFS name node(s)
Thanks to #Chris..
Programatically, look at the javadocs for the JobClient class, it should be able to acquire information about the running tasks, and their node names.

Resources