We are currently running a large amount of Oozie jobs in our cluster.
Many of those jobs use templates and have sub-workflows.
These jobs don't always contain large and heavy jobs, they mostly contain a small shell script.
The Hue job browser show lots and lots of Oozie steps.
We now sometimes feel that our cluster is getting overloaded by these jobs. This made me wonder, does every one of those Oozie jobs get a yarn container appointed to it?
If so this would mean that for a 2 min job we are effectively useing 2-10 times more resources than required.
Just see by yourself...
in the Hue Dashboard, click on any Workflow that has been executed, select the "Actions" tab, look at the "External ID" column => every job_000000_0000 refers to a YARN job
...and when "External ID" points to a Sub-Workflow, then if you click, you will get its own YARN jobs
alternately you can use the command line with oozie job -info <wkf/sub-wkf exec id>
You can get more details in that post for instance.
A frequent issue with Shell or Java actions is that the "launcher" YARN job uses the default job settings defined by your Hadoop admin -- e.g. 1 GB of RAM for the AppMaster and 1.5 GB for the "launcher".
But typically your shell just requires a few MB of RAM (on top of what is used by Oozie to bootstrap the Action in a raw YARN container), and its AppMaster just requires the bare minimum to control the execution-- say, 512 MB each.
So you can reduce the footprint of your Oozie actions by setting some undocumented properties -- in practice, standard Hadoop props prefixed by oozie.launcher.
See for instance this post then that post.
PS: oozie.launcher.mapreduce.map.java.opts is relevant for a Java action (or a Pig action, a Sqoop action, etc.) and should stay consistent with the global RAM quota; but it's not relevant for a Shell action [unless you set a really goofy value, in which case it might affect the Oozie bootstrap process]
In your case Yes, all jobs will get container still if you are invoking MR through shell. Its not true that for each container YARN will provide unnecessary memory or resources.
Yarn provides exact or little more resources but it increases if Job requires more.
Related
Does anyone have any idea on what's the maximum limit of oozie workflows that can execute in parallel?
I'm running 35 workflows in parallel (or that's what oozie UI mentions that they all got started in parallel). All the subworkflows perform ingestion of files from local to HDFS & do some validation checks henceforth on the metadata of file. Simple as that.
However, I see some subworkflows get failed during execution; the step in which they fail tries to put the files into HDFS location, i.e., the process wasn't able to execute hdfs dfs -put command. However, when I rerun these subworkflows they run successfully.
Not sure what caused them to execute and fail on hdfs dfs -put.
Any clues/suggestions on what could be happening?
First limitation does not depends on Oozie, but on resources available in YARN to execute Oozie actions as each action is executed in one map. But this limit will not fail your workflow: they will just wait for resources.
The major limit we've faced, leading to troubles, was on the callable queue of oozie services. Sometime, on heavy loads created by plenty of coordinator submitting plenty of worklow, Oozie was loosing more time in processing its internal callable queue than running workflows :/
Check oozie.service.CallableQueueService settings for informations about this.
When I look at my logs, I see that my oozie java actions are actually running on multiple machines.
I assume that is because they're wrapped inside m/r job? (is this correct)
Is there a way to have only a single instance of the java action executing on the entire cluster?
The Java action runs inside an Oozie "launcher" job, with just one YARN "map" container.
The trick is that every YARN job requires an application master (AM) container for coordination.
So you end up with 2 containers, _0001 for the AM and _0002 for the Oozie action, probably on different machines.
To control the resource allocation for each one, you can set the following Action properties to override your /etc/hadoop/conf/*-site.xml config and/or hard-coded defaults (which are specific to each version and each distro, by the way):
oozie.launcher.yarn.app.mapreduce.am.resource.mb
oozie.launcher.yarn.app.mapreduce.am.command-opts (to align the max heap size with the global memory max)
oozie.launcher.mapreduce.map.memory.mb
oozie.launcher.mapreduce.map.java.opts (...)
oozie.launcher.mapreduce.job.queuename (in case you've got multiples queues with different priorities)
Well, actually, the explanation above is not entirely true... On a HortonWorks distro you end up with 2 containers, as expected.
But with a Cloudera distro, you typically end up with just one container, running both the AM and the action in the same Linux process.
And I have no idea how they do that. Maybe there's a generic YARN config somewhere, maybe it's a Cloudera-specific feature.
When I run a hadoop job with the hadoop application it prints a lot of stuff. Among them, It show the relative progress of the job ("map: 30%, reduce: 0%" and stuff like that). But, when running a job without the application it does not print anything, not even errors. Is there a way to get that level of logging without the application? That is, without running [hadoop_folder]/bin/hadoop jar <my_jar> <indexer> <args>....
You can get this information from Application Master (assuming you use YARN and not MR1 where you would get it from Job Tracker). There is usually web UI where you can find this information. Details will depend on your Hadoop installation / distribution.
In case of Hadoop v1 check Job tracker web URL and in case of Hadoop v2 check Application Master web UI
I'm running an Oozie job with multiple actions and there's a part I could not make it work. In the process of troubleshooting I'm overwhelmed with lots of logs.
In YARN UI (yarn.resourcemanager.webapp.address in yarn-site.xml, normally on port 8088), there's the application_<app_id> logs.
In Job History Server (yarn.log.server.url in yarn-site.xml, ours on port 19888), there's the job_<job_id> logs. (These job logs should also show up on Hue's Job Browser, right?)
In Hue's Oozie workflow editor, there's the task and task_attempt (not sure if they're the same, everything's a mixed-up soup to me already), which redirects to the Job Browser if you clicked here and there.
Can someone explain what's the difference between these things from Hadoop/Oozie architectural standpoint?
P.S.
I've seen in logs container_<container_id> as well. Might as well include this in your explanation in relation to the things above.
In terms of YARN, the programs that are being run on a cluster are called applications. In terms of MapReduce they are called jobs. So, if you are running MapReduce on YARN, job and application are the same thing (if you take a close look, job ids and application ids are the same).
MapReduce job consists of several tasks (they could be either map or reduce tasks). If a task fails, it is launched again on another node. Those are task attempts.
Container is a YARN term. This is a unit of resource allocation. For example, MapReduce task would be run in a single container.
I am carrying out several Hadoop tests using TestDFSIO and TeraSort benchmark tools. I am basically testing with different amount of datanodes in order to assess the linearity of the processing capacity and datanode scalability.
During the above mentioned process, I have obviously had to restart several times all Hadoop environment. Every time I restarted Hadoop, all MapReduce jobs are removed and the job counter starts again from "job_2013*_0001". For comparison reasons, it is very important for me to keep all the MapReduce jobs up that I have previously launched. So, my question is:
¿How can I avoid Hadoop removes all MapReduce-job history after it is restarted?
¿Is there some property to control job removing after Hadoop environment restarting?
Thanks!
the MR job history logs are not deleted right way after you restart hadoop, the new job will be counted from *_0001 and only new jobs which are started after hadoop restart will be displayed on resource manager web portal though. In fact, there are 2 log related settings from yarn default:
# this is where you can find the MR job history logs
yarn.nodemanager.log-dirs = ${yarn.log.dir}/userlogs
# this is how long the history logs will be retained
yarn.nodemanager.log.retain-seconds = 10800
and the default ${yarn.log.dir} is defined in $HADOOP_HONE/etc/hadoop/yarn-env.sh.
YARN_LOG_DIR="$HADOOP_YARN_HOME/logs"
BTW, similar settings could also be found in mapred-env.sh if you are use Hadoop 1.X