Max limit of oozie workflows - hadoop

Does anyone have any idea on what's the maximum limit of oozie workflows that can execute in parallel?
I'm running 35 workflows in parallel (or that's what oozie UI mentions that they all got started in parallel). All the subworkflows perform ingestion of files from local to HDFS & do some validation checks henceforth on the metadata of file. Simple as that.
However, I see some subworkflows get failed during execution; the step in which they fail tries to put the files into HDFS location, i.e., the process wasn't able to execute hdfs dfs -put command. However, when I rerun these subworkflows they run successfully.
Not sure what caused them to execute and fail on hdfs dfs -put.
Any clues/suggestions on what could be happening?

First limitation does not depends on Oozie, but on resources available in YARN to execute Oozie actions as each action is executed in one map. But this limit will not fail your workflow: they will just wait for resources.
The major limit we've faced, leading to troubles, was on the callable queue of oozie services. Sometime, on heavy loads created by plenty of coordinator submitting plenty of worklow, Oozie was loosing more time in processing its internal callable queue than running workflows :/
Check oozie.service.CallableQueueService settings for informations about this.

Related

Spark Launcher Jobs not starting because of token cant be found in cache after 24 hours

I have a Java Application, which runs continuously and checks a table in database for new records. When a New record is added in the table, the Java application do a unzip file and puts into HDFS location and then a Spark Job gets triggered(I am pro-grammatically triggering the Spark Job using 'SparkLauncher" class inside the Java Application), which does the processing for newly added file in HDFS location.
I have scheduled the Java Application in cluster using Oozie Java Action.
The cluster is HDP kerberized cluster.
The Job is working perfectly fine for 24 hours. All the unzip happens and spark job is running.
But after 24 hours the unzip happens in Java Application but the Spark Job is not get triggered in Resource Manager.
Exception : Exception encountered while connecting to the server :INFO: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken): token (owner=****, renewer=oozie mr token, realUser=oozie, issueDate=1498798762481, maxDate=1499403562481, sequenceNumber=36550, masterKeyId=619) can't be found in cache
As per my understanding, after 24 hours oozie is renewing the token, and that token is not getting updated for the Spark launcher Job. The spark Launcher is still looking for the older Token which is not available in cache.
Please help me, how I can make Spark Launcher to look for the new-token.
As per my understanding, after 24 hours oozie is renewing the token
Why? Can you point to any documentation, source code, blog?
Remember that Oozie is a scheduler for batch jobs, and its canonical use case (at Yahoo!) is for triggering hourly jobs.
Only a pathological batch job would run for more than 24h, therefore renewal of the Hadoop delegation token is not really useful in Oozie.
But your Java thing acts as a service, running continuously, and needing automatic restart if it ever crashes. So you should consider...
either Slider, if you really want to run it inside YARN (although there
are many, many drawbacks -- how do you inspect the
logs of a running YARN job? how can you make sure that the app starts on time and is not delayed by a lack of resources? how can you make sure that your app will not be killed because YARN needs resources for a high-priority job?) but it is probably overkill for simply running your toy app
or a plain Linux service running on some Edge Node -- it's a Do-It-Yourself task, but not extremely complicated, and there are tutorials on the web
If you insist on using Oozie, in spite of all the limitations of both YARN and Oozie, then you have to change the way your app runs -- for instance, schedule the Coordinator to launch a job every 12h and pass the "nominal time" as Workflow property, edit the Workflow to pass that time to the Java app, edit the Java code so that the app exits at (arg + 11:58) and clears the way for the next exec.

Does oozie use Yarn containers

We are currently running a large amount of Oozie jobs in our cluster.
Many of those jobs use templates and have sub-workflows.
These jobs don't always contain large and heavy jobs, they mostly contain a small shell script.
The Hue job browser show lots and lots of Oozie steps.
We now sometimes feel that our cluster is getting overloaded by these jobs. This made me wonder, does every one of those Oozie jobs get a yarn container appointed to it?
If so this would mean that for a 2 min job we are effectively useing 2-10 times more resources than required.
Just see by yourself...
in the Hue Dashboard, click on any Workflow that has been executed, select the "Actions" tab, look at the "External ID" column => every job_000000_0000 refers to a YARN job
...and when "External ID" points to a Sub-Workflow, then if you click, you will get its own YARN jobs
alternately you can use the command line with oozie job -info <wkf/sub-wkf exec id>
You can get more details in that post for instance.
A frequent issue with Shell or Java actions is that the "launcher" YARN job uses the default job settings defined by your Hadoop admin -- e.g. 1 GB of RAM for the AppMaster and 1.5 GB for the "launcher".
But typically your shell just requires a few MB of RAM (on top of what is used by Oozie to bootstrap the Action in a raw YARN container), and its AppMaster just requires the bare minimum to control the execution-- say, 512 MB each.
So you can reduce the footprint of your Oozie actions by setting some undocumented properties -- in practice, standard Hadoop props prefixed by oozie.launcher.
See for instance this post then that post.
PS: oozie.launcher.mapreduce.map.java.opts is relevant for a Java action (or a Pig action, a Sqoop action, etc.) and should stay consistent with the global RAM quota; but it's not relevant for a Shell action [unless you set a really goofy value, in which case it might affect the Oozie bootstrap process]
In your case Yes, all jobs will get container still if you are invoking MR through shell. Its not true that for each container YARN will provide unnecessary memory or resources.
Yarn provides exact or little more resources but it increases if Job requires more.

Difference between job, application, task, task attempt logs in Hadoop, Oozie

I'm running an Oozie job with multiple actions and there's a part I could not make it work. In the process of troubleshooting I'm overwhelmed with lots of logs.
In YARN UI (yarn.resourceman­ager.webapp.address in yarn-site.xml, normally on port 8088), there's the application_<app_id> logs.
In Job History Server (yarn.log.server.url in yarn-site.xml, ours on port 19888), there's the job_<job_id> logs. (These job logs should also show up on Hue's Job Browser, right?)
In Hue's Oozie workflow editor, there's the task and task_attempt (not sure if they're the same, everything's a mixed-up soup to me already), which redirects to the Job Browser if you clicked here and there.
Can someone explain what's the difference between these things from Hadoop/Oozie architectural standpoint?
P.S.
I've seen in logs container_<container_id> as well. Might as well include this in your explanation in relation to the things above.
In terms of YARN, the programs that are being run on a cluster are called applications. In terms of MapReduce they are called jobs. So, if you are running MapReduce on YARN, job and application are the same thing (if you take a close look, job ids and application ids are the same).
MapReduce job consists of several tasks (they could be either map or reduce tasks). If a task fails, it is launched again on another node. Those are task attempts.
Container is a YARN term. This is a unit of resource allocation. For example, MapReduce task would be run in a single container.

How to trigger Oozie jobs on particular condition?

I have a folder where all my application log files gets stored. If new log file is created in the folder, immediately my oozie should trigger a Flume job which will put my log file into HDFS.
How to trigger Oozie job when new log file is created in the folder ?
Any help on this topic is greatly appreciated !!!
That's not how Oozie works. Oozie is a scheduler, a bit like CRON. First, you specify how often a workflow should run and then you can add a requirement for files being available as an additional requirement.
I think its more of how you place the files in HDFS. You could always have a parameterized oozie job, which could be invoked using Oozie Java API and passing in the name of the file created on HDFS from the client writing to HDFS itself unless streaming.
Every time a oozie workflow is initiated, it runs on a separate thread and this would allow you to call multiple oozie instances with different parameters.

What is significance of the Oozie MR launcher?

I created a simple Oozie work flow with Sqoop, Hive and Pig actions. For each of there actions, Oozie launches a MR launcher and which in turn launches the action (Sqoop/Hive/Pig). So, there are a total of 6 MR jobs for 3 actions in the work flow.
Why does Oozie start an MR launcher to start the action and not directly start the action?
I posted the same in Apache Flume forums and here is the response.
It's also to keep the Oozie server from being bogged down or becoming
unstable. For example, if you have a bunch of workflows running Pig jobs,
then you'd have the Oozie server running multiple copies of the Pig client
(which is a relatively "heavy" program) directly. By moving all of the
user code and external clients to map tasks in the launcher job, the Oozie
server remains more light-weight and less prone to errors. It can also
much more scalable this way because the launcher jobs distribute the the
job launching/monitoring to other machines in the cluster; otherwise, with
the Oozie server doing everything, we'd have to limit the number of
concurrent workflows based on your Oozie server's machine specs (RAM, CPU,
etc). And finally, from an architectural standpoint, the Oozie server
itself is stateless; that is, everything is stored in the database and the
Oozie server can be taken down at any point without losing anything. If we
were to launch jobs directly from the Oozie server, then we'd now have some
state (e.g. the Pig client cannot be restarted and resumed).

Resources