Why oozie submits shell action to yarn? - hadoop

I am recently learning oozie. I little curious about shell action. I am executing shell action which contains shell command like
hadoop jar <jarPath> <FQCN>
While running this action there are two yarn jobs running which are
one for hadoop job
one for shell action
I dont understand why shell action needs yarn for execution. I also tried email action. It executes without yarn resources.

To answer this question, the difference is between
running a shell script independently(.sh file or from CLI)
running a shell action as a part of an oozie workflow.(shell script in an oozie shell action)
The first case is very obvious.
In the second case, oozie launches the shell script via YARN(is the resource negotiator )to run your shell script on the cluster where oozie is installed and runs MR jobs internally to launch the shell action. So the shell script runs as a YARN application internally. The logs of the oozie workflow shows the way the shell action is launched in oozie.

Related

In Oozie, how I'd be able to use script output

I have to create a cron-like coordinator job and collect some logs.
/mydir/sample.sh >> /mydir/cron.log 2>&1
Can I use simple oozie wf, which I use for any shell command?
I'm asking because I've seen that there are specific workflows to execute .sh scripts
Sure, you can execute Shell action (On any node in the Yarn cluster) or use the Ssh action if you'd like to target specific hosts. You have to keep in mind that the "/mydir/cron.log" file will be created on the host the action is executed on and the generated file might no be available for other Oozie actions.

Oozie fork running only 2 forks parallely

I am running an oozie workflow job which has a fork node. Fork node directs the workflow to 4 different sub-workflows which in turn are calling shell scripts.
Ideally all 4 shell scripts were suppose to execute parallely but for me only 2 shell scripts are executing parallely.
Could someone help me to address this issue?

How is running a script using aws emr script-runner different from running it from bash?

I have used script-runner on aws emr, and given that it may look very basic (and maybe stuid) question, but I read many documents and noone answers why we need a script runner in emr, when all it does is executing a script in the master node.
Can the same script not be run using a bash?
The script runner is needed when you want to simply execute a script but the entry point is expecting a jar. For example, submitting an EMR Step will execute a "hadoop jar blah ..." command. But if "blah" is a script this will fail. Script runner becomes the jar that the Step expects and then uses its argument (path to script) to execute shell script.
When you are running your script in bash, you need to have the script locally and also you need to set all the configurations to work as you expect it.
With the script-runner you have more options, for example, run it as part of your cluster launch command, as well execute a script that is hosted remotely in S3. See the example from the EMR documentations: http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-hadoop-script.html

Why does scheduling Spark jobs through cron fail (while the same command works when executed on terminal)?

I am trying to schedule a spark job using cron.
I have made a shell script and it executes well on the terminal.
However, when I execute the script using cron it gives me insufficient memory to start JVM thread error.
Every time I start the script using terminal there is no issue. This issue comes when the script starts with cron.
Kindly if you could suggest something.

How to get the ID of MapReduce job submitted by the `hadoop jar <example.jar> <main-class>` command?

I want to write a shell script which submits a MapReduce job by the command hadoop jar <example.jar> <main-class>, then how can I get the ID of the job submitted by that command in the shell script, right after that command was invoked?
I know that the command hadoop job -list can display all jobs' IDs, but in that case I can't tell which job is submitted by the shell script.

Resources