how to know the spark application's parent application in oozie spark action - hadoop

when I use oozie's spark action launch a spark application, oozie will first launch a mapreduce applicaton, then the mapreduce launch a spark application. How can I know a spark application is launched by which mapreduce task?
So far I can see MapReduce application is named with some oozie information, like oozie:launcher:T=spark:W=JavaWordCount:A=spark-test:ID=0000023-171207132348866-oozie-oozi-W, and the spark application has a application tags like oozie-6e83d420c018bc0f63bccd19fe73b24f.But I still don't konw how to associate them?

You can get the spark application-id by using yarn client:
to show all application-id and get the oozie:launcher mapreduce application-id please run the following command;
yarn application -list
then you can get the spark application id by using the oozie:launcher mapreduce application-id like this:
yarn logs -applicationId $APPID | grep "Submitted application" | awk '{print $NF}'
please change $APPID by the first mapreduce application-id which launch the spark application

Related

How to Kill Hive Query, without knowing application id?

My hive-server2 list a few running jobs, so I can find the various query_id.
But there is not yarn-application information in the Yarn 8088 pages.
My question is how to kill the running job.
If you are using Yarn as resource manager, you can find all running jobs by running the following in shell:
yarn application -list -appStates ALL
You can change ALL to RUNNING etc. depending on what application state you are interested in seeing.
An alternative command to the above to see running applications is:
mapred job -list
In order to kill a specific application/job, with YARN you can run:
yarn application -kill <application_id>
Or otherwise:
mapred job -kill <job_id>

Spark setAppName doesn't appear in Hadoop running applications UI

I am running a spark streaming job and when I set the app name (a better readable string) for my spark streaming job, It doesn't appear in the Hadoop running applications UI. I always see the class name as the name in Hadoop UI
val sparkConf = new SparkConf().setAppName("BetterName")
How to set the job name in Spark, so it appears in this Hadoop UI ?
Hadoop URL for running applications is - http://localhost:8088/cluster/apps/RUNNING
[update]
Looks like this is the issue only with Spark Streaming jobs, couldn't find solution on how to fix it though.
When submitting a job via spark-submit, the SparkContext created can't set the name of the app, as the YARN is already configured for job before Spark. For the app name to appear in the Hadoop running jobs UI, you have to set it in the command line for spark-submit "--name BetterName". I kick off my job with a shell script that calls spark-submit, so added the name to the command in my shell script.

Running Spark Jobs via Oozie

Is it possible to run Spark Jobs e.g. Spark-sql jobs via Oozie?
In the past we have used Oozie with Hadoop. Since we are now using Spark-Sql on top of YARN, looking for a way to use Oozie to schedule jobs.
Thanks.
Yup its possible ... The procedure is also same, that you have to provide Oozia a directory structure having coordinator.xml, workflow.xml and a lib directory containing your Jar files.
But remember Oozie starts the job with java -cp command, not with spark-submit, so if you have to run it with Oozie, Here is a trick.
Run your jar with spark-submit in background.
Look for that process in process list. It will be running under java -cp command but with some additional Jars, that are added by spark-submit. Add those Jars in CLASS_PATH. and that's it. Now you can run your Spark applications through Oozie.
1. nohup spark-submit --class package.to.MainClass /path/to/App.jar &
2. ps aux | grep '/path/to/App.jar'
EDITED: You can also use latest Oozie, which has Spark Action also.
To run Spark SQL by Oozie you need to use Oozie Spark Action.
You can locate oozie.gz on your distribution. Usually in cloudera you can find this oozie examples directory at below path.
]$ locate oozie.gz
/usr/share/doc/oozie-4.1.0+cdh5.7.0+267/oozie-examples.tar.gz
Spark SQL need hive-site.xml file for execution which you need to provide in workflow.xml
< spark-opts>--file /hive-site.xml < /spark-opts>

How to kill a mapred job started by hive?

I'm working by CDH 5.1 now. It starts normal Hadoop job by YARN but hive still works with mapred. Sometimes a big query will hang for a long time and I want to kill it.
I can find this big job by JobTracker web console while it didn't provide a button to kill it.
Another way is killing by command line. However, I couldn't find any job running by command line.
I have tried 2 commands:
yarn application -list
mapred job -list
How to kill big query like this?
You can get the Job ID from Hive CLI when you run a job or from the Web UI. You can also list the job IDs using the application ID from resource manager. Ideally, you should get everything from
mapred job -list
or
hadoop job -list
Using the Job ID you can kill it by using the below command.
hadoop job -kill <job_id>
Another alternative would be to kill the application using
yarn application -kill <application_id>

Hadoop Streaming Job v/s Hadoop pipe job

I am trying to run a hadoop job using following command
hadoop -jar myjob.jar
In this case i can not see the jar submitted and its status using web page(at port 50030)
but if i do
hadoop jar myjob.jar
I can see the progress on the same port(50030)
What is the difference between these two commands ,I searched a bit and found
hadoop -jar to submit pipe jobs
hadoop jar to submit streaming jobs
Any insight will be of a great help.
There is no hadoop -jar
From the docs:
Usage: hadoop jar <jar> [mainClass] args...
The streaming jobs are run via this command.

Resources