Spark setAppName doesn't appear in Hadoop running applications UI - hadoop

I am running a spark streaming job and when I set the app name (a better readable string) for my spark streaming job, It doesn't appear in the Hadoop running applications UI. I always see the class name as the name in Hadoop UI
val sparkConf = new SparkConf().setAppName("BetterName")
How to set the job name in Spark, so it appears in this Hadoop UI ?
Hadoop URL for running applications is - http://localhost:8088/cluster/apps/RUNNING
[update]
Looks like this is the issue only with Spark Streaming jobs, couldn't find solution on how to fix it though.

When submitting a job via spark-submit, the SparkContext created can't set the name of the app, as the YARN is already configured for job before Spark. For the app name to appear in the Hadoop running jobs UI, you have to set it in the command line for spark-submit "--name BetterName". I kick off my job with a shell script that calls spark-submit, so added the name to the command in my shell script.

Related

How to get the job id of a specific running hadoop jobs

I need to get the id of a specific hadoop job.
In my case, I lunch a sqoop commande remotely and I went to verify the job status with this commande :
hadoop job -status job_id | grep -w 'state'
I can get this information from the GUI but i went to do something
can any one help me !!!
You can use the Yarn REST apis, via your browser or curl from the command line. It will list all the currently running and previously running jobs, including sqoop and the mapreduce jobs that sqoop generates and executes. Use the UI first, if you have it up and running just point your browser to http:<host>:8088/cluster (not sure if the port is the same on all hadoop distributions. I believe 8088 is the default on apache). Alternatively you can use yarn commands directly, e.g, yarn application -list.

how to know the spark application's parent application in oozie spark action

when I use oozie's spark action launch a spark application, oozie will first launch a mapreduce applicaton, then the mapreduce launch a spark application. How can I know a spark application is launched by which mapreduce task?
So far I can see MapReduce application is named with some oozie information, like oozie:launcher:T=spark:W=JavaWordCount:A=spark-test:ID=0000023-171207132348866-oozie-oozi-W, and the spark application has a application tags like oozie-6e83d420c018bc0f63bccd19fe73b24f.But I still don't konw how to associate them?
You can get the spark application-id by using yarn client:
to show all application-id and get the oozie:launcher mapreduce application-id please run the following command;
yarn application -list
then you can get the spark application id by using the oozie:launcher mapreduce application-id like this:
yarn logs -applicationId $APPID | grep "Submitted application" | awk '{print $NF}'
please change $APPID by the first mapreduce application-id which launch the spark application

Apache Zeppelin running on Spark Cluster and YARN

I have created and ran a %pyspark program in Apache Zeppelin running on a Spark Cluster with yarn-client. The program is reading a file in a Dataframe from HDFS and does a simple groupby command and prints the output successfully. I am using Zeppellin version 0.6.2 and Spark 2.0.0 .
I can see the job running in YARN(see application_1480590511892_0007):
But when I check the Spark UI at the same time there is nothing at all for this job:
Question 1: Shouldn't this job appear in both of these windows?
Also, the completed applications in the SparkUI image just above, were Zeppelin jobs with the %python interpreter simply initializing a SparkSession and stopping it:
1st Zeppelin block:
%python
from pyspark.sql import SparkSession
from pyspark.sql import Row
import collections
spark = SparkSession.builder.appName("SparkSQL").getOrCreate()
2nd Zeppelin block:
%python
spark.stop()
Question 2: This job in turn, has not appeared in the YARN UI. Is it the case that whenever a job appears in the SparkUI means that it is running with Spark Resource manager?
Any insights for these questions are highly appreciated.
Zeppelin runs a continuous Spark application once the interpreter is first used. All the paragraphs will run in this one application. In your second paragraph you are stopping the SparkSession (spark.stop), so that would kill the application that was created when the interpreter was first used. So you can just see the jobs under the Completed Applications section. If you remove the spark.stop, you should see the job listed under Running Applications.

Running Spark Jobs via Oozie

Is it possible to run Spark Jobs e.g. Spark-sql jobs via Oozie?
In the past we have used Oozie with Hadoop. Since we are now using Spark-Sql on top of YARN, looking for a way to use Oozie to schedule jobs.
Thanks.
Yup its possible ... The procedure is also same, that you have to provide Oozia a directory structure having coordinator.xml, workflow.xml and a lib directory containing your Jar files.
But remember Oozie starts the job with java -cp command, not with spark-submit, so if you have to run it with Oozie, Here is a trick.
Run your jar with spark-submit in background.
Look for that process in process list. It will be running under java -cp command but with some additional Jars, that are added by spark-submit. Add those Jars in CLASS_PATH. and that's it. Now you can run your Spark applications through Oozie.
1. nohup spark-submit --class package.to.MainClass /path/to/App.jar &
2. ps aux | grep '/path/to/App.jar'
EDITED: You can also use latest Oozie, which has Spark Action also.
To run Spark SQL by Oozie you need to use Oozie Spark Action.
You can locate oozie.gz on your distribution. Usually in cloudera you can find this oozie examples directory at below path.
]$ locate oozie.gz
/usr/share/doc/oozie-4.1.0+cdh5.7.0+267/oozie-examples.tar.gz
Spark SQL need hive-site.xml file for execution which you need to provide in workflow.xml
< spark-opts>--file /hive-site.xml < /spark-opts>

I cannot see the running applications in hadoop 2.5.2 (yarn)

I installed hadoop 2.5.2, and I can run the wordcount sample successfully. However, when I want to see the application running on yarn (job running), I cannot as all applictaions interface is always empty (shown in the following screen).
Is there anyway to make the jobs visible?
Please try localhost:19888 or check value of the the property for web url for job history (mapreduce.jobhistory.webapp.address) configured in you yarn config file.

Resources