Apache Zeppelin running on Spark Cluster and YARN - hadoop

I have created and ran a %pyspark program in Apache Zeppelin running on a Spark Cluster with yarn-client. The program is reading a file in a Dataframe from HDFS and does a simple groupby command and prints the output successfully. I am using Zeppellin version 0.6.2 and Spark 2.0.0 .
I can see the job running in YARN(see application_1480590511892_0007):
But when I check the Spark UI at the same time there is nothing at all for this job:
Question 1: Shouldn't this job appear in both of these windows?
Also, the completed applications in the SparkUI image just above, were Zeppelin jobs with the %python interpreter simply initializing a SparkSession and stopping it:
1st Zeppelin block:
%python
from pyspark.sql import SparkSession
from pyspark.sql import Row
import collections
spark = SparkSession.builder.appName("SparkSQL").getOrCreate()
2nd Zeppelin block:
%python
spark.stop()
Question 2: This job in turn, has not appeared in the YARN UI. Is it the case that whenever a job appears in the SparkUI means that it is running with Spark Resource manager?
Any insights for these questions are highly appreciated.

Zeppelin runs a continuous Spark application once the interpreter is first used. All the paragraphs will run in this one application. In your second paragraph you are stopping the SparkSession (spark.stop), so that would kill the application that was created when the interpreter was first used. So you can just see the jobs under the Completed Applications section. If you remove the spark.stop, you should see the job listed under Running Applications.

Related

Run Apache Zeppelin as different User

How can I run a Zeppelin interpreter as a different user than the user who started the process?
I want to run Zeppelin as "root" and then launch a spark application as "admin" user
You can keep running Zeppelin as you're currently doing, but start the Spark process separately as that admin user.
The Spark interpreter can be pointed to an external master. Open the Zeppelin interpreter config and change the value of the spark master config key, pointing it to the instance started by the admin user.
In other words, you have one process for spark:
# First run spark as admin:
$ /path/to/spark/sbin/start-all.sh
# Then run zeppelin as root:
# /path/to/zeppelin/bin/zeppelin-daemon.sh start
According to the Zeppelin documentation for the Spark interpreter, you can point Zeppelin to a separate master by changing the value of the master configuration.
The default value for this config is local[*], which makes Zeppelin start a spark context just as done through the spark shell.
And just as the Spark shell can be pointed to an external master, you can use a value for the master URL, such as spark://masterhost:7077.
After this change (and possibly a restart), Zeppelin will only be running the driver, program, while all the workers and scheduling will be handled by your master.

Spark History Server on Yarn only shows Python application

I have two spark contexts running on a box, 1 from python and 1 from scala. They are similarly configured, yet only the python application appears in the spark history page pointed to by the yarn tracking URL. Is there extra configuration I am missing here? (both run in yarn-client mode)

While running a application using spark-submit in Apache Spark, gave WARN message

I have configured Apache Spark standalone cluster into two Ubuntu 14.04 VMs. One of the VMs i.e. Master and the other one i.e. Worker,both are connected with password less ssh described here.
After that from the Master, I have started master as well as worker by the following command from the spark home directory -
sbin/start-all.sh
Then I run the following command from Master as well as Woker VMs.
jps
It shows in Master VM-
6047 jps
6048 Master
And into Worker VM-
6046 jps
6045 Worker
It seemed that the Master and Worker is running properly and also in Web UI, there is no error occured.
But when I am trying to run an application using the following command-
spark-1.6.0/bin/spark-submit spark.py
It gives WARN message in console that-
TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
Here is my test application-
from pyspark import SparkConf, SparkContext
from pyspark.sql import SQLContext
conf =SparkConf().setMaster('spark://SparkMaster:7077').setAppName("My_App")
sc = SparkContext(conf=conf)
SQLCtx = SQLContext(sc)
list_of_list = sc.textFile("ver1_sample.csv").map(lambda line: line.split(",")).collect()
print("type_of_list_of_list===========",type(list_of_list), list_of_list)
As I am new to Apache Spark. Please help.
The problem could be with the resource (memory/cores) availability. By default spark takes default from spark-defaults.conf.
Try using
bin/spark-submit --executor-memory 1g

Spark setAppName doesn't appear in Hadoop running applications UI

I am running a spark streaming job and when I set the app name (a better readable string) for my spark streaming job, It doesn't appear in the Hadoop running applications UI. I always see the class name as the name in Hadoop UI
val sparkConf = new SparkConf().setAppName("BetterName")
How to set the job name in Spark, so it appears in this Hadoop UI ?
Hadoop URL for running applications is - http://localhost:8088/cluster/apps/RUNNING
[update]
Looks like this is the issue only with Spark Streaming jobs, couldn't find solution on how to fix it though.
When submitting a job via spark-submit, the SparkContext created can't set the name of the app, as the YARN is already configured for job before Spark. For the app name to appear in the Hadoop running jobs UI, you have to set it in the command line for spark-submit "--name BetterName". I kick off my job with a shell script that calls spark-submit, so added the name to the command in my shell script.

How to run Mahout jobs on Spark Engine?

Currently I’m doing some document similarity analysis using Mahout RowSimilarity Job. This can be easily done be running command ‘mahout rowsimilarity…’ from the console. However I noticed that this Job is also supported to be run on Spark engine. I wonder to know how I can run this Job on Spark Engine.
You can use MLlib alternate of mahout in spark. All library in MLlib are processing in distributed mode(Map-reduce in Hadoop).
In Mahout 0.10 provide job execution with spark.
More detail Link
http://mahout.apache.org/users/sparkbindings/play-with-shell.html
step to setup spark with mahout.
1 Goto the directory where you unpacked Spark and type sbin/start-all.sh to locally start Spark
2 Open a browser, point it to http://localhost:8080/ to check whether Spark successfully started. Copy the url of the spark master at the top of the page (it starts with spark://)
3 Define the following environment variables:
export MAHOUT_HOME=[directory into which you checked out Mahout]
export SPARK_HOME=[directory where you unpacked Spark]
export MASTER=[url of the Spark master]
4 Finally, change to the directory where you unpacked Mahout and type bin/mahout spark-shell, you should see the shell starting and get the prompt mahout>. Check FAQ for further troubleshooting.
Please visit link.It uses new mahout 0.10 and works uses spark server.

Resources