Spark with yarn cluster in docker - hadoop

I worked with my spark project as standalone mode (in eclipse IDE). Here is some code that I used (working well).
val conf = new SparkConf()
.setAppName("My Application")
.setMaster("local[*]")
.set("spark.executor.memory", "4g")
.set("spark.driver.memory", "4g")
SparkSession.builder
.config(conf)
.appName("spark app")
.config("spark.sql.warehouse.dir", "file:///.")
.getOrCreate()
}
Until now, I wrapped this application as a jar and used in another java project.
Now I'm going to change it to yarn clustering mode for multi user. So, I installed hadoop clustering in docker with this image(link).
I think It's already set yarn configuration. Do i need to do something more to do spark yarn cluster?
How should I set sparkConf? How pass hadoop ip, my application jar and other infomation needed?

Related

How to use JDBC to read datasets from Oracle?

What is really executed and where, when using jdbc drivers to connect to e.g. oracle.?
1: I have started a spark master as
spark-class.cmd org.apache.spark.deploy.master.Master
and a worker like so
spark-class.cmd org.apache.spark.deploy.worker.Worker spark://myip:7077
and spark shell as
spark-shell --master spark://myip:7077
in spark-defaults.conf I have
spark.driver.extraClassPath = C:/jdbcDrivers/ojdbc8.jar
spark.executor.extraClassPath = C:/jdbcDrivers/ojdbc8.jar
and in spark-env.sh I have
SPARK_CLASSPATH=C:/jdbcDrivers/ojdbc8.jar
I can now run queries against Oracle in the spark-shell:
val jdbcDF = spark.read.format("jdbc").option("url","jdbc:oracle:thin:#...
This works fine without separately adding the jdbc driver jar in the scala shell.
When I start the master and worker in the same way, but create a scala project in eclipse and connecting to the master as follows:
val sparkSession = SparkSession.builder.
master("spark://myip:7077")
.appName("SparkTestApp")
.config("spark.jars", "C:\\pathToJdbc\\ojdbc8.jar")
.getOrCreate()
then it fails if I don't explicitly add the jdbc jar in the scala code.
How is the execution different? Why do I need to specify the jdbc jar in the code? What is the purpose of connecting to the master if it doesn't rely on the master and workers started?
If I use multiple workers with jdbc will they use only one connection or will they simultaneously read in parallel over several connections?
You are certainly using too much for the sample and you got confused.
The two lines, spark-class.cmd org.apache.spark.deploy.master.Master and spark-class.cmd org.apache.spark.deploy.worker.Worker spark://myip:7077, started a Spark Standalone cluster with one master and one worker. See Spark Standalone Mode.
In addition to running on the Mesos or YARN cluster managers, Spark also provides a simple standalone deploy mode. You can launch a standalone cluster either manually, by starting a master and workers by hand, or use our provided launch scripts. It is also possible to run these daemons on a single machine for testing.
You chose to start the Spark Standalone cluster manually (as described in Starting a Cluster Manually).
I doubt that spark-defaults.conf is used by the cluster at all. The file is to configure your Spark applications that are spark-submit to a cluster (as described in Dynamically Loading Spark Properties):
bin/spark-submit will also read configuration options from conf/spark-defaults.conf, in which each line consists of a key and a value separated by whitespace.
With that said, I think we can safely put Spark Standalone aside. It does not add much to the discussion (and does confuse a bit).
"Installing" JDBC Driver for Spark Application
In order to use a JDBC driver in your Spark application, you should spark-submit with --driver-class-path command-line option (or spark.driver.extraClassPath property as described in Runtime Environment):
spark.driver.extraClassPath Extra classpath entries to prepend to the classpath of the driver.
Note: In client mode, this config must not be set through the SparkConf directly in your application, because the driver JVM has already started at that point. Instead, please set this through the --driver-class-path command line option or in your default properties file.
I strongly recommend using spark-submit --driver-class-path.
$ ./bin/spark-submit --help
...
--driver-class-path Extra class path entries to pass to the driver. Note that
jars added with --jars are automatically included in the
classpath.
You can read my notes on how to use a JDBC driver with PostgreSQL in Working with Datasets from JDBC Data Sources (and PostgreSQL).
PROTIP Use SPARK_PRINT_LAUNCH_COMMAND=1 to check out the command line of spark-submit.
All above applies to spark-shell too (as it uses spark-submit under the covers).

Setting spark yarn client

I would like to make spark yarn client (link). Does it need to install hadoop ? or is it ok to install only yarn? ( by this
link)
No Spark do not require Hadoop for running. Apache Spark is an independent project which can run on its own. If you want you can even run it without apache yarn.
Spark support 3 type of cluster manager which are mesos, yarn and standalone. if you do not have yarn installed then it can use mesos and standalone and by default it uses standalone when you do not mention any preference for cluster manager.Links which you have mentioned is fine to use but I think more better resources are available on google.

How to connect Apache Spark with Yarn from the SparkContext?

I have developed a Spark application in Java using Eclipse.
So far, I am using the standalone mode by configuring the master's address to 'local[*]'.
Now I want to deploy this application on a Yarn cluster.
The only official documentation I found is http://spark.apache.org/docs/latest/running-on-yarn.html
Unlike the documentation for deploying on a mesos cluster or in standalone (http://spark.apache.org/docs/latest/running-on-mesos.html), there is not any URL to use within SparkContext for the master's adress.
Apparently, I have to use line commands to deploy spark on Yarn.
Do you know if there is a way to configure the master's adress in the SparkContext like the standalone and mesos mode?
There actually is a URL.
Ensure that HADOOP_CONF_DIR or YARN_CONF_DIR points to the directory which contains the (client side) configuration files for the Hadoop cluster. These configs are used to write to HDFS and connect to the YARN ResourceManager
You should have at least hdfs-site.xml, yarn-site.xml, and core-site.xml files that specify all the settings and URLs for the Hadoop cluster you connect to.
Some properties from yarn-site.xml include yarn.nodemanager.hostname and yarn.nodemanager.address.
Since the address has a default of ${yarn.nodemanager.hostname}:0, you may only need to set the hostname.

Adding JDBC driver to Spark on EMR

I'm trying to add a JDBC driver to a Spark cluster that is executing on top Amazon EMR but I keep getting the:
java.sql.SQLException: No suitable driver found for exception.
I tried the following things:
Use addJar to add the driver Jar explicitly from the code.
Using spark.executor.extraClassPath spark.driver.extraClassPath parameters.
Using spark.driver.userClassPathFirst=true, when I used this option I'm getting a different error because mix of dependencies with Spark, Anyway this option seems to be to aggressive if I just want to add a single JAR.
Could you please help me with that,how can I introduce the driver to the Spark cluster easily?
Thanks,
David
Source code of the application
val properties = new Properties()
properties.put("ssl", "***")
properties.put("user", "***")
properties.put("password", "***")
properties.put("account", "***")
properties.put("db", "***")
properties.put("schema", "***")
properties.put("driver", "***")
val conf = new SparkConf().setAppName("***")
.setMaster("yarn-cluster")
.setJars(JavaSparkContext.jarOfClass(this.getClass()))
val sc = new SparkContext(conf)
sc.addJar(args(0))
val sqlContext = new SQLContext(sc)
var df = sqlContext.read.jdbc(connectStr, "***", properties = properties)
df = df.select( Constants.***,
Constants.***,
Constants.***,
Constants.***,
Constants.***,
Constants.***,
Constants.***,
Constants.***,
Constants.***)
// Additional actions on df
I had the same problem. What ended working for me is to use the --driver-class-path parameter used with spark-submit.
The main thing is to add the entire spark class path to the --driver-class-path
Here are my steps:
I got the default driver class path by getting the value of the
"spark.driver.extraClassPath" property from the Spark History Server
under "Environment".
Copied the MySQL JAR file to each node in the EMR cluster.
Put the MySQL jar path at the front of the --driver-class-path argument to the spark-submit command and append the value of "spark.driver.extraClassPath" to it
My driver class path ended up looking like this:
--driver-class-path /home/hadoop/jars/mysql-connector-java-5.1.35.jar:/etc/hadoop/conf:/usr/lib/hadoop/:/usr/lib/hadoop-hdfs/:/usr/lib/hadoop-mapreduce/:/usr/lib/hadoop-yarn/:/usr/lib/hadoop-lzo/lib/:/usr/share/aws/emr/emrfs/conf:/usr/share/aws/emr/emrfs/lib/:/usr/share/aws/emr/emrfs/auxlib/*
This worked with EMR 4.1 using Java with Spark 1.5.0.
I had already added the MySQL JAR as a dependency in the Maven pom.xml
You may also want to look at this answer as it seems like a cleaner solution. I haven't tried it myself.
With EMR 5.2 I add any new jars to the original driver classpath with:
export MY_DRIVER_CLASS_PATH=my_jdbc_jar.jar:some_other_jar.jar$(grep spark.driver.extraClassPath /etc/spark/conf/spark-defaults.conf | awk '{print $2}')
and after that
spark-submit --driver-class-path $MY_DRIVER_CLASS_PATH
Following a similar pattern to this answer quoted above, this is how I automated installing a JDBC driver on EMR clusters. (Full automation is useful for transient clusters started and terminated per job.)
use a bootstrap action to install the JDBC driver on all EMR cluster nodes. Your bootstrap action will be a one-line shell script, stored in S3, that looks like
aws s3 cp s3://.../your-jdbc-driver.jar /home/hadoop
add a step to your EMR cluster before running your actual Spark job, to modify /etc/spark/conf/spark-defaults.conf
This will be another one-line shell script, stored in S3:
sudo sed -e 's,\(^spark.driver.extraClassPath.*$\),\1:/home/hadoop/your-jdbc-driver.jar,' -i /etc/spark/conf/spark-defaults.conf
The step itself will look like
{
"name": "add JDBC driver to classpath",
"jar": "s3://us-east-1.elasticmapreduce/libs/script-runner/script-runner.jar",
"args": ["s3://...bucket.../set-spark-driver-classpath.sh"]
}
This will add your JDBC driver to spark.driver.extraClassPath
Explanation
you can't do both as bootstrap actions because Spark won't be installed yet, so no config file to update
you can't install the JDBC driver as a step, because you need the JDBC driver installed on the same path on all cluster nodes. In YARN cluster mode, the driver process does not necessarily run on the master node.
The configuration only needs to be updated on the master node, though, as the config is packed up and shipped whatever node ends up running the driver.
In case you're using python in your EMR cluster there's no need for you to specify the jar while creating the cluster. You can add the jar package while creating your SparkSession.
spark = SparkSession \
.builder \
.config("spark.jars.packages", "org.apache.hadoop:hadoop-aws:2.7.0") \
.config("spark.jars.packages", "mysql:mysql-connector-java:8.0.17") \
.getOrCreate()
And then when you make your query mention the driver like this:
form_df = spark.read.format("jdbc"). \
option("url", "jdbc:mysql://yourdatabase"). \
option("driver", "com.mysql.jdbc.Driver"). \
This way the package is included on the SparkSession as it is pulled from a maven repository. I hope it helps someone that is on the same situation I once was.

Spark on yarn concept understanding

I am trying to understand how spark runs on YARN cluster/client. I have the following question in my mind.
Is it necessary that spark is installed on all the nodes in yarn cluster? I think it should because worker nodes in cluster execute a task and should be able to decode the code(spark APIs) in spark application sent to cluster by the driver?
It says in the documentation "Ensure that HADOOP_CONF_DIR or YARN_CONF_DIR points to the directory which contains the (client side) configuration files for the Hadoop cluster". Why does client node have to install Hadoop when it is sending the job to cluster?
Adding to other answers.
Is it necessary that spark is installed on all the nodes in the yarn
cluster?
No, If the spark job is scheduling in YARN(either client or cluster mode). Spark installation is needed in many nodes only for standalone mode.
These are the visualizations of spark app deployment modes.
Spark Standalone Cluster
In cluster mode driver will be sitting in one of the Spark Worker node whereas in client mode it will be within the machine which launched the job.
YARN cluster mode
YARN client mode
This table offers a concise list of differences between these modes:
pics source
It says in the documentation "Ensure that HADOOP_CONF_DIR or YARN_CONF_DIR points to the directory which contains the (client-side)
configuration files for the Hadoop cluster". Why does the client node have
to install Hadoop when it is sending the job to cluster?
Hadoop installation is not mandatory but configurations(not all) are!. We can call them Gateway nodes. It's for two main reasons.
The configuration contained in HADOOP_CONF_DIR directory will be distributed to the YARN cluster so that all containers used by the application use the same configuration.
In YARN mode the ResourceManager’s address is picked up from the
Hadoop configuration(yarn-default.xml). Thus, the --master parameter is yarn.
Update: (2017-01-04)
Spark 2.0+ no longer requires a fat assembly jar for production
deployment. source
We are running spark jobs on YARN (we use HDP 2.2).
We don't have spark installed on the cluster. We only added the Spark assembly jar to the HDFS.
For example to run the Pi example:
./bin/spark-submit \
--verbose \
--class org.apache.spark.examples.SparkPi \
--master yarn-cluster \
--conf spark.yarn.jar=hdfs://master:8020/spark/spark-assembly-1.3.1-hadoop2.6.0.jar \
--num-executors 2 \
--driver-memory 512m \
--executor-memory 512m \
--executor-cores 4 \
hdfs://master:8020/spark/spark-examples-1.3.1-hadoop2.6.0.jar 100
--conf spark.yarn.jar=hdfs://master:8020/spark/spark-assembly-1.3.1-hadoop2.6.0.jar - This config tell the yarn from were to take the spark assembly. If you don't use it, it will upload the jar from were you run spark-submit.
About your second question: The client node doesn't not need Hadoop installed. It only needs the configuration files. You can copy the directory from your cluster to your client.
1 - Spark if following s slave/master architecture. So on your cluster, you have to install a spark master and N spark slaves. You can run spark in a standalone mode. But using Yarn architecture will give you some benefits.
There is a very good explanation of it here : http://blog.cloudera.com/blog/2014/05/apache-spark-resource-management-and-yarn-app-models/
2- It is necessary if you want to use Yarn or HDFS for example, but as i said before you can run it in standalone mode.
Let me try to cut glues and make it short for impatient.
6 components: 1. client, 2. driver, 3. executors, 4. application master, 5. workers, and 6. resource manager; 2 deploy modes; and 2 resource (cluster) management.
Here's the relation:
Client
Nothing special, is the one submitting spark app.
Worker, executors
Nothing special, one worker holds one or more executors.
Master, & resource (cluster) manager
(no matter client or cluster mode)
in yarn, resource manager and master sit in two different nodes;
in standalone, resource manager == master, same process in the same node.
Driver
in client mode, sits with client
in yarn - cluster mode, sits with master (in this case, client process exits after submission of app)
in standalone - cluster mode, sits with one worker
VoilĂ !

Resources