Missing java system properties when running spark-streaming on Mesos cluster - spark-streaming

I submit a spark app to mesos cluster(running in cluster mode), and pass java system property through "--drive-java-options=-Dkey=value -Dkey=value", however these system properties are not available at runtime, seems they are not set. --conf "spark.driver.extraJavaOptions=-Dkey=value" doesn't work either
More details:
the command is
bin/spark-submit --master mesos://10.3.101.119:7077 --deploy-mode cluster --class ${classname} --driver-java-options "-Dconfiguration.http=http://10.3.101.119:9090/application.conf" --conf "spark.executor.extraJavaOptions=-Dconfiguration.http=http://10.3.101.119:9090/application.conf" ${jar file}
I have a two-node mesos cluster, one node both runs master and slave, and the other runs slave only. I submit the spark application on master node.
Internally, the application hopes to read a configuration file from java system property "configuration.http", if the property is not available, the application will load a default file from the root of the classpath. When I submit the application, from the logs, i saw the default configuration file is loaded.
And the actual command to run the application is
"sh -c '/home/ubuntu/spark-1.6.0/bin/spark-submit --name ${appName} --master mesos://zk://10.3.101.184:2181/mesos/grant --driver-cores 1.0 --driver-memory 1024M --class ${classname} ./${jar file} '"
from here you can see the system property is lost

You might have a look at this blog post which recommends using an external properties file for this purpose:
$ vi app.properties
spark.driver.extraJavaOptions -Dconfiguration.http=http://10.3.101.119:9090/application.conf
spark.executor.extraJavaOptions –Dconfiguration.http=http://10.3.101.119:9090/application.conf
Then try to run this via
bin/spark-submit --master mesos://10.3.101.119:7077 --deploy-mode cluster --class ${classname} —-properties-file app.properties ${jar file}
See
How to pass -D parameter or environment variable to Spark job?
Separate logs from Apache spark

Related

how to switch between cluster types in Apache Spark

I'm trying to switch cluster manager from standalone to 'YARN' in Apache Spark that I've installed for learning.
I read following thread to understand which cluster type should be chosen
However, I'd like to know the steps/syntax to change the cluster type.
Ex: from Standalone to YARN or from YARN to Standalone.
In spark there is one function name as --master that can helps you to execute your script on yarn Cluster mode or standalone mode.
Run the application on local mode or standalone used this with spark-submit command
--master Local[*]
or
--master spark://192.168.10.01:7077 \
--deploy-mode cluster \
Run on a YARN cluster
--master yarn
--deploy-mode cluster
For more information kindly visit this link.
https://spark.apache.org/docs/latest/submitting-applications.html
If you are not running through command line then you can directly set this master on SparkConf object.
sparkConf.setMaster(http://path/to/master/url:port) in cluster mode
or
sparkConf.setMaster(local[*]) in client/local mode

Can't see Yarn Job when doing Spark-Submit on Yarn Cluster

I am using spark-submit for my job with the command below:
spark-submit script_test.py --master yarn --deploy-mode cluster
spark-submit script_test.py --master yarn-cluster --deploy-mode cluster
The job is working fine. I can see it under the Spark History Server UI. However, I cannot see it under the RessourceManager UI ( YARN).
I have the feeling that my job is not sent to the cluster but it is running only in one node. However, I see nothing wrong on the way I use the Spark-submit command.
Am-i wrong? How can I check it? Or send the job to yarn cluster?
When you are using --master yarn means that in some place you have configured the yarn-site with hosts, ports, and so on.
Maybe the machine where you are using the spark-submit doesn't know where is the Yarn master.
You could check your hadoop/yarn/spark config files, specially the yarn-site.xml to check if the host of the Resource Manager is correct or not.
Those files are in different folders depending on which distribution of Hadoop you are using. In HDP I guess they are in /etc/hadoop/conf
Hope it helps.

Spark-submit not working when application jar is in hdfs

I'm trying to run a spark application using bin/spark-submit. When I reference my application jar inside my local filesystem, it works. However, when I copied my application jar to a directory in hdfs, i get the following exception:
Warning: Skip remote jar hdfs://localhost:9000/user/hdfs/jars/simple-project-1.0-SNAPSHOT.jar.
java.lang.ClassNotFoundException: com.example.SimpleApp
Here's the command:
$ ./bin/spark-submit --class com.example.SimpleApp --master local hdfs://localhost:9000/user/hdfs/jars/simple-project-1.0-SNAPSHOT.jar
I'm using hadoop version 2.6.0, spark version 1.2.1
The only way it worked for me, when I was using
--master yarn-cluster
To make HDFS library accessible to spark-job , you have to run job in cluster mode.
$SPARK_HOME/bin/spark-submit \
--deploy-mode cluster \
--class <main_class> \
--master yarn-cluster \
hdfs://myhost:8020/user/root/myjar.jar
Also, There is Spark JIRA raised for client mode which is not supported yet.
SPARK-10643 :Support HDFS application download in client mode spark submit
There is a workaround. You could mount the directory in HDFS (which contains your application jar) as local directory.
I did the same (with azure blob storage, but it should be similar for HDFS)
example command for azure wasb
sudo mount -t cifs //{storageAccountName}.file.core.windows.net/{directoryName} {local directory path} -o vers=3.0,username={storageAccountName},password={storageAccountKey},dir_mode=0777,file_mode=0777
Now, in your spark submit command, you provide the path from the command above
$ ./bin/spark-submit --class com.example.SimpleApp --master local {local directory path}/simple-project-1.0-SNAPSHOT.jar
spark-submit --master spark://kssr-virtual-machine:7077 --deploy-mode client --executor-memory 1g hdfs://localhost:9000/user/wordcount.py
For me its working I am using Hadoop 3.3.1 & Spark 3.2.1. I am able to read the file from HDFS.
Yes, it has to be a local file. I think that's simply the answer.

How to report JMX from Spark Streaming on EC2 to VisualVM?

I have been trying to get a Spark Streaming job, running on a EC2 instance to report to VisualVM using JMX.
As of now I have the following config file:
spark/conf/metrics.properties:
*.sink.jmx.class=org.apache.spark.metrics.sink.JmxSink
master.source.jvm.class=org.apache.spark.metrics.source.JvmSource
worker.source.jvm.class=org.apache.spark.metrics.source.JvmSource
driver.source.jvm.class=org.apache.spark.metrics.source.JvmSource
executor.source.jvm.class=org.apache.spark.metrics.source.JvmSource
And I start the spark streaming job like this:
(the -D bits I have added afterwards in the hopes of getting remote access to the ec2's jmx)
terminal:
spark/bin/spark-submit --class my.class.StarterApp --master local --deploy-mode client \
project-1.0-SNAPSHOT.jar \
-Dcom.sun.management.jmxremote \
-Dcom.sun.management.jmxremote.port=54321 \
-Dcom.sun.management.jmxremote.authenticate=false \
-Dcom.sun.management.jmxremote.ssl=false
There are two issues with the spark-submit command line:
local - you must not run Spark Standalone with local master URL because there will be no threads to run your computations (jobs) and you've got two, i.e. one for a receiver and another for the driver. You should see the following WARN in the logs:
WARN StreamingContext: spark.master should be set as local[n], n > 1
in local mode if you have receivers to get data, otherwise Spark jobs
will not get resources to process the received data.
-D options are not picked up by the JVM as they're given after the Spark Streaming application and effectively became its command-line arguments. Put them before project-1.0-SNAPSHOT.jar and start over (you have to fix the above issue first!)
spark-submit --conf "spark.driver.extraJavaOptions=-Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote.port=8090 -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.ssl=false"/path/example/src/main/python/pi.py 10000
Notes:the configurations format : --conf "params" . tested under spark 2.+

Spark Streaming to ElasticSearch

I'm trying to replicate this example Streamlining Search Indexing using Elastic Search by Holden Karau using the Spark Java API. I've successfully made it work as a normal Java application with some changes in the code. Instead of using saveAsHadoopDataset method I'm sending my tweets with:
JavaEsSpark.saveToEs(rdd,"/test/collection");
and running my code with:
java -cp ./target/hbase-spark-playground-1.0-SNAPSHOT.jar spark.examples.SparkToElasticSearchStreaming local[2] collection-name
My current problem is how to execute it on a Yarn Cluster. A code snippet of what I'm doing can be found here:
https://gist.github.com/IvanFernandez/b3a3e25397f8b402256b
and running my class this way:
spark.examples.SparkToElasticSearchStreaming --master yarn-cluster --executor-memory 400m --num-executors 1 ./target/hbase-spark-playground-1.0-SNAPSHOT.jar yarn-cluster collection-name
I think that the main problem is that I don't have any elasticSearch configuration in the foreach transformation so I can't reach my elasticSearch master. Any ideas?
The es cluster or other configuration information should be set in the SparkConf, which is already done in your code snippet as args[2] set as es.nodes. In your yarn command the third argument with es host is missing, also I believe that your command is using spark-submit to submit the application.
Can you please try setting the spark.es.nodes and es.port properties in as shown below in SparkConf:
sparkConf.set("spark.es.nodes", args[2]);
sparkConf.set("es.port", args[3]); // HTTP Port of elastic search
And use the below command to run the app on yarn:
spark-submit --class spark.examples.SparkToElasticSearchStreaming --master yarn-cluster --executor-memory 400m --num-executors 1 ./target/hbase-spark-playground-1.0-SNAPSHOT.jar yarn-cluster collection-name localhost 9200

Resources