How to report JMX from Spark Streaming on EC2 to VisualVM? - amazon-ec2

I have been trying to get a Spark Streaming job, running on a EC2 instance to report to VisualVM using JMX.
As of now I have the following config file:
spark/conf/metrics.properties:
*.sink.jmx.class=org.apache.spark.metrics.sink.JmxSink
master.source.jvm.class=org.apache.spark.metrics.source.JvmSource
worker.source.jvm.class=org.apache.spark.metrics.source.JvmSource
driver.source.jvm.class=org.apache.spark.metrics.source.JvmSource
executor.source.jvm.class=org.apache.spark.metrics.source.JvmSource
And I start the spark streaming job like this:
(the -D bits I have added afterwards in the hopes of getting remote access to the ec2's jmx)
terminal:
spark/bin/spark-submit --class my.class.StarterApp --master local --deploy-mode client \
project-1.0-SNAPSHOT.jar \
-Dcom.sun.management.jmxremote \
-Dcom.sun.management.jmxremote.port=54321 \
-Dcom.sun.management.jmxremote.authenticate=false \
-Dcom.sun.management.jmxremote.ssl=false

There are two issues with the spark-submit command line:
local - you must not run Spark Standalone with local master URL because there will be no threads to run your computations (jobs) and you've got two, i.e. one for a receiver and another for the driver. You should see the following WARN in the logs:
WARN StreamingContext: spark.master should be set as local[n], n > 1
in local mode if you have receivers to get data, otherwise Spark jobs
will not get resources to process the received data.
-D options are not picked up by the JVM as they're given after the Spark Streaming application and effectively became its command-line arguments. Put them before project-1.0-SNAPSHOT.jar and start over (you have to fix the above issue first!)

spark-submit --conf "spark.driver.extraJavaOptions=-Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote.port=8090 -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.ssl=false"/path/example/src/main/python/pi.py 10000
Notes:the configurations format : --conf "params" . tested under spark 2.+

Related

how to switch between cluster types in Apache Spark

I'm trying to switch cluster manager from standalone to 'YARN' in Apache Spark that I've installed for learning.
I read following thread to understand which cluster type should be chosen
However, I'd like to know the steps/syntax to change the cluster type.
Ex: from Standalone to YARN or from YARN to Standalone.
In spark there is one function name as --master that can helps you to execute your script on yarn Cluster mode or standalone mode.
Run the application on local mode or standalone used this with spark-submit command
--master Local[*]
or
--master spark://192.168.10.01:7077 \
--deploy-mode cluster \
Run on a YARN cluster
--master yarn
--deploy-mode cluster
For more information kindly visit this link.
https://spark.apache.org/docs/latest/submitting-applications.html
If you are not running through command line then you can directly set this master on SparkConf object.
sparkConf.setMaster(http://path/to/master/url:port) in cluster mode
or
sparkConf.setMaster(local[*]) in client/local mode

Launching a Spark-Submit job under Kerberos for Kafka

Through tinkering I have been able to partially launch a spark submit job using the following command, however soon after starting it crashes and gives me the exception outlined below:
Spark-Submit Command:
su spark -c 'export SPARK_MAJOR_VERSION=2; spark-submit \
--verbose \
--master yarn \
--driver-cores 5 \
--num-executors 3 --executor-cores 6 \
--principal spark#test.com \
--keytab /etc/security/keytabs/spark.headless.keytab \
--driver-java-options "-Djava.security.auth.login.config=kafka_client_jaas.conf"\
--conf "spark.executor.extraJavaOptions=-Djava.security.auth.login.config=kafka_client_jaas.conf" \
--files "/tmp/kafka_client_jaas.conf,/tmp/kafka.service.keytab" \
--class au.com.XXX.XXX.spark.test.test test.jar application.properties'
EXCEPTION:
Caused by: org.apache.kafka.common.KafkaException: javax.security.auth.login.LoginException: Could not login: the client is being asked for a password, but the Kafka client code does not currently support obtaining a password from the user. not available to garner authentication information from the user
WARN KerberosLogin: [Principal=kafka/test.com#test.com]: TGT renewal thread has been interrupted and will exit.
How can I get Kerberos to KINIT two principals at the same time? I'm assuming that is the problem here? I have tried adding another set of --principal/--keytab to the initial command, although this presented more permission issues within HDFS.
It's an old thread, but I struggled with this for some time and hopefully this can help someone.
The possible cause is that the Spark executors are not being able to locate the keytab, so they are failling to authenticate to Kerberos. On your submit, you should pass your Jaas config and Keytab files to your executors using the following options:
spark-submit --master yarn --deploy-mode cluster --files /path/to/keytab/yourkeytab.keytab#yourkeytab.keytab,/path/to/jaas/your-kafka-jaas.conf#your-kafka-jaas.conf --conf "spark.driver.extraJavaOptions=-Djava.security.auth.login.config=your-kafka-jaas.conf" --conf "spark.executor.extraJavaOptions=-Djava.security.auth.login.config=your-kafka-jaas.conf" --driver-java-options "-Djava.security.auth.login.config=your-kafka-jaas.conf" your-application.jar
Finally, since these jaas files are being sent to executors (and the spark driver), you should use the relative path for the Keytab, and not the absolute. Your jaas config then should have the following line:
keyTab="./yourkeytab.keytab"

Missing java system properties when running spark-streaming on Mesos cluster

I submit a spark app to mesos cluster(running in cluster mode), and pass java system property through "--drive-java-options=-Dkey=value -Dkey=value", however these system properties are not available at runtime, seems they are not set. --conf "spark.driver.extraJavaOptions=-Dkey=value" doesn't work either
More details:
the command is
bin/spark-submit --master mesos://10.3.101.119:7077 --deploy-mode cluster --class ${classname} --driver-java-options "-Dconfiguration.http=http://10.3.101.119:9090/application.conf" --conf "spark.executor.extraJavaOptions=-Dconfiguration.http=http://10.3.101.119:9090/application.conf" ${jar file}
I have a two-node mesos cluster, one node both runs master and slave, and the other runs slave only. I submit the spark application on master node.
Internally, the application hopes to read a configuration file from java system property "configuration.http", if the property is not available, the application will load a default file from the root of the classpath. When I submit the application, from the logs, i saw the default configuration file is loaded.
And the actual command to run the application is
"sh -c '/home/ubuntu/spark-1.6.0/bin/spark-submit --name ${appName} --master mesos://zk://10.3.101.184:2181/mesos/grant --driver-cores 1.0 --driver-memory 1024M --class ${classname} ./${jar file} '"
from here you can see the system property is lost
You might have a look at this blog post which recommends using an external properties file for this purpose:
$ vi app.properties
spark.driver.extraJavaOptions -Dconfiguration.http=http://10.3.101.119:9090/application.conf
spark.executor.extraJavaOptions –Dconfiguration.http=http://10.3.101.119:9090/application.conf
Then try to run this via
bin/spark-submit --master mesos://10.3.101.119:7077 --deploy-mode cluster --class ${classname} —-properties-file app.properties ${jar file}
See
How to pass -D parameter or environment variable to Spark job?
Separate logs from Apache spark

spark-submit --proxy-user do not work in yarn cluster mode

Currently I am using a cloudera hadoop single node cluster (kerberos enabled.)
In client mode I use following commands
kinit
spark-submit --master yarn-client --proxy-user cloudera examples/src/main/python/pi.py
This works fine. In cluster mode I use following command (no kinit done and no TGT is present in the cache)
spark-submit --principal <myprinc> --keytab <KT location> --master yarn-cluster examples/src/main/python/pi.py
Also works fine. But when I use following command in cluster mode (no kinit done and no TGT is present in the cache)
spark-submit --principal <myprinc> --keytab <KT location> --master yarn-cluster --proxy-user <proxy-user> examples/src/main/python/pi.py
throws following error
<proxy-user> tries to renew a token with renewer <myprinc>
I guess in cluster mode the spark-submit do not look for TGT in the client machine... it transfers the "keytab" file to the cluster and then starts the spark job. So why does the specifying "--proxy-user" option looks for TGT while submitting in the "yarn-cluster" mode. Am I doing some thing wrong.
Spark doesn't allow to submit keytab and principal with proxy-user. The feature description in the official documentation for YARN mode (second paragraph) states specifically that you need keytab and principal when you are running long running jobs. This enables the application to continue working with any security issue.
Imagine if all application users logging into your applications can proxy to your keytab.
I have to do what Hive does to run "spark-submit". Basically kinit before submitting my application and then provide a proxy-user. So here is how I solved it.
kinit # -k -t
spark-submit with --proxy-user
is best implementation. So no your are not doing anything wrong.

Spark-submit not working when application jar is in hdfs

I'm trying to run a spark application using bin/spark-submit. When I reference my application jar inside my local filesystem, it works. However, when I copied my application jar to a directory in hdfs, i get the following exception:
Warning: Skip remote jar hdfs://localhost:9000/user/hdfs/jars/simple-project-1.0-SNAPSHOT.jar.
java.lang.ClassNotFoundException: com.example.SimpleApp
Here's the command:
$ ./bin/spark-submit --class com.example.SimpleApp --master local hdfs://localhost:9000/user/hdfs/jars/simple-project-1.0-SNAPSHOT.jar
I'm using hadoop version 2.6.0, spark version 1.2.1
The only way it worked for me, when I was using
--master yarn-cluster
To make HDFS library accessible to spark-job , you have to run job in cluster mode.
$SPARK_HOME/bin/spark-submit \
--deploy-mode cluster \
--class <main_class> \
--master yarn-cluster \
hdfs://myhost:8020/user/root/myjar.jar
Also, There is Spark JIRA raised for client mode which is not supported yet.
SPARK-10643 :Support HDFS application download in client mode spark submit
There is a workaround. You could mount the directory in HDFS (which contains your application jar) as local directory.
I did the same (with azure blob storage, but it should be similar for HDFS)
example command for azure wasb
sudo mount -t cifs //{storageAccountName}.file.core.windows.net/{directoryName} {local directory path} -o vers=3.0,username={storageAccountName},password={storageAccountKey},dir_mode=0777,file_mode=0777
Now, in your spark submit command, you provide the path from the command above
$ ./bin/spark-submit --class com.example.SimpleApp --master local {local directory path}/simple-project-1.0-SNAPSHOT.jar
spark-submit --master spark://kssr-virtual-machine:7077 --deploy-mode client --executor-memory 1g hdfs://localhost:9000/user/wordcount.py
For me its working I am using Hadoop 3.3.1 & Spark 3.2.1. I am able to read the file from HDFS.
Yes, it has to be a local file. I think that's simply the answer.

Resources