Spark-shell --conf option for Kryos/Java serializer - hadoop

I need to launch the spark shell with custom classes using registerKryoClasses method as mentioned in the spark help page .
Now as mentioned in the page I cannot recreate the sc variable after launching the spark shell and hence need to provide the option --conf while launching the spark-shell command.
What should be the option value with --conf so that it is equivalent to the following update:
conf.registerKryoClasses(Array(classOf[ArrayBuffer[String]], classOf[ListBuffer[String]]))

The option to use Javaserializer instead of Kryos serializer worked for me:
spark-shell --conf 'spark.serializer=org.apache.spark.serializer.JavaSerializer'
Edit: just figured out how to use the options. We can do the following:
--conf 'spark.kryo.classesToRegister=scala.collection.mutable.ArrayBuffer,scala.collection.mutable.ListBuffer'

Related

How to launch a Spark Streaming YARN application with Kerberos-Only Users?

The Problem: As expected, OS Users are able to launch and own a spark streaming application. However, when we try to run a job where the owner of the application is not an OS User, the spark streaming returns an error saying that the user was not found. As you can see in the output from the 'spark-submit' command:
main : run as user is 'user_name'
main : requested yarn user is 'user_name'
User 'user_name' not found
I already saw this error in some other forums and the recommendation was to created the OS-User, but unfortunately this is not an option here. In storm applications a Kerberos-Only User can be used in combination with an OS-User, but this seems not to be the case in spark.
What I have tried so far: The closest I could get was to use two OS Users, where one has 'read' access to the keytab file of the second one. I ran the application from one to 'impersonate' the second and the second appears as the owner. No errors appear as both are OS Users, but it does fail when I use a Kerberos-Only user as the second. Following you can see the submitted command for spark-streaming (BTW both are also HDFS-users, otherwise it would also not be possible to launch):
spark-submit --master yarn --deploy-mode cluster --keytab /etc/security/keytabs/user_name.keytab
--principal kerberosOnlyUser#LOCAL
--files ./spark_jaas.conf#spark_jaas.conf,
./user_name_copy.keytab#user_name_copy.keytab --conf "spark.driver.extraJavaOptions=-Djava.security.auth.login.config=./spark_jaas.conf"
--conf "spark.executor.extraJavaOptions=-Djava.security.auth.login.config=./spark_jaas.conf"
--driver-java-options "-Djava.security.auth.login.config=./spark_jaas.conf"
--conf spark.yarn.submit.waitAppCompletion=true --class ...
I also tried the alternative with the --proxy-user command, but the same error was returned.
Is it really not possible to use a Kerberos-only user in spark? Or is there a workaround?
The environment is:
Spark 2.3.0 on YARN.
Hadoop 2.7.3.
Thanks a lot for your help!

How to run a spark-shell?

I have downloaded this: https://github.com/sryza/spark-timeseries, and followed these instructions
milenko#milenko-desktop:~/spark-timeseries$ spark --jars /home/milenko/spark-timeseries/target/sparkts-0.4.0-SNAPSHOT-jar-with-dependencies.jar
I got this:
Invalid command line option:
If I try what was suggested in the tutorial:
spark-shell --jars /home/milenko/spark-timeseries/target/sparkts-0.4.0-SNAPSHOT-jar-with-dependencies.jar
spark-shell: command not found
Why?
Use the full SPARK_HOME/bin/spark-shell or update your PATH environment variable to contain the path to SPARK_HOME/bin
bin/spark-shell
this will start Spark session and scala

HDP: How to change HADOOP_CLASSPATH value

I need to add a value to the HADOOP_CLASSPATH environment variable, according to this troubleshoot article: https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.3.2/bk_installing_manually_book/content/troubleshooting-phoenix.html
when I type echo $HADOOP_CLASSPATH in console I get an empty result back. I think I need to set these values in an config.xml file...
Where or how can I set this new value to the environment variable?
Can I set it in spark-submit?
You can add environment variable HADOOP_CONF_DIR in spark-env.sh so whenever you run spark-submit it will automatically pick all environment variable. this environment variable value is path of hadoop configuration.
export HADOOP_CONF_DIR = to point Spark towards Hadoop configuration files
The error can be avoided by adding the jar path to the spark-submit call via --driver-class-path parameter:
spark-submit --class sparkhbase.PhoenixTest --master yarn --deploy-mode client --driver-class-path "/usr/hdp/current/hbase-client/lib/hbase-protocol-1.1.2.2.4.2.0-258.jar" /home/test/app.jar
It also worked by setting the --conf parameter like this:
spark-submit --class sparkhbase.PhoenixTest --master yarn --deploy-mode client --conf "spark.driver.extraClassPath=/usr/hdp/current/hbase-client/lib/hbase-protocol-1.1.2.2.4.2.0-258.jar" /home/test/app.jar
Setting ONE of them should do it!
Also add --conf "spark.executor.extraClassPath=/usr/hdp/current/hbase-client/lib/hbase-protocol-1.1.2.2.4.2.0-258.jar" to your spark-submit if you still get an exception (can happen when code is launched on the executors, not on master)

Setting S3 output file grantees for spark output files

I'm running Spark on AWS EMR and I'm having some issues getting the correct permissions on the output files (rdd.saveAsTextFile('<file_dir_name>')). In hive, I would add a line in the beginning with set fs.s3.canned.acl=BucketOwnerFullControl and that would set the correct permissions. For Spark, I tried running:
hadoop jar /mnt/var/lib/hadoop/steps/s-3HIRLHJJXV3SJ/script-runner.jar \
/home/hadoop/spark/bin/spark-submit --deploy-mode cluster --master yarn-cluster \
--conf "spark.driver.extraJavaOptions -Dfs.s3.canned.acl=BucketOwnerFullControl" \
hdfs:///user/hadoop/spark.py
But the permissions do not get set properly on the output files. What is the proper way to pass in the 'fs.s3.canned.acl=BucketOwnerFullControl' or any of the S3 canned permissions to the spark job?
Thanks in advance
I found the solution. In the job, you have to access the JavaSparkContext and from there get the Hadoop configuration and set the parameter there. For example:
sc._jsc.hadoopConfiguration().set('fs.s3.canned.acl','BucketOwnerFullControl')
The proper way to pass hadoop config keys in spark is to use --conf with keys prefixed with spark.hadoop.. Your command would look like
hadoop jar /mnt/var/lib/hadoop/steps/s-3HIRLHJJXV3SJ/script-runner.jar \
/home/hadoop/spark/bin/spark-submit --deploy-mode cluster --master yarn-cluster \
--conf "spark.hadoop.fs.s3.canned.acl=BucketOwnerFullControl" \
hdfs:///user/hadoop/spark.py
Unfortunately I cannot find any reference in official documentation of spark.

Spark Streaming to ElasticSearch

I'm trying to replicate this example Streamlining Search Indexing using Elastic Search by Holden Karau using the Spark Java API. I've successfully made it work as a normal Java application with some changes in the code. Instead of using saveAsHadoopDataset method I'm sending my tweets with:
JavaEsSpark.saveToEs(rdd,"/test/collection");
and running my code with:
java -cp ./target/hbase-spark-playground-1.0-SNAPSHOT.jar spark.examples.SparkToElasticSearchStreaming local[2] collection-name
My current problem is how to execute it on a Yarn Cluster. A code snippet of what I'm doing can be found here:
https://gist.github.com/IvanFernandez/b3a3e25397f8b402256b
and running my class this way:
spark.examples.SparkToElasticSearchStreaming --master yarn-cluster --executor-memory 400m --num-executors 1 ./target/hbase-spark-playground-1.0-SNAPSHOT.jar yarn-cluster collection-name
I think that the main problem is that I don't have any elasticSearch configuration in the foreach transformation so I can't reach my elasticSearch master. Any ideas?
The es cluster or other configuration information should be set in the SparkConf, which is already done in your code snippet as args[2] set as es.nodes. In your yarn command the third argument with es host is missing, also I believe that your command is using spark-submit to submit the application.
Can you please try setting the spark.es.nodes and es.port properties in as shown below in SparkConf:
sparkConf.set("spark.es.nodes", args[2]);
sparkConf.set("es.port", args[3]); // HTTP Port of elastic search
And use the below command to run the app on yarn:
spark-submit --class spark.examples.SparkToElasticSearchStreaming --master yarn-cluster --executor-memory 400m --num-executors 1 ./target/hbase-spark-playground-1.0-SNAPSHOT.jar yarn-cluster collection-name localhost 9200

Resources