The Problem: As expected, OS Users are able to launch and own a spark streaming application. However, when we try to run a job where the owner of the application is not an OS User, the spark streaming returns an error saying that the user was not found. As you can see in the output from the 'spark-submit' command:
main : run as user is 'user_name'
main : requested yarn user is 'user_name'
User 'user_name' not found
I already saw this error in some other forums and the recommendation was to created the OS-User, but unfortunately this is not an option here. In storm applications a Kerberos-Only User can be used in combination with an OS-User, but this seems not to be the case in spark.
What I have tried so far: The closest I could get was to use two OS Users, where one has 'read' access to the keytab file of the second one. I ran the application from one to 'impersonate' the second and the second appears as the owner. No errors appear as both are OS Users, but it does fail when I use a Kerberos-Only user as the second. Following you can see the submitted command for spark-streaming (BTW both are also HDFS-users, otherwise it would also not be possible to launch):
spark-submit --master yarn --deploy-mode cluster --keytab /etc/security/keytabs/user_name.keytab
--principal kerberosOnlyUser#LOCAL
--files ./spark_jaas.conf#spark_jaas.conf,
./user_name_copy.keytab#user_name_copy.keytab --conf "spark.driver.extraJavaOptions=-Djava.security.auth.login.config=./spark_jaas.conf"
--conf "spark.executor.extraJavaOptions=-Djava.security.auth.login.config=./spark_jaas.conf"
--driver-java-options "-Djava.security.auth.login.config=./spark_jaas.conf"
--conf spark.yarn.submit.waitAppCompletion=true --class ...
I also tried the alternative with the --proxy-user command, but the same error was returned.
Is it really not possible to use a Kerberos-only user in spark? Or is there a workaround?
The environment is:
Spark 2.3.0 on YARN.
Hadoop 2.7.3.
Thanks a lot for your help!
Related
I need to get the id of a specific hadoop job.
In my case, I lunch a sqoop commande remotely and I went to verify the job status with this commande :
hadoop job -status job_id | grep -w 'state'
I can get this information from the GUI but i went to do something
can any one help me !!!
You can use the Yarn REST apis, via your browser or curl from the command line. It will list all the currently running and previously running jobs, including sqoop and the mapreduce jobs that sqoop generates and executes. Use the UI first, if you have it up and running just point your browser to http:<host>:8088/cluster (not sure if the port is the same on all hadoop distributions. I believe 8088 is the default on apache). Alternatively you can use yarn commands directly, e.g, yarn application -list.
I want to run a maven project in spark cluster mode. I have the application jar file. I also have one master and 6 workers in working condition. But when I execute the jar application, the work is not getting distributed among the workers. The following is the command I gave from the spark directory.
./bin/spark-submit --class org.deeplearning4j.mlp.MnistMLPExample --master spark://115.145.173.152:7077 --driver-memory 10g /home/hadoop/Niki/mnist/target/dl4j-spark-0.7-SNAPSHOT-bin.jar.
If I add another parameter --deploy-mode cluster, Then its throwing exception as follows:
Exception in thread "main" com.beust.jcommander.ParameterException: Unknown option: --deploy-mode
Can anyone help me out. Thanks a lot
Hi Nikitha yes you need jar file in all worker nodes because spark transformations and actions will execute on worker nodes and if they use this path they search file in there local path so distribute it on all worker nodes also Can you please tell why you use this jar file path in spark code.
You are running spark in standalone mode. There is no cluster/client mode in standalone. It is relvent in yarn only.
Currently I am using a cloudera hadoop single node cluster (kerberos enabled.)
In client mode I use following commands
kinit
spark-submit --master yarn-client --proxy-user cloudera examples/src/main/python/pi.py
This works fine. In cluster mode I use following command (no kinit done and no TGT is present in the cache)
spark-submit --principal <myprinc> --keytab <KT location> --master yarn-cluster examples/src/main/python/pi.py
Also works fine. But when I use following command in cluster mode (no kinit done and no TGT is present in the cache)
spark-submit --principal <myprinc> --keytab <KT location> --master yarn-cluster --proxy-user <proxy-user> examples/src/main/python/pi.py
throws following error
<proxy-user> tries to renew a token with renewer <myprinc>
I guess in cluster mode the spark-submit do not look for TGT in the client machine... it transfers the "keytab" file to the cluster and then starts the spark job. So why does the specifying "--proxy-user" option looks for TGT while submitting in the "yarn-cluster" mode. Am I doing some thing wrong.
Spark doesn't allow to submit keytab and principal with proxy-user. The feature description in the official documentation for YARN mode (second paragraph) states specifically that you need keytab and principal when you are running long running jobs. This enables the application to continue working with any security issue.
Imagine if all application users logging into your applications can proxy to your keytab.
I have to do what Hive does to run "spark-submit". Basically kinit before submitting my application and then provide a proxy-user. So here is how I solved it.
kinit # -k -t
spark-submit with --proxy-user
is best implementation. So no your are not doing anything wrong.
I have created a Spark application Hello World that works well locally through Eclipse IDE.
I would like to deploy remotely this application from my local machine to the virtualbox Cloudera machine, through the "spark-submit".
The command line used for that is:
C:\Users\S-LAMARTI\Desktop\AXA\Workspaces\AXA\helloworld\target>%SPARK_HOME%/spa
rk-submit --class com.saadlamarti.helloworld.App --master spark://192.168.56.102
:7077 --deploy-mode cluster helloworld-0.0.1-SNAPSHOT.jar
Unfortunately, the application doesn't work, and I get this message error:
15/10/12 12:20:40 WARN RestSubmissionClient: Unable to connect to server spark:/
/192.168.56.102:7077.
Warning: Master endpoint spark://192.168.56.102:7077 was not a REST server. Fall
ing back to legacy submission gateway instead.
Can someone have any idea, why is not working?
Remove the arguement --deploy-mode cluster and try again.
Check the master:8080,and then you can see two url,one is the client submit url,another is the rest for cluster.
Find your REST url, if you set the argument --deploy-mode cluster, you must set the argument --master spark:Rest url.
I am running Spark 1.1.0, HDP 2.1, on a kerberized cluster. I can successfully run spark-submit using --master yarn-client and the results are properly written to HDFS, however, the job doesn't show up on the Hadoop All Applications page. I want to run spark-submit using --master yarn-cluster but I continue to get this error:
appDiagnostics: Application application_1417686359838_0012 failed 2 times due to AM Container
for appattempt_1417686359838_0012_000002 exited with exitCode: -1000 due to: File does not
exist: hdfs://<HOST>/user/<username>/.sparkStaging/application_<numbers>_<more numbers>/spark-assembly-1.1.0-hadoop2.4.0.jar
.Failing this attempt.. Failing the application.
I've provisioned my account with access to the cluster. I've configured yarn-site.xml. I've cleared .sparkStaging. I've tried including --jars [path to my spark assembly in spark/lib]. I've found this question that is very similar, yet unanswered. I can't tell if this is a 2.1 issue, spark 1.1.0, kerberized cluster, configurations, or what. Any help would be much appreciated.
This is probably because you left sparkConf.setMaster("local[n]") in the code.