I'm performing bulk loading of data into HBase using Spark. My Python script works perfectly to do this job, however I need to be able to submit this using spark submit so that I can run it on the cluster.
When I run the script locally with the following:
#!/bin/bash
sudo /usr/hdp/current/spark-client/bin/spark-submit\
--master local[*]\
--deploy-mode client\
--verbose\
--num-executors 3\
--executor-cores 1\
--executor-memory 512m\
--driver-memory 512m\
--conf\
spark.logConf=true\
/test/BulkLoader.py
it works perfectly - loads the data, writes the HFiles, bulk loads them. However, when I run the code with YARN as follows:
#!/bin/bash
sudo /usr/hdp/current/spark-client/bin/spark-submit\
--master yarn\
--deploy-mode client\
--verbose\
--num-executors 3\
--executor-cores 1\
--executor-memory 512m\
--driver-memory 512m\
--conf\
spark.logConf=true\
--conf\
spark.speculation=false\
/test/BulkLoader.py
things go wrong quickly. As soon as the script tries to write the HFile, I get the following error:
An error occurred while calling z:org.apache.spark.api.python.PythonRDD.saveAsNewAPIHadoopFile.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 15.0 failed 26 times, most recent failure:
Lost task 0.25 in stage 15.0 (TID 67, sandbox.hortonworks.com): java.io.FileNotFoundException: File file:/tmp/hfiles-06-46-57/_temporary/0/_temporary/attempt_201602150647_0019_r_000000_25/f1 does not exist
at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:609)
at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:822)
...
When writing the HFile, a FileNotFoundException is raised on a _temporary directory. I looked around and found that many other people had encountered such errors (here and here) however nothing suggested worked for me. I set the number of executors to 1 and set speculation to false since this was suggested as the likely cause of the error, however the problem persists. If anybody could suggest other options for me to explore, I'd be grateful.
Related
I have installed Hadoop2.7.1 with spark 1.4.1 on windows 8.1
When I execute below command
cd spark
spark-submit --class org.apache.spark.examples.SparkPi --master yarn --deploy-mode client lib/spark-examples*.jar 10
I get below error in JobHistoryServer log
Error: Could not find or load main class '-Dspark.externalBlockStore.folderName=spark-262c4697-ef0c-4042-af0c-8106b08574fb'
I did further debugging(along searching net) and could get hold of container cmd script where below sections(other lines are omitted) are given
...
#set CLASSPATH=C:/tmp/hadoop-xyz/nm-local-dir/usercache/xyz/appcache/application_1487502025818_0003/container_1487502025818_0003_02_000001/classpath-3207656532274684591.jar
...
#call %JAVA_HOME%/bin/java -server -Xmx512m -Djava.io.tmpdir=%PWD%/tmp '-Dspark.fileserver.uri=http://192.168.1.2:34814' '-Dspark.app.name=Spark shell' '-Dspark.driver.port=34810' '-Dspark.repl.class.uri=http://192.168.1.2:34785' '-Dspark.driver.host=192.168.1.2' '-Dspark.externalBlockStore.folderName=spark-dd9f3f84-6cf4-4ff8-b0f6-7ff84daf74bc' '-Dspark.master=yarn-client' '-Dspark.driver.appUIAddress=http://192.168.1.2:4040' '-Dspark.jars=' '-Dspark.executor.id=driver' -Dspark.yarn.app.container.log.dir=/dep/logs/userlogs/application_1487502025818_0003/container_1487502025818_0003_02_000001 org.apache.spark.deploy.yarn.ExecutorLauncher --arg '192.168.1.2:34810' --executor-memory 1024m --executor-cores 1 --num-executors 2 1> /dep/logs/userlogs/application_1487502025818_0003/container_1487502025818_0003_02_000001/stdout 2> /dep/logs/userlogs/application_1487502025818_0003/container_1487502025818_0003_02_000001/stderr
I check relevant files for CLASSPATH, looks OK. The main class org.apache.spark.deploy.yarn.ExecutorLauncher is available in spark assembly jar which is part of container jar
So, what could be the issue here? I searched net and found many discussions, but are for unix variants, but not many for Windows. I am wondering whether spark submit really works on Windows (yarn-client mode only, standalone cluster mode works) without any special setup!!!
BTW, if I run the above java command from cmd.exe command prompt, I get the same error as all command line arguments are quoted with single quote instead of double quotes(making these double quotes work!!), so is this a bug
Note spark-shell also fails (in yarn mode) and but yarn jar ... command works
Looks like it was a defect in earlier version. With latest Hadoop 2.7.3 with spark 2.1.0, it is working correctly.!!! Could not find any reference though.
I've submitted my spark job in ambari-server
using following command..
./spark-submit --class customer.core.classname --master yarn --numexecutors 2 --driver-memory 2g --executor-memory 2g --executor-cores 1 /home/hdfs/Test/classname-0.0.1-SNAPSHOT-SNAPSHOT.jar newdata host:6667
and it is working fine...
But how can it will be keep on running like if we close the command prompt or try to kill the job, it must be keep on running.
Any help is appreciated.
You can achieve this by couple of ways
1)You can run the spark submit driver process in background using nohup
Eg:
nohup ./spark-submit --class customer.core.classname \
--master yarn --numexecutors 2 \
--driver-memory 2g --executor-memory 2g --executor-cores 1 \
/home/hdfs/Test/classname-0.0.1-SNAPSHOT-SNAPSHOT.jar \
newdata host:6667 &
2)Run in deploy mode as cluster so that driver process runs in different node.
I think this question is more about shell than spark,
To keep an application running, even when closing the shell, tou should add & at the end of your command. So your spark-submit command will be (just add the & to the end)
./spark-submit --class customer.core.classname --master yarn --numexecutors 2 --driver-memory 2g --executor-memory 2g --executor-cores 1 /home/hdfs/Test/classname-0.0.1-SNAPSHOT-SNAPSHOT.jar newdata host:6667 &
[1] 28299
You still get the logs and output messages, unless you redirected them
hope I understand the question. In general, if you want a process to keep running you can create a process file that will run in the background. in your case, the job will continue running until you specifically kill it using yarn -kill. so even if you kill the spark submit it will continue to run since yarn is managing it after submission.
Warning: I didn't test this. But the better way to do what you describe is probably by using the following settings:
--deploy-mode cluster \
--conf spark.yarn.submit.waitAppCompletion=false
Found here:
How to exit spark-submit after the submission
I met the following issue that Spark can not launch on Windows Yarn.
15/06/05 06:31:34 ERROR spark.SparkContext: Error initializing SparkContext.org.apache.spark.SparkException: Yarn application has already ended! It might have been killed or unable to launch application master.
at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.waitForApplication(YarnClientSchedulerBackend.scala:114)
at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:59)
And when I drill down to Yarn, and get the error:
Error: Could not find or load main class '-Dspark.driver.memory=2G'
After investigating this issue, the root cause is:
In the YARN command generation part, single quotes (‘) will be
added surrounding some of the java options.
But Windows does not like the quotes for these options.
Similar to this issue https://issues.apache.org/jira/browse/SPARK-5754 .
Anyone know how to escape the logic working on Windows Yarn cluster?
Spark Submit Command:
%SPARK_HOME%\bin\spark-submit.cmd --jars ... ^
--class ....^
--master yarn-client ^
--driver-memory 10G ^
--executor-memory 20G ^
--executor-cores 6 ^
--num-executors 10 ^
QuasarNRT.jar 10 6
-Tao
I am trying to execute my code on a yarn cluster
The command which I am using is
$SPARK_HOME/bin/spark-submit \
--class "MyApp" \
target/scala-2.10/my-application_2.10-1.0.jar \
--master yarn-cluster \
--num-executors 3 \
--driver-memory 6g \
--executor-memory 7g \
<outputPath>
But, I can see that this program is running only on the localhost.
Its able to read the file from hdfs.
I have tried this in standalone mode and it works fine.
Please suggest where is it going wrong.
I am using Hadoop2.4 with Spark 1.1.0 . I was able to get it running in the cluster mode.
To solve it we simply removed all the configuration files from all the slave nodes. Earlier we were running in the standalone mode and that lead to duplicating the configuration on all the slaves. Once that was done it ran as expected in cluster mode. Although performance is not up to the standalone mode.
Thanks.
I am running Spark 1.1.0, HDP 2.1, on a kerberized cluster. I can successfully run spark-submit using --master yarn-client and the results are properly written to HDFS, however, the job doesn't show up on the Hadoop All Applications page. I want to run spark-submit using --master yarn-cluster but I continue to get this error:
appDiagnostics: Application application_1417686359838_0012 failed 2 times due to AM Container
for appattempt_1417686359838_0012_000002 exited with exitCode: -1000 due to: File does not
exist: hdfs://<HOST>/user/<username>/.sparkStaging/application_<numbers>_<more numbers>/spark-assembly-1.1.0-hadoop2.4.0.jar
.Failing this attempt.. Failing the application.
I've provisioned my account with access to the cluster. I've configured yarn-site.xml. I've cleared .sparkStaging. I've tried including --jars [path to my spark assembly in spark/lib]. I've found this question that is very similar, yet unanswered. I can't tell if this is a 2.1 issue, spark 1.1.0, kerberized cluster, configurations, or what. Any help would be much appreciated.
This is probably because you left sparkConf.setMaster("local[n]") in the code.