I installed a fresh instance of Cloudera 5.4 on a single Ubuntu 14.04 server and want to run one of spark applications.
This is the command:
sudo -uhdfs spark-submit --class org.apache.spark.examples.SparkPi --deploy-mode cluster --master yarn /opt/cloudera/parcels/CDH-5.4.5-1.cdh5.4.5.p0.7/jars/spark-examples-1.3.0-cdh5.4.5-hadoop2.6.0-cdh5.4.5.jar
This is the output:
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/opt/cloudera/parcels/CDH-5.4.5-1.cdh5.4.5.p0.7/jars/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/cloudera/parcels/CDH-5.4.5-1.cdh5.4.5.p0.7/jars/avro-tools-1.7.6-cdh5.4.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
15/08/29 12:07:56 INFO RMProxy: Connecting to ResourceManager at chd2.moneyball.guru/104.131.78.0:8032
15/08/29 12:07:56 INFO Client: Requesting a new application from cluster with 1 NodeManagers
15/08/29 12:07:56 INFO Client: Verifying our application has not requested more than the maximum memory capability of the cluster (1750 MB per container)
15/08/29 12:07:56 INFO Client: Will allocate AM container, with 896 MB memory including 384 MB overhead
15/08/29 12:07:56 INFO Client: Setting up container launch context for our AM
15/08/29 12:07:56 INFO Client: Preparing resources for our AM container
15/08/29 12:07:57 INFO Client: Uploading resource file:/opt/cloudera/parcels/CDH-5.4.5-1.cdh5.4.5.p0.7/jars/spark-examples-1.3.0-cdh5.4.5-hadoop2.6.0-cdh5.4.5.jar -> hdfs://chd2.moneyball.guru:8020/user/hdfs/.sparkStaging/application_1440861466017_0007/spark-examples-1.3.0-cdh5.4.5-hadoop2.6.0-cdh5.4.5.jar
15/08/29 12:07:57 INFO Client: Setting up the launch environment for our AM container
15/08/29 12:07:57 INFO SecurityManager: Changing view acls to: hdfs
15/08/29 12:07:57 INFO SecurityManager: Changing modify acls to: hdfs
15/08/29 12:07:57 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(hdfs); users with modify permissions: Set(hdfs)
15/08/29 12:07:57 INFO Client: Submitting application 7 to ResourceManager
15/08/29 12:07:57 INFO YarnClientImpl: Submitted application application_1440861466017_0007
15/08/29 12:07:58 INFO Client: Application report for application_1440861466017_0007 (state: ACCEPTED)
15/08/29 12:07:58 INFO Client:
client token: N/A
diagnostics: N/A
ApplicationMaster host: N/A
ApplicationMaster RPC port: -1
queue: root.hdfs
start time: 1440864477580
final status: UNDEFINED
tracking URL: http://chd2.moneyball.guru:8088/proxy/application_1440861466017_0007/
user: hdfs
15/08/29 12:07:59 INFO Client: Application report for application_1440861466017_0007 (state: ACCEPTED)
15/08/29 12:08:00 INFO Client: Application report for application_1440861466017_0007 (state: ACCEPTED)
15/08/29 12:08:01 INFO Client: Application report for application_1440861466017_0007 (state: ACCEPTED)
15/08/29 12:08:02 INFO Client: Application report for application_1440861466017_0007 (state: ACCEPTED)
15/08/29 12:08:03 INFO Client: Application report for application_1440861466017_0007 (state: ACCEPTED)
15/08/29 12:08:04 INFO Client: Application report for application_1440861466017_0007 (state: ACCEPTED)
15/08/29 12:08:05 INFO Client: Application report for application_1440861466017_0007 (state: ACCEPTED)
15/08/29 12:08:06 INFO Client: Application report for application_1440861466017_0007 (state: ACCEPTED)
15/08/29 12:08:07 INFO Client: Application report for application_1440861466017_0007 (state: ACCEPTED
.....
It will show the last line in a loop.
Can you help please? Let me know if you need anything else.
I increased yarn.nodemanager.resource.memory-mb. Everything is ok now
This can happen when Yarn's slots are occupied by other jobs and the cluster is at its capacity. The job gets stuck in the ACCEPTED state waiting for its turn to run. Can you check from Yarn Resource Manager UI to see if anything else is running on the cluster which might be slowing this app down? The RM UI can be accessed by going to http://104.131.78.0:8088, assuming that your RM Address is still 104.131.78.0 as shown in your logs. You should be able to see 1) if any other application is running on your cluster, and 2) navigate to the Spark UI running on http://ApplicationMasterAddress:4040 for further analysis.
I ran into a similar issue on Spark 1.5.2, and was able to fix this by using a Scala object to contain my main function, instead of a Scala class
Related
When I submit Spark application using Hadoop with Yarn in cluster mode.
Yarn client State stucks in Accepted state and it never change to Running. I am Using Centos 7 Hadoop Cluster which has 1 Master 2 Slaves
I login to openstack with floating IP(which we externally associate) This IP is different from IP address we get we do ifconfig in system.
Below are the logs:
18/01/21 16:34:19 INFO yarn.Client: Uploading resource file:/usr/local/spark/examples/jars/spark-examples_2.11-2.0.1.jar -> hdfs://192.168.198.10:8020/user/cloud-user/.sparkStaging/application_1516548465362_0014/spark-examples_2.11-2.0.1.jar
18/01/21 16:34:19 INFO yarn.Client: Uploading resource file:/tmp/spark-f37b5cec-a81f-46c3-9b5e-6ce7854c6dd4/__spark_conf__2008488553335511154.zip -> hdfs://192.168.198.10:8020/user/cloud-user/.sparkStaging/application_1516548465362_0014/__spark_conf__.zip
18/01/21 16:34:19 INFO spark.SecurityManager: Changing view acls to: cloud-user
18/01/21 16:34:19 INFO spark.SecurityManager: Changing modify acls to: cloud-user
18/01/21 16:34:19 INFO spark.SecurityManager: Changing view acls groups to:
18/01/21 16:34:19 INFO spark.SecurityManager: Changing modify acls groups to:
18/01/21 16:34:19 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(cloud-user); groups with view permissions: Set(); users with modify permissions: Set(cloud-user); groups with modify permissions: Set()
18/01/21 16:34:19 INFO yarn.Client: Submitting application application_1516548465362_0014 to ResourceManager
18/01/21 16:34:19 INFO impl.YarnClientImpl: Submitted application application_1516548465362_0014
18/01/21 16:34:20 INFO yarn.Client: Application report for application_1516548465362_0014 (state: ACCEPTED)
18/01/21 16:34:20 INFO yarn.Client:
client token: N/A
diagnostics: N/A
ApplicationMaster host: N/A
ApplicationMaster RPC port: -1
queue: default
start time: 1516552459599
tracking URL: http://master.abc.com:8088/proxy/application_1516548465362_0014/
user: cloud-user
18/01/21 16:34:21 INFO yarn.Client: Application report for application_1516548465362_0014 (state: ACCEPTED)
18/01/21 16:34:22 INFO yarn.Client: Application report for application_1516548465362_0014 (state: ACCEPTED)
18/01/21 16:34:23 INFO yarn.Client: Application report for application_1516548465362_0014 (state: ACCEPTED)
18/01/21 16:34:24 INFO yarn.Client: Application report for application_1516548465362_0014 (state: ACCEPTED)
18/01/21 16:34:25 INFO yarn.Client: Application report for application_1516548465362_0014 (state: ACCEPTED)
18/01/21 16:34:26 INFO yarn.Client: Application report for application_1516548465362_0014 (state: ACCEPTED)
18/01/21 16:34:27 yarn.Client: Application report for application_1516548465362_0014 (state: ACCEPTED)
Tried all options which people have suggested but nothing work.I see node has enough space but not sure why this is not working. Any help is appreciated. Thanks
Unresolved datanode registration: hostname cannot be resolved (ip=192.168.198.11, hostname=192.168.198.11)
I don't think the hostname should be an IP, but you need to update /etc/hosts on each machine to tell it where the slaves and masters are, or you need either static IPs or use a DNS server to resolve address for machines that float on the network
Thank you in advance for a any help.
I am running a yarn job using provided Hadoop example. The job never completes and stays at the "ACCEPTED" state. Looking at what is being printed out, it seems like the job is waiting to be completed -- and the client continuously probing for the job status.
Example job (from Hadoop 2.6.0):
spark-submit --master yarn-client --driver-memory 4g --executor-memory 2g --executor-cores 4 --class org.apache.spark.examples.SparkPi /home/john/spark/spark-1.6.1-bin-hadoop2.6/lib/spark-examples-1.6.1-hadoop2.6.0.jar 100
Output:
....
....
disabled; ui acls disabled; users with view permissions: Set(john); users with modify permissions: Set(jogn)
16/07/27 17:36:09 INFO yarn.Client: Submitting application 1 to ResourceManager
16/07/27 17:36:09 INFO impl.YarnClientImpl: Submitted application application_1469665943738_0001
16/07/27 17:36:10 INFO yarn.Client: Application report for application_1469665943738_0001 (state: ACCEPTED)
16/07/27 17:36:10 INFO yarn.Client:
client token: N/A
diagnostics: N/A
ApplicationMaster host: N/A
ApplicationMaster RPC port: -1
queue: default
start time: 1469666169333
final status: UNDEFINED
tracking URL: http://cpt-bdx021:8088/proxy/application_1469665943738_0001/
user: john
16/07/27 17:36:11 INFO yarn.Client: Application report for application_1469665943738_0001 (state: ACCEPTED)
16/07/27 17:36:12 INFO yarn.Client: Application report for application_1469665943738_0001 (state: ACCEPTED)
16/07/27 17:36:13 INFO yarn.Client: Application report for application_1469665943738_0001 (state: ACCEPTED)
16/07/27 17:36:14 INFO yarn.Client: Application report for application_1469665943738_0001 (state: ACCEPTED)
16/07/27 17:36:15 INFO yarn.Client: Application report for application_1469665943738_0001 (state: ACCEPTED)
16/07/27 17:36:16 INFO yarn.Client: Application report for application_1469665943738_0001 (state: ACCEPTED)
16/07/27 17:36:17 INFO yarn.Client: Application report for application_1469665943738_0001 (state: ACCEPTED)
16/07/27 17:36:18 INFO yarn.Client: Application report for application_1469665943738_0001 (state: ACCEPTED)
16/07/27 17:36:19 INFO yarn.Client: Application report for application_1469665943738_0001 (state: ACCEPTED)
16/07/27 17:36:20 INFO yarn.Client: Application report for application_1469665943738_0001 (state: ACCEPTED)
16/07/27 17:36:21 INFO yarn.Client: Application report for application_1469665943738_0001 (state: ACCEPTED)
16/07/27 17:36:22 INFO yarn.Client: Application report for application_1469665943738_0001 (state: ACCEPTED)
...........
...........
...........
UPDATE (Looks like job was submitted to ResourceManager -- hence "ACCEPTED", but ResourceManager "sees" no nodes or hadoop workers to actually get job across to):
$ jps
jps
12404 Jps
12211 NameNode
12315 DataNode
11743 ApplicationHistoryServer
11876 ResourceManager
11542 NodeManager
$ yarn node -list
16/07/27 23:07:53 INFO client.RMProxy: Connecting to ResourceManager at /192.168.0.5.55:8032
Total Nodes:0
Node-Id Node-State Node-Http-Address Number-of-Running-Containers
UPDATE(2):I am using the default etc/container-executor.cfg file:
yarn.nodemanager.linux-container-executor.group=#configured value of yarn.nodemanager.linux-container-executor.group
banned.users=#comma separated list of users who can not run applications
min.user.id=1000#Prevent other super-users
allowed.system.users=##comma separated list of system users who CAN run applications
Also, as I side, I want to mention that I do not have a hadoop user or hadoop` user group. I am using the default account with which I logged on to the system. If that matters. Thanks!
UPDATE(3): NodeManager log
org.apache.hadoop.yarn.client.RMProxy: Connecting to ResourceManager at 192.168.0.5.55:8031
2016-07-28 00:23:26,083 INFO org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Sending out 0 NM container statuses: []
2016-07-28 00:23:26,087 INFO org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Registering with RM using containers :[]
2016-07-28 00:23:26,233 INFO org.apache.hadoop.yarn.server.nodemanager.security.NMContainerTokenSecretManager: Rolling master-key for container-tokens, got key with id -160570002
2016-07-28 00:23:26,236 INFO org.apache.hadoop.yarn.server.nodemanager.security.NMTokenSecretManagerInNM: Rolling master-key for container-tokens, got key with id -1876215653
2016-07-28 00:23:26,237 INFO org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Registered with ResourceManager as 192.168.0.5.55:53034 with total resource of <memory:8192, vCores:8>
2016-07-28 00:23:26,237 INFO org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Notifying ContainerManager to unblock new container-requests
Reason why your job never get completed is because it never goes to state RUNNING (from state ACCEPTED). There is a scheduler which takes care of scheduling what apps will get resources and thereby to state RUNNING.
There are two schedulers available: fair-scheduler and capacity-scheduler. You can find details in Hadoop Yarn documentation. If you could provide yarn-site.xml, capacity-scheduler.xml and fair-scheduler.xml files I would give you better help :).
The most common possibility is that the queue that you are sending your job to does not have the available resources you are requesting.
Typical problems may be:
Resource requirements (memory and/or cores). You're asking for more memory/cores that it is able to allocate. This may be because of a near-full use of cluster, or that your settings are not consistent. More details on this page.
Disk space. Check node space, there is a health check that may stop you from running application.
yarn.nodemanager.disk-health-checker.max-disk-utilization-per-disk-percentage
In a multitenant / multiqueue environment, if there are hard resource limits per-queue, your application may be hitting those. You may want to increase your settings, or test in another queue with more resources.
I have a cluster of 2 machines and am trying to submit a spark job with YARN cluster manager.
vanilla Spark 1.6.2 built aginst hadoop 2.6.2
vanilla Hadoop 2.7.2
I can successfully run map-reduce jobs and spark jobs with standalone cluster manager. But when I run it with YARN, I got an error.
Any suggestions how to get it to work?
How do I enable more verbose logging? The error message is absolutely unclear
Why no log files are created under hadoop/logs/userlogs/applicationXXX?
Rhetorical question: IMO: hadoop logging & diagnostic isn't very well. Why is that? Hadoop seems to be an established product.
Below is the output:
mike#mp-desktop ~/opt/hadoop $ spark-submit --class org.apache.spark.examples.SparkPi --master yarn --deploy-mode cluster ~/prg/scala/spark-examples_2.11-1.0.jar 10
16/07/09 08:59:00 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
16/07/09 08:59:01 INFO client.RMProxy: Connecting to ResourceManager at mp-desktop/192.168.1.60:8050
16/07/09 08:59:01 INFO yarn.Client: Requesting a new application from cluster with 2 NodeManagers
16/07/09 08:59:01 INFO yarn.Client: Verifying our application has not requested more than the maximum memory capability of the cluster (8192 MB per container)
16/07/09 08:59:01 INFO yarn.Client: Will allocate AM container, with 1408 MB memory including 384 MB overhead
16/07/09 08:59:01 INFO yarn.Client: Setting up container launch context for our AM
16/07/09 08:59:01 INFO yarn.Client: Setting up the launch environment for our AM container
16/07/09 08:59:01 INFO yarn.Client: Preparing resources for our AM container
16/07/09 08:59:02 INFO yarn.Client: Uploading resource file:/home/mike/opt/spark-1.6.2-bin-hadoop2.6/lib/spark-assembly-1.6.2-hadoop2.6.0.jar -> hdfs://mp-desktop:9000/user/mike/.sparkStaging/application_1468043888852_0001/spark-assembly-1.6.2-hadoop2.6.0.jar
16/07/09 08:59:06 INFO yarn.Client: Uploading resource file:/home/mike/prg/scala/spark-examples_2.11-1.0.jar -> hdfs://mp-desktop:9000/user/mike/.sparkStaging/application_1468043888852_0001/spark-examples_2.11-1.0.jar
16/07/09 08:59:06 INFO yarn.Client: Uploading resource file:/tmp/spark-2ee6dfd6-e9d3-4ca4-9e98-5ce9e75dc757/__spark_conf__7114661171911035574.zip -> hdfs://mp-desktop:9000/user/mike/.sparkStaging/application_1468043888852_0001/__spark_conf__7114661171911035574.zip
16/07/09 08:59:06 INFO spark.SecurityManager: Changing view acls to: mike
16/07/09 08:59:06 INFO spark.SecurityManager: Changing modify acls to: mike
16/07/09 08:59:06 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(mike); users with modify permissions: Set(mike)
16/07/09 08:59:07 INFO yarn.Client: Submitting application 1 to ResourceManager
16/07/09 08:59:07 INFO impl.YarnClientImpl: Submitted application application_1468043888852_0001
16/07/09 08:59:08 INFO yarn.Client: Application report for application_1468043888852_0001 (state: ACCEPTED)
16/07/09 08:59:08 INFO yarn.Client:
client token: N/A
diagnostics: N/A
ApplicationMaster host: N/A
ApplicationMaster RPC port: -1
queue: default
start time: 1468043947113
final status: UNDEFINED
tracking URL: http://mp-desktop:8088/proxy/application_1468043888852_0001/
user: mike
16/07/09 08:59:09 INFO yarn.Client: Application report for application_1468043888852_0001 (state: ACCEPTED)
16/07/09 08:59:10 INFO yarn.Client: Application report for application_1468043888852_0001 (state: ACCEPTED)
16/07/09 08:59:11 INFO yarn.Client: Application report for application_1468043888852_0001 (state: ACCEPTED)
16/07/09 08:59:12 INFO yarn.Client: Application report for application_1468043888852_0001 (state: ACCEPTED)
16/07/09 08:59:13 INFO yarn.Client: Application report for application_1468043888852_0001 (state: ACCEPTED)
16/07/09 08:59:14 INFO yarn.Client: Application report for application_1468043888852_0001 (state: ACCEPTED)
16/07/09 08:59:15 INFO yarn.Client: Application report for application_1468043888852_0001 (state: ACCEPTED)
16/07/09 08:59:16 INFO yarn.Client: Application report for application_1468043888852_0001 (state: ACCEPTED)
16/07/09 08:59:17 INFO yarn.Client: Application report for application_1468043888852_0001 (state: ACCEPTED)
16/07/09 08:59:18 INFO yarn.Client: Application report for application_1468043888852_0001 (state: ACCEPTED)
16/07/09 08:59:19 INFO yarn.Client: Application report for application_1468043888852_0001 (state: ACCEPTED)
16/07/09 08:59:20 INFO yarn.Client: Application report for application_1468043888852_0001 (state: ACCEPTED)
16/07/09 08:59:21 INFO yarn.Client: Application report for application_1468043888852_0001 (state: FAILED)
16/07/09 08:59:21 INFO yarn.Client:
client token: N/A
diagnostics: Application application_1468043888852_0001 failed 2 times due to AM Container for appattempt_1468043888852_0001_000002 exited with exitCode: -1
For more detailed output, check application tracking page:http://mp-desktop:8088/cluster/app/application_1468043888852_0001Then, click on links to logs of each attempt.
Diagnostics: File /home/mike/hadoopstorage/nm-local-dir/usercache/mike/appcache/application_1468043888852_0001/container_1468043888852_0001_02_000001 does not exist
Failing this attempt. Failing the application.
ApplicationMaster host: N/A
ApplicationMaster RPC port: -1
queue: default
start time: 1468043947113
final status: FAILED
tracking URL: http://mp-desktop:8088/cluster/app/application_1468043888852_0001
user: mike
16/07/09 08:59:21 INFO yarn.Client: Deleting staging directory .sparkStaging/application_1468043888852_0001
Exception in thread "main" org.apache.spark.SparkException: Application application_1468043888852_0001 finished with failed status
at org.apache.spark.deploy.yarn.Client.run(Client.scala:1034)
at org.apache.spark.deploy.yarn.Client$.main(Client.scala:1081)
at org.apache.spark.deploy.yarn.Client.main(Client.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
16/07/09 08:59:21 INFO util.ShutdownHookManager: Shutdown hook called
16/07/09 08:59:21 INFO util.ShutdownHookManager: Deleting directory /tmp/spark-2ee6dfd6-e9d3-4ca4-9e98-5ce9e75dc757
Thanks!
The error message I had was similar:
16/07/15 13:55:53 INFO Client: Application report for application_1468583505911_0002 (state: ACCEPTED)
16/07/15 13:55:54 INFO Client: Application report for application_1468583505911_0002 (state: ACCEPTED)
16/07/15 13:55:55 INFO Client: Application report for application_1468583505911_0002 (state: ACCEPTED)
16/07/15 13:55:56 INFO Client: Application report for application_1468583505911_0002 (state: FAILED)
16/07/15 13:55:56 INFO Client:
client token: N/A
diagnostics: Application application_1468583505911_0002 failed 2 times due to AM Container for appattempt_1468583505911_0002_000002 exited with exitCode: -1000
For more detailed output, check application tracking page:http://<redacted>:8088/cluster/app/application_1468583505911_0002Then, click on links to logs of each attempt.
Diagnostics: File does not exist: hdfs://<redacted>:8020/user/root/.sparkStaging/application_1468583505911_0002/__spark_conf__4995486282135454270.zip
java.io.FileNotFoundException: File does not exist: hdfs://<redacted>:8020/user/root/.sparkStaging/application_1468583505911_0002/__spark_conf__4995486282135454270.zip
at org.apache.hadoop.hdfs.DistributedFileSystem$24.doCall(DistributedFileSystem.java:1367)
at org.apache.hadoop.hdfs.DistributedFileSystem$24.doCall(DistributedFileSystem.java:1359)
at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1359)
at org.apache.hadoop.yarn.util.FSDownload.copy(FSDownload.java:253)
at org.apache.hadoop.yarn.util.FSDownload.access$000(FSDownload.java:63)
at org.apache.hadoop.yarn.util.FSDownload$2.run(FSDownload.java:361)
Try running with YARN in client mode instead of cluster mode which prints out the driver program log to your shell:
spark-submit --class myClass --master yarn /path/to/myClass.jar
The log output showed that myClass was failing immediately because I had the incorrect number of args (the class expected more than 1 arg).
The class had failed with my custom exit code (42) and prints "Usage" info to the log, allowing me to fix the actual problem.
When I ran with --master yarn-cluster, this output was not visible to me and I could not see the "Usage" information mentioned above. Instead, all I had was the vague "File does not exist" issue shown above.
Specifying the correct number of arguments to myClass resolved the issue.
At this point I've assumed that my Spark job failed so quickly that it started cleaning up the .sparkStaging files it had copied before YARN had checked for them.
Probably you have solved your problem but I faced the same issue this morning with Spark 2.1 in yarn cluster and I found this post. I had the same error as you did and my problem was spark conference object which needed:
conf = (SparkConf()
.setMaster("yarn") #I had this value as local
.setAppName("My app Name")
So when I changed this and did my spark submit ( --master yarn --deploy-mode cluster ) everything worked right.
I solved this problem by the following statement and I use CDH5.14.1 and spark 1.6
$ ./bin/spark-submit --class org.apache.spark.examples.SparkPi \
--master yarn \
--deploy-mode cluster \
--driver-memory 4g \
--executor-memory 2g \
--executor-cores 1 \
--queue thequeue \
lib/spark-examples*.jar \
10
Here is the link:https://spark.apache.org/docs/1.6.0/running-on-yarn.html
I have Hadoop 2.6.0.2.2.0.0-2041 with Hive 0.14.0.2.2.0.0-2041
After building Spark with command:
mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.6.0 -Phive -Phive-thriftserver -DskipTests package
I try to run Pi example on YARN with the following command:
export HADOOP_CONF_DIR=/etc/hadoop/conf
/var/home2/test/spark/bin/spark-submit \
--class org.apache.spark.examples.SparkPi \
--master yarn-cluster \
--executor-memory 3G \
--num-executors 50 \
hdfs:///user/test/jars/spark-examples-1.3.0-hadoop2.4.0.jar \
1000
I get exceptions: application_1427875242006_0029 failed 2 times due to AM Container for appattempt_1427875242006_0029_000002 exited with exitCode: 1 Which in fact is Diagnostics: Exception from container-launch.(please see log below).
Application tracking url reveals the following messages:
java.lang.Exception: Unknown container. Container either has not started or has already completed or doesn't belong to this node at all
and also:
Error: Could not find or load main class org.apache.spark.deploy.yarn.ApplicationMaster
I have Hadoop working fine on 4 nodes and completly at a loss how to make Spark work on YARN.
Should I set spark.yarn.access.namenodes Spark configuration property? Though my application does not need to access any name nodes directly, but maybe this will solve the problem?
Please advise where to look for, any ideas would be of great help, thank you!
Spark assembly has been built with Hive, including Datanucleus jars on classpath
15/04/06 10:53:40 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
15/04/06 10:53:42 INFO impl.TimelineClientImpl: Timeline service address: http://etl-hdp-yarn.foo.bar.com:8188/ws/v1/timeline/
15/04/06 10:53:42 INFO client.RMProxy: Connecting to ResourceManager at etl-hdp-yarn.foo.bar.com/192.168.0.16:8050
15/04/06 10:53:42 INFO yarn.Client: Requesting a new application from cluster with 4 NodeManagers
15/04/06 10:53:42 INFO yarn.Client: Verifying our application has not requested more than the maximum memory capability of the cluster (4096 MB per container)
15/04/06 10:53:42 INFO yarn.Client: Will allocate AM container, with 896 MB memory including 384 MB overhead
15/04/06 10:53:42 INFO yarn.Client: Setting up container launch context for our AM
15/04/06 10:53:42 INFO yarn.Client: Preparing resources for our AM container
15/04/06 10:53:43 WARN shortcircuit.DomainSocketFactory: The short-circuit local reads feature cannot be used because libhadoop cannot be loaded.
15/04/06 10:53:43 INFO yarn.Client: Uploading resource file:/var/home2/test/spark-1.3.0/assembly/target/scala-2.10/spark-assembly-1.3.0-hadoop2.6.0.jar -> hdfs://etl-hdp-nn1.foo.bar.com:8020/user/test/.sparkStaging/application_1427875242006_0029/spark-assembly-1.3.0-hadoop2.6.0.jar
15/04/06 10:53:44 INFO yarn.Client: Source and destination file systems are the same. Not copying hdfs:/user/test/jars/spark-examples-1.3.0-hadoop2.4.0.jar
15/04/06 10:53:44 INFO yarn.Client: Setting up the launch environment for our AM container
15/04/06 10:53:44 INFO spark.SecurityManager: Changing view acls to: test
15/04/06 10:53:44 INFO spark.SecurityManager: Changing modify acls to: test
15/04/06 10:53:44 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(test); users with modify permissions: Set(test)
15/04/06 10:53:44 INFO yarn.Client: Submitting application 29 to ResourceManager
15/04/06 10:53:44 INFO impl.YarnClientImpl: Submitted application application_1427875242006_0029
15/04/06 10:53:45 INFO yarn.Client: Application report for application_1427875242006_0029 (state: ACCEPTED)
15/04/06 10:53:45 INFO yarn.Client:
client token: N/A
diagnostics: N/A
ApplicationMaster host: N/A
ApplicationMaster RPC port: -1
queue: default
start time: 1428317623905
final status: UNDEFINED
tracking URL: http://etl-hdp-yarn.foo.bar.com:8088/proxy/application_1427875242006_0029/
user: test
15/04/06 10:53:46 INFO yarn.Client: Application report for application_1427875242006_0029 (state: ACCEPTED)
15/04/06 10:53:47 INFO yarn.Client: Application report for application_1427875242006_0029 (state: ACCEPTED)
15/04/06 10:53:48 INFO yarn.Client: Application report for application_1427875242006_0029 (state: ACCEPTED)
15/04/06 10:53:49 INFO yarn.Client: Application report for application_1427875242006_0029 (state: FAILED)
15/04/06 10:53:49 INFO yarn.Client:
client token: N/A
diagnostics: Application application_1427875242006_0029 failed 2 times due to AM Container for appattempt_1427875242006_0029_000002 exited with exitCode: 1
For more detailed output, check application tracking page:http://etl-hdp-yarn.foo.bar.com:8088/proxy/application_1427875242006_0029/Then, click on links to logs of each attempt.
Diagnostics: Exception from container-launch.
Container id: container_1427875242006_0029_02_000001
Exit code: 1
Exception message: /mnt/hdfs01/hadoop/yarn/local/usercache/test/appcache/application_1427875242006_0029/container_1427875242006_0029_02_000001/launch_container.sh: line 27: $PWD:$PWD/__spark__.jar:$HADOOP_CONF_DIR:/usr/hdp/current/hadoop-client/*:/usr/hdp/current/hadoop-client/lib/*:/usr/hdp/current/hadoop-hdfs-client/*:/usr/hdp/current/hadoop-hdfs-client/lib/*:/usr/hdp/current/hadoop-yarn-client/*:/usr/hdp/current/hadoop-yarn-client/lib/*:$PWD/mr-framework/hadoop/share/hadoop/mapreduce/*:$PWD/mr-framework/hadoop/share/hadoop/mapreduce/lib/*:$PWD/mr-framework/hadoop/share/hadoop/common/*:$PWD/mr-framework/hadoop/share/hadoop/common/lib/*:$PWD/mr-framework/hadoop/share/hadoop/yarn/*:$PWD/mr-framework/hadoop/share/hadoop/yarn/lib/*:$PWD/mr-framework/hadoop/share/hadoop/hdfs/*:$PWD/mr-framework/hadoop/share/hadoop/hdfs/lib/*:/usr/hdp/${hdp.version}/hadoop/lib/hadoop-lzo-0.6.0.${hdp.version}.jar:/etc/hadoop/conf/secure: bad substitution
Stack trace: ExitCodeException exitCode=1: /mnt/hdfs01/hadoop/yarn/local/usercache/test/appcache/application_1427875242006_0029/container_1427875242006_0029_02_000001/launch_container.sh: line 27: $PWD:$PWD/__spark__.jar:$HADOOP_CONF_DIR:/usr/hdp/current/hadoop-client/*:/usr/hdp/current/hadoop-client/lib/*:/usr/hdp/current/hadoop-hdfs-client/*:/usr/hdp/current/hadoop-hdfs-client/lib/*:/usr/hdp/current/hadoop-yarn-client/*:/usr/hdp/current/hadoop-yarn-client/lib/*:$PWD/mr-framework/hadoop/share/hadoop/mapreduce/*:$PWD/mr-framework/hadoop/share/hadoop/mapreduce/lib/*:$PWD/mr-framework/hadoop/share/hadoop/common/*:$PWD/mr-framework/hadoop/share/hadoop/common/lib/*:$PWD/mr-framework/hadoop/share/hadoop/yarn/*:$PWD/mr-framework/hadoop/share/hadoop/yarn/lib/*:$PWD/mr-framework/hadoop/share/hadoop/hdfs/*:$PWD/mr-framework/hadoop/share/hadoop/hdfs/lib/*:/usr/hdp/${hdp.version}/hadoop/lib/hadoop-lzo-0.6.0.${hdp.version}.jar:/etc/hadoop/conf/secure: bad substitution
at org.apache.hadoop.util.Shell.runCommand(Shell.java:538)
at org.apache.hadoop.util.Shell.run(Shell.java:455)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:715)
at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:211)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Container exited with a non-zero exit code 1
Failing this attempt. Failing the application.
ApplicationMaster host: N/A
ApplicationMaster RPC port: -1
queue: default
start time: 1428317623905
final status: FAILED
tracking URL: http://etl-hdp-yarn.foo.bar.com:8088/cluster/app/application_1427875242006_0029
user: test
Exception in thread "main" org.apache.spark.SparkException: Application finished with failed status
at org.apache.spark.deploy.yarn.Client.run(Client.scala:622)
at org.apache.spark.deploy.yarn.Client$.main(Client.scala:647)
at org.apache.spark.deploy.yarn.Client.main(Client.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:569)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:166)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:189)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:110)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
If you are using spark with hdp, then we have to do following things.
Add these entries in your $SPARK_HOME/conf/spark-defaults.conf
spark.driver.extraJavaOptions -Dhdp.version=2.2.0.0-2041 (your installed HDP version)
spark.yarn.am.extraJavaOptions -Dhdp.version=2.2.0.0-2041 (your installed HDP version)
create java-opts file in $SPARK_HOME/conf and add the installed HDP version in that file like
-Dhdp.version=2.2.0.0-2041 (your installed HDP version)
to know hdp verion please run command hdp-select status hadoop-client in the cluster
This is a bug in the HDP - Spark Integration
In your spark-defaults.conf add the following lines
spark.driver.extraJavaOptions -Dhdp.version=2.2.0.0–2041
spark.yarn.am.extraJavaOptions -Dhdp.version=2.2.0.0–2041
This should help address the issue
I think your hadoop classpath is not setup.
lib/hadoop-lzo-0.6.0.${hdp.version}.jar:/etc/hadoop/conf/secure: bad substitution
When running Spark 1.3.0 Pi example on YARN (Hadoop 2.6.0.2.2.0.0-2041) with the following script:
# Run on a YARN cluster
export HADOOP_CONF_DIR=/etc/hadoop/conf
/var/home2/test/spark/bin/spark-submit \
--class org.apache.spark.examples.SparkPi \
--master yarn-cluster \
--executor-memory 3G \
--num-executors 50 \
/var/home2/test/spark/lib/spark-examples-1.3.0-hadoop2.4.0.jar \
1000
It fails with "Application failed 2 times due to AM Container" message (please see below). As far as I understand, all neccessary information to run Spark application in YARN mode is provided in this launch script. What else should be configured to run on YARN. What is missing? Other reasons for YARN launch to fail?
[test#etl-hdp-mgmt pi]$ ./run-pi.sh
Spark assembly has been built with Hive, including Datanucleus jars on classpath
15/04/01 12:59:57 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
15/04/01 12:59:58 INFO client.RMProxy: Connecting to ResourceManager at etl-hdp-yarn.foo.bar.com/192.168.0.16:8050
15/04/01 12:59:58 INFO yarn.Client: Requesting a new application from cluster with 4 NodeManagers
15/04/01 12:59:58 INFO yarn.Client: Verifying our application has not requested more than the maximum memory capability of the cluster (4096 MB per container)
15/04/01 12:59:58 INFO yarn.Client: Will allocate AM container, with 896 MB memory including 384 MB overhead
15/04/01 12:59:58 INFO yarn.Client: Setting up container launch context for our AM
15/04/01 12:59:58 INFO yarn.Client: Preparing resources for our AM container
15/04/01 12:59:59 WARN hdfs.BlockReaderLocal: The short-circuit local reads feature cannot be used because libhadoop cannot be loaded.
15/04/01 12:59:59 INFO yarn.Client: Uploading resource file:/var/home2/test/spark-1.3.0-bin-hadoop2.4/lib/spark-assembly-1.3.0-hadoop2.4.0.jar -> hdfs://foo.bar.com:8020/user/test/.sparkStaging/application_1427875242006_0010/spark-assembly-1.3.0-hadoop2.4.0.jar
15/04/01 13:00:01 INFO yarn.Client: Uploading resource file:/var/home2/test/spark/lib/spark-examples-1.3.0-hadoop2.4.0.jar -> hdfs://foo.bar.com:8020/user/test/.sparkStaging/application_1427875242006_0010/spark-examples-1.3.0-hadoop2.4.0.jar
15/04/01 13:00:02 INFO yarn.Client: Setting up the launch environment for our AM container
15/04/01 13:00:03 INFO spark.SecurityManager: Changing view acls to: test
15/04/01 13:00:03 INFO spark.SecurityManager: Changing modify acls to: test
15/04/01 13:00:03 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(test); users with modify permissions: Set(test)
15/04/01 13:00:03 INFO yarn.Client: Submitting application 10 to ResourceManager
15/04/01 13:00:03 INFO impl.YarnClientImpl: Submitted application application_1427875242006_0010
15/04/01 13:00:04 INFO yarn.Client: Application report for application_1427875242006_0010 (state: ACCEPTED)
15/04/01 13:00:04 INFO yarn.Client:
client token: N/A
diagnostics: N/A
ApplicationMaster host: N/A
ApplicationMaster RPC port: -1
queue: default
start time: 1427893202566
final status: UNDEFINED
tracking URL: http://etl-hdp-yarn.foo.bar.com:8088/proxy/application_1427875242006_0010/
user: test
15/04/01 13:00:05 INFO yarn.Client: Application report for application_1427875242006_0010 (state: ACCEPTED)
15/04/01 13:00:06 INFO yarn.Client: Application report for application_1427875242006_0010 (state: ACCEPTED)
15/04/01 13:00:07 INFO yarn.Client: Application report for application_1427875242006_0010 (state: ACCEPTED)
15/04/01 13:00:08 INFO yarn.Client: Application report for application_1427875242006_0010 (state: ACCEPTED)
15/04/01 13:00:09 INFO yarn.Client: Application report for application_1427875242006_0010 (state: FAILED)
15/04/01 13:00:09 INFO yarn.Client:
client token: N/A
diagnostics: Application application_1427875242006_0010 failed 2 times due to AM Container for appattempt_1427875242006_0010_000002 exited with exitCode: 1
For more detailed output, check application tracking page:http://etl-hdp-yarn.foo.bar.com:8088/proxy/application_1427875242006_0010/Then, click on links to logs of each attempt.
Diagnostics: Exception from container-launch.
Container id: container_1427875242006_0010_02_000001
Exit code: 1
Exception message: /mnt/hdfs01/hadoop/yarn/local/usercache/test/appcache/application_1427875242006_0010/container_1427875242006_0010_02_000001/launch_container.sh: line 27: $PWD:$PWD/__spark__.jar:$HADOOP_CONF_DIR:/usr/hdp/current/hadoop-client/*:/usr/hdp/current/hadoop-client/lib/*:/usr/hdp/current/hadoop-hdfs-client/*:/usr/hdp/current/hadoop-hdfs-client/lib/*:/usr/hdp/current/hadoop-yarn-client/*:/usr/hdp/current/hadoop-yarn-client/lib/*:$PWD/mr-framework/hadoop/share/hadoop/mapreduce/*:$PWD/mr-framework/hadoop/share/hadoop/mapreduce/lib/*:$PWD/mr-framework/hadoop/share/hadoop/common/*:$PWD/mr-framework/hadoop/share/hadoop/common/lib/*:$PWD/mr-framework/hadoop/share/hadoop/yarn/*:$PWD/mr-framework/hadoop/share/hadoop/yarn/lib/*:$PWD/mr-framework/hadoop/share/hadoop/hdfs/*:$PWD/mr-framework/hadoop/share/hadoop/hdfs/lib/*:/usr/hdp/${hdp.version}/hadoop/lib/hadoop-lzo-0.6.0.${hdp.version}.jar:/etc/hadoop/conf/secure: bad substitution
Stack trace: ExitCodeException exitCode=1: /mnt/hdfs01/hadoop/yarn/local/usercache/test/appcache/application_1427875242006_0010/container_1427875242006_0010_02_000001/launch_container.sh: line 27: $PWD:$PWD/__spark__.jar:$HADOOP_CONF_DIR:/usr/hdp/current/hadoop-client/*:/usr/hdp/current/hadoop-client/lib/*:/usr/hdp/current/hadoop-hdfs-client/*:/usr/hdp/current/hadoop-hdfs-client/lib/*:/usr/hdp/current/hadoop-yarn-client/*:/usr/hdp/current/hadoop-yarn-client/lib/*:$PWD/mr-framework/hadoop/share/hadoop/mapreduce/*:$PWD/mr-framework/hadoop/share/hadoop/mapreduce/lib/*:$PWD/mr-framework/hadoop/share/hadoop/common/*:$PWD/mr-framework/hadoop/share/hadoop/common/lib/*:$PWD/mr-framework/hadoop/share/hadoop/yarn/*:$PWD/mr-framework/hadoop/share/hadoop/yarn/lib/*:$PWD/mr-framework/hadoop/share/hadoop/hdfs/*:$PWD/mr-framework/hadoop/share/hadoop/hdfs/lib/*:/usr/hdp/${hdp.version}/hadoop/lib/hadoop-lzo-0.6.0.${hdp.version}.jar:/etc/hadoop/conf/secure: bad substitution
at org.apache.hadoop.util.Shell.runCommand(Shell.java:538)
at org.apache.hadoop.util.Shell.run(Shell.java:455)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:715)
at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:211)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Container exited with a non-zero exit code 1
Failing this attempt. Failing the application.
ApplicationMaster host: N/A
ApplicationMaster RPC port: -1
queue: default
start time: 1427893202566
final status: FAILED
tracking URL: http://etl-hdp-yarn.foo.bar.com:8088/cluster/app/application_1427875242006_0010
user: test
Exception in thread "main" org.apache.spark.SparkException: Application finished with failed status
at org.apache.spark.deploy.yarn.Client.run(Client.scala:622)
at org.apache.spark.deploy.yarn.Client$.main(Client.scala:647)
at org.apache.spark.deploy.yarn.Client.main(Client.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:569)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:166)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:189)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:110)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Run
yarn logs -applicationId application_1427875242006_0010 > /tmp/application_1427875242006_0010
Logs there should indicate why it failed.
"Failed 2 times" happens because when you run in yarn cluster mode, driver runs in AM whose retry is 2 by default.
So your driver is retried twice.
I totally agree with #SeanOwen. Follow the Spark Building documentation.
You need to compile spark for YARN using the correct configuration for your hadoop cluster (version,hive support, etc).
The problem won't persist then!
This is the problem with spark communicating with Application Master.
The RM and NM talk to each other over RPC so the problem could be launch_container.cmd is not running correctly. Check that the NM has communicating with RM while submitting job
Try adding this to your yarn-site.xml:
<property>
<name>yarn.nodemanager.delete.debug-delay-sec</name>
<value>1200</value>
</property>
This will ensure that the launch_container.cmd from the NM error seen does not get deleted ( will remain around for 20 mins - increase 1200 to a higher number if needed). Now, what you can do is try and run that launch_container.cmd script manually from the container dir and see where it bails out.
Hope this will help you.
I also faced a similar problem. Actually, you do not need to mention --master yarn-cluster when you are running your self-contained application in the cluster.
This problem has been solved on Cloudera forum refer this https://community.cloudera.com/t5/Advanced-Analytics-Apache-Spark/Issue-running-spark-application-in-Yarn-cluster-mode/td-p/44570