Spark streaming job on YARN cluster mode stuck in accepted, then fails with a Timeout Exception

Spark streaming job on YARN cluster mode stuck in accepted, then fails with a Timeout Exception - spark-streaming

I am running a spark streaming application that simply read messages from a Kafka topic, enrich them and then write the enriched messages in another kafka topic.
I already tried it in Standalone mode (both client and cluster deploy mode) and in YARN client mode, successfully.
When I submit the application in cluster mode it gives me the following messages:
18/01/10 12:13:34 INFO Client: Submitting application application_1515582681419_0001 to ResourceManager
18/01/10 12:13:34 INFO YarnClientImpl: Submitted application application_1515582681419_0001
18/01/10 12:13:35 INFO Client: Application report for application_1515582681419_0001 (state: ACCEPTED)
18/01/10 12:13:35 INFO Client:
client token: N/A
diagnostics: AM container is launched, waiting for AM container to Register with RM
ApplicationMaster host: N/A
ApplicationMaster RPC port: -1
queue: default
start time: 1515582814080
final status: UNDEFINED
tracking URL: http://ambari1.internal:8088/proxy/application_1515582681419_0001/
user: root
18/01/10 12:13:36 INFO Client: Application report for application_1515582681419_0001 (state: ACCEPTED)
18/01/10 12:13:37 INFO Client: Application report for application_1515582681419_0001 (state: ACCEPTED)
And keeps stuck in ACCEPTED Status until after around 4-5 minutes, exit with the following error message:
18/01/10 12:17:00 INFO InputInfoTracker: remove old batch metadata: 1515583000000 ms
18/01/10 12:17:02 ERROR ApplicationMaster: Uncaught exception:
java.util.concurrent.TimeoutException: Futures timed out after [100000 milliseconds]
at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:201)
at org.apache.spark.deploy.yarn.ApplicationMaster.runDriver(ApplicationMaster.scala:423)
at org.apache.spark.deploy.yarn.ApplicationMaster.run(ApplicationMaster.scala:282)
at org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$main$1.apply$mcV$sp(ApplicationMaster.scala:768)
at org.apache.spark.deploy.SparkHadoopUtil$$anon$2.run(SparkHadoopUtil.scala:67)
at org.apache.spark.deploy.SparkHadoopUtil$$anon$2.run(SparkHadoopUtil.scala:66)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1866)
at org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:66)
at org.apache.spark.deploy.yarn.ApplicationMaster$.main(ApplicationMaster.scala:766)
at org.apache.spark.deploy.yarn.ApplicationMaster.main(ApplicationMaster.scala)
18/01/10 12:17:02 INFO ApplicationMaster: Final app status: FAILED, exitCode: 10, (reason: Uncaught exception: java.util.concurrent.TimeoutException: Futures timed out after [100000 milliseconds])
18/01/10 12:17:02 INFO StreamingContext: Invoking stop(stopGracefully=false) from shutdown hook
18/01/10 12:17:02 INFO ReceiverTracker: ReceiverTracker stopped
18/01/10 12:17:02 INFO JobGenerator: Stopping JobGenerator immediately
Funny fact: If I visit the age of the application, I can see that the Spark Context has been started and it processes some messages.
Could anyone help me on this?
PS: These are the resources of my YARN cluster:

The problem might be with Yarn "App Timeline Server". Try to restart it.

Are you creating your spark session with master as local?. Please do check this.

Related

Spark 2.1 + Yarn application has already ended

we are using spark application version 2.1 in out ambari cluster
ambari thrift servers isn't stable and restarted all times
from the log we can see that:
ERROR SparkContext: Error initializing SparkContext.
org.apache.spark.SparkException: Yarn application has already ended! It might have been killed or unable to launch application master.
we found the following link that described solution for this problem
https://markobigdata.com/2016/08/11/yarn-application-has-already-ended-it-might-have-been-killed-or-unable-to-launch-application-master/
but after we set the parameters as described in the article , the problem still exsist
please advice what is the solution for this?
full log:
tail -f spark-hive-org.apache.spark.sql.hive.thriftserver.HiveThriftServer2-1-master01.sys873dns.com.out
Spark Command: /usr/jdk64/jdk1.8.0_112/bin/java -Dhdp.version=2.6.0.3-8 -cp /usr/hdp/current/spark2-thriftserver/conf/:/usr/hdp/current/spark2-thriftserver/jars/*:/usr/hdp/current/hadoop-client/conf/ -Xmx10000m org.apache.spark.deploy.SparkSubmit --conf spark.driver.memory=50g --properties-file /usr/hdp/current/spark2-thriftserver/conf/spark-thrift-sparkconf.conf --class org.apache.spark.sql.hive.thriftserver.HiveThriftServer2 --name Thrift JDBC/ODBC Server --executor-cores 7 spark-internal
========================================
Warning: Master yarn-client is deprecated since 2.0. Please use master "yarn" with specified deploy mode instead.
18/02/08 09:38:07 ERROR SparkContext: Error initializing SparkContext.
org.apache.spark.SparkException: Yarn application has already ended! It might have been killed or unable to launch application master.
at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.waitForApplication(YarnClientSchedulerBackend.scala:85)
at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:62)
at org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:156)
at org.apache.spark.SparkContext.<init>(SparkContext.scala:509)
at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2320)
at org.apache.spark.sql.SparkSession$Builder$$anonfun$6.apply(SparkSession.scala:868)
at org.apache.spark.sql.SparkSession$Builder$$anonfun$6.apply(SparkSession.scala:860)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:860)
at org.apache.spark.sql.hive.thriftserver.SparkSQLEnv$.init(SparkSQLEnv.scala:47)
at org.apache.spark.sql.hive.thriftserver.HiveThriftServer2$.main(HiveThriftServer2.scala:81)
at org.apache.spark.sql.hive.thriftserver.HiveThriftServer2.main(HiveThriftServer2.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:745)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:187)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:212)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:126)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
18/02/08 09:38:07 WARN YarnSchedulerBackend$YarnSchedulerEndpoint: Attempted to request executors before the AM has registered!
18/02/08 09:38:07 ERROR Utils: Uncaught exception in thread main
java.lang.NullPointerException
I give also the yarn logs:
grep -i erro yarn-yarn-resourcemanager-master01.sys873dns.com.log
018-02-08 11:19:00,993 INFO zookeeper.ClientCnxn (ClientCnxn.java:logStartConnect(1019)) - Opening socket connection to server master01.sys873dns.com/23.1.29.61:2181. Will not attempt to authenticate using SASL (unknown error)
2018-02-08 11:19:15,767 ERROR resourcemanager.ResourceManager (LogAdapter.java:error(69)) - RECEIVED SIGNAL 15: SIGTERM
2018-02-08 11:19:27,281 INFO zookeeper.ClientCnxn (ClientCnxn.java:logStartConnect(1019)) - Opening socket connection to server master01.sys873dns.com/23.1.29.61:2181. Will not attempt to authenticate using SASL (unknown error)
2018-02-08 11:29:00,064 INFO zookeeper.ClientCnxn (ClientCnxn.java:logStartConnect(1019)) - Opening socket connection to server master01.sys873dns.com/23.1.29.61:2181. Will not attempt to authenticate using SASL (unknown error)
2018-02-08 11:29:01,839 INFO zookeeper.ClientCnxn (ClientCnxn.java:logStartConnect(1019)) - Opening socket connection to server master01.sys873dns.com/23.1.29.61:2181. Will not attempt to authenticate using SASL (unknown error)
2018-02-08 11:29:15,725 ERROR resourcemanager.ResourceManager (LogAdapter.java:error(69)) - RECEIVED SIGNAL 15: SIGTERM
2018-02-08 11:29:27,033 INFO zookeeper.ClientCnxn (ClientCnxn.java:logStartConnect(1019)) - Opening socket connection to server master03.sys873dns.com/23.1.29.63:2181. Will not attempt to authenticate using SASL (unknown error)
ons.YarnException: Unauthorized request to start container.
2018-02-08 12:56:11,144 INFO amlauncher.AMLauncher (AMLauncher.java:run(273)) - Error launching appattempt_1518089370033_0028_000008. Got exception: org.apache.hadoop.yarn.exceptions.YarnException: Unauthorized request to start container.
2018-02-08 12:59:39,822 INFO amlauncher.AMLauncher (AMLauncher.java:run(273)) - Error launching appattempt_1518089370033_0029_000002. Got exception: org.apache.hadoop.yarn.exceptions.YarnException: Unauthorized request to start container.
2018-02-08 13:00:01,671 INFO amlauncher.AMLauncher (AMLauncher.java:run(273)) - Error launching appattempt_1518089370033_0029_000004. Got exception: org.apache.hadoop.yarn.exceptions.YarnException: Unauthorized request to start container.
2018-02-08 13:00:18,062 INFO amlauncher.AMLauncher (AMLauncher.java:run(273)) - Error launching appattempt_1518089370033_0029_000006. Got exception: org.apache.hadoop.yarn.exceptions.YarnException: Unauthorized request to start container.
2018-02-08 13:00:20,245 INFO amlauncher.AMLauncher (AMLauncher.java:run(273)) - Error launching appattempt_1518089370033_0030_000003. Got exception: org.apache.hadoop.yarn.exceptions.YarnException: Unauthorized request to start container.
2018-02-08 13:00:42,100 INFO amlauncher.AMLauncher (AMLauncher.java:run(273)) - Error launching appattempt_1518089370033_0030_000006. Got exception: org.apache.hadoop.yarn.exceptions.YarnException: Unauthorized request to start container.
2018-02-08 13:00:56,310 INFO amlauncher.AMLauncher (AMLauncher.java:run(273)) - Error launching appattempt_1518089370033_0030_000008. Got exception: org.apache.hadoop.yarn.exceptions.YarnException: Unauthorized request to start container.
2018-02-08 13:00:58,511 INFO amlauncher.AMLauncher (AMLauncher.java:run(273)) - Error launching appattempt_1518089370033_0030_000010. Got exception: org.apache.hadoop.yarn.exceptions.YarnException: Unauthorized request to start container.
2018-02-08 13:00:58,537 INFO rmapp.RMAppImpl (RMAppImpl.java:transition(1063)) - Application application_1518089370033_0030 failed 10 times due to Error launching appattempt_1518089370033_0030_000010. Got exception: org.apache.hadoop.yarn.exceptions.YarnException: Unauthorized request to start container.
last log
2018-02-08 14:14:54,410 INFO rmapp.RMAppImpl (RMAppImpl.java:handle(778)) - application_1518089370033_0050 State change from FINAL_SAVING to FAILED
2018-02-08 14:14:54,410 INFO capacity.ParentQueue (ParentQueue.java:removeApplication(385)) - Application removed - appId: application_1518089370033_0050 user: hive leaf-queue of parent: root #applications: 1
2018-02-08 14:14:54,412 INFO integration.RMRegistryOperationsService (RMRegistryOperationsService.java:onApplicationCompleted(119)) - Application application_1518089370033_0050 completed, purging application-level records
2018-02-08 14:14:54,412 INFO integration.RMRegistryOperationsService (RMRegistryOperationsService.java:purgeRecordsAsync(198)) - records under / with ID application_1518089370033_0050 and policy application: {}
2018-02-08 14:14:55,393 INFO rmcontainer.RMContainerImpl (RMContainerImpl.java:handle(422)) - container_e09_1518089370033_0049_10_000001 Container Transitioned from RUNNING to COMPLETED
2018-02-08 14:14:55,393 INFO scheduler.SchedulerNode (SchedulerNode.java:releaseContainer(220)) - Released container container_e09_1518089370033_0049_10_000001 of capacity <memory:10240, vCores:1> on host worker02.sys768.com:45454, which currently has 0 containers, <memory:0, vCores:0> used and <memory:30720, vCores:6> available, release resources=true
2018-02-08 14:14:55,393 INFO attempt.RMAppAttemptImpl (RMAppAttemptImpl.java:rememberTargetTransitionsAndStoreState(1209)) - Updating application attempt appattempt_1518089370033_0049_000010 with final state: FAILED, and exit status: -1000
2018-02-08 14:14:55,398 INFO attempt.RMAppAttemptImpl (RMAppAttemptImpl.java:handle(809)) - appattempt_1518089370033_0049_000010 State change from LAUNCHED to FINAL_SAVING
2018-02-08 14:14:55,399 INFO integration.RMRegistryOperationsService (RMRegistryOperationsService.java:onContainerFinished(144)) - Container container_e09_1518089370033_0049_10_000001 finished, purging container-level records
2018-02-08 14:14:55,400 INFO integration.RMRegistryOperationsService (RMRegistryOperationsService.java:purgeRecordsAsync(198)) - records under / with ID container_e09_1518089370033_0049_10_000001 and policy container: {}
2018-02-08 14:14:55,408 INFO resourcemanager.ApplicationMasterService (ApplicationMasterService.java:unregisterAttempt(685)) - Unregistering app attempt : appattempt_1518089370033_0049_000010
2018-02-08 14:14:55,408 INFO security.AMRMTokenSecretManager (AMRMTokenSecretManager.java:applicationMasterFinished(124)) - Application finished, removing password for appattempt_1518089370033_0049_000010
2018-02-08 14:14:55,408 INFO attempt.RMAppAttemptImpl (RMAppAttemptImpl.java:handle(809)) - appattempt_1518089370033_0049_000010 State change from FINAL_SAVING to FAILED
2018-02-08 14:14:55,408 INFO rmapp.RMAppImpl (RMAppImpl.java:transition(1330)) - The number of failed attempts is 10. The max attempts is 10
2018-02-08 14:14:55,409 INFO rmapp.RMAppImpl (RMAppImpl.java:rememberTargetTransitionsAndStoreState(1123)) - Updating application application_1518089370033_0049 with final state: FAILED
2018-02-08 14:14:55,409 INFO rmapp.RMAppImpl (RMAppImpl.java:handle(778)) - application_1518089370033_0049 State change from ACCEPTED to FINAL_SAVING
2018-02-08 14:14:55,409 INFO recovery.RMStateStore (RMStateStore.java:transition(228)) - Updating info for app: application_1518089370033_0049
2018-02-08 14:14:55,409 INFO capacity.CapacityScheduler (CapacityScheduler.java:doneApplicationAttempt(811)) - Application Attempt appattempt_1518089370033_0049_000010 is done. finalState=FAILED
2018-02-08 14:14:55,409 INFO scheduler.AppSchedulingInfo (AppSchedulingInfo.java:clearRequests(124)) - Application application_1518089370033_0049 requests cleared
2018-02-08 14:14:55,410 INFO capacity.LeafQueue (LeafQueue.java:removeApplicationAttempt(795)) - Application removed - appId: application_1518089370033_0049 user: hive queue: default #user-pending-applications: 0 #user-active-applications: 0 #queue-pending-applications: 0 #queue-active-applications: 0
2018-02-08 14:14:55,417 INFO rmapp.RMAppImpl (RMAppImpl.java:transition(1063)) - Application application_1518089370033_0049 failed 10 times due to AM Container for appattempt_1518089370033_0049_000010 exited with exitCode: -1000
For more detailed output, check the application tracking page: http://master02.sys768.com:8088/cluster/app/application_1518089370033_0049 Then click on links to logs of each attempt.
Diagnostics: org.apache.hadoop.hdfs.BlockMissingException: Could not obtain block: BP-1212891131-25.1.53.61-1518077044052:blk_1073741833_1009 file=/hdp/apps/2.6.0.3-8/spark2/spark2-hdp-yarn-archive.tar.gz
Failing this attempt. Failing the application.
2018-02-08 14:14:55,418 INFO rmapp.RMAppImpl (RMAppImpl.java:handle(778)) - application_1518089370033_0049 State change from FINAL_SAVING to FAILED
2018-02-08 14:14:55,418 INFO capacity.ParentQueue (ParentQueue.java:removeApplication(385)) - Application removed - appId: application_1518089370033_0049 user: hive leaf-queue of parent: root #applications: 0
2018-02-08 14:14:55,419 INFO integration.RMRegistryOperationsService (RMRegistryOperationsService.java:onApplicationCompleted(119)) - Application application_1518089370033_0049 completed, purging application-level records
2018-02-08 14:14:55,419 INFO integration.RMRegistryOperationsService (RMRegistryOperationsService.java:purgeRecordsAsync(198)) - records under / with ID application_1518089370033_0049 and policy application: {}
[root#master02 yarn]#

Problems using Spark 1.6.2 for Hadoop 2.6.0 in a Hadoop 2.7.1 cluster

I have access to a Hadoop cluster, version 2.7.1, that was installed using HDP 2.4. Such a cluster has Spark installed, specifically:
$ cat /usr/hdp/2.4.3.0-227/spark/RELEASE
Spark 1.6.2.2.4.3.0-227 built for Hadoop 2.7.1.2.4.3.0-227
I'm trying to set up a "client" machine able to remotelly connect to the cluster and deploy Spark jobs. Thus, I need to install a Spark distribution for the same versions above.
First of all, I've gone to the official Spark download page, but 1.6.2 is only available for Hadoop 2.6.
Then, I decided to download Spark source code and build it by following this guide. The interesting thing is the required building profile for Hadoop "2.6.x and later 2.x" is hadoop-2-6. I.e. if I build by myself Spark, I'll obtain a distribution as the one available in the official Spark download page.
Thus, I've gone with such official pre-built distribution of Spark 1.6.2 for Hadoop 2.6.0.
And it seems not to be working properly. I've submitted a Python script -a very simple one only creating a Spark context- and there is some kind of problem (only showing relevant parts of the log):
$ ./bin/spark-submit --master yarn --deploy-mode cluster basic.py
...
17/08/28 13:08:29 INFO Client: Requesting a new application from cluster with 8 NodeManagers
17/08/28 13:08:29 INFO Client: Verifying our application has not requested more than the maximum memory capability of the cluster (24576 MB per container)
17/08/28 13:08:29 INFO Client: Will allocate AM container, with 1408 MB memory including 384 MB overhead
17/08/28 13:08:29 INFO Client: Setting up container launch context for our AM
17/08/28 13:08:29 INFO Client: Setting up the launch environment for our AM container
17/08/28 13:08:29 INFO Client: Preparing resources for our AM container
17/08/28 13:08:36 INFO Client: Uploading resource file:/Users/frb/Applications/spark-1.6.2-bin-hadoop2.6/lib/spark-assembly-1.6.2-hadoop2.6.0.jar -> hdfs://<host>:8020/user/frb/.sparkStaging/application_1495097788339_0066/spark-assembly-1.6.2-hadoop2.6.0.jar
17/08/28 13:14:40 INFO Client: Uploading resource file:basic.py -> hdfs://<host>:8020/user/frb/.sparkStaging/application_1495097788339_0066/basic.py
17/08/28 13:14:40 INFO Client: Uploading resource file:/Users/frb/Applications/spark-1.6.2-bin-hadoop2.6/python/lib/pyspark.zip -> hdfs://<host>:8020/user/frb/.sparkStaging/application_1495097788339_0066/pyspark.zip
17/08/28 13:14:41 INFO Client: Uploading resource file:/Users/frb/Applications/spark-1.6.2-bin-hadoop2.6/python/lib/py4j-0.9-src.zip -> hdfs://<host>:8020/user/frb/.sparkStaging/application_1495097788339_0066/py4j-0.9-src.zip
17/08/28 13:14:42 INFO Client: Uploading resource file:/private/var/folders/cc/p9gx2wnn3dz8g6yf_r4308fm0000gn/T/spark-0d86f1f4-d310-423a-9d2f-90e2ff46f84e/__spark_conf__3704082754178078870.zip -> hdfs://<host>:8020/user/frb/.sparkStaging/application_1495097788339_0066/__spark_conf__3704082754178078870.zip
17/08/28 13:14:42 INFO SecurityManager: Changing view acls to: frb
17/08/28 13:14:42 INFO SecurityManager: Changing modify acls to: frb
17/08/28 13:14:42 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(frb); users with modify permissions: Set(frb)
17/08/28 13:14:42 INFO Client: Submitting application 66 to ResourceManager
17/08/28 13:14:42 INFO YarnClientImpl: Submitted application application_1495097788339_0066
17/08/28 13:14:48 INFO Client: Application report for application_1495097788339_0066 (state: ACCEPTED)
17/08/28 13:14:48 INFO Client:
client token: N/A
diagnostics: N/A
ApplicationMaster host: N/A
ApplicationMaster RPC port: -1
queue: default
start time: 1503918882943
final status: UNDEFINED
tracking URL: <host>:8088/proxy/application_1495097788339_0066/
user: frb
17/08/28 13:14:49 INFO Client: Application report for application_1495097788339_0066 (state: ACCEPTED)
...
17/08/28 13:14:52 INFO Client: Application report for application_1495097788339_0066 (state: RUNNING)
17/08/28 13:14:52 INFO Client:
client token: N/A
diagnostics: N/A
ApplicationMaster host: 10.95.120.6
ApplicationMaster RPC port: 0
queue: default
start time: 1503918882943
final status: UNDEFINED
tracking URL: <host>:8088/proxy/application_1495097788339_0066/
user: frb
17/08/28 13:14:53 INFO Client: Application report for application_1495097788339_0066 (state: RUNNING)
...
17/08/28 13:14:59 INFO Client: Application report for application_1495097788339_0066 (state: ACCEPTED)
17/08/28 13:14:59 INFO Client:
client token: N/A
diagnostics: N/A
ApplicationMaster host: N/A
ApplicationMaster RPC port: -1
queue: default
start time: 1503918882943
final status: UNDEFINED
tracking URL: <host>:8088/proxy/application_1495097788339_0066/
user: frb
17/08/28 13:15:00 INFO Client: Application report for application_1495097788339_0066 (state: ACCEPTED)
17/08/28 13:15:01 INFO Client: Application report for application_1495097788339_0066 (state: RUNNING)
17/08/28 13:15:01 INFO Client:
client token: N/A
diagnostics: N/A
ApplicationMaster host: 10.95.58.21
ApplicationMaster RPC port: 0
queue: default
start time: 1503918882943
final status: UNDEFINED
tracking URL: <host>:8088/proxy/application_1495097788339_0066/
user: frb
17/08/28 13:15:02 INFO Client: Application report for application_1495097788339_0066 (state: RUNNING)
...
17/08/28 13:15:09 INFO Client: Application report for application_1495097788339_0066 (state: FINISHED)
17/08/28 13:15:09 INFO Client:
client token: N/A
diagnostics: Max number of executor failures (4) reached
ApplicationMaster host: 10.95.58.21
ApplicationMaster RPC port: 0
queue: default
start time: 1503918882943
final status: FAILED
tracking URL: <host>:8088/proxy/application_1495097788339_0066/
user: frb
Exception in thread "main" org.apache.spark.SparkException: Application application_1495097788339_0066 finished with failed status
at org.apache.spark.deploy.yarn.Client.run(Client.scala:1034)
at org.apache.spark.deploy.yarn.Client$.main(Client.scala:1081)
at org.apache.spark.deploy.yarn.Client.main(Client.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
17/08/28 13:15:09 INFO ShutdownHookManager: Shutdown hook called
17/08/28 13:15:09 INFO ShutdownHookManager: Deleting directory /private/var/folders/cc/p9gx2wnn3dz8g6yf_r4308fm0000gn/T/spark-0d86f1f4-d310-423a-9d2f-90e2ff46f84e
If I check the logs for this job, I see that:
ERROR:py4j.java_gateway:An error occurred while trying to connect to the Java server
Traceback (most recent call last):
File "/disk0/hadoop/yarn/local/usercache/frb/appcache/application_1495097788339_0066/container_e03_1495097788339_0066_02_000001/py4j-0.9-src.zip/py4j/java_gateway.py", line 690, in start
self.socket.connect((self.address, self.port))
File "/usr/lib64/python2.7/socket.py", line 224, in meth
return getattr(self._sock,name)(*args)
error: [Errno 111] Connection refused
Traceback (most recent call last):
File "basic.py", line 36, in <module>
sc = SparkContext(conf=conf)
File "/disk0/hadoop/yarn/local/usercache/frb/appcache/application_1495097788339_0066/container_e03_1495097788339_0066_02_000001/pyspark.zip/pyspark/context.py", line 115, in __init__
File "/disk0/hadoop/yarn/local/usercache/frb/appcache/application_1495097788339_0066/container_e03_1495097788339_0066_02_000001/pyspark.zip/pyspark/context.py", line 172, in _do_init
File "/disk0/hadoop/yarn/local/usercache/frb/appcache/application_1495097788339_0066/container_e03_1495097788339_0066_02_000001/pyspark.zip/pyspark/context.py", line 235, in _initialize_context
File "/disk0/hadoop/yarn/local/usercache/frb/appcache/application_1495097788339_0066/container_e03_1495097788339_0066_02_000001/py4j-0.9-src.zip/py4j/java_gateway.py", line 1062, in __call__
File "/disk0/hadoop/yarn/local/usercache/frb/appcache/application_1495097788339_0066/container_e03_1495097788339_0066_02_000001/py4j-0.9-src.zip/py4j/java_gateway.py", line 631, in send_command
File "/disk0/hadoop/yarn/local/usercache/frb/appcache/application_1495097788339_0066/container_e03_1495097788339_0066_02_000001/py4j-0.9-src.zip/py4j/java_gateway.py", line 624, in send_command
File "/disk0/hadoop/yarn/local/usercache/frb/appcache/application_1495097788339_0066/container_e03_1495097788339_0066_02_000001/py4j-0.9-src.zip/py4j/java_gateway.py", line 579, in _get_connection
File "/disk0/hadoop/yarn/local/usercache/frb/appcache/application_1495097788339_0066/container_e03_1495097788339_0066_02_000001/py4j-0.9-src.zip/py4j/java_gateway.py", line 585, in _create_connection
File "/disk0/hadoop/yarn/local/usercache/frb/appcache/application_1495097788339_0066/container_e03_1495097788339_0066_02_000001/py4j-0.9-src.zip/py4j/java_gateway.py", line 697, in start
py4j.protocol.Py4JNetworkError: An error occurred while trying to connect to the Java server
ERROR:py4j.java_gateway:An error occurred while trying to connect to the Java server
Traceback (most recent call last):
File "/disk0/hadoop/yarn/local/usercache/frb/appcache/application_1495097788339_0066/container_e03_1495097788339_0066_02_000001/py4j-0.9-src.zip/py4j/java_gateway.py", line 690, in start
self.socket.connect((self.address, self.port))
File "/usr/lib64/python2.7/socket.py", line 224, in meth
return getattr(self._sock,name)(*args)
error: [Errno 111] Connection refused
I.e. the Spark context is not created, the connection fails between the JVM running the Java gateway and the Python driver running the Spark Context.
This must be related to the Spark distribution I've installed in my client machine for sure, because:
The Spark distribution of my client machine is uploaded to the clsuter, thus it is the one used; just remember this log when submitting:
17/08/28 13:08:36 INFO Client: Uploading resource file:/Users/frb/Applications/spark-1.6.2-bin-hadoop2.6/lib/spark-assembly-1.6.2-hadoop2.6.0.jar -> hdfs://:8020/user/frb/.sparkStaging/application_1495097788339_0066/spark-assembly-1.6.2-hadoop2.6.0.jar
The same above command works when submitted within the cluster, i.e. when using the "Spark 1.6.2.2.4.3.0-227 built for Hadoop 2.7.1.2.4.3.0-227" version of Spark installed by HDP.
Any idea about how to fix this? Thanks!

I finaly solved this:
I added to the spark-submit command the option --conf spark.yarn.jars, with value the location of the Spark assembly jar in the remote Spark cluster. This avoids uploading the client-side Spark assembly jar I installed (which is a slow process, and does not exactly match the remote version, indeed).
I added to the client-side of yarn-site.xml the property hdp.version, with value the HDP version of the remote Hadoop-Spark cluster. This avoids a substitution error in certain paths, which in the end was revealed as the connection error I described in the question.

Spark not running on yarn-client mode (state: ACCEPTED) ends for Spark Submit (with Spark 1.6.1 on YARN) with failure

I am trying to execute following query on Spark in yarn-client mode.
$SPARK_HOME/bin/spark-submit --class org.apache.spark.examples.SparkPi --master yarn --deploy-mode client $SPARK_HOME/examples/target/scala-2.10/spark-examples*.jar 10
When I execute above query my application gets stuck on following
16/07/13 17:14:28 INFO yarn.Client: Application report for application_1468428769910_0002 (state: ACCEPTED)
16/07/13 17:14:28 INFO yarn.Client:
client token: N/A
diagnostics: N/A
ApplicationMaster host: N/A
ApplicationMaster RPC port: -1
queue: default
start time: 1468430067384
final status: UNDEFINED
tracking URL: http://hadoop-master:8088/proxy/application_1468428769910_0002/
user: nachiket
16/07/13 17:14:29 INFO yarn.Client: Application report for application_1468428769910_0002 (state: ACCEPTED)
16/07/13 17:14:30 INFO yarn.Client: Application report for application_1468428769910_0002 (state: ACCEPTED)
16/07/13 17:14:31 INFO yarn.Client: Application report for application_1468428769910_0002 (state: ACCEPTED)
16/07/13 17:14:32 INFO yarn.Client: Application report for application_1468428769910_0002 (state: ACCEPTED)
I have already implemented most of suggestions mentioned on following link :
Application report for application_ (state: ACCEPTED) never ends for Spark Submit (with Spark 1.2.0 on YARN)
I am still facing the same issues . Are there any other solutions than above mentioned link ?
Finally Job fails with following clause
client token: N/A
diagnostics: Application application_1468455134412_0001 failed 2 times due to Error launching appattempt_1468455134412_0001_000002. Got exception: org.apache.hadoop.net.ConnectTimeoutException: Call From sclab103/104.239.213.7 to 104.239.213.7:60640 failed on socket timeout exception: org.apache.hadoop.net.ConnectTimeoutException: 20000 millis timeout while waiting for channel to be ready for connect. ch : java.nio.channels.SocketChannel[connection-pending remote=104.239.213.7/104.239.213.7:60640]; For more details see: http://wiki.apache.org/hadoop/SocketTimeout
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:792)
at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:751)
at org.apache.hadoop.ipc.Client.call(Client.java:1479)
at org.apache.hadoop.ipc.Client.call(Client.java:1412)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229)
at com.sun.proxy.$Proxy82.startContainers(Unknown Source)
at org.apache.hadoop.yarn.api.impl.pb.client.ContainerManagementProtocolPBClientImpl.startContainers(ContainerManagementProtocolPBClientImpl.java:96)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:191)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
at com.sun.proxy.$Proxy83.startContainers(Unknown Source)
at org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher.launch(AMLauncher.java:118)
at org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher.run(AMLauncher.java:250)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.hadoop.net.ConnectTimeoutException: 20000 millis timeout while waiting for channel to be ready for connect. ch : java.nio.channels.SocketChannel[connection-pending remote=104.239.213.7/104.239.213.7:60640]
at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:534)
at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:495)
at org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:614)
at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:712)
at org.apache.hadoop.ipc.Client$Connection.access$2900(Client.java:375)
at org.apache.hadoop.ipc.Client.getConnection(Client.java:1528)
at org.apache.hadoop.ipc.Client.call(Client.java:1451)
... 16 more
. Failing the application.
ApplicationMaster host: N/A
ApplicationMaster RPC port: -1
queue: default
start time: 1468455280498
final status: FAILED
tracking URL: http://hadoop-master:8088/cluster/app/application_1468455134412_0001
user: sclab

Spark 1.3.0: Running Pi example on YARN fails

I have Hadoop 2.6.0.2.2.0.0-2041 with Hive 0.14.0.2.2.0.0-2041
After building Spark with command:
mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.6.0 -Phive -Phive-thriftserver -DskipTests package
I try to run Pi example on YARN with the following command:
export HADOOP_CONF_DIR=/etc/hadoop/conf
/var/home2/test/spark/bin/spark-submit \
--class org.apache.spark.examples.SparkPi \
--master yarn-cluster \
--executor-memory 3G \
--num-executors 50 \
hdfs:///user/test/jars/spark-examples-1.3.0-hadoop2.4.0.jar \
1000
I get exceptions: application_1427875242006_0029 failed 2 times due to AM Container for appattempt_1427875242006_0029_000002 exited with exitCode: 1 Which in fact is Diagnostics: Exception from container-launch.(please see log below).
Application tracking url reveals the following messages:
java.lang.Exception: Unknown container. Container either has not started or has already completed or doesn't belong to this node at all
and also:
Error: Could not find or load main class org.apache.spark.deploy.yarn.ApplicationMaster
I have Hadoop working fine on 4 nodes and completly at a loss how to make Spark work on YARN.
Should I set spark.yarn.access.namenodes Spark configuration property? Though my application does not need to access any name nodes directly, but maybe this will solve the problem?
Please advise where to look for, any ideas would be of great help, thank you!
Spark assembly has been built with Hive, including Datanucleus jars on classpath
15/04/06 10:53:40 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
15/04/06 10:53:42 INFO impl.TimelineClientImpl: Timeline service address: http://etl-hdp-yarn.foo.bar.com:8188/ws/v1/timeline/
15/04/06 10:53:42 INFO client.RMProxy: Connecting to ResourceManager at etl-hdp-yarn.foo.bar.com/192.168.0.16:8050
15/04/06 10:53:42 INFO yarn.Client: Requesting a new application from cluster with 4 NodeManagers
15/04/06 10:53:42 INFO yarn.Client: Verifying our application has not requested more than the maximum memory capability of the cluster (4096 MB per container)
15/04/06 10:53:42 INFO yarn.Client: Will allocate AM container, with 896 MB memory including 384 MB overhead
15/04/06 10:53:42 INFO yarn.Client: Setting up container launch context for our AM
15/04/06 10:53:42 INFO yarn.Client: Preparing resources for our AM container
15/04/06 10:53:43 WARN shortcircuit.DomainSocketFactory: The short-circuit local reads feature cannot be used because libhadoop cannot be loaded.
15/04/06 10:53:43 INFO yarn.Client: Uploading resource file:/var/home2/test/spark-1.3.0/assembly/target/scala-2.10/spark-assembly-1.3.0-hadoop2.6.0.jar -> hdfs://etl-hdp-nn1.foo.bar.com:8020/user/test/.sparkStaging/application_1427875242006_0029/spark-assembly-1.3.0-hadoop2.6.0.jar
15/04/06 10:53:44 INFO yarn.Client: Source and destination file systems are the same. Not copying hdfs:/user/test/jars/spark-examples-1.3.0-hadoop2.4.0.jar
15/04/06 10:53:44 INFO yarn.Client: Setting up the launch environment for our AM container
15/04/06 10:53:44 INFO spark.SecurityManager: Changing view acls to: test
15/04/06 10:53:44 INFO spark.SecurityManager: Changing modify acls to: test
15/04/06 10:53:44 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(test); users with modify permissions: Set(test)
15/04/06 10:53:44 INFO yarn.Client: Submitting application 29 to ResourceManager
15/04/06 10:53:44 INFO impl.YarnClientImpl: Submitted application application_1427875242006_0029
15/04/06 10:53:45 INFO yarn.Client: Application report for application_1427875242006_0029 (state: ACCEPTED)
15/04/06 10:53:45 INFO yarn.Client:
client token: N/A
diagnostics: N/A
ApplicationMaster host: N/A
ApplicationMaster RPC port: -1
queue: default
start time: 1428317623905
final status: UNDEFINED
tracking URL: http://etl-hdp-yarn.foo.bar.com:8088/proxy/application_1427875242006_0029/
user: test
15/04/06 10:53:46 INFO yarn.Client: Application report for application_1427875242006_0029 (state: ACCEPTED)
15/04/06 10:53:47 INFO yarn.Client: Application report for application_1427875242006_0029 (state: ACCEPTED)
15/04/06 10:53:48 INFO yarn.Client: Application report for application_1427875242006_0029 (state: ACCEPTED)
15/04/06 10:53:49 INFO yarn.Client: Application report for application_1427875242006_0029 (state: FAILED)
15/04/06 10:53:49 INFO yarn.Client:
client token: N/A
diagnostics: Application application_1427875242006_0029 failed 2 times due to AM Container for appattempt_1427875242006_0029_000002 exited with exitCode: 1
For more detailed output, check application tracking page:http://etl-hdp-yarn.foo.bar.com:8088/proxy/application_1427875242006_0029/Then, click on links to logs of each attempt.
Diagnostics: Exception from container-launch.
Container id: container_1427875242006_0029_02_000001
Exit code: 1
Exception message: /mnt/hdfs01/hadoop/yarn/local/usercache/test/appcache/application_1427875242006_0029/container_1427875242006_0029_02_000001/launch_container.sh: line 27: $PWD:$PWD/__spark__.jar:$HADOOP_CONF_DIR:/usr/hdp/current/hadoop-client/*:/usr/hdp/current/hadoop-client/lib/*:/usr/hdp/current/hadoop-hdfs-client/*:/usr/hdp/current/hadoop-hdfs-client/lib/*:/usr/hdp/current/hadoop-yarn-client/*:/usr/hdp/current/hadoop-yarn-client/lib/*:$PWD/mr-framework/hadoop/share/hadoop/mapreduce/*:$PWD/mr-framework/hadoop/share/hadoop/mapreduce/lib/*:$PWD/mr-framework/hadoop/share/hadoop/common/*:$PWD/mr-framework/hadoop/share/hadoop/common/lib/*:$PWD/mr-framework/hadoop/share/hadoop/yarn/*:$PWD/mr-framework/hadoop/share/hadoop/yarn/lib/*:$PWD/mr-framework/hadoop/share/hadoop/hdfs/*:$PWD/mr-framework/hadoop/share/hadoop/hdfs/lib/*:/usr/hdp/${hdp.version}/hadoop/lib/hadoop-lzo-0.6.0.${hdp.version}.jar:/etc/hadoop/conf/secure: bad substitution
Stack trace: ExitCodeException exitCode=1: /mnt/hdfs01/hadoop/yarn/local/usercache/test/appcache/application_1427875242006_0029/container_1427875242006_0029_02_000001/launch_container.sh: line 27: $PWD:$PWD/__spark__.jar:$HADOOP_CONF_DIR:/usr/hdp/current/hadoop-client/*:/usr/hdp/current/hadoop-client/lib/*:/usr/hdp/current/hadoop-hdfs-client/*:/usr/hdp/current/hadoop-hdfs-client/lib/*:/usr/hdp/current/hadoop-yarn-client/*:/usr/hdp/current/hadoop-yarn-client/lib/*:$PWD/mr-framework/hadoop/share/hadoop/mapreduce/*:$PWD/mr-framework/hadoop/share/hadoop/mapreduce/lib/*:$PWD/mr-framework/hadoop/share/hadoop/common/*:$PWD/mr-framework/hadoop/share/hadoop/common/lib/*:$PWD/mr-framework/hadoop/share/hadoop/yarn/*:$PWD/mr-framework/hadoop/share/hadoop/yarn/lib/*:$PWD/mr-framework/hadoop/share/hadoop/hdfs/*:$PWD/mr-framework/hadoop/share/hadoop/hdfs/lib/*:/usr/hdp/${hdp.version}/hadoop/lib/hadoop-lzo-0.6.0.${hdp.version}.jar:/etc/hadoop/conf/secure: bad substitution
at org.apache.hadoop.util.Shell.runCommand(Shell.java:538)
at org.apache.hadoop.util.Shell.run(Shell.java:455)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:715)
at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:211)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Container exited with a non-zero exit code 1
Failing this attempt. Failing the application.
ApplicationMaster host: N/A
ApplicationMaster RPC port: -1
queue: default
start time: 1428317623905
final status: FAILED
tracking URL: http://etl-hdp-yarn.foo.bar.com:8088/cluster/app/application_1427875242006_0029
user: test
Exception in thread "main" org.apache.spark.SparkException: Application finished with failed status
at org.apache.spark.deploy.yarn.Client.run(Client.scala:622)
at org.apache.spark.deploy.yarn.Client$.main(Client.scala:647)
at org.apache.spark.deploy.yarn.Client.main(Client.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:569)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:166)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:189)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:110)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

If you are using spark with hdp, then we have to do following things.
Add these entries in your $SPARK_HOME/conf/spark-defaults.conf
spark.driver.extraJavaOptions -Dhdp.version=2.2.0.0-2041 (your installed HDP version)
spark.yarn.am.extraJavaOptions -Dhdp.version=2.2.0.0-2041 (your installed HDP version)
create java-opts file in $SPARK_HOME/conf and add the installed HDP version in that file like
-Dhdp.version=2.2.0.0-2041 (your installed HDP version)
to know hdp verion please run command hdp-select status hadoop-client in the cluster

This is a bug in the HDP - Spark Integration
In your spark-defaults.conf add the following lines
spark.driver.extraJavaOptions -Dhdp.version=2.2.0.0–2041
spark.yarn.am.extraJavaOptions -Dhdp.version=2.2.0.0–2041
This should help address the issue

I think your hadoop classpath is not setup.
lib/hadoop-lzo-0.6.0.${hdp.version}.jar:/etc/hadoop/conf/secure: bad substitution

Spark 1.3.0 on YARN: Application failed 2 times due to AM Container

When running Spark 1.3.0 Pi example on YARN (Hadoop 2.6.0.2.2.0.0-2041) with the following script:
# Run on a YARN cluster
export HADOOP_CONF_DIR=/etc/hadoop/conf
/var/home2/test/spark/bin/spark-submit \
--class org.apache.spark.examples.SparkPi \
--master yarn-cluster \
--executor-memory 3G \
--num-executors 50 \
/var/home2/test/spark/lib/spark-examples-1.3.0-hadoop2.4.0.jar \
1000
It fails with "Application failed 2 times due to AM Container" message (please see below). As far as I understand, all neccessary information to run Spark application in YARN mode is provided in this launch script. What else should be configured to run on YARN. What is missing? Other reasons for YARN launch to fail?
[test#etl-hdp-mgmt pi]$ ./run-pi.sh
Spark assembly has been built with Hive, including Datanucleus jars on classpath
15/04/01 12:59:57 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
15/04/01 12:59:58 INFO client.RMProxy: Connecting to ResourceManager at etl-hdp-yarn.foo.bar.com/192.168.0.16:8050
15/04/01 12:59:58 INFO yarn.Client: Requesting a new application from cluster with 4 NodeManagers
15/04/01 12:59:58 INFO yarn.Client: Verifying our application has not requested more than the maximum memory capability of the cluster (4096 MB per container)
15/04/01 12:59:58 INFO yarn.Client: Will allocate AM container, with 896 MB memory including 384 MB overhead
15/04/01 12:59:58 INFO yarn.Client: Setting up container launch context for our AM
15/04/01 12:59:58 INFO yarn.Client: Preparing resources for our AM container
15/04/01 12:59:59 WARN hdfs.BlockReaderLocal: The short-circuit local reads feature cannot be used because libhadoop cannot be loaded.
15/04/01 12:59:59 INFO yarn.Client: Uploading resource file:/var/home2/test/spark-1.3.0-bin-hadoop2.4/lib/spark-assembly-1.3.0-hadoop2.4.0.jar -> hdfs://foo.bar.com:8020/user/test/.sparkStaging/application_1427875242006_0010/spark-assembly-1.3.0-hadoop2.4.0.jar
15/04/01 13:00:01 INFO yarn.Client: Uploading resource file:/var/home2/test/spark/lib/spark-examples-1.3.0-hadoop2.4.0.jar -> hdfs://foo.bar.com:8020/user/test/.sparkStaging/application_1427875242006_0010/spark-examples-1.3.0-hadoop2.4.0.jar
15/04/01 13:00:02 INFO yarn.Client: Setting up the launch environment for our AM container
15/04/01 13:00:03 INFO spark.SecurityManager: Changing view acls to: test
15/04/01 13:00:03 INFO spark.SecurityManager: Changing modify acls to: test
15/04/01 13:00:03 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(test); users with modify permissions: Set(test)
15/04/01 13:00:03 INFO yarn.Client: Submitting application 10 to ResourceManager
15/04/01 13:00:03 INFO impl.YarnClientImpl: Submitted application application_1427875242006_0010
15/04/01 13:00:04 INFO yarn.Client: Application report for application_1427875242006_0010 (state: ACCEPTED)
15/04/01 13:00:04 INFO yarn.Client:
client token: N/A
diagnostics: N/A
ApplicationMaster host: N/A
ApplicationMaster RPC port: -1
queue: default
start time: 1427893202566
final status: UNDEFINED
tracking URL: http://etl-hdp-yarn.foo.bar.com:8088/proxy/application_1427875242006_0010/
user: test
15/04/01 13:00:05 INFO yarn.Client: Application report for application_1427875242006_0010 (state: ACCEPTED)
15/04/01 13:00:06 INFO yarn.Client: Application report for application_1427875242006_0010 (state: ACCEPTED)
15/04/01 13:00:07 INFO yarn.Client: Application report for application_1427875242006_0010 (state: ACCEPTED)
15/04/01 13:00:08 INFO yarn.Client: Application report for application_1427875242006_0010 (state: ACCEPTED)
15/04/01 13:00:09 INFO yarn.Client: Application report for application_1427875242006_0010 (state: FAILED)
15/04/01 13:00:09 INFO yarn.Client:
client token: N/A
diagnostics: Application application_1427875242006_0010 failed 2 times due to AM Container for appattempt_1427875242006_0010_000002 exited with exitCode: 1
For more detailed output, check application tracking page:http://etl-hdp-yarn.foo.bar.com:8088/proxy/application_1427875242006_0010/Then, click on links to logs of each attempt.
Diagnostics: Exception from container-launch.
Container id: container_1427875242006_0010_02_000001
Exit code: 1
Exception message: /mnt/hdfs01/hadoop/yarn/local/usercache/test/appcache/application_1427875242006_0010/container_1427875242006_0010_02_000001/launch_container.sh: line 27: $PWD:$PWD/__spark__.jar:$HADOOP_CONF_DIR:/usr/hdp/current/hadoop-client/*:/usr/hdp/current/hadoop-client/lib/*:/usr/hdp/current/hadoop-hdfs-client/*:/usr/hdp/current/hadoop-hdfs-client/lib/*:/usr/hdp/current/hadoop-yarn-client/*:/usr/hdp/current/hadoop-yarn-client/lib/*:$PWD/mr-framework/hadoop/share/hadoop/mapreduce/*:$PWD/mr-framework/hadoop/share/hadoop/mapreduce/lib/*:$PWD/mr-framework/hadoop/share/hadoop/common/*:$PWD/mr-framework/hadoop/share/hadoop/common/lib/*:$PWD/mr-framework/hadoop/share/hadoop/yarn/*:$PWD/mr-framework/hadoop/share/hadoop/yarn/lib/*:$PWD/mr-framework/hadoop/share/hadoop/hdfs/*:$PWD/mr-framework/hadoop/share/hadoop/hdfs/lib/*:/usr/hdp/${hdp.version}/hadoop/lib/hadoop-lzo-0.6.0.${hdp.version}.jar:/etc/hadoop/conf/secure: bad substitution
Stack trace: ExitCodeException exitCode=1: /mnt/hdfs01/hadoop/yarn/local/usercache/test/appcache/application_1427875242006_0010/container_1427875242006_0010_02_000001/launch_container.sh: line 27: $PWD:$PWD/__spark__.jar:$HADOOP_CONF_DIR:/usr/hdp/current/hadoop-client/*:/usr/hdp/current/hadoop-client/lib/*:/usr/hdp/current/hadoop-hdfs-client/*:/usr/hdp/current/hadoop-hdfs-client/lib/*:/usr/hdp/current/hadoop-yarn-client/*:/usr/hdp/current/hadoop-yarn-client/lib/*:$PWD/mr-framework/hadoop/share/hadoop/mapreduce/*:$PWD/mr-framework/hadoop/share/hadoop/mapreduce/lib/*:$PWD/mr-framework/hadoop/share/hadoop/common/*:$PWD/mr-framework/hadoop/share/hadoop/common/lib/*:$PWD/mr-framework/hadoop/share/hadoop/yarn/*:$PWD/mr-framework/hadoop/share/hadoop/yarn/lib/*:$PWD/mr-framework/hadoop/share/hadoop/hdfs/*:$PWD/mr-framework/hadoop/share/hadoop/hdfs/lib/*:/usr/hdp/${hdp.version}/hadoop/lib/hadoop-lzo-0.6.0.${hdp.version}.jar:/etc/hadoop/conf/secure: bad substitution
at org.apache.hadoop.util.Shell.runCommand(Shell.java:538)
at org.apache.hadoop.util.Shell.run(Shell.java:455)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:715)
at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:211)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Container exited with a non-zero exit code 1
Failing this attempt. Failing the application.
ApplicationMaster host: N/A
ApplicationMaster RPC port: -1
queue: default
start time: 1427893202566
final status: FAILED
tracking URL: http://etl-hdp-yarn.foo.bar.com:8088/cluster/app/application_1427875242006_0010
user: test
Exception in thread "main" org.apache.spark.SparkException: Application finished with failed status
at org.apache.spark.deploy.yarn.Client.run(Client.scala:622)
at org.apache.spark.deploy.yarn.Client$.main(Client.scala:647)
at org.apache.spark.deploy.yarn.Client.main(Client.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:569)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:166)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:189)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:110)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

Run
yarn logs -applicationId application_1427875242006_0010 > /tmp/application_1427875242006_0010
Logs there should indicate why it failed.
"Failed 2 times" happens because when you run in yarn cluster mode, driver runs in AM whose retry is 2 by default.
So your driver is retried twice.

I totally agree with #SeanOwen. Follow the Spark Building documentation.
You need to compile spark for YARN using the correct configuration for your hadoop cluster (version,hive support, etc).
The problem won't persist then!

This is the problem with spark communicating with Application Master.
The RM and NM talk to each other over RPC so the problem could be launch_container.cmd is not running correctly. Check that the NM has communicating with RM while submitting job
Try adding this to your yarn-site.xml:
<property>
<name>yarn.nodemanager.delete.debug-delay-sec</name>
<value>1200</value>
</property>
This will ensure that the launch_container.cmd from the NM error seen does not get deleted ( will remain around for 20 mins - increase 1200 to a higher number if needed). Now, what you can do is try and run that launch_container.cmd script manually from the container dir and see where it bails out.
Hope this will help you.

I also faced a similar problem. Actually, you do not need to mention --master yarn-cluster when you are running your self-contained application in the cluster.
This problem has been solved on Cloudera forum refer this https://community.cloudera.com/t5/Advanced-Analytics-Apache-Spark/Issue-running-spark-application-in-Yarn-cluster-mode/td-p/44570

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Spark streaming job on YARN cluster mode stuck in accepted, then fails with a Timeout Exception - spark-streaming

The problem might be with Yarn "App Timeline Server". Try to restart it.

Are you creating your spark session with master as local?. Please do check this.

Related

Spark 2.1 + Yarn application has already ended

Problems using Spark 1.6.2 for Hadoop 2.6.0 in a Hadoop 2.7.1 cluster

Spark not running on yarn-client mode (state: ACCEPTED) ends for Spark Submit (with Spark 1.6.1 on YARN) with failure

Spark 1.3.0: Running Pi example on YARN fails

Spark 1.3.0 on YARN: Application failed 2 times due to AM Container

Categories

Resources