I am running spark-submit in yarn client mode. Yarn has been setup with HDP sandbox with kerberos enabled. HDP Sandbox is running on docker container on Mac host.
When spark submit is run from within the docker container of the sandbox, it’s runs successfully but when spark submit is run from the host machine it fails immediately after ACCEPTED state with error:
19/07/28 00:41:21 INFO yarn.Client: Application report for application_1564298049378_0008 (state: ACCEPTED)
19/07/28 00:41:22 INFO yarn.Client: Application report for application_1564298049378_0008 (state: ACCEPTED)
19/07/28 00:41:23 INFO yarn.Client: Application report for application_1564298049378_0008 (state: FAILED)
19/07/28 00:41:23 INFO yarn.Client:
client token: N/A
diagnostics: Application application_1564298049378_0008 failed 2 times due to AM Container for appattempt_1564298049378_0008_000002 exited with exitCode: -1000
Failing this attempt.Diagnostics: (Client.java:1558)
... 37 more
Caused by: org.apache.hadoop.security.AccessControlException: Client cannot authenticate via:[TOKEN, KERBEROS]
I could not find any more information about the failure. Any help will be greatly appreciated.
Here is the resourcemanager log:
2019-07-28 22:39:04,654 INFO resourcemanager.ClientRMService (ClientRMService.java:getNewApplicationId(341)) - Allocated new applicationId: 20
2019-07-28 22:39:10,982 INFO capacity.CapacityScheduler (CapacityScheduler.java:checkAndGetApplicationPriority(2526)) - Application 'application_1564332457320_0020' is submitted without priority hence considering default queue/cluster priority: 0
2019-07-28 22:39:10,982 INFO capacity.CapacityScheduler (CapacityScheduler.java:checkAndGetApplicationPriority(2547)) - Priority '0' is acceptable in queue : santosh for application: application_1564332457320_0020
2019-07-28 22:39:10,983 WARN rmapp.RMAppImpl (RMAppImpl.java:(473)) - The specific max attempts: 0 for application: 20 is invalid, because it is out of the range [1, 2]. Use the global max attempts instead.
2019-07-28 22:39:10,983 INFO collector.TimelineCollectorManager (TimelineCollectorManager.java:putIfAbsent(142)) - the collector for application_1564332457320_0020 was added
2019-07-28 22:39:10,984 INFO resourcemanager.ClientRMService (ClientRMService.java:submitApplication(648)) - Application with id 20 submitted by user santosh
2019-07-28 22:39:10,984 INFO security.DelegationTokenRenewer (DelegationTokenRenewer.java:handleAppSubmitEvent(458)) - application_1564332457320_0020 found existing hdfs token Kind: HDFS_DELEGATION_TOKEN, Service: 192.168.50.1:8020, Ident: (token for santosh: HDFS_DELEGATION_TOKEN owner=santosh#XXX.XX, renewer=yarn, realUser=, issueDate=1564353550169, maxDate=1564958350169, sequenceNumber=125, masterKeyId=20)
2019-07-28 22:39:11,011 INFO security.DelegationTokenRenewer (DelegationTokenRenewer.java:renewToken(635)) - Renewed delegation-token= [Kind: HDFS_DELEGATION_TOKEN, Service: 192.168.50.1:8020, Ident: (token for santosh: HDFS_DELEGATION_TOKEN owner=santosh#XXX.XX, renewer=yarn, realUser=, issueDate=1564353550169, maxDate=1564958350169, sequenceNumber=125, masterKeyId=20);exp=1564439951007; apps=[application_1564332457320_0020]]
2019-07-28 22:39:11,011 INFO security.DelegationTokenRenewer (DelegationTokenRenewer.java:setTimerForTokenRenewal(613)) - Renew Kind: HDFS_DELEGATION_TOKEN, Service: 192.168.50.1:8020, Ident: (token for santosh: HDFS_DELEGATION_TOKEN owner=santosh#XXX.XX, renewer=yarn, realUser=, issueDate=1564353550169, maxDate=1564958350169, sequenceNumber=125, masterKeyId=20);exp=1564439951007; apps=[application_1564332457320_0020] in 86399996 ms, appId = [application_1564332457320_0020]
2019-07-28 22:39:11,011 INFO rmapp.RMAppImpl (RMAppImpl.java:transition(1259)) - Storing application with id application_1564332457320_0020
2019-07-28 22:39:11,012 INFO rmapp.RMAppImpl (RMAppImpl.java:handle(912)) - application_1564332457320_0020 State change from NEW to NEW_SAVING on event = START
2019-07-28 22:39:11,012 INFO recovery.RMStateStore (RMStateStore.java:transition(222)) - Storing info for app: application_1564332457320_0020
2019-07-28 22:39:11,022 INFO rmapp.RMAppImpl (RMAppImpl.java:handle(912)) - application_1564332457320_0020 State change from NEW_SAVING to SUBMITTED on event = APP_NEW_SAVED
2019-07-28 22:39:11,022 INFO capacity.ParentQueue (ParentQueue.java:addApplication(494)) - Application added - appId: application_1564332457320_0020 user: santosh leaf-queue of parent: root #applications: 1
2019-07-28 22:39:11,023 INFO capacity.CapacityScheduler (CapacityScheduler.java:addApplication(990)) - Accepted application application_1564332457320_0020 from user: santosh, in queue: santosh
2019-07-28 22:39:11,023 INFO rmapp.RMAppImpl (RMAppImpl.java:handle(912)) - application_1564332457320_0020 State change from SUBMITTED to ACCEPTED on event = APP_ACCEPTED
2019-07-28 22:39:11,023 INFO resourcemanager.ApplicationMasterService (ApplicationMasterService.java:registerAppAttempt(479)) - Registering app attempt : appattempt_1564332457320_0020_000001
2019-07-28 22:39:11,024 INFO attempt.RMAppAttemptImpl (RMAppAttemptImpl.java:handle(925)) - appattempt_1564332457320_0020_000001 State change from NEW to SUBMITTED on event = START
2019-07-28 22:39:11,024 INFO capacity.LeafQueue (LeafQueue.java:activateApplications(911)) - Application application_1564332457320_0020 from user: santosh activated in queue: santosh
2019-07-28 22:39:11,025 INFO capacity.LeafQueue (LeafQueue.java:addApplicationAttempt(941)) - Application added - appId: application_1564332457320_0020 user: santosh, leaf-queue: santosh #user-pending-applications: 0 #user-active-applications: 1 #queue-pending-applications: 0 #queue-active-applications: 1
2019-07-28 22:39:11,025 INFO capacity.CapacityScheduler (CapacityScheduler.java:addApplicationAttempt(1036)) - Added Application Attempt appattempt_1564332457320_0020_000001 to scheduler from user santosh in queue santosh
2019-07-28 22:39:11,028 INFO attempt.RMAppAttemptImpl (RMAppAttemptImpl.java:handle(925)) - appattempt_1564332457320_0020_000001 State change from SUBMITTED to SCHEDULED on event = ATTEMPT_ADDED
2019-07-28 22:39:11,033 INFO allocator.AbstractContainerAllocator (AbstractContainerAllocator.java:getCSAssignmentFromAllocateResult(129)) - assignedContainer application attempt=appattempt_1564332457320_0020_000001 container=null queue=santosh clusterResource= type=OFF_SWITCH requestedPartition=
2019-07-28 22:39:11,034 INFO rmcontainer.RMContainerImpl (RMContainerImpl.java:handle(490)) - container_e20_1564332457320_0020_01_000001 Container Transitioned from NEW to ALLOCATED
2019-07-28 22:39:11,035 INFO fica.FiCaSchedulerNode (FiCaSchedulerNode.java:allocateContainer(169)) - Assigned container container_e20_1564332457320_0020_01_000001 of capacity on host sandbox-hdp.hortonworks.com:45454, which has 1 containers, used and available after allocation
2019-07-28 22:39:11,038 INFO security.NMTokenSecretManagerInRM (NMTokenSecretManagerInRM.java:createAndGetNMToken(200)) - Sending NMToken for nodeId : sandbox-hdp.hortonworks.com:45454 for container : container_e20_1564332457320_0020_01_000001
2019-07-28 22:39:11,043 INFO rmcontainer.RMContainerImpl (RMContainerImpl.java:handle(490)) - container_e20_1564332457320_0020_01_000001 Container Transitioned from ALLOCATED to ACQUIRED
2019-07-28 22:39:11,043 INFO security.NMTokenSecretManagerInRM (NMTokenSecretManagerInRM.java:clearNodeSetForAttempt(146)) - Clear node set for appattempt_1564332457320_0020_000001
2019-07-28 22:39:11,044 INFO capacity.ParentQueue (ParentQueue.java:apply(1332)) - assignedContainer queue=root usedCapacity=0.25 absoluteUsedCapacity=0.25 used= cluster=
2019-07-28 22:39:11,044 INFO capacity.CapacityScheduler (CapacityScheduler.java:tryCommit(2890)) - Allocation proposal accepted
2019-07-28 22:39:11,044 INFO attempt.RMAppAttemptImpl (RMAppAttemptImpl.java:storeAttempt(2213)) - Storing attempt: AppId: application_1564332457320_0020 AttemptId: appattempt_1564332457320_0020_000001 MasterContainer: Container: [ContainerId: container_e20_1564332457320_0020_01_000001, AllocationRequestId: -1, Version: 0, NodeId: sandbox-hdp.hortonworks.com:45454, NodeHttpAddress: sandbox-hdp.hortonworks.com:8042, Resource: , Priority: 0, Token: Token { kind: ContainerToken, service: 172.18.0.3:45454 }, ExecutionType: GUARANTEED, ]
2019-07-28 22:39:11,051 INFO attempt.RMAppAttemptImpl (RMAppAttemptImpl.java:handle(925)) - appattempt_1564332457320_0020_000001 State change from SCHEDULED to ALLOCATED_SAVING on event = CONTAINER_ALLOCATED
2019-07-28 22:39:11,057 INFO attempt.RMAppAttemptImpl (RMAppAttemptImpl.java:handle(925)) - appattempt_1564332457320_0020_000001 State change from ALLOCATED_SAVING to ALLOCATED on event = ATTEMPT_NEW_SAVED
2019-07-28 22:39:11,060 INFO amlauncher.AMLauncher (AMLauncher.java:run(307)) - Launching masterappattempt_1564332457320_0020_000001
2019-07-28 22:39:11,068 INFO amlauncher.AMLauncher (AMLauncher.java:launch(109)) - Setting up container Container: [ContainerId: container_e20_1564332457320_0020_01_000001, AllocationRequestId: -1, Version: 0, NodeId: sandbox-hdp.hortonworks.com:45454, NodeHttpAddress: sandbox-hdp.hortonworks.com:8042, Resource: , Priority: 0, Token: Token { kind: ContainerToken, service: 172.18.0.3:45454 }, ExecutionType: GUARANTEED, ] for AM appattempt_1564332457320_0020_000001
2019-07-28 22:39:11,069 INFO security.AMRMTokenSecretManager (AMRMTokenSecretManager.java:createAndGetAMRMToken(195)) - Create AMRMToken for ApplicationAttempt: appattempt_1564332457320_0020_000001
2019-07-28 22:39:11,069 INFO security.AMRMTokenSecretManager (AMRMTokenSecretManager.java:createPassword(307)) - Creating password for appattempt_1564332457320_0020_000001
2019-07-28 22:39:11,265 INFO amlauncher.AMLauncher (AMLauncher.java:launch(130)) - Done launching container Container: [ContainerId: container_e20_1564332457320_0020_01_000001, AllocationRequestId: -1, Version: 0, NodeId: sandbox-hdp.hortonworks.com:45454, NodeHttpAddress: sandbox-hdp.hortonworks.com:8042, Resource: , Priority: 0, Token: Token { kind: ContainerToken, service: 172.18.0.3:45454 }, ExecutionType: GUARANTEED, ] for AM appattempt_1564332457320_0020_000001
2019-07-28 22:39:11,265 INFO attempt.RMAppAttemptImpl (RMAppAttemptImpl.java:handle(925)) - appattempt_1564332457320_0020_000001 State change from ALLOCATED to LAUNCHED on event = LAUNCHED
2019-07-28 22:39:11,852 INFO resourcemanager.ResourceTrackerService (ResourceTrackerService.java:updateAppCollectorsMap(713)) - Update collector information for application application_1564332457320_0020 with new address: sandbox-hdp.hortonworks.com:35197 timestamp: 1564332457320, 36
2019-07-28 22:39:11,854 INFO rmcontainer.RMContainerImpl (RMContainerImpl.java:handle(490)) - container_e20_1564332457320_0020_01_000001 Container Transitioned from ACQUIRED to RUNNING
2019-07-28 22:39:12,833 INFO provider.BaseAuditHandler (BaseAuditHandler.java:logStatus(312)) - Audit Status Log: name=yarn.async.batch.hdfs, interval=01:11.979 minutes, events=162, succcessCount=162, totalEvents=17347, totalSuccessCount=17347
2019-07-28 22:39:12,834 INFO destination.HDFSAuditDestination (HDFSAuditDestination.java:logJSON(179)) - Flushing HDFS audit. Event Size:1
2019-07-28 22:39:12,857 INFO resourcemanager.ResourceTrackerService (ResourceTrackerService.java:updateAppCollectorsMap(713)) - Update collector information for application application_1564332457320_0020 with new address: sandbox-hdp.hortonworks.com:35197 timestamp: 1564332457320, 37
2019-07-28 22:39:14,054 INFO rmcontainer.RMContainerImpl (RMContainerImpl.java:handle(490)) - container_e20_1564332457320_0020_01_000001 Container Transitioned from RUNNING to COMPLETED
2019-07-28 22:39:14,055 INFO attempt.RMAppAttemptImpl (RMAppAttemptImpl.java:rememberTargetTransitionsAndStoreState(1412)) - Updating application attempt appattempt_1564332457320_0020_000001 with final state: FAILED, and exit status: -1000
2019-07-28 22:39:14,055 INFO attempt.RMAppAttemptImpl (RMAppAttemptImpl.java:handle(925)) - appattempt_1564332457320_0020_000001 State change from LAUNCHED to FINAL_SAVING on event = CONTAINER_FINISHED
2019-07-28 22:39:14,066 INFO resourcemanager.ApplicationMasterService (ApplicationMasterService.java:unregisterAttempt(496)) - Unregistering app attempt : appattempt_1564332457320_0020_000001
2019-07-28 22:39:14,066 INFO security.AMRMTokenSecretManager (AMRMTokenSecretManager.java:applicationMasterFinished(124)) - Application finished, removing password for appattempt_1564332457320_0020_000001
2019-07-28 22:39:14,066 INFO attempt.RMAppAttemptImpl (RMAppAttemptImpl.java:handle(925)) - appattempt_1564332457320_0020_000001 State change from FINAL_SAVING to FAILED on event = ATTEMPT_UPDATE_SAVED
2019-07-28 22:39:14,067 INFO rmapp.RMAppImpl (RMAppImpl.java:transition(1538)) - The number of failed attempts is 1. The max attempts is 2
2019-07-28 22:39:14,067 INFO resourcemanager.ApplicationMasterService (ApplicationMasterService.java:registerAppAttempt(479)) - Registering app attempt : appattempt_1564332457320_0020_000002
2019-07-28 22:39:14,067 INFO attempt.RMAppAttemptImpl (RMAppAttemptImpl.java:handle(925)) - appattempt_1564332457320_0020_000002 State change from NEW to SUBMITTED on event = START
2019-07-28 22:39:14,067 INFO capacity.CapacityScheduler (CapacityScheduler.java:doneApplicationAttempt(1085)) - Application Attempt appattempt_1564332457320_0020_000001 is done. finalState=FAILED
2019-07-28 22:39:14,067 INFO scheduler.AppSchedulingInfo (AppSchedulingInfo.java:clearRequests(159)) - Application application_1564332457320_0020 requests cleared
2019-07-28 22:39:14,067 INFO capacity.LeafQueue (LeafQueue.java:removeApplicationAttempt(1003)) - Application removed - appId: application_1564332457320_0020 user: santosh queue: santosh #user-pending-applications: 0 #user-active-applications: 0 #queue-pending-applications: 0 #queue-active-applications: 0
2019-07-28 22:39:14,068 INFO capacity.LeafQueue (LeafQueue.java:activateApplications(911)) - Application application_1564332457320_0020 from user: santosh activated in queue: santosh
2019-07-28 22:39:14,068 INFO capacity.LeafQueue (LeafQueue.java:addApplicationAttempt(941)) - Application added - appId: application_1564332457320_0020 user: santosh, leaf-queue: santosh #user-pending-applications: 0 #user-active-applications: 1 #queue-pending-applications: 0 #queue-active-applications: 1
2019-07-28 22:39:14,068 INFO capacity.CapacityScheduler (CapacityScheduler.java:addApplicationAttempt(1036)) - Added Application Attempt appattempt_1564332457320_0020_000002 to scheduler from user santosh in queue santosh
2019-07-28 22:39:14,068 INFO attempt.RMAppAttemptImpl (RMAppAttemptImpl.java:handle(925)) - appattempt_1564332457320_0020_000002 State change from SUBMITTED to SCHEDULED on event = ATTEMPT_ADDED
2019-07-28 22:39:14,074 INFO allocator.AbstractContainerAllocator (AbstractContainerAllocator.java:getCSAssignmentFromAllocateResult(129)) - assignedContainer application attempt=appattempt_1564332457320_0020_000002 container=null queue=santosh clusterResource= type=OFF_SWITCH requestedPartition=
2019-07-28 22:39:14,074 INFO rmcontainer.RMContainerImpl (RMContainerImpl.java:handle(490)) - container_e20_1564332457320_0020_02_000001 Container Transitioned from NEW to ALLOCATED
2019-07-28 22:39:14,075 INFO fica.FiCaSchedulerNode (FiCaSchedulerNode.java:allocateContainer(169)) - Assigned container container_e20_1564332457320_0020_02_000001 of capacity on host sandbox-hdp.hortonworks.com:45454, which has 1 containers, used and available after allocation
2019-07-28 22:39:14,075 INFO security.NMTokenSecretManagerInRM (NMTokenSecretManagerInRM.java:createAndGetNMToken(200)) - Sending NMToken for nodeId : sandbox-hdp.hortonworks.com:45454 for container : container_e20_1564332457320_0020_02_000001
2019-07-28 22:39:14,076 INFO rmcontainer.RMContainerImpl (RMContainerImpl.java:handle(490)) - container_e20_1564332457320_0020_02_000001 Container Transitioned from ALLOCATED to ACQUIRED
2019-07-28 22:39:14,076 INFO security.NMTokenSecretManagerInRM (NMTokenSecretManagerInRM.java:clearNodeSetForAttempt(146)) - Clear node set for appattempt_1564332457320_0020_000002
2019-07-28 22:39:14,076 INFO capacity.ParentQueue (ParentQueue.java:apply(1332)) - assignedContainer queue=root usedCapacity=0.25 absoluteUsedCapacity=0.25 used= cluster=
2019-07-28 22:39:14,076 INFO capacity.CapacityScheduler (CapacityScheduler.java:tryCommit(2890)) - Allocation proposal accepted
2019-07-28 22:39:14,076 INFO attempt.RMAppAttemptImpl (RMAppAttemptImpl.java:storeAttempt(2213)) - Storing attempt: AppId: application_1564332457320_0020 AttemptId: appattempt_1564332457320_0020_000002 MasterContainer: Container: [ContainerId: container_e20_1564332457320_0020_02_000001, AllocationRequestId: -1, Version: 0, NodeId: sandbox-hdp.hortonworks.com:45454, NodeHttpAddress: sandbox-hdp.hortonworks.com:8042, Resource: , Priority: 0, Token: Token { kind: ContainerToken, service: 172.18.0.3:45454 }, ExecutionType: GUARANTEED, ]
2019-07-28 22:39:14,077 INFO attempt.RMAppAttemptImpl (RMAppAttemptImpl.java:handle(925)) - appattempt_1564332457320_0020_000002 State change from SCHEDULED to ALLOCATED_SAVING on event = CONTAINER_ALLOCATED
2019-07-28 22:39:14,088 INFO attempt.RMAppAttemptImpl (RMAppAttemptImpl.java:handle(925)) - appattempt_1564332457320_0020_000002 State change from ALLOCATED_SAVING to ALLOCATED on event = ATTEMPT_NEW_SAVED
2019-07-28 22:39:14,089 INFO amlauncher.AMLauncher (AMLauncher.java:run(307)) - Launching masterappattempt_1564332457320_0020_000002
2019-07-28 22:39:14,091 INFO amlauncher.AMLauncher (AMLauncher.java:launch(109)) - Setting up container Container: [ContainerId: container_e20_1564332457320_0020_02_000001, AllocationRequestId: -1, Version: 0, NodeId: sandbox-hdp.hortonworks.com:45454, NodeHttpAddress: sandbox-hdp.hortonworks.com:8042, Resource: , Priority: 0, Token: Token { kind: ContainerToken, service: 172.18.0.3:45454 }, ExecutionType: GUARANTEED, ] for AM appattempt_1564332457320_0020_000002
2019-07-28 22:39:14,092 INFO security.AMRMTokenSecretManager (AMRMTokenSecretManager.java:createAndGetAMRMToken(195)) - Create AMRMToken for ApplicationAttempt: appattempt_1564332457320_0020_000002
2019-07-28 22:39:14,092 INFO security.AMRMTokenSecretManager (AMRMTokenSecretManager.java:createPassword(307)) - Creating password for appattempt_1564332457320_0020_000002
2019-07-28 22:39:14,110 INFO amlauncher.AMLauncher (AMLauncher.java:launch(130)) - Done launching container Container: [ContainerId: container_e20_1564332457320_0020_02_000001, AllocationRequestId: -1, Version: 0, NodeId: sandbox-hdp.hortonworks.com:45454, NodeHttpAddress: sandbox-hdp.hortonworks.com:8042, Resource: , Priority: 0, Token: Token { kind: ContainerToken, service: 172.18.0.3:45454 }, ExecutionType: GUARANTEED, ] for AM appattempt_1564332457320_0020_000002
2019-07-28 22:39:14,110 INFO attempt.RMAppAttemptImpl (RMAppAttemptImpl.java:handle(925)) - appattempt_1564332457320_0020_000002 State change from ALLOCATED to LAUNCHED on event = LAUNCHED
2019-07-28 22:39:15,056 INFO rmcontainer.RMContainerImpl (RMContainerImpl.java:handle(490)) - container_e20_1564332457320_0020_02_000001 Container Transitioned from ACQUIRED to RUNNING
2019-07-28 22:39:16,752 INFO rmcontainer.RMContainerImpl (RMContainerImpl.java:handle(490)) - container_e20_1564332457320_0020_02_000001 Container Transitioned from RUNNING to COMPLETED
2019-07-28 22:39:16,755 INFO attempt.RMAppAttemptImpl (RMAppAttemptImpl.java:rememberTargetTransitionsAndStoreState(1412)) - Updating application attempt appattempt_1564332457320_0020_000002 with final state: FAILED, and exit status: -1000
2019-07-28 22:39:16,755 INFO attempt.RMAppAttemptImpl (RMAppAttemptImpl.java:handle(925)) - appattempt_1564332457320_0020_000002 State change from LAUNCHED to FINAL_SAVING on event = CONTAINER_FINISHED
2019-07-28 22:39:16,899 INFO resourcemanager.ApplicationMasterService (ApplicationMasterService.java:unregisterAttempt(496)) - Unregistering app attempt : appattempt_1564332457320_0020_000002
2019-07-28 22:39:16,900 INFO security.AMRMTokenSecretManager (AMRMTokenSecretManager.java:applicationMasterFinished(124)) - Application finished, removing password for appattempt_1564332457320_0020_000002
2019-07-28 22:39:16,900 INFO attempt.RMAppAttemptImpl (RMAppAttemptImpl.java:handle(925)) - appattempt_1564332457320_0020_000002 State change from FINAL_SAVING to FAILED on event = ATTEMPT_UPDATE_SAVED
2019-07-28 22:39:16,900 INFO rmapp.RMAppImpl (RMAppImpl.java:transition(1538)) - The number of failed attempts is 2. The max attempts is 2
2019-07-28 22:39:16,900 INFO rmapp.RMAppImpl (RMAppImpl.java:rememberTargetTransitionsAndStoreState(1278)) - Updating application application_1564332457320_0020 with final state: FAILED
2019-07-28 22:39:16,900 INFO rmapp.RMAppImpl (RMAppImpl.java:handle(912)) - application_1564332457320_0020 State change from ACCEPTED to FINAL_SAVING on event = ATTEMPT_FAILED
2019-07-28 22:39:16,900 INFO recovery.RMStateStore (RMStateStore.java:transition(260)) - Updating info for app: application_1564332457320_0020
2019-07-28 22:39:16,900 INFO capacity.CapacityScheduler (CapacityScheduler.java:doneApplicationAttempt(1085)) - Application Attempt appattempt_1564332457320_0020_000002 is done. finalState=FAILED
2019-07-28 22:39:16,901 INFO scheduler.AppSchedulingInfo (AppSchedulingInfo.java:clearRequests(159)) - Application application_1564332457320_0020 requests cleared
2019-07-28 22:39:16,901 INFO capacity.LeafQueue (LeafQueue.java:removeApplicationAttempt(1003)) - Application removed - appId: application_1564332457320_0020 user: santosh queue: santosh #user-pending-applications: 0 #user-active-applications: 0 #queue-pending-applications: 0 #queue-active-applications: 0
2019-07-28 22:39:16,916 INFO rmapp.RMAppImpl (RMAppImpl.java:transition(1197)) - Application application_1564332457320_0020 failed 2 times due to AM Container for appattempt_1564332457320_0020_000002 exited with exitCode: -1000
Failing this attempt.Diagnostics: (Client.java:1558)
at org.apache.hadoop.ipc.Client.call(Client.java:1389)
... 37 more
Caused by: org.apache.hadoop.security.AccessControlException: Client cannot authenticate via:[TOKEN, KERBEROS]
at org.apache.hadoop.security.SaslRpcClient.selectSaslClient(SaslRpcClient.java:173)
at org.apache.hadoop.security.SaslRpcClient.saslConnect(SaslRpcClient.java:390)
at org.apache.hadoop.ipc.Client$Connection.setupSaslConnection(Client.java:614)
at org.apache.hadoop.ipc.Client$Connection.access$2300(Client.java:410)
at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:800)
at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:796)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:796)
... 40 more
Caused by: org.apache.hadoop.security.AccessControlException: Client cannot authenticate via:[TOKEN, KERBEROS]
I have a docker swarm running a number of services. I'm using the elastic stack (kibana, elastic, filebeat, etc) for monitoring.
For the business logic I'm writing logs and using filebeat to move them to logstash and analyze the data in kibana.
But I'm having trouble in monitoring the liveness of my docker services. some of them are deployed globally (like filebeat) and some of them have a number of replicas. I wan't to be able to see in kibana that the number of running containers is equal to the number that the service should have. I'm trying to use metricbeat with docker module, the most useful metricset I've found is container, but it doesn't seem to contain enough information for me to display or analyze the number of instances of a service.
I'd appreciate any advice how to achieve this.
The metricbeat config
metricbeat.autodiscover:
providers:
- type: docker
hits.enabled: true
metricbeat.modules:
- module: docker
enabled: true
metricsets:
- container
- healthcheck
- info
period: 10s
hosts: [ "unix:///var/run/docker.sock" ]
processors:
- add_docker_metadata: ~
- add_locale:
format: offset
output.logstash:
hosts: [ "mylogstash.com" ]
The metricset container log data (the relevant docker part)
...
"docker" : {
"container": {
"id": "60983ad304e13cb0245a589ce843100da82c5fv9e093aad68abb439cdc2f3044"
"status": "Up 3 weeks",
"command": "./entrypoint.sh",
"image": "registry.com/myimage",
"created": "2019-04-08T11:38:10.000Z",
"name": "mystack_myservice.wuiqep73p99hcbto2kgv6vhr2.mufs70y24k5388jxv782in18f",
"ip_addresses": [ "10.0.0.148" ]
"labels" : {
"com_dokcer_swarm_node_id": "wuiqep73p99hcbto2kgv6vhr2",
"com_docker_swarm_task_name": "stack_service.wuiqep73p99hcbto2kgv6vhr2.mufs70y24k5388jxv782in18f",
"com_docker_swarm_service_id": "kxm5dk43yzyzpemcbz23s21xo",
"com_docker_swarn_task_id": "mufs70y24k5388jxv782in18f",
"com_docker_swarm_task" : "",
"com_docker_stack_namespace": "mystack",
"com_docker_swarm_service_name": "mystack_myservice"
},
"size": {
"rw": 0,
"root_fs": 0
}
}
}
...
For future reference:
I wrote a bash script which runs by interval and write a json log for each of the swarm services. the script is wrapped in image docker service logger
I am trying to set up a Jelastic clustered database as described in Setting Up Auto-Clusterization with Cloud Scripting but I don't see documentation there that describes how to either set or retrieve the cluster username and password.
I did try passing db_user and db_pass to the cluster, names I found in some of the sample JPS files, as well as having those as settings but the credentials were still just the Jelastic generated ones.
Here is the JPS I am trying to use; it includes a simple Debian container that requires the database credentials as environment variables. In this case the Docker container includes just the MariaDB client for testing purpose, the real environment is bit more complex than that, running scripts in the startup that need the database connection.
{
"version": "1.5",
"type": "install",
"name": "Database test",
"skipNodeEmails": true,
"globals":
{
"MYSQL_ROOT_USERNAME": "root",
"MYSQL_ROOT_PASSWORD": "${fn.password(20)}",
"MYSQL_USERNAME": "username",
"MYSQL_PASSWORD": "${fn.password(20)}",
"MYSQL_DATABASE": "database",
"MYSQL_HOSTNAME": "ProxySQL"
},
"nodes":
[
{
"image": "mireiawen/debian-sql",
"count": 1,
"cloudlets": 8,
"nodeGroup": "vds",
"displayName": "SQL worker",
"env":
{
"MYSQL_ROOT_USERNAME": "${globals.MYSQL_ROOT_USERNAME}",
"MYSQL_ROOT_PASSWORD": "${globals.MYSQL_ROOT_PASSWORD}",
"MYSQL_USERNAME": "${globals.MYSQL_USERNAME}",
"MYSQL_PASSWORD": "${globals.MYSQL_PASSWORD}",
"MYSQL_DATABASE": "${globals.MYSQL_DATABASE}",
"MYSQL_HOSTNAME": "${globals.MYSQL_HOSTNAME}"
}
},
{
"nodeType": "mariadb-dockerized",
"nodeGroup": "sqldb",
"count": "2",
"cloudlets": 16,
"cluster":
{
"scheme": "master"
}
}
]
}
This JPS seems to launch the MariaDB master-master cluster correctly with the ProxySQL included, I am just lacking on the documentation about how to either provide the database credentials to the database cluster, or a way to retrieve the generated ones to be used as variables in the JPS to send those to the containers.
The mechanism has been improved so now you can pass custom credentials to cluster using either environment variables or cluster settings:
type: install
name: env. variables
nodes:
nodeType: mariadb-dockerized
nodeGroup: sqldb
count: 2
cloudlets: 8
env:
DB_USER: customuser
DB_PASS: custompass
cluster:
scheme: master
or
type: install
name: cluster settings
nodes:
nodeType: mariadb-dockerized
nodeGroup: sqldb
count: 2
cloudlets: 8
cluster:
scheme: master
db_user: customuser
db_pass: custompass
Thank you for the good question. The mechanism of passing custom credentials should be and will be improved soon. At the moment you can use the example below. In short, we disable automated clustering and enable it again with custom username and password.
---
version: 1.5
type: install
name: Database test
skipNodeEmails: true
baseUrl: https://raw.githubusercontent.com/jelastic-jps/mysql-cluster/master
globals:
logic_jps: ${baseUrl}/addons/auto-clustering/scripts/auto-cluster-logic.jps
MYSQL_USERNAME: username
MYSQL_PASSWORD: ${fn.password(20)}
nodes:
- image: mireiawen/debian-sql
count: 1
cloudlets: 8
nodeGroup: extra
displayName: SQL worker
env:
MYSQL_USERNAME: ${globals.MYSQL_USERNAME}
MYSQL_PASSWORD: ${globals.MYSQL_PASSWORD}
- nodeType: mariadb-dockerized
nodeGroup: sqldb
count: 2
cloudlets: 16
cluster: false
onInstall:
install:
jps: ${globals.logic_jps}
envName: ${env.envName}
nodeGroup: sqldb
settings:
path: ${baseUrl}
scheme: master
logic_jps: ${globals.logic_jps}
db_user: ${globals.MYSQL_USERNAME}
db_pass: ${globals.MYSQL_PASSWORD}
repl_user: repl-${fn.random}
repl_pass: ${fn.password(20)}
After environment is ready, you can test the connection by executing the following command in your docker image:
mysql -h proxy -u $MYSQL_USERNAME -p$MYSQL_PASSWORD
My kops configuration has the feature gates and policies as follows (notice that I added a test policy of EC2:* just to make sure the masters and nodes had all the permissions):
spec:
additionalPolicies:
master: |
[
{
"Effect": "Allow",
"Action": ["ec2:*"
],
"Resource": ["*"]
}
]
node: |
[
{
"Effect": "Allow",
"Action": ["ec2:*"
],
"Resource": ["*"]
}
]
and
kubeAPIServer:
featureGates:
ExpandPersistentVolumes: "true"
kubeControllerManager:
featureGates:
ExpandPersistentVolumes: "true"
kubelet:
featureGates:
ExpandPersistentVolumes: "true"
I created a PVC as follows:
---
kind: PersistentVolumeClaim
apiVersion: v1
metadata:
name: kek-pvc-2
namespace: kekdev
labels:
app: kek
storage: persistent
annotations:
volume.beta.kubernetes.io/storage-provisioner: kubernetes.io/aws-ebs
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 4Gi
storageClassName: kek-storage-class
StorageClass looks like this:
kind: StorageClass
apiVersion: storage.k8s.io/v1
metadata:
name: kek-storage-class
provisioner: kubernetes.io/aws-ebs
reclaimPolicy: Retain
parameters:
type: gp2
fsType: ext4
encrypted: "true"
allowVolumeExpansion: true
k8s properly created the volume in AWS. Then I proceeded to edit the PVC storage request and increment it to, say, 7Gi. I could see in the AWS console how it got resized, optimized and available again. However, in k8s, it doesn't seem to update or do anything. It still shows as bound and with the initial 4Gi size. It has no pods attached whatsoever. If I look at the resource template in the k8s dashboard, I see this:
"conditions": [
{
"type": "Resizing",
"status": "True",
"lastProbeTime": null,
"lastTransitionTime": "2018-07-27T10:02:33Z"
}
]
It looks like it's stuck in resizing state, but it's already resized.
kube logs say:
I0727 09:54:03.546372 1 operation_generator.go:1195] ExpandVolume
succeeded for volume kekdev/kek-pvc-2
I0727 09:54:03.558268 1 operation_generator.go:1207]
ExpandVolume.UpdatePV succeeded for volume kekdev/kek-pvc-2
I0727 09:54:32.909753 1 operation_generator.go:1195] ExpandVolume
succeeded for volume kekdev/kek-pvc-2
I0727 09:54:32.912113 1 operation_generator.go:1207]
ExpandVolume.UpdatePV succeeded for volume kekdev/kek-pvc-2
Any ideas on what could be going on?
kubectl version output:
Client Version: version.Info{Major:"1", Minor:"10", GitVersion:"v1.10.4", GitCommit:"5ca598b4ba5abb89bb773071ce452e33fb66339d", GitTreeState:"clean", BuildDate:"2018-06-06T08:13:03Z", GoVersion:"go1.9.3", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"9", GitVersion:"v1.9.9", GitCommit:"57729ea3d9a1b75f3fc7bbbadc597ba707d47c8a", GitTreeState:"clean", BuildDate:"2018-06-29T01:07:01Z", GoVersion:"go1.9.3", Compiler:"gc", Platform:"linux/amd64"}
Many thanks in advance for any inputs on this.
i have crated the docker stack file as below, it created the 3 services, as excepted but i am unable to access out side of the host. and its not creating any port also. i have created a overlay network called test01. When i create a this manually via command line it works perfectly.
version: '3.0'
networks:
default:
external:
name: test01
services:
mssql:
image: microsoft/mssql-server-windows-developer
environment:
- SA_PASSWORD=Password1
- ACCEPT_EULA=Y
ports:
- 1433:1433
volumes:
- c:\Databases:c:\Databases
deploy:
placement:
constraints: [node.labels.os==Windows]
web:
image: iiswithdb:latest
ports:
- 8080:8080
deploy:
replicas: 3
lbs:
image: nginx:latest
ports:
- 80: 80
deploy:
placement:
constraints: [node.labels.os==Windows]
Your services need to explicitly join the network you are defining. You can do this in the compose file. Otherwise they will use the default network created by the stack/compose. https://docs.docker.com/compose/compose-file/#networks
c:\Program Files\docker>docker network inspect test01
[
{
"Name": "test01",
"Id": "8ffz8xihux13gx1uuhalub5by",
"Created": "2017-09-11T12:30:35.7747711+05:30",
"Scope": "swarm",
"Driver": "overlay",
"EnableIPv6": false,
"IPAM": {
"Driver": "default",
"Options": null,
"Config": [
{
"Subnet": "10.0.0.0/24",
"Gateway": "10.0.0.1"
}
]
},
"Internal": false,
"Attachable": false,
"Ingress": false,
"ConfigFrom": {
"Network": ""
},
"ConfigOnly": false,
"Containers": {
"2f283e7c21608d09a57a7cdef25a836d77c0ceb8030ae15796ff692e43b0eb73": {
"Name": "test_web.1.jti1pyrgxv3v4yet9m9cpk0i4",
"EndpointID": "bed2a5e0d077fcf48ab2d6fe419a8a69a45c3033e1a8602cf6395f93bec405b8",
"MacAddress": "00:15:5d:f3:aa:1a",
"IPv4Address": "10.0.0.5/24",
"IPv6Address": ""
},
"8c55fad8ad54e5286bb7fc54da52ad1958854bceacbf0260400e7dc3c00c1c45": {
"Name": "test_mssql.1.mn31bwoh8iwg5sge5rllh7gc9",
"EndpointID": "00c6e68d6a22ee0dc5ad90cda7ab958323a0b07206ce4583f11baa8b3476de8f",
"MacAddress": "00:15:5d:f3:aa:23",
"IPv4Address": "10.0.0.3/24",
"IPv6Address": ""
}
},
"Options": {
"com.docker.network.driver.overlay.vxlanid_list": "4097",
"com.docker.network.windowsshim.hnsid": "b76fa7e3-530d-4133-b72a-1d1818cd3c16"
},
"Labels": {},
"Peers": [
{
"Name": "node2-f3dedf0e26d9",
"IP": "10.30.50.10"
},
{
"Name": "node3-2e1ad7fb91be",
"IP": "10.30.50.13"
}
]
}
]
Below is the output
c:\Program Files\docker>docker service ls
ID NAME MODE REPLICAS IMAGE PORTS
bo9uovidd4z3 test_web replicated 3/3 iiswithdb:latest *:8080->8080/tcp
sujwg53gjnp3 test_lbs replicated 0/1 nginx:latest *:80->80/tcp
vyxyoaji8jkd test_mssql replicated 1/1 microsoft/mssql-server-windows-developer:latest *:1433->1433/tcp
c:\Program Files\docker>docker service ps test_mssql
ID NAME IMAGE NODE DESIRED STATE CURRENT STATE ERROR PORTS
mn31bwoh8iwg test_mssql.1 microsoft/mssql-server-windows-developer:latest node2 Running Running 6 minutes ago
When i inspect SQL server container i can't find any port taged
c:\Program Files\docker>docker service ps test_lbs
ID NAME IMAGE NODE DESIRED STATE CURRENT STATE ERROR PORTS
j4x806u1ucdr test_lbs.1 nginx:latest Running Pending 32 minutes ago
c:\Program Files\docker>docker service ps test_web
ID NAME IMAGE NODE DESIRED STATE CURRENT STATE ERROR PORTS
jti1pyrgxv3v test_web.1 iiswithdb:latest node2 Running Running 22 minutes ago
1gudznmi9ufz \_ test_web.1 iiswithdb:latest node2 Shutdown Failed 27 minutes ago "task: non-zero exit (21479434…"
xxkr98na4qsy test_web.2 iiswithdb:latest node3 Running Running 29 minutes ago
7j1y6vc90qvf test_web.3 iiswithdb:latest node3 Running Running 29 minutes ago
C:\Users\Administrator>docker service ls
ID NAME MODE REPLICAS IMAGE PORTS
19qeljqt3wuf test_mssql replicated 1/1 microsoft/mssql-server-windows-developer:latest *:1433->1433/tcp
48gamfl4j4rl test_web replicated 3/3 iiswithdb:latest *:8080->8080/tcp
nxycxrigmz4u test_lbs replicated 1/1 nginx:latest *:80->80/tcp
C:\Users\Administrator>docker service ps test_lbs
ID NAME IMAGE NODE DESIRED STATE CURRENT STATE ERROR PORTS
81fm4xplekig test_lbs.1 nginx:latest node2 Running Running 25 minutes ago
C:\Users\Administrator>docker service ps test_web
ID NAME IMAGE NODE DESIRED STATE CURRENT STATE ERROR PORTS
aivzt7eagf4f test_web.1 iiswithdb:latest node1 Running Running about an hour ago
sny1zf7osibq test_web.2 iiswithdb:latest node2 Running Running about an hour ago
lwzlpaks1b4t \_ test_web.2 iiswithdb:latest node2 Shutdown Failed about an hour ago "task: non-zero exit (21479434…"
iav5mxqdbzoy test_web.3 iiswithdb:latest node3 Running Running about an hour ago
C:\Users\Administrator>docker service ps test_mssql
ID NAME IMAGE NODE DESIRED STATE CURRENT STATE ERROR PORTS
pfu8qyw7vqxp test_mssql.1 microsoft/mssql-server-windows-developer:latest node2 Running Running 26 minutes ago