I am running spark-submit in yarn client mode. Yarn has been setup with HDP sandbox with kerberos enabled. HDP Sandbox is running on docker container on Mac host.
When spark submit is run from within the docker container of the sandbox, it’s runs successfully but when spark submit is run from the host machine it fails immediately after ACCEPTED state with error:
19/07/28 00:41:21 INFO yarn.Client: Application report for application_1564298049378_0008 (state: ACCEPTED)
19/07/28 00:41:22 INFO yarn.Client: Application report for application_1564298049378_0008 (state: ACCEPTED)
19/07/28 00:41:23 INFO yarn.Client: Application report for application_1564298049378_0008 (state: FAILED)
19/07/28 00:41:23 INFO yarn.Client:
client token: N/A
diagnostics: Application application_1564298049378_0008 failed 2 times due to AM Container for appattempt_1564298049378_0008_000002 exited with exitCode: -1000
Failing this attempt.Diagnostics: (Client.java:1558)
... 37 more
Caused by: org.apache.hadoop.security.AccessControlException: Client cannot authenticate via:[TOKEN, KERBEROS]
I could not find any more information about the failure. Any help will be greatly appreciated.
Here is the resourcemanager log:
2019-07-28 22:39:04,654 INFO resourcemanager.ClientRMService (ClientRMService.java:getNewApplicationId(341)) - Allocated new applicationId: 20
2019-07-28 22:39:10,982 INFO capacity.CapacityScheduler (CapacityScheduler.java:checkAndGetApplicationPriority(2526)) - Application 'application_1564332457320_0020' is submitted without priority hence considering default queue/cluster priority: 0
2019-07-28 22:39:10,982 INFO capacity.CapacityScheduler (CapacityScheduler.java:checkAndGetApplicationPriority(2547)) - Priority '0' is acceptable in queue : santosh for application: application_1564332457320_0020
2019-07-28 22:39:10,983 WARN rmapp.RMAppImpl (RMAppImpl.java:(473)) - The specific max attempts: 0 for application: 20 is invalid, because it is out of the range [1, 2]. Use the global max attempts instead.
2019-07-28 22:39:10,983 INFO collector.TimelineCollectorManager (TimelineCollectorManager.java:putIfAbsent(142)) - the collector for application_1564332457320_0020 was added
2019-07-28 22:39:10,984 INFO resourcemanager.ClientRMService (ClientRMService.java:submitApplication(648)) - Application with id 20 submitted by user santosh
2019-07-28 22:39:10,984 INFO security.DelegationTokenRenewer (DelegationTokenRenewer.java:handleAppSubmitEvent(458)) - application_1564332457320_0020 found existing hdfs token Kind: HDFS_DELEGATION_TOKEN, Service: 192.168.50.1:8020, Ident: (token for santosh: HDFS_DELEGATION_TOKEN owner=santosh#XXX.XX, renewer=yarn, realUser=, issueDate=1564353550169, maxDate=1564958350169, sequenceNumber=125, masterKeyId=20)
2019-07-28 22:39:11,011 INFO security.DelegationTokenRenewer (DelegationTokenRenewer.java:renewToken(635)) - Renewed delegation-token= [Kind: HDFS_DELEGATION_TOKEN, Service: 192.168.50.1:8020, Ident: (token for santosh: HDFS_DELEGATION_TOKEN owner=santosh#XXX.XX, renewer=yarn, realUser=, issueDate=1564353550169, maxDate=1564958350169, sequenceNumber=125, masterKeyId=20);exp=1564439951007; apps=[application_1564332457320_0020]]
2019-07-28 22:39:11,011 INFO security.DelegationTokenRenewer (DelegationTokenRenewer.java:setTimerForTokenRenewal(613)) - Renew Kind: HDFS_DELEGATION_TOKEN, Service: 192.168.50.1:8020, Ident: (token for santosh: HDFS_DELEGATION_TOKEN owner=santosh#XXX.XX, renewer=yarn, realUser=, issueDate=1564353550169, maxDate=1564958350169, sequenceNumber=125, masterKeyId=20);exp=1564439951007; apps=[application_1564332457320_0020] in 86399996 ms, appId = [application_1564332457320_0020]
2019-07-28 22:39:11,011 INFO rmapp.RMAppImpl (RMAppImpl.java:transition(1259)) - Storing application with id application_1564332457320_0020
2019-07-28 22:39:11,012 INFO rmapp.RMAppImpl (RMAppImpl.java:handle(912)) - application_1564332457320_0020 State change from NEW to NEW_SAVING on event = START
2019-07-28 22:39:11,012 INFO recovery.RMStateStore (RMStateStore.java:transition(222)) - Storing info for app: application_1564332457320_0020
2019-07-28 22:39:11,022 INFO rmapp.RMAppImpl (RMAppImpl.java:handle(912)) - application_1564332457320_0020 State change from NEW_SAVING to SUBMITTED on event = APP_NEW_SAVED
2019-07-28 22:39:11,022 INFO capacity.ParentQueue (ParentQueue.java:addApplication(494)) - Application added - appId: application_1564332457320_0020 user: santosh leaf-queue of parent: root #applications: 1
2019-07-28 22:39:11,023 INFO capacity.CapacityScheduler (CapacityScheduler.java:addApplication(990)) - Accepted application application_1564332457320_0020 from user: santosh, in queue: santosh
2019-07-28 22:39:11,023 INFO rmapp.RMAppImpl (RMAppImpl.java:handle(912)) - application_1564332457320_0020 State change from SUBMITTED to ACCEPTED on event = APP_ACCEPTED
2019-07-28 22:39:11,023 INFO resourcemanager.ApplicationMasterService (ApplicationMasterService.java:registerAppAttempt(479)) - Registering app attempt : appattempt_1564332457320_0020_000001
2019-07-28 22:39:11,024 INFO attempt.RMAppAttemptImpl (RMAppAttemptImpl.java:handle(925)) - appattempt_1564332457320_0020_000001 State change from NEW to SUBMITTED on event = START
2019-07-28 22:39:11,024 INFO capacity.LeafQueue (LeafQueue.java:activateApplications(911)) - Application application_1564332457320_0020 from user: santosh activated in queue: santosh
2019-07-28 22:39:11,025 INFO capacity.LeafQueue (LeafQueue.java:addApplicationAttempt(941)) - Application added - appId: application_1564332457320_0020 user: santosh, leaf-queue: santosh #user-pending-applications: 0 #user-active-applications: 1 #queue-pending-applications: 0 #queue-active-applications: 1
2019-07-28 22:39:11,025 INFO capacity.CapacityScheduler (CapacityScheduler.java:addApplicationAttempt(1036)) - Added Application Attempt appattempt_1564332457320_0020_000001 to scheduler from user santosh in queue santosh
2019-07-28 22:39:11,028 INFO attempt.RMAppAttemptImpl (RMAppAttemptImpl.java:handle(925)) - appattempt_1564332457320_0020_000001 State change from SUBMITTED to SCHEDULED on event = ATTEMPT_ADDED
2019-07-28 22:39:11,033 INFO allocator.AbstractContainerAllocator (AbstractContainerAllocator.java:getCSAssignmentFromAllocateResult(129)) - assignedContainer application attempt=appattempt_1564332457320_0020_000001 container=null queue=santosh clusterResource= type=OFF_SWITCH requestedPartition=
2019-07-28 22:39:11,034 INFO rmcontainer.RMContainerImpl (RMContainerImpl.java:handle(490)) - container_e20_1564332457320_0020_01_000001 Container Transitioned from NEW to ALLOCATED
2019-07-28 22:39:11,035 INFO fica.FiCaSchedulerNode (FiCaSchedulerNode.java:allocateContainer(169)) - Assigned container container_e20_1564332457320_0020_01_000001 of capacity on host sandbox-hdp.hortonworks.com:45454, which has 1 containers, used and available after allocation
2019-07-28 22:39:11,038 INFO security.NMTokenSecretManagerInRM (NMTokenSecretManagerInRM.java:createAndGetNMToken(200)) - Sending NMToken for nodeId : sandbox-hdp.hortonworks.com:45454 for container : container_e20_1564332457320_0020_01_000001
2019-07-28 22:39:11,043 INFO rmcontainer.RMContainerImpl (RMContainerImpl.java:handle(490)) - container_e20_1564332457320_0020_01_000001 Container Transitioned from ALLOCATED to ACQUIRED
2019-07-28 22:39:11,043 INFO security.NMTokenSecretManagerInRM (NMTokenSecretManagerInRM.java:clearNodeSetForAttempt(146)) - Clear node set for appattempt_1564332457320_0020_000001
2019-07-28 22:39:11,044 INFO capacity.ParentQueue (ParentQueue.java:apply(1332)) - assignedContainer queue=root usedCapacity=0.25 absoluteUsedCapacity=0.25 used= cluster=
2019-07-28 22:39:11,044 INFO capacity.CapacityScheduler (CapacityScheduler.java:tryCommit(2890)) - Allocation proposal accepted
2019-07-28 22:39:11,044 INFO attempt.RMAppAttemptImpl (RMAppAttemptImpl.java:storeAttempt(2213)) - Storing attempt: AppId: application_1564332457320_0020 AttemptId: appattempt_1564332457320_0020_000001 MasterContainer: Container: [ContainerId: container_e20_1564332457320_0020_01_000001, AllocationRequestId: -1, Version: 0, NodeId: sandbox-hdp.hortonworks.com:45454, NodeHttpAddress: sandbox-hdp.hortonworks.com:8042, Resource: , Priority: 0, Token: Token { kind: ContainerToken, service: 172.18.0.3:45454 }, ExecutionType: GUARANTEED, ]
2019-07-28 22:39:11,051 INFO attempt.RMAppAttemptImpl (RMAppAttemptImpl.java:handle(925)) - appattempt_1564332457320_0020_000001 State change from SCHEDULED to ALLOCATED_SAVING on event = CONTAINER_ALLOCATED
2019-07-28 22:39:11,057 INFO attempt.RMAppAttemptImpl (RMAppAttemptImpl.java:handle(925)) - appattempt_1564332457320_0020_000001 State change from ALLOCATED_SAVING to ALLOCATED on event = ATTEMPT_NEW_SAVED
2019-07-28 22:39:11,060 INFO amlauncher.AMLauncher (AMLauncher.java:run(307)) - Launching masterappattempt_1564332457320_0020_000001
2019-07-28 22:39:11,068 INFO amlauncher.AMLauncher (AMLauncher.java:launch(109)) - Setting up container Container: [ContainerId: container_e20_1564332457320_0020_01_000001, AllocationRequestId: -1, Version: 0, NodeId: sandbox-hdp.hortonworks.com:45454, NodeHttpAddress: sandbox-hdp.hortonworks.com:8042, Resource: , Priority: 0, Token: Token { kind: ContainerToken, service: 172.18.0.3:45454 }, ExecutionType: GUARANTEED, ] for AM appattempt_1564332457320_0020_000001
2019-07-28 22:39:11,069 INFO security.AMRMTokenSecretManager (AMRMTokenSecretManager.java:createAndGetAMRMToken(195)) - Create AMRMToken for ApplicationAttempt: appattempt_1564332457320_0020_000001
2019-07-28 22:39:11,069 INFO security.AMRMTokenSecretManager (AMRMTokenSecretManager.java:createPassword(307)) - Creating password for appattempt_1564332457320_0020_000001
2019-07-28 22:39:11,265 INFO amlauncher.AMLauncher (AMLauncher.java:launch(130)) - Done launching container Container: [ContainerId: container_e20_1564332457320_0020_01_000001, AllocationRequestId: -1, Version: 0, NodeId: sandbox-hdp.hortonworks.com:45454, NodeHttpAddress: sandbox-hdp.hortonworks.com:8042, Resource: , Priority: 0, Token: Token { kind: ContainerToken, service: 172.18.0.3:45454 }, ExecutionType: GUARANTEED, ] for AM appattempt_1564332457320_0020_000001
2019-07-28 22:39:11,265 INFO attempt.RMAppAttemptImpl (RMAppAttemptImpl.java:handle(925)) - appattempt_1564332457320_0020_000001 State change from ALLOCATED to LAUNCHED on event = LAUNCHED
2019-07-28 22:39:11,852 INFO resourcemanager.ResourceTrackerService (ResourceTrackerService.java:updateAppCollectorsMap(713)) - Update collector information for application application_1564332457320_0020 with new address: sandbox-hdp.hortonworks.com:35197 timestamp: 1564332457320, 36
2019-07-28 22:39:11,854 INFO rmcontainer.RMContainerImpl (RMContainerImpl.java:handle(490)) - container_e20_1564332457320_0020_01_000001 Container Transitioned from ACQUIRED to RUNNING
2019-07-28 22:39:12,833 INFO provider.BaseAuditHandler (BaseAuditHandler.java:logStatus(312)) - Audit Status Log: name=yarn.async.batch.hdfs, interval=01:11.979 minutes, events=162, succcessCount=162, totalEvents=17347, totalSuccessCount=17347
2019-07-28 22:39:12,834 INFO destination.HDFSAuditDestination (HDFSAuditDestination.java:logJSON(179)) - Flushing HDFS audit. Event Size:1
2019-07-28 22:39:12,857 INFO resourcemanager.ResourceTrackerService (ResourceTrackerService.java:updateAppCollectorsMap(713)) - Update collector information for application application_1564332457320_0020 with new address: sandbox-hdp.hortonworks.com:35197 timestamp: 1564332457320, 37
2019-07-28 22:39:14,054 INFO rmcontainer.RMContainerImpl (RMContainerImpl.java:handle(490)) - container_e20_1564332457320_0020_01_000001 Container Transitioned from RUNNING to COMPLETED
2019-07-28 22:39:14,055 INFO attempt.RMAppAttemptImpl (RMAppAttemptImpl.java:rememberTargetTransitionsAndStoreState(1412)) - Updating application attempt appattempt_1564332457320_0020_000001 with final state: FAILED, and exit status: -1000
2019-07-28 22:39:14,055 INFO attempt.RMAppAttemptImpl (RMAppAttemptImpl.java:handle(925)) - appattempt_1564332457320_0020_000001 State change from LAUNCHED to FINAL_SAVING on event = CONTAINER_FINISHED
2019-07-28 22:39:14,066 INFO resourcemanager.ApplicationMasterService (ApplicationMasterService.java:unregisterAttempt(496)) - Unregistering app attempt : appattempt_1564332457320_0020_000001
2019-07-28 22:39:14,066 INFO security.AMRMTokenSecretManager (AMRMTokenSecretManager.java:applicationMasterFinished(124)) - Application finished, removing password for appattempt_1564332457320_0020_000001
2019-07-28 22:39:14,066 INFO attempt.RMAppAttemptImpl (RMAppAttemptImpl.java:handle(925)) - appattempt_1564332457320_0020_000001 State change from FINAL_SAVING to FAILED on event = ATTEMPT_UPDATE_SAVED
2019-07-28 22:39:14,067 INFO rmapp.RMAppImpl (RMAppImpl.java:transition(1538)) - The number of failed attempts is 1. The max attempts is 2
2019-07-28 22:39:14,067 INFO resourcemanager.ApplicationMasterService (ApplicationMasterService.java:registerAppAttempt(479)) - Registering app attempt : appattempt_1564332457320_0020_000002
2019-07-28 22:39:14,067 INFO attempt.RMAppAttemptImpl (RMAppAttemptImpl.java:handle(925)) - appattempt_1564332457320_0020_000002 State change from NEW to SUBMITTED on event = START
2019-07-28 22:39:14,067 INFO capacity.CapacityScheduler (CapacityScheduler.java:doneApplicationAttempt(1085)) - Application Attempt appattempt_1564332457320_0020_000001 is done. finalState=FAILED
2019-07-28 22:39:14,067 INFO scheduler.AppSchedulingInfo (AppSchedulingInfo.java:clearRequests(159)) - Application application_1564332457320_0020 requests cleared
2019-07-28 22:39:14,067 INFO capacity.LeafQueue (LeafQueue.java:removeApplicationAttempt(1003)) - Application removed - appId: application_1564332457320_0020 user: santosh queue: santosh #user-pending-applications: 0 #user-active-applications: 0 #queue-pending-applications: 0 #queue-active-applications: 0
2019-07-28 22:39:14,068 INFO capacity.LeafQueue (LeafQueue.java:activateApplications(911)) - Application application_1564332457320_0020 from user: santosh activated in queue: santosh
2019-07-28 22:39:14,068 INFO capacity.LeafQueue (LeafQueue.java:addApplicationAttempt(941)) - Application added - appId: application_1564332457320_0020 user: santosh, leaf-queue: santosh #user-pending-applications: 0 #user-active-applications: 1 #queue-pending-applications: 0 #queue-active-applications: 1
2019-07-28 22:39:14,068 INFO capacity.CapacityScheduler (CapacityScheduler.java:addApplicationAttempt(1036)) - Added Application Attempt appattempt_1564332457320_0020_000002 to scheduler from user santosh in queue santosh
2019-07-28 22:39:14,068 INFO attempt.RMAppAttemptImpl (RMAppAttemptImpl.java:handle(925)) - appattempt_1564332457320_0020_000002 State change from SUBMITTED to SCHEDULED on event = ATTEMPT_ADDED
2019-07-28 22:39:14,074 INFO allocator.AbstractContainerAllocator (AbstractContainerAllocator.java:getCSAssignmentFromAllocateResult(129)) - assignedContainer application attempt=appattempt_1564332457320_0020_000002 container=null queue=santosh clusterResource= type=OFF_SWITCH requestedPartition=
2019-07-28 22:39:14,074 INFO rmcontainer.RMContainerImpl (RMContainerImpl.java:handle(490)) - container_e20_1564332457320_0020_02_000001 Container Transitioned from NEW to ALLOCATED
2019-07-28 22:39:14,075 INFO fica.FiCaSchedulerNode (FiCaSchedulerNode.java:allocateContainer(169)) - Assigned container container_e20_1564332457320_0020_02_000001 of capacity on host sandbox-hdp.hortonworks.com:45454, which has 1 containers, used and available after allocation
2019-07-28 22:39:14,075 INFO security.NMTokenSecretManagerInRM (NMTokenSecretManagerInRM.java:createAndGetNMToken(200)) - Sending NMToken for nodeId : sandbox-hdp.hortonworks.com:45454 for container : container_e20_1564332457320_0020_02_000001
2019-07-28 22:39:14,076 INFO rmcontainer.RMContainerImpl (RMContainerImpl.java:handle(490)) - container_e20_1564332457320_0020_02_000001 Container Transitioned from ALLOCATED to ACQUIRED
2019-07-28 22:39:14,076 INFO security.NMTokenSecretManagerInRM (NMTokenSecretManagerInRM.java:clearNodeSetForAttempt(146)) - Clear node set for appattempt_1564332457320_0020_000002
2019-07-28 22:39:14,076 INFO capacity.ParentQueue (ParentQueue.java:apply(1332)) - assignedContainer queue=root usedCapacity=0.25 absoluteUsedCapacity=0.25 used= cluster=
2019-07-28 22:39:14,076 INFO capacity.CapacityScheduler (CapacityScheduler.java:tryCommit(2890)) - Allocation proposal accepted
2019-07-28 22:39:14,076 INFO attempt.RMAppAttemptImpl (RMAppAttemptImpl.java:storeAttempt(2213)) - Storing attempt: AppId: application_1564332457320_0020 AttemptId: appattempt_1564332457320_0020_000002 MasterContainer: Container: [ContainerId: container_e20_1564332457320_0020_02_000001, AllocationRequestId: -1, Version: 0, NodeId: sandbox-hdp.hortonworks.com:45454, NodeHttpAddress: sandbox-hdp.hortonworks.com:8042, Resource: , Priority: 0, Token: Token { kind: ContainerToken, service: 172.18.0.3:45454 }, ExecutionType: GUARANTEED, ]
2019-07-28 22:39:14,077 INFO attempt.RMAppAttemptImpl (RMAppAttemptImpl.java:handle(925)) - appattempt_1564332457320_0020_000002 State change from SCHEDULED to ALLOCATED_SAVING on event = CONTAINER_ALLOCATED
2019-07-28 22:39:14,088 INFO attempt.RMAppAttemptImpl (RMAppAttemptImpl.java:handle(925)) - appattempt_1564332457320_0020_000002 State change from ALLOCATED_SAVING to ALLOCATED on event = ATTEMPT_NEW_SAVED
2019-07-28 22:39:14,089 INFO amlauncher.AMLauncher (AMLauncher.java:run(307)) - Launching masterappattempt_1564332457320_0020_000002
2019-07-28 22:39:14,091 INFO amlauncher.AMLauncher (AMLauncher.java:launch(109)) - Setting up container Container: [ContainerId: container_e20_1564332457320_0020_02_000001, AllocationRequestId: -1, Version: 0, NodeId: sandbox-hdp.hortonworks.com:45454, NodeHttpAddress: sandbox-hdp.hortonworks.com:8042, Resource: , Priority: 0, Token: Token { kind: ContainerToken, service: 172.18.0.3:45454 }, ExecutionType: GUARANTEED, ] for AM appattempt_1564332457320_0020_000002
2019-07-28 22:39:14,092 INFO security.AMRMTokenSecretManager (AMRMTokenSecretManager.java:createAndGetAMRMToken(195)) - Create AMRMToken for ApplicationAttempt: appattempt_1564332457320_0020_000002
2019-07-28 22:39:14,092 INFO security.AMRMTokenSecretManager (AMRMTokenSecretManager.java:createPassword(307)) - Creating password for appattempt_1564332457320_0020_000002
2019-07-28 22:39:14,110 INFO amlauncher.AMLauncher (AMLauncher.java:launch(130)) - Done launching container Container: [ContainerId: container_e20_1564332457320_0020_02_000001, AllocationRequestId: -1, Version: 0, NodeId: sandbox-hdp.hortonworks.com:45454, NodeHttpAddress: sandbox-hdp.hortonworks.com:8042, Resource: , Priority: 0, Token: Token { kind: ContainerToken, service: 172.18.0.3:45454 }, ExecutionType: GUARANTEED, ] for AM appattempt_1564332457320_0020_000002
2019-07-28 22:39:14,110 INFO attempt.RMAppAttemptImpl (RMAppAttemptImpl.java:handle(925)) - appattempt_1564332457320_0020_000002 State change from ALLOCATED to LAUNCHED on event = LAUNCHED
2019-07-28 22:39:15,056 INFO rmcontainer.RMContainerImpl (RMContainerImpl.java:handle(490)) - container_e20_1564332457320_0020_02_000001 Container Transitioned from ACQUIRED to RUNNING
2019-07-28 22:39:16,752 INFO rmcontainer.RMContainerImpl (RMContainerImpl.java:handle(490)) - container_e20_1564332457320_0020_02_000001 Container Transitioned from RUNNING to COMPLETED
2019-07-28 22:39:16,755 INFO attempt.RMAppAttemptImpl (RMAppAttemptImpl.java:rememberTargetTransitionsAndStoreState(1412)) - Updating application attempt appattempt_1564332457320_0020_000002 with final state: FAILED, and exit status: -1000
2019-07-28 22:39:16,755 INFO attempt.RMAppAttemptImpl (RMAppAttemptImpl.java:handle(925)) - appattempt_1564332457320_0020_000002 State change from LAUNCHED to FINAL_SAVING on event = CONTAINER_FINISHED
2019-07-28 22:39:16,899 INFO resourcemanager.ApplicationMasterService (ApplicationMasterService.java:unregisterAttempt(496)) - Unregistering app attempt : appattempt_1564332457320_0020_000002
2019-07-28 22:39:16,900 INFO security.AMRMTokenSecretManager (AMRMTokenSecretManager.java:applicationMasterFinished(124)) - Application finished, removing password for appattempt_1564332457320_0020_000002
2019-07-28 22:39:16,900 INFO attempt.RMAppAttemptImpl (RMAppAttemptImpl.java:handle(925)) - appattempt_1564332457320_0020_000002 State change from FINAL_SAVING to FAILED on event = ATTEMPT_UPDATE_SAVED
2019-07-28 22:39:16,900 INFO rmapp.RMAppImpl (RMAppImpl.java:transition(1538)) - The number of failed attempts is 2. The max attempts is 2
2019-07-28 22:39:16,900 INFO rmapp.RMAppImpl (RMAppImpl.java:rememberTargetTransitionsAndStoreState(1278)) - Updating application application_1564332457320_0020 with final state: FAILED
2019-07-28 22:39:16,900 INFO rmapp.RMAppImpl (RMAppImpl.java:handle(912)) - application_1564332457320_0020 State change from ACCEPTED to FINAL_SAVING on event = ATTEMPT_FAILED
2019-07-28 22:39:16,900 INFO recovery.RMStateStore (RMStateStore.java:transition(260)) - Updating info for app: application_1564332457320_0020
2019-07-28 22:39:16,900 INFO capacity.CapacityScheduler (CapacityScheduler.java:doneApplicationAttempt(1085)) - Application Attempt appattempt_1564332457320_0020_000002 is done. finalState=FAILED
2019-07-28 22:39:16,901 INFO scheduler.AppSchedulingInfo (AppSchedulingInfo.java:clearRequests(159)) - Application application_1564332457320_0020 requests cleared
2019-07-28 22:39:16,901 INFO capacity.LeafQueue (LeafQueue.java:removeApplicationAttempt(1003)) - Application removed - appId: application_1564332457320_0020 user: santosh queue: santosh #user-pending-applications: 0 #user-active-applications: 0 #queue-pending-applications: 0 #queue-active-applications: 0
2019-07-28 22:39:16,916 INFO rmapp.RMAppImpl (RMAppImpl.java:transition(1197)) - Application application_1564332457320_0020 failed 2 times due to AM Container for appattempt_1564332457320_0020_000002 exited with exitCode: -1000
Failing this attempt.Diagnostics: (Client.java:1558)
at org.apache.hadoop.ipc.Client.call(Client.java:1389)
... 37 more
Caused by: org.apache.hadoop.security.AccessControlException: Client cannot authenticate via:[TOKEN, KERBEROS]
at org.apache.hadoop.security.SaslRpcClient.selectSaslClient(SaslRpcClient.java:173)
at org.apache.hadoop.security.SaslRpcClient.saslConnect(SaslRpcClient.java:390)
at org.apache.hadoop.ipc.Client$Connection.setupSaslConnection(Client.java:614)
at org.apache.hadoop.ipc.Client$Connection.access$2300(Client.java:410)
at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:800)
at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:796)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:796)
... 40 more
Caused by: org.apache.hadoop.security.AccessControlException: Client cannot authenticate via:[TOKEN, KERBEROS]
I get following error, while running a Oozie job.
Command:
oozie job -oozie http://10.xxx.xx.xx:11000/oozie/ -log 0000017-151029172404066-oozie-oozi-W
Logs:
2015-11-24 11:50:23,469 INFO ActionStartXCommand:543 - SERVER[hostname.abc.com] USER[oozie] GROUP[-] TOKEN[] APP[sqoop-wf] JOB[0000017-151029172404066-oozie-oozi-W] ACTION[0000017-151029172404066-oozie-oozi-W#:start:] Start action [0000017-151029172404066-oozie-oozi-W#:start:] with user-retry state : userRetryCount [0], userRetryMax [0], userRetryInterval [10]
2015-11-24 11:50:23,470 INFO ActionStartXCommand:543 - SERVER[hostname.abc.com] USER[oozie] GROUP[-] TOKEN[] APP[sqoop-wf] JOB[0000017-151029172404066-oozie-oozi-W] ACTION[0000017-151029172404066-oozie-oozi-W#:start:] [***0000017-151029172404066-oozie-oozi-W#:start:***]Action status=DONE
2015-11-24 11:50:23,470 INFO ActionStartXCommand:543 - SERVER[hostname.abc.com] USER[oozie] GROUP[-] TOKEN[] APP[sqoop-wf] JOB[0000017-151029172404066-oozie-oozi-W] ACTION[0000017-151029172404066-oozie-oozi-W#:start:] [***0000017-151029172404066-oozie-oozi-W#:start:***]Action updated in DB!
2015-11-24 11:50:23,567 INFO ActionStartXCommand:543 - SERVER[hostname.abc.com] USER[oozie] GROUP[-] TOKEN[] APP[sqoop-wf] JOB[0000017-151029172404066-oozie-oozi-W] ACTION[0000017-151029172404066-oozie-oozi-W#sqoop-node] Start action [0000017-151029172404066-oozie-oozi-W#sqoop-node] with user-retry state : userRetryCount [0], userRetryMax [0], userRetryInterval [10]
2015-11-24 11:50:24,323 WARN ActionStartXCommand:546 - SERVER[hostname.abc.com] USER[oozie] GROUP[-] TOKEN[] APP[sqoop-wf] JOB[0000017-151029172404066-oozie-oozi-W] ACTION[0000017-151029172404066-oozie-oozi-W#sqoop-node] Error starting action [sqoop-node]. ErrorType [NON_TRANSIENT], ErrorCode [JA002], Message [JA002: SIMPLE authentication is not enabled. Available:[TOKEN]]
org.apache.oozie.action.ActionExecutorException: JA002: SIMPLE authentication is not enabled. Available:[TOKEN]
at org.apache.oozie.action.ActionExecutor.convertExceptionHelper(ActionExecutor.java:418)
at org.apache.oozie.action.ActionExecutor.convertException(ActionExecutor.java:392)
at org.apache.oozie.action.hadoop.JavaActionExecutor.submitLauncher(JavaActionExecutor.java:980)
at org.apache.oozie.action.hadoop.JavaActionExecutor.start(JavaActionExecutor.java:1135)
at org.apache.oozie.command.wf.ActionStartXCommand.execute(ActionStartXCommand.java:228)
at org.apache.oozie.command.wf.ActionStartXCommand.execute(ActionStartXCommand.java:63)
at org.apache.oozie.command.XCommand.call(XCommand.java:281)
at org.apache.oozie.service.CallableQueueService$CompositeCallable.call(CallableQueueService.java:323)
at org.apache.oozie.service.CallableQueueService$CompositeCallable.call(CallableQueueService.java:252)
at org.apache.oozie.service.CallableQueueService$CallableWrapper.run(CallableQueueService.java:174)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.hadoop.security.AccessControlException: SIMPLE authentication is not enabled. Available:[TOKEN]
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
at org.apache.hadoop.yarn.ipc.RPCUtil.instantiateException(RPCUtil.java:53)
at org.apache.hadoop.yarn.ipc.RPCUtil.unwrapAndThrowException(RPCUtil.java:104)
at org.apache.hadoop.yarn.api.impl.pb.client.ApplicationClientProtocolPBClientImpl.getDelegationToken(ApplicationClientProtocolPBClientImpl.java:309)
at sun.reflect.GeneratedMethodAccessor46.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
at com.sun.proxy.$Proxy45.getDelegationToken(Unknown Source)
at org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.getRMDelegationToken(YarnClientImpl.java:486)
at org.apache.hadoop.mapred.ResourceMgrDelegate.getDelegationToken(ResourceMgrDelegate.java:174)
at org.apache.hadoop.mapred.YARNRunner.getDelegationToken(YARNRunner.java:221)
at org.apache.hadoop.mapreduce.Cluster.getDelegationToken(Cluster.java:400)
at org.apache.hadoop.mapred.JobClient$16.run(JobClient.java:1240)
at org.apache.hadoop.mapred.JobClient$16.run(JobClient.java:1237)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
at org.apache.hadoop.mapred.JobClient.getDelegationToken(JobClient.java:1236)
at org.apache.oozie.service.HadoopAccessorService.createJobClient(HadoopAccessorService.java:439)
at org.apache.oozie.action.hadoop.JavaActionExecutor.createJobClient(JavaActionExecutor.java:1178)
at org.apache.oozie.action.hadoop.JavaActionExecutor.submitLauncher(JavaActionExecutor.java:927)
... 10 more
Caused by: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.AccessControlException): SIMPLE authentication is not enabled. Available:[TOKEN]
at org.apache.hadoop.ipc.Client.call(Client.java:1469)
at org.apache.hadoop.ipc.Client.call(Client.java:1400)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232)
at com.sun.proxy.$Proxy44.getDelegationToken(Unknown Source)
at org.apache.hadoop.yarn.api.impl.pb.client.ApplicationClientProtocolPBClientImpl.getDelegationToken(ApplicationClientProtocolPBClientImpl.java:306)
... 29 more
2015-11-24 11:50:24,324 WARN ActionStartXCommand:546 - SERVER[hostname.abc.com] USER[oozie] GROUP[-] TOKEN[] APP[sqoop-wf] JOB[0000017-151029172404066-oozie-oozi-W] ACTION[0000017-151029172404066-oozie-oozi-W#sqoop-node] Suspending Workflow Job id=0000017-151029172404066-oozie-oozi-W
For me, I was connecting to yarn scheduler instead of yarn resource manager.
In your oozie job.properties, make sure jobTracker url is pointing to yarn resource manager. Look for "yarn.resourcemanager.address" in your yarn-site.xml
Trying to run a Oozie coordinator with a java action workflow that consists of running a Camus mapper job. The coordinator seems to run, and start the workflow every 20 minutes, but the workflow would just run indefinitely, even though the job when run independently would easily complete in a few minutes. I think the error either has to do with how I run the job, or how the arguments are passed, but I'm not sure how to debug this. Here is the code:
/coord/job.properties
oozie.coord.application.path=hdfs://10.0.2.15:8020/user/hue/app/coord/coordinator.xml
name=camus
frequency=20
start=2015-07-30T11:40Z
end=2016-07-30T11:40Z
timezone=GMT+0530
workflow=hdfs://10.0.2.15:8020/user/hue/app/workflow/workflow.xml
nameNode=hdfs://10.0.2.15:8020
jobTracker=10.0.2.15:8021
queueName=default
properties=${nameNode}/user/hue/app/workflows/lib/config.properties
coord/coordinator.xml
<coordinator-app name="${name}" frequency="${frequency}" start="${start}" end="${end}" timezone="${timezone}" xmlns="uri:oozie:coordinator:0.1">
<action>
<workflow>
<app-path>${workflow}</app-path>
</workflow>
</action>
</coordinator-app>
/workflow/workflow.xml
<workflow-app xmlns='uri:oozie:workflow:0.4' name='camus-wf'>
<start to='camus_job' />
<action name='camus_job'>
<java>
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<main-class>com.linkedin.camus.etl.kafka.CamusJob</main-class>
<arg>-P</arg>
<arg>${properties}</arg>
</java>
<ok to="end" />
<error to="fail" />
</action>
<kill name="fail">
<message>Camus Job Failed</message>
</kill>
<end name='end' />
</workflow-app>
The SHADED jar and config.properties are located in /workflow/lib/
I'm running HDP 2.2
Coordinator Logs:
2015-08-03 06:43:43,820 INFO CoordSubmitXCommand:543 - SERVER[sandbox.hortonworks.com] USER[root] GROUP[-] TOKEN[] APP[camus] JOB[0000000-150803063131195-oozie-oozi-C] ACTION[-] ENDED Coordinator Submit jobId=0000000-150803063131195-oozie-oozi-C
2015-08-03 06:43:43,935 INFO CoordMaterializeTransitionXCommand:543 - SERVER[sandbox.hortonworks.com] USER[root] GROUP[-] TOKEN[] APP[camus] JOB[0000000-150803063131195-oozie-oozi-C] ACTION[-] materialize actions for tz=Coordinated Universal Time,
start=Thu Jul 30 11:40:00 UTC 2015, end=Thu Jul 30 15:40:00 UTC 2015,
timeUnit 12,
frequency :20:MINUTE,
lastActionNumber 0
2015-08-03 06:43:43,971 INFO CoordMaterializeTransitionXCommand:543 - SERVER[sandbox.hortonworks.com] USER[root] GROUP[-] TOKEN[] APP[camus] JOB[0000000-150803063131195-oozie-oozi-C] ACTION[-] [0000000-150803063131195-oozie-oozi-C]: Update status from PREP to RUNNING
2015-08-03 06:43:44,113 INFO CoordActionInputCheckXCommand:543 - SERVER[sandbox.hortonworks.com] USER[-] GROUP[-] TOKEN[-] APP[-] JOB[0000000-150803063131195-oozie-oozi-C] ACTION[0000000-150803063131195-oozie-oozi-C#1] [0000000-150803063131195-oozie-oozi-C#1]::CoordActionInputCheck:: Missing deps:
2015-08-03 06:43:44,209 INFO CoordActionNotificationXCommand:543 - SERVER[sandbox.hortonworks.com] USER[-] GROUP[-] TOKEN[-] APP[-] JOB[0000000-150803063131195-oozie-oozi-C] ACTION[0000000-150803063131195-oozie-oozi-C#1] STARTED Coordinator Notification actionId=0000000-150803063131195-oozie-oozi-C#1 : WAITING
...
2015-08-03 06:43:44,267 INFO CoordActionNotificationXCommand:543 - SERVER[sandbox.hortonworks.com] USER[-] GROUP[-] TOKEN[-] APP[-] JOB[0000000-150803063131195-oozie-oozi-C] ACTION[0000000-150803063131195-oozie-oozi-C#12] No Notification URL is defined. Therefore nothing to notify for job 0000000-150803063131195-oozie-oozi-C action ID 0000000-150803063131195-oozie-oozi-C#12
2015-08-03 06:43:44,268 INFO CoordActionNotificationXCommand:543 - SERVER[sandbox.hortonworks.com] USER[-] GROUP[-] TOKEN[-] APP[-] JOB[0000000-150803063131195-oozie-oozi-C] ACTION[0000000-150803063131195-oozie-oozi-C#12] ENDED Coordinator Notification actionId=0000000-150803063131195-oozie-oozi-C#12
2015-08-03 06:43:44,433 WARN ParameterVerifier:546 - SERVER[sandbox.hortonworks.com] USER[-] GROUP[-] TOKEN[-] APP[-] JOB[0000000-150803063131195-oozie-oozi-C] ACTION[0000000-150803063131195-oozie-oozi-C#1] The application does not define formal parameters in its XML definition
...
Workflow Logs:
2015-08-03 06:43:44,672 INFO ActionStartXCommand:543 - SERVER[sandbox.hortonworks.com] USER[root] GROUP[-] TOKEN[] APP[camus-wf] JOB[0000001-150803063131195-oozie-oozi-W] ACTION[0000001-150803063131195-oozie-oozi-W#:start:] Start action [0000001-150803063131195-oozie-oozi-W#:start:] with user-retry state : userRetryCount [0], userRetryMax [0], userRetryInterval [10]
2015-08-03 06:43:44,673 INFO ActionStartXCommand:543 - SERVER[sandbox.hortonworks.com] USER[root] GROUP[-] TOKEN[] APP[camus-wf] JOB[0000001-150803063131195-oozie-oozi-W] ACTION[0000001-150803063131195-oozie-oozi-W#:start:] [***0000001-150803063131195-oozie-oozi-W#:start:***]Action status=DONE
2015-08-03 06:43:44,673 INFO ActionStartXCommand:543 - SERVER[sandbox.hortonworks.com] USER[root] GROUP[-] TOKEN[] APP[camus-wf] JOB[0000001-150803063131195-oozie-oozi-W] ACTION[0000001-150803063131195-oozie-oozi-W#:start:] [***0000001-150803063131195-oozie-oozi-W#:start:***]Action updated in DB!
2015-08-03 06:43:45,104 INFO ActionStartXCommand:543 - SERVER[sandbox.hortonworks.com] USER[root] GROUP[-] TOKEN[] APP[camus-wf] JOB[0000001-150803063131195-oozie-oozi-W] ACTION[0000001-150803063131195-oozie-oozi-W#camus_job] Start action [0000001-150803063131195-oozie-oozi-W#camus_job] with user-retry state : userRetryCount [0], userRetryMax [0], userRetryInterval [10]
I setup a two node hadoop cluster. After having started the cluster it looks like this:
machine namenode:
hadoop#namenode:~$ jps
5691 Jps
3531 DataNode
3424 NameNode
3669 SecondaryNameNode
3822 ResourceManager
3908 NodeManager
second machine datanode:
hadoop#datanode:~$ jps
3716 Jps
2137 DataNode
2231 NodeManager
So, after having started the cluster I tried to perform a standard benchmark:
hadoop jar /opt/hadoop-2.2.0/share/hadoop/mapreduce/hadoop-mapreduce-client-jobclient-2.2.0-tests.jar TestDFSIO -write -nrFiles 20 -fileSize 10
However the job fails and the config file contain the following messages:
On the datanode:
hadoop#datanode:~$ cat /opt/hadoop-2.2.0/logs/yarn-hadoop-nodemanager-datanode.log
...
2014-02-18 16:37:41,567 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 3547 for container-id container_1392741263071_0001_02_000001: 26.2 MB of 2 GB physical memory used; 1.2 GB of 4.2 GB virtual memory used
2014-02-18 16:37:42,158 INFO org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Sending out status for container: container_id { app_attempt_id { application_id { id: 1 cluster_timestamp: 1392741263071 } attemptId: 2 } id: 1 } state: C_RUNNING diagnostics: "" exit_status: -1000
2014-02-18 16:37:43,166 INFO org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Sending out status for container: container_id { app_attempt_id { application_id { id: 1 cluster_timestamp: 1392741263071 } attemptId: 2 } id: 1 } state: C_RUNNING diagnostics: "" exit_status: -1000
2014-02-18 16:37:44,171 INFO org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Sending out status for container: container_id { app_attempt_id { application_id { id: 1 cluster_timestamp: 1392741263071 } attemptId: 2 } id: 1 } state: C_RUNNING diagnostics: "" exit_status: -1000
2014-02-18 16:37:44,579 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 3547 for container-id container_1392741263071_0001_02_000001: 95.3 MB of 2 GB physical memory used; 1.3 GB of 4.2 GB virtual memory used
2014-02-18 16:37:45,180 INFO org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Sending out status for container: container_id { app_attempt_id { application_id { id: 1 cluster_timestamp: 1392741263071 } attemptId: 2 } id: 1 } state: C_RUNNING diagnostics: "" exit_status: -1000
2014-02-18 16:37:46,183 INFO org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Sending out status for container: container_id { app_attempt_id { application_id { id: 1 cluster_timestamp: 1392741263071 } attemptId: 2 } id: 1 } state: C_RUNNING diagnostics: "" exit_status: -1000
2014-02-18 16:37:47,189 INFO org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Sending out status for container: container_id { app_attempt_id { application_id { id: 1 cluster_timestamp: 1392741263071 } attemptId: 2 } id: 1 } state: C_RUNNING diagnostics: "" exit_status: -1000
2014-02-18 16:37:47,584 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 3547 for container-id container_1392741263071_0001_02_000001: 108.1 MB of 2 GB physical memory used; 1.3 GB of 4.2 GB virtual memory used
2014-02-18 16:37:48,196 INFO org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Sending out status for container: container_id { app_attempt_id { application_id { id: 1 cluster_timestamp: 1392741263071 } attemptId: 2 } id: 1 } state: C_RUNNING diagnostics: "" exit_status: -1000
2014-02-18 16:37:49,157 WARN org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Exit code from container container_1392741263071_0001_02_000001 is : 1
2014-02-18 16:37:49,157 WARN org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Exception from container-launch with container ID: container_1392741263071_0001_02_000001 and exit code: 1
org.apache.hadoop.util.Shell$ExitCodeException:
at org.apache.hadoop.util.Shell.runCommand(Shell.java:464)
at org.apache.hadoop.util.Shell.run(Shell.java:379)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:589)
at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:195)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:283)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:79)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)
2014-02-18 16:37:49,159 INFO org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor:
2014-02-18 16:37:49,159 WARN org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch: Container exited with a non-zero exit code 1
2014-02-18 16:37:49,160 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: Container container_1392741263071_0001_02_000001 transitioned from RUNNING to EXITED_WITH_FAILURE
2014-02-18 16:37:49,160 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch: Cleaning up container container_1392741263071_0001_02_000001
2014-02-18 16:37:49,172 INFO org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Deleting absolute path : /home/hadoop/hadoop/yarn-data/usercache/hadoop/appcache/application_1392741263071_0001/container_1392741263071_0001_02_000001
2014-02-18 16:37:49,173 WARN org.apache.hadoop.yarn.server.nodemanager.NMAuditLogger: USER=hadoop OPERATION=Container Finished - Failed TARGET=ContainerImpl RESULT=FAILURE DESCRIPTION=Container failed with state: EXITED_WITH_FAILURE APPID=application_1392741263071_0001 CONTAINERID=container_1392741263071_0001_02_000001
...
On the namenode:
hadoop#namenode:/opt/hadoop-2.2.0/logs$ cat yarn-hadoop-*.log
2014-02-18 16:34:25,054 INFO org.apache.hadoop.yarn.server.nodemanager.NodeManager: STARTUP_MSG:
...
2014-02-18 16:37:37,441 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 4493 for container-id container_1392741263071_0001_01_000001: 131.1 MB of 2 GB physical memory used; 1.4 GB of 4.2 GB virtual memory used
2014-02-18 16:37:38,367 INFO org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Sending out status for container: container_id { app_attempt_id { application_id { id: 1 cluster_timestamp: 1392741263071 } attemptId: 1 } id: 1 } state: C_RUNNING diagnostics: "" exit_status: -1000
2014-02-18 16:37:39,369 INFO org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Sending out status for container: container_id { app_attempt_id { application_id { id: 1 cluster_timestamp: 1392741263071 } attemptId: 1 } id: 1 } state: C_RUNNING diagnostics: "" exit_status: -1000
...
2014-02-18 16:34:23,131 INFO org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: STARTUP_MSG:
...
2014-02-18 16:37:49,186 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerNode: Released container container_1392741263071_0001_02_000001 of capacity <memory:2048, vCores:1> on host datanode.c.forward-camera-473.internal:43994, which currently has 0 containers, <memory:0, vCores:0> used and <memory:8192, vCores:8> available, release resources=true
2014-02-18 16:37:49,186 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: default used=<memory:0, vCores:0> numContainers=0 user=hadoop user-resources=<memory:0, vCores:0>
2014-02-18 16:37:49,186 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: completedContainer container=Container: [ContainerId: container_1392741263071_0001_02_000001, NodeId: datanode.c.forward-camera-473.internal:43994, NodeHttpAddress: datanode.c.forward-camera-473.internal:8042, Resource: <memory:2048, vCores:1>, Priority: 0, Token: Token { kind: ContainerToken, service: 10.240.110.76:43994 }, ] resource=<memory:2048, vCores:1> queue=default: capacity=1.0, absoluteCapacity=1.0, usedResources=<memory:0, vCores:0>usedCapacity=0.0, absoluteUsedCapacity=0.0, numApps=1, numContainers=0 usedCapacity=0.0 absoluteUsedCapacity=0.0 used=<memory:0, vCores:0> cluster=<memory:16384, vCores:16>
2014-02-18 16:37:49,186 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: completedContainer queue=root usedCapacity=0.0 absoluteUsedCapacity=0.0 used=<memory:0, vCores:0> cluster=<memory:16384, vCores:16>
2014-02-18 16:37:49,186 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: Re-sorting completed queue: root.default stats: default: capacity=1.0, absoluteCapacity=1.0, usedResources=<memory:0, vCores:0>usedCapacity=0.0, absoluteUsedCapacity=0.0, numApps=1, numContainers=0
2014-02-18 16:37:49,186 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler: Application appattempt_1392741263071_0001_000002 released container container_1392741263071_0001_02_000001 on node: host: datanode.c.forward-camera-473.internal:43994 #containers=0 available=8192 used=0 with event: FINISHED
2014-02-18 16:37:49,187 INFO org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService: Unregistering app attempt : appattempt_1392741263071_0001_000002
2014-02-18 16:37:49,187 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: appattempt_1392741263071_0001_000002 State change from RUNNING to FAILED
2014-02-18 16:37:49,187 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: Application application_1392741263071_0001 failed 2 times due to AM Container for appattempt_1392741263071_0001_000002 exited with exitCode: 1 due to: Exception from container-launch:
org.apache.hadoop.util.Shell$ExitCodeException:
at org.apache.hadoop.util.Shell.runCommand(Shell.java:464)
at org.apache.hadoop.util.Shell.run(Shell.java:379)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:589)
at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:195)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:283)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:79)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)
.Failing this attempt.. Failing the application.
2014-02-18 16:37:49,189 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: Removing info for app: application_1392741263071_0001
2014-02-18 16:37:49,194 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: application_1392741263071_0001 State change from RUNNING to FAILED
2014-02-18 16:37:49,194 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler: Application appattempt_1392741263071_0001_000002 is done. finalState=FAILED
2014-02-18 16:37:49,194 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo: Application application_1392741263071_0001 requests cleared
2014-02-18 16:37:49,194 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: Application removed - appId: application_1392741263071_0001 user: hadoop queue: default #user-pending-applications: 0 #user-active-applications: 0 #queue-pending-applications: 0 #queue-active-applications: 0
2014-02-18 16:37:49,194 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: Application removed - appId: application_1392741263071_0001 user: hadoop leaf-queue of parent: root #applications: 0
2014-02-18 16:37:49,204 WARN org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=hadoop OPERATION=Application Finished - Failed TARGET=RMAppManager RESULT=FAILURE DESCRIPTION=App failed with state: FAILED PERMISSIONS=Application application_1392741263071_0001 failed 2 times due to AM Container for appattempt_1392741263071_0001_000002 exited with exitCode: 1 due to: Exception from container-launch:
org.apache.hadoop.util.Shell$ExitCodeException:
at org.apache.hadoop.util.Shell.runCommand(Shell.java:464)
at org.apache.hadoop.util.Shell.run(Shell.java:379)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:589)
at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:195)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:283)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:79)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)
.Failing this attempt.. Failing the application. APPID=application_1392741263071_0001
2014-02-18 16:37:49,205 INFO org.apache.hadoop.yarn.server.resourcemanager.RMAppManager$ApplicationSummary: appId=application_1392741263071_0001,name=hadoop-mapreduce-client-jobclient-2.2.0-tests.jar,user=hadoop,queue=default,state=FAILED,trackingUrl=namenode:8088/cluster/app/application_1392741263071_0001,appMasterHost=,startTime=1392741381131,finishTime=1392741469188,finalStatus=FAILED
2014-02-18 16:37:49,205 INFO org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher: Cleaning master appattempt_1392741263071_0001_000002
What is happening?
Look like it can't spawn new java process. Probably your .profile or .bashrc do not setup JAVA_HOME or PATH correctly, and thus the java executable is not accessible.
I am trying to set up a oozie and sqoop workflow (I want to backup mySql data into my hdfs).
But I am stuck when I try to start up my job.
I am using hadoop2(working hdfs node), the last version of oozie.
I installed oozie server on my computer (I want to test it before deploying it) with the hdfs config (core-site.xml, hdfs-site.xml, yarn-site.xml, mapred-site.xml on the oozie conf/haddop-conf dir), and my hdfs on a server.
I have made a basic workflow (testing purpose, I just want to see if sqoop is working) like this:
<workflow-app name="Sqoop" xmlns="uri:oozie:workflow:0.4">
<start to="Sqoop"/>
<action name="Sqoop">
<sqoop xmlns="uri:oozie:sqoop-action:0.2">
<job-tracker>yarn.resourcemanager.address:8040</job-tracker>
<name-node>hdfs://hdfs-server:54310</name-node>
<command>job --list</command>
</sqoop>
<ok to="end"/>
<error to="kill"/>
</action>
<kill name="kill">
<message>Action failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message>
</kill>
<end name="end"/>
</workflow-app>
I put this workflow into my hdfs.
I have made a java code for starting my job:
OozieClient wc = new OozieClient("http://localhost:11000/oozie");
Properties conf = wc.createConfiguration();
conf.setProperty( OozieClient.APP_PATH, "hdfs://hdfs_server:54310/hive/testSqoop/sqoop-workflow.xml" );
conf.setProperty("queueName", "default");
try {
String jobId = wc.run(conf);
System.out.println("Workflow job submitted");
while (wc.getJobInfo(jobId).getStatus() == WorkflowJob.Status.RUNNING) {
System.out.println("Workflow job running ...");
System.out.println("..." + wc.getJobInfo(jobId).getStatus().toString() );
Thread.sleep(10 * 1000);
}
System.out.println("Workflow job completed ...");
System.out.println(wc.getJobInfo(jobId));
} catch (Exception r) {
r.printStackTrace();
}
In Oozie webinterface I can see my job running
2013-05-28 12:42:30,004 INFO ActionStartXCommand:539 - USER[anthonyc] GROUP[-] TOKEN[] APP[Sqoop] JOB[0000000-130528124140043-oozie-anth-W] ACTION[0000000-130528124140043-oozie-anth-W#:start:] Start action [0000000-130528124140043-oozie-anth-W#:start:] with user-retry state : userRetryCount [0], userRetryMax [0], userRetryInterval [10]
2013-05-28 12:42:30,008 WARN ActionStartXCommand:542 - USER[anthonyc] GROUP[-] TOKEN[] APP[Sqoop] JOB[0000000-130528124140043-oozie-anth-W] ACTION[0000000-130528124140043-oozie-anth-W#:start:] [***0000000-130528124140043-oozie-anth-W#:start:***]Action status=DONE
2013-05-28 12:42:30,009 WARN ActionStartXCommand:542 - USER[anthonyc] GROUP[-] TOKEN[] APP[Sqoop] JOB[0000000-130528124140043-oozie-anth-W] ACTION[0000000-130528124140043-oozie-anth-W#:start:] [***0000000-130528124140043-oozie-anth-W#:start:***]Action updated in DB!
2013-05-28 12:42:30,192 INFO ActionStartXCommand:539 - USER[anthonyc] GROUP[-] TOKEN[] APP[Sqoop] JOB[0000000-130528124140043-oozie-anth-W] ACTION[0000000-130528124140043-oozie-anth-W#Sqoop] Start action [0000000-130528124140043-oozie-anth-W#Sqoop] with user-retry state : userRetryCount [0], userRetryMax [0], userRetryInterval [10]
2013-05-28 12:42:31,389 WARN SqoopActionExecutor:542 - USER[anthonyc] GROUP[-] TOKEN[] APP[Sqoop] JOB[0000000-130528124140043-oozie-anth-W] ACTION[0000000-130528124140043-oozie-anth-W#Sqoop] credentials is null for the action
2013-05-28 12:42:42,942 INFO SqoopActionExecutor:539 - USER[anthonyc] GROUP[-] TOKEN[] APP[Sqoop] JOB[0000000-130528124140043-oozie-anth-W] ACTION[0000000-130528124140043-oozie-anth-W#Sqoop] checking action, external ID [job_1369126414383_0003] status [RUNNING]
2013-05-28 12:42:42,945 WARN ActionStartXCommand:542 - USER[anthonyc] GROUP[-] TOKEN[] APP[Sqoop] JOB[0000000-130528124140043-oozie-anth-W] ACTION[0000000-130528124140043-oozie-anth-W#Sqoop] [***0000000-130528124140043-oozie-anth-W#Sqoop***]Action status=RUNNING
2013-05-28 12:42:42,946 WARN ActionStartXCommand:542 - USER[anthonyc] GROUP[-] TOKEN[] APP[Sqoop] JOB[0000000-130528124140043-oozie-anth-W] ACTION[0000000-130528124140043-oozie-anth-W#Sqoop] [***0000000-130528124140043-oozie-anth-W#Sqoop***]Action updated in DB!
2013-05-28 12:47:43,034 INFO KillXCommand:539 - USER[anthonyc] GROUP[-] TOKEN[] APP[Sqoop] JOB[0000000-130528124140043-oozie-anth-W] ACTION[-] STARTED WorkflowKillXCommand for jobId=0000000-130528124140043-oozie-anth-W
2013-05-28 12:47:43,328 WARN CoordActionUpdateXCommand:542 - USER[anthonyc] GROUP[-] TOKEN[] APP[Sqoop] JOB[0000000-130528124140043-oozie-anth-W] ACTION[-] E1100: Command precondition does not hold before execution, [, coord action is null], Error Code: E1100
2013-05-28 12:47:43,328 INFO KillXCommand:539 - USER[anthonyc] GROUP[-] TOKEN[] APP[Sqoop] JOB[0000000-130528124140043-oozie-anth-W] ACTION[-] ENDED WorkflowKillXCommand for jobId=0000000-130528124140043-oozie-anth-W
And when I check the yarn webinterface, I can see my job but with the status FAILED with
Application application_1369126414383_0003 failed 1 times due to AM Container for appattempt_1369126414383_0003_000001 exited with exitCode: 1 due to: .Failing this attempt.. Failing the application.
I really dont know what is wrong.
I need your advice.
Thank you~
You have to inspect the job logs:
$ oozie job -log <coord_job_id>
to understand what is happening.