The oozie job does not run with the message [AM container is launched, waiting for AM container to Register with RM] - hadoop

I ran a shell job among the oozie examples.
However, YARN application is not executed.
Detail information YARN UI & LOG:
https://docs.google.com/document/d/1N8LBXZGttY3rhRTwv8cUEfK3WkWtvWJ-YV1q_fh_kks/edit
YARN application status is
Application Priority: 0 (Higher Integer value indicates higher priority)
YarnApplicationState: ACCEPTED: waiting for AM container to be allocated, launched and register with RM.
Queue: default
FinalStatus Reported by AM: Application has not completed yet.
Finished: N/A
Elapsed: 20mins, 30sec
Tracking URL: ApplicationMaster
Log Aggregation Status: DISABLED
Application Timeout (Remaining Time): Unlimited
Diagnostics: AM container is launched, waiting for AM container to Register with RM
Application Attempt status is
Application Attempt State: FAILED
Elapsed: 13mins, 19sec
AM Container: container_1607273090037_0001_02_000001
Node: N/A
Tracking URL: History
Diagnostics Info: ApplicationMaster for attempt appattempt_1607273090037_0001_000002 timed out
Node Local Request Rack Local Request Off Switch Request
Num Node Local Containers (satisfied by) 0
Num Rack Local Containers (satisfied by) 0 0
Num Off Switch Containers (satisfied by) 0 0 1
nodemanager log
2020-12-07 01:45:16,237 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.scheduler.ContainerScheduler: Starting container [container_1607273090037_0001_01_000001]
2020-12-07 01:45:16,267 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl: Container container_1607273090037_0001_01_000001 transitioned from SCHEDULED to RUNNING
2020-12-07 01:45:16,267 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Starting resource-monitoring for container_1607273090037_0001_01_000001
2020-12-07 01:45:16,272 INFO org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: launchContainer: [bash, /tmp/hadoop-oozie/nm-local-dir/usercache/oozie/appcache/application_1607273090037_0001/container_1607273090037_0001_01_000001/default_container_executor.sh]
2020-12-07 01:45:17,301 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: container_1607273090037_0001_01_000001's ip = 127.0.0.1, and hostname = localhost.localdomain
2020-12-07 01:45:17,345 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Skipping monitoring container container_1607273090037_0001_01_000001 since CPU usage is not yet available.
2020-12-07 01:45:48,274 INFO logs: Aliases are enabled
2020-12-07 01:54:50,242 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService: Cache Size Before Clean: 496756, Total Deleted: 0, Public Deleted: 0, Private Deleted: 0
2020-12-07 01:58:10,071 INFO SecurityLogger.org.apache.hadoop.ipc.Server: Auth successful for appattempt_1607273090037_0001_000001 (auth:SIMPLE)
2020-12-07 01:58:10,078 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: Stopping container with container Id: container_1607273090037_0001_01_000001
What is the problem ?

Related

Running Kafka-Manager inside Docker container on Windows

I am following this tutorial to run Kafka inside a Docker container on windows.
When I try to launch Kafka-Manager by opening http://localhost:9000 in the browser as described there, I get ERR_CONNECTION_REFUSED.
Something I think might be related is that at the first time I ran docker-compose up, PowerShell showed an error saying I needed to run some command first, to open a virtual machine or something like that.
Then I ran the command that PowerShell had told me and then I managed to run docker-compose up successfully. However the tutorial didn't mention anything about it, and since then every time I tried to run docker-compose up I managed to to it without running another command first, even if I closed and reopened PowerShell.
I suspect PowerShell remembers I'm connected to a virtual machine so docker-compose up runs Kafka inside a virtual machine, and therefore I can't reach Kafka-Manager in the browser, although I see shows the following message:
kafkamanager | [info] p.c.s.NettyServer - Listening for HTTP on
/0.0.0.0:9000
Edit:
docker logs for kafka container:
/usr/lib/python2.7/dist-packages/supervisor/options.py:296: UserWarning: Supervisord is running as root and it is searching for its configuration file in default locations (including its current working directory); you probably want to specify a "-c" argument specifying an absolute path to a configuration file for improved security.
'Supervisord is running as root and it is searching '
2020-02-28 08:37:37,274 CRIT Supervisor running as root (no user in config file)
2020-02-28 08:37:37,274 WARN Included extra file "/etc/supervisor/conf.d/zookeeper.conf" during parsing
2020-02-28 08:37:37,274 WARN Included extra file "/etc/supervisor/conf.d/kafka.conf" during parsing
2020-02-28 08:37:37,303 INFO RPC interface 'supervisor' initialized
2020-02-28 08:37:37,303 CRIT Server 'unix_http_server' running without any HTTP authentication checking
2020-02-28 08:37:37,303 INFO supervisord started with pid 1
2020-02-28 08:37:38,306 INFO spawned: 'zookeeper' with pid 8
2020-02-28 08:37:38,308 INFO spawned: 'kafka' with pid 9
2020-02-28 08:37:39,372 INFO success: zookeeper entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2020-02-28 08:37:39,372 INFO success: kafka entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2020-02-28 21:16:01,095 WARN received SIGTERM indicating exit request
2020-02-28 21:16:01,095 INFO waiting for zookeeper, kafka to die
2020-02-28 21:16:02,102 INFO stopped: kafka (terminated by SIGTERM)
2020-02-28 21:16:02,442 INFO stopped: zookeeper (exit status 143)
/usr/lib/python2.7/dist-packages/supervisor/options.py:296: UserWarning: Supervisord is running as root and it is searching for its configuration file in default locations (including its current working directory); you probably want to specify a "-c" argument specifying an absolute path to a configuration file for improved security.
'Supervisord is running as root and it is searching '
2020-02-28 21:17:50,843 CRIT Supervisor running as root (no user in config file)
2020-02-28 21:17:50,843 WARN Included extra file "/etc/supervisor/conf.d/zookeeper.conf" during parsing
2020-02-28 21:17:50,843 WARN Included extra file "/etc/supervisor/conf.d/kafka.conf" during parsing
2020-02-28 21:17:50,858 INFO RPC interface 'supervisor' initialized
2020-02-28 21:17:50,858 CRIT Server 'unix_http_server' running without any HTTP authentication checking
2020-02-28 21:17:50,859 INFO supervisord started with pid 1
2020-02-28 21:17:51,862 INFO spawned: 'zookeeper' with pid 8
2020-02-28 21:17:51,864 INFO spawned: 'kafka' with pid 9
2020-02-28 21:17:52,926 INFO success: zookeeper entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2020-02-28 21:17:52,927 INFO success: kafka entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2020-02-28 21:17:59,672 INFO exited: kafka (exit status 1; not expected)
2020-02-28 21:18:00,675 INFO spawned: 'kafka' with pid 297
2020-02-28 21:18:01,694 INFO success: kafka entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2020-02-29 19:42:18,487 WARN received SIGTERM indicating exit request
2020-02-29 19:42:18,487 INFO waiting for zookeeper, kafka to die
2020-02-29 19:42:18,488 INFO stopped: kafka (terminated by SIGTERM)
2020-02-29 19:42:18,821 INFO stopped: zookeeper (exit status 143)
/usr/lib/python2.7/dist-packages/supervisor/options.py:296: UserWarning: Supervisord is running as root and it is searching for its configuration file in default locations (including its current working directory); you probably want to specify a "-c" argument specifying an absolute path to a configuration file for improved security.
'Supervisord is running as root and it is searching '
2020-02-29 19:42:26,841 CRIT Supervisor running as root (no user in config file)
2020-02-29 19:42:26,841 WARN Included extra file "/etc/supervisor/conf.d/zookeeper.conf" during parsing
2020-02-29 19:42:26,842 WARN Included extra file "/etc/supervisor/conf.d/kafka.conf" during parsing
2020-02-29 19:42:26,854 INFO RPC interface 'supervisor' initialized
2020-02-29 19:42:26,854 CRIT Server 'unix_http_server' running without any HTTP authentication checking
2020-02-29 19:42:26,855 INFO supervisord started with pid 1
2020-02-29 19:42:27,857 INFO spawned: 'zookeeper' with pid 8
2020-02-29 19:42:27,859 INFO spawned: 'kafka' with pid 9
2020-02-29 19:42:28,903 INFO success: zookeeper entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2020-02-29 19:42:28,903 INFO success: kafka entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2020-02-29 19:42:34,985 INFO exited: kafka (exit status 1; not expected)
2020-02-29 19:42:35,988 INFO spawned: 'kafka' with pid 297
2020-02-29 19:42:37,014 INFO success: kafka entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2020-02-29 19:43:20,590 WARN received SIGTERM indicating exit request
2020-02-29 19:43:20,590 INFO waiting for zookeeper, kafka to die
2020-02-29 19:43:20,590 INFO stopped: kafka (terminated by SIGTERM)
2020-02-29 19:43:20,784 INFO stopped: zookeeper (exit status 143)
/usr/lib/python2.7/dist-packages/supervisor/options.py:296: UserWarning: Supervisord is running as root and it is searching for its configuration file in default locations (including its current working directory); you probably want to specify a "-c" argument specifying an absolute path to a configuration file for improved security.
'Supervisord is running as root and it is searching '
2020-02-29 19:45:38,600 CRIT Supervisor running as root (no user in config file)
2020-02-29 19:45:38,600 WARN Included extra file "/etc/supervisor/conf.d/zookeeper.conf" during parsing
2020-02-29 19:45:38,600 WARN Included extra file "/etc/supervisor/conf.d/kafka.conf" during parsing
2020-02-29 19:45:38,619 INFO RPC interface 'supervisor' initialized
2020-02-29 19:45:38,629 CRIT Server 'unix_http_server' running without any HTTP authentication checking
2020-02-29 19:45:38,630 INFO supervisord started with pid 1
2020-02-29 19:45:39,632 INFO spawned: 'zookeeper' with pid 8
2020-02-29 19:45:39,634 INFO spawned: 'kafka' with pid 9
2020-02-29 19:45:40,687 INFO success: zookeeper entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2020-02-29 19:45:40,689 INFO success: kafka entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2020-02-29 19:45:47,740 INFO exited: kafka (exit status 1; not expected)
2020-02-29 19:45:48,743 INFO spawned: 'kafka' with pid 297
2020-02-29 19:45:49,763 INFO success: kafka entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2020-02-29 19:46:20,659 WARN received SIGTERM indicating exit request
2020-02-29 19:46:20,659 INFO waiting for zookeeper, kafka to die
2020-02-29 19:46:20,660 INFO stopped: kafka (terminated by SIGTERM)
2020-02-29 19:46:20,991 INFO stopped: zookeeper (exit status 143)
/usr/lib/python2.7/dist-packages/supervisor/options.py:296: UserWarning: Supervisord is running as root and it is searching for its configuration file in default locations (including its current working directory); you probably want to specify a "-c" argument specifying an absolute path to a configuration file for improved security.
'Supervisord is running as root and it is searching '
2020-03-13 22:16:26,128 CRIT Supervisor running as root (no user in config file)
2020-03-13 22:16:26,128 WARN Included extra file "/etc/supervisor/conf.d/zookeeper.conf" during parsing
2020-03-13 22:16:26,128 WARN Included extra file "/etc/supervisor/conf.d/kafka.conf" during parsing
2020-03-13 22:16:26,157 INFO RPC interface 'supervisor' initialized
2020-03-13 22:16:26,162 CRIT Server 'unix_http_server' running without any HTTP authentication checking
2020-03-13 22:16:26,162 INFO supervisord started with pid 1
2020-03-13 22:16:27,164 INFO spawned: 'zookeeper' with pid 8
2020-03-13 22:16:27,167 INFO spawned: 'kafka' with pid 9
2020-03-13 22:16:28,226 INFO success: zookeeper entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2020-03-13 22:16:28,227 INFO success: kafka entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2020-03-13 22:16:36,496 INFO exited: kafka (exit status 1; not expected)
2020-03-13 22:16:37,499 INFO spawned: 'kafka' with pid 298
2020-03-13 22:16:38,511 INFO success: kafka entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2020-03-13 22:17:20,939 WARN received SIGTERM indicating exit request
2020-03-13 22:17:20,940 INFO waiting for zookeeper, kafka to die
2020-03-13 22:17:20,940 INFO stopped: kafka (terminated by SIGTERM)
2020-03-13 22:17:21,268 INFO stopped: zookeeper (exit status 143)
/usr/lib/python2.7/dist-packages/supervisor/options.py:296: UserWarning: Supervisord is running as root and it is searching for its configuration file in default locations (including its current working directory); you probably want to specify a "-c" argument specifying an absolute path to a configuration file for improved security.
'Supervisord is running as root and it is searching '
2020-03-27 21:25:59,495 CRIT Supervisor running as root (no user in config file)
2020-03-27 21:25:59,496 WARN Included extra file "/etc/supervisor/conf.d/zookeeper.conf" during parsing
2020-03-27 21:25:59,497 WARN Included extra file "/etc/supervisor/conf.d/kafka.conf" during parsing
2020-03-27 21:25:59,520 INFO RPC interface 'supervisor' initialized
2020-03-27 21:25:59,522 CRIT Server 'unix_http_server' running without any HTTP authentication checking
2020-03-27 21:25:59,523 INFO supervisord started with pid 1
2020-03-27 21:26:00,530 INFO spawned: 'zookeeper' with pid 8
2020-03-27 21:26:00,532 INFO spawned: 'kafka' with pid 9
2020-03-27 21:26:01,620 INFO success: zookeeper entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2020-03-27 21:26:01,620 INFO success: kafka entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
docker logs for kafka manager container seems fine:
[info] o.a.z.ZooKeeper - Client environment:java.library.path=/usr/java/packages/lib/amd64:/usr/lib64:/lib64:/lib:/usr/lib
[info] o.a.z.ZooKeeper - Client environment:java.io.tmpdir=/tmp
[info] o.a.z.ZooKeeper - Client environment:java.compiler=<NA>
[info] o.a.z.ZooKeeper - Client environment:os.name=Linux
[info] o.a.z.ZooKeeper - Client environment:os.arch=amd64
[info] o.a.z.ZooKeeper - Client environment:os.version=4.9.93-boot2docker
[info] o.a.z.ZooKeeper - Client environment:user.name=root
[info] o.a.z.ZooKeeper - Client environment:user.home=/root
[info] o.a.z.ZooKeeper - Client environment:user.dir=/kafka-manager-1.3.3.4
[info] o.a.z.ZooKeeper - Initiating client connection, connectString=kafkaserver:2181 sessionTimeout=60000 watcher=org.apache.curator.ConnectionState#7a27a9b4
[info] o.a.z.ClientCnxn - Opening socket connection to server kafka.kafka_kafkanet/172.18.0.2:2181. Will not attempt to authenticate using SASL (unknown error)
[info] k.m.a.KafkaManagerActor - zk=kafkaserver:2181
[info] k.m.a.KafkaManagerActor - baseZkPath=/kafka-manager
[info] o.a.z.ClientCnxn - Socket connection established to kafka.kafka_kafkanet/172.18.0.2:2181, initiating session
[info] o.a.z.ClientCnxn - Session establishment complete on server kafka.kafka_kafkanet/172.18.0.2:2181, sessionid = 0x1711de33be70001, negotiated timeout = 40000
[info] k.m.a.KafkaManagerActor - Started actor akka://kafka-manager-system/user/kafka-manager
[info] k.m.a.KafkaManagerActor - Starting delete clusters path cache...
[info] k.m.a.DeleteClusterActor - Started actor akka://kafka-manager-system/user/kafka-manager/delete-cluster
[info] k.m.a.DeleteClusterActor - Starting delete clusters path cache...
[info] k.m.a.DeleteClusterActor - Adding kafka manager path cache listener...
[info] k.m.a.DeleteClusterActor - Scheduling updater for 10 seconds
[info] k.m.a.KafkaManagerActor - Starting kafka manager path cache...
[info] k.m.a.KafkaManagerActor - Adding kafka manager path cache listener...
[info] play.api.Play - Application started (Prod)
[info] p.c.s.NettyServer - Listening for HTTP on /0.0.0.0:9000
[info] k.m.a.KafkaManagerActor - Updating internal state...
[info] k.m.a.KafkaManagerActor - Updating internal state...
[info] k.m.a.KafkaManagerActor - Updating internal state...
[info] k.m.a.KafkaManagerActor - Updating internal state...
This log is a lot longer so I've ommited the beginning but it seems fine.
Yes, there's a hypervisor, not a full VM. You can open the hyperV manager to look at it
You compose file needs a port forward
ports:
- '9000:9000'
If you are using docker toolbox on windows you can try to access kafka-manager with this address: http://192.168.99.100:9000
Note: 192.168.99.100 is the default ip address of VM which docker running on.
docker-compose.yaml is totally fine which is given in the tutorial. Can you do docker-compose down and then again bring up the docker-compose up?
Then try to browse http://localhost:9000 and you should be able to see it.
Possible errors:-
Port forwarding (already done in the docker-compose)
Instead of HTTP, you are opening HTTPS in the browser.

Spark streaming job on YARN cluster mode stuck in accepted, then fails with a Timeout Exception

I am running a spark streaming application that simply read messages from a Kafka topic, enrich them and then write the enriched messages in another kafka topic.
I already tried it in Standalone mode (both client and cluster deploy mode) and in YARN client mode, successfully.
When I submit the application in cluster mode it gives me the following messages:
18/01/10 12:13:34 INFO Client: Submitting application application_1515582681419_0001 to ResourceManager
18/01/10 12:13:34 INFO YarnClientImpl: Submitted application application_1515582681419_0001
18/01/10 12:13:35 INFO Client: Application report for application_1515582681419_0001 (state: ACCEPTED)
18/01/10 12:13:35 INFO Client:
client token: N/A
diagnostics: AM container is launched, waiting for AM container to Register with RM
ApplicationMaster host: N/A
ApplicationMaster RPC port: -1
queue: default
start time: 1515582814080
final status: UNDEFINED
tracking URL: http://ambari1.internal:8088/proxy/application_1515582681419_0001/
user: root
18/01/10 12:13:36 INFO Client: Application report for application_1515582681419_0001 (state: ACCEPTED)
18/01/10 12:13:37 INFO Client: Application report for application_1515582681419_0001 (state: ACCEPTED)
And keeps stuck in ACCEPTED Status until after around 4-5 minutes, exit with the following error message:
18/01/10 12:17:00 INFO InputInfoTracker: remove old batch metadata: 1515583000000 ms
18/01/10 12:17:02 ERROR ApplicationMaster: Uncaught exception:
java.util.concurrent.TimeoutException: Futures timed out after [100000 milliseconds]
at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:201)
at org.apache.spark.deploy.yarn.ApplicationMaster.runDriver(ApplicationMaster.scala:423)
at org.apache.spark.deploy.yarn.ApplicationMaster.run(ApplicationMaster.scala:282)
at org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$main$1.apply$mcV$sp(ApplicationMaster.scala:768)
at org.apache.spark.deploy.SparkHadoopUtil$$anon$2.run(SparkHadoopUtil.scala:67)
at org.apache.spark.deploy.SparkHadoopUtil$$anon$2.run(SparkHadoopUtil.scala:66)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1866)
at org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:66)
at org.apache.spark.deploy.yarn.ApplicationMaster$.main(ApplicationMaster.scala:766)
at org.apache.spark.deploy.yarn.ApplicationMaster.main(ApplicationMaster.scala)
18/01/10 12:17:02 INFO ApplicationMaster: Final app status: FAILED, exitCode: 10, (reason: Uncaught exception: java.util.concurrent.TimeoutException: Futures timed out after [100000 milliseconds])
18/01/10 12:17:02 INFO StreamingContext: Invoking stop(stopGracefully=false) from shutdown hook
18/01/10 12:17:02 INFO ReceiverTracker: ReceiverTracker stopped
18/01/10 12:17:02 INFO JobGenerator: Stopping JobGenerator immediately
Funny fact: If I visit the age of the application, I can see that the Spark Context has been started and it processes some messages.
Could anyone help me on this?
PS: These are the resources of my YARN cluster:
The problem might be with Yarn "App Timeline Server". Try to restart it.
Are you creating your spark session with master as local?. Please do check this.

Change Mesos Master Leader, cause Marathon shutdown?

Env:
Zookeeper on computer A,
Mesos master on computer B as Leader,
Mesos master on computer C,
Marathon on computer B singleton.
Action:
Kill Mesos master task on computer B, attempt to change mesos cluster leader
Result:
Mesos cluster leader change to mesos master on computer C,
But Marathon task on computer auto shutdown with following logs.
Question:
Somebody can help me why marathon down? and how to fix it!
Logs:
I1109 12:19:10.010197 11287 detector.cpp:152] Detected a new leader: (id='9')
I1109 12:19:10.010646 11291 group.cpp:699] Trying to get '/mesos/json.info_0000000009' in ZooKeeper
I1109 12:19:10.013425 11292 zookeeper.cpp:262] A new leading master (UPID=master#10.4.23.55:5050) is detected
[2017-11-09 12:19:10,015] WARN Disconnected (mesosphere.marathon.MarathonScheduler:Thread-23)
I1109 12:19:10.018977 11292 sched.cpp:2021] Asked to stop the driver
I1109 12:19:10.019161 11292 sched.cpp:336] New master detected at master#10.4.23.55:5050
I1109 12:19:10.019892 11292 sched.cpp:1203] Stopping framework d52cbd8c-1015-4d94-8328-e418876ca5b2-0000
[2017-11-09 12:19:10,020] INFO Driver future completed with result=Success(()). (mesosphere.marathon.MarathonSchedulerService:ForkJoinPool-3-worker-5)
[2017-11-09 12:19:10,022] INFO Abdicating leadership (mesosphere.marathon.core.election.impl.CuratorElectionService:ForkJoinPool-3-worker-5)
[2017-11-09 12:19:10,022] INFO Stopping the election service (mesosphere.marathon.core.election.impl.CuratorElectionService:ForkJoinPool-3-worker-5)
[2017-11-09 12:19:10,029] INFO backgroundOperationsLoop exiting (org.apache.curator.framework.imps.CuratorFrameworkImpl:Curator-Framework-0)
[2017-11-09 12:19:10,061] INFO Session: 0x15f710ffb010058 closed (org.apache.zookeeper.ZooKeeper:ForkJoinPool-3-worker-5)
[2017-11-09 12:19:10,061] INFO EventThread shut down for session: 0x15f710ffb010058 (org.apache.zookeeper.ClientCnxn:pool-3-thread-1-EventThread)
[2017-11-09 12:19:10,063] INFO Stopping MarathonSchedulerService [RUNNING]'s leadership (mesosphere.marathon.core.election.impl.CuratorElectionService:ForkJoinPool-3-worker-5)
[2017-11-09 12:19:10,063] INFO Lost leadership (mesosphere.marathon.MarathonSchedulerService:ForkJoinPool-3-worker-5)
[2017-11-09 12:19:10,066] INFO All actors suspended:
* Actor[akka://marathon/user/offerMatcherStatistics#-1904211014]
* Actor[akka://marathon/user/reviveOffersWhenWanted#-238627718]
* Actor[akka://marathon/user/expungeOverdueLostTasks#608979053]
* Actor[akka://marathon/user/launchQueue#803590575]
* Actor[akka://marathon/user/offersWantedForReconciliation#598482724]
* Actor[akka://marathon/user/offerMatcherLaunchTokens#813230776]
* Actor[akka://marathon/user/offerMatcherManager#1205401692]
* Actor[akka://marathon/user/instanceTracker#1055980147]
* Actor[akka://marathon/user/killOverdueStagedTasks#-40058350]
* Actor[akka://marathon/user/taskKillServiceActor#-602552505]
* Actor[akka://marathon/user/rateLimiter#-911383474]
* Actor[akka://marathon/user/deploymentManager#2013376325] (mesosphere.marathon.core.leadership.impl.LeadershipCoordinatorActor:marathon-akka.actor.default-dispatcher-10)
I1109 12:19:10.069551 11272 sched.cpp:2021] Asked to stop the driver
[2017-11-09 12:19:10,068] INFO Stopping driver (mesosphere.marathon.MarathonSchedulerService:ForkJoinPool-3-worker-5)
[2017-11-09 12:19:10,069] INFO Stopped MarathonSchedulerService [RUNNING]'s leadership (mesosphere.marathon.core.election.impl.CuratorElectionService:ForkJoinPool-3-worker-5)
[2017-11-09 12:19:10,070] INFO Terminating due to leadership abdication or failure (mesosphere.marathon.core.election.impl.CuratorElectionService:ForkJoinPool-3-worker-5)
[2017-11-09 12:19:10,071] INFO Call postDriverRuns callbacks on (mesosphere.marathon.MarathonSchedulerService:ForkJoinPool-3-worker-5)
[2017-11-09 12:19:10,074] INFO Now standing by. Closing existing handles and rejecting new. (mesosphere.marathon.core.event.impl.stream.HttpEventStreamActor:marathon-akka.actor.default-dispatcher-12)
[2017-11-09 12:19:10,074] INFO Suspending scheduler actor (mesosphere.marathon.MarathonSchedulerActor:marathon-akka.actor.default-dispatcher-2)
[2017-11-09 12:19:10,083] INFO Finished postDriverRuns callbacks (mesosphere.marathon.MarathonSchedulerService:ForkJoinPool-3-worker-5)
[2017-11-09 12:19:10,084] INFO ExpungeOverdueLostTasksActor has stopped (mesosphere.marathon.core.task.jobs.impl.ExpungeOverdueLostTasksActor:marathon-akka.actor.default-dispatcher-9)
[1]+ Exit 137
I think there is wrong configuration in zookeeper cluster. Use 3 zookeeper cluster and 2 mesos master n multiple slaves. Ref : https://www.google.co.in/amp/s/beingasysadmin.wordpress.com/2014/08/16/managing-ha-docker-cluster-using-multiple-mesos-masters/amp/
Did you set masters reference to marathon conf?
can you do
cat /etc/marathon/conf/master

Problems using Spark 1.6.2 for Hadoop 2.6.0 in a Hadoop 2.7.1 cluster

I have access to a Hadoop cluster, version 2.7.1, that was installed using HDP 2.4. Such a cluster has Spark installed, specifically:
$ cat /usr/hdp/2.4.3.0-227/spark/RELEASE
Spark 1.6.2.2.4.3.0-227 built for Hadoop 2.7.1.2.4.3.0-227
I'm trying to set up a "client" machine able to remotelly connect to the cluster and deploy Spark jobs. Thus, I need to install a Spark distribution for the same versions above.
First of all, I've gone to the official Spark download page, but 1.6.2 is only available for Hadoop 2.6.
Then, I decided to download Spark source code and build it by following this guide. The interesting thing is the required building profile for Hadoop "2.6.x and later 2.x" is hadoop-2-6. I.e. if I build by myself Spark, I'll obtain a distribution as the one available in the official Spark download page.
Thus, I've gone with such official pre-built distribution of Spark 1.6.2 for Hadoop 2.6.0.
And it seems not to be working properly. I've submitted a Python script -a very simple one only creating a Spark context- and there is some kind of problem (only showing relevant parts of the log):
$ ./bin/spark-submit --master yarn --deploy-mode cluster basic.py
...
17/08/28 13:08:29 INFO Client: Requesting a new application from cluster with 8 NodeManagers
17/08/28 13:08:29 INFO Client: Verifying our application has not requested more than the maximum memory capability of the cluster (24576 MB per container)
17/08/28 13:08:29 INFO Client: Will allocate AM container, with 1408 MB memory including 384 MB overhead
17/08/28 13:08:29 INFO Client: Setting up container launch context for our AM
17/08/28 13:08:29 INFO Client: Setting up the launch environment for our AM container
17/08/28 13:08:29 INFO Client: Preparing resources for our AM container
17/08/28 13:08:36 INFO Client: Uploading resource file:/Users/frb/Applications/spark-1.6.2-bin-hadoop2.6/lib/spark-assembly-1.6.2-hadoop2.6.0.jar -> hdfs://<host>:8020/user/frb/.sparkStaging/application_1495097788339_0066/spark-assembly-1.6.2-hadoop2.6.0.jar
17/08/28 13:14:40 INFO Client: Uploading resource file:basic.py -> hdfs://<host>:8020/user/frb/.sparkStaging/application_1495097788339_0066/basic.py
17/08/28 13:14:40 INFO Client: Uploading resource file:/Users/frb/Applications/spark-1.6.2-bin-hadoop2.6/python/lib/pyspark.zip -> hdfs://<host>:8020/user/frb/.sparkStaging/application_1495097788339_0066/pyspark.zip
17/08/28 13:14:41 INFO Client: Uploading resource file:/Users/frb/Applications/spark-1.6.2-bin-hadoop2.6/python/lib/py4j-0.9-src.zip -> hdfs://<host>:8020/user/frb/.sparkStaging/application_1495097788339_0066/py4j-0.9-src.zip
17/08/28 13:14:42 INFO Client: Uploading resource file:/private/var/folders/cc/p9gx2wnn3dz8g6yf_r4308fm0000gn/T/spark-0d86f1f4-d310-423a-9d2f-90e2ff46f84e/__spark_conf__3704082754178078870.zip -> hdfs://<host>:8020/user/frb/.sparkStaging/application_1495097788339_0066/__spark_conf__3704082754178078870.zip
17/08/28 13:14:42 INFO SecurityManager: Changing view acls to: frb
17/08/28 13:14:42 INFO SecurityManager: Changing modify acls to: frb
17/08/28 13:14:42 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(frb); users with modify permissions: Set(frb)
17/08/28 13:14:42 INFO Client: Submitting application 66 to ResourceManager
17/08/28 13:14:42 INFO YarnClientImpl: Submitted application application_1495097788339_0066
17/08/28 13:14:48 INFO Client: Application report for application_1495097788339_0066 (state: ACCEPTED)
17/08/28 13:14:48 INFO Client:
client token: N/A
diagnostics: N/A
ApplicationMaster host: N/A
ApplicationMaster RPC port: -1
queue: default
start time: 1503918882943
final status: UNDEFINED
tracking URL: <host>:8088/proxy/application_1495097788339_0066/
user: frb
17/08/28 13:14:49 INFO Client: Application report for application_1495097788339_0066 (state: ACCEPTED)
...
17/08/28 13:14:52 INFO Client: Application report for application_1495097788339_0066 (state: RUNNING)
17/08/28 13:14:52 INFO Client:
client token: N/A
diagnostics: N/A
ApplicationMaster host: 10.95.120.6
ApplicationMaster RPC port: 0
queue: default
start time: 1503918882943
final status: UNDEFINED
tracking URL: <host>:8088/proxy/application_1495097788339_0066/
user: frb
17/08/28 13:14:53 INFO Client: Application report for application_1495097788339_0066 (state: RUNNING)
...
17/08/28 13:14:59 INFO Client: Application report for application_1495097788339_0066 (state: ACCEPTED)
17/08/28 13:14:59 INFO Client:
client token: N/A
diagnostics: N/A
ApplicationMaster host: N/A
ApplicationMaster RPC port: -1
queue: default
start time: 1503918882943
final status: UNDEFINED
tracking URL: <host>:8088/proxy/application_1495097788339_0066/
user: frb
17/08/28 13:15:00 INFO Client: Application report for application_1495097788339_0066 (state: ACCEPTED)
17/08/28 13:15:01 INFO Client: Application report for application_1495097788339_0066 (state: RUNNING)
17/08/28 13:15:01 INFO Client:
client token: N/A
diagnostics: N/A
ApplicationMaster host: 10.95.58.21
ApplicationMaster RPC port: 0
queue: default
start time: 1503918882943
final status: UNDEFINED
tracking URL: <host>:8088/proxy/application_1495097788339_0066/
user: frb
17/08/28 13:15:02 INFO Client: Application report for application_1495097788339_0066 (state: RUNNING)
...
17/08/28 13:15:09 INFO Client: Application report for application_1495097788339_0066 (state: FINISHED)
17/08/28 13:15:09 INFO Client:
client token: N/A
diagnostics: Max number of executor failures (4) reached
ApplicationMaster host: 10.95.58.21
ApplicationMaster RPC port: 0
queue: default
start time: 1503918882943
final status: FAILED
tracking URL: <host>:8088/proxy/application_1495097788339_0066/
user: frb
Exception in thread "main" org.apache.spark.SparkException: Application application_1495097788339_0066 finished with failed status
at org.apache.spark.deploy.yarn.Client.run(Client.scala:1034)
at org.apache.spark.deploy.yarn.Client$.main(Client.scala:1081)
at org.apache.spark.deploy.yarn.Client.main(Client.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
17/08/28 13:15:09 INFO ShutdownHookManager: Shutdown hook called
17/08/28 13:15:09 INFO ShutdownHookManager: Deleting directory /private/var/folders/cc/p9gx2wnn3dz8g6yf_r4308fm0000gn/T/spark-0d86f1f4-d310-423a-9d2f-90e2ff46f84e
If I check the logs for this job, I see that:
ERROR:py4j.java_gateway:An error occurred while trying to connect to the Java server
Traceback (most recent call last):
File "/disk0/hadoop/yarn/local/usercache/frb/appcache/application_1495097788339_0066/container_e03_1495097788339_0066_02_000001/py4j-0.9-src.zip/py4j/java_gateway.py", line 690, in start
self.socket.connect((self.address, self.port))
File "/usr/lib64/python2.7/socket.py", line 224, in meth
return getattr(self._sock,name)(*args)
error: [Errno 111] Connection refused
Traceback (most recent call last):
File "basic.py", line 36, in <module>
sc = SparkContext(conf=conf)
File "/disk0/hadoop/yarn/local/usercache/frb/appcache/application_1495097788339_0066/container_e03_1495097788339_0066_02_000001/pyspark.zip/pyspark/context.py", line 115, in __init__
File "/disk0/hadoop/yarn/local/usercache/frb/appcache/application_1495097788339_0066/container_e03_1495097788339_0066_02_000001/pyspark.zip/pyspark/context.py", line 172, in _do_init
File "/disk0/hadoop/yarn/local/usercache/frb/appcache/application_1495097788339_0066/container_e03_1495097788339_0066_02_000001/pyspark.zip/pyspark/context.py", line 235, in _initialize_context
File "/disk0/hadoop/yarn/local/usercache/frb/appcache/application_1495097788339_0066/container_e03_1495097788339_0066_02_000001/py4j-0.9-src.zip/py4j/java_gateway.py", line 1062, in __call__
File "/disk0/hadoop/yarn/local/usercache/frb/appcache/application_1495097788339_0066/container_e03_1495097788339_0066_02_000001/py4j-0.9-src.zip/py4j/java_gateway.py", line 631, in send_command
File "/disk0/hadoop/yarn/local/usercache/frb/appcache/application_1495097788339_0066/container_e03_1495097788339_0066_02_000001/py4j-0.9-src.zip/py4j/java_gateway.py", line 624, in send_command
File "/disk0/hadoop/yarn/local/usercache/frb/appcache/application_1495097788339_0066/container_e03_1495097788339_0066_02_000001/py4j-0.9-src.zip/py4j/java_gateway.py", line 579, in _get_connection
File "/disk0/hadoop/yarn/local/usercache/frb/appcache/application_1495097788339_0066/container_e03_1495097788339_0066_02_000001/py4j-0.9-src.zip/py4j/java_gateway.py", line 585, in _create_connection
File "/disk0/hadoop/yarn/local/usercache/frb/appcache/application_1495097788339_0066/container_e03_1495097788339_0066_02_000001/py4j-0.9-src.zip/py4j/java_gateway.py", line 697, in start
py4j.protocol.Py4JNetworkError: An error occurred while trying to connect to the Java server
ERROR:py4j.java_gateway:An error occurred while trying to connect to the Java server
Traceback (most recent call last):
File "/disk0/hadoop/yarn/local/usercache/frb/appcache/application_1495097788339_0066/container_e03_1495097788339_0066_02_000001/py4j-0.9-src.zip/py4j/java_gateway.py", line 690, in start
self.socket.connect((self.address, self.port))
File "/usr/lib64/python2.7/socket.py", line 224, in meth
return getattr(self._sock,name)(*args)
error: [Errno 111] Connection refused
I.e. the Spark context is not created, the connection fails between the JVM running the Java gateway and the Python driver running the Spark Context.
This must be related to the Spark distribution I've installed in my client machine for sure, because:
The Spark distribution of my client machine is uploaded to the clsuter, thus it is the one used; just remember this log when submitting:
17/08/28 13:08:36 INFO Client: Uploading resource file:/Users/frb/Applications/spark-1.6.2-bin-hadoop2.6/lib/spark-assembly-1.6.2-hadoop2.6.0.jar -> hdfs://:8020/user/frb/.sparkStaging/application_1495097788339_0066/spark-assembly-1.6.2-hadoop2.6.0.jar
The same above command works when submitted within the cluster, i.e. when using the "Spark 1.6.2.2.4.3.0-227 built for Hadoop 2.7.1.2.4.3.0-227" version of Spark installed by HDP.
Any idea about how to fix this? Thanks!
I finaly solved this:
I added to the spark-submit command the option --conf spark.yarn.jars, with value the location of the Spark assembly jar in the remote Spark cluster. This avoids uploading the client-side Spark assembly jar I installed (which is a slow process, and does not exactly match the remote version, indeed).
I added to the client-side of yarn-site.xml the property hdp.version, with value the HDP version of the remote Hadoop-Spark cluster. This avoids a substitution error in certain paths, which in the end was revealed as the connection error I described in the question.

Spark: Unknown/unsupported param error when setting conf.yarn.jar

I have a little application that runs fine on my Spark cluster based on Yarn when I commit it with spark-submit like this:
~/spark-1.4.0-bin-hadoop2.4$ bin/spark-submit --class MyClass --master yarn-cluster --queue testing myApp.jar hdfs://nameservice1/user/XXX/README.md_count
However, I would like to avoid uploading the spark-assembly.jar file each time, so I set the spark.yarn.jar configuration parameter:
~/spark-1.4.0-bin-hadoop2.4$ bin/spark-submit --class MyClass --master yarn-cluster --queue testing --conf "spark.yarn.jar=hdfs://nameservice1/user/spark/share/lib/spark-assembly.jar" myApp.jar hdfs://nameservice1/user/XXX/README.md_count
This seems to be fine at first:
15/07/08 13:57:17 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
15/07/08 13:57:18 INFO yarn.Client: Requesting a new application from cluster with 24 NodeManagers
15/07/08 13:57:18 INFO yarn.Client: Verifying our application has not requested more than the maximum memory capability of the cluster (8192 MB per container)
15/07/08 13:57:18 INFO yarn.Client: Will allocate AM container, with 896 MB memory including 384 MB overhead
15/07/08 13:57:18 INFO yarn.Client: Setting up container launch context for our AM
15/07/08 13:57:18 INFO yarn.Client: Preparing resources for our AM container
15/07/08 13:57:18 INFO yarn.Client: Source and destination file systems are the same. Not copying hdfs://nameservice1/user/spark/share/lib/spark-assembly.jar
[...]
However, it fails eventually:
15/07/08 13:57:18 INFO yarn.Client: Submitting application 670 to ResourceManager
15/07/08 13:57:18 INFO impl.YarnClientImpl: Submitted application application_1434986503384_0670
15/07/08 13:57:19 INFO yarn.Client: Application report for application_1434986503384_0670 (state: ACCEPTED)
15/07/08 13:57:19 INFO yarn.Client:
client token: N/A
diagnostics: N/A
ApplicationMaster host: N/A
ApplicationMaster RPC port: -1
queue: testing
start time: 1436356638869
final status: UNDEFINED
tracking URL: http://node-00a/cluster/app/application_1434986503384_0670
user: XXX
15/07/08 13:57:20 INFO yarn.Client: Application report for application_1434986503384_0670 (state: ACCEPTED)
15/07/08 13:57:21 INFO yarn.Client: Application report for application_1434986503384_0670 (state: ACCEPTED)
15/07/08 13:57:23 INFO yarn.Client: Application report for application_1434986503384_0670 (state: FAILED)
15/07/08 13:57:23 INFO yarn.Client:
client token: N/A
diagnostics: Application application_1434986503384_0670 failed 2 times due to AM Container for appattempt_1434986503384_0670_000002 exited with exitCode: 1 due to: Exception from container-launch.
Container id: container_1434986503384_0670_02_000001
Exit code: 1
[...]
In the Yarn log, I find the following error message indicating a wrong usage of parameters:
Container: container_1434986503384_0670_01_000001 on node-01b_8041
===================================================================================================
LogType:stderr
Log Upload Time:Mi Jul 08 13:57:22 +0200 2015
LogLength:764
Log Contents:
Unknown/unsupported param List(--arg, hdfs://nameservice1/user/XXX/README.md_count, --executor-memory, 1024m, --executor-cores, 1, --num-executors, 2)
Usage: org.apache.spark.deploy.yarn.ApplicationMaster [options]
Options:
--jar JAR_PATH Path to your application's JAR file (required)
--class CLASS_NAME Name of your application's main class (required)
--args ARGS Arguments to be passed to your application's main class.
Mutliple invocations are possible, each will be passed in order.
--num-executors NUM Number of executors to start (Default: 2)
--executor-cores NUM Number of cores for the executors (Default: 1)
--executor-memory MEM Memory per executor (e.g. 1000M, 2G) (Default: 1G)
End of LogType:stderr
As the same application runs when uploading the local assembly file upon submission, it seems to come down to the assembly file. Could the one on the cluster be a wrong/different version? How could I validate that? What other reasons might be the cause? Is the warning WARN util.NativeCodeLoader: ... possibly related?
The same happens when I set the (deprecated) environment variable SPARK_JAR instead of setting spark.yarn.jar.
Asking the obvious question here: are you sure the spark-assembly.jar on HDFS is the same one as you have locally? If not, can you try uploading your local spark-assembly to your home directory on HDFS and try again?

Resources