How can I compile my C++ code with Mesos? - mesos

I got a cluster of 1 master node and 2 slaves and I'm trying to compile my application with mesos.
Basically, here is the command that I use:
mesos-execute --name=alc1 --command="ccmake -j myapp" --master=10.11.12.13:5050
Offers are made from the slave but this compilation task keeps failing.
[root#master-node ~]# mesos-execute --name=alc1 --command="ccmake -j myapp" --master=10.11.12.13:5050
I0511 22:26:11.623016 11560 sched.cpp:222] Version: 0.28.0
I0511 22:26:11.625602 11564 sched.cpp:326] New master detected at master#10.11.12.13:5050
I0511 22:26:11.625952 11564 sched.cpp:336] No credentials provided. Attempting to register without authentication
I0511 22:26:11.627279 11564 sched.cpp:703] Framework registered with 70582e35-5d6e-4915-a919-cae61c904fd9-0139
Framework registered with 70582e35-5d6e-4915-a919-cae61c904fd9-0139
task alc1 submitted to slave 70582e35-5d6e-4915-a919-cae61c904fd9-S2
Received status update TASK_RUNNING for task alc1
Received status update TASK_FAILED for task alc1
I0511 22:26:11.759610 11567 sched.cpp:1903] Asked to stop the driver
I0511 22:26:11.759639 11567 sched.cpp:1143] Stopping framework '70582e35-5d6e-4915-a919-cae61c904fd9-0139'
On the sandbox slave node, here is the stderr logs:
I0511 22:26:13.781070 5037 exec.cpp:143] Version: 0.28.0
I0511 22:26:13.785001 5040 exec.cpp:217] Executor registered on slave 70582e35-5d6e-4915-a919-cae61c904fd9-S2
sh: ccmake: command not found
I0511 22:26:13.892653 5042 exec.cpp:390] Executor asked to shutdown
Just to mentionned that commands like this work fine and get me the expected results:
[root#master-node ~]# mesos-execute --name=alc1 --command="find / -name a" --master=10.11.12.13:5050
I0511 22:26:03.733172 11550 sched.cpp:222] Version: 0.28.0
I0511 22:26:03.736112 11554 sched.cpp:326] New master detected at master#10.11.12.13:5050
I0511 22:26:03.736383 11554 sched.cpp:336] No credentials provided. Attempting to register without authentication
I0511 22:26:03.737730 11554 sched.cpp:703] Framework registered with 70582e35-5d6e-4915-a919-cae61c904fd9-0138
Framework registered with 70582e35-5d6e-4915-a919-cae61c904fd9-0138
task alc1 submitted to slave 70582e35-5d6e-4915-a919-cae61c904fd9-S2
Received status update TASK_RUNNING for task alc1
Received status update TASK_FINISHED for task alc1
I0511 22:26:04.184813 11553 sched.cpp:1903] Asked to stop the driver
I0511 22:26:04.184844 11553 sched.cpp:1143] Stopping framework '70582e35-5d6e-4915-a919-cae61c904fd9-0138'
I don't really get what is needed for even troubleshot this issue.

Related

YARN complains java.net.NoRouteToHostException: No route to host (Host unreachable)

Attempting to run h2o on a HDP 3.1 cluster and running into error that appears to be about YARN resource capacity...
[ml1user#HW04 h2o-3.26.0.1-hdp3.1]$ hadoop jar h2odriver.jar -nodes 3 -mapperXmx 10g
Determining driver host interface for mapper->driver callback...
[Possible callback IP address: 192.168.122.1]
[Possible callback IP address: 172.18.4.49]
[Possible callback IP address: 127.0.0.1]
Using mapper->driver callback IP address and port: 172.18.4.49:46015
(You can override these with -driverif and -driverport/-driverportrange and/or specify external IP using -extdriverif.)
Memory Settings:
mapreduce.map.java.opts: -Xms10g -Xmx10g -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -Dlog4j.defaultInitOverride=true
Extra memory percent: 10
mapreduce.map.memory.mb: 11264
Hive driver not present, not generating token.
19/07/25 14:48:05 INFO client.RMProxy: Connecting to ResourceManager at hw01.ucera.local/172.18.4.46:8050
19/07/25 14:48:06 INFO client.AHSProxy: Connecting to Application History server at hw02.ucera.local/172.18.4.47:10200
19/07/25 14:48:07 INFO mapreduce.JobResourceUploader: Disabling Erasure Coding for path: /user/ml1user/.staging/job_1564020515809_0006
19/07/25 14:48:08 INFO mapreduce.JobSubmitter: number of splits:3
19/07/25 14:48:08 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1564020515809_0006
19/07/25 14:48:08 INFO mapreduce.JobSubmitter: Executing with tokens: []
19/07/25 14:48:08 INFO conf.Configuration: found resource resource-types.xml at file:/etc/hadoop/3.1.0.0-78/0/resource-types.xml
19/07/25 14:48:08 INFO impl.YarnClientImpl: Submitted application application_1564020515809_0006
19/07/25 14:48:08 INFO mapreduce.Job: The url to track the job: http://HW01.ucera.local:8088/proxy/application_1564020515809_0006/
Job name 'H2O_47159' submitted
JobTracker job ID is 'job_1564020515809_0006'
For YARN users, logs command is 'yarn logs -applicationId application_1564020515809_0006'
Waiting for H2O cluster to come up...
ERROR: Timed out waiting for H2O cluster to come up (120 seconds)
ERROR: (Try specifying the -timeout option to increase the waiting time limit)
Attempting to clean up hadoop job...
19/07/25 14:50:19 INFO impl.YarnClientImpl: Killed application application_1564020515809_0006
Killed.
19/07/25 14:50:23 INFO client.RMProxy: Connecting to ResourceManager at hw01.ucera.local/172.18.4.46:8050
19/07/25 14:50:23 INFO client.AHSProxy: Connecting to Application History server at hw02.ucera.local/172.18.4.47:10200
----- YARN cluster metrics -----
Number of YARN worker nodes: 3
----- Nodes -----
Node: http://HW03.ucera.local:8042 Rack: /default-rack, RUNNING, 0 containers used, 0.0 / 15.0 GB used, 0 / 3 vcores used
Node: http://HW04.ucera.local:8042 Rack: /default-rack, RUNNING, 0 containers used, 0.0 / 15.0 GB used, 0 / 3 vcores used
Node: http://HW02.ucera.local:8042 Rack: /default-rack, RUNNING, 0 containers used, 0.0 / 15.0 GB used, 0 / 3 vcores used
----- Queues -----
Queue name: default
Queue state: RUNNING
Current capacity: 0.00
Capacity: 1.00
Maximum capacity: 1.00
Application count: 0
Queue 'default' approximate utilization: 0.0 / 45.0 GB used, 0 / 9 vcores used
----------------------------------------------------------------------
ERROR: Unable to start any H2O nodes; please contact your YARN administrator.
A common cause for this is the requested container size (11.0 GB)
exceeds the following YARN settings:
yarn.nodemanager.resource.memory-mb
yarn.scheduler.maximum-allocation-mb
----------------------------------------------------------------------
For YARN users, logs command is 'yarn logs -applicationId application_1564020515809_0006'
Looking in the YARN configs in Ambari UI, these properties are nowhere to be found. But checking the YARN logs in the YARN resource manager UI and checking some of the logs for the killed application, I see what appears to be unreachable-host errors...
Container: container_e05_1564020515809_0006_02_000002 on HW03.ucera.local_45454_1564102219781
LogAggregationType: AGGREGATED
=============================================================================================
LogType:stderr
LogLastModifiedTime:Thu Jul 25 14:50:19 -1000 2019
LogLength:2203
LogContents:
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/hadoop/yarn/local/filecache/11/mapreduce.tar.gz/hadoop/share/hadoop/common/lib/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/hadoop/yarn/local/usercache/ml1user/appcache/application_1564020515809_0006/filecache/10/job.jar/job.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
log4j:WARN No appenders could be found for logger (org.apache.hadoop.mapred.YarnChild).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
java.net.NoRouteToHostException: No route to host (Host unreachable)
at java.net.PlainSocketImpl.socketConnect(Native Method)
....
at java.net.Socket.<init>(Socket.java:211)
at water.hadoop.EmbeddedH2OConfig$BackgroundWriterThread.run(EmbeddedH2OConfig.java:38)
End of LogType:stderr
***********************************************************************
Taking note of "java.net.NoRouteToHostException: No route to host (Host unreachable)". However, I can access all the other nodes from each other and they can all ping each other, so not sure what is going on here. Any suggestions for debugging or fixing?
Think I found the problem, TLDR: firewalld (nodes running on centos7) was still running, when should be disabled on HDP clusters.
From another community post:
For Ambari to communicate during setup with the hosts it deploys to and manages, certain ports must be open and available. The easiest way to do this is to temporarily disable iptables, as follows:
systemctl disable firewalld
service firewalld stop
So apparently iptables and firewalld need to be disabled across the cluster (supporting docs can be found here, I only disabled them on the Ambari installation node). After stopping these services across the cluster (I recommend using clush), was able to run the yarn job without incident.
Normally, this problem is either due to bad DNS configuration, firewalls, or network unreachability. To quote this official doc:
The hostname of the remote machine is wrong in the configuration files
The client's host table /etc/hosts has an invalid IPAddress for the target host.
The DNS server's host table has an invalid IPAddress for the target host.
The client's routing tables (In Linux, iptables) are wrong.
The DHCP server is publishing bad routing information.
Client and server are on different subnets, and are not set up to talk to each other. This may be an accident, or it is to deliberately lock down the Hadoop cluster.
The machines are trying to communicate using IPv6. Hadoop does not currently support IPv6
The host's IP address has changed but a long-lived JVM is caching the old value. This is a known problem with JVMs (search for "java negative DNS caching" for the details and solutions). The quick solution: restart the JVMs
For me, the problem was that the driver was inside a Docker container which made it impossible for the workers to send data back to it. In other words, workers and the driver not being in the same subnet. The solution as given in this answer was to set the following configurations:
spark.driver.host=<container's host IP accessible by the workers>
spark.driver.bindAddress=0.0.0.0
spark.driver.port=<forwarded port 1>
spark.driver.blockManager.port=<forwarded port 2>

Change Mesos Master Leader, cause Marathon shutdown?

Env:
Zookeeper on computer A,
Mesos master on computer B as Leader,
Mesos master on computer C,
Marathon on computer B singleton.
Action:
Kill Mesos master task on computer B, attempt to change mesos cluster leader
Result:
Mesos cluster leader change to mesos master on computer C,
But Marathon task on computer auto shutdown with following logs.
Question:
Somebody can help me why marathon down? and how to fix it!
Logs:
I1109 12:19:10.010197 11287 detector.cpp:152] Detected a new leader: (id='9')
I1109 12:19:10.010646 11291 group.cpp:699] Trying to get '/mesos/json.info_0000000009' in ZooKeeper
I1109 12:19:10.013425 11292 zookeeper.cpp:262] A new leading master (UPID=master#10.4.23.55:5050) is detected
[2017-11-09 12:19:10,015] WARN Disconnected (mesosphere.marathon.MarathonScheduler:Thread-23)
I1109 12:19:10.018977 11292 sched.cpp:2021] Asked to stop the driver
I1109 12:19:10.019161 11292 sched.cpp:336] New master detected at master#10.4.23.55:5050
I1109 12:19:10.019892 11292 sched.cpp:1203] Stopping framework d52cbd8c-1015-4d94-8328-e418876ca5b2-0000
[2017-11-09 12:19:10,020] INFO Driver future completed with result=Success(()). (mesosphere.marathon.MarathonSchedulerService:ForkJoinPool-3-worker-5)
[2017-11-09 12:19:10,022] INFO Abdicating leadership (mesosphere.marathon.core.election.impl.CuratorElectionService:ForkJoinPool-3-worker-5)
[2017-11-09 12:19:10,022] INFO Stopping the election service (mesosphere.marathon.core.election.impl.CuratorElectionService:ForkJoinPool-3-worker-5)
[2017-11-09 12:19:10,029] INFO backgroundOperationsLoop exiting (org.apache.curator.framework.imps.CuratorFrameworkImpl:Curator-Framework-0)
[2017-11-09 12:19:10,061] INFO Session: 0x15f710ffb010058 closed (org.apache.zookeeper.ZooKeeper:ForkJoinPool-3-worker-5)
[2017-11-09 12:19:10,061] INFO EventThread shut down for session: 0x15f710ffb010058 (org.apache.zookeeper.ClientCnxn:pool-3-thread-1-EventThread)
[2017-11-09 12:19:10,063] INFO Stopping MarathonSchedulerService [RUNNING]'s leadership (mesosphere.marathon.core.election.impl.CuratorElectionService:ForkJoinPool-3-worker-5)
[2017-11-09 12:19:10,063] INFO Lost leadership (mesosphere.marathon.MarathonSchedulerService:ForkJoinPool-3-worker-5)
[2017-11-09 12:19:10,066] INFO All actors suspended:
* Actor[akka://marathon/user/offerMatcherStatistics#-1904211014]
* Actor[akka://marathon/user/reviveOffersWhenWanted#-238627718]
* Actor[akka://marathon/user/expungeOverdueLostTasks#608979053]
* Actor[akka://marathon/user/launchQueue#803590575]
* Actor[akka://marathon/user/offersWantedForReconciliation#598482724]
* Actor[akka://marathon/user/offerMatcherLaunchTokens#813230776]
* Actor[akka://marathon/user/offerMatcherManager#1205401692]
* Actor[akka://marathon/user/instanceTracker#1055980147]
* Actor[akka://marathon/user/killOverdueStagedTasks#-40058350]
* Actor[akka://marathon/user/taskKillServiceActor#-602552505]
* Actor[akka://marathon/user/rateLimiter#-911383474]
* Actor[akka://marathon/user/deploymentManager#2013376325] (mesosphere.marathon.core.leadership.impl.LeadershipCoordinatorActor:marathon-akka.actor.default-dispatcher-10)
I1109 12:19:10.069551 11272 sched.cpp:2021] Asked to stop the driver
[2017-11-09 12:19:10,068] INFO Stopping driver (mesosphere.marathon.MarathonSchedulerService:ForkJoinPool-3-worker-5)
[2017-11-09 12:19:10,069] INFO Stopped MarathonSchedulerService [RUNNING]'s leadership (mesosphere.marathon.core.election.impl.CuratorElectionService:ForkJoinPool-3-worker-5)
[2017-11-09 12:19:10,070] INFO Terminating due to leadership abdication or failure (mesosphere.marathon.core.election.impl.CuratorElectionService:ForkJoinPool-3-worker-5)
[2017-11-09 12:19:10,071] INFO Call postDriverRuns callbacks on (mesosphere.marathon.MarathonSchedulerService:ForkJoinPool-3-worker-5)
[2017-11-09 12:19:10,074] INFO Now standing by. Closing existing handles and rejecting new. (mesosphere.marathon.core.event.impl.stream.HttpEventStreamActor:marathon-akka.actor.default-dispatcher-12)
[2017-11-09 12:19:10,074] INFO Suspending scheduler actor (mesosphere.marathon.MarathonSchedulerActor:marathon-akka.actor.default-dispatcher-2)
[2017-11-09 12:19:10,083] INFO Finished postDriverRuns callbacks (mesosphere.marathon.MarathonSchedulerService:ForkJoinPool-3-worker-5)
[2017-11-09 12:19:10,084] INFO ExpungeOverdueLostTasksActor has stopped (mesosphere.marathon.core.task.jobs.impl.ExpungeOverdueLostTasksActor:marathon-akka.actor.default-dispatcher-9)
[1]+ Exit 137
I think there is wrong configuration in zookeeper cluster. Use 3 zookeeper cluster and 2 mesos master n multiple slaves. Ref : https://www.google.co.in/amp/s/beingasysadmin.wordpress.com/2014/08/16/managing-ha-docker-cluster-using-multiple-mesos-masters/amp/
Did you set masters reference to marathon conf?
can you do
cat /etc/marathon/conf/master

ArangoDB DCOS: Coordinator dies when writing data

I have set up a DCOS cluster and installed the arangodb mesos framework (I haven't changed the initial config). I can access the web interface on port 8529 through the arangodb-proxy and I can create databases, collections, documents there.
Now I'm trying to import some data using the java driver (3.1.4). After 2-3 calls the coordinator goes down. Mesos restarts it but as long as I send data it immediately dies again after a few requests (I also lose the connection on the webinterface for a few seconds):
com.arangodb.ArangoException: org.apache.http.NoHttpResponseException: 172.16.100.99:8529 failed to respond
My insert is basically just a create statement:
arangoDriver.graphCreateVertex(GRAPH_NAME, VERTEX_COLLECTION,
getId(), this, true);
ArangoDB proxy also complains:
I0109 11:26:45.046947 113285 exec.cpp:161] Version: 1.0.1
I0109 11:26:45.051712 113291 exec.cpp:236] Executor registered on agent b8bd75d7-b013-4a60-8644-94cf7d8e53bb-S2
I0109 11:26:45.052942 113293 docker.cpp:815] Running docker -H unix:///var/run/docker.sock run --cpu-shares 1024 --memory 134217728 -e MARATHON_APP_VERSION=2017-01-09T10:26:29.819Z -e HOST=172.16.100.99 -e MARATHON_APP_RESOURCE_CPUS=1.0 -e MARATHON_APP_RESOURCE_GPUS=0 -e MARATHON_APP_DOCKER_IMAGE=arangodb/arangodb-mesos-haproxy -e MESOS_TASK_ID=arangodb-proxy.16604c72-d656-11e6-80d4-70b3d5800001 -e PORT=8529 -e MARATHON_APP_RESOURCE_MEM=128.0 -e PORTS=8529 -e MARATHON_APP_RESOURCE_DISK=0.0 -e PORT_80=8529 -e MARATHON_APP_LABELS= -e MARATHON_APP_ID=/arangodb-proxy -e PORT0=8529 -e LIBPROCESS_IP=172.16.100.99 -e MESOS_SANDBOX=/mnt/mesos/sandbox -e MESOS_CONTAINER_NAME=mesos-b8bd75d7-b013-4a60-8644-94cf7d8e53bb-S2.e0db1925-ff85-4454-bd7e-e0f46e502631 -v /var/lib/mesos/slave/slaves/b8bd75d7-b013-4a60-8644-94cf7d8e53bb-S2/frameworks/b8bd75d7-b013-4a60-8644-94cf7d8e53bb-0000/executors/arangodb-proxy.16604c72-d656-11e6-80d4-70b3d5800001/runs/e0db1925-ff85-4454-bd7e-e0f46e502631:/mnt/mesos/sandbox --net bridge -p 8529:80/tcp --entrypoint /bin/sh --name mesos-b8bd75d7-b013-4a60-8644-94cf7d8e53bb-S2.e0db1925-ff85-4454-bd7e-e0f46e502631 arangodb/arangodb-mesos-haproxy -c nodejs /configurator.js arangodb3
{ [Error: connect ECONNREFUSED 172.16.100.98:1891]
code: 'ECONNREFUSED',
errno: 'ECONNREFUSED',
syscall: 'connect',
address: '172.16.100.98',
port: 1891 }
{ [Error: connect ECONNREFUSED 172.16.100.99:10413]
code: 'ECONNREFUSED',
errno: 'ECONNREFUSED',
syscall: 'connect',
address: '172.16.100.99',
port: 10413 }
I can see the failed task in the arangodb service completed task list, but the stderr log doesn't seem to say anything:
I0109 16:28:31.792980 126177 exec.cpp:161] Version: 1.0.1
I0109 16:28:31.797145 126182 exec.cpp:236] Executor registered on agent b8bd75d7-b013-4a60-8644-94cf7d8e53bb-S2
I0109 16:28:31.798338 126183 docker.cpp:815] Running docker -H unix:///var/run/docker.sock run --cpu-shares 1024 --memory 4294967296 -e CLUSTER_ROLE=coordinator -e CLUSTER_ID=Coordinator002 -e ADDITIONAL_ARGS= -e AGENCY_ENDPOINTS=tcp://172.16.100.97:1025 tcp://172.16.100.99:1025 tcp://172.16.100.98:1025 -e HOST=172.16.100.99 -e PORT0=1027 -e LIBPROCESS_IP=172.16.100.99 -e MESOS_SANDBOX=/mnt/mesos/sandbox -e MESOS_CONTAINER_NAME=mesos-b8bd75d7-b013-4a60-8644-94cf7d8e53bb-S2.61fb92b9-62e0-48b2-b2a3-3dc0b95f7818 -v /var/lib/mesos/slave/slaves/b8bd75d7-b013-4a60-8644-94cf7d8e53bb-S2/frameworks/b8bd75d7-b013-4a60-8644-94cf7d8e53bb-0049/executors/0c81c1f2-0404-4943-ab81-0abc78763140/runs/61fb92b9-62e0-48b2-b2a3-3dc0b95f7818:/mnt/mesos/sandbox --net bridge -p 1027:8529/tcp --name mesos-b8bd75d7-b013-4a60-8644-94cf7d8e53bb-S2.61fb92b9-62e0-48b2-b2a3-3dc0b95f7818 arangodb/arangodb-mesos:3.1
Mesos log indicates that the task failed:
I0109 16:55:44.821689 13431 master.cpp:5728] Sending 1 offers to framework b8bd75d7-b013-4a60-8644-94cf7d8e53bb-0049 (arangodb3) at scheduler-a39fa06a-627c-48bc-bd59-89e55d2ebaa2#172.16.100.99:2273
I0109 16:55:45.313108 13431 master.cpp:5466] Performing explicit task state reconciliation for 1 tasks of framework b8bd75d7-b013-4a60-8644-94cf7d8e53bb-0003 (marathon-user) at scheduler-f4e239f5-3249-4b48-9bae-24c1e3d3152c#172.16.100.98:42099
I0109 16:55:45.560523 13428 master.cpp:3954] Processing DECLINE call for offers: [ b8bd75d7-b013-4a60-8644-94cf7d8e53bb-O141655 ] for framework b8bd75d7-b013-4a60-8644-94cf7d8e53bb-0049 (arangodb3) at scheduler-a39fa06a-627c-48bc-bd59-89e55d2ebaa2#172.16.100.99:2273
I0109 16:55:45.676347 13431 http.cpp:381] HTTP GET for /master/state-summary from 172.16.100.106:42540 with User-Agent='python-requests/2.10.0'
I0109 16:55:45.823482 13425 master.cpp:5728] Sending 1 offers to framework b8bd75d7-b013-4a60-8644-94cf7d8e53bb-0001 (metronome) at scheduler-07a8a68b-8942-46e5-af63-d404785f3f39#172.16.100.107:44838
I0109 16:55:45.823698 13425 http.cpp:381] HTTP GET for /master/state-summary from 172.16.100.105:34374 with User-Agent='python-requests/2.10.0'
I0109 16:55:45.824986 13425 master.cpp:3954] Processing DECLINE call for offers: [ b8bd75d7-b013-4a60-8644-94cf7d8e53bb-O141656 ] for framework b8bd75d7-b013-4a60-8644-94cf7d8e53bb-0001 (metronome) at scheduler-07a8a68b-8942-46e5-af63-d404785f3f39#172.16.100.107:44838
I0109 16:55:45.826448 13425 http.cpp:381] HTTP GET for /master/state-summary from 172.16.100.107:41364 with User-Agent='python-requests/2.10.0'
I0109 16:55:46.694202 13425 master.cpp:5140] Status update TASK_FAILED (UUID: 2abcbe87-e1d6-4968-965d-33429573dfd9) for task 32014e7f-7f5b-4fea-b757-cca0faa3deac of framework b8bd75d7-b013-4a60-8644-94cf7d8e53bb-0049 from agent b8bd75d7-b013-4a60-8644-94cf7d8e53bb-S2 at slave(1)#172.16.100.99:5051 (172.16.100.99)
I0109 16:55:46.694247 13425 master.cpp:5202] Forwarding status update TASK_FAILED (UUID: 2abcbe87-e1d6-4968-965d-33429573dfd9) for task 32014e7f-7f5b-4fea-b757-cca0faa3deac of framework b8bd75d7-b013-4a60-8644-94cf7d8e53bb-0049
I0109 16:55:46.694344 13425 master.cpp:6844] Updating the state of task 32014e7f-7f5b-4fea-b757-cca0faa3deac of framework b8bd75d7-b013-4a60-8644-94cf7d8e53bb-0049 (latest state: TASK_FAILED, status update state: TASK_FAILED)
I0109 16:55:46.695953 13425 master.cpp:4265] Processing ACKNOWLEDGE call 2abcbe87-e1d6-4968-965d-33429573dfd9 for task 32014e7f-7f5b-4fea-b757-cca0faa3deac of framework b8bd75d7-b013-4a60-8644-94cf7d8e53bb-0049 (arangodb3) at scheduler-a39fa06a-627c-48bc-bd59-89e55d2ebaa2#172.16.100.99:2273 on agent b8bd75d7-b013-4a60-8644-94cf7d8e53bb-S2
I0109 16:55:46.695989 13425 master.cpp:6910] Removing task 32014e7f-7f5b-4fea-b757-cca0faa3deac with resources mem(*):4096; cpus(*):1; disk(*):1024; ports(*):[1027-1027] of framework b8bd75d7-b013-4a60-8644-94cf7d8e53bb-0049 on agent b8bd75d7-b013-4a60-8644-94cf7d8e53bb-S2 at slave(1)#172.16.100.99:5051 (172.16.100.99)
I0109 16:55:46.824192 13430 master.cpp:5728] Sending 1 offers to framework b8bd75d7-b013-4a60-8644-94cf7d8e53bb-0049 (arangodb3) at scheduler-a39fa06a-627c-48bc-bd59-89e55d2ebaa2#172.16.100.99:2273
I0109 16:55:46.824347 13430 master.cpp:5728] Sending 1 offers to framework b8bd75d7-b013-4a60-8644-94cf7d8e53bb-0001 (metronome) at scheduler-07a8a68b-8942-46e5-af63-d404785f3f39#172.16.100.107:44838
I0109 16:55:46.825814 13425 master.cpp:3954] Processing DECLINE call for offers: [ b8bd75d7-b013-4a60-8644-94cf7d8e53bb-O141658 ] for framework b8bd75d7-b013-4a60-8644-94cf7d8e53bb-0001 (metronome) at scheduler-07a8a68b-8942-46e5-af63-d404785f3f39#172.16.100.107:44838
I0109 16:55:47.567651 13426 master.cpp:3954] Processing DECLINE call for offers: [ b8bd75d7-b013-4a60-8644-94cf7d8e53bb-O141657 ] for framework b8bd75d7-b013-4a60-8644-94cf7d8e53bb-0049 (arangodb3) at scheduler-a39fa06a-627c-48bc-bd59-89e55d2ebaa2#172.16.100.99:2273
I had the same system import tasks running without any issues on a single arangodb instance, so I assume the issue is not with my java code. But I can't find any further logs to indicate where the issue could be (but then I'm pretty new to Mesos).
My cluster:
4 Slaves (126GB RAM, 8 CPUs each)
3 Masters (32 GB RAM, 4 CPUs)
Can anyone tell me what I'm doing wrong or where I can find some more logging information?
Update: stdout only shows the startup log messages (looking at the timestamps both entries in stderr (above)/stdout (below) are from when the task is started, now when it failed):
--container="mesos-b8bd75d7-b013-4a60-8644-94cf7d8e53bb-S2.209e5908-e884-4c92-92b2-a27f1761ad6e" --docker="docker" --docker_socket="/var/run/docker.sock" --help="false" --initialize_driver_logging="true" --launcher_dir="/opt/mesosphere/packages/mesos--253f5cb0a96e2e3574293ddfecf5c63358527377/libexec/mesos" --logbufsecs="0" --logging_level="INFO" --mapped_directory="/mnt/mesos/sandbox" --quiet="false" --sandbox_directory="/var/lib/mesos/slave/slaves/b8bd75d7-b013-4a60-8644-94cf7d8e53bb-S2/frameworks/b8bd75d7-b013-4a60-8644-94cf7d8e53bb-0049/executors/1d0cf1e4-3267-4150-a465-19ecca21fa65/runs/209e5908-e884-4c92-92b2-a27f1761ad6e" --stop_timeout="20secs"
--container="mesos-b8bd75d7-b013-4a60-8644-94cf7d8e53bb-S2.209e5908-e884-4c92-92b2-a27f1761ad6e" --docker="docker" --docker_socket="/var/run/docker.sock" --help="false" --initialize_driver_logging="true" --launcher_dir="/opt/mesosphere/packages/mesos--253f5cb0a96e2e3574293ddfecf5c63358527377/libexec/mesos" --logbufsecs="0" --logging_level="INFO" --mapped_directory="/mnt/mesos/sandbox" --quiet="false" --sandbox_directory="/var/lib/mesos/slave/slaves/b8bd75d7-b013-4a60-8644-94cf7d8e53bb-S2/frameworks/b8bd75d7-b013-4a60-8644-94cf7d8e53bb-0049/executors/1d0cf1e4-3267-4150-a465-19ecca21fa65/runs/209e5908-e884-4c92-92b2-a27f1761ad6e" --stop_timeout="20secs"
Registered docker executor on 172.16.100.99
Starting task 1d0cf1e4-3267-4150-a465-19ecca21fa65
Initializing database...Hang on...
Database initialized...Starting System...
2017-01-10T08:04:30Z [1] INFO ArangoDB 3.1.7 [linux] 64bit, using VPack 0.1.30, ICU 54.1, V8 5.0.71.39, OpenSSL 1.0.1t 3 May 2016
2017-01-10T08:04:30Z [1] INFO using SSL options: SSL_OP_CIPHER_SERVER_PREFERENCE, SSL_OP_TLS_ROLLBACK_BUG
2017-01-10T08:04:30Z [1] WARNING {agencycomm} got an agency redirect from 'http+tcp://172.16.100.97:1025' to 'http+tcp://172.16.100.99:1025'
2017-01-10T08:04:31Z [1] WARNING {agencycomm} Retrying agency communication at 'http+tcp://172.16.100.99:1025', tries: 2
2017-01-10T08:04:31Z [1] INFO Waiting for DBservers to show up...
2017-01-10T08:04:31Z [1] INFO Found 3 DBservers.
2017-01-10T08:04:31Z [1] INFO file-descriptors (nofiles) hard limit is 1048576, soft limit is 1048576
2017-01-10T08:04:31Z [1] INFO JavaScript using startup '/usr/share/arangodb3/js', application '/var/lib/arangodb3-apps'
2017-01-10T08:04:32Z [1] INFO Cluster feature is turned on. Agency version: {"server":"arango","version":"3.1.7","license":"community"}, Agency endpoints: http+tcp://172.16.100.99:1025, http+tcp://172.16.100.97:1025, http+tcp://172.16.100.98:1025, server id: 'Coordinator002', internal address: tcp://172.16.100.99:1027, role: COORDINATOR
2017-01-10T08:04:32Z [1] INFO using heartbeat interval value '1000 ms' from agency
2017-01-10T08:04:32Z [1] INFO In database '_system': Database is up-to-date (30107/cluster-local/existing)
2017-01-10T08:04:32Z [1] INFO using endpoint 'http+tcp://0.0.0.0:8529' for non-encrypted requests
2017-01-10T08:04:33Z [1] INFO bootstraped coordinator Coordinator002
2017-01-10T08:04:33Z [1] INFO ArangoDB (version 3.1.7 [linux]) is ready for business. Have fun!
I've noted some output in stderr for the arangodb3 task but I'm not sure if this is just logging requests or part of the issue - this gets repeated every few seconds:
I0110 08:50:25.981262 23 HttpServer.cpp:456] handling http request 'GET /v1/endpoints.json'
I0110 08:50:26.000558 22 CaretakerCluster.cpp:470] And here the offer:
{"id":{"value":"b8bd75d7-b013-4a60-8644-94cf7d8e53bb-O173464"},"framework_id":{"value":"b8bd75d7-b013-4a60-8644-94cf7d8e53bb-0049"},"slave_id":{"value":"b8bd75d7-b013-4a60-8644-94cf7d8e53bb-S0"},"hostname":"172.16.100.97","url":{"scheme":"http","address":{"hostname":"172.16.100.97","ip":"172.16.100.97","port":5051},"path":"/slave(1)","query":[]},"resources":[{"name":"ports","type":1,"ranges":{"range":[{"begin":1028,"end":2180},{"begin":2182,"end":3887},{"begin":3889,"end":5049},{"begin":5052,"end":8079},{"begin":8082,"end":8180},{"begin":8182,"end":8528},{"begin":8530,"end":32000}]},"role":"*"},{"name":"disk","type":0,"scalar":{"value":291730},"role":"*"},{"name":"cpus","type":0,"scalar":{"value":4.75},"role":"*"},{"name":"mem","type":0,"scalar":{"value":117367},"role":"*"}],"attributes":[],"executor_ids":[]}
Update 2: also, the online log in /mesos shows me this - does this mean the cluster isn't started properly?
I0110 09:55:30.857897 13427 http.cpp:381] HTTP GET for /master/state-summary from 172.16.100.107:60624 with User-Agent='python-requests/2.10.0'
I0110 09:55:31.111609 13426 master.cpp:3954] Processing DECLINE call for offers: [ b8bd75d7-b013-4a60-8644-94cf7d8e53bb-O173624 ] for framework b8bd75d7-b013-4a60-8644-94cf7d8e53bb-0049 (arangodb3) at scheduler-b97d05bc-d50c-49e6-8ef1-a9ff324fd2ec#172.16.100.98:9546
I0110 09:55:31.111747 13426 master.cpp:3954] Processing DECLINE call for offers: [ b8bd75d7-b013-4a60-8644-94cf7d8e53bb-O173623 ] for framework b8bd75d7-b013-4a60-8644-94cf7d8e53bb-0049 (arangodb3) at scheduler-b97d05bc-d50c-49e6-8ef1-a9ff324fd2ec#172.16.100.98:9546
I

Submit Job in Spark using Yarn Cluster

I am unable to submit the job in yarn cluster.The job is running fine under yarn-client option. When submit it to yarn-cluster only this log is coming multiple times.
Application report for application_1421828570504_0002 (state: ACCEPTED)
and got failed with the following exception.
diagnostics: Application application_1421828570504_0002 failed 10 times due to AM Container for app
attempt_1421828570504_0002_000010 exited with exitCode: 1 due to: Exception from container-launch: org.apache.hadoop.util.Shell$ExitCodeException:
org.apache.hadoop.util.Shell$ExitCodeException:
You should have a look at the logs of your application:
> yarn logs --applicationId application_1421828570504_0002
This will yield some debug information of the actual run within the spark containers.
Since it is running locally but not on the cluster my wild guess would be a missing SparkContext definition. Have a look at my answer to this question for a fix.

Hadoop 2.5.2 on Mesos 0.21.0 - Failed to fetch URIs for container

I'm trying to run a simple WordCount example on Mesos with Hadoop 2.5.2. I've successfully set up HDFS (actually got a YARN set up behind this and it is working fine). Mesos master is running and got 4 slaves connected to it. The Hadoop library for Mesos is 0.0.8.
The configuration for Hadoop 2.5.2 is (mapred-site.xml):
<configuration>
<property>
<name>mapred.job.tracker</name>
<value>*.*.*.*:9001</value>
</property>
<property>
<name>mapred.job.tracker.http.address</name>
<value>*.*.*.*:50030</value>
</property>
<property>
<name>mapred.jobtracker.taskScheduler</name>
<value>org.apache.hadoop.mapred.MesosScheduler</value>
</property>
<property>
<name>mapred.mesos.taskScheduler</name>
<value>org.apache.hadoop.mapred.JobQueueTaskScheduler</value>
</property>
<property>
<name>mapred.mesos.master</name>
<value>*.*.*.*:5050</value>
</property>
<property>
<name>mapred.mesos.executor.uri</name>
<value>hdfs://*.*.*.*:9000/hadoop-2.5.0-cdh5.2.0.tgz</value>
</property>
</configuration>
I've got the following logs from all my slaves (example):
dbpc42: I1202 00:03:12.066195 11232 launcher.cpp:137] Forked child
with pid '18714' for container 'c10c2d2b-bf4b-469b-97a2-60c9720773b4'
dbpc42: I1202 00:03:12.068272 11232 containerizer.cpp:571] Fetching
URIs for container 'c10c2d2b-bf4b-469b-97a2-60c9720773b4' using
command '/opt/mesos-0.21.0/build/src/mesos-fetcher'
dbpc42: I1202
00:03:12.140894 11226 containerizer.cpp:946] Destroying container
'c10c2d2b-bf4b-469b-97a2-60c9720773b4'
dbpc42: E1202 00:03:12.141315
11229 slave.cpp:2787] Container 'c10c2d2b-bf4b-469b-97a2-60c9720773b4'
for executor 'executor_Task_Tracker_93' of framework
'20141201-225046-698725789-5050-19765-0003' failed to start: Failed to
fetch URIs for container 'c10c2d2b-bf4b-469b-97a2-60c9720773b4': exit
status 256
dbpc42: I1202 00:03:12.242033 11231 containerizer.cpp:1117]
Executor for container 'c10c2d2b-bf4b-469b-97a2-60c9720773b4' has
exited dbpc42: I1202 00:03:12.243896 11225 slave.cpp:2898] Executor
'executor_Task_Tracker_93' of framework
20141201-225046-698725789-5050-19765-0003 exited with status 1
Job tracker running fine, with the hadoop jar command the job stucks at map 0% reduce 0%. In mesos cluster information the TASKS_LOST counter goes all the way up until I kill the job. Mesos and the JobTracker runs as root, the job runs as user hdfs.
What is this URI problem all about?
Thank you for your kind help or hint!
(I'll provide more information if needed.)
UPDATE
Starting a slave on the same PC where the master runs will get tasks to the staging status. 5, each time.
The mapred-mesos.executor.uri has been changed from the IP to dbpc41 (master PC).
<property>
<name>mapred.mesos.executor.uri</name>
<value>hdfs://dbpc41:9000/hadoop-2.5.0-cdh5.2.0.tgz</value>
</property
The other 4 slaves are still losing tasks due to (probably) unable to fetch the executor URI.
These are the logs from the 5th slave running on the same PC where the master do:
I1202 16:17:57.434345 1405 containerizer.cpp:571] Fetching URIs for
container '5f33123b-00eb-4e05-9dcc-30f16f5eee44' using command
'/opt/mesos-0.21.0/build/src/mesos-fetcher' I1202 16:18:08.620708
1412 slave.cpp:2840] Monitoring executor 'executor_Task_Tracker_445'
of framework '20141201-225046-698725789-5050-19765-0012' in container
'5f33123b-00eb-4e05-9dcc-30f16f5eee44' I1202 16:18:09.022902 1407
containerizer.cpp:1117] Executor for container
'5f33123b-00eb-4e05-9dcc-30f16f5eee44' has exited I1202
16:18:09.022964 1407 containerizer.cpp:946] Destroying container
'5f33123b-00eb-4e05-9dcc-30f16f5eee44' W1202 16:18:11.369912 1407
containerizer.cpp:888] Skipping resource statistic for container
5f33123b-00eb-4e05-9dcc-30f16f5eee44 because: Failed to get usage: No
process found at 11093 W1202 16:18:11.369971 1407
containerizer.cpp:888] Skipping resource statistic for container
5f33123b-00eb-4e05-9dcc-30f16f5eee44 because: Failed to get usage: No
process found at 11093 I1202 16:18:11.399648 1412 slave.cpp:2898]
Executor 'executor_Task_Tracker_445' of framework
20141201-225046-698725789-5050-19765-0012 exited with status 1 I1202
16:18:11.401949 1412 slave.cpp:2215] Handling status update TASK_LOST
(UUID: 959709c2-5546-41fd-9af3-09f024bb6354) for task Task_Tracker_445
of framework 20141201-225046-698725789-5050-19765-0012 from #0.0.0.0:0
W1202 16:18:11.402245 1409 containerizer.cpp:852] Ignoring update for
unknown container: 5f33123b-00eb-4e05-9dcc-30f16f5eee44 I1202
16:18:11.403017 1410 status_update_manager.cpp:317] Received status
update TASK_LOST (UUID: 959709c2-5546-41fd-9af3-09f024bb6354) for task
Task_Tracker_445 of framework
20141201-225046-698725789-5050-19765-0012 I1202 16:18:11.403437 1406
slave.cpp:2458] Forwarding the update TASK_LOST (UUID:
959709c2-5546-41fd-9af3-09f024bb6354) for task Task_Tracker_445 of
framework 20141201-225046-698725789-5050-19765-0012 to
master#157.181.165.41:5050 I1202 16:18:11.448752 1409
status_update_manager.cpp:389] Received status update acknowledgement
(UUID: 959709c2-5546-41fd-9af3-09f024bb6354) for task Task_Tracker_445
of framework 20141201-225046-698725789-5050-19765-0012 I1202
16:18:11.449354 1408 slave.cpp:3007] Cleaning up executor
'executor_Task_Tracker_445' of framework
20141201-225046-698725789-5050-19765-0012 I1202 16:18:11.449707 1405
gc.cpp:56] Scheduling
'/tmp/mesos/slaves/20141201-225046-698725789-5050-19765-S4/frameworks/20141201-225046-698725789-5050-19765-0012/executors/executor_Task_Tracker_445/runs/5f33123b-00eb-4e05-9dcc-30f16f5eee44'
for gc 6.99999479755852days in the future I1202 16:18:11.450034 1409
gc.cpp:56] Scheduling
'/tmp/mesos/slaves/20141201-225046-698725789-5050-19765-S4/frameworks/20141201-225046-698725789-5050-19765-0012/executors/executor_Task_Tracker_445' for gc 6.9999947929037days in the future I1202 16:18:11.450147 1408
slave.cpp:3084] Cleaning up framework
20141201-225046-698725789-5050-19765-0012 I1202 16:18:11.450213 1406
status_update_manager.cpp:279] Closing status update streams for
framework 20141201-225046-698725789-5050-19765-0012 I1202
16:18:11.450381 1412 gc.cpp:56] Scheduling
'/tmp/mesos/slaves/20141201-225046-698725789-5050-19765-S4/frameworks/20141201-225046-698725789-5050-19765-0012'
for gc 6.99999478812444days in the future I1202 16:18:12.441505 1405
slave.cpp:1083] Got assigned task Task_Tracker_472 for framework
20141201-225046-698725789-5050-19765-0012 I1202 16:18:12.442337 1405
gc.cpp:84] Unscheduling
'/tmp/mesos/slaves/20141201-225046-698725789-5050-19765-S4/frameworks/20141201-225046-698725789-5050-19765-0012'
from gc I1202 16:18:12.442617 1405 slave.cpp:1193] Launching task
Task_Tracker_472 for framework
20141201-225046-698725789-5050-19765-0012 I1202 16:18:12.444263 1405
slave.cpp:3997] Launching executor executor_Task_Tracker_472 of
framework 20141201-225046-698725789-5050-19765-0012 in work directory
'/tmp/mesos/slaves/20141201-225046-698725789-5050-19765-S4/frameworks/20141201-225046-698725789-5050-19765-0012/executors/executor_Task_Tracker_472/runs/2310c642-02bf-401b-954c-876c88675c31'
I1202 16:18:12.444756 1405 slave.cpp:1316] Queuing task
'Task_Tracker_472' for executor executor_Task_Tracker_472 of framework
'20141201-225046-698725789-5050-19765-0012 I1202 16:18:12.444793 1406
containerizer.cpp:424] Starting container
'2310c642-02bf-401b-954c-876c88675c31' for executor
'executor_Task_Tracker_472' of framework
'20141201-225046-698725789-5050-19765-0012' I1202 16:18:12.447434
1406 launcher.cpp:137] Forked child with pid '11549' for container
'2310c642-02bf-401b-954c-876c88675c31' I1202 16:18:12.448652 1406
containerizer.cpp:571] Fetching URIs for container
'2310c642-02bf-401b-954c-876c88675c31' using command
'/opt/mesos-0.21.0/build/src/mesos-fetcher'
Checked executor logs (stderr in /tmp/mesos/slaves/...) and found out that JAVA_HOME was not set, so the hadoop dfs command was not able to run to fetch the executor. The URI was perfect, the JAVA_HOME was not set. Additionally I had to set HADOOP_HOME when starting the slaves.
Looks like the Mesos slave cannot fetch one of the URIs, likely the executor itself.
Did you upload your modified Hadoop on Mesos distribution (including the hadoop-mesos-0.0.8.jar) to hdfs://*.*.*.*:9000/hadoop-2.5.0-cdh5.2.0.tgz as specified by mapred.mesos.executor.uri? Is it accessible from the slave?

Resources