I have set up a DCOS cluster and installed the arangodb mesos framework (I haven't changed the initial config). I can access the web interface on port 8529 through the arangodb-proxy and I can create databases, collections, documents there.
Now I'm trying to import some data using the java driver (3.1.4). After 2-3 calls the coordinator goes down. Mesos restarts it but as long as I send data it immediately dies again after a few requests (I also lose the connection on the webinterface for a few seconds):
com.arangodb.ArangoException: org.apache.http.NoHttpResponseException: 172.16.100.99:8529 failed to respond
My insert is basically just a create statement:
arangoDriver.graphCreateVertex(GRAPH_NAME, VERTEX_COLLECTION,
getId(), this, true);
ArangoDB proxy also complains:
I0109 11:26:45.046947 113285 exec.cpp:161] Version: 1.0.1
I0109 11:26:45.051712 113291 exec.cpp:236] Executor registered on agent b8bd75d7-b013-4a60-8644-94cf7d8e53bb-S2
I0109 11:26:45.052942 113293 docker.cpp:815] Running docker -H unix:///var/run/docker.sock run --cpu-shares 1024 --memory 134217728 -e MARATHON_APP_VERSION=2017-01-09T10:26:29.819Z -e HOST=172.16.100.99 -e MARATHON_APP_RESOURCE_CPUS=1.0 -e MARATHON_APP_RESOURCE_GPUS=0 -e MARATHON_APP_DOCKER_IMAGE=arangodb/arangodb-mesos-haproxy -e MESOS_TASK_ID=arangodb-proxy.16604c72-d656-11e6-80d4-70b3d5800001 -e PORT=8529 -e MARATHON_APP_RESOURCE_MEM=128.0 -e PORTS=8529 -e MARATHON_APP_RESOURCE_DISK=0.0 -e PORT_80=8529 -e MARATHON_APP_LABELS= -e MARATHON_APP_ID=/arangodb-proxy -e PORT0=8529 -e LIBPROCESS_IP=172.16.100.99 -e MESOS_SANDBOX=/mnt/mesos/sandbox -e MESOS_CONTAINER_NAME=mesos-b8bd75d7-b013-4a60-8644-94cf7d8e53bb-S2.e0db1925-ff85-4454-bd7e-e0f46e502631 -v /var/lib/mesos/slave/slaves/b8bd75d7-b013-4a60-8644-94cf7d8e53bb-S2/frameworks/b8bd75d7-b013-4a60-8644-94cf7d8e53bb-0000/executors/arangodb-proxy.16604c72-d656-11e6-80d4-70b3d5800001/runs/e0db1925-ff85-4454-bd7e-e0f46e502631:/mnt/mesos/sandbox --net bridge -p 8529:80/tcp --entrypoint /bin/sh --name mesos-b8bd75d7-b013-4a60-8644-94cf7d8e53bb-S2.e0db1925-ff85-4454-bd7e-e0f46e502631 arangodb/arangodb-mesos-haproxy -c nodejs /configurator.js arangodb3
{ [Error: connect ECONNREFUSED 172.16.100.98:1891]
code: 'ECONNREFUSED',
errno: 'ECONNREFUSED',
syscall: 'connect',
address: '172.16.100.98',
port: 1891 }
{ [Error: connect ECONNREFUSED 172.16.100.99:10413]
code: 'ECONNREFUSED',
errno: 'ECONNREFUSED',
syscall: 'connect',
address: '172.16.100.99',
port: 10413 }
I can see the failed task in the arangodb service completed task list, but the stderr log doesn't seem to say anything:
I0109 16:28:31.792980 126177 exec.cpp:161] Version: 1.0.1
I0109 16:28:31.797145 126182 exec.cpp:236] Executor registered on agent b8bd75d7-b013-4a60-8644-94cf7d8e53bb-S2
I0109 16:28:31.798338 126183 docker.cpp:815] Running docker -H unix:///var/run/docker.sock run --cpu-shares 1024 --memory 4294967296 -e CLUSTER_ROLE=coordinator -e CLUSTER_ID=Coordinator002 -e ADDITIONAL_ARGS= -e AGENCY_ENDPOINTS=tcp://172.16.100.97:1025 tcp://172.16.100.99:1025 tcp://172.16.100.98:1025 -e HOST=172.16.100.99 -e PORT0=1027 -e LIBPROCESS_IP=172.16.100.99 -e MESOS_SANDBOX=/mnt/mesos/sandbox -e MESOS_CONTAINER_NAME=mesos-b8bd75d7-b013-4a60-8644-94cf7d8e53bb-S2.61fb92b9-62e0-48b2-b2a3-3dc0b95f7818 -v /var/lib/mesos/slave/slaves/b8bd75d7-b013-4a60-8644-94cf7d8e53bb-S2/frameworks/b8bd75d7-b013-4a60-8644-94cf7d8e53bb-0049/executors/0c81c1f2-0404-4943-ab81-0abc78763140/runs/61fb92b9-62e0-48b2-b2a3-3dc0b95f7818:/mnt/mesos/sandbox --net bridge -p 1027:8529/tcp --name mesos-b8bd75d7-b013-4a60-8644-94cf7d8e53bb-S2.61fb92b9-62e0-48b2-b2a3-3dc0b95f7818 arangodb/arangodb-mesos:3.1
Mesos log indicates that the task failed:
I0109 16:55:44.821689 13431 master.cpp:5728] Sending 1 offers to framework b8bd75d7-b013-4a60-8644-94cf7d8e53bb-0049 (arangodb3) at scheduler-a39fa06a-627c-48bc-bd59-89e55d2ebaa2#172.16.100.99:2273
I0109 16:55:45.313108 13431 master.cpp:5466] Performing explicit task state reconciliation for 1 tasks of framework b8bd75d7-b013-4a60-8644-94cf7d8e53bb-0003 (marathon-user) at scheduler-f4e239f5-3249-4b48-9bae-24c1e3d3152c#172.16.100.98:42099
I0109 16:55:45.560523 13428 master.cpp:3954] Processing DECLINE call for offers: [ b8bd75d7-b013-4a60-8644-94cf7d8e53bb-O141655 ] for framework b8bd75d7-b013-4a60-8644-94cf7d8e53bb-0049 (arangodb3) at scheduler-a39fa06a-627c-48bc-bd59-89e55d2ebaa2#172.16.100.99:2273
I0109 16:55:45.676347 13431 http.cpp:381] HTTP GET for /master/state-summary from 172.16.100.106:42540 with User-Agent='python-requests/2.10.0'
I0109 16:55:45.823482 13425 master.cpp:5728] Sending 1 offers to framework b8bd75d7-b013-4a60-8644-94cf7d8e53bb-0001 (metronome) at scheduler-07a8a68b-8942-46e5-af63-d404785f3f39#172.16.100.107:44838
I0109 16:55:45.823698 13425 http.cpp:381] HTTP GET for /master/state-summary from 172.16.100.105:34374 with User-Agent='python-requests/2.10.0'
I0109 16:55:45.824986 13425 master.cpp:3954] Processing DECLINE call for offers: [ b8bd75d7-b013-4a60-8644-94cf7d8e53bb-O141656 ] for framework b8bd75d7-b013-4a60-8644-94cf7d8e53bb-0001 (metronome) at scheduler-07a8a68b-8942-46e5-af63-d404785f3f39#172.16.100.107:44838
I0109 16:55:45.826448 13425 http.cpp:381] HTTP GET for /master/state-summary from 172.16.100.107:41364 with User-Agent='python-requests/2.10.0'
I0109 16:55:46.694202 13425 master.cpp:5140] Status update TASK_FAILED (UUID: 2abcbe87-e1d6-4968-965d-33429573dfd9) for task 32014e7f-7f5b-4fea-b757-cca0faa3deac of framework b8bd75d7-b013-4a60-8644-94cf7d8e53bb-0049 from agent b8bd75d7-b013-4a60-8644-94cf7d8e53bb-S2 at slave(1)#172.16.100.99:5051 (172.16.100.99)
I0109 16:55:46.694247 13425 master.cpp:5202] Forwarding status update TASK_FAILED (UUID: 2abcbe87-e1d6-4968-965d-33429573dfd9) for task 32014e7f-7f5b-4fea-b757-cca0faa3deac of framework b8bd75d7-b013-4a60-8644-94cf7d8e53bb-0049
I0109 16:55:46.694344 13425 master.cpp:6844] Updating the state of task 32014e7f-7f5b-4fea-b757-cca0faa3deac of framework b8bd75d7-b013-4a60-8644-94cf7d8e53bb-0049 (latest state: TASK_FAILED, status update state: TASK_FAILED)
I0109 16:55:46.695953 13425 master.cpp:4265] Processing ACKNOWLEDGE call 2abcbe87-e1d6-4968-965d-33429573dfd9 for task 32014e7f-7f5b-4fea-b757-cca0faa3deac of framework b8bd75d7-b013-4a60-8644-94cf7d8e53bb-0049 (arangodb3) at scheduler-a39fa06a-627c-48bc-bd59-89e55d2ebaa2#172.16.100.99:2273 on agent b8bd75d7-b013-4a60-8644-94cf7d8e53bb-S2
I0109 16:55:46.695989 13425 master.cpp:6910] Removing task 32014e7f-7f5b-4fea-b757-cca0faa3deac with resources mem(*):4096; cpus(*):1; disk(*):1024; ports(*):[1027-1027] of framework b8bd75d7-b013-4a60-8644-94cf7d8e53bb-0049 on agent b8bd75d7-b013-4a60-8644-94cf7d8e53bb-S2 at slave(1)#172.16.100.99:5051 (172.16.100.99)
I0109 16:55:46.824192 13430 master.cpp:5728] Sending 1 offers to framework b8bd75d7-b013-4a60-8644-94cf7d8e53bb-0049 (arangodb3) at scheduler-a39fa06a-627c-48bc-bd59-89e55d2ebaa2#172.16.100.99:2273
I0109 16:55:46.824347 13430 master.cpp:5728] Sending 1 offers to framework b8bd75d7-b013-4a60-8644-94cf7d8e53bb-0001 (metronome) at scheduler-07a8a68b-8942-46e5-af63-d404785f3f39#172.16.100.107:44838
I0109 16:55:46.825814 13425 master.cpp:3954] Processing DECLINE call for offers: [ b8bd75d7-b013-4a60-8644-94cf7d8e53bb-O141658 ] for framework b8bd75d7-b013-4a60-8644-94cf7d8e53bb-0001 (metronome) at scheduler-07a8a68b-8942-46e5-af63-d404785f3f39#172.16.100.107:44838
I0109 16:55:47.567651 13426 master.cpp:3954] Processing DECLINE call for offers: [ b8bd75d7-b013-4a60-8644-94cf7d8e53bb-O141657 ] for framework b8bd75d7-b013-4a60-8644-94cf7d8e53bb-0049 (arangodb3) at scheduler-a39fa06a-627c-48bc-bd59-89e55d2ebaa2#172.16.100.99:2273
I had the same system import tasks running without any issues on a single arangodb instance, so I assume the issue is not with my java code. But I can't find any further logs to indicate where the issue could be (but then I'm pretty new to Mesos).
My cluster:
4 Slaves (126GB RAM, 8 CPUs each)
3 Masters (32 GB RAM, 4 CPUs)
Can anyone tell me what I'm doing wrong or where I can find some more logging information?
Update: stdout only shows the startup log messages (looking at the timestamps both entries in stderr (above)/stdout (below) are from when the task is started, now when it failed):
--container="mesos-b8bd75d7-b013-4a60-8644-94cf7d8e53bb-S2.209e5908-e884-4c92-92b2-a27f1761ad6e" --docker="docker" --docker_socket="/var/run/docker.sock" --help="false" --initialize_driver_logging="true" --launcher_dir="/opt/mesosphere/packages/mesos--253f5cb0a96e2e3574293ddfecf5c63358527377/libexec/mesos" --logbufsecs="0" --logging_level="INFO" --mapped_directory="/mnt/mesos/sandbox" --quiet="false" --sandbox_directory="/var/lib/mesos/slave/slaves/b8bd75d7-b013-4a60-8644-94cf7d8e53bb-S2/frameworks/b8bd75d7-b013-4a60-8644-94cf7d8e53bb-0049/executors/1d0cf1e4-3267-4150-a465-19ecca21fa65/runs/209e5908-e884-4c92-92b2-a27f1761ad6e" --stop_timeout="20secs"
--container="mesos-b8bd75d7-b013-4a60-8644-94cf7d8e53bb-S2.209e5908-e884-4c92-92b2-a27f1761ad6e" --docker="docker" --docker_socket="/var/run/docker.sock" --help="false" --initialize_driver_logging="true" --launcher_dir="/opt/mesosphere/packages/mesos--253f5cb0a96e2e3574293ddfecf5c63358527377/libexec/mesos" --logbufsecs="0" --logging_level="INFO" --mapped_directory="/mnt/mesos/sandbox" --quiet="false" --sandbox_directory="/var/lib/mesos/slave/slaves/b8bd75d7-b013-4a60-8644-94cf7d8e53bb-S2/frameworks/b8bd75d7-b013-4a60-8644-94cf7d8e53bb-0049/executors/1d0cf1e4-3267-4150-a465-19ecca21fa65/runs/209e5908-e884-4c92-92b2-a27f1761ad6e" --stop_timeout="20secs"
Registered docker executor on 172.16.100.99
Starting task 1d0cf1e4-3267-4150-a465-19ecca21fa65
Initializing database...Hang on...
Database initialized...Starting System...
2017-01-10T08:04:30Z [1] INFO ArangoDB 3.1.7 [linux] 64bit, using VPack 0.1.30, ICU 54.1, V8 5.0.71.39, OpenSSL 1.0.1t 3 May 2016
2017-01-10T08:04:30Z [1] INFO using SSL options: SSL_OP_CIPHER_SERVER_PREFERENCE, SSL_OP_TLS_ROLLBACK_BUG
2017-01-10T08:04:30Z [1] WARNING {agencycomm} got an agency redirect from 'http+tcp://172.16.100.97:1025' to 'http+tcp://172.16.100.99:1025'
2017-01-10T08:04:31Z [1] WARNING {agencycomm} Retrying agency communication at 'http+tcp://172.16.100.99:1025', tries: 2
2017-01-10T08:04:31Z [1] INFO Waiting for DBservers to show up...
2017-01-10T08:04:31Z [1] INFO Found 3 DBservers.
2017-01-10T08:04:31Z [1] INFO file-descriptors (nofiles) hard limit is 1048576, soft limit is 1048576
2017-01-10T08:04:31Z [1] INFO JavaScript using startup '/usr/share/arangodb3/js', application '/var/lib/arangodb3-apps'
2017-01-10T08:04:32Z [1] INFO Cluster feature is turned on. Agency version: {"server":"arango","version":"3.1.7","license":"community"}, Agency endpoints: http+tcp://172.16.100.99:1025, http+tcp://172.16.100.97:1025, http+tcp://172.16.100.98:1025, server id: 'Coordinator002', internal address: tcp://172.16.100.99:1027, role: COORDINATOR
2017-01-10T08:04:32Z [1] INFO using heartbeat interval value '1000 ms' from agency
2017-01-10T08:04:32Z [1] INFO In database '_system': Database is up-to-date (30107/cluster-local/existing)
2017-01-10T08:04:32Z [1] INFO using endpoint 'http+tcp://0.0.0.0:8529' for non-encrypted requests
2017-01-10T08:04:33Z [1] INFO bootstraped coordinator Coordinator002
2017-01-10T08:04:33Z [1] INFO ArangoDB (version 3.1.7 [linux]) is ready for business. Have fun!
I've noted some output in stderr for the arangodb3 task but I'm not sure if this is just logging requests or part of the issue - this gets repeated every few seconds:
I0110 08:50:25.981262 23 HttpServer.cpp:456] handling http request 'GET /v1/endpoints.json'
I0110 08:50:26.000558 22 CaretakerCluster.cpp:470] And here the offer:
{"id":{"value":"b8bd75d7-b013-4a60-8644-94cf7d8e53bb-O173464"},"framework_id":{"value":"b8bd75d7-b013-4a60-8644-94cf7d8e53bb-0049"},"slave_id":{"value":"b8bd75d7-b013-4a60-8644-94cf7d8e53bb-S0"},"hostname":"172.16.100.97","url":{"scheme":"http","address":{"hostname":"172.16.100.97","ip":"172.16.100.97","port":5051},"path":"/slave(1)","query":[]},"resources":[{"name":"ports","type":1,"ranges":{"range":[{"begin":1028,"end":2180},{"begin":2182,"end":3887},{"begin":3889,"end":5049},{"begin":5052,"end":8079},{"begin":8082,"end":8180},{"begin":8182,"end":8528},{"begin":8530,"end":32000}]},"role":"*"},{"name":"disk","type":0,"scalar":{"value":291730},"role":"*"},{"name":"cpus","type":0,"scalar":{"value":4.75},"role":"*"},{"name":"mem","type":0,"scalar":{"value":117367},"role":"*"}],"attributes":[],"executor_ids":[]}
Update 2: also, the online log in /mesos shows me this - does this mean the cluster isn't started properly?
I0110 09:55:30.857897 13427 http.cpp:381] HTTP GET for /master/state-summary from 172.16.100.107:60624 with User-Agent='python-requests/2.10.0'
I0110 09:55:31.111609 13426 master.cpp:3954] Processing DECLINE call for offers: [ b8bd75d7-b013-4a60-8644-94cf7d8e53bb-O173624 ] for framework b8bd75d7-b013-4a60-8644-94cf7d8e53bb-0049 (arangodb3) at scheduler-b97d05bc-d50c-49e6-8ef1-a9ff324fd2ec#172.16.100.98:9546
I0110 09:55:31.111747 13426 master.cpp:3954] Processing DECLINE call for offers: [ b8bd75d7-b013-4a60-8644-94cf7d8e53bb-O173623 ] for framework b8bd75d7-b013-4a60-8644-94cf7d8e53bb-0049 (arangodb3) at scheduler-b97d05bc-d50c-49e6-8ef1-a9ff324fd2ec#172.16.100.98:9546
I
Related
Attempting to run h2o on a HDP 3.1 cluster and running into error that appears to be about YARN resource capacity...
[ml1user#HW04 h2o-3.26.0.1-hdp3.1]$ hadoop jar h2odriver.jar -nodes 3 -mapperXmx 10g
Determining driver host interface for mapper->driver callback...
[Possible callback IP address: 192.168.122.1]
[Possible callback IP address: 172.18.4.49]
[Possible callback IP address: 127.0.0.1]
Using mapper->driver callback IP address and port: 172.18.4.49:46015
(You can override these with -driverif and -driverport/-driverportrange and/or specify external IP using -extdriverif.)
Memory Settings:
mapreduce.map.java.opts: -Xms10g -Xmx10g -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -Dlog4j.defaultInitOverride=true
Extra memory percent: 10
mapreduce.map.memory.mb: 11264
Hive driver not present, not generating token.
19/07/25 14:48:05 INFO client.RMProxy: Connecting to ResourceManager at hw01.ucera.local/172.18.4.46:8050
19/07/25 14:48:06 INFO client.AHSProxy: Connecting to Application History server at hw02.ucera.local/172.18.4.47:10200
19/07/25 14:48:07 INFO mapreduce.JobResourceUploader: Disabling Erasure Coding for path: /user/ml1user/.staging/job_1564020515809_0006
19/07/25 14:48:08 INFO mapreduce.JobSubmitter: number of splits:3
19/07/25 14:48:08 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1564020515809_0006
19/07/25 14:48:08 INFO mapreduce.JobSubmitter: Executing with tokens: []
19/07/25 14:48:08 INFO conf.Configuration: found resource resource-types.xml at file:/etc/hadoop/3.1.0.0-78/0/resource-types.xml
19/07/25 14:48:08 INFO impl.YarnClientImpl: Submitted application application_1564020515809_0006
19/07/25 14:48:08 INFO mapreduce.Job: The url to track the job: http://HW01.ucera.local:8088/proxy/application_1564020515809_0006/
Job name 'H2O_47159' submitted
JobTracker job ID is 'job_1564020515809_0006'
For YARN users, logs command is 'yarn logs -applicationId application_1564020515809_0006'
Waiting for H2O cluster to come up...
ERROR: Timed out waiting for H2O cluster to come up (120 seconds)
ERROR: (Try specifying the -timeout option to increase the waiting time limit)
Attempting to clean up hadoop job...
19/07/25 14:50:19 INFO impl.YarnClientImpl: Killed application application_1564020515809_0006
Killed.
19/07/25 14:50:23 INFO client.RMProxy: Connecting to ResourceManager at hw01.ucera.local/172.18.4.46:8050
19/07/25 14:50:23 INFO client.AHSProxy: Connecting to Application History server at hw02.ucera.local/172.18.4.47:10200
----- YARN cluster metrics -----
Number of YARN worker nodes: 3
----- Nodes -----
Node: http://HW03.ucera.local:8042 Rack: /default-rack, RUNNING, 0 containers used, 0.0 / 15.0 GB used, 0 / 3 vcores used
Node: http://HW04.ucera.local:8042 Rack: /default-rack, RUNNING, 0 containers used, 0.0 / 15.0 GB used, 0 / 3 vcores used
Node: http://HW02.ucera.local:8042 Rack: /default-rack, RUNNING, 0 containers used, 0.0 / 15.0 GB used, 0 / 3 vcores used
----- Queues -----
Queue name: default
Queue state: RUNNING
Current capacity: 0.00
Capacity: 1.00
Maximum capacity: 1.00
Application count: 0
Queue 'default' approximate utilization: 0.0 / 45.0 GB used, 0 / 9 vcores used
----------------------------------------------------------------------
ERROR: Unable to start any H2O nodes; please contact your YARN administrator.
A common cause for this is the requested container size (11.0 GB)
exceeds the following YARN settings:
yarn.nodemanager.resource.memory-mb
yarn.scheduler.maximum-allocation-mb
----------------------------------------------------------------------
For YARN users, logs command is 'yarn logs -applicationId application_1564020515809_0006'
Looking in the YARN configs in Ambari UI, these properties are nowhere to be found. But checking the YARN logs in the YARN resource manager UI and checking some of the logs for the killed application, I see what appears to be unreachable-host errors...
Container: container_e05_1564020515809_0006_02_000002 on HW03.ucera.local_45454_1564102219781
LogAggregationType: AGGREGATED
=============================================================================================
LogType:stderr
LogLastModifiedTime:Thu Jul 25 14:50:19 -1000 2019
LogLength:2203
LogContents:
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/hadoop/yarn/local/filecache/11/mapreduce.tar.gz/hadoop/share/hadoop/common/lib/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/hadoop/yarn/local/usercache/ml1user/appcache/application_1564020515809_0006/filecache/10/job.jar/job.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
log4j:WARN No appenders could be found for logger (org.apache.hadoop.mapred.YarnChild).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
java.net.NoRouteToHostException: No route to host (Host unreachable)
at java.net.PlainSocketImpl.socketConnect(Native Method)
....
at java.net.Socket.<init>(Socket.java:211)
at water.hadoop.EmbeddedH2OConfig$BackgroundWriterThread.run(EmbeddedH2OConfig.java:38)
End of LogType:stderr
***********************************************************************
Taking note of "java.net.NoRouteToHostException: No route to host (Host unreachable)". However, I can access all the other nodes from each other and they can all ping each other, so not sure what is going on here. Any suggestions for debugging or fixing?
Think I found the problem, TLDR: firewalld (nodes running on centos7) was still running, when should be disabled on HDP clusters.
From another community post:
For Ambari to communicate during setup with the hosts it deploys to and manages, certain ports must be open and available. The easiest way to do this is to temporarily disable iptables, as follows:
systemctl disable firewalld
service firewalld stop
So apparently iptables and firewalld need to be disabled across the cluster (supporting docs can be found here, I only disabled them on the Ambari installation node). After stopping these services across the cluster (I recommend using clush), was able to run the yarn job without incident.
Normally, this problem is either due to bad DNS configuration, firewalls, or network unreachability. To quote this official doc:
The hostname of the remote machine is wrong in the configuration files
The client's host table /etc/hosts has an invalid IPAddress for the target host.
The DNS server's host table has an invalid IPAddress for the target host.
The client's routing tables (In Linux, iptables) are wrong.
The DHCP server is publishing bad routing information.
Client and server are on different subnets, and are not set up to talk to each other. This may be an accident, or it is to deliberately lock down the Hadoop cluster.
The machines are trying to communicate using IPv6. Hadoop does not currently support IPv6
The host's IP address has changed but a long-lived JVM is caching the old value. This is a known problem with JVMs (search for "java negative DNS caching" for the details and solutions). The quick solution: restart the JVMs
For me, the problem was that the driver was inside a Docker container which made it impossible for the workers to send data back to it. In other words, workers and the driver not being in the same subnet. The solution as given in this answer was to set the following configurations:
spark.driver.host=<container's host IP accessible by the workers>
spark.driver.bindAddress=0.0.0.0
spark.driver.port=<forwarded port 1>
spark.driver.blockManager.port=<forwarded port 2>
Env:
Zookeeper on computer A,
Mesos master on computer B as Leader,
Mesos master on computer C,
Marathon on computer B singleton.
Action:
Kill Mesos master task on computer B, attempt to change mesos cluster leader
Result:
Mesos cluster leader change to mesos master on computer C,
But Marathon task on computer auto shutdown with following logs.
Question:
Somebody can help me why marathon down? and how to fix it!
Logs:
I1109 12:19:10.010197 11287 detector.cpp:152] Detected a new leader: (id='9')
I1109 12:19:10.010646 11291 group.cpp:699] Trying to get '/mesos/json.info_0000000009' in ZooKeeper
I1109 12:19:10.013425 11292 zookeeper.cpp:262] A new leading master (UPID=master#10.4.23.55:5050) is detected
[2017-11-09 12:19:10,015] WARN Disconnected (mesosphere.marathon.MarathonScheduler:Thread-23)
I1109 12:19:10.018977 11292 sched.cpp:2021] Asked to stop the driver
I1109 12:19:10.019161 11292 sched.cpp:336] New master detected at master#10.4.23.55:5050
I1109 12:19:10.019892 11292 sched.cpp:1203] Stopping framework d52cbd8c-1015-4d94-8328-e418876ca5b2-0000
[2017-11-09 12:19:10,020] INFO Driver future completed with result=Success(()). (mesosphere.marathon.MarathonSchedulerService:ForkJoinPool-3-worker-5)
[2017-11-09 12:19:10,022] INFO Abdicating leadership (mesosphere.marathon.core.election.impl.CuratorElectionService:ForkJoinPool-3-worker-5)
[2017-11-09 12:19:10,022] INFO Stopping the election service (mesosphere.marathon.core.election.impl.CuratorElectionService:ForkJoinPool-3-worker-5)
[2017-11-09 12:19:10,029] INFO backgroundOperationsLoop exiting (org.apache.curator.framework.imps.CuratorFrameworkImpl:Curator-Framework-0)
[2017-11-09 12:19:10,061] INFO Session: 0x15f710ffb010058 closed (org.apache.zookeeper.ZooKeeper:ForkJoinPool-3-worker-5)
[2017-11-09 12:19:10,061] INFO EventThread shut down for session: 0x15f710ffb010058 (org.apache.zookeeper.ClientCnxn:pool-3-thread-1-EventThread)
[2017-11-09 12:19:10,063] INFO Stopping MarathonSchedulerService [RUNNING]'s leadership (mesosphere.marathon.core.election.impl.CuratorElectionService:ForkJoinPool-3-worker-5)
[2017-11-09 12:19:10,063] INFO Lost leadership (mesosphere.marathon.MarathonSchedulerService:ForkJoinPool-3-worker-5)
[2017-11-09 12:19:10,066] INFO All actors suspended:
* Actor[akka://marathon/user/offerMatcherStatistics#-1904211014]
* Actor[akka://marathon/user/reviveOffersWhenWanted#-238627718]
* Actor[akka://marathon/user/expungeOverdueLostTasks#608979053]
* Actor[akka://marathon/user/launchQueue#803590575]
* Actor[akka://marathon/user/offersWantedForReconciliation#598482724]
* Actor[akka://marathon/user/offerMatcherLaunchTokens#813230776]
* Actor[akka://marathon/user/offerMatcherManager#1205401692]
* Actor[akka://marathon/user/instanceTracker#1055980147]
* Actor[akka://marathon/user/killOverdueStagedTasks#-40058350]
* Actor[akka://marathon/user/taskKillServiceActor#-602552505]
* Actor[akka://marathon/user/rateLimiter#-911383474]
* Actor[akka://marathon/user/deploymentManager#2013376325] (mesosphere.marathon.core.leadership.impl.LeadershipCoordinatorActor:marathon-akka.actor.default-dispatcher-10)
I1109 12:19:10.069551 11272 sched.cpp:2021] Asked to stop the driver
[2017-11-09 12:19:10,068] INFO Stopping driver (mesosphere.marathon.MarathonSchedulerService:ForkJoinPool-3-worker-5)
[2017-11-09 12:19:10,069] INFO Stopped MarathonSchedulerService [RUNNING]'s leadership (mesosphere.marathon.core.election.impl.CuratorElectionService:ForkJoinPool-3-worker-5)
[2017-11-09 12:19:10,070] INFO Terminating due to leadership abdication or failure (mesosphere.marathon.core.election.impl.CuratorElectionService:ForkJoinPool-3-worker-5)
[2017-11-09 12:19:10,071] INFO Call postDriverRuns callbacks on (mesosphere.marathon.MarathonSchedulerService:ForkJoinPool-3-worker-5)
[2017-11-09 12:19:10,074] INFO Now standing by. Closing existing handles and rejecting new. (mesosphere.marathon.core.event.impl.stream.HttpEventStreamActor:marathon-akka.actor.default-dispatcher-12)
[2017-11-09 12:19:10,074] INFO Suspending scheduler actor (mesosphere.marathon.MarathonSchedulerActor:marathon-akka.actor.default-dispatcher-2)
[2017-11-09 12:19:10,083] INFO Finished postDriverRuns callbacks (mesosphere.marathon.MarathonSchedulerService:ForkJoinPool-3-worker-5)
[2017-11-09 12:19:10,084] INFO ExpungeOverdueLostTasksActor has stopped (mesosphere.marathon.core.task.jobs.impl.ExpungeOverdueLostTasksActor:marathon-akka.actor.default-dispatcher-9)
[1]+ Exit 137
I think there is wrong configuration in zookeeper cluster. Use 3 zookeeper cluster and 2 mesos master n multiple slaves. Ref : https://www.google.co.in/amp/s/beingasysadmin.wordpress.com/2014/08/16/managing-ha-docker-cluster-using-multiple-mesos-masters/amp/
Did you set masters reference to marathon conf?
can you do
cat /etc/marathon/conf/master
I got a cluster of 1 master node and 2 slaves and I'm trying to compile my application with mesos.
Basically, here is the command that I use:
mesos-execute --name=alc1 --command="ccmake -j myapp" --master=10.11.12.13:5050
Offers are made from the slave but this compilation task keeps failing.
[root#master-node ~]# mesos-execute --name=alc1 --command="ccmake -j myapp" --master=10.11.12.13:5050
I0511 22:26:11.623016 11560 sched.cpp:222] Version: 0.28.0
I0511 22:26:11.625602 11564 sched.cpp:326] New master detected at master#10.11.12.13:5050
I0511 22:26:11.625952 11564 sched.cpp:336] No credentials provided. Attempting to register without authentication
I0511 22:26:11.627279 11564 sched.cpp:703] Framework registered with 70582e35-5d6e-4915-a919-cae61c904fd9-0139
Framework registered with 70582e35-5d6e-4915-a919-cae61c904fd9-0139
task alc1 submitted to slave 70582e35-5d6e-4915-a919-cae61c904fd9-S2
Received status update TASK_RUNNING for task alc1
Received status update TASK_FAILED for task alc1
I0511 22:26:11.759610 11567 sched.cpp:1903] Asked to stop the driver
I0511 22:26:11.759639 11567 sched.cpp:1143] Stopping framework '70582e35-5d6e-4915-a919-cae61c904fd9-0139'
On the sandbox slave node, here is the stderr logs:
I0511 22:26:13.781070 5037 exec.cpp:143] Version: 0.28.0
I0511 22:26:13.785001 5040 exec.cpp:217] Executor registered on slave 70582e35-5d6e-4915-a919-cae61c904fd9-S2
sh: ccmake: command not found
I0511 22:26:13.892653 5042 exec.cpp:390] Executor asked to shutdown
Just to mentionned that commands like this work fine and get me the expected results:
[root#master-node ~]# mesos-execute --name=alc1 --command="find / -name a" --master=10.11.12.13:5050
I0511 22:26:03.733172 11550 sched.cpp:222] Version: 0.28.0
I0511 22:26:03.736112 11554 sched.cpp:326] New master detected at master#10.11.12.13:5050
I0511 22:26:03.736383 11554 sched.cpp:336] No credentials provided. Attempting to register without authentication
I0511 22:26:03.737730 11554 sched.cpp:703] Framework registered with 70582e35-5d6e-4915-a919-cae61c904fd9-0138
Framework registered with 70582e35-5d6e-4915-a919-cae61c904fd9-0138
task alc1 submitted to slave 70582e35-5d6e-4915-a919-cae61c904fd9-S2
Received status update TASK_RUNNING for task alc1
Received status update TASK_FINISHED for task alc1
I0511 22:26:04.184813 11553 sched.cpp:1903] Asked to stop the driver
I0511 22:26:04.184844 11553 sched.cpp:1143] Stopping framework '70582e35-5d6e-4915-a919-cae61c904fd9-0138'
I don't really get what is needed for even troubleshot this issue.
I'm trying to run nutch 2.3.1 with cassandra. Followed steps on http://wiki.apache.org/nutch/Nutch2Cassandra . Finally, when I try to start nutch with command:
bin/crawl urls/ test http://localhost:8983/solr/ 2
I got the following exception:
GeneratorJob: starting
GeneratorJob: filtering: false
GeneratorJob: normalizing: false
GeneratorJob: topN: 50000
GeneratorJob: java.lang.RuntimeException: job failed: name=[test]generate: 1454483370-31180, jobid=job_local1380148534_0001
at org.apache.nutch.util.NutchJob.waitForCompletion(NutchJob.java:120)
at org.apache.nutch.crawl.GeneratorJob.run(GeneratorJob.java:227)
at org.apache.nutch.crawl.GeneratorJob.generate(GeneratorJob.java:256)
at org.apache.nutch.crawl.GeneratorJob.run(GeneratorJob.java:322)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.nutch.crawl.GeneratorJob.main(GeneratorJob.java:330)
Error running:
/home/user/apache-nutch-2.3.1/runtime/local/bin/nutch generate -D mapred.reduce.tasks=2 -D mapred.child.java.opts=-Xmx1000m -D mapred.reduce.tasks.speculative.execution=false -D mapred.map.tasks.speculative.execution=false -D mapred.compress.map.output=true -topN 50000 -noNorm -noFilter -adddays 0 - crawlId webmd -batchId 1454483370-31180
Failed with exit value 255.
When I check logs/hadoop.log, here's the error message:
2016-02-03 15:18:14,741 ERROR connection.HConnectionManager - Could not start connection pool for host localhost(127.0.0.1):9160
...
2016-02-03 15:18:15,185 ERROR store.CassandraStore - All host pools marked down. Retry burden pushed out to client.
me.prettyprint.hector.api.exceptions.HectorException: All host pools marked down. Retry burden pushed out to client.
at me.prettyprint.cassandra.connection.HConnectionManager.getClientFromLBPolicy(HConnectionManager.java:390)
But my cassandra server is up:
runtime/local$ netstat -l |grep 9160
tcp 0 0 172.16.230.130:9160 *:* LISTEN
Anyone can help on this issue? Thanks.
The address of Cassandra is not localhost, it's 172.16.230.130. That is the reason, Nutch cannot connect to the Cassandra store.
Hope this helps,
Le Quoc Do
I have a remote spark on yarn cluster that if I use rstudio server(web version) hosted on that cluster to connect in client mode I can do the following:
sc <- SparkR::sparkR.init(master = "yarn-client")
However if I try to use rstudio on my local machine to connect to that spark cluster the same way then I have errors:
ERROR SparkContext: Error initializing SparkContext.
org.apache.spark.SparkException: Yarn application has already ended! It might have been killed or unable to launch application master
...
ERROR Utils: Uncaught exception in thread nioEventLoopGroup-2-2
java.lang.NullPointerException
...
ERROR RBackendHandler: createSparkContext on org.apache.spark.api.r.RRDD failed
Error in invokeJava(isStatic = TRUE, className, methodName, ...) :
org.apache.spark.SparkException: Yarn application has already ended! It might have been killed or unable to launch application master.
A more detailed error message on hadoop application tracking page is like this:
User: blueivy
Name: SparkR
Application Type: SPARK
Application Tags:
State: FAILED
FinalStatus: FAILED
Started: 27-Oct-2015 11:07:09
Elapsed: 4mins, 39sec
Tracking URL: History
Diagnostics:
Application application_1445628650748_0027 failed 2 times due to AM Container for appattempt_1445628650748_0027_000002 exited with exitCode: 10
For more detailed output, check application tracking page:http://master:8088/proxy/application_1445628650748_0027/Then, click on links to logs of each attempt.
Diagnostics: Exception from container-launch.
Container id: container_1445628650748_0027_02_000001
Exit code: 10
Stack trace: ExitCodeException exitCode=10:
at org.apache.hadoop.util.Shell.runCommand(Shell.java:538)
at org.apache.hadoop.util.Shell.run(Shell.java:455)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:715)
at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:211)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)
at java.util.concurrent.FutureTask.run(FutureTask.java:267)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1143)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:618)
at java.lang.Thread.run(Thread.java:785)
Container exited with a non-zero exit code 10
Failing this attempt. Failing the application.
I have the same configurations and environment for hadoop and spark with remote cluster: spark 1.5.1, hadoop 2.6.0 and ubuntu 14.04. Anyone can help me find what's my mistake here?