Mesosphere (dc/os) agent re-connection failure

Mesosphere (dc/os) agent re-connection failure - mesos

I set up small test Mesosphere cluster according to this guide https://dcos.io/docs/1.8/administration/installing/custom/cli/
everything went smoothly. There are only 3 nodes in cluster, one for bootstrap, one master(10.7.1.12) and one agent(10.7.1.13).
But after rebooting machine with agent node, it is no longer visible by master node .
In /var/log/mesos/mesos-agent.log last input has timestamp before reboot.
I was trying all steps from https://dcos.io/docs/1.8/administration/installing/custom/troubleshooting/ but nothing changed.
Here are the logs from master after agent disconnection (sudo journalctl -u dcos-mesos-master)
lut 06 15:48:14 arangodb2.test1.fgtsa.com mesos-master[2661]: I0206 15:48:14.556001 2671 master.cpp:1245] Agent d44b38bf-f35b-4c47-8dfb-d53b543b5e8f-S0 at slave(1)#10.7.1.13:5051 (10.7.1.13) disconnected
lut 06 15:48:14 arangodb2.test1.fgtsa.com mesos-master[2661]: I0206 15:48:14.556089 2671 master.cpp:2784] Disconnecting agent d44b38bf-f35b-4c47-8dfb-d53b543b5e8f-S0 at slave(1)#10.7.1.13:5051 (10.7.1.13)
lut 06 15:48:14 arangodb2.test1.fgtsa.com mesos-master[2661]: I0206 15:48:14.556170 2671 master.cpp:2803] Deactivating agent d44b38bf-f35b-4c47-8dfb-d53b543b5e8f-S0 at slave(1)#10.7.1.13:5051 (10.7.1.13)
lut 06 15:53:16 arangodb2.test1.fgtsa.com mesos-master[2661]: W0206 15:53:16.926198 2670 master.cpp:5334] Shutting down agent d44b38bf-f35b-4c47-8dfb-d53b543b5e8f-S0 at slave(1)#10.7.1.13:5051 (10.7.1.13) with message 'health check timed out'
lut 06 15:53:16 arangodb2.test1.fgtsa.com mesos-master[2661]: I0206 15:53:16.926230 2670 master.cpp:6617] Removing agent d44b38bf-f35b-4c47-8dfb-d53b543b5e8f-S0 at slave(1)#10.7.1.13:5051 (10.7.1.13): health check timed out
lut 06 15:53:16 arangodb2.test1.fgtsa.com mesos-master[2661]: I0206 15:53:16.926507 2670 master.cpp:6910] Removing task 93f4b075-1338-4a84-afd6-6932cfe44c30 with resources mem(arangodb31, arangodb3):2048; cpus(arangodb31, arangodb3):0.25; disk(arangodb31, arangodb3)[AGENCY_991972e5-2d83-4710-ba3c-de8cf02303ab:myPersistentVolume]:2048; ports(arangodb31, arangodb3):[1026-1026] of framework d44b38bf-f35b-4c47-8dfb-d53b543b5e8f-0004 on agent d44b38bf-f35b-4c47-8dfb-d53b543b5e8f-S0 at slave(1)#10.7.1.13:5051 (10.7.1.13)
lut 06 15:53:16 arangodb2.test1.fgtsa.com mesos-master[2661]: I0206 15:53:16.926695 2670 master.cpp:6910] Removing task 644b59eb-fb20-43fd-a7c1-b1d9406cbfcb with resources mem(arangodb3, arangodb3):2048; cpus(arangodb3, arangodb3):0.25; disk(arangodb3, arangodb3)[AGENCY_0c76702f-ae8b-423c-83a8-1b6e2af8b723:myPersistentVolume]:2048; ports(arangodb3, arangodb3):[1025-1025] of framework d44b38bf-f35b-4c47-8dfb-d53b543b5e8f-0002 on agent d44b38bf-f35b-4c47-8dfb-d53b543b5e8f-S0 at slave(1)#10.7.1.13:5051 (10.7.1.13)
lut 06 15:53:16 arangodb2.test1.fgtsa.com mesos-master[2661]: I0206 15:53:16.928460 2670 master.cpp:6736] Removed agent d44b38bf-f35b-4c47-8dfb-d53b543b5e8f-S0 (10.7.1.13): health check timed out
lut 06 15:53:16 arangodb2.test1.fgtsa.com mesos-master[2661]: I0206 15:53:16.928472 2670 master.cpp:5197] Sending status update TASK_LOST for task 93f4b075-1338-4a84-afd6-6932cfe44c30 of framework d44b38bf-f35b-4c47-8dfb-d53b543b5e8f-0004 'Slave 10.7.1.13 removed: health check timed out'
lut 06 15:53:16 arangodb2.test1.fgtsa.com mesos-master[2661]: W0206 15:53:16.928486 2670 master.hpp:2113] Master attempted to send message to disconnected framework d44b38bf-f35b-4c47-8dfb-d53b543b5e8f-0004 (arangodb3-1) at scheduler-f4f3a3f0-2261-4aaf-9390-81f4b1cc6d20#10.7.1.13:25366
lut 06 15:53:16 arangodb2.test1.fgtsa.com mesos-master[2661]: I0206 15:53:16.928611 2670 master.cpp:5197] Sending status update TASK_LOST for task 644b59eb-fb20-43fd-a7c1-b1d9406cbfcb of framework d44b38bf-f35b-4c47-8dfb-d53b543b5e8f-0002 'Slave 10.7.1.13 removed: health check timed out'
lut 06 15:53:16 arangodb2.test1.fgtsa.com mesos-master[2661]: W0206 15:53:16.928638 2670 master.hpp:2113] Master attempted to send message to disconnected framework d44b38bf-f35b-4c47-8dfb-d53b543b5e8f-0002 (arangodb3) at scheduler-7870582e-becd-4747-aeba-0217e91d537e#10.7.1.13:19866
lut 06 15:53:16 arangodb2.test1.fgtsa.com mesos-master[2661]: I0206 15:53:16.928747 2670 master.cpp:6759] Notifying framework d44b38bf-f35b-4c47-8dfb-d53b543b5e8f-0003 (arangodb3-standalone) at scheduler-180f6695-f3c9-4da6-80e8-d1dc633ec737#10.7.1.13:3583 of lost agent d44b38bf-f35b-4c47-8dfb-d53b543b5e8f-S0 (10.7.1.13) after recovering
lut 06 15:53:16 arangodb2.test1.fgtsa.com mesos-master[2661]: W0206 15:53:16.928761 2670 master.hpp:2113] Master attempted to send message to disconnected framework d44b38bf-f35b-4c47-8dfb-d53b543b5e8f-0003 (arangodb3-standalone) at scheduler-180f6695-f3c9-4da6-80e8-d1dc633ec737#10.7.1.13:3583
lut 06 15:53:16 arangodb2.test1.fgtsa.com mesos-master[2661]: I0206 15:53:16.928894 2670 master.cpp:6759] Notifying framework d44b38bf-f35b-4c47-8dfb-d53b543b5e8f-0002 (arangodb3) at scheduler-7870582e-becd-4747-aeba-0217e91d537e#10.7.1.13:19866 of lost agent d44b38bf-f35b-4c47-8dfb-d53b543b5e8f-S0 (10.7.1.13) after recovering
lut 06 15:53:16 arangodb2.test1.fgtsa.com mesos-master[2661]: W0206 15:53:16.928905 2670 master.hpp:2113] Master attempted to send message to disconnected framework d44b38bf-f35b-4c47-8dfb-d53b543b5e8f-0002 (arangodb3) at scheduler-7870582e-becd-4747-aeba-0217e91d537e#10.7.1.13:19866
lut 06 15:53:16 arangodb2.test1.fgtsa.com mesos-master[2661]: I0206 15:53:16.928921 2670 master.cpp:6759] Notifying framework d44b38bf-f35b-4c47-8dfb-d53b543b5e8f-0004 (arangodb3-1) at scheduler-f4f3a3f0-2261-4aaf-9390-81f4b1cc6d20#10.7.1.13:25366 of lost agent d44b38bf-f35b-4c47-8dfb-d53b543b5e8f-S0 (10.7.1.13) after recovering
lut 06 15:53:16 arangodb2.test1.fgtsa.com mesos-master[2661]: W0206 15:53:16.928928 2670 master.hpp:2113] Master attempted to send message to disconnected framework d44b38bf-f35b-4c47-8dfb-d53b543b5e8f-0004 (arangodb3-1) at scheduler-f4f3a3f0-2261-4aaf-9390-81f4b1cc6d20#10.7.1.13:25366
lut 06 15:53:16 arangodb2.test1.fgtsa.com mesos-master[2661]: I0206 15:53:16.928941 2670 master.cpp:6759] Notifying framework d44b38bf-f35b-4c47-8dfb-d53b543b5e8f-0001 (metronome) at scheduler-4eb937a4-9a64-4a47-9245-3858defe691a#10.7.1.12:41077 of lost agent d44b38bf-f35b-4c47-8dfb-d53b543b5e8f-S0 (10.7.1.13) after recovering
lut 06 15:53:16 arangodb2.test1.fgtsa.com mesos-master[2661]: I0206 15:53:16.928963 2670 master.cpp:6759] Notifying framework d44b38bf-f35b-4c47-8dfb-d53b543b5e8f-0000 (marathon) at scheduler-02bf4e29-4dd7-4cf8-b14b-4a064b4d082c#10.7.1.12:43643 of lost agent d44b38bf-f35b-4c47-8dfb-d53b543b5e8f-S0 (10.7.1.13) after recovering
rest of the journals (`journalctl ...) are empty.
There is also this error in ZooKeeper logs
I will be grateful for any suggestions on how to investigate it further.
EDIT:
I managed to run it agent node manually by starting dcos-mesos-slave service (before that I had to start dcos-spartan and dcos-gen-resolvconf services). Any ideas why it didn't start automatically?

Any ideas why it didn't start automatically?
According to rules for using systemd reliably systemd units do not depend on each other so you need to start everything manually.
Requires=, Wants= are not allowed. If something that is depended upon fails, the thing depending on it will never try to be started again.
Before=, After= are discouraged. They are not strong guarantees, software needs to check that pre-requisites are up and working correctly

Related

First tutotial: Starting Your Node

I just spent some time setting up my node following the first tutorial. For some reason, when launching the node, it throws this error:
Dec 04 11:09:40.546 WARN Running in --dev mode, RPC CORS has been disabled.
Dec 04 11:09:40.546 INFO Substrate Node
Dec 04 11:09:40.546 INFO ✌️ version 2.0.0-655bfdc-x86_64-linux-gnu
Dec 04 11:09:40.546 INFO ❤️ by Substrate DevHub <https://github.com/substrate-developer-hub>, 2017-2020
Dec 04 11:09:40.546 INFO 📋 Chain specification: Development
Dec 04 11:09:40.546 INFO 🏷 Node name: abaft-jar-7208
Dec 04 11:09:40.546 INFO 👤 Role: AUTHORITY
Dec 04 11:09:40.546 INFO 💾 Database: RocksDb at /tmp/substratefBeKdR/chains/dev/db
Dec 04 11:09:40.546 INFO ⛓ Native runtime: node-template-1 (node-template-1.tx1.au1)
Illegal instruction (core dumped)
Looking at the expected outputs, it looks like this takes place when intialising the genesis block.
I've just started learning about all this, so may be a quite basic issue (e.g. an incompatible version or something). I didn't find any previous question on this, but please do let me know if this has been asked before.
Thanks!

Kibana failed to start

Elasticsearch working with no issues on http://localhost:9200
And Operating system is Ubuntu 18.04
Here is the error log for Kibana
root#syed-MS-7B17:/var/log# journalctl -fu kibana.service
-- Logs begin at Sat 2020-01-04 18:30:58 IST. --
Apr 03 20:22:49 syed-MS-7B17 kibana[7165]: {"type":"log","#timestamp":"2020-04-03T14:52:49Z","tags":["fatal","root"],"pid":7165,"message":"{ Error: listen EADDRNOTAVAIL: address not available 7.0.0.1:5601\n at Server.setupListenHandle [as _listen2] (net.js:1263:19)\n at listenInCluster (net.js:1328:12)\n at GetAddrInfoReqWrap.doListen (net.js:1461:7)\n at GetAddrInfoReqWrap.onlookup [as oncomplete] (dns.js:61:10)\n code: 'EADDRNOTAVAIL',\n errno: 'EADDRNOTAVAIL',\n syscall: 'listen',\n address: '7.0.0.1',\n port: 5601 }"}
Apr 03 20:22:49 syed-MS-7B17 kibana[7165]: FATAL Error: listen EADDRNOTAVAIL: address not available 7.0.0.1:5601
Apr 03 20:22:50 syed-MS-7B17 systemd[1]: kibana.service: Main process exited, code=exited, status=1/FAILURE
Apr 03 20:22:50 syed-MS-7B17 systemd[1]: kibana.service: Failed with result 'exit-code'.
Apr 03 20:22:53 syed-MS-7B17 systemd[1]: kibana.service: Service hold-off time over, scheduling restart.
Apr 03 20:22:53 syed-MS-7B17 systemd[1]: kibana.service: Scheduled restart job, restart counter is at 2.
Apr 03 20:22:53 syed-MS-7B17 systemd[1]: Stopped Kibana.
Apr 03 20:22:53 syed-MS-7B17 systemd[1]: kibana.service: Start request repeated too quickly.
Apr 03 20:22:53 syed-MS-7B17 systemd[1]: kibana.service: Failed with result 'exit-code'.
Apr 03 20:22:53 syed-MS-7B17 systemd[1]: Failed to start Kibana.

I have resolved it myself after checking the /etc/hosts file
It was edited by mistake like below
7.0.0.1 localhost

How does my neo4j browser still works even when I stop neo4j service?

neo4j.service - Neo4j Graph Database
Loaded: loaded (/lib/systemd/system/neo4j.service; disabled; vendor preset: enabled)
Active: inactive (dead)
Mar 06 13:26:43 ip-10-14-12-59 neo4j[12287]: 2020-03-06 13:26:43.564+0000 INFO ======== Neo4j 3.5.14 ========
Mar 06 13:26:43 ip-10-14-12-59 neo4j[12287]: 2020-03-06 13:26:43.572+0000 INFO Starting...
Mar 06 13:26:49 ip-10-14-12-59 neo4j[12287]: 2020-03-06 13:26:49.780+0000 INFO Bolt enabled on 0.0.0.0:7687.
Mar 06 13:26:51 ip-10-14-12-59 neo4j[12287]: 2020-03-06 13:26:51.153+0000 INFO Started.
Mar 06 13:26:52 ip-10-14-12-59 neo4j[12287]: 2020-03-06 13:26:52.131+0000 INFO Remote interface available at http://10.14.12.59:7474/
Mar 06 13:42:38 ip-10-14-12-59 systemd[1]: Stopping Neo4j Graph Database...
Mar 06 13:42:38 ip-10-14-12-59 neo4j[12287]: 2020-03-06 13:42:38.818+0000 INFO Neo4j Server shutdown initiated by request
Mar 06 13:42:38 ip-10-14-12-59 neo4j[12287]: 2020-03-06 13:42:38.832+0000 INFO Stopping...
Mar 06 13:42:38 ip-10-14-12-59 neo4j[12287]: 2020-03-06 13:42:38.884+0000 INFO Stopped.
Mar 06 13:42:39 ip-10-14-12-59 systemd[1]: Stopped Neo4j Graph Database.
But yet sudo lsof -i -P -n | grep LISTEN
Neither 7474 port is listening neither 7687 is listening

Your neo4j Browser session is connected to a different (running) neo4j instance (probably on your local host). You can use this Browser command to see the URL it is currently using:
:server status
You can run these two Browser commands to disconnect, and then connect to the correct instance (the second command will display a form):
:server disconnect
:server connect
Based on your logs, it looks like you want to set the Connect URL to bolt://10.14.12.59:7687.

kapacitor not running indicate fail

help my apacitor is not runnning, actually im running influxdb in the same server that kapacitor and telegraf, but my kapacitor don't work
kapacitor.service - Time series data processing engine.
Loaded: loaded (/lib/systemd/system/kapacitor.service; enabled; vendor preset: enabled)
Active: failed (Result: exit-code) since Thu 2019-01-03 17:56:38 UTC; 3s ago
Docs: https://github.com/influxdb/kapacitor
Process: 2502 ExecStart=/usr/bin/kapacitord -config /etc/kapacitor/kapacitor.conf $KAPACITOR_OPTS (code=exited, status=1/FAILURE)
Main PID: 2502 (code=exited, status=1/FAILURE)
Jan 03 17:56:38 ip-172-31-43-67 systemd[1]: kapacitor.service: Service hold-off time over, scheduling restart.
Jan 03 17:56:38 ip-172-31-43-67 systemd[1]: kapacitor.service: Scheduled restart job, restart counter is at 5.
Jan 03 17:56:38 ip-172-31-43-67 systemd[1]: Stopped Time series data processing engine..
Jan 03 17:56:38 ip-172-31-43-67 systemd[1]: kapacitor.service: Start request repeated too quickly.
Jan 03 17:56:38 ip-172-31-43-67 systemd[1]: kapacitor.service: Failed with result 'exit-code'.
Jan 03 17:56:38 ip-172-31-43-67 systemd[1]: Failed to start Time series data processing engine..

i did find the solution for myself:
[[influxdb]]
enabled = true
name = "localhost"
default = true
urls = ["http://localhost:8086"]
username = "user"
password = "password"
you must take in count that you will need has an user create in influxdb before

Running docker on virtual Server- Possible or not?

I'm trying to run/install docker on my vServer and can't find information if it's even possible.. I tried CentOS(6&7), Ubuntu, Debian, and fedora now and I'm just not able to get the docker daemon to run.
docker.service - Docker Application Container Engine
Loaded: loaded (/usr/lib/systemd/system/docker.service; disabled)
Active: failed (Result: exit-code) since So 2015-04-05 17:12:23 EDT; 16s ago
Docs: http://docs.docker.com
Process: 956 ExecStart=/usr/bin/docker -d $OPTIONS $DOCKER_STORAGE_OPTIONS $DOCKER_NETWORK_OPTIONS $INSECURE_REGISTRY (code=exited, status=1/FAILURE)
Main PID: 956 (code=exited, status=1/FAILURE)
Apr 05 17:12:23 vvs.valentinsavenko.com systemd[1]: Starting Docker Applicati...
Apr 05 17:12:23 vvs.valentinsavenko.com docker[956]: time="2015-04-05T17:12:2...
Apr 05 17:12:23 vvs.valentinsavenko.com docker[956]: time="2015-04-05T17:12:2...
Apr 05 17:12:23 vvs.valentinsavenko.com docker[956]: time="2015-04-05T17:12:2...
Apr 05 17:12:23 vvs.valentinsavenko.com docker[956]: inappropriate ioctl for ...
Apr 05 17:12:23 vvs.valentinsavenko.com docker[956]: time="2015-04-05T17:12:2...
Apr 05 17:12:23 vvs.valentinsavenko.com docker[956]: time="2015-04-05T17:12:2...
Apr 05 17:12:23 vvs.valentinsavenko.com systemd[1]: docker.service: main proc...
Apr 05 17:12:23 vvs.valentinsavenko.com systemd[1]: Failed to start Docker Ap...
Apr 05 17:12:23 vvs.valentinsavenko.com systemd[1]: Unit docker.service enter...
Hint: Some lines were ellipsized, use -l to show in full.
[root#vvs ~]# systemctl status docker.service -l
docker.service - Docker Application Container Engine
Loaded: loaded (/usr/lib/systemd/system/docker.service; disabled)
Active: failed (Result: exit-code) since So 2015-04-05 17:12:23 EDT; 33s ago
Docs: http://docs.docker.com
Process: 956 ExecStart=/usr/bin/docker -d $OPTIONS $DOCKER_STORAGE_OPTIONS $DOCKER_NETWORK_OPTIONS $INSECURE_REGISTRY (code=exited, status=1/FAILURE)
Main PID: 956 (code=exited, status=1/FAILURE)
Apr 05 17:12:23 vvs.valentinsavenko.com systemd[1]: Starting Docker Application Container Engine...
Apr 05 17:12:23 vvs.valentinsavenko.com docker[956]: time="2015-04-05T17:12:23-04:00" level="info" msg="+job serveapi(unix:///var/run/docker.sock)"
Apr 05 17:12:23 vvs.valentinsavenko.com docker[956]: time="2015-04-05T17:12:23-04:00" level="info" msg="WARNING: You are running linux kernel version 2.6.32-042stab094.8, which might be unstable running docker. Please upgrade your kernel to 3.8.0."
Apr 05 17:12:23 vvs.valentinsavenko.com docker[956]: time="2015-04-05T17:12:23-04:00" level="info" msg="+job init_networkdriver()"
Apr 05 17:12:23 vvs.valentinsavenko.com docker[956]: inappropriate ioctl for device
Apr 05 17:12:23 vvs.valentinsavenko.com docker[956]: time="2015-04-05T17:12:23-04:00" level="info" msg="-job init_networkdriver() = ERR (1)"
Apr 05 17:12:23 vvs.valentinsavenko.com docker[956]: time="2015-04-05T17:12:23-04:00" level="fatal" msg="inappropriate ioctl for device"
Apr 05 17:12:23 vvs.valentinsavenko.com systemd[1]: docker.service: main process exited, code=exited, status=1/FAILURE
Apr 05 17:12:23 vvs.valentinsavenko.com systemd[1]: Failed to start Docker Application Container Engine.
Apr 05 17:12:23 vvs.valentinsavenko.com systemd[1]: Unit docker.service entered failed state.
On every system there is a different problem and I'm wasting hours and hours on not solving them ..
http://kb.odin.com/en/125115
This post suggests that it might not work at all on vServer with old kernels, like in my case..
Did anybody actually manage to use docker on a vServer and if yes, which Kernel does your host-system have?
I have a cheap server at https://www.netcix.de if that's important.

The installation page has a section "Check kernel dependencies" which clearly mentions the minimum kernel level to be expected for Docker to run:
Docker in daemon mode has specific kernel requirements. For details, check your distribution in Installation.
A 3.10 Linux kernel is the minimum requirement for Docker. Kernels older than 3.10 lack some of the features required to run Docker containers. These older versions are known to have bugs which cause data loss and frequently panic under certain conditions.
The latest minor version (3.x.y) of the 3.10 (or a newer maintained version) Linux kernel is recommended. Keeping the kernel up to date with the latest minor version will ensure critical kernel bugs get fixed
So if your distros have a kernel too old, or some other requirements not respected (as listed in Installation), that would explain why the docker daemon fails.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Mesosphere (dc/os) agent re-connection failure - mesos

Related

First tutotial: Starting Your Node

Kibana failed to start

How does my neo4j browser still works even when I stop neo4j service?

kapacitor not running indicate fail

Running docker on virtual Server- Possible or not?

Categories

Resources