DC/OS - Dashboard showing 0 connected nodes - mesos

After restarting my 3 masters in my DC/OS cluster, the DC/OS dashboard is showing 0 connected nodes. However from the DC/OS cli I see all 6 of my agent nodes:
$ dcos node
HOSTNAME IP ID
172.16.1.20 172.16.1.20 a7af5134-baa2-45f3-892e-5e578cc00b4d-S7
172.16.1.21 172.16.1.21 a7af5134-baa2-45f3-892e-5e578cc00b4d-S12
172.16.1.22 172.16.1.22 a7af5134-baa2-45f3-892e-5e578cc00b4d-S8
172.16.1.23 172.16.1.23 a7af5134-baa2-45f3-892e-5e578cc00b4d-S6
172.16.1.24 172.16.1.24 a7af5134-baa2-45f3-892e-5e578cc00b4d-S11
172.16.1.25 172.16.1.25 a7af5134-baa2-45f3-892e-5e578cc00b4d-S10`
I am still able to schedule tasks in Marathon both from the dcos cli and from the Marathon gui, they then are properly scheduled and executed on the agents. Also, from the mesos interface on :5050 I can see all of the agents in the slaves page.
I have restarted agent nodes and master nodes. I have also rerun the DC/OS GUI installer and run preflight check, which of course fails with an "already installed" error.
Is there a way to re-register the node with DC/OS GUI short of uninstalling/reinstalling a node?

For anyone who is running into this, my problem was related to our corporate proxy. In order to get the Universe working in my cluster I had to add proxy settings to /opt/mesosphere/environment. I then restarted the dcos-cosmos.service and life was good. However, upon server restart, dcos-history-service.service was now running with the new environment and was unable to resolve my local names with our proxy server. To solve, I added a NO_PROXY to the /opt/mesosphere/environment and DCOS dashboard is again happy.

Related

Nomad and consul setup

Should I run consul slaves alongside nomad slaves or inside them?
The later might not make sense at all but I'm asking it just in case.
I brought my own nomad cluster up with consul slaves running alongside nomad slaves (inside worker nodes), my deployable artifacts are docker containers (java spring applications).
The issue with my current setup is that my applications can't access consul slaves (to read configurations) (none of 0.0.0.0, localhost, worker node ip worked)
Lets say my service exposes 8080, I configured docker part (in hcl file) to use bridge as network mode. Nomad maps 8080 to 43210.
Everything is fine until my service tries to reach the consul slave to read configuration. Ideally giving nomad worker node IP as consul host to Spring should suffice. But for some reason it's not.
I'm using latest version of nomad.
I configured my nomad slaves like https://github.com/bmd007/statefull-geofencing-faas/blob/master/infrastructure/nomad/client1.hcl
And the link below shows how I configured/ran my consul slave:
https://github.com/bmd007/statefull-geofencing-faas/blob/master/infrastructure/server2.yml
Note: if I use static port mapping and host as the network mode for docker (in nomad) I'll be fine but then I can't deploy more than one instance of each application in each worker node (due to port conflic)
Nomad jobs listen on a specific host/port pair.
You might want to ssh into the server and run docker ps to see what host/port pair the job is listening on.
a93c5cb46a3e image-name bash 2 hours ago Up 2 hours 10.0.47.2:21435->8000/tcp, 10.0.47.2:21435->8000/udp foo-bar
Additionally, you will need to ensure that the consul nomad job is listening on port 0.0.0.0, or the specific ip of the machine. I believe that is this config value: https://www.consul.io/docs/agent/options.html#_bind
All those will need to match up in order to consul to be reachable.
More generally, I might recommend: if you're going to run consul with nomad, you might want to switch to host networking, so that you don't have to deal with the specifics of the networking within a container. Additionally, you could schedule consul as a system job so that it is automatically present on every host.
So I managed to solve the issue like this:
nomad.job.group.network.mode = host
nomad.job.group.network.port: port "http" {}
nomad.job.group.task.driver = docker
nomad.job.group.task.config.network_mode = host
nomad.job.group.task.config.ports = ["http"]
nomad.job.group.task.service.connect: connect { native = true }
nomad.job.group.task.env: SERVER_PORT= "${NOMAD_PORT_http}"
nomad.job.group.task.env: SPRING_CLOUD_CONSUL_HOST = "localhost"
nomad.job.group.task.env: SPRING_CLOUD_SERVICE_REGISTRY_AUTO_REGISTRATION_ENABLED = "false"
Running consul agent (slaves) using docker-compose alongside nomad agent (slave) with host as network mode + exposing all required ports.
Example of nomad job: https://github.com/bmd007/statefull-geofencing-faas/blob/master/infrastructure/nomad/location-update-publisher.hcl
Example of consul agent config (docker-compose file): https://github.com/bmd007/statefull-geofencing-faas/blob/master/infrastructure/server2.yml
Disclaimer: The LAB is part of Cluster Visualization Framework called: LiteArch Trafik which I have created as an interesting exercise to understand Nomad and Consul.
It took me long time to shift my mind from K8S to Nomad and Consul,
Integration them was one of my effort I spent in the last year.
When service resolution doesn't work, I found out it's more or less the DNS configuration on servers.
There is a section for it on Hashicorp documentation called DNS Forwarding
Hashicorp DNS Forwarding
I have created a LAB which explains how to set up Nomad and Consul.
But you can use the LAB seperately.
I created the LAB after learning the hard way how to install the cluster and how to integrate Nomad and Consul.
With the LAB you need Ubuntu Multipass installed.
You execute one script and you will get full functional Cluster locally with three servers and three nodes.
It shows you as well how to install docker and integrate the services with Consul and DNS services on Ubuntu.
After running the LAB you will get the links to Nomad, Fabio, Consul.
Hopefully it will guide you through the learning process of Nomad and Consul
LAB: LAB
Trafik:Trafik Visualizer

How to reinstall Ambari Server on a crashed node and migrate the Cluster settings?

Two of my drives crashed on the Ambari Server node so I have to re-migrate my Ambari Cluster. No real data was lost (due to a different backup strategy) but the configuration files of the node, including Ambari Server configuration, are gone.
Because two drives crashed, I can not access any files from that node anymore (RAID 5).
I am now in the process of reinstalling the Ambari Server on the same node and would like to have my agents seamlessly reconnect to the "new" Ambari Server.
Is there a way to migrate the existing Cluster settings to the Ambari Server? I am thinking of Cluster settings that were distributed to the agents or similar.
If there is no such way to migrate the cluster, how would I go and install the Ambari Server? Do a fresh install and setup everything again? Will the Ambari agents be able to connect to the "new" Cluster without problems? Note that the Ambari Server will run on the same hostname/ip.

Setting up Ganglia on spark running on EC2

I am trying to set up ganglia on ec2 servers, but can't seem to recevie info from any other host than my master.
The master meterics are showing fine, but I have 4 more nodes that aren't listed in the web interface.
My question is - could it be that because I'm using unicast I can only see the master node, and that it's aggregating all of the other nodes data?
I have ran both gmetad and gmond in the foreground, and saw that the node and the master are communicating with each other, but still can't see the node in the web UI.
Any help would be appreciated.

Marathon loses control over Mesos when Marathon and Mesos leaders mismatch

When mesos or marathon service restart due to some reasons and leader of mesos and marathon is not on the same machine, deployments stuck in marathon and nothing happens in mesos, that leads to terrible results when marathon can not restart failed services and do nothing with deployments until leaders will not match again.
Our cluster has 3 masters (installed through mesosphere website) and this situation happens quite often, is there any way to fix that?
Marathon v.0.9.0
Mesos v0.22.1
It sounds like either Mesos or Marathon use a private ip (localhost/127.0.0.1), thus they weren't able to talk to each other.
You should be able to solve your issue by setting a public ip using the respective --ip command line flag or LIBPROCESS_IP environment var.
One particularly useful setting is LIBPROCESS_IP, which tells the master and slave binaries which IP address to bind to; in some installations, the default interface that the hostname resolves to is not the machine’s external IP address, so you can set the right IP through this variable.
/source http://mesos.apache.org/documentation/latest/deploy-scripts/

How does one install etcd in a cluster?

Newbie w/ etcd/zookeeper type services ...
I'm not quite sure how to handle cluster installation for etcd. Should the service be installed on each client or a group of independent servers? I ask because if I'm on a client, how would I query the cluster? Every tutorial I've read shows a curl command running against localhost.
For etcd cluster installation, you can install the service on independent servers and form a cluster. The cluster information can be queried by logging onto one of the machines and running curl or remotely by specifying the IP address of one of the cluster member node.
For more information on how to set it up, follow this article

Resources