DC/OS single-node master does not recover manual reboot - mesos

I've already setup a single-master single-slave DC/OS 1.9 for test purpose. I did reboot master node and it failed to restart itself (Server is up, but none of dcos-* services are running). Is there any procedure to recover my master? I did find slave recovery in the docs, but there was nothing about master.
Please tell me how should i recover my master.

Related

Ambari won't restart: DB check failed

When I restarted my cluster, ambari didn't start because of a db check failed config:
sudo service ambari-server restart --skip-database-check
Using python /usr/bin/python
Restarting ambari-server
Waiting for server stop...
Ambari Server stopped
Ambari Server running with administrator privileges.
Organizing resource files at /var/lib/ambari-server/resources...
Ambari Server is starting with the database consistency check skipped. Do not make any changes to your cluster topology or perform a cluster upgrade until you correct the database consistency issues. See "/var/log/ambari-server/ambari-server-check-database.log" for more details on the consistency issues.
Server PID at: /var/run/ambari-server/ambari-server.pid
Server out at: /var/log/ambari-server/ambari-server.out
Server log at: /var/log/ambari-server/ambari-server.log
Waiting for server start.....................
DB configs consistency check failed. Run "ambari-server start --skip-database-check" to skip. You may try --auto-fix-database flag to attempt to fix issues automatically. If you use this "--skip-database-check" option, do not make any changes to your cluster topology or perform a cluster upgrade until you correct the database consistency issues. See /var/log/ambari-server/ambari-server-check-database.log for more details on the consistency issues.
ERROR: Exiting with exit code -1.
REASON: Ambari Server java process has stopped. Please check the logs for more information.
I looked in the logs in "/var/log/ambari-server/ambari-server-check-database.log", and I saw:
2017-08-23 08:16:13,445 INFO - Checking Topology tables
2017-08-23 08:16:13,447 ERROR - Your topology request hierarchy is not complete for each row in topology_request should exist at least one raw in topology_logical_request, topology_host_request, topology_host_task, topology_logical_task.
I tried both options --auto-fix-database and --skip-database-check, it didn't work.
It seems that postgresql didn't start correctly, and even if in the log of Ambari there was no mention of postgresql not started or not available, but it was weird that ambari couldn't access to the topology configuration stored in it.
sudo service postgresql restart
Stopping postgresql service: [ OK ]
Starting postgresql service: [ OK ]
It did the trick:
sudo service ambari-server restart
Using python /usr/bin/python
Restarting ambari-server
Ambari Server is not running
Ambari Server running with administrator privileges.
Organizing resource files at /var/lib/ambari-server/resources...
Ambari database consistency check started...
Server PID at: /var/run/ambari-server/ambari-server.pid
Server out at: /var/log/ambari-server/ambari-server.out
Server log at: /var/log/ambari-server/ambari-server.log
Waiting for server start.........
Server started listening on 8080

Ambari show namenode is stop but actually namenode is still working

We are using HDP 2.7.1.2.3 with Ambari 2.1.2
After finish setup, every node status is correct.
But oneday ambari suddenly show namdenode is stopped.(we don't change any config of ambari or namenode)
However, we still can use HBASE and run MapReduce.
we think name node status should be normal.
We try to restart namenode and check ambari-server log
It shows:
ServiceComponentHostImpl:949 - Host role transitioned to a new state, serviceComponentName=NAMENODE, oldState=STARTING, currentState=STARTED
HeartBeatHandler:657 - State of service component NAMENODE of service HDFS of cluster wae has changed from STARTED to INSTALLED
we don't understand why its status change from "STARTED" to "INSTALLED".
In namenode side, we check ambari-agent.log
It shows one warning:
[Alert][namenode_directory_status] HA nameservice value is present but there are no aliases for {{hdfs-site/dfs.ha.namenodes.{{ha-nameservice}}}}
We think it is irrelevant.
What's the reason that ambari think namenode is stopped?
Is there any way that we can fix this issue?
Run the command ambari-server restart from linux terminal in Ambari server node
Run the command ambari-agent restart from linux terminal in all the nodes in the cluster.
You can run the command hdfs dfsadmin -report from the terminal as hdfs user to confirm all the nodes are up and running.

Hadoop HA Namenode goes down with the Error: flush failed for required journal (JournalAndStream(mgr=QJM to [< ip >:8485, < ip >:8485, < ip >:8485]))

Hadoop Namenode goes down almost everyday once.
FATAL namenode.FSEditLog (JournalSet.java:mapJournalsAndReportErrors(398)) -
**Error: flush failed for required journal** (JournalAndStream(mgr=QJM to [< ip >:8485, < ip >:8485, < ip >:8485], stream=QuorumOutputStream starting at txid <>))
java.io.IOException: Timed out waiting 20000ms for a quorum of nodes to respond.
at org.apache.hadoop.hdfs.qjournal.client.AsyncLoggerSet.waitForWriteQuorum(AsyncLoggerSet.java:137)
at org.apache.hadoop.hdfs.qjournal.client.QuorumOutputStream.flushAndSync(QuorumOutputStream.java:107)
at org.apache.hadoop.hdfs.server.namenode.EditLogOutputStream.flush(EditLogOutputStream.java:113)
at
Can someone suggest what are the things that I need to look into for resolving this issue?
I am using VMs for the journal nodes and master nodes. Does it cause any issue?
From the error you pasted. It appears your journal nodes could not talk to the NN in a timely manner. What was going on at the time of this event?
Since you mention that your nodes are vms I would guess you overloaded the hypervisor or it had troubling talking from the NN to the JN and zk quorum.
In my case, this issue was caused due to the difference in the system time between the nodes of the cluster.
To keep the system time in sync, we can execute the commands below in each node.
sudo service ntpd stop
sudo ntpdate pool.ntp.org # Run this command multiple times
sudo service ntpd start
If hue is down, run below command on the hue server machine
sudo service hue start
If namenode is down, start the namenode.
Recurring fix
Add a crontab for the root user on all the nodes of the environment.
or
Install VM tools, to keep the system time in sync.

Unable to start Mesos slave on single node cluster

From what I know I am able to set up Mesos master, slave, zookeeper, marathon on a single node.
But once I execute the command to start mesos-master and after that I am trying to start mesos-slave as well but I don't have any way to continue to execute other commands else where. I have to stop the running and run but the problem is mesos-master already stop running.
Don't execute the commands directly from your shell, you want to start all of those components (zookeeper, mesos-master, mesos-slave, and marathon) as services.
/etc/init.d/zookeeper start
start mesos-master
start mesos-slave
start marathon
I forget if zookeeper creates the init script as part of the install for you or not, you may have to find it in the Hadoop docs.
As for the other 3, they all use 'upstart' and you can find the configuration files in /etc/init/

DataNode automatically getting restarted in the CDH5 cluster

We have setup a cluster with 6 slave nodes. I am trying to see how replication happens when one of the DataNode dies.
I logged into one of the slave and killed the DataNode using the kill -9 command. After sometime the DataNode is restarted automatically and HDFS gets back into healthy status. I am verify this because the PID of the DataNode has changed.
I don't see any documentation on the above behavior of DataNode. Is this the Apache Hadoop or Cloudera CDH feature? Any reference to the documentation is appreciated.
As the pid of datanode has been changed, I don't think it is a behavior of datanode. If you are managing your cluster using Cloudera Manager, there is an option for restarting datanode daemon if it fails(Automatically Restart Process). This option will be set by default. When the datanode process gets failed or killed, As Automatic restart option is set Cloudera Scm agent will start the the datanode daemon.
For Automatic restart option : Choose HDFS services -> go to Configuration section -> Search for automatic restart.
This feature is available in CM 4.X release as well.

Resources