Find unreachable/ deactivated mesos slaves agents - mesos

I deployed dcos cluster on aws ec2 instances having a couple of mesos-slave agents. Few out of them were unexpectedly terminated. Mesos master marked them "unreachable". I would like to change their status from "Unreachable" to "Gone". To do that dcos provide following command:
dcos node decommission <mesos-id>
However, I am unable to find mesos-id of the unreachable mesos-agents. Neither mesos-master nor dc/os GUI/logs show any information for these nodes.
My question is how to get a list of all the unreachable (or deactivated) mesos-slave agents?
Thanks in anticipation.

To get an history of agents marked as unreachable use this command:
grep unreachable /var/log/mesos/*.INFO.*
or
gawk 'match($0, /.*Marking agent (.*) \(.*\) unreachable.*/, a) {print a[1]}' /var/log/mesos/*.INFO.*|sort|uniq
But if you only want to reset metrics reported in web ui you need to restart the mesos-master service (take a look at https://mesos.apache.org/documentation/latest/monitoring/)

Related

Hadoop HA Namenode goes down with the Error: flush failed for required journal (JournalAndStream(mgr=QJM to [< ip >:8485, < ip >:8485, < ip >:8485]))

Hadoop Namenode goes down almost everyday once.
FATAL namenode.FSEditLog (JournalSet.java:mapJournalsAndReportErrors(398)) -
**Error: flush failed for required journal** (JournalAndStream(mgr=QJM to [< ip >:8485, < ip >:8485, < ip >:8485], stream=QuorumOutputStream starting at txid <>))
java.io.IOException: Timed out waiting 20000ms for a quorum of nodes to respond.
at org.apache.hadoop.hdfs.qjournal.client.AsyncLoggerSet.waitForWriteQuorum(AsyncLoggerSet.java:137)
at org.apache.hadoop.hdfs.qjournal.client.QuorumOutputStream.flushAndSync(QuorumOutputStream.java:107)
at org.apache.hadoop.hdfs.server.namenode.EditLogOutputStream.flush(EditLogOutputStream.java:113)
at
Can someone suggest what are the things that I need to look into for resolving this issue?
I am using VMs for the journal nodes and master nodes. Does it cause any issue?
From the error you pasted. It appears your journal nodes could not talk to the NN in a timely manner. What was going on at the time of this event?
Since you mention that your nodes are vms I would guess you overloaded the hypervisor or it had troubling talking from the NN to the JN and zk quorum.
In my case, this issue was caused due to the difference in the system time between the nodes of the cluster.
To keep the system time in sync, we can execute the commands below in each node.
sudo service ntpd stop
sudo ntpdate pool.ntp.org # Run this command multiple times
sudo service ntpd start
If hue is down, run below command on the hue server machine
sudo service hue start
If namenode is down, start the namenode.
Recurring fix
Add a crontab for the root user on all the nodes of the environment.
or
Install VM tools, to keep the system time in sync.

redis on windows cluster setup

I have downloaded MSOpenTech Redis version 3.x which includes the long awaited clustering feature. My redis database is all working and I can start my cluster on the min 3 nodes required (in cluster mode). Does anyone know how to configure the cluster (it seems no one knows)?
Installing Linux and running the native Linux version is not an option for me sadly.
Any help would be greatly appreciated.
You can follow the Redis Cluster Tutorial and to create the cluster you can use the redis-trib.rb ruby script, for which you need to install Ruby for Windows.
For example:
> C:\Ruby22\Bin\ruby.exe redis-trib.rb create --replicas 1 192.168.1.1:7000 192.168.1.1:7001 192.168.1.1:7002 192.168.1.1:7003 192.168.1.1:7004 192.168.1.1:7005
Did not have the option to install Ruby on Windows but found the manual steps worked for me. The Ruby script seems to do a lot of checking stuff is setup correctly and is the preferred setup route. So Beware, here be dragons.
Set each node to run in Cluster mode. Edit the redis.windows-service.conf file and uncomment
cluster-enabled yes
cluster-config-file nodes-6379.conf
cluster-node-timeout 15000
restart the service.
Run a powershell window and change to the Redis installed folder and start the redis-cli. e.g.
cd "C:\Program Files\Redis"
.\redis-cli.exe
Now you can join other nodes. Run CLUSTER MEET IPADDRESS PORT for each of the other nodes, than the instance you happen to be on. e.g.
CLUSTER MEET 10.10.0.2 6379
After a few seconds running
CLUSTER NODES
Should list all the nodes connected, but all will be set as MASTER.
On each of the other nodes, run CLUSTER REPLICATE MASTERNODEID. Where MASTERNODEID is the hash-looking value next the node declared "myself" on your master when running CLUSTER NODES. e.g.
CLUSTER REPLICATE b7c767ab3ab7c4a926ac2fed937cf140b96764a7
Now allocate slots to each Master. My setup has three instances, only one master.
for ($slot=0;$slot -le 16383;$slot++) {
.\redis-cli.exe -h REDMST CLUSTER ADDSLOTS $slot
}
Reconnect with redis-cli and try and save data. e.g.
SET foo bar
OK
GET foo
"bar"
Phew! Got most this from reading https://www.javacodegeeks.com/2015/09/redis-clustering.html#InstallingRedis which is not Windows specific.
for windows version:
open the command window then type below command
C:\ProgramFiles\redis>FOR /L %i IN (0,1,16383) DO ( redis-cli.exe -p **6380** CLUSTER ADDSLOTS %i )
6380 is port of master node.

Mesos slave node unable to restart

I've setup a Mesos cluster using the CloudFormation templates from Mesosphere. Things worked fine after cluster launch.
I recently noticed that none of the slave nodes are listed in the Mesos dashboard. EC2 console shows the slaves are running & pass health checks. I restarted nodes on cluster but that didn't help.
I ssh'ed into one of the slaves and noticed mesos-slave services are not running. Executed sudo systemctl status dcos-mesos-slave.service but that couldn't start the service.
Looked in /var/log/mesos/ and tail -f mesos-slave.xxx.invalid-user.log.ERROR.20151127-051324.31267 and saw the following...
F1127 05:13:24.242182 31270 slave.cpp:4079] CHECK_SOME(state::checkpoint(path, bootId.get())): Failed to create temporary file: No space left on device
But the output of df -h and free show there is plenty of disk space left.
Which leads me to wonder, why is it complaining about no disk space?
Ok I figured it out.
When running Mesos for a long time or under frequent load, the /tmp folder won't have any disk space left since Mesos uses the /tmp/mesos/ as the work_dir. You see, the filesystem can only hold a certain number of file references(inodes). In my case, slaves were collecting large number of file chuncks from image pulls in /var/lib/docker/tmp.
To resolve this issue:
1) Remove files under /tmp
2) Set a different work_dir location
It is good practice to run
docker rmi -f $(docker images | grep "<none>" | awk "{print \$3}")
this way you will free space by deleting unused docker images

Unable to start Mesos slave on single node cluster

From what I know I am able to set up Mesos master, slave, zookeeper, marathon on a single node.
But once I execute the command to start mesos-master and after that I am trying to start mesos-slave as well but I don't have any way to continue to execute other commands else where. I have to stop the running and run but the problem is mesos-master already stop running.
Don't execute the commands directly from your shell, you want to start all of those components (zookeeper, mesos-master, mesos-slave, and marathon) as services.
/etc/init.d/zookeeper start
start mesos-master
start mesos-slave
start marathon
I forget if zookeeper creates the init script as part of the install for you or not, you may have to find it in the Hadoop docs.
As for the other 3, they all use 'upstart' and you can find the configuration files in /etc/init/

Need to Install Mesos to get Mesos Slave?

I'm trying to get this question solve,
To get mesos slave, is it we have to install Mesos and start mesos slave set up or?
And also I have problem with mesos master which I run a command
./bin/mesos-master.sh --ip=*** --work_dir=/var/lib/mesos
end up it does not continue to run so i stop it running. End up I run the same above command and I get error shown
Failed to initialize, bind: Address already in use [98]
Which part did I do wrongly?
You have to run mesos-master first and then you can connect mesos slave running on a different node to the master. You can refer to getting started guide of mesos. only one slave can connect to the master on the same port. If you get bind address already in use, you can try running slave on another port by passing --port=value parameter. Replace value with port number.
to start mesos master on localhost:
./bin/mesos-master.sh --ip=127.0.0.1 --work_dir=/var/lib/mesos
to start and connect slave to master
./bin/mesos-slave.sh --master=127.0.0.1:5050
to start and connect another slave to the same master you have to use another port as default port 5051 is already used by the first connected slave. Use argument --port-value to start slave on another port
./bin/mesos-slave.sh --master=127.0.0.1:5050 --port=5053
You may get a permission denied error. If so use sudo to access the given port
sudo ./bin/mesos-slave.sh --master=127.0.0.1:5050 --port=5053
You can run one more slave but you have to specify ip and a different workdir using
./mesos-slave.sh --master=<ipaddr>:<port> --ip=<ip of slave> --work_dir=<work_dir other than that of a running slave> --port=<another_port>
edit your etc/hosts and add more local ips with the following entries
127.0.0.2 slave2
127.0.0.3 slave3
then you can replace --ip=<ip of slave> with --ip=slave1 or --ip=slave2
You may have to replace <another_port> with ports like 5052,5053 or any available port if you have a running slave. The slave will be using the default port.
To run only a mesos-slave on a host is simple by installing the mesos package and only running the mesos-slave process with the correct flags, it's not a problem if the master is also installed, but be careful only to run the masters correct to the quorum number.
Something already running on the port you are trying to start the mesos-master, which has a web interface.
Check what program runs on the mesos default port, or use another port, more info about the command line documentation available here: Mesos configuration
To see what's using port 5050 or 5051 use either one of these commands:
sudo fuser -v 5050/tcp
sudo lsof -i | grep 5050
Both command will give you the process pid which holds the port. Either kill them or specify a new port for mesos by starting it with the correct port option:
./bin/mesos-master.sh --ip=*** --work_dir=/var/lib/mesos --port=FREE_PORT
Where do you specify the zookeepers for the mesos master and slaves? The following flags are required to start mesos-master (see the link I gave you):
--advertise_ip, --advertise_port, --quorum, --work_dir, --zk
What are your current full configuration for mesos master? You can find the files under related at /etc/mesos/, /etc/mesos-master/, /etc/mesos-slave/, /etc/defaults/mesos, /etc/defaults/mesos-master, /etc/defaults/mesos-slave. If you copy paste the lines from them and the mesos log here, we might give you more help.
Also please explain the cluster you would like to set up (Number of hosts, masters, slaves) and we can also help there.
excecute below command :
sudo netstat -peanut
Then check which process is using the port 5050 and 5051.
Kill those process using the pid.
Start the mesos master and slave again.
This happens to me when I killed the mesos slave accidentally and then restarted it but failed with address-bind issue.

Resources