I deployed dcos cluster on aws ec2 instances having a couple of mesos-slave agents. Few out of them were unexpectedly terminated. Mesos master marked them "unreachable". I would like to change their status from "Unreachable" to "Gone". To do that dcos provide following command:
dcos node decommission <mesos-id>
However, I am unable to find mesos-id of the unreachable mesos-agents. Neither mesos-master nor dc/os GUI/logs show any information for these nodes.
My question is how to get a list of all the unreachable (or deactivated) mesos-slave agents?
Thanks in anticipation.
To get an history of agents marked as unreachable use this command:
grep unreachable /var/log/mesos/*.INFO.*
or
gawk 'match($0, /.*Marking agent (.*) \(.*\) unreachable.*/, a) {print a[1]}' /var/log/mesos/*.INFO.*|sort|uniq
But if you only want to reset metrics reported in web ui you need to restart the mesos-master service (take a look at https://mesos.apache.org/documentation/latest/monitoring/)
I am currently using postgresql with log shipping replication. I use a master/slave resource of pacemaker to deal with postgresql failover.
I was asking if there is a way to demote a master, set it as standby and keep synchronized without using "repmgr standby clone" neither pg_rewind.
In fact, I want the old master to be quickly ready to get back to master state and "repmgr standby clone" takes several minutes to recover which is too long.
I see that it is possible to use pg_rewind to synchronize faster but it implies to have wal_log_hints enable, and I afraid that this options will decrease the performances of the master. The master is already too much busy.
I try to just write the recovery.conf in data directory, the master has well turned to slave mode, however it doesn't have upstream:
[root#bkm-01 httpd]# su - postgres -c "/usr/pgsql-9.5/bin/repmgr -f /var/lib/pgsql/repmgr/repmgr.conf cluster show"
Role | Name | Upstream | Connection String
----------+--------|----------|--------------------------------------
* master | node-02 | | host=node-02 user=repmgr dbname=repmgr
standby | node-01 | | host=node-01 user=repmgr dbname=repmgr
I wish it is clear enough, I actually a newbie in database replication. Any help would be appreciated.
I found the solution by myself. In fact the former-master just need to be registered after been demoted. --force should be used if node was previously registered.
[root#node-01 ] su - postgres -c "/usr/pgsql-9.5/bin/repmgr -f /var/lib/pgsql/repmgr/repmgr.conf standby register --force"
Hey I have a cluster id mismatch for some reason, i had it on 1 node then disapperead after clearing data dir few times , changing cluster token and node names, but apperead on another
here is the script i use
IP0=10.150.0.1
IP1=10.150.0.2
IP2=10.150.0.3
IP3=10.150.0.4
NODENAME0=node0
NODENAME1=node1
NODENAME2=node2
NODENAME3=node3
# changing these on each box
THISIP=$IP2
THISNODENAME=$NODENAME2
etcd --name $THISNODENAME --initial-advertise-peer-urls http://$THISIP:2380 \
--data-dir /root/etcd-data \
--listen-peer-urls http://$THISIP:2380 \
--listen-client-urls http://$THISIP:2379,http://127.0.0.1:2379 \
--advertise-client-urls http://$THISIP:2379 \
--initial-cluster-token etcd-cluster-2 \
--initial-cluster $NODENAME0=http://$IP0:2380,$NODENAME1=http://$IP1:2380,$NODENAME2=http://$IP2:2380,$NODENAME3=http://$IP3:2380 \
--initial-cluster-state new
I get
2016-11-11 22:13:12.090515 I | etcdmain: etcd Version: 2.3.7
2016-11-11 22:13:12.090643 N | etcdmain: the server is already initialized as member before, starting as etcd member...
2016-11-11 22:13:12.090713 I | etcdmain: listening for peers on http://10.150.0.3:2380
2016-11-11 22:13:12.090745 I | etcdmain: listening for client requests on http://10.150.0.3:2379
2016-11-11 22:13:12.090771 I | etcdmain: listening for client requests on http://127.0.0.1:2379
2016-11-11 22:13:12.090960 I | etcdserver: name = node2
2016-11-11 22:13:12.090976 I | etcdserver: data dir = /root/etcd-data
2016-11-11 22:13:12.090983 I | etcdserver: member dir = /root/etcd-data/member
2016-11-11 22:13:12.090990 I | etcdserver: heartbeat = 100ms
2016-11-11 22:13:12.090995 I | etcdserver: election = 1000ms
2016-11-11 22:13:12.091001 I | etcdserver: snapshot count = 10000
2016-11-11 22:13:12.091011 I | etcdserver: advertise client URLs = http://10.150.0.3:2379
2016-11-11 22:13:12.091269 I | etcdserver: restarting member 7fbd572038b372f6 in cluster 4e73d7b9b94fe83b at commit index 4
2016-11-11 22:13:12.091317 I | raft: 7fbd572038b372f6 became follower at term 8
2016-11-11 22:13:12.091346 I | raft: newRaft 7fbd572038b372f6 [peers: [], term: 8, commit: 4, applied: 0, lastindex: 4, lastterm: 1]
2016-11-11 22:13:12.091516 I | etcdserver: starting server... [version: 2.3.7, cluster version: to_be_decided]
2016-11-11 22:13:12.091869 E | etcdmain: failed to notify systemd for readiness: No socket
2016-11-11 22:13:12.091894 E | etcdmain: forgot to set Type=notify in systemd service file?
2016-11-11 22:13:12.096380 N | etcdserver: added member 7508b3e625cfed5 [http://10.150.0.4:2380] to cluster 4e73d7b9b94fe83b
2016-11-11 22:13:12.099800 N | etcdserver: added member 14c76eb5d27acbc5 [http://10.150.0.1:2380] to cluster 4e73d7b9b94fe83b
2016-11-11 22:13:12.100957 N | etcdserver: added local member 7fbd572038b372f6 [http://10.150.0.2:2380] to cluster 4e73d7b9b94fe83b
2016-11-11 22:13:12.102711 N | etcdserver: added member d416fca114f17871 [http://10.150.0.3:2380] to cluster 4e73d7b9b94fe83b
2016-11-11 22:13:12.134330 E | rafthttp: request cluster ID mismatch (got cfd5ef74b3dcf6fe want 4e73d7b9b94fe83b)
the other memebers are not even running, how that's possible ?
Thank you
For all those who stumble upon this from google:
The error is about peer member ID, that tries to join cluster with same name as another member (probably old instance) that already exists in cluster (with same peer name, but another ID, this is the problem).
you should delete the peer and re-add it like shown in this helpful post:
In order to fix this it was pretty simple, first we had to log into an existing working server on the rest of the cluster and remove server00 from its member list:
etcdctl member remove <UID>
This free's up the ability to allow the new server00 to join but we needed to simply tell the cluster it could by issuing the add command:
etcdctl member add server00 http://1.2.3.4:2380
It you follow the logs on server00 you'll then see that everything spring into life. You can confirm this with the commands:
etcdctl member list
etcdctl cluster-health
Use "etcdctl member list" to find what are the IDs of current members, and find the one which tries to join cluster with wrong ID, then delete that peer from "members" with "etcdctl member remove " and try to rejoin him.
Hope it helps.
I just ran into this same issue, 2 years later. Dmitry's answer is fine but misses what the OP likely did wrong in the first place when setting up an etcd cluster.
Running an etcd instance with "--cluster-state new" at any point, will generate a cluster ID in the data directory. If you try to then/later join an existing cluster, it will use that old generated cluster ID (which is when the mismatch error occurs). Yes, technically the OP had an "old cluster" but more likely, and 100% common, is when someone is trying to stand up their first cluster, they don't notice the procedure has to change. I find that etcd kind of generally fails in providing a good usage model.
So, removing the member (you don't really need to if the new node never joined successfully) and/or deleting the new node's data directory will "fix" the issue, but its how the OP setup the 2nd cluster node that the problem.
Here's an example of the setup nuance: (sigh... thanks for that etcd...)
# On the 1st node (I used Centos7 minimal, with etcd installed)
sudo firewall-cmd --permanent --add-port=2379/tcp
sudo firewall-cmd --permanent --add-port=2380/tcp
sudo firewall-cmd --reload
export CL_NAME=etcd1
export HOST=$(hostname)
export IP_ADDR=$(ip -4 addr show ens33 | grep -oP '(?<=inet\s)\d+(\.\d+){3}')
# turn on etcdctl v3 api support, why is this not default?!
export ETCDCTL_API=3
sudo etcd --name $CL_NAME --data-dir ~/data --advertise-client-urls=http://127.0.0.1:2379,https://$IP_ADDR:2379 --listen-client-urls=https://0.0.0.0:2379 --initial-advertise-peer-urls https://$IP_ADDR:2380 --listen-peer-urls https://$IP_ADDR:2380 --initial-cluster-state new
Ok, the first node is running. The cluster data is in the ~/data directory. In future runs you only need (note that cluster-state isn't needed):
sudo etcd --name $CL_NAME --data-dir ~/data --advertise-client-urls=http://127.0.0.1:2379,https://$IP_ADDR:2379 --listen-client-urls=https://0.0.0.0:2379 --initial-advertise-peer-urls https://$IP_ADDR:2380 --listen-peer-urls https://$IP_ADDR:2380
Next, add your 2nd node's expected cluster name and peer URLs:
etcdctl --endpoints="https://127.0.0.1:2379" member add etcd2 --peer-urls="http://<next node's IP address>:2380"
Adding the member is important. You won't be able to successfully join without doing it first.
# Next on the 2nd/new node
export CL_NAME=etcd1
export HOST=$(hostname)
export IP_ADDR=$(ip -4 addr show ens33 | grep -oP '(?<=inet\s)\d+(\.\d+){3}')
sudo etcd --name $CL_NAME --data-dir ~/data --advertise-client-urls=https://127.0.0.1:2379,https://$IP_ADDR:2379 --listen-client-urls=https://0.0.0.0:2379 --initial-advertise-peer-urls https://$IP_ADDR:2380 --listen-peer-urls https://$IP_ADDR:2380 --initial-cluster-state existing --initial-cluster="etcd1=http://<IP of 1st node>:2380,etcd2=http://$IP_ADD:2380"
Note the annoying extra arguments here. --initial-cluster must have 100% of all nodes in the cluster identified... which doesn't matter after you join the cluster because cluster data will be replicated anyways... Also "--initial-cluster existing" is needed.
Again, after the 1st time the 2nd node runs/joins, you can run it without any cluster arguments:
sudo etcd --name $CL_NAME --data-dir ~/data --advertise-client-urls=http://127.0.0.1:2379,https://$IP_ADDR:2379 --listen-client-urls=https://0.0.0.0:2379 --initial-advertise-peer-urls https://$IP_ADDR:2380 --listen-peer-urls https://$IP_ADDR:2380
Sure, you could keep running etcd with all the cluster settings in there, but they "might" get ignored for whats in the data directory. Remember that if you join a 3rd node, knowledge of the new node member is replicated to the remaining node, and those "initial" cluster settings could be completely false/misleading in the future when your cluster changes. So run your joined nodes with no initial cluster settings unless you are actually joining one.
Also, last bit to impart, you should/must run at least 3 nodes in a cluster, otherwise the RAFT leader election process will break everything. With 2 nodes, when 1 node goes down or they get disconnected, the node will not elect itself and spin in an election loop. Clients can't talk to an etcd service that's in election mode... Great availability! You need a minimum of 3 nodes to handle if 1 goes down.
in my case i got the error
rafthttp: request cluster ID mismatch (got 1b3a88599e79f82b want b33939d80a381a57)
due to incorrect config on one node
two my nodes got in config
env ETCD_INITIAL_CLUSTER="etcd-01=http://172.16.50.101:2380,etcd-02=http://172.16.50.102:2380,etcd-03=http://172.16.50.103:2380"
and one node got
env ETCD_INITIAL_CLUSTER="etcd-01=http://172.16.50.101:2380"
to resolve the problem i stopped etcd on all nodes, edited incorrect config,
deleted /var/lib/etcd/member folder in all nodes , restarted etcd on all nodes and voila !
p.s.
/var/lib/etcd - is the folder where etcd save its data in my case
My --data-dir=/var/etcd/data, remove and recreate it, that works for me. It seems that something of previous etcd cluster I made left in this directory, which may affect the etcd settings.
I have faced the same problem, our leader etcd server went down and after replacing it with new we were getting an error
rafthttp: request sent was ignored (cluster ID mismatch)
It was looking for the old cluster-Id and generating some random local cluster with some misconfiguration.
Followed these steps to fix the issue.
Login to other working cluster and remove unreachable member from
the cluster
etcdctl cluster-health
etcdctl member remove member-id
Login to new server and stop if etcd process is running systemctl etcd2 stop
Remove data from the data directory rm -rf /var/etcd2/data Keep backup of this data somewhere in other folder before deleting it.
Now start your cluster with --initial-cluster-state existing parameter, don't use --initial-cluster-state new if you are already adding server to existing cluster.
Now go back to one of the running etcd server and add this new member to cluster etcdctl member add node0 http://$IP:2380
I have spent a lot of time on debugging this issue and now my cluster is running healthy with all members. Hope this information helps.
Add a new node to a existing etcd cluster.
etcdctl member add <new_node_name> --peer-urls="http://<new_node_ip>:2380"
Attention, if you enable TLS, replace http with https
Run etcd in new node. It is important to add "--initial-cluster-state existing", the purpose is telling new node that join the existing cluster, instead of creating a new cluster.
etcd --name <new_node_name> --initial-cluster-state existing ...
Check the result
etcdctl member list
I've setup a Mesos cluster using the CloudFormation templates from Mesosphere. Things worked fine after cluster launch.
I recently noticed that none of the slave nodes are listed in the Mesos dashboard. EC2 console shows the slaves are running & pass health checks. I restarted nodes on cluster but that didn't help.
I ssh'ed into one of the slaves and noticed mesos-slave services are not running. Executed sudo systemctl status dcos-mesos-slave.service but that couldn't start the service.
Looked in /var/log/mesos/ and tail -f mesos-slave.xxx.invalid-user.log.ERROR.20151127-051324.31267 and saw the following...
F1127 05:13:24.242182 31270 slave.cpp:4079] CHECK_SOME(state::checkpoint(path, bootId.get())): Failed to create temporary file: No space left on device
But the output of df -h and free show there is plenty of disk space left.
Which leads me to wonder, why is it complaining about no disk space?
Ok I figured it out.
When running Mesos for a long time or under frequent load, the /tmp folder won't have any disk space left since Mesos uses the /tmp/mesos/ as the work_dir. You see, the filesystem can only hold a certain number of file references(inodes). In my case, slaves were collecting large number of file chuncks from image pulls in /var/lib/docker/tmp.
To resolve this issue:
1) Remove files under /tmp
2) Set a different work_dir location
It is good practice to run
docker rmi -f $(docker images | grep "<none>" | awk "{print \$3}")
this way you will free space by deleting unused docker images
I have searched it all over and couldn't find the error.
I have checked This Stackoverflow Issue but it is not the problem with me
I have started a zookeeper server
Command to start server was
bin/zookeeper-server-start.sh config/zookeeper.properties
Then I SSH into VM by using Putty and started kafka server using
$ bin/kafka-server-start.sh config/server.properties
Then I created Kafka Topic and when I list the topic, it appears.
Then I opened another putty and started kafka-console-producer.sh and typed any message (even enter) and get this long repetitive exception.
Configuration files for zookeeper.properties, server.properties, kafka-producer.properties are as following (respectively)
The version of Kafka i am running is 8.2.2. something as I saw it in kafka/libs folder.
P.S. I get no messages in consumer.
Can any body figure out the problem?
The tutorial I was following was [This][9]
8http://%60http://www.bogotobogo.com/Hadoop/BigData_hadoop_Zookeeper_Kafka_single_node_single_broker_cluster.php%60
On the hortonworks sandbox have a look at the server configuration:
$ less /etc/kafka/conf/server.properties
In my case it said
...
listeners=PLAINTEXT://sandbox.hortonworks.com:6667
...
This means you have to use the following command to successfully connect with the console-producer
$ cd /usr/hdp/current/kafka-broker
$ bin/kafka-console-producer.sh --topic test --broker-list sandbox.hortonworks.com:6667
It won't work, if you use --broker-list 127.0.0.1:6667 or --broker-list localhost:6667 . See also http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.3.0/bk_installing_manually_book/content/configure_kafka.html
To consume the messages use
$ bin/kafka-console-consumer.sh --zookeeper localhost:2181 --topic test --from-beginning
As you mentioned in your question that you are using HDP 2.3 and for that when you are running Console-Producer
You need to provide sandbox.hortonworks.com:6667 in Broker-list.
Please use the same while running Console-Consumer.
Please let me know in case still you face any issue.
Within Kafka internally there is a conversation that goes on between both producers and consumers (clients) and the broker (server). During those conversations clients often ask the server for the address of a server broker that's managing a particular partition. The answer is always a fully-qualified host name. Without going into specifics if you ever refer to a broker with an address that is not that broker's fully-qualified host name there are situations when the Kafka implementation runs into trouble.
Another mistake that's easy to make, especially with the Sandbox, is referring to a broker by an address that's not defined to the DNS. That's why every node on the cluster has to be able to address every other node in the cluster by fully-qualified host name. It's also why, when accessing the sandbox from another virtual image running on the same machine you have to add sandbox.hortonworks.com to the image's hosts file.