Can anyone explain this etcd behavior? - etcd

Let me ask a question on a strange etcd behavior.
environment:
-bash-4.2$ etcd --version
etcd Version: 3.2.28
Git SHA: 2d861f3
Go Version: go1.10.3
Go OS/Arch: linux/amd64
-bash-4.2$ etcdctl cluster-health
member ef05587d2e4769f is healthy: got healthy result from http://10.7.211.15:2379
member 6066465b170c501d is healthy: got healthy result from http://10.7.211.13:2379
member 7132cc73aebdbcb8 is healthy: got healthy result from http://10.10.51.17:2379
member 7eb23e55f039af25 is healthy: got healthy result from http://10.7.211.14:2379
member c60f0881d3524793 is healthy: got healthy result from http://10.7.211.12:2379
cluster is healthy
what happend:
member ef05587d2e4769f on host 10.7.211.15 had been the leader
host 10.7.211.15 had a defect resulting in ethernet flipping
member ef05587d2e4769f also flipped between available and unavailable therefore
the cluster voted a new leader but changed the leader back to member ef05587d2e4769f every time it was available again
because member ef05587d2e4769f soon went unavailable again the cluster voted a new leader...
this loop repeated for hours
the whole cluster got kind of blocked because of permanent leader election
...
Feb 03 05:58:34 szhm58466 etcd[1027]: raft.node: 6066465b170c501d changed leader from 7132cc73aebdbcb8 to 7eb23e55f039af25 at term 3349
Feb 03 05:58:43 szhm58466 etcd[1027]: raft.node: 6066465b170c501d changed leader from 6066465b170c501d to ef05587d2e4769f at term 3352
Feb 03 06:10:34 szhm58466 etcd[1027]: raft.node: 6066465b170c501d changed leader from 6066465b170c501d to ef05587d2e4769f at term 3371
Feb 03 06:17:42 szhm58466 etcd[1027]: raft.node: 6066465b170c501d changed leader from 6066465b170c501d to ef05587d2e4769f at term 3373
Feb 03 06:23:47 szhm58466 etcd[1027]: raft.node: 6066465b170c501d changed leader from 6066465b170c501d to ef05587d2e4769f at term 3375
Feb 03 06:30:58 szhm58466 etcd[1027]: raft.node: 6066465b170c501d changed leader from 6066465b170c501d to ef05587d2e4769f at term 3379
Feb 03 06:31:01 szhm58466 etcd[1027]: raft.node: 6066465b170c501d changed leader from ef05587d2e4769f to c60f0881d3524793 at term 3379
Feb 03 06:36:46 szhm58466 etcd[1027]: raft.node: 6066465b170c501d changed leader from 6066465b170c501d to ef05587d2e4769f at term 3387
Feb 03 06:37:24 szhm58466 etcd[1027]: raft.node: 6066465b170c501d changed leader from 6066465b170c501d to ef05587d2e4769f at term 3389
Feb 03 06:37:26 szhm58466 etcd[1027]: raft.node: 6066465b170c501d changed leader from ef05587d2e4769f to c60f0881d3524793 at term 3389
Feb 03 07:11:04 szhm58466 etcd[1027]: raft.node: 6066465b170c501d changed leader from 6066465b170c501d to ef05587d2e4769f at term 3400
...
QUESTIONS:
why did the cluster change back to the old leader every time the old leader came back again?
can this be avoided with some kind of parameter? that means don't change the leader back again anymore once you have changed?
Thanks in advance,
Markus

Read there (https://kubernetes.io/blog/2019/08/30/announcing-etcd-3-4/) that the etcd raft voting process has problems with flaky members by design and this should be fixed by the "--pre-vote" feature in etcd v3.4. Correct?
Thanks, Markus

1、why did the cluster change back to the old leader every time the old leader came back again?
Answer:According to the definition of the draft algorithm, an election occurs when a membership change occurs. the leader vote process based on network performance, maybe the old leader host is fast network.

Related

Updated Elasticservice on Droplet of Digital Ocean, Elasticsearch will no longer start

The error I am receiving when I try to start up elasticsearch
-- Unit elasticsearch.service has begun starting up.
Oct 08 23:54:05 ElasticSearch logstash[1064]: [2020-10-08T23:54:05,137][WARN ][logstash.outputs.elasticsearch][main] Attempted to resurrect connection to dead ES instance, but got an error. {:url=
Oct 08 23:54:05 ElasticSearch logstash[1064]: [2020-10-08T23:54:05,138][WARN ][logstash.outputs.elasticsearch][main] Attempted to resurrect connection to dead ES instance, but got an error. {:url=
Oct 08 23:54:05 ElasticSearch kernel: [UFW BLOCK] IN=eth0 OUT= MAC=76:67:e9:46:24:b8:fe:00:00:00:01:01:08:00 SRC=79.124.62.110 DST=206.189.196.214 LEN=40 TOS=0x00 PREC=0x00 TTL=244 ID=52316 PROTO=
Oct 08 23:54:05 ElasticSearch systemd-entrypoint[14701]: Exception in thread "main" java.lang.RuntimeException: starting java failed with [1]
Oct 08 23:54:05 ElasticSearch systemd-entrypoint[14701]: output:
Oct 08 23:54:05 ElasticSearch systemd-entrypoint[14701]: error:
Oct 08 23:54:05 ElasticSearch systemd-entrypoint[14701]: Unrecognized VM option 'UseConcMarkSweepGC'
Oct 08 23:54:05 ElasticSearch systemd-entrypoint[14701]: Error: Could not create the Java Virtual Machine.
Oct 08 23:54:05 ElasticSearch systemd-entrypoint[14701]: Error: A fatal exception has occurred. Program will exit.
Oct 08 23:54:05 ElasticSearch systemd-entrypoint[14701]: at org.elasticsearch.tools.launchers.JvmErgonomics.flagsFinal(JvmErgonomics.java:126)
Oct 08 23:54:05 ElasticSearch systemd-entrypoint[14701]: at org.elasticsearch.tools.launchers.JvmErgonomics.finalJvmOptions(JvmErgonomics.java:88)
Oct 08 23:54:05 ElasticSearch systemd-entrypoint[14701]: at org.elasticsearch.tools.launchers.JvmErgonomics.choose(JvmErgonomics.java:59)
Oct 08 23:54:05 ElasticSearch systemd-entrypoint[14701]: at org.elasticsearch.tools.launchers.JvmOptionsParser.jvmOptions(JvmOptionsParser.java:137)
Oct 08 23:54:05 ElasticSearch systemd-entrypoint[14701]: at org.elasticsearch.tools.launchers.JvmOptionsParser.main(JvmOptionsParser.java:95)
It looks a lot like this reported issue and this one.
In your jvm.options file, if you replace this
-XX:+UseConcMarkSweepGC
-XX:CMSInitiatingOccupancyFraction=75
-XX:+UseCMSInitiatingOccupancyOnly
with this
8-13:-XX:+UseConcMarkSweepGC
8-13:-XX:CMSInitiatingOccupancyFraction=75
8-13:-XX:+UseCMSInitiatingOccupancyOnly
it should work again.

Cassandra: Cannot achieve consistency level QUORUM on a specific keyspace

Actually, I'm using Elassandra which is a combination of Cassandra and Elasticsearch.
but the issue might came from Cassandra (from the logs said)
I have two nodes joined as a single datacenter DC1. And I'm trying to install Kibana on one of the node. My Kibana server always says "Kibana server is not ready yet" then I've found that the error is something around Cassandra consistency level.
My cassandra system_auth is set to
system_auth
WITH REPLICATION= {'class' : 'SimpleStrategy',
'DC1' :2 };
and here is the log from manual trigger Kibana service /usr/share/kibana/bin/kibana -c /etc/kibana/kibana.yml
FATAL [exception] org.apache.cassandra.exceptions.UnavailableException: Cannot achieve
consistency level QUORUM :: {"path":"/.kibana_1","query":{"include_type_name":true},"body":"
{\"mappings\":{\"doc\":{\"dynamic\":\"strict\",\"properties\":{\"config\":
{\"dynamic\":\"true\",\"properties\":{\"buildNum\":
{\"type\":\"keyword\"}}},\"migrationVersion\":
{\"dynamic\":\"true\",\"type\":\"object\"},\"type\":{\"type\":\"keyword\"},\"namespace\":
{\"type\":\"keyword\"},\"updated_at\":{\"type\":\"date\"},\"index-pattern\":{\"properties\":
{\"fieldFormatMap\":{\"type\":\"text\"},\"fields\":{\"type\":\"text\"},\"intervalName\":
{\"type\":\"keyword\"},\"notExpandable\":{\"type\":\"boolean\"},\"sourceFilters\":
{\"type\":\"text\"},\"timeFieldName\":{\"type\":\"keyword\"},\"title\":
{\"type\":\"text\"},\"type\":{\"type\":\"keyword\"},\"typeMeta\":
{\"type\":\"keyword\"}}},\"visualization\":{\"properties\":{\"description\":
{\"type\":\"text\"},\"kibanaSavedObjectMeta\":{\"properties\":{\"searchSourceJSON\":
{\"type\":\"text\"}}},\"savedSearchId\":{\"type\":\"keyword\"},\"title\":
{\"type\":\"text\"},\"uiStateJSON\":{\"type\":\"text\"},\"version\":
{\"type\":\"integer\"},\"visState\":{\"type\":\"text\"}}},\"search\":{\"properties\":
{\"columns\":{\"type\":\"keyword\"},\"description\":{\"type\":\"text\"},\"hits\":
{\"type\":\"integer\"},\"kibanaSavedObjectMeta\":{\"properties\":{\"searchSourceJSON\":
{\"type\":\"text\"}}},\"sort\":{\"type\":\"keyword\"},\"title\":{\"type\":\"text\"},\"version\":
{\"type\":\"integer\"}}},\"dashboard\":{\"properties\":{\"description\":
{\"type\":\"text\"},\"hits\":{\"type\":\"integer\"},\"kibanaSavedObjectMeta\":{\"properties\":
{\"searchSourceJSON\":{\"type\":\"text\"}}},\"optionsJSON\":{\"type\":\"text\"},\"panelsJSON\":
{\"type\":\"text\"},\"refreshInterval\":{\"properties\":{\"display\":
{\"type\":\"keyword\"},\"pause\":{\"type\":\"boolean\"},\"section\":
{\"type\":\"integer\"},\"value\":{\"type\":\"integer\"}}},\"timeFrom\":
{\"type\":\"keyword\"},\"timeRestore\":{\"type\":\"boolean\"},\"timeTo\":
{\"type\":\"keyword\"},\"title\":{\"type\":\"text\"},\"uiStateJSON\":
{\"type\":\"text\"},\"version\":{\"type\":\"integer\"}}},\"url\":{\"properties\":
{\"accessCount\":{\"type\":\"long\"},\"accessDate\":{\"type\":\"date\"},\"createDate\":
{\"type\":\"date\"},\"url\":{\"type\":\"text\",\"fields\":{\"keyword\":
{\"type\":\"keyword\",\"ignore_above\":2048}}}}},\"server\":{\"properties\":{\"uuid\":
{\"type\":\"keyword\"}}},\"kql-telemetry\":{\"properties\":{\"optInCount\":
{\"type\":\"long\"},\"optOutCount\":{\"type\":\"long\"}}},\"timelion-sheet\":{\"properties\":
{\"description\":{\"type\":\"text\"},\"hits\":{\"type\":\"integer\"},\"kibanaSavedObjectMeta\":
{\"properties\":{\"searchSourceJSON\":{\"type\":\"text\"}}},\"timelion_chart_height\":
{\"type\":\"integer\"},\"timelion_columns\":{\"type\":\"integer\"},\"timelion_interval\":
{\"type\":\"keyword\"},\"timelion_other_interval\":{\"type\":\"keyword\"},\"timelion_rows\":
{\"type\":\"integer\"},\"timelion_sheet\":{\"type\":\"text\"},\"title\":
{\"type\":\"text\"},\"version\":{\"type\":\"integer\"}}}}}},\"settings\":
{\"number_of_shards\":1,\"auto_expand_replicas\":\"0-1\"}}","statusCode":500,"response":"
{\"error\":{\"root_cause\":
[{\"type\":\"exception\",\"reason\":\"org.apache.cassandra.exceptions.UnavailableException:
Cannot achieve consistency level
QUORUM\"}],\"type\":\"exception\",\"reason\":\"org.apache.cassandra.exceptions.UnavailableExcept
ion: Cannot achieve consistency level QUORUM\",\"caused_by\":
{\"type\":\"unavailable_exception\",\"reason\":\"Cannot achieve consistency level
QUORUM\"}},\"status\":500}"}
there are no any indices named 'kibana_1' or any indices contains word kibana. but there are keyspaces named "_kibana_1" and "_kibana"
and that cause Kibana service unable to start
systemctl status kibana
● kibana.service - Kibana
Loaded: loaded (/etc/systemd/system/kibana.service; enabled; vendor preset: enabled)
Active: failed (Result: exit-code) since Thu 2020-09-10 16:26:14 CEST; 2s ago
Process: 16942 ExecStart=/usr/share/kibana/bin/kibana -c /etc/kibana/kibana.yml (code=exited, status=1
Main PID: 16942 (code=exited, status=1/FAILURE)
Sep 10 16:26:14 ns3053180 systemd[1]: kibana.service: Service hold-off time over, scheduling restart.
Sep 10 16:26:14 ns3053180 systemd[1]: kibana.service: Scheduled restart job, restart counter is at 3.
Sep 10 16:26:14 ns3053180 systemd[1]: Stopped Kibana.
Sep 10 16:26:14 ns3053180 systemd[1]: kibana.service: Start request repeated too quickly.
Sep 10 16:26:14 ns3053180 systemd[1]: kibana.service: Failed with result 'exit-code'.
Sep 10 16:26:14 ns3053180 systemd[1]: Failed to start Kibana.
I think this is your problem:
system_auth WITH REPLICATION= {'class' : 'SimpleStrategy', 'DC1' :2 };
The SimpleStrategy class does not accept datacenter/RF pairs as parameters. It has one parameter, which is simply replication_factor:
ALTER KEYSPACE system_auth WITH REPLICATION= {'class' : 'SimpleStrategy', 'replication_factor' :2 };
By contrast, the NetworkTopologyStrategy takes the parameters you have provided above:
ALTER KEYSPACE system_auth WITH REPLICATION= {'class' : 'NetworkTopologyStrategy', 'DC1' :2 };
IMO, there really isn't much of a need for SimpleStrategy. I never use it.
Note: If you're going to query at LOCAL_QUORUM, you should have at least 3 replicas. Or at the very least, an odd number capable of computing a majority. Because quorum of 2 is, well, 2. So querying at quorum with only 2 replicas doesn't really help you.

kapacitor not running indicate fail

help my apacitor is not runnning, actually im running influxdb in the same server that kapacitor and telegraf, but my kapacitor don't work
kapacitor.service - Time series data processing engine.
Loaded: loaded (/lib/systemd/system/kapacitor.service; enabled; vendor preset: enabled)
Active: failed (Result: exit-code) since Thu 2019-01-03 17:56:38 UTC; 3s ago
Docs: https://github.com/influxdb/kapacitor
Process: 2502 ExecStart=/usr/bin/kapacitord -config /etc/kapacitor/kapacitor.conf $KAPACITOR_OPTS (code=exited, status=1/FAILURE)
Main PID: 2502 (code=exited, status=1/FAILURE)
Jan 03 17:56:38 ip-172-31-43-67 systemd[1]: kapacitor.service: Service hold-off time over, scheduling restart.
Jan 03 17:56:38 ip-172-31-43-67 systemd[1]: kapacitor.service: Scheduled restart job, restart counter is at 5.
Jan 03 17:56:38 ip-172-31-43-67 systemd[1]: Stopped Time series data processing engine..
Jan 03 17:56:38 ip-172-31-43-67 systemd[1]: kapacitor.service: Start request repeated too quickly.
Jan 03 17:56:38 ip-172-31-43-67 systemd[1]: kapacitor.service: Failed with result 'exit-code'.
Jan 03 17:56:38 ip-172-31-43-67 systemd[1]: Failed to start Time series data processing engine..
i did find the solution for myself:
[[influxdb]]
enabled = true
name = "localhost"
default = true
urls = ["http://localhost:8086"]
username = "user"
password = "password"
you must take in count that you will need has an user create in influxdb before

How to set up autosearch nodes in Elasticsearch 6.1

I have created cluster of 5 nodes in ES 6.1. I am able to create cluster when I added line with all ip addresses of other nodes into configuration file elasticsearch.yaml as discovery.zen.ping.unicast.hosts. It looks like this:
discovery.zen.ping.unicast.hosts: ["10.206.81.241","10.206.81.238","10.206.81.237","10.206.81.239"]
When I have this line in my config file, everything works well.
ip heap.percent ram.percent cpu load_1m load_5m load_15m node.role master name
10.206.81.241 9 54 0 0.03 0.05 0.05 mi * master4
10.206.81.239 10 54 0 0.00 0.01 0.05 mi - master1
10.206.81.238 14 54 0 0.00 0.01 0.05 mi - master3
10.206.81.240 15 54 0 0.00 0.01 0.05 mi - master5
10.206.81.237 10 54 0 0.00 0.01 0.05 mi - master2
When I added discovery.zen.ping.multicast.enabled: true elasticsearch will not start.
I would like to have more nodes and if I will have to configure each file separately and add new address to each configuration every time, it is not proper way. So is there any way how to set up ES6 to find new nodes automatically?
EDIT:
journalctl -f output:
led 08 10:43:04 elk-prod3.user.dc.company.local polkitd[548]: Registered Authentication Agent for unix-process:23395:23676999 (system bus name :1.162 [/usr/bin/pkttyagent --notify-fd 5 --fallback], object path /org/freedesktop/PolicyKit1/AuthenticationAgent, locale en_US.UTF-8)
led 08 10:43:04 elk-prod3.user.dc.company.local systemd[1]: Stopping Elasticsearch...
led 08 10:43:04 elk-prod3.user.dc.company.local systemd[1]: Started Elasticsearch.
led 08 10:43:04 elk-prod3.user.dc.company.local systemd[1]: Starting Elasticsearch...
led 08 10:43:04 elk-prod3.user.dc.company.local polkitd[548]: Unregistered Authentication Agent for unix-process:23395:23676999 (system bus name :1.162, object path /org/freedesktop/PolicyKit1/AuthenticationAgent, locale en_US.UTF-8) (disconnected from bus)
led 08 10:43:07 elk-prod3.user.dc.company.local systemd[1]: elasticsearch.service: main process exited, code=exited, status=1/FAILURE
led 08 10:43:07 elk-prod3.user.dc.company.local systemd[1]: Unit elasticsearch.service entered failed state.
led 08 10:43:07 elk-prod3.user.dc.company.local systemd[1]: elasticsearch.service failed.
Basically you should have "stable" nodes. What i mean is that you should have IPs which are always part of cluster
discovery.zen.ping.unicast.hosts: [MASTER_NODE_IP_OR_DNS, MASTER2_NODE_IP_OR_DNS, MASTER3_NODE_IP_OR_DNS]
Then if you use autoscaling or add nodes they must "talk" to that ips to let them know that they are joining the cluster.
You haven't mentioned anything about your network setup so i can say you for sure what is wrong. But as I recall unicast hosts is recommended approach
PS. If you are using azure, there is feature called VM scaleset I modified template to my needs. Idea is that by default I am always using 3 nodes, and if my cluster is loaded scale set will add dynamically more nodes.
discovery.zen.ping.multicast has been removed from elasticsearch, see: https://www.elastic.co/guide/en/elasticsearch/plugins/6.1/discovery-multicast.html

aerospike on openvz 2core 4Gb ram doesn't start and doesn't give errors

after installation without any trouble I've started aerospike on a openvz vps with 2cores and 4gb ram.
this is the result:
root#outland:~# /etc/init.d/aerospike start
* Start aerospike: asd [OK]
then check for running asd:
root#outland:~# /etc/init.d/aerospike status
* Halt aerospike: asd [fail]
what is going wrong?
adding logs:
Mar 03 2015 15:17:57 GMT: INFO (config): (cfg.c::3033) system file descriptor limit: 100000, proto-fd-max: 15000
Mar 03 2015 15:17:57 GMT: WARNING (cf:misc): (id.c::249) Tried eth,bond,wlan and list of all available interfaces on device.Failed to retrieve physical address with errno 19 No such device
Mar 03 2015 15:17:57 GMT: CRITICAL (config): (cfg.c:3363) could not get unique id and/or ip address
Mar 03 2015 15:17:57 GMT: WARNING (as): (signal.c::120) SIGINT received, shutting down
Mar 03 2015 15:17:57 GMT: WARNING (as): (signal.c::123) startup was not complete, exiting immediately
This is your config problem
Mar 03 2015 15:17:57 GMT: WARNING (cf:misc): (id.c::249) Tried eth,bond,wlan and list of all available interfaces on device.Failed to retrieve physical address with errno 19 No such device
Mar 03 2015 15:17:57 GMT: CRITICAL (config): (cfg.c:3363) could not get unique id and/or ip address
Basically the vps has a non standard interface name.
The solution is to add your interface name as network-interface-name to the config.
http://www.aerospike.com/docs/operations/troubleshoot/startup/#problem-with-network-interface
Which OS are your using btw?

Resources