etcd v2: etcd-server is healthy but etcd-events is not joining ("cluster ID mismatch" and "unmatched member while checking PeerURLs" errors) - etcd

I have a legacy Kubernetes cluster running etcd v2 with 3 masters (etcd-a, etcd-b, etcd-c). We attempted an upgrade to etcd v3 but this broken the first master (etcd-a) and it was no longer able to join the cluster. After some time I was able to restore it:
removed etcd-a from etcd cluster with etcdctl member rm
added a new etcd-a1 with a clean state and added to the cluster etcdctl member add
started kubelet with ETCD_INITIAL_CLUSTER_STATE set to existing, then started protokube. At this point the master is able to join the cluster.
At the beginning I thought the cluster was healthy:
/ # etcdctl member list
a4***b2: name=etcd-c peerURLs=http://etcd-c.internal.mydomain.com:2380 clientURLs=http://etcd-c.internal.mydomain.com:4001
cf***97: name=etcd-a1 peerURLs=http://etcd-a1.internal.mydomain.com:2380 clientURLs=http://etcd-a1.internal.mydomain.com:4001
d3***59: name=etcd-b peerURLs=http://etcd-b.internal.mydomain.com:2380 clientURLs=http://etcd-b.internal.mydomain.com:4001
/ # etcdctl cluster-health
member a4***b2 is healthy: got healthy result from http://etcd-c.internal.mydomain.com:4001
member cf***97 is healthy: got healthy result from http://etcd-a1.internal.mydomain.com:4001
member d3***59 is healthy: got healthy result from http://etcd-b.internal.mydomain.com:4001
cluster is healthy
Yet the status of etcd-events is not great. etcd-events for a1 is not running
etcd-server-events-ip-a1 0/1 CrashLoopBackOff 430
etcd-server-events-ip-b 1/1 Running 3
etcd-server-events-ip-c 1/1 Running 0
Logs from etcd-events-a1:
flags: recognized and used environment variable ETCD_ADVERTISE_CLIENT_URLS=http://etcd-events-a1.internal.mydomain.com:4002
flags: recognized and used environment variable ETCD_DATA_DIR=/var/etcd/data-events
flags: recognized and used environment variable ETCD_INITIAL_ADVERTISE_PEER_URLS=http://etcd-events-a1.internal.mydomain.com:2381
flags: recognized and used environment variable ETCD_INITIAL_CLUSTER=etcd-events-a1=http://etcd-events-a1.internal.mydomain.com:2381,etcd-events-b=http://etcd-events-b.internal.mydomain.com:2381,etcd-events-c=http://etcd-events-c.internal.mydomain.com:2381
flags: recognized and used environment variable ETCD_INITIAL_CLUSTER_STATE=existing
flags: recognized and used environment variable ETCD_INITIAL_CLUSTER_TOKEN=etcd-cluster-token-etcd-events
flags: recognized and used environment variable ETCD_LISTEN_CLIENT_URLS=http://0.0.0.0:4002
flags: recognized and used environment variable ETCD_LISTEN_PEER_URLS=http://0.0.0.0:2381
flags: recognized and used environment variable ETCD_NAME=etcd-events-a1
etcdmain: etcd Version: 2.2.1
etcdmain: Git SHA: 75f8282
etcdmain: Go Version: go1.5.1
etcdmain: Go OS/Arch: linux/amd64
etcdmain: setting maximum number of CPUs to 2, total number of available CPUs is 2
etcdmain: the server is already initialized as member before, starting as etcd member...
etcdmain: listening for peers on http://0.0.0.0:2381
etcdmain: listening for client requests on http://0.0.0.0:4002
netutil: resolving etcd-events-b.internal.mydomain.com:2381 to 10.15.***:2381
netutil: resolving etcd-events-a1.internal.mydomain.com:2381 to 10.15.***:2381
etcdmain: stopping listening for client requests on http://0.0.0.0:4002
etcdmain: stopping listening for peers on http://0.0.0.0:2381
etcdmain: error validating peerURLs {ClusterID:5a***b3 Members:[&{ID:a7***32 RaftAttributes:{PeerURLs:[http://etcd-events-b.internal.mydomain.com:2381]} Attributes:{Name:etcd-events-b ClientURLs:[http://etcd-events-b.internal.mydomain.com:4002]}} &{ID:cc***b3 RaftAttributes:{PeerURLs:[https://etcd-events-a.internal.mydomain.com:2381]} Attributes:{Name:etcd-events-a ClientURLs:[https://etcd-events-a.internal.mydomain.com:4002]}} &{ID:7f***2ca RaftAttributes:{PeerURLs:[http://etcd-events-c.internal.mydomain.com:2381]} Attributes:{Name:etcd-events-c ClientURLs:[http://etcd-events-c.internal.mydomain.com:4002]}}] RemovedMemberIDs:[]}: unmatched member while checking PeerURLs
# restart
...
etcdserver: restarting member eb***3a in cluster 96***07 at commit index 3
raft: eb***a3a became follower at term 12407
raft: newRaft eb***3a [peers: [], term: 12407, commit: 3, applied: 0, lastindex: 3, lastterm: 1]
etcdserver: starting server... [version: 2.2.1, cluster version: to_be_decided]
etcdserver: added local member eb***3a [http://etcd-events-a1.internal.mydomain.com:2381] to cluster 96***07
etcdserver: added member 7f***ca [http://etcd-events-c.internal.mydomain.com:2381] to cluster 96***07
rafthttp: request sent was ignored (cluster ID mismatch: remote[7f***ca]=5a***b3, local=96***07)
rafthttp: request sent was ignored (cluster ID mismatch: remote[7f***ca]=5a***3, local=96***07)
rafthttp: failed to dial 7f***ca on stream Message (cluster ID mismatch)
rafthttp: failed to dial 7f***ca on stream MsgApp v2 (cluster ID mismatch)
etcdserver: added member a7***32 [http://etcd-events-b.internal.mydomain.com:2381] to cluster 96***07
rafthttp: request sent was ignored (cluster ID mismatch: remote[a7***32]=5a***b3, local=96***07)
rafthttp: failed to dial a7***32 on stream MsgApp v2 (cluster ID mismatch)
...
rafthttp: request sent was ignored (cluster ID mismatch: remote[a7***32]=5a***b3, local=96***07)
osutil: received terminated signal, shutting down...
etcdserver: aborting publish because server is stopped
Logs from etcd-events-b:
rafthttp: streaming request ignored (cluster ID mismatch got 96***07 want 5a***b3)
rafthttp: the connection to peer cc***b3 is unhealthy
Logs from etcd-events-c:
etcdserver: failed to reach the peerURL(https://etcd-events-a.internal.mydomain.com:2381) of member cc***b3 (Get https://etcd-events-a.internal.mydomain.com:2381/version: dial tcp 10.15.131.7:2381: i/o timeout)
etcdserver: cannot get the version of member cc***b3 (Get https://etcd-events-a.internal.mydomain.com:2381/version: dial tcp 10.15.131.7:2381: i/o timeout)
From the log I saw two problems:
etcd-events on 1a seems to ignore the existing cluster (then IDs doesn't match).
the other nodes (b and c) still somehow remembers the removed old node a.
I'm short of ideas on how to fix this. Any suggestion?
Thanks!

If you tried to upgrade etcd2 and did not restart all the masters at the same time, you will fail the upgrade.
Make sure you read through https://kops.sigs.k8s.io/etcd3-migration/
I also strongly recommend using the latest possible version of kOps as there are quite a few migration bugs fixed along the way.
There may be multiple reasons why the cluster ID changed, but if I remember correctly, replacing members like that was never really supported and with etcd2 your options are limited. Trying to get to etcd-manager and etcdv3 may be the best way to get your cluster in a working state again.

Related

Jaeger operator fails to parse Jaeger instance version on Kubernetes

Jaeger operator shows this log.
time="2022-01-07T11:27:57Z" level=info msg=Versions arch=amd64 identity=jaeger-operator.jaeger-operator jaeger=1.21.0 jaeger-operator=v1.21.3 operator-sdk=v0.18.2 os=linux version=go1.14.15 time="2022-01-07T11:28:20Z" level=warning msg="Failed to parse current Jaeger instance version. Unable to perform upgrade" current= error="Invalid Semantic Version" instance=tracing namespace=istio-system
The tracing operated resource shows like this afterwards:
kubectl get jaeger
NAME STATUS VERSION STRATEGY STORAGE AGE
tracing Running allinone elasticsearch 37d
We use GitOps for distributing the applications (included jaeger-operator and jaeger tracing resource). Only difference we are aware is between versions of clusters. In this case, this is only failing for a particular cluster with the following kubernetes version:
Server Version: version.Info{Major:"1", Minor:"20+", GitVersion:"v1.20.12-gke.1500", GitCommit:"d32c0db9a3ccd0ac73b0b3abd0532505217b376e", GitTreeState:"clean", BuildDate:"2021-11-17T09:30:02Z", GoVersion:"go1.15.15b5", Compiler:"gc", Platform:"linux/amd64"}
Other than the log error and the resulting missing information from the get jaeger command, the jaeger-operator modifies 2 things from the initial manifest:
It removes the line: .spec.storage.esRollover.enabled: true
It lowercases the .spec.strategy: AllInOne
The functions used for parsing the version: https://github.com/jaegertracing/jaeger-operator/blob/v1.21.3/pkg/upgrade/main.go#L28
The the function used to check the current version and compare it to verify if it needs to update the resource: https://github.com/jaegertracing/jaeger-operator/blob/v1.21.3/pkg/upgrade/upgrade.go#L134
They both look ok to me. Can't tell where/what the problem is and how to workaround it.

How to resolve (error) ERR unknown command 'service' error in Redis in windows machine?

I installed Redis-cache in my windows machine,after successfully installed i open reddis-cli in that i run service redis_6379 status it's showing following error
(error) ERR unknown command 'service'
How to resolve this error and start the redis server ...?
These are the info about my server
`
Server
redis_version:3.0.504
redis_git_sha1:00000000
redis_git_dirty:0
redis_build_id:a4f7a6e86f2d60b3
redis_mode:standalone
os:Windows
arch_bits:64
multiplexing_api:WinSock_IOCP
process_id:22004
run_id:7db1c339b5291dc237372c740f8630fd0c665442
tcp_port:6379
uptime_in_seconds:791
uptime_in_days:0
hz:10
lru_clock:2329164
config_file:C:\Program Files\Redis\redis.windows-service.conf
Clients
connected_clients:1
client_longest_output_list:0
client_biggest_input_buf:0
blocked_clients:0
Memory
used_memory:693104
used_memory_human:676.86K
used_memory_rss:655352
used_memory_peak:693104
used_memory_peak_human:676.86K
used_memory_lua:36864
mem_fragmentation_ratio:0.95
mem_allocator:jemalloc-3.6.0
Persistence
loading:0
rdb_changes_since_last_save:0
rdb_bgsave_in_progress:0
rdb_last_save_time:1629718325
rdb_last_bgsave_status:ok
rdb_last_bgsave_time_sec:-1
rdb_current_bgsave_time_sec:-1
aof_enabled:0
aof_rewrite_in_progress:0
aof_rewrite_scheduled:0
aof_last_rewrite_time_sec:-1
aof_current_rewrite_time_sec:-1
aof_last_bgrewrite_status:ok
aof_last_write_status:ok
Stats
total_connections_received:2
total_commands_processed:2
instantaneous_ops_per_sec:0
total_net_input_bytes:290
total_net_output_bytes:2142
instantaneous_input_kbps:0.00
instantaneous_output_kbps:0.00
rejected_connections:0
sync_full:0
sync_partial_ok:0
sync_partial_err:0
expired_keys:0
evicted_keys:0
keyspace_hits:0
keyspace_misses:0
pubsub_channels:0
pubsub_patterns:0
latest_fork_usec:0
migrate_cached_sockets:0
Replication
role:master
connected_slaves:0
master_repl_offset:0
repl_backlog_active:0
repl_backlog_size:1048576
repl_backlog_first_byte_offset:0
repl_backlog_histlen:0
CPU
used_cpu_sys:0.28
used_cpu_user:0.36
used_cpu_sys_children:0.00
used_cpu_user_children:0.00
Cluster
cluster_enabled:0
Keyspace
`
The error message is quite clear, there's no Redis command called service.
If you want to check the status of Redis server, you can run INFO command instead. Check thishttps://redis.io/commands) for a list of commands that Redis supports.

issue with adding a new member in etcd cluster

I have 3 node etcd cluster running on docker
Node1:
etcd-advertise-client-urls: "http://sensu-backend1:2379"
etcd-initial-advertise-peer-urls: "http://sensu-backend3:2380"
etcd-initial-cluster: "sensu-backend1=http://sensu-backend1:2380,sensu-backend2=http://sensu-backend2:2380,sensu-backend3=http://sensu-backend3:2380"
etcd-initial-cluster-state: "new" # new or existing
etcd-listen-client-urls: "http://0.0.0.0:2379"
etcd-listen-peer-urls: "http://0.0.0.0:2380"
etcd-name: "sensu-backend1"
Node2:
etcd-advertise-client-urls: "http://sensu-backend2:2379"
etcd-initial-advertise-peer-urls: "http://sensu-backend3:2380"
etcd-initial-cluster: "sensu-backend1=http://sensu-backend1:2380,sensu-backend2=http://sensu-backend2:2380,sensu-backend3=http://sensu-backend3:2380"
etcd-initial-cluster-state: "new" # new or existing
etcd-listen-client-urls: "http://0.0.0.0:2379"
etcd-listen-peer-urls: "http://0.0.0.0:2380"
etcd-name: "sensu-backend2"```
Node3:
etcd-advertise-client-urls: "http://sensu-backend3:2379"
etcd-initial-advertise-peer-urls: "http://sensu-backend3:2380"
etcd-initial-cluster: "sensu-backend1=http://sensu-backend1:2380,sensu-backend2=http://sensu-backend2:2380,sensu-backend3=http://sensu-backend3:2380"
etcd-initial-cluster-state: "new" # new or existing
etcd-listen-client-urls: "http://0.0.0.0:2379"
etcd-listen-peer-urls: "http://0.0.0.0:2380"
etcd-name: "sensu-backend3"
I am running each node as a docker service without persisting the etcd data directory.
When I start all the nodes together etcd forms the cluster.
If I delete one node and try to add as etcd-initial-cluster-state: "existing" then I get following error
{"component":"etcd","level":"fatal","msg":"tocommit(6264) is out of range [lastIndex(0)]. Was the raft log corrupted, truncated, or lost?","pkg":"raft","time":"2020-12-09T11:32:55Z"}
After stopping etcd, I deleted the node from cluster using etcdctl member remove . When I restart container with empty etcd data directory then I get cluster id mismatch error.
{"component":"backend","error":"error starting etcd: error validating peerURLs {ClusterID:4bccd6f485bb66f5 Members:[\u0026{ID:2ea5b7e4c09185e2 RaftAttributes:{PeerURLs:[http://sensu-backend1:2380]} Attributes:{Name:sensu-backend1 ClientURLs:[http://sensu-backend1:2379]}} \u0026{ID:9e83e7f64749072d RaftAttributes:{PeerURLs:[http://sensu-backend2:2380]} Attributes:{Name:sensu-backend2 ClientURLs:[http://sensu-backend2:2379]}}] RemovedMemberIDs:[]}: member count is unequal"}
Please help me on fixing the issue.
If you delete a node that was in a cluster, you should manually delete it from etcd cluster also i.e. by doing 'etcdctl remove '.
And member mismatch count error is because 'etcd-initial-cluster' still has all 3 entries of nodes, you need to remove that entry of deleted node from this field also in all containers.

RabbitMQ (OSX) : ERROR: epmd error for host x1-6-20-0c-c8-19-6b-bd: timeout (timed out)

I'm working on OSX 10.10.5 and installed RabbitMQ using the tarball.
Running it via the script :
bash sbin/rabbitmq-server
The first time it ran, but after a restart, it is giving out this error :
ERROR: epmd error for host x1-6-20-0c-c8-19-6b-bd: timeout (timed out)
sbin/rabbitmqctl status returns this :
Status of node 'rabbit#x1-6-20-0c-c8-19-6b-bd' ...
Error: unable to connect to node 'rabbit#x1-6-20-0c-c8-19-6b-bd': nodedown
DIAGNOSTICS
===========
attempted to contact: ['rabbit#x1-6-20-0c-c8-19-6b-bd']
rabbit#x1-6-20-0c-c8-19-6b-bd:
* unable to connect to epmd (port 4369) on x1-6-20-0c-c8-19-6b-bd: timeout (timed out)
current node details:
- node name: 'rabbitmq-cli-25#x1-6-20-0c-c8-19-6b-bd'
- home dir: /Users/mohit
- cookie hash: FOxL2w3eJGpNkenIS5ebSw==
Please help me resolve this, thanks!
Update : Interestingly it works when i switch back to my personal network from the office network. Possibly something to do with port / network firewall?
Add a configuration file:
/etc/rabbitmq/rabbitmq-env.conf
Add a line as below:
NODENAME=rabbit#localhost

RabbitMQ as Windows service: badarith error on rpc.erl

I am experiencing some problems with RabbitMQ started as a service on Windows.
Operative System: Windows 8 (Microsoft Windows NT version 6.2 Server)
(build 9200)
Erlang: R16B03 (erts-5.10.4)
RabbitMQ: 3.2.2
Goal: create a RabbitMQ cluster with three servers: Srv1, Srv2, Srv3.
Note: I have carefully followed the official documentation
All the following operations are executed as user "Administrator".
FIRST SCENARIO: start RabbitMQ from command line as a background process
I used the command "rabbitmq-server -detached" on Srv1.
Result: a file ".erlang.cookie" is created under C:\Users\Administrator
The execution of the command "rabbimqctl status" is successful and gives me the current state of the node.
I can then copy the file .erlang.cookie in the same folder on Srv2 and Srv3 and successfully create a cluster.
SECOND SCENARIO: start RabbitMQ as a service (this is requirement I have)
Result: the file ".erlang.cookie" is created under C:\Windows.
When I type the command "rabbitmqctl status" another file .erlang.cookie is created under C:\Users\Administrator and I receive the following result:
C:\Program Files\Aspect\DashBoard\RabbitMQ\sbin>rabbitmqctl.bat status
Status of node 'rabbit#RABBITMQ-NODE4' ...
Error: unable to connect to node 'rabbit#RABBITMQ-NODE4': nodedown
DIAGNOSTICS
===========
nodes in question: ['rabbit#RABBITMQ-NODE4']
hosts, their running nodes and ports:
- RABBITMQ-NODE4: [{rabbit,49428},{rabbitmqctl3045334,49434}]
current node details:
- node name: 'rabbitmqctl3045334#rabbitmq-node4'
- home dir: C:\Users\Administrator
- cookie hash: 0DLAKf8pOVrGC016+6BDBw==
We know that this is ok because the two cookies are different.
So I copy the .erlang.cookie file from C:\Windows into C:\Users\Administrator and I try again the same command. This time I get:
C:\Program Files\Aspect\DashBoard\RabbitMQ\sbin>rabbitmqctl.bat status
Status of node 'rabbit#RABBITMQ-NODE4' ...
Error: unable to connect to node 'rabbit#RABBITMQ-NODE4': nodedown
DIAGNOSTICS
===========
nodes in question: ['rabbit#RABBITMQ-NODE4']
hosts, their running nodes and ports:
- RABBITMQ-NODE4: [{rabbitmqctl1178095,49471}]
current node details:
- node name: 'rabbitmqctl1178095#rabbitmq-node4'
- home dir: C:\Users\Administrator
- cookie hash: TIuqp21HOQSoUJT8JfgRQw==
C:\Program Files\Aspect\DashBoard\RabbitMQ\sbin>rabbitmqctl.bat status
Status of node 'rabbit#RABBITMQ-NODE4' ...
Error: {badarith,[{rabbit_vm,bytes,1,[]},
{rabbit_vm,'-mnesia_memory/0-lc$^0/1-0-',1,[]},
{rabbit_vm,mnesia_memory,0,[]},
{rabbit_vm,memory,0,[]},
{rabbit,status,0,[]},
{rpc,'-handle_call_call/6-fun-0-',5,
[{file,"rpc.erl"},{line,205}]}]}
Please notice the Error at the end: "badarith" in rpc.erl, line 205.
I think that the file is Erlang\lib\kernel-2.16.4\src\rpc.erl
The function is this one:
handle_call_call(Mod, Fun, Args, Gleader, To, S) ->
RpcServer = self(),
%% Spawn not to block the rpc server.
{Caller,_} =
erlang:spawn_monitor(
fun () ->
set_group_leader(Gleader),
Reply =
%% in case some sucker rex'es
%% something that throws
case catch apply(Mod, Fun, Args) of
{'EXIT', _} = Exit ->
{badrpc, Exit};
Result ->
Result
end,
RpcServer ! {self(), {reply, Reply}}
end),
{noreply, gb_trees:insert(Caller, To, S)}.
and line 205 is 'case catch apply(Mod, Fun, Args) of'
THIRD SCENARIO: start RabbitMQ as a named user to avoid it to create the file .erlang.cookie under C:\Windows
I set the RabbitMQ service to log on as the user "Administrator", this way it does not create the file under C:\Windows but only under C:\User\Administrator.
Result: when the service starts, the file ".erlang.cookie" is created only under C:\User\Administrator.
When I type the command "rabbitmqctl status" I get the same error as in the provious case (badarith...).
Now the question: I have not found any information about this error (badarith).
Could anyone give me a suggestion about how to troubleshoot/avoid this?

Resources