Pacemaker on centos 7 : Releasing VIP for unmanage resource of type heartbeat type IPaddr2 on network blip - pacemaker

I have a setup of 2 node pacemaker which is having 2 VIPs of Resource type ocf::heartbeat:IPaddr2
VIP1 : This VIP is not expected to auto failover so this resource type is unmanaged
VIP2 : This VIP is expected to auto failover so is kept as managed
Issue : we had a case of network issue for 3 mins and in this case
VIP1 : vip which we were using for VIP1 was released for the host and did not come back automattically even afer the network was fixed , resource was marked as Stopped , so ip which we were using for VIP1 did neither exist on host1 or host2 .
VIP2 : in this case the ip came back on the node and resource also was started back.
We do not want resource VIP1 to release the IP even when the resource is unmanaged.
`[root#osboxes1 ~]# pcs config
Cluster Name: test-cluster
Corosync Nodes:
osboxes1 osboxes
Pacemaker Nodes:
osboxes osboxes1
Resources:
Resource: VIP2 (class=ocf provider=heartbeat type=IPaddr2)
Attributes: ip=192.168.50.54 nic=enp0s3:2 cidr_netmask=19
Operations: start interval=0s timeout=20s (VIP2-start-interval-0s)
stop interval=0s timeout=20s (VIP2-stop-interval-0s)
monitor interval=20s (VIP2-monitor-interval-20s)
Resource: VIP1 (class=ocf provider=heartbeat type=IPaddr2)
Attributes: ip=192.168.50.53 nic=enp0s3:1 cidr_netmask=19
Meta Attrs: is-managed=false
Operations: start interval=0s timeout=20s (VIP1-start-interval-0s)
stop interval=0s timeout=20s (VIP1-stop-interval-0s)
monitor interval=20s (VIP1-monitor-interval-20s)
Stonith Devices:
Fencing Levels:
Location Constraints:
Resource: VIP1
Enabled on: osboxes (score:50) (id:location-VIP1-osboxes-50)
Resource: VIP2
Enabled on: osboxes1 (score:50) (id:location-VIP2-osboxes1-50)
Ordering Constraints:
Colocation Constraints:
Ticket Constraints:
Alerts:
No alerts defined
Resources Defaults:
resource-stickiness: 100
Operations Defaults:
No defaults set
Cluster Properties:
cluster-infrastructure: corosync
cluster-name: test-cluster
dc-version: 1.1.15-11.el7_3.4-e174ec8
have-watchdog: false
no-quorum-policy: ignore
stonith-enabled: false
Quorum:
Options:`

As far as I get your setup correctly, try to remove VIP1 resource completely from cluster, cause there is no point of adding it to cluster cause your cluster doesn't manage it.
Resource: VIP1 (class=ocf provider=heartbeat type=IPaddr2)
Attributes: ip=192.168.50.53 nic=enp0s3:1 cidr_netmask=19
Meta Attrs: is-managed=false

Related

Azure Kubernetes, running DaemonSet to a pool "CriticalAddonsOnly=true:NoSchedule"

i'm configuring Elastic Cloud agent on Azure AKS with pool system and user. On system pool i configured CriticalAddonsOnly=true:NoSchedule taint to prevent that pod application run there. I installed the Elastic Cloud agent but i'm noticing that DaemonSet trying to run pods on that system pool without success. I tried to set on yaml config of agent the label CriticalAddonsOnly=true:NoSchedule but i got same errors. Is there a way to force deploy on system pool or to exclude ElasticCloud pods deploy on that pool?
Here how setup yaml:
tolerations:
- key: node-role.kubernetes.io/control-plane
effect: NoSchedule
- key: node-role.kubernetes.io/master
effect: NoSchedule
- key: CriticalAddonsOnly
operator: "Exists"
effect: NoSchedule
Regards
node-role.kubernetes.io/control-plane & node-role.kubernetes.io/master are no taints for AKS nodes. These are node labels. So please remove them from the toleration spec.
Furthermore specifying a toleration does not guarantee scheduling onto tolerated nodes. It just marks that the node should not accept any pods that do not tolerate the taints. As your 2nd node pool seems not to be tainted, the scheduler just drops your pods there.
You could now add taints to your other nodepools or more easier just specify a node selector =
nodeSelector:
kubernetes.azure.com/mode: system
tolerations:
- key: "CriticalAddonsOnly"
operator: "Exists"
effect: "NoSchedule"
The same could be also achieved with Node Affinity. You should check the Helm Chart or your deployment option if nodeSelector or NodeAffinity is available.

Google Anthos Bare Metal - Add Node

So I'm trying to add nodes to my existing Anthos k8s cluster Anthos Bare Metal - Add Node / Resize Cluster.
Just add the new nodes under NodePool and run the bmctl update command
apiVersion: baremetal.cluster.gke.io/v1
kind: NodePool
metadata:
name: node-pool-1
namespace: cluster-anthos-prod-cluster
spec:
clusterName: anthos-prod-cluster
nodes:
- address: 10.0.14.66
- address: 10.0.14.67
- address: 10.0.14.68
- address: 10.0.14.72
- address: 10.0.14.73
The new nodes are on RECONCILING state and upon checking the logs I have this messages below;
Warning:
FailedScheduling
6m5s (x937 over 17h)
default-scheduler
0/6 nodes are available: 1 node(s) had taint {client.ip.colocate: NoIngress}, that the pod didn't tolerate, 2 node(s) had taint {client.ip.colocate: SameInstance}, that the pod didn't tolerate, 3 node(s) didn't match Pod's node affinity /selector.
Did I missed a step?
Would like to ask some help where to start checking to fix my problem.
Appreciate any help.
Thank you.
-MD

Fix elasticsearch broken cluster within kubernetes

I deployed an elasticsearch cluster with official Helm chart (https://github.com/elastic/helm-charts/tree/master/elasticsearch).
There are 3 Helm releases:
master (3 nodes)
client (1 node)
data (2 nodes)
Cluster was running fine, I did a crash test by removing master release, and re-create it.
After that, master nodes are ok, but data nodes complain:
Caused by: org.elasticsearch.cluster.coordination.CoordinationStateRejectedException: join validation on cluster state with a different cluster uuid xeQ6IVkDQ2es1CO2yZ_7rw than local cluster uuid 9P9ZGqSuQmy7iRDGcit5fg, rejecting
which is normal because master nodes are new.
How can I fix data nodes cluster state without removing data folder?
Edit:
I know the reason why is broken, I know a basic solution is to remove data folder and restart node (as I can see on elastic forum, lot of similar questions without answers). But I am looking for a production aware solution, maybe with https://www.elastic.co/guide/en/elasticsearch/reference/current/node-tool.html tool?
Using elasticsearch-node utility, it's possible to reset cluster state, then the fresh node can join another cluster.
The tricky thing is to use this utility bin with Docker, because elasticsearch server must be stopped!
Solution with kubernetes:
Stop pods by scaling to 0 the sts: kubectl scale data-nodes --replicas=0
Create a k8s job that reset the cluster state, with data volume attached
Apply the job for each PVC
Rescale sts and enjoy!
job.yaml:
apiVersion: batch/v1
kind: Job
metadata:
name: test-fix-cluster-m[0-3]
spec:
template:
spec:
containers:
- args:
- -c
- yes | elasticsearch-node detach-cluster; yes | elasticsearch-node remove-customs '*'
# uncomment for at least 1 PVC
#- yes | elasticsearch-node unsafe-bootstrap -v
command:
- /bin/sh
image: docker.elastic.co/elasticsearch/elasticsearch:7.10.1
name: elasticsearch
volumeMounts:
- mountPath: /usr/share/elasticsearch/data
name: es-data
restartPolicy: Never
volumes:
- name: es-data
persistentVolumeClaim:
claimName: es-test-master-es-test-master-[0-3]
If you are interested, here the code behind unsafe-bootstrap: https://github.com/elastic/elasticsearch/blob/master/server/src/main/java/org/elasticsearch/cluster/coordination/UnsafeBootstrapMasterCommand.java#L83
I have written a small story at https://medium.com/#thomasdecaux/fix-broken-elasticsearch-cluster-405ad67ee17c.

Gluster_Volume module in ansible

Request you to help me on the following Issue
I am writing a High available LAMPAPP on UBUNTU 14.04 with ansible (on my home lab). All the tasks are getting excecuted till the glusterfs installation however creating the Glusterfs Volume is a challenge for me since a week. If is use the command moudle the glusterfs volume is getting created
- name: Creating the Gluster Volume
command: sudo gluster volume create var-www replica 2 transport tcp server01-private:/data/glusterfs/var-www/brick01/brick server02-private:/data/glusterfs/var-www/brick02/brick
But if i use the GLUSTER_VOLUME module i am getting the error
- name: Creating the Gluster Volume
gluster_volume:
state: present
name: var-www
bricks: /server01-private:/data/glusterfs/var-www/brick01/brick,/server02-private:/data/glusterfs/var-www/brick02/brick
replicas: 2
transport: tcp
cluster:
- server01-private
- server02-private
force: yes
run_once: true
The error is
"msg": "error running gluster (/usr/sbin/gluster --mode=script volume add-brick var-www replica 2 server01-private:/server01-private:/data/glusterfs/var-www/brick01/brick server01-private:/server02-private:/data/glusterfs/var-www/brick02/brick server02-private:/server01-private:/data/glusterfs/var-www/brick01/brick server02-private:/server02-private:/data/glusterfs/var-www/brick02/brick force) command (rc=1): internet address 'server01-private:/server01-private' does not conform to standards\ninternet address 'server01-private:/server02-private' does not conform to standards\ninternet address 'server02-private:/server01-private' does not conform to standards\ninternet address 'server02-private:/server02-private' does not conform to standards\nvolume add-brick: failed: Host server01-private:/server01-private is not in 'Peer in Cluster' state\n"
}
May i know the mistake i am committing
The bricks: declaration of Ansible gluster_volume module requires only the path of the brick. The nodes participating in the volume are identified as cluster:.
The <hostname>:<brickpath> format is required for the gluster command line. However when you use the Ansible module, this is not required.
So your task should be something like:
- name: Creating the Gluster Volume
gluster_volume:
name: 'var-www'
bricks: '/data/glusterfs/var-www/brick01/brick,/data/glusterfs/var-www/brick02/brick'
replicas: '2'
cluster:
- 'server01-private'
- 'server02-private'
transport: 'tcp'
state: 'present'

Service discovery does not discover multiple nodes

I have elasticsearch cluster on kubernetes on aws. I have used upmc operator and readonlyrest security plugin.
I have started my cluster passing yaml to upmc operator with 3 data / master and ingest nodes.
However when I do /localhsot:9200/_nodes all I see only 1 node is being assigned. Service discovery did not attach other nodes to the clusters. Essentially I have a single node cluster. Any settings I am missing or after creating cluster I need to run some settings so that all nodes become part of the cluster ?
Here is my yml file:
The following yaml file is used to create the cluster.
This yaml config creates 3 master/data/ingest nodes and upmcoperator uses pod afinity to allocate pods in different zones.
All 9 nodes are getting created just fine, but they are unable to become part of the cluster.
====================================
apiVersion: enterprises.upmc.com/v1
kind: ElasticsearchCluster
metadata:
name: es-cluster
spec:
kibana:
image: docker.elastic.co/kibana/kibana-oss:6.1.3
image-pull-policy: Always
cerebro:
image: upmcenterprises/cerebro:0.7.2
image-pull-policy: Always
elastic-search-image: myelasticimage:elasticsearch-mod-v0.1
elasticsearchEnv:
- name: PUBLIC_KEY
value: "default"
- name: NETWORK_HOST
value: "_eth0:ipv4_"
image-pull-secrets:
- name: egistrykey
image-pull-policy: Always
client-node-replicas: 3
master-node-replicas: 3
data-node-replicas: 3
network-host: 0.0.0.0
zones: []
data-volume-size: 1Gi
java-options: "-Xms512m -Xmx512m"
snapshot:
scheduler-enabled: false
bucket-name: somebucket
cron-schedule: "#every 1m"
image: upmcenterprises/elasticsearch-cron:0.0.4
storage:
type: standard
storage-class-provisioner: volume.alpha.kubernetes.io/storage-class
volume-reclaim-policy: Delete

Resources