PCS Cluster with Shared IP on different interface

PCS Cluster with Shared IP on different interface - cluster-computing

I am creating a Fedora PCS Cluster for HAProxy. I have it running on VMWare, and am following this guide, and get to this step of adding a IPAddr2 resource: http://clusterlabs.org/doc/en-US/Pacemaker/1.1/html/Clusters_from_Scratch/_adding_a_resource.html
The only difference is that I need my cluster heartbeat/comms on one NIC/subnet, and my shared resource IP on a different NIC/subnet.
My internal comms is Node1=192.160.0.1 and Node2=192.168.0.2, and my resource ip is 10.0.0.1
How do I use this command in this situation:
pcs resource create ClusterIP ocf:heartbeat:IPaddr2 \
ip=192.168.0.120 cidr_netmask=32 op monitor interval=30s
If I add it as above, I get this:
[root#node-01 .pcs]# pcs status
Cluster name: mycluster
Last updated: Tue Oct 28 09:10:13 2014
Last change: Tue Oct 28 09:00:13 2014 via cibadmin on node-02
Stack: corosync
Current DC: node-02 (2) - partition with quorum
Version: 1.1.11-1.fc20-9d39a6b
2 Nodes configured
1 Resources configured
Online: [ node-01 node-02 ]
Full list of resources:
ClusterIP (ocf::heartbeat:IPaddr2): Stopped
Failed actions:
ClusterIP_start_0 on node-01 'unknown error' (1): call=7, status=complete, last-rc-change='Tue Oct 28 09:00:13 2014', queued=0ms, exec=27ms
ClusterIP_start_0 on node-02 'unknown error' (1): call=6, status=complete, last-rc-change='Tue Oct 28 09:00:13 2014', queued=0ms, exec=27ms

First, you need to specify the network device as mentioned by Daniel, e.g.
pcs resource create ClusterIP ocf:heartbeat:IPaddr2 ip=10.0.0.1 cidr_netmask=32 nic=eth0 op monitor interval=30s
Since you are running two node cluster, you don't have a fencing device. So you have to disable the STONITH setting but it is not recommended for production environment.
pcs property set stonith-enabled=false
The virtual ip address should be activated automatically.
#pcs status resources
Full list of resources:
ClusterIP (ocf::heartbeat:IPaddr2): Started:node-01

You need to specify the NIC. If your first NIC is eth0 and second is eth1. you can create the resource with this :
pcs resource create ClusterIP ocf:heartbeat:IPaddr2 ip=10.0.0.1 cidr_netmask=32 nic=eth1:0 op monitor interval=30s
You can also use only eth1, but i prefer to use sub interface for my floating IP address. You can create more than one floating IP addresses on one NIC, but you need to configure each on unique sub interface.

Related

Promtail labels

I'm having trouble adding labels into Grafana, but this issue is only in one node.
I have already 3 Promtails with labels working properly, I tried the same example on this machine which belongs to the same cluster and also has connectivity to Loki port.
Here is what I have:
server:
http_listen_port: 9080
grpc_listen_port: 0
positions:
filename: /usr/hdp/promtail/data/positions.yaml
clients:
- url: http://machine4:3100/loki/api/v1/push
scrape_configs:
- job_name: zeppelin
static_configs:
- targets:
- machine1:9080
labels:
host: machine1
stream: zeppelin
job: zeppelin
__path__: /usr/hdp/logs/zeppelin/zeppelin-zeppelin-machine1.log
pipeline_stages:
- match:
selector: '{job="zeppelin"}'
stages:
- regex:
expression: '(?P<zeppelinError>RemoteInterpreterManagedProcess)'
- labels:
zeppelinError:
So when I go to grafana into variables I type 'label_values(zeppelinError) and don't show me the label.
Here are the logs from Promtail, with looks fine:
Aug 10 11:47:14 machine1 systemd[1]: Started Promtail service.
Aug 10 11:47:14 machine1 promtail[25496]: level=info ts=2021-08-10T10:47:14.666205865Z caller=server.go:225 http=0.0.0.0:9080 grpc=0.0.0.0:44903 msg="server listening on addresses"
Aug 10 11:47:14 machine1 promtail[25496]: level=info ts=2021-08-10T10:47:14.666573544Z caller=main.go:108 msg="Starting Promtail" version="(version=2.0.0, branch=HEAD, revision=6978ee5d)"
Aug 10 11:47:19 machine1 promtail[25496]: level=info ts=2021-08-10T10:47:19.663478261Z caller=filetargetmanager.go:261 msg="Adding target" key="{host=\"machine1\", job=\"zeppelin\", stream=\"zeppelin\"}"
Aug 10 11:47:19 machine1 promtail[25496]: level=info ts=2021-08-10T10:47:19.667623654Z caller=tailer.go:122 component=tailer msg="tail routine: started" path=/usr/hdp/logs/zeppelin/zeppelin-zeppelin-machine1.log
Aug 10 11:47:19 machine1 promtail[25496]: ts=2021-08-10T10:47:19.668843991Z caller=log.go:124 component=tailer level=info msg="Seeked /usr/hdp/logs/zeppelin/zeppelin-zeppelin-machine1.log - &{Offset:713999 Whence:0}"
And here the log I want to trace:
ERROR [2021-07-30 06:37:40,836] ({pool-4-thread-74} NotebookServer.java[afterStatusChange]:2294) - Error
java.lang.RuntimeException:
at org.apache.zeppelin.interpreter.remote.RemoteInterpreterManagedProcess.start(RemoteInterpreterManagedProcess.java:205)
Probably is something small I'm missing here, hope you can give me a hand on this matter.
Following:
https://grafana.com/docs/loki/latest/clients/promtail/stages/regex/#schema (how to capture data),
https://github.com/google/re2/wiki/Syntax (Regex Expression Rules),
https://sbcode.net/grafana/nginx-promtail/ (following a similar build)

Oddly, I ended up implementing this just like you did and had the same issue, so if we are doing something wrong, it means the docs are not as clear as they could be. I couldn't get more than one label to be recognised and the only working label would work on one machine but not another, with the same config.
Looking at the documentation though, I spotted that starting with Loki 2.3.0, LogQL supports dynamic creation of labels on query:
https://grafana.com/docs/loki/latest/logql/log_queries/#parser-expression
So, after spending way too long trying to get this fixed to no avail, I decided to rip out the pipeline stages from the Promtail config and apply the regex directly on the Loki query:
https://grafana.com/docs/loki/latest/logql/log_queries/#regular-expression
Which finally fixed my problem. I've now applied this on all Promtail-ed machines.
I hope this helps but if someone knows how to fix labelling via the Promtail config, that would still be very helpful to know as an alternative since there's a limit to dynamic labels before they start making querying too heavy.
Cheers.

How to diagnose 'stuck at provisioning' problems in Rancher 2.x?

I'm attempting to set up a Rancher / Kubernetes dev lab on a set of four local virtual machines, however when attempting to add nodes to the cluster Rancher seems to be permanently stuck at 'Waiting to register with Kubernetes'.
From extensive Googling I suspect there is some kind of a communications problem between the Rancher node and the other three, however I can't find how to attempt to diagnose it, the instructions for finding logs in Rancher 1.x don't apply for 2.x and all the information I've so far found for 2.x appears to be on how to configure logging for a working cluster, as opposed to where to find Rancher's own logs of it's attempts to set up clusters.
So effectively two questions:
What is the best way to go about diagnosing this problem?
Where can I find Rancher's logs of it's cluster-building activities?
Details of my setup:
Four identical VMs, all with Ubuntu 20.04 and Docker 20.10.5, all running under Proxmox on the same host and all can ping and ssh to each other. All have full Internet access.
Rancher 2.5.7 is installed on 192.168.0.180 with the other three nodes being 181-183.
Using "Global > Cluster > Add Cluster" I created a new cluster, using the default settings.
Rancher gives me the following code to execute on the nodes, this has been done, with no errors reported:
sudo docker run -d --privileged --restart=unless-stopped --net=host -v /etc/kubernetes:/etc/kubernetes -v /var/run:/var/run rancher/rancher-agent:v2.5.7 --server https://192.168.0.180 --token (token) --ca-checksum (checksum) --etcd --controlplane --worker
According to the Rancher setup instructions Rancher should now configure and take control of the nodes, however nothing happens and the nodes continue to show "Waiting to register with Kubernetes".
I've execed into the Rancher container on .180 "docker exec -it (container-id) bash" and searched for the logs, however the /var/lib/cattle directory where in older versions the debug logs were found, is empty.
Update 2021-06-23
Having got nowhere with this I deleted the existing cluster attempt in Rancher, stopped all existing Docker processes on the nodes, and tried to create a new cluster, this time using one node each for etcd, controlplane, and worker, instead of all three doing all three tasks.
Exactly the same thing happens, Rancher just forever says "Waiting to register with Kubernetes." Looking at the logs on node-1 (181) using docker ps to find the id and then docker logs to view them, I get this:
root#knode-1:~# docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
3ca92e0ea581 rancher/rancher-agent:v2.5.7 "run.sh --server htt…" About a minute ago Up About a minute epic_goldberg
root#knode-1:~# docker logs 3ca92e0ea581
INFO: Arguments: --server https://192.168.0.180 --token REDACTED --ca-checksum 151f030e78c10cf8e2dad63679f6d07c166d2da25b979407a606dc195d08855e --etcd
INFO: Environment: CATTLE_ADDRESS=192.168.0.181 CATTLE_INTERNAL_ADDRESS= CATTLE_NODE_NAME=knode-1 CATTLE_ROLE=,etcd CATTLE_SERVER=https://192.168.0.180 CATTLE_TOKEN=REDACTED
INFO: Using resolv.conf: nameserver 127.0.0.53 options edns0
WARN: Loopback address found in /etc/resolv.conf, please refer to the documentation how to configure your cluster to resolve DNS properly
INFO: https://192.168.0.180/ping is accessible
INFO: Value from https://192.168.0.180/v3/settings/cacerts is an x509 certificate
time="2021-06-23T09:46:36Z" level=info msg="Listening on /tmp/log.sock"
time="2021-06-23T09:46:36Z" level=info msg="Rancher agent version v2.5.7 is starting"
time="2021-06-23T09:46:36Z" level=info msg="Option customConfig=map[address:192.168.0.181 internalAddress: label:map[] roles:[etcd] taints:[]]"
time="2021-06-23T09:46:36Z" level=info msg="Option etcd=true"
time="2021-06-23T09:46:36Z" level=info msg="Option controlPlane=false"
time="2021-06-23T09:46:36Z" level=info msg="Option worker=false"
time="2021-06-23T09:46:36Z" level=info msg="Option requestedHostname=knode-1"
time="2021-06-23T09:46:36Z" level=info msg="Connecting to wss://192.168.0.180/v3/connect/register with token rbdbrk8r7ncbvb9ktw9w669tj7q9xppb9scwxp9wj8zj25nhfq24s9"
time="2021-06-23T09:46:36Z" level=info msg="Connecting to proxy" url="wss://192.168.0.180/v3/connect/register"
time="2021-06-23T09:46:36Z" level=info msg="Waiting for node to register. Either cluster is not ready for registering or etcd and controlplane node have to be registered first"
time="2021-06-23T09:46:38Z" level=info msg="Starting plan monitor, checking every 15 seconds"
The only error showing appears to be the DNS one - I originally set the node's resolv.conf to use 1.1.1.1 and 8.8.4.4, so presumably the Docker install changed it, however testing 127.0.0.53 on a range of domains and records it resolves DNS correctly so I don't think that's the problem.
Help?

How can I delete a PROXMOX cluster?

I have created a cluster on Ubuntu proxmox node ("node01")
pvecm create cluster1
This is the output of pvecm status (i changed my ip address to 1.1.1.1 for security purposes)
root#node01:~# pvecm status
Quorum information
------------------
Date: Thu Jul 9 09:41:47 2020
Quorum provider: corosync_votequorum
Nodes: 1
Node ID: 0x00000001
Ring ID: 1/8
Quorate: Yes
Votequorum information
----------------------
Expected votes: 1
Highest expected: 1
Total votes: 1
Quorum: 1
Flags: Quorate
Membership information
----------------------
Nodeid Votes Name
0x00000001 1 1.1.1.1 (local)
However I want to completely remove it. How can I do that?

I remember creating my Proxmox cluster using the GUI, and honestly ... I have never removed the cluster that I currently have working, but this information may be of use to you.
https://pve.proxmox.com/wiki/Cluster_Manager#_remove_a_cluster_node
https://gist.github.com/ianchen06/73acc392c72d6680099b7efac1351f56
https://www.youtube.com/watch?v=GSg-aeQ5gT8
https://forum.proxmox.com/threads/removing-deleting-a-created-cluster.18887/

Coreos fleet not working after auto-scaling

I have CoreOS cluster with 3 AWS ec2 instances. The cluster was setup using the CoreOS stack cloudformation. After the cluster is up and running, I need to update the autoscaling policy to pick up ec2 instance profile. I copied the existing auto-scaling configuration file and updated the IAM role for ec2s. Then I terminated EC2s in the fleet, letting the auto-scaling to fire up new instances. The new instances indeed assumed their new roles, however, the cluster seems lost cluster machine information:
ip-10-214-156-29 ~ # systemctl -l status etcd.service
● etcd.service - etcd
Loaded: loaded (/usr/lib64/systemd/system/etcd.service; disabled)
Drop-In: /run/systemd/system/etcd.service.d
└─10-oem.conf, 20-cloudinit.conf
Active: activating (auto-restart) (Result: exit-code) since Wed 2014-09-24 18:28:58 UTC; 9s ago
Process: 14124 ExecStart=/usr/bin/etcd (code=exited, status=1/FAILURE)
Main PID: 14124 (code=exited, status=1/FAILURE)
Sep 24 18:28:58 ip-10-214-156-29.us-west-2.compute.internal systemd[1]: etcd.service: main process exited, code=exited, status=1/FAILURE
Sep 24 18:28:58 ip-10-214-156-29.us-west-2.compute.internal systemd[1]: Unit etcd.service entered failed state.
Sep 24 18:28:58 ip-10-214-156-29.us-west-2.compute.internal etcd[14124]: [etcd] Sep 24 18:28:58.206 INFO | d9a7cb8df4a049689de452b6858399e9 attempted to join via 10.252.78.43:7001 failed: fail checking join version: Client Internal Error (Get http://10.252.78.43:7001/version: dial tcp 10.252.78.43:7001: connection refused)
Sep 24 18:28:58 ip-10-214-156-29.us-west-2.compute.internal etcd[14124]: [etcd] Sep 24 18:28:58.206 WARNING | d9a7cb8df4a049689de452b6858399e9 cannot connect to existing peers [10.214.135.35:7001 10.16.142.108:7001 10.248.7.66:7001 10.35.142.159:7001 10.252.78.43:7001]: fail joining the cluster via given peers after 3 retries
Sep 24 18:28:58 ip-10-214-156-29.us-west-2.compute.internal etcd[14124]: [etcd] Sep 24 18:28:58.206 CRITICAL | fail joining the cluster via given peers after 3 retries
The same token was used from cloud-init. https://discovery.etcd.io/<cluster token> shows 6 machines, with 3 dead ones, 3 new ones. So it looks like 3 new instances joined the cluster alright. The journal -u etcd.service logs shows the etcd timed out on dead instances, and got connection refused for the new ones.
journal -u etcd.service shows:
...
Sep 24 06:01:11 ip-10-35-142-159.us-west-2.compute.internal etcd[574]: [etcd] Sep 24 06:01:11.198 INFO | 5c4531d885df4d06ae2d369c94f4de11 attempted to join via 10.214.156.29:7001 failed: fail checking join version: Client Internal Error (Get http://10.214.156.29:7001/version: dial tcp 10.214.156.29:7001: connection refused)
etcdctl --debug ls
Cluster-Peers: http://127.0.0.1:4001 http://10.35.142.159:4001
Curl-Example: curl -X GET http://127.0.0.1:4001/v2/keys/? consistent=true&recursive=false&sorted=false
Curl-Example: curl -X GET http://10.35.142.159:4001/v2/keys/?consistent=true&recursive=false&sorted=false
Curl-Example: curl -X GET http://127.0.0.1:4001/v2/keys/?consistent=true&recursive=false&sorted=false
Curl-Example: curl -X GET http://10.35.142.159:4001/v2/keys/?consistent=true&recursive=false&sorted=false
Error: 501: All the given peers are not reachable (Tried to connect to each peer twice and failed) [0]
Maybe this is not right process to update a cluster's configuration, but IF the cluster does need auto-scaling for whatever reasons (load triggered for example), will the fleet still be able to function with dead instances and new instances mixed in the pool?
How to recover from this situations without tear down and rebuild?
Xueshan

In this scheme etcd will not remain with a quorum of machines and can't operate successfully. The best scheme for doing autoscaling would be to set up two groups of machines:
A fixed number (1-9) of etcd machines that will always be up. These are set up with a discovery token or static networking like normal.
Your autoscaling group, which doesn't start etcd, but instead configures fleet (and any other tool) to use the fixed etcd cluster. You can do this in cloud-config. Here's an example that also sets some fleet metadata so you can schedule jobs specifically to the autoscaled machines if desired:
#cloud-config
coreos:
fleet:
metadata: "role=autoscale"
etcd_servers: "http://:4001,http://:4001,http://:4001,http://:4001,http://:4001,http://:4001"
units:
- name: fleet.service
command: start
The validator wouldn't let me put in any 10.x IP addresses in my answer (wtf!?) so be sure to replace those.

You must have at least one machine always running with the discovery token, as soon as all of them go down, heartbeat will fail and no one new will be able to join, you will need a new token for the cluster to join.

unable to determine zookeeper ensemble health

I setup a 3 node Zookeeper cdh4 ensemble on RHEL 5.5 machines. I have started the service by running zkServer.sh on each of the nodes. ZooKeeper instance is running on all the nodes, but how do I know if it is a part of an ensemble or are they running as individual services?
I tried to start the service and check the ensemble as stated here, on Cloudera's site, but it throws a ClassNotFoundException.

You can use the stat four letter word,
~$echo stat | nc 127.0.0.1 <zkport>
Which gives you output like,
Zookeeper version: 3.4.5-1392090, built on 09/30/2012 17:52 GMT
Clients:
/127.0.0.1:55829[0](queued=0,recved=1,sent=0)
Latency min/avg/max: 0/0/0
Received: 3
Sent: 2
Connections: 1
Outstanding: 0
Zxid: 0x100000000
Mode: leader
Node count: 4
The Mode: line tells you what mode the server is running in, either leader, follower or standalone if the node is not part of a cluster.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

PCS Cluster with Shared IP on different interface - cluster-computing

Related

Promtail labels

How to diagnose 'stuck at provisioning' problems in Rancher 2.x?

How can I delete a PROXMOX cluster?

Coreos fleet not working after auto-scaling

unable to determine zookeeper ensemble health

Categories

Resources