IBM Cloud Private 2.1.0 CE - ibm-cloud-private

TASK [addon : Waiting for cloudant to start] ***************************************************************************************************************
FAILED - RETRYING: TASK: addon : Waiting for cloudant to start (50 retries left).
FAILED - RETRYING: TASK: addon : Waiting for cloudant to start (49 retries left).
FAILED - RETRYING: TASK: addon : Waiting for cloudant to start (48 retries left).
FAILED - RETRYING: TASK: addon : Waiting for cloudant to start (47 retries left).
FAILED - RETRYING: TASK: addon : Waiting for cloudant to start (46 retries left).
Failed twice. Any ideas on what can be done?
thanks
--rv

Please be sure to check you are sizing the VMs accordingly:
https://www.ibm.com/support/knowledgecenter/SSBS6K_2.1.0/supported_system_config/hardware_reqs.html
and following the stated install steps:
https://www.ibm.com/support/knowledgecenter/SSBS6K_2.1.0/installing/install_containers_CE.html
You can pull the cloudant image before ICP installation, this can workaround the cloudant retrying failed issue.

Lack of resources, maybe?
Debug the pod to learn more: Debugging pods
During the TASK [addon : Waiting for cloudant to start] phase, the system is checking to make sure that for the pod to see if the desired amount equals the current amount.
If it fails, it usually means they don't match...in other words, your pod did not start.

Related

Unable to Join Kubernetes Cluster with Windows Worker Node using containerd and Calico CNI

I'm trying to add a Windows Worker Node in the Kubernetes cluster
using containerd and Calico CNI. It failed to join the cluster after
running the kubeadm join command in Powershell with the following
error after:
[preflight] Running pre-flight checks
[preflight] WARNING: Couldn't
create the interface used for talking to the container runtime: crictl
is required for container runtime: exec: "crictl": executable file not
found in %PATH% [preflight] Reading configuration from the cluster...
[preflight] FYI: You can look at this config file with 'kubectl -n
kube-system get cm kubeadm-config -o yaml' W0406 09:27:34.133070
676 utils.go:69] The recommended value for
"authentication.x509.clientCAFile" in "KubeletConfiguration" is:
\etc\kubernetes\pki\ca.crt; the provided value is:
/etc/kubernetes/pki/ca.crt [kubelet-start] Writing kubelet
configuration to file "\\var\\lib\\kubelet\\config.yaml"
[kubelet-start] Writing kubelet environment file with flags to file
"\\var\\lib\\kubelet\\kubeadm-flags.env" [kubelet-start] Starting the
kubelet [kubelet-start] Waiting for the kubelet to perform the TLS
Bootstrap... [kubelet-check] Initial timeout of 40s passed.
[kubelet-check] It seems like the kubelet isn't running or healthy.
[kubelet-check] The HTTP call equal to 'curl -sSL
http://localhost:10248/healthz' failed with error: Get
"http://localhost:10248/healthz": dial tcp 127.0.0.1:10248: connectex:
No connection could be made because the target machine actively
refused it.. [kubelet-check] It seems like the kubelet isn't running
or healthy. [kubelet-check] The HTTP call equal to 'curl -sSL
http://localhost:10248/healthz' failed with error: Get
"http://localhost:10248/healthz": dial tcp 127.0.0.1:10248: connectex:
No connection could be made because the target machine actively
refused it.. [kubelet-check] It seems like the kubelet isn't running
or healthy. [kubelet-check] The HTTP call equal to 'curl -sSL
http://localhost:10248/healthz' failed with error: Get
"http://localhost:10248/healthz": dial tcp 127.0.0.1:10248: connectex:
No connection could be made because the target machine actively
refused it.. [kubelet-check] It seems like the kubelet isn't running
or healthy. [kubelet-check] The HTTP call equal to 'curl -sSL
http://localhost:10248/healthz' failed with error: Get
"http://localhost:10248/healthz": dial tcp 127.0.0.1:10248: connectex:
No connection could be made because the target machine actively
refused it.. [kubelet-check] It seems like the kubelet isn't running
or healthy. [kubelet-check] The HTTP call equal to 'curl -sSL
http://localhost:10248/healthz' failed with error: Get
"http://localhost:10248/healthz": dial tcp 127.0.0.1:10248: connectex:
No connection could be made because the target machine actively
refused it...
Unfortunately, an error has occurred:
timed out waiting for the condition
This error is likely caused by:
- The kubelet is not running
- The kubelet is unhealthy due to a misconfiguration of the node in some way (required cgroups disabled)
If you are on a systemd-powered system, you can try to troubleshoot
the error with the following commands:
- 'systemctl status kubelet'
- 'journalctl -xeu kubelet' error execution phase kubelet-start: timed out waiting for the condition To see the stack
trace of this error execute with --v=5 or higher
Thank you in advance for all help.

What's the correct way to configure ansible tasks to make helm deployments fault tolerant of internet connection issues?

I'm deploying helm charts using community.kubernetes.helm with ease but I've run into conditions where the connection is refused and it's not clear how best to configure a retries/wait/until. I've run into a case where every now and then, helm can't communicate with the cluster, here's an example (dns/ip faked) showing that the issue is as simple as not being able to connect to the cluster:
fatal: [localhost]: FAILED! => {"changed": false, "command":
"/usr/local/bin/helm --kubeconfig /var/opt/kubeconfig
--namespace=gpu-operator list --output=yaml --filter gpu-operator", "msg": "Failure when executing Helm command. Exited 1.\nstdout:
\nstderr: Error: Kubernetes cluster unreachable: Get
"https://ec2-host/k8s/clusters/c-xrsqn/version?timeout=32s": dial
tcp 192.168.1.1:443: connect: connection refused\n", "stderr": "Error:
Kubernetes cluster unreachable: Get
"https://ec2-host/k8s/clusters/c-xrsqn/version?timeout=32s": dial
tcp 192.168.1.1:443: connect: connection refused\n", "stderr_lines":
["Error: Kubernetes cluster unreachable: Get
"https://ec2-host/k8s/clusters/c-xrsqn/version?timeout=32s": dial
tcp 192.168.1.1:443: connect: connection refused"], "stdout": "",
"stdout_lines": []}
In my experience, I have seen that try/retry will work. I agree that it would be ideal to figure out why I can't connect to the service, but it would be even more ideal to work around this by taking advantage of a catch all "until" block that tries this block until it works or gives up after N tries while taking a break of N seconds.
Here's an example of the ansible block:
- name: deploy Nvidia GPU Operator
block:
- name: deploy gpu operator
community.kubernetes.helm:
name: gpu-operator
chart_ref: "{{ CHARTS_DIR }}/gpu-operator"
create_namespace: yes
release_namespace: gpu-operator
kubeconfig: "{{ STATE_DIR }}/{{ INSTANCE_NAME }}-kubeconfig"
until: ???
retries: 5
delay: 3
when: GPU_NODE is defined
I would really appreciate any suggestions/pointers.
I discovered that registering the output and then testing until it's defined get's ansible to rerun. The key is learning what is going to be a successful output. For helm, it says it will define a status when it works correctly. So, this is what you need to add
register: _gpu_result
until: _gpu_result.status is defined
ignore_errors: true
retries: 5
delay: 3
retries/delay is up to you

How to add more monitors in heartbeat.yml?

I'm trying out Uptime feature in Kibana. I've downloaded Heartbeat and ran it with default setting. It works okay.
However, when I tried to add more monitors in heartbeat.monitors in heartbeat.yml. I run into an error.
The below is the default, and it runs okay.
haertbeat.yml
# Configure monitors inline
heartbeat.monitors:
- type: http
# List or urls to query
urls: ["http://localhost:9200"]
# Configure task schedule
schedule: '#every 10s'
# Total test connection and data exchange timeout
#timeout: 16s
However, when I add the following, I get an error.
# Configure monitors inline
heartbeat.monitors:
- type: http
# List or urls to query
urls: ["http://localhost:9200"]
# Configure task schedule
schedule: '#every 10s'
# Total test connection and data exchange timeout
#timeout: 16s
- type: icmp <------ When I try to add tcp or icmp,
schedule: '#every 10s' <------ I get an error. I am doing something
hosts: ["localhost"] <------ wrong. How can I add more monitors?
PS C:\Program Files\Heartbeat> Start-Service heartbeat
Start-Service : Service 'heartbeat (heartbeat)' cannot be started due to the following error: Cannot start service heartbeat on computer '.'.
At line:1 char:1
+ Start-Service heartbeat
+ ~~~~~~~~~~~~~~~~~~~~~~~
+ CategoryInfo : OpenError: (System.ServiceProcess.ServiceController:ServiceController) [Start-Service], ServiceCommandException
+ FullyQualifiedErrorId : CouldNotStartService,Microsoft.PowerShell.Commands.StartServiceCommand
When I erase what I wanted to add, it works fine. How can I add more monitors in heartbeat.yml?
I strongly believe that its an indentation issue in the YAML file.
Look at your icmp monitor:
- type: icmp <------ When I try to add tcp or icmp,
schedule: '#every 10s' <------ I get an error. I am doing something
hosts: ["localhost"] <------ wrong. How can I add more monitors?
There are whitespaces before the schedule and hosts settings.
Now look at the default monitor:
heartbeat.monitors:
- type: http
# List or urls to query
urls: ["http://localhost:9200"]
# Configure task schedule
schedule: '#every 10s'
# Total test connection and data exchange timeout
#timeout: 16s
Align the settings exactly under the type field and run it again.

Waiting for configuring calico node to node mesh - ICP 2.1.0.2 Installation

During the installation I encounter this error on Ubuntu 16.04 single node with the host ip 192.168.240.14:
TASK [network : Ensuring that the calico.yaml file exist] **********************
changed: [localhost]
TASK [network : include] *******************************************************
TASK [network : include] *******************************************************
TASK [network : include] *******************************************************
included: /installer/playbook/roles/network/tasks/calico.yaml for localhost
TASK [network : Enabling calico] ***********************************************
changed: [localhost]
TASK [network : Waiting for configuring calico service] ************************
ok: [localhost -> 192.168.240.14] => (item=192.168.240.14)
TASK [network : Waiting for configuring calico node to node mesh] **************
FAILED - RETRYING: Waiting for configuring calico node to node mesh (100 retries left).
FAILED - RETRYING: Waiting for configuring calico node to node mesh (99 retries left).
FAILED - RETRYING: Waiting for configuring calico node to node mesh (98 retries left).
FAILED - RETRYING: Waiting for configuring calico node to node mesh (97 retries left).
FAILED - RETRYING: Waiting for configuring calico node to node mesh (96 retries left).
FAILED - RETRYING: Waiting for configuring calico node to node mesh (95 retries left).
FAILED - RETRYING: Waiting for configuring calico node to node mesh (94 retries left).
FAILED - RETRYING: Waiting for configuring calico node to node mesh (93 retries left).
FAILED - RETRYING: Waiting for configuring calico node to node mesh (92 retries left).
FAILED - RETRYING: Waiting for configuring calico node to node mesh (91 retries left).
I read that it's possible to disable calico's node to node mesh functionality but since calico is installed via ICP the calicoctl command is not recognized. In the config.yaml I couldn't find an option where I can disable this setting either.
So far I've tried to disable it by downloading and executing calicoctl seperately, but the connection to the cluster can't be established:
user#user:~/Desktop/calicoctl$ ./calicoctl config set nodeToNodeMesh off
Error executing command: client: etcd cluster is unavailable or misconfigured; error #0: dial tcp 127.0.0.1:2379: getsockopt: connection refused
I'm not sure if it's because it tries to dial the loopback ip adress instead of 192.168.240.14 or something else. And I also don't know if this can actually fix the issue during the installation.
I'm not very experienced with this and am thankful for any help!
EDIT:
I ran the installation with ICP 2.1.0.1 again and had the same error, but with 10 retries instead and received following error message:
TASK [network : Enabling calico] ***********************************************
changed: [localhost]
TASK [network : Waiting for configuring calico service] ************************
ok: [localhost -> 192.168.240.14] => (item=192.168.240.14)
FAILED - RETRYING: Waiting for configuring calico node to node mesh (10 retries left).
FAILED - RETRYING: Waiting for configuring calico node to node mesh (9 retries left).
FAILED - RETRYING: Waiting for configuring calico node to node mesh (8 retries left).
FAILED - RETRYING: Waiting for configuring calico node to node mesh (7 retries left).
FAILED - RETRYING: Waiting for configuring calico node to node mesh (6 retries left).
FAILED - RETRYING: Waiting for configuring calico node to node mesh (5 retries left).
FAILED - RETRYING: Waiting for configuring calico node to node mesh (4 retries left).
FAILED - RETRYING: Waiting for configuring calico node to node mesh (3 retries left).
FAILED - RETRYING: Waiting for configuring calico node to node mesh (2 retries left).
FAILED - RETRYING: Waiting for configuring calico node to node mesh (1 retries left).
TASK [network : Waiting for configuring calico node to node mesh] **************
fatal: [localhost]: FAILED! => {"attempts": 10, "changed": true, "cmd": "kubectl get pods --show-all --namespace=kube-system |grep configure-calico-mesh", "delta": "0:00:01.343071", "end": "2018-06-20 08:12:28.433186", "failed": true, "rc": 0, "start": "2018-06-20 08:12:27.090115", "stderr": "", "stderr_lines": [], "stdout": "configure-calico-mesh-9f756 0/1 Pending 0 5m", "stdout_lines": ["configure-calico-mesh-9f756 0/1 Pending 0 5m"]}
PLAY RECAP *********************************************************************
192.168.240.14 : ok=168 changed=54 unreachable=0 failed=0
localhost : ok=81 changed=16 unreachable=0 failed=1
Playbook run took 0 days, 0 hours, 19 minutes, 8 seconds
user#user:/opt/ibm-cloud-private-ce-2.1.0.1/cluster$
I don't understand why suddenly the localhost is included in the setup steps, since I only specified my IP adress in the hosts file:
[master]
192.168.240.14 ansible_user="user" ansible_ssh_pass="6CEd29CN" ansible_become=true ansible_become_pass="6CEd29CN" ansible_port="22" ansible_ssh_common_args="-oPubkeyAuthentication=no"
[worker]
192.168.240.14 ansible_user="user" ansible_ssh_pass="6CEd29CN" ansible_become=true ansible_become_pass="6CEd29CN" ansible_port="22" ansible_ssh_common_args="-oPubkeyAuthentication=no"
[proxy]
192.168.240.14 ansible_user="user" ansible_ssh_pass="6CEd29CN" ansible_become=true ansible_become_pass="6CEd29CN" ansible_port="22" ansible_ssh_common_args="-oPubkeyAuthentication=no"
#[management]
#4.4.4.4
#[va]
#5.5.5.5
My config.yaml file looks like this:
# Licensed Materials - Property of IBM
# IBM Cloud private
# # Copyright IBM Corp. 2017 All Rights Reserved
# US Government Users Restricted Rights - Use, duplication or disclosure restricted by GSA ADP Schedule Contract with IBM Corp.
---
###### docker0: 172.17.0.1
###### eth0: 192.168.240.14
## Network Settings
#network_type: calico
# network_helm_chart_path: < helm chart path >
## Network in IPv4 CIDR format
network_cidr: 10.1.0.0/16
## Kubernetes Settings
service_cluster_ip_range: 10.0.0.1/24
## Makes the Kubelet start if swap is enabled on the node. Remove
## this if your production env want to disble swap.
kubelet_extra_args: ["--fail-swap-on=false"]
# cluster_domain: cluster.local
# cluster_name: mycluster
cluster_CA_domain: "mydomain.icp"
# cluster_zone: "myzone"
# cluster_region: "myregion"
## Etcd Settings
#etcd_extra_args: ["--grpc-keepalive-timeout=0", "--grpc-keepalive-interval=0", #"--snapshot-count=10000"]
## General Settings
# wait_for_timeout: 600
# docker_api_timeout: 100
## Advanced Settings
default_admin_user: user
default_admin_password: 6CEd29CN
# ansible_user: <username>
# ansible_become: true
# ansible_become_password: <password>
## Kubernetes Settings
# kube_apiserver_extra_args: []
# kube_controller_manager_extra_args: []
# kube_proxy_extra_args: []
# kube_scheduler_extra_args: []
## Enable Kubernetes Audit Log
# auditlog_enabled: false
## GlusterFS Settings
# glusterfs: false
## GlusterFS Storage Settings
# storage:
# - kind: glusterfs
# nodes:
# - ip: <worker_node_m_IP_address>
# device: <link path>/<symlink of device aaa>,<link path>/<symlink of device bbb>
# - ip: <worker_node_n_IP_address>
# device: <link path>/<symlink of device ccc>
# - ip: <worker_node_o_IP_address>
# device: <link path>/<symlink of device ddd>
# storage_class:
# name:
# default: false
# volumetype: replicate:3
## Network Settings
## Calico Network Settings
### calico_ipip_enabled: true
calico_ipip_enabled: false
calico_tunnel_mtu: 1430
calico_ip_autodetection_method: interface=eth0
## IPSec mesh Settings
## If user wants to configure IPSec mesh, the following parameters
## should be configured through config.yaml
ipsec_mesh:
enable: false
# interface: <interface name on which IPsec will be enabled>
# subnets: []
# exclude_ips: "<list of IP addresses separated by a comma>"
kube_apiserver_insecure_port: 8080
kube_apiserver_secure_port: 8001
## External loadbalancer IP or domain
## Or floating IP in OpenStack environment
# cluster_lb_address: none
## External loadbalancer IP or domain
## Or floating IP in OpenStack environment
# proxy_lb_address: none
## Install in firewall enabled mode
firewall_enabled: false
## Allow loopback dns server in cluster nodes
loopback_dns: true
## High Availability Settings
# vip_manager: etcd
## High Availability Settings for master nodes
# vip_iface: eth0
# cluster_vip: 127.0.1.1
## High Availability Settings for Proxy nodes
# proxy_vip_iface: eth0
# proxy_vip: 127.0.1.1
## Federation cluster Settings
# federation_enabled: false
# federation_cluster: federation-cluster
# federation_domain: cluster.federation
# federation_apiserver_extra_args: []
# federation_controllermanager_extra_args: []
# federation_external_policy_engine_enabled: false
## vSphere cloud provider Settings
## If user wants to configure vSphere as cloud provider, vsphere_conf
## parameters should be configured through config.yaml
# kubelet_nodename: hostname
# cloud_provider: vsphere
# vsphere_conf:
# user: <vCenter username for vSphere cloud provider>
# password: <password for vCenter user>
# server: <vCenter server IP or FQDN>
# port: [vCenter Server Port; default: 443]
# insecure_flag: [set to 1 if vCenter uses a self-signed certificate]
# datacenter: <datacenter name on which Node VMs are deployed>
# datastore: <default datastore to be used for provisioning volumes>
# working_dir: <vCenter VM folder path in which node VMs are located>
## Disabled Management Services Settings
## You can disable the following management services: ["service-catalog", "metering", "monitoring", "istio", "vulnerability-advisor", "custom-metrics-adapter"]
#disabled_management_services: ["istio", "vulnerability-advisor", "custom-metrics-adapter"]
disabled_management_services: ["service-catalog", "metering", "monitoring", "istio", "vulnerability-advisor", "custom-metrics-adapter"]
## Docker Settings
# docker_env: []
# docker_extra_args: []
## The maximum size of the log before it is rolled
# docker_log_max_size: 50m
## The maximum number of log files that can be present
# docker_log_max_file: 10
## Install/upgrade docker version
# docker_version: 17.12.1
## ICP install docker automatically
# install_docker: true
## Ingress Controller Settings
## You can add your ingress controller configuration, and the allowed configuration can refer to
## https://github.com/kubernetes/ingress-nginx/blob/nginx-0.9.0/docs/user-guide/configmap.md#configuration-options
# ingress_controller:
# disable-access-log: 'true'
## Clean metrics indices in Elasticsearch older than this number of days
# metrics_max_age: 1
## Clean application log indices in Elasticsearch older than this number of days
# logs_maxage: 1
## Uncomment the line below to install Kibana as a managed service.
kibana_install: true
# STARTING_CLOUDANT
# cloudant:
# namespace: kube-system
# pullPolicy: IfNotPresent
# pvPath: /opt/ibm/cfc/cloudant
# database:
# password: fdrreedfddfreeedffde
# federatorCommand: hostname
# federationIdentifier: "-0"
# readinessProbePeriodSeconds: 2
# readinessProbeInitialDelaySeconds: 90
# END_CLOUDANT
Had a similar issue when deploying with Ansible on Ubuntu servers.... As a user mentioned on Kubernetes issue 43156, "We should not inherit nameserver 127.x.x.x in the pod resolv.conf from the node, as the node localhost will not be accessible from the pod."
If your /etc/resolv.conf has the localhost IP on it, I suggest you to replace it with the nodes IP, as for instance, and in case your using Ubuntu, to opt-out of NetworkManager to avoid it setting it back after a restart:
systemctl disable --now systemd-resolved.service
cp /etc/resolv.conf /etc/resolv.conf.bkp
echo "nameserver <Node's_IP>" > /etc/resolv.conf
More details on opting-out of NetworkManager at the following link:
How to take back control of /etc/resolv.conf on Linux

Ansible: async mode is not supported with the service module

I have been trying to call the 'restart the network' service in a fire and forget mode because obviously I will get disconnected from the SSH connection after I restart the network in a VM so I wanted to have a timeout process to do that.
In order to do that I did this inside of my restart networking tasks:
- name: Restart network
become: true
service: name=network state=restarted
async: 1000
poll: 0
When Ansible gets to this point I get this error:
fatal: [build]: FAILED! => {"failed": true, "msg": "async mode is not supported with the service module"}
Which I found that is an Ansible bug that is not yet in production and they still have it in the development branch, which I don't want to do because that would also mean more possible bugs in Ansible.
So, I have two options in my opinion, either I wait for the new release of Ansible to come with the bug fix or change async: 0 and poll: 0 to wait for the service to finish ( which it will never is going to finish ) so I press CTRL+C when get to that point to stop the service manually.
I don't want to go either of those routes because they are not very efficient for me, so I was wondering if there is a solution would be better at this point.
Try this as a temporary workaround:
- name: Restart network
become: yes
shell: sleep 2 && service network restart
async: 1
poll: 0
And don't forget to wait_for port 22 after this task to avoid host unreachable error.

Resources