Seldon-core deployment in GKE private cluster with Anthos Service Mesh - google-api

I'm trying to use GKE private cluster with standard config, with the Anthos service mesh managed profile. However, when I try to deploy "Iris" model for the test, the deployment stuck in calling "storage.googleapis.com":
$ kubectl get all -n test
NAME READY STATUS RESTARTS AGE
pod/iris-model-default-0-classifier-dfb586df4-ltt29 0/3 Init:1/2 0 30s
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
service/iris-model-default ClusterIP xxx.xxx.65.194 <none> 8000/TCP,5001/TCP 30s
service/iris-model-default-classifier ClusterIP xxx.xxx.79.206 <none> 9000/TCP,9500/TCP 30s
NAME READY UP-TO-DATE AVAILABLE AGE
deployment.apps/iris-model-default-0-classifier 0/1 1 0 31s
NAME DESIRED CURRENT READY AGE
replicaset.apps/iris-model-default-0-classifier-dfb586df4 1 1 0 31s
$ kubectl logs -f -n test pod/iris-model-default-0-classifier-dfb586df4-ltt29 -c classifier-model-initializer
2022/11/19 20:59:34 NOTICE: Config file "/.rclone.conf" not found - using defaults
2022/11/19 20:59:57 ERROR : GCS bucket seldon-models path v1.15.0-dev/sklearn/iris: error reading source root directory: Get "https://storage.googleapis.com/storage/v1/b/seldon-models/o?alt=json&delimiter=%2F&maxResults=1000&prefix=v1.15.0-dev%2Fsklearn%2Firis%2F&prettyPrint=false": dial tcp 199.36.153.8:443: connect: connection refused
2022/11/19 20:59:57 ERROR : Attempt 1/3 failed with 1 errors and: Get "https://storage.googleapis.com/storage/v1/b/seldon-models/o?alt=json&delimiter=%2F&maxResults=1000&prefix=v1.15.0-dev%2Fsklearn%2Firis%2F&prettyPrint=false": dial tcp 199.36.153.8:443: connect: connection refused
2022/11/19 21:00:17 ERROR : GCS bucket seldon-models path v1.15.0-dev/sklearn/iris: error reading source root directory: Get "https://storage.googleapis.com/storage/v1/b/seldon-models/o?alt=json&delimiter=%2F&maxResults=1000&prefix=v1.15.0-dev%2Fsklearn%2Firis%2F&prettyPrint=false": dial tcp 199.36.153.8:443: connect: connection refused
2022/11/19 21:00:17 ERROR : Attempt 2/3 failed with 1 errors and: Get "https://storage.googleapis.com/storage/v1/b/seldon-models/o?alt=json&delimiter=%2F&maxResults=1000&prefix=v1.15.0-dev%2Fsklearn%2Firis%2F&prettyPrint=false": dial tcp 199.36.153.8:443: connect: connection refused
I used "sidecar injection" with the namespace labeling:
kubectl create namespace test
kubectl label namespace test istio-injection- istio.io/rev=asm-managed --overwrite
kubectl annotate --overwrite namespace test mesh.cloud.google.com/proxy='{"managed":"true"}'
When I don't use "sidecar injection", the deployment was quite successful. But in this case I need to inject the proxy manually to get the accesss to the model API. I wonder if this is the intended operation or not.

Istio sidecars will block connectivity on other init containers. This is a known issue with Istio sidecars unfortunately. A potential workaround is to ask Istio to don't "filter" traffic going to storage.googleapis.com (i.e. don't route that traffic through Istio's egress), which can be done through Istio's excludeIPRanges flag.
In the longer term, due to these shortcomings, Istio seems to be moving away from sidecars into their new "Ambient mesh".

Related

Unable to connect to the server: dial tcp [::1]:8080: connectex: No connection could be made because the target machine actively refused it. -Microk8s

When i do this command kubectl get pods --all-namespaces I get this Unable to connect to the server: dial tcp [::1]:8080: connectex: No connection could be made because the target machine actively refused it.
All of my pods are running and ready 1/1, but when I use this microk8s kubectl get service -n kube-system I get
kubernetes-dashboard ClusterIP 10.152.183.132 <none> 443/TCP 6h13m
dashboard-metrics-scraper ClusterIP 10.152.183.10 <none> 8000/TCP 6h13m
I am missing kube-dns even tho dns is enabled. Also when I type this for proxy for all ip addresses microk8s kubectl proxy --accept-hosts=.* --address=0.0.0.0 & I only get this Starting to serve on [::]:8001 and I am missing [1]84623 for example.
I am using microk8s and multipass with Hyper-V Manager on Windows, and I can't go to dashboard on the net. I am also a beginner, this is for my college paper. I saw something similar online but it was for Azure.
Posting answer from comments for better visibility:
Problem solved by reinstalling multipass and microk8s. Now it works.

kubectl port-forward does not return when connection lost

The help of the kubectl port-forward says The forwarding session ends when the selected pod terminates, and rerun of the command is needed to resume forwarding.
Although it does not auto-reconnect when the pod terminates the command does not return either and just hangs with errors:
E0929 11:57:50.187945 62466 portforward.go:400] an error occurred forwarding 8000 -> 8080: error forwarding port 8080 to pod a1fe1d167955e1c345e0f8026c4efa70a84b9d46029037ebc5b69d9da5d30249, uid : network namespace for sandbox "a1fe1d167955e1c345e0f8026c4efa70a84b9d46029037ebc5b69d9da5d30249" is closed
Handling connection for 8000
E0929 12:02:44.505938 62466 portforward.go:400] an error occurred forwarding 8000 -> 8080: error forwarding port 8080 to pod a1fe1d167955e1c345e0f8026c4efa70a84b9d46029037ebc5b69d9da5d30249, uid : failed to find sandbox "a1fe1d167955e1c345e0f8026c4efa70a84b9d46029037ebc5b69d9da5d30249" in store: not found
I would like it to return so that I can handle this error and make the script that will rerun it.
Is there any way or workaround for how to do it?
Based on the information, described on Kubernetes Issues page on GitHub, I can suppose that it is a normal behavior for your case: port-forward connection cannot be canceled on pod deletion, since there is no a connection management inside REST connectors on server side.
A connection being maintained from kubectl all the way through to the kubelet hanging open even if the pod doesn't exist.
We'll proxy a websocket connection kubectl->kubeapiserver->kubelet on port-forward.
Recursive function?
kpf(){ kubectl port-forward $type/$object $LOCAL:$REMOTE $ns || kpf; }
Also check this

Why does search name resolution fail for elasticsearch-master-headless on Kubernetes 1.16?

I'm trying to get elasticsearch running on Kubernetes 1.16 with Helm 3 on GKE. I'm aware that both 1.16 and 3 aren't supported yet. I want to prepare a PR to make it compatible. I'm using the helm charts from https://github.com/elastic/helm-charts.
If I use the original chart 7.6.1 the pod creation fails due to create Pod elasticsearch-master-0 in StatefulSet elasticsearch-master failed error: pods "elasticsearch-master-0" is forbidden: unable to validate against any pod security policy: [spec.volumes[1]: Invalid value: "projected": projected volumes are not allowed to be used]. Therefore I created the following patch:
diff --git a/elasticsearch/values.yaml b/elasticsearch/values.yaml
index 053c020..fd9c37b 100755
--- a/elasticsearch/values.yaml
+++ b/elasticsearch/values.yaml
## -107,6 +107,7 ## podSecurityPolicy:
- secret
- configMap
- persistentVolumeClaim
+ - projected
persistence:
enabled: true
With this patch on master/d9ccb5a and tag 7.6.1 (tried both) the pods quickly get into unhealthy state due to failed to resolve host [elasticsearch-master-headless] caused by a java.net.UnknownHostException: elasticsearch-master-headless.
I don't understand why the name resolution doesn't work as there's no change introduced in 1.16 which changes name resolution with Kubernetes names afaik. If I try to ping elasticsearch-master-headless from a shell in the pod started with kubectl exec, I can't reach it neither.
I tried to contact the nameserver in /etc/resolv.conf with telnet because it allows specifying a specific port:
[elasticsearch#elasticsearch-master-1 ~]$ cat /etc/resolv.conf
search default.svc.cluster.local svc.cluster.local cluster.local us-central1-a.c.myproject.internal c.myproject.internal google.internal
nameserver 10.23.240.10
options ndots:5
[elasticsearch#elasticsearch-master-1 ~]$ telnet 10.23.240.10
Trying 10.23.240.10...
^C
[elasticsearch#elasticsearch-master-1 ~]$ telnet 10.23.240.10 53
Trying 10.23.240.10...
telnet: connect to address 10.23.240.10: Connection refused
I obfuscated the project ID with myproject.
The patch is already proposed to be merged upstream together with other changes at https://github.com/elastic/helm-charts/pull/496.
This is caused by the pod kube-dns crashing due to
F0315 20:01:02.464839 1 server.go:61] Failed to create a kubernetes client: open /var/run/secrets/kubernetes.io/serviceaccount/token: permission denied
Since Kubernetes 1.16 is only available in the rapid channel of GKE and it's a system pod, I consider this a bug.
I'll update this answer if I find the energy to file a bug.
Chances are that there is a firewall(firewalld) blocking 53/udp,tcp or an issue with CoreDNS pod in the cluster where you are performing the test.

How can I troubleshoot/fix an issue interacting with a running Kubernetes pod (timeout error)?

I have two EC2 instances, one running a Kubernetes Master node and the other running the Worker node. I can successfully create a pod from a deployment file that pulls a docker image and it starts with a status of "Running". However when I try to interact with it I get a timeout error.
Ex: kubectl logs <pod-name> -v6
Output:
Config loaded from file /home/ec2-user/.kube/config
GET https://<master-node-ip>:6443/api/v1/namespaces/default/pods/<pod-name> 200 OK in 11 milliseconds
GET https://<master-node-ip>:6443/api/v1/namespaces/default/pods/<pod-name>/log 500 Internal Server Error in 30002 milliseconds
Server response object: [{"status": "Failure", "message": "Get https://<worker-node-ip>:10250/containerLogs/default/<pod-name>/<container-name>: dial tcp <worker-node-ip>:10250: i/o timeout", "code": 500 }]
I can get information about the pod by running kubectl describe pod <pod-name> and confirm the status as Running. Any ideas on how to identify exactly what is causing this error and/or how to fix it?
Probably, you didn't install any network add-on to your Kubernetes cluster. It's not included in kubeadm installation, but it's required to communicate between pods scheduled on different nodes. The most popular are Calico and Flannel. As you already have a cluster, you may want to chose the network add-on that uses the same subnet as you stated with kubeadm init --pod-network-cidr=xx.xx.xx.xx/xx during cluster initialization.
192.168.0.0/16 is default for Calico network addon
10.244.0.0/16 is default for Flannel network addon
You can change it by downloading corresponded YAML file and by replacing the default subnet with the subnet you want. Then just apply it with kubectl apply -f filename.yaml

IBM Cloud Private monitoring gets 502 bad gateway

The following containers are not starting after installing IBM Cloud Private. I had previously installed ICP without a Management node and was doing a new install after having done and 'uninstall' and did restart the Docker service on all nodes.
Installed a second time with a Management node defined, Master/Proxy on a single node, and two Worker nodes.
Selecting menu option Platform / Monitoring gets 502 Bad Gateway
Event messages from deployed containers
Deployment - monitoring-prometheus
TYPE SOURCE COUNT REASON MESSAGE
Warning default-scheduler 2113 FailedScheduling
No nodes are available that match all of the following predicates:: MatchNodeSelector (3), NoVolumeNodeConflict (4).
Deployment - monitoring-grafana
TYPE SOURCE COUNT REASON MESSAGE
Warning default-scheduler 2097 FailedScheduling
No nodes are available that match all of the following predicates:: MatchNodeSelector (3), NoVolumeNodeConflict (4).
Deployment - rootkit-annotator
TYPE SOURCE COUNT REASON MESSAGE
Normal kubelet 169.53.226.142 125 Pulled
Container image "ibmcom/rootkit-annotator:20171011" already present on machine
Normal kubelet 169.53.226.142 125 Created
Created container
Normal kubelet 169.53.226.142 125 Started
Started container
Warning kubelet 169.53.226.142 2770 BackOff
Back-off restarting failed container
Warning kubelet 169.53.226.142 2770 FailedSync
Error syncing pod
The management console sometimes displays a 502 Bad Gateway Error after installation or rebooting the master node. If you recently installed IBM Cloud Private, wait a few minutes and reload the page.
If you rebooted the master node, take the following steps:
Configure the kubectl command line interface. See Accessing your IBM Cloud Private cluster by using the kubectl CLI.
Obtain the IP addresses of the icp-ds pods. Run the following command:
kubectl get pods -o wide -n kube-system | grep "icp-ds"
The output resembles the following text:
icp-ds-0 1/1 Running 0 1d 10.1.231.171 10.10.25.134
In this example, 10.1.231.171 is the IP address of the pod.
In high availability (HA) environments, an icp-ds pod exists for each master node.
From the master node, ping the icp-ds pods. Check the IP address for each icp-ds pod by running the following command for each IP address:
ping 10.1.231.171
If the output resembles the following text, you must delete the pod:
connect: Invalid argument
Delete each pod that you cannot reach:
kubectl delete pods icp-ds-0 -n kube-system
In this example, icp-ds-0 is the name of the unresponsive pod.
In HA installations, you might have to delete the pod for each master node.
Obtain the IP address of the replacement pod or pods. Run the following command:
kubectl get pods -o wide -n kube-system | grep "icp-ds"
The output resembles the following text:
icp-ds-0 1/1 Running 0 1d 10.1.231.172 10.10.2
Ping the pods again. Check the IP address for each icp-ds pod by running the following command for each IP address:
ping 10.1.231.172
If you can reach all icp-ds pods, you can access the IBM Cloud Private management console when that pod enters the available state.

Resources