Fail to connect to kubectl from client-go - /serviceaccount/token: no such file - go

I am using golang lib client-go to connect to a running local kubrenets. To start with I took code from the example: out-of-cluster-client-configuration.
Running a code like this:
$ KUBERNETES_SERVICE_HOST=localhost KUBERNETES_SERVICE_PORT=6443 go run ./main.go results in following error:
panic: open /var/run/secrets/kubernetes.io/serviceaccount/token: no such file or directory
goroutine 1 [running]:
/var/run/secrets/kubernetes.io/serviceaccount/
I am not quite sure which part of configuration I am missing. I've researched following links :
https://kubernetes.io/docs/reference/access-authn-authz/authentication/#client-go-credential-plugins
https://kubernetes.io/docs/reference/access-authn-authz/service-accounts-admin/
But with no luck.
I guess I need to either let the client-go know which token/serviceAccount to use, or configure kubectl in a way that everyone can connect to its api.
Here's status of my kubectl though some commands results:
$ kubectl config view
apiVersion: v1
clusters:
- cluster:
insecure-skip-tls-verify: true
server: https://localhost:6443
name: docker-for-desktop-cluster
contexts:
- context:
cluster: docker-for-desktop-cluster
user: docker-for-desktop
name: docker-for-desktop
current-context: docker-for-desktop
kind: Config
preferences: {}
users:
- name: docker-for-desktop
user:
client-certificate-data: REDACTED
client-key-data: REDACTED
$ kubectl get serviceAccounts
NAME SECRETS AGE
default 1 3d
test-user 1 1d
$ kubectl describe serviceaccount test-user
Name: test-user
Namespace: default
Labels: <none>
Annotations: <none>
Image pull secrets: <none>
Mountable secrets: test-user-token-hxcsk
Tokens: test-user-token-hxcsk
Events: <none>
$ kubectl get secret test-user-token-hxcsk -o yaml
apiVersion: v1
data:
ca.crt: LS0tLS1CRUdJTiBDRVJUSUZJQ0......=
namespace: ZGVmYXVsdA==
token: ZXlKaGJHY2lPaUpTVXpJMU5pSX......=
kind: Secret
metadata:
annotations:
kubernetes.io/service-account.name: test-user
kubernetes.io/service-account.uid: 984b359a-6bd3-11e8-8600-XXXXXXX
creationTimestamp: 2018-06-09T10:55:17Z
name: test-user-token-hxcsk
namespace: default
resourceVersion: "110618"
selfLink: /api/v1/namespaces/default/secrets/test-user-token-hxcsk
uid: 98550de5-6bd3-11e8-8600-XXXXXX
type: kubernetes.io/service-account-token

This answer could be a little outdated but I will try to give more perspective/baseline for future readers that encounter the same/similar problem.
TL;DR
The following error:
panic: open /var/run/secrets/kubernetes.io/serviceaccount/token: no such file or directory
is most likely connected with the lack of token in the /var/run/secrets/kubernetes.io/serviceaccount location when using in-cluster-client-configuration. Also, it could be related to the fact of using in-cluster-client-configuration code outside of the cluster (for example running this code directly on a laptop or in pure Docker container).
You can check following commands to troubleshoot your issue further (assuming this code is running inside a Pod):
$ kubectl get serviceaccount X -o yaml:
look for: automountServiceAccountToken: false
$ kubectl describe pod XYZ
look for: containers.mounts and volumeMounts where Secret is mounted
Citing the official documentation:
Authenticating inside the cluster
This example shows you how to configure a client with client-go to authenticate to the Kubernetes API from an application running inside the Kubernetes cluster.
client-go uses the Service Account token mounted inside the Pod at the /var/run/secrets/kubernetes.io/serviceaccount path when the rest.InClusterConfig() is used.
-- Github.com: Kubernetes: client-go: Examples: in cluster client configuration
If you are authenticating to the Kubernetes API with ~/.kube/config you should be using the out-of-cluster-client-configuration.
Additional information:
I've added additional information for more reference on further troubleshooting when the code is run inside of a Pod.
automountServiceAccountToken: false
In version 1.6+, you can opt out of automounting API credentials for a service account by setting automountServiceAccountToken: false on the service account:
apiVersion: v1
kind: ServiceAccount
metadata:
name: go-serviceaccount
automountServiceAccountToken: false
In version 1.6+, you can also opt out of automounting API credentials for a particular pod:
apiVersion: v1
kind: Pod
metadata:
name: sdk
spec:
serviceAccountName: go-serviceaccount
automountServiceAccountToken: false
-- Kubernetes.io: Docs: Tasks: Configure pod container: Configure service account
$ kubectl describe pod XYZ:
When the servicAccount token is mounted, the Pod definition should look like this:
<-- OMITTED -->
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from go-serviceaccount-token-4rst8 (ro)
<-- OMITTED -->
Volumes:
go-serviceaccount-token-4rst8:
Type: Secret (a volume populated by a Secret)
SecretName: go-serviceaccount-token-4rst8
Optional: false
If it's not:
<-- OMITTED -->
Mounts: <none>
<-- OMITTED -->
Volumes: <none>
Additional resources:
Kubernetes.io: Docs: Reference: Access authn authz: Authentication

Just to make it clear, in case it helps you further debug it: the problem has nothing to do with Go or your code, and everything to do with the Kubernetes node not being able to get a token from the Kubernetes master.
In kubectl config view, clusters.cluster.server should probably point at an IP address that the node can reach.
It needs to access the CA, i.e., the master, in order to provide that token, and I'm guessing it fails to for that reason.
kubectl describe <your_pod_name> would probably tell you what the problem was acquiring the token.
Since you assumed the problem was Go/your code and focused on that, you neglected to provide more information about your Kubernetes setup, which makes it more difficult for me to give you a better answer than my guess above ;-)
But I hope it helps!

Related

ElasticSearch CrashLoopBackoff when deploying with ECK in Kubernetes OKD 4.11

I am running Kubernetes using OKD 4.11 (running on vSphere) and have validated the basic functionality (including dyn. volume provisioning) using applications (like nginx).
I also applied
oc adm policy add-scc-to-group anyuid system:authenticated
to allow authenticated users to use anyuid (which seems to have been required to deploy the nginx example I was testing with).
Then I installed ECK using this quickstart with kubectl to install the CRD and RBAC manifests. This seems to have worked.
Then I deployed the most basic ElasticSearch quickstart example with kubectl apply -f quickstart.yaml using this manifest:
apiVersion: elasticsearch.k8s.elastic.co/v1
kind: Elasticsearch
metadata:
name: quickstart
spec:
version: 8.4.2
nodeSets:
- name: default
count: 1
config:
node.store.allow_mmap: false
The deployment proceeds as expected, pulling image and starting container, but ends in a CrashLoopBackoff with the following error from ElasticSearch at the end of the log:
"elasticsearch.cluster.name":"quickstart",
"error.type":"java.lang.IllegalStateException",
"error.message":"failed to obtain node locks, tried
[/usr/share/elasticsearch/data]; maybe these locations
are not writable or multiple nodes were started on the same data path?"
Looking into the storage, the PV and PVC are created successfully, the output of kubectl get pv,pvc,sc -A -n my-namespace is:
NAME CAPACITY ACCESS MODES RECLAIM POLICY STATUS CLAIM STORAGECLASS REASON AGE
persistentvolume/pvc-9d7b57db-8afd-40f7-8b3d-6334bdc07241 1Gi RWO Delete Bound my-namespace/elasticsearch-data-quickstart-es-default-0 thin 41m
NAMESPACE NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE
my-namespace persistentvolumeclaim/elasticsearch-data-quickstart-es-default-0 Bound pvc-9d7b57db-8afd-40f7-8b3d-6334bdc07241 1Gi RWO thin 41m
NAMESPACE NAME PROVISIONER RECLAIMPOLICY VOLUMEBINDINGMODE ALLOWVOLUMEEXPANSION AGE
storageclass.storage.k8s.io/thin (default) kubernetes.io/vsphere-volume Delete Immediate false 19d
storageclass.storage.k8s.io/thin-csi csi.vsphere.vmware.com Delete WaitForFirstConsumer true 19d
Looking at the pod yaml, it appears that the volume is correctly attached :
volumes:
- name: elasticsearch-data
persistentVolumeClaim:
claimName: elasticsearch-data-quickstart-es-default-0
- name: downward-api
downwardAPI:
items:
- path: labels
fieldRef:
apiVersion: v1
fieldPath: metadata.labels
defaultMode: 420
....
volumeMounts:
...
- name: elasticsearch-data
mountPath: /usr/share/elasticsearch/data
I cannot understand why the volume would be read-only or rather why ES cannot create the lock.
I did find this similar issue, but I am not sure how to apply the UID permissions (in general I am fairly naive about the way permissions work in OKD) when when working with ECK.
Does anyone with deeper K8s / OKD or ECK/ElasticSearch knowledge have an idea how to better isolate and/or resolve this issue?
Update: I believe this has something to do with this issue and am researching the optionas related to OKD.
For posterity, the ECK starts an init container that should take care of the chown on the data volume, but can only do so if it is running as root.
The resolution for me was documented here:
https://repo1.dso.mil/dsop/elastic/elasticsearch/elasticsearch/-/issues/7
The manifest now looks like this:
apiVersion: elasticsearch.k8s.elastic.co/v1
kind: Elasticsearch
metadata:
name: quickstart
spec:
version: 8.4.2
nodeSets:
- name: default
count: 1
config:
node.store.allow_mmap: false
# run init container as root to chown the volume to uid 1000
podTemplate:
spec:
securityContext:
runAsUser: 1000
runAsGroup: 0
initContainers:
- name: elastic-internal-init-filesystem
securityContext:
runAsUser: 0
runAsGroup: 0
And the pod starts up and can write to the volume as uid 1000.

Record Kubernetes container resource utilization data

I'm doing a perf test for web server which is deployed on EKS cluster. I'm invoking the server using jmeter with different conditions (like varying thread count, payload size, etc..).
So I want to record kubernetes perf data with the timestamp so that I can analyze these data with my jmeter output (JTL).
I have been digging through the internet to find a way to record kubernetes perf data. But I was unable to find a proper way to do that.
Can experts please provide me a standard way to do this??
Note: I have a multi-container pod also.
In line with #Jonas comment
This is the quickest way of installing Prometheus in you K8 cluster. Added Details in the answer as it was impossible to put the commands in a readable format in Comment.
Add bitnami helm repo.
helm repo add bitnami https://charts.bitnami.com/bitnami
Install helmchart for promethus
helm install my-release bitnami/kube-prometheus
Installation output would be:
C:\Users\ameena\Desktop\shine\Article\K8\promethus>helm install my-release bitnami/kube-prometheus
NAME: my-release
LAST DEPLOYED: Mon Apr 12 12:44:13 2021
NAMESPACE: default
STATUS: deployed
REVISION: 1
TEST SUITE: None
NOTES:
** Please be patient while the chart is being deployed **
Watch the Prometheus Operator Deployment status using the command:
kubectl get deploy -w --namespace default -l app.kubernetes.io/name=kube-prometheus-operator,app.kubernetes.io/instance=my-release
Watch the Prometheus StatefulSet status using the command:
kubectl get sts -w --namespace default -l app.kubernetes.io/name=kube-prometheus-prometheus,app.kubernetes.io/instance=my-release
Prometheus can be accessed via port "9090" on the following DNS name from within your cluster:
my-release-kube-prometheus-prometheus.default.svc.cluster.local
To access Prometheus from outside the cluster execute the following commands:
echo "Prometheus URL: http://127.0.0.1:9090/"
kubectl port-forward --namespace default svc/my-release-kube-prometheus-prometheus 9090:9090
Watch the Alertmanager StatefulSet status using the command:
kubectl get sts -w --namespace default -l app.kubernetes.io/name=kube-prometheus-alertmanager,app.kubernetes.io/instance=my-release
Alertmanager can be accessed via port "9093" on the following DNS name from within your cluster:
my-release-kube-prometheus-alertmanager.default.svc.cluster.local
To access Alertmanager from outside the cluster execute the following commands:
echo "Alertmanager URL: http://127.0.0.1:9093/"
kubectl port-forward --namespace default svc/my-release-kube-prometheus-alertmanager 9093:9093
Follow the commands to forward the UI to localhost.
echo "Prometheus URL: http://127.0.0.1:9090/"
kubectl port-forward --namespace default svc/my-release-kube-prometheus-prometheus 9090:9090
Open the UI in browser: http://127.0.0.1:9090/classic/graph
Annotate the pods for sending the metrics.
apiVersion: apps/v1
kind: Deployment
metadata:
name: nginx-deployment
spec:
selector:
matchLabels:
app: nginx
replicas: 4 # Update the replicas from 2 to 4
template:
metadata:
labels:
app: nginx
annotations:
prometheus.io/scrape: 'true'
prometheus.io/port: '9102'
spec:
containers:
- name: nginx
image: nginx:1.14.2
ports:
- containerPort: 80
In the ui put appropriate filters and start observing the crucial parameter such as memory CPU etc. UI supports autocomplete so it will not be that difficult to figure out things.
Regards

Kubernetes Endpoint created for Kafka but not reflecting in POD

In Kubernetes cluster I have created Endpoint pointing to Kafka cluster. Endpoint created successfully.
Name - kafka
Endpoint - X.X.X.X:9092
In my Spring Boot application's deployment yaml I have kept environment variable BROKER_IP. For this environment variable I have pointed:
env:
- name: BROKER_IP
value: kafka
The POD is in Error state. In my bootstrap-server I am getting kafka and not the actual Endpoint that was created. Any thoughts?
UPDATE - Just tried kafka:9092 and it worked. So wondering does the ENDPOINT maps to IP only and not the Port? Is my understanding correct??
Is it possible that you forgot to create the Service object matching the Endpoints? Because you are providing the ip-port pairs yourself the Service would need to be selectorless.
This works for me:
kind: Endpoints
apiVersion: v1
metadata:
name: kafka
subsets:
- addresses: [{ip: "1.2.3.4"}]
ports: [{port: 9092}]
---
kind: Service
apiVersion: v1
metadata:
name: kafka
spec:
ports: [{port: 9092}]
Testing it:
$ kubectl run kafka-dns-test --image=busybox --attach --rm --restart=Never -- nslookup kafka
If you don't see a command prompt, try pressing enter.
Server: 10.96.0.10
Address: 10.96.0.10:53
Name: kafka.default.svc.cluster.local
Address: 10.96.220.40
Successful lookup, ignore extra *** Can't find xxx: No answer messages
Also, because there is a Service object you get some environment variables in your Pods (without having to declare them):
KAFKA_PORT='tcp://10.96.220.40:9092'
KAFKA_PORT_9092_TCP='tcp://10.96.220.40:9092'
KAFKA_PORT_9092_TCP_ADDR='10.96.220.40'
KAFKA_PORT_9092_TCP_PORT='9092'
KAFKA_PORT_9092_TCP_PROTO='tcp'
KAFKA_SERVICE_HOST='10.96.220.40'
KAFKA_SERVICE_PORT='9092'
But the most flexible way to use a Service is still to use the dns name (kafka in this case).

Why Prometheus pod pending after setup it by helm in Kubernetes cluster on Rancher server?

Installed Rancher server and 2 Rancher agents in Vagrant. Then switch to K8S environment from Rancher server.
On Rancher server host, installed kubectl and helm. Then installed Prometheus by Helm:
helm install stable/prometheus
Now check the status from Kubernetes dashboard, there are 2 pods pending:
It noticed PersistentVolumeClaim is not bound, so aren't the K8S components been installed default with Rancher server?
(another name, same issue)
Edit
> kubectl get pvc
NAME STATUS VOLUME CAPACITY
ACCESSMODES STORAGECLASS AGE
voting-prawn-prometheus-alertmanager Pending 6h
voting-prawn-prometheus-server Pending 6h
> kubectl get pv
No resources found.
Edit 2
$ kubectl describe pvc voting-prawn-prometheus-alertmanager
Name: voting-prawn-prometheus-alertmanager
Namespace: default
StorageClass:
Status: Pending
Volume:
Labels: app=prometheus
chart=prometheus-4.6.9
component=alertmanager
heritage=Tiller
release=voting-prawn
Annotations: <none>
Capacity:
Access Modes:
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal FailedBinding 12s (x10 over 2m) persistentvolume-controller no persistent volumes available for this claim and no storage class is set
$ kubectl describe pvc voting-prawn-prometheus-server
Name: voting-prawn-prometheus-server
Namespace: default
StorageClass:
Status: Pending
Volume:
Labels: app=prometheus
chart=prometheus-4.6.9
component=server
heritage=Tiller
release=voting-prawn
Annotations: <none>
Capacity:
Access Modes:
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal FailedBinding 12s (x14 over 3m) persistentvolume-controller no persistent volumes available for this claim and no storage class is set
I had same issues as you. I found two ways to solve this:
edit values.yaml under persistentVolumes.enabled=false this will allow you to use emptyDir "this applies to Prometheus-Server and AlertManager"
If you can't change values.yaml you will have to create the PV before deploying the chart so that the pod can bind to the volume otherwise it will stay in the pending state forever
PV are cluster scoped and PVC are namespaced scope.
If your application running in a different namespace and PVC in a different namespace, it can be issue.
If yes, use RBAC to give proper permissions, or put app and PVC in same namespace.
Can you make sure PV which is getting created from Storage class is the default SC of the cluster ?
I found that i was missing storage class and storage volumes. fixed similar problems on my cluster by first creating a storage class.
kubectl apply -f storageclass.ymal
storageclass.ymal:
{
"kind": "StorageClass",
"apiVersion": "storage.k8s.io/v1",
"metadata": {
"name": "local-storage",
"annotations": {
"storageclass.kubernetes.io/is-default-class": "true"
}
},
"provisioner": "kubernetes.io/no-provisioner",
"reclaimPolicy": "Delete"
and the using the storage class when install Prometheus with helm
helm install stable/prometheus --set server.storageClass=local-storage
and i was also forced to create a volume for Prometheus to bind to
kubectl apply -f prometheusVolume.yaml
prometheusVolume.yaml:
apiVersion: v1
kind: PersistentVolume
metadata:
name: prometheus-volume
spec:
storageClassName: local-storage
capacity:
storage: 2Gi #Size of the volume
accessModes:
- ReadWriteOnce #type of access
hostPath:
path: "/mnt/data" #host location
You could use other storage classes, found that there as a lot to chose between but then there might be other steps involved to get it working.

How do I access this Kubernetes service via kubectl proxy?

I want to access my Grafana Kubernetes service via the kubectl proxy server, but for some reason it won't work even though I can make it work for other services. Given the below service definition, why is it not available on http://localhost:8001/api/v1/proxy/namespaces/monitoring/services/grafana?
grafana-service.yaml
apiVersion: v1
kind: Service
metadata:
namespace: monitoring
name: grafana
labels:
app: grafana
spec:
type: NodePort
ports:
- name: web
port: 3000
protocol: TCP
nodePort: 30902
selector:
app: grafana
grafana-deployment.yaml
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
namespace: monitoring
name: grafana
spec:
replicas: 1
template:
metadata:
labels:
app: grafana
spec:
containers:
- name: grafana
image: grafana/grafana:4.1.1
env:
- name: GF_AUTH_BASIC_ENABLED
value: "true"
- name: GF_AUTH_ANONYMOUS_ENABLED
value: "true"
- name: GF_SECURITY_ADMIN_USER
valueFrom:
secretKeyRef:
name: grafana-credentials
key: user
- name: GF_SECURITY_ADMIN_PASSWORD
valueFrom:
secretKeyRef:
name: grafana-credentials
key: password
volumeMounts:
- name: grafana-storage
mountPath: /var/grafana-storage
ports:
- name: web
containerPort: 3000
resources:
requests:
memory: 100Mi
cpu: 100m
limits:
memory: 200Mi
cpu: 200m
- name: grafana-watcher
image: quay.io/coreos/grafana-watcher:v0.0.5
args:
- '--watch-dir=/var/grafana-dashboards'
- '--grafana-url=http://localhost:3000'
env:
- name: GRAFANA_USER
valueFrom:
secretKeyRef:
name: grafana-credentials
key: user
- name: GRAFANA_PASSWORD
valueFrom:
secretKeyRef:
name: grafana-credentials
key: password
resources:
requests:
memory: "16Mi"
cpu: "50m"
limits:
memory: "32Mi"
cpu: "100m"
volumeMounts:
- name: grafana-dashboards
mountPath: /var/grafana-dashboards
volumes:
- name: grafana-storage
emptyDir: {}
- name: grafana-dashboards
configMap:
name: grafana-dashboards
The error I'm seeing when accessing the above URL is "no endpoints available for service "grafana"", error code 503.
With Kubernetes 1.10 the proxy URL should be slighly different, like this:
http://localhost:8080/api/v1/namespaces/default/services/SERVICE-NAME:PORT-NAME/proxy/
Ref: https://kubernetes.io/docs/tasks/access-application-cluster/access-cluster/#manually-constructing-apiserver-proxy-urls
As Michael says, quite possibly your labels or namespaces are mismatching. However in addition to that, keep in mind that even when you fix the endpoint, the url you're after (http://localhost:8001/api/v1/proxy/namespaces/monitoring/services/grafana) might not work correctly.
Depending on your root_url and/or static_root_path grafana configuration settings, when trying to login you might get grafana trying to POST to http://localhost:8001/login and get a 404.
Try using kubectl port-forward instead:
kubectl -n monitoring port-forward [grafana-pod-name] 3000
then access grafana via http://localhost:3000/
https://kubernetes.io/docs/tasks/access-application-cluster/port-forward-access-application-cluster/
The issue is that Grafana's port is named web, and as a result one needs to append :web to the kubectl proxy URL: http://localhost:8001/api/v1/proxy/namespaces/monitoring/services/grafana:web.
An alternative, is to instead not name the Grafana port, because then you don't have to append :web to the kubectl proxy URL for the service: http://localhost:8001/api/v1/proxy/namespaces/monitoring/services/grafana:web. I went with this option in the end since it's easier.
There are a few factors that might be causing this issue.
The service expects to find one or more supporting endpoints, which it discovers through matching rules on the labels. If the labels don't align, then the service won't find endpoints, and the network gateway function performed by the service will result in 503.
The port declared by the POD and the process within the container are misaligned from the --target-port expected by the service.
Either one of these might generate the error. Let's take a closer look.
First, kubectl describe the service:
$ kubectl describe svc grafana01-grafana-3000
Name: grafana01-grafana-3000
Namespace: default
Labels: app=grafana01-grafana
chart=grafana-0.3.7
component=grafana
heritage=Tiller
release=grafana01
Annotations: <none>
Selector: app=grafana01-grafana,component=grafana,release=grafana01
Type: NodePort
IP: 10.0.0.197
Port: <unset> 3000/TCP
NodePort: <unset> 30905/TCP
Endpoints: 10.1.45.69:3000
Session Affinity: None
Events: <none>
Notice that my grafana service has 1 endpoint listed (there could be multiple). The error above in your example indicates that you won't have endpoints listed here.
Endpoints: 10.1.45.69:3000
Let's take a look next at the selectors. In the example above, you can see I have 3 selector labels on my service:
Selector: app=grafana01-grafana,component=grafana,release=grafana01
I'll kubectl describe my pods next:
$ kubectl describe pod grafana
Name: grafana01-grafana-1843344063-vp30d
Namespace: default
Node: 10.10.25.220/10.10.25.220
Start Time: Fri, 14 Jul 2017 03:25:11 +0000
Labels: app=grafana01-grafana
component=grafana
pod-template-hash=1843344063
release=grafana01
...
Notice that the labels on the pod align correctly, hence my service finds pods which provide endpoints which are load balanced against by the service. Verify that this part of the chain isn't broken in your environment.
If you do find that the labels are correct, you may still have a disconnect in that the grafana process running within the container within the pod is running on a different port than you expect.
$ kubectl describe pod grafana
Name: grafana01-grafana-1843344063-vp30d
...
Containers:
grafana:
Container ID: docker://69f11b7828c01c5c3b395c008d88e8640c5606f4d865107bf4b433628cc36c76
Image: grafana/grafana:latest
Image ID: docker-pullable://grafana/grafana#sha256:11690015c430f2b08955e28c0e8ce7ce1c5883edfc521b68f3fb288e85578d26
Port: 3000/TCP
State: Running
Started: Fri, 14 Jul 2017 03:25:26 +0000
If for some reason, your port under the container listed a different value, then the service is effectively load balancing against an invalid endpoint.
For example, if it listed port 80:
Port: 80/TCP
Or was an empty value
Port:
Then even if your label selectors were correct, the service would never find a valid response from the pod and would remove the endpoint from the rotation.
I suspect your issue is the first problem above (mismatched label selectors).
If both the label selectors and ports align, then you might have a problem with the MTU setting between nodes. In some cases, if the MTU used by your networking layer (like calico) is larger than the MTU of the supporting network, then you'll never get a valid response from the endpoint. Typically, this last potential issue will manifest itself as a timeout rather than a 503 though.
Your Deployment may not have a label app: grafana, or be in another namespace. Could you also post the Deployment definition?

Resources