Packetbeat does not add Kubernetes metadata - elasticsearch

I've started a minikube (using Kubernetes 1.18.3) to test out ECK and specifically packetbeat. The minikube profile is called "packetbeat" (important, as that's the hostname for the Virtualbox VM as well) and I followed the ECK quickstart to get it up and running. ElasticSearch (single node) and Kibana are running fine and packetbeat is gathering flows as well, however, I'm unable to make it add the Kubernetes metadata to the fields.
I'm working in the default namespace and created a ClusterRoleBinding to view for the default ServiceAccount in the namespace. This is working well, if I do not do that, packetbeat will report it is unable to list the Pods on the API server.
This is the Beat config I'm using to make ECK deploy packetbeat:
apiVersion: beat.k8s.elastic.co/v1beta1
kind: Beat
metadata:
name: packetbeat
spec:
type: packetbeat
version: 7.9.0
elasticsearchRef:
name: quickstart
kibanaRef:
name: kibana
config:
packetbeat.interfaces.device: any
packetbeat.protocols:
- type: http
ports: [80, 8000, 8080, 9200]
- type: tls
ports: [443]
packetbeat.flows:
timeout: 30s
period: 10s
processors:
- add_kubernetes_metadata: {}
daemonSet:
podTemplate:
spec:
terminationGracePeriodSeconds: 30
hostNetwork: true
automountServiceAccountToken: true # some older Beat versions are depending on this settings presence in k8s context
dnsPolicy: ClusterFirstWithHostNet
containers:
- name: packetbeat
securityContext:
runAsUser: 0
capabilities:
add:
- NET_ADMIN
(This is mostly a slightly modified example from the ECK example page.) However, this is not working at all. I tried it with "add_kubernetes_metadata: {}" first, but that will error with the message:
2020-08-19T14:23:38.550Z ERROR [kubernetes] kubernetes/util.go:117
kubernetes: Querying for pod failed with error: pods "packetbeat" not
found {"libbeat.processor": "add_kubernetes_metadata"}
This message goes away when I add the "host: packetbeat". I'm no longer getting an error now, but I'm not getting the Kubernetes metadata either. I'm mostly interested in the namespace tag, but I'm not getting any. I do not see any additional errors in the log and it just reports monitoring details every 30 seconds at the moment.
What am I doing wrong? Any more information I can provide to help me debug this?

So the docs are just unclear. Although they do not explicitely state it, you do need to add indexers and matchers. My understanding was that there are "default" ones (as you can disable those), but that does not seem to be the case. Adding the indexers and matchers as per the example in the docs makes the Kubernetes metadata part of the data.

Related

ElasticSearch CrashLoopBackoff when deploying with ECK in Kubernetes OKD 4.11

I am running Kubernetes using OKD 4.11 (running on vSphere) and have validated the basic functionality (including dyn. volume provisioning) using applications (like nginx).
I also applied
oc adm policy add-scc-to-group anyuid system:authenticated
to allow authenticated users to use anyuid (which seems to have been required to deploy the nginx example I was testing with).
Then I installed ECK using this quickstart with kubectl to install the CRD and RBAC manifests. This seems to have worked.
Then I deployed the most basic ElasticSearch quickstart example with kubectl apply -f quickstart.yaml using this manifest:
apiVersion: elasticsearch.k8s.elastic.co/v1
kind: Elasticsearch
metadata:
name: quickstart
spec:
version: 8.4.2
nodeSets:
- name: default
count: 1
config:
node.store.allow_mmap: false
The deployment proceeds as expected, pulling image and starting container, but ends in a CrashLoopBackoff with the following error from ElasticSearch at the end of the log:
"elasticsearch.cluster.name":"quickstart",
"error.type":"java.lang.IllegalStateException",
"error.message":"failed to obtain node locks, tried
[/usr/share/elasticsearch/data]; maybe these locations
are not writable or multiple nodes were started on the same data path?"
Looking into the storage, the PV and PVC are created successfully, the output of kubectl get pv,pvc,sc -A -n my-namespace is:
NAME CAPACITY ACCESS MODES RECLAIM POLICY STATUS CLAIM STORAGECLASS REASON AGE
persistentvolume/pvc-9d7b57db-8afd-40f7-8b3d-6334bdc07241 1Gi RWO Delete Bound my-namespace/elasticsearch-data-quickstart-es-default-0 thin 41m
NAMESPACE NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE
my-namespace persistentvolumeclaim/elasticsearch-data-quickstart-es-default-0 Bound pvc-9d7b57db-8afd-40f7-8b3d-6334bdc07241 1Gi RWO thin 41m
NAMESPACE NAME PROVISIONER RECLAIMPOLICY VOLUMEBINDINGMODE ALLOWVOLUMEEXPANSION AGE
storageclass.storage.k8s.io/thin (default) kubernetes.io/vsphere-volume Delete Immediate false 19d
storageclass.storage.k8s.io/thin-csi csi.vsphere.vmware.com Delete WaitForFirstConsumer true 19d
Looking at the pod yaml, it appears that the volume is correctly attached :
volumes:
- name: elasticsearch-data
persistentVolumeClaim:
claimName: elasticsearch-data-quickstart-es-default-0
- name: downward-api
downwardAPI:
items:
- path: labels
fieldRef:
apiVersion: v1
fieldPath: metadata.labels
defaultMode: 420
....
volumeMounts:
...
- name: elasticsearch-data
mountPath: /usr/share/elasticsearch/data
I cannot understand why the volume would be read-only or rather why ES cannot create the lock.
I did find this similar issue, but I am not sure how to apply the UID permissions (in general I am fairly naive about the way permissions work in OKD) when when working with ECK.
Does anyone with deeper K8s / OKD or ECK/ElasticSearch knowledge have an idea how to better isolate and/or resolve this issue?
Update: I believe this has something to do with this issue and am researching the optionas related to OKD.
For posterity, the ECK starts an init container that should take care of the chown on the data volume, but can only do so if it is running as root.
The resolution for me was documented here:
https://repo1.dso.mil/dsop/elastic/elasticsearch/elasticsearch/-/issues/7
The manifest now looks like this:
apiVersion: elasticsearch.k8s.elastic.co/v1
kind: Elasticsearch
metadata:
name: quickstart
spec:
version: 8.4.2
nodeSets:
- name: default
count: 1
config:
node.store.allow_mmap: false
# run init container as root to chown the volume to uid 1000
podTemplate:
spec:
securityContext:
runAsUser: 1000
runAsGroup: 0
initContainers:
- name: elastic-internal-init-filesystem
securityContext:
runAsUser: 0
runAsGroup: 0
And the pod starts up and can write to the volume as uid 1000.

Record Kubernetes container resource utilization data

I'm doing a perf test for web server which is deployed on EKS cluster. I'm invoking the server using jmeter with different conditions (like varying thread count, payload size, etc..).
So I want to record kubernetes perf data with the timestamp so that I can analyze these data with my jmeter output (JTL).
I have been digging through the internet to find a way to record kubernetes perf data. But I was unable to find a proper way to do that.
Can experts please provide me a standard way to do this??
Note: I have a multi-container pod also.
In line with #Jonas comment
This is the quickest way of installing Prometheus in you K8 cluster. Added Details in the answer as it was impossible to put the commands in a readable format in Comment.
Add bitnami helm repo.
helm repo add bitnami https://charts.bitnami.com/bitnami
Install helmchart for promethus
helm install my-release bitnami/kube-prometheus
Installation output would be:
C:\Users\ameena\Desktop\shine\Article\K8\promethus>helm install my-release bitnami/kube-prometheus
NAME: my-release
LAST DEPLOYED: Mon Apr 12 12:44:13 2021
NAMESPACE: default
STATUS: deployed
REVISION: 1
TEST SUITE: None
NOTES:
** Please be patient while the chart is being deployed **
Watch the Prometheus Operator Deployment status using the command:
kubectl get deploy -w --namespace default -l app.kubernetes.io/name=kube-prometheus-operator,app.kubernetes.io/instance=my-release
Watch the Prometheus StatefulSet status using the command:
kubectl get sts -w --namespace default -l app.kubernetes.io/name=kube-prometheus-prometheus,app.kubernetes.io/instance=my-release
Prometheus can be accessed via port "9090" on the following DNS name from within your cluster:
my-release-kube-prometheus-prometheus.default.svc.cluster.local
To access Prometheus from outside the cluster execute the following commands:
echo "Prometheus URL: http://127.0.0.1:9090/"
kubectl port-forward --namespace default svc/my-release-kube-prometheus-prometheus 9090:9090
Watch the Alertmanager StatefulSet status using the command:
kubectl get sts -w --namespace default -l app.kubernetes.io/name=kube-prometheus-alertmanager,app.kubernetes.io/instance=my-release
Alertmanager can be accessed via port "9093" on the following DNS name from within your cluster:
my-release-kube-prometheus-alertmanager.default.svc.cluster.local
To access Alertmanager from outside the cluster execute the following commands:
echo "Alertmanager URL: http://127.0.0.1:9093/"
kubectl port-forward --namespace default svc/my-release-kube-prometheus-alertmanager 9093:9093
Follow the commands to forward the UI to localhost.
echo "Prometheus URL: http://127.0.0.1:9090/"
kubectl port-forward --namespace default svc/my-release-kube-prometheus-prometheus 9090:9090
Open the UI in browser: http://127.0.0.1:9090/classic/graph
Annotate the pods for sending the metrics.
apiVersion: apps/v1
kind: Deployment
metadata:
name: nginx-deployment
spec:
selector:
matchLabels:
app: nginx
replicas: 4 # Update the replicas from 2 to 4
template:
metadata:
labels:
app: nginx
annotations:
prometheus.io/scrape: 'true'
prometheus.io/port: '9102'
spec:
containers:
- name: nginx
image: nginx:1.14.2
ports:
- containerPort: 80
In the ui put appropriate filters and start observing the crucial parameter such as memory CPU etc. UI supports autocomplete so it will not be that difficult to figure out things.
Regards

Kubernetes Kibana operator failures and Nginx ingress timeouts

I just started implementing a Kubernetes cluster on an Azure Linux VM. I'm very new with all this. The cluster is running on a small VM (2 core, 16gb). I set up the ECK stack using their tutorial online, and an Nginx Ingress controller to expose it.
Most of the day, everything runs fine. I can access the Kibana dashboard, run Elastic queries, Nginx is working. But about once each day, something happens that causes the Kibana Endpoint matching the Kibana Service to not have any IP address. As a result, the Service can't route correctly to the container. When this happens, the Kibana pod has a status of Running, but says that 0/1 are running. It never triggers any restarts, and as a result, the Kibana dashboard becomes inaccessible. I've tried reproducing this by shutting down the Docker container, force killing the pod, but can't reliably reproduce it.
Looking at the logs on the Kibana pod, there are a bunch of errors due to timeouts. The Nginx logs say that it can't find the Endpoint for the Service. It looks like this could potentially be the source. Has anyone encountered this? Does anyone know a reliable way to prevent this?
This should probably be a separate question, but the other issue this causes is completely blocking all Nginx Ingress. Any new requests are not seen in the logs, and the logs completely stop after the message about not finding an endpoint. As a result, all URLs that Ingress is normally responsible for time out, and the whole cluster becomes externally unusable. This is fixed by deleting the Nginx controller pod, but the pod doesn't restart itself. Can someone explain why an issue like this would completely block Nginx? And why the Nginx pod can't detect this and restart?
Edit:
The Nginx logs end with this:
W1126 16:20:31.517113 6 controller.go:950] Service "default/gwam-kb-http" does not have any active Endpoint.
W1126 16:20:34.848942 6 controller.go:950] Service "default/gwam-kb-http" does not have any active Endpoint.
W1126 16:21:52.555873 6 controller.go:950] Service "default/gwam-kb-http" does not have any active Endpoint.
Any further requests timeout and do not appear in the logs.
I don't have logs for the kibana pod, but they were just consistent timeouts to the kibana service default/gwam-kb-http (same as in Nginx logs above). This caused the readiness probe to fail, and show 0/1 Running, but did not trigger a restart of the pod.
Kibana Endpoints when everything is normal
Name: gwam-kb-http
Namespace: default
Labels: common.k8s.elastic.co/type=kibana
kibana.k8s.elastic.co/name=gwam
Annotations: endpoints.kubernetes.io/last-change-trigger-time: 2020-11-26T16:27:20Z
Subsets:
Addresses: 10.244.0.6
NotReadyAddresses: <none>
Ports:
Name Port Protocol
---- ---- --------
https 5601 TCP
Events: <none>
When I run into this issue, Addresses is empty, and the pod IP is under NotReadyAddresses
I'm using the very basic YAML from the ECK setup tutorial:
Elastic (no problems here)
apiVersion: elasticsearch.k8s.elastic.co/v1
kind: Elasticsearch
metadata:
name: gwam
spec:
version: 7.10.0
nodeSets:
- name: default
count: 3
volumeClaimTemplates:
- metadata:
name: elasticsearch-data
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 2Gi
storageClassName: elasticsearch
Kibana:
apiVersion: kibana.k8s.elastic.co/v1
kind: Kibana
metadata:
name: gwam
spec:
version: 7.10.0
count: 1
elasticsearchRef:
name: gwam
Ingress for the Kibana service:
kind: Ingress
apiVersion: extensions/v1beta1
metadata:
name: nginx-ingress-secure-backend-no-rewrite
annotations:
kubernetes.io/ingress.class: nginx
nginx.org/proxy-connect-timeout: "30s"
nginx.org/proxy-read-timeout: "20s"
nginx.org/proxy-send-timeout: "60s"
nginx.org/client-max-body-size: "4m"
nginx.ingress.kubernetes.io/backend-protocol: "HTTPS"
spec:
tls:
- hosts:
- <internal company site>
secretName: gwam-tls-secret
rules:
- host: <internal company site>
http:
paths:
- path: /
backend:
serviceName: gwam-kb-http
servicePort: 5601
Some more environment details:
Kubernetes version: 1.19.3
OS: Ubuntu 18.04.5 LTS (GNU/Linux 5.4.0-1031-azure x86_64)
edit 2:
Seems like I'm getting some kind of network error here. None of my pods can do a dnslookup for kubernetes.default. All the networking pods are running, but after adding logs to CoreDNS, I'm seeing the following:
[ERROR] plugin/errors: 2 1699910358767628111.9001703618875455268. HINFO: read udp 10.244.0.69:35222->10.234.44.20:53: i/o timeout
I'm using Flannel for my network. Thinking of trying to reset and switch to Calico and increasing nf_conntrack_max as some answers suggest.
This ended up being a very simple mistake on my part. I thought it was a pod or DNS issue, but was just a general network issue. My IP forwarding was turned off. I turned it on with:
sysctl -w net.ipv4.ip_forward=1
And added net.ipv4.ip_forward=1 to /etc/sysctl.conf

How to show custom application metrics in Prometheus captured using the golang client library from all pods running in Kubernetes

I am trying to get some custom application metrics captured in golang using the prometheus client library to show up in Prometheus.
I have the following working:
I have a go application which is exposing metrics on localhost:8080/metrics as described in this article:
https://godoc.org/github.com/prometheus/client_golang/prometheus
I have a kubernates minikube running which has Prometheus, Grafana and AlertManager running using the operator from this article:
https://github.com/coreos/prometheus-operator/tree/master/contrib/kube-prometheus
I created a docker image for my go app, when I run it and go to localhost:8080/metrics I can see the prometheus metrics showing up in a browser.
I use the following pod.yaml to deploy my docker image to a pod in k8s
apiVersion: v1
kind: Pod
metadata:
name: my-app-pod
labels:
zone: prod
version: v1
annotations:
prometheus.io/scrape: 'true'
prometheus.io/port: '8080'
spec:
containers:
- name: my-container
image: name/my-app:latest
imagePullPolicy: IfNotPresent
ports:
- containerPort: 8080
If I connect to my pod using:
kubectl exec -it my-app-pod -- /bin/bash
then do wget on "localhost:8080/metrics", I can see my metrics
So far so good, here is where I am hitting a wall. I could have multiple pods running this same image. I want to expose all the images to prometheus as targets. How do I configure my pods so that they show up in prometheus so I can report on my custom metrics?
Thanks for any help offered!
The kubernetes_sd_config directive can be used to discover all pods with a given tag. Your Prometheus.yml config file should have something like so:
- job_name: 'some-app'
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_label_app]
regex: python-app
action: keep
The source label [__meta_kubernetes_pod_label_app] is basically using the Kubernetes api to look at pods that have a label of 'app' and whose value is captured by the regex expression, given on the line below (in this case, matching 'python-app').
Once you've done this Prometheus will automatically discover the pods you want and start scraping the metrics from your app.
Hope that helps. You can follow blog post here for more detail.
Note: it is worth mentioning that at the time of writing, kubernetes_sd_config is still in beta. Thus breaking changes to configuration may occur in future releases.
You need 2 things:
a ServiceMonitor for the Prometheus Operator, which specifies which services will be scraped for metrics
a Service which matches the ServiceMonitor and points to your pods
There is an example in the docs over here: https://coreos.com/operators/prometheus/docs/latest/user-guides/running-exporters.html
Can you share the prometheus config that you are using to scrape the metrics. The config will control what all sources to scrape the metrics from. Here are a few links that you can refer to : https://groups.google.com/forum/#!searchin/prometheus-users/Application$20metrics$20monitoring$20of$20Kubernetes$20Pods%7Csort:relevance/prometheus-users/uNPl4nJX9yk/cSKEBqJlBwAJ

How do I access this Kubernetes service via kubectl proxy?

I want to access my Grafana Kubernetes service via the kubectl proxy server, but for some reason it won't work even though I can make it work for other services. Given the below service definition, why is it not available on http://localhost:8001/api/v1/proxy/namespaces/monitoring/services/grafana?
grafana-service.yaml
apiVersion: v1
kind: Service
metadata:
namespace: monitoring
name: grafana
labels:
app: grafana
spec:
type: NodePort
ports:
- name: web
port: 3000
protocol: TCP
nodePort: 30902
selector:
app: grafana
grafana-deployment.yaml
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
namespace: monitoring
name: grafana
spec:
replicas: 1
template:
metadata:
labels:
app: grafana
spec:
containers:
- name: grafana
image: grafana/grafana:4.1.1
env:
- name: GF_AUTH_BASIC_ENABLED
value: "true"
- name: GF_AUTH_ANONYMOUS_ENABLED
value: "true"
- name: GF_SECURITY_ADMIN_USER
valueFrom:
secretKeyRef:
name: grafana-credentials
key: user
- name: GF_SECURITY_ADMIN_PASSWORD
valueFrom:
secretKeyRef:
name: grafana-credentials
key: password
volumeMounts:
- name: grafana-storage
mountPath: /var/grafana-storage
ports:
- name: web
containerPort: 3000
resources:
requests:
memory: 100Mi
cpu: 100m
limits:
memory: 200Mi
cpu: 200m
- name: grafana-watcher
image: quay.io/coreos/grafana-watcher:v0.0.5
args:
- '--watch-dir=/var/grafana-dashboards'
- '--grafana-url=http://localhost:3000'
env:
- name: GRAFANA_USER
valueFrom:
secretKeyRef:
name: grafana-credentials
key: user
- name: GRAFANA_PASSWORD
valueFrom:
secretKeyRef:
name: grafana-credentials
key: password
resources:
requests:
memory: "16Mi"
cpu: "50m"
limits:
memory: "32Mi"
cpu: "100m"
volumeMounts:
- name: grafana-dashboards
mountPath: /var/grafana-dashboards
volumes:
- name: grafana-storage
emptyDir: {}
- name: grafana-dashboards
configMap:
name: grafana-dashboards
The error I'm seeing when accessing the above URL is "no endpoints available for service "grafana"", error code 503.
With Kubernetes 1.10 the proxy URL should be slighly different, like this:
http://localhost:8080/api/v1/namespaces/default/services/SERVICE-NAME:PORT-NAME/proxy/
Ref: https://kubernetes.io/docs/tasks/access-application-cluster/access-cluster/#manually-constructing-apiserver-proxy-urls
As Michael says, quite possibly your labels or namespaces are mismatching. However in addition to that, keep in mind that even when you fix the endpoint, the url you're after (http://localhost:8001/api/v1/proxy/namespaces/monitoring/services/grafana) might not work correctly.
Depending on your root_url and/or static_root_path grafana configuration settings, when trying to login you might get grafana trying to POST to http://localhost:8001/login and get a 404.
Try using kubectl port-forward instead:
kubectl -n monitoring port-forward [grafana-pod-name] 3000
then access grafana via http://localhost:3000/
https://kubernetes.io/docs/tasks/access-application-cluster/port-forward-access-application-cluster/
The issue is that Grafana's port is named web, and as a result one needs to append :web to the kubectl proxy URL: http://localhost:8001/api/v1/proxy/namespaces/monitoring/services/grafana:web.
An alternative, is to instead not name the Grafana port, because then you don't have to append :web to the kubectl proxy URL for the service: http://localhost:8001/api/v1/proxy/namespaces/monitoring/services/grafana:web. I went with this option in the end since it's easier.
There are a few factors that might be causing this issue.
The service expects to find one or more supporting endpoints, which it discovers through matching rules on the labels. If the labels don't align, then the service won't find endpoints, and the network gateway function performed by the service will result in 503.
The port declared by the POD and the process within the container are misaligned from the --target-port expected by the service.
Either one of these might generate the error. Let's take a closer look.
First, kubectl describe the service:
$ kubectl describe svc grafana01-grafana-3000
Name: grafana01-grafana-3000
Namespace: default
Labels: app=grafana01-grafana
chart=grafana-0.3.7
component=grafana
heritage=Tiller
release=grafana01
Annotations: <none>
Selector: app=grafana01-grafana,component=grafana,release=grafana01
Type: NodePort
IP: 10.0.0.197
Port: <unset> 3000/TCP
NodePort: <unset> 30905/TCP
Endpoints: 10.1.45.69:3000
Session Affinity: None
Events: <none>
Notice that my grafana service has 1 endpoint listed (there could be multiple). The error above in your example indicates that you won't have endpoints listed here.
Endpoints: 10.1.45.69:3000
Let's take a look next at the selectors. In the example above, you can see I have 3 selector labels on my service:
Selector: app=grafana01-grafana,component=grafana,release=grafana01
I'll kubectl describe my pods next:
$ kubectl describe pod grafana
Name: grafana01-grafana-1843344063-vp30d
Namespace: default
Node: 10.10.25.220/10.10.25.220
Start Time: Fri, 14 Jul 2017 03:25:11 +0000
Labels: app=grafana01-grafana
component=grafana
pod-template-hash=1843344063
release=grafana01
...
Notice that the labels on the pod align correctly, hence my service finds pods which provide endpoints which are load balanced against by the service. Verify that this part of the chain isn't broken in your environment.
If you do find that the labels are correct, you may still have a disconnect in that the grafana process running within the container within the pod is running on a different port than you expect.
$ kubectl describe pod grafana
Name: grafana01-grafana-1843344063-vp30d
...
Containers:
grafana:
Container ID: docker://69f11b7828c01c5c3b395c008d88e8640c5606f4d865107bf4b433628cc36c76
Image: grafana/grafana:latest
Image ID: docker-pullable://grafana/grafana#sha256:11690015c430f2b08955e28c0e8ce7ce1c5883edfc521b68f3fb288e85578d26
Port: 3000/TCP
State: Running
Started: Fri, 14 Jul 2017 03:25:26 +0000
If for some reason, your port under the container listed a different value, then the service is effectively load balancing against an invalid endpoint.
For example, if it listed port 80:
Port: 80/TCP
Or was an empty value
Port:
Then even if your label selectors were correct, the service would never find a valid response from the pod and would remove the endpoint from the rotation.
I suspect your issue is the first problem above (mismatched label selectors).
If both the label selectors and ports align, then you might have a problem with the MTU setting between nodes. In some cases, if the MTU used by your networking layer (like calico) is larger than the MTU of the supporting network, then you'll never get a valid response from the endpoint. Typically, this last potential issue will manifest itself as a timeout rather than a 503 though.
Your Deployment may not have a label app: grafana, or be in another namespace. Could you also post the Deployment definition?

Resources